This is Part 3 describing the statistics I used for generating better words in the **Orthographic Password Creator**. *Part 1* looked at the pairing of all English Orthography components (Wikipedia definition of *English Orthography*). **Part 2** showed count plots for the count tables.

I wrote a C program that permuted through all possible pairings of all letters in the alphabet, then counted how many times each pair occurs in the English language. The program loaded the entire 109,462 word dictionary into memory, which is used by the **Random Word Password Creator**, then counted every occurrence of every possible pair of letters (26 x 26 = 676 pairs).

This scan works different than the previous tests (**Part 1**) that used an editors search function. Counts for the same pairings, comparing the two methodologies, are a little higher due to the methodology. The program starts at the top of the dictionary, looks at the current and next character, then increments the appropriate counter. The pointer was then incremented by one, repeat. The search function method of the previous test would increment by two so no overlap.

For example, xxxy. The search function method would see the first two letters as a pair then increment by two. The third x is paired with xy, so xx gets one count, xy gets one count.

The C program increments by 1, so the same example xxxy, would see the first xx pair, then the second and third character xx pair, then the xy pair, so xx gets two counts and xy gets one count.

The reason I did it this way was to get statistics about how many times a particular letter is followed by another specific letter. By scanning the entire dictionary, these numbers are about as accurate as possible.

The following shows the plots followed by the data table. The first plot is a linear scale for the counts. As you can see, the curve is exponential. The second plot is the natural logarithm of the count value.

Note on the logarithm graph/table entries: The plotting program ignores the infinite values from taking the logarithm of zero, whereas the table function in the blog barfs at infinity. I replaced the infinity values for zero counts (from ln(0)) with zero in the table.

Continue reading →