# On word-length and dictionary size.

It has long been known that the larger the dictionary, the longer
the words that are contained in it. In an attempt to quantify this
observation, I collected data from word lists based on several American
and English dictionaries:

1) WBH: Word Builder's Handbook (Computer Puzzle Library, 1972), based on the Merriam-Webster Pocket Dictionary

2) LCK: Longman Crossword Key (Essex, England, 1982), based on Longman's Dictionary

3) EWB: English Word Book (Pilot Press, 1963), based on

4) CW: Chambers Words (W&R Chambers, 1976), based on Chambers Twentieth Century Dictionary

5) NREWL: Normal and Reverse English Word List (University of Pennsylvania Linguistics Department, 1963), based upon Webster's Second (plus about 43 thousand non-Websterian words from four specialized medical and technical dictionaries)

The table on the next page gives the number of words of various lengths taken from these word lists. The Word Builder's and Longman lists stop with 15-letter words; to make them consonant with the others, estimated numbers of 16-1etter through 20-letter words have been added in parentheses. At the end of each column is given the total number of words and the average word-length.

The distribution of different word-lengths can be approximated by the following mathematical (Poisson) formula, where W = the total number of words in the list, W(n) = the number of words of length n, and a = the average word-length:

W(n) = Wa[sup.n]exp(-a)/n(n-l)(n-2) ... 2.1

The final column in the table gives the number of words predicted by this formula; as one can see, the fit is good but far from perfect.

It is interesting to see how well this formula predicts the longest word in the dictionary on which the list is based. Ignoring the 45-letter pneumonoultramicroscopic silicovolcanokoniosis (which should never have been included in Webster's Second), the longest word is the 27-letter honorificabilitudinatatibus, followed by three 26-letter ones from chemistry and medicine. According to the formula, W(25) = 5.16, W(26) = 1.93, and W(27) = 0.69 words, in excellent agreement with reality. The prediction is nearly as good for the Merriam-Webster Pocket Dictionary, which contains the 23-letter electroencephalographic, the 22-letter electroencephalography, and the 21-letter electroencephalograph; here, W(21) = 1.80, W(22) = 0.65 and W(23) = 0.23 words.

As the above table reveals, the average word-length slowly increases with dictionary size. The relationship: can be modeled by the following mathematical (power law) formula:

A = 3.19W x 08823

The congruence between a and A, the observed and predicted average word-lengths, can be seen by comparing the last two rows of the table.

According to the 1985 Guinness Book of World Records, the Oxford English Dictionary has 414,825 word listings. The second formula predicts that the average word-length is 10.00, and the first formula predicts that W(27) = 1.73, W(28) = 0.62 and W(29) = 0.21 words. The longest word in the OED is, in fact, the 29-letter floccinauci-nihili-pili-fication. However, there is a considerable gap between this and the longest-known solid words, the 23-letter transubstantiationalist and anthropomorphologically.

It would be of interest to see analogous studies for other languages. In a book by C. B. Williams entitled Style and Vocabulary: Numerical Studies (Griffin, London, 1970) is a table from a 1963 paper by R. Moreau, giving the distribution of words by length as found in the Littre French dictionary. For words from 2 through 20 letters, the average length of 76884 words was 8.97, 0.35 hig-er than the value predicted from English experience.

JOHN HENRICK

Seattle, Washington

1) WBH: Word Builder's Handbook (Computer Puzzle Library, 1972), based on the Merriam-Webster Pocket Dictionary

2) LCK: Longman Crossword Key (Essex, England, 1982), based on Longman's Dictionary

3) EWB: English Word Book (Pilot Press, 1963), based on

4) CW: Chambers Words (W&R Chambers, 1976), based on Chambers Twentieth Century Dictionary

5) NREWL: Normal and Reverse English Word List (University of Pennsylvania Linguistics Department, 1963), based upon Webster's Second (plus about 43 thousand non-Websterian words from four specialized medical and technical dictionaries)

The table on the next page gives the number of words of various lengths taken from these word lists. The Word Builder's and Longman lists stop with 15-letter words; to make them consonant with the others, estimated numbers of 16-1etter through 20-letter words have been added in parentheses. At the end of each column is given the total number of words and the average word-length.

The distribution of different word-lengths can be approximated by the following mathematical (Poisson) formula, where W = the total number of words in the list, W(n) = the number of words of length n, and a = the average word-length:

W(n) = Wa[sup.n]exp(-a)/n(n-l)(n-2) ... 2.1

The final column in the table gives the number of words predicted by this formula; as one can see, the fit is good but far from perfect.

It is interesting to see how well this formula predicts the longest word in the dictionary on which the list is based. Ignoring the 45-letter pneumonoultramicroscopic silicovolcanokoniosis (which should never have been included in Webster's Second), the longest word is the 27-letter honorificabilitudinatatibus, followed by three 26-letter ones from chemistry and medicine. According to the formula, W(25) = 5.16, W(26) = 1.93, and W(27) = 0.69 words, in excellent agreement with reality. The prediction is nearly as good for the Merriam-Webster Pocket Dictionary, which contains the 23-letter electroencephalographic, the 22-letter electroencephalography, and the 21-letter electroencephalograph; here, W(21) = 1.80, W(22) = 0.65 and W(23) = 0.23 words.

Length WBH LCK EWB CW NREWL 2 38 169 102 161 757 3 541 812 859 966 1402 2453 4 1843 2826 3284 3559 5370 5964 5 2798 4907 6344 6562 10746 11604 6 4056 7910 10314 11041 18914 18810 7 4591 9906 15558 14131 26066 26137 8 4415 11032 16386 16616 33226 31778 9 3979 10791 15453 16899 36419 34343 10 3138 9153 13265 15002 35186 33404 11 2100 6762 9623 11458 30193 29536 12 1339 4613 6700 7910 23999 23941 13 818 2895 4211 5059 17723 17912 14 362 1594 2522 2855 11910 12445 15 185 851 1211 1434 7438 8070 16 (90) (420) 623 720 4380 4906 17 (45) (210) 285 308 2504 2807 18 (22) (80) 100 98 1299 1517 19 (11) (30) 42 58 701 776 20 (5) (10) 15 21 378 378 W 30376 74802 106964 114799 268015 a 7.99 8.65 8.72 8.87 9.73 A 7.94 8.60 8.87 8.93 9.62

As the above table reveals, the average word-length slowly increases with dictionary size. The relationship: can be modeled by the following mathematical (power law) formula:

A = 3.19W x 08823

The congruence between a and A, the observed and predicted average word-lengths, can be seen by comparing the last two rows of the table.

According to the 1985 Guinness Book of World Records, the Oxford English Dictionary has 414,825 word listings. The second formula predicts that the average word-length is 10.00, and the first formula predicts that W(27) = 1.73, W(28) = 0.62 and W(29) = 0.21 words. The longest word in the OED is, in fact, the 29-letter floccinauci-nihili-pili-fication. However, there is a considerable gap between this and the longest-known solid words, the 23-letter transubstantiationalist and anthropomorphologically.

It would be of interest to see analogous studies for other languages. In a book by C. B. Williams entitled Style and Vocabulary: Numerical Studies (Griffin, London, 1970) is a table from a 1963 paper by R. Moreau, giving the distribution of words by length as found in the Littre French dictionary. For words from 2 through 20 letters, the average length of 76884 words was 8.97, 0.35 hig-er than the value predicted from English experience.

JOHN HENRICK

Seattle, Washington

Printer friendly Cite/link Email Feedback | |

Author: | Henrick, John |
---|---|

Publication: | Word Ways |

Geographic Code: | 1USA |

Date: | Nov 1, 2008 |

Words: | 757 |

Previous Article: | Colloquy. |

Next Article: | Leaping letters. |

Topics: |

## Reader Opinion