# On word-length and dictionary size.

It has long been known that the larger the dictionary, the longer the words that are contained in it. In an attempt to quantify this observation, I collected data from word lists based on several American and English dictionaries:

1) WBH: Word Builder's Handbook (Computer Puzzle Library, 1972), based on the Merriam-Webster Pocket Dictionary

2) LCK: Longman Crossword Key (Essex, England, 1982), based on Longman's Dictionary

3) EWB: English Word Book (Pilot Press, 1963), based on

4) CW: Chambers Words (W&R Chambers, 1976), based on Chambers Twentieth Century Dictionary

5) NREWL: Normal and Reverse English Word List (University of Pennsylvania Linguistics Department, 1963), based upon Webster's Second (plus about 43 thousand non-Websterian words from four specialized medical and technical dictionaries)

The table on the next page gives the number of words of various lengths taken from these word lists. The Word Builder's and Longman lists stop with 15-letter words; to make them consonant with the others, estimated numbers of 16-1etter through 20-letter words have been added in parentheses. At the end of each column is given the total number of words and the average word-length.

The distribution of different word-lengths can be approximated by the following mathematical (Poisson) formula, where W = the total number of words in the list, W(n) = the number of words of length n, and a = the average word-length:

W(n) = Wa[sup.n]exp(-a)/n(n-l)(n-2) ... 2.1

The final column in the table gives the number of words predicted by this formula; as one can see, the fit is good but far from perfect.

It is interesting to see how well this formula predicts the longest word in the dictionary on which the list is based. Ignoring the 45-letter pneumonoultramicroscopic silicovolcanokoniosis (which should never have been included in Webster's Second), the longest word is the 27-letter honorificabilitudinatatibus, followed by three 26-letter ones from chemistry and medicine. According to the formula, W(25) = 5.16, W(26) = 1.93, and W(27) = 0.69 words, in excellent agreement with reality. The prediction is nearly as good for the Merriam-Webster Pocket Dictionary, which contains the 23-letter electroencephalographic, the 22-letter electroencephalography, and the 21-letter electroencephalograph; here, W(21) = 1.80, W(22) = 0.65 and W(23) = 0.23 words.
```Length   WBH    LCK     EWB      CW   NREWL

2        38            169     102     161    757
3       541    812     859     966    1402   2453
4      1843   2826    3284    3559    5370   5964
5      2798   4907    6344    6562   10746  11604

6      4056   7910   10314   11041   18914  18810
7      4591   9906   15558   14131   26066  26137
8      4415  11032   16386   16616   33226  31778
9      3979  10791   15453   16899   36419  34343
10     3138   9153   13265   15002   35186  33404

11     2100   6762    9623   11458   30193  29536
12     1339   4613    6700    7910   23999  23941
13      818   2895    4211    5059   17723  17912
14      362   1594    2522    2855   11910  12445
15      185    851    1211    1434    7438   8070

16     (90)  (420)     623     720    4380   4906
17     (45)  (210)     285     308    2504   2807
18     (22)   (80)     100      98    1299   1517
19     (11)   (30)      42      58     701    776
20      (5)   (10)      15      21     378    378

W     30376  74802  106964  114799  268015
a      7.99   8.65    8.72    8.87    9.73
A      7.94   8.60    8.87    8.93    9.62
```

As the above table reveals, the average word-length slowly increases with dictionary size. The relationship: can be modeled by the following mathematical (power law) formula:

A = 3.19W x 08823

The congruence between a and A, the observed and predicted average word-lengths, can be seen by comparing the last two rows of the table.

According to the 1985 Guinness Book of World Records, the Oxford English Dictionary has 414,825 word listings. The second formula predicts that the average word-length is 10.00, and the first formula predicts that W(27) = 1.73, W(28) = 0.62 and W(29) = 0.21 words. The longest word in the OED is, in fact, the 29-letter floccinauci-nihili-pili-fication. However, there is a considerable gap between this and the longest-known solid words, the 23-letter transubstantiationalist and anthropomorphologically.

It would be of interest to see analogous studies for other languages. In a book by C. B. Williams entitled Style and Vocabulary: Numerical Studies (Griffin, London, 1970) is a table from a 1963 paper by R. Moreau, giving the distribution of words by length as found in the Littre French dictionary. For words from 2 through 20 letters, the average length of 76884 words was 8.97, 0.35 hig-er than the value predicted from English experience.

JOHN HENRICK

Seattle, Washington