Printer Friendly

Characterizing linguistic structure with mutual information.

Identifying an appropriate way to characterize linguistic structure is a fundamental issue in our effort to understand the relevant cognitive processes. For example, we would like to know how much grammatical and syntactical information can be extracted on the basis of unsupervised processing of linguistic input. Some investigators have been arguing that the language learnability problem is so difficult that innate knowledge must (in part) guide linguistic development (e.g. Chomsky, 1965; Marcus, 2001; Pinker, 1994). Indeed, in a famous paper, Gold (1967) argued that grammar and syntax cannot be inferred from the processing of positive linguistic exemplars alone. However, later discussions have shown Gold's conclusions to be less restrictive than originally thought (e.g. Horning, 1969; Pinker, 1979; Wharton, 1974). Moreover, there has been a continuous stream of demonstrations of the richness of linguistic statistical structure and the sensitivity of the cognitive system to this structure (for discussions, see Bates & Elman, 1996; Plunkett, Karmiloof-Smith, Bates, & Elman, 1997; Seidenberg, 1997). For example, children as young as 8 months old can very rapidly become sensitive to regularity in auditory input, such as unbroken series of syllables (Bates & Elman, 1996; cf. Marcus, Vijayan, Bandi Rao, & Vishton, 1999; Saffran, Aslin, & Newport, 1996). Several studies have also shown how considerable grammatical and syntactical knowledge can be extracted from language statistics (Christiansen & Chater, 1999; Johnson & Riezler, 2002). However, what is the most appropriate (mathematical) way to characterize linguistic statistical regularity?

Shannon (1951) employed information theory. Information theory emphasizes information efficiency, in other words the question is whether language is 'set up' in a way that the use of tokens, types, etc., is optimal in terms of economical representation and prediction (Vitanyi & Li, 1997). For example, Shannon based his studies on the extent to which language structure allows us to predict subsequent symbols in linguistic input by knowledge of the present symbol. For example, if I see the word 'cat', how likely is it that I will see the word 'chases' next? Similarly, if in English I see the letters 'th', then 'e' is highly likely to be the next letter. Note that Shannon's analysis focused on forward transitional probabilities (cf. Single Recurrent Networks [SRNs]; Elman, 1991). Humans are clearly sensitive to forward transitional probabilities. For example, Cleeremans and McClelland (1991) created a symbol sequence such that the current symbol could predict subsequent symbols several positions down the sequence. After 60,000 trials, their participants were aware of contingencies between symbols that were separated by up to four other symbols and no more. However, humans in general are not only sensitive to forward regularity, but also to backward regularity (e.g. Boucher & Dienes, 2003), as well as more complex measures of contingency (Perruchet & Peereman, 2004).

We are presently interested in written language. Intuition, as well as the success of Shannon's investigations, indicate that a reasonable starting point would be to examine structure in written language in terms of forward regularity; that is, the question is how much information about later tokens exists in the present one. Shannon utilized entropy, a measure of uncertainty in identifying a target object in a set of objects. In the present study, we chose to explore mutual information (MI; Cover and Thomas, 1991), which is a more obviously justifiable measure of forward regularity.

Mutual information

Let 'range' be the number of words in between two given words, x and y. For instance, a range of one will indicate that words x and y are separated by one other word. We are asking whether our expectation of obtaining word y at a particular location is affected by the knowledge that we have word x in an earlier location. A measure of this is the MI between P(x) and P(y), the probabilities of obtaining word x and word y. MI indicates how much the uncertainty involved in expecting y is reduced by knowledge that we have x, and is given by [[summation].sub.x,y]P(x, y)log [P(x, y)/P(x)P(y)] (the summation extends over all unique word pairs). For different ranges, P(x, y) is the probability of having both words x and y, separated by a number of words equal to the range. Note that MI is symmetrical in its arguments, but presently we wish to explore only forward associations. The practical advantage of MI is that even with a corpus of limited size one can expect meaningful statistics for word pairs. By contrast, this would not be the case if we considered how our expectation for a word at position n + s is affected by knowledge of the previous s words; in most samples, the occurrence of the same sequence of s words would be unique or extremely unlikely. For example, the last four words in the previous sentence will be unique for tens of thousands of pages of written text (cf. Schuetze, 1993).


Our source was the European Corpus Initiative/Multilingual Corpus 1 (ECI/MC1; ELSNET, 1994, We randomly selected 595 samples from the CD, excluding any with 1,000 words or less. The samples were from 25 languages with a total of approximately 30 million words (see Table 1). In order to simplify our analysis, tagging and punctuation were removed so that the material was a long string of words. MI computations were carried out on the basis of the formula presented in the previous section for all range values up to 20 (this limit was chosen on the basis of pilot analyses where it looked like there was no variability for range values of 20 or higher; in fact, there seemed to be no variability after about 10 words and we adopted a limit of 11 in later regression analyses). Table 1 additionally shows the average MI value for each language. Note that probabilities of word and word-pair occurrences were computed within each sample (as opposed to across samples for the same language).

Theoretical model and justification

If we had an infinite language sample then the MI curves for each language should level off (asymptotically, as range increases) at zero. This is because when P(x, y) = P(x)P(y), that is, there is no information about the later occurrence of y when x is encountered, the corresponding MI is zero. In practice, even in our 30 million word corpus there was considerable noise (but note that few samples had more than 100,000 words), so that MI curves did not level off at zero but rather at values very close to the average MI for each language. This is plausibly because of error in estimating P(x, y), P(x), P(y) from the limited samples we employed. Moreover, it is unclear how we could apply smoothing techniques to compensate for our lack of accuracy, since the statistical expectations of seeing word y after having encountered word x in a linguistic sample surely depends on the thematic context of the sample (cf. Lapata, Keller, & McDonald, 2001; Lee, 1999).

To address this problem, and acquire some insight into the MI curves, we propose that MI depends on range on the basis of a simple exponential curve, that is, MI = [Ae.sup.[](range)], where A and s are constants (s is always negative, so that MI is reduced exponentially as range increases). Exponential decay functions have featured prominently in understanding similarity and generalization (e.g. Nosofsky, 1992; Shepard, 1980, 1987; see also Chater & Brown, 1999). However, we cannot presently motivate the exponential form in a theoretically rigorous way. We assume this form and show later that it leads to good fits to our MI data.

The exponential model implies that ln(MI) = ln A + s X (range), in other words plotting the natural logarithm of MI versus range should result in negatively sloped straight lines (recall that s is negative). Note from before that a non-zero lnA term implies that there is some error in the estimation of the P(x, y), P(x), P(y). Accordingly, we can reasonably postulate that the interesting information about how MI characterizes each language is forthcoming only from the slopes, s. We justify these assumptions in four ways.

We first correlated the average MI in each of our 595 samples with the number of words in each sample. The rationale for this computation is that the lower the number of words, the more noisy the P(x), P(y) and P(x, y) estimates, hence the higher the average MI. The correlation was (of course, given the number of data points) significant at the .01 level but not particularly high: .342. This is because it is not only the size of the sample that affects the validity of the probability estimates, but also the number of distinct words in each sample (information which we did not extract from our corpus processing).

Secondly, we used the combined Sherlock Holmes novels from the ECI/MC1 database, which resulted in a corpus of approximately 750,000 tokens and about 23,000 types. We then phonologically transcribed from orthography to the International Phonetic Alphabet the Sherlock Holmes novels, using the Mobylist Pronunciator II[TM] (Copyright 1988-1993, Grady Ward). The Mobylist achieves this via a look-up table (rather than an algorithm). The original translation left untranscribed about 65,000 words, most of which were plurals of common nouns or past tense forms of common verbs. Thus, using the rules of converting speech sounds corresponding to singular forms to speech sounds for plural forms, and likewise for past tense conversions, a supplement of the Mobylist was introduced, so that in the end only about 12,000 words were not transcribed to a phonological representation, and these were eliminated from the corpus. The 750,000 words were represented by about 2.6 million phonemes and a mutual information analysis identical to the ones presented previously was performed, with statistics being computed for pairs of phonemes, rather than words. The total number of possible phoneme pairs, about 2,500, was massively smaller than the number of unique word pairs, hence we expected that our probability estimates for P(x, y) would likewise be more accurate. Figure 1 shows the MI curve for this data and confirms our expectation that, with accurate P(x, y) estimates, the MI curve levels off at MI = 0, as predicted by our model. In other words, when the ratio of unique token pairs to total tokens is about 1 to 1,000, MI computations appear accurate.

Thirdly, as alread mentioned, we assume that in the equation MI = [Ae.sup.[](range)] meaningful information about linguistic statistical structure involves s, not A. We therefore computed s for 50 samples, 10 from each of five languages (randomly selected). Computation of s in the different samples can be straightforwardly achieved by carrying out simple linear regressions of ln(MI) vs. range (where s would be the unstandardized slope). Figure 2 shows the computed s values, arranged by language. In some cases there is more spread in s values than in others, but in all cases the s values for different languages are grouped together.


Finally, it is possible that the computed MI values reflect noise rather than any meaningful linguistic structure. To address this possibility, we performed a series of MI analyses at the word level, for ranges 0 to 10, on the English Bible (authorized version, King James translation), for which we counted 828,864 words. Specifically, we analysed as separate samples the first 1,000, 5,000, 50,000, 100,000 and 300,000 words, as well as the full text. Additionally, we randomized the order of the words in these five samples and the full Bible text and repeated the MI analyses. If our MI computations were the result of noise, no difference would be expected in the two sets of computations. Figure 3 illustrates this not to be the case. In all cases, MI does not vary with range in the randomized samples relative to the non-randomized one. Even in our 1,000 word sample, there is a clear difference between an exponential reduction in MI as a function of range in the non-randomized version relative to the randomized one. Moreover, inspection of Figure 3 shows the (constant) MI value in the randomized samples to be very near the asymptotic MI value in the non-randomized versions. This observation further reinforces our approach in assuming average MI values in Table 1 reflect noise (i.e. arise from inaccuracies in estimating probabilities of individual words and word pairs).



Utilizing the data 1

In this section, we examine the range over which MI dependencies extend in different languages. In other words, how far ahead does the presence of a word inform us about the presence of other words (in different languages)?

We computed the natural logarithms of the average MI values for each language for ranges 0 to 11. We then ran simple linear regressions of ln(MI) values vs. range, one for each of our 25 languages. (It would have been more accurate to analyse individual language samples, but it was not practical to conduct 595 regressions.) The results are shown in Table 1 (all regression analyses involved one model and 11 residual degrees of freedom; recall that the regression equations try to predict the 12 MI values for ranges 0 to 11). The highly significant p values are somewhat deceptive since we were effectively dealing with straight lines parallel to the horizontal. The differences in slope appear small but, on the basis of our hypothesis about how MI depends on range, it is these differences that are meaningful in characterizing linguistic structure with MI.

Recall that the regression equations in Table 1 were of the form ln(MI) = (lnA) + s X range, where lnA = c, and c is the unstandardized constants in Table 1. Therefore, we can write MI = [e.sup.c+sXrange]. In order to assess the range over which MI dependency extends in different languages, consider our assumption that [MI.sub.0] (the MI value for range = 0, that is adjacent word pairs), and [MI.sub.average] (the average MI for each language) carry no information about the statistical properties of the languages. Hence, let us consider the range required for a 90% reduction in the quantity ([MI.sub.0] - [MI.sub.average]). Simple arithmetic (see Appendix) leads us to estimates for the ranges over which MI extends in different languages (Table 2). Note that for Latvian and Lithuanian, the zero slope prevents us from making this computation. This is clearly a spurious result that possibly arises from rounding errors. Overall, there is considerable consistency across all languages and all MI range estimates are close to the average of 5.008.

Utilizing the data 2

We performed a clustering analysis on our 25 languages on the basis of their slope (cf. Lund, Burgess, & Atchley, 1995; Schuetze, 1993).

We used Pothos and Chater's (2002, 2005) simplicity model of clustering, which is a model of which clustering for a group of items appears most natural and intuitive to naive observers. It is a model of spontaneous classification and (accordingly) unsupervised clustering. Very briefly, the model examines whether it is possible to compress the information-theoretic description of a set of items by arranging them into groups. The model can compute the optimal (in the information-theoretic sense of Pothos & Chater, 2002) classification for a set of items on the basis of some kind of item distance, without any information about either the distributional characteristics of the items or the number of groups sought (this latter feature, in particular, discriminates between the simplicity approach and alternative classification methods; e.g. Jardine & Sibson, 1968). Additionally, the model can look for subclusters, if these allow for further information-theoretic simplification.

The unstandardized slopes corresponding to the best fitting line for each language were used as the coordinates for each language in a one-dimensional space. Distances between languages were computed using the Euclidean metric. Figure 4 shows the optimal classification identified for two levels. In the final section, we discuss the potential significance of these groupings.


The finding that MI dependencies extend over a range of nearly exactly five items for 25 different languages could be interpreted as implying an involvement of short-term memory (STM) in linguistic processing. The reasoning is that there has been extensive research that the STM can concurrently represent around four to seven items (Cowan, 2001; Miller, 1956; for discussions, see Baddeley, 1994; Davelaar, Goshen-Gottstein, Ashkenazi, Haarmann, & Usher, 2005). If the STM is involved in processing linguistic input, we would expect linguistic structure to extend over a range of words only as long as the STM can support. This is what our analyses show: The range over which MI dependencies extend matches very closely the reported STM spans.

In favour of this conclusion, there is evidence that the STM and specifically the processing restrictions imposed by the STM, are critical for language learning (Newport, 1988, 1990). Elman (1993, 1996; Plunkett, & Elman, 1996) created an artificial language that incorporated many of the critical characteristics of real languages (such as relative clauses). He then tried to get an SRN to learn the language; that is, the objective of the network was to recognize which novel linguistic stimuli were grammatical. SRNs work by gradually building a sensitivity to forward transitional probabilities in an input that involves sequentially presented symbols. In other words, an SRN gradually learns to anticipate which other symbols will appear after a certain symbol. Elman (1993) found that, although no learning was possible when the training set (consisting of a series of sentences) was just presented to the SRN, this was not the case when the 'memory' of the network for previous words was low to start with and gradually increased to 'adult' size (cf. Plunkett & Marchman, 1993). That is, the mechanism that enabled the network to have information about what was presented before was periodically reset in a way to mimic human STM development, by first starting small and only gradually being increased to a size corresponding to the adult STM span. If the STM is not initially restricted, the theory goes, the SRN (and presumably human learners) is overwhelmed by too complex a problem and ends up being unable to learn anything (for a different view, see Rohde & Plaut, 2003). Dempster (1981) argued against any developmental changes in STM span, but this view is not universally accepted (for discussions, see Hitch, Towse, & Hutton, 2001; Kail, 1984; Newport, 1990). Finally, note that the 'starting small' intuition has also been applied in other areas of cognition (e.g. Turkewitz & Kenny, 1982) and STM restrictions have been argued to have an adaptive role beyond linguistic processing (e.g. Dirlam, 1972; Kareev, 1995; Kareev, Lieberman, & Lev, 1997).


Additionally, Daneman and Carpenter (1980) emphasized the role of working memory in linguistic ability. These investigators made a distinction between passive measures of STM capacity and measures that reflect processing STM limitations (cf. Baddeley, 1983; Baddeley, Thomson, & Buchanan, 1975). The former are typically measured with tasks such as digit span. Daneman and Carpenter measured the latter in terms of 'reading span', a task involving retaining the last word of an increasing number of sentences. Daneman and colleagues showed that reading span, but not passive STM measures, correlates with linguistic ability (e.g. speech generation, reading tasks) in several studies (Daneman, 1991; Daneman & Merikle, 1996; cf. Lepine, Barrouillet, & Camos, 2005). Their conclusion was that working memory is intimately linked with linguistic processes. In a similar vein, Ellis and his colleagues demonstrated that individuals' phonological STM can correlate with their ability to learn vocabulary and syntax in first and second language acquisition (e.g. Ellis, 1996; Ellis & Schmidt, 1997; Ellis & Sinclair, 1996). Finally, when Jarvella (1971) interrupted participants reading some text, he found that they could remember the last seven words very well, but memory deteriorated rapidly from that point onwards. This result (trivially) indicates that the information explicitly available to the cognitive processor from reading linguistic material is constrained by the STM, so that STM span restrictions plausibly affect the comprehension of written text.

Against a conclusion that our finding that MI dependencies extend over a range of five items indicates an involvement of STM in linguistic processing, note first that the conclusions of Ellis (1996) or Elman (1993) concern primarily spoken language, whereas our analyses were carried out on written language. Second, the differential role of function words, prepositions, etc., may confuse MI range differences between languages (we thank Eddy Davelaar for bringing this point to our attention). For example, in a language like English that relies a great deal on phrasal verbs, we would expect many verb-preposition pairs to be chunked and hence be processed as a single unit in STM (Simon & Barenfeld, 1969). In other words, languages that rely more on function words/prepositions may appear to have a longer MI range. It is not clear how one can get round this problem in MI analyses at the word level.

Third, the links between STM and linguistic structure typically concern phonological STM (Baddeley et al., 1975). For example, Naveh-Benjamin and Ayres (1986; see also Ellis & Hennelly, 1980) examined relations between reading time and phonological STM, measured in terms of digit recall in English, Spanish, Hebrew and Arabic. They found that where the digits had more syllables, the STM span was shorter. Taking into account differences in syllables in the material participants in the different linguistic conditions had to memorize, it looked as though phonological STM was constant. Is STM for language best understood in terms of the (phonological) length of the material to be recalled or the number of words? The latest empirical results suggest both to be important (Chen & Cowan, 2005). Accordingly, we would expect regularity both at the phonemic level and the word level to play a part in linguistic processes, as indeed Monaghan, Chater, and Christiansen (2005) have found with artificial languages. To complicate things even further, the role of phonology in different languages appears to depend on the regularity between orthography and phonology (e.g. Jonsdottir, Shallice, & Wise, 1996). Since the functional unit of our computations was words, without information of syllables per word we cannot directly infer STM involvement in linguistic processes.

The considerations above clearly also limit potential interpretations of the clustering analysis. The clustering analysis should provide some indication of morphological similarity between languages and/or similarities in the STM in the respective populations and/or similarities in the average phonological length of words in different languages. Note that languages morphologically similar, such as Norwegian and Swedish, were assigned to different clusters, hence it looks as though MI statistics do not reflect simply language morphological properties. The computations presently reported prevent any more specific observations.

This work should indicate the potential utility of MI as a means of characterizing linguistic structure. A number of refinements for future work readily suggest themselves. First, the accuracy of MI estimates depends on the number of words relative to the number of unique pairs. Second, to fully explore the implications of the reported language clustering, one would additionally need measures of language morphological complexity (cf. Juola, Pothos, & Bailey, 1998) and behavioural measures of phonological STM differences.


We would like to thank Alan Baddeley, Todd Bailey, Nick Chater, Eddy Davelaar, Padraic Monaghan and Kim Plunkett for their helpful comments on this work. This research has been partly supported by EC Framework 6 grant, contract 516542 (NEST).


Baddeley, A. D. (1983). Working memory. Philosophical Transactions of the Royal Society of London B, 302, 311-324.

Baddeley, A. D. (1994). The magical number seven: Still magic after all these years? Psychological Review, 101, 353-356.

Baddeley, A. D., Thomson, N., & Buchanan, M. (1975). Word length and the structure of short-term memory. Journal of Verbal Learning and Verbal Behavior, 14, 575-589.

Bates, E., & Elman, J. (1996). Learning rediscovered. Science, 274, 1849-1850.

Boucher, L., & Dienes, Z. (2003). Two ways of learning associations. Cognitive Science, 27, 807-842.

Chater, N., & Brown, G. D. A. (1999). Scale-invariance as a unifying psychological principle. Cognition, 69, B17-B24.

Chen, Z., & Cowan, N. (2005). Chunk limits and length limits in immediate recall: A reconciliation. Journal of Experimental Psychology: Learning, Memory, and Cognition, 31, 1235-1249.

Chomsky, N. (1965). Aspects of the theory of syntax. Cambridge, MA: MIT Press.

Christiansen, M. H., & Chater, N. (1999). Connectionist natural language processing: The state of the art. Cognitive Science, 23, 417-437.

Cleeremans, A., & McClelland, J. L. (1991). Learning the structure of event sequences. Journal of Experimental Psychology: General, 120, 235-253.

Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley.

Cowan, N. (2001). The magical number 4 in short-term memory: A reconsideration of mental storage capacity. Behavioral and Brain Sciences, 24, 87-185.

Daneman, M. (1991). Working memory as a predictor of verbal fluency. Journal of Psycholinguistic Research, 20, 445-464.

Daneman, M., & Carpenter, P. A. (1980). Individual differences in working memory and reading. Journal of Verbal Learning and Verbal Behavior, 19, 450-466.

Daneman, M., & Merikle, P. M. (1996). Working memory and language comprehension: A meta-analysis. Psychonomic Bulletin and Review, 3, 422-433.

Davelaar, E.J., Goshen-Gottstein, Y, Ashkenazi, A., Haarmann, H.J., and Usher, M. (2005). The demise of short-term memory revisited: Empirical and computational investigations of recency effects. Psychological Review 112, 3-42.

Dempster, E. N. (1981). Memory span: Sources of individual and developmental differences. Psychological Bulletin, 89, 63-100.

Dirlam, D. K. (1972). Most efficient chunk sizes. Cognitive Psychology, 3, 355-359.

Ellis, N. C. (1996). Sequencing in SLA. Studies in Second Language Acquisition, 18, 91-126.

Ellis, N. C., & Hennelly, R. A. (1980). A bilingual word-length effect: Implications for intelligence testing and the relative ease of mental calculation in Welsh and English. British Journal of Psychology, 71, 43-51.

Ellis, N. C., & Schmidt, R. (1997). Morphology and longer distance dependencies. Studies in Second Language Acquisition, 19, 145-171.

Ellis, N. C., & Sinclair, S. G. (1996). Working memory in the acquisition of vocabulary and syntax: Putting language in good order. Quarterly Journal of Experimental Psychology, 49A, 234-250.

Elman, J. L. (1991). Distributed representations, simple recurrent networks, and grammatical structure. Machine Learning, 7, 195-225.

Elman, J. L. (1993). Learning and development in neural networks: The importance of starting small. Cognition, 48, 71-99.

Elman, J. L. (1996). Rethinking innateness: A connectionist perspective on development. Cambridge, MA: MIT Press.

Gold, E. M. (1967). Language identification in the limit. Information and Control, 10, 447-474.

Hitch, G. H., Towse, J. N., & Hutton, U. (2001). What limits children's working memory span? Theoretical accounts and applications for scholastic development. Journal of Experimental Psychology: General, 130, 184-198.

Horning, J. J. (1969). A study of grammatical inference. Doctoral thesis, Stanford University.

Jardine, N., & Sibson, R. (1968). The construction of hierarchic and nonhierarchic classifications. Computer Journal, 11, 177-194.

Jarvella, R. J. (1971). Syntactic processing of connected speech. Journal of Verbal Learning and Verbal Behavior, 10, 409-416.

Johnson, M., & Riezler, S. (2002). Statistical models of language learning and use. Cognitive Science, 26, 239-253.

Jonsdottir, M. K., Shallice, T., & Wise, R. (1996). Phonological mediation and the graphemic buffer disorder in spelling: Cross-language differences? Cognition, 59, 169-197.

Juola, P., Bailey, T. M., & Pothos, E. M. (1998). Theory-neutral domain regularity measurements. In Proceedings of the Twentieth Annual Conference of the Cognitive Science Society (pp.555-560). Mahwah, NJ: Erlbaum.

Kail, R. (1984). The development of memory. New York: Freeman.

Kareev, Y. (1995). Through a narrow window: Working memory capacity and the detection of covariation. Cognition, 56, 263-269.

Kareev, Y., Lieberman, I., & Lev, M. (1997). Through a narrow window: Sample size and the perception of correlation. Journal of Experimental Psychology: General, 126, 278-287.

Lapata, M., Keller, E, & McDonald, S. (2001). Evaluating smoothing algorithms against plausibility judgments. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (pp. 346-353). Morristown, NJ: Association for Computational Linguistics.

Lee, L. (1999). Measures of distributional similarity. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (pp. 25-32). Morristown, NJ: Association for Computational Linguistics.

Lepine, R., Barrouillet, P., & Camos, V. (2005). What makes working memory spans so predictive of high level cognition? Psychonomic Bulletin and Review, 12(1), 165-170.

Lund, K., Burgess, C., & Atchley, R. A. (1995). Semantic and associative priming in high-dimensional semantic space. In Proceedings of the 17th Annual Conference of the Cognitive Science Society (pp.660-665). Mahwah, NJ: Erlbaum.

Marcus, G. F. (2001). The algebraic mind. Cambridge, MA: MIT Press.

Marcus, G. F., Vijayan, S., Bandi Rao, S., & Vishton, P. M. (1999). Rule learning by seven-month-old infants. Science, 283, 77-80.

Miller, G. A. (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63, 81-97.

Monaghan, P., Chater, N., & Christiansen, M. H. (2005). The differential role of phonological cues in grammatical categorization. Cognition, 96, 143-182.

Naveh-Benjamin, M., & Ayres, T. J. (1986). Digit span, reading rate, and linguistic relativity. Quarterly Journal of Experimental Psychology, 38A, 739-751.

Newport, E. L. (1988). Constraints on learning and their role in language acquisition: Studies of the acquisition of American Sign Language. Language Sciences, 10, 147-172.

Newport, E. L. (1990). Maturational constraints on language learning. Cognitive Science, 14, 11-29.

Nosofsky, R. M. (1992). Similarity scaling and cognitive process models. Annual Review of Psychology, 43, 25-53.

Perruchet, P., & Peereman, R. (2004). The exploitation of distributional information in syllable processing. Journal of Neurolinguistics, 17, 97-119.

Pinker, S. (1979). Formal models of language learning. Cognition, 7, 217-283.

Pinker, S. (1994). The language instinct. London: Penguin Books.

Plunkett, K., & Elman, J. L. (1996). Simulating nature and nurture: A handbook of connectionist exercises. Cambridge, MA: MIT Press.

Plunkett, K., Karmiloff-Smith, A., Bates, E., & Elman, J. L. (1997). Connectionism and developmental psychology. Journal of Child Psychology and Psychiatry and Allied Disciplines, 38, 53-80.

Plunkett, K., & Marchman, V. (1993). From rote learning to system building: Acquiring verb morphology in children and connectionist nets. Cognition, 48, 21-69.

Pothos, E. M., & Chater, N. (2002). A simplicity principle in unsupervised human categorization. Cognitive Science, 26, 303-343.

Pothos, E. M., & Chater, N. (2005). Unsupervised categorization and category learning. Quarterly Journal of Experimental Psychology, 58A, 733-752.

Rohde, D. L. T., & Plaut, D. C. (2003). Less is less in language acquisition. In P. Quinlan (Ed.), Connectionist modeling of cognitive development (pp. 189-231). Hove, UK: Psychology Press.

Saffran, J. R., Aslin, R. N., & Newport, E. (1996). Statistical learning by 8-month-old infants. Science, 274, 1926-1928.

Schuetze, H. (1993). Word space. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems (Vol. 5, pp.895-902). San Matteo, CA: Morgan Kauffmann.

Seidenberg, M. S. (1997). Language acquisition and use: Learning and applying probabilistic constraints. Science, 275, 1599-1603.

Shannon, C. E. (1951). Prediction and entropy of printed English. Bell System Technical Journal, 30, 50-64.

Shepard, R. N. (1980). Multidimensional scaling, tree-fitting, and clustering. Science, 210, 390-398.

Shepard, R. N. (1987). Toward a universal law of generalization for psychological science. Science, 237, 1317-1323.

Simon, H. A., & Barenfeld, M. (1969). Information-processing analysis of perceptual processes in problem solving. Psychological Review, 76, 473-483.

Turkewitz, G., & Kenny, P. A. (1982). Limitations on input as a basis for neural organization and perceptual development: A preliminary theoretical statement. Developmental Psychobiology, 15, 257-368.

Vitanyi, P. M. B., & Li, M. (1997). On prediction by data compression. In Proceedings of the 9th European Conference on Machine Learning, Lecture Notes in Artificial Intelligence (Vol. 1224, pp. 14-30). Heidelberg: Springer-Verlag.

Wharton, R. M. (1974). Approximate language identification. Information and Control, 26, 236-255.

Received 12 December 2005; revised version received I June 2006


Arithmetic supplement for the section 'Utilizing the data 1'

We seek the MI for which [MI - [MI.sub.av]]/[[MI.sub.0] - [MI.sub.av]] = 10% = 0.1, where [MI.sub.0] is the maximum MI for a language and [MI.sub.av] is the average MI for a language (both quantities are presently hypothesized to reflect sample sizes and number of unique pairs). Note that [MI.sub.0] = [e.sup.c] = A. The MI value we seek is given by MI = 0.1[MI.sub.0] + 0.9[MI.sub.av]. Now we have to find the range such that 0.1[MI.sub.0] + 0.9[MI.sub.av] = [e.sup.c+sxrange], where c and s are the unstandardized constants and slopes in Table 1. Hence, range = [ln(0.1[MI.sub.0] + 0.9[MI.sub.av]) - c]/s.

Emmanuel M. Pothos (1*) and Patrick Juola (2*)

(1) Swansea University, UK

(2) Duquesne University, Pittsburgh, USA

*Correspondence should be addressed to Emmanuel M. Pothos, Department of Psychology, Swansea University, Swansea SA2 8PP, UK (e-mail: or to Patrick Juola, Department of Mathematics and Computer Science, Duquesne University, 600 Forbes Avenue, Pittsburgh, PA 15282, USA (e-mail:
Table 1. Corpus data

            Average number of words  Samples  Average MI

Albanian    170,450                    1      2.102
Bulgarian     1,498                    3      6.462
Croatian     25,573                    1      6.182
Czech       118,694                  138      9.027
Danish        3,707                    1      6.387
Dutch        22,988                   39      5.109
English      64,688                   20      4.293
Estonian     92,762                    2      8.911
French       54,372                   60      5.091
Gaelic       24,808                    1      6.872
German       84,308                    8      6.188
Greek        38,834                   24      5.621
Italian      50,191                   10      5.733
Japanese     79,001                    4      5.176
Latvian      83,439                   15      9.554
Lithuanian   55,317                    1      9.466
Malay        44,107                   16      5.365
Norwegian    19,954                   43      4.289
Portuguese   26,623                    5      5.447
Russian      29,708                    4      6.704
Serbian      24,063                   16      6.708
Slovakian    21,554                    1      5.971
Spanish      27,835                   29      5.013
Swedish      11,319                  150      3.423
Turkish      68,819                    3      3.486

            Unstandardized constant, c  Unstandardized slope, s  p value

Albanian    0.791                       -0.008                   0.013
Bulgarian   1.877                       -0.002                   0.001
Croatian    1.841                       -0.003                   0.050
Czech       2.204                       -0.001                   0.012
Danish      1.866                       -0.002                   0.000
Dutch       1.66                        -0.005                   0.011
English     1.502                       -0.008                   0.009
Estonian    2.191                       -0.001                   0.001
French      1.67                        -0.007                   0.004
Gaelic      1.941                       -0.002                   0
German      1.845                       -0.004                   0.005
Greek       1.757                       -0.005                   0.009
Italian     1.784                       -0.007                   0.001
Japanese    1.764                       -0.022                   0
Latvian     2.259                        0                       0
Lithuanian  2.249                        0                       0.064
Malay       1.714                       -0.006                   0.007
Norwegian   1.494                       -0.007                   0.011
Portuguese  1.716                       -0.004                   0.012
Russian     1.93                        -0.005                   0.003
Serbian     1.914                       -0.002                   0.041
Slovakian   1.811                       -0.004                   0.057
Spanish     1.643                       -0.005                   0.007
Swedish     1.493                       -0.05                    0
Turkish     1.353                       -0.019                   0

Note. The unstandardized constant, unstandardized slope and p value
columns relate to the simple linear regressions of In MI vs. range (In
MI = ln A + s X range = ln [e.sup.c] + s X range). Average MI: the
average MI for all range positions (up to 20) for all samples in a

Table 2. Range over which the quantity [MI.sub.0] - [MI.sub.average]
drops by 90%


Albanian    5.40
Bulgarian   4.97
Croatian    5.80
Czech       3.40
Danish      5.28
Dutch       5.21
English     5.05
Estonian    3.34
French      5.45
Gaelic      6.09
German      5.03
Greek       5.48
Italian     4.84
Japanese    4.88
Latvian     Undefined
Lithuanian  Undefined
Malay       5.11
Norwegian   4.87
Portuguese  4.70
Russian     4.91
Serbian     4.81
Slovakian   5.41
Spanish     5.57
Swedish     4.66
Turkish     4.91
COPYRIGHT 2007 British Psychological Society
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2007 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Pothos, Emmanuel M.; Juola, Patrick
Publication:British Journal of Psychology
Date:May 1, 2007
Previous Article:The location of trait emotional intelligence in personality factor space.
Next Article:Predicting children's word-spelling difficulty for common English words from measures of orthographic transparency, phonemic and graphemic length and...

Terms of use | Privacy policy | Copyright © 2020 Farlex, Inc. | Feedback | For webmasters