Characterizing linguistic structure with mutual information.
Shannon (1951) employed information theory. Information theory emphasizes information efficiency, in other words the question is whether language is 'set up' in a way that the use of tokens, types, etc., is optimal in terms of economical representation and prediction (Vitanyi & Li, 1997). For example, Shannon based his studies on the extent to which language structure allows us to predict subsequent symbols in linguistic input by knowledge of the present symbol. For example, if I see the word 'cat', how likely is it that I will see the word 'chases' next? Similarly, if in English I see the letters 'th', then 'e' is highly likely to be the next letter. Note that Shannon's analysis focused on forward transitional probabilities (cf. Single Recurrent Networks [SRNs]; Elman, 1991). Humans are clearly sensitive to forward transitional probabilities. For example, Cleeremans and McClelland (1991) created a symbol sequence such that the current symbol could predict subsequent symbols several positions down the sequence. After 60,000 trials, their participants were aware of contingencies between symbols that were separated by up to four other symbols and no more. However, humans in general are not only sensitive to forward regularity, but also to backward regularity (e.g. Boucher & Dienes, 2003), as well as more complex measures of contingency (Perruchet & Peereman, 2004).
We are presently interested in written language. Intuition, as well as the success of Shannon's investigations, indicate that a reasonable starting point would be to examine structure in written language in terms of forward regularity; that is, the question is how much information about later tokens exists in the present one. Shannon utilized entropy, a measure of uncertainty in identifying a target object in a set of objects. In the present study, we chose to explore mutual information (MI; Cover and Thomas, 1991), which is a more obviously justifiable measure of forward regularity.
Let 'range' be the number of words in between two given words, x and y. For instance, a range of one will indicate that words x and y are separated by one other word. We are asking whether our expectation of obtaining word y at a particular location is affected by the knowledge that we have word x in an earlier location. A measure of this is the MI between P(x) and P(y), the probabilities of obtaining word x and word y. MI indicates how much the uncertainty involved in expecting y is reduced by knowledge that we have x, and is given by [[summation].sub.x,y]P(x, y)log [P(x, y)/P(x)P(y)] (the summation extends over all unique word pairs). For different ranges, P(x, y) is the probability of having both words x and y, separated by a number of words equal to the range. Note that MI is symmetrical in its arguments, but presently we wish to explore only forward associations. The practical advantage of MI is that even with a corpus of limited size one can expect meaningful statistics for word pairs. By contrast, this would not be the case if we considered how our expectation for a word at position n + s is affected by knowledge of the previous s words; in most samples, the occurrence of the same sequence of s words would be unique or extremely unlikely. For example, the last four words in the previous sentence will be unique for tens of thousands of pages of written text (cf. Schuetze, 1993).
Our source was the European Corpus Initiative/Multilingual Corpus 1 (ECI/MC1; ELSNET, 1994, www.elsnet.org). We randomly selected 595 samples from the CD, excluding any with 1,000 words or less. The samples were from 25 languages with a total of approximately 30 million words (see Table 1). In order to simplify our analysis, tagging and punctuation were removed so that the material was a long string of words. MI computations were carried out on the basis of the formula presented in the previous section for all range values up to 20 (this limit was chosen on the basis of pilot analyses where it looked like there was no variability for range values of 20 or higher; in fact, there seemed to be no variability after about 10 words and we adopted a limit of 11 in later regression analyses). Table 1 additionally shows the average MI value for each language. Note that probabilities of word and word-pair occurrences were computed within each sample (as opposed to across samples for the same language).
Theoretical model and justification
If we had an infinite language sample then the MI curves for each language should level off (asymptotically, as range increases) at zero. This is because when P(x, y) = P(x)P(y), that is, there is no information about the later occurrence of y when x is encountered, the corresponding MI is zero. In practice, even in our 30 million word corpus there was considerable noise (but note that few samples had more than 100,000 words), so that MI curves did not level off at zero but rather at values very close to the average MI for each language. This is plausibly because of error in estimating P(x, y), P(x), P(y) from the limited samples we employed. Moreover, it is unclear how we could apply smoothing techniques to compensate for our lack of accuracy, since the statistical expectations of seeing word y after having encountered word x in a linguistic sample surely depends on the thematic context of the sample (cf. Lapata, Keller, & McDonald, 2001; Lee, 1999).
To address this problem, and acquire some insight into the MI curves, we propose that MI depends on range on the basis of a simple exponential curve, that is, MI = [Ae.sup.[s.sub.dot](range)], where A and s are constants (s is always negative, so that MI is reduced exponentially as range increases). Exponential decay functions have featured prominently in understanding similarity and generalization (e.g. Nosofsky, 1992; Shepard, 1980, 1987; see also Chater & Brown, 1999). However, we cannot presently motivate the exponential form in a theoretically rigorous way. We assume this form and show later that it leads to good fits to our MI data.
The exponential model implies that ln(MI) = ln A + s X (range), in other words plotting the natural logarithm of MI versus range should result in negatively sloped straight lines (recall that s is negative). Note from before that a non-zero lnA term implies that there is some error in the estimation of the P(x, y), P(x), P(y). Accordingly, we can reasonably postulate that the interesting information about how MI characterizes each language is forthcoming only from the slopes, s. We justify these assumptions in four ways.
We first correlated the average MI in each of our 595 samples with the number of words in each sample. The rationale for this computation is that the lower the number of words, the more noisy the P(x), P(y) and P(x, y) estimates, hence the higher the average MI. The correlation was (of course, given the number of data points) significant at the .01 level but not particularly high: .342. This is because it is not only the size of the sample that affects the validity of the probability estimates, but also the number of distinct words in each sample (information which we did not extract from our corpus processing).
Secondly, we used the combined Sherlock Holmes novels from the ECI/MC1 database, which resulted in a corpus of approximately 750,000 tokens and about 23,000 types. We then phonologically transcribed from orthography to the International Phonetic Alphabet the Sherlock Holmes novels, using the Mobylist Pronunciator II[TM] (Copyright 1988-1993, Grady Ward). The Mobylist achieves this via a look-up table (rather than an algorithm). The original translation left untranscribed about 65,000 words, most of which were plurals of common nouns or past tense forms of common verbs. Thus, using the rules of converting speech sounds corresponding to singular forms to speech sounds for plural forms, and likewise for past tense conversions, a supplement of the Mobylist was introduced, so that in the end only about 12,000 words were not transcribed to a phonological representation, and these were eliminated from the corpus. The 750,000 words were represented by about 2.6 million phonemes and a mutual information analysis identical to the ones presented previously was performed, with statistics being computed for pairs of phonemes, rather than words. The total number of possible phoneme pairs, about 2,500, was massively smaller than the number of unique word pairs, hence we expected that our probability estimates for P(x, y) would likewise be more accurate. Figure 1 shows the MI curve for this data and confirms our expectation that, with accurate P(x, y) estimates, the MI curve levels off at MI = 0, as predicted by our model. In other words, when the ratio of unique token pairs to total tokens is about 1 to 1,000, MI computations appear accurate.
Thirdly, as alread mentioned, we assume that in the equation MI = [Ae.sup.[s.sup.dot](range)] meaningful information about linguistic statistical structure involves s, not A. We therefore computed s for 50 samples, 10 from each of five languages (randomly selected). Computation of s in the different samples can be straightforwardly achieved by carrying out simple linear regressions of ln(MI) vs. range (where s would be the unstandardized slope). Figure 2 shows the computed s values, arranged by language. In some cases there is more spread in s values than in others, but in all cases the s values for different languages are grouped together.
[FIGURE 1 OMITTED]
Finally, it is possible that the computed MI values reflect noise rather than any meaningful linguistic structure. To address this possibility, we performed a series of MI analyses at the word level, for ranges 0 to 10, on the English Bible (authorized version, King James translation), for which we counted 828,864 words. Specifically, we analysed as separate samples the first 1,000, 5,000, 50,000, 100,000 and 300,000 words, as well as the full text. Additionally, we randomized the order of the words in these five samples and the full Bible text and repeated the MI analyses. If our MI computations were the result of noise, no difference would be expected in the two sets of computations. Figure 3 illustrates this not to be the case. In all cases, MI does not vary with range in the randomized samples relative to the non-randomized one. Even in our 1,000 word sample, there is a clear difference between an exponential reduction in MI as a function of range in the non-randomized version relative to the randomized one. Moreover, inspection of Figure 3 shows the (constant) MI value in the randomized samples to be very near the asymptotic MI value in the non-randomized versions. This observation further reinforces our approach in assuming average MI values in Table 1 reflect noise (i.e. arise from inaccuracies in estimating probabilities of individual words and word pairs).
[FIGURE 2 OMITTED]
[FIGURE 3 OMITTED]
Utilizing the data 1
In this section, we examine the range over which MI dependencies extend in different languages. In other words, how far ahead does the presence of a word inform us about the presence of other words (in different languages)?
We computed the natural logarithms of the average MI values for each language for ranges 0 to 11. We then ran simple linear regressions of ln(MI) values vs. range, one for each of our 25 languages. (It would have been more accurate to analyse individual language samples, but it was not practical to conduct 595 regressions.) The results are shown in Table 1 (all regression analyses involved one model and 11 residual degrees of freedom; recall that the regression equations try to predict the 12 MI values for ranges 0 to 11). The highly significant p values are somewhat deceptive since we were effectively dealing with straight lines parallel to the horizontal. The differences in slope appear small but, on the basis of our hypothesis about how MI depends on range, it is these differences that are meaningful in characterizing linguistic structure with MI.
Recall that the regression equations in Table 1 were of the form ln(MI) = (lnA) + s X range, where lnA = c, and c is the unstandardized constants in Table 1. Therefore, we can write MI = [e.sup.c+sXrange]. In order to assess the range over which MI dependency extends in different languages, consider our assumption that [MI.sub.0] (the MI value for range = 0, that is adjacent word pairs), and [MI.sub.average] (the average MI for each language) carry no information about the statistical properties of the languages. Hence, let us consider the range required for a 90% reduction in the quantity ([MI.sub.0] - [MI.sub.average]). Simple arithmetic (see Appendix) leads us to estimates for the ranges over which MI extends in different languages (Table 2). Note that for Latvian and Lithuanian, the zero slope prevents us from making this computation. This is clearly a spurious result that possibly arises from rounding errors. Overall, there is considerable consistency across all languages and all MI range estimates are close to the average of 5.008.
Utilizing the data 2
We performed a clustering analysis on our 25 languages on the basis of their slope (cf. Lund, Burgess, & Atchley, 1995; Schuetze, 1993).
We used Pothos and Chater's (2002, 2005) simplicity model of clustering, which is a model of which clustering for a group of items appears most natural and intuitive to naive observers. It is a model of spontaneous classification and (accordingly) unsupervised clustering. Very briefly, the model examines whether it is possible to compress the information-theoretic description of a set of items by arranging them into groups. The model can compute the optimal (in the information-theoretic sense of Pothos & Chater, 2002) classification for a set of items on the basis of some kind of item distance, without any information about either the distributional characteristics of the items or the number of groups sought (this latter feature, in particular, discriminates between the simplicity approach and alternative classification methods; e.g. Jardine & Sibson, 1968). Additionally, the model can look for subclusters, if these allow for further information-theoretic simplification.
The unstandardized slopes corresponding to the best fitting line for each language were used as the coordinates for each language in a one-dimensional space. Distances between languages were computed using the Euclidean metric. Figure 4 shows the optimal classification identified for two levels. In the final section, we discuss the potential significance of these groupings.
The finding that MI dependencies extend over a range of nearly exactly five items for 25 different languages could be interpreted as implying an involvement of short-term memory (STM) in linguistic processing. The reasoning is that there has been extensive research that the STM can concurrently represent around four to seven items (Cowan, 2001; Miller, 1956; for discussions, see Baddeley, 1994; Davelaar, Goshen-Gottstein, Ashkenazi, Haarmann, & Usher, 2005). If the STM is involved in processing linguistic input, we would expect linguistic structure to extend over a range of words only as long as the STM can support. This is what our analyses show: The range over which MI dependencies extend matches very closely the reported STM spans.
In favour of this conclusion, there is evidence that the STM and specifically the processing restrictions imposed by the STM, are critical for language learning (Newport, 1988, 1990). Elman (1993, 1996; Plunkett, & Elman, 1996) created an artificial language that incorporated many of the critical characteristics of real languages (such as relative clauses). He then tried to get an SRN to learn the language; that is, the objective of the network was to recognize which novel linguistic stimuli were grammatical. SRNs work by gradually building a sensitivity to forward transitional probabilities in an input that involves sequentially presented symbols. In other words, an SRN gradually learns to anticipate which other symbols will appear after a certain symbol. Elman (1993) found that, although no learning was possible when the training set (consisting of a series of sentences) was just presented to the SRN, this was not the case when the 'memory' of the network for previous words was low to start with and gradually increased to 'adult' size (cf. Plunkett & Marchman, 1993). That is, the mechanism that enabled the network to have information about what was presented before was periodically reset in a way to mimic human STM development, by first starting small and only gradually being increased to a size corresponding to the adult STM span. If the STM is not initially restricted, the theory goes, the SRN (and presumably human learners) is overwhelmed by too complex a problem and ends up being unable to learn anything (for a different view, see Rohde & Plaut, 2003). Dempster (1981) argued against any developmental changes in STM span, but this view is not universally accepted (for discussions, see Hitch, Towse, & Hutton, 2001; Kail, 1984; Newport, 1990). Finally, note that the 'starting small' intuition has also been applied in other areas of cognition (e.g. Turkewitz & Kenny, 1982) and STM restrictions have been argued to have an adaptive role beyond linguistic processing (e.g. Dirlam, 1972; Kareev, 1995; Kareev, Lieberman, & Lev, 1997).
[FIGURE 4 OMITTED]
Additionally, Daneman and Carpenter (1980) emphasized the role of working memory in linguistic ability. These investigators made a distinction between passive measures of STM capacity and measures that reflect processing STM limitations (cf. Baddeley, 1983; Baddeley, Thomson, & Buchanan, 1975). The former are typically measured with tasks such as digit span. Daneman and Carpenter measured the latter in terms of 'reading span', a task involving retaining the last word of an increasing number of sentences. Daneman and colleagues showed that reading span, but not passive STM measures, correlates with linguistic ability (e.g. speech generation, reading tasks) in several studies (Daneman, 1991; Daneman & Merikle, 1996; cf. Lepine, Barrouillet, & Camos, 2005). Their conclusion was that working memory is intimately linked with linguistic processes. In a similar vein, Ellis and his colleagues demonstrated that individuals' phonological STM can correlate with their ability to learn vocabulary and syntax in first and second language acquisition (e.g. Ellis, 1996; Ellis & Schmidt, 1997; Ellis & Sinclair, 1996). Finally, when Jarvella (1971) interrupted participants reading some text, he found that they could remember the last seven words very well, but memory deteriorated rapidly from that point onwards. This result (trivially) indicates that the information explicitly available to the cognitive processor from reading linguistic material is constrained by the STM, so that STM span restrictions plausibly affect the comprehension of written text.
Against a conclusion that our finding that MI dependencies extend over a range of five items indicates an involvement of STM in linguistic processing, note first that the conclusions of Ellis (1996) or Elman (1993) concern primarily spoken language, whereas our analyses were carried out on written language. Second, the differential role of function words, prepositions, etc., may confuse MI range differences between languages (we thank Eddy Davelaar for bringing this point to our attention). For example, in a language like English that relies a great deal on phrasal verbs, we would expect many verb-preposition pairs to be chunked and hence be processed as a single unit in STM (Simon & Barenfeld, 1969). In other words, languages that rely more on function words/prepositions may appear to have a longer MI range. It is not clear how one can get round this problem in MI analyses at the word level.
Third, the links between STM and linguistic structure typically concern phonological STM (Baddeley et al., 1975). For example, Naveh-Benjamin and Ayres (1986; see also Ellis & Hennelly, 1980) examined relations between reading time and phonological STM, measured in terms of digit recall in English, Spanish, Hebrew and Arabic. They found that where the digits had more syllables, the STM span was shorter. Taking into account differences in syllables in the material participants in the different linguistic conditions had to memorize, it looked as though phonological STM was constant. Is STM for language best understood in terms of the (phonological) length of the material to be recalled or the number of words? The latest empirical results suggest both to be important (Chen & Cowan, 2005). Accordingly, we would expect regularity both at the phonemic level and the word level to play a part in linguistic processes, as indeed Monaghan, Chater, and Christiansen (2005) have found with artificial languages. To complicate things even further, the role of phonology in different languages appears to depend on the regularity between orthography and phonology (e.g. Jonsdottir, Shallice, & Wise, 1996). Since the functional unit of our computations was words, without information of syllables per word we cannot directly infer STM involvement in linguistic processes.
The considerations above clearly also limit potential interpretations of the clustering analysis. The clustering analysis should provide some indication of morphological similarity between languages and/or similarities in the STM in the respective populations and/or similarities in the average phonological length of words in different languages. Note that languages morphologically similar, such as Norwegian and Swedish, were assigned to different clusters, hence it looks as though MI statistics do not reflect simply language morphological properties. The computations presently reported prevent any more specific observations.
This work should indicate the potential utility of MI as a means of characterizing linguistic structure. A number of refinements for future work readily suggest themselves. First, the accuracy of MI estimates depends on the number of words relative to the number of unique pairs. Second, to fully explore the implications of the reported language clustering, one would additionally need measures of language morphological complexity (cf. Juola, Pothos, & Bailey, 1998) and behavioural measures of phonological STM differences.
We would like to thank Alan Baddeley, Todd Bailey, Nick Chater, Eddy Davelaar, Padraic Monaghan and Kim Plunkett for their helpful comments on this work. This research has been partly supported by EC Framework 6 grant, contract 516542 (NEST).
Baddeley, A. D. (1983). Working memory. Philosophical Transactions of the Royal Society of London B, 302, 311-324.
Baddeley, A. D. (1994). The magical number seven: Still magic after all these years? Psychological Review, 101, 353-356.
Baddeley, A. D., Thomson, N., & Buchanan, M. (1975). Word length and the structure of short-term memory. Journal of Verbal Learning and Verbal Behavior, 14, 575-589.
Bates, E., & Elman, J. (1996). Learning rediscovered. Science, 274, 1849-1850.
Boucher, L., & Dienes, Z. (2003). Two ways of learning associations. Cognitive Science, 27, 807-842.
Chater, N., & Brown, G. D. A. (1999). Scale-invariance as a unifying psychological principle. Cognition, 69, B17-B24.
Chen, Z., & Cowan, N. (2005). Chunk limits and length limits in immediate recall: A reconciliation. Journal of Experimental Psychology: Learning, Memory, and Cognition, 31, 1235-1249.
Chomsky, N. (1965). Aspects of the theory of syntax. Cambridge, MA: MIT Press.
Christiansen, M. H., & Chater, N. (1999). Connectionist natural language processing: The state of the art. Cognitive Science, 23, 417-437.
Cleeremans, A., & McClelland, J. L. (1991). Learning the structure of event sequences. Journal of Experimental Psychology: General, 120, 235-253.
Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley.
Cowan, N. (2001). The magical number 4 in short-term memory: A reconsideration of mental storage capacity. Behavioral and Brain Sciences, 24, 87-185.
Daneman, M. (1991). Working memory as a predictor of verbal fluency. Journal of Psycholinguistic Research, 20, 445-464.
Daneman, M., & Carpenter, P. A. (1980). Individual differences in working memory and reading. Journal of Verbal Learning and Verbal Behavior, 19, 450-466.
Daneman, M., & Merikle, P. M. (1996). Working memory and language comprehension: A meta-analysis. Psychonomic Bulletin and Review, 3, 422-433.
Davelaar, E.J., Goshen-Gottstein, Y, Ashkenazi, A., Haarmann, H.J., and Usher, M. (2005). The demise of short-term memory revisited: Empirical and computational investigations of recency effects. Psychological Review 112, 3-42.
Dempster, E. N. (1981). Memory span: Sources of individual and developmental differences. Psychological Bulletin, 89, 63-100.
Dirlam, D. K. (1972). Most efficient chunk sizes. Cognitive Psychology, 3, 355-359.
Ellis, N. C. (1996). Sequencing in SLA. Studies in Second Language Acquisition, 18, 91-126.
Ellis, N. C., & Hennelly, R. A. (1980). A bilingual word-length effect: Implications for intelligence testing and the relative ease of mental calculation in Welsh and English. British Journal of Psychology, 71, 43-51.
Ellis, N. C., & Schmidt, R. (1997). Morphology and longer distance dependencies. Studies in Second Language Acquisition, 19, 145-171.
Ellis, N. C., & Sinclair, S. G. (1996). Working memory in the acquisition of vocabulary and syntax: Putting language in good order. Quarterly Journal of Experimental Psychology, 49A, 234-250.
Elman, J. L. (1991). Distributed representations, simple recurrent networks, and grammatical structure. Machine Learning, 7, 195-225.
Elman, J. L. (1993). Learning and development in neural networks: The importance of starting small. Cognition, 48, 71-99.
Elman, J. L. (1996). Rethinking innateness: A connectionist perspective on development. Cambridge, MA: MIT Press.
Gold, E. M. (1967). Language identification in the limit. Information and Control, 10, 447-474.
Hitch, G. H., Towse, J. N., & Hutton, U. (2001). What limits children's working memory span? Theoretical accounts and applications for scholastic development. Journal of Experimental Psychology: General, 130, 184-198.
Horning, J. J. (1969). A study of grammatical inference. Doctoral thesis, Stanford University.
Jardine, N., & Sibson, R. (1968). The construction of hierarchic and nonhierarchic classifications. Computer Journal, 11, 177-194.
Jarvella, R. J. (1971). Syntactic processing of connected speech. Journal of Verbal Learning and Verbal Behavior, 10, 409-416.
Johnson, M., & Riezler, S. (2002). Statistical models of language learning and use. Cognitive Science, 26, 239-253.
Jonsdottir, M. K., Shallice, T., & Wise, R. (1996). Phonological mediation and the graphemic buffer disorder in spelling: Cross-language differences? Cognition, 59, 169-197.
Juola, P., Bailey, T. M., & Pothos, E. M. (1998). Theory-neutral domain regularity measurements. In Proceedings of the Twentieth Annual Conference of the Cognitive Science Society (pp.555-560). Mahwah, NJ: Erlbaum.
Kail, R. (1984). The development of memory. New York: Freeman.
Kareev, Y. (1995). Through a narrow window: Working memory capacity and the detection of covariation. Cognition, 56, 263-269.
Kareev, Y., Lieberman, I., & Lev, M. (1997). Through a narrow window: Sample size and the perception of correlation. Journal of Experimental Psychology: General, 126, 278-287.
Lapata, M., Keller, E, & McDonald, S. (2001). Evaluating smoothing algorithms against plausibility judgments. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (pp. 346-353). Morristown, NJ: Association for Computational Linguistics.
Lee, L. (1999). Measures of distributional similarity. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (pp. 25-32). Morristown, NJ: Association for Computational Linguistics.
Lepine, R., Barrouillet, P., & Camos, V. (2005). What makes working memory spans so predictive of high level cognition? Psychonomic Bulletin and Review, 12(1), 165-170.
Lund, K., Burgess, C., & Atchley, R. A. (1995). Semantic and associative priming in high-dimensional semantic space. In Proceedings of the 17th Annual Conference of the Cognitive Science Society (pp.660-665). Mahwah, NJ: Erlbaum.
Marcus, G. F. (2001). The algebraic mind. Cambridge, MA: MIT Press.
Marcus, G. F., Vijayan, S., Bandi Rao, S., & Vishton, P. M. (1999). Rule learning by seven-month-old infants. Science, 283, 77-80.
Miller, G. A. (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63, 81-97.
Monaghan, P., Chater, N., & Christiansen, M. H. (2005). The differential role of phonological cues in grammatical categorization. Cognition, 96, 143-182.
Naveh-Benjamin, M., & Ayres, T. J. (1986). Digit span, reading rate, and linguistic relativity. Quarterly Journal of Experimental Psychology, 38A, 739-751.
Newport, E. L. (1988). Constraints on learning and their role in language acquisition: Studies of the acquisition of American Sign Language. Language Sciences, 10, 147-172.
Newport, E. L. (1990). Maturational constraints on language learning. Cognitive Science, 14, 11-29.
Nosofsky, R. M. (1992). Similarity scaling and cognitive process models. Annual Review of Psychology, 43, 25-53.
Perruchet, P., & Peereman, R. (2004). The exploitation of distributional information in syllable processing. Journal of Neurolinguistics, 17, 97-119.
Pinker, S. (1979). Formal models of language learning. Cognition, 7, 217-283.
Pinker, S. (1994). The language instinct. London: Penguin Books.
Plunkett, K., & Elman, J. L. (1996). Simulating nature and nurture: A handbook of connectionist exercises. Cambridge, MA: MIT Press.
Plunkett, K., Karmiloff-Smith, A., Bates, E., & Elman, J. L. (1997). Connectionism and developmental psychology. Journal of Child Psychology and Psychiatry and Allied Disciplines, 38, 53-80.
Plunkett, K., & Marchman, V. (1993). From rote learning to system building: Acquiring verb morphology in children and connectionist nets. Cognition, 48, 21-69.
Pothos, E. M., & Chater, N. (2002). A simplicity principle in unsupervised human categorization. Cognitive Science, 26, 303-343.
Pothos, E. M., & Chater, N. (2005). Unsupervised categorization and category learning. Quarterly Journal of Experimental Psychology, 58A, 733-752.
Rohde, D. L. T., & Plaut, D. C. (2003). Less is less in language acquisition. In P. Quinlan (Ed.), Connectionist modeling of cognitive development (pp. 189-231). Hove, UK: Psychology Press.
Saffran, J. R., Aslin, R. N., & Newport, E. (1996). Statistical learning by 8-month-old infants. Science, 274, 1926-1928.
Schuetze, H. (1993). Word space. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems (Vol. 5, pp.895-902). San Matteo, CA: Morgan Kauffmann.
Seidenberg, M. S. (1997). Language acquisition and use: Learning and applying probabilistic constraints. Science, 275, 1599-1603.
Shannon, C. E. (1951). Prediction and entropy of printed English. Bell System Technical Journal, 30, 50-64.
Shepard, R. N. (1980). Multidimensional scaling, tree-fitting, and clustering. Science, 210, 390-398.
Shepard, R. N. (1987). Toward a universal law of generalization for psychological science. Science, 237, 1317-1323.
Simon, H. A., & Barenfeld, M. (1969). Information-processing analysis of perceptual processes in problem solving. Psychological Review, 76, 473-483.
Turkewitz, G., & Kenny, P. A. (1982). Limitations on input as a basis for neural organization and perceptual development: A preliminary theoretical statement. Developmental Psychobiology, 15, 257-368.
Vitanyi, P. M. B., & Li, M. (1997). On prediction by data compression. In Proceedings of the 9th European Conference on Machine Learning, Lecture Notes in Artificial Intelligence (Vol. 1224, pp. 14-30). Heidelberg: Springer-Verlag.
Wharton, R. M. (1974). Approximate language identification. Information and Control, 26, 236-255.
Received 12 December 2005; revised version received I June 2006
Arithmetic supplement for the section 'Utilizing the data 1'
We seek the MI for which [MI - [MI.sub.av]]/[[MI.sub.0] - [MI.sub.av]] = 10% = 0.1, where [MI.sub.0] is the maximum MI for a language and [MI.sub.av] is the average MI for a language (both quantities are presently hypothesized to reflect sample sizes and number of unique pairs). Note that [MI.sub.0] = [e.sup.c] = A. The MI value we seek is given by MI = 0.1[MI.sub.0] + 0.9[MI.sub.av]. Now we have to find the range such that 0.1[MI.sub.0] + 0.9[MI.sub.av] = [e.sup.c+sxrange], where c and s are the unstandardized constants and slopes in Table 1. Hence, range = [ln(0.1[MI.sub.0] + 0.9[MI.sub.av]) - c]/s.
Emmanuel M. Pothos (1*) and Patrick Juola (2*)
(1) Swansea University, UK
(2) Duquesne University, Pittsburgh, USA
*Correspondence should be addressed to Emmanuel M. Pothos, Department of Psychology, Swansea University, Swansea SA2 8PP, UK (e-mail: email@example.com) or to Patrick Juola, Department of Mathematics and Computer Science, Duquesne University, 600 Forbes Avenue, Pittsburgh, PA 15282, USA (e-mail: firstname.lastname@example.org).
Table 1. Corpus data Average number of words Samples Average MI Albanian 170,450 1 2.102 Bulgarian 1,498 3 6.462 Croatian 25,573 1 6.182 Czech 118,694 138 9.027 Danish 3,707 1 6.387 Dutch 22,988 39 5.109 English 64,688 20 4.293 Estonian 92,762 2 8.911 French 54,372 60 5.091 Gaelic 24,808 1 6.872 German 84,308 8 6.188 Greek 38,834 24 5.621 Italian 50,191 10 5.733 Japanese 79,001 4 5.176 Latvian 83,439 15 9.554 Lithuanian 55,317 1 9.466 Malay 44,107 16 5.365 Norwegian 19,954 43 4.289 Portuguese 26,623 5 5.447 Russian 29,708 4 6.704 Serbian 24,063 16 6.708 Slovakian 21,554 1 5.971 Spanish 27,835 29 5.013 Swedish 11,319 150 3.423 Turkish 68,819 3 3.486 Unstandardized constant, c Unstandardized slope, s p value Albanian 0.791 -0.008 0.013 Bulgarian 1.877 -0.002 0.001 Croatian 1.841 -0.003 0.050 Czech 2.204 -0.001 0.012 Danish 1.866 -0.002 0.000 Dutch 1.66 -0.005 0.011 English 1.502 -0.008 0.009 Estonian 2.191 -0.001 0.001 French 1.67 -0.007 0.004 Gaelic 1.941 -0.002 0 German 1.845 -0.004 0.005 Greek 1.757 -0.005 0.009 Italian 1.784 -0.007 0.001 Japanese 1.764 -0.022 0 Latvian 2.259 0 0 Lithuanian 2.249 0 0.064 Malay 1.714 -0.006 0.007 Norwegian 1.494 -0.007 0.011 Portuguese 1.716 -0.004 0.012 Russian 1.93 -0.005 0.003 Serbian 1.914 -0.002 0.041 Slovakian 1.811 -0.004 0.057 Spanish 1.643 -0.005 0.007 Swedish 1.493 -0.05 0 Turkish 1.353 -0.019 0 Note. The unstandardized constant, unstandardized slope and p value columns relate to the simple linear regressions of In MI vs. range (In MI = ln A + s X range = ln [e.sup.c] + s X range). Average MI: the average MI for all range positions (up to 20) for all samples in a language. Table 2. Range over which the quantity [MI.sub.0] - [MI.sub.average] drops by 90% Range Albanian 5.40 Bulgarian 4.97 Croatian 5.80 Czech 3.40 Danish 5.28 Dutch 5.21 English 5.05 Estonian 3.34 French 5.45 Gaelic 6.09 German 5.03 Greek 5.48 Italian 4.84 Japanese 4.88 Latvian Undefined Lithuanian Undefined Malay 5.11 Norwegian 4.87 Portuguese 4.70 Russian 4.91 Serbian 4.81 Slovakian 5.41 Spanish 5.57 Swedish 4.66 Turkish 4.91
|Printer friendly Cite/link Email Feedback|
|Author:||Pothos, Emmanuel M.; Juola, Patrick|
|Publication:||British Journal of Psychology|
|Date:||May 1, 2007|
|Previous Article:||The location of trait emotional intelligence in personality factor space.|
|Next Article:||Predicting children's word-spelling difficulty for common English words from measures of orthographic transparency, phonemic and graphemic length and...|