The Handbook of Psychological Testing.
The orientation of the book can be conveyed by a couple of quotations: 'only factor-analytic tests are worthy of consideration'; 'projective tests were never based on reason'. Although these are presented as summarizing statements, they appear to be prejudices, apparently acquired from Eysenck's paperbacks when Kline was a student. They seem to have prevented Kline examining relevant work with care. Thus, he does not seem to be aware of the extensive programme of classic laboratory experimentation which went into the development of McClelland, Atkinson, Clark & Lowell's (1958) logically coherent, content-analysis based, scoring system for a derivative of the TAT - still less of the very good reasons why that scoring system is still at loggerheads with mainstream psychometric theory. Likewise, although Kline admits to the possibility of constructing tests in which each item correlates with the total score but not with the other items, he concludes, 'No test constructor has been able to construct [such] a test'. Assuming that he is not pedantically insisting that the inter-item correlations be absolutely zero, his conclusion exposes his lack of familiarity both with tests based on Item Response Theory (IRT) and those derived from researchers in the generic competency testing movement (e.g. Klemp, Munger & Spencer, 1977; McClelland, 1978; Pottinger, 1977; Raven, 1984, 1991; Spencer & Spencer, 1993).
Kline has at least tried to understand IRT, devoting almost two chapters to it. The problem is that the way he writes suggests an absence of personal practical experience. For example, whereas in his chapters on classical test theory he introduces numerous anecdotes and examples which nicely help the reader understand what is being said (and even explains that 'construct' means 'concept'), by the third paragraph of his chapter on IRT he is writing, 'the calibration of the test [must] be independent of the sample that is used ... and ... the measurement of the object should be independent of the items which happen to be used in the measurement'. The appearance of this statement at this point in the chapter suggests that its author is not on top of his subject matter. This suspicion is confirmed when one reads that IRT only works in 'simple fields of ability testing' and has not 'led to psychological tests of any note'. Such statements not only call Kline's familiarity with IRT into question, they also lead one to wonder how familiar he really is with the procedures used to construct such tests as the Progressive Matrices and the British Ability Scales. Naturally, such reflections can only lead one to question further the basis for his assertion that 'only factor analytic tests are worthy of consideration'.
I am uneasy about such a one-sided presentation in a book like this. Why? It is not that I do not accept that science advances most rapidly from elaborating contrasting positions (and not from attempts at balanced eclecticism). Nor is it that I do not appreciate the way in which Kline has made his biases clear by forthrightly stating his opinions. Although I recoil from the thought, it seems that I somehow expect higher levels of scholarship in a book of this sort. In a review of this kind, it is important to draw attention to a number of erroneous beliefs that are likely to be perpetuated if this book is widely adopted in courses on psychometrics.
The first of these concerns sampling. I was initially delighted to find a chapter on this subject. As I have tried to obtain good normative data for the Progressive Matrices, I have encountered a distinct lack of understanding of sampling issues among psychologists. Unfortunately, my delight was short-lived. While welcoming Kline's emphasis on large samples and accepting the cost implications, it was disturbing to find his advocacy based on false premises. Despite a statement to the contrary, he continuously confuses size with representativeness, as in the following example: 'huge samples are required if they are to be at all representative'. In point of fact, representativeness depends mainly on the care with which a sample is drawn, next on its size, and hardly at all on the size of the population to be represented. Had Kline been better informed about sampling matters, he would have underlined the difference between quota samples and carefully drawn random (or stratified random) samples. Most large-scale test standardizations are based on quota samples, which are notoriously unreliable. Much smaller, carefully drawn samples would have been better. The reason why large samples are needed is that it is often necessary to report data for subsamples, such as different age groups. If one has 100 children in a six-month age group, it means that the 5th and 95th percentiles are the scores of the 5th child from the respective tails of the distribution. To get 100 children in a series of six six-month age groups, one must have a total sample size of some 600. Even with such a sample, finer discrimination in the tails (which is where tests are most often used) is typically generated by little more than the application of smoke and mirrors (Dockrell, 1989). It is a great pity that Kline did not treat this topic more carefully and provide references that would lead readers back to more fundamental texts on sampling theory and practice.
Similar comments apply to the chapters on factor analysis. Readers are led through methods of condensation and rotation, but they are nowhere enjoined to perform the elementary task of rewriting and inspecting their correlation matrices after the variables have been ordered according to their factor loadings. If they did this, they would quickly find out whether the psychological models behind different computational algorithms fit their data. Failure to do this not only accounts for a large proportion of the meaningless factor analyses that are published. It also results in few researchers discovering that no factor-analytic model fits IRT-based tests, let alone more meaningful models of psychological abilities.
In the chapter on reporting test scores, percentiles are rejected and users urged to report normalized scores. Only after devoting several pages to describing how to do this does Kline acknowledge a problem that actually undermines the whole operation. He notes that normalized scores cannot legitimately be used when test scores do not conform to a Gaussian distribution especially in the tails. When Thorndike was reporting results from the Stanford-Binet restandardization in the mid-1980s, I asked him whether he had examined the shapes of the distributions of the subscales. 'Yes,' he replied, 'they are almost always bimodal.' 'What do you do about that?' I asked. 'Read off percentiles and convert them to deviation scores.' Ho hum. When restandardizing the WISC in Scotland, Dockrell and his colleagues examined the shapes of the distributions in the tails. The deviations from the Gaussian curve were so marked that the same child, with the same score on the same test, judged against the same normative sample, might have an IQ of 47 if the statistician who compiled the norms assumed a Gaussian curve and an IQ of 60 if the actual distribution of scores were employed. (In the United States, scores differing by this amount have different legally mandated implications for treatment.)
Kline's use of the word 'objective' troubled me continuously. He has clearly not considered the question of how an assessment of a person or an evaluation of an educational or staff-development programme that records some deficiencies in that person or programme but does not call attention to his or her or its strengths can be described as 'objective'. How can an evaluation of an educational programme that shows that it depresses scores on some measure of reading ability but does not record that it increases leadership, the ability to make one's own observations and self-confidence legitimately be described as 'objective'? Yet such unobjective results have dramatic effects on policy. One result of this oversight is that there is, in the chapter on the applications of tests in education, no mention of the misuse of tests in perpetuating very limited understandings of 'ability' and 'education' and the personally and socially destructive educational and social policies associated with them. Likewise, there is, in the chapter on occupational applications, no discussion of the widespread use of tests to promote incompetent and destructive people into influential positions. These are unethical uses of tests by psychologists. Had Kline recognized this, he might have been more prepared to acknowledge the use of unconventional methodologies in the hope of finding ways of assessing a wider range of components of competence.
Moving to the section on test reviews, I have had to content myself with looking at reviews of tests I know something about. However, it seems fairly obvious that Kline's test library is far from up to date. The references to the Wechsler publications stop in 1975, those to reviews in Buros in 1978. There is no discussion of the extensive work that went into transforming the British Ability Scales into the Differential Ability Scales. Among other things, Kline asserts that the norms for the Progressive Matrices are "old and clearly defective in terms of sampling'. This can only mean that he has not read any of the numerous revisions to the Manual that have been published since 1972. The 1979 standardization of the Standard Progressive Matrices and the Mill Hill Vocabulary Scale among British schoolchildren (which could not have been carried out without the support of the Social Science Research Council) is among the most thorough standardizations ever undertaken. The American and International norms (based on some 100000 protocols), while not everything they might be, hardly merit oblivion. The references in the review of the Myers-Briggs inventory are similarly dated. Kline's test library is not only dated, it is also thin. Thus, while some effort would have been required to secure early publications on the Progressive Matrices and the British Ability Scales, the effort would have disconfirmed Kline's claim that IRT has had little application in test construction. Likewise, the literature on the Myers-Briggs is much more extensive than he claims. Contrary to what is implied, the Murray Thematic Apperception Test has led not only to the development of theoretically based, experimentally validated scoring procedures, but to a swathe of competence-based assessment procedures that are much more psychologically and ethically justifiable than most of those Kline recommends.
So, what is the value of this book? Would I refer someone who needed to understand and learn to use factor analysis to it? No. Someone who needed to understand and use IRT? Definitely not. Someone who was looking for a test which could be used to evaluate educational, staff-development or organizational development programmes? No. Someone who wished to understand the strengths and limitations of testing in educational or occupational areas? No. Someone who needed to understand sampling in norming? I am afraid not. What a pity.
Dockrell, W. B. (1989). Extreme scores on the WISC-R. Bulletin of the International Test Commission, 28, April, 1-7.
Klemp, G. O., Munger, M. T. & Spencer, L. M. (1977). An Analysis of Leadership and Management Competencies of Commissioned and Non-Commissioned Naval Officers in the Pacific and Atlantic Fleets. Boston, MA: McBer.
McClelland, D. C. (1978). Guide to Behavioral Event Interviewing. Boston, MA: McBer.
McClelland, D.C., Atkinson, J. W., Clark, R. A. & Lowell, E. L. (1958). A scoring manual for the achievement motive. In J. W. Atkinson (Ed.), Motives in Fantasy, Action, and Society. New York: Van Nostrand.
Pottinger, P. S. (1977). Competence Testing as an Alternative to Credentials. Boston, MA: McBer.
Raven, J. (1984). Competence in Modern Society: Its Identification, Development and Release. Oxford: Oxford Psychologists Press.
Raven, J. (1991). The Tragic Illusion: Educational Testing. New York: Trillium Press. Oxford: Oxford Psychologists Press.
Spencer, L. M. & Spencer, S. M. (1993). Competence at Work. New York: Wiley.
JOHN RAVEN (Edinburgh)
|Printer friendly Cite/link Email Feedback|
|Publication:||British Journal of Psychology|
|Article Type:||Book Review|
|Date:||Aug 1, 1995|
|Previous Article:||Preventing Child Sexual Abuse: Sharing the Responsibility.|
|Next Article:||Ethology and Human Development.|