RESEARCH NOTE: REPLICABLE EVIDENCE OF PSI: A REVISION OF MILTON'S (1999) META-ANALYSIS OF THE GANZEELD DATABASES.
ABSTRACT: J. Milton (1999) found that D. J. Bem and C. Honorton's (1994) successful result of a significant mean effect size for a database of autoganzfeld studies had not been replicated, either by J. Milton and R. Wiseman's (1999) 30-study ganzfeld database or by all ganzfeld studies dating from 1987 to 1998 (which includes a further 9 ganzfeld studies deemed suitable for inclusion with Milton and Wiseman's 30 studies). D. Radin (cited in G. Schmeidler & H. Edge, 1999) found that Milton and Wiseman's database was indeed a successful replication of Bem and Honorton's meta-analysis. In deference to Milton, replications were in fact demonstrated by way of the same performance comparisons she made if 1 "extremely successful" study in Bem and Honorton's database is judiciously considered not suitable for inclusion in that database. The issues of database homogeneity and the "outlier problem" are raised.
Milton and Wiseman (1999) presented an up-to-date meta-analysis of 30 post-Communique ganzfeld studies dating from early 1987 and published by February 1997. They found a nonsignificant mean effect size of .013 (Z = .70, p = .24), which did not replicate Bern and Honorton's (1994) significant result (ES = .164, Z = 2.55, p = .005). Milton (1999) then found another 12 ganzfeld studies dating from 1997, but only 9 were deemed suitable for further meta-analysis. Further performance comparisons by Milton (detailed below) again failed to replicate the successful results of Bem and Honorton.
The treatments I discuss in this article, using the databases from the same three above-mentioned studies (as well as a cursory consideration of Honorton, 1985, and Honorton et al., 1990) result in alternative interpretations of the ganzfeld situation, which are relevant not only to the whole ganzfeld issue as it stands but also to its history so far, and its future.
ISSUES OF HOMOGENEITY AND OUTLIER STUDIES
In the recent ganzfeld debate (Schmeidler & Edge, 1999), it was noted that Radin (cited in Schmeidler & Edge, 1999, p. 340) conducted a chi-square test (sum of z-squares) on Milton and Wiseman's (1999) 30 studies, [[chi].sup.2](30, N= 30) = 46.67, p = .027. This heterogeneous database was made homogeneous  by the removal of three extremely negative z-score studies (i.e., outliers), [[chi].sup.2]) (30, N= 30) 32.43, p = .35. A significant z score of 1.99 was found for the smaller database of 27 remaining studies (p = .02, one tailed). Radin concluded that the Milton and Wiseman database was, after all, a "statistically significant replication" (p. 340) of Bem and Honorton's (1994) result. (Note that Radin mistakenly used 30 degrees of freedom, instead of 27, in the second chi-square test, so the p value should be .22, not .35.)
There was some dispute (Schmeidler & Edge, 1999, pp. 340-349) over whether it was such a good idea to take out studies responsible for heterogeneity, and the implications of Radin's procedure for the ganzfeld domain as a whole is worth considering. It therefore seemed only natural to apply the same test to other ganzfeld databases. Specifically, the chi-square test can be applied to Bem and Honorton's (1994, pp. 10-11, Table 1) 10-study database, which they constructed in their "primary analysis" (p. 10). Even their 11-study database (which includes Study No. 302) could be tested, as well as Honorton et al.'s (1990, p. 114, Table 1) original database. One might also apply the test to Honorton's (1985) 28-study database, just to see what happens.
The earliest database, that of Honorton (1985), which includes 28 studies, yields [[chi].sup.2] (28, N = 28) = 109.76, p = 1.25 x [10.sup.-11]. Given that this database is thus heterogeneous, would it not be wise to start trimming away outlier studies until one gets a homogeneous database?
Honorton et al.'s (1990) 11-study autoganzfeld database  is not homogeneous either, [[chi].sup.2] (11, N = 11) = 26.29, p = .006. Neither is Bem and Honorton's (1994) version of that 11-study database, [[chi].sup.2] (11 N = 11) = 20.19, p = .043. However, their 10-study database "gets through the net," [[chi].sup.2] (10, N = 10) = 10.95, p = .361.
Perhaps the removal altogether of Study No. 302 is a good idea, especially if homogeneity is important to the researcher. As it happens, homogeneity did not appear to be an issue for Bem and Honorton (1994). However, they did regard Study No. 302 as an anomaly in itself, because they never included it in their primary analysis (pp. 10-11) but instead dealt with it as a separate issue (pp. 11-12). (To clarify, it should be noted that the "anomalous" nature of that study stems from the fact that it was not completed because of laboratory closure, so an imbalance in the frequency of administered targets brought about an inflated hit rate. Thus, this extremely high hit rate had nothing to do with methodological flaw. For this reason, Bern and Honorton treated the study separately because, unlike the other studies in the series, it was incomplete, not because it was flawed, or too successful.)
PERFORMANCE COMPARISONS OF THE GANZFELD DATABASES
Bem and Honorton's (1994) Database and Milton and Wiseman's (1999) Database
Paradoxically, Milton and Wiseman (1999, p. 387) never criticized Bem and Honorton (1994) for taking the step of "conservatively excluding the 11th study" (No. 302), yet Milton (1999, p. 330) left it in when she compared all 11 studies with Milton and Wiseman's (p. 388, Table 1) database. By Milton's reckoning the two databases are significantly different, t(39) = 2.64, p= .0059, one-tailed, although this study's tvalue differs from hers, t(39) = 2.40, p = .042, one-tailed. Had she excluded Study No. 302 (as was done in this study, because I am only following Bern and 4 Honorton's example!), and had she proposed a more conservative  (two-tailed) t test, she would have found that the two databases were not significantly different on mean effect sizes after all, t(38) = 1.88, p = .067, two-tailed. Milton might then have argued that Milton arid Wiseman's database did replicate Bern and Honorton's finding.
Bem and Honarton's (1994) Database and Milton's (1999) Updated Database
The success or failure of the above performance comparisons using the Bern and Honorton (1994) database "hangs" on Study No. 302, but the story does not end here. When Bern and Honorton's 10-study database was compared with 'all studies 1987 to present [viz., l998]"  (to use Milton's, 1999, p. 330, words) the result was not significant, t(47) = 1.44, p = .156, two-tailed. Furthermore, a comparison of Bern and Honorton's 10-study database with "all studies 1987 to present [viz., 1998] excluding Dalton (1997)" also yielded a nonsignificant result, t(46) = 1.60, p = .116, two-tailed. (If one generously admits one-tailed p values, the two t tests still produce nonsignificant results: ps = .078 and .058, respectively.) When Milton (1999, p. 330) made these same comparisons, using Bem and Honorton's 11-study database instead, the results were significant (one-tailed) in both cases. Two-tailed tests also were significant.
In this study I followed Milton's (1999, p. 330) example of excluding Dalton (1997) in the second test. However, Milton originally targeted Dalton's study to make another point, that is, for no other reason than that it was "extremely successful" and proved to be single-handedly responsible for the significant mean effect size of all studies from 1987 to 1998. Study No. 302 was not single-handedly responsible for the significant mean effect size of Bem and Honorton's (1994) database (as Bem & Honorton [pp. 10-11] already showed), but might not its "extremely successful" outcome (even after Bem & Honorton's [p. 12] adjustment for "response biases") give good cause to question its validity as a study? Once establishing the fact that the study was "irregular" (irrespective of its outlier status), might it not be justifiably excluded from Bem and Honorton's database? If Bem and Honorton had reservations about it, why didn't Milton? Note, too, that Palmer and Broughton (2000) had the same reservations about this 11th study and referred to, and gave statistics for, only a 10-study database (see Palmer & Broughton, 2000, p. 231, Table 2). Milton seemed to target, or not target, studies on the basis of their success without really specifying whether their success warrants further investigation.
As I show here, the exclusion of Study No. 302 throws up completely different interpretations of post-Communique ganzfeld research. Bem and Honorton's (1994) successful result can be regarded as having been replicated by two (or three) databases:
* Milton and Wiseman's (1999) database (30 studies).
* Milton's (1999, p. 330) database of "all studies 1987 to present [viz., 1998] (with Dalton )" (39 studies).
* Milton's (1999, p. 330) database of "all studies 1987 to present [viz., 1998] (without Dalton )" (38 studies).
Note, too, that it was not necessary to trim Milton and Wiseman's database of the 3 outlier studies, as recommended by Radin (cited in Schmeidler & Edge, 1999, p. 340) to achieve these replications. I discuss this issue next.
Homogeneity--A Recurrent Issue
To return to the earlier chi-square finding for Honorton's (1985) database, one is still left with the burning question: "If constructing a 27-study trimmed-down version of Milton and Wiseman's (1999) database is an acceptable statistical practice (for some researchers, anyway), should researchers not revisit Honorton (1985) with the same intention of trimming the database of outliers?" Many researchers (e.g., Schmeidler & Edge, 1999, pp. 340-349) would answer in the negative to that question, and Storm and Ertel (in press), through their recent treatment of the ganzfeld databases, tacitly agreed with that response.
On the surface, it would seem that, without an answer to the above question, the meta-analyst is not able to undertake unequivocal performance comparisons between Honorton's (1985) database and later ganzfeld databases. Note, however, that at present there is no general rule that databases should be made homogeneous before further testing, and I do not argue here that they should be. I do not regard the homogeneity of Bem and Honorton's (1994) 10-study database as being in any way fortuitous or relevant to the aims of this study--it would not have made any difference if Bern and Honorton's database was still heterogeneous after the removal of Study No. 302. The same performance comparisons would have been conducted just as conscientiously. The point here is that doubts were cast over a certain study, and when I followed the example of other researchers those same doubts led me to the inevitable conclusion that the study should be removed from its database. Good meta-analytic technique will require some kind of consensus on how to proceed with issues such as this. In this study I have shown that mean effect sizes for all post-Communique databases do not differ significantly from each other, and those results in themselves indicate that the ganzfeld has been a consistent performer for more than 20 years.
The major aims of this article were, first, to emphasize the two-edged problem of homogeneity and what to do with irregular or outlier studies, and second, to redress the performance comparisons made in Milton's (1999) discussion article. I found (or argued) that (a) homogeneity can be useful if it proves the significance of a database, (b) databases need not be homogeneous before performance comparisons are made, and (c) it should be a mandatory exercise that all "exceptional" studies (not just the occasional study) in all databases (past and present) be scrutinized and be excluded from their respective databases only with good reason. On the basis of that threefold position, and after the appropriate tests, I found that Bem and Honorton's (1994) result replicated on four occasions, which includes Radin's successful result (cited in Schmeidler & Edge, 1999, p. 340).
When researchers come to "assess the need for a pre-planned meta-analysis" (Schmeidler & Edge, 1999, p. 383), one cannot help but feel that special consideration must be given to the issues of homogeneity and outliers (with particular emphasis on problem studies, be they outliers or otherwise). Researchers are given a good deal of flexibility in deciding what, if anything, should be done about these problems, and they will decide for their own reasons whether the situation warrants further investigation. However, as was shown, the overall success or failure of more than 20 years of experimentation in the ganzfeld domain can hang on what may appear to some to be an arbitrary decision rule but to others is simply good meta-analytic practice. Although these kinds of problems keep the debate going, they will, no doubt, help parapsychologists and statisticians reach some kind of a consensus about what a good meta-analysis should ultimately look like.
(1.) Preparation of this article was supported by a grant from the Bial Foundation.
(2.) A database is regarded as homogeneous if a test for such yields a nonsignificant result. (If the test result is significant, the database is regarded as heterogeneous.) Thus, running the test is simple enough, but the difficulty for the researcher is in defining the alleged outlier(s) and trimming accordingly. Personal preference can enter at this point (e.g., see Lawrence's trim [cited in Schmeidler & Edge, 1999, P. 343]).
(3.) Honorton et al.'s original 11-study database included Study No. 302, which was later adjusted for "response biases" (Bem & Honorton, 1994, pp. 11-12).
(4.) One cannot overlook the fact that Milton had the opportunity and, therefore, the statistical advantage, in performing a one-tailed t test, because she already knew the results of Bern and Honorton's study and Milton and Wisenman's study. Conducting a two-tailed test would have been a more valid procedure, because that would imply that she was not taking advantage of a priori knowledge. Honorton et al. (1990, p. 127) set that precedent by performing two-tailed tests when they did their comparisons, even though they knew that the value of the mean effect size for Honorton's (1985) database was higher than that of their own database.
(5.) Note that Milton's (p. 329) Table Al incorrectly reports the effect size of Wezelman and Bierman (1997) Amsterdam Series VI as --.2. It should be --.02.
BEM, D. J., & HONORTON, C. (1994). Does psi exist? Replicable evidence for an anomalous process of information transfer. Psychological Bulletin, 115, 4-18.
DALTON, K. (1997). Exploring the links: Creativity and psi in the ganzfeld. Proceedings of Presented Papers: The Parapsychological Association 40th Annual Convention, 119-134.
HONORTON, C. (1985). Meta-analysis of psi ganzfeld research: A response to Hyman. Journal of Parapsychology 49, 51-91.
HONORTON, C., BERGER, R. E., VARVOGLIS, M. P., QUANT, M., DERR, P., SCHEGHTER, E. I., & FERRARI, D. C. (1990). Psi communication in the ganzfeld: Experiments with an automated testing system and a comparison with a meta-analysis of earlier studies. Journal of Parapsychology, 54, 99-139.
MILTON, J. (1999). Should ganzfeld research continue to be crucial in the search for a replicable psi effect? Part I. Discussion paper and introduction to an electronic mail discussion. Journal of Parapsychology, 63, 309-333.
MILTON, J., & WISEMAN, R. (1999). Does psi exist? Lack of replication of an anomalous process of information transfer. Psychological Bulletin, 125, 387-391.
PALMER, J., & BROUGHTON, R. S. (2000). An updated meta-analysis of post-PRL ESP-ganzfeld experiments: The effect of standardness. Proceedings of Presented Papers: The Parapsychological Convention 43rd Annual Convention, 224-240.
SCHMEIDLER, C., & EDGE, H. (1999). Should ganzfeld research continue to be crucial in the search for a replicable psi effect? Part II. Edited ganzfeld debate. Journal of Parapsychology, 63, 335-388.
STORM, L., & ERTEL, S. (in press). Does psi exist? A re-appraisal of Milton and Wiseman's (1999) meta-analysis of ganzfeld research. Psychological Bulletin.
|Printer friendly Cite/link Email Feedback|
|Publication:||The Journal of Parapsychology|
|Article Type:||Statistical Data Included|
|Date:||Dec 1, 2000|
|Previous Article:||HUMAN ELECTRODERMAL RESPONSE TO REMOTE HUMAN MONITORING: CLASSIFICATION AND ANALYSIS OF RESPONSE CHARACTERISTICS.|
|Next Article:||1998 PARAPSYCHOLOGIGAL ASSOCIATION BIBLIOGRAPHY.|