Performance in group telepathy experiments as a function of target picture characteristics/Les performances dans les experimentations de telepathie en groupe en fonction des caracteristiques de l'image cible/Rendimiento en experimentos grupales de telepatia en funcion de las caracteristicas del objetivo/Leistung bei experimenten zur gruppentelepathie als funktion der merkmale des zielbildes.
One exception is a series of group telepathy studies that have been performed at the Department of Psychology, Stockholm University, since 1993, initiated by myself. Based on the idea that strong emotional messages (e.g., signals of danger) may be, for evolutionary reasons, easier to transmit telepathically than are more neutral messages (Moss & Gengerally, 1969), the studies have all been concerned with transmission of emotions, as evoked by slide pictures.
Each study comprised a series of repeated, identical experiments, with two groups of participants sitting in two adjacent rooms. In the first part of the experiment, participants in one of the two groups, the "sender" group, were presented with 15 emotionally positive and 15 emotionally negative pictures, one at a time. For each presented picture, participants in the other group, the "receiver" group, were to guess and note individually whether the picture was positive or negative. In a second part of the experiment, the two groups changed rooms and tasks: The former senders now served as receivers, and vice versa.
Hit rate, defined as the number or proportion of correct responses, was invariably used as the dependent variable in the data analyses. Hit rate was analysed as (a) a function of person or situational factors (e.g., belief in telepathy and the order in which the participant served as sender and receiver) and (b) a function of stimulus factors (e.g., rated characteristics of the target pictures). In recent studies, physical variables have also been examined, as detailed below.
As a first study in the present project (Dalkvist & Westerlund, 1998)--for an overview of the previous studies, see Table 1--five individual explorative studies were performed. They generated a large number of significant results (for example, an effect of belief in telepathy, with believers and nonbelievers performing worse than uncertain participants). Based on the significant results of the five explorative studies, which apparently were unlikely to have been obtained by chance (even though we failed to support this judgment with any statistical measure), a set of hypotheses were subsequently tested in a comprehensive replication study (Westerlund & Dalkvist, 2004). The outcome of this study was not encouraging: None of the eight predictions was supported. The two remaining studies (Dalkvist & Westerlund, 2006; Dalkvist, Montgomery, Montgomery, & Westerlund, 2009) were more restricted in scope, as can be seen from Table 1.
The present study is based on data from several separate studies, four of which have remained unpublished until now. One is a new replication study. One motivation for conducting this study was to extend the amount of data to allow a highly reliable stimulus level analysis to be performed. The remaining three studies provided new scale descriptions of the target pictures.
As indicated above, already at the beginning of the present project, the target pictures were rated on a set of psychological scales (see Table 1). Initially (Dalkvist & Westerlund, 1998), these scales showed some relationship to performance, but this relationship could not be replicated (Westerlund & Dalkvist, 2004).
These scales were judged to be inadequate, however. For one thing, their reliability could be questioned due to the small number of underlying ratings. Second, comparison of positive and negative pictures using the same scales for both types of pictures was not possible. To obtain more reliable and versatile psychological scales for use in future research, a new picture rating study was performed. This was done using participants in some of the experiments in the above-mentioned replication study (Westerlund & Dalkvist, 2004), without using the new scales in that study.
When at the start of our present project hit rate was chosen as the dependent variable, we assumed, quite naturally, that the receivers would follow the instructions and discriminate between positive and negative pictures--if they were able to show any discrimination at all. On closer reflection, however, this assumption is not at all obvious. After all, emotional reactions are complex phenomena, varying in at least two global dimensions: pleasure-displeasure and arousal (perceived and/or physiological), the latter dimension exhibiting a more or less pronounced U-shaped relation to the former (Lang, Greenwald, Bradely, & Ham, 1993). Thus, the notion that some form of arousal rather than pleasure-displeasure (or some more specific pleasurable or displeasurable experience) might be transmitted from the senders to the receivers seemed to be conceivable and worth testing.
To allow for such comparisons, two studies were performed to extend the scale description of the targets to encompass not only psychological scales but also physiological ones. In one of the two physiological studies, electrodermal activity (EDA) was measured in response to the stimulus pictures, in the other, heart-rate (HR).
Increased EDA and decreased (!) HR enter as pivotal components in a pattern of various physiological reactions known as the orienting response, which is evoked by motivationally relevant stimuli that capture the individual's attention (Sokolov, 1960). The two reactions are not perfectly correlated, however. While both of them tend to be somewhat stronger in response to negative as compared to positive stimuli, the difference between negative and positive stimuli is somewhat more pronounced for the decrease in HR than for the increase in EDA (Lang et al., 1993). Thus, the two measures complement each other to some extent.
The aim of the present study was to investigate whether participants could discriminate among the present target pictures and, if they could, to establish the conditions underlying this ability in terms of stimulus and moderator variables, using more powerful data than those previously used for similar purposes (Dalkvist & Westerlund, 1998; Westerlund & Dalkvist, 2004). Two sets of telepathy data were analysed, one referred to as the "old" and the other as the "new" data set. The old data set was obtained from the last two of the five initial explorative studies (Dalkvist & Westerlund, 1998), the best-controlled ones, and from the previous replication study (Westerlund & Dalkvist, 2004). The new data set was obtained from the new replication study, reported below. Only the new picture scales were used, not the original psychological scales.
The Replication Study
Participants. Participants were 652 undergraduate students at the University of Stockholm, 459 females and 193 males, most of them at the Department of Psychology, who had chosen to participate in the study as part of course requirements.
As can be seen from Table 2, in terms of participant age and gender distributions, the present study was very similar to the corresponding previous studies, with a typical mean age of 26 or 27 years and a predominance of females, comprising about 70% of participants.
As can be seen in Table 3, the present study was also similar to the previous ones with respect to number of experiments/sessions (the two parts of an experiment) and the distribution of participants across experiments/sessions. It should be noted, however, that the average number of participants per session was somewhat smaller in the new study than in the old ones. This was due to the occurrence of a larger number of small sessions in the new study than in the old ones, resulting from the students' freedom to choose among different occasions for participating.
Stimulus material. The stimuli used were 30 slide pictures, 15 with positive motifs (such as nature pictures and pictures of happy people) and 15 with negative motifs (such as pictures of traffic accidents and starving children), collected specifically for the present project from various sources (IAPS, the frequently used emotion picture collection, did not exist when the project started). (For a detailed description of target pictures, see Dalkvist & Westerlund, 1998.)
Procedure. The new data collection was carried out exactly as in the first replication study (Westerlund & Dalkvist, 2004), but here one additional type of data was collected as well. In the new data collection, one of the two experimenters in the sender room registered any disturbance that occurred--coughs, rustling, scrapes, or any sound outside the sender room. Although the experimental rooms were soundproof, this sound registration was done to check whether any possible positive results could be accounted for by auditory (or vibrational) leakage, unconsciously perceived by the receivers and mediated by a correlation between disturbance and the two response alternatives (positive/negative picture).
In outline, the experiment was run as follows (for more details, see Westerlund & Dalkvist, 2004). When the participants arrived at the laboratory, they were randomly divided into two groups as equal in size as possible (the number of participants could be uneven): one sender group and one receiver group. The senders and the receivers were sequestered in two soundproof rooms, separated by one room. The two experimental rooms were connected to each other by a signal device: A lamp in the receiver room could be turned on and off from the sender room.
The slides were presented in random order for each group of senders, who sat in one or two rows in front of a screen. The senders' only task was to look at the pictures and to "hold on to" the feelings evoked by the respective pictures as long as they were shown. The receivers, who sat in a circle with their backs to each other, were instructed to guess whether a given picture was positive or negative and to mark the chosen alternative on a response sheet (the receivers were forced to choose one of the two options). One of the (two) experimenters in the receiver room (there were also two experimenters in the sender room) watched the signal lamp, which was turned off when a new picture was shown to the senders, and reported this to the receivers. Each picture was shown for 20 s, with an interstimulus interval of about half a second.
When all 30 pictures had been shown, the participants changed rooms, and, as mentioned before, those who had served as senders in the first part of the experiment now served as receivers, and vice versa.
Moderator variables. As in Dalkvist and Westerlund (1998) and Westerlund and Dalkvist (2004), eight person/situation-related variables were examined as potential moderator variables. Two of them concerned belief in telepathy, one as measured before the experiment on a 3-point scale, and the other as measured after the experiment on a 7-point scale. Furthermore, the two available demographic variables, age and gender, as well as two different measures of response style: number of negative guesses and repetition avoidance, defined as the number of times the subject shifted from one type of response ("positive picture" or "negative picture") to the other, were considered. The two remaining person/situation-related potential moderator variables were: number of receivers and the order of the two tasks, that as sender and that as receiver.
Moreover, as in Westerlund and Dalkvist (2004), two possible physical moderator variables were considered: (a) local sidereal time (LST)--an astronomical time and space measure, which is indirectly related to the magnitude of cosmic radiation that reaches the earth, and (b) fluctuations in the global geomagnetic field, as measured by the ap index. For a large number of different studies, performed on the northern hemisphere, Spottiswoode (1997) found both of these measures to be systematically related to the effect size of the studies. (In a more recent paper [Sturrock & Spottiswoode, 2007], Spottiswoode repudiates his previous conclusions in favour of a lunar phase interpretation, but this interpretation has thus far failed to make a breakthrough.) In the present project, ap index--but not LST--has been shown to be related to psi performance in two previous studies: namely, in Dalkvist and Westerlund (2006) and in Dalkvist et al. (2009). Based on previous findings showing LST around 13:30 to be psi-conducive (Spottiswoode, 1997), the LST scale was divided into two periods: a "good" period (coded as "1"), ranging from LST = 10:00 through LST = 16:00, and a "bad" period (coded as "0"), comprising all other times. (Spottiswoode's psi-conducive window was, in fact, only 2 LST hours wide, but this small window would not have generated a sufficiently large amount of data).
The Rating Study
Participants. Participants were 66 undergraduate students at Stockholm University, 42 females and 24 males, with a mean age of 27.30 years. They had, in fact, been recruited to participate in the abovementioned replication study (Westerlund & Dalkvist, 2004), where the present rating experiment served as a final, additional, session in some of the experiments.
Most of the participants were psychology students who chose to participate in the study as part of course requirements. All participants were informed beforehand that some of the pictures to be shown were very repulsive, and sensitive persons were recommended not to take part in the experiment.
Stimulus material. All the target pictures were the same as in the present replication study.
Procedure. The 30 target pictures were projected onto a white screen. The ratings were made in groups, with 10 or fewer participants in each group. The projector, being located behind the participants, was run automatically, by means of a timer. The target pictures were shown in a randomized order.
Each picture was rated on six different scales. Four of them measured purely emotional aspects of the pictures, namely how (a) pleasant/unpleasant, (b) involving, (c) compassion-arousing and (d) repulsive they were. The two remaining scales measured (e) how well known and (f) how perceptible (easy to understand) the motifs of the pictures were.
The participants were given a booklet of forms, each one containing the six scales, with one form for each target picture. For half of the subjects in a given session, the scales were written in the above order; for the other half, the order was reversed. Similarly, for half of the sessions, the target pictures were presented in one particular order, and for the other half in the reversed order. Each target picture was presented for 30 s. The judgments were made by drawing a vertical line on a 100-mm-long graphic scale with verbally anchored endpoints, representing extreme states of the experience to be judged (for example, "not at all disgusting" and "very highly disgusting").
The EDA Study
Participants. Participants were 77 undergraduate students at the University of Stockholm, 43 females and 31 males, most of them at the Department of Psychology, who had chosen to participate in the study as part of course requirements. Due to disturbances in the equipment or loose electrodes, only 60 participants, 32 females and 28 males, entered into the final sample. These participants varied in age from 19 to 41 years, with a mean age of 24.72 years. All participants were informed beforehand that some of the pictures to be shown were very repulsive, and sensitive persons were recommended not to take part in the experiment. None of the participants had participated in any other study in the project.
Stimulus material. The target pictures were the same as in the present replication study.
Apparatus. An EDA monitor, with a software program for collecting and analysing data, manufactured by Biopac Systems, Inc., was used. The monitor (model mp 100A) and the program (AcqKnowledge III for the 100WS version 3.2) were linked to and installed in a Macintosh 6320 computer.
The electrodes used were of the EL204S-Ag/AgCL type. Isotonic paste served as electrolyte (0.5% NaCl/100 ml [H.sub.2]O). The electrodes were fastened to the middle phalanges of the middle finger and the ring finger of the nondominant hand.
Procedure. The target pictures--the same as in the above study--were projected onto a white wall, in the same manner as in that study.
At most three participants took part in each measurement session. Each picture was presented for 20 s, with an interstimulus interval of about half a second. The total number of sessions was 35. Each session lasted for about 20 min, including instructions.
Three participant chairs, equipped with elbow rests, were located in a row in front of the projector.
The sequence of events during a session was as follows:
1. The participant(s) were asked to wash their hands with soap and water.
2. The participant(s) were given a complete description of the experiment, except for any expectation of the results or the connection to previous telepathy experiments. The participants were then informed that they were allowed to close their eyes if a picture was experienced as too unpleasant, and even to terminate the experiment.
3. The experimenter instructed the participant(s) on how to attach the electrodes by doing it on her own fingers and then letting the participant(s) do it themselves (without electrode paste).
4. The experimenter attached the electrode paste to the electrodes, and the participant(s) fastened the electrodes.
5. The participant(s) were instructed (a) to keep the hand with the electrodes as still as possible, but without concentrating too much on it, and (b) to put their arms as comfortably as possible on the arms of the chair. The participant(s) were also told not to speak to each other or to the experimenter during the experiment.
6. The light in the experimental room was turned off, and the projector was started. The first picture (a photo of a landscape) was not entered among the 30 "real" stimuli but was used only to accustom the participant(s) to the experimental situation, without the participant(s) knowing it.
7. After the last trial, the participant(s) were disconnected from the monitoring equipment and asked to fill out a short questionnaire, mainly concerned with demographic data.
8. The participant(s) were debriefed with respect to the purposes of the study. (In addition to testing the telepathy hypothesis, the data were also used in a study on gender differences in emotional reactions.)
The HR Study
Participants. Participants were 53 undergraduate students at the University of Stockholm, 37 females and 16 males, most of them at the Department of Psychology, who chose to participate in the study as part of course requirements. Due to missing data, only 50 participants, 36 females and 14 males, entered into the final sample. These participants varied in age from 19 to 39 years, with a mean age of 26.80 years. As before, all participants were informed beforehand that some of the pictures to be shown were very disgusting, and sensitive persons were recommended not to take part in the experiment. None of the participants had participated in any other study in the project.
Stimulus material. The target pictures were the same as in the present replication study.
Procedure. HR was measured by means of an electronic HR meter, manufactured by Polar Electro Oy. The HR meter (S610i[TM]) consists of two parts: (a) a belt with heart-beat sensors, to be fastened around the chest, with the sensors adhering to the skin just below the breasts or the breast muscles, and (b) a "watch" for receiving and storing signals from the heart-beat sensors. Associated with the HR meter was a software program for data analysis.
There were at most four participants in each measurement session. Each picture was presented for 20 s, with an interstimulus interval of about half a second. Each session lasted for about 20 min, including instructions. There were 19 measurement sessions in all.
In order to avoid interference among the HR monitors, participants were spread out in the experimental room a couple of meters apart. In order to prevent females from becoming embarrassed when attaching the sensor belts, males were located in front of the females. On the experimenter's command, the participant(s) started their "watches" just before the start of the experiment, and stopped them when the experiment was finished. Also, again on the experimenter's command (saying "now"), the participant(s) pressed a button to register the point in time when a picture was exposed, and another button 7 s later to delineate the time interval for the measurement (the minimum time interval required for HR calculations was 5 s, that is, the same period as that used in measuring EDA, but with that short period one would have run the risk of obtaining periods that were too short to be analysed, due to delayed button presses when a picture was exposed.) The participant(s) could hear a "beep" when the button had been properly pressed.
The sequence of events was the same as for the EDA measurement study, except for the following steps:
1. The participant(s) were seated on chairs as described above.
2. Following the experimenter's instructions, the participant(s) fastened the sensor belts as described above.
3. The participant(s) were told not to speak to each other or to the experimenter during the measurements and to concentrate on the pictures and not on handling the watch.
4. The participants were instructed how and when to start and stop their watches and how and when to make the time registrations, as described above.
5. A short trial run, without any pictures, was carried out.
6. The light in the experimental room was turned off, the participants started their "watches," the experimenter started the projector, and so on.
7. After the last trial, the subjects took off their sensor belts and "watches" and filled out a short questionnaire, mainly concerned with demographic data.
Finally, the data were transferred to a computer for analysis.
Data from all the individual studies involved in the present study--the four new studies reported above, the first replication study and the two best-controlled initial studies--were brought together with a view to answer this basic question: Could participants discriminate among the target pictures and, if they could, which were the critical mediating variables? Thus, the four studies are treated collectively and not one-by-one.
In the previous quest for relationships between picture characteristics and performance (Dalkvist & Westerlund, 1998; Westerlund & Dalkvist, 2004), a simple hit-rate measure was used (proportion of correct guesses). However, using this measure prevented us from comparing positive and negative pictures, because the measured performance would be affected by the participants' preponderance to guess that any picture was, say, positive. To get around this problem, measures of relative hit rate with respect to the tendency to guess that any picture was positive or negative were used in the present study. Specifically, after having coded a participant's hit in response to a given picture as "1" and a miss as "0," each hit for a positive picture was divided by the participant's total number of positive guesses and each hit for a negative picture by the participant's total number of negative guesses, to transform any absolute hit into a relative hit. The individual relative hit rates for positive and negative pictures, respectively, were then calculated as the sum of all relative hits for positive pictures and the sum of all relative hits for negative pictures. The individual total hit rates, that is, the relative hit rates for both positive and negative pictures, were calculated as the mean of the relative hit rates for positive and negative pictures. To be used at the session level, each of the three sets of individual relative hit rates was averaged within sessions.
In running group experiments, one is most often confronted with a troublesome statistical problem, called the "stacking" problem: Due to the possible occurrence of dependency among participants' responses in group testing (caused, for example, by the occurrence of a common response bias, such as a tendency to give one type of response at the beginning of a run and another type at the end of it), the statistical assumption of independent measures runs the risk of being violated, leading to deflated (or inflated, in cases of negative correlations) p values (Thouless & Briar, 1970). In the present study, the risk of any stacking effect was eliminated by analysing data at the session level, in the form of group means. The reason why this procedure effectively eliminated any stacking effect is that participants' responses to a given stimulus picture in a given session were independent of participants' responses to the same stimulus picture in any other session because the stimulus orders were uniquely randomized.
In the statistical analyses, the new data were treated as an attempted replication of the old data, even though the old and the new data in reality were analysed simultaneously. Specifically, for each statistical test that was performed, the two data sets were submitted to a mini-Stouffer-Z analysis (N = 2). In analogy with a regular meta-analysis, this was done not only with a view to testing the significance of the total results per se but also to test the replicability of the two separate data sets, by combining the Stouffer Zanalysis with assessment of the homogeneity of the data--in the present case, assessment of the difference between the old and the new data. It should be borne in mind, however, that such a replicability test is less stringent than performing a one-tailed significance test, and can only address the minimal criterion for claiming replicability.
Raw data were aggregated as follows in the three scaling studies.
In the rating study, for each scale and target picture, the arithmetical mean was calculated over all judges.
In the EDA study, for each target picture, the mean of the response amplitude was calculated over a 5-s interval, starting from the exposure of the picture. These means were then transformed into z values for each participant separately, to eliminate effects of individual response level differences. Finally, for each of the 30 target pictures, the mean z value was calculated across the 60 (nonexcluded) participants.
In the HR study, by means of the program belonging to the measurement equipment, for each target picture, the mean response amplitude was calculated over a 7-s interval, starting from the exposure of the target picture. These means were then transformed into z values for each participant separately, to eliminate effects of individual differences in response level. Finally, for each of the 30 stimulus pictures, the mean z value was calculated across the 50 (nonexcluded) subjects.
For each of the three scaling studies, the interrespondent reliability was assessed using Cronbach's alpha. In the rating study, reliability was assessed for each scale individually, yielding a mean alpha of .99 (SD = .02). For the EDA scale, Cronbach's alpha was .51, and for the HR scale .24.
As expected from previous work (Lang et al., 1993), the EDA and HR scales were found to be negatively correlated. However, the correlation was small and far from significant, r(28) = -.17, p = .36, two-tailed.
Table 4 shows the Pearson correlations among the six rating scales. As can be seen from this table, most of the correlations are strong or very strong. As can also be seen from this table, the four emotional scales (1-4) are positively correlated with one another, but negatively correlated with the two, positively correlated, nonemotional scales.
Table 5 shows the Pearson correlations between the two physiological scales and the six rating scales. As can be seen from this table, all of the four emotional rating scales are positively correlated with the EDA scale and negatively correlated with the HR scale, while the two nonemotional rating scales exhibit the reverse correlation pattern. It may also be noted that the correlations obtained for the HR scale are stronger than those obtained for the EDA scale, despite the fact that Cronbach's alpha is higher for the latter scale than it is for the former.
Given the strong correlations among most of the eight picture scales (the six subjective and the two physiological scales), all of them were combined into a composite scale. This was done by summing all of the original scales after having changed the signs of three of them--the HR, the familiarity, and the perceptibility scale--to make all scales positively correlated with one another. The resulting composite scale was interpreted as a bipolar scale measuring degree of unpleasant arousal, and was accordingly named a "negative arousal" (NA) scale. This summarizing scale was used to give an overall description of the target pictures.
As indicated by independent t tests, the old data studies had significantly larger values than the new study for preexperiment belief in telepathy, t(232) = 3.47, p = .001, two-tailed, as well as for post-experiment belief in telepathy, t(232) = 5.49, p < .001, two-tailed. They had also significantly larger values for number of receivers, t(232) = 3.33, p = .001, two-tailed, and on the ap index, t(190) = 3.02, p = .003.
All Stimulus Pictures
The Pearson correlation between the old and the new data set in mean relative hit rate across all 30 stimulus pictures was r(28) = .31, p < .05, one-tailed, and the Spearman rank-correlation slightly higher: r(28) = .35, p = .03, one-tailed. Thus, as measured in terms of ordinary correlations, the two studies exhibited significant interstudy reliability with respect to relative hit rate.
The mean observed relative hit rate values for the old and the new data set, respectively, are given in Table 6, together with the results of one-sample t tests of deviation of mean observed values from mean chance expectation (MCE = .50) and a corresponding Stouffer Z analysis.
As can be seen from this table, whereas the old data set did not show any deviation from MCE at all, the new one showed an almost significant tendency for relative hit rate to lie below MCE. Accordingly, the Stouffer Z test did not reach significance, as can be seen. The apparent difference in relative hit rate between the new and the old data was not significant, however. (The large difference between the two p values compared to the small difference between the two observed relative hit rate values is basically due to extremely small standard deviations, SD = .03 for the old data and .05 for the new.)
To test whether there was any significant overall effect of the stimulus pictures, one-way repeated measures ANOVA was applied to the old and the new data set, with the 30 stimulus pictures as the independent variable and mean relative hit rate as the dependent variable. As can be seen from Table 7, while the old data set did not show any effect at all, the new one approached--but did not reach--significance. Stouffer Z did not show any significant overall effect (Z = 1.28, p = .10).
Table 8 shows the Pearson correlations between mean relative hit rate and the eight original picture scales, as well as the corresponding composite (NA) scale. As can be seen from this table, a highly significant positive correlation was obtained for the EDA scale in the new data set, as well as a highly significant corresponding Z value. As can further be seen from Table 8, however, none of the remaining scales exhibited any significant effect. None of the differences between corresponding correlations for the two data sets was significant.
Thus, considering all 30 stimulus pictures together, with the notable exception of the EDA scale, none of the eight original picture scales or the corresponding NA scale exhibited any linear relationship with mean relative hit rate.
Except for the EDA scale, however, rather than being linearly related to the picture scales, mean relative hit rate tended to exhibit U-shaped, or reversed U-shaped, relationships to these scales. For the old and the new data set, the U- or reversed U-shaped trends could be summarized by a quadratic U-formed curve relating mean relative hit rate to the NA scale, corresponding to a Stouffer Z value of 2.93, p = .002. The curve was significant for the new data set, R = .51; F(2, 29) = 4.76; p = .02, two-tailed, but not for the old one, R = .23; F(2, 29) = 0.78, p = .48, two-tailed. As will be seen below, however, these U-formed relations are almost entirely attributable to a positive correlation between relative hit rate and the NA scale for the negative pictures, the positive pictures showing no clear correlation with the NA scale.
Positive Versus Negative Pictures
For the 15 positive pictures, the Pearson correlation between the old and the new data set in mean relative hit rate was r(13) = .05, p = .85, one-tailed, and the Spearman correlation was [r.sub.s](13) = .16, p = .58, one-tailed. For the 15 negative pictures, the Pearson correlation between the old and the new data set was r(13) = .52, p = .05, one-tailed, and the Spearman correlation was [r.sub.s](13) = .60, p = .02, one-tailed. Thus, whereas the positive pictures did not show any significant interstudy reliability at all, the negative pictures did so, the strength of the correlation between the two sets of measures exceeding .50, thus explaining more than 25% of the variance.
As can be seen from Table 9, for the positive pictures, a one-way repeated measures ANOVA did not show any effect at all. As can further be seen from this table, however, for the negative pictures, a nearly significant effect was obtained for the new data set, while no effect at all was obtained for the old one. As tested by a two-way repeated measures ANOVA, this apparent difference between the two studies was not statistically significant. As can further be seen from Table 9, however, despite the lack of significance in any of the separate analyses, Stouffer Z did reach significance.
In Figure 1, applying a linear model, positive and negative pictures have been separately plotted against the NA scale for the old, the new, and the total data set. As can be seen from this figure, whereas the negative pictures exhibited a positive linear trend for each of the three data sets--although a stronger one for the new data set than for the old one--the positive pictures did not show any noteworthy linear (or other) trend (as will be seen below, the weak negative trends for the positive pictures are far from significant).
The Pearson correlations between the picture scales, including the NA scale, and mean relative hit rate for positive and negative pictures analysed separately are given in Table 10. As can be seen from this table, except for a moderately significant positive correlation for the HR scale in the old data set, no significant correlation was found for the positive pictures; accordingly, except for HR, no significant Z value was obtained for positive pictures. By contrast, as can also be seen from Table 10, several significant correlations, some of which are highly significant, were obtained for the negative pictures, but only for the new data set. No significant difference could be found between corresponding correlations in the old and the new study, however. The significant correlations in the new data set, as well as the corresponding correlations in the old one, are all positive. In agreement with the positive correlation obtained for the NA scale, the original scales exhibiting positive correlations--the EDA, pleasure-displeasure and repulsion scale--can be said to characterize the pictures in terms of unpleasant arousal. The correlations obtained for the remaining original scales are all negative--both in the old and the new study, and with the exception of the HR scale, the corresponding scales describe the pictures in terms of "compassion," "familiarity" or "perceptibility" rather than "arousal." As can further be seen from Table 10, the scales exhibiting significant correlations were all associated with significant Z values.
In summary, the separate analyses of positive and negative pictures did not reveal any clear effect for the positive pictures, but did suggest that at least some of the negative pictures had been discriminated, at least in part based on the arousing aversive properties of the pictures. The significant Z values, along with the lack of any significant difference between corresponding correlations in the two studies and the overall agreement between their correlation patterns, show the two sets of correlations to be interrelated, even though no significant correlation was obtained for the old data set.
In view of these findings, it seemed appropriate to take an even closer look at the negative pictures.
Constructing a negative arousal discrimination scale. With the aim of exploring how and to what extent participants tended to discriminate pictures associated with high negative arousal from pictures associated with low negative arousal, a negative arousal discrimination (NAD) scale was constructed. To this end, all negative pictures were divided into two categories based on a median split with respect to their NA scores. The sum of the relative hit rate values for the seven pictures with the highest NA scores and the sum of the relative hit rate values for the seven pictures with the lowest NA scores were then computed, for each participant separately. Next, the difference between the high arousal picture sum and the low arousal picture sum was computed. Finally, for the purpose of comparing session groups instead of single participants, the individual NAD values were averaged within session groups.
Mean analyses. For each of the two studies, the mean of the NAD scores was compared to MCE (= 0) using a one-sample t test. As shown in Table 11, there was a positive deviation from MCE in both data sets, the deviation being significant for the new study, but not for the old one. The difference between the two studies did not reach but approached significance, t(232) = -1.72, p = .09, two-tailed). Thus, the significant Z value, given in Table 11, does not reflect the results in both studies, only the results in the new one.
In order to analyse the NAD scale more closely, the scale was split into its two components: the relative hit rate sum for the seven pictures with high NA values and the relative hit rate sum for the seven pictures with low NA values. This was done by averaging the respective individual relative hit rate sums within sessions.
Was one of the two sets of pictures--those with high and those with low NA values--more responsible for the positive deviations of the mean NAD scores from MCE shown in Table 11 than the other? To answer that question, the relative hit rate sums for the high and the low negative arousal pictures, respectively, were compared to MCE. As can be easily shown, MCE = 7 (number of low or high arousal pictures) / 30 (total number of pictures) = .23.
As can be seen from Table 12, there was a small and nonsignificant tendency for the high NA level pictures to have a mean relative hit rate sum above MCE, and a much stronger tendency for the low NA level pictures to have a mean relative hit rate sum below MCE, this deviation being significant in the new data set, but not in the old one. Thus, the significant positive NAD values shown in Table 11 resulted mostly from low NA level pictures having low relative hit rate sums rather than from high NA level pictures having high relative hit rate sums. In essence, this means that participants erroneously tended to guess that nonarousing negative pictures were positive instead of negative. In other words, whereas the high NA level pictures showed a (weak) tendency to evoke positive hit rates, the low NA level pictures were associated with so-called psi-missing, responses opposite in direction to the "correct" ones. This effect was clear-cut only in the new study, however. Thus, rather than demonstrating a common effect in the two studies, the significant Z value obtained for low NA pictures, shown in Table 12, only reflects the effect in the new study.
NAD distributions. Figure 2 shows the frequency distribution of the NAD values for each of the three data sets. Under the null-hypothesis, that is, the assumption that no relationship existed between negative arousal and relative hit rate, the NAD scores are expected to be symmetrically distributed, with MCE being equal to zero (the converse does not hold: that a distribution is symmetrical with a mean of zero does not exclude the occurrence of interaction effects, which then have to balance each other for the distribution to be symmetrical). As can be seen from Figure 2, however, at least as far as the shape of the distribution is concerned, this prediction is ostensibly violated: None of the three distributions seems to be symmetrical--not even approximately so; each distribution seems instead to be positively skewed, with a clearly visible prolonged "tail" pointing in the positive direction of the scale, the bulk of the distribution being located below zero.
This picture was confirmed and nuanced by numerical statistical data, which are shown beside the respective graphs in Figure 2. There are many methods for calculating a skewness measure. The most important ones are based on the second and the third moment about the mean, the latter moment indicating the basic degree of skewness. The present skewness measure is a sample version of such a measure ([G.sub.1]), which is normally--or at least approximately normally--distributed (see Joanes & Gill, 1998). If Nexceeds 150, as is the case in the total data set, the present measure of skewness can--like any measure of skewness based on the second and the third moment--for all practical purposes be considered normal. For the present significance testing, an exact measure of the standard error of skewness, suggested by Cramer (1997), was adopted.
As can be seen from Figure 2, for all three data sets, the skewness value fell between 0.50 and 1.00-the lower and the upper limit, respectively, for a distribution usually being characterized as "moderately" skewed (a distribution with a skewness value above 1 is usually characterized as "strongly" skewed and a distribution with a skewness value below 0.50 as "slightly skewed" or "approximately symmetrical"). For the old and the new data set, the z value and the corresponding p value indicate that the skewness of each of the two distributions is clearly--though not remarkably--significant. For the total data set, however, the z value and the corresponding p value show the skewness of the distribution to be extraordinarily significant, with a z value approaching 5 and a corresponding p value of one in a million. Thus, the distribution for the total data set can apparently be characterized as skewed beyond any reasonable doubt. This conclusion was confirmed by a Stouffer Z analysis, yielding a Z value of 4.53 and a corresponding p value of .000003, only marginally larger than that obtained in the skewness test.
Moderator variables. The fact that the distributions were found to be skewed, and not symmetric, indicates that session groups tended to respond differently to each of the two types of negative pictures-- those with high and those with low NA values. Hypothetically, this could be explained as resulting from interaction between negative arousal and some moderator variable(s). To discover whether one or more of the 10 potential moderator variables considered in the present study were involved, all of these variables were correlated with the NAD scale for the old and the new data set, respectively. As can be seen from Table 13, considering both the consistency across the two data sets, both of them yielding significant correlations, and the strength of the correlations, there was one--and only one--strong candidate: number of receivers. Thus, as the number of participants in the receiver group increased, the NAD value decreased significantly in both data sets, yielding a highly significant Z value.
In the following analyses involving moderator variables, attention will be focused on number of receivers. Even though they did not reach significance in both data sets, the two variables number of negative guesses and ap index will also be considered, as these two variables exhibited significant Z values, thus fulfilling a minimal requirement for claiming replicability.
Of these two variables, the more interesting one is perhaps ap index, as activity in the geomagnetic field has been linked to psi performance in several previous studies (e.g., Arango & Persinger, 1988; Berger & Persinger, 1991; Haraldsson & Gissurarson, 1987), most often with a low level of geomagnetic activity apparently being psi-conducive. In the afore-mentioned study by Spottiswoode (1997), a negative correlation was found between ap index and psi performance--but only in an apparently psi-conducive LST window of 2 hours around 13:30 LST. The possibility that such an interaction between ap index and LST occurred also in the present study was tested in relation to the NAD scale. The result was negative, however.
As can be seen from Table 14, the positive correlation between number of receivers and the NAD scale was mainly due to a positive correlation between the relative hit rate sum and number of receivers for the low NA level pictures, and only to a lesser extent to a negative correlation between the relative hit rate sum and number of receivers for the high NA level pictures. As can also be seen from Table 14, there is good general agreement between the correlation patterns in the old and the new study, as reflected by a highly significant Z value for the low NA level pictures and a less significant Z value for the high NA level pictures.
The exact relationships among the three current variables: NA level, relative hit rate sum and number of receivers are shown in Figure 3. As can be seen, for each of the three data sets, the low NA level pictures show a steep--essentially linear--increase in relative hit rate sum in small session groups as the number of receivers increases. In contrast, starting from a higher relative hit rate level, the high NA level pictures show a (less steep) decrease in relative hit rate sum with increasing number of receivers in small session groups. At a group size of about six participants, the two curves converge, and with increasing group size both curves level off.
Together the two curves clearly demonstrate the nature of the current effect: a failure on the part of the receivers in small, as opposed to large, groups to identify negative pictures having low negative arousal scores as negative, that is, psi-missing.
As mentioned earlier, the skewness of the distributions of NAD scores shown in Figure 2 could be expected to result from interaction between relative hit rate and one or more moderator variables. If this is true, given the negative correlation between number of receivers and the NAD scale, eliminating the effect of number of receivers should lead to a reduction--or even elimination--of the skewness of the NAD distributions. A reduction of the effect of number of receivers on the NAD scale was, in fact, accomplished by extracting the residuals of the NAD scores with respect to number of receivers using linear regression. In line with the expectation, the distributions of the residuals obtained were found to be considerably less skewed than the original NAD distributions, as can be seen from Table 15. As can also be seen from Table 15, however, the distributions were still positively skewed, and significantly so for the new and the total data sets. As can further be seen, however, by also entering the two other moderator variables that are significantly correlated with the NAD scale in the total data set--number of negative guesses and ap index--into the regression analysis, the skewness of the distributions was reduced additionally, resulting in nonsignificant skewness values for both the old and the new data set. Thus, the present analysis provided strong support for the interaction interpretation of the skewed NAD scale distributions, even though the skewness could not be fully eliminated by removing the effects of the three most influential moderator variables.
Four different control tests were performed to check whether the major positive results obtained could be explained away as artifacts. The major positive results were considered to be: (a) the positive correlation between mean relative hit rate and the NA scale for negative pictures, (b) the positive skewness of the NAD scale distribution for negative pictures, and (c) the correlation between the NAD scale and number of receivers for negative pictures. Four possible "natural" explanations of these findings were tested:
1. The positive results are methodological artifacts resulting from averaging individual data into session data. Although it is hard to imagine how such artifacts could have arisen, I recalculated the major positive results using individual data instead of group data, and compared these new results with the original ones. The individual level correlations between picture scales and mean relative hit rate for negative pictures showed the same pattern as the group level correlations, shown in Table 10. The strength of the correlations between mean relative hit rate and the NA scale was somewhat reduced, however, resulting in a reduction of Stouffer Z from 2.93, p = .002, to 2.14, p = .02.
A more pronounced weakening of the results was obtained for the skewness measure of the NAD scale distributions, shown in Figure 2. Most notably, for the whole data set, the skewness value was reduced from the extremely significant (p = .000001) value of 0.76 to the much smaller and much less significant (p = .004) value of 0.18.
A reduction in the strength of the results was also obtained for the correlations between number of receivers and the NAD scale, shown in Table 13, resulting in a corresponding diminishing of Stouffer Z from 3.35, p = .0004, to 3.20, p = .0007.
How should this weakening of the positive results be explained? Before attempting to answer this question, it should be emphasized that the positive results were not eliminated by shifting from the session level to the individual level, but were merely weakened. This fact excludes the possibility that the session level results were mere artifacts caused by averaging individual data. Indeed, the occurrence of weaker results at the individual level than at the session level is what is to be expected given the negative relationship between number of receivers and the NAD scale. Thus, in accordance with this relationship, when session averages are calculated from individual data, participants in larger groups, showing only small or no arousal effects at all, will get a heavier weight than do participants in smaller groups, who do show such effects, simply because participants in larger groups are more numerous than those in smaller groups. Conversely, at the session level, the impact of the large number of participants belonging to large groups becomes relatively small, because all groups are treated equally regardless of their number of participants.
2. The positive results were artifacts of transforming absolute into relative hit rate values. To test this interpretation, the same positive results as considered above were recalculated using absolute instead of relative hit rates as input data. Only a negligible change followed. Instead of being reduced, the Stouffer Z reflecting the correlations between relative hit rate and the NA scale was, in fact, slightly increased: from 2.93, p = .002, to 3.01, p = .001; the skewness value for the NAD scale distribution for the total data set was reduced from [G.sub.1] = 0.76, p = .000001, to [G.sub.1] = 0.67, p = .00001; and for the correlations between the NAD scores and number of receivers, Stouffer Z was reduced from 3.35, p = .0004, to 3.20, p = .0007.
The next interpretation is specifically concerned with the significant skewness values of the NAD scale distributions (Figure 2), especially the remarkably significant value in the case of the total data set.
3. The skewness of the distribution of NAD values was due to asymmetrically distributed extreme values. Any measure of the skewness of a distribution is sensitive to asymmetrically distributed extreme values, especially outliers. To test whether the occurrence of asymmetrically located extreme values could account for the small p values associated with the skewness values of the NAD scale distributions--especially the remarkably small p value in the case of the total data set--the tails of the distributions were truncated by moving extreme values toward the centre of the distribution. This was done by categorizing the NAD values using open end-categories before the skewness analysis started. Specifically, for a given data set, the original NAD scale was transformed into a 10-point scale with equal intervals, except for the two end-categories, which had equal frequencies (5) in the case of the total data set. As can be seen from Figure 4, these open categories effectively eliminated any possible effect of extreme values. With the original NAD values being replaced by corresponding category values (1-10), new skewness values and their associated z and p values were calculated. Somewhat surprisingly, as can be seen from Figure 4, the effect of these recalculations was to strengthen (!), instead of weaken, the skewness indicators, yielding, for example, a p value close to 1 in 10 million (!) for the total data set. This strengthening of the skewness indicators was most likely due to
reduction of sampling errors resulting from the transformation of the original NAD data into categorical data, thus counteracting the skewness-reducing effect of truncating the distribution. It may also be noted that, whereas the distribution for the new study previously was somewhat more skewed than that for the old one, the two current distributions are practically identical.
4. Positive findings were due to disturbances in the sender room. As may be recalled, during the new data collection, occasional disturbances in the sender room (scrapes, coughs, and other sounds), which, at least in theory, might have been unconsciously perceived by the receivers, were registered by one of the two experimenters in the sender room. Typically, a few (minor) disturbances occurred during a session (Md = 3), only 23 of the 110 sessions (20.9%) being completely free from any disturbances. There was no significant correlation between number of negative guesses and number of disturbances as calculated across sessions, r(107) = -.015, p = .88, two-tailed. Likewise, there was no significant correlation between number of disturbances in a session and the NAD scores, neither for small session groups, defined as n < 6, r(45) = .16, p = .28, two-tailed, nor for all session groups regardless of number of receivers, r(108) = -.057, p = .55, two-tailed. Hence, at least as far as the new data are concerned, the major findings of the present study could not be due to auditory (or vibrational) perceptual leakage.
Implicitly, there was one general assumption behind the present study: Considering the whole set of stimulus pictures, performance was expected to be related to at least one of the eight picture scales, or to all of them in combination. This general assumption was supported by two findings: First, as measured by correlations between relative hit rate among the 30 stimulus pictures for the old and the new study, there was significant interstudy reliability. Second, the NA scale, summarizing the eight original scales, was significantly related to relative hit rate (by a quadratic function) as was the EDA scale (by a linear function), as indicated by Stouffer Z.
However, the strongest results were obtained through a systematic explorative procedure, aimed at unveiling the roots of the above findings. A hierarchical strategy was followed in this search, comprising two major steps. In the first step, positive and negative pictures were analysed separately. Whereas no clear effect was found for the positive pictures, several different analyses suggested that participants did discriminate among negative pictures. In the second step, the negative pictures were analysed in more detail, as discussed below.
Not surprisingly, in the many studies on the dimensional structure of emotions that have been performed during the twentieth century, a dimension of pleasure versus displeasure has repeatedly appeared as the most salient one. However, in the present study, the distinction between pleasant and unpleasant emotions did not appear to be related to psi performance. Instead a dimension of negative arousal, varying above all among the negative pictures, seemed to be critical.
As far as the negative pictures are concerned, the systematic quest for any relationship between pictures and receivers' responses resulted in three major findings:
1. Participants discriminated between two types of negative pictures. As related to relative hit rate, two types of negative stimulus pictures could be discerned: arousing and nonarousing ones. In terms of the eight original picture scales, the arousing pictures were characterized by particularly high levels of EDA and repulsiveness, and the nonarousing pictures by high familiarity or compassion levels. Both types of pictures tended to give rise to relative hit rate values that deviated from MCE--but in different directions. Thus, the predominantly arousing pictures tended to be associated with responses above MCE, that is, hits, and predominantly compassion-eliciting or familiar pictures with responses below MCE, that is, psi-missing. This latter deviation was the most pronounced one, and was significant for the new data set, although not for the old one.
While there was no evidence that participants had distinguished between positive and negative emotions, or between different positive emotions, the results suggested that participants did discriminate between two types of negative emotions: repulsion and compassion. Indeed, these two emotions are quite distinct--phenomenologically and physiologically as well as behaviourally. Thus, while compassion feels "soft," is physiologically characterized by parasympathetic rather than sympathetic autonomic reactions, and tends to evoke approaching, care-giving behaviour, repulsion feels "hard" and "tense," is dominated by sympathetic rather than parasympathetic activity, and evokes avoidance or withdrawal behaviour (see, e.g., Rollenhagen, 1990). There is thus a tangible basis for any discrimination between repulsion and compassion, making the present results potentially understandable in psychological as well as physiological terms.
2. The NAD scale distribution was found to be positively skewed. Under the general null hypothesis that no telepathic communication or any other psi phenomenon occurred in the present study, a distribution of NAD scores is expected to be completely symmetric (under the null hypothesis, there is no logical reason why the sum of the relative hit rate values for the seven high-arousal pictures should be larger than the sum of the relative hit rate values for the seven low-arousal pictures). At odds with this prediction, however, for each of the three data sets, the distribution of NAD scores was found to be significantly positively skewed. According to standard criteria, the degree of skewness obtained for the three data sets could neither be characterized as very high nor as very small, but as moderate. However, in the case of the total data set, apparently due to the large N (= 234), a significance test showed the skewness to be extremely significant, with a z value of 4.75, corresponding to a p value of one in a million. With respect to the amount of skewness, the NAD scale distributions for the old and the new data set agreed very well with that obtained for the total data set. However, although the skewness was clearly significant for both the old and the new data set, the corresponding z values (2.95 and 3.19, respectively) were considerably smaller than that obtained for the total data set, and, although small, the corresponding p values (.002 and .0007, respectively) were not exceptionally small.
Different control tests failed to explain away the skewness of the NAD scale distributions as due to any bias resulting from averaging individual data, the occurrence of asymmetrically distributed extreme-values, or the use of relative, instead of absolute, hit rate as the measure of performance.
But what about the extremely small p value obtained in the case of the total data set? Is this value really trustworthy?
First of all, there is no reason to dispute the validity of the present skewness test. For example, there was no underlying assumption that was violated, such as the requirement of a sufficiently large number of cases (at least in the case of the total data set, rather than being small, the sample was large according to established criteria), or independent measures (the different session groups could not in any way have affected each other's responses, or have been affected in the same way in responding to any given target picture, as the pictures were presented in uniquely randomized orders).
At least in theory, however, there is one possible remaining "natural" explanation for the extremely small p value in the case of the total data set: Such small p values may be unstable and therefore unreliable. In the present case, this would mean that, if the study had been only slightly different, the p value of one in a million might instead have become, say, one in a thousand or (why not) one in a hundred million. Whether or not this general statement about small p values being associated with a high degree of uncertainty is true (which, in fact, may be disputed), the statement seems at least not to be true in the present study. The p value for the total data set is thus perfectly consistent with the combination of the p value obtained for the old data set (p = .002) and that obtained for the new one (p = .0007), neither of which can be characterized as extremely small, the product of the two p values being equal to p = .0000014--almost exactly the same p value as that obtained for the total data set.
3. The skewness of the NAD scale distribution could largely be accounted for in terms of interaction between negative arousal and number of receivers. The skewness of the NAD scale distributions strongly suggests that some interaction had occurred between negative arousal and some moderator variable(s) (though not necessarily any of those considered in the present study). This interpretation was supported by a correlation analysis: For each of the three data sets, a significant negative correlation was found between the NAD scale and one of the 10 potential moderator variables of the present study: number of receivers. This relationship was for the most part attributable to a positive correlation between relative hit rate and number of receivers for the low negative arousal pictures, rather than a negative correlation between relative hit rate and number of receivers for the high negative arousal pictures.
The negative correlation between the NAD scale and number of receivers could account for a substantial part of the skewness of the NAD scale distributions, as shown by eliminating the effect of number of receivers using linear regression--but not all of it. A further reduction of the skewness was obtained by entering two additional moderator variables into the regression analysis: number of negative guesses and ap index. But the resulting distributions were still positively skewed. Perhaps some further, unknown, moderator variable, or some combination of the present ones, is needed to fully account for the skewness of the present NAD distributions. Additional potential moderator variables that should be tested in future research are the person-describing variables (age, gender, belief in telepathy and the two response style measures--number of negative guesses and repetition aversion) as applied to the senders instead of the receivers. Preliminary tests seem, in fact, to indicate that particular characteristics of the senders (the two response style variables) may be more important than the corresponding characteristics of the receivers.
So far in this discussion, the relation between the NAD scale and the size of the receiver group has been described as linear. However, the relationship was, in fact, only approximately linear. More specifically, whereas the NAD value decreased progressively with increasing group size for small receiver groups (< six receivers), there were no apparent differences in relative hit rate for large receiver groups (> five receivers). In other words, there was an upper limit around five receivers for any difference in relative hit rate between the two types of pictures to appear. This finding can apparently explain why the old data set in general exhibited weaker results than the new one, as the session groups tended to be smaller in the new data set than in the old one.
The difference between small and large receiver groups now discussed can be said to support a compromise between the common view, mentioned in the introduction, that group testing is inefficient in producing positive psi results, and the opposite view that group testing is equally, or more, efficient than individual testing in this respect. Thus, according to the present study, individual testing is more psi-conducive than group testing, but group testing will also do, provided that the groups are not too large. It remains to find out, however, why the size of the receiver group would be critical.
A comment should also be devoted to the significant negative correlation between the NAD scale and ap index for both data sets together. This relationship confirms previous suggestions that fluctuations in the geomagnetic field may be related to psi performance, even though Spottiswood's specific finding of a psi-conducive window around 13:30 LST could not be replicated. Specifically, according to the present study--as well as several previous experimental studies (see, e.g., Berger & Persinger, 1991), a low level of geomagnetic fluctuations is psi-conducive, suggesting that telepathy somehow is transmitted by, or otherwise dependent on, electromagnetic fields. Perhaps the old, but nowadays less popular idea among parapsychologists that telepathy--if it does exist--is an electromagnetic phenomenon should be reconsidered.
Critical Methodological Discussion
The really critical question is, of course, if there is any reasonable "natural" explanation for the present positive results. During the initial studies in the present project, every effort was made to eliminate any perceptual leakage or any other experimental flaw, such as involuntary communication between the experimenters in the sender room and those in the receiver room--including a telepathybased experimenter effect (the two experimenters in the sender room didn't know which picture the senders were presented with at any given trial). Also, by testing the randomness of the random orders of the stimulus pictures, we made sure that they were free from any bias. And before the first replication experiment (Westerlund & Dalkvist, 2004) started, we were entirely convinced that any experimental error, however far-fetched it seemed, could be excluded--except for one remaining possible error. In spite of the fact that the senders were instructed to be silent, and the fact that the two experimental rooms were sufficiently well sound isolated to prevent any normal sound in the sender room to be heard in the receiver room (loud screams--but not ordinary talk--could be heard from one room to the other), there were occasional minor disturbances in the sender room, which at least in theory, could have been unconsciously perceived by the receivers. Hence, to the extent that there was any correlation between these disturbances and characteristics of the stimulus pictures, the disturbances could have affected the receivers' responses systematically. However, the control test made in this study, based on noted disturbances in the sender room, excludes this possibility. It is thus very unlikely that the present positive results can be explained by any experimental error whatsoever.
But what about statistical errors? The most critical question is perhaps whether the extremely small p value of one in a million that was obtained in testing the skewness of the NAD scale distribution for the total data set can be taken seriously, or resulted from some statistical bias, such as one or several unfulfilled statistical assumptions or the instability of extremely small p values. As argued above, however, as far as I can see, there is no reason to dispute the validity of the extremely small p value now being considered. This extremely small p value is not unique to the skewness test, however. A very small p value can also be obtained by combining larger significant, independent, p values. For example, this holds for a combination of the p value obtained for the correlation between the NAD scale and number of receivers: .0003, and the p value for Stouffer Z for the two correlations between the NA scale and relative hit rate: .002. This yields a combined p value of .0003*.002 = .0000006.
One major reason why the extremely small p values just discussed are important in interpreting the present positive results is that they argue against the likelihood that these results were due to positive selections from post hoc tests: It does not seem reasonable to argue that a p value of one in a million or thereabout can be found merely through such selections. In principle, the selection hypothesis can be tested by adjusting the p values for number of tests being performed using a Bonferroni-like method (multiplying the original p value by the number of tests). Unfortunately, however, in most practical situations, this cannot be done in a rigorous manner, for two major reasons. One is that the decision of how many tests to include in the analysis is often completely arbitrary. The other is that the method is only valid if the component analyses are independent, an assumption that is grossly violated in most actual research cases. Nevertheless, we can still say that, according to the Bonferroni logic, more than 10,000 independent tests are required before we can conclude that the most significant results may be due to multiple testing. Thus it is clear that these results cannot be solely accounted for by positive selection.
In the present paper, instead of attempting to adjust p values, an alternative approach has been taken: to consider the new study as a replication of the old one and to compare the results of the two studies, both with respect to similarities and with respect to differences. Arguably, to the extent that such similarities or differences can be established, the selection interpretation diminishes in credibility, even though no proper prediction testing is being performed.
As pointed out above, the existence of a general agreement between the two studies was suggested by the interstudy reliability analysis for the whole set of pictures. This conclusion was strengthened and qualified by the results of corresponding analyses for the negative and positive pictures separately, showing significant interstudy reliability for the negative pictures, but not for the positive ones. Thus, for both the whole set of stimulus pictures and for the negative pictures, the Pearson correlation as well as the Spearman rank-order correlation between the two data sets with respect to relative hit rate was significant.
More specifically, the major positive findings discussed above were all characterized by at least some agreement between the old and the new study, as indicated by Stouffer Z analyses in conjunction with assessment of differences between the old and the new study. Particularly good agreement was obtained (a) for the skewness of the NAD distribution and (b) for the correlation between number of receivers and the NAD scale: In both cases, not only was the Stouffer Z value highly significant, but the results were also independendy significant in the old and the new study. In the case of the relation between relative hit rate and the NA scale, the picture is somewhat less clear. Thus, neither this scale nor any of its component scales exhibited a significant correlation with relative hit rate in the old study, only in the new one. Nevertheless, the highly significant Stouffer Z value, the agreement between the correlation patterns in the old and the new study, as well as the lack of any significant difference between corresponding correlations in the two studies point to a common effect. The above agreements indicate that at least some of the findings were replicable: To some extent, results from the new study could thus be "predicted" from the old one, and, conversely, results from the old study could be "retrodicted" from new results.
Throughout the analyses, there was a tendency for the new study to yield clearer results than the old one, even though most of the noted differences were not significant, but only suggestive (the only exception is the correlation between the ap index and the NAD scale, which was significantly negative in the new study but zero in the old one). This tendency for the new study to yield clearer results than the old one may be explained by some minor methodological differences between the two studies. One obvious possible explanation is the tendency for the session groups to be smaller in the new study than in the old one.
It should not be forgotten, however, that previous attempts to predict future results in the present project have failed blatantly, which should be taken as a warning against drawing too hasty conclusions. However, the results are certainly sufficiently strong and interesting to warrant future tests. Unfortunately, though, in the current version, the present study is probably too time- and resource-consuming for any researcher to be willing or able (it is no easy task to recruit almost 1,600 participants) to perform a more exact replication of the study. Luckily, however, a reduced version of the present study is accessible and would do. Thus, taking advantage of the negative correlation between the NAD scale and number of receivers, a small number of receivers in each session group would suffice to demonstrate an effect. It would also be sufficient to use only negative pictures, provided they include both repulsive and familiar or compassion-evoking pictures. Mere conceptual replications based on the concept of negative arousal would also do. For example, some previous data (e.g., ganzfeld data associated with emotional target descriptions) might be re-analysed in terms of hit rate as related to some available measure of arousal. Also, by amplifying critical picture characteristics, for example, by using film clips instead of slide pictures, the effects might be strengthened, thereby reducing the need for large amounts of data.
In any case, the present study would seem to provide a potential recipe for bringing forth telepathic communication or some other psi phenomenon.
The research reported in this paper was supported by a grant from the John Bjorkheim Memorial Foundation. I am grateful to my friends and colleagues Gergo Hadlaczky and Joakim Westerlund for their theoretical and practical support during the progress of this work. I am also grateful to Eric Nissen for helpful comments on an earlier version of this paper.
Arango, M. A., & Persinger, M. A. (1988). Geographical variables and behavior: LII. Decreased geomagnetic activity and spontaneous telepathic experiences from the Sidgwick collection. Perceptual and Motor Skills, 67, 907-910.
Barker, P. L., Messer, E., & Drucker, S. A. (1975). Intentionally-deployed attention states: Relaxation. A group majority vote procedure with percipient optimization. Proceedings of Presented Papers: The Parapsychological Association 16th Annual Convention, 165-167.
Berger, R. E., & Persinger, N. A. (1991). Geophysical variables and behavior: LXVII. Quieter annual geometric activity and larger effect size for experimental psi (ESP) studies over six decades. Perceptual and Motor Skills, 73, 1219-1223.
Carpenter, J. c. (1988). Quasi-therapeutic group process and ESP. Journal of Parapsychology, 52, 279-304.
Carpenter, J. c. (1991). Prediction of forced-choice ESP performance: III. Three attempts to retrieve coded information using mood reports and a repeated-guessing technique. Journal of Parapsychology, 55, 227-280.
Cramer, D. (1997). Basic statistics for social research. London: Routledge, Chapman & Hall.
Dalkvist, J. & Westerlund, J. (1998). Five experiments on telepathic communication of emotions.Journal of Parapsychology, 62, 219-253.
Dalkvist, J. & Westerlund, J. (2006). Telepathic group communication of emotions: Announcement of predictions for an ongoing experiment. Proceedings of Presented Papers: Parapsychological Association 49th Annual Convention, 314-319.
Dalkvist, J., Montgomery, W., Montgomery, H., & Westerlund, J. (2010). Re-analysis of group telepathy data with focus on variability. Journal of Parapsychology, 74, 143-171.
Haraldsson, E., & Gissurarson, L. R. (1987). Does geomagnetic activity affect extrasensory perception? Personality and Individual Differences, 8, 745-747.
Haight, J., Weiner, D., & Morrison, M. (1978). Group testing for ESP: A novel approach to the combined use of individual and shared targets. Proceedings of Presented Papers: The Parapsychological Association 24th Annual Convention, 96-98.
Lang, P.J., Greenwald, M. K, Bradely, M. M., & Hamm, A. O. (1993). Looking at pictures: Affective, facial, visceral, and behavioral reactions. Psychophysiology, 30, 261-273.
Joanes, D. N., & Gill, C. A. (1998). Comparing measures of sample skewness and kurtosis. The Statistician, 47, 183-189.
Milton, J., & Wiseman, R. (1999). A meta-analysis of mass-media tests of extrasensory perception. British Journal of Psychology, 90, 235-240.
Moss, T. & Gingerelli, J. A. (1968). ESP effects generated by affective states. Journal of Parapsychology, 32, 90-100.
Rhine, J. B. (1947/1971). The reach of the mind. New York: Macy.
Rollenhagen, C. (1990). On the description of emotion awareness. Unpublished doctoral dissertation, Stockholm University, Sweden.
Sokolov, E.N. (1960). Neuronal models and the orienting reflex. In M. A. Brazier (Ed.), The central nervous system and behaviour (pp. 187-276). New York: Macy.
Spottiswoode, J. (1997) Geomagnetic fluctuations and free response anomalous cognition: A new understanding. Journal of Parapsychology, 61, 3-12.
Sturrock, P., & Spottiswoode, S. J. P. (2007). Time-series power spectrum analysis of performance in free response anomalous cognition experiments. Journal of Scientific Exploration, 21, 47-66.
Thouless, R. H., & Brier, R. M. (1970). The stacking effect and methods of correcting for it. Journal of Parapsychology, 34, 124-128.
Westerlund, J., & Dalkvist, J. (2004). A test of predictions from five studies on telepathic group communication of emotions. Proceedings of Presented Papers: The Parapsychological Association 47th Annual Convention, 269-277.
Department of Psychology
S-106 91 STOCKHOLM
Table 1 Summary of Previous Studies Main studies Studies Content Dalkvist & 1. Five different explorative substudies, Westerlund, 1998 resulting in several positive findings. 2. Construction of psychological target picture scales. Westerlund & Test of eight predictions from the initial Dalkvist, 2004 study, none of which was supported. Dalkvist & Establishing the occurrence of a relationship Westerlund, 2006 between a sender/receiver order effect on performance, found in Westerlund & Dalkvist, 2004, and disturbances in the geomagnetic field. Dalkvist, Montgomery, Establishing reduced variability in hit rate Montgomery & in agreement with a sender/receiver order Westerlund, 2009 effect, found in Westerlund & Dalkvist, 2004. Scale construction studies Study number Rated picture features 1 All pictures: pleasant-unpleasant, involving, familiar, perceptible. 2 Positive pictures: calm, exciting. 3 Negative pictures: compassion-arousing, repulsive. Table 2 Age and Gender Distribution of Participants in the Old and the New Data Collection Age Gender Data distribution collections Mean Range Females Males Total n (%) n (%) N (%) Old 26.42 57 589 (69.7) 256 (30.3) 845 (100) New 26.72 40 459 (70.4) 193 (29.6) 652 (100) Total 26.55 57 1,048 (70.0) 449 (30.0) 1,497 (100) Table 3 Overview of the Old and the New Data Collections Data N of N of Participants collections experiments sessions per sessions Mean Range Old 64 (a) 124 7.22 8 New 56 (b) 110 6.70 10 Total 120 234 7.03 10 (a) Four sessions were excluded due to apparatus failure. (b) Two sessions were excluded due to apparatus failure. Table 4 Pearson Correlations Among the Six Subjective Scales Scales (1) (2) (3) (4) (5) Pleasure- 1 Displeasure (1) Compassion (2) .96 ** 1 Repulsion (3) .98 ** .97 ** 1 Involvement (4) .71 ** .85 ** .77 ** 1 Familiarity (5) -.74 ** -.62 ** -.75 ** -.32 1 Perceptibility (6) -.53 ** -.40 * -.53 ** -.03 .81 ** * p < .05, two-tailed ** p <.01, two-tailed Table 5 Pearson Correlations Between the Six Subjective Scales and the Two Physiological Scales Subjective scales Physiological scales EDA HR Pleasure-displeasure .38 * -.47 ** Compassion .36 -.43 Repulsion .39 * -.49 ** Involvement .30 -.42 * Familiarity -.31 .54 ** Perceptibility -.15 .32 * p < .05, two-tailed ** p < .01, two-tailed Table 6 Mean Observed Relative Hit Rate for the Old and the New Data Set, Respectively, Together With the Results of One-sample t Tests of Deviation of Mean Observed Values From MCE (= .50) and a Corresponding Stouffer Z Analysis Data set Mean observed t df p relative hit rate (two-tailed) Old 0.50 -0.02 123 0.98 New 0.49 -1.49 109 0.06 Both Stouffer Z = 0.39, p = .35 Table 7 Results From One-way Repeated Measures ANOVA With All Stimulus Pictures as the Independent Variable and Relative Hit Rate as the Dependent Variable for the Old and the New Data Set and Results from a Stouffer Z Analysis Data set F df p (two-tailed) Old 0.85 29/3567 0.70 New 1.30 29/3161 0.13 Both Stouffer Z = 1.28, p = .10 Table 8 Pearson Correlations Between Mean Relative Hit Rate and the Scales for the 30 Stimulus Pictures for the Old and the New Data Set and Results from Corresponding Stouffer Z Analyses Data set Old New Both Scales r p * r p * Z p * EDA .30 .11 .52 .003# 3.22 .0006# HR .18 .35 -.18 .33 .01 .49 Pleasure-displeasure .10 .60 .04 .82 .53 .30 Compassion .00 .10 -.03 .86 .13 .45 Repulsion .05 .79 .05 .79 .38 .35 Involvement -.17 .38 .00 .99 .62 .27 Familiarity -.17 .36 -.19 .31 1.35 .09 Perceptibility -.26 .17 -.05 .79 1.16 .12 NA .09 .65 .16 .39 .93 .18 * two-tailed; bold represents p < .05 Note: two-tailed; bold represents p < .05 with # indicated. Table 9 Results From One-way Repeated Measures ANOVA With Positive or Negative Stimulus Pictures as the Independent Variable and Mean Relative Hit Rate as the Dependent Variable for the Old and the New Data Set and Results From Corresponding Stouffer Z Analyses Data set Positive pictures F df p Old 0.98 14/1722 0.47 New 1.00 14/1526 0.45 Both Stouffer Z = 0.76, p = .23 Data set Negative pictures F df p Old 0.75 14/1722 0.73 New 1.59 14/1526 0.08 Both Stouffer Z = 1.74, p = .04# * two-tailed; bold represents p < .05 Note: two-tailed; bold represents p < .05 with # indicated. Table 10 Pearson Correlations Between Mean Relative Hit Rate and Picture Scales for positive and negative Pictures for the Old and the New Data Set and Results From Corresponding Stouffer Z Analyses Picture Scales Type of picture Positive Old New Both data sets data sets data sets r p * r p * r p * EDA .21 .46 .12 .66 0.82 .21 HR .56 .03 .07 .81 1.70 .04# Pleasure- .36 .19 -.7.1 .46 0.42 .34 Displeasure Compassion .00 .10 -.26 .35 0.67 .25 Repulsion .36 .19 -.33 .24 0.09 .46 Involvement -.45 .09 .16 .57 0.23 .41 Familiarity -.33 .23 .18 .51 0.39 .35 Perceptibility -.47 .07 .17 .54 0.83 .20 -.09 .76 -.08 .78 0.40 .34 Picture Scales Type of picture Negative Old New Both data sets data sets data sets r p * r p * r p * EDA .38 .16 .77 .001# 3.31 .001# HR -.08 .77 -.39 .15 1.21 .11 Pleasure- .13 .64 .61 .016# 2.03 .02# Displeasure Compassion -.27 .33 -.16 .57 1.09 .14 Repulsion .06 .83 .69 .005# 2.13 .02# Involvement -.19 .50 -.12 .68 0.77 .22 Familiarity -.19 .50 -.42 .12 1.58 .06 Perceptibility -.22 .43 -.14 .62 0.90 .19 .29 .29 .73 .002# 2.93 .002# * two-tailed; bold represents p < .05 Note: two-tailed; bold represents p < .05 with # indicated. Table 11 One-sample t Tests of Deviation of Mean NAD Scores From MCE for the Old and the New Data Set and Results From a Corresponding Stouffer Z Analysis Data set MCE Mean t df p (two-tailed) NAD score Old 0 0.002 0.35 123 .73 New 0 0.015 2.42 109 .02# Both Stouffer Z = 1.83, p=.03# Bold represents p <.05 Bold represents p <.05 with # indicated. Table 12 One-sample t Tests of Deviation of Observed Relative Hit Rate From MCE for Stimulus Pictures With High (Above the Median) and Low (Below the Median) NA Scores, Respectively, for the Old and the New Data Set and Results From Corresponding Stouffer Z Analyses Data set NA level MCE for Mean relative hit observed rate sum relative hit rate sum Old High 0.23 0.23 Low 0.23 0.23 New High 0.23 0.24 Low 0.23 0.22 Both High Stouffer Z = 0.83, p = .20 Low Stouffer Z = 2.19, p = .02# Data set NA level t df p (two-tailed) Old High 0.29 123 .78 Low -0.30 123 .77 New High 0.95 109 .34 Low -3.01 109 .003# Both High Low Bold represents p < .05 Note: Bold represents p < .05 with # indicated. Table 13 Pearson Correlations Between 10 Possible Moderator Variables and NAD Scores for the Old and the New Data Set and Results From Corresponding Stouffer Z Analyses Data set Variables Old New Both (df = 122) (df = 108) (df = 232) r p * r p * Z p * Receiver order -.02 .87 .20 .05 1.18 .12 Gender .05 .62 .00 .98 .38 .35 Belief in telepathy .11 .21 -.06 .54 .53 .30 before the experiment Belief in telepathy .05 .58 -.02 .85 .29 .39 after the experiment N of receivers -.23 .01# -.21 .03# 3.35 .0004# Age .01 .88 .12 .21 1.16 .12 Repetition aversion -.13 .14 .08 .43 .58 .28 N of negative guesses -.18 .05 -.15 .13 2.48 .007# LST (Good--Bad) -.12 .17 .12 .21 .20 .42 ap-index (In) .00 .97 -.26 .007# 1.81 .04# * two-tailed; bold represents p < .05 Note: two-tailed; bold represents p < .05 with # indicated. Table14 Pearson Correlation Between N of Receivers and Relative Hit Rate Sum for High and Low NA Level Pictures for the Old and the New Data Set and Results Fmm Corresponding Stou f fer Z Analyses Data set Old New Both r p * r p * Z p (df= 122) (df= 108) (df=232) High -.10 .27 -.15 .12 1.86 .032# Low .27 .003# .19 .05# 3.54 .0001# * two-tailed; bold represents p < .05 Note: two-tailed; bold represents p < .05 with # indicated. Table 15 Skewness Tests for NAD Score Residuals Obtained From Linear Regression Analyses With NAD Score as Dependent Variable and Moderator Variables as Independent Variables Residuals for Data set Statistics Origmal N of N of data receivers receivers. N of negative guesses and ap index Old Skewness 0.64 0.33 0.29 2.95 1.53 1.32 p(124) * 0.002# 0.06 0.09 New Skewness 0.34 0.51 0.37 z 3.19 2.20 1.60 p(110)* 0.0007# 0.003# 0.06 Total Skewness 0.76 0.45 0.3, z 4.75 2,.81 2..4 p (234) * 0.000001# 0.003# 0.01 * one-tailed: bold represents p < .05 Note: one-tailed: bold represents p < .05 with # indicated.
|Printer friendly Cite/link Email Feedback|
|Publication:||The Journal of Parapsychology|
|Date:||Mar 22, 2013|
|Previous Article:||A community survey of anomalous experiences: correlational analysis of evolutionary hypotheses/Une enquete communautaire sur les experiences anomales...|
|Next Article:||An anomaly of an anomaly: investigating the cortical electrophysiology of remote staring detection/Une anomalie d'anomalie : etude de...|