# Lies, damned lies and statistics.

Lies, Damned Lies and Statistics

Mark Twain quotes Disraeli as saying that there were three kinds of lies: lies, damned lies and statistics. What statistics does is summarize a large set of data into two numbers, the central tendency of the data and a measure of its spread. Theoretically, these two numbers describe the set of data with no loss in information. In real life, however, there are problems because of the basic assumptions made with data (such as the assumption that the data is normally distributed). Real data requires investigation before any statistical techniques can be used. For example, using parametric tests on data which are not normally distributed may produce incorrect conclusions. Solo allows the investigator to determine the distribution of his data, and select the proper type of tests to conduct (ie. parametric or nonparametric).

The creators of Solo have designed an extremely useful software package for anyone who needs to manipulate data or summarize information using statistics. The surprising thing about this program is that it is inexpensive at 10 to 20% of the cost of similar statistical programs. In fact, it has tests and routines contained in only the best of the statistical packages. (This is not surprising since BMDP has been creating statistical packages for mainframes for a long time.)

The package is easy-to-use, and is tailored for those who really want to investigate their data. The manuals provided are easy to understand, and explain all the techniques in detail. For those techniques which need further elaboration, there is an excellent set of references. A simple interface transfers from one set of tests to another. Nearly every test that most of us would ever imagine is included. There is even a Terminate and Stay Resident (TSR) calculator. The graphics include many less common plots such as stem and leaf plots and Chernoff faces. As a device to explore data Solo seems excellent. For specialized applications, a number of add-on programs provide additional capabilities including QC control charts. Time Series and Survival Analysis add-ons are scheduled for release shortly.

Data was taken from the literature to test the various routines. One very good book, with many of the techniques used in this package is Biometry (Sokal and Rohlf, 1981) which carries the same sets of data through a number of tests. For the experimental design techniques, examples from Design and Analysis of Industrial Experiments (Davies, 1967) were used. The small set of time series analyses included in the main package was tested using examples in Time Series Analysis (Box and Jenkins, 1976). Every test coincided with the published results. None of the examples `hung up' nor were there any discrepancies from the examples in the various texts used.

Unlike another package used by this reviewer, there seem to be no bugs and it is fast! Simple ANOVAs take only a few seconds. The other package, which costs about 10 times more, has the annoying habit of dying after telling the user there is a parity error, with no way of recovering from the error, but to start from the very beginning (and sometimes having to reboot the system). Solo can cut you out of a routine and bring you back to DOS, but it takes you back to your data set, and it is a simple matter of editing the data before trying again. The other statistical package would force you to load the data again--and it is annoyingly slow.

The graphics routines have no parallel in the texts used and are quite impressive. As a means of showing that a transformation of data provides a better approximation of normality than an untransformed set, the density traces, for example, can be more meaningful to non-statisticians than a simple comparison of skewness and kurtosis.

The plots appear to be well made. They can be printed with most printers and plotters or even imported into Wordperfect 5.0. The graphics files can also be saved to disk and printed at a later date. This feature is very useful as it saves time. It takes only a few seconds to save a file to disk, compared to the time to print or plot them. Saved plots can be printed with PrintAPlot.

Solo is compatible with most IBM graphics cards. It has a useful, but not wordy, help file. A number of useful features include the ability to shell to DOS, save screen displays to file, use keyboard macros and import ASCII files with variable and row labels.

There is no software protection, and at the price (US$149), it's a bargain. Other packages may do all of these things but are many times more expensive. The other packages are also not geared to exploring data. They assume the user already knows what techniques they want to use. Except for Factorial Design techniques, this is usually not the case with most data. Information from plants tend to be sporadic, with many missing values, and censored data (either left or right censored).

Limitations: Solo can handle up to 250 variables and 32,000 observations. Since it reads the data file constantly, it is best to copy the data file onto a RAM disk. When this is done, the computations occur quickly, even on a slow XT. The program also requires at least 450 K of memory. This is quite a large amount of data. Even mainframe programs have problems with data sets this large.

Caveats: One small idiosyncrasy with the descriptive statistics which is neither critically important, nor wrong, is that the kurtosis (the measure of `peakedness' of a distribution) has the value 3 subtracted from it. This means that the normal distribution has a skewness of 0 and a kurtosis of 0 rather than 3, which is common with many statistical texts.

Mark Twain quotes Disraeli as saying that there were three kinds of lies: lies, damned lies and statistics. What statistics does is summarize a large set of data into two numbers, the central tendency of the data and a measure of its spread. Theoretically, these two numbers describe the set of data with no loss in information. In real life, however, there are problems because of the basic assumptions made with data (such as the assumption that the data is normally distributed). Real data requires investigation before any statistical techniques can be used. For example, using parametric tests on data which are not normally distributed may produce incorrect conclusions. Solo allows the investigator to determine the distribution of his data, and select the proper type of tests to conduct (ie. parametric or nonparametric).

The creators of Solo have designed an extremely useful software package for anyone who needs to manipulate data or summarize information using statistics. The surprising thing about this program is that it is inexpensive at 10 to 20% of the cost of similar statistical programs. In fact, it has tests and routines contained in only the best of the statistical packages. (This is not surprising since BMDP has been creating statistical packages for mainframes for a long time.)

The package is easy-to-use, and is tailored for those who really want to investigate their data. The manuals provided are easy to understand, and explain all the techniques in detail. For those techniques which need further elaboration, there is an excellent set of references. A simple interface transfers from one set of tests to another. Nearly every test that most of us would ever imagine is included. There is even a Terminate and Stay Resident (TSR) calculator. The graphics include many less common plots such as stem and leaf plots and Chernoff faces. As a device to explore data Solo seems excellent. For specialized applications, a number of add-on programs provide additional capabilities including QC control charts. Time Series and Survival Analysis add-ons are scheduled for release shortly.

Data was taken from the literature to test the various routines. One very good book, with many of the techniques used in this package is Biometry (Sokal and Rohlf, 1981) which carries the same sets of data through a number of tests. For the experimental design techniques, examples from Design and Analysis of Industrial Experiments (Davies, 1967) were used. The small set of time series analyses included in the main package was tested using examples in Time Series Analysis (Box and Jenkins, 1976). Every test coincided with the published results. None of the examples `hung up' nor were there any discrepancies from the examples in the various texts used.

Unlike another package used by this reviewer, there seem to be no bugs and it is fast! Simple ANOVAs take only a few seconds. The other package, which costs about 10 times more, has the annoying habit of dying after telling the user there is a parity error, with no way of recovering from the error, but to start from the very beginning (and sometimes having to reboot the system). Solo can cut you out of a routine and bring you back to DOS, but it takes you back to your data set, and it is a simple matter of editing the data before trying again. The other statistical package would force you to load the data again--and it is annoyingly slow.

The graphics routines have no parallel in the texts used and are quite impressive. As a means of showing that a transformation of data provides a better approximation of normality than an untransformed set, the density traces, for example, can be more meaningful to non-statisticians than a simple comparison of skewness and kurtosis.

The plots appear to be well made. They can be printed with most printers and plotters or even imported into Wordperfect 5.0. The graphics files can also be saved to disk and printed at a later date. This feature is very useful as it saves time. It takes only a few seconds to save a file to disk, compared to the time to print or plot them. Saved plots can be printed with PrintAPlot.

Solo is compatible with most IBM graphics cards. It has a useful, but not wordy, help file. A number of useful features include the ability to shell to DOS, save screen displays to file, use keyboard macros and import ASCII files with variable and row labels.

There is no software protection, and at the price (US$149), it's a bargain. Other packages may do all of these things but are many times more expensive. The other packages are also not geared to exploring data. They assume the user already knows what techniques they want to use. Except for Factorial Design techniques, this is usually not the case with most data. Information from plants tend to be sporadic, with many missing values, and censored data (either left or right censored).

Limitations: Solo can handle up to 250 variables and 32,000 observations. Since it reads the data file constantly, it is best to copy the data file onto a RAM disk. When this is done, the computations occur quickly, even on a slow XT. The program also requires at least 450 K of memory. This is quite a large amount of data. Even mainframe programs have problems with data sets this large.

Caveats: One small idiosyncrasy with the descriptive statistics which is neither critically important, nor wrong, is that the kurtosis (the measure of `peakedness' of a distribution) has the value 3 subtracted from it. This means that the normal distribution has a skewness of 0 and a kurtosis of 0 rather than 3, which is common with many statistical texts.

Printer friendly Cite/link Email Feedback | |

Title Annotation: | Solo computer program manipulates real data |
---|---|

Author: | Miyamoto, Henry K. |

Publication: | Canadian Chemical News |

Article Type: | Product/Service Evaluation |

Date: | Feb 1, 1990 |

Words: | 963 |

Previous Article: | Laboratory safety. |

Next Article: | John Hooz. |

Topics: |