Nearly everyone is familiar with the literary detective Sherlock Holmes and his often-picked-on associate Dr. John Watson. In “A Scandal in Bohemia,” Holmes educates Watson on the difference between seeing and observing. That Watson is looking at the same information, yet deriving different (and more often incorrect) conclusions than Holmes, befuddles him. “I believe that my eyes are as good as yours,” Watson claims. To which Holmes answers “Quite so. You see, but you do not observe. The distinction is clear.”

The same holds true for the interpretation of scientific data. Researchers see data, but may fail to observe whether the data yield any meaningful results. Researchers reach conclusions, but may fail to observe whether the data support the conclusions. In doing so, they may miss important clues that the data provide.

Overreliance on statistical analyses of data is like seeing a snapshot of your data but not really observing the data. The *P* value is an excellent example of how researchers may over- or underinterpret data and thus fail to see the true picture. To illustrate this fact, let us start with what the *P* value really is.

In research, one can never state with 100% certainty that any change or difference is real. In fact, probability testing starts with the assumption that the difference between groups is zero (the null hypothesis). Therefore, all one can do is determine the probability (*P* value) that the null hypothesis is true. If the *P* value is small enough, it suggests, but does not prove, that the difference seen did not occur by chance and that the groups, therefore, may have come from different populations. However, the *P* value does not tell you anything about how large the difference is between 2 groups, or whether the *statistical* significance implies any *clinical* significance. Yet the misperception exists that the smaller the *P* value, the greater the importance.

As examples of how the *P* value is misleading, we have created a hypothetical study of patients with epilepsy being treated with phenytoin. Serum phenytoin concentrations were measured for 50 patients who remained seizure free and 50 patients who had a subsequent seizure. Two separate example data sets are provided.

Fig. 1 shows 1 data set. Bar graphs of mean with SD (Fig. 1, A and E); mean with SD plots (Fig. 1, B and F); median, interquartile range, and range plots (Fig. 1, C and G); and scatter plots of all data points (Fig. 1, D and H) are shown. The mean (SD) for the seizure-free group is 14.1 (4.3) vs 16.4 (6.0) μg/mL for the group having a second seizure while on phenytoin. Many researchers would fail to look at the distribution of data, assume a normal distribution, and apply a Student unpaired *t*-test for analysis of the data. In this case, the *P* value would be 0.026, below the commonly used cutoff of *P* < 0.05 for significance. It is possible that the researcher was familiar with nonparametric statistical tests and therefore applied a Mann–Whitney test to the data. The medians for the 2 groups are 14.0 and 17.9 μg/mL, respectively. A Mann–Whitney test of these data gives a *P* value of 0.048, again a statistically significant difference, but not a difference that has clinical value, as confirmed by direct observation of the individual data points (Fig. 1, D vs H). The minimum and maximum values are virtually the same, and there is almost complete overlap of the serum phenytoin concentrations between the 2 groups. A measured serum phenytoin concentration would provide little information for classifying a patient into either group. But how can that be? The statistical analysis suggested that there was a difference that was real. The answer is that the *P* value only gives an indication that the difference between the 2 groups was not due to chance, but says nothing about whether the difference was great or small or whether it had any real value. Simply seeing a summary of the data (*à la* Dr. Watson) vs actually observing the data (*à la* Sherlock Holmes) told a different story.

Now let us look at the same patients (50 in each group) using a different set of data (Fig. 2). Using the same statistical approaches as those used above, the *t*-test yields means (SD) of 14.1 (4.3) and 16.4 (9.0) μg/mL for the 2 groups, the same as the means for the prior example, but now with a *P* value of 0.104, which is not statistically different. The medians for the 2 groups are 14.0 and 19.1 μg/mL, respectively (Fig. 2, C and G), a wider difference in medians than in the prior example, but the Mann–Whitney test provides a *P* value of 0.427, again not a statistically significant difference. One possible (and wrong) conclusion would be that the data showed nothing of value and were not worth further investigation because of these *P* values. But here you would be seeing only what was suggested by these statistical analyses. Actual observation of the data (Fig. 2, D and H) tells a different story.

Whereas 36 of 50 patients in the group with no new seizures had serum phenytoin concentrations within the commonly accepted therapeutic range of 10–20 μg/mL (Fig. 2D), only 4 of 50 patients having a subsequent seizure had concentrations within this range (Fig. 2H). Only 1 patient in the seizure-free group had a concentration <7.0 μg/mL, compared with 10 patients with a subsequent seizure who had values <7.0 μg/mL. No seizure-free patient had a serum phenytoin concentration >25.0 μg/mL, but 13 patients in the group who had a subsequent seizure had concentrations ranging from 25.8 to 29.2 μg/mL.

So how can your eyes tell you that the distribution of the 2 groups of patients is clearly different, yet the *P* value informs you that the groups are not statistically different? How can information of clear clinical value be obscured by a nonsignificant *P* value? How can probability obscure practicality? This is the crux of the problem with reliance on the *P* value. Neither the *t*-test nor the Mann–Whitney test gave any indication of or took into account the bimodality of the data distribution in the group with subsequent seizures. This bimodality, easily observed by visual inspection, turned out to be the key determinant in interpreting these data. As Holmes suggested, if we merely see, but fail to observe, we can reach the wrong conclusion. There is no substitute for looking at your data.

## Footnotes

(see editorial on page 909)

**Author Contributions:***All authors confirmed they have contributed to the intellectual content of this paper and have met the following 3 requirements: (a) significant contributions to the conception and design, acquisition of data, or analysis and interpretation of data; (b) drafting or revising the article for intellectual content; and (c) final approval of the published article.***Authors' Disclosures or Potential Conflicts of Interest:***Upon manuscript submission, all authors completed the author disclosure form. Disclosures and/or potential conflicts of interest:***Employment or Leadership:**T.M. Annesley,*Clinical Chemistry*, AACC; J.C. Boyd,*Clinical Chemistry*, AACC.**Consultant or Advisory Role:**None declared.**Stock Ownership:**None declared.**Honoraria:**None declared.**Research Funding:**None declared.**Expert Testimony:**None declared.**Patents:**None declared.

- Received for publication April 21, 2014.
- Accepted for publication April 30, 2014.

- © 2014 The American Association for Clinical Chemistry