## Abstract

Various viewpoints have been offered regarding the appropriate use of scatter plots or difference plots (bias plots or residual plots) in comparing analytical methods. In many of these discussions it seems the basic concepts of identity (within inherent imprecision) and acceptability based on analytical goals (analytical quality specifications) have been forgotten. With the increasing number of Reference Methods in laboratory medicine, these basic concepts are becoming more important in validation of field methods. Here we describe a simple and effective graphical test of these hypotheses (identity and acceptability) by use of difference plots. These plots display the underlying hypothesis before the measured differences are plotted and allow interpretation of the results according to specific criteria. We further describe simple but effective interpretations of the data, when the hypothesis is not fulfilled, by using two data sets drawn from comparisons of field methods for S-creatinine with a Reference Method for this analyte. The difference plot is a graphical tool with related simple statistics for comparison of a field method with a Reference Method, focusing on (*a*) identity within the inherent analytical imprecision or (*b*) acceptability within analytical quality specifications. Calculation of the standard deviation of the differences is an indispensable tool for evaluation of aberrant-sample bias (matrix effects).

Westgard et al. (1)(2)(3) outlined the basic principles for method comparison in a clear, easy to follow manual. They also introduced the concept of allowable analytical error and gave an overview of published performance criteria. They recommended that the estimated analytical imprecision and bias be compared with these performance criteria in method evaluation as well as in method comparison. Their approach made use of a scatter-plot and calculations based on regression lines, but with confidence limits and judgment of acceptability based on the criteria for allowable analytical error.

These principles of comparing analytical performance with performance criteria, however, have not been universally accepted, and recent publications have criticized the misuse of correlation coefficients (4) and overinterpretation of regression lines in method comparison (5)(6)(7). Bland and Altman (4) recommended the difference plot (or bias plot or residual plot) as an alternative approach for method comparison. On the abscissa they used the mean value of the methods to be compared, to avoid regression towards the mean, and on the ordinate they plotted the calculated difference between measurements by the two methods. They further estimated the mean and standard deviation of differences and displayed horizontal lines for the mean and for ±2 × the standard deviation. However, they missed the concept of a more objective criterion for acceptability. Recently, Hollis (5) has recommended difference plots as the only acceptable method for method comparison studies for publication in *Annals of Clinical Biochem**istry*, but without specifying criteria for acceptability.

However, a few difference plots with evaluation of acceptability according to defined criteria have been published, e.g., in evaluation of estimated biological variation compared with analytical imprecision (8), and in external quality assessment of plasma proteins for the possibilities of sharing common reference intervals (9).

Maybe the scarcity of such publications is more a question of interpretation of the data by plotting than a strict choice between scatter-plot and difference plot, as discussed by Stöckl (10) recently. Investigators seem to rely too much on regression lines and *r*-values, without doing the equally important interpretation of the data points of the plot. This is becoming more and more disadvantageous with the increasing number of Reference Methods available for comparison with field methods, because in these cases, it is not a question of finding some relationships, but simply of judging the field method to be acceptable or not.

NCCLS has recently published guidelines for method comparison and bias estimation by using patients’ samples (11), where both scatter-plots and bias plots are advised. The document also recommends plotting of single determinations as mean values and stresses the need of visual inspection of data. Further, comparison with performance criteria is recommended, but these criteria are not specified and they are not used in the graphical interpretation. Recently, Houbouyan et al. (12) used ratio plots in their validation protocol of analytical hemostasis systems, where they used a preset, but arbitrarily chosen, acceptance limit of inaccuracy of 15%.

In the following, we will use the difference plot (or bias plot) in combination with simple statistics for the principal judgment of the identity or acceptability of a field method. The difference plot makes it easier to apply the concept; in principle, however, the same evaluations could be performed for a scatter-plot in relation to the line of identity (*y* = *x*).

The aim of this contribution is to pay attention to the hypothesis of identity and the concept of acceptable analytical quality in method comparison, especially when one of the methods is a Reference Method.

## basic considerations

The basis for comparing an analytical field method with an analytical Reference Method measuring the same quantity (analyte) is the hypothesis of identity within inherent imprecision or within preset analytical quality specifications. The field method, whether a kit or a self-produced analytical procedure, is applied for routine analyses in laboratory medicine. The method must be demonstrated to have adequate accuracy (trueness), precision, and specificity (lack of aberrant-sample bias (13)) for its intended use. The Reference Method, in contrast, is a “thoroughly investigated method, clearly and exactly describing the necessary conditions for procedures, for the measurement of one or more property values that has been shown to have accuracy and precision commensurate with its intended use and that can therefore be used to assess the accuracy of other methods for the same measurement, particularly in permitting the characterization of a reference material” (14).

Further characteristics of the methods are the costs, the complexity, the equipment, the time used for production of the results, and so forth—the field method generally being cheap and well suited for routine work and the Reference Method usually being applicable only in few competent and specially equipped laboratories.

The two approaches to the comparison are:

1. Identity. The results from the field method do not deviate from the Reference Method results by more than the inherent imprecision of both methods.

2. Analytical quality specifications. The results from the field method do not deviate from the Reference Method results by more than the acceptance specifications do from the analytical goals.

In both cases the null hypothesis is that the measured differences for all samples are zero. In the first case, the acceptance limits are defined by the inherent analytical imprecisions, and in the latter case the acceptance limits are defined by the analytical quality specifications.

By assuming the ideal situation, the theoretical limits for testing the null hypothesis can be drawn in a difference plot, i.e., a mean difference [mean(δ)] equal to 0, and and a standard deviation of differences [σ(δ)] calculated from the imprecisions of the two methods. The ultimate hypothesis, then, is that the points are distributed within these bounds. If they fall outside of these bounds, the hypothesis is rejected. Alternatively, the performance characteristics are tested against defined analytical quality specifications.

## acceptance limits defined by inherent analytical imprecision

Constant analytical standard deviations are presumed. A number of patients’ samples are used for the comparison of the field method with the Reference Method. The result of the *i*th patient sample by the two methods is *x*_{iF} and *x*_{iR} for field and Reference Method, respectively, and the difference, *x*_{id}, is *x*_{iF} − *x*_{iR}. Further, the theoretical variance of the differences is the sum of the two variances denoting the inherent imprecision of field and Reference Methods: σ^{2}(δ) = σ^{2}_{F} + σ^{2}_{R} (the theoretical σ values for the two methods being estimated independently for the two methods of the comparison, or estimated from replicate measurements during the comparison). When means of duplicates are used, then σ should be divided by .

When the two methods are identical, the expectation is that ∼68% of differences will be distributed symmetrically around 0 within 0 ± 1σ(δ), and 95% will be within 0 ± 1.96σ(δ). If the distribution of differences fits these criteria, then it is not possible to find any difference between results from the two methods within the inherent imprecision. If this is not the case, then the methods are not identical.

Before the measured points are plotted on any figure, the hypothesis can be illustrated in a difference plot, as shown in Fig. 1⇓ A for the example of S-creatinine. The horizontal line *y* = 0 illustrates the hypothesis of *x*_{id} = 0; the other lines indicate within 0 ± 1σ(δ) for 68% and within 0 ± 1.96σ(δ) for 95% of the differences, respectively. The outer lines indicate the 95% prediction interval for the expected distribution of points. In this example, σ_{F} = 3.10 and σ_{R} = 0.50 and, therefore, σ(δ) = 3.15 μmol/L.

It is educational to describe the hypothesis before plotting the points, because most investigators will start interpretation of the points in the form of functional relationships and thereby forget about the hypothesis.

With the hypothesis in mind, one can now plot the data points as shown in Fig. 1B⇑ . The data points are generated for Reference Method values between 50 and 150 μmol/L and computer-simulated gaussian-distributed differences based on mean(δ) = −0.5 μmol/L and σ(δ) = 3.00 μmol/L; from these simulated data the calculated mean(d) was −0.84 μmol/L and s(d) was 3.27 μmol/L. The data points are distributed roughly according to the hypothesis, with 16 points (70%) within the 0 ± 1σ(δ) and 21 points (91%) within 0 ± 2σ(δ), leaving 2 points (9%) outside the limits; because these two points seem not to deviate too much from the general distribution, the conclusion could be that the finding of just 2 points outside (and close to) the 95% prediction interval is expected and acceptable and, therefore, that the field method is indistinguishable from the Reference Method within the analytical imprecision, so the evaluation can stop. Statistically, the mean difference can be evaluated by a *t*-test and the distribution of differences by an *F*-test.

This approach is very narrow, and some uncertainty related to unknown factors may be taken into account. Such variations could be the variation between the two tubes/vials with serum from the same individual or the underestimation of σ_{F}. Therefore, possible additional sources of uncertainty always should be taken into account when appropriate. However, the design of a method comparison has to be carefully planned so as to exclude additional uncertainties. Further, any addition of “acceptable” uncertainty should be well thought through and handled with caution.

For the experienced scientist, interpreting the difference plot is easy. A more objective criterion, however, for graphical validation of the distribution of points, and especially the more extreme points, is to apply the concept of tolerance intervals, where the standard deviate (*z* or *c*) is substituted for by a tolerance factor, *k*, with a value dependent on the percentage of points (here 95%) and the confidence with which this percentage should be obtained. The *k*-value is determined by the assumptions about the new distribution, whether the mean or the standard deviation is unknown, or both. The *k*-value further depends on the number of points, n (15)(16). Although we have a hypothesis about mean difference and standard deviation, these figures are unknown in practice, so the *k*_{7} for unknown mean and standard deviation may be the most relevant to use. For n = 23 and 95% confidence for 95% of the points, the tolerance factor, *k*_{7}, is 2.67 and the tolerance interval is 0 ± 2.67σ. This is illustrated in Fig. 1C⇑ , where all points are distributed within the chosen tolerance interval.

The present approach of theoretical expected distribution is compared with the approach of Bland and Altman (4) in Fig. 1D⇑ by inserting the new lines determined by mean(d) ± 2 s(d) estimated from the data points.

The difference between the present concept and the Bland and Altman concept is clear from Fig. 1D⇑ . Accordingly, we illustrate the 95% prediction interval before any data points are applied, in contrast to Bland and Altman, who simply illustrate the statistics of the points. In this example the mean values are clearly different, whereas the standard deviations are rather close to each other.

To illustrate the relations to *x–y* plots, we first add to Fig. 1B⇑ 11 points (triangles), as shown in Fig. 2⇓ (top). The specimens producing these points are assumed to contain some “nonspecific” components [in the S-creatinine example, perhaps the specimens are from diabetics, where (e.g.) glucose could result in nonspecific reactions by the Jaffe methods]. In the difference plot they separate clearly from the other points and the difference and standard deviation change to + 0.16 and 4.08 μmol/L, respectively. In the *x–y* plot (Fig. 2⇓ , bottom), where the points should be related to the line of identity (*y* = *x*), it is difficult to see the difference between the two sets of points. If we turn to calculation of *r*-values, *r* decreases from 0.993 to 0.991, the slope of the regression line changes from 1.020 to 1.025, and the intercept changes from −2.54 to −1.17 μmol/L.

The information from difference plots and *x–y* plots (when using the line of identity) is the same, but it is easier to expand the differences in the difference plot and the calculations of variances are simpler, whereas the 45° angle in the *x–y* plot makes comparable calculations more difficult. This is also seen from Westgard et al. (2)(3), where the simple calculations of (e.g.) bias is easier to interpret in combination with figures.

## acceptance limits defined by goals for analytical quality

A more relevant approach for comparing a field method with a Reference Method is to use the analytical goals (analytical quality specifications) as acceptance limits. These specifications may be related to the clinical use of laboratory data (17)(18) or more generally to the application of common reference intervals (19)(20) and monitoring patients (21)(22). Two European groups, one under the auspices of EGE-Lab (European Group for the Evaluation of Reagents and Analytical Systems in Laboratory Medicine) (23)(24), and another group under the auspices of European EQA-Organizers (External Quality Assessment Organizers) (25), have given recommendations for analytical quality specifications based on the same biological concepts and with identical criteria for acceptable analytical bias and imprecision, but with a different concept for combining these—the first (EGE-Lab) defining the recommendations for analytical bias and imprecision separately and the latter (EQA-Organizers) combining these two aspects. In both European recommendations, the analytical quality specification for imprecision is CV_{analytical} ≤0.5 CV_{within-subject variation} as proposed by Cotlove et al. (21), and that for analytical bias is |B_{analytical} | ≤0.25 CV_{total biological variation} (19). The EGE-Lab concept accepts both a maximum bias and a maximum imprecision simultaneously; the EQA-Organizer concept describes a functional relationship between the two in the form of a maximum allowable combination of imprecision and bias.

The latter is close to the original concept of Gowans et al. (19), defining the acceptable analytical percentage bias for S-creatinine as 2.8% (25) when imprecision is negligible. According to the EGE-Lab concept (23)(24), however, both a bias of 2.8% and an imprecision of 2.2% are acceptable simultaneously. This means that for single determinations 0 ± ( bias + 1.65 × imprecision) is acceptable (26); i.e., 95% of the single points must lie within the limits of 0 ± (2.8% + 1.65 × 2.2%) = 0 ± 6.4%, as illustrated in Fig. 3⇓ (left). This criterion is fulfilled as shown in Fig. 3⇓ (middle). The concept, however, is one-sided, i.e., is valid only in one direction. Therefore, the standard deviation of differences should be judged against the imprecision criterion separately. The standard deviation of the field method is 3.1 μmol/L and the CV = 3.9%, which exceeds the imprecision specification of 2.2%.

Figure 3⇑ (right) illustrates an example of acceptable analytical imprecision and bias. The CV is 2.1% and the estimated bias (mean difference) is +1.3 μmol/L (95% confidence interval, 0.6–2.0 μmol/L). Note that the mean difference is different from 0 but is acceptable according to the criterion.

For the purpose of method comparison, the value for maximum allowable bias might be expanded because of the uncertainty (confidence interval) of the Reference Method. This cannot be seen from the actual comparison, but because the Reference Method is allowed to have some uncertainty, then this must be allowed also for the field method. We propose using the factor 1.2, in light of a recent concept that requires for Reference Methods a total error of <0.2 times that of the routine method (27). In the present case, the acceptance limits for bias should thus be 0 ± 3.36%, as also used in the example below.

## two examples utilizing comparison data

The data used are from a paper on a candidate reference method for determining S-creatinine, which was used for comparison of four field methods (28). Data from two of these comparisons are used, but only for concentration values <150 μmol/L, the only region in which there are sufficient data for our presentation.

*Practical example 1.*

Based on the duplicate analyses performed, analytical imprecision is calculated within the interval 50–150 μmol/L for both the Reference Method (HPLC) and the field method, giving σ_{R} = 0.568 μmol/L and σ_{F} = 0.791 μmol/L. When means of duplicates are used for the comparison, the calculated theoretical σ(δ) should be divided by , giving an expected distribution of 95% (the 95% prediction interval) of the points within 0 ± 2σ(δ) = 0 ± 1.4 μmol/L. Further, the expanded allowable bias of 3.36% is used for the analytical quality specifications according to both EGE-Lab and EQA-Organizers concepts.

The graphical evaluations are shown in Fig. 4⇓ (top). Here, the calculated distribution is considerably broader than the one assumed from the analytical point of view, whereas the measured mean bias is within the limits of acceptance of the bias because the mean (−1.6) and 95% confidence interval (−0.5 to −2.7 μmol/L) are within the 3.36% allowable bias. Further, the points are distributed within the total EGE-Lab criteria with only one real outlier. The difference between the present concept of 95% prediction interval and the Bland and Altman description of the actual points is clear from the Figure⇑ .

At first glance, the field method should be acceptable, with CV = 1% and bias <3.36%, but the distribution of points (Fig. 4⇑ , top) reveals a much broader distribution (CV = 5%), which emphasizes an unknown uncertainty. The problem is further underlined by the adding of single determinations to the measurements in the Figure⇑ , illustrating the reproducibility of the individual differences. This uncertainty is close to 5% and may originate not only from vial-to-vial variation but also from aberrant-sample bias, whether from nonspecific reactions or interference in the field method. Thus the difference plot and calculation of the standard deviation of differences are tools to disclose aberrant-sample bias in field methods.

Fuentes-Arderiu and Fraser (30) have proposed that the combined effects of imprecision and interference should be used in the concept of specifications for imprecision, but the problem has not been dealt with in either the EGE-Lab or EQA-Organizer concepts.

*Practical example 2.*

In the other example (Fig. 4⇑ , bottom), all points are displaced from the acceptance area, showing a considerable mean bias (∼20 μmol/L) and also considerable uncertainty from aberrant-sample bias. The calculated imprecision, based on assays of duplicates, was ∼1%, but the difference plot reveals the errors clearly.

The CLIA criteria are much wider than the European recommendations, but as total error specifications are easy to apply in the plot. For concentrations <175 μmol/L the acceptable deviation is ±26.5 μmol/L (±0.3 mg/L), ±15% at higher concentrations. The upper line of the CLIA criterion is shown in Fig. 4⇑ (bottom). Because 7 of the 54 points are outside the line, the method is also unacceptable from the standpoint of proficiency testing.

## general discussion

In the current discussion of difference plot vs *x–y* plots and the application of regression lines or functions, evaluations of data (vis-à-vis best relationship and comparison of a field method with a Reference Method) have often been mixed, as pointed out recently by Stöckl (10). Stöckl emphasizes that graphical presentation of data (*x–y* plot or difference plot) should not be mixed with statistical interpretation of data.

As long as the purpose is to find the best functional relationship between two methods in order to correct one with the other, then an *x–y* plot and calculation of the regression line with an estimation of the scatter via *s*_{y|x} may be most relevant, with visual inspection of the scatter and a residual plot. The present task, however, is not to find a functional relationship between two methods, but to judge a field method in relation to a Reference Method.

When a field method is compared with a Reference Method for acceptability according to certain criteria, whether analytical or biological, then the visual inspection of all data is essential. Whether one uses a simple *x–y* plot or a difference plot is not critical, as long as the area of interest is expanded and the single points are assessed according to the hypothesis of identity between the two methods. The hypothesis is that the measured values are identical (and not that they are unrelated, which is the basic hypothesis for correlation studies), which means that the hypothesis is described in an *x–y* plot as the line of identity (*y* = *x*) and in the difference plot as the line *y* = 0. When a ratio plot is more appropriate than a difference plot, e.g., when analytical CV is close to being constant, the same evaluations can be performed.

The advantages of difference plots (and ratio plots) are keeping the hypothesis of identity in mind and the ease of expanding the difference ordinate according to the investigator’s purpose. The power of the graphical illustration in Figs. 1A⇑ and 3⇑ (left) lies in the simplicity and the clear definition of the hypotheses.

For the experienced interpreter, most situations can be evaluated by visual inspection of the plots, whether a difference plot or an *x–y* plot. If more objective criteria are wanted, calculations of mean difference with confidence intervals are a powerful tool, as is a table of *k*_{7}-values1 for estimation of tolerance intervals.

Most important, however, is the inspection of the distribution of the difference points, especially when samples expected to have matrix effects are marked, and calculation of the standard deviation of differences. When this exceeds the estimated analytical imprecision, it is an indication of aberrant-sample bias. In principle, the *r*-value from correlation between *x* and *y* reflects this, but in practice the *r*-value is insensitive, as illustrated in Fig. 2⇑ . Further, calculation of the standard deviation of differences gives an estimate of aberrant-sample bias compared with the theoretical imprecision. The same information can be obtained from the *s*_{y|x} estimate from regression analysis.

The plotting of single determinations can give information about imprecision and aberrant-sample bias as well and, if needed, a functional curve can be calculated and drawn. In this context we mention that replicate measurements always should be performed in this type of comparison, and that specimens should be stored for evaluation of possible outliers.

Krouwer and Monti (29) presented a graphical method for evaluation of laboratory assays (a mountain plot). They computed the percentile for each ranked difference between the two methods, and by “turning” at the 50th percentile produced a histogram-like function (the mountain). This method is relevant for detecting large infrequent errors (differences) but lacks the aspect of concentration relationship. These investigators, therefore, recommend use of their plot together with difference plots. Introduction of analytical quality specifications in the mountain plots may be useful in method evaluations.

Bland and Altman have pointed to the presentation of difference plots, where they recommend mean values of both methods on the abscissa (4); however, the risk of regression towards the mean is negligible in studies where field methods are compared with Reference Methods, because the results from the Reference Method (used on the abscissa) are assumed to have negligible error. They further calculate and present the standard deviation of the measured data, which is relevant information but not related to the hypothesis of identity.

This form of graphical testing of the hypothesis of identity has been used for biological data (8), where the standard deviation of measured differences was compared with the analytical imprecision. A more stringent method of testing measured values from field methods against target values has been applied for plasma proteins in The Nordic Protein Project (9). Here, the target values of serum pools were assigned from the European Community Bureau of Certified Reference Material, CRM 470 (31), by the method recommended by IFCC (32), and were used for the abscissa in a plot with acceptance lines according to acceptable bias (9) and with the measurements illustrated by the mean difference with a 90% confidence interval.

Another graphical method of prediction of single measurements of clinical data has been published for serial measurements of International Normalized Ratios of prothrombin times (33). Here, the differences between consecutive measurements in patients were compared with the expected variation estimated from patients under steady-state conditions. The abscissa was used for the latest result, resulting in a regression towards the mean, which was considered acceptable for the purpose (i.e., not to investigate correlations). The usefulness of the nomogram was improved by adding vertical lines for the therapeutic interval.

The goals used for acceptance are relevant for the use of common reference intervals (19)(20) and have been recommended by two European groups (23)(24)(25), with different consequences for the acceptance. This problem is related to the phenomenon of aberrant-sample bias (matrix effects) and has not been fully clarified by the proposed goal from either EGE-Lab or EQA-Organizers; Fuentes-Arderiu and Fraser (30), however, have proposed that the aberrant-sample bias be included in the precision goal (30). Other analytical goals have been postulated (17)(18)(34) and may be relevant for other evaluations. Goals based on biological data, however, are general in nature (related to common reference intervals and monitoring of patients) and are not restricted to a specific clinical application.

The CLIA criteria for S-creatinine are ±0.3 mg/dL (3 g/L, or 26.5 μmol/L) or ±15% for concentrations >175 μmol/L (28). These are criteria for total error, but they are very wide compared with the European recommendations. The CLIA criteria are intended for use with proficiency testing (and therefore need to be wide), whereas the European recommendations are so-called educational criteria, which relate directly to the desirable performance criteria for optimum monitoring of patients and the sharing of common reference intervals within geographical areas with populations that are homogeneous for the quantity of analyte. The criteria proposed by Ehrmeyer et al. (35) for minimum intralaboratory performance characteristics to pass CLIA—CV <33% and bias <20%, proficiency testing criteria—can be applied as well as the European criteria, but for the latter the total error will be determining. The CLIA criteria may be applied even better in difference plots, given the total error concept.

The validity of the analytical conclusions of this type of evaluation relies on the Reference Method chosen. It must be correct and specific and so forth, with negligible imprecision; otherwise, the conclusions of the comparison will be weakened according to any possible flaws of the Reference Method.

In conclusion, we find the visual inspection of plots to be essential for method comparison. When a field method is compared with a Reference Method, the hypothesis of identity within analytical imprecision or within stated analytical quality specifications should be applied. Both *x–y* plots and difference plots are useful, but we find the difference plot is easier to handle and interpret and facilitates the calculations of uncertainty.

## Footnotes

Departments of

^{1}Clinical Chemistry and^{4}Medical Gastroenterology, Odense University Hospital, DK-5000 Odense C, Denmark.^{2}Laboratorium voor Analytische Chemie, Faculteit Farmaceutische Wetenschappen, Universiteit Gent, Harelbekestraat 72, B-9000 Gent, Belgium.↵1

^{5}Some k_{7}-values for 95% tolerance factors for the 95% interval: 3.38 for n = 10; 2.75 for n = 20; 2.55 for n = 30; 2.38 for n = 50; 2.30 for n = 70; and 2.23 for n = 100.

- © 1997 The American Association for Clinical Chemistry