## Abstract

**BACKGROUND:** In clinical chemistry, quality control (QC) often relies on measurements of control samples, but limitations, such as a lack of commutability, compromise the ability of such measurements to detect out-of-control situations. Medians of patient results have also been used for QC purposes, but it may be difficult to distinguish changes observed in the patient population from analytical errors. This study aims to combine traditional control measurements and patient medians for facilitating detection of biases.

**METHODS:** The software package “rSimLab” was developed to simulate measurements of 5 analytes. Internal QC measurements and patient medians were assessed for detecting impermissible biases. Various control rules combined these parameters. A *Westgard*-like algorithm was evaluated and new rules that aggregate Z-values of QC parameters were proposed.

**RESULTS:** Mathematical approximations estimated the required sample size for calculating meaningful patient medians. The appropriate number was highly dependent on the ratio of the spread of sample values to their center. Instead of applying a threshold to each QC parameter separately like the Westgard algorithm, the proposed aggregation of Z-values averaged these parameters. This behavior was found beneficial, as a bias could affect QC parameters unequally, resulting in differences between their Z-transformed values. In our simulations, control rules tended to outperform the simple QC parameters they combined. The inclusion of patient medians substantially improved bias detection for some analytes.

**CONCLUSIONS:** Patient result medians can supplement traditional QC, and aggregations of Z-values are novel and beneficial tools for QC strategies to detect biases.

Internal quality control (QC)^{4} ensures analytical quality in clinical chemistry. To prevent the release of erroneous results and ultimately avert patient harm, medical laboratories allocate considerable human and financial resources to these procedures. In >60 years, QC has progressed from simple control charts (1) to statistically planned QC strategies (2). Based on quality goals for the individual laboratory test (3, 4) and actual performance (5), appropriate rules can be selected from a pool of established procedures. Groundbreaking work in this area has been contributed by Westgard, who developed rules to evaluate multiple control measurements together (6–8). Other parameters to adjust the QC strategy to individual needs include the frequency of QC events (9) and triggers (e.g., lot changes) (10, 11).

Traditionally, QC of biochemical tests relies on the measurement of well-characterized control samples. The quantity of the analyte in these samples is determined using methods of a higher metrological order. Therefore, comparison of measured and assigned values can reveal systematic and random errors of the test. However, these materials suffer from several drawbacks. QC samples often constitute a considerable amount of the total price of an analysis (12). Moreover, the central assumption that QC material exhibits the same behavior as patient samples does not hold true in many cases (13). Miller et al. compared patient and QC samples with different reagent lots and found significant differences between sample types in 40.9% of cases (14). In a similar study, Cho et al. found a significant difference in only 7.8% (15). Despite the discrepancy regarding the frequency of differences, both studies illustrate serious limitations of QC material. A lack of commutability is also frequently observed in external quality assessment schemes that use samples similar to QC materials (16, 17).

The test results of patient samples are increasingly being used for QC purposes (18–22). The true value of patient samples is unknown. However, if comparable and sufficiently large patient populations are repeatedly tested, a central location parameter (e.g., the median) of the distribution of measurements should remain constant within certain bounds. Calculated parameters do not generate costs for additional tests or materials. Because no sample is tested repeatedly, precision is hard to evaluate with patient samples. Moreover, the patient population cannot be controlled by the laboratory. Changes in the location parameter that result from changes in the patient population cannot be readily distinguished from changes caused by errors in the laboratory test. A systematic preanalytic error (e.g., during transport or centrifugation (23)) can influence many patient samples and shift the location parameter without changing control measurements.

This study provides estimates of the utility of daily patient medians for QC. Moreover, we introduce several algorithms that aggregate QC measurements and daily patient medians into a single parameter. All procedures were evaluated in simulations of laboratory tests and errors.

## Methods

### ESTIMATION OF THE UTILITY OF PATIENT MEDIANS FOR QUALITY CONTROL

To avoid relying on analytical performance specifications such as the allowable total error (*TE _{a}*), we calculated the number of patient samples required for the standard error of the median to reach the same magnitude as the standard error of a single QC measurement. For measurements in which patient values follow a normal or lognormal distribution, the required number can be estimated with Eqs. 1 and 2, respectively:
(1)
(2) σ

*, μ,*

_{t}*CV*are standard deviation, mean, and coefficient of variation of the distribution of patient analyte true values, respectively. σ

_{t}*,*

_{a}*CV*express analytical imprecision as standard deviation and coefficient of variation, respectively (see Methods and Results in the Data Supplement that accompanies the online version of this article at http://www.clinchem.org/content/vol63/issue8).

_{a}### Z-VALUES FOR QUALITY CONTROL RULES

In this work, daily measurements of internal QC material and daily medians are considered as simple QC parameters. Control rules aggregate these parameters into a combined measure. All control rules presented here operate on Z-values that are calculated from a simple QC parameter *X* according to the following formula:
(3) The standard deviation σ* _{stable}* and the mean μ

*can be calculated from previous measurements that are assumed to have been conducted under stable conditions. Parameters of control materials are often also provided by the manufacturer.*

_{stable}### WESTGARD-LIKE ALGORITHM

Daily Z-values of simple QC parameters (internal QC measurements or daily medians) were combined with the following algorithm:

*IF (at least n*=*2 Z-values are positive){**pos_z*=*Minimum of the n*=*2 largest Z-values*

*}**ELSE {**pos_z*=*0*

*}**IF (at least n*=*2 Z-values are negative){**neg_z*=*Maximum of the n*=*2 smallest Z-values*

*}**ELSE {**neg_z*=*0*

*}**RETURN maximum of absolute values of pos_z and neg_z*

To the best of our knowledge, this algorithm has not been described before. For Z-values from 2 control measurements as input, it closely resembles “Westgard-Rules” of the form 2_{2S} or 2_{3S}, which are often applied to detect biases. The innovation of this approach is to avoid specifying a threshold value for standard deviations away from the mean. Therefore, the general classification ability of this type of combination can be evaluated. When Z-values from 3 simple control parameters are included but n = 2, this algorithm resembles “Westgard-Rules” of the form 2 *of* 3_{2S} or 2 *of* 3_{3S}. We refer to this type of control rule as “Westgard-like algorithm” and abbreviated it as “west” if daily patient medians were included and as “west.QC” if not.

### WEIGHTED AND UNWEIGHTED AGGREGATION OF Z-VALUES

In addition, unweighted Z-values were aggregated using Stouffer's method (24):
(4) A weighted method aggregates the individual Z values as follows:
(5) Different weights *w _{i}*, based on sample size, effect size, or estimated standard error, have been proposed (25). Here, the probability of each simple QC parameter to detect an out-of-control situation is used (see Methods and Results in the online Data Supplement). We denominate these methods as weighted and unweighted aggregation of Z-values. The methods that include patient medians are abbreviated “zWAggr” and “zAggr,” whereas methods that aggregate only internal QC measurements are abbreviated as “zWAggr.QC” and “zAggr.QC.”

### COMPARISON OF THE WESTGARD-LIKE ALGORITHM AND THE AGGREGATION OF Z-VALUES UNDER SIMPLIFIED CONDITIONS

An out-of-control situation can be caused by a bias shifting all QC measurements by the equivalent of 2 Z-values (Fig. 1A). Other biases might affect Z-values unequally (Fig. 1B).
(6) Eq. 6 expresses the difference between the shifts at QC measurements 1 and 2 normalized to Z-values (Fig. 1C). To allow comparison of the Westgard-like algorithm and the unweighted aggregation of Z-values, simplified conditions were assumed. In this simulation, the mean of the shifts *Z*(*d*_{1}) and *Z*(*d*_{2}) was kept at 2 Z-values, but ΔZ was increased in steps of 0.01 from 0–3. In each step, 500 pairs of Z-values were generated for the stable and for the out-of-control situation. The areas under the receiver operating characteristic curves (AUC) of both algorithms were subsequently calculated.

### SIMULATION OF VALUES AND MEASUREMENTS

We have created a dedicated software package “rSimLab” implemented in the “R” programming language (26) to simulate measurements in laboratory medicine. “rSimLab” is available free of charge at https://github.com/acnb/rSimLab. The design principle was to initially generate patient and QC samples with known true values as described in the following paragraph. Measurements of these samples were then simulated using characteristics of the respective assays (Fig. 2).

Daily measurements were simulated for 5 analytes—albumin, hemoglobin A1c (HbA1c), testosterone, troponin I, and vitamin D3 (Table 1). Together, these parameters represent a broad spectrum of analytical properties relevant for QC. The daily number of measurements followed a normal distribution. For each measurement, a true value was drawn from a normal, lognormal, or bimodal distribution (27). The center and spread of each analyte were modeled to resemble measurements stored in our laboratory information system at the university hospital of the Technische Universität München. Seasonal trends were simulated by changing the center of the distribution. At least 2 QC measurements with a fixed true value were simulated for each day.
(7) Precision was specified using the so-called characteristic function (Eq. 7) with *c* being the concentration and σ* _{c}* being the standard deviation at this concentration (28). This function models a constant imprecision close to the limit of detection and an imprecision relative to the measured value at higher values (see Methods and Results in the online Data Supplement, see Fig. 1 in the online Data Supplement).

### SIMPLE QUALITY CONTROL PARAMETERS

Levels of QC measurements were chosen similar to those in our laboratory with at least 1 normal and 1 pathological control. Controls were measured only once daily. No bracketing was used as is common practice in German laboratories. Medians were calculated daily from a varying number of samples.

### SIMULATION OF BIASES

Bias in clinical chemistry can be caused by various reasons such as operator errors or unstable reagent lots (13, 29, 30). In this simulation, relative and absolute biases were modeled. They increased steadily over time or occurred abruptly. The size of biases was adjusted to exceed out-of-control specifications on approximately 5% of all days (see Fig. 3, see Methods and Results in the online Data Supplement).

### SIMULATION OF DISTURBANCES

To test the robustness of QC procedures, several disturbances were simulated that affect the simple QC parameters but not the patient sample measurements. Internal QC measurements were disturbed by a constant bias, by a relative bias, or by increased imprecision to simulate a lack of commutability (14, 15, 31). Patient medians were disturbed by a constant increase in all patient true values or by a removal of one-third of all samples to model changes in the patient population (see Methods and Results in the online Data Supplement).

### STEPS OF SIMULATION

Each run of a simulation started with a stable phase that was used to derive mean and standard deviation for the calculation of Z-values in control rules. Decision thresholds were determined in a first error-prone phase with simulated biases. The performance of simple QC parameters, of the Westgard-like algorithm, of the weighted, and of the unweighted aggregation of Z-values, was evaluated in an additional independent error-prone phase. All control rules were investigated using internal QC measurements alone and combining QC measurements and patient medians. Finally, the same control rules were tested in an error-prone phase with disturbances.

In each phase, 730 days were simulated to accommodate all simulated biases. For each analyte, 200 independent simulation runs were conducted to detect a difference of 0.01 in AUC values with a power of at least 95%.

To make this simulation reproducible (32), the full code is provided in a Code file in the online Data Supplement.

### PERFORMANCE EVALUATION

German RiliBAEK states a “permissible relative deviation” and the “applicable concentration intervals” for the most important analytes (33). The permissible relative deviations were treated as *TE _{a}*. Values outside the applicable concentration interval were not considered for performance evaluation. A day was regarded as out-of-control if >5% of patient measurements exceeded this quality requirement. Other ratios were tested in a sensitivity analysis (see Methods and Results in the online Data Supplement, Fig. 2 in the online Data Supplement).

The AUC metric was used to express the ability to correctly classify out-of-control days regardless of thresholds (34). Concrete decision limits were determined based on maximum Youden Index and on 90% sensitivity (often called “probability of error detection”) (8) in an undisturbed, error-prone phase (35). For both thresholds, sensitivity, specificity, and balanced accuracy were calculated in independent error-prone simulation phases with and without disturbances.

The Wilcoxon rank sum test was used to compare AUC values from different control rules. Differences with a *P*-value of <0.01 were considered statistically significant.

## Results

### NUMBER OF PATIENT SAMPLES FOR THE ERROR OF THE MEDIAN TO REACH THE SAME MAGNITUDE AS THE ERROR OF AN INTERNAL CONTROL MEASUREMENT

Of the simulated analytes, only albumin follows a normal distribution. The analytical *CV _{a}* at the mean is 0.037. The ratio of spread to the center of sample values =

*CV*is 0.255. In total, 79 patient samples are sufficient for the standard error of the median to reach the same magnitude as the standard error of a QC measurement at the same analyte value.

_{t}HbA1c, troponin I, and vitamin D3 follow a lognormal distribution. Their *CV _{a}*,

*CV*, and required number of samples are 0.034, 0.162, and 36 samples, 0.038, 4.1, and 178 samples, and 0.056, 0.581, and 110 samples, respectively.

_{t}Because of the bimodal distribution, the number of samples needed for testosterone could not be readily calculated.

### COMPARISON OF QUALITY CONTROL RULES UNDER SIMPLIFIED CONDITIONS

Under simplified conditions, the Westgard-like algorithm and the unweighted aggregation of Z-values performed comparably when Z-values of QC measurements were affected equally (Fig. 1D). The AUC of the unweighted aggregation of Z-values and of the Westgard-like algorithm reaches 0.998 and 0.996, respectively.

When an out-of-control condition affected QC measurements unequally, their Z-values differed. With increasing difference Δ*Z* between these Z-values, the performance of the Westgard-like algorithm decreased but not that of the unweighted aggregation of Z-values. When the difference Δ*Z* equaled 3 Z-values, AUCs of the aggregation of Z-value and of the Westgard-like algorithm were 0.998 and 0.943, respectively.

### SIMULATION

The R package “rSimLab” is flexible enough to simulate normal, lognormal, or bimodal distributions, and seasonal variations of true values. The precision of the analytical methods was modeled using the characteristic function. The *TE _{a}* was specified as a proportion of the concentration and did not change in line with imprecision. Consequently, the probability of impermissible errors varied over the measurement range.

### ERROR-PRONE PHASE WITHOUT DISTURBANCES

After 200 runs for each simulation, mean AUC (mAUC) for patient medians considerably varied and ranged from 0.64–0.96 (Fig. 4, Table 1 in the online Data Supplement). The highest mAUC was reached in the simulation of albumin, characterized by a large number of samples and a low ratio between spread and center. For vitamin D3, seasonal variations increased the error of the median, but 200 samples, on average, were sufficient to reach an mAUC of 0.84. In the simulation of HbA1c measurements, the mAUC of medians of only 70 patient samples was higher than that of the poorly performing control measurements. The median of the testosterone simulation was located between the distributions of “male” and “female” patients and was not effective in detecting a bias (mAUC 0.64).

The Westgard-like algorithm and the aggregations of Z-values tended to have a higher mAUC than the simple QC parameters they combined. Only the median of albumin and the control measurement 1 of testosterone had a slightly higher (<0.01) mAUC than an aggregation of Z-values. The Westgard-like algorithm performed worse than the simple QC parameters it combined in simulations of albumin, testosterone, and troponin I. The inclusion of patient medians improved all control rules in the simulations of albumin, HbA1c, and vitamin D3, but only the Westgard-like algorithm in the simulations of troponin I. When control rules were combined in an additional experiment using sex-specific medians of testosterone, mAUCs were not markedly improved (see Methods and Results in the online Data Supplement, see Fig. 3 in the online Data Supplement). Aggregations of Z-values had higher mAUCs than the respective Westgard-like algorithm in all simulations. Weighting nominally enhanced the aggregation of Z-values in the simulation of HbA1c and troponin I. However, mAUC of the weighted and unweighted aggregation of Z-values differed by <0.02 in all simulations. The differences between AUCs of the Westgard-like algorithm and of the corresponding weighted or unweighted Z-value aggregations were statistically significant.

Performance evaluations using thresholds based on Youden Index or 90% sensitivity followed the pattern of AUC values (see Tables 2–7 in the online Data Supplement). The inclusion of medians improved balanced accuracies and specificities in simulations of albumin, HbA1c, and vitamin D3. Mean balanced accuracies and mean specificities of Z-value aggregations were always higher than those of the respective Westgard-like algorithms.

### ERROR-PRONE PHASE WITH DISTURBANCES

When internal QC measurements were disturbed (by increased imprecision, by a relative bias, or by a constant bias), the mAUC of single control measurements decreased by >20% (e.g., vitamin D3). In most simulations, a control rule that included medians had a higher mAUC than the same rule without this parameter. Inclusions of medians led to a slightly lower mAUC (decrease of mAUC < 0.01) only in simulations of testosterone and troponin I when disturbed by an increase in imprecision. In simulations of albumin and vitamin D3, the disturbance caused a much larger performance loss in rules without medians.

When medians were disturbed (by removal of one-third of all samples or by a constant increase of all patient true values), the mAUC of medians also decreased by up to 20% (e.g., albumin). A rule that included medians still exhibited a higher or an only slightly lower (decrease of mAUC < 0.03) mAUC than the same rule without medians (Fig. 5, see Table 1 in the online Data Supplement).

In the simulation of HbA1c measurements, the removal of one-third of all patients decreased all mAUC values. The reduced number of patients increased randomness, in that days were regarded as “stable” or “out-of-control” that would have been classified differently with more samples. This classification served as the basis for AUC calculations. As such, mAUC values decreased although QC measurements were not affected by the fewer samples directly (see Methods and Results in the online Data Supplement; Fig. 1 in the online Data Supplement).

## Discussion

QC in clinical laboratories has become increasingly sophisticated. Instead of “one-size-fits-all” approaches, individual QC strategies are developed on the basis of knowledge of the analytical methods and clinical needs. We investigated the circumstances under which patient medians may be useful and how these can be incorporated into control rules for the detection of biases.

We did not address the construction of a comprehensive QC strategy combining several control rules. However, this work suggests that the aggregation of Z-values can be a new element of such “multirules.” Thresholds for aggregated Z-values need to be defined on the basis of individual needs. A Z-value threshold corresponds to a significance level (e.g., a threshold of ±1.96 Zequals a significance level of α = 0.05) (25). The inclusion of medians is particularly suited for retrograde QC strategies, in which patient results are held until a QC event has passed. Here, QC and patient samples measured under the same conditions can be evaluated together. Bracketing a fixed number of patient results for a QC event removes a source of uncertainty for patient medians. In anterograde strategies, in which a QC event has to be passed before measurements are started, patient medians are only available from the previous run and only persistent deviations can be detected.

In this study, QC procedures were compared in an extensive, carefully parameterized simulation. Precision was modeled with characteristic functions to create higher imprecision close to the limits of detection. *TE _{a}* was specified as a proportion of the concentration and did not change in line with imprecision. Consequently, the probability of impermissible errors was not constant over the range of measurements. This fact is often simplified in mathematical calculations of control rules. The AUC metric was used to evaluate the performance of QC procedures regardless of their decision thresholds.

The median of patient measurements can offer valuable information for QC if the spread of patient measurements in relation to their center is small. We have provided more exact estimates for the required numbers than the previously recommended 200 patient samples (18, 36) and confirmed the importance of the spread (37). Other location parameters besides the median have been proposed, including the mean after outlier removal with Tukey method, the average of normal values, and the mean of log-transformed values (19). All parameters calculated from patient measurements are always prone to be influenced by changes in the patient population. The more traditional use of QC materials also suffers from hard-to-control failures such as noncommutability. Patient medians should therefore supplement, but not replace, traditional QC measurements.

The proposed aggregation of Z-values has a superior discriminative ability compared with the algorithm resembling traditional Westgard single rules of the form 2_{xs} (x = 2, 2.5 …). However, although a variety of errors and disturbances was simulated, these settings may not reflect the situation of an individual assay. When trying to select an algorithm for the detection of biases, their different mechanisms offer guidance. The Westgard-like algorithm combines individual Z-values with a “minimum threshold” strategy. For the 2_{2.5S} rule, 2 Z-values need to exceed the minimum boundary 2.5 to be flagged. If 1 Z-value is higher than the boundary by far but the other Z-value misses the boundary slightly, no signal is given. In contrast, the proposed aggregation of Z-values averages the results of simple QC parameters. As shown by our simulation under simplified conditions, this is especially beneficial when the deviation causing the out-of-control situation does not affect all simple QC parameters equally. The aggregation of Z-values can be nominally improved when the individual Z-values are weighted. Linnet has compared a “mean rule,” closely related to Stouffer method, with traditional Westgard rules. Given the same type-I error, the mean rule was more powerful for detection of shifts of location than Westgard rules (38).

The robustness of control rules was tested when internal QCs were disturbed with a constant or relative bias or an increase in imprecision. Patient medians were influenced by a removal of one-third of all patients values or a constant increase in patients' true values. When patient values remained unchanged, patient medians could mitigate the effect of failing QC measurements and vice versa. A confounding factor, such as inappropriate storage conditions or a change in the patient population, is much more likely to influence all QC samples or all patient measurements, but not both. Therefore, any rule that includes information from both sources provides a superior detection of biases.

## Acknowledgment

We would like to thank Christof Winter for fruitful discussions.

## Footnotes

↵4 Nonstandard abbreviations:

- QC,
- quality control;
- HbA1c,
- Hemoglobin A1c;
- TEa,
- allowable total error;
- CV,
- coefficient of variation;
- AUC,
- area under the receiver operating characteristic curve;
- mAUC,
- mean AUC;
- OOC,
- out-of-control.

**Author Contributions:***All authors confirmed they have contributed to the intellectual content of this paper and have met the following 3 requirements: (a) significant contributions to the conception and design, acquisition of data, or analysis and interpretation of data; (b) drafting or revising the article for intellectual content; and (c) final approval of the published article.***Authors' Disclosures or Potential Conflicts of Interest:***No authors declared any potential conflicts of interest.*

- Received for publication December 12, 2016.
- Accepted for publication April 11, 2017.

- © 2017 American Association for Clinical Chemistry