## Abstract

Sampling strategy fundamentally influences the effectiveness of quality control with control charts. This study shows a simple approach for optimizing the control strategy for automatic multichannel analyzers that takes into account cost-efficiency considerations. Our main focus is on the frequency of controls necessary. The methods used are based on a field study (on a Hitachi/BM 747), the views of experts, and computer simulations of customary cost-models together with a survey of the literature. We found that industrial cost-models are applicable only with distinct limitations, but–unlike the test-yield model–they offer consistent solutions. On the basis of the field study and the opinions of experts, we adjusted the control strategy to account for inadequacies in the theoretical models. The combined result is that, for effective operation, the number of samples between controls may reach values up to 100 and should not require controls more often than every 30 samples on comparable multichannel analyzers. For adequate statistical performance, a simple 3-SD Shewhart chart usually requires not more than two controls of the same material at each time.

Much has been written about quality control using control charts. As a result, today many different types of control charts are available, with many of them being statistically much more efficient than the Shewhart chart that was introduced into clinical laboratories by Levey and Jennings in 1950 (1). However, the chart itself is only part of the quality control strategy as a whole, the main properties of which consist of both the sampling strategy and the chart. Although in practice the sampling strategy should be even more important than the chart type used, very few publications can be found that approach this problem. Because of this situation, many laboratories still use traditional control schemes, although current widespread automatic analyzers may require other inspection strategies.

In 1990, Westgard et al. (2) examined a very similar problem when they tried to determine a cost-efficient control strategy for a Hitachi 737. With their emphasis on showing the impact of the size of medically relevant differences on the selection of alarm limits for control charts, they also concluded that, by increasing the batch size between controls from 20 to 60 patient samples, a higher cost-effectiveness can be achieved. Using batches with bracketing controls first, their study showed how switching to a nonbracketing mode of operation increased productivity.

These two main inspection designs found in practice (bracketing/nonbracketing) are explained below, using symbols. The controls are abbreviated “C,” the patient samples “p,” and “n” is an index for the number of measurements of a control sample. The bracket “]” means charting the control measurements and examining the results, whereas “[” means the start of a batch. Using this notation, the inspection strategies can be written as follows:

*(a)* batch mode with bracketing controls:

[C_{1} p p p … p C_{2}][C_{3} p p p … p C_{4}] [C_{5} … C_{n−2}] [C_{n−1} p p p … p C_{n}]

*(b)* nonbracketing mode:

[C_{1}] p p p … p [C_{2}] p p p … p [C_{3}] … [C_{(n−1)/2}] p p p … p[C_{(n1)/2}].

Westgard et al. also presumed that cost-effectiveness could also be increased by even larger average numbers of patient samples between the control samples. Previously, Koch et al. (3) showed that the size of analytical errors to be detected in relation to the medically relevant changes and the chart type used must be taken into account when making these conclusions. They found that, for almost all analytes in serum chemistry, the use of Shewhart charts with no more than two controls at a time was sufficient.

Designs using preoperation controls before each batch offered little additional gain but reduced the speed of sample processing (2). However, a preoperation control after starting operation may be worth the delay (4).

There are generally three completely different approaches to assessing the problem of optimal control schemes systematically: *(a)* heuristic (based on experience), *(b)* statistical optimization, and *(c)* economic optimization.

Statistical approaches optimize the statistic properties of the inspection scheme, whereas economic models use a cost (or profit) function to optimize the total quality costs. These costs include, for example, the costs of false alarms or the costs of producing unsatisfactory quality. Combined statistic-economic approaches are mainly economic approaches, but these approaches trade small cost increases for an increased overall statistical performance.

The purely economic approach is not commonly used in clinical chemistry, but a basic model was described by Westgard and Barry on pp. 138–142 of their book on cost-effective quality control (5). In addition, from this approach they developed a pseudo-economic model, described on pp. 142–149 (5), that is based on so-called test yields (i.e., relative productivity) and is the most comprehensive and commonly used model. Regarding the optimal inspection strategy, some the important conclusions of Westgard and Barry are as follows:

*(a)* Random access analyzers offer higher productivities than batch processes.

*(b)* The frequency of errors and statistical power of the control chart are critical for the control strategy.

*(c)* Increasing the number of simultaneous batches (i.e., many simultaneous processes) reduces the productivity.

*(d)* Productivity increases with increasing run sizes from 10 to 60 patient-samples between controls [ Fig. 6–2C on p. 147 of (5)]. However, the absolute gains of increasing run sizes get smaller and smaller with higher numbers of patient samples.

*(e)* For most situations in serum chemistry, one or two controls at each time are statistically sufficient for error detection.

## Materials and Methods

### simulations

To determine the optimal control strategy for automatic multichannel analyzers, computer programs for common optimization models were investigated. Models developed especially for the clinical laboratory as well as economic models published in the *Journal of Quality Technology* and in *Technometrics* were analyzed. Finally, of the many programs published in the *Journal of Quality Technology*, two approaches that used a purely economic optimization [Montgomery (6)] and a combined statistical-economic optimization model [McWilliams (7)] were chosen. These programs are based on customary economic cost models: the underlying models are the basic model of Duncan (8) and the more recent, unified approach of Lorenzen and Vance (9). Continuous operation is assumed; only changes in accuracy (shifts) are considered, and only Shewhart charts are used. These restrictions limit the applicability of the results. The input parameters used for the economic models and the basic cost-model of Duncan are listed in Table 1⇓ . These values were then varied by up to a factor of 10 in each direction to examine their impact (sensitivity analysis). Only one average set of inputs is shown in Table 1⇓ instead of optimizing each analyte separately because different control schemes for different analytes would be difficult to carry out without fully automated sample processing. However, the programs provided at our website (http://edv1.klinchem.med.tu-muenchen.de/∼neubauer) make different optimizations easily possible.

For the approach of Lorenzen and Vance (9), it is also assumed that the average time to remove an existing error is one hour, that the process does not run during an error search or error resolval, and that the minimum run length to examine is 370 (which is equivalent to examining only schemes with a maximum frequency of false alarms corresponding to a 3-SD Shewhart chart). The Fortran source code of the programs [Montgomery (6) and McWilliams (7)] was received from the *Journal of Quality Technology* via e-mail (now available via the Worldwide Web at http://lib.stat.cmu.edu/jqt/) and then was compiled and run on a Pentium™ 100 system using a NAGware™ FTN90™ Professional Plus compiler, Ver. 2.1.

To summarize, the optimization problem was approached here quite differently from the test-yield model of Westgard and Barry (5). Their model was investigated with a specially programmed simulation program and a Microsoft Excel spreadsheet.

### field study

A field study was undertaken to obtain data to evaluate the designs proposed in the literature and those derived by simulations. The test bed was prepared by increasing the frequency of control pools measured in the Hitachi 747 multichannel analyzer, which does most of the serum analysis in our laboratory. To assess the use of more than one control sample, one rack was filled with three subsequent sample cups containing the same control pool and two sample cups containing another control pool. The control pools used were lyophilized serum pools (Boehringer Mannheim/Klinikum Groβhadern) that differed mainly in the concentrations of the various analytes and were reconstituted every evening for the next day. The Hitachi 747 was operated according to our daily routine with Boehringer Mannheim reagents and was controlled by our standard inspection strategy, which is based on one control rack containing four different serum pools for approximately every 60 patient samples, which were charted on 3-SD Shewhart charts. The five sera on the additional study rack (containing the two different pools) were measured after approximately every 30 patient samples. No special cooling was provided for the control sera.

Our laboratory is part of a 1000-bed university hospital, for which it does most of the necessary analyses. Routinely, our laboratory determines 23 different analytes on the Hitachi 747. During the two months of the study (February and March of 1996; 41 working days) an average of 4000 measurements for the 23 different analytes were performed every day on 250–350 patient samples.

Seven of the more frequently investigated analytes were selected for the study on the basis of the criteria representativeness for different groups of tests, frequency of tests, and bearable costs: sodium, potassium, calcium, creatinine, γ-glutamyltransferase, alkaline phosphatase, and pseudocholinesterase.

All data were stored in an Microsoft Access database (Ver. 2.0). Analysis of the ∼20 000 control measurements was performed with SPSS™ for Windows (Ver. 6.1.3) and Microsoft Excel, Ver. 5.0. Analysis included a thorough graphic and statistical examination of all data by all methods applicable and available with SPSS, especially box-plots and factor analysis. Finally, a simple approach was derived for determining an optimized control strategy: All data from the first month were used to calculate 3-SD limits for Shewhart control charts (Table 2⇓ ). These control charts were then applied to the second month. Different strategies were theoretically constructed [on the basis of the mean of one, two, or three of the measurements at each time and on different numbers of patients between the controls (30, 60, 90, or 120)]. The numbers of alarms caused by the control charts using these different theoretical inspection schemes were listed in two tables (summarized in Table 3⇓ ). Finally these tables were judged by nine experts (physicians/clinical chemists leading a hospital laboratory or part of such a laboratory) by means of a questionnaire (Fig. 1⇓ ).

## Results

### simulations

The solution of the cost-optimization models using the input factors listed in Table 1⇑ leads to the proposal that two control samples taken every 15 min be used and that their mean be documented in a 3.16-SD Shewhart chart (Fig. 2⇓ ). The stability of the optimization solution was checked by varying the different input parameters. We found that more than two control samples are statistically useful only if the size of shift decreases (e.g., three controls are necessary if a shift of 2-SD has to be detected) or the cost of false alarms increases dramatically. A reduced error probability, e.g., an error probability of 0.01 (one error per 100 h of operation) reduces the frequency of controls to one control set every 30 min. If the costs for low quality are reduced from 1000 to 100 DM per hour, a frequency of approximately one control per hour is sufficient.

The results of the economic and the economic-statistic optimization are nearly identical. The two simulation programs result in very similar suggestions for optimal sampling plans. When we varied all different input parameters by up to a factor of 10 in each direction, time intervals between controls varied between 5 and 60 min. Assuming 3000 possible tests per hour and 10 tests per patient, ∼5 patient sera can be processed per minute by the Hitachi 747. Under these conditions, the possible intervals between controls vary between 25 and 300 patients. Adjustments to other assumptions, such as different costs or different quality requirements, can be performed easily by downloading the programs/spreadsheets from our website.

### field study

In 80% of the cases, the number of patient samples between controls was ≤30. The remaining 20% of cases showed longer run sizes of up to 50 patients. Because not all of the seven selected analytes were requested for each patient sample, an average of ∼18 (SD = 10) patient samples tested for a specific analyte lay between study controls.

To determine the optimal batch size, the number of alarms caused by control charts using the different inspection schemes listed in Table 3⇑ were assessed by experts answering a questionnaire (Fig. 1⇑ ). Of the 10 experts who were mailed questionnaires, 9 completed the questionnaire. One questionnaire was not returned. Their answers, based on the information presented in this paper, are listed in Table 4⇓ and Fig. 3⇓ .

The inspection schemes selected (Table 4⇑ ) appear fairly divergent at first glance but reveal a degree of consensus on closer inspection. Most experts selected a control frequency of between 30 and 100 patient samples (question 1), and only one answer in each direction is outside this interval (experts B and G). The repetition of controls (question 2) was believed to be useful by only two experts (experts C and D). For the chart type (question 4), a simple Shewhart chart was preferred by most of the quality control managers, two preferred adding the Rili-BÄK rules1 and two experts preferred choosing another chart type or other rules. Question 3 shows that two different control materials were suggested by the majority of experts (seven of the nine who responded). However, here two groups exist: one major group preferred two completely different materials (e.g., one serum pool and one commercial material–possibly in different concentrations), whereas the other group preferred using different concentrations of the same material.

The factors underlying these assessments are shown in Fig. 3⇑ . Regarding the costs of control material and reagents for quality control (questions 1 and 2) two groups exist: One group believed that these costs are important, the other group did not regard this to be a factor. A similar situation is found for costs of judging the control results (question 3), but here the majority assessed this point as being relatively unimportant. In question 4, the costs for error removal were also assessed by the majority as being quite unimportant. Some disagreement is observed regarding the costs of false alarms (question 5). Questions 6 through 8 ask for the importance of different sizes of analytical errors. Although medically important errors were clearly appreciated by nearly all experts (question 8) and those smaller in size are judged as less important, for some of the analysts quite small errors (question 6) were of relatively high interest. Handling of the control chart (question 10) was recognized as being very important by all experts, whereas the number of channels (question 9) was assessed as being of only medium interest. The frequency of errors was believed to be crucial (question 12), whereas the time used for performing and analyzing controls was of medium interest (question 11). Two experts added (question 13) that the sensitivity of the control chart and the quality of control materials are essential. In Fig. 3⇑ as well as in Table 4⇑ , no correlation could be identified between experts working in our institute (codes A, B, C, and I) and external experts (codes D, E, F, G, and H).

In addition to optimizing the sampling strategy, our study clearly showed how important storage and handling of control material is: Trends during the day can be identified for some analytes when charting cumulative control results over the entire study period. Because the samples were not refrigerated or airtight, specimen evaporation increased concentration or activity up to 10% over 6 to 7 h. For example, the calcium concentration increased significantly [linear regression model, (mmol/L): Ca = 2.571 0.0159 × hours; 95% confidence intervals of (2.542; 2.600) and (0.014; 0.018)]. These findings are in accordance with (11) and (12). The stability of the reconstituted control serum was high enough not to cause visible effects (13)(14)(15). Factors such as the day of the week showed no influence on the test results.

Another point worth mentioning is the method of calculating control limits (Table 2⇑ ). When control charts are used for quality control during the day, it is essential to use limits derived in the same way (e.g., including evaporation effects and serum instability during the day). Control limits derived from an earlier month and calculated exactly as recommended by the Rili-BÄK (≥20 values, the same scheme, e.g., always the second control of a day) can be used only for controlling day-to-day imprecision but cannot be used for same-day control purposes because of the effects mentioned above.

## Discussion

Three different approaches–literature research, simulation studies, and a field study–were used to determine an optimized, cost-effective control strategy. The variables we focused on were the use of more than one sample of the same control material at each time and the optimal number of patient samples between controls.

Regarding the number of controls at each time, all approaches showed that, when using 3-SD Shewhart charts, two identical control samples at each time are sufficient. Charting the mean of the two measurements offers enough statistical power for all analytes, in agreement with Koch et al. (3). The use of such a scheme with two measurements at each time can be represented as follows:

C_{11}C_{12} C_{21}C_{22} C_{31}C_{32} C_{i1}C_{i2} ppp … p C_{11}C_{12} C_{21}…

We make no recommendation about the number of different control materials necessary, i.e., the index “i” in the symbolical description above. Such a selection (13) cannot be derived from the statistical requirements (Table 4⇑ , however, does include the experts’ view on this topic).

With control charts that have a greater degree of statistical power than the Shewhart charts that we used here, one repeat of each control sample might be sufficient for all analytes. Such charts include the CUSUM charts (16), EWMA charts (17)(18) or the Westgard Multirule algorithm (19). The official Rili-BÄK multirules are not suitable, as shown earlier (20). However, because the Westgard algorithm is based on one control sample at a time, multirules across at least two different control materials or concentrations are applied. Therefore this procedure has the statistical power desired only if both materials behave identically.

Regarding the frequency of controls necessary on automatic multichannel analyzers, our simulation studies showed relatively low frequencies to be cost-efficient. Our results suggest controls every 15 min, which implies a cost optimum of 75 patient samples between controls when considering a throughput of ∼5 patient samples per minute. In this respect, our conclusions are consistent with the findings of Westgard and Barry (5) and Westgard et al. (2). However, our simulation results may not be overestimated because of limitations such as uncertain cost factors and considering only shifts in inaccuracy.

Examining intervals longer than 120 patient samples between controls did not seem reasonable because a batch size of 120 patients means using only 2–3 controls per day, given a total number of 200–400 patient samples per day. Often a change of reagents is necessary once a day, therefore, one control would be used when starting the analysis in the morning, and the other control would be used after the reagents are changed in the afternoon.

The results of the expert assessments support the results of our rough simulations: *(a)* Between 30 and 100 patient samples between controls were preferred in question 1 of Table 4⇑ , which is consistent with the simulation results. *(b)* The use of a Shewhart chart appealed to the majority of experts (question 4, Table 4⇑ ). *(c)* However, charting the mean of two controls at each time was not supported by the experts (question 2). In this context, the size of (medically) relevant shifts is very important and would have to be assessed separately for each analyte.

The experts’ judgment of the different factors underlying the cost model are reflected in Fig. 3⇑ . Some factors were judged not to be essential (e.g., questions 1–4 and question 11). But the factors crucial for the simulation results–the size (questions 6–8) and frequency of errors (question 12)–were regarded as being important by the experts. The result of question 6 supports the opinion that small errors may be of some interest because they occur more frequently and reveal incipient problems at an earlier timepoint (20). Interestingly, the answers to question 9 reveal that the importance of multiple channels is underestimated by most experts. However, as we were recently reminded by Petersen et al. (21), the probability of false rejections (P_{fr}) for a control chart with *n* channels is related with *n*^{th} power of the false rejection of a single channel: P_{fr total} = 1 − [1 − P_{fr}]^{n}. This means, for example, that when 3-SD Shewhart charts with 20 channels are used, the overall probability of false rejection is 5.3% [P_{fr} = 1 − (1 − 0.0027) ≈ 0.053].

Regarding the validity of our results, it may be noted that the applicability of our simulation approach is based on the assumption of a continuous process not necessarily realized in the clinical laboratory. Nevertheless, the mode of operation is reasonably close to a continuous process, and the stability of the analyzer over the day is high enough to reject batches with bracketing controls (2). The economic models that we used investigated only inaccuracy, and nearly all costs are rough estimates. However, the use of these models still seems justified because they have been practice-proven over many years in industrial quality control and offer a simple means of optimization. Adaptation to other cost situations, separate optimizations for different tests, and charts other than the Shewhart chart can be easily achieved. However, models investigating not only inaccuracy but also imprecision and based on samples instead of hours of operation would improve simulation results.

The specific Westgard and Barry model that uses test yields (5) is–unlike their cost model–not suitable for determining the optimal control strategy, because no optimum exists. The test yield approaches unity with increasing numbers of patients between controls, meaning:

test yield (20 samples) < test yield (60) < test yield (500) < test yield (1 000 000).

Additionally, negative test yields are possible by changing the rerun factors. Table 5⇓ shows this strange behavior when Westgard’s test-yield model is used.

The advantages that the test-yield model provides include ease of adjustment to different quality requirements and the performance of different quality control procedures, a performance estimate that has practical meaning in the laboratory and is potentially measurable and verifiable from laboratory results, and an overall modeling strategy that allows the prediction of quality in terms of the defect rate. When using economic cost models (or economic-statistical models) instead of test-yield models, one encounters problems with determining costs instead of using simple assumptions on cost in the form of repeat factors. On the other hand, transforming these models into others that do not use monetary parameters (such as the test-yield model of Westgard and Barry) is quite complex and error-prone. The final result of such an attempt is mainly an exchange of “costs” for “rerun factors,” which still does not resolve problems with determining these factors exactly. We think that using cost models simplifies the whole process because it is easier to estimate monetary costs than to assess less tangible factors such as “analytic rerun factors”.

A completely different approach that makes use of the average of normals (AON) to maximize run lengths was proposed by Westgard et al. (22) in 1996. The AON method observes the average of normal patient results and compares them with the theoretically expected average to assess if imprecision and inaccuracy requirements are met. Because, in contrast to quality-control samples, the patient results do not mean any additional sample costs, repeated attempts to establish this method for quality control were made in the past. Although the high number of patient results necessary prevented a widespread use in the past, Westgard et al. have shown that today, with larger laboratories and computer support, AON is applicable for many tests* (22)*. Their minimum numbers necessary for candidates with high potential for AON range between 30 and 450. These numbers give the minimum number of patient samples necessary to use the AON method with a certain statistical power. Unlike this method, in the optimization of a process controlled by control samples, the statistical properties of a control chart are already given and patient numbers between controls are varied to reach a cost optimum. This optimum is found by taking into account the consequences of those cases not meeting quality standards, rejected falsely, and so forth, by means of a cost function or test yields. Thus, our economic optimization approach based on the use of control samples is fundamentally different from Westgard’s statistical approach of determining necessary AON lengths. However, control samples and the AON method may be advantageously applied simultaneously, which leads to a combined control system. This combined system could be optimized again in a test-yield or economic model that uses the statistical properties of both control methods and the costs of both methods.

To examine quality-control process parameters more thoroughly than by solely theoretical models, we recommend conducting a field study. By establishing “theoretical” alarm tables for the measurements, we have presented a simple method for judging the impacts of different control strategies. The final selection of a strategy can be made using the different cost-model (or test-yield) scenarios. Additionally, factors such as professional handling by the laboratory technician or special analyzer requirements that are not included in the model, can be taken into account. However, this step means avoiding strict economic optimization and trying an optimization that includes intangible, nonmonetary factors. However, because any model can include only part of the expert knowledge available, a final assessment by experts seems necessary and valuable to account for remaining inadequacies.

## Acknowledgments

We would like to thank the Hitachi team, especially Marion Borchers and Dieter Wagner, for their assistance in conducting the field study on the Hitachi 747. We would also like to thank Michael Page for reviewing the English language of this text and Thomas Schade (Weights and Measures Office Munich) for giving valuable suggestions. We also thank the external participants of the expert panel (S. Appel, Krankenhaus München-Neuperlach, München; S. Braun, Deutsches Herzzentrum, München; P. Cremer, Klinikum Groβhadern, München; W. Ehret, Zentralklinikum Augsburg, Augsburg; and F. Keller, Universitätsklinik Würzburg, Würzburg) and the internal participants (Klinikum rechts der Isar, München; C. Falkner, P. Luppa, D. Neumeier, and C. Wolter).

## Footnotes

Institut für Klinische Chemie und Pathobiochemie der Technischen Universität München, Klinikum rechts der Isar, Ismaninger Straβe 22, 81675 München, Germany.

↵1 Rili-BÄK (German guideline of the federal physicians association) multirules (10): 7

_{Τ}, the assay is out of control if seven consecutive measurements show the same trend upward of downward and 7_{Χ}, the assay is out of control if seven consecutive measurements fall on one side of the mean. The association recommends using these rules in addition to the 3-SD Shewhart chart for assessing imprecision in internal quality control. External (as well as internal) quality control is supervised by the Weights and Measures Office in Germany. Medical laboratories are required to participate in external quality assessment and qualify for each analyte at least twice a year.If

*n*identical samples are taken every*h*hours,*a*is the type I error rate, and 1-β is the power of the chart, the cost per hour*E(L)*can be written according to the basic cost model of Duncan*(8)*:1 Using the first routine control measurements of 20 days.

2 Calculated using all routine control data of 1 month.

3 Calculated using all additional control data of 1 month.

4 The high number of decimals was chosen to avoid problems caused by rounding.

5 GGT, γ-glutamyltransferase; AP, alkaline phosphatase; and PChE, pseudocholinesterase.

The table shows the impact of different average numbers of patients between controls and the influence of using more than one (identical) control sample at each time (the mean is then used in the control chart). The values after the slash give the much lower number of alarms when control limits are calculated from all values of 1 month.

1 Limits calculated according to Rili-BÄK using only the first daily control or calculated using all control data.

2 In our routine operation, no analytical error was encountered for all analytes shown.

3 GGT, γ-glutamyltransferase; AP, alkaline phosphatase; and PChE, pseudocholinesterase.

1 Also necessary: medical validation and external quality control.

Some examples for inconsistent test yields, as computed by the Westgard and Barry model [

*(5)*, p. 145], are shown. A random access process with persistent errors was assumed. The rerun factors determine the number of times tests have to be repeated in the different cases, e.g., R_{TR}= 1 means that in case of an detected error all tests are repeated once. All results were computed by a special simulation program and verified with an Excel spreadsheet. Both are available via WWW at (http://edv1.klinchem.med.tu-muenchen.de/∼neubauer/). It is obvious that this model is not applicable for determining the optimal number of patient samples*S*between for a given number of control samples*N*because:*(a)*no optimal number S is reached, but the test yield approaches unity with a higher S;*(b)*negative test yields are calculated using some input parameters, which is inconsistent with the definition of the test yield (which should lie between 0 and 1).1 T, total number of samples; N, number of controls; C, number of calibrators; T = S (patient samples) + N + C; R

_{TR}, rerun factor for true reject; R_{FR}, rerun factor for false reject; R_{FA}, rerun factor for false accept; R_{TA}, rerun factor for true accept; f, probability of an error; m, number of simultaneous channels; ARL_{R}, average run length for rejectable quality; and ARL_{A}, average run length for acceptable quality.2 The test yield approaches unity with higher numbers of samples S between controls.

3 Negative results for relative productivity do not make sense. The test yield shall represent the portion of measurements that are correct and reportable as patients’ results.

- © 1998 The American Association for Clinical Chemistry