## Abstract

*Background:* Analytical error affects 2nd-trimester maternal serum screening for Down syndrome risk estimation. We analyzed the between-laboratory reproducibility of risk estimates from 2 laboratories.

*Methods:* Laboratory 1 used Bayer ACS180 immunoassays for α-fetoprotein (AFP) and human chorionic gonadotropin (hCG), Diagnostic Systems Laboratories (DSL) RIA for unconjugated estriol (uE3), and DSL enzyme immunoassay for inhibin-A (INH-A). Laboratory 2 used Beckman immunoassays for AFP, hCG, and uE3, and DSL enzyme immunoassay for INH-A. Analyte medians were separately established for each laboratory. We used the same computational algorithm for all risk calculations, and we used Monte Carlo methods for computer modeling.

*Results:* For 462 samples tested, risk figures from the 2 laboratories differed >2-fold for 44.7%, >5-fold for 7.1%, and >10-fold for 1.7%. Between-laboratory differences in analytes were greatest for uE3 and INH-A. The screen-positive rates were 9.3% for laboratory 1 and 11.5% for laboratory 2, with a significant difference in the patients identified as screen-positive vs screen-negative (McNemar test, *P* <0.001). Computer modeling confirmed the large between-laboratory risk differences.

*Conclusion:* Differences in performance of assays and laboratory procedures can have a large effect on patient-specific risks. Screening laboratories should minimize test imprecision and ensure that each assay performs in a manner similar to that assumed in the risk computational algorithm.

Assessment of the risk for fetal Down syndrome or other aneuploidy has become a routine component of prenatal care. Risk is estimated on the basis of a combination of maternal age, prior history of aneuploidy, 1st- and 2nd-trimester maternal serum screening tests, and ultrasound examination of the fetus (1)(2). Women considered to be at high risk (screen-positive) are offered definitive diagnosis through chromosome analysis of chorionic villus samples or amniotic fluid cells.

Risks for Down syndrome are usually calculated by converting marker measurements into gestational age–adjusted multiples of the median (MoMs),2 establishing a likelihood ratio that is based on the expected distributions of the MoM values in affected and unaffected pregnancies and the use of the likelihood ratio to modify the maternal age–specific risk for an affected pregnancy (3)(4). For maternal serum markers, the expected distributions are often based on studies that were carried out on a different population and with alternative assay protocols. Although these population or testing differences could affect marker measurements, it has generally been assumed that the risk assessments provided are accurate. Consistent with this assumption, the detection rates and false-positive rates are close to theoretical expectations (5). Furthermore, when patients are grouped according to ranges of risk, the proportions of affected and unaffected pregnancies approximately correspond to those expected (6)(7). These studies demonstrate that, on average, grouped risk figures are approximately correct.

The analyses of group risks do not provide any insights into the effects of differences between assays and random errors that may substantially alter individual risks without affecting the mean. Proportions of affected and unaffected pregnancies for groups may be relatively insensitive to these effects because both affected and unaffected pregnancies may be subject to similar errors, with proportionate changes in the numbers of both types. An analysis of the effect of analytical imprecision in the 2nd-trimester triple test [α-fetoprotein (AFP), human chorionic gonadotropin (hCG), and unconjugated estriol (uE3)] showed that individual patient risk figures can change dramatically owing to compounding effects of errors in each of the 3 component test measurements (8). Proficiency test data on serum or defined quantities of the analytes in buffered solutions also show that when different laboratories measure the same specimen, markedly different risk figures are reported (9)(10).

Patient decisions to accept or reject amniocentesis depend on the risk figure reported (11). Because amniocentesis is anxiety-provoking and expensive and can lead to procedure-related fetal loss or other pregnancy complications, risk figures should be as accurate as possible.

The objective of our study was to compare the risk figures generated by 2 laboratories testing the same set of 2nd-trimester serum specimens.

## Materials and Methods

A total of 501 serum specimens from women with singleton pregnancies at 14–21.9 weeks gestational age were referred for Down syndrome risk evaluation. The study population was 63% Caucasian, 12% Afro-American, 17% Hispanic, and 8% other; with 21.2% aged ≥35 years at the expected date of delivery. Gestational age estimation was from ultrasound measurements in 75.7% of pregnancies and the time from the last menstrual period in 24.3%. For 462 specimens sufficient serum was available to measure AFP, uE3, hCG, and inhibin-A (INH-A) in both laboratories. Based on subsequent follow-up data on pregnancy outcomes, none of the samples were thought to be from a Down syndrome–affected pregnancy.

In laboratory 1, AFP and hCG were measured with Bayer ACS-180 immunoassays (Bayer), uE3 with Diagnostic Systems Laboratories (DSL) RIA, and INH-A with DSL immunoassay. All testing was performed in duplicate, and results were based on the mean of the 2 results. In laboratory 2, AFP, hCG, and uE3 were all measured with Beckman Access immunoassays (Beckman Coulter), and INH-A was measured with DSL immunoassay. In laboratory 2, all testing was based on a single measurement for each patient sample. Within each laboratory, all results included in this study were based on a single lot of commercial reagent sets. For both laboratories, testing was carried out within the time intervals for which analyte concentrations were expected to be reliable (12).

To convert analyte concentrations into MoMs, we divided each concentration by the expected gestational age–specific median. The accepted practice of establishing normal medians was applied; data were grouped by weeks, and raw medians were then calculated and regressed against the gestational age (with weighting according to the number of observations at each week). For AFP and uE3, the regression curve was log-linear (13)(14); for hCG, it was exponential (15); and for INH-A, it was quadratic(16). For each laboratory, Down syndrome risks were calculated with the same 2nd-trimester risk algorithm, based on the statistical parameters of Wald et al. (17)(18)(19) and taking into consideration the method of gestational age dating, maternal weight, and race. A risk >1:270 in the 2nd trimester was used to define a positive screening test result.

For statistical modeling with Monte Carlo methods, we used computer programs written for the statistical package S-Plus (Insightful Corp). For unaffected pregnancies, the test SDs and within- and between-laboratory correlation coefficients for pairs of tests were calculated directly from the data from the 2 laboratories. SDs of log_{10} MoMs were estimated from the median absolute deviation function in S-Plus, which is considered to provide a robust estimate by minimizing the effect of outliers. Correlation coefficients between log_{10}-transformed MoMs were calculated after excluding values >3 SD from the mean (20). For affected pregnancies, it was assumed that established differences in the variances and covariances between affected and unaffected pregnancies could be directly added to the values obtained for the unaffected pregnancies (21). With the statistical parameter sets for each laboratory, we simulated the results for sets of affected and unaffected pregnancies tested in the 2 laboratories. We used the same risk algorithm to calculate the likelihood ratios for the 2 laboratories (17)(18)(19), and we calculated the maternal age–specific detection rates and false-positive rates with the 1:270 2nd-trimester cutoff. Net detection and false-positive rates for the population were based on the mean of these maternal age–specific detection rates and false-positive rates, weighted according to the maternal ages for US women delivering liveborn infants in the year 2000 (22).

## Results

For all 4 analytes (AFP, uE3, hCG, and INH-A) linear relationships existed between the raw concentrations measured in the 2 laboratories. The correlation coefficients were high for AFP (0.98) and hCG (0.97) but somewhat lower for uE3 (0.94) and INH-A (0.90; Fig. 1⇓ ).

Regression curves for median concentrations of each analyte against gestational age are shown in Fig. 2⇓ . For AFP, hCG, and INH-A, the regressed medians for the 2 laboratories were approximately proportional across most gestational ages. The 2 regression curves for uE3, however, showed a divergence with increasing gestational age. When all analytes were converted into MoMs, correlation coefficients between test results from the 2 laboratories were not substantially decreased for AFP (0.96), hCG (0.97), or INH-A (0.90), but the correlation for uE3 decreased markedly (0.82; see Fig. 3⇓ ).

Bland–Altman plots for the tests performed in the 2 laboratories are shown in Fig. 4⇓ . For each laboratory and analyte, log MoM values were standardized by dividing by the SD, making the means and differences comparable across analytes. These plots illustrate the relatively high consistency of hCG and AFP measurements and the relatively low consistency of uE3 and INH-A measurements between the 2 laboratories. There is little deviation from the horizontal line to suggest any substantial systematic bias. uE3 and, to a lesser extent, INH-A show a high amount of scatter about the horizontal line, indicating large random variation between measurements from the 2 laboratories.

SDs and correlation coefficients for the laboratory tests performed in laboratory 1 and laboratory 2 are summarized in Table 1⇓ . Also included are reference data observed by another laboratory, which provide the basis for widely used algorithms for risk calculation (17)(18)(19), and data for 23 704 patients receiving routine screening through laboratory 1 (23). The statistical parameters are derived from log-transformed concentrations expressed in MoMs. On the basis of the relatively close values for the statistical parameters for the 462 study samples (columns 4 and 5) compared with the reference groups (columns 2 and 3), we considered the study group to be sufficiently large for a meaningful between-laboratory comparison.

For analyte results expressed in MoMs, the mean CVs for paired test results from the 2 laboratories were 5.0% for AFP, 5.3% for hCG, 11.0% for uE3, and 14.7% for INH-A. In comparison, for laboratory 1, the CVs for paired test results, measured in separate runs, were 4.1% for AFP, 2.8% for hCG, 3.9% for uE3, and 4.9% for INH-A. Estimates for laboratory 2 CVs were not available.

For each of the 462 samples, we calculated Down syndrome risk with the results from laboratory 1 and, separately, the results from laboratory 2. For each laboratory, risks were calculated with the same algorithm (17)(18)(19). A scatter plot of the 2 risk figures generated in each laboratory is shown in Fig. 5⇓ . The diagonal line represents exact correspondence of risks for the 2 laboratories. For laboratory 1, 43 (9.3%; 95% confidence interval, 7.0%–12.3%) tests were screen-positive. For laboratory 2, 53 (11.5%; 95% confidence interval, 8.9%–14.7%) results were positive. The difference between the 2 laboratories in the classification of patients as either screen-positive or screen-negative was statistically significant (McNemar test, *P* <0.001). Only 34 (7.4%) of samples were screen-positive in both laboratories. No overall tendency existed to assign a higher risk in laboratory 2; the median risks reported for all 462 pregnancies were similar (1:6462 and 1:6579 for laboratory 1 and laboratory 2, respectively).

The 2 risk figures for any particular sample were often markedly different. For 44.7% of pregnancies, the difference in risk figures was >2-fold; in 7.1%, it was >5-fold; and in 1.7%, it was >10-fold. Large differences in the risk figures from the 2 laboratories were not confined to those patients whose risks were very low. For example, for the 104 patients with risks >1:1000 in laboratory 1, the risk reported by laboratory 2 could differ by 0.07-fold to 13-fold.

Two extreme examples in which the risk figures for the 2 laboratories were widely discrepant are shown in Fig. 2⇑ . At point A, the risk associated with testing in laboratory 1 was 1:2128, based on AFP = 0.79 MoM, uE3 = 1.16 MoM, hCG = 1.30 MoM, and INH-A = 1.13 MoM. For laboratory 2, the risk was 1:124, based on AFP = 0.76 MoM, uE3 = 0.44 MoM, hCG = 1.55 MoM, and INH-A = 1.14 MoM. At point B, the laboratory 1 risk was 1:126, with AFP = 0.99 MoM, uE3 = 0.84 MoM, hCG = 1.67 MoM, and INH-A = 2.49 MoM, whereas the laboratory 2 results were 1:1253, with AFP = 1.08 MoM, uE3 = 1.04 MoM, hCG = 1.38 MoM, and INH-A = 1.52 MoM.

To further assess the effect of these between-laboratory differences on screening, we used statistical modeling to computer-simulate larger populations of pregnancies with the same statistical parameters as those summarized in Table 1⇑ . Applied to a population with maternal ages similar to the entire US population delivering liveborn infants, the Down syndrome detection rate and false-positive rate for laboratory 1 were expected to be 82.8% and 7.8%, respectively. For laboratory 2, these rates were expected to be 82.3% and 7.8%, respectively. The modeling indicated that 79% of affected pregnancies and 5.5% of unaffected pregnancies would have been screen-positive by both laboratories. Widely variable risk figures were again apparent, with 44.2% of normal pregnancies showing risk figure differences >2-fold, 8.6% >5-fold, and 1.8% >10-fold. These rates were similar in Down syndrome–affected pregnancies: 49.0%, 12.2%, and 3.1%, respectively.

## Discussion

We demonstrated that when risk evaluation from 2nd-trimester maternal serum screening for Down syndrome is performed at different laboratories, results often differ dramatically, even when laboratories evaluate the same sample. In each laboratory in our study, standard quality assurance protocols were met, identical sample populations were used to establish unaffected population medians, and the same algorithm was used for all risk calculations. For each component analyte, the test raw values for the 2 laboratories were highly correlated (*r* ≥ 0.90), and yet individual risks could differ by more than 10-fold.

We have shown previously that the Down syndrome risk figure provided to women receiving serum screening is highly sensitive to small differences in the performance of the testing (8). Variability in each component test is compounded when the likelihood ratio is calculated, leading to relatively large differences in the risk figures. Our data showed that the between-laboratory differences in analyte MoM values can be large, notably for uE3 and INH-A. For these 2 tests, the CVs for pairs of tests performed in the 2 laboratories were high, and the correlation coefficients were relatively low. These 2 tests were therefore the major contributors to the between-laboratory differences in risk seen in this study.

For uE3 measurement, there appeared to be systematic differences in the performance of the 2 manufacturers’ assays. Although the raw uE3 values were highly correlated, values expressed in MoMs were substantially less well correlated. This lack of correlation is attributable to differences in regressed median values, particularly at higher gestational ages (Fig. 2⇑ ), which occurred for unclear reasons. The difference in these regression curves could reflect uE3–antibody cross-reactivity or differences in uE3-antibody/antigen recognition in one of the assays that are more apparent at higher gestational age. Nonlinear assay dilution or matrix effects seem unlikely, because such effects should be apparent from the plot of the raw values. Whatever the cause, the difference in performance of different manufacturers’ assays is of concern because each risk is calculated under the assumption that all assays perform similarly across all gestational ages from 14 to 21.9 weeks and according to the statistical parameters incorporated into the algorithm.

A similar explanation cannot be applied to the poor between-laboratory reproducibility of INH-A test results. Both laboratories used the same manufacturer’s assay, and MoM values for the 2 laboratories correlated almost as strongly as the raw concentration data. The between-laboratory variability in INH-A results would therefore seem to be primarily attributable to differences in commercial reagent set lots and/or procedural differences (for example, the extent of color development, plate washing, and duplicate assay performance in laboratory 1). The INH-A assay was not carried out on an automated chemistry analyzer, and of the 4 analytes tested in laboratory 1, INH-A had the largest within-laboratory CV. The introduction of a more standardized protocol or automated assay for INH-A testing could potentially help decrease error in risk assessments.

Although individual patient risk assessments are highly sensitive to these laboratory test performance variables, from a population perspective, they appear to have only a modest effect on the overall effect of Down syndrome screening. Our results for the 462 specimens tested in laboratory 1 and laboratory 2 indicated that the false-positive rate would be higher for laboratory 2 (11.5% vs 9.3%). The results of our computer modeling indicated that the overall effect on the detection rate and higher false-positive rate would be very small. Some caution is needed in interpreting the results of the simulation because the modeling is based on statistical parameters developed for an essentially full range of unaffected pregnancy test results, and assumptions were made as to the performance of the screening in affected pregnancies. It is not known whether any systematic bias was introduced as a result of these assumptions.

Most of the variance in markers is attributable to differences between patients, and the difference in the statistical parameters for different laboratory protocols will be relatively small (Table 1⇑ ). Nevertheless, best efforts should be made to minimize sources of variation arising within the laboratory. To provide accurate risk assessment, screening programs must ensure that their test performance corresponds as closely as possible to the statistical parameters in the screening algorithm.

It should also be recognized that the statistical variables within the screening algorithm and the maternal age–specific risks for Down syndrome are also subject to some uncertainty. Metaanalyses of well-conducted trials can provide more robust estimates of these variables. Ideally, each laboratory would modify their algorithm to match the screening achievable for their local population and with their particular choice of tests (21). In practice, because of competing sequential screening technologies (24), laboratories will likely find it increasingly difficult to generate an appropriate local set of statistical screening parameters from an unselected group of women. Greater emphasis therefore needs to be placed on indirect quality assurance measures that ensure that the assays do perform according to expectations.

## Footnotes

1 Reference values were based on Wald et al. (17)(18)(19), computed from the variances and covariances with weighting to reflect the proportions of pregnancies with ultrasound dating (75.7%).

2 Based on 23,704 pregnancies studied in laboratory 1 (23), of which 73.3% were dated by ultrasound.

↵1 Current affiliation: Department of Pathology and Laboratory Medicine, Hartford Hospital, Hartford, CT.

↵2 Nonstandard abbreviations: MoM, multiple of the median; AFP, α-fetoprotein; hCG, human chorionic gonadotropin; uE3, unconjugated estriol; INH-A, inhibin-A; DSL, Diagnostic Systems Laboratories.

- © 2006 The American Association for Clinical Chemistry