Abstract
BACKGROUND: To date, no published nomogram for prostate cancer (PCa) risk prediction has considered the betweenmethod differences associated with estimating concentrations of prostatespecific antigen (PSA).
METHODS: Total PSA (tPSA) and free PSA were measured in 780 biopsyreferred men with 5 different assays. These data, together with other clinical parameters, were applied to 5 published nomograms that are used for PCa detection. Discrimination and calibration criteria were used to characterize the accuracy of the nomogram models under these conditions.
RESULTS: PCa was found in 455 men (58.3%), and 325 men had no evidence of malignancy. Median tPSA concentrations ranged from 5.5 μg/L to 7.04 μg/L, whereas the median percentage of free PSA ranged from 10.6% to 16.4%. Both the calibration and discrimination of the nomograms varied significantly across different types of PSA assays. Median PCa probabilities, which indicate PCa risk, ranged from 0.59 to 0.76 when different PSA assays were used within the same nomogram. On the other hand, various nomograms produced different PCa probabilities when the same PSA assay was used. Although the ROC curves had comparable areas under the ROC curve, considerable differences were observed among the 5 assays when the sensitivities and specificities at various PCa probability cutoffs were analyzed.
CONCLUSIONS: The accuracy of the PCa probabilities predicted according to different nomograms is limited by the lack of agreement between the different PSA assays. This difference between methods may lead to unacceptable variation in PCa risk prediction. A more cautious application of nomograms is recommended.
Prostate cancer (PCa)^{5} detection relies on the measurement of prostatespecific antigen (PSA) concentrations (1, 2). An increased PSA value is directly associated with a higher probability of having PCa (3,–,5), but benign prostate hyperplasia or prostatitis can also cause increases in serum PSA (6). Of all the molecular variables involving the total PSA (tPSA) concentration, only the ratio of free PSA (fPSA) to tPSA [i.e., the percentage of free PSA (%fPSA)] is clinically relevant and capable of avoiding unnecessary biopsies (7). Yet, the low specificity of PSA and %fPSA remains problematic.
Multivariate models, such as artificial neural networks or logistic regression–based nomograms, improve PCa risk prediction by combining tPSA, %fPSA, age, digital rectal examination (DRE) results, and/or prostate volume (8, 9). The frequent use of nomograms as PCaclassification models and for recurrence prediction has recently been reviewed (8, 10). The inclusion of %fPSA in nomograms has improved the accuracy of PCa diagnosis (11, 12). The nomograms show an improvement in specificity (13) compared with the use of %fPSA alone, but these prediction models were developed with data from different populations, used various tPSA intervals (e.g., 0–20, 4–10, or 0–50 μg/L) and applied different PSA assays. To the best of our knowledge, no one has analyzed whether the use of different PSA and fPSA assays has an effect on nomogrambased PCa prediction. Clinicians use some nomograms that are available online in patient counseling, but without considering the inadequate comparability of PSA results obtained with the various assays and the effect on the probability calculated from the nomograms. Despite the introduction of tPSA and fPSA assays that are calibrated against WHO PSA reference materials, PSA values still cannot be used interchangeably (14, 15). Besides tPSA, %fPSA values also differ between assays (15). External nomogram validation studies have not considered (11, 16) or have considered only partially (17, 18) the influence of PSA and %fPSA assays.
The aim of this study was to evaluate the effect of assaydependent variation in PSA and %fPSA values on nomogrambased PCa prediction. To this end, we measured the PSA values of 780 patients simultaneously with 5 different PSA assays and used the results from each of the assays to calculate the probability of PCa from 5 different nomograms. The 2 nomogram validation criteria—discrimination and calibration—were used to assess the effect of PSA and %fPSA variation on the predictive results obtained with the nomograms. This study also provided an external validation of the nomogram models.
Materials and Methods
STUDY GROUPS AND SAMPLES
The study population consisted of 455 PCa patients and 325 men with no evidence of malignancy (NEM) who had a tPSA value within the interval of 0.54–23.4 μg/L as measured with the AxSYM assay (Abbott Diagnostics). The study population was previously described, in detail, in another report (19). All men had been referred to the Department of Urology or the affiliated outpatient department at the Charité – Universitätsmedizin Berlin, and samples of archival sera collected between 2001 and 2004 were investigated retrospectively. Disease status was confirmed histologically for all patients by prostate biopsy (8–10 cores). A total of 364 patients treated by radical prostatectomy had pathologic stages pT1/pT2 (n = 263) or pT3/pT4 (n = 101) and Gleason scores ≤6 (n = 148) or ≥7 (n = 215; results not available for 1 patient). The clinical stages for the remaining 91 patients were as follows: T1/T2, n = 58; T3, n = 33. The biopsy Gleason scores were ≤6 for 49 patients, ≥7 for 34 patients, and unavailable for 8 patients.
Blood samples were taken before any procedures involving the prostate and at least 3–4 weeks after prostate manipulation. All samples were aliquoted and stored at −80 °C. Two aliquots of unthawed samples were analyzed in 2004 for the parallel measurement of tPSA and fPSA by 5 different assays, as previously described (15, 19). Prostate volume was measured by transrectal ultrasound examination and using the prolate ellipse formula. All patients underwent DRE.
PSA ASSAYS AND MEASUREMENTS
tPSA and fPSA [or complexed PSA (cPSA)] concentrations were measured with the following analyzers: AxSYM (Abbott Diagnostics), ADVIA Centaur [Siemens Healthcare Diagnostics; this assay measures cPSA instead of fPSA (Siemensc)], Access (Beckman Coulter), Immulite 2000 [Siemens Healthcare Diagnostics; measures fPSA (Siemensf)], and Elecsys 2010 (Roche Diagnostics). The assays that used the AxSYM, ADVIA, and Elecsys analyzers were calibrated against the WHO PSA standards (96/668 and 96/670), whereas the Access and Immulite assays used their own calibrators. The differences between these assays have previously been described in detail (15, 19). %fPSA values were determined from measurements obtained with these 5 assays. The PSA and %fPSA values used in a previous study (19) were applied to nomogram predictions.
APPLICATION OF NOMOGRAMS
The data from all patients with complete sets of values for the 5 PSA assays, age, prostate volume, and DRE status were then used with the 5 different nomograms. Table 1 lists the characteristics of the 5 nomograms. Nomogram I is based on age, DRE, and tPSA. Nomogram II uses these variables but also includes %fPSA (11). Nomogram III, which is available at http://www.nomogram.org/, was constructed by combining age, DRE, tPSA, %fPSA, and sampling density (16). The other 2 nomograms include age, DRE, tPSA, and fPSA (nomogram IV) and the additional factors of transrectal ultrasound and prostate volume (nomogram V) (17).
STATISTICAL ANALYSIS
Statistical calculations were performed with SPSS 17.0 for Windows (IBM SPSS). Specifically, the Friedman test was used to detect significant differences in multiple samples. The Wilcoxon test was used for pairwise comparison of groups, assays, or nomogram outputs. P values <0.05 were considered statistically significant. We used the Bonferroni correction to correct the results for pairwise comparisons involving multiple tests.
We used ROC curve analyses to evaluate the discrimination capability of the nomograms, and we used the test of Hanley and McNeil to compare areas under the ROC curve (AUCs). MedCalc software (version 10.4.8.0; MedCalc Software) was used to compare sensitivity and specificity data for the 5 nomograms at various cutoffs.
The nomograms were calibrated according to Harrell et al. (20) by using calibration plots in a diagram. The results were used as a performance measure of the agreement between the predicted probabilities and observed outcomes. The points of these calibration plots are constructed with the predicted probability of a positive biopsy result on the x axis and the observed frequency of PCapositive biopsies on the y axis. For this purpose, the 780 patients were subdivided into 20 groups (i.e., each group being 5% of the entire study group) of 39 men, according to the order of their respective predicted PCa probabilities. The mean observed outcomes and predicted probabilities were calculated for each group. A cubic smoothing spline was computed to suppress random fluctuations in the graphical representation and to expose the relationship between the predicted probabilities and observed outcomes. Figures were developed with MATLAB (MathWorks). To determine the consistency between these pairs, we computed the intraclass correlation coefficient (ICC), where a value of 1 is ideal. The ICC is a measure of consistency, which is obtained by multiplying the Pearson correlation coefficient by a correction factor that is based on the means and SDs of the observed outcomes and predicted probabilities [see Lin et al. (21)].
Results
NOMOGRAMBASED PREDICTIONS OF PCa ACCORDING TO PSA ASSAYS
Table 2 summarizes the clinical data and PSA values measured with the 5 assays for each of the study groups (58.3% PCa, 41.7% NEM). The median tPSA values obtained for the 5 assays were always significantly different except for the 2 lowest tPSA values for the Abbott and Siemensc assays (Table 2). The highest values were observed with the Siemensf assay. The largest median differences in %fPSA results were detected between the Siemensc and Siemensf assays.
Table 1 summarizes the predicted PCa probabilities, which are the median values of the respective individual PCa probabilities of all patients. For every nomogram, the pairwise comparison of the predicted probabilities was significantly (P < 0.0001) dependent on the PSA assay, except for the Abbott and Siemensc assays. Nomogram IV had the most diverse results, with median PCa probabilities of 0.59 and 0.76 for the Abbott assay and the Siemensf assay, respectively. In addition, we observed remarkable differences between the predicted probabilities obtained with the various nomograms, which ranged from 0.24 for nomogram I to 0.76 for nomogram IV, despite the fact that nomograms I and II and nomograms IV and V were established by the same group. Three patients, with tPSA values of approximately 2, 7, and 16 μg/L, exemplify the fact that different PCa probabilities are obtained when the results of different PSA assays are used, irrespective of the other variables (Table 3). Whereas the difference between algorithms in the probability of PCa between the lowest and highest tPSA increases with higher tPSA values, the difference in probabilities between the lowest and highest %fPSA values appears to be large for all 3 patients.
DISCRIMINATIVE ABILITY OF THE NOMOGRAMS ACCORDING TO PSA ASSAYS
The AUC as an overall discriminative criterion.
The ability of a nomogram to distinguish between PCa and NEM patients is termed discrimination, which is generally assessed by AUC analysis. A comparison of AUCs for the nomograms with data from the same PSA assay (Table 4) revealed significantly (P < 0.001) lower AUCs for nomogram I (0.79–0.80) compared with the other nomograms (0.82–0.87), with the exception of the comparison of nomograms I and IV for the Siemensc assay (P = 0.033). No differences between the other nomograms were observed. The reason for the lower AUCs for nomogram I may be because nomogram I is the only one that does not include %fPSA.
Comparing the AUCs for the 5 PSA assays within the same nomogram showed clinically irrelevant differences (AUCs ≤0.03).
Assessment of prediction ability according to various cutoffs.
Although AUC values represent the overall measure of the discriminative ability of a given model, it is most important to analyze the sensitivity and specificity values from ROC curve analyses from the nomograms over a certain range of outputs. Therefore, we applied data for different PSA assays to the nomograms and compared the sensitivity and specificity curves (Fig. 1) as a function of sensitivity or specificity on the y axis and the respective cutoff probability used for the nomogram on the x axis. The curves are obviously different. For example, the specificities obtained with the Siemensf and Abbott assays for nomogram IV (Fig. 1D) at a chosen nomogram probability were 43% and 73%, respectively, whereas the sensitivity varied from 81% (Siemensc assay) to 95% (Siemensf assay). On the other hand, at the clinically important sensitivity cutoff of 95% (see Table 1 in the Data Supplement that accompanies the online version of this article at http://www.clinchem.org/content/vol57/issue7), the specificity shows large variation within 1 nomogram when the 5 assays are compared. These data (Fig. 1; see Table 1 in the online Data Supplement) demonstrate that sensitivity and specificity data must be considered as important criteria, rather than only the global AUC measurements, in characterizing the effect of PSA interassay variation (Table 4).
CALIBRATION OF THE NOMOGRAMS ACCORDING TO THE PSA ASSAYS
The concordance between the PCa probability predicted by the nomograms and the real (observed) rate of PCa can be visually represented in a calibration plot, which is considered a measure of a model's quality. With total concordance, there is no difference between predicted probabilities and observed rates, and all points lie on the 45° line. Fig. 2 shows the calibration differences due to assaydependent PSA values for all of the nomograms. In general, differences between observed PCa rates and predicted PCa probabilities depend on both the PSA assay used and the corresponding nomogram. Only the Siemensf assay shows excellent performance in nomogram V, with an ICC of almost 1 (Fig. 2E). In contrast, Fig. 2A shows large differences between the observed PCa rates and predicted PCa probabilities, with up to 2fold underestimation of the PCa rate, regardless of the PSA assay that is applied to the nomogram. These data indicate a weak performance of nomogram I, regardless of the PSA assay that is used. The main reason for this inferior validity for nomogram I is the absence of the %fPSA value.
Discussion
Numerous nomograms have been developed to predict PCa risk and to facilitate the process of prostate biopsy decisionmaking for the clinician (22). All of these tools use the PSA value as a decisive variable for risk stratification, in addition to such other variables as a suspicious DRE result and prostate volume. External validations of various nomograms have most often been performed by applying the nomograms to different populations (16, 18). Table 1 shows that the different results obtained with the different PSA assays greatly affect the reliability of PCa risk prediction with nomograms.
It is well established that the use of different PSA assays generally yields different tPSA and %fPSA values (14, 23), despite the introduction of WHO PSA standards to improve the interchangeability of PSA results among the various assays (14). Thus, assay calibration is only partially responsible for the differences between the assays in PSA estimation (15). An analysis of PCa probabilities that used nomograms IV and V in a separate cohort of approximately 640 men (24) revealed lower medians with the WHOcalibrated data than with the Hybritechcalibrated data (see Table 2 in the online Data Supplement). Aside from the variation in tPSA, differences in %fPSA are also responsible for the variation in PCa probabilities. For example, the PCa probability ranged from 0.49 for the Abbott assay to 0.76 for the Siemensf assay when predictions were made with nomogram IV, as seen in 1 patient (patient B in Table 3). The predicted PCa probabilities obtained with nomogram IV for the other 2 patients also demonstrated large variation (between 0.16 and 0.38, patient A in Table 3; between 0.54 and 0.91, patient C in Table 3).
When we applied fixed PCa probability cutoffs, the differences between different assays in sensitivity and specificity increased when we used nomograms that take %fPSA into account (Fig. 1, B–E). Therefore, tPSA assay variation seemed to have a more moderate impact on PCa probability values than the variation in %fPSA, a result that has already been shown (25). This conclusion is documented in the examples of 3 patients with different tPSA concentrations (Table 3). The specificities also demonstrated large variation at a given PCa probability, such as 0.5. Specificities for the Siemensf and Abbott data were >30% different at a 0.5 PCa probability when nomogram IV was used (Fig. 1D). These data suggest that the nomograms showed large differences in discrimination power, whereas overall AUC values did not (Table 4). Predictions based on ROC curve analyses are based on rankorder statistics (26). This approach is insensitive to systematic errors in calibration, an issue that has recently been reviewed (27); therefore, AUC comparisons alone are not appropriate for the validation of risk calculators. ROC curve analysis has been criticized when it is used as the only tool to differentiate between 2 cohorts (28, 29), and results can be misinterpreted (18). Our results, however, confirm reviewed data (8, 22) that show a general advantage of %fPSAbased multivariate models for PCa detection.
Calibration differences in PCa probabilities are also important, as Fig. 2 demonstrates. Yet nomogram I (Fig. 2A), which does not account for %fPSA results, showed only marginal variation among the PSA assays for predicted PCa probabilities and observed PCa rates but had the lowest overall ICC value. In addition, nomogram I showed the largest difference between observed rates and predicted PCa probabilities, with an approximately 2fold higher observed PCa rate. This detection rate clearly improved in nomograms that included %fPSA in the calculations. On the other hand, the variance between the assays was much larger for nomograms that included %fPSA, a finding that is especially evident with nomogram IV. Thus, external validation of multivariate models requires a thorough assessment of the potential contributions of calibration analysis when attempting to estimate the concordance between predicted PCa probabilities and observed PCa rates.
Interestingly, nomograms IV and V, which were developed with data from the Beckman Coulter and Abbott PSA assays, had relatively low ICC values (Fig. 2, D and E) for these 2 assays. This result indicates that the effect of the assays on PCa prediction seems to be superimposed on other effects. These contradictory results are most likely caused by differences in the characteristics of the cohorts used to build the nomograms, compared with our cohort. In addition to the effect of the variation contributed by the different PSA assays, 39% of the patients in our cohort had suspicious DRE findings (Table 2). This rate is different from the rates for the cohorts upon which the nomograms were based (11, 16, 17). Although Kawakami et al. (17) and Chun et al. (16) found only 17.5% and 20.3% of all patients, respectively, with a suspicious DRE result, the 3 cohorts used to develop nomograms I and II had higher rates of suspicious DRE findings (31.1% overall) (11).
Additionally, the prevalence of PCa in our population (58.3%) was higher than the prevalences in the cohorts used to develop the 5 nomograms, which varied from 35.2% to 41.9% (11, 16, 17). These differences may also have an impact on nomogram performance in external validations.
The study is limited by both the inclusion of only 5 available nomogrambased models and the retrospective study design. Unfortunately, none of the other nomograms available for PCa risk prediction were suitable for our data (30,–,33). In some cases, nomograms were available, but they had been developed with small cohorts (34).
In summary, the present study has provided 2 main conclusions. First, our results demonstrate that nomogrambased PCa prediction is influenced by the type of PSA assay that is used. Second, AUC comparison alone is insufficient, and calibration analysis is recommended for validation of models. The dependence on PSA assay calls into question the general applicability of these models without considering the suitability of a specific PSA assay for a given model.
Footnotes
↵† These authors contributed equally to the study.

↵5 Nonstandard abbreviations:
 PCa,
 prostate cancer;
 PSA,
 prostatespecific antigen;
 tPSA,
 total PSA;
 fPSA,
 free PSA;
 %fPSA,
 percentage of free PSA;
 DRE,
 digital rectal examination;
 NEM,
 no evidence of malignancy;
 cPSA,
 complexed PSA;
 Siemensc,
 Siemens test using cPSA;
 Siemensf,
 Siemens test using fPSA;
 AUC,
 area under the ROC curve;
 ICC,
 intraclass correlation coefficient.
Author Contributions: All authors confirmed they have contributed to the intellectual content of this paper and have met the following 3 requirements: (a) significant contributions to the conception and design, acquisition of data, or analysis and interpretation of data; (b) drafting or revising the article for intellectual content; and (c) final approval of the published article.
Authors' Disclosures or Potential Conflicts of Interest: No authors declared any potential conflicts of interest.
Role of Sponsor: The funding organizations played no role in the design of study, choice of enrolled patients, review and interpretation of data, or preparation or approval of manuscript.
 Received for publication June 3, 2010.
 Accepted for publication March 21, 2011.
 © 2011 The American Association for Clinical Chemistry