## Abstract

Background: Optimal cutoff values for tests results involving continuous variables are often derived in a data-driven way. This approach, however, may lead to overly optimistic measures of diagnostic accuracy. We evaluated the magnitude of the bias in sensitivity and specificity associated with data-driven selection of cutoff values and examined potential solutions to reduce this bias.

Methods: Different sample sizes, distributions, and prevalences were used in a simulation study. We compared data-driven estimates of accuracy based on the Youden index with the true values and calculated the median bias. Three alternative approaches (assuming a specific distribution, leave-one-out, smoothed ROC curve) were examined for their ability to reduce this bias.

Results: The magnitude of bias caused by data-driven optimization of cutoff values was inversely related to sample size. If the true values for sensitivity and specificity are both 84%, the estimates in studies with a sample size of 40 will be approximately 90%. If the sample size increases to 200, the estimates will be 86%. The distribution of the test results had little impact on the amount of bias when sample size was held constant. More robust methods of optimizing cutoff values were less prone to bias, but the performance deteriorated if the underlying assumptions were not met.

Conclusions: Data-driven selection of the optimal cutoff value can lead to overly optimistic estimates of sensitivity and specificity, especially in small studies. Alternative methods can reduce this bias, but finding robust estimates for cutoff values and accuracy requires considerable sample sizes.

Diagnostic accuracy denotes the amount of agreement between the results of an index test (the test under evaluation) and the reference standard (the best method available for determining the presence or absence of the disease of interest). Commonly used accuracy measures are sensitivity (the proportion of individuals with the target condition who have a positive result in the index test) and specificity (the proportion of those without the target condition who have a negative result in the index test). In the case of a test of a continuous or ordinal variable, the ROC curve is an informative way to present sensitivity vs 1 − specificity for each possible cutoff value of the index test(1)(2). In situations in which higher values in a test are more indicative of the presence of disease, lowering the cutoff value will increase the sensitivity and decrease the specificity. For clinical purposes of linking actions to test results, one threshold or cutoff value is used. The optimal choice for this cutoff value is ultimately determined by the consequences associated with false-positive and false-negative test results(3).

In the early phases of test development, when the exact role of the index test has not been fully defined and thus the consequences of incorrect test results have not yet been determined, the preference for choosing the optimal cutoff value is often a criterion that weighs both sensitivity and specificity equally. Such a criterion is the Youden index, which is defined as: sensitivity + specificity − 1(4)(5). The optimal cutoff value that maximizes the Youden index is often determined in a “data-driven” way. This means that the sensitivities and specificities are calculated from the data at hand across all possible cutoff values within the range of the test results, with the cutoff value leading to the highest Youden index then being selected.

This data-driven selection of optimal cutoff values is prone to bias, meaning that it can systematically lead to overestimation of the sensitivity and specificity of the test under study. Because chance variation plays a larger role in smaller studies, the observed ROC curve obtained from a single small study for such a procedure will deviate more from the true underlying ROC curve than the observed ROC curve obtained from a large study (see Fig. 1⇓ ). These fluctuations can occur in both directions, leading to both underestimation and overestimation with respect to the true sensitivity and specificity. In small studies, an increase in sensitivity by taking a lower threshold value will not lead directly to a decrease in specificity. Because the data-driven approach specifically selects the cutoff value with the highest sum of sensitivity and specificity (i.e., closest to the top left corner of the ROC plot), this value is generally a point above the true underlying ROC curve. Data-driven selection of cutoff values in test results for a continuous variable may therefore lead to overly optimistic estimates of sensitivity and specificity in studies with a small sample size. Because small sample sizes (<200) are common in diagnostic studies(6), overestimation of diagnostic accuracy by data-driven selection of cutoff values can be a serious and prevalent problem.

This potential for bias associated with data-driven selection of the optimal Youden index has been recognized before, both in diagnostic and prognostic studies(7)(8)(9)(10)(11)(12)(13). These publications have been rather technical, without offering clear guidance or solutions for practice. We therefore performed a series of simulations to document the magnitude of overestimation of sensitivity and specificity under a range of conditions and examined the possible role of alternative ways of estimating sensitivity and specificity with the Youden index. The results of these simulations enable us to inform readers when to be aware of this bias and to advise researchers how to reduce the potential for this bias in future studies.

## Materials and Methods

### simulation of datasets

Results for an index test of a continuous variable for individuals with and without the disease were simulated on the basis of a specific distribution, sample size, and disease prevalence. The true values for the optimal Youden index and cutoff and the corresponding true maximum sensitivities and specificities were calculated from the true underlying distribution of test results among individuals with the disease and those without the disease.

To examine the impact of sample size, disease prevalence, amount of spread in index test results, and their underlying distribution on the amount of bias, we varied these variables across scenarios. Sample sizes varied from 20–1000 patients, prevalence varied from 5%–95%, SDs varied from 5–20, and test results were generated from an underlying normal distribution and 2 types of nonsymmetric distributions, the lognormal and gamma distributions (see Fig. 2⇓ ).

All analyses were carried out with SAS for Windows, version 9.1.3 (SAS Institute).

### data-driven estimation of sensitivity and specificity

Data-driven estimates of diagnostic accuracy associated with the optimal cutoff value were evaluated in each simulated dataset and compared with their true values. Each simulation scenario was replicated 2 000 times to determine the median magnitude of the bias (the difference between each data-driven estimate of sensitivity and specificity and their true values) and the number of times that sensitivity and specificity were overestimated (expressed as a percentage).

### potential solutions to reduce overestimation

We evaluated 3 alternative methods as potential solutions for reducing the magnitude of bias: *(a)* using sample characteristics and assuming a specific underlying distribution, *(b)* leave-one-out cross-validation, and *(c)* robust fitting of ROC curves. We applied these methods to 2 scenarios in which the true underlying distribution of the index test results was gaussian, 2 scenarios with a true underlying lognormal distribution, and 2 scenarios with a true underlying gamma distribution (see Fig. 2⇑ ). Within each simulated dataset, we compared the data-driven estimate with the estimates of the potential solutions to examine the effectiveness of the solutions in reducing the bias.

### deriving the optimal cutoff point from assumed distributions

Sample characteristics describing the central tendency and the shape of the distribution of test results can be used to estimate the optimal cutoff value. By assuming a specific underlying distribution (e.g., gaussian) for the test results in the patients with the disease, we can use sample characteristics (descriptors such as mean and SD) to calculate the cumulative proportion of diseased patients who will have an index test result equal to or greater than the cutoff value, i.e., an estimate of the true sensitivity. Similarly, we can use the observed mean and SD of the nondiseased patients to calculate the proportion of patients without the disease and with an index test result below each possible cutoff value. This proportion equals the specificity of that test. The gamma distribution is characterized by a shape parameter and a scale parameter, that describe, just like the mean and SD for a normal distribution, the variation in test results for individuals with and without the disease within a sample. The lower the height of the shape parameter, the more skewed the distribution is. The lower the value of the scale parameter, the less spread out the results are (analogous to a smaller SD in normal distributions). We used the univariate procedure to estimate the shape and scale parameters of a gamma distribution from the sampled data. We then used the cumulative gamma distribution to calculate sensitivity and specificity.

### leave-one-out cross-validation

In the leave-one-out cross-validation approach, a single individual is removed from the study population and used in the validation process. The cutoff value is evaluated in a data-driven way for the remaining (n − 1) individuals, as described above. Thereafter, the resulting cutoff value is applied to the single individual who did not take part in this process. This individual is then classified as a true positive, a false positive, a false negative, or a true negative, depending on whether this individual is classified as having or not having the disease and whether this individual’s test result is below or above the cutoff value. This process is repeated for all of the patients in the dataset, and the resulting 2-by-2 table based on all of the individuals is used to evaluate the sensitivity and specificity corresponding to the cutoff value that was derived in the n − 1 patients.

### robust roc curve fitting

In the robust ROC curve fitting approach, we used a smoothing procedure included in SAS software (Loess procedure) to fit a smooth nonparametric curve through the observed data points plotted in the ROC graph. The point on the fitted curve with the highest Youden index was used to obtain estimates of sensitivity and specificity.

### empirical evidence from published diagnostic reviews

From a set of 28 published systematic reviews used in a previously published metaepidemiologic project, we selected the reviews that reported test results for continuous variables and included studies both with and without a prespecified cutoff value. We then compared the summary diagnostic odds ratios for the 2 groups to examine whether the diagnostic accuracy was higher (overestimated) in studies with data-driven selection of cutoff values than in studies that used prespecified cutoff values. The diagnostic odds ratio is an overall measure of accuracy that combines both sensitivity and specificity: [sensitivity/(1 − sensitivity)]/[(1 − specificity)/specificity]. Further details about this set of systematic reviews and the applied statistical methods can be found in an earlier publication(14).

## Results

### simulation of datasets

In the basic scenario, index test results were generated from a normal distribution with a mean (SD) of 100 (10) for persons without the disease and a mean (SD) of 120 (10) for persons with the disease, leading to a true maximum Youden index of 0.68, a true optimal cutoff value of 110, and true values of 84% for both sensitivity and specificity. These true values will change only if the underlying distribution changes (such as a difference in means between diseased and nondiseased individuals or a difference in the spread of the test results) but are not affected by changes in sample size or disease prevalence.

### data-driven overestimation of sensitivity and specificity

#### effect of sample size

The amount of bias in the data-driven estimates was inversely related to sample size (Fig. 3⇓ ). At a total sample size of 40, the median sensitivity and specificity were both 90% (interquartile range, 80%–95%), and their true values were both 84%. Both measures were overestimated in 74% of the simulations. In 60% of the simulations, estimates of sensitivity and specificity exceeded 89%, whereas their true values were both 84%. When the sample size was 200, sensitivity was overestimated in 62% of all simulations, and specificity was overestimated in 60%, whereas their median values (86%; interquartile range, 82%–89%) were closer to the true values of 84%.

#### effect of disease prevalence

A prevalence of 50% is the most efficient prevalence for ensuring that the combined uncertainty in both sensitivity and specificity is the smallest, as was seen in our results. Lowering the prevalence (conditional on the same sample size) leads to fewer individuals with the disease, a larger fluctuation in the sensitivity by chance and therefore more room for overestimation of sensitivity. The opposite occurs for specificity. The median absolute bias at a prevalence of 10% was 5.9% for sensitivity and 3.6% for specificity. At a prevalence of 90%, the median absolute bias was 2.2% for sensitivity and 6.7% for specificity (results not shown).

#### overlap in test results between populations with and without the disease

The spread and overlap in test results between populations with and without the disease determine the absolute size of sensitivity and specificity. A smaller SD (less spread) while the difference in the mean values between the populations remains the same will lead to less overlap in test results between diseased and nondiseased individuals. Thus, sensitivity and specificity will increase, leaving less room for overestimation (the ceiling effect: sensitivity cannot exceed 100%). On the other hand, when we allowed the SDs to change without changing sensitivity and specificity, the amount of observed bias in sensitivity and sensitivity did not vary (results not shown).

#### effect of underlying distributions

The underlying distribution of the results of simulated tests had little impact on the average amount of bias in comparisons of scenarios based on normal, lognormal, and gamma distributions (see Figs. 4⇓ and 5⇓ ); however, the amount of bias could vary substantially within a specific distribution, depending on the values of the parameters for that distribution. For example, one of the lognormal distributions produced results in which 60% of the simulations overestimated sensitivity by >5%, whereas the other lognormal distribution produced a sensitivity overestimate of this magnitude in 35% of the simulations.

### potential solutions for reducing bias

#### deriving the optimal cutoff point from assumed distributions

The assumption of a normal distribution and the use of the estimated mean and SD from a dataset to calculate the true optimal cutoff value decreased the amount of bias when the underlying distribution was in fact gaussian. In one of the scenarios with index test results conforming to a true underlying normal distribution, the median sensitivity and specificity obtained following this strategy were both estimated to be 85%, whereas their true values were 84%, a difference of only 1% (see Fig. 6⇓ ). When the underlying distribution was a gamma or lognormal one, the same procedure led to systematically underestimating sensitivity and overestimating specificity, a result that was sometimes worse than that obtained with the uncorrected, data-driven results. In these situations, the median estimated sensitivity was 2%–13% lower than the true sensitivity (see Fig. 6⇓ ). The difference between the median estimated specificity and the underlying true specificity was 7% or 8%.

The gamma distribution was more flexible in approximating various distributions and led to less bias in all scenarios except the data-driven method. The median estimated sensitivity varied from 2% below to 3% above the true sensitivity. The median estimated specificity varied from 1% to 4% above the true specificity.

Because we sometimes observed that results for sensitivity and specificity varied in the opposite direction (overestimation in one parameter and underestimation in the other), we summed the absolute values of the bias in sensitivity and specificity. When we assumed underlying normal distributions, the total absolute value of the bias was 59% (the sum of the absolute bias for all sensitivities in all 5 studied scenarios was 28%; the summed bias of all specificities in all 5 scenarios was 31%). When we assumed underlying gamma distributions, the total absolute value of the bias was 20% (the sum of the absolute bias was 12% for sensitivity and 8% for specificity).

#### leave-one-out cross-validation

The leave-one-out cross-validation approach produced less bias in sensitivity, the estimated values being 2% lower to 4% higher than their true values. Specificity was marginally underestimated (1%–5% lower than their true values; see Fig. 6⇑ ). The sum of the absolute values of the bias in sensitivity was 9%, and sum of the bias in specificity was 20% (total bias, 29%).

#### robust roc curve fitting

Robust fitting of ROC curves also produced less bias in both sensitivity and specificity. The difference between true and estimated sensitivities ranged from −2% to 3%, and the difference between true and estimated specificities ranged from −4% to 3% (see Fig. 6⇑ ). The sum of the absolute values of the bias in sensitivity was 9%, and the sum of the bias in specificity was 14% (total bias, 23%).

#### empirical evidence from published diagnostic reviews

Seven reviews have systematically evaluated a test that produced results for a continuous variable. Five of these reviews included both studies with a prespecified cutoff value and studies with a data-driven cutoff value. The mean diagnostic odds ratio was 1.71 (95% confidence interval, 1.04–2.82; *P* = 0.03) times higher in studies with a data-driven cutoff value than in studies with a prespecified cutoff value. If we translate this result to sensitivity and specificity and use a prespecified cutoff value to estimate both sensitivity and specificity at 84% (a diagnostic odds ratio of 28), a study using data-driven selection would obtain estimates of sensitivity and specificity of 87.4%, corresponding to a diagnostic odds ratio of 48 (i.e., 28 × 1.71).

## Discussion

Our simulation study has shown that data-driven selection of optimal cutoff values for a test of a continuous variable by means of the Youden index led to overestimates of sensitivity and specificity. The amount of bias in sensitivity and specificity predominantly depended on the sample size. A typical value for the absolute amount of bias in studies with a sample size of 40 was 5% in both sensitivity and specificity.

The bias decreased with increasing sample size. Overestimation by >5% was present in 27% of the simulations when the sample size was 200, compared with 60% of the studies with a sample size of 40. The underlying distributions had little or no effect on the amount of bias. This result can be explained by the nonparametric approach for the data-driven selection of the optimal cutoff value. The absolute magnitude of the true sensitivity and specificity did have an effect: the nearer the true values were to 100%, the less room there was for overestimation.

In this study, we have reported only the effect of optimizing cutoff values on sensitivity and specificity, although we also have examined the effects on likelihood ratios and diagnostic odds ratios (results not reported). That these effects were in line with the results for sensitivity and specificity is not surprising because they are direct functions of sensitivity and specificity. This potential for bias identified through our simulation studies was confirmed with the empirical data, because the diagnostic odds ratios in studies with data-driven cutoff values were significantly higher than in studies with prespecified values.

We applied 3 alternative and more robust methods for determining the sensitivity and specificity associated with the optimal cutoff value to examine whether these methods were less prone to bias. In general, these methods produced lower estimates of sensitivities and specificities, sometimes even producing estimates that were too conservative (see Fig. 6⇑ ). As expected, the performance of the method that assumed the underlying distribution to be normal deteriorated considerably if this assumption was not met. Because it is difficult to examine whether the assumption of an underlying normal distribution is reasonable for a small sample, we do not recommend this method in general. Assuming a gamma distribution is a more flexible approach, because it can mimic various shapes of distributions; therefore, this method performed consistently well across our simulations. The approach of smooth ROC curve fitting can be viewed as a distribution-free method; that is, it would perform consistently, irrespective of the true underlying distribution. The leave-one-out approach is a traditional way of cross-validating results in regression analyses to reduce the impact of overfitting. In our situation, the leave-one-out approach did produce lower estimates than the data-driven method; however, sometimes the estimates obtained with the leave-one-out approach became too conservative, especially for specificity. We do not have an explanation for this finding. Bootstrapping, which is based on the same principle of cross-validation, would have been a slightly different approach. Therefore, we would expect results with this method to be similar to those obtained with the leave-one-out approach.

Another approach that will reduce the problem of overestimation is to use a prespecified cutoff value; however, there may be little information available in the early phases of test evaluation regarding the likely value of the optimal cutoff value. Other more complex solutions that use the actual data of a study have been described for generating less biased results. These approaches involve the reporting of a confidence interval around the “true” cutoff value and a Bayesian method to smooth the steps in an ROC curve. Details can be found in Fluss et al.(5) and Gail and Green(15).

Readers of diagnostic studies should be aware of the potential for bias when optimal cutoff values are derived in a data-driven way, especially if the sample size is small. Although what constitutes a small study is rather arbitrary and depends on the amount of bias that is deemed acceptable, our results show that there is a 27% probability that sensitivity and specificity will be overestimated >5% in a study with a sample size of 200. As a rule of thumb, we suggest that a diagnostic study have at least 100 individuals without the disease and 100 individuals with the disease to reliably estimate a cutoff value from the data. Unfortunately, most diagnostic studies will not have these numbers(6). An additional problem both clinicians and laboratory professionals may encounter is that not only will the amount of bias increase as sample sizes decrease, but the confidence intervals around the estimates of the optimal cutoff value and both sensitivity and specificity will also increase. Even if more robust methods are used to reduce bias, uncertainty about the true optimal cutoff value and its corresponding diagnostic accuracy will remain.

In conclusion, researchers and readers of diagnostic studies should be aware that measures of diagnostic accuracy are likely to be overly optimistic when the results in a small study have been generated by a data-driven approach. Several methods can be used to reduce the amount of this bias, but it is important to stress that finding robust estimates of cutoff values and their associated measures of accuracy require studies with a considerable sample size. In smaller studies, researchers may present a scatter graph showing the distribution of all test results for the nondiseased and diseased individuals. In addition, investigators can draw the empirical ROC curve and a robust (smoothed) ROC curve, but we recommend that they refrain from selecting the most outlying point closest to the top left corner (the maximum Youden) as the optimal cutoff value.

## Acknowledgments

**Grant/Funding Support:** None declared.

**Financial Disclosures**: None declared.

- © 2008 The American Association for Clinical Chemistry