## Abstract

Background: Prediction models combine patient characteristics and test results to predict the presence of a disease or the occurrence of an event in the future. In the event that test results (predictor) are unavailable, a strategy is needed to help users applying a prediction model to deal with such missing values. We evaluated 6 strategies to deal with missing values.

Methods: We developed and validated (in 1295 and 532 primary care patients, respectively) a prediction model to predict the risk of deep venous thrombosis. In an application set (259 patients), we mimicked 3 situations in which (1) an important predictor (D-dimer test), (2) a weaker predictor (difference in calf circumference), and (3) both predictors simultaneously were missing. The 6 strategies to deal with missing values were (1) ignoring the predictor, (2) overall mean imputation, (3) subgroup mean imputation, (4) multiple imputation, (5) applying a submodel including only the observed predictors as derived from the development set, or (6) the “one-step-sweep” method. We compared the model’s discriminative ability (expressed by the ROC area) with the true ROC area (no missing values) and the model’s estimated calibration slope and intercept with the ideal values of 1 and 0, respectively.

Results: Ignoring the predictor led to the worst and multiple imputation to the best discrimination. Multiple imputation led to calibration intercepts closest to the true value. The effect of the strategies on the slope differed between the 3 scenarios.

Conclusions: Multiple imputation is preferred if a predictor value is missing.

Clinical prediction models or risk scores are developed to estimate a patient’s risk of having (diagnosis) or developing (prognosis) a particular outcome. Well-known examples are the Apgar score (1) to estimate the prognosis of newborns and the Framingham risk score (2) to predict heart disease. Usually, 3 consecutive phases can be distinguished in clinical prediction research: derivation of the prediction model, validation of the model in new subjects (testing), and application in daily practice (3)(4)(5)(6).

Studies aimed at deriving or validating a prediction model commonly are negatively affected by missing values in one or more predictors. Often researchers conduct a so-called complete case analysis, neglecting the data of patients with missing values. Furthermore, predictors with (many) missing values are frequently excluded or replaced by a reference value. These approaches lead not only to loss of power (complete case analysis), but also to biased estimates of diagnostic or prognostic accuracy (7)(8)(9)(10)(11)(12)(13)(14). A more advanced method is multiple imputation (7)(8)(9)(10)(11)(12)(13)(14)(15)(16). This technique uses all observed patient information to multivariately impute the missing predictor values, which leads to more valid results (7)(8)(9)(10)(11)(12)(15)(16).

Physicians who apply prediction models to their patients may also face the problem of a missing predictor value. It is unclear how to deal with missing predictor values in individual patients. For example, a model to predict the presence of a bacterial infection in children with acute fever includes the predictor “duration of fever” (17); however, the parents may not remember the exact duration of the fever. Applying the prediction model without this predictor is not a sound solution, as the relative weights of the other predictors in the model become invalid. We compared 6 strategies to deal with missing values when applying a prediction model to individual patients. We used the empirical data of a prediction model aimed at predicting the presence of deep vein thrombosis (DVT).

## Materials and Methods

### clinical example

Timely diagnosis of DVT is important because patients with untreated DVT may develop pulmonary embolism, whereas unjustified therapy with anticoagulants poses a risk for major bleeding (18). Physicians have to decide which patients need to be referred for further workup and which can be safely kept under their surveillance without further workup. A diagnostic prediction model could aid physicians in this decision.

For this analysis, we used data from a large cohort of 2086 primary care patients suspected of DVT as described in previous studies (19)(20)(21)(22). Because prediction models are first developed from a so-called derivation data set, then tested in a (usually smaller) validation set, and finally applied in daily practice (3)(4)(5)(6), we have split our cohort into a derivation, a validation, and an application set. These 3 datasets have been described in previous studies (19)(20)(23). For the purpose of the current study, we completed the missing values in the data with regression imputation. As a result, there were no missing values in the data sets.

### derivation and validation of the prediction model

The derivation set consisted of 1295 patients included in the period between January 2001 and May 2003 (Table 1⇓ ). After information was obtained on patient history, physical examination, and D-dimer test, all patients were referred for ultrasonography as a reference standard to document the true presence or absence of DVT. The prediction model was developed with multivariable logistic regression. Model reduction (stepwise backward) was performed with a *P* value >0.157 according to Akaike Information Criterion (4)(24)(25), and the final model included 7 predictors:

where −14.84 is the so-called intercept and the other numbers the regression coefficients of each predictor. The risk of DVT in an individual patient (scale 0%–100%) can be calculated by

We validated this prediction model in the second part of our data, i.e., 532 equally selected and measured patients, included in the period between June 2003 and June 2005 (see also Supplemental Data Section 1, which accompanies the online version of this article at http://www.clinchem.org/content/vol55/issue5).

### application of the prediction model

The application set consisted of the last 259 consecutive patients (Table 1⇑ ). This application set did not contain any missing predictor values and served as the reference situation. We then mimicked 3 scenarios in which predictor values were missing. First, the D-dimer test (strongest predictor, see Table 3⇓ ) was missing for all patients. Second, the difference in calf circumference (weaker predictor) was missing for all patients. Third, both predictors were missing for all patients.

### strategies to deal with missing values

We compared 6 strategies (Table 2⇓ ) that can deal with missing predictor values when a prediction model is applied to individual patients. The first 4 strategies impute the missing value, in which case the original prediction model can be applied. The last 2 strategies use a modified prediction model (a submodel without the unobserved predictors). In these submodels, the intercept and regression coefficients of the remaining (observed) predictors are adjusted for the exclusion of the unobserved predictors. The submodels are derived either from the data of the derivation set or by a method called one-step-sweep.

#### 1. Imputation of the value zero.

The missing predictor value was imputed with the value zero. For example, in the first scenario, this means that the D-dimer test is neglected, whereas the intercept and regression coefficients of the remaining predictors in Formula 1 are used without adjustments.

#### 2. Overall mean imputation.

The missing predictor value was imputed with the mean value of the predictor, estimated from the derivation set. For example, if the D-dimer test was missing (first scenario), the mean log(D-dimer test) of the patients in the derivation set was imputed.

#### 3. Subgroup mean imputation.

The missing predictor value was imputed with a subgroup mean value, estimated from the derivation set. Subgroups were determined by sex and 5 age categories. For example, if the D-dimer test was missing for a male patient of 44 years old, the mean log(D-dimer test) of male patients between 40 and 50 years of age in the derivation set was imputed.

#### 4. Multiple imputation.

Multiple imputation (see also online Supplemental Data Section 2) is a more advanced method that uses regression models to estimate multiple values of the missing predictor, based on the observed predictors or characteristics of that patient (7)(8)(9)(10)(11)(12)(13)(14)(15)(16). Multiple imputation is straightforward and feasible when analyzing a whole dataset. To use this method when applying a prediction model to an individual patient, however, is less straightforward. One needs to have access to the data of the derivation set, for example via a website. Hence, the individual patient with a missing value is added to the derivation set, and the missing predictor value is (multiple) imputed. In this study, we imputed 10 values of the missing predictor for each patient. Then we calculated 10 linear predictors (Formula 1) for each patient, which we subsequently averaged to obtain the patient’s risk of DVT presence (Formula 2).

#### 5. Submodel derived from the derivation set.

The submodel, including only the observed predictors, was derived in the derivation set.

#### 6. Submodel derived by one-step-sweep.

The submodel, including only the observed predictors, was derived with a noniterative 1-step approximation called the one-step-sweep that has been proposed recently (26) (see also online Supplemental Data Section 3). This method can be applied without using the individual patient data of the derivation set. The regression coefficients of the submodel are based on the regression coefficients of the original model (Formula 1) and the covariance matrix obtained from the derivation set.

### predictive accuracy measures

We estimated the accuracy of the 6 strategies by quantifying the discrimination and calibration, and compared it with the reference situation (no missing values in the application set). Discrimination is the ability of the model to distinguish between patients with and without DVT, quantified with the area under the ROC curve (27). An ROC area can range from 0.5 (no discrimination) to 1.0 (perfect discrimination) (28). Calibration refers to the agreement between the predicted probabilities and observed frequencies of DVT. It can be graphically assessed with a calibration plot with the predicted probabilities on the *x* axis and the observed frequencies on the *y* axis. The plot shows a line that can be described with a so-called calibration slope and calibration intercept (estimated by fitting the linear predictor of the model as the only covariate in a logistic model with DVT as the outcome) (29)(30). The calibration slope and calibration intercept are ideally 1 and 0, respectively. A slope <1 indicates too-optimistic predictions (low predicted probabilities are too low and high predicted probabilities are too high); a slope >1 indicates that predictions are not extreme enough (low predicted probabilities not low enough and high predicted probabilities not high enough). A calibration intercept close to 0 indicates good calibration in the large and means that the mean predicted DVT probability equals the mean observed DVT frequency. A positive calibration intercept indicates (on average) underestimated risks, whereas a negative value indicates overestimated risks. Because the interpretation of this calibration intercept is difficult if the calibration slope is unequal to 1, the calibration intercept is estimated with the slope fixed at 1, implying that the calibration intercept equals the difference between the observed DVT prevalence and the mean predicted risk (29)(30).

## Results

### strategy 1: zero imputation

This is the only strategy we could apply without additional estimations.

### strategy 2: overall mean imputation

The overall mean log(D-dimer test) and log(difference of calf circumference) in the derivation set were 6.83 and 1.14, respectively.

### strategy 3: subgroup means

The subgroup means for log(D-dimer test) and log(difference of calf circumference) in the derivation set are presented in online Supplemental Table 1. For both predictors, the subgroup means differed from the overall means (strategy 2) and were higher for men and older patients.

### strategy 4: multiple imputation

We used the derivation data set and a multiple imputation script (available on request).

### strategy 5: submodels estimated in the derivation set

Three submodels were derived (Table 3⇓ ).

### strategy 6: submodels estimated by one-step-sweep

Three submodels were derived (Table 3⇑ ). Online Supplemental Table 2 shows the covariance matrix (needed for this strategy) of the regression coefficients of the original model.

### accuracy of the 6 strategies

#### Discrimination.

In the reference situation (no missing predictor values in the application set), the ROC area of the original prediction model was 0.90 (95% CI 0.84–0.96). With only the D-dimer test missing (scenario 1), the ROC area decreased to approximately 0.70 in all strategies (Table 4⇓ ), except with multiple imputation (ROC area 0.77). If the difference in calf circumference was missing (scenario 2), the ROC did not decrease in any of the strategies (ROC area 0.89 or 0.90). If both predictors were missing (scenario 3), the ROC area decreased to 0.66 or lower for all strategies, except with multiple imputation (ROC area 0.78). Zero imputation and mean imputation resulted in the largest decrease (ROC area 0.62).

#### Calibration.

In the reference situation, the calibration slope was 1.06. If the D-dimer test was missing (scenario 1), the subgroup mean imputation resulted in a calibration slope (1.02) closest to the reference slope (Table 5⇓ ). Multiple imputation resulted in a calibration slope <1, indicating too-extreme predictions. The other 4 strategies led to calibration slopes >1, where strategy 5 and 6 (submodels without the predictor with missing values) resulted in the largest deviation from the reference slope. If the difference in calf circumference was missing (scenario 2), all slopes were similar to the reference slope. If the 2 predictors were missing simultaneously (scenario 3), none of the strategies led to calibration slopes close to the reference slope, though subgroup imputation resulted in the smallest deviation (slope 0.94) from the reference situation. All imputation methods (strategies 1–4) resulted in calibration slopes <1.

The intercept of the calibration line in the reference situation was −0.10. In scenario 1, all strategies led generally to insufficient calibration in the large (intercept not equal to 0), apart from multiple imputation (intercept −0.06) (Table 5⇑ ). Strategy 1, simply neglecting the predictor, led to the worst calibration (intercept 12.48). In scenario 2, calibration was most similar to the reference situation for multiple imputation (intercept −0.04) and for the submodel estimated in the derivation set (intercept −0.03). Also in this case, neglecting the predictor with missing values led to the largest deviation (intercept 0.97). For scenario 3, all strategies generally resulted in insufficient calibration, apart from multiple imputation (intercept 0.01).

## Discussion

When applying a prediction model to individual patients, often a particular predictor may not be measured. The question arises how to use the prediction model in such situations. We compared 6 strategies, of which multiple imputation of the missing values led to the most accurate model predictions.

### discrimination of the model

If the strong predictor D-dimer was missing, multiple imputation resulted in a ROC area closest to the reference value, whereas all other methods led to highly underestimated ROC areas. We expected that this would occur if the predictor with missing values was ignored without adjusting the regression coefficients of the remaining predictors. For imputation of the overall mean, this was also expected, because it does not change the rank order of patients (since every patient receives the same imputed value). Yet this solution is frequently used in medical research. Imputation of subgroup mean can hypothetically improve the model’s discrimination, although this was not found in our results. Apparently, the variability in imputed subgroup means (compared to the overall mean) was not large enough. Furthermore, using submodels (derived either from the derivation set or with the one-step-sweep method) that contain only the predictors with observed values can hypothetically better discriminate than strategies that impute the same value for all patients. However, this is less likely if a strong predictor is missing, as for example the D-dimer test in our study. Any submodel without this predictor substantially loses discriminative ability. Indeed, if the value of a relatively weak predictor was missing (scenario 2), all strategies led to ROC areas similar to the reference situation. If both predictors were missing (scenario 3), we found similar or even worse results compared to scenario 1. Apparently, the discriminative ability of our prediction model was largely based on the strong predictor, the D-dimer test.

### calibration of the model

In the case of a missing D-dimer test (scenario 1), ignoring the predictive effect of this strong predictor (i.e., imputing the value zero) led to the worst calibration in the large. If this risk-increasing predictor was ignored, all predicted risks were too low. As expected, multiple imputation led to a calibration intercept closest to the reference situation, as this strategy best approached the missing predictor values. Application of the submodels, derived either from the derivation set or by the one-step-sweep method, showed too-high predicted probabilities (negative intercept), although closer to the reference value than the (subgroup) mean imputation. The effect of these strategies probably depended on the data at hand and may be different in other situations. As expected, imputation of overall mean and subgroup mean improved this calibration intercept, as the missing predictor value is to some extent incorporated in the risk estimation. If the difference in calf circumference was missing (scenario 2), we found better results for all methods, though ignoring the predictor again led to the worst results. If both predictors were missing (scenario 3), we found results similar to those in scenario 1.

The same inferences can be drawn for the calibration slope (Table 5⇑ ). If the D-dimer test was missing (scenario 1), ignoring the predictor and overall mean imputation resulted in a slope >1, indicating that the predicted probabilities were not extreme enough. This is expected, as the predicted probabilities become more alike (less extreme). Indeed, subgroup mean imputation improved the slope to a value close to 1, as it allows for more variation between patients. Application of the submodels resulted in calibration slopes >1 and with the largest deviation from the reference situation. Again, this probably depended on the data at hand and may be different in other situations. If difference in calf circumference was missing (scenario 2), all slopes were (nearly) equal to the reference slope. If both predictors were missing (scenario 3), largely the same inferences can be drawn as for scenario 1.

### methodological considerations

First, our results are based on one empirical example. Other datasets with other prediction models predicting other outcomes may show different results. For example, applying the submodels (without the predictor with missing values) may lead to better results if the remaining predictors of the model have a predictive strength similar to the one that is missing. In our study, the D-dimer test was such a strong predictor that estimating a submodel without this predictor inevitably led to less accurate predictions.

Second, to our knowledge, this is the first time that multiple imputation has been studied when a prediction model is applied to individual patients. We calculated 10 linear predictors (Formula 1) for each patient, averaged these, and transformed this average to the probability of presence of DVT. Another option would be to first transform the 10 linear predictors to 10 probabilities, and average these to a probability. Because risks are not normally distributed, we chose the first strategy. Yet elaborate simulation studies in which all potential scenarios can be mimicked may be necessary to choose the ultimate strategy.

Third, multiple imputation resulted in a calibration slope smaller than 1, indicating predicted probabilities that were too extreme, which are often caused by overfitted models. This suggests that the imputation model may have been overfitted. Shrinkage of the imputation model (i.e., adjusting the model for overfitting) may be a possible solution. More research should be conducted on these methodological issues.

Fourth, the 6 strategies vary in applicability. Mean imputation and subgroup mean imputation are easily applicable in daily clinical practice, as these values can easily be added to the appendix of the manuscript presenting the prediction model. Additionally, the submodels derived from the derivation set and the covariance matrix necessary for the one-step-sweep can be presented in a manuscript. However, this can become quite complex if many predictors have missing values. Hypothetically, all 7 predictors of our prediction model can be missing in practice. Accordingly, 2^{7} = 128 submodels would have to be developed. The one-step-sweep can more easily estimate these submodels without the need to develop all the submodels (26). Multiple imputation is the most complex strategy to apply, as the original derivation set and the multiple imputation models need to be stored in such a way that they are publicly available. Storing the data at Internet sites is a good option. Owing to the increasing introduction of electronic patient records in primary and secondary care, with its potential for built-in algorithms, these strategies may be more easily implemented and applied.

Fifth, there may be more strategies to deal with missing values. For example, we could have imputed the missing values with regression models, in which the predictor with missing values is the dependent variable and the other predictors the independent variables. We did not apply this single regression imputation approach, as it is less feasible in practice. For a prediction model with 7 predictors like ours, one would need to develop and store the 6 * 2^{7} = 768 potential regression models.

Sixth, the gain of multiple imputation over single regression imputation is in the correct estimation of the standard errors of the predicted probabilities. We did not take full profit of this advantage, as in our study the interest was not in the confidence intervals of the predicted probabilities but in the predictive accuracy of the model. However, in situations where the confidence intervals of predicted probabilities are of interest, this will be an extra advantage of multiple imputation.

Finally, we could have split our cohort into a derivation set and an application set (ignoring the validation phase), which would have resulted in a larger derivation set. In our study, however, we explicitly wanted to use an in-between validation set to test the accuracy of the newly developed model. Although model validation is always highly important, it is still rarely applied. We would like to stress that before any clinical prediction model is applied in practice, it needs to be tested in new patients (5)(6).

In conclusion, if a prediction model is applied in individual patients and a predictor is missing, ignoring that predictor is the worst strategy, as the weights of the remaining predictors become incorrect. Imputation of the overall mean does not improve the discrimination, and the estimated risks may be incorrect. Imputation of a subgroup mean may improve the discrimination, although the predicted risks are not necessarily correct. Using a submodel without the predictor can result in a poor discrimination if the predictor with missing values was a strong predictor. We found that multiple imputation resulted in the best discrimination, and, the predicted risks were on average correct. The question of why the models derived by multiple imputation seemed overfitted needs to be addressed in future research.

## Acknowledgments

**Author Contributions:** *All authors confirmed they have contributed to the intellectual content of this paper and have met the following 3 requirements: (a) significant contributions to the conception and design, acquisition of data, or analysis and interpretation of data; (b) drafting or revising the article for intellectual content; and (c) final approval of the published article.*

**Authors’ Disclosures of Potential Conflicts of Interest:** *Upon manuscript submission, all authors completed the Disclosures of Potential Conflict of Interest form. Potential conflicts of interest:*

**Employment or Leadership:** F.E. Harrell, Jr., Department of Biostatistics, Vanderbilt University.

**Consultant or Advisory Role:** F.E. Harrell, Jr., Pfizer, Amgen, Becker Consulting, GlaxoSmithKline, Novartis, and Merck.

**Stock Ownership:** None declared.

**Honoraria:** F.E. Harrell, Jr., Johnson & Johnson, Statistics Society of Canada, and American Statistical Association.

**Research Funding:** F.E. Harrell, Jr., NIH. We gratefully acknowledge the support by the Netherlands Organization for Scientific Research (ZonMw 016.046.360).

**Expert Testimony:** None declared.

**Role of Sponsor:** The funding organizations played no role in the design of study, choice of enrolled patients, review and interpretation of data, or preparation or approval of manuscript.

**Acknowledgments:** We gratefully acknowledge that part of this work has been conducted in the Department of Statistics, Harvard University (Prof. D.B. Rubin), and in the Department of Biostatistics, Vanderbilt University Medical School (Prof. F.E. Harrell, Jr.).

- © 2009 The American Association for Clinical Chemistry