Background: We evaluated whether articles on molecular diagnostic tests interpret appropriately the clinical applicability of their results.
Methods: We selected original-research articles published in 2006 that addressed the diagnostic value of a molecular test. We defined overinterpretation of clinical applicability by means of prespecified rules that evaluated study design, conclusions regarding applicability, presence of statements suggesting the need for further clinical evaluation of the test, and diagnostic accuracy. Two reviewers independently evaluated the articles; consensus was reached after discussion and arbitration by a third reviewer.
Results: Of 108 articles included in the study, 82 (76%) used a design that used healthy controls or alternative-diagnosis controls, only 15 (11%) addressed a clinically relevant population similar to that in which the test might be applied in practice, 104 articles (96%) made definitely favorable or promising statements regarding clinical applicability, and 61 (56%) of the articles apparently overinterpreted the clinical applicability of their findings. Articles published in journals with higher impact factors were more likely to overinterpret their results than those with lower impact factors (adjusted odds ratio, 1.71 per impact factor quartile; 95% CI, 1.09–2.69; P = 0.020). Overinterpretation was more common when authors were based in laboratories than in clinical settings (adjusted odds ratio, 18.7; 95% CI, 1.41–249; P = 0.036).
Conclusions: Although expectations are high for new diagnostic tests based on molecular techniques, the majority of published research has involved preclinical phases of research. Overinterpretation of the clinical applicability of findings for new molecular diagnostic tests is common.
With the remarkable advances in genomic and proteomic technologies, a large number of studies on new molecular diagnostic tests are being published. Expectations are high for the development of noninvasive molecular diagnostic tests, yet analysis and interpretation of the data have presented unique challenges(1). Few of the many proposed tests have been introduced into clinical practice with clearly documented benefits(2)(3)(4). Today, more than ever, intense promotion of molecular-diagnostic techniques strengthens the need to ensure that the provision of diagnostic tests in clinical settings is evidence-based; however, offering guidance for the introduction of a new diagnostic test into clinical practice remains a challenge(5). Besides the increased sensitivity to issues of reporting(6) and quality assessment(7), several authors(8)(9)(10) have proposed a formal structure to guide the process of diagnostic-test development.
In the path toward a successful clinical application, a diagnostic test should be evaluated in distinct populations that are similar to those in which the test is intended for eventual use (in clinical practice or in public health). Although preliminary studies may evaluate the ability of the test to distinguish between known disease cases and control individuals who are either healthy or have a specific, different diagnosis, excellent results in the preliminary, preclinical phases do not prove clinical utility. Application of a test in the real world usually involves a different spectrum of disease than preliminary studies, because real-life diagnostic investigations tend to address primarily patients suspected of the target condition and not patients with severe clear-cut disease or obviously healthy people. Moreover, other, competing diagnoses are prevalent in real life, whereas most healthy control- or alternative diagnosis-control studies typically exclude patients with diagnoses that compete in the differential diagnosis. Analytical issues (e.g., reproducibility)(11)(12) and potential biases(13) may also complicate the transition from discovery to clinical translation(1). Although these conceptual and methodologic requirements have long been established, it is unknown whether the new generations of studies on molecular diagnostic tests recognize and integrate the extra requirements for clinical translation or, by contrast, whether they tend to overinterpret or exaggerate preliminary results as providing conclusive evidence for clinical applicability.
Our aim was to analyze a large sample of recent articles on molecular-diagnostic tests to determine whether the authors’ assessment of the clinical applicability of their results was coherent with their study design and findings or whether they overinterpreted the clinical significance of the available information.
Materials and Methods
data sources and searching
We identified diagnostic-accuracy studies on molecular research through a computerized search of MEDLINE that used the medical subject headings (MeSH): “Diagnosis” and “Genomics” or “Microarray analysis”; “Molecular diagnostic techniques” (MeSH) and “Sensitivity and Specificity” (MeSH); “diagnos*” and “genomics” or “proteomics”; and finally, “molecular” or “genetic” and “diagnostic test.” The searches were carried out on May 11, 2007. The full search strategy is documented in Fig. 1 in the Data Supplement that accompanies the online version of this article at http://www.clinchem.org/content/vol55/issue4.
We selected original research articles that used human participants in studies in which the main objective was to address the diagnostic value of a given test whose methodology was based on molecular techniques. The term “molecular techniques” included technologies that provide a comprehensive analysis of cellular-specific constituents, such as RNA, DNA, proteins, and intermediary metabolites, as well as techniques such as in situ hybridization of chromosomes for cytogenetic analysis, identification of pathogenic organisms via analysis of species-specific DNA sequences, and detection of mutations with the PCR. To maintain a focus on recent research, we limited our sample to articles published in 2006.
A single investigator screened the titles and abstracts according to specific criteria. Reviews, editorials, letters, and case reports were excluded. We also excluded preevaluation studies that focused on the analytical aspects of a diagnostic test (technical aspects on how a method is applied or how measurements are made) and studies that aimed to monitor disease prognosis or treatment effects.
To assess the reliability of the selection process, 2 investigators independently assessed a random sample of 200 abstracts; they agreed with the initial reviewer 94% and 83% of the time.
data extraction and definitions
Two investigators independently extracted data from each article. The data extractors assigned each study to one of 3 following study designs according to previous definitions(14): (a) healthy-control or alternative diagnosis-control study; (b) consecutive series or series of clinically relevant patients in which the spectrum of patients/samples reflects, as closely as possible, populations in which the test may be used in practice; and (c) studies that could not be assigned with confidence to either of the 2 other groups. Table 1⇓ details the operational definitions for each type of design. Furthermore, all statements in the articles referring to clinical applicability and potential need for further clinical evaluation were recorded, as follows:
Statements regarding clinical applicability of the test. Statements on clinical applicability were graded as definitely favorable, as promising, or as unfavorable. Conditional language such as “may” was considered as promising; however, if the authors affirmed that a study reflected the clinical evaluation of the test under question or that the test could be considered an option for diagnosis, it was marked as definitely favorable. The final weight of the decision regarding overinterpretation was based in the abstract.
Statements regarding further clinical evaluation of the test. The presence or absence of statements regarding the need for further clinical evaluation was recorded for each study. A distinction was made between studies that mentioned further clinical evaluation as a desirable possibility and those that stated clinical evaluation was necessary. Only the latter were considered to “mention need of further clinical evaluation.”
We defined overinterpretation of clinical applicability with the following rules, which were agreed upon up front and evaluated in a pilot study of 10 articles to ensure that they were operational (Table 2⇓ ). In brief, overinterpretation was defined in studies with healthy or alternative-diagnosis controls when authors gave a conclusion that was definitely favorable for the application of the test to the clinic (with or without mentioning the requirement of further clinical evaluation), or if authors stated that the assessed test was promising but did not mention the need for further clinical evaluation. In studies including patient series, any statement in a study that concluded that the test had clinical applications was classified as overinterpretation if the study had unacceptable diagnostic accuracy, as follows: Both sensitivity and specificity were <60% in the main analysis; either sensitivity or specificity was <50% in the main analysis without justification of the merits of the test as an exclusion/inclusion test; the lower limits of the CIs of both sensitivity and specificity were <50%; the area under the ROC curve was <0.55 or had CIs that reached to <0.50; or, an accuracy index was absent, along with insufficient information provided to calculate sensitivity or specificity.
Transcriptions of a selection of the articles examined and their classifications are provided in Annex 1 of the online Data Supplement for illustrative purposes, and some detailed examples are described in the Results. The degree of observer agreement regarding the presence or absence of overinterpretation was 79% at this stage. Discrepancies were resolved by consensus and by independent review by a third investigator. The reviewers were aware of the journal source and authorship.
From each study we also recorded the following variables: Thomson Reuters’ bibliographic impact factor; journal categories selected by Thomson Reuters’ Web of Science (Journal Citation Reports 2006); whether the authors were based in a laboratory, in a clinical setting, or both; the disease studied; the molecular methodology used, categorized as gene-targeting techniques (PCR-based and microarray), protein-targeting techniques (mass spectrometry or 2-dimensional gel electrophoresis, antibody array or protein microarray), and other; mention of previous studies on the same test and how the results were reported; and description of other diagnostic tests for the same diagnostic problem. We also recorded the sample size; in proteomic or genomic studies in which a pattern-recognition model is developed in a training set and then applied in an independent “validation” set(13), we recorded only the number of patients/samples included in the validation set.
To assess the association between the outcome variable (overinterpretation) and the variables listed in the previous paragraph, we computed odds ratios and their 95% CIs by means of unconditional logistic regression. Multivariable models considered all variables with P values <0.10 in univariate analyses and used stepwise forward selection. We always included study design and accuracy index as adjusting factors in the multivariable analysis, because they were included in the criteria for judging overinterpretation (as discussed above) and because they could be related with other study characteristics, thus acting as classic confounders. Study size and bibliographic impact-factor data were categorized in quartiles. Analyses were carried out with STATA/SE 8.0 (StataCorp).
After screening the titles and abstracts of 1614 articles retrieved in the electronic searches, we considered 147 articles potentially eligible for the study after reviewing the abstracts. After examination of the full texts, we ultimately included 108 articles (see Annex 2 and Flowchart in the online Data Supplement).
Table 3⇓ ⇓ lists the characteristics of the sample of 108 reports. Most of the included reports (83%) used a healthy-control or alternative diagnosis-control design to assess diagnostic accuracy. Regarding the measurement of diagnostic accuracy, more than half (n = 58) of the studies reported classic diagnostic indexes (sensitivity and specificity, or area under the ROC curve). We presented sensitivity and specificity in the same category as area under the ROC curve because 9 of the 12 studies that reported area under the ROC curve presented it along with sensitivity and specificity values; however, when we separately analyzed the 3 studies that reported only area under the ROC curve, we obtained similar results. The sample size ranged from 4 to 8156, with a median of 68.
Thirty-one reports (29%) mentioned previous studies on the same tests; of these 31 reports, 15 quantitatively described the results of the previous studies. More than two thirds (n = 75) of the studies mentioned the existence of other diagnostic tests for the same diagnostic problem. Approximately half (n = 53, 49%) of the reports stated the need for studies other than diagnostic evaluations, such as identification of biomarkers or assessment of prognostic value.
overall stance and interpretation of the results
Half (n = 54, 50%) of the articles studied made definitely favorable statements with regard to clinical application, whereas 50 studies (46%) made statements that were classified as promising. Only 4 studies made unfavorable statements regarding the evaluated diagnostic test. About half (n = 57, 53%) of the articles mentioned the need to evaluate the test’s diagnostic performance in further studies.
Fifty-seven (59%) of the 97 studies that did not use a clinically relevant population overinterpreted the clinical applicability. Of the 15 studies carried out with a clinically relevant population, 4 studies (3%) were also deemed to have overinterpreted their results because of insufficient diagnostic accuracy. In combination, overinterpretation of the clinical applicability of the test under study was apparent in more than half (n = 61, 56%) of the examined articles.
Authors solely based in clinical settings were much less likely to overinterpret results, and articles published in journals focusing on medical specialties were also less likely to do so. Furthermore, a higher impact factor for a journal was associated with a higher chance of overinterpretation (Table 4⇓ ). Multivariable analyses indicated that laboratory-based authors were more likely than clinic-based authors to overinterpret the clinical implications of their results (odds ratio adjusted for study design, type of diagnostic accuracy index, and impact factor, 18.7; 95% CI, 1.41–249.26; P = 0.026). Articles from journals with impact factors in the upper quartile were more likely to overinterpret than those from the lowest quartile (odds ratio adjusted for study design, type of diagnostic accuracy index, and authorship, 4.33; 95% CI, 1.03–18.23; P = 0.045). The association between overinterpretation and impact factor was linear (odds ratio, 1.71 per quartile; 95% CI, 1.09–2.69; P = 0.020). We calculated cross-tabulations to see the differences between journals with high vs low impact factors. The only difference observed was in journal category. The higher-impact journals included a higher proportion of those categorized as “laboratory and methodology,” whereas the lower-impact journals included more “biomedical or general science” journals (P = 0.010).
examples in the assessment of overinterpretation
Example 1 (reference 25 in Annex 1 in the online Data Supplement).
This study used an alternative diagnosis-control design, and the statements regarding clinical applicability were considered definitely favorable: “This rapid MS-MA is a good primary screening method that can be implemented in a diagnostic laboratory to determine the methylation patterns of patients with suspected PWS or A.” The authors confirm that the diagnostic test is a good primary-screening method, despite the limited conclusiveness of the study design; therefore, the study was considered as overinterpretation.
Example 2 (reference 40 in Annex 1 in the online Data Supplement).
This study used a healthy-control design, and we did not consider it to have overinterpreted its results. The statements regarding clinical applicability were judged as simply promising (“This study shows that free-circulating DNA can be detected in cancer patients compared with disease-free individuals, and suggests a new, non invasive approach for early detection of cancer.”). The authors additionally specify the need of further studies to evaluate the test (“Further studies are needed to understand the correlation of these new molecular markers with cancer diagnosis, outcome of disease, and eventually treatment response.”).
Example 3 (reference 87 in Annex 1 in the online Data Supplement).
This study used a clinically relevant population, and we considered the statements regarding clinical utility as definitely favorable (“Component-based testing and the whole-allergen CAP are equally relevant in the diagnosis of grass-, birch- and cat-allergic patients.”). The authors specify the need for further clinical evaluation (“The clinical relevance of each allergen needs to be validated separately before the implementation of multiallergen panels into routine diagnostic settings.”). This study had acceptable diagnostic accuracy (sensitivity, 72%; specificity, 92%) and therefore was not considered to have overinterpreted the clinical applicability of its results.
Example 4 (reference 54 in Annex 1 in the online Data Supplement).
This study also used a clinically relevant population; we considered the statements regarding clinical utility as definitely favorable (“This PCR assay detects a variety of strains exhibiting characteristics of the EAEC group, making it a useful tool for identifying both typical and atypical EAEC.”); however, the authors did not report any measure of diagnostic accuracy. The study was therefore considered overinterpretation.
Although clinical evaluation is necessary before introducing a test into clinical practice, few recent diagnostic studies on molecular research have been carried out in a clinically relevant population. The authors almost always interpreted their findings as either definitely favorable or at least promising for the evaluated technology. More than half of the articles apparently overinterpreted the clinical applicability of their findings, and such interpretation was more likely for articles in which all of the authors were laboratory-based and in articles published in journals with higher impact factors. Most of the reviewed studies used healthy- or alternative diagnosis-control designs. These studies are not all equal(14): Some may be affected by biases, whereas others may be unbiased. Such nonequivalence is one more reason why evaluations with study designs that come closer to the real-life clinical settings are warranted.
Some authors have stressed the need to measure the value of a diagnostic test on health outcomes as a final phase in the evaluation of its clinical utility, once the test has been accepted clinically and made commercially available(6)(9). We have not covered this issue in this study; however, we do agree that evaluating whether a test positively influences health outcomes is a key aspect. We chose not to cover this aspect because few molecular-diagnostic tests have been incorporated into practice and because trials evaluating the clinical utility of such tests are still scarce. For example, no randomized trials have conclusively assessed the clinical utility of tests involving gene expression profiling, despite several thousand published articles on the subject (2 trials are ongoing)(15).
Other empirical investigations of the methodologic aspects of diagnostic research have reported serious methodologic limitations(16)(17)(18)(19). In the present study, however, we examined the applicability of diagnostic-test results to practice on the basis of the study design and independently of other methodologic aspects. We documented that considerable distance often exists between study design and the clinical applicability of the molecular-diagnostic tests, even if the design and the data are methodologically sound.
With the continuing development of new diagnostic tests, comprehensive clinical evaluations are needed if clinical harm and unnecessary spending are to be avoided. As our results show, studies that make claims about the clinical applicability of molecular-diagnostic tests often have not evaluated populations of clinically relevant patients and therefore lack evidence on which to base their claims. Enticing promises exist across the field of molecular medicine(20). The exaggeration of the clinical implications of preliminary investigations that we observed in our study may be due to different processes(4)(21)(22), including commercial influences(4) and insufficient awareness by researchers of their own “interpretive biases”(23)(24).
Overinterpretation can certainly arise when a strong result is obtained from a very small study. Indeed, the lack of reproducibility in analyses of proteomic and genomic data is often ascribed to small samples: The main difficulty in conducting a satisfactory early assessment is obtaining sufficient numbers of individuals for both training and validation; thus, the results may be overinterpreted. Large sample sizes and replication in multiple independent data sets are necessary but not sufficient for reliable results, however.
Comprehensive clinical evaluation of a single diagnostic test is expensive in terms of both money and time(25). Reliable consecutive series of samples that are representative of the real clinical settings of interest may be difficult to obtain in molecular-based research. Unless a well-thought-out research study is designed in collaboration with a clinical center, few groups are likely to hand over their “precious” clinical samples and their clinical and demographic data to a laboratory(26). Clinicians may be more sensitive to the difficulties and implications of moving these tests to the bedside and thus may be more cautious in their interpretation. Such reticence would be consistent with our observation that articles by exclusively laboratory-based authors were more likely to overinterpret the clinical applicability of their results. Finally, the observed relationship between journal-impact factor and overinterpretation could be a form of bias: Studies with the more spectacular conclusions appear in journals with higher impact factors, many of which are also more biologically and industry oriented than clinically based.
Some caveats about our methods require some discussion. First, we used an operational search strategy and definition to identify a sufficiently large number of molecular-diagnostic studies, but there is no established and widely agreed strategy for identifying such studies in the literature. To evaluate the consistency of the selection process, 2 investigators assessed a random sample of the abstracts and achieved an adequate degree of agreement with the initial reviewer. Therefore, only one reviewer carried out the complete search of the potential reports through MEDLINE. We cannot totally exclude the potential for selective inclusion, but our hope is that it is not large. Furthermore, the internal validity of the type of study we conducted does not require the same completeness of the sample that systematic reviews and metaanalyses of research findings require.
More importantly, passing judgment on whether overinterpretation exists is not always straightforward, and there is a risk that our own assessments overinterpret the language of an article. To establish an adequate definition of overinterpretation, we took into account several aspects in each scientific report; however, we acknowledge that this scheme is not a perfectly objective rule. The agreement between the independent data extractors was less than perfect. Although such deficiencies may affect the exact extent of estimated overinterpretation, it does not affect our main conclusion that inferences on clinical applicability are exaggerated in this literature.
The requirements for the introduction of diagnostic tests into clinical practice are less strict than for the introduction of new treatments. Hence, flawed or exaggerated claims for diagnostic-research results could lead to the premature adoption of defective tests, which could translate into erroneous decisions with adverse consequences for health. All in all, our results emphasize the necessity for caution when interpreting the results of diagnostic-accuracy studies in molecular research.
Author Contributions: All authors confirmed they have contributed to the intellectual content of this paper and have met the following 3 requirements: (a) significant contributions to the conception and design, acquisition of data, or analysis and interpretation of data; (b) drafting or revising the article for intellectual content; and (c) final approval of the published article.
Authors’ Disclosures of Potential Conflicts of Interest: Upon manuscript submission, all authors completed the Disclosures of Potential Conflict of Interest form. Potential conflicts of interest:
Employment or Leadership: None declared.
Consultant or Advisory Role: None declared.
Stock Ownership: None declared.
Honoraria: None declared.
Research Funding: Spanish Agency for Health Technology Assessment (Exp PI06/90311) and CIBER en Epidemiología y Salud Pública (CIBERESP), Instituto de Salud Carlos III, Government of Spain.
Expert Testimony: None declared.
Role of Sponsor: The funding organizations played no role in the design of study, choice of enrolled patients, review and interpretation of data, or preparation or approval of manuscript.
Acknowledgments: This manuscript was presented in poster format at the Fifth Annual Meeting of Health Technology Assessment International (HTAi), Montréal, Canada, July 6–9, 2008. We thank Jonathan Whitehead for editorial help in preparing an early version of the manuscript.
- © 2009 The American Association for Clinical Chemistry