Depending upon who you ask, you may get very different answers to the question of “what does ‘big data’ mean to you?” Most obviously, the term “big data” applies to the high-resolution omics data for which we rely on various bioinformatics tools to make conclusions on how to improve patient care. However, “big data” also readily refers to the data reported every day as a part of the clinical laboratory testing environment, and more broadly to the information generated in electronic health records (EHRs).11 There are several practical IT solutions for handling day-to-day “big data” that enable millions of test results to be reported per year.
Informatics is changing the processes behind laboratory medicine. With ever-growing demands on laboratory medicine professionals not only to collect and interpret omics data in the era of the Precision Medicine Initiative, but also to ensure high-quality, low-cost patient management in the structure of accountable care organizations, we have invited several experts to discuss their take on “big data.”
Our experts highlight how to ensure that the data analyzed are high quality, so that the conclusions we make will translate to effective clinical management and optimal patient care. They review a number of IT solutions they rely on to gain efficiency in the clinical laboratory to benefit clinical practice. Also, our experts discuss the ability to query the clinical laboratory database in an effort to improve test utilization and how “big data” analytics allows for a more effective means of quality management. These 8 experts, with diverse backgrounds and interests, highlight various IT solutions to tackle our “big data.”
How do you define “big data” and what does it mean to you in your clinical practice?
Eric Klee: In my opinion, the term “big data” has different meanings depending on the context being considered. From an IT perspective, “big data” is anything that challenges an institution's computational infrastructure and requires application-specific modifications. A clear example is providing sufficient compute nodes, memory, and storage to account for the demands of whole genome sequencing. The size of the data sets generated often requires high-performance computer clusters and specialized storage infrastructure. I think “big data” has a different meaning from the context of a cyto- or molecular geneticist. From that perspective, I would assert that “big data” refers to any data set that challenges or exceeds an individual's ability to manually evaluate all data points for clinical relevance. This does not necessarily require the data to be whole genome sequencing, but any targeted next-generation sequencing (NGS) panel of sufficient size to require informatics solutions to enable data reduction before clinical interpretation. For example, a molecular geneticist might be capable of reviewing all variants called on a 10-gene panel (approximately 30–40 variants per case) without additional informatics support; however, they would be overwhelmed in trying to do this for a 60–100-gene panel.
Linnea Baudhuin: “Big data” is a broad term that relates to data sets that are so large, diverse, and/or complex that traditional data processing applications are inadequate to analyze, capture, curate, share, visualize, and store. Thus, “big data” requires innovative bioinformatics solutions for processing, to make the data meaningful, and to derive usable information. In the arena of clinical molecular genetic testing, NGS has required us to develop complex bioinformatics solutions that require raw data mapping, alignment, filtering, and variant calling.
Stephen Master: There are several different definitions of “big data” that get thrown around. One school of thought says that we should only use the term “big data” when we have too much information to store or process on a single computer. Another way to define “big data” is by the characteristics of volume (amount of data), velocity (how quickly we acquire data), and variety (different kinds of data). We also talk about “big data” in the context of analyzing large, complex data sets such as those derived from omics experiments. The important thing to recognize is that many of the same analytical approaches can be applied to laboratory medicine data regardless of the precise definition that we choose. As a result, I think that it's appropriate to use “big data” to describe the information that we get from our large numbers of patients, samples, and analytes in the clinical laboratory.
Dan Holmes: The term “big data” is frequently used by my laboratory medicine colleagues, clinicians, and health administrators in various settings. From the context of these conversations, I (somewhat jokingly) would be prepared to say that most of us use the term to mean “I can't do the analysis in Excel.” Restricting my thinking to healthcare environments, I would define the term as follows:
Big data (in clinical medicine): (n) Extremely large data sets obtained from demographic, clinical (medical, nursing, paramedical, pharmacy), diagnostic, and public health records used to direct decisions about diagnostics, patient care, resource allocation, and epidemiological trends.
The connotation of the term suggests that the data itself might have to be pooled from disparate sources, usually databases of diverse structure, and that the data might require substantial “wrangling” to prepare it for analysis using preexisting or custom computational tools.
Mark Cervinski: All of the high-volume test result data produced by automated instruments in the clinical laboratory could qualify as “big data.” A medium-sized laboratory such as ours can generate 3 to 4 million patient test results a year and in addition, each one of those results has associated data that never make it to the chart. For every test performed, we track the time when, and where, a sample was drawn, transit time, processing time, analyzer time, resulting time, and specimen integrity (hemolysis, icterus, and lipemia indices). However, the only data to accompany the test result are typically the time the sample was drawn and when it was resulted. All of these nonresult data are valuable and mineable. With the proper tools and questions in hand, a laboratorian can dig into these data to query whether their reference ranges are appropriate, to look for test utilization patterns, to curb test overutilization, or to monitor preanalytic and analytic quality.
Gary Horowitz: Although my laboratory generates over 5 million patient test results each year, I don't usually think of the work I do as involving “big data,” but maybe I should. To me, “big data” relates to the kind of analytics that Google does—e.g., using the frequency of search terms to track influenza epidemics almost in real time. The work I do with large amounts of patient data reflects practices at my institution, which is only a small piece of the “big data” our laboratory produces each day. My goal in analyzing these clinical data is to see whether I can uncover ways to improve not only laboratory practice, but overall clinical practice. That is, in addition to ensuring the accuracy, precision, and turnaround times (TATs) of laboratory results, I try to see whether the results themselves can be used to monitor and improve clinical care. For example, if many of the vancomycin levels we report are outside the therapeutic range, it's not enough that our assays are accurate and TATs are good. I'd like to know what we can do, as an institution, to help ensure that patients' vancomycin levels are therapeutic.
Rajiv Kumar: “Big data” is the assessment of massive amounts of information from multiple electronic sources in unison, by sophisticated analytic tools to reveal otherwise unrecognized patterns. As a pediatric endocrinologist, to me “big data” means the method to enhance the care of human disease, such as insulin-dependent diabetes mellitus. Multiple and fluctuating factors affect blood sugar control, and patients/parents are asked to make real-time insulin multidaily dosing decisions without truly knowing if a given dose will have the intended effect. They have the benefit of their own experience and healthcare provider guidance based on intermittent retrospective review of available data. However, each decision point represents a new combination of variables with subtleties in patterns that may not be readily identified. As the diabetes research community moves closer to realization of an automated closed-loop artificial pancreas in clinical use, “big data” will be the backbone that facilitates optimal glycemic control.
Albert Chan: Our world today is geared to improve the lives of consumers. From Amazon to Google to our local grocery store, we have the same expectation: a product that meets our consumer expectation of utmost quality and convenience. This is made possible by “big data.” On the continuum from creepiness to utility, our expectations have shifted. We now readily provide the personalized data that can simplify our transactions or personalize our experiences for the better.
How do you tackle your “big data” and what do you make out of it in your practice?
Linnea Baudhuin: Most laboratories utilize multiple different software programs and home-brewed IT solutions to analyze extremely large NGS output files. There are 3 major types of NGS tests in current clinical use: (1) cancer genetic variant testing for diagnosis, prognosis, or therapeutic response, (2) gene panel testing for diagnosis of inherited disorders, and (3) whole exome (or genome) sequencing for rare inherited disorders. In the future, we can expect NGS tests for pharmacogenomic testing panels, as well as transcriptomic- and epigenomic-based tests. For all of these applications, we need the ability to filter out potentially large numbers of benign or nonreportable variants. Additionally, we frequently encounter variants that are of uncertain significance (VUSs), with the more that we sequence, the more VUSs we encounter. While testing of trios (e.g., parents and affected child) can help to create cleaner data, oftentimes complete trios are not available for testing. For these reasons, and more, we require a specialist to help us tackle big NGS data. Most clinical genetics laboratories now employ at least one bioinformatics specialist, with the bigger laboratories requiring a team of such specialists.
Eric Klee: The “big data” data sets that I generally work with are all NGS based, and most of what we attempt to make out of these data are high-quality variant call sets. The variant types range from simple single-nucleotide variants (SNVs) and small insertion or deletion variants (INDELs), to copy number variants (CNVs), to structural variants, including fusions, translocations, and inversions. The steps employed to make these calls include NGS, QC read, filtering, realignment to the appropriate reference genome, and then a series of highly customized variant-calling algorithms that are application specific. These are integrated into a centralized relational database, merging annotations together with the variant call data, to provide the appropriate context to the underlying data at the time of interpretation.
Stephen Master: From the perspective of software platforms, the bulk of our analytic work is being done in the R statistical program language. There are a few other language platforms that also have good support for statistical analytics in a big data context, but we now have a growing international group of clinical chemists who are sharing approaches for analyzing laboratory data in R.
Daniel Holmes: I exclusively use the R statistical programming language for cleaning, processing, analyzing, and visualizing the large data sets I have to deal with in laboratory medicine. I use R because it has a very large following of people from many disparate academic fields. This means that specific tools for almost any conceivable analysis are available in the Comprehensive R Archive Network (CRAN). Relevant to clinical chemists and pathologists, this includes tools for database creation/querying, data cleaning and reshaping, routine and sophisticated statistical analyses, mass spectrometry/chromatography, genetics, epidemiology, machine learning, and data visualization (including real-time interactive visualization). Additionally, R is freely available and open-source, platform independent, and has a broad community that shares its knowledge and source code. Our efforts with R have been primarily directed towards automating quality management tasks, visualization of quality metrics in new ways, assessing utilization, and finding needles in haystacks. For example, Levey-Jennings charts for all tests on all chemistry analyzers for the previous month are autogenerated in the early hours of Monday mornings. These are processed and converted to PDFs by a single R script (coordinated with Linux bash) and autoemailed to all relevant staff for review when they arrive for work. As for finding needles in haystacks, we had an analyzer filing identical panel results on 2 consecutive patients on a very rare basis. With R, we were able to write a script to identify these occurrences among 5 million analyses performed over the prior 3 years to find the 30 affected patient records and make investigations and corrections.
Mark Cervinski: For the last few years we've been using our “big data” to establish a moving averages program to monitor the mean patient value for a number of chemistry analytes in real time. Moving averages is not a new quality assurance (QA) concept but it has been difficult to implement, partially due to the difficulty in acquiring and analyzing the data. We were only able to develop sensitive moving average protocols once we were able to accumulate nearly 2 million test results in a database. Using this “big data” data set, we were able to model the moving average process in a statistical modeling software package. This modeling of “big data” allowed us to develop protocols to rapidly detect analytical shifts. Without the data, we would only be able to guess if our protocols would be sensitive enough to detect a systematic error condition. We've only begun to mine this data in our laboratory and thus far, we've focused only on the results and when and where the sample came from. But there is great potential to influence the medical care that happens outside of the laboratory's walls.
Gary Horowitz: We make extensive use of Microsoft Access and Microsoft Excel. We are fortunate in that we have access to a database of laboratory data extending back to 2003, along with many other clinical parameters [admissions, International Classification of Diseases, Ninth Revision (ICD-9), and recently Tenth Revision (IDC-10), codes, attending physicians, etc.]. We use Access to write queries to retrieve data of interest, and then we do the majority of our analyses using Excel. To contemplate the increasing number of variables in the future, more sophisticated analytical tools will be required, such as pulling data from this database using R statistical software.
Rajiv Kumar: Currently when reviewing data for a patient with diabetes, we strive to recognize patterns that may benefit from modulation of insulin dosing parameters and other intervention. Variables assessed include blood sugar trends, characteristics of insulin dosing, physical activity, carbohydrate intake, and progression in growth and puberty. These data and metadata are challenging to access between quarterly clinic visits and are not readily organized in a manner conducive for rapid pattern recognition. In response, my group has explored an infrastructure goal of conveying more patient data, more often, without increasing patient/parent effort or exacerbating provider resource strain. Using integration of Apple's HealthKit platform with our Epic EHR patient portal, we are now able to receive up to 288 blood glucose readings per day for patients using Bluetooth-enabled continuous glucose monitors. Key aspects of this integration include data unified in the EHR (home of medical history variables, laboratory test results, growth data, prescription data, and provider work flow), passive data transfer via the patient's/parent's mobile device, and data security of this patient health information. With continued variable expansion using this approach, there will be important progress in clinical decision support toward more precise diabetes care.
Albert Chan: Historically, physicians have made decisions about patient care based on limited data. For example, physicians may change a hypertensive medication based on a blood pressure obtained in the exam room. Yet, 99 plus percent of a patient's life happens outside of the office. This is the promise of wearables and other remote monitoring solutions, providing our care teams with a more holistic view of the patient. For example, a home blood pressure cuff digitally connected to our electronic health record provides our care team with a more complete view of blood pressure control to make better clinical decisions. More importantly, empowering our patients with this data can facilitate new teachable moments.
Are there particular “big data” analytics you use to gain efficiencies, for quality management, or to provide clinical improvements?
Eric Klee: We use all the same type of standard NGS analytics that the broader community uses, including basic read-level QC (i.e., FastQC), alignment level QC (% reads mapped, mapped on target, mapped at Q30, etc.), and variant level QC [Ti/Tv (transition–transversion) ratio, number of variants per region, synonymous-to-nonsynonymous ratios, Q20 or greater variant counts, variant frequency distributions]. In addition, we use a highly specialized relational database to enable complete variant profile storage with rapid retrieval, enabling us to quickly generate QC and interpretative reports.
Linnea Baudhuin: We utilize bioinformatics tools to analyze NGS data quality and filter out data that are of poor quality or require potential follow-up. Quality parameters that are assessed include number of overlapping reads, per base depth of coverage, average depth of coverage within total sequenced area, uniformity of coverage, variant frequency for heterozygotes and homozygotes, strand bias, and nonduplicate reads. Cutoffs for analytical performance parameters need to be established during test verification. Additionally, per-base quality scores are assessed for each test and bioinformatics will remove bases with low quality scores before alignment. To gain efficiencies for variant classification, we utilize multiple sources to determine if a variant has been previously detected and reported in the general population, in specific ethnic groups, and/or in individuals or families with relevant disease phenotypes. These sources include our own internal variant database, ClinVar, Human Gene Mutation Database, the National Heart, Lung, and Blood Institute's (NHLBI) Exome Variant Server, and the Exome Aggregation Consortium (ExAC) database. We utilize whatever information we gather from these sources, along with other in silico prediction and evolutionary conservation tools, to help us make decisions on variant classification. We also classify variants during test verification and store this information in a database, to help streamline downstream classification of variants encountered when the test is live. This information is especially helpful for benign or likely benign variants.
Stephen Master: Right now we deal with our “big data” once it has already been pulled from the laboratory information system (LIS) in a batch mode. This is fine for things that are not time sensitive (simple things like TAT analysis, identification of long-term testing trends, or discovering multivariable patterns), but it doesn't yet provide a way for us to turn our big-data analysis into real-time diagnostic output. Our next step at an institutional level is making sure that we have much more rapid access to the raw results data. In terms of specific analytic applications, my group has demonstrated the use of high-throughput hematology analyzer data to identify myelodysplastic syndrome. However, we really need to solve the real-time-data access problem to fully take advantage of these analytics in our clinical practice.
Daniel Holmes: At present, there is a significant desire for the application of quality metrics in both monitoring of traditional statistics (TAT, reporting of critical values, adverse event rates) and the development of novel metrics (identification of outlier behavior, trends in utilization, identification of testing of low clinical utility). Previously, all of this monitoring was done manually in spreadsheet-based programs, which is problematic for a number of reasons: it is not traceable (there is no record of the steps in the analysis), it is not automated (repeating the same tedious steps each month to generate reports), the statistical tools in spreadsheet programs are limited, and there are currently no automated report generation or real-time data visualization tools. For these reasons, my colleagues and I are coding tools to automate the traditional metrics of quality monitoring. We hope to automate TAT and utilization monitoring using the pipeline of R (database query and analysis), R-Markdown (http://rmarkdown.rstudio.com/), Knitr (http://yihui.name/knitr/), and LaTeX to create PDF reports. We may opt to create a web “dashboard” using the “Shiny” package for R (http://shiny.rstudio.com/). At a minimum, the advantage will be that the source code shows exactly what has been done, and we can use the same code to produce laboratory quality metric reports across the 7 large hospitals in our region.
Mark Cervinski: In addition to the moving averages protocols, we use tools available in our middleware software to analyze sample work flow and tests performed per hour, to adjust our instrument maintenance times, staffing, and test distribution mixture. We also routinely monitor autoverification rates, in-laboratory TAT, and specimen quality flags in real time, as deviations from the norm could indicate unnoticed instrumentation errors and predict instrument downtime. On a longer scale, the data we collect could be considered “big data” but as we monitor these changes on a daily, weekly, and monthly basis we tend to refer to this as “small data.” These “small data” fields are key to designing middleware rules that assist laboratory technologists and allow them to focus on those samples that need extra handling. A well-designed set of middleware and LIS rules can bump up the laboratory's autoverification rate and lower the cost per test, metrics that become ever more important as reimbursement rates for laboratory testing continue to decline.
Gary Horowitz: We generate and test hypotheses with a goal of analyzing and improving clinical practice. As an example, we wondered why our clinicians were ordering so many serum folate tests. Were they screening for folate deficiency, or were they working up cases of macrocytic anemias? Theoretically, there should be very little folate deficiency in the US population, since all breads and cereals have been fortified since 1996. Our analysis indicated that the test was being ordered in huge numbers and almost always in the absence of anemia, let alone macrocytic anemia. Our data indicated that it had an exceptionally low clinical yield: 3 cases of folate deficiency out of 84 000 samples (0.06%) over an 11-year period. We argued that, although we generate excellent results in a timely fashion, we would prefer not to do the test at all. In a similar way, we can look at how often physicians order a single troponin in patients undergoing evaluation for acute coronary syndromes, how often therapeutic drug levels are within their target ranges, and how often physicians react appropriately to d-dimer levels. In all of these situations, our goal is to try to identify areas where, together with our clinical colleagues, we can improve patient care above and beyond offering accurate test results in a timely manner.
Rajiv Kumar: In anticipation of an exponential increase in glucose data from our patients, we built an analytic triage report and glucose data viewer embedded in the EHR to facilitate intervisit retrospective data review without collapsing available resources. The automated report is generated at defined intervals to triage patients by glycemic control. This allows a diabetes provider to focus time and resources where they are needed most. For a patient whose home data meet flag criteria, the provider opens the patient's chart and uses the glucose viewer to review and quickly identify actionable data trends. Any questions or recommendations are conveyed to the patient and/or parent using the EHR's patient portal, permitting efficient communication while simultaneously documenting changes in the treatment plan. We are now translating this provider work flow to streamline analysis of patient-generated health data for additional chronic diseases.
What are some lessons learned from your experience with real-time “laboratory” data integration through Apple HealthKit?
Rajiv Kumar: A major component of managing data is managing patient/parent expectations about said data. Our current intention of using HealthKit is not to take over real-time decision-making, but rather to facilitate efficient identification of actionable trends between visits. In review of previous efforts to implement home data in the EHR, we learned that unless explicitly highlighted, patients may think their provider is constantly watching their data and get frustrated when they are not contacted immediately for an aberrant value. At setup, we use verbal and written notification to establish appropriate expectations regarding only intermittent provider monitoring. To date, we have received only positive feedback, as we are meeting the expectations we defined. When a patient/parent contacts the diabetes provider between clinic visits with questions or concerns regarding glucose trends, quick provider access to the data with no additional effort is appreciated on both sides. There is a technical requirement for patients/parents to keep their mobile device operating system and relevant apps updated to maintain passive data communication. While this is not a major hurdle for most smartphone users, we found use of a tip sheet with mobile device screenshots to be helpful for some.
How do you see these devices impacting healthcare in the near term, and way in the future? What should laboratorians be investing in now, to have a seat at the table 20 years from now?
Rajiv Kumar: In the near term, passive communication of patient-generated health data to the EHR enhances access to this information in the context of other laboratory and relevant variables in the chart. This organization of puzzle pieces in the center of provider work flow may likely improve care for a given patient's condition. In the long term, this precision medicine initiative will continue to foster variable expansion and facilitate clinical decision support tools for providers and patients alike. Additionally, and with respect to population health, unified organization of data sets in a given EHR will permit deidentified sharing across health systems to provide invaluable insight on epidemiology and optimized approaches for human disease. In anticipation of the short- and long-term power of complete health information, hospital systems and laboratorians should be investing in EHR data infrastructure today, including an interactive patient portal. We need to assure that the data we generate and receive have longevity, are easy to access and share, and can be easily formatted to answer questions that we have not yet thought to ask. Importantly, we also need to advocate for healthcare policy that supports use of big data in evolving care models without challenging provider resources.
Albert Chan: In our experience with a personalized healthcare program for hypertension, almost 69% of patients with previously uncontrolled hypertension who were provided their blood pressure data contextualized with behavioral factors, such as exercise activity, are currently at target control. Similarly, simple binary measures such as a serum test will be augmented with measures that reflect our increasingly nuanced understanding of health. From genomics and other tests that provide quantitative probabilities of disease, our clinicians will have to become facile, with an ability to take increasingly complex data and explain the ramifications and patient's options for intervention. I am teaching my son and daughters the basics of computer science. This is not a bet that they will grow up to be programmers. Rather, it is based on a belief that all of us, including those of us to participate in clinical care, will need superior quantitative skills to serve as advocates for our patients. Our healthcare consumers will depend on and demand that we have these abilities to better partner with them to make the critical decisions that influence their health.
What pitfalls must laboratorians be aware of to ensure we make accurate conclusions from the analysis of our “big data”?
Eric Klee: It is important that laboratorians understand all of the assumptions that have been built into any “big data” analysis pipeline. Oftentimes, these consist of default configurations per informatics solutions that will not necessarily meet the assay-specific requirements. A simple example is the minimum read depth or frequency of a variant that would be called and reported. These configurations are often set with the assumption that the user is analyzing basic genomic data in a hereditary test application and will fall short when thinking about somatic or mitochondrial assays. More complex are some of the assumptions made around complex variant situations, including INDEL events, or SNVs in proximity to INDELs, etc. It is equally important that the laboratorian is familiar with the type of variant quality filtering that is being used. When one is dealing with extremely large data sets, automated filtering and data reduction is required for efficient interpretation. A laboratorian cannot be expected to review all possible variants for each case analyzed, but must take the time and establish a solid understanding of the data reduction and QC steps employed in a “big data” assay during test development, to ensure the proper methods are being used.
Linnea Baudhuin: NGS has prompted us to move from targeted mutations or single gene analysis to multigene panels, whole exome, and even whole genome testing. Along with this, our analysis of the data has moved from fairly simple software solutions to the need to implement a stitched-together set of bioinformatics software systems that are a combination of off-the-shelf and in-house developed. A high-quality bioinformatics pipeline enables us to perform testing with high sensitivity and specificity. In the world of NGS, this means that we can detect as many variants and types of variants as are present while ensuring that the data being reported are meeting quality standards. But, we need to balance this with being careful about creating a test that is clinically useful, keeping in mind that more is not necessarily better. In other words, the more we sequence, the more variants we find, and the more variant categorization that needs to be done. This, in turn, translates to resources spent by the laboratory classifying variants, time spent by the clinician trying to understand and explain the results, a higher potential for incorrect interpretation of the report by clinician/patient, and potentially unnecessary follow-up testing on VUSs. Thus, we, the laboratorians, have a responsibility to provide NGS tests that are analytically and clinically valid, as well as clinically useful. We also need to carefully state what the limitations of testing are (e.g., what is not detected with the test) and we need classify variants in a conservative and standardized manner.
Stephen Master: I think that the biggest pitfall for the laboratory is not spending enough time thinking about data management. There have been several highly publicized cases within the past 10–15 years where complex data were incorrectly processed and led to possible harm to subjects or patients. The underlying problem is that once we're talking about “big data” it can be very difficult to spot problems “by eye” unless we have well-validated ways of reproducibly managing and processing data. Another important pitfall is the relative lack of people in our field with quantitative expertise. If we're going to use “big data” approaches in clinical chemistry and laboratory medicine, we need to be able to effectively police ourselves and peer review each other's laboratories through the inspection process. This has important implications not only for the way that we prioritize our use of big data for computational pathology but also for the way that we train the upcoming cohort of young clinical laboratorians.
Mark Cervinski: Whether it is intentional or not, data can be massaged to fit a predetermined outcome. The formulation of testable questions and disclosure of all analysis conditions, including what values or data elements were included or excluded from the analysis, is vital. Like all scientific experiments, the results of our analysis of “big data” must be able to be replicated by our peers. While sharing our data sets may not be possible because of protected health information (PHI) disclosure or simply because of the size of the database, I would support the notion of sharing the tools used so that they can be vetted and improved upon by other similarly skilled investigators.
Daniel Holmes: The data coming out of our LIS are not always as clean as we think. For example, in programming analytical tools for TAT, we have realized that data were polluted with add-ons, duplicate analyses, duplicate requests, nonnumeric results, and even negative TATs. Thorough review of the quality of the data and a strategy for removing extraneous data are necessary to ensure that the results are meaningful and accurate. We usually start with small predictable analyses on specific cases to verify the generation of meaningful results. Then we embed the analyses into R functions and apply it across cases. In medicine we often say, “Don't order a test if it does not change clinical management.” With “big data” we are effectively performing diagnostic tests on our data to see if we can help direct patient care, defray costs, allocate resources, and identify problems early. However, we are in danger of investing our resources in custom analytics only to end up with metrics that are uninformative or to which the appropriate response is unclear. If the analysis does not or cannot inform clinical or laboratory medical practice, then we produce reports that have no value. It is critical that the individuals performing analysis have a thorough understanding of clinical and/or laboratory medicine processes. Otherwise, they are likely to discover phenomena that may appear significant to an outsider but don't really matter from a practical standpoint. For this reason, abstraction of quality management personnel and programmers from the clinical or clinical laboratory environment does not facilitate a team-based approach to improve patient care. A coordinated approach prevents medical professionals from making pie-in-the-sky requests and analysts from drawing naive conclusions.
Gary Horowitz: By far, the most important pitfall for us to consider relates to the effects of local practice on the data, limiting the generalizability of the findings. As an example, we recently did an analysis of extremely high ferritin values (>10 000 ng/mL) from our hospital. None of the traditional causes (hemochromatosis, Still's disease, hemophagocytic lymphohistiocytosis) were among our cases; rather, we found liver failure and other hematologic diseases most commonly. Does this reflect our patient population, our doctors' ordering habits, and/or something else? Should we “educate” our clinicians by telling them that a ferritin level >10 000 ng/mL is not seen in those other diseases? A third example relates to efforts to derive reference intervals from laboratory databases, which are attractive because they encompass massive amounts of information. Simple nonparametric techniques, as well as sophisticated statistical techniques, have been used in these efforts. But the results vary tremendously depending on whether one includes all patients, or limits the analyses to just outpatients, or limits the analyses to just outpatients with ICD-9/ICD-10 codes indicating the absence of disease. And even the use of these codes to filter the data can be problematic. We once analyzed the distribution of hemoglobin A1c values among outpatients with diabetes at our institution, relying on ICD-9 codes in the database to establish the diagnosis. The prevalence of excellent control was very high, so high in fact that we felt obliged to dig deeper into the data, at which point we discovered that many of the patients were being screened for diabetes rather than having an established diagnosis of diabetes. In other words, an ICD-9 code may not reflect reality, and one must be careful in drawing conclusions, notwithstanding how much data one has.
Albert Chan: “Big data” alone does not guarantee better outcomes. Overwhelming clinicians with unbridled volumes of data makes it more difficult to separate the signal from the noise. To truly realize the benefits, we need to develop better algorithmic and computational approaches to convert “big data” into big insights to find novel opportunities for clinical treatment. With these new approaches, we will be able to empower our clinicians to be better diagnosticians in ways not possible without data.
This year's AACC Society for Young Clinical Laboratorians (SYCL) Workshop, detailing a number of IT Solutions in Laboratory Medicine, inspired this Q&A session. The development of the focus for the 2015 SYCL Workshop was a collaborative one and a result of the efforts put forth by the 2015 SYCL Workshop and Mixer Planning Committee.
2015 SYCL Workshop Chair: Nicole V. Tolan, PhD, DABCC, Beth Israel Deaconess Medical Center and Harvard Medical School, Department of Pathology and Laboratory. Medicine, Director of Clinical Chemistry and POCT.
SYCL AACC Staff Liaison: Michele Horwitz, AACC, Director of Membership.
Lindsay Bazydlo, PhD, DABCC, FACB, University of Virginia, Department of Pathology, Associate Director of Clinical Chemistry and Toxicology, Scientific Director of Coagulation Laboratory.
Erin J. Kaleta, PhD, DABCC, Sonora Quest Laboratories, Clinical Director.
Mark Marzinke, PhD, DABCC, Johns Hopkins School of Medicine, Departments of Pathology and Medicine, Director of Preanalytics and General Chemistry, Pharmacology Analytical Laboratory.
Fred Strathmann, PhD, DABCC (CC, TC), University of Utah and ARUP Laboratories, Department of Pathology, Director of Toxicology and Mass Spectrometry.
↵11 Nonstandard abbreviations:
- electronic health records;
- next-generation sequencing;
- laboratory information system;
- turnaround time;
- variant of uncertain significance;
- single-nucleotide variant;
- insertion or deletion variant;
- copy number variant;
- Comprehensive R Archive Network;
- quality assurance;
- International Classification of Diseases, Ninth Revision;
- National Heart, Lung, and Blood Institute;
- Exome Aggregation Consortium;
- laboratory information system;
- protected health information;
- Society for Young Clinical Laboratorians.
Author Contributions: All authors confirmed they have contributed to the intellectual content of this paper and have met the following 3 requirements: (a) significant contributions to the conception and design, acquisition of data, or analysis and interpretation of data; (b) drafting or revising the article for intellectual content; and (c) final approval of the published article.
Authors' Disclosures or Potential Conflicts of Interest: Upon manuscript submission, all authors completed the author disclosure form. Disclosures and/or potential conflicts of interest:
Employment or Leadership: E.W. Klee, Association for Molecular Pathology.
Consultant or Advisory Role: A.S. Chan, AnalyticsMD.
Stock Ownership: A.S. Chan, AnalyticsMD.
Honoraria: G.L. Horowitz, SYCL (presentation at AACC 2015 Meeting).
Research Funding: None declared.
Expert Testimony: None declared.
Patents: None declared.
Other Remuneration: Soft Genetics, receive royalties for joint software development.
- Received for publication August 30, 2015.
- Accepted for publication September 10, 2015.
- © 2015 American Association for Clinical Chemistry