Electronic health records (EHRs) are now ubiquitous in US medical care, in part owing to legislation that incentivized their adoption as a component of the 2009 American Recovery and Reinvestment Act.1 These records have been associated with improved medical communication through increased legibility and rapid access to individual patient information. However, the promise of leveraging the rich digital data collected during routine clinical practice for research and to support point-of-care applications is only recently being achieved. This progress is enabled by technology that augments traditional abstraction, software that digitizes and interprets unstructured documents, and computational tools that rapidly analyze large data sets. The report by Yuan and colleagues2 highlights the promise and challenges of applying artificial intelligence tools in the interpretation of real-world data derived from EHRs. This research addresses 2 important goals: construction of real-world data cohorts and predictive modeling.
In oncology, a small minority of patients take part in prospective clinical trials of investigational therapies, and those who do tend to be younger, have fewer comorbidities, and be less sociodemographically diverse than the broad population of patients with cancer. Thus, the generalizability of clinical trial results may be questioned. It may be possible to use real-world data from EHRs to explore treatment patterns and associated outcomes among a more representative population. However, many clinical features needed to select a cohort or perform an analysis, including cancer diagnosis, stage, biomarker profile, functional status, treatment, and clinical outcomes, may not be available in structured EHR data with high accuracy or completeness. This has created a need to develop technology-driven approaches for extracting information from unstructured EHR data, such as clinic notes, diagnostic test reports, and other text.
Yuan et al2 demonstrate such an approach using natural language processing in the construction of a cohort of patients with lung cancer. The authors used a semisupervised machine learning algorithm called PheCAP that supplements a small set of criterion standard labeled data with a large amount of unlabeled data to identify eligible patients. Yuan et al next extracted additional variables that were not reliably available from structured data. They used rule-based methods for many variables, such as stage, and they used machine learning for smoking status, for which criterion standard labeled data for model training was available. To evaluate their cohort selection and variable extraction, the authors used variable-specific validation measures, including sensitivity, specificity, and completeness, and added a holistic test of performance by comparing results with data that were previously collected for another epidemiologic study. Overall, Yuan and colleagues found that their constructed cohort, while limited by a sensitivity of 75% for identifying patients with lung cancer at the target specificity of 90%, achieved higher data accuracy and completeness than use of structured data alone.
This work suggests that it may be possible to construct broad cohorts with a machine learning algorithm and that care must be taken to characterize the performance of the cohort. Given that such cohorts may be the basis for research or quality-improvement initiatives that ultimately affect patients, a question arises about what level of model performance is good enough. An algorithm that identifies patients with 90% specificity may be suitable for certain research questions but not others. When a very rare cohort is being identified for retrospective research or when patients who may be eligible for a prospective clinical trial are identified in real time, it may be more important to calibrate toward sensitivity (ie, minimizing false negatives) rather than specificity (ie, minimizing false positives). For this reason, the close collaboration of clinical and technical experts is essential in cohort design and assessment of fitness for purpose. Underscoring the importance of training set characteristics, the Yuan et al cohort included fewer than 3% Black patients, and therefore there is a risk of lack of generalizability of the algorithm performance and downstream analytic results. The authors are commended for the detail provided regarding their analysis and the underlying data characteristics. The significant level of missing data among certain variables (eg, Eastern Cooperative Oncology Group [ECOG] performance status, a measure of functional status) and dates of death suggests the importance of improving the entry of critical structured data elements in EHRs and the ongoing need to link EHR records with other data sources to improve completeness.3
Yuan and colleagues2 also used EHR-derived data to inform a machine learning–based model to predict 5-year survival in their lung cancer cohort. Prognostic models have myriad potential applications in research and clinical care. A clinician-assigned measure of clinical function (eg, ECOG performance status) is commonly used as an eligibility criterion, a stratification factor in clinical trial design, and a covariate in multivariable analyses. A more objective measure based on factors commonly available in EHRs and perhaps emerging ambulatory digital tools may provide greater clinical discrimination. This may be especially beneficial in settings where simple measures of functional status alone are inadequate predictors associated with outcome (eg, among older patients).4
Prediction models are of great interest in the clinical setting, where preventing an adverse outcome, such as a severe treatment side effect, early rehospitalization upon discharge, or emergency department visits, may be possible with more aggressive high-touch outpatient interventions. This opportunity is increasingly recognized, especially given that future payment models in oncology are anticipated to include a greater focus on value-based care. In this context, assigning categories of risk requires a tradeoff between sensitivity and specificity (or positive predictive value). This may be associated with the underlying prevalence among the population of interest of a particular intervention (eg, the population may consist of patients at high risk of emergency department visits) and the available resources to intervene (eg, nursing support). Thus, in the clinical arena, traditional performance metrics like area under the curve (AUC) must be translated into clinically relevant measures, such as positive predictive value at a certain target sensitivity, to clarify the patient-centered value of the model. Predictive algorithms may also require regulatory approval under certain circumstances.
The application of artificial intelligence in medical care has lagged behind its use in finance, advertising, and other consumer industries. This contrast is associated, in part, with the high stakes involved in developing tools that will ultimately affect patients. Given the expanding evidence gaps in oncology and the growing complexity of medical decisions, the imperative to apply available technologies has never been greater. In this context, careful consideration must be given to model development and scientific validation.5,6 Large-scale appropriate training data and rigorous downstream validation, with transparency to permit reproducibility, may provide researchers the ability to use machine-based variables in appropriate clinical settings. In addition, explainability of model features may also be required if broad adoption by nontechnical clinical users is expected. The true promise of machine-based approaches is in enabling a learning health care system in which patient data are used for research and clinical applications and evolving care patterns and outcomes measurements are incorporated in a continuous feedback loop.7 Success demands a broad recognition of the importance of high-quality data collection, data standards, and the benefits of data sharing for patients and public health.