11institutetext: Duke University 22institutetext: Massachusetts Institute of Technology

EHRmonize: A Framework for Medical Concept Abstraction from Electronic Health Records using Large Language Models

João Matos 11    Jack Gallifant 22    Jian Pei 11    A. Ian Wong Corresponding Author: 11 a.ian.wong@duke.edu
Abstract

Electronic health records (EHRs) contain vast amounts of complex data, but harmonizing and processing this information remains a challenging and costly task requiring significant clinical expertise. While large language models (LLMs) have shown promise in various healthcare applications, their potential for abstracting medical concepts from EHRs remains largely unexplored. We introduce EHRmonize, a framework leveraging LLMs to abstract medical concepts from EHR data. Our study uses medication data from two real-world EHR databases to evaluate five LLMs on two free-text extraction and six binary classification tasks across various prompting strategies. GPT-4o’s with 10-shot prompting achieved the highest performance in all tasks, accompanied by Claude-3.5-Sonnet in a subset of tasks. GPT-4o achieved an accuracy of 97% in identifying generic route names, 82% for generic drug names, and 100% in performing binary classification of antibiotics. While EHRmonize significantly enhances efficiency, reducing annotation time by an estimated 60%, we emphasize that clinician oversight remains essential. Our framework, available as a Python package,111Package on PyPI, Repository on GitHub, and Documentation on ReadTheDocs. offers a promising tool to assist clinicians in EHR data abstraction, potentially accelerating healthcare research and improving data harmonization processes.

Keywords:
Large Language Models Electronic Health Records Chart Abstraction of Medical Concepts EHRmonize

1 Introduction

The development of machine learning models in healthcare critically depends on large-scale, high-quality data. Electronic Health Records (EHRs) offer a rich source of such data, encompassing structured information generated during routine clinical practice, including vital signs, laboratory values, and clinical interventions [20]. However, the full potential of EHR data remains largely untapped due to significant challenges in data processing. A primary obstacle in leveraging EHR data is the substantial variability in recording practices, both between and within hospital systems [14]. This variability manifests in several ways:

  • Inconsistent Terminology: The same medical concept may be recorded differently across institutions or even within a single hospital. For example, in medication data, "dextrose 5%" is an intravenous fluid for volume expansion (and the same as "D5W" and the normalized RxNorm concept "glucose 50 mg/ml"), but is different from "D50" and "magnesium sulfate 1 g in d5w", which are medical therapies [16].

  • Local Coding Systems: Many healthcare institutions use local coding systems, making it difficult to compare/aggregate data across different sources.

  • Evolving Standards: As medical knowledge and practices evolve, so do the terminologies and coding systems used in EHRs, further complicating long-term data harmonization efforts.

Current approaches to addressing these challenges often rely on manual abstraction (i.e., cleaning, categorization, and/or summarization) of concepts and chart review [21, 24]. However, these methods are time-consuming, labor-intensive, and prone to errors [2, 13, 18]. Moreover, the expertise required for accurate data abstraction is not always available, limiting the accessibility of EHR data for many researchers and potentially hindering progress in healthcare research [22]. Although there is a growing body of literature, most papers fail to be reproducible as the underlying codebases are not always shared. [12]

Large Language Models (LLMs) have emerged as a promising technology with the potential to revolutionize various aspects of medicine [15, 4]. Their ability to understand and generate human-like text has shown promise in tasks such as note summarization, clinical decision support, and medical education [19]. Furthermore, LLMs have demonstrated significant encoded medical knowledge, as evidenced by their performance on medical question-answering benchmarks [10].

Given these capabilities, we hypothesize that LLMs can significantly improve workflow efficiency in abstracting medical concepts from EHR data. By automating the categorization and harmonization of EHR entries, LLMs could potentially address many of the challenges associated with EHR data processing, thereby lowering barriers to entry for researchers and enabling more widespread use of EHR data in healthcare research and analytics.

In this paper, we introduce EHRmonize, a novel framework that leverages the power of LLMs to automate the cleaning and categorization of medical concepts in EHR data. Our work makes the following key contributions:

  • LLM-based EHR Data Harmonization: We present a novel approach to using LLMs for abstracting medical concepts in EHR data, addressing the critical need for efficient, scalable data harmonization methods.

  • Curated Dataset: We provide a curated dataset of medication data from MIMIC-IV [11] and eICU-CRD [17], enabling reproducibility of our findings and facilitating further research in this domain. This labeled dataset is made publicly available222Dataset on HuggingFace..

  • Comprehensive Evaluation: We conduct an extensive evaluation of five state-of-the-art LLMs across various prompting strategies, encompassing two free-text tasks and six binary classification tasks. This evaluation provides insights into the capabilities and limitations of different LLMs in EHR data processing tasks.

  • Open-Source Implementation: We release EHRmonize as an open-source PyPI package, implementing the use cases explored in this study and providing customizable modules for further applications. This contribution aims to foster collaboration and accelerate progress in the field of EHR data science.

By developing tools that automate the categorization and harmonization of EHR entries, EHRmonize aims to address critical challenges in EHR data processing, lower barriers to entry for researchers, and ultimately enable more widespread and efficient use of EHR data in healthcare research and analytics. In the following sections, we discuss related work, detail our methodology, present our findings, and discuss this work’s implications and future directions.

2 Related Work

The challenge of harmonizing and extracting meaningful information from EHRs has been addressed through various approaches over the years. Traditional methods have included rule-based systems using hard-coded queries for automated data abstraction [20] and cascading architectures for complex classification tasks [5]. While effective for specific use cases, these approaches often lack flexibility and require significant effort to maintain as medical terminologies evolve.

Natural Language Processing (NLP) techniques have been widely applied to unstructured EHR data, with Named Entity Recognition (NER) being a key focus. Ahmad et al. [1] and Durango et al. [6] provide comprehensive reviews of NER techniques applied to clinical text, highlighting successes in identifying medical concepts despite linguistic variability challenges. However, these approaches often struggle with the variability of medical terminology across different EHR systems and require extensive manual input or task-specific fine-tuning, limiting their scalability and generalizability.

Efforts to standardize medical concepts have led to the development of tools like RxNorm [23, 16]. While these tools have made significant strides in concept matching across vocabularies, they often require extensive manual review, limiting their scalability. LLMs have opened new avenues for processing extensive medical texts at unprecedented speeds. Liu et al. [15] and Chen et al. [4] discuss the broader potential of LLMs to revolutionize various aspects of healthcare, from clinical decision support to medical education.

However, the application of LLMs in healthcare is not without challenges. Recent studies have highlighted concerns regarding the faithfulness [9] and bias [3] of LLMs in medical contexts. In the domain of medication information processing, Gallifant et al. [8] demonstrated high performance in matching drug brand and generic terms using various LLMs, with GPT-4 achieving near-perfect accuracy. Nevertheless, their work also revealed limitations in handling more complex aspects of medication nomenclature. These findings underscore the need for specialized tools to manage the intricacies of medical drug data, which are crucial for developing comprehensive frameworks for AI-enabled pharmacovigilance and data harmonization [7].

EHRmonize addresses a critical gap in this landscape by focusing on abstraction rather than mere extraction or labeling. We leverage LLMs for automated EHR data harmonization, aiming to capture and standardize higher-level concepts across diverse EHR systems. This approach combines the flexibility of machine learning with the nuanced understanding of medical language demonstrated by LLMs, potentially offering a more scalable and adaptable solution to the challenges of EHR data harmonization.

3 Methods

EHRmonize facilitates EHR data harmonization, addressing multidisciplinary collaboration challenges between data scientists and clinicians (Figure 1). It comprises two components: corpus generation (SQL-based extraction of relevant text/concepts from EHR databases) and LLM inference (conversion of raw input to standardized classes via few-shot prompting) (Figure 2).

Figure 1: Example workflow and challenges in multidisciplinary clinical data science.

Context: A multidisciplinary clinical data science team is working with EHRs. The clinicians agreed to include the patients on antibiotics, but exclude patients on anticoagulants. It is necessary to abstract medication and route names as "antibiotics" or "anticoagulants".
Current Workflow, by role:
1. Data Scientist: Queries unique medication names from EHRs and sends them to clinician.
2. Clinician: Maps medications into predefined classes and returns to the data scientist.
Challenge: Step 2 may involve manual labeling of thousands of entries with the help of a clinical expert, which not all teams have access to.
User Story: As a data scientist working with EHR data, I want to automatically abstract medical concepts as a first pass, so that collaboration with clinical experts is more efficient.

Refer to caption
Figure 2: Overall workflow of EHRmonize. Corpus generation from EHRs provides the data that needs categorization, across different domains and tasks, which is then fed to our package that employs LLMs to categorize the entries into predefined classes.

Tasks: We defined two task types: (1) free-text extraction of generic routes and drug names from raw entries, and (2) binary classification of (drug, route) pairs as antibiotic, anticoagulant, electrolytes, IV fluid, opioid analgesic, or stress ulcer prophylaxis. Data sources were MIMIC-IV [11] and eICU-CRD [17]. Preprocessing involved SQL extraction of unique drug-route pairs, selection of top 200 prevalent entries per task, and manual labeling by a physician (AIW).

Labeling: Generic drug names: Free-text drug names were translated to the lowercase generic name, matching either the clinical drug component, precise ingredient, or ingredient in RxNorm, a National Library of Medicine system to normalize medications [16]. Salt names (e.g., hydromorphone hydrochloride to hydromorphone) were not included unless the active ingredient was shared across multiple salts (e.g., "metoprolol tartrate" vs. "metoprolol succinate"). Prescription strengths were not included (e.g., "hydromorphone hydrochloride 1mg" to "hydromorphone"). Concentrations were included for intravenous fluids and dextrose to disambiguate a precise drug (e.g., "normal saline 0.9%" to "sodium chloride 9mg/ml"; "dextrose 50%" to "glucose 500 mg/ml"). Medications with significant combinations (e.g., "pneumococcal 23-valent polysaccharide vaccine") kept all common RxNorm components but did not include valence. Generic routes: Entries were transformed to the lowercase (no abbreviations) RxNorm classification (e.g., "IV" to "injectable product"; "PO/NG" to "oral product"). Binary classifications: Six classes were one-hot encoded.

Prompting: Prompts included a specific task description, where we instruct the model to act like an experienced clinician, how the output format is expected to be, and how the expected class can be defined. When few-shot prompting was used, a few representative examples were provided. (Figure 3)

LLMs: We assessed five models of 4 different families: Anthropic’s Claude-3.5-Sonnet; Meta’s Llama3-70B; Mistral’s Mixtral-8x7B (via AWS’s Bedrock API); and OpenAI’s GPT-3.5-Turbo and GPT-4o. These models were selected due to their good performance in medical and non-medical benchmarks and cost-efficiency ratio.

Experiments: Besides the five different models, we explored different temperatures (0, 0.2, 0.5) and (0 to 10)-shot prompting. As the objective of EHRmonize is to improve efficiency in data cleaning, the time necessary to do manual annotation and EHRmonize’s output review was recorded and compared.

Figure 3: One-shot prompting example for the "IV fluid" binary classification task.

You are a well trained clinician doing data cleaning and harmonization. You are given a raw drug name and administration route out of the EHR data below, within square brackets such as [drugname, route]. Please output "1" if ["normal saline", "IV"] is classified as "IV fluid", otherwise "0". "IV fluid" means "intravenous fluid given for the purpose of volume expansion". Consider the following example: An input drug name "sodium chloride 0.9%" and route "IV" would be classified as "1". Please output nothing more than "1" or "0".

4 Results

We labeled 398 entries from 14,604 and 8,803 unique medication-route pairs in eICU-CRD and MIMIC-IV databases, respectively (Table 1).

Table 1: Characteristics of the medication entries in the labeled dataset.
Free-Text
(#Unique)
Binary Tasks
(#Positive)
Database N
Generic
Route
Generic
Name
Antibiotic Anticoagulant Electrolytes
IV
Fluids
Opioid
Analgesic
Stress ulcer
prophylaxis
MIMIC-IV 198 6 83 8 13 17 22 12 8
eICU-CRD 200 5 50 5 7 24 28 22 8

Model Performance: GPT-4o consistently outperformed other models, achieving an F1-score of 1.00 for antibiotic classification and 0.97 accuracy for route identification. Claude-3.5-Sonnet matched GPT-4o’s performance in several binary classification tasks. GPT-3.5-Turbo, Llama3 70B, and Mixtral 8x7B showed lower performance (Figure 4). Generic drug name extraction proved challenging for all models, with GPT-4o achieving 0.82 accuracy.

N-shot Prompting: GPT-4o and Claude-3.5-Sonnet exhibited stable, high performance with increasing examples. Unexpectedly, GPT-3.5-Turbo’s performance declined as the number of examples increased, particularly in antibiotic, anticoagulant, and opioid analgesic tasks. Llama3 70B and Mixtral 8x7B maintained intermediate, relatively stable performance (Figure 5).

Temperature Impact: Variations in temperature up to 0.5 had minimal impact on model performance across tasks.

Efficiency Gains: EHRmonize significantly reduced annotation time, with savings of 67.9% for MIMIC-IV and 60.4% for eICU-CRD (Table 2).

Refer to caption
Figure 4: LLM 10-shot performance across tasks and temperatures (398 samples).
Refer to caption
Figure 5: LLM performance (temp. 0.2) with varying N-shot prompting across tasks.
Table 2: Time (in minutes) spent in data annotation and in EHRmonize output review, using GPT-4o with 10-shot prompting.
Database MIMIC-IV eICU-CRD
Task Type Generic Route Binary Total Generic Route Binary Total
Annotation 6:03 3:56 5:53 15:52 4:46 3:40 8:31 16:57
Revision 2:02 0:37 2:27 5:06 2:37 1:11 2:55 6:43
Corrections 10/100 2/100 1/600 13/800 22/100 3/100 1/600 26/800
Savings (%) 66.4% 84.3% 58.4% 67.9% 45.1% 67.7% 65.8% 60.4%

5 Conclusion and Discussion

EHRmonize demonstrates the potential of LLMs to abstract medical concepts from structured EHR data across multiple classification tasks. The framework demonstrated significant efficiency gains, reducing annotation time by approximately 60%. This underscores EHRmonize’s potential to enhance, rather than replace, manual chart review by prepopulating options and allowing clinicians to focus on more complex abstraction tasks.

Several limitations of this study warrant consideration. The dataset, while curated by a well-trained physician and supported by RxNorm materials, is limited in size (398 samples) and focused solely on medication data. Future research should aim to expand the dataset’s volume and scope, incorporating other domains such as laboratory results and flowsheet data. Additionally, the current approach to N-shot example selection was deterministic; exploring the impact of example ordering could yield valuable insights into prompt engineering for medical NLP tasks.

Further avenues for improvement include incorporating semantic equivalence in free-text evaluation, implementing batching for enhanced efficiency, exploring retrieval-augmented generation (RAG) methods to extend N-shot examples, and investigating fine-tuning strategies for task-specific optimization. The potential of agentic approaches in managing abstraction workflows and ensuring consistency across outputs also merits exploration. Finally, regular evaluation on periodic data could facilitate the identification of concept drift, allowing the tool to adapt to evolving medical practices and terminologies.

Prospect of application: EHRmonize, now available as a Python package on PyPI, represents a significant step towards lowering barriers in EHR data research. Improving abstraction efficiency for structured data fields—a task often performed manually—has the potential to accelerate research and enable more comprehensive analyses of EHR data.

Disclosure of Interests: AIW has received funding from NIMHD under U54MD012530. AIW has received support from AWS and CloudForce. All other authors have no competing interests.

References

  • [1] Ahmad, P.N., Shah, A.M., Lee, K.: A review on electronic health record text-mining for biomedical name entity recognition in healthcare domain. In: Healthcare. vol. 11, p. 1268. MDPI (2023)
  • [2] Byrne, M.D., Jordan, T., Welle, T.: Comparison of manual versus automated data collection method for an evidence-based nursing practice study. Applied Clinical Informatics 4(01), 61–74 (2013)
  • [3] Chen, S., Gallifant, J., Gao, M., Moreira, P., Munch, N., Muthukkumar, A., Rajan, A., Kolluri, J., Fiske, A., Hastings, J., Aerts, H., Anthony, B., Celi, L.A., Cava, W.G.L., Bitterman, D.S.: Cross-care: Assessing the healthcare implications of pre-training data on language model bias (2024), https://arxiv.org/abs/2405.05506
  • [4] Chen, Y., Liu, C., Huang, W., Cheng, S., Arcucci, R., Xiong, Z.: Generative text-guided 3d vision-language pretraining for unified medical image segmentation. arXiv preprint arXiv:2306.04811 (2023)
  • [5] Dai, H.J., Su, C.H., Wu, C.S.: Adverse drug event and medication extraction in electronic health records via a cascading architecture with different sequence labeling models and word embeddings. Journal of the American Medical Informatics Association 27(1), 47–55 (2020)
  • [6] Durango, M.C., Torres-Silva, E.A., Orozco-Duque, A.: Named entity recognition in electronic health records: A methodological review. Healthcare Informatics Research 29(4),  286 (2023)
  • [7] Gallifant, J., Celi, L.A., Sharon, E., Bitterman, D.S.: Navigating the complexities of artificial intelligence–enabled real-world data collection for oncology pharmacovigilance (2024)
  • [8] Gallifant, J., Chen, S., Moreira, P., Munch, N., Gao, M., Pond, J., Celi, L.A., Aerts, H., Hartvigsen, T., Bitterman, D.: Language models are surprisingly fragile to drug names in biomedical benchmarks (2024)
  • [9] Han, T., Kumar, A., Agarwal, C., Lakkaraju, H.: Towards safe and aligned large language models for medicine. arXiv preprint arXiv:2403.03744 (2024)
  • [10] Jin, D., Pan, E., Oufattole, N., Weng, W., Fang, H., Szolovits, P.: What disease does this patient have? A large-scale open domain question answering dataset from medical exams. CoRR abs/2009.13081 (2020), https://arxiv.org/abs/2009.13081
  • [11] Johnson, A.E.W., Bulgarelli, L., Shen, L., Gayles, A., Shammout, A., Horng, S., Pollard, T.J., Hao, S., Moody, B., Gow, B., Lehman, L.w.H., Celi, L.A., Mark, R.G.: MIMIC-IV, a freely accessible electronic health record dataset. Scientific Data 10(1),  1 (Jan 2023). https://doi.org/10.1038/s41597-022-01899-x, https://www.nature.com/articles/s41597-022-01899-x
  • [12] Johnson, A.E.W., Pollard, T.J., Mark, R.G.: Reproducibility in critical care: a mortality prediction case study. In: Doshi-Velez, F., Fackler, J., Kale, D., Ranganath, R., Wallace, B., Wiens, J. (eds.) Proceedings of the 2nd Machine Learning for Healthcare Conference. Proceedings of Machine Learning Research, vol. 68, pp. 361–376. PMLR (18–19 Aug 2017), https://proceedings.mlr.press/v68/johnson17a.html
  • [13] Lan, H., Thongprayoon, C., Ahmed, A., Herasevich, V., Sampathkumar, P., Gajic, O., O’Horo, J.C.: Automating quality metrics in the era of electronic medical records: digital signatures for ventilator bundle compliance. BioMed Research International 2015(1), 396508 (2015)
  • [14] Lester, C.A., Flynn, A.J., Marshall, V.D., Rochowiak, S., Rowell, B., Bagian, J.P.: Comparing the variability of ingredient, strength, and dose form information from electronic prescriptions with rxnorm drug product descriptions. Journal of the American Medical Informatics Association 29(9), 1471–1479 (2022)
  • [15] Liu, Y., Han, T., Ma, S., Zhang, J., Yang, Y., Tian, J., He, H., Li, A., He, M., Liu, Z., Wu, Z., Zhao, L., Zhu, D., Li, X., Qiang, N., Shen, D., Liu, T., Ge, B.: Summary of ChatGPT-Related research and perspective towards the future of large language models. Meta-Radiology 1(2), 100017 (Sep 2023). https://doi.org/10.1016/j.metrad.2023.100017, https://www.sciencedirect.com/science/article/pii/S2950162823000176
  • [16] National Library of Medicine: Rxnorm technical documentation. https://www.nlm.nih.gov/research/umls/rxnorm/docs/index.html, accessed: 2024-06-23
  • [17] Pollard, T.J., Johnson, A.E.W., Raffa, J.D., Celi, L.A., Mark, R.G., Badawi, O.: The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Scientific Data 5(1), 180178 (Sep 2018). https://doi.org/10.1038/sdata.2018.178, https://www.nature.com/articles/sdata2018178
  • [18] Sauer, C.M., Chen, L.C., Hyland, S.L., Girbes, A., Elbers, P., Celi, L.A.: Leveraging electronic health records for data science: common pitfalls and how to avoid them. The Lancet Digital Health 4(12), E893–E898 (Dec 2022). https://doi.org/10.1016/S2589-7500(22)00154-6, https://doi.org/10.1016/S2589-7500(22)00154-6, open AccessPublished:September 22, 2022
  • [19] Thirunavukarasu, A.J., Ting, D.S.J., Elangovan, K., Gutierrez, L., Tan, T.F., Ting, D.S.W.: Large language models in medicine. Nature Medicine 29(8), 1930–1940 (Aug 2023). https://doi.org/10.1038/s41591-023-02448-8, https://www.nature.com/articles/s41591-023-02448-8
  • [20] Valencia Morales, D.J., Bansal, V., Heavner, S.F., Castro, J.C., Sharma, M., Tekin, A., Bogojevic, M., Zec, S., Sharma, N., Cartin-Ceba, R., et al.: Validation of automated data abstraction for sccm discovery virus covid-19 registry: practical ehr export pathways (virus-peep). Frontiers in Medicine 10, 1089087 (2023)
  • [21] Vassar, M., Matthew, H.: The retrospective chart review: important methodological considerations. Journal of educational evaluation for health professions 10 (2013)
  • [22] Wang, S., McDermott, M.B.A., Chauhan, G., Hughes, M.C., Naumann, T., Ghassemi, M.: Mimic-extract: A data extraction, preprocessing, and representation pipeline for MIMIC-III. CoRR abs/1907.08322 (2019), http://arxiv.org/abs/1907.08322
  • [23] Waters, R., Malecki, S., Lail, S., Mak, D., Saha, S., Jung, H.Y., Imrit, M.A., Razak, F., Verma, A.A.: Automated identification of unstandardized medication data: a scalable and flexible data standardization pipeline using rxnorm on gemini multicenter hospital data. JAMIA Open 6(3), ooad062 (October 2023). https://doi.org/10.1093/jamiaopen/ooad062, https://doi.org/10.1093/jamiaopen/ooad062
  • [24] Yin, A.L., Guo, W.L., Sholle, E.T., Rajan, M., Alshak, M.N., Choi, J.J., Goyal, P., Jabri, A., Li, H.A., Pinheiro, L.C., et al.: Comparing automated vs. manual data collection for covid-specific medications from electronic health records. International Journal of Medical Informatics 157, 104622 (2022)