Cohort design and natural language processing to reduce bias in electronic health records research

Khurshid, Shaan; Reeder, Christopher; Harrington, Lia X.; Singh, Pulkit; Sarma, Gopal; Friedman, Samuel F.; Di Achille, Paolo; Diamant, Nathaniel; Cunningham, Jonathan W.; Turner, Ashby C.; Lau, Emily S.; Haimovich, Julian S.; Al-Alusi, Mostafa A.; Wang, Xin; Klarqvist, Marcus D. R.; Ashburner, Jeffrey M.; Diedrich, Christian; Ghadessi, Mercedeh; Mielke, Johanna; Eilken, Hanna M.; McElhinney, Alice; Derix, Andrea; Atlas, Steven J.; Ellinor, Patrick T.; Philippakis, Anthony A.; Anderson, Christopher D.; Ho, Jennifer E.; Batra, Puneet; Lubitz, Steven A.

Published in

Nature Research, npj Digital Medicine, 1(5), 2022

DOI: 10.1038/s41746-022-00590-0

Tools

Export citation

Search in Google Scholar

Cohort design and natural language processing to reduce bias in electronic health records research

Journal article published in 2022 by Shaan Khurshid, Christopher Reeder, Lia X. Harrington, Pulkit Singh, Gopal Sarma, Samuel F. Friedman, Paolo Di Achille

, Nathaniel Diamant

, Jonathan W. Cunningham

, Ashby C. Turner, Emily S. Lau, Julian S. Haimovich, Mostafa A. Al-Alusi, Xin Wang, Marcus D. R. Klarqvist

and other authors.

This paper is made freely available by the publisher.

Full text: Download

Preprint: archiving allowed

Upload

Postprint: archiving forbidden

Published version: archiving allowed

Upload

Policy details

Data provided by

Abstract

AbstractElectronic health record (EHR) datasets are statistically powerful but are subject to ascertainment bias and missingness. Using the Mass General Brigham multi-institutional EHR, we approximated a community-based cohort by sampling patients receiving longitudinal primary care between 2001-2018 (Community Care Cohort Project [C3PO], n = 520,868). We utilized natural language processing (NLP) to recover vital signs from unstructured notes. We assessed the validity of C3PO by deploying established risk models for myocardial infarction/stroke and atrial fibrillation. We then compared C3PO to Convenience Samples including all individuals from the same EHR with complete data, but without a longitudinal primary care requirement. NLP reduced the missingness of vital signs by 31%. NLP-recovered vital signs were highly correlated with values derived from structured fields (Pearson r range 0.95–0.99). Atrial fibrillation and myocardial infarction/stroke incidence were lower and risk models were better calibrated in C3PO as opposed to the Convenience Samples (calibration error range for myocardial infarction/stroke: 0.012–0.030 in C3PO vs. 0.028–0.046 in Convenience Samples; calibration error for atrial fibrillation 0.028 in C3PO vs. 0.036 in Convenience Samples). Sampling patients receiving regular primary care and using NLP to recover missing data may reduce bias and maximize generalizability of EHR research.

Published in

Links

Tools

Cohort design and natural language processing to reduce bias in electronic health records research

Abstract