Surrogate-assisted feature extraction for high-throughput phenotyping

Yu, Sheng; Chakrabortty, Abhishek; Liao, Katherine P.; Cai, Tianrun; Ananthakrishnan, Ashwin N.; Gainer, Vivian S.; Churchill, Susanne E.; Szolovits, Peter; Murphy, Shawn N.; Kohane, Isaac S.; Cai, Tianxi

Published in

Oxford University Press, JAMIA: A Scholarly Journal of Informatics in Health and Biomedicine, e1(24), p. e143-e149, 2016

DOI: 10.1093/jamia/ocw135

Tools

Export citation

Search in Google Scholar

Surrogate-assisted feature extraction for high-throughput phenotyping

Journal article published in 2016 by Sheng Yu, Abhishek Chakrabortty

, Katherine P. Liao, Tianrun Cai, Ashwin N. Ananthakrishnan, Vivian S. Gainer, Susanne E. Churchill, Peter Szolovits, Shawn N. Murphy, Isaac S. Kohane, Tianxi Cai

This paper is made freely available by the publisher.

Full text: Download

Preprint: archiving allowed

Upload

Postprint: archiving restricted

Upload

Published version: archiving forbidden

Policy details

Data provided by

Abstract

Objective: Phenotyping algorithms are capable of accurately identifying patients with specific phenotypes from within electronic medical records systems. However, developing phenotyping algorithms in a scalable way remains a challenge due to the extensive human resources required. This paper introduces a high-throughput unsupervised feature selection method, which improves the robustness and scalability of electronic medical record phenotyping without compromising its accuracy. Methods: The proposed Surrogate-Assisted Feature Extraction (SAFE) method selects candidate features from a pool of comprehensive medical concepts found in publicly available knowledge sources. The target phenotype’s International Classification of Diseases, Ninth Revision and natural language processing counts, acting as noisy surrogates to the gold-standard labels, are used to create silver-standard labels. Candidate features highly predictive of the silver-standard labels are selected as the final features. Results: Algorithms were trained to identify patients with coronary artery disease, rheumatoid arthritis, Crohn’s disease, and ulcerative colitis using various numbers of labels to compare the performance of features selected by SAFE, a previously published automated feature extraction for phenotyping procedure, and domain experts. The out-of-sample area under the receiver operating characteristic curve and F-score from SAFE algorithms were remarkably higher than those from the other two, especially at small label sizes. Conclusion: SAFE advances high-throughput phenotyping methods by automatically selecting a succinct set of informative features for algorithm training, which in turn reduces overfitting and the needed number of gold-standard labels. SAFE also potentially identifies important features missed by automated feature extraction for phenotyping or experts.

Published in

Links

Tools

Surrogate-assisted feature extraction for high-throughput phenotyping

Abstract