Use of a support vector machine for categorizing free-text notes: assessment of accuracy across two institutions

Henkin, Stanislav; McCoy, Allison B.; Ab, McCoy; Wright, Adam; Kale, Abhivyakti; Df, Sittig; Sittig, Dean F.

Published in

Oxford University Press, JAMIA: A Scholarly Journal of Informatics in Health and Biomedicine, 5(20), p. 887-890, 2013

DOI: 10.1136/amiajnl-2012-001576

Tools

Export citation

Search in Google Scholar

Use of a support vector machine for categorizing free-text notes: assessment of accuracy across two institutions

Journal article published in 2013 by Stanislav Henkin, Allison B. McCoy

, McCoy Ab, Adam Wright, Abhivyakti Kale, Sittig Df, Dean F. Sittig

This paper is available in a repository.

Full text: Download

Preprint: archiving allowed

Upload

Postprint: archiving restricted

Upload

Published version: archiving forbidden

Policy details

Data provided by

Abstract

BACKGROUND: Electronic health record (EHR) users must regularly review large amounts of data in order to make informed clinical decisions, and such review is time-consuming and often overwhelming. Technologies like automated summarization tools, EHR search engines and natural language processing have been shown to help clinicians manage this information. OBJECTIVE: To develop a support vector machine (SVM)-based system for identifying EHR progress notes pertaining to diabetes, and to validate it at two institutions. MATERIALS AND METHODS: We retrieved 2000 EHR progress notes from patients with diabetes at the Brigham and Women's Hospital (1000 for training and 1000 for testing) and another 1000 notes from the University of Texas Physicians (for validation). We manually annotated all notes and trained a SVM using a bag of words approach. We then used the SVM on the testing and validation sets and evaluated its performance with the area under the curve (AUC) and F statistics. RESULTS: The model accurately identified diabetes-related notes in both the Brigham and Women's Hospital testing set (AUC=0.956, F=0.934) and the external University of Texas Faculty Physicians validation set (AUC=0.947, F=0.935). DISCUSSION: Overall, the model we developed was quite accurate. Furthermore, it generalized, without loss of accuracy, to another institution with a different EHR and a distinct patient and provider population. CONCLUSIONS: It is possible to use a SVM-based classifier to identify EHR progress notes pertaining to diabetes, and the model generalizes well.

Published in

Links

Tools

Use of a support vector machine for categorizing free-text notes: assessment of accuracy across two institutions

Abstract