JMIR Publications, JMIR Medical Informatics, 11(10), p. e37945, 2022
DOI: 10.2196/37945
Full text: Download
Background The increasing availability of “real-world” data in the form of written text holds promise for deepening our understanding of societal and health-related challenges. Textual data constitute a rich source of information, allowing the capture of lived experiences through a broad range of different sources of information (eg, content and emotional tone). Interviews are the “gold standard” for gaining qualitative insights into individual experiences and perspectives. However, conducting interviews on a large scale is not always feasible, and standardized quantitative assessment suitable for large-scale application may miss important information. Surveys that include open-text assessments can combine the advantages of both methods and are well suited for the application of natural language processing (NLP) methods. While innovations in NLP have made large-scale text analysis more accessible, the analysis of real-world textual data is still complex and requires several consecutive steps. Objective We developed and subsequently examined the utility and scientific value of an NLP pipeline for extracting real-world experiences from textual data to provide guidance for applied researchers. Methods We applied the NLP pipeline to large-scale textual data collected by the Swiss Multiple Sclerosis (MS) registry. Such textual data constitute an ideal use case for the study of real-world text data. Specifically, we examined 639 text reports on the experienced impact of the first COVID-19 lockdown from the perspectives of persons with MS. The pipeline has been implemented in Python and complemented by analyses of the “Linguistic Inquiry and Word Count” software. It consists of the following 5 interconnected analysis steps: (1) text preprocessing; (2) sentiment analysis; (3) descriptive text analysis; (4) unsupervised learning–topic modeling; and (5) results interpretation and validation. Results A topic modeling analysis identified the following 4 distinct groups based on the topics participants were mainly concerned with: “contacts/communication;” “social environment;” “work;” and “errands/daily routines.” Notably, the sentiment analysis revealed that the “contacts/communication” group was characterized by a pronounced negative emotional tone underlying the text reports. This observed heterogeneity in emotional tonality underlying the reported experiences of the first COVID-19–related lockdown is likely to reflect differences in emotional burden, individual circumstances, and ways of coping with the pandemic, which is in line with previous research on this matter. Conclusions This study illustrates the timely and efficient applicability of an NLP pipeline and thereby serves as a precedent for applied researchers. Our study thereby contributes to both the dissemination of NLP techniques in applied health sciences and the identification of previously unknown experiences and burdens of persons with MS during the pandemic, which may be relevant for future treatment.