Natural Language Processing with Open-Text Healthcare Survey Data

Open-text questions provide qualitative data that gives insights to unfiltered experiences and opinions. However, large scale surveys can make manually analyzing the responses infeasible due to time and resource constraints. Natural Language Processing (NLP) techniques are a type of artificial intelligence that can read, interpret, and understand the human language. Utilizing NLP to categorize survey responses not only reduces valuable time, but also reduces biases from analysts’ interpretations of categories. 

There are two types of categorization techniques that NLP can perform. The first is unsupervised topic modeling. In this technique, the algorithm discovers topics based on recurring words found in the text. The second is supervised topic modeling where topics have been created by the analysts prior to running the algorithm. NLP will then match the words found in the text to words analysts have assigned to predefined topics. Each technique has their pros and cons depending on what the analyst wants from their data. 

Before NLP techniques can be utilized, the data must be preprocessed into a format that the program can understand. First, any spelling errors must be corrected as well as removing punctuation, non-text characters, abbreviations, and stop words. Stop words are commonly used words such as “the”, “is”, “a”. Removing stop words focuses the analysis on the text that provides meaningful information. Another important preprocessing step is stemming all words to their root form to further reduce dimensionality. For example, “running” will become “run”. 

Cammel and researchers have tested the effectiveness of unsupervised topic modeling with patient experience data from two different hospitals (Cammel et al., 2020). Because unsupervised topic modeling was used, topics exactly represent the responses without further interpretation. The researchers tested their original algorithm to a smaller data set to prove the transferability of the algorithm to different data sets of different sizes. Although topics were created and valuable insights were drawn, NLP cannot replace the accuracy of manual categorization. Quality improvement teams can utilize NLP with large datasets to drill down specific domains to focus on.

goShadow is developing a similar algorithm as Cammel but for supervised topic modeling. In other words, the preprocessing techniques are the same, but instead of the algorithm creating the topics, goShadow has created those topics already. Each topic has been assigned a list of words that correlate with the topic. Thus, the algorithm reads each survey response and matches the words to the words in the topic. Although the algorithm is still in the works, it will greatly reduce analysis time and improve standardization of results. With a great enough accuracy, the NLP results can be used as benchmarks between hospitals and pinpoint weaknesses.

Want to learn more or design your own custom survey with the goShadow team? Email us to get started.

Cammel, S., De Vos, M., van Soest, D., Hettne, K., Boer, F., Steyerberg, E., & Boosman, H. (2020). How to automatically turn patient experience free-text responses into actionable insights: a natural language programming (NLP) approach. BMC Medical Informatics and Decision Making, 20(1), 97–97.

Back to Blog

Posted on

July 15, 2021