Word2Vec Inversion and Traditional Text Classifiers for Phenotyping Lupus
Turner, Clayton Anthony
MetadataShow full item record
Identifying patients with certain clinical criteria based on manual chart review of doctors’ notes is a daunting task given the massive amounts of text notes in the electronic health records (EHR). This task can be automated using text classifiers based on Natural Language Processing (NLP) techniques along with pattern recognition machine learning (ML) algorithms. The aim of this research is to evaluate the performance of traditional classifiers for identifying patients with Systemic Lupus Erythematosus (SLE) in comparison with a newer bayesian word vector method. We obtained clinical notes for patients with SLE diagnosis along with controls from the Rheumatology Clinic. Sparse bag-of-words (BOWs) and Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs) matrices were produced using NLP pipelines. These matrices were subjected to several different classifiers using customized Theano and scikit-learn libraries. Additionally, we utilized a bayesian inversion method which contains much less overhead, implemented with Word2Vec. Performance was measured by calculating accuracy and area under the Receiver Operating Characteristic (ROC) curve (AUC) of a cross-validated (CV) set and a separate testing set. Our results suggest that Word2Vec inversion is as accurate as a shallow neural network with CUIs and random forests with both CUIs and BOWs. The overhead in generating the CUIs makes BOWs and the inversion method more desirable. The inversion method, however, is more easily adaptable to non-binary classification tasks and requires less overhead in initial setup.