A Deep Representation Empowered Distant Supervision Paradigm for Clinical Information Extraction

Wang, Yanshan; Sohn, Sunghwan; Liu, Sijia; Shen, Feichen; Wang, Liwei; Atkinson, Elizabeth J.; Amin, Shreyasee; Liu, Hongfang

Abstract:Objective: To automatically create large labeled training datasets and reduce the efforts of feature engineering for training accurate machine learning models for clinical information extraction. Materials and Methods: We propose a distant supervision paradigm empowered by deep representation for extracting information from clinical text. In this paradigm, the rule-based NLP algorithms are utilized to generate weak labels and create large training datasets automatically. Additionally, we use pre-trained word embeddings as deep representation to eliminate the need of task-specific feature engineering for machine learning. We evaluated the effectiveness of the proposed paradigm on two clinical information extraction tasks: smoking status extraction and proximal femur (hip) fracture extraction. We tested three prevalent machine learning models, namely, Convolutional Neural Networks (CNN), Support Vector Machine (SVM), and Random Forrest (RF). Results: The results indicate that CNN is the best fit to the proposed distant supervision paradigm. It outperforms the rule-based NLP algorithms given large datasets by capturing additional extraction patterns. We also verified the advantage of word embedding feature representation in the paradigm over term frequency-inverse document frequency (tf-idf) and topic modeling representations. Discussion: In the clinical domain, the limited amount of labeled data is always a bottleneck for applying machine learning. Additionally, the performance of machine learning approaches highly depends on task-specific feature engineering. The proposed paradigm could alleviate those problems by leveraging rule-based NLP algorithms to automatically assign weak labels and eliminating the need of task-specific feature engineering using word embedding feature representation.

Subjects:	Information Retrieval (cs.IR)
Cite as:	arXiv:1804.07814 [cs.IR]
	(or arXiv:1804.07814v1 [cs.IR] for this version)
	https://doi.org/10.48550/arXiv.1804.07814

Computer Science > Information Retrieval

Title:A Deep Representation Empowered Distant Supervision Paradigm for Clinical Information Extraction

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators