The Magic of Semantic Enrichment and NLP for Medical Coding

García-Santa, Nuria; San Miguel, Beatriz; Ugai, Takanori

doi:10.1007/978-3-030-32327-1_12

Nuria García-Santa²⁰,
Beatriz San Miguel²⁰ &
Takanori Ugai²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11762))

Included in the following conference series:

European Semantic Web Conference

1299 Accesses

Abstract

Artificial Intelligence technologies are every day more present in the medical domain. Several healthcare activities that were entirely done manually by experts in the past, now are reaching a high level of automatization thanks to a satisfactory integration between these technologies and the medical professionals. This is the case of the medical coding process, consisting on the annotation of clinical notes (free-text narrative reports) to standard medical classifications in order to align this information with the patients’ records. This paper presents a combination of NLP and semantic enrichment techniques to generate an extended Biomedical Knowledge Graph in order to be exploited in the development of our automatic medical coding solution.

You have full access to this open access chapter, Download conference paper PDF

Semantic biomedical resource discovery: a Natural Language Processing framework

Article Open access 30 September 2015

Knowledge graph enrichment from clinical narratives using NLP, NER, and biomedical ontologies for healthcare applications

Article 03 January 2023

SIFR annotator: ontology-based semantic annotation of French biomedical text and clinical notes

Article Open access 06 November 2018

Keywords

1 Introduction

The exponential increase of semi-structured and free-text narrative information in Electronic Health Records has created a recent need for automated tools to transform this data into valuable knowledge [3]. Currently, Knowledge Graphs and Semantic Web technologies in healthcare domain are expected to improve clinical outcomes [1]. The combination of Natural Language Processing (NLP) with semantic annotations has achieved crucial advances in the semi-automatization of data processing to build enriched knowledge bases for different healthcare use cases [5].

One healthcare use case of special interest is the medical coding. This is the process of annotating clinical notes directly with the codes of medical classifications/standards, such as the International Classification of Diseases (ICD)^{Footnote 1}, for recording, diagnostic, billing and reporting purposes. Clinical notes are unstructured data related to patients created by healthcare professionals in a narrative form that contain medical terms which detail diagnosis, symptoms or procedures. As medical coding is usually performed manually by human medical coders, this can be extremely time-consuming and prone to errors [6].

The present paper proposes an approach using the combination of NLP and semantic enrichment to create automatically a Biomedical Knowledge Graph (BKG) that is the basis for our Medical Coding solution (our demonstration). Below, Sect. 2 describes our approach and demonstration, Sect. 3 the evaluation and results and finally Sect. 4 concludes and highlights future work.

2 Medical Coding Solution

2.1 Solution Overview

In this section we will describe our medical coding solution that is based on the generation and use of a proprietary BKG.

Figure 1 illustrates our workflow. First, an initial BKG is defined to store the medical classification for the coding (such as ICD-9 or ICD-10). Here, the nodes are the ICD entities with basic properties like code and description, and hierarchical relations among them. Next the BKG is semantically enriched in an automatic way with external resources such as medical vocabularies and scientific articles from PubMed.

With enriched BKG, we are able to carry out the medical coding task that involves Medical Named-Entity Recognition tools to identify diseases from clinical notes, string similarity comparison against our enriched BKG and heuristic approaches to return the outcome based on a scored ranking of potential ICD codes candidates.

2.2 The Semantic Enrichment of Biomedical Knowledge Graph

The semantic enrichment is focused on:

Synonyms mapping from the Ontology of Consumer Health Vocabulary (OCHV)^{Footnote 2}: a tool has been developed to make a whole comparison over the descriptions of the ICD entities of our BKG with equivalent terms pointed by the above ontology. For those ICD entities annotated by the ontology, their synonyms are collected and included as other semantic feature within the specific ICD entity of the BKG. We used tokenization, noun chunk detection and string similarity techniques.
Healthcare related terms extracted with Word Embedding techniques. This allows finding relationship between the descriptions of the ICD entities of the BKG and other new close medical terms (not synonyms). Word embeddings have been generated using the Word2Vec skip-gram algorithm^{Footnote 3} over a dataset created with the titles of PubMed medical articles. We have selected Word2Vec skip-gram because it works under low dimensional space representation of words and its architecture is focused on the probability calculation of the context of a given word. This provides a good approach to create related terms with good performance. Once the Word2Vec model is trained we use cosine-similarity measure to extract related terms. We make a whole comparison over ICD descriptions consulting each word against the model through cosine-similarity to retrieve the related terms and include them in the BKG.

Next, a sample is shown for a specific medical entity in RDF. The data triples merge the origin information with enriched semantic data (‘dbpedia-owl:synonym’ and ‘dbpedia-owl:related’). This semantic data is transformed to property-graph format to create our BKG for future exploitation. Property graphs provide compact models and express complex traversals queries with good performance. For the semantic data enrichment we have added other properties from Dbpedia^{Footnote 4} and the Human Disease Ontology^{Footnote 5} to define labels and relations of medical synonyms/related terms. Also we use the already mentioned OCHV ontology and Symptoms Ontology to codify specific symptom instances^{Footnote 6}.

2.3 Demonstration

Our Medical Coding solution is developed in different languages (English, Spanish and Japanese). The enriched BKG is stored in Neo4j^{Footnote 7} because of its high performance in reading and writing thanks to the scalable architecture and the native graph storage, and because of its wide support. We provide a web-based interface to allow users to see clinical notes, its relevant entities, the associated standards codes, and to analyze texts dynamically. In Fig. 2 a screenshot of the demonstration is shown. Also we provide public access of a demo video^{Footnote 8}.

3 Preliminary Evaluation and Results

For evaluation, two main datasets have been used: (1) MIMIC-III [4], a publicly available clinical database that contains discharge summaries for English language with ICD-9 codes annotated by medical experts. We have selected 5000 random discharge summaries (MIMIC-5000) for the evaluation. (2) 200 private de-identified clinical notes in English annotated manually by experts following the ICD-10 classification. We work with ~15000 ICD-10 codes and ~17000 ICD-9 codes. We demonstrate that our approach is able to manage different standards and there is no limitation to certain subsets. An advantage of our semantic approach is that there are no biases problems because there is no training over fixed target distribution. Our solution is based on semantic knowledge and covers all the codes distribution pointed out at the same level of coverage.

Medical coding is a multi-class, multi-label problem. We provide three potential codes for each clinical note, returning ICD-9 or ICD-10 codes depending on the datasets. The F-score metric has been measured and the definition of the confusion matrix parameters are the next:

True Positives = the codes assigned by the experts are among the selected ones by our method.
False Positives = our method assigns codes to a text which does not have enough information to be coded according to the experts.
False Negatives = the experts assign codes and our method is not able to assign any code, and our method assigns different codes from those selected by the experts.
True Negatives = our method does not assign codes to a text which does not have enough information according to the experts.

The best metric performance to evaluate our approach is the F1-score which does a harmonic mean between precision and recall. We achieve good results with an F1-score of 0.75 for MIMIC-5000 and 0.72 for the 200 private clinical notes. In [2], the authors analyze current state-of-the-art methods for assigning ICD-9 codes to clinical notes and the best models achieve an F-score of 0.7233 returning the top 10 ICD-9. Our solution overcomes these results.

4 Conclusions and Future Work

We presented a prototype of an automatic Medical Coding solution that is based on semantic and NLP technologies. The evaluation results are very competitive with an F-score of 0.75 in the MIMIC-5000 dataset. Our work is easily adaptable to any medical classification, language and does not need pre-annotated datasets. The preliminary evaluation was just made for English language.

The exploitation of semantic enrichment and NLP techniques interlinked with the original data of the KG allows to build better mathematical models for the automatic medical coding, which means more context for resolving this problem and better performance and results [2].

For future work, we will compare our approach against different Neural Networks approaches. Additionally, we will provide further evaluation and comparisons between languages and methods. Moreover, we want to extend the system to more languages, apply cross-lingual approaches and analyze how to merge semantic and Deep Learning approaches.

Notes

References

Barisevičius, G., et al.: Supporting digital healthcare services using semantic web technologies. In: Vrandečić, D., et al. (eds.) ISWC 2018. LNCS, vol. 11137, pp. 291–306. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00668-6_18
Chapter Google Scholar
Huang, J., Osorio, C., Sy, L.W.: An empirical evaluation of deep learning for ICD-9 code assignment using MIMIC-III clinical notes. arXiv preprint arXiv:1802.02311 (2018)
Jiang, F., et al.: Artificial intelligence in healthcare: past, present and future. Stroke Vasc. Neurol. 2(4), 230–243 (2017)
Article Google Scholar
Johnson, A.E., et al.: MIMIC-III, a freely accessible critical care database. Sci. Data 3, 160035 (2016)
Article Google Scholar
Lee, Y., Geller, J.: Semantic enrichment for medical ontologies. J. Biomed. Inform. 39(2), 209–226 (2006)
Article Google Scholar
Omalley, K.J., Cook, K.F., Price, M.D., Wildes, K.R., Hurdle, J.F., Ashton, C.M.: Measuring diagnoses: ICD code accuracy. Health Serv. Res. 40(5p2), 1620–1639 (2005)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Fujitsu Laboratories of Europe, Camino Cerro de los Gamos, 28224, Pozuelo de Alarcón, Madrid, Spain
Nuria García-Santa & Beatriz San Miguel
Fujitsu Laboratories Ltd., 4-1-1 Kamikodanaka, Nakahara-ku, Kawasaki, 211-8588, Japan
Takanori Ugai

Authors

Nuria García-Santa
View author publications
You can also search for this author in PubMed Google Scholar
Beatriz San Miguel
View author publications
You can also search for this author in PubMed Google Scholar
Takanori Ugai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nuria García-Santa .

Editor information

Editors and Affiliations

Kansas State University, Manhattan, KS, USA
Pascal Hitzler
Vienna University of Economics and Business, Vienna, Austria
Sabrina Kirrane
Linköping University, Linköping, Sweden
Olaf Hartig
Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
Victor de Boer
Leibniz Information Centre for Science and Technology University Library (TIB), Hannover, Germany
Maria-Esther Vidal
University of Bonn, Bonn, Germany
Maria Maleshkova
Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
Stefan Schlobach
Jönköping University, Jönköping, Sweden
Karl Hammar
F. Hoffmann-La Roche AG, Basel, Switzerland
Nelia Lasierra
Robert Bosch GmbH, Stuttgart, Germany
Steffen Stadtmüller
Aalborg University, Aalborg, Denmark
Katja Hose
IMEC, Ghent University, Ghent, Belgium
Ruben Verborgh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

García-Santa, N., San Miguel, B., Ugai, T. (2019). The Magic of Semantic Enrichment and NLP for Medical Coding. In: Hitzler, P., et al. The Semantic Web: ESWC 2019 Satellite Events. ESWC 2019. Lecture Notes in Computer Science(), vol 11762. Springer, Cham. https://doi.org/10.1007/978-3-030-32327-1_12

Download citation

DOI: https://doi.org/10.1007/978-3-030-32327-1_12
Published: 10 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32326-4
Online ISBN: 978-3-030-32327-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics