Keywords

1 Introduction

The exponential increase of semi-structured and free-text narrative information in Electronic Health Records has created a recent need for automated tools to transform this data into valuable knowledge [3]. Currently, Knowledge Graphs and Semantic Web technologies in healthcare domain are expected to improve clinical outcomes [1]. The combination of Natural Language Processing (NLP) with semantic annotations has achieved crucial advances in the semi-automatization of data processing to build enriched knowledge bases for different healthcare use cases [5].

One healthcare use case of special interest is the medical coding. This is the process of annotating clinical notes directly with the codes of medical classifications/standards, such as the International Classification of Diseases (ICD)Footnote 1, for recording, diagnostic, billing and reporting purposes. Clinical notes are unstructured data related to patients created by healthcare professionals in a narrative form that contain medical terms which detail diagnosis, symptoms or procedures. As medical coding is usually performed manually by human medical coders, this can be extremely time-consuming and prone to errors [6].

The present paper proposes an approach using the combination of NLP and semantic enrichment to create automatically a Biomedical Knowledge Graph (BKG) that is the basis for our Medical Coding solution (our demonstration). Below, Sect. 2 describes our approach and demonstration, Sect. 3 the evaluation and results and finally Sect. 4 concludes and highlights future work.

2 Medical Coding Solution

2.1 Solution Overview

In this section we will describe our medical coding solution that is based on the generation and use of a proprietary BKG.

Figure 1 illustrates our workflow. First, an initial BKG is defined to store the medical classification for the coding (such as ICD-9 or ICD-10). Here, the nodes are the ICD entities with basic properties like code and description, and hierarchical relations among them. Next the BKG is semantically enriched in an automatic way with external resources such as medical vocabularies and scientific articles from PubMed.

With enriched BKG, we are able to carry out the medical coding task that involves Medical Named-Entity Recognition tools to identify diseases from clinical notes, string similarity comparison against our enriched BKG and heuristic approaches to return the outcome based on a scored ranking of potential ICD codes candidates.

2.2 The Semantic Enrichment of Biomedical Knowledge Graph

The semantic enrichment is focused on:

  • Synonyms mapping from the Ontology of Consumer Health Vocabulary (OCHV)Footnote 2: a tool has been developed to make a whole comparison over the descriptions of the ICD entities of our BKG with equivalent terms pointed by the above ontology. For those ICD entities annotated by the ontology, their synonyms are collected and included as other semantic feature within the specific ICD entity of the BKG. We used tokenization, noun chunk detection and string similarity techniques.

  • Healthcare related terms extracted with Word Embedding techniques. This allows finding relationship between the descriptions of the ICD entities of the BKG and other new close medical terms (not synonyms). Word embeddings have been generated using the Word2Vec skip-gram algorithmFootnote 3 over a dataset created with the titles of PubMed medical articles. We have selected Word2Vec skip-gram because it works under low dimensional space representation of words and its architecture is focused on the probability calculation of the context of a given word. This provides a good approach to create related terms with good performance. Once the Word2Vec model is trained we use cosine-similarity measure to extract related terms. We make a whole comparison over ICD descriptions consulting each word against the model through cosine-similarity to retrieve the related terms and include them in the BKG.

Fig. 1.
figure 1

Internal workflow of our approach

Next, a sample is shown for a specific medical entity in RDF. The data triples merge the origin information with enriched semantic data (‘dbpedia-owl:synonym’ and ‘dbpedia-owl:related’). This semantic data is transformed to property-graph format to create our BKG for future exploitation. Property graphs provide compact models and express complex traversals queries with good performance. For the semantic data enrichment we have added other properties from DbpediaFootnote 4 and the Human Disease OntologyFootnote 5 to define labels and relations of medical synonyms/related terms. Also we use the already mentioned OCHV ontology and Symptoms Ontology to codify specific symptom instancesFootnote 6.

figure a

2.3 Demonstration

Our Medical Coding solution is developed in different languages (English, Spanish and Japanese). The enriched BKG is stored in Neo4jFootnote 7 because of its high performance in reading and writing thanks to the scalable architecture and the native graph storage, and because of its wide support. We provide a web-based interface to allow users to see clinical notes, its relevant entities, the associated standards codes, and to analyze texts dynamically. In Fig. 2 a screenshot of the demonstration is shown. Also we provide public access of a demo videoFootnote 8.

Fig. 2.
figure 2

Medical Coding system: (on the left) our web tool with a clinical note and its associated codes; (on the right) a part of the Enriched KG in neo4j

3 Preliminary Evaluation and Results

For evaluation, two main datasets have been used: (1) MIMIC-III [4], a publicly available clinical database that contains discharge summaries for English language with ICD-9 codes annotated by medical experts. We have selected 5000 random discharge summaries (MIMIC-5000) for the evaluation. (2) 200 private de-identified clinical notes in English annotated manually by experts following the ICD-10 classification. We work with ~15000 ICD-10 codes and ~17000 ICD-9 codes. We demonstrate that our approach is able to manage different standards and there is no limitation to certain subsets. An advantage of our semantic approach is that there are no biases problems because there is no training over fixed target distribution. Our solution is based on semantic knowledge and covers all the codes distribution pointed out at the same level of coverage.

Medical coding is a multi-class, multi-label problem. We provide three potential codes for each clinical note, returning ICD-9 or ICD-10 codes depending on the datasets. The F-score metric has been measured and the definition of the confusion matrix parameters are the next:

  • True Positives = the codes assigned by the experts are among the selected ones by our method.

  • False Positives = our method assigns codes to a text which does not have enough information to be coded according to the experts.

  • False Negatives = the experts assign codes and our method is not able to assign any code, and our method assigns different codes from those selected by the experts.

  • True Negatives = our method does not assign codes to a text which does not have enough information according to the experts.

The best metric performance to evaluate our approach is the F1-score which does a harmonic mean between precision and recall. We achieve good results with an F1-score of 0.75 for MIMIC-5000 and 0.72 for the 200 private clinical notes. In [2], the authors analyze current state-of-the-art methods for assigning ICD-9 codes to clinical notes and the best models achieve an F-score of 0.7233 returning the top 10 ICD-9. Our solution overcomes these results.

4 Conclusions and Future Work

We presented a prototype of an automatic Medical Coding solution that is based on semantic and NLP technologies. The evaluation results are very competitive with an F-score of 0.75 in the MIMIC-5000 dataset. Our work is easily adaptable to any medical classification, language and does not need pre-annotated datasets. The preliminary evaluation was just made for English language.

The exploitation of semantic enrichment and NLP techniques interlinked with the original data of the KG allows to build better mathematical models for the automatic medical coding, which means more context for resolving this problem and better performance and results [2].

For future work, we will compare our approach against different Neural Networks approaches. Additionally, we will provide further evaluation and comparisons between languages and methods. Moreover, we want to extend the system to more languages, apply cross-lingual approaches and analyze how to merge semantic and Deep Learning approaches.