research-article

Open access

Unsupervised Clinical Language Translation

Authors:

Peter SzolovitsAuthors Info & Claims

KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 3121 - 3131

https://doi.org/10.1145/3292500.3330710

Published: 25 July 2019 Publication History

Abstract

As patients' access to their doctors' clinical notes becomes common, translating professional, clinical jargon to layperson-understandable language is essential to improve patient-clinician communication. Such translation yields better clinical outcomes by enhancing patients' understanding of their own health conditions, and thus improving patients' involvement in their own care. Existing research has used dictionary-based word replacement or definition insertion to approach the need. However, these methods are limited by expert curation, which is hard to scale and has trouble generalizing to unseen datasets that do not share an overlapping vocabulary. In contrast, we approach the clinical word and sentence translation problem in a completely unsupervised manner. We show that a framework using representation learning, bilingual dictionary induction and statistical machine translation yields the best precision at 10 of 0.827 on professional-to-consumer word translation, and mean opinion scores of 4.10 and 4.28 out of 5 for clinical correctness and layperson readability, respectively, on sentence translation. Our fully-unsupervised strategy overcomes the curation problem, and the clinically meaningful evaluation reduces biases from inappropriate evaluators, which are critical in clinical machine learning.

Supplementary Material

MP4 File (p3121-weng.mp4)

Download
1173.93 MB

References

[1]

Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018b. Generalizing and improving bilingual word embedding mappings with a multi-step framework of linear transformations. In AAAI .

[2]

Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018c. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In ACL .

[3]

Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2018a. Unsupervised neural machine translation. In ICLR .

[4]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. In ICLR .

[5]

Antonio Valerio Miceli Barone. 2016. Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders. In RepL4NLP .

[6]

Or Biran, Samuel Brody, and Noémie Elhadad. 2011. Putting it simply: A context-aware approach to lexical simplification. In ACL .

Digital Library

[7]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. In Transactions of the Association for Computational Linguistics .

[8]

Jinying Chen, Emily Druhl, Balaji Polepalli Ramesh, Thomas Houston, Cynthia Brandt, Donna Zulman, Varsha Vimalananda, Samir Malkani, and Hong Yu. 2018. A natural language processing system that links medical terms in electronic health record notes to lay definitions: System development using physician reviews. JMIR, Vol. 20, 1 (2018), e26.

[9]

Jinying Chen, Abhyuday Jagannatha, Samah Fodeh, and Hong Yu. 2017. Ranking medical terms to support expansion of lay language resources for patient comprehension of electronic health record notes: Adapted distant supervision approach. JMIR Medical Informatics, Vol. 5, 4 (2017), e42.

[10]

Youngduck Choi, Chill Yi-I Chiu, and David Sontag. 2016. Learning low-dimensional representations of medical concepts. AMIA CRI, Vol. 2016 (2016), 41.

[11]

Yu-An Chung, Wei-Hung Weng, Schrasing Tong, and James Glass. 2018. Unsupervised cross-modal alignment of speech and text embedding spaces. In NeurIPS .

Digital Library

[12]

Yu-An Chung, Wei-Hung Weng, Schrasing Tong, and James Glass. 2019. Towards unsupervised speech-to-text translation. In ICASSP .

[13]

Alexis Conneau, Guillaume Lample, Marc'Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2018. Word translation without parallel data. In ICLR .

[14]

Georgiana Dinu, Angeliki Lazaridou, and Marco Baroni. 2015. Improving zero-shot learning by mitigating the hubness problem. In ICLR Workshop .

[15]

Noemie Elhadad and Komal Sutaria. 2007. Mining a lexicon of technical terms and lay equivalents. In BioNLP .

Digital Library

[16]

Lijun Feng, Martin Jansche, Matt Huenerfauth, and Noémie Elhadad. 2010. A comparison of features for automatic readability assessment. In COLING .

Digital Library

[17]

Traber Davis Giardina and Hardeep Singh. 2011. Should patients get direct access to their laboratory test results?: An answer with many questions. JAMA, Vol. 306, 22 (2011), 2502--2503.

[18]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In NIPS .

Digital Library

[19]

Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In WMT .

Digital Library

[20]

Tzu-Ming Harry Hsu, Wei-Hung Weng, Willie Boag, Matthew McDermott, and Peter Szolovits. 2018. Unsupervised multimodal representation learning across medical images and reports. In NeurIPS ML4H Workshop .

[21]

Alistair EW Johnson, Tom Pollard, Lu Shen, Lehman Li-wei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific Data, Vol. 3 (2016), 160035.

[22]

Sasikiran Kandula, Dorothy Curtis, and Qing Zeng-Treitler. 2010. A semantic and syntactic text simplification tool for health content. In AMIA .

[23]

Alla Keselman, Catherine Arnott Smith, Guy Divita, Hyeoneui Kim, Allen Browne, Gondy Leroy, and Qing Zeng-Treitler. 2008. Consumer health concepts that do not map to the UMLS: Where do they fit? JAMIA, Vol. 15, 4 (2008), 496--505.

[24]

Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In ACL Interactive Poster and Demonstration Sessions .

Digital Library

[25]

Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In NAACL-HLT .

Digital Library

[26]

John Lalor, Hao Wu, Li Chen, Kathleen Mazor, and Hong Yu. 2018. ComprehENotes, an instrument to assess patient reading comprehension of electronic health record notes: Development and validation. JMIR, Vol. 20, 4 (2018), e139.

[27]

Guillaume Lample, Ludovic Denoyer, and Marc'Aurelio Ranzato. 2018a. Unsupervised machine translation using monolingual corpora only. In ICLR .

[28]

Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc'Aurelio Ranzato. 2018b. Phrase-based & neural unsupervised machine translation. In EMNLP .

[29]

Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In ACL System Demonstrations .

[30]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS .

Digital Library

[31]

Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In NAACL-HLT .

[32]

Ramesh Polepalli, Thomas Houston, Cynthia Brandt, Hua Fang, and Hong Yu. 2013. Improving patients' electronic health record comprehension with NoteAid. Studies in Health Technology and Informatics, Vol. 192 (2013), 714--718.

[33]

Sampo Pyysalo, Filip Ginter, Hans Moen, Tapio Salakoski, and Sophia Ananiadou. 2013. Distributional semantics resources for biomedical text processing .

[34]

Stephen Ross and Chen-Tan Lin. 2003. The effects of promoting patient access to medical records: A review. JAMIA, Vol. 10, 2 (2003), 129--138.

[35]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Improving neural machine translation models with monolingual data. In ACL .

[36]

Anders Søgaard, Sebastian Ruder, and Ivan Vulić. 2018. On the limitations of unsupervised bilingual dictionary induction. In ACL .

[37]

Rebecca Sudore, Kristine Yaffe, Suzanne Satterfield, Tamara Harris, Kala Mehta, Eleanor Simonsick, Anne Newman, Caterina Rosano, Ronica Rooks, Susan Rubin, et almbox. 2006. Limited literacy and mortality in the elderly: The health, aging, and body composition study. JGIM, Vol. 21, 8 (2006), 806--812.

[38]

Ilya Sutskever, Oriol Vinyals, and Quoc Le. 2014. Sequence to sequence learning with neural networks. In NIPS .

Digital Library

[39]

Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoencoders. In ICML .

Digital Library

[40]

Vinod Vydiswaran, Qiaozhu Mei, David Hanauer, and Kai Zheng. 2014. Mining consumer health vocabulary from community-generated text. In AMIA .

[41]

Yanshan Wang, Sijia Liu, Naveed Afzal, Majid Rastegar-Mojarad, Liwei Wang, Feichen Shen, Paul Kingsbury, and Hongfang Liu. 2018. A comparison of word embeddings for the biomedical natural language processing. JBI, Vol. 87 (2018), 12--20.

[42]

Wei-Hung Weng and Peter Szolovits. 2018. Mapping unparalleled clinical professional and consumer languages with embedding alignment. In KDD MLMH Workshop .

[43]

Wei-Hung Weng, Kavishwar Wagholikar, Alexa McCray, Peter Szolovits, and Henry Chueh. 2017. Medical subdomain classification of clinical notes using a machine learning-based natural language processing approach. BMC MIDM, Vol. 17, 1 (2017), 155.

[44]

Chao Xing, Dong Wang, Chao Liu, and Yiye Lin. 2015. Normalized word embedding and orthogonal transform for bilingual word translation. In NAACL-HLT .

[45]

Qing Zeng and Tony Tse. 2006. Exploring and developing consumer health vocabularies. JAMIA, Vol. 13, 1 (2006), 24--29.

[46]

Qing Zeng-Treitler, Sergey Goryachev, Hyeoneui Kim, Alla Keselman, and Douglas Rosendale. 2007. Making texts in electronic health records comprehensible to consumers: A prototype translator. In AMIA .

[47]

Rita Zielstorff. 2003. Controlled vocabularies for consumer health. JBI, Vol. 36, 4--5 (2003), 326--333.

Digital Library

Cited By

Yang RZeng QYou KQiao YHuang LHsieh CRosand BGoldwasser JDave AKeenan TKe YHong CLiu NChew ERadev DLu ZXu HChen QLi I(2024)Ascle: A Python Natural Language Processing Toolkit for Medical Text Generation (Preprint)Journal of Medical Internet Research10.2196/60601Online publication date: 16-May-2024
https://doi.org/10.2196/60601
Wang NHuang CChen JLi L(2024)CMRight: Chinese Morph Resolution based on end-to-end model combined with enhancement algorithmsExpert Systems with Applications10.1016/j.eswa.2024.124294254(124294)Online publication date: Nov-2024
https://doi.org/10.1016/j.eswa.2024.124294
Zappatore MRuggieri G(2024)Adopting machine translation in the healthcare sectorComputer Speech and Language10.1016/j.csl.2023.10158284:COnline publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1016/j.csl.2023.101582
Show More Cited By

Index Terms

Unsupervised Clinical Language Translation
1. Applied computing
  1. Life and medical sciences
    1. Consumer health
    2. Health informatics
2. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Machine translation
  2. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
    2. Machine learning approaches
      1. Learning latent representations

Recommendations

Language Modeling for Syntax-Based Machine Translation Using Tree Substitution Grammars: A Case Study on Chinese-English Translation

The poor grammatical output of Machine Translation (MT) systems appeals syntax-based approaches within language modeling. However, previous studies showed that syntax-based language modeling using (Context-Free) Treebank Grammars was not very helpful in ...
Malayalam Natural Language Processing: Challenges in Building a Phrase-Based Statistical Machine Translation System
Statistical Machine Translation (SMT) is a preferred Machine Translation approach to convert the text in a specific language into another by automatically learning translations using a parallel corpus. SMT has been successful in producing quality ...
Post-Ordering by Parsing with ITG for Japanese-English Statistical Machine Translation

Word reordering is a difficult task for translation between languages with widely different word orders, such as Japanese and English. A previously proposed post-ordering method for Japanese-to-English translation first translates a Japanese sentence ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

July 2019

3305 pages

ISBN:9781450362016

DOI:10.1145/3292500

General Chairs:
Ankur Teredesai
KenSci
,
Vipin Kumar
University of Minnesota
,
Program Chairs:
Ying Li
EV Analysis Corporation
,
Rómer Rosales
LinkedIn
,
Evimaria Terzi
Boston University
,
George Karypis
University of Minnesota

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

MIT-IBM Watson AI Lab

Conference

KDD '19

Sponsor:

KDD '19: The 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 4 - 8, 2019

AK, Anchorage, USA

Acceptance Rates

KDD '19 Paper Acceptance Rate 110 of 1,200 submissions, 9%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
1,003
Total Downloads

Downloads (Last 12 months)171
Downloads (Last 6 weeks)19

Reflects downloads up to 24 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yang RZeng QYou KQiao YHuang LHsieh CRosand BGoldwasser JDave AKeenan TKe YHong CLiu NChew ERadev DLu ZXu HChen QLi I(2024)Ascle: A Python Natural Language Processing Toolkit for Medical Text Generation (Preprint)Journal of Medical Internet Research10.2196/60601Online publication date: 16-May-2024
https://doi.org/10.2196/60601
Wang NHuang CChen JLi L(2024)CMRight: Chinese Morph Resolution based on end-to-end model combined with enhancement algorithmsExpert Systems with Applications10.1016/j.eswa.2024.124294254(124294)Online publication date: Nov-2024
https://doi.org/10.1016/j.eswa.2024.124294
Zappatore MRuggieri G(2024)Adopting machine translation in the healthcare sectorComputer Speech and Language10.1016/j.csl.2023.10158284:COnline publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1016/j.csl.2023.101582
Mikherskii RMikherskii M(2023)Application of artificial intelligence systems for stylometric analysis of texts as factor of sustainable developmentE3S Web of Conferences10.1051/e3sconf/202337103007371(03007)Online publication date: 28-Feb-2023
https://doi.org/10.1051/e3sconf/202337103007
Bacco LDell’Orletta FLai HMerone MNissim M(2023)A text style transfer system for reducing the physician–patient expertise gap: An analysis with automatic and human evaluationsExpert Systems with Applications10.1016/j.eswa.2023.120874233(120874)Online publication date: Dec-2023
https://doi.org/10.1016/j.eswa.2023.120874
Hossain ERana RHiggins NSoar JBarua PPisani ATurner K(2023)Natural Language Processing in Electronic Health Records in relation to healthcare decision-makingComputers in Biology and Medicine10.1016/j.compbiomed.2023.106649155:COnline publication date: 1-Mar-2023
https://dl.acm.org/doi/10.1016/j.compbiomed.2023.106649
McDermott MNestor BSzolovits P(2023)Clinical Artificial IntelligenceClinics in Laboratory Medicine10.1016/j.cll.2022.09.00443:1(29-46)Online publication date: Mar-2023
https://doi.org/10.1016/j.cll.2022.09.004
Jin DJin ZHu ZVechtomova OMihalcea R(2022)Deep Learning for Text Style Transfer: A SurveyComputational Linguistics10.1162/coli_a_0042648:1(155-205)Online publication date: 4-Apr-2022
https://doi.org/10.1162/coli_a_00426
Manzini EGarrido-Aguirre JFonollosa JPerera-Lluna A(2022)Mapping layperson medical terminology into the Human Phenotype Ontology using neural machine translation modelsExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.117446204:COnline publication date: 15-Oct-2022
https://dl.acm.org/doi/10.1016/j.eswa.2022.117446
Li IPan JGoldwasser JVerma NWong WNuzumlalı MRosand BLi YZhang MChang DTaylor RKrumholz HRadev D(2022)Neural Natural Language Processing for unstructured data in electronic health recordsComputer Science Review10.1016/j.cosrev.2022.10051146:COnline publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1016/j.cosrev.2022.100511
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents