Skip to main content

Lana Yeganova

National Institutes of Health, National Library of Medicine, Department Member

Followers

8

Following

2

Co-author

1

Public Views

InterestsView All (9)

Uploads

Papers by Lana Yeganova

Findings of the WMT 2021 Biomedical Translation Shared Task: Summaries of Animal Experiments as New Test Set

In the sixth edition of the WMT Biomedical Task, we addressed a total of eight language pairs, na... more In the sixth edition of the WMT Biomedical Task, we addressed a total of eight language pairs, namely English/German, English/French, English/Spanish, English/Portuguese, English/Chinese, English/Russian, English/Italian, and English/Basque. Further, our tests were composed of three types of textual test sets. New to this year, we released a test set of summaries of animal experiments, in addition to the test sets of scientific abstracts and terminologies. We received a total of 107 submissions from 15 teams from 6 countries.

Additional file 1 of Discovering themes in biomedical literature using a projection-based algorithm

Analysis of the projection algorithm. The file provides the proof of convergence, and identifies ... more

Findings of the WMT 2020 Biomedical Translation Shared Task: Basque, Italian and Russian as New Additional Languages

Machine translation of scientific abstracts and terminologies has the potential to support health... more Machine translation of scientific abstracts and terminologies has the potential to support health professionals and biomedical researchers in some of their activities. In the fifth edition of the WMT Biomedical Task, we addressed a total of eight language pairs. Five language pairs were previously addressed in past editions of the shared task, namely, English/German, English/French, English/Spanish, English/Portuguese, and English/Chinese. Three additional languages pairs were also introduced this year: English/Russian, English/Italian, and English/Basque. The task addressed the evaluation of both scientific abstracts (all language pairs) and terminologies (English/Basque only). We received submissions from a total of 20 teams. For recurring language pairs, we observed an improvement in the translations in terms of automatic scores and qualitative evaluations, compared to previous years.

PDC - a probabilistic distributional clustering algorithm: a case study on suicide articles in PubMed

AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science, 2020

The need to organize a large collection in a manner that facilitates human comprehension is cruci... more The need to organize a large collection in a manner that facilitates human comprehension is crucial given the ever-increasing volumes of information. In this work, we present PDC (probabilistic distributional clustering), a novel algorithm that, given a document collection, computes disjoint term sets representing topics in the collection. The algorithm relies on probabilities of word co-occurrences to partition the set of terms appearing in the collection of documents into disjoint groups of related terms. In this work, we also present an environment to visualize the computed topics in the term space and retrieve the most related PubMed articles for each group of terms. We illustrate the algorithm by applying it to PubMed documents on the topic of suicide. Suicide is a major public health problem identified as the tenth leading cause of death in the US. In this application, our goal is to provide a global view of the mental health literature pertaining to the subject of suicide, an...

Evolving use of ancestry, ethnicity, and race in genetics research—A survey spanning seven decades

The American Journal of Human Genetics, 2021

Navigating the landscape of COVID-19 research through literature analysis: A bird's eye view

ArXiv, 2020

Timely access to accurate scientific literature in the battle with the ongoing COVID-19 pandemic ... more Timely access to accurate scientific literature in the battle with the ongoing COVID-19 pandemic is critical. This unprecedented public health risk has motivated research towards understanding the disease in general, identifying drugs to treat the disease, developing potential vaccines, etc. This has given rise to a rapidly growing body of literature that doubles in number of publications every 20 days as of May 2020. Providing medical professionals with means to quickly analyze the literature and discover growing areas of knowledge is necessary for addressing their question and information needs. In this study we analyze the LitCovid collection, 13,369 COVID-19 related articles found in PubMed as of May 15th, 2020 with the purpose of examining the landscape of literature and presenting it in a format that facilitates information navigation and understanding. We do that by applying state-of-the-art named entity recognition, classification, clustering and other NLP techniques. By app...

Discovering themes in biomedical literature using a projection-based algorithm

BMC Bioinformatics, 2018

MeSH-based dataset for measuring the relevance of text retrieval

Proceedings of the BioNLP 2018 workshop, 2018

SingleCite: Towards an improved Single Citation Search in PubMed

Proceedings of the BioNLP 2018 workshop, 2018

A Field Sensor: computing the composition and intent of PubMed queries

Database, 2018

PubMed Phrases, an open set of coherent phrases for searching biomedical literature

Scientific data, Jan 12, 2018

In biomedicine, key concepts are often expressed by multiple words (e.g., 'zinc finger protei... more In biomedicine, key concepts are often expressed by multiple words (e.g., 'zinc finger protein'). Previous work has shown treating a sequence of words as a meaningful unit, where applicable, is not only important for human understanding but also beneficial for automatic information seeking. Here we present a collection of PubMed Phrases that are beneficial for information retrieval and human comprehension. We define these phrases as coherent chunks that are logically connected. To collect the phrase set, we apply the hypergeometric test to detect segments of consecutive terms that are likely to appear together in PubMed. These text segments are then filtered using the BM25 ranking function to ensure that they are beneficial from an information retrieval perspective. Thus, we obtain a set of 705,915 PubMed Phrases. We evaluate the quality of the set by investigating PubMed user click data and manually annotating a sample of 500 randomly selected noun phrases. We also analyze ...

Reports on the 2012 AAAI Fall Symposium Series

AI Magazine, 2012

The Association for the Advancement of Artificial Intelligence was pleased to present the 2012 Fa... more The Association for the Advancement of Artificial Intelligence was pleased to present the 2012 Fall Symposium Series, held Friday through Sunday, November 2–4, at the Westin Arlington Gateway in Arlington, Virginia. The titles of the eight symposia were as follows: AI for Gerontechnology (FS-12-01), Artificial Intelligence of Humor (FS-12-02), Discovery Informatics: The Role of AI Research in Innovating Scientific Processes (FS-12-03), Human Control of Bio-Inspired Swarms (FS-12-04), Information Retrieval and Knowledge Discovery in Biomedical Text (FS-12-05), Machine Aggregation of Human Judgment (FS-12-06), Robots Learning Interactively from Human Teachers (FS-12-07), and Social Networks and Social Contagion (FS-12-08). The highlights of each symposium are presented in this report.

PubTermVariants: biomedical term variants and their use for PubMed search

Proceedings of the 15th Workshop on Biomedical Natural Language Processing, 2016

Summarizing Topical Contents from PubMed Documents Using a Thematic Analysis

Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015

Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora

Database : the journal of biological databases and curation, 2014

BioC is a recently created XML format to share text data and annotations, and an accompanying inp... more BioC is a recently created XML format to share text data and annotations, and an accompanying input/output library to promote interoperability of data and tools for natural language processing of biomedical text. This article reports the use of BioC to address a common challenge in processing biomedical text information-that of frequent entity name abbreviation. We selected three different abbreviation definition identification modules, and used the publicly available BioC code to convert these independent modules into BioC-compatible components that interact seamlessly with BioC-formatted data, and other BioC-compatible modules. In addition, we consider four manually annotated corpora of abbreviations in biomedical text: the Ab3P corpus of 1250 PubMed abstracts, the BIOADI corpus of 1201 PubMed abstracts, the old MEDSTRACT corpus of 199 PubMed(®) citations and the Schwartz and Hearst corpus of 1000 PubMed abstracts. Annotations in these corpora have been re-evaluated by four annota...

Finding biomedical categories in Medline®

Journal of biomedical semantics, Jan 5, 2012

There are several humanly defined ontologies relevant to Medline. However, Medline is a fast grow... more There are several humanly defined ontologies relevant to Medline. However, Medline is a fast growing collection of biomedical documents which creates difficulties in updating and expanding these humanly defined ontologies. Automatically identifying meaningful categories of entities in a large text corpus is useful for information extraction, construction of machine learning features, and development of semantic representations. In this paper we describe and compare two methods for automatically learning meaningful biomedical categories in Medline. The first approach is a simple statistical method that uses part-of-speech and frequency information to extract a list of frequent nouns from Medline. The second method implements an alignment-based technique to learn frequent generic patterns that indicate a hyponymy/hypernymy relationship between a pair of noun phrases. We then apply these patterns to Medline to collect frequent hypernyms as potential biomedical categories. We study and ...

Extracting drug-drug interactions from literature using a rich feature-based linear kernel approach

Journal of biomedical informatics, Jan 19, 2015

Identifying unknown drug interactions is of great benefit in the early detection of adverse drug ... more Identifying unknown drug interactions is of great benefit in the early detection of adverse drug reactions. Despite existence of several resources for drug-drug interaction (DDI) information, the wealth of such information is buried in a body of unstructured medical text which is growing exponentially. This calls for developing text mining techniques for identifying DDIs. The state-of-the-art DDI extraction methods use Support Vector Machines (SVMs) with non-linear composite kernels to explore diverse contexts in literature. While computationally less expensive, linear kernel-based systems have not achieved a comparable performance in DDI extraction tasks. In this work, we propose an efficient and scalable system using a linear kernel to identify DDI information. The proposed approach consists of two steps: identifying DDIs and assigning one of four different DDI types to the predicted drug pairs. We demonstrate that when equipped with a rich set of lexical and syntactic features, a...

HMM-based System for Identification of Related Gene/Protein Names

Automatic Identification of Key Concepts in Large PubMed Retrievals

ABSTRACT PubMed queries frequently retrieve thousands of documents making it very challenging for... more ABSTRACT PubMed queries frequently retrieve thousands of documents making it very challenging for a user to identify information of interest. In this paper we propose a method for automatically identifying central concepts in large PubMed retrievals. The centrality of a concept is modeled using the hypergeometric distribution. Retrieved documents are grouped by concept, which can help users navigate the retrieval. We test our method on five datasets, each representing a medical condition.

Text mining techniques for leveraging positively labeled data

Findings of the WMT 2021 Biomedical Translation Shared Task: Summaries of Animal Experiments as New Test Set

In the sixth edition of the WMT Biomedical Task, we addressed a total of eight language pairs, na... more In the sixth edition of the WMT Biomedical Task, we addressed a total of eight language pairs, namely English/German, English/French, English/Spanish, English/Portuguese, English/Chinese, English/Russian, English/Italian, and English/Basque. Further, our tests were composed of three types of textual test sets. New to this year, we released a test set of summaries of animal experiments, in addition to the test sets of scientific abstracts and terminologies. We received a total of 107 submissions from 15 teams from 6 countries.

Additional file 1 of Discovering themes in biomedical literature using a projection-based algorithm

Analysis of the projection algorithm. The file provides the proof of convergence, and identifies ... more

Findings of the WMT 2020 Biomedical Translation Shared Task: Basque, Italian and Russian as New Additional Languages

Machine translation of scientific abstracts and terminologies has the potential to support health... more Machine translation of scientific abstracts and terminologies has the potential to support health professionals and biomedical researchers in some of their activities. In the fifth edition of the WMT Biomedical Task, we addressed a total of eight language pairs. Five language pairs were previously addressed in past editions of the shared task, namely, English/German, English/French, English/Spanish, English/Portuguese, and English/Chinese. Three additional languages pairs were also introduced this year: English/Russian, English/Italian, and English/Basque. The task addressed the evaluation of both scientific abstracts (all language pairs) and terminologies (English/Basque only). We received submissions from a total of 20 teams. For recurring language pairs, we observed an improvement in the translations in terms of automatic scores and qualitative evaluations, compared to previous years.

PDC - a probabilistic distributional clustering algorithm: a case study on suicide articles in PubMed

AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science, 2020

The need to organize a large collection in a manner that facilitates human comprehension is cruci... more The need to organize a large collection in a manner that facilitates human comprehension is crucial given the ever-increasing volumes of information. In this work, we present PDC (probabilistic distributional clustering), a novel algorithm that, given a document collection, computes disjoint term sets representing topics in the collection. The algorithm relies on probabilities of word co-occurrences to partition the set of terms appearing in the collection of documents into disjoint groups of related terms. In this work, we also present an environment to visualize the computed topics in the term space and retrieve the most related PubMed articles for each group of terms. We illustrate the algorithm by applying it to PubMed documents on the topic of suicide. Suicide is a major public health problem identified as the tenth leading cause of death in the US. In this application, our goal is to provide a global view of the mental health literature pertaining to the subject of suicide, an...

Evolving use of ancestry, ethnicity, and race in genetics research—A survey spanning seven decades

The American Journal of Human Genetics, 2021

Navigating the landscape of COVID-19 research through literature analysis: A bird's eye view

ArXiv, 2020

Timely access to accurate scientific literature in the battle with the ongoing COVID-19 pandemic ... more Timely access to accurate scientific literature in the battle with the ongoing COVID-19 pandemic is critical. This unprecedented public health risk has motivated research towards understanding the disease in general, identifying drugs to treat the disease, developing potential vaccines, etc. This has given rise to a rapidly growing body of literature that doubles in number of publications every 20 days as of May 2020. Providing medical professionals with means to quickly analyze the literature and discover growing areas of knowledge is necessary for addressing their question and information needs. In this study we analyze the LitCovid collection, 13,369 COVID-19 related articles found in PubMed as of May 15th, 2020 with the purpose of examining the landscape of literature and presenting it in a format that facilitates information navigation and understanding. We do that by applying state-of-the-art named entity recognition, classification, clustering and other NLP techniques. By app...

Discovering themes in biomedical literature using a projection-based algorithm

BMC Bioinformatics, 2018

MeSH-based dataset for measuring the relevance of text retrieval

Proceedings of the BioNLP 2018 workshop, 2018

SingleCite: Towards an improved Single Citation Search in PubMed

Proceedings of the BioNLP 2018 workshop, 2018

A Field Sensor: computing the composition and intent of PubMed queries

Database, 2018

PubMed Phrases, an open set of coherent phrases for searching biomedical literature

Scientific data, Jan 12, 2018

In biomedicine, key concepts are often expressed by multiple words (e.g., 'zinc finger protei... more In biomedicine, key concepts are often expressed by multiple words (e.g., 'zinc finger protein'). Previous work has shown treating a sequence of words as a meaningful unit, where applicable, is not only important for human understanding but also beneficial for automatic information seeking. Here we present a collection of PubMed Phrases that are beneficial for information retrieval and human comprehension. We define these phrases as coherent chunks that are logically connected. To collect the phrase set, we apply the hypergeometric test to detect segments of consecutive terms that are likely to appear together in PubMed. These text segments are then filtered using the BM25 ranking function to ensure that they are beneficial from an information retrieval perspective. Thus, we obtain a set of 705,915 PubMed Phrases. We evaluate the quality of the set by investigating PubMed user click data and manually annotating a sample of 500 randomly selected noun phrases. We also analyze ...

Reports on the 2012 AAAI Fall Symposium Series

AI Magazine, 2012

The Association for the Advancement of Artificial Intelligence was pleased to present the 2012 Fa... more The Association for the Advancement of Artificial Intelligence was pleased to present the 2012 Fall Symposium Series, held Friday through Sunday, November 2–4, at the Westin Arlington Gateway in Arlington, Virginia. The titles of the eight symposia were as follows: AI for Gerontechnology (FS-12-01), Artificial Intelligence of Humor (FS-12-02), Discovery Informatics: The Role of AI Research in Innovating Scientific Processes (FS-12-03), Human Control of Bio-Inspired Swarms (FS-12-04), Information Retrieval and Knowledge Discovery in Biomedical Text (FS-12-05), Machine Aggregation of Human Judgment (FS-12-06), Robots Learning Interactively from Human Teachers (FS-12-07), and Social Networks and Social Contagion (FS-12-08). The highlights of each symposium are presented in this report.

PubTermVariants: biomedical term variants and their use for PubMed search

Proceedings of the 15th Workshop on Biomedical Natural Language Processing, 2016

Summarizing Topical Contents from PubMed Documents Using a Thematic Analysis

Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015

Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora

Database : the journal of biological databases and curation, 2014

BioC is a recently created XML format to share text data and annotations, and an accompanying inp... more BioC is a recently created XML format to share text data and annotations, and an accompanying input/output library to promote interoperability of data and tools for natural language processing of biomedical text. This article reports the use of BioC to address a common challenge in processing biomedical text information-that of frequent entity name abbreviation. We selected three different abbreviation definition identification modules, and used the publicly available BioC code to convert these independent modules into BioC-compatible components that interact seamlessly with BioC-formatted data, and other BioC-compatible modules. In addition, we consider four manually annotated corpora of abbreviations in biomedical text: the Ab3P corpus of 1250 PubMed abstracts, the BIOADI corpus of 1201 PubMed abstracts, the old MEDSTRACT corpus of 199 PubMed(®) citations and the Schwartz and Hearst corpus of 1000 PubMed abstracts. Annotations in these corpora have been re-evaluated by four annota...

Finding biomedical categories in Medline®

Journal of biomedical semantics, Jan 5, 2012

There are several humanly defined ontologies relevant to Medline. However, Medline is a fast grow... more There are several humanly defined ontologies relevant to Medline. However, Medline is a fast growing collection of biomedical documents which creates difficulties in updating and expanding these humanly defined ontologies. Automatically identifying meaningful categories of entities in a large text corpus is useful for information extraction, construction of machine learning features, and development of semantic representations. In this paper we describe and compare two methods for automatically learning meaningful biomedical categories in Medline. The first approach is a simple statistical method that uses part-of-speech and frequency information to extract a list of frequent nouns from Medline. The second method implements an alignment-based technique to learn frequent generic patterns that indicate a hyponymy/hypernymy relationship between a pair of noun phrases. We then apply these patterns to Medline to collect frequent hypernyms as potential biomedical categories. We study and ...

Extracting drug-drug interactions from literature using a rich feature-based linear kernel approach

Journal of biomedical informatics, Jan 19, 2015

Identifying unknown drug interactions is of great benefit in the early detection of adverse drug ... more Identifying unknown drug interactions is of great benefit in the early detection of adverse drug reactions. Despite existence of several resources for drug-drug interaction (DDI) information, the wealth of such information is buried in a body of unstructured medical text which is growing exponentially. This calls for developing text mining techniques for identifying DDIs. The state-of-the-art DDI extraction methods use Support Vector Machines (SVMs) with non-linear composite kernels to explore diverse contexts in literature. While computationally less expensive, linear kernel-based systems have not achieved a comparable performance in DDI extraction tasks. In this work, we propose an efficient and scalable system using a linear kernel to identify DDI information. The proposed approach consists of two steps: identifying DDIs and assigning one of four different DDI types to the predicted drug pairs. We demonstrate that when equipped with a rich set of lexical and syntactic features, a...

HMM-based System for Identification of Related Gene/Protein Names

Automatic Identification of Key Concepts in Large PubMed Retrievals

ABSTRACT PubMed queries frequently retrieve thousands of documents making it very challenging for... more ABSTRACT PubMed queries frequently retrieve thousands of documents making it very challenging for a user to identify information of interest. In this paper we propose a method for automatically identifying central concepts in large PubMed retrievals. The centrality of a concept is modeled using the hypergeometric distribution. Retrieved documents are grouped by concept, which can help users navigate the retrieval. We test our method on five datasets, each representing a medical condition.

Text mining techniques for leveraging positively labeled data