In the sixth edition of the WMT Biomedical Task, we addressed a total of eight language pairs, na... more In the sixth edition of the WMT Biomedical Task, we addressed a total of eight language pairs, namely English/German, English/French, English/Spanish, English/Portuguese, English/Chinese, English/Russian, English/Italian, and English/Basque. Further, our tests were composed of three types of textual test sets. New to this year, we released a test set of summaries of animal experiments, in addition to the test sets of scientific abstracts and terminologies. We received a total of 107 submissions from 15 teams from 6 countries.
Analysis of the projection algorithm. The file provides the proof of convergence, and identifies ... more Analysis of the projection algorithm. The file provides the proof of convergence, and identifies a convenient stopping criterion for the projection algorithm. (CVS 149 kb)
Machine translation of scientific abstracts and terminologies has the potential to support health... more Machine translation of scientific abstracts and terminologies has the potential to support health professionals and biomedical researchers in some of their activities. In the fifth edition of the WMT Biomedical Task, we addressed a total of eight language pairs. Five language pairs were previously addressed in past editions of the shared task, namely, English/German, English/French, English/Spanish, English/Portuguese, and English/Chinese. Three additional languages pairs were also introduced this year: English/Russian, English/Italian, and English/Basque. The task addressed the evaluation of both scientific abstracts (all language pairs) and terminologies (English/Basque only). We received submissions from a total of 20 teams. For recurring language pairs, we observed an improvement in the translations in terms of automatic scores and qualitative evaluations, compared to previous years.
AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science, 2020
The need to organize a large collection in a manner that facilitates human comprehension is cruci... more The need to organize a large collection in a manner that facilitates human comprehension is crucial given the ever-increasing volumes of information. In this work, we present PDC (probabilistic distributional clustering), a novel algorithm that, given a document collection, computes disjoint term sets representing topics in the collection. The algorithm relies on probabilities of word co-occurrences to partition the set of terms appearing in the collection of documents into disjoint groups of related terms. In this work, we also present an environment to visualize the computed topics in the term space and retrieve the most related PubMed articles for each group of terms. We illustrate the algorithm by applying it to PubMed documents on the topic of suicide. Suicide is a major public health problem identified as the tenth leading cause of death in the US. In this application, our goal is to provide a global view of the mental health literature pertaining to the subject of suicide, an...
Timely access to accurate scientific literature in the battle with the ongoing COVID-19 pandemic ... more Timely access to accurate scientific literature in the battle with the ongoing COVID-19 pandemic is critical. This unprecedented public health risk has motivated research towards understanding the disease in general, identifying drugs to treat the disease, developing potential vaccines, etc. This has given rise to a rapidly growing body of literature that doubles in number of publications every 20 days as of May 2020. Providing medical professionals with means to quickly analyze the literature and discover growing areas of knowledge is necessary for addressing their question and information needs. In this study we analyze the LitCovid collection, 13,369 COVID-19 related articles found in PubMed as of May 15th, 2020 with the purpose of examining the landscape of literature and presenting it in a format that facilitates information navigation and understanding. We do that by applying state-of-the-art named entity recognition, classification, clustering and other NLP techniques. By app...
In biomedicine, key concepts are often expressed by multiple words (e.g., 'zinc finger protei... more In biomedicine, key concepts are often expressed by multiple words (e.g., 'zinc finger protein'). Previous work has shown treating a sequence of words as a meaningful unit, where applicable, is not only important for human understanding but also beneficial for automatic information seeking. Here we present a collection of PubMed Phrases that are beneficial for information retrieval and human comprehension. We define these phrases as coherent chunks that are logically connected. To collect the phrase set, we apply the hypergeometric test to detect segments of consecutive terms that are likely to appear together in PubMed. These text segments are then filtered using the BM25 ranking function to ensure that they are beneficial from an information retrieval perspective. Thus, we obtain a set of 705,915 PubMed Phrases. We evaluate the quality of the set by investigating PubMed user click data and manually annotating a sample of 500 randomly selected noun phrases. We also analyze ...
The Association for the Advancement of Artificial Intelligence was pleased to present the 2012 Fa... more The Association for the Advancement of Artificial Intelligence was pleased to present the 2012 Fall Symposium Series, held Friday through Sunday, November 2–4, at the Westin Arlington Gateway in Arlington, Virginia. The titles of the eight symposia were as follows: AI for Gerontechnology (FS-12-01), Artificial Intelligence of Humor (FS-12-02), Discovery Informatics: The Role of AI Research in Innovating Scientific Processes (FS-12-03), Human Control of Bio-Inspired Swarms (FS-12-04), Information Retrieval and Knowledge Discovery in Biomedical Text (FS-12-05), Machine Aggregation of Human Judgment (FS-12-06), Robots Learning Interactively from Human Teachers (FS-12-07), and Social Networks and Social Contagion (FS-12-08). The highlights of each symposium are presented in this report.
Database : the journal of biological databases and curation, 2014
BioC is a recently created XML format to share text data and annotations, and an accompanying inp... more BioC is a recently created XML format to share text data and annotations, and an accompanying input/output library to promote interoperability of data and tools for natural language processing of biomedical text. This article reports the use of BioC to address a common challenge in processing biomedical text information-that of frequent entity name abbreviation. We selected three different abbreviation definition identification modules, and used the publicly available BioC code to convert these independent modules into BioC-compatible components that interact seamlessly with BioC-formatted data, and other BioC-compatible modules. In addition, we consider four manually annotated corpora of abbreviations in biomedical text: the Ab3P corpus of 1250 PubMed abstracts, the BIOADI corpus of 1201 PubMed abstracts, the old MEDSTRACT corpus of 199 PubMed(®) citations and the Schwartz and Hearst corpus of 1000 PubMed abstracts. Annotations in these corpora have been re-evaluated by four annota...
There are several humanly defined ontologies relevant to Medline. However, Medline is a fast grow... more There are several humanly defined ontologies relevant to Medline. However, Medline is a fast growing collection of biomedical documents which creates difficulties in updating and expanding these humanly defined ontologies. Automatically identifying meaningful categories of entities in a large text corpus is useful for information extraction, construction of machine learning features, and development of semantic representations. In this paper we describe and compare two methods for automatically learning meaningful biomedical categories in Medline. The first approach is a simple statistical method that uses part-of-speech and frequency information to extract a list of frequent nouns from Medline. The second method implements an alignment-based technique to learn frequent generic patterns that indicate a hyponymy/hypernymy relationship between a pair of noun phrases. We then apply these patterns to Medline to collect frequent hypernyms as potential biomedical categories. We study and ...
Identifying unknown drug interactions is of great benefit in the early detection of adverse drug ... more Identifying unknown drug interactions is of great benefit in the early detection of adverse drug reactions. Despite existence of several resources for drug-drug interaction (DDI) information, the wealth of such information is buried in a body of unstructured medical text which is growing exponentially. This calls for developing text mining techniques for identifying DDIs. The state-of-the-art DDI extraction methods use Support Vector Machines (SVMs) with non-linear composite kernels to explore diverse contexts in literature. While computationally less expensive, linear kernel-based systems have not achieved a comparable performance in DDI extraction tasks. In this work, we propose an efficient and scalable system using a linear kernel to identify DDI information. The proposed approach consists of two steps: identifying DDIs and assigning one of four different DDI types to the predicted drug pairs. We demonstrate that when equipped with a rich set of lexical and syntactic features, a...
ABSTRACT PubMed queries frequently retrieve thousands of documents making it very challenging for... more ABSTRACT PubMed queries frequently retrieve thousands of documents making it very challenging for a user to identify information of interest. In this paper we propose a method for automatically identifying central concepts in large PubMed retrievals. The centrality of a concept is modeled using the hypergeometric distribution. Retrieved documents are grouped by concept, which can help users navigate the retrieval. We test our method on five datasets, each representing a medical condition.
In the sixth edition of the WMT Biomedical Task, we addressed a total of eight language pairs, na... more In the sixth edition of the WMT Biomedical Task, we addressed a total of eight language pairs, namely English/German, English/French, English/Spanish, English/Portuguese, English/Chinese, English/Russian, English/Italian, and English/Basque. Further, our tests were composed of three types of textual test sets. New to this year, we released a test set of summaries of animal experiments, in addition to the test sets of scientific abstracts and terminologies. We received a total of 107 submissions from 15 teams from 6 countries.
Analysis of the projection algorithm. The file provides the proof of convergence, and identifies ... more Analysis of the projection algorithm. The file provides the proof of convergence, and identifies a convenient stopping criterion for the projection algorithm. (CVS 149 kb)
Machine translation of scientific abstracts and terminologies has the potential to support health... more Machine translation of scientific abstracts and terminologies has the potential to support health professionals and biomedical researchers in some of their activities. In the fifth edition of the WMT Biomedical Task, we addressed a total of eight language pairs. Five language pairs were previously addressed in past editions of the shared task, namely, English/German, English/French, English/Spanish, English/Portuguese, and English/Chinese. Three additional languages pairs were also introduced this year: English/Russian, English/Italian, and English/Basque. The task addressed the evaluation of both scientific abstracts (all language pairs) and terminologies (English/Basque only). We received submissions from a total of 20 teams. For recurring language pairs, we observed an improvement in the translations in terms of automatic scores and qualitative evaluations, compared to previous years.
AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science, 2020
The need to organize a large collection in a manner that facilitates human comprehension is cruci... more The need to organize a large collection in a manner that facilitates human comprehension is crucial given the ever-increasing volumes of information. In this work, we present PDC (probabilistic distributional clustering), a novel algorithm that, given a document collection, computes disjoint term sets representing topics in the collection. The algorithm relies on probabilities of word co-occurrences to partition the set of terms appearing in the collection of documents into disjoint groups of related terms. In this work, we also present an environment to visualize the computed topics in the term space and retrieve the most related PubMed articles for each group of terms. We illustrate the algorithm by applying it to PubMed documents on the topic of suicide. Suicide is a major public health problem identified as the tenth leading cause of death in the US. In this application, our goal is to provide a global view of the mental health literature pertaining to the subject of suicide, an...
Timely access to accurate scientific literature in the battle with the ongoing COVID-19 pandemic ... more Timely access to accurate scientific literature in the battle with the ongoing COVID-19 pandemic is critical. This unprecedented public health risk has motivated research towards understanding the disease in general, identifying drugs to treat the disease, developing potential vaccines, etc. This has given rise to a rapidly growing body of literature that doubles in number of publications every 20 days as of May 2020. Providing medical professionals with means to quickly analyze the literature and discover growing areas of knowledge is necessary for addressing their question and information needs. In this study we analyze the LitCovid collection, 13,369 COVID-19 related articles found in PubMed as of May 15th, 2020 with the purpose of examining the landscape of literature and presenting it in a format that facilitates information navigation and understanding. We do that by applying state-of-the-art named entity recognition, classification, clustering and other NLP techniques. By app...
In biomedicine, key concepts are often expressed by multiple words (e.g., 'zinc finger protei... more In biomedicine, key concepts are often expressed by multiple words (e.g., 'zinc finger protein'). Previous work has shown treating a sequence of words as a meaningful unit, where applicable, is not only important for human understanding but also beneficial for automatic information seeking. Here we present a collection of PubMed Phrases that are beneficial for information retrieval and human comprehension. We define these phrases as coherent chunks that are logically connected. To collect the phrase set, we apply the hypergeometric test to detect segments of consecutive terms that are likely to appear together in PubMed. These text segments are then filtered using the BM25 ranking function to ensure that they are beneficial from an information retrieval perspective. Thus, we obtain a set of 705,915 PubMed Phrases. We evaluate the quality of the set by investigating PubMed user click data and manually annotating a sample of 500 randomly selected noun phrases. We also analyze ...
The Association for the Advancement of Artificial Intelligence was pleased to present the 2012 Fa... more The Association for the Advancement of Artificial Intelligence was pleased to present the 2012 Fall Symposium Series, held Friday through Sunday, November 2–4, at the Westin Arlington Gateway in Arlington, Virginia. The titles of the eight symposia were as follows: AI for Gerontechnology (FS-12-01), Artificial Intelligence of Humor (FS-12-02), Discovery Informatics: The Role of AI Research in Innovating Scientific Processes (FS-12-03), Human Control of Bio-Inspired Swarms (FS-12-04), Information Retrieval and Knowledge Discovery in Biomedical Text (FS-12-05), Machine Aggregation of Human Judgment (FS-12-06), Robots Learning Interactively from Human Teachers (FS-12-07), and Social Networks and Social Contagion (FS-12-08). The highlights of each symposium are presented in this report.
Database : the journal of biological databases and curation, 2014
BioC is a recently created XML format to share text data and annotations, and an accompanying inp... more BioC is a recently created XML format to share text data and annotations, and an accompanying input/output library to promote interoperability of data and tools for natural language processing of biomedical text. This article reports the use of BioC to address a common challenge in processing biomedical text information-that of frequent entity name abbreviation. We selected three different abbreviation definition identification modules, and used the publicly available BioC code to convert these independent modules into BioC-compatible components that interact seamlessly with BioC-formatted data, and other BioC-compatible modules. In addition, we consider four manually annotated corpora of abbreviations in biomedical text: the Ab3P corpus of 1250 PubMed abstracts, the BIOADI corpus of 1201 PubMed abstracts, the old MEDSTRACT corpus of 199 PubMed(®) citations and the Schwartz and Hearst corpus of 1000 PubMed abstracts. Annotations in these corpora have been re-evaluated by four annota...
There are several humanly defined ontologies relevant to Medline. However, Medline is a fast grow... more There are several humanly defined ontologies relevant to Medline. However, Medline is a fast growing collection of biomedical documents which creates difficulties in updating and expanding these humanly defined ontologies. Automatically identifying meaningful categories of entities in a large text corpus is useful for information extraction, construction of machine learning features, and development of semantic representations. In this paper we describe and compare two methods for automatically learning meaningful biomedical categories in Medline. The first approach is a simple statistical method that uses part-of-speech and frequency information to extract a list of frequent nouns from Medline. The second method implements an alignment-based technique to learn frequent generic patterns that indicate a hyponymy/hypernymy relationship between a pair of noun phrases. We then apply these patterns to Medline to collect frequent hypernyms as potential biomedical categories. We study and ...
Identifying unknown drug interactions is of great benefit in the early detection of adverse drug ... more Identifying unknown drug interactions is of great benefit in the early detection of adverse drug reactions. Despite existence of several resources for drug-drug interaction (DDI) information, the wealth of such information is buried in a body of unstructured medical text which is growing exponentially. This calls for developing text mining techniques for identifying DDIs. The state-of-the-art DDI extraction methods use Support Vector Machines (SVMs) with non-linear composite kernels to explore diverse contexts in literature. While computationally less expensive, linear kernel-based systems have not achieved a comparable performance in DDI extraction tasks. In this work, we propose an efficient and scalable system using a linear kernel to identify DDI information. The proposed approach consists of two steps: identifying DDIs and assigning one of four different DDI types to the predicted drug pairs. We demonstrate that when equipped with a rich set of lexical and syntactic features, a...
ABSTRACT PubMed queries frequently retrieve thousands of documents making it very challenging for... more ABSTRACT PubMed queries frequently retrieve thousands of documents making it very challenging for a user to identify information of interest. In this paper we propose a method for automatically identifying central concepts in large PubMed retrievals. The centrality of a concept is modeled using the hypergeometric distribution. Retrieved documents are grouped by concept, which can help users navigate the retrieval. We test our method on five datasets, each representing a medical condition.
Uploads
Papers by Lana Yeganova