Cet article presente une perspective ethique sur le projet decrit dans (Pestian et al., 2012b). L... more Cet article presente une perspective ethique sur le projet decrit dans (Pestian et al., 2012b). La campagne d'annotation en question a vise a produire un corpus de lettres de suicides annotees en emotions. Les annotateurs etaient soit des parents ou des amis de suicides, soit des professionnels de la sante mentale. Nous appelons ces annotateurs benevoles, volontaires pour faire avancer la recherche, des volontaires investis. Ce projet souleve un certain nombre de questions ethiques, notamment en ce qui concerne le role de l'empathie des annotateurs, les effets possibles sur ceux-ci et les utilisations potentielles des resultats obtenus. Nous concluons par une analyse du corpus du point de vue de la Charte Ethique et Big Data. Abstract. Annotating suicide notes : ethical issues at a glance. According to the World Health Organization, 800,000 people die of suicide every year. About 20% of them leave a written message. This paper discusses a corpus of such messages. The corpus ...
2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), 2018
In order to develop a gold corpus for Biomedical Natural Language Processing community for the sa... more In order to develop a gold corpus for Biomedical Natural Language Processing community for the sake of knowledge discovery in drug repurposing, an active gene annotation corpus (AGAC) was developed in this research. Five semantic trigger labels and three root regulatory trigger labels were designed as molecular- and cell- level biological entity annotations, which focused on the information of function change in biological processes resulted from mutated genes. In addition, predicates ‘ThemeOf’ and ‘CauseOf’ were as well annotated manually for the semantic knowledge extraction. Eventually, roles of gene mutation including gain of function (GOF) and loss of function (LOF) were curated through the AGAC annotation guideline. The information from AGAC annotation effectively bridge the association between mutation, gene, drug and disease, and make it possible to predict new drug direction in a large scale. AGAC corpus availability: The corpus is available in PubAn-notation platform11http://pubannotation.org/projects/HZAU_Active_Gene_Corpus.
Welcome to the HLT-NAACL'06 BioNLP Workshop, Linking Natural Language Processing and Biology:... more Welcome to the HLT-NAACL'06 BioNLP Workshop, Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis. The late 1990s saw the beginning of a trend towards significant growth in the area of biomedical language processing, and in particular in the use of natural language processing techniques in the molecular biology and related computational bioscience domains. The figure below gives an indication of the amount of recent activity in this area: it shows the cumulative number of documents returned by searching PubMed, the premiere repository of biomedical scientific literature, with the query ((natural language processing) OR (text mining)) AND (gene OR protein), limiting the search by year for every year from 1999 through 2005: the three papers in 1999 had grown to 227 by the end of 2005. Significant challenges to biological literature exploitation remain, in particular for such biological problem areas as automated function prediction and pathway reconstruction and for linguistic applications such as relation extraction and abstractive summarization. In light of the nature of these remaining challenges, the focus of this workshop was intended to be applications that move towards deeper semantic analysis. We particularly solicited work that addresses relatively under-explored areas such as summarization and question-answering from biological information. Papers describing applications of semantic processing technologies to the biology domain were especially invited. That is, the primary topics of interest were applications which require deeper linguistic analysis of the biological literature. We also solicited papers exploring issues in porting NLP systems originally constructed for other domains to the biology domain. What makes the biology domain special? What hurdles must be overcome in performing linguistic analysis of biological text? Are any special linguistic or knowledge resources required, beyond a domain-specific lexicon? What relations in biological text are most interesting to biologists, and hence should be the focus of our future efforts? The workshop received 31 submissions: 29 full-paper submissions, and two poster submissions. A strong program committee, representing BioNLP researchers in North America, Europe, and Asia, provided thorough reviews, resulting in the acceptance of eleven full papers and nineteen posters, for an acceptance rate for full papers of 38% (11/29), which we believe made this one of the most competitive BioNLP workshop or conference sessions to date. A notable trend in the accepted papers is that only one of them was on the topic of entity identification. The subject areas of the papers presented at BioNLP'06 included an exceptionally wide range of topics: question-answering, computational lexical semantics, information extraction, entity normalization, semantic role labelling, image classification, and syntactic aspects of the sublanguage of molecular biology
There is a growing concern that the reproducibility of much work in science in general and in the... more There is a growing concern that the reproducibility of much work in science in general and in the biomedical field in particular is questionable. In order to improve the robustness of methods in biomedical Natural Language Processing (NLP), we need to assess the current situation in the field, after recent experiments on a shared task suggest reproducibility cannot be taken for granted even under favorable conditions where data, code and evaluation toolkits are available. Nonetheless, data and code availability is the first requirement to enable reproducibility.
<p>Computational linguistics has its origins in the post-Second World War research on trans... more <p>Computational linguistics has its origins in the post-Second World War research on translation of Russian-language scientific journal articles in the United States. Today, biomedical natural language processing treats clinical data, the scientific literature, and social media, with use cases ranging from studying adverse effects of drugs to interpreting high-throughput genomic assays (Névéol and Zweigenbaum 2018). Many of the most prominent research areas in the field involve extracting information from text and normalizing it to enormous databases of domain-relevant semantic classes, such as genes, diseases, and biological processes. Moving forward, the field is expected to play a significant role in understanding reproducibility in natural language processing.</p>
Node embedding of biological entity network has been widely investigated for the downstream appli... more Node embedding of biological entity network has been widely investigated for the downstream application scenarios. To embed full semantics of gene and disease, a multi-relational heterogeneous graph is considered in a scenario where uni-relation between gene/disease and other heterogeneous entities are abundant while multi-relation between gene and disease is relatively sparse. After introducing this novel graph format, it is illuminative to design a specific data integration algorithm to fully capture the graph information and bring embeddings with high quality. First, a typical multi-relational triple dataset was introduced, which carried significant association between gene and disease. Second, we curated all human genes and diseases in seven mainstream datasets and constructed a large-scale gene-disease network, which compromising 163,024 nodes and 25,265,607 edges, and relates to 27,165 genes, 2,665 diseases, 15,067 chemicals, 108,023 mutations, 2,363 pathways, and 7.732 phenotypes. Third, we proposed a Joint Decomposition of Heterogeneous Matrix and Tensor (JDHMT) model, which integrated all heterogeneous data resources and obtained embedding for each gene or disease. Forth, a visualized intrinsic evaluation was performed, which investigated the embeddings in terms of interpretable data clustering. Furthermore, an extrinsic evaluation was performed in the form of linking prediction. Both intrinsic and extrinsic evaluation results showed that JDHMT model outperformed other eleven state-of-the-art (SOTA) methods which are under relation-learning, proximity-preserving or message-passing paradigms. Finally, the constructed gene-disease network, embedding results and codes were made available. The constructed massive gene-disease network is available at: https://hzaubionlp.com/heterogeneous-biological-network/. The codes are available at: https://github.com/bionlp-hzau/JDHMT.
Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, 2018
In an era when massive amounts of medical data became available, researchers working in biologica... more In an era when massive amounts of medical data became available, researchers working in biological, biomedical and clinical domains have increasingly started to require the help of language engineers to process large quantities of biomedical and molecular biology literature, patient data or health records. With such a huge amount of reports, evaluating their impact has long seized to be a trivial task. Linking the contents of these documents to each other, as well as to specialized ontologies, could enable access to and discovery of structured clinical information and foster a major leap in natural language processing and health research
This paper reports on Task 2 of the 2016 CLEF eHealth evaluation lab which extended the previous ... more This paper reports on Task 2 of the 2016 CLEF eHealth evaluation lab which extended the previous information extraction tasks of ShARe/CLEF eHealth evaluation labs. The task continued with named entity recognition and normalization in French narratives, as offered in CLEF eHealth 2015. Named entity recognition involved ten types of entities including disorders that were defined according to Semantic Groups in the Unified Medical Language System® (UMLS®), which was also used for normalizing the entities. In addition, we introduced a large-scale classification task in French death certificates, which consisted of extracting causes of death as coded in the International Classification of Diseases, tenth revision (ICD10). Participant systems were evaluated against a blind reference standard of 832 titles of scientific articles indexed in MEDLINE, 4 drug monographs published by the European Medicines Agency (EMEA) and 27,850 death certificates using Precision, Recall and F-measure. In to...
2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2017
We propose a novel, semantic-reasoning-based approach to look for potentially adverse drug-drug i... more We propose a novel, semantic-reasoning-based approach to look for potentially adverse drug-drug interactions (DDIs) by using a knowledge-base of biomedical public ontologies and datasets in a semantic graph representation. This approach makes it possible to find previously unknown relations between different biological entities like drugs, proteins and biological processes, and perform inferences on those relations. Finding nodes that represent drugs in this semantic graph, and intersecting pathways between these nodes (e.g. intersecting at a metabolic pathway step described in Reactome [1] data), can yield to novel drug-drug interactions. The resulting pathways not only describe drug-drug interactions reflected in the literature, but also unstudied interactions that could elucidate reported adverse effects.
Background A novel disease poses special challenges for informatics solutions. Biomedical informa... more Background A novel disease poses special challenges for informatics solutions. Biomedical informatics relies for the most part on structured data, which require a preexisting data or knowledge model; however, novel diseases do not have preexisting knowledge models. In an emergent epidemic, language processing can enable rapid conversion of unstructured text to a novel knowledge model. However, although this idea has often been suggested, no opportunity has arisen to actually test it in real time. The current coronavirus disease (COVID-19) pandemic presents such an opportunity. Objective The aim of this study was to evaluate the added value of information from clinical text in response to emergent diseases using natural language processing (NLP). Methods We explored the effects of long-term treatment by calcium channel blockers on the outcomes of COVID-19 infection in patients with high blood pressure during in-patient hospital stays using two sources of information: data available s...
Cet article presente une perspective ethique sur le projet decrit dans (Pestian et al., 2012b). L... more Cet article presente une perspective ethique sur le projet decrit dans (Pestian et al., 2012b). La campagne d'annotation en question a vise a produire un corpus de lettres de suicides annotees en emotions. Les annotateurs etaient soit des parents ou des amis de suicides, soit des professionnels de la sante mentale. Nous appelons ces annotateurs benevoles, volontaires pour faire avancer la recherche, des volontaires investis. Ce projet souleve un certain nombre de questions ethiques, notamment en ce qui concerne le role de l'empathie des annotateurs, les effets possibles sur ceux-ci et les utilisations potentielles des resultats obtenus. Nous concluons par une analyse du corpus du point de vue de la Charte Ethique et Big Data. Abstract. Annotating suicide notes : ethical issues at a glance. According to the World Health Organization, 800,000 people die of suicide every year. About 20% of them leave a written message. This paper discusses a corpus of such messages. The corpus ...
2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), 2018
In order to develop a gold corpus for Biomedical Natural Language Processing community for the sa... more In order to develop a gold corpus for Biomedical Natural Language Processing community for the sake of knowledge discovery in drug repurposing, an active gene annotation corpus (AGAC) was developed in this research. Five semantic trigger labels and three root regulatory trigger labels were designed as molecular- and cell- level biological entity annotations, which focused on the information of function change in biological processes resulted from mutated genes. In addition, predicates ‘ThemeOf’ and ‘CauseOf’ were as well annotated manually for the semantic knowledge extraction. Eventually, roles of gene mutation including gain of function (GOF) and loss of function (LOF) were curated through the AGAC annotation guideline. The information from AGAC annotation effectively bridge the association between mutation, gene, drug and disease, and make it possible to predict new drug direction in a large scale. AGAC corpus availability: The corpus is available in PubAn-notation platform11http://pubannotation.org/projects/HZAU_Active_Gene_Corpus.
Welcome to the HLT-NAACL'06 BioNLP Workshop, Linking Natural Language Processing and Biology:... more Welcome to the HLT-NAACL'06 BioNLP Workshop, Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis. The late 1990s saw the beginning of a trend towards significant growth in the area of biomedical language processing, and in particular in the use of natural language processing techniques in the molecular biology and related computational bioscience domains. The figure below gives an indication of the amount of recent activity in this area: it shows the cumulative number of documents returned by searching PubMed, the premiere repository of biomedical scientific literature, with the query ((natural language processing) OR (text mining)) AND (gene OR protein), limiting the search by year for every year from 1999 through 2005: the three papers in 1999 had grown to 227 by the end of 2005. Significant challenges to biological literature exploitation remain, in particular for such biological problem areas as automated function prediction and pathway reconstruction and for linguistic applications such as relation extraction and abstractive summarization. In light of the nature of these remaining challenges, the focus of this workshop was intended to be applications that move towards deeper semantic analysis. We particularly solicited work that addresses relatively under-explored areas such as summarization and question-answering from biological information. Papers describing applications of semantic processing technologies to the biology domain were especially invited. That is, the primary topics of interest were applications which require deeper linguistic analysis of the biological literature. We also solicited papers exploring issues in porting NLP systems originally constructed for other domains to the biology domain. What makes the biology domain special? What hurdles must be overcome in performing linguistic analysis of biological text? Are any special linguistic or knowledge resources required, beyond a domain-specific lexicon? What relations in biological text are most interesting to biologists, and hence should be the focus of our future efforts? The workshop received 31 submissions: 29 full-paper submissions, and two poster submissions. A strong program committee, representing BioNLP researchers in North America, Europe, and Asia, provided thorough reviews, resulting in the acceptance of eleven full papers and nineteen posters, for an acceptance rate for full papers of 38% (11/29), which we believe made this one of the most competitive BioNLP workshop or conference sessions to date. A notable trend in the accepted papers is that only one of them was on the topic of entity identification. The subject areas of the papers presented at BioNLP'06 included an exceptionally wide range of topics: question-answering, computational lexical semantics, information extraction, entity normalization, semantic role labelling, image classification, and syntactic aspects of the sublanguage of molecular biology
There is a growing concern that the reproducibility of much work in science in general and in the... more There is a growing concern that the reproducibility of much work in science in general and in the biomedical field in particular is questionable. In order to improve the robustness of methods in biomedical Natural Language Processing (NLP), we need to assess the current situation in the field, after recent experiments on a shared task suggest reproducibility cannot be taken for granted even under favorable conditions where data, code and evaluation toolkits are available. Nonetheless, data and code availability is the first requirement to enable reproducibility.
<p>Computational linguistics has its origins in the post-Second World War research on trans... more <p>Computational linguistics has its origins in the post-Second World War research on translation of Russian-language scientific journal articles in the United States. Today, biomedical natural language processing treats clinical data, the scientific literature, and social media, with use cases ranging from studying adverse effects of drugs to interpreting high-throughput genomic assays (Névéol and Zweigenbaum 2018). Many of the most prominent research areas in the field involve extracting information from text and normalizing it to enormous databases of domain-relevant semantic classes, such as genes, diseases, and biological processes. Moving forward, the field is expected to play a significant role in understanding reproducibility in natural language processing.</p>
Node embedding of biological entity network has been widely investigated for the downstream appli... more Node embedding of biological entity network has been widely investigated for the downstream application scenarios. To embed full semantics of gene and disease, a multi-relational heterogeneous graph is considered in a scenario where uni-relation between gene/disease and other heterogeneous entities are abundant while multi-relation between gene and disease is relatively sparse. After introducing this novel graph format, it is illuminative to design a specific data integration algorithm to fully capture the graph information and bring embeddings with high quality. First, a typical multi-relational triple dataset was introduced, which carried significant association between gene and disease. Second, we curated all human genes and diseases in seven mainstream datasets and constructed a large-scale gene-disease network, which compromising 163,024 nodes and 25,265,607 edges, and relates to 27,165 genes, 2,665 diseases, 15,067 chemicals, 108,023 mutations, 2,363 pathways, and 7.732 phenotypes. Third, we proposed a Joint Decomposition of Heterogeneous Matrix and Tensor (JDHMT) model, which integrated all heterogeneous data resources and obtained embedding for each gene or disease. Forth, a visualized intrinsic evaluation was performed, which investigated the embeddings in terms of interpretable data clustering. Furthermore, an extrinsic evaluation was performed in the form of linking prediction. Both intrinsic and extrinsic evaluation results showed that JDHMT model outperformed other eleven state-of-the-art (SOTA) methods which are under relation-learning, proximity-preserving or message-passing paradigms. Finally, the constructed gene-disease network, embedding results and codes were made available. The constructed massive gene-disease network is available at: https://hzaubionlp.com/heterogeneous-biological-network/. The codes are available at: https://github.com/bionlp-hzau/JDHMT.
Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, 2018
In an era when massive amounts of medical data became available, researchers working in biologica... more In an era when massive amounts of medical data became available, researchers working in biological, biomedical and clinical domains have increasingly started to require the help of language engineers to process large quantities of biomedical and molecular biology literature, patient data or health records. With such a huge amount of reports, evaluating their impact has long seized to be a trivial task. Linking the contents of these documents to each other, as well as to specialized ontologies, could enable access to and discovery of structured clinical information and foster a major leap in natural language processing and health research
This paper reports on Task 2 of the 2016 CLEF eHealth evaluation lab which extended the previous ... more This paper reports on Task 2 of the 2016 CLEF eHealth evaluation lab which extended the previous information extraction tasks of ShARe/CLEF eHealth evaluation labs. The task continued with named entity recognition and normalization in French narratives, as offered in CLEF eHealth 2015. Named entity recognition involved ten types of entities including disorders that were defined according to Semantic Groups in the Unified Medical Language System® (UMLS®), which was also used for normalizing the entities. In addition, we introduced a large-scale classification task in French death certificates, which consisted of extracting causes of death as coded in the International Classification of Diseases, tenth revision (ICD10). Participant systems were evaluated against a blind reference standard of 832 titles of scientific articles indexed in MEDLINE, 4 drug monographs published by the European Medicines Agency (EMEA) and 27,850 death certificates using Precision, Recall and F-measure. In to...
2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2017
We propose a novel, semantic-reasoning-based approach to look for potentially adverse drug-drug i... more We propose a novel, semantic-reasoning-based approach to look for potentially adverse drug-drug interactions (DDIs) by using a knowledge-base of biomedical public ontologies and datasets in a semantic graph representation. This approach makes it possible to find previously unknown relations between different biological entities like drugs, proteins and biological processes, and perform inferences on those relations. Finding nodes that represent drugs in this semantic graph, and intersecting pathways between these nodes (e.g. intersecting at a metabolic pathway step described in Reactome [1] data), can yield to novel drug-drug interactions. The resulting pathways not only describe drug-drug interactions reflected in the literature, but also unstudied interactions that could elucidate reported adverse effects.
Background A novel disease poses special challenges for informatics solutions. Biomedical informa... more Background A novel disease poses special challenges for informatics solutions. Biomedical informatics relies for the most part on structured data, which require a preexisting data or knowledge model; however, novel diseases do not have preexisting knowledge models. In an emergent epidemic, language processing can enable rapid conversion of unstructured text to a novel knowledge model. However, although this idea has often been suggested, no opportunity has arisen to actually test it in real time. The current coronavirus disease (COVID-19) pandemic presents such an opportunity. Objective The aim of this study was to evaluate the added value of information from clinical text in response to emergent diseases using natural language processing (NLP). Methods We explored the effects of long-term treatment by calcium channel blockers on the outcomes of COVID-19 infection in patients with high blood pressure during in-patient hospital stays using two sources of information: data available s...
Uploads