Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1516360.1516498acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article
Free access

High-performance information extraction with AliBaba

Published: 24 March 2009 Publication History

Abstract

A wealth of information is available only in web pages, patents, publications etc. Extracting information from such sources is challenging, both due to the typically complex language processing steps required and to the potentially large number of texts that need to be analyzed. Furthermore, integrating extracted data with other sources of knowledge often is mandatory for subsequent analysis. In this demo, we present the AliBaba system for scalable information extraction from biomedical documents. Unlike many other systems, AliBaba performs both entity extraction and relationship extraction and graphically visualizes the resulting network of inter-connected objects. It leverages the PubMed search engine for selection of relevant documents. The technical novelty of AliBaba is twofold: (a) its ability to automatically learn language patterns for relationship extraction without an annotated corpus, and (b) its high performance pattern matching algorithm. We show that a simple yet effective pattern filtering technique improves the runtime of the system drastically without harming its extraction effectiveness. Although AliBaba has been implemented for biomedical texts, its underlying principles should also be applicable in any other domain.

References

[1]
Altschul, S. F., et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 1997. 25(17): p. 3389--402.
[2]
Chen, F., et al. Efficient Information Extraction over Evolving Text Data. in 24th International Conference on Data Engineering. 2008. Cancun, Mexico.
[3]
Cheng, T., X. Yan, and K. C. C. Chang. EntityRank: searching entities directly and holistically. in 33rd International Conference on Very Large Data Bases. 2007. Vienna, Austria.
[4]
Doan, A. H., R. Ramakrishnan, and S. Vaithyanathan. Managing information extraction: state of the art and research directions. in SIGMOD (Tutorial). 2006.
[5]
Gravano, L., et al. Approximate String Joins in a Database (Almost) for Free. in 7th Conference on Very Large Database Systems. 2001. Roma, Italy.
[6]
Gruhl, D., et al., How to build a WebFountain: An architecture for very large-scale text analytics. IBM Systems Journal, 2004. 43(1).
[7]
Hakenberg, J., et al., Gene mention normalization and interaction extraction with context models and sentence motifs. Genome Biol, 2008. 9 Suppl 2: p. S14.
[8]
Hao, Y., et al., Discovering patterns to extract protein-protein interactions from the literature: Part II. Bioinformatics, 2005. 21(15): p. 3294--300.
[9]
Jenssen, T. K., et al., A literature network of human genes for high-throughput analysis of gene expression. Nat Genet, 2001. 28(1): p. 21--8.
[10]
Myers, E. and R. Durbin, A Table-Driven, Full-Sensitivity Similarity Search Algorithm. Journal of Computational Biology, 2003. 10(2): p. 103--117.
[11]
Plake, C., et al., AliBaba: as a graph. Bioinformatics, 2006. 22(19): p. 2444--5.
[12]
Pyysalo, S., et al., Comparative analysis of five protein-protein interaction corpora. BMC Bioinformatics, 2008. 9 Suppl 3: p. S6.
[13]
Ramakrishnan, C., K. J. Kochut, and A. P. Sheth. A Framework for Schema-Driven Relationship Discovery from Unstructured Text. in Int. Semantic Web Conference. 2006.
[14]
Saric, J., et al., Extraction of regulatory gene/protein networks from Medline. Bioinformatics, 2006. 22(6): p. 645--50.

Cited By

View all
  • (2020)Computational discovery of plant-based inhibitors against human carbonic anhydrase IX and molecular dynamics simulationJournal of Biomolecular Structure and Dynamics10.1080/07391102.2020.175357939:8(2754-2770)Online publication date: 29-Apr-2020
  • (2015)Molecular-docking study of malaria drug target enzyme transketolase in Plasmodium falciparum 3D7 portends the novel approach to its treatmentSource Code for Biology and Medicine10.1186/s13029-015-0037-310:1Online publication date: 22-May-2015
  • (2015)Wide-coverage relation extraction from MEDLINE using deep syntaxBMC Bioinformatics10.1186/s12859-015-0538-816:1Online publication date: 1-Apr-2015
  • Show More Cited By

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
March 2009
1180 pages
ISBN:9781605584225
DOI:10.1145/1516360
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 March 2009

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

EDBT/ICDT '09
EDBT/ICDT '09: EDBT/ICDT '09 joint conference
March 24 - 26, 2009
Saint Petersburg, Russia

Acceptance Rates

Overall Acceptance Rate 7 of 10 submissions, 70%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)48
  • Downloads (Last 6 weeks)7
Reflects downloads up to 10 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2020)Computational discovery of plant-based inhibitors against human carbonic anhydrase IX and molecular dynamics simulationJournal of Biomolecular Structure and Dynamics10.1080/07391102.2020.175357939:8(2754-2770)Online publication date: 29-Apr-2020
  • (2015)Molecular-docking study of malaria drug target enzyme transketolase in Plasmodium falciparum 3D7 portends the novel approach to its treatmentSource Code for Biology and Medicine10.1186/s13029-015-0037-310:1Online publication date: 22-May-2015
  • (2015)Wide-coverage relation extraction from MEDLINE using deep syntaxBMC Bioinformatics10.1186/s12859-015-0538-816:1Online publication date: 1-Apr-2015
  • (2014)KnowLife: A knowledge graph for health and life sciences2014 IEEE 30th International Conference on Data Engineering10.1109/ICDE.2014.6816754(1254-1257)Online publication date: Mar-2014
  • (2012)Text mining in livestock animal science: Introducing the potential of text mining to animal sciences 1Journal of Animal Science10.2527/jas.2011-484190:10(3666-3676)Online publication date: 1-Oct-2012
  • (2012)A new approach to the design of knowledge base using XCLS clusteringInternational Conference on Pattern Recognition, Informatics and Medical Engineering (PRIME-2012)10.1109/ICPRIME.2012.6208280(14-19)Online publication date: Mar-2012
  • (2012)Regular path queries on large graphsProceedings of the 24th international conference on Scientific and Statistical Database Management10.1007/978-3-642-31235-9_12(177-194)Online publication date: 25-Jun-2012
  • (2011)Your Personal, Virtual LibrarianInterdisciplinary Advances in Adaptive and Intelligent Assistant Systems10.4018/978-1-61520-851-7.ch009(199-234)Online publication date: 2011
  • (2011)Enabling information extraction by inference of regular expressions from sample entitiesProceedings of the 20th ACM international conference on Information and knowledge management10.1145/2063576.2063763(1285-1294)Online publication date: 24-Oct-2011
  • (2010)Simple tricks for improving pattern-based information extraction from the biomedical literatureJournal of Biomedical Semantics10.1186/2041-1480-1-91:1Online publication date: 24-Sep-2010
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media