research-article

IBEX: Harvesting Entities from the Web Using Unique Identifiers

Authors:

Aliaksandr Talaika,

Antoine Amarilli,

Fabian M. SuchanekAuthors Info & Claims

WebDB'15: Proceedings of the 18th International Workshop on Web and Databases

Pages 13 - 19

https://doi.org/10.1145/2767109.2767116

Published: 31 May 2015 Publication History

Abstract

In this paper we study the prevalence of unique entity identifiers on the Web. These are, e.g., ISBNs (for books), GTINs (for commercial products), DOIs (for documents), email addresses, and others. We show how these identifiers can be harvested systematically from Web pages, and how they can be associated with humanreadable names for the entities at large scale.

Starting with a simple extraction of identifiers and names from Web pages, we show how we can use the properties of unique identifiers to filter out noise and clean up the extraction result on the entire corpus. The end result is a database of millions of uniquely identified entities of different types, with an accuracy of 73--96% and a very high coverage compared to existing knowledge bases. We use this database to compute novel statistics on the presence of products, people, and other entities on the Web.

References

[1]

R. Agrawal and S. Ieong. Aggregating Web offers to determine product prices. In KDD, 2012.

Digital Library

[2]

A. Arasu and H. Garcia-Molina. Extracting structured data from Web pages. In SIGMOD, 2003.

Digital Library

[3]

S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. G. Ives. DBpedia: A nucleus for a Web of open data. In ISWC, 2007.

Digital Library

[4]

A. Bakalov, A. Fuxman, P. P. Talukdar, and S. Chakrabarti. Scad: Collective discovery of attribute values. In WWW, 2011.

Digital Library

[5]

M. Banko, M. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open Information Extraction from the Web. In IJCAI, 2007.

Digital Library

[6]

R. Baumgartner, S. Flesca, and G. Gottlob. Visual Web information extraction with Lixto. In VLDB, 2001.

Digital Library

[7]

P. Bohannon, N. Dalvi, Y. Filmus, N. Jacoby, S. Keerthi, and A. Kirpal. Automatic Web-scale information extraction. In CIKM, 2012.

Digital Library

[8]

M. Bronzi, V. Crescenzi, P. Merialdo, and P. Papotti. Extraction and integration of partially overlapping Web sources. In VLDB, 2013.

Digital Library

[9]

L. Brown, T. Cai, and A. Dasgupta. Interval Estimation for a Binomial Proportion. Statistical Science, 16(2), 2001.

[10]

A. Carlson, J. Betteridge, R. C. Wang, E. R. H. Jr., and T. M. Mitchell. Coupled semi-supervised learning for information extraction. In WSDM, 2010.

Digital Library

[11]

C. Chang, M. Kayed, M. Girgis, and K. Shaalan. A survey of Web information extraction systems. TKDE, 18(10), 2006.

Digital Library

[12]

W. W. Cohen, M. Hurst, and L. S. Jensen. A flexible learning system for wrapping tables and lists in HTML documents. In WWW, 2002.

Digital Library

[13]

V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large Web sites. In VLDB, 2011.

Digital Library

[14]

N. Dalvi, R. Kumar, and M. A. Soliman. Automatic wrappers for large scale Web extraction. In VLDB, 2011.

Digital Library

[15]

N. Derouiche, B. Cautis, and T. Abdessalem. Automatic extraction of structured Web data with domain knowledge. In ICDE, 2012.

Digital Library

[16]

D. Freitag and N. Kushmerick. Boosted wrapper induction. In NCAI, 2000.

Digital Library

[17]

T. Furche, G. Gottlob, G. Grasso, X. Guo, G. Orsi, C. Schallhart, and C. Wang. DIADEM: Thousands of websites to a single database. In VLDB, 2014.

Digital Library

[18]

W. Gatterbauer, P. Bohunsky, M. Herzog, B. Kruepl, and B. Pollak. Towards domain-independent information extraction from Web tables. In WWW, 2007.

Digital Library

[19]

R. Ghani, K. Probst, Y. Liu, M. Krema, and A. Fano. Text mining for product attribute extraction. SIGKDD Explor. Newsl., 8(1), 2006.

Digital Library

[20]

P. Gulhane, R. Rastogi, S. H. Sengamedu, and A. Tengli. Exploiting content redundancy for Web information extraction. In VLDB, 2010.

Digital Library

[21]

A. Kannan, I. Givoni, R. Agrawal, and A. Fuxman. Matching unstructured product offers to structured product specifications. In KDD, 2011.

Digital Library

[22]

H. Köpcke, A. Thor, S. Thomas, and E. Rahm. Tailoring entity resolution for matching product offers. In EDBT, 2012.

Digital Library

[23]

A. Kopliku, M. Boughanem, and K. Pinel-Sauvagnat. Towards a framework for attribute retrieval. In CIKM, 2011.

Digital Library

[24]

W. Y. Lin and W. Lam. Learning to extract hierarchical information from semi-structured documents. In CIKM, 2000.

Digital Library

[25]

J.-B. Michel et al. Quantitative analysis of culture using millions of digitized books. Science, 331(6041), 2011.

[26]

N. Nakashole, G. Weikum, and F. M. Suchanek. PATTY: A taxonomy of relational patterns with semantic types. In EMNLP-CoNLL, 2012.

Digital Library

[27]

H. Nguyen, A. Fuxman, S. Paparizos, J. Freire, and R. Agrawal. Synthesizing products for online catalogs. PVLDB, 4(7), 2011.

Digital Library

[28]

Z. Nie, Y. Ma, S. Shi, J. Wen, and W. Ma. Web object retrieval. In WWW, 2007.

Digital Library

[29]

T. Pham and K. Nguyen. A simhash-based scheme for locating product information from the Web. In SoICT, 2011.

Digital Library

[30]

K. Probst, R. Ghani, M. Krema, A. Fano, and Y. Liu. Semi-supervised learning of attribute-value pairs from product descriptions. In IJCAI, 2007.

Digital Library

[31]

D. Putthividhya and J. Hu. Bootstrapped named entity recognition for product attribute extraction. In EMNLP, 2011.

Digital Library

[32]

S. Sarawagi. Information Extraction. Foundations and Trends in Databases, 2(1), 2008.

Digital Library

[33]

K. Simon and G. Lausen. ViPER: augmenting automatic information extraction with visual perceptions. In CIKM, 2005.

Digital Library

[34]

F. M. Suchanek, G. Kasneci, and G. Weikum. YAGO: A core of semantic knowledge. In WWW, 2007.

Digital Library

[35]

A. Talaika, J. Biega, A. Amarilli, and F. M. Suchanek. Harvesting entities from the web using unique identifiers - IBEX. Technical report, Telecom ParisTech, 2015. http://suchanek.name/work/publications/ibex2015tr.pdf.

[36]

Techspot. Gmail finally overtakes Hotmail as world's top email service. http://techspot.com/news/50678-google.html, 2012. Accessed: 2014-11-07.

[37]

United States Census Bureau. US census. http://www.census.gov/data/data-tools.html, 1990. Accessed: 2013-10-01.

[38]

P. Venetis, A. Halevy, J. Madhavan, and et al. Recovering semantics of tables on the Web. In VLDB, 2011.

Digital Library

[39]

World Bank. GDP (current US$). http://data.worldbank.org/indicator/NY.GDP.MKTP.CD. Accessed: 2013-10-01.

[40]

World Trade Organization. International trade statistics: World trade developments. https://www.wto.org/english/res_e/statis_e/its2012_e/its12_world_trade_dev_e.pdf, 2012. Accessed: 2014-11-07.

[41]

Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In WWW, 2005.

Digital Library

[42]

H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu. Fully automatic wrapper generation for search engines. In WWW, 2005.

Digital Library

[43]

S. Zheng, R. Song, J. R. Wen, and C. L. Giles. Efficient record-level wrapper induction. In CIKM, 2009.

Digital Library

[44]

J. Zhu, Z. Nie, X. Liu, B. Zhang, and J. Wen. StatSnowball: a statistical approach to extracting entity relationships. In WWW, 2009.

Digital Library

[45]

J. Zhu, Z. Nie, J. R. Wen, B. Zhang, and W. Y. Ma. Simultaneous record detection and attribute labeling in Web data extraction. In SIGKDD, 2006.

Digital Library

Cited By

Suchanek FLuu A(2023)Knowledge Bases and Language Models: Complementing ForcesRules and Reasoning10.1007/978-3-031-45072-3_1(3-15)Online publication date: 15-Oct-2023
https://doi.org/10.1007/978-3-031-45072-3_1
Zhao JWang JYang JJin P(2019)Company Acquisition Relations Extraction From Web PagesSemantic Web Science and Real-World Applications10.4018/978-1-5225-7186-5.ch001(1-17)Online publication date: 2019
https://doi.org/10.4018/978-1-5225-7186-5.ch001
Weikum GHoffart JSuchanek F(2019)Knowledge Harvesting: Achievements and ChallengesComputing and Software Science10.1007/978-3-319-91908-9_13(217-235)Online publication date: 2019
https://doi.org/10.1007/978-3-319-91908-9_13
Show More Cited By

Index Terms

IBEX: Harvesting Entities from the Web Using Unique Identifiers
1. Information systems

Recommendations

Web personal name disambiguation based on reference entity tables mined from the web
WIDM '09: Proceedings of the eleventh international workshop on Web information and data management

Ambiguous personal names are common on the Web, which pose a challenge for many different tasks. The traditional disambiguation employs the clustering methods. However, without reference entity tables, the clustering method can only identify whether two ...
Comparison of Methods to Annotate Named Entity Corpora

The authors compared two methods for annotating a corpus for the named entity (NE) recognition task using non-expert annotators: (i) revising the results of an existing NE recognizer and (ii) manually annotating the NEs completely. The annotation time, ...
Named entity recognition and disambiguation using linked data and graph-based centrality scoring
SWIM '12: Proceedings of the 4th International Workshop on Semantic Web Information Management

Named Entity Recognition (NER) is a subtask of information extraction and aims to identify atomic entities in text that fall into predefined categories such as person, location, organization, etc. Recent efforts in NER try to extract entities and link ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WebDB'15: Proceedings of the 18th International Workshop on Web and Databases

May 2015

75 pages

ISBN:9781450336277

DOI:10.1145/2767109

Editors:
Julia Stoyanovich
Drexel University
,
Fabian M. Suchanek
Télécom ParisTech

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

SIGMOD/PODS'15

Sponsor:

SIGMOD

SIGMOD/PODS'15: International Conference on Management of Data

May 31 - June 4, 2015

VIC, Melbourne, Australia

Acceptance Rates

WebDB'15 Paper Acceptance Rate 9 of 31 submissions, 29%;

Overall Acceptance Rate 30 of 100 submissions, 30%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
116
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Suchanek FLuu A(2023)Knowledge Bases and Language Models: Complementing ForcesRules and Reasoning10.1007/978-3-031-45072-3_1(3-15)Online publication date: 15-Oct-2023
https://doi.org/10.1007/978-3-031-45072-3_1
Zhao JWang JYang JJin P(2019)Company Acquisition Relations Extraction From Web PagesSemantic Web Science and Real-World Applications10.4018/978-1-5225-7186-5.ch001(1-17)Online publication date: 2019
https://doi.org/10.4018/978-1-5225-7186-5.ch001
Weikum GHoffart JSuchanek F(2019)Knowledge Harvesting: Achievements and ChallengesComputing and Software Science10.1007/978-3-319-91908-9_13(217-235)Online publication date: 2019
https://doi.org/10.1007/978-3-319-91908-9_13
Yan CHe YDas GJermaine CBernstein P(2018)Synthesizing Type-Detection Logic for Rich Semantic Data Types using Open-source CodeProceedings of the 2018 International Conference on Management of Data10.1145/3183713.3196888(35-50)Online publication date: 27-May-2018
https://dl.acm.org/doi/10.1145/3183713.3196888
Rebele TSuchanek FHoffart JBiega JKuzey EWeikum G(2016)YAGO: A Multilingual Knowledge Base from Wikipedia, Wordnet, and GeonamesThe Semantic Web – ISWC 201610.1007/978-3-319-46547-0_19(177-185)Online publication date: 23-Sep-2016
https://doi.org/10.1007/978-3-319-46547-0_19

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents