Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2767109.2767116acmconferencesArticle/Chapter ViewAbstractPublication PageswebdbConference Proceedingsconference-collections
research-article

IBEX: Harvesting Entities from the Web Using Unique Identifiers

Published: 31 May 2015 Publication History

Abstract

In this paper we study the prevalence of unique entity identifiers on the Web. These are, e.g., ISBNs (for books), GTINs (for commercial products), DOIs (for documents), email addresses, and others. We show how these identifiers can be harvested systematically from Web pages, and how they can be associated with humanreadable names for the entities at large scale.
Starting with a simple extraction of identifiers and names from Web pages, we show how we can use the properties of unique identifiers to filter out noise and clean up the extraction result on the entire corpus. The end result is a database of millions of uniquely identified entities of different types, with an accuracy of 73--96% and a very high coverage compared to existing knowledge bases. We use this database to compute novel statistics on the presence of products, people, and other entities on the Web.

References

[1]
R. Agrawal and S. Ieong. Aggregating Web offers to determine product prices. In KDD, 2012.
[2]
A. Arasu and H. Garcia-Molina. Extracting structured data from Web pages. In SIGMOD, 2003.
[3]
S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. G. Ives. DBpedia: A nucleus for a Web of open data. In ISWC, 2007.
[4]
A. Bakalov, A. Fuxman, P. P. Talukdar, and S. Chakrabarti. Scad: Collective discovery of attribute values. In WWW, 2011.
[5]
M. Banko, M. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open Information Extraction from the Web. In IJCAI, 2007.
[6]
R. Baumgartner, S. Flesca, and G. Gottlob. Visual Web information extraction with Lixto. In VLDB, 2001.
[7]
P. Bohannon, N. Dalvi, Y. Filmus, N. Jacoby, S. Keerthi, and A. Kirpal. Automatic Web-scale information extraction. In CIKM, 2012.
[8]
M. Bronzi, V. Crescenzi, P. Merialdo, and P. Papotti. Extraction and integration of partially overlapping Web sources. In VLDB, 2013.
[9]
L. Brown, T. Cai, and A. Dasgupta. Interval Estimation for a Binomial Proportion. Statistical Science, 16(2), 2001.
[10]
A. Carlson, J. Betteridge, R. C. Wang, E. R. H. Jr., and T. M. Mitchell. Coupled semi-supervised learning for information extraction. In WSDM, 2010.
[11]
C. Chang, M. Kayed, M. Girgis, and K. Shaalan. A survey of Web information extraction systems. TKDE, 18(10), 2006.
[12]
W. W. Cohen, M. Hurst, and L. S. Jensen. A flexible learning system for wrapping tables and lists in HTML documents. In WWW, 2002.
[13]
V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large Web sites. In VLDB, 2011.
[14]
N. Dalvi, R. Kumar, and M. A. Soliman. Automatic wrappers for large scale Web extraction. In VLDB, 2011.
[15]
N. Derouiche, B. Cautis, and T. Abdessalem. Automatic extraction of structured Web data with domain knowledge. In ICDE, 2012.
[16]
D. Freitag and N. Kushmerick. Boosted wrapper induction. In NCAI, 2000.
[17]
T. Furche, G. Gottlob, G. Grasso, X. Guo, G. Orsi, C. Schallhart, and C. Wang. DIADEM: Thousands of websites to a single database. In VLDB, 2014.
[18]
W. Gatterbauer, P. Bohunsky, M. Herzog, B. Kruepl, and B. Pollak. Towards domain-independent information extraction from Web tables. In WWW, 2007.
[19]
R. Ghani, K. Probst, Y. Liu, M. Krema, and A. Fano. Text mining for product attribute extraction. SIGKDD Explor. Newsl., 8(1), 2006.
[20]
P. Gulhane, R. Rastogi, S. H. Sengamedu, and A. Tengli. Exploiting content redundancy for Web information extraction. In VLDB, 2010.
[21]
A. Kannan, I. Givoni, R. Agrawal, and A. Fuxman. Matching unstructured product offers to structured product specifications. In KDD, 2011.
[22]
H. Köpcke, A. Thor, S. Thomas, and E. Rahm. Tailoring entity resolution for matching product offers. In EDBT, 2012.
[23]
A. Kopliku, M. Boughanem, and K. Pinel-Sauvagnat. Towards a framework for attribute retrieval. In CIKM, 2011.
[24]
W. Y. Lin and W. Lam. Learning to extract hierarchical information from semi-structured documents. In CIKM, 2000.
[25]
J.-B. Michel et al. Quantitative analysis of culture using millions of digitized books. Science, 331(6041), 2011.
[26]
N. Nakashole, G. Weikum, and F. M. Suchanek. PATTY: A taxonomy of relational patterns with semantic types. In EMNLP-CoNLL, 2012.
[27]
H. Nguyen, A. Fuxman, S. Paparizos, J. Freire, and R. Agrawal. Synthesizing products for online catalogs. PVLDB, 4(7), 2011.
[28]
Z. Nie, Y. Ma, S. Shi, J. Wen, and W. Ma. Web object retrieval. In WWW, 2007.
[29]
T. Pham and K. Nguyen. A simhash-based scheme for locating product information from the Web. In SoICT, 2011.
[30]
K. Probst, R. Ghani, M. Krema, A. Fano, and Y. Liu. Semi-supervised learning of attribute-value pairs from product descriptions. In IJCAI, 2007.
[31]
D. Putthividhya and J. Hu. Bootstrapped named entity recognition for product attribute extraction. In EMNLP, 2011.
[32]
S. Sarawagi. Information Extraction. Foundations and Trends in Databases, 2(1), 2008.
[33]
K. Simon and G. Lausen. ViPER: augmenting automatic information extraction with visual perceptions. In CIKM, 2005.
[34]
F. M. Suchanek, G. Kasneci, and G. Weikum. YAGO: A core of semantic knowledge. In WWW, 2007.
[35]
A. Talaika, J. Biega, A. Amarilli, and F. M. Suchanek. Harvesting entities from the web using unique identifiers - IBEX. Technical report, Telecom ParisTech, 2015. http://suchanek.name/work/publications/ibex2015tr.pdf.
[36]
Techspot. Gmail finally overtakes Hotmail as world's top email service. http://techspot.com/news/50678-google.html, 2012. Accessed: 2014-11-07.
[37]
United States Census Bureau. US census. http://www.census.gov/data/data-tools.html, 1990. Accessed: 2013-10-01.
[38]
P. Venetis, A. Halevy, J. Madhavan, and et al. Recovering semantics of tables on the Web. In VLDB, 2011.
[39]
World Bank. GDP (current US$). http://data.worldbank.org/indicator/NY.GDP.MKTP.CD. Accessed: 2013-10-01.
[40]
World Trade Organization. International trade statistics: World trade developments. https://www.wto.org/english/res_e/statis_e/its2012_e/its12_world_trade_dev_e.pdf, 2012. Accessed: 2014-11-07.
[41]
Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In WWW, 2005.
[42]
H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu. Fully automatic wrapper generation for search engines. In WWW, 2005.
[43]
S. Zheng, R. Song, J. R. Wen, and C. L. Giles. Efficient record-level wrapper induction. In CIKM, 2009.
[44]
J. Zhu, Z. Nie, X. Liu, B. Zhang, and J. Wen. StatSnowball: a statistical approach to extracting entity relationships. In WWW, 2009.
[45]
J. Zhu, Z. Nie, J. R. Wen, B. Zhang, and W. Y. Ma. Simultaneous record detection and attribute labeling in Web data extraction. In SIGKDD, 2006.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WebDB'15: Proceedings of the 18th International Workshop on Web and Databases
May 2015
75 pages
ISBN:9781450336277
DOI:10.1145/2767109
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2015

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SIGMOD/PODS'15
Sponsor:
SIGMOD/PODS'15: International Conference on Management of Data
May 31 - June 4, 2015
VIC, Melbourne, Australia

Acceptance Rates

WebDB'15 Paper Acceptance Rate 9 of 31 submissions, 29%;
Overall Acceptance Rate 30 of 100 submissions, 30%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)0
Reflects downloads up to 26 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Knowledge Bases and Language Models: Complementing ForcesRules and Reasoning10.1007/978-3-031-45072-3_1(3-15)Online publication date: 15-Oct-2023
  • (2019)Company Acquisition Relations Extraction From Web PagesSemantic Web Science and Real-World Applications10.4018/978-1-5225-7186-5.ch001(1-17)Online publication date: 2019
  • (2019)Knowledge Harvesting: Achievements and ChallengesComputing and Software Science10.1007/978-3-319-91908-9_13(217-235)Online publication date: 2019
  • (2018)Synthesizing Type-Detection Logic for Rich Semantic Data Types using Open-source CodeProceedings of the 2018 International Conference on Management of Data10.1145/3183713.3196888(35-50)Online publication date: 27-May-2018
  • (2016)YAGO: A Multilingual Knowledge Base from Wikipedia, Wordnet, and GeonamesThe Semantic Web – ISWC 201610.1007/978-3-319-46547-0_19(177-185)Online publication date: 23-Sep-2016

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media