Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3183713.3183757acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Big Data Linkage for Product Specification Pages

Published: 27 May 2018 Publication History

Abstract

An increasing number of product pages are available from thousands of web sources, each page associated with a product, containing its attributes and one or more product identifiers. The sources provide overlapping information about the products, using diverse schemas, making web-scale integration extremely challenging. In this paper, we take advantage of the opportunity that sources publish product identifiers to perform big data linkage across sources at the beginning of the data integration pipeline, before schema alignment. To realize this opportunity, several challenges need to be addressed: identifiers need to be discovered on product pages, made difficult by the diversity of identifiers; the main product identifier on the page needs to be identified, made difficult by the many related products presented on the page; and identifiers across pages need to beresolved, made difficult by the ambiguity between identifiers across product categories. We present our RaF (Redundancy as Friend) solution to the problem of big data linkage for product specification pages, which takes advantage of the redundancy of identifiers at a global level, and the homogeneity of structure and semantics at the local source level, to effectively and efficiently link millions of pages of head and tail products across thousands of head and tail sources. We perform a thorough empirical evaluation of our RaF approach using the publicly available Dexter dataset consisting of 1.9M product pages from 7.1k sources of 3.5k websites, and demonstrate its effectiveness in practice.

References

[1]
Rakesh Agrawal and Samuel Ieong. 2012. Aggregating web offers to determine product prices Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 435--443.
[2]
Arvind Arasu and Hector Garcia-Molina. 2003. Extracting structured data from web pages. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data. ACM, 337--348.
[3]
Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, and Paolo Papotti. 2008. Supporting the automatic construction of entity aware search engines Proceedings of the 10th ACM workshop on Web information and data management. ACM, 149--156.
[4]
Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, Vol. 2008, 10 (2008), P10008.
[5]
Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, and Paolo Papotti. 2013. Extraction and integration of partially overlapping web sources. Proceedings of the VLDB Endowment Vol. 6, 10 (2013), 805--816.
[6]
Peter Christen. 2012. A survey of indexing techniques for scalable record linkage and deduplication. IEEE transactions on knowledge and data engineering, Vol. 24, 9 (2012), 1537--1555.
[7]
Valter Crescenzi, Paolo Merialdo, and Disheng Qiu. 2013. A framework for learning web wrappers from the crowd Proceedings of the 22nd international conference on World Wide Web. ACM, 261--272.
[8]
Nilesh Dalvi, Philip Bohannon, and Fei Sha. 2009. Robust web extraction: an approach based on a probabilistic tree-edit model Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, 335--348.
[9]
Nilesh Dalvi, Ashwin Machanavajjhala, and Bo Pang. 2012. An analysis of structured data on the web. Proceedings of the VLDB Endowment Vol. 5, 7 (2012), 680--691.
[10]
Marnix de Bakker, Flavius Frasincar, and Damir Vandic. 2013. A hybrid model words-driven approach for web product duplicate detection International Conference on Advanced Information Systems Engineering. Springer, 149--161.
[11]
Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, Thomas Strohmann, Shaohua Sun, and Wei Zhang. 2014. Knowledge vault: A web-scale approach to probabilistic knowledge fusion Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 601--610.
[12]
Xin Luna Dong. 2016. How Far Are We from Collecting the Knowledge in the World? Keynote at 19th International Workshop on Web and Databases. ACM.
[13]
Xin Luna Dong and Divesh Srivastava. 2015. Big data integration. Vol. Vol. 7. Morgan &Claypool Publishers. 1--198 pages.
[14]
Vasilis Efthymiou, George Papadakis, George Papastefanatos, Kostas Stefanidis, and Themis Palpanas. 2017. Parallel meta-blocking for scaling entity resolution over big heterogeneous data. Information Systems Vol. 65 (2017), 137--157.
[15]
Vishrawas Gopalakrishnan, Suresh Parthasarathy Iyengar, Amit Madaan, Rajeev Rastogi, and Srinivasan Sengamedu. 2012. Matching product titles using web-based enrichment Proceedings of the 21st ACM international conference on Information and knowledge management. ACM, 605--614.
[16]
Pankaj Gulhane, Amit Madaan, Rupesh Mehta, Jeyashankher Ramamirtham, Rajeev Rastogi, Sandeep Satpal, Srinivasan H Sengamedu, Ashwin Tengli, and Charu Tiwari. 2011. Web-scale information extraction with vertex. In Data Engineering (ICDE), 2011 IEEE 27th International Conference on. IEEE, 1209--1220.
[17]
Pankaj Gulhane, Rajeev Rastogi, Srinivasan H. Sengamedu, and Ashwin Tengli. 2010. Exploiting Content Redundancy for Web Information Extraction. Proc. VLDB Endow., Vol. 3, 1--2 (Sept. 2010), 578--587.

Cited By

View all
  • (2022)Fine-grained semantic type discovery for heterogeneous sources using clusteringThe VLDB Journal10.1007/s00778-022-00743-332:2(305-324)Online publication date: 17-May-2022
  • (2021)Incorporating Data Context to Cost-Effectively Automate End-to-End Data WranglingIEEE Transactions on Big Data10.1109/TBDATA.2019.29075887:1(169-186)Online publication date: 1-Mar-2021

Index Terms

  1. Big Data Linkage for Product Specification Pages

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data
    May 2018
    1874 pages
    ISBN:9781450347037
    DOI:10.1145/3183713
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 May 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. big data
    2. data extraction
    3. data integration
    4. data linkage

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS '18
    Sponsor:

    Acceptance Rates

    SIGMOD '18 Paper Acceptance Rate 90 of 461 submissions, 20%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 30 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Fine-grained semantic type discovery for heterogeneous sources using clusteringThe VLDB Journal10.1007/s00778-022-00743-332:2(305-324)Online publication date: 17-May-2022
    • (2021)Incorporating Data Context to Cost-Effectively Automate End-to-End Data WranglingIEEE Transactions on Big Data10.1109/TBDATA.2019.29075887:1(169-186)Online publication date: 1-Mar-2021

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media