Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Hybrid Crowd-Machine Wrapper Inference

Published: 24 September 2019 Publication History
  • Get Citation Alerts
  • Abstract

    Wrapper inference deals in generating programs to extract data from Web pages. Several supervised and unsupervised wrapper inference approaches have been proposed in the literature. On one hand, unsupervised approaches produce erratic wrappers: whenever the sources do not satisfy underlying assumptions of the inference algorithm, their accuracy is compromised. On the other hand, supervised approaches produce accurate wrappers, but since they need training data, their scalability is limited. The recent advent of crowdsourcing platforms has opened new opportunities for supervised approaches, as they make possible the production of large amounts of training data with the support of workers recruited online. Nevertheless, involving human workers has monetary costs. We present an original hybrid crowd-machine wrapper inference system that offers the benefits of both approaches exploiting the cooperation of crowd workers and unsupervised algorithms. Based on a principled probabilistic model that estimates the quality of wrappers, humans workers are recruited only when unsupervised wrapper induction algorithms are not able to produce sufficiently accurate solutions.

    References

    [1]
    Aris Anagnostopoulos, Luca Becchetti, Adriano Fazzone, Ida Mele, and Matteo Riondato. 2015. The importance of being expert: Efficient max-finding in crowdsourcing. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Timos K. Sellis, Susan B. Davidson, and Zachary G. Ives (Eds.). ACM, 983--998.
    [2]
    Dana Angluin. 2004. Queries revisited. Theor. Comput. Sci. 313, 2 (2004), 175--194.
    [3]
    Dana Angluin and Philip Laird. 1988. Learning from noisy examples. Mach. Learn. 2, 4 (April 1988), 343--370.
    [4]
    Arvind Arasu and Hector Garcia-Molina. 2003. Extracting structured data from web pages. In Proceedings of the SIGMOD Conference. ACM, 337--348.
    [5]
    Maria-Florina Balcan, Steve Hanneke, and Jennifer Wortman Vaughan. 2010. The true sample complexity of active learning. Machine Learning 80, 2--3 (2010), 111--139.
    [6]
    Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, and Paolo Papotti. 2013. Extraction and integration of partially overlapping web sources. Proc. VLDB Endow. 6, 10 (2013), 805--816. http://www.vldb.org/pvldb/vol6/p805-bronzi.pdf.
    [7]
    Anne Chao and Shen-Ming Lee. 1992. Estimating the number of classes via sample coverage. J. Am. Stat. Assoc. 87, 417 (1992), 210--217.
    [8]
    Justin Cheng and Michael S. Bernstein. 2015. Flock: Hybrid crowd-machine learning classifiers. In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work 8 Social Computing (CSCW’15). 600--611.
    [9]
    Kenneth Ward Church and Patrick Hanks. 1990a. Word association norms, mutual information, and lexicography. Comput. Linguist. 16, 1 (March 1990), 22--29. http://dl.acm.org/citation.cfm?id=89086.89095
    [10]
    Kenneth Ward Church and Patrick Hanks. 1990b. Word association norms, mutual information, and lexicography. Comput. Linguist. 16, 1 (1990), 22--29.
    [11]
    William W. Cohen, Matthew Hurst, and Lee S. Jensen. 2002. A flexible learning system for wrapping tables and lists in HTML documents. In Proceedings of the 11th International World Wide Web Conference (WWW’02). 232--241.
    [12]
    Valter Crescenzi, Alvaro A. A. Fernandes, Paolo Merialdo, and Norman W. Paton. 2017. Crowdsourcing for data management. Knowledge and Information Systems 53, 1 (2017), 1--41.
    [13]
    V. Crescenzi, G. Mecca, and P. Merialdo. 2001. RoadRunner: Towards automatic data extraction from large web sites. In Proceedings of the International Conference on Very Large Data Bases (VLDB’01). 109--118.
    [14]
    Valter Crescenzi and Paolo Merialdo. 2008. Wrapper inference for ambiguous web pages. Appl. Artif. Intell. 22, 182 (2008), 21--52.
    [15]
    Valter Crescenzi, Paolo Merialdo, and Disheng Qiu. 2013b. Alfred: Crowd assisted data extraction. In Proceedings of the 22nd International Conference on World Wide Web. ACM, 297--300.
    [16]
    Valter Crescenzi, Paolo Merialdo, and Disheng Qiu. 2013a. A framework for learning web wrappers from the crowd. In Proceedings of the 22nd International Conference on World Wide Web (WWW’13). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 261--272. http://dl.acm.org/citation.cfm?id=2488388.2488412.
    [17]
    Valter Crescenzi, Paolo Merialdo, and Disheng Qiu. 2015. Crowdsourcing large scale wrapper inference. Distrib. Parallel Databases 33, 1 (2015), 95--122.
    [18]
    Altigran S. Da Silva, Denilson Barbosa, Joao M. B. Cavalcanti, and Marco A. S. Sevalho. 2007. Labeling data extracted from the web. In Proceedings of the OTM Confederated International Conferences “On the Move to Meaningful Internet Systems”. Springer, 1099--1116.
    [19]
    Nilesh N. Dalvi, Ravi Kumar, and Mohamed A. Soliman. 2011. Automatic wrappers for large scale web extraction. Proc. VLDB Endow. 4, 4 (2011), 219--230.
    [20]
    A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stati. Soc., Series B 39, 1 (1977), 1--38.
    [21]
    AnHai Doan, Michael J. Franklin, Donald Kossmann, and Tim Kraska. 2011. Crowdsourcing applications and platforms: A data management perspective. Proc. VLDB Endow. 4, 12 (2011), 1508--1509. http://www.vldb.org/pvldb/vol4/p1508-doan-tutorial4.pdf.
    [22]
    Ju Fan, Guoliang Li, Beng Chin Ooi, Kian-lee Tan, and Jianhua Feng. 2015. icrowd: An adaptive crowdsourcing framework. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 1015--1030.
    [23]
    Ju Fan, Meiyu Lu, Beng Chin Ooi, Wang-Chiew Tan, and Meihui Zhang. 2014. A hybrid machine-crowdsourcing system for matching web tables. In Proceedings of the IEEE 30th International Conference on Data Engineering. 976--987.
    [24]
    Emilio Ferrara, Pasquale De Meo, Giacomo Fiumara, and Robert Baumgartner. 2014. Web data extraction, applications and techniques: A survey. Knowl.-Based Syst. 70 (2014), 301--323. https://doi.org/10.1016/j.knosys.2014.07.007
    [25]
    Michael J. Franklin, Donald Kossmann, Tim Kraska, Sukriti Ramesh, and Reynold Xin. 2011. CrowdDB: Answering queries with crowdsourcing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’11). 61--72.
    [26]
    Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi, Christian Schallhart, and Cheng Wang. 2014. DIADEM: Thousands of websites to a single database. Proc. VLDB Endow. 7, 14 (2014), 1845--1856. http://www.vldb.org/pvldb/vol7/p1845-furche.pdf.
    [27]
    Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F. Naughton, Narasimhan Rampalli, Jude W. Shavlik, and Xiaojin Zhu. 2014. Corleone: Hands-off crowdsourcing for entity matching. In Proceedings of the International Conference on Management of Data (SIGMOD’14), Curtis E. Dyreson, Feifei Li, and M. Tamer Özsu (Eds.). ACM, 601--612.
    [28]
    Leo A Goodman. 1949. On the estimation of the number of classes in a population. Ann. Math. Stat. (1949), 572--579.
    [29]
    Georg Gottlob, Christoph Koch, Robert Baumgartner, Marcus Herzog, and Sergio Flesca. 2004. The lixto data extraction project - back and forth between theory and practice. In Proceedings of the 23rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’04). ACM, 1--12.
    [30]
    David A. Grossman and Ophir Frieder. 2012. Information Retrieval: Algorithms and Heuristics, Vol. 15. Springer Science 8 Business Media.
    [31]
    Pankaj Gulhane, Amit Madaan, Rupesh Mehta, Jeyashankher Ramamirtham, Rajeev Rastogi, Sandeep Satpal, Srinivasan H. Sengamedu, Ashwin Tengli, and Charu Tiwari. 2011. Web-scale information extraction with vertex. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering. IEEE, 1209--1220.
    [32]
    Utku Irmak and Torsten Suel. 2006. Interactive wrapper generation with minimal user effort. In Proceedings of the 15th International Conference on World Wide Web (WWW’06). ACM, 553--563.
    [33]
    Furong Li, Xin Luna Dong, Anno Langen, and Yang Li. 2017. Knowledge verification for longtail verticals. Proc. VLDB Endow. 10, 11 (2017), 1370--1381.
    [34]
    Guoliang Li, Jiannan Wang, Yudian Zheng, and Michael J. Franklin. 2016. Crowdsourced data management: A survey. IEEE Transactions on Knowledge and Data Engineering 28, 9 (2016), 2296--2319.
    [35]
    Colin Lockard, Xin Luna Dong, Prashant Shiralkar, and Arash Einolghozati. 2018. CERES: Distantly supervised relation extraction from the semi-structured web. Proc. VLDB Endow. 11, 10 (2018), 1084--1096. http://www.vldb.org/pvldb/vol11/p1084-lockard.pdf
    [36]
    Adam Marcus and Aditya Parameswaran. 2015. Crowdsourced data management: Industry and academic perspectives. Foundations and Trends in Databases 6, 1--2 (2015), 1--161.
    [37]
    Robert McCann, Warren Shen, and AnHai Doan. 2008. Matching schemas in online communities: A web 2.0 approach. In Proceedings of the 24th International Conference on Data Engineering (ICDE’08). 110--119.
    [38]
    Jafar Muhammadi, Hamid R. Rabiee, and Abbas Hosseini. 2015. A unified statistical framework for crowd labeling. Knowl. Inf. Syst. 45, 2 (01 November 2015), 271--294.
    [39]
    Ion Muslea, Steven Minton, and Craig A. Knoblock. 2006. Active learning with multiple views. J. Artif. Intell. Res. 27, 1 (2006), 203--233.
    [40]
    Bryce Nicholson, Victor S. Sheng, and Jing Zhang. 2016. Label noise correction and application in crowdsourcing. Expert Syst. Appl. 66, C (2016), 149--162.
    [41]
    Stefano Ortona, Giorgio Orsi, Marcello Buoncristiano, and Tim Furche. 2015. WADaR: Joint wrapper and data repair. Proc. VLDB Endow. 8, 12 (2015), 1996--2007. http://www.vldb.org/pvldb/vol8/p1996-ortona.pdf.
    [42]
    Stefano Ortona, Giorgio Orsi, Tim Furche, and Marcello Buoncristiano. 2016. Joint repairs for web wrappers. In Proceedings of the 2016 IEEE 32nd International Conference on Data Engineering (ICDE’16). IEEE, 1146--1157.
    [43]
    Disheng Qiu, Luciano Barbosa, Xin Luna Dong, Yanyan Shen, and Divesh Srivastava. 2015. DEXTER: Large-scale discovery and extraction of product specifications on the web. Proc. VLDB Endow. 8, 13 (2015), 2194--2205. http://www.vldb.org/pvldb/vol8/p2194-qiu.pdf.
    [44]
    Anca-Livia Radu, Bogdan Ionescu, María Menéndez, Julian Stöttinger, Fausto Giunchiglia, and Antonella De Angeli. 2014. A hybrid machine-crowd approach to photo retrieval result diversification. In Proceedings of the MultiMedia Modeling - 20th Anniversary International Conference (MMM’14). 25--36.
    [45]
    Bahareh Rahmanian and Joseph G. Davis. 2014. User interface design for crowdsourcing systems. In Proceedings of the International Working Conference on Advanced Visual Interfaces (AVI’14). Paolo Paolini and Franca Garzotto (Eds.), ACM, 405--408.
    [46]
    Vikas C. Raykar and Shipeng Yu. 2012. Eliminating spammers and ranking annotators for crowdsourced labeling tasks. J. Mach. Learn. Res. 13, Feb (2012), 491--518.
    [47]
    Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. 2010. Learning from crowds. J. Mach. Learn. Res. 11, Apr (2010), 1297--1322.
    [48]
    Burr Settles. 2009. Active Learning Literature Survey. Computer Sciences Technical Report 1648. University of Wisconsin--Madison.
    [49]
    John Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson, and Martin Anthony. 1998. Structural risk minimization over data-dependent hierarchies. IEEE Trans. Inf. Theory 44, 5 (1998), 1926--1940.
    [50]
    Victor S. Sheng, Foster J. Provost, and Panagiotis G. Ipeirotis. 2008. Get another label? Improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Ying Li, Bing Liu, and Sunita Sarawagi (Eds.). ACM, 614--622.
    [51]
    Aashish Sheshadri and Matthew Lease. 2013. Square: A benchmark for research on computing crowd consensus. In Proceedings of the 1st AAAI Conference on Human Computation and Crowdsourcing.
    [52]
    Vladimir Vapnik. 1999. An overview of statistical learning theory. IEEE Trans. Neural Netw. 10, 5 (1999), 988--999.
    [53]
    Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2012. CrowdER: Crowdsourcing entity resolution. Proc. VLDB Endow. 5, 11 (2012), 1483--1494. http://vldb.org/pvldb/vol5/p1483_jiannanwang_vldb2012.pdf.
    [54]
    Jacob Whitehill, Ting-fan Wu, Jacob Bergsma, Javier R Movellan, and Paul L Ruvolo. 2009. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Proceedings of the Advances in Neural Information Processing Systems. 2035--2043.
    [55]
    Chen Jason Zhang, Lei Chen, and Yongxin Tong. 2014a. MaC: A probabilistic framework for query answering with machine-crowd collaboration. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (CIKM’14), Jianzhong Li, Xiaoyang Sean Wang, Minos N. Garofalakis, Ian Soboroff, Torsten Suel, and Min Wang (Eds.). ACM, 11--20.
    [56]
    Chen Jason Zhang, Ziyuan Zhao, Lei Chen, H. V. Jagadish, and Caleb Chen Cao. 2014b. CrowdMatcher: Crowd-assisted schema matching. In Proceedings of the International Conference on Management of Data (SIGMOD’14). 721--724.
    [57]
    Jing Zhang, Victor S Sheng, Tao Li, and Xindong Wu. 2018. Improving crowdsourced label quality using noise correction. IEEE Trans. Neural Netw. Learn. Syst. 29, 5 (2018), 1675--1688.
    [58]
    Jing Zhang and Xindong Wu. 2018. Multi-label inference for crowdsourcing. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery 8 Data Mining. ACM, 2738--2747.
    [59]
    Jing Zhang, Xindong Wu, and Victor S Sheng. 2016. Learning from crowdsourced labeled data: A survey. Artif. Intell. Rev. 46, 4 (2016), 543--576.
    [60]
    Yudian Zheng, Guoliang Li, Yuanbing Li, Caihua Shan, and Reynold Cheng. 2017. Truth inference in crowdsourcing: Is the problem solved? Proc. VLDB Endow. 10, 5 (2017), 541--552.

    Cited By

    View all

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Knowledge Discovery from Data
    ACM Transactions on Knowledge Discovery from Data  Volume 13, Issue 5
    October 2019
    258 pages
    ISSN:1556-4681
    EISSN:1556-472X
    DOI:10.1145/3364623
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 September 2019
    Accepted: 01 June 2019
    Revised: 01 April 2019
    Received: 01 October 2018
    Published in TKDD Volume 13, Issue 5

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Crowdsourcing
    2. data extraction
    3. wrapper inference

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)11
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media