research-article

Hybrid Crowd-Machine Wrapper Inference

Authors:

Valter Crescenzi,

Paolo Merialdo,

Disheng QiuAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 13, Issue 5

Article No.: 51, Pages 1 - 43

https://doi.org/10.1145/3344720

Published: 24 September 2019 Publication History

Abstract

Wrapper inference deals in generating programs to extract data from Web pages. Several supervised and unsupervised wrapper inference approaches have been proposed in the literature. On one hand, unsupervised approaches produce erratic wrappers: whenever the sources do not satisfy underlying assumptions of the inference algorithm, their accuracy is compromised. On the other hand, supervised approaches produce accurate wrappers, but since they need training data, their scalability is limited. The recent advent of crowdsourcing platforms has opened new opportunities for supervised approaches, as they make possible the production of large amounts of training data with the support of workers recruited online. Nevertheless, involving human workers has monetary costs. We present an original hybrid crowd-machine wrapper inference system that offers the benefits of both approaches exploiting the cooperation of crowd workers and unsupervised algorithms. Based on a principled probabilistic model that estimates the quality of wrappers, humans workers are recruited only when unsupervised wrapper induction algorithms are not able to produce sufficiently accurate solutions.

References

[1]

Aris Anagnostopoulos, Luca Becchetti, Adriano Fazzone, Ida Mele, and Matteo Riondato. 2015. The importance of being expert: Efficient max-finding in crowdsourcing. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Timos K. Sellis, Susan B. Davidson, and Zachary G. Ives (Eds.). ACM, 983--998.

Digital Library

[2]

Dana Angluin. 2004. Queries revisited. Theor. Comput. Sci. 313, 2 (2004), 175--194.

Digital Library

[3]

Dana Angluin and Philip Laird. 1988. Learning from noisy examples. Mach. Learn. 2, 4 (April 1988), 343--370.

[4]

Arvind Arasu and Hector Garcia-Molina. 2003. Extracting structured data from web pages. In Proceedings of the SIGMOD Conference. ACM, 337--348.

Digital Library

[5]

Maria-Florina Balcan, Steve Hanneke, and Jennifer Wortman Vaughan. 2010. The true sample complexity of active learning. Machine Learning 80, 2--3 (2010), 111--139.

Digital Library

[6]

Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, and Paolo Papotti. 2013. Extraction and integration of partially overlapping web sources. Proc. VLDB Endow. 6, 10 (2013), 805--816. http://www.vldb.org/pvldb/vol6/p805-bronzi.pdf.

Digital Library

[7]

Anne Chao and Shen-Ming Lee. 1992. Estimating the number of classes via sample coverage. J. Am. Stat. Assoc. 87, 417 (1992), 210--217.

[8]

Justin Cheng and Michael S. Bernstein. 2015. Flock: Hybrid crowd-machine learning classifiers. In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work 8 Social Computing (CSCW’15). 600--611.

[9]

Kenneth Ward Church and Patrick Hanks. 1990a. Word association norms, mutual information, and lexicography. Comput. Linguist. 16, 1 (March 1990), 22--29. http://dl.acm.org/citation.cfm?id=89086.89095

Digital Library

[10]

Kenneth Ward Church and Patrick Hanks. 1990b. Word association norms, mutual information, and lexicography. Comput. Linguist. 16, 1 (1990), 22--29.

Digital Library

[11]

William W. Cohen, Matthew Hurst, and Lee S. Jensen. 2002. A flexible learning system for wrapping tables and lists in HTML documents. In Proceedings of the 11th International World Wide Web Conference (WWW’02). 232--241.

[12]

Valter Crescenzi, Alvaro A. A. Fernandes, Paolo Merialdo, and Norman W. Paton. 2017. Crowdsourcing for data management. Knowledge and Information Systems 53, 1 (2017), 1--41.

Digital Library

[13]

V. Crescenzi, G. Mecca, and P. Merialdo. 2001. RoadRunner: Towards automatic data extraction from large web sites. In Proceedings of the International Conference on Very Large Data Bases (VLDB’01). 109--118.

[14]

Valter Crescenzi and Paolo Merialdo. 2008. Wrapper inference for ambiguous web pages. Appl. Artif. Intell. 22, 182 (2008), 21--52.

Digital Library

[15]

Valter Crescenzi, Paolo Merialdo, and Disheng Qiu. 2013b. Alfred: Crowd assisted data extraction. In Proceedings of the 22nd International Conference on World Wide Web. ACM, 297--300.

Digital Library

[16]

Valter Crescenzi, Paolo Merialdo, and Disheng Qiu. 2013a. A framework for learning web wrappers from the crowd. In Proceedings of the 22nd International Conference on World Wide Web (WWW’13). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 261--272. http://dl.acm.org/citation.cfm?id=2488388.2488412.

Digital Library

[17]

Valter Crescenzi, Paolo Merialdo, and Disheng Qiu. 2015. Crowdsourcing large scale wrapper inference. Distrib. Parallel Databases 33, 1 (2015), 95--122.

Digital Library

[18]

Altigran S. Da Silva, Denilson Barbosa, Joao M. B. Cavalcanti, and Marco A. S. Sevalho. 2007. Labeling data extracted from the web. In Proceedings of the OTM Confederated International Conferences “On the Move to Meaningful Internet Systems”. Springer, 1099--1116.

[19]

Nilesh N. Dalvi, Ravi Kumar, and Mohamed A. Soliman. 2011. Automatic wrappers for large scale web extraction. Proc. VLDB Endow. 4, 4 (2011), 219--230.

Digital Library

[20]

A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stati. Soc., Series B 39, 1 (1977), 1--38.

[21]

AnHai Doan, Michael J. Franklin, Donald Kossmann, and Tim Kraska. 2011. Crowdsourcing applications and platforms: A data management perspective. Proc. VLDB Endow. 4, 12 (2011), 1508--1509. http://www.vldb.org/pvldb/vol4/p1508-doan-tutorial4.pdf.

Digital Library

[22]

Ju Fan, Guoliang Li, Beng Chin Ooi, Kian-lee Tan, and Jianhua Feng. 2015. icrowd: An adaptive crowdsourcing framework. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 1015--1030.

Digital Library

[23]

Ju Fan, Meiyu Lu, Beng Chin Ooi, Wang-Chiew Tan, and Meihui Zhang. 2014. A hybrid machine-crowdsourcing system for matching web tables. In Proceedings of the IEEE 30th International Conference on Data Engineering. 976--987.

[24]

Emilio Ferrara, Pasquale De Meo, Giacomo Fiumara, and Robert Baumgartner. 2014. Web data extraction, applications and techniques: A survey. Knowl.-Based Syst. 70 (2014), 301--323. https://doi.org/10.1016/j.knosys.2014.07.007

Digital Library

[25]

Michael J. Franklin, Donald Kossmann, Tim Kraska, Sukriti Ramesh, and Reynold Xin. 2011. CrowdDB: Answering queries with crowdsourcing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’11). 61--72.

Digital Library

[26]

Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi, Christian Schallhart, and Cheng Wang. 2014. DIADEM: Thousands of websites to a single database. Proc. VLDB Endow. 7, 14 (2014), 1845--1856. http://www.vldb.org/pvldb/vol7/p1845-furche.pdf.

Digital Library

[27]

Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F. Naughton, Narasimhan Rampalli, Jude W. Shavlik, and Xiaojin Zhu. 2014. Corleone: Hands-off crowdsourcing for entity matching. In Proceedings of the International Conference on Management of Data (SIGMOD’14), Curtis E. Dyreson, Feifei Li, and M. Tamer Özsu (Eds.). ACM, 601--612.

Digital Library

[28]

Leo A Goodman. 1949. On the estimation of the number of classes in a population. Ann. Math. Stat. (1949), 572--579.

[29]

Georg Gottlob, Christoph Koch, Robert Baumgartner, Marcus Herzog, and Sergio Flesca. 2004. The lixto data extraction project - back and forth between theory and practice. In Proceedings of the 23rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’04). ACM, 1--12.

Digital Library

[30]

David A. Grossman and Ophir Frieder. 2012. Information Retrieval: Algorithms and Heuristics, Vol. 15. Springer Science 8 Business Media.

[31]

Pankaj Gulhane, Amit Madaan, Rupesh Mehta, Jeyashankher Ramamirtham, Rajeev Rastogi, Sandeep Satpal, Srinivasan H. Sengamedu, Ashwin Tengli, and Charu Tiwari. 2011. Web-scale information extraction with vertex. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering. IEEE, 1209--1220.

Digital Library

[32]

Utku Irmak and Torsten Suel. 2006. Interactive wrapper generation with minimal user effort. In Proceedings of the 15th International Conference on World Wide Web (WWW’06). ACM, 553--563.

Digital Library

[33]

Furong Li, Xin Luna Dong, Anno Langen, and Yang Li. 2017. Knowledge verification for longtail verticals. Proc. VLDB Endow. 10, 11 (2017), 1370--1381.

Digital Library

[34]

Guoliang Li, Jiannan Wang, Yudian Zheng, and Michael J. Franklin. 2016. Crowdsourced data management: A survey. IEEE Transactions on Knowledge and Data Engineering 28, 9 (2016), 2296--2319.

Digital Library

[35]

Colin Lockard, Xin Luna Dong, Prashant Shiralkar, and Arash Einolghozati. 2018. CERES: Distantly supervised relation extraction from the semi-structured web. Proc. VLDB Endow. 11, 10 (2018), 1084--1096. http://www.vldb.org/pvldb/vol11/p1084-lockard.pdf

Digital Library

[36]

Adam Marcus and Aditya Parameswaran. 2015. Crowdsourced data management: Industry and academic perspectives. Foundations and Trends in Databases 6, 1--2 (2015), 1--161.

Digital Library

[37]

Robert McCann, Warren Shen, and AnHai Doan. 2008. Matching schemas in online communities: A web 2.0 approach. In Proceedings of the 24th International Conference on Data Engineering (ICDE’08). 110--119.

Digital Library

[38]

Jafar Muhammadi, Hamid R. Rabiee, and Abbas Hosseini. 2015. A unified statistical framework for crowd labeling. Knowl. Inf. Syst. 45, 2 (01 November 2015), 271--294.

[39]

Ion Muslea, Steven Minton, and Craig A. Knoblock. 2006. Active learning with multiple views. J. Artif. Intell. Res. 27, 1 (2006), 203--233.

[40]

Bryce Nicholson, Victor S. Sheng, and Jing Zhang. 2016. Label noise correction and application in crowdsourcing. Expert Syst. Appl. 66, C (2016), 149--162.

[41]

Stefano Ortona, Giorgio Orsi, Marcello Buoncristiano, and Tim Furche. 2015. WADaR: Joint wrapper and data repair. Proc. VLDB Endow. 8, 12 (2015), 1996--2007. http://www.vldb.org/pvldb/vol8/p1996-ortona.pdf.

Digital Library

[42]

Stefano Ortona, Giorgio Orsi, Tim Furche, and Marcello Buoncristiano. 2016. Joint repairs for web wrappers. In Proceedings of the 2016 IEEE 32nd International Conference on Data Engineering (ICDE’16). IEEE, 1146--1157.

[43]

Disheng Qiu, Luciano Barbosa, Xin Luna Dong, Yanyan Shen, and Divesh Srivastava. 2015. DEXTER: Large-scale discovery and extraction of product specifications on the web. Proc. VLDB Endow. 8, 13 (2015), 2194--2205. http://www.vldb.org/pvldb/vol8/p2194-qiu.pdf.

Digital Library

[44]

Anca-Livia Radu, Bogdan Ionescu, María Menéndez, Julian Stöttinger, Fausto Giunchiglia, and Antonella De Angeli. 2014. A hybrid machine-crowd approach to photo retrieval result diversification. In Proceedings of the MultiMedia Modeling - 20th Anniversary International Conference (MMM’14). 25--36.

Digital Library

[45]

Bahareh Rahmanian and Joseph G. Davis. 2014. User interface design for crowdsourcing systems. In Proceedings of the International Working Conference on Advanced Visual Interfaces (AVI’14). Paolo Paolini and Franca Garzotto (Eds.), ACM, 405--408.

[46]

Vikas C. Raykar and Shipeng Yu. 2012. Eliminating spammers and ranking annotators for crowdsourced labeling tasks. J. Mach. Learn. Res. 13, Feb (2012), 491--518.

[47]

Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. 2010. Learning from crowds. J. Mach. Learn. Res. 11, Apr (2010), 1297--1322.

[48]

Burr Settles. 2009. Active Learning Literature Survey. Computer Sciences Technical Report 1648. University of Wisconsin--Madison.

[49]

John Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson, and Martin Anthony. 1998. Structural risk minimization over data-dependent hierarchies. IEEE Trans. Inf. Theory 44, 5 (1998), 1926--1940.

Digital Library

[50]

Victor S. Sheng, Foster J. Provost, and Panagiotis G. Ipeirotis. 2008. Get another label? Improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Ying Li, Bing Liu, and Sunita Sarawagi (Eds.). ACM, 614--622.

[51]

Aashish Sheshadri and Matthew Lease. 2013. Square: A benchmark for research on computing crowd consensus. In Proceedings of the 1st AAAI Conference on Human Computation and Crowdsourcing.

[52]

Vladimir Vapnik. 1999. An overview of statistical learning theory. IEEE Trans. Neural Netw. 10, 5 (1999), 988--999.

Digital Library

[53]

Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2012. CrowdER: Crowdsourcing entity resolution. Proc. VLDB Endow. 5, 11 (2012), 1483--1494. http://vldb.org/pvldb/vol5/p1483_jiannanwang_vldb2012.pdf.

Digital Library

[54]

Jacob Whitehill, Ting-fan Wu, Jacob Bergsma, Javier R Movellan, and Paul L Ruvolo. 2009. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Proceedings of the Advances in Neural Information Processing Systems. 2035--2043.

[55]

Chen Jason Zhang, Lei Chen, and Yongxin Tong. 2014a. MaC: A probabilistic framework for query answering with machine-crowd collaboration. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (CIKM’14), Jianzhong Li, Xiaoyang Sean Wang, Minos N. Garofalakis, Ian Soboroff, Torsten Suel, and Min Wang (Eds.). ACM, 11--20.

Digital Library

[56]

Chen Jason Zhang, Ziyuan Zhao, Lei Chen, H. V. Jagadish, and Caleb Chen Cao. 2014b. CrowdMatcher: Crowd-assisted schema matching. In Proceedings of the International Conference on Management of Data (SIGMOD’14). 721--724.

Digital Library

[57]

Jing Zhang, Victor S Sheng, Tao Li, and Xindong Wu. 2018. Improving crowdsourced label quality using noise correction. IEEE Trans. Neural Netw. Learn. Syst. 29, 5 (2018), 1675--1688.

[58]

Jing Zhang and Xindong Wu. 2018. Multi-label inference for crowdsourcing. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery 8 Data Mining. ACM, 2738--2747.

Digital Library

[59]

Jing Zhang, Xindong Wu, and Victor S Sheng. 2016. Learning from crowdsourced labeled data: A survey. Artif. Intell. Rev. 46, 4 (2016), 543--576.

Digital Library

[60]

Yudian Zheng, Guoliang Li, Yuanbing Li, Caihua Shan, and Reynold Cheng. 2017. Truth inference in crowdsourcing: Is the problem solved? Proc. VLDB Endow. 10, 5 (2017), 541--552.

Digital Library

Cited By

Cetorelli VAtzeni PCrescenzi VMilicchio F(2021)The smallest extraction problemProceedings of the VLDB Endowment10.14778/3476249.347629314:11(2445-2458)Online publication date: 27-Oct-2021
https://dl.acm.org/doi/10.14778/3476249.3476293

Index Terms

Hybrid Crowd-Machine Wrapper Inference
1. Information systems
  1. Data management systems
    1. Information integration
      1. Wrappers (data mining)
  2. World Wide Web
    1. Web applications
      1. Crowdsourcing
    2. Web mining
      1. Site wrapping

Recommendations

Crowdsourcing large scale wrapper inference

We present a crowdsourcing system for large-scale production of accurate wrappers to extract data from data-intensive websites. Our approach is based on supervised wrapper inference algorithms which demand the burden of generating training data to ...
A framework for learning web wrappers from the crowd
WWW '13: Proceedings of the 22nd international conference on World Wide Web

The development of solutions to scale the extraction of data from Web sources is still a challenging issue. High accuracy can be achieved by supervised approaches but the costs of training data, i.e., annotations over a set of sample pages, limit their ...
ALFRED: crowd assisted data extraction
WWW '13 Companion: Proceedings of the 22nd International Conference on World Wide Web

The development of solutions to scale the extraction of data from Web sources is still a challenging issue. High accuracy can be achieved by supervised approaches, but the costs of training data, i.e., annotations over a set of sample pages, limit their ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 13, Issue 5

October 2019

258 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/3364623

Editors:
Charu Aggarwal
IBM T. J. Watson Research, USA
,
Xindong Wu
Minginglamp Academy of Sciences, China

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 September 2019

Accepted: 01 June 2019

Revised: 01 April 2019

Received: 01 October 2018

Published in TKDD Volume 13, Issue 5

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
193
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)1

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Cetorelli VAtzeni PCrescenzi VMilicchio F(2021)The smallest extraction problemProceedings of the VLDB Endowment10.14778/3476249.347629314:11(2445-2458)Online publication date: 27-Oct-2021
https://dl.acm.org/doi/10.14778/3476249.3476293

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents