research-article

Efficient Feedback Collection for Pay-as-you-go Source Selection

Authors:

Julio César Cortés Ríos,

Norman W. Paton,

Alvaro A.A. Fernandes, and

Khalid BelhajjameAuthors Info & Claims

SSDBM '16: Proceedings of the 28th International Conference on Scientific and Statistical Database Management

July 2016

Article No.: 1, Pages 1 - 12

https://doi.org/10.1145/2949689.2949690

Published: 18 July 2016 Publication History

Abstract

Technical developments, such as the web of data and web data extraction, combined with policy developments such as those relating to open government or open science, are leading to the availability of increasing numbers of data sources. Indeed, given these physical sources, it is then also possible to create further virtual sources that integrate, aggregate or summarise the data from the original sources. As a result, there is a plethora of data sources, from which a small subset may be able to provide the information required to support a task. The number and rate of change in the available sources is likely to make manual source selection and curation by experts impractical for many applications, leading to the need to pursue a pay-as-you-go approach, in which crowds or data consumers annotate results based on their correctness or suitability, with the resulting annotations used to inform, e.g., source selection algorithms. However, for pay-as-you-go feedback collection to be cost-effective, it may be necessary to select judiciously the data items on which feedback is to be obtained. This paper describes OLBP (Ordering and Labelling By Precision), a heuristics-based approach to the targeting of data items for feedback to support mapping and source selection tasks, where users express their preferences in terms of the trade-off between precision and recall. The proposed approach is then evaluated on two different scenarios, mapping selection with synthetic data, and source selection with real data produced by web data extraction. The results demonstrate a significant reduction in the amount of feedback required to reach user-provided objectives when using OLBP.

References

[1]

B. Alexe, B. ten Cate, P. G. Kolaitis, and W. C. Tan. Characterizing schema mappings via data examples. ACM Trans. Database Syst., 36(4):23, 2011.

Digital Library

[2]

K. Belhajjame, N. W. Paton, S. M. Embury, A. A. A. Fernandes, and C. Hedeler. Feedback-based annotation, selection and refinement of schema mappings for dataspaces. In EDBT, pages 573--584, 2010.

Digital Library

[3]

K. Belhajjame, N. W. Paton, S. M. Embury, A. A. A. Fernandes, and C. Hedeler. Incrementally improving dataspaces based on user feedback. Inf. Syst., 38(5):656--687, 2013.

Digital Library

[4]

P. A. Bernstein and L. M. Haas. Information integration in the enterprise. CACM, 51(9):72--79, 2008.

Digital Library

[5]

A. Bonifati, G. Mecca, A. Pappalardo, S. Raunich, and G. Summa. Schema mapping verification: the spicy way. In EDBT, pages 85--96, 2008.

Digital Library

[6]

A. Bozzon, M. Brambilla, and S. Ceri. Answering search queries with crowdsearcher. WWW, pages 1009--1018, 2012.

Digital Library

[7]

M. J. Cafarella, A. Halevy, and J. Madhavan. Structured data on the web. CACM, 54(2):72--79, 2011.

Digital Library

[8]

H. Cao, Y. Qi, K. S. Candan, and M. L. Sapino. Feedback-driven result ranking and query refinement for exploring semi-structured data collections. In EDBT, pages 3--14, 2010.

Digital Library

[9]

J. Cohen. Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates, 2 edition, 1988.

[10]

V. Crescenzi, P. Merialdo, and D. Qiu. Wrapper generation supervised by a noisy crowd. In VLDB, pages 8--13, 2013.

[11]

V. Crescenzi, P. Merialdo, and D. Qiu. Crowdsourcing large scale wrapper inference. Distributed and Parallel Databases, 33(1):95--122, 2015.

Digital Library

[12]

X. L. Dong, B. Saha, and D. Srivastava. Less is more: Selecting sources wisely for integration. PVLDB, 6(2):37--48, 2012.

Digital Library

[13]

H. Elmeleegy, A. K. Elmagarmid, and J. Lee. Leveraging query logs for schema mapping generation in u-map. In SIGMOD, pages 121--132, 2011.

Digital Library

[14]

D. H. Foley. Considerations of sample and feature size. IEEE Trans. on Inf. Theory, 18(5):618--626, 1972.

Digital Library

[15]

M. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: Answering queries with crowdsourcing. In ACM SIGMOD, pages 61--72, 2011.

Digital Library

[16]

T. Furche, G. Gottlob, G. Grasso, X. Guo, G. Orsi, C. Schallhart, and C. Wang. DIADEM: Thousands of websites to a single database. PVLDB, 7(14):1845--1856, 2014.

Digital Library

[17]

C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. W. Shavlik, and X. Zhu. Corleone: hands-off crowdsourcing for entity matching. In SIGMOD Conference, pages 601--612, 2014.

Digital Library

[18]

L. M. Haas, M. A. Hernández, H. Ho, L. Popa, and M. Roth. Clio grows up: from research prototype to industrial tool. In ACM SIGMOD, pages 805--810, 2005.

Digital Library

[19]

T. Heath and C. Bizer. Linked Data: Evolving the Web into a Global Data Space. Morgan & Claypool, 2011.

Digital Library

[20]

S. B. Hulley, S. R. Cummings, W. S. Browner, D. G. Grady, and T. B. Newman. Designing Clinical Research. Lippincott Williams and Wilkins, 3 edition, 2007.

[21]

N. Q. V. Hung, D. C. Thang, M. Weidlich, and K. Aberer. Minimizing efforts in validating crowd answers. In SIGMOD, pages 999--1014, 2015.

Digital Library

[22]

R. Isele and C. Bizer. Learning expressive linkage rules using genetic programming. PVLDB, 5(11):1638--1649, 2012.

Digital Library

[23]

R. Isele and C. Bizer. Active learning of expressive linkage rules using genetic programming. JWS, 2--15, 2013.

Digital Library

[24]

X. Liu, M. Lu, B. C. Ooi, Y. Shen, S. Wu, and M. Zhang. CDAS: A crowdsourcing data analytics system. PVLDB, 5(10):1040--1051, 2012.

Digital Library

[25]

C. Lofi, K. E. Maarry, and W.-T. Balke. Skyline queries in crowd-enabled databases. In Proc. 16th EDBT, pages 465--476, 2013.

Digital Library

[26]

J. Madhavan, S. Jeffery, S. Cohen, X. Dong, D. Ko, C. Yu, and A. Halevy. Web-scale data integration: You can only afford to pay as you go. In CIDR, pages 342--350, 2007.

[27]

A. Marcus, E. Wu, D. R. Karger, S. Madden, and R. C. Miller. Human-powered sorts and joins. PVLDB, 5(1):13--24, 2011.

Digital Library

[28]

B. Mozafari, P. Sarkar, M. J. Franklin, M. I. Jordan, and S. Madden. Scaling up crowd-sourcing to very large datasets: A case for active learning. PVLDB, 8(2):125--136, 2014.

Digital Library

[29]

F. Naumann, J. C. Freytag, and M. Spiliopoulou. Quality driven source selection using data envelope analysis. In IQ, pages 137--152, 1998.

[30]

T. Rekatsinas, X. L. Dong, L. Getoor, and D. Srivastava. Finding quality in quantity: The challenge of discovering valuable sources for integration. In CIDR, USA, 2015, 2015.

[31]

T. Rekatsinas, X. L. Dong, and D. Srivastava. Characterizing and selecting fresh data sources. In SIGMOD, pages 919--930, 2014.

Digital Library

[32]

B. Settles. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 6(1):1--114, 2012.

[33]

P. P. Talukdar, Z. G. Ives, and F. C. N. Pereira. Automatically incorporating new sources in keyword search-based data integration. In ACM SIGMOD, USA, 2010, pages 387--398, 2010.

Digital Library

[34]

P. P. Talukdar, M. Jacob, M. S. Mehmood, K. Crammer, Z. G. Ives, F. C. N. Pereira, and S. Guha. Learning to create data-integrating queries. PVLDB, 1(1):785--796, 2008.

Digital Library

[35]

S. E. Whang, P. Lofgren, and H. Garcia-Molina. Question selection for crowd entity resolution. PVLDB, 6(6):349--360, 2013.

Digital Library

[36]

L. Yan, R. J. Miller, L. M. Haas, and R. Fagin. Data-driven understanding and refinement of schema mappings. In ACM SIGMOD, pages 485--496, 2001.

Digital Library

[37]

Z. Yan, N. Zheng, Z. G. Ives, P. P. Talukdar, and C. Yu. Active learning in keyword search-based data integration. VLDB J., 24(5):611--631, 2015.

Digital Library

[38]

C. J. Zhang, L. Chen, Y. Tong, and Z. Liu. Cleaning uncertain data with a noisy crowd. ICDE, 6--17, 2015.

Cited By

Konstantinou NPaton N(2020)Feedback driven improvement of data preparation pipelinesInformation Systems10.1016/j.is.2019.10148092(101480)Online publication date: Sep-2020
https://doi.org/10.1016/j.is.2019.101480
Ríos JPaton NFernandes AAbel EKeane J(2019)Crowdsourced Targeted Feedback Collection for Multicriteria Data Source SelectionJournal of Data and Information Quality10.1145/328493411:1(1-27)Online publication date: 4-Jan-2019
https://dl.acm.org/doi/10.1145/3284934
Galpin IAbel EPaton N(2018)Source Selection LanguagesProceedings of the Workshop on Human-In-the-Loop Data Analytics10.1145/3209900.3209906(1-6)Online publication date: 10-Jun-2018
https://dl.acm.org/doi/10.1145/3209900.3209906
Show More Cited By

Recommendations

Crowdsourced Targeted Feedback Collection for Multicriteria Data Source Selection
On the Horizon, Regular Papers and Challenge Paper

A multicriteria data source selection (MCSS) scenario identifies, from a set of candidate data sources, the subset that best meets users’ needs. These needs are expressed using several criteria, which are used to evaluate the candidate data sources. An ...
Read More
Data source management and selection for dynamic data integration
RED'09: Proceedings of the 2nd international conference on Resource discovery

Selection-dynamic data integration employs a set of known data sources attached to an integration system. For answering a given query, suitable sources are selected from this set and dynamically integrated. This procedure requires a method to determine ...
Read More
Characterizing and selecting fresh data sources
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Data integration is a challenging task due to the large numbers of autonomous data sources. This necessitates the development of techniques to reason about the benefits and costs of acquiring and integrating data. Recently the problem of source ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

SSDBM '16: Proceedings of the 28th International Conference on Scientific and Statistical Database Management

July 2016

290 pages

ISBN:9781450342155

DOI:10.1145/2949689

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SSDBM '16

SSDBM '16: Conference on Scientific and Statistical Database Management

July 18 - 20, 2016

Budapest, Hungary

Acceptance Rates

Overall Acceptance Rate 56 of 146 submissions, 38%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
181
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Other Metrics

View Author Metrics

Citations

Cited By

Konstantinou NPaton N(2020)Feedback driven improvement of data preparation pipelinesInformation Systems10.1016/j.is.2019.10148092(101480)Online publication date: Sep-2020
https://doi.org/10.1016/j.is.2019.101480
Ríos JPaton NFernandes AAbel EKeane J(2019)Crowdsourced Targeted Feedback Collection for Multicriteria Data Source SelectionJournal of Data and Information Quality10.1145/328493411:1(1-27)Online publication date: 4-Jan-2019
https://dl.acm.org/doi/10.1145/3284934
Galpin IAbel EPaton N(2018)Source Selection LanguagesProceedings of the Workshop on Human-In-the-Loop Data Analytics10.1145/3209900.3209906(1-6)Online publication date: 10-Jun-2018
https://dl.acm.org/doi/10.1145/3209900.3209906
Serrano FFernandes AChristodoulou K(2018)An approach to quantify integration quality using feedback on mapping resultsInternational Journal of Web Information Systems10.1108/IJWIS-05-2018-0043Online publication date: 31-Dec-2018
https://doi.org/10.1108/IJWIS-05-2018-0043
Serrano FFernandes AChristodoulou K(2017)Quantifying integration quality using feedback on mapping resultsProceedings of the 19th International Conference on Information Integration and Web-based Applications & Services10.1145/3151759.3151763(3-12)Online publication date: 4-Dec-2017
https://dl.acm.org/doi/10.1145/3151759.3151763
Azuan NEmbury SPaton N(2017)Observing the Data ScientistProceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics10.1145/3077257.3077272(1-6)Online publication date: 14-May-2017
https://dl.acm.org/doi/10.1145/3077257.3077272
Cortés Ríos JPaton NFernandes AAbel EKeane J(2017)Targeted Feedback Collection Applied to Multi-Criteria Source SelectionAdvances in Databases and Information Systems10.1007/978-3-319-66917-5_10(136-150)Online publication date: 25-Aug-2017
https://doi.org/10.1007/978-3-319-66917-5_10

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents