Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2949689.2949690acmotherconferencesArticle/Chapter ViewAbstractPublication PagesssdbmConference Proceedingsconference-collections
research-article

Efficient Feedback Collection for Pay-as-you-go Source Selection

Published: 18 July 2016 Publication History
  • Get Citation Alerts
  • Abstract

    Technical developments, such as the web of data and web data extraction, combined with policy developments such as those relating to open government or open science, are leading to the availability of increasing numbers of data sources. Indeed, given these physical sources, it is then also possible to create further virtual sources that integrate, aggregate or summarise the data from the original sources. As a result, there is a plethora of data sources, from which a small subset may be able to provide the information required to support a task. The number and rate of change in the available sources is likely to make manual source selection and curation by experts impractical for many applications, leading to the need to pursue a pay-as-you-go approach, in which crowds or data consumers annotate results based on their correctness or suitability, with the resulting annotations used to inform, e.g., source selection algorithms. However, for pay-as-you-go feedback collection to be cost-effective, it may be necessary to select judiciously the data items on which feedback is to be obtained. This paper describes OLBP (Ordering and Labelling By Precision), a heuristics-based approach to the targeting of data items for feedback to support mapping and source selection tasks, where users express their preferences in terms of the trade-off between precision and recall. The proposed approach is then evaluated on two different scenarios, mapping selection with synthetic data, and source selection with real data produced by web data extraction. The results demonstrate a significant reduction in the amount of feedback required to reach user-provided objectives when using OLBP.

    References

    [1]
    B. Alexe, B. ten Cate, P. G. Kolaitis, and W. C. Tan. Characterizing schema mappings via data examples. ACM Trans. Database Syst., 36(4):23, 2011.
    [2]
    K. Belhajjame, N. W. Paton, S. M. Embury, A. A. A. Fernandes, and C. Hedeler. Feedback-based annotation, selection and refinement of schema mappings for dataspaces. In EDBT, pages 573--584, 2010.
    [3]
    K. Belhajjame, N. W. Paton, S. M. Embury, A. A. A. Fernandes, and C. Hedeler. Incrementally improving dataspaces based on user feedback. Inf. Syst., 38(5):656--687, 2013.
    [4]
    P. A. Bernstein and L. M. Haas. Information integration in the enterprise. CACM, 51(9):72--79, 2008.
    [5]
    A. Bonifati, G. Mecca, A. Pappalardo, S. Raunich, and G. Summa. Schema mapping verification: the spicy way. In EDBT, pages 85--96, 2008.
    [6]
    A. Bozzon, M. Brambilla, and S. Ceri. Answering search queries with crowdsearcher. WWW, pages 1009--1018, 2012.
    [7]
    M. J. Cafarella, A. Halevy, and J. Madhavan. Structured data on the web. CACM, 54(2):72--79, 2011.
    [8]
    H. Cao, Y. Qi, K. S. Candan, and M. L. Sapino. Feedback-driven result ranking and query refinement for exploring semi-structured data collections. In EDBT, pages 3--14, 2010.
    [9]
    J. Cohen. Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates, 2 edition, 1988.
    [10]
    V. Crescenzi, P. Merialdo, and D. Qiu. Wrapper generation supervised by a noisy crowd. In VLDB, pages 8--13, 2013.
    [11]
    V. Crescenzi, P. Merialdo, and D. Qiu. Crowdsourcing large scale wrapper inference. Distributed and Parallel Databases, 33(1):95--122, 2015.
    [12]
    X. L. Dong, B. Saha, and D. Srivastava. Less is more: Selecting sources wisely for integration. PVLDB, 6(2):37--48, 2012.
    [13]
    H. Elmeleegy, A. K. Elmagarmid, and J. Lee. Leveraging query logs for schema mapping generation in u-map. In SIGMOD, pages 121--132, 2011.
    [14]
    D. H. Foley. Considerations of sample and feature size. IEEE Trans. on Inf. Theory, 18(5):618--626, 1972.
    [15]
    M. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: Answering queries with crowdsourcing. In ACM SIGMOD, pages 61--72, 2011.
    [16]
    T. Furche, G. Gottlob, G. Grasso, X. Guo, G. Orsi, C. Schallhart, and C. Wang. DIADEM: Thousands of websites to a single database. PVLDB, 7(14):1845--1856, 2014.
    [17]
    C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. W. Shavlik, and X. Zhu. Corleone: hands-off crowdsourcing for entity matching. In SIGMOD Conference, pages 601--612, 2014.
    [18]
    L. M. Haas, M. A. Hernández, H. Ho, L. Popa, and M. Roth. Clio grows up: from research prototype to industrial tool. In ACM SIGMOD, pages 805--810, 2005.
    [19]
    T. Heath and C. Bizer. Linked Data: Evolving the Web into a Global Data Space. Morgan & Claypool, 2011.
    [20]
    S. B. Hulley, S. R. Cummings, W. S. Browner, D. G. Grady, and T. B. Newman. Designing Clinical Research. Lippincott Williams and Wilkins, 3 edition, 2007.
    [21]
    N. Q. V. Hung, D. C. Thang, M. Weidlich, and K. Aberer. Minimizing efforts in validating crowd answers. In SIGMOD, pages 999--1014, 2015.
    [22]
    R. Isele and C. Bizer. Learning expressive linkage rules using genetic programming. PVLDB, 5(11):1638--1649, 2012.
    [23]
    R. Isele and C. Bizer. Active learning of expressive linkage rules using genetic programming. JWS, 2--15, 2013.
    [24]
    X. Liu, M. Lu, B. C. Ooi, Y. Shen, S. Wu, and M. Zhang. CDAS: A crowdsourcing data analytics system. PVLDB, 5(10):1040--1051, 2012.
    [25]
    C. Lofi, K. E. Maarry, and W.-T. Balke. Skyline queries in crowd-enabled databases. In Proc. 16th EDBT, pages 465--476, 2013.
    [26]
    J. Madhavan, S. Jeffery, S. Cohen, X. Dong, D. Ko, C. Yu, and A. Halevy. Web-scale data integration: You can only afford to pay as you go. In CIDR, pages 342--350, 2007.
    [27]
    A. Marcus, E. Wu, D. R. Karger, S. Madden, and R. C. Miller. Human-powered sorts and joins. PVLDB, 5(1):13--24, 2011.
    [28]
    B. Mozafari, P. Sarkar, M. J. Franklin, M. I. Jordan, and S. Madden. Scaling up crowd-sourcing to very large datasets: A case for active learning. PVLDB, 8(2):125--136, 2014.
    [29]
    F. Naumann, J. C. Freytag, and M. Spiliopoulou. Quality driven source selection using data envelope analysis. In IQ, pages 137--152, 1998.
    [30]
    T. Rekatsinas, X. L. Dong, L. Getoor, and D. Srivastava. Finding quality in quantity: The challenge of discovering valuable sources for integration. In CIDR, USA, 2015, 2015.
    [31]
    T. Rekatsinas, X. L. Dong, and D. Srivastava. Characterizing and selecting fresh data sources. In SIGMOD, pages 919--930, 2014.
    [32]
    B. Settles. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 6(1):1--114, 2012.
    [33]
    P. P. Talukdar, Z. G. Ives, and F. C. N. Pereira. Automatically incorporating new sources in keyword search-based data integration. In ACM SIGMOD, USA, 2010, pages 387--398, 2010.
    [34]
    P. P. Talukdar, M. Jacob, M. S. Mehmood, K. Crammer, Z. G. Ives, F. C. N. Pereira, and S. Guha. Learning to create data-integrating queries. PVLDB, 1(1):785--796, 2008.
    [35]
    S. E. Whang, P. Lofgren, and H. Garcia-Molina. Question selection for crowd entity resolution. PVLDB, 6(6):349--360, 2013.
    [36]
    L. Yan, R. J. Miller, L. M. Haas, and R. Fagin. Data-driven understanding and refinement of schema mappings. In ACM SIGMOD, pages 485--496, 2001.
    [37]
    Z. Yan, N. Zheng, Z. G. Ives, P. P. Talukdar, and C. Yu. Active learning in keyword search-based data integration. VLDB J., 24(5):611--631, 2015.
    [38]
    C. J. Zhang, L. Chen, Y. Tong, and Z. Liu. Cleaning uncertain data with a noisy crowd. ICDE, 6--17, 2015.

    Cited By

    View all
    • (2020)Feedback driven improvement of data preparation pipelinesInformation Systems10.1016/j.is.2019.10148092(101480)Online publication date: Sep-2020
    • (2019)Crowdsourced Targeted Feedback Collection for Multicriteria Data Source SelectionJournal of Data and Information Quality10.1145/328493411:1(1-27)Online publication date: 4-Jan-2019
    • (2018)Source Selection LanguagesProceedings of the Workshop on Human-In-the-Loop Data Analytics10.1145/3209900.3209906(1-6)Online publication date: 10-Jun-2018
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    SSDBM '16: Proceedings of the 28th International Conference on Scientific and Statistical Database Management
    July 2016
    290 pages
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 July 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Data integration
    2. feedback collection
    3. mapping selection
    4. pay-as-you-go
    5. source selection

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    SSDBM '16

    Acceptance Rates

    Overall Acceptance Rate 56 of 146 submissions, 38%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)0

    Other Metrics

    Citations

    Cited By

    View all
    • (2020)Feedback driven improvement of data preparation pipelinesInformation Systems10.1016/j.is.2019.10148092(101480)Online publication date: Sep-2020
    • (2019)Crowdsourced Targeted Feedback Collection for Multicriteria Data Source SelectionJournal of Data and Information Quality10.1145/328493411:1(1-27)Online publication date: 4-Jan-2019
    • (2018)Source Selection LanguagesProceedings of the Workshop on Human-In-the-Loop Data Analytics10.1145/3209900.3209906(1-6)Online publication date: 10-Jun-2018
    • (2018)An approach to quantify integration quality using feedback on mapping resultsInternational Journal of Web Information Systems10.1108/IJWIS-05-2018-0043Online publication date: 31-Dec-2018
    • (2017)Quantifying integration quality using feedback on mapping resultsProceedings of the 19th International Conference on Information Integration and Web-based Applications & Services10.1145/3151759.3151763(3-12)Online publication date: 4-Dec-2017
    • (2017)Observing the Data ScientistProceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics10.1145/3077257.3077272(1-6)Online publication date: 14-May-2017
    • (2017)Targeted Feedback Collection Applied to Multi-Criteria Source SelectionAdvances in Databases and Information Systems10.1007/978-3-319-66917-5_10(136-150)Online publication date: 25-Aug-2017

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media