Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2588555.2610504acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Characterizing and selecting fresh data sources

Published: 18 June 2014 Publication History
  • Get Citation Alerts
  • Abstract

    Data integration is a challenging task due to the large numbers of autonomous data sources. This necessitates the development of techniques to reason about the benefits and costs of acquiring and integrating data. Recently the problem of source selection (i.e., identifying the subset of sources that maximizes the profit from integration) was introduced as a preprocessing step before the actual integration. The problem was studied for static sources and used the accuracy of data fusion to quantify the integration profit.
    In this paper, we study the problem of source selection considering dynamic data sources whose content changes over time. We define a set of time-dependent metrics, including coverage, freshness and accuracy, to characterize the quality of integrated data. We show how statistical models for the evolution of sources can be used to estimate these metrics. While source selection is NP-complete, we show that for a large class of practical cases, near-optimal solutions can be found, propose an algorithmic framework with theoretical guarantees for our problem and show its effectiveness with an extensive experimental evaluation on both real-world and synthetic data.

    References

    [1]
    J. Cho and H. Garcia-Molina. Effective page refresh policies for web crawlers. ACM Trans. Database Syst., 28(4), 2003.
    [2]
    X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1), 2009.
    [3]
    X. L. Dong, B. Saha, and D. Srivastava. Less is more: Selecting sources wisely for integration. PVLDB, 6(2), 2012.
    [4]
    U. Feige and V. S. Mirrokni. Maximizing non-monotone submodular functions. In FOCS, 2007.
    [5]
    R. G. Gallager. Discrete Stochastic Processes. Kluwer Academic Publishers, Boston, 1996.
    [6]
    T. Herzog, F. Scheuren, and W. Winkler. Record linkage. Wiley Interdisciplinary Reviews: Computational Statistics, 2010.
    [7]
    T. Hua, C.-T. Lu, N. Ramakrishnan, F. Chen, J. Arredondo, D. Mares, and K. Summers. Analyzing civil unrest through social media. Computer, 46(12):80--84, 2013.
    [8]
    E. L. Kaplan and P. Meier. Nonparametric Estimation from Incomplete Observations. JASA, 53:457--481, 1958.
    [9]
    J. Lee, V. S. Mirrokni, V. Nagarajan, and M. Sviridenko. Non-monotone submodular maximization under matroid and knapsack constraints. STOC, 2009.
    [10]
    K. Leetaru and P. Schrodt. Gdelt: Global data on events, language, and tone, 1979--2012. Inter. Studies Association Annual Conf., 2013.
    [11]
    X. Li, X. L. Dong, K. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: is the problem solved? PVLDB, 6(2), 2012.
    [12]
    W. Meng and C. T. Yu. Advanced Metasearch Engine Technology. Morgan & Claypool Publishers, 2010.
    [13]
    G. A. Mihaila, L. Raschid, and M.-E. Vidal. Using quality of data metadata for source selection and ranking. In WebDB, 2000.
    [14]
    A. C. Morris, V. Maier, and P. Green. From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition. In INTERSPEECH, 2004.
    [15]
    A. Pal, V. Rastogi, A. Machanavajjhala, and P. Bohannon. Information integration over time in unreliable and uncertain environments. In WWW, 2012.
    [16]
    G.-J. Qi, C. C. Aggarwal, J. Han, and T. Huang. Mining collective intelligence in diverse groups. WWW, 2013.
    [17]
    M. Stonebraker, D. Bruckner, I. Ilyas, G. Beskales, M. Cherniack, S. Zdonik, A. Pagan, and S. Xu. Data curation at scale: The data tamer system. In CIDR'13, 2013.
    [18]
    K. Wilson and J. S. Brownstein. Early detection of disease outbreaks using the internet. CMAJ, 180(8):829--831, 2009.
    [19]
    B. Zhao, B. I. P. Rubinstein, J. Gemmell, and J. Han. A bayesian approach to discovering truth from conflicting sources for data integration. PVLDB, 5(6), 2012.

    Cited By

    View all
    • (2023)Judging Knowledge by its Cover: Leveraging Large Language Models in Establishing Criteria for Knowledge Graph Sources Selection2023 8th International Conference on Information Technology and Digital Applications (ICITDA)10.1109/ICITDA60835.2023.10427395(1-8)Online publication date: 17-Nov-2023
    • (2023)On the Replicability of Data Collection Using Online News DatabasesPS: Political Science & Politics10.1017/S1049096522001317(1-8)Online publication date: 11-Jan-2023
    • (2021)Data source selection for approximate queryJournal of Combinatorial Optimization10.1007/s10878-021-00760-y44:4(2443-2459)Online publication date: 24-May-2021
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
    June 2014
    1645 pages
    ISBN:9781450323765
    DOI:10.1145/2588555
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 June 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data integration
    2. dynamic data sources
    3. source selection

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS'14
    Sponsor:

    Acceptance Rates

    SIGMOD '14 Paper Acceptance Rate 107 of 421 submissions, 25%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)16
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 11 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Judging Knowledge by its Cover: Leveraging Large Language Models in Establishing Criteria for Knowledge Graph Sources Selection2023 8th International Conference on Information Technology and Digital Applications (ICITDA)10.1109/ICITDA60835.2023.10427395(1-8)Online publication date: 17-Nov-2023
    • (2023)On the Replicability of Data Collection Using Online News DatabasesPS: Political Science & Politics10.1017/S1049096522001317(1-8)Online publication date: 11-Jan-2023
    • (2021)Data source selection for approximate queryJournal of Combinatorial Optimization10.1007/s10878-021-00760-y44:4(2443-2459)Online publication date: 24-May-2021
    • (2021)Data Source Selection Support in the Big Data Integration Process – Towards a TaxonomyInnovation Through Information Systems10.1007/978-3-030-86800-0_1(5-21)Online publication date: 29-Oct-2021
    • (2020)Using Demographic Pattern Analysis to Predict COVID-19 Fatalities on the US County LevelDigital Government: Research and Practice10.1145/34301962:1(1-11)Online publication date: 3-Dec-2020
    • (2020)Selection of Information Streams in Social SensingProceedings of the 12th International Conference on Management of Digital EcoSystems10.1145/3415958.3433099(157-161)Online publication date: 2-Nov-2020
    • (2020)From Appearance to EssenceACM Transactions on Intelligent Systems and Technology10.1145/341174911:6(1-24)Online publication date: 11-Sep-2020
    • (2020)BOXRECACM Transactions on Intelligent Systems and Technology10.1145/340889011:6(1-28)Online publication date: 25-Sep-2020
    • (2020)Fast Distributed kNN Graph Construction Using Auto-tuned Locality-sensitive HashingACM Transactions on Intelligent Systems and Technology10.1145/340888911:6(1-18)Online publication date: 12-Oct-2020
    • (2020)Latent Unexpected RecommendationsACM Transactions on Intelligent Systems and Technology10.1145/340485511:6(1-25)Online publication date: 15-Sep-2020
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media