research-article

Characterizing and selecting fresh data sources

Authors:

Theodoros Rekatsinas,

Divesh SrivastavaAuthors Info & Claims

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Pages 919 - 930

https://doi.org/10.1145/2588555.2610504

Published: 18 June 2014 Publication History

Abstract

Data integration is a challenging task due to the large numbers of autonomous data sources. This necessitates the development of techniques to reason about the benefits and costs of acquiring and integrating data. Recently the problem of source selection (i.e., identifying the subset of sources that maximizes the profit from integration) was introduced as a preprocessing step before the actual integration. The problem was studied for static sources and used the accuracy of data fusion to quantify the integration profit.

In this paper, we study the problem of source selection considering dynamic data sources whose content changes over time. We define a set of time-dependent metrics, including coverage, freshness and accuracy, to characterize the quality of integrated data. We show how statistical models for the evolution of sources can be used to estimate these metrics. While source selection is NP-complete, we show that for a large class of practical cases, near-optimal solutions can be found, propose an algorithmic framework with theoretical guarantees for our problem and show its effectiveness with an extensive experimental evaluation on both real-world and synthetic data.

References

[1]

J. Cho and H. Garcia-Molina. Effective page refresh policies for web crawlers. ACM Trans. Database Syst., 28(4), 2003.

Digital Library

[2]

X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1), 2009.

Digital Library

[3]

X. L. Dong, B. Saha, and D. Srivastava. Less is more: Selecting sources wisely for integration. PVLDB, 6(2), 2012.

Digital Library

[4]

U. Feige and V. S. Mirrokni. Maximizing non-monotone submodular functions. In FOCS, 2007.

Digital Library

[5]

R. G. Gallager. Discrete Stochastic Processes. Kluwer Academic Publishers, Boston, 1996.

[6]

T. Herzog, F. Scheuren, and W. Winkler. Record linkage. Wiley Interdisciplinary Reviews: Computational Statistics, 2010.

Digital Library

[7]

T. Hua, C.-T. Lu, N. Ramakrishnan, F. Chen, J. Arredondo, D. Mares, and K. Summers. Analyzing civil unrest through social media. Computer, 46(12):80--84, 2013.

Digital Library

[8]

E. L. Kaplan and P. Meier. Nonparametric Estimation from Incomplete Observations. JASA, 53:457--481, 1958.

[9]

J. Lee, V. S. Mirrokni, V. Nagarajan, and M. Sviridenko. Non-monotone submodular maximization under matroid and knapsack constraints. STOC, 2009.

Digital Library

[10]

K. Leetaru and P. Schrodt. Gdelt: Global data on events, language, and tone, 1979--2012. Inter. Studies Association Annual Conf., 2013.

[11]

X. Li, X. L. Dong, K. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: is the problem solved? PVLDB, 6(2), 2012.

Digital Library

[12]

W. Meng and C. T. Yu. Advanced Metasearch Engine Technology. Morgan & Claypool Publishers, 2010.

Digital Library

[13]

G. A. Mihaila, L. Raschid, and M.-E. Vidal. Using quality of data metadata for source selection and ranking. In WebDB, 2000.

[14]

A. C. Morris, V. Maier, and P. Green. From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition. In INTERSPEECH, 2004.

[15]

A. Pal, V. Rastogi, A. Machanavajjhala, and P. Bohannon. Information integration over time in unreliable and uncertain environments. In WWW, 2012.

Digital Library

[16]

G.-J. Qi, C. C. Aggarwal, J. Han, and T. Huang. Mining collective intelligence in diverse groups. WWW, 2013.

Digital Library

[17]

M. Stonebraker, D. Bruckner, I. Ilyas, G. Beskales, M. Cherniack, S. Zdonik, A. Pagan, and S. Xu. Data curation at scale: The data tamer system. In CIDR'13, 2013.

[18]

K. Wilson and J. S. Brownstein. Early detection of disease outbreaks using the internet. CMAJ, 180(8):829--831, 2009.

[19]

B. Zhao, B. I. P. Rubinstein, J. Gemmell, and J. Han. A bayesian approach to discovering truth from conflicting sources for data integration. PVLDB, 5(6), 2012.

Digital Library

Cited By

Hendrik HPermanasari AFauziati SKusumawardani S(2023)Judging Knowledge by its Cover: Leveraging Large Language Models in Establishing Criteria for Knowledge Graph Sources Selection2023 8th International Conference on Information Technology and Digital Applications (ICITDA)10.1109/ICITDA60835.2023.10427395(1-8)Online publication date: 17-Nov-2023
https://doi.org/10.1109/ICITDA60835.2023.10427395
Karstens MSoules MDietrich N(2023)On the Replicability of Data Collection Using Online News DatabasesPS: Political Science & Politics10.1017/S1049096522001317(1-8)Online publication date: 11-Jan-2023
https://doi.org/10.1017/S1049096522001317
Guo HLi JGao H(2021)Data source selection for approximate queryJournal of Combinatorial Optimization10.1007/s10878-021-00760-y44:4(2443-2459)Online publication date: 24-May-2021
https://doi.org/10.1007/s10878-021-00760-y
Show More Cited By

Index Terms

Characterizing and selecting fresh data sources
1. Information systems
  1. Data management systems
2. Mathematics of computing
  1. Probability and statistics

Recommendations

Selecting quality sources

This study investigated undergraduates' source selection behaviour: what sources they use frequently, what criteria they consider important for source selection, how they perceive different sources, and whether their source selection behaviour is ...
Efficient Feedback Collection for Pay-as-you-go Source Selection
SSDBM '16: Proceedings of the 28th International Conference on Scientific and Statistical Database Management

Technical developments, such as the web of data and web data extraction, combined with policy developments such as those relating to open government or open science, are leading to the availability of increasing numbers of data sources. Indeed, given ...
Data source management and selection for dynamic data integration
RED'09: Proceedings of the 2nd international conference on Resource discovery

Selection-dynamic data integration employs a set of known data sources attached to an integration system. For answering a given query, suitable sources are selected from this set and dynamically integrated. This procedure requires a method to determine ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

June 2014

1645 pages

ISBN:9781450323765

DOI:10.1145/2588555

General Chairs:
Curtis Dyreson
Utah State University, USA
,
Feifei Li
University of Utah, USA
,
Program Chair:
M. Tamer Özsu
University of Waterloo, Canada

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS'14

Sponsor:

SIGMOD

SIGMOD/PODS'14: International Conference on Management of Data

June 22 - 27, 2014

Utah, Snowbird, USA

Acceptance Rates

SIGMOD '14 Paper Acceptance Rate 107 of 421 submissions, 25%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

46
Total Citations
View Citations
698
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)1

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Hendrik HPermanasari AFauziati SKusumawardani S(2023)Judging Knowledge by its Cover: Leveraging Large Language Models in Establishing Criteria for Knowledge Graph Sources Selection2023 8th International Conference on Information Technology and Digital Applications (ICITDA)10.1109/ICITDA60835.2023.10427395(1-8)Online publication date: 17-Nov-2023
https://doi.org/10.1109/ICITDA60835.2023.10427395
Karstens MSoules MDietrich N(2023)On the Replicability of Data Collection Using Online News DatabasesPS: Political Science & Politics10.1017/S1049096522001317(1-8)Online publication date: 11-Jan-2023
https://doi.org/10.1017/S1049096522001317
Guo HLi JGao H(2021)Data source selection for approximate queryJournal of Combinatorial Optimization10.1007/s10878-021-00760-y44:4(2443-2459)Online publication date: 24-May-2021
https://doi.org/10.1007/s10878-021-00760-y
Kruse FSchröer CGómez J(2021)Data Source Selection Support in the Big Data Integration Process – Towards a TaxonomyInnovation Through Information Systems10.1007/978-3-030-86800-0_1(5-21)Online publication date: 29-Oct-2021
https://doi.org/10.1007/978-3-030-86800-0_1
Mueller KPapenhausen E(2020)Using Demographic Pattern Analysis to Predict COVID-19 Fatalities on the US County LevelDigital Government: Research and Practice10.1145/34301962:1(1-11)Online publication date: 3-Dec-2020
https://dl.acm.org/doi/10.1145/3430196
Gianini GMio CViola FLin JAlmoosa NChbeir RManolopoulos YDamiani EBenslimane DBellatreche LMorzy T(2020)Selection of Information Streams in Social SensingProceedings of the 12th International Conference on Management of Digital EcoSystems10.1145/3415958.3433099(157-161)Online publication date: 2-Nov-2020
https://dl.acm.org/doi/10.1145/3415958.3433099
Fang XSheng QWang XZhang WNgu AYang J(2020)From Appearance to EssenceACM Transactions on Intelligent Systems and Technology10.1145/341174911:6(1-24)Online publication date: 11-Sep-2020
https://dl.acm.org/doi/10.1145/3411749
Banerjee DRao KSural SGanguly N(2020)BOXRECACM Transactions on Intelligent Systems and Technology10.1145/340889011:6(1-28)Online publication date: 25-Sep-2020
https://dl.acm.org/doi/10.1145/3408890
Eiras-Franco CMartínez-Rego DKanthan LPiñeiro CBahamonde AGuijarro-Berdiñas BAlonso-Betanzos A(2020)Fast Distributed kNN Graph Construction Using Auto-tuned Locality-sensitive HashingACM Transactions on Intelligent Systems and Technology10.1145/340888911:6(1-18)Online publication date: 12-Oct-2020
https://dl.acm.org/doi/10.1145/3408889
Li PTuzhilin A(2020)Latent Unexpected RecommendationsACM Transactions on Intelligent Systems and Technology10.1145/340485511:6(1-25)Online publication date: 15-Sep-2020
https://dl.acm.org/doi/10.1145/3404855
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents