Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2948674.2948675acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
short-paper

Towards large-scale data discovery: position paper

Published: 26 June 2016 Publication History

Abstract

With thousands of data sources spread across multiple databases and data lakes, modern organizations face a data discovery challenge. Analysts spend more time finding relevant data to answer the questions at hand than analyzing it.
In this paper we introduce a data discovery system that facilitates locating relevant data among thousands of data sources. We represent data sources succinctly through signatures, and then create search paths that permit quick execution of a set of data discovery primitives used for finding relevant data. We have built a prototype that is being used to solve data discovery challenges of two big organizations.

References

[1]
R. Agrawal and R. Srikant. Searching with Numbers. In WWW, 2002.
[2]
M. J. Cafarella, A. Halevy, et al. WebTables: Exploring the Power of Tables on the Web. VLDB, 2008.
[3]
A. Das Sarma, L. Fang, et al. Finding Related Tables. In SIGMOD, 2012.
[4]
M. Datar, N. Immorlica, et al. Locality-sensitive Hashing Scheme Based on P-stable Distributions. In SCG, 2004.
[5]
A. Halevy, A. Rajaraman, et al. Data Integration: The Teenage Years. In VLDB, 2006.
[6]
V. M. Megler and D. Maier. Are Data Sets Like Documents?: Evaluating Similarity-Based Ranked Search over Scientific Data. In TKDE, 2015.
[7]
V. M. Megler, D. Maier, et al. Data like this: Ranked search of genomic data vision paper. In ExploreDB, 2015.
[8]
I. Terrizano, P. Schwarz, et al. Data Wrangling: The Challenging Journey from the Wild to the Lake. In CIDR, 2015.

Cited By

View all
  • (2024)CMDBench: A Benchmark for Coarse-to-fine Multimodal Data Discovery in Compound AI SystemsProceedings of the Conference on Governance, Understanding and Integration of Data for Effective and Responsible AI10.1145/3665601.3669846(16-25)Online publication date: 9-Jun-2024
  • (2022)Information Resilience: the nexus of responsible and agile approaches to information useThe VLDB Journal10.1007/s00778-021-00720-231:5(1059-1084)Online publication date: 16-Jan-2022
  • (2019)Meta-mappings for schema mapping reuseProceedings of the VLDB Endowment10.14778/3303753.330376112:5(557-569)Online publication date: 1-Jan-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ExploreDB '16: Proceedings of the Third International Workshop on Exploratory Search in Databases and the Web
June 2016
38 pages
ISBN:9781450343121
DOI:10.1145/2948674
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • LogicBlox: LogicBlox Inc.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 June 2016

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Short-paper

Conference

SIGMOD/PODS'16
Sponsor:
  • LogicBlox
SIGMOD/PODS'16: International Conference on Management of Data
June 26 - July 1, 2016
California, San Francisco

Acceptance Rates

ExploreDB '16 Paper Acceptance Rate 5 of 11 submissions, 45%;
Overall Acceptance Rate 11 of 21 submissions, 52%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)CMDBench: A Benchmark for Coarse-to-fine Multimodal Data Discovery in Compound AI SystemsProceedings of the Conference on Governance, Understanding and Integration of Data for Effective and Responsible AI10.1145/3665601.3669846(16-25)Online publication date: 9-Jun-2024
  • (2022)Information Resilience: the nexus of responsible and agile approaches to information useThe VLDB Journal10.1007/s00778-021-00720-231:5(1059-1084)Online publication date: 16-Jan-2022
  • (2019)Meta-mappings for schema mapping reuseProceedings of the VLDB Endowment10.14778/3303753.330376112:5(557-569)Online publication date: 1-Jan-2019
  • (2019)Data Management Systems Research at TU BerlinACM SIGMOD Record10.1145/3335409.333541547:4(23-28)Online publication date: 17-May-2019
  • (2018)Data ProfilingSynthesis Lectures on Data Management10.2200/S00878ED1V01Y201810DTM05210:4(1-154)Online publication date: 7-Nov-2018
  • (2018)Data Lifecycle Challenges in Production Machine LearningACM SIGMOD Record10.1145/3299887.329989147:2(17-28)Online publication date: 11-Dec-2018
  • (2017)Are key-foreign key joins safe to avoid when learning high-capacity classifiers?Proceedings of the VLDB Endowment10.14778/3157794.315780411:3(366-379)Online publication date: 1-Nov-2017
  • (2017)MetacrateProceedings of the 2017 ACM on Conference on Information and Knowledge Management10.1145/3132847.3133180(2483-2486)Online publication date: 6-Nov-2017

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media