Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2213836.2213864acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

NoDB: efficient query execution on raw data files

Published: 20 May 2012 Publication History

Abstract

As data collections become larger and larger, data loading evolves to a major bottleneck. Many applications already avoid using database systems, e.g., scientific data analysis and social networks, due to the complexity and the increased data-to-query time. For such applications data collections keep growing fast, even on a daily basis, and we are already in the era of data deluge where we have much more data than what we can move, store, let alone analyze.
Our contribution in this paper is the design and roadmap of a new paradigm in database systems, called NoDB, which do not require data loading while still maintaining the whole feature set of a modern database system. In particular, we show how to make raw data files a first-class citizen, fully integrated with the query engine. Through our design and lessons learned by implementing the NoDB philosophy over a modern DBMS, we discuss the fundamental limitations as well as the strong opportunities that such a research path brings. We identify performance bottlenecks specific for in situ processing, namely the repeated parsing and tokenizing overhead and the expensive data type conversion costs. To address these problems, we introduce an adaptive indexing mechanism that maintains positional information to provide efficient access to raw data files, together with a flexible caching structure.
Our implementation over PostgreSQL, called PostgresRaw, is able to avoid the loading cost completely, while matching the query performance of plain PostgreSQL and even outperforming it in many cases. We conclude that NoDB systems are feasible to design and implement over modern database architectures, bringing an unprecedented positive effect in usability and performance.

References

[1]
S. Agrawal, S. Chaudhuri, and V. R. Narasayya. Automated selection of materialized views and indexes in sql databases. In VLDB, 2000.
[2]
S. Agrawal, V. Narasayya, and B. Yang. Integrating vertical and horizontal partitioning into automated physical database design. In SIGMOD, 2004.
[3]
A. Ailamaki, V. Kantere, and D. Dash. Managing scientific data. Commun. ACM, 53:68--78, 2010.
[4]
N. Bruno and S. Chaudhuri. Automatic physical database tuning: a relaxation-based approach. In SIGMOD, 2005.
[5]
N. Bruno and S. Chaudhuri. To tune or not to tune?: a lightweight physical design alerter. In VLDB, 2006.
[6]
S. Chaudhuri and V. R. Narasayya. An efficient cost-driven index selection tool for microsoft sql server. In VLDB, 1997.
[7]
J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. Mad skills: new analysis practices for big data. PVLDB, 2:1481--1492, 2009.
[8]
B. Dageville, D. Das, K. Dias, K. Yagoub, M. Zait, and M. Ziauddin. Automatic sql tuning in oracle 10g. In VLDB, 2004.
[9]
D. Dash, N. Polyzotis, and A. Ailamaki. Cophy: a scalable, portable, and interactive index advisor for large workloads. PVLDB, 4:362--372, 2011.
[10]
G. Graefe, S. Idreos, H. Kuno, and S. Manegold. Benchmarking adaptive indexing. In TPCTC, 2011.
[11]
G. Graefe and H. Kuno. Adaptive indexing for relational keys. ICDEW, 0:69--74, 2010.
[12]
G. Graefe and H. Kuno. Self-selecting, self-tuning, incrementally optimized indexes. In EDBT, 2010.
[13]
J. Gray, D. T. Liu, M. Nieto-Santisteban, A. Szalay, D. J. DeWitt, and G. Heber. Scientific data management in the coming decade. SIGMOD Rec., 34:34--41, 2005.
[14]
S. Idreos, I. Alagiannis, R. Johnson, and A. Ailamaki. Here are my data files. here are my queries. where are my results? In CIDR, 2011.
[15]
S. Idreos, M. L. Kersten, and S. Manegold. Database cracking. In CIDR, 2007.
[16]
S. Idreos, M. L. Kersten, and S. Manegold. Updating a cracked database. In SIGMOD, 2007.
[17]
S. Idreos, M. L. Kersten, and S. Manegold. Self-organizing tuple reconstruction in column-stores. In SIGMOD, 2009.
[18]
S. Idreos, S. Manegold, H. Kuno, and G. Graefe. Merging what's cracked, cracking what's merged: adaptive indexing in main-memory column-stores. PVLDB, 4:586--597, 2011.
[19]
H. V. Jagadish, A. Chapman, A. Elkiss, M. Jayapandian, Y. Li, A. Nandi, and C. Yu. Making database systems usable. In SIGMOD, 2007.
[20]
A. Jain, A. Doan, and L. Gravano. Optimizing sql queries over text databases. In ICDE, 2008.
[21]
M. L. Kersten, S. Idreos, S. Manegold, and E. Liarou. The researcher's guide to the data deluge: Querying a scientific database in just a few seconds. In VLDB, 2011.
[22]
K. Lorincz, K. Redwine, and J. Tov. Grep versus flatsql versus mysql: Queries using unix tools vs. a dbms, 2003.
[23]
A. Nandi and H. V.Jagadish. Guided interaction: Rethinking the query-result paradigm. In VLDB, 2011.
[24]
S. Papadomanolakis and A. Ailamaki. Autopart: Automating schema design for large scientific databases using data partitioning. In SSDBM, 2004.
[25]
M. T. Roth and P. M. Schwarz. Don't scrap it, wrap it! a wrapper architecture for legacy data sources. In VLDB, 1997.
[26]
K. Schnaitter, S. Abiteboul, T. Milo, and N. Polyzotis. Colt: continuous on-line tuning. In SIGMOD, 2006.
[27]
M. Stonebraker, J. Becla, D. J. DeWitt, K.-T. Lim, D. Maier, O. Ratzesberger, and S. B. Zdonik. Requirements for science data bases and scidb. In CIDR, 2009.
[28]
G. Valentin, M. Zuliani, D. C. Zilio, G. Lohman, and A. Skelley. Db2 advisor: An optimizer smart enough to recommend its own indexes. In ICDE, 2000.
[29]
D. C. Zilio, J. Rao, S. Lightstone, G. M. Lohman, A. J. Storm, C. Garcia-Arellano, and S. Fadden. Db2 design advisor: Integrated automatic physical database. In VLDB, 2004.

Cited By

View all
  • (2024)Data caching for enterprise-grade petabyte-scale OLAPProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692047(901-915)Online publication date: 10-Jul-2024
  • (2024)SeLeP: Learning Based Semantic Prefetching for Exploratory Database WorkloadsProceedings of the VLDB Endowment10.14778/3659437.365945817:8(2064-2076)Online publication date: 31-May-2024
  • (2024)Executing Ad-Hoc Queries on Large Geospatial Data Sets Without Acceleration StructuresSN Computer Science10.1007/s42979-024-02986-z5:5Online publication date: 13-Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
May 2012
886 pages
ISBN:9781450312479
DOI:10.1145/2213836
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 May 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. adaptive loading
  2. in situ querying
  3. positional map

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '12
Sponsor:

Acceptance Rates

SIGMOD '12 Paper Acceptance Rate 48 of 289 submissions, 17%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)87
  • Downloads (Last 6 weeks)8
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Data caching for enterprise-grade petabyte-scale OLAPProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692047(901-915)Online publication date: 10-Jul-2024
  • (2024)SeLeP: Learning Based Semantic Prefetching for Exploratory Database WorkloadsProceedings of the VLDB Endowment10.14778/3659437.365945817:8(2064-2076)Online publication date: 31-May-2024
  • (2024)Executing Ad-Hoc Queries on Large Geospatial Data Sets Without Acceleration StructuresSN Computer Science10.1007/s42979-024-02986-z5:5Online publication date: 13-Jun-2024
  • (2024)Optimizing the B+tree Index with Hotness Awareness and AdaptivityAdvanced Intelligent Computing Technology and Applications10.1007/978-981-97-5581-3_29(356-367)Online publication date: 1-Aug-2024
  • (2023)GIO: Generating Efficient Matrix and Frame Readers for Custom Data Formats by ExampleProceedings of the ACM on Management of Data10.1145/35892651:2(1-26)Online publication date: 20-Jun-2023
  • (2023)No DBA? No Regret! Multi-Armed Bandits for Index Tuning of Analytical and HTAP Workloads With Provable GuaranteesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.327166435:12(12855-12872)Online publication date: 1-Dec-2023
  • (2023)Analytical Engines With Context-Rich Processing: Towards Efficient Next-Generation Analytics2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00298(3699-3707)Online publication date: Apr-2023
  • (2023)In-Situ Cross-Database Query Processing2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00214(2794-2807)Online publication date: Apr-2023
  • (2023)Bringing Data Analysis to the Files and the Database to the Command Line2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE)10.1109/CSCE60160.2023.00246(1490-1497)Online publication date: 24-Jul-2023
  • (2023)MUAR: Maximizing Utilization of Available Resources for Query Processing2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW)10.1109/CCGridW59191.2023.00040(176-183)Online publication date: May-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media