research-article

NoDB: efficient query execution on raw data files

Authors:

Ioannis Alagiannis,

Renata Borovica,

Stratos Idreos,

Anastasia AilamakiAuthors Info & Claims

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

Pages 241 - 252

https://doi.org/10.1145/2213836.2213864

Published: 20 May 2012 Publication History

Abstract

As data collections become larger and larger, data loading evolves to a major bottleneck. Many applications already avoid using database systems, e.g., scientific data analysis and social networks, due to the complexity and the increased data-to-query time. For such applications data collections keep growing fast, even on a daily basis, and we are already in the era of data deluge where we have much more data than what we can move, store, let alone analyze.

Our contribution in this paper is the design and roadmap of a new paradigm in database systems, called NoDB, which do not require data loading while still maintaining the whole feature set of a modern database system. In particular, we show how to make raw data files a first-class citizen, fully integrated with the query engine. Through our design and lessons learned by implementing the NoDB philosophy over a modern DBMS, we discuss the fundamental limitations as well as the strong opportunities that such a research path brings. We identify performance bottlenecks specific for in situ processing, namely the repeated parsing and tokenizing overhead and the expensive data type conversion costs. To address these problems, we introduce an adaptive indexing mechanism that maintains positional information to provide efficient access to raw data files, together with a flexible caching structure.

Our implementation over PostgreSQL, called PostgresRaw, is able to avoid the loading cost completely, while matching the query performance of plain PostgreSQL and even outperforming it in many cases. We conclude that NoDB systems are feasible to design and implement over modern database architectures, bringing an unprecedented positive effect in usability and performance.

References

[1]

S. Agrawal, S. Chaudhuri, and V. R. Narasayya. Automated selection of materialized views and indexes in sql databases. In VLDB, 2000.

Digital Library

[2]

S. Agrawal, V. Narasayya, and B. Yang. Integrating vertical and horizontal partitioning into automated physical database design. In SIGMOD, 2004.

Digital Library

[3]

A. Ailamaki, V. Kantere, and D. Dash. Managing scientific data. Commun. ACM, 53:68--78, 2010.

Digital Library

[4]

N. Bruno and S. Chaudhuri. Automatic physical database tuning: a relaxation-based approach. In SIGMOD, 2005.

Digital Library

[5]

N. Bruno and S. Chaudhuri. To tune or not to tune?: a lightweight physical design alerter. In VLDB, 2006.

Digital Library

[6]

S. Chaudhuri and V. R. Narasayya. An efficient cost-driven index selection tool for microsoft sql server. In VLDB, 1997.

Digital Library

[7]

J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. Mad skills: new analysis practices for big data. PVLDB, 2:1481--1492, 2009.

Digital Library

[8]

B. Dageville, D. Das, K. Dias, K. Yagoub, M. Zait, and M. Ziauddin. Automatic sql tuning in oracle 10g. In VLDB, 2004.

Digital Library

[9]

D. Dash, N. Polyzotis, and A. Ailamaki. Cophy: a scalable, portable, and interactive index advisor for large workloads. PVLDB, 4:362--372, 2011.

Digital Library

[10]

G. Graefe, S. Idreos, H. Kuno, and S. Manegold. Benchmarking adaptive indexing. In TPCTC, 2011.

Digital Library

[11]

G. Graefe and H. Kuno. Adaptive indexing for relational keys. ICDEW, 0:69--74, 2010.

[12]

G. Graefe and H. Kuno. Self-selecting, self-tuning, incrementally optimized indexes. In EDBT, 2010.

Digital Library

[13]

J. Gray, D. T. Liu, M. Nieto-Santisteban, A. Szalay, D. J. DeWitt, and G. Heber. Scientific data management in the coming decade. SIGMOD Rec., 34:34--41, 2005.

Digital Library

[14]

S. Idreos, I. Alagiannis, R. Johnson, and A. Ailamaki. Here are my data files. here are my queries. where are my results? In CIDR, 2011.

[15]

S. Idreos, M. L. Kersten, and S. Manegold. Database cracking. In CIDR, 2007.

[16]

S. Idreos, M. L. Kersten, and S. Manegold. Updating a cracked database. In SIGMOD, 2007.

Digital Library

[17]

S. Idreos, M. L. Kersten, and S. Manegold. Self-organizing tuple reconstruction in column-stores. In SIGMOD, 2009.

Digital Library

[18]

S. Idreos, S. Manegold, H. Kuno, and G. Graefe. Merging what's cracked, cracking what's merged: adaptive indexing in main-memory column-stores. PVLDB, 4:586--597, 2011.

Digital Library

[19]

H. V. Jagadish, A. Chapman, A. Elkiss, M. Jayapandian, Y. Li, A. Nandi, and C. Yu. Making database systems usable. In SIGMOD, 2007.

Digital Library

[20]

A. Jain, A. Doan, and L. Gravano. Optimizing sql queries over text databases. In ICDE, 2008.

Digital Library

[21]

M. L. Kersten, S. Idreos, S. Manegold, and E. Liarou. The researcher's guide to the data deluge: Querying a scientific database in just a few seconds. In VLDB, 2011.

[22]

K. Lorincz, K. Redwine, and J. Tov. Grep versus flatsql versus mysql: Queries using unix tools vs. a dbms, 2003.

[23]

A. Nandi and H. V.Jagadish. Guided interaction: Rethinking the query-result paradigm. In VLDB, 2011.

[24]

S. Papadomanolakis and A. Ailamaki. Autopart: Automating schema design for large scientific databases using data partitioning. In SSDBM, 2004.

Digital Library

[25]

M. T. Roth and P. M. Schwarz. Don't scrap it, wrap it! a wrapper architecture for legacy data sources. In VLDB, 1997.

Digital Library

[26]

K. Schnaitter, S. Abiteboul, T. Milo, and N. Polyzotis. Colt: continuous on-line tuning. In SIGMOD, 2006.

Digital Library

[27]

M. Stonebraker, J. Becla, D. J. DeWitt, K.-T. Lim, D. Maier, O. Ratzesberger, and S. B. Zdonik. Requirements for science data bases and scidb. In CIDR, 2009.

[28]

G. Valentin, M. Zuliani, D. C. Zilio, G. Lohman, and A. Skelley. Db2 advisor: An optimizer smart enough to recommend its own indexes. In ICDE, 2000.

[29]

D. C. Zilio, J. Rao, S. Lightstone, G. M. Lohman, A. J. Storm, C. Garcia-Arellano, and S. Fadden. Db2 design advisor: Integrated automatic physical database. In VLDB, 2004.

Digital Library

Cited By

Tang CFan BZhao JLiang CWang YWang BQiu ZQiu LDing BSun SChe SMai JChen SZhu YXie JSun YLi YZhang YWang KChen MBagchi SZhang Y(2024)Data caching for enterprise-grade petabyte-scale OLAPProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692047(901-915)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691992.3692047
Zirak FChoudhury FBorovica-Gajic R(2024)SeLeP: Learning Based Semantic Prefetching for Exploratory Database WorkloadsProceedings of the VLDB Endowment10.14778/3659437.365945817:8(2064-2076)Online publication date: 31-May-2024
https://doi.org/10.14778/3659437.3659458
Bormann PKrämer MWürz HGöhringer P(2024)Executing Ad-Hoc Queries on Large Geospatial Data Sets Without Acceleration StructuresSN Computer Science10.1007/s42979-024-02986-z5:5Online publication date: 13-Jun-2024
https://dl.acm.org/doi/10.1007/s42979-024-02986-z
Show More Cited By

Index Terms

NoDB: efficient query execution on raw data files

Recommendations

NoDB: efficient query execution on raw data files

As data collections become larger and larger, users are faced with increasing bottlenecks in their data analysis. More data means more time to prepare and to load the data into the database before executing the desired queries. Many applications already ...
NoDB in action: adaptive query processing on raw data

As data collections become larger and larger, users are faced with increasing bottlenecks in their data analysis. More data means more time to prepare the data, to load the data into the database and to execute the desired queries. Many applications ...
Sql: Learn Basics of Queries and Implement Easily (sql programming, SQL 2016, sql database programming, sql for beginners, sql beginners guide, sql ... sql workbook, sql guide, MSSQL) (Volume 1)

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

May 2012

886 pages

ISBN:9781450312479

DOI:10.1145/2213836

General Chairs:
K. Selçuk Candan
Arizona State University
,
Yi Chen
Arizona State University
,
Richard Snodgrass
University of Arizona
,
Program Chair:
Luis Gravano
Columbia University
,
Publications Chair:
Ariel Fuxman
Microsoft Research

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 May 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '12

Sponsor:

SIGMOD

SIGMOD/PODS '12: International Conference on Management of Data

May 20 - 24, 2012

Arizona, Scottsdale, USA

Acceptance Rates

SIGMOD '12 Paper Acceptance Rate 48 of 289 submissions, 17%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

137
Total Citations
View Citations
1,963
Total Downloads

Downloads (Last 12 months)87
Downloads (Last 6 weeks)8

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Tang CFan BZhao JLiang CWang YWang BQiu ZQiu LDing BSun SChe SMai JChen SZhu YXie JSun YLi YZhang YWang KChen MBagchi SZhang Y(2024)Data caching for enterprise-grade petabyte-scale OLAPProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692047(901-915)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691992.3692047
Zirak FChoudhury FBorovica-Gajic R(2024)SeLeP: Learning Based Semantic Prefetching for Exploratory Database WorkloadsProceedings of the VLDB Endowment10.14778/3659437.365945817:8(2064-2076)Online publication date: 31-May-2024
https://doi.org/10.14778/3659437.3659458
Bormann PKrämer MWürz HGöhringer P(2024)Executing Ad-Hoc Queries on Large Geospatial Data Sets Without Acceleration StructuresSN Computer Science10.1007/s42979-024-02986-z5:5Online publication date: 13-Jun-2024
https://dl.acm.org/doi/10.1007/s42979-024-02986-z
Wei YWang HJin P(2024)Optimizing the B+tree Index with Hotness Awareness and AdaptivityAdvanced Intelligent Computing Technology and Applications10.1007/978-981-97-5581-3_29(356-367)Online publication date: 1-Aug-2024
https://doi.org/10.1007/978-981-97-5581-3_29
Fathollahzadeh SBoehm M(2023)GIO: Generating Efficient Matrix and Frame Readers for Custom Data Formats by ExampleProceedings of the ACM on Management of Data10.1145/35892651:2(1-26)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589265
Perera ROetomo BRubinstein BBorovica-Gajic R(2023)No DBA? No Regret! Multi-Armed Bandits for Index Tuning of Analytical and HTAP Workloads With Provable GuaranteesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.327166435:12(12855-12872)Online publication date: 1-Dec-2023
https://doi.org/10.1109/TKDE.2023.3271664
Sanca VAilamaki A(2023)Analytical Engines With Context-Rich Processing: Towards Efficient Next-Generation Analytics2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00298(3699-3707)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00298
Gavriilidis HBeedkar KQuiané-Ruiz JMarkl V(2023)In-Situ Cross-Database Query Processing2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00214(2794-2807)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00214
Potti SNguyen AGibson LBadia A(2023)Bringing Data Analysis to the Files and the Database to the Command Line2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE)10.1109/CSCE60160.2023.00246(1490-1497)Online publication date: 24-Jul-2023
https://doi.org/10.1109/CSCE60160.2023.00246
Patel MBhise M(2023)MUAR: Maximizing Utilization of Available Resources for Query Processing2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW)10.1109/CCGridW59191.2023.00040(176-183)Online publication date: May-2023
https://doi.org/10.1109/CCGridW59191.2023.00040
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten