research-article

Public Access

Database Learning: Toward a Database that Becomes Smarter Every Time

Authors:

Ahmad Shahab Tajik,

Michael Cafarella,

Barzan MozafariAuthors Info & Claims

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

Pages 587 - 602

https://doi.org/10.1145/3035918.3064013

Published: 09 May 2017 Publication History

Abstract

In today's databases, previous query answers rarely benefit answering future queries. For the first time, to the best of our knowledge, we change this paradigm in an approximate query processing (AQP) context. We make the following observation: the answer to each query reveals some degree of knowledge about the answer to another query because their answers stem from the same underlying distribution that has produced the entire dataset. Exploiting and refining this knowledge should allow us to answer queries more analytically, rather than by reading enormous amounts of raw data. Also, processing more queries should continuously enhance our knowledge of the underlying distribution, and hence lead to increasingly faster response times for future queries.

We call this novel idea---learning from past query answers---Database Learning. We exploit the principle of maximum entropy to produce answers, which are in expectation guaranteed to be more accurate than existing sample-based approximations. Empowered by this idea, we build a query engine on top of Spark SQL, called Verdict. We conduct extensive experiments on real-world query traces from a large customer of a major database vendor. Our results demonstrate that database learning supports 73.7% of these queries, speeding them up by up to 23.0x for the same accuracy level compared to existing AQP systems.

References

[1]

https://db.apache.org/derby/docs/10.6/tuning/ctuntransform36368.html.

[2]

S. Acharya, P. B. Gibbons, and V. Poosala. Aqua: A fast decision support system using approximate query answers. In VLDB, 1999.

Digital Library

[3]

S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. In SIGMOD, 1999.

Digital Library

[4]

S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. The Aqua Approximate Query Answering System. In SIGMOD, 1999.

Digital Library

[5]

S. Agarwal, H. Milner, A. Kleiner, A. Talwalkar, M. Jordan, S. Madden, B. Mozafari, and I. Stoica. Knowing when you're wrong: Building fast and reliable approximate query processing systems. In SIGMOD, 2014.

Digital Library

[6]

S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. BlinkDB: queries with bounded errors and bounded response times on very large data. In EuroSys, 2013.

Digital Library

[7]

S. Agarwal, A. Panda, B. Mozafari, A. P. Iyer, S. Madden, and I. Stoica. Blink and it's done: Interactive queries on very large data. PVLDB, 2012.

Digital Library

[8]

S. Agrawal, S. Chaudhuri, and V. R. Narasayya. Automated selection of materialized views and indexes in sql databases. In VLDB, 2000.

Digital Library

[9]

A. Arasu and G. S. Manku. Approximate counts and quantiles over sliding windows. In PODS, 2004.

Digital Library

[10]

M. Armbrust et al. Spark sql: Relational data processing in spark. In SIGMOD, 2015.

Digital Library

[11]

B. Babcock, S. Chaudhuri, and G. Das. Dynamic sample selection for approximate query processing. In VLDB, 2003.

Digital Library

[12]

L. Battle, R. Chang, and M. Stonebraker. Dynamic prefetching of data tiles for interactive visualization. 2015.

[13]

A. L. Berger, V. J. D. Pietra, and S. A. D. Pietra. A maximum entropy approach to natural language processing. Computational linguistics, 1996.

Digital Library

[14]

C. M. Bishop. Pattern recognition. Machine Learning, 2006.

Digital Library

[15]

J. G. Carbonell, R. S. Michalski, and T. M. Mitchell. An overview of machine learning. In Machine learning. 1983.

[16]

A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka Jr, and T. M. Mitchell. Toward an architecture for never-ending language learning. In AAAI, 2010.

Digital Library

[17]

S. Chaudhuri, G. Das, and V. Narasayya. Optimized stratified sampling for approximate query processing. TODS, 2007.

Digital Library

[18]

T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. Mapreduce online. In NSDI, 2010.

Digital Library

[19]

J. Considine, F. Li, G. Kollios, and J. Byers. Approximate aggregation techniques for sensor databases. In ICDE, 2004.

Digital Library

[20]

A. Dobra, C. Jermaine, F. Rusu, and F. Xu. Turbo-charging estimate convergence in dbo. PVLDB, 2009.

Digital Library

[21]

A. El-Helw, I. F. Ilyas, and C. Zuzarte. Statadvisor: Recommending statistical views. VLDB, 2009.

Digital Library

[22]

W. Fan, F. Geerts, Y. Cao, T. Deng, and P. Lu. Querying big data by accessing small data. In PODS, 2015.

Digital Library

[23]

D. Freedman, R. Pisani, and R. Purves. Statistics. 2007.

[24]

V. Ganti, M.-L. Lee, and R. Ramakrishnan. Icicles: Self-tuning samples for approximate query answering. In VLDB, 2000.

Digital Library

[25]

W. Gatterbauer and D. Suciu. Approximate lifted inference with probabilistic databases. PVLDB, 2015.

Digital Library

[26]

G. Graefe and H. Kuno. Adaptive indexing for relational keys. In ICDEW, 2010.

[27]

A. Y. Halevy. Answering queries using views: A survey. VLDBJ, 2001.

Digital Library

[28]

J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In SIGMOD, 1997.

Digital Library

[29]

K. Hose, D. Klan, and K.-U. Sattler. Distributed data summaries for approximate query processing in pdms. In IDEAS, 2006.

Digital Library

[30]

Y. Hu, S. Sundara, and J. Srinivasan. Estimating Aggregates in Time-Constrained Approximate Queries in Oracle. In EDBT, 2009.

Digital Library

[31]

H. Huang, C. Liu, and X. Zhou. Approximating query answering on rdf databases. WWW, 2012.

Digital Library

[32]

S. Idreos, M. L. Kersten, and S. Manegold. Database cracking. In CIDR, 2007.

[33]

S. Idreos, M. L. Kersten, and S. Manegold. Self-organizing tuple reconstruction in column-stores. In SIGMOD, 2009.

Digital Library

[34]

J. S. R. Jang. General formula: Matrix inversion lemma. http://www.cs.nthu.edu.tw/ jang/book/addenda/matinv/matinv/.

[35]

Y. Jia. Running tpc-h queries on hive. https://issues.apache.org/jira/browse/HIVE-600.

[36]

M. Joglekar, H. Garcia-Molina, and A. Parameswaran. Interactive data exploration with smart drill-down. In ICDE, 2016.

[37]

S. Joshi and C. Jermaine. Materialized sample views for database approximation. TKDE, 2008.

Digital Library

[38]

S. Joshi and C. Jermaine. Sampling-Based Estimators for Subset-Based Queries. VLDB J., 18(1), 2009.

Digital Library

[39]

S. Kandula, A. Shanbhag, A. Vitorovic, M. Olma, R. Grandl, S. Chaudhuri, and B. Ding. Quickr: Lazily approximating complex adhoc queries in bigdata clusters. In SIGMOD, 2016.

Digital Library

[40]

R. Kaushik, C. Ré, and D. Suciu. General database statistics using entropy maximization. In DBPL, 2009.

Digital Library

[41]

A. Kim, E. Blais, A. Parameswaran, P. Indyk, S. Madden, and R. Rubinfeld. Rapid sampling for visualizations with ordering guarantees. PVLDB, 2015.

Digital Library

[42]

N. Lawrence, M. Seeger, R. Herbrich, et al. Fast sparse gaussian process methods: The informative vector machine. NIPS, 2003.

Digital Library

[43]

M. Lichman. UCI machine learning repository, 2013.

[44]

M. Lovric. International Encyclopedia of Statistical Science. Springer, 2011.

[45]

A. Meliou, C. Guestrin, and J. M. Hellerstein. Approximating sensor network queries using in-network summaries. In IPSN, 2009.

Digital Library

[46]

C. A. Micchelli, Y. Xu, and H. Zhang. Universal kernels. JMLR, 2006.

Digital Library

[47]

B. Mozafari. Verdict: A system for stochastic query planning. In CIDR, Biennial Conference on Innovative Data Systems, 2015.

[48]

B. Mozafari. Approximate query engines: Commercial challenges and research opportunities. In SIGMOD, 2017.

Digital Library

[49]

B. Mozafari, E. Z. Y. Goh, and D. Y. Yoon. CliffGuard: A principled framework for finding robust database designs. In SIGMOD, 2015.

Digital Library

[50]

B. Mozafari and N. Niu. A handbook for building an approximate query engine. IEEE Data Eng. Bull., 2015.

[51]

B. Mozafari, J. Ramnarayan, S. Menon, Y. Mahajan, S. Chakraborty, H. Bhanawat, and K. Bachhav. Snappydata: A unified cluster for streaming, transactions, and interactive analytics. In CIDR, 2017.

[52]

B. Mozafari and C. Zaniolo. Optimal load shedding with aggregates and mining queries. In ICDE, 2010.

[53]

C. Olston, E. Bortnikov, K. Elmeleegy, F. Junqueira, and B. Reed. Interactive Analysis of Web-Scale Data. In CIDR, 2009.

[54]

D. Olteanu, J. Huang, and C. Koch. Approximate confidence computation in probabilistic databases. In ICDE, 2010.

[55]

N. Pansare, V. R. Borkar, C. Jermaine, and T. Condie. Online aggregation for large mapreduce jobs. PVLDB, 4, 2011.

[56]

Y. Park. Active database learning. In CIDR, 2017.

[57]

Y. Park, M. Cafarella, and B. Mozafari. Neighbor-sensitive hashing. PVLDB, 2015.

Digital Library

[58]

Y. Park, M. Cafarella, and B. Mozafari. Visualization-aware sampling for very large databases. ICDE, 2016.

[59]

Y. Park, A. S. Tajik, M. Cafarella, and B. Mozafari. Database learning: Toward a database that becomes smarter every time. https://arxiv.org/abs/1703.05468.

Digital Library

[60]

K. S. Perera, M. Hahmann, W. Lehner, T. B. Pedersen, and C. Thomsen. Efficient approximate olap querying over time series. In IDEAS, 2016.

Digital Library

[61]

E. Petraki, S. Idreos, and S. Manegold. Holistic indexing in main-memory column-stores. In SIGMOD, 2015.

Digital Library

[62]

A. Pol and C. Jermaine. Relational confidence bounds are easy with the bootstrap. In SIGMOD, 2005.

Digital Library

[63]

N. Potti and J. M. Patel. Daq: a new paradigm for approximate query processing. PVLDB, 2015.

Digital Library

[64]

J. Ramnarayan, B. Mozafari, S. Menon, S. Wale, N. Kumar, H. Bhanawat, S. Chakraborty, Y. Mahajan, R. Mishra, and K. Bachhav. Snappydata: A hybrid transactional analytical store built on spark. In SIGMOD, 2016.

Digital Library

[65]

F. Rusu, C. Qin, and M. Torres. Scalable analytics model calibration with online aggregation. IEEE Data Eng. Bull., 2015.

[66]

A. L. Samuel. Some studies in machine learning using the game of checkers. IBM Journal of research and development, 1959.

Digital Library

[67]

S. Sarawagi. User-adaptive exploration of multidimensional data. In VLDB, 2000.

[68]

K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. In MSST, 2010.

Digital Library

[69]

L. Sidirourgos, M. L. Kersten, and P. A. Boncz. SciBORQ: Scientific data management with Bounds On Runtime and Quality. In CIDR, 2011.

[70]

A. Silberschatz, H. F. Korth, S. Sudarshan, et al. Database system concepts. 1997.

Digital Library

[71]

J. Skilling. Data Analysis: A Bayesian Tutorial. Oxford University Press, 2006.

[72]

A. Souihli and P. Senellart. Optimizing approximations of dnf query lineage in probabilistic xml. In ICDE, 2013.

Digital Library

[73]

M. Vartak, S. Rahman, S. Madden, A. G. Parameswaran, and N. Polyzotis. SEEDB: efficient data-driven visualization recommendations to support visual analytics. PVLDB, 2015.

Digital Library

[74]

L. Wasserman. All of Nonparametric Statistics. Springer, 2006.

Digital Library

[75]

C. K. Williams and M. Seeger. Using the nyström method to speed up kernel machines. In NIPS, 2000.

Digital Library

[76]

E. Wu, L. Battle, and S. R. Madden. The case for data visualization management systems: vision paper. PVLDB, 2014.

Digital Library

[77]

S. Wu, B. C. Ooi, and K.-L. Tan. Continuous Sampling for Online Aggregation over Multiple Queries. In SIGMOD, pages 651--662, 2010.

Digital Library

[78]

F. Xu, C. Jermaine, and A. Dobra. Confidence bounds for sampling-based group by estimates. TODS, 2008.

Digital Library

[79]

K. Zeng, S. Agarwal, A. Dave, M. Armbrust, and I. Stoica. G-OLA: Generalized on-line aggregation for interactive analysis on big data. In SIGMOD, 2015.

Digital Library

[80]

K. Zeng, S. Agarwal, and I. Stoica. iolap: Managing uncertainty for efficient incremental olap. 2016.

Digital Library

[81]

K. Zeng, S. Gao, J. Gu, B. Mozafari, and C. Zaniolo. Abs: a system for scalable approximate queries with accuracy guarantees. In SIGMOD, 2014.

Digital Library

[82]

K. Zeng, S. Gao, B. Mozafari, and C. Zaniolo. The analytical bootstrap: a new method for fast error estimation in approximate query processing. In SIGMOD, 2014.

Digital Library

Cited By

Chang CLo EYe C(2024)Biathlon: Harnessing Model Resilience for Accelerating ML Inference PipelinesProceedings of the VLDB Endowment10.14778/3675034.367505217:10(2631-2640)Online publication date: 1-Jun-2024
https://dl.acm.org/doi/10.14778/3675034.3675052
Zhang HJing YHe ZZhang KWang X(2024)Learning-Based Sample Tuning for Approximate Query Processing in Interactive Data ExplorationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.334145136:11(6532-6546)Online publication date: Nov-2024
https://doi.org/10.1109/TKDE.2023.3341451
Sheoran NChockchowwat SChheda AWang SVerma RPark Y(2023)A Step Toward Deep Online AggregationProceedings of the ACM on Management of Data10.1145/35892691:2(1-28)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589269
Show More Cited By

Index Terms

Database Learning: Toward a Database that Becomes Smarter Every Time
1. Computing methodologies
  1. Artificial intelligence
    1. Knowledge representation and reasoning
      1. Probabilistic reasoning
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by regression
    2. Machine learning approaches
      1. Learning in probabilistic graphical models
        Maximum entropy modeling
2. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
      2. Online analytical processing engines

Recommendations

QuickSel: Quick Selectivity Learning with Mixture Models
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Estimating the selectivity of a query is a key step in almost any cost-based query optimizer. Most of today's databases rely on histograms or samples that are periodically refreshed by re-scanning the data as the underlying data changes. Since frequent ...
Demonstration of VerdictDB, the Platform-Independent AQP System
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

We demonstrate VerdictDB, the first platform-independent approximate query processing (AQP) system. Unlike existing AQP systems that are tightly-integrated into a specific database, VerdictDB operates at the driver-level, acting as a middleware between ...
NeuroSketch: Fast and Approximate Evaluation of Range Aggregate Queries with Neural Networks
PACMMOD

Range aggregate queries (RAQs) are an integral part of many real-world applications, where, often, fast and approximate answers for the queries are desired. Recent work has studied answering RAQs using machine learning (ML) models, where a model of the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

May 2017

1810 pages

ISBN:9781450341974

DOI:10.1145/3035918

General Chairs:
Rada Chirkova
North Carolina State University, USA
,
Jun Yang
Duke University, USA
,
Program Chair:
Dan Suciu
University of Washington, USA

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 May 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

SIGMOD/PODS'17

Sponsor:

SIGMOD

SIGMOD/PODS'17: International Conference on Management of Data

May 14 - 19, 2017

Illinois, Chicago, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

47
Total Citations
View Citations
1,889
Total Downloads

Downloads (Last 12 months)176
Downloads (Last 6 weeks)24

Reflects downloads up to 07 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chang CLo EYe C(2024)Biathlon: Harnessing Model Resilience for Accelerating ML Inference PipelinesProceedings of the VLDB Endowment10.14778/3675034.367505217:10(2631-2640)Online publication date: 1-Jun-2024
https://dl.acm.org/doi/10.14778/3675034.3675052
Zhang HJing YHe ZZhang KWang X(2024)Learning-Based Sample Tuning for Approximate Query Processing in Interactive Data ExplorationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.334145136:11(6532-6546)Online publication date: Nov-2024
https://doi.org/10.1109/TKDE.2023.3341451
Sheoran NChockchowwat SChheda AWang SVerma RPark Y(2023)A Step Toward Deep Online AggregationProceedings of the ACM on Management of Data10.1145/35892691:2(1-28)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589269
Li ZPi XPark Y(2023)S/C: Speeding up Data Materialization with Bounded Memory2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00393(1981-1994)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00393
Liang XSintos SKrishnan S(2023)JanusAQP: Efficient Partition Tree Maintenance for Dynamic Approximate Query Processing2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00050(572-584)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00050
Peng JDing BWang JZeng KZhou JIves ZBonifati AEl Abbadi A(2022)One Size Does Not Fit All: A Bandit-Based Sampler Combination Framework with Theoretical GuaranteesProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517900(531-544)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517900
Omidvar-Tehrani BPersonnaz AAmer-Yahia SAl Hasan MXiong L(2022)Guided Text-based Item ExplorationProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557141(3410-3420)Online publication date: 17-Oct-2022
https://dl.acm.org/doi/10.1145/3511808.3557141
Chockchowwat SSood CPark Y(2022)Airphant: Cloud-oriented Document Indexing2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00107(1368-1381)Online publication date: May-2022
https://doi.org/10.1109/ICDE53745.2022.00107
Ma QTriantafillou P(2022)Query-centric regressionInformation Systems10.1016/j.is.2021.101736104:COnline publication date: 9-Apr-2022
https://dl.acm.org/doi/10.1016/j.is.2021.101736
Ishaq MAbid AFarooq MManzoor MFarooq UAbid KHelou M(2022)Advances in database systems education: Methods, tools, curricula, and way forwardEducation and Information Technologies10.1007/s10639-022-11293-028:3(2681-2725)Online publication date: 31-Aug-2022
https://dl.acm.org/doi/10.1007/s10639-022-11293-0
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents