Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3035918.3064013acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

Database Learning: Toward a Database that Becomes Smarter Every Time

Published: 09 May 2017 Publication History

Abstract

In today's databases, previous query answers rarely benefit answering future queries. For the first time, to the best of our knowledge, we change this paradigm in an approximate query processing (AQP) context. We make the following observation: the answer to each query reveals some degree of knowledge about the answer to another query because their answers stem from the same underlying distribution that has produced the entire dataset. Exploiting and refining this knowledge should allow us to answer queries more analytically, rather than by reading enormous amounts of raw data. Also, processing more queries should continuously enhance our knowledge of the underlying distribution, and hence lead to increasingly faster response times for future queries.
We call this novel idea---learning from past query answers---Database Learning. We exploit the principle of maximum entropy to produce answers, which are in expectation guaranteed to be more accurate than existing sample-based approximations. Empowered by this idea, we build a query engine on top of Spark SQL, called Verdict. We conduct extensive experiments on real-world query traces from a large customer of a major database vendor. Our results demonstrate that database learning supports 73.7% of these queries, speeding them up by up to 23.0x for the same accuracy level compared to existing AQP systems.

References

[1]
https://db.apache.org/derby/docs/10.6/tuning/ctuntransform36368.html.
[2]
S. Acharya, P. B. Gibbons, and V. Poosala. Aqua: A fast decision support system using approximate query answers. In VLDB, 1999.
[3]
S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. In SIGMOD, 1999.
[4]
S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. The Aqua Approximate Query Answering System. In SIGMOD, 1999.
[5]
S. Agarwal, H. Milner, A. Kleiner, A. Talwalkar, M. Jordan, S. Madden, B. Mozafari, and I. Stoica. Knowing when you're wrong: Building fast and reliable approximate query processing systems. In SIGMOD, 2014.
[6]
S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. BlinkDB: queries with bounded errors and bounded response times on very large data. In EuroSys, 2013.
[7]
S. Agarwal, A. Panda, B. Mozafari, A. P. Iyer, S. Madden, and I. Stoica. Blink and it's done: Interactive queries on very large data. PVLDB, 2012.
[8]
S. Agrawal, S. Chaudhuri, and V. R. Narasayya. Automated selection of materialized views and indexes in sql databases. In VLDB, 2000.
[9]
A. Arasu and G. S. Manku. Approximate counts and quantiles over sliding windows. In PODS, 2004.
[10]
M. Armbrust et al. Spark sql: Relational data processing in spark. In SIGMOD, 2015.
[11]
B. Babcock, S. Chaudhuri, and G. Das. Dynamic sample selection for approximate query processing. In VLDB, 2003.
[12]
L. Battle, R. Chang, and M. Stonebraker. Dynamic prefetching of data tiles for interactive visualization. 2015.
[13]
A. L. Berger, V. J. D. Pietra, and S. A. D. Pietra. A maximum entropy approach to natural language processing. Computational linguistics, 1996.
[14]
C. M. Bishop. Pattern recognition. Machine Learning, 2006.
[15]
J. G. Carbonell, R. S. Michalski, and T. M. Mitchell. An overview of machine learning. In Machine learning. 1983.
[16]
A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka Jr, and T. M. Mitchell. Toward an architecture for never-ending language learning. In AAAI, 2010.
[17]
S. Chaudhuri, G. Das, and V. Narasayya. Optimized stratified sampling for approximate query processing. TODS, 2007.
[18]
T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. Mapreduce online. In NSDI, 2010.
[19]
J. Considine, F. Li, G. Kollios, and J. Byers. Approximate aggregation techniques for sensor databases. In ICDE, 2004.
[20]
A. Dobra, C. Jermaine, F. Rusu, and F. Xu. Turbo-charging estimate convergence in dbo. PVLDB, 2009.
[21]
A. El-Helw, I. F. Ilyas, and C. Zuzarte. Statadvisor: Recommending statistical views. VLDB, 2009.
[22]
W. Fan, F. Geerts, Y. Cao, T. Deng, and P. Lu. Querying big data by accessing small data. In PODS, 2015.
[23]
D. Freedman, R. Pisani, and R. Purves. Statistics. 2007.
[24]
V. Ganti, M.-L. Lee, and R. Ramakrishnan. Icicles: Self-tuning samples for approximate query answering. In VLDB, 2000.
[25]
W. Gatterbauer and D. Suciu. Approximate lifted inference with probabilistic databases. PVLDB, 2015.
[26]
G. Graefe and H. Kuno. Adaptive indexing for relational keys. In ICDEW, 2010.
[27]
A. Y. Halevy. Answering queries using views: A survey. VLDBJ, 2001.
[28]
J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In SIGMOD, 1997.
[29]
K. Hose, D. Klan, and K.-U. Sattler. Distributed data summaries for approximate query processing in pdms. In IDEAS, 2006.
[30]
Y. Hu, S. Sundara, and J. Srinivasan. Estimating Aggregates in Time-Constrained Approximate Queries in Oracle. In EDBT, 2009.
[31]
H. Huang, C. Liu, and X. Zhou. Approximating query answering on rdf databases. WWW, 2012.
[32]
S. Idreos, M. L. Kersten, and S. Manegold. Database cracking. In CIDR, 2007.
[33]
S. Idreos, M. L. Kersten, and S. Manegold. Self-organizing tuple reconstruction in column-stores. In SIGMOD, 2009.
[34]
J. S. R. Jang. General formula: Matrix inversion lemma. http://www.cs.nthu.edu.tw/ jang/book/addenda/matinv/matinv/.
[35]
Y. Jia. Running tpc-h queries on hive. https://issues.apache.org/jira/browse/HIVE-600.
[36]
M. Joglekar, H. Garcia-Molina, and A. Parameswaran. Interactive data exploration with smart drill-down. In ICDE, 2016.
[37]
S. Joshi and C. Jermaine. Materialized sample views for database approximation. TKDE, 2008.
[38]
S. Joshi and C. Jermaine. Sampling-Based Estimators for Subset-Based Queries. VLDB J., 18(1), 2009.
[39]
S. Kandula, A. Shanbhag, A. Vitorovic, M. Olma, R. Grandl, S. Chaudhuri, and B. Ding. Quickr: Lazily approximating complex adhoc queries in bigdata clusters. In SIGMOD, 2016.
[40]
R. Kaushik, C. Ré, and D. Suciu. General database statistics using entropy maximization. In DBPL, 2009.
[41]
A. Kim, E. Blais, A. Parameswaran, P. Indyk, S. Madden, and R. Rubinfeld. Rapid sampling for visualizations with ordering guarantees. PVLDB, 2015.
[42]
N. Lawrence, M. Seeger, R. Herbrich, et al. Fast sparse gaussian process methods: The informative vector machine. NIPS, 2003.
[43]
M. Lichman. UCI machine learning repository, 2013.
[44]
M. Lovric. International Encyclopedia of Statistical Science. Springer, 2011.
[45]
A. Meliou, C. Guestrin, and J. M. Hellerstein. Approximating sensor network queries using in-network summaries. In IPSN, 2009.
[46]
C. A. Micchelli, Y. Xu, and H. Zhang. Universal kernels. JMLR, 2006.
[47]
B. Mozafari. Verdict: A system for stochastic query planning. In CIDR, Biennial Conference on Innovative Data Systems, 2015.
[48]
B. Mozafari. Approximate query engines: Commercial challenges and research opportunities. In SIGMOD, 2017.
[49]
B. Mozafari, E. Z. Y. Goh, and D. Y. Yoon. CliffGuard: A principled framework for finding robust database designs. In SIGMOD, 2015.
[50]
B. Mozafari and N. Niu. A handbook for building an approximate query engine. IEEE Data Eng. Bull., 2015.
[51]
B. Mozafari, J. Ramnarayan, S. Menon, Y. Mahajan, S. Chakraborty, H. Bhanawat, and K. Bachhav. Snappydata: A unified cluster for streaming, transactions, and interactive analytics. In CIDR, 2017.
[52]
B. Mozafari and C. Zaniolo. Optimal load shedding with aggregates and mining queries. In ICDE, 2010.
[53]
C. Olston, E. Bortnikov, K. Elmeleegy, F. Junqueira, and B. Reed. Interactive Analysis of Web-Scale Data. In CIDR, 2009.
[54]
D. Olteanu, J. Huang, and C. Koch. Approximate confidence computation in probabilistic databases. In ICDE, 2010.
[55]
N. Pansare, V. R. Borkar, C. Jermaine, and T. Condie. Online aggregation for large mapreduce jobs. PVLDB, 4, 2011.
[56]
Y. Park. Active database learning. In CIDR, 2017.
[57]
Y. Park, M. Cafarella, and B. Mozafari. Neighbor-sensitive hashing. PVLDB, 2015.
[58]
Y. Park, M. Cafarella, and B. Mozafari. Visualization-aware sampling for very large databases. ICDE, 2016.
[59]
Y. Park, A. S. Tajik, M. Cafarella, and B. Mozafari. Database learning: Toward a database that becomes smarter every time. https://arxiv.org/abs/1703.05468.
[60]
K. S. Perera, M. Hahmann, W. Lehner, T. B. Pedersen, and C. Thomsen. Efficient approximate olap querying over time series. In IDEAS, 2016.
[61]
E. Petraki, S. Idreos, and S. Manegold. Holistic indexing in main-memory column-stores. In SIGMOD, 2015.
[62]
A. Pol and C. Jermaine. Relational confidence bounds are easy with the bootstrap. In SIGMOD, 2005.
[63]
N. Potti and J. M. Patel. Daq: a new paradigm for approximate query processing. PVLDB, 2015.
[64]
J. Ramnarayan, B. Mozafari, S. Menon, S. Wale, N. Kumar, H. Bhanawat, S. Chakraborty, Y. Mahajan, R. Mishra, and K. Bachhav. Snappydata: A hybrid transactional analytical store built on spark. In SIGMOD, 2016.
[65]
F. Rusu, C. Qin, and M. Torres. Scalable analytics model calibration with online aggregation. IEEE Data Eng. Bull., 2015.
[66]
A. L. Samuel. Some studies in machine learning using the game of checkers. IBM Journal of research and development, 1959.
[67]
S. Sarawagi. User-adaptive exploration of multidimensional data. In VLDB, 2000.
[68]
K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. In MSST, 2010.
[69]
L. Sidirourgos, M. L. Kersten, and P. A. Boncz. SciBORQ: Scientific data management with Bounds On Runtime and Quality. In CIDR, 2011.
[70]
A. Silberschatz, H. F. Korth, S. Sudarshan, et al. Database system concepts. 1997.
[71]
J. Skilling. Data Analysis: A Bayesian Tutorial. Oxford University Press, 2006.
[72]
A. Souihli and P. Senellart. Optimizing approximations of dnf query lineage in probabilistic xml. In ICDE, 2013.
[73]
M. Vartak, S. Rahman, S. Madden, A. G. Parameswaran, and N. Polyzotis. SEEDB: efficient data-driven visualization recommendations to support visual analytics. PVLDB, 2015.
[74]
L. Wasserman. All of Nonparametric Statistics. Springer, 2006.
[75]
C. K. Williams and M. Seeger. Using the nyström method to speed up kernel machines. In NIPS, 2000.
[76]
E. Wu, L. Battle, and S. R. Madden. The case for data visualization management systems: vision paper. PVLDB, 2014.
[77]
S. Wu, B. C. Ooi, and K.-L. Tan. Continuous Sampling for Online Aggregation over Multiple Queries. In SIGMOD, pages 651--662, 2010.
[78]
F. Xu, C. Jermaine, and A. Dobra. Confidence bounds for sampling-based group by estimates. TODS, 2008.
[79]
K. Zeng, S. Agarwal, A. Dave, M. Armbrust, and I. Stoica. G-OLA: Generalized on-line aggregation for interactive analysis on big data. In SIGMOD, 2015.
[80]
K. Zeng, S. Agarwal, and I. Stoica. iolap: Managing uncertainty for efficient incremental olap. 2016.
[81]
K. Zeng, S. Gao, J. Gu, B. Mozafari, and C. Zaniolo. Abs: a system for scalable approximate queries with accuracy guarantees. In SIGMOD, 2014.
[82]
K. Zeng, S. Gao, B. Mozafari, and C. Zaniolo. The analytical bootstrap: a new method for fast error estimation in approximate query processing. In SIGMOD, 2014.

Cited By

View all
  • (2024)Biathlon: Harnessing Model Resilience for Accelerating ML Inference PipelinesProceedings of the VLDB Endowment10.14778/3675034.367505217:10(2631-2640)Online publication date: 1-Jun-2024
  • (2024)Learning-Based Sample Tuning for Approximate Query Processing in Interactive Data ExplorationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.334145136:11(6532-6546)Online publication date: Nov-2024
  • (2023)A Step Toward Deep Online AggregationProceedings of the ACM on Management of Data10.1145/35892691:2(1-28)Online publication date: 20-Jun-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data
May 2017
1810 pages
ISBN:9781450341974
DOI:10.1145/3035918
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 May 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. approximate query processing
  2. database learning
  3. machine learning
  4. maximum entropy principle
  5. online aggregation

Qualifiers

  • Research-article

Funding Sources

Conference

SIGMOD/PODS'17
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)176
  • Downloads (Last 6 weeks)24
Reflects downloads up to 07 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Biathlon: Harnessing Model Resilience for Accelerating ML Inference PipelinesProceedings of the VLDB Endowment10.14778/3675034.367505217:10(2631-2640)Online publication date: 1-Jun-2024
  • (2024)Learning-Based Sample Tuning for Approximate Query Processing in Interactive Data ExplorationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.334145136:11(6532-6546)Online publication date: Nov-2024
  • (2023)A Step Toward Deep Online AggregationProceedings of the ACM on Management of Data10.1145/35892691:2(1-28)Online publication date: 20-Jun-2023
  • (2023)S/C: Speeding up Data Materialization with Bounded Memory2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00393(1981-1994)Online publication date: Apr-2023
  • (2023)JanusAQP: Efficient Partition Tree Maintenance for Dynamic Approximate Query Processing2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00050(572-584)Online publication date: Apr-2023
  • (2022)One Size Does Not Fit All: A Bandit-Based Sampler Combination Framework with Theoretical GuaranteesProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517900(531-544)Online publication date: 10-Jun-2022
  • (2022)Guided Text-based Item ExplorationProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557141(3410-3420)Online publication date: 17-Oct-2022
  • (2022)Airphant: Cloud-oriented Document Indexing2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00107(1368-1381)Online publication date: May-2022
  • (2022)Query-centric regressionInformation Systems10.1016/j.is.2021.101736104:COnline publication date: 9-Apr-2022
  • (2022)Advances in database systems education: Methods, tools, curricula, and way forwardEducation and Information Technologies10.1007/s10639-022-11293-028:3(2681-2725)Online publication date: 31-Aug-2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media