research-article

Accelerating Machine Learning Inference with Probabilistic Predicates

Authors:

Aakanksha Chowdhery,

Srikanth Kandula,

Surajit ChaudhuriAuthors Info & Claims

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

Pages 1493 - 1508

https://doi.org/10.1145/3183713.3183751

Published: 27 May 2018 Publication History

Abstract

Classic query optimization techniques, including predicate pushdown, are of limited use for machine learning inference queries, because the user-defined functions (UDFs) which extract relational columns from unstructured inputs are often very expensive; query predicates will remain stuck behind these UDFs if they happen to require relational columns that are generated by the UDFs. In this work, we demonstrate constructing and applying probabilistic predicates to filter data blobs that do not satisfy the query predicate; such filtering is parametrized to different target accuracies. Furthermore, to support complex predicates and to avoid per-query training, we augment a cost-based query optimizer to choose plans with appropriate combinations of simpler probabilistic predicates. Experiments with several machine learning workloads on a big-data cluster show that query processing improves by as much as 10x.

References

[1]

Free video trigger app. http://bit.ly/2ufJSSs.

[2]

In more cities, a camera on every corner, park and sidewalk. http://n.pr/2tKQEg3.

[3]

Shun-ichi Amari and Si Wu. Improving support vector machine classifiers by modifying kernel functions. Neural Networks, 12(6):783--789, 1999.

Digital Library

[4]

Barak Ariel, William Farrar, and Alex Sutherland. The effect of police body-worn cameras on use of force and citizens complaints against the police: A randomized controlled trial. J. of quantitative criminology, 31(3):509--535, 2015.

[5]

Michael Armbrust et al. Spark SQL: Relational Data Processing in Spark. In SIGMOD, 2015.

Digital Library

[6]

Josh Attenberg, Kilian Weinberger, Anirban Dasgupta, Alex Smola, and Martin Zinkevich. Collaborative email-spam filtering with the hashing trick. In 6th Conf. on Email and Anti-Spam, 2009.

[7]

Shivnath Babu, Rajeev Motwani, Kamesh Munagala, Itaru Nishizawa, and Jennifer Widom. Adaptive ordering of pipelined stream filters. In ACM SIGMOD, 2004.

Digital Library

[8]

Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Comm. of the ACM, 18(9):509--517, 1975.

Digital Library

[9]

Leo Breiman. Random forests. Mach. Learn., 45(1):5--32, October 2001.

Digital Library

[10]

Mark W Burris. Application of variable tolls on congested toll road. Journal of transportation engineering, 129(4):354--361, 2003.

[11]

Ronnie Chaiken et al. SCOPE: Easy and Efficient Parallel Processing of Massive Datasets. In VLDB, 2008.

Digital Library

[12]

Craig Chambers et al. Flumejava: easy, efficient data-parallel pipelines. In PLDI, 2010.

Digital Library

[13]

Surajit Chaudhuri, Vivek R. Narasayya, and Sunita Sarawagi. Efficient evaluation of queries with mining predicates. In ICDE, 2002.

[14]

Robert T Collins et al. A system for video surveillance and monitoring. VSAM final report, pages 1--68, 2000.

[15]

Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.

Digital Library

[16]

James Davidson et al. The youtube video recommendation system. In ACM conference on Recommender systems, 2010.

Digital Library

[17]

Amol Deshpande, Carlos Guestrin, Sam Madden, and Wei Hong. Exploiting correlated attributes in acquisitional query processing. In ICDE, 2005.

Digital Library

[18]

Christos Faloutsos. Searching Multimedia Databases by Content. Kluwer Academic Publishers, Norwell, MA, USA, 1996.

Digital Library

[19]

Gene H Golub and Charles F Van Loan. Matrix computations. 2012.

[20]

Jim Gray et al. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-total. In ICDE, 1996.

Digital Library

[21]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.

[22]

Joseph M Hellerstein and Michael Stonebraker. Predicate migration: Optimizing queries with expensive predicates. ACM SIGMOD, 1993.

Digital Library

[23]

Nacim Ihaddadene and Chabane Djeraba. Real-time crowd motion analysis. In ICPR, 2008.

[24]

Yu-Gang Jiang, Chong-Wah Ngo, and Jun Yang. Towards optimal bag-of-features for object categorization and semantic video retrieval. In ACM Conf. on Image and video retrieval, 2007.

Digital Library

[25]

Thorsten Joachims. Training linear svms in linear time. In SIGKDD, 2006.

Digital Library

[26]

Manas Joglekar, Hector Garcia-Molina, Aditya Parameswaran, and Christopher Re. Exploiting correlations for expensive predicate evaluation. arXiv preprint arXiv:1411.3374, 2014.

[27]

Manas Joglekar, Hector Garcia-Molina, Aditya Parameswaran, and Christopher Re. Exploiting correlations for expensive predicate evaluation. In SIGMOD, 2015.

Digital Library

[28]

Ian Jolliffe. Principal component analysis. Wiley Online Library, 2002.

[29]

Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Zaharia. NoScope: Optimizing neural network queries over video at scale. VLDB, 2017.

Digital Library

[30]

A Kemper, G Moerkotte, K Peithner, and M Steinbrunn. Optimizing disjunctive queries with expensive predicates. In SIGMOD, 1994.

Digital Library

[31]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.

Digital Library

[32]

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proc. IEEE, 86(11):2278--2324, 1998.

[33]

Yann LeCun et al. Handwritten digit recognition with a back-propagation network. In NIPS, 1990.

[34]

Alon Levy, Inderpal Mumick, and Yehoshua Sagiv. Query optimization by predicate move-around. In VLDB, 1994.

Digital Library

[35]

Tsung-Yi Lin et al. Microsoft COCO: Common objects in context. In ECCV, 2014.

[36]

Yao Lu, Aakanksha Chowdhery, and Srikanth Kandula. Optasia: A relational platform for efficient large-scale video analytics. In ACM SoCC, 2016.

Digital Library

[37]

Yao Lu, Wei Zhang, Ke Zhang, and Xiangyang Xue. Semantic context learning with large-scale weakly-labeled image set. In CIKM, 2012.

Digital Library

[38]

Bruce D Lucas and Takeo Kanade. An iterative image registration technique with an application to stereo vision. In IJCAI, 1981.

Digital Library

[39]

Thomas Neumann, Sven Helmer, and Guido Moerkotte. On the optimal ordering of maps and selections under factorization. In ICDE, 2005.

Digital Library

[40]

Ioannis Partalas et al. LSHTC: A benchmark for large-scale text classification. arXiv preprint arXiv:1503.08581, 2015.

[41]

Genevieve Patterson, Chen Xu, Hang Su, and James Hays. The sun attribute database: Beyond categories for deeper scene understanding. IJCV, 2014.

Digital Library

[42]

Anand Rajaraman, Jeffrey D Ullman, Jeffrey David Ullman, and Jeffrey David Ullman. Mining of massive datasets. 2012.

Digital Library

[43]

Murray Rosenblatt et al. Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics, 27(3):832--837, 1956.

[44]

Narayanan Shivakumar, Hector Garcia-Molina, and Chandra Chekuri. Filtering with approximate predicates. In VLDB, 1998.

Digital Library

[45]

Bernard W Silverman. Density estimation for statistics and data analysis, volume 26. CRC press, 1986.

[46]

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. Preprint arXiv:1212.0402, 2012.

[47]

Abhinav Srivastava, Amlan Kundu, Shamik Sural, and Arun K Majumdar. Credit card fraud detection using hidden markov model. IEEE Trans. on Dependable and Secure Computing, 2008.

Digital Library

[48]

Ashish Thusoo et al. Hive: A Warehousing Solution Over A Map-Reduce Framework. Proc. VLDB Endow., 2009.

Digital Library

[49]

Jeffrey Ullman. Principles of database and knowledge-base systems, 1989.

Digital Library

[50]

Vladimir Naumovich Vapnik and Vlamimir Vapnik. Statistical learning theory, volume 1. Wiley New York, 1998.

Digital Library

[51]

Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple features. In CVPR, 2001.

[52]

Xin Wang et al. IDK Cascades: Fast Deep Learning by Learning not to Overthink. Preprint arXiv:1706.00885, 2017.

[53]

Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. Feature hashing for large scale multitask learning. In ICML, 2009.

Digital Library

[54]

Longyin Wen et al. Detrac: A new benchmark and protocol for multi-object tracking. Preprint arXiv:1511.04136, 2015.

[55]

Xiangyang Xue, Wei Zhang, Jie Zhang, Bin Wu, Jianping Fan, and Yao Lu. Correlative multi-label multi-instance image annotation. In ICCV, 2011.

Digital Library

Cited By

Chao DChen YKoudas NYu X(2024)Optimizing Video Queries with Declarative CluesProceedings of the VLDB Endowment10.14778/3681954.368199817:11(3256-3268)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3681998
He WSabek ILou YCafarella M(2024)Optimizing Video Selection LIMIT Queries with Commonsense KnowledgeProceedings of the VLDB Endowment10.14778/3654621.365463917:7(1751-1764)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.14778/3654621.3654639
Wu RChunduri PPayani AChu XArulraj JRong K(2024)SketchQL: Video Moment Querying with a Visual Query InterfaceProceedings of the ACM on Management of Data10.1145/36771402:4(1-27)Online publication date: 30-Sep-2024
https://doi.org/10.1145/3677140
Show More Cited By

Index Terms

Accelerating Machine Learning Inference with Probabilistic Predicates
1. Computing methodologies
  1. Distributed computing methodologies
    1. Distributed algorithms

Recommendations

Interactive Demonstration of Probabilistic Predicates
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

We will demonstrate a prototype query processing engine that uses probabilistic predicates (PPs) to speed up machine learning inference jobs. In current analytic engines, machine learning functions are modeled as user-defined functions (UDFs) which are ...
Set Predicates in SQL: Enabling Set-Level Comparisons for Dynamically Formed Groups

In data warehousing and OLAP applications, scalar-level predicates in SQL become increasingly inadequate to support a class of operations that require set-level comparison semantics, i.e., comparing a group of tuples with multiple values. Currently, ...
Top-k best probability queries and semantics ranking properties on probabilistic databases

There has been much interest in answering top-k queries on probabilistic data in various applications such as market analysis, personalized services, and decision making. In probabilistic relational databases, the most common problem in answering top-k ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

May 2018

1874 pages

ISBN:9781450347037

DOI:10.1145/3183713

General Chairs:
Gautam Das
University of Texas at Arlington, USA
,
Christopher Jermaine
Rice University, USA
,
Philip Bernstein
Microsoft Research, USA

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '18

Sponsor:

SIGMOD

SIGMOD/PODS '18: International Conference on Management of Data

June 10 - 15, 2018

TX, Houston, USA

Acceptance Rates

SIGMOD '18 Paper Acceptance Rate 90 of 461 submissions, 20%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

61
Total Citations
View Citations
1,065
Total Downloads

Downloads (Last 12 months)82
Downloads (Last 6 weeks)11

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chao DChen YKoudas NYu X(2024)Optimizing Video Queries with Declarative CluesProceedings of the VLDB Endowment10.14778/3681954.368199817:11(3256-3268)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3681998
He WSabek ILou YCafarella M(2024)Optimizing Video Selection LIMIT Queries with Commonsense KnowledgeProceedings of the VLDB Endowment10.14778/3654621.365463917:7(1751-1764)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.14778/3654621.3654639
Wu RChunduri PPayani AChu XArulraj JRong K(2024)SketchQL: Video Moment Querying with a Visual Query InterfaceProceedings of the ACM on Management of Data10.1145/36771402:4(1-27)Online publication date: 30-Sep-2024
https://doi.org/10.1145/3677140
Li GSun JXu LLi SWang JNie W(2024)GaussML: An End-to-End In-Database Machine Learning System2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00391(5198-5210)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00391
Zhang EDaum MHe DGanti MHaynes BKrishna RBalazinska M(2023)EQUI-VOCAL Demonstration: Synthesizing Video Queries from User InteractionsProceedings of the VLDB Endowment10.14778/3611540.361160016:12(3978-3981)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.14778/3611540.3611600
Lew DYoo KNam K(2023)DeepVQL: Deep Video Queries on PostgreSQLProceedings of the VLDB Endowment10.14778/3611540.361158316:12(3910-3913)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.14778/3611540.3611583
He WSabek ILou YCafarella M(2023)PAINE Demo: Optimizing Video Selection Queries with Commonsense KnowledgeProceedings of the VLDB Endowment10.14778/3611540.361158116:12(3902-3905)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.14778/3611540.3611581
Zhang EDaum MHe DHaynes BKrishna RBalazinska M(2023)EQUI-VOCAL: Synthesizing Queries for Compositional Video Events from Limited User InteractionsProceedings of the VLDB Endowment10.14778/3611479.361148216:11(2714-2727)Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.14778/3611479.3611482
Rong KBudiu MSkiadopoulos ASuresh LTai A(2023)Scaling a Declarative Cluster Manager Architecture with Query Optimization TechniquesProceedings of the VLDB Endowment10.14778/3603581.360359916:10(2618-2631)Online publication date: 8-Aug-2023
https://dl.acm.org/doi/10.14778/3603581.3603599
Kossmann FWu ZLai ETatbul NCao LKraska TMadden S(2023)Extract-Transform-Load for Video StreamsProceedings of the VLDB Endowment10.14778/3598581.359860016:9(2302-2315)Online publication date: 1-May-2023
https://dl.acm.org/doi/10.14778/3598581.3598600
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents