research-article

Declarative Parameterizations of User-Defined Functions for Large-Scale Machine Learning and Optimization

Authors:

Niketan Pansare,

Christopher JermaineAuthors Info & Claims

IEEE Transactions on Knowledge and Data Engineering, Volume 31, Issue 11

Pages 2079 - 2092

https://doi.org/10.1109/TKDE.2018.2873325

Published: 01 November 2019 Publication History

Abstract

Large-scale optimization has become an important application for data management systems, particularly in the context of statistical machine learning. In this paper, we consider how one might implement the join-and-co-group pattern in the context of a fully declarative data processing system. The join-and-co-group pattern is ubiquitous in iterative, large-scale optimization. In the join-and-co-group pattern, a user-defined function $g$g is parameterized with a data object $x$x as well as the subset of the statistical model $\Theta _x$Θx that applies to that object, so that $g(x | \Theta _x)$g(x|Θx) can be used to compute a partial update of the model. This is repeated for every $x$x in the full data set $X$X. All partial updates are then aggregated and used to perform a complete update of the model. The join-and-co-group pattern has several implementation challenges, including the potential for a massive blow-up in the size of a fully parameterized model. Thus, unless the correct physical execution plan be chosen for implementing the join-and-co-group pattern, it is easily possible to have an execution that takes a very long time or even fails to complete. In this paper, we carefully consider the alternatives for implementing the join-and-co-group pattern on top of a declarative system, as well as how the best alternative can be selected automatically. Our focus is on the SimSQL database system, which is an SQL-based system with special facilities for large-scale, iterative optimization. Since it is an SQL-based system with a query optimizer, those choices can be made automatically.

References

[1]

C. M. Bishop, “Pattern recognition,” Mach. Learn., vol. 128, pp. 22–23, 2006.

[2]

S. Kullback and R. A. Leibler, “On information and sufficiency,” Ann. Math. Statist., vol. 22, no. 1, pp. 79–86, 1951.

[3]

J. M. Bernardo and A. F. Smith, “Bayesian theory,” New York, NY, USA: John Wiley, 1994.

[4]

J. Snyman, Practical Mathematical Optimization, vol. 97. New York, NY, USA: Springer, 2005.

[5]

T. J. Ypma, “Historical development of the newton-raphson method,” SIAM Rev., vol. 37, no. 4, pp. 531–551, 1995.

Digital Library

[6]

A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the em algorithm,” J. Roy. Statist. Soc. Series B, vol. 39, pp. 1–38, 1977.

[7]

M. J. Wainwright and M. I. Jordan, “Graphical models, exponential families, and variational inference,” Found. Trends® Mach. Learn., vol. 1, no. 1/2, pp. 1–305, 2008.

Digital Library

[8]

C. Andrieu, N. De Freitas, A. Doucet, and M. I. Jordan, “An introduction to mcmc for machine learning,” Mach. Learn., vol. 50, no. 1–2, pp. 5–43, 2003.

[9]

Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. Ganger, and E. P. Xing, “More effective distributed ml via a stale synchronous parallel parameter server,” in Proc. Neural Inf. Proc. Syst., 2013, pp. 1223–1231.

[10]

M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su, “Scaling distributed machine learning with the parameter server,” in Proc. 11th USENIX Symp. Operating Syst. Des. Implementation, 2014, pp. 583–598.

[11]

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing with working sets,” in Proc. 2nd USENIX Conf. Hot Topics Cloud Comput., vol. 10, pp. 10–10, 2010.

[12]

A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl, et al., “The stratosphere platform for big data analytics,” VLDB J., vol. 23, no. 6, pp. 939–964, 2014.

Digital Library

[13]

J. Yuan, F. Gao, Q. Ho, W. Dai, J. Wei, X. Zheng, E. P. Xing, T.-Y. Liu, and W.-Y. Ma, “Lightlda: Big topic models on modest computer clusters,” in Proc. 24th Int. Conf. World Wide Web, 2015, pp. 1351–1361.

[14]

D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” J. Mach. Learn. Res., vol. 3, no. Jan, pp. 993–1022, 2003.

Digital Library

[15]

“Light lda implementation,” 2016. [Online]. Available: https://github.com/Microsoft/LightLDA

[16]

Z. Cai, Z. Vagena, L. Perez, S. Arumugam, P. J. Haas, and C. Jermaine, “Simulation of database-valued markov chains using SimSQL,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2013, pp. 637–648.

[17]

T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proc. Nat. Academy Sci. United States America, vol. 101, no. suppl 1, pp. 5228–5235, 2004.

[18]

R. Fagin, “Multivalued dependencies and a new normal form for relational databases,” ACM Trans. Database Syst., vol. 2, no. 3, pp. 262–278, 1977.

Digital Library

[19]

P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price, “Access path selection in a relational database management system,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 1979, pp. 23–34.

[20]

M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, et al., “Spark SQL: Relational data processing in spark,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2015, pp. 1383–1394.

[21]

T. Dunning, “Accurate methods for the statistics of surprise and coincidence,” Comput. Linguistics, vol. 19, no. 1, pp. 61–74, 1993.

Digital Library

[22]

E. Bingham and H. Mannila, “Random projection in dimensionality reduction: Applications to image and text data,” in Proc. SIGKDD, 2001, pp. 245–250.

[23]

Z. Cai, Z. J. Gao, S. Luo, L. L. Perez, Z. Vagena, and C. Jermaine, “A comparison of platforms for implementing and running very large scale machine learning algorithms,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2014, pp. 1371–1382.

[24]

A. Smola and S. Narayanamurthy, “An architecture for parallel topic models,” Proc. VLDB Endowment, vol. 3, no. 1–2, pp. 703–710, 2010.

[25]

A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. Smola, “Scalable inference in latent variable models,” in Proc. Int. Conf. Web Search Data Mining, 2012, pp. 123–132.

[26]

A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy, “Hive: A warehousing solution over a map-reduce framework,” Proc. VLDB, vol. 2, no. 2, pp. 1626–1629, 2009.

[27]

A. F. Gates, O. Natkovich, S. Chopra, P. Kamath, S. M. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava, “Building a high-level dataflow system on top of map-reduce: The pig experience,” Proc. VLDB, vol. 2, no. 2, pp. 1414–1425, 2009.

[28]

T. White, Hadoop: The Definitive Guide. Newton, MA, USA: O'Reilly Media, Inc., 2012.

Digital Library

[29]

A. Crotty, A. Galakatos, K. Dursun, T. Kraska, C. Binnig, U. Cetintemel, and S. Zdonik, “An architecture for compiling udf-centric workflows,” Proc. VLDB, vol. 8, no. 12, pp. 1466–1477, 2015.

[30]

C. Lattner and V. Adve, “Llvm: A compilation framework for lifelong program analysis & transformation,” in Proc. Int. Symp. Code Generation Optimization, 2004, pp. 75–86.

[31]

E. Friedman, P. Pawlowski, and J. Cieslewicz, “SQL/mapreduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions,” Proc. VLDB, vol. 2, no. 2, pp. 1402–1413, 2009.

[32]

Y. Kwon, M. Balazinska, B. Howe, and J. Rolia, “Skew-resistant parallel processing of feature-extracting scientific user-defined functions,” in Proc. 1st ACM Symp. Cloud Comput., 2010, pp. 75–86.

[33]

C. Ordonez, “Statistical model computation with udfs,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 12, pp. 1752–1765, Dec. 2010.

Digital Library

[34]

J. M. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, et al., “The madlib analytics library: Or mad skills, the SQL,” Proc. VLDB, vol. 5, no. 12, pp. 1700–1711, 2012.

[35]

S. Chaudhuri and K. Shim, “Optimization of queries with user-defined predicates,” ACM Trans. Database Syst., vol. 24, no. 2, pp. 177–228, 1999.

Digital Library

[36]

S. Chaudhuri and K. Shim, “Query optimization in the presence of foreign functions,” in Proc. VLDB, vol. 93, pp. 529–542, 1993.

[37]

E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, and Y. Yu, “Petuum: A new platform for distributed machine learning on big data,” IEEE Trans. Big Data, vol. 1, no. 2, pp. 49–67, Jun. 2015.

[38]

H. Cui, H. Zhang, G. R. Ganger, P. B. Gibbons, and E. P. Xing, “Geeps: Scalable deep learning on distributed gpus with a gpu-specialized parameter server,” in Proc. 11th Eur. Conf. Comput. Syst., 2016, Art. no. 4.

[39]

T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang, “Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems,” arXiv:1512.01274 [cs.DC], 2015, https://arxiv.org/abs/1512.01274

[40]

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv:1603.04467 [cs.DC], 2016, https://arxiv.org/abs/1603.04467

[41]

R. Jampani, F. Xu, M. Wu, L. L. Perez, C. Jermaine, and P. J. Haas, “Mcdb: A monte carlo approach to managing uncertain data,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2008, pp. 687–700.

[42]

A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan, “Systemml: Declarative machine learning on MapReduce,” in Proc. IEEE 27th Int. Conf. Data Eng., 2011, pp. 231–242.

[43]

“Mahout samsara.” [Online]. Available: https://mahout.apache.org/users/environment/out-of-core-reference.html, Accessed on: Oct. 22, 2016.

[44]

P. G. Brown, “Overview of scidb: large scale array storage, processing and analysis,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2010, pp. 963–968.

[45]

M. Boehm, A. V. Evfimievski, N. Pansare, and B. Reinwald, “Declarative machine learning-a classification of basic properties and types,” arXiv:1605.05826 [cs.DB], 2016, https://arxiv.org/abs/1605.05826

Cited By

Qathrady MUllah SAlshehri MAhmad JAlmakdi SAlqhtani SKhan MGhaleb B(2024)SACNN‐IDSCAAI Transactions on Intelligence Technology10.1049/cit2.123529:6(1398-1411)Online publication date: 12-Jun-2024
https://dl.acm.org/doi/10.1049/cit2.12352
Momand AJan SRamzan N(2024)ABCNN-IDS: Attention-Based Convolutional Neural Network for Intrusion Detection in IoT NetworksWireless Personal Communications: An International Journal10.1007/s11277-024-11260-7136:4(1981-2003)Online publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1007/s11277-024-11260-7
Zhang YMcQuillan FJayaram NKak NKhanna EKislal OValdano DKumar A(2021)Distributed deep learning on data systemsProceedings of the VLDB Endowment10.14778/3467861.346786714:10(1769-1782)Online publication date: 1-Jun-2021
https://dl.acm.org/doi/10.14778/3467861.3467867

Index Terms

Declarative Parameterizations of User-Defined Functions for Large-Scale Machine Learning and Optimization

Index terms have been assigned to the content through auto-classification.

Recommendations

Parameterizations of Test Cover with Bounded Test Sizes

In the Test Cover problem we are given a hypergraph $$H=(V, {\mathcal {E}})$$H=(V,E) with $$|V|=n, |{\mathcal {E}}|=m$$|V|=n,|E|=m, and we assume that $${\mathcal {E}}$$E is a test cover, i.e. for every pair of vertices $$x_i, x_j$$xi,xj, there exists ...
Large-Scale Election Campaigns: Combinatorial Shift Bribery
AAMAS '15: Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems

We study the complexity of a combinatorial variant of the Shift Bribery problem in elections. In the standard Shift Bribery problem, we are given an election where each voter has a preference order over the candidate set and where an outside agent, the ...
Intractability of Clique-Width Parameterizations

We show that Edge Dominating Set, Hamiltonian Cycle, and Graph Coloring are $W[1]$-hard parameterized by clique-width. It was an open problem, explicitly mentioned in several papers, whether any of these problems is fixed parameter tractable when ...

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Knowledge and Data Engineering

IEEE Transactions on Knowledge and Data Engineering Volume 31, Issue 11

Nov. 2019

30 pages

ISSN:1041-4347

Issue’s Table of Contents

1041-4347 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

Publisher

IEEE Educational Activities Department

United States

Publication History

Published: 01 November 2019

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Qathrady MUllah SAlshehri MAhmad JAlmakdi SAlqhtani SKhan MGhaleb B(2024)SACNN‐IDSCAAI Transactions on Intelligence Technology10.1049/cit2.123529:6(1398-1411)Online publication date: 12-Jun-2024
https://dl.acm.org/doi/10.1049/cit2.12352
Momand AJan SRamzan N(2024)ABCNN-IDS: Attention-Based Convolutional Neural Network for Intrusion Detection in IoT NetworksWireless Personal Communications: An International Journal10.1007/s11277-024-11260-7136:4(1981-2003)Online publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1007/s11277-024-11260-7
Zhang YMcQuillan FJayaram NKak NKhanna EKislal OValdano DKumar A(2021)Distributed deep learning on data systemsProceedings of the VLDB Endowment10.14778/3467861.346786714:10(1769-1782)Online publication date: 1-Jun-2021
https://dl.acm.org/doi/10.14778/3467861.3467867

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents