Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Declarative Parameterizations of User-Defined Functions for Large-Scale Machine Learning and Optimization

Published: 01 November 2019 Publication History

Abstract

Large-scale optimization has become an important application for data management systems, particularly in the context of statistical machine learning. In this paper, we consider how one might implement the join-and-co-group pattern in the context of a fully declarative data processing system. The join-and-co-group pattern is ubiquitous in iterative, large-scale optimization. In the join-and-co-group pattern, a user-defined function $g$g is parameterized with a data object $x$x as well as the subset of the statistical model $\Theta _x$Θx that applies to that object, so that $g(x | \Theta _x)$g(x|Θx) can be used to compute a partial update of the model. This is repeated for every $x$x in the full data set $X$X. All partial updates are then aggregated and used to perform a complete update of the model. The join-and-co-group pattern has several implementation challenges, including the potential for a massive blow-up in the size of a fully parameterized model. Thus, unless the correct physical execution plan be chosen for implementing the join-and-co-group pattern, it is easily possible to have an execution that takes a very long time or even fails to complete. In this paper, we carefully consider the alternatives for implementing the join-and-co-group pattern on top of a declarative system, as well as how the best alternative can be selected automatically. Our focus is on the SimSQL database system, which is an SQL-based system with special facilities for large-scale, iterative optimization. Since it is an SQL-based system with a query optimizer, those choices can be made automatically.

References

[1]
C. M. Bishop, “Pattern recognition,” Mach. Learn., vol. 128, pp. 22–23, 2006.
[2]
S. Kullback and R. A. Leibler, “On information and sufficiency,” Ann. Math. Statist., vol. 22, no. 1, pp. 79–86, 1951.
[3]
J. M. Bernardo and A. F. Smith, “Bayesian theory,” New York, NY, USA: John Wiley, 1994.
[4]
J. Snyman, Practical Mathematical Optimization, vol. 97. New York, NY, USA: Springer, 2005.
[5]
T. J. Ypma, “Historical development of the newton-raphson method,” SIAM Rev., vol. 37, no. 4, pp. 531–551, 1995.
[6]
A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the em algorithm,” J. Roy. Statist. Soc. Series B, vol. 39, pp. 1–38, 1977.
[7]
M. J. Wainwright and M. I. Jordan, “Graphical models, exponential families, and variational inference,” Found. Trends® Mach. Learn., vol. 1, no. 1/2, pp. 1–305, 2008.
[8]
C. Andrieu, N. De Freitas, A. Doucet, and M. I. Jordan, “An introduction to mcmc for machine learning,” Mach. Learn., vol. 50, no. 1–2, pp. 5–43, 2003.
[9]
Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. Ganger, and E. P. Xing, “More effective distributed ml via a stale synchronous parallel parameter server,” in Proc. Neural Inf. Proc. Syst., 2013, pp. 1223–1231.
[10]
M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su, “Scaling distributed machine learning with the parameter server,” in Proc. 11th USENIX Symp. Operating Syst. Des. Implementation, 2014, pp. 583–598.
[11]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing with working sets,” in Proc. 2nd USENIX Conf. Hot Topics Cloud Comput., vol. 10, pp. 10–10, 2010.
[12]
A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl, et al., “The stratosphere platform for big data analytics,” VLDB J., vol. 23, no. 6, pp. 939–964, 2014.
[13]
J. Yuan, F. Gao, Q. Ho, W. Dai, J. Wei, X. Zheng, E. P. Xing, T.-Y. Liu, and W.-Y. Ma, “Lightlda: Big topic models on modest computer clusters,” in Proc. 24th Int. Conf. World Wide Web, 2015, pp. 1351–1361.
[14]
D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” J. Mach. Learn. Res., vol. 3, no. Jan, pp. 993–1022, 2003.
[15]
“Light lda implementation,” 2016. [Online]. Available: https://github.com/Microsoft/LightLDA
[16]
Z. Cai, Z. Vagena, L. Perez, S. Arumugam, P. J. Haas, and C. Jermaine, “Simulation of database-valued markov chains using SimSQL,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2013, pp. 637–648.
[17]
T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proc. Nat. Academy Sci. United States America, vol. 101, no. suppl 1, pp. 5228–5235, 2004.
[18]
R. Fagin, “Multivalued dependencies and a new normal form for relational databases,” ACM Trans. Database Syst., vol. 2, no. 3, pp. 262–278, 1977.
[19]
P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price, “Access path selection in a relational database management system,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 1979, pp. 23–34.
[20]
M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, et al., “Spark SQL: Relational data processing in spark,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2015, pp. 1383–1394.
[21]
T. Dunning, “Accurate methods for the statistics of surprise and coincidence,” Comput. Linguistics, vol. 19, no. 1, pp. 61–74, 1993.
[22]
E. Bingham and H. Mannila, “Random projection in dimensionality reduction: Applications to image and text data,” in Proc. SIGKDD, 2001, pp. 245–250.
[23]
Z. Cai, Z. J. Gao, S. Luo, L. L. Perez, Z. Vagena, and C. Jermaine, “A comparison of platforms for implementing and running very large scale machine learning algorithms,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2014, pp. 1371–1382.
[24]
A. Smola and S. Narayanamurthy, “An architecture for parallel topic models,” Proc. VLDB Endowment, vol. 3, no. 1–2, pp. 703–710, 2010.
[25]
A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. Smola, “Scalable inference in latent variable models,” in Proc. Int. Conf. Web Search Data Mining, 2012, pp. 123–132.
[26]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy, “Hive: A warehousing solution over a map-reduce framework,” Proc. VLDB, vol. 2, no. 2, pp. 1626–1629, 2009.
[27]
A. F. Gates, O. Natkovich, S. Chopra, P. Kamath, S. M. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava, “Building a high-level dataflow system on top of map-reduce: The pig experience,” Proc. VLDB, vol. 2, no. 2, pp. 1414–1425, 2009.
[28]
T. White, Hadoop: The Definitive Guide. Newton, MA, USA: O'Reilly Media, Inc., 2012.
[29]
A. Crotty, A. Galakatos, K. Dursun, T. Kraska, C. Binnig, U. Cetintemel, and S. Zdonik, “An architecture for compiling udf-centric workflows,” Proc. VLDB, vol. 8, no. 12, pp. 1466–1477, 2015.
[30]
C. Lattner and V. Adve, “Llvm: A compilation framework for lifelong program analysis & transformation,” in Proc. Int. Symp. Code Generation Optimization, 2004, pp. 75–86.
[31]
E. Friedman, P. Pawlowski, and J. Cieslewicz, “SQL/mapreduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions,” Proc. VLDB, vol. 2, no. 2, pp. 1402–1413, 2009.
[32]
Y. Kwon, M. Balazinska, B. Howe, and J. Rolia, “Skew-resistant parallel processing of feature-extracting scientific user-defined functions,” in Proc. 1st ACM Symp. Cloud Comput., 2010, pp. 75–86.
[33]
C. Ordonez, “Statistical model computation with udfs,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 12, pp. 1752–1765, Dec. 2010.
[34]
J. M. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, et al., “The madlib analytics library: Or mad skills, the SQL,” Proc. VLDB, vol. 5, no. 12, pp. 1700–1711, 2012.
[35]
S. Chaudhuri and K. Shim, “Optimization of queries with user-defined predicates,” ACM Trans. Database Syst., vol. 24, no. 2, pp. 177–228, 1999.
[36]
S. Chaudhuri and K. Shim, “Query optimization in the presence of foreign functions,” in Proc. VLDB, vol. 93, pp. 529–542, 1993.
[37]
E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, and Y. Yu, “Petuum: A new platform for distributed machine learning on big data,” IEEE Trans. Big Data, vol. 1, no. 2, pp. 49–67, Jun. 2015.
[38]
H. Cui, H. Zhang, G. R. Ganger, P. B. Gibbons, and E. P. Xing, “Geeps: Scalable deep learning on distributed gpus with a gpu-specialized parameter server,” in Proc. 11th Eur. Conf. Comput. Syst., 2016, Art. no. 4.
[39]
T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang, “Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems,” arXiv:1512.01274 [cs.DC], 2015, https://arxiv.org/abs/1512.01274
[40]
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv:1603.04467 [cs.DC], 2016, https://arxiv.org/abs/1603.04467
[41]
R. Jampani, F. Xu, M. Wu, L. L. Perez, C. Jermaine, and P. J. Haas, “Mcdb: A monte carlo approach to managing uncertain data,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2008, pp. 687–700.
[42]
A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan, “Systemml: Declarative machine learning on MapReduce,” in Proc. IEEE 27th Int. Conf. Data Eng., 2011, pp. 231–242.
[43]
“Mahout samsara.” [Online]. Available: https://mahout.apache.org/users/environment/out-of-core-reference.html, Accessed on: Oct. 22, 2016.
[44]
P. G. Brown, “Overview of scidb: large scale array storage, processing and analysis,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2010, pp. 963–968.
[45]
M. Boehm, A. V. Evfimievski, N. Pansare, and B. Reinwald, “Declarative machine learning-a classification of basic properties and types,” arXiv:1605.05826 [cs.DB], 2016, https://arxiv.org/abs/1605.05826

Cited By

View all
  • (2024)SACNN‐IDSCAAI Transactions on Intelligence Technology10.1049/cit2.123529:6(1398-1411)Online publication date: 12-Jun-2024
  • (2024)ABCNN-IDS: Attention-Based Convolutional Neural Network for Intrusion Detection in IoT NetworksWireless Personal Communications: An International Journal10.1007/s11277-024-11260-7136:4(1981-2003)Online publication date: 1-Jun-2024
  • (2021)Distributed deep learning on data systemsProceedings of the VLDB Endowment10.14778/3467861.346786714:10(1769-1782)Online publication date: 1-Jun-2021

Index Terms

  1. Declarative Parameterizations of User-Defined Functions for Large-Scale Machine Learning and Optimization
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Information & Contributors

            Information

            Published In

            cover image IEEE Transactions on Knowledge and Data Engineering
            IEEE Transactions on Knowledge and Data Engineering  Volume 31, Issue 11
            Nov. 2019
            30 pages

            Publisher

            IEEE Educational Activities Department

            United States

            Publication History

            Published: 01 November 2019

            Qualifiers

            • Research-article

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • Downloads (Last 12 months)0
            • Downloads (Last 6 weeks)0
            Reflects downloads up to 25 Feb 2025

            Other Metrics

            Citations

            Cited By

            View all
            • (2024)SACNN‐IDSCAAI Transactions on Intelligence Technology10.1049/cit2.123529:6(1398-1411)Online publication date: 12-Jun-2024
            • (2024)ABCNN-IDS: Attention-Based Convolutional Neural Network for Intrusion Detection in IoT NetworksWireless Personal Communications: An International Journal10.1007/s11277-024-11260-7136:4(1981-2003)Online publication date: 1-Jun-2024
            • (2021)Distributed deep learning on data systemsProceedings of the VLDB Endowment10.14778/3467861.346786714:10(1769-1782)Online publication date: 1-Jun-2021

            View Options

            View options

            Figures

            Tables

            Media

            Share

            Share

            Share this Publication link

            Share on social media