Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Probabilistic demand forecasting at scale

Published: 01 August 2017 Publication History
  • Get Citation Alerts
  • Abstract

    We present a platform built on large-scale, data-centric machine learning (ML) approaches, whose particular focus is demand forecasting in retail. At its core, this platform enables the training and application of probabilistic demand forecasting models, and provides convenient abstractions and support functionality for forecasting problems. The platform comprises of a complex end-to-end machine learning system built on Apache Spark, which includes data preprocessing, feature engineering, distributed learning, as well as evaluation, experimentation and ensembling. Furthermore, it meets the demands of a production system and scales to large catalogues containing millions of items.
    We describe the challenges of building such a platform and discuss our design decisions. We detail aspects on several levels of the system, such as a set of general distributed learning schemes, our machinery for ensembling predictions, and a high-level dataflow abstraction for modeling complex ML pipelines. To the best of our knowledge, we are not aware of prior work on real-world demand forecasting systems which rivals our approach in terms of scalability.

    References

    [1]
    A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl, et al. The stratosphere platform for big data analytics. VLDB Journal, 23(6):939--964, 2014.
    [2]
    M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, et al. Spark sql: Relational data processing in spark. In SIGMOD, pages 1383--1394, 2015.
    [3]
    M. Bilenko, T. Finley, S. Katzenberger, S. Kochman, D. Mahajan, S. Narayanamurthy, J. Wang, S. Wang, and M. Weimer. Towards Production-Grade, Platform-Independent Distributed ML. In Machine Learning Systems Workshop at ICML, 2016.
    [4]
    M. Bilenko, T. Finley, S. Katzenberger, S. Kochman, D. Mahajan, S. Narayanamurthy, J. Wang, S. Wang, and M. Weimer. Towards Production-Grade, Platform-Independent Distributed ML. In Machine Learning Systems Workshop at ICML, 2016.
    [5]
    M. Boehm, S. Tatikonda, B. Reinwald, P. Sen, Y. Tian, D. R. Burdick, and S. Vaithyanathan. Hybrid parallelization strategies for large-scale machine learning in systemml. PVLDB, 7(7):553--564, 2014.
    [6]
    V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE, pages 1151--1162, 2011.
    [7]
    R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal, 16(5):1190--1208, 1995.
    [8]
    R. Chaiken, B. Jenkins, P.-Å. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: easy and efficient parallel processing of massive data sets. PVLDB, 1(2):1265--1276, 2008.
    [9]
    C. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. NIPS, 19:281, 2007.
    [10]
    V. Flunkert, D. Salinas, and J. Gasthaus. DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks. arXiv preprint arXiv:1704.04110, 2017.
    [11]
    A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. Systemml: Declarative machine learning on mapreduce. In ICDE, pages 231--242, 2011.
    [12]
    T. Januschowski, S. Kolassa, M. Lorenz, and C. Schwarz. Forecasting with in-memory technology. Foresight, 2013.
    [13]
    A. Jha, S. Ray, B. Seaman, and I. S. Dhillon. Clustering to forecast sparse time-series data. In ICDE, 2015.
    [14]
    kaggle.com. Rossmann store sales. https://www.kaggle.com/c/rossmann-store-sales.
    [15]
    T. Kraska, A. Talwalkar, J. C. Duchi, R. Griffith, M. J. Franklin, and M. I. Jordan. Mlbase: A distributed machine-learning system. In CIDR, 2013.
    [16]
    A. Kumar, R. McCann, J. Naughton, and J. M. Patel. Model selection management systems: The next frontier of advanced analytics. SIGMOD Record, 2015.
    [17]
    J. Lin and A. Kolcz. Large-scale machine learning at twitter. In SIGMOD, pages 793--804, 2012.
    [18]
    Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed graphlab: a framework for machine learning and data mining in the cloud. VLDB, 5(8):716--727, 2012.
    [19]
    S. Makridakis, S. C. Wheelwright, and R. J. Hyndman. Forecasting methods and applications. John Wiley & Sons, 2008.
    [20]
    X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, et al. Mllib: Machine learning in apache spark. JMLR, 17(34):1--7, 2016.
    [21]
    F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learning in python. JMLR, 12:2825--2830, 2011.
    [22]
    S. Schelter, A. Palumbo, S. Quinn, S. Marthi, and A. Musselman. Samsara: Declarative Machine Learning on Distributed Dataflow Systems. In Machine Learning Systems Workshop at NIPS, 2016.
    [23]
    S. Schelter, V. Satuluri, and R. Zadeh. Factorbird - a parameter server approach to distributed matrix factorization. Distributed Machine Learning and Matrix Computations Workshop at NIPS, 2014.
    [24]
    S. Schelter, J. Soto, V. Markl, D. Burdick, B. Reinwald, and A. Evfimievski. Efficient sample generation for scalable meta learning. In ICDE, pages 1191--1202, 2015.
    [25]
    D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, M. Young, J.-F. Crespo, and D. Dennison. Hidden technical debt in machine learning systems. In NIPS, pages 2503--2511, 2015.
    [26]
    M. Seeger, D. Salinas, and V. Valentin Flunkert. Bayesian Intermittent Demand Forecasting for Large Inventories. In NIPS, 2016.
    [27]
    E. R. Sparks, S. Venkataraman, T. Kaftan, M. J. Franklin, and B. Recht. Keystoneml: Optimizing pipelines for large-scale advanced analytics. ICDE, 2017.
    [28]
    T. Van der Weide, O. Smirnov, M. Zielinski, D. Papadopoulos, and T. van Kasteren. Versioned machine learning pipelines for batch experimentation. In Machine Learning Systems workshop at NIPS, 2016.
    [29]
    M. Vartak, H. Subramanyam, W.-E. Lee, S. Viswanathan, S. Husnoo, S. Madden, and M. Zaharia. Modeldb: A system for machine learning model management. In Workshop on Human-In-the-Loop Data Analytics at SIGMOD, pages 14:1--14:3, 2016.
    [30]
    M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, pages 2--2, 2012.
    [31]
    C. Zhang, A. Kumar, and C. Ré. Materialization optimizations for feature selection workloads. In SIGMOD, pages 265--276. ACM, 2014.

    Cited By

    View all
    • (2024)A Multi-Scale Decomposition MLP-Mixer for Time Series AnalysisProceedings of the VLDB Endowment10.14778/3654621.365463717:7(1723-1736)Online publication date: 1-Mar-2024
    • (2023)One fits allProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667999(43322-43355)Online publication date: 10-Dec-2023
    • (2023)CM-Explorer: Dissecting Data Ingestion ProblemsProceedings of the VLDB Endowment10.14778/3611540.361159516:12(3958-3961)Online publication date: 1-Aug-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 10, Issue 12
    August 2017
    427 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 August 2017
    Published in PVLDB Volume 10, Issue 12

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)202
    • Downloads (Last 6 weeks)19

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A Multi-Scale Decomposition MLP-Mixer for Time Series AnalysisProceedings of the VLDB Endowment10.14778/3654621.365463717:7(1723-1736)Online publication date: 1-Mar-2024
    • (2023)One fits allProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667999(43322-43355)Online publication date: 10-Dec-2023
    • (2023)CM-Explorer: Dissecting Data Ingestion ProblemsProceedings of the VLDB Endowment10.14778/3611540.361159516:12(3958-3961)Online publication date: 1-Aug-2023
    • (2023)A Practical End-to-End Inventory Management Model with Deep LearningManagement Science10.1287/mnsc.2022.456469:2(759-773)Online publication date: 1-Feb-2023
    • (2023)Discovering time series motifs of all lengths using dynamic time warpingWorld Wide Web10.1007/s11280-023-01207-626:6(3815-3836)Online publication date: 1-Nov-2023
    • (2022)ZINProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3602051(24529-24542)Online publication date: 28-Nov-2022
    • (2022)Optimization of Production Planning Using Goal Programming and Inventory Control Based on Demand Forecasting Using Neural Networks on CV Bahyu PerkasaProceedings of the 2022 International Conference on Engineering and Information Technology for Sustainable Industry10.1145/3557738.3557875(1-7)Online publication date: 21-Sep-2022
    • (2022)Deep Learning for Time Series Forecasting: Tutorial and Literature SurveyACM Computing Surveys10.1145/353338255:6(1-36)Online publication date: 7-Dec-2022
    • (2021)Time series forecasting for patient arrivals in online health servicesProceedings of the 31st Annual International Conference on Computer Science and Software Engineering10.5555/3507788.3507795(43-52)Online publication date: 22-Nov-2021
    • (2021)Toto – Benchmarking the Efficiency of a Cloud ServiceProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457555(2543-2556)Online publication date: 9-Jun-2021
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media