Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3514221.3517836acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Camel: Managing Data for Efficient Stream Learning

Published: 11 June 2022 Publication History

Abstract

Many real-world applications rely on predictive models that are incrementally learned online. Specifically, models are updated with a single pass over continuously arriving data batches in a typical stream learning framework. However, this framework has three shortcomings: high training cost, low data effectiveness, and catastrophic forgetting. We describe Camel, a system that addresses the above issues. Camel includes two independent data management components: coreset selection and buffer update. To accelerate model training, Camel selects a coreset from each streaming data batch for model update. Selecting a coreset with worst-case guarantees is NP-hard. To solve this problem, we reformulate coreset selection as a submodular maximization problem by deriving an upper bound on the objective function. To mitigate catastrophic forgetting, Camel maintains a buffer of past representative samples as new data arrive. Moreover, Camel quantizes numerical data in buffer via a quantile sketch to reduce the memory footprint. Finally, extensive experiments validate the effectiveness and efficiency of Camel. In particular, our coreset selection algorithm can achieve a linear speedup with a marginal accuracy loss on redundant datasets. Furthermore, our buffer update algorithms can outperform the state-of-the-art methods for anti-forgetting on various data distributions.

Supplementary Material

MP4 File (SIGMOD22_fp59.mp4)
Presentation video.

References

[1]
Bilge Acun, Matthew Murphy, Xiaodong Wang, Jade Nie, Carole-Jean Wu, and Kim M. Hazelwood. 2021. Understanding Training Efficiency of Deep Learning Recommendation Models at Scale. In HPCA. 802--814.
[2]
Rahaf Aljundi, Eugene Belilovsky, Tinne Tuytelaars, Laurent Charlin, Massimo Caccia, Min Lin, and Lucas Page-Caccia. 2019 a. Online Continual Learning with Maximal Interfered Retrieval. In NeurIPS. 11849--11860.
[3]
Rahaf Aljundi, Min Lin, Baptiste Goujaud, and Yoshua Bengio. 2019 b. Gradient based sample selection for online continual learning. In NeurIPS. 11816--11825.
[4]
Yogesh Balaji, Mehrdad Farajtabar, Dong Yin, Alex Mott, and Ang Li. 2020. The Effectiveness of Memory Replay in Large Scale Continual Learning. CoRR, Vol. abs/2010.02418 (2020).
[5]
Peter L. Bartlett, Dylan J. Foster, and Matus Telgarsky. 2017. Spectrally-normalized margin bounds for neural networks. In Annual Conference on Neural Information Processing Systems 2017. 6240--6249.
[6]
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In ICML. 41--48.
[7]
Vighnesh Birodkar, Hossein Mobahi, and Samy Bengio. 2019. Semantic Redundancies in Image-Classification Datasets: The 10% You Don't Need. CoRR, Vol. abs/1901.11409 (2019).
[8]
Zalá n Borsos, Mojmir Mutny, and Andreas Krause. 2020. Coresets via Bilevel Optimization for Continual Learning and Streaming. In NeurIPS.
[9]
Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. 2013. Near-Optimal Coresets for Least-Squares Regression. IEEE Trans. Inf. Theory (2013).
[10]
Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. 2020. Dark Experience for General Continual Learning: a Strong, Simple Baseline. In NeurIPS.
[11]
Arslan Chaudhry, Marc'Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. 2019 a. Efficient Lifelong Learning with A-GEM. In ICLR.
[12]
Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet Kumar Dokania, Philip H. S. Torr, and Marc'Aurelio Ranzato. 2019 b. Continual Learning with Tiny Episodic Memories. CoRR, Vol. abs/1902.10486 (2019).
[13]
Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2015. BinaryConnect: Training Deep Neural Networks with binary weights during propagations. In Advances in Neural Information Processing Systems 2015. 3123--3131.
[14]
Dominik Csiba and Peter Richtá rik. 2018. Importance Sampling for Minibatches. J. Mach. Learn. Res., Vol. 19 (2018), 27:1--27:21.
[15]
Muataz Salam Al Daweri, Khairul Akram Zainol Ariffin, Salwani Abdullah, and Mohamad Firham Efendy Md. Senan. 2020. An Analysis of the KDD99 and UNSW-NB15 Datasets for the Intrusion Detection System. Symmetry, Vol. 12, 10 (2020), 1666.
[16]
Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ales Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. 2019. A continual learning survey: Defying forgetting in classification tasks. arXiv preprint arXiv:1909.08383 (2019).
[17]
Lei Deng, Guoqi Li, Song Han, Luping Shi, and Yuan Xie. 2020. Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey. Proc. IEEE, Vol. 108, 4 (2020), 485--532.
[18]
Zvi Drezner and Horst W Hamacher. 2001. Facility location: applications and theory .Springer Science & Business Media.
[19]
Dan Feldman and Michael Langberg. 2011. A unified framework for approximating and clustering data. In STOC.
[20]
Dan Feldman, Melanie Schmidt, and Christian Sohler. 2020. Turning Big Data Into Tiny Data: Constant-Size Coresets for k-Means, PCA, and Projective Clustering. SIAM J. Comput. (2020).
[21]
Mina Ghashami and Jeff M. Phillips. 2014. Relative Errors for Deterministic Low-Rank Matrix Approximations. In SODA. 707--717.
[22]
Alex Gittens and Michael W. Mahoney. 2016. Revisiting the Nystrom Method for Improved Large-scale Machine Learning. J. Mach. Learn. Res., Vol. 17 (2016), 117:1--117:65.
[23]
Heitor Murilo Gomes, Jesse Read, and Albert Bifet. 2019. Streaming Random Patches for Evolving Data Stream Classification. In ICDM. 240--249.
[24]
Michael Greenwald and Sanjeev Khanna. 2001. Space-Efficient Online Computation of Quantile Summaries. In SIGMOD.
[25]
Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015).
[26]
Sariel Har-Peled and Soham Mazumdar. 2004. On coresets for k-means and k-median clustering. In STOC. 291--300.
[27]
Tyler L. Hayes, Nathan D. Cahill, and Christopher Kanan. 2019. Memory Efficient Experience Replay for Streaming Learning. In ICRA. 9769--9776.
[28]
Yanzhang He, Tara N. Sainath, and Rohit Prabhavalkar et al. 2019. Streaming End-to-end Speech Recognition for Mobile Devices. In ICASSP. 6381--6385.
[29]
Steven C. H. Hoi, Doyen Sahoo, Jing Lu, and Peilin Zhao. 2018. Online Learning: A Comprehensive Survey. CoRR, Vol. abs/1802.02871 (2018).
[30]
Jiagao Hu, Zhengxing Sun, Bo Li, Kewei Yang, and Dongyang Li. 2017. Online User Modeling for Interactive Streaming Image Classification. In MultiMedia Modeling - 23rd International Conference, MMM 2017, Reykjavik, Iceland, January 4--6, 2017, Proceedings, Part II (Lecture Notes in Computer Science, Vol. 10133). 293--305.
[31]
Xinting Hu, Kaihua Tang, Chunyan Miao, Xian-Sheng Hua, and Hanwang Zhang. 2021. Distilling Causal Effect of Data in Class-Incremental Learning. In CVPR. 3957--3966.
[32]
Yanxiang Huang, Bin Cui, Wenyu Zhang, Jie Jiang, and Ying Xu. 2015. TencentRec: Real-time Stream Recommendation in Practice. In SIGMOD. 227--238.
[33]
Prateek Jain and Purushottam Kar. 2017. Non-convex Optimization for Machine Learning. CoRR, Vol. abs/1712.07897 (2017).
[34]
Jiawei Jiang, Fangcheng Fu, Tong Yang, and Bin Cui. 2018. SketchML: Accelerating Distributed Machine Learning with Data Sketches. In SIGMOD.
[35]
Ashish Kapoor, Simon Baker, Sumit Basu, and Eric Horvitz. 2012. Memory constrained face recognition. In CVPR. 2539--2546.
[36]
Angelos Katharopoulos and Francc ois Fleuret. 2018. Not All Samples Are Created Equal: Deep Learning with Importance Sampling. In ICML, Vol. 80. 2530--2539.
[37]
Ronald Kemker and Christopher Kanan. 2018. FearNet: Brain-Inspired Model for Incremental Learning. In ICLR.
[38]
Stuart P. Lloyd. 1982. Least squares quantization in PCM. IEEE Trans. Inf. Theory, Vol. 28, 2 (1982), 129--136.
[39]
Vincenzo Lomonaco and Davide Maltoni. 2017. CORe50: a New Dataset and Benchmark for Continuous Object Recognition. In 1st Annual Conference on Robot Learning, CoRL, Vol. 78. 17--26.
[40]
David Lopez-Paz and Marc'Aurelio Ranzato. 2017. Gradient Episodic Memory for Continual Learning. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017. 6467--6476.
[41]
Mario Lucic, Matthew Faulkner, Andreas Krause, and Dan Feldman. 2017. Training Gaussian Mixture Models at Scale via Coresets. J. Mach. Learn. Res. (2017).
[42]
Dionysis Manousakas, Zuheng Xu, Cecilia Mascolo, and Trevor Campbell. 2020. Bayesian Pseudocoresets. In NeurIPS.
[43]
Charles Masson, Jee E. Rim, and Homin K. Lee. 2019. DDSketch: A Fast and Fully-Mergeable Quantile Sketch with Relative-Error Guarantees. PVLDB, Vol. 12, 12 (2019), 2195--2205.
[44]
Avner May, Jian Zhang, Tri Dao, and Christopher Ré. 2019. On the Downstream Performance of Compressed Word Embeddings. In NeurIPS. 11782--11793.
[45]
Fei Mi and Boi Faltings. 2020. Memory Augmented Neural Model for Incremental Session-based Recommendation. In IJCAI. 2169--2176.
[46]
Michel Minoux. 1978. Accelerated greedy algorithms for maximizing submodular set functions. In Optimization techniques. 234--243.
[47]
Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, Amin Karbasi, Jan Vondrá k, and Andreas Krause. 2015. Lazier Than Lazy Greedy. In AAAI. 1812--1818.
[48]
Baharan Mirzasoleiman, Jeff A. Bilmes, and Jure Leskovec. 2020. Coresets for Data-efficient Training of Machine Learning Models. In ICML.
[49]
Alexander Munteanu, Chris Schwiegelshohn, Christian Sohler, and David P. Woodruff. 2018. On Coresets for Logistic Regression. In NeurIPS.
[50]
Supun Nakandala and Arun Kumar. 2020. Vista: Optimized System for Declarative Feature Transfer from Deep CNNs at Scale. In SIGMOD. 1685--1700.
[51]
George L. Nemhauser, Laurence A. Wolsey, and Marshall L. Fisher. 1978. An analysis of approximations for maximizing submodular set functions - I. Math. Program., Vol. 14, 1 (1978), 265--294.
[52]
Yongjoo Park, Jingyi Qing, Xiaoyang Shen, and Barzan Mozafari. 2019. BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees. In SIGMOD.
[53]
Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H. Lampert. 2017. iCaRL: Incremental Classifier and Representation Learning. In CVPR. 5533--5542.
[54]
Matthew Riemer, Ignacio Cases, Robert Ajemian, Miao Liu, Irina Rish, Yuhai Tu, and Gerald Tesauro. 2019. Learning to Learn without Forgetting by Maximizing Transfer and Minimizing Interference. In ICLR.
[55]
Sebastian Ruder. 2016. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747 (2016).
[56]
Doyen Sahoo, Quang Pham, Jing Lu, and Steven C. H. Hoi. 2018. Online Deep Learning: Learning Deep Neural Networks on the Fly. In IJCAI. 2660--2666.
[57]
Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina, Agnieszka Grabska-Barwinska, Yee Whye Teh, Razvan Pascanu, and Raia Hadsell. 2018. Progress & Compress: A scalable framework for continual learning. In ICML, Vol. 80. 4535--4544.
[58]
Ozan Sener and Silvio Savarese. 2018. Active Learning for Convolutional Neural Networks: A Core-Set Approach. In ICLR.
[59]
Khadija Shaheen, Muhammad Abdullah Hanif, Osman Hasan, and Muhammad Shafique. 2021. Continual Learning for Real-World Autonomous Systems: Algorithms, Challenges and Frameworks. CoRR, Vol. abs/2105.12374 (2021).
[60]
Shai Shalev-Shwartz and Yoram Singer. 2007. Online learning: Theory, algorithms, and applications. (2007).
[61]
Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet Talwalkar. 2017. Federated Multi-Task Learning. CoRR, Vol. abs/1705.10467 (2017).
[62]
Tasuku Soma and Yuichi Yoshida. 2018. A New Approximation Guarantee for Monotone Submodular Function Maximization via Discrete Convexity. In ICALP, Vol. 107. 99:1--99:14.
[63]
Kai Sheng Tai, Vatsal Sharan, Peter Bailis, and Gregory Valiant. 2018. Sketching Linear Classifiers over Data Streams. In SIGMOD.
[64]
Balajee Vamanan, Gwendolyn Voskuilen, and T. N. Vijaykumar. 2010. EffiCuts: optimizing packet classification for memory and throughput. In SIGCOMM. 207--218.
[65]
Yichao Wang, Huifeng Guo, Ruiming Tang, Zhirong Liu, and Xiuqiang He. 2020. A Practical Incremental Method to Train Deep CTR Models. CoRR, Vol. abs/2009.02147 (2020).
[66]
Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. 2017. ZipML: Training Linear Models with End-to-End Low Precision, and a Little Bit of Deep Learning. In ICML, Vol. 70. 4035--4043.
[67]
Peilin Zhao and Tong Zhang. 2015. Stochastic Optimization with Importance Sampling for Regularized Loss Minimization. In ICML, Vol. 37. 1--9.

Cited By

View all
  • (2024)PECJ: Stream Window Join on Disorder Data Streams with Proactive Error CompensationProceedings of the ACM on Management of Data10.1145/36392682:1(1-24)Online publication date: 26-Mar-2024
  • (2024)Target-agnostic Source-free Domain Adaptation for Regression Tasks2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00121(1464-1477)Online publication date: 13-May-2024
  • (2024) E 2 GCL: Efficient and Expressive Contrastive Learning on Graph Neural Networks 2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00071(859-873)Online publication date: 13-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data
June 2022
2597 pages
ISBN:9781450392495
DOI:10.1145/3514221
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. coreset selection
  2. data management
  3. stream learning

Qualifiers

  • Research-article

Funding Sources

  • the Tencent Wechat Rhino-Bird Focused Research Program
  • Shanghai Municipal Science and Technology Major Project
  • the Hong Kong RGC GRF Project
  • China NSFC
  • HKUST Global Strategic Partnership Fund
  • SJTU Global Strategic Partnership Fund
  • National Key Research and Development Program of China
  • HKUST-Webank joint research lab
  • Microsoft Research Asia Collaborative Research Grant
  • Didi-HKUST joint research lab
  • the Hong Kong RGC CRF Project
  • Hong Kong ITC ITF grants
  • HKUST-NAVER/LINE AI Lab
  • the Hong Kong RGC Theme-based Project TRS
  • the Hong Kong RGC RIF Project
  • Guangdong Basic and Applied Basic Research Foundation
  • the Hong Kong RGC AOE Project

Conference

SIGMOD/PODS '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)511
  • Downloads (Last 6 weeks)45
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)PECJ: Stream Window Join on Disorder Data Streams with Proactive Error CompensationProceedings of the ACM on Management of Data10.1145/36392682:1(1-24)Online publication date: 26-Mar-2024
  • (2024)Target-agnostic Source-free Domain Adaptation for Regression Tasks2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00121(1464-1477)Online publication date: 13-May-2024
  • (2024) E 2 GCL: Efficient and Expressive Contrastive Learning on Graph Neural Networks 2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00071(859-873)Online publication date: 13-May-2024
  • (2023)FedCSS: Joint Client-and-Sample Selection for Hard Sample-Aware Noise-Robust Federated LearningProceedings of the ACM on Management of Data10.1145/36173321:3(1-24)Online publication date: 13-Nov-2023
  • (2023)EARLY: Efficient and Reliable Graph Neural Network for Dynamic GraphsProceedings of the ACM on Management of Data10.1145/35893081:2(1-28)Online publication date: 20-Jun-2023
  • (2023)Fast Prototyping of Distributed Stream Processing Applications with stream2gym2023 IEEE 43rd International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS57875.2023.00034(395-405)Online publication date: Jul-2023
  • (2023)Environment-agnostic Effective Learning for Domain Generalization on IoT Time Series Data2023 International Conference on Artificial Intelligence of Things and Systems (AIoTSys)10.1109/AIoTSys58602.2023.00053(214-220)Online publication date: 19-Oct-2023

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media