research-article

Camel: Managing Data for Efficient Stream Learning

Authors:

Lei ChenAuthors Info & Claims

SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data

Pages 1271 - 1285

https://doi.org/10.1145/3514221.3517836

Published: 11 June 2022 Publication History

Abstract

Many real-world applications rely on predictive models that are incrementally learned online. Specifically, models are updated with a single pass over continuously arriving data batches in a typical stream learning framework. However, this framework has three shortcomings: high training cost, low data effectiveness, and catastrophic forgetting. We describe Camel, a system that addresses the above issues. Camel includes two independent data management components: coreset selection and buffer update. To accelerate model training, Camel selects a coreset from each streaming data batch for model update. Selecting a coreset with worst-case guarantees is NP-hard. To solve this problem, we reformulate coreset selection as a submodular maximization problem by deriving an upper bound on the objective function. To mitigate catastrophic forgetting, Camel maintains a buffer of past representative samples as new data arrive. Moreover, Camel quantizes numerical data in buffer via a quantile sketch to reduce the memory footprint. Finally, extensive experiments validate the effectiveness and efficiency of Camel. In particular, our coreset selection algorithm can achieve a linear speedup with a marginal accuracy loss on redundant datasets. Furthermore, our buffer update algorithms can outperform the state-of-the-art methods for anti-forgetting on various data distributions.

Supplementary Material

MP4 File (SIGMOD22_fp59.mp4)

Presentation video.

Download
81.75 MB

References

[1]

Bilge Acun, Matthew Murphy, Xiaodong Wang, Jade Nie, Carole-Jean Wu, and Kim M. Hazelwood. 2021. Understanding Training Efficiency of Deep Learning Recommendation Models at Scale. In HPCA. 802--814.

[2]

Rahaf Aljundi, Eugene Belilovsky, Tinne Tuytelaars, Laurent Charlin, Massimo Caccia, Min Lin, and Lucas Page-Caccia. 2019 a. Online Continual Learning with Maximal Interfered Retrieval. In NeurIPS. 11849--11860.

[3]

Rahaf Aljundi, Min Lin, Baptiste Goujaud, and Yoshua Bengio. 2019 b. Gradient based sample selection for online continual learning. In NeurIPS. 11816--11825.

[4]

Yogesh Balaji, Mehrdad Farajtabar, Dong Yin, Alex Mott, and Ang Li. 2020. The Effectiveness of Memory Replay in Large Scale Continual Learning. CoRR, Vol. abs/2010.02418 (2020).

[5]

Peter L. Bartlett, Dylan J. Foster, and Matus Telgarsky. 2017. Spectrally-normalized margin bounds for neural networks. In Annual Conference on Neural Information Processing Systems 2017. 6240--6249.

[6]

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In ICML. 41--48.

[7]

Vighnesh Birodkar, Hossein Mobahi, and Samy Bengio. 2019. Semantic Redundancies in Image-Classification Datasets: The 10% You Don't Need. CoRR, Vol. abs/1901.11409 (2019).

[8]

Zalá n Borsos, Mojmir Mutny, and Andreas Krause. 2020. Coresets via Bilevel Optimization for Continual Learning and Streaming. In NeurIPS.

[9]

Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. 2013. Near-Optimal Coresets for Least-Squares Regression. IEEE Trans. Inf. Theory (2013).

Digital Library

[10]

Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. 2020. Dark Experience for General Continual Learning: a Strong, Simple Baseline. In NeurIPS.

[11]

Arslan Chaudhry, Marc'Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. 2019 a. Efficient Lifelong Learning with A-GEM. In ICLR.

[12]

Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet Kumar Dokania, Philip H. S. Torr, and Marc'Aurelio Ranzato. 2019 b. Continual Learning with Tiny Episodic Memories. CoRR, Vol. abs/1902.10486 (2019).

[13]

Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2015. BinaryConnect: Training Deep Neural Networks with binary weights during propagations. In Advances in Neural Information Processing Systems 2015. 3123--3131.

Digital Library

[14]

Dominik Csiba and Peter Richtá rik. 2018. Importance Sampling for Minibatches. J. Mach. Learn. Res., Vol. 19 (2018), 27:1--27:21.

[15]

Muataz Salam Al Daweri, Khairul Akram Zainol Ariffin, Salwani Abdullah, and Mohamad Firham Efendy Md. Senan. 2020. An Analysis of the KDD99 and UNSW-NB15 Datasets for the Intrusion Detection System. Symmetry, Vol. 12, 10 (2020), 1666.

[16]

Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ales Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. 2019. A continual learning survey: Defying forgetting in classification tasks. arXiv preprint arXiv:1909.08383 (2019).

[17]

Lei Deng, Guoqi Li, Song Han, Luping Shi, and Yuan Xie. 2020. Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey. Proc. IEEE, Vol. 108, 4 (2020), 485--532.

[18]

Zvi Drezner and Horst W Hamacher. 2001. Facility location: applications and theory .Springer Science & Business Media.

[19]

Dan Feldman and Michael Langberg. 2011. A unified framework for approximating and clustering data. In STOC.

[20]

Dan Feldman, Melanie Schmidt, and Christian Sohler. 2020. Turning Big Data Into Tiny Data: Constant-Size Coresets for k-Means, PCA, and Projective Clustering. SIAM J. Comput. (2020).

[21]

Mina Ghashami and Jeff M. Phillips. 2014. Relative Errors for Deterministic Low-Rank Matrix Approximations. In SODA. 707--717.

[22]

Alex Gittens and Michael W. Mahoney. 2016. Revisiting the Nystrom Method for Improved Large-scale Machine Learning. J. Mach. Learn. Res., Vol. 17 (2016), 117:1--117:65.

[23]

Heitor Murilo Gomes, Jesse Read, and Albert Bifet. 2019. Streaming Random Patches for Evolving Data Stream Classification. In ICDM. 240--249.

[24]

Michael Greenwald and Sanjeev Khanna. 2001. Space-Efficient Online Computation of Quantile Summaries. In SIGMOD.

[25]

Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015).

[26]

Sariel Har-Peled and Soham Mazumdar. 2004. On coresets for k-means and k-median clustering. In STOC. 291--300.

[27]

Tyler L. Hayes, Nathan D. Cahill, and Christopher Kanan. 2019. Memory Efficient Experience Replay for Streaming Learning. In ICRA. 9769--9776.

[28]

Yanzhang He, Tara N. Sainath, and Rohit Prabhavalkar et al. 2019. Streaming End-to-end Speech Recognition for Mobile Devices. In ICASSP. 6381--6385.

[29]

Steven C. H. Hoi, Doyen Sahoo, Jing Lu, and Peilin Zhao. 2018. Online Learning: A Comprehensive Survey. CoRR, Vol. abs/1802.02871 (2018).

[30]

Jiagao Hu, Zhengxing Sun, Bo Li, Kewei Yang, and Dongyang Li. 2017. Online User Modeling for Interactive Streaming Image Classification. In MultiMedia Modeling - 23rd International Conference, MMM 2017, Reykjavik, Iceland, January 4--6, 2017, Proceedings, Part II (Lecture Notes in Computer Science, Vol. 10133). 293--305.

[31]

Xinting Hu, Kaihua Tang, Chunyan Miao, Xian-Sheng Hua, and Hanwang Zhang. 2021. Distilling Causal Effect of Data in Class-Incremental Learning. In CVPR. 3957--3966.

[32]

Yanxiang Huang, Bin Cui, Wenyu Zhang, Jie Jiang, and Ying Xu. 2015. TencentRec: Real-time Stream Recommendation in Practice. In SIGMOD. 227--238.

[33]

Prateek Jain and Purushottam Kar. 2017. Non-convex Optimization for Machine Learning. CoRR, Vol. abs/1712.07897 (2017).

[34]

Jiawei Jiang, Fangcheng Fu, Tong Yang, and Bin Cui. 2018. SketchML: Accelerating Distributed Machine Learning with Data Sketches. In SIGMOD.

[35]

Ashish Kapoor, Simon Baker, Sumit Basu, and Eric Horvitz. 2012. Memory constrained face recognition. In CVPR. 2539--2546.

[36]

Angelos Katharopoulos and Francc ois Fleuret. 2018. Not All Samples Are Created Equal: Deep Learning with Importance Sampling. In ICML, Vol. 80. 2530--2539.

[37]

Ronald Kemker and Christopher Kanan. 2018. FearNet: Brain-Inspired Model for Incremental Learning. In ICLR.

[38]

Stuart P. Lloyd. 1982. Least squares quantization in PCM. IEEE Trans. Inf. Theory, Vol. 28, 2 (1982), 129--136.

Digital Library

[39]

Vincenzo Lomonaco and Davide Maltoni. 2017. CORe50: a New Dataset and Benchmark for Continuous Object Recognition. In 1st Annual Conference on Robot Learning, CoRL, Vol. 78. 17--26.

[40]

David Lopez-Paz and Marc'Aurelio Ranzato. 2017. Gradient Episodic Memory for Continual Learning. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017. 6467--6476.

[41]

Mario Lucic, Matthew Faulkner, Andreas Krause, and Dan Feldman. 2017. Training Gaussian Mixture Models at Scale via Coresets. J. Mach. Learn. Res. (2017).

[42]

Dionysis Manousakas, Zuheng Xu, Cecilia Mascolo, and Trevor Campbell. 2020. Bayesian Pseudocoresets. In NeurIPS.

[43]

Charles Masson, Jee E. Rim, and Homin K. Lee. 2019. DDSketch: A Fast and Fully-Mergeable Quantile Sketch with Relative-Error Guarantees. PVLDB, Vol. 12, 12 (2019), 2195--2205.

Digital Library

[44]

Avner May, Jian Zhang, Tri Dao, and Christopher Ré. 2019. On the Downstream Performance of Compressed Word Embeddings. In NeurIPS. 11782--11793.

[45]

Fei Mi and Boi Faltings. 2020. Memory Augmented Neural Model for Incremental Session-based Recommendation. In IJCAI. 2169--2176.

[46]

Michel Minoux. 1978. Accelerated greedy algorithms for maximizing submodular set functions. In Optimization techniques. 234--243.

[47]

Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, Amin Karbasi, Jan Vondrá k, and Andreas Krause. 2015. Lazier Than Lazy Greedy. In AAAI. 1812--1818.

[48]

Baharan Mirzasoleiman, Jeff A. Bilmes, and Jure Leskovec. 2020. Coresets for Data-efficient Training of Machine Learning Models. In ICML.

[49]

Alexander Munteanu, Chris Schwiegelshohn, Christian Sohler, and David P. Woodruff. 2018. On Coresets for Logistic Regression. In NeurIPS.

[50]

Supun Nakandala and Arun Kumar. 2020. Vista: Optimized System for Declarative Feature Transfer from Deep CNNs at Scale. In SIGMOD. 1685--1700.

[51]

George L. Nemhauser, Laurence A. Wolsey, and Marshall L. Fisher. 1978. An analysis of approximations for maximizing submodular set functions - I. Math. Program., Vol. 14, 1 (1978), 265--294.

Digital Library

[52]

Yongjoo Park, Jingyi Qing, Xiaoyang Shen, and Barzan Mozafari. 2019. BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees. In SIGMOD.

[53]

Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H. Lampert. 2017. iCaRL: Incremental Classifier and Representation Learning. In CVPR. 5533--5542.

[54]

Matthew Riemer, Ignacio Cases, Robert Ajemian, Miao Liu, Irina Rish, Yuhai Tu, and Gerald Tesauro. 2019. Learning to Learn without Forgetting by Maximizing Transfer and Minimizing Interference. In ICLR.

[55]

Sebastian Ruder. 2016. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747 (2016).

[56]

Doyen Sahoo, Quang Pham, Jing Lu, and Steven C. H. Hoi. 2018. Online Deep Learning: Learning Deep Neural Networks on the Fly. In IJCAI. 2660--2666.

Digital Library

[57]

Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina, Agnieszka Grabska-Barwinska, Yee Whye Teh, Razvan Pascanu, and Raia Hadsell. 2018. Progress & Compress: A scalable framework for continual learning. In ICML, Vol. 80. 4535--4544.

[58]

Ozan Sener and Silvio Savarese. 2018. Active Learning for Convolutional Neural Networks: A Core-Set Approach. In ICLR.

[59]

Khadija Shaheen, Muhammad Abdullah Hanif, Osman Hasan, and Muhammad Shafique. 2021. Continual Learning for Real-World Autonomous Systems: Algorithms, Challenges and Frameworks. CoRR, Vol. abs/2105.12374 (2021).

[60]

Shai Shalev-Shwartz and Yoram Singer. 2007. Online learning: Theory, algorithms, and applications. (2007).

[61]

Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet Talwalkar. 2017. Federated Multi-Task Learning. CoRR, Vol. abs/1705.10467 (2017).

[62]

Tasuku Soma and Yuichi Yoshida. 2018. A New Approximation Guarantee for Monotone Submodular Function Maximization via Discrete Convexity. In ICALP, Vol. 107. 99:1--99:14.

[63]

Kai Sheng Tai, Vatsal Sharan, Peter Bailis, and Gregory Valiant. 2018. Sketching Linear Classifiers over Data Streams. In SIGMOD.

[64]

Balajee Vamanan, Gwendolyn Voskuilen, and T. N. Vijaykumar. 2010. EffiCuts: optimizing packet classification for memory and throughput. In SIGCOMM. 207--218.

[65]

Yichao Wang, Huifeng Guo, Ruiming Tang, Zhirong Liu, and Xiuqiang He. 2020. A Practical Incremental Method to Train Deep CTR Models. CoRR, Vol. abs/2009.02147 (2020).

[66]

Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. 2017. ZipML: Training Linear Models with End-to-End Low Precision, and a Little Bit of Deep Learning. In ICML, Vol. 70. 4035--4043.

[67]

Peilin Zhao and Tong Zhang. 2015. Stochastic Optimization with Importance Sampling for Regularized Loss Minimization. In ICML, Vol. 37. 1--9.

Digital Library

Cited By

Zeng XZhang SZhong HZhang HLu MZheng ZChen Y(2024)PECJ: Stream Window Join on Disorder Data Streams with Proactive Error CompensationProceedings of the ACM on Management of Data10.1145/36392682:1(1-24)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639268
He TXia ZChen JLi HChan S(2024)Target-agnostic Source-free Domain Adaptation for Regression Tasks2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00121(1464-1477)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00121
Li HDi SChen LZhou X(2024) E 2 GCL: Efficient and Expressive Contrastive Learning on Graph Neural Networks 2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00071(859-873)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00071
Show More Cited By

Index Terms

Camel: Managing Data for Efficient Stream Learning
1. Information systems
  1. Data management systems
    1. Database design and models
      1. Data model extensions
        Data streams

Recommendations

GoodCore: Data-effective and Data-efficient Machine Learning through Coreset Selection over Incomplete Data
PACMMOD

Given a dataset with incomplete data (e.g., missing values), training a machine learning model over the incomplete data requires two steps. First, it requires a data-effective step that cleans the data in order to improve the data quality (and the model ...
Camel Sequences and Their Applications
Resource optimization for processing of stream data in data warehouse environment
ICACCI '12: Proceedings of the International Conference on Advances in Computing, Communications and Informatics

To fulfill the increasing demand of business for the latest information, current data integration approaches are moving towards real-time updates. In the case of real-time data integration the updates occurring on the source systems need to be reflected ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data

June 2022

2597 pages

ISBN:9781450392495

DOI:10.1145/3514221

General Chair:
Zachary Ives
University of Pennsylvania (USA)
,
Program Chairs:
Angela Bonifati
Lyon 1 University (France)
,
Amr El Abbadi
University of California, Santa Barbara (USA)

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

the Tencent Wechat Rhino-Bird Focused Research Program
Shanghai Municipal Science and Technology Major Project
the Hong Kong RGC GRF Project
China NSFC
HKUST Global Strategic Partnership Fund
SJTU Global Strategic Partnership Fund
National Key Research and Development Program of China
HKUST-Webank joint research lab
Microsoft Research Asia Collaborative Research Grant
Didi-HKUST joint research lab
the Hong Kong RGC CRF Project
Hong Kong ITC ITF grants
HKUST-NAVER/LINE AI Lab
the Hong Kong RGC Theme-based Project TRS
the Hong Kong RGC RIF Project
Guangdong Basic and Applied Basic Research Foundation
the Hong Kong RGC AOE Project

Conference

SIGMOD/PODS '22

Sponsor:

SIGMOD

SIGMOD/PODS '22: International Conference on Management of Data

June 12 - 17, 2022

PA, Philadelphia, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
1,259
Total Downloads

Downloads (Last 12 months)511
Downloads (Last 6 weeks)45

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zeng XZhang SZhong HZhang HLu MZheng ZChen Y(2024)PECJ: Stream Window Join on Disorder Data Streams with Proactive Error CompensationProceedings of the ACM on Management of Data10.1145/36392682:1(1-24)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639268
He TXia ZChen JLi HChan S(2024)Target-agnostic Source-free Domain Adaptation for Regression Tasks2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00121(1464-1477)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00121
Li HDi SChen LZhou X(2024) E 2 GCL: Efficient and Expressive Contrastive Learning on Graph Neural Networks 2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00071(859-873)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00071
Li ACao YGuo JPeng HGuo QYu H(2023)FedCSS: Joint Client-and-Sample Selection for Hard Sample-Aware Noise-Robust Federated LearningProceedings of the ACM on Management of Data10.1145/36173321:3(1-24)Online publication date: 13-Nov-2023
https://dl.acm.org/doi/10.1145/3617332
Li HChen L(2023)EARLY: Efficient and Reliable Graph Neural Network for Dynamic GraphsProceedings of the ACM on Management of Data10.1145/35893081:2(1-28)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589308
Amin Ifath MNeves MHaque I(2023)Fast Prototyping of Distributed Stream Processing Applications with stream2gym2023 IEEE 43rd International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS57875.2023.00034(395-405)Online publication date: Jul-2023
https://doi.org/10.1109/ICDCS57875.2023.00034
Wang QGuo B(2023)Environment-agnostic Effective Learning for Domain Generalization on IoT Time Series Data2023 International Conference on Artificial Intelligence of Things and Systems (AIoTSys)10.1109/AIoTSys58602.2023.00053(214-220)Online publication date: 19-Oct-2023
https://doi.org/10.1109/AIoTSys58602.2023.00053

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents