Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3580305.3599326acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Free access

Efficient Coreset Selection with Cluster-based Methods

Published: 04 August 2023 Publication History

Abstract

Coreset selection is a technique for efficient machine learning, which selects a subset of the training data to achieve similar model performance as using the full dataset. It can be performed with or without training machine learning models. Coreset selection with training, which iteratively trains the machine model and updates data items in the coreset, is time consuming. Coreset selection without training can select the coreset before training. Gradient approximation is the typical method, but it can also be slow when dealing with large training datasets as it requires multiple iterations and pairwise distance computations for each iteration. The state-of-the-art (SOTA) results w.r.t. effectiveness are achieved by the latter approach, i.e. gradient approximation.
In this paper, we aim to significantly improve the efficiency of coreset selection while ensuring good effectiveness, by improving the SOTA approaches of using gradient descent without training machine learning models. Specifically, we present a highly efficient coreset selection framework that utilizes an approximation of the gradient. This is achieved by dividing the entire training set into multiple clusters, each of which contains items with similar feature distances (calculated using the Euclidean distance). Our framework further demonstrates that the full gradient can be bounded based on the maximum feature distance between each item and each cluster, allowing for more efficient coreset selection by iterating through these clusters. Additionally, we propose an efficient method for estimating the maximum feature distance using the product quantization technique. Our experiments on multiple real-world datasets demonstrate that we can improve the efficiency 3-10 times comparing with SOTA almost without sacrificing the accuracy.

Supplementary Material

MP4 File (video1701950363.mp4)
Presentation video - short version

References

[1]
https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce/, 2023. Accessed: 2023-02-01.
[2]
https://archive.ics.uci.edu/ml/datasets/covertype, 2023. Accessed: 2023-02-01.
[3]
Effcient coreset selection with cluster-based methods [technical report]. https: //github.com/for0nething/FCTR, 2023. Last accessed: 2023-06-03.
[4]
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: a system for large-scale machine learning. In Osdi, volume 16, pages 265--283. Savannah, GA, USA, 2016.
[5]
A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117--122, 2008.
[6]
O. Bachem, M. Lucic, and A. Krause. Scalable k-means clustering via lightweight coresets. In Y. Guo and F. Farooq, editors, Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018, pages 1119--1127. ACM, 2018.
[7]
C. B. Barber, D. P. Dobkin, and H. Huhdanpaa. The quickhull algorithm for convex hulls. ACM Trans. Math. Softw., 22(4):469--483, 1996.
[8]
S. Boyd, S. P. Boyd, and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.
[9]
T. Campbell and T. Broderick. Bayesian coreset construction via greedy iterative geodesic ascent. In ICML 2018, volume 80 of Proceedings of Machine Learning Research, pages 697--705. PMLR, 2018.
[10]
C. Chai, L. Cao, G. Li, J. Li, Y. Luo, and S. Madden. Human-in-the-loop outlier detection. In SIGMOD Conference 2020, pages 19--33. ACM, 2020.
[11]
C. Chai, J. Liu, N. Tang, G. Li, and Y. Luo. Selective data acquisition in the wild for model charging. Proc. VLDB Endow., 15(7):1466--1478, 2022.
[12]
M. Charikar. Similarity estimation techniques from rounding algorithms. In J. H. Reif, editor, Proceedings on 34th Annual ACM Symposium on Theory of Computing, May 19-21, 2002, Montréal, Québec, Canada, pages 380--388. ACM, 2002.
[13]
C. Coleman, C. Yeh, S. Mussmann, B. Mirzasoleiman, P. Bailis, P. Liang, J. Leskovec, and M. Zaharia. Selection via proxy: Efficient data selection for deep learning. In ICLR 2020. OpenReview.net, 2020.
[14]
J. C. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121--2159, 2011.
[15]
M. Ducoffe and F. Precioso. Adversarial active learning for deep networks: a margin based approach. CoRR, abs/1802.09841, 2018.
[16]
C. C. et al. Goodcore: Data-effective and data-efficient machine learning through coreset selection over incomplete data. In SIGMOD. ACM, 2023.
[17]
S. Har-Peled and S. Mazumdar. On coresets for k-means and k-median clustering. In L. Babai, editor, Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, June 13-16, 2004, pages 291--300. ACM, 2004.
[18]
T. Hofmann, A. Lucchi, S. Lacoste-Julien, and B. McWilliams. Variance reduced stochastic gradient descent with neighbors. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 2305--2313, 2015.
[19]
J. Huang, R. Huang, W. Liu, N. M. Freris, and H. Ding. A novel sequential coreset method for gradient descent algorithms. In ICML 2021, volume 139 of Proceedings of Machine Learning Research, pages 4412--4422. PMLR, 2021.
[20]
H. Jegou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, 33(1):117--128, 2010.
[21]
J. Johnson, M. Douze, and H. Jégou. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535--547, 2019.
[22]
A. Katharopoulos and F. Fleuret. Not all samples are created equal: Deep learning with importance sampling. In ICML 2018, volume 80 of Proceedings of Machine Learning Research, pages 2530--2539. PMLR, 2018.
[23]
K. Killamsetty, D. Sivasubramanian, G. Ramakrishnan, A. De, and R. K. Iyer. GRAD-MATCH: gradient matching based data subset selection for efficient deep model training. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 5464--5474. PMLR, 2021.
[24]
K. Killamsetty, D. Sivasubramanian, G. Ramakrishnan, and R. K. Iyer. GLISTER: generalization based data subset selection for efficient and robust learning. In AAAI 2021, pages 8110--8118. AAAI Press, 2021.
[25]
D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Y. Bengio and Y. LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
[26]
K. Kirchhoff and J. A. Bilmes. Submodularity for data selection in machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 131--141. ACL, 2014.
[27]
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proc. IEEE, 86(11):2278--2324, 1998.
[28]
V. Leis, A. Gubichev, A. Mirchev, P. A. Boncz, A. Kemper, and T. Neumann. How good are query optimizers, really? Proc. VLDB Endow., 9(3):204--215, 2015.
[29]
I. Loshchilov and F. Hutter. SGDR: stochastic gradient descent with warm restarts. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.
[30]
M. Lucic, M. Faulkner, A. Krause, and D. Feldman. Training gaussian mixture models at scale via coresets. J. Mach. Learn. Res., 18:160:1--160:25, 2017.
[31]
Y. Luo, C. Chai, X. Qin, N. Tang, and G. Li. Interactive cleaning for progressive visualization through composite questions. In 36th IEEE International Conference on Data Engineering, ICDE, pages 733--744. IEEE, 2020.
[32]
Y. Luo, X. Qin, N. Tang, and G. Li. Deepeye: Towards automatic data visualization. In 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, April 16-19, 2018, pages 101--112, 2018.
[33]
Y. Luo, N. Tang, G. Li, C. Chai, W. Li, and X. Qin. Synthesizing natural language to visualization (NL2VIS) benchmarks from NL2SQL benchmarks. In G. Li, Z. Li, S. Idreos, and D. Srivastava, editors, SIGMOD '21: International Conference on Management of Data, China, June 20-25, 2021, pages 1235--1247. ACM, 2021.
[34]
Y. Luo, N. Tang, G. Li, J. Tang, C. Chai, and X. Qin. Natural language to vi-sualization by neural machine translation. IEEE Trans. Vis. Comput. Graph., 28(1):217--226, 2022.
[35]
J. MacQueen. Classification and analysis of multivariate observations. In 5th Berkeley Symp. Math. Statist. Probability, pages 281--297. University of California Los Angeles LA USA, 1967.
[36]
K. Margatina, G. Vernikos, L. Barrault, and N. Aletras. Active learning by ac-quiring contrastive examples. In EMNLP 2021, pages 650--663. Association for Computational Linguistics, 2021.
[37]
B. Mirzasoleiman, A. Badanidiyuru, A. Karbasi, J. Vondrák, and A. Krause. Lazier than lazy greedy. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA, pages 1812--1818. AAAI Press, 2015.
[38]
B. Mirzasoleiman, J. A. Bilmes, and J. Leskovec. Coresets for data-efficient training of machine learning models. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 6950--6960. PMLR, 2020.
[39]
B. Mirzasoleiman, K. Cao, and J. Leskovec. Coresets for robust training of deep neural networks against noisy labels. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
[40]
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
[41]
O. Pooladzandi, D. Davini, and B. Mirzasoleiman. Adaptive second order coresets for data-efficient machine learning. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 17848--17869. PMLR, 2022.
[42]
F. P. Preparata and S. J. Hong. Convex hulls of finite sets of points in two and three dimensions. Communications of the ACM, 20(2):87--93, 1977.
[43]
N. Qian. On the momentum term in gradient descent learning algorithms. Neural Networks, 12(1):145--151, 1999.
[44]
M. Tukan, C. Baykal, D. Feldman, and D. Rus. On coresets for support vector machines. Theor. Comput. Sci., 890:171--191, 2021.
[45]
J. Wang, C. Chai, N. Tang, J. Liu, and G. Li. Coresets over multiple tables for feature-rich and data-efficient machine learning. Proc. VLDB Endow., 16(1):64--76, 2022.
[46]
K. Wei, R. K. Iyer, and J. A. Bilmes. Submodularity in data subset selection and active learning. In ICML 2015, volume 37 of JMLR Workshop and Conference Proceedings, pages 1954--1963. JMLR.org, 2015.
[47]
Z. A. Zhu, Y. Yuan, and K. Sridharan. Exploiting the structure: Stochastic gradient methods using raw clusters. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 1642--1650, 2016.

Cited By

View all
  • (2025)DynImpt: A Dynamic Data Selection Method for Improving Model Training EfficiencyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.348246637:1(239-252)Online publication date: Jan-2025
  • (2025)GaussDB-AISQL: a composable cloud-native SQL system with AI capabilitiesFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-024-40624-219:9Online publication date: 1-Sep-2025
  • (2024)Optimizing Data Acquisition to Enhance Machine Learning PerformanceProceedings of the VLDB Endowment10.14778/3648160.364817217:6(1310-1323)Online publication date: 3-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
August 2023
5996 pages
ISBN:9798400701030
DOI:10.1145/3580305
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 August 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. coreset selection
  2. data-efficient ml
  3. product quantization

Qualifiers

  • Research-article

Funding Sources

Conference

KDD '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,241
  • Downloads (Last 6 weeks)101
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)DynImpt: A Dynamic Data Selection Method for Improving Model Training EfficiencyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.348246637:1(239-252)Online publication date: Jan-2025
  • (2025)GaussDB-AISQL: a composable cloud-native SQL system with AI capabilitiesFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-024-40624-219:9Online publication date: 1-Sep-2025
  • (2024)Optimizing Data Acquisition to Enhance Machine Learning PerformanceProceedings of the VLDB Endowment10.14778/3648160.364817217:6(1310-1323)Online publication date: 3-May-2024
  • (2024)MisDetect: Iterative Mislabel Detection using Early LossProceedings of the VLDB Endowment10.14778/3648160.364816117:6(1159-1172)Online publication date: 3-May-2024
  • (2024)Tabular Data-centric AI: Challenges, Techniques and Future PerspectivesProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679102(5522-5525)Online publication date: 21-Oct-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media