Efficient Coreset Selection with Cluster-based Methods

Published: 04 August 2023 Publication History


Coreset selection is a technique for efficient machine learning, which selects a subset of the training data to achieve similar model performance as using the full dataset. It can be performed with or without training machine learning models. Coreset selection with training, which iteratively trains the machine model and updates data items in the coreset, is time consuming. Coreset selection without training can select the coreset before training. Gradient approximation is the typical method, but it can also be slow when dealing with large training datasets as it requires multiple iterations and pairwise distance computations for each iteration. The state-of-the-art (SOTA) results w.r.t. effectiveness are achieved by the latter approach, i.e. gradient approximation.
In this paper, we aim to significantly improve the efficiency of coreset selection while ensuring good effectiveness, by improving the SOTA approaches of using gradient descent without training machine learning models. Specifically, we present a highly efficient coreset selection framework that utilizes an approximation of the gradient. This is achieved by dividing the entire training set into multiple clusters, each of which contains items with similar feature distances (calculated using the Euclidean distance). Our framework further demonstrates that the full gradient can be bounded based on the maximum feature distance between each item and each cluster, allowing for more efficient coreset selection by iterating through these clusters. Additionally, we propose an efficient method for estimating the maximum feature distance using the product quantization technique. Our experiments on multiple real-world datasets demonstrate that we can improve the efficiency 3-10 times comparing with SOTA almost without sacrificing the accuracy.

Supplementary Material

MP4 File (video1701950363.mp4)
Presentation video - short version


  DynImpt: A Dynamic Data Selection Method for Improving Model Training EfficiencyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.348246637:1(239-252)Online publication date: Jan-2025
  GaussDB-AISQL: a composable cloud-native SQL system with AI capabilitiesFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-024-40624-219:9Online publication date: 1-Sep-2025
  Optimizing Data Acquisition to Enhance Machine Learning PerformanceProceedings of the VLDB Endowment10.14778/3648160.364817217:6(1310-1323)Online publication date: 3-May-2024
KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
August 2023
5996 pages
Published: 04 August 2023


Author Tags

  1. coreset selection
  2. data-efficient ml
  3. product quantization


  DynImpt: A Dynamic Data Selection Method for Improving Model Training EfficiencyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.348246637:1(239-252)Online publication date: Jan-2025
  GaussDB-AISQL: a composable cloud-native SQL system with AI capabilitiesFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-024-40624-219:9Online publication date: 1-Sep-2025
  Optimizing Data Acquisition to Enhance Machine Learning PerformanceProceedings of the VLDB Endowment10.14778/3648160.364817217:6(1310-1323)Online publication date: 3-May-2024
  MisDetect: Iterative Mislabel Detection using Early LossProceedings of the VLDB Endowment10.14778/3648160.364816117:6(1159-1172)Online publication date: 3-May-2024
  Tabular Data-centric AI: Challenges, Techniques and Future PerspectivesProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679102(5522-5525)Online publication date: 21-Oct-2024

