Cloud-Native RStudio On Kubernetes For Hopsworks
Cloud-Native RStudio On Kubernetes For Hopsworks
Cloud-Native RStudio On Kubernetes For Hopsworks
Abstract. Deep neural networks used for computer vision tasks are
typically trained on datasets consisting of thousands of images, called
examples. Recent studies have shown that examples in a dataset are not
of equal importance for model training and can be categorized based
on quantifiable measures reflecting a notion of “hardness” or “impor-
tance”. In this work, we conduct an empirical study of the impact of
importance-aware partitioning of the dataset examples across workers
on the performance of data-parallel training of deep neural networks.
Our experiments with CIFAR-10 and CIFAR-100 image datasets show
that data-parallel training with importance-aware partitioning can per-
form better than vanilla data-parallel training, which is oblivious to the
importance of examples. More specifically, the proper choice of the im-
portance measure, partitioning heuristic, and the number of intervals
for dataset repartitioning can improve the best accuracy of the model
trained for a fixed number of epochs. We conclude that the parameters
related to importance-aware data-parallel training, including the impor-
tance measure, number of warmup training epochs, and others defined in
the paper, may be considered as hyperparameters of data-parallel model
training.
1 Introduction
Data-parallel training (DPT) is the current best practice for training deep neural
networks (DNNs) on large datasets over several computing nodes (a.k.a. work-
ers) [11]. In DPT, the DNN (model) is replicated among the workers, and the
training dataset is partitioned and distributed uniformly among them. DPT is
an iterative process where in each iteration, each worker trains its model replica
on its dataset partition for one epoch. After each iteration, the parameters or
gradients of the worker models are aggregated and updated. Then, all work-
ers continue the training using the same updated model replicas. This “vanilla”
DPT scheme is shown in Figure 1.
The dataset partitions in vanilla DPT are constructed by random partition-
ing, i.e., randomly assigning training examples to each partition. However, it is
known that not all examples within a training dataset are of equal importance for
training DNNs [13, 3, 2, 6] meaning that different examples contribute differently
to the training process and the performance of the trained model (e.g., its pre-
diction accuracy). Prior works have used example importance to improve DNN
training schemes, mainly aiming at reducing the total training time or increas-
ing the performance of the trained models. For example, in dataset subset search
[3], the goal is to find subset(s) of a given training dataset that can be used to
train equally good or more performant models compared to the models trained
on the initial dataset. Example importance has also been used for developing
more effective sampling algorithms for stochastic gradient descent (SGD) [6], or
in active learning for choosing the best examples to label [2].
Contributions. All the above-mentioned solutions are mainly designed for non-
distributed model training. In this paper, we study different heuristics to assign
examples, based on their importance, to workers in a distributed environment
and in DPT. In particular, the contributions of this work are as follows.
Parameter Server
?w ?w
?w ?w
W0 W1 W2 W3
Fig. 1. The vanilla DPT scheme with four workers and one parameter server. At each
epoch, each worker gets a random partition of the dataset, and all the workers are
assigned the same model replica. After one epoch of training, the workers send their
local gradients or model parameters to the parameter server. The parameter server
performs either gradient aggregation or model aggregation and sends back the new
gradients or parameters to the workers.
or the dataset (for DPT) and assign them to the workers to enable parallel
training. For our purposes, we define a worker w ∈ W as a process within a
processing node that is allocated exactly one GPU, i.e., each worker corresponds
to exactly one GPU in our cluster of processing nodes.
In a typical most common DPT scheme, which we refer to as vanilla DPT,
the DNN is replicated across the workers. At the beginning of each epoch, the
dataset is partitioned uniformly at random into disjoint subsets p ∈ P , such
that worker wi is allocated the partition pi (dataset partitioning step). More
Sn−1
formally, P = i=0 pi such that pi ∩ pj = ∅ for i ̸= j; and pi ̸= ∅ for each i.
For simplicity, we assume that the number of examples in the dataset, or |D|, is
divisible by the number of workers, n = |W |; but the approach and results can
easily be extended to cases where the assumption does not hold.
During an epoch, each worker independently trains its own replica of the
DNN model (local training step) on its own partition pi . At the end of an epoch,
a model synchronization step occurs, e.g., using a parameter server, and the
workers get a new identical replica of the model. This process is repeated for
a specified budget (e.g., a pre-determined number of epochs) or until a model
convergence criterion or performance metric is satisfied. We are interested to see
if using a partitioning function, based on notions of example importance, may
lead to better results compared to vanilla DPT’s random partitioning in terms
of the target performance metrics. We define the importance of an example,
denoted by Imp, as a mapping of an example to a scalar value:
Imp : e → R (1)
3 Importance-aware DPT
Every epoch, for Ewarmup epochs Every Einterval epochs, until completion criteria is met
Random Heuristic-based
Model Importance Model
Dataset Dataset
Training Calculation Training
Partitioning Partitioning
those fixed partitions for Einterval epochs, and (iii) example importance calcula-
tion, in which we recalculate and update the importance value of each example
for the next interval. In the rest of this section, we discuss importance-aware
DPT in more detail.
Fig. 4. Depiction of Stripes (left) and Blocks (right) partitioning heuristics for a
setting with eight examples (indexed in order of importance) and four workers.
Now that we have a mapping between examples and their importance values, we
can use various heuristics to proceed with dataset partitioning for importance-
aware DPT. Remind that in vanilla DPT, the examples are partitioned randomly
across the parallel workers at the beginning of each epoch. We have defined two
such heuristics, namely Stripes and Blocks, and compared them with random
partitioning (i.e., vanilla DPT).
where sorted index(e) returns the index of example e in the sorted list SE, n is
the number of workers, and i = 0, ..., n − 1. The Stripes heuristic is depicted
on the left side of Figure 4.
4 Implementation in PyTorch
This section presents the implementation details of importance-aware data-
parallel training in PyTorch v1.10.1 [8, 9]. The implementation is mainly based
on several classes and methods that (i) track and calculate the importance of
examples as explained in Sections 3.1 and 3.2, (ii) partition the dataset across
workers based on importance-aware heuristics defined in Section 3.3, and (iii)
resume and continue the model training for fixed intervals of Einterval epochs as
described in Section 3.4.
get individual (per example) loss values, we construct an additional loss function
of the same type and set its reduction parameter to None. This way, this loss
function returns a tensor instead of a scalar.
Hence, each step of the training consists of two forward passes: the first
one uses the customized loss function and writes values to a local worker copy
of a loss-epochs matrix similar to the one depicted in Figure 3, and the second
forward pass uses the default loss function implementation which is used with the
backward pass. Each worker maintains its own copy of the loss-epochs matrix,
and before each dataset partitioning step, the workers wait at a barrier (by calling
torch.distributed.barrier()) for the main process to merge the local copies
and aggregate, i.e., to compute the row-wise average which is the average loss
of each example across the epochs. The output of this step is a sorted list of
tuples (example, importance value) - the Sorted Examples list introduced in
Section 3.3, that is used with the importance-aware partitioning heuristics.
Model training in PyTorch typically consists of a few blocks of code for setting up
the training (e.g., downloading the dataset, constructing the train/test/validation
folds and data samplers, and creating the model), followed by a single loop for it-
erative training of the model. To implement different stages of importance-aware
DPT, we first break down the default training loop into two separate loops: one
for warmup training (Section 3.1) and the other for intervals of importance-aware
training (Section 3.4). The first loop is similar to a typical PyTorch training loop
but is extended with code to track and calculate the importance of examples.
10 S. Sheikholeslami et al.
The second loop is nested: an outer loop maintains the intervals, while the in-
ner loop contains the code for the actual dataset partitioning step, the example
importance calculation step, and the model training step.
5 Evaluation
In this section, we describe our experimental setup and scenarios and discuss the
results of the experiments. When talking about “model performance” we mainly
refer to best test accuracy of a model trained for 100 epochs. Our hardware setup
consists of a single machine with 4 GeForce RTX 2070 SUPER graphic cards,
so we train on 4 workers.
83.6
83.4
83.4
83.2 83.2
best_accuracy
best_accuracy
83.0 83.0
82.8 82.8
82.6
82.6
82.4
82.4
vanilla W20-INT30W10-INT30 W40-INT8 W15-INT15W30-INT10 vanilla W60-INT10 W20-INT1 W10-INT1 W30-INT1 W60-INT5
setting setting
(1) CIFAR-10, ResNet-18, Stripes, Variance (2) CIFAR-10, ResNet-18, Stripes, Average
83.6 83.4
83.4 83.2
83.0
83.2
82.8
best_accuracy
best_accuracy
83.0 82.6
82.4
82.8
82.2
82.6 82.0
82.4 81.8
vanilla W10-INT5 W30-INT8 W40-INT1 W15-INT1 W30-INT1 vanilla W30-INT10W20-INT15 W20-INT1 W15-INT15W15-INT10
setting setting
(3) CIFAR-10, ResNet-18, Blocks, Variance (4) CIFAR-10, ResNet-34, Stripes, Variance
50.0
49.5
best_accuracy
49.0
48.5
48.0
47.5
vanilla W30-INT1 W60-INT30 W15-INT8 W40-INT8 W60-INT5
setting
Fig. 5. Box plots comparing the performance of the top 5 settings of Ewarmup (W) and
Einterval (INT) for different combinations of (Dataset, Model, Partitioning Heuristic,
Importance Metric). The leftmost box plot in each subfigure is the performance of
vanilla DPT (baseline), and the other five box plots are ordered in decreasing average
best test accuracy. The white square on each box plot denotes the average best test
accuracy for a setting. Each subfigure (1)-(5) corresponds to a table with the same
number, which contains the average best test accuracies and standard deviations over
ten runs for each of the combinations of W and INT.
12 S. Sheikholeslami et al.
Table 1. Average best test accuracies (over ten runs) and standard deviations for
different combinations of Ewarmup (W) and Einterval (I), when training ResNet-18
on CIFAR-10 with Stripes policy and loss variance as the importance metric. The
baseline (using vanilla DPT) is 82.983±0.327.
I
1 5 8 10 15 30
W
10 82.766±0.185 82.848±0.278 82.742±0.152 82.862±0.237 82.836±0.387 82.988±0.299
15 82.743±0.373 82.752±0.157 82.891±0.302 82.888±0.296 82.958±0.247 82.873±0.262
20 82.776±0.243 82.832±0.262 82.749±0.309 82.722±0.221 82.878±0.283 83.044±0.311
30 82.846±0.202 82.858±0.376 82.837±0.263 82.946±0.204 82.843±0.307 82.773±0.266
40 82.946±0.246 82.773±0.208 82.985±0.238 82.869±0.364 82.815±0.296 82.827±0.161
60 82.813±0.283 82.898±0.300 82.882±0.152 82.764±0.293 82.830±0.249 82.705±0.415
Table 2. Average best test accuracies (over ten runs) and standard deviations for
different combinations of Ewarmup (W) and Einterval (I), when training ResNet-18 on
CIFAR-10 with Stripes policy and average loss as the importance metric. The baseline
(using vanilla DPT) is 82.983±0.327.
I
1 5 8 10 15 30
W
10 82.941±0.262 82.880±0.339 82.859±0.312 82.815±0.290 82.836±0.226 82.891±0.195
15 82.885±0.231 82.816±0.287 82.841±0.316 82.778±0.259 82.866±0.260 82.773±0.247
20 82.952±0.314 82.913±0.247 82.903±0.240 82.889±0.265 82.841±0.278 82.919±0.210
30 82.939±0.294 82.854±0.185 82.853±0.236 82.889±0.227 82.743±0.335 82.929±0.279
40 82.864±0.138 82.903±0.152 82.883±0.225 82.766±0.220 82.905±0.244 82.851±0.236
60 82.908±0.337 82.931±0.339 82.818±0.245 82.956±0.228 82.806±0.195 82.758±0.237
the same number of examples in train (50000 examples) and test (10000 exam-
ples) subsets, but they differ in the number of classes. CIFAR-10 has ten classes
(5000 training examples per class), and CIFAR-100 has 100 classes (500 training
examples per class). Hence, CIFAR-100 has a higher complexity than CIFAR-10
in terms of classes.
The results show that there are several combinations of (Ewarmup , Einterval )
for training settings that can train better models than vanilla DPT. Thus, the
gains of importance-aware DPT seem to hold across different datasets, given
that we can find and select good hyperparameters for the training setting (e.g.,
Ewarmup and Einterval ).
Table 3. Average best test accuracies (over ten runs) and standard deviations for
different combinations of Ewarmup (W) and Einterval (I), when training ResNet-18 on
CIFAR-10 with Blocks policy and loss variance as the importance metric. The baseline
(using vanilla DPT) is 82.983±0.327.
I
1 5 8 10 15 30
W
10 82.921±0.352 83.067±0.270 82.778±0.426 82.743±0.218 82.662±0.240 82.706±0.165
15 82.992±0.321 82.899±0.308 82.890±0.253 82.805±0.165 82.664±0.178 82.109±0.338
20 82.845±0.292 82.939±0.376 82.850±0.429 82.716±0.205 82.747±0.289 82.523±0.165
30 82.956±0.189 82.942±0.309 83.055±0.153 82.954±0.382 82.815±0.247 82.583±0.206
40 83.001±0.270 82.861±0.336 82.786±0.247 82.925±0.18 82.865±0.177 82.894±0.254
60 82.918±0.348 82.873±0.283 82.848±0.271 82.886±0.273 82.884±0.228 82.462±0.222
Table 4. Average best test accuracies (over ten runs) and standard deviations for
different combinations of Ewarmup (W) and Einterval (I), when training ResNet-34
on CIFAR-10 with Stripes policy and loss variance as the importance metric. The
baseline (using vanilla DPT) is 82.661±0.478.
I
1 5 8 10 15 30
W
10 82.650±0.547 82.653±0.399 82.590±0.395 82.621±0.243 82.751±0.461 82.753±0.632
15 82.537±0.332 82.424±0.510 82.745±0.401 82.799±0.481 82.832±0.239 82.433±1.020
20 82.845±0.441 82.659±0.637 82.787±0.407 82.606±0.541 82.890±0.321 82.492±0.300
30 82.671±0.434 82.539±0.307 82.719±0.509 82.920±0.287 82.594±0.434 82.720±0.589
40 82.669±0.426 82.773±0.403 82.422±0.728 82.530±0.305 82.649±0.339 82.562±0.353
60 82.789±0.336 82.615±0.342 82.683±0.397 82.768±0.525 82.678±0.451 82.622±0.661
Table 5. Average best test accuracies (over ten runs) and standard deviations for
different combinations of Ewarmup (W) and Einterval (I), when training ResNet-34
on CIFAR-100 with Stripes policy and loss variance as the importance metric. The
baseline (using vanilla DPT) is 49.042±0.698.
I
1 5 8 10 15 30
W
10 49.169±0.335 49.064±0.312 49.167±0.432 48.758±0.597 49.04±0.503 49.033±0.450
15 49.156±0.332 48.959±0.437 49.264±0.292 49.186±0.498 49.073±0.573 49.079±0.351
20 48.978±0.550 49.144±0.637 49.024±0.365 49.149±0.297 48.944±0.436 48.977±0.380
30 49.278±0.399 48.906±0.792 49.102±0.393 48.897±0.432 49.152±0.446 48.966±0.389
40 49.129±0.549 48.978±0.527 49.262±0.489 49.155±0.387 48.998±0.450 49.024±0.284
60 49.083±0.348 49.224±0.338 49.027±0.453 49.194±0.396 49.107±0.461 49.270±0.429
tracking overhead) and (2) calculating the importance of examples and reparti-
tioning the dataset based on heuristics at the beginning of each interval (a.k.a.,
heuristic overhead). In Table 6, we report the statistics on these overheads (in
seconds) when we train ResNet-18 on CIFAR-10 for 100 epochs using four work-
ers and the different 36 combinations of Ewarmup and Einterval (as reported in
Tables 1-5). The importance tracking overhead is independent of Ewarmup and
Einterval , as it happens at every epoch, and on average accounts for 14.57% of
the total wallclock time. However, we should note that this is a prototype im-
plementation of importance-aware DPT, and many optimizations can be made
to significantly reduce the overheads (e.g., getting the individual example losses
and the mini-batch losses in the same forward pass or using MPI operations
for calculating the importance of examples). By only requiring repartitioning at
every Einterval , importance-aware DPT has the potential to significantly reduce
the network and I/O overhead that vanilla DPT requires for fetching exam-
ples at each epoch, especially in large training settings consisting of hundreds of
thousands or millions of examples.
6 Conclusion
In this paper, we proposed importance-aware DPT, a data-parallel training ap-
proach for deep neural networks, that partitions the dataset examples across the
workers based on a notion of the importance of each example. Our empirical
evaluation across a number of well-known image classification workloads sug-
gests that by setting relevant values for the hyperparameters of this approach,
most notably Ewarmup and Einterval , we can find better models (in terms of
best test accuracy) compared to when training with vanilla DPT. Future work
Importance-aware Dataset Partitioning for Data-parallel DNN Training 15
can concentrate on, e.g., using hyperparameter tuning methods for finding the
best values for the hyperparameters of importance-aware DPT and evaluating
the effects of different importance metrics and measures.
References
1. Arpit, D., Jastrzebski, S., Ballas, N., Krueger, D., Bengio, E., Kanwal, M.S., Ma-
haraj, T., Fischer, A., Courville, A., Bengio, Y., et al.: A closer look at memo-
rization in deep networks. In: International Conference on Machine Learning. pp.
233–242. PMLR (2017)
2. Chang, H.S., Learned-Miller, E., McCallum, A.: Active bias: Training more accu-
rate neural networks by emphasizing high variance samples. Advances in Neural
Information Processing Systems 30 (2017)
3. Chitta, K., Alvarez, J.M., Haussmann, E., Farabet, C.: Training data distribution
search with ensemble active learning. arXiv preprint arXiv:1905.12737 (2019)
4. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
pp. 770–778 (2016)
5. Isola, P., Xiao, J., Parikh, D., Torralba, A., Oliva, A.: What makes a photo-
graph memorable? IEEE Transactions on Pattern Analysis and Machine Intelli-
gence 36(7), 1469–1482 (2013)
6. Katharopoulos, A., Fleuret, F.: Not all samples are created equal: Deep learning
with importance sampling. In: International Conference on Machine Learning. pp.
2525–2534. PMLR (2018)
7. Krizhevsky, A.: Learning multiple layers of features from tiny images. Tech. rep.,
University of Toronto (2009)
8. Li, S., Zhao, Y., Varma, R., Salpekar, O., Noordhuis, P., Li, T., Paszke, A., Smith,
J., Vaughan, B., Damania, P., et al.: Pytorch distributed: Experiences on acceler-
ating data parallel training. Proceedings of the VLDB Endowment 13(12) (2020)
9. Paszke, A., et al.: Pytorch: An imperative style, high-performance deep learning
library. Advances in Neural Information Processing Systems 32 (2019)
10. Swayamdipta, S., Schwartz, R., Lourie, N., Wang, Y., Hajishirzi, H., Smith, N.A.,
Choi, Y.: Dataset cartography: Mapping and diagnosing datasets with training
dynamics. In: Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP). pp. 9275–9293 (2020)
11. Tang, Z., Shi, S., Chu, X., Wang, W., Li, B.: Communication-efficient distributed
deep learning: A comprehensive survey. arXiv preprint arXiv:2003.06307 (2020)
12. Toneva, M., Sordoni, A., Combes, R.T.d., Trischler, A., Bengio, Y., Gordon, G.J.:
An empirical study of example forgetting during deep neural network learning. In:
ICLR (2019)
13. Vodrahalli, K., Li, K., Malik, J.: Are all training examples created equal? an em-
pirical study. arXiv preprint arXiv:1811.12569 (2018)
14. Yin, D., Pananjady, A., Lam, M., Papailiopoulos, D., Ramchandran, K., Bartlett,
P.: Gradient diversity: a key ingredient for scalable distributed learning. In: Inter-
national Conference on Artificial Intelligence and Statistics. pp. 1998–2007. PMLR
(2018)
15. Zhuang, D., Zhang, X., Song, S., Hooker, S.: Randomness in neural network train-
ing: Characterizing the impact of tooling. In: Marculescu, D., Chi, Y., Wu, C.
(eds.) Proceedings of Machine Learning and Systems. vol. 4, pp. 316–336 (2022)