Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Hybrid Approach to Parallel Stochastic Gradient Descent

Aakash Sudhirbhai Vora School of Computing and Augmented Intelligence
Arizona State University
TempeAZUSA
avora8@asu.edu
Dhrumil Chetankumar Joshi School of Computing and Augmented Intelligence
Arizona State University
TempeAZUSA
djoshi12@asu.edu
 and  Aksh Kantibhai Patel School of Computing and Augmented Intelligence
Arizona State University
TempeAZUSA
apate160@asu.edu
Abstract.

Stochastic Gradient Descent is used for large datasets to train models to reduce the training time. On top of that data parallelism is widely used as a method to efficiently train neural networks using multiple worker nodes in parallel. Synchronous and asynchronous approach to data parallelism is used by most systems to train the model in parallel. However, both of them have their drawbacks. We propose a third approach to data parallelism which is a hybrid between synchronous and asynchronous approaches, using both approaches to train the neural network. When the threshold function is selected appropriately to gradually shift all parameter aggregation from asynchronous to synchronous, we show that in a given time period our hybrid approach outperforms both asynchronous and synchronous approaches.

Data parallelism, Synchronous approach, Asynchronous approach, Stochastic gradient descent, Distributed optimization
copyright: none

1. Introduction

Neural networks play an important role in modern computer applications. They are indispensable from day to day life today. They are the backbone of recommender systems, stock trading algorithms, voice assistants, autonomous driving, etc. Neural networks are trained using Gradient Descent, where the motive is to minimize the loss function by improving model parameters, iterating over the training data. Training the neural network on the whole dataset as input once is a slow process since the training dataset could be huge. An alternative to this is Stochastic Gradient Descent, where data is passed in batches to train the network. So, Stochastic Gradient Descent is widely accepted as the method to train neural network because it is more efficient.

This process of training neural networks can be accelerated even further by using distributed training where there are multiple worker nodes updating the model parameters in parallel. There are multiple types of parallelism techniques used for training a model. Model parallelism involves splitting the model across different workers, so that each worker control over some parameters of the model. Each worker finds the parameters for the part of model it has and then all the parameters are aggregated to get the final model.Petuum proposed by Xing et .al (2015) (Xing et al., 2015), Project Adam proposed by Chilimbi et. al (2014) (Chilimbi et al., 2014) and Sandblaster proposed by Dean et. al (2012) (Dean et al., 2012) all use model parallelism to speed up neural network training.

Another approach that is used for parallelizing training is known as pipeline parallelism. Fundamental idea of pipeline parallelism to overcome the shortcoming of model parallelism that when a set of workers compute weights for one layer, other workers should not be sitting idly. So, apart from splitting the model, multiple inputs are supplied to the system so that at each point of time, each worker is computing the weights based on one set of inputs. Each worker node is responsible for computing weights for one or more layers. Once the worker computers parameters for the assigned layer, it passes that information to next set of workers responsible for other layers in the network. The same worker is responsible for computing the weights during both feed forward and backpropagation phases of training the model. Model gets divided among workers in the direction of data flow. GPipe proposed by Huang et. al (2019) (Huang et al., 2019) and Pipedream proposed by Narayanan et. al (2019) (Narayanan et al., 2019) both make use of pipeline parallelism to speed up the training process.

Data parallelism is another approach that people use for speeding up neural network training. Core idea for this technique is that the training dataset is split among different workers and each worker trains the model in parallel based on the dataset it has. Thereafter, the parameters are combined from the workers to get the resultant model. Here for parameters to converge, they need to combined. Currently there are two approaches to applying data parallelism, one is synchronous approach. Here, the workers working in parallel need to communicate among themselves to keep the parameters synchronized. One way of synchronizing parameters is to keep the parameters synchronized after each iteration. Another approach is to synchronize parameters after some fixed number of iterations. Popular implementations of synchronous approach to data parallelism are Petuum proposed by Xing et .al (2015) (Xing et al., 2015), Horovod proposed by Sergeev et .al (2018) (Sergeev and Balso, 2018) and Stale Synchronous Parallel Parameter Server proposed by Ho et .al (2013) (Ho et al., 2013). There is a clear drawback to this approach where faster workers will have to wait for slower workers to catch up. This leads to wastage of resources since faster workers will be sitting idle while slower workers catch up.There is also communication overhead in this approach as workers will coordinate amongst themselves to keep the parameters synchronized. Alternative to this, is the asynchronous approach. Here, workers do not have to communicate amongst each other to keep parameters synchronized. Each worker is allowed to update the parameters independently with little communication needed. This will lead to better resource utilization since the faster workers will not have to wait for slower workers. Popular implementations of this are Hogwild proposed by Niu et .al (2011) (Niu et al., 2011), Project Adam proposed by Chilimbi et .al (2014) (Chilimbi et al., 2014) and Distbelief proposed by Dean et .al (2012) (Dean et al., 2012). However, there is a major drawback regarding parameter convergence. If the model parameters are not sparse, convergence might take a lot of time since faster workers would frequently update the parameters while slower workers would still be operating on stale parameters.

We propose a hybrid approach which incorporates advantages of both approaches. This involves using both synchronous and asynchronous approaches. Initially, each worker is computing parameters asynchronously. However as iterations progress, more and more parameters are allowed to be accumulated and passed onto the workers. This approach allows workers to use an asynchronous approach to get more progress per iteration and a synchronous approach to get more confident progress.

2. Related Work

To increase the effectiveness of neural network training over large training sets, a number of strategies have been put forward. Due to its effectiveness in handling huge datasets, stochastic gradient descent (SGD) has been largely favored over gradient descent. In order to increase efficiency even more, distributed training techniques have been investigated. These techniques let worker nodes update model parameters concurrently.

Synchronous parallel gradient descent is one of the current training methods for neural networks. This method synchronizes all worker nodes either instantly or after a predetermined number of iterations. The output parameters are closely matched to those acquired by sequential stochastic gradient descent thanks to this synchronization. This strategy has a big drawback in that, faster workers must wait for slower workers to catch up, which adds a lot of idle time. The work ”More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server” (Ho et al., 2013) offers a synchronous parallel gradient descent approach with a parameter server, with updates that are slightly stale. However, the disadvantage is that faster workers need to wait for slower workers to catch up, resulting in idle time. The publication ”Horovod: Fast and Easy Distributed Deep Learning in TensorFlow” (Sergeev and Balso, 2018) describes a distributed training system that employs synchronous parallel gradient descent in conjunction with a ring-based communication technique called ring-all reduce. With features like gradient compression, it enhances scalability and speed. Fast and efficient training is one of the benefits, although it shares the disadvantage of idle time with slower workers. The paper ”Petuum: A New Platform for Distributed Machine Learning on Big Data” (Xing et al., 2015) introduces Petuum, a distributed machine learning platform that uses synchronous parallel gradient descent with a parameter server. It makes use of techniques like model compression and a customized communication protocol. Large-scale machine learning jobs can be handled efficiently, but it also suffers from idle time because of slower workers.

Asynchronous parallel gradient descent is another contemporary strategy in which each worker separately updates parameters in parallel. Because workers do not have to wait for each other, this strategy optimizes the utilization of resources. However, convergence will only be achievable for sparse modifications, and individual worker parameters could become outdated. The paper, ”HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent,” (Niu et al., 2011) introduces a technique that uses a lock-free mechanism to parallelize stochastic gradient descent. It enables independent model parameter updates across several threads without explicit synchronization. While this approach effectively parallelized SGD minimizing synchronization overhead and resource usage, the method’s assumption of sparse updates makes it potentially inconclusive for dense updates. Additionally, because there is no explicit synchronization, individual threads may use outdated parameters, which could have an effect on the training process convergence and accuracy. ”Project Adam: Building a Scalable and Efficient Deep Learning Training System” (Chilimbi et al., 2014) also uses asynchronous parallel SGD to allow worker nodes to update simultaneously. To share information among workers and maintain uniformity, it makes use of a communication protocol. Individual workers’ use of out-of-date criteria can have an impact on how well training is performed generally, sometimes producing less-than-ideal results. Existing techniques exhibit various trade-offs in terms of synchronization and convergence assurances.

3. Problem Formulation

Let there exist X𝑋Xitalic_X which is a set of points sample over some unknown distribution defined over the space of n𝑛nitalic_n dimension i.e. Rnsuperscript𝑅𝑛R^{n}italic_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. Corresponding to each point xX𝑥𝑋x\in Xitalic_x ∈ italic_X, there exists a value y𝑦yitalic_y sampled from another set Y𝑌Yitalic_Y which is a subset of Rmsuperscript𝑅𝑚R^{m}italic_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT. The goal is to find a function f:Rn×RkRm:𝑓superscript𝑅𝑛superscript𝑅𝑘superscript𝑅𝑚f:R^{n}\times R^{k}\rightarrow R^{m}italic_f : italic_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT × italic_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT → italic_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT defined on X𝑋Xitalic_X and parameter θ𝜃\thetaitalic_θ, where θRk𝜃superscript𝑅𝑘\theta\in R^{k}italic_θ ∈ italic_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT such that it can approximate the unknown function fun:RnRm:subscript𝑓𝑢𝑛superscript𝑅𝑛superscript𝑅𝑚f_{un}:R^{n}\rightarrow R^{m}italic_f start_POSTSUBSCRIPT italic_u italic_n end_POSTSUBSCRIPT : italic_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → italic_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT which maps each point in x𝑥xitalic_x to corresponding y𝑦yitalic_y. Further, we assume that there exists a differentiable convex function j(x,y,θ)𝑗𝑥𝑦𝜃j(x,y,\theta)italic_j ( italic_x , italic_y , italic_θ ) defined over Rn×Rm×RkRsuperscript𝑅𝑛superscript𝑅𝑚superscript𝑅𝑘𝑅R^{n}\times R^{m}\times R^{k}\rightarrow Ritalic_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT × italic_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT × italic_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT → italic_R. Let a new function J𝐽Jitalic_J be defined for any subset of X𝑋Xitalic_X which applies j𝑗jitalic_j to all xX𝑥𝑋x\in Xitalic_x ∈ italic_X and returns a summation of them. The function f𝑓fitalic_f can be estimated by finding the value of the parameter theta which minimizes the function J(X,Y,θ)𝐽𝑋𝑌𝜃J(X,Y,\theta)italic_J ( italic_X , italic_Y , italic_θ ). This minimization can be performed by gradient-based methods like Stochastic Gradient Descent.

(1) J(X,Y,θ)=ΣxiX,yiYj(xi,yi,θ)𝐽𝑋𝑌𝜃subscriptΣformulae-sequencesubscript𝑥𝑖𝑋subscript𝑦𝑖𝑌𝑗subscript𝑥𝑖subscript𝑦𝑖𝜃J(X,Y,\theta)=\Sigma_{x_{i}\in X,y_{i}\in Y}j(x_{i},y_{i},\theta)italic_J ( italic_X , italic_Y , italic_θ ) = roman_Σ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_X , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_Y end_POSTSUBSCRIPT italic_j ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θ )
(2) θ=argminθRkJ(X,Y,θ)𝜃𝑎𝑟𝑔𝑚𝑖subscript𝑛𝜃superscript𝑅𝑘𝐽𝑋𝑌𝜃\theta=argmin_{\theta\in R^{k}}J(X,Y,\theta)italic_θ = italic_a italic_r italic_g italic_m italic_i italic_n start_POSTSUBSCRIPT italic_θ ∈ italic_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_J ( italic_X , italic_Y , italic_θ )
(3) θi=θiηJ(X,Y,θ)θisuperscriptsubscript𝜃𝑖subscript𝜃𝑖𝜂𝐽𝑋𝑌𝜃subscript𝜃𝑖\theta_{i}^{\prime}=\theta_{i}-\eta\frac{\partial J(X,Y,\theta)}{\partial% \theta_{i}}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_η divide start_ARG ∂ italic_J ( italic_X , italic_Y , italic_θ ) end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG

However, we have multiple processors at our disposal that we would like to use for performing the minimization. We assign one of the machines to be in charge of maintaining the current values of the parameter and call it the Parameter server p. Rest of the machines are known as worker machines wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Each of the worker machines wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has a subset of data (Xi,Yi)subscript𝑋𝑖subscript𝑌𝑖(X_{i},Y_{i})( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) from the entire dataset (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ). They are responsible for getting the current parameter values, calculating the gradient using (Xi,Yi)subscript𝑋𝑖subscript𝑌𝑖(X_{i},Y_{i})( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), and sending them to the Parameter server p𝑝pitalic_p. We don’t assume any kind of global notion of time among these workers. Furthermore, we assume that the execution speed of each worker is different and there also exists communication delays when workers are communicating with the Parameter Server.

In such a setting, it becomes necessary to define whether the policy is used by the Parameter server for updating the parameters from the gradients received from the worker. Two widely used approaches are synchronous and asynchronous. In the synchronous approach, the parameter collects gradients from all workers before updating the parameter and each worker waits till updated parameters are not received. Each update in this approach is calculated by all workers on the same previous values of parameters and hence the updates are less noisy. However, this algorithm is slower since faster workers have to wait for slower ones. On the other hand, in an asynchronous approach, any parameter applies the update as soon as it is received from the worker and provides a new value of the parameter to the worker. Though this approach is simple and works very well in practice during the initial convergence(reach quickly to local minima), as we move closer to local minima, faster workers might end up providing too many stale updates which might lead to noisy updates and slow down convergence. Thus, we ask the question, is there a way that combines good properties from both synchronous and asynchronous algorithms i.e. high confidence updates and faster initial convergence respectively, and gives a new algorithm? We hypothesize that if we start initially with asynchronous updates and switch over time to synchronous updates, it would lead to faster convergence.

4. Methodology: Smooth Switch Algorithm

To develop a hybrid algorithm, which we call a smooth switch algorithm, for stochastic gradient descent (SGD) algorithms, both synchronous and asynchronous techniques are combined. With this method, asynchronous updates are used to increase iteration throughput while synchronous updates are used to ensure the integrity of forward positive progress. The method defines a smooth transition from the asynchronous to the synchronous approach which is controlled by the threshold function. Asynchronicity will help in achieving more updates per iteration and synchronicity will help improve accuracy per iteration. This two-pronged method offers a fair trade-off between training speed and accuracy by combining the advantages of synchronous and asynchronous updates.

Refer to caption
Figure 1. System Architecture

Figure 1 displays the System Architectural diagram in which a threshold parameter (K) is used to control the transition from asynchronous to synchronous updates. Initially, the threshold is set to a low value, enabling asynchronous updates to the parameter server. As the training process iterates, the threshold gradually increases. During each iteration, multiple gradients (G1, G2, G3, … Gk) are accumulated in the gradient buffer, and at each iteration, the algorithm evaluates if the gradients accumulated are greater than or equal to the threshold to trigger the transition to synchronous updates to the parameter server. If the threshold has not been reached, asynchronous updates continue. Once the threshold is reached, synchronous gradient updates are performed to the parameter server. The steps are repeated until convergence or a specified number of iterations. The algorithm 1 demonstrates the entire process.

Algorithm 1
Step 1: Set the initial gradient buffer to the parameter database transfer threshold (K) to a very low value to allow for asynchronous updates (stale reads).
Step 2:
while (!(convergence) or !(a specified number of iterations)) do
  1. (1)

    If the total gradients in the gradient buffer ¿= threshold K then synchronize all the gradients in the gradient buffer with the Parameter Server (reduction in stale reads).

  2. (2)

    If the threshold has not been reached, continue with asynchronous updates to the Parameter Server. (Might cause frequent stale reads)

  3. (3)

    Gradually increase the threshold (K) as the iterations progress.

  4. (4)

    Accumulate multiple gradients (G1, G2, G3, … Gn) during each iteration.

5. Data Preparation

As part of our research experiment, we conducted evaluations to assess the performance and generalization capabilities of our approach to Stochastic Gradient Descent (SGD) using a diverse range of datasets. Among these datasets, we utilized the MNIST dataset (Le Cunn, 1999) figure 2 and the CIFAR-10 dataset (Krizhevsky et al., 2009) figure 3.

Refer to caption
Figure 2. MNIST samples

A curated collection of grayscale images showing handwritten numbers makes up the MNIST dataset. It serves as a well-known benchmark dataset for tasks involving image classification in which the goal is to correctly identify the represented digit in each image. The MNIST dataset is a rich source of labeled data with 60,000 training images and 10,000 test images, providing a significant quantity of resources for testing and training machine learning algorithms. With a size of 28x28 pixels for each image, the collection has 784 features. The MNIST dataset has been widely used to evaluate the effectiveness of various machine learning models due to its well-stated task and simplicity.

Refer to caption
Figure 3. CIFAR 10 samples

In addition, the CIFAR-10 dataset (Krizhevsky et al., 2009) included in our research comprises a total of 60,000 color images that make up the CIFAR-10 dataset and are split into 10,000 test images and 50,000 training images. These photos, which represent numerous objects and scenes, are divided into ten different classifications. The CIFAR-10 dataset is more sophisticated than MNIST since it has a resolution of 32x32 pixels and three RGB color streams. It is regularly used as a benchmark for assessing how well image classification systems can handle increasingly challenging visual tasks.

We wanted to evaluate the performance and generalization capabilities of our model using both the MNIST and CIFAR-10 datasets. We were able to assess the performance of our method using a variety of datasets, including basic grayscale photographs of numbers (MNIST) and more complex color photos of a variety of objects (CIFAR-10). Through this investigation, we gained valuable insights into the adaptability and robustness of our distributed large-scale Stochastic Gradient Descent approach.

6. IMPLEMENTATION AND EXPERIMENTS

The goal of our experiment is to validate our hypothesis that the proposed algorithm provides a speedup in convergence as compared to both completely synchronous and asynchronous versions of distributed stochastic gradient descent. Hence, in all our experiments, we ran all three algorithms namely, our proposed algorithm, synchronous and asynchronous algorithms for the same initial conditions, and collected values of training loss, testing loss, and testing accuracy at various time intervals.

All our experiments were executed in a clustered environment setup. The cluster was provided with 2 CPUs each with 14 cores which makes 28 cores in total. The CPUs were based on 64-bit x86 architecture and the model name was Intel(R) Xeon(R) CPU E5-2680 v4 with 2.4GHz clock speed. The kernel version was 3.10.0-1160.21.1.el7.x86_64 and the operating system was CentOS Linux 7. The available RAM on the cluster was 263.85 GB.

The code for all algorithms was written in Python language with version 3.9.12. For parallel execution and scheduling, we used the ray library with version 2.4.0. Pytorch 2.0.0 was used for model creation and training. For the training of the model, 25 gradient workers were used for calculating the gradient and they passed the updates to a worker which acted as Parameter Server. Further, during the training, to simulate the communication delays and faster/slower workers, we randomly introduced execution delays in 50% gradient workers. The execution delays were sampled randomly from a normal distribution with a mean of 0 and a standard deviation of 0.25 during each gradient calculated by the worker.

For testing our approach, we selected MNIST and CIFAR-10 datasets. For our algorithm, we used the step function as our threshold function which updated the threshold with an increase in the number of gradient updates. We executed our experiments for combinations of step sizes in multiples of 3 and 5 of reciprocal of learning rate and batch size of 32 and 64. For each combination, we trained the model for 5 rounds starting from random initialization using our algorithm, asynchronous and synchronous algorithm. For each round, the same initialization values of weights were used for each algorithm. The training was performed for 100 seconds in each round. Since both datasets contain images, CNN was used as the model. For experiments, we fixed the learning rate to 0.01. Since we are creating a model to solve the classification problem, negative log-likelihood loss is used.

Furthermore, we wanted to analyze the effects of various step sizes, batch sizes, and communication delays on our algorithm. Hence, we repeated our experiments for different combinations of step size, batch size, and communication delays and noted training and testing loss along with testing accuracy during different intervals. For this purpose, we used randomly generated datasets with 20 dimensions and 10 classes containing 10k samples with 80:20 train to test split. A newly sampled dataset was used for each configuration. The reason behind selecting a random dataset was to cover a wide range of classification problems and validate the robustness of our algorithm.

7. RESULTS and DISCUSSION

In this section, we present the results of various experiments and analyses of the same. First, we present how our algorithm works on the MNIST and CIFAR-10 datasets. After that, we also present the effects of various choices of step sizes, batch sizes, and communication delays on our algorithm.

7.1. Results on MNIST and CIFAR-10 dataset

Plots 4 and 5 shows the average values of testing accuracy, testing loss, and training loss for five rounds of training from random initialization on the MNIST dataset. It can be seen clearly that our algorithm maintains the lead in terms of accuracy and loss as compared to both asynchronous and synchronous versions. The same trend is observed for all the combinations of batch sizes and step sizes. However, the speed gain by our algorithm over the asynchronous version is not that significant, we believe that MNIST poses a simple optimization problem that does not bring out problems of asynchronous algorithm effectively. Table 1 shows the difference of the metrics like accuracy and loss between our algorithm and asynchronous algorithm averaged over the entire training interval. For better performance, the difference in accuracy should be positive and that loss should be negative.

(Stepsize,Batchsize)Metric𝑆𝑡𝑒𝑝𝑠𝑖𝑧𝑒𝐵𝑎𝑡𝑐𝑠𝑖𝑧𝑒𝑀𝑒𝑡𝑟𝑖𝑐\frac{(Stepsize,Batchsize)}{Metric}divide start_ARG ( italic_S italic_t italic_e italic_p italic_s italic_i italic_z italic_e , italic_B italic_a italic_t italic_c italic_h italic_s italic_i italic_z italic_e ) end_ARG start_ARG italic_M italic_e italic_t italic_r italic_i italic_c end_ARG (300,32) (300,64) (500,32) (500,64)
Test Accuracy 1.374 -0.516 1.366 1.291
Test loss -0.047 0.001 -0.053 -0.022
Train loss -0.047 -0.001 -0.054 -0.023
Table 1. Difference between the metric for our algorithm and asynchronous algorithm averaged over entire training interval for MNIST dataset. For better performance, difference in accuracy should be positive and that loss should be negative
(Stepsize,Batchsize)Metric𝑆𝑡𝑒𝑝𝑠𝑖𝑧𝑒𝐵𝑎𝑡𝑐𝑠𝑖𝑧𝑒𝑀𝑒𝑡𝑟𝑖𝑐\frac{(Stepsize,Batchsize)}{Metric}divide start_ARG ( italic_S italic_t italic_e italic_p italic_s italic_i italic_z italic_e , italic_B italic_a italic_t italic_c italic_h italic_s italic_i italic_z italic_e ) end_ARG start_ARG italic_M italic_e italic_t italic_r italic_i italic_c end_ARG (300,32) (300,64) (500,32) (500,64)
Test Accuracy 4.849 2.435 3.468 2.884
Test loss -0.137 -0.066 -0.092 -0.080
Train loss -0.139 -0.067 -0.091 -0.082
Table 2. Difference between the metric for our algorithm and asynchronous algorithm averaged over entire training interval for CIFAR-10 dataset. For better performance, difference in accuracy should be positive and that loss should be negative
Refer to caption
Refer to caption
Figure 4. Testing accuracy, testing loss and training loss on MNIST for step size 300, and batch size 32(left) and 64(right)
Refer to caption
Refer to caption
Figure 5. Testing accuracy, testing loss and training loss on MNIST for step size 500, and batch size 32(left) and 64(right)
Refer to caption
Refer to caption
Figure 6. Testing accuracy, testing loss and training loss on CIFAR-10 for step size 300, and batch size 32(left) and 64(right)
Refer to caption
Refer to caption
Figure 7. Testing accuracy, testing loss and training loss on CIFAR-10 for step size 500, and batch size 32(left) and 64(right)

For the next set of experiments, we selected CIFAR-10 as our dataset since we believe that it provides a difficult optimization problem as compared to MNIST. Table 2 and plots 6 and 7 show similar statistics as that for MNIST. We can clearly note here that our algorithms show significant speedup as compared to both of the other algorithms. It is able to achieve higher accuracy and lower loss as compared to asynchronous and synchronous algorithms. In all the previous experiments, the synchronous algorithm was very slow, and hence for future analysis, only present a comparison between our algorithm and the asynchronous algorithm.

7.2. Effect of different batch sizes

Further, we wanted to understand how different values of batch sizes affect the efficiency of our approach. For each of the batch sizes, we executed 5 rounds of training, each with different initialization of the parameters on the randomly generated dataset. Table 3 shows the difference of the metrics like accuracy and loss between our algorithm and asynchronous algorithm averaged over the entire training interval. We hypothesized that as the batch size increases, the difference should decrease since asynchronous algorithms start providing updates with high confidence. This can be also validated by the trend observed in the plot 8.

Refer to caption
Figure 8. Average difference in metrics for different batch sizes
(Batchsize)Metric𝐵𝑎𝑡𝑐𝑠𝑖𝑧𝑒𝑀𝑒𝑡𝑟𝑖𝑐\frac{(Batchsize)}{Metric}divide start_ARG ( italic_B italic_a italic_t italic_c italic_h italic_s italic_i italic_z italic_e ) end_ARG start_ARG italic_M italic_e italic_t italic_r italic_i italic_c end_ARG 8 16 32 64 128
Test Accuracy 4.896 5.183 4.222 3.304 2.599
Test loss -0.141 -0.141 -0.117 -0.089 -0.072
Train loss -0.143 -0.141 -0.114 -0.088 -0.068
Table 3. Difference between the metric for our algorithm and asynchronous algorithm averaged over entire training interval for various batch sizes and constant step size of 500. For better performance, difference in accuracy should be positive and that loss should be negative

7.3. Effect of different step sizes

We also executed experiments to investigate the effects of step sizes on the performance of our algorithm. Again, we repeated the experiment for various step sizes that are multiples of the reciprocal of the learning rate. Table 4 shows the difference between metrics for our algorithm and asynchronous averaged over the training interval. Ideally, for smaller step sizes, our algorithm performs similarly to the synchronous algorithm. As we increase the step sizes, the behavior shifts towards the asynchronous algorithm. The plot 9 shows the relation between step size and the performance of the algorithm.

(Stepsize)Metric𝑆𝑡𝑒𝑝𝑠𝑖𝑧𝑒𝑀𝑒𝑡𝑟𝑖𝑐\frac{(Stepsize)}{Metric}divide start_ARG ( italic_S italic_t italic_e italic_p italic_s italic_i italic_z italic_e ) end_ARG start_ARG italic_M italic_e italic_t italic_r italic_i italic_c end_ARG 1lr1𝑙𝑟\frac{1}{lr}divide start_ARG 1 end_ARG start_ARG italic_l italic_r end_ARG 3lr3𝑙𝑟\frac{3}{lr}divide start_ARG 3 end_ARG start_ARG italic_l italic_r end_ARG 5lr5𝑙𝑟\frac{5}{lr}divide start_ARG 5 end_ARG start_ARG italic_l italic_r end_ARG 7lr7𝑙𝑟\frac{7}{lr}divide start_ARG 7 end_ARG start_ARG italic_l italic_r end_ARG 10lr10𝑙𝑟\frac{10}{lr}divide start_ARG 10 end_ARG start_ARG italic_l italic_r end_ARG
Test Accuracy 0.136 3.857 3.915 3.083 2.967
Test loss -0.016 -0.110 -0.118 -0.084 -0.074
Train loss -0.013 -0.110 -0.121 -0.079 -0.075
Table 4. Difference between the metric for our algorithm and asynchronous algorithm averaged over entire training interval for various step sizes and constant batch size of 32. For better performance, difference in accuracy should be positive and that loss should be negative
Refer to caption
Figure 9. Average difference in metrics for different step sizes

7.4. Effect of communication delay

As mentioned in the experiment section, we introduce communication delays by adding random execution delays to the worker. Communication delays are the major reason for the slowness of the synchronous algorithm. However, such a problem is not observed in the asynchronous algorithm, and for our algorithm, we believe that for a good choice of step size and batch size, it should be resilient to communication delay. The table 5 and plot 10, show the results from the experiment with varying delay distribution (normal distribution with mean 0 and different standard deviation) and we observe that for all values, our algorithm outperformed the asynchronous version.

(Mean,Std.)Metric\frac{(Mean,Std.)}{Metric}divide start_ARG ( italic_M italic_e italic_a italic_n , italic_S italic_t italic_d . ) end_ARG start_ARG italic_M italic_e italic_t italic_r italic_i italic_c end_ARG (0,0.25) (0,0.5) (0,0.75) (0,1) (0,1.25)
Test Accuracy 3.915 1.920 3.012 2.879 5.184
Test loss -0.117 -0.035 -0.081 -0.079 -0.156
Train loss -0.120 -0.039 -0.079 -0.075 -0.166
Table 5. Difference between the metric for our algorithm and asynchronous algorithm averaged over entire training interval for various delay distribution with constant batch sizes and step size of 32 and 500. For better performance, difference in accuracy should be positive and that loss should be negative
Refer to caption
Figure 10. Average difference in metrics for different communication delays

8. CONCLUSION

In summary, we can say that our approach looks to combine the best of both worlds. It uses the asynchronous approach to gain speed and synchronous approach for accuracy. And the switch from asynchronous to synchronous is made using a threshold function which takes the learning rate into account for determining how many parameters get aggregated for synchronization. Based on the extensive experimentation, we can draw the conclusion that for the same time period, our approach leads to better accuracy and lower loss than both synchronous and asynchronous approaches. As batch size decreases, which is the norm for large training datasets, our approach performs even better. The step size is currently defined in terms of the learning rate and even if the step size is not selected appropriately, our approach will give better results than the synchronous approach.

9. Future Work

This approach can be tested on different CPU architectures with smaller memory size and processing power to see what impact it has on the overall performance. Another challenge is to check whether it would work with a non convex loss function since such a function will have multiple local minimas. We can also test this approach with more complex loss functions and model architectures to check its robustness. It can also be tested with larger datasets to see how an increase in dataset size affects performance. Currently, finding the threshold for aggregating parameters is based upon experimental data. However, a good heuristic can be devised which can form a base for selecting aggregation the threshold for different types of models and datasets. Different monotonically increasing functions can also be used to see if the all such functions can be straightaway plugged in without much change in performance. Formal proofs can also be derived for the convergence of the parameters even though the approach is not completely synchronous.

References

  • (1)
  • Chilimbi et al. (2014) Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project Adam: Building an Efficient and Scalable Deep Learning Training System. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). USENIX Association, Broomfield, CO, 571–582. https://www.usenix.org/conference/osdi14/technical-sessions/presentation/chilimbi
  • Dean et al. (2012) Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc' aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Quoc Le, and Andrew Ng. 2012. Large Scale Distributed Deep Networks. In Advances in Neural Information Processing Systems, F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger (Eds.), Vol. 25. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2012/file/6aca97005c68f1206823815f66102863-Paper.pdf
  • Ho et al. (2013) Qirong Ho, James Cipar, Henggang Cui, Jin Kyu Kim, Seunghak Lee, Phillip B. Gibbons, Garth A. Gibson, Gregory R. Ganger, and Eric P. Xing. 2013. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server (NIPS’13). Curran Associates Inc., Red Hook, NY, USA, 1223–1231.
  • Huang et al. (2019) Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, and zhifeng Chen. 2019. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2019/file/093f65e080a295f8076b1c5722a46aa2-Paper.pdf
  • Krizhevsky et al. (2009) Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. 2009. CIFAR-10 and CIFAR-100 Datasets. https://www.cs.toronto.edu/~kriz/cifar.html
  • Le Cunn (1999) Yann Le Cunn. 1999. THE MNIST DATABASE of handwritten digits. http://yann.lecun.com/exdb/mnist/
  • Narayanan et al. (2019) Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized Pipeline Parallelism for DNN Training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (Huntsville, Ontario, Canada) (SOSP ’19). Association for Computing Machinery, New York, NY, USA, 1–15. https://doi.org/10.1145/3341301.3359646
  • Niu et al. (2011) Feng Niu, Benjamin Recht, Christopher Re, and Stephen J. Wright. 2011. HOGWILD! A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In Proceedings of the 24th International Conference on Neural Information Processing Systems (Granada, Spain) (NIPS’11). Curran Associates Inc., Red Hook, NY, USA, 693–701.
  • Sergeev and Balso (2018) Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. CoRR abs/1802.05799 (2018). arXiv:1802.05799 http://arxiv.org/abs/1802.05799
  • Xing et al. (2015) Eric P. Xing, Qirong Ho, Wei Dai, Jin Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. 2015. Petuum: A New Platform for Distributed Machine Learning on Big Data. IEEE Transactions on Big Data 1, 2 (2015), 49–67. https://doi.org/10.1109/TBDATA.2015.2472014