Getting Started With Distributed Machine Learning With PyTorch and Ray - by PyTorch - PyTorch - Medium
Getting Started With Distributed Machine Learning With PyTorch and Ray - by PyTorch - PyTorch - Medium
PyTorch Follow
https://medium.com/pytorch/getting-started-with-distributed-machine-learning-with-pytorch-and-ray-fd83c98fdead 1/11
6/9/2021 Getting Started with Distributed Machine Learning with PyTorch and Ray | by PyTorch | PyTorch | Medium
This post covers various elements of the Ray ecosystem and how it can be used with
PyTorch!
What is Ray
Ray is an open source library for parallel and distributed Python. The diagram above
shows that at a high level, the Ray ecosystem consists of three parts: the core Ray
system, scalable libraries for machine learning (both native and third party), and tools
for launching clusters on any cluster or cloud provider.
Simplicity: you can scale your Python applications without rewriting them, and the
same code can run on one machine or multiple machines.
Library Ecosystem
Because Ray is a general-purpose framework, the community has built many libraries
and frameworks on top of it to accomplish different tasks. The vast majority of these
https://medium.com/pytorch/getting-started-with-distributed-machine-learning-with-pytorch-and-ray-fd83c98fdead 2/11
6/9/2021 Getting Started with Distributed Machine Learning with PyTorch and Ray | by PyTorch | PyTorch | Medium
support PyTorch, require minimal modifications to your code, and integrate seamlessly
with each other. Below are just a few of the many libraries in the ecosystem.
RaySGD
Comparison of PyTorch’s DataParallel vs Ray (which uses PyTorch’s Distributed DataParallel underneath the
hood) on p3dn.24xlarge instances. Image source.
RaySGD is a library that provides distributed training wrappers for data parallel
training. For example, the RaySGD TorchTrainer is a wrapper around
torch.distributed.launch. It provides a Python API to easily incorporate distributed
training into a larger Python application, as opposed to needing to wrap your training
code in bash scripts.
Ease of use: You can scale PyTorch’s native DistributedDataParallel without needing
to monitor individual nodes.
Scalability: You can scale up and down. Start on a single CPU. Scale up to multi-
node, multi-CPU, or multi-GPU clusters by changing 2 lines of code.
https://medium.com/pytorch/getting-started-with-distributed-machine-learning-with-pytorch-and-ray-fd83c98fdead 3/11
6/9/2021 Getting Started with Distributed Machine Learning with PyTorch and Ray | by PyTorch | PyTorch | Medium
Accelerated Training: There is built-in support for mixed precision training with
NVIDIA Apex.
Fault Tolerance: There is support for automatic recovery when cloud machines are
preempted.
Compatibility: There is seamless integration with other libraries like Ray Tune and
Ray Serve.
You can get started with TorchTrainer by installing Ray (pip install -U ray torch) and
running the code below:
1 import torch
2 from torch.utils.data import DataLoader
3 from torchvision.datasets import CIFAR10
4 import torchvision.transforms as transforms
5
6 import ray
7 from ray.util.sgd.torch import TorchTrainer
8 from ray.util.sgd.torch import TrainingOperator
9 # https://github.com/kuangliu/pytorch-cifar/blob/master/models/resnet.py
10 from ray.util.sgd.torch.resnet import ResNet18
11
12 def cifar_creator(config):
13 """Returns dataloaders to be used in `train` and `validate`."""
14 tfms = transforms.Compose([
15 transforms.ToTensor(),
16 transforms.Normalize((0.4914, 0.4822, 0.4465),
17 (0.2023, 0.1994, 0.2010)),
18 ]) # meanstd transformation
19 train_loader = DataLoader(
20 CIFAR10(root="~/data", download=True, transform=tfms), batch_size=config["batch"
21 validation_loader = DataLoader(
22 CIFAR10(root="~/data", download=True, transform=tfms), batch_size=config["batch"
23 return train_loader, validation_loader
24
25 def optimizer_creator(model, config):
26 """Returns an optimizer (or multiple)"""
27 return torch.optim.SGD(model.parameters(), lr=config["lr"])
28
29 CustomTrainingOperator = TrainingOperator.from_creators(
30 model_creator=ResNet18, # A function that returns a nn.Module
31 optimizer creator=optimizer creator # A function that returns an optimizer
https://medium.com/pytorch/getting-started-with-distributed-machine-learning-with-pytorch-and-ray-fd83c98fdead 4/11
6/9/2021 Getting Started with Distributed Machine Learning with PyTorch and Ray | by PyTorch | PyTorch | Medium
31 optimizer_creator optimizer_creator, # A function that returns an optimizer
32 data_creator=cifar_creator, # A function that returns dataloaders
33 loss_creator=torch.nn.CrossEntropyLoss # A loss function
34 )
35
36 ray.init()
37
38 trainer = TorchTrainer(
39 training_operator_cls=CustomTrainingOperator,
40 config={"lr": 0.01, # used in optimizer_creator
41 "batch": 64 # used in data_creator
42 },
43 num_workers=2, # amount of parallelism
44 use_gpu=torch.cuda.is_available(),
45 use_tqdm=True)
46
47 stats = trainer.train()
48 print(trainer.validate())
49
50 torch.save(trainer.state_dict(), "checkpoint.pt")
51 trainer.shutdown()
52 print("success!")
Caption: The script will download CIFAR10 and use a ResNet18 model to do image
classification. With a single parameter change (num_workers=N), you can utilize
multiple GPUs.
If you would like to learn more about RaySGD and how to scale PyTorch training across
a cluster, you should check out this blog post.
Ray Tune
https://medium.com/pytorch/getting-started-with-distributed-machine-learning-with-pytorch-and-ray-fd83c98fdead 5/11
6/9/2021 Getting Started with Distributed Machine Learning with PyTorch and Ray | by PyTorch | PyTorch | Medium
Ray Tune’s implementation of optimization algorithms like Population Based Training (shown above) can be
used with PyTorch for more performant models. Image from Deepmind.
Ray Tune is a Python library for experiment execution and hyperparameter tuning at any
scale. Some advantages of the library are:
Access to state of the art algorithms such as Population Based Training (PBT),
BayesOptSearch, HyperBand/ASHA.
You can get started with Ray Tune by installing Ray (pip install ray torch torchvision)
and running the code below.
1 import numpy as np
2 import torch
3 import torch.optim as optim
4
5 from ray import tune
6 from ray.tune.examples.mnist_pytorch import get_data_loaders, train, test
7 import ray
8 import sys
9
10 if len(sys.argv) > 1:
https://medium.com/pytorch/getting-started-with-distributed-machine-learning-with-pytorch-and-ray-fd83c98fdead 6/11
6/9/2021 Getting Started with Distributed Machine Learning with PyTorch and Ray | by PyTorch | PyTorch | Medium
11 ray.init(redis_address=sys.argv[1])
12
13 import torch.nn as nn
14 import torch.nn.functional as F
15
16 class ConvNet(nn.Module):
17 def __init__(self):
18 super(ConvNet, self).__init__()
19 self.conv1 = nn.Conv2d(1, 3, kernel_size=3)
20 self.fc = nn.Linear(192, 10)
21
22 def forward(self, x):
23 x = F.relu(F.max_pool2d(self.conv1(x), 3))
24 x = x.view(-1, 192)
25 x = self.fc(x)
26 return F.log_softmax(x, dim=1)
27
28
29 def train_mnist(config):
30 model = ConvNet()
31 train_loader, test_loader = get_data_loaders()
32 optimizer = optim.SGD(
33 model.parameters(), lr=config["lr"], momentum=config["momentum"])
34 for i in range(10):
35 train(model, optimizer, train_loader, torch.device("cpu"))
36 acc = test(model, test_loader, torch.device("cpu"))
37 tune.track.log(mean_accuracy=acc)
38 if i % 5 == 0:
39 # This saves the model to the trial directory
40 torch.save(model.state_dict(), "./model.pth")
41
42 from ray.tune.schedulers import ASHAScheduler
43
44 search_space = {
45 "lr": tune.choice([0.001, 0.01, 0.1]),
46 "momentum": tune.uniform(0.1, 0.9)
47 }
48
49 analysis = tune.run(
50 train_mnist,
51 num_samples=30,
52 scheduler=ASHAScheduler(metric="mean_accuracy", mode="max", grace_period=1),
53 config=search_space)
https://medium.com/pytorch/getting-started-with-distributed-machine-learning-with-pytorch-and-ray-fd83c98fdead 7/11
6/9/2021 Getting Started with Distributed Machine Learning with PyTorch and Ray | by PyTorch | PyTorch | Medium
The script shows you how to leverage a state-of-the-art early stopping algorithm AHSA which terminates
trials that are less promising and allocates more time and resources to more promising trials. Code source and
explanation.
If you would like to learn about how to incorporate Ray Tune into your PyTorch
workflow, you should check out this blog post.
Ray Serve
Ray Serve can not only be used to serve models on its own, but also to scale other serving tools like FastAPI.
Ray Serve is a library for easy-to-use scalable model serving. Some advantages of the
library are:
The ability to use a single toolkit to serve everything from deep learning models
(PyTorch, TensorFlow, etc) to scikit-learn models, to arbitrary Python business logic.
Compatibility with many other libraries like Ray Tune and FastAPI.
If you would like to learn how to incorporate Ray Serve and Ray Tune together into your
PyTorch workflow, you should check out the documentation for a full code example.
RLlib
https://medium.com/pytorch/getting-started-with-distributed-machine-learning-with-pytorch-and-ray-fd83c98fdead 8/11
6/9/2021 Getting Started with Distributed Machine Learning with PyTorch and Ray | by PyTorch | PyTorch | Medium
RLlib provides ways to customize almost all aspects of training, including neural network models, action
distributions, policy definitions, environments, and the sample collection process.
RLlib is a library for reinforcement learning that offers both high scalability and a
unified API for a variety of applications. Some advantages include:
Native support for PyTorch, TensorFlow Eager, and TensorFlow (1.x and 2.x).
Support for complex model types, such as attention nets and LSTM stacks via simple
config flags and auto-wrappers
Cluster Launcher
The Ray Cluster Launcher simplifies the process of launching and scaling across any cluster or cloud provider.
https://medium.com/pytorch/getting-started-with-distributed-machine-learning-with-pytorch-and-ray-fd83c98fdead 9/11
6/9/2021 Getting Started with Distributed Machine Learning with PyTorch and Ray | by PyTorch | PyTorch | Medium
Once you have developed an application on your laptop and want to scale it up to the
cloud (perhaps with more data or more GPUs), the next steps aren’t always clear. The
process is either to have an infrastructure team set it up for you or to go through the
following steps.
2. Navigate the management console to set instance types, security groups, spot prices,
instance limits, and more.
An easier approach is to use the Ray Cluster Launcher to launch and scale machines
across any cluster or cloud provider. Cluster Launcher allows you autoscale, sync files,
submit scripts, port forward, and more. This means that you can run your Ray clusters
on Kubernetes, AWS, GCP, Azure, or a private cluster without needing to understand the
low-level details of cluster management.
Conclusion
Ray provides a distributed computing foundation for Ant Group’s Fusion Engine.
This article contained some of the benefits of Ray in the PyTorch ecosystem. Ray is being
used for a wide variety of applications from Ant Group using Ray to support its financial
business, to LinkedIn running Ray on Yarn, to Pathmind using Ray to connect
reinforcement learning to simulation software, and more. If you have any questions or
https://medium.com/pytorch/getting-started-with-distributed-machine-learning-with-pytorch-and-ray-fd83c98fdead 10/11
6/9/2021 Getting Started with Distributed Machine Learning with PyTorch and Ray | by PyTorch | PyTorch | Medium
thoughts about Ray or want to learn more about parallel and distributed Python, please
join our community through Discourse, Slack, or GitHub.
https://medium.com/pytorch/getting-started-with-distributed-machine-learning-with-pytorch-and-ray-fd83c98fdead 11/11