DETOX

This repository contains source code for DETOX, a redundancy-based framework for faster and more robust gradient aggregation.

Overview:

DETOX is a general Byzantine-resilient distributed training framework that combines algorithmic redundancy with robust aggregation methods. DETOX operates in two steps, a filtering step that uses limited redundancy to significantly reduce the effect of Byzantine nodes, and a hierarchical aggregation step that can be used in tandem with any state-of-the-art robust aggregation method.

Depdendencies:

Tested stable depdencises:

python 3.7 (Anaconda)
PyTorch 1.0.1
torchvision 0.1.18
MPI4Py 0.3.0
python-blosc 1.5.0
joblib 0.13.2

We highly recommend installing an Anaconda environment. You will get a high-quality BLAS library (MKL) and you get a controlled compiler version regardless of your Linux distro.

Cluster Setup:

For running on distributed cluster, the first thing you need do is to launch AWS EC2 instances.

Launching Instances:

The script ./tools/pytorch_ec2.py helps you to launch EC2 instances automatically, but before running this script, you should follow the instruction to setup AWS CLI on your local machine. After that, please edit this part in ./tools/pytorch_ec2.py

cfg = Cfg({
    "name" : "PS_PYTORCH",      # Unique name for this specific configuration
    "key_name": "NameOfKeyFile",          # Necessary to ssh into created instances
    # Cluster topology
    "n_masters" : 1,                      # Should always be 1
    "n_workers" : 8,
    "num_replicas_to_aggregate" : "8", # deprecated, not necessary
    "method" : "spot",
    # Region speficiation
    "region" : "us-west-2",
    "availability_zone" : "us-west-2b",
    # Machine type - instance type configuration.
    "master_type" : "m4.2xlarge",
    "worker_type" : "m4.2xlarge",
    # please only use this AMI for pytorch
    "image_id": "ami-xxxxxxxx",            # id of AMI
    # Launch specifications
    "spot_price" : "0.15",                 # Has to be a string
    # SSH configuration
    "ssh_username" : "ubuntu",            # For sshing. E.G: ssh ssh_username@hostname
    "path_to_keyfile" : "/dir/to/NameOfKeyFile.pem",

    # NFS configuration
    # To set up these values, go to Services > ElasticFileSystem > Create new filesystem, and follow the directions.
    #"nfs_ip_address" : "172.31.3.173",         # us-west-2c
    #"nfs_ip_address" : "172.31.35.0",          # us-west-2a
    "nfs_ip_address" : "172.31.14.225",          # us-west-2b
    "nfs_mount_point" : "/home/ubuntu/shared",       # NFS base dir

For setting everything up on EC2 cluster, the easiest way is to setup one machine and create an AMI. Then use the AMI id for image_id in pytorch_ec2.py. Then, launch EC2 instances by running

python ./tools/pytorch_ec2.py launch

After all launched instances are ready (this may take a while), getting private ips of instances by

python ./tools/pytorch_ec2.py get_hosts

this will write ips into a file named hosts_address, which looks like

172.31.16.226 (${PS_IP})
172.31.27.245
172.31.29.131
172.31.18.108
172.31.18.174
172.31.17.228
172.31.16.25
172.31.30.61
172.31.29.30

After generating the hosts_address of all EC2 instances, running the following command will copy your keyfile to the parameter server (PS) instance whose address is always the first one in hosts_address. local_script.sh will also do some basic configurations e.g. clone this git repo

bash ./tool/local_script.sh ${PS_IP}

SSH related:

At this stage, you should ssh to the PS instance and all operation should happen on PS. In PS setting, PS should be able to ssh to any compute node. Running the following script will do the trick for you:

bash ./tools/remote_script.sh

Prepare Datasets

We currently support MNIST and CIFAR-10/100 datasets. Download, split, and transform datasets by (and ./tools/remote_script.sh dose this for you)

bash ./src/data_prepare.sh

Job Launching

Since this project is built on MPI, tasks are required to be launched by PS (or master) instance. run_pytorch.sh wraps job-launching process up. Commonly used options (arguments) are listed as following:

Argument	Comments
`n`	Number of processes (size of cluster) e.g. if we have P compute node and 1 PS, n=P+1.
`hostfile`	A directory to the file that contains Private IPs of every node in the cluster, we use `hosts_address` here as mentioned before.
`lr`	Inital learning rate that will be use.
`momentum`	Value of momentum that will be use.
`network`	Types of deep neural nets, currently `LeNet`, `ResNet-18/32/50/110/152`, and `VGGs` are supported.
`dataset`	Datasets use for training.
`batch-size`	Batch size for optimization algorithms.
`comm-type`	A fake parameter, please always set it to be `Bcast`, which gives you logarithmic comm complexity.
`num-aggregate`	Number of gradients required for the PS to aggregate.
`mode`	Robust aggregation methods e.g. `bulyan`, `multi-krum`, `coord-median`, `signSGD`
`approach`	This can be set to `baseline`, `draco-lite`(DETOX), and `maj_vote`
`epochs`	The maximal number of epochs to train (somehow redundant).
`err-mode`	Byzantine attack to simulate can be set as `rev_grad` or `constant`
`max-steps`	total number if iterations to run.
`eval-freq`	Frequency of iterations to evaluation the model.
`worker-fail`	Number of Byzantine nodes to simulate.
`lis-simulation`	Enable the "A little is enough attack", note that if this is set to `simulate`, the `err-mode` won't work any more.
`train-dir`	Directory to save model checkpoints for evaluation.

Model Evaluation

Distributed evaluator will fetch model checkpoints from the shared directory and evaluate model on validation set. To evaluate model, you can run

bash ./src/evaluate_pytorch.sh

with specified arguments.

Evaluation arguments are listed as following:

Argument	Comments
`eval-batch-size`	Batch size (on validation set) used during model evaluation.
`eval-freq`	Frequency of iterations to evaluation the model, should be set to the same value as run_pytorch.sh.
`network`	Types of deep neural nets, should be set to the same value as run_pytorch.sh.
`dataset`	Datasets use for training, should be set to the same value as run_pytorch.sh.
`model-dir`	Directory to save model checkpoints for evaluation, should be set to the same value as run_pytorch.sh.

Future Work

Those are potential directions we are actively working on, stay tuned!

Provide CUDA support.
Explore the compatibility with federated learning.
Explore the Byzantine-resilience in decentralized setups.

Citation

@inproceedings{rajput2019detox,
  title={DETOX: A redundancy-based framework for faster and more robust gradient aggregation},
  author={Rajput, Shashank and Wang, Hongyi and Charles, Zachary and Papailiopoulos, Dimitris},
  booktitle={Advances in Neural Information Processing Systems},
  pages={10320--10330},
  year={2019}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

DETOX

Overview:

Depdendencies:

Cluster Setup:

Launching Instances:

SSH related:

Prepare Datasets

Job Launching

Model Evaluation

Future Work

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

DETOX

Overview:

Depdendencies:

Cluster Setup:

Launching Instances:

SSH related:

Prepare Datasets

Job Launching

Model Evaluation

Future Work

Citation