Bringing Deep Learning Workloads to JSC supercomputers

Parallelize Training

Alexandre Strube // Sabrina Benassou

June 25, 2024

The ResNet50 Model

ImageNet class

class ImageNet(Dataset):
    def __init__(self, root, split, transform=None):
        if split not in ["train", "val"]:
            raise ValueError("split must be either 'train' or 'val'")
        
        self.root = root
        
        with open(os.path.join(root, "imagenet_{}.json".format(split)), "rb") as f:
            data = json.load(f)

        self.samples = list(data.keys())
        self.targets = list(data.values())
        self.transform = transform
        
                
    def __len__(self):
        return len(self.samples)    
    
    def __getitem__(self, idx):
        x = Image.open(os.path.join(self.root, self.samples[idx])).convert("RGB")
        if self.transform:
            x = self.transform(x)
        return x, self.targets[idx]

PyTorch Lightning Data Module

class ImageNetDataModule(pl.LightningDataModule):
    def __init__(
        self,
        data_root: str,
        batch_size: int,
        num_workers: int,
        dataset_transforms: dict(),
    ):
        super().__init__()
        self.data_root = data_root
        self.batch_size = batch_size
        self.num_workers = num_workers
        self.dataset_transforms = dataset_transforms
        
    def setup(self, stage: Optional[str] = None):
        self.train = ImageNet(self.data_root, "train", self.dataset_transforms)
            
    def train_dataloader(self):
        return DataLoader(self.train, batch_size=self.batch_size, \
            num_workers=self.num_workers)

PyTorch Lightning Module

class resnet50Model(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = resnet50(pretrained=True)

    def forward(self, x):
        return self.model(x)

    def training_step(self,batch):
        x, labels = batch
        pred=self.forward(x)
        train_loss = F.cross_entropy(pred, labels)
        self.log("training_loss", train_loss)
    
        return train_loss

    def configure_optimizers(self):
            return torch.optim.Adam(self.parameters(), lr=0.02)

One GPU training

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Resize((256, 256))
])

# 1. Organize the data
datamodule = ImageNetDataModule("/p/scratch/training2425/data/", 256, \
    int(os.getenv('SLURM_CPUS_PER_TASK')), transform)
# 2. Build the model using desired Task
model = resnet50Model()
# 3. Create the trainer
trainer = pl.Trainer(max_epochs=10,  accelerator="gpu")
# 4. Train the model
trainer.fit(model, datamodule=datamodule)
# 5. Save the model!
trainer.save_checkpoint("image_classification_model.pt")

One GPU training

#!/bin/bash -x
#SBATCH --nodes=1            
#SBATCH --gres=gpu:1
#SBATCH --ntasks-per-node=1  
#SBATCH --cpus-per-task=96
#SBATCH --time=06:00:00
#SBATCH --partition=dc-gpu
#SBATCH --account=training2425
#SBATCH --output=%j.out
#SBATCH --error=%j.err
#SBATCH --reservation=training2425 

# To get number of cpu per task
export SRUN_CPUS_PER_TASK="$SLURM_CPUS_PER_TASK"
# activate env
source $HOME/course/$USER/sc_venv_template/activate.sh
# run script from above
time srun python3 gpu_training.py

real    342m11.864s

DEMO

But what about many GPUs?

It’s when things get interesting

Data Parallel

Data Parallel - Averaging

Multi-GPU training

1 node and 4 GPU

#!/bin/bash -x
#SBATCH --nodes=1                     
#SBATCH --gres=gpu:4                  # Use the 4 GPUs available
#SBATCH --ntasks-per-node=4           # When using pl it should always be set to 4
#SBATCH --cpus-per-task=24            # Divide the number of cpus (96) by the number of GPUs (4)
#SBATCH --time=02:00:00
#SBATCH --partition=dc-gpu
#SBATCH --account=training2425
#SBATCH --output=%j.out
#SBATCH --error=%j.err
#SBATCH --reservation=training2425 

export CUDA_VISIBLE_DEVICES=0,1,2,3    # Very important to make the GPUs visible
export SRUN_CPUS_PER_TASK="$SLURM_CPUS_PER_TASK"

source $HOME/course/$USER/sc_venv_template/activate.sh
time srun python3 gpu_training.py

real    89m15.923s

DEMO

That’s it for data parallel!

Copy of the model on each GPU
Use different data for each GPU
Everything else is the same
Average after each iteration
Update of the weights

There are more levels!

Data Parallel - Multi Node

Before we go further…

Data parallel is usually good enough 👌
If you need more than this, you should be giving this course, not me 🤷‍♂️

Model Parallel

Model itself is too big to fit in one single GPU 🐋
Each GPU holds a slice of the model 🍕
Data moves from one GPU to the next

Model Parallel

What’s the problem here? 🧐

Model Parallel

Waste of resources
While one GPU is working, others are waiting the whole process to end
- Source: GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Model Parallel - Pipelining

This is an oversimplification!

Actually, you split the input minibatch into multiple microbatches.
There’s still idle time - an unavoidable “bubble” 🫧

Model Parallel - Multi Node

In this case, each node does the same as the others.
At each step, they all synchronize their weights.

Model Parallel - Multi Node

Model Parallel - going bigger

You can also have layers spreaded over multiple gpus
One can even pipeline among nodes….

Recap

Data parallelism:
- Split the data over multiple GPUs
- Each GPU runs the whole model
- The gradients are averaged at each step
- Update of the model’s weights
Data parallelism, multi-node:
- Same, but gradients are averaged across nodes
Model parallelism:
- Split the model over multiple GPUs
- Each GPU does the forward/backward pass
Model parallelism, multi-node:
- Same, but gradients are averaged across nodes

Parallel Training with PyTorch DDP

PyTorch’s DDP (Distributed Data Parallel) works as follows:
- Each GPU across each node gets its own process.
- Each GPU gets visibility into a subset of the overall dataset. It will only ever see that subset.
- Each process inits the model.
- Each process performs a full forward and backward pass in parallel.
- The gradients are synced and averaged across all processes.
- Each process updates its optimizer.

Terminologies

WORLD_SIZE: number of processes participating in the job.
RANK: the rank of the process in the network.
LOCAL_RANK: the rank of the process on the local machine.
MASTER_PORT: free port on machine with rank 0.

DDP steps

Set up the environement variables for the distributed mode (WORLD_SIZE, RANK, LOCAL_RANK …)

# The number of total processes started by Slurm.
ntasks = os.getenv('SLURM_NTASKS')
# Index of the current process.
rank = os.getenv('SLURM_PROCID')
# Index of the current process on this node only.
local_rank = os.getenv('SLURM_LOCALID')
# The number of nodes
nnodes = os.getenv("SLURM_NNODES")

DDP steps

Initialize a sampler to specify the sequence of indices/keys used in data loading.
Implements data parallelism of the model.
Allow only one process to save checkpoints.

datamodule = ImageNetDataModule("/p/scratch/training2425/data/", 256, \
    int(os.getenv('SLURM_CPUS_PER_TASK')), transform)
trainer = pl.Trainer(max_epochs=10,  accelerator="gpu", num_nodes=nnodes)
trainer.fit(model, datamodule=datamodule)
trainer.save_checkpoint("image_classification_model.pt")

DDP steps

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Resize((256, 256))
])

# 1. The number of nodes
nnodes = os.getenv("SLURM_NNODES")
# 2. Organize the data
datamodule = ImageNetDataModule("/p/scratch/training2425/data/", 128, \
    int(os.getenv('SLURM_CPUS_PER_TASK')), transform)
# 3. Build the model using desired Task
model = resnet50Model()
# 4. Create the trainer
trainer = pl.Trainer(max_epochs=10,  accelerator="gpu", num_nodes=nnodes)
# 5. Train the model
trainer.fit(model, datamodule=datamodule)
# 6. Save the model!
trainer.save_checkpoint("image_classification_model.pt")

DDP training

16 nodes and 4 GPU each

#!/bin/bash -x
#SBATCH --nodes=16                     # This needs to match Trainer(num_nodes=...)
#SBATCH --gres=gpu:4                   # Use the 4 GPUs available
#SBATCH --ntasks-per-node=4            # When using pl it should always be set to 4
#SBATCH --cpus-per-task=24             # Divide the number of cpus (96) by the number of GPUs (4)
#SBATCH --time=00:15:00
#SBATCH --partition=dc-gpu
#SBATCH --account=training2425
#SBATCH --output=%j.out
#SBATCH --error=%j.err
#SBATCH --reservation=training2425 

export CUDA_VISIBLE_DEVICES=0,1,2,3    # Very important to make the GPUs visible
export SRUN_CPUS_PER_TASK="$SLURM_CPUS_PER_TASK"

source $HOME/course/$USER/sc_venv_template/activate.sh
time srun python3 ddp_training.py

real    6m56.457s

DDP training

With 4 nodes:

real    24m48.169s

With 8 nodes:

real    13m10.722s

With 16 nodes:

real    6m56.457s

With 32 nodes:

real    4m48.313s

Data Parallel

It was

trainer = pl.Trainer(max_epochs=10,  accelerator="gpu")

Became

nnodes = os.getenv("SLURM_NNODES")
trainer = pl.Trainer(max_epochs=10,  accelerator="gpu", num_nodes=nnodes)

Data Parallel

It was

#SBATCH --nodes=1                
#SBATCH --gres=gpu:1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=96

Became

#SBATCH --nodes=16                   # This needs to match Trainer(num_nodes=...)
#SBATCH --gres=gpu:4                 # Use the 4 GPUs available
#SBATCH --ntasks-per-node=4          # When using pl it should always be set to 4
#SBATCH --cpus-per-task=24           # Divide the number of cpus (96) by the number of GPUs (4)
export CUDA_VISIBLE_DEVICES=0,1,2,3  # Very important to make the GPUs visible

DEMO

TensorBoard

In resnet50.py
```
self.log("training_loss", train_loss)
```

TensorBoard

source $HOME/course/$USER/sc_venv_template/activate.sh
tensorboard --logdir=[PATH_TO_TENSOR_BOARD]

DEMO

Llview

llview
https://go.fzj.de/llview-jureca

DAY 2 RECAP

Access using FS, Arrow, and H5 files
Ran parallel code.
Can submit single node, multi-gpu and multi-node training.
Use TensorBoard on the supercomputer.
Usage of llview.

Bringing Deep Learning Workloads to JSC supercomputers

The ResNet50 Model

ImageNet class

PyTorch Lightning Data Module

PyTorch Lightning Module

One GPU training

One GPU training

DEMO

But what about many GPUs?

Data Parallel

Data Parallel

Data Parallel - Averaging

Multi-GPU training

DEMO

That’s it for data parallel!

There are more levels!

Data Parallel - Multi Node

Data Parallel - Multi Node

Before we go further…

Model Parallel

Model Parallel

Model Parallel

Model Parallel

Model Parallel

Model Parallel

Model Parallel

Model Parallel

Model Parallel

Model Parallel

Model Parallel

Model Parallel

What’s the problem here? 🧐

Model Parallel

Model Parallel - Pipelining

Model Parallel - Pipelining

Model Parallel - Pipelining

Model Parallel - Pipelining

Model Parallel - Pipelining

Model Parallel - Pipelining

Model Parallel - Pipelining

Model Parallel - Pipelining

Model Parallel - Pipelining

This is an oversimplification!

Model Parallel - Multi Node

Model Parallel - Multi Node

Model Parallel - going bigger

Recap

Parallel Training with PyTorch DDP

Terminologies

DDP steps

DDP steps

DDP steps

DDP training

DDP training

Data Parallel

Data Parallel

DEMO

TensorBoard

TensorBoard

DEMO

Llview

DAY 2 RECAP

ANY QUESTIONS??

Feedback is more than welcome!

Link to other courses at JSC