Alexandre Strube // Sabrina Benassou // Javad Kasravi
November 19, 2024
Links for the complimentary parts of this course:
Time | Title |
---|---|
10:00 - 10:10 | Welcome |
10:10 - 10:40 | Introduction |
10:40 - 11:00 | Jupyter-JSC |
11:00 - 11:10 | Coffee Break |
11:10 - 11:30 | SLURM |
11:30 - 12:00 | Setup Environement |
12:00 - 12:10 | Coffee Break |
12:10 - 12:40 | Distributed Data Parallel |
12:40 - 13:00 | Model Parallelism and Analysis |
Please open this document on your own browser! We will need it for the exercises. https://go.fzj.de/dl-in-neuroscience
training2441
module spider
strube1$ module spider PyTorch
------------------------------------------------------------------------------------
PyTorch:
------------------------------------------------------------------------------------
Description:
Tensors and Dynamic neural networks in Python with strong GPU acceleration.
PyTorch is a deep learning framework that puts Python first.
Versions:
PyTorch/1.7.0-Python-3.8.5
PyTorch/1.8.1-Python-3.8.5
PyTorch/1.11-CUDA-11.5
PyTorch/1.12.0-CUDA-11.7
Other possible modules matches:
PyTorch-Geometric PyTorch-Lightning
...
module avail
(Inside hierarchy)
Stage (full collection of software of a given year)
Compiler
MPI
Module
Eg:
module load Stages/2023 GCC OpenMPI PyTorch
module spider Software/version
Search for the software itself - it will suggest a version
Search with the version - it will suggest the hierarchy
Copy and paste these lines
module load Stages/2024
module load GCC OpenMPI Python PyTorch
# And we run a small test: import pytorch and ask its version
python -c "import torch ; print(torch.__version__)"
Should look like this:
module key
”module key toml
The following modules match your search criteria: "toml"
------------------------------------------------------------------------------------
Jupyter: Jupyter/2020.2.5-Python-3.8.5, Jupyter/2021.3.1-Python-3.8.5, Jupyter/2021.3.2-Python-3.8.5, Jupyter/2022.3.3, Jupyter/2022.3.4
Project Jupyter exists to develop open-source software, open-standards, and services for interactive computing across dozens of programming languages.
PyQuil: PyQuil/3.0.1
PyQuil is a library for generating and executing Quil programs on the Rigetti Forest platform.
Python: Python/3.8.5, Python/3.9.6, Python/3.10.4
Python is a programming language that lets you work more quickly and integrate your systems more effectively.
------------------------------------------------------------------------------------
module load Stages/2023
module load GCC OpenMPI PyTorch
python matrix.py
Simple Linux Utility for Resource Management
Create a file named jureca-matrix.sbatch
as described in
the previous section, and copy all the content from the following into
this file.
#!/bin/bash
#SBATCH --account=training2441 # Who pays?
#SBATCH --nodes=1 # How many compute nodes
#SBATCH --job-name=matrix-multiplication
#SBATCH --ntasks-per-node=1 # How many mpi processes/node
#SBATCH --cpus-per-task=1 # How many cpus per mpi proc
#SBATCH --output=output.%j # Where to write results
#SBATCH --error=error.%j
#SBATCH --time=00:01:00 # For how long can it run?
#SBATCH --partition=dc-gpu # Machine partition
#SBATCH --reservation=training2441 # For today only
module load Stages/2024
module load GCC OpenMPI PyTorch # Load the correct modules on the compute node(s)
srun python matrix.py # srun tells the supercomputer how to run it
squeue --me
squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
412169 gpus matrix-m strube1 CF 0:02 1 jsfc013
training2441
simply open output.412169
and error.412169
using Editor !!
mkdir $PROJECT_training2441/$USER
Create a shortcut for the project on the home folder
rm -rf ~/course ; ln -s $PROJECT_training2441/$USER ~/course
# Enter course folder and
cd ~/course
# Where am I?
pwd
# We well need those later
mkdir ~/course/.cache
mkdir ~/course/.config
mkdir ~/course/.fastai
rm -rf $HOME/.cache ; ln -s ~/course/.cache $HOME/
rm -rf $HOME/.config ; ln -s ~/course/.config $HOME/
rm -rf $HOME/.fastai ; ln -s ~/course/.fastai $HOME/
pip
….Edit the file sc_venv_template/requirements.txt
Add these lines at the end:
Run on the terminal:
sc_venv_template/setup.sh
This is a minimal demo, to show some quirks of the supercomputer
Create a file “cats.py”
from fastai.vision.all import *
from fastai.callback.tensorboard import *
#
print("Downloading dataset...")
path = untar_data(URLs.PETS)/'images'
print("Finished downloading dataset")
#
def is_cat(x): return x[0].isupper()
# Create the dataloaders and resize the images
dls = ImageDataLoaders.from_name_func(
path, get_image_files(path), valid_pct=0.2, seed=42,
label_func=is_cat, item_tfms=Resize(224))
print("On the login node, this will download resnet34")
learn = vision_learner(dls, resnet34, metrics=accuracy)
cbs=[SaveModelCallback(), TensorBoardCallback('runs', trace_model=True)]
# Trains the model for 6 epochs with this dataset
learn.unfreeze()
learn.fit_one_cycle(6, cbs=cbs)
create file fastai.sbatch
#!/bin/bash
#SBATCH --account=training2441
#SBATCH --mail-user=MYUSER@fz-juelich.de
#SBATCH --mail-type=ALL
#SBATCH --nodes=1
#SBATCH --job-name=cat-classifier
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=128
#SBATCH --output=output.%j
#SBATCH --error=error.%j
#SBATCH --time=00:20:00
#SBATCH --partition=dc-gpu
#SBATCH --reservation=training2441 # For today only
source sc_venv_template/activate.sh # Now we finally use the fastai module
srun python cats.py
error.${JOBID}
file$ source sc_venv_template/activate.sh
$ python cats.py
Downloading dataset...
|████████-------------------------------| 23.50% [190750720/811706944 00:08<00:26]
Downloading: "https://download.pytorch.org/models/resnet34-b627a593.pth" to /p/project/ccstao/cstao05/.cache/torch/hub/checkpoints/resnet34-b627a593.pth
100%|█████████████████████████████████████| 83.3M/83.3M [00:00<00:00, 266MB/s]
(To exit, type CTRL-C)
The activation script must be sourced, otherwise the virtual environment will not work.
Setting vars
Downloading dataset...
Finished downloading dataset
epoch train_loss valid_loss error_rate time
Epoch 1/1 : |-----------------------------------| 0.00% [0/92 00:00<?]
Epoch 1/1 : |-----------------------------------| 2.17% [2/92 00:14<10:35 1.7452]
Epoch 1/1 : |█----------------------------------| 3.26% [3/92 00:14<07:01 1.6413]
Epoch 1/1 : |██---------------------------------| 5.43% [5/92 00:15<04:36 1.6057]
...
....
Epoch 1/1 :
epoch train_loss valid_loss error_rate time
0 0.049855 0.021369 0.007442 00:42
Open a notebook
Choose PyDeepLearning-2024.3 kernel
Write
As of now, I expect you managed to: