Alexandre Strube // Sabrina Benassou
October 17, 2023
Links for the complimentary parts of this course:
Please open this document on your own browser! We will need it for the exercises. https://go.fzj.de/inm-ml
training2336
(It might take 30+ mins, do it early!)$PROJECT
Terminal
pwd
/p/project/training2336
# Create a shortcut for the project on the home folder
ln -s $PROJECT_training2336 ~/course
# Create a folder for myself
mkdir $HOME/course/$USER
# Enter course folder and
cd $HOME/course/$USER
# Where am I?
pwd
# We well need those later
mkdir $HOME/course/$USER/.cache
mkdir $HOME/course/$USER/.config
mkdir $HOME/course/$USER/.fastai
ln -s $HOME/course/$USER/.cache $HOME/
ln -s $HOME/course/$USER/.config $HOME/
ln -s $HOME/course/$USER/.fastai $HOME/
module spider
strube1$ module spider PyTorch
------------------------------------------------------------------------------------
PyTorch:
------------------------------------------------------------------------------------
Description:
Tensors and Dynamic neural networks in Python with strong GPU acceleration.
PyTorch is a deep learning framework that puts Python first.
Versions:
PyTorch/1.7.0-Python-3.8.5
PyTorch/1.8.1-Python-3.8.5
PyTorch/1.11-CUDA-11.5
PyTorch/1.12.0-CUDA-11.7
Other possible modules matches:
PyTorch-Geometric PyTorch-Lightning
...
module avail
(Inside hierarchy)
Stage (full collection of software of a given year)
Compiler
MPI
Module
Eg:
module load Stages/2023 GCC OpenMPI PyTorch
module spider Software/version
Search for the software itself - it will suggest a version
Search with the version - it will suggest the hierarchy
(make sure you are still connected to Juwels BOOSTER)
Oh noes! 🙈
Let’s bring Python together with PyTorch!
Copy and paste these lines
# This command fails, as we have no proper python
python
# So, we load the correct modules...
module load Stages/2023
module load GCC OpenMPI Python PyTorch
# And we run a small test: import pytorch and ask its version
python -c "import torch ; print(torch.__version__)"
Should look like this:
module key
”module key toml
The following modules match your search criteria: "toml"
------------------------------------------------------------------------------------
Jupyter: Jupyter/2020.2.5-Python-3.8.5, Jupyter/2021.3.1-Python-3.8.5, Jupyter/2021.3.2-Python-3.8.5, Jupyter/2022.3.3, Jupyter/2022.3.4
Project Jupyter exists to develop open-source software, open-standards, and services for interactive computing across dozens of programming languages.
PyQuil: PyQuil/3.0.1
PyQuil is a library for generating and executing Quil programs on the Rigetti Forest platform.
Python: Python/3.8.5, Python/3.9.6, Python/3.10.4
Python is a programming language that lets you work more quickly and integrate your systems more effectively.
------------------------------------------------------------------------------------
From the Jupyter’s terminal, navigate to your “course” folder and to the name you created earlier.
This is out working directory. We do everything here.
pwd
commandmatrix.py
Simple Linux Utility for Resource Management
File juwelsbooster-matrix.sbatch
#!/bin/bash
#SBATCH --account=training2336 # Who pays?
#SBATCH --nodes=1 # How many compute nodes
#SBATCH --job-name=matrix-multiplication
#SBATCH --ntasks-per-node=1 # How many mpi processes/node
#SBATCH --cpus-per-task=1 # How many cpus per mpi proc
#SBATCH --output=output.%j # Where to write results
#SBATCH --error=error.%j
#SBATCH --time=00:01:00 # For how long can it run?
#SBATCH --partition=booster # Machine partition
#SBATCH --reservation=dl4neurosc # For today only
module load Stages/2023
module load GCC OpenMPI PyTorch # Load the correct modules on the compute node(s)
srun python matrix.py # srun tells the supercomputer how to run it
squeue --me
squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
412169 gpus matrix-m strube1 CF 0:02 1 jsfc013
dl4neurosc
# Notice that this number is the job id. It's different for every job
cat output.412169
cat error.412169
Or simply open it on Jupyter!
pip
….Edit the file requirements.txt
Add these lines at the end:
Run on the terminal:
sc_venv_template/setup.sh
source sc_venv_template/activate.sh
python
import fastai
fastai.__version__
source sc_venv_template/activate.sh
The activation script must be sourced, otherwise the virtual environment will not work.
Setting vars
The following modules were not unloaded:
(Use "module --force purge" to unload all):
1) Stages/2023
The following have been reloaded with a version change:
1) HDF5/1.12.2-serial => HDF5/1.12.2
python
Python 3.10.4 (main, Oct 4 2022, 08:48:14) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import fastai
>>> fastai.__version__
'2.7.12'
>>> exit()
from fastai.vision.all import *
from fastai.callback.tensorboard import *
#
print("Downloading dataset...")
path = untar_data(URLs.PETS)/'images'
print("Finished downloading dataset")
#
def is_cat(x): return x[0].isupper()
#
# Create the dataloaders and resize the images
dls = ImageDataLoaders.from_name_func(
path, get_image_files(path), valid_pct=0.2, seed=42,
label_func=is_cat, item_tfms=Resize(224))
print("On the login node, this will download resnet34")
learn = vision_learner(dls, resnet34, metrics=accuracy)
#
# Trains the model for 3 epochs with this dataset
learn.unfreeze()
learn.fit_one_cycle(3, cbs=TensorBoardCallback('runs', trace_model=True, projector=True))
fastai.sbatch
#!/bin/bash
#SBATCH --account=training2336
#SBATCH --mail-user=MYUSER@fz-juelich.de
#SBATCH --mail-type=ALL
#SBATCH --nodes=1
#SBATCH --job-name=matrix-multiplication
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --output=output.%j
#SBATCH --error=error.%j
#SBATCH --time=00:10:00
#SBATCH --partition=booster
#SBATCH --reservation=dl4neurosc # For today only
cd $HOME/course/$USER
source sc_venv_template/activate.sh # Now we finally use the fastai module
srun python cats.py
sbatch fastai.sbatch
error.${JOBID}
file File "/p/software/juwelsbooster/stages/2023/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/urllib/request.py", line 1391, in https_open
return self.do_open(http.client.HTTPSConnection, req,
File "/p/software/juwelsbooster/stages/2023/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/urllib/request.py", line 1351, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 111] Connection refused>
srun: error: jsfc013: task 0: Exited with exit code 1
$ source sc_venv_template/activate.sh
$ python fastai-demo.py
Downloading dataset...
|████████-------------------------------| 23.50% [190750720/811706944 00:08<00:26]
Downloading: "https://download.pytorch.org/models/resnet34-b627a593.pth" to /p/project/ccstao/cstao05/.cache/torch/hub/checkpoints/resnet34-b627a593.pth
100%|█████████████████████████████████████| 83.3M/83.3M [00:00<00:00, 266MB/s]
(To exit, type CTRL-C)
The activation script must be sourced, otherwise the virtual environment will not work.
Setting vars
Downloading dataset...
Finished downloading dataset
epoch train_loss valid_loss error_rate time
Epoch 1/1 : |-----------------------------------| 0.00% [0/92 00:00<?]
Epoch 1/1 : |-----------------------------------| 2.17% [2/92 00:14<10:35 1.7452]
Epoch 1/1 : |█----------------------------------| 3.26% [3/92 00:14<07:01 1.6413]
Epoch 1/1 : |██---------------------------------| 5.43% [5/92 00:15<04:36 1.6057]
...
....
Epoch 1/1 :
epoch train_loss valid_loss error_rate time
0 0.049855 0.021369 0.007442 00:42
New Notebook
As of now, I expect you managed to: