Alexandre Strube // Sabrina Benassou // José Ignacio Robledo
December 4th, 2024
Links for the complimentary parts of this course:
| Time | Title |
|---|---|
| 10:00 - 10:15 | Welcome |
| 10:15 - 11:00 | Introduction |
| 11:00 - 11:15 | Coffee break |
| 11:16 - 11:30 | Judoor, Keys |
| 11:30 - 12:00 | SSH, Jupyter, VS Code |
| 12:00 - 12:15 | Coffee Break |
| 12:15 - 13:00 | Running services on the login and compute nodes |
| 13:00 - 13:15 | Coffee Break |
| 13:30 - 14:00 | Sync (everyone should be at the same point) |
Please open this document on your own browser! We will need it for the exercises. https://go.fzj.de/bringing-dl-workloads-to-jsc

training2449
code
$ ssh-keygen -a 100 -t ed25519 -f ~/.ssh/id_ed25519-JSC
Generating public/private ed25519 key pair.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /Users/strube1/.ssh/id_ed25519-JSC
Your public key has been saved in /Users/strube1/.ssh/id_ed25519-JSC.pub
The key fingerprint is:
SHA256:EGNNC1NTaN8fHwpfuZRPa50qXHmGcQjxp0JuU0ZA86U strube1@Strube-16
The keys randomart image is:
+--[ED25519 256]--+
| *++oo=o. . |
| . =+o .= o |
| .... o.E..o|
| . +.+o+B.|
| S =o.o+B|
| . o*.B+|
| . . = |
| o . |
| . |
+----[SHA256]-----+Windows users, from Ubuntu WSL (Change username for your user on windows)
Host jureca
HostName jureca.fz-juelich.de
User [MY_USERNAME] # Here goes your username, not the word MY_USERNAME.
AddressFamily inet
IdentityFile ~/.ssh/id_ed25519-JSC
MACs hmac-sha2-512-etm@openssh.comCopy contents to the config file and save it
REPLACE [MY_USERNAME] WITH YOUR USERNAME!!! 🤦♂️

code key.txt and paste the number you gotDid everyone get their own ip address?
93.199.55.16393.199.55.163"0.0/16":
93.199.55.16393.199.0.0/16 (with YOUR
number, not with the example)from="" around itfrom="93.199.0.0/16",10.0.0.0/8 🧙♀️from="93.199.0.0/16,10.0.0.0/8" 🎬93.199.0.0/16Terminal:
code ~/.ssh/id_ed25519-JSC.pub
Something like this will open:
Paste this line at the same key.txt
which you just opened
93.199.0.0/16
This might take some minutes
That’s it! Give it a try (and answer yes)
$ ssh jureca
The authenticity of host 'jrlogin03.fz-juelich.de (134.94.0.185)' cannot be established.
ED25519 key fingerprint is SHA256:ASeu9MJbkFx3kL1FWrysz6+paaznGenChgEkUW8nRQU.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])? Yes
**************************************************************************
* Welcome to Jureca DC *
**************************************************************************
...
...
strube1@jrlogin03~ $ # Create a folder for myself
mkdir $PROJECT_training2449/$USER
# Create a shortcut for the project on the home folder
rm -rf ~/course ; ln -s $PROJECT_training2449/$USER ~/course
# Enter course folder and
cd ~/course
# Where am I?
pwd
# We well need those later
mkdir ~/course/.cache
mkdir ~/course/.config
mkdir ~/course/.fastai
rm -rf $HOME/.cache ; ln -s ~/course/.cache $HOME/
rm -rf $HOME/.config ; ln -s ~/course/.config $HOME/
rm -rf $HOME/.fastai ; ln -s ~/course/.fastai $HOME/module spiderstrube1$ module spider PyTorch
------------------------------------------------------------------------------------
PyTorch:
------------------------------------------------------------------------------------
Description:
Tensors and Dynamic neural networks in Python with strong GPU acceleration.
PyTorch is a deep learning framework that puts Python first.
Versions:
PyTorch/1.7.0-Python-3.8.5
PyTorch/1.8.1-Python-3.8.5
PyTorch/1.11-CUDA-11.5
PyTorch/1.12.0-CUDA-11.7
Other possible modules matches:
PyTorch-Geometric PyTorch-Lightning
...module avail (Inside hierarchy)
Stage (full collection of software of a given year)
Compiler
MPI
Module
Eg:
module load Stages/2023 GCC OpenMPI PyTorch
module spider Software/version
Search for the software itself - it will suggest a version

Search with the version - it will suggest the hierarchy

(make sure you are still connected to Jureca DC)
$ python
Python 3.9.18 (main, Jan 24 2024, 00:00:00)
[GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'torch'Oh noes! 🙈
Let’s bring Python together with PyTorch!
Copy and paste these lines
# This command fails, as we have no proper pytorch
python -c "import torch ; print(torch.__version__)"
# So, we load the correct modules...
module load Stages/2024
module load GCC OpenMPI Python PyTorch
# And we run a small test: import pytorch and ask its version
python -c "import torch ; print(torch.__version__)" Should look like this:
module key”module key toml
The following modules match your search criteria: "toml"
------------------------------------------------------------------------------------
Jupyter: Jupyter/2020.2.5-Python-3.8.5, Jupyter/2021.3.1-Python-3.8.5,
Jupyter/2021.3.2-Python-3.8.5, Jupyter/2022.3.3, Jupyter/2022.3.4
Project Jupyter exists to develop open-source software, open-standards,
and services for interactive computing across dozens of programming languages.
PyQuil: PyQuil/3.0.1
PyQuil is a library for generating and executing Quil programs on the Rigetti
Forest platform.
Python: Python/3.8.5, Python/3.9.6, Python/3.10.4
Python is a programming language that lets you work more quickly and integrate
your systems more effectively.
------------------------------------------------------------------------------------

From the VSCode’s terminal, navigate to your “course” folder and to the name you created earlier.
This is out working directory. We do everything here.
matrix.py” on VSCode on Jureca DCPaste this into the file:
module load Stages/2024
module load GCC OpenMPI Python PyTorch
python matrix.py

Simple Linux Utility for Resource Management
code jureca-matrix.sbatch
#!/bin/bash
#SBATCH --account=training2449 # Who pays?
#SBATCH --nodes=1 # How many compute nodes
#SBATCH --job-name=matrix-multiplication
#SBATCH --ntasks-per-node=1 # How many mpi processes/node
#SBATCH --cpus-per-task=1 # How many cpus per mpi proc
#SBATCH --output=output.%j # Where to write results
#SBATCH --error=error.%j
#SBATCH --time=00:01:00 # For how long can it run?
#SBATCH --partition=dc-gpu # Machine partition
#SBATCH --reservation=training2449_day1 # For today only
module load Stages/2024
module load GCC OpenMPI PyTorch # Load the correct modules on the compute node(s)
srun python matrix.py # srun tells the supercomputer how to run it
squeue --me
squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
412169 gpus matrix-m strube1 CF 0:02 1 jsfc013training2449_day1# Notice that this number is the job id. It's different for every job
cat output.412169
cat error.412169 Or simply open it on VSCode!
pip….Edit the file sc_venv_template/requirements.txt
Add these lines at the end:
Run on the terminal:
sc_venv_template/setup.sh
from fastai.vision.all import *
from fastai.callback.tensorboard import *
#
print("Downloading dataset...")
path = untar_data(URLs.PETS)/'images'
print("Finished downloading dataset")
#
def is_cat(x): return x[0].isupper()
# Create the dataloaders and resize the images
dls = ImageDataLoaders.from_name_func(
path, get_image_files(path), valid_pct=0.2, seed=42,
label_func=is_cat, item_tfms=Resize(224))
print("On the login node, this will download resnet34")
learn = vision_learner(dls, resnet34, metrics=accuracy)
cbs=[SaveModelCallback(), TensorBoardCallback('runs', trace_model=True)]
# Trains the model for 6 epochs with this dataset
learn.unfreeze()
learn.fit_one_cycle(6, cbs=cbs)#!/bin/bash
#SBATCH --account=training2449
#SBATCH --mail-user=MYUSER@fz-juelich.de
#SBATCH --mail-type=ALL
#SBATCH --nodes=1
#SBATCH --job-name=cat-classifier
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=128
#SBATCH --output=output.%j
#SBATCH --error=error.%j
#SBATCH --time=00:20:00
#SBATCH --partition=dc-gpu
#SBATCH --reservation=training2449_day1 # For today only
cd $HOME/course/
source sc_venv_template/activate.sh # Now we finally use the fastai module
srun python cats.pyerror.${JOBID} file$ source sc_venv_template/activate.sh
$ python cats.py
Downloading dataset...
|████████-------------------------------| 23.50% [190750720/811706944 00:08<00:26]
Downloading: "https://download.pytorch.org/models/resnet34-b627a593.pth" to /p/project/ccstao/cstao05/.cache/torch/hub/checkpoints/resnet34-b627a593.pth
100%|█████████████████████████████████████| 83.3M/83.3M [00:00<00:00, 266MB/s](To exit, type CTRL-C)
The activation script must be sourced, otherwise the virtual environment will not work.
Setting vars
Downloading dataset...
Finished downloading dataset
epoch train_loss valid_loss error_rate time
Epoch 1/1 : |-----------------------------------| 0.00% [0/92 00:00<?]
Epoch 1/1 : |-----------------------------------| 2.17% [2/92 00:14<10:35 1.7452]
Epoch 1/1 : |█----------------------------------| 3.26% [3/92 00:14<07:01 1.6413]
Epoch 1/1 : |██---------------------------------| 5.43% [5/92 00:15<04:36 1.6057]
...
....
Epoch 1/1 :
epoch train_loss valid_loss error_rate time
0 0.049855 0.021369 0.007442 00:42 PORTS next to the
terminal
As of now, I expect you managed to:



Inside config.json, add at the
"models" section:
REPLACE THE APIKEY WITH YOUR OWN TOKEN!!!!
Type on your machine “code $HOME/.ssh/config” and paste
this at the end:
# -- Compute Nodes --
Host *.jureca
User [ADD YOUR USERNAME HERE]
StrictHostKeyChecking no
IdentityFile ~/.ssh/id_ed25519-JSC
ProxyJump jureca
Example: A service provides web interface on port 9999
On the supercomputer:
srun --time=00:05:00 \
--nodes=1 --ntasks=1 \
--partition=dc-gpu \
--account training2449 \
--cpu_bind=none \
--pty /bin/bash -i
bash-4.4$ hostname # This is running on a compute node of the supercomputer
jwb0002
bash-4.4$ cd $HOME/course/
bash-4.4$ source sc_venv_template/activate.sh
bash-4.4$ tensorboard --logdir=runs --port=9999 serveOn your machine:
Mind the i letter I added at the
end of the hostname
Now you can access the service on your local browser at http://localhost:3334