Alexandre Strube // Sabrina Benassou
June 24th, 2025
Links for the complimentary parts of this course:
Time | Title |
---|---|
13:00 - 13:15 | Welcome |
13:15 - 14:00 | Introduction |
14:00 - 14:15 | Coffee break |
14:16 - 14:30 | Judoor, Keys |
14:30 - 15:00 | SSH, Jupyter, VS Code |
15:00 - 15:15 | Coffee Break |
15:15 - 16:00 | Running services on the login and compute nodes |
16:00 - 16:15 | Coffee Break |
16:30 - 17:00 | Sync (everyone should be at the same point) |
Please open this document on your own browser! We will need it for the exercises. https://go.fzj.de/bringing-dl-workloads-to-jsc
training2529
code
$ ssh-keygen -a 100 -t ed25519 -f ~/.ssh/id_ed25519-JSC
Generating public/private ed25519 key pair.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /Users/strube1/.ssh/id_ed25519-JSC
Your public key has been saved in /Users/strube1/.ssh/id_ed25519-JSC.pub
The key fingerprint is:
SHA256:EGNNC1NTaN8fHwpfuZRPa50qXHmGcQjxp0JuU0ZA86U strube1@Strube-16
The keys randomart image is:
+--[ED25519 256]--+
| *++oo=o. . |
| . =+o .= o |
| .... o.E..o|
| . +.+o+B.|
| S =o.o+B|
| . o*.B+|
| . . = |
| o . |
| . |
+----[SHA256]-----+
Windows users, from Ubuntu WSL (Change username for your user on windows)
Host jureca
HostName jureca.fz-juelich.de
User [MY_USERNAME] # Here goes your username, not the word MY_USERNAME.
AddressFamily inet
IdentityFile ~/.ssh/id_ed25519-JSC
MACs hmac-sha2-512-etm@openssh.com
Copy contents to the config file and save it
REPLACE [MY_USERNAME] WITH YOUR USERNAME!!! 🤦♂️
code key.txt
and paste the number you gotDid everyone get their own ip address?
93.199.55.163
93.199.55.163
"0.0/16"
:
93.199.55.163
93.199.0.0/16
(with YOUR
number, not with the example)from=""
around itfrom="93.199.0.0/16"
,10.0.0.0/8
🧙♀️from="93.199.0.0/16,10.0.0.0/8"
🎬93.199.0.0/16
Terminal:
code ~/.ssh/id_ed25519-JSC.pub
Something like this will open:
```bash ssh-ed25519 AAAAC3NzaC1lZDE1NTA4AAAAIHaoOJF3gqXd7CV6wncoob0DL2OJNfvjgnHLKEniHV6F strube@demonstration.fz-juelich.de
- Paste this line at the same `key.txt` which you just opened
---
### SSH
#### Example: `93.199.0.0/16`
- Put them together and copy again:
- ```bash
from="93.199.0.0/16,10.0.0.0/8" ssh-ed25519 AAAAC3NzaC1lZDE1NTA4AAAAIHaoOJF3gqXd7CV6wncoob0DL2OJNfvjgnHLKEniHV6F strube@demonstration.fz-juelich.de
This might take some minutes
That’s it! Give it a try (and answer yes)
$ ssh jureca
The authenticity of host 'jrlogin03.fz-juelich.de (134.94.0.185)' cannot be established.
ED25519 key fingerprint is SHA256:ASeu9MJbkFx3kL1FWrysz6+paaznGenChgEkUW8nRQU.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])? Yes
**************************************************************************
* Welcome to Jureca DC *
**************************************************************************
...
...
strube1@jrlogin03~ $
# Create a folder for myself
mkdir $PROJECT_training2529/$USER
# Create a shortcut for the project on the home folder
rm -rf ~/course ; ln -s $PROJECT_training2529/$USER ~/course
# Enter course folder and
cd ~/course
# Where am I?
pwd
# We well need those later
mkdir ~/course/.cache
mkdir ~/course/.config
mkdir ~/course/.fastai
rm -rf $HOME/.cache ; ln -s ~/course/.cache $HOME/
rm -rf $HOME/.config ; ln -s ~/course/.config $HOME/
rm -rf $HOME/.fastai ; ln -s ~/course/.fastai $HOME/
module spider
strube1$ module spider PyTorch
------------------------------------------------------------------------------------
PyTorch:
------------------------------------------------------------------------------------
Description:
Tensors and Dynamic neural networks in Python with strong GPU acceleration.
PyTorch is a deep learning framework that puts Python first.
Versions:
PyTorch/1.7.0-Python-3.8.5
PyTorch/1.8.1-Python-3.8.5
PyTorch/1.11-CUDA-11.5
PyTorch/1.12.0-CUDA-11.7
Other possible modules matches:
PyTorch-Geometric PyTorch-Lightning
...
module avail
(Inside hierarchy)
Stage (full collection of software of a given year)
Compiler
MPI
Module
Eg:
module load Stages/2023 GCC OpenMPI PyTorch
module spider Software/version
Search for the software itself - it will suggest a version
Search with the version - it will suggest the hierarchy
(make sure you are still connected to Jureca DC)
$ python
Python 3.12.3 (main, Apr 15 2024, 18:07:06) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'torch'
Oh noes! 🙈
Let’s bring Python together with PyTorch!
Copy and paste these lines
# This command fails, as we have no proper pytorch
python -c "import torch ; print(torch.__version__)"
# So, we load the correct modules...
module load Stages/2025
module load GCC OpenMPI Python PyTorch
# And we run a small test: import pytorch and ask its version
python -c "import torch ; print(torch.__version__)"
Should look like this:
module key
”module key toml
The following modules match your search criteria: "toml"
------------------------------------------------------------------------------------
Jupyter: Jupyter/2020.2.5-Python-3.8.5, Jupyter/2021.3.1-Python-3.8.5,
Jupyter/2021.3.2-Python-3.8.5, Jupyter/2022.3.3, Jupyter/2022.3.4
Project Jupyter exists to develop open-source software, open-standards,
and services for interactive computing across dozens of programming languages.
PyQuil: PyQuil/3.0.1
PyQuil is a library for generating and executing Quil programs on the Rigetti
Forest platform.
Python: Python/3.8.5, Python/3.9.6, Python/3.10.4
Python is a programming language that lets you work more quickly and integrate
your systems more effectively.
------------------------------------------------------------------------------------
From the VSCode’s terminal, navigate to your “course” folder and to the name you created earlier.
```bash cd $HOME/course/ pwd
- This is out working directory. We do everything here.
---
### Demo code
#### Create a new file "`matrix.py`" on VSCode on Jureca DC
```bash
code matrix.py
Paste this into the file:
module load Stages/2025
module load GCC OpenMPI Python PyTorch
python matrix.py
Simple Linux Utility for Resource Management
code jureca-matrix.sbatch
#!/bin/bash
#SBATCH --account=training2529 # Who pays?
#SBATCH --nodes=1 # How many compute nodes
#SBATCH --job-name=matrix-multiplication
#SBATCH --ntasks-per-node=1 # How many mpi processes/node
#SBATCH --cpus-per-task=1 # How many cpus per mpi proc
#SBATCH --output=output.%j # Where to write results
#SBATCH --error=error.%j
#SBATCH --time=00:01:00 # For how long can it run?
#SBATCH --partition=dc-gpu # Machine partition
#SBATCH --reservation=training2529 # For today only
module load Stages/2025
module load GCC OpenMPI PyTorch # Load the correct modules on the compute node(s)
srun python matrix.py # srun tells the supercomputer how to run it
squeue --me
squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
412169 gpus matrix-m strube1 CF 0:02 1 jsfc013
training2529
# Notice that this number is the job id. It's different for every job
cat output.412169
cat error.412169
Or simply open it on VSCode!
pip
….Edit the file sc_venv_template/requirements.txt
Add these lines at the end:
# Add here the pip packages you would like to install on this virtual environment / kernel
pip
ipykernel
fastai
numba==0.60.0
numpy==1.26.4
scipy==1.13.1
matplotlib==3.9.2
scikit-learn==1.5.2
pandas==2.2.2
accelerate==1.1.1
pyarrow==18.1.0
transformers==4.46.3
sentencepiece==0.2.0
datasets==3.6.0
fsspec==2025.2.0.*
torch==2.5.1
torchrun_jsc>=0.0.15
Run on the terminal:
sc_venv_template/setup.sh
---
### Example: Activating the virtual environment
```bash
source sc_venv_template/activate.sh
The activation script must be sourced, otherwise the virtual environment will not work.
Setting vars
The following modules were not unloaded:
(Use "module --force purge" to unload all):
1) Stages/2025
- ```python
from fastai.vision.all import *
from fastai.callback.tensorboard import *
#
print("Downloading dataset...")
path = untar_data(URLs.PETS)/'images'
print("Finished downloading dataset")
#
def is_cat(x): return x[0].isupper()
# Create the dataloaders and resize the images
dls = ImageDataLoaders.from_name_func(
path, get_image_files(path), valid_pct=0.2, seed=42,
label_func=is_cat, item_tfms=Resize(224))
print("On the login node, this will download resnet34")
learn = vision_learner(dls, resnet34, metrics=accuracy)
cbs=[SaveModelCallback(), TensorBoardCallback('runs', trace_model=True)]
# Trains the model for 6 epochs with this dataset
learn.unfreeze()
learn.fit_one_cycle(6, cbs=cbs)
#!/bin/bash
#SBATCH --account=training2529
#SBATCH --mail-user=MYUSER@fz-juelich.de
#SBATCH --mail-type=ALL
#SBATCH --nodes=1
#SBATCH --job-name=cat-classifier
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=128
#SBATCH --output=output.%j
#SBATCH --error=error.%j
#SBATCH --time=00:20:00
#SBATCH --partition=dc-gpu
#SBATCH --reservation=training2529 # For today only
cd $HOME/course/
source sc_venv_template/activate.sh # Now we finally use the fastai module
srun python cats.py
- ```bash
$ cat err.7948496
The following modules were not unloaded:
(Use "module --force purge" to unload all):
1) Stages/2025
error.${JOBID}
file
---
## 🤔...
---
### What is it doing?
- This downloads the dataset:
- ```python
path = untar_data(URLs.PETS)/'images'
---
## Remember, remember

---
## Remember, remember

---
## Compute nodes have no internet connection
- But the login nodes do!
- So we download our dataset before...
- On the login nodes!
---
## On the login node:
- Comment out the line which does AI training:
- ```python
# learn.fit_one_cycle(6, cbs=cbs)
---
## Run the downloader on the login node
```bash
$ source sc_venv_template/activate.sh
$ python cats.py
Downloading dataset...
|████████-------------------------------| 23.50% [190750720/811706944 00:08<00:26]
Downloading: "https://download.pytorch.org/models/resnet34-b627a593.pth" to /p/project/ccstao/cstao05/.cache/torch/hub/checkpoints/resnet34-b627a593.pth
100%|█████████████████████████████████████| 83.3M/83.3M [00:00<00:00, 266MB/s]
- Submit the job!
- ```bash
sbatch fastai.sbatch
(To exit, type CTRL-C)
- 🎉
- 🥳
---
### Tools for results analysis
- We already ran the code and have results
- To analyze them, there's a neat tool called Tensorboard
- And we already have the code for it on our example!
- ```python
cbs=[SaveModelCallback(), TensorBoardCallback('runs', trace_model=True)]
- Opens a connection on port 9999... *OF THE SUPERCOMPUTER*.
- This port is behind the firewall. You can't access it directly...
- We need to bypass the firewall 🏴☠️
- SSH PORT FORWARDING
---
## Example: Tensorboard

---
## Port Forwarding

---
## Port forwarding demo:
- On VSCode's terminal:
- ```bash
cd $HOME/course/
source sc_venv_template/activate.sh
tensorboard --logdir=runs --port=12345 serve
PORTS
next to the
terminalAs of now, I expect you managed to:
Inside config.json, add at the
"models"
section:
```json { “model”: “AUTODETECT”, “title”: “Blablador”, “apiKey”: “ADD_BLABLADOR_TOKEN_HERE”, “apiBase”: “https://api.helmholtz-blablador.fz-juelich.de/v1”, “provider”: “openai” },
- REPLACE THE APIKEY WITH YOUR OWN TOKEN!!!!
---
### Blablador on VSCode
- Click on the "Continue.dev extension on the left side of VSCode.
- Select some code from our exercises, select it and send it to continue with cmd-shift-L (or ctrl-shift-L)
- Ask it to add unit tests, for example.
---
## Backup slides
---
## There's more!
- Remember the magic? 🧙♂️
- Let's use it now to access the compute nodes directly!
---
## Proxy Jump
#### Accessing compute nodes directly
- If we need to access some ports on the compute nodes
- 
---
## Proxy Jump - SSH Configuration
Type on your machine "`code $HOME/.ssh/config`" and paste this at the end:
```ssh
# -- Compute Nodes --
Host *.jureca
User [ADD YOUR USERNAME HERE]
StrictHostKeyChecking no
IdentityFile ~/.ssh/id_ed25519-JSC
ProxyJump jureca
Example: A service provides web interface on port 9999
On the supercomputer:
srun --time=00:05:00 \
--nodes=1 --ntasks=1 \
--partition=dc-gpu \
--account training2529 \
--cpu_bind=none \
--pty /bin/bash -i
bash-4.4$ hostname # This is running on a compute node of the supercomputer
jwb0002
bash-4.4$ cd $HOME/course/
bash-4.4$ source sc_venv_template/activate.sh
bash-4.4$ tensorboard --logdir=runs --port=9999 serve
On your machine:
Mind the i
letter I added at the
end of the hostname
Now you can access the service on your local browser at http://localhost:3334