$ ssh-keygen -a 100 -t ed25519 -f ~/.ssh/id_ed25519-JSCGenerating public/private ed25519 key pair.Enter passphrase (empty for no passphrase):Enter same passphrase again: Your identification has been saved in /Users/strube1/.ssh/id_ed25519-JSCYour public key has been saved in /Users/strube1/.ssh/id_ed25519-JSC.pubThe key fingerprint is:SHA256:EGNNC1NTaN8fHwpfuZRPa50qXHmGcQjxp0JuU0ZA86U strube1@Strube-16The keys randomart image is:+--[ED25519 256]--+|*++oo=o. . ||. =+o .= o ||.... o.E..o||. +.+o+B.||S =o.o+B||. o*.B+||. . = ||o . ||.|+----[SHA256]-----+
SSH
Configure SSH session
code$HOME/.ssh/config
Windows users, from Ubuntu WSL (Change username for your user on
windows)
Host jurecaHostName jureca.fz-juelich.deUser[MY_USERNAME]# Here goes your username, not the word MY_USERNAME.AddressFamily inetIdentityFile ~/.ssh/id_ed25519-JSCMACs hmac-sha2-512-etm@openssh.com
$ ssh jurecaThe authenticity of host 'jrlogin03.fz-juelich.de (134.94.0.185)' cannot be established.ED25519 key fingerprint is SHA256:ASeu9MJbkFx3kL1FWrysz6+paaznGenChgEkUW8nRQU.This key is not known by any other namesAre you sure you want to continue connecting (yes/no/[fingerprint])? Yes*************************************************************************** Welcome to Jureca DC ***************************************************************************......strube1@jrlogin03~ $
SSH: Exercise
Make sure you
are connected to the supercomputer
# Create a folder for myselfmkdir$PROJECT_training2425/$USER# Create a shortcut for the project on the home folderrm-rf ~/course ;ln-s$PROJECT_training2425/$USER ~/course# Enter course folder andcd ~/course# Where am I?pwd# We well need those latermkdir ~/course/.cachemkdir ~/course/.configmkdir ~/course/.fastairm-rf$HOME/.cache ;ln-s ~/course/.cache $HOME/rm-rf$HOME/.config ;ln-s ~/course/.config $HOME/rm-rf$HOME/.fastai ;ln-s ~/course/.fastai $HOME/
Working with the supercomputer’s software
We have literally thousands of software packages,
hand-compiled for the specifics of the supercomputer.
strube1$ module spider PyTorch------------------------------------------------------------------------------------PyTorch:------------------------------------------------------------------------------------Description:Tensors and Dynamic neural networks in Python with strong GPU acceleration. PyTorch is a deep learning framework that puts Python first.Versions:PyTorch/1.7.0-Python-3.8.5PyTorch/1.8.1-Python-3.8.5PyTorch/1.11-CUDA-11.5PyTorch/1.12.0-CUDA-11.7Other possible modules matches:PyTorch-Geometric PyTorch-Lightning...
What do we have?
module avail (Inside hierarchy)
Module hierarchy
Stage (full collection of software of a given
year)
Compiler
MPI
Module
Eg:
module load Stages/2023 GCC OpenMPI PyTorch
What do I need to load
such software?
module spider Software/version
Example: PyTorch
Search for the software itself - it will suggest a version
Example: PyTorch
Search with the version - it will suggest the hierarchy
Example: PyTorch
(make sure you are still connected to Jureca DC)
$ python-bash: python: command not found
Oh noes! 🙈
Let’s bring Python together with PyTorch!
Example: PyTorch
Copy and paste these lines
# This command fails, as we have no proper pythonpython# So, we load the correct modules...module load Stages/2024module load GCC OpenMPI Python PyTorch# And we run a small test: import pytorch and ask its versionpython-c"import torch ; print(torch.__version__)"
Some
of the python softwares are part of Python itself, or of other
softwares. Use “module key”
module key tomlThe following modules match your search criteria: "toml"------------------------------------------------------------------------------------Jupyter: Jupyter/2020.2.5-Python-3.8.5, Jupyter/2021.3.1-Python-3.8.5, Jupyter/2021.3.2-Python-3.8.5, Jupyter/2022.3.3, Jupyter/2022.3.4Project Jupyter exists to develop open-source software, open-standards, and services for interactive computing across dozens of programming languages.PyQuil: PyQuil/3.0.1PyQuil is a library for generating and executing Quil programs on the Rigetti Forest platform.Python: Python/3.8.5, Python/3.9.6, Python/3.10.4Python is a programming language that lets you work more quickly and integrate your systems more effectively.------------------------------------------------------------------------------------
VSCode
Editing files on the
supercomputers
VSCode
VSCode
You can have a terminal inside VSCode:
Go to the menu View->Terminal
VSCode
From the VSCode’s terminal, navigate to your
“course” folder and to the name you created earlier.
cd$HOME/course/pwd
This is out working directory. We do everything
here.
Demo code
Create a new
file “matrix.py” on VSCode on Jureca DC
code matrix.py
Paste this into the file:
import torchmatrix1 = torch.randn(3,3)print("The first matrix is", matrix1)matrix2 = torch.randn(3,3)print("The second matrix is", matrix2)result = torch.matmul(matrix1,matrix2)print("The result is:\n", result)
Simple text file which describes what we want and
how much of it, for how long, and what to do with the results
Slurm submission file
example
code jureca-matrix.sbatch
#!/bin/bash#SBATCH --account=training2425 # Who pays?#SBATCH --nodes=1 # How many compute nodes#SBATCH --job-name=matrix-multiplication#SBATCH --ntasks-per-node=1 # How many mpi processes/node#SBATCH --cpus-per-task=1 # How many cpus per mpi proc#SBATCH --output=output.%j # Where to write results#SBATCH --error=error.%j#SBATCH --time=00:01:00 # For how long can it run?#SBATCH --partition=dc-gpu # Machine partition#SBATCH --reservation=training2425 # For today onlymodule load Stages/2024module load GCC OpenMPI PyTorch # Load the correct modules on the compute node(s)srun python matrix.py # srun tells the supercomputer how to run it
Even though we have PyTorch, we don’t have PyTorch
Lightning Flash
Same for fast.ai and wandb
We will install them in a virtual environment
Example: Let’s install
some software!
Edit the file
sc_venv_template/requirements.txt
Add these lines at the end:
fastaiwandbacceleratedeepspeed
Run on the terminal:
sc_venv_template/setup.sh
Example: Activating
the virtual environment
source sc_venv_template/activate.sh
Example:
Activating the virtual environment
source ./activate.sh The activation script must be sourced, otherwise the virtual environment will not work.Setting varsThe following modules were not unloaded:(Use"module --force purge" to unload all):1)Stages/2024
This is a minimal demo, to show some quirks of the
supercomputer
code cats.py
from fastai.vision.allimport*from fastai.callback.tensorboard import*#print("Downloading dataset...")path = untar_data(URLs.PETS)/'images'print("Finished downloading dataset")#def is_cat(x): return x[0].isupper()# Create the dataloaders and resize the imagesdls = ImageDataLoaders.from_name_func( path, get_image_files(path), valid_pct=0.2, seed=42, label_func=is_cat, item_tfms=Resize(224))print("On the login node, this will download resnet34")learn = vision_learner(dls, resnet34, metrics=accuracy)cbs=[SaveModelCallback(), TensorBoardCallback('runs', trace_model=True)]# Trains the model for 6 epochs with this datasetlearn.unfreeze()learn.fit_one_cycle(6, cbs=cbs)
Submission file for the
classifier
code fastai.sbatch
#!/bin/bash#SBATCH --account=training2425#SBATCH --mail-user=MYUSER@fz-juelich.de#SBATCH --mail-type=ALL#SBATCH --nodes=1#SBATCH --job-name=cat-classifier#SBATCH --ntasks-per-node=1#SBATCH --cpus-per-task=128#SBATCH --output=output.%j#SBATCH --error=error.%j#SBATCH --time=00:20:00#SBATCH --partition=dc-gpu#SBATCH --reservation=training2425 # For today onlycd$HOME/course/source sc_venv_template/activate.sh # Now we finally use the fastai modulesrun python cats.py
Submit it
sbatch fastai.sbatch
Submission time
Check error and output logs, check queue
Probably not much happening…
$ cat output.7948496 The activation script must be sourced, otherwise the virtual environment will not work.Setting varsDownloading dataset...
$ cat err.7948496 The following modules were not unloaded:(Use"module --force purge" to unload all):1)Stages/2024
💥
What happened?
It might be that it’s not enough time for the job
to give up
Check the error.${JOBID} file
If you run it longer, you will get the actual
error:
Click on the “Continue.dev extension on the left
side of VSCode.
Select some code from our exercises, select it and
send it to continue with cmd-shift-L (or ctrl-shift-L)
Ask it to add unit tests, for example.
Backup slides
There’s more!
Remember the magic? 🧙♂️
Let’s use it now to access the compute nodes
directly!
Proxy Jump
Accessing compute nodes
directly
If we need to access some ports on the compute
nodes
Proxy Jump - SSH Configuration
Type on your machine “code $HOME/.ssh/config” and paste
this at the end:
# -- Compute Nodes --
Host *.jureca
User [ADD YOUR USERNAME HERE]
StrictHostKeyChecking no
IdentityFile ~/.ssh/id_ed25519-JSC
ProxyJump jureca
Proxy Jump: Connecting to a node
Example: A service provides web interface on port 9999
On the supercomputer:
srun--time=00:05:00 \--nodes=1 --ntasks=1 \--partition=dc-gpu \--account training2425 \--cpu_bind=none \--pty /bin/bash -ibash-4.4$ hostname # This is running on a compute node of the supercomputerjwb0002bash-4.4$ cd $HOME/course/bash-4.4$ source sc_venv_template/activate.shbash-4.4$ tensorboard --logdir=runs --port=9999 serve
Proxy Jump
On your machine:
ssh-L :3334:localhost:9999 jrc002i.jureca
Mind the i letter I added at the
end of the hostname