Alexandre Strube // Sabrina Benassou
June 25, 2024
Links for the complimentary parts of this course:
Time | Title |
---|---|
10:00 - 10:15 | Welcome |
10:15 - 11:00 | Introduction |
11:00 - 11:15 | Coffee break |
11:16 - 11:30 | Judoor, Keys |
11:30 - 12:00 | SSH, Jupyter, VS Code |
12:00 - 12:15 | Coffee Break |
12:15 - 13:00 | Running services on the login and compute nodes |
13:00 - 13:15 | Coffee Break |
13:30 - 14:00 | Sync (everyone should be at the same point) |
Please open this document on your own browser! We will need it for the exercises. https://go.fzj.de/bringing-dl-workloads-to-jsc
training2425
code
$ ssh-keygen -a 100 -t ed25519 -f ~/.ssh/id_ed25519-JSC
Generating public/private ed25519 key pair.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /Users/strube1/.ssh/id_ed25519-JSC
Your public key has been saved in /Users/strube1/.ssh/id_ed25519-JSC.pub
The key fingerprint is:
SHA256:EGNNC1NTaN8fHwpfuZRPa50qXHmGcQjxp0JuU0ZA86U strube1@Strube-16
The keys randomart image is:
+--[ED25519 256]--+
| *++oo=o. . |
| . =+o .= o |
| .... o.E..o|
| . +.+o+B.|
| S =o.o+B|
| . o*.B+|
| . . = |
| o . |
| . |
+----[SHA256]-----+
Windows users, from Ubuntu WSL (Change username for your user on windows)
Host jureca
HostName jureca.fz-juelich.de
User [MY_USERNAME] # Here goes your username, not the word MY_USERNAME.
AddressFamily inet
IdentityFile ~/.ssh/id_ed25519-JSC
MACs hmac-sha2-512-etm@openssh.com
Copy contents to the config file and save it
REPLACE [MY_USERNAME] WITH YOUR USERNAME!!! 🤦♂️
code key.txt
and paste the number you gotDid everyone get their own ip address?
93.199.55.163
93.199.55.163
"0.0/16"
:
93.199.55.163
93.199.0.0/16
(with YOUR
number, not with the example)from=""
around itfrom="93.199.0.0/16"
,10.0.0.0/8
🧙♀️from="93.199.0.0/16,10.0.0.0/8"
🎬93.199.0.0/16
Terminal:
code ~/.ssh/id_ed25519-JSC.pub
Something like this will open:
Paste this line at the same key.txt
which you just opened
93.199.0.0/16
This might take some minutes
That’s it! Give it a try (and answer yes)
$ ssh jureca
The authenticity of host 'jrlogin03.fz-juelich.de (134.94.0.185)' cannot be established.
ED25519 key fingerprint is SHA256:ASeu9MJbkFx3kL1FWrysz6+paaznGenChgEkUW8nRQU.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])? Yes
**************************************************************************
* Welcome to Jureca DC *
**************************************************************************
...
...
strube1@jrlogin03~ $
# Create a folder for myself
mkdir $PROJECT_training2425/$USER
# Create a shortcut for the project on the home folder
rm -rf ~/course ; ln -s $PROJECT_training2425/$USER ~/course
# Enter course folder and
cd ~/course
# Where am I?
pwd
# We well need those later
mkdir ~/course/.cache
mkdir ~/course/.config
mkdir ~/course/.fastai
rm -rf $HOME/.cache ; ln -s ~/course/.cache $HOME/
rm -rf $HOME/.config ; ln -s ~/course/.config $HOME/
rm -rf $HOME/.fastai ; ln -s ~/course/.fastai $HOME/
module spider
strube1$ module spider PyTorch
------------------------------------------------------------------------------------
PyTorch:
------------------------------------------------------------------------------------
Description:
Tensors and Dynamic neural networks in Python with strong GPU acceleration.
PyTorch is a deep learning framework that puts Python first.
Versions:
PyTorch/1.7.0-Python-3.8.5
PyTorch/1.8.1-Python-3.8.5
PyTorch/1.11-CUDA-11.5
PyTorch/1.12.0-CUDA-11.7
Other possible modules matches:
PyTorch-Geometric PyTorch-Lightning
...
module avail
(Inside hierarchy)
Stage (full collection of software of a given year)
Compiler
MPI
Module
Eg:
module load Stages/2023 GCC OpenMPI PyTorch
module spider Software/version
Search for the software itself - it will suggest a version
Search with the version - it will suggest the hierarchy
(make sure you are still connected to Jureca DC)
Oh noes! 🙈
Let’s bring Python together with PyTorch!
Copy and paste these lines
# This command fails, as we have no proper python
python
# So, we load the correct modules...
module load Stages/2024
module load GCC OpenMPI Python PyTorch
# And we run a small test: import pytorch and ask its version
python -c "import torch ; print(torch.__version__)"
Should look like this:
module key
”module key toml
The following modules match your search criteria: "toml"
------------------------------------------------------------------------------------
Jupyter: Jupyter/2020.2.5-Python-3.8.5, Jupyter/2021.3.1-Python-3.8.5, Jupyter/2021.3.2-Python-3.8.5, Jupyter/2022.3.3, Jupyter/2022.3.4
Project Jupyter exists to develop open-source software, open-standards, and services for interactive computing across dozens of programming languages.
PyQuil: PyQuil/3.0.1
PyQuil is a library for generating and executing Quil programs on the Rigetti Forest platform.
Python: Python/3.8.5, Python/3.9.6, Python/3.10.4
Python is a programming language that lets you work more quickly and integrate your systems more effectively.
------------------------------------------------------------------------------------
From the VSCode’s terminal, navigate to your “course” folder and to the name you created earlier.
This is out working directory. We do everything here.
matrix.py
” on VSCode on Jureca DCPaste this into the file:
module load Stages/2023
module load GCC OpenMPI PyTorch
python matrix.py
Simple Linux Utility for Resource Management
code jureca-matrix.sbatch
#!/bin/bash
#SBATCH --account=training2425 # Who pays?
#SBATCH --nodes=1 # How many compute nodes
#SBATCH --job-name=matrix-multiplication
#SBATCH --ntasks-per-node=1 # How many mpi processes/node
#SBATCH --cpus-per-task=1 # How many cpus per mpi proc
#SBATCH --output=output.%j # Where to write results
#SBATCH --error=error.%j
#SBATCH --time=00:01:00 # For how long can it run?
#SBATCH --partition=dc-gpu # Machine partition
#SBATCH --reservation=training2425 # For today only
module load Stages/2024
module load GCC OpenMPI PyTorch # Load the correct modules on the compute node(s)
srun python matrix.py # srun tells the supercomputer how to run it
squeue --me
squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
412169 gpus matrix-m strube1 CF 0:02 1 jsfc013
training2425
# Notice that this number is the job id. It's different for every job
cat output.412169
cat error.412169
Or simply open it on VSCode!
pip
….Edit the file sc_venv_template/requirements.txt
Add these lines at the end:
Run on the terminal:
sc_venv_template/setup.sh
from fastai.vision.all import *
from fastai.callback.tensorboard import *
#
print("Downloading dataset...")
path = untar_data(URLs.PETS)/'images'
print("Finished downloading dataset")
#
def is_cat(x): return x[0].isupper()
# Create the dataloaders and resize the images
dls = ImageDataLoaders.from_name_func(
path, get_image_files(path), valid_pct=0.2, seed=42,
label_func=is_cat, item_tfms=Resize(224))
print("On the login node, this will download resnet34")
learn = vision_learner(dls, resnet34, metrics=accuracy)
cbs=[SaveModelCallback(), TensorBoardCallback('runs', trace_model=True)]
# Trains the model for 6 epochs with this dataset
learn.unfreeze()
learn.fit_one_cycle(6, cbs=cbs)
#!/bin/bash
#SBATCH --account=training2425
#SBATCH --mail-user=MYUSER@fz-juelich.de
#SBATCH --mail-type=ALL
#SBATCH --nodes=1
#SBATCH --job-name=cat-classifier
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=128
#SBATCH --output=output.%j
#SBATCH --error=error.%j
#SBATCH --time=00:20:00
#SBATCH --partition=dc-gpu
#SBATCH --reservation=training2425 # For today only
cd $HOME/course/
source sc_venv_template/activate.sh # Now we finally use the fastai module
srun python cats.py
error.${JOBID}
file$ source sc_venv_template/activate.sh
$ python cats.py
Downloading dataset...
|████████-------------------------------| 23.50% [190750720/811706944 00:08<00:26]
Downloading: "https://download.pytorch.org/models/resnet34-b627a593.pth" to /p/project/ccstao/cstao05/.cache/torch/hub/checkpoints/resnet34-b627a593.pth
100%|█████████████████████████████████████| 83.3M/83.3M [00:00<00:00, 266MB/s]
(To exit, type CTRL-C)
The activation script must be sourced, otherwise the virtual environment will not work.
Setting vars
Downloading dataset...
Finished downloading dataset
epoch train_loss valid_loss error_rate time
Epoch 1/1 : |-----------------------------------| 0.00% [0/92 00:00<?]
Epoch 1/1 : |-----------------------------------| 2.17% [2/92 00:14<10:35 1.7452]
Epoch 1/1 : |█----------------------------------| 3.26% [3/92 00:14<07:01 1.6413]
Epoch 1/1 : |██---------------------------------| 5.43% [5/92 00:15<04:36 1.6057]
...
....
Epoch 1/1 :
epoch train_loss valid_loss error_rate time
0 0.049855 0.021369 0.007442 00:42
PORTS
next to the
terminalAs of now, I expect you managed to:
Inside config.json, add at the
"models"
section:
REPLACE THE APIKEY WITH YOUR OWN TOKEN!!!!
Type on your machine “code $HOME/.ssh/config
” and paste
this at the end:
# -- Compute Nodes --
Host *.jureca
User [ADD YOUR USERNAME HERE]
StrictHostKeyChecking no
IdentityFile ~/.ssh/id_ed25519-JSC
ProxyJump jureca
Example: A service provides web interface on port 9999
On the supercomputer:
srun --time=00:05:00 \
--nodes=1 --ntasks=1 \
--partition=dc-gpu \
--account training2425 \
--cpu_bind=none \
--pty /bin/bash -i
bash-4.4$ hostname # This is running on a compute node of the supercomputer
jwb0002
bash-4.4$ cd $HOME/course/
bash-4.4$ source sc_venv_template/activate.sh
bash-4.4$ tensorboard --logdir=runs --port=9999 serve
On your machine:
Mind the i
letter I added at the
end of the hostname
Now you can access the service on your local browser at http://localhost:3334