Accessing the machines, intro

June 25, 2024 · *NOT*

mkdir ~/.ssh/
ssh-keygen -a 100 -t ed25519 -f ~/.ssh/id_ed25519-JSC
$ ssh-keygen -a 100 -t ed25519 -f ~/.ssh/id_ed25519-JSC
Generating public/private ed25519 key pair.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /Users/strube1/.ssh/id_ed25519-JSC
Your public key has been saved in /Users/strube1/.ssh/id_ed25519-JSC.pub
The key fingerprint is:
SHA256:EGNNC1NTaN8fHwpfuZRPa50qXHmGcQjxp0JuU0ZA86U strube1@Strube-16
The keys randomart image is:
+--[ED25519 256]--+
|      *++oo=o. . |
|     . =+o .= o  |
|      .... o.E..o|
|       .  +.+o+B.|
|        S  =o.o+B|
|          . o*.B+|
|          . . =  |
|           o .   |
|            .    |
+----[SHA256]-----+
code $HOME/.ssh/config
ls -la /mnt/c/Users/
mkdir /mnt/c/Users/USERNAME/.ssh/
cp $HOME/.ssh/* /mnt/c/Users/USERNAME/.ssh/
Host jureca
        HostName jureca.fz-juelich.de
        User [MY_USERNAME]   # Here goes your username, not the word MY_USERNAME.
        AddressFamily inet
        IdentityFile ~/.ssh/id_ed25519-JSC
        MACs hmac-sha2-512-etm@openssh.com
ssh-ed25519 AAAAC3NzaC1lZDE1NTA4AAAAIHaoOJF3gqXd7CV6wncoob0DL2OJNfvjgnHLKEniHV6F strube@demonstration.fz-juelich.de
from="93.199.0.0/16,10.0.0.0/8" ssh-ed25519 AAAAC3NzaC1lZDE1NTA4AAAAIHaoOJF3gqXd7CV6wncoob0DL2OJNfvjgnHLKEniHV6F strube@demonstration.fz-juelich.de
$ ssh jureca
The authenticity of host 'jrlogin03.fz-juelich.de (134.94.0.185)' cannot be established.
ED25519 key fingerprint is SHA256:ASeu9MJbkFx3kL1FWrysz6+paaznGenChgEkUW8nRQU.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])? Yes
**************************************************************************
*                            Welcome to Jureca DC                   *
**************************************************************************
...
...
strube1@jrlogin03~ $ 
# Create a folder for myself
mkdir $PROJECT_training2425/$USER

# Create a shortcut for the project on the home folder
rm -rf ~/course ; ln -s $PROJECT_training2425/$USER ~/course

# Enter course folder and
cd ~/course

# Where am I?
pwd

# We well need those later
mkdir ~/course/.cache
mkdir ~/course/.config
mkdir ~/course/.fastai

rm -rf $HOME/.cache ; ln -s ~/course/.cache $HOME/
rm -rf $HOME/.config ; ln -s ~/course/.config $HOME/
rm -rf $HOME/.fastai ; ln -s ~/course/.fastai $HOME/
strube1$ module spider PyTorch
------------------------------------------------------------------------------------
  PyTorch:
------------------------------------------------------------------------------------
    Description:
      Tensors and Dynamic neural networks in Python with strong GPU acceleration. 
      PyTorch is a deep learning framework that puts Python first.

     Versions:
        PyTorch/1.7.0-Python-3.8.5
        PyTorch/1.8.1-Python-3.8.5
        PyTorch/1.11-CUDA-11.5
        PyTorch/1.12.0-CUDA-11.7
     Other possible modules matches:
        PyTorch-Geometric  PyTorch-Lightning
...
$ python
-bash: python: command not found
# This command fails, as we have no proper python
python 
# So, we load the correct modules...
module load Stages/2024
module load GCC OpenMPI Python PyTorch
# And we run a small test: import pytorch and ask its version
python -c "import torch ; print(torch.__version__)" 
$ python
-bash: python: command not found
$ module load Stages/2024
$ module load GCC OpenMPI Python PyTorch
$ python -c "import torch ; print(torch.__version__)" 
2.1.0
module key toml
The following modules match your search criteria: "toml"
------------------------------------------------------------------------------------

  Jupyter: Jupyter/2020.2.5-Python-3.8.5, Jupyter/2021.3.1-Python-3.8.5, Jupyter/2021.3.2-Python-3.8.5, Jupyter/2022.3.3, Jupyter/2022.3.4
    Project Jupyter exists to develop open-source software, open-standards, and services for interactive computing across dozens of programming languages.

  PyQuil: PyQuil/3.0.1
    PyQuil is a library for generating and executing Quil programs on the Rigetti Forest platform.

  Python: Python/3.8.5, Python/3.9.6, Python/3.10.4
    Python is a programming language that lets you work more quickly and integrate your systems more effectively.

------------------------------------------------------------------------------------
cd $HOME/course/
pwd
code matrix.py
import torch

matrix1 = torch.randn(3,3)
print("The first matrix is", matrix1)

matrix2 = torch.randn(3,3)
print("The second matrix is", matrix2)

result = torch.matmul(matrix1,matrix2)
print("The result is:\n", result)
module load Stages/2023
module load GCC OpenMPI PyTorch
python matrix.py
#!/bin/bash
#SBATCH --account=training2425           # Who pays?
#SBATCH --nodes=1                        # How many compute nodes
#SBATCH --job-name=matrix-multiplication
#SBATCH --ntasks-per-node=1              # How many mpi processes/node
#SBATCH --cpus-per-task=1                # How many cpus per mpi proc
#SBATCH --output=output.%j        # Where to write results
#SBATCH --error=error.%j
#SBATCH --time=00:01:00          # For how long can it run?
#SBATCH --partition=dc-gpu         # Machine partition
#SBATCH --reservation=training2425 # For today only

module load Stages/2024
module load GCC OpenMPI PyTorch  # Load the correct modules on the compute node(s)

srun python matrix.py            # srun tells the supercomputer how to run it
sbatch jureca-matrix.sbatch

Submitted batch job 412169
squeue --me
   JOBID  PARTITION    NAME      USER    ST       TIME  NODES NODELIST(REASON)
   412169 gpus         matrix-m  strube1 CF       0:02      1 jsfc013
scancel <JOBID>
# Notice that this number is the job id. It's different for every job
cat output.412169 
cat error.412169 
cd $HOME/course/
git clone https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template.git
fastai
wandb
accelerate
deepspeed
source sc_venv_template/activate.sh
source ./activate.sh 
The activation script must be sourced, otherwise the virtual environment will not work.
Setting vars
The following modules were not unloaded:
  (Use "module --force purge" to unload all):
 1) Stages/2024
jureca01 $ python
Python 3.11.3 (main, Jun 25 2023, 13:17:30) [GCC 12.3.0]
>>> import fastai
>>> fastai.__version__
'2.7.14'
code cats.py
from fastai.vision.all import *
from fastai.callback.tensorboard import *
#
print("Downloading dataset...")
path = untar_data(URLs.PETS)/'images'
print("Finished downloading dataset")
#
def is_cat(x): return x[0].isupper()
# Create the dataloaders and resize the images
dls = ImageDataLoaders.from_name_func(
    path, get_image_files(path), valid_pct=0.2, seed=42,
    label_func=is_cat, item_tfms=Resize(224))
print("On the login node, this will download resnet34")
learn = vision_learner(dls, resnet34, metrics=accuracy)
cbs=[SaveModelCallback(), TensorBoardCallback('runs', trace_model=True)]
# Trains the model for 6 epochs with this dataset
learn.unfreeze()
learn.fit_one_cycle(6, cbs=cbs)
code fastai.sbatch
#!/bin/bash
#SBATCH --account=training2425
#SBATCH --mail-user=MYUSER@fz-juelich.de
#SBATCH --mail-type=ALL
#SBATCH --nodes=1
#SBATCH --job-name=cat-classifier
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=128
#SBATCH --output=output.%j
#SBATCH --error=error.%j
#SBATCH --time=00:20:00
#SBATCH --partition=dc-gpu
#SBATCH --reservation=training2425 # For today only

cd $HOME/course/
source sc_venv_template/activate.sh # Now we finally use the fastai module

srun python cats.py
sbatch fastai.sbatch
$ cat output.7948496 
The activation script must be sourced, otherwise the virtual environment will not work.
Setting vars
Downloading dataset...
$ cat err.7948496 
The following modules were not unloaded:
(Use "module --force purge" to unload all):

1) Stages/2024
Traceback (most recent call last):
  File "/p/project/training2425/strube1/cats.py", line 5, in <module>
    path = untar_data(URLs.PETS)/'images'
    ...
    ...
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 110] Connection timed out>
srun: error: jwb0160: task 0: Exited with exit code 1
path = untar_data(URLs.PETS)/'images'
learn = vision_learner(dls, resnet34, metrics=error_rate)
# learn.fit_one_cycle(6, cbs=cbs)
source sc_venv_template/activate.sh # So that we have fast.ai library
python cats.py
$ source sc_venv_template/activate.sh
$ python cats.py 
Downloading dataset...
 |████████-------------------------------| 23.50% [190750720/811706944 00:08<00:26]
 Downloading: "https://download.pytorch.org/models/resnet34-b627a593.pth" to /p/project/ccstao/cstao05/.cache/torch/hub/checkpoints/resnet34-b627a593.pth
100%|█████████████████████████████████████| 83.3M/83.3M [00:00<00:00, 266MB/s]
learn.fit_one_cycle(6, cbs=cbs)
sbatch fastai.sbatch
watch squeue --me
The activation script must be sourced, otherwise the virtual environment will not work.
Setting vars
Downloading dataset...
Finished downloading dataset
epoch     train_loss  valid_loss  error_rate  time    
Epoch 1/1 : |-----------------------------------| 0.00% [0/92 00:00<?]
Epoch 1/1 : |-----------------------------------| 2.17% [2/92 00:14<10:35 1.7452]
Epoch 1/1 : |█----------------------------------| 3.26% [3/92 00:14<07:01 1.6413]
Epoch 1/1 : |██---------------------------------| 5.43% [5/92 00:15<04:36 1.6057]
...
....
Epoch 1/1 :
epoch     train_loss  valid_loss  error_rate  time    
0         0.049855    0.021369    0.007442    00:42     
cbs=[SaveModelCallback(), TensorBoardCallback('runs', trace_model=True)]
tensorboard --logdir=runs  --port=9999 serve
cd $HOME/course/
source sc_venv_template/activate.sh
tensorboard --logdir=runs  --port=12345 serve
    {
      "title": "Mistral helmholtz",
      "provider": "openai",
      "contextLength": 16384,
      "model": "alias-code",
      "apiKey": "ADD-YOUR-TOKEN-HERE",
      "apiBase": "https://helmholtz-blablador.fz-juelich.de:8000"
    },

# -- Compute Nodes --
Host *.jureca
        User [ADD YOUR USERNAME HERE]
        StrictHostKeyChecking no
        IdentityFile ~/.ssh/id_ed25519-JSC
        ProxyJump jureca
srun --time=00:05:00 \
     --nodes=1 --ntasks=1 \
     --partition=dc-gpu \
     --account training2425 \
     --cpu_bind=none \
     --pty /bin/bash -i

bash-4.4$ hostname # This is running on a compute node of the supercomputer
jwb0002

bash-4.4$ cd $HOME/course/
bash-4.4$ source sc_venv_template/activate.sh
bash-4.4$ tensorboard --logdir=runs  --port=9999 serve
ssh -L :3334:localhost:9999 jrc002i.jureca

Time	Title
10:00 - 10:15	Welcome
10:15 - 11:00	Introduction
11:00 - 11:15	Coffee break
11:16 - 11:30	Judoor, Keys
11:30 - 12:00	SSH, Jupyter, VS Code
12:00 - 12:15	Coffee Break
12:15 - 13:00	Running services on the login and compute nodes
13:00 - 13:15	Coffee Break
13:30 - 14:00	Sync (everyone should be at the same point)

Accessing the machines, intro

Communication:

Goals for this course:

Team:

Schedule for day 1

Note

Jülich Supercomputers

What is a supercomputer?

Anatomy of a supercomputer

JURECA DC Compute Nodes

How do I use a Supercomputer?

You don’t use the whole supercomputer

You submit jobs to a queue asking for resources

You don’t use the whole supercomputer

And get results back

You don’t use the whole supercomputer

You are just submitting jobs via the login node

You don’t use the whole supercomputer

You are just submitting jobs via the login node

You don’t use the whole supercomputer

You are just submitting jobs via the login node

You don’t use the whole supercomputer

You don’t use the whole supercomputer

And get results back

Supercomputer Usage Model

Recap:

Recap:

Connecting to Jureca DC

Getting compute time

Jupyter

Jupyter

Pay attention to the partition - DON’T RUN IT ON THE LOGIN NODE!!!

Connecting to Jureca DC

VSCode

VSCode

Now with the remote explorer tab

SSH

SSH

Create key in VSCode’s Terminal (menu View->Terminal)

SSH

Configure SSH session

SSH

Configure SSH session

SSH

JSC restricts from where you can login

So we need to:

SSH

Find your ip/name range

SSH

Find your ip/name range

SSH

SSH - EXAMPLE

SSH - Example: 93.199.55.163

SSH - Example: 93.199.0.0/16

Copy your ssh key

SSH

Example: 93.199.0.0/16

SSH

SSH

Add new key to Judoor

SSH: Exercise

SSH: Exercise

Make sure you are connected to the supercomputer

Working with the supercomputer’s software

Software

Tool for finding software: module spider

What do we have?

Module hierarchy

What do I need to load such software?

Example: PyTorch

Example: PyTorch

Example: PyTorch

Example: PyTorch

Python Modules

Some of the python softwares are part of Python itself, or of other softwares. Use “module key”

VSCode

Editing files on the supercomputers

VSCode

VSCode

VSCode

SSH - Example: `93.199.55.163`

SSH - Example: `93.199.0.0/16`

Example: `93.199.0.0/16`

Tool for finding software: `module spider`

Some of the python softwares are part of Python itself, or of other softwares. Use “`module key`”

Create a new file “`matrix.py`” on VSCode on Jureca DC

You want that extra software from `pip`….