Bringing Deep Learning Workloads to JSC supercomputers

Data loading

Alexandre Strube // Sabrina Benassou

June 25th, 2025

Schedule for day 2

Time	Title
13:00 - 13:10	Welcome, questions
13:10 - 14:10	Data loading
14:10 - 14:25	Coffee Break (flexible)
14:25 - 17:00	Parallelize Training

Let’s talk about DATA

I/O is separate and shared

All compute nodes of all supercomputers see the same files
Performance tradeoff between shared acessibility and speed
Our I/O server is almost a supercomputer by itself

Where do I keep my files?

Always store your code in the project1 folder ($PROJECT_projectname ). In our case
```
/p/project1/training2529/$USER
```
Store data in the scratch directory for faster I/O access ($SCRATCH_projectname). ⚠️Files in scratch are deleted after 90 days of inactivity.
```
/p/scratch/training2529/$USER
```
Store the data in $DATA_dataset for a more permanent location.
```
/p/data1/datasets
```

Data loading

We have CPUs and lots of memory - let’s use them
If your dataset is relatively small (< 500 GB) and can fit into the working memory (RAM) of each compute node (along with the program state), you can store it in /dev/shm. This is a special filesystem that uses RAM for storage, making it extremely fast for data access. ⚡️
For bigger datasets (> 500 GB) you have many strategies:
- Hierarchical Data Format 5 (HDF5)
- Apache Arrow
- NVIDIA Data Loading Library (DALI)
- SquashFS

Inodes

Inodes (Index Nodes) are data structures that store metadata about files and directories.
Unique identification of files and directories within the file system.
Efficient management and retrieval of file metadata.
Essential for file operations like opening, reading, and writing.
Limitations:
- Fixed Number: Limited number of inodes; no new files if exhausted, even with free disk space.
- Space Consumption: Inodes consume disk space, balancing is needed for efficiency.

Data loading

In this course, we provide you with some examples on how to create and HDF5 and pyarrow files.

We need to download some code:

cd $HOME/course
git clone https://github.com/HelmholtzAI-FZJ/2025-06-course-Bringing-Deep-Learning-Workloads-to-JSC-supercomputers.git

Move to the correct folder:

cd 2025-06-course-Bringing-Deep-Learning-Workloads-to-JSC-supercomputers/code/dataloading/

We used the ImageNet dataset for the examples.

The ImageNet dataset

Large Scale Visual Recognition Challenge (ILSVRC)

An image dataset organized according to the WordNet hierarchy.
Extensively used in algorithms for object detection and image classification at large scale.
It has 1000 classes, that comprises 1.2 million images for training, and 50,000 images for the validation set.

The ImageNet dataset

ILSVRC
|-- Data/
    `-- CLS-LOC
        |-- test
        |-- train
        |   |-- n01440764
        |   |   |-- n01440764_10026.JPEG
        |   |   |-- n01440764_10027.JPEG
        |   |   |-- n01440764_10029.JPEG
        |   |-- n01695060
        |   |   |-- n01695060_10009.JPEG
        |   |   |-- n01695060_10022.JPEG
        |   |   |-- n01695060_10028.JPEG
        |   |   |-- ...
        |   |...
        |-- val
            |-- ILSVRC2012_val_00000001.JPEG  
            |-- ILSVRC2012_val_00016668.JPEG  
            |-- ILSVRC2012_val_00033335.JPEG      
            |-- ...

The ImageNet dataset

imagenet_train.pkl

{
    'ILSVRC/Data/CLS-LOC/train/n03146219/n03146219_8050.JPEG': 524,
    'ILSVRC/Data/CLS-LOC/train/n03146219/n03146219_12728.JPEG': 524,
    'ILSVRC/Data/CLS-LOC/train/n03146219/n03146219_9736.JPEG': 524,
    ...
    'ILSVRC/Data/CLS-LOC/train/n03146219/n03146219_7460.JPEG': 524,
    ...
 }

imagenet_val.pkl

{
    'ILSVRC/Data/CLS-LOC/val/ILSVRC2012_val_00008838.JPEG': 785,
    'ILSVRC/Data/CLS-LOC/val/ILSVRC2012_val_00008555.JPEG': 129,
    'ILSVRC/Data/CLS-LOC/val/ILSVRC2012_val_00028410.JPEG': 968,
    ...
    'ILSVRC/Data/CLS-LOC/val/ILSVRC2012_val_00016007.JPEG': 709,
 }

HDF5

A binary file format for storing large, complex datasets.
Store data like a file system inside a file.
Hierarchical: organizes data as groups and datasets

HDF5

PyArrow

A Python library that provides tools for Apache Arrow – an in-memory columnar data
Stores data as tables, arrays, and record batches

PyArrow

Run examples

The examples are in:

    imagenet_loaders.py # to create the H5 and pyarrow files  
    save_imagenet_files.py # to read the H5 and pyarrow files

To create the h5 or pyarrow files, you can run the examples by launching
```
    sbatch run_save_file.sh
```
To read those files, you can run:
```
    run_loader.sh
```