Data loading
Alexandre Strube // Sabrina Benassou
June 25th, 2025
Time | Title |
---|---|
13:00 - 13:10 | Welcome, questions |
13:10 - 14:10 | Data loading |
14:10 - 14:25 | Coffee Break (flexible) |
14:25 - 17:00 | Parallelize Training |
Always store your code in the project1 folder
($PROJECT_projectname
). In our case
Store data in the scratch directory for faster
I/O access ($SCRATCH_projectname
).
⚠️Files in scratch are deleted after 90 days of
inactivity.
Store the data in $DATA_dataset
for a more permanent location.
/dev/shm
. This is a special filesystem
that uses RAM for storage, making it extremely fast for data access.
⚡️In this course, we provide you with some examples on how to create and HDF5 and pyarrow files.
We need to download some code:
Move to the correct folder:
cd 2025-06-course-Bringing-Deep-Learning-Workloads-to-JSC-supercomputers/code/dataloading/
We used the ImageNet dataset for the examples.
ILSVRC
|-- Data/
`-- CLS-LOC
|-- test
|-- train
| |-- n01440764
| | |-- n01440764_10026.JPEG
| | |-- n01440764_10027.JPEG
| | |-- n01440764_10029.JPEG
| |-- n01695060
| | |-- n01695060_10009.JPEG
| | |-- n01695060_10022.JPEG
| | |-- n01695060_10028.JPEG
| | |-- ...
| |...
|-- val
|-- ILSVRC2012_val_00000001.JPEG
|-- ILSVRC2012_val_00016668.JPEG
|-- ILSVRC2012_val_00033335.JPEG
|-- ...
imagenet_train.pkl
{
'ILSVRC/Data/CLS-LOC/train/n03146219/n03146219_8050.JPEG': 524,
'ILSVRC/Data/CLS-LOC/train/n03146219/n03146219_12728.JPEG': 524,
'ILSVRC/Data/CLS-LOC/train/n03146219/n03146219_9736.JPEG': 524,
...
'ILSVRC/Data/CLS-LOC/train/n03146219/n03146219_7460.JPEG': 524,
...
}
imagenet_val.pkl
A binary file format for storing large, complex datasets.
Store data like a file system inside a file.
Hierarchical: organizes data as groups and datasets
A Python library that provides tools for Apache Arrow – an in-memory columnar data
Stores data as tables, arrays, and record batches
The examples are in:
To create the h5 or pyarrow files, you can run the examples by launching
To read those files, you can run: