Speedup Data loading
Speedup Data loading
Alexandre Strube // Sabrina Benassou
October 17, 2023
Let’s talk about DATA
- Some general considerations one should have in
mind
I/O is separate and shared
All
compute nodes of all supercomputers see the same files
- Performance tradeoff between shared acessibility
and speed
- It’s simple to load data fast to 1 or 2 gpus. But
to 100? 1000? 10000?
Jülich Supercomputers
- Our I/O server is almost a supercomputer by
itself
Where do I keep my files?
$PROJECT_projectname
for code
- Most of your work should stay here
$DATA_projectname
for
big data(*)
- Permanent location for big datasets
$SCRATCH_projectname
for temporary files (fast, but not permanent)
- Files are deleted after 90 days untouched
Data services
- JSC provides different data services
- Data projects give massive amounts of storage
- We use it for ML datasets. Join the project at
Judoor
- After being approved, connect to the supercomputer
and try it:
Data Staging
- LARGEDATA
filesystem is not accessible by compute nodes
- Copy files to an accessible filesystem BEFORE
working
- Imagenet-21K copy alone takes 21+ minutes to
$SCRATCH
- We already copied it to $SCRATCH for you
Data loading
Strategies
- We have CPUs and lots of memory - let’s use them
- multitask training and data loading for the next
batch
/dev/shm
is a filesystem on ram -
ultra fast ⚡️
- Use big files made for parallel computing
- HDF5, Zarr, mmap() in a parallel fs, LMDB
- Use specialized data loading libraries
- Compression sush as squashfs
- data transfer can be slower than decompression
(must be checked case by case)
- Beneficial in cases where numerous small files are
at hand.