Speedup Data loading

Alexandre Strube // Sabrina Benassou

October 17, 2023

Let’s talk about DATA

$PROJECT_projectname for code
- Most of your work should stay here
$DATA_projectname for big data(*)
- Permanent location for big datasets
$SCRATCH_projectname for temporary files (fast, but not permanent)
- Files are deleted after 90 days untouched

LARGEDATA filesystem is not accessible by compute nodes
- Copy files to an accessible filesystem BEFORE working
Imagenet-21K copy alone takes 21+ minutes to $SCRATCH
- We already copied it to $SCRATCH for you

We have CPUs and lots of memory - let’s use them
- multitask training and data loading for the next batch
- /dev/shm is a filesystem on ram - ultra fast ⚡️
Use big files made for parallel computing
- HDF5, Zarr, mmap() in a parallel fs, LMDB
Use specialized data loading libraries
- FFCV, DALI, Apache Arrow
Compression sush as squashfs
- data transfer can be slower than decompression (must be checked case by case)
- Beneficial in cases where numerous small files are at hand.