Bringing Deep Learning Workloads to JSC supercomputers

Data loading

Alexandre Strube // Sabrina Benassou // José Ignacio Robledo

December 5th, 2024

Schedule for day 2

Time	Title
10:00 - 10:15	Welcome, questions
10:15 - 11:30	Data loading
11:30 - 12:00	Coffee Break (flexible)
12:30 - 14:00	Parallelize Training

Let’s talk about DATA

I/O is separate and shared

All compute nodes of all supercomputers see the same files
Performance tradeoff between shared acessibility and speed
Our I/O server is almost a supercomputer by itself

Where do I keep my files?

Always store your code in the project folder ($PROJECT_projectname ). In our case
```
/p/project/training2449/$USER
```
Store data in the scratch directory for faster I/O access ($SCRATCH_projectname). Files in scratch are deleted after 90 days of inactivity.
```
/p/scratch/training2449/$USER
```
Store the data in $DATA_dataset for a more permanent location. This location is not accessible by compute nodes. Please copy the data to scratch to ensure your job can access it.
```
/p/data1/datasets
```

Data loading

We have CPUs and lots of memory - let’s use them
If your dataset is relatively small (< 500 GB) and can fit into the working memory (RAM) of each compute node (along with the program state), you can store it in /dev/shm. This is a special filesystem that uses RAM for storage, making it extremely fast for data access. ⚡️
For bigger datasets (> 500 GB) you have many strategies:
- Hierarchical Data Format 5 (HDF5)
- Apache Arrow
- NVIDIA Data Loading Library (DALI)
- SquashFS

Data loading

In this course, we will demonstrate how to store your data in HDF5 and PyArrow files.
We will use the ImageNet dataset to create HDF5 and PyArrow files.

But before

We need to download some code

cd $HOME/course
git clone https://github.com/HelmholtzAI-FZJ/2024-12-course-Bringing-Deep-Learning-Workloads-to-JSC-supercomputers.git

Move to the correct folder

cd 2024-12-course-Bringing-Deep-Learning-Workloads-to-JSC-supercomputers/code/dataloading/

The ImageNet dataset

Large Scale Visual Recognition Challenge (ILSVRC)

An image dataset organized according to the WordNet hierarchy.
Extensively used in algorithms for object detection and image classification at large scale.
It has 1000 classes, that comprises 1.2 million images for training, and 50,000 images for the validation set.

The ImageNet dataset

ILSVRC
|-- Data/
    `-- CLS-LOC
        |-- test
        |-- train
        |   |-- n01440764
        |   |   |-- n01440764_10026.JPEG
        |   |   |-- n01440764_10027.JPEG
        |   |   |-- n01440764_10029.JPEG
        |   |-- n01695060
        |   |   |-- n01695060_10009.JPEG
        |   |   |-- n01695060_10022.JPEG
        |   |   |-- n01695060_10028.JPEG
        |   |   |-- ...
        |   |...
        |-- val
            |-- ILSVRC2012_val_00000001.JPEG  
            |-- ILSVRC2012_val_00016668.JPEG  
            |-- ILSVRC2012_val_00033335.JPEG      
            |-- ...

The ImageNet dataset

imagenet_train.pkl

{
    'ILSVRC/Data/CLS-LOC/train/n03146219/n03146219_8050.JPEG': 524,
    'ILSVRC/Data/CLS-LOC/train/n03146219/n03146219_12728.JPEG': 524,
    'ILSVRC/Data/CLS-LOC/train/n03146219/n03146219_9736.JPEG': 524,
    ...
    'ILSVRC/Data/CLS-LOC/train/n03146219/n03146219_7460.JPEG': 524,
    ...
 }

imagenet_val.pkl

{
    'ILSVRC/Data/CLS-LOC/val/ILSVRC2012_val_00008838.JPEG': 785,
    'ILSVRC/Data/CLS-LOC/val/ILSVRC2012_val_00008555.JPEG': 129,
    'ILSVRC/Data/CLS-LOC/val/ILSVRC2012_val_00028410.JPEG': 968,
    ...
    'ILSVRC/Data/CLS-LOC/val/ILSVRC2012_val_00016007.JPEG': 709,
 }

Access File System

class ImageNet(Dataset):
    def __init__(self, root, split, transform=None):

        file_name = "imagenet_train.pkl" if split == "train" else "imagenet_val.pkl"

        with open(os.path.join(args.data_root, file_name), "rb") as f:
            train_data = pickle.load(f)

        self.samples = list(train_data.keys())
        self.targets = list(train_data.values())
        self.transform = transform

    def __len__(self):
            return len(self.samples)

    def __getitem__(self, idx):
        x = Image.open(os.path.join(self.samples[idx])).convert("RGB")
        if self.transform:
            x = self.transform(x)
        return x, self.targets[idx]

Inodes

Inodes (Index Nodes) are data structures that store metadata about files and directories.
Unique identification of files and directories within the file system.
Efficient management and retrieval of file metadata.
Essential for file operations like opening, reading, and writing.
Limitations:
- Fixed Number: Limited number of inodes; no new files if exhausted, even with free disk space.
- Space Consumption: Inodes consume disk space, balancing is needed for efficiency.

Pyarrow File Creation

    binary_t = pa.binary()
    uint16_t = pa.uint16()

Pyarrow File Creation

    binary_t = pa.binary()
    uint16_t = pa.uint16()

    schema = pa.schema([
        pa.field('image_data', binary_t),
        pa.field('label', uint16_t),
    ])

Pyarrow File Creation

    with pa.OSFile(
            os.path.join(args.target_folder, f'ImageNet-{split}.arrow'),
            'wb',
    ) as f:
        with pa.ipc.new_file(f, schema) as writer:

Pyarrow File Creation

    for (sample, label) in tqdm(zip(samples, targets)):
        with open(os.path.join(args.data_root, sample), 'rb') as f:
            img_string = f.read()

        image_data = pa.array([img_string], type=binary_t)
        label = pa.array([label], type=uint16_t)

        batch = pa.record_batch([image_data, label], schema=schema)

        writer.write(batch)

Pyarrow File Creation


    for (sample, label) in tqdm(zip(samples, targets)):
        with open(os.path.join(args.data_root, sample), 'rb') as f:
            img_string = f.read()

        image_data = pa.array([img_string], type=binary_t)
        label = pa.array([label], type=uint16_t)

        batch = pa.record_batch([image_data, label], schema=schema)

        writer.write(batch)

Access Arrow File

def __getitem__(self, idx):
    if self.arrowfile is None:
        self.arrowfile = pa.OSFile(self.data_root, 'rb')
        self.reader = pa.ipc.open_file(self.arrowfile)

    row = self.reader.get_batch(idx)

    img_string = row['image_data'][0].as_py()
    target = row['label'][0].as_py()

    with io.BytesIO(img_string) as byte_stream:
        with Image.open(byte_stream) as img:
            img = img.convert("RGB")

    if self.transform:
        img = self.transform(img)

    return img, target

HDF5


with h5py.File(os.path.join(args.target_folder, 'ImageNet.h5'), "w") as f:

HDF5


group = g.create_group(split)

HDF5

dt_sample = h5py.vlen_dtype(np.dtype(np.uint8))
dt_target = np.dtype('int16')

dset = group.create_dataset(
                'images',
                (len(samples),),
                dtype=dt_sample,
            )

dtargets = group.create_dataset(
        'targets',
        (len(samples),),
        dtype=dt_target,
    )

HDF5

for idx, (sample, target) in tqdm(enumerate(zip(samples, targets))):        
    with open(sample, 'rb') as f:
        img_string = f.read() 
        dset[idx] = np.array(list(img_string), dtype=np.uint8)
        dtargets[idx] = target

HDF5

for idx, (sample, target) in tqdm(enumerate(zip(samples, targets))):        
    with open(sample, 'rb') as f:
        img_string = f.read() 
        dset[idx] = np.array(list(img_string), dtype=np.uint8)
        dtargets[idx] = target

HDF5

Access h5 File

def __getitem__(self, idx):
    if self.h5file is None:
        self.h5file = h5py.File(self.train_data_path, 'r')[self.split]
        self.imgs = self.h5file["images"]
        self.targets = self.h5file["targets"]

    img_string = self.imgs[idx]
    target = self.targets[idx]

    with io.BytesIO(img_string) as byte_stream:
        with Image.open(byte_stream) as img:
            img = img.convert("RGB")

    if self.transform:
        img = self.transform(img)
        
    return img, target

DEMO

Exercise

Could you create an arrow file for the flickr dataset stored in /p/scratch/training2449/data/Flickr30K/ and read it using a dataloader ?