## Buffered Python generators for data augmentation

September 13, 2015

When using deep convolutional neural networks (CNNs) for image classification tasks, it’s common to apply several transformations to the images in order to augment the data and reduce overfitting. For example, images are often randomly cropped, mirrored, rotated, and blurred to artificially increase the number of training examples. It’s much more efficient to do this in real-time rather than store extra transformed images on disk.

Large image datasets that won’t fit in memory all at once are loaded into GPU memory one batch at a time during training. Data augmentation steps are usually applied on the CPU. In Python, generators are often used to do this iteration over batches. For example,

for batch in iterate_minibatches(X_train, y_train, 500):
inputs, targets = batch
train_err += train_fn(inputs, targets)

for X_batch, Y_batch in datagen.flow(X_train, Y_train):
loss = model.train_on_batch(X_batch, Y_batch)


Ideally you don’t want to be waiting for the CPU to do preprocessing because that slows down your training. Krizhevsky et al. (2012) describes loading batches and applying data augmentation asynchronously. A separate CPU process loads a new batch into memory an applies any preprocessing so that that a new batch is ready to be sent to the GPU once training on the old batch has finished.

I’ve used Caffe quite a bit, and it does asynchronous batch processing for you. However, I noticed my Python CNN training was getting bogged down when doing lots of data augmentation. After some Googling, I found a nice bit of code in Sander Dieleman’s winning solution for the Kaggle National Data Science Bowl that allows you to run any slow generator in a separate process.

## Speedup with Keras CIFAR10

The Keras CIFAR10 example mentioned previously is a good example of where these buffered generators can speed up training. The code applies several preprocessing steps to each batch of images via the ImageDataGenerator class.

### Case with no buffering

Running the Keras example via python cifar10_cnn.py, we get the following output for the first few epochs:

Using gpu device 0: GRID K520 (CNMeM is disabled)
X_train shape: (50000, 3, 32, 32)
50000 train samples
10000 test samples
Using real time data augmentation
----------------------------------------
Epoch 0
----------------------------------------
Training...
50000/50000 [==============================] - 238s - train loss: 1.6063
Testing...
10000/10000 [==============================] - 42s - test loss: 1.2623
----------------------------------------
Epoch 1
----------------------------------------
Training...
50000/50000 [==============================] - 237s - train loss: 1.2726
Testing...
10000/10000 [==============================] - 42s - test loss: 1.1471
----------------------------------------
Epoch 2
----------------------------------------
Training...
50000/50000 [==============================] - 238s - train loss: 1.1211
Testing...
10000/10000 [==============================] - 42s - test loss: 0.9634
----------------------------------------
Epoch 3
----------------------------------------
Training...
50000/50000 [==============================] - 237s - train loss: 1.0355
Testing...
10000/10000 [==============================] - 42s - test loss: 0.9138
----------------------------------------
Epoch 4
----------------------------------------
Training...
50000/50000 [==============================] - 236s - train loss: 0.9748
Testing...
10000/10000 [==============================] - 42s - test loss: 0.8442
----------------------------------------
Epoch 5
----------------------------------------
Training...
50000/50000 [==============================] - 237s - train loss: 0.9340
Testing...
10000/10000 [==============================] - 42s - test loss: 0.8224


### With buffering

Now we run the following modified version which uses Dieleman’s handy buffered_gen_threaded function to load batches asynchronously:

Running with python cifar10_cnn_buffered.py produces:

Using gpu device 0: GRID K520 (CNMeM is disabled)
X_train shape: (50000, 3, 32, 32)
50000 train samples
10000 test samples
Using real time data augmentation with buffer_size = 2
----------------------------------------
Epoch 0
----------------------------------------
Training...
50000/50000 [==============================] - 207s - train loss: 1.5903
Testing...
10000/10000 [==============================] - 40s - test loss: 1.2370
----------------------------------------
Epoch 1
----------------------------------------
Training...
50000/50000 [==============================] - 207s - train loss: 1.2662
Testing...
10000/10000 [==============================] - 40s - test loss: 1.0837
----------------------------------------
Epoch 2
----------------------------------------
Training...
50000/50000 [==============================] - 207s - train loss: 1.1250
Testing...
10000/10000 [==============================] - 40s - test loss: 0.9738
----------------------------------------
Epoch 3
----------------------------------------
Training...
50000/50000 [==============================] - 207s - train loss: 1.0401
Testing...
10000/10000 [==============================] - 40s - test loss: 0.9147
----------------------------------------
Epoch 4
----------------------------------------
Training...
50000/50000 [==============================] - 207s - train loss: 0.9810
Testing...
10000/10000 [==============================] - 40s - test loss: 0.8359
----------------------------------------
Epoch 5
----------------------------------------
Training...
50000/50000 [==============================] - 207s - train loss: 0.9354
Testing...
10000/10000 [==============================] - 40s - test loss: 0.8138


### Differences

Each training epoch runs about 30 seconds faster with the buffered version. The mean GPU utilization is also higher when running the buffered version.

Here is the GPU utilization during the first epoch without buffering, with a mean of ~11%:

Here is the GPU utilization during the first epoch with buffering, with a mean of ~13%:

These plots were made using an IPython/Jupyter notebook and the Python API for the NVIDIA Management Library (NVML):

http://nbviewer.ipython.org/gist/jimgoo/76ce3caaa25ed02084f7

The notebook plots the percentage of GPU utilization for all of your GPUs every few seconds and comes in handy for identifying preprocessing bottlenecks. The benefits of using the buffered generator increase with the number and complexity of the preprocessing steps. In this case, only a few simple ones are used (mirroring and normalization by constants), so the difference is not as dramatic.

These results used the buffered_gen_threaded function with a buffer size of 2. There is also buffered_gen_mp which uses the multiprocessing module instead of the threading module. The latter one avoids issues with the Python Global Interpreter Lock (GIL) and is useful if other libraries are being used that lock up the GIL (such as h5py). This is at the cost of process and (de)serialization overhead. See this discussion for more info.