Classifying images in the Oxford 102 flower dataset with CNNs

September 17, 2015

I’ve added some code on GitHUb for training deep convolutional neural networks to classify images in the Oxford 102 category flower dataset. This is using the lovely Caffe. The prototxt files for fine-tuning AlexNet and VGG_S models are included and use initial weights from training on the ILSVRC 2012 (ImageNet) data.

You can get the code via:

git clone https://github.com/jimgoo/caffe-oxford102

To download the Oxford 102 dataset, prepare Caffe image files, and download pre-trained model weights for AlexNet and VGG_S, run

python bootstrap.py

This will give you some pretty flower pictures:

_config.yml

The categories are split into training, testing, and validation sets. It seems odd that there are more testing images than training images. We’ll first train on the provided train set and test on the test set nonetheless.

_config.yml

AlexNet

This model is a slightly modified version of the ILSVR 2012 winning AlexNet. The number of outputs in the inner product layer has been set to 102 to reflect the number of flower categories. Hyperparameter choices in AlexNet/solver.prototxt reflect those in Fine-tuning CaffeNet for Style Recognition on “Flickr Style” Data. The global learning rate is reduced while the learning rate for the final fully connected layer is increased relative to the other layers.

Once you’ve run the bootstrap.py script, you can begin training from this directory with:

cd AlexNet
$CAFFE_HOME/build/tools/caffe train -solver solver.prototxt -weights pretrained-weights.caffemodel -gpu 0

After 14,000 iterations, the test accuracy is 80%:

I0918 00:02:19.772845 67440 solver.cpp:266] Iteration 14000, Testing net (#0)
I0918 00:02:45.828433 67440 solver.cpp:315]     Test net output #0: accuracy = 0.8
I0918 00:02:45.978008 67440 solver.cpp:189] Iteration 14000, loss = 0.000117275

The Caffe model can be downloaded here. You can also use the Caffe utility to download from its gist:

cd $CAFFE_HOME
./scripts/download_model_from_gist.sh 0179e52305ca768a601f <dirname>

VGG-S

This is another popular CNN from the University of Oxford Visual Geometry Group (VGG). On ILSVRC 2012, it has a top-5 error rate of 13.1% compared to 15.3% for AlexNet.

Getting the prototxt file setup for training took a little more work because only the deploy.prototxt file was provided. I added the same learning rate multipliers for each layer as the AlexNet one and the same weight initialization schemes, although the latter was redundant when starting with pre-trained weights. The same random cropping and mirroring are also used.

To train,

cd VGG_S
$CAFFE_HOME/build/tools/caffe train -solver solver.prototxt -weights pretrained-weights.caffemodel -gpu 1

After 14,000 iterations, this model does a little better with test accuracy of 82%:

I0918 03:49:00.571482 68176 solver.cpp:266] Iteration 14000, Testing net (#0)
I0918 03:49:59.285096 68176 solver.cpp:315]     Test net output #0: accuracy = 0.824516
I0918 03:49:59.753748 68176 solver.cpp:189] Iteration 14000, loss = 0.000275362

AlexNet uses a crop size of 227 x 227, while VGG_S uses 224 x 224, so it’s not an exact comparison. Accuracy on the test set evolves as follows:

test-acc

GPU Utilization

I did the above training at the same time on two GPUs and monitored GPU usage as in my last post with Keras. AlexNet was on GPU 1 and VGG was on GPU 2. Notice how the GPU utilization is always peaked out, excluding dips during test time:

gpu-util

I’ve yet to get my Python implementations of these models to be as efficient.

Caffe on AWS

Installing Caffe and the latest CuDNN libraries is no trivial matter. Luckily there is an Amazon EC2 instance ready to go with Caffe, CUDA 7, and cuDNN (ami-763a311e). It’ll work with with g2.2xlarge (1 x K520) and g2.8xlarge (4 x K520) GPU instances. In this case, CAFFE_HOME = /home/ubuntu/caffe.

Notes

  • The class labels for each species were deduced by Github user m-co and can be found in the file class-labels.py. They are in order from class 1 to class 102 as used in the mat files.

  • These were run using the mean image for ILSVRC 2012 instead of the mean for the actual Oxford dataset. This was more out of laziness that anything else.

  • This paper reports 87% top-1 accuracy on the Oxford-102 dataset using an SVM on features from the OverFeat model. I couldn’t tell which split they used for training.