Comparison of Google Machine Learning Cloud's single GPU vs Nvidia Titan X

October 19, 2016


This was a quick and dirty test I did the with some trial hours on the Google Machine Learning Cloud. I was curious how one of their Tensor Processing Units (TPUs) compares to a single Nvidia Titan X in terms of training time on an MNIST benchmark.

The summary is that they’re fairly similar, but the lag time between submitting a job to the ML cloud and getting output from your code wasn’t worth it for me to not just use the local Titan machine. When I need more GPUs than I have locally, the ML Cloud would obviously be the way to go, but for development it’s much easier to do it locally. Developing code to run on multiple GPUs when you only have a single local GPU (or none at all) is going to be slower on the ML Cloud than on AWS due to the lag between changing code and having it executed on the TPU. On AWS you can instantly run your code directly from the GPU instance.

Google ML Cloud setup

This is a summary of https://cloud.google.com/ml/docs/how-tos/getting-set-up.

  • Go to the Google Cloud Console (https://console.cloud.google.com/) and select your project.
  • Start Cloud Shell (upper right icon).
  • Install cloud ML shell on the instance:
# download and run the cloud shell install script 
curl https://storage.googleapis.com/cloud-ml/scripts/setup_cloud_shell.sh | bash
export PATH=${HOME}/.local/bin:${PATH}

# verify installation, should print success
curl https://storage.googleapis.com/cloud-ml/scripts/check_environment.py | python

# create a cloud ML project
gcloud beta ml init-project

# create a Google Cloud Storage bucket for reading and writing data during model training and batch prediction
PROJECT_ID=$(gcloud config list project --format "value(core.project)")
BUCKET_NAME=${PROJECT_ID}-ml
# use the same region where you plan on running Cloud ML job
gsutil mb -l us-central1 gs://$BUCKET_NAME
  • Your 10GB home directory is persisted across sessions.

Training on MNIST example

This is a summary of the MNIST example training on Google’s site.

In the same Cloud Console as above, run the example python MNIST training file on a single instance:

cd ~/google-cloud-ml/samples/mnist/trainable/

JOB_NAME=mnist-jimmie
PROJECT_ID=`gcloud config list project --format "value(core.project)"`
TRAIN_BUCKET=gs://${PROJECT_ID}-ml
TRAIN_PATH=${TRAIN_BUCKET}/${JOB_NAME}

gsutil rm -rf ${TRAIN_PATH}

gcloud beta ml jobs submit training ${JOB_NAME} \
  --package-path=trainer \
  --module-name=trainer.task \
  --staging-bucket="${TRAIN_BUCKET}" \
  --region=us-central1 \
  -- \
  --train_dir="${TRAIN_PATH}/train"

The job will be queued, and you can check the status two way:

1) via the command line:

gcloud beta ml jobs describe --project ${PROJECT_ID} ${JOB_NAME}

2) via the web console

Important note

It took ~2 minutes from the time of job submission to the time that training actually started. The docs emphasize testing locally first, which would keep you to wait this long each time you update your code.

To get the training logs, grep for the interesting parts, and reverse the order run:

gcloud beta logging read --project ${PROJECT_ID} --format=json \
  "labels.\"ml.googleapis.com/task_name\"=\"master-replica-0\" AND \
   labels.\"ml.googleapis.com/job_id\"=\"${JOB_NAME}\"" | grep 'Step\|Data Eval\|Num examples' | tac
"message": "Step 0: loss = 2.32 (0.008 sec)",
"message": "Step 100: loss = 2.09 (0.003 sec)",
"message": "Step 200: loss = 1.86 (0.003 sec)",
"message": "Step 300: loss = 1.56 (0.003 sec)",
"message": "Step 400: loss = 1.31 (0.002 sec)",
"message": "Step 500: loss = 0.87 (0.002 sec)",
"message": "Step 600: loss = 0.94 (0.002 sec)",
"message": "Step 700: loss = 0.80 (0.003 sec)",
"message": "Step 800: loss = 0.40 (0.002 sec)",
"message": "Step 900: loss = 0.60 (0.002 sec)",
"message": "Training Data Eval:",
"message": "  Num examples: 55000  Num correct: 47903  Precision @ 1: 0.8710",
"message": "Validation Data Eval:",
"message": "  Num examples: 5000  Num correct: 4387  Precision @ 1: 0.8774",
"message": "Test Data Eval:",
"message": "  Num examples: 10000  Num correct: 8772  Precision @ 1: 0.8772",
"message": "Step 1000: loss = 0.55 (0.008 sec)",
"message": "Step 1100: loss = 0.38 (0.123 sec)",
"message": "Step 1200: loss = 0.58 (0.066 sec)",
"message": "Step 1300: loss = 0.42 (0.003 sec)",
"message": "Step 1400: loss = 0.33 (0.002 sec)",
"message": "Step 1500: loss = 0.59 (0.002 sec)",
"message": "Step 1600: loss = 0.33 (0.002 sec)",
"message": "Step 1700: loss = 0.35 (0.003 sec)",
"message": "Step 1800: loss = 0.41 (0.002 sec)",
"message": "Step 1900: loss = 0.38 (0.002 sec)",
"message": "Training Data Eval:",
"message": "  Num examples: 55000  Num correct: 49454  Precision @ 1: 0.8992",
"message": "Validation Data Eval:",
"message": "  Num examples: 5000  Num correct: 4521  Precision @ 1: 0.9042",
"message": "Test Data Eval:",
"message": "  Num examples: 10000  Num correct: 9030  Precision @ 1: 0.9030",

Nvidia Titan X:

Compare the same model run on an Nvidia Titan X setup:

Step 0: loss = 2.32 (0.154 sec)
Step 100: loss = 2.16 (0.001 sec)
Step 200: loss = 1.93 (0.001 sec)
Step 300: loss = 1.66 (0.001 sec)
Step 400: loss = 1.34 (0.001 sec)
Step 500: loss = 0.95 (0.001 sec)
Step 600: loss = 0.70 (0.002 sec)
Step 700: loss = 0.68 (0.001 sec)
Step 800: loss = 0.63 (0.001 sec)
Step 900: loss = 0.58 (0.001 sec)
Training Data Eval:
  Num examples: 55000  Num correct: 47772  Precision @ 1: 0.8686
Validation Data Eval:
  Num examples: 5000  Num correct: 4359  Precision @ 1: 0.8718
Test Data Eval:
  Num examples: 10000  Num correct: 8763  Precision @ 1: 0.8763
Step 1000: loss = 0.61 (0.002 sec)
Step 1100: loss = 0.51 (0.040 sec)
Step 1200: loss = 0.45 (0.001 sec)
Step 1300: loss = 0.51 (0.001 sec)
Step 1400: loss = 0.30 (0.001 sec)
Step 1500: loss = 0.35 (0.002 sec)
Step 1600: loss = 0.29 (0.001 sec)
Step 1700: loss = 0.50 (0.001 sec)
Step 1800: loss = 0.39 (0.001 sec)
Step 1900: loss = 0.25 (0.001 sec)
Training Data Eval:
  Num examples: 55000  Num correct: 49306  Precision @ 1: 0.8965
Validation Data Eval:
  Num examples: 5000  Num correct: 4504  Precision @ 1: 0.9008
Test Data Eval:
  Num examples: 10000  Num correct: 9000  Precision @ 1: 0.9000

So the cloud instance averages very roughly 0.002 sec/step and the Titan averages ~0.001.