rolisz's site

Monitoring GPU usage with StackDriver

At work we use Google Cloud Platform to run our machine learning jobs on multiple machines. GCP has a monitoring platform called Stack­driv­er which can be used to view all kinds of metrics about your VMs. Un­for­tu­nate­ly, it doesn't collect any metrics about GPUs, neither usage or memory. The good news is that it is extensible and you can "easily" set up a new kind of metric and monitor it.

To get GPU metrics, we can use the nvidia-smi program, which is installed when you get all the necessary drivers for your graphics card. If you call it simply, it will give you the following output:

> nvidia-smi
+-----------------------------------------------------------------------------+
| 
continue.