gpu

Monitoring GPU usage with StackDriver

Roland Szabo

Nov 21, 2018 • 4 min read

At work we use Google Cloud Platform to run our machine learning jobs on multiple machines. GCP has a monitoring platform called Stackdriver which can be used to view all kinds of metrics about your VMs. Unfortunately, it doesn't collect any metrics about GPUs, neither usage or memory. The good news is that it is extensible and you can "easily" set up a new kind of metric and monitor it.

To get GPU metrics, we can use the nvidia-smi program, which is installed when you get all the necessary drivers for your graphics card. If you call it simply, it will give you the following output:

> nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.66       Driver Version: 410.66       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0  On |                  N/A |
|  0%   43C    P8    17W / 250W |   1309MiB / 11177MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0       700      G   /usr/lib/Xorg                                 40MiB |
|    0       733      G   /usr/bin/gnome-shell                         110MiB |
|    0       931      G   /usr/lib/Xorg                                371MiB |
|    0      1119      G   /usr/lib/firefox/firefox                       2MiB |
|    0      1279      G   /usr/lib/firefox/firefox                       3MiB |
|    0     23585      G   /usr/lib/firefox/firefox                      24MiB |
+-----------------------------------------------------------------------------+

This is a bit convoluted, hard to parse and has too many details. But, with the right flags, you can get just what you want in CSV format:

> nvidia-smi --query-gpu=utilization.gpu,utilization.memory --format=csv,noheader,nounits
10,35

The first value is the GPU utilization, as a percentage, and the second value is the memory usage of the GPU, also as a percentage.

We are going to write a Python process that open a subprocess to call nvidia-smi once a second and aggregates statistics, on a per minute basis. We have to do this, because we cannot write to Stackdriver metrics more than once a minute, per label (which are a sort of identifier for these time series).

from subprocess import Popen, PIPE
import os
import time
import sys

def compute_stats():
    all_gpu = []
    all_mem = []
    for i in range(10):
        p = Popen(["nvidia-smi","--query-gpu=utilization.gpu,utilization.memory", 
                   "--format=csv,noheader,nounits"], stdout=PIPE)
        stdout, stderror = p.communicate()
        output = stdout.decode('UTF-8')
        # Split on line break
        lines = output.split(os.linesep)
        numDevices = len(lines)-1
        gpu = []
        mem = []
        for g in range(numDevices):
            line = lines[g]
            vals = line.split(', ')
            gpu.append(float(vals[0]))
            mem.append(float(vals[1]))

        all_gpu.append(gpu)
        all_mem.append(mem)
        time.sleep(1)

    max_gpu = [max(x[i] for x in all_gpu) for i in range(numDevices)]
    avg_gpu = [sum(x[i] for x in all_gpu)/len(all_gpu) for i in range(numDevices)]
    max_mem = [max(x[i] for x in all_mem) for i in range(numDevices)]
    avg_mem = [sum(x[i] for x in all_mem)/len(all_mem) for i in range(numDevices)]
    return max_gpu, avg_gpu, max_mem, avg_mem

Here we computed both the average and the maximum over a 1 minute interval. This can be changed to other statistics if they are more relevant for your use case.

To write the data to Stackdriver, we have to build up the appropriate protobufs. We will set two labels: one for the zone in which are machines are and one for the instance_id, which we will hack to contain both the name of the machine and the number of the GPU (this is useful in case you attach multiple GPUs to one machine). I hacked the instance_id because Stackdriver kept refusing any API calls with custom labels, even though the docs said it supported them.

from google.cloud import monitoring_v3

client = monitoring_v3.MetricServiceClient()
project = 'myGCPprojectid'  
project_name = client.project_path(project)


def write_time_series(name, gpu_nr, value):
    series = monitoring_v3.types.TimeSeries()
    series.metric.type = 'custom.googleapis.com/' + name
    series.resource.type = 'gce_instance'
    series.resource.labels['instance_id'] = sys.argv[1] + "_gpu_" + str(gpu_nr)
    series.resource.labels['zone'] = 'us-central1-f'

    point = series.points.add()
    point.value.double_value = value
    now = time.time()
    point.interval.end_time.seconds = int(now)
    point.interval.end_time.nanos = int(
    (now - point.interval.end_time.seconds) * 10**9)
    client.create_time_series(project_name, [series])

And now, we put everything together. The program must be called with a the name of the instance as a first parameter. If you run it only on GCP, you can use the GCP APIs to get the name of the instance automatically.

if len(sys.argv) < 2:
    print("You need to pass the instance name as first argument")
    sys.exit(1)

try:
    max_gpu, avg_gpu, max_mem, avg_mem = compute_stats()
    for i in range(len(max_gpu)):
        write_time_series('max_gpu_utilization', i, max_gpu[i])
        write_time_series('max_gpu_memory', i, max_mem[i])
        write_time_series('avg_gpu_utilization', i, avg_gpu[i])
        write_time_series('avg_gpu_memory', i, avg_mem[i])
except Exception as e:
    print(e)

If you save all this code to a file called gpu_monitoring.py and you run this locally, on a machine with an NVidia GPU, after a minute you should start seeing the new metrics in your Stackdriver console associated with your GCP project.

This code can then be called with cron once a minute or it can be changed so that it runs without stopping, posting results once a minute.

* * * * * python /path/to/gpu_monitoring.py instance_name >> /var/log/gpu.log 2>&1

Setting up the GCP project and authentication to connect to Stackdriver is left as an exercise to the user. The whole code can be seen in this gist.

Subscribe to rolisz's blog