Monitoring GPU usage with StackDriver

    Monitoring GPU usage with StackDriver

    At work we use Google Cloud Platform to run our machine learning jobs on multiple machines. GCP has a monitoring platform called Stack­driv­er which can be used to view all kinds of metrics about your VMs. Un­for­tu­nate­ly, it doesn't collect any metrics about GPUs, neither usage or memory. The good news is that it is extensible and you can "easily" set up a new kind of metric and monitor it.

    To get GPU metrics, we can use the nvidia-smi program, which is installed when you get all the necessary drivers for your graphics card. If you call it simply, it will give you the following output:

    > nvidia-smi
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 410.66       Driver Version: 410.66       CUDA Version: 10.0     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  GeForce GTX 108...  Off  | 00000000:01:00.0  On |                  N/A |
    |  0%   43C    P8    17W / 250W |   1309MiB / 11177MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |    0       700      G   /usr/lib/Xorg                                 40MiB |
    |    0       733      G   /usr/bin/gnome-shell                         110MiB |
    |    0       931      G   /usr/lib/Xorg                                371MiB |
    |    0      1119      G   /usr/lib/firefox/firefox                       2MiB |
    |    0      1279      G   /usr/lib/firefox/firefox                       3MiB |
    |    0     23585      G   /usr/lib/firefox/firefox                      24MiB |
    +-----------------------------------------------------------------------------+

    This is a bit convoluted, hard to parse and has too many details. But, with the right flags, you can get just what you want in CSV format:

    > nvidia-smi --query-gpu=utilization.gpu,utilization.memory --format=csv,noheader,nounits
    10,35

    The first value is the GPU uti­liza­tion, as a percentage, and the second value is the memory usage of the GPU, also as a percentage.

    We are going to write a Python process that open a subprocess to call nvidia-smi once a second and aggregates statistics, on a per minute basis. We have to do this, because we cannot write to Stack­driv­er metrics more than once a minute, per label (which are a sort of identifier for these time series).

    from subprocess import Popen, PIPE
    import os
    import time
    import sys
    
    def compute_stats():
        all_gpu = []
        all_mem = []
        for i in range(10):
            p = Popen(["nvidia-smi","--query-gpu=utilization.gpu,utilization.memory", 
                       "--format=csv,noheader,nounits"], stdout=PIPE)
            stdout, stderror = p.communicate()
            output = stdout.decode('UTF-8')
            # Split on line break
            lines = output.split(os.linesep)
            numDevices = len(lines)-1
            gpu = []
            mem = []
            for g in range(numDevices):
                line = lines[g]
                vals = line.split(', ')
                gpu.append(float(vals[0]))
                mem.append(float(vals[1]))
    
            all_gpu.append(gpu)
            all_mem.append(mem)
            time.sleep(1)
    
        max_gpu = [max(x[i] for x in all_gpu) for i in range(numDevices)]
        avg_gpu = [sum(x[i] for x in all_gpu)/len(all_gpu) for i in range(numDevices)]
        max_mem = [max(x[i] for x in all_mem) for i in range(numDevices)]
        avg_mem = [sum(x[i] for x in all_mem)/len(all_mem) for i in range(numDevices)]
        return max_gpu, avg_gpu, max_mem, avg_mem

    Here we computed both the average and the maximum over a 1 minute interval. This can be changed to other statistics if they are more relevant for your use case.

    To write the data to Stack­driv­er, we have to build up the ap­pro­pri­ate protobufs. We will set two labels: one for the zone in which are machines are and one for the instance_id, which we will hack to contain both the name of the machine and the number of the GPU (this is useful in case you attach multiple GPUs to one machine). I hacked the instance_id because Stack­driv­er kept refusing any API calls with custom labels, even though the docs said it supported them.

    from google.cloud import monitoring_v3
    
    client = monitoring_v3.MetricServiceClient()
    project = 'myGCPprojectid'  
    project_name = client.project_path(project)
    
    
    def write_time_series(name, gpu_nr, value):
        series = monitoring_v3.types.TimeSeries()
        series.metric.type = 'custom.googleapis.com/' + name
        series.resource.type = 'gce_instance'
        series.resource.labels['instance_id'] = sys.argv[1] + "_gpu_" + str(gpu_nr)
        series.resource.labels['zone'] = 'us-central1-f'
    
        point = series.points.add()
        point.value.double_value = value
        now = time.time()
        point.interval.end_time.seconds = int(now)
        point.interval.end_time.nanos = int(
        (now - point.interval.end_time.seconds) * 10**9)
        client.create_time_series(project_name, [series])

    And now, we put everything together. The program must be called with a the name of the instance as a first parameter. If you run it only on GCP, you can use the GCP APIs to get the name of the instance au­to­mat­i­cal­ly.

    if len(sys.argv) < 2:
        print("You need to pass the instance name as first argument")
        sys.exit(1)
    
    try:
        max_gpu, avg_gpu, max_mem, avg_mem = compute_stats()
        for i in range(len(max_gpu)):
            write_time_series('max_gpu_utilization', i, max_gpu[i])
            write_time_series('max_gpu_memory', i, max_mem[i])
            write_time_series('avg_gpu_utilization', i, avg_gpu[i])
            write_time_series('avg_gpu_memory', i, avg_mem[i])
    except Exception as e:
        print(e)
        

    If you save all this code to a file called gpu_monitoring.py and you run this locally, on a machine with an NVidia GPU, after a minute you should start seeing the new metrics in your Stack­driv­er console associated with your GCP project.

    This code can then be called with cron once a minute or it can be changed so that it runs without stopping, posting results once a minute.

    * * * * * python /path/to/gpu_monitoring.py instance_name >> /var/log/gpu.log 2>&1

    Setting up the GCP project and au­then­ti­ca­tion to connect to Stack­driv­er is left as an exercise to the user. The whole code can be seen in this gist.