Chasing the latency tail

This proposal has been rejected.

*

One Line Summary

Measuring CFS-based CPU scheduling performance on multi-tenant hosts

Abstract

On a multi-tenant host, a task’s performance can be affected by CPU contention.
Breaking down and addressing the performance effects allows us to drive higher utilization while maintaining guarantees for our customers.

In this talk, we describe our efforts to understand and control the effects of CPU scheduling on task’s performance. We’ll describe the metrics we used to study cpu scheduling latency across all tasks in our clusters. Along the way, we came up with a few SLIs for cpu scheduling for every job and for overall health of the cluster. We discarded a few metrics that initially showed promised and evolved a few others. We also made a few changes to our stack to fix the problems we found using the new metrics. We’ll go into details of our experiments and evolution of scheduling metrics.

Tags

performance, cpu, cfs, containers, scheduling, cluster, SLI

Speakers

  • Rohit Jnagal

    Google Inc

    Biography

    Rohit Jnagal is the lead for container management tools at Google. Rohit has been working on resource isolation and shared-load machine performance for Google’s cluster managements system (Borg) since 2009.

    Before Google, Rohit worked on a distributed virtual machine monitor at 3Leaf Systems, HPUX memory management at Hewlett Packard, and Veritas Volume Manager for Linux at Veritas.

  • David Ruiz

    Google

    Biography

    David Ruiz is a lead for container management tools at Google.