The Google computing infrastructure uses containers to manage millions of simultaneously running jobs in data centers worldwide. Although the applications are container aware and are designed to be resilient to failures, evictions due to resource contention and scheduled maintenance events can reduce overall efficiency due to the time required to rebuild complex application state. This talk discusses the ongoing use of the open source Checkpoint/Restore in Userspace (CRIU) software to migrate container workloads between machines without loss of application state, allowing improvements in efficiency and utilization. We’ll present our experiences with using CRIU at Google, including ongoing challenges supporting production workloads, current state of the project, changes required to integrate with our existing container infrastructure, new requirements from running CRIU at scale, and lessons learned from managing and supporting migratable containers. We hope to start a discussion around the future direction of CRIU as well as task migration in Linux as a whole.
|I agree to abide by the anti-harassment policy||Yes|