Checkpoint/Restart for distributed applications

Session information has not yet been published for this event.

*

One Line Summary

Executing efficient checkpoint/restart for distributed applications

Abstract

Using checkpoint/restart protocols for single instance applications is a well understood principle. Scaling up this concept to incorporate parallel applications running across multiple nodes is a delicate operation, one that can easily incur a significant performance impact. This talk covers different checkpoint/restart protocols existing, and gives details about their implement in one of the leading message passing libraries, Open MPI.

Tags

checkpoint, restart, resilience

Speaker

  • George Bosilca

    University of Tennessee

    Biography

    Research Director and Adjunct Assistant Professor at the Innovative Computing Laboratory at University of Tennessee, Knoxville. Core developer of Open MPI and fervent supporter of resilience in distributed computing and in particular in MPI.