Checkpoint/Restart for distributed applications

Session information has not yet been published for this event.

*

One Line Summary

distributed checkpoint restart protocols

Abstract

Scaling up the single application checkpoint/restart to extend it to parallel applications running across distributed resources is a delicate operation, one that can easily incur a significant, and certainly undesirable performance impact. From coordination between distributed processes, to ensuring no message has been lost or duplicated, several checkpoint/restart challenges are addressed and detailed. The talk covers in details different distributed checkpoint/restart protocols, and gives details about ongoing efforts to incorporate and optimize support for checkpoint/restart into an existing MPI implementation, Open MPI.

Tags

resilience, soft error detection, soft error correction

Speaker

  • George Bosilca

    University of Tennessee

    Biography

    Research Director and Adjunct Assistant Professor at the Innovative Computing Laboratory at University of Tennessee, Knoxville. Core developer of Open MPI and fervent supporter of resilience in distributed computing and in particular in MPI.