-
Welcome
-
Subscribe to
Checkpoint/Restart for distributed applications
Session information has not yet been published for this event.
One Line Summary
Executing efficient checkpoint/restart for distributed applications
Abstract
Using checkpoint/restart protocols for single instance applications is a well understood principle. Scaling up this concept to incorporate parallel applications running across multiple nodes is a delicate operation, one that can easily incur a significant performance impact. This talk covers different checkpoint/restart protocols existing, and gives details about their implement in one of the leading message passing libraries, Open MPI.
Tags
checkpoint, restart, resilience
Speaker
-
George Bosilca
University of Tennessee- Website: http://icl.cs.utk.edu/~bosilca/
Biography
Research Director and Adjunct Assistant Professor at the Innovative Computing Laboratory at University of Tennessee, Knoxville. Core developer of Open MPI and fervent supporter of resilience in distributed computing and in particular in MPI.