Tuesday October 2nd, 9:15 am – Room: Salle Sophie Germain
Speaker: Keita Teranishi, Sandia National Laboratories, California
Keita Teranishi received the BS and MS degrees from the University of Tennessee, Knoxville, in 1998 and 2000, respectively, and the PhD degree from Penn State University, in 2004. He is principal member of technical staff with Sandia National Laboratories.
Title: Scalable, Efficient Fault Tolerance in Asynchronous Many Task (AMT) Programming Models
Abstract: With growing scale and complexity of computational systems, HPC applications are increasingly susceptible to a wide variety of hardware and software faults. Accordingly, applications are ill-equipped to deal with the full spectrum of possible faults and often their response, particularly in synchronous programming models, is disproportionate to fault rate. Alternatively, Local Failure Local Recovery (LFLR), is based on the notion that a fault recovery that is localized around their occurrence is more scalable and efficient than a bulk response. LFLR is more amenable with an asynchronous programming model as opposed to synchronous ones. In this study, we demonstrate the efficiency and scalability of task-based fault recovery methodologies: task-replication and task-replay in an exemplar AMT runtime, Habanero-C++. The data/task semantics and API in Habanero were augmented to include functionality for the recovery techniques in a manner that incurs negligible overhead. Three representative mini-applications were implemented in Habanero to study the performance overhead of the recovery strategies at varying fault rates: 1-D explicit stencil, 3-D explicit stencil and sparse-matrix vector multiplication. Experiments with the three applications show that, with an efficient load balancing strategy, the additional cost incurred due to fault recovery (additional tasks) is proportional to rate of failed tasks.