June 25th, 10am, Room Ada Lovelace
Title: How we built VeloC for Fault Tolerant Exascale Computing.
Over the past 12 years a lot of research has focus on developing techniques to make sure executions on Exascale systems complete with correct results. Many failure characterization studies have been conducted on existing systems to understand how they break and how they disrupt parallel application executions. Many projections have been proposed to forecast failure modes of Exascale systems in order to limit the focus to the failures that are the most critical and likely to happen in Exascale systems. Many research directions have been investigated on detection and mitigation of errors and failures. From this gigantic corpus of results (described in hundreds of publications), the teams involved in actually developing applications and systems for the first Exascale systems have established the desirable strategy and characteristic of fault tolerance mechanisms. The results of these 12 years of research is the VeloC environment, developed in the context of the US DOE/NNSA Exascale Computing Project (ECP). The talk will review the most important analysis and results that have conducted to the specifications of VeloC, its design and development. We will present VeloC and early results. VeloC is the main fault tolerance environment for ECP. It bears a huge responsibility for making sure that exascale applications running from 2021 will survive failures and complete with a minimal overhead. The talk will also look beyond exascale and describe the main research problems in fault tolerance for the next generation of systems.