Resilience algorithms to cope with fail-stop and silent errors by Hongyang Sun (ENS Lyon)

2:00 pm – 3:00 pm
March 31, 2016

This talk focuses on resilience algorithms at extreme scale. Many papers deal with fail-stop errors, many others deal with silent errors (or silent data corruptions), but very few papers deal with fail-stop and silent errors simultaneously. However, HPC applications will obviously have to cope with both error sources. This talk presents a unified framework and optimal algorithmic solutions to this double challenge. Silent errors are handled via verification mechanisms (either partially or fully accurate) and in-memory checkpoints. Fail-stop errors are processed via disk checkpoints. All verification and checkpoint types are combined into computational patterns. We provide a unified model, and a full characterization of the optimal pattern. Our results nicely extend several published solutions and demonstrate how to make use of different techniques to solve the double threat of fail-stop and silent errors.

View full calendar

M	T	W	T	F	S	S
September 1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	October 1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31	November 1	2

Resilience algorithms to cope with fail-stop and silent errors by Hongyang Sun (ENS Lyon)

News

Next seminars

Events

Events in September–October 2025

September

October

November

Meta

Category: Seminars Resilience algorithms to cope with fail-stop and silent errors by Hongyang Sun (ENS Lyon)

News

Next seminars

Events

Events in September–October 2025

September

October

November

Meta

Resilience algorithms to cope with fail-stop and silent errors by Hongyang Sun (ENS Lyon)