HiePACS Working Group

The next HiePACS Working Group will take place on Monday April 18 at 9:30 in Ada Lovelace.

Context:
Robert Clay and Keita Teranishi are visiting HiePACS and Inria Bordeaux HPC teams on Monday April 18.

The morning will be dedicated to two talks on runtime systems and resilience, respectively.

9:30 Robert Clay (SNL)

Title: The DHARMA Approach to Asynchronous Many Task Programming

Abstract: Asynchronous Many-Task (AMT) programming models and runtime systems hold the promise to address key issues in future extreme-scale computer architectures, and hence are an active exascale research area. The DHARMA project at Sandia National Labs is working towards three complementary AMT research goals: 1) co-design a programming model specification that incorporates both application requirements and lessons

learned from other AMT efforts; 2) design an implementation of that spec, leveraging existing components and expertise from the community; 3) engage the AMT community longer term to define best practices and ultimately

standards. In this talk we discuss recent results and current state of the DHARMA project. We highlight our recent comparative analysis study and how it informs our higher-level design philosophy. We introduce features from our developing spec and where that spec fits in the AMT design space. Finally we discuss the effort remaining to achieve a DHARMA implementation.

10:30 Coffee break

11:00 Keita Teranishi (SNL)

Title: FENIX for Scalable Online Application Resilience

Abstract: Major exascale reports indicate that future HPC systems will suffer shorter Mean Time Between Failures (MTBF) due to the increase in system complexity and the shrink of hardware components. For such unreliable

computing systems, it is reasonable for application users to explicitly manage the response from frequent system failures. Traditionally, checkpoint-restart (CR) has been a popular resilience enhancement for application users, but incurring some undue cost associated with the access to secondary storage (distributed IO) and the global restart of parallel programs. Interestingly, anecdotal evidences suggest that the majority of large scale HPC application failures attributes to failures at single node. If this holds, the traditional CR makes use of unnecessary system resource to contain any scales of application failures, thereby suggesting a new approach to adapt the scale of failures. We have proposed Local Recovery Local Failure (LFLR) concept to make parallel applications to recover locally for single node (local) failures without global program termination and restart. In joint-effort with

Rutgers University, we have developed a prototype software, FENIX, to realize scalable online application recovery using MPI-ULFM (a fault tolerant MPI prototype). In this talk, we will discuss the architecture of FENIX and its capability and future research directions.

HiePACS Working Group

Links

Meta