Talk by Andres Rubio and Brice Goglin on April 9th

Brice & Andres will present us new trends in non-volatile memory technologies.

Talk by Francieli Zanon Boito (Corse Team, Inria Grenoble/LIG) on January 23rd

January 23rd at 2pm in room Grace Hopper2 (4th floor), Francieli Zanon Boito (post-doc in Corse Team), will present her recent works on data management

Title: Data management to promote near-data processing

Abstract: Motivated by a case study of instrumental data management at the CEA, this project aims at providing near-data processing (NDP) for tasks from scientific workflows, executed to perform analysis of instrumental data from a centralized storage. These tasks are submitted by users to a framework that will coordinate their execution over available processing nodes and all required data transfers. NDP is promoted by using part of the storage capacity from
the processing nodes as cache for data. In this presentation I will talk about this project and present initial results with different replication strategies.

Talk by Navjot Kukreja (Imperial College London) on December 13th

Combining checkpointing and data compression for large scale seismic inversion

Seismic inversion is a class of adjoint-based optimization problems that process up to terabytes of data, regularly exceeding the memory capacity of available computers.Data compression is an effective strategy to reduce this memory requirement by a certain factor, particularly if some loss in accuracy is acceptable. A popular alternative is checkpointing, where data is stored at selected points in time, and values at other times are recomputed as needed from the last stored state. This allows arbitrarily large adjoint computations with limited memory, at the cost of additional recomputations. In this talk I discuss the combination of compression and checkpointing to compute a realistic seismic inversion. The combination of checkpointing and compression allows larger adjoint computations compared to using only compression, and reduces the recomputation overhead significantly compared to using only checkpointing.

Talk by Paul Hovland (Argonne Nat Lab) on December 13th

Compressing Checkpoints in MITgcm Adjoint Computations

Efficient computation of the gradients used for state estimation in the MITgcm general circulation model requires saving intermediate states to disk.  We present some preliminary experiments on compressing these checkpoints in order to reduce the time to read and write checkpoints or to increase the number of checkpoints written to disk.

Hugo Taboada defends his PhD thesis

Hugo Taboada will defend his PhD thesis entitled “MPI Non-Blocking Collective Overlap on Manycore Processor” on Tuesday, December 11th at 10:00 AM.

 

Supercomputers used in HPC are composed of severals inter-connected machines. Usually, they are programmed using MPI which specify an API for messages exchanges between machines. To amortize the cost of MPI collective operations, non-blocking collectives have been proposed so as to allow communications to be overlapped with computation. Initially, these operations were only available for communication between 2 MPI processes : point-to-point communications. Non-blocking communications were expanded to collective communications in 2012 with MPI 3.0. This opens up the possibility to overlap non-blocking collective communications with computation. However, these operations are more CPU-hungry than point-to-point communications.
We propose to approach this problem from several angles. On the one hand, we focus on the placement of progress threads generated by the MPI non-blocking collectives. We propose two progress threads placements algorithms for all non-blocking collectives. We either bind them on free cores, or we bind them on the hyper-threads. Then, we focus on optimizing two types of algorithms used by collective operations: tree-based algorithms and chain-based algorithms.
On the other hand, we also study the scheduling of progress threads to avoid their execution when it is unecessary to the advancement of the collective algorithm. For that, we propose first to use a mechanism to suspend the scheduling of these threads, and then we force their optimal scheduling statically by using semaphores. Finally, we introduce a proof of concept scheduling policy with priorities.

 

The thesis is reported by :
George Bosilca, University of Tennessee, Knoxville
Christian Perez, Inria Grenoble Rhône-Alpes

The Jury is  :
Emmanuel Jeannot (Inria Bordeaux Sud-Ouest)
Alexandre Denis (Inria Bordeaux Sud-Ouest)
Julien Jaeger (CEA)
Christian Perez (Inria Grenoble Rhône-Alpes)
Jean-Marc Pierson (Université de Toulouse)
Raymond Namyst (Université de Bordeaux)
Pascale Rossé-Laurent (Bull Atos)

Nicolas Denoyelle defends his PhD thesis

Nicolas Denoyelle will defend his PhD thesis entitled “From Software Locality to Hardware Locality in Shared Memory Systems with Heterogeneous and Non-Uniform memory“, on Monday,  November 5th at 2:00 PM.

Through years, the complexity of High Performance Computing (HPC) systems’ memory hierarchy has increased. Nowadays, large scale machines typically embed several levels of caches and a distributed memory. Recently, on-chip memories and non-volatile PCIe based flash have entered the HPC landscape. This memory architecture is a necessary pain to obtain high performance, but at the cost of a thorough task and data placement. Hardware managed caches used to hide the tedious locality optimizations. Now, data locality, in local or remote memories, in fast or slow memory, in volatile or non-volatile memory, with small or wide capacity, is entirely software manageable. This extra flexibility grants more freedom to application designers but with the drawback of making their work more complex and expensive. Indeed, when managing tasks and data placement, one has to account for several complex trade-offs between memory performance, size and features.

This thesis has been supervised between Atos Bull Technologies and Inria Bordeaux — Sud-Ouest. In the hereby document, we detail contemporary HPC systems and characterize machines performance for several locality scenarios. We explain how the programming language semantics affects data locality in the hardware, and thus applications performance. Through a joint work with the INESC-ID laboratory in Lisbon, we propose an insightful extension to the famous Roofline performance model in order to provide locality hints and improve applications performance. We also present a modeling framework to map platform and application performance events to the hardware topology, in order to extract synthetic locality metrics. Finally, we propose an automatic locality policy selector, on top of machine learning algorithms, to easily improve applications tasks and data placement.

Jury:
Arnaud Legrand (CNRS/Inria Grenoble)
Patrick Carribault (CEA)
Cécile Germain (LRI/université de Paris)
Brice Goglin (Inria Bordeaux)
Emmanuel Jeannot (Inria Bordeaux)
Guillaume Papauré ( Atos Grenoble)

Julien Herrmann joins the team as Postdoc

Julien Herrmann has obtained his PhD from ENS Lyon in 2015. He will be working with Guillaume Aupy and Olivier Beaumont (RealOpt) on the Influence of local storage capacities on task based schedulers, with a focus on specific graph structure such as those involved in backpropagation.

He is funded by ANR Dash (ANR-17-CE25-0004) and a Region funding “Hpc Scalable Ecosystem”.

Andres Rubio Proano and Nicolas Vidal join the team as PhD students

Andres will work on task- and data-placement for HPC platforms with heterogeneous and non-volatile memories.

 

Talk by Navjot Kukreja (Imperial College) on June 28th, 2018

High-level abstractions for checkpointing in PDE-constrained optimisation

Gradient-based methods for PDE-constrained optimization problems often rely on solving a pair of forward and adjoint equations to calculate the gradient. This requires storing large amounts of intermediate data, limiting the largest problem that might be solved with a given amount of memory. Checkpointing is an approach that can reduce the amount of memory required by redoing parts of the computation instead of storing intermediate results. The Revolve checkpointing algorithm offers an optimal schedule that trades computational cost for smaller memory footprints. Integrating Revolve into a modern python HPC code is not straightforward. We present pyrevolve, an API to the Revolve library that makes checkpointing accessible from a code generation environment. The separation of concerns effected by pyrevolve allows arbitrary operators to utilise checkpointing with no coupling. This means that more complex schedules like multi-level checkpointing can be implemented with no change in the PDE solver.

 

The talk will be in Room Turing 2 at 10.30am.

Talk by Jan Hückelheim (Imperial College) on June 28th, 2018

Algorithmic differentiation in high-performance computing: challenges and opportunities in optimisation,uncertainty quantification, and machine learning

Gradients are useful in countless applications, e.g. gradient-based shape optimisation in structural dynamics, adjoint methods in weather forecasting, or the training of neural networks. Algorithmic differentiation (AD) is a technique to efficiently compute gradients of computer programs, and has undergone decades of development. This talk will give a brief overview of AD techniques, and highlight some of the challenges that arise in the differentiation of code written for modern computer architectures such as multi-core and many-core processors, and the differentiation of high-level languages such as C++ or Python. The talk will also show some recent developments in the differentiation of shared-memory parallel fluid dynamics solvers for Intel Xeon Phi accelerators.

The talk will be in Room Alan Turing 2 at 10 am.