Hugo Taboada defends his PhD thesis

Hugo Taboada will defend his PhD thesis entitled “MPI Non-Blocking Collective Overlap on Manycore Processor” on Tuesday, December 11th at 10:00 AM.


Supercomputers used in HPC are composed of severals inter-connected machines. Usually, they are programmed using MPI which specify an API for messages exchanges between machines. To amortize the cost of MPI collective operations, non-blocking collectives have been proposed so as to allow communications to be overlapped with computation. Initially, these operations were only available for communication between 2 MPI processes : point-to-point communications. Non-blocking communications were expanded to collective communications in 2012 with MPI 3.0. This opens up the possibility to overlap non-blocking collective communications with computation. However, these operations are more CPU-hungry than point-to-point communications.
We propose to approach this problem from several angles. On the one hand, we focus on the placement of progress threads generated by the MPI non-blocking collectives. We propose two progress threads placements algorithms for all non-blocking collectives. We either bind them on free cores, or we bind them on the hyper-threads. Then, we focus on optimizing two types of algorithms used by collective operations: tree-based algorithms and chain-based algorithms.
On the other hand, we also study the scheduling of progress threads to avoid their execution when it is unecessary to the advancement of the collective algorithm. For that, we propose first to use a mechanism to suspend the scheduling of these threads, and then we force their optimal scheduling statically by using semaphores. Finally, we introduce a proof of concept scheduling policy with priorities.


The thesis is reported by :
George Bosilca, University of Tennessee, Knoxville
Christian Perez, Inria Grenoble Rhône-Alpes

The Jury is  :
Emmanuel Jeannot (Inria Bordeaux Sud-Ouest)
Alexandre Denis (Inria Bordeaux Sud-Ouest)
Julien Jaeger (CEA)
Christian Perez (Inria Grenoble Rhône-Alpes)
Jean-Marc Pierson (Université de Toulouse)
Raymond Namyst (Université de Bordeaux)
Pascale Rossé-Laurent (Bull Atos)

Nicolas Denoyelle defends his PhD thesis

Nicolas Denoyelle will defend his PhD thesis entitled “From Software Locality to Hardware Locality in Shared Memory Systems with Heterogeneous and Non-Uniform memory“, on Monday,  November 5th at 2:00 PM.

Through years, the complexity of High Performance Computing (HPC) systems’ memory hierarchy has increased. Nowadays, large scale machines typically embed several levels of caches and a distributed memory. Recently, on-chip memories and non-volatile PCIe based flash have entered the HPC landscape. This memory architecture is a necessary pain to obtain high performance, but at the cost of a thorough task and data placement. Hardware managed caches used to hide the tedious locality optimizations. Now, data locality, in local or remote memories, in fast or slow memory, in volatile or non-volatile memory, with small or wide capacity, is entirely software manageable. This extra flexibility grants more freedom to application designers but with the drawback of making their work more complex and expensive. Indeed, when managing tasks and data placement, one has to account for several complex trade-offs between memory performance, size and features.

This thesis has been supervised between Atos Bull Technologies and Inria Bordeaux — Sud-Ouest. In the hereby document, we detail contemporary HPC systems and characterize machines performance for several locality scenarios. We explain how the programming language semantics affects data locality in the hardware, and thus applications performance. Through a joint work with the INESC-ID laboratory in Lisbon, we propose an insightful extension to the famous Roofline performance model in order to provide locality hints and improve applications performance. We also present a modeling framework to map platform and application performance events to the hardware topology, in order to extract synthetic locality metrics. Finally, we propose an automatic locality policy selector, on top of machine learning algorithms, to easily improve applications tasks and data placement.

Arnaud Legrand (CNRS/Inria Grenoble)
Patrick Carribault (CEA)
Cécile Germain (LRI/université de Paris)
Brice Goglin (Inria Bordeaux)
Emmanuel Jeannot (Inria Bordeaux)
Guillaume Papauré ( Atos Grenoble)

Julien Herrmann joins the team as Postdoc

Julien Herrmann has obtained his PhD from ENS Lyon in 2015. He will be working with Guillaume Aupy and Olivier Beaumont (RealOpt) on the Influence of local storage capacities on task based schedulers, with a focus on specific graph structure such as those involved in backpropagation.

He is funded by ANR Dash (ANR-17-CE25-0004) and a Region funding “Hpc Scalable Ecosystem”.

Andres Rubio Proano and Nicolas Vidal join the team as PhD students

Andres will work on task- and data-placement for HPC platforms with heterogeneous and non-volatile memories.


Talk by Navjot Kukreja (Imperial College) on June 28th, 2018

High-level abstractions for checkpointing in PDE-constrained optimisation

Gradient-based methods for PDE-constrained optimization problems often rely on solving a pair of forward and adjoint equations to calculate the gradient. This requires storing large amounts of intermediate data, limiting the largest problem that might be solved with a given amount of memory. Checkpointing is an approach that can reduce the amount of memory required by redoing parts of the computation instead of storing intermediate results. The Revolve checkpointing algorithm offers an optimal schedule that trades computational cost for smaller memory footprints. Integrating Revolve into a modern python HPC code is not straightforward. We present pyrevolve, an API to the Revolve library that makes checkpointing accessible from a code generation environment. The separation of concerns effected by pyrevolve allows arbitrary operators to utilise checkpointing with no coupling. This means that more complex schedules like multi-level checkpointing can be implemented with no change in the PDE solver.


The talk will be in Room Turing 2 at 10.30am.

Talk by Jan Hückelheim (Imperial College) on June 28th, 2018

Algorithmic differentiation in high-performance computing: challenges and opportunities in optimisation,uncertainty quantification, and machine learning

Gradients are useful in countless applications, e.g. gradient-based shape optimisation in structural dynamics, adjoint methods in weather forecasting, or the training of neural networks. Algorithmic differentiation (AD) is a technique to efficiently compute gradients of computer programs, and has undergone decades of development. This talk will give a brief overview of AD techniques, and highlight some of the challenges that arise in the differentiation of code written for modern computer architectures such as multi-core and many-core processors, and the differentiation of high-level languages such as C++ or Python. The talk will also show some recent developments in the differentiation of shared-memory parallel fluid dynamics solvers for Intel Xeon Phi accelerators.

The talk will be in Room Alan Turing 2 at 10 am.

Talk by Yves Robert on March 12th, 2018

Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms

joint work with Dorian Arnold, George Bosilca, Aurelien Bouteiller, Jack Dongarra, Kurt Ferreira and Thomas Hérault

In high-performance computing environments, input/output (I/O) from various sources often contend for scarce available bandwidth. Adding to the I/O operations inherent to the failure-free execution of an application, I/O from checkpoint/restart (CR) operations (used to ensure progress in the presence
of failures) place an additional burden as it increase I/O contention, leading to degraded performance. In this work, we consider a cooperative scheduling policy that optimizes the overall performance of concurrently executing CR-based applications which share valuable I/O resources. First, we provide a theoretical model and then derive a set of necessary constraints needed to minimize the global waste on the platform.

Our results demonstrate that the optimal checkpoint interval, as defined by Young/Daly, despite providing a sensible metric for a single application, is not sufficient to optimally address resource contention at the platform scale. We therefore show that combining optimal checkpointing periods with I/O scheduling
strategies can provide a significant improvement on the overall application performance, thereby maximizing platform throughput. Overall, these results provide critical analysis and direct guidance on checkpointing large-scale workloads in the presence of competing I/O while minimizing the impact
on application performance.

Talk by Jalil Boukhobza on Feb 27, 2018

Titre: Vers une approche orthogonale pour l’optimisation des (nouveaux) systèmes de stockage

Résumé: Aujourd’hui, en une minute, plus de 3 millions de posts sont écrits sur Facebook , plus de 40 000 photos sont déposées sur Instagram, et plus de 120 heures de vidéos sont chargées sur YouTube. Ce ne sont ici que des exemples du déluge de données numérique qui sont stockées dans différents centres de données. Le traitement de ces données devient un enjeu économique et sociétal majeur. Une condition préalable pour un traitement performant de cette information est de disposer d’un système de stockage efficient.

Durant cette présentation, nous essayerons de comprendre pourquoi la mémoire flash a envahi le marché en décrivant quelques-unes de nos contributions. Ces contributions ont été conçues à trois niveaux complémentaires: architecturale, système et applicative. Nous introduirons aussi les caractéristiques de quelques nouvelles technologies de mémoires non-volatiles qui viendraient bousculer notre conception de la mémoire et du stockage dans un futur proche.

Talk by Francieli Zanon Boito on Feb 15, 2018

Francieli Zanon Boito (postdoc dans l’équipe Inria Corse à Grenoble) vient nous parler de ses travaux de recherche.

Title: I/O scheduling for HPC: finding the right access pattern and mitigating interference

Abstract: Scientific applications are executed in a high performance computing (HPC) environment, where a parallel file system (PFS) provides access to a shared storage infrastructure. The key characteristic of these systems is the use of multiple storage servers, from where data can be obtained by the clients in parallel. The performance observed by applications when accessing a PFS is directly affected by the way they perform this access, i.e. their access pattern. Additionally, when multiple applications concurrently access the PFS, their performance will suffer from interference.

In this seminar, I’ll discuss my previous and current work with I/O scheduling at different levels of the I/O stack, adapting policies to applications’ access patterns and working to mitigate interference

Open PhD position

An PhD position is available in the team about Data Placement Strategies for Heterogeneous and Non-Volatile Memories in High Performance Computing

Get more details and post your CV at