Dec 08

Soutenance de thèse : Maria Predari

Vendredi 9 décembre – 14h30 – salle Ada Lovelace

Titre : Load Balancing for Coupled Simulations

Résumé : In the field of scientific computing, the load balancing is an important step conditioning the performance of parallel programs. The goal is to distribute the computational load across multiple processors in order to minimize the execution time. This is a well-known problem that is unfortunately NP-hard. The most common approach to solve it is based on graph or hypergraph partitioning method, using mature and efficient software tools such as Metis, Zoltan or Scotch.
Nowadays, numerical simulation are becoming more and more complex, mixing several models and codes to represent different physics or scales. Here, the key idea is to reuse available legacy codes through a coupling framework instead of merging them into a standalone application. For instance, the simulation of the earth’s climate system typically involves at least 4 codes for atmosphere, ocean, land surface and sea-ice . Combining such different codes are still a challenge to reach high performance and scalability. In this context, one crucial issue is undoubtedly the load balancing of the whole coupled simulation that remains an open question. The goal here is to find the best data distribution for the whole coupled codes and not only for each standalone code, as it is usually done. Indeed, the naive balancing of each code on its own can lead to an important imbalance and to a communication bottleneck during the coupling phase, that can dramatically decrease the overall  performance. Therefore, one argues that it is required to model the coupling itself in order to ensure a good scalability, especially when running on tens of thousands of processors. In this work, we develop new algorithms to perform a coupling-aware partitioning of the whole application.

 

Sep 27

HiePACS working group – Friday September 30 – 9:00 – Ada Lovelace

The next HiePACS working group will be Friday September 30
9:00 – 11:30  in Ada Lovelace.
It will consist of a survey on recent activities on hybrid solvers in
the team.
Note that there will also be a presentation of Nathalie Moller
(UVSQ/Dassault) at 14:30 the same day on FMM

Sep 14

Soutenance de thèse : Jean-Marie COUTEYEN

Lundi 19 septembre – 14:00 – Amphi du LaBRI

Titre : Parallélisation et passage à l’échelle du code FLUSEPA

Résumé :

Il existe de nombreux types de satellites qui fournissent des services utiles au quotidien : l’imagerie satellite, les télécommunications, la géolocalisation… Leur mise en orbite passe par l’utilisation de lanceurs, dont la conception est une des activités d’Airbus Safran Launchers. Pour la conception de lanceurs, l’accès à l’expérience n’est pas évident : l’utilisation de souffleries ne permet pas de tester toutes les situations critiques auxquelles un lanceur sera confronté au cours de sa mission. La simulation numérique est donc essentielle pour l’industrie aérospatiale. Pour disposer de simulations plus fidèles, il est nécessaire de disposer et de pouvoir exploiter une importante puissance de calcul via l’utilisation de supercalculateurs. Ces supercalculateurs évoluent rapidement et sont de plus en plus complexes ; il est alors nécessaire d’adapter les codes existants pour pouvoir les utiliser efficacement. Aujourd’hui, il semble de plus en plus nécessaire d’utiliser des abstractions afin de pouvoir porter les codes sur les nouvelles machines avec un coût humain raisonnable et une bonne portabilité des performances.

Airbus Safran Launchers a développé depuis plus de 20 ans le code de calcul FLUSEPA qui convient particulièrement bien à la modélisation des phénomènes instationnaires avec topologie variable tels que pour les séparations d’étages et les décollages de lanceurs spatiaux. Ce code est basé sur une formulation Volumes Finis. La prise en compte des mouvements relatifs repose sur une technique originale de chevauchement de maillages conservative et la technique d’intégration temporelle adaptative explicite permet de calculer très efficacement les évolutions rapides.

Les travaux réalisés durant cette thèse portent sur la parallélisation du code FLUSEPA, qui au départ n’était parallélisé qu’en mémoire partagée via OpenMP. Une première version distribuée du code a été réalisée et utilise une programmation hybride MPI+OpenMP pour des clusters de calcul. Les gains apportés par cette version ont été évalués via l’utilisation de deux calculs industriels. Un démonstrateur basé cette fois-ci sur un modèle de programmation à base de graphe de tâches avec l’utilisation d’un support d’exécution a aussi été réalisé pour répondre de manière plus adéquate au problème d’efficacité posé par la version MPI+OpenMP.

May 13

Le HPC au service de l’électromagnétisme et de l’acoustique

présentation de Guillaume Sylvand, Ingénieur Expert Airbus Group Innovations, HIEPACS

le vendredi 13 mai 2016 de 13h à 14h, Salle Ada Lovelace, Inria

Des maquettes testées en soufflerie aux essais en vol, en passant par la conception, la construction, l’acheminement et l’assemblage de chacune des pièces de l’appareil, Airbus Group réalise un très grand nombre de phases pour concevoir un avion. Bien avant les premiers travaux des bureaux d’études, des ingénieurs manipulent des équations modélisant les physiques essentielles de l’avion (aérodynamique, mécanique des structures, électromagnétisme, acoustique, …) afin de garantir une conception fiable. Les enjeux sont colossaux.

Dans son exposé, Guillaume Sylvand s’intéressera plus particulièrement aux problématiques d’électromagnétisme et d’acoustique et nous expliquera comment en modélisant sur ordinateur des problèmes d’ondes très complexes (à l’aide notamment d’outils développés chez Inria), on parvient à une résolution performante qui contribue à minimiser le coût de développement des appareils de demain.

Mar 10

HiePACS Working Group

The next HiePACS Working Group will take place on Monday April 18 at 9:30 in Ada Lovelace.


Context
:
Robert Clay and Keita Teranishi are visiting HiePACS and Inria Bordeaux HPC teams on
Monday April 18.

The morning will be dedicated to two talks on runtime systems and resilience, respectively.

9:30 Robert Clay (SNL)

 

Title: The DHARMA Approach to Asynchronous Many Task Programming

 

Abstract: Asynchronous Many-Task (AMT) programming models and runtime systems hold the promise to address key issues in future extreme-scale computer architectures, and hence are an active exascale research area. The DHARMA project at Sandia National Labs is working towards three complementary AMT research goals: 1) co-design a programming model specification that incorporates both application requirements and lessons
learned from other AMT efforts; 2) design an implementation of that spec, leveraging existing components and expertise from the community; 3) engage the AMT community longer term to define best practices and ultimately

standards.  In this talk we discuss recent results and current state of the DHARMA project. We highlight our recent comparative analysis study and how it informs our higher-level design philosophy. We introduce features from our developing spec and where that spec fits in the AMT design space. Finally we discuss the effort remaining to achieve a DHARMA implementation.

 

10:30 Coffee break

 

11:00 Keita Teranishi (SNL)
 
Title: FENIX for Scalable Online Application Resilience
 
Abstract: Major exascale reports indicate that future HPC systems will suffer shorter Mean Time Between Failures (MTBF) due to the increase in system complexity and the shrink of hardware components. For such unreliable
computing systems, it is reasonable for application users to explicitly manage the response from frequent system failures.  Traditionally, checkpoint-restart (CR) has been a popular resilience enhancement for application users, but incurring some undue cost associated with the access to secondary storage (distributed IO) and the global restart of parallel programs.  Interestingly, anecdotal evidences suggest that the majority of large scale HPC application failures attributes to failures at single node. If this holds, the traditional CR makes use of unnecessary system resource to contain any scales of application failures, thereby suggesting a new approach to adapt the scale of failures. We have proposed Local Recovery Local Failure (LFLR) concept to make parallel applications to recover locally for single node  (local) failures without global program termination and restart.  In joint-effort with 
Rutgers University, we have developed a prototype software, FENIX, to realize scalable online application recovery using MPI-ULFM (a fault tolerant MPI prototype). In this talk, we will discuss the architecture of FENIX and its capability and future research directions.

Feb 15

Soutenance de thèse : Bérenger Bramas

Bérenger Bramas soutiendra sa thèse le lundi 15 février à 14h30 salle Ada Lovelace.

“Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain”

 

Committee:

George Biros – Professor – The University of Texas at Austin
Coulaud Olivier (Advisor) – Research Director – Inria Bordeaux – Sud-Ouest
Pascal Havé – Researcher – IFP Energies nouvelles
Stéphane Lanteri – Research Director – Inria Sophia Antipolis
Raymond Namyst – Professor – The University of Bordeaux
Guillaume Sylvand (Advisor) – Researcher – Airbus Group Innovations
Isabelle Terrasse – Research Director – Airbus Group
Richard Vuduc – Associate Professor – Georgia Institute of Technology

Abstract:
The time-domain BEM for the wave equation in acoustics and electromagnetism is used to simulate the propagation of a wave with a discretization in time. It allows to obtain several frequency-domain results with one solve. In this thesis, we investigate the implementation of an efficient TD-BEM solver using different approaches. We describe the context of our study and the TD-BEM formulation expressed as a sparse linear system composed of multiple interaction/convolution matrices. This system is naturally computed using the sparse matrix-vector product (SpMV). We work on the limits of the SpMV kernel by looking at the matrix reordering and the behavior of our SpMV kernels using vectorization (SIMD) on CPUs and an advanced blocking-layout on Nvidia GPUs. We show that this operator is not appropriate for our problem, and we then propose to reorder the original computation to get a special matrix structure. This new structure is called a slice matrix and is computed with a custom matrix/vector product operator. We present an optimized implementation of this operator on CPUs and Nvidia GPUs for which we describe advanced blocking schemes. The resulting solver is parallelized with a hybrid strategy above heterogeneous nodes and relies on a new heuristic to balance the work among the processing units. Due to the quadratic complexity of this matrix approach, we study the use of the fast multipole method (FMM) for our time-domain BEM solver. We investigate the parallelization of the general FMM algorithm using several paradigms in both shared and distributed memory, and we explain how modern runtime systems are well-suited to express the FMM computation. Finally, we investigate the implementation and the parametrization of an FMM kernel specific to our TD-BEM, and we provide preliminary results.

Jan 25

Solhar Meeting @ Bordeaux

Solhar meeting : Monday January 25th

Program:
09:00-09:30 Terry Cojean    : “Resource aggregation in task-based
applications over accelerator-based multicore machines”
09:30-10:00 Suraj Kumar     : “Are Static Schedules so Bad ? A Case
Study on Cholesky Factorization”
10:00-10:30 Thomas Lambert  : “A New Approximation Algorithm for Matrix
Partitioning in Presence of Strongly Heterogeneous Processors”

10:30-11:00 Break

11:00-11:30 Loris Marchal   : “A reasonable model of malleable tasks for
linear algebra task graphs”
11:30-12:00 Berenger Bramas : “Fast Multipole Methods over task-based
runtime systems”

12:00-14:00 Lunch break

14:30-15:00 Marc Sergent    : “Controlling the Memory Subscription of
Applications with a Task-Based Runtime System”
15:00-15:30 Emmanuel Agullo : “StarPU-Simgrid : Overview and current work”
15:30-16:00 Alfredo Buttari : “Performance analysis of parallel codes on
heterogeneous systems”
16:00-16:30 Samuel Thibault : “Open discussion about StarPU”

Dec 10

Soutenance de thèse : Salli Moustafa

15/12/2015 – 10:30 – EDF Clamart

Title: Massively Parallel Cartesian Discrete Ordinates Method for Neutron Transport Simulation

Abstract:

The goal of the research presented in this thesis was to study, and to propose a suitable solution, to the  challenges posed by the use of hierarchical massively parallel computers, for solving the neutron transport equation according to the discrete ordinates method (SN). The classic source iteration (SI) scheme used for solving this equation involves the so-called sweep operation on the spatial mesh. This sweep operation gathers the vast majority of computations in the SN method and it exposes a wavefront like progression over the spatial mesh.
We first proposed a strategy for designing an efficient parallel implementation of the sweep operation on modern architectures by combining the use of the SIMD paradigm and the emerging task-based runtime systems. We have shown that the PaRSEC framework, based on parametrized DAG model, produced the most efficient implementation for the Cartesian transport sweep, as compared to Intel TBB and StarPU versions. We designed an accurate parallelsweep simulator in order to determine the optimal parallel spatial decomposition of the transport sweep. We used this sweep simulator to justify the need for a task-based implementation of the sweep operation in order to maximize its performances on multicore-based architectures. Using optimal partitioning, the performance of the PaRSEC implementation of the sweep operation reaches 6.1 Tflop/s on 768 cores of the IVANOE supercomputer, which corresponds to 33.9% of the theoretical peak performance of this set of computational resources.
Then we studied the challenge of converging the source iterations in highly diffusive media such as the PWR cores. We have implemented and studied the convergence of a new acceleration scheme (PDSA) that naturally suits our Hybrid parallel implementation. The combination of all these techniques have enabled us to develop a massively parallel version of the Domino solver. It is capable of tackling the challenges posed by the neutron
transport simulations and compares favorably with state-of-the-art solvers such as Denovo. For a typical 26-group PWR calculations involving 1.02 × 10^12 DoFs, the time to solution required by the Domino solver is 45 min using 1536 cores. Consequently, this Domino solver can be used by nuclear power plant operators such as EDF for improving the efficiency and safety of nuclear power plants.

Dec 10

Soutenance de thèse : Stojce nakov

lundi 14 décembre – amphi du LaBRI – 15:00

Titre : On the design of sparse hybrid linear solvers for modern parallel architectures

Résumé :

Over the last few decades, there have been innumerable science, engineering and societal breakthroughs enabled by the development of High Performance Computing (HPC) applications, algorithms and architectures. These powerful tools have provided researchers with the ability to computationally find efficient solutions for some of the most challenging scientific questions and problems in medicine and biology, climatology, nanotechnology,
energy and environment. In the context of this thesis, our focus is on numerical linear algebra, more precisely on solution of large sparse systems of linear equations. We focus on designing efficient parallel implementations of MaPHyS, an hybrid linear solver based on domain decomposition techniques. Two approaches are considered in that perspective.

First we investigate the MPI+threads approach. In MaPHyS, the first level of parallelism arises from the independent treatment of the various subdomains and is managed using message passing. The second level is exploited thanks to the use of multi-threaded dense and sparse linear algebra kernels involved at the subdomain level. Such an hybrid implementation of an hybrid linear solver suitably matches the hierarchical structure of 1modern supercomputers and enables a trade-off between the numerical and parallel performances of the solver. We describe how the interoperability between the various kernels has to be mastered to ensure the scalability of the parallel solver. We demonstrate the flexibility of our parallel implementation on a set of test examples coming from classical test matrices as well as from geoscience challenging test cases provided by our industrial partner Total.

Secondly, we follow a more disruptive approach where the algorithms are described as sets of tasks with data inter-dependencies that leads to a directed acyclic graph (DAG) representation. The scheduling and mapping of these tasks are handled by a runtime system. Such an approach permits to keep a high level description of the algorithms and does not require to interleave their numerical and parallel complexities. While designing
from scratch an hybrid solver based on this paradigm would be a huge development effort, we perform an incremental feasibility study. We first illustrate how a first task-based parallel implementation can be obtained by composing task-based parallel libraries within MPI processes. We illustrate our discussion with a preliminary prototype implementation of such an hybrid solver. We then show how a task-based approach fully abstracting the hardware architecture can successfully exploit a wide range of modern hardware architectures in the case of a key component of the hybrid solver that is a Krylov method. We implemented a full task-based conjugate gradient algorithm and showed that the proposed approach leads to very high performance on multi-GPU, multicore and heterogeneous architectures. This preliminary study motivates the design of the whole hybrid solver as a full task-based algorithm, which will be the focus of a future work.

Dec 10

Soutenance de thèse : Arnaud Etcheverry

lundi 23 novembre à 14h30 – salle Ada Lovelace.

Titre : “Vers la simulation en dynamique des dislocation à très grande échelle”

Résumé:

Le travail réalisé durant cette thèse vise à offrir à un code de simulation en dynamique des dislocations les composantes essentielles pour permettre le passage à l’échelle sur les calculateurs modernes.

Nous abordons plusieurs aspects de la simulation numérique avec tout d’abord des considérations algorithmiques. Pour permettre de réaliser des simulations efficaces en terme de complexité algorithmique pour des grandes simulations, nous explorons les contraintes des différentes étapes de la simulation en offrant une analyse et des améliorations aux algorithmes.

Ensuite, une considération particulière est apportée aux structures de données. En prenant en compte les nouveaux algorithmes, nous proposons un structure de données pour bénéficier d’accès performants à travers la hiérarchie mémoire. Cette structure est modulaire pour faire face à deux types d’algorithmes, avec d’un côté la gestion du maillage nécessitant une gestion dynamique de la mémoire et de l’autre les phases de calcul intensifs avec des accès rapides. Pour cela cette structure modulaire est complétée par un octree pour gérer la décomposition de domaine et aussi les algorithmes hiérarchiques comme le calcul du champ de contrainte et la détection des collisions.

Enfin nous présentons les aspects parallèles du code. Pour cela nous introduisons une approche hybride, avec un parallélisme à grain fin à base de threads, et un parallélisme à gros grain de type MPI nécessitant une décomposition de domaine et un équilibrage de charge.

Finalement, ces contributions sont testées pour valider ces contributions sur des cas challenges composée de source de Frank-Read dans un cristal de zirconium contenant une forte densité de boucle d’irradiations.

Older posts «