A new beta hwloc release 2.6.0rc1 was published. It brings better support for hybrid CPUs, AMD GPUs, Linux memory binding, and more.
On October 12, Nicolas Nisse, from Inria team COATI, will introduce the field he is working in, and its links to HPC problematics.
Title: Brief introduction on parameterized algorithms
This talk will be a mini-course about parameterized complexity and algorithms. We will first survey the basics and show how the structure of the instances may facilitate the resolution of problems toward the design of Fixed Parameter Tractable (FPT) algorithms. We will give some examples about scheduling problems and conclude with a discussion about tree-decompositions of graphs, an important tool in the design of FPT algorithms.
The slides are available here.
The PhD thesis is entitled “Data Placement Strategies for Heterogeneous and Non-Volatile Memory in High Performance Computing”.
On June, 18th, Clément Foyer, post-doc in our team, will present us the work he did during his PhD.
Title: Abstractions for Portable Data Management in Heterogeneous Memory Systems
This talk present the work I did for my Thesis at the University of Bristol. This thesis is a study of data selection and placement in heterogeneous memories in modern high-performance computer architectures. Memory systems are becoming increasingly complex and diverse, which complicates the search for optimal data placement and reduces the portability of applications. As we enter the dawn of the exascale era, memory models have to be rethought to consider the new trade-offs between latency, bandwidth, capacity, persistence and accessibility, and their impact on performance. Moreover, this data management needs to be simplified and brought within reach of domain scientists in fields outside of Computer Science.
To address this issue, this work focuses on studying data movement, data optimisation and memory management in systems with heterogeneous memory. Firstly, a new algorithm was developed that improves the computation of data exchange in the context of multigrid data redistribution.
Secondly, multiple APIs for memory management were unified into a single abstraction that provides memory allocations and transfers in the form of a portable, adaptive and modular library. Lastly, the allocation management was studied in a high-level language along with ways to enable low-level control over memory placement for a high-level language.
The Adjacent Shifting of PEriodic Node data (ASPEN) algorithm, presented in this thesis, provides better performance than state-of-the-art algorithms used for producer-consumer data redistribution of block-cyclic organised data, as used in distributed numerical applications and libraries (e.g. ScaLAPACK). The MAMBA library was developed and aims to facilitate data management on heterogeneous memory systems. It uses a data broker developed with library cooperation and interoperability in mind. In addition to providing portability and memory abstraction, it also serves as a comparison tool for benchmarking or exploratory experiments. Finally, a use case of memory management in C for a Python application based on a distributed framework has been studied as a proof-of-concept for providing direct memory management to high-level application development.
This work presents a data-centric approach to the challenges heterogeneous memory creates for performance-seeking applications.
On May, 25th, Rosa M. Badia, from Barcelona Supercomputing Centre (BSC), will present to us her recent research group activities.
Title: Parallel machine learning with PyCOMPSs and dislib
The seminar will present our group research on parallel programming models, more specifically in PyCOMPSs. PyCOMPSs is a task-based programming model for distributed computing platforms. In the seminar we will present the basics of the programming model. Also, the seminar will include some of our recent research work in the parallelization of machine learning with the dislib library. Dislib is a distributed, parallel machine learning library that offers a syntax inspired in scikit learn and it is parallelized with PyCOMPSs.
On June, 4th,, PhD student from our team and the CEA, will present us his thesis topic.
The presentation will be given in French.
Title: Modélisation et projection de performances d’applications parallèles sur environnement ARM
L’évolution des architectures de processeurs rend la prédiction et l’évaluation des performances d’une application parallèle complexes. En effet, l’augmentation du nombre de cœurs de calcul, la multiplication des unités vectorielles et l’organisation interne du réseau en maillage influencent grandement le comportement des processeurs. Néanmoins, afin de préparer l’arrivée de futures machines, il est nécessaire d’avoir une idée des performances (e.g., temps d’exécution ou nombre d’opérations flottantes par seconde) des applications parallèles actuelles sur les futures générations de processeurs. Dans ce cadre, le CEA et ARM souhaitent développer une méthodologie de prédiction de performance : étant donné le comportement d’une application parallèle sur un processeur existant et les caractéristiques d’un processeur hypothétique, le but est de prédire les performances de cette application sur ce dernier.
L’objectif de cette thèse est de définir un modèle de projection de performance d’applications parallèles basé sur l’évolution des processeurs incluant les changements des unités de calcul (par exemple, jeu d’instructions SVE), l’évolution du nombre de cœurs et l’augmentation des capacités mémoire (DDR, HBM, …). Ceci passera par l’expérimentation sur des machines existantes afin de valider le modèle et sur des variations de processeurs connus (changement de fréquence, changement du nombre de cœurs…). Cette phase permettra alors d’étudier l’impact architectural sur les performances. Une fois cette première étude effectuée, il sera alors possible de faire évoluer ce modèle pour raffiner les prédictions de performances notamment en cas de changements micro-architecturaux (ce qui est le cas lors d’un changement entre générations quasi-identique).
On April 16th,, PhD student from our team and the CEA, will present us his thesis topic.
Title: Circuit partitioning for multi-FPGA platforms
An FPGA (‘Field Programmable Gate Array’) is an integrated circuit comprising a large number of programmable and interconnectable logic resources, which allow one to implement, by programming, a digital electronic circuit such as a microprocessor, a compute accelerator or a complex hybrid system-on-chip. FPGAs are widely used in the field of integrated circuits design, because they allow one to obtain prototype circuits very quickly, without having to manufacture the chip on silicon. However, some circuits are too big to be implemented on a single FPGA. To address this issue, it is possible to use a platform consisting of several highly interconnected FPGAs, which can be seen as a single virtual FPGA giving access to all the resources of the platform. This solution, although elegant, poses several problems. In particular, the existing tools do not account for all the constraints of the placement problem to be solved in order to efficiently map a circuit onto a multi-FPGA platform. For example, current cost functions are not designed to minimize signal propagation times between FPGA registers, nor do they take into account the capacity constraints induced by the routing of connections.
On March 19th,, Assistant Professor in our team, will present us some results of her recent research activities.
Title: Application-aware arbitration of I/O resources in HPC machines
I/O forwarding is a well-established and widely-adopted technique in HPC to reduce contention in the access to storage servers and transparently improve I/O performance. Rather than having applications directly accessing the shared parallel file system, the forwarding technique defines a set of I/O nodes responsible for receiving application requests and forwarding them to the file system, thus reshaping the flow of requests. The typical approach is to statically assign I/O nodes to applications depending on the number of compute nodes they use, which is not always necessarily related to their I/O requirements. Thus, this approach leads to inefficient usage of these resources.
During this talk, I will present our recent work, accepted for IPDPS 2021, that investigates arbitration policies based on the applications I/O demands, represented by their access patterns. We proposed a policy based on the Multiple-Choice Knapsack problem that seeks to maximize global bandwidth by giving more I/O nodes to applications that will benefit the most. Our approach can transparently improve global I/O bandwidth by up to 85% in a live setup compared to the default static policy. I will also discuss ongoing work about trying to predict application I/O performance with different numbers of I/O nodes.
On March 4th, Emmanuel Jeannot, TADaaM team leader, will present us some result of his research.
Ttile: Process Mapping on any Topologies with TOPOMATCH
Process mapping (or process placement) is a useful algorithmic technique to optimize the way applications are launched and executed onto a parallel machine. By taking into account the topology of the machine and the affinity between the processes, process mapping helps to reduce the communication time of the whole parallel application. Here, we present TOPOMATCH a generic and versatile tool and algorithm to address the process placement problem. We describe its features and characteristics, and we report different use cases that benefit from these tools. We also study the impact of different factors: sparsity of the input affinity matrix, trade-off between the speed and the quality of the mapping procedure as well as the impact of the noise onto the input.
Hardware Locality is nominated for a French reward for academic open-source projects