Autotuning MPI Collectives using Performance Guidelines, Sascha Hunold
Autotuning MPI Collectives using Performance Guidelines, Sascha Hunold
– December 18, 2017
MPI collective operations provide a standardized interface for performing data movements within a group of processes. The efficiency
of collective communication operations depends on the actual algorithm, its implementation, and the specific communication problem
(type of communication, message size, and number of processes).
Many MPI libraries provide numerous algorithms for specific collective operations. The strategy for selecting an efficient algorithm
is often times predefined (hard-coded) in MPI libraries, but some of
them, such as Open MPI, allow users to change the algorithm manually. Finding the best algorithm for each case is a hard problem, and
several approaches to tune these algorithmic parameters have been
proposed. We use an orthogonal approach to the parameter-tuning
of MPI collectives, that is, instead of testing individual algorithmic
choices provided by an MPI library, we compare the latency of
a specific MPI collective operation to the latency of semantically
equivalent functions, which we call the mock-up implementations.
The structure of the mock-up implementations is defined by selfconsistent performance guidelines. The advantage of this approach
is that tuning using mock-up implementations is always possible,
whether or not an MPI library allows users to select a specific algorithm at run-time. We implement this concept in a library called
PGMPITuneLib, which is layered between the user code and the
actual MPI implementation. This library selects the best-performing
algorithmic pattern of an MPI collective by intercepting MPI calls
and redirecting them to our mock-up implementations. Experimental results show that PGMPITuneLib can significantly reduce the
latency of MPI collectives, and also equally important, that it can
help identifying the tuning potential of MPI libraries.
TAPIOCA : Une bibliothèque d'agrégation de données pour les I/O parallèles prenant en compte la topologie, François Tessier, Argonne
TAPIOCA : Une bibliothèque d'agrégation de données pour les I/O parallèles prenant en compte la topologie, François Tessier, Argonne
– December 21, 2017
TAPIOCA : Une bibliothèque d'agrégation de données pour les I/O
parallèles prenant en compte la topologie
L'augmentation de la puissance de calcul des supercalculateurs engendre
un coût considérable des mouvements de données. En outre, la majorité
des simulations scientifiques ont des besoins importants en terme de
lecture et d'écriture sur les systèmes de fichiers parallèles. De
nombreuses solutions logicielles ont été développées pour contenir le
goulot d'étranglement causé par les I/O. Une stratégie bien connue dans
le monde des opérations collectives d'I/O consiste à sélectionner un
sous-ensemble des processus de l'application pour agréger des morceaux
de données contiguës avant d'effectuer les lectures et écritures. Dans
cet exposé, je présenterai TAPIOCA, une bibliothèque MPI implémentant un
algorithme d’agrégation de données optimisé prenant en compte la
topologie. Je montrerai les gains de performance substantiels en lecture
et écriture que nous avons obtenus sur deux supercalculateurs présents à
Argonne National Laboratory. Pour terminer, j'aborderai nos travaux
actuels dans TAPIOCA afin de tirer parti des nouveaux niveaux de mémoire
et de stockage disponibles sur les systèmes actuels et à venir (MCDRAM,
SSD locaux, ...).
Convergence d’algorithme de non regret, Amélie Heliou (Polaris)
– November 30, 2017
Les algorithmes de non-regret sont souvent utilisés dans les jeux répétés où les joueurs ont peu d’information sur le jeu auquel ils jouent. Ces algorithmes garantissent que le regret de chaque joueur est sous-linéaire. La moyenne temporelle des stratégies choisies en suivant un algorithme de non-regret converge dans l’ensemble des équilibres corrélés. Cependant cela ne donne aucune information sur la convergence de la séquence de stratégies.
Nous nous sommes intéressés à la question « est-ce que la sequence de stratégie obtenue pas un algorithme de non regret converge vers un équilibre de Nash? ».
Dans cet exposé, je présenterai un algorithme de non regret appelé Hedge qui est une version d’algorithmes à poids exponentiels. En particulier, je discuterai la convergence des séquences de stratégies obtenues par Hedge en utilisant deux types d’informations accessibles aux joueurs.
Learning efficient Nash equilibra in distributed systems by Bary Pradelski (ETH Zurich)
– December 14, 2017
Learning efficient Nash equilibra in distributed systems
with H. Peyton Young
An individual’s learning rule is completely uncoupled if it does not depend directly on the actions or payoffs of anyone else. We propose a variant of log linear learning that is completely uncoupled and that selects an efficient (welfare-maximizing) pure Nash equilibrium in all generic n-person games that possess at least one pure Nash equilibrium. In games that do not have such an equilibrium, there is a simple formula that expresses the long-run probability of the various disequilibrium states in terms of two factors: i) the sum of payoffs over all agents, and ii) the maximum payoff gain that results from a unilateral deviation by some agent. This welfare/stability trade-off criterion provides a novel framework for analyzing the selection of disequilibrium as well as equilibrium states in n-person games.
Autotuning MPI Collectives using Performance Guidelines, Sascha Hunold
– December 18, 2017
MPI collective operations provide a standardized interface for performing data movements within a group of processes. The efficiency
of collective communication operations depends on the actual algorithm, its implementation, and the specific communication problem
(type of communication, message size, and number of processes).
Many MPI libraries provide numerous algorithms for specific collective operations. The strategy for selecting an efficient algorithm
is often times predefined (hard-coded) in MPI libraries, but some of
them, such as Open MPI, allow users to change the algorithm manually. Finding the best algorithm for each case is a hard problem, and
several approaches to tune these algorithmic parameters have been
proposed. We use an orthogonal approach to the parameter-tuning
of MPI collectives, that is, instead of testing individual algorithmic
choices provided by an MPI library, we compare the latency of
a specific MPI collective operation to the latency of semantically
equivalent functions, which we call the mock-up implementations.
The structure of the mock-up implementations is defined by selfconsistent performance guidelines. The advantage of this approach
is that tuning using mock-up implementations is always possible,
whether or not an MPI library allows users to select a specific algorithm at run-time. We implement this concept in a library called
PGMPITuneLib, which is layered between the user code and the
actual MPI implementation. This library selects the best-performing
algorithmic pattern of an MPI collective by intercepting MPI calls
and redirecting them to our mock-up implementations. Experimental results show that PGMPITuneLib can significantly reduce the
latency of MPI collectives, and also equally important, that it can
help identifying the tuning potential of MPI libraries.
TAPIOCA : Une bibliothèque d'agrégation de données pour les I/O parallèles prenant en compte la topologie, François Tessier, Argonne
– December 21, 2017
TAPIOCA : Une bibliothèque d'agrégation de données pour les I/O
parallèles prenant en compte la topologie
L'augmentation de la puissance de calcul des supercalculateurs engendre
un coût considérable des mouvements de données. En outre, la majorité
des simulations scientifiques ont des besoins importants en terme de
lecture et d'écriture sur les systèmes de fichiers parallèles. De
nombreuses solutions logicielles ont été développées pour contenir le
goulot d'étranglement causé par les I/O. Une stratégie bien connue dans
le monde des opérations collectives d'I/O consiste à sélectionner un
sous-ensemble des processus de l'application pour agréger des morceaux
de données contiguës avant d'effectuer les lectures et écritures. Dans
cet exposé, je présenterai TAPIOCA, une bibliothèque MPI implémentant un
algorithme d’agrégation de données optimisé prenant en compte la
topologie. Je montrerai les gains de performance substantiels en lecture
et écriture que nous avons obtenus sur deux supercalculateurs présents à
Argonne National Laboratory. Pour terminer, j'aborderai nos travaux
actuels dans TAPIOCA afin de tirer parti des nouveaux niveaux de mémoire
et de stockage disponibles sur les systèmes actuels et à venir (MCDRAM,
SSD locaux, ...).
Predicting the Energy-Consumption of MPI Applications at Scale Using Only a Single Node, by Christian Heinrich (Polaris)
– January 18, 2018
Monitoring and assessing the energy efficiency of supercomputers and
data centers is crucial in order to limit and reduce their energy
consumption. Applications from the domain of High Performance Computing
(HPC), such as MPI applications, account for a significant fraction of
the overall energy consumed by HPC centers. Simulation is a popular
approach for studying the behavior of these applications in a variety of
scenarios, and it is therefore advantageous to be able to study their
energy consumption in a cost-efficient, controllable, and also
reproducible simulation environment. Alas, simulators supporting HPC
applications commonly lack the capability of predicting the energy
consumption, particularly when target platforms consist of multi-core
nodes. In this work, we aim to accurately predict the energy consumption
of MPI applications via simulation. Firstly, we introduce the models
required for meaningful simulations: The computation model, the
communication model, and the energy model of the target platform.
Secondly, we demonstrate that by carefully calibrating these models on a
single node, the predicted energy consumption of HPC applications at a
larger scale is very close (within a few percents) to real experiments.
We further show how to integrate such models into the SimGrid simulation
toolkit. In order to obtain good execution time predictions on
multi-core architectures, we also establish that it is vital to
correctly account for memory effects in simulation. The proposed
simulator is validated through an extensive set of experiments with
well-known HPC benchmarks. Lastly, we show the simulator can be used to
study applications at scale, which allows researchers to save both time
and resources compared to real experiments
This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.
Strictly Necessary Cookies
Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.
If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.