Presentation

Performance evaluation and Optimization of LARge Infrastructures and Systems

INRIA Theme: Distributed and High Performance Computing
LIG Laboratory Axis: Distributed Systems, Parallel Computing and Networks
Keywords: large distributed and stochastic systems, experimental methodology, performance evaluation, simulation, trace analysis and visualization, distributed and stochastic optimization, game theory.

The goal of the POLARIS project is to contribute to the understanding (from the observation, modeling and analysis to the actual optimisation through adapted algorithms) of the performance of very large-scale distributed systems such as supercomputers, cloud infrastructures, wireless networks, smart grids, transportation systems, or even recommendation systems.

Here are some slides presenting the team in a nutshell as well as a few recent results and our last Inria Activity Report.

Overview

From our past experience, we gather skills in:

  • Experiment design: experimental methodology, measuring/monitoring/tracing tools, experiment control, design of experiments, reproducible research, in particular in the context of large computing infrastructures (grid, HPC, volunteer computing, embedded systems, …).
  • Trace Analysis: parallel application visualization (paje, triva/viva, framesoc/ocelotl, …), characterization of failures in large distributed systems, visualization and analysis for geographical information system, spatio-temporal analysis of media events in RSS flows from newspapers, …
  • Modeling and Simulation: emulation, discrete event simulation, perfect sampling, Markov chains, Monte Carlo methods, …
  • Optimization: stochastic approximations, mean field limits, game theory, mean field games, primal dual optimization, learning, information theory.

Research directions

The POLARIS team works in close cooperation with other research teams on a continuum of five research themes:

  1. Measurement: Sound and Reproducible Experimental Methodology
  2. Analysis: Multi-Scale Analysis and Visualization
  3. Simulation: Fast and Faithful Performance Prediction of Very Large Systems
  4. Asymptotic Models: Local Interactions and Transient Analysis in Adaptive Dynamic Systems
  5. Distributed Optimization: Continuous Game Theory and On-line Distributed Optimization

Associated Teams

  • ReDaS (Analysis Techniques and Workflow Methodologies for Reproducible Data Science) is an associated team with our colleagues from UFRGS in Porto Alegre, Brazil.

Contribution to the AI/Learning Hype

AI and Learning is everywhere now. Building on our performance evaluation and distributed computing background, we obviously publish our work in conferences like SIGMETRICS, INFOCOM, CCGRID or IPDPS but we also regularly publish our work in major AI conferences like IJCAI, NeurIPS, ICLR, or ICML. Let us clarify how our research activities are positionned with respect to this trend.

A first line of research in POLARIS is devoted to the use statistical learning techniques (Bayesian inference) to model the expected performance of distributed systems, to build aggregated performance views, to feed simulators of such systems, or to detect anomalous behaviours.

In a distributed context it is also essential to design systems that can seamlessly adapt to the workload and to the evolving behaviour of its components (users, resources, network). Obtaining faithful information on the dynamic of the system can be particularly difficult, which is why it is generally more efficient to design systems that dynamically learn the best actions to play through trial and errors. A key characteristic of the work in the POLARIS project is to leverage regularly game-theoretic modeling to handle situations where the resources or the decision is distributed among several agents or even situations where a centralised decision maker has to adapt to strategic users.

An important research direction in POLARIS is thus centered on reinforcement learning (Multi-armed bandits, Q-learning, online learning) and active learning in environments with one or several of the following features:

  • Feedback is limited (e.g., gradient or even stochastic gradients are not available, which requires for example to resort to stochastic approximations);
  • Multi-agent setting where each agent learns, possibly not in a synchronised way (i.e., decisions may be taken asynchronously, which raises convergence issues);
  • Delayed feedback (avoid oscillations and quantify convergence degradation);
  • Non stochastic (e.g., adversarial) or non stationary workloads (e.g., in presence of shocks);
  • Systems composed of a very large number of entities, that we study through mean field approximation (mean-field games and mean field control).

As a side effect, many of the gained insights can often be used to dramatically improve the scalability and the performance of the implementation of more standard machine or deep learning techniques over supercomputers.

The POLARIS members are thus particularly interested in the design and analysis of adaptive learning algorithms for multi-agent systems, i.e. agents that seek to progressively improve their performance on a specific task (see Figure). The resulting algorithms should not only learn an efficient (Nash) equilibrium but they should also be able of doing so quickly (low regret), even when facing the difficulties associated to a distributed context (lack of coordination, uncertain world, information delay, limited feedback, …)