Data Aware Large Scale Computing
INRIA Theme: Distributed and High Performance Computing
LIG Laboratory Axis: Distributed Systems, Parallel Computing and Networks
Keywords: Exascale; High Performance Computing; Parallel Algorithms; Scheduling; Multi-objective Optimization; Middleware; Batch Scheduler; High Performance Data Analytics;
Today’s largest supercomputers (Top500 ranking) are composed of hundreds of thousands of cores, with performances reaching the PetaFlops. Moving data on such large supercomputers is becoming a major performance bottleneck, and the situation is expected to worsen even more at exascale and beyond. The data transfer capabilities are growing at a slower rate than processing power ones. The profusion of flops available will be difficult to use efficiently due to constrained communication capabilities.
The memory hierarchy and storage architecture are expected to deeply change with the emergence of new technologies like non volatile memories (NVRAM), requiring new approaches to data management. Data movements are also an important source of power consumption, and thus a relevant target for energy savings.
The DataMove team addresses these challenges, performing research to optimize data movements for large scale computing. DataMove targets four main research axes:
- Integration of High Performance Computing and Data Analytics
- Data Aware Batch Scheduling
- Empirical Studies of Large Scale Platforms
- Forecasting Resource Availability
The batch scheduler is in charge of allocating resources upon user requests for application executions (when and where to execute a parallel job). The growing cost of data movements requires adapting scheduling policies able to take into account the influence of intra-application communications, IOs as well as contention caused by data traffic generated by other concurrent applications. Modelling the application behavior to anticipate its actual resource usage on such architecture is challenging but critical to improve performance (time, energy). The scheduler also needs to handle new workloads. High performance platforms now need to execute more and more often data intensive processing tasks like data analysis in addition to traditional computation intensive numerical simulations. In particular, the ever growing amount of data generated by numerical simulation call for a tighter integration between the simulation and the data analysis. The goal is to reduce the data traffic and to speed-up result analysis by performing result processing (compression, indexation, analysis, visualization, etc.) as closely as possible to the locus and time of data generation. This approach, called in-situ analytics, requires revisiting the traditional workflow (batch processing followed by post-mortem analysis). The application becomes a whole including the simulation, in-situ processing and I/Os. This motivates the development of adapted resource allocation strategies, data structures and parallel analytics schemes to efficiently interleave the execution of the different components of the application and globally improve the performance.
To tackle these issues, we intertwine theoretical research and practical developments in an agile mode, to elaborate solutions generic and effective enough to be of practical interest. Algorithms with performance guarantees are designed and experimented on large scale platforms with realistic usage scenarios developed with partner scientists or based on logs of the biggest available computing platforms. Conversely, our strong experimental expertise enables to feed theoretical models with sound hypotheses, to twist proven algorithms with practical heuristics that could be further retro-feeded into adequate theoretical models.