HPCProSol (Next-generation HPC PROblems and SOLutions) is a joint team (équipe associée) between Inria, in France, and the National Laboratory for Scientific Computing (LNCC), in Brazil. It started at the beginning of 2021, and is expected to continue until the end of 2023.
High performance computing (HPC) architectures, the supercomputers, were conceived to efficiently run the traditional HPC applications: numerical simulations. However, in the context of the convergence between HPC and big data, their workload is becoming more heterogeneous and the notion of scientific application is evolving into a scientific workflow, composed of cpu-intensive and data-intensive tasks. This evolution characterizes the new HPC workload, which is expected to bring new challenges and accentuate existing ones.
- Efficient application execution becomes more challenging due to a mismatch between systems and applications. New applications include new methods, libraries, and runtime systems that may not have been properly optimized to the supercomputer, leading to problems such as load imbalance and poor communication performance.
- Supercomputers’ resources are arbitrated between applications using as little information as the number of CPUs and the estimated execution time, which potentially wastes resources that are unused at different moments during application execution.
- Although running on independent nodes, concurrent applications still share the network and I/O infrastructures, which means they can interfere with each other non-uniformly.
This project’s main goal is to study and characterize the new HPC workload, represented by a set of scientific applications that are important to the LNCC because they are representative of its Santos Dumont machine’s workload. This generated knowledge will guide the proposal of monitoring and profiling techniques for applications, and the design of new coordination mechanisms to arbitrate resources in HPC environments.
We are interested in evaluating and improving individual applications’ performance, but also on using this study to provide a better understanding of how performance is impacted by aspects such as interference. Moreover, we want to identify metrics that can be used to predict performance and deviations from the applications’ expected behaviors, specially at run time.