2019 Activity

2019 Scientific Progress

 

Communication reordering Current High Performance Computing networks rely on adaptive routing and out-of-order message delivery to improve communication throughput. At the application level, developers rely on non-blocking interfaces for communication overlap and latency hiding. However, in most scientific codes the non-blocking communication operations are issued in some logical order dictated by the domain decomposition, rather than the logical order necessary for optimal network throughput. Lawrijen et. al introduced a generic approach for reordering non-blocking operations. In their approach, each rank reorders its communication based on metrics obtained by a calibration phase performed at the beginning of the application. More recently, Lawrijen et. al proposed a reordering method to optimize global bandwidth. In their approach, a global order is determined dynamically for all ranks at the cost of global synchronizations.

During the first months of Pierre’s postdoc we have been working on the exploration of new methods for reordering that do not require an expensive calibration phase nor global synchronizations. The first lead explored was to reorder messages statically. Different reordering strategies using different criteria (message sizes, distance between nodes) have been implemented and tested on the Cori supercomputer at NERSC. However, the results obtained have shown that due to the noise on the network and the high variability of communication times, the optimal ordering cannot be determined statically. The second lead to explore is to reorder messages dynamically in order to adapt the reordering decision to the state of the network and to the contention on the nodes. The dynamic strategy has yet to be defined and tuned, but ongoing work has shown potential performance benefits.

Hiding the latency of MPI operations – Many applications employ blocking operations (point-to-point and collectives) whereas they could use the non-blocking ones. During Summer 2019, an intern worked on an analysis able to transform existing applications to use non-blocking collectives. We have obtained promising results so far. A PhD started in November 2019 to push further on this subject. This will be a collaborative project with the CEA (French Alternative Energies and Atomic Energy Commission).

Correctness of MPI 3.0 one-sided communication – A PhD started in March 2019 on the development of a method to help developers using MPI one-sided communications in their applications.

Developing mixed-precision tuning optimization tools and methods for High Performance Computing scientific applications –  Most scientific applications in High Performance Computing use floating point arithmetic to perform their calculations. Due to the complexity of understanding the impact of floating point arithmetic on result accuracy, many applications are written entirely with double precision. Research on transforming code to versions using multiple precisions (norm IEEE754 double, single, half) aims to provide tools and methods to help developers write more efficient mixed precision floating point applications. The precision required in the different parts of a scientific software to achieve the desired precision in the result remains an open question. It seems that each application has its own specificity and it is still necessary to call upon experts to write an optimized application with mixed precision.
In Hugo’s postdoc, we seek to develop a systematic approach to optimize scientific applications with different accuracies, whose performance is limited by calls to mathematical library functions (exp, log, sin, cos, etc.). We have developed a tool that allows developers to automatically find a locally optimal mixed precision contextual strategy. This method has been tested on both GPU and CPU programs and can provide up to 40% acceleration with little user effort.

 

Comments are closed.