APOLLO

The CAMUS team is developing a framework dedicated to automatic, dynamic and speculative parallelization of programs. This framework is called APOLLO for Automatic speculative POLyhedral Loop Optimizer. This framework allows a user to mark in a source code some nested loops of any kind (for, while or do-while loops) in order to be handled by a speculative parallelization process and take advantage of the underlying multi-core processor architecture. The framework is composed of two main parts. First, extensions to the CLANG-LLVM compiler are devoted to prepare the program for being handled by the runtime system. This phase consists to generating several versions of each target loop nest:

  • an instrumented version, where every memory instruction, as well as some scalar updates, are associated with instructions to collect at runtime the values of accessed memory addresses, or values assigned to the variables ;

  • an original version which corresponds to the sequential execution of the target code ;

  • several code patterns, or skeletons, which represent incomplete optimized and parallelized versions of the target loop nest. These versions will be completed at runtime as soon as an efficient transformation of the original code will be decided. This strategy allows to generate dynamically new parallel versions in a fast way, simply by patching some areas of the selected code pattern. These patterns also contain additional instructions devoted to manage the speculative parallelization, as detecting misspredictions or launch a required rollback.

During the execution, the runtime system orchestrates the launch of the different versions through a mechanism of sliding windows. Each target loop nest is run as a sequence of slices of the outermost loop. Each of these slices is run by using one of the versions generated at compile-time. The scenario is the following:

  • at start-up, the runtime system launches a first slice using the instrumented version. This slice is reduced to a tenth of iterations of the outermost loop, in order to limit the time-overhead due to sequential execution and instrumentation ;

  • using the memory addresses and the values of some scalars that have been collected, the runtime system builds linear interpolation functions for each instrumented instruction, when possible ;

  • also, using the collected memory addresses, the runtime system performs an online dependency analysis by computing the dependence distance vectors for all direct accesses to the same memory locations, where at least one access is a write. By using the linear interpolation functions, the dependence analysis is completed by a value range analysis and a GCD test ;

  • the dependences knowledge is then used to select an optimizing and parallelizing transformation of the target loop nest. This transformation is speculative, since it is based on an analysis of a small slice of the original nest. It is applied by patching the corresponding code pattern ;

  • the parallel version is launched in a slice of significant size. The instructions devoted to the control of the speculation validate the execution. In case of invalidation, the slide is re-executed using the original version, and an instrumented version is re-launched, in order to detect the change of behavior and allow a new parallelization of the code.

Leave a Reply