IOComplexity Associate team

The IOComplexity associate team is an active collaboration between the research group of P. Sadayappan, Ohio State University (OSU, USA), the research group of Louis-Noel Pouchet, Colorado State University (CSU, USA), and Inria CORSE research project.

Most compilers and run-time systems lack a precise performance model for the hardware they are targeting. Yet performance modelling is essential in the quest of exploiting the best of current and future architectures. Performance modelling is a notoriously hard challenge. ¬To overcome this challenge, we propose to approach it through different angles. The first is to restrict to specific types of computation patterns that are amenable to polyhedral representation, possibly via over-approximations; The second is to develop models specialised to the targeted application; The third is to combine abstract approaches to machine learning. In all cases, already developed techniques in data-movement (I/O) complexity analysis will take an important place in our model design as performance bottleneck is usually associated to I/Os between different hardware components.

Team Members

Fabrice RASTELLO (PI Inria), Research Director, Inria CORSE
P. Sadayappan (PI U of U), Professor, U of U
Louis-Noël Pouchet (PI CSU), Associate Professor, CSU
Nicolas Tollenaere, PhD, Inria CORSE / U of U
Auguste Olivry, PhD, Inria CORSE
Nicolas Derumigny, PhD, Inria CORSE / CSU
Rui Li, PhD, U of U
Aravind S. R., Post Doc, OSU (former member)
Changwan Hong, PhD, OSU (former member)
Prashant Singh Rawat, PhD, OSU (former member)
Fabian Gruber, PhD, UGA CORSE (former member)
Antoine Pouille, Internship, Inria CORSE (former member)
Venmugil Elango, PhD, OSU (former member)
Duco van Amstel, PhD Inria CORSE (former member)
Diogo Sampaio, Engineer, Inria CORSE (former member)
Philippe Virouleau, PhD, Inria CORSE (former member)

Recent Scientific Contributions

Data-flow/dependence profiling for structured transformations

Participants: Fabian Gruber, Manuel Selva [CORSE], Diogo Sampaio [CORSE], Christophe Guillon [STMicroelectronics], Antoine Moynault [STMicroelectronics], Louis-Noël Pouchet, Fabrice Rastello

Profiling feedback is an important technique used by developers for performance debugging, where it is usually used to pinpoint performance bottlenecks and also to find optimization opportunities. Assessing the validity and potential benefit of a program transformation requires accurate knowledge of the data flow and data dependencies, which can be uncovered by profiling a particular execution of the program.

In this work we develop MICKEY, an end-to-end infrastructure for dynamic binary analysis, which produces feedback about the potential to apply structured transformations to uncover non-trivial parallelism and data locality via complex program rescheduling. Our tool can handle both inter- and intra-procedural aspects of the program in a unified way, thus enabling structured inter-procedural transformations.

Building a Polyhedral Representation from an Instrumented Execution: Making Dynamic Analyses of Non-Affine Programs Scalable

Participants: Manuel Selva [CORSE], Fabian Gruber, Diogo Sampaio [CORSE], Christophe Guillon [STMicroelectronics], Louis-Noël Pouchet, Fabrice Rastello

The polyhedral model has been successfully used in production compilers. Nevertheless, only a very restricted class of applications can benefit from it. Recent proposals investigated how runtime information could be used to apply polyhedral optimization on applications that do not statically fit the model. In this work, we go one step further in that direction. We propose the folding-based analysis that, from the output of an instrumented program execution, builds a compact polyhedral representation.It is able to accurately detect affine dependencies, fixed-stride memory accesses and induction variables in programs. It scales to real-life applications, which often include some non-affine dependencies and accesses in otherwise affine code. This is enabled by a safe fine-grain polyhedral over-approximation mechanism. We evaluate our analysis on the entire Rodinia benchmark suite, enabling accurate feedback about potential for complex polyhedral transformations.

Analytical Cache Modeling and Tilesize Optimization for Tensor Contractions

Participants: Rui Li, Aravind Sukumaran-Rajam [OSU], Richard Veras [LSU], Tze Meng Low [CMU], Fabrice Rastello, Atanas Routnev [OSU], P. Sadayappan

Data movement between processor and memory hierarchy is a fundamental bottleneck that limits the performance of many applications on modern computer architectures. Tiling and loop permutation are key techniques for improving data locality. However, selecting effective tile-sizes and loop permutations is particularly challenging for tensor contractions due to the large number of loops. Even state-of-the-art compilers usually produce sub-optimal tile-sizes and loop permutations, as they rely on naive cost models. In this paper we provide an analytical model based approach to multi-level tile size optimization and permutation selection for tensor contractions. Our experimental results show that this approach achieves comparable or better performance than state-of-the-art frameworks and libraries for tensor contractions.

Automated Derivation of Roofline Performance Limit for Affine Programs

Participants: Auguste Olivry, Venmugil Elango [OSU], Louis-Noël Pouchet, P. Sadayappan, Julien Langou [UCDenver] Fabrice Rastello

For most relevant computation, the energy and time needed for data movement dominates that for performing arithmetic operations on all computing systems today. Hence it is of critical importance to understand the minimal total data movement achievable during the execution of an algorithm. The achieved total data movement for different schedules of an algorithm can vary widely depending on how efficiently the cache is used, e.g., untiled versus effectively tiled matrix-matrix multiplication. A significant current challenge is that no existing tool is able to meaningfully quantify the potential reduction to the data movement of a computation that can be achieved by more effective use of the cache through operation rescheduling. Asymptotic parametric expressions of data movement lower bounds have previously been manually derived for a limited number of algorithms, often without scaling constants. In this work, we present the first compile-time approach for deriving non-asymptotic parametric expressions of data movement lower bounds for arbitrary affine computations.
The approach has been implemented in a fully automatic tool (IOLB) that can generate these lower bounds for input affine programs. IOLB’s use is demonstrated by exercising it on all the benchmarks of the Polybench suite. The advantages of IOLB are many: (1) IOLB enables us to derive bounds for few dozens of algorithms for which these lower bounds have never been derived. This reflects an increase of productivity by automation. (2) Anyone is able to obtain these lower bounds through IOLB, no expertise is required. (3) For some of the most well-studied algorithms, the lower bounds obtained by IOLB are higher than any previously reported manually derived lower bounds.

Recent Visits (2019)

P. Sadayappan came one week in Grenoble
Fabrice Rastello spent one week at U of U
Nicolas Tollenaere spent one months at U of U
Théo Barollet spent one month at CSU
Fabrice Rastello, Nicolas Tollenaere, Nicolas Derumigny spent one month at U of U

IOComplexity Associate team

Team Members

Recent Scientific Contributions

Recent Visits (2019)

Recent News

Leave a Reply Cancel reply

Categories

Meta

Blogroll