Software

Micro-architecture modeling & Performance debugging

Pipedream reverse engineers the following performance characteristics: (1) Instruction latency – The number of cycles an instruction requires to execute. (2) Peak micro-op retirement rate – How many fused micro-ops the CPU can retire per cycle. (3) Micro-fusion – The number of fused micro-ops an instruction decomposes into. (4) Micro-op decomposition and micro-op port usage – The list of unfused micro-ops every instruction decomposes into and the list of execution ports every one of these micro-ops can execute on.

Gitlab: https://gitlab.inria.fr/CORSE/pipedream

PALMED computes a bipartite graph assembly instructions to abstract resources that may be used for performance prediction, targeting static analysis tools and compilers. Internally, PALMED uses PIPEDREAM as a framework for microbenchmarking code generation, and use gurobi to find a first small graph. Then, PALMED deduces from the found resources and the microbenchmarks that saturates them a mapping of every supported instruction.

Demonstrator webpage: https://palmed.corse.inria.fr/

Mickey is a set of tools for profiling based performance debugging for compiled binaries. It uses a dynamic binary translator to instrument arbitrary programs as they are being run to reconstruct the control flow and track data dependencies. This information is then fed to a polyhedral optimizer that proposes structured transformations for the original code. Mickey can handle both inter- and intra-procedural control & data flow in a unified way, thus enabling inter-procedural structured transformations. It is based on QEMU to allow for portability, both in terms of targeted CPU architectures, but also in terms of programming environment and the use of third-party libraries for which no source code is available.

Data-movement complexity

IOLB computes a symbolic lower bound on the I/O, or data movement, complexity of a computer program, that is the amount of data that needs to be moved between cache and main memory to perform its computation. The input is a C program, and the output is a mathematical formula that depends on program parameters (array sizes…) and cache size.

IOOpt takes as input an abstract representation of a tileable program. The tool generates a tractable set of relevant permutations of the tiling loops, and a symbolic I/O cost expression for each of them. It then uses a non linear problem optimizer to find the best permutations and corresponding tile sizes for a given value of machine parameters (cache sizes and bandwidths at each level). IOOop can also be used to find an upper bound on the I/O cost of a program, for a given tiling scheme.

Demonstrator webpage: https://iocomplexity.corse.inria.fr/

Compiler optimization for ML workloads

TTiLE is a code generation tool for tensor operators present in CNNs such as tensor contraction and convolution. It takes as input: 1. a kernel specification, that is, a fonctionnal description of the operator (iteration space, tensors, single assignment description of the computation), 2. an optimization scheme that describes the layered structure of the generated code. The scheme “language” allows to express loop permutation, loop tiling, vectorization, unrolling, packing (which consists in using temporary buffers making cache access better behaved) and branching. Indeed, as opposed to existing schemes that rely on partial tiles, TTile can create non-perfectly nested loops to combine several “micro-kernels” (micro-kernel stands for the innermost part of the loop nest that is fully unrolled, register promoted and vectorized here). Using a specialized search strategy that combines operational research on analytical performance model and native execution time measurement, TTile outperforms current highly optimized libraries such as Intel oneDNN and Intel MKL.

The XTC (XDSL Transform Compiler) project, is a compiler framework that provides high-level scheduling specifications for linear algebra operations over a dataflow graph. The projet provides high-level syntax for applying transformations like: – Tiling – Loop interchange – Vectorization – Unrolling It allow the integration of multiple backend and implements: – MLIR backend (using MLIR linalg and transform dialects) – TVM backend (using TVM tensor IR and schedule APIs) – JIR backend (using JIR intermediate representation) Core components: – Abstract interfaces for tensors, graphs, nodes, and operators – Implementers for different backends – Schedulers for transformation management – Compilers for code generation – Evaluators for performance measurement Main tools: mlir-loop: Compilation driver loop-explore: Automatic exploration of different scheduling strategies loop-display: Visualization of exploration results

Teaching of programming

EasyTracker is intended for computer science teaching. It encapsulates the execution control and monitoring of another program written in Python or any compiled language supported by GDB. It can pause execution at interest points described in a high level language (variable modified or return of recursive functions for example).

Gitlab: https://gitlab.inria.fr/CORSE/easytracker

Agdbentures is a game to teach debugging. It is based on GDB (the Gnu Debugger), uses the Easytracker library, and proposes to students an RPG-like (Role Playing Game) 2D interface, where each level is a program in the C language that contains one or more bugs. To validate a level, one needs to first correct the bugs that block the main character in its goals. Difficulty is gradual, and the code for each level is based (and expands) on the preceding level, which allows players to get familiar with the code base.

Gitlab: https://gitlab.inria.fr/CORSE/agdbentures

Others

IPFME: Quantifier elimination is the process of removing existential variables of a given formula, obtaining one with less variables and that implies the original formula. This can also be viewed as a projection of the set of points (integer here) that satisfy the original formula onto a sub-vectorial space made up of all the non-eliminated variables. The obtained projection is an over-approximation of the exact projection. The goal of the process is to make it as tight as possible.

IPFME presents extensions to the Fourier-Motzkin quantifier elimination process. The developed techniques allow to derive more precise simplification operations when handling integer valued multivariate polynomial systems.

The implementation, in C++, uses GiNaC (https://www.ginac.de/) for the manipulation of symbolic expressions.

Software

Micro-architecture modeling & Performance debugging

Data-movement complexity

Compiler optimization for ML workloads

Teaching of programming

Others

Categories

Meta

Blogroll