Sylvain Collange

Research Scientist in the PACAP project-team at Inria in Rennes, France.

My research interests are throughput-oriented parallel architectures such as GPUs, and emerging compute architectures in general. My primary focus is on computer architecture issues. I am also interested in compilers, software optimization and computer arithmetic.


Sylvain Collange
INRIA – Centre de recherche Rennes – Bretagne Atlantique
Campus de Beaulieu
35042 RENNES Cedex

E-mail: sylvain.collange [at]
Phone: +33 29984 7105
Office: E320 Rouge
Web (this page):


I aim at improving both the efficiency and the programmability of throughput-oriented processors through architecture changes and compiler techniques.

Compiler optimization for quantum computers

The first generation of usable quantum computers is now a reality. It enables experimental computer science in the quantum computing research field.
We propose compiler analyses and transformations to optimize quantum programs for existing and near-term quantum computers. With only a few tens or even hundreds of qubits, these computers are still too small to afford general error correction techniques. Instead, software and compilers have to cope with noise. Compiler optimization is critical, not only to minimize the time to solution, but more importantly to maximize the accuracy of results.

  • Qubit allocation is the process of mapping the logical qubits of quantum programs into physical qubits following hardware constraints. It is the quantum equivalent of register allocation. Our CGO 2018 paper formally introduces the qubit allocation problem and provides an exact solution to it. This optimal algorithm deals with the simple quantum machinery available today; however, it cannot scale up to the more complex architectures scheduled to appear. Thus, we also provide a heuristic solution to qubit allocation, which is faster than the current solutions already implemented to deal with this problem.

General-purpose SIMT: bridging the gap between SMT CPU and SIMT GPU architectures

I design hardware schemes that allow the execution of existing scalar instruction sets on GPU-like architectures. Existing mechanisms that perform dynamic vectorization in the Single Instruction Multiple Thread (SIMT) model on current GPUs rely on explicit annotations in the instruction set, and hardware-based stack structures. I have shown an alternative constant-space mechanism could enable SIMT execution on conventional scalar instruction sets. It allows individual threads to be managed, suspended, resumed or migrated independently, lifting the main barrier separating SIMT architectures from general-purpose multithread processors.

My ongoing research along this direction includes:

  • DITVA: Dynamic Inter-Thread Vectorization.
    DITVA starts from an SMT in-order core and incorporates SIMT an execution mode, retaining full compatibility with existing SPMD binaries.
    SBAC-PAD 2016 paper.
    Tech report.
  • Simty: General-Purpose SIMT made simple.
    Simty is an open-source fully-synthesizable RTL design of a general-purpose SIMT core implementing the RISC-V instruction set. Simty aims at defining the RISC of general-purpose SIMT: a streamlined resource-efficient SIMT pipeline suitable as the building block for highly scalable, easy-to-program parallel architectures.


Enabling composable, safe and efficient warp-synchronous SIMD programming on SIMT GPUs

Warp-synchronous programming has evolved from an obscure programmer trick to a common programming technique to express explicit SIMD
computations in CUDA or OpenCL programs, supported by new hardware primitives like warp vote and shuffle instructions.
Warp-synchronous programming is extensively used in highly-tuned libraries like CUB.
However, warp-synchronous programming still lacks clearly defined semantics, documentation, and vendor support, and its use raise code composability issues.
I seek to document warp-synchronous patterns and propose light-weight compiler extensions to improve expressivity and code composability.

Accurate, deterministic and fast floating-point calculations on parallel architectures

Floating-point computations on parallel architectures increasingly lead to non-deterministic results, as a consequence of calculation reordering and dynamic scheduling. This raises issues with debugging, testing and validation as well as code portability. I am an advocate of generalizing the correctly-rounded requirement of IEEE-754 standard for basic arithmetic operations to higher-level primitives like sums and sums of products. We show that correct rounding can be achieved with virtually no performance impact for bandwith-bound reduction algorithms on current parallel architectures.


Book chapter

International journals

  • S. Kalathingal, S. Collange, B. Swamy, A. Seznec. DITVA: Dynamic Inter-Thread Vectorization Architecture. Journal of Parallel and Distributed Computing, 2018. On HAL
  • S. Collange, D. Defour, S. Graillat, and R. Iakymchuk. Numerical reproducibility for the parallel reduction on multi-and many-core architectures. Parallel Computing. Volume 49 Issue 9, Pages 83-97. 2015
  • T. Milanez, S. Collange, F. M. Q. Pereira, W. Meira Jr, and R. Ferreira. Thread scheduling and memory coalescing for dynamic vectorization of SPMD workloads. Parallel Computing. Volume 40 Issue 9, Pages 548-558. 2014.
  • M. G. Arnold, S. Collange. Options for Denormal Representation in Logarithmic Arithmetic. Journal of Signal Processing Systems. Volume 77, Issue 1-2, Pages 207-220. 2014
  • D. Sampaio, R. M. de Souza, S. Collange, F. M. Q. Pereira. Divergence Analysis. ACM Transactions on Programming Languages and Systems. Volume 35 Issue 4, Pages 13:1-13:36. 2014. On HAL
  • P.D. Vouzis, S. Collange, M.G. Arnold, M.V. Kothare. Improving model predictive control arithmetic robustness by Monte Carlo simulations. IET Control Theory and Applications. Volume 6, Issue 8, Pages 1064-1070. 2012.
  • M. G. Arnold, S. Collange. A Real/Complex Logarithmic Number System ALU. IEEE Transactions on Computers. Volume 60, Issue 2, Pages 202-213. 2011.
  • P. D. Vouzis, S. Collange, M. G. Arnold. A Novel Cotransformation for LNS Subtraction. Journal of Signal Processing Systems (JSPS). Volume 58, Issue 1, Pages 29-40. 2010. PDF
  • S. Collange, M. Daumas, D. Defour. Line-by-line spectroscopic simulations on graphics processing units. Computer Physics Communications (CPC). Volume 178, Issue 2, Pages 135-143. 2008. PDF

International conferences

  • M. Siraichi, V. F. Dos Santos, S. Collange, F. M. Q. Pereira. Qubit Allocation. International Symposium on Code Generation and Optimization (CGO) 2018 On HAL.
  • Sylvain Collange. Simty: generalized SIMT execution on RISC-V. First Workshop on Computer Architecture Research with RISC-V (CARRV). 2017. On HAL.
  • Rubens Moreira, Sylvain Collange, Fernando Pereira. Function Call Re-Vectorization. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2017. On HAL.
  • Sajith Kalathingal, Sylvain Collange, Bharath Narasimha Swamy, André Seznec. Dynamic Inter-Thread Vectorization Architecture: extracting DLP from TLP. International Symposium on Computer Architecture and High-Performance Computing (SBAC-PAD), 2016. Best paper award. On HAL.
  • Sylvain Collange, Mioara Joldes, Jean-Michel Muller and Valentina Popescu. Parallel floating-point expansions for extended-precision GPU computations. IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP) 2016. On HAL.
  • Douglas do Couto Teixeira, Sylvain Collange and Fernando Magno Quintão Pereira. Fusion of Calling Sites. International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). 2015. On HAL.
  • M. G. Arnold, S. Collange. The Denormal Logarithmic Number System. 24th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP), 2013. On HAL.
  • T. Milanez, S. Collange, F. Pereira, W. Meira. Data and Instruction Uniformity in Minimal Multi-Threading. International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), 2012. On HAL.
  • D. Sampaio, R. Martins, S. Collange, F. Pereira. Divergence Analysis with Affine Constraints. International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), 2012. On HAL.
  • N. Brunie, S. Collange, G. Diamos. Simultaneous Branch and Warp Interweaving for Sustained GPU Performance. International Symposium on Computer Architecture (ISCA), 2012. On HAL.
  • S. Collange, M. Daumas, D. Defour, D. Parello. Barra: A Parallel Functional Simulator for GPGPU. 18th Annual IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), 2010. PDF
  • M. G. Arnold, S. Collange, D. Defour. Implementing LNS using filtering units of GPUs. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2010. On HAL.
  • S. Collange, D. Defour, Y. Zhang. Dynamic detection of uniform and affine vectors in GPGPU computations. Europar 3rd Workshop on Highly Parallel Processing on a Chip (HPPC). 2009. On HAL.
  • M. G. Arnold, S. Collange. A Dual-Purpose Real/Complex Logarithmic Number System ALU. 19th Symposium on Computer Arithmetic (ARITH19). 2009. PDF
  • S. Collange, D. Defour, A. Tisserand. Power Consumption of GPUs from a Software Perspective. International Conference on Computational Science (ICCS). 2009. On HAL.
  • S. Collange, Y. Dandass, M. Daumas, D. Defour. Using Graphics Processors for Parallelizing Hash-based Data Carving. Hawaii International Conference on System Sciences (HICSS) 42. 2009. On HAL.
  • S. Collange, J. Flórez, D. Defour. A GPU interval library based on Boost.Interval. 8th Conference on Real Numbers and Computers (RNC8) 2008. PDF
  • S. Collange, M. Daumas, D. Defour. Graphic processors to speed-up simulations for the design of high performance solar receptors. IEEE 18th International Conference on Application-specific Systems, Applications and Processors (ASAP) 2007. On HAL.
  • P. D. Vouzis, S. Collange, M. G. Arnold, M. Kothare. Monte Carlo Logarithmic Number System for Model Predictive Control. 17th International Conference on Field Programmable Logic and Applications (FPL) 2007. PDF
  • P. D. Vouzis, S. Collange, M. G. Arnold. LNS Subtraction Using Novel Cotransformation and/or Interpolation. IEEE 18th International Conference on Application-specific Systems, Applications and Processors (ASAP) 2007. Best paper award. PDF
  • P. D. Vouzis, S. Collange, M. G. Arnold. Cotransformation Provides Area and Accuracy Improvement in an HDL Library for LNS Subtraction. EuroMicro Digital System Design (DSD) 2007. PDF
  • S. Collange, J. Detrey, F. de Dinechin. Floating Point or LNS: Choosing the Right Arithmetic on an Application Basis. EuroMicro Digital System Design (DSD) 2006. pp. 197-203. 2006 PDF

National journals and conferences

  • S. Collange, N. Brunie. Parcours par liste de chemins : une nouvelle classe de mécanismes de suivi de flot SIMT. Compas 2017. On HAL
  • S. Collange. Un processeur SIMT généraliste synthétisable. Compas 2016. On HAL
  • R. Iakymchuk, S. Graillat, S. Collange, D. Defour. ExBLAS: Reproducible and Accurate BLAS Library. 7ème Rencontre Arithmétique de l’Informatique Mathématique (RAIM 2015), Apr 2015, Rennes, France. 2015. On HAL
  • N. Brunie, S.Collange. Reconvergence de contrôle implicite pour les architectures SIMT. Technique et Science Informatiques, Vol 32/2 – 2013 – pp.153-178. On HAL.
  • D. Sampaio, E. Gedeon, F. Pereira, S. Collange. Spill Code Placement for SIMD Machines. Simpósio Brasileiro de Linguagens de Programação (SBLP). LNCS Volume 7554, 2012, pp 12-26. 2012. PDF
  • S. Collange. Une architecture unifiée pour traiter la divergence de contrôle et la divergence mémoire en SIMT. SYMPosium en Architectures (SYMPA) 2011. On HAL.
  • S. Collange, M. Daumas, D. Defour, D. Parello. Étude comparée et simulation d’algorithmes de branchements pour le GPGPU. SYMPosium en Architectures (SYMPA) 2009. On HAL.
  • S. Collange, M. Daumas, D. Defour, R. Olivès. Fonctions élémentaires sur GPU exploitant la localité de valeurs. SYMPosium en Architectures nouvelles de machines (SYMPA) 2008. On HAL.
  • S. Collange, M. Daumas, D. Defour. État de l’intégration de la virgule flottante dans les processeurs graphiques. Technique et Science Informatiques, Vol 27/6 – 2008 – pp.719-733. 2008. PDF

PhD Thesis

  • S. Collange. Enjeux de conception des architectures GPGPU : unités arithmétiques spécialisées et exploitation de la régularité. PhD Thesis, Université de Perpignan Via Domitia. 2010. On TEL

Technical reports

  • S. Collange, N. Brunie. Path list traversal: a new class of SIMT flow tracking mechanisms. Inria Research Report RR-9073, 2017. On HAL
  • S. Collange. Simty: a Synthesizable General-Purpose SIMT Processor. Inria Research Report RR-8944, 2016. On HAL
  • S. Kalathingal, S. Collange, B. N. Swamy, A. Seznec. Transforming TLP into DLP with the Dynamic Inter-Thread Vectorization Architecture. Inria Technical report RR-8830, 2015. On HAL
  • S. Collange, A. Kouyoumdjian. Affine Vector Cache for memory bandwidth savings. ENS Lyon, technical report ensl-00649200, 2011. On HAL-ENSL.
  • S. Collange. Stack-less SIMT reconvergence at low cost. Technical report hal-00622654, 2011. On HAL.
  • S. Collange. Identifying scalar behavior in CUDA kernels. Technical report hal-00555134, 2011. On HAL.
  • S. Collange. Analyse de l’architecture GPU Tesla. Technical report hal-00443875, January 2010. On HAL.
  • S. Collange, D. Defour, D. Parello. Barra, a Parallel Functional GPGPU Simulator. Technical report hal-00359342, 2009. On HAL.

Talks, posters

Graduate courses



  • Barra. A functional GPU simulator of Nvidia G80 running CUDA. Part of Unisim. BSD license, 2009.
  • Graphics memory latency and throughput tests in CUDA. 2008.
  • Boost.Interval on GPU. Guaranteed interval arithmetic in Cuda and Cg. Boost Software License, 2008.
  • GPU4RE: Line-by-line spectroscopic simulation on GPU. CPC Program Library adzy_v1_0. CPC license, 2007.
  • FPTest: Test suite for GPU floating-point arithmetic based on GPUBench. BSD license, 2007.
  • FPLibrary-Cotransformation. Hardware operators in the Logarithmic Number System using cotransformation. GPL, 2006.


  • ExBLAS. Correctly-rounded linear algebra routines. BSD license, 2014.
  • Unisim. UNIted SIMulation environment. BSD license, 2009.
  • FloPoCo. Generator of arithmetic cores (Floating-Point Cores, but not only) for FPGAs. GPL, 2008.
  • FPLibrary. Hardware arithmetic operators for FPGAs in FP and LNS. GPL, 2006.


Teaching: 2017/2018

Advanced Architectures (ADA)
1. Introduction to GPU architectures
2. SIMT control flow management

Optimization Techniques for Parallel Code (OPT)
1. Parallel programming models
2. Introduction to GPU architecture
3. CUDA basics
4. GPU code optimization
4.5. Atomics and unified memory
5. Warp-synchronous programming with Cooperative Groups
Labs 1-2
Labs 3-4
Labs 5-6
Labs 7-8

Parallel Programming (PPAR)
1. GPU architecture
2. CUDA programming
3. GPU code optimization
Lab 1
Lab 2

Comments are closed