↑ Return to Research

Computational kernels

Communication-avoiding algorithms

 

comm_wall

Left: computation to communication ratio for the LU factorization with partial pivoting of a dense matrix on a model of exascale machine (ISC 2014). Right: preconditioned Krylov solver as used in the map making problem in astrophysics, results obtained on a Cray XE6 machine (Astronomy and Astrophysics, 2014). Please click to enlarge.

Our research focuses on a novel approach to dense and sparse linear algebra algorithms, which aims at minimizing the communication, where communication refers to both its volume and the number of messages exchanged. The main goal is to reformulate and redesign linear algebra algorithms so that they are optimal in an amount of the communication they perform, while retaining the numerical stability. In general, they reduce significantly the amount of time spent communicating, relative to conventional algorithms. The new algorithms are referred to as communication avoiding algorithms. Very often, they are based on a reformulation which expresses the operation on a block of columns of the input matrix as a reduce operation. The reduction tree influences the task dependency graph associated with the respective computation. Its choice depends on the underlying architecture.

 

tslu

Communication avoiding LU uses tournament pivoting to minimize communication (SC08). Lightweight scheduling combines static and dynamic scheduling to provide a good trade-off beween load balance, data locality and dequeue overhead (IPDPS 2012). Please click to enlarge.

 

Low-rank approximation

low_rank_turing

Comparison between LU-PP (bottom left) LU-CRTP (down) and SVD decomposition (up) on a 919×707 image. Applying the truncated SVD on the original matrix with a desired rank of 38 gives only a small blur of the matrix and doubling the rank gives the same image as the original one. Here the compress rate is roughly 5 for both SVD and LU-CRTP.  Please click to enlarge.

Low Rank Approximation (LRA) algorithms usually aim to reduce the computation cost as well as storage occupation like in deep neuronal network or image compression, respectively. The idea is to approximate a matrix A of size M*N by a product of smaller matrices of size M*k and k*N where k is the rank i.e. the number of columns useful. The most accurate LRA is based on truncated Singular Values Decomposition (SVD). Unfortunately, truncated SVD is expansive. There are alternatives like QR or LU with partial pivoting. These could be good ideas but in general they give an incomplete approximation. Instead, applying RRQR or our low rank algorithm LU-CRTP with the same rank gets almost the same result and same compression as truncated SVD.