Inria Skoltech Workshop – Inria Skoltech Moliere Associated Team

In the framework of the Molière associated team between Skoltech and Inria, we organized a virtual workshop (by invitation only) from July 6 to 8, 2021, whose topics are Training in DNNs, Compression, Tensors, Memory minimization.

The talks take place from 10:30 to 12:30 (Moscow time) or from 9:30 to 11:30 (Bordeaux time).

July 6: 10:30 to 12:30 (Moscow time) and 9:30 to 11:30 (Bordeaux time).

10:30-10:45 (Moscow time), 9:30-9:45 (Bordeaux time) Welcome session: Olivier Beaumont, Ivan Oseledets

Session: Compressed Linear Algebra. Chair: Ivan Oseledets (Skoltech)

10:45-11:15 (Moscow time), 9:45-11:15 (Bordeaux time) :

Title: STARS-H: a High Performance Data-Sparse Matrix Market Library on Large-Scale Systems (slides)

Speaker: Aleksandr Mikhalev, RWTH Aachen, Germany, mikhalev@acom.rwth-aachen.de

Abstract: We introduce the STARS-H library, a Software for Testing
Accuracy, Reliability and Scalability of Hierarchical matrix approximations.
The aim of the STARS-H is to provide a reference matrix market library for
hierarchical or H-matrices. These matrices inherently exhibit
low-rank off-diagonal block structure, which may be exploited to reduce memory
footprint and complexity of arithmetic operations. The H-matrices
arise in many scientific applications, for which sparsity in the data naturally
occurs, e.g., in covariance matrices when modeling interactions between
particles in electrostatics and geospatial locations for climate/weather
forecasting. The STARS-H takes the advantage of the matrix data-sparsity by
approximating various off-diagonal tiles up to a given
application-dependent accuracy threshold. While the STARS-H provides a
myriad of H-matrices coming from synthetic and real applications,
its software infrastructure delivers high-performance computations on modern
x86 and GPU hardware architectures for shared and distributed-memory systems. It
supports OpenMP, MPI and task-based programming model for asynchronous
executions for a broad software outreach and integration. Existing
H-matrix computations libraries can then rely on the STARS-H
to assess their numerical robustness, while enabling a fair performance
comparison across a common reference set of H-matrices.

11:15-11:45 (Moscow time), 10:15-10-45 (Bordeaux time) :

Title: Exploiting Generic Tiled Algorithms Toward Scalable H-Matrices Factorizations on Top of Runtime Systems (slides)

Speaker: Mathieu Faverge, Bordeaux INP, Inria, LaBRI, France, mathieu.faverge@inria.fr

Abstract: Hierarchical matrices (H-matrices) have become important in applications where accuracy can be reduced to decrease to a logarithmic order both the execution time and memory consumption. It happens for instance when solving Boundary Element Methods (BEM) problems. However the natural hierarchical structure of the H-Matrices makes it more difficult to efficiently parallelize with modern programming paradigm such as task-based implementations. We discuss in this presentation how we can combine, Chameleon, a tiled dense linear algebra software relying on sequential task-based algorithms and runtime systems such as StarPU, and Hmat-oss, a library focused on providing a set of sequential algorithms for H-algebra operations. We will discuss the limitations in terms of H-matrices structure and memory compression that are imposed by the use of a tiled algorithm, and we will show the performance that can be brought by such a generic solution with respect to more advanced implementation fully exploiting the hierarchical data structure.

11:45-12:15 (Moscow time), 10:45-11-15 (Bordeaux time) :

Title: Rank revealing QR methods for sparse block low-rank solvers (slides)

Speaker: Esragul Korkmaz, Inria, Bordeaux

Abstract: Solving linear equations of type Ax=b for large sparse systems frequently emerges in science and engineering applications, which creates the main bottleneck. In spite that the direct methods are costly in time and memory consumption, they are still the most robust way to solve these systems. Nowadays, increasing the number of computational units for supercomputers became trendy, while the memory available per core is reduced. Therefore, when solving these linear equations, memory reduction becomes as important as time reduction. For this purpose, compression methods of dense blocks appearing inside sparse matrix solvers have been introduced to reduce the memory consumption, as well as the time to solution. While looking for the lowest possible compression rank, Singular Value Decomposition (SVD) gives the best result. It is however too costly as the whole factorization is computed to find the resulting rank. In this respect, rank-revealing QR decomposition variants are less costly, but can introduce larger ranks. Among these variants, column pivoting or matrix rotation can be applied on matrix A, such that the most important information in the matrix is gathered to the leftmost columns and the remaining unnecessary information can be omitted. For reducing the communication cost of the classical QR decomposition with column pivoting, blocking versions with randomization are suggested as an alternative solution to find the pivots. In these randomized variants, the matrix A is projected on a much lower dimensional matrix by using an independent and identically distributed Gaussian matrix so that the pivoting/rotational matrix can be computed on the lower dimensional matrix. In addition, to avoid unnecessary updates of the trailing matrix at each iteration, a truncated randomized method is suggested and shown to be more efficient for larger matrix sizes. Thanks to these methods, closer results to SVD can be obtained and the cost of compression can be reduced. In this presentation, we will present a comparison of all these methods in terms of complexity, numerical stability, and performance. We will also present results from the integration of these methods in the PaStiX block low-rank solver.

July 7: 10:30 to 12:30 (Moscow time) and 9:30 to 11:30 (Bordeaux time).

Session: Memory minimization when Training in Deep Neural Networks

10:30-11:00 (Moscow time), 9:30-10:00 (Bordeaux time) :

Title: Memory Optimization with Re-materialization when Training DNNs (slides)

Speaker: Alena Shilova

Abstract:
Training Deep Neural Networks is known to be an expensive operation, both in terms of computational cost and memory load. Indeed, during training, all intermediate layer outputs (called activations) computed during the forward phase must be stored until the corresponding gradient has been computed in the backward phase. These memory requirements sometimes prevent to consider larger batch sizes and deeper networks, so that they can limit both convergence speed and accuracy. Recent works have proposed to rely on rematerialization strategy. It consists in dynamically selecting the forward activations that are materialized (saved) or deleted during the forward phase, and then automatically recomputing missing activations from previously materialized ones when they are needed. We propose an original computation model that combines two types of activation savings: either only storing the layer inputs, or recording the complete history of operations that produced the outputs. For this model we find an optimal solution based on dynamic programming and we show through the series of experiments that our new algorithm outperforms the previous state-of-the-art.

11:00-11:30 (Moscow time), 10:00-10:30 (Bordeaux time) :

Title: Rotor : a pytorch extension for memory optimization of deep networks (slides)
Author: Lionel Eyraud-Dubois
Abstract: Rotor (Rematerializing Optimally with pyTORch) is a software developed at Inria Bordeaux, whose goal is to optimize memory usage when training deep pytorch networks. It uses rematerialization to automatically adapt the memory usage of the network to the available memory on the device. In this talk, I will describe how to use rotor, both with basic and advanced usage.

11:30-12:00 (Moscow time), 10:30-11:00 (Bordeaux time) :

Title: Tensor methods for neural networks speed-up and compression (slides)

Speaker: Julia Gusak

Abstract: Most modern neural networks are overparameterized and exhibit high computational costs. As a result, they can’t be efficiently deployed on embedded systems and mobile devices due to power and memory limitations. We present several approaches to accelerate and compress neural networks. We propose methods, some of which are based on low-rank approximations of weight tensors, and others – on low-rank matrix representations of layers’ outputs. Also, we introduce a Python library called MUSCO (MUlti-Stage COmpression) for automated neural network compression via tensor approximations.

July 8: 10:30 to 12:30 (Moscow time) and 9:30 to 11:30 (Bordeaux time).

Session: Tensors

10:30-11:00 (Moscow time), 9:30-10:00 (Bordeaux time) :

Title: From low rank tensors for grids of multivariate functions to joint distributions of statistical models: selecting the relevant norm (slides)

Speaker: Alain Franc, INRAE, Bordeaux

Abstract: We first recall elementary bounds linking CP and TT ranks for any tensor. Then we show (by elementary calculations as well) that the CP rank of discretization of a polynomial on a cartesian grid for is low, and transfer these results to TT-rank. We use Stone-Weierstrass theorem to extend it as an accurate low rank (CP, TT) approximation of continuous functions with supremum norm. Hence, working with a diversity of norms may be crucial, even if a standard theory exists as well for Frobenius norm. We then move to tensors representing a joint distribution of a statistical model, and focus on the calculation of the partition function which is a standard and difficult problem both in statistical physics and computational statistics. We recall that statisticians use Kullback-Leibler distance to compare distribution, and observe that, as all coefficient in a tensor representing a joint distribution are non negative, the partition function is the L1 norm of the tensor. Hence we move to the problem of finding the best low rank approximation of a tensor with L1 norm as a challenge with some leads to circumvent it.

11:00-11:30 (Moscow time), 10:00-10:30 (Bordeaux time) :

Title: Sampling zeros of a sparse tensor (slides)

Speaker: Bora Uçar, CNRS, ENS Lyon

Abstract: Kolda and Hong (SIAM J. MoDS, 2(4):1066–1095, 2020) have recently proposed a stochastic tensor decomposition method. This is an iterative method in which the zeros and nonzeros of a given tensor should be sampled separately at each iteration for accelerating the convergence. While sampling nonzeros is straightforward, one needs an efficient algorithm for sampling zeros. We formulate the problem of efficiently sampling zeros of a sparse tensor as a static hashing problem on hypergraph structured data and present results. This is joint work with Jules Bertrand (ENS de Lyon) and Fanny Dufosse (DataMove, Inria), and available online https://hal.inria.fr/hal-03127673

11:30-12:00 (Moscow time), 10:30-11:00 (Bordeaux time) :

Title: Parallel Tensor Train through Hierarchical Decomposition (slides)

Speaker: Suraj Kumar, Inria, Alpines Team

Abstract: We consider the problem of developing parallel decomposition and approximation algorithms for high dimensional tensors. We focus on a tensor representation named Tensor Train (TT). It stores a d-dimensional tensor in O(ndr^2), much less than the O(n^d) entries in the original tensor, where ‘r’ is usually a very small number and depends on the application. Sequential algorithms to compute TT decomposition and TT approximation of a tensor have been proposed in the literature. Here we propose a parallel algorithm to compute TT decomposition of a tensor. We prove that the ranks of TT-representation produced by our algorithm are bounded by the ranks of unfolding matrices of the tensor. Additionally, we propose a parallel algorithm to compute approximation of a tensor in TT-representation. Our algorithm relies on a hierarchical partitioning of the dimensions of the tensor in a balanced binary tree shape and transmission of leading singular values of associated unfolding matrix from the parent to its children. We consider several approaches on the basis of how leading singular values are transmitted in the tree. We present an in-depth experimental analysis of our approaches for different low rank tensors. Our results show that the approach which transmits leading singular values to both of its children performs better in practice. Compression ratios and accuracies of the approximations obtained by our approaches are comparable with the sequential algorithm and, in some cases, even better than that.

12:00-12:30 (Moscow time), 11:00-11:30 (Bordeaux time) :

Title:Tensor Train decomposition with Jax mindset (slides)

Speaker: Alexander Novikov (DeepMind)

Abstract: Tensor Train decomposition (a.k.a. MPS) provides a way to compactly represent and work with exponentially large tensors and is used, for example, in quantum chemistry, machine learning and solving PDEs. We present a library for working with the TT-decomposition called TTAX, which is based on and inspired by Jax — a functional library for machine learning. The two main novelties in TTAX are a) a tensor contraction engine, which allows you to fuse multiple elementary operations into a single one to (asymptotically) speed up computations and b) automatic Riemannian differentiation, which allows you to write a function in a single line of Python and automatically get optimal code for computing it’s Riemannian gradient and product of it’s (linearized) Riemannian Hessian by vector.
Also, thanks to using Jax under the hood, TTAX automatically supports running any function on GPUs and vectorization of operations.