Stochastic Subsampling for Huge Matrix Factorization

Existing approaches for matrix factorization are not readily usable to extract base sparse or sparsifying components from terabyte scale datasets as the one we encountered in fMRI. We therefore designed a matrix-factorization algorithm, called subsampled online matrix factorization (SOMF), that scales to input matrices with both huge number of rows and columns.

SOMF can learn factors sparse or dense and/or non-negative, which makes it suitable for dictionary learning, sparse component analysis, and non-negative matrix factorization. In brief, as shown in the following figure, SOMF streams matrix columns while subsampling them to iteratively learn the matrix factors. At each iteration, the row dimension of a new sample is reduced by subsampling, resulting in lower time complexity compared to a simple streaming algorithm.

Our method comes with convergence guarantees to reach a stationary point of the matrix-factorization problem. The convergence analysis is based on analysing the robustness to perturbation of a wider category of algorithm, known as stochastic majorization-minimization algorithms. We propose a generalised stochastic approximate majorization-minimization framework, that we use to derive an asymptotical convergence analysis for SOMF.

We demonstrate the efficiency of SOMF on massive functional Magnetic Resonance Imaging data (2 TB of resting-state data, that corresponds to the release of 500 subjects of the HCP study), and on patches extracted from hyperspectral images (103 GB). For both problems, in which we use different penalties on rows and columns, we obtain significant speed-ups compared to state-of-the-art algorithms. Finally, SOMF can be adapted for explicit collaborative filtering, and provides large speed-ups compared to the fastest methods available for models relying on matrix factorization.


Python package available here.


Arthur Mensch, Julien Mairal, Bertrand Thirion, Gaël Varoquaux
Stochastic Subsampling for Factorizing Huge Matrices,
IEEE Transactions on Signal Processing, 2018

Arthur Mensch, Julien Mairal, Bertrand Thirion, Gaël Varoquaux
Dictionary Learning for Massive Matrix Factorization,
International Conference oo Machine Learning, 2016

Comments are closed.