Total variation regularization

TV (Total-Variation) regularization can be used for extracting information from brain images, both in regression or classification settings. Feature selection and model estimation are performed jointly and capture the predictive information present in the data better than alternative methods. A particularly important property of this approach is its ability to create spatially coherent regions with similar weights, yielding simplified and informative sets of features.  In particular, the segmented regions are robust to inter-subject variability. TV regularization is a powerful tool for understanding brain activity and spatial mapping of cognitive process, and its use in this context first appeared in this paper.

Mental representations of shape and size of objects (Inter-subject analysis). Top – voxels selected within one of the three main clusters by TV regression, for the Sizes prediction experiment. Bottom – voxels selected at least one time within one of the three main clusters for each of the one-vs-one TV classification, for the Objects prediction experiment. Some clusters found in the Object prediction experiment are more anterior (their center of mass are at [50, −72, −2] mm and [46, −80, −2] mm) than the ones found for the Size prediction experiment (center of mass at [16, −96, 10] mm and [−26, −96, −10] mm). This is coherent with the hypothesis that the processing of shapes is done at a higher level in  the processing of visual information, and thus the implied regions are found further in the ventral pathway

TV-ℓ1 regularization (structured sparsity)

Here, we consider decoding as a statistical estimation problem and show that injecting a spatial segmentation prior leads to unmatched performance in recovering predictive regions. Specifically, we use ℓ1 penalization to set voxels to zero and TV prior to segment regions. Our contribution is two-fold. On the one hand, we show via extensive experiments that, amongst a large selection of decoding and brain-mapping strategies, TV-ℓ1 leads to best region recovery (see Fig. 8 ). On the other hand, we consider implementation issues related to this estimator. To tackle efficiently this joint prediction-segmentation problem we introduce a fast optimization algorithm based on a primal-dual approach. We also tackle automatic setting of hyper-parameters and fast computation of image operation on the irregularmasks that arise in brain imaging. Checkout full PRNI 2013 paper


Figure 8. Results on fMRI data from (from left to right F-test, ElasticNet and TV-ℓ1 ). The TV-ℓ1 regularized model segments neuroscientificly meaningful predictive regions in agreement with univariate statistics while the ElasticNet yields sparse although very scattered non-zero weights.

TV and TV-ℓ1 solvers

The TV-ℓ1 and TV priors lead to a difficult non-smooth convex optimization. In an award-winning PRNI 2014 paper, we explored all possible solvers for this problem (proximal methods like ISTA and FISTA; LBFGS on smooth surrogates; operator splitting methods like ADMM and primal-dual; HANSO, etc.) and exhibited their convergence properties. The final implementation retained was FISTA (Fast Iterative Shrinkage Thresholding Algorithm), in which the prox of the TV-ℓ1 (resp. TV) prior is estimated in an embedded FISTA loop, with finer and finer precision.

Benchmarking solvers for TV-l1 least-squares and logistic regression in brain imaging

HAL paper here

Learning predictive models from brain imaging data, as in decoding cognitive states from fMRI (functional Magnetic Resonance Imaging), is typically an ill-posed problem as it entails estimating many more parameters than available sample points. This estimation problem thus requires regularization. Total variation regularization, combined with sparse models, has been shown to yield good predictive performance, as well as stable and interpretable maps. However, the corresponding optimization problem is very challenging: it is non-smooth, non-separable and heavily ill-conditioned. For the penalty to fully exercise its structuring effect on the maps, this optimization problem must be solved to a good tolerance, resulting in a computational challenge. In this work, we explore a wide variety of solvers and exhibit their convergence properties on fMRI data. We introduce a variant of smooth solvers and show that it is a promising approach in these settings. Our findings show that care must be taken in solving TV-l1 estimation in brain imaging and highlight the successful strategies.


TV−l1 maps for a face-house discrimination task taken from a visual recognition dataset, with regularization parameters chosen by cross-validation, for different stopping criteria. Note that the stopping criterion is defined as a threshold on the energy decrease per one iteration of the algorithm. This figure shows the importance of convergence of the multivariate estimator, and motivates the need for a fast solver.


SpaceNet is a family of “structure + Sparsity” priors for regularizing models for brain decoding. It includes not only the TV and TV-ℓ1 priors already discused above, but other models like GraphNet [Grosenick 2013] (aka Smooth-Lasso [Hebiri 2011]), Sparse-Variation [Eickenberg 2015]. The code has been merged, and will appear in the next release of nilearn. For an in-depth exposition on SpaceNet, checkout the following presentation @ OHBM2015.
Shown to the left are the 20% most important voxels, and then the whole brain, for a prediction task [Mixed Gambles]. SpaceNet employs a screening heuristic to eliminate irrelevant voxels from the brain, before the model fitting problem is even entered. ). This (and other heuristics described in more detail in the following PRNI 2015 paper lead to 10-fold speedup in the model fit.

Sparse Variation for statistical learning.

A computationally efficient estimator: fAASTA

Full paper on HAL

The total variation (TV) penalty, as many other analysis-sparsity problems, does not lead to separable factors or a proximal operator with a closed-form expression, such as soft thresholding for the ℓ1 penalty. As a result, in a variational formulation of an inverse problem or statistical learning estimation, it leads to challenging non-smooth optimization problems that are often solved with elaborate single-step first-order methods. When the data-fit term arises from empirical measurements, as in brain imaging, it is often very ill-conditioned and without simple structure. In this situation, in proximal splitting methods, the computation cost of the gradient step can easily dominate each iteration. Thus it is beneficial to minimize the number of gradient steps. We present fAASTA, a variant of FISTA, that relies on an internal solver for the TV proximal operator, and refines its tolerance to balance computational cost of the gradient and the proximal steps. We give benchmarks and illustrations on brain decoding: recovering brain maps from noisy measurements to predict observed behavior. The algorithm as well as the empirical study of convergence speed are valuable for any non-exact proximal operator, in particular analysis-sparsity problems.


Convergence of currently available optimization algorithms, for 3 scenarios, with weak, medium and strong regularization, where medium regularization corresponds to the value chosen by cross-validation. These are log-log plots with the 0 defined as the lowest energy value reached across all algorithms.

Comments are closed