MODAL Seminar, 2020-2021 (15 sessions)

Thi Phuong Thuy Vo

Date: June 29, 2021 (Tuesday) at 11.00 (Online seminar)

Affiliation: Université Gustave Eiffel

Webpage: Link

Title: Exploration of random graphs by the Respondent Driven Sampling (RDS) method

Abstract: The study of Respondent Driven Sampling (RDS) is invested for the discovery of a social network of hidden populations. It leads to the study of a Markov chain on a random graph whose vertices represent individuals and whose edges describe the relationships between the people connected. Respondents are asked to list their partners and a certain number of coupons are given to some of the latter. The RDS survey searches for hidden nodes in the population by randomly following the edges of the underlying social network, which allows us to trace the sampled individuals. We consider the normalized process of the reference chain on the Erdo ̈s-Rényi model, then on its generalization, the Stochastic Block Model (SBM) when populations are partitioned into communities. We prove that when the population size is large and the graph is sparse, the normalized stochastic process describing the fraction of the graph discovered behaves like a deterministic curve which is the unique solution of a system of ODEs. In our model, the graph and the random walk are constructed simultaneously. The difficulty lies in handling the heterogeneity of the graph. Furthermore, we are also interested in the problem of recovering statistical information on a SBM from the subgraph discovered by an exploring random walk (RDS with 1 coupon per interviewee).

Cristobal Guzman

Date: May 11, 2021 (Tuesday) at 11.00 (Online seminar)

Affiliation: University of Twente, Netherlands

Webpage: Link

Title: Non-Euclidean Differentially Private Stochastic Convex Optimization

Abstract: Differentially private (DP) stochastic convex optimization (SCO) is a fundamental problem, where the goal is to approximately minimize the population risk with respect to a convex loss function, given a dataset of i.i.d. samples from a distribution, while satisfying differential privacy with respect to the dataset. Most of the existing works in the literature of private convex optimization focus on the Euclidean (i.e., $\ell_2$) setting, where the loss is assumed to be Lipschitz (and possibly smooth) w.r.t.~the $\ell_2$ norm over a constraint set with bounded $\ell_2$ diameter. Algorithms based on noisy stochastic gradient descent (SGD) are known to attain the optimal excess risk in this setting.

In this talk I will provide a brief overview of differential privacy and its applications in stochastic convex optimization, together with new upper and lower bounds on the population risk for DP-SCO in non-Euclidean setups, in particular where the reference norm is the $\ell_p$-norm, $1\leq p\leq \infty$. Our algorithms include DP versions of variance-reduced stochastic Frank-Wolfe and mirror-descent methods, together with a novel privacy mechanism, which generalizes the Gaussian mechanism. Our work draws upon concepts from the geometry of normed spaces, such as the notions of regularity, uniform convexity, and uniform smoothness.

Based on joint work with Raef Bassily and Anupama Nandi, and available at arXiv:2103.01278

Yohann de Castro

Date: April 13, 2021 (Tuesday) at 11.00 (Online seminar)

Affiliation: Centrale Lyon

Webpage: Link

Title: Sparse Regularization on Measures: the mixture area

Abstract: Lifting problems in much bigger spaces may lead to a simplified and useful representation. This trick has been used, for instance, in Monge-Kantorovich formulation of Optimal Transport. We will use the same trick for sparse regularization of mixture models (MM). We will show that MM recovery can be lifted on the space of measures addressing two fundamental limits: the problem becomes convex and the knowledge of number of mixture components is not required.

This is joint work with S. Gadat (TSE, IMT), C. Marteau (Lyon 1, ICJ), and C. Maugis-Rabusseau (INSA Toulouse, IMT)

Guillaume Lecué

Date: April 6, 2021 (Tuesday) at 11.00 (Online seminar)

Affiliation: CREST, ENSAE

Webpage: Link

Title: Robust subgaussian estimation of a mean vector in nearly linear time

Abstract:

Joint work with Jules Depersin.

Robust mean estimation has witnessed an increasing interest the last five years in both statistics and computer sciences. We will first review the litterature in this domain.

Then, we will present the construction of an algorithm, running in time $\tilde\cO(N d + uK d)$ , which is robust to outliers and heavy-tailed data and which achieves the subgaussian rate

\begin{equation}\label{eq:intro_subgaus_rate}

\sqrt{\frac{\Tr(\Sigma)}{N}}+\sqrt{\frac{\norm{\Sigma}_{op}K}{N}}

\end{equation}

with probability at least $1-\exp(-c_0K)-\exp(-c_1 u)$ where $\Sigma$ is the covariance matrix of the \textit{informative data}, $K\in\{1, \ldots, N\}$ is some parameter (number of block means) and $u\in\bN^*$ is another parameter of the algorithm. This rate is achieved when $K\geq c_2 |\cO|$ where $|\cO|$ is the number of outliers in the database and under the only assumption that the informative data have a second moment. The algorithm is fully data-dependent and does not use in its construction the proportion of outliers nor the rate in \eqref{eq:intro_subgaus_rate}. Its construction combines recently developed tools for Median-of-Means estimators and covering-Semi-definite Programming [1,2]. We also show that this algorithm can automatically adapt to the number of outliers.

References:

[1] Yu Cheng, Ilias Diakonikolas, and Rong Ge. High-dimensional robust mean estimation in nearly-linear time. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2755–2771. SIAM, Philadelphia, PA, 2019.

[2] Richard Peng, Kanat Tangwongsan, and Peng Zhang. Faster and simpler width-independent parallel algorithms for positive semidef- inite programming, 2012.

Gilles Stupfler

Date: March 30, 2021 (Tuesday) at 11.00 (Online seminar)

Affiliation: ENSAI, CREST

Webpage: Link

Title: Asymmetric least squares techniques for extreme risk estimation

Abstract: Financial and actuarial risk assessment is typically based on the computation of a single quantile (or Value-at-Risk). One drawback of quantiles is that they only take into account the frequency of an extreme event, and in particular do not give an idea of what the typical magnitude of such an event would be. Another issue is that they do not induce a coherent risk measure, which is a serious concern in actuarial and financial applications. In this talk, I will explain how, starting from the formulation of a quantile as the solution of an optimisation problem, one may come up with two alternative families of risk measures, called expectiles and extremiles. I will give a broad overview of their properties, as well as of their estimation at extreme levels in heavy-tailed models, and explain why they constitute sensible alternatives for risk assessment using some real data applications. This is based on joint work with Abdelaati Daouia, Irène Gijbels, Stéphane Girard and Antoine Usseglio-Carleve.

Konstantin Avrachenkov

Date: March 16, 2021 (Tuesday) at 11.00 (Online seminar)

Affiliation: Inria Sophia Antipolis-Méditerranée

Webpage: Link

Title: Higher-Order Spectral Clustering for Geometric Graphs

Abstract: The present paper is devoted to clustering geometric graphs. While the standard spectral
clustering is often not effective for geometric graphs, we present an effective generalization,
which we call higher-order spectral clustering. It resembles in concept the classical spectral clustering method but uses for partitioning the eigenvector associated with a higher-order
eigenvalue. We establish the weak consistency of this algorithm for a wide class of geometric
graphs which we call Soft Geometric Block Model. A small adjustment of the algorithm provides strong consistency. We also show that our method is effective in numerical experiments
even for graphs of modest size.

Alexandre d’Aspremont

Date: March 9, 2021 (Tuesday) at 11.00 (Online seminar)

Affiliation: CNRS & ENS Paris

Webpage: Link

Title: Sparsity, Feature Selection and the Shapley Folkman Theorem

Abstract: The Shapley Folkman theorem acts a central limit theorem for convexity: It shows that Minkowski sums of arbitrary bounded sets are increasingly close to their convex hull as the number of terms in the sum increases. This produces a priori bounds on the duality gap of separable optimization problems. We use these results to show that several classical sparsity constrained optimization problems have low duality gaps in meaningful data settings.

Michaël Fanuel

Date: February 16, 2021 (Tuesday) at 11.00 (Online seminar)

Affiliation: KU Leuven

Webpage: Link

Title: Diversity sampling is an implicit regularization for kernel methods

Abstract: Kernel methods have achieved very good performance on large scale regression and classification problems, by using the Nyström method and preconditioning techniques. The Nyström approximation–based on a subset of landmarks–gives a low rank approximation of the kernel matrix, and is known to provide a form of implicit regularization. We further elaborate on the impact of sampling diverse landmarks for constructing the Nyström approximation in supervised as well as unsupervised kernel methods. By using Determinantal Point Processes for sampling, we obtain additional theoretical results concerning the interplay between diversity and regularization.

Christophe Biernacki

Date: December 15, 2020 (Tuesday) at 11.00 (Online seminar)

Affiliation: Inria Lille

Webpage: Link

Title: Model-based Clustering with Missing Not At Random Data

Abstract: Since the 90s, model-based clustering is largely used to classify data. Nowadays, with the increase of available data, missing values are more frequent. Traditional ways to deal with them consist to obtain a filled data set, either by discarding missing values or by imputing them. In the first case some information is lost; in the second case the final clustering purpose is not taken into account through the imputation step. Thus both solutions risk to blur the clustering estimation result. Alternatively, we defend the need to embed the missingness mechanism directly within the clustering modeling step. There exists three types of missing data: missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). In all situations, especially the MNAR one, logit and probit regressions are proposed as natural and flexible candidate models. In particular, their flexibility property allows to design some meaningful parsimonious variants, as dependency on missing values or dependency on the cluster label after taking specific attention to related identifiability issues. In this unified context, standard model selection criteria can be used to select between such different missing data mechanisms, simultaneously with the number of clusters. Practical interest of our proposal is illustrated on data derived from medical studies suffering from many missing data, supporting the key idea that the missing data pattern can bring significant additional information for discovering the latent partition.

Joint work with: Gilles Celeux, Julie Josse, Fabien Laporte, Matthieu Marbac, Aude Sportisse.

Wilfried Heyse

Date: December 1, 2020 (Tuesday) at 11.00 (Online seminar)

Affiliation: Inria Lille

Webpage:

Title: Support of temporal structure in the statistical analysis of high-throughput proteomic data

Abstract: Each year, in France, over 100 000 people suffer from myocardial infarction (MI) which, for some of them, lead to a left ventricular remodeling (LVR) and heart failure (HF). Studies have shown that during a year following MI, LVR is a risk factor for HF and cardiovascular death. Finding biomarkers which can detect early stage of LVR and prevent HF is a leading public health matter. We are aiming at selecting few proteins responsible for LVR and HF, using not only baseline measurement of over 5000 proteins on 2 cohorts of around 220 patients each, but also using three additional longitudinal measurement of these proteins available on one of the two cohorts. In a first time, we will present how we have tried to select proteins to develop a prediction model, however the differences between the two cohorts have led to unstable results, thus preventing the validation of the prediction model. In a second time, we will focus on the longitudinal dimension of the data and explore how this dimension could help selecting relevant proteins for predicting HF based only on the first measure. To handle the longitudinal (and high) dimension of the data, clustering of longitudinal data will be studied in order to create groups of proteins that could be used in a selection model.

Florent Dewez

Date: December 8, 2020 (Tuesday) at 11.00 (Online seminar)

Affiliation: Inria Lille

Webpage: Link

Title: An end-to-end data-driven optimisation framework for constrained trajectories

Abstract: Trajectory optimisation is a wide field of academic and industrial study with numerous applications in real-world problems. Optimal control methods are commonly used to compute the trajectory minimising the criterion while complying with certain constraints. This family of physical-based methods requires the knowledge of the underlying dynamics of the system to provide realistic solutions. Nevertheless the dynamics are often (partially) unknown in the applications, requiring estimation strategies which could be a challenging task. Further the solution may be affected by the statistical errors. In this talk, we will present a dynamics-free methodology for optimised and realistic trajectories. It leverages information in available trajectory data to derive constrained optimisation problems via a maximum a posteriori approach. These problems, which involve regularised objective functions, are constrained in a natural and simple way by the data: the search space is automatically restricted to a region centered on the data and contained in a metric space reflecting estimated features. The solutions inherit then realistic patterns from the observed trajectories. We will also present two applications, including an industrial problem from the aeronautics, whose results are actually very promising.This a joint work with Benjamin Guedj, Vincent Vandewalle and Arthur Talpaert.

Guillaume Braun

Date: November 24, 2020 (Tuesday) at 11.00 (Online seminar)

Affiliation: Inria Lille

Webpage:

Title: Clustering multilayer graphs with missing nodes

Abstract: Relationship between agents can be conveniently represented by graphs. When these relationships have different modalities, they are better modelled by multilayer graphs where each layer is associated with one modality. Such graphs arise naturally in many contexts including biological and social networks. Clustering is a fundamental problem in network analysis where the goal is to regroup nodes with similar connectivity profiles. In the past decade, various clustering methods have been extended from the unilayer setting to multilayer graphs in order to incorporate the information provided by each layer. While most existing works assume – rather restrictively – that all layers share the same set of nodes, we propose a new framework that allows for layers to be defined on different sets of nodes. In particular, the nodes not recorded in a layer are treated as missing. Within this paradigm, we investigate several generalizations of well-known clustering methods in the complete setting to the incomplete one and prove some consistency results under the Multi-Layer Stochastic Block Model assumption. Our theoretical results are complemented by thorough numerical comparisons between our proposed algorithms on synthetic data, and also on real datasets, thus highlighting the promising behaviour of our methods in various settings. This is joint work with Hemant Tyagi and Christophe Biernacki.

Stephane Chretien

Date: November 17, 2020 (Tuesday) at 11.00 (Online seminar)

Affiliation: University Lyon 2

Webpage: Link

Title: A finite sample analysis of the double descent phenomenon for ridge function estimation

Abstract: Recent extensive numerical experiments in high scale machine learning have allowed to uncover a quite counterintuitive phase transition, as a function of the ratio between the sample size and the number of parameters in the model. As the number of parameters $p$ approaches the sample size $n$, the generalisation error (a.k.a. testing error) increases, but it many cases, it starts decreasing again past the threshold $p=n$. This surprising phenomenon, brought to the theoretical community attention by Belkin et al., has been thoroughly investigated lately, more specifically for simpler models than deep neural networks, such as the linear model when the parameter is taken to be the minimum norm solution to the least-square problem, mostly in the asymptotic regime when $p$ and $n$ tend to $+\infty$; see e.g. recent work of Hastie et al.. In the present paper, we propose a finite sample analysis of non-linear models of \textit{ridge} type, where we investigate the double descent phenomenon for both the \textit{estimation problem} and the prediction problem. Our results show that the double descent phenomenon can be precisely demonstrated in non-linear settings and complements recent works Bartlett et al. and Chinot and Lerasle. Our analysis is based on efficient but elementary tools closely related to the continuous Newton method. This is a joint work with E. Caron

Ernesto Araya Valdivia

Date: November 10, 2020 (Tuesday) at 11.00 (Online seminar)

Affiliation: Paris Saclay University, Orsay

Webpage:

Title: Latent distance estimation on Random Geometric Graphs

Abstract: In the model of Random Geometric Graphs (RGG), each vertex of a graph is associated to a latent point on a metric space and the probability of an edge between two vertices depends on the distance between the associated latent points. Placing ourselves in the context of dense graphs and taking the Euclidean unit sphere as the ambient (latent) space, we consider the following estimation problem: assume that a graph is generated by the RGG model, but we only have access to the adjacency matrix, our objective is to recover the distances between the latent points. We propose a spectral estimator for the matrix of latent distances, which is consistent provided that a spectral gap-type assumption holds. Moreover, we give its rate of convergence for the mean squared error and propose an efficient algorithm to compute it. We discuss extensions to the case when the sphere is replaced by the unit ball.

The talk is based on a joint work with Yohann de Castro.

Benjamin Guedj

Date: October 20, 2020 (Tuesday) at 11.00 (Online seminar)

Affiliation: Inria Lille and University College London

Webpage: Link

Title: A primer on PAC-Bayesian learning, and some applications

Abstract: PAC-Bayes is a generic and flexible framework to address generalisation abilities of machine learning algorithms. It leverages the power of Bayesian inference and allows to derive new learning strategies. I will briefly present the key concepts of PAC-Bayes and illustrate a few of its recent successes for deep neural networks and for coherent risk measures.

References: https://bguedj.github.io/publications/#, our ICML 2019 tutorial https://bguedj.github.io/icml2019/index.html and the following two papers:

Letarte, Germain, Guedj and Laviolette, NeurIPS 2019 https://papers.nips.cc/paper/8911-dichotomize-and-generalize-pac-bayesian-binary-activated-deep-neural-networks.pdf
Mhammedi, Guedj and Williamson, NeurIPS 2020 https://arxiv.org/abs/2006.14763

MODAL Seminar, 2020-2021 (15 sessions)

Yohann de Castro

Alexandre d’Aspremont

Michaël Fanuel

Christophe Biernacki

Wilfried Heyse

Florent Dewez

Ernesto Araya Valdivia

Benjamin Guedj

Posts

Categories

Archives