Modal Seminar, 2014-2015 (22 sessions)

Organizer: Benjamin Guedj.

Olivier Delrieu

Date: 22/06/2015 at 14.00
Affiliation: Adorial.
Title: Calculs distribués sur GPU.
Abstract: Retour d’expérience sur le portage GPU d’un logiciel d’analyse de données génétiques utilisé sur le cloud d’Amazon.
Slides: Link.

Sébastien Gadat

Date: 16/06/2015 at 14.00
Affiliation: Toulouse School of Economics & Université Toulouse 1 Capitole.
Webpage: Link.
Title: Regret bounds for Narendra-Shapiro bandit algorithms.
Abstract: Narendra-Shapiro (NS) algorithms are bandit-type algorithms introduced in the sixties (with a view to applications in Psychology or learning automata), whose convergence has been intensively studied in the stochastic algorithm literature. In this talk, we study the efficiency of these bandit algorithms from a regret point of view. We show that some competitive bounds can be obtained for such algorithms in a modified penalized version. Up to an over-penalization modification, the pseudo-regret Rn related to the penalized two-armed bandit is uniformly bounded by C sqrt(n) (for a known C). We also generalize existing convergence and rates of convergence results to the multi-armed case of the over-penalized bandit algorithm, including the convergence toward the invariant measure of a Piecewise Deterministic Markov Process (PDMP) after a suitable renormalization. Finally, ergodic properties of this PDMP are given in the multi-armed case.
Slides: Link.

Alessandro Lazaric

Date: 19/05/2015 at 14.00
Affiliation: Inria Lille – Nord Europe.
Webpage: Link.
Title: Exploiting easy data in online optimization.
Abstract: We consider the problem of online optimization, where a learner chooses a decision from a given decision set and suffers some loss associated with the decision and the state of the environment. The learner’s objective is to minimize its cumulative regret against the best fixed decision in hindsight. Over the past few decades numerous variants have been considered, with many algorithms designed to achieve sub-linear regret in the worst case. However, this level of robustness comes at a cost. Proposed algorithms are often over-conservative, failing to adapt to the actual complexity of the loss sequence which is often far from the worst case. In this paper we introduce a general algorithm that, provided with a “safe” learning algorithm and an opportunistic “benchmark”, can effectively combine good worst-case guarantees with much improved performance on “easy” data. We derive general theoretical bounds on the regret of the proposed algorithm and discuss its implementation in a wide range of applications, notably in the problem of learning with shifting experts. Finally, we provide numerical simulations in the setting of prediction with expert advice with comparisons to the state of the art.
Slides: Link.

Quentin Grimonprez, Jérémie Kellner & Florence Loingeville

Format: Rehearsal for the Journées de Statistique.
Date: 12/05/2015 at 10.30
Shared affiliation: Inria Lille – Nord Europe.
Quentin Grimonprez: Sélection de groupes de variables corrélées par classification ascendante hiérarchique et group-lasso. Dans un contexte de sélection de variables, utiliser des régressions pénalisées en présence de fortes corrélations peut poser problème. Seul un sous-ensemble des variables corrélées est sélectionné. Agréger préalablement les variables liées entre elles peut aider aussi bien à la sélection qu’à l’interprétation. Cependant, les méthodes de regroupement de variables nécessitent la calibration de paramètres supplémentaires. Nous présenterons une nouvelle méthode combinant classification ascendante hiérarchique et sélection de groupes de variables.
Jérémie Kellner: Test de normalité en grande dimension par méthodes à noyaux. Nous proposons un nouveau test de normalité dans un espace de Hilbert à noyau reproduisant (RKHS). Ce test reprend le principe de la MMD (Maximum Mean Discrepancy) – traditionnellement employé pour des tests d’homogénéité ou d’indépendance. Notre méthode intègre une procédure spéciale de bootstrap paramétrique – typique des tests d’adéquation – qui est parcimonieuse en temps de calcul par rapport au bootstrap paramétrique standard. En outre, une borne théorique pour l’erreur de Type-II est donnée. Enfin, des simulations montrent la puissance de notre test là où les tests de normalité courants deviennent rapidement inutilisables en grande dimension.
Florence Loingeville: Modèle Linéaire Généralisé Hiérarchique Gamma-Poisson à 3 facteurs aléatoires – Application au contrôle de qualité. Le dénombrement de particules dans une phase homogène est idéalement représenté par la loi de Poisson. En pratique, il s’avère pourtant que la dispersion des résultats de dénombrements de germes est supérieure à celle attendue d’après le modèle de Poisson. Nous proposons dans ce travail un Modèle Linéaire Généralisé Hiérarchique Gamma-Poisson à trois facteurs aléatoires, afin d’estimer les dispersions induites par les différents facteurs d’un essai interlaboratoires.

Ilaria Giulini

Date: 28/04/2015 at 14.00
Affiliation: ENS Ulm.
Webpage: Link.
Title: PAC-Bayesian bounds for the covariance matrix.
Abstract: Using a PAC-Bayesian approach it is possible to construct a robust estimator of the covariance matrix and to provide non-asymptotic dimension-free bounds. This result allows us to introduce a stable version of principal component analysis (PCA) where we perform a smooth cut-off of the eigenvalues instead of the projection on the largest eigenvectors. Since the previous results do not explicitly depend on the dimension of the ambient space, they can be generalized to infinite-dimensional Hilbert spaces. This approach also allows us to present a new algorithm of spectral clustering.
Slides: Link.

Pierre Alquier

Date: 21/04/2015 at 14.00
Affiliation: ENSAE ParisTech.
Webpage: Link.
Title: Bayesian Estimation of Low-rank Matrices.
Abstract: In many statistical problems, the parameter of interest is a high-dimensional low-rank matrix. While the statistical behaviour of rank-penalization and related methods is now well understood, Bayesian estimation in these models is an as-yet unexplored avenue of research. In this talk, I will describe in details some prior for two models: a) reduced rank regression, b) matrix completion (with noise). I will discuss the minimax-optimality of Bayesian estimators in both models, and test these methods on simulated and real datasets.
Slides: Link.

Julie Josse

Date: 14/04/2015 at 14.00
Affiliation: Agrocampus Ouest.
Webpage: Link.
Title: A flexible framework for regularized low-rank matrix estimation.
Abstract: Low-rank matrix estimation plays a key role in many scientific and engineering tasks including collaborative filtering and image denoising. Low-rank procedures are often motivated by the statistical model where we observe a noisy matrix drawn from some distribution with expectation assumed to have a low-rank representation. The statistical goal is to try to recover the signal from the noisy data. Classical approaches are centered around singular-value decomposition algorithms. Although the truncated singular value decomposition has been extensively used and studied, the estimator is found to be noisy and its performance can be improved by regularization. Methods based on singular-value shrinkage have achieved considerable empirical success and also have provable optimality properties in the Gaussian noise model (Gavish & Donoho, 2014). In this presentation, we propose a new framework for regularized low-rank estimation that does not start from the singular-value shrinkage point of view. Our approach is motivated by a simple parametric boostrap idea. In the simplest case of isotropic Gaussian noise, we end up with a new singular-value shrinkage estimator whereas for non-isotropic noise models, our procedure yields new estimators that perform well in experiments.
Slides: Link.

Maxime Brunin

Date: 07/04/2015 at 14.00
Affiliation: Inria Lille – Nord Europe.
Title: The statistical accuracy – computational cost trade-off.
Abstract: This talk is a brief introduction to the time-accuracy trade-off that arises from the big data setting statisticians are facing with. After a general overview of this question from a bibliographic point of view, we will focus on the change-point detection problem where some important ideas will be illustrated.
Slides: Link.

Franck Picard

Date: 24/03/2015 at 14.00
Affiliation: CNRS & Université Claude Bernard Lyon 1.
Webpage: Link.
Title: High throughput approaches for studying the genomic and epigenetic landscapes of human replication origins.
Abstract: Replication is the mechanism by which genomes are duplicated into two exact copies. Genomic stability is under the control of a spatiotemporal program that orchestrates both the positioning and the timing of firing of about 50,000 replication starting points, also called replication origins. Replication bubbles found at origins have been very difficult to map due to their short lifespan. Moreover, with the flood of data characterizing new sequencing technologies, the precise statistical analysis of replication data has become an additional challenge. We propose a new method to map replication origins on the human genome, and we assess the reliability of our finding using experimental validation and comparison with origins maps obtained by bubble trapping. This fine mapping then allowed us to identify potential regulators of the replication dynamics. Our study highlights the key role of CpG Islands and identifies new potential epigenetic regulators (methylation of lysine 4 on histone H4, and tri-methylation of lysine 27 on histone H3) whose coupling is correlated with an increase in the efficiency of replication origins, suggesting those marks as potential key regulators of replication. Overall, our study defines new potentially important pathways that might regulate the sequential firing of origins during genome duplication.
Slides: Link.

Magali Champion

Date: 17/03/2015 at 14.00
Affiliation: INSA Toulouse.
Webpage: Link.
Title: Sparse regression and optimization in high-dimensional framework: application to Gene Regulatory Networks.
Abstract: In this presentation, we focus on a theoretical analysis and the use of statistical and optimization methods in the context of sparse linear regressions in a high-dimensional setting. The first part of this work is dedicated to the study of statistical learning methods, more precisely penalized methods and greedy algorithms. The second part concerns the application of these methods for gene regulatory networks inference. Gene regulatory networks are powerful tools to represent and analyse complex biological systems, and enable the modelling of functional relationships between elements of these systems. We thus propose to develop optimization methods to estimate relationships in such networks.
Slides: Link.

Alexandre Brouste

Date: 10/03/2015 at 14.00
Affiliation: Université du Maine.
Webpage: Link.
Title: Estimation of wind turbine production.
Abstract: Practitioners consider several uncertainties to estimate the energy produced by a wind farm. Two industrial problems can be studied: estimation of the annual wind farm production in the investment phase and the short-term forecasting in the operational phase. We will show how « classical » methods (parameter estimation in diffusion processes, GLM) can be used in this context.
Slides: Link.

Jean Peyhardi

Date: 03/03/2015 at 14.00
Affiliation: Université Montpellier 1 & Inria.
Webpage: Link.
Title: Specification of regression models for categorical data.
Abstract: Categorical data are observed in different applied fields such as econometrics, medicine and biology. Such data involve simple structures (ordinal and nominal data) and also hierarchical structures (partially ordered data for instance). Therefore many regression models have been independently developed for categorical data with respect to the fields and the structures. We first propose an unifying specification of regression models for nominal and ordinal data. Equivalences and invariance properties are then studied in order to caracterize derived models. The interest of this caracterization is illustrated with two particular situations: individual choice modelling of mode transport and quality of life in oncology. Secondly we introduce the class of partitionned conditional generalized linear models (PCGLMs) for hierarchically structured data. The hierarchical structure of these models is fully specified by a partition tree of categories. Using the genericity of the former specification, the class of PCGLMs handle nominal, ordinal and also partially-ordered response variable.
Slides: Link.

Charles-Elie Rabier

Date: 24/02/2015 at 14.00
Affiliation: INRA.
Webpage: Link.
Title: Processus gaussiens pour la détection de gènes.
Abstract: On s’intéresse aux processus Gaussiens résultant de la recherche de gènes (QTL) sur un chromosome. On établira les propriétés asymptotiques du test de rapport de vraisemblance (LRT), relatif au test d’absence de QTL sur le chromosome, tout en considérant un génotypage sélectif (i.e. génotypage uniquement des individus extrêmes). On s’attardera en particulier sur deux modélisations de la recombinaison dans le génome : le modèle Poissonien de Haldane, et un modèle d’interférence. On prouvera que même si le LRT est construit à partir du faux modèle de recombinaison (i.e. le modèle ne correspondant pas aux données), le LRT converge asymptotiquement vers le LRT construit à partir du vrai modèle de recombinaison. Cependant, la localisation du QTL s’avère différente.
Slides: Link.

Allou Samé

Date: 17/02/2015 at 14.00
Affiliation: IFSTTAR – Institut français des sciences et technologies des transports, de l’aménagement et des réseaux.
Title: Segmentation et modélisation de données temporelles. Modèles de mélange à proportions logistiques et extensions.
Abstract: La modélisation et la description de données évoluant au cours du temps constituent des problèmes centraux dans de nombreuses applications. Ce type de données englobe à la fois les séries monodimensionnelles et multidimensionnelles, ainsi que les données fonctionnelles. Dans cet exposé, nous montrerons dans un premier temps comment les mélanges de lois, dont les proportions varient dans le temps selon des fonctions logistiques, peuvent être exploités pour segmenter des signaux présentant des hétérogénéités. D’un point de vue pratique, ce formalisme permet d’exploiter toute une panoplie d’outils d’estimation, notamment l’algorithme EM qui sera détaillé. Dans un second temps, des extensions de ce modèle, dédiées au partitionnement d’ensembles de courbes et à la modélisation de séquences de courbes, seront évoquées. Les algorithmes décrits dans l’exposé seront illustrés aussi bien sur des données simulées que sur des données réelles issues du monitoring de certains systèmes de transport complexes.
Slides: Link.

Clément Théry

Date: 10/02/2015 at 14.00
Affiliation: ArcelorMittal & Inria Lille – Nord Europe.
Title: Model-based linear regression for correlated and missing covariates. Application to steel industry datasets.
Abstract: Présentation de CorReg. Il s’agit d’un outil qui modélise explicitement les corrélations au sein d’un jeu de données sous la forme d’un système de sous-régression. Le système en question est trouvé automatiquement par une chaîne MCMC puis utilisée pour définir deux estimateurs d’une régression linéaire principale basée sur les données. Le premier estimateur est un estimateur marginal sur les données indépendantes (les données redondantes sont mises à part momentanément). Le second estimateur vient affiner le premier de manière séquentielle par plug-in sous la forme d’une régression sur les résidus du modèle marginal. Enfin, la structure de sous-régressions est utilisable pour gérer les problématiques de valeurs manquantes. L’efficacité de la méthode (utilisable via le package CorReg disponible sur le CRAN) est illustrée sur données simulées ainsi que sur données réelles issues de l’industrie sidérurgique (ArcelorMittal).

Karim Lounici

Date: 03/02/2015 at 14.00
Affiliation: Georgia Institute of Technology.
Webpage: Link.
Title: Principal Component Analysis, estimation des composantes principales en grande dimension.
Abstract: Soit $X_{1}, \dots, X_{n}$ un échantillon i.i.d. à valeurs dans un espace de Hilbert tel que $E [X_{1}] = 0$ et $E [X \otimes X] = Σ$ . Nous prouvons que la matrice de covariance empirique $\hat{Σ} = \frac{1}{n} \sum_{i = 1}^{n} X_{i} \otimes X_{i}$ fournit des estimateurs consistants du spectre et des espaces propres de $Σ$ sous la condition suivante sur le rang effectif: $r_{e} (Σ) = o (n)$ avec $r_{e} (Σ) = \frac{t r (Σ)}{‖ Σ ‖_{\infty}} .$ Nous établissons en particulier une inégalité d’oracle en norme sup pour les vecteurs propres qui peut être exploitée dans le problème de selection de variables. Ce travail est joint avec Vladimir Koltchinskii.

Serge Iovleff

Date: 27/01/2015 at 14.00
Affiliation: Université Lille 1 & Inria Lille – Nord Europe.
Webpage: Link.
Title: Présentation de la librairie de calcul stk++ (The Statistical ToolKit).
Abstract: La librairie stk++ est une librairie écrite en C++ orientée vers les applications statistiques. Dans cet exposé nous présenterons les différentes parties du code qui composent la librairie et tenterons d’expliquer à un public non-spécialiste les différentes techniques mises en œuvre. Nous illustrerons ensuite les utilisations de la librairie au travers du programme aam, du package HDPenReg et du projet Mixcomp.
Slides: Link.
Video:

Pierre Pudlo

Date: 20/01/2015 at 14.00
Affiliation: Université Montpellier 2.
Webpage: Link.
Title: Choix de modèle avec les méthodes ABC.
Abstract: Les méthodes bayésiennes approchées (approximate Bayesian computation, ABC) permettent de conduire une analyse bayésienne lorsque la vraisemblance n’est pas calculable explicitement (variables latentes, constante de normalisation inconnue,…). Elle remplace le calcul de cette vraisemblance par la simulation de nombreux jeux de données dont les paramètres sont tirés suivant la distribution a priori. Après un rapide tour d’horizon des méthodes ABC, cet exposé se concentrera sur les questions de choix de modèle. Nous verrons en particulier comment résumer jeux de données simulés et observé pour les comparer, quelle est alors la cible d’ABC vue comme une méthode de Monte-Carlo. Cet exposé se terminera par des méthodes de machine learning pour prédire le modèle, en particulier avec des forêts aléatoires. Nous illustrerons cet exposé par des questions de génétique des populations où il s’agit d’inférer l’histoire démographique passée de populations naturelles à partir des traces que cette histoire a laissé dans le génome d’individus échantillonné aujourd’hui.
Slides: Link.

Karin Sahmer

Date: 13/01/2015 at 14.00
Affiliation: ISA Lille & LGCgE.
Title: Utilisation d’une régression non linéaire pour des applications microbiologiques.
Abstract: La même régression non linéaire est utilisée dans différentes applications microbiologiques. L’exposé traitera de deux de ces applications. Le premier exemple est la comparaison de l’efficacité de biofongicides. Un des paramètres du modèle correspond à la CI50 (concentration d’inhibition à 50%). Un test F est utilisé pour comparer un modèle dans lequel la CI50 est estimée séparément pour les différents biofongicides, à un modèle avec une même CI50 pour tous les biofongicides. Ceci permet de conclure sur une éventuelle différence d’efficacité. Dans certains cas, il est préférable de comparer la concentration d’inhibition à 90%, la CI90, ou d’une manière plus générale, la CIp, pour un pourcentage p donné. Il est possible de calculer cette CIp, à partir des paramètres estimés du modèle. Une réécriture de l’équation du modèle permet de réaliser directement l’estimation et la comparaison des CIp. Un deuxième exemple d’application de la régression non linéaire concerne l’évaluation de la qualité d’un sol, grâce à l’activité de respiration de micro-organismes. Le CO2 produit a une influence non linéaire sur la densité optique mesurée dans des essais en microplaques. Dans une phase d’étalonnage, les paramètres du modèle sont estimés. Dans la phase d’utilisation, l’inverse de la fonction estimée est utilisé, pour calculer le CO2 grâce à la densité optique obtenue.
Slides: Link.

Guillem Rigaill

Date: 06/01/2015 at 14.00
Affiliation: Unité de Recherche en Génomique Végétale, INRA – CNRS – Université d’Evry.
Title: Fast tree inference with weighted fusion penalties.
Abstract: Given a data set with many features observed in a large number of conditions, it is desirable to fuse and aggregate conditions which are similar to ease the interpretation and extract the main characteristic of the data. This paper presents a multidimensional fusion penalty framework to address this question when the number of conditions is large. If the fusion penalty is encoded by a norm, we prove for uniform weights that the path of solutions is a tree which is suitable for interpretability. For the $ℓ_{1}$ and $ℓ_{\infty}$ norms, the path is piecewise linear and we derive an homotopy algorithm to recover exactly the whole tree structure. For weighted $ℓ_{1}$ -fusion penalties, we demonstrate that distance decreasing weights lead to balanced tree structures. For a subclass of these weights that we call “exponentially adaptive”, we derive an $O (n \log (n))$ homotopy algorithm and we prove an asymptotic oracle property. This guarantees that we recover the underlying structure of the data efficiently both from a statistical and computational point of view. We provide a fast implementation of the homotopy algorithm for the single feature case, as well as an efficient embedded cross-validation procedure that takes advantage of the tree structure of the path of solutions. Our proposal outperforms its competitors on simulations both in term of timings and prediction accuracy. As an example we consider phenotypic data: given one or several traits, we reconstruct a balanced tree structure and assess its agreement with the known taxonomy.
Reference: Link to the paper on arXiv.
Slides: Link.

Jérémie Kellner

Date: 16/12/2014 at 14.00
Affiliation: Université de Lille 1 & Inria Lille – Nord Europe.
Title: Discussion on “Asymptotics of Graphical Projection Pursuit” (Diaconis & Freedman, 1984).
Reference: download the article.

Alain Celisse

Date: 20/11/2014 at 10.30
Affiliation: Université de Lille 1 & Inria Lille – Nord Europe.
Webpage: Link.
Title: Optimalité de la validation-croisée en estimation de densité pour la perte $L^{2}$ .
Slides: Link.

Modal Seminar, 2014-2015 (22 sessions)

Olivier Delrieu

Sébastien Gadat

Alessandro Lazaric

Quentin Grimonprez, Jérémie Kellner & Florence Loingeville

Ilaria Giulini

Pierre Alquier

Julie Josse

Maxime Brunin

Franck Picard

Magali Champion

Alexandre Brouste

Jean Peyhardi

Charles-Elie Rabier

Allou Samé

Clément Théry

Karim Lounici

Serge Iovleff

Pierre Pudlo

Karin Sahmer

Guillem Rigaill

Jérémie Kellner

Alain Celisse

Posts

Categories

Archives