Statistical Learning Theory: a Short Course

Statistical Learning Theory: a Short Course

by Peter Grünwald from the Information-theoretic learning group at the Centrum voor Wiskunde en Informatica (CWI), Netherlands, and Leiden University, Dept. of Mathematics


The general topic is model and variable selection, and prediction among large sets of complex models. The main method to be discussed will be MDL (minimum-description length) and related information-theoretic methods, but these will also be used to highlight strengths and weaknesses of other methods; in particular,
Bayesian and cross-validation methods and prequential methods (see below).

We will certainly not simply advertise MDL in all settings; in some settings, other methods are preferable. We will furthermore spend significant time on:

  • the inherent conflict between statistical consistency (finding the correct model) and predictive optimality. We will show how this ‘AIC vs BIC’ conflict can be escaped in predictive settings. We will give some practical results in a regression setting (see ‘Catching Up Faster…’ paper, below).
  • what happens when all models are wrong (as is usually the case…) There are mathematical results which show that in some cases, standard Bayesian inference and MDL may both go very wrong here, in the sense
    that the ‘closest’ available model to the truth is not identified, no matter how many data are available. We will also show some computer simulations where this indeed happens, and discuss what can be done about it.
  • on-line sequential prediction in worst-case settings. Here the goal is to predict essentially as well as the best element of a set of a (potentially huge) set of predictors, without making any probabilistic assumptions at all. Surprisingly, this turns out to be possible in certain settings; this is a thriving topic in theoretical machine learning, and the time is ripe to apply it in practice. We will briefly discuss results by Devaine et al. who used such methods, very succesfully, for on-line prediction of electricity demand at Electricite de France.
  • ‘Occam’s Razor’, in particular its role in Bayesian nonparametrics. MDL can be viewed as a particular formalization of Occam’s Razor, but one has to be very careful with such interpretations. We will dispel certain myths (such as ‘a preference for simple models is an inherent consequence of Bayesian methods’ – it is much more complicated than that; or ‘simplicity is a model bias like any other’ – there are formal ways to define ‘simplicity’ of a model for which this is certainly not true).
  • the (little known but important) ‘prequential’ methods. Prequential model assessment (initiated by Dawid in 1984) is a sort of sequential version of cross-validation. They can be used to `tie all the philosophies together’: the intuition behind cross-validation, MDL (using some form of Occam’s Razor), worst-case sequential prediction and Bayesian inference can all be seen to have a common ground.





PROVISIONAL PROGRAM (The final program may change in some details)

DAY 1 (June the 3rd, 9am – noon)

2 hours:

  • Information theory, universal coding, sequential prediction with logarithmic loss (these are *absolutely essential* in order to understand Minimum Description Length (MDL) and other information-theoretic model selection and prediction methods.

1 hour:

  • MDL, in particular, I will introduce the four ways of doing MDL, including the little known but important ‘normalized maximum likelihood’ method and ‘prequential model validation’

DAY 2 (June the 4th, 9am – noon)

1/2 hour: short repetition of Day 1

1 1/2 hour:

  • How MDL relates to Bayes and Occam’s Razor esp. in nonparametric situations
  • Dispelling some myths about MDL, Bayes, Cross-Validation
  • Minimax Optimality vs. ‘Correct on Average’; the concept of ‘luckiness’ (central in modern approaches to MDL) and how it is subtly different from ‘subjective prior information’ in Bayesian statistics
  • Sequential On-Line Prediction in worst case settings – the example at EDF (Electricite de France)


  • Application of Information-Theoretic Methods to Classification, Ridge Regression, Lasso-type methods etc.

DAY 3 (June the 6th, 2pm – 5pm)

1 1/2 hour:

  • Learning when All Models are wrong: what can happen to MDL and Bayesian methods if they are used when models are substantially wrong (rather terrible things in fact). How cross-validation avoids such issues to some extent; the crucial role of the loss function of interest (for some loss functions, MDL and Bayes continue to perform very well when the model is wrong, for others (e.g. classification loss) things can go badly wrong.

1 1/2 hour:

  • The Catch-Up Phenomenon: AIC/Cross-Validation vs. BIC/MDL/full Bayes; a new explanation of the conflicts between these methods, and ‘a way out’ (the ‘switching code’), discussion of a fast implementation of the ‘switching code’ based on dynamic programming; some experiments.



  • First lecture, Monday morning, 9am – noon: Kahn 1-2-3
  • Second lecture, Tuesday morning, 9am – noon: Kahn 1-2-3
  • Third lecture, Thursday afternoon, 1pm – 4pm: Euler bleu (CP room)
  • NB: Thursday morning, G. Berry will replay the inaugural lecture from his course at Collège de France.