# Scool Seminar

### 25 February 2022: Memory Saving Strategies for Deep Neural Network Training, by Alena Shilova (Inria Scool).

Artificial Intelligence is a field that has received a lot of attention recently and in particular Deep Learning, its sub-field that groups together machine learning methods based on neural networks. These neural networks have proven to be effective in solving very complex problems in different domains. However, their effectiveness depends on a number of factors: the architecture of the model, its size, how and where the training is performed… Most studies indicate that the large models are more likely to achieve the smallest error, but they are also more difficult to train. The main challenges are related to insufficient computational power and limited memory of the machines: if the model is too large then it can take a long time to be trained (days or even months), or it cannot even fit in memory in the worst case. During the training, it is necessary to store the weights (model parameters), the activations (intermediate computed data) and the optimizer states.
This situation offers several opportunities to deal with memory problems, depending on their origin. Training can be distributed across multiple resources of the computing platform, and different parallelization techniques suggest different ways of dividing memory load. In addition, data structures that remain inactive for a long period of time can be temporarily offloaded to a larger storage space with the possibility of retrieving them later (offloading strategies). Furthermore, activations that are computed anew at each iteration can be deleted and recomputed several times within it (rematerialization strategies). Memory saving strategies usually induce a time overhead with respect to the direct execution, therefore optimization problems should be considered to choose the best approach for each strategy. In this work, we formulate and analyze optimization problems in relation to various methods reducing memory consumption of the training process. In particular, we focus on rematerialization and activation offloading, for both of them we design optimal solutions under a set of assumptions. Finally, we propose a fully functional tool called rotor that combines activation offloading and rematerialization and that can be applied to training in PyTorch, allowing to process big models that otherwise would not fit into memory.

### 18 February 2022: Robustness via distributional dynamic programming, by Mastane Achab (Universitat Pompeu Fabra).

In dynamic programming (DP) and reinforcement learning (RL), an agent learns to act optimally in terms of expected long-term return by sequentially interacting with its environment modeled by a Markov decision process (MDP). More generally in distributional reinforcement learning (DRL), the focus is on the whole distribution of the return, not just its expectation. Although DRL-based methods produced state-of-the-art performance in RL with function approximation, they involve additional quantities (compared to the non-distributional setting) that are still not well understood. As a first contribution, we introduce a new class of distributional operators, together with a practical DP algorithm for policy evaluation, that come with a robust MDP interpretation. Indeed, our approach reformulates through an augmented state space where each state is split into a worst-case substate and a best-case substate, whose values are maximized by safe and risky policies respectively. Finally, we derive distributional operators and DP algorithms solving a new control task: How to distinguish safe from risky optimal actions in order to break ties in the space of optimal policies?

### 11 February 2022: Routing in an Uncertain World: Adaptivity, Efficiency, and Equilibrium, by Dong Quan Vu (Grenoble Computer Science Laboratory – LIGLAB).

We consider the traffic assignment problem in non-atomic routing games where the players’ cost functions may be subject to random fluctuations (e.g., weather disturbances, perturbations in the underlying network, etc.). We tackle this problem from the viewpoint of a control interface that makes routing recommendations based solely on observed costs and without any further knowledge of the system’s governing dynamics – such as the network’s cost functions, the distribution of any random events affecting the network, etc. In this online setting, learning methods based on the popular exponential weights
algorithm converge to equilibrium at an O(1/sqrt{T}) rate: this rate is known to be order-optimal in stochastic networks, but it is otherwise suboptimal in static networks. In the latter case, it is possible to achieve an O(1/T^2) equilibrium convergence rate via the use of finely tuned accelerated algorithms; on the other hand, these accelerated algorithms fail to converge altogether in the presence of persistent randomness, so it is not clear how to achieve the “best of both worlds” in terms of convergence speed. Our paper seeks to fill this gap by proposing an adaptive routing algorithm with the following desirable properties: (i) it seamlessly interpolates between the O(1/T^2) and O(1/sqrt{T}) rates for static and stochastic environments respectively; (ii) its convergence speed is polylogarithmic in the number of paths in the network; (iii) the method’s per-iteration complexity and memory requirements are both linear in the number of nodes and edges in the network; and (iv) it does not require any prior knowledge of the problem’s parameters.

### 4 February 2022: A Unified Perspective on Value Backup and Exploration in Monte Carlo Tree Search, by Tuan Dam

Monte Carlo Tree Search (MCTS) is a class of methods for solving complex decision-making problems through the synergy of Monte-Carlo planning and Reinforcement Learning (RL). The highly combinatorial nature of the problems commonly addressed by MCTS requires the use of efficient exploration strategies for navigating the planning tree, and quickly convergent value backup methods. These crucial problems are particularly evident in recent advances that combine MCTS with deep neural networks for function approximation. In this work, we propose two methods for improving the convergence rate and exploration based on a newly introduced backup operator, and entropy regularization. We provide strong theoretical guarantees to bound convergence rate, approximation error, and regret, of our methods. Moreover, we introduce a mathematical framework, based on the use of the $\alpha$-divergence for backup and exploration in MCTS. We show that this theoretical formulation unifies different approaches, including our newly introduced ones, under the same mathematical framework, allowing to obtain different methods by simply changing the value of $\alpha$. In practice, our unified perspective offers a flexible way to balance between exploration and exploitation by simply tuning the single $\alpha$ parameter according to the problem at hand. We validate our methods through a rigorous empirical study from basic toy problems to the complex Atari games, and including both MDP and POMDP problems.

### 28 January 2022: A/B/n Testing with Control in the Presence of Subpopulations, by Wooter M. Koolen (Centrum Wiskunde & Informatica, Amsterdam).

A key task in reinforcement learning is best policy identification. We study a version of this problem in structured bandits (single state MDPs), and investigate the sample complexity as a function of the action and feedback models. Concretely, we revisit the question: “which versions of my website are better than the control”. We address this question in the active sequential testing framework, where we explicitly model the presence of subpopulations among the users. For three modes of interaction depending on what is observed about these subpopulations, we characterize the optimal sample complexity, develop matching algorithms and validate the methods experimentally.

### 21 January 2022: From Lyapunov to model free reinforcement learning in games, by Julien Perolat (Deepmind).

Over the past decade, optimisation of neural networks has been the building block of many if not most progress in Deep Learning and Reinforcement Learning at scale. I will argue that one needs to go beyond optimisation to solve many crucial problems in machine learning such as learning Nash equilibrium in imperfect information games or mean field games. To do so I will show how to use the Lyapunov method to analyse theoretically and propose practical algorithms to learn in many different classes of games.

### 14 January 2022: Verification and Explanation of Fairness in Machine Learning, by Bishwamittra Ghosh (National University of Singapore)

In recent years, machine learning (ML) algorithms have been deployed in safety-critical and high-stake decision-making, where the fairness of algorithms is of paramount importance. Fairness in ML centers on detecting bias towards certain demographic populations induced by an ML classifier and proposes algorithmic solutions to mitigate the bias with respect to different fairness definitions.

In this talk, we are presenting two problems in fairness in machine learning: verification and explanation. A fairness verifier computes the bias in the prediction of an ML classifier — essentially beyond a finite dataset — given the probability distribution of input features. To this end, we discuss two solutions to the verification problem, one based on stochastic satisfiability (SSAT) and another based on the stochastic subset-sum problem (S3P). Our SSAT based framework can verify any classifier represented as a Boolean formula. In contrast, our S3P based framework provides a scalable solution for linear classifiers. Both solutions overcome challenges in earlier SMT and sampling-based verifiers by verifying multiple sensitive features while considering correlated features represented as a Bayesian network.

Our current research is on explaining the unfairness of a classifier by attributing unfairness contribution among input features and their interactions. To this end, we discuss a model-agnostic solution built on global sensitivity analysis and variance decomposition. The resulting explanation framework can identify influential features and feature interactions that potentially cause the unfairness of the classifier.

### 7 January 2022: Fixed-confidence top-m identification for structured models, by Clémence Réda (Inria Scool)

We study the problem of the identification of m arms with largest means under a fixed error rate delta, for linear and misspecified linear bandit models. Linear models are popular due to their simplicity and the existence of efficient algorithms for recommendation, but, in practice, data often deviates from linearity. We first present a practical family of algorithms to deal with linear models; then, we show a tractable lower bound on the sample complexity of any delta-correct algorithm for the general Top-m identification problem. For misspecified linear models, we prove that knowing the scale of the deviation from linearity is necessary to exploit the structure of the problem. We then describe the first algorithm for this setting, which is practical and matches the lower bound when the error rate delta tends to zero. Finally, we evaluate this algorithm on real-world data, with an application to drug repurposing for chemoresistant epilepsy.

References:

### 17 December 2021: Concentration of exponential families with several parameters, by Rémy Degenne (Inria Scool)

Motivated by bandit applications, we investigate concentration of the maximum likelihood estimator in exponential families. The multi-armed bandit algorithm KL-UCB has been shown to have optimal asymptotic regret in a large class of bandit problems. An essential ingredient of the regret analysis is a bound on the probability of deviation of an estimated parameter of the arm distribution from its true value, measured by a Kullback-Leibler divergence. Similar concentration inequalities are useful in bandit identification, in particular for the design of stopping rules. Such results can be obtained in several cases, for example if the arm distributions belong to a one-parameter exponential family or if the set of possible laws contains all the distributions with support in a known bounded interval. It is also possible to derive bounds for multi-parameter exponential families. The talk will focus on that last case and present techniques related to the method of mixtures and PAC-Bayes bounds.

### 26 November 2021: From Optimality to Robustness: Dirichlet Randomized Exploration in Stochastic Bandits, by Patrick Saux (Inria Scool)

The stochastic multi-arm bandit problem has been extensively studied under standard assumptions on the arm’s distribution (e.g bounded with known support, exponential family, etc). These assumptions are suitable for many real-world problems but sometimes they require knowledge (on tails for instance) that may not be precisely accessible to the practitioner, raising the question of the robustness of bandit algorithms to model misspecification. In this paper we study a generic Dirichlet Sampling (DS) algorithm, based on pairwise comparisons of empirical indices computed with re-sampling of the arms’ observations and a data-dependent exploration bonus. We show that different variants of this strategy achieve provably optimal regret guarantees when the distributions are bounded and logarithmic regret for semi-bounded distributions with a mild quantile condition. We also show that a simple tuning achieve robustness with respect to a large class of unbounded distributions, at the cost of slightly worse than logarithmic asymptotic regret. We finally provide numerical experiments showing the merits of DS in a decision-making problem on synthetic agriculture data.

### 19 November 2021: Choosing Answers in ε-Best-Answer Identification for Linear Bandits, by Marc Jourdan (Inria Scool)

In pure-exploration problems, information is gathered sequentially to answer a question on the stochastic environment. While best-arm identification for linear bandits has been extensively studied in recent years, few works have been dedicated to identifying one arm that is close to the best one (and not exactly the best one). In this problem with several correct answers, an identification algorithm should focus on one candidate among those answers and verify that it is correct. We demonstrate that picking the answer with highest mean does not allow an algorithm to reach asymptotic optimality in terms of expected sample complexity. Instead, a furthest answer should be identified. Using that insight to choose the candidate answer, we propose an asymptotically optimal algorithm for best-answer identification in transductive linear stochastic bandits. Finally, our algorithm is shown to achieve competitive empirical performance against existing best-arm identification algorithms.

### 5 November 2021: Testing by betting, by Rianne de Heide (Inria Scool)

A large fraction of published research in top journals in applied science such as medicine and psychology is irreproducible. In light of this ’replicability crisis’, classical p-value based hypothesis testing has come under intense scrutiny. One of the many problems is the following: suppose a test result is promising but inconclusive (p = 0.07), simply deciding to gather a few more data points, a practice that is ubiquitous in science, invalidates p-values and their error guarantees. We propose an alternative hypothesis testing methodology based on e-values and test martingales. This ‘safe testing’ or ’testing by betting’ method allows us to consider additional data and effortlessly combine results from different tests, while preserving error guarantees. In this talk I will illustrate testing by betting by means of a modified famous historical example: a lady tasting tea, and I will point out the mathematical and interpretational advantages of this new framework for hypothesis testing.

### 29 October 2021: Exploiting Structure in State-Action Sequences: Reversibility, Redundancy, by Nathan Grinsztajn (Inria Scool)

The talk will be about the two following papers:
•  There Is No Turning Back: A Self-Supervised Approach for Reversibility-Aware Reinforcement Learning.
We propose to learn to distinguish reversible from irreversible actions for better informed decision-making in Reinforcement Learning (RL). From theoretical considerations, we show that approximate reversibility can be learned through a simple surrogate task: ranking randomly sampled trajectory events in chronological order. Intuitively, pairs of events that are always observed in the same order are likely to be separated by an irreversible sequence of actions. Conveniently, learning the temporal order of events can be done in a fully self-supervised way, which we use to estimate the reversibility of actions from experience, without any priors. We propose two different strategies that incorporate reversibility in RL agents, one strategy for exploration (RAE) and one strategy for control (RAC). We demonstrate the potential of reversibility-aware agents in several environments, including the challenging Sokoban game. In synthetic tasks, we show that we can learn control policies that never fail and reduce to zero the side-effects of interactions, even without access to the reward function.
• More Efficient Exploration with Symbolic Priors on Action Sequence Equivalences.
Incorporating prior knowledge in reinforcement learning algorithms is mainly an open question. Even when insights about the environment dynamics are available, reinforcement learning is traditionally used in a tabula rasa setting and must explore and learn everything from scratch. In this paper, we consider the problem of exploiting priors about action sequence equivalence: that is, when different sequences of actions produce the same effect. We propose a new local exploration strategy calibrated to minimize collisions and maximize new state visitations. We show that this strategy can be computed at little cost, by solving a convex optimization problem. By replacing the usual epsilon-greedy strategy in a DQN, we demonstrate its potential in several environments with various dynamic structures.

### 22 October 2021: Online Sign Identification: Minimization of the Number of Errors in Thresholding Bandits, by Reda Ouhamma (Inria Scool)

In the fixed budget thresholding bandit problem, an algorithm sequentially allocates a budgeted number of samples to different distributions. It then predicts whether the mean of each distribution is larger or lower than a given threshold. We introduce a large family of algorithms (containing most existing relevant ones), inspired by the Frank-Wolfe algorithm, and provide a thorough yet generic analysis of their performance. This allowed us to construct new explicit algorithms, for a broad class of problems, whose losses are within a small constant factor of the non-adaptive oracle ones. Quite interestingly, we observed that adaptive methods empirically greatly out-perform non-adaptive oracles, an uncommon behavior in standard online learning settings, such as regret minimization. We explain this surprising phenomenon on an insightful toy problem.

### 1 October 2021: rlberry: A Reinforcement Learning Library for Research and Education, by Omar Darwiche Domingues (Inria Scool)

Writing reinforcement learning algorithms is fun! But after the fun, we have lots of boring things to implement: run our agents in parallel, average and plot results, optimize hyperparameters, compare to baselines coming from different codebases, create tricky environments etc. rlberry is a Python library that helps you to do all that with a few lines of code. In this presentation, I will talk about the motivation behind rlberry, show some examples of how it might be useful, present some features that are under development, and ask your feedback about what you’d like to have in the library, or what would be useful for your research.

### 24 September 2021: On the efficiency of subsampling algorithms for exploration in bandit, by Dorian Baudry (Inria Scool)

In this talk we consider a family of bandit algorithms called “Subsampling Dueling Algorithms” (SDA). Unlike Thompson Sampling which requires to specify a different prior to be optimal for specific problems, SDA does not need any distribution-dependent tuning: it simply picks arms by comparing subsamples of the same size from their history. We first show that a randomized sampling scheme, Random Block SDA (RB-SDA), is asymptotically optimal when the arms’ distribution come from the same single-parameter exponential family (SPEF) without specifying which one. One drawback however of these approaches is the additional complexity required by random subsampling and the storage of the full history of rewards. Our next contribution is to show that a simple deterministic subsampling rule, ”last-block subsampling”, attains the same results. In addition, we prove that these guarantees also hold when limiting the algorithm memory to a polylogarithmic function of the time horizon. These findings open up new perspectives, in particular for non-stationary scenarios in which the arm distributions evolve over time. We propose a variant of the algorithm in which only the most recent observations are used for subsampling, achieving optimal regret guarantees under the assumption of a known number of abrupt changes. We finally discuss the potential of this family of algorithms for future works.

References: