Publications

Below are some publications from the team in 2022 and 2023, followed by some recent publications of the team members (in HaL).

  • Azize, A. and Basu, D. (2023). Interactive and concentrated differential privacy for bandits. In Sixteenth European Workshop on Reinforcement Learning.
  • Azize, A., Jourdan, M., Al Marjani, A., and Basu, D. (2023). On the complexity of differentially private best-arm identification with fixed confidence. In Thirty-seventh Conference on Neural Information Processing Systems.
  • Honda, J., Ito, S., and Tsuchiya, T. (2023). Follow-the-Perturbed-Leader Achieves Best-of-Both-Worlds for Bandit Problems. In Proceedings of The 34th International Conference on Algorithmic Learning Theory, volume 201 of Proceedings of Machine Learning Research, pages 726–754. PMLR.
  • Jourdan, M., Degenne, R., and Kaufmann, E. (2023a). An ϵ-best-arm identification algorithm for fixed-confidence and beyond. In Advances in Neural Information Processing Systems (NeurIPS).
    Jourdan, M., Degenne, R., and Kaufmann, E. (2023b). Dealing with unknown variances in best-arm identification. In Algorithmic Learning Theory (ALT).
  • Komiyama, J., Tsuchiya, T., and Honda, J. (2022). Minimax optimal algorithms for fixed-budget best arm identification. In NeurIPS.
  • Kone, C., Kaufmann, E., and Richert, L. (2023). Adaptive algorithms for relaxed pareto set identification. In Advances in Neural Information Processing Systems (NeurIPS).
  • Lee, J., Honda, J., Chiang, C.-K., and Sugiyama, M. (2023a). Optimality of thompson sampling with noninformative priors for pareto bandits. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J., editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 18810–18851.
  • Lee, J., Honda, J., and Sugiyama, M. (2023b). Optimality of thompson sampling with noninformative priors for pareto bandits. In Proceedings of the 15th Asian Conference on Machine Learning, Proceedings of Machine Learning Research. to appear.
  • Matsuura, K., Sakamaki, K., Honda, J., and Sozu, T. (2023). Optimal dose escalation methods using deep reinforcement learning in phase i oncology trials. Journal of biopharmaceutical statistics, pages 1–14.
  • Pesquerel, F. and Maillard, O.-A. (2022). IMED-RL: Regret optimal learning of ergodic Markov decision processes. In NeurIPS 2022 Thirty-sixth Conference on Neural Information Processing Systems, Thirty-sixth Conference on Neural Information Processing Systems, New-Orleans, United States.
  • Saber, H., Pesquerel, F., Maillard, O.-A., and Talebi, M. S. (2023). Logarithmic regret in communicating MDPs: Leveraging known dynamics with bandits. In Asian Conference on Machine Learning, Istanbul, Turkey.
  • Tsuchiya, T., Ito, S., and Honda, J. (2023). Best-of-both-worlds algorithms for partial monitoring. In Agrawal, S. and Orabona, F., editors, Proceedings of The 34th International Conference on Algorithmic Learning Theory, volume 201 of Proceedings of Machine Learning Research, pages 1484–1515. PMLR.

Publications HAL de Odalric-Ambrym Maillard; Emilie Kaufmann; Debabrota Basu

2024

Poster communications

titre
Évaluation de critères de sélection de noyaux pour la régression Ridge à noyau dans un contexte de petits jeux de données
auteur
Frédérick Fabre Ferber, Dominique Gay, Jean-Christophe Soulié, Jean Diatta, Odalric-Ambrym Maillard
article
24ème conférence francophone sur l’Extraction et la Gestion des Connaissances EGC 2024, Jan 2024, Dijon, France. RNTI E-40
Accès au texte intégral et bibtex
https://hal.univ-reunion.fr/hal-04516719/file/1002940.pdf BibTex

2023

Conference papers

titre
An ε-Best-Arm Identification Algorithm for Fixed-Confidence and Beyond
auteur
Marc Jourdan, Rémy Degenne, Emilie Kaufmann
article
Advances in Neural Information Processing Systems (NeurIPS), Dec 2023, New Orleans, United States
resume
We propose EB-TC ε , a novel sampling rule for ε-best arm identification in stochastic bandits. It is the first instance of Top Two algorithm analyzed for approximate best arm identification. EB-TC ε is an anytime sampling rule that can therefore be employed without modification for fixed confidence or fixed budget identification (without prior knowledge of the budget). We provide three types of theoretical guarantees for EB-TC ε. First, we prove bounds on its expected sample complexity in the fixed confidence setting, notably showing its asymptotic optimality in combination with an adaptive tuning of its exploration parameter. We complement these findings with upper bounds on its probability of error at any time and for any error parameter, which further yield upper bounds on its simple regret at any time. Finally, we show through numerical simulations that EB-TC ε performs favorably compared to existing algorithms, in different settings.
Accès au texte intégral et bibtex
https://hal.science/hal-04306214/file/TTeBAI.pdf BibTex
titre
Adaptive Algorithms for Relaxed Pareto Set Identification
auteur
Cyrille Kone, Emilie Kaufmann, Laura Richert
article
NeurIPS 2023 – 37th Conference on Neural Information Processing Systems, Dec 2023, La Nouvelle Orléans, LA, United States
resume
In this paper we revisit the fixed-confidence identification of the Pareto optimal set in a multi-objective multi-armed bandit model. As the sample complexity to identify the exact Pareto set can be very large, a relaxation allowing to output some additional near-optimal arms has been studied. In this work we also tackle alternative relaxations that allow instead to identify a relevant subset of the Pareto set. Notably, we propose a single sampling strategy, called Adaptive Pareto Exploration, that can be used in conjunction with different stopping rules to take into account different relaxations of the Pareto Set Identification problem. We analyze the sample complexity of these different combinations, quantifying in particular the reduction in sample complexity that occurs when one seeks to identify at most k Pareto optimal arms. We showcase the good practical performance of Adaptive Pareto Exploration on a real-world scenario, in which we adaptively explore several vaccination strategies against Covid-19 in order to find the optimal ones when multiple immunogenicity criteria are taken into account.
Accès au texte intégral et bibtex
https://hal.science/hal-04306210/file/KKR23.pdf BibTex
titre
Fast Asymptotically Optimal Algorithms for Non-Parametric Stochastic Bandits
auteur
Dorian Baudry, Fabien Pesquerel, Rémy Degenne, Odalric-Ambrym Maillard
article
NeurIPS 2023 – Thirty-seventh Conference on Neural Information Processing Systems, Dec 2023, New Orleans (Louisiana), United States
resume
We consider the problem of regret minimization in non-parametric stochastic bandits. When the rewards are known to be bounded from above, there exists asymptotically optimal algorithms, with asymptotic regret depending on an infimum of Kullback-Leibler divergences (KL). These algorithms are computationally expensive and require storing all past rewards, thus simpler but non-optimal algorithms are often used instead. We introduce several methods to approximate the infimum KL which reduce drastically the computational and memory costs of existing optimal algorithms, while keeping their regret guaranties. We apply our findings to design new variants of the MED and IMED algorithms, and demonstrate their interest with extensive numerical simulations.
Accès au texte intégral et bibtex
https://inria.hal.science/hal-04337742/file/5313_fast_asymptotically_optimal_al.pdf BibTex
titre
On the Complexity of Differentially Private Best-Arm Identification with Fixed Confidence
auteur
Achraf Azize, Marc Jourdan, Aymen Al Marjani, Debabrota Basu
article
NeurIPS 2023 – Conference on Neural Information Processing Systems, Dec 2023, New Orleans (US), United States
resume
Best Arm Identification (BAI) problems are progressively used for data-sensitive applications, such as designing adaptive clinical trials, tuning hyper-parameters, and conducting user studies to name a few. Motivated by the data privacy concerns invoked by these applications, we study the problem of BAI with fixed confidence under $\epsilon$-global Differential Privacy (DP). First, to quantify the cost of privacy, we derive a lower bound on the sample complexity of any $\delta$-correct BAI algorithm satisfying $\epsilon$-global DP. Our lower bound suggests the existence of two privacy regimes depending on the privacy budget $\epsilon$. In the high-privacy regime (small $\epsilon$), the hardness depends on a coupled effect of privacy and a novel information-theoretic quantity, called the Total Variation Characteristic Time. In the low-privacy regime (large $\epsilon$), the sample complexity lower bound reduces to the classical non-private lower bound. Second, we propose AdaP-TT, an $\epsilon$-global DP variant of the Top Two algorithm. AdaP-TT runs in arm-dependent adaptive episodes and adds Laplace noise to ensure a good privacy-utility trade-off. We derive an asymptotic upper bound on the sample complexity of AdaP-TT that matches with the lower bound up to multiplicative constants in the high-privacy regime. Finally, we provide an experimental analysis of AdaP-TT that validates our theoretical results.
Accès au bibtex
https://arxiv.org/pdf/2309.02202 BibTex
titre
Marich: A Query-efficient Distributionally Equivalent Model Extraction Attack using Public Data
auteur
Pratik Karmakar, Debabrota Basu
article
Advances in Neural Information Processing Systems (NeurIPS), Dec 2023, New orleans, USA, United States
resume
We study design of black-box model extraction attacks that can send minimal number of queries from a publicly available dataset to a target ML model through a predictive API with an aim to create an informative and distributionally equivalent replica of the target. First, we define distributionally equivalent and Max-Information model extraction attacks, and reduce them into a variational optimisation problem. The attacker sequentially solves this optimisation problem to select the most informative queries that simultaneously maximise the entropy and reduce the mismatch between the target and the stolen models. This leads to an active sampling-based query selection algorithm, Marich, which is model-oblivious. Then, we evaluate Marich on different text and image data sets, and different models, including CNNs and BERT. Marich extracts models that achieve $\sim 60-95\%$ of true model’s accuracy and uses $\sim 1,000 – 8,500$ queries from the publicly available datasets, which are different from the private training datasets. Models extracted by Marich yield prediction distributions, which are $\sim 2-4\times$ closer to the target’s distribution in comparison to the existing active sampling-based attacks. The extracted models also lead to -96\%$ accuracy under membership inference attacks. Experimental results validate that Marich is query-efficient, and capable of performing task-accurate, high-fidelity, and informative model extraction.
Accès au bibtex
https://arxiv.org/pdf/2302.08466 BibTex
titre
Logarithmic regret in communicating MDPs: Leveraging known dynamics with bandits
auteur
Hassan Saber, Fabien Pesquerel, Odalric-Ambrym Maillard, Mohammad Sadegh Talebi
article
Asian Conference on Machine Learning, Nov 2023, Istanbul, Turkey
resume
We study regret minimization in an average-reward and communicating Markov Decision Process (MDP) with known dynamics, but unknown reward function. Although learning in such MDPs is a priori easier than in fully unknown ones, they are still largely challenging as they include as special cases large classes of problems such as combinatorial semi-bandits. Leveraging the knowledge on transition function in regret minimization, in a statistically efficient way, appears largely unexplored. As it is conjectured that achieving exact optimality in generic MDPs is NP-hard, even with known transitions, we focus on a computationally efficient relaxation, at the cost of achieving order-optimal logarithmic regret instead of exact optimality. We contribute to filling this gap by introducing a novel algorithm based on the popular Indexed Minimum Empirical Divergence strategy for bandits. A key component of the proposed algorithm is a carefully designed stopping criterion leveraging the recurrent classes induced by stationary policies. We derive a nonasymptotic, problem-dependent, and logarithmic regret bound for this algorithm, which relies on a novel regret decomposition leveraging the structure. We further provide an efficient implementation and experiments illustrating its promising empirical performance.
Accès au texte intégral et bibtex
https://hal.science/hal-04241513/file/acml_2023_fabien.pdf BibTex
titre
Pure Exploration in Bandits with Linear Constraints
auteur
Emil Carlsson, Debabrota Basu, Fredrik D. Johansson, Devdatt Dubhashi
article
EWRL 2023 – European Workshop on Reinforcement Learning, Sep 2023, Brussels, Belgium
resume
We address the problem of identifying the optimal policy with a fixed confidence level in a multi-armed bandit setup, when the arms are subject to linear constraints. Unlike the standard best-arm identification problem which is well studied, the optimal policy in this case may not be deterministic and could mix between several arms. This changes the geometry of the problem which we characterize via an information-theoretic lower bound. We introduce two asymptotically optimal algorithms for this setting, one based on the Track-and-Stop method and the other based on a game-theoretic approach. Both these algorithms try to track an optimal allocation based on the lower bound and computed by a weighted projection onto the boundary of a normal cone. Finally, we provide empirical results that validate our bounds and visualize how constraints change the hardness of the problem.
Accès au bibtex
https://arxiv.org/pdf/2306.12774 BibTex
titre
Interactive and Concentrated Differential Privacy for Bandits
auteur
Achraf Azize, Debabrota Basu
article
EWRL 2023 – European Workshop on Reinforcement Learning, Sep 2023, Brussels (Belgium), Belgium
resume
Bandits play a crucial role in interactive learning schemes and modern recommender systems. However, these systems often rely on sensitive user data, making privacy a critical concern. This paper investigates privacy in bandits with a trusted centralized decision-maker through the lens of interactive Differential Privacy (DP). While bandits under pure $\epsilon$-global DP have been well-studied, we contribute to the understanding of bandits under zero Concentrated DP (zCDP). We provide minimax and problem-dependent lower bounds on regret for finite-armed and linear bandits, which quantify the cost of $\rho$-global zCDP in these settings. These lower bounds reveal two hardness regimes based on the privacy budget $\rho$ and suggest that $\rho$-global zCDP incurs less regret than pure $\epsilon$-global DP. We propose two $\rho$-global zCDP bandit algorithms, AdaC-UCB and AdaC-GOPE, for finite-armed and linear bandits respectively. Both algorithms use a common recipe of Gaussian mechanism and adaptive episodes. We analyze the regret of these algorithms to show that AdaC-UCB achieves the problem-dependent regret lower bound up to multiplicative constants, while AdaC-GOPE achieves the minimax regret lower bound up to poly-logarithmic factors. Finally, we provide experimental validation of our theoretical results under different settings.
Accès au bibtex
https://arxiv.org/pdf/2309.00557 BibTex
titre
Active Coverage for PAC Reinforcement Learning
auteur
Aymen Al-Marjani, Andrea Tirinzoni, Emilie Kaufmann
article
Conference on Learning Theory 2023, Jul 2023, Bangalore, India
resume
Collecting and leveraging data with good coverage properties plays a crucial role in different aspects of reinforcement learning (RL), including reward-free exploration and offline learning. However, the notion of “good coverage” really depends on the application at hand, as data suitable for one context may not be so for another. In this paper, we formalize the problem of active coverage in episodic Markov decision processes (MDPs), where the goal is to interact with the environment so as to fulfill given sampling requirements. This framework is sufficiently flexible to specify any desired coverage property, making it applicable to any problem that involves online exploration. Our main contribution is an instance-dependent lower bound on the sample complexity of active coverage and a simple game-theoretic algorithm, COVGAME, that nearly matches it. We then show that COVGAME can be used as a building block to solve different PAC RL tasks. In particular, we obtain a simple algorithm for PAC reward-free exploration with an instance-dependent sample complexity that, in certain MDPs which are “easy to explore”, is lower than the minimax one. By further coupling this exploration algorithm with a new technique to do implicit eliminations in policy space, we obtain a computationally-efficient algorithm for best-policy identification whose instance-dependent sample complexity scales with gaps between policy values.
Accès au texte intégral et bibtex
https://hal.science/hal-04215441/file/COLT23.pdf BibTex
titre
Bregman Deviations of Generic Exponential Families
auteur
Sayak Ray Chowdhury, Patrick Saux, Odalric-Ambrym Maillard, Aditya Gopalan
article
Conference On Learning Theory (COLT), Jul 2023, Bangalore, India
resume
We revisit the method of mixture technique, also known as the Laplace method, to study the concentration phenomenon in generic exponential families. Combining the properties of Bregman divergence associated with log-partition function of the family with the method of mixtures for super-martingales, we establish a generic bound controlling the Bregman divergence between the parameter of the family and a finite sample estimate of the parameter. Our bound is time-uniform and makes appear a quantity extending the classical information gain to exponential families, which we call the Bregman information gain. For the practitioner, we instantiate this novel bound to several classical families, e.g., Gaussian, Bernoulli, Exponential, Weibull, Pareto, Poisson and Chi-square yielding explicit forms of the confidence sets and the Bregman information gain. We further numerically compare the resulting confidence bounds to state-of-the-art alternatives for time-uniform concentration and show that this novel method yields competitive results. Finally, we highlight the benefit of our concentration bounds on some illustrative applications.
Accès au texte intégral et bibtex
https://hal.science/hal-04161043/file/Bregman-Deviations-Laplace%2Fmain.pdf BibTex
titre
From Noisy Fixed-Point Iterations to Private ADMM for Centralized and Federated Learning
auteur
Edwige Cyffers, Aurélien Bellet, Debabrota Basu
article
Proceedings of the 40th International Conference on Machine Learning (ICML), Jul 2023, Honolulu, United States
resume
We study differentially private (DP) machine learning algorithms as instances of noisy fixed-point iterations, in order to derive privacy and utility results from this well-studied framework. We show that this new perspective recovers popular private gradient-based methods like DP-SGD and provides a principled way to design and analyze new private optimization algorithms in a flexible manner. Focusing on the widely-used Alternating Directions Method of Multipliers (ADMM) method, we use our general framework to derive novel private ADMM algorithms for centralized, federated and fully decentralized learning. For these three algorithms, we establish strong privacy guarantees leveraging privacy amplification by iteration and by subsampling. Finally, we provide utility guarantees using a unified analysis that exploits a recent linear convergence result for noisy fixed-point iterations.
Accès au bibtex
https://arxiv.org/pdf/2302.12559 BibTex
titre
How Biased are Your Features?”: Computing Fairness Influence Functions with Global Sensitivity Analysis
auteur
Bishwamittra Ghosh, Debabrota Basu, Kuldeep Meel
article
FAccT ’23: the 2023 ACM Conference on Fairness, Accountability, and Transparency, Jun 2023, Chicago IL, United States. pp.138-148, ⟨10.1145/3593013.3593983⟩
resume
Fairness in machine learning has attained significant focus due to the widespread application in high-stake decision-making tasks. Unregulated machine learning classifiers can exhibit bias towards certain demographic groups in data, thus the quantification and mitigation of classifier bias is a central concern in fairness in machine learning. In this paper, we aim to quantify the influence of different features in a dataset on the bias of a classifier. To do this, we introduce the Fairness Influence Function (FIF). This function breaks down bias into its components among individual features and the intersection of multiple features. The key idea is to represent existing group fairness metrics as the difference of the scaled conditional variances in the classifier’s prediction and apply a decomposition of variance according to global sensitivity analysis. To estimate FIFs, we instantiate an algorithm that applies variance decomposition of classifier’s prediction following local regression. Experiments demonstrate that captures FIFs of individual feature and intersectional features, provides a better approximation of bias based on FIFs, demonstrates higher correlation of FIFs with fairness interventions, and detects changes in bias due to fairness affirmative/punitive actions in the classifier. The code is available at https://github.com/ReAILe/bias-explainer. The extended version of the paper is at https://arxiv.org/pdf/2206.00667.pdf.
Accès au bibtex
https://arxiv.org/pdf/2206.00667 BibTex
titre
Risk-aware linear bandits with convex loss
auteur
Patrick Saux, Odalric-Ambrym Maillard
article
International Conference on Artificial Intelligence and Statistics (AISTATS), Apr 2023, Valencia, Spain
resume
In decision-making problems such as the multi-armed bandit, an agent learns sequentially by optimizing a certain feedback. While the mean reward criterion has been extensively studied, other measures that reflect an aversion to adverse outcomes, such as mean-variance or conditional value-at-risk (CVaR), can be of interest for critical applications (healthcare, agriculture). Algorithms have been proposed for such risk-aware measures under bandit feedback without contextual information. In this work, we study contextual bandits where such risk measures can be elicited as linear functions of the contexts through the minimization of a convex loss. A typical example that fits within this framework is the expectile measure, which is obtained as the solution of an asymmetric least-square problem. Using the method of mixtures for supermartingales, we derive confidence sequences for the estimation of such risk measures. We then propose an optimistic UCB algorithm to learn optimal risk-aware actions, with regret guarantees similar to those of generalized linear bandits. This approach requires solving a convex problem at each round of the algorithm, which we can relax by allowing only approximated solution obtained by online gradient descent, at the cost of slightly higher regret. We conclude by evaluating the resulting algorithms on numerical experiments.
Accès au texte intégral et bibtex
https://hal.science/hal-04044440/file/main.pdf BibTex
titre
Dealing with Unknown Variances in Best-Arm Identification
auteur
Marc Jourdan, Rémy Degenne, Emilie Kaufmann
article
Algorithmic Learning Theory (ALT), Feb 2023, Singapore (SG), Singapore
resume
The problem of identifying the best arm among a collection of items having Gaussian rewards distribution is well understood when the variances are known. Despite its practical relevance for many applications, few works studied it for unknown variances. In this paper we introduce and analyze two approaches to deal with unknown variances, either by plugging in the empirical variance or by adapting the transportation costs. In order to calibrate our two stopping rules, we derive new time-uniform concentration inequalities, which are of independent interest. Then, we illustrate the theoretical and empirical performances of our two sampling rule wrappers on Track-and-Stop and on a Top Two algorithm. Moreover, by quantifying the impact on the sample complexity of not knowing the variances, we reveal that it is rather small.
Accès au texte intégral et bibtex
https://hal.science/hal-04306221/file/BAIUV.pdf BibTex
titre
Optimistic PAC Reinforcement Learning: the Instance-Dependent View
auteur
Andrea Tirinzoni, Aymen Al-Marjani, Emilie Kaufmann
article
Algorithmic Learning Theory (ALT), Feb 2023, Singapore (SG), Singapore
resume
Optimistic algorithms have been extensively studied for regret minimization in episodic tabular Markov Decision Processes (MDPs), both from a minimax and an instance-dependent view. However, for the PAC RL problem, where the goal is to identify a near-optimal policy with high probability, little is known about their instance-dependent sample complexity. A negative result of Wagenmaker et al. (2022) suggests that optimistic sampling rules cannot be used to attain the (still elusive) optimal instance-dependent sample complexity. On the positive side, we provide the first instance-dependent bound for an optimistic algorithm for PAC RL, BPI-UCRL, for which only minimax guarantees were available (Kaufmann et al., 2021). While our bound features some minimal visitation probabilities, it also features a refined notion of sub-optimality gap compared to the value gaps that appear in prior work. Moreover, in MDPs with deterministic transitions, we show that BPI-UCRL is actually near instance-optimal (up to a factor of the horizon). On the technical side, our analysis is very simple thanks to a new “target trick” of independent interest. We complement these findings with a novel hardness result explaining why the instance-dependent complexity of PAC RL cannot be easily related to that of regret minimization, unlike in the minimax regime.
Accès au texte intégral et bibtex
https://hal.science/hal-04306228/file/TAMK23.pdf BibTex
titre
Farm-gym: A modular reinforcement learning platform for stochastic agronomic games
auteur
Odalric-Ambrym Maillard, Timothée Mathieu, Debabrota Basu
article
AIAFS 2023 – Artificial Intelligence for Agriculture and Food Systems, Feb 2023, Wahington DC, United States
resume
We introduce Farm-gym, an open-source farming environment written in Python, that models sequential decisionmaking in farms using Reinforcement Learning (RL). Farm-gym conceptualizes a farm as a dynamical system with many interacting entities. Leveraging a modular design, it enables us to instantiate from very simple to highly complicated environments. Contrasting many available gym environments, Farm-gym features intrinsically stochastic games, using stochastic growth models and weather data. Further, it enables to create farm games in a modular way, activating or not the entities (e.g. weeds, pests, pollinators), and yielding non-trivial coupled dynamics. Finally, every game can be customized with .yaml files for rewards, feasible actions, and initial/end-game conditions. We illustrate some interesting features on simple farms. We also showcase the challenges posed by Farm-gym to the deep RL algorithms, in order to stimulate studies in the RL community.
Accès au texte intégral et bibtex
https://inria.hal.science/hal-03960683/file/2022_AAAI_AIAFS%20%282%29.pdf BibTex
titre
Learning crop management by reinforcement: gym-DSSAT
auteur
Romain Gautron, Emilio J Padrón, Philippe Preux, Julien Bigot, Odalric-Ambrym Maillard, Gerrit Hoogenboom, Julien Teigny
article
AIAFS 2023 – 2nd AAAI Workshop on AI for Agriculture and Food Systems, Feb 2023, Washignton DC, United States
resume
We introduce gym-DSSAT, a gym environment for crop management tasks, that is easy to use for training Reinforcement Learning (RL) agents. gym-DSSAT is based on DSSAT, a state-of-the-art mechanistic crop growth simulator. We modify DSSAT so that an external software agent can interact with it to control the actions performed in a crop field during a growing season. The RL environment provides predefined decision problems without having to manipulate the complex crop simulator. We report encouraging preliminary results on a use case of nitrogen fertilization for maize. This work opens up opportunities to explore new sustainable crop management strategies with RL, and provides RL researchers with an original set of challenging tasks to investigate.
Accès au texte intégral et bibtex
https://inria.hal.science/hal-03976393/file/AI4AFS.pdf BibTex
titre
Bilinear Exponential Family of MDPs: Frequentist Regret Bound with Tractable Exploration & Planning
auteur
Reda Ouhamma, Debabrota Basu, Odalric-Ambrym Maillard
article
Proceedings of the AAAI Conference on Artificial Intelligence, Feb 2023, Washignton DC, United States. pp.9336-9344, ⟨10.1609/aaai.v37i8.26119⟩
resume
We study the problem of episodic reinforcement learning in continuous stateaction spaces with unknown rewards and transitions. Specifically, we consider the setting where the rewards and transitions are modeled using parametric bilinear exponential families. We propose an algorithm, BEF-RLSVI, that a) uses penalized maximum likelihood estimators to learn the unknown parameters, b) injects a calibrated Gaussian noise in the parameter of rewards to ensure exploration, and c) leverages linearity of the exponential family with respect to an underlying RKHS to perform tractable planning. We further provide a frequentist regret analysis of BEF-RLSVI that yields an upper bound of Õ( (d^3 H^3 K)^{1/2} ), where d is the dimension of the parameters, H is the episode length, and K is the number of episodes. Our analysis improves the existing bounds for the bilinear exponential family of MDPs by √H and removes the handcrafted clipping deployed in existing RLSVI-type algorithms. Our regret bound is order-optimal with respect to H and K.
Accès au texte intégral et bibtex
https://hal.science/hal-03790997/file/bef_rlsvi.pdf BibTex

Reports

titre
AdaStop: sequential testing for efficient and reliable comparisons of Deep RL Agents
auteur
Timothée Mathieu, Riccardo Della Vecchia, Alena Shilova, Matheus Centa de Medeiros, Hector Kohler, Odalric-Ambrym Maillard, Philippe Preux
article
RR-9513, Inria Lille Nord Europe – Laboratoire CRIStAL – Université de Lille. 2023
resume
The reproducibility of many experimental results in Deep Reinforcement Learning (RL) is under question. To solve this reproducibility crisis, we propose a theoretically sound methodology to compare multiple Deep RL algorithms. The performance of one execution of a Deep RL algorithm is random so that independent executions are needed to assess it precisely. When comparing several RL algorithms, a major question is how many executions must be made and how can we assure that the results of such a comparison is theoretically sound. Researchers in Deep RL often use less than 5 independent executions to compare algorithms: we claim that this is not enough in general. Moreover, when comparing several algorithms at once, the error of each comparison accumulates and must be taken into account with a multiple tests procedure to preserve low error guarantees. To address this problem in a statistically sound way, we introduce AdaStop, a new statistical test based on multiple group sequential tests. When comparing algorithms, AdaStop adapts the number of executions to stop as early as possible while ensuring that we have enough information to distinguish algorithms that perform better than the others in a statistical significant way. We prove both theoretically and empirically that AdaStop has a low probability of making an error (Family-Wise Error). Finally, we illustrate the effectiveness of AdaStop in multiple use-cases, including toy examples and difficult cases such as Mujoco environments.
Accès au texte intégral et bibtex
https://inria.hal.science/hal-04132861/file/RR-9513.pdf BibTex

Preprints, Working Papers, …

titre
Reinforcement Learning in the Wild with Maximum Likelihood-based Model Transfer
auteur
Hannes Eriksson, Debabrota Basu, Tommy Tram, Mina Alibeigi, Christos Dimitrakakis
article
2023
resume
In this paper, we study the problem of transferring the available Markov Decision Process (MDP) models to learn and plan efficiently in an unknown but similar MDP. We refer to it as \textit{Model Transfer Reinforcement Learning (MTRL)} problem. First, we formulate MTRL for discrete MDPs and Linear Quadratic Regulators (LQRs) with continuous state actions. Then, we propose a generic two-stage algorithm, MLEMTRL, to address the MTRL problem in discrete and continuous settings. In the first stage, MLEMTRL uses a \textit{constrained Maximum Likelihood Estimation (MLE)}-based approach to estimate the target MDP model using a set of known MDP models. In the second stage, using the estimated target MDP model, MLEMTRL deploys a model-based planning algorithm appropriate for the MDP class. Theoretically, we prove worst-case regret bounds for MLEMTRL both in realisable and non-realisable settings. We empirically demonstrate that MLEMTRL allows faster learning in new MDPs than learning from scratch and achieves near-optimal performance depending on the similarity of the available MDPs and the target MDP.
Accès au bibtex
https://arxiv.org/pdf/2302.09273 BibTex
titre
CRIMED: Lower and Upper Bounds on Regret for Bandits with Unbounded Stochastic Corruption
auteur
Shubhada Agrawal, Timothée Mathieu, Debabrota Basu, Odalric-Ambrym Maillard
article
2023
resume
We investigate the regret-minimisation problem in a multi-armed bandit setting with arbitrary corruptions. Similar to the classical setup, the agent receives rewards generated independently from the distribution of the arm chosen at each time. However, these rewards are not directly observed. Instead, with a fixed $\varepsilon\in (0,\frac{1}{2})$, the agent observes a sample from the chosen arm’s distribution with probability s-\varepsilon$, or from an arbitrary corruption distribution with probability $\varepsilon$. Importantly, we impose no assumptions on these corruption distributions, which can be unbounded. In this setting, accommodating potentially unbounded corruptions, we establish a problem-dependent lower bound on regret for a given family of arm distributions. We introduce CRIMED, an asymptotically-optimal algorithm that achieves the exact lower bound on regret for bandits with Gaussian distributions with known variance. Additionally, we provide a finite-sample analysis of CRIMED’s regret performance. Notably, CRIMED can effectively handle corruptions with $\varepsilon$ values as high as $\frac{1}{2}$. Furthermore, we develop a tight concentration result for medians in the presence of arbitrary corruptions, even with $\varepsilon$ values up to $\frac{1}{2}$, which may be of independent interest. We also discuss an extension of the algorithm for handling misspecification in Gaussian model.
Accès au bibtex
https://arxiv.org/pdf/2309.16563 BibTex
titre
Towards Instance-Optimality in Online PAC Reinforcement Learning
auteur
Aymen Al-Marjani, Andrea Tirinzoni, Emilie Kaufmann
article
2023
resume
Several recent works have proposed instance-dependent upper bounds on the number of episodes needed to identify, with probability 1 − δ, an ε-optimal policy in finite-horizon tabular Markov Decision Processes (MDPs). These upper bounds feature various complexity measures for the MDP, which are defined based on different notions of sub-optimality gaps. However, as of now, no lower bound has been established to assess the optimality of any of these complexity measures, except for the special case of MDPs with deterministic transitions. In this paper, we propose the first instance-dependent lower bound on the sample complexity required for the PAC identification of a near-optimal policy in any tabular episodic MDP. Additionally, we demonstrate that the sample complexity of the PEDEL algorithm of Wagenmaker and Jamieson (2022) closely approaches this lower bound. Considering the intractability of PEDEL, we formulate an open question regarding the possibility of achieving our lower bound using a computationally-efficient algorithm.
Accès au texte intégral et bibtex
https://hal.science/hal-04270888/file/main.pdf BibTex
titre
Online Instrumental Variable Regression: Regret Analysis and Bandit Feedback
auteur
Riccardo Della Vecchia, Debabrota Basu
article
2023
resume
The independence of noise and covariates is a standard assumption in online linear regression with unbounded noise and linear bandit literature. This assumption and the following analysis are invalid in the case of endogeneity, i.e., when the noise and covariates are correlated. In this paper, we study the online setting of Instrumental Variable (IV) regression, which is widely used in economics to identify the underlying model from an endogenous dataset. Specifically, we upper bound the identification and oracle regrets of the popular Two-Stage Least Squares (2SLS) approach to IV regression but in the online setting. Our analysis shows that Online 2SLS (O2SLS) achieves $\mathcal O(d^2\log^2 T)$ identification and $\mathcal O(\gamma \sqrt{d T \log T})$ oracle regret after $T$ interactions, where $d$ is the dimension of covariates and $\gamma$ is the bias due to endogeneity. Then, we leverage O2SLS as an oracle to design OFUL-IV, a linear bandit algorithm. OFUL-IV can tackle endogeneity and achieves $\mathcal O(d\sqrt{T}\log T)$ regret. For datasets with endogeneity, we experimentally show the efficiency of OFUL-IV in terms of estimation error and regret.
Accès au texte intégral et bibtex
https://hal.science/hal-03831210/file/o2sls.pdf BibTex

2022

Journal articles

titre
A channel selection game for multi-operator LoRaWAN deployments
auteur
Kinda Khawam, Hassan Fawaz, Samer Lahoud, Odalric-Ambrym Maillard, Steven Martin
article
Computer Networks, 2022, 216, pp.109185. ⟨10.1016/j.comnet.2022.109185⟩
resume
With over 75 billion Internet of Things (IoT) devices expected worldwide by the year 2025, inaugural MAC layer solutions for long-range IoT deployments no longer suffice. LoRaWAN, the principal technology for comprehensive IoT deployments, enables low power and long range communications. However, synchronous transmissions on the same spreading factors and within the same frequency channel, augmented by the necessary clustering of IoT devices, will cause collisions and drops in efficiency. The performance is further impacted by the shortage of radio resources and by multiple operators utilizing the same unlicensed frequency bands. In this paper, we propose a game theoretic based channel selection algorithm for LoRaWAN in a multi-operator deployment scenario. We begin by proposing an optimal formulation for the selection process with the objective of maximizing the total normalized throughput per spreading factor, per channel. Then, we propose centralized optimal approaches, as well as distributed algorithms, based on reinforcement learning and regret matching dynamics, to finding both the Nash and Correlated equilibria of the proposed game. We simulate our proposals and compare them to the legacy approach of randomized channel access, stressing their efficiency in improving the total normalized throughput as well as the packet delivery ratios.
Accès au bibtex
BibTex
titre
Reinforcement Learning for crop management
auteur
Romain Gautron, Odalric-Ambrym Maillard, Philippe Preux, Marc Corbeels, Régis Sabbadin
article
Computers and Electronics in Agriculture, 2022, 200, pp.107182. ⟨10.1016/j.compag.2022.107182⟩
Accès au texte intégral et bibtex
https://inria.hal.science/hal-03834290/file/main.pdf BibTex
titre
Efficient Change-Point Detection for Tackling Piecewise-Stationary Bandits
auteur
Lilian Besson, Emilie Kaufmann, Odalric-Ambrym Maillard, Julien Seznec
article
Journal of Machine Learning Research, 2022
resume
We introduce GLR-klUCB, a novel algorithm for the piecewise iid non-stationary bandit problem with bounded rewards. This algorithm combines an efficient bandit algorithm, kl-UCB, with an efficient, parameter-free, changepoint detector, the Bernoulli Generalized Likelihood Ratio Test, for which we provide new theoretical guarantees of independent interest. Unlike previous non-stationary bandit algorithms using a change-point detector, GLR-klUCB does not need to be calibrated based on prior knowledge on the arms’ means. We prove that this algorithm can attain a $O(\sqrt{TA \Upsilon_T\log(T)})$ regret in $T$ rounds on some “easy” instances, where A is the number of arms and $\Upsilon_T$ the number of change-points, without prior knowledge of $\Upsilon_T$. In contrast with recently proposed algorithms that are agnostic to $\Upsilon_T$, we perform a numerical study showing that GLR-klUCB is also very efficient in practice, beyond easy instances.
Accès au texte intégral et bibtex
https://inria.hal.science/hal-02006471/file/BKMS22%20%281%29.pdf BibTex
titre
Local Dvoretzky-Kiefer-Wolfowitz confidence bands
auteur
Odalric-Ambrym Maillard
article
Mathematical Methods of Statistics, 2022, ⟨10.3103/S1066530721010038⟩
resume
In this paper, we revisit the concentration inequalities for the supremum of the cumulative distribution function (CDF) of a real-valued continuous distribution as established by Dvoretzky, Kiefer, Wolfowitz and revisited later by Massart in in two seminal papers. We focus on the concentration of the local supremum over a sub-interval, rather than on the full domain. That is, denoting U the CDF of the uniform distribution over [0, 1] and U n its empirical version built from n samples, we study P sup u∈[u,u] U n (u)−U (u) > ε for different values of u, u ∈ [0, 1]. Such local controls naturally appear for instance when studying estimation error of spectral risk-measures (such as the conditional value at risk), where [u, u] is typically [0, α] or [1 − α, 1] for a risk level α, after reshaping the CDF F of the considered distribution into U by the general inverse transform F −1. Extending a proof technique from Smirnov, we provide exact expressions of the local quantities P sup u∈[u,u] U n (u) − U (u) > ε and P sup u∈[u,u] U (u) − U n (u) > ε for each n, ε, u, u. Interestingly these quantities, seen as a function of ε, can be easily inverted numerically into functions of the probability level δ. Although not explicit, they can be computed and tabulated. We plot such expressions and compare them to the classical bound log(1/δ) 2n provided by Massart inequality. We then provide an application of such result to the control of generic functional of the CDF, motivated by the case of the conditional value at risk. Last, we extend the local concentration results holding individually for each n to time-uniform concentration inequalities holding simultaneously for all n, revisiting a reflection inequality by James, which is of independent interest for the study of sequential decision making strategies.
Accès au texte intégral et bibtex
https://hal.science/hal-03780573/file/LocalDKW_MMStat_CR.pdf BibTex
titre
Collaborative Algorithms for Online Personalized Mean Estimation
auteur
Mahsa Asadi, Aurélien Bellet, Odalric-Ambrym Maillard, Marc Tommasi
article
Transactions on Machine Learning Research Journal, 2022
resume
We consider an online estimation problem involving a set of agents. Each agent has access to a (personal) process that generates samples from a real-valued distribution and seeks to estimate its mean. We study the case where some of the distributions have the same mean, and the agents are allowed to actively query information from other agents. The goal is to design an algorithm that enables each agent to improve its mean estimate thanks to communication with other agents. The means as well as the number of distributions with same mean are unknown, which makes the task nontrivial. We introduce a novel collaborative strategy to solve this online personalized mean estimation problem. We analyze its time complexity and introduce variants that enjoy good performance in numerical experiments. We also extend our approach to the setting where clusters of agents with similar means seek to estimate the mean of their cluster.
Accès au texte intégral et bibtex
https://inria.hal.science/hal-03905917/file/tmlr.pdf BibTex

Conference papers

titre
Near-Optimal Collaborative Learning in Bandits
auteur
Clémence Réda, Sattar Vakili, Emilie Kaufmann
article
NeurIPS 2022 – 36th Conference on Neural Information Processing System, Dec 2022, New Orleans, United States
resume
This paper introduces a general multi-agent bandit model in which each agent is facing a finite set of arms and may communicate with other agents through a central controller in order to identify-in pure exploration-or play-in regret minimizationits optimal arm. The twist is that the optimal arm for each agent is the arm with largest expected mixed reward, where the mixed reward of an arm is a weighted sum of the rewards of this arm for all agents. This makes communication between agents often necessary. This general setting allows to recover and extend several recent models for collaborative bandit learning, including the recently proposed federated learning with personalization [30]. In this paper, we provide new lower bounds on the sample complexity of pure exploration and on the regret. We then propose a near-optimal algorithm for pure exploration. This algorithm is based on phased elimination with two novel ingredients: a data-dependent sampling scheme within each phase, aimed at matching a relaxation of the lower bound.
Accès au texte intégral et bibtex
https://hal.science/hal-03825099/file/RVK22.pdf BibTex
titre
When Privacy Meets Partial Information: A Refined Analysis of Differentially Private Bandits
auteur
Achraf Azize, Debabrota Basu
article
Advances in Neural Information Processing Systems, Dec 2022, New Orleans, United States
resume
We study the problem of multi-armed bandits with $\epsilon$-global Differential Privacy (DP). First, we prove the minimax and problem-dependent regret lower bounds for stochastic and linear bandits that quantify the hardness of bandits with $\epsilon$-global DP. These bounds suggest the existence of two hardness regimes depending on the privacy budget $\epsilon$. In the high-privacy regime (small $\epsilon$), the hardness depends on a coupled effect of privacy and partial information about the reward distributions. In the low-privacy regime (large $\epsilon$), bandits with $\epsilon$-global DP are not harder than the bandits without privacy. For stochastic bandits, we further propose a generic framework to design a near-optimal $\epsilon$ global DP extension of an index-based optimistic bandit algorithm. The framework consists of three ingredients: the Laplace mechanism, arm-dependent adaptive episodes, and usage of only the rewards collected in the last episode for computing private statistics. Specifically, we instantiate $\epsilon$-global DP extensions of UCB and KL-UCB algorithms, namely AdaP-UCB and AdaP-KLUCB. AdaP-KLUCB is the first algorithm that both satisfies $\epsilon$-global DP and yields a regret upper bound that matches the problem-dependent lower bound up to multiplicative constants.
Accès au texte intégral et bibtex
https://hal.science/hal-03781600/file/When_Privacy_Meets_Partial_Information-2.pdf BibTex
titre
Top Two Algorithms Revisited
auteur
Marc Jourdan, Rémy Degenne, Dorian Baudry, Rianne de Heide, Emilie Kaufmann
article
NeurIPS 2022 – 36th Conference on Neural Information Processing System, Nov 2022, New Orleans, United States
resume
Top Two algorithms arose as an adaptation of Thompson sampling to best arm identification in multi-armed bandit models [38], for parametric families of arms. They select the next arm to sample from by randomizing among two candidate arms, a leader and a challenger. Despite their good empirical performance, theoretical guarantees for fixed-confidence best arm identification have only been obtained when the arms are Gaussian with known variances. In this paper, we provide a general analysis of Top Two methods, which identifies desirable properties of the leader, the challenger, and the (possibly non-parametric) distributions of the arms. As a result, we obtain theoretically supported Top Two algorithms for best arm identification with bounded distributions. Our proof method demonstrates in particular that the sampling step used to select the leader inherited from Thompson sampling can be replaced by other choices, like selecting the empirical best arm.
Accès au texte intégral et bibtex
https://hal.science/hal-03825103/file/npbai.pdf BibTex
titre
IMED-RL: Regret optimal learning of ergodic Markov decision processes
auteur
Fabien Pesquerel, Odalric-Ambrym Maillard
article
NeurIPS 2022 – Thirty-sixth Conference on Neural Information Processing Systems, Nov 2022, New-Orleans, United States
resume
We consider reinforcement learning in a discrete, undiscounted, infinite-horizon Markov Decision Problem (MDP) under the average reward criterion, and focus on the minimization of the regret with respect to an optimal policy, when the learner does not know the rewards nor the transitions of the MDP. In light of their success at regret minimization in multi-armed bandits, popular bandit strategies, such as the optimistic UCB, KL-UCB or the Bayesian Thompson sampling strategy, have been extended to the MDP setup. Despite some key successes, existing strategies for solving this problem either fail to be provably asymptotically optimal, or suffer from prohibitive burn-in phase and computational complexity when implemented in practice. In this work, we shed a novel light on regret minimization strategies, by extending to reinforcement learning the computationally appealing Indexed Minimum Empirical Divergence (IMED) bandit algorithm. Traditional asymptotic problem-dependent lower bounds on the regret are known under the assumption that the MDP is ergodic. Under this assumption, we introduce IMED-RL and prove that its regret upper bound asymptotically matches the regret lower bound. We discuss both the case when the supports of transitions are unknown, and the more informative but a priori harder-to-exploit-optimally case when they are known. Rewards are assumed light-tailed, semi-bounded from above. Last, we provide numerical illustrations on classical tabular MDPs, ergodic and communicating only, showing the competitiveness of IMED-RL in finite-time against state-of-the-art algorithms. IMED-RL also benefits from a light complexity.
Accès au texte intégral et bibtex
https://hal.science/hal-03825423/file/IMED_RL_pesquerel_neurips.pdf BibTex
titre
Near Instance-Optimal PAC Reinforcement Learning for Deterministic MDPs
auteur
Andrea Tirinzoni, Aymen Al-Marjani, Emilie Kaufmann
article
NeurIPS 2022 – 36th Conference on Neural Information Processing System, Nov 2022, New Orleans, United States
resume
In probably approximately correct (PAC) reinforcement learning (RL), an agent is required to identify an ε-optimal policy with probability 1 − δ. While minimax optimal algorithms exist for this problem, its instance-dependent complexity remains elusive in episodic Markov decision processes (MDPs). In this paper, we propose the first nearly matching (up to a horizon squared factor and logarithmic terms) upper and lower bounds on the sample complexity of PAC RL in deterministic episodic MDPs with finite state and action spaces. In particular, our bounds feature a new notion of sub-optimality gap for state-action pairs that we call the deterministic return gap. While our instance-dependent lower bound is written as a linear program, our algorithms are very simple and do not require solving such an optimization problem during learning. Their design and analyses employ novel ideas, including graph-theoretical concepts (minimum flows) and a new maximum-coverage exploration strategy.
Accès au texte intégral et bibtex
https://hal.science/hal-03825101/file/TAMK22.pdf BibTex
titre
On Meritocracy in Optimal Set Selection
auteur
Thomas Kleine Buening, Meirav Segal, Debabrota Basu, Anne-Marie George, Christos Dimitrakakis
article
EAAMO 2022- Equity and Access in Algorithms, Mechanisms, and Optimization, ACM, Oct 2022, Arlington, United States
resume
Typically, merit is defined with respect to some intrinsic measure of worth. We instead consider a setting where an individual’s worth is relative: when a decision maker (DM) selects a set of individuals from a population to maximise expected utility, it is natural to consider the expected marginal contribution (EMC) of each person to the utility. We show that this notion satisfies an axiomatic definition of fairness for this setting. We also show that for certain policy structures, this notion of fairness is aligned with maximising expected utility, while for linear utility functions it is identical to the Shapley value. However, for certain natural policies, such as those that select individuals with a specific set of attributes (e.g. high enough test scores for college admissions), there is a trade-off between meritocracy and utility maximisation. We analyse the effect of constraints on the policy on both utility and fairness in an extensive experiments based on college admissions and outcomes in Norwegian universities.
Accès au bibtex
https://arxiv.org/pdf/2102.11932 BibTex
titre
Risk-aware linear bandits with convex loss
auteur
Patrick Saux, Odalric-Ambrym Maillard
article
European Workshop on Reinforcement Learning, Sep 2022, Milan, Italy
resume
In decision-making problems such as the multi-armed bandit, an agent learns sequentially by optimizing a certain feedback. While the mean reward criterion has been extensively studied, other measures that reflect an aversion to adverse outcomes, such as mean-variance or conditional value-at-risk (CVaR), can be of interest for critical applications (healthcare, agriculture). Algorithms have been proposed for such risk-aware measures under bandit feedback without contextual information. In this work, we study contextual bandits where such risk measures can be elicited as linear functions of the contexts through the minimization of a convex loss. A typical example that fits within this framework is the expectile measure, which is obtained as the solution of an asymmetric least-square problem. Using the method of mixtures for supermartingales, we derive confidence sequences for the estimation of such risk measures. We then propose an optimistic UCB algorithm to learn optimal risk-aware actions, with regret guarantees similar to those of generalized linear bandits. This approach requires solving a convex problem at each round of the algorithm, which we can relax by allowing only approximated solution obtained by online gradient descent, at the cost of slightly higher regret. We conclude by evaluating the resulting algorithms on numerical experiments.
Accès au texte intégral et bibtex
https://hal.science/hal-03776680/file/main.pdf BibTex
titre
Near Instance-Optimal PAC Reinforcement Learning for Deterministic MDPs
auteur
Andrea Tirinzoni, Aymen Al-Marjani, Emilie Kaufmann
article
EWRL 2022 – European Workshop on Reinforcement Learning, Sep 2022, Milan, Italy
resume
In probably approximately correct (PAC) reinforcement learning (RL), an agent is required to identify an ε-optimal policy with probability 1 − δ. While minimax optimal algorithms exist for this problem, its instance-dependent complexity remains elusive in episodic Markov decision processes (MDPs). In this paper, we propose the first nearly matching (up to a horizon squared factor and logarithmic terms) upper and lower bounds on the sample complexity of PAC RL in deterministic episodic MDPs with finite state and action spaces. In particular, our bounds feature a new notion of sub-optimality gap for state-action pairs that we call the deterministic return gap. While our instance-dependent lower bound is written as a linear program, our algorithms are very simple and do not require solving such an optimization problem during learning. Their design and analyses employ novel ideas, including graph-theoretical concepts (minimum flows) and a new maximum-coverage exploration strategy.
Accès au texte intégral et bibtex
https://hal.science/hal-03767412/file/eprl_ewrl.pdf BibTex
titre
Optimistic PAC Reinforcement Learning: the Instance-Dependent View
auteur
Andrea Tirinzoni, Aymen Al-Marjani, Emilie Kaufmann
article
EWRL 2022 – European Workshop on Reinforcement Learning, Sep 2022, Milan, Italy
resume
Optimistic algorithms have been extensively studied for regret minimization in episodic tabular MDPs, both from a minimax and an instance-dependent view. However, for the PAC RL problem, where the goal is to identify a near-optimal policy with high probability, little is known about their instance-dependent sample complexity. A negative result of Wagenmaker et al. (2022) suggests that optimistic sampling rules cannot be used to attain the (still elusive) optimal instance-dependent sample complexity. On the positive side, we provide the first instance-dependent bound for an optimistic algorithm for PAC RL, BPI-UCRL, for which only minimax guarantees were available (Kaufmann et al., 2021). While our bound features some minimal visitation probabilities, it also features a refined notion of sub-optimality gap compared to the value gaps that appear in prior work. Moreover, in MDPs with deterministic transitions, we show that BPI-UCRL is actually near-optimal. On the technical side, our analysis is very simple thanks to a new “target trick” of independent interest. We complement these findings with a novel hardness result explaining why the instance-dependent complexity of PAC RL cannot be easily related to that of regret minimization, unlike in the minimax regime.
Accès au texte intégral et bibtex
https://hal.science/hal-03767409/file/bpi_ucrl.pdf BibTex
titre
UDO: Universal Database Optimization using Reinforcement Learning
auteur
Junxiong Wang, Immanuel Trummer, Debabrota Basu
article
Proceedings of the VLDB Endowment, Sep 2022, Sydney, Australia. pp.3402-3414, ⟨10.14778/3484224.3484236⟩
resume
UDO is a versatile tool for offline tuning of database systems for specific workloads. UDO can consider a variety of tuning choices, reaching from picking transaction code variants over index selections up to database system parameter tuning. UDO uses reinforcement learning to converge to near-optimal configurations, creating and evaluating different configurations via actual query executions (instead of relying on simplifying cost models). To cater to different parameter types, UDO distinguishes heavy parameters (which are expensive to change, e.g. physical design parameters) from light parameters. Specifically for optimizing heavy parameters, UDO uses reinforcement learning algorithms that allow delaying the point at which the reward feedback becomes available. This gives us the freedom to optimize the point in time and the order in which different configurations are created and evaluated (by benchmarking a workload sample). UDO uses a cost-based planner to minimize reconfiguration overheads. For instance, it aims to amortize the creation of expensive data structures by consecutively evaluating configurations using them. We evaluate UDO on Postgres as well as MySQL and on TPC-H as well as TPC-C, optimizing a variety of light and heavy parameters concurrently.
Accès au bibtex
https://arxiv.org/pdf/2104.01744 BibTex
titre
SENTINEL: Taming Uncertainty with Ensemble-based Distributional Reinforcement Learning
auteur
Hannes Eriksson, Debabrota Basu, Mina Alibeigi, Christos Dimitrakakis
article
UAI 2022- Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, Aug 2022, Eindhoven, Netherlands. pp.631-640
resume
In this paper, we consider risk-sensitive sequential decision-making in Reinforcement Learning (RL). Our contributions are two-fold. First, we introduce a novel and coherent quantification of risk, namely composite risk, which quantifies the joint effect of aleatory and epistemic risk during the learning process. Existing works considered either aleatory or epistemic risk individually, or as an additive combination. We prove that the additive formulation is a particular case of the composite risk when the epistemic risk measure is replaced with expectation. Thus, the composite risk is more sensitive to both aleatory and epistemic uncertainty than the individual and additive formulations. We also propose an algorithm, SENTINEL-K, based on ensemble bootstrapping and distributional RL for representing epistemic and aleatory uncertainty respectively. The ensemble of K learners uses Follow The Regularised Leader (FTRL) to aggregate the return distributions and obtain the composite risk. We experimentally verify that SENTINEL-K estimates the return distribution better, and while used with composite risk estimates, demonstrates higher risk-sensitive performance than state-of-the-art risk-sensitive and distributional RL algorithms.
Accès au texte intégral et bibtex
https://hal.science/hal-03150823/file/eriksson22a-supp.pdf BibTex
titre
SAAC: Safe Reinforcement Learning as an Adversarial Game of Actor-Critics
auteur
Yannis Flet-Berliac, Debabrota Basu
article
RLDM 2022 – The Multi-disciplinary Conference on Reinforcement Learning and Decision Making, Jun 2022, Providence, United States
resume
Although Reinforcement Learning (RL) is effective for sequential decision-making problems under uncertainty, it still fails to thrive in real-world systems where risk or safety is a binding constraint. In this paper, we formulate the RL problem with safety constraints as a non-zero-sum game. While deployed with maximum entropy RL, this formulation leads to a safe adversarially guided soft actor-critic framework, called SAAC. In SAAC, the adversary aims to break the safety constraint while the RL agent aims to maximize the constrained value function given the adversary’s policy. The safety constraint on the agent’s value function manifests only as a repulsion term between the agent’s and the adversary’s policies. Unlike previous approaches, SAAC can address different safety criteria such as safe exploration, mean-variance risk sensitivity, and CVaR-like coherent risk sensitivity. We illustrate the design of the adversary for these constraints. Then, in each of these variations, we show the agent differentiates itself from the adversary’s unsafe actions in addition to learning to solve the task. Finally, for challenging continuous control tasks, we demonstrate that SAAC achieves faster convergence, better efficiency, and fewer failures to satisfy the safety constraints than risk-averse distributional RL and risk-neutral soft actor-critic algorithms.
Accès au texte intégral et bibtex
https://hal.science/hal-03771734/file/saac%20arxiv.pdf BibTex
titre
Risk-Sensitive Bayesian Games for Multi-Agent Reinforcement Learning under Policy Uncertainty
auteur
Hannes Eriksson, Debabrota Basu, Mina Alibeigi, Christos Dimitrakakis
article
OptLearnMAS@AAMAS, May 2022, Virtual, New Zealand
resume
In stochastic games with incomplete information, the uncertainty is evoked by the lack of knowledge about a player’s own and the other players’ types, i.e. the utility function and the policy space, and also the inherent stochasticity of different players’ interactions. In existing literature, the risk in stochastic games has been studied in terms of the inherent uncertainty evoked by the variability of transitions and actions. In this work, we instead focus on the risk associated with the \textit{uncertainty over types}. We contrast this with the multi-agent reinforcement learning framework where the other agents have fixed stationary policies and investigate risk-sensitiveness due to the uncertainty about the other agents’ adaptive policies. We propose risk-sensitive versions of existing algorithms proposed for risk-neutral stochastic games, such as Iterated Best Response (IBR), Fictitious Play (FP) and a general multi-objective gradient approach using dual ascent (DAPG). Our experimental analysis shows that risk-sensitive DAPG performs better than competing algorithms for both social welfare and general-sum stochastic games.
Accès au bibtex
https://arxiv.org/pdf/2203.10045 BibTex
titre
Efficient Algorithms for Extreme Bandits
auteur
Dorian Baudry, Yoan Russac, Emilie Kaufmann
article
International conference on Artificial Intelligence and Statistics (AISTATS), Mar 2022, Virtual Conference, Spain
resume
In this paper, we contribute to the Extreme Bandit problem, a variant of Multi-Armed Bandits in which the learner seeks to collect the largest possible reward. We first study the concentration of the maximum of i.i.d random variables under mild assumptions on the tail of the rewards distributions. This analysis motivates the introduction of Quantile of Maxima (QoMax). The properties of QoMax are sufficient to build an Explore-Then-Commit (ETC) strategy, QoMax-ETC, achieving strong asymptotic guarantees despite its simplicity. We then propose and analyze a more adaptive, anytime algorithm, QoMax-SDA, which combines QoMax with a subsampling method recently introduced by Baudry et al. (2021). Both algorithms are more efficient than existing approaches in two aspects (1) they lead to better empirical performance (2) they enjoy a significant reduction of the memory and time complexities.
Accès au texte intégral et bibtex
https://hal.science/hal-03741302/file/RBK22.pdf BibTex
titre
Algorithmic fairness verification with graphical models
auteur
Bishwamittra Ghosh, Debabrota Basu, Kuldeep S. Meel
article
AAAI-2022 – 36th AAAI Conference on Artificial Intelligence, Feb 2022, Virtual, United States
resume
In recent years, machine learning (ML) algorithms have been deployed in safety-critical and high-stake decision-making, where the fairness of algorithms is of paramount importance. Fairness in ML centers on detecting bias towards certain demographic populations induced by an ML classifier and proposes algorithmic solutions to mitigate the bias with respect to different fairness definitions. To this end, several fairness verifiers have been proposed that compute the bias in the prediction of an ML classifier—essentially beyond a finite dataset—given the probability distribution of input features. In the context of verifying linear classifiers, existing fairness verifiers are limited by accuracy due to imprecise modeling of correlations among features and scalability due to restrictive formulations of the classifiers as SSAT/SMT formulas or by sampling. In this paper, we propose an efficient fairness verifier, called FVGM, that encodes the correlations among features as a Bayesian network. In contrast to existing verifiers, FVGM proposes a stochastic subset-sum based approach for verifying linear classifiers. Experimentally, we show that FVGM leads to an accurate and scalable assessment for more diverse families of fairness-enhancing algorithms, fairness attacks, and group/causal fairness metrics than the state-of-the-art. We also demonstrate that FVGM facilitates the computation of fairness influence functions as a stepping stone to detect the source of bias induced by subsets of features.
Accès au texte intégral et bibtex
https://hal.science/hal-03770361/file/2109.09447.pdf BibTex
titre
Procrastinated Tree Search: Black-box Optimization with Delayed, Noisy, and Multi-fidelity Feedback
auteur
Junxiong Wang, Debabrota Basu, Immanuel Trummer
article
AAAI Conference on Artificial Intelligence, Feb 2022, Virtual, United States. pp.10381-10390
resume
In black-box optimization problems, we aim to maximize an unknown objective function, where the function is only accessible through feedbacks of an evaluation or simulation oracle. In real-life, the feedbacks of such oracles are often noisy and available after some unknown delay that may depend on the computation time of the oracle. Additionally, if the exact evaluations are expensive but coarse approximations are available at a lower cost, the feedbacks can have multi-fidelity. In order to address this problem, we propose a generic extension of hierarchical optimistic tree search (HOO), called ProCrastinated Tree Search (PCTS), that flexibly accommodates a delay and noise-tolerant bandit algorithm. We provide a generic proof technique to quantify regret of PCTS under delayed, noisy, and multi-fidelity feedbacks. Specifically, we derive regret bounds of PCTS enabled with delayed-UCB1 (DUCB1) and delayed-UCB-V (DUCBV) algorithms. Given a horizon T, PCTS retains the regret bound of non-delayed HOO for expected delay of O (log T), and worsens by T^((1-α)/(d+ 2)) for expected delays of O (T^(1-α)) for α∈(0, 1]. We experimentally validate on multiple synthetic functions and hyperparameter tuning problems that PCTS outperforms the state-of-the-art black-box optimization methods for feedbacks with different noise levels, delays, and fidelity.
Accès au texte intégral et bibtex
https://hal.science/hal-03445909/file/2110.07232v2%20%281%29.pdf BibTex

Book sections

titre
De l’échantillonnage adaptatif à la résolution de jeux
auteur
Nathanaël Fijalkow, Emilie Kaufmann
article
Informatique Mathématique Une photographie en 2022, 2022
resume
Dans ce chapitre nous introduisons le problème des bandits, qui est la brique de base de l’apprentissage par renforcement. L’objectif est de présenter l’algorithme de Monte Carlo Tree Search (MCTS) pour la résolution de jeux, et plus précisément la variante UCT.
Accès au texte intégral et bibtex
https://hal.science/hal-04152484/file/ejcim.pdf BibTex

Poster communications

titre
Petits jeux de données et prédiction en Intelligence Artificielle, vers une meilleure cohabitation : Application à la gestion durable de l’enherbement des systèmes agricoles à La Réunion
auteur
Frédérick Fabre Ferber, Jean Diatta, Jean-Christophe Soulié, Dominique Gay, Odalric-Ambrym Maillard, Thomas Le Bourgeois, Sandrine Auzoux
article
Comité scientifique et technique du DPP CapTerre, Nov 2022, Saint-Leu de La Réunion, Réunion
resume
Transformer les données pour obtenir une meilleure représentation et ainsi, mieux gérer la gestion de l’enherbement Kernels et transformation d’espace Perspectives Ces travaux de thèse permettront d’améliorer d’une part, la performance prédictive des algorithmes pour une meilleure gestion de l’enherbement des systèmes agricoles et d’autre part, pourront guider à la création de protocoles efficaces de collecte de données agronomiques pour améliorer leur qualité.
Accès au texte intégral et bibtex
https://hal.science/hal-03971262/file/CST-%202022%20-%20V3-FFF.pdf BibTex

Reports

titre
gym-DSSAT: a crop model turned into a Reinforcement Learning environment
auteur
Romain Gautron, Emilio J. Padrón, Philippe Preux, Julien Bigot, Odalric-Ambrym Maillard, David Emukpere
article
[Research Report] RR-9460, Inria Lille. 2022, pp.31
resume
Addressing a real world sequential decision problem with Reinforcement Learning (RL) usually starts with the use of a simulated environment that mimics real conditions. We present a novel open source RL environment for realistic crop management tasks. gym-DSSAT is a gym interface to the Decision Support System for Agrotechnology Transfer (DSSAT), a high fidelity crop simulator. DSSAT has been developped over the last 30 years and is widely recognized by agronomists. gym-DSSAT comes with predefined simulations based on real world maize experiments. The environment is as easy to use as any gym environment. We provide performance baselines using basic RL algorithms. We also briefly outline how the monolithic DSSAT simulator written in Fortran has been turned into a Python RL environment. Our methodology is generic and may be applied to similar simulators. We report on very preliminary experimental results which suggest that RL can help researchers to improve sustainability of fertilization and irrigation practices.
Accès au texte intégral et bibtex
https://inria.hal.science/hal-03711132/file/RR-9460.pdf BibTex

Preprints, Working Papers, …

titre
Bandits Corrupted by Nature: Lower Bounds on Regret and Robust Optimistic Algorithm
auteur
Debabrota Basu, Odalric-Ambrym Maillard, Timothée Mathieu
article
2022
resume
In this paper, we study the stochastic bandits problem with k unknown heavy-tailed and corrupted reward distributions or arms with time-invariant corruption distributions. At each iteration, the player chooses an arm. Given the arm, the environment returns an uncorrupted reward with probability 1−ε and an arbitrarily corrupted reward with probability ε. In our setting, the uncorrupted reward might be heavy-tailed and the corrupted reward might be unbounded. We prove a lower bound on the regret indicating that the corrupted and heavy-tailed bandits are strictly harder than uncorrupted or light-tailed bandits. We observe that the environments can be categorised into hardness regimes depending on the suboptimality gap ∆, variance σ, and corruption proportion ϵ. Following this, we design a UCB-type algorithm, namely HuberUCB, that leverages Huber’s estimator for robust mean estimation. HuberUCB leads to tight upper bounds on regret in the proposed corrupted and heavy-tailed setting. To derive the upper bound, we prove a novel concentration inequality for Huber’s estimator, which might be of independent interest.
Accès au texte intégral et bibtex
https://hal.science/hal-03611816/file/main.pdf BibTex

Comments are closed.