Publications

Thanks to Inria communication staff, we made the following communication article (see also the French version) intended to a public audience.

Publications HAL

2024

Journal articles

titre: The Survival Bandit Problem
auteur: Charles Riou, Junya Honda, Masashi Sugiyama
article: Transactions on Machine Learning Research Journal, 2024, pp.2835-8856
resume: In this paper, we introduce and study a new variant of the multi-armed bandit problem (MAB), called the survival bandit problem (S-MAB). While in both problems, the objective is to maximize the so-called cumulative reward, in this new variant, the procedure is interrupted if the cumulative reward falls below a preset threshold. This simple yet unexplored extension of the MAB follows from many practical applications. For example, when testing two medicines against each other on voluntary patients, people’s lives and health are at stake, and it is necessary to be able to interrupt experiments if serious side effects occur or if the disease syndromes are not dissipated by the treatment. From a theoretical perspective, the S-MAB is the first variant of the MAB where the procedure may or may not be interrupted. We start by formalizing the S-MAB and we define its objective as the minimization of the so-called survival regret, which naturally generalizes the regret of the MAB. Then, we show that the objective of the S-MAB is considerably more difficult than the MAB, in the sense that contrary to the MAB, no policy can achieve a reasonably small (i.e., sublinear) survival regret. Instead, we minimize the survival regret in the sense of Pareto, i.e., we seek a policy whose cumulative reward cannot be improved for some problem instance without being sacrificed for another one. For that purpose, we identify two key components in the survival regret: the regret given no ruin (which corresponds to the regret in the MAB), and the probability that the procedure is interrupted, called the probability of ruin. We derive a lower bound on the probability of ruin, as well as policies whose probability of ruin matches the lower bound. Finally, based on a doubling trick on those policies, we derive a policy which minimizes the survival regret in the sense of Pareto, providing an answer to the open problem by Perotto et al. (2019).
Accès au texte intégral et bibtex

titre: Finite-time Analysis of Globally Nonstationary Multi-Armed Bandits
auteur: Junpei Komiyama, Edouard Fouché, Junya Honda
article: Journal of Machine Learning Research, 2024, 25 (112), pp.1-56
resume: We consider nonstationary multi-armed bandit problems where the model parameters of the arms change over time. We introduce the adaptive resetting bandit (ADR-bandit), a bandit algorithm class that leverages adaptive windowing techniques from literature on data streams. We first provide new guarantees on the quality of estimators resulting from adaptive windowing techniques, which are of independent interest. Furthermore, we conduct a finite-time analysis of ADR-bandit in two typical environments: an abrupt environment where changes occur instantaneously and a gradual environment where changes occur progressively. We demonstrate that ADR-bandit has nearly optimal performance when abrupt or gradual changes occur in a coordinated manner that we call global changes. We demonstrate that forced exploration is unnecessary when we assume such global changes. Unlike the existing nonstationary bandit algorithms, ADR-bandit has optimal performance in stationary environments as well as nonstationary environments with global changes. Our experiments show that the proposed algorithms outperform the existing approaches in synthetic and real-world environments.
Accès au bibtex

titre: Bandits with Stochastic Corruption: Lower Bounds on Regret and Robust Optimistic Algorithms
auteur: Timothée Mathieu, Debabrota Basu, Odalric-Ambrym Maillard
article: Transactions on Machine Learning Research Journal, 2024
resume: We study the Bandits with Stochastic Corruption problem, i.e. a stochastic multi-armed bandit problem with $k$ unknown reward distributions, which are heavy-tailed and corrupted by a history-independent stochastic adversary or Nature. To be specific, the reward obtained by playing an arm comes from corresponding heavy-tailed reward distribution with probability s-\varepsilon \in (0.5,1]$ and an arbitrary corruption distribution of unbounded support with probability $\varepsilon \in [0,0.5)$. First, we provide \textit{a problem-dependent lower bound on the regret} of any corrupted bandit algorithm. The lower bounds indicate that the Bandits with Stochastic Corruption problem is harder than the classical stochastic bandit problem with sub-Gaussian or heavy-tail rewards. Following that, we propose a novel UCB-type algorithm for Bandits with Stochastic Corruption, namely \texttt{HubUCB}, that builds on Huber’s estimator for robust mean estimation. Leveraging a novel concentration inequality of Huber’s estimator, we prove that \texttt{HubUCB} achieves a near-optimal regret upper bound. Since computing Huber’s estimator has quadratic complexity, we further introduce a sequential version of Huber’s estimator that exhibits linear complexity. We leverage this sequential estimator to design \texttt{SeqHubUCB} that enjoys similar regret guarantees while reducing the computational burden. Finally, we experimentally illustrate the efficiency of \texttt{HubUCB} and \texttt{SeqHubUCB} in solving Bandits with Stochastic Corruption for different reward distributions and different levels of corruptions.
Accès au texte intégral et bibtex

titre: AdaStop: adaptive statistical testing for sound comparisons of Deep RL agents
auteur: Timothée Mathieu, Riccardo Della Vecchia, Alena Shilova, Matheus Medeiros Centa, Hector Kohler, Odalric-Ambrym Maillard, Philippe Preux
article: Transactions on Machine Learning Research Journal, 2024
resume: The reproducibility of many experimental results in Deep Reinforcement Learning (RL) is under question. To solve this reproducibility crisis, we propose a theoretically sound methodology to compare multiple Deep RL algorithms. The performance of one execution of a Deep RL algorithm is random so that independent executions are needed to assess it precisely. When comparing several RL algorithms, a major question is how many executions must be made and how can we assure that the results of such a comparison is theoretically sound. Researchers in Deep RL often use less than 5 independent executions to compare algorithms: we claim that this is not enough in general. Moreover, when comparing several algorithms at once, the error of each comparison accumulates and must be taken into account with a multiple tests procedure to preserve low error guarantees. To address this problem in a statistically sound way, we introduce AdaStop, a new statistical test based on multiple group sequential tests. When comparing algorithms, AdaStop adapts the number of executions to stop as early as possible while ensuring that we have enough information to distinguish algorithms that perform better than the others in a statistical significant way. We prove both theoretically and empirically that AdaStop has a low probability of making an error (Family-Wise Error). Finally, we illustrate the effectiveness of AdaStop in multiple use-cases, including toy examples and difficult cases such as Mujoco environments.
Accès au texte intégral et bibtex

Conference papers

titre: Learning with Posterior Sampling for Revenue Management under Time-varying Demand
auteur: Kazuma Shimizu, Junya Honda, Shinji Ito, Shinji Nakadai
article: International Joint Conference on Artificial Intelligence (IJCAI), Aug 2024, Jeju, South Korea. pp.4911-4919, ⟨10.24963/ijcai.2024/543⟩
resume: This paper discusses the revenue management (RM) problem to maximize revenue by pricing items or services. One challenge in this problem is that the demand distribution is unknown and varies over time in real applications such as airline and retail industries. In particular, the time-varying demand has not been well studied under scenarios of unknown demand due to the difficulty of jointly managing the remaining inventory and estimating the demand. To tackle this challenge, we first introduce an episodic generalization of the RM problem motivated by typical application scenarios. We then propose a computationally efficient algorithm based on posterior sampling, which effectively optimizes prices by solving linear programming. We derive a Bayesian regret upper bound of this algorithm for general models where demand parameters can be correlated between time periods, while also deriving a regret lower bound for generic algorithms. Our empirical study shows that the proposed algorithm performs better than other benchmark algorithms and comparably to the optimal policy in hindsight. We also propose a heuristic modification of the proposed algorithm, which further efficiently learns the pricing policy in the experiments.
Accès au bibtex

titre: Bandits with Multimodal Structure
auteur: Hassan Saber, Odalric-Ambrym Maillard
article: RLC 2024 – Reinforcement Learning Conference, Aug 2024, Amherst Massachusetts, United States. pp.39
resume: We consider a multi-armed bandit problem specified by a set of one-dimensional exponential family distributions endowed with a multimodal structure. The multimodal structure naturally extends the unimodal structure and appears to be underlying in quite interesting ways popular structures such as linear or Lipschitz bandits. We introduce IMED-MB, an algorithm that optimally exploits the multimodal structure, by adapting to this setting the popular Indexed Minimum Empirical Divergence (IMED) algorithm. We provide instance-dependent regret analysis of this strategy. Numerical experiments show that IMED-MB performs well in practice when assuming unimodal, polynomial or Lipschitz mean function.
Accès au texte intégral et bibtex

titre: Exploration by Optimization with Hybrid Regularizers: Logarithmic Regret with Adversarial Robustness in Partial Monitoring
auteur: Taira Tsuchiya, Shinji Ito, Junya Honda
article: ICML 2024 – 41st International Conference on Machine Learning, Jul 2024, Vienna, Austria. pp.48768-48790
resume: Partial monitoring is a generic framework of online decision-making problems with limited feedback. To make decisions from such limited feedback, it is necessary to find an appropriate distribution for exploration. Recently, a powerful approach for this purpose, exploration by optimization (ExO), was proposed, which achieves optimal bounds in adversarial environments with follow-the-regularized-leader for a wide range of online decision-making problems. However, a naive application of ExO in stochastic environments significantly degrades regret bounds. To resolve this issue in locally observable games, we first establish a new framework and analysis for ExO with a hybrid regularizer. This development allows us to significantly improve existing regret bounds of best-of-both-worlds (BOBW) algorithms, which achieves nearly optimal bounds both in stochastic and adversarial environments. In particular, we derive a stochastic regret bound of O(sum_{a neq a^*} k^2 m^2 log T / ∆_a ), where k, m, and T are the numbers of actions, observations and rounds, a^* is an optimal action, and ∆_a is the suboptimality gap for action a. This bound is roughly Θ(k^2 log T) times smaller than existing BOBW bounds. In addition, for globally observable games, we provide a new BOBW algorithm with the first O(log T) stochastic bound.
Accès au texte intégral et bibtex

titre: Power Mean Estimation in Stochastic Monte-Carlo Tree Search
auteur: Tuan Quang Tuan Dam, Odalric-Ambrym Maillard, Emilie Kaufmann
article: Uncertainty in Artificial Intelligence, Jul 2024, Barcelona, Spain
resume: Monte-Carlo Tree Search (MCTS) is a widelyused strategy for online planning that combines Monte-Carlo sampling with forward tree search. Its success relies on the Upper Confidence bound for Trees (UCT) algorithm, an extension of the UCB method for multi-arm bandits. However, the theoretical foundation of UCT is incomplete due to an error in the logarithmic bonus term for action selection, leading to the development of Fixed-Depth-MCTS with a polynomial exploration bonus to balance exploration and exploitation [Shah et al., 2022]. Both UCT and Fixed-Depth-MCTS suffer from biased value estimation: the weighted sum underestimates the optimal value, while the maximum valuation overestimates it [Coulom, 2006]. The power mean estimator offers a balanced solution, lying between the average and maximum values. Power-UCT [Dam et al., 2019] incorporates this estimator for more accurate value estimates but its theoretical analysis remains incomplete. This paper introduces Stochastic-Power-UCT, an MCTS algorithm using the power mean estimator and tailored for stochastic MDPs. We analyze its polynomial convergence in estimating root node values and show that it shares the same convergence rate of O(n -1/2 ), with n is the number of visited trajectories, as Fixed-Depth-MCTS, with the latter being a special case of the former. Our theoretical results are validated with empirical tests across various stochastic MDP environments.
Accès au texte intégral et bibtex

titre: Adaptive Learning Rate for Follow-the-Regularized-Leader: Competitive Analysis and Best-of-Both-Worlds
auteur: Shinji Ito, Taira Tsuchiya, Junya Honda
article: COLT 2024 – 37th Annual Conference on Learning Theory, Jun 2024, Edmonton, Canada. pp.2522-2563
resume: Follow-The-Regularized-Leader (FTRL) is known as an effective and versatile approach in online learning, where appropriate choice of the learning rate is crucial for smaller regret. To this end, we formulate the problem of adjusting FTRL’s learning rate as a sequential decision-making problem and introduce the framework of competitive analysis. We establish a lower bound for the competitive ratio and propose update rules for the learning rate that achieves an upper bound within a constant factor of this lower bound. Specifically, we illustrate that the optimal competitive ratio is characterized by the (approximate) monotonicity of components of the penalty term, showing that a constant competitive ratio is achievable if the components of the penalty term form a monotone nonincreasing sequence, and derive a tight competitive ratio when penalty terms are ξ-approximately monotone non-increasing. Our proposed update rule, referred to as stability-penalty matching, also facilitates the construction of Best-Of-Both-Worlds (BOBW) algorithms for stochastic and adversarial environments. In these environments our results contribute to achieving tighter regret bound and broaden the applicability of algorithms for various settings such as multi-armed bandits, graph bandits, linear bandits, and contextual bandits.
Accès au texte intégral et bibtex

titre: Follow-the-Perturbed-Leader with Fréchet-type Tail Distributions: Optimality in Adversarial Bandits and Best-of-Both-Worlds
auteur: Jongyeong Lee, Junya Honda, Shinji Ito, Min-Hwan Oh
article: COLT 2024 – 37th Annual Conference on Learning Theory, Jun 2024, Edmonton, Canada. pp.3375-3430
resume: This paper studies the optimality of the Follow-the-Perturbed-Leader (FTPL) policy in both adversarial and stochastic K-armed bandits. Despite the widespread use of the Follow-the-Regularized-Leader (FTRL) framework with various choices of regularization, the FTPL framework, which relies on random perturbations, has not received much attention, despite its inherent simplicity. In adversarial bandits, there has been conjecture that FTPL could potentially achieve O(sqrt{KT}) regrets if perturbations follow a distribution with a Fréchet-type tail. Recent work by Honda et al. (2023) showed that FTPL with Fréchet distribution with shape α = 2 indeed attains this bound and, notably logarithmic regret in stochastic bandits, meaning the Best-of-Both-Worlds (BOBW) capability of FTPL. However, this result only partly resolves the above conjecture because their analysis heavily relies on the specific form of the Fréchet distribution with this shape. In this paper, we establish a sufficient condition for perturbations to achieve O(sqrt{KT}) regrets in the adversarial setting, which covers, e.g., Fréchet, Pareto, and Student-t distributions. We also demonstrate the BOBW achievability of FTPL with certain Fréchet-type tail distributions. Our results contribute not only to resolving existing conjectures through the lens of extreme value theory but also potentially offer insights into the effect of the regularization functions in FTRL through the mapping from FTPL to FTRL.
Accès au texte intégral et bibtex

titre: Pure Exploration in Bandits with Linear Constraints
auteur: Emil Carlsson, Debabrota Basu, Fredrik D. Johansson, Devdatt Dubhashi
article: International Conference on Artificial Intelligence and Statistics, May 2024, Valencia (Espagne), Spain. pp.334-342
resume: We address the problem of identifying the optimal policy with a fixed confidence level in a multi-armed bandit setup, when the arms are subject to linear constraints. Unlike the standard best-arm identification problem which is well studied, the optimal policy in this case may not be deterministic and could mix between several arms. This changes the geometry of the problem which we characterize via an information-theoretic lower bound. We introduce two asymptotically optimal algorithms for this setting, one based on the Track-and-Stop method and the other based on a game-theoretic approach. Both these algorithms try to track an optimal allocation based on the lower bound and computed by a weighted projection onto the boundary of a normal cone. Finally, we provide empirical results that validate our bounds and visualize how constraints change the hardness of the problem.
Accès au bibtex

titre: CRIMED: Lower and Upper Bounds on Regret for Bandits with Unbounded Stochastic Corruption
auteur: Shubhada Agrawal, Timothée Mathieu, Debabrota Basu, Odalric-Ambrym Maillard
article: International Conference on Algorithmic Learning Theory (ALT), Feb 2024, San Diego (CA), United States. pp.74-124
resume: We investigate the regret-minimisation problem in a multi-armed bandit setting with arbitrary corruptions. Similar to the classical setup, the agent receives rewards generated independently from the distribution of the arm chosen at each time. However, these rewards are not directly observed. Instead, with a fixed $\varepsilon\in (0,\frac{1}{2})$, the agent observes a sample from the chosen arm’s distribution with probability s-\varepsilon$, or from an arbitrary corruption distribution with probability $\varepsilon$. Importantly, we impose no assumptions on these corruption distributions, which can be unbounded. In this setting, accommodating potentially unbounded corruptions, we establish a problem-dependent lower bound on regret for a given family of arm distributions. We introduce CRIMED, an asymptotically-optimal algorithm that achieves the exact lower bound on regret for bandits with Gaussian distributions with known variance. Additionally, we provide a finite-sample analysis of CRIMED’s regret performance. Notably, CRIMED can effectively handle corruptions with $\varepsilon$ values as high as $\frac{1}{2}$. Furthermore, we develop a tight concentration result for medians in the presence of arbitrary corruptions, even with $\varepsilon$ values up to $\frac{1}{2}$, which may be of independent interest. We also discuss an extension of the algorithm for handling misspecification in Gaussian model.
Accès au bibtex

Preprints, Working Papers, …

titre: Differentially Private Best-Arm Identification
auteur: Achraf Azize, Marc Jourdan, Aymen Al Marjani, Debabrota Basu
article: 2024
resume: Best Arm Identification (BAI) problems are progressively used for data-sensitive applications, such as designing adaptive clinical trials, tuning hyper-parameters, and conducting user studies. Motivated by the data privacy concerns invoked by these applications, we study the problem of BAI with fixed confidence in both the local and central models, i.e. $\epsilon$-local and $\epsilon$-global Differential Privacy (DP). First, to quantify the cost of privacy, we derive lower bounds on the sample complexity of any $\delta$-correct BAI algorithm satisfying $\epsilon$-global DP or $\epsilon$-local DP. Our lower bounds suggest the existence of two privacy regimes. In the high-privacy regime, the hardness depends on a coupled effect of privacy and novel information-theoretic quantities involving the Total Variation. In the low-privacy regime, the lower bounds reduce to the non-private lower bounds. We propose $\epsilon$-local DP and $\epsilon$-global DP variants of a Top Two algorithm, namely CTB-TT and AdaP-TT*, respectively. For $\epsilon$-local DP, CTB-TT is asymptotically optimal by plugging in a private estimator of the means based on Randomised Response. For $\epsilon$-global DP, our private estimator of the mean runs in arm-dependent adaptive episodes and adds Laplace noise to ensure a good privacy-utility trade-off. By adapting the transportation costs, the expected sample complexity of AdaP-TT* reaches the asymptotic lower bound up to multiplicative constants.
Accès au bibtex

2023

Journal articles

titre: Optimal dose escalation methods using deep reinforcement learning in phase I oncology trials
auteur: Kentaro Matsuura, Kentaro Sakamaki, Junya Honda, Takashi Sozu
article: Journal of Biopharmaceutical Statistics, 2023, 33 (5), pp.639-652. ⟨10.1080/10543406.2023.2170402⟩
resume: In phase I trials of a novel anticancer drug, one of the most important objectives is to identify the maximum tolerated dose (MTD). To this end, a number of methods have been proposed and evaluated under various scenarios. However, the percentages of correct selection (PCS) of MTDs using previous methods are insufficient to determine the dose for late-phase trials. The purpose of this study is to construct an action rule for escalating or de-escalating the dose and continuing or stopping the trial to increase the PCS as much as possible. We show that deep reinforcement learning with an appropriately defined state, action, and reward can be used to construct such an action selection rule. The simulation study shows that the proposed method can improve the PCS compared with the 3 + 3 design, CRM, BLRM, BOIN, mTPI, and i3 + 3 methods.
Accès au bibtex

Conference papers

titre: Stability-penalty-adaptive follow-the-regularized-leader: Sparsity, game-dependency, and best-of-both-worlds
auteur: Taira Tsuchiya, Shinji Ito, Junya Honda
article: NeurIPS 2023 – Conference on Neural Information Processing Systems, Dec 2023, New Orleans, United States. pp.47406-47437
resume: Adaptivity to the difficulties of a problem is a key property in sequential decision-making problems to broaden the applicability of algorithms. Followthe-regularized-leader (FTRL) has recently emerged as one of the most promising approaches for obtaining various types of adaptivity in bandit problems. Aiming to further generalize this adaptivity, we develop a generic adaptive learning rate, called stability-penalty-adaptive (SPA) learning rate for FTRL. This learning rate yields a regret bound jointly depending on stability and penalty of the algorithm, into which the regret of FTRL is typically decomposed. With this result, we establish several algorithms with three types of adaptivity: sparsity, game-dependency, and best-of-both-worlds (BOBW). Despite the fact that sparsity appears frequently in real problems, existing sparse multi-armed bandit algorithms with k-arms assume that the sparsity level s ≤ k is known in advance, which is often not the case in real-world scenarios. To address this issue, we first establish s-agnostic algorithms with regret bounds of O(sqrt{sT}) in the adversarial regime for T rounds, which matches the existing lower bound up to a logarithmic factor. Meanwhile, BOBW algorithms aim to achieve a near-optimal regret in both the stochastic and adversarial regimes. Leveraging the SPA learning rate and the technique for s-agnostic algorithms combined with a new analysis to bound the variation in FTRL output in response to changes in a regularizer, we establish the first BOBW algorithm with a sparsity-dependent bound. Additionally, we explore partial monitoring and demonstrate that the proposed SPA learning rate framework allows us to achieve a game-dependent bound and the BOBW simultaneously.
Accès au texte intégral et bibtex

titre: An ε-Best-Arm Identification Algorithm for Fixed-Confidence and Beyond
auteur: Marc Jourdan, Rémy Degenne, Emilie Kaufmann
article: Advances in Neural Information Processing Systems (NeurIPS), Dec 2023, New Orleans, United States
resume: We propose EB-TC ε , a novel sampling rule for ε-best arm identification in stochastic bandits. It is the first instance of Top Two algorithm analyzed for approximate best arm identification. EB-TC ε is an anytime sampling rule that can therefore be employed without modification for fixed confidence or fixed budget identification (without prior knowledge of the budget). We provide three types of theoretical guarantees for EB-TC ε. First, we prove bounds on its expected sample complexity in the fixed confidence setting, notably showing its asymptotic optimality in combination with an adaptive tuning of its exploration parameter. We complement these findings with upper bounds on its probability of error at any time and for any error parameter, which further yield upper bounds on its simple regret at any time. Finally, we show through numerical simulations that EB-TC ε performs favorably compared to existing algorithms, in different settings.
Accès au texte intégral et bibtex

titre: Adaptive Algorithms for Relaxed Pareto Set Identification
auteur: Cyrille Kone, Emilie Kaufmann, Laura Richert
article: NeurIPS 2023 – 37th Conference on Neural Information Processing Systems, Dec 2023, La Nouvelle Orléans, LA, United States
resume: In this paper we revisit the fixed-confidence identification of the Pareto optimal set in a multi-objective multi-armed bandit model. As the sample complexity to identify the exact Pareto set can be very large, a relaxation allowing to output some additional near-optimal arms has been studied. In this work we also tackle alternative relaxations that allow instead to identify a relevant subset of the Pareto set. Notably, we propose a single sampling strategy, called Adaptive Pareto Exploration, that can be used in conjunction with different stopping rules to take into account different relaxations of the Pareto Set Identification problem. We analyze the sample complexity of these different combinations, quantifying in particular the reduction in sample complexity that occurs when one seeks to identify at most k Pareto optimal arms. We showcase the good practical performance of Adaptive Pareto Exploration on a real-world scenario, in which we adaptively explore several vaccination strategies against Covid-19 in order to find the optimal ones when multiple immunogenicity criteria are taken into account.
Accès au texte intégral et bibtex

titre: Fast Asymptotically Optimal Algorithms for Non-Parametric Stochastic Bandits
auteur: Dorian Baudry, Fabien Pesquerel, Rémy Degenne, Odalric-Ambrym Maillard
article: NeurIPS 2023 – Thirty-seventh Conference on Neural Information Processing Systems, Dec 2023, New Orleans (Louisiana), United States
resume: We consider the problem of regret minimization in non-parametric stochastic bandits. When the rewards are known to be bounded from above, there exists asymptotically optimal algorithms, with asymptotic regret depending on an infimum of Kullback-Leibler divergences (KL). These algorithms are computationally expensive and require storing all past rewards, thus simpler but non-optimal algorithms are often used instead. We introduce several methods to approximate the infimum KL which reduce drastically the computational and memory costs of existing optimal algorithms, while keeping their regret guaranties. We apply our findings to design new variants of the MED and IMED algorithms, and demonstrate their interest with extensive numerical simulations.
Accès au texte intégral et bibtex

titre: On the Complexity of Differentially Private Best-Arm Identification with Fixed Confidence
auteur: Achraf Azize, Marc Jourdan, Aymen Al Marjani, Debabrota Basu
article: NeurIPS 2023 – Conference on Neural Information Processing Systems, Dec 2023, New Orleans (US), United States. pp.71150–71194
resume: Best Arm Identification (BAI) problems are progressively used for data-sensitive applications, such as designing adaptive clinical trials, tuning hyper-parameters, and conducting user studies to name a few. Motivated by the data privacy concerns invoked by these applications, we study the problem of BAI with fixed confidence under $\epsilon$-global Differential Privacy (DP). First, to quantify the cost of privacy, we derive a lower bound on the sample complexity of any $\delta$-correct BAI algorithm satisfying $\epsilon$-global DP. Our lower bound suggests the existence of two privacy regimes depending on the privacy budget $\epsilon$. In the high-privacy regime (small $\epsilon$), the hardness depends on a coupled effect of privacy and a novel information-theoretic quantity, called the Total Variation Characteristic Time. In the low-privacy regime (large $\epsilon$), the sample complexity lower bound reduces to the classical non-private lower bound. Second, we propose AdaP-TT, an $\epsilon$-global DP variant of the Top Two algorithm. AdaP-TT runs in arm-dependent adaptive episodes and adds Laplace noise to ensure a good privacy-utility trade-off. We derive an asymptotic upper bound on the sample complexity of AdaP-TT that matches with the lower bound up to multiplicative constants in the high-privacy regime. Finally, we provide an experimental analysis of AdaP-TT that validates our theoretical results.
Accès au bibtex

titre: Logarithmic regret in communicating MDPs: Leveraging known dynamics with bandits
auteur: Hassan Saber, Fabien Pesquerel, Odalric-Ambrym Maillard, Mohammad Sadegh Talebi
article: Asian Conference on Machine Learning, Nov 2023, Istanbul, Turkey
resume: We study regret minimization in an average-reward and communicating Markov Decision Process (MDP) with known dynamics, but unknown reward function. Although learning in such MDPs is a priori easier than in fully unknown ones, they are still largely challenging as they include as special cases large classes of problems such as combinatorial semi-bandits. Leveraging the knowledge on transition function in regret minimization, in a statistically efficient way, appears largely unexplored. As it is conjectured that achieving exact optimality in generic MDPs is NP-hard, even with known transitions, we focus on a computationally efficient relaxation, at the cost of achieving order-optimal logarithmic regret instead of exact optimality. We contribute to filling this gap by introducing a novel algorithm based on the popular Indexed Minimum Empirical Divergence strategy for bandits. A key component of the proposed algorithm is a carefully designed stopping criterion leveraging the recurrent classes induced by stationary policies. We derive a nonasymptotic, problem-dependent, and logarithmic regret bound for this algorithm, which relies on a novel regret decomposition leveraging the structure. We further provide an efficient implementation and experiments illustrating its promising empirical performance.
Accès au texte intégral et bibtex

titre: Thompson Exploration with Best Challenger Rule in Best Arm Identification
auteur: Jongyeong Lee, Junya Honda, Masashi Sugiyama
article: ACML 2023 – Asian Conference on Machine Learning, Nov 2023, Istanbul, Turkey. pp.646-661
resume: This paper studies the fixed-confidence best arm identification (BAI) problem in the bandit framework in the canonical single-parameter exponential models. For this problem, many policies have been proposed, but most of them require solving an optimization problem at every round and/or are forced to explore an arm at least a certain number of times except those restricted to the Gaussian model. To address these limitations, we propose a novel policy that combines Thompson sampling with a computationally efficient approach known as the best challenger rule. While Thompson sampling was originally considered for maximizing the cumulative reward, we demonstrate that it can be used to naturally explore arms in BAI without forcing it. We show that our policy is asymptotically optimal for any two-armed bandit problems and achieves near optimality for general K-armed bandit problems for K ≥ 3. Nevertheless, in numerical experiments, our policy shows competitive performance compared to asymptotically optimal policies in terms of sample complexity while requiring less computation cost. In addition, we highlight the advantages of our policy by comparing it to the concept of β-optimality, a relaxed notion of asymptotic optimality commonly considered in the analysis of a class of policies including the proposed one.
Accès au texte intégral et bibtex

titre: Interactive and Concentrated Differential Privacy for Bandits
auteur: Achraf Azize, Debabrota Basu
article: EWRL 2023 – European Workshop on Reinforcement Learning, Sep 2023, Brussels (Belgium), Belgium
resume: Bandits play a crucial role in interactive learning schemes and modern recommender systems. However, these systems often rely on sensitive user data, making privacy a critical concern. This paper investigates privacy in bandits with a trusted centralized decision-maker through the lens of interactive Differential Privacy (DP). While bandits under pure $\epsilon$-global DP have been well-studied, we contribute to the understanding of bandits under zero Concentrated DP (zCDP). We provide minimax and problem-dependent lower bounds on regret for finite-armed and linear bandits, which quantify the cost of $\rho$-global zCDP in these settings. These lower bounds reveal two hardness regimes based on the privacy budget $\rho$ and suggest that $\rho$-global zCDP incurs less regret than pure $\epsilon$-global DP. We propose two $\rho$-global zCDP bandit algorithms, AdaC-UCB and AdaC-GOPE, for finite-armed and linear bandits respectively. Both algorithms use a common recipe of Gaussian mechanism and adaptive episodes. We analyze the regret of these algorithms to show that AdaC-UCB achieves the problem-dependent regret lower bound up to multiplicative constants, while AdaC-GOPE achieves the minimax regret lower bound up to poly-logarithmic factors. Finally, we provide experimental validation of our theoretical results under different settings.
Accès au bibtex

titre: Optimality of Thompson Sampling with Noninformative Priors for Pareto Bandits
auteur: Jongyeong Lee, Junya Honda, Chao-Kai Chiang, Masashi Sugiyama
article: International Conference on Machine Learning, Jul 2023, Hawaii, United States. pp.18810-18851
resume: In the stochastic multi-armed bandit problem, a randomized probability matching policy called Thompson sampling (TS) has shown excellent performance in various reward models. In addition to the empirical performance, TS has been shown to achieve asymptotic problem-dependent lower bounds in several models. However, its optimality has been mainly addressed under light-tailed or one-parameter models that belong to exponential families. In this paper, we consider the optimality of TS for the Pareto model that has a heavy tail and is parameterized by two unknown parameters. Specifically, we discuss the optimality of TS with probability matching priors that include the Jeffreys prior and the reference priors. We first prove that TS with certain probability matching priors can achieve the optimal regret bound. Then, we show the suboptimality of TS with other priors, including the Jeffreys and the reference priors. Nevertheless, we find that TS with the Jeffreys and reference priors can achieve the asymptotic lower bound if one uses a truncation procedure. These results suggest carefully choosing noninformative priors to avoid suboptimality and show the effectiveness of truncation procedures in TS-based policies.
Accès au texte intégral et bibtex

titre: Risk-aware linear bandits with convex loss
auteur: Patrick Saux, Odalric-Ambrym Maillard
article: International Conference on Artificial Intelligence and Statistics (AISTATS), Apr 2023, Valencia, Spain
resume: In decision-making problems such as the multi-armed bandit, an agent learns sequentially by optimizing a certain feedback. While the mean reward criterion has been extensively studied, other measures that reflect an aversion to adverse outcomes, such as mean-variance or conditional value-at-risk (CVaR), can be of interest for critical applications (healthcare, agriculture). Algorithms have been proposed for such risk-aware measures under bandit feedback without contextual information. In this work, we study contextual bandits where such risk measures can be elicited as linear functions of the contexts through the minimization of a convex loss. A typical example that fits within this framework is the expectile measure, which is obtained as the solution of an asymmetric least-square problem. Using the method of mixtures for supermartingales, we derive confidence sequences for the estimation of such risk measures. We then propose an optimistic UCB algorithm to learn optimal risk-aware actions, with regret guarantees similar to those of generalized linear bandits. This approach requires solving a convex problem at each round of the algorithm, which we can relax by allowing only approximated solution obtained by online gradient descent, at the cost of slightly higher regret. We conclude by evaluating the resulting algorithms on numerical experiments.
Accès au texte intégral et bibtex

titre: Further Adaptive Best-of-Both-Worlds Algorithm for Combinatorial Semi-Bandits
auteur: Taira Tsuchiya, Shinji Ito, Junya Honda
article: International Conference on Artificial Intelligence and Statistics (AISTATS), Apr 2023, Valencia, Spain. pp.8144
resume: We consider the combinatorial semi-bandit problem and present a new algorithm with a bestof-both-worlds regret guarantee, in which the regrets are near-optimally bounded in the stochastic and adversarial regimes. In the stochastic regime, we prove a variance-dependent regret bound depending on the tight suboptimality gap introduced by Kveton et al. ( 2015) with a good leading constant. In the adversarial regime, we show that the same algorithm simultaneously obtains various data-dependent regret bounds. Our algorithm is based on the follow-the-regularizedleader framework with a refined regularizer and adaptive learning rate. Finally, we numerically test the proposed algorithm and confirm its superior or competitive performance over existing algorithms, including Thompson sampling under most settings.
Accès au texte intégral et bibtex

titre: Dealing with Unknown Variances in Best-Arm Identification
auteur: Marc Jourdan, Rémy Degenne, Emilie Kaufmann
article: Algorithmic Learning Theory (ALT), Feb 2023, Singapore (SG), Singapore
resume: The problem of identifying the best arm among a collection of items having Gaussian rewards distribution is well understood when the variances are known. Despite its practical relevance for many applications, few works studied it for unknown variances. In this paper we introduce and analyze two approaches to deal with unknown variances, either by plugging in the empirical variance or by adapting the transportation costs. In order to calibrate our two stopping rules, we derive new time-uniform concentration inequalities, which are of independent interest. Then, we illustrate the theoretical and empirical performances of our two sampling rule wrappers on Track-and-Stop and on a Top Two algorithm. Moreover, by quantifying the impact on the sample complexity of not knowing the variances, we reveal that it is rather small.
Accès au texte intégral et bibtex

titre: Follow-the-Perturbed-Leader Achieves Best-of-Both-Worlds for Bandit Problems
auteur: Junya Honda, Shinji Ito, Taira Tsuchiya
article: International Conference on Algorithmic Learning Theory (ALT), Feb 2023, Singapore, Singapore. pp.726-754
resume: This paper discusses the adversarial and stochastic K-armed bandit problems. In the adversarial setting, the best possible regret is known to be O(sqrt{KT}) for time horizon T. This bound can be achieved by several policies but they require to explicitly compute the arm-selection probabilities by solving optimization problems at each round, which becomes problematic in some settings. One promising candidate to avoid this issue is the Follow-The-Perturbed-Leader (FTPL) policy, which simply chooses the arm with the minimum cumulative estimated loss with a random perturbation. In particular, it has been conjectured that O(sqrt{KT}) regret might be achieved by FTPL with a Fréchet-type perturbation. This paper affirmatively resolves this conjecture by showing that Fréchet perturbation indeed achieves this bound. We also show that FTPL achieves a logarithmic regret for the stochastic setting, meaning that FTPL achieves best-of-both-worlds regret bounds. The key to these bounds is the novel technique to evaluate the stability of the policy and to express the regret at each round in multiple forms depending on the estimated losses.
Accès au texte intégral et bibtex

titre: Best-of-Both-Worlds Algorithms for Partial Monitoring
auteur: Taira Tsuchiya, Shinji Ito, Junya Honda
article: International Conference on Algorithmic Learning Theory (ALT), Feb 2023, Singapore, Singapore. pp.8117-8144
resume: This study considers the partial monitoring problem with k-actions and d-outcomes and provides the first best-of-both-worlds algorithms, whose regrets are favorably bounded both in the stochastic and adversarial regimes. In particular, we show that for non-degenerate locally observable games, the regret is O(m^2 k^4 log(T) log(k_Π T)/∆_min) in the stochastic regime and O(m k^{3/2} sqrt{T log(T) log k_Π}) in the adversarial regime, where T is the number of rounds, m is the maximum number of distinct observations per action, ∆_min is the minimum suboptimality gap, and k_Π is the number of Pareto optimal actions. Moreover, we show that for globally observable games, the regret is O(c_G^2 log(T) log(k_Π T)/∆_min^2) in the adversarial regime, where c_G is a game-dependent constant. We also provide regret bounds for a stochastic regime with adversarial corruptions. Our algorithms are based on the follow-theregularized-leader framework and are inspired by the approach of exploration by optimization and the adaptive learning rate in the field of online learning with feedback graphs.
Accès au texte intégral et bibtex

2022

Conference papers

titre: Minimax Optimal Algorithms for Fixed-Budget Best Arm Identification
auteur: Junpei Komiyama, Taira Tsuchiya, Junya Honda
article: Neural Information Processing Systems (NeurIPS), Nov 2022, New Orleans, United States. pp.10393-10404
resume: We consider the fixed-budget best arm identification problem where the goal is to find the arm of the largest mean with a fixed number of samples. It is known that the probability of misidentifying the best arm is exponentially small to the number of rounds. However, limited characterizations have been discussed on the rate (exponent) of this value. In this paper, we characterize the minimax optimal rate as a result of an optimization over all possible parameters. We introduce two rates, R^{go} and R_∞^{go}, corresponding to lower bounds on the probability of misidentification, each of which is associated with a proposed algorithm. The rate R^{go} is associated with R^{go}-tracking, which can be efficiently implemented by a neural network and is shown to outperform existing algorithms. However, this rate requires a nontrivial condition to be achievable. To address this issue, we introduce the second rate R_∞^{go}. We show that this rate is indeed achievable by introducing a conceptual algorithm called delayed optimal tracking (DOT).
Accès au texte intégral et bibtex

titre: Nearly Optimal Best-of-Both-Worlds Algorithms for Online Learning with Feedback Graphs
auteur: Shinji Ito, Taira Tsuchiya, Junya Honda
article: NeurIPS 2022 – 36th Conference on Neural Information Processing Systems, Nov 2022, New Orleans, United States. pp.28631-28643
resume: This study considers online learning with general directed feedback graphs. For this problem, we present best-of-both-worlds algorithms that achieve nearly tight regret bounds for adversarial environments as well as poly-logarithmic regret bounds for stochastic environments. As Alon et al. [2015] have shown, tight regret bounds depend on the structure of the feedback graph: strongly observable graphs yield minimax regret of Θ(α^{1/2} T^{1/2}), while weakly observable graphs induce minimax regret of Θ(δ^{1/3} T^{2/3}), where α and δ, respectively, represent the independence number of the graph and the domination number of a certain portion of the graph. Our proposed algorithm for strongly observable graphs has a regret bound of Õ(α^{1/2} T^{1/2}) for adversarial environments, as well as of O(α(ln T)^3 / Δ_min ) for stochastic environments, where Δ_min expresses the minimum suboptimality gap. This result resolves an open question raised by Erez and Koren [2021]. We also provide an algorithm for weakly observable graphs that achieves a regret bound of Õ(δ^{1/3} T^{2/3}) for adversarial environments and poly-logarithmic regret for stochastic environments. The proposed algorithms are based on the follow-the-regularized-leader approach combined with newly designed update rules for learning rates.
Accès au texte intégral et bibtex

titre: IMED-RL: Regret optimal learning of ergodic Markov decision processes
auteur: Fabien Pesquerel, Odalric-Ambrym Maillard
article: NeurIPS 2022 – Thirty-sixth Conference on Neural Information Processing Systems, Nov 2022, New-Orleans, United States
resume: We consider reinforcement learning in a discrete, undiscounted, infinite-horizon Markov Decision Problem (MDP) under the average reward criterion, and focus on the minimization of the regret with respect to an optimal policy, when the learner does not know the rewards nor the transitions of the MDP. In light of their success at regret minimization in multi-armed bandits, popular bandit strategies, such as the optimistic UCB, KL-UCB or the Bayesian Thompson sampling strategy, have been extended to the MDP setup. Despite some key successes, existing strategies for solving this problem either fail to be provably asymptotically optimal, or suffer from prohibitive burn-in phase and computational complexity when implemented in practice. In this work, we shed a novel light on regret minimization strategies, by extending to reinforcement learning the computationally appealing Indexed Minimum Empirical Divergence (IMED) bandit algorithm. Traditional asymptotic problem-dependent lower bounds on the regret are known under the assumption that the MDP is ergodic. Under this assumption, we introduce IMED-RL and prove that its regret upper bound asymptotically matches the regret lower bound. We discuss both the case when the supports of transitions are unknown, and the more informative but a priori harder-to-exploit-optimally case when they are known. Rewards are assumed light-tailed, semi-bounded from above. Last, we provide numerical illustrations on classical tabular MDPs, ergodic and communicating only, showing the competitiveness of IMED-RL in finite-time against state-of-the-art algorithms. IMED-RL also benefits from a light complexity.
Accès au texte intégral et bibtex

2021

Journal articles

titre: Optimal adaptive allocation using deep reinforcement learning in a dose‐response study
auteur: Kentaro Matsuura, Junya Honda, Imad El Hanafi, Takashi Sozu, Kentaro Sakamaki
article: Statistics in Medicine, 2021, 41 (7), pp.1157-1171. ⟨10.1002/sim.9247⟩
resume: Estimation of the dose‐response curve for efficacy and subsequent selection of an appropriate dose in phase II trials are important processes in drug development. Various methods have been investigated to estimate dose‐response curves. Generally, these methods are used with equal allocation of subjects for simplicity; nevertheless, they may not fully optimize performance metrics because of nonoptimal allocation. Optimal allocation methods, which include adaptive allocation methods, have been proposed to overcome the limitations of equal allocation. However, they rely on asymptotics, and thus sometimes cannot efficiently optimize the performance metric with the sample size in an actual clinical trial. The purpose of this study is to construct an adaptive allocation rule that directly optimizes a performance metric, such as power, accuracy of model selection, accuracy of the estimated target dose, or mean absolute error over the estimated dose‐response curve. We demonstrate that deep reinforcement learning with an appropriately defined state and reward can be used to construct such an adaptive allocation rule. The simulation study shows that the proposed method can successfully improve the performance metric to be optimized when compared with the equal allocation, D‐optimal, and TD‐optimal methods. In particular, when the mean absolute error was set to the metric to be optimized, it is possible to construct a rule that is superior for many metrics.
Accès au bibtex

Archives

Categories