# American Institute of Mathematical Sciences

• Previous Article
EmT: Locating empty territories of homology group generators in a dataset
• FoDS Home
• This Issue
• Next Article
Levels and trends in the sex ratio at birth and missing female births for 29 states and union territories in India 1990–2016: A Bayesian modeling study
June  2019, 1(2): 197-225. doi: 10.3934/fods.2019009

## On adaptive estimation for dynamic Bernoulli bandits

 Department of Mathematics, Imperial College London, London, SW7 2AZ, UK

* Corresponding author: Nikolas Kantas

Published  June 2019

The multi-armed bandit (MAB) problem is a classic example of the exploration-exploitation dilemma. It is concerned with maximising the total rewards for a gambler by sequentially pulling an arm from a multi-armed slot machine where each arm is associated with a reward distribution. In static MABs, the reward distributions do not change over time, while in dynamic MABs, each arm's reward distribution can change, and the optimal arm can switch over time. Motivated by many real applications where rewards are binary, we focus on dynamic Bernoulli bandits. Standard methods like $\epsilon$-Greedy and Upper Confidence Bound (UCB), which rely on the sample mean estimator, often fail to track changes in the underlying reward for dynamic problems. In this paper, we overcome the shortcoming of slow response to change by deploying adaptive estimation in the standard methods and propose a new family of algorithms, which are adaptive versions of $\epsilon$-Greedy, UCB, and Thompson sampling. These new methods are simple and easy to implement. Moreover, they do not require any prior knowledge about the dynamic reward process, which is important for real applications. We examine the new algorithms numerically in different scenarios and the results show solid improvements of our algorithms in dynamic environments.

Citation: Xue Lu, Niall Adams, Nikolas Kantas. On adaptive estimation for dynamic Bernoulli bandits. Foundations of Data Science, 2019, 1 (2) : 197-225. doi: 10.3934/fods.2019009
##### References:

show all references

##### References:
Illustration of the difference between tuning $d$ in AFF-$d$-Greedy and tuning $\epsilon$ in `adaptive estimation $\epsilon$-Greedy'. The step size $\eta = 0.01$
Performance of different algorithms in the case of small number of changes
Abruptly changing scenario (Case 1): examples of $\mu_{t}$ sampled from the model in (28) with parameters of Case 1 displayed in Table 1
Abruptly changing scenario (Case 2): examples of $\mu_{t}$ sampled from the model in (28) with parameters of Casek 2 displayed in Table 1
Results for the two-armed Bernoulli bandit with abruptly changing expected rewards. The top row displays the cumulative regret over time; results are averaged over 100 replications. The bottom row are boxplots of total regret at time $t = 10,000$. Trajectories are sampled from (28) with parameters displayed in Table 1
Drifting scenario (Case 3): examples of $\mu_t$ simulated from the model in (29) with $\sigma^{2}_{\mu} = 0.0001$
Drifting scenario (Case 4): examples of $\mu_t$ simulated from the model in (30) with $\sigma^{2}_{\mu} = 0.001$
Results for the two-armed Bernoulli bandit with drifting expected rewards. The top row displays the cumulative regret over time; results are averaged over 100 independent replications. The bottom row are boxplots of total regret at time $t = 10,000$. Trajectories for Case 3 are sampled from (29) with $\sigma^{2}_{\mu} = 0.0001$, and trajectories for Case 4 are sampled from (30) with $\sigma^{2}_{\mu} = 0.001$
Large number of arms: abruptly changing environment (Case 1)
Large number of arms: abruptly changing environment (Case 2)
Large number of arms: drifting environment (Case 3)
Large number of arms: drifting environment (Case 4)
AFF-$d$-Greedy algorithm with different $\eta$ values. $\eta_{1} = 0.0001, \eta_{2} = 0.001$, $\eta_{3} = 0.01$, and $\eta_{4}(t) = 0.0001/s^{2}_{t}$, where $s^{2}_{t}$ is as in (11)
AFF versions of UCB algorithm with different $\eta$ values. $\eta_{1} = 0.0001$, $\eta_{2} = 0.001$, $\eta_{3} = 0.01$, and $\eta_{4}(t) = 0.0001/s^{2}_{t}$, where $s^{2}_{t}$ is as in (11)
AFF versions of TS algorithm with different $\eta$ values. $\eta_{1} = 0.0001$, $\eta_{2} = 0.001$, $\eta_{3} = 0.01$, and $\eta_{4}(t) = 0.0001/s^{2}_{t}$, where $s^{2}_{t}$ is as in (11)
D-UCB and SW-UCB algorithms with different values of key parameters
Boxplot of total regret for algorithms DTS, AFF-DTS1, and AFF-DTS2. Acronym like DTS-C5 represents the DTS algorithm with parameter $C = 5$. Similarly acronym like AFF-DTS1-C5 represents the AFF-DTS1 algorithm with initial value $C_{0} = 5$. The result of AFF-OTS is plotted as a benchmark
Parameters used in the exponential clock model shown in (28)
 Case 1 Case 2 $\theta$ $r_{l}$ $r_{u}$ $\theta$ $r_{l}$ $r_{u}$ Arm 1 0.001 0.0 1.0 0.001 0.3 1.0 Arm 2 0.010 0.0 1.0 0.010 0.0 0.7
 Case 1 Case 2 $\theta$ $r_{l}$ $r_{u}$ $\theta$ $r_{l}$ $r_{u}$ Arm 1 0.001 0.0 1.0 0.001 0.3 1.0 Arm 2 0.010 0.0 1.0 0.010 0.0 0.7
 [1] Tamar Friedlander, Naama Brenner. Adaptive response and enlargement of dynamic range. Mathematical Biosciences & Engineering, 2011, 8 (2) : 515-528. doi: 10.3934/mbe.2011.8.515 [2] Christopher Rackauckas, Qing Nie. Adaptive methods for stochastic differential equations via natural embeddings and rejection sampling with memory. Discrete & Continuous Dynamical Systems - B, 2017, 22 (7) : 2731-2761. doi: 10.3934/dcdsb.2017133 [3] Dongho Kim, Eun-Jae Park. Adaptive Crank-Nicolson methods with dynamic finite-element spaces for parabolic problems. Discrete & Continuous Dynamical Systems - B, 2008, 10 (4) : 873-886. doi: 10.3934/dcdsb.2008.10.873 [4] Arthur Henrique Caixeta, Irena Lasiecka, Valéria Neves Domingos Cavalcanti. On long time behavior of Moore-Gibson-Thompson equation with molecular relaxation. Evolution Equations & Control Theory, 2016, 5 (4) : 661-676. doi: 10.3934/eect.2016024 [5] Luciano Abadías, Carlos Lizama, Marina Murillo-Arcila. Hölder regularity for the Moore-Gibson-Thompson equation with infinite delay. Communications on Pure & Applied Analysis, 2018, 17 (1) : 243-265. doi: 10.3934/cpaa.2018015 [6] Marta Pellicer, Joan Solà-Morales. Optimal scalar products in the Moore-Gibson-Thompson equation. Evolution Equations & Control Theory, 2019, 8 (1) : 203-220. doi: 10.3934/eect.2019011 [7] Alexandre J. Chorin, Fei Lu, Robert N. Miller, Matthias Morzfeld, Xuemin Tu. Sampling, feasibility, and priors in data assimilation. Discrete & Continuous Dynamical Systems - A, 2016, 36 (8) : 4227-4246. doi: 10.3934/dcds.2016.36.4227 [8] Omri M. Sarig. Bernoulli equilibrium states for surface diffeomorphisms. Journal of Modern Dynamics, 2011, 5 (3) : 593-608. doi: 10.3934/jmd.2011.5.593 [9] Matthew Nicol. Induced maps of hyperbolic Bernoulli systems. Discrete & Continuous Dynamical Systems - A, 2001, 7 (1) : 147-154. doi: 10.3934/dcds.2001.7.147 [10] Hajnal R. Tóth. Infinite Bernoulli convolutions with different probabilities. Discrete & Continuous Dynamical Systems - A, 2008, 21 (2) : 595-600. doi: 10.3934/dcds.2008.21.595 [11] Peter Monk, Virginia Selgas. Sampling type methods for an inverse waveguide problem. Inverse Problems & Imaging, 2012, 6 (4) : 709-747. doi: 10.3934/ipi.2012.6.709 [12] Martin Hanke. Why linear sampling really seems to work. Inverse Problems & Imaging, 2008, 2 (3) : 373-395. doi: 10.3934/ipi.2008.2.373 [13] T. Hillen, K. Painter, Christian Schmeiser. Global existence for chemotaxis with finite sampling radius. Discrete & Continuous Dynamical Systems - B, 2007, 7 (1) : 125-144. doi: 10.3934/dcdsb.2007.7.125 [14] Kamil Rajdl, Petr Lansky. Fano factor estimation. Mathematical Biosciences & Engineering, 2014, 11 (1) : 105-123. doi: 10.3934/mbe.2014.11.105 [15] Jiying Liu, Jubo Zhu, Fengxia Yan, Zenghui Zhang. Compressive sampling and $l_1$ minimization for SAR imaging with low sampling rate. Inverse Problems & Imaging, 2013, 7 (4) : 1295-1305. doi: 10.3934/ipi.2013.7.1295 [16] João M. Lemos, Fernando Machado, Nuno Nogueira, Luís Rato, Manuel Rijo. Adaptive and non-adaptive model predictive control of an irrigation channel. Networks & Heterogeneous Media, 2009, 4 (2) : 303-324. doi: 10.3934/nhm.2009.4.303 [17] Imen Bhouri, Houssem Tlili. On the multifractal formalism for Bernoulli products of invertible matrices. Discrete & Continuous Dynamical Systems - A, 2009, 24 (4) : 1129-1145. doi: 10.3934/dcds.2009.24.1129 [18] Ben A. Vanderlei, Matthew M. Hopkins, Lisa J. Fauci. Error estimation for immersed interface solutions. Discrete & Continuous Dynamical Systems - B, 2012, 17 (4) : 1185-1203. doi: 10.3934/dcdsb.2012.17.1185 [19] H.T. Banks, Jimena L. Davis. Quantifying uncertainty in the estimation of probability distributions. Mathematical Biosciences & Engineering, 2008, 5 (4) : 647-667. doi: 10.3934/mbe.2008.5.647 [20] Azmy S. Ackleh, Jeremy J. Thibodeaux. Parameter estimation in a structured erythropoiesis model. Mathematical Biosciences & Engineering, 2008, 5 (4) : 601-616. doi: 10.3934/mbe.2008.5.601

Impact Factor: