Options
Collaborative Best Arm Identification in Multi-armed Bandits
Date Issued
01-01-2022
Author(s)
Abstract
We consider the best arm identification problem in stochastic multi-armed bandits (MAB) under a fixed budget setting where the agents learning the MAB instance are connected through a network. In our collaborative model, each agent's action and its associated reward are instantaneously observed by the agent as well as its one-hop neighbours. At the end of the budget, an ensemble learner (EL) aggregates samples from the agents and decides the best arm. The objective is to design arm selection policies for each of the agents (and a suitable aggregation policy for the EL) to minimize the probability of incorrect arm identification. We propose two collaborative learning policies that first partition the network of agents into dominating sets. We then employ (i) a highly exploring UCB-E variant or (ii) a 'Follow Your Leader' policy within each star partition, and at the end of the budget, the EL aggregates the samples from all the partitions to select the best arm. We show that our policies enjoy an exponentially decaying probability of error in the number of agents as well as the budget. Notably, our policies do not require a minimum dominating set partition of the graph (a hard problem), and the probability of error bounds remain independent of the size of the dominating set partition.