Video URL

https://old.simons.berkeley.edu/talks/optimal-learning-structured-bandits

Optimal Learning for Structured Bandits

(2022). Optimal Learning for Structured Bandits. The Simons Institute for the Theory of Computing. https://old.simons.berkeley.edu/talks/optimal-learning-structured-bandits

Optimal Learning for Structured Bandits. The Simons Institute for the Theory of Computing, Oct. 13, 2022, https://old.simons.berkeley.edu/talks/optimal-learning-structured-bandits

          @misc{ scivideos_22749,
            doi = {},
            url = {https://old.simons.berkeley.edu/talks/optimal-learning-structured-bandits},
            author = {},
            keywords = {},
            language = {en},
            title = {Optimal Learning for Structured Bandits},
            publisher = {The Simons Institute for the Theory of Computing},
            year = {2022},
            month = {oct},
            note = {22749 see, \url{https://scivideos.org/simons-institute/22749}}
          }

Negin Golrezei (Massachusetts Institute of Technology)

October 13, 2022

Source Repository Simons Institute

Subject

Computer Science

Abstract

In this work, we study structured multi-armed bandits, which is the problem of online decision-making under uncertainty in the presence of structural information. In this problem, the decision-maker needs to discover the best course of action despite observing only uncertain rewards over time. The decision-maker is aware of certain structural information regarding the reward distributions and would like to minimize their regret by exploiting this information, where the regret is its performance difference against a benchmark policy that knows the best action ahead of time. In the absence of structural information, the classical upper confidence bound (UCB) and Thomson sampling algorithms are well known to suffer only minimal regret. As recently pointed out, neither algorithms are, however, capable of exploiting structural information that is commonly available in practice. We propose a novel learning algorithm that we call DUSA whose worst-case regret matches the information-theoretic regret lower bound up to a constant factor and can handle a wide range of structural information. Our algorithm DUSA solves a dual counterpart of the regret lower bound at the empirical reward distribution and follows its suggested play. Our proposed algorithm is the first computationally viable learning policy for structured bandit problems that has asymptotic minimal regret.

Supported by

Video URL

Optimal Learning for Structured Bandits

Abstract

The how and why of translating between the circuit model and the one-way model of quantum computing

GOLD-PLATED SICS

Quantum metrological limits in noisy environments

Probing the limits of classical computing with arbitrarily connected quantum circuits

Efficiently achieving fault-tolerant qudit quantum computation via gate teleportation

Video URL

Optimal Learning for Structured Bandits

APA

MLA

BibTex

Abstract