We discuss the solution of multistage decision problems using methods that are based on the idea of policy iteration (PI for short), i.e., start from some base policy and generate an improved policy. Rollout
is the simplest method of this type, where just one improved policy is generated. We can view PI as repeated application of rollout, where the rollout policy at each iteration serves as the base policy for the next iteration. In contrast with PI, rollout can be applied on-line and is suitable for on-line replanning. Moreover, rollout can use as base policy one of the policies produced by PI, thereby improving on that policy. This is the type of scheme underlying the prominently successful AlphaZero chess program.
In this paper we focus on rollout and PI-like methods for multiagent problems, where the control consists of multiple components each selected (conceptually) by a separate agent. We discuss an approach, whereby at every stage, the agents sequentially (one-at-a-time) execute a local rollout algorithm that uses a base policy, together with some coordinating information from the other agents. The amount of total computation required at every stage grows linearly with the number of agents. By contrast, in the standard rollout algorithm, the amount of total computation grows exponentially with the number of agents. Despite the dramatic reduction in required computation, we show that our multiagent rollout algorithm has the fundamental cost improvement property of standard rollout: it guarantees an improved performance relative to the base policy.
We first develop our agent-by-agent policy improvement approach for finite horizon problems, and then we extend it to exact and approximate PI for discounted and other infinite horizon problems. We prove that the cost improvement property steers the algorithm towards convergence to an agent-by-agent optimal policy, thus establishing a connection with the theory of teams. We also discuss autonomous multiagent rollout schemes that allow the agents to make decisions autonomously through the use of precomputed signaling information, which is sufficient to maintain the cost improvement property, without any on-line coordination of control selection between the agents.
We consider the batch (off-line) policy learning problem in the infinite horizon Markov Decision Process. Motivated by mobile health applications, we focus on learning a policy that maximizes the long-term average reward. We propose a doubly robust estimator for the average reward for a given policy and show that it achieves semi-parametric efficiency. The proposed estimator requires estimation of two policy dependent nuisance functions. We develop an optimization algorithm to compute the optimal policy in a parameterized stochastic policy class. The performance of the estimated policy is measured by the regret, i.e., the difference between the optimal average reward in the policy class and the average reward of the estimated policy and we establish a finite-sample regret guarantee for our proposed method.
We establish a theoretical comparison between the asymptotic mean-squared error of Double Q-learning and Q-learning. Using prior work on the asymptotic mean-squared error of linear stochastic approximation based on Lyapunov equations, we show that the asymptotic mean-squared error of Double Q-learning is exactly equal to that of Q-learning if Double Q-learning uses twice the learning rate of Q-learning and outputs the average of its two estimators. We also present some practical implications of this theoretical observation using simulations.
Sample complexity bounds are a common performance metric in the RL literature. In the discounted cost, infinite horizon setting, all of the known bounds can be arbitrarily large, as the discount factor approaches unity. For a large discount factor, these bounds seem to imply that a very large number of samples is required to achieve an epsilon-optimal policy. In this talk, we will discuss a new class of algorithms that have sample complexity uniformly bounded for all discount factors.
One may argue that this is impossible, due to a recent min-max lower bound. The explanation is that this previous lower bound is for a specific problem, which we modify, without compromising the ultimate objective of obtaining an epsilon-optimal policy.
Specifically, we show that the asymptotic covariance of the Q-learning algorithm with an optimized step-size sequence is a quadratic function of a factor that goes to infinity, as discount factor gets close to 1; an expected, and essentially known result. The new relative Q-learning algorithm proposed here is shown to have asymptotic covariance that is uniformly bounded for all discount factors.
Zap Q-learning is a recent class of reinforcement learning algorithms, motivated primarily as a means to accelerate convergence. Stability theory has been absent outside of two restrictive classes: the tabular setting, and optimal stopping. This paper introduces a new framework for analysis of a more general class of recursive algorithms known as stochastic approximation. Based on this general theory, it is shown that Zap Q-learning is consistent under a non-degeneracy assumption, even when the function approximation architecture is nonlinear. Zap Q-learning with neural network function approximation emerges as a special case, and is tested on examples from OpenAI Gym. Based on multiple experiments with a range of neural network sizes, it is found that the new algorithms converge quickly and are robust to choice of function approximation architecture.
The problem of Offline Policy Evaluation (OPE) in Reinforcement Learning (RL) is a critical step towards applying RL in real-life applications. Existing work on OPE mostly focus on evaluating a fixed target policy π, which does not provide useful bounds for offline policy learning as π will then be data-dependent. We address this problem by simultaneously evaluating all policies in a policy class Π -- uniform convergence in OPE -- and obtain nearly optimal error bounds for a number of global / local policy classes. Our results imply that the model-based planning achieves an optimal episode complexity of O˜(H3/dmϵ2) in identifying an ϵ-optimal policy under the time-inhomogeneous episodic MDP model (H is the planning horizon, dm is a quantity that reflects the exploration of the logging policy μ). To the best of our knowledge, this is the first time the optimal rate is shown to be possible for the offline RL setting and the paper is the first that systematically investigates the uniform convergence in OPE.
In this talk I will discuss recent progress on a long-standing open problem in batch RL: learning Q* from an exploratory and polynomial-sized dataset, using a realizable and otherwise arbitrary function class. In fact, all existing algorithms demand function-approximation assumptions stronger than realizability, and the mounting negative evidence has led to a conjecture that sample-efficient learning is impossible in this setting (Chen and Jiang, 2019). Our algorithm, BVFT, breaks the hardness conjecture (albeit under a stronger notion of exploratory data) via a tournament procedure that reduces the learning problem to pairwise comparison, and solves the latter with the help of a state-action partition constructed from the compared functions. I will also discuss how BVFT can be applied to model selection / holdout validation among other extensions and open problems.
Reinforcement Learning (RL) problems for continuous state and action space systems are among the most challenging in RL. Recently, deep reinforcement learning methods have been shown to be quite effective for certain RL problems in settings of very large/continuous state and action spaces. But such methods require extensive hyper-parameter tuning, huge amount of data, and come with no performance guarantees. We note that such methods are mostly trained `offline’ on experience replay buffers. In this talk, I will describe a series of simple reinforcement learning schemes for various settings. Our premise is that we have access to a generative model that can give us simulated samples of the next state. I will introduce the RANDPOL (randomized function approximation for policy iteration) algorithm, an empirical actor-critic algorithm that uses randomized neural networks that can successfully solve a tough robotic problem with continuous state and action spaces. We also provide theoretical performance guarantees for the algorithm. Specifically, it allows for arbitrarily good approximation with high probability for any problem. I will also touch upon the probabilistic contraction analysis framework of iterative stochastic algorithms that underpins the theoretical analysis. This talk is based on work with Hiteshi Sharma (Microsoft).
In this talk we discuss computational approaches based on Monte Carlo sampling techniques to solving multistage stochastic programming problems. In some applications the considered programs have a periodical behavior. We demonstrate that in such cases it is possible to drastically reduce the number of stages by introducing a periodical analog of the so-called Bellman equations, used in Markov Decision Processes and Stochastic Optimal Control. Furthermore, we describe a primal - dual variant of
the Stochastic Dual Dynamic Programming algorithm, applied to the constructed periodical Bellman equations. We consider risk neutral and risk averse settings.
Robust control theory highlighted the importance of quantifying model uncertainty for the design of feedback control strategies that achieve a provable level of performance. The robustness paradigm motivated work on ‘robust learning’ to address the question of how well model uncertainty can be characterized from data. The quality and convergence rate of model approximation from data imposes a fundamental limit on the rate of adaptation of the underlying control/decision strategy. In particular, for some model classes, sample complexity results impose significant limitations on such adaptation rates. We conjecture that the same limitations exist for convergence in reinforcement learning.
The characterization of the relationship between learning and model uncertainty hinges on having a tractable theory for model approximation. Such a theory exists for broad classes of linear stochastic dynamical systems. In this talk, I will present some results for learning classes of stochastic systems, namely, jump linear systems. A key question for such models is the unraveling of the underlying model structure from data. In this context, I will show that spectral methods allow for estimating both the underlying state dimension when the number of stochastic transitions is known. Utilizing existing results from model reduction using Hankel-like matrices, I will show that efficient learning of low dimensional models is quite possible.
This work is in collaboration with Tuhin Sarkar and Alexander Rakhlin.