Samples from high-dimensional distributions can be scarce or expensive. Can we meaningfully learn such distributions from one or just a few samples? We provide guarantees for single-sample estimation of Ising models, quantifying the estimation error in terms of the metric entropy of possible interaction matrices. As corollaries of our main result, we derive bounds when the model's interaction matrix is a (sparse) linear combination of known matrices, or it belongs to a finite set, or to a high-dimensional manifold.
Our result handles multiple independent samples by viewing them as one sample from a larger model, and can be used to derive estimation bounds that are qualitatively similar to state-of-the-art in the multiple-sample literature. We thus unify two separate strands of work in the literature: (1) a renaissance of recent work in Computer Science on estimating Ising models/MRFs from multiple independent samples under minimal assumptions about the model's interaction matrix; and (2) a growing literature in Probability Theory on estimating them from one sample in restrictive settings. On the technical front, we exploit novel concentration and anti-concentration inequalities for functions of the Ising model.
We reduce the problem of proving a "Boolean Unique Games Conjecture" (with gap 1-delta vs. 1-C*delta, for any C>1, and sufficiently small delta>0) to the problem of proving a PCP Theorem for a certain non-unique game. In a previous work, Khot and Moshkovitz suggested a candidate reduction (i.e., without a proof of soundness). The current work is the first to provide a reduction along with a proof of soundness. The non-unique game we reduce from is similar to non-unique games for which PCP theorems are known.
Our proof relies on a new, tight, local testing theorem for the low degree part of functions f:R^n-->R in Gaussian space: We show that the low degree part of the restriction of f to a random hyperplane h is extremely close (O(1/n) Euclidean distance) to the restriction of the low degree part of f to h with high probability.
This is joint work with Ronen Eldan from the Weizmann Institute.
Consider the following basic problem in sparse linear regression -- an algorithm gets labeled samples of the form (x, w.x + \eps) where w is an unknown n-dimensional vector, x is drawn from a background n-dimensional distribution D, and \eps is some independent noise. Given the promise that w is k-sparse, the breakthrough work of Candes, Romberg and Tao shows that w can be recovered with a number of samples that scales as O(k log n). This should be contrasted with general linear regression where O(n) samples are information theoretically necessary. We look at this problem from the vantage point of property testing, and study the complexity of determining whether the unknown vector w is k-sparse versus "far" from k-sparse. We show that this decision problem can be solved with a number of samples which is independent of n as long as the background distribution D is i.i.d. and its components are not Gaussian. We further show that weakening any of the conditions in this result necessarily makes the complexity scale logarithmically in n.
Joint work with Xue Chen (George Mason University) and Anindya De (University of Pennsylvania).
When can we test functions more efficiently than we can learn them? We will consider this fundamental problem in the standard PAC learning setting, which corresponds to the sample-based distribution-free property testing model. The VC dimension of a class of functions characterizes the number of samples needed to learn the class. But since some properties can indeed be tested more efficiently than they can be learned, VC dimension by itself does not characterize the number of samples needed to test it. Nonetheless, we will see that VC dimension combined with a closely-related “lower VC dimension" variant does give strong lower bounds on the number of samples required to test many properties of functions, including halfspaces, intersection of halfspaces, polynomial threshold functions, and decision trees.
This is joint work with Renato Ferreira Pinto Jr. and Nathan Harms.
In this talk, we consider the sample complexity required for testing the monotonicity of distributions over partial orders. A distribution p over a poset is monotone if, for any pair of domain elements x and y such that x \preceq y, p(x) \leq p(y). I am going to present the proof sketch of achieving a lower bound for this problem over a few posets, e.g., matchings, and hypercubes. The main idea of these lower bounds is the following: we introduce a new property called *bigness* over a finite domain, where the distribution is T-big if the minimum probability for any domain element is at least T. Relying on the framework of [Wu-Yang'15], we establish a lower bound of \Omega(n/\log n) for testing bigness of distributions on domains of size n. We then build on these lower bounds to give \Omega(n/\log{n}) lower bounds for testing monotonicity over a matching poset of size n and significantly improved lower bounds over the hypercube poset. Although finding a tight upper bound for testing monotonicity remains open, I will also discuss the steps we took towards the upper bound as well.
Joint work with: Themis Gouleakis, John Peebles, Ronitt Rubinfeld, and Anak Yodpinyanee.
Modern supervised machine learning algorithms are at their best when provided with large datasets and large, high-capacity models. This kind of data-driven paradigm has driven remarkable progress in fields ranging from computer vision to natural language processing and speech recognition. However, reinforcement learning algorithms have proven difficult to scale to such large data regimes without the use of simulation. Online reinforcement learning algorithms require recollecting data in each experiment -- when the dataset is on the scale of ImageNet and MS-COCO, this becomes infeasible to do in the real world. Offline reinforcement learning algorithms have so far been difficult to integrate with deep neural networks. In this talk, I will discuss some recent advances in offline reinforcement learning that I think represent a step toward bridging this gap, and in addition carry a number of appealing theoretical properties. I will discuss the theoretical reasons why offline reinforcement learning is challenging, discuss the solutions that have been proposed in the literature, and describe our recent advances in developing conservative Q-learning methods that provide theoretical guarantees in the face of distributional shift, providing not only a practical way of deploying offline deep RL, but also a degree of confidence that the resulting solution will avoid common pitfalls of overestimation. I will also describe how the ideas in offline RL can be applied more broadly to data-driven optimization problems that do not necessarily involve sequential decision making, such as designing protein sequences or robot morphologies from prior experimental data. This emerging field shares many of the theoretical foundations with offline RL, but represents the possibility to broaden the impact of data-driven decision making to a wider range of applications.
Model-free reinforcement learning attempts to find an optimal control action for an unknown dynamical system by directly searching over the parameter space of controllers. The convergence behavior and statistical properties of these approaches are often poorly understood because of the nonconvex nature of the underlying optimization problems and the lack of exact gradient computation. In this talk, we discuss performance and efficiency of such methods by focusing on the standard infinite-horizon linear quadratic regulator problem for continuous-time systems with unknown state-space parameters. We establish exponential stability for the ordinary differential equation (ODE) that governs the gradient-flow dynamics over the set of stabilizing feedback gains and show that a similar result holds for the gradient descent method that arises from the forward Euler discretization of the corresponding ODE. We also provide theoretical bounds on the convergence rate and sample complexity of the random search method with two-point gradient estimates. We prove that the required simulation time for achieving $\epsilon$-accuracy in the model-free setup and the total number of function evaluations both scale as $\log (1/\epsilon)$.
Most literature on policy evaluation, bandit methods, etc., is focused on settings where actions taken on one unit do not affect other units. Such lack of interference, however, fails to hold in many applications of interest. For example, in a vaccine study, one person getting vaccinated also protects others; in a microcredit study, loans given to one person may stimulate the economy and indirectly benefit others; or, in a jobs-training study, training more people to perform a given task may create over-supply of qualified workers, thus reducing the market value of the training. In this talk, I'll survey various approaches to modeling cross-unit interference, and discuss associated methods for policy evaluation.
We consider the problem of reinforcement learning (RL) with unbounded state space, motivated by the classical problem of scheduling in a queueing network. Traditional policies as well as error metric that are designed for finite, bounded or compact state space, require infinite samples for providing any meaningful performance guarantee (e.g. ℓ_∞ error) for unbounded state space. We need a new notion of performance metric. Inspired by the literature in queuing systems and control theory, we propose stability as the notion of “goodness”: the state dynamics under the policy should remain in a bounded region with high probability. As a proof of concept, we propose an RL policy using Sparse-Sampling-based Monte Carlo Oracle and argue that it satisfies the stability property as long as the system dynamics under the optimal policy respects a Lyapunov function. The assumption of existence of a Lyapunov function is not restrictive as it is equivalent to the positive recurrence or stability property of any Markov chain. And, our policy does not utilize the knowledge of the specific Lyapunov function. To make our method sample efficient, we provide an improved, sample efficient Monte Carlo Oracle with Lipschitz value. We also design an adaptive version of the algorithm, based on carefully constructed statistical tests, which finds the correct tuning parameter automatically.
This work is joint with Devavrat Shah and Zhi Xu.
We consider the question of learning Q-function in a sample efficient manner for reinforcement learning with continuous state and action spaces under a generative model. If Q-function is Lipschitz continuous, then the minimal sample complexity for estimating ϵ-optimal Q-function is known to scale as O(eps^{-(d1+d2+2)}) per classical non-parametric learning theory, where d1 and d2 denote the dimensions of the state and action spaces respectively. The Q-function, when viewed as a kernel, induces a Hilbert-Schmidt operator and hence possesses square-summable spectrum. This motivates us to consider a parametric class of Q-functions parameterized by its "rank" r, which contains all Lipschitz Q-functions as r→∞. As our key contribution, we develop a simple, iterative learning algorithm that finds ϵ-optimal Q-function with sample complexity of O(eps^{-max(d1,d2)+2}) when the optimal Q-function has low rank r and the discounting factor is below a certain threshold. Thus, this provides an exponential improvement in sample complexity. To enable our result, we develop a novel Matrix Estimation algorithm that faithfully estimates an unknown low-rank matrix with respect to max-norm sense even in the presence of arbitrary bounded noise, which might be of interest in its own right. Empirical results on several stochastic control tasks confirm the efficacy of our "low-rank" algorithms.
This is based on joint work with Dogyoon Song, Zhi Xu, Yuzhe Yang. The manuscript is available at https://arxiv.org/abs/2006.06135.
We study the reward-free reinforcement learning framework, which is particularly suitable for batch reinforcement learning and scenarios where one needs policies for multiple reward functions. This framework has two phases. In the exploration phase, the agent collects trajectories by interacting with the environment without using any reward signal. In the planning phase, the agent needs to return a near-optimal policy for arbitrary reward functions. We give a new efficient algorithm which interacts with the environment at most $O\left( \frac{S^2A}{\epsilon^2}\poly\log\left(\frac{SAH}{\epsilon}\right) \right)$ episodes in the exploration phase, and guarantees to output a near-optimal policy for arbitrary reward functions in the planning phase. Here, $S$ is the size of state space, $A$ is the size of action space, $H$ is the planning horizon, and $\epsilon$ is the target accuracy relative to the total reward. Our sample complexity scales only logarithmically with $H$, in contrast to all existing results which scale polynomially with $H$. Furthermore, this bound matches the minimax lower bound $\Omega\left(\frac{S^2A}{\epsilon^2}\right)$ up to logarithmic factors.
Offline RL is crucial in applications where experimentation is limited, such as medicine, but it is also notoriously difficult because the similarity between the trajectories observed and those generated by any proposed policy diminishes exponentially as horizon grows, known as the curse of horizon. To better understand this limitation, we study the statistical efficiency limits of two central tasks in offline reinforcement learning: estimating the policy value and the policy gradient from off-policy data. The efficiency bounds reveal that the curse is generally insurmountable without assuming additional structure and as such plagues many standard estimators that work in general problems, but it may be overcome in Markovian settings and even further attenuated in stationary settings. We develop the first estimators achieving the efficiency limits in finite- and infinite-horizon MDPs using a meta-algorithm we term Double Reinforcement Learning (DRL). We provide favorable guarantees for DRL and for off-policy policy optimization via efficiently-estimated policy gradient ascent.