Skip to main content

Showing 1–50 of 83 results for author: Van Roy, B

  1. arXiv:2407.01456  [pdf, other

    cs.LG cs.AI

    Information-Theoretic Foundations for Neural Scaling Laws

    Authors: Hong Jun Jeon, Benjamin Van Roy

    Abstract: Neural scaling laws aim to characterize how out-of-sample error behaves as a function of model and training dataset size. Such scaling laws guide allocation of a computational resources between model and data processing to minimize error. However, existing theoretical support for neural scaling laws lacks rigor and clarity, entangling the roles of information and optimization. In this work, we dev… ▽ More

    Submitted 27 June, 2024; originally announced July 2024.

    Comments: arXiv admin note: text overlap with arXiv:2212.01365

  2. arXiv:2402.00396  [pdf, other

    cs.LG cs.AI cs.CL stat.ME stat.ML

    Efficient Exploration for LLMs

    Authors: Vikranth Dwaracherla, Seyed Mohammad Asghari, Botao Hao, Benjamin Van Roy

    Abstract: We present evidence of substantial benefit from efficient exploration in gathering human feedback to improve large language models. In our experiments, an agent sequentially generates queries while fitting a reward model to the feedback received. Our best-performing agent generates queries using double Thompson sampling, with uncertainty represented by an epistemic neural network. Our results demo… ▽ More

    Submitted 4 June, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

    Comments: Accepted at ICML 2024

  3. arXiv:2401.15530  [pdf, ps, other

    cs.LG cs.IT

    An Information-Theoretic Analysis of In-Context Learning

    Authors: Hong Jun Jeon, Jason D. Lee, Qi Lei, Benjamin Van Roy

    Abstract: Previous theoretical results pertaining to meta-learning on sequences build on contrived assumptions and are somewhat convoluted. We introduce new information-theoretic tools that lead to an elegant and very general decomposition of error into three components: irreducible error, meta-learning error, and intra-task error. These tools unify analyses across many meta-learning challenges. To illustra… ▽ More

    Submitted 27 January, 2024; originally announced January 2024.

  4. arXiv:2401.13239  [pdf, other

    cs.LG cs.HC

    Adaptive Crowdsourcing Via Self-Supervised Learning

    Authors: Anmol Kagrecha, Henrik Marklund, Benjamin Van Roy, Hong Jun Jeon, Richard Zeckhauser

    Abstract: Common crowdsourcing systems average estimates of a latent quantity of interest provided by many crowdworkers to produce a group estimate. We develop a new approach -- predict-each-worker -- that leverages self-supervised learning and a novel aggregation scheme. This approach adapts weights assigned to crowdworkers based on estimates they provided for previous quantities. When skills vary across c… ▽ More

    Submitted 1 February, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

    Comments: 33 pages, 3 figures

  5. arXiv:2312.01057  [pdf, other

    cs.LG cs.AI cs.CL

    RLHF and IIA: Perverse Incentives

    Authors: Wanqiao Xu, Shi Dong, Xiuyuan Lu, Grace Lam, Zheng Wen, Benjamin Van Roy

    Abstract: Existing algorithms for reinforcement learning from human feedback (RLHF) can incentivize responses at odds with preferences because they are based on models that assume independence of irrelevant alternatives (IIA). The perverse incentives induced by IIA hinder innovations on query formats and learning algorithms.

    Submitted 1 February, 2024; v1 submitted 2 December, 2023; originally announced December 2023.

  6. arXiv:2310.07786  [pdf, other

    cs.LG cs.IR

    Non-Stationary Contextual Bandit Learning via Neural Predictive Ensemble Sampling

    Authors: Zheqing Zhu, Yueyang Liu, Xu Kuang, Benjamin Van Roy

    Abstract: Real-world applications of contextual bandits often exhibit non-stationarity due to seasonality, serendipity, and evolving social trends. While a number of non-stationary contextual bandit learning algorithms have been proposed in the literature, they excessively explore due to a lack of prioritization for information of enduring value, or are designed in ways that do not scale in modern applicati… ▽ More

    Submitted 14 October, 2023; v1 submitted 11 October, 2023; originally announced October 2023.

  7. arXiv:2308.11958  [pdf, other

    cs.LG cs.AI

    Maintaining Plasticity in Continual Learning via Regenerative Regularization

    Authors: Saurabh Kumar, Henrik Marklund, Benjamin Van Roy

    Abstract: In continual learning, plasticity refers to the ability of an agent to quickly adapt to new information. Neural networks are known to lose plasticity when processing non-stationary data streams. In this paper, we propose L2 Init, a simple approach for maintaining plasticity by incorporating in the loss function L2 regularization toward initial parameters. This is very similar to standard L2 regula… ▽ More

    Submitted 3 October, 2023; v1 submitted 23 August, 2023; originally announced August 2023.

  8. arXiv:2307.11046  [pdf, other

    cs.LG cs.AI

    A Definition of Continual Reinforcement Learning

    Authors: David Abel, André Barreto, Benjamin Van Roy, Doina Precup, Hado van Hasselt, Satinder Singh

    Abstract: In a standard view of the reinforcement learning problem, an agent's goal is to efficiently identify a policy that maximizes long-term reward. However, this perspective is based on a restricted view of learning as finding a solution, rather than treating learning as endless adaptation. In contrast, continual reinforcement learning refers to the setting in which the best agents never stop learning.… ▽ More

    Submitted 1 December, 2023; v1 submitted 20 July, 2023; originally announced July 2023.

    Comments: NeurIPS 2023

  9. arXiv:2307.11044  [pdf, other

    cs.LG cs.AI

    On the Convergence of Bounded Agents

    Authors: David Abel, André Barreto, Hado van Hasselt, Benjamin Van Roy, Doina Precup, Satinder Singh

    Abstract: When has an agent converged? Standard models of the reinforcement learning problem give rise to a straightforward definition of convergence: An agent converges when its behavior or performance in each environment state stops changing. However, as we shift the focus of our learning problem from the environment's state to the agent's state, the concept of an agent's convergence becomes significantly… ▽ More

    Submitted 20 July, 2023; originally announced July 2023.

  10. arXiv:2307.04345  [pdf, other

    cs.LG cs.AI

    Continual Learning as Computationally Constrained Reinforcement Learning

    Authors: Saurabh Kumar, Henrik Marklund, Ashish Rao, Yifan Zhu, Hong Jun Jeon, Yueyang Liu, Benjamin Van Roy

    Abstract: An agent that efficiently accumulates knowledge to develop increasingly sophisticated skills over a long lifetime could advance the frontier of artificial intelligence capabilities. The design of such agents, which remains a long-standing challenge of artificial intelligence, is addressed by the subject of continual learning. This monograph clarifies and formalizes concepts of continual learning,… ▽ More

    Submitted 20 August, 2023; v1 submitted 10 July, 2023; originally announced July 2023.

  11. arXiv:2306.14834  [pdf, other

    cs.IR cs.AI

    Scalable Neural Contextual Bandit for Recommender Systems

    Authors: Zheqing Zhu, Benjamin Van Roy

    Abstract: High-quality recommender systems ought to deliver both innovative and relevant content through effective and exploratory interactions with users. Yet, supervised learning-based neural networks, which form the backbone of many existing recommender systems, only leverage recognized user interests, falling short when it comes to efficiently uncovering unknown user preferences. While there has been so… ▽ More

    Submitted 18 August, 2023; v1 submitted 26 June, 2023; originally announced June 2023.

    Journal ref: ACM International Conference on Information and Knowledge Management (CIKM 2023) 32nd ACM International Conference on Information and Knowledge Management (CIKM 2023)

  12. arXiv:2305.11455  [pdf, other

    cs.CL cs.AI cs.LG

    Shattering the Agent-Environment Interface for Fine-Tuning Inclusive Language Models

    Authors: Wanqiao Xu, Shi Dong, Dilip Arumugam, Benjamin Van Roy

    Abstract: A centerpiece of the ever-popular reinforcement learning from human feedback (RLHF) approach to fine-tuning autoregressive language models is the explicit training of a reward model to emulate human feedback, distinct from the language model itself. This reward model is then coupled with policy-gradient methods to dramatically improve the alignment between language model outputs and desired respon… ▽ More

    Submitted 19 May, 2023; originally announced May 2023.

  13. arXiv:2305.03263  [pdf, other

    cs.LG cs.AI

    Bayesian Reinforcement Learning with Limited Cognitive Load

    Authors: Dilip Arumugam, Mark K. Ho, Noah D. Goodman, Benjamin Van Roy

    Abstract: All biological and artificial agents must learn and make decisions given limits on their ability to process information. As such, a general theory of adaptive behavior should be able to account for the complex interactions between an agent's learning history, decisions, and capacity constraints. Recent work in computer science has begun to clarify the principles that shape these dynamics by bridgi… ▽ More

    Submitted 4 May, 2023; originally announced May 2023.

  14. arXiv:2302.12202  [pdf, ps, other

    cs.LG stat.ML

    A Definition of Non-Stationary Bandits

    Authors: Yueyang Liu, Xu Kuang, Benjamin Van Roy

    Abstract: Despite the subject of non-stationary bandit learning having attracted much recent attention, we have yet to identify a formal definition of non-stationarity that can consistently distinguish non-stationary bandits from stationary ones. Prior work has characterized non-stationary bandits as bandits for which the reward distribution changes over time. We demonstrate that this definition can ambiguo… ▽ More

    Submitted 28 July, 2023; v1 submitted 23 February, 2023; originally announced February 2023.

  15. arXiv:2302.09205  [pdf, other

    cs.LG cs.AI

    Approximate Thompson Sampling via Epistemic Neural Networks

    Authors: Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Morteza Ibrahimi, Xiuyuan Lu, Benjamin Van Roy

    Abstract: Thompson sampling (TS) is a popular heuristic for action selection, but it requires sampling from a posterior distribution. Unfortunately, this can become computationally intractable in complex environments, such as those modeled using neural networks. Approximate posterior samples can produce effective actions, but only if they reasonably approximate joint predictive distributions of outputs acro… ▽ More

    Submitted 17 February, 2023; originally announced February 2023.

  16. arXiv:2302.03319  [pdf, ps, other

    cs.LG math.ST stat.ML

    Leveraging Demonstrations to Improve Online Learning: Quality Matters

    Authors: Botao Hao, Rahul Jain, Tor Lattimore, Benjamin Van Roy, Zheng Wen

    Abstract: We investigate the extent to which offline demonstration data can improve online learning. It is natural to expect some improvement, but the question is how, and by how much? We show that the degree of improvement must depend on the quality of the demonstration data. To generate portable insights, we focus on Thompson sampling (TS) applied to a multi-armed bandit as a prototypical online learning… ▽ More

    Submitted 17 May, 2023; v1 submitted 7 February, 2023; originally announced February 2023.

    Comments: Accepted at ICML 2023

  17. arXiv:2212.12633  [pdf, other

    cs.LG cs.AI

    Inclusive Artificial Intelligence

    Authors: Dilip Arumugam, Shi Dong, Benjamin Van Roy

    Abstract: Prevailing methods for assessing and comparing generative AIs incentivize responses that serve a hypothetical representative individual. Evaluating models in these terms presumes homogeneous preferences across the population and engenders selection of agglomerative AIs, which fail to represent the diverse range of interests across individuals. We propose an alternative evaluation method that inste… ▽ More

    Submitted 3 March, 2023; v1 submitted 23 December, 2022; originally announced December 2022.

  18. arXiv:2212.01365  [pdf, other

    cs.LG

    An Information-Theoretic Analysis of Compute-Optimal Neural Scaling Laws

    Authors: Hong Jun Jeon, Benjamin Van Roy

    Abstract: We study the compute-optimal trade-off between model and training data set sizes for large neural networks. Our result suggests a linear relation similar to that supported by the empirical analysis of chinchilla. While that work studies transformer-based large language models trained on the MassiveText corpus gopher, as a starting point for development of a mathematical theory, we focus on a simpl… ▽ More

    Submitted 18 October, 2023; v1 submitted 2 December, 2022; originally announced December 2022.

  19. arXiv:2211.15931  [pdf, other

    cs.LG stat.ML

    Posterior Sampling for Continuing Environments

    Authors: Wanqiao Xu, Shi Dong, Benjamin Van Roy

    Abstract: We develop an extension of posterior sampling for reinforcement learning (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into agent designs that scale to complex environments. The approach, continuing PSRL, maintains a statistically plausible model of the environment and follows a policy that maximizes expected $γ$-discounted return in that model. At eac… ▽ More

    Submitted 1 February, 2023; v1 submitted 29 November, 2022; originally announced November 2022.

  20. arXiv:2211.01568  [pdf, other

    cs.CL cs.AI

    Fine-Tuning Language Models via Epistemic Neural Networks

    Authors: Ian Osband, Seyed Mohammad Asghari, Benjamin Van Roy, Nat McAleese, John Aslanides, Geoffrey Irving

    Abstract: Language models often pre-train on large unsupervised text corpora, then fine-tune on additional task-specific data. However, typical fine-tuning schemes do not prioritize the examples that they tune on. We show that, if you can prioritize informative training data, you can achieve better performance while using fewer labels. To do this we augment a language model with an epinet: a small additiona… ▽ More

    Submitted 10 May, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

  21. arXiv:2210.16877  [pdf, ps, other

    cs.LG cs.AI

    On Rate-Distortion Theory in Capacity-Limited Cognition & Reinforcement Learning

    Authors: Dilip Arumugam, Mark K. Ho, Noah D. Goodman, Benjamin Van Roy

    Abstract: Throughout the cognitive-science literature, there is widespread agreement that decision-making agents operating in the real world do so under limited information-processing capabilities and without access to unbounded cognitive or computational resources. Prior work has drawn inspiration from this fact and leveraged an information-theoretic model of such behaviors or policies as communication cha… ▽ More

    Submitted 30 October, 2022; originally announced October 2022.

    Comments: Accepted to the NeurIPS Workshop on Information-Theoretic Principles in Cognitive Systems (InfoCog) 2022. arXiv admin note: text overlap with arXiv:2206.02072

  22. arXiv:2209.08627  [pdf, other

    cs.LG

    Is Stochastic Gradient Descent Near Optimal?

    Authors: Yifan Zhu, Hong Jun Jeon, Benjamin Van Roy

    Abstract: The success of neural networks over the past decade has established them as effective models for many relevant data generating processes. Statistical theory on neural networks indicates graceful scaling of sample complexity. For example, Joen & Van Roy (arXiv:2203.00246) demonstrate that, when data is generated by a ReLU teacher network with $W$ parameters, an optimal learner needs only… ▽ More

    Submitted 6 October, 2022; v1 submitted 18 September, 2022; originally announced September 2022.

    Comments: arXiv admin note: substantial text overlap with arXiv:2203.00246

  23. arXiv:2207.00137  [pdf, other

    cs.LG

    Robustness of Epinets against Distributional Shifts

    Authors: Xiuyuan Lu, Ian Osband, Seyed Mohammad Asghari, Sven Gowal, Vikranth Dwaracherla, Zheng Wen, Benjamin Van Roy

    Abstract: Recent work introduced the epinet as a new approach to uncertainty modeling in deep learning. An epinet is a small neural network added to traditional neural networks, which, together, can produce predictive distributions. In particular, using an epinet can greatly improve the quality of joint predictions across multiple inputs, a measure of how well a neural network knows what it does not know. I… ▽ More

    Submitted 30 June, 2022; originally announced July 2022.

  24. arXiv:2206.03633  [pdf, other

    cs.LG cs.AI stat.ML

    Ensembles for Uncertainty Estimation: Benefits of Prior Functions and Bootstrapping

    Authors: Vikranth Dwaracherla, Zheng Wen, Ian Osband, Xiuyuan Lu, Seyed Mohammad Asghari, Benjamin Van Roy

    Abstract: In machine learning, an agent needs to estimate uncertainty to efficiently explore and adapt and to make effective decisions. A common approach to uncertainty estimation maintains an ensemble of models. In recent years, several approaches have been proposed for training ensembles, and conflicting views prevail with regards to the importance of various ingredients of these approaches. In this paper… ▽ More

    Submitted 7 June, 2022; originally announced June 2022.

  25. arXiv:2206.02072  [pdf, ps, other

    cs.LG cs.IT stat.ML

    Deciding What to Model: Value-Equivalent Sampling for Reinforcement Learning

    Authors: Dilip Arumugam, Benjamin Van Roy

    Abstract: The quintessential model-based reinforcement-learning agent iteratively refines its estimates or prior beliefs about the true underlying model of the environment. Recent empirical successes in model-based reinforcement learning with function approximation, however, eschew the true model in favor of a surrogate that, while ignoring various facets of the environment, still facilitates effective plan… ▽ More

    Submitted 30 October, 2022; v1 submitted 4 June, 2022; originally announced June 2022.

    Comments: Accepted to Neural Information Processing Systems (NeurIPS) 2022

  26. arXiv:2206.02025  [pdf, ps, other

    cs.LG cs.IT

    Between Rate-Distortion Theory & Value Equivalence in Model-Based Reinforcement Learning

    Authors: Dilip Arumugam, Benjamin Van Roy

    Abstract: The quintessential model-based reinforcement-learning agent iteratively refines its estimates or prior beliefs about the true underlying model of the environment. Recent empirical successes in model-based reinforcement learning with function approximation, however, eschew the true model in favor of a surrogate that, while ignoring various facets of the environment, still facilitates effective plan… ▽ More

    Submitted 4 June, 2022; originally announced June 2022.

    Comments: Accepted to the Multi-Disciplinary Conference on Reinforcement Learning and Decision Making (RLDM) 2022

  27. arXiv:2205.01970  [pdf, other

    cs.LG stat.ML

    Non-Stationary Bandit Learning via Predictive Sampling

    Authors: Yueyang Liu, Xu Kuang, Benjamin Van Roy

    Abstract: Thompson sampling has proven effective across a wide range of stationary bandit environments. However, as we demonstrate in this paper, it can perform poorly when applied to non-stationary environments. We attribute such failures to the fact that, when exploring, the algorithm does not differentiate actions based on how quickly the information acquired loses its usefulness due to non-stationarity.… ▽ More

    Submitted 17 July, 2023; v1 submitted 4 May, 2022; originally announced May 2022.

  28. arXiv:2203.01303  [pdf, other

    cs.LG stat.ML

    An Analysis of Ensemble Sampling

    Authors: Chao Qin, Zheng Wen, Xiuyuan Lu, Benjamin Van Roy

    Abstract: Ensemble sampling serves as a practical approximation to Thompson sampling when maintaining an exact posterior distribution over model parameters is computationally intractable. In this paper, we establish a regret bound that ensures desirable behavior when ensemble sampling is applied to the linear bandit problem. This represents the first rigorous regret analysis of ensemble sampling and is made… ▽ More

    Submitted 1 March, 2023; v1 submitted 2 March, 2022; originally announced March 2022.

    Comments: [NeurIPS 2022 camera-ready version](https://openreview.net/forum?id=c6ibx0yl-aG) with improved regret bounds

  29. arXiv:2203.00246  [pdf, other

    cs.LG cs.AI stat.ML

    An Information-Theoretic Framework for Supervised Learning

    Authors: Hong Jun Jeon, Yifan Zhu, Benjamin Van Roy

    Abstract: Each year, deep learning demonstrates new and improved empirical results with deeper and wider neural networks. Meanwhile, with existing theoretical frameworks, it is difficult to analyze networks deeper than two layers without resorting to counting parameters or encountering sample complexity bounds that are exponential in depth. Perhaps it may be fruitful to try to analyze modern machine learnin… ▽ More

    Submitted 24 March, 2023; v1 submitted 1 March, 2022; originally announced March 2022.

  30. arXiv:2202.13509  [pdf, other

    stat.ML cs.AI cs.LG

    Evaluating High-Order Predictive Distributions in Deep Learning

    Authors: Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Xiuyuan Lu, Benjamin Van Roy

    Abstract: Most work on supervised learning research has focused on marginal predictions. In decision problems, joint predictive distributions are essential for good performance. Previous work has developed methods for assessing low-order predictive distributions with inputs sampled i.i.d. from the testing distribution. With low-dimensional inputs, these methods distinguish agents that effectively estimate u… ▽ More

    Submitted 27 February, 2022; originally announced February 2022.

  31. arXiv:2201.01902  [pdf, other

    cs.LG stat.ML

    Gaussian Imagination in Bandit Learning

    Authors: Yueyang Liu, Adithya M. Devraj, Benjamin Van Roy, Kuang Xu

    Abstract: Assuming distributions are Gaussian often facilitates computations that are otherwise intractable. We study the performance of an agent that attains a bounded information ratio with respect to a bandit environment with a Gaussian prior distribution and a Gaussian likelihood function when applied instead to a Bernoulli bandit. Relative to an information-theoretic bound on the Bayesian regret the ag… ▽ More

    Submitted 21 February, 2022; v1 submitted 5 January, 2022; originally announced January 2022.

  32. arXiv:2110.13973  [pdf, other

    cs.LG cs.IT

    The Value of Information When Deciding What to Learn

    Authors: Dilip Arumugam, Benjamin Van Roy

    Abstract: All sequential decision-making agents explore so as to acquire knowledge about a particular target. It is often the responsibility of the agent designer to construct this target which, in rich and complex environments, constitutes a onerous burden; without full knowledge of the environment itself, a designer may forge a sub-optimal learning target that poorly balances the amount of information an… ▽ More

    Submitted 26 October, 2021; originally announced October 2021.

    Comments: Accepted to Neural Information Processing Systems (NeurIPS) 2021

  33. arXiv:2110.04629  [pdf, other

    cs.LG cs.AI stat.ML

    The Neural Testbed: Evaluating Joint Predictions

    Authors: Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Botao Hao, Morteza Ibrahimi, Dieterich Lawson, Xiuyuan Lu, Brendan O'Donoghue, Benjamin Van Roy

    Abstract: Predictive distributions quantify uncertainties ignored by point estimates. This paper introduces The Neural Testbed: an open-source benchmark for controlled and principled evaluation of agents that generate such predictions. Crucially, the testbed assesses agents not only on the quality of their marginal predictions per input, but also on their joint predictions across many inputs. We evaluate a… ▽ More

    Submitted 1 November, 2022; v1 submitted 9 October, 2021; originally announced October 2021.

  34. arXiv:2109.12509  [pdf, other

    cs.IR cs.AI cs.LG

    Deep Exploration for Recommendation Systems

    Authors: Zheqing Zhu, Benjamin Van Roy

    Abstract: Modern recommendation systems ought to benefit by probing for and learning from delayed feedback. Research has tended to focus on learning from a user's response to a single recommendation. Such work, which leverages methods of supervised and bandit learning, forgoes learning from the user's subsequent behavior. Where past work has aimed to learn from subsequent behavior, there has been a lack of… ▽ More

    Submitted 30 July, 2023; v1 submitted 26 September, 2021; originally announced September 2021.

  35. arXiv:2107.09224  [pdf, ps, other

    cs.LG stat.ML

    From Predictions to Decisions: The Importance of Joint Predictive Distributions

    Authors: Zheng Wen, Ian Osband, Chao Qin, Xiuyuan Lu, Morteza Ibrahimi, Vikranth Dwaracherla, Mohammad Asghari, Benjamin Van Roy

    Abstract: A fundamental challenge for any intelligent system is prediction: given some inputs, can you predict corresponding outcomes? Most work on supervised learning has focused on producing accurate marginal predictions for each input. However, we show that for a broad class of decision problems, accurate joint predictions are required to deliver good performance. In particular, we establish several resu… ▽ More

    Submitted 23 May, 2022; v1 submitted 19 July, 2021; originally announced July 2021.

  36. arXiv:2107.08924  [pdf, other

    cs.LG cs.AI stat.ML

    Epistemic Neural Networks

    Authors: Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Morteza Ibrahimi, Xiuyuan Lu, Benjamin Van Roy

    Abstract: Intelligence relies on an agent's knowledge of what it does not know. This capability can be assessed based on the quality of joint predictions of labels across multiple inputs. In principle, ensemble-based approaches produce effective joint predictions, but the computational costs of training large ensembles can become prohibitive. We introduce the epinet: an architecture that can supplement any… ▽ More

    Submitted 17 May, 2023; v1 submitted 19 July, 2021; originally announced July 2021.

  37. arXiv:2103.04047  [pdf, other

    cs.LG cs.AI

    Reinforcement Learning, Bit by Bit

    Authors: Xiuyuan Lu, Benjamin Van Roy, Vikranth Dwaracherla, Morteza Ibrahimi, Ian Osband, Zheng Wen

    Abstract: Reinforcement learning agents have demonstrated remarkable achievements in simulated environments. Data efficiency poses an impediment to carrying this success over to real environments. The design of data-efficient agents calls for a deeper understanding of information acquisition and representation. We discuss concepts and regret analysis that together offer principled guidance. This line of thi… ▽ More

    Submitted 4 May, 2023; v1 submitted 6 March, 2021; originally announced March 2021.

  38. arXiv:2102.09488  [pdf, other

    cs.LG

    A Bit Better? Quantifying Information for Bandit Learning

    Authors: Adithya M. Devraj, Benjamin Van Roy, Kuang Xu

    Abstract: The information ratio offers an approach to assessing the efficacy with which an agent balances between exploration and exploitation. Originally, this was defined to be the ratio between squared expected regret and the mutual information between the environment and action-observation pair, which represents a measure of information gain. Recent work has inspired consideration of alternative informa… ▽ More

    Submitted 18 February, 2021; originally announced February 2021.

    Comments: 41 pages, 10 figures, 1 table

  39. arXiv:2102.05261  [pdf, other

    cs.LG cs.AI

    Simple Agent, Complex Environment: Efficient Reinforcement Learning with Agent States

    Authors: Shi Dong, Benjamin Van Roy, Zhengyuan Zhou

    Abstract: We design a simple reinforcement learning (RL) agent that implements an optimistic version of $Q$-learning and establish through regret analysis that this agent can operate with some level of competence in any environment. While we leverage concepts from the literature on provably efficient RL, we consider a general agent-environment interface and provide a novel agent design and analysis. This le… ▽ More

    Submitted 11 July, 2021; v1 submitted 9 February, 2021; originally announced February 2021.

  40. arXiv:2101.06197  [pdf, other

    cs.LG cs.IT

    Deciding What to Learn: A Rate-Distortion Approach

    Authors: Dilip Arumugam, Benjamin Van Roy

    Abstract: Agents that learn to select optimal actions represent a prominent focus of the sequential decision-making literature. In the face of a complex environment or constraints on time and resources, however, aiming to synthesize such an optimal policy can become infeasible. These scenarios give rise to an important trade-off between the information an agent must acquire to learn and the sub-optimality o… ▽ More

    Submitted 21 June, 2021; v1 submitted 15 January, 2021; originally announced January 2021.

  41. arXiv:2010.02383  [pdf, other

    cs.LG cs.AI stat.ML

    Randomized Value Functions via Posterior State-Abstraction Sampling

    Authors: Dilip Arumugam, Benjamin Van Roy

    Abstract: State abstraction has been an essential tool for dramatically improving the sample efficiency of reinforcement-learning algorithms. Indeed, by exposing and accentuating various types of latent structure within the environment, different classes of state abstraction have enabled improved theoretical guarantees and empirical performance. When dealing with state abstractions that capture structure in… ▽ More

    Submitted 17 June, 2021; v1 submitted 5 October, 2020; originally announced October 2020.

    Comments: Accepted to the Workshop on Biological and Artificial Reinforcement Learning (NeurIPS 2020)

  42. arXiv:2006.07464  [pdf, other

    cs.LG math.OC stat.ML

    Hypermodels for Exploration

    Authors: Vikranth Dwaracherla, Xiuyuan Lu, Morteza Ibrahimi, Ian Osband, Zheng Wen, Benjamin Van Roy

    Abstract: We study the use of hypermodels to represent epistemic uncertainty and guide exploration. This generalizes and extends the use of ensembles to approximate Thompson sampling. The computational cost of training an ensemble grows with its size, and as such, prior work has typically been limited to ensembles with tens of elements. We show that alternative hypermodels can enjoy dramatic efficiency gain… ▽ More

    Submitted 12 June, 2020; originally announced June 2020.

    Comments: Published as a conference paper at ICLR 2020

  43. arXiv:2002.07282  [pdf, other

    cs.LG cs.AI stat.ML

    Langevin DQN

    Authors: Vikranth Dwaracherla, Benjamin Van Roy

    Abstract: Algorithms that tackle deep exploration -- an important challenge in reinforcement learning -- have relied on epistemic uncertainty representation through ensembles or other hypermodels, exploration bonuses, or visitation count distributions. An open question is whether deep exploration can be achieved by an incremental reinforcement learning algorithm that tracks a single point estimate, without… ▽ More

    Submitted 23 February, 2021; v1 submitted 17 February, 2020; originally announced February 2020.

    Comments: 5 figures, 14 pages

  44. arXiv:1912.06366  [pdf, ps, other

    stat.ML cs.LG math.OC

    Provably Efficient Reinforcement Learning with Aggregated States

    Authors: Shi Dong, Benjamin Van Roy, Zhengyuan Zhou

    Abstract: We establish that an optimistic variant of Q-learning applied to a fixed-horizon episodic Markov decision process with an aggregated state representation incurs regret $\tilde{\mathcal{O}}(\sqrt{H^5 M K} + εHK)$, where $H$ is the horizon, $M$ is the number of aggregate states, $K$ is the number of episodes, and $ε$ is the largest difference between any pair of optimal state-action values associate… ▽ More

    Submitted 19 February, 2020; v1 submitted 13 December, 2019; originally announced December 2019.

  45. arXiv:1911.09724  [pdf, other

    stat.ML cs.AI cs.LG

    Information-Theoretic Confidence Bounds for Reinforcement Learning

    Authors: Xiuyuan Lu, Benjamin Van Roy

    Abstract: We integrate information-theoretic concepts into the design and analysis of optimistic algorithms and Thompson sampling. By making a connection between information-theoretic quantities and confidence bounds, we obtain results that relate the per-period performance of the agent with its information gain about the environment, thus explicitly characterizing the exploration-exploitation tradeoff. The… ▽ More

    Submitted 21 November, 2019; originally announced November 2019.

  46. arXiv:1911.07910  [pdf, other

    cs.LG stat.ML

    Comments on the Du-Kakade-Wang-Yang Lower Bounds

    Authors: Benjamin Van Roy, Shi Dong

    Abstract: Du, Kakade, Wang, and Yang recently established intriguing lower bounds on sample complexity, which suggest that reinforcement learning with a misspecified representation is intractable. Another line of work, which centers around a statistic called the eluder dimension, establishes tractability of problems similar to those considered in the Du-Kakade-Wang-Yang paper. We compare these results and r… ▽ More

    Submitted 18 November, 2019; originally announced November 2019.

  47. arXiv:1908.03568  [pdf, other

    cs.LG cs.AI stat.ML

    Behaviour Suite for Reinforcement Learning

    Authors: Ian Osband, Yotam Doron, Matteo Hessel, John Aslanides, Eren Sezener, Andre Saraiva, Katrina McKinney, Tor Lattimore, Csaba Szepesvari, Satinder Singh, Benjamin Van Roy, Richard Sutton, David Silver, Hado Van Hasselt

    Abstract: This paper introduces the Behaviour Suite for Reinforcement Learning, or bsuite for short. bsuite is a collection of carefully-designed experiments that investigate core capabilities of reinforcement learning (RL) agents with two objectives. First, to collect clear, informative and scalable problems that capture key issues in the design of general and efficient learning algorithms. Second, to stud… ▽ More

    Submitted 14 February, 2020; v1 submitted 9 August, 2019; originally announced August 2019.

  48. arXiv:1905.04654  [pdf, other

    stat.ML cs.LG

    On the Performance of Thompson Sampling on Logistic Bandits

    Authors: Shi Dong, Tengyu Ma, Benjamin Van Roy

    Abstract: We study the logistic bandit, in which rewards are binary with success probability $\exp(βa^\top θ) / (1 + \exp(βa^\top θ))$ and actions $a$ and coefficients $θ$ are within the $d$-dimensional unit ball. While prior regret bounds for algorithms that address the logistic bandit exhibit exponential dependence on the slope parameter $β$, we establish a regret bound for Thompson sampling that is indep… ▽ More

    Submitted 12 May, 2019; originally announced May 2019.

    Comments: Accepted for presentation at the Conference on Learning Theory (COLT) 2019

  49. arXiv:1805.11845  [pdf, other

    stat.ML cs.IT cs.LG

    An Information-Theoretic Analysis for Thompson Sampling with Many Actions

    Authors: Shi Dong, Benjamin Van Roy

    Abstract: Information-theoretic Bayesian regret bounds of Russo and Van Roy capture the dependence of regret on prior uncertainty. However, this dependence is through entropy, which can become arbitrarily large as the number of actions increases. We establish new bounds that depend instead on a notion of rate-distortion. Among other things, this allows us to recover through information-theoretic arguments a… ▽ More

    Submitted 7 July, 2020; v1 submitted 30 May, 2018; originally announced May 2018.

  50. arXiv:1805.08948  [pdf, other

    cs.LG cs.AI stat.ML

    Scalable Coordinated Exploration in Concurrent Reinforcement Learning

    Authors: Maria Dimakopoulou, Ian Osband, Benjamin Van Roy

    Abstract: We consider a team of reinforcement learning agents that concurrently operate in a common environment, and we develop an approach to efficient coordinated exploration that is suitable for problems of practical scale. Our approach builds on seed sampling (Dimakopoulou and Van Roy, 2018) and randomized value function learning (Osband et al., 2016). We demonstrate that, for simple tabular contexts, t… ▽ More

    Submitted 16 December, 2018; v1 submitted 22 May, 2018; originally announced May 2018.

    Comments: NIPS 2018