Skip to main content

Showing 1–43 of 43 results for author: Van Roy, B

  1. arXiv:2402.00396  [pdf, other

    cs.LG cs.AI cs.CL stat.ME stat.ML

    Efficient Exploration for LLMs

    Authors: Vikranth Dwaracherla, Seyed Mohammad Asghari, Botao Hao, Benjamin Van Roy

    Abstract: We present evidence of substantial benefit from efficient exploration in gathering human feedback to improve large language models. In our experiments, an agent sequentially generates queries while fitting a reward model to the feedback received. Our best-performing agent generates queries using double Thompson sampling, with uncertainty represented by an epistemic neural network. Our results demo… ▽ More

    Submitted 4 June, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

    Comments: Accepted at ICML 2024

  2. arXiv:2302.12202  [pdf, ps, other

    cs.LG stat.ML

    A Definition of Non-Stationary Bandits

    Authors: Yueyang Liu, Xu Kuang, Benjamin Van Roy

    Abstract: Despite the subject of non-stationary bandit learning having attracted much recent attention, we have yet to identify a formal definition of non-stationarity that can consistently distinguish non-stationary bandits from stationary ones. Prior work has characterized non-stationary bandits as bandits for which the reward distribution changes over time. We demonstrate that this definition can ambiguo… ▽ More

    Submitted 28 July, 2023; v1 submitted 23 February, 2023; originally announced February 2023.

  3. arXiv:2302.03319  [pdf, ps, other

    cs.LG math.ST stat.ML

    Leveraging Demonstrations to Improve Online Learning: Quality Matters

    Authors: Botao Hao, Rahul Jain, Tor Lattimore, Benjamin Van Roy, Zheng Wen

    Abstract: We investigate the extent to which offline demonstration data can improve online learning. It is natural to expect some improvement, but the question is how, and by how much? We show that the degree of improvement must depend on the quality of the demonstration data. To generate portable insights, we focus on Thompson sampling (TS) applied to a multi-armed bandit as a prototypical online learning… ▽ More

    Submitted 17 May, 2023; v1 submitted 7 February, 2023; originally announced February 2023.

    Comments: Accepted at ICML 2023

  4. arXiv:2211.15931  [pdf, other

    cs.LG stat.ML

    Posterior Sampling for Continuing Environments

    Authors: Wanqiao Xu, Shi Dong, Benjamin Van Roy

    Abstract: We develop an extension of posterior sampling for reinforcement learning (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into agent designs that scale to complex environments. The approach, continuing PSRL, maintains a statistically plausible model of the environment and follows a policy that maximizes expected $γ$-discounted return in that model. At eac… ▽ More

    Submitted 1 February, 2023; v1 submitted 29 November, 2022; originally announced November 2022.

  5. arXiv:2206.03633  [pdf, other

    cs.LG cs.AI stat.ML

    Ensembles for Uncertainty Estimation: Benefits of Prior Functions and Bootstrapping

    Authors: Vikranth Dwaracherla, Zheng Wen, Ian Osband, Xiuyuan Lu, Seyed Mohammad Asghari, Benjamin Van Roy

    Abstract: In machine learning, an agent needs to estimate uncertainty to efficiently explore and adapt and to make effective decisions. A common approach to uncertainty estimation maintains an ensemble of models. In recent years, several approaches have been proposed for training ensembles, and conflicting views prevail with regards to the importance of various ingredients of these approaches. In this paper… ▽ More

    Submitted 7 June, 2022; originally announced June 2022.

  6. arXiv:2206.02072  [pdf, ps, other

    cs.LG cs.IT stat.ML

    Deciding What to Model: Value-Equivalent Sampling for Reinforcement Learning

    Authors: Dilip Arumugam, Benjamin Van Roy

    Abstract: The quintessential model-based reinforcement-learning agent iteratively refines its estimates or prior beliefs about the true underlying model of the environment. Recent empirical successes in model-based reinforcement learning with function approximation, however, eschew the true model in favor of a surrogate that, while ignoring various facets of the environment, still facilitates effective plan… ▽ More

    Submitted 30 October, 2022; v1 submitted 4 June, 2022; originally announced June 2022.

    Comments: Accepted to Neural Information Processing Systems (NeurIPS) 2022

  7. arXiv:2205.01970  [pdf, other

    cs.LG stat.ML

    Non-Stationary Bandit Learning via Predictive Sampling

    Authors: Yueyang Liu, Xu Kuang, Benjamin Van Roy

    Abstract: Thompson sampling has proven effective across a wide range of stationary bandit environments. However, as we demonstrate in this paper, it can perform poorly when applied to non-stationary environments. We attribute such failures to the fact that, when exploring, the algorithm does not differentiate actions based on how quickly the information acquired loses its usefulness due to non-stationarity.… ▽ More

    Submitted 17 July, 2023; v1 submitted 4 May, 2022; originally announced May 2022.

  8. arXiv:2203.01303  [pdf, other

    cs.LG stat.ML

    An Analysis of Ensemble Sampling

    Authors: Chao Qin, Zheng Wen, Xiuyuan Lu, Benjamin Van Roy

    Abstract: Ensemble sampling serves as a practical approximation to Thompson sampling when maintaining an exact posterior distribution over model parameters is computationally intractable. In this paper, we establish a regret bound that ensures desirable behavior when ensemble sampling is applied to the linear bandit problem. This represents the first rigorous regret analysis of ensemble sampling and is made… ▽ More

    Submitted 1 March, 2023; v1 submitted 2 March, 2022; originally announced March 2022.

    Comments: [NeurIPS 2022 camera-ready version](https://openreview.net/forum?id=c6ibx0yl-aG) with improved regret bounds

  9. arXiv:2203.00246  [pdf, other

    cs.LG cs.AI stat.ML

    An Information-Theoretic Framework for Supervised Learning

    Authors: Hong Jun Jeon, Yifan Zhu, Benjamin Van Roy

    Abstract: Each year, deep learning demonstrates new and improved empirical results with deeper and wider neural networks. Meanwhile, with existing theoretical frameworks, it is difficult to analyze networks deeper than two layers without resorting to counting parameters or encountering sample complexity bounds that are exponential in depth. Perhaps it may be fruitful to try to analyze modern machine learnin… ▽ More

    Submitted 24 March, 2023; v1 submitted 1 March, 2022; originally announced March 2022.

  10. arXiv:2202.13509  [pdf, other

    stat.ML cs.AI cs.LG

    Evaluating High-Order Predictive Distributions in Deep Learning

    Authors: Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Xiuyuan Lu, Benjamin Van Roy

    Abstract: Most work on supervised learning research has focused on marginal predictions. In decision problems, joint predictive distributions are essential for good performance. Previous work has developed methods for assessing low-order predictive distributions with inputs sampled i.i.d. from the testing distribution. With low-dimensional inputs, these methods distinguish agents that effectively estimate u… ▽ More

    Submitted 27 February, 2022; originally announced February 2022.

  11. arXiv:2201.01902  [pdf, other

    cs.LG stat.ML

    Gaussian Imagination in Bandit Learning

    Authors: Yueyang Liu, Adithya M. Devraj, Benjamin Van Roy, Kuang Xu

    Abstract: Assuming distributions are Gaussian often facilitates computations that are otherwise intractable. We study the performance of an agent that attains a bounded information ratio with respect to a bandit environment with a Gaussian prior distribution and a Gaussian likelihood function when applied instead to a Bernoulli bandit. Relative to an information-theoretic bound on the Bayesian regret the ag… ▽ More

    Submitted 21 February, 2022; v1 submitted 5 January, 2022; originally announced January 2022.

  12. arXiv:2110.04629  [pdf, other

    cs.LG cs.AI stat.ML

    The Neural Testbed: Evaluating Joint Predictions

    Authors: Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Botao Hao, Morteza Ibrahimi, Dieterich Lawson, Xiuyuan Lu, Brendan O'Donoghue, Benjamin Van Roy

    Abstract: Predictive distributions quantify uncertainties ignored by point estimates. This paper introduces The Neural Testbed: an open-source benchmark for controlled and principled evaluation of agents that generate such predictions. Crucially, the testbed assesses agents not only on the quality of their marginal predictions per input, but also on their joint predictions across many inputs. We evaluate a… ▽ More

    Submitted 1 November, 2022; v1 submitted 9 October, 2021; originally announced October 2021.

  13. arXiv:2107.09224  [pdf, ps, other

    cs.LG stat.ML

    From Predictions to Decisions: The Importance of Joint Predictive Distributions

    Authors: Zheng Wen, Ian Osband, Chao Qin, Xiuyuan Lu, Morteza Ibrahimi, Vikranth Dwaracherla, Mohammad Asghari, Benjamin Van Roy

    Abstract: A fundamental challenge for any intelligent system is prediction: given some inputs, can you predict corresponding outcomes? Most work on supervised learning has focused on producing accurate marginal predictions for each input. However, we show that for a broad class of decision problems, accurate joint predictions are required to deliver good performance. In particular, we establish several resu… ▽ More

    Submitted 23 May, 2022; v1 submitted 19 July, 2021; originally announced July 2021.

  14. arXiv:2107.08924  [pdf, other

    cs.LG cs.AI stat.ML

    Epistemic Neural Networks

    Authors: Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Morteza Ibrahimi, Xiuyuan Lu, Benjamin Van Roy

    Abstract: Intelligence relies on an agent's knowledge of what it does not know. This capability can be assessed based on the quality of joint predictions of labels across multiple inputs. In principle, ensemble-based approaches produce effective joint predictions, but the computational costs of training large ensembles can become prohibitive. We introduce the epinet: an architecture that can supplement any… ▽ More

    Submitted 17 May, 2023; v1 submitted 19 July, 2021; originally announced July 2021.

  15. arXiv:2010.02383  [pdf, other

    cs.LG cs.AI stat.ML

    Randomized Value Functions via Posterior State-Abstraction Sampling

    Authors: Dilip Arumugam, Benjamin Van Roy

    Abstract: State abstraction has been an essential tool for dramatically improving the sample efficiency of reinforcement-learning algorithms. Indeed, by exposing and accentuating various types of latent structure within the environment, different classes of state abstraction have enabled improved theoretical guarantees and empirical performance. When dealing with state abstractions that capture structure in… ▽ More

    Submitted 17 June, 2021; v1 submitted 5 October, 2020; originally announced October 2020.

    Comments: Accepted to the Workshop on Biological and Artificial Reinforcement Learning (NeurIPS 2020)

  16. arXiv:2006.07464  [pdf, other

    cs.LG math.OC stat.ML

    Hypermodels for Exploration

    Authors: Vikranth Dwaracherla, Xiuyuan Lu, Morteza Ibrahimi, Ian Osband, Zheng Wen, Benjamin Van Roy

    Abstract: We study the use of hypermodels to represent epistemic uncertainty and guide exploration. This generalizes and extends the use of ensembles to approximate Thompson sampling. The computational cost of training an ensemble grows with its size, and as such, prior work has typically been limited to ensembles with tens of elements. We show that alternative hypermodels can enjoy dramatic efficiency gain… ▽ More

    Submitted 12 June, 2020; originally announced June 2020.

    Comments: Published as a conference paper at ICLR 2020

  17. arXiv:2002.07282  [pdf, other

    cs.LG cs.AI stat.ML

    Langevin DQN

    Authors: Vikranth Dwaracherla, Benjamin Van Roy

    Abstract: Algorithms that tackle deep exploration -- an important challenge in reinforcement learning -- have relied on epistemic uncertainty representation through ensembles or other hypermodels, exploration bonuses, or visitation count distributions. An open question is whether deep exploration can be achieved by an incremental reinforcement learning algorithm that tracks a single point estimate, without… ▽ More

    Submitted 23 February, 2021; v1 submitted 17 February, 2020; originally announced February 2020.

    Comments: 5 figures, 14 pages

  18. arXiv:1912.06366  [pdf, ps, other

    stat.ML cs.LG math.OC

    Provably Efficient Reinforcement Learning with Aggregated States

    Authors: Shi Dong, Benjamin Van Roy, Zhengyuan Zhou

    Abstract: We establish that an optimistic variant of Q-learning applied to a fixed-horizon episodic Markov decision process with an aggregated state representation incurs regret $\tilde{\mathcal{O}}(\sqrt{H^5 M K} + εHK)$, where $H$ is the horizon, $M$ is the number of aggregate states, $K$ is the number of episodes, and $ε$ is the largest difference between any pair of optimal state-action values associate… ▽ More

    Submitted 19 February, 2020; v1 submitted 13 December, 2019; originally announced December 2019.

  19. arXiv:1911.09724  [pdf, other

    stat.ML cs.AI cs.LG

    Information-Theoretic Confidence Bounds for Reinforcement Learning

    Authors: Xiuyuan Lu, Benjamin Van Roy

    Abstract: We integrate information-theoretic concepts into the design and analysis of optimistic algorithms and Thompson sampling. By making a connection between information-theoretic quantities and confidence bounds, we obtain results that relate the per-period performance of the agent with its information gain about the environment, thus explicitly characterizing the exploration-exploitation tradeoff. The… ▽ More

    Submitted 21 November, 2019; originally announced November 2019.

  20. arXiv:1911.07910  [pdf, other

    cs.LG stat.ML

    Comments on the Du-Kakade-Wang-Yang Lower Bounds

    Authors: Benjamin Van Roy, Shi Dong

    Abstract: Du, Kakade, Wang, and Yang recently established intriguing lower bounds on sample complexity, which suggest that reinforcement learning with a misspecified representation is intractable. Another line of work, which centers around a statistic called the eluder dimension, establishes tractability of problems similar to those considered in the Du-Kakade-Wang-Yang paper. We compare these results and r… ▽ More

    Submitted 18 November, 2019; originally announced November 2019.

  21. arXiv:1908.03568  [pdf, other

    cs.LG cs.AI stat.ML

    Behaviour Suite for Reinforcement Learning

    Authors: Ian Osband, Yotam Doron, Matteo Hessel, John Aslanides, Eren Sezener, Andre Saraiva, Katrina McKinney, Tor Lattimore, Csaba Szepesvari, Satinder Singh, Benjamin Van Roy, Richard Sutton, David Silver, Hado Van Hasselt

    Abstract: This paper introduces the Behaviour Suite for Reinforcement Learning, or bsuite for short. bsuite is a collection of carefully-designed experiments that investigate core capabilities of reinforcement learning (RL) agents with two objectives. First, to collect clear, informative and scalable problems that capture key issues in the design of general and efficient learning algorithms. Second, to stud… ▽ More

    Submitted 14 February, 2020; v1 submitted 9 August, 2019; originally announced August 2019.

  22. arXiv:1905.04654  [pdf, other

    stat.ML cs.LG

    On the Performance of Thompson Sampling on Logistic Bandits

    Authors: Shi Dong, Tengyu Ma, Benjamin Van Roy

    Abstract: We study the logistic bandit, in which rewards are binary with success probability $\exp(βa^\top θ) / (1 + \exp(βa^\top θ))$ and actions $a$ and coefficients $θ$ are within the $d$-dimensional unit ball. While prior regret bounds for algorithms that address the logistic bandit exhibit exponential dependence on the slope parameter $β$, we establish a regret bound for Thompson sampling that is indep… ▽ More

    Submitted 12 May, 2019; originally announced May 2019.

    Comments: Accepted for presentation at the Conference on Learning Theory (COLT) 2019

  23. arXiv:1805.11845  [pdf, other

    stat.ML cs.IT cs.LG

    An Information-Theoretic Analysis for Thompson Sampling with Many Actions

    Authors: Shi Dong, Benjamin Van Roy

    Abstract: Information-theoretic Bayesian regret bounds of Russo and Van Roy capture the dependence of regret on prior uncertainty. However, this dependence is through entropy, which can become arbitrarily large as the number of actions increases. We establish new bounds that depend instead on a notion of rate-distortion. Among other things, this allows us to recover through information-theoretic arguments a… ▽ More

    Submitted 7 July, 2020; v1 submitted 30 May, 2018; originally announced May 2018.

  24. arXiv:1805.08948  [pdf, other

    cs.LG cs.AI stat.ML

    Scalable Coordinated Exploration in Concurrent Reinforcement Learning

    Authors: Maria Dimakopoulou, Ian Osband, Benjamin Van Roy

    Abstract: We consider a team of reinforcement learning agents that concurrently operate in a common environment, and we develop an approach to efficient coordinated exploration that is suitable for problems of practical scale. Our approach builds on seed sampling (Dimakopoulou and Van Roy, 2018) and randomized value function learning (Osband et al., 2016). We demonstrate that, for simple tabular contexts, t… ▽ More

    Submitted 16 December, 2018; v1 submitted 22 May, 2018; originally announced May 2018.

    Comments: NIPS 2018

  25. arXiv:1706.04241  [pdf, other

    stat.ML cs.LG

    On Optimistic versus Randomized Exploration in Reinforcement Learning

    Authors: Ian Osband, Benjamin Van Roy

    Abstract: We discuss the relative merits of optimistic and randomized approaches to exploration in reinforcement learning. Optimistic approaches presented in the literature apply an optimistic boost to the value estimate at each state-action pair and select actions that are greedy with respect to the resulting optimistic value function. Randomized approaches sample from among statistically plausible value f… ▽ More

    Submitted 13 June, 2017; originally announced June 2017.

    Comments: Extended abstract for RLDM 2017

  26. arXiv:1705.07347  [pdf, ps, other

    stat.ML cs.AI cs.LG

    Ensemble Sampling

    Authors: Xiuyuan Lu, Benjamin Van Roy

    Abstract: Thompson sampling has emerged as an effective heuristic for a broad range of online decision problems. In its basic form, the algorithm requires computing and sampling from a posterior distribution over models, which is tractable only for simple special cases. This paper develops ensemble sampling, which aims to approximate Thompson sampling while maintaining tractability even in the face of compl… ▽ More

    Submitted 25 April, 2023; v1 submitted 20 May, 2017; originally announced May 2017.

  27. arXiv:1703.07608  [pdf, other

    stat.ML cs.AI cs.LG

    Deep Exploration via Randomized Value Functions

    Authors: Ian Osband, Benjamin Van Roy, Daniel Russo, Zheng Wen

    Abstract: We study the use of randomized value functions to guide deep exploration in reinforcement learning. This offers an elegant means for synthesizing statistically and computationally efficient exploration with common practical approaches to value function learning. We present several reinforcement learning algorithms that leverage randomized value functions and demonstrate their efficacy through comp… ▽ More

    Submitted 23 September, 2019; v1 submitted 22 March, 2017; originally announced March 2017.

    Comments: Accepted for publication in Journal of Machine Learning Research 2019

  28. arXiv:1702.04126  [pdf, other

    stat.ML cs.LG math.PR

    Gaussian-Dirichlet Posterior Dominance in Sequential Learning

    Authors: Ian Osband, Benjamin Van Roy

    Abstract: We consider the problem of sequential learning from categorical observations bounded in [0,1]. We establish an ordering between the Dirichlet posterior over categorical outcomes and a Gaussian posterior under observations with N(0,1) noise. We establish that, conditioned upon identical data with at least two observations, the posterior mean of the categorical distribution will always second-order… ▽ More

    Submitted 9 February, 2018; v1 submitted 14 February, 2017; originally announced February 2017.

  29. arXiv:1611.06426  [pdf, other

    stat.ML cs.LG

    Conservative Contextual Linear Bandits

    Authors: Abbas Kazerouni, Mohammad Ghavamzadeh, Yasin Abbasi-Yadkori, Benjamin Van Roy

    Abstract: Safety is a desirable property that can immensely increase the applicability of learning algorithms in real-world decision-making problems. It is much easier for a company to deploy an algorithm that is safe, i.e., guaranteed to perform at least as well as a baseline. In this paper, we study the issue of safety in contextual linear bandits that have application in many different fields including p… ▽ More

    Submitted 3 March, 2017; v1 submitted 19 November, 2016; originally announced November 2016.

  30. arXiv:1608.02732  [pdf, other

    stat.ML cs.LG

    On Lower Bounds for Regret in Reinforcement Learning

    Authors: Ian Osband, Benjamin Van Roy

    Abstract: This is a brief technical note to clarify the state of lower bounds on regret for reinforcement learning. In particular, this paper: - Reproduces a lower bound on regret for reinforcement learning, similar to the result of Theorem 5 in the journal UCRL2 paper (Jaksch et al 2010). - Clarifies that the proposed proof of Theorem 6 in the REGAL paper (Bartlett and Tewari 2009) does not hold using… ▽ More

    Submitted 9 August, 2016; originally announced August 2016.

  31. arXiv:1608.02731  [pdf, ps, other

    stat.ML cs.LG

    Posterior Sampling for Reinforcement Learning Without Episodes

    Authors: Ian Osband, Benjamin Van Roy

    Abstract: This is a brief technical note to clarify some of the issues with applying the application of the algorithm posterior sampling for reinforcement learning (PSRL) in environments without fixed episodes. In particular, this paper aims to: - Review some of results which have been proven for finite horizon MDPs (Osband et al 2013, 2014a, 2014b, 2016) and also for MDPs with finite ergodic structure (G… ▽ More

    Submitted 9 August, 2016; originally announced August 2016.

  32. arXiv:1607.00215  [pdf, other

    stat.ML cs.AI cs.LG

    Why is Posterior Sampling Better than Optimism for Reinforcement Learning?

    Authors: Ian Osband, Benjamin Van Roy

    Abstract: Computational results demonstrate that posterior sampling for reinforcement learning (PSRL) dramatically outperforms algorithms driven by optimism, such as UCRL2. We provide insight into the extent of this performance boost and the phenomenon that drives it. We leverage this insight to establish an $\tilde{O}(H\sqrt{SAT})$ Bayesian expected regret bound for PSRL in finite-horizon episodic Markov d… ▽ More

    Submitted 13 June, 2017; v1 submitted 1 July, 2016; originally announced July 2016.

  33. arXiv:1602.04621  [pdf, other

    cs.LG cs.AI eess.SY stat.ML

    Deep Exploration via Bootstrapped DQN

    Authors: Ian Osband, Charles Blundell, Alexander Pritzel, Benjamin Van Roy

    Abstract: Efficient exploration in complex environments remains a major challenge for reinforcement learning. We propose bootstrapped DQN, a simple algorithm that explores in a computationally and statistically efficient manner through use of randomized value functions. Unlike dithering strategies such as epsilon-greedy exploration, bootstrapped DQN carries out temporally-extended (or deep) exploration; thi… ▽ More

    Submitted 4 July, 2016; v1 submitted 15 February, 2016; originally announced February 2016.

  34. arXiv:1507.00300  [pdf, other

    stat.ML cs.LG

    Bootstrapped Thompson Sampling and Deep Exploration

    Authors: Ian Osband, Benjamin Van Roy

    Abstract: This technical note presents a new approach to carrying out the kind of exploration achieved by Thompson sampling, but without explicitly maintaining or sampling from posterior distributions. The approach is based on a bootstrap technique that uses a combination of observed and artificially generated data. The latter serves to induce a prior distribution which, as we will demonstrate, is critical… ▽ More

    Submitted 1 July, 2015; originally announced July 2015.

  35. arXiv:1406.1853  [pdf, ps, other

    stat.ML cs.LG

    Model-based Reinforcement Learning and the Eluder Dimension

    Authors: Ian Osband, Benjamin Van Roy

    Abstract: We consider the problem of learning to optimize an unknown Markov decision process (MDP). We show that, if the MDP can be parameterized within some known function class, we can obtain regret bounds that scale with the dimensionality, rather than cardinality, of the system. We characterize this dependence explicitly as $\tilde{O}(\sqrt{d_K d_E T})$ where $T$ is time elapsed, $d_K$ is the Kolmogorov… ▽ More

    Submitted 31 October, 2014; v1 submitted 6 June, 2014; originally announced June 2014.

  36. arXiv:1403.3741  [pdf, ps, other

    stat.ML cs.LG

    Near-optimal Reinforcement Learning in Factored MDPs

    Authors: Ian Osband, Benjamin Van Roy

    Abstract: Any reinforcement learning algorithm that applies to all Markov decision processes (MDPs) will suffer $Ω(\sqrt{SAT})$ regret on some MDP, where $T$ is the elapsed time and $S$ and $A$ are the cardinalities of the state and action spaces. This implies $T = Ω(SA)$ time to guarantee a near-optimal policy. In many settings of practical interest, due to the curse of dimensionality, $S$ and $A$ can be s… ▽ More

    Submitted 31 October, 2014; v1 submitted 14 March, 2014; originally announced March 2014.

  37. arXiv:1402.0635  [pdf, other

    stat.ML cs.AI cs.LG eess.SY

    Generalization and Exploration via Randomized Value Functions

    Authors: Ian Osband, Benjamin Van Roy, Zheng Wen

    Abstract: We propose randomized least-squares value iteration (RLSVI) -- a new reinforcement learning algorithm designed to explore and generalize efficiently via linearly parameterized value functions. We explain why versions of least-squares value iteration that use Boltzmann or epsilon-greedy exploration can be highly inefficient, and we present computational results that demonstrate dramatic efficiency… ▽ More

    Submitted 15 February, 2016; v1 submitted 4 February, 2014; originally announced February 2014.

    Comments: arXiv admin note: text overlap with arXiv:1307.4847

  38. arXiv:1307.4847  [pdf, other

    cs.LG cs.AI eess.SY stat.ML

    Efficient Reinforcement Learning in Deterministic Systems with Value Function Generalization

    Authors: Zheng Wen, Benjamin Van Roy

    Abstract: We consider the problem of reinforcement learning over episodes of a finite-horizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efficient exploration and value function generalization. We establish that when the true value function lies within a given hypothesis class, OCP selects optimal actions over all but at most K… ▽ More

    Submitted 6 July, 2016; v1 submitted 18 July, 2013; originally announced July 2013.

  39. arXiv:1306.0940  [pdf, other

    stat.ML cs.LG

    (More) Efficient Reinforcement Learning via Posterior Sampling

    Authors: Ian Osband, Daniel Russo, Benjamin Van Roy

    Abstract: Most provably-efficient learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for efficient exploration, posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision proce… ▽ More

    Submitted 26 December, 2013; v1 submitted 4 June, 2013; originally announced June 2013.

    Comments: 10 pages

  40. arXiv:1303.5984  [pdf, ps, other

    stat.ML cs.LG math.OC

    Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems

    Authors: Morteza Ibrahimi, Adel Javanmard, Benjamin Van Roy

    Abstract: We study the problem of adaptive control of a high dimensional linear quadratic (LQ) system. Previous work established the asymptotic convergence to an optimal controller for various adaptive control schemes. More recently, for the average cost LQ problem, a regret bound of ${O}(\sqrt{T})$ was shown, apart form logarithmic factors. However, this bound scales exponentially with $p$, the dimension o… ▽ More

    Submitted 24 March, 2013; originally announced March 2013.

    Comments: 16 pages

    Journal ref: Advances in Neural Information Processing Systems (NIPS) 2012: 2645-2653

  41. arXiv:1206.7112  [pdf, ps, other

    cs.LG cs.IR stat.ML

    A Hybrid Method for Distance Metric Learning

    Authors: Yi-Hao Kao, Benjamin Van Roy, Daniel Rubin, Jiajing Xu, Jessica Faruque, Sandy Napel

    Abstract: We consider the problem of learning a measure of distance among vectors in a feature space and propose a hybrid method that simultaneously learns from similarity ratings assigned to pairs of vectors and class labels assigned to individual vectors. Our method is based on a generative model in which class labels can provide information that is not encoded in feature vectors but yet relates to percei… ▽ More

    Submitted 29 June, 2012; originally announced June 2012.

  42. arXiv:1206.6141  [pdf, ps, other

    cs.LG eess.SY stat.ML

    Directed Time Series Regression for Control

    Authors: Yi-Hao Kao, Benjamin Van Roy

    Abstract: We propose directed time series regression, a new approach to estimating parameters of time-series models for use in certainty equivalent model predictive control. The approach combines merits of least squares regression and empirical optimization. Through a computational study involving a stochastic version of a well known inverted pendulum balancing problem, we demonstrate that directed time ser… ▽ More

    Submitted 26 June, 2012; originally announced June 2012.

  43. Learning a Factor Model via Regularized PCA

    Authors: Yi-Hao Kao, Benjamin Van Roy

    Abstract: We consider the problem of learning a linear factor model. We propose a regularized form of principal component analysis (PCA) and demonstrate through experiments with synthetic and real data the superiority of resulting estimates to those produced by pre-existing factor analysis approaches. We also establish theoretical results that explain how our algorithm corrects the biases induced by convent… ▽ More

    Submitted 24 February, 2013; v1 submitted 26 November, 2011; originally announced November 2011.

    Journal ref: Machine Learning, Volume 91, Number 3, pp. 279-303 (2013)