-
Achieving Tractable Minimax Optimal Regret in Average Reward MDPs
Authors:
Victor Boone,
Zihan Zhang
Abstract:
In recent years, significant attention has been directed towards learning average-reward Markov Decision Processes (MDPs). However, existing algorithms either suffer from sub-optimal regret guarantees or computational inefficiencies. In this paper, we present the first tractable algorithm with minimax optimal regret of $\widetilde{\mathrm{O}}(\sqrt{\mathrm{sp}(h^*) S A T})$, where…
▽ More
In recent years, significant attention has been directed towards learning average-reward Markov Decision Processes (MDPs). However, existing algorithms either suffer from sub-optimal regret guarantees or computational inefficiencies. In this paper, we present the first tractable algorithm with minimax optimal regret of $\widetilde{\mathrm{O}}(\sqrt{\mathrm{sp}(h^*) S A T})$, where $\mathrm{sp}(h^*)$ is the span of the optimal bias function $h^*$, $S \times A$ is the size of the state-action space and $T$ the number of learning steps. Remarkably, our algorithm does not require prior information on $\mathrm{sp}(h^*)$. Our algorithm relies on a novel subroutine, Projected Mitigated Extended Value Iteration (PMEVI), to compute bias-constrained optimal policies efficiently. This subroutine can be applied to various previous algorithms to improve regret bounds.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
Gradient-enhanced deep Gaussian processes for multifidelity modelling
Authors:
Viv Bone,
Chris van der Heide,
Kieran Mackle,
Ingo H. J. Jahn,
Peter M. Dower,
Chris Manzie
Abstract:
Multifidelity models integrate data from multiple sources to produce a single approximator for the underlying process. Dense low-fidelity samples are used to reduce interpolation error, while sparse high-fidelity samples are used to compensate for bias or noise in the low-fidelity samples. Deep Gaussian processes (GPs) are attractive for multifidelity modelling as they are non-parametric, robust t…
▽ More
Multifidelity models integrate data from multiple sources to produce a single approximator for the underlying process. Dense low-fidelity samples are used to reduce interpolation error, while sparse high-fidelity samples are used to compensate for bias or noise in the low-fidelity samples. Deep Gaussian processes (GPs) are attractive for multifidelity modelling as they are non-parametric, robust to overfitting, perform well for small datasets, and, critically, can capture nonlinear and input-dependent relationships between data of different fidelities. Many datasets naturally contain gradient data, especially when they are generated by computational models that are compatible with automatic differentiation or have adjoint solutions. Principally, this work extends deep GPs to incorporate gradient data. We demonstrate this method on an analytical test problem and a realistic partial differential equation problem, where we predict the aerodynamic coefficients of a hypersonic flight vehicle over a range of flight conditions and geometries. In both examples, the gradient-enhanced deep GP outperforms a gradient-enhanced linear GP model and their non-gradient-enhanced counterparts.
△ Less
Submitted 25 February, 2024;
originally announced February 2024.
-
The Sliding Regret in Stochastic Bandits: Discriminating Index and Randomized Policies
Authors:
Victor Boone
Abstract:
This paper studies the one-shot behavior of no-regret algorithms for stochastic bandits. Although many algorithms are known to be asymptotically optimal with respect to the expected regret, over a single run, their pseudo-regret seems to follow one of two tendencies: it is either smooth or bumpy. To measure this tendency, we introduce a new notion: the sliding regret, that measures the worst pseud…
▽ More
This paper studies the one-shot behavior of no-regret algorithms for stochastic bandits. Although many algorithms are known to be asymptotically optimal with respect to the expected regret, over a single run, their pseudo-regret seems to follow one of two tendencies: it is either smooth or bumpy. To measure this tendency, we introduce a new notion: the sliding regret, that measures the worst pseudo-regret over a time-window of fixed length sliding to infinity. We show that randomized methods (e.g. Thompson Sampling and MED) have optimal sliding regret, while index policies, although possibly asymptotically optimal for the expected regret, have the worst possible sliding regret under regularity conditions on their index (e.g. UCB, UCB-V, KL-UCB, MOSS, IMED etc.). We further analyze the average bumpiness of the pseudo-regret of index policies via the regret of exploration, that we show to be suboptimal as well.
△ Less
Submitted 30 November, 2023;
originally announced November 2023.
-
The equivalence of dynamic and strategic stability under regularized learning in games
Authors:
Victor Boone,
Panayotis Mertikopoulos
Abstract:
In this paper, we examine the long-run behavior of regularized, no-regret learning in finite games. A well-known result in the field states that the empirical frequencies of no-regret play converge to the game's set of coarse correlated equilibria; however, our understanding of how the players' actual strategies evolve over time is much more limited - and, in many cases, non-existent. This issue i…
▽ More
In this paper, we examine the long-run behavior of regularized, no-regret learning in finite games. A well-known result in the field states that the empirical frequencies of no-regret play converge to the game's set of coarse correlated equilibria; however, our understanding of how the players' actual strategies evolve over time is much more limited - and, in many cases, non-existent. This issue is exacerbated further by a series of recent results showing that only strict Nash equilibria are stable and attracting under regularized learning, thus making the relation between learning and pointwise solution concepts particularly elusive. In lieu of this, we take a more general approach and instead seek to characterize the \emph{setwise} rationality properties of the players' day-to-day play. To that end, we focus on one of the most stringent criteria of setwise strategic stability, namely that any unilateral deviation from the set in question incurs a cost to the deviator - a property known as closedness under better replies (club). In so doing, we obtain a far-reaching equivalence between strategic and dynamic stability: a product of pure strategies is closed under better replies if and only if its span is stable and attracting under regularized learning. In addition, we estimate the rate of convergence to such sets, and we show that methods based on entropic regularization (like the exponential weights algorithm) converge at a geometric rate, while projection-based methods converge within a finite number of iterations, even with bandit, payoff-based feedback.
△ Less
Submitted 4 November, 2023;
originally announced November 2023.
-
When do discounted-optimal policies also optimize the gain?
Authors:
Victor Boone
Abstract:
In this technical note, we establish an upper-bound on the threshold on the discount factor starting from which all discounted-optimal deterministic policies are gain-optimal, that we prove to be tight on an example. To address computability issues of that theoretical threshold, we provide a weaker bound which is tractable on ergodic MDPs in polynomial time.
In this technical note, we establish an upper-bound on the threshold on the discount factor starting from which all discounted-optimal deterministic policies are gain-optimal, that we prove to be tight on an example. To address computability issues of that theoretical threshold, we provide a weaker bound which is tractable on ergodic MDPs in polynomial time.
△ Less
Submitted 17 April, 2023;
originally announced April 2023.
-
Towards model predictive control of supercritical CO2 cycles
Authors:
Viv Bone,
Michael Kearney,
Ingo Jahn
Abstract:
Control of non-condensing non-ideal-gas power cycles is challenging because their output power dynamics depend on complex system interactions, non-ideal-gas effects complicate turbomachinery behavior, and state constraints must be respected. This article presents a control methodology for these systems, comprising a control modeling approach and model predictive control (MPC) strategy. This method…
▽ More
Control of non-condensing non-ideal-gas power cycles is challenging because their output power dynamics depend on complex system interactions, non-ideal-gas effects complicate turbomachinery behavior, and state constraints must be respected. This article presents a control methodology for these systems, comprising a control modeling approach and model predictive control (MPC) strategy. This methodology is demonstrated on the high-pressure side of a simple supercritical CO2 cycle power block, composed of a variable-speed compressor, heat exchanger, and fixed-speed turbine. The control model is developed by applying timescale-separation arguments to a high-fidelity simulation model and locally linearizing non-ideal-gas turbomachinery performance maps. MPC is implemented by linearizing the control model online at each sampling instant. Closed-loop simulations with a full-order gas-dynamics truth model demonstrate the effectiveness of the proposed control methodology. In response to load changes, the controller maintains high turbine inlet temperatures while achieving net power output ramp rates in excess of 100% of nameplate output per minute. The controller often acts at the intersection of motor torque, compressor surge, and turbine inlet temperature constraints, and performs well from 35 to 105% of nameplate capacity with no parameter scheduling. Good performance and fast update rates are obtained via online linearization. The results demonstrate the suitability of MPC for the supercritical CO2 cycle, and the proposed methodology is extensible to more complex cycle variants such as the recuperated and recompression cycle.
△ Less
Submitted 27 August, 2021;
originally announced August 2021.
-
From Darwin to Poincaré and von Neumann: Recurrence and Cycles in Evolutionary and Algorithmic Game Theory
Authors:
Victor Boone,
Georgios Piliouras
Abstract:
Replicator dynamics, the continuous-time analogue of Multiplicative Weights Updates, is the main dynamic in evolutionary game theory. In simple evolutionary zero-sum games, such as Rock-Paper-Scissors, replicator dynamic is periodic \cite{zeeman1980population}, however, its behavior in higher dimensions is not well understood. We provide a complete characterization of its behavior in zero-sum evol…
▽ More
Replicator dynamics, the continuous-time analogue of Multiplicative Weights Updates, is the main dynamic in evolutionary game theory. In simple evolutionary zero-sum games, such as Rock-Paper-Scissors, replicator dynamic is periodic \cite{zeeman1980population}, however, its behavior in higher dimensions is not well understood. We provide a complete characterization of its behavior in zero-sum evolutionary games. We prove that, if and only if, the system has an interior Nash equilibrium, the dynamics exhibit Poincaré recurrence, i.e., almost all orbits come arbitrary close to their initial conditions infinitely often. If no interior equilibria exist, then all interior initial conditions converge to the boundary. Specifically, the strategies that are not in the support of any equilibrium vanish in the limit of all orbits. All recurrence results furthermore extend to a class of games that generalize both graphical polymatrix games as well as evolutionary games, establishing a unifying link between evolutionary and algorithmic game theory. We show that two degrees of freedom, as in Rock-Paper-Scissors, is sufficient to prove periodicity.
△ Less
Submitted 3 October, 2019;
originally announced October 2019.