-
Accelerated Parameter-Free Stochastic Optimization
Authors:
Itai Kreisler,
Maor Ivgi,
Oliver Hinder,
Yair Carmon
Abstract:
We propose a method that achieves near-optimal rates for smooth stochastic convex optimization and requires essentially no prior knowledge of problem parameters. This improves on prior work which requires knowing at least the initial distance to optimality d0. Our method, U-DoG, combines UniXGrad (Kavis et al., 2019) and DoG (Ivgi et al., 2023) with novel iterate stabilization techniques. It requi…
▽ More
We propose a method that achieves near-optimal rates for smooth stochastic convex optimization and requires essentially no prior knowledge of problem parameters. This improves on prior work which requires knowing at least the initial distance to optimality d0. Our method, U-DoG, combines UniXGrad (Kavis et al., 2019) and DoG (Ivgi et al., 2023) with novel iterate stabilization techniques. It requires only loose bounds on d0 and the noise magnitude, provides high probability guarantees under sub-Gaussian noise, and is also near-optimal in the non-smooth case. Our experiments show consistent, strong performance on convex problems and mixed results on neural network training.
△ Less
Submitted 5 July, 2024; v1 submitted 31 March, 2024;
originally announced April 2024.
-
The Price of Adaptivity in Stochastic Convex Optimization
Authors:
Yair Carmon,
Oliver Hinder
Abstract:
We prove impossibility results for adaptivity in non-smooth stochastic convex optimization. Given a set of problem parameters we wish to adapt to, we define a "price of adaptivity" (PoA) that, roughly speaking, measures the multiplicative increase in suboptimality due to uncertainty in these parameters. When the initial distance to the optimum is unknown but a gradient norm bound is known, we show…
▽ More
We prove impossibility results for adaptivity in non-smooth stochastic convex optimization. Given a set of problem parameters we wish to adapt to, we define a "price of adaptivity" (PoA) that, roughly speaking, measures the multiplicative increase in suboptimality due to uncertainty in these parameters. When the initial distance to the optimum is unknown but a gradient norm bound is known, we show that the PoA is at least logarithmic for expected suboptimality, and double-logarithmic for median suboptimality. When there is uncertainty in both distance and gradient norm, we show that the PoA must be polynomial in the level of uncertainty. Our lower bounds nearly match existing upper bounds, and establish that there is no parameter-free lunch.
En route, we also establish tight upper and lower bounds for (known-parameter) high-probability stochastic convex optimization with heavy-tailed and bounded noise, respectively.
△ Less
Submitted 27 June, 2024; v1 submitted 16 February, 2024;
originally announced February 2024.
-
Datasets and Benchmarks for Nanophotonic Structure and Parametric Design Simulations
Authors:
Jungtaek Kim,
Mingxuan Li,
Oliver Hinder,
Paul W. Leu
Abstract:
Nanophotonic structures have versatile applications including solar cells, anti-reflective coatings, electromagnetic interference shielding, optical filters, and light emitting diodes. To design and understand these nanophotonic structures, electrodynamic simulations are essential. These simulations enable us to model electromagnetic fields over time and calculate optical properties. In this work,…
▽ More
Nanophotonic structures have versatile applications including solar cells, anti-reflective coatings, electromagnetic interference shielding, optical filters, and light emitting diodes. To design and understand these nanophotonic structures, electrodynamic simulations are essential. These simulations enable us to model electromagnetic fields over time and calculate optical properties. In this work, we introduce frameworks and benchmarks to evaluate nanophotonic structures in the context of parametric structure design problems. The benchmarks are instrumental in assessing the performance of optimization algorithms and identifying an optimal structure based on target optical properties. Moreover, we explore the impact of varying grid sizes in electrodynamic simulations, shedding light on how evaluation fidelity can be strategically leveraged in enhancing structure designs.
△ Less
Submitted 29 October, 2023;
originally announced October 2023.
-
DoG is SGD's Best Friend: A Parameter-Free Dynamic Step Size Schedule
Authors:
Maor Ivgi,
Oliver Hinder,
Yair Carmon
Abstract:
We propose a tuning-free dynamic SGD step size formula, which we call Distance over Gradients (DoG). The DoG step sizes depend on simple empirical quantities (distance from the initial point and norms of gradients) and have no ``learning rate'' parameter. Theoretically, we show that a slight variation of the DoG formula enjoys strong parameter-free convergence guarantees for stochastic convex opti…
▽ More
We propose a tuning-free dynamic SGD step size formula, which we call Distance over Gradients (DoG). The DoG step sizes depend on simple empirical quantities (distance from the initial point and norms of gradients) and have no ``learning rate'' parameter. Theoretically, we show that a slight variation of the DoG formula enjoys strong parameter-free convergence guarantees for stochastic convex optimization assuming only \emph{locally bounded} stochastic gradients. Empirically, we consider a broad range of vision and language transfer learning tasks, and show that DoG's performance is close to that of SGD with tuned learning rate. We also propose a per-layer variant of DoG that generally outperforms tuned SGD, approaching the performance of tuned Adam. A PyTorch implementation is available at https://github.com/formll/dog
△ Less
Submitted 16 July, 2023; v1 submitted 8 February, 2023;
originally announced February 2023.
-
Optimal Diagonal Preconditioning
Authors:
Zhaonan Qu,
Wenzhi Gao,
Oliver Hinder,
Yinyu Ye,
Zhengyuan Zhou
Abstract:
Preconditioning has long been a staple technique in optimization, often applied to reduce the condition number of a matrix and speed up the convergence of algorithms. Although there are many popular preconditioning techniques in practice, most lack guarantees on reductions in condition number. Moreover, the degree to which we can improve over existing heuristic preconditioners remains an important…
▽ More
Preconditioning has long been a staple technique in optimization, often applied to reduce the condition number of a matrix and speed up the convergence of algorithms. Although there are many popular preconditioning techniques in practice, most lack guarantees on reductions in condition number. Moreover, the degree to which we can improve over existing heuristic preconditioners remains an important practical question. In this paper, we study the problem of optimal diagonal preconditioning that achieves maximal reduction in the condition number of any full-rank matrix by scaling its rows and/or columns. We first reformulate the problem as a quasi-convex problem and provide a simple algorithm based on bisection. Then we develop an interior point algorithm with $O(\log(1/ε))$ iteration complexity, where each iteration consists of a Newton update based on the Nesterov-Todd direction. Next, we specialize to one-sided optimal diagonal preconditioning problems, and demonstrate that they can be formulated as standard dual SDP problems. We then develop efficient customized solvers and study the empirical performance of our optimal diagonal preconditioning procedures through extensive experiments on large matrices. Our findings suggest that optimal diagonal preconditioners can significantly improve upon existing heuristics-based diagonal preconditioners at reducing condition numbers and speeding up iterative methods. Moreover, our implementation of customized solvers, combined with a random row/column sampling step, can find near-optimal diagonal preconditioners for matrices up to size 200,000 in reasonable time, demonstrating their practical appeal.
△ Less
Submitted 4 November, 2022; v1 submitted 2 September, 2022;
originally announced September 2022.
-
Making SGD Parameter-Free
Authors:
Yair Carmon,
Oliver Hinder
Abstract:
We develop an algorithm for parameter-free stochastic convex optimization (SCO) whose rate of convergence is only a double-logarithmic factor larger than the optimal rate for the corresponding known-parameter setting. In contrast, the best previously known rates for parameter-free SCO are based on online parameter-free regret bounds, which contain unavoidable excess logarithmic terms compared to t…
▽ More
We develop an algorithm for parameter-free stochastic convex optimization (SCO) whose rate of convergence is only a double-logarithmic factor larger than the optimal rate for the corresponding known-parameter setting. In contrast, the best previously known rates for parameter-free SCO are based on online parameter-free regret bounds, which contain unavoidable excess logarithmic terms compared to their known-parameter counterparts. Our algorithm is conceptually simple, has high-probability guarantees, and is also partially adaptive to unknown gradient norms, smoothness, and strong convexity. At the heart of our results is a novel parameter-free certificate for SGD step size choice, and a time-uniform concentration result that assumes no a-priori bounds on SGD iterates.
△ Less
Submitted 1 March, 2024; v1 submitted 4 May, 2022;
originally announced May 2022.
-
An efficient nonconvex reformulation of stagewise convex optimization problems
Authors:
Rudy Bunel,
Oliver Hinder,
Srinadh Bhojanapalli,
Krishnamurthy,
Dvijotham
Abstract:
Convex optimization problems with staged structure appear in several contexts, including optimal control, verification of deep neural networks, and isotonic regression. Off-the-shelf solvers can solve these problems but may scale poorly. We develop a nonconvex reformulation designed to exploit this staged structure. Our reformulation has only simple bound constraints, enabling solution via project…
▽ More
Convex optimization problems with staged structure appear in several contexts, including optimal control, verification of deep neural networks, and isotonic regression. Off-the-shelf solvers can solve these problems but may scale poorly. We develop a nonconvex reformulation designed to exploit this staged structure. Our reformulation has only simple bound constraints, enabling solution via projected gradient methods and their accelerated variants. The method automatically generates a sequence of primal and dual feasible solutions to the original convex problem, making optimality certification easy. We establish theoretical properties of the nonconvex formulation, showing that it is (almost) free of spurious local minima and has the same global optimum as the convex problem. We modify PGD to avoid spurious local minimizers so it always converges to the global minimizer. For neural network verification, our approach obtains small duality gaps in only a few gradient steps. Consequently, it can quickly solve large-scale verification problems faster than both off-the-shelf and specialized solvers.
△ Less
Submitted 27 October, 2020;
originally announced October 2020.
-
Near-Optimal Methods for Minimizing Star-Convex Functions and Beyond
Authors:
Oliver Hinder,
Aaron Sidford,
Nimit S. Sohoni
Abstract:
In this paper, we provide near-optimal accelerated first-order methods for minimizing a broad class of smooth nonconvex functions that are strictly unimodal on all lines through a minimizer. This function class, which we call the class of smooth quasar-convex functions, is parameterized by a constant $γ\in (0,1]$, where $γ= 1$ encompasses the classes of smooth convex and star-convex functions, and…
▽ More
In this paper, we provide near-optimal accelerated first-order methods for minimizing a broad class of smooth nonconvex functions that are strictly unimodal on all lines through a minimizer. This function class, which we call the class of smooth quasar-convex functions, is parameterized by a constant $γ\in (0,1]$, where $γ= 1$ encompasses the classes of smooth convex and star-convex functions, and smaller values of $γ$ indicate that the function can be "more nonconvex." We develop a variant of accelerated gradient descent that computes an $ε$-approximate minimizer of a smooth $γ$-quasar-convex function with at most $O(γ^{-1} ε^{-1/2} \log(γ^{-1} ε^{-1}))$ total function and gradient evaluations. We also derive a lower bound of $Ω(γ^{-1} ε^{-1/2})$ on the worst-case number of gradient evaluations required by any deterministic first-order method, showing that, up to a logarithmic factor, no deterministic first-order method can improve upon ours.
△ Less
Submitted 24 February, 2023; v1 submitted 27 June, 2019;
originally announced June 2019.
-
Worst-case iteration bounds for log barrier methods on problems with nonconvex constraints
Authors:
Oliver Hinder,
Yinyu Ye
Abstract:
Interior point methods (IPMs) that handle nonconvex constraints such as IPOPT, KNITRO and LOQO have had enormous practical success. We consider IPMs in the setting where the objective and constraints are thrice differentiable, and have Lipschitz first and second derivatives on the feasible region. We provide an IPM that, starting from a strictly feasible point, finds a $μ$-approximate Fritz John p…
▽ More
Interior point methods (IPMs) that handle nonconvex constraints such as IPOPT, KNITRO and LOQO have had enormous practical success. We consider IPMs in the setting where the objective and constraints are thrice differentiable, and have Lipschitz first and second derivatives on the feasible region. We provide an IPM that, starting from a strictly feasible point, finds a $μ$-approximate Fritz John point by solving $\mathcal{O}( μ^{-7/4})$ trust-region subproblems. For IPMs that handle nonlinear constraints, this result represents the first iteration bound with a polynomial dependence on $1/μ$. We also show how to use our method to find scaled-KKT points starting from an infeasible solution and improve on existing complexity bounds.
△ Less
Submitted 3 November, 2023; v1 submitted 1 July, 2018;
originally announced July 2018.
-
Cutting plane methods can be extended into nonconvex optimization
Authors:
Oliver Hinder
Abstract:
We show that it is possible to obtain an $O(ε^{-4/3})$ expected runtime --- including computational cost --- for finding $ε$-stationary points of smooth nonconvex functions using cutting plane methods. This improves on the best-known epsilon dependence achieved by cubic regularized Newton of $O(ε^{-3/2})$ as proved by Nesterov and Polyak (2006). Our techniques utilize the convex until proven guilt…
▽ More
We show that it is possible to obtain an $O(ε^{-4/3})$ expected runtime --- including computational cost --- for finding $ε$-stationary points of smooth nonconvex functions using cutting plane methods. This improves on the best-known epsilon dependence achieved by cubic regularized Newton of $O(ε^{-3/2})$ as proved by Nesterov and Polyak (2006). Our techniques utilize the convex until proven guilty principle proposed by Carmon, Duchi, Hinder, and Sidford (2017).
△ Less
Submitted 26 June, 2019; v1 submitted 21 May, 2018;
originally announced May 2018.
-
Accelerated Methods for Non-Convex Optimization
Authors:
Yair Carmon,
John C. Duchi,
Oliver Hinder,
Aaron Sidford
Abstract:
We present an accelerated gradient method for non-convex optimization problems with Lipschitz continuous first and second derivatives. The method requires time $O(ε^{-7/4} \log(1/ ε) )$ to find an $ε$-stationary point, meaning a point $x$ such that $\|\nabla f(x)\| \le ε$. The method improves upon the $O(ε^{-2} )$ complexity of gradient descent and provides the additional second-order guarantee th…
▽ More
We present an accelerated gradient method for non-convex optimization problems with Lipschitz continuous first and second derivatives. The method requires time $O(ε^{-7/4} \log(1/ ε) )$ to find an $ε$-stationary point, meaning a point $x$ such that $\|\nabla f(x)\| \le ε$. The method improves upon the $O(ε^{-2} )$ complexity of gradient descent and provides the additional second-order guarantee that $\nabla^2 f(x) \succeq -O(ε^{1/2})I$ for the computed $x$. Furthermore, our method is Hessian free, i.e. it only requires gradient computations, and is therefore suitable for large scale applications.
△ Less
Submitted 2 February, 2017; v1 submitted 2 November, 2016;
originally announced November 2016.