subscribe to arXiv mailings

Towards Modular LLMs by Building and Reusing a Library of LoRAs

Authors: Oleksiy Ostapenko, Zhan Su, Edoardo Maria Ponti, Laurent Charlin, Nicolas Le Roux, Matheus Pereira, Lucas Caccia, Alessandro Sordoni

Abstract: The growing number of parameter-efficient adaptations of a base large language model (LLM) calls for studying whether we can reuse such trained adapters to improve performance for new tasks. We study how to best build a library of adapters given multi-task data and devise techniques for both zero-shot and supervised task generalization through routing in such library. We benchmark existing approac… ▽ More The growing number of parameter-efficient adaptations of a base large language model (LLM) calls for studying whether we can reuse such trained adapters to improve performance for new tasks. We study how to best build a library of adapters given multi-task data and devise techniques for both zero-shot and supervised task generalization through routing in such library. We benchmark existing approaches to build this library and introduce model-based clustering, MBC, a method that groups tasks based on the similarity of their adapter parameters, indirectly optimizing for transfer across the multi-task dataset. To re-use the library, we present a novel zero-shot routing mechanism, Arrow, which enables dynamic selection of the most relevant adapters for new inputs without the need for retraining. We experiment with several LLMs, such as Phi-2 and Mistral, on a wide array of held-out tasks, verifying that MBC-based adapters and Arrow routing lead to superior generalization to new tasks. We make steps towards creating modular, adaptable LLMs that can match or outperform traditional joint training. △ Less

Submitted 17 May, 2024; originally announced May 2024.

arXiv:2402.16354 [pdf, other]

Language-guided Skill Learning with Temporal Variational Inference

Authors: Haotian Fu, Pratyusha Sharma, Elias Stengel-Eskin, George Konidaris, Nicolas Le Roux, Marc-Alexandre Côté, Xingdi Yuan

Abstract: We present an algorithm for skill discovery from expert demonstrations. The algorithm first utilizes Large Language Models (LLMs) to propose an initial segmentation of the trajectories. Following that, a hierarchical variational inference framework incorporates the LLM-generated segmentation information to discover reusable skills by merging trajectory segments. To further control the trade-off be… ▽ More We present an algorithm for skill discovery from expert demonstrations. The algorithm first utilizes Large Language Models (LLMs) to propose an initial segmentation of the trajectories. Following that, a hierarchical variational inference framework incorporates the LLM-generated segmentation information to discover reusable skills by merging trajectory segments. To further control the trade-off between compression and reusability, we introduce a novel auxiliary objective based on the Minimum Description Length principle that helps guide this skill discovery process. Our results demonstrate that agents equipped with our method are able to discover skills that help accelerate learning and outperform baseline skill learning approaches on new long-horizon tasks in BabyAI, a grid world navigation environment, as well as ALFRED, a household simulation environment. △ Less

Submitted 27 May, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

Comments: ICML 2024

arXiv:2306.12509 [pdf, other]

Joint Prompt Optimization of Stacked LLMs using Variational Inference

Authors: Alessandro Sordoni, Xingdi Yuan, Marc-Alexandre Côté, Matheus Pereira, Adam Trischler, Ziang Xiao, Arian Hosseini, Friederike Niedtner, Nicolas Le Roux

Abstract: Large language models (LLMs) can be seen as atomic units of computation mapping sequences to a distribution over sequences. Thus, they can be seen as stochastic language layers in a language network, where the learnable parameters are the natural language prompts at each layer. By stacking two such layers and feeding the output of one layer to the next, we obtain a Deep Language Network (DLN). We… ▽ More Large language models (LLMs) can be seen as atomic units of computation mapping sequences to a distribution over sequences. Thus, they can be seen as stochastic language layers in a language network, where the learnable parameters are the natural language prompts at each layer. By stacking two such layers and feeding the output of one layer to the next, we obtain a Deep Language Network (DLN). We first show how to effectively perform prompt optimization for a 1-Layer language network (DLN-1). Then, we present an extension that applies to 2-layer DLNs (DLN-2), where two prompts must be learned. The key idea is to consider the output of the first layer as a latent variable, which requires inference, and prompts to be learned as the parameters of the generative distribution. We first test the effectiveness of DLN-1 in multiple reasoning and natural language understanding tasks. Then, we show that DLN-2 can reach higher performance than a single layer, showing promise that we might reach comparable performance to GPT-4, even when each LLM in the network is smaller and less powerful. △ Less

Submitted 4 December, 2023; v1 submitted 21 June, 2023; originally announced June 2023.

Comments: NeurIPS 2023

arXiv:2306.10043 [pdf, ps, other]

Unraveling the Interconnected Axes of Heterogeneity in Machine Learning for Democratic and Inclusive Advancements

Authors: Maryam Molamohammadi, Afaf Taik, Nicolas Le Roux, Golnoosh Farnadi

Abstract: The growing utilization of machine learning (ML) in decision-making processes raises questions about its benefits to society. In this study, we identify and analyze three axes of heterogeneity that significantly influence the trajectory of ML products. These axes are i) values, culture and regulations, ii) data composition, and iii) resource and infrastructure capacity. We demonstrate how these ax… ▽ More The growing utilization of machine learning (ML) in decision-making processes raises questions about its benefits to society. In this study, we identify and analyze three axes of heterogeneity that significantly influence the trajectory of ML products. These axes are i) values, culture and regulations, ii) data composition, and iii) resource and infrastructure capacity. We demonstrate how these axes are interdependent and mutually influence one another, emphasizing the need to consider and address them jointly. Unfortunately, the current research landscape falls short in this regard, often failing to adopt a holistic approach. We examine the prevalent practices and methodologies that skew these axes in favor of a selected few, resulting in power concentration, homogenized control, and increased dependency. We discuss how this fragmented study of the three axes poses a significant challenge, leading to an impractical solution space that lacks reflection of real-world scenarios. Addressing these issues is crucial to ensure a more comprehensive understanding of the interconnected nature of society and to foster the democratic and inclusive development of ML systems that are more aligned with real-world complexities and its diverse requirements. △ Less

Submitted 11 June, 2023; originally announced June 2023.

arXiv:2305.15249 [pdf, other]

Decision-Aware Actor-Critic with Function Approximation and Theoretical Guarantees

Authors: Sharan Vaswani, Amirreza Kazemi, Reza Babanezhad, Nicolas Le Roux

Abstract: Actor-critic (AC) methods are widely used in reinforcement learning (RL) and benefit from the flexibility of using any policy gradient method as the actor and value-based method as the critic. The critic is usually trained by minimizing the TD error, an objective that is potentially decorrelated with the true goal of achieving a high reward with the actor. We address this mismatch by designing a j… ▽ More Actor-critic (AC) methods are widely used in reinforcement learning (RL) and benefit from the flexibility of using any policy gradient method as the actor and value-based method as the critic. The critic is usually trained by minimizing the TD error, an objective that is potentially decorrelated with the true goal of achieving a high reward with the actor. We address this mismatch by designing a joint objective for training the actor and critic in a decision-aware fashion. We use the proposed objective to design a generic, AC algorithm that can easily handle any function approximation. We explicitly characterize the conditions under which the resulting algorithm guarantees monotonic policy improvement, regardless of the choice of the policy and critic parameterization. Instantiating the generic algorithm results in an actor that involves maximizing a sequence of surrogate functions (similar to TRPO, PPO) and a critic that involves minimizing a closely connected objective. Using simple bandit examples, we provably establish the benefit of the proposed critic objective over the standard squared error. Finally, we empirically demonstrate the benefit of our decision-aware actor-critic framework on simple RL problems. △ Less

Submitted 30 October, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

Comments: NeurIPS 2023

arXiv:2302.02607 [pdf, other]

Target-based Surrogates for Stochastic Optimization

Authors: Jonathan Wilder Lavington, Sharan Vaswani, Reza Babanezhad, Mark Schmidt, Nicolas Le Roux

Abstract: We consider minimizing functions for which it is expensive to compute the (possibly stochastic) gradient. Such functions are prevalent in reinforcement learning, imitation learning and adversarial training. Our target optimization framework uses the (expensive) gradient computation to construct surrogate functions in a \emph{target space} (e.g. the logits output by a linear model for classificatio… ▽ More We consider minimizing functions for which it is expensive to compute the (possibly stochastic) gradient. Such functions are prevalent in reinforcement learning, imitation learning and adversarial training. Our target optimization framework uses the (expensive) gradient computation to construct surrogate functions in a \emph{target space} (e.g. the logits output by a linear model for classification) that can be minimized efficiently. This allows for multiple parameter updates to the model, amortizing the cost of gradient computation. In the full-batch setting, we prove that our surrogate is a global upper-bound on the loss, and can be (locally) minimized using a black-box optimization algorithm. We prove that the resulting majorization-minimization algorithm ensures convergence to a stationary point of the loss. Next, we instantiate our framework in the stochastic setting and propose the $SSO$ algorithm, which can be viewed as projected stochastic gradient descent in the target space. This connection enables us to prove theoretical guarantees for $SSO$ when minimizing convex functions. Our framework allows the use of standard stochastic optimization algorithms to construct surrogates which can be minimized by any deterministic optimization method. To evaluate our framework, we consider a suite of supervised learning and imitation learning problems. Our experiments indicate the benefits of target optimization and the effectiveness of $SSO$. △ Less

Submitted 8 June, 2023; v1 submitted 6 February, 2023; originally announced February 2023.

arXiv:2211.03831 [pdf, other]

Multi-Head Adapter Routing for Cross-Task Generalization

Authors: Lucas Caccia, Edoardo Ponti, Zhan Su, Matheus Pereira, Nicolas Le Roux, Alessandro Sordoni

Abstract: Parameter-efficient fine-tuning (PEFT) for cross-task generalization consists in pre-training adapters on a multi-task training set before few-shot adaptation to test tasks. Polytropon [Ponti et al., 2023] ($\texttt{Poly}$) jointly learns an inventory of adapters and a routing function that selects a (variable-size) subset of adapters for each task during both pre-training and few-shot adaptation.… ▽ More Parameter-efficient fine-tuning (PEFT) for cross-task generalization consists in pre-training adapters on a multi-task training set before few-shot adaptation to test tasks. Polytropon [Ponti et al., 2023] ($\texttt{Poly}$) jointly learns an inventory of adapters and a routing function that selects a (variable-size) subset of adapters for each task during both pre-training and few-shot adaptation. In this paper, we investigate the role that adapter routing plays in its success and design new variants based on our findings. First, we build on the intuition that finer-grained routing provides more expressivity. Hence, we propose $\texttt{MHR}$ (Multi-Head Routing) which combines subsets of adapter parameters and outperforms $\texttt{Poly}$ under a comparable parameter budget; by only fine-tuning the routing function and not the adapters ($\texttt{MHR}$-$z$) we achieve competitive performance with extreme parameter efficiency. Second, we find that $\texttt{Poly}$/$\texttt{MHR}$ performance is a result of better multi-task optimization, rather than modular inductive biases that facilitate adapter recombination and local adaptation, as previously hypothesized. In fact, we find that $\texttt{MHR}$ exhibits high gradient alignment between training tasks. We find that routing is most beneficial during multi-task pre-training rather than during few-shot adaptation and propose $\texttt{MHR}$-$μ$, which discards routing and fine-tunes the average of the pre-trained adapters on each downstream tasks. This establishes $\texttt{MHR}$-$μ$ as an effective method for single-adapter fine-tuning. We also show that $\texttt{MHR}$-$μ$ can be used as an effective zero-shot transfer method by training the average of the pre-trained adapters for a few additional steps on the multi-task training set: this yields gains up to 3% on absolute accuracy w.r.t. the baselines. △ Less

Submitted 13 November, 2023; v1 submitted 7 November, 2022; originally announced November 2022.

Comments: Accepted at NeurIPS 2023. Code is available at https://github.com/microsoft/mttl

arXiv:2108.05828 [pdf, other]

A general class of surrogate functions for stable and efficient reinforcement learning

Authors: Sharan Vaswani, Olivier Bachem, Simone Totaro, Robert Mueller, Shivam Garg, Matthieu Geist, Marlos C. Machado, Pablo Samuel Castro, Nicolas Le Roux

Abstract: Common policy gradient methods rely on the maximization of a sequence of surrogate functions. In recent years, many such surrogate functions have been proposed, most without strong theoretical guarantees, leading to algorithms such as TRPO, PPO or MPO. Rather than design yet another surrogate function, we instead propose a general framework (FMA-PG) based on functional mirror ascent that gives ris… ▽ More Common policy gradient methods rely on the maximization of a sequence of surrogate functions. In recent years, many such surrogate functions have been proposed, most without strong theoretical guarantees, leading to algorithms such as TRPO, PPO or MPO. Rather than design yet another surrogate function, we instead propose a general framework (FMA-PG) based on functional mirror ascent that gives rise to an entire family of surrogate functions. We construct surrogate functions that enable policy improvement guarantees, a property not shared by most existing surrogate functions. Crucially, these guarantees hold regardless of the choice of policy parameterization. Moreover, a particular instantiation of FMA-PG recovers important implementation heuristics (e.g., using forward vs reverse KL divergence) resulting in a variant of TRPO with additional desirable properties. Via experiments on simple bandit problems, we evaluate the algorithms instantiated by FMA-PG. The proposed framework also suggests an improved variant of PPO, whose robustness and efficiency we empirically demonstrate on the MuJoCo suite. △ Less

Submitted 30 October, 2023; v1 submitted 12 August, 2021; originally announced August 2021.

Comments: Fixed minor typos

arXiv:2108.03489 [pdf, other]

Impact of Aliasing on Generalization in Deep Convolutional Networks

Authors: Cristina Vasconcelos, Hugo Larochelle, Vincent Dumoulin, Rob Romijnders, Nicolas Le Roux, Ross Goroshin

Abstract: We investigate the impact of aliasing on generalization in Deep Convolutional Networks and show that data augmentation schemes alone are unable to prevent it due to structural limitations in widely used architectures. Drawing insights from frequency analysis theory, we take a closer look at ResNet and EfficientNet architectures and review the trade-off between aliasing and information loss in each… ▽ More We investigate the impact of aliasing on generalization in Deep Convolutional Networks and show that data augmentation schemes alone are unable to prevent it due to structural limitations in widely used architectures. Drawing insights from frequency analysis theory, we take a closer look at ResNet and EfficientNet architectures and review the trade-off between aliasing and information loss in each of their major components. We show how to mitigate aliasing by inserting non-trainable low-pass filters at key locations, particularly where networks lack the capacity to learn them. These simple architectural changes lead to substantial improvements in generalization on i.i.d. and even more on out-of-distribution conditions, such as image classification under natural corruptions on ImageNet-C [11] and few-shot learning on Meta-Dataset [26]. State-of-the art results are achieved on both datasets without introducing additional trainable parameters and using the default hyper-parameters of open source codebases. △ Less

Submitted 7 August, 2021; originally announced August 2021.

Comments: Accepted to ICCV 2021. arXiv admin note: text overlap with arXiv:2011.10675

arXiv:2107.00464 [pdf, other]

On the Convergence of Stochastic Extragradient for Bilinear Games using Restarted Iteration Averaging

Authors: Chris Junchi Li, Yaodong Yu, Nicolas Loizou, Gauthier Gidel, Yi Ma, Nicolas Le Roux, Michael I. Jordan

Abstract: We study the stochastic bilinear minimax optimization problem, presenting an analysis of the same-sample Stochastic ExtraGradient (SEG) method with constant step size, and presenting variations of the method that yield favorable convergence. In sharp contrasts with the basic SEG method whose last iterate only contracts to a fixed neighborhood of the Nash equilibrium, SEG augmented with iteration a… ▽ More We study the stochastic bilinear minimax optimization problem, presenting an analysis of the same-sample Stochastic ExtraGradient (SEG) method with constant step size, and presenting variations of the method that yield favorable convergence. In sharp contrasts with the basic SEG method whose last iterate only contracts to a fixed neighborhood of the Nash equilibrium, SEG augmented with iteration averaging provably converges to the Nash equilibrium under the same standard settings, and such a rate is further improved by incorporating a scheduled restarting procedure. In the interpolation setting where noise vanishes at the Nash equilibrium, we achieve an optimal convergence rate up to tight constants. We present numerical experiments that validate our theoretical findings and demonstrate the effectiveness of the SEG method when equipped with iteration averaging and restarting. △ Less

Submitted 8 April, 2022; v1 submitted 30 June, 2021; originally announced July 2021.

Comments: Camera-ready version appeared at AISTATS 2022; short version appeared at NeurIPS OPT 2021 Workshop

arXiv:2102.08868 [pdf, other]

Bridging the Gap Between Adversarial Robustness and Optimization Bias

Authors: Fartash Faghri, Sven Gowal, Cristina Vasconcelos, David J. Fleet, Fabian Pedregosa, Nicolas Le Roux

Abstract: We demonstrate that the choice of optimizer, neural network architecture, and regularizer significantly affect the adversarial robustness of linear neural networks, providing guarantees without the need for adversarial training. To this end, we revisit a known result linking maximally robust classifiers and minimum norm solutions, and combine it with recent results on the implicit bias of optimize… ▽ More We demonstrate that the choice of optimizer, neural network architecture, and regularizer significantly affect the adversarial robustness of linear neural networks, providing guarantees without the need for adversarial training. To this end, we revisit a known result linking maximally robust classifiers and minimum norm solutions, and combine it with recent results on the implicit bias of optimizers. First, we show that, under certain conditions, it is possible to achieve both perfect standard accuracy and a certain degree of robustness, simply by training an overparametrized model using the implicit bias of the optimization. In that regime, there is a direct relationship between the type of the optimizer and the attack to which the model is robust. To the best of our knowledge, this work is the first to study the impact of optimization methods such as sign gradient descent and proximal methods on adversarial robustness. Second, we characterize the robustness of linear convolutional models, showing that they resist attacks subject to a constraint on the Fourier-$\ell_\infty$ norm. To illustrate these findings we design a novel Fourier-$\ell_\infty$ attack that finds adversarial examples with controllable frequencies. We evaluate Fourier-$\ell_\infty$ robustness of adversarially-trained deep CIFAR-10 models from the standard RobustBench benchmark and visualize adversarial perturbations. △ Less

Submitted 7 June, 2021; v1 submitted 17 February, 2021; originally announced February 2021.

Comments: New CIFAR-10 experiments and Fourier attack variations

arXiv:2011.10675 [pdf, other]

An Effective Anti-Aliasing Approach for Residual Networks

Authors: Cristina Vasconcelos, Hugo Larochelle, Vincent Dumoulin, Nicolas Le Roux, Ross Goroshin

Abstract: Image pre-processing in the frequency domain has traditionally played a vital role in computer vision and was even part of the standard pipeline in the early days of deep learning. However, with the advent of large datasets, many practitioners concluded that this was unnecessary due to the belief that these priors can be learned from the data itself. Frequency aliasing is a phenomenon that may occ… ▽ More Image pre-processing in the frequency domain has traditionally played a vital role in computer vision and was even part of the standard pipeline in the early days of deep learning. However, with the advent of large datasets, many practitioners concluded that this was unnecessary due to the belief that these priors can be learned from the data itself. Frequency aliasing is a phenomenon that may occur when sub-sampling any signal, such as an image or feature map, causing distortion in the sub-sampled output. We show that we can mitigate this effect by placing non-trainable blur filters and using smooth activation functions at key locations, particularly where networks lack the capacity to learn them. These simple architectural changes lead to substantial improvements in out-of-distribution generalization on both image classification under natural corruptions on ImageNet-C [10] and few-shot learning on Meta-Dataset [17], without introducing additional trainable parameters and using the default hyper-parameters of open source codebases. △ Less

Submitted 20 November, 2020; originally announced November 2020.

arXiv:2008.13773 [pdf, other]

Beyond variance reduction: Understanding the true impact of baselines on policy optimization

Authors: Wesley Chung, Valentin Thomas, Marlos C. Machado, Nicolas Le Roux

Abstract: Bandit and reinforcement learning (RL) problems can often be framed as optimization problems where the goal is to maximize average performance while having access only to stochastic estimates of the true gradient. Traditionally, stochastic optimization theory predicts that learning dynamics are governed by the curvature of the loss function and the noise of the gradient estimates. In this paper we… ▽ More Bandit and reinforcement learning (RL) problems can often be framed as optimization problems where the goal is to maximize average performance while having access only to stochastic estimates of the true gradient. Traditionally, stochastic optimization theory predicts that learning dynamics are governed by the curvature of the loss function and the noise of the gradient estimates. In this paper we demonstrate that this is not the case for bandit and RL problems. To allow our analysis to be interpreted in light of multi-step MDPs, we focus on techniques derived from stochastic optimization principles (e.g., natural policy gradient and EXP3) and we show that some standard assumptions from optimization theory are violated in these problems. We present theoretical results showing that, at least for bandit problems, curvature and noise are not sufficient to explain the learning dynamics and that seemingly innocuous choices like the baseline can determine whether an algorithm converges. These theoretical findings match our empirical evaluation, which we extend to multi-state MDPs. △ Less

Submitted 19 February, 2021; v1 submitted 31 August, 2020; originally announced August 2020.

arXiv:2006.11266 [pdf, other]

An operator view of policy gradient methods

Authors: Dibya Ghosh, Marlos C. Machado, Nicolas Le Roux

Abstract: We cast policy gradient methods as the repeated application of two operators: a policy improvement operator $\mathcal{I}$, which maps any policy $π$ to a better one $\mathcal{I}π$, and a projection operator $\mathcal{P}$, which finds the best approximation of $\mathcal{I}π$ in the set of realizable policies. We use this framework to introduce operator-based versions of traditional policy gradient… ▽ More We cast policy gradient methods as the repeated application of two operators: a policy improvement operator $\mathcal{I}$, which maps any policy $π$ to a better one $\mathcal{I}π$, and a projection operator $\mathcal{P}$, which finds the best approximation of $\mathcal{I}π$ in the set of realizable policies. We use this framework to introduce operator-based versions of traditional policy gradient methods such as REINFORCE and PPO, which leads to a better understanding of their original counterparts. We also use the understanding we develop of the role of $\mathcal{I}$ and $\mathcal{P}$ to propose a new global lower bound of the expected return. This new perspective allows us to further bridge the gap between policy-based and value-based methods, showing how REINFORCE and the Bellman optimality operator, for example, can be seen as two sides of the same coin. △ Less

Submitted 22 October, 2020; v1 submitted 19 June, 2020; originally announced June 2020.

Comments: NeurIPS 2020

arXiv:2006.06821 [pdf, other]

To Each Optimizer a Norm, To Each Norm its Generalization

Authors: Sharan Vaswani, Reza Babanezhad, Jose Gallego-Posada, Aaron Mishkin, Simon Lacoste-Julien, Nicolas Le Roux

Abstract: We study the implicit regularization of optimization methods for linear models interpolating the training data in the under-parametrized and over-parametrized regimes. Since it is difficult to determine whether an optimizer converges to solutions that minimize a known norm, we flip the problem and investigate what is the corresponding norm minimized by an interpolating solution. Using this reasoni… ▽ More We study the implicit regularization of optimization methods for linear models interpolating the training data in the under-parametrized and over-parametrized regimes. Since it is difficult to determine whether an optimizer converges to solutions that minimize a known norm, we flip the problem and investigate what is the corresponding norm minimized by an interpolating solution. Using this reasoning, we prove that for over-parameterized linear regression, projections onto linear spans can be used to move between different interpolating solutions. For under-parameterized linear classification, we prove that for any linear classifier separating the data, there exists a family of quadratic norms ||.||_P such that the classifier's direction is the same as that of the maximum P-margin solution. For linear classification, we argue that analyzing convergence to the standard maximum l2-margin is arbitrary and show that minimizing the norm induced by the data results in better generalization. Furthermore, for over-parameterized linear classification, projections onto the data-span enable us to use techniques from the under-parameterized setting. On the empirical side, we propose techniques to bias optimizers towards better generalizing solutions, improving their test performance. We validate our theoretical results via synthetic experiments, and use the neural tangent kernel to handle non-linear models. △ Less

Submitted 11 June, 2020; originally announced June 2020.

arXiv:2002.08056 [pdf, other]

The Geometry of Sign Gradient Descent

Authors: Lukas Balles, Fabian Pedregosa, Nicolas Le Roux

Abstract: Sign-based optimization methods have become popular in machine learning due to their favorable communication cost in distributed optimization and their surprisingly good performance in neural network training. Furthermore, they are closely connected to so-called adaptive gradient methods like Adam. Recent works on signSGD have used a non-standard "separable smoothness" assumption, whereas some old… ▽ More Sign-based optimization methods have become popular in machine learning due to their favorable communication cost in distributed optimization and their surprisingly good performance in neural network training. Furthermore, they are closely connected to so-called adaptive gradient methods like Adam. Recent works on signSGD have used a non-standard "separable smoothness" assumption, whereas some older works study sign gradient descent as steepest descent with respect to the $\ell_\infty$-norm. In this work, we unify these existing results by showing a close connection between separable smoothness and $\ell_\infty$-smoothness and argue that the latter is the weaker and more natural assumption. We then proceed to study the smoothness constant with respect to the $\ell_\infty$-norm and thereby isolate geometric properties of the objective function which affect the performance of sign-based methods. In short, we find sign-based methods to be preferable over gradient descent if (i) the Hessian is to some degree concentrated on its diagonal, and (ii) its maximal eigenvalue is much larger than the average eigenvalue. Both properties are common in deep networks. △ Less

Submitted 19 February, 2020; originally announced February 2020.

arXiv:1906.07774 [pdf, other]

On the interplay between noise and curvature and its effect on optimization and generalization

Authors: Valentin Thomas, Fabian Pedregosa, Bart van Merriënboer, Pierre-Antoine Mangazol, Yoshua Bengio, Nicolas Le Roux

Abstract: The speed at which one can minimize an expected loss using stochastic methods depends on two properties: the curvature of the loss and the variance of the gradients. While most previous works focus on one or the other of these properties, we explore how their interaction affects optimization speed. Further, as the ultimate goal is good generalization performance, we clarify how both curvature and… ▽ More The speed at which one can minimize an expected loss using stochastic methods depends on two properties: the curvature of the loss and the variance of the gradients. While most previous works focus on one or the other of these properties, we explore how their interaction affects optimization speed. Further, as the ultimate goal is good generalization performance, we clarify how both curvature and noise are relevant to properly estimate the generalization gap. Realizing that the limitations of some existing works stems from a confusion between these matrices, we also clarify the distinction between the Fisher matrix, the Hessian, and the covariance matrix of the gradients. △ Less

Submitted 6 April, 2020; v1 submitted 18 June, 2019; originally announced June 2019.

Comments: Accepted to AISTATS 2020

arXiv:1906.03532 [pdf, other]

Reducing the variance in online optimization by transporting past gradients

Authors: Sébastien M. R. Arnold, Pierre-Antoine Manzagol, Reza Babanezhad, Ioannis Mitliagkas, Nicolas Le Roux

Abstract: Most stochastic optimization methods use gradients once before discarding them. While variance reduction methods have shown that reusing past gradients can be beneficial when there is a finite number of datapoints, they do not easily extend to the online setting. One issue is the staleness due to using past gradients. We propose to correct this staleness using the idea of implicit gradient transpo… ▽ More Most stochastic optimization methods use gradients once before discarding them. While variance reduction methods have shown that reusing past gradients can be beneficial when there is a finite number of datapoints, they do not easily extend to the online setting. One issue is the staleness due to using past gradients. We propose to correct this staleness using the idea of implicit gradient transport (IGT) which transforms gradients computed at previous iterates into gradients evaluated at the current iterate without using the Hessian explicitly. In addition to reducing the variance and bias of our updates over time, IGT can be used as a drop-in replacement for the gradient estimate in a number of well-understood methods such as heavy ball or Adam. We show experimentally that it achieves state-of-the-art results on a wide range of architectures and benchmarks. Additionally, the IGT gradient estimator yields the optimal asymptotic convergence rate for online stochastic optimization in the restricted setting where the Hessians of all component functions are equal. △ Less

Submitted 18 June, 2019; v1 submitted 8 June, 2019; originally announced June 2019.

Comments: Open-source implementation available at: https://github.com/seba-1511/igt.pth

arXiv:1902.05083 [pdf, other]

Anytime Tail Averaging

Authors: Nicolas Le Roux

Abstract: Tail averaging consists in averaging the last examples in a stream. Common techniques either have a memory requirement which grows with the number of samples to average, are not available at every timestep or do not accomodate growing windows. We propose two techniques with a low constant memory cost that perform tail averaging with access to the average at every time step. We also show how one ca… ▽ More Tail averaging consists in averaging the last examples in a stream. Common techniques either have a memory requirement which grows with the number of samples to average, are not available at every timestep or do not accomodate growing windows. We propose two techniques with a low constant memory cost that perform tail averaging with access to the average at every time step. We also show how one can improve the accuracy of that average at the cost of increased memory consumption. △ Less

Submitted 19 February, 2019; v1 submitted 13 February, 2019; originally announced February 2019.

Comments: Added a specific section on the case of multiple accumulators when k_t is a constant

arXiv:1902.03149 [pdf, other]

Distributional reinforcement learning with linear function approximation

Authors: Marc G. Bellemare, Nicolas Le Roux, Pablo Samuel Castro, Subhodeep Moitra

Abstract: Despite many algorithmic advances, our theoretical understanding of practical distributional reinforcement learning methods remains limited. One exception is Rowland et al. (2018)'s analysis of the C51 algorithm in terms of the Cramér distance, but their results only apply to the tabular setting and ignore C51's use of a softmax to produce normalized distributions. In this paper we adapt the Cramé… ▽ More Despite many algorithmic advances, our theoretical understanding of practical distributional reinforcement learning methods remains limited. One exception is Rowland et al. (2018)'s analysis of the C51 algorithm in terms of the Cramér distance, but their results only apply to the tabular setting and ignore C51's use of a softmax to produce normalized distributions. In this paper we adapt the Cramér distance to deal with arbitrary vectors. From it we derive a new distributional algorithm which is fully Cramér-based and can be combined to linear function approximation, with formal guarantees in the context of policy evaluation. In allowing the model's prediction to be any real vector, we lose the probabilistic interpretation behind the method, but otherwise maintain the appealing properties of distributional approaches. To the best of our knowledge, ours is the first proof of convergence of a distributional algorithm combined with function approximation. Perhaps surprisingly, our results provide evidence that Cramér-based distributional methods may perform worse than directly approximating the value function. △ Less

Submitted 8 February, 2019; originally announced February 2019.

Comments: To appear

Journal ref: Proceedings of AISTATS 2019

arXiv:1902.02366 [pdf, other]

Negative eigenvalues of the Hessian in deep neural networks

Authors: Guillaume Alain, Nicolas Le Roux, Pierre-Antoine Manzagol

Abstract: The loss function of deep networks is known to be non-convex but the precise nature of this nonconvexity is still an active area of research. In this work, we study the loss landscape of deep networks through the eigendecompositions of their Hessian matrix. In particular, we examine how important the negative eigenvalues are and the benefits one can observe in handling them appropriately. The loss function of deep networks is known to be non-convex but the precise nature of this nonconvexity is still an active area of research. In this work, we study the loss landscape of deep networks through the eigendecompositions of their Hessian matrix. In particular, we examine how important the negative eigenvalues are and the benefits one can observe in handling them appropriately. △ Less

Submitted 6 February, 2019; originally announced February 2019.

arXiv:1901.11530 [pdf, other]

A Geometric Perspective on Optimal Representations for Reinforcement Learning

Authors: Marc G. Bellemare, Will Dabney, Robert Dadashi, Adrien Ali Taiga, Pablo Samuel Castro, Nicolas Le Roux, Dale Schuurmans, Tor Lattimore, Clare Lyle

Abstract: We propose a new perspective on representation learning in reinforcement learning based on geometric properties of the space of value functions. We leverage this perspective to provide formal evidence regarding the usefulness of value functions as auxiliary tasks. Our formulation considers adapting the representation to minimize the (linear) approximation of the value function of all stationary po… ▽ More We propose a new perspective on representation learning in reinforcement learning based on geometric properties of the space of value functions. We leverage this perspective to provide formal evidence regarding the usefulness of value functions as auxiliary tasks. Our formulation considers adapting the representation to minimize the (linear) approximation of the value function of all stationary policies for a given environment. We show that this optimization reduces to making accurate predictions regarding a special class of value functions which we call adversarial value functions (AVFs). We demonstrate that using value functions as auxiliary tasks corresponds to an expected-error relaxation of our formulation, with AVFs a natural candidate, and identify a close relationship with proto-value functions (Mahadevan, 2005). We highlight characteristics of AVFs and their usefulness as auxiliary tasks in a series of experiments on the four-room domain. △ Less

Submitted 25 June, 2019; v1 submitted 31 January, 2019; originally announced January 2019.

arXiv:1901.11524 [pdf, other]

The Value Function Polytope in Reinforcement Learning

Authors: Robert Dadashi, Adrien Ali Taïga, Nicolas Le Roux, Dale Schuurmans, Marc G. Bellemare

Abstract: We establish geometric and topological properties of the space of value functions in finite state-action Markov decision processes. Our main contribution is the characterization of the nature of its shape: a general polytope (Aigner et al., 2010). To demonstrate this result, we exhibit several properties of the structural relationship between policies and value functions including the line theorem… ▽ More We establish geometric and topological properties of the space of value functions in finite state-action Markov decision processes. Our main contribution is the characterization of the nature of its shape: a general polytope (Aigner et al., 2010). To demonstrate this result, we exhibit several properties of the structural relationship between policies and value functions including the line theorem, which shows that the value functions of policies constrained on all but one state describe a line segment. Finally, we use this novel perspective to introduce visualizations to enhance the understanding of the dynamics of reinforcement learning algorithms. △ Less

Submitted 15 May, 2019; v1 submitted 31 January, 2019; originally announced January 2019.

arXiv:1811.11214 [pdf, other]

Understanding the impact of entropy on policy optimization

Authors: Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, Dale Schuurmans

Abstract: Entropy regularization is commonly used to improve policy optimization in reinforcement learning. It is believed to help with \emph{exploration} by encouraging the selection of more stochastic policies. In this work, we analyze this claim using new visualizations of the optimization landscape based on randomly perturbing the loss function. We first show that even with access to the exact gradient,… ▽ More Entropy regularization is commonly used to improve policy optimization in reinforcement learning. It is believed to help with \emph{exploration} by encouraging the selection of more stochastic policies. In this work, we analyze this claim using new visualizations of the optimization landscape based on randomly perturbing the loss function. We first show that even with access to the exact gradient, policy optimization is difficult due to the geometry of the objective function. Then, we qualitatively show that in some environments, a policy with higher entropy can make the optimization landscape smoother, thereby connecting local optima and enabling the use of larger learning rates. This paper presents new tools for understanding the optimization landscape, shows that policy entropy serves as a regularizer, and highlights the challenge of designing general-purpose policy optimization algorithms. △ Less

Submitted 7 June, 2019; v1 submitted 27 November, 2018; originally announced November 2018.

Comments: Accepted to ICML 2019

arXiv:1710.07462 [pdf, other]

Tracking the gradients using the Hessian: A new look at variance reducing stochastic methods

Authors: Robert M. Gower, Nicolas Le Roux, Francis Bach

Abstract: Our goal is to improve variance reducing stochastic methods through better control variates. We first propose a modification of SVRG which uses the Hessian to track gradients over time, rather than to recondition, increasing the correlation of the control variates and leading to faster theoretical convergence close to the optimum. We then propose accurate and computationally efficient approximatio… ▽ More Our goal is to improve variance reducing stochastic methods through better control variates. We first propose a modification of SVRG which uses the Hessian to track gradients over time, rather than to recondition, increasing the correlation of the control variates and leading to faster theoretical convergence close to the optimum. We then propose accurate and computationally efficient approximations to the Hessian, both using a diagonal and a low-rank matrix. Finally, we demonstrate the effectiveness of our method on a wide range of problems. △ Less

Submitted 31 March, 2018; v1 submitted 20 October, 2017; originally announced October 2017.

Comments: 17 pages, 2 figures, 1 table

MSC Class: 90C15; 90C25; 68W20

arXiv:1705.10405 [pdf, other]

Distributed SAGA: Maintaining linear convergence rate with limited communication

Authors: Clément Calauzènes, Nicolas Le Roux

Abstract: In recent years, variance-reducing stochastic methods have shown great practical performance, exhibiting linear convergence rate when other stochastic methods offered a sub-linear rate. However, as datasets grow ever bigger and clusters become widespread, the need for fast distribution methods is pressing. We propose here a distribution scheme for SAGA which maintains a linear convergence rate, ev… ▽ More In recent years, variance-reducing stochastic methods have shown great practical performance, exhibiting linear convergence rate when other stochastic methods offered a sub-linear rate. However, as datasets grow ever bigger and clusters become widespread, the need for fast distribution methods is pressing. We propose here a distribution scheme for SAGA which maintains a linear convergence rate, even when communication between nodes is limited. △ Less

Submitted 29 May, 2017; originally announced May 2017.

arXiv:1704.00773 [pdf, other]

A comparative study of counterfactual estimators

Authors: Thomas Nedelec, Nicolas Le Roux, Vianney Perchet

Abstract: We provide a comparative study of several widely used off-policy estimators (Empirical Average, Basic Importance Sampling and Normalized Importance Sampling), detailing the different regimes where they are individually suboptimal. We then exhibit properties optimal estimators should possess. In the case where examples have been gathered using multiple policies, we show that fused estimators domina… ▽ More We provide a comparative study of several widely used off-policy estimators (Empirical Average, Basic Importance Sampling and Normalized Importance Sampling), detailing the different regimes where they are individually suboptimal. We then exhibit properties optimal estimators should possess. In the case where examples have been gathered using multiple policies, we show that fused estimators dominate basic ones but can still be improved. △ Less

Submitted 29 January, 2019; v1 submitted 3 April, 2017; originally announced April 2017.

arXiv:1612.08967 [pdf, other]

Efficient iterative policy optimization

Authors: Nicolas Le Roux

Abstract: We tackle the issue of finding a good policy when the number of policy updates is limited. This is done by approximating the expected policy reward as a sequence of concave lower bounds which can be efficiently maximized, drastically reducing the number of policy updates required to achieve good performance. We also extend existing methods to negative rewards, enabling the use of control variates. We tackle the issue of finding a good policy when the number of policy updates is limited. This is done by approximating the expected policy reward as a sequence of concave lower bounds which can be efficiently maximized, drastically reducing the number of policy updates required to achieve good performance. We also extend existing methods to negative rewards, enabling the use of control variates. △ Less

Submitted 28 December, 2016; originally announced December 2016.

Comments: 12 pages

arXiv:1606.09202 [pdf, other]

Tighter bounds lead to improved classifiers

Authors: Nicolas Le Roux

Abstract: The standard approach to supervised classification involves the minimization of a log-loss as an upper bound to the classification error. While this is a tight bound early on in the optimization, it overemphasizes the influence of incorrectly classified examples far from the decision boundary. Updating the upper bound during the optimization leads to improved classification rates while transformin… ▽ More The standard approach to supervised classification involves the minimization of a log-loss as an upper bound to the classification error. While this is a tight bound early on in the optimization, it overemphasizes the influence of incorrectly classified examples far from the decision boundary. Updating the upper bound during the optimization leads to improved classification rates while transforming the learning into a sequence of minimization problems. In addition, in the context where the classifier is part of a larger system, this modification makes it possible to link the performance of the classifier to that of the whole system, allowing the seamless introduction of external constraints. △ Less

Submitted 28 December, 2016; v1 submitted 29 June, 2016; originally announced June 2016.

arXiv:1309.2388 [pdf, other]

Minimizing Finite Sums with the Stochastic Average Gradient

Authors: Mark Schmidt, Nicolas Le Roux, Francis Bach

Abstract: We propose the stochastic average gradient (SAG) method for optimizing the sum of a finite number of smooth convex functions. Like stochastic gradient (SG) methods, the SAG method's iteration cost is independent of the number of terms in the sum. However, by incorporating a memory of previous gradient values the SAG method achieves a faster convergence rate than black-box SG methods. The convergen… ▽ More We propose the stochastic average gradient (SAG) method for optimizing the sum of a finite number of smooth convex functions. Like stochastic gradient (SG) methods, the SAG method's iteration cost is independent of the number of terms in the sum. However, by incorporating a memory of previous gradient values the SAG method achieves a faster convergence rate than black-box SG methods. The convergence rate is improved from O(1/k^{1/2}) to O(1/k) in general, and when the sum is strongly-convex the convergence rate is improved from the sub-linear O(1/k) to a linear convergence rate of the form O(p^k) for p \textless{} 1. Further, in many cases the convergence rate of the new method is also faster than black-box deterministic gradient methods, in terms of the number of gradient evaluations. Numerical experiments indicate that the new algorithm often dramatically outperforms existing SG and deterministic gradient methods, and that the performance may be further improved through the use of non-uniform sampling strategies. △ Less

Submitted 11 May, 2016; v1 submitted 10 September, 2013; originally announced September 2013.

Comments: Revision from January 2015 submission. Major changes: updated literature follow and discussion of subsequent work, additional Lemma showing the validity of one of the formulas, somewhat simplified presentation of Lyapunov bound, included code needed for checking proofs rather than the polynomials generated by the code, added error regions to the numerical experiments

arXiv:1202.6258 [pdf, other]

A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets

Authors: Nicolas Le Roux, Mark Schmidt, Francis Bach

Abstract: We propose a new stochastic gradient method for optimizing the sum of a finite set of smooth functions, where the sum is strongly convex. While standard stochastic gradient methods converge at sublinear rates for this problem, the proposed method incorporates a memory of previous gradient values in order to achieve a linear convergence rate. In a machine learning context, numerical experiments ind… ▽ More We propose a new stochastic gradient method for optimizing the sum of a finite set of smooth functions, where the sum is strongly convex. While standard stochastic gradient methods converge at sublinear rates for this problem, the proposed method incorporates a memory of previous gradient values in order to achieve a linear convergence rate. In a machine learning context, numerical experiments indicate that the new algorithm can dramatically outperform standard algorithms, both in terms of optimizing the training error and reducing the test error quickly. △ Less

Submitted 11 March, 2013; v1 submitted 28 February, 2012; originally announced February 2012.

Comments: The notable changes over the current version: - worked example of convergence rates showing SAG can be faster than first-order methods - pointing out that the storage cost is O(n) for linear models - the more-stable line-search - comparison to additional optimal SG methods - comparison to rates of coordinate descent methods in quadratic case

arXiv:1109.2415 [pdf, other]

Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization

Authors: Mark Schmidt, Nicolas Le Roux, Francis Bach

Abstract: We consider the problem of optimizing the sum of a smooth convex function and a non-smooth convex function using proximal-gradient methods, where an error is present in the calculation of the gradient of the smooth term or in the proximity operator with respect to the non-smooth term. We show that both the basic proximal-gradient method and the accelerated proximal-gradient method achieve the same… ▽ More We consider the problem of optimizing the sum of a smooth convex function and a non-smooth convex function using proximal-gradient methods, where an error is present in the calculation of the gradient of the smooth term or in the proximity operator with respect to the non-smooth term. We show that both the basic proximal-gradient method and the accelerated proximal-gradient method achieve the same convergence rate as in the error-free case, provided that the errors decrease at appropriate rates.Using these rates, we perform as well as or better than a carefully chosen fixed error level on a set of structured sparsity problems. △ Less

Submitted 1 December, 2011; v1 submitted 12 September, 2011; originally announced September 2011.

Comments: Neural Information Processing Systems (2011)

arXiv:1109.0093 [pdf, ps, other]

Local Component Analysis

Authors: Nicolas Le Roux, Francis Bach

Abstract: Kernel density estimation, a.k.a. Parzen windows, is a popular density estimation method, which can be used for outlier detection or clustering. With multivariate data, its performance is heavily reliant on the metric used within the kernel. Most earlier work has focused on learning only the bandwidth of the kernel (i.e., a scalar multiplicative factor). In this paper, we propose to learn a full E… ▽ More Kernel density estimation, a.k.a. Parzen windows, is a popular density estimation method, which can be used for outlier detection or clustering. With multivariate data, its performance is heavily reliant on the metric used within the kernel. Most earlier work has focused on learning only the bandwidth of the kernel (i.e., a scalar multiplicative factor). In this paper, we propose to learn a full Euclidean metric through an expectation-minimization (EM) procedure, which can be seen as an unsupervised counterpart to neighbourhood component analysis (NCA). In order to avoid overfitting with a fully nonparametric density estimator in high dimensions, we also consider a semi-parametric Gaussian-Parzen density model, where some of the variables are modelled through a jointly Gaussian density, while others are modelled through Parzen windows. For these two models, EM leads to simple closed-form updates based on matrix inversions and eigenvalue decompositions. We show empirically that our method leads to density estimators with higher test-likelihoods than natural competing methods, and that the metrics may be used within most unsupervised learning techniques that rely on such metrics, such as spectral clustering or manifold learning methods. Finally, we present a stochastic approximation scheme which allows for the use of this method in a large-scale setting. △ Less

Submitted 10 December, 2012; v1 submitted 1 September, 2011; originally announced September 2011.

arXiv:1107.3823 [pdf, ps, other]

Weakly Supervised Learning of Foreground-Background Segmentation using Masked RBMs

Authors: Nicolas Heess, Nicolas Le Roux, John Winn

Abstract: We propose an extension of the Restricted Boltzmann Machine (RBM) that allows the joint shape and appearance of foreground objects in cluttered images to be modeled independently of the background. We present a learning scheme that learns this representation directly from cluttered images with only very weak supervision. The model generates plausible samples and performs foreground-background segm… ▽ More We propose an extension of the Restricted Boltzmann Machine (RBM) that allows the joint shape and appearance of foreground objects in cluttered images to be modeled independently of the background. We present a learning scheme that learns this representation directly from cluttered images with only very weak supervision. The model generates plausible samples and performs foreground-background segmentation. We demonstrate that representing foreground objects independently of the background can be beneficial in recognition tasks. △ Less

Submitted 19 July, 2011; originally announced July 2011.

Journal ref: International Conference on Artificial Neural Networks (2011)

arXiv:0804.2181 [pdf, ps, other]

Products of Ordinary Differential Operators by Evaluation and Interpolation

Authors: Alin Bostan, Frédéric Chyzak, Nicolas Le Roux

Abstract: It is known that multiplication of linear differential operators over ground fields of characteristic zero can be reduced to a constant number of matrix products. We give a new algorithm by evaluation and interpolation which is faster than the previously-known one by a constant factor, and prove that in characteristic zero, multiplication of differential operators and of matrices are computation… ▽ More It is known that multiplication of linear differential operators over ground fields of characteristic zero can be reduced to a constant number of matrix products. We give a new algorithm by evaluation and interpolation which is faster than the previously-known one by a constant factor, and prove that in characteristic zero, multiplication of differential operators and of matrices are computationally equivalent problems. In positive characteristic, we show that differential operators can be multiplied in nearly optimal time. Theoretical results are validated by intensive experiments. △ Less

Submitted 14 April, 2008; originally announced April 2008.

Journal ref: Dans ISSAC'08 (2008)

Showing 1–35 of 35 results for author: Roux, N L