Skip to main content

Showing 1–36 of 36 results for author: Pennington, J

  1. arXiv:2405.15074  [pdf, other

    stat.ML cs.LG math.OC math.PR math.ST

    4+3 Phases of Compute-Optimal Neural Scaling Laws

    Authors: Elliot Paquette, Courtney Paquette, Lechao Xiao, Jeffrey Pennington

    Abstract: We consider the three parameter solvable neural scaling model introduced by Maloney, Roberts, and Sully. The model has three parameters: data complexity, target complexity, and model-parameter-count. We use this neural scaling model to derive new predictions about the compute-limited, infinite-data scaling law regime. To train the neural scaling model, we run one-pass stochastic gradient descent o… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

  2. arXiv:2404.19261  [pdf, other

    cs.LG math.OC math.ST physics.data-an

    High dimensional analysis reveals conservative sharpening and a stochastic edge of stability

    Authors: Atish Agarwala, Jeffrey Pennington

    Abstract: Recent empirical and theoretical work has shown that the dynamics of the large eigenvalues of the training loss Hessian have some remarkably robust features across models and datasets in the full batch regime. There is often an early period of progressive sharpening where the large eigenvalues increase, followed by stabilization at a predictable value known as the edge of stability. Previous work… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

  3. arXiv:2404.03626  [pdf, other

    cs.CL cs.LG

    Training LLMs over Neurally Compressed Text

    Authors: Brian Lester, Jaehoon Lee, Alex Alemi, Jeffrey Pennington, Adam Roberts, Jascha Sohl-Dickstein, Noah Constant

    Abstract: In this paper, we explore the idea of training large language models (LLMs) over highly compressed text. While standard subword tokenizers compress text by a small factor, neural text compressors can achieve much higher rates of compression. If it were possible to train LLMs directly over neurally compressed text, this would confer advantages in training and serving efficiency, as well as easier h… ▽ More

    Submitted 4 April, 2024; originally announced April 2024.

  4. arXiv:2312.06585  [pdf, other

    cs.LG

    Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models

    Authors: Avi Singh, John D. Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J. Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron Parisi, Abhishek Kumar, Alex Alemi, Alex Rizkowsky, Azade Nova, Ben Adlam, Bernd Bohnet, Gamaleldin Elsayed, Hanie Sedghi, Igor Mordatch, Isabelle Simpson, Izzeddin Gur, Jasper Snoek, Jeffrey Pennington, Jiri Hron , et al. (16 additional authors not shown)

    Abstract: Fine-tuning language models~(LMs) on human-generated data remains a prevalent practice. However, the performance of such models is often limited by the quantity and diversity of high-quality human data. In this paper, we explore whether we can go beyond human data on tasks where we have access to scalar feedback, for example, on math problems where one can verify correctness. To do so, we investig… ▽ More

    Submitted 17 April, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

    Comments: Accepted to TMLR. Camera-ready version. First three authors contributed equally

  5. arXiv:2311.07587  [pdf, other

    cs.CL cs.AI cs.CY cs.LG

    Frontier Language Models are not Robust to Adversarial Arithmetic, or "What do I need to say so you agree 2+2=5?

    Authors: C. Daniel Freeman, Laura Culp, Aaron Parisi, Maxwell L Bileschi, Gamaleldin F Elsayed, Alex Rizkowsky, Isabelle Simpson, Alex Alemi, Azade Nova, Ben Adlam, Bernd Bohnet, Gaurav Mishra, Hanie Sedghi, Igor Mordatch, Izzeddin Gur, Jaehoon Lee, JD Co-Reyes, Jeffrey Pennington, Kelvin Xu, Kevin Swersky, Kshiteej Mahajan, Lechao Xiao, Rosanne Liu, Simon Kornblith, Noah Constant , et al. (5 additional authors not shown)

    Abstract: We introduce and study the problem of adversarial arithmetic, which provides a simple yet challenging testbed for language model alignment. This problem is comprised of arithmetic questions posed in natural language, with an arbitrary adversarial string inserted before the question is complete. Even in the simple setting of 1-digit addition problems, it is easy to find adversarial prompts that mak… ▽ More

    Submitted 15 November, 2023; v1 submitted 8 November, 2023; originally announced November 2023.

  6. arXiv:2309.14322  [pdf, other

    cs.LG

    Small-scale proxies for large-scale Transformer training instabilities

    Authors: Mitchell Wortsman, Peter J. Liu, Lechao Xiao, Katie Everett, Alex Alemi, Ben Adlam, John D. Co-Reyes, Izzeddin Gur, Abhishek Kumar, Roman Novak, Jeffrey Pennington, Jascha Sohl-dickstein, Kelvin Xu, Jaehoon Lee, Justin Gilmer, Simon Kornblith

    Abstract: Teams that have trained large Transformer-based models have reported training instabilities at large scale that did not appear when training with the same hyperparameters at smaller scales. Although the causes of such instabilities are of scientific interest, the amount of resources required to reproduce them has made investigation difficult. In this work, we seek ways to reproduce and study train… ▽ More

    Submitted 16 October, 2023; v1 submitted 25 September, 2023; originally announced September 2023.

  7. arXiv:2210.04860  [pdf, other

    cs.LG cs.AI math.OC

    Second-order regression models exhibit progressive sharpening to the edge of stability

    Authors: Atish Agarwala, Fabian Pedregosa, Jeffrey Pennington

    Abstract: Recent studies of gradient descent with large step sizes have shown that there is often a regime with an initial increase in the largest eigenvalue of the loss Hessian (progressive sharpening), followed by a stabilization of the eigenvalue near the maximum value which allows convergence (edge of stability). These phenomena are intrinsically non-linear and do not happen for models in the constant N… ▽ More

    Submitted 10 October, 2022; originally announced October 2022.

  8. arXiv:2207.04612  [pdf, other

    cs.LG

    Synergy and Symmetry in Deep Learning: Interactions between the Data, Model, and Inference Algorithm

    Authors: Lechao Xiao, Jeffrey Pennington

    Abstract: Although learning in high dimensions is commonly believed to suffer from the curse of dimensionality, modern machine learning methods often exhibit an astonishing power to tackle a wide range of challenging real-world learning problems without using abundant amounts of data. How exactly these methods break this curse remains a fundamental open question in the theory of deep learning. While previou… ▽ More

    Submitted 11 July, 2022; originally announced July 2022.

    Comments: Accepted by ICML 2022; 23 pages

    MSC Class: 68T07

  9. arXiv:2206.07673  [pdf, other

    stat.ML cs.LG

    Wide Bayesian neural networks have a simple weight posterior: theory and accelerated sampling

    Authors: Jiri Hron, Roman Novak, Jeffrey Pennington, Jascha Sohl-Dickstein

    Abstract: We introduce repriorisation, a data-dependent reparameterisation which transforms a Bayesian neural network (BNN) posterior to a distribution whose KL divergence to the BNN prior vanishes as layer widths grow. The repriorisation map acts directly on parameters, and its analytic simplicity complements the known neural network Gaussian process (NNGP) behaviour of wide BNNs in function space. Exploit… ▽ More

    Submitted 15 June, 2022; originally announced June 2022.

    Comments: ICML 2022

  10. arXiv:2206.07252  [pdf, other

    stat.ML cs.LG math.OC math.PR math.ST

    Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions

    Authors: Courtney Paquette, Elliot Paquette, Ben Adlam, Jeffrey Pennington

    Abstract: Stochastic gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of problems. While the empirical success of SGD is often attributed to its computational efficiency and favorable generalization behavior, neither effect is well understood and disentangling them remains an open problem. Even in the simple setting of convex quad… ▽ More

    Submitted 14 June, 2022; originally announced June 2022.

    Comments: arXiv admin note: text overlap with arXiv:2205.07069

  11. arXiv:2205.14846  [pdf, other

    cs.LG stat.ML

    Precise Learning Curves and Higher-Order Scaling Limits for Dot Product Kernel Regression

    Authors: Lechao Xiao, Hong Hu, Theodor Misiakiewicz, Yue M. Lu, Jeffrey Pennington

    Abstract: As modern machine learning models continue to advance the computational frontier, it has become increasingly important to develop precise estimates for expected performance improvements under different model and data scaling regimes. Currently, theoretical understanding of the learning curves that characterize how the prediction error depends on the number of samples is restricted to either large-… ▽ More

    Submitted 12 June, 2023; v1 submitted 30 May, 2022; originally announced May 2022.

    Comments: 42 pages; 5 + 6 figures

    MSC Class: 68T07

  12. arXiv:2111.08234  [pdf, other

    stat.ML cs.LG

    Covariate Shift in High-Dimensional Random Feature Regression

    Authors: Nilesh Tripuraneni, Ben Adlam, Jeffrey Pennington

    Abstract: A significant obstacle in the development of robust machine learning models is covariate shift, a form of distribution shift that occurs when the input distributions of the training and test sets differ while the conditional label distributions remain the same. Despite the prevalence of covariate shift in real-world applications, a theoretical understanding in the context of modern machine learnin… ▽ More

    Submitted 16 November, 2021; originally announced November 2021.

    Comments: 107 pages, 10 figures

  13. arXiv:2011.03321  [pdf, other

    stat.ML cs.LG

    Understanding Double Descent Requires a Fine-Grained Bias-Variance Decomposition

    Authors: Ben Adlam, Jeffrey Pennington

    Abstract: Classical learning theory suggests that the optimal generalization performance of a machine learning model should occur at an intermediate model complexity, with simpler models exhibiting high bias and more complex models exhibiting high variance of the predictive function. However, such a simple trade-off does not adequately describe deep learning models that simultaneously attain low bias and va… ▽ More

    Submitted 4 November, 2020; originally announced November 2020.

    Comments: Published as a conference paper in the Proceedings of the Thirty-fourth Conference on Neural Information Processing Systems; 54 pages; 5 figures. arXiv admin note: text overlap with arXiv:2008.06786

  14. arXiv:2010.07355  [pdf, other

    stat.ML cs.LG

    Exploring the Uncertainty Properties of Neural Networks' Implicit Priors in the Infinite-Width Limit

    Authors: Ben Adlam, Jaehoon Lee, Lechao Xiao, Jeffrey Pennington, Jasper Snoek

    Abstract: Modern deep learning models have achieved great success in predictive accuracy for many data modalities. However, their application to many real-world tasks is restricted by poor uncertainty estimates, such as overconfidence on out-of-distribution (OOD) data and ungraceful failing under distributional shift. Previous benchmarks have found that ensembles of neural networks (NNs) are typically the b… ▽ More

    Submitted 14 October, 2020; originally announced October 2020.

    Comments: 23 pages, 11 figures

  15. arXiv:2010.07344  [pdf, other

    cs.LG cs.AI

    Temperature check: theory and practice for training models with softmax-cross-entropy losses

    Authors: Atish Agarwala, Jeffrey Pennington, Yann Dauphin, Sam Schoenholz

    Abstract: The softmax function combined with a cross-entropy loss is a principled approach to modeling probability distributions that has become ubiquitous in deep learning. The softmax function is defined by a lone hyperparameter, the temperature, that is commonly set to one or regarded as a way to tune model confidence after training; however, less is known about how the temperature impacts training dynam… ▽ More

    Submitted 14 October, 2020; originally announced October 2020.

  16. arXiv:2008.07023  [pdf, other

    cs.DS

    Selection on $X_1 + X_1 + \cdots X_m$ via Cartesian product tree

    Authors: Patrick Kreitzberg, Kyle Lucke, Jake Pennington, Oliver Serang

    Abstract: Selection on the Cartesian product is a classic problem in computer science. Recently, an optimal algorithm for selection on $X+Y$, based on soft heaps, was introduced. By combining this approach with layer-ordered heaps (LOHs), an algorithm using a balanced binary tree of $X+Y$ selections was proposed to perform $k$-selection on $X_1+X_2+\cdots+X_m$ in $o(n\cdot m + k\cdot m)$, where $X_i$ have l… ▽ More

    Submitted 16 August, 2020; originally announced August 2020.

  17. arXiv:2008.06786  [pdf, other

    stat.ML cs.LG

    The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization

    Authors: Ben Adlam, Jeffrey Pennington

    Abstract: Modern deep learning models employ considerably more parameters than required to fit the training data. Whereas conventional statistical wisdom suggests such models should drastically overfit, in practice these models generalize remarkably well. An emerging paradigm for describing this unexpected behavior is in terms of a \emph{double descent} curve, in which increasing a model's capacity causes i… ▽ More

    Submitted 15 August, 2020; originally announced August 2020.

    Comments: Published as a conference paper in the Proceedings of the 37th International Conference on Machine Learning; 31 pages; 4 figures

  18. arXiv:2007.15801  [pdf, other

    cs.LG stat.ML

    Finite Versus Infinite Neural Networks: an Empirical Study

    Authors: Jaehoon Lee, Samuel S. Schoenholz, Jeffrey Pennington, Ben Adlam, Lechao Xiao, Roman Novak, Jascha Sohl-Dickstein

    Abstract: We perform a careful, thorough, and large scale empirical study of the correspondence between wide neural networks and kernel methods. By doing so, we resolve a variety of open questions related to the study of infinitely wide neural networks. Our experimental results include: kernel methods outperform fully-connected finite-width networks, but underperform convolutional finite width networks; neu… ▽ More

    Submitted 8 September, 2020; v1 submitted 30 July, 2020; originally announced July 2020.

    Comments: 17+11 pages; v2 references added, minor improvements

  19. arXiv:2007.13356  [pdf, other

    cs.DS

    Optimal construction of a layer-ordered heap

    Authors: Jake Pennington, Patrick Kreitzberg, Kyle Lucke, Oliver Serang

    Abstract: The layer-ordered heap (LOH) is a simple, recently proposed data structure used in optimal selection on $X+Y$, thealgorithm with the best known runtime for selection on $X_1+X_2+\cdots+X_m$, and the fastest method in practice for computing the most abundant isotope peaks in a chemical compound. Here, we introduce a few algorithms for constructing LOHs, analyze their complexity, and demonstrate tha… ▽ More

    Submitted 15 August, 2020; v1 submitted 27 July, 2020; originally announced July 2020.

  20. arXiv:2006.14599  [pdf, other

    cs.LG cs.NE stat.ML

    The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks

    Authors: Wei Hu, Lechao Xiao, Ben Adlam, Jeffrey Pennington

    Abstract: Modern neural networks are often regarded as complex black-box functions whose behavior is difficult to understand owing to their nonlinear dependence on the data and the nonconvexity in their loss landscapes. In this work, we show that these common perceptions can be completely false in the early phase of learning. In particular, we formally prove that, for a class of well-behaved input distribut… ▽ More

    Submitted 25 June, 2020; originally announced June 2020.

  21. arXiv:2006.10541  [pdf, other

    stat.ML cs.LG

    Exact posterior distributions of wide Bayesian neural networks

    Authors: Jiri Hron, Yasaman Bahri, Roman Novak, Jeffrey Pennington, Jascha Sohl-Dickstein

    Abstract: Recent work has shown that the prior over functions induced by a deep Bayesian neural network (BNN) behaves as a Gaussian process (GP) as the width of all layers becomes large. However, many BNN applications are concerned with the BNN function space posterior. While some empirical evidence of the posterior convergence was provided in the original works of Neal (1996) and Matthews et al. (2018), it… ▽ More

    Submitted 26 November, 2020; v1 submitted 18 June, 2020; originally announced June 2020.

  22. arXiv:2004.07444  [pdf, other

    cs.CE cs.DS

    Fast exact computation of the $k$ most abundant isotope peaks with layer-ordered heaps

    Authors: Patrick Kreitzberg, Jake Pennington, Kyle Lucke, Oliver Serang

    Abstract: The theoretical computation of isotopic distribution of compounds is crucial in many important applications of mass spectrometry, especially as machine precision grows. A considerable amount of good tools have been created in the last decade for doing so. In this paper we present a novel algorithm for calculating the top $k$ peaks of a given compound. The algorithm takes advantage of layer-ordered… ▽ More

    Submitted 15 April, 2020; originally announced April 2020.

  23. arXiv:2001.05992  [pdf, other

    cs.LG cs.NE math.OC stat.ML

    Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

    Authors: Wei Hu, Lechao Xiao, Jeffrey Pennington

    Abstract: The selection of initial parameter values for gradient-based optimization of deep neural networks is one of the most impactful hyperparameter choices in deep learning systems, affecting both convergence times and model performance. Yet despite significant empirical and theoretical analysis, relatively little has been proved about the concrete effects of different initialization schemes. In this wo… ▽ More

    Submitted 16 January, 2020; originally announced January 2020.

    Comments: International Conference on Learning Representations (ICLR) 2020

  24. arXiv:1912.13053  [pdf, other

    cs.LG stat.ML

    Disentangling Trainability and Generalization in Deep Neural Networks

    Authors: Lechao Xiao, Jeffrey Pennington, Samuel S. Schoenholz

    Abstract: A longstanding goal in the theory of deep learning is to characterize the conditions under which a given neural network architecture will be trainable, and if so, how well it might generalize to unseen data. In this work, we provide such a characterization in the limit of very wide and very deep networks, for which the analysis simplifies considerably. For wide networks, the trajectory under gradi… ▽ More

    Submitted 13 July, 2020; v1 submitted 30 December, 2019; originally announced December 2019.

    Comments: 22 pages, 3 figures, ICML 2020. Associated Colab notebook at https://colab.research.google.com/github/google/neural-tangents/blob/master/notebooks/Disentangling_Trainability_and_Generalization.ipynb

  25. arXiv:1912.00827  [pdf, other

    stat.ML cs.LG

    A Random Matrix Perspective on Mixtures of Nonlinearities for Deep Learning

    Authors: Ben Adlam, Jake Levinson, Jeffrey Pennington

    Abstract: One of the distinguishing characteristics of modern deep learning systems is that they typically employ neural network architectures that utilize enormous numbers of parameters, often in the millions and sometimes even in the billions. While this paradigm has inspired significant research on the properties of large networks, relatively little work has been devoted to the fact that these networks a… ▽ More

    Submitted 12 November, 2021; v1 submitted 2 December, 2019; originally announced December 2019.

  26. arXiv:1902.08129  [pdf, other

    cs.NE cond-mat.dis-nn cs.LG math.DS

    A Mean Field Theory of Batch Normalization

    Authors: Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha Sohl-Dickstein, Samuel S. Schoenholz

    Abstract: We develop a mean field theory for batch normalization in fully-connected feedforward neural networks. In so doing, we provide a precise characterization of signal propagation and gradient backpropagation in wide batch-normalized networks at initialization. Our theory shows that gradient signals grow exponentially in depth and that these exploding gradients cannot be eliminated by tuning the initi… ▽ More

    Submitted 5 March, 2019; v1 submitted 21 February, 2019; originally announced February 2019.

    Comments: To appear in ICLR 2019

  27. Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

    Authors: Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, Jeffrey Pennington

    Abstract: A longstanding goal in deep learning research has been to precisely characterize training and generalization. However, the often complex loss landscapes of neural networks have made a theory of learning dynamics elusive. In this work, we show that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained… ▽ More

    Submitted 8 December, 2019; v1 submitted 18 February, 2019; originally announced February 2019.

    Comments: 12+16 pages; open-source code available at https://github.com/google/neural-tangents; accepted to NeurIPS 2019

  28. arXiv:1901.08987  [pdf, other

    cs.LG stat.ML

    Dynamical Isometry and a Mean Field Theory of LSTMs and GRUs

    Authors: Dar Gilboa, Bo Chang, Minmin Chen, Greg Yang, Samuel S. Schoenholz, Ed H. Chi, Jeffrey Pennington

    Abstract: Training recurrent neural networks (RNNs) on long sequence tasks is plagued with difficulties arising from the exponential explosion or vanishing of signals as they propagate forward or backward through the network. Many techniques have been proposed to ameliorate these issues, including various algorithmic and architectural modifications. Two of the most successful RNN architectures, the LSTM and… ▽ More

    Submitted 23 May, 2019; v1 submitted 25 January, 2019; originally announced January 2019.

  29. arXiv:1810.05148  [pdf, other

    stat.ML cs.AI cs.LG cs.NE

    Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes

    Authors: Roman Novak, Lechao Xiao, Jaehoon Lee, Yasaman Bahri, Greg Yang, Jiri Hron, Daniel A. Abolafia, Jeffrey Pennington, Jascha Sohl-Dickstein

    Abstract: There is a previously identified equivalence between wide fully connected neural networks (FCNs) and Gaussian processes (GPs). This equivalence enables, for instance, test set predictions that would have resulted from a fully Bayesian, infinitely wide trained FCN to be computed without ever instantiating the FCN, but by instead evaluating the corresponding GP. In this work, we derive an analogous… ▽ More

    Submitted 21 August, 2020; v1 submitted 11 October, 2018; originally announced October 2018.

    Comments: Published as a conference paper at ICLR 2019

  30. arXiv:1806.05394  [pdf, other

    stat.ML cs.LG

    Dynamical Isometry and a Mean Field Theory of RNNs: Gating Enables Signal Propagation in Recurrent Neural Networks

    Authors: Minmin Chen, Jeffrey Pennington, Samuel S. Schoenholz

    Abstract: Recurrent neural networks have gained widespread use in modeling sequence data across various domains. While many successful recurrent architectures employ a notion of gating, the exact mechanism that enables such remarkable performance is not well understood. We develop a theory for signal propagation in recurrent networks after random initialization using a combination of mean field theory and r… ▽ More

    Submitted 15 August, 2018; v1 submitted 14 June, 2018; originally announced June 2018.

    Comments: ICML 2018 Conference Proceedings

  31. arXiv:1806.05393  [pdf, other

    stat.ML cs.LG

    Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks

    Authors: Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S. Schoenholz, Jeffrey Pennington

    Abstract: In recent years, state-of-the-art methods in computer vision have utilized increasingly deep convolutional neural network architectures (CNNs), with some of the most successful models employing hundreds or even thousands of layers. A variety of pathologies such as vanishing/exploding gradients make training such deep networks challenging. While residual connections and batch normalization do enabl… ▽ More

    Submitted 10 July, 2018; v1 submitted 14 June, 2018; originally announced June 2018.

    Comments: ICML 2018 Conference Proceedings

  32. arXiv:1802.09979  [pdf, other

    stat.ML cs.LG

    The Emergence of Spectral Universality in Deep Networks

    Authors: Jeffrey Pennington, Samuel S. Schoenholz, Surya Ganguli

    Abstract: Recent work has shown that tight concentration of the entire spectrum of singular values of a deep network's input-output Jacobian around one at initialization can speed up learning by orders of magnitude. Therefore, to guide important design choices, it is important to build a full theoretical understanding of the spectra of Jacobians at initialization. To this end, we leverage powerful tools fro… ▽ More

    Submitted 27 February, 2018; originally announced February 2018.

    Comments: 17 pages, 4 figures. Appearing at the 21st International Conference on Artificial Intelligence and Statistics (AISTATS) 2018

  33. arXiv:1802.08760  [pdf, other

    stat.ML cs.AI cs.LG cs.NE

    Sensitivity and Generalization in Neural Networks: an Empirical Study

    Authors: Roman Novak, Yasaman Bahri, Daniel A. Abolafia, Jeffrey Pennington, Jascha Sohl-Dickstein

    Abstract: In practice it is often found that large over-parameterized neural networks generalize better than their smaller counterparts, an observation that appears to conflict with classical notions of function complexity, which typically favor smaller models. In this work, we investigate this tension between complexity and generalization through an extensive empirical exploration of two natural metrics of… ▽ More

    Submitted 18 June, 2018; v1 submitted 23 February, 2018; originally announced February 2018.

    Comments: Published as a conference paper at ICLR 2018

  34. arXiv:1711.04735  [pdf, other

    cs.LG stat.ML

    Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice

    Authors: Jeffrey Pennington, Samuel S. Schoenholz, Surya Ganguli

    Abstract: It is well known that the initialization of weights in deep neural networks can have a dramatic impact on learning speed. For example, ensuring the mean squared singular value of a network's input-output Jacobian is $O(1)$ is essential for avoiding the exponential vanishing or explosion of gradients. The stronger condition that all singular values of the Jacobian concentrate near $1$ is a property… ▽ More

    Submitted 13 November, 2017; originally announced November 2017.

    Comments: 13 pages, 6 figures. Appearing at the 31st Conference on Neural Information Processing Systems (NIPS 2017)

  35. arXiv:1711.00165  [pdf, other

    stat.ML cs.LG

    Deep Neural Networks as Gaussian Processes

    Authors: Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, Jascha Sohl-Dickstein

    Abstract: It has long been known that a single-layer fully-connected neural network with an i.i.d. prior over its parameters is equivalent to a Gaussian process (GP), in the limit of infinite network width. This correspondence enables exact Bayesian inference for infinite width neural networks on regression tasks by means of evaluating the corresponding GP. Recently, kernel functions which mimic multi-layer… ▽ More

    Submitted 2 March, 2018; v1 submitted 31 October, 2017; originally announced November 2017.

    Comments: Published version in ICLR 2018. 10 pages + appendix

  36. arXiv:1710.06570  [pdf, other

    stat.ML cond-mat.dis-nn cs.LG

    A Correspondence Between Random Neural Networks and Statistical Field Theory

    Authors: Samuel S. Schoenholz, Jeffrey Pennington, Jascha Sohl-Dickstein

    Abstract: A number of recent papers have provided evidence that practical design questions about neural networks may be tackled theoretically by studying the behavior of random networks. However, until now the tools available for analyzing random neural networks have been relatively ad-hoc. In this work, we show that the distribution of pre-activations in random neural networks can be exactly mapped onto la… ▽ More

    Submitted 17 October, 2017; originally announced October 2017.