Skip to main content

Showing 1–17 of 17 results for author: Hestness, J

  1. arXiv:2405.15743  [pdf, other

    cs.LG

    Sparse maximal update parameterization: A holistic approach to sparse training dynamics

    Authors: Nolan Dey, Shane Bergsma, Joel Hestness

    Abstract: Several challenges make it difficult for sparse neural networks to compete with dense models. First, setting a large fraction of weights to zero impairs forward and gradient signal propagation. Second, sparse studies often need to test multiple sparsity levels, while also introducing new hyperparameters (HPs), leading to prohibitive tuning costs. Indeed, the standard practice is to re-use the lear… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

    Comments: 9 pages main text, 11 pages reference and appendix, 11 figures

  2. arXiv:2403.00952  [pdf, other

    cs.CL cs.LG

    MediSwift: Efficient Sparse Pre-trained Biomedical Language Models

    Authors: Vithursan Thangarasa, Mahmoud Salem, Shreyas Saxena, Kevin Leong, Joel Hestness, Sean Lie

    Abstract: Large language models (LLMs) are typically trained on general source data for various domains, but a recent surge in domain-specific LLMs has shown their potential to outperform general-purpose models in domain-specific tasks (e.g., biomedicine). Although domain-specific pre-training enhances efficiency and leads to smaller models, the computational costs of training these LLMs remain high, posing… ▽ More

    Submitted 1 March, 2024; originally announced March 2024.

  3. arXiv:2310.13017  [pdf, other

    cs.CL cs.AI cs.LG

    Position Interpolation Improves ALiBi Extrapolation

    Authors: Faisal Al-Khateeb, Nolan Dey, Daria Soboleva, Joel Hestness

    Abstract: Linear position interpolation helps pre-trained models using rotary position embeddings (RoPE) to extrapolate to longer sequence lengths. We propose using linear position interpolation to extend the extrapolation range of models using Attention with Linear Biases (ALiBi). We find position interpolation significantly improves extrapolation capability on upstream language modelling and downstream su… ▽ More

    Submitted 18 October, 2023; originally announced October 2023.

    Comments: 4 pages content, 1 page references, 4 figures

  4. arXiv:2309.11568  [pdf, other

    cs.AI cs.CL cs.LG

    BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model

    Authors: Nolan Dey, Daria Soboleva, Faisal Al-Khateeb, Bowen Yang, Ribhu Pathria, Hemant Khachane, Shaheer Muhammad, Zhiming, Chen, Robert Myers, Jacob Robert Steeves, Natalia Vassilieva, Marvin Tom, Joel Hestness

    Abstract: We introduce the Bittensor Language Model, called "BTLM-3B-8K", a new state-of-the-art 3 billion parameter open-source language model. BTLM-3B-8K was trained on 627B tokens from the SlimPajama dataset with a mixture of 2,048 and 8,192 context lengths. BTLM-3B-8K outperforms all existing 3B parameter models by 2-5.5% across downstream tasks. BTLM-3B-8K is even competitive with some 7B parameter mod… ▽ More

    Submitted 20 September, 2023; originally announced September 2023.

  5. arXiv:2309.10818  [pdf, other

    cs.CL cs.AI

    SlimPajama-DC: Understanding Data Combinations for LLM Training

    Authors: Zhiqiang Shen, Tianhua Tao, Liqun Ma, Willie Neiswanger, Zhengzhong Liu, Hongyi Wang, Bowen Tan, Joel Hestness, Natalia Vassilieva, Daria Soboleva, Eric Xing

    Abstract: This paper aims to understand the impacts of various data combinations (e.g., web text, Wikipedia, GitHub, books) on the pretraining of large language models using SlimPajama. SlimPajama is a rigorously deduplicated, multi-source dataset, which has been refined and further deduplicated to 627B tokens from the extensive 1.2T token RedPajama dataset contributed by Together. We have termed our resear… ▽ More

    Submitted 9 May, 2024; v1 submitted 19 September, 2023; originally announced September 2023.

    Comments: Technical report. Models at: https://huggingface.co/MBZUAI-LLM/SlimPajama-DC and dataset at: https://huggingface.co/datasets/MBZUAI-LLM/SlimPajama-627B-DC

  6. arXiv:2308.16149  [pdf, other

    cs.CL cs.AI cs.LG

    Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models

    Authors: Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Satheesh Katipomu, Haonan Li, Fajri Koto, William Marshall, Gurpreet Gosal, Cynthia Liu, Zhiming Chen, Osama Mohammed Afzal, Samta Kamboj, Onkar Pandit, Rahul Pal, Lalit Pradhan, Zain Muhammad Mujahid, Massa Baali, Xudong Han, Sondos Mahmoud Bsharat, Alham Fikri Aji, Zhiqiang Shen, Zhengzhong Liu, Natalia Vassilieva, Joel Hestness, Andy Hock , et al. (7 additional authors not shown)

    Abstract: We introduce Jais and Jais-chat, new state-of-the-art Arabic-centric foundation and instruction-tuned open generative large language models (LLMs). The models are based on the GPT-3 decoder-only architecture and are pretrained on a mixture of Arabic and English texts, including source code in various programming languages. With 13 billion parameters, they demonstrate better knowledge and reasoning… ▽ More

    Submitted 29 September, 2023; v1 submitted 30 August, 2023; originally announced August 2023.

    Comments: Arabic-centric, foundation model, large-language model, LLM, generative model, instruction-tuned, Jais, Jais-chat

    MSC Class: 68T50 ACM Class: F.2.2; I.2.7

  7. arXiv:2304.03208  [pdf, other

    cs.LG cs.CL

    Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster

    Authors: Nolan Dey, Gurpreet Gosal, Zhiming, Chen, Hemant Khachane, William Marshall, Ribhu Pathria, Marvin Tom, Joel Hestness

    Abstract: We study recent research advances that improve large language models through efficient pre-training and scaling, and open datasets and tools. We combine these advances to introduce Cerebras-GPT, a family of open compute-optimal language models scaled from 111M to 13B parameters. We train Cerebras-GPT models on the Eleuther Pile dataset following DeepMind Chinchilla scaling rules for efficient pre-… ▽ More

    Submitted 6 April, 2023; originally announced April 2023.

    Comments: 13 pages main text, 16 pages appendix, 13 figures

  8. arXiv:2206.14098  [pdf, other

    cs.LG cs.AI cs.CV

    RevBiFPN: The Fully Reversible Bidirectional Feature Pyramid Network

    Authors: Vitaliy Chiley, Vithursan Thangarasa, Abhay Gupta, Anshul Samar, Joel Hestness, Dennis DeCoste

    Abstract: This work introduces RevSilo, the first reversible bidirectional multi-scale feature fusion module. Like other reversible methods, RevSilo eliminates the need to store hidden activations by recomputing them. However, existing reversible methods do not apply to multi-scale feature fusion and are, therefore, not applicable to a large class of networks. Bidirectional multi-scale feature fusion promot… ▽ More

    Submitted 28 April, 2023; v1 submitted 28 June, 2022; originally announced June 2022.

    Comments: Presented at MLSys 2023. Code available from Cerebras Systems: https://github.com/CerebrasResearch/RevBiFPN

  9. arXiv:2203.09128  [pdf

    cs.LG cs.CL econ.GN

    Time Dependency, Data Flow, and Competitive Advantage

    Authors: Ehsan Valavi, Joel Hestness, Marco Iansiti, Newsha Ardalani, Feng Zhu, Karim R. Lakhani

    Abstract: Data is fundamental to machine learning-based products and services and is considered strategic due to its externalities for businesses, governments, non-profits, and more generally for society. It is renowned that the value of organizations (businesses, government agencies and programs, and even industries) scales with the volume of available data. What is often less appreciated is that the data… ▽ More

    Submitted 17 March, 2022; originally announced March 2022.

    Comments: 24 Pages

  10. arXiv:2203.09118  [pdf, other

    cs.LG cs.CL econ.GN

    Time and the Value of Data

    Authors: Ehsan Valavi, Joel Hestness, Newsha Ardalani, Marco Iansiti

    Abstract: Managers often believe that collecting more data will continually improve the accuracy of their machine learning models. However, we argue in this paper that when data lose relevance over time, it may be optimal to collect a limited amount of recent data instead of keeping around an infinite supply of older (less relevant) data. In addition, we argue that increasing the stock of data by including… ▽ More

    Submitted 17 March, 2022; originally announced March 2022.

    Comments: 43 Pages, 8 Figures, Harvard Business School Working Paper 21-016

  11. arXiv:2201.01942  [pdf, other

    cs.LG stat.ML

    Efficiently Disentangle Causal Representations

    Authors: Yuanpeng Li, Joel Hestness, Mohamed Elhoseiny, Liang Zhao, Kenneth Church

    Abstract: This paper proposes an efficient approach to learning disentangled representations with causal mechanisms based on the difference of conditional probabilities in original and new distributions. We approximate the difference with models' generalization abilities so that it fits in the standard machine learning framework and can be efficiently computed. In contrast to the state-of-the-art approach,… ▽ More

    Submitted 1 January, 2024; v1 submitted 6 January, 2022; originally announced January 2022.

    Comments: 17 pages, 7 figures

    Report number: Causal-01

  12. Memory Efficient 3D U-Net with Reversible Mobile Inverted Bottlenecks for Brain Tumor Segmentation

    Authors: Mihir Pendse, Vithursan Thangarasa, Vitaliy Chiley, Ryan Holmdahl, Joel Hestness, Dennis DeCoste

    Abstract: We propose combining memory saving techniques with traditional U-Net architectures to increase the complexity of the models on the Brain Tumor Segmentation (BraTS) challenge. The BraTS challenge consists of a 3D segmentation of a 240x240x155x4 input image into a set of tumor classes. Because of the large volume and need for 3D convolutional layers, this task is very memory intensive. To address th… ▽ More

    Submitted 20 April, 2021; v1 submitted 19 April, 2021; originally announced April 2021.

    Comments: 11 pages, 5 figures, Published at MICCAI Brainles 2020

    Journal ref: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries (2021) 388-397

  13. arXiv:2003.11666  [pdf, other

    cs.LG cs.DC stat.ML

    Pipelined Backpropagation at Scale: Training Large Models without Batches

    Authors: Atli Kosson, Vitaliy Chiley, Abhinav Venigalla, Joel Hestness, Urs Köster

    Abstract: New hardware can substantially increase the speed and efficiency of deep neural network training. To guide the development of future hardware architectures, it is pertinent to explore the hardware and machine learning properties of alternative training algorithms. In this work we evaluate the use of small batch, fine-grained Pipelined Backpropagation, an asynchronous pipeline parallel training alg… ▽ More

    Submitted 9 April, 2021; v1 submitted 25 March, 2020; originally announced March 2020.

    Comments: Proceedings of the 4th MLSys Conference, 2021

  14. arXiv:1910.02612  [pdf, other

    cs.CL cs.LG

    Compositional Generalization for Primitive Substitutions

    Authors: Yuanpeng Li, Liang Zhao, Jianyu Wang, Joel Hestness

    Abstract: Compositional generalization is a basic mechanism in human language learning, but current neural networks lack such ability. In this paper, we conduct fundamental research for encoding compositionality in neural networks. Conventional methods use a single representation for the input sentence, making it hard to apply prior knowledge of compositionality. In contrast, our approach leverages such kno… ▽ More

    Submitted 7 October, 2019; originally announced October 2019.

    Comments: EMNLP 2019

  15. arXiv:1909.01736  [pdf, other

    cs.LG

    Beyond Human-Level Accuracy: Computational Challenges in Deep Learning

    Authors: Joel Hestness, Newsha Ardalani, Greg Diamos

    Abstract: Deep learning (DL) research yields accuracy and product improvements from both model architecture changes and scale: larger data sets and models, and more computation. For hardware design, it is difficult to predict DL model changes. However, recent prior work shows that as dataset sizes grow, DL model accuracy and model size grow predictably. This paper leverages the prior work to project the dat… ▽ More

    Submitted 3 September, 2019; originally announced September 2019.

  16. arXiv:1712.00409  [pdf, other

    cs.LG stat.ML

    Deep Learning Scaling is Predictable, Empirically

    Authors: Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, Yanqi Zhou

    Abstract: Deep learning (DL) creates impactful advances following a virtuous recipe: model architecture search, creating large training data sets, and scaling computation. It is widely believed that growing training sets and models should improve accuracy and result in better products. As DL application domains grow, we would like a deeper understanding of the relationships between training set size, comput… ▽ More

    Submitted 1 December, 2017; originally announced December 2017.

    Comments: 19 pages, 11 figures

  17. arXiv:1703.05390  [pdf

    cs.CL cs.AI cs.LG

    Convolutional Recurrent Neural Networks for Small-Footprint Keyword Spotting

    Authors: Sercan O. Arik, Markus Kliegl, Rewon Child, Joel Hestness, Andrew Gibiansky, Chris Fougner, Ryan Prenger, Adam Coates

    Abstract: Keyword spotting (KWS) constitutes a major component of human-technology interfaces. Maximizing the detection accuracy at a low false alarm (FA) rate, while minimizing the footprint size, latency and complexity are the goals for KWS. Towards achieving them, we study Convolutional Recurrent Neural Networks (CRNNs). Inspired by large-scale state-of-the-art speech recognition systems, we combine the… ▽ More

    Submitted 4 July, 2017; v1 submitted 15 March, 2017; originally announced March 2017.

    Comments: Accepted to Interspeech 2017