Skip to main content

Showing 1–35 of 35 results for author: Blankevoort, T

  1. arXiv:2405.16406  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    SpinQuant: LLM quantization with learned rotations

    Authors: Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, Tijmen Blankevoort

    Abstract: Post-training quantization (PTQ) techniques applied to weights, activations, and the KV cache greatly reduce memory usage, latency, and power consumption of Large Language Models (LLMs), but may lead to large quantization errors when outliers are present. Recent findings suggest that rotating activation or weight matrices helps remove outliers and benefits quantization. In this work, we identify a… ▽ More

    Submitted 28 May, 2024; v1 submitted 25 May, 2024; originally announced May 2024.

  2. arXiv:2405.14862  [pdf, other

    cs.CL

    Bitune: Bidirectional Instruction-Tuning

    Authors: Dawid J. Kopiczko, Tijmen Blankevoort, Yuki M. Asano

    Abstract: We introduce Bitune, a method that improves instruction-tuning of pretrained decoder-only large language models, leading to consistent gains on downstream tasks. Bitune applies both causal and bidirectional attention to the prompt, to obtain a better representation of the query or instruction. We realize this by introducing two sets of parameters, for which we apply parameter-efficient finetuning… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

  3. arXiv:2402.16848  [pdf, other

    cs.LG

    InterroGate: Learning to Share, Specialize, and Prune Representations for Multi-task Learning

    Authors: Babak Ehteshami Bejnordi, Gaurav Kumar, Amelie Royer, Christos Louizos, Tijmen Blankevoort, Mohsen Ghafoorian

    Abstract: Jointly learning multiple tasks with a unified model can improve accuracy and data efficiency, but it faces the challenge of task interference, where optimizing one task objective may inadvertently compromise the performance of another. A solution to mitigate this issue is to allocate task-specific parameters, free from interference, on top of shared features. However, manually designing such arch… ▽ More

    Submitted 26 February, 2024; originally announced February 2024.

    Comments: Under review

  4. arXiv:2402.16844  [pdf, other

    cs.LG cs.AI cs.CL

    Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding

    Authors: Benjamin Bergner, Andrii Skliar, Amelie Royer, Tijmen Blankevoort, Yuki Asano, Babak Ehteshami Bejnordi

    Abstract: Large language models (LLMs) have become ubiquitous in practice and are widely used for generation tasks such as translation, summarization and instruction following. However, their enormous size and reliance on autoregressive decoding increase deployment costs and complicate their use in latency-critical applications. In this work, we propose a hybrid approach that combines language models of dif… ▽ More

    Submitted 16 July, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

    Comments: Work presented at the ES-FoMo II Workshop at ICML 2024

  5. arXiv:2402.15319  [pdf, other

    cs.LG cs.CL

    GPTVQ: The Blessing of Dimensionality for LLM Quantization

    Authors: Mart van Baalen, Andrey Kuzmin, Markus Nagel, Peter Couperus, Cedric Bastoul, Eric Mahurin, Tijmen Blankevoort, Paul Whatmough

    Abstract: In this work we show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality. We propose the GPTVQ method, a new fast method for post-training vector quantization (VQ) that scales well to Large Language Models (LLMs). Our method interleaves quantization of one or more columns with updates to the remaining un… ▽ More

    Submitted 23 February, 2024; originally announced February 2024.

  6. arXiv:2312.17244  [pdf, other

    cs.LG cs.CL

    The LLM Surgeon

    Authors: Tycho F. A. van der Ouderaa, Markus Nagel, Mart van Baalen, Yuki M. Asano, Tijmen Blankevoort

    Abstract: State-of-the-art language models are becoming increasingly large in an effort to achieve the highest performance on large corpora of available textual data. However, the sheer size of the Transformer architectures makes it difficult to deploy models within computational, environmental or device-specific constraints. We explore data-driven compression of existing pretrained models as an alternative… ▽ More

    Submitted 20 March, 2024; v1 submitted 28 December, 2023; originally announced December 2023.

  7. arXiv:2310.11454  [pdf, other

    cs.CL

    VeRA: Vector-based Random Matrix Adaptation

    Authors: Dawid J. Kopiczko, Tijmen Blankevoort, Yuki M. Asano

    Abstract: Low-rank adapation (LoRA) is a popular method that reduces the number of trainable parameters when finetuning large language models, but still faces acute storage challenges when scaling to even larger models or deploying numerous per-user or per-task adapted models. In this work, we present Vector-based Random Matrix Adaptation (VeRA), which significantly reduces the number of trainable parameter… ▽ More

    Submitted 16 January, 2024; v1 submitted 17 October, 2023; originally announced October 2023.

    Comments: Accepted at ICLR 2024, website: https://dkopi.github.io/vera

  8. arXiv:2310.08910  [pdf, other

    cs.LG cs.CV

    Scalarization for Multi-Task and Multi-Domain Learning at Scale

    Authors: Amelie Royer, Tijmen Blankevoort, Babak Ehteshami Bejnordi

    Abstract: Training a single model on multiple input domains and/or output tasks allows for compressing information from multiple sources into a unified backbone hence improves model efficiency. It also enables potential positive knowledge transfer across tasks/domains, leading to improved accuracy and data-efficient training. However, optimizing such networks is a challenge, in particular due to discrepanci… ▽ More

    Submitted 13 October, 2023; originally announced October 2023.

    Comments: NeurIPS 2023; https://openreview.net/forum?id=TSuq3debnD

  9. arXiv:2308.07350  [pdf, other

    cs.LG cs.AI

    Efficient Neural PDE-Solvers using Quantization Aware Training

    Authors: Winfried van den Dool, Tijmen Blankevoort, Max Welling, Yuki M. Asano

    Abstract: In the past years, the application of neural networks as an alternative to classical numerical methods to solve Partial Differential Equations has emerged as a potential paradigm shift in this century-old mathematical field. However, in terms of practical applicability, computational cost remains a substantial bottleneck. Classical approaches try to mitigate this challenge by limiting the spatial… ▽ More

    Submitted 14 August, 2023; originally announced August 2023.

    Comments: Accepted at the ICCV 2023 Workshop on Resource Efficient Deep Learning for Computer Vision

  10. arXiv:2307.04535  [pdf, other

    cs.LG cs.AI cs.CV

    QBitOpt: Fast and Accurate Bitwidth Reallocation during Training

    Authors: Jorn Peters, Marios Fournarakis, Markus Nagel, Mart van Baalen, Tijmen Blankevoort

    Abstract: Quantizing neural networks is one of the most effective methods for achieving efficient inference on mobile and embedded devices. In particular, mixed precision quantized (MPQ) networks, whose layers can be quantized to different bitwidths, achieve better task performance for the same resource constraint compared to networks with homogeneous bitwidths. However, finding the optimal bitwidth allocat… ▽ More

    Submitted 10 July, 2023; originally announced July 2023.

  11. arXiv:2307.02973  [pdf, other

    cs.LG

    Pruning vs Quantization: Which is Better?

    Authors: Andrey Kuzmin, Markus Nagel, Mart van Baalen, Arash Behboodi, Tijmen Blankevoort

    Abstract: Neural network pruning and quantization techniques are almost as old as neural networks themselves. However, to date only ad-hoc comparisons between the two have been published. In this paper, we set out to answer the question on which is better: neural network quantization or pruning? By answering this question, we hope to inform design decisions made on neural network hardware going forward. We… ▽ More

    Submitted 16 February, 2024; v1 submitted 6 July, 2023; originally announced July 2023.

  12. arXiv:2307.02321  [pdf, other

    cs.CV

    MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers

    Authors: Jakob Drachmann Havtorn, Amelie Royer, Tijmen Blankevoort, Babak Ehteshami Bejnordi

    Abstract: The input tokens to Vision Transformers carry little semantic meaning as they are defined as regular equal-sized patches of the input image, regardless of its content. However, processing uniform background areas of an image should not necessitate as much compute as dense, cluttered areas. To address this issue, we propose a dynamic mixed-scale tokenization scheme for ViT, MSViT. Our method introd… ▽ More

    Submitted 7 September, 2023; v1 submitted 5 July, 2023; originally announced July 2023.

    Comments: ICCV Workshops 2023; Code for the Generalized Batch-Shaping loss is available at https://github.com/Qualcomm-AI-research/batchshaping

  13. arXiv:2306.12929  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing

    Authors: Yelysei Bondarenko, Markus Nagel, Tijmen Blankevoort

    Abstract: Transformer models have been widely adopted in various domains over the last years, and especially large language models have advanced the field of AI significantly. Due to their size, the capability of these networks has increased tremendously, but this has come at the cost of a significant increase in necessary compute. Quantization is one of the most effective ways to reduce the computational t… ▽ More

    Submitted 9 November, 2023; v1 submitted 22 June, 2023; originally announced June 2023.

  14. arXiv:2304.05497  [pdf, other

    cs.CV cs.LG

    Revisiting Single-gated Mixtures of Experts

    Authors: Amelie Royer, Ilia Karmanov, Andrii Skliar, Babak Ehteshami Bejnordi, Tijmen Blankevoort

    Abstract: Mixture of Experts (MoE) are rising in popularity as a means to train extremely large-scale models, yet allowing for a reasonable computational cost at inference time. Recent state-of-the-art approaches usually assume a large number of experts, and require training all experts jointly, which often lead to training instabilities such as the router collapsing In contrast, in this work, we propose to… ▽ More

    Submitted 11 April, 2023; originally announced April 2023.

    Comments: BMVC 2022

  15. arXiv:2303.17951  [pdf, other

    cs.LG

    FP8 versus INT8 for efficient deep learning inference

    Authors: Mart van Baalen, Andrey Kuzmin, Suparna S Nair, Yuwei Ren, Eric Mahurin, Chirag Patel, Sundar Subramanian, Sanghyuk Lee, Markus Nagel, Joseph Soriaga, Tijmen Blankevoort

    Abstract: Recently, the idea of using FP8 as a number format for neural network training has been floating around the deep learning world. Given that most training is currently conducted with entire networks in FP32, or sometimes FP16 with mixed-precision, the step to having some parts of a network run in FP8 with 8-bit weights is an appealing potential speed-up for the generally costly and time-intensive t… ▽ More

    Submitted 15 June, 2023; v1 submitted 31 March, 2023; originally announced March 2023.

  16. arXiv:2302.05397  [pdf, other

    cs.LG

    A Practical Mixed Precision Algorithm for Post-Training Quantization

    Authors: Nilesh Prasad Pandey, Markus Nagel, Mart van Baalen, Yin Huang, Chirag Patel, Tijmen Blankevoort

    Abstract: Neural network quantization is frequently used to optimize model size, latency and power consumption for on-device deployment of neural networks. In many cases, a target bit-width is set for an entire network, meaning every layer get quantized to the same number of bits. However, for many networks some layers are significantly more robust to quantization noise than others, leaving an important axi… ▽ More

    Submitted 10 February, 2023; originally announced February 2023.

  17. arXiv:2208.09225  [pdf, other

    cs.LG

    FP8 Quantization: The Power of the Exponent

    Authors: Andrey Kuzmin, Mart Van Baalen, Yuwei Ren, Markus Nagel, Jorn Peters, Tijmen Blankevoort

    Abstract: When quantizing neural networks for efficient inference, low-bit integers are the go-to format for efficiency. However, low-bit floating point numbers have an extra degree of freedom, assigning some bits to work on an exponential scale instead. This paper in-depth investigates this benefit of the floating point format for neural network inference. We detail the choices that can be made for the FP8… ▽ More

    Submitted 23 February, 2024; v1 submitted 19 August, 2022; originally announced August 2022.

  18. arXiv:2206.08236  [pdf, other

    cs.CV cs.LG eess.IV

    Simple and Efficient Architectures for Semantic Segmentation

    Authors: Dushyant Mehta, Andrii Skliar, Haitam Ben Yahia, Shubhankar Borse, Fatih Porikli, Amirhossein Habibian, Tijmen Blankevoort

    Abstract: Though the state-of-the architectures for semantic segmentation, such as HRNet, demonstrate impressive accuracy, the complexity arising from their salient design choices hinders a range of model acceleration tools, and further they make use of operations that are inefficient on current hardware. This paper demonstrates that a simple encoder-decoder architecture with a ResNet-like backbone and a sm… ▽ More

    Submitted 16 June, 2022; originally announced June 2022.

    Comments: To be presented at Efficient Deep Learning for Computer Vision Workshop at CVPR 2022

  19. arXiv:2203.11086  [pdf, other

    cs.LG

    Overcoming Oscillations in Quantization-Aware Training

    Authors: Markus Nagel, Marios Fournarakis, Yelysei Bondarenko, Tijmen Blankevoort

    Abstract: When training neural networks with simulated quantization, we observe that quantized weights can, rather unexpectedly, oscillate between two grid-points. The importance of this effect and its impact on quantization-aware training (QAT) are not well-understood or investigated in literature. In this paper, we delve deeper into the phenomenon of weight oscillations and show that it can lead to a sign… ▽ More

    Submitted 28 June, 2022; v1 submitted 21 March, 2022; originally announced March 2022.

    Comments: Published as oral paper at ICML 2022

  20. arXiv:2202.01290  [pdf, other

    cs.LG cs.CV

    Cyclical Pruning for Sparse Neural Networks

    Authors: Suraj Srinivas, Andrey Kuzmin, Markus Nagel, Mart van Baalen, Andrii Skliar, Tijmen Blankevoort

    Abstract: Current methods for pruning neural network weights iteratively apply magnitude-based pruning on the model weights and re-train the resulting model to recover lost accuracy. In this work, we show that such strategies do not allow for the recovery of erroneously pruned weights. To enable weight recovery, we propose a simple strategy called \textit{cyclical pruning} which requires the pruning schedul… ▽ More

    Submitted 2 February, 2022; originally announced February 2022.

  21. arXiv:2201.08442  [pdf, other

    cs.LG cs.AI cs.AR cs.PF cs.SE

    Neural Network Quantization with AI Model Efficiency Toolkit (AIMET)

    Authors: Sangeetha Siddegowda, Marios Fournarakis, Markus Nagel, Tijmen Blankevoort, Chirag Patel, Abhijit Khobare

    Abstract: While neural networks have advanced the frontiers in many machine learning applications, they often come at a high computational cost. Reducing the power and latency of neural network inference is vital to integrating modern networks into edge devices with strict power and compute requirements. Neural network quantization is one of the most effective ways of achieving these savings, but the additi… ▽ More

    Submitted 20 January, 2022; originally announced January 2022.

    Comments: arXiv admin note: substantial text overlap with arXiv:2106.08295

  22. arXiv:2109.12948  [pdf, other

    cs.LG cs.AI cs.CL

    Understanding and Overcoming the Challenges of Efficient Transformer Quantization

    Authors: Yelysei Bondarenko, Markus Nagel, Tijmen Blankevoort

    Abstract: Transformer-based architectures have become the de-facto standard models for a wide range of Natural Language Processing tasks. However, their memory footprint and high latency are prohibitive for efficient deployment and inference on resource-limited devices. In this work, we explore quantization for transformers. We show that transformers have unique quantization challenges -- namely, high dynam… ▽ More

    Submitted 27 September, 2021; originally announced September 2021.

  23. arXiv:2106.08295  [pdf, other

    cs.LG cs.AI cs.CV

    A White Paper on Neural Network Quantization

    Authors: Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart van Baalen, Tijmen Blankevoort

    Abstract: While neural networks have advanced the frontiers in many applications, they often come at a high computational cost. Reducing the power and latency of neural network inference is key if we want to integrate modern networks into edge devices with strict power and compute requirements. Neural network quantization is one of the most effective ways of achieving these savings but the additional noise… ▽ More

    Submitted 15 June, 2021; originally announced June 2021.

  24. arXiv:2012.08859  [pdf, other

    cs.LG cs.AI cs.CV cs.NE stat.ML

    Distilling Optimal Neural Networks: Rapid Search in Diverse Spaces

    Authors: Bert Moons, Parham Noorzad, Andrii Skliar, Giovanni Mariani, Dushyant Mehta, Chris Lott, Tijmen Blankevoort

    Abstract: Current state-of-the-art Neural Architecture Search (NAS) methods neither efficiently scale to multiple hardware platforms, nor handle diverse architectural search-spaces. To remedy this, we present DONNA (Distilling Optimal Neural Network Architectures), a novel pipeline for rapid, scalable and diverse NAS, that scales to many user scenarios. DONNA consists of three phases. First, an accuracy pre… ▽ More

    Submitted 27 August, 2021; v1 submitted 16 December, 2020; originally announced December 2020.

    Comments: Accepted at ICCV2021. Main text 9 pages, Full text 21 pages, 18 figures

  25. arXiv:2007.10463  [pdf, ps, other

    cs.LG stat.ML

    Differentiable Joint Pruning and Quantization for Hardware Efficiency

    Authors: Ying Wang, Yadong Lu, Tijmen Blankevoort

    Abstract: We present a differentiable joint pruning and quantization (DJPQ) scheme. We frame neural network compression as a joint gradient-based optimization problem, trading off between model pruning and quantization automatically for hardware efficiency. DJPQ incorporates variational information bottleneck based structured pruning and mixed-bit precision quantization into a single differentiable loss fun… ▽ More

    Submitted 4 April, 2021; v1 submitted 20 July, 2020; originally announced July 2020.

    Comments: Accepted to ECCV 2020

  26. arXiv:2005.07093  [pdf, other

    cs.LG cs.CV stat.ML

    Bayesian Bits: Unifying Quantization and Pruning

    Authors: Mart van Baalen, Christos Louizos, Markus Nagel, Rana Ali Amjad, Ying Wang, Tijmen Blankevoort, Max Welling

    Abstract: We introduce Bayesian Bits, a practical method for joint mixed precision quantization and pruning through gradient based optimization. Bayesian Bits employs a novel decomposition of the quantization operation, which sequentially considers doubling the bit width. At each new bit width, the residual error between the full precision value and the previously rounded value is quantized. We then decide… ▽ More

    Submitted 27 October, 2020; v1 submitted 14 May, 2020; originally announced May 2020.

  27. arXiv:2004.10568  [pdf, other

    cs.LG cs.CV stat.ML

    Up or Down? Adaptive Rounding for Post-Training Quantization

    Authors: Markus Nagel, Rana Ali Amjad, Mart van Baalen, Christos Louizos, Tijmen Blankevoort

    Abstract: When quantizing neural networks, assigning each floating-point weight to its nearest fixed-point value is the predominant approach. We find that, perhaps surprisingly, this is not the best we can do. In this paper, we propose AdaRound, a better weight-rounding mechanism for post-training quantization that adapts to the data and the task loss. AdaRound is fast, does not require fine-tuning of the n… ▽ More

    Submitted 30 June, 2020; v1 submitted 22 April, 2020; originally announced April 2020.

    Comments: Published as a conference paper at ICML 2020

  28. arXiv:2004.09576  [pdf, other

    cs.CV cs.LG stat.ML

    LSQ+: Improving low-bit quantization through learnable offsets and better initialization

    Authors: Yash Bhalgat, Jinwon Lee, Markus Nagel, Tijmen Blankevoort, Nojun Kwak

    Abstract: Unlike ReLU, newer activation functions (like Swish, H-swish, Mish) that are frequently employed in popular efficient architectures can also result in negative activation values, with skewed positive and negative ranges. Typical learnable quantization schemes [PACT, LSQ] assume unsigned quantization for activations and quantize all negative activations to zero which leads to significant loss in pe… ▽ More

    Submitted 20 April, 2020; originally announced April 2020.

    Comments: Camera-ready for Joint Workshop on Efficient Deep Learning in Computer Vision, CVPR 2020

  29. arXiv:2004.00070  [pdf, other

    cs.CV cs.LG stat.ML

    Conditional Channel Gated Networks for Task-Aware Continual Learning

    Authors: Davide Abati, Jakub Tomczak, Tijmen Blankevoort, Simone Calderara, Rita Cucchiara, Babak Ehteshami Bejnordi

    Abstract: Convolutional Neural Networks experience catastrophic forgetting when optimized on a sequence of learning problems: as they meet the objective of the current training examples, their performance on previous tasks drops drastically. In this work, we introduce a novel framework to tackle this problem with conditional computation. We equip each convolutional layer with task-specific gating modules, s… ▽ More

    Submitted 31 March, 2020; originally announced April 2020.

    Comments: CVPR 2020 (oral)

  30. arXiv:2003.00075  [pdf, other

    cs.LG stat.ML

    Learned Threshold Pruning

    Authors: Kambiz Azarian, Yash Bhalgat, Jinwon Lee, Tijmen Blankevoort

    Abstract: This paper presents a novel differentiable method for unstructured weight pruning of deep neural networks. Our learned-threshold pruning (LTP) method learns per-layer thresholds via gradient descent, unlike conventional methods where they are set as input. Making thresholds trainable also makes LTP computationally efficient, hence scalable to deeper networks. For example, it takes $30$ epochs for… ▽ More

    Submitted 18 March, 2021; v1 submitted 28 February, 2020; originally announced March 2020.

  31. arXiv:2002.07520  [pdf, other

    cs.LG stat.ML

    Gradient $\ell_1$ Regularization for Quantization Robustness

    Authors: Milad Alizadeh, Arash Behboodi, Mart van Baalen, Christos Louizos, Tijmen Blankevoort, Max Welling

    Abstract: We analyze the effect of quantizing weights and activations of neural networks on their loss and derive a simple regularization scheme that improves robustness against post-training quantization. By training quantization-ready networks, our approach enables storing a single set of weights that can be quantized on-demand to different bit-widths as energy and memory requirements of the application c… ▽ More

    Submitted 18 February, 2020; originally announced February 2020.

    Comments: ICLR 2020

  32. arXiv:1912.09802  [pdf, other

    cs.LG cs.CV stat.ML

    Taxonomy and Evaluation of Structured Compression of Convolutional Neural Networks

    Authors: Andrey Kuzmin, Markus Nagel, Saurabh Pitre, Sandeep Pendyam, Tijmen Blankevoort, Max Welling

    Abstract: The success of deep neural networks in many real-world applications is leading to new challenges in building more efficient architectures. One effective way of making networks more efficient is neural network compression. We provide an overview of existing neural network compression methods that can be used to make neural networks more efficient by changing the architecture of the network. First,… ▽ More

    Submitted 20 December, 2019; originally announced December 2019.

  33. arXiv:1907.06627  [pdf, other

    cs.LG cs.CV stat.ML

    Batch-Shaping for Learning Conditional Channel Gated Networks

    Authors: Babak Ehteshami Bejnordi, Tijmen Blankevoort, Max Welling

    Abstract: We present a method that trains large capacity neural networks with significantly improved accuracy and lower dynamic computational cost. We achieve this by gating the deep-learning architecture on a fine-grained-level. Individual convolutional maps are turned on/off conditionally on features in the network. To achieve this, we introduce a new residual block architecture that gates convolutional c… ▽ More

    Submitted 3 April, 2020; v1 submitted 15 July, 2019; originally announced July 2019.

    Comments: Published as a conference paper at ICLR 2020

  34. arXiv:1906.04721  [pdf, other

    cs.LG cs.CV stat.ML

    Data-Free Quantization Through Weight Equalization and Bias Correction

    Authors: Markus Nagel, Mart van Baalen, Tijmen Blankevoort, Max Welling

    Abstract: We introduce a data-free quantization method for deep neural networks that does not require fine-tuning or hyperparameter selection. It achieves near-original model performance on common computer vision architectures and tasks. 8-bit fixed-point quantization is essential for efficient inference on modern deep learning hardware. However, quantizing models to run in 8-bit is a non-trivial task, freq… ▽ More

    Submitted 25 November, 2019; v1 submitted 11 June, 2019; originally announced June 2019.

    Comments: ICCV 2019

    Journal ref: The IEEE International Conference on Computer Vision (ICCV), 2019

  35. arXiv:1810.01875  [pdf, other

    cs.LG stat.ML

    Relaxed Quantization for Discretized Neural Networks

    Authors: Christos Louizos, Matthias Reisser, Tijmen Blankevoort, Efstratios Gavves, Max Welling

    Abstract: Neural network quantization has become an important research area due to its great impact on deployment of large models on resource constrained devices. In order to train networks that can be effectively discretized without loss of performance, we introduce a differentiable quantization procedure. Differentiability can be achieved by transforming continuous distributions over the weights and activ… ▽ More

    Submitted 3 October, 2018; originally announced October 2018.