Skip to main content

Showing 1–3 of 3 results for author: Svirschevski, R

  1. arXiv:2406.02532  [pdf, other

    cs.CL

    SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices

    Authors: Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, Max Ryabinin

    Abstract: As large language models gain widespread adoption, running them efficiently becomes crucial. Recent works on LLM inference use speculative decoding to achieve extreme speedups. However, most of these works implicitly design their algorithms for high-end datacenter hardware. In this work, we ask the opposite question: how fast can we run LLMs on consumer machines? Consumer GPUs can no longer fit th… ▽ More

    Submitted 25 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

    Comments: preprint

  2. arXiv:2402.12374  [pdf, other

    cs.CL

    Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding

    Authors: Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, Beidi Chen

    Abstract: As the usage of large language models (LLMs) grows, performing efficient inference with these models becomes increasingly important. While speculative decoding has recently emerged as a promising direction for speeding up inference, existing methods are limited in their ability to scale to larger speculation budgets, and adapt to different hyperparameters and hardware. This paper introduces Sequoi… ▽ More

    Submitted 29 February, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

  3. arXiv:2306.03078  [pdf, other

    cs.CL cs.LG

    SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

    Authors: Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, Dan Alistarh

    Abstract: Recent advances in large language model (LLM) pretraining have led to high-quality LLMs with impressive abilities. By compressing such LLMs via quantization to 3-4 bits per parameter, they can fit into memory-limited devices such as laptops and mobile phones, enabling personalized use. However, quantization down to 3-4 bits per parameter usually leads to moderate-to-high accuracy losses, especiall… ▽ More

    Submitted 5 June, 2023; originally announced June 2023.

    Comments: Extended preprint