Skip to main content

Showing 1–4 of 4 results for author: Kasikci, B

  1. arXiv:2406.10774  [pdf, other

    cs.CL cs.LG

    Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

    Authors: Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, Song Han

    Abstract: As the demand for long-context large language models (LLMs) increases, models with context windows of up to 128K or 1M tokens are becoming increasingly prevalent. However, long-context LLM inference is challenging since the inference speed decreases significantly as the sequence length grows. This slowdown is primarily caused by loading a large KV cache during self-attention. Previous works have s… ▽ More

    Submitted 15 June, 2024; originally announced June 2024.

    Comments: ICML 2024

  2. arXiv:2403.14770  [pdf, other

    cs.AR

    Beehive: A Flexible Network Stack for Direct-Attached Accelerators

    Authors: Katie Lim, Matthew Giordano, Theano Stavrinos, Pratyush Patel, Jacob Nelson, Irene Zhang, Baris Kasikci, Tom Anderson

    Abstract: Direct-attached accelerators, where application accelerators are directly connected to the datacenter network via a hardware network stack, offer substantial benefits in terms of reduced latency, CPU overhead, and energy use. However, a key challenge is that modern datacenter network stacks are complex, with interleaved protocol layers, network management functions, and virtualization support. To… ▽ More

    Submitted 30 May, 2024; v1 submitted 21 March, 2024; originally announced March 2024.

  3. arXiv:2402.07033  [pdf, other

    cs.LG cs.AI cs.DC cs.OS

    Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models

    Authors: Keisuke Kamahori, Yile Gu, Kan Zhu, Baris Kasikci

    Abstract: Large Language Models (LLMs) based on Mixture-of-Experts (MoE) architecture are showing promising performance on various tasks. However, running them on resource-constrained settings, where GPU memory resources are not abundant, is challenging due to huge model sizes. Existing systems that offload model weights to CPU memory suffer from the significant overhead of frequently moving data between CP… ▽ More

    Submitted 10 February, 2024; originally announced February 2024.

  4. arXiv:2310.19102  [pdf, other

    cs.LG

    Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

    Authors: Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci

    Abstract: The growing demand for Large Language Models (LLMs) in applications such as content generation, intelligent chatbots, and sentiment analysis poses considerable challenges for LLM service providers. To efficiently use GPU resources and boost throughput, batching multiple requests has emerged as a popular paradigm; to further speed up batching, LLM quantization techniques reduce memory consumption a… ▽ More

    Submitted 16 April, 2024; v1 submitted 29 October, 2023; originally announced October 2023.