Skip to main content

Showing 1–31 of 31 results for author: Song, S L

  1. arXiv:2406.05223  [pdf, other

    cs.LG cs.AI

    CorDA: Context-Oriented Decomposition Adaptation of Large Language Models

    Authors: Yibo Yang, Xiaojie Li, Zhongzhu Zhou, Shuaiwen Leon Song, Jianlong Wu, Liqiang Nie, Bernard Ghanem

    Abstract: Current parameter-efficient fine-tuning (PEFT) methods build adapters without considering the context of downstream task to learn, or the context of important knowledge to maintain. As a result, there is often a performance gap compared to full-parameter finetuning, and meanwhile the finetuned model suffers from catastrophic forgetting of the pre-trained world knowledge. In this paper, we propose… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

  2. arXiv:2406.00977  [pdf, other

    cs.CV cs.AI

    Dragonfly: Multi-Resolution Zoom Supercharges Large Visual-Language Model

    Authors: Kezhen Chen, Rahul Thapa, Rahul Chalamala, Ben Athiwaratkun, Shuaiwen Leon Song, James Zou

    Abstract: Recent advances in large multimodal models (LMMs) suggest that higher image resolution enhances the fine-grained understanding of image details, crucial for tasks such as visual commonsense reasoning and analyzing biomedical images. However, increasing input resolution poses two main challenges: 1) It extends the context length required by the language model, leading to inefficiencies and hitting… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

  3. arXiv:2401.14112  [pdf, other

    cs.LG cs.AI cs.AR

    FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design

    Authors: Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou, Olatunji Ruwase, Yuxiong He, Shuaiwen Leon Song

    Abstract: Six-bit quantization (FP6) can effectively reduce the size of large language models (LLMs) and preserve the model quality consistently across varied applications. However, existing systems do not provide Tensor Core support for FP6 quantization and struggle to achieve practical performance improvements during LLM inference. It is challenging to support FP6 quantization on GPUs due to (1) unfriendl… ▽ More

    Submitted 3 March, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

    Comments: Adding URL link of the source code

  4. arXiv:2310.04610  [pdf, other

    cs.AI cs.LG

    DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System Technologies

    Authors: Shuaiwen Leon Song, Bonnie Kruft, Minjia Zhang, Conglong Li, Shiyang Chen, Chengming Zhang, Masahiro Tanaka, Xiaoxia Wu, Jeff Rasley, Ammar Ahmad Awan, Connor Holmes, Martin Cai, Adam Ghanem, Zhongzhu Zhou, Yuxiong He, Pete Luferenko, Divya Kumar, Jonathan Weyn, Ruixiong Zhang, Sylwester Klocek, Volodymyr Vragov, Mohammed AlQuraishi, Gustaf Ahdritz, Christina Floristean, Cristina Negri , et al. (67 additional authors not shown)

    Abstract: In the upcoming decade, deep learning may revolutionize the natural sciences, enhancing our capacity to model and predict natural occurrences. This could herald a new era of scientific exploration, bringing significant advancements across sectors from drug development to renewable energy. To answer this call, we present DeepSpeed4Science initiative (deepspeed4science.ai) which aims to build unique… ▽ More

    Submitted 11 October, 2023; v1 submitted 6 October, 2023; originally announced October 2023.

  5. arXiv:2309.14509  [pdf, other

    cs.LG cs.CL cs.DC

    DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

    Authors: Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, Yuxiong He

    Abstract: Computation in a typical Transformer-based large language model (LLM) can be characterized by batch size, hidden dimension, number of layers, and sequence length. Until now, system works for accelerating LLM training have focused on the first three dimensions: data parallelism for batch size, tensor parallelism for hidden size and pipeline parallelism for model depth or layers. These widely studie… ▽ More

    Submitted 4 October, 2023; v1 submitted 25 September, 2023; originally announced September 2023.

  6. arXiv:2309.10285  [pdf, other

    cs.DC cs.AR cs.LG

    Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

    Authors: Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, Shuaiwen Leon Song

    Abstract: With the fast growth of parameter size, it becomes increasingly challenging to deploy large generative models as they typically require large GPU memory consumption and massive computation. Unstructured model pruning has been a common approach to reduce both GPU memory footprint and the overall computation while retaining good model accuracy. However, the existing solutions do not provide a highly… ▽ More

    Submitted 18 September, 2023; originally announced September 2023.

    Comments: VLDB 2024

  7. arXiv:2309.00810  [pdf, other

    cs.CV cs.AI

    RenAIssance: A Survey into AI Text-to-Image Generation in the Era of Large Model

    Authors: Fengxiang Bie, Yibo Yang, Zhongzhu Zhou, Adam Ghanem, Minjia Zhang, Zhewei Yao, Xiaoxia Wu, Connor Holmes, Pareesa Golnari, David A. Clifton, Yuxiong He, Dacheng Tao, Shuaiwen Leon Song

    Abstract: Text-to-image generation (TTI) refers to the usage of models that could process text input and generate high fidelity images based on text descriptions. Text-to-image generation using neural networks could be traced back to the emergence of Generative Adversial Network (GAN), followed by the autoregressive Transformer. Diffusion models are one prominent type of generative model used for the genera… ▽ More

    Submitted 1 September, 2023; originally announced September 2023.

  8. arXiv:2308.01320  [pdf, other

    cs.LG cs.AI cs.CL

    DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales

    Authors: Zhewei Yao, Reza Yazdani Aminabadi, Olatunji Ruwase, Samyam Rajbhandari, Xiaoxia Wu, Ammar Ahmad Awan, Jeff Rasley, Minjia Zhang, Conglong Li, Connor Holmes, Zhongzhu Zhou, Michael Wyatt, Molly Smith, Lev Kurilenko, Heyang Qin, Masahiro Tanaka, Shuai Che, Shuaiwen Leon Song, Yuxiong He

    Abstract: ChatGPT-like models have revolutionized various applications in artificial intelligence, from summarization and coding to translation, matching or even surpassing human performance. However, the current landscape lacks an accessible, efficient, and cost-effective end-to-end RLHF (Reinforcement Learning with Human Feedback) training pipeline for these powerful models, particularly when training at… ▽ More

    Submitted 2 August, 2023; originally announced August 2023.

    Comments: 14 pages, 7 figures

  9. arXiv:2307.02666  [pdf, other

    cs.AR

    Chiplet Cloud: Building AI Supercomputers for Serving Large Generative Language Models

    Authors: Huwan Peng, Scott Davidson, Richard Shi, Shuaiwen Leon Song, Michael Taylor

    Abstract: Large language models (LLMs) such as OpenAI's ChatGPT and Google's Gemini have demonstrated unprecedented capabilities of autoregressive AI models across multiple tasks triggering disruptive technology innovations around the world. However, as models continue to grow the cost to serve these models also continues to grow threatening the democratization of LLMs. To address this issue, we propose C… ▽ More

    Submitted 20 May, 2024; v1 submitted 5 July, 2023; originally announced July 2023.

  10. HEAT: A Highly Efficient and Affordable Training System for Collaborative Filtering Based Recommendation on CPUs

    Authors: Chengming Zhang, Shaden Smith, Baixi Sun, Jiannan Tian, Jonathan Soifer, Xiaodong Yu, Shuaiwen Leon Song, Yuxiong He, Dingwen Tao

    Abstract: Collaborative filtering (CF) has been proven to be one of the most effective techniques for recommendation. Among all CF approaches, SimpleX is the state-of-the-art method that adopts a novel loss function and a proper number of negative samples. However, there is no work that optimizes SimpleX on multi-core CPUs, leading to limited performance. To this end, we perform an in-depth profiling and an… ▽ More

    Submitted 3 May, 2023; v1 submitted 14 April, 2023; originally announced April 2023.

    Comments: 12 pages, 14 figures, 7 tables, accepted by ACM ICS '23

  11. arXiv:2212.10642  [pdf, other

    quant-ph

    Mitigating Coupling Map Constrained Correlated Measurement Errors on Quantum Devices

    Authors: Alan Robertson, Shuaiwen Leon Song

    Abstract: We introduce a technique for the suppression of state-dependent and correlated measurement errors, which are commonly observed on modern superconducting quantum devices. Our method leverages previous results, establishing that correlated errors tend to be physically localised on quantum devices to perform characterisations over the coupling map of the device, and to join overlapping measurement ca… ▽ More

    Submitted 20 December, 2022; originally announced December 2022.

    Comments: 13 pages, 15 figures

  12. arXiv:2209.07552  [pdf, other

    cs.DC

    MSREP: A Fast yet Light Sparse Matrix Framework for Multi-GPU Systems

    Authors: Jieyang Chen, Chenhao Xie, Jesun S Firoz, Jiajia Li, Shuaiwen Leon Song, Kevin Barker, Mark Raugas, Ang Li

    Abstract: Sparse linear algebra kernels play a critical role in numerous applications, covering from exascale scientific simulation to large-scale data analytics. Offloading linear algebra kernels on one GPU will no longer be viable in these applications, simply because the rapidly growing data volume may exceed the memory capacity and computing power of a single GPU. Multi-GPU systems nowadays being ubiqui… ▽ More

    Submitted 15 September, 2022; originally announced September 2022.

  13. arXiv:2209.06272  [pdf, other

    cs.DC

    Towards Efficient Architecture and Algorithms for Sensor Fusion

    Authors: Zhendong Wang, Xiaoming Zeng, Shuaiwen Leon Song, Yang Hu

    Abstract: The safety of an automated vehicle hinges crucially upon the accuracy of perception and decision-making latency. Under these stringent requirements, future automated cars are usually equipped with multi-modal sensors such as cameras and LiDARs. The sensor fusion is adopted to provide a confident context of driving scenarios for better decision-making. A promising sensor fusion technique is middle… ▽ More

    Submitted 13 September, 2022; originally announced September 2022.

  14. arXiv:2207.00737  [pdf, other

    cs.DC cs.RO

    Brief Industry Paper: The Necessity of Adaptive Data Fusion in Infrastructure-Augmented Autonomous Driving System

    Authors: Shaoshan Liu, Jianda Wang, Zhendong Wang, Bo Yu, Wei Hu, Yahui Liu, Jie Tang, Shuaiwen Leon Song, Cong Liu, Yang Hu

    Abstract: This paper is the first to provide a thorough system design overview along with the fusion methods selection criteria of a real-world cooperative autonomous driving system, named Infrastructure-Augmented Autonomous Driving or IAAD. We present an in-depth introduction of the IAAD hardware and software on both road-side and vehicle-side computing and communication platforms. We extensively character… ▽ More

    Submitted 2 July, 2022; originally announced July 2022.

    Journal ref: 28th IEEE Real-Time and Embedded Technology and Applications Symposium, 2022

  15. arXiv:2111.09562  [pdf, other

    cs.AI cs.DC

    COMET: A Novel Memory-Efficient Deep Learning Training Framework by Using Error-Bounded Lossy Compression

    Authors: Sian Jin, Chengming Zhang, Xintong Jiang, Yunhe Feng, Hui Guan, Guanpeng Li, Shuaiwen Leon Song, Dingwen Tao

    Abstract: Training wide and deep neural networks (DNNs) require large amounts of storage resources such as memory because the intermediate activation data must be saved in the memory during forward propagation and then restored for backward propagation. However, state-of-the-art accelerators such as GPUs are only equipped with very limited memory capacities due to hardware design constraints, which signific… ▽ More

    Submitted 18 November, 2021; originally announced November 2021.

    Comments: 14 pages, 17 figures, accepted by VLDB 2022. arXiv admin note: substantial text overlap with arXiv:2011.09017

  16. Shift-BNN: Highly-Efficient Probabilistic Bayesian Neural Network Training via Memory-Friendly Pattern Retrieving

    Authors: Qiyu Wan, Haojun Xia, Xingyao Zhang, Lening Wang, Shuaiwen Leon Song, Xin Fu

    Abstract: Bayesian Neural Networks (BNNs) that possess a property of uncertainty estimation have been increasingly adopted in a wide range of safety-critical AI applications which demand reliable and robust decision making, e.g., self-driving, rescue robots, medical image diagnosis. The training procedure of a probabilistic BNN model involves training an ensemble of sampled DNN models, which induces orders… ▽ More

    Submitted 7 October, 2021; originally announced October 2021.

    Comments: 54th IEEE/ACM International Symposium on Microarchitecture

  17. MAPA: Multi-Accelerator Pattern Allocation Policy for Multi-Tenant GPU Servers

    Authors: Kiran Ranganath, Joshua D. Suetterlein, Joseph B. Manzano, Shuaiwen Leon Song, Daniel Wong

    Abstract: Multi-accelerator servers are increasingly being deployed in shared multi-tenant environments (such as in cloud data centers) in order to meet the demands of large-scale compute-intensive workloads. In addition, these accelerators are increasingly being inter-connected in complex topologies and workloads are exhibiting a wider variety of inter-accelerator communication patterns. However, existing… ▽ More

    Submitted 7 October, 2021; originally announced October 2021.

  18. arXiv:2109.08219  [pdf, other

    cs.IR cs.DB cs.DC

    Dr. Top-k: Delegate-Centric Top-k on GPUs

    Authors: Anil Gaihre, Da Zheng, Scott Weitze, Lingda Li, Shuaiwen Leon Song, Caiwen Ding, Xiaoye S Li, Hang Liu

    Abstract: Recent top-$k$ computation efforts explore the possibility of revising various sorting algorithms to answer top-$k$ queries on GPUs. These endeavors, unfortunately, perform significantly more work than needed. This paper introduces Dr. Top-k, a Delegate-centric top-$k$ system on GPUs that can reduce the top-$k$ workloads significantly. Particularly, it contains three major contributions: First, we… ▽ More

    Submitted 16 September, 2021; originally announced September 2021.

    Comments: To be published in The International Conference for High Performance Computing, Networking, Storage and Analysis (SC 21)

  19. Toward Efficient Interactions between Python and Native Libraries

    Authors: Jialiang Tan, Yu Chen, Zhenming Liu, Bin Ren, Shuaiwen Leon Song, Xipeng Shen, Xu Liu

    Abstract: Python has become a popular programming language because of its excellent programmability. Many modern software packages utilize Python for high-level algorithm design and depend on native libraries written in C/C++/Fortran for efficient computation kernels. Interaction between Python code and native libraries introduces performance losses because of the abstraction lying on the boundary of Python… ▽ More

    Submitted 10 June, 2021; originally announced July 2021.

    Comments: In Proceedings of the 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2021), August 23-27, 2021, Athens, Greece. ACM, New York,NY, USA, 12 pages

  20. arXiv:2106.11872  [pdf, other

    cs.LG cs.NE

    Randomness In Neural Network Training: Characterizing The Impact of Tooling

    Authors: Donglin Zhuang, Xingyao Zhang, Shuaiwen Leon Song, Sara Hooker

    Abstract: The quest for determinism in machine learning has disproportionately focused on characterizing the impact of noise introduced by algorithmic design choices. In this work, we address a less well understood and studied question: how does our choice of tooling introduce randomness to deep neural network training. We conduct large scale experiments across different types of hardware, accelerators, sta… ▽ More

    Submitted 22 June, 2021; originally announced June 2021.

    Comments: 21 pages, 10 figures

  21. Q-VR: System-Level Design for Future Mobile Collaborative Virtual Reality

    Authors: Chenhao Xie, Xie Li, Yang Hu, Huwan Peng, Michael Taylor, Shuaiwen Leon Song

    Abstract: High Quality Mobile Virtual Reality (VR) is what the incoming graphics technology era demands: users around the world, regardless of their hardware and network conditions, can all enjoy the immersive virtual experience. However, the state-of-the-art software-based mobile VR designs cannot fully satisfy the realtime performance requirements due to the highly interactive nature of user's actions and… ▽ More

    Submitted 25 February, 2021; originally announced February 2021.

  22. arXiv:2012.06959  [pdf, other

    cs.DC cs.AR

    Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures

    Authors: Chenhao Xie, Jieyang Chen, Jesun S Firoz, Jiajia Li, Shuaiwen Leon Song, Kevin Barker, Mark Raugas, Ang Li

    Abstract: Designing efficient and scalable sparse linear algebra kernels on modern multi-GPU based HPC systems is a daunting task due to significant irregular memory references and workload imbalance across the GPUs. This is particularly the case for Sparse Triangular Solver (SpTRSV) which introduces additional two-dimensional computation dependencies among subsequent computation steps. Dependency informati… ▽ More

    Submitted 12 December, 2020; originally announced December 2020.

  23. ClickTrain: Efficient and Accurate End-to-End Deep Learning Training via Fine-Grained Architecture-Preserving Pruning

    Authors: Chengming Zhang, Geng Yuan, Wei Niu, Jiannan Tian, Sian Jin, Donglin Zhuang, Zhe Jiang, Yanzhi Wang, Bin Ren, Shuaiwen Leon Song, Dingwen Tao

    Abstract: Convolutional neural networks (CNNs) are becoming increasingly deeper, wider, and non-linear because of the growing demand on prediction accuracy and analysis quality. The wide and deep CNNs, however, require a large amount of computing resources and processing time. Many previous works have studied model pruning to improve inference performance, but little work has been done for effectively reduc… ▽ More

    Submitted 30 April, 2021; v1 submitted 19 November, 2020; originally announced November 2020.

    Comments: 12 pages, 15 figures, 2 tables, published by ICS'21

  24. arXiv:2011.09017  [pdf, other

    cs.DC cs.CV

    A Novel Memory-Efficient Deep Learning Training Framework via Error-Bounded Lossy Compression

    Authors: Sian Jin, Guanpeng Li, Shuaiwen Leon Song, Dingwen Tao

    Abstract: Deep neural networks (DNNs) are becoming increasingly deeper, wider, and non-linear due to the growing demands on prediction accuracy and analysis quality. When training a DNN model, the intermediate activation data must be saved in the memory during forward propagation and then restored for backward propagation. However, state-of-the-art accelerators such as GPUs are only equipped with very limit… ▽ More

    Submitted 17 November, 2020; originally announced November 2020.

    Comments: 11 pages, 11 figures, 1 table, accepted by PPoPP '21 as a poster

  25. Accurate Face Rig Approximation with Deep Differential Subspace Reconstruction

    Authors: Steven L. Song, Weiqi Shi, Michael Reed

    Abstract: To be suitable for film-quality animation, rigs for character deformation must fulfill a broad set of requirements. They must be able to create highly stylized deformation, allow a wide variety of controls to permit artistic freedom, and accurately reflect the design intent. Facial deformation is especially challenging due to its nonlinearity with respect to the animation controls and its addition… ▽ More

    Submitted 17 August, 2020; v1 submitted 2 June, 2020; originally announced June 2020.

    Comments: 12 pages, ACM Trans. on Graphics (Proceedings of SIGGRAPH 2020)

  26. TSM2X: High-Performance Tall-and-Skinny Matrix-Matrix Multiplication on GPUs

    Authors: Cody Rivera, Jieyang Chen, Nan Xiong, Shuaiwen Leon Song, Dingwen Tao

    Abstract: Linear algebra operations have been widely used in big data analytics and scientific computations. Many works have been done on optimizing linear algebra operations on GPUs with regular-shaped input. However, few works focus on fully utilizing GPU resources when the input is not regular-shaped. Current optimizations do not consider fully utilizing the memory bandwidth and computing power; therefor… ▽ More

    Submitted 18 February, 2021; v1 submitted 8 February, 2020; originally announced February 2020.

    Comments: 17 pages, 14 figures, published in JPDC

  27. OO-VR: NUMA Friendly Object-Oriented VR Rendering Framework For Future NUMA-Based Multi-GPU Systems

    Authors: Chenhao Xie, Xin Fu, Mingsong Chen, Shuaiwen Leon Song

    Abstract: With the strong computation capability, NUMA-based multi-GPU system is a promising candidate to provide sustainable and scalable performance for Virtual Reality. However, the entire multi-GPU system is viewed as a single GPU which ignores the data locality in VR rendering during the workload distribution, leading to tremendous remote memory accesses among GPU models. By conducting comprehensive ch… ▽ More

    Submitted 8 January, 2020; originally announced January 2020.

  28. arXiv:1911.03451  [pdf, other

    cs.DC cs.AR cs.LG

    Enabling Highly Efficient Capsule Networks Processing Through A PIM-Based Architecture Design

    Authors: Xingyao Zhang, Shuaiwen Leon Song, Chenhao Xie, Jing Wang, Weigong Zhang, Xin Fu

    Abstract: In recent years, the CNNs have achieved great successes in the image processing tasks, e.g., image recognition and object detection. Unfortunately, traditional CNN's classification is found to be easily misled by increasingly complex image features due to the usage of pooling operations, hence unable to preserve accurate position and pose information of the objects. To address this challenge, a no… ▽ More

    Submitted 7 November, 2019; originally announced November 2019.

    Comments: To appear in the 2020 26th International Symposium on High-Performance Computer Architecture (HPCA 2020)

  29. arXiv:1903.04611  [pdf, other

    cs.AR cs.DC cs.NI cs.PF

    Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

    Authors: Ang Li, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan Tallent, Kevin Barker

    Abstract: High performance multi-GPU computing becomes an inevitable trend due to the ever-increasing demand on computation capability in emerging domains such as deep learning, big data and planet-scale simulations. However, the lack of deep understanding on how modern GPUs can be connected and the real impact of state-of-the-art interconnect technology on multi-GPU application performance become a hurdle.… ▽ More

    Submitted 11 March, 2019; originally announced March 2019.

    Comments: 15 pages. The paper is going to be submitted to TPDS

  30. SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks

    Authors: Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, Tim Kraska

    Abstract: Going deeper and wider in neural architectures improves the accuracy, while the limited GPU DRAM places an undesired restriction on the network design domain. Deep Learning (DL) practitioners either need change to less desired network architectures, or nontrivially dissect a network across multiGPUs. These distract DL practitioners from concentrating on their original machine learning tasks. We pr… ▽ More

    Submitted 12 January, 2018; originally announced January 2018.

    Comments: PPoPP '2018: 23nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

  31. arXiv:1605.02043  [pdf, other

    cs.DC cs.PL

    A Graph-based Model for GPU Caching Problems

    Authors: Lingda Li, Ari B. Hayes, Stephen A. Hackler, Eddy Z. Zhang, Mario Szegedy, Shuaiwen Leon Song

    Abstract: Modeling data sharing in GPU programs is a challenging task because of the massive parallelism and complex data sharing patterns provided by GPU architectures. Better GPU caching efficiency can be achieved through careful task scheduling among different threads. Traditionally, in the field of parallel computing, graph partition models are used to model data communication and guide task scheduling.… ▽ More

    Submitted 6 May, 2016; originally announced May 2016.

    Comments: Currently under submission