Skip to main content

Showing 1–25 of 25 results for author: Rhu, M

  1. arXiv:2406.14571  [pdf, other

    cs.AR cs.AI cs.LG

    PreSto: An In-Storage Data Preprocessing System for Training Recommendation Models

    Authors: Yunjae Lee, Hyeseong Kim, Minsoo Rhu

    Abstract: Training recommendation systems (RecSys) faces several challenges as it requires the "data preprocessing" stage to preprocess an ample amount of raw data and feed them to the GPU for training in a seamless manner. To sustain high training throughput, state-of-the-art solutions reserve a large fleet of CPU servers for preprocessing which incurs substantial deployment cost and power consumption. Our… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Journal ref: Published at 51th IEEE/ACM International Symposium on Computer Architecture (ISCA-51), 2024

  2. arXiv:2406.06955  [pdf, other

    cs.DC cs.IR cs.LG

    ElasticRec: A Microservice-based Model Serving Architecture Enabling Elastic Resource Scaling for Recommendation Models

    Authors: Yujeong Choi, Jiin Kim, Minsoo Rhu

    Abstract: With the increasing popularity of recommendation systems (RecSys), the demand for compute resources in datacenters has surged. However, the model-wise resource allocation employed in current RecSys model serving architectures falls short in effectively utilizing resources, leading to sub-optimal total cost of ownership. We propose ElasticRec, a model serving architecture for RecSys providing resou… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Journal ref: 51th IEEE/ACM International Symposium on Computer Architecture (ISCA-51), 2024

  3. arXiv:2404.08847  [pdf, other

    cs.IR cs.CR cs.LG

    LazyDP: Co-Designing Algorithm-Software for Scalable Training of Differentially Private Recommendation Models

    Authors: Juntaek Lim, Youngeun Kwon, Ranggi Hwang, Kiwan Maeng, G. Edward Suh, Minsoo Rhu

    Abstract: Differential privacy (DP) is widely being employed in the industry as a practical standard for privacy protection. While private training of computer vision or natural language processing applications has been studied extensively, the computational challenges of training of recommender systems (RecSys) with DP have not been explored. In this work, we first present our detailed characterization of… ▽ More

    Submitted 12 April, 2024; originally announced April 2024.

    Journal ref: Published at 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-29), 2024

  4. arXiv:2312.12391  [pdf, other

    cs.LG cs.AI cs.AR

    vTrain: A Simulation Framework for Evaluating Cost-effective and Compute-optimal Large Language Model Training

    Authors: Jehyeon Bang, Yujeong Choi, Myeongwoo Kim, Yongdeok Kim, Minsoo Rhu

    Abstract: As large language models (LLMs) become widespread in various application domains, a critical challenge the AI community is facing is how to train these large AI models in a cost-effective manner. Existing LLM training plans typically employ a heuristic based parallel training strategy which is based on empirical observations rather than grounded upon a thorough examination of the search space of L… ▽ More

    Submitted 27 November, 2023; originally announced December 2023.

  5. arXiv:2308.00846  [pdf, other

    cs.AR

    Pathfinding Future PIM Architectures by Demystifying a Commercial PIM Technology

    Authors: Bongjoon Hyun, Taehun Kim, Dongjae Lee, Minsoo Rhu

    Abstract: Processing-in-memory (PIM) has been explored for decades by computer architects, yet it has never seen the light of day in real-world products due to their high design overheads and lack of a killer application. With the advent of critical memory-intensive workloads, several commercial PIM technologies have been introduced to the market ranging from domain-specific PIM architectures to more genera… ▽ More

    Submitted 6 March, 2024; v1 submitted 1 August, 2023; originally announced August 2023.

    Comments: Published at the 30th IEEE International Symposium on High-Performance Computer Architecture (HPCA-30), 2024

  6. arXiv:2302.11750  [pdf, other

    cs.DC cs.IR cs.LG

    Hera: A Heterogeneity-Aware Multi-Tenant Inference Server for Personalized Recommendations

    Authors: Yujeong Choi, John Kim, Minsoo Rhu

    Abstract: While providing low latency is a fundamental requirement in deploying recommendation services, achieving high resource utility is also crucial in cost-effectively maintaining the datacenter. Co-locating multiple workers of a model is an effective way to maximize query-level parallelism and server throughput, but the interference caused by concurrent workers at shared resources can prevent server q… ▽ More

    Submitted 22 February, 2023; originally announced February 2023.

  7. arXiv:2301.10904  [pdf, other

    cs.CR cs.DC cs.LG

    GPU-based Private Information Retrieval for On-Device Machine Learning Inference

    Authors: Maximilian Lam, Jeff Johnson, Wenjie Xiong, Kiwan Maeng, Udit Gupta, Yang Li, Liangzhen Lai, Ilias Leontiadis, Minsoo Rhu, Hsien-Hsin S. Lee, Vijay Janapa Reddi, Gu-Yeon Wei, David Brooks, G. Edward Suh

    Abstract: On-device machine learning (ML) inference can enable the use of private user data on user devices without revealing them to remote servers. However, a pure on-device solution to private ML inference is impractical for many applications that rely on embedding tables that are too large to be stored on-device. In particular, recommendation models typically use multiple embedding tables each on the or… ▽ More

    Submitted 25 September, 2023; v1 submitted 25 January, 2023; originally announced January 2023.

  8. arXiv:2208.12392  [pdf, other

    cs.AR cs.AI cs.CR cs.LG

    DiVa: An Accelerator for Differentially Private Machine Learning

    Authors: Beomsik Park, Ranggi Hwang, Dongho Yoon, Yoonhyuk Choi, Minsoo Rhu

    Abstract: The widespread deployment of machine learning (ML) is raising serious concerns on protecting the privacy of users who contributed to the collection of training data. Differential privacy (DP) is rapidly gaining momentum in the industry as a practical standard for privacy protection. Despite DP's importance, however, little has been explored within the computer systems community regarding the impli… ▽ More

    Submitted 25 August, 2022; originally announced August 2022.

    Comments: Accepted for publication at the 55th IEEE/ACM International Symposium on Microarchitecture (MICRO-55), 2022

  9. arXiv:2205.04711  [pdf, other

    cs.AR cs.AI cs.LG

    SmartSAGE: Training Large-scale Graph Neural Networks using In-Storage Processing Architectures

    Authors: Yunjae Lee, Jinha Chung, Minsoo Rhu

    Abstract: Graph neural networks (GNNs) can extract features by learning both the representation of each objects (i.e., graph nodes) and the relationship across different objects (i.e., the edges that connect nodes), achieving state-of-the-art performance in various graph-based tasks. Despite its strengths, utilizing these algorithms in a production environment faces several challenges as the number of graph… ▽ More

    Submitted 10 May, 2022; originally announced May 2022.

    Comments: Accepted for publication at the 49th IEEE/ACM International Symposium on Computer Architecture (ISCA-49), 2022

  10. arXiv:2205.04702  [pdf, other

    cs.AR cs.AI cs.LG

    Training Personalized Recommendation Systems from (GPU) Scratch: Look Forward not Backwards

    Authors: Youngeun Kwon, Minsoo Rhu

    Abstract: Personalized recommendation models (RecSys) are one of the most popular machine learning workload serviced by hyperscalers. A critical challenge of training RecSys is its high memory capacity requirements, reaching hundreds of GBs to TBs of model size. In RecSys, the so-called embedding layers account for the majority of memory usage so current systems employ a hybrid CPU-GPU design to have the la… ▽ More

    Submitted 10 May, 2022; originally announced May 2022.

    Comments: Accepted for publication at the 49th IEEE/ACM International Symposium on Computer Architecture (ISCA-49), 2022

  11. ARK: Fully Homomorphic Encryption Accelerator with Runtime Data Generation and Inter-Operation Key Reuse

    Authors: Jongmin Kim, Gwangho Lee, Sangpyo Kim, Gina Sohn, John Kim, Minsoo Rhu, Jung Ho Ahn

    Abstract: Homomorphic Encryption (HE) is one of the most promising post-quantum cryptographic schemes that enable privacy-preserving computation on servers. However, noise accumulates as we perform operations on HE-encrypted data, restricting the number of possible operations. Fully HE (FHE) removes this restriction by introducing the bootstrapping operation, which refreshes the data; however, FHE schemes a… ▽ More

    Submitted 29 October, 2022; v1 submitted 2 May, 2022; originally announced May 2022.

    Comments: 18 pages, 9 figures

  12. arXiv:2203.00158  [pdf, other

    cs.AR cs.AI cs.LG

    GROW: A Row-Stationary Sparse-Dense GEMM Accelerator for Memory-Efficient Graph Convolutional Neural Networks

    Authors: Ranggi Hwang, Minhoo Kang, Jiwon Lee, Dongyun Kam, Youngjoo Lee, Minsoo Rhu

    Abstract: Graph convolutional neural networks (GCNs) have emerged as a key technology in various application domains where the input data is relational. A unique property of GCNs is that its two primary execution stages, aggregation and combination, exhibit drastically different dataflows. Consequently, prior GCN accelerators tackle this research space by casting the aggregation and combination stages as a… ▽ More

    Submitted 30 November, 2022; v1 submitted 28 February, 2022; originally announced March 2022.

    Comments: Accepted for publication at the 29th IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023

  13. arXiv:2202.13481  [pdf, other

    cs.DC cs.AI cs.AR cs.LG

    PARIS and ELSA: An Elastic Scheduling Algorithm for Reconfigurable Multi-GPU Inference Servers

    Authors: Yunseong Kim, Yujeong Choi, Minsoo Rhu

    Abstract: In cloud machine learning (ML) inference systems, providing low latency to end-users is of utmost importance. However, maximizing server utilization and system throughput is also crucial for ML service providers as it helps lower the total-cost-of-ownership. GPUs have oftentimes been criticized for ML inference usages as its massive compute and memory throughput is hard to be fully utilized under… ▽ More

    Submitted 27 February, 2022; originally announced February 2022.

    Comments: This is an extended version of our work, which is accepted for publication at the 59th ACM/ESDA/IEEE Design Automation Conference (DAC), 2022

  14. BTS: An Accelerator for Bootstrappable Fully Homomorphic Encryption

    Authors: Sangpyo Kim, Jongmin Kim, Michael Jaemin Kim, Wonkyung Jung, Minsoo Rhu, John Kim, Jung Ho Ahn

    Abstract: Homomorphic encryption (HE) enables the secure offloading of computations to the cloud by providing computation on encrypted data (ciphertexts). HE is based on noisy encryption schemes in which noise accumulates as more computations are applied to the data. The limited number of operations applicable to the data prevents practical applications from exploiting HE. Bootstrapping enables an unlimited… ▽ More

    Submitted 28 April, 2022; v1 submitted 31 December, 2021; originally announced December 2021.

    Comments: 15 pages, 10 figures

  15. arXiv:2010.13103  [pdf, other

    cs.DC cs.AR cs.LG cs.NE

    LazyBatching: An SLA-aware Batching System for Cloud Machine Learning Inference

    Authors: Yujeong Choi, Yunseong Kim, Minsoo Rhu

    Abstract: In cloud ML inference systems, batching is an essential technique to increase throughput which helps optimize total-cost-of-ownership. Prior graph batching combines the individual DNN graphs into a single one, allowing multiple inputs to be concurrently executed in parallel. We observe that the coarse-grained graph batching becomes suboptimal in effectively handling the dynamic inference request t… ▽ More

    Submitted 25 October, 2020; originally announced October 2020.

  16. arXiv:2010.13100  [pdf, other

    cs.AR cs.DC cs.IR cs.LG cs.NE

    Tensor Casting: Co-Designing Algorithm-Architecture for Personalized Recommendation Training

    Authors: Youngeun Kwon, Yunjae Lee, Minsoo Rhu

    Abstract: Personalized recommendations are one of the most widely deployed machine learning (ML) workload serviced from cloud datacenters. As such, architectural solutions for high-performance recommendation inference have recently been the target of several prior literatures. Unfortunately, little have been explored and understood regarding the training side of this emerging ML workload. In this paper, we… ▽ More

    Submitted 25 October, 2020; originally announced October 2020.

  17. arXiv:2005.05968  [pdf, other

    cs.DC cs.IR cs.LG

    Centaur: A Chiplet-based, Hybrid Sparse-Dense Accelerator for Personalized Recommendations

    Authors: Ranggi Hwang, Taehun Kim, Youngeun Kwon, Minsoo Rhu

    Abstract: Personalized recommendations are the backbone machine learning (ML) algorithm that powers several important application domains (e.g., ads, e-commerce, etc) serviced from cloud datacenters. Sparse embedding layers are a crucial building block in designing recommendations yet little attention has been paid in properly accelerating this important ML algorithm. This paper first provides a detailed wo… ▽ More

    Submitted 12 May, 2020; originally announced May 2020.

    Comments: Accepted for publication at the 47th IEEE/ACM International Symposium on Computer Architecture (ISCA-47), 2020

  18. arXiv:1911.06859  [pdf, other

    cs.AR cs.DC cs.LG cs.NE

    NeuMMU: Architectural Support for Efficient Address Translations in Neural Processing Units

    Authors: Bongjoon Hyun, Youngeun Kwon, Yujeong Choi, John Kim, Minsoo Rhu

    Abstract: To satisfy the compute and memory demands of deep neural networks, neural processing units (NPUs) are widely being utilized for accelerating deep learning algorithms. Similar to how GPUs have evolved from a slave device into a mainstream processor architecture, it is likely that NPUs will become first class citizens in this fast-evolving heterogeneous architecture space. This paper makes a case fo… ▽ More

    Submitted 15 November, 2019; originally announced November 2019.

  19. arXiv:1909.04548  [pdf, other

    cs.DC cs.LG cs.NE

    PREMA: A Predictive Multi-task Scheduling Algorithm For Preemptible Neural Processing Units

    Authors: Yujeong Choi, Minsoo Rhu

    Abstract: To amortize cost, cloud vendors providing DNN acceleration as a service to end-users employ consolidation and virtualization to share the underlying resources among multiple DNN service requests. This paper makes a case for a "preemptible" neural processing unit (NPU) and a "predictive" multi-task scheduler to meet the latency demands of high-priority inference while maintaining high throughput. W… ▽ More

    Submitted 6 September, 2019; originally announced September 2019.

  20. arXiv:1908.03072  [pdf, other

    cs.LG cs.AR cs.DC cs.NE

    TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning

    Authors: Youngeun Kwon, Yunjae Lee, Minsoo Rhu

    Abstract: Recent studies from several hyperscalars pinpoint to embedding layers as the most memory-intensive deep learning (DL) algorithm being deployed in today's datacenters. This paper addresses the memory capacity and bandwidth challenges of embedding layers and the associated tensor operations. We present our vertically integrated hardware/software co-design, which includes a custom DIMM module enhance… ▽ More

    Submitted 25 August, 2019; v1 submitted 8 August, 2019; originally announced August 2019.

    Comments: Accepted for publication at the 52nd IEEE/ACM International Symposium on Microarchitecture (MICRO-52), 2019

  21. arXiv:1902.06468  [pdf, other

    cs.DC cs.AR cs.LG cs.NE

    Beyond the Memory Wall: A Case for Memory-centric HPC System for Deep Learning

    Authors: Youngeun Kwon, Minsoo Rhu

    Abstract: As the models and the datasets to train deep learning (DL) models scale, system architects are faced with new challenges, one of which is the memory capacity bottleneck, where the limited physical memory inside the accelerator device constrains the algorithm that can be studied. We propose a memory-centric deep learning system that can transparently expand the memory capacity available to the acce… ▽ More

    Submitted 18 February, 2019; originally announced February 2019.

    Comments: Published as a conference paper at the 51st IEEE/ACM International Symposium on Microarchitecture (MICRO-51), 2018

  22. arXiv:1806.00512  [pdf, other

    cs.LG cs.CL stat.ML

    Structurally Sparsified Backward Propagation for Faster Long Short-Term Memory Training

    Authors: Maohua Zhu, Jason Clemons, Jeff Pool, Minsoo Rhu, Stephen W. Keckler, Yuan Xie

    Abstract: Exploiting sparsity enables hardware systems to run neural networks faster and more energy-efficiently. However, most prior sparsity-centric optimization techniques only accelerate the forward pass of neural networks and usually require an even longer training process with iterative pruning and retraining. We observe that artificially inducing sparsity in the gradients of the gates in an LSTM cell… ▽ More

    Submitted 1 June, 2018; originally announced June 2018.

  23. arXiv:1708.04485  [pdf, other

    cs.NE cs.AR cs.LG

    SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks

    Authors: Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, William J. Dally

    Abstract: Convolutional Neural Networks (CNNs) have emerged as a fundamental technology for machine learning. High performance and extreme energy efficiency are critical for deployments of CNNs in a wide range of situations, especially mobile platforms such as autonomous vehicles, cameras, and electronic personal assistants. This paper introduces the Sparse CNN (SCNN) accelerator architecture, which improve… ▽ More

    Submitted 23 May, 2017; originally announced August 2017.

  24. arXiv:1705.01626  [pdf, other

    cs.LG cs.AR

    Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks

    Authors: Minsoo Rhu, Mike O'Connor, Niladrish Chatterjee, Jeff Pool, Stephen W. Keckler

    Abstract: Popular deep learning frameworks require users to fine-tune their memory usage so that the training data of a deep neural network (DNN) fits within the GPU physical memory. Prior work tries to address this restriction by virtualizing the memory usage of DNNs, enabling both CPU and GPU memory to be utilized for memory allocations. Despite its merits, virtualizing memory can incur significant perfor… ▽ More

    Submitted 3 May, 2017; originally announced May 2017.

  25. arXiv:1602.08124  [pdf, other

    cs.DC cs.LG cs.NE

    vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design

    Authors: Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, Stephen W. Keckler

    Abstract: The most widely used machine learning frameworks require users to carefully tune their memory usage so that the deep neural network (DNN) fits into the DRAM capacity of a GPU. This restriction hampers a researcher's flexibility to study different machine learning algorithms, forcing them to either use a less desirable network architecture or parallelize the processing across multiple GPUs. We prop… ▽ More

    Submitted 28 July, 2016; v1 submitted 25 February, 2016; originally announced February 2016.

    Comments: Published as a conference paper at the 49th IEEE/ACM International Symposium on Microarchitecture (MICRO-49), 2016