Skip to main content

Showing 1–8 of 8 results for author: Mowry, T C

  1. arXiv:2311.02103  [pdf, other

    cs.LG cs.AI cs.PL

    Relax: Composable Abstractions for End-to-End Dynamic Machine Learning

    Authors: Ruihang Lai, Junru Shao, Siyuan Feng, Steven S. Lyubomirsky, Bohan Hou, Wuwei Lin, Zihao Ye, Hongyi Jin, Yuchen Jin, Jiawei Liu, Lesheng Jin, Yaxing Cai, Ziheng Jiang, Yong Wu, Sunghyun Park, Prakalp Srivastava, Jared G. Roesch, Todd C. Mowry, Tianqi Chen

    Abstract: Dynamic shape computations have become critical in modern machine learning workloads, especially in emerging large language models. The success of these models has driven demand for deploying them to a diverse set of backend environments. In this paper, we present Relax, a compiler abstraction for optimizing end-to-end dynamic machine learning workloads. Relax introduces first-class symbolic shape… ▽ More

    Submitted 1 November, 2023; originally announced November 2023.

  2. arXiv:2305.10611  [pdf, other

    cs.LG

    ACRoBat: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time

    Authors: Pratik Fegade, Tianqi Chen, Phillip B. Gibbons, Todd C. Mowry

    Abstract: Dynamic control flow is an important technique often used to design expressive and efficient deep learning computations for applications such as text parsing, machine translation, exiting early out of deep models and so on. The control flow divergence resulting from dynamic control flow makes batching, an important optimization enabling high throughput and hardware utilization, difficult to perfor… ▽ More

    Submitted 16 May, 2024; v1 submitted 17 May, 2023; originally announced May 2023.

  3. arXiv:2302.03851  [pdf, other

    cs.LG cs.SE

    ED-Batch: Efficient Automatic Batching of Dynamic Neural Networks via Learned Finite State Machines

    Authors: Siyuan Chen, Pratik Fegade, Tianqi Chen, Phillip B. Gibbons, Todd C. Mowry

    Abstract: Batching has a fundamental influence on the efficiency of deep neural network (DNN) execution. However, for dynamic DNNs, efficient batching is particularly challenging as the dataflow graph varies per input instance. As a result, state-of-the-art frameworks use heuristics that result in suboptimal batching decisions. Further, batching puts strict restrictions on memory adjacency and can lead to h… ▽ More

    Submitted 7 February, 2023; originally announced February 2023.

  4. arXiv:2110.10221  [pdf, other

    cs.LG

    The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding

    Authors: Pratik Fegade, Tianqi Chen, Phillip B. Gibbons, Todd C. Mowry

    Abstract: There is often variation in the shape and size of input data used for deep learning. In many cases, such data can be represented using tensors with non-uniform shapes, or ragged tensors. Due to limited and non-portable support for efficient execution on ragged tensors, current deep learning frameworks generally use techniques such as padding and masking to make the data shapes uniform and then off… ▽ More

    Submitted 21 March, 2022; v1 submitted 19 October, 2021; originally announced October 2021.

    Comments: 23 pages, 25 figures and 10 tables

  5. arXiv:2011.01383  [pdf, other

    cs.LG cs.DC

    Cortex: A Compiler for Recursive Deep Learning Models

    Authors: Pratik Fegade, Tianqi Chen, Phillip B. Gibbons, Todd C. Mowry

    Abstract: Optimizing deep learning models is generally performed in two steps: (i) high-level graph optimizations such as kernel fusion and (ii) low level kernel optimizations such as those found in vendor libraries. This approach often leaves significant performance on the table, especially for the case of recursive deep learning models. In this paper, we present Cortex, a compiler-based approach to genera… ▽ More

    Submitted 5 March, 2021; v1 submitted 2 November, 2020; originally announced November 2020.

    Comments: 11 pages, 12 figures and 6 tables

    MSC Class: 68N20 ACM Class: D.3.4

  6. arXiv:1805.03502  [pdf, other

    cs.AR

    RowClone: Accelerating Data Movement and Initialization Using DRAM

    Authors: Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata Ausavarungnirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, Todd C. Mowry

    Abstract: In existing systems, to perform any bulk data movement operation (copy or initialization), the data has to first be read into the on-chip processor, all the way into the L1 cache, and the result of the operation must be written back to main memory. This is despite the fact that these operations do not involve any actual computation. RowClone exploits the organization and operation of commodity DRA… ▽ More

    Submitted 7 May, 2018; originally announced May 2018.

    Comments: arXiv admin note: text overlap with arXiv:1605.06483

  7. arXiv:1611.09988  [pdf, other

    cs.AR

    Buddy-RAM: Improving the Performance and Efficiency of Bulk Bitwise Operations Using DRAM

    Authors: Vivek Seshadri, Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali Boroumand, Jeremie Kim, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, Todd C. Mowry

    Abstract: Bitwise operations are an important component of modern day programming. Many widely-used data structures (e.g., bitmap indices in databases) rely on fast bitwise operations on large bit vectors to achieve high performance. Unfortunately, in existing systems, regardless of the underlying architecture (e.g., CPU, GPU, FPGA), the throughput of such bulk bitwise operations is limited by the available… ▽ More

    Submitted 29 November, 2016; originally announced November 2016.

    Comments: arXiv admin note: text overlap with arXiv:1605.06483

  8. arXiv:1602.01348  [pdf, other

    cs.AR

    A Framework for Accelerating Bottlenecks in GPU Execution with Assist Warps

    Authors: Nandita Vijaykumar, Gennady Pekhimenko, Adwait Jog, Saugata Ghose, Abhishek Bhowmick, Rachata Ausavarangnirun, Chita Das, Mahmut Kandemir, Todd C. Mowry, Onur Mutlu

    Abstract: Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, different bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the cores. For example, when a GPU is bottlenecked by the available off-chip memory bandwidth, its computational resources are often… ▽ More

    Submitted 3 February, 2016; originally announced February 2016.