Showing 1–2 of 2 results for author: Shim, D E

Search v0.5.6 released 2020-02-24

arXiv:2003.06464 [pdf, other]

eess.SP cs.LG

LCP: A Low-Communication Parallelization Method for Fast Neural Network Inference in Image Recognition

Authors: Ramyad Hadidi, Bahar Asgari, Jiashen Cao, Younmin Bae, Da Eun Shim, Hyojong Kim, Sung-Kyu Lim, Michael S. Ryoo, Hyesoon Kim

Abstract: Deep neural networks (DNNs) have inspired new studies in myriad edge applications with robots, autonomous agents, and Internet-of-things (IoT) devices. However, performing inference of DNNs in the edge is still a severe challenge, mainly because of the contradiction between the intensive resource requirements of DNNs and the tight resource availability in several edge domains. Further, as communic… ▽ More Deep neural networks (DNNs) have inspired new studies in myriad edge applications with robots, autonomous agents, and Internet-of-things (IoT) devices. However, performing inference of DNNs in the edge is still a severe challenge, mainly because of the contradiction between the intensive resource requirements of DNNs and the tight resource availability in several edge domains. Further, as communication is costly, taking advantage of other available edge devices by using data- or model-parallelism methods is not an effective solution. To benefit from available compute resources with low communication overhead, we propose the first DNN parallelization method for reducing the communication overhead in a distributed system. We propose a low-communication parallelization (LCP) method in which models consist of several almost-independent and narrow branches. LCP offers close-to-minimum communication overhead with better distribution and parallelization opportunities while significantly reducing memory footprint and computation compared to data- and model-parallelism methods. We deploy LCP models on three distributed systems: AWS instances, Raspberry Pis, and PYNQ boards. We also evaluate the performance of LCP models on a customized hardware (tailored for low latency) implemented on a small edge FPGA and as a 16mW 0.107mm2 ASIC @7nm chip. LCP models achieve a maximum and average speedups of 56x and 7x, compared to the originals, which could be improved by up to an average speedup of 33x by incorporating common optimizations such as pruning and quantization. △ Less

Submitted 17 November, 2020; v1 submitted 13 March, 2020; originally announced March 2020.
arXiv:2002.12151 [pdf, other]

cs.DC

Vortex: OpenCL Compatible RISC-V GPGPU

Authors: Fares Elsabbagh, Blaise Tine, Priyadarshini Roshan, Ethan Lyons, Euna Kim, Da Eun Shim, Lingjun Zhu, Sung Kyu Lim, Hyesoon kim

Abstract: The current challenges in technology scaling are pushing the semiconductor industry towards hardware specialization, creating a proliferation of heterogeneous systems-on-chip, delivering orders of magnitude performance and power benefits compared to traditional general-purpose architectures. This transition is getting a significant boost with the advent of RISC-V with its unique modular and extens… ▽ More The current challenges in technology scaling are pushing the semiconductor industry towards hardware specialization, creating a proliferation of heterogeneous systems-on-chip, delivering orders of magnitude performance and power benefits compared to traditional general-purpose architectures. This transition is getting a significant boost with the advent of RISC-V with its unique modular and extensible ISA, allowing a wide range of low-cost processor designs for various target applications. In addition, OpenCL is currently the most widely adopted programming framework for heterogeneous platforms available on mainstream CPUs, GPUs, as well as FPGAs and custom DSP. In this work, we present Vortex, a RISC-V General-Purpose GPU that supports OpenCL. Vortex implements a SIMT architecture with a minimal ISA extension to RISC-V that enables the execution of OpenCL programs. We also extended OpenCL runtime framework to use the new ISA. We evaluate this design using 15nm technology. We also show the performance and energy numbers of running them with a subset of benchmarks from the Rodinia Benchmark suite. △ Less

Submitted 27 February, 2020; originally announced February 2020.

Search v0.5.6 released 2020-02-24