subscribe to arXiv mailings

A Fast Control Plane for a Large-Scale and High-Speed Optical Circuit Switch System

Authors: Ryousei Takano, Kiyo Ishii, Toshiyuki Shimizu, Fumihiro Okazaki, Shu Namiki, Ken-ichi Sato

Abstract: We experimentally verify a fast control plane with 100 microseconds of configuration time that can support more than 1000 racks, leveraged by a software-defined network controller and an industrial real-time Ethernet standard EtherCAT. We experimentally verify a fast control plane with 100 microseconds of configuration time that can support more than 1000 racks, leveraged by a software-defined network controller and an industrial real-time Ethernet standard EtherCAT. △ Less

Submitted 17 January, 2024; originally announced January 2024.

Comments: 5 pages, 4 figures

arXiv:2309.06565 [pdf, other]

METICULOUS: An FPGA-based Main Memory Emulator for System Software Studies

Authors: Takahiro Hirofuchi, Takaaki Fukai, Akram Ben Ahmed, Ryousei Takano, Kento Sato

Abstract: Due to the scaling problem of the DRAM technology, non-volatile memory devices, which are based on different principle of operation than DRAM, are now being intensively developed to expand the main memory of computers. Disaggregated memory is also drawing attention as an emerging technology to scale up the main memory. Although system software studies need to discuss management mechanisms for the… ▽ More Due to the scaling problem of the DRAM technology, non-volatile memory devices, which are based on different principle of operation than DRAM, are now being intensively developed to expand the main memory of computers. Disaggregated memory is also drawing attention as an emerging technology to scale up the main memory. Although system software studies need to discuss management mechanisms for the new main memory designs incorporating such emerging memory systems, there are no feasible memory emulation mechanisms that efficiently work for large-scale, privileged programs such as operating systems and hypervisors. In this paper, we propose an FPGA-based main memory emulator for system software studies on new main memory systems. It can emulate the main memory incorporating multiple memory regions with different performance characteristics. For the address region of each memory device, it emulates the latencies, bandwidths and bit-flip error rates of read/write operations, respectively. The emulator is implemented at the hardware module of an off-the-self FPGA System-on-Chip board. Any privileged/unprivileged software programs running on its powerful 64-bit CPU cores can access emulated main memory devices at a practical speed through the exactly same interface as normal DRAM main memory. We confirmed that the emulator transparently worked for CPU cores and successfully changed the performance of a memory region according to given emulation parameters; for example, the latencies measured by CPU cores were exactly proportional to the latencies inserted by the emulator, involving the minimum overhead of approximately 240 ns. As a preliminary use case, we confirmed that the emulator allows us to change the bandwidth limit and the inserted latency individually for unmodified software programs, making discussions on latency sensitivity much easier. △ Less

Submitted 7 September, 2023; originally announced September 2023.

arXiv:2105.12301 [pdf, other]

doi 10.1145/3437359.3465571

kEDM: A Performance-portable Implementation of Empirical Dynamic Modeling using Kokkos

Authors: Keichi Takahashi, Wassapon Watanakeesuntorn, Kohei Ichikawa, Joseph Park, Ryousei Takano, Jason Haga, George Sugihara, Gerald M. Pao

Abstract: Empirical Dynamic Modeling (EDM) is a state-of-the-art non-linear time-series analysis framework. Despite its wide applicability, EDM was not scalable to large datasets due to its expensive computational cost. To overcome this obstacle, researchers have attempted and succeeded in accelerating EDM from both algorithmic and implementational aspects. In previous work, we developed a massively paralle… ▽ More Empirical Dynamic Modeling (EDM) is a state-of-the-art non-linear time-series analysis framework. Despite its wide applicability, EDM was not scalable to large datasets due to its expensive computational cost. To overcome this obstacle, researchers have attempted and succeeded in accelerating EDM from both algorithmic and implementational aspects. In previous work, we developed a massively parallel implementation of EDM targeting HPC systems (mpEDM). However, mpEDM maintains different backends for different architectures. This design becomes a burden in the increasingly diversifying HPC systems, when porting to new hardware. In this paper, we design and develop a performance-portable implementation of EDM based on the Kokkos performance portability framework (kEDM), which runs on both CPUs and GPUs while based on a single codebase. Furthermore, we optimize individual kernels specifically for EDM computation, and use real-world datasets to demonstrate up to $5.5\times$ speedup compared to mpEDM in convergent cross mapping computation. △ Less

Submitted 25 May, 2021; originally announced May 2021.

Comments: 8 pages, 9 figures, accepted at Practice & Experience in Advanced Research Computing (PEARC'21), corresponding authors: Keichi Takahashi, Gerald M. Pao

arXiv:2104.09075 [pdf, ps, other]

doi 10.1145/3431379.3460644

An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of Convolutional Neural Networks

Authors: Albert Njoroge Kahira, Truong Thao Nguyen, Leonardo Bautista Gomez, Ryousei Takano, Rosa M Badia, Mohamed Wahib

Abstract: Deep Neural Network (DNN) frameworks use distributed training to enable faster time to convergence and alleviate memory capacity limitations when training large models and/or using high dimension inputs. With the steady increase in datasets and model sizes, model/hybrid parallelism is deemed to have an important role in the future of distributed training of DNNs. We analyze the compute, communicat… ▽ More Deep Neural Network (DNN) frameworks use distributed training to enable faster time to convergence and alleviate memory capacity limitations when training large models and/or using high dimension inputs. With the steady increase in datasets and model sizes, model/hybrid parallelism is deemed to have an important role in the future of distributed training of DNNs. We analyze the compute, communication, and memory requirements of Convolutional Neural Networks (CNNs) to understand the trade-offs between different parallelism approaches on performance and scalability. We leverage our model-driven analysis to be the basis for an oracle utility which can help in detecting the limitations and bottlenecks of different parallelism approaches at scale. We evaluate the oracle on six parallelization strategies, with four CNN models and multiple datasets (2D and 3D), on up to 1024 GPUs. The results demonstrate that the oracle has an average accuracy of about 86.74% when compared to empirical results, and as high as 97.57% for data parallelism. △ Less

Submitted 19 April, 2021; originally announced April 2021.

Comments: The International ACM Symposium on High-Performance Parallel and Distributed Computing 2021 (HPDC'21)

arXiv:2011.11082 [pdf, other]

Massively Parallel Causal Inference of Whole Brain Dynamics at Single Neuron Resolution

Authors: Wassapon Watanakeesuntorn, Keichi Takahashi, Kohei Ichikawa, Joseph Park, George Sugihara, Ryousei Takano, Jason Haga, Gerald M. Pao

Abstract: Empirical Dynamic Modeling (EDM) is a nonlinear time series causal inference framework. The latest implementation of EDM, cppEDM, has only been used for small datasets due to computational cost. With the growth of data collection capabilities, there is a great need to identify causal relationships in large datasets. We present mpEDM, a parallel distributed implementation of EDM optimized for moder… ▽ More Empirical Dynamic Modeling (EDM) is a nonlinear time series causal inference framework. The latest implementation of EDM, cppEDM, has only been used for small datasets due to computational cost. With the growth of data collection capabilities, there is a great need to identify causal relationships in large datasets. We present mpEDM, a parallel distributed implementation of EDM optimized for modern GPU-centric supercomputers. We improve the original algorithm to reduce redundant computation and optimize the implementation to fully utilize hardware resources such as GPUs and SIMD units. As a use case, we run mpEDM on AI Bridging Cloud Infrastructure (ABCI) using datasets of an entire animal brain sampled at single neuron resolution to identify dynamical causation patterns across the brain. mpEDM is 1,530 X faster than cppEDM and a dataset containing 101,729 neuron was analyzed in 199 seconds on 512 nodes. This is the largest EDM causal inference achieved to date. △ Less

Submitted 22 November, 2020; originally announced November 2020.

Comments: 10 pges, 10 figures, accepted at IEEE International Conference on Parallel and Distributed Systems (ICPADS)2020, corresponding authors: Keichi Takahashi, Gerald M Pao

ACM Class: K.6.3; G.4; J.3

arXiv:2010.13594 [pdf, other]

Disaggregated Accelerator Management System for Cloud Data Centers

Authors: Ryousei Takano, Kuniyasu Suzaki

Abstract: A conventional data center that consists of monolithic-servers is confronted with limitations including lack of operational flexibility, low resource utilization, low maintainability, etc. Resource disaggregation is a promising solution to address the above issues. We propose a concept of disaggregated cloud data center architecture called Flow-in-Cloud (FiC) that enables an existing cluster compu… ▽ More A conventional data center that consists of monolithic-servers is confronted with limitations including lack of operational flexibility, low resource utilization, low maintainability, etc. Resource disaggregation is a promising solution to address the above issues. We propose a concept of disaggregated cloud data center architecture called Flow-in-Cloud (FiC) that enables an existing cluster computer system to expand an accelerator pool through a high-speed network. FlowOS-RM manages the entire pool resources, and deploys a user job on a dynamically constructed slice according to a user request. This slice consists of compute nodes and accelerators where each accelerator is attached to the corresponding compute node. This paper demonstrates the feasibility of FiC in a proof of concept experiment running a distributed deep learning application on the prototype system. The result successfully warrants the applicability of the proposed system. △ Less

Submitted 26 October, 2020; originally announced October 2020.

Comments: To appear in IEICE Transactions on Information and Systems, 2020

arXiv:2008.11421 [pdf, other]

Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA

Authors: Mohamed Wahib, Haoyu Zhang, Truong Thao Nguyen, Aleksandr Drozd, Jens Domke, Lingqi Zhang, Ryousei Takano, Satoshi Matsuoka

Abstract: The dedicated memory of hardware accelerators can be insufficient to store all weights and/or intermediate states of large deep learning models. Although model parallelism is a viable approach to reduce the memory pressure issue, significant modification of the source code and considerations for algorithms are required. An alternative solution is to use out-of-core methods instead of, or in additi… ▽ More The dedicated memory of hardware accelerators can be insufficient to store all weights and/or intermediate states of large deep learning models. Although model parallelism is a viable approach to reduce the memory pressure issue, significant modification of the source code and considerations for algorithms are required. An alternative solution is to use out-of-core methods instead of, or in addition to, data parallelism. We propose a performance model based on the concurrency analysis of out-of-core training behavior, and derive a strategy that combines layer swapping and redundant recomputing. We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods. We also introduce the first method to solve the challenging problem of out-of-core multi-node training by carefully pipelining gradient exchanges and performing the parameter updates on the host. Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG. △ Less

Submitted 26 August, 2020; originally announced August 2020.

Comments: ACM/IEEE Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'20)

arXiv:2002.06018 [pdf, other]

doi 10.1587/transinf.2019EDL8141

A Prompt Report on the Performance of Intel Optane DC Persistent Memory Module

Authors: Takahiro Hirofuchi, Ryousei Takano

Abstract: In this prompt report, we present the basic performance evaluation of Intel Optane Data Center Persistent Memory Module (Optane DCPMM), which is the first commercially-available, byte-addressable non-volatile memory modules released in April 2019. Since at the moment of writing only a few reports on its performance were published, this letter is intended to complement other performance studies. Th… ▽ More In this prompt report, we present the basic performance evaluation of Intel Optane Data Center Persistent Memory Module (Optane DCPMM), which is the first commercially-available, byte-addressable non-volatile memory modules released in April 2019. Since at the moment of writing only a few reports on its performance were published, this letter is intended to complement other performance studies. Through experiments using our own measurement tools, we obtained that the latency of random read-only access was approximately 374 ns. That of random writeback-involving access was 391 ns. The bandwidths of read-only and writeback-involving access for interleaved memory modules were approximately 38 GB/s and 3 GB/s, respectively. △ Less

Submitted 13 February, 2020; originally announced February 2020.

Comments: To appear in IEICE Transactions on Information and Systems, 2020. arXiv admin note: substantial text overlap with arXiv:1907.12014

arXiv:1909.02724 [pdf, other]

doi 10.1145/3295500.3356163

iFDK: A Scalable Framework for Instant High-resolution Image Reconstruction

Authors: Peng Chen, Mohamed Wahib, Shinichiro Takizawa, Ryousei Takano, Satoshi Matsuoka

Abstract: Computed Tomography (CT) is a widely used technology that requires compute-intense algorithms for image reconstruction. We propose a novel back-projection algorithm that reduces the projection computation cost to 1/6 of the standard algorithm. We also propose an efficient implementation that takes advantage of the heterogeneity of GPU-accelerated systems by overlapping the filtering and back-proje… ▽ More Computed Tomography (CT) is a widely used technology that requires compute-intense algorithms for image reconstruction. We propose a novel back-projection algorithm that reduces the projection computation cost to 1/6 of the standard algorithm. We also propose an efficient implementation that takes advantage of the heterogeneity of GPU-accelerated systems by overlapping the filtering and back-projection stages on CPUs and GPUs, respectively. Finally, we propose a distributed framework for high-resolution image reconstruction on state-of-the-art GPU-accelerated supercomputers. The framework relies on an elaborate interleave of MPI collective communication steps to achieve scalable communication. Evaluation on a single Tesla V100 GPU demonstrates that our back-projection kernel performs up to 1.6x faster than the standard FDK implementation. We also demonstrate the scalability and instantaneous CT capability of the distributed framework by using up to 2,048 V100 GPUs to solve 4K and 8K problems within 30 seconds and 2 minutes, respectively (including I/O). △ Less

Submitted 6 September, 2019; originally announced September 2019.

Comments: ACM/IEEE Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'19)

arXiv:1908.02135 [pdf, other]

doi 10.1587/transinf.2019PAP0018

A Software-based NVM Emulator Supporting Read/Write Asymmetric Latencies

Authors: Atsushi Koshiba, Takahiro Hirofuchi, Ryousei Takano, Mitaro Namiki

Abstract: Non-volatile memory (NVM) is a promising technology for low-energy and high-capacity main memory of computers. The characteristics of NVM devices, however, tend to be fundamentally different from those of DRAM (i.e., the memory device currently used for main memory), because of differences in principles of memory cells. Typically, the write latency of an NVM device such as PCM and ReRAM is much hi… ▽ More Non-volatile memory (NVM) is a promising technology for low-energy and high-capacity main memory of computers. The characteristics of NVM devices, however, tend to be fundamentally different from those of DRAM (i.e., the memory device currently used for main memory), because of differences in principles of memory cells. Typically, the write latency of an NVM device such as PCM and ReRAM is much higher than its read latency. The asymmetry in read/write latencies likely affects the performance of applications significantly. For analyzing behavior of applications running on NVM-based main memory, most researchers use software-based emulation tools due to the limited number of commercial NVM products. However, these existing emulation tools are too slow to emulate a large-scale, realistic workload or too simplistic to investigate the details of application behavior on NVM with asymmetric read/write latencies. This paper therefore proposes a new NVM emulation mechanism that is not only light-weight but also aware of a read/write latency gap in NVM-based main memory. We implemented the prototype of the proposed mechanism for the Intel CPU processors of the Haswell architecture. We also evaluated its accuracy and performed case studies for practical benchmarks. The results showed that our prototype accurately emulated write-latencies of NVM-based main memory: it emulated the NVM write latencies in a range from 200 ns to 1000 ns with negligible errors from 0.2% to 1.1%. We confirmed that the use of our emulator enabled us to successfully estimate performance of practical workloads for NVM-based main memory, while an existing light-weight emulation model misestimated. △ Less

Submitted 2 August, 2019; originally announced August 2019.

Comments: To appear in IEICE Transactions on Information and Systems, December, 2019

arXiv:1907.12014 [pdf, other]

The Preliminary Evaluation of a Hypervisor-based Virtualization Mechanism for Intel Optane DC Persistent Memory Module

Authors: Takahiro Hirofuchi, Ryousei Takano

Abstract: Non-volatile memory (NVM) technologies, being accessible in the same manner as DRAM, are considered indispensable for expanding main memory capacities. Intel Optane DCPMM is a long-awaited product that drastically increases main memory capacities. However, a substantial performance gap exists between DRAM and DCPMM. In our experiments, the read/write latencies of DCPMM were 400% and 407% higher th… ▽ More Non-volatile memory (NVM) technologies, being accessible in the same manner as DRAM, are considered indispensable for expanding main memory capacities. Intel Optane DCPMM is a long-awaited product that drastically increases main memory capacities. However, a substantial performance gap exists between DRAM and DCPMM. In our experiments, the read/write latencies of DCPMM were 400% and 407% higher than those of DRAM, respectively. The read/write bandwidths were 37% and 8% of those of DRAM. This performance gap in main memory presents a new challenge to researchers; we need a new system software technology supporting emerging hybrid memory architecture. In this paper, we present RAMinate, a hypervisor-based virtualization mechanism for hybrid memory systems, and a key technology to address the performance gap in main memory systems. It provides great flexibility in memory management and maximizes the performance of virtual machines (VMs) by dynamically optimizing memory mappings. Through experiments, we confirmed that even though a VM has only 1% of DRAM in its RAM, the performance degradation of the VM was drastically alleviated by memory mapping optimization. The elapsed time to finish the build of Linux Kernel in the VM was 557 seconds, which was only 13% increase from the 100% DRAM case (i.e., 495 seconds). When the optimization mechanism was disabled, the elapsed time increased to 624 seconds (i.e. 26% increase from the 100% DRAM case). △ Less

Submitted 28 July, 2019; originally announced July 2019.

ACM Class: D.4; B.3

arXiv:1907.06154 [pdf, other]

doi 10.1145/3295500.3356162

A Versatile Software Systolic Execution Model for GPU Memory-Bound Kernels

Authors: Peng Chen, Mohamed Wahib, Shinichiro Takizawa, Ryousei Takano, Satoshi Matsuoka

Abstract: This paper proposes a versatile high-performance execution model, inspired by systolic arrays, for memory-bound regular kernels running on CUDA-enabled GPUs. We formulate a systolic model that shifts partial sums by CUDA warp primitives for the computation. We also employ register files as a cache resource in order to operate the entire model efficiently. We demonstrate the effectiveness and versa… ▽ More This paper proposes a versatile high-performance execution model, inspired by systolic arrays, for memory-bound regular kernels running on CUDA-enabled GPUs. We formulate a systolic model that shifts partial sums by CUDA warp primitives for the computation. We also employ register files as a cache resource in order to operate the entire model efficiently. We demonstrate the effectiveness and versatility of the proposed model for a wide variety of stencil kernels that appear commonly in HPC, and also convolution kernels (increasingly important in deep learning workloads). Our algorithm outperforms the top reported state-of-the-art stencil implementations, including implementations with sophisticated temporal and spatial blocking techniques, on the two latest Nvidia architectures: Tesla V100 and P100. For 2D convolution of general filter sizes and shapes, our algorithm is on average 2.5x faster than Nvidia's NPP on V100 and P100 GPUs. △ Less

Submitted 6 September, 2019; v1 submitted 13 July, 2019; originally announced July 2019.

Comments: ACM/IEEE Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'19)

arXiv:1902.01514 [pdf]

Perturbative GAN: GAN with Perturbation Layers

Authors: Yuma Kishi, Tsutomu Ikegami, Shin-ichi O'uchi, Ryousei Takano, Wakana Nogami, Tomohiro Kudoh

Abstract: Perturbative GAN, which replaces convolution layers of existing convolutional GANs (DCGAN, WGAN-GP, BIGGAN, etc.) with perturbation layers that adds a fixed noise mask, is proposed. Compared with the convolu-tional GANs, the number of parameters to be trained is smaller, the convergence of training is faster, the incep-tion score of generated images is higher, and the overall training cost is redu… ▽ More Perturbative GAN, which replaces convolution layers of existing convolutional GANs (DCGAN, WGAN-GP, BIGGAN, etc.) with perturbation layers that adds a fixed noise mask, is proposed. Compared with the convolu-tional GANs, the number of parameters to be trained is smaller, the convergence of training is faster, the incep-tion score of generated images is higher, and the overall training cost is reduced. Algorithmic generation of the noise masks is also proposed, with which the training, as well as the generation, can be boosted with hardware acceleration. Perturbative GAN is evaluated using con-ventional datasets (CIFAR10, LSUN, ImageNet), both in the cases when a perturbation layer is adopted only for Generators and when it is introduced to both Generator and Discriminator. △ Less

Submitted 4 February, 2019; originally announced February 2019.

arXiv:1509.06991 [pdf, other]

Feasibility Evaluation of 6LoWPAN over Bluetooth Low Energy

Authors: Varat Chawathaworncharoen, Vasaka Visoottiviseth, Ryousei Takano

Abstract: IPv6 over Low power Wireless Personal Area Network (6LoWPAN) is an emerging technology to enable ubiquitous IoT services. However, there are very few studies of the performance evaluation on real hardware environments. This paper demonstrates the feasibility of 6LoWPAN through conducting a preliminary performance evaluation of a commodity hardware environment, including Bluetooth Low Energy (BLE)… ▽ More IPv6 over Low power Wireless Personal Area Network (6LoWPAN) is an emerging technology to enable ubiquitous IoT services. However, there are very few studies of the performance evaluation on real hardware environments. This paper demonstrates the feasibility of 6LoWPAN through conducting a preliminary performance evaluation of a commodity hardware environment, including Bluetooth Low Energy (BLE) network, Raspberry Pi, and a laptop PC. Our experimental results show that the power consumption of 6LoWPAN over BLE is one-tenth lower than that of IP over WiFi; the performance significantly depends on the distance between devices and the message size; and the communication completely stops when bursty traffic transfers. This observation provides our optimistic conclusions on the feasibility of 6LoWPAN although the maturity of implementations is a remaining issue. △ Less

Submitted 23 September, 2015; originally announced September 2015.

Comments: 4 pages, PRAGMA Workshop on International Clouds for Data Science (PRAGMA-ICDS 2015)

Showing 1–14 of 14 results for author: Takano, R