-
The infrastructure powering IBM's Gen AI model development
Authors:
Talia Gershon,
Seetharami Seelam,
Brian Belgodere,
Milton Bonilla,
Lan Hoang,
Danny Barnett,
I-Hsin Chung,
Apoorve Mohan,
Ming-Hung Chen,
Lixiang Luo,
Robert Walkup,
Constantinos Evangelinos,
Shweta Salaria,
Marc Dombrowa,
Yoonho Park,
Apo Kayi,
Liran Schour,
Alim Alim,
Ali Sydney,
Pavlos Maniotis,
Laurent Schares,
Bernard Metzler,
Bengi Karacali-Akyamac,
Sophia Wen,
Tatsuhiro Chiba
, et al. (121 additional authors not shown)
Abstract:
AI Infrastructure plays a key role in the speed and cost-competitiveness of developing and deploying advanced AI models. The current demand for powerful AI infrastructure for model training is driven by the emergence of generative AI and foundational models, where on occasion thousands of GPUs must cooperate on a single training job for the model to be trained in a reasonable time. Delivering effi…
▽ More
AI Infrastructure plays a key role in the speed and cost-competitiveness of developing and deploying advanced AI models. The current demand for powerful AI infrastructure for model training is driven by the emergence of generative AI and foundational models, where on occasion thousands of GPUs must cooperate on a single training job for the model to be trained in a reasonable time. Delivering efficient and high-performing AI training requires an end-to-end solution that combines hardware, software and holistic telemetry to cater for multiple types of AI workloads. In this report, we describe IBM's hybrid cloud infrastructure that powers our generative AI model development. This infrastructure includes (1) Vela: an AI-optimized supercomputing capability directly integrated into the IBM Cloud, delivering scalable, dynamic, multi-tenant and geographically distributed infrastructure for large-scale model training and other AI workflow steps and (2) Blue Vela: a large-scale, purpose-built, on-premises hosting environment that is optimized to support our largest and most ambitious AI model training tasks. Vela provides IBM with the dual benefit of high performance for internal use along with the flexibility to adapt to an evolving commercial landscape. Blue Vela provides us with the benefits of rapid development of our largest and most ambitious models, as well as future-proofing against the evolving model landscape in the industry. Taken together, they provide IBM with the ability to rapidly innovate in the development of both AI models and commercial offerings.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
Objcache: An Elastic Filesystem over External Persistent Storage for Container Clusters
Authors:
Takeshi Yoshimura,
Tatsuhiro Chiba,
Sunyanan Choochotkaew,
Seetharami Seelam,
Hui-fang Wen,
Jonas Pfefferle
Abstract:
Container virtualization enables emerging AI workloads such as model serving, highly parallelized training, machine learning pipelines, and so on, to be easily scaled on demand on the elastic cloud infrastructure. Particularly, AI workloads require persistent storage to store data such as training inputs, models, and checkpoints. An external storage system like cloud object storage is a common cho…
▽ More
Container virtualization enables emerging AI workloads such as model serving, highly parallelized training, machine learning pipelines, and so on, to be easily scaled on demand on the elastic cloud infrastructure. Particularly, AI workloads require persistent storage to store data such as training inputs, models, and checkpoints. An external storage system like cloud object storage is a common choice because of its elasticity and scalability. To mitigate access latency to external storage, caching at a local filesystem is an essential technique. However, building local caches on scaling clusters must cope with explosive disk usage, redundant networking, and unexpected failures. We propose objcache, an elastic filesystem over external storage. Objcache introduces an internal transaction protocol over Raft logging to enable atomic updates of distributed persistent states with consistent hashing. The proposed transaction protocol can also manage inode dirtiness by maintaining the consistency between the local cache and external storage. Objcache supports scaling down to zero by automatically evicting dirty files to external storage. Our evaluation reports that objcache speeded up model serving startup by 98.9% compared to direct copies via S3 interfaces. Scaling up with dirty files completed from 2 to 14 seconds with 1024 dirty files.
△ Less
Submitted 4 September, 2023;
originally announced September 2023.
-
ns-3 Implementation of Sub-Terahertz and Millimeter Wave Drop-based NYU Channel Model (NYUSIM)
Authors:
Hitesh Poddar,
Tomoki Yoshimura,
Matteo Pagin,
Theodore S Rappaport,
Art Ishii,
Michele Zorzi
Abstract:
The next generation of wireless networks will use sub-THz frequencies alongside mmWave frequencies to enable multi-Gbps and low-latency applications. To enable different verticals and use cases, engineers must take a holistic approach to build, analyze, and study different parts of the network and the interplay among the lower and higher layers of the protocol stack. It is of paramount importance…
▽ More
The next generation of wireless networks will use sub-THz frequencies alongside mmWave frequencies to enable multi-Gbps and low-latency applications. To enable different verticals and use cases, engineers must take a holistic approach to build, analyze, and study different parts of the network and the interplay among the lower and higher layers of the protocol stack. It is of paramount importance to accurately characterize the radio propagation in diverse scenarios such as urban microcell (UMi), urban macrocell (UMa), rural macrocell (RMa), indoor hotspot (InH), and indoor factory (InF) for a wide range of frequencies. The 3GPP statistical channel model (SCM) is oversimplified and restricted to the frequency range of 0.5-100 GHz. Thus, to overcome these limitations, this paper presents a detailed implementation of the drop-based NYU channel model (NYUSIM) for the frequency range of 0.5-150 GHz for the UMi, UMa, RMa, InH, and InF scenarios. NYUSIM allows researchers to design and evaluate new algorithms and protocols for future sub-THz wireless networks in ns-3.
△ Less
Submitted 2 May, 2023;
originally announced May 2023.
-
Full-Stack End-To-End mmWave Simulations Using 3GPP and NYUSIM Channel Model in ns-3
Authors:
H. Poddar,
T. Yoshimura,
M. Pagin,
T. S. Rappaport,
A. Ishii,
M. Zorzi
Abstract:
Accurate channel modeling and simulation tools are vital for studying sub-THz and millimeter (mmWave) wideband communication system performance. To accurately design future high data rate, low latency wireless modems, the entire protocol stack must be appropriately modeled to understand how the physical layer impacts the end-to-end performance experienced by the end user. This paper presents a ful…
▽ More
Accurate channel modeling and simulation tools are vital for studying sub-THz and millimeter (mmWave) wideband communication system performance. To accurately design future high data rate, low latency wireless modems, the entire protocol stack must be appropriately modeled to understand how the physical layer impacts the end-to-end performance experienced by the end user. This paper presents a full stack end-to-end performance analysis in ns-3 using drop-based NYU channel model (NYUSIM) and 3GPP statistical channel model (SCM) in scenarios, namely urban microcell (UMi), urban macrocell (UMa), rural macrocell (RMa), and indoor hotspot (InH) at 28 GHz with 100 MHz bandwidth. Video data is transmitted at 50 Mbps using User Datagram Protocol (UDP), and we observe that the RMa channel is benign in non-line of sight (NLOS) for NYUSIM and 3GPP SCM as it exhibits no packet drops and yields maximum throughput (48.1 Mbps) and latency of $\sim$ 20 ms. In NLOS, for NYUSIM, the UMa and RMa channels are similar in terms of throughput and packet drops, and the latency in UMi and InH scenarios is 10 times and 25 times higher respectively compared to UMa. Our results indicate that mmWave bands can support data rates of 50 Mbps with negligible packet drops and latency below 150 ms in all scenarios using NYUSIM.
△ Less
Submitted 5 March, 2023; v1 submitted 23 February, 2023;
originally announced February 2023.
-
Embedding a Differentiable Mel-cepstral Synthesis Filter to a Neural Speech Synthesis System
Authors:
Takenori Yoshimura,
Shinji Takaki,
Kazuhiro Nakamura,
Keiichiro Oura,
Yukiya Hono,
Kei Hashimoto,
Yoshihiko Nankaku,
Keiichi Tokuda
Abstract:
This paper integrates a classic mel-cepstral synthesis filter into a modern neural speech synthesis system towards end-to-end controllable speech synthesis. Since the mel-cepstral synthesis filter is explicitly embedded in neural waveform models in the proposed system, both voice characteristics and the pitch of synthesized speech are highly controlled via a frequency warping parameter and fundame…
▽ More
This paper integrates a classic mel-cepstral synthesis filter into a modern neural speech synthesis system towards end-to-end controllable speech synthesis. Since the mel-cepstral synthesis filter is explicitly embedded in neural waveform models in the proposed system, both voice characteristics and the pitch of synthesized speech are highly controlled via a frequency warping parameter and fundamental frequency, respectively. We implement the mel-cepstral synthesis filter as a differentiable and GPU-friendly module to enable the acoustic and waveform models in the proposed system to be simultaneously optimized in an end-to-end manner. Experiments show that the proposed system improves speech quality from a baseline system maintaining controllability. The core PyTorch modules used in the experiments will be publicly available on GitHub.
△ Less
Submitted 21 November, 2022;
originally announced November 2022.
-
ESPnet2-TTS: Extending the Edge of TTS Research
Authors:
Tomoki Hayashi,
Ryuichi Yamamoto,
Takenori Yoshimura,
Peter Wu,
Jiatong Shi,
Takaaki Saeki,
Yooncheol Ju,
Yusuke Yasuda,
Shinnosuke Takamichi,
Shinji Watanabe
Abstract:
This paper describes ESPnet2-TTS, an end-to-end text-to-speech (E2E-TTS) toolkit. ESPnet2-TTS extends our earlier version, ESPnet-TTS, by adding many new features, including: on-the-fly flexible pre-processing, joint training with neural vocoders, and state-of-the-art TTS models with extensions like full-band E2E text-to-waveform modeling, which simplify the training pipeline and further enhance T…
▽ More
This paper describes ESPnet2-TTS, an end-to-end text-to-speech (E2E-TTS) toolkit. ESPnet2-TTS extends our earlier version, ESPnet-TTS, by adding many new features, including: on-the-fly flexible pre-processing, joint training with neural vocoders, and state-of-the-art TTS models with extensions like full-band E2E text-to-waveform modeling, which simplify the training pipeline and further enhance TTS performance. The unified design of our recipes enables users to quickly reproduce state-of-the-art E2E-TTS results. We also provide many pre-trained models in a unified Python interface for inference, offering a quick means for users to generate baseline samples and build demos. Experimental evaluations with English and Japanese corpora demonstrate that our provided models synthesize utterances comparable to ground-truth ones, achieving state-of-the-art TTS performance. The toolkit is available online at https://github.com/espnet/espnet.
△ Less
Submitted 14 October, 2021;
originally announced October 2021.
-
End-to-End Automatic Speech Recognition Integrated With CTC-Based Voice Activity Detection
Authors:
Takenori Yoshimura,
Tomoki Hayashi,
Kazuya Takeda,
Shinji Watanabe
Abstract:
This paper integrates a voice activity detection (VAD) function with end-to-end automatic speech recognition toward an online speech interface and transcribing very long audio recordings. We focus on connectionist temporal classification (CTC) and its extension of CTC/attention architectures. As opposed to an attention-based architecture, input-synchronous label prediction can be performed based o…
▽ More
This paper integrates a voice activity detection (VAD) function with end-to-end automatic speech recognition toward an online speech interface and transcribing very long audio recordings. We focus on connectionist temporal classification (CTC) and its extension of CTC/attention architectures. As opposed to an attention-based architecture, input-synchronous label prediction can be performed based on a greedy search with the CTC (pre-)softmax output. This prediction includes consecutive long blank labels, which can be regarded as a non-speech region. We use the labels as a cue for detecting speech segments with simple thresholding. The threshold value is directly related to the length of a non-speech region, which is more intuitive and easier to control than conventional VAD hyperparameters. Experimental results on unsegmented data show that the proposed method outperformed the baseline methods using the conventional energy-based and neural-network-based VAD methods and achieved an RTF less than 0.2. The proposed method is publicly available.
△ Less
Submitted 14 February, 2020; v1 submitted 2 February, 2020;
originally announced February 2020.
-
ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit
Authors:
Tomoki Hayashi,
Ryuichi Yamamoto,
Katsuki Inoue,
Takenori Yoshimura,
Shinji Watanabe,
Tomoki Toda,
Kazuya Takeda,
Yu Zhang,
Xu Tan
Abstract:
This paper introduces a new end-to-end text-to-speech (E2E-TTS) toolkit named ESPnet-TTS, which is an extension of the open-source speech processing toolkit ESPnet. The toolkit supports state-of-the-art E2E-TTS models, including Tacotron~2, Transformer TTS, and FastSpeech, and also provides recipes inspired by the Kaldi automatic speech recognition (ASR) toolkit. The recipes are based on the desig…
▽ More
This paper introduces a new end-to-end text-to-speech (E2E-TTS) toolkit named ESPnet-TTS, which is an extension of the open-source speech processing toolkit ESPnet. The toolkit supports state-of-the-art E2E-TTS models, including Tacotron~2, Transformer TTS, and FastSpeech, and also provides recipes inspired by the Kaldi automatic speech recognition (ASR) toolkit. The recipes are based on the design unified with the ESPnet ASR recipe, providing high reproducibility. The toolkit also provides pre-trained models and samples of all of the recipes so that users can use it as a baseline. Furthermore, the unified design enables the integration of ASR functions with TTS, e.g., ASR-based objective evaluation and semi-supervised learning with both ASR and TTS models. This paper describes the design of the toolkit and experimental evaluation in comparison with other toolkits. The experimental results show that our models can achieve state-of-the-art performance comparable to the other latest toolkits, resulting in a mean opinion score (MOS) of 4.25 on the LJSpeech dataset. The toolkit is publicly available at https://github.com/espnet/espnet.
△ Less
Submitted 16 February, 2020; v1 submitted 24 October, 2019;
originally announced October 2019.
-
A Comparative Study on Transformer vs RNN in Speech Applications
Authors:
Shigeki Karita,
Nanxin Chen,
Tomoki Hayashi,
Takaaki Hori,
Hirofumi Inaguma,
Ziyan Jiang,
Masao Someki,
Nelson Enrique Yalta Soplin,
Ryuichi Yamamoto,
Xiaofei Wang,
Shinji Watanabe,
Takenori Yoshimura,
Wangyou Zhang
Abstract:
Sequence-to-sequence models have been widely used in end-to-end speech processing, for example, automatic speech recognition (ASR), speech translation (ST), and text-to-speech (TTS). This paper focuses on an emergent sequence-to-sequence model called Transformer, which achieves state-of-the-art performance in neural machine translation and other natural language processing applications. We underto…
▽ More
Sequence-to-sequence models have been widely used in end-to-end speech processing, for example, automatic speech recognition (ASR), speech translation (ST), and text-to-speech (TTS). This paper focuses on an emergent sequence-to-sequence model called Transformer, which achieves state-of-the-art performance in neural machine translation and other natural language processing applications. We undertook intensive studies in which we experimentally compared and analyzed Transformer and conventional recurrent neural networks (RNN) in a total of 15 ASR, one multilingual ASR, one ST, and two TTS benchmarks. Our experiments revealed various training tips and significant performance benefits obtained with Transformer for each task including the surprising superiority of Transformer in 13/15 ASR benchmarks in comparison with RNN. We are preparing to release Kaldi-style reproducible recipes using open source and publicly available datasets for all the ASR, ST, and TTS tasks for the community to succeed our exciting outcomes.
△ Less
Submitted 28 September, 2019; v1 submitted 13 September, 2019;
originally announced September 2019.
-
Chain-NN: An Energy-Efficient 1D Chain Architecture for Accelerating Deep Convolutional Neural Networks
Authors:
Shihao Wang,
Dajiang Zhou,
Xushen Han,
Takeshi Yoshimura
Abstract:
Deep convolutional neural networks (CNN) have shown their good performances in many computer vision tasks. However, the high computational complexity of CNN involves a huge amount of data movements between the computational processor core and memory hierarchy which occupies the major of the power consumption. This paper presents Chain-NN, a novel energy-efficient 1D chain architecture for accelera…
▽ More
Deep convolutional neural networks (CNN) have shown their good performances in many computer vision tasks. However, the high computational complexity of CNN involves a huge amount of data movements between the computational processor core and memory hierarchy which occupies the major of the power consumption. This paper presents Chain-NN, a novel energy-efficient 1D chain architecture for accelerating deep CNNs. Chain-NN consists of the dedicated dual-channel process engines (PE). In Chain-NN, convolutions are done by the 1D systolic primitives composed of a group of adjacent PEs. These systolic primitives, together with the proposed column-wise scan input pattern, can fully reuse input operand to reduce the memory bandwidth requirement for energy saving. Moreover, the 1D chain architecture allows the systolic primitives to be easily reconfigured according to specific CNN parameters with fewer design complexity. The synthesis and layout of Chain-NN is under TSMC 28nm process. It costs 3751k logic gates and 352KB on-chip memory. The results show a 576-PE Chain-NN can be scaled up to 700MHz. This achieves a peak throughput of 806.4GOPS with 567.5mW and is able to accelerate the five convolutional layers in AlexNet at a frame rate of 326.2fps. 1421.0GOPS/W power efficiency is at least 2.5 to 4.1x times better than the state-of-the-art works.
△ Less
Submitted 4 March, 2017;
originally announced March 2017.