Skip to main content

Showing 1–27 of 27 results for author: Shibuya, T

  1. arXiv:2406.17672  [pdf, other

    cs.SD eess.AS

    SpecMaskGIT: Masked Generative Modeling of Audio Spectrograms for Efficient Audio Synthesis and Beyond

    Authors: Marco Comunità, Zhi Zhong, Akira Takahashi, Shiqi Yang, Mengjie Zhao, Koichi Saito, Yukara Ikemiya, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: Recent advances in generative models that iteratively synthesize audio clips sparked great success to text-to-audio synthesis (TTA), but with the cost of slow synthesis speed and heavy computation. Although there have been attempts to accelerate the iterative procedure, high-quality TTA systems remain inefficient due to hundreds of iterations required in the inference phase and large amount of mod… ▽ More

    Submitted 26 June, 2024; v1 submitted 25 June, 2024; originally announced June 2024.

    Comments: 6 pages, 8 figures, 8 tables. Audio samples: https://zzaudio.github.io/SpecMaskGIT/index.html

  2. arXiv:2406.01867  [pdf, other

    cs.CV

    MoLA: Motion Generation and Editing with Latent Diffusion Enhanced by Adversarial Training

    Authors: Kengo Uchida, Takashi Shibuya, Yuhta Takida, Naoki Murata, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: In motion generation, controllability as well as generation quality and speed is becoming more and more important. There are various motion editing tasks, such as in-betweening, upper body editing, and path-following, but existing methods perform motion editing with a data-space diffusion model, which is slow in inference compared to a latent diffusion model. In this paper, we propose MoLA, which… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

    Comments: 12 pages, 6 figures

  3. arXiv:2405.18503  [pdf, other

    cs.SD cs.LG eess.AS

    SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation

    Authors: Koichi Saito, Dongjun Kim, Takashi Shibuya, Chieh-Hsin Lai, Zhi Zhong, Yuhta Takida, Yuki Mitsufuji

    Abstract: Sound content is an indispensable element for multimedia works such as video games, music, and films. Recent high-quality diffusion-based sound generation models can serve as valuable tools for the creators. However, despite producing high-quality sounds, these models often suffer from slow inference speeds. This drawback burdens creators, who typically refine their sounds through trial and error… ▽ More

    Submitted 10 June, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

    Comments: Audio samples: https://koichi-saito-sony.github.io/soundctm/. Codes: https://github.com/sony/soundctm. Checkpoints: https://huggingface.co/Sony/soundctm

  4. arXiv:2405.17842  [pdf, other

    cs.CV cs.LG cs.MM cs.SD eess.AS

    Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation

    Authors: Akio Hayakawa, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji

    Abstract: In this study, we aim to construct an audio-video generative model with minimal computational cost by leveraging pre-trained single-modal generative models for audio and video. To achieve this, we propose a novel method that guides each single-modal model to cooperatively generate well-aligned samples across modalities. Specifically, given two pre-trained base diffusion models, we train a lightwei… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

  5. arXiv:2405.17251  [pdf, other

    cs.CV

    GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping

    Authors: Junyoung Seo, Kazumi Fukuda, Takashi Shibuya, Takuya Narihira, Naoki Murata, Shoukang Hu, Chieh-Hsin Lai, Seungryong Kim, Yuki Mitsufuji

    Abstract: Generating novel views from a single image remains a challenging task due to the complexity of 3D scenes and the limited diversity in the existing multi-view datasets to train a model on. Recent research combining large-scale text-to-image (T2I) models with monocular depth estimation (MDE) has shown promise in handling in-the-wild images. In these methods, an input view is geometrically warped to… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

    Comments: Project page: https://GenWarp-NVS.github.io

  6. arXiv:2405.14598  [pdf, other

    cs.CV cs.LG cs.MM cs.SD eess.AS

    Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation

    Authors: Shiqi Yang, Zhi Zhong, Mengjie Zhao, Shusuke Takahashi, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji

    Abstract: In recent years, with the realistic generation results and a wide range of personalized applications, diffusion-based generative models gain huge attention in both visual and audio generation areas. Compared to the considerable advancements of text2image or text2audio generation, research in audio2visual or visual2audio generation has been relatively slow. The recent audio-visual generation method… ▽ More

    Submitted 24 May, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

    Comments: 10 pages

  7. arXiv:2402.07584  [pdf, ps, other

    cs.CR

    Privacy-Optimized Randomized Response for Sharing Multi-Attribute Data

    Authors: Akito Yamamoto, Tetsuo Shibuya

    Abstract: With the increasing amount of data in society, privacy concerns in data sharing have become widely recognized. Particularly, protecting personal attribute information is essential for a wide range of aims from crowdsourcing to realizing personalized medicine. Although various differentially private methods based on randomized response have been proposed for single attribute information or specific… ▽ More

    Submitted 12 February, 2024; originally announced February 2024.

  8. arXiv:2401.00365  [pdf, other

    cs.LG cs.AI cs.CV

    HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes

    Authors: Yuhta Takida, Yukara Ikemiya, Takashi Shibuya, Kazuki Shimada, Woosung Choi, Chieh-Hsin Lai, Naoki Murata, Toshimitsu Uesaka, Kengo Uchida, Wei-Hsiang Liao, Yuki Mitsufuji

    Abstract: Vector quantization (VQ) is a technique to deterministically learn features with discrete codebook representations. It is commonly performed with a variational autoencoding model, VQ-VAE, which can be further extended to hierarchical structures for making high-fidelity reconstructions. However, such hierarchical extensions of VQ-VAE often suffer from the codebook/layer collapse issue, where the co… ▽ More

    Submitted 28 March, 2024; v1 submitted 30 December, 2023; originally announced January 2024.

    Comments: 34 pages with 17 figures, accepted for TMLR

  9. arXiv:2312.07055  [pdf, other

    cs.CR cs.AI

    Communication Cost Reduction for Subgraph Counting under Local Differential Privacy via Hash Functions

    Authors: Quentin Hillebrand, Vorapong Suppakitpaisarn, Tetsuo Shibuya

    Abstract: We suggest the use of hash functions to cut down the communication costs when counting subgraphs under edge local differential privacy. While various algorithms exist for computing graph statistics, including the count of subgraphs, under the edge local differential privacy, many suffer with high communication costs, making them less efficient for large graphs. Though data compression is a typical… ▽ More

    Submitted 12 December, 2023; originally announced December 2023.

    Comments: 13 pages, 3 figures

  10. arXiv:2310.13267  [pdf, other

    cs.CL cs.CV cs.LG cs.SD eess.AS

    On the Language Encoder of Contrastive Cross-modal Models

    Authors: Mengjie Zhao, Junya Ono, Zhi Zhong, Chieh-Hsin Lai, Yuhta Takida, Naoki Murata, Wei-Hsiang Liao, Takashi Shibuya, Hiromi Wakaki, Yuki Mitsufuji

    Abstract: Contrastive cross-modal models such as CLIP and CLAP aid various vision-language (VL) and audio-language (AL) tasks. However, there has been limited investigation of and improvement in their language encoder, which is the central component of encoding natural language descriptions of image/audio into vector representations. We extensively evaluate how unsupervised and supervised sentence embedding… ▽ More

    Submitted 20 October, 2023; originally announced October 2023.

  11. arXiv:2309.09223  [pdf, other

    cs.SD eess.AS

    Zero- and Few-shot Sound Event Localization and Detection

    Authors: Kazuki Shimada, Kengo Uchida, Yuichiro Koyama, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji, Tatsuya Kawahara

    Abstract: Sound event localization and detection (SELD) systems estimate direction-of-arrival (DOA) and temporal activation for sets of target classes. Neural network (NN)-based SELD systems have performed well in various sets of target classes, but they only output the DOA and temporal activation of preset classes trained before inference. To customize target classes after training, we tackle zero- and few… ▽ More

    Submitted 17 January, 2024; v1 submitted 17 September, 2023; originally announced September 2023.

    Comments: 5 pages, 4 figures, accepted for publication in IEEE ICASSP 2024

  12. arXiv:2309.02836  [pdf, other

    cs.SD cs.LG eess.AS

    BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network

    Authors: Takashi Shibuya, Yuhta Takida, Yuki Mitsufuji

    Abstract: Generative adversarial network (GAN)-based vocoders have been intensively studied because they can synthesize high-fidelity audio waveforms faster than real-time. However, it has been reported that most GANs fail to obtain the optimal projection for discriminating between real and fake data in the feature space. In the literature, it has been demonstrated that slicing adversarial network (SAN), an… ▽ More

    Submitted 24 March, 2024; v1 submitted 6 September, 2023; originally announced September 2023.

    Comments: Accepted at ICASSP 2024. Equation (5) in the previous version is wrong. We modified it

  13. arXiv:2305.10734  [pdf, other

    cs.SD cs.CL eess.AS

    Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders

    Authors: Hao Shi, Kazuki Shimada, Masato Hirano, Takashi Shibuya, Yuichiro Koyama, Zhi Zhong, Shusuke Takahashi, Tatsuya Kawahara, Yuki Mitsufuji

    Abstract: Diffusion-based generative speech enhancement (SE) has recently received attention, but reverse diffusion remains time-consuming. One solution is to initialize the reverse diffusion process with enhanced features estimated by a predictive SE system. However, the pipeline structure currently does not consider for a combined use of generative and predictive decoders. The predictive decoder allows us… ▽ More

    Submitted 28 February, 2024; v1 submitted 18 May, 2023; originally announced May 2023.

  14. arXiv:2305.06701  [pdf, ps, other

    cs.SD eess.AS

    Extending Audio Masked Autoencoders Toward Audio Restoration

    Authors: Zhi Zhong, Hao Shi, Masato Hirano, Kazuki Shimada, Kazuya Tateishi, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: Audio classification and restoration are among major downstream tasks in audio signal processing. However, restoration derives less of a benefit from pretrained models compared to the overwhelming success of pretrained models in classification tasks. Due to such unbalanced benefits, there has been rising interest in how to improve the performance of pretrained models for restoration tasks, e.g., s… ▽ More

    Submitted 17 August, 2023; v1 submitted 11 May, 2023; originally announced May 2023.

    Comments: WASPAA 2023.Copyright 2023 IEEE.Personal use of this material is permitted.Permission from IEEE must be obtained for all other uses,in any current or future media,including reprinting/republishing this material for advertising or promotional purposes, creating new collective works,for resale or redistribution to servers or lists,or reuse of any copyrighted component of this work in other works

  15. arXiv:2301.12811  [pdf, other

    cs.LG

    SAN: Inducing Metrizability of GAN with Discriminative Normalized Linear Layer

    Authors: Yuhta Takida, Masaaki Imaizumi, Takashi Shibuya, Chieh-Hsin Lai, Toshimitsu Uesaka, Naoki Murata, Yuki Mitsufuji

    Abstract: Generative adversarial networks (GANs) learn a target probability distribution by optimizing a generator and a discriminator with minimax objectives. This paper addresses the question of whether such optimization actually provides the generator with gradients that make its distribution close to the target distribution. We derive metrizable conditions, sufficient conditions for the discriminator to… ▽ More

    Submitted 10 April, 2024; v1 submitted 30 January, 2023; originally announced January 2023.

    Comments: 34 pages with 17 figures, accepted for publication in ICLR 2024

  16. arXiv:2212.10352  [pdf, other

    cs.NE cs.LG

    Fixed-Weight Difference Target Propagation

    Authors: Tatsukichi Shibuya, Nakamasa Inoue, Rei Kawakami, Ikuro Sato

    Abstract: Target Propagation (TP) is a biologically more plausible algorithm than the error backpropagation (BP) to train deep networks, and improving practicality of TP is an open issue. TP methods require the feedforward and feedback networks to form layer-wise autoencoders for propagating the target values generated at the output layer. However, this causes certain drawbacks; e.g., careful hyperparameter… ▽ More

    Submitted 19 December, 2022; originally announced December 2022.

    Comments: Accepted at the Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23). 9 pages and 3 figures in main manuscript; 11 pages and 5 figures in supplementary material

  17. Diffiner: A Versatile Diffusion-based Generative Refiner for Speech Enhancement

    Authors: Ryosuke Sawata, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: Although deep neural network (DNN)-based speech enhancement (SE) methods outperform the previous non-DNN-based ones, they often degrade the perceptual quality of generated outputs. To tackle this problem, we introduce a DNN-based generative refiner, Diffiner, aiming to improve perceptual speech quality pre-processed by an SE method. We train a diffusion-based generative model by utilizing a datase… ▽ More

    Submitted 30 August, 2023; v1 submitted 27 October, 2022; originally announced October 2022.

    Comments: Accepted by Interspeech 2023

  18. arXiv:2210.16978  [pdf, other

    cs.CL

    XMD: An End-to-End Framework for Interactive Explanation-Based Debugging of NLP Models

    Authors: Dong-Ho Lee, Akshen Kadakia, Brihi Joshi, Aaron Chan, Ziyi Liu, Kiran Narahari, Takashi Shibuya, Ryosuke Mitani, Toshiyuki Sekiya, Jay Pujara, Xiang Ren

    Abstract: NLP models are susceptible to learning spurious biases (i.e., bugs) that work on some datasets but do not properly reflect the underlying task. Explanation-based model debugging aims to resolve spurious biases by showing human users explanations of model behavior, asking users to give feedback on the behavior, then using the feedback to update the model. While existing model debugging methods have… ▽ More

    Submitted 30 October, 2022; originally announced October 2022.

    Comments: 6 pages, 7 figures. Project page: https://inklab.usc.edu/xmd/

  19. arXiv:2205.07547  [pdf, other

    cs.LG cs.CV

    SQ-VAE: Variational Bayes on Discrete Representation with Self-annealed Stochastic Quantization

    Authors: Yuhta Takida, Takashi Shibuya, WeiHsiang Liao, Chieh-Hsin Lai, Junki Ohmura, Toshimitsu Uesaka, Naoki Murata, Shusuke Takahashi, Toshiyuki Kumakura, Yuki Mitsufuji

    Abstract: One noted issue of vector-quantized variational autoencoder (VQ-VAE) is that the learned discrete representation uses only a fraction of the full capacity of the codebook, also known as codebook collapse. We hypothesize that the training scheme of VQ-VAE, which involves some carefully designed heuristics, underlies this issue. In this paper, we propose a new training scheme that extends the standa… ▽ More

    Submitted 9 June, 2022; v1 submitted 16 May, 2022; originally announced May 2022.

    Comments: 25 pages with 10 figures, accepted for publication in ICML 2022 (Our code is available at https://github.com/sony/sqvae)

  20. arXiv:2110.08454  [pdf, other

    cs.CL

    Good Examples Make A Faster Learner: Simple Demonstration-based Learning for Low-resource NER

    Authors: Dong-Ho Lee, Akshen Kadakia, Kangmin Tan, Mahak Agarwal, Xinyu Feng, Takashi Shibuya, Ryosuke Mitani, Toshiyuki Sekiya, Jay Pujara, Xiang Ren

    Abstract: Recent advances in prompt-based learning have shown strong results on few-shot text classification by using cloze-style templates. Similar attempts have been made on named entity recognition (NER) which manually design templates to predict entity types for every text span in a sentence. However, such methods may suffer from error propagation induced by entity span detection, high cost due to enume… ▽ More

    Submitted 30 March, 2022; v1 submitted 15 October, 2021; originally announced October 2021.

    Comments: Accepted to ACL 2022 main conference. 14 pages, 8 figures, 9 tables

  21. arXiv:2011.00425  [pdf, other

    cs.CL cs.LG

    Analyzing the Effect of Multi-task Learning for Biomedical Named Entity Recognition

    Authors: Arda Akdemir, Tetsuo Shibuya

    Abstract: Developing high-performing systems for detecting biomedical named entities has major implications. State-of-the-art deep-learning based solutions for entity recognition often require large annotated datasets, which is not available in the biomedical domain. Transfer learning and multi-task learning have been shown to improve performance for low-resource domains. However, the applications of these… ▽ More

    Submitted 1 November, 2020; originally announced November 2020.

  22. arXiv:2004.12247  [pdf, other

    cs.CL cs.IR cs.LG

    Hierarchical Multi Task Learning with Subword Contextual Embeddings for Languages with Rich Morphology

    Authors: Arda Akdemir, Tetsuo Shibuya, Tunga Güngör

    Abstract: Morphological information is important for many sequence labeling tasks in Natural Language Processing (NLP). Yet, existing approaches rely heavily on manual annotations or external software to capture this information. In this study, we propose using subword contextual embeddings to capture the morphological information for languages with rich morphology. In addition, we incorporate these embeddi… ▽ More

    Submitted 25 April, 2020; originally announced April 2020.

  23. arXiv:1909.02250  [pdf, other

    cs.CL

    Nested Named Entity Recognition via Second-best Sequence Learning and Decoding

    Authors: Takashi Shibuya, Eduard Hovy

    Abstract: When an entity name contains other names within it, the identification of all combinations of names can become difficult and expensive. We propose a new method to recognize not only outermost named entities but also inner nested ones. We design an objective function for training a neural model that treats the tag sequence for nested entities as the second best path within the span of their parent… ▽ More

    Submitted 10 July, 2020; v1 submitted 5 September, 2019; originally announced September 2019.

    Comments: Accepted to TACL

  24. arXiv:1804.08285  [pdf, other

    cs.DS

    Succinct Oblivious RAM

    Authors: Taku Onodera, Tetsuo Shibuya

    Abstract: Reducing the database space overhead is critical in big-data processing. In this paper, we revisit oblivious RAM (ORAM) using big-data standard for the database space overhead. ORAM is a cryptographic primitive that enables users to perform arbitrary database accesses without revealing the access pattern to the server. It is particularly important today since cloud services become increasingly c… ▽ More

    Submitted 23 April, 2018; originally announced April 2018.

    Comments: 21 pages. A preliminary version of this paper appeared in STACS'18

  25. arXiv:1308.1611  [pdf

    cs.CY

    Pattern Language for Good Old Future From Japanese Culture

    Authors: Megumi Kadotani, Aya Matsumoto, Takafumi Shibuya, Younjae Lee, Saori Watanabe, Takashi Iba

    Abstract: Having developed greatly over millennium under its culture, the ancient buildings and old town atmospheres maintain a quality of comfort. However, people only appreciate the "good old" quality and do not think further about the rational reasons why they feel comfort in it. This keeps them from creating their own things and models with good old quality, relying on the imported western thinking and… ▽ More

    Submitted 6 August, 2013; originally announced August 2013.

    Comments: Presented at COINs13 Conference, Chile, 2013 (arxiv:1308.1028)

    Report number: coins13/2013/09

  26. arXiv:1307.7925  [pdf, ps, other

    cs.DS cs.CE cs.DM q-bio.QM

    Detecting Superbubbles in Assembly Graphs

    Authors: Taku Onodera, Kunihiko Sadakane, Tetsuo Shibuya

    Abstract: We introduce a new concept of a subgraph class called a superbubble for analyzing assembly graphs, and propose an efficient algorithm for detecting it. Most assembly algorithms utilize assembly graphs like the de Bruijn graph or the overlap graph constructed from reads. From these graphs, many assembly algorithms first detect simple local graph structures (motifs), such as tips and bubbles, mainly… ▽ More

    Submitted 30 July, 2013; originally announced July 2013.

    Comments: Peer-reviewed and presented as part of the 13th Workshop on Algorithms in Bioinformatics (WABI2013)

  27. arXiv:0801.0931  [pdf, ps, other

    cs.IT

    The Asymptotic Bit Error Probability of LDPC Codes for the Binary Erasure Channel with Finite Iteration Number

    Authors: Ryuhei Mori, Kenta Kasai, Tomoharu Shibuya, Kohichi Sakaniwa

    Abstract: We consider communication over the binary erasure channel (BEC) using low-density parity-check (LDPC) code and belief propagation (BP) decoding. The bit error probability for infinite block length is known by density evolution and it is well known that a difference between the bit error probability at finite iteration number for finite block length $n$ and for infinite block length is asymptotic… ▽ More

    Submitted 23 January, 2008; v1 submitted 7 January, 2008; originally announced January 2008.

    Comments: 5 pages, 6 figures, correcting errors in Theorem 1 and poor English