Skip to main content

Showing 1–18 of 18 results for author: Takida, Y

  1. arXiv:2406.01867  [pdf, other

    cs.CV

    MoLA: Motion Generation and Editing with Latent Diffusion Enhanced by Adversarial Training

    Authors: Kengo Uchida, Takashi Shibuya, Yuhta Takida, Naoki Murata, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: In motion generation, controllability as well as generation quality and speed is becoming more and more important. There are various motion editing tasks, such as in-betweening, upper body editing, and path-following, but existing methods perform motion editing with a data-space diffusion model, which is slow in inference compared to a latent diffusion model. In this paper, we propose MoLA, which… ▽ More

    Submitted 18 July, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

    Comments: 12 pages, 6 figures

  2. arXiv:2405.18503  [pdf, other

    cs.SD cs.LG eess.AS

    SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation

    Authors: Koichi Saito, Dongjun Kim, Takashi Shibuya, Chieh-Hsin Lai, Zhi Zhong, Yuhta Takida, Yuki Mitsufuji

    Abstract: Sound content is an indispensable element for multimedia works such as video games, music, and films. Recent high-quality diffusion-based sound generation models can serve as valuable tools for the creators. However, despite producing high-quality sounds, these models often suffer from slow inference speeds. This drawback burdens creators, who typically refine their sounds through trial and error… ▽ More

    Submitted 10 June, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

    Comments: Audio samples: https://koichi-saito-sony.github.io/soundctm/. Codes: https://github.com/sony/soundctm. Checkpoints: https://huggingface.co/Sony/soundctm

  3. arXiv:2405.14822  [pdf, other

    cs.CV cs.AI cs.LG stat.ML

    PaGoDA: Progressive Growing of a One-Step Generator from a Low-Resolution Diffusion Teacher

    Authors: Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Yuhta Takida, Naoki Murata, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon

    Abstract: To accelerate sampling, diffusion models (DMs) are often distilled into generators that directly map noise to data in a single step. In this approach, the resolution of the generator is fundamentally limited by that of the teacher DM. To overcome this limitation, we propose Progressive Growing of Diffusion Autoencoder (PaGoDA), a technique to progressively grow the resolution of the generator beyo… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

  4. arXiv:2404.19228  [pdf, other

    cs.LG

    Understanding Multimodal Contrastive Learning Through Pointwise Mutual Information

    Authors: Toshimitsu Uesaka, Taiji Suzuki, Yuhta Takida, Chieh-Hsin Lai, Naoki Murata, Yuki Mitsufuji

    Abstract: Multimodal representation learning to integrate different modalities, such as text, vision, and audio is important for real-world applications. The symmetric InfoNCE loss proposed in CLIP is a key concept in multimodal representation learning. In this work, we provide a theoretical understanding of the symmetric InfoNCE loss through the lens of the pointwise mutual information and show that encode… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

  5. arXiv:2401.00365  [pdf, other

    cs.LG cs.AI cs.CV

    HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes

    Authors: Yuhta Takida, Yukara Ikemiya, Takashi Shibuya, Kazuki Shimada, Woosung Choi, Chieh-Hsin Lai, Naoki Murata, Toshimitsu Uesaka, Kengo Uchida, Wei-Hsiang Liao, Yuki Mitsufuji

    Abstract: Vector quantization (VQ) is a technique to deterministically learn features with discrete codebook representations. It is commonly performed with a variational autoencoding model, VQ-VAE, which can be further extended to hierarchical structures for making high-fidelity reconstructions. However, such hierarchical extensions of VQ-VAE often suffer from the codebook/layer collapse issue, where the co… ▽ More

    Submitted 28 March, 2024; v1 submitted 30 December, 2023; originally announced January 2024.

    Comments: 34 pages with 17 figures, accepted for TMLR

  6. arXiv:2311.16424  [pdf, other

    cs.LG cs.AI cs.CV

    Manifold Preserving Guided Diffusion

    Authors: Yutong He, Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Dongjun Kim, Wei-Hsiang Liao, Yuki Mitsufuji, J. Zico Kolter, Ruslan Salakhutdinov, Stefano Ermon

    Abstract: Despite the recent advancements, conditional image generation still faces challenges of cost, generalizability, and the need for task-specific training. In this paper, we propose Manifold Preserving Guided Diffusion (MPGD), a training-free conditional generation framework that leverages pretrained diffusion models and off-the-shelf neural networks with minimal additional inference cost for a broad… ▽ More

    Submitted 27 November, 2023; originally announced November 2023.

  7. arXiv:2310.13267  [pdf, other

    cs.CL cs.CV cs.LG cs.SD eess.AS

    On the Language Encoder of Contrastive Cross-modal Models

    Authors: Mengjie Zhao, Junya Ono, Zhi Zhong, Chieh-Hsin Lai, Yuhta Takida, Naoki Murata, Wei-Hsiang Liao, Takashi Shibuya, Hiromi Wakaki, Yuki Mitsufuji

    Abstract: Contrastive cross-modal models such as CLIP and CLAP aid various vision-language (VL) and audio-language (AL) tasks. However, there has been limited investigation of and improvement in their language encoder, which is the central component of encoding natural language descriptions of image/audio into vector representations. We extensively evaluate how unsupervised and supervised sentence embedding… ▽ More

    Submitted 20 October, 2023; originally announced October 2023.

  8. arXiv:2310.02279  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion

    Authors: Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, Stefano Ermon

    Abstract: Consistency Models (CM) (Song et al., 2023) accelerate score-based diffusion model sampling at the cost of sample quality but lack a natural way to trade-off quality for speed. To address this limitation, we propose Consistency Trajectory Model (CTM), a generalization encompassing CM and score-based models as special cases. CTM trains a single neural network that can -- in a single forward pass --… ▽ More

    Submitted 30 March, 2024; v1 submitted 1 October, 2023; originally announced October 2023.

    Comments: International Conference on Learning Representations

  9. arXiv:2309.02836  [pdf, other

    cs.SD cs.LG eess.AS

    BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network

    Authors: Takashi Shibuya, Yuhta Takida, Yuki Mitsufuji

    Abstract: Generative adversarial network (GAN)-based vocoders have been intensively studied because they can synthesize high-fidelity audio waveforms faster than real-time. However, it has been reported that most GANs fail to obtain the optimal projection for discriminating between real and fake data in the feature space. In the literature, it has been demonstrated that slicing adversarial network (SAN), an… ▽ More

    Submitted 24 March, 2024; v1 submitted 6 September, 2023; originally announced September 2023.

    Comments: Accepted at ICASSP 2024. Equation (5) in the previous version is wrong. We modified it

  10. arXiv:2307.04305  [pdf, other

    cs.SD cs.LG eess.AS

    Automatic Piano Transcription with Hierarchical Frequency-Time Transformer

    Authors: Keisuke Toyama, Taketo Akama, Yukara Ikemiya, Yuhta Takida, Wei-Hsiang Liao, Yuki Mitsufuji

    Abstract: Taking long-term spectral and temporal dependencies into account is essential for automatic piano transcription. This is especially helpful when determining the precise onset and offset for each note in the polyphonic piano content. In this case, we may rely on the capability of self-attention mechanism in Transformers to capture these long-term dependencies in the frequency and time axes. In this… ▽ More

    Submitted 9 July, 2023; originally announced July 2023.

    Comments: 8 pages, 6 figures, to be published in ISMIR2023

  11. arXiv:2306.00367  [pdf, other

    cs.LG cs.AI math.ST

    On the Equivalence of Consistency-Type Models: Consistency Models, Consistent Diffusion Models, and Fokker-Planck Regularization

    Authors: Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Naoki Murata, Yuki Mitsufuji, Stefano Ermon

    Abstract: The emergence of various notions of ``consistency'' in diffusion models has garnered considerable attention and helped achieve improved sample quality, likelihood estimation, and accelerated sampling. Although similar concepts have been proposed in the literature, the precise relationships among them remain unclear. In this study, we establish theoretical connections between three recent ``consist… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

  12. arXiv:2301.12811  [pdf, other

    cs.LG

    SAN: Inducing Metrizability of GAN with Discriminative Normalized Linear Layer

    Authors: Yuhta Takida, Masaaki Imaizumi, Takashi Shibuya, Chieh-Hsin Lai, Toshimitsu Uesaka, Naoki Murata, Yuki Mitsufuji

    Abstract: Generative adversarial networks (GANs) learn a target probability distribution by optimizing a generator and a discriminator with minimax objectives. This paper addresses the question of whether such optimization actually provides the generator with gradients that make its distribution close to the target distribution. We derive metrizable conditions, sufficient conditions for the discriminator to… ▽ More

    Submitted 10 April, 2024; v1 submitted 30 January, 2023; originally announced January 2023.

    Comments: 34 pages with 17 figures, accepted for publication in ICLR 2024

  13. arXiv:2301.12686  [pdf, other

    cs.LG cs.AI cs.CV cs.SD eess.AS

    GibbsDDRM: A Partially Collapsed Gibbs Sampler for Solving Blind Inverse Problems with Denoising Diffusion Restoration

    Authors: Naoki Murata, Koichi Saito, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon

    Abstract: Pre-trained diffusion models have been successfully used as priors in a variety of linear inverse problems, where the goal is to reconstruct a signal from noisy linear measurements. However, existing approaches require knowledge of the linear operator. In this paper, we propose GibbsDDRM, an extension of Denoising Diffusion Restoration Models (DDRM) to a blind setting in which the linear measureme… ▽ More

    Submitted 27 June, 2023; v1 submitted 30 January, 2023; originally announced January 2023.

  14. arXiv:2211.04124  [pdf, other

    eess.AS cs.LG cs.SD

    Unsupervised vocal dereverberation with diffusion-based generative models

    Authors: Koichi Saito, Naoki Murata, Toshimitsu Uesaka, Chieh-Hsin Lai, Yuhta Takida, Takao Fukui, Yuki Mitsufuji

    Abstract: Removing reverb from reverberant music is a necessary technique to clean up audio for downstream music manipulations. Reverberation of music contains two categories, natural reverb, and artificial reverb. Artificial reverb has a wider diversity than natural reverb due to its various parameter setups and reverberation types. However, recent supervised dereverberation methods may fail because they r… ▽ More

    Submitted 8 November, 2022; originally announced November 2022.

    Comments: 6 pages, 2 figures, submitted to ICASSP 2023

  15. Diffiner: A Versatile Diffusion-based Generative Refiner for Speech Enhancement

    Authors: Ryosuke Sawata, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: Although deep neural network (DNN)-based speech enhancement (SE) methods outperform the previous non-DNN-based ones, they often degrade the perceptual quality of generated outputs. To tackle this problem, we introduce a DNN-based generative refiner, Diffiner, aiming to improve perceptual speech quality pre-processed by an SE method. We train a diffusion-based generative model by utilizing a datase… ▽ More

    Submitted 30 August, 2023; v1 submitted 27 October, 2022; originally announced October 2022.

    Comments: Accepted by Interspeech 2023

  16. arXiv:2210.04296  [pdf, other

    cs.LG cs.AI

    FP-Diffusion: Improving Score-based Diffusion Models by Enforcing the Underlying Score Fokker-Planck Equation

    Authors: Chieh-Hsin Lai, Yuhta Takida, Naoki Murata, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon

    Abstract: Score-based generative models (SGMs) learn a family of noise-conditional score functions corresponding to the data density perturbed with increasingly large amounts of noise. These perturbed data densities are linked together by the Fokker-Planck equation (FPE), a partial differential equation (PDE) governing the spatial-temporal evolution of a density undergoing a diffusion process. In this work,… ▽ More

    Submitted 14 June, 2023; v1 submitted 9 October, 2022; originally announced October 2022.

  17. arXiv:2205.07547  [pdf, other

    cs.LG cs.CV

    SQ-VAE: Variational Bayes on Discrete Representation with Self-annealed Stochastic Quantization

    Authors: Yuhta Takida, Takashi Shibuya, WeiHsiang Liao, Chieh-Hsin Lai, Junki Ohmura, Toshimitsu Uesaka, Naoki Murata, Shusuke Takahashi, Toshiyuki Kumakura, Yuki Mitsufuji

    Abstract: One noted issue of vector-quantized variational autoencoder (VQ-VAE) is that the learned discrete representation uses only a fraction of the full capacity of the codebook, also known as codebook collapse. We hypothesize that the training scheme of VQ-VAE, which involves some carefully designed heuristics, underlies this issue. In this paper, we propose a new training scheme that extends the standa… ▽ More

    Submitted 9 June, 2022; v1 submitted 16 May, 2022; originally announced May 2022.

    Comments: 25 pages with 10 figures, accepted for publication in ICML 2022 (Our code is available at https://github.com/sony/sqvae)

  18. arXiv:2102.08663  [pdf, other

    cs.LG cs.CV

    Preventing Oversmoothing in VAE via Generalized Variance Parameterization

    Authors: Yuhta Takida, Wei-Hsiang Liao, Chieh-Hsin Lai, Toshimitsu Uesaka, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: Variational autoencoders (VAEs) often suffer from posterior collapse, which is a phenomenon in which the learned latent space becomes uninformative. This is often related to the hyperparameter resembling the data variance. It can be shown that an inappropriate choice of this hyperparameter causes the oversmoothness in the linearly approximated case and can be empirically verified for the general c… ▽ More

    Submitted 21 August, 2022; v1 submitted 17 February, 2021; originally announced February 2021.

    Comments: 35 pages with 12 figures, accepted for Neurocomputing