Skip to main content

Showing 51–79 of 79 results for author: Mitsufuji, Y

  1. arXiv:2208.11428  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Automatic music mixing with deep learning and out-of-domain data

    Authors: Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Giorgio Fabbro, Stefan Uhlich, Chihiro Nagashima, Yuki Mitsufuji

    Abstract: Music mixing traditionally involves recording instruments in the form of clean, individual tracks and blending them into a final mixture using audio effects and expert knowledge (e.g., a mixing engineer). The automation of music production tasks has become an emerging field in recent years, where rule-based methods and machine learning approaches have been explored. Nevertheless, the lack of dry o… ▽ More

    Submitted 29 August, 2022; v1 submitted 24 August, 2022; originally announced August 2022.

    Comments: 23rd International Society for Music Information Retrieval Conference (ISMIR), December, 2022. Source code, demo and audio examples: https://marco-martinez-sony.github.io/FxNorm-automix/ - added acknowledgements

  2. arXiv:2206.01948  [pdf, other

    eess.AS cs.SD

    STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events

    Authors: Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, Tuomas Virtanen

    Abstract: This report presents the Sony-TAu Realistic Spatial Soundscapes 2022 (STARS22) dataset for sound event localization and detection, comprised of spatial recordings of real scenes collected in various interiors of two different sites. The dataset is captured with a high resolution spherical microphone array and delivered in two 4-channel formats, first-order Ambisonics and tetrahedral microphone arr… ▽ More

    Submitted 2 September, 2022; v1 submitted 4 June, 2022; originally announced June 2022.

  3. arXiv:2205.07547  [pdf, other

    cs.LG cs.CV

    SQ-VAE: Variational Bayes on Discrete Representation with Self-annealed Stochastic Quantization

    Authors: Yuhta Takida, Takashi Shibuya, WeiHsiang Liao, Chieh-Hsin Lai, Junki Ohmura, Toshimitsu Uesaka, Naoki Murata, Shusuke Takahashi, Toshiyuki Kumakura, Yuki Mitsufuji

    Abstract: One noted issue of vector-quantized variational autoencoder (VQ-VAE) is that the learned discrete representation uses only a fraction of the full capacity of the codebook, also known as codebook collapse. We hypothesize that the training scheme of VQ-VAE, which involves some carefully designed heuristics, underlies this issue. In this paper, we propose a new training scheme that extends the standa… ▽ More

    Submitted 9 June, 2022; v1 submitted 16 May, 2022; originally announced May 2022.

    Comments: 25 pages with 10 figures, accepted for publication in ICML 2022 (Our code is available at https://github.com/sony/sqvae)

  4. arXiv:2202.01664  [pdf, other

    eess.AS cs.LG cs.SD

    Distortion Audio Effects: Learning How to Recover the Clean Signal

    Authors: Johannes Imort, Giorgio Fabbro, Marco A. Martínez Ramírez, Stefan Uhlich, Yuichiro Koyama, Yuki Mitsufuji

    Abstract: Given the recent advances in music source separation and automatic mixing, removing audio effects in music tracks is a meaningful step toward developing an automated remixing system. This paper focuses on removing distortion audio effects applied to guitar tracks in music production. We explore whether effect removal can be solved by neural networks designed for source separation and audio effect… ▽ More

    Submitted 13 September, 2022; v1 submitted 3 February, 2022; originally announced February 2022.

    Comments: Audio examples available at https://joimort.github.io/distortionremoval/

  5. arXiv:2110.07124  [pdf, other

    eess.AS cs.SD

    Multi-ACCDOA: Localizing and Detecting Overlapping Sounds from the Same Class with Auxiliary Duplicating Permutation Invariant Training

    Authors: Kazuki Shimada, Yuichiro Koyama, Shusuke Takahashi, Naoya Takahashi, Emiru Tsunoo, Yuki Mitsufuji

    Abstract: Sound event localization and detection (SELD) involves identifying the direction-of-arrival (DOA) and the event class. The SELD methods with a class-wise output format make the model predict activities of all sound event classes and corresponding locations. The class-wise methods can output activity-coupled Cartesian DOA (ACCDOA) vectors, which enable us to solve a SELD task with a single target u… ▽ More

    Submitted 27 March, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

    Comments: 5 pages, 3 figures, accepted for publication in IEEE ICASSP 2022

  6. arXiv:2110.06525  [pdf, other

    cs.SD cs.LG eess.AS

    Automatic DJ Transitions with Differentiable Audio Effects and Generative Adversarial Networks

    Authors: Bo-Yu Chen, Wei-Han Hsu, Wei-Hsiang Liao, Marco A. Martínez Ramírez, Yuki Mitsufuji, Yi-Hsuan Yang

    Abstract: A central task of a Disc Jockey (DJ) is to create a mixset of mu-sic with seamless transitions between adjacent tracks. In this paper, we explore a data-driven approach that uses a generative adversarial network to create the song transition by learning from real-world DJ mixes. In particular, the generator of the model uses two differentiable digital signal processing components, an equalizer (EQ… ▽ More

    Submitted 17 February, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

    Comments: To be published at ICASSP 2022

  7. arXiv:2110.06501  [pdf, other

    cs.SD eess.AS

    Spatial Data Augmentation with Simulated Room Impulse Responses for Sound Event Localization and Detection

    Authors: Yuichiro Koyama, Kazuhide Shigemi, Masafumi Takahashi, Kazuki Shimada, Naoya Takahashi, Emiru Tsunoo, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: Recording and annotating real sound events for a sound event localization and detection (SELD) task is time consuming, and data augmentation techniques are often favored when the amount of data is limited. However, how to augment the spatial information in a dataset, including unlabeled directional interference events, remains an open research question. Furthermore, directional interference events… ▽ More

    Submitted 28 April, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

    Comments: 5 pages, 2 figures, accepted for publication in IEEE ICASSP 2022

  8. arXiv:2110.06494  [pdf, other

    cs.SD eess.AS

    Music Source Separation with Deep Equilibrium Models

    Authors: Yuichiro Koyama, Naoki Murata, Stefan Uhlich, Giorgio Fabbro, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: While deep neural network-based music source separation (MSS) is very effective and achieves high performance, its model size is often a problem for practical deployment. Deep implicit architectures such as deep equilibrium models (DEQ) were recently proposed, which can achieve higher performance than their explicit counterparts with limited depth while keeping the number of parameters small. This… ▽ More

    Submitted 28 April, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

    Comments: 5 pages, 4 figures, accepted for publication in IEEE ICASSP 2022

  9. Spatial mixup: Directional loudness modification as data augmentation for sound event localization and detection

    Authors: Ricardo Falcon-Perez, Kazuki Shimada, Yuichiro Koyama, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: Data augmentation methods have shown great importance in diverse supervised learning problems where labeled data is scarce or costly to obtain. For sound event localization and detection (SELD) tasks several augmentation methods have been proposed, with most borrowing ideas from other domains such as images, speech, or monophonic audio. However, only a few exploit the spatial properties of a full… ▽ More

    Submitted 12 October, 2021; originally announced October 2021.

    Comments: 5 pages, 2 figures, 4 tables. Submitted to the 2022 International Conference on Acoustics, Speech, & Signal Processing (ICASSP)

  10. arXiv:2110.05059  [pdf, other

    cs.SD eess.AS

    Amicable examples for informed source separation

    Authors: Naoya Takahashi, Yuki Mitsufuji

    Abstract: This paper deals with the problem of informed source separation (ISS), where the sources are accessible during the so-called \textit{encoding} stage. Previous works computed side-information during the encoding stage and source separation models were designed to utilize the side-information to improve the separation performance. In contrast, in this work, we improve the performance of a pretrained… ▽ More

    Submitted 17 February, 2022; v1 submitted 11 October, 2021; originally announced October 2021.

    Comments: Accepted to ICASSP 2022

  11. arXiv:2110.05054  [pdf, other

    cs.SD cs.CR eess.AS

    Source Mixing and Separation Robust Audio Steganography

    Authors: Naoya Takahashi, Mayank Kumar Singh, Yuki Mitsufuji

    Abstract: Audio steganography aims at concealing secret information in carrier audio with imperceptible modification on the carrier. Although previous works addressed the robustness of concealed message recovery against distortions introduced during transmission, they do not address the robustness against aggressive editing such as mixing of other audio sources and source separation. In this work, we propos… ▽ More

    Submitted 17 February, 2022; v1 submitted 11 October, 2021; originally announced October 2021.

    Comments: Accepted to ICASSP 2022

  12. Music Demixing Challenge 2021

    Authors: Yuki Mitsufuji, Giorgio Fabbro, Stefan Uhlich, Fabian-Robert Stöter, Alexandre Défossez, Minseok Kim, Woosung Choi, Chin-Yun Yu, Kin-Wai Cheuk

    Abstract: Music source separation has been intensively studied in the last decade and tremendous progress with the advent of deep learning could be observed. Evaluation campaigns such as MIREX or SiSEC connected state-of-the-art models and corresponding papers, which can help researchers integrate the best practices into their models. In recent years, the widely used MUSDB18 dataset played an important role… ▽ More

    Submitted 23 May, 2022; v1 submitted 30 August, 2021; originally announced August 2021.

    Journal ref: Frontiers in Signal Processing, 28 January 2022

  13. arXiv:2106.10806  [pdf, other

    eess.AS cs.SD

    Ensemble of ACCDOA- and EINV2-based Systems with D3Nets and Impulse Response Simulation for Sound Event Localization and Detection

    Authors: Kazuki Shimada, Naoya Takahashi, Yuichiro Koyama, Shusuke Takahashi, Emiru Tsunoo, Masafumi Takahashi, Yuki Mitsufuji

    Abstract: This report describes our systems submitted to the DCASE2021 challenge task 3: sound event localization and detection (SELD) with directional interference. Our previous system based on activity-coupled Cartesian direction of arrival (ACCDOA) representation enables us to solve a SELD task with a single target. This ACCDOA-based system with efficient network architecture called RD3Net and data augme… ▽ More

    Submitted 20 June, 2021; originally announced June 2021.

    Comments: 5 pages, 3 figures, submitted to DCASE2021 task3

  14. arXiv:2105.12315  [pdf, other

    eess.AS cs.LG cs.SD

    Training Speech Enhancement Systems with Noisy Speech Datasets

    Authors: Koichi Saito, Stefan Uhlich, Giorgio Fabbro, Yuki Mitsufuji

    Abstract: Recently, deep neural network (DNN)-based speech enhancement (SE) systems have been used with great success. During training, such systems require clean speech data - ideally, in large quantity with a variety of acoustic conditions, many different speaker characteristics and for a given sampling rate (e.g., 48kHz for fullband SE). However, obtaining such clean speech data is not straightforward -… ▽ More

    Submitted 25 May, 2021; originally announced May 2021.

    Comments: 5 pages, 3 figures, submitted to WASPAA2021

  15. arXiv:2102.08663  [pdf, other

    cs.LG cs.CV

    Preventing Oversmoothing in VAE via Generalized Variance Parameterization

    Authors: Yuhta Takida, Wei-Hsiang Liao, Chieh-Hsin Lai, Toshimitsu Uesaka, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: Variational autoencoders (VAEs) often suffer from posterior collapse, which is a phenomenon in which the learned latent space becomes uninformative. This is often related to the hyperparameter resembling the data variance. It can be shown that an inappropriate choice of this hyperparameter causes the oversmoothness in the linearly approximated case and can be empirically verified for the general c… ▽ More

    Submitted 21 August, 2022; v1 submitted 17 February, 2021; originally announced February 2021.

    Comments: 35 pages with 12 figures, accepted for Neurocomputing

  16. arXiv:2101.06842  [pdf, other

    cs.SD cs.LG eess.AS

    Hierarchical disentangled representation learning for singing voice conversion

    Authors: Naoya Takahashi, Mayank Kumar Singh, Yuki Mitsufuji

    Abstract: Conventional singing voice conversion (SVC) methods often suffer from operating in high-resolution audio owing to a high dimensionality of data. In this paper, we propose a hierarchical representation learning that enables the learning of disentangled representations with multiple resolutions independently. With the learned disentangled representations, the proposed method progressively performs S… ▽ More

    Submitted 25 April, 2021; v1 submitted 17 January, 2021; originally announced January 2021.

    Comments: accepted at IJCNN 2021

  17. arXiv:2011.11844  [pdf, other

    cs.CV cs.LG

    Densely connected multidilated convolutional networks for dense prediction tasks

    Authors: Naoya Takahashi, Yuki Mitsufuji

    Abstract: Tasks that involve high-resolution dense prediction require a modeling of both local and global patterns in a large input field. Although the local and global structures often depend on each other and their simultaneous modeling is important, many convolutional neural network (CNN)-based approaches interchange representations in different resolutions only a few times. In this paper, we claim the i… ▽ More

    Submitted 8 June, 2021; v1 submitted 21 November, 2020; originally announced November 2020.

    Comments: Accepted to CVPR 2021. arXiv admin note: text overlap with arXiv:2010.01733

  18. arXiv:2010.15306  [pdf, other

    eess.AS cs.SD

    ACCDOA: Activity-Coupled Cartesian Direction of Arrival Representation for Sound Event Localization and Detection

    Authors: Kazuki Shimada, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: Neural-network (NN)-based methods show high performance in sound event localization and detection (SELD). Conventional NN-based methods use two branches for a sound event detection (SED) target and a direction-of-arrival (DOA) target. The two-branch representation with a single network has to decide how to balance the two objectives during optimization. Using two networks dedicated to each task in… ▽ More

    Submitted 14 February, 2021; v1 submitted 28 October, 2020; originally announced October 2020.

    Comments: 5 pages, 5 figures, accepted for publication in IEEE ICASSP 2021

  19. arXiv:2010.04228  [pdf, ps, other

    eess.AS cs.SD

    All for One and One for All: Improving Music Separation by Bridging Networks

    Authors: Ryosuke Sawata, Stefan Uhlich, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: This paper proposes several improvements for music separation with deep neural networks (DNNs), namely a multi-domain loss (MDL) and two combination schemes. First, by using MDL we take advantage of the frequency and time domain representation of audio signals. Next, we utilize the relationship among instruments by jointly considering them. We do this on the one hand by modifying the network archi… ▽ More

    Submitted 11 May, 2021; v1 submitted 8 October, 2020; originally announced October 2020.

    Comments: The both implementations of our code, i.e., NNabla and PyTorch, are available on this latest paper

  20. arXiv:2010.03164  [pdf, other

    cs.SD cs.LG eess.AS

    Adversarial attacks on audio source separation

    Authors: Naoya Takahashi, Shota Inoue, Yuki Mitsufuji

    Abstract: Despite the excellent performance of neural-network-based audio source separation methods and their wide range of applications, their robustness against intentional attacks has been largely neglected. In this work, we reformulate various adversarial attack methods for the audio source separation problem and intensively investigate them under different attack conditions and target models. We furthe… ▽ More

    Submitted 14 February, 2021; v1 submitted 7 October, 2020; originally announced October 2020.

    Comments: Accepted at ICASSP 2021

  21. arXiv:2010.01733  [pdf, other

    eess.AS cs.LG cs.SD

    D3Net: Densely connected multidilated DenseNet for music source separation

    Authors: Naoya Takahashi, Yuki Mitsufuji

    Abstract: Music source separation involves a large input field to model a long-term dependence of an audio signal. Previous convolutional neural network (CNN)-based approaches address the large input field modeling using sequentially down- and up-sampling feature maps or dilated convolution. In this paper, we claim the importance of a rapid growth of a receptive field and a simultaneous modeling of multi-re… ▽ More

    Submitted 27 March, 2021; v1 submitted 4 October, 2020; originally announced October 2020.

  22. arXiv:2006.12014  [pdf, other

    eess.AS cs.SD

    Sound Event Localization and Detection Using Activity-Coupled Cartesian DOA Vector and RD3net

    Authors: Kazuki Shimada, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji

    Abstract: Our systems submitted to the DCASE2020 task~3: Sound Event Localization and Detection (SELD) are described in this report. We consider two systems: a single-stage system that solve sound event localization~(SEL) and sound event detection~(SED) simultaneously, and a two-stage system that first handles the SED and SEL tasks individually and later combines those results. As the single-stage system, w… ▽ More

    Submitted 7 October, 2020; v1 submitted 22 June, 2020; originally announced June 2020.

    Comments: Submitted to DCASE2020 task3

  23. arXiv:1911.12928  [pdf, other

    cs.SD cs.LG eess.AS

    Improving Voice Separation by Incorporating End-to-end Speech Recognition

    Authors: Naoya Takahashi, Mayank Kumar Singh, Sakya Basak, Parthasaarathy Sudarsanam, Sriram Ganapathy, Yuki Mitsufuji

    Abstract: Despite recent advances in voice separation methods, many challenges remain in realistic scenarios such as noisy recording and the limits of available data. In this work, we propose to explicitly incorporate the phonetic and linguistic nature of speech by taking a transfer learning approach using an end-to-end automatic speech recognition (E2EASR) system. The voice separation is conditioned on dee… ▽ More

    Submitted 3 May, 2020; v1 submitted 28 November, 2019; originally announced November 2019.

    Comments: Accepted in ICASSP 2020

  24. arXiv:1911.02091  [pdf, other

    eess.AS cs.SD

    Closing the Training/Inference Gap for Deep Attractor Networks

    Authors: Cyril Cadoux, Stefan Uhlich, Marc Ferras, Yuki Mitsufuji

    Abstract: This paper improves the deep attractor network (DANet) approach by closing its gap between training and inference. During training, DANet relies on attractors, which are computed from the ground truth separations. As this information is not available at inference time, the attractors have to be estimated, which is typically done by k-means. This results in two mismatches: The first mismatch stems… ▽ More

    Submitted 5 November, 2019; originally announced November 2019.

  25. arXiv:1904.03065  [pdf, other

    cs.SD eess.AS

    Recursive speech separation for unknown number of speakers

    Authors: Naoya Takahashi, Sudarsanam Parthasaarathy, Nabarun Goswami, Yuki Mitsufuji

    Abstract: In this paper we propose a method of single-channel speaker-independent multi-speaker speech separation for an unknown number of speakers. As opposed to previous works, in which the number of speakers is assumed to be known in advance and speech separation models are specific for the number of speakers, our proposed method can be applied to cases with different numbers of speakers using a single m… ▽ More

    Submitted 1 September, 2019; v1 submitted 5 April, 2019; originally announced April 2019.

    Comments: Interspeech 2019 (oral)

  26. arXiv:1807.02710  [pdf, other

    cs.SD cs.LG eess.AS

    Improving DNN-based Music Source Separation using Phase Features

    Authors: Joachim Muth, Stefan Uhlich, Nathanael Perraudin, Thomas Kemp, Fabien Cardinaux, Yuki Mitsufuji

    Abstract: Music source separation with deep neural networks typically relies only on amplitude features. In this paper we show that additional phase features can improve the separation performance. Using the theoretical relationship between STFT phase and amplitude, we conjecture that derivatives of the phase are a good feature representation opposed to the raw phase. We verify this conjecture experimentall… ▽ More

    Submitted 16 July, 2018; v1 submitted 7 July, 2018; originally announced July 2018.

    Comments: 7 pages, 9 figures, Joint Workshop on Machine Learning for Music at ICML, IJCAI/ECAI and AAMAS, 2018

  27. arXiv:1805.02410  [pdf, other

    cs.SD eess.AS

    MMDenseLSTM: An efficient combination of convolutional and recurrent neural networks for audio source separation

    Authors: Naoya Takahashi, Nabarun Goswami, Yuki Mitsufuji

    Abstract: Deep neural networks have become an indispensable technique for audio source separation (ASS). It was recently reported that a variant of CNN architecture called MMDenseNet was successfully employed to solve the ASS problem of estimating source amplitudes, and state-of-the-art results were obtained for DSD100 dataset. To further enhance MMDenseNet, here we propose a novel architecture that integra… ▽ More

    Submitted 29 May, 2018; v1 submitted 7 May, 2018; originally announced May 2018.

  28. arXiv:1803.00187  [pdf, other

    cs.SD eess.AS eess.SP

    Mode Domain Spatial Active Noise Control Using Sparse Signal Representation

    Authors: Yu Maeno, Yuki Mitsufuji, Thushara D. Abhayapala

    Abstract: Active noise control (ANC) over a sizeable space requires a large number of reference and error microphones to satisfy the spatial Nyquist sampling criterion, which limits the feasibility of practical realization of such systems. This paper proposes a mode-domain feedforward ANC method to attenuate the noise field over a large space while reducing the number of microphones required. We adopt a spa… ▽ More

    Submitted 28 February, 2018; originally announced March 2018.

    Comments: to appear at ICASSP 2018

  29. arXiv:1706.09588  [pdf, other

    cs.SD cs.CL cs.MM

    Multi-scale Multi-band DenseNets for Audio Source Separation

    Authors: Naoya Takahashi, Yuki Mitsufuji

    Abstract: This paper deals with the problem of audio source separation. To handle the complex and ill-posed nature of the problems of audio source separation, the current state-of-the-art approaches employ deep neural networks to obtain instrumental spectra from a mixture. In this study, we propose a novel network architecture that extends the recently developed densely connected convolutional network (Dens… ▽ More

    Submitted 29 June, 2017; originally announced June 2017.

    Comments: to appear at WASPAA 2017