Skip to main content

Showing 1–50 of 51 results for author: Fujita, Y

  1. arXiv:2406.16315  [pdf, other

    eess.AS cs.CL cs.SD

    Song Data Cleansing for End-to-End Neural Singer Diarization Using Neural Analysis and Synthesis Framework

    Authors: Hokuto Munakata, Ryo Terashima, Yusuke Fujita

    Abstract: We propose a data cleansing method that utilizes a neural analysis and synthesis (NANSY++) framework to train an end-to-end neural diarization model (EEND) for singer diarization. Our proposed model converts song data with choral singing which is commonly contained in popular music and unsuitable for generating a simulated dataset to the solo singing data. This cleansing is based on NANSY++, which… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: INTERSPEECH 2024 accepted

  2. arXiv:2406.13139  [pdf, other

    eess.AS cs.SD

    Audio Fingerprinting with Holographic Reduced Representations

    Authors: Yusuke Fujita, Tatsuya Komatsu

    Abstract: This paper proposes an audio fingerprinting model with holographic reduced representation (HRR). The proposed method reduces the number of stored fingerprints, whereas conventional neural audio fingerprinting requires many fingerprints for each audio track to achieve high accuracy and time resolution. We utilize HRR to aggregate multiple fingerprints into a composite fingerprint via circular convo… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: accepted at Interspeech 2024

  3. arXiv:2406.12194  [pdf, other

    eess.AS cs.SD

    Universal Score-based Speech Enhancement with High Content Preservation

    Authors: Robin Scheibler, Yusuke Fujita, Yuma Shirahata, Tatsuya Komatsu

    Abstract: We propose UNIVERSE++, a universal speech enhancement method based on score-based diffusion and adversarial training. Specifically, we improve the existing UNIVERSE model that decouples clean speech feature extraction and diffusion. Our contributions are three-fold. First, we make several modifications to the network architecture, improving training stability and final performance. Second, we intr… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: 5 pages, 5 figures, accepted at Interspeech 2024

  4. arXiv:2401.11700  [pdf, other

    cs.CL cs.SD eess.AS

    Keep Decoding Parallel with Effective Knowledge Distillation from Language Models to End-to-end Speech Recognisers

    Authors: Michael Hentschel, Yuta Nishikawa, Tatsuya Komatsu, Yusuke Fujita

    Abstract: This study presents a novel approach for knowledge distillation (KD) from a BERT teacher model to an automatic speech recognition (ASR) model using intermediate layers. To distil the teacher's knowledge, we use an attention decoder that learns from BERT's token probabilities. Our method shows that language model (LM) information can be more effectively distilled into an ASR model using both the in… ▽ More

    Submitted 22 January, 2024; originally announced January 2024.

    Comments: Accepted at ICASSP 2024

  5. arXiv:2310.03975  [pdf, ps, other

    cs.SD cs.CL

    HuBERTopic: Enhancing Semantic Representation of HuBERT through Self-supervision Utilizing Topic Model

    Authors: Takashi Maekaku, Jiatong Shi, Xuankai Chang, Yuya Fujita, Shinji Watanabe

    Abstract: Recently, the usefulness of self-supervised representation learning (SSRL) methods has been confirmed in various downstream tasks. Many of these models, as exemplified by HuBERT and WavLM, use pseudo-labels generated from spectral features or the model's own representation features. From previous studies, it is known that the pseudo-labels contain semantic information. However, the masked predicti… ▽ More

    Submitted 5 October, 2023; originally announced October 2023.

    Comments: Submitted to IEEE ICASSP 2024

  6. arXiv:2309.15826  [pdf, other

    cs.CL cs.SD eess.AS

    Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard Parameter Sharing

    Authors: Brian Yan, Xuankai Chang, Antonios Anastasopoulos, Yuya Fujita, Shinji Watanabe

    Abstract: Recent works in end-to-end speech-to-text translation (ST) have proposed multi-tasking methods with soft parameter sharing which leverage machine translation (MT) data via secondary encoders that map text inputs to an eventual cross-modal representation. In this work, we instead propose a ST/MT multi-tasking framework with hard parameter sharing in which all model parameters are shared cross-modal… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

  7. arXiv:2309.15800  [pdf, other

    cs.CL cs.SD eess.AS

    Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study

    Authors: Xuankai Chang, Brian Yan, Kwanghee Choi, Jeeweon Jung, Yichen Lu, Soumi Maiti, Roshan Sharma, Jiatong Shi, Jinchuan Tian, Shinji Watanabe, Yuya Fujita, Takashi Maekaku, Pengcheng Guo, Yao-Fei Cheng, Pavel Denisov, Kohei Saijo, Hsiu-Hsuan Wang

    Abstract: Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies, evoking inefficiencies in sequence modeling. High-dimensional speech features such as spectrograms are often used as the input for the subsequent model. However, they can still be redundant. Recent investigations proposed the use of discrete speech units derived from self-supervised learning repre… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

    Comments: Submitted to IEEE ICASSP 2024

  8. arXiv:2309.08141  [pdf, other

    eess.AS cs.CL cs.LG cs.SD eess.SP

    Audio Difference Learning for Audio Captioning

    Authors: Tatsuya Komatsu, Yusuke Fujita, Kazuya Takeda, Tomoki Toda

    Abstract: This study introduces a novel training paradigm, audio difference learning, for improving audio captioning. The fundamental concept of the proposed learning method is to create a feature representation space that preserves the relationship between audio, enabling the generation of captions that detail intricate audio information. This method employs a reference audio along with the input audio, bo… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

    Comments: submitted to ICASSP2024

  9. arXiv:2305.18108  [pdf, other

    cs.SD eess.AS

    Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning

    Authors: Xuankai Chang, Brian Yan, Yuya Fujita, Takashi Maekaku, Shinji Watanabe

    Abstract: Self-supervised learning (SSL) of speech has shown impressive results in speech-related tasks, particularly in automatic speech recognition (ASR). While most methods employ the output of intermediate layers of the SSL model as real-valued features for downstream tasks, there is potential in exploring alternative approaches that use discretized token sequences. This approach offers benefits such as… ▽ More

    Submitted 29 May, 2023; originally announced May 2023.

    Comments: Accepted at INTERSPEECH 2023

  10. arXiv:2303.06806  [pdf, other

    eess.AS cs.CL cs.SD

    Neural Diarization with Non-autoregressive Intermediate Attractors

    Authors: Yusuke Fujita, Tatsuya Komatsu, Robin Scheibler, Yusuke Kida, Tetsuji Ogawa

    Abstract: End-to-end neural diarization (EEND) with encoder-decoder-based attractors (EDA) is a promising method to handle the whole speaker diarization problem simultaneously with a single neural network. While the EEND model can produce all frame-level speaker labels simultaneously, it disregards output label dependency. In this work, we propose a novel EEND model that introduces the label dependency betw… ▽ More

    Submitted 12 March, 2023; originally announced March 2023.

    Comments: ICASSP 2023

  11. arXiv:2211.05967  [pdf, ps, other

    cs.CL eess.AS

    Align, Write, Re-order: Explainable End-to-End Speech Translation via Operation Sequence Generation

    Authors: Motoi Omachi, Brian Yan, Siddharth Dalmia, Yuya Fujita, Shinji Watanabe

    Abstract: The black-box nature of end-to-end speech translation (E2E ST) systems makes it difficult to understand how source language inputs are being mapped to the target language. To solve this problem, we would like to simultaneously generate automatic speech recognition (ASR) and ST predictions such that each source language word is explicitly mapped to a target language word. A major challenge arises f… ▽ More

    Submitted 10 November, 2022; originally announced November 2022.

  12. arXiv:2204.00540  [pdf, other

    cs.SD cs.CL eess.AS

    End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation

    Authors: Xuankai Chang, Takashi Maekaku, Yuya Fujita, Shinji Watanabe

    Abstract: This work presents our end-to-end (E2E) automatic speech recognition (ASR) model targetting at robust speech recognition, called Integraded speech Recognition with enhanced speech Input for Self-supervised learning representation (IRIS). Compared with conventional E2E ASR models, the proposed E2E model integrates two important modules including a speech enhancement (SE) module and a self-supervise… ▽ More

    Submitted 1 April, 2022; originally announced April 2022.

    Comments: Submitted to Interspeech 2022

  13. arXiv:2204.00176  [pdf, other

    cs.CL cs.SD eess.AS

    Better Intermediates Improve CTC Inference

    Authors: Tatsuya Komatsu, Yusuke Fujita, Jaesong Lee, Lukas Lee, Shinji Watanabe, Yusuke Kida

    Abstract: This paper proposes a method for improved CTC inference with searched intermediates and multi-pass conditioning. The paper first formulates self-conditioned CTC as a probabilistic model with an intermediate prediction as a latent representation and provides a tractable conditioning framework. We then propose two new conditioning methods based on the new formulation: (1) Searched intermediate condi… ▽ More

    Submitted 31 March, 2022; originally announced April 2022.

    Comments: 5 pages, submitted INTERSPEECH2022

  14. Alternate Intermediate Conditioning with Syllable-level and Character-level Targets for Japanese ASR

    Authors: Yusuke Fujita, Tatsuya Komatsu, Yusuke Kida

    Abstract: End-to-end automatic speech recognition directly maps input speech to characters. However, the mapping can be problematic when several different pronunciations should be mapped into one character or when one pronunciation is shared among many different characters. Japanese ASR suffers the most from such many-to-one and one-to-many mapping problems due to Japanese kanji characters. To alleviate the… ▽ More

    Submitted 12 March, 2023; v1 submitted 31 March, 2022; originally announced April 2022.

    Comments: SLT 2022

  15. arXiv:2204.00174  [pdf, other

    cs.CL cs.SD eess.AS

    InterAug: Augmenting Noisy Intermediate Predictions for CTC-based ASR

    Authors: Yu Nakagome, Tatsuya Komatsu, Yusuke Fujita, Shuta Ichimura, Yusuke Kida

    Abstract: This paper proposes InterAug: a novel training method for CTC-based ASR using augmented intermediate representations for conditioning. The proposed method exploits the conditioning framework of self-conditioned CTC to train robust models by conditioning with "noisy" intermediate predictions. During the training, intermediate predictions are changed to incorrect intermediate predictions, and fed in… ▽ More

    Submitted 31 March, 2022; originally announced April 2022.

    Comments: This paper was submitted to INTERSPEECH2022

  16. arXiv:2201.01683  [pdf, other

    cs.CV

    Surface-Aligned Neural Radiance Fields for Controllable 3D Human Synthesis

    Authors: Tianhan Xu, Yasuhiro Fujita, Eiichi Matsumoto

    Abstract: We propose a new method for reconstructing controllable implicit 3D human models from sparse multi-view RGB videos. Our method defines the neural scene representation on the mesh surface points and signed distances from the surface of a human body mesh. We identify an indistinguishability issue that arises when a point in 3D space is mapped to its nearest surface point on a mesh for learning surfa… ▽ More

    Submitted 3 April, 2022; v1 submitted 5 January, 2022; originally announced January 2022.

    Comments: CVPR 2022. Project page: https://pfnet-research.github.io/surface-aligned-nerf/

  17. arXiv:2110.05249  [pdf, other

    eess.AS cs.CL cs.SD

    A Comparative Study on Non-Autoregressive Modelings for Speech-to-Text Generation

    Authors: Yosuke Higuchi, Nanxin Chen, Yuya Fujita, Hirofumi Inaguma, Tatsuya Komatsu, Jaesong Lee, Jumon Nozaki, Tianzi Wang, Shinji Watanabe

    Abstract: Non-autoregressive (NAR) models simultaneously generate multiple outputs in a sequence, which significantly reduces the inference speed at the cost of accuracy drop compared to autoregressive baselines. Showing great potential for real-time applications, an increasing number of NAR models have been explored in different fields to mitigate the performance gap against AR models. In this work, we con… ▽ More

    Submitted 11 October, 2021; originally announced October 2021.

    Comments: Accepted to ASRU2021

  18. arXiv:2107.09428  [pdf, other

    eess.AS cs.LG cs.SD

    Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models

    Authors: Tianzi Wang, Yuya Fujita, Xuankai Chang, Shinji Watanabe

    Abstract: Non-autoregressive (NAR) modeling has gained more and more attention in speech processing. With recent state-of-the-art attention-based automatic speech recognition (ASR) structure, NAR can realize promising real-time factor (RTF) improvement with only small degradation of accuracy compared to the autoregressive (AR) models. However, the recognition inference needs to wait for the completion of a… ▽ More

    Submitted 20 July, 2021; originally announced July 2021.

    Comments: 5 pages, 1 figures, Interspeech21 conference

  19. arXiv:2107.05899  [pdf, ps, other

    cs.SD eess.AS

    Speech Representation Learning Combining Conformer CPC with Deep Cluster for the ZeroSpeech Challenge 2021

    Authors: Takashi Maekaku, Xuankai Chang, Yuya Fujita, Li-Wei Chen, Shinji Watanabe, Alexander Rudnicky

    Abstract: We present a system for the Zero Resource Speech Challenge 2021, which combines a Contrastive Predictive Coding (CPC) with deep cluster. In deep cluster, we first prepare pseudo-labels obtained by clustering the outputs of a CPC network with k-means. Then, we train an additional autoregressive model to classify the previously obtained pseudo-labels in a supervised manner. Phoneme discriminative re… ▽ More

    Submitted 16 February, 2022; v1 submitted 13 July, 2021; originally announced July 2021.

  20. Encoder-Decoder Based Attractors for End-to-End Neural Diarization

    Authors: Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Yawen Xue, Paola Garcia

    Abstract: This paper investigates an end-to-end neural diarization (EEND) method for an unknown number of speakers. In contrast to the conventional cascaded approach to speaker diarization, EEND methods are better in terms of speaker overlap handling. However, EEND still has a disadvantage in that it cannot deal with a flexible number of speakers. To remedy this problem, we introduce encoder-decoder-based a… ▽ More

    Submitted 28 March, 2022; v1 submitted 20 June, 2021; originally announced June 2021.

    Comments: Accepted to IEEE/ACM TASLP. This article is based on our previous conference paper arxiv:2005.09921

  21. arXiv:2106.09198  [pdf

    cs.GR cs.CV

    Learning Perceptual Manifold of Fonts

    Authors: Haoran Xie, Yuki Fujita, Kazunori Miyata

    Abstract: Along the rapid development of deep learning techniques in generative models, it is becoming an urgent issue to combine machine intelligence with human intelligence to solve the practical applications. Motivated by this methodology, this work aims to adjust the machine generated character fonts with the effort of human workers in the perception study. Although numerous fonts are available online f… ▽ More

    Submitted 16 June, 2021; originally announced June 2021.

    Comments: 9 pages, 16 figures

  22. arXiv:2106.04764  [pdf, other

    eess.AS cs.SD

    Semi-Supervised Training with Pseudo-Labeling for End-to-End Neural Diarization

    Authors: Yuki Takashima, Yusuke Fujita, Shota Horiguchi, Shinji Watanabe, Paola García, Kenji Nagamatsu

    Abstract: In this paper, we present a semi-supervised training technique using pseudo-labeling for end-to-end neural diarization (EEND). The EEND system has shown promising performance compared with traditional clustering-based methods, especially in the case of overlapping speech. However, to get a well-tuned model, EEND requires labeled data for all the joint speech activities of every speaker at each tim… ▽ More

    Submitted 8 June, 2021; originally announced June 2021.

    Comments: Accepted for Interspeech 2021

  23. arXiv:2106.04078  [pdf, other

    eess.AS cs.SD

    End-to-End Speaker Diarization Conditioned on Speech Activity and Overlap Detection

    Authors: Yuki Takashima, Yusuke Fujita, Shinji Watanabe, Shota Horiguchi, Paola García, Kenji Nagamatsu

    Abstract: In this paper, we present a conditional multitask learning method for end-to-end neural speaker diarization (EEND). The EEND system has shown promising performance compared with traditional clustering-based methods, especially in the case of overlapping speech. In this paper, to further improve the performance of the EEND system, we propose a novel multitask learning framework that solves speaker… ▽ More

    Submitted 7 June, 2021; originally announced June 2021.

    Comments: Accepted for SLT 2021

    Journal ref: IEEE Spoken Language Technology Workshop (SLT), 2021, pp. 849-856

  24. arXiv:2102.01363  [pdf, other

    eess.AS cs.CL cs.SD

    The Hitachi-JHU DIHARD III System: Competitive End-to-End Neural Diarization and X-Vector Clustering Systems Combined by DOVER-Lap

    Authors: Shota Horiguchi, Nelson Yalta, Paola Garcia, Yuki Takashima, Yawen Xue, Desh Raj, Zili Huang, Yusuke Fujita, Shinji Watanabe, Sanjeev Khudanpur

    Abstract: This paper provides a detailed description of the Hitachi-JHU system that was submitted to the Third DIHARD Speech Diarization Challenge. The system outputs the ensemble results of the five subsystems: two x-vector-based subsystems, two end-to-end neural diarization-based subsystems, and one hybrid subsystem. We refine each system and all five subsystems become competitive and complementary. After… ▽ More

    Submitted 2 February, 2021; originally announced February 2021.

  25. arXiv:2101.08473  [pdf, other

    cs.SD eess.AS

    Online Streaming End-to-End Neural Diarization Handling Overlapping Speech and Flexible Numbers of Speakers

    Authors: Yawen Xue, Shota Horiguchi, Yusuke Fujita, Yuki Takashima, Shinji Watanabe, Paola Garcia, Kenji Nagamatsu

    Abstract: We propose a streaming diarization method based on an end-to-end neural diarization (EEND) model, which handles flexible numbers of speakers and overlapping speech. In our previous study, the speaker-tracing buffer (STB) mechanism was proposed to achieve a chunk-wise streaming diarization using a pre-trained EEND model. STB traces the speaker information in previous chunks to map the speakers in a… ▽ More

    Submitted 6 April, 2021; v1 submitted 21 January, 2021; originally announced January 2021.

  26. arXiv:2012.10055  [pdf, other

    eess.AS cs.CL cs.SD

    End-to-End Speaker Diarization as Post-Processing

    Authors: Shota Horiguchi, Paola Garcia, Yusuke Fujita, Shinji Watanabe, Kenji Nagamatsu

    Abstract: This paper investigates the utilization of an end-to-end diarization model as post-processing of conventional clustering-based diarization. Clustering-based diarization methods partition frames into clusters of the number of speakers; thus, they typically cannot handle overlapping speech because each frame is assigned to one speaker. On the other hand, some end-to-end diarization methods can handl… ▽ More

    Submitted 23 December, 2020; v1 submitted 18 December, 2020; originally announced December 2020.

  27. arXiv:2012.02530  [pdf, other

    cs.LG

    Logic Synthesis Meets Machine Learning: Trading Exactness for Generalization

    Authors: Shubham Rai, Walter Lau Neto, Yukio Miyasaka, Xinpei Zhang, Mingfei Yu, Qingyang Yi Masahiro Fujita, Guilherme B. Manske, Matheus F. Pontes, Leomar S. da Rosa Junior, Marilton S. de Aguiar, Paulo F. Butzen, Po-Chun Chien, Yu-Shan Huang, Hoa-Ren Wang, Jie-Hong R. Jiang, Jiaqi Gu, Zheng Zhao, Zixuan Jiang, David Z. Pan, Brunno A. de Abreu, Isac de Souza Campos, Augusto Berndt, Cristina Meinhardt, Jonata T. Carvalho, Mateus Grellert , et al. (15 additional authors not shown)

    Abstract: Logic synthesis is a fundamental step in hardware design whose goal is to find structural representations of Boolean functions while minimizing delay and area. If the function is completely-specified, the implementation accurately represents the function. If the function is incompletely-specified, the implementation has to be true only on the care set. While most of the algorithms in logic synthes… ▽ More

    Submitted 15 December, 2020; v1 submitted 4 December, 2020; originally announced December 2020.

    Comments: In this 23 page manuscript, we explore the connection between machine learning and logic synthesis which was the main goal for International Workshop on logic synthesis. It includes approaches applied by ten teams spanning 6 countries across the world

  28. arXiv:2011.07791  [pdf, other

    eess.AS cs.SD eess.SP

    Block-Online Guided Source Separation

    Authors: Shota Horiguchi, Yusuke Fujita, Kenji Nagamatsu

    Abstract: We propose a block-online algorithm of guided source separation (GSS). GSS is a speech separation method that uses diarization information to update parameters of the generative model of observation signals. Previous studies have shown that GSS performs well in multi-talker scenarios. However, it requires a large amount of calculation time, which is an obstacle to the deployment of online applicat… ▽ More

    Submitted 16 November, 2020; originally announced November 2020.

    Comments: Accepted to SLT 2021

  29. arXiv:2007.15868  [pdf, other

    eess.AS cs.CL cs.SD

    Utterance-Wise Meeting Transcription System Using Asynchronous Distributed Microphones

    Authors: Shota Horiguchi, Yusuke Fujita, Kenji Nagamatsu

    Abstract: A novel framework for meeting transcription using asynchronous microphones is proposed in this paper. It consists of audio synchronization, speaker diarization, utterance-wise speech enhancement using guided source separation, automatic speech recognition, and duplication reduction. Doing speaker diarization before speech enhancement enables the system to deal with overlapped speech without consid… ▽ More

    Submitted 31 July, 2020; originally announced July 2020.

    Comments: Accepted to INTERSPEECH 2020

  30. arXiv:2007.08082  [pdf, other

    cs.RO cs.AI cs.DC cs.LG stat.ML

    Distributed Reinforcement Learning of Targeted Grasping with Active Vision for Mobile Manipulators

    Authors: Yasuhiro Fujita, Kota Uenishi, Avinash Ummadisingu, Prabhat Nagarajan, Shimpei Masuda, Mario Ynocente Castro

    Abstract: Developing personal robots that can perform a diverse range of manipulation tasks in unstructured environments necessitates solving several challenges for robotic grasping systems. We take a step towards this broader goal by presenting the first RL-based system, to our knowledge, for a mobile manipulator that can (a) achieve targeted grasping generalizing to unseen target objects, (b) learn comple… ▽ More

    Submitted 14 October, 2020; v1 submitted 15 July, 2020; originally announced July 2020.

    Comments: Accepted at IROS 2020

  31. arXiv:2006.14150  [pdf, other

    eess.AS cs.SD

    Sequence to Multi-Sequence Learning via Conditional Chain Mapping for Mixture Signals

    Authors: Jing Shi, Xuankai Chang, Pengcheng Guo, Shinji Watanabe, Yusuke Fujita, Jiaming Xu, Bo Xu, Lei Xie

    Abstract: Neural sequence-to-sequence models are well established for applications which can be cast as mapping a single input sequence into a single output sequence. In this work, we focus on one-to-many sequence transduction problems, such as extracting multiple sequential sources from a mixture sequence. We extend the standard sequence-to-sequence model to a conditional multi-sequence model, which explic… ▽ More

    Submitted 24 June, 2020; originally announced June 2020.

    Comments: 15 pages, 5 figures

  32. arXiv:2006.14149  [pdf, other

    eess.AS cs.SD

    Speaker-Conditional Chain Model for Speech Separation and Extraction

    Authors: Jing Shi, Jiaming Xu, Yusuke Fujita, Shinji Watanabe, Bo Xu

    Abstract: Speech separation has been extensively explored to tackle the cocktail party problem. However, these studies are still far from having enough generalization capabilities for real scenarios. In this work, we raise a common strategy named Speaker-Conditional Chain Model to process complex speech recordings. In the proposed method, our model first infers the identities of variable numbers of speakers… ▽ More

    Submitted 24 June, 2020; originally announced June 2020.

    Comments: 7pages, 3 figures

  33. arXiv:2006.02616  [pdf, other

    eess.AS cs.SD

    Online End-to-End Neural Diarization with Speaker-Tracing Buffer

    Authors: Yawen Xue, Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Kenji Nagamatsu

    Abstract: This paper proposes a novel online speaker diarization algorithm based on a fully supervised self-attention mechanism (SA-EEND). Online diarization inherently presents a speaker's permutation problem due to the possibility to assign speaker regions incorrectly across the recording. To circumvent this inconsistency, we proposed a speaker-tracing buffer mechanism that selects several input frames re… ▽ More

    Submitted 6 March, 2021; v1 submitted 3 June, 2020; originally announced June 2020.

    Comments: Accepted to SLT 2021

  34. arXiv:2006.01796  [pdf, other

    eess.AS cs.CL cs.SD

    Neural Speaker Diarization with Speaker-Wise Chain Rule

    Authors: Yusuke Fujita, Shinji Watanabe, Shota Horiguchi, Yawen Xue, Jing Shi, Kenji Nagamatsu

    Abstract: Speaker diarization is an essential step for processing multi-speaker audio. Although an end-to-end neural diarization (EEND) method achieved state-of-the-art performance, it is limited to a fixed number of speakers. In this paper, we solve this fixed number of speaker issue by a novel speaker-wise conditional inference method based on the probabilistic chain rule. In the proposed method, each spe… ▽ More

    Submitted 2 June, 2020; originally announced June 2020.

    Comments: Submitted to Interspeech 2020

  35. arXiv:2005.13211  [pdf, other

    eess.AS cs.SD

    Insertion-Based Modeling for End-to-End Automatic Speech Recognition

    Authors: Yuya Fujita, Shinji Watanabe, Motoi Omachi, Xuankai Chan

    Abstract: End-to-end (E2E) models have gained attention in the research field of automatic speech recognition (ASR). Many E2E models proposed so far assume left-to-right autoregressive generation of an output token sequence except for connectionist temporal classification (CTC) and its variants. However, left-to-right decoding cannot consider the future output context, and it is not always optimal for ASR.… ▽ More

    Submitted 15 November, 2020; v1 submitted 27 May, 2020; originally announced May 2020.

    Comments: INTERSPEECH 2020

  36. arXiv:2005.09921  [pdf, other

    eess.AS cs.CL cs.SD

    End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors

    Authors: Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Yawen Xue, Kenji Nagamatsu

    Abstract: End-to-end speaker diarization for an unknown number of speakers is addressed in this paper. Recently proposed end-to-end speaker diarization outperformed conventional clustering-based speaker diarization, but it has one drawback: it is less flexible in terms of the number of speakers. This paper proposes a method for encoder-decoder based attractor calculation (EDA), which first generates a flexi… ▽ More

    Submitted 5 October, 2020; v1 submitted 20 May, 2020; originally announced May 2020.

    Comments: Accepted to INTERSPEECH 2020

  37. arXiv:2004.09249  [pdf, other

    cs.SD cs.CL eess.AS

    CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings

    Authors: Shinji Watanabe, Michael Mandel, Jon Barker, Emmanuel Vincent, Ashish Arora, Xuankai Chang, Sanjeev Khudanpur, Vimal Manohar, Daniel Povey, Desh Raj, David Snyder, Aswin Shanmugam Subramanian, Jan Trmal, Bar Ben Yair, Christoph Boeddeker, Zhaoheng Ni, Yusuke Fujita, Shota Horiguchi, Naoyuki Kanda, Takuya Yoshioka, Neville Ryant

    Abstract: Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6). The new challenge revisits the previous CHiME-5 challenge and further considers the problem of distant multi-microphone conversational speech diarization and recognition in everyday home environments. Speech material is the same as the previous C… ▽ More

    Submitted 2 May, 2020; v1 submitted 20 April, 2020; originally announced April 2020.

  38. arXiv:2003.02966  [pdf, other

    eess.AS cs.CL cs.SD

    End-to-End Neural Diarization: Reformulating Speaker Diarization as Simple Multi-label Classification

    Authors: Yusuke Fujita, Shinji Watanabe, Shota Horiguchi, Yawen Xue, Kenji Nagamatsu

    Abstract: The most common approach to speaker diarization is clustering of speaker embeddings. However, the clustering-based approach has a number of problems; i.e., (i) it is not optimized to minimize diarization errors directly, (ii) it cannot handle speaker overlaps correctly, and (iii) it has trouble adapting their speaker embedding models to real audio recordings with speaker overlaps. To solve these p… ▽ More

    Submitted 24 February, 2020; originally announced March 2020.

    Comments: Submission to IEEE TASLP. This article draws from our previous conference papers: arxiv:1909.06247 and arxiv:1909.05952

  39. arXiv:2002.06220  [pdf, other

    eess.AS cs.SD

    Speaker Diarization with Region Proposal Network

    Authors: Zili Huang, Shinji Watanabe, Yusuke Fujita, Paola Garcia, Yiwen Shao, Daniel Povey, Sanjeev Khudanpur

    Abstract: Speaker diarization is an important pre-processing step for many speech applications, and it aims to solve the "who spoke when" problem. Although the standard diarization systems can achieve satisfactory results in various scenarios, they are composed of several independently-optimized modules and cannot deal with the overlapped speech. In this paper, we propose a novel speaker diarization method:… ▽ More

    Submitted 14 February, 2020; originally announced February 2020.

    Comments: Accepted to ICASSP 2020

  40. arXiv:1912.04201  [pdf, other

    cs.LG cs.AI stat.ML

    Learning Latent State Spaces for Planning through Reward Prediction

    Authors: Aaron Havens, Yi Ouyang, Prabhat Nagarajan, Yasuhiro Fujita

    Abstract: Model-based reinforcement learning methods typically learn models for high-dimensional state spaces by aiming to reconstruct and predict the original observations. However, drawing inspiration from model-free reinforcement learning, we propose learning a latent dynamics model directly from rewards. In this work, we introduce a model-based planning framework which learns a latent reward prediction… ▽ More

    Submitted 9 December, 2019; originally announced December 2019.

    Comments: Deep RL Workshop, Neurips 2019, Vancouver

  41. arXiv:1912.03905  [pdf, other

    cs.LG cs.AI stat.ML

    ChainerRL: A Deep Reinforcement Learning Library

    Authors: Yasuhiro Fujita, Prabhat Nagarajan, Toshiki Kataoka, Takahiro Ishikawa

    Abstract: In this paper, we introduce ChainerRL, an open-source deep reinforcement learning (DRL) library built using Python and the Chainer deep learning framework. ChainerRL implements a comprehensive set of DRL algorithms and techniques drawn from state-of-the-art research in the field. To foster reproducible research, and for instructional purposes, ChainerRL provides scripts that closely replicate the… ▽ More

    Submitted 11 April, 2021; v1 submitted 9 December, 2019; originally announced December 2019.

    Comments: Journal of Machine Learning Research

    Journal ref: Journal of Machine Learning Research 22(77) (2021) 1-14;

  42. arXiv:1909.08103  [pdf, other

    cs.CL cs.SD eess.AS

    Simultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic Models

    Authors: Naoyuki Kanda, Shota Horiguchi, Yusuke Fujita, Yawen Xue, Kenji Nagamatsu, Shinji Watanabe

    Abstract: This paper investigates the use of target-speaker automatic speech recognition (TS-ASR) for simultaneous speech recognition and speaker diarization of single-channel dialogue recordings. TS-ASR is a technique to automatically extract and recognize only the speech of a target speaker given a short sample utterance of that speaker. One obvious drawback of TS-ASR is that it cannot be used when the sp… ▽ More

    Submitted 17 September, 2019; originally announced September 2019.

    Comments: Accepted to ASRU 2019

  43. arXiv:1909.06247  [pdf, other

    eess.AS cs.CL cs.SD

    End-to-End Neural Speaker Diarization with Self-attention

    Authors: Yusuke Fujita, Naoyuki Kanda, Shota Horiguchi, Yawen Xue, Kenji Nagamatsu, Shinji Watanabe

    Abstract: Speaker diarization has been mainly developed based on the clustering of speaker embeddings. However, the clustering-based approach has two major problems; i.e., (i) it is not optimized to minimize diarization errors directly, and (ii) it cannot handle speaker overlaps correctly. To solve these problems, the End-to-End Neural Diarization (EEND), in which a bidirectional long short-term memory (BLS… ▽ More

    Submitted 13 September, 2019; originally announced September 2019.

    Comments: Accepted for ASRU 2019

  44. arXiv:1909.05952  [pdf, other

    eess.AS cs.CL cs.SD

    End-to-End Neural Speaker Diarization with Permutation-Free Objectives

    Authors: Yusuke Fujita, Naoyuki Kanda, Shota Horiguchi, Kenji Nagamatsu, Shinji Watanabe

    Abstract: In this paper, we propose a novel end-to-end neural-network-based speaker diarization method. Unlike most existing methods, our proposed method does not have separate modules for extraction and clustering of speaker representations. Instead, our model has a single neural network that directly outputs speaker diarization results. To realize such a model, we formulate the speaker diarization problem… ▽ More

    Submitted 12 September, 2019; originally announced September 2019.

    Comments: Accepted to INTERSPEECH 2019

  45. arXiv:1906.10876  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Auxiliary Interference Speaker Loss for Target-Speaker Speech Recognition

    Authors: Naoyuki Kanda, Shota Horiguchi, Ryoichi Takashima, Yusuke Fujita, Kenji Nagamatsu, Shinji Watanabe

    Abstract: In this paper, we propose a novel auxiliary loss function for target-speaker automatic speech recognition (ASR). Our method automatically extracts and transcribes target speaker's utterances from a monaural mixture of multiple speakers speech given a short sample of the target speaker. The proposed auxiliary loss function attempts to additionally maximize interference speaker ASR accuracy during t… ▽ More

    Submitted 26 June, 2019; originally announced June 2019.

    Comments: Accepted to INTERSPEECH 2019

  46. arXiv:1905.12230  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Guided Source Separation Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR

    Authors: Naoyuki Kanda, Christoph Boeddeker, Jens Heitkaemper, Yusuke Fujita, Shota Horiguchi, Kenji Nagamatsu, Reinhold Haeb-Umbach

    Abstract: In this paper, we present Hitachi and Paderborn University's joint effort for automatic speech recognition (ASR) in a dinner party scenario. The main challenges of ASR systems for dinner party recordings obtained by multiple microphone arrays are (1) heavy speech overlaps, (2) severe noise and reverberation, (3) very natural conversational content, and possibly (4) insufficient training data. As a… ▽ More

    Submitted 26 June, 2019; v1 submitted 29 May, 2019; originally announced May 2019.

    Comments: Accepted to INTERSPEECH 2019

  47. arXiv:1904.09049  [pdf, other

    eess.AS cs.CL cs.SD

    An Investigation of End-to-End Multichannel Speech Recognition for Reverberant and Mismatch Conditions

    Authors: Aswin Shanmugam Subramanian, Xiaofei Wang, Shinji Watanabe, Toru Taniguchi, Dung Tran, Yuya Fujita

    Abstract: Sequence-to-sequence (S2S) modeling is becoming a popular paradigm for automatic speech recognition (ASR) because of its ability to jointly optimize all the conventional ASR components in an end-to-end (E2E) fashion. This report investigates the ability of E2E ASR from standard close-talk to far-field applications by encompassing entire multichannel speech enhancement and ASR components within the… ▽ More

    Submitted 28 April, 2019; v1 submitted 18 April, 2019; originally announced April 2019.

  48. arXiv:1902.02992  [pdf, other

    stat.ML cs.LG

    A Wrapped Normal Distribution on Hyperbolic Space for Gradient-Based Learning

    Authors: Yoshihiro Nagano, Shoichiro Yamaguchi, Yasuhiro Fujita, Masanori Koyama

    Abstract: Hyperbolic space is a geometry that is known to be well-suited for representation learning of data with an underlying hierarchical structure. In this paper, we present a novel hyperbolic distribution called \textit{pseudo-hyperbolic Gaussian}, a Gaussian-like distribution on hyperbolic space whose density can be evaluated analytically and differentiated with respect to the parameters. Our distribu… ▽ More

    Submitted 9 May, 2019; v1 submitted 8 February, 2019; originally announced February 2019.

    Comments: 20 pages, 12 figures

  49. arXiv:1810.10727  [pdf, other

    eess.AS cs.LG cs.SD

    Speaker Selective Beamformer with Keyword Mask Estimation

    Authors: Yusuke Kida, Dung Tran, Motoi Omachi, Toru Taniguchi, Yuya Fujita

    Abstract: This paper addresses the problem of automatic speech recognition (ASR) of a target speaker in background speech. The novelty of our approach is that we focus on a wakeup keyword, which is usually used for activating ASR systems like smart speakers. The proposed method firstly utilizes a DNN-based mask estimator to separate the mixture signal into the keyword signal uttered by the target speaker an… ▽ More

    Submitted 7 November, 2018; v1 submitted 25 October, 2018; originally announced October 2018.

    Comments: Accepted by SLT2018

  50. arXiv:1809.05214  [pdf, other

    cs.LG cs.AI stat.ML

    Model-Based Reinforcement Learning via Meta-Policy Optimization

    Authors: Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasuhiro Fujita, Tamim Asfour, Pieter Abbeel

    Abstract: Model-based reinforcement learning approaches carry the promise of being data efficient. However, due to challenges in learning dynamics models that sufficiently match the real-world dynamics, they struggle to achieve the same asymptotic performance as model-free methods. We propose Model-Based Meta-Policy-Optimization (MB-MPO), an approach that foregoes the strong reliance on accurate learned dyn… ▽ More

    Submitted 13 September, 2018; originally announced September 2018.

    Comments: First 2 authors contributed equally. Accepted for Conference on Robot Learning (CoRL)