Skip to main content

Showing 1–24 of 24 results for author: Saeki, T

  1. arXiv:2406.00899  [pdf, other

    cs.CL cs.SD eess.AS

    YODAS: Youtube-Oriented Dataset for Audio and Speech

    Authors: Xinjian Li, Shinnosuke Takamichi, Takaaki Saeki, William Chen, Sayaka Shiota, Shinji Watanabe

    Abstract: In this study, we introduce YODAS (YouTube-Oriented Dataset for Audio and Speech), a large-scale, multilingual dataset comprising currently over 500k hours of speech data in more than 100 languages, sourced from both labeled and unlabeled YouTube speech datasets. The labeled subsets, including manual or automatic subtitles, facilitate supervised model training. Conversely, the unlabeled subsets ar… ▽ More

    Submitted 2 June, 2024; originally announced June 2024.

    Comments: ASRU 2023

  2. arXiv:2402.18932  [pdf, other

    eess.AS cs.SD

    Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data

    Authors: Takaaki Saeki, Gary Wang, Nobuyuki Morioka, Isaac Elias, Kyle Kastner, Fadi Biadsy, Andrew Rosenberg, Bhuvana Ramabhadran, Heiga Zen, Françoise Beaufays, Hadar Shemtov

    Abstract: Collecting high-quality studio recordings of audio is challenging, which limits the language coverage of text-to-speech (TTS) systems. This paper proposes a framework for scaling a multilingual TTS model to 100+ languages using found data without supervision. The proposed framework combines speech-text encoder pretraining with unsupervised training using untranscribed speech and unspoken text data… ▽ More

    Submitted 16 July, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

    Comments: To appear in ICASSP 2024. Demo page: https://google.github.io/tacotron/publications/extending_tts/

  3. arXiv:2401.16812  [pdf, other

    cs.SD eess.AS

    SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics

    Authors: Takaaki Saeki, Soumi Maiti, Shinnosuke Takamichi, Shinji Watanabe, Hiroshi Saruwatari

    Abstract: While subjective assessments have been the gold standard for evaluating speech generation, there is a growing need for objective metrics that are highly correlated with human subjective judgments due to their cost efficiency. This paper proposes reference-aware automatic evaluation methods for speech generation inspired by evaluation metrics in natural language processing. The proposed SpeechBERTS… ▽ More

    Submitted 12 June, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

    Comments: Accepted by Interspeech 2024. An extended version with Appendix. Code: https://github.com/Takaaki-Saeki/DiscreteSpeechMetrics

  4. arXiv:2309.08127  [pdf, other

    cs.SD eess.AS

    Diversity-based core-set selection for text-to-speech with linguistic and acoustic features

    Authors: Kentaro Seki, Shinnosuke Takamichi, Takaaki Saeki, Hiroshi Saruwatari

    Abstract: This paper proposes a method for extracting a lightweight subset from a text-to-speech (TTS) corpus ensuring synthetic speech quality. In recent years, methods have been proposed for constructing large-scale TTS corpora by collecting diverse data from massive sources such as audiobooks and YouTube. Although these methods have gained significant attention for enhancing the expressive capabilities o… ▽ More

    Submitted 14 September, 2023; originally announced September 2023.

  5. arXiv:2302.13652  [pdf, ps, other

    eess.AS cs.CL cs.LG cs.SD

    Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech

    Authors: Dong Yang, Tomoki Koriyama, Yuki Saito, Takaaki Saeki, Detai Xin, Hiroshi Saruwatari

    Abstract: Pause insertion, also known as phrase break prediction and phrasing, is an essential part of TTS systems because proper pauses with natural duration significantly enhance the rhythm and intelligibility of synthetic speech. However, conventional phrasing models ignore various speakers' different styles of inserting silent pauses, which can degrade the performance of the model trained on a multi-spe… ▽ More

    Submitted 27 February, 2023; originally announced February 2023.

    Comments: Accepted by ICASSP2023

  6. arXiv:2301.12596  [pdf, other

    eess.AS cs.CL

    Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining

    Authors: Takaaki Saeki, Soumi Maiti, Xinjian Li, Shinji Watanabe, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: While neural text-to-speech (TTS) has achieved human-like natural synthetic speech, multilingual TTS systems are limited to resource-rich languages due to the need for paired text and studio-quality audio data. This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language. The use of text-only data allows the development of TTS systems for low-resource la… ▽ More

    Submitted 27 May, 2023; v1 submitted 29 January, 2023; originally announced January 2023.

    Comments: To appear in IJCAI 2023

  7. arXiv:2212.04559  [pdf, other

    eess.AS cs.LG cs.SD

    SpeechLMScore: Evaluating speech generation using speech language model

    Authors: Soumi Maiti, Yifan Peng, Takaaki Saeki, Shinji Watanabe

    Abstract: While human evaluation is the most reliable metric for evaluating speech generation systems, it is generally costly and time-consuming. Previous studies on automatic speech quality assessment address the problem by predicting human evaluation scores with machine learning models. However, they rely on supervised learning and thus suffer from high annotation costs and domain-shift problems. We propo… ▽ More

    Submitted 8 December, 2022; originally announced December 2022.

  8. arXiv:2210.15447  [pdf, other

    cs.SD cs.CL eess.AS

    Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech

    Authors: Takaaki Saeki, Heiga Zen, Zhehuai Chen, Nobuyuki Morioka, Gary Wang, Yu Zhang, Ankur Bapna, Andrew Rosenberg, Bhuvana Ramabhadran

    Abstract: This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models. Existing multilingual TTS typically supports tens of languages, which are a small fraction of the thousands of languages in the world. One difficulty to scale multilingual TTS to hundreds of languages is collecting high-quality speech-text paired da… ▽ More

    Submitted 15 March, 2023; v1 submitted 27 October, 2022; originally announced October 2022.

    Comments: To appear in ICASSP 2023

  9. arXiv:2210.14850  [pdf, other

    cs.SD eess.AS

    Text-to-speech synthesis from dark data with evaluation-in-the-loop data selection

    Authors: Kentaro Seki, Shinnosuke Takamichi, Takaaki Saeki, Hiroshi Saruwatari

    Abstract: This paper proposes a method for selecting training data for text-to-speech (TTS) synthesis from dark data. TTS models are typically trained on high-quality speech corpora that cost much time and money for data collection, which makes it very challenging to increase speaker variation. In contrast, there is a large amount of data whose availability is unknown (a.k.a, "dark data"), such as YouTube v… ▽ More

    Submitted 26 October, 2022; originally announced October 2022.

    Comments: Submitted to ICASSP 2023

  10. arXiv:2210.09815  [pdf, other

    cs.SD eess.AS

    Improving robustness of spontaneous speech synthesis with linguistic speech regularization and pseudo-filled-pause insertion

    Authors: Yuta Matsunaga, Takaaki Saeki, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: We present a training method with linguistic speech regularization that improves the robustness of spontaneous speech synthesis methods with filled pause (FP) insertion. Spontaneous speech synthesis is aimed at producing speech with human-like disfluencies, such as FPs. Because modeling the complex data distribution of spontaneous speech with a rich FP vocabulary is challenging, the quality of FP-… ▽ More

    Submitted 19 September, 2023; v1 submitted 18 October, 2022; originally announced October 2022.

    Comments: Accepted to SSW12

  11. arXiv:2210.07559  [pdf, ps, other

    cs.SD cs.CL eess.AS

    Empirical Study Incorporating Linguistic Knowledge on Filled Pauses for Personalized Spontaneous Speech Synthesis

    Authors: Yuta Matsunaga, Takaaki Saeki, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: We present a comprehensive empirical study for personalized spontaneous speech synthesis on the basis of linguistic knowledge. With the advent of voice cloning for reading-style speech synthesis, a new voice cloning paradigm for human-like and spontaneous speech synthesis is required. We, therefore, focus on personalized spontaneous speech synthesis that can clone both the individual's voice timbr… ▽ More

    Submitted 19 September, 2023; v1 submitted 14 October, 2022; originally announced October 2022.

    Comments: Accepted to APSIPA ASC 2022

  12. arXiv:2204.02152  [pdf, other

    cs.SD eess.AS

    UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022

    Authors: Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: We present the UTokyo-SaruLab mean opinion score (MOS) prediction system submitted to VoiceMOS Challenge 2022. The challenge is to predict the MOS values of speech samples collected from previous Blizzard Challenges and Voice Conversion Challenges for two tracks: a main track for in-domain prediction and an out-of-domain (OOD) track for which there is less labeled data from different listening tes… ▽ More

    Submitted 29 June, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

    Comments: Accepted to INTERSPEECH 2022

  13. arXiv:2203.15683  [pdf, other

    cs.SD eess.AS

    DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning

    Authors: Takaaki Saeki, Kentaro Tachibana, Ryuichi Yamamoto

    Abstract: Most text-to-speech (TTS) methods use high-quality speech corpora recorded in a well-designed environment, incurring a high cost for data collection. To solve this problem, existing noise-robust TTS methods are intended to use noisy speech corpora as training data. However, they only address either time-invariant or time-variant noises. We propose a degradation-robust TTS method, which can be trai… ▽ More

    Submitted 29 June, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

    Comments: Accepted to INTERSPEECH 2022

  14. arXiv:2203.14725  [pdf, other

    cs.SD

    vTTS: visual-text to speech

    Authors: Yoshifumi Nakano, Takaaki Saeki, Shinnosuke Takamichi, Katsuhito Sudoh, Hiroshi Saruwatari

    Abstract: This paper proposes visual-text to speech (vTTS), a method for synthesizing speech from visual text (i.e., text as an image). Conventional TTS converts phonemes or characters into discrete symbols and synthesizes a speech waveform from them, thus losing the visual features that the characters essentially have. Therefore, our method synthesizes speech not from discrete symbols but from visual text.… ▽ More

    Submitted 28 March, 2022; originally announced March 2022.

    Comments: submitted to interspech 2022

  15. arXiv:2203.12937  [pdf, other

    cs.SD eess.AS

    SelfRemaster: Self-Supervised Speech Restoration with Analysis-by-Synthesis Approach Using Channel Modeling

    Authors: Takaaki Saeki, Shinnosuke Takamichi, Tomohiko Nakamura, Naoko Tanji, Hiroshi Saruwatari

    Abstract: We present a self-supervised speech restoration method without paired speech corpora. Because the previous general speech restoration method uses artificial paired data created by applying various distortions to high-quality speech corpora, it cannot sufficiently represent acoustic distortions of real data, limiting the applicability. Our model consists of analysis, synthesis, and channel modules… ▽ More

    Submitted 27 June, 2022; v1 submitted 24 March, 2022; originally announced March 2022.

    Comments: Accepted to INTERSPEECH 2022

  16. arXiv:2203.09961  [pdf, other

    cs.SD eess.AS

    Personalized Filled-pause Generation with Group-wise Prediction Models

    Authors: Yuta Matsunaga, Takaaki Saeki, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: In this paper, we propose a method to generate personalized filled pauses (FPs) with group-wise prediction models. Compared with fluent text generation, disfluent text generation has not been widely explored. To generate more human-like texts, we addressed disfluent text generation. The usage of disfluency, such as FPs, rephrases, and word fragments, differs from speaker to speaker, and thus, the… ▽ More

    Submitted 22 April, 2022; v1 submitted 18 March, 2022; originally announced March 2022.

    Comments: Accepted to LREC 2022

  17. arXiv:2112.09323  [pdf, other

    cs.SD eess.AS

    JTubeSpeech: corpus of Japanese speech collected from YouTube for speech recognition and speaker verification

    Authors: Shinnosuke Takamichi, Ludwig Kürzinger, Takaaki Saeki, Sayaka Shiota, Shinji Watanabe

    Abstract: In this paper, we construct a new Japanese speech corpus called "JTubeSpeech." Although recent end-to-end learning requires large-size speech corpora, open-sourced such corpora for languages other than English have not yet been established. In this paper, we describe the construction of a corpus from YouTube videos and subtitles for speech recognition and speaker verification. Our method can autom… ▽ More

    Submitted 17 December, 2021; originally announced December 2021.

    Comments: Submitted to ICASSP2022

  18. arXiv:2110.07840  [pdf, other

    cs.CL cs.SD eess.AS

    ESPnet2-TTS: Extending the Edge of TTS Research

    Authors: Tomoki Hayashi, Ryuichi Yamamoto, Takenori Yoshimura, Peter Wu, Jiatong Shi, Takaaki Saeki, Yooncheol Ju, Yusuke Yasuda, Shinnosuke Takamichi, Shinji Watanabe

    Abstract: This paper describes ESPnet2-TTS, an end-to-end text-to-speech (E2E-TTS) toolkit. ESPnet2-TTS extends our earlier version, ESPnet-TTS, by adding many new features, including: on-the-fly flexible pre-processing, joint training with neural vocoders, and state-of-the-art TTS models with extensions like full-band E2E text-to-waveform modeling, which simplify the training pipeline and further enhance T… ▽ More

    Submitted 14 October, 2021; originally announced October 2021.

    Comments: Submitted to ICASSP2022. Demo HP: https://espnet.github.io/icassp2022-tts/

  19. arXiv:2109.10724  [pdf, other

    cs.SD cs.CL eess.AS

    Low-Latency Incremental Text-to-Speech Synthesis with Distilled Context Prediction Network

    Authors: Takaaki Saeki, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: Incremental text-to-speech (TTS) synthesis generates utterances in small linguistic units for the sake of real-time and low-latency applications. We previously proposed an incremental TTS method that leverages a large pre-trained language model to take unobserved future context into account without waiting for the subsequent segment. Although this method achieves comparable speech quality to that… ▽ More

    Submitted 22 September, 2021; originally announced September 2021.

    Comments: Accepted for ASRU2021

  20. arXiv:2105.06025  [pdf

    cs.LG

    Machine-learning-based investigation on classifying binary and multiclass behavior outcomes of children with PIMD/SMID

    Authors: Von Ralph Dane Marquez Herbuela, Tomonori Karita, Yoshiya Furukawa, Yoshinori Wada, Yoshihiro Yagi, Shuichiro Senba, Eiko Onishi, Tatsuo Saeki

    Abstract: Recently, the importance of weather parameters and location information to better understand the context of the communication of children with profound intellectual and multiple disabilities (PIMD) or severe motor and intellectual disorders (SMID) has been proposed. However, an investigation on whether these data can be used to classify their behavior for system optimization aimed for predicting t… ▽ More

    Submitted 12 May, 2021; originally announced May 2021.

  21. Incremental Text-to-Speech Synthesis Using Pseudo Lookahead with Large Pretrained Language Model

    Authors: Takaaki Saeki, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: This letter presents an incremental text-to-speech (TTS) method that performs synthesis in small linguistic units while maintaining the naturalness of output speech. Incremental TTS is generally subject to a trade-off between latency and synthetic speech quality. It is challenging to produce high-quality speech with a low-latency setup that does not make much use of an unobserved future sentence (… ▽ More

    Submitted 14 April, 2021; v1 submitted 23 December, 2020; originally announced December 2020.

    Comments: Accepted for IEEE Signal Processing Letters

  22. arXiv:2009.00260  [pdf

    cs.HC

    Children with PIMD/SMID expressive behaviors: Development and testing of ChildSIDE app, the first step for independent communication and mobility

    Authors: Von Ralph Dane Marquez Herbuela, Tomonori Karita, Yoshiya Furukawa, Yoshinori Wada, Shuichiro Senba, Eiko Onishi, Tatsuo Saeki

    Abstract: Children with profound intellectual and multiple disabilities or severe motor and intellectual disabilities only communicate through movements, vocalizations, body postures, muscle tensions, or facial expressions on a pre- or protosymbolic level. Yet, to the best of our knowledge, hardly any system has been developed to interpret their expressive behaviors. This paper describes the design, develop… ▽ More

    Submitted 1 September, 2020; originally announced September 2020.

  23. arXiv:2002.06778  [pdf, other

    cs.SD eess.AS

    Lifter Training and Sub-band Modeling for Computationally Efficient and High-Quality Voice Conversion Using Spectral Differentials

    Authors: Takaaki Saeki, Yuki Saito, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: In this paper, we propose computationally efficient and high-quality methods for statistical voice conversion (VC) with direct waveform modification based on spectral differentials. The conventional method with a minimum-phase filter achieves high-quality conversion but requires heavy computation in filtering. This is because the minimum phase using a fixed lifter of the Hilbert transform often re… ▽ More

    Submitted 17 February, 2020; originally announced February 2020.

    Comments: 5 pages, to appear in IEEE International Conference on Acoustics, Speech, and Signal Processing 2020 (ICASSP 2020)

  24. arXiv:1804.04824  [pdf, other

    cs.IT

    An Improvement of Non-binary Code Correcting Single b-Burst of Insertions or Deletions

    Authors: Toyohiko Saeki, Takayuki Nozaki

    Abstract: This paper constructs a non-binary code correcting a single $b$-burst of insertions or deletions with a large cardinality. This paper also proposes a decoding algorithm of this code and evaluates a lower bound of the cardinality of this code. Moreover, we evaluate an asymptotic upper bound on the cardinality of codes which correct a single burst of insertions or deletions.

    Submitted 9 August, 2018; v1 submitted 13 April, 2018; originally announced April 2018.

    Comments: 7 pages, accepted to ISITA 2018