subscribe to arXiv mailings

LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning

Authors: Masaya Kawamura, Ryuichi Yamamoto, Yuma Shirahata, Takuya Hasumi, Kentaro Tachibana

Abstract: We introduce LibriTTS-P, a new corpus based on LibriTTS-R that includes utterance-level descriptions (i.e., prompts) of speaking style and speaker-level prompts of speaker characteristics. We employ a hybrid approach to construct prompt annotations: (1) manual annotations that capture human perceptions of speaker characteristics and (2) synthetic annotations on speaking style. Compared to existing… ▽ More We introduce LibriTTS-P, a new corpus based on LibriTTS-R that includes utterance-level descriptions (i.e., prompts) of speaking style and speaker-level prompts of speaker characteristics. We employ a hybrid approach to construct prompt annotations: (1) manual annotations that capture human perceptions of speaker characteristics and (2) synthetic annotations on speaking style. Compared to existing English prompt datasets, our corpus provides more diverse prompt annotations for all speakers of LibriTTS-R. Experimental results for prompt-based controllable TTS demonstrate that the TTS model trained with LibriTTS-P achieves higher naturalness than the model using the conventional dataset. Furthermore, the results for style captioning tasks show that the model utilizing LibriTTS-P generates 2.5 times more accurate words than the model using a conventional dataset. Our corpus, LibriTTS-P, is available at https://github.com/line/LibriTTS-P. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: Accepted to INTERSPEECH 2024

arXiv:2404.14100 [pdf, other]

doi 10.1109/HUMANOIDS.2018.8625002

A Method of Joint Angle Estimation Using Only Relative Changes in Muscle Lengths for Tendon-driven Humanoids with Complex Musculoskeletal Structures

Authors: Kento Kawaharazuka, Shogo Makino, Masaya Kawamura, Yuki Asano, Kei Okada, Masayuki Inaba

Abstract: Tendon-driven musculoskeletal humanoids typically have complex structures similar to those of human beings, such as ball joints and the scapula, in which encoders cannot be installed. Therefore, joint angles cannot be directly obtained and need to be estimated using the changes in muscle lengths. In previous studies, methods using table-search and extended kalman filter have been developed. These… ▽ More Tendon-driven musculoskeletal humanoids typically have complex structures similar to those of human beings, such as ball joints and the scapula, in which encoders cannot be installed. Therefore, joint angles cannot be directly obtained and need to be estimated using the changes in muscle lengths. In previous studies, methods using table-search and extended kalman filter have been developed. These methods express the joint-muscle mapping, which is the nonlinear relationship between joint angles and muscle lengths, by using a data table, polynomials, or a neural network. However, due to computational complexity, these methods cannot consider the effects of polyarticular muscles. In this study, considering the limitation of the computational cost, we reduce unnecessary degrees of freedom, divide joints and muscles into several groups, and formulate a joint angle estimation method that takes into account polyarticular muscles. Also, we extend the estimation method to propose a joint angle estimation method using only the relative changes in muscle lengths. By this extension, which does not use absolute muscle lengths, we do not need to execute a difficult calibration of muscle lengths for tendon-driven musculoskeletal humanoids. Finally, we conduct experiments in simulation and actual environments, and verify the effectiveness of this study. △ Less

Submitted 22 April, 2024; originally announced April 2024.

Comments: Accepted at Humanoids2018

arXiv:2404.05295 [pdf, other]

doi 10.1109/LRA.2018.2789849

Online Learning of Joint-Muscle Mapping Using Vision in Tendon-driven Musculoskeletal Humanoids

Authors: Kento Kawaharazuka, Shogo Makino, Masaya Kawamura, Yuki Asano, Kei Okada, Masayuki Inaba

Abstract: The body structures of tendon-driven musculoskeletal humanoids are complex, and accurate modeling is difficult, because they are made by imitating the body structures of human beings. For this reason, we have not been able to move them accurately like ordinary humanoids driven by actuators in each axis, and large internal muscle tension and slack of tendon wires have emerged by the model error bet… ▽ More The body structures of tendon-driven musculoskeletal humanoids are complex, and accurate modeling is difficult, because they are made by imitating the body structures of human beings. For this reason, we have not been able to move them accurately like ordinary humanoids driven by actuators in each axis, and large internal muscle tension and slack of tendon wires have emerged by the model error between its geometric model and the actual robot. Therefore, we construct a joint-muscle mapping (JMM) using a neural network (NN), which expresses a nonlinear relationship between joint angles and muscle lengths, and aim to move tendon-driven musculoskeletal humanoids accurately by updating the JMM online from data of the actual robot. In this study, the JMM is updated online by using the vision of the robot so that it moves to the correct position (Vision Updater). Also, we execute another update to modify muscle antagonisms correctly (Antagonism Updater). By using these two updaters, the error between the target and actual joint angles decrease to about 40% in 5 minutes, and we show through a manipulation experiment that the tendon-driven musculoskeletal humanoid Kengoro becomes able to move as intended. This novel system can adapt to the state change and growth of robots, because it updates the JMM online successively. △ Less

Submitted 8 April, 2024; originally announced April 2024.

Comments: Accepted at IEEE Robotics and Automation Letters, 2018

arXiv:2404.05286 [pdf, other]

doi 10.1109/IROS.2018.8593428

Online Self-body Image Acquisition Considering Changes in Muscle Routes Caused by Softness of Body Tissue for Tendon-driven Musculoskeletal Humanoids

Authors: Kento Kawaharazuka, Shogo Makino, Masaya Kawamura, Ayaka Fujii, Yuki Asano, Kei Okada, Masayuki Inaba

Abstract: Tendon-driven musculoskeletal humanoids have many benefits in terms of the flexible spine, multiple degrees of freedom, and variable stiffness. At the same time, because of its body complexity, there are problems in controllability. First, due to the large difference between the actual robot and its geometric model, it cannot move as intended and large internal muscle tension may emerge. Second, m… ▽ More Tendon-driven musculoskeletal humanoids have many benefits in terms of the flexible spine, multiple degrees of freedom, and variable stiffness. At the same time, because of its body complexity, there are problems in controllability. First, due to the large difference between the actual robot and its geometric model, it cannot move as intended and large internal muscle tension may emerge. Second, movements which do not appear as changes in muscle lengths may emerge, because of the muscle route changes caused by softness of body tissue. To solve these problems, we construct two models: ideal joint-muscle model and muscle-route change model, using a neural network. We initialize these models by a man-made geometric model and update them online using the sensor information of the actual robot. We validate that the tendon-driven musculoskeletal humanoid Kengoro is able to obtain a correct self-body image through several experiments. △ Less

Submitted 8 April, 2024; originally announced April 2024.

Comments: Accepted at IROS2018

arXiv:2403.17459 [pdf, other]

doi 10.1109/IROS.2017.8202291

High-Power, Flexible, Robust Hand: Development of Musculoskeletal Hand Using Machined Springs and Realization of Self-Weight Supporting Motion with Humanoid

Authors: Shogo Makino, Kento Kawaharazuka, Masaya Kawamura, Yuki Asano, Kei Okada, Masayuki Inaba

Abstract: Human can not only support their body during standing or walking, but also support them by hand, so that they can dangle a bar and others. But most humanoid robots support their body only in the foot and they use their hand just to manipulate objects because their hands are too weak to support their body. Strong hands are supposed to enable humanoid robots to act in much broader scene. Therefore,… ▽ More Human can not only support their body during standing or walking, but also support them by hand, so that they can dangle a bar and others. But most humanoid robots support their body only in the foot and they use their hand just to manipulate objects because their hands are too weak to support their body. Strong hands are supposed to enable humanoid robots to act in much broader scene. Therefore, we developed new life-size five-fingered hand that can support the body of life-size humanoid robot. It is tendon-driven and underactuated hand and actuators in forearms produce large gripping force. This hand has flexible joints using machined springs, which can be designed integrally with the attachment. Thus, it has both structural strength and impact resistance in spite of small size. As other characteristics, this hand has force sensors to measure external force and the fingers can be flexed along objects though the number of actuators to flex fingers is less than that of fingers. We installed the developed hand on musculoskeletal humanoid "Kengoro" and achieved two self-weight supporting motions: push-up motion and dangling motion. △ Less

Submitted 26 March, 2024; originally announced March 2024.

Comments: accepted at IROS2017

arXiv:2403.17452 [pdf, other]

doi 10.1109/IROS.2018.8594316

Five-fingered Hand with Wide Range of Thumb Using Combination of Machined Springs and Variable Stiffness Joints

Authors: Shogo Makino, Kento Kawaharazuka, Ayaka Fujii, Masaya Kawamura, Tasuku Makabe, Moritaka Onitsuka, Yuki Asano, Kei Okada, Koji Kawasaki, Masayuki Inaba

Abstract: Human hands can not only grasp objects of various shape and size and manipulate them in hands but also exert such a large gripping force that they can support the body in the situations such as dangling a bar and climbing a ladder. On the other hand, it is difficult for most robot hands to manage both. Therefore in this paper we developed the hand which can grasp various objects and exert large gr… ▽ More Human hands can not only grasp objects of various shape and size and manipulate them in hands but also exert such a large gripping force that they can support the body in the situations such as dangling a bar and climbing a ladder. On the other hand, it is difficult for most robot hands to manage both. Therefore in this paper we developed the hand which can grasp various objects and exert large gripping force. To develop such hand, we focused on the thumb CM joint with wide range of motion and the MP joints of four fingers with the DOF of abduction and adduction. Based on the hand with large gripping force and flexibility using machined spring, we applied above mentioned joint mechanism to the hand. The thumb CM joint has wide range of motion because of the combination of three machined springs and MP joints of four fingers have variable rigidity mechanism instead of driving each joint independently in order to move joint in limited space and by limited actuators. Using the developed hand, we achieved the grasping of various objects, supporting a large load and several motions with an arm. △ Less

Submitted 26 March, 2024; originally announced March 2024.

Comments: accepted at IROS2018

arXiv:2402.05508 [pdf, ps, other]

Performance Evaluation of Associative Watermarking Using Statistical Neurodynamics

Authors: Ryoto Kanegae, Masaki Kawamura

Abstract: We theoretically evaluated the performance of our proposed associative watermarking method in which the watermark is not embedded directly into the image. We previously proposed a watermarking method that extends the zero-watermarking model by applying associative memory models. In this model, the hetero-associative memory model is introduced to the mapping process between image features and water… ▽ More We theoretically evaluated the performance of our proposed associative watermarking method in which the watermark is not embedded directly into the image. We previously proposed a watermarking method that extends the zero-watermarking model by applying associative memory models. In this model, the hetero-associative memory model is introduced to the mapping process between image features and watermarks, and the auto-associative memory model is applied to correct watermark errors. We herein show that the associative watermarking model outperforms the zero-watermarking model through computer simulations using actual images. In this paper, we describe how we derive the macroscopic state equation for the associative watermarking model using the Okada theory. The theoretical results obtained by the fourth-order theory were in good agreement with those obtained by computer simulations. Furthermore, the performance of the associative watermarking model was evaluated using the bit error rate of the watermark, both theoretically and using computer simulations. △ Less

Submitted 8 February, 2024; originally announced February 2024.

arXiv:2309.08140 [pdf, other]

PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech Using Natural Language Descriptions

Authors: Reo Shimizu, Ryuichi Yamamoto, Masaya Kawamura, Yuma Shirahata, Hironori Doi, Tatsuya Komatsu, Kentaro Tachibana

Abstract: We propose PromptTTS++, a prompt-based text-to-speech (TTS) synthesis system that allows control over speaker identity using natural language descriptions. To control speaker identity within the prompt-based TTS framework, we introduce the concept of speaker prompt, which describes voice characteristics (e.g., gender-neutral, young, old, and muffled) designed to be approximately independent of spe… ▽ More We propose PromptTTS++, a prompt-based text-to-speech (TTS) synthesis system that allows control over speaker identity using natural language descriptions. To control speaker identity within the prompt-based TTS framework, we introduce the concept of speaker prompt, which describes voice characteristics (e.g., gender-neutral, young, old, and muffled) designed to be approximately independent of speaking style. Since there is no large-scale dataset containing speaker prompts, we first construct a dataset based on the LibriTTS-R corpus with manually annotated speaker prompts. We then employ a diffusion-based acoustic model with mixture density networks to model diverse speaker factors in the training data. Unlike previous studies that rely on style prompts describing only a limited aspect of speaker individuality, such as pitch, speaking speed, and energy, our method utilizes an additional speaker prompt to effectively learn the mapping from natural language descriptions to the acoustic features of diverse speakers. Our subjective evaluation results show that the proposed method can better control speaker characteristics than the methods without the speaker prompt. Audio samples are available at https://reppy4620.github.io/demo.promptttspp/. △ Less

Submitted 27 December, 2023; v1 submitted 15 September, 2023; originally announced September 2023.

Comments: Accepted to ICASSP 2024

arXiv:2210.15975 [pdf, other]

Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform

Authors: Masaya Kawamura, Yuma Shirahata, Ryuichi Yamamoto, Kentaro Tachibana

Abstract: We propose a lightweight end-to-end text-to-speech model using multi-band generation and inverse short-time Fourier transform. Our model is based on VITS, a high-quality end-to-end text-to-speech model, but adopts two changes for more efficient inference: 1) the most computationally expensive component is partially replaced with a simple inverse short-time Fourier transform, and 2) multi-band gene… ▽ More We propose a lightweight end-to-end text-to-speech model using multi-band generation and inverse short-time Fourier transform. Our model is based on VITS, a high-quality end-to-end text-to-speech model, but adopts two changes for more efficient inference: 1) the most computationally expensive component is partially replaced with a simple inverse short-time Fourier transform, and 2) multi-band generation, with fixed or trainable synthesis filters, is used to generate waveforms. Unlike conventional lightweight models, which employ optimization or knowledge distillation separately to train two cascaded components, our method enjoys the full benefits of end-to-end optimization. Experimental results show that our model synthesized speech as natural as that synthesized by VITS, while achieving a real-time factor of 0.066 on an Intel Core i7 CPU, 4.1 times faster than VITS. Moreover, a smaller version of the model significantly outperformed a lightweight baseline model with respect to both naturalness and inference speed. Code and audio samples are available from https://github.com/MasayaKawamura/MB-iSTFT-VITS. △ Less

Submitted 21 February, 2023; v1 submitted 28 October, 2022; originally announced October 2022.

Comments: Accepted to ICASSP 2023

arXiv:2202.00200 [pdf, other]

Differentiable Digital Signal Processing Mixture Model for Synthesis Parameter Extraction from Mixture of Harmonic Sounds

Authors: Masaya Kawamura, Tomohiko Nakamura, Daichi Kitamura, Hiroshi Saruwatari, Yu Takahashi, Kazunobu Kondo

Abstract: A differentiable digital signal processing (DDSP) autoencoder is a musical sound synthesizer that combines a deep neural network (DNN) and spectral modeling synthesis. It allows us to flexibly edit sounds by changing the fundamental frequency, timbre feature, and loudness (synthesis parameters) extracted from an input sound. However, it is designed for a monophonic harmonic sound and cannot handle… ▽ More A differentiable digital signal processing (DDSP) autoencoder is a musical sound synthesizer that combines a deep neural network (DNN) and spectral modeling synthesis. It allows us to flexibly edit sounds by changing the fundamental frequency, timbre feature, and loudness (synthesis parameters) extracted from an input sound. However, it is designed for a monophonic harmonic sound and cannot handle mixtures of harmonic sounds. In this paper, we propose a model (DDSP mixture model) that represents a mixture as the sum of the outputs of multiple pretrained DDSP autoencoders. By fitting the output of the proposed model to the observed mixture, we can directly estimate the synthesis parameters of each source. Through synthesis parameter extraction experiments, we show that the proposed method has high and stable performance compared with a straightforward method that applies the DDSP autoencoder to the signals separated by an audio source separation method. △ Less

Submitted 31 January, 2022; originally announced February 2022.

Comments: 5 pages, 2 figures, to appear in 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2022)

arXiv:1312.2658 [pdf]

doi 10.5121/acij.2013.4601

The Study about the Analysis of Responsiveness Pair Clustering to Social Network Bipartite Graph

Authors: Akira Otsuki, Masayoshi Kawamura

Abstract: In this study, regional (cities, towns and villages) data and tweet data are obtained from Twitter, and extract information of purchase information (Where and what bought) from the tweet data by morphological analysis and rule-based dependency analysis. Then, the "The regional information" and "The information of purchase history (Where and what bought information)" are captured as bipartite graph… ▽ More In this study, regional (cities, towns and villages) data and tweet data are obtained from Twitter, and extract information of purchase information (Where and what bought) from the tweet data by morphological analysis and rule-based dependency analysis. Then, the "The regional information" and "The information of purchase history (Where and what bought information)" are captured as bipartite graph, and Responsiveness Pair Clustering analysis (a clustering using correspondence analysis as similarity measure) is conducted. In this study, since it was found to be difficult to analyze a network such as bipartite graph having limitations in links by using modularity Q, responsiveness is used instead of modularity Q as similarity measure. As a result of this analysis, "regional information cluster" which refers to similar "The information of purchase history" nodes group is generated. Finally, similar regions are visualized by mapping the regional information cluster on the map. This visualization system is expected to contribute as an analytical tool for customers purchasing behavior and so on. △ Less

Submitted 9 December, 2013; originally announced December 2013.

Comments: 14 pages, 8 figures, 3 tables

Journal ref: Advanced Computing: An International Journal (ACIJ), Vol.4, No.6, November 2013

arXiv:1310.4900 [pdf]

doi 10.5121/ijdkp.2013.3501

GV-Index:Scientific Contribution Rating Index That Takes into Account the Growth Degree of Research Area and Variance Values of the Publication Year of Cited Paper

Authors: Akira Otsuki, Masayoshi Kawamura

Abstract: There are a wide variety of scientific contribution rating indices including the impact factor and h-index. These are used for quantitative analyses on research papers published in the past, and therefore unable to incorporate in the assessment the growth, or deterioration, of the research area: whether the research area of a particular paper is in decline or conversely in a growing trend. Other h… ▽ More There are a wide variety of scientific contribution rating indices including the impact factor and h-index. These are used for quantitative analyses on research papers published in the past, and therefore unable to incorporate in the assessment the growth, or deterioration, of the research area: whether the research area of a particular paper is in decline or conversely in a growing trend. Other hand, the use of the conventional rating indices may result in higher rates for papers that are hardly referenced nowadays in other papers although frequently cited in the past. This study proposes a new type of scientific contribution ranking index, "Growing Degree of Research Area and Variance Values Index (GV-Index)". The GV-Index is computed by a principal component analysis based on an estimated value obtained by PageRank Algorithm, which takes into account the growing degree of the research area and its variance. We also propose visualization system of a scientist's network using the GV-Index. △ Less

Submitted 17 October, 2013; originally announced October 2013.

Comments: 11 pages, 9 figures, 8 tables

Journal ref: International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013

arXiv:1209.4772 [pdf, ps, other]

doi 10.1103/PhysRevE.99.062132

Statistical mechanical evaluation of spread spectrum watermarking model with image restoration

Authors: Masaki Kawamura, Kao Hayashi, Tatsuya Uezu, Masato Okada

Abstract: In cases in which an original image is blind, a decoding method where both the image and the messages can be estimated simultaneously is desirable. We propose a spread spectrum watermarking model with image restoration based on Bayes estimation. We therefore need to assume some prior probabilities. The probability for estimating the messages is given by the uniform distribution, and the ones for t… ▽ More In cases in which an original image is blind, a decoding method where both the image and the messages can be estimated simultaneously is desirable. We propose a spread spectrum watermarking model with image restoration based on Bayes estimation. We therefore need to assume some prior probabilities. The probability for estimating the messages is given by the uniform distribution, and the ones for the image are given by the infinite range model and 2D Ising model. Any attacks from unauthorized users can be represented by channel models. We can obtain the estimated messages and image by maximizing the posterior probability. We analyzed the performance of the proposed method by the replica method in the case of the infinite range model. We first calculated the theoretical values of the bit error rate from obtained saddle point equations and then verified them by computer simulations. For this purpose, we assumed that the image is binary and is generated from a given prior probability. We also assume that attacks can be represented by the Gaussian channel. The computer simulation retults agreed with the theoretical values. In the case of prior probability given by the 2D Ising model, in which each pixel is statically connected with four-neighbors, we evaluated the decoding performance by computer simulations, since the replica theory could not be applied. Results using the 2D Ising model showed that the proposed method with image restoration is as effective as the infinite range model for decoding messages. We compared the performances in a case in which the image was blind and one in which it was informed. The difference between these cases was small as long as the embedding and attack rates were small. This demonstrates that the proposed method with simultaneous estimation is effective as a watermarking decoder. △ Less

Submitted 26 June, 2019; v1 submitted 21 September, 2012; originally announced September 2012.

Journal ref: Phys. Rev. E 99, 062132 (2019)

Showing 1–13 of 13 results for author: Kawamura, M