Skip to main content

Showing 1–35 of 35 results for author: Tjandra, A

  1. arXiv:2406.06251  [pdf, other

    eess.AS cs.CL

    Learning Fine-Grained Controllability on Speech Generation via Efficient Fine-Tuning

    Authors: Chung-Ming Chien, Andros Tjandra, Apoorv Vyas, Matt Le, Bowen Shi, Wei-Ning Hsu

    Abstract: As the scale of generative models continues to grow, efficient reuse and adaptation of pre-trained models have become crucial considerations. In this work, we propose Voicebox Adapter, a novel approach that integrates fine-grained conditions into a pre-trained Voicebox speech generation model using a cross-attention module. To ensure a smooth integration of newly added modules with pre-trained one… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: Accepted by InterSpeech 2024

  2. arXiv:2312.15821  [pdf, other

    cs.SD cs.LG eess.AS

    Audiobox: Unified Audio Generation with Natural Language Prompts

    Authors: Apoorv Vyas, Bowen Shi, Matthew Le, Andros Tjandra, Yi-Chiao Wu, Baishan Guo, Jiemin Zhang, Xinyue Zhang, Robert Adkins, William Ngan, Jeff Wang, Ivan Cruz, Bapi Akula, Akinniyi Akinyemi, Brian Ellis, Rashel Moritz, Yael Yungster, Alice Rakotoarison, Liang Tan, Chris Summers, Carleigh Wood, Joshua Lane, Mary Williamson, Wei-Ning Hsu

    Abstract: Audio is an essential part of our life, but creating it often requires expertise and is time-consuming. Research communities have made great progress over the past year advancing the performance of large scale audio generative models for a single modality (speech, sound, or music) through adopting more powerful generative models and scaling data. However, these models lack controllability in sever… ▽ More

    Submitted 25 December, 2023; originally announced December 2023.

  3. arXiv:2310.16338  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Generative Pre-training for Speech with Flow Matching

    Authors: Alexander H. Liu, Matt Le, Apoorv Vyas, Bowen Shi, Andros Tjandra, Wei-Ning Hsu

    Abstract: Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data. In speech, text-to-speech synthesis and neural vocoder are good examples where generative models have shined. While generative models have been applied to different applications in speech, there… ▽ More

    Submitted 25 March, 2024; v1 submitted 24 October, 2023; originally announced October 2023.

    Comments: ICLR 2024

  4. arXiv:2309.13018  [pdf, other

    eess.AS cs.CL cs.SD

    Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model

    Authors: Jiamin Xie, Ke Li, Jinxi Guo, Andros Tjandra, Yuan Shangguan, Leda Sari, Chunyang Wu, Junteng Jia, Jay Mahadeokar, Ozlem Kalinli

    Abstract: Neural network pruning offers an effective method for compressing a multilingual automatic speech recognition (ASR) model with minimal performance loss. However, it entails several rounds of pruning and re-training needed to be run for each language. In this work, we propose the use of an adaptive masking approach in two scenarios for pruning a multilingual ASR model efficiently, each resulting in… ▽ More

    Submitted 11 January, 2024; v1 submitted 22 September, 2023; originally announced September 2023.

  5. arXiv:2305.13516  [pdf, other

    cs.CL cs.SD eess.AS

    Scaling Speech Technology to 1,000+ Languages

    Authors: Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli

    Abstract: Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world. The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

  6. arXiv:2301.02966  [pdf, other

    cs.CL cs.LG eess.AS

    SpeeChain: A Speech Toolkit for Large-Scale Machine Speech Chain

    Authors: Heli Qi, Sashi Novitasari, Andros Tjandra, Sakriani Sakti, Satoshi Nakamura

    Abstract: This paper introduces SpeeChain, an open-source Pytorch-based toolkit designed to develop the machine speech chain for large-scale use. This first release focuses on the TTS-to-ASR chain, a core component of the machine speech chain, that refers to the TTS data augmentation by unspoken text for ASR. To build an efficient pipeline for the large-scale TTS-to-ASR chain, we implement easy-to-use multi… ▽ More

    Submitted 7 January, 2023; originally announced January 2023.

    Comments: Submitted to ICASSP 2023

    MSC Class: 68T10 ACM Class: I.2.7

  7. arXiv:2211.13282  [pdf, other

    cs.SD cs.AI eess.AS

    Voice-preserving Zero-shot Multiple Accent Conversion

    Authors: Mumin Jin, Prashant Serai, Jilong Wu, Andros Tjandra, Vimal Manohar, Qing He

    Abstract: Most people who have tried to learn a foreign language would have experienced difficulties understanding or speaking with a native speaker's accent. For native speakers, understanding or speaking a new accent is likewise a difficult task. An accent conversion system that changes a speaker's accent but preserves that speaker's voice identity, such as timbre and pitch, has the potential for a range… ▽ More

    Submitted 14 October, 2023; v1 submitted 23 November, 2022; originally announced November 2022.

    Comments: Accepted to IEEE ICASSP 2023

  8. arXiv:2211.05756  [pdf, other

    cs.CL cs.SD eess.AS

    Massively Multilingual ASR on 70 Languages: Tokenization, Architecture, and Generalization Capabilities

    Authors: Andros Tjandra, Nayan Singhal, David Zhang, Ozlem Kalinli, Abdelrahman Mohamed, Duc Le, Michael L. Seltzer

    Abstract: End-to-end multilingual ASR has become more appealing because of several reasons such as simplifying the training and deployment process and positive performance transfer from high-resource to low-resource languages. However, scaling up the number of languages, total hours, and number of unique tokens is not a trivial task. This paper explores large-scale multilingual ASR models on 70 languages. W… ▽ More

    Submitted 10 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  9. arXiv:2209.05735  [pdf, other

    eess.AS cs.CL

    Learning ASR pathways: A sparse multilingual ASR model

    Authors: Mu Yang, Andros Tjandra, Chunxi Liu, David Zhang, Duc Le, Ozlem Kalinli

    Abstract: Neural network pruning compresses automatic speech recognition (ASR) models effectively. However, in multilingual ASR, language-agnostic pruning may lead to severe performance drops on some languages because language-agnostic pruning masks may not fit all languages and discard important language-specific parameters. In this work, we present ASR pathways, a sparse multilingual ASR model that activa… ▽ More

    Submitted 28 September, 2023; v1 submitted 13 September, 2022; originally announced September 2022.

    Comments: Accepted by ICASSP 2023

  10. arXiv:2203.15643  [pdf, other

    cs.SD cs.CL cs.LG cs.NE eess.AS

    Nix-TTS: Lightweight and End-to-End Text-to-Speech via Module-wise Distillation

    Authors: Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji, Andros Tjandra, Sakriani Sakti

    Abstract: Several solutions for lightweight TTS have shown promising results. Still, they either rely on a hand-crafted design that reaches non-optimum size or use a neural architecture search but often suffer training costs. We present Nix-TTS, a lightweight TTS achieved via knowledge distillation to a high-quality yet large-sized, non-autoregressive, and end-to-end (vocoder-free) TTS teacher model. Specif… ▽ More

    Submitted 5 November, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

    Comments: Accepted at SLT 2022 (https://slt2022.org/). Associated materials can be seen in https://github.com/rendchevi/nix-tts

    MSC Class: 68T50 (Primary) 68T07; 68T10; 68T99 (Secondary) ACM Class: I.2.7; I.2.6; H.5.5

  11. arXiv:2111.09296  [pdf, other

    cs.CL cs.SD eess.AS

    XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

    Authors: Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli

    Abstract: This paper presents XLS-R, a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0. We train models with up to 2B parameters on nearly half a million hours of publicly available speech audio in 128 languages, an order of magnitude more public data than the largest known prior work. Our evaluation covers a wide range of tasks, domains, data regimes and languages, b… ▽ More

    Submitted 16 December, 2021; v1 submitted 17 November, 2021; originally announced November 2021.

  12. arXiv:2110.07313  [pdf, other

    cs.SD cs.LG eess.AS

    Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks

    Authors: Sangeeta Srivastava, Yun Wang, Andros Tjandra, Anurag Kumar, Chunxi Liu, Kritika Singh, Yatharth Saraf

    Abstract: Representation learning from unlabeled data has been of major interest in artificial intelligence research. While self-supervised speech representation learning has been popular in the speech research community, very few works have comprehensively analyzed audio representation learning for non-speech audio tasks. In this paper, we propose a self-supervised audio representation learning method and… ▽ More

    Submitted 6 January, 2022; v1 submitted 14 October, 2021; originally announced October 2021.

    Comments: 4 pages. Submitted to ICASSP in Oct 2021

  13. arXiv:2107.04082  [pdf, other

    cs.CL cs.SD eess.AS

    Improved Language Identification Through Cross-Lingual Self-Supervised Learning

    Authors: Andros Tjandra, Diptanu Gon Choudhury, Frank Zhang, Kritika Singh, Alexis Conneau, Alexei Baevski, Assaf Sela, Yatharth Saraf, Michael Auli

    Abstract: Language identification greatly impacts the success of downstream tasks such as automatic speech recognition. Recently, self-supervised speech representations learned by wav2vec 2.0 have been shown to be very effective for a range of speech tasks. We extend previous self-supervised work on language identification by experimenting with pre-trained models which were learned on real-world unconstrain… ▽ More

    Submitted 17 October, 2021; v1 submitted 8 July, 2021; originally announced July 2021.

  14. arXiv:2106.15529  [pdf, other

    cs.LG

    On Graph Neural Network Ensembles for Large-Scale Molecular Property Prediction

    Authors: Edward Elson Kosasih, Joaquin Cabezas, Xavier Sumba, Piotr Bielak, Kamil Tagowski, Kelvin Idanwekhai, Benedict Aaron Tjandra, Arian Rokkum Jamasb

    Abstract: In order to advance large-scale graph machine learning, the Open Graph Benchmark Large Scale Challenge (OGB-LSC) was proposed at the KDD Cup 2021. The PCQM4M-LSC dataset defines a molecular HOMO-LUMO property prediction task on about 3.8M graphs. In this short paper, we show our current work-in-progress solution which builds an ensemble of three graph neural networks models based on GIN, Bayesian… ▽ More

    Submitted 29 June, 2021; originally announced June 2021.

    Comments: 7 pages, 1 figure, 1 table

  15. arXiv:2011.02128  [pdf, other

    cs.CL cs.SD eess.AS

    Cross-Lingual Machine Speech Chain for Javanese, Sundanese, Balinese, and Bataks Speech Recognition and Synthesis

    Authors: Sashi Novitasari, Andros Tjandra, Sakriani Sakti, Satoshi Nakamura

    Abstract: Even though over seven hundred ethnic languages are spoken in Indonesia, the available technology remains limited that could support communication within indigenous communities as well as with people outside the villages. As a result, indigenous communities still face isolation due to cultural barriers; languages continue to disappear. To accelerate communication, speech-to-speech translation (S2S… ▽ More

    Submitted 4 November, 2020; originally announced November 2020.

    Comments: Accepted in SLTU-CCURL 2020

  16. arXiv:2011.02127  [pdf, other

    cs.CL cs.SD eess.AS

    Sequence-to-Sequence Learning via Attention Transfer for Incremental Speech Recognition

    Authors: Sashi Novitasari, Andros Tjandra, Sakriani Sakti, Satoshi Nakamura

    Abstract: Attention-based sequence-to-sequence automatic speech recognition (ASR) requires a significant delay to recognize long utterances because the output is generated after receiving entire input sequences. Although several studies recently proposed sequence mechanisms for incremental speech recognition (ISR), using different frameworks and learning algorithms is more complicated than the standard ASR… ▽ More

    Submitted 4 November, 2020; originally announced November 2020.

    Comments: Accepted in INTERSPEECH 2019

  17. arXiv:2011.02126  [pdf, other

    cs.CL cs.SD eess.AS

    Incremental Machine Speech Chain Towards Enabling Listening while Speaking in Real-time

    Authors: Sashi Novitasari, Andros Tjandra, Tomoya Yanagita, Sakriani Sakti, Satoshi Nakamura

    Abstract: Inspired by a human speech chain mechanism, a machine speech chain framework based on deep learning was recently proposed for the semi-supervised development of automatic speech recognition (ASR) and text-to-speech synthesis TTS) systems. However, the mechanism to listen while speaking can be done only after receiving entire input sequences. Thus, there is a significant delay when encountering lon… ▽ More

    Submitted 3 November, 2020; originally announced November 2020.

    Comments: Accepted in INTERSPEECH 2020

  18. arXiv:2011.02099  [pdf, other

    cs.CL cs.SD eess.AS

    Augmenting Images for ASR and TTS through Single-loop and Dual-loop Multimodal Chain Framework

    Authors: Johanes Effendi, Andros Tjandra, Sakriani Sakti, Satoshi Nakamura

    Abstract: Previous research has proposed a machine speech chain to enable automatic speech recognition (ASR) and text-to-speech synthesis (TTS) to assist each other in semi-supervised learning and to avoid the need for a large amount of paired speech and text data. However, that framework still requires a large amount of unpaired (speech or text) data. A prototype multimodal machine chain was then explored… ▽ More

    Submitted 3 November, 2020; originally announced November 2020.

    Comments: Accepted at INTERSPEECH 2020

  19. arXiv:2010.12973  [pdf, other

    cs.CL cs.SD eess.AS

    Unsupervised Learning of Disentangled Speech Content and Style Representation

    Authors: Andros Tjandra, Ruoming Pang, Yu Zhang, Shigeki Karita

    Abstract: We present an approach for unsupervised learning of speech representation disentangling contents and styles. Our model consists of: (1) a local encoder that captures per-frame information; (2) a global encoder that captures per-utterance information; and (3) a conditional decoder that reconstructs speech given local and global latent variables. Our experiments show that (1) the local latent variab… ▽ More

    Submitted 20 June, 2021; v1 submitted 24 October, 2020; originally announced October 2020.

    Comments: Submitted to Interspeech 2021

  20. arXiv:2005.11676  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Transformer VQ-VAE for Unsupervised Unit Discovery and Speech Synthesis: ZeroSpeech 2020 Challenge

    Authors: Andros Tjandra, Sakriani Sakti, Satoshi Nakamura

    Abstract: In this paper, we report our submitted system for the ZeroSpeech 2020 challenge on Track 2019. The main theme in this challenge is to build a speech synthesizer without any textual information or phonetic labels. In order to tackle those challenges, we build a system that must address two major components such as 1) given speech audio, extract subword units in an unsupervised way and 2) re-synthes… ▽ More

    Submitted 24 May, 2020; originally announced May 2020.

    Comments: Submitted to INTERSPEECH 2020

  21. arXiv:1910.10324  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Deja-vu: Double Feature Presentation and Iterated Loss in Deep Transformer Networks

    Authors: Andros Tjandra, Chunxi Liu, Frank Zhang, Xiaohui Zhang, Yongqiang Wang, Gabriel Synnaeve, Satoshi Nakamura, Geoffrey Zweig

    Abstract: Deep acoustic models typically receive features in the first layer of the network, and process increasingly abstract representations in the subsequent layers. Here, we propose to feed the input features at multiple depths in the acoustic model. As our motivation is to allow acoustic models to re-examine their input features in light of partial hypotheses we introduce intermediate model heads and l… ▽ More

    Submitted 13 February, 2020; v1 submitted 22 October, 2019; originally announced October 2019.

    Comments: Accepted in IEEE ICASSP 2020

  22. Transformer-based Acoustic Modeling for Hybrid Speech Recognition

    Authors: Yongqiang Wang, Abdelrahman Mohamed, Duc Le, Chunxi Liu, Alex Xiao, Jay Mahadeokar, Hongzhao Huang, Andros Tjandra, Xiaohui Zhang, Frank Zhang, Christian Fuegen, Geoffrey Zweig, Michael L. Seltzer

    Abstract: We propose and evaluate transformer-based acoustic models (AMs) for hybrid speech recognition. Several modeling choices are discussed in this work, including various positional embedding methods and an iterated loss to enable training deep transformers. We also present a preliminary study of using limited right context in transformer models, which makes it possible for streaming applications. We d… ▽ More

    Submitted 29 April, 2020; v1 submitted 22 October, 2019; originally announced October 2019.

    Comments: to appear in ICASSP 2020

  23. arXiv:1910.00795  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Speech-to-speech Translation between Untranscribed Unknown Languages

    Authors: Andros Tjandra, Sakriani Sakti, Satoshi Nakamura

    Abstract: In this paper, we explore a method for training speech-to-speech translation tasks without any transcription or linguistic supervision. Our proposed method consists of two steps: First, we train and generate discrete representation with unsupervised term discovery with a discrete quantized autoencoder. Second, we train a sequence-to-sequence model that directly maps the source language speech to t… ▽ More

    Submitted 5 October, 2019; v1 submitted 2 October, 2019; originally announced October 2019.

    Comments: Accepted in IEEE ASRU 2019. Web-page for more samples & details: https://sp2code-translation-v1.netlify.com/

  24. arXiv:1906.00579  [pdf, other

    cs.CL cs.SD eess.AS

    Listening while Speaking and Visualizing: Improving ASR through Multimodal Chain

    Authors: Johanes Effendi, Andros Tjandra, Sakriani Sakti, Satoshi Nakamura

    Abstract: Previously, a machine speech chain, which is based on sequence-to-sequence deep learning, was proposed to mimic speech perception and production behavior. Such chains separately processed listening and speaking by automatic speech recognition (ASR) and text-to-speech synthesis (TTS) and simultaneously enabled them to teach each other in semi-supervised learning when they received unpaired data. Un… ▽ More

    Submitted 14 November, 2019; v1 submitted 3 June, 2019; originally announced June 2019.

    Comments: Accepted in IEEE ASRU 2019

  25. arXiv:1905.11449  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    VQVAE Unsupervised Unit Discovery and Multi-scale Code2Spec Inverter for Zerospeech Challenge 2019

    Authors: Andros Tjandra, Berrak Sisman, Mingyang Zhang, Sakriani Sakti, Haizhou Li, Satoshi Nakamura

    Abstract: We describe our submitted system for the ZeroSpeech Challenge 2019. The current challenge theme addresses the difficulty of constructing a speech synthesizer without any text or phonetic labels and requires a system that can (1) discover subword units in an unsupervised way, and (2) synthesize the speech with a target speaker's voice. Moreover, the system should also balance the discrimination sco… ▽ More

    Submitted 29 May, 2019; v1 submitted 27 May, 2019; originally announced May 2019.

    Comments: Submitted to Interspeech 2019

  26. arXiv:1810.13107  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    End-to-End Feedback Loss in Speech Chain Framework via Straight-Through Estimator

    Authors: Andros Tjandra, Sakriani Sakti, Satoshi Nakamura

    Abstract: The speech chain mechanism integrates automatic speech recognition (ASR) and text-to-speech synthesis (TTS) modules into a single cycle during training. In our previous work, we applied a speech chain mechanism as a semi-supervised learning. It provides the ability for ASR and TTS to assist each other when they receive unpaired data and let them infer the missing pair and optimize the model with r… ▽ More

    Submitted 31 October, 2018; originally announced October 2018.

  27. arXiv:1807.08280  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Multi-scale Alignment and Contextual History for Attention Mechanism in Sequence-to-sequence Model

    Authors: Andros Tjandra, Sakriani Sakti, Satoshi Nakamura

    Abstract: A sequence-to-sequence model is a neural network module for mapping two sequences of different lengths. The sequence-to-sequence model has three core modules: encoder, decoder, and attention. Attention is the bridge that connects the encoder and decoder modules and improves model performance in many tasks. In this paper, we propose two ideas to improve sequence-to-sequence model performance by enh… ▽ More

    Submitted 22 July, 2018; originally announced July 2018.

  28. arXiv:1803.10525  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Machine Speech Chain with One-shot Speaker Adaptation

    Authors: Andros Tjandra, Sakriani Sakti, Satoshi Nakamura

    Abstract: In previous work, we developed a closed-loop speech chain model based on deep learning, in which the architecture enabled the automatic speech recognition (ASR) and text-to-speech synthesis (TTS) components to mutually improve their performance. This was accomplished by the two parts teaching each other using both labeled and unlabeled data. This approach could significantly improve model performa… ▽ More

    Submitted 28 March, 2018; originally announced March 2018.

  29. arXiv:1802.10410  [pdf, other

    cs.LG

    Tensor Decomposition for Compressing Recurrent Neural Network

    Authors: Andros Tjandra, Sakriani Sakti, Satoshi Nakamura

    Abstract: In the machine learning fields, Recurrent Neural Network (RNN) has become a popular architecture for sequential data modeling. However, behind the impressive performance, RNNs require a large number of parameters for both training and inference. In this paper, we are trying to reduce the number of parameters and maintain the expressive power from RNN simultaneously. We utilize several tensor decom… ▽ More

    Submitted 8 May, 2018; v1 submitted 28 February, 2018; originally announced February 2018.

    Comments: Accepted at IJCNN 2018. Source code URL: https://github.com/androstj/tensor_rnn

  30. arXiv:1710.10774  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Sequence-to-Sequence ASR Optimization via Reinforcement Learning

    Authors: Andros Tjandra, Sakriani Sakti, Satoshi Nakamura

    Abstract: Despite the success of sequence-to-sequence approaches in automatic speech recognition (ASR) systems, the models still suffer from several problems, mainly due to the mismatch between the training and inference conditions. In the sequence-to-sequence architecture, the model is trained to predict the grapheme of the current time-step given the input of speech signal and the ground-truth grapheme hi… ▽ More

    Submitted 28 February, 2018; v1 submitted 30 October, 2017; originally announced October 2017.

    Comments: Accepted at ICASSP 2018

  31. arXiv:1709.07814  [pdf, other

    cs.CL cs.LG cs.SD

    Attention-based Wav2Text with Feature Transfer Learning

    Authors: Andros Tjandra, Sakriani Sakti, Satoshi Nakamura

    Abstract: Conventional automatic speech recognition (ASR) typically performs multi-level pattern recognition tasks that map the acoustic speech waveform into a hierarchy of speech units. But, it is widely known that information loss in the earlier stage can propagate through the later stages. After the resurgence of deep learning, interest has emerged in the possibility of developing a purely end-to-end ASR… ▽ More

    Submitted 22 September, 2017; originally announced September 2017.

    Comments: Accepted at ASRU 2017

  32. arXiv:1707.04879  [pdf, other

    cs.CL cs.LG cs.SD

    Listening while Speaking: Speech Chain by Deep Learning

    Authors: Andros Tjandra, Sakriani Sakti, Satoshi Nakamura

    Abstract: Despite the close relationship between speech perception and production, research in automatic speech recognition (ASR) and text-to-speech synthesis (TTS) has progressed more or less independently without exerting much mutual influence on each other. In human communication, on the other hand, a closed-loop speech chain mechanism with auditory feedback from the speaker's mouth to her ear is crucial… ▽ More

    Submitted 16 July, 2017; originally announced July 2017.

  33. arXiv:1706.02222  [pdf, other

    cs.LG cs.CL stat.ML

    Gated Recurrent Neural Tensor Network

    Authors: Andros Tjandra, Sakriani Sakti, Ruli Manurung, Mirna Adriani, Satoshi Nakamura

    Abstract: Recurrent Neural Networks (RNNs), which are a powerful scheme for modeling temporal and sequential data need to capture long-term dependencies on datasets and represent them in hidden layers with a powerful model to capture more information from inputs. For modeling long-term dependencies in a dataset, the gating mechanism concept can help RNNs remember and forget previous information. Representin… ▽ More

    Submitted 7 June, 2017; originally announced June 2017.

    Comments: Accepted at IJCNN 2016 URL : http://ieeexplore.ieee.org/document/7727233/

  34. arXiv:1705.08091  [pdf, other

    cs.CL

    Local Monotonic Attention Mechanism for End-to-End Speech and Language Processing

    Authors: Andros Tjandra, Sakriani Sakti, Satoshi Nakamura

    Abstract: Recently, encoder-decoder neural networks have shown impressive performance on many sequence-related tasks. The architecture commonly uses an attentional mechanism which allows the model to learn alignments between the source and the target sequence. Most attentional mechanisms used today is based on a global attention property which requires a computation of a weighted summarization of the whole… ▽ More

    Submitted 3 November, 2017; v1 submitted 23 May, 2017; originally announced May 2017.

    Comments: Accepted at IJCNLP 2017 --- (V2: added more experiments on G2P & MT)

  35. Compressing Recurrent Neural Network with Tensor Train

    Authors: Andros Tjandra, Sakriani Sakti, Satoshi Nakamura

    Abstract: Recurrent Neural Network (RNN) are a popular choice for modeling temporal and sequential tasks and achieve many state-of-the-art performance on various complex problems. However, most of the state-of-the-art RNNs have millions of parameters and require many computational resources for training and predicting new data. This paper proposes an alternative RNN model to reduce the number of parameters… ▽ More

    Submitted 22 May, 2017; originally announced May 2017.

    Comments: Accepted at IJCNN 2017