Skip to main content

Showing 1–50 of 73 results for author: Waibel, A

  1. arXiv:2406.16777  [pdf, other

    cs.CL cs.AI

    Blending LLMs into Cascaded Speech Translation: KIT's Offline Speech Translation System for IWSLT 2024

    Authors: Sai Koneru, Thai-Binh Nguyen, Ngoc-Quan Pham, Danni Liu, Zhaolin Li, Alexander Waibel, Jan Niehues

    Abstract: Large Language Models (LLMs) are currently under exploration for various tasks, including Automatic Speech Recognition (ASR), Machine Translation (MT), and even End-to-End Speech Translation (ST). In this paper, we present KIT's offline submission in the constrained + LLM track by incorporating recently proposed techniques that can be added to any cascaded speech translation. Specifically, we inte… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

  2. arXiv:2406.10421  [pdf, other

    cs.CL

    SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading

    Authors: Tu Anh Dinh, Carlos Mullov, Leonard Bärmann, Zhaolin Li, Danni Liu, Simon Reiß, Jueun Lee, Nathan Lerzer, Fabian Ternava, Jianfeng Gao, Tobias Röddiger, Alexander Waibel, Tamim Asfour, Michael Beigl, Rainer Stiefelhagen, Carsten Dachsbacher, Klemens Böhm, Jan Niehues

    Abstract: With the rapid development of Large Language Models (LLMs), it is crucial to have benchmarks which can evaluate the ability of LLMs on different domains. One common use of LLMs is performing tasks on scientific topics, such as writing algorithms, querying databases or giving mathematical proofs. Inspired by the way university students are evaluated on such tasks, in this paper, we propose SciEx -… ▽ More

    Submitted 12 July, 2024; v1 submitted 14 June, 2024; originally announced June 2024.

    ACM Class: I.2.7

  3. arXiv:2405.04327  [pdf, other

    cs.CV

    Audio-Visual Speech Representation Expert for Enhanced Talking Face Video Generation and Evaluation

    Authors: Dogucan Yaman, Fevziye Irem Eyiokur, Leonard Bärmann, Seymanur Aktı, Hazım Kemal Ekenel, Alexander Waibel

    Abstract: In the task of talking face generation, the objective is to generate a face video with lips synchronized to the corresponding audio while preserving visual details and identity information. Current methods face the challenge of learning accurate lip synchronization while avoiding detrimental effects on visual quality, as well as robustly evaluating such synchronization. To tackle these problems, w… ▽ More

    Submitted 7 May, 2024; originally announced May 2024.

    Comments: CVPR2024 NTIRE Workshop

  4. arXiv:2402.17633  [pdf, other

    cs.CL

    From Text Segmentation to Smart Chaptering: A Novel Benchmark for Structuring Video Transcriptions

    Authors: Fabian Retkowski, Alexander Waibel

    Abstract: Text segmentation is a fundamental task in natural language processing, where documents are split into contiguous sections. However, prior research in this area has been constrained by limited datasets, which are either small in scale, synthesized, or only contain well-structured documents. In this paper, we address these limitations by introducing a novel benchmark YTSeg focusing on spoken conten… ▽ More

    Submitted 27 February, 2024; originally announced February 2024.

    Comments: Accepted to EACL 2024

  5. arXiv:2401.04482  [pdf, other

    cs.CL cs.LG

    Continuously Learning New Words in Automatic Speech Recognition

    Authors: Christian Huber, Alexander Waibel

    Abstract: Despite recent advances, Automatic Speech Recognition (ASR) systems are still far from perfect. Typical errors include acronyms, named entities and domain-specific special words for which little or no data is available. To address the problem of recognizing these words, we propose an self-supervised continual learning approach. Given the audio of a lecture talk with corresponding slides, we bias t… ▽ More

    Submitted 9 January, 2024; originally announced January 2024.

  6. arXiv:2309.11379  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff

    Authors: Peter Polák, Brian Yan, Shinji Watanabe, Alex Waibel, Ondřej Bojar

    Abstract: Blockwise self-attentional encoder models have recently emerged as one promising end-to-end approach to simultaneous speech translation. These models employ a blockwise beam search with hypothesis reliability scoring to determine when to wait for more input speech before translating further. However, this method maintains multiple hypotheses until the entire speech input is consumed -- this scheme… ▽ More

    Submitted 20 September, 2023; originally announced September 2023.

    Comments: Accepted at INTERSPEECH 2023

    Journal ref: Polák, P., Yan, B., Watanabe, S., Waibel, A., Bojar, O. (2023) Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff. Proc. INTERSPEECH 2023, 3979-3983

  7. arXiv:2309.04316  [pdf, other

    cs.RO cs.AI

    Incremental Learning of Humanoid Robot Behavior from Natural Interaction and Large Language Models

    Authors: Leonard Bärmann, Rainer Kartmann, Fabian Peller-Konrad, Jan Niehues, Alex Waibel, Tamim Asfour

    Abstract: Natural-language dialog is key for intuitive human-robot interaction. It can be used not only to express humans' intents, but also to communicate instructions for improvement if a robot does not understand a command correctly. Of great importance is to endow robots with the ability to learn from such interaction experience in an incremental way to allow them to improve their behaviors or avoid mis… ▽ More

    Submitted 16 May, 2024; v1 submitted 8 September, 2023; originally announced September 2023.

    Comments: This version (v3) adds further quantitative evaluation and many improvements. v2 was presented at the Workshop on Language and Robot Learning (LangRob) at the Conference on Robot Learning (CoRL) 2023. Supplementary video available at https://youtu.be/y5O2mRGtsLM

  8. arXiv:2308.11380  [pdf, other

    cs.SD cs.CL eess.AS

    Convoifilter: A case study of doing cocktail party speech recognition

    Authors: Thai-Binh Nguyen, Alexander Waibel

    Abstract: This paper presents an end-to-end model designed to improve automatic speech recognition (ASR) for a particular speaker in a crowded, noisy environment. The model utilizes a single-channel speech enhancement module that isolates the speaker's voice from background noise (ConVoiFilter) and an ASR module. The model can decrease ASR's word error rate (WER) from 80% to 26.4% through this approach. Typ… ▽ More

    Submitted 7 April, 2024; v1 submitted 22 August, 2023; originally announced August 2023.

    Comments: Accepted at HSCMA 2024

  9. arXiv:2308.03415  [pdf, other

    cs.CL cs.AI

    End-to-End Evaluation for Low-Latency Simultaneous Speech Translation

    Authors: Christian Huber, Tu Anh Dinh, Carlos Mullov, Ngoc Quan Pham, Thai Binh Nguyen, Fabian Retkowski, Stefan Constantin, Enes Yavuz Ugan, Danni Liu, Zhaolin Li, Sai Koneru, Jan Niehues, Alexander Waibel

    Abstract: The challenge of low-latency speech translation has recently draw significant interest in the research community as shown by several publications and shared tasks. Therefore, it is essential to evaluate these different approaches in realistic scenarios. However, currently only specific aspects of the systems are evaluated and often it is not possible to compare different approaches. In this work… ▽ More

    Submitted 23 October, 2023; v1 submitted 7 August, 2023; originally announced August 2023.

  10. arXiv:2307.09368  [pdf, other

    cs.CV

    Audio-driven Talking Face Generation by Overcoming Unintended Information Flow

    Authors: Dogucan Yaman, Fevziye Irem Eyiokur, Leonard Bärmann, Hazim Kemal Ekenel, Alexander Waibel

    Abstract: Audio-driven talking face generation is the task of creating a lip-synchronized, realistic face video from given audio and reference frames. This involves two major challenges: overall visual quality of generated images on the one hand, and audio-visual synchronization of the mouth part on the other hand. In this paper, we start by identifying several problematic aspects of synchronization methods… ▽ More

    Submitted 11 December, 2023; v1 submitted 18 July, 2023; originally announced July 2023.

  11. arXiv:2306.05320  [pdf, other

    cs.CL cs.SD

    KIT's Multilingual Speech Translation System for IWSLT 2023

    Authors: Danni Liu, Thai Binh Nguyen, Sai Koneru, Enes Yavuz Ugan, Ngoc-Quan Pham, Tuan-Nam Nguyen, Tu Anh Dinh, Carlos Mullov, Alexander Waibel, Jan Niehues

    Abstract: Many existing speech translation benchmarks focus on native-English speech in high-quality recording conditions, which often do not match the conditions in real-life use-cases. In this paper, we describe our speech translation system for the multilingual track of IWSLT 2023, which evaluates translation quality on scientific conference talks. The test condition features accented input speech and te… ▽ More

    Submitted 12 July, 2023; v1 submitted 8 June, 2023; originally announced June 2023.

    Comments: IWSLT 2023

  12. arXiv:2305.03873  [pdf, other

    cs.CL

    Train Global, Tailor Local: Minimalist Multilingual Translation into Endangered Languages

    Authors: Zhong Zhou, Jan Niehues, Alex Waibel

    Abstract: In many humanitarian scenarios, translation into severely low resource languages often does not require a universal translation engine, but a dedicated text-specific translation engine. For example, healthcare records, hygienic procedures, government communication, emergency procedures and religious texts are all limited texts. While generic translation engines for all languages do not exist, tran… ▽ More

    Submitted 5 May, 2023; originally announced May 2023.

    Comments: In Proceedings of the 6th Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT) of the 17th Conference of the European Chapter of the Association for Computational Linguistic in 2023

  13. arXiv:2211.11703   

    cs.CL cs.SD eess.AS

    Towards continually learning new languages

    Authors: Ngoc-Quan Pham, Jan Niehues, Alexander Waibel

    Abstract: Multilingual speech recognition with neural networks is often implemented with batch-learning, when all of the languages are available before training. An ability to add new languages after the prior training sessions can be economically beneficial, but the main challenge is catastrophic forgetting. In this work, we combine the qualities of weight factorization and elastic weight consolidation in… ▽ More

    Submitted 1 March, 2023; v1 submitted 21 November, 2022; originally announced November 2022.

    Comments: Work in progress

  14. arXiv:2211.03705  [pdf, other

    cs.CV eess.IV

    A Survey on Computer Vision based Human Analysis in the COVID-19 Era

    Authors: Fevziye Irem Eyiokur, Alperen Kantarcı, Mustafa Ekrem Erakın, Naser Damer, Ferda Ofli, Muhammad Imran, Janez Križaj, Albert Ali Salah, Alexander Waibel, Vitomir Štruc, Hazım Kemal Ekenel

    Abstract: The emergence of COVID-19 has had a global and profound impact, not only on society as a whole, but also on the lives of individuals. Various prevention measures were introduced around the world to limit the transmission of the disease, including face masks, mandates for social distancing and regular disinfection in public spaces, and the use of screening applications. These developments also trig… ▽ More

    Submitted 7 November, 2022; originally announced November 2022.

    Comments: Submitted to Image and Vision Computing, 44 pages, 7 figures

  15. arXiv:2210.08992  [pdf, other

    cs.CL cs.SD eess.AS

    Language-agnostic Code-Switching in Sequence-To-Sequence Speech Recognition

    Authors: Enes Yavuz Ugan, Christian Huber, Juan Hussain, Alexander Waibel

    Abstract: Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages. While today's neural end-to-end (E2E) models deliver state-of-the-art performances on the task of automatic speech recognition (ASR) it is commonly known that these systems are very data-intensive. However, there is only a few transcribed and aligned CS speech available. To overcome t… ▽ More

    Submitted 3 July, 2023; v1 submitted 17 October, 2022; originally announced October 2022.

    Comments: 18 pages

  16. arXiv:2210.01512  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Code-Switching without Switching: Language Agnostic End-to-End Speech Translation

    Authors: Christian Huber, Enes Yavuz Ugan, Alexander Waibel

    Abstract: We propose a) a Language Agnostic end-to-end Speech Translation model (LAST), and b) a data augmentation strategy to increase code-switching (CS) performance. With increasing globalization, multiple languages are increasingly used interchangeably during fluent speech. Such CS complicates traditional speech recognition and translation, as we must recognize which language was spoken first and then a… ▽ More

    Submitted 9 November, 2022; v1 submitted 4 October, 2022; originally announced October 2022.

  17. arXiv:2206.04523  [pdf, other

    cs.CL cs.CV cs.SD eess.AS eess.IV

    Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos

    Authors: Alexander Waibel, Moritz Behr, Fevziye Irem Eyiokur, Dogucan Yaman, Tuan-Nam Nguyen, Carlos Mullov, Mehmet Arif Demirtas, Alperen Kantarcı, Stefan Constantin, Hazım Kemal Ekenel

    Abstract: In this paper, we propose a neural end-to-end system for voice preserving, lip-synchronous translation of videos. The system is designed to combine multiple component models and produces a video of the original speaker speaking in the target language that is lip-synchronous with the target speech, yet maintains emphases in speech, voice characteristics, face video of the original speaker. The pipe… ▽ More

    Submitted 9 June, 2022; originally announced June 2022.

  18. arXiv:2205.12304  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Adaptive multilingual speech recognition with pretrained models

    Authors: Ngoc-Quan Pham, Alex Waibel, Jan Niehues

    Abstract: Multilingual speech recognition with supervised learning has achieved great results as reflected in recent research. With the development of pretraining methods on audio and text data, it is imperative to transfer the knowledge from unsupervised multilingual models to facilitate recognition, especially in many languages with limited data. Our work investigated the effectiveness of using two pretra… ▽ More

    Submitted 24 May, 2022; originally announced May 2022.

    Comments: Submitted to INTERSPEECH 2022

  19. arXiv:2204.10648  [pdf, other

    cs.CV

    Exposure Correction Model to Enhance Image Quality

    Authors: Fevziye Irem Eyiokur, Dogucan Yaman, Hazım Kemal Ekenel, Alexander Waibel

    Abstract: Exposure errors in an image cause a degradation in the contrast and low visibility in the content. In this paper, we address this problem and propose an end-to-end exposure correction model in order to handle both under- and overexposure errors with a single model. Our model contains an image encoder, consecutive residual blocks, and image decoder to synthesize the corrected image. We utilize perc… ▽ More

    Submitted 22 April, 2022; originally announced April 2022.

    Comments: Accepted for CVPR2022 NTIRE Workshop

  20. arXiv:2204.06028  [pdf, other

    cs.CL

    CUNI-KIT System for Simultaneous Speech Translation Task at IWSLT 2022

    Authors: Peter Polák, Ngoc-Quan Ngoc, Tuan-Nam Nguyen, Danni Liu, Carlos Mullov, Jan Niehues, Ondřej Bojar, Alexander Waibel

    Abstract: In this paper, we describe our submission to the Simultaneous Speech Translation at IWSLT 2022. We explore strategies to utilize an offline model in a simultaneous setting without the need to modify the original model. In our experiments, we show that our onlinization algorithm is almost on par with the offline setting while being $3\times$ faster than offline in terms of latency on the test set.… ▽ More

    Submitted 11 May, 2022; v1 submitted 12 April, 2022; originally announced April 2022.

    Comments: Accepted to IWSLT22

  21. arXiv:2203.15404  [pdf, other

    cs.CL

    Short-Term Word-Learning in a Dynamically Changing Environment

    Authors: Christian Huber, Rishu Kumar, Ondřej Bojar, Alexander Waibel

    Abstract: Neural sequence-to-sequence automatic speech recognition (ASR) systems are in principle open vocabulary systems, when using appropriate modeling units. In practice, however, they often fail to recognize words not seen during training, e.g., named entities, numbers or technical terms. To alleviate this problem, Huber et al. proposed to supplement an end-to-end ASR system with a word/phrase memory a… ▽ More

    Submitted 29 March, 2022; originally announced March 2022.

    Comments: This paper was submitted to Interspeech 2022

  22. arXiv:2108.07127  [pdf, other

    cs.CL

    Active Learning for Massively Parallel Translation of Constrained Text into Low Resource Languages

    Authors: Zhong Zhou, Alex Waibel

    Abstract: We translate a closed text that is known in advance and available in many languages into a new and severely low resource language. Most human translation efforts adopt a portion-based approach to translate consecutive pages/chapters in order, which may not suit machine translation. We compare the portion-based approach that optimizes coherence of the text locally with the random sampling approach… ▽ More

    Submitted 26 October, 2021; v1 submitted 16 August, 2021; originally announced August 2021.

    Journal ref: In Proceedings of the LoResMT Workshop of the 18th Biennial Machine Translation Summit in 2021

  23. arXiv:2107.02268  [pdf, other

    cs.CL cs.LG

    Instant One-Shot Word-Learning for Context-Specific Neural Sequence-to-Sequence Speech Recognition

    Authors: Christian Huber, Juan Hussain, Sebastian Stüker, Alexander Waibel

    Abstract: Neural sequence-to-sequence systems deliver state-of-the-art performance for automatic speech recognition (ASR). When using appropriate modeling units, e.g., byte-pair encoded characters, these systems are in principal open vocabulary systems. In practice, however, they often fail to recognize words not seen during training, e.g., named entities, numbers or technical terms. To alleviate this probl… ▽ More

    Submitted 5 July, 2021; originally announced July 2021.

    Comments: 7 pages, 1 figure, 4 tables

  24. arXiv:2106.03210  [pdf, other

    cs.CV

    Alpha Matte Generation from Single Input for Portrait Matting

    Authors: Dogucan Yaman, Hazım Kemal Ekenel, Alexander Waibel

    Abstract: In the portrait matting, the goal is to predict an alpha matte that identifies the effect of each pixel on the foreground subject. Traditional approaches and most of the existing works utilized an additional input, e.g., trimap, background image, to predict alpha matte. However, (1) providing additional input is not always practical, and (2) models are too sensitive to these additional inputs. To… ▽ More

    Submitted 25 April, 2022; v1 submitted 6 June, 2021; originally announced June 2021.

    Comments: Accepted for CVPR 2022 NTIRE Workshop

  25. arXiv:2105.03010  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Efficient Weight factorization for Multilingual Speech Recognition

    Authors: Ngoc-Quan Pham, Tuan-Nam Nguyen, Sebastian Stueker, Alexander Waibel

    Abstract: End-to-end multilingual speech recognition involves using a single model training on a compositional speech corpus including many languages, resulting in a single neural network to handle transcribing different languages. Due to the fact that each language in the training data has different characteristics, the shared network may struggle to optimize for all various languages simultaneously. In th… ▽ More

    Submitted 6 May, 2021; originally announced May 2021.

    Comments: Submitted to Interspeech 2021

  26. CAGAN: Text-To-Image Generation with Combined Attention GANs

    Authors: Henning Schulze, Dogucan Yaman, Alexander Waibel

    Abstract: Generating images according to natural language descriptions is a challenging task. Prior research has mainly focused to enhance the quality of generation by investigating the use of spatial attention and/or textual attention thereby neglecting the relationship between channels. In this work, we propose the Combined Attention Generative Adversarial Network (CAGAN) to generate photo-realistic image… ▽ More

    Submitted 14 January, 2022; v1 submitted 26 April, 2021; originally announced April 2021.

    Journal ref: LNCS 13024 (2021) 392-404

  27. arXiv:2104.05848  [pdf, other

    cs.CL cs.AI

    Family of Origin and Family of Choice: Massively Parallel Lexiconized Iterative Pretraining for Severely Low Resource Machine Translation

    Authors: Zhong Zhou, Alex Waibel

    Abstract: We translate a closed text that is known in advance into a severely low resource language by leveraging massive source parallelism. In other words, given a text in 124 source languages, we translate it into a severely low resource language using only ~1,000 lines of low resource data without any external help. Firstly, we propose a systematic method to rank and choose source languages that are clo… ▽ More

    Submitted 15 October, 2021; v1 submitted 12 April, 2021; originally announced April 2021.

    Journal ref: In Proceedings of the 3rd Workshop on Research in Computational Typology and Multilingual NLP of the 20th Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technologies in 2021

  28. Unconstrained Face-Mask & Face-Hand Datasets: Building a Computer Vision System to Help Prevent the Transmission of COVID-19

    Authors: Fevziye Irem Eyiokur, Hazım Kemal Ekenel, Alexander Waibel

    Abstract: Health organizations advise social distancing, wearing face mask, and avoiding touching face to prevent the spread of coronavirus. Based on these protective measures, we developed a computer vision system to help prevent the transmission of COVID-19. Specifically, the developed system performs face mask detection, face-hand interaction detection, and measures social distance. To train and evaluate… ▽ More

    Submitted 8 December, 2021; v1 submitted 15 March, 2021; originally announced March 2021.

    Comments: 9 pages, 4 figures

    Journal ref: SIViP (2022)

  29. arXiv:2103.06689  [pdf, other

    cs.CL

    Unsupervised Transfer Learning in Multilingual Neural Machine Translation with Cross-Lingual Word Embeddings

    Authors: Carlos Mullov, Ngoc-Quan Pham, Alexander Waibel

    Abstract: In this work we look into adding a new language to a multilingual NMT system in an unsupervised fashion. Under the utilization of pre-trained cross-lingual word embeddings we seek to exploit a language independent multilingual sentence representation to easily generalize to a new language. While using cross-lingual embeddings for word lookup we decode from a yet entirely unseen source language in… ▽ More

    Submitted 11 March, 2021; originally announced March 2021.

  30. arXiv:2010.03449  [pdf, other

    cs.CV

    Super-Human Performance in Online Low-latency Recognition of Conversational Speech

    Authors: Thai-Son Nguyen, Sebastian Stueker, Alex Waibel

    Abstract: Achieving super-human performance in recognizing human speech has been a goal for several decades, as researchers have worked on increasingly challenging tasks. In the 1990's it was discovered, that conversational speech between two humans turns out to be considerably more difficult than read speech as hesitations, disfluencies, false starts and sloppy articulation complicate acoustic processing a… ▽ More

    Submitted 26 July, 2021; v1 submitted 7 October, 2020; originally announced October 2020.

    Comments: To appear in Interspeech 2021

  31. arXiv:2005.09940  [pdf, other

    eess.AS cs.CL cs.SD

    Relative Positional Encoding for Speech Recognition and Direct Translation

    Authors: Ngoc-Quan Pham, Thanh-Le Ha, Tuan-Nam Nguyen, Thai-Son Nguyen, Elizabeth Salesky, Sebastian Stueker, Jan Niehues, Alexander Waibel

    Abstract: Transformer models are powerful sequence-to-sequence architectures that are capable of directly mapping speech inputs to transcriptions or translations. However, the mechanism for modeling positions in this model was tailored for text modeling, and thus is less ideal for acoustic inputs. In this work, we adapt the relative position encoding scheme to the Speech Transformer, where the key addition… ▽ More

    Submitted 20 May, 2020; originally announced May 2020.

    Comments: Submitted to Interspeech 2020

  32. arXiv:2004.04243  [pdf, other

    cs.CL

    Error correction and extraction in request dialogs

    Authors: Stefan Constantin, Alex Waibel

    Abstract: We propose a dialog system utility component that gets the last two utterances of a user and can detect whether the last utterance is an error correction of the second last utterance. If yes, it corrects the second last utterance according to the error correction in the last utterance and outputs the extracted pairs of reparandum and repair entity. This component offers two advantages, learning th… ▽ More

    Submitted 20 June, 2023; v1 submitted 8 April, 2020; originally announced April 2020.

    Comments: 10 pages, 8 figures, 3 tables, presented at ICNLSP 2022

  33. arXiv:2003.10022  [pdf, other

    eess.AS cs.CL cs.SD

    High Performance Sequence-to-Sequence Model for Streaming Speech Recognition

    Authors: Thai-Son Nguyen, Ngoc-Quan Pham, Sebastian Stueker, Alex Waibel

    Abstract: Recently sequence-to-sequence models have started to achieve state-of-the-art performance on standard speech recognition tasks when processing audio data in batch mode, i.e., the complete audio data is available when starting processing. However, when it comes to performing run-on recognition on an input stream of audio data while producing recognition results in real-time and with low word-based… ▽ More

    Submitted 26 July, 2020; v1 submitted 22 March, 2020; originally announced March 2020.

    Comments: To appear in Interspeech 2020

  34. arXiv:2003.09891  [pdf, other

    eess.AS cs.CL cs.SD

    Low Latency ASR for Simultaneous Speech Translation

    Authors: Thai Son Nguyen, Jan Niehues, Eunah Cho, Thanh-Le Ha, Kevin Kilgour, Markus Muller, Matthias Sperber, Sebastian Stueker, Alex Waibel

    Abstract: User studies have shown that reducing the latency of our simultaneous lecture translation system should be the most important goal. We therefore have worked on several techniques for reducing the latency for both components, the automatic speech recognition and the speech translation module. Since the commonly used commitment latency is not appropriate in our case of continuous stream decoding, we… ▽ More

    Submitted 22 March, 2020; originally announced March 2020.

  35. arXiv:2003.04194  [pdf, ps, other

    eess.AS cs.CV cs.LG cs.SD

    Toward Cross-Domain Speech Recognition with End-to-End Models

    Authors: Thai-Son Nguyen, Sebastian Stüker, Alex Waibel

    Abstract: In the area of multi-domain speech recognition, research in the past focused on hybrid acoustic models to build cross-domain and domain-invariant speech recognition systems. In this paper, we empirically examine the difference in behavior between hybrid acoustic models and neural end-to-end systems when mixing acoustic training data from several domains. For these experiments we composed a multi-d… ▽ More

    Submitted 9 March, 2020; originally announced March 2020.

    Comments: Presented in Life-Long Learning for Spoken Language Systems Workshop - ASRU 2019

  36. arXiv:2001.11120  [pdf, other

    cs.CV

    Gun Source and Muzzle Head Detection

    Authors: Zhong Zhou, Isak Czeresnia Etinger, Florian Metze, Alexander Hauptmann, Alexander Waibel

    Abstract: There is a surging need across the world for protection against gun violence. There are three main areas that we have identified as challenging in research that tries to curb gun violence: temporal location of gunshots, gun type prediction and gun source (shooter) detection. Our task is gun source detection and muzzle head detection, where the muzzle head is the round opening of the firing end of… ▽ More

    Submitted 29 January, 2020; originally announced January 2020.

    Comments: EI 2020

    Journal ref: Electronic Imaging 2020.8 (2020): 187-1

  37. arXiv:1912.04235  [pdf, other

    cs.RO cs.HC

    An Interactive Indoor Drone Assistant

    Authors: Tino Fuhrman, David Schneider, Felix Altenberg, Tung Nguyen, Simon Blasen, Stefan Constantin, Alex Waibel

    Abstract: With the rapid advance of sophisticated control algorithms, the capabilities of drones to stabilise, fly and manoeuvre autonomously have dramatically improved, enabling us to pay greater attention to entire missions and the interaction of a drone with humans and with its environment during the course of such a mission. In this paper, we present an indoor office drone assistant that is tasked to ru… ▽ More

    Submitted 9 December, 2019; originally announced December 2019.

    Comments: Presented at IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2019

  38. arXiv:1912.02610  [pdf, other

    eess.AS cs.CL cs.LG cs.SD stat.ML

    Bimodal Speech Emotion Recognition Using Pre-Trained Language Models

    Authors: Verena Heusser, Niklas Freymuth, Stefan Constantin, Alex Waibel

    Abstract: Speech emotion recognition is a challenging task and an important step towards more natural human-machine interaction. We show that pre-trained language models can be fine-tuned for text emotion recognition, achieving an accuracy of 69.5% on Task 4A of SemEval 2017, improving upon the previous state of the art by over 3% absolute. We combine these language models with speech emotion recognition, a… ▽ More

    Submitted 29 November, 2019; originally announced December 2019.

    Comments: Life-Long Learning for Spoken Language Systems ASRU 2019

    ACM Class: I.2.7

  39. arXiv:1911.02709  [pdf, other

    cs.CL

    Using Interlinear Glosses as Pivot in Low-Resource Multilingual Machine Translation

    Authors: Zhong Zhou, Lori Levin, David R. Mortensen, Alex Waibel

    Abstract: We demonstrate a new approach to Neural Machine Translation (NMT) for low-resource languages using a ubiquitous linguistic resource, Interlinear Glossed Text (IGT). IGT represents a non-English sentence as a sequence of English lemmas and morpheme labels. As such, it can serve as a pivot or interlingua for NMT. Our contribution is four-fold. Firstly, we pool IGT for 1,497 languages in ODIN (54,545… ▽ More

    Submitted 3 March, 2020; v1 submitted 6 November, 2019; originally announced November 2019.

  40. arXiv:1910.13296  [pdf, other

    eess.AS cs.CV cs.LG cs.SD

    Improving sequence-to-sequence speech recognition training with on-the-fly data augmentation

    Authors: Thai-Son Nguyen, Sebastian Stueker, Jan Niehues, Alex Waibel

    Abstract: Sequence-to-Sequence (S2S) models recently started to show state-of-the-art performance for automatic speech recognition (ASR). With these large and deep models overfitting remains the largest problem, outweighing performance improvements that can be obtained from better architectures. One solution to the overfitting problem is increasing the amount of available training data and the variety exhib… ▽ More

    Submitted 3 February, 2020; v1 submitted 29 October, 2019; originally announced October 2019.

    Comments: To appear in ICASSP 2020

  41. arXiv:1909.13790  [pdf, other

    cs.CL cs.SD eess.AS

    Incremental processing of noisy user utterances in the spoken language understanding task

    Authors: Stefan Constantin, Jan Niehues, Alex Waibel

    Abstract: The state-of-the-art neural network architectures make it possible to create spoken language understanding systems with high quality and fast processing time. One major challenge for real-world applications is the high latency of these systems caused by triggered actions with high executions times. If an action can be separated into subactions, the reaction time of the systems can be improved thro… ▽ More

    Submitted 30 September, 2019; originally announced September 2019.

    Comments: 10 pages, 3 figures, 7 tables, forthcoming in W-NUT 2019

  42. arXiv:1906.08584  [pdf, other

    cs.CL

    Improving Zero-shot Translation with Language-Independent Constraints

    Authors: Ngoc-Quan Pham, Jan Niehues, Thanh-Le Ha, Alex Waibel

    Abstract: An important concern in training multilingual neural machine translation (NMT) is to translate between language pairs unseen during training, i.e zero-shot translation. Improving this ability kills two birds with one stone by providing an alternative to pivot translation which also allows us to better understand how the model captures information between languages. In this work, we carried out a… ▽ More

    Submitted 20 June, 2019; originally announced June 2019.

    Comments: 10 pages version accepted in WMT 2019

  43. arXiv:1906.01617  [pdf, other

    cs.CL

    Self-Attentional Models for Lattice Inputs

    Authors: Matthias Sperber, Graham Neubig, Ngoc-Quan Pham, Alex Waibel

    Abstract: Lattices are an efficient and effective method to encode ambiguity of upstream systems in natural language processing tasks, for example to compactly capture multiple speech recognition hypotheses, or to represent multiple linguistic analyses. Previous work has extended recurrent neural networks to model lattice inputs and achieved improvements in various tasks, but these models suffer from very s… ▽ More

    Submitted 4 June, 2019; originally announced June 2019.

    Comments: ACL 2019

  44. arXiv:1906.00556  [pdf, other

    cs.CL

    Fluent Translations from Disfluent Speech in End-to-End Speech Translation

    Authors: Elizabeth Salesky, Matthias Sperber, Alex Waibel

    Abstract: Spoken language translation applications for speech suffer due to conversational speech phenomena, particularly the presence of disfluencies. With the rise of end-to-end speech translation models, processing steps such as disfluency removal that were previously an intermediate step between speech recognition and machine translation need to be incorporated into model architectures. We use a sequenc… ▽ More

    Submitted 2 June, 2019; originally announced June 2019.

    Comments: Accepted at NAACL 2019

  45. arXiv:1904.13377  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Very Deep Self-Attention Networks for End-to-End Speech Recognition

    Authors: Ngoc-Quan Pham, Thai-Son Nguyen, Jan Niehues, Markus Müller, Sebastian Stüker, Alexander Waibel

    Abstract: Recently, end-to-end sequence-to-sequence models for speech recognition have gained significant interest in the research community. While previous architecture choices revolve around time-delay neural networks (TDNN) and long short-term memory (LSTM) recurrent neural networks, we propose to use self-attention via the Transformer architecture as an alternative. Our analysis shows that deep Transfor… ▽ More

    Submitted 3 May, 2019; v1 submitted 30 April, 2019; originally announced April 2019.

    Comments: Submitted to INTERSPEECH 2019

  46. arXiv:1904.07209  [pdf, other

    cs.CL

    Attention-Passing Models for Robust and Data-Efficient End-to-End Speech Translation

    Authors: Matthias Sperber, Graham Neubig, Jan Niehues, Alex Waibel

    Abstract: Speech translation has traditionally been approached through cascaded models consisting of a speech recognizer trained on a corpus of transcribed speech, and a machine translation system trained on parallel texts. Several recent works have shown the feasibility of collapsing the cascade into a single, direct model that can be trained in an end-to-end fashion on a corpus of translated speech. Howev… ▽ More

    Submitted 15 April, 2019; originally announced April 2019.

    Comments: Authors' final version, accepted at TACL 2019

  47. arXiv:1904.02147  [pdf, other

    eess.AS cs.LG cs.SD

    Learning Shared Encoding Representation for End-to-End Speech Recognition Models

    Authors: Thai-Son Nguyen, Sebastian Stueker, Alex Waibel

    Abstract: In this work, we learn a shared encoding representation for a multi-task neural network model optimized with connectionist temporal classification (CTC) and conventional framewise cross-entropy training criteria. Our experiments show that the multi-task training not only tackles the complexity of optimizing CTC models such as acoustic-to-word but also results in significant improvement compared to… ▽ More

    Submitted 31 March, 2019; originally announced April 2019.

    Comments: arXiv admin note: substantial text overlap with arXiv:1902.01951

  48. arXiv:1902.01951   

    eess.AS cs.CL cs.LG cs.SD

    Using multi-task learning to improve the performance of acoustic-to-word and conventional hybrid models

    Authors: Thai-Son Nguyen, Sebastian Stueker, Alex Waibel

    Abstract: Acoustic-to-word (A2W) models that allow direct mapping from acoustic signals to word sequences are an appealing approach to end-to-end automatic speech recognition due to their simplicity. However, prior works have shown that modelling A2W typically encounters issues of data sparsity that prevent training such a model directly. So far, pre-training initialization is the only approach proposed to… ▽ More

    Submitted 15 May, 2019; v1 submitted 2 February, 2019; originally announced February 2019.

    Comments: submitted newer work which includes this paper results

  49. arXiv:1812.06876  [pdf, other

    cs.CL

    Multi-task learning to improve natural language understanding

    Authors: Stefan Constantin, Jan Niehues, Alex Waibel

    Abstract: Recently advancements in sequence-to-sequence neural network architectures have led to an improved natural language understanding. When building a neural network-based Natural Language Understanding component, one main challenge is to collect enough training data. The generation of a synthetic dataset is an inexpensive and quick way to collect data. Since this data often has less variety than real… ▽ More

    Submitted 15 February, 2019; v1 submitted 17 December, 2018; originally announced December 2018.

    Comments: 11 pages, 4 figures, 2 tables, forthcoming in IWSDS 2019

  50. arXiv:1811.03189  [pdf, other

    cs.CL

    Towards Fluent Translations from Disfluent Speech

    Authors: Elizabeth Salesky, Susanne Burger, Jan Niehues, Alex Waibel

    Abstract: When translating from speech, special consideration for conversational speech phenomena such as disfluencies is necessary. Most machine translation training data consists of well-formed written texts, causing issues when translating spontaneous speech. Previous work has introduced an intermediate step between speech recognition (ASR) and machine translation (MT) to remove disfluencies, making the… ▽ More

    Submitted 7 November, 2018; originally announced November 2018.

    Comments: To appear at SLT 2018