Skip to main content

Showing 1–50 of 78 results for author: Bojar, O

  1. arXiv:2406.03881  [pdf, other

    cs.CL

    Evaluating the IWSLT2023 Speech Translation Tasks: Human Annotations, Automatic Metrics, and Segmentation

    Authors: Matthias Sperber, Ondřej Bojar, Barry Haddow, Dávid Javorský, Xutai Ma, Matteo Negri, Jan Niehues, Peter Polák, Elizabeth Salesky, Katsuhito Sudoh, Marco Turchi

    Abstract: Human evaluation is a critical component in machine translation system development and has received much attention in text translation research. However, little prior work exists on the topic of human evaluation for speech translation, which adds additional challenges such as noisy data and segmentation mismatches. We take first steps to fill this gap by conducting a comprehensive human evaluation… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: LREC-COLING2024 publication (with corrections for Table 3)

    Journal ref: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

  2. arXiv:2404.13855  [pdf, other

    cs.CL

    Understanding the role of FFNs in driving multilingual behaviour in LLMs

    Authors: Sunit Bhattacharya, Ondřej Bojar

    Abstract: Multilingualism in Large Language Models (LLMs) is an yet under-explored area. In this paper, we conduct an in-depth analysis of the multilingual capabilities of a family of a Large Language Model, examining its architecture, activation patterns, and processing mechanisms across languages. We introduce novel metrics to probe the model's multilingual behaviour at different layers and shed light on… ▽ More

    Submitted 21 April, 2024; originally announced April 2024.

    Comments: 10 pages

  3. arXiv:2404.00798  [pdf, other

    cs.LG

    On Difficulties of Attention Factorization through Shared Memory

    Authors: Uladzislau Yorsh, Martin Holeňa, Ondřej Bojar, David Herel

    Abstract: Transformers have revolutionized deep learning in numerous fields, including natural language processing, computer vision, and audio processing. Their strength lies in their attention mechanism, which allows for the discovering of complex input relationships. However, this mechanism's quadratic time and memory complexity pose challenges for larger inputs. Researchers are now investigating models l… ▽ More

    Submitted 31 March, 2024; originally announced April 2024.

    Comments: 2 pages of main content, 8 pages in total, published as a Tiny Paper at ICLR 2024

  4. arXiv:2401.01283  [pdf, other

    cs.CL

    Quality and Quantity of Machine Translation References for Automatic Metrics

    Authors: Vilém Zouhar, Ondřej Bojar

    Abstract: Automatic machine translation metrics typically rely on human translations to determine the quality of system translations. Common wisdom in the field dictates that the human references should be of very high quality. However, there are no cost-benefit analyses that could be used to guide practitioners who plan to collect references for machine translation evaluation. We find that higher-quality r… ▽ More

    Submitted 10 April, 2024; v1 submitted 2 January, 2024; originally announced January 2024.

  5. arXiv:2311.16787  [pdf, other

    cs.CL

    Evaluating Optimal Reference Translations

    Authors: Vilém Zouhar, Věra Kloudová, Martin Popel, Ondřej Bojar

    Abstract: The overall translation quality reached by current machine translation (MT) systems for high-resourced language pairs is remarkably good. Standard methods of evaluation are not suitable nor intended to uncover the many translation errors and quality deficiencies that still persist. Furthermore, the quality of standard reference translations is commonly questioned and comparable quality levels have… ▽ More

    Submitted 8 March, 2024; v1 submitted 28 November, 2023; originally announced November 2023.

    Comments: To appear in Natural Language Engineering 2024

  6. arXiv:2310.15552  [pdf, other

    cs.CL

    Unveiling Multilinguality in Transformer Models: Exploring Language Specificity in Feed-Forward Networks

    Authors: Sunit Bhattacharya, Ondrej Bojar

    Abstract: Recent research suggests that the feed-forward module within Transformers can be viewed as a collection of key-value memories, where the keys learn to capture specific patterns from the input based on the training examples. The values then combine the output from the 'memories' of the keys to generate predictions about the next token. This leads to an incremental process of prediction that gradual… ▽ More

    Submitted 24 October, 2023; originally announced October 2023.

  7. arXiv:2310.14262  [pdf, other

    cs.CL

    Boosting Unsupervised Machine Translation with Pseudo-Parallel Data

    Authors: Ivana Kvapilíková, Ondřej Bojar

    Abstract: Even with the latest developments in deep learning and large-scale language modeling, the task of machine translation (MT) of low-resource languages remains a challenge. Neural MT systems can be trained in an unsupervised way without any translation resources but the quality lags behind, especially in truly low-resource conditions. We propose a training strategy that relies on pseudo-parallel sent… ▽ More

    Submitted 22 October, 2023; originally announced October 2023.

    Comments: MT Summit 2023

    Journal ref: Ivana Kvapilíková, Ondřej Bojar (2023): Boosting Unsupervised Machine Translation with Pseudo-Parallel Data. In: Proceedings of Machine Translation Summit XIX vol. 1: Research Track, pp. 135-147, AAMT, Kyoto, Japan

  8. arXiv:2309.11384  [pdf, ps, other

    cs.CL cs.AI cs.SD eess.AS

    Long-Form End-to-End Speech Translation via Latent Alignment Segmentation

    Authors: Peter Polák, Ondřej Bojar

    Abstract: Current simultaneous speech translation models can process audio only up to a few seconds long. Contemporary datasets provide an oracle segmentation into sentences based on human-annotated transcripts and translations. However, the segmentation into sentences is not available in the real world. Current speech segmentation approaches either offer poor segmentation quality or have to trade latency f… ▽ More

    Submitted 20 September, 2023; originally announced September 2023.

    Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  9. arXiv:2309.11379  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff

    Authors: Peter Polák, Brian Yan, Shinji Watanabe, Alex Waibel, Ondřej Bojar

    Abstract: Blockwise self-attentional encoder models have recently emerged as one promising end-to-end approach to simultaneous speech translation. These models employ a blockwise beam search with hypothesis reliability scoring to determine when to wait for more input speech before translating further. However, this method maintains multiple hypotheses until the entire speech input is consumed -- this scheme… ▽ More

    Submitted 20 September, 2023; originally announced September 2023.

    Comments: Accepted at INTERSPEECH 2023

    Journal ref: Polák, P., Yan, B., Watanabe, S., Waibel, A., Bojar, O. (2023) Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff. Proc. INTERSPEECH 2023, 3979-3983

  10. arXiv:2309.05272  [pdf, other

    cs.CL

    Minuteman: Machine and Human Joining Forces in Meeting Summarization

    Authors: František Kmječ, Ondřej Bojar

    Abstract: Many meetings require creating a meeting summary to keep everyone up to date. Creating minutes of sufficient quality is however very cognitively demanding. Although we currently possess capable models for both audio speech recognition (ASR) and summarization, their fully automatic use is still problematic. ASR models frequently commit errors when transcribing named entities while the summarization… ▽ More

    Submitted 11 September, 2023; originally announced September 2023.

    Comments: 6 pages, 3 figures

  11. arXiv:2308.04398  [pdf, other

    cs.CL

    Character-level NMT and language similarity

    Authors: Josef Jon, Ondřej Bojar

    Abstract: We explore the effectiveness of character-level neural machine translation using Transformer architecture for various levels of language similarity and size of the training dataset on translation between Czech and Croatian, German, Hungarian, Slovak, and Spanish. We evaluate the models using automatic MT metrics and show that translation between similar languages benefits from character-level inpu… ▽ More

    Submitted 8 August, 2023; originally announced August 2023.

  12. arXiv:2308.03601  [pdf, other

    cs.CL

    Negative Lexical Constraints in Neural Machine Translation

    Authors: Josef Jon, Dušan Variš, Michal Novák, João Paulo Aires, Ondřej Bojar

    Abstract: This paper explores negative lexical constraining in English to Czech neural machine translation. Negative lexical constraining is used to prohibit certain words or expressions in the translation produced by the neural translation model. We compared various methods based on modifying either the decoding process or the training data. The comparison was performed on two tasks: paraphrasing and feedb… ▽ More

    Submitted 7 August, 2023; originally announced August 2023.

  13. arXiv:2307.14743  [pdf, other

    cs.CL

    Turning Whisper into Real-Time Transcription System

    Authors: Dominik Macháček, Raj Dabre, Ondřej Bojar

    Abstract: Whisper is one of the recent state-of-the-art multilingual speech recognition and translation models, however, it is not designed for real time transcription. In this paper, we build on top of Whisper and create Whisper-Streaming, an implementation of real-time speech transcription and translation of Whisper-like models. Whisper-Streaming uses local agreement policy with self-adaptive latency to e… ▽ More

    Submitted 21 September, 2023; v1 submitted 27 July, 2023; originally announced July 2023.

    Comments: IJCNLP-AACL 2023 system demonstration

  14. arXiv:2305.19689  [pdf, other

    cs.CL

    Assessing Word Importance Using Models Trained for Semantic Tasks

    Authors: Dávid Javorský, Ondřej Bojar, François Yvon

    Abstract: Many NLP tasks require to automatically identify the most significant words in a text. In this work, we derive word significance from models trained to solve semantic task: Natural Language Inference and Paraphrase Identification. Using an attribution method aimed to explain the predictions of these models, we derive importance scores for each input token. We evaluate their relevance using a so-ca… ▽ More

    Submitted 31 May, 2023; originally announced May 2023.

    Comments: Published in the Findings of ACL 2023

  15. arXiv:2305.19330  [pdf, other

    cs.CL

    Breeding Machine Translations: Evolutionary approach to survive and thrive in the world of automated evaluation

    Authors: Josef Jon, Ondřej Bojar

    Abstract: We propose a genetic algorithm (GA) based method for modifying n-best lists produced by a machine translation (MT) system. Our method offers an innovative approach to improving MT quality and identifying weaknesses in evaluation metrics. Using common GA operations (mutation and crossover) on a list of hypotheses in combination with a fitness function (an arbitrary MT metric), we obtain novel and d… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

  16. arXiv:2305.17690  [pdf, other

    cs.CL

    HaVQA: A Dataset for Visual Question Answering and Multimodal Research in Hausa Language

    Authors: Shantipriya Parida, Idris Abdulmumin, Shamsuddeen Hassan Muhammad, Aneesh Bose, Guneet Singh Kohli, Ibrahim Said Ahmad, Ketan Kotwal, Sayan Deb Sarkar, Ondřej Bojar, Habeebah Adamu Kakudi

    Abstract: This paper presents HaVQA, the first multimodal dataset for visual question-answering (VQA) tasks in the Hausa language. The dataset was created by manually translating 6,022 English question-answer pairs, which are associated with 1,555 unique images from the Visual Genome dataset. As a result, the dataset provides 12,044 gold standard English-Hausa parallel sentences that were translated in a fa… ▽ More

    Submitted 28 May, 2023; originally announced May 2023.

    Comments: Accepted at ACL 2023 as a long paper (Findings)

  17. arXiv:2305.16894  [pdf, other

    cs.CL

    Robustness of Multi-Source MT to Transcription Errors

    Authors: Dominik Macháček, Peter Polák, Ondřej Bojar, Raj Dabre

    Abstract: Automatic speech translation is sensitive to speech recognition errors, but in a multilingual scenario, the same content may be available in various languages via simultaneous interpreting, dubbing or subtitling. In this paper, we hypothesize that leveraging multiple sources will improve translation quality if the sources complement one another in terms of correct information they contain. To this… ▽ More

    Submitted 26 May, 2023; originally announced May 2023.

    Comments: ACL 2023 Findings

  18. arXiv:2303.11192  [pdf, other

    cs.CL

    Multimodal Shannon Game with Images

    Authors: Vilém Zouhar, Sunit Bhattacharya, Ondřej Bojar

    Abstract: The Shannon game has long been used as a thought experiment in linguistics and NLP, asking participants to guess the next letter in a sentence based on its preceding context. We extend the game by introducing an optional extra modality in the form of image information. To investigate the impact of multimodal information in this game, we use human participants and a language model (LM, GPT-2). We s… ▽ More

    Submitted 20 March, 2023; originally announced March 2023.

  19. arXiv:2211.16174  [pdf, other

    cs.CL

    CUNI Submission in WMT22 General Task

    Authors: Josef Jon, Martin Popel, Ondřej Bojar

    Abstract: We present the CUNI-Bergamot submission for the WMT22 General translation task. We compete in English$\rightarrow$Czech direction. Our submission further explores block backtranslation techniques. Compared to the previous work, we measure performance in terms of COMET score and named entities translation accuracy. We evaluate performance of MBR decoding compared to traditional mixed backtranslatio… ▽ More

    Submitted 29 November, 2022; originally announced November 2022.

    Comments: 8 pages, WMT22

  20. arXiv:2211.08633  [pdf, other

    cs.CL cs.AI

    MT Metrics Correlate with Human Ratings of Simultaneous Speech Translation

    Authors: Dominik Macháček, Ondřej Bojar, Raj Dabre

    Abstract: There have been several meta-evaluation studies on the correlation between human ratings and offline machine translation (MT) evaluation metrics such as BLEU, chrF2, BertScore and COMET. These metrics have been used to evaluate simultaneous speech translation (SST) but their correlations with human ratings of SST, which has been recently collected as Continuous Ratings (CR), are unclear. In this p… ▽ More

    Submitted 1 June, 2023; v1 submitted 15 November, 2022; originally announced November 2022.

    Comments: IWSLT 2023

  21. arXiv:2210.09754  [pdf, other

    cs.CL

    Simultaneous Translation for Unsegmented Input: A Sliding Window Approach

    Authors: Sukanta Sen, Ondřej Bojar, Barry Haddow

    Abstract: In the cascaded approach to spoken language translation (SLT), the ASR output is typically punctuated and segmented into sentences before being passed to MT, since the latter is typically trained on written text. However, erroneous segmentation, due to poor sentence-final punctuation by the ASR system, leads to degradation in translation quality, especially in the simultaneous (online) setting whe… ▽ More

    Submitted 18 October, 2022; originally announced October 2022.

  22. arXiv:2210.06928  [pdf, other

    cs.CL

    Sentence Ambiguity, Grammaticality and Complexity Probes

    Authors: Sunit Bhattacharya, Vilém Zouhar, Ondřej Bojar

    Abstract: It is unclear whether, how and where large pre-trained language models capture subtle linguistic traits like ambiguity, grammaticality and sentence complexity. We present results of automatic classification of these traits and compare their viability and patterns across representation types. We demonstrate that template-based datasets with surface-level artifacts should not be used for probing, ca… ▽ More

    Submitted 15 October, 2022; v1 submitted 13 October, 2022; originally announced October 2022.

    Comments: Accepted at BlackboxNLP @ EMNLP 2022

  23. arXiv:2205.05433  [pdf, other

    cs.CL

    ALIGNMEET: A Comprehensive Tool for Meeting Annotation, Alignment, and Evaluation

    Authors: Peter Polák, Muskaan Singh, Anna Nedoluzhko, Ondřej Bojar

    Abstract: Summarization is a challenging problem, and even more challenging is to manually create, correct, and evaluate the summaries. The severity of the problem grows when the inputs are multi-party dialogues in a meeting setup. To facilitate the research in this area, we present ALIGNMEET, a comprehensive tool for meeting annotation, alignment, and evaluation. The tool aims to provide an efficient and c… ▽ More

    Submitted 11 May, 2022; originally announced May 2022.

    Comments: Accepted to LREC22

  24. arXiv:2205.01133  [pdf, other

    cs.CL cs.CV cs.LG

    Hausa Visual Genome: A Dataset for Multi-Modal English to Hausa Machine Translation

    Authors: Idris Abdulmumin, Satya Ranjan Dash, Musa Abdullahi Dawud, Shantipriya Parida, Shamsuddeen Hassan Muhammad, Ibrahim Sa'id Ahmad, Subhadarshi Panda, Ondřej Bojar, Bashir Shehu Galadanci, Bello Shehu Bello

    Abstract: Multi-modal Machine Translation (MMT) enables the use of visual information to enhance the quality of translations. The visual information can serve as a valuable piece of context information to decrease the ambiguity of input sentences. Despite the increasing popularity of such a technique, good and sizeable datasets are scarce, limiting the full extent of their potential. Hausa, a Chadic languag… ▽ More

    Submitted 6 May, 2022; v1 submitted 2 May, 2022; originally announced May 2022.

    Comments: Accepted at Language Resources and Evaluation Conference 2022 (LREC2022)

  25. arXiv:2204.06028  [pdf, other

    cs.CL

    CUNI-KIT System for Simultaneous Speech Translation Task at IWSLT 2022

    Authors: Peter Polák, Ngoc-Quan Ngoc, Tuan-Nam Nguyen, Danni Liu, Carlos Mullov, Jan Niehues, Ondřej Bojar, Alexander Waibel

    Abstract: In this paper, we describe our submission to the Simultaneous Speech Translation at IWSLT 2022. We explore strategies to utilize an offline model in a simultaneous setting without the need to modify the original model. In our experiments, we show that our onlinization algorithm is almost on par with the offline setting while being $3\times$ faster than offline in terms of latency on the test set.… ▽ More

    Submitted 11 May, 2022; v1 submitted 12 April, 2022; originally announced April 2022.

    Comments: Accepted to IWSLT22

  26. arXiv:2204.04998  [pdf, other

    cs.CL cs.LG

    Team ÚFAL at CMCL 2022 Shared Task: Figuring out the correct recipe for predicting Eye-Tracking features using Pretrained Language Models

    Authors: Sunit Bhattacharya, Rishu Kumar, Ondrej Bojar

    Abstract: Eye-Tracking data is a very useful source of information to study cognition and especially language comprehension in humans. In this paper, we describe our systems for the CMCL 2022 shared task on predicting eye-tracking information. We describe our experiments with pretrained models like BERT and XLM and the different ways in which we used those representations to predict four eye-tracking featur… ▽ More

    Submitted 11 April, 2022; originally announced April 2022.

  27. arXiv:2204.02905  [pdf, other

    cs.CL cs.HC

    EMMT: A simultaneous eye-tracking, 4-electrode EEG and audio corpus for multi-modal reading and translation scenarios

    Authors: Sunit Bhattacharya, Věra Kloudová, Vilém Zouhar, Ondřej Bojar

    Abstract: We present the Eyetracked Multi-Modal Translation (EMMT) corpus, a dataset containing monocular eye movement recordings, audio and 4-electrode electroencephalogram (EEG) data of 43 participants. The objective was to collect cognitive signals as responses of participants engaged in a number of language intensive tasks involving different text-image stimuli settings when translating from English to… ▽ More

    Submitted 6 April, 2022; originally announced April 2022.

    Comments: Submitted to Nature Scientific Data

  28. arXiv:2203.15404  [pdf, other

    cs.CL

    Short-Term Word-Learning in a Dynamically Changing Environment

    Authors: Christian Huber, Rishu Kumar, Ondřej Bojar, Alexander Waibel

    Abstract: Neural sequence-to-sequence automatic speech recognition (ASR) systems are in principle open vocabulary systems, when using appropriate modeling units. In practice, however, they often fail to recognize words not seen during training, e.g., named entities, numbers or technical terms. To alleviate this problem, Huber et al. proposed to supplement an end-to-end ASR system with a word/phrase memory a… ▽ More

    Submitted 29 March, 2022; originally announced March 2022.

    Comments: This paper was submitted to Interspeech 2022

  29. arXiv:2203.02458  [pdf, other

    cs.CL

    Comprehension of Subtitles from Re-Translating Simultaneous Speech Translation

    Authors: Dávid Javorský, Dominik Macháček, Ondřej Bojar

    Abstract: In simultaneous speech translation, one can vary the size of the output window, system latency and sometimes the allowed level of rewriting. The effect of these properties on readability and comprehensibility has not been tested with modern neural translation systems. In this work, we propose an evaluation method and investigate the effects on comprehension and user preferences. It is a pilot stud… ▽ More

    Submitted 4 March, 2022; originally announced March 2022.

  30. arXiv:2202.12814  [pdf

    cs.CL

    The Reality of Multi-Lingual Machine Translation

    Authors: Tom Kocmi, Dominik Macháček, Ondřej Bojar

    Abstract: Our book "The Reality of Multi-Lingual Machine Translation" discusses the benefits and perils of using more than two languages in machine translation systems. While focused on the particular task of sequence-to-sequence processing and multi-task learning, the book targets somewhat beyond the area of natural language processing. Machine translation is for us a prime example of deep learning applica… ▽ More

    Submitted 25 February, 2022; originally announced February 2022.

    Comments: ISBN 978-80-88132-11-0. arXiv admin note: substantial text overlap with arXiv:2001.01622

  31. arXiv:2109.09354  [pdf, other

    cs.CL

    CUNI systems for WMT21: Multilingual Low-Resource Translation for Indo-European Languages Shared Task

    Authors: Josef Jon, Michal Novák, João Paulo Aires, Dušan Variš, Ondřej Bojar

    Abstract: This paper describes Charles University submission for Multilingual Low-Resource Translation for Indo-European Languages shared task at WMT21. We competed in translation from Catalan into Romanian, Italian and Occitan. Our systems are based on shared multilingual model. We show that using joint model for multiple similar language pairs improves upon translation quality in each pair. We also demons… ▽ More

    Submitted 20 September, 2021; originally announced September 2021.

  32. arXiv:2109.09350  [pdf, other

    cs.CL

    CUNI systems for WMT21: Terminology translation Shared Task

    Authors: Josef Jon, Michal Novák, João Paulo Aires, Dušan Variš, Ondřej Bojar

    Abstract: This paper describes Charles University submission for Terminology translation Shared Task at WMT21. The objective of this task is to design a system which translates certain terms based on a provided terminology database, while preserving high overall translation quality. We competed in English-French language pair. Our approach is based on providing the desired translations alongside the input s… ▽ More

    Submitted 20 September, 2021; originally announced September 2021.

  33. Sequence Length is a Domain: Length-based Overfitting in Transformer Models

    Authors: Dušan Variš, Ondřej Bojar

    Abstract: Transformer-based sequence-to-sequence architectures, while achieving state-of-the-art results on a large number of NLP tasks, can still suffer from overfitting during training. In practice, this is usually countered either by applying regularization methods (e.g. dropout, L2-regularization) or by providing huge amounts of training data. Additionally, Transformer and other architectures are known… ▽ More

    Submitted 15 September, 2021; originally announced September 2021.

  34. arXiv:2109.05016  [pdf, other

    cs.CL cs.HC

    Neural Machine Translation Quality and Post-Editing Performance

    Authors: Vilém Zouhar, Aleš Tamchyna, Martin Popel, Ondřej Bojar

    Abstract: We test the natural expectation that using MT in professional translation saves human processing time. The last such study was carried out by Sanchez-Torron and Koehn (2016) with phrase-based MT, artificially reducing the translation quality. In contrast, we focus on neural MT (NMT) of high quality, which has become the state-of-the-art approach since then and also got adopted by most translation… ▽ More

    Submitted 10 September, 2021; originally announced September 2021.

    Comments: 9 pages, 1 page appendix. To be presented at EMNLP2021

  35. arXiv:2109.00916  [pdf, other

    cs.CL

    Coarse-To-Fine And Cross-Lingual ASR Transfer

    Authors: Peter Polák, Ondřej Bojar

    Abstract: End-to-end neural automatic speech recognition systems achieved recently state-of-the-art results, but they require large datasets and extensive computing resources. Transfer learning has been proposed to overcome these difficulties even across languages, e.g., German ASR trained from an English model. We experiment with much less related languages, reusing an English model for Czech ASR. To simpl… ▽ More

    Submitted 2 September, 2021; originally announced September 2021.

    Comments: Accepted to ITAT WAFNL

  36. End-to-End Lexically Constrained Machine Translation for Morphologically Rich Languages

    Authors: Josef Jon, João Paulo Aires, Dušan Variš, Ondřej Bojar

    Abstract: Lexically constrained machine translation allows the user to manipulate the output sentence by enforcing the presence or absence of certain words and phrases. Although current approaches can enforce terms to appear in the translation, they often struggle to make the constraint word form agree with the rest of the generated output. Our manual analysis shows that 46% of the errors in the output of a… ▽ More

    Submitted 24 June, 2021; v1 submitted 23 June, 2021; originally announced June 2021.

  37. arXiv:2106.09343  [pdf, ps, other

    cs.CL

    Lost in Interpreting: Speech Translation from Source or Interpreter?

    Authors: Dominik Macháček, Matúš Žilinec, Ondřej Bojar

    Abstract: Interpreters facilitate multi-lingual meetings but the affordable set of languages is often smaller than what is needed. Automatic simultaneous speech translation can extend the set of provided languages. We investigate if such an automatic system should rather follow the original speaker, or an interpreter to achieve better translation quality at the cost of increased delay. To answer the quest… ▽ More

    Submitted 17 June, 2021; originally announced June 2021.

    Comments: to be published at INTERSPEECH 2021

  38. Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining

    Authors: Ivana Kvapilıkova, Mikel Artetxe, Gorka Labaka, Eneko Agirre, Ondřej Bojar

    Abstract: Existing models of multilingual sentence embeddings require large parallel data resources which are not available for low-resource languages. We propose a novel unsupervised method to derive multilingual sentence embeddings relying only on monolingual data. We first produce a synthetic parallel corpus using unsupervised machine translation, and use it to fine-tune a pretrained cross-lingual masked… ▽ More

    Submitted 21 May, 2021; originally announced May 2021.

    Comments: ACL SRW 2020

    Journal ref: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics - Student Research Workshop, pages 255-262, Association for Computational Linguistics, 2020

  39. arXiv:2104.05688  [pdf, other

    cs.CL cs.HC

    Backtranslation Feedback Improves User Confidence in MT, Not Quality

    Authors: Vilém Zouhar, Michal Novák, Matúš Žilinec, Ondřej Bojar, Mateo Obregón, Robin L. Hill, Frédéric Blain, Marina Fomicheva, Lucia Specia, Lisa Yankovskaya

    Abstract: Translating text into a language unknown to the text's author, dubbed outbound translation, is a modern need for which the user experience has significant room for improvement, beyond the basic machine translation facility. We demonstrate this by showing three ways in which user confidence in the outbound translation, as well as its overall final quality, can be affected: backward translation, qua… ▽ More

    Submitted 12 April, 2021; originally announced April 2021.

    Comments: 9 pages (excluding references); to appear at NAACL-HWT 2021

  40. arXiv:2102.08892  [pdf, ps, other

    cs.CL cs.HC

    THEaiTRE 1.0: Interactive generation of theatre play scripts

    Authors: Rudolf Rosa, Tomáš Musil, Ondřej Dušek, Dominik Jurko, Patrícia Schmidtová, David Mareček, Ondřej Bojar, Tom Kocmi, Daniel Hrbek, David Košťák, Martina Kinská, Marie Nováková, Josef Doležal, Klára Vosecká, Tomáš Studeník, Petr Žabka

    Abstract: We present the first version of a system for interactive generation of theatre play scripts. The system is based on a vanilla GPT-2 model with several adjustments, targeting specific issues we encountered in practice. We also list other issues we encountered but plan to only solve in a future version of the system. The presented system was used to generate a theatre play script planned for premier… ▽ More

    Submitted 17 February, 2021; originally announced February 2021.

    Comments: Submitted to Text2Story workshop 2021

    Journal ref: Proc. Text2Story (2021) 71-76

  41. arXiv:2010.15924  [pdf

    cs.CL cs.IR cs.LG

    How Many Pages? Paper Length Prediction from the Metadata

    Authors: Erion Çano, Ondřej Bojar

    Abstract: Being able to predict the length of a scientific paper may be helpful in numerous situations. This work defines the paper length prediction task as a regression problem and reports several experimental results using popular machine learning models. We also create a huge dataset of publication metadata and the respective lengths in number of pages. The dataset will be freely available and is intend… ▽ More

    Submitted 17 December, 2020; v1 submitted 29 October, 2020; originally announced October 2020.

    Comments: 5 pages, 6 tables. Published in proceedings of NLPIR 2020, the 4th International Conference on Natural Language Processing and Information Retrieval, Seoul, Korea

  42. arXiv:2010.11747  [pdf, other

    cs.CL

    CUNI Systems for the Unsupervised and Very Low Resource Translation Task in WMT20

    Authors: Ivana Kvapilíková, Tom Kocmi, Ondřej Bojar

    Abstract: This paper presents a description of CUNI systems submitted to the WMT20 task on unsupervised and very low-resource supervised machine translation between German and Upper Sorbian. We experimented with training on synthetic data and pre-training on a related language pair. In the fully unsupervised scenario, we achieved 25.5 and 23.7 BLEU translating from and into Upper Sorbian, respectively. Our… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

    Comments: WMT20

  43. Unsupervised Pretraining for Neural Machine Translation Using Elastic Weight Consolidation

    Authors: Dušan Variš, Ondřej Bojar

    Abstract: This work presents our ongoing research of unsupervised pretraining in neural machine translation (NMT). In our method, we initialize the weights of the encoder and decoder with two language models that are trained with monolingual data and then fine-tune the model on parallel data using Elastic Weight Consolidation (EWC) to avoid forgetting of the original language modeling tasks. We compare the… ▽ More

    Submitted 19 October, 2020; originally announced October 2020.

    Comments: ACL-SRW 2019 (camera-ready)

  44. arXiv:2009.09016  [pdf, other

    cs.CL

    Presenting Simultaneous Translation in Limited Space

    Authors: Dominik Macháček, Ondřej Bojar

    Abstract: Some methods of automatic simultaneous translation of a long-form speech allow revisions of outputs, trading accuracy for low latency. Deploying these systems for users faces the problem of presenting subtitles in a limited space, such as two lines on a television screen. The subtitles must be shown promptly, incrementally, and with adequate time for reading. We provide an algorithm for subtitling… ▽ More

    Submitted 18 September, 2020; originally announced September 2020.

    Journal ref: ITAT WAFNL 2020

  45. arXiv:2007.03006  [pdf, ps, other

    cs.CL

    Announcing CzEng 2.0 Parallel Corpus with over 2 Gigawords

    Authors: Tom Kocmi, Martin Popel, Ondrej Bojar

    Abstract: We present a new release of the Czech-English parallel corpus CzEng 2.0 consisting of over 2 billion words (2 "gigawords") in each language. The corpus contains document-level information and is filtered with several techniques to lower the amount of noise. In addition to the data in the previous version of CzEng, it contains new authentic and also high-quality synthetic parallel data. CzEng is fr… ▽ More

    Submitted 6 July, 2020; originally announced July 2020.

  46. arXiv:2006.14668  [pdf, ps, other

    cs.CL

    THEaiTRE: Artificial Intelligence to Write a Theatre Play

    Authors: Rudolf Rosa, Ondřej Dušek, Tom Kocmi, David Mareček, Tomáš Musil, Patrícia Schmidtová, Dominik Jurko, Ondřej Bojar, Daniel Hrbek, David Košťák, Martina Kinská, Josef Doležal, Klára Vosecká

    Abstract: We present THEaiTRE, a starting project aimed at automatic generation of theatre play scripts. This paper reviews related work and drafts an approach we intend to follow. We plan to adopt generative neural language models and hierarchical generation approaches, supported by summarization and machine translation methods, and complemented with a human-in-the-loop approach.

    Submitted 25 June, 2020; originally announced June 2020.

    Comments: accepted to AI4Narratives2020

    Journal ref: Proc. AI4Narratives (2020) 9-13

  47. arXiv:2006.13268  [pdf, ps, other

    cs.CL cs.LG

    Automating Text Naturalness Evaluation of NLG Systems

    Authors: Erion Çano, Ondřej Bojar

    Abstract: Automatic methods and metrics that assess various quality criteria of automatically generated texts are important for developing NLG systems because they produce repeatable results and allow for a fast development cycle. We present here an attempt to automate the evaluation of text naturalness which is a very important characteristic of natural language generation methods. Instead of relying on hu… ▽ More

    Submitted 23 June, 2020; originally announced June 2020.

    Comments: 15 pages, 4 equations, 3 tables. arXiv admin note: text overlap with arXiv:2006.03189

  48. arXiv:2006.03331  [pdf, ps, other

    cs.CL

    ELITR Non-Native Speech Translation at IWSLT 2020

    Authors: Dominik Macháček, Jonáš Kratochvíl, Sangeet Sagar, Matúš Žilinec, Ondřej Bojar, Thai-Son Nguyen, Felix Schneider, Philip Williams, Yuekun Yao

    Abstract: This paper is an ELITR system submission for the non-native speech translation task at IWSLT 2020. We describe systems for offline ASR, real-time ASR, and our cascaded approach to offline SLT and real-time SLT. We select our primary candidates from a pool of pre-existing systems, develop a new end-to-end general ASR system, and a hybrid ASR trained on non-native speech. The provided small validati… ▽ More

    Submitted 5 June, 2020; originally announced June 2020.

    Comments: IWSLT 2020

  49. arXiv:2006.03189  [pdf, ps, other

    cs.CL cs.LG

    Human or Machine: Automating Human Likeliness Evaluation of NLG Texts

    Authors: Erion Çano, Ondřej Bojar

    Abstract: Automatic evaluation of various text quality criteria produced by data-driven intelligent methods is very common and useful because it is cheap, fast, and usually yields repeatable results. In this paper, we present an attempt to automate the human likeliness evaluation of the output text samples coming from natural language generation methods used to solve several tasks. We propose to use a human… ▽ More

    Submitted 4 June, 2020; originally announced June 2020.

    Comments: 9 pages, 5 equations, 1 table

  50. arXiv:2002.04689  [pdf, ps, other

    cs.CL cs.IR

    Two Huge Title and Keyword Generation Corpora of Research Articles

    Authors: Erion Çano, Ondřej Bojar

    Abstract: Recent developments in sequence-to-sequence learning with neural networks have considerably improved the quality of automatically generated text summaries and document keywords, stipulating the need for even bigger training corpora. Metadata of research articles are usually easy to find online and can be used to perform research on various tasks. In this paper, we introduce two huge datasets for t… ▽ More

    Submitted 11 February, 2020; originally announced February 2020.

    Comments: 9 pages, 8 tables. Published in proceedings of LREC 2020, the 12th International Conference on Language Resources and Evaluation, Marseille, France