-
ParaCotta: Synthetic Multilingual Paraphrase Corpora from the Most Diverse Translation Sample Pair
Authors:
Alham Fikri Aji,
Tirana Noor Fatyanosa,
Radityo Eko Prasojo,
Philip Arthur,
Suci Fitriany,
Salma Qonitah,
Nadhifa Zulfa,
Tomi Santoso,
Mahendra Data
Abstract:
We release our synthetic parallel paraphrase corpus across 17 languages: Arabic, Catalan, Czech, German, English, Spanish, Estonian, French, Hindi, Indonesian, Italian, Dutch, Romanian, Russian, Swedish, Vietnamese, and Chinese. Our method relies only on monolingual data and a neural machine translation system to generate paraphrases, hence simple to apply. We generate multiple translation samples…
▽ More
We release our synthetic parallel paraphrase corpus across 17 languages: Arabic, Catalan, Czech, German, English, Spanish, Estonian, French, Hindi, Indonesian, Italian, Dutch, Romanian, Russian, Swedish, Vietnamese, and Chinese. Our method relies only on monolingual data and a neural machine translation system to generate paraphrases, hence simple to apply. We generate multiple translation samples using beam search and choose the most lexically diverse pair according to their sentence BLEU. We compare our generated corpus with the \texttt{ParaBank2}. According to our evaluation, our synthetic paraphrase pairs are semantically similar and lexically diverse.
△ Less
Submitted 9 May, 2022;
originally announced May 2022.
-
It is Not as Good as You Think! Evaluating Simultaneous Machine Translation on Interpretation Data
Authors:
Jinming Zhao,
Philip Arthur,
Gholamreza Haffari,
Trevor Cohn,
Ehsan Shareghi
Abstract:
Most existing simultaneous machine translation (SiMT) systems are trained and evaluated on offline translation corpora. We argue that SiMT systems should be trained and tested on real interpretation data. To illustrate this argument, we propose an interpretation test set and conduct a realistic evaluation of SiMT trained on offline translations. Our results, on our test set along with 3 existing s…
▽ More
Most existing simultaneous machine translation (SiMT) systems are trained and evaluated on offline translation corpora. We argue that SiMT systems should be trained and tested on real interpretation data. To illustrate this argument, we propose an interpretation test set and conduct a realistic evaluation of SiMT trained on offline translations. Our results, on our test set along with 3 existing smaller scale language pairs, highlight the difference of up-to 13.83 BLEU score when SiMT models are evaluated on translation vs interpretation data. In the absence of interpretation training data, we propose a translation-to-interpretation (T2I) style transfer method which allows converting existing offline translations into interpretation-style data, leading to up-to 2.8 BLEU improvement. However, the evaluation gap remains notable, calling for constructing large-scale interpretation corpora better suited for evaluating and developing SiMT systems.
△ Less
Submitted 11 October, 2021;
originally announced October 2021.
-
Learning Coupled Policies for Simultaneous Machine Translation using Imitation Learning
Authors:
Philip Arthur,
Trevor Cohn,
Gholamreza Haffari
Abstract:
We present a novel approach to efficiently learn a simultaneous translation model with coupled programmer-interpreter policies. First, wepresent an algorithmic oracle to produce oracle READ/WRITE actions for training bilingual sentence-pairs using the notion of word alignments. This oracle actions are designed to capture enough information from the partial input before writing the output. Next, we…
▽ More
We present a novel approach to efficiently learn a simultaneous translation model with coupled programmer-interpreter policies. First, wepresent an algorithmic oracle to produce oracle READ/WRITE actions for training bilingual sentence-pairs using the notion of word alignments. This oracle actions are designed to capture enough information from the partial input before writing the output. Next, we perform a coupled scheduled sampling to effectively mitigate the exposure bias when learning both policies jointly with imitation learning. Experiments on six language-pairs show our method outperforms strong baselines in terms of translation quality while keeping the translation delay low.
△ Less
Submitted 25 January, 2021; v1 submitted 11 February, 2020;
originally announced February 2020.
-
Multilingual Neural Machine Translation With Soft Decoupled Encoding
Authors:
Xinyi Wang,
Hieu Pham,
Philip Arthur,
Graham Neubig
Abstract:
Multilingual training of neural machine translation (NMT) systems has led to impressive accuracy improvements on low-resource languages. However, there are still significant challenges in efficiently learning word representations in the face of paucity of data. In this paper, we propose Soft Decoupled Encoding (SDE), a multilingual lexicon encoding framework specifically designed to share lexical-…
▽ More
Multilingual training of neural machine translation (NMT) systems has led to impressive accuracy improvements on low-resource languages. However, there are still significant challenges in efficiently learning word representations in the face of paucity of data. In this paper, we propose Soft Decoupled Encoding (SDE), a multilingual lexicon encoding framework specifically designed to share lexical-level information intelligently without requiring heuristic preprocessing such as pre-segmenting the data. SDE represents a word by its spelling through a character encoding, and its semantic meaning through a latent embedding space shared by all languages. Experiments on a standard dataset of four low-resource languages show consistent improvements over strong multilingual NMT baselines, with gains of up to 2 BLEU on one of the tested languages, achieving the new state-of-the-art on all four language pairs.
△ Less
Submitted 9 February, 2019;
originally announced February 2019.
-
XNMT: The eXtensible Neural Machine Translation Toolkit
Authors:
Graham Neubig,
Matthias Sperber,
Xinyi Wang,
Matthieu Felix,
Austin Matthews,
Sarguna Padmanabhan,
Ye Qi,
Devendra Singh Sachan,
Philip Arthur,
Pierre Godard,
John Hewitt,
Rachid Riad,
Liming Wang
Abstract:
This paper describes XNMT, the eXtensible Neural Machine Translation toolkit. XNMT distin- guishes itself from other open-source NMT toolkits by its focus on modular code design, with the purpose of enabling fast iteration in research and replicable, reliable results. In this paper we describe the design of XNMT and its experiment configuration system, and demonstrate its utility on the tasks of m…
▽ More
This paper describes XNMT, the eXtensible Neural Machine Translation toolkit. XNMT distin- guishes itself from other open-source NMT toolkits by its focus on modular code design, with the purpose of enabling fast iteration in research and replicable, reliable results. In this paper we describe the design of XNMT and its experiment configuration system, and demonstrate its utility on the tasks of machine translation, speech recognition, and multi-tasked machine translation/parsing. XNMT is available open-source at https://github.com/neulab/xnmt
△ Less
Submitted 28 February, 2018;
originally announced March 2018.
-
Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the "Speaking Rosetta" JSALT 2017 Workshop
Authors:
Odette Scharenborg,
Laurent Besacier,
Alan Black,
Mark Hasegawa-Johnson,
Florian Metze,
Graham Neubig,
Sebastian Stueker,
Pierre Godard,
Markus Mueller,
Lucas Ondel,
Shruti Palaskar,
Philip Arthur,
Francesco Ciannella,
Mingxing Du,
Elin Larsen,
Danny Merkx,
Rachid Riad,
Liming Wang,
Emmanuel Dupoux
Abstract:
We summarize the accomplishments of a multi-disciplinary workshop exploring the computational and scientific issues surrounding the discovery of linguistic units (subwords and words) in a language without orthography. We study the replacement of orthographic transcriptions by images and/or translated text in a well-resourced language to help unsupervised discovery from raw speech.
We summarize the accomplishments of a multi-disciplinary workshop exploring the computational and scientific issues surrounding the discovery of linguistic units (subwords and words) in a language without orthography. We study the replacement of orthographic transcriptions by images and/or translated text in a well-resourced language to help unsupervised discovery from raw speech.
△ Less
Submitted 14 February, 2018;
originally announced February 2018.
-
Neural Machine Translation via Binary Code Prediction
Authors:
Yusuke Oda,
Philip Arthur,
Graham Neubig,
Koichiro Yoshino,
Satoshi Nakamura
Abstract:
In this paper, we propose a new method for calculating the output layer in neural machine translation systems. The method is based on predicting a binary code for each word and can reduce computation time/memory requirements of the output layer to be logarithmic in vocabulary size in the best case. In addition, we also introduce two advanced approaches to improve the robustness of the proposed mod…
▽ More
In this paper, we propose a new method for calculating the output layer in neural machine translation systems. The method is based on predicting a binary code for each word and can reduce computation time/memory requirements of the output layer to be logarithmic in vocabulary size in the best case. In addition, we also introduce two advanced approaches to improve the robustness of the proposed model: using error-correcting codes and combining softmax and binary codes. Experiments on two English-Japanese bidirectional translation tasks show proposed models achieve BLEU scores that approach the softmax, while reducing memory usage to the order of less than 1/10 and improving decoding speed on CPUs by x5 to x10.
△ Less
Submitted 23 April, 2017;
originally announced April 2017.
-
Incorporating Discrete Translation Lexicons into Neural Machine Translation
Authors:
Philip Arthur,
Graham Neubig,
Satoshi Nakamura
Abstract:
Neural machine translation (NMT) often makes mistakes in translating low-frequency content words that are essential to understanding the meaning of the sentence. We propose a method to alleviate this problem by augmenting NMT systems with discrete translation lexicons that efficiently encode translations of these low-frequency words. We describe a method to calculate the lexicon probability of the…
▽ More
Neural machine translation (NMT) often makes mistakes in translating low-frequency content words that are essential to understanding the meaning of the sentence. We propose a method to alleviate this problem by augmenting NMT systems with discrete translation lexicons that efficiently encode translations of these low-frequency words. We describe a method to calculate the lexicon probability of the next word in the translation candidate by using the attention vector of the NMT model to select which source word lexical probabilities the model should focus on. We test two methods to combine this probability with the standard NMT probability: (1) using it as a bias, and (2) linear interpolation. Experiments on two corpora show an improvement of 2.0-2.3 BLEU and 0.13-0.44 NIST score, and faster convergence time.
△ Less
Submitted 4 October, 2016; v1 submitted 6 June, 2016;
originally announced June 2016.