-
A New Massive Multilingual Dataset for High-Performance Language Technologies
Authors:
Ona de Gibert,
Graeme Nail,
Nikolay Arefyev,
Marta Bañón,
Jelmer van der Linde,
Shaoxiong Ji,
Jaume Zaragoza-Bernabeu,
Mikko Aulamo,
Gema Ramírez-Sánchez,
Andrey Kutuzov,
Sampo Pyysalo,
Stephan Oepen,
Jörg Tiedemann
Abstract:
We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performa…
▽ More
We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of ~5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.
△ Less
Submitted 20 March, 2024;
originally announced March 2024.
-
OpusCleaner and OpusTrainer, open source toolkits for training Machine Translation and Large language models
Authors:
Nikolay Bogoychev,
Jelmer van der Linde,
Graeme Nail,
Barry Haddow,
Jaume Zaragoza-Bernabeu,
Gema Ramírez-Sánchez,
Lukas Weymann,
Tudor Nicolae Mateiu,
Jindřich Helcl,
Mikko Aulamo
Abstract:
Developing high quality machine translation systems is a labour intensive, challenging and confusing process for newcomers to the field. We present a pair of tools OpusCleaner and OpusTrainer that aim to simplify the process, reduce the amount of work and lower the entry barrier for newcomers.
OpusCleaner is a data downloading, cleaning, and proprocessing toolkit. It is designed to allow researc…
▽ More
Developing high quality machine translation systems is a labour intensive, challenging and confusing process for newcomers to the field. We present a pair of tools OpusCleaner and OpusTrainer that aim to simplify the process, reduce the amount of work and lower the entry barrier for newcomers.
OpusCleaner is a data downloading, cleaning, and proprocessing toolkit. It is designed to allow researchers to quickly download, visualise and preprocess bilingual (or monolingual) data that comes from many different sources, each of them with different quality, issues, and unique filtering/preprocessing requirements.
OpusTrainer is a data scheduling and data augmenting tool aimed at building large scale, robust machine translation systems and large language models. It features deterministic data mixing from many different sources, on-the-fly data augmentation and more.
Using these tools, we showcase how we can use it to create high quality machine translation model robust to noisy user input; multilingual models and terminology aware models.
△ Less
Submitted 24 November, 2023;
originally announced November 2023.
-
Democratizing Neural Machine Translation with OPUS-MT
Authors:
Jörg Tiedemann,
Mikko Aulamo,
Daria Bakshandaeva,
Michele Boggia,
Stig-Arne Grönroos,
Tommi Nieminen,
Alessandro Raganato,
Yves Scherrer,
Raul Vazquez,
Sami Virpioja
Abstract:
This paper presents the OPUS ecosystem with a focus on the development of open machine translation models and tools, and their integration into end-user applications, development platforms and professional workflows. We discuss our on-going mission of increasing language coverage and translation quality, and also describe on-going work on the development of modular translation models and speed-opt…
▽ More
This paper presents the OPUS ecosystem with a focus on the development of open machine translation models and tools, and their integration into end-user applications, development platforms and professional workflows. We discuss our on-going mission of increasing language coverage and translation quality, and also describe on-going work on the development of modular translation models and speed-optimized compact solutions for real-time translation on regular desktops and small devices.
△ Less
Submitted 4 July, 2023; v1 submitted 4 December, 2022;
originally announced December 2022.
-
Paraphrase Detection on Noisy Subtitles in Six Languages
Authors:
Eetu Sjöblom,
Mathias Creutz,
Mikko Aulamo
Abstract:
We perform automatic paraphrase detection on subtitle data from the Opusparcus corpus comprising six European languages: German, English, Finnish, French, Russian, and Swedish. We train two types of supervised sentence embedding models: a word-averaging (WA) model and a gated recurrent averaging network (GRAN) model. We find out that GRAN outperforms WA and is more robust to noisy training data. B…
▽ More
We perform automatic paraphrase detection on subtitle data from the Opusparcus corpus comprising six European languages: German, English, Finnish, French, Russian, and Swedish. We train two types of supervised sentence embedding models: a word-averaging (WA) model and a gated recurrent averaging network (GRAN) model. We find out that GRAN outperforms WA and is more robust to noisy training data. Better results are obtained with more and noisier data than less and cleaner data. Additionally, we experiment on other datasets, without reaching the same level of performance, because of domain mismatch between training and test data.
△ Less
Submitted 21 September, 2018;
originally announced September 2018.