Skip to main content

Showing 1–4 of 4 results for author: Aulamo, M

  1. arXiv:2403.14009  [pdf, other

    cs.CL

    A New Massive Multilingual Dataset for High-Performance Language Technologies

    Authors: Ona de Gibert, Graeme Nail, Nikolay Arefyev, Marta Bañón, Jelmer van der Linde, Shaoxiong Ji, Jaume Zaragoza-Bernabeu, Mikko Aulamo, Gema Ramírez-Sánchez, Andrey Kutuzov, Sampo Pyysalo, Stephan Oepen, Jörg Tiedemann

    Abstract: We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performa… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: LREC-COLING 2024

  2. arXiv:2311.14838  [pdf, other

    cs.CL

    OpusCleaner and OpusTrainer, open source toolkits for training Machine Translation and Large language models

    Authors: Nikolay Bogoychev, Jelmer van der Linde, Graeme Nail, Barry Haddow, Jaume Zaragoza-Bernabeu, Gema Ramírez-Sánchez, Lukas Weymann, Tudor Nicolae Mateiu, Jindřich Helcl, Mikko Aulamo

    Abstract: Developing high quality machine translation systems is a labour intensive, challenging and confusing process for newcomers to the field. We present a pair of tools OpusCleaner and OpusTrainer that aim to simplify the process, reduce the amount of work and lower the entry barrier for newcomers. OpusCleaner is a data downloading, cleaning, and proprocessing toolkit. It is designed to allow researc… ▽ More

    Submitted 24 November, 2023; originally announced November 2023.

    Comments: Code on Github: https://github.com/hplt-project/OpusCleaner and https://github.com/hplt-project/OpusTrainer

  3. arXiv:2212.01936  [pdf, other

    cs.CL

    Democratizing Neural Machine Translation with OPUS-MT

    Authors: Jörg Tiedemann, Mikko Aulamo, Daria Bakshandaeva, Michele Boggia, Stig-Arne Grönroos, Tommi Nieminen, Alessandro Raganato, Yves Scherrer, Raul Vazquez, Sami Virpioja

    Abstract: This paper presents the OPUS ecosystem with a focus on the development of open machine translation models and tools, and their integration into end-user applications, development platforms and professional workflows. We discuss our on-going mission of increasing language coverage and translation quality, and also describe on-going work on the development of modular translation models and speed-opt… ▽ More

    Submitted 4 July, 2023; v1 submitted 4 December, 2022; originally announced December 2022.

  4. arXiv:1809.07978  [pdf, other

    cs.CL

    Paraphrase Detection on Noisy Subtitles in Six Languages

    Authors: Eetu Sjöblom, Mathias Creutz, Mikko Aulamo

    Abstract: We perform automatic paraphrase detection on subtitle data from the Opusparcus corpus comprising six European languages: German, English, Finnish, French, Russian, and Swedish. We train two types of supervised sentence embedding models: a word-averaging (WA) model and a gated recurrent averaging network (GRAN) model. We find out that GRAN outperforms WA and is more robust to noisy training data. B… ▽ More

    Submitted 21 September, 2018; originally announced September 2018.

    Comments: To appear in Proceedings of W-NUT at EMNLP 2018, Brussels, Belgium, 1 November 2018