Skip to main content

Showing 1–27 of 27 results for author: Soroa, A

  1. arXiv:2406.15227  [pdf, other

    cs.CL

    A LLM-Based Ranking Method for the Evaluation of Automatic Counter-Narrative Generation

    Authors: Irune Zubiaga, Aitor Soroa, Rodrigo Agerri

    Abstract: The proliferation of misinformation and harmful narratives in online discourse has underscored the critical need for effective Counter Narrative (CN) generation techniques. However, existing automatic evaluation methods often lack interpretability and fail to capture the nuanced relationship between generated CNs and human perception. Aiming to achieve a higher correlation with human judgments, th… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

  2. arXiv:2406.07302  [pdf, ps, other

    cs.CL cs.AI cs.LG

    BertaQA: How Much Do Language Models Know About Local Culture?

    Authors: Julen Etxaniz, Gorka Azkune, Aitor Soroa, Oier Lopez de Lacalle, Mikel Artetxe

    Abstract: Large Language Models (LLMs) exhibit extensive knowledge about the world, but most evaluations have been limited to global or anglocentric subjects. This raises the question of how well these models perform on topics relevant to other cultures, whose presence on the web is not that prominent. To address this gap, we introduce BertaQA, a multiple-choice trivia dataset that is parallel in English an… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

  3. arXiv:2404.06996  [pdf, other

    cs.CL cs.AI

    XNLIeu: a dataset for cross-lingual NLI in Basque

    Authors: Maite Heredia, Julen Etxaniz, Muitze Zulaika, Xabier Saralegi, Jeremy Barnes, Aitor Soroa

    Abstract: XNLI is a popular Natural Language Inference (NLI) benchmark widely used to evaluate cross-lingual Natural Language Understanding (NLU) capabilities across languages. In this paper, we expand XNLI to include Basque, a low-resource language that can greatly benefit from transfer-learning approaches. The new dataset, dubbed XNLIeu, has been developed by first machine-translating the English XNLI cor… ▽ More

    Submitted 10 April, 2024; originally announced April 2024.

    Comments: Accepted to NAACL 2024

  4. arXiv:2403.20266  [pdf, other

    cs.CL cs.AI cs.LG

    Latxa: An Open Language Model and Evaluation Suite for Basque

    Authors: Julen Etxaniz, Oscar Sainz, Naiara Perez, Itziar Aldabe, German Rigau, Eneko Agirre, Aitor Ormazabal, Mikel Artetxe, Aitor Soroa

    Abstract: We introduce Latxa, a family of large language models for Basque ranging from 7 to 70 billion parameters. Latxa is based on Llama 2, which we continue pretraining on a new Basque corpus comprising 4.3M documents and 4.2B tokens. Addressing the scarcity of high-quality benchmarks for Basque, we further introduce 4 multiple choice evaluation datasets: EusProficiency, comprising 5,169 questions from… ▽ More

    Submitted 29 March, 2024; originally announced March 2024.

  5. arXiv:2403.00587  [pdf, other

    cs.CV cs.AI

    Improving Explicit Spatial Relationships in Text-to-Image Generation through an Automatically Derived Dataset

    Authors: Ander Salaberria, Gorka Azkune, Oier Lopez de Lacalle, Aitor Soroa, Eneko Agirre, Frank Keller

    Abstract: Existing work has observed that current text-to-image systems do not accurately reflect explicit spatial relations between objects such as 'left of' or 'below'. We hypothesize that this is because explicit spatial relations rarely appear in the image captions used to train these models. We propose an automatic method that, given existing images, generates synthetic captions that contain 14 explici… ▽ More

    Submitted 1 March, 2024; originally announced March 2024.

    Comments: 12 pages and 5 figures

  6. arXiv:2308.01223  [pdf, other

    cs.CL cs.AI cs.LG

    Do Multilingual Language Models Think Better in English?

    Authors: Julen Etxaniz, Gorka Azkune, Aitor Soroa, Oier Lopez de Lacalle, Mikel Artetxe

    Abstract: Translate-test is a popular technique to improve the performance of multilingual language models. This approach works by translating the input into English using an external machine translation system, and running inference over the translated input. However, these improvements can be attributed to the use of a separate translation system, which is typically trained on large amounts of parallel da… ▽ More

    Submitted 2 August, 2023; originally announced August 2023.

  7. arXiv:2303.03915  [pdf, other

    cs.CL cs.AI

    The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

    Authors: Hugo Laurençon, Lucile Saulnier, Thomas Wang, Christopher Akiki, Albert Villanova del Moral, Teven Le Scao, Leandro Von Werra, Chenghao Mou, Eduardo González Ponferrada, Huu Nguyen, Jörg Frohberg, Mario Šaško, Quentin Lhoest, Angelina McMillan-Major, Gerard Dupont, Stella Biderman, Anna Rogers, Loubna Ben allal, Francesco De Toni, Giada Pistilli, Olivier Nguyen, Somaieh Nikpoor, Maraim Masoud, Pierre Colombo, Javier de la Rosa , et al. (29 additional authors not shown)

    Abstract: As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the f… ▽ More

    Submitted 7 March, 2023; originally announced March 2023.

    Comments: NeurIPS 2022, Datasets and Benchmarks Track

    ACM Class: I.2.7

  8. arXiv:2211.05100  [pdf, other

    cs.CL

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Authors: BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major , et al. (369 additional authors not shown)

    Abstract: Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access… ▽ More

    Submitted 27 June, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

  9. arXiv:2211.03152  [pdf, other

    cs.CL cs.AI

    Noisy Channel for Automatic Text Simplification

    Authors: Oscar M Cumbicus-Pineda, Iker Gutiérrez-Fandiño, Itziar Gonzalez-Dios, Aitor Soroa

    Abstract: In this paper we present a simple re-ranking method for Automatic Sentence Simplification based on the noisy channel scheme. Instead of directly computing the best simplification given a complex text, the re-ranking method also considers the probability of the simple sentence to produce the complex counterpart, as well as the probability of the simple text itself, according to a language model. Ou… ▽ More

    Submitted 6 November, 2022; originally announced November 2022.

    Comments: 8 pages

  10. arXiv:2205.12213  [pdf, other

    cs.CL

    Principled Paraphrase Generation with Parallel Corpora

    Authors: Aitor Ormazabal, Mikel Artetxe, Aitor Soroa, Gorka Labaka, Eneko Agirre

    Abstract: Round-trip Machine Translation (MT) is a popular choice for paraphrase generation, which leverages readily available parallel corpora for supervision. In this paper, we formalize the implicit similarity function induced by this approach, and show that it is susceptible to non-paraphrase pairs sharing a single ambiguous translation. Based on these insights, we design an alternative similarity metri… ▽ More

    Submitted 23 May, 2023; v1 submitted 24 May, 2022; originally announced May 2022.

  11. arXiv:2205.12206  [pdf, other

    cs.CL cs.AI

    PoeLM: A Meter- and Rhyme-Controllable Language Model for Unsupervised Poetry Generation

    Authors: Aitor Ormazabal, Mikel Artetxe, Manex Agirrezabal, Aitor Soroa, Eneko Agirre

    Abstract: Formal verse poetry imposes strict constraints on the meter and rhyme scheme of poems. Most prior work on generating this type of poetry uses existing poems for supervision, which are difficult to obtain for most languages and poetic forms. In this work, we propose an unsupervised approach to generate poems following any given meter and rhyme scheme, without requiring any poetic text for training.… ▽ More

    Submitted 28 October, 2022; v1 submitted 24 May, 2022; originally announced May 2022.

    Comments: EMNLP Findings 2022

  12. arXiv:2203.08111  [pdf, other

    cs.CL cs.AI cs.LG

    Does Corpus Quality Really Matter for Low-Resource Languages?

    Authors: Mikel Artetxe, Itziar Aldabe, Rodrigo Agerri, Olatz Perez-de-Viñaspre, Aitor Soroa

    Abstract: The vast majority of non-English corpora are derived from automatically filtered versions of CommonCrawl. While prior work has identified major issues on the quality of these datasets (Kreutzer et al., 2021), it is not clear how this impacts downstream performance. Taking representation learning in Basque as a case study, we explore tailored crawling (manually identifying and scraping websites wit… ▽ More

    Submitted 26 October, 2022; v1 submitted 15 March, 2022; originally announced March 2022.

    Comments: EMNLP 2022

  13. arXiv:2201.10066  [pdf, other

    cs.CL cs.DB

    Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources

    Authors: Angelina McMillan-Major, Zaid Alyafeai, Stella Biderman, Kimbo Chen, Francesco De Toni, Gérard Dupont, Hady Elsahar, Chris Emezue, Alham Fikri Aji, Suzana Ilić, Nurulaqilla Khamis, Colin Leong, Maraim Masoud, Aitor Soroa, Pedro Ortiz Suarez, Zeerak Talat, Daniel van Strien, Yacine Jernite

    Abstract: In recent years, large-scale data collection efforts have prioritized the amount of data collected in order to improve the modeling capabilities of large language models. This prioritization, however, has resulted in concerns with respect to the rights of data subjects represented in data collections, particularly when considering the difficulty in interrogating these collections due to insufficie… ▽ More

    Submitted 24 January, 2022; originally announced January 2022.

    Comments: 8 pages plus appendix and references

  14. Image Captioning for Effective Use of Language Models in Knowledge-Based Visual Question Answering

    Authors: Ander Salaberria, Gorka Azkune, Oier Lopez de Lacalle, Aitor Soroa, Eneko Agirre

    Abstract: Integrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answering (VQA) is an open problem. Given that pretrained language models have been shown to include world knowledge, we propose to use a unimodal (text-only) train and inference procedure based on automatic off-the-shelf captioning of images and pretrained language models. Our results on a visual questio… ▽ More

    Submitted 25 March, 2022; v1 submitted 15 September, 2021; originally announced September 2021.

    Comments: Under review. 25 pages with 4 figures

    Journal ref: Expert Systems with Applications, Volume 212, 2023, 118669

  15. Inferring spatial relations from textual descriptions of images

    Authors: Aitzol Elu, Gorka Azkune, Oier Lopez de Lacalle, Ignacio Arganda-Carreras, Aitor Soroa, Eneko Agirre

    Abstract: Generating an image from its textual description requires both a certain level of language understanding and common sense knowledge about the spatial relations of the physical entities being described. In this work, we focus on inferring the spatial relation between entities, a key step in the process of composing scenes based on text. More specifically, given a caption containing a mention to a s… ▽ More

    Submitted 1 February, 2021; originally announced February 2021.

    Comments: Accepted in Pattern Recognition

    Journal ref: Pattern Recognition, Volume 113, 2021, 107847

  16. Beyond Offline Mapping: Learning Cross Lingual Word Embeddings through Context Anchoring

    Authors: Aitor Ormazabal, Mikel Artetxe, Aitor Soroa, Gorka Labaka, Eneko Agirre

    Abstract: Recent research on cross-lingual word embeddings has been dominated by unsupervised mapping approaches that align monolingual embeddings. Such methods critically rely on those embeddings having a similar structure, but it was recently shown that the separate training in different languages causes departures from this assumption. In this paper, we propose an alternative approach that does not have… ▽ More

    Submitted 3 August, 2021; v1 submitted 31 December, 2020; originally announced December 2020.

    Comments: ACL 2021

  17. arXiv:2011.00615  [pdf, other

    cs.CL

    Improving Conversational Question Answering Systems after Deployment using Feedback-Weighted Learning

    Authors: Jon Ander Campos, Kyunghyun Cho, Arantxa Otegi, Aitor Soroa, Gorka Azkune, Eneko Agirre

    Abstract: The interaction of conversational systems with users poses an exciting opportunity for improving them after deployment, but little evidence has been provided of its feasibility. In most applications, users are not able to provide the correct answer to the system, but they are able to provide binary (correct, incorrect) feedback. In this paper we propose feedback-weighted learning based on importan… ▽ More

    Submitted 1 November, 2020; originally announced November 2020.

    Comments: Accepted at COLING 2020. 11 pages, 5 figures

  18. arXiv:2010.02140  [pdf, other

    cs.AI cs.CL

    Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems

    Authors: Jan Deriu, Don Tuggener, Pius von Däniken, Jon Ander Campos, Alvaro Rodrigo, Thiziri Belkacem, Aitor Soroa, Eneko Agirre, Mark Cieliebak

    Abstract: The lack of time-efficient and reliable evaluation methods hamper the development of conversational dialogue systems (chatbots). Evaluations requiring humans to converse with chatbots are time and cost-intensive, put high cognitive demands on the human judges, and yield low-quality results. In this work, we introduce \emph{Spot The Bot}, a cost-efficient and robust evaluation framework that replac… ▽ More

    Submitted 5 October, 2020; originally announced October 2020.

  19. arXiv:2005.01328  [pdf, other

    cs.CL

    DoQA -- Accessing Domain-Specific FAQs via Conversational QA

    Authors: Jon Ander Campos, Arantxa Otegi, Aitor Soroa, Jan Deriu, Mark Cieliebak, Eneko Agirre

    Abstract: The goal of this work is to build conversational Question Answering (QA) interfaces for the large body of domain-specific information available in FAQ sites. We present DoQA, a dataset with 2,437 dialogues and 10,917 QA pairs. The dialogues are collected from three Stack Exchange sites using the Wizard of Oz method with crowdsourcing. Compared to previous work, DoQA comprises well-defined informat… ▽ More

    Submitted 18 May, 2020; v1 submitted 4 May, 2020; originally announced May 2020.

    Comments: Accepted at ACL 2020. 13 pages 4 figures

    Journal ref: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020

  20. arXiv:2004.01894  [pdf, other

    cs.CL

    Evaluating Multimodal Representations on Visual Semantic Textual Similarity

    Authors: Oier Lopez de Lacalle, Ander Salaberria, Aitor Soroa, Gorka Azkune, Eneko Agirre

    Abstract: The combination of visual and textual representations has produced excellent results in tasks such as image captioning and visual question answering, but the inference capabilities of multimodal representations are largely untested. In the case of textual representations, inference tasks such as Textual Entailment and Semantic Textual Similarity have been often used to benchmark the quality of tex… ▽ More

    Submitted 4 April, 2020; originally announced April 2020.

    Comments: Accepted in ECAI-2020, 8 pages, 6 tables, 6 figures

  21. arXiv:2004.00033  [pdf, ps, other

    cs.CL

    Give your Text Representation Models some Love: the Case for Basque

    Authors: Rodrigo Agerri, Iñaki San Vicente, Jon Ander Campos, Ander Barrena, Xabier Saralegi, Aitor Soroa, Eneko Agirre

    Abstract: Word embeddings and pre-trained language models allow to build rich representations of text and have enabled improvements across most NLP tasks. Unfortunately they are very expensive to train, and many small companies and research groups tend to use models that have been pre-trained and made available by third parties, rather than building their own. This is suboptimal as, for many languages, the… ▽ More

    Submitted 2 April, 2020; v1 submitted 31 March, 2020; originally announced April 2020.

    Comments: Accepted at LREC 2020; 8 pages, 7 tables

  22. Analyzing the Limitations of Cross-lingual Word Embedding Mappings

    Authors: Aitor Ormazabal, Mikel Artetxe, Gorka Labaka, Aitor Soroa, Eneko Agirre

    Abstract: Recent research in cross-lingual word embeddings has almost exclusively focused on offline methods, which independently train word embeddings in different languages and map them to a shared space through linear transformations. While several authors have questioned the underlying isomorphism assumption, which states that word embeddings in different languages have approximately the same structure,… ▽ More

    Submitted 12 June, 2019; originally announced June 2019.

    Comments: ACL 2019

  23. arXiv:1809.03695  [pdf, other

    cs.CL cs.AI

    Evaluating Multimodal Representations on Sentence Similarity: vSTS, Visual Semantic Textual Similarity Dataset

    Authors: Oier Lopez de Lacalle, Aitor Soroa, Eneko Agirre

    Abstract: In this paper we introduce vSTS, a new dataset for measuring textual similarity of sentences using multimodal information. The dataset is comprised by images along with its respectively textual captions. We describe the dataset both quantitatively and qualitatively, and claim that it is a valid gold standard for measuring automatic multimodal textual similarity systems. We also describe the initia… ▽ More

    Submitted 11 September, 2018; originally announced September 2018.

    Journal ref: ICCV17: second workshop on Closing the Loop Between Vision and Language. Venice, Italy. 2017

  24. arXiv:1805.04277  [pdf, ps, other

    cs.CL

    The risk of sub-optimal use of Open Source NLP Software: UKB is inadvertently state-of-the-art in knowledge-based WSD

    Authors: Eneko Agirre, Oier López de Lacalle, Aitor Soroa

    Abstract: UKB is an open source collection of programs for performing, among other tasks, knowledge-based Word Sense Disambiguation (WSD). Since it was released in 2009 it has been often used out-of-the-box in sub-optimal settings. We show that nine years later it is the state-of-the-art on knowledge-based WSD. This case shows the pitfalls of releasing open source NLP software without optimal default settin… ▽ More

    Submitted 11 May, 2018; originally announced May 2018.

  25. Bilingual Embeddings with Random Walks over Multilingual Wordnets

    Authors: J. Goikoetxea, A. Soroa, E. Agirre

    Abstract: Bilingual word embeddings represent words of two languages in the same space, and allow to transfer knowledge from one language to the other without machine translation. The main approach is to train monolingual embeddings first and then map them using bilingual dictionaries. In this work, we present a novel method to learn bilingual embeddings based on multilingual knowledge bases (KB) such as Wo… ▽ More

    Submitted 23 April, 2018; originally announced April 2018.

    Comments: Preprint version, Knowledge-Based Systems (ISSN: 0950-7051). (2018)

  26. arXiv:1509.03739  [pdf, other

    cs.CL

    Improving distant supervision using inference learning

    Authors: Roland Roller, Eneko Agirre, Aitor Soroa, Mark Stevenson

    Abstract: Distant supervision is a widely applied approach to automatic training of relation extraction systems and has the advantage that it can generate large amounts of labelled data with minimal effort. However, this data may contain errors and consequently systems trained using distant supervision tend not to perform as well as those based on manually labelled data. This work proposes a novel method fo… ▽ More

    Submitted 12 September, 2015; originally announced September 2015.

    Comments: In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

  27. arXiv:1503.01655  [pdf, other

    cs.CL

    Studying the Wikipedia Hyperlink Graph for Relatedness and Disambiguation

    Authors: Eneko Agirre, Ander Barrena, Aitor Soroa

    Abstract: Hyperlinks and other relations in Wikipedia are a extraordinary resource which is still not fully understood. In this paper we study the different types of links in Wikipedia, and contrast the use of the full graph with respect to just direct links. We apply a well-known random walk algorithm on two tasks, word relatedness and named-entity disambiguation. We show that using the full graph is more… ▽ More

    Submitted 12 March, 2015; v1 submitted 5 March, 2015; originally announced March 2015.