Skip to main content

Showing 1–5 of 5 results for author: Conforti, C

  1. arXiv:2403.19546  [pdf, other

    cs.LG cs.AI cs.DB cs.IR

    Croissant: A Metadata Format for ML-Ready Datasets

    Authors: Mubashara Akhtar, Omar Benjelloun, Costanza Conforti, Pieter Gijsbers, Joan Giner-Miguelez, Nitisha Jain, Michael Kuchnik, Quentin Lhoest, Pierre Marcenac, Manil Maskey, Peter Mattson, Luis Oala, Pierre Ruyssen, Rajat Shinde, Elena Simperl, Goeffry Thomas, Slava Tykhonov, Joaquin Vanschoren, Jos van der Velde, Steffen Vogler, Carole-Jean Wu

    Abstract: Data is a critical resource for Machine Learning (ML), yet working with data remains a key friction point. This paper introduces Croissant, a metadata format for datasets that simplifies how data is used by ML tools and frameworks. Croissant makes datasets more discoverable, portable and interoperable, thereby addressing significant challenges in ML data management and responsible AI. Croissant is… ▽ More

    Submitted 30 May, 2024; v1 submitted 28 March, 2024; originally announced March 2024.

    Comments: Published in Proceedings of ACM SIGMOD/PODS'24 Data Management for End-to-End Machine Learning (DEEM) Workshop https://dl.acm.org/doi/10.1145/3650203.3663326

  2. arXiv:2102.02841  [pdf, other

    cs.CL

    Building Representative Corpora from Illiterate Communities: A Review of Challenges and Mitigation Strategies for Developing Countries

    Authors: Stephanie Hirmer, Alycia Leonard, Josephine Tumwesige, Costanza Conforti

    Abstract: Most well-established data collection methods currently adopted in NLP depend on the assumption of speaker literacy. Consequently, the collected corpora largely fail to represent swathes of the global population, which tend to be some of the most vulnerable and marginalised people in society, and often live in rural developing areas. Such underrepresented groups are thus not only ignored when maki… ▽ More

    Submitted 4 February, 2021; originally announced February 2021.

    Comments: Accepted at EACL 2021

  3. arXiv:2005.00388  [pdf, other

    cs.CL

    Will-They-Won't-They: A Very Large Dataset for Stance Detection on Twitter

    Authors: Costanza Conforti, Jakob Berndt, Mohammad Taher Pilehvar, Chryssi Giannitsarou, Flavio Toxvaerd, Nigel Collier

    Abstract: We present a new challenging stance detection dataset, called Will-They-Won't-They (WT-WT), which contains 51,284 tweets in English, making it by far the largest available dataset of the type. All the annotations are carried out by experts; therefore, the dataset constitutes a high-quality and reliable benchmark for future research in stance detection. Our experiments with a wide range of recent s… ▽ More

    Submitted 1 May, 2020; originally announced May 2020.

    Comments: 10 pages, accepted at ACL2020

  4. arXiv:2004.12935  [pdf, other

    cs.CL

    Natural language processing for achieving sustainable development: the case of neural labelling to enhance community profiling

    Authors: Costanza Conforti, Stephanie Hirmer, David Morgan, Marco Basaldella, Yau Ben Or

    Abstract: In recent years, there has been an increasing interest in the application of Artificial Intelligence - and especially Machine Learning - to the field of Sustainable Development (SD). However, until now, NLP has not been applied in this context. In this research paper, we show the high potential of NLP applications to enhance the sustainability of projects. In particular, we focus on the case of co… ▽ More

    Submitted 17 November, 2020; v1 submitted 27 April, 2020; originally announced April 2020.

    Comments: 18 pages, 9 figures. Accepted at EMNLP 2020

  5. Neural Architectures for Open-Type Relation Argument Extraction

    Authors: Benjamin Roth, Costanza Conforti, Nina Poerner, Sanjeev Karn, Hinrich Schütze

    Abstract: In this work, we introduce the task of Open-Type Relation Argument Extraction (ORAE): Given a corpus, a query entity Q and a knowledge base relation (e.g.,"Q authored notable work with title X"), the model has to extract an argument of non-standard entity type (entities that cannot be extracted by a standard named entity tagger, e.g. X: the title of a book or a work of art) from the corpus. A dist… ▽ More

    Submitted 30 September, 2018; v1 submitted 5 March, 2018; originally announced March 2018.

    Journal ref: Nat. Lang. Eng. 25 (2019) 219-238