Skip to main content

Showing 1–13 of 13 results for author: Corro, C

  1. arXiv:2403.17534  [pdf, other

    cs.CL

    Sparse Logistic Regression with High-order Features for Automatic Grammar Rule Extraction from Treebanks

    Authors: Santiago Herrera, Caio Corro, Sylvain Kahane

    Abstract: Descriptive grammars are highly valuable, but writing them is time-consuming and difficult. Furthermore, while linguists typically use corpora to create them, grammar descriptions often lack quantitative data. As for formal grammars, they can be challenging to interpret. In this paper, we propose a new method to extract and explore significant fine-grained grammar patterns and potential syntactic… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

    Comments: Published in LREC-Coling 2024 proceedings

  2. arXiv:2403.03883  [pdf, other

    cs.CL

    SaulLM-7B: A pioneering Large Language Model for Law

    Authors: Pierre Colombo, Telmo Pessoa Pires, Malik Boudiaf, Dominic Culver, Rui Melo, Caio Corro, Andre F. T. Martins, Fabrizio Esposito, Vera Lúcia Raposo, Sofia Morgado, Michael Desa

    Abstract: In this paper, we introduce SaulLM-7B, a large language model (LLM) tailored for the legal domain. With 7 billion parameters, SaulLM-7B is the first LLM designed explicitly for legal text comprehension and generation. Leveraging the Mistral 7B architecture as its foundation, SaulLM-7B is trained on an English legal corpus of over 30 billion tokens. SaulLM-7B exhibits state-of-the-art proficiency i… ▽ More

    Submitted 7 March, 2024; v1 submitted 6 March, 2024; originally announced March 2024.

  3. arXiv:2402.00786  [pdf, other

    cs.CL cs.LG

    CroissantLLM: A Truly Bilingual French-English Language Model

    Authors: Manuel Faysse, Patrick Fernandes, Nuno M. Guerreiro, António Loison, Duarte M. Alves, Caio Corro, Nicolas Boizard, João Alves, Ricardo Rei, Pedro H. Martins, Antoni Bigata Casademunt, François Yvon, André F. T. Martins, Gautier Viaud, Céline Hudelot, Pierre Colombo

    Abstract: We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens, to bring to the research and industrial community a high-performance, fully open-sourced bilingual model that runs swiftly on consumer-grade local hardware. To that end, we pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio, a cust… ▽ More

    Submitted 29 March, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

  4. arXiv:2310.14124  [pdf, other

    cs.CL

    Structural generalization in COGS: Supertagging is (almost) all you need

    Authors: Alban Petit, Caio Corro, François Yvon

    Abstract: In many Natural Language Processing applications, neural networks have been found to fail to generalize on out-of-distribution examples. In particular, several recent semantic parsing datasets have put forward important limitations of neural networks in cases where compositional generalization is required. In this work, we extend a neural graph-based semantic parsing framework in several ways to a… ▽ More

    Submitted 21 October, 2023; originally announced October 2023.

    Comments: accepted at EMNLP 2023

  5. arXiv:2302.07679  [pdf, other

    cs.CL cs.LG

    On graph-based reentrancy-free semantic parsing

    Authors: Alban Petit, Caio Corro

    Abstract: We propose a novel graph-based approach for semantic parsing that resolves two problems observed in the literature: (1) seq2seq models fail on compositional generalization tasks; (2) previous work using phrase structure parsers cannot cover all the semantic parses observed in treebanks. We prove that both MAP inference and latent tag anchoring (required for weakly-supervised learning) are NP-hard… ▽ More

    Submitted 15 February, 2023; originally announced February 2023.

    Comments: This work has been accepted for publication in TACL. This version is a pre-MIT Press publication version

  6. arXiv:2301.10810  [pdf, other

    cs.LG cs.CL stat.ML

    On the inconsistency of separable losses for structured prediction

    Authors: Caio Corro

    Abstract: In this paper, we prove that separable negative log-likelihood losses for structured prediction are not necessarily Bayes consistent, or, in other words, minimizing these losses may not result in a model that predicts the most probable structure in the data distribution for a given input. This fact opens the question of whether these losses are well-adapted for structured prediction and, if so, wh… ▽ More

    Submitted 25 January, 2023; originally announced January 2023.

    Comments: Preprint, to appear in proc. of EACL 2023

  7. arXiv:2301.07473  [pdf, other

    cs.LG stat.ML

    Discrete Latent Structure in Neural Networks

    Authors: Vlad Niculae, Caio F. Corro, Nikita Nangia, Tsvetomila Mihaylova, André F. T. Martins

    Abstract: Many types of data from fields including natural language processing, computer vision, and bioinformatics, are well represented by discrete, compositional structures such as trees, sequences, or matchings. Latent structure models are a powerful tool for learning to extract such representations, offering a way to incorporate structural bias, discover insight about the data, and interpret decisions.… ▽ More

    Submitted 18 January, 2023; originally announced January 2023.

    ACM Class: I.2.6

  8. arXiv:2210.04738  [pdf, ps, other

    cs.CL

    A dynamic programming algorithm for span-based nested named-entity recognition in O(n^2)

    Authors: Caio Corro

    Abstract: Span-based nested named-entity recognition (NER) has a cubic-time complexity using a variant of the CYK algorithm. We show that by adding a supplementary structural constraint on the search space, nested NER has a quadratic-time complexity, that is the same asymptotic complexity than the non-nested case. The proposed algorithm covers a large part of three standard English benchmarks and delivers c… ▽ More

    Submitted 26 May, 2023; v1 submitted 10 October, 2022; originally announced October 2022.

    Comments: ACL 2023

  9. arXiv:2112.00709  [pdf, ps, other

    cs.DC cs.CL

    GPU-Accelerated Forward-Backward algorithm with Application to Lattice-Free MMI

    Authors: Lucas Ondel, Léa-Marie Lam-Yee-Mui, Martin Kocour, Caio Filippo Corro, Lukáš Burget

    Abstract: We propose to express the forward-backward algorithm in terms of operations between sparse matrices in a specific semiring. This new perspective naturally leads to a GPU-friendly algorithm which is easy to implement in Julia or any programming languages with native support of semiring algebra. We use this new implementation to train a TDNN with the LF-MMI objective function and we compare the trai… ▽ More

    Submitted 22 October, 2021; originally announced December 2021.

    Comments: Submitted to ICASSP 2022

  10. arXiv:2110.14945  [pdf, ps, other

    cs.LG cs.CL

    Preventing posterior collapse in variational autoencoders for text generation via decoder regularization

    Authors: Alban Petit, Caio Corro

    Abstract: Variational autoencoders trained to minimize the reconstruction error are sensitive to the posterior collapse problem, that is the proposal posterior distribution is always equal to the prior. We propose a novel regularization method based on fraternal dropout to prevent posterior collapse. We evaluate our approach using several metrics and observe improvements in all the tested configurations.

    Submitted 28 October, 2021; originally announced October 2021.

    Comments: Accepted at NeurIPS 2021 Workshop DGMs Applications

  11. arXiv:2003.13785  [pdf, ps, other

    cs.CL

    Span-based discontinuous constituency parsing: a family of exact chart-based algorithms with time complexities from O(n^6) down to O(n^3)

    Authors: Caio Corro

    Abstract: We introduce a novel chart-based algorithm for span-based parsing of discontinuous constituency trees of block degree two, including ill-nested structures. In particular, we show that we can build variants of our parser with smaller search spaces and time complexities ranging from $\mathcal O(n^6)$ down to $\mathcal O(n^3)$. The cubic time variant covers 98\% of constituents observed in linguistic… ▽ More

    Submitted 30 March, 2020; originally announced March 2020.

  12. arXiv:1906.09992  [pdf, other

    cs.CL cs.LG

    Learning Latent Trees with Stochastic Perturbations and Differentiable Dynamic Programming

    Authors: Caio Corro, Ivan Titov

    Abstract: We treat projective dependency trees as latent variables in our probabilistic model and induce them in such a way as to be beneficial for a downstream task, without relying on any direct tree supervision. Our approach relies on Gumbel perturbations and differentiable dynamic programming. Unlike previous approaches to latent tree learning, we stochastically sample global structures and our parser i… ▽ More

    Submitted 24 June, 2019; originally announced June 2019.

    Comments: Accepted at ACL 2019

  13. arXiv:1807.09875  [pdf, ps, other

    cs.CL cs.LG

    Differentiable Perturb-and-Parse: Semi-Supervised Parsing with a Structured Variational Autoencoder

    Authors: Caio Corro, Ivan Titov

    Abstract: Human annotation for syntactic parsing is expensive, and large resources are available only for a fraction of languages. A question we ask is whether one can leverage abundant unlabeled texts to improve syntactic parsers, beyond just using the texts to obtain more generalisable lexical features (i.e. beyond word embeddings). To this end, we propose a novel latent-variable generative model for semi… ▽ More

    Submitted 20 February, 2019; v1 submitted 25 July, 2018; originally announced July 2018.

    Comments: Accepted at ICLR 2019