Skip to main content

Showing 1–14 of 14 results for author: Oala, L

  1. arXiv:2404.12241  [pdf, other

    cs.CL cs.AI

    Introducing v0.5 of the AI Safety Benchmark from MLCommons

    Authors: Bertie Vidgen, Adarsh Agrawal, Ahmed M. Ahmed, Victor Akinwande, Namir Al-Nuaimi, Najla Alfaraj, Elie Alhajjar, Lora Aroyo, Trupti Bavalatti, Max Bartolo, Borhane Blili-Hamelin, Kurt Bollacker, Rishi Bomassani, Marisa Ferrara Boston, Siméon Campos, Kal Chakra, Canyu Chen, Cody Coleman, Zacharie Delpierre Coudert, Leon Derczynski, Debojyoti Dutta, Ian Eisenberg, James Ezick, Heather Frase, Brian Fuller , et al. (75 additional authors not shown)

    Abstract: This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-pu… ▽ More

    Submitted 13 May, 2024; v1 submitted 18 April, 2024; originally announced April 2024.

  2. arXiv:2403.19546  [pdf, other

    cs.LG cs.AI cs.DB cs.IR

    Croissant: A Metadata Format for ML-Ready Datasets

    Authors: Mubashara Akhtar, Omar Benjelloun, Costanza Conforti, Pieter Gijsbers, Joan Giner-Miguelez, Nitisha Jain, Michael Kuchnik, Quentin Lhoest, Pierre Marcenac, Manil Maskey, Peter Mattson, Luis Oala, Pierre Ruyssen, Rajat Shinde, Elena Simperl, Goeffry Thomas, Slava Tykhonov, Joaquin Vanschoren, Jos van der Velde, Steffen Vogler, Carole-Jean Wu

    Abstract: Data is a critical resource for Machine Learning (ML), yet working with data remains a key friction point. This paper introduces Croissant, a metadata format for datasets that simplifies how data is used by ML tools and frameworks. Croissant makes datasets more discoverable, portable and interoperable, thereby addressing significant challenges in ML data management and responsible AI. Croissant is… ▽ More

    Submitted 30 May, 2024; v1 submitted 28 March, 2024; originally announced March 2024.

    Comments: Published in Proceedings of ACM SIGMOD/PODS'24 Data Management for End-to-End Machine Learning (DEEM) Workshop https://dl.acm.org/doi/10.1145/3650203.3663326

  3. arXiv:2311.13028  [pdf, other

    cs.LG cs.AI cs.DC eess.SP

    DMLR: Data-centric Machine Learning Research -- Past, Present and Future

    Authors: Luis Oala, Manil Maskey, Lilith Bat-Leah, Alicia Parrish, Nezihe Merve Gürel, Tzu-Sheng Kuo, Yang Liu, Rotem Dror, Danilo Brajovic, Xiaozhe Yao, Max Bartolo, William A Gaviria Rojas, Ryan Hileman, Rainier Aliment, Michael W. Mahoney, Meg Risdal, Matthew Lease, Wojciech Samek, Debojyoti Dutta, Curtis G Northcutt, Cody Coleman, Braden Hancock, Bernard Koch, Girmaw Abebe Tadesse, Bojan Karlaš , et al. (13 additional authors not shown)

    Abstract: Drawing from discussions at the inaugural DMLR workshop at ICML 2023 and meetings prior, in this report we outline the relevance of community engagement and infrastructure development for the creation of next-generation public datasets that will advance machine learning science. We chart a path forward as a collective effort to sustain the creation and maintenance of these datasets and methods tow… ▽ More

    Submitted 1 June, 2024; v1 submitted 21 November, 2023; originally announced November 2023.

    Comments: Published in the Journal of Data-centric Machine Learning Research (DMLR) at https://data.mlr.press/assets/pdf/v01-5.pdf

  4. arXiv:2310.17638  [pdf, other

    cs.LG stat.ML

    Generative Fractional Diffusion Models

    Authors: Gabriel Nobis, Maximilian Springenberg, Marco Aversa, Michael Detzel, Rembert Daems, Roderick Murray-Smith, Shinichi Nakajima, Sebastian Lapuschkin, Stefano Ermon, Tolga Birdal, Manfred Opper, Christoph Knochenhauer, Luis Oala, Wojciech Samek

    Abstract: We introduce the first continuous-time score-based generative model that leverages fractional diffusion processes for its underlying dynamics. Although diffusion models have excelled at capturing data distributions, they still suffer from various limitations such as slow convergence, mode-collapse on imbalanced data, and lack of diversity. These issues are partially linked to the use of light-tail… ▽ More

    Submitted 24 June, 2024; v1 submitted 26 October, 2023; originally announced October 2023.

    ACM Class: I.2.4; F.4.1; G.3

  5. arXiv:2307.01767  [pdf, other

    cs.LG cs.CV

    Localized Data Work as a Precondition for Data-Centric ML: A Case Study of Full Lifecycle Crop Disease Identification in Ghana

    Authors: Darlington Akogo, Issah Samori, Cyril Akafia, Harriet Fiagbor, Andrews Kangah, Donald Kwame Asiedu, Kwabena Fuachie, Luis Oala

    Abstract: The Ghana Cashew Disease Identification with Artificial Intelligence (CADI AI) project demonstrates the importance of sound data work as a precondition for the delivery of useful, localized datacentric solutions for public good tasks such as agricultural productivity and food security. Drone collected data and machine learning are utilized to determine crop stressors. Data, model and the final app… ▽ More

    Submitted 4 July, 2023; originally announced July 2023.

  6. arXiv:2306.13384  [pdf, other

    eess.IV cs.CV cs.LG

    DiffInfinite: Large Mask-Image Synthesis via Parallel Random Patch Diffusion in Histopathology

    Authors: Marco Aversa, Gabriel Nobis, Miriam Hägele, Kai Standvoss, Mihaela Chirica, Roderick Murray-Smith, Ahmed Alaa, Lukas Ruff, Daniela Ivanova, Wojciech Samek, Frederick Klauschen, Bruno Sanguinetti, Luis Oala

    Abstract: We present DiffInfinite, a hierarchical diffusion model that generates arbitrarily large histological images while preserving long-range correlation structural information. Our approach first generates synthetic segmentation masks, subsequently used as conditions for the high-fidelity generative diffusion process. The proposed sampling method can be scaled up to any desired image size while only r… ▽ More

    Submitted 25 October, 2023; v1 submitted 23 June, 2023; originally announced June 2023.

  7. arXiv:2211.15564   

    cs.LG

    Machine Learning for Health symposium 2022 -- Extended Abstract track

    Authors: Antonio Parziale, Monica Agrawal, Shalmali Joshi, Irene Y. Chen, Shengpu Tang, Luis Oala, Adarsh Subbaswamy

    Abstract: A collection of the extended abstracts that were presented at the 2nd Machine Learning for Health symposium (ML4H 2022), which was held both virtually and in person on November 28, 2022, in New Orleans, Louisiana, USA. Machine Learning for Health (ML4H) is a longstanding venue for research into machine learning for health, including both theoretical works and applied works. ML4H 2022 featured two… ▽ More

    Submitted 28 November, 2022; originally announced November 2022.

    MSC Class: 68Txx ACM Class: I.2; J.3; I.6; I.4

  8. arXiv:2211.02578  [pdf

    cs.LG cs.AI cs.CV

    Data Models for Dataset Drift Controls in Machine Learning With Optical Images

    Authors: Luis Oala, Marco Aversa, Gabriel Nobis, Kurt Willis, Yoan Neuenschwander, Michèle Buck, Christian Matek, Jerome Extermann, Enrico Pomarico, Wojciech Samek, Roderick Murray-Smith, Christoph Clausen, Bruno Sanguinetti

    Abstract: Camera images are ubiquitous in machine learning research. They also play a central role in the delivery of important services spanning medicine and environmental surveying. However, the application of machine learning models in these domains has been limited because of robustness concerns. A primary failure mode are performance drops due to differences between the training and deployment data. Wh… ▽ More

    Submitted 7 May, 2023; v1 submitted 4 November, 2022; originally announced November 2022.

    Comments: Published as a journal paper in the Transactions on Machine Learning Research 2023 (TMLR) available at https://openreview.net/forum?id=I4IkGmgFJz

  9. arXiv:2112.00179   

    cs.LG

    A collection of the accepted abstracts for the Machine Learning for Health (ML4H) symposium 2021

    Authors: Fabian Falck, Yuyin Zhou, Emma Rocheteau, Liyue Shen, Luis Oala, Girmaw Abebe, Subhrajit Roy, Stephen Pfohl, Emily Alsentzer, Matthew B. A. McDermott

    Abstract: A collection of the accepted abstracts for the Machine Learning for Health (ML4H) symposium 2021. This index is not complete, as some accepted abstracts chose to opt-out of inclusion.

    Submitted 30 November, 2021; originally announced December 2021.

  10. arXiv:2104.10223  [pdf, other

    cs.LG cs.CV stat.ML

    More Than Meets The Eye: Semi-supervised Learning Under Non-IID Data

    Authors: Saul Calderon-Ramirez, Luis Oala

    Abstract: A common heuristic in semi-supervised deep learning (SSDL) is to select unlabelled data based on a notion of semantic similarity to the labelled data. For example, labelled images of numbers should be paired with unlabelled images of numbers instead of, say, unlabelled images of cars. We refer to this practice as semantic data set matching. In this work, we demonstrate the limits of semantic data… ▽ More

    Submitted 20 April, 2021; originally announced April 2021.

    Comments: Presented as a RobustML workshop paper at ICLR 2021. Both authors contributed equally. This article extends arXiv:2006.07767

  11. arXiv:2104.03624  [pdf, other

    cs.LG cs.AI cs.NE

    Post-Hoc Domain Adaptation via Guided Data Homogenization

    Authors: Kurt Willis, Luis Oala

    Abstract: Addressing shifts in data distributions is an important prerequisite for the deployment of deep learning models to real-world settings. A general approach to this problem involves the adjustment of models to a new domain through transfer learning. However, in many cases, this is not applicable in a post-hoc manner to deployed models and further parameter adjustments jeopardize safety certification… ▽ More

    Submitted 8 April, 2021; originally announced April 2021.

    Comments: Published as a conference paper at ICLR 2021; 4 pages, plus appendix, 5 figures

  12. arXiv:2006.07767  [pdf, ps, other

    cs.LG stat.ML

    MixMOOD: A systematic approach to class distribution mismatch in semi-supervised learning using deep dataset dissimilarity measures

    Authors: Saul Calderon-Ramirez, Luis Oala, Jordina Torrents-Barrena, Shengxiang Yang, Armaghan Moemeni, Wojciech Samek, Miguel A. Molina-Cabello

    Abstract: In this work, we propose MixMOOD - a systematic approach to mitigate effect of class distribution mismatch in semi-supervised deep learning (SSDL) with MixMatch. This work is divided into two components: (i) an extensive out of distribution (OOD) ablation test bed for SSDL and (ii) a quantitative unlabelled dataset selection heuristic referred to as MixMOOD. In the first part, we analyze the sensi… ▽ More

    Submitted 13 June, 2020; originally announced June 2020.

    Comments: The first two authors made equal contribution

    ACM Class: I.5.2

  13. arXiv:2003.13471  [pdf, other

    eess.IV cs.CV cs.LG stat.ML

    Interval Neural Networks as Instability Detectors for Image Reconstructions

    Authors: Jan Macdonald, Maximilian März, Luis Oala, Wojciech Samek

    Abstract: This work investigates the detection of instabilities that may occur when utilizing deep learning models for image reconstruction tasks. Although neural networks often empirically outperform traditional reconstruction methods, their usage for sensitive medical applications remains controversial. Indeed, in a recent series of works, it has been demonstrated that deep learning approaches are suscept… ▽ More

    Submitted 26 March, 2020; originally announced March 2020.

    Comments: JM, MM and LO contributed equally

    ACM Class: I.5.1; I.4.5; J.3; I.2.m

  14. arXiv:2003.11566  [pdf, other

    cs.LG cs.CV eess.IV stat.ML

    Interval Neural Networks: Uncertainty Scores

    Authors: Luis Oala, Cosmas Heiß, Jan Macdonald, Maximilian März, Wojciech Samek, Gitta Kutyniok

    Abstract: We propose a fast, non-Bayesian method for producing uncertainty scores in the output of pre-trained deep neural networks (DNNs) using a data-driven interval propagating network. This interval neural network (INN) has interval valued parameters and propagates its input using interval arithmetic. The INN produces sensible lower and upper bounds encompassing the ground truth. We provide theoretical… ▽ More

    Submitted 25 March, 2020; originally announced March 2020.

    Comments: LO and CH contributed equally

    ACM Class: I.5.1; I.4.5; J.3; I.2.m