Skip to main content

Showing 1–10 of 10 results for author: Nazabal, A

  1. arXiv:2211.00192  [pdf, other

    cs.DB

    AI Assistants: A Framework for Semi-Automated Data Wrangling

    Authors: Tomas Petricek, Gerrit J. J. van den Burg, Alfredo Nazábal, Taha Ceritli, Ernesto Jiménez-Ruiz, Christopher K. I. Williams

    Abstract: Data wrangling tasks such as obtaining and linking data from various sources, transforming data formats, and correcting erroneous records, can constitute up to 80% of typical data engineering work. Despite the rise of machine learning and artificial intelligence, data wrangling remains a tedious and manual task. We introduce AI assistants, a class of semi-automatic interactive tools to streamline… ▽ More

    Submitted 31 October, 2022; originally announced November 2022.

    Comments: Accepted for publication in IEEE Transactions on Knowledge and Data Engineering

  2. Inference and Learning for Generative Capsule Models

    Authors: Alfredo Nazabal, Nikolaos Tsagkas, Christopher K. I. Williams

    Abstract: Capsule networks (see e.g. Hinton et al., 2018) aim to encode knowledge of and reason about the relationship between an object and its parts. In this paper we specify a generative model for such data, and derive a variational algorithm for inferring the transformation of each model object in a scene, and the assignments of observed parts to the objects. We derive a learning algorithm for the objec… ▽ More

    Submitted 21 October, 2022; v1 submitted 7 September, 2022; originally announced September 2022.

    Comments: 31 pages, 6 figures. This paper extends our previous work (arxiv:2103.06676) by covering the learning of the models as well as inference. Paper accepted for publication in Neural Computation

    Journal ref: Neural Computation 35(4) (2023) 727-761

  3. arXiv:2207.08050  [pdf, other

    cs.LG stat.ML

    Repairing Systematic Outliers by Learning Clean Subspaces in VAEs

    Authors: Simao Eduardo, Kai Xu, Alfredo Nazabal, Charles Sutton

    Abstract: Data cleaning often comprises outlier detection and data repair. Systematic errors result from nearly deterministic transformations that occur repeatedly in the data, e.g. specific image pixels being set to default values or watermarks. Consequently, models with enough capacity easily overfit to these errors, making detection and repair difficult. Seeing as a systematic outlier is a combination of… ▽ More

    Submitted 16 July, 2022; originally announced July 2022.

    Comments: Submitted for review in ICLR 2022

  4. arXiv:2103.06676  [pdf, other

    cs.LG

    Inference for Generative Capsule Models

    Authors: Alfredo Nazabal, Nikolaos Tsagkas, Christopher K. I. Williams

    Abstract: Capsule networks (see e.g. Hinton et al., 2018) aim to encode knowledge and reason about the relationship between an object and its parts. In this paper we specify a \emph{generative} model for such data, and derive a variational algorithm for inferring the transformation of each object and the assignments of observed parts to the objects. We apply this model to (i) data generated from multiple ge… ▽ More

    Submitted 14 March, 2022; v1 submitted 11 March, 2021; originally announced March 2021.

  5. arXiv:2006.05301  [pdf, other

    cs.LG stat.ML

    VAEs in the Presence of Missing Data

    Authors: Mark Collier, Alfredo Nazabal, Christopher K. I. Williams

    Abstract: Real world datasets often contain entries with missing elements e.g. in a medical dataset, a patient is unlikely to have taken all possible diagnostic tests. Variational Autoencoders (VAEs) are popular generative models often used for unsupervised learning. Despite their widespread use it is unclear how best to apply VAEs to datasets with missing data. We develop a novel latent variable model of a… ▽ More

    Submitted 21 March, 2021; v1 submitted 9 June, 2020; originally announced June 2020.

    Comments: Accepted to ICML Workshop on the Art of Learning with Missing Values (Artemiss), 17 July 2020

  6. arXiv:2004.12929  [pdf, other

    cs.DB

    Data Engineering for Data Analytics: A Classification of the Issues, and Case Studies

    Authors: Alfredo Nazabal, Christopher K. I. Williams, Giovanni Colavizza, Camila Rangel Smith, Angus Williams

    Abstract: Consider the situation where a data analyst wishes to carry out an analysis on a given dataset. It is widely recognized that most of the analyst's time will be taken up with \emph{data engineering} tasks such as acquiring, understanding, cleaning and preparing the data. In this paper we provide a description and classification of such tasks into high-levels groups, namely data organization, data q… ▽ More

    Submitted 27 April, 2020; originally announced April 2020.

    Comments: 24 pages, 1 figure, submitted to IEEE Transactions on Knowledge and Data Engineering

  7. arXiv:1907.06671  [pdf, other

    cs.LG stat.ML

    Robust Variational Autoencoders for Outlier Detection and Repair of Mixed-Type Data

    Authors: Simão Eduardo, Alfredo Nazábal, Christopher K. I. Williams, Charles Sutton

    Abstract: We focus on the problem of unsupervised cell outlier detection and repair in mixed-type tabular data. Traditional methods are concerned only with detecting which rows in the dataset are outliers. However, identifying which cells are corrupted in a specific row is an important problem in practice, and the very first step towards repairing them. We introduce the Robust Variational Autoencoder (RVAE)… ▽ More

    Submitted 3 March, 2020; v1 submitted 15 July, 2019; originally announced July 2019.

    Comments: Accepted for publication at AISTATS 2020

  8. Wrangling Messy CSV Files by Detecting Row and Type Patterns

    Authors: Gerrit J. J. van den Burg, Alfredo Nazabal, Charles Sutton

    Abstract: It is well known that data scientists spend the majority of their time on preparing data for analysis. One of the first steps in this preparation phase is to load the data from the raw storage format. Comma-separated value (CSV) files are a popular format for tabular data due to their simplicity and ostensible ease of use. However, formatting standards for CSV files are not followed consistently,… ▽ More

    Submitted 27 November, 2018; originally announced November 2018.

    ACM Class: E.5; H.2.8

    Journal ref: Data Mining and Knowledge Discovery (July, 2019)

  9. arXiv:1807.03653  [pdf, other

    cs.LG cs.AI stat.ML

    Handling Incomplete Heterogeneous Data using VAEs

    Authors: Alfredo Nazabal, Pablo M. Olmos, Zoubin Ghahramani, Isabel Valera

    Abstract: Variational autoencoders (VAEs), as well as other generative models, have been shown to be efficient and accurate for capturing the latent structure of vast amounts of complex high-dimensional data. However, existing VAEs can still not directly handle data that are heterogenous (mixed continuous and discrete) or incomplete (with missing data at random), which is indeed common in real-world applica… ▽ More

    Submitted 22 May, 2020; v1 submitted 10 July, 2018; originally announced July 2018.

  10. arXiv:1801.03851  [pdf, other

    cs.LG stat.ML

    Autoencoders and Probabilistic Inference with Missing Data: An Exact Solution for The Factor Analysis Case

    Authors: Christopher K. I. Williams, Charlie Nash, Alfredo Nazábal

    Abstract: Latent variable models can be used to probabilistically "fill-in" missing data entries. The variational autoencoder architecture (Kingma and Welling, 2014; Rezende et al., 2014) includes a "recognition" or "encoder" network that infers the latent variables given the data variables. However, it is not clear how to handle missing data variables in this network. The factor analysis (FA) model is a ba… ▽ More

    Submitted 19 February, 2019; v1 submitted 11 January, 2018; originally announced January 2018.

    Comments: 7 pages, 2 figures, Adding ref to Ilin and Raiko (2010)