subscribe to arXiv mailings

Naive Bayes Classifiers and One-hot Encoding of Categorical Variables

Abstract: This paper investigates the consequences of encoding a $K$-valued categorical variable incorrectly as $K$ bits via one-hot encoding, when using a Naïve Bayes classifier. This gives rise to a product-of-Bernoullis (PoB) assumption, rather than the correct categorical Naïve Bayes classifier. The differences between the two classifiers are analysed mathematically and experimentally. In our experiment… ▽ More This paper investigates the consequences of encoding a $K$-valued categorical variable incorrectly as $K$ bits via one-hot encoding, when using a Naïve Bayes classifier. This gives rise to a product-of-Bernoullis (PoB) assumption, rather than the correct categorical Naïve Bayes classifier. The differences between the two classifiers are analysed mathematically and experimentally. In our experiments using probability vectors drawn from a Dirichlet distribution, the two classifiers are found to agree on the maximum a posteriori class label for most cases, although the posterior probabilities are usually greater for the PoB case. △ Less

Submitted 28 April, 2024; originally announced April 2024.

Comments: 7 pages, 3 figures

arXiv:2306.03066 [pdf, other]

doi 10.1007/s11263-024-02118-3

Of Mice and Mates: Automated Classification and Modelling of Mouse Behaviour in Groups using a Single Model across Cages

Authors: Michael P. J. Camilleri, Rasneer S. Bains, Christopher K. I. Williams

Abstract: Behavioural experiments often happen in specialised arenas, but this may confound the analysis. To address this issue, we provide tools to study mice in the home-cage environment, equipping biologists with the possibility to capture the temporal aspect of the individual's behaviour and model the interaction and interdependence between cage-mates with minimal human intervention. Our main contributi… ▽ More Behavioural experiments often happen in specialised arenas, but this may confound the analysis. To address this issue, we provide tools to study mice in the home-cage environment, equipping biologists with the possibility to capture the temporal aspect of the individual's behaviour and model the interaction and interdependence between cage-mates with minimal human intervention. Our main contribution is the novel Group Behaviour Model (GBM) which summarises the joint behaviour of groups of mice across cages, using a permutation matrix to match the mouse identities in each cage to the model. In support of the above, we also (a) developed the Activity Labelling Module (ALM) to automatically classify mouse behaviour from video, and (b) released two datasets, ABODe for training behaviour classifiers and IMADGE for modelling behaviour. △ Less

Submitted 24 June, 2024; v1 submitted 5 June, 2023; originally announced June 2023.

Comments: International Journal of Computer Vision (2024)

arXiv:2302.03531 [pdf, other]

Structured Generative Models for Scene Understanding

Authors: Christopher K. I. Williams

Abstract: This position paper argues for the use of \emph{structured generative models} (SGMs) for scene understanding. This requires the reconstruction of a 3D scene from an input image, whereby the contents of the image are causally explained in terms of models of instantiated objects, each with their own type, shape, appearance and pose, along with global variables like scene lighting and camera paramete… ▽ More This position paper argues for the use of \emph{structured generative models} (SGMs) for scene understanding. This requires the reconstruction of a 3D scene from an input image, whereby the contents of the image are causally explained in terms of models of instantiated objects, each with their own type, shape, appearance and pose, along with global variables like scene lighting and camera parameters. This approach also requires scene models which account for the co-occurrences and inter-relationships of objects in a scene. The SGM approach has the merits that it is compositional and generative, which lead to interpretability. To pursue the SGM agenda, we need models for objects and scenes, and approaches to carry out inference. We first review models for objects, which include ``things'' (object categories that have a well defined shape), and ``stuff'' (categories which have amorphous spatial extent). We then move on to review \emph{scene models} which describe the inter-relationships of objects. Perhaps the most challenging problem for SGMs is \emph{inference} of the objects, lighting and camera parameters, and scene inter-relationships from input consisting of a single or multiple images. We conclude with a discussion of issues that need addressing to advance the SGM agenda. △ Less

Submitted 7 February, 2023; originally announced February 2023.

Comments: 33 pages, 10 figures

arXiv:2211.00192 [pdf, other]

AI Assistants: A Framework for Semi-Automated Data Wrangling

Authors: Tomas Petricek, Gerrit J. J. van den Burg, Alfredo Nazábal, Taha Ceritli, Ernesto Jiménez-Ruiz, Christopher K. I. Williams

Abstract: Data wrangling tasks such as obtaining and linking data from various sources, transforming data formats, and correcting erroneous records, can constitute up to 80% of typical data engineering work. Despite the rise of machine learning and artificial intelligence, data wrangling remains a tedious and manual task. We introduce AI assistants, a class of semi-automatic interactive tools to streamline… ▽ More Data wrangling tasks such as obtaining and linking data from various sources, transforming data formats, and correcting erroneous records, can constitute up to 80% of typical data engineering work. Despite the rise of machine learning and artificial intelligence, data wrangling remains a tedious and manual task. We introduce AI assistants, a class of semi-automatic interactive tools to streamline data wrangling. An AI assistant guides the analyst through a specific data wrangling task by recommending a suitable data transformation that respects the constraints obtained through interaction with the analyst. We formally define the structure of AI assistants and describe how existing tools that treat data cleaning as an optimization problem fit the definition. We implement AI assistants for four common data wrangling tasks and make AI assistants easily accessible to data analysts in an open-source notebook environment for data science, by leveraging the common structure they follow. We evaluate our AI assistants both quantitatively and qualitatively through three example scenarios. We show that the unified and interactive design makes it easy to perform tasks that would be difficult to do manually or with a fully automatic tool. △ Less

Submitted 31 October, 2022; originally announced November 2022.

Comments: Accepted for publication in IEEE Transactions on Knowledge and Data Engineering

arXiv:2210.04221 [pdf, other]

The Elliptical Quartic Exponential Distribution: An Annular Distribution Obtained via Maximum Entropy

Authors: Christopher K I Williams

Abstract: This paper describes the Elliptical Quartic Exponential distribution in $\mathbb{R}^D$, obtained via a maximum entropy construction by imposing second and fourth moment constraints. I discuss relationships to related work, analytical expressions for the normalization constant and the entropy, and the conditional and marginal distributions. This paper describes the Elliptical Quartic Exponential distribution in $\mathbb{R}^D$, obtained via a maximum entropy construction by imposing second and fourth moment constraints. I discuss relationships to related work, analytical expressions for the normalization constant and the entropy, and the conditional and marginal distributions. △ Less

Submitted 9 October, 2022; originally announced October 2022.

Comments: 6 pages, 1 figure

arXiv:2210.04023 [pdf, other]

Multi-Task Dynamical Systems

Authors: Alex Bird, Christopher K. I. Williams, Christopher Hawthorne

Abstract: Time series datasets are often composed of a variety of sequences from the same domain, but from different entities, such as individuals, products, or organizations. We are interested in how time series models can be specialized to individual sequences (capturing the specific characteristics) while still retaining statistical power by sharing commonalities across the sequences. This paper describe… ▽ More Time series datasets are often composed of a variety of sequences from the same domain, but from different entities, such as individuals, products, or organizations. We are interested in how time series models can be specialized to individual sequences (capturing the specific characteristics) while still retaining statistical power by sharing commonalities across the sequences. This paper describes the multi-task dynamical system (MTDS); a general methodology for extending multi-task learning (MTL) to time series models. Our approach endows dynamical systems with a set of hierarchical latent variables which can modulate all model parameters. To our knowledge, this is a novel development of MTL, and applies to time series both with and without control inputs. We apply the MTDS to motion-capture data of people walking in various styles using a multi-task recurrent neural network (RNN), and to patient drug-response data using a multi-task pharmacodynamic model. △ Less

Submitted 8 October, 2022; originally announced October 2022.

Comments: 52 pages, 17 figures

Journal ref: Journal of Machine Learning Research 23 (2022)

arXiv:2209.03115 [pdf, other]

doi 10.1162/neco_a_01564

Inference and Learning for Generative Capsule Models

Authors: Alfredo Nazabal, Nikolaos Tsagkas, Christopher K. I. Williams

Abstract: Capsule networks (see e.g. Hinton et al., 2018) aim to encode knowledge of and reason about the relationship between an object and its parts. In this paper we specify a generative model for such data, and derive a variational algorithm for inferring the transformation of each model object in a scene, and the assignments of observed parts to the objects. We derive a learning algorithm for the objec… ▽ More Capsule networks (see e.g. Hinton et al., 2018) aim to encode knowledge of and reason about the relationship between an object and its parts. In this paper we specify a generative model for such data, and derive a variational algorithm for inferring the transformation of each model object in a scene, and the assignments of observed parts to the objects. We derive a learning algorithm for the object models, based on variational expectation maximization (Jordan et al., 1999). We also study an alternative inference algorithm based on the RANSAC method of Fischler and Bolles (1981). We apply these inference methods to (i) data generated from multiple geometric objects like squares and triangles ("constellations"), and (ii) data from a parts-based model of faces. Recent work by Kosiorek et al. (2019) has used amortized inference via stacked capsule autoencoders (SCAEs) to tackle this problem -- our results show that we significantly outperform them where we can make comparisons (on the constellations data). △ Less

Submitted 21 October, 2022; v1 submitted 7 September, 2022; originally announced September 2022.

Comments: 31 pages, 6 figures. This paper extends our previous work (arxiv:2103.06676) by covering the learning of the models as well as inference. Paper accepted for publication in Neural Computation

Journal ref: Neural Computation 35(4) (2023) 727-761

arXiv:2203.08089 [pdf, other]

doi 10.1162/neco_a_01533

On Suspicious Coincidences and Pointwise Mutual Information

Authors: Christopher K. I. Williams

Abstract: Barlow (1985) hypothesized that the co-occurrence of two events $A$ and $B$ is "suspicious" if $P(A,B) \gg P(A) P(B)$. We first review classical measures of association for $2 \times 2$ contingency tables, including Yule's $Y$ (Yule, 1912), which depends only on the odds ratio $λ$, and is independent of the marginal probabilities of the table. We then discuss the mutual information (MI) and pointw… ▽ More Barlow (1985) hypothesized that the co-occurrence of two events $A$ and $B$ is "suspicious" if $P(A,B) \gg P(A) P(B)$. We first review classical measures of association for $2 \times 2$ contingency tables, including Yule's $Y$ (Yule, 1912), which depends only on the odds ratio $λ$, and is independent of the marginal probabilities of the table. We then discuss the mutual information (MI) and pointwise mutual information (PMI), which depend on the ratio $P(A,B)/P(A)P(B)$, as measures of association. We show that, once the effect of the marginals is removed, MI and PMI behave similarly to $Y$ as functions of $λ$. The pointwise mutual information is used extensively in some research communities for flagging suspicious coincidences, but it is important to bear in mind the sensitivity of the PMI to the marginals, with increased scores for sparser events. △ Less

Submitted 2 March, 2023; v1 submitted 15 March, 2022; originally announced March 2022.

Comments: 9 pages, 1 figure. Addendum added March 2023

Journal ref: Neural Computation 34(10) 2037-2046 (2022)

arXiv:2203.04694 [pdf, other]

Align-Deform-Subtract: An Interventional Framework for Explaining Object Differences

Authors: Cian Eastwood, Li Nanbo, Christopher K. I. Williams

Abstract: Given two object images, how can we explain their differences in terms of the underlying object properties? To address this question, we propose Align-Deform-Subtract (ADS) -- an interventional framework for explaining object differences. By leveraging semantic alignments in image-space as counterfactual interventions on the underlying object properties, ADS iteratively quantifies and removes diff… ▽ More Given two object images, how can we explain their differences in terms of the underlying object properties? To address this question, we propose Align-Deform-Subtract (ADS) -- an interventional framework for explaining object differences. By leveraging semantic alignments in image-space as counterfactual interventions on the underlying object properties, ADS iteratively quantifies and removes differences in object properties. The result is a set of "disentangled" error measures which explain object differences in terms of the underlying properties. Experiments on real and synthetic data illustrate the efficacy of the framework. △ Less

Submitted 20 July, 2022; v1 submitted 9 March, 2022; originally announced March 2022.

Comments: ICLR 2022 Workshop on Objects, Structure and Causality

arXiv:2112.06809 [pdf, other]

doi 10.1007/s00138-023-01414-1

Persistent Animal Identification Leveraging Non-Visual Markers

Authors: Michael P. J. Camilleri, Li Zhang, Rasneer S. Bains, Andrew Zisserman, Christopher K. I. Williams

Abstract: Our objective is to locate and provide a unique identifier for each mouse in a cluttered home-cage environment through time, as a precursor to automated behaviour recognition for biological research. This is a very challenging problem due to (i) the lack of distinguishing visual features for each mouse, and (ii) the close confines of the scene with constant occlusion, making standard visual tracki… ▽ More Our objective is to locate and provide a unique identifier for each mouse in a cluttered home-cage environment through time, as a precursor to automated behaviour recognition for biological research. This is a very challenging problem due to (i) the lack of distinguishing visual features for each mouse, and (ii) the close confines of the scene with constant occlusion, making standard visual tracking approaches unusable. However, a coarse estimate of each mouse's location is available from a unique RFID implant, so there is the potential to optimally combine information from (weak) tracking with coarse information on identity. To achieve our objective, we make the following key contributions: (a) the formulation of the object identification problem as an assignment problem (solved using Integer Linear Programming), and (b) a novel probabilistic model of the affinity between tracklets and RFID data. The latter is a crucial part of the model, as it provides a principled probabilistic treatment of object detections given coarse localisation. Our approach achieves 77% accuracy on this animal identification problem, and is able to reject spurious detections when the animals are hidden. △ Less

Submitted 19 July, 2023; v1 submitted 13 December, 2021; originally announced December 2021.

Journal ref: Machine Vision and Applications 34, 68 (2023)

arXiv:2111.11959 [pdf, other]

Identifying the Units of Measurement in Tabular Data

Authors: Taha Ceritli, Christopher K. I. Williams

Abstract: We consider the problem of identifying the units of measurement in a data column that contains both numeric values and unit symbols in each row, e.g., "5.2 l", "7 pints". In this case we seek to identify the dimension of the column (e.g. volume) and relate the unit symbols to valid units (e.g. litre, pint) obtained from a knowledge graph. Below we present PUC, a Probabilistic Unit Canonicalizer th… ▽ More We consider the problem of identifying the units of measurement in a data column that contains both numeric values and unit symbols in each row, e.g., "5.2 l", "7 pints". In this case we seek to identify the dimension of the column (e.g. volume) and relate the unit symbols to valid units (e.g. litre, pint) obtained from a knowledge graph. Below we present PUC, a Probabilistic Unit Canonicalizer that can accurately identify the units of measurement, extract semantic descriptions of quantitative data columns and canonicalize their entries. We present the first messy real-world tabular datasets annotated for units of measurement, which can enable and accelerate the research in this area. Our experiments on these datasets show that PUC achieves better results than existing solutions. △ Less

Submitted 23 November, 2021; originally announced November 2021.

arXiv:2111.11956 [pdf, other]

ptype-cat: Inferring the Type and Values of Categorical Variables

Authors: Taha Ceritli, Christopher K. I. Williams

Abstract: Type inference is the task of identifying the type of values in a data column and has been studied extensively in the literature. Most existing type inference methods support data types such as Boolean, date, float, integer and string. However, these methods do not consider non-Boolean categorical variables, where there are more than two possible values encoded by integers or strings. Therefore, s… ▽ More Type inference is the task of identifying the type of values in a data column and has been studied extensively in the literature. Most existing type inference methods support data types such as Boolean, date, float, integer and string. However, these methods do not consider non-Boolean categorical variables, where there are more than two possible values encoded by integers or strings. Therefore, such columns are annotated either as integer or string rather than categorical, and need to be transformed into categorical manually by the user. In this paper, we propose a probabilistic type inference method that can identify the general categorical data type (including non-Boolean variables). Additionally, we identify the possible values of each categorical variable by adapting the existing type inference method ptype. Combining these methods, we present ptype-cat which achieves better results than existing applicable solutions. △ Less

Submitted 23 November, 2021; originally announced November 2021.

arXiv:2107.05446 [pdf, other]

Source-Free Adaptation to Measurement Shift via Bottom-Up Feature Restoration

Authors: Cian Eastwood, Ian Mason, Christopher K. I. Williams, Bernhard Schölkopf

Abstract: Source-free domain adaptation (SFDA) aims to adapt a model trained on labelled data in a source domain to unlabelled data in a target domain without access to the source-domain data during adaptation. Existing methods for SFDA leverage entropy-minimization techniques which: (i) apply only to classification; (ii) destroy model calibration; and (iii) rely on the source model achieving a good level o… ▽ More Source-free domain adaptation (SFDA) aims to adapt a model trained on labelled data in a source domain to unlabelled data in a target domain without access to the source-domain data during adaptation. Existing methods for SFDA leverage entropy-minimization techniques which: (i) apply only to classification; (ii) destroy model calibration; and (iii) rely on the source model achieving a good level of feature-space class-separation in the target domain. We address these issues for a particularly pervasive type of domain shift called measurement shift which can be resolved by restoring the source features rather than extracting new ones. In particular, we propose Feature Restoration (FR) wherein we: (i) store a lightweight and flexible approximation of the feature distribution under the source data; and (ii) adapt the feature-extractor such that the approximate feature distribution under the target data realigns with that saved on the source. We additionally propose a bottom-up training scheme which boosts performance, which we call Bottom-Up Feature Restoration (BUFR). On real and synthetic data, we demonstrate that BUFR outperforms existing SFDA methods in terms of accuracy, calibration, and data efficiency, while being less reliant on the performance of the source model in the target domain. △ Less

Submitted 17 March, 2022; v1 submitted 12 July, 2021; originally announced July 2021.

Comments: ICLR 2022 (Spotlight)

arXiv:2106.03216 [pdf, other]

On Memorization in Probabilistic Deep Generative Models

Authors: Gerrit J. J. van den Burg, Christopher K. I. Williams

Abstract: Recent advances in deep generative models have led to impressive results in a variety of application domains. Motivated by the possibility that deep learning models might memorize part of the input data, there have been increased efforts to understand how memorization arises. In this work, we extend a recently proposed measure of memorization for supervised learning (Feldman, 2019) to the unsuperv… ▽ More Recent advances in deep generative models have led to impressive results in a variety of application domains. Motivated by the possibility that deep learning models might memorize part of the input data, there have been increased efforts to understand how memorization arises. In this work, we extend a recently proposed measure of memorization for supervised learning (Feldman, 2019) to the unsupervised density estimation problem and adapt it to be more computationally efficient. Next, we present a study that demonstrates how memorization can occur in probabilistic deep generative models such as variational autoencoders. This reveals that the form of memorization to which these models are susceptible differs fundamentally from mode collapse and overfitting. Furthermore, we show that the proposed memorization score measures a phenomenon that is not captured by commonly-used nearest neighbor tests. Finally, we discuss several strategies that can be used to limit memorization in practice. Our work thus provides a framework for understanding problematic memorization in probabilistic generative models. △ Less

Submitted 29 December, 2021; v1 submitted 6 June, 2021; originally announced June 2021.

Comments: Accepted for publication at NeurIPS 2021

MSC Class: 68T07

arXiv:2105.05699 [pdf, other]

doi 10.1145/3495256

Automating Data Science: Prospects and Challenges

Authors: Tijl De Bie, Luc De Raedt, José Hernández-Orallo, Holger H. Hoos, Padhraic Smyth, Christopher K. I. Williams

Abstract: Given the complexity of typical data science projects and the associated demand for human expertise, automation has the potential to transform the data science process. Key insights: * Automation in data science aims to facilitate and transform the work of data scientists, not to replace them. * Important parts of data science are already being automated, especially in the modeling stages, w… ▽ More Given the complexity of typical data science projects and the associated demand for human expertise, automation has the potential to transform the data science process. Key insights: * Automation in data science aims to facilitate and transform the work of data scientists, not to replace them. * Important parts of data science are already being automated, especially in the modeling stages, where techniques such as automated machine learning (AutoML) are gaining traction. * Other aspects are harder to automate, not only because of technological challenges, but because open-ended and context-dependent tasks require human interaction. △ Less

Submitted 28 February, 2022; v1 submitted 12 May, 2021; originally announced May 2021.

Comments: 19 pages, 3 figures. v1 accepted for publication (April 2021) in Communications of the ACM

Journal ref: Communications of the ACM 65(3) 76-87 (2022)

arXiv:2103.06676 [pdf, other]

Inference for Generative Capsule Models

Authors: Alfredo Nazabal, Nikolaos Tsagkas, Christopher K. I. Williams

Abstract: Capsule networks (see e.g. Hinton et al., 2018) aim to encode knowledge and reason about the relationship between an object and its parts. In this paper we specify a \emph{generative} model for such data, and derive a variational algorithm for inferring the transformation of each object and the assignments of observed parts to the objects. We apply this model to (i) data generated from multiple ge… ▽ More Capsule networks (see e.g. Hinton et al., 2018) aim to encode knowledge and reason about the relationship between an object and its parts. In this paper we specify a \emph{generative} model for such data, and derive a variational algorithm for inferring the transformation of each object and the assignments of observed parts to the objects. We apply this model to (i) data generated from multiple geometric objects like squares and triangles ("constellations"), and (ii) data from a parts-based model of faces. Recent work by Kosiorek et al. [2019] has used amortized inference via stacked capsule autoencoders (SCAEs) to tackle this problem -- our results show that we significantly outperform them where we can make comparisons (on the constellations data). △ Less

Submitted 14 March, 2022; v1 submitted 11 March, 2021; originally announced March 2021.

arXiv:2007.01905 [pdf, other]

doi 10.1162/neco_a_01362

The Effect of Class Imbalance on Precision-Recall Curves

Authors: Christopher K I Williams

Abstract: In this note I study how the precision of a classifier depends on the ratio $r$ of positive to negative cases in the test set, as well as the classifier's true and false positive rates. This relationship allows prediction of how the precision-recall curve will change with $r$, which seems not to be well known. It also allows prediction of how $F_β$ and the Precision Gain and Recall Gain measures o… ▽ More In this note I study how the precision of a classifier depends on the ratio $r$ of positive to negative cases in the test set, as well as the classifier's true and false positive rates. This relationship allows prediction of how the precision-recall curve will change with $r$, which seems not to be well known. It also allows prediction of how $F_β$ and the Precision Gain and Recall Gain measures of Flach and Kull (2015) vary with $r$. △ Less

Submitted 27 April, 2021; v1 submitted 3 July, 2020; originally announced July 2020.

Comments: 4 pages, 1 figure. Added ref to Siblini et al (2020) and last sentence. Final m/s version of paper published in Neural Computation

Journal ref: Neural Computation 33(4) 853-857 (2021)

arXiv:2006.05301 [pdf, other]

VAEs in the Presence of Missing Data

Authors: Mark Collier, Alfredo Nazabal, Christopher K. I. Williams

Abstract: Real world datasets often contain entries with missing elements e.g. in a medical dataset, a patient is unlikely to have taken all possible diagnostic tests. Variational Autoencoders (VAEs) are popular generative models often used for unsupervised learning. Despite their widespread use it is unclear how best to apply VAEs to datasets with missing data. We develop a novel latent variable model of a… ▽ More Real world datasets often contain entries with missing elements e.g. in a medical dataset, a patient is unlikely to have taken all possible diagnostic tests. Variational Autoencoders (VAEs) are popular generative models often used for unsupervised learning. Despite their widespread use it is unclear how best to apply VAEs to datasets with missing data. We develop a novel latent variable model of a corruption process which generates missing data, and derive a corresponding tractable evidence lower bound (ELBO). Our model is straightforward to implement, can handle both missing completely at random (MCAR) and missing not at random (MNAR) data, scales to high dimensional inputs and gives both the VAE encoder and decoder principled access to indicator variables for whether a data element is missing or not. On the MNIST and SVHN datasets we demonstrate improved marginal log-likelihood of observed data and better missing data imputation, compared to existing approaches. △ Less

Submitted 21 March, 2021; v1 submitted 9 June, 2020; originally announced June 2020.

Comments: Accepted to ICML Workshop on the Art of Learning with Missing Values (Artemiss), 17 July 2020

arXiv:2004.12929 [pdf, other]

Data Engineering for Data Analytics: A Classification of the Issues, and Case Studies

Authors: Alfredo Nazabal, Christopher K. I. Williams, Giovanni Colavizza, Camila Rangel Smith, Angus Williams

Abstract: Consider the situation where a data analyst wishes to carry out an analysis on a given dataset. It is widely recognized that most of the analyst's time will be taken up with \emph{data engineering} tasks such as acquiring, understanding, cleaning and preparing the data. In this paper we provide a description and classification of such tasks into high-levels groups, namely data organization, data q… ▽ More Consider the situation where a data analyst wishes to carry out an analysis on a given dataset. It is widely recognized that most of the analyst's time will be taken up with \emph{data engineering} tasks such as acquiring, understanding, cleaning and preparing the data. In this paper we provide a description and classification of such tasks into high-levels groups, namely data organization, data quality and feature engineering. We also make available four datasets and example analyses that exhibit a wide variety of these problems, to help encourage the development of tools and techniques to help reduce this burden and push forward research towards the automation or semi-automation of the data engineering process. △ Less

Submitted 27 April, 2020; originally announced April 2020.

Comments: 24 pages, 1 figure, submitted to IEEE Transactions on Knowledge and Data Engineering

arXiv:2003.06222 [pdf, other]

An Evaluation of Change Point Detection Algorithms

Authors: Gerrit J. J. van den Burg, Christopher K. I. Williams

Abstract: Change point detection is an important part of time series analysis, as the presence of a change point indicates an abrupt and significant change in the data generating process. While many algorithms for change point detection have been proposed, comparatively little attention has been paid to evaluating their performance on real-world time series. Algorithms are typically evaluated on simulated d… ▽ More Change point detection is an important part of time series analysis, as the presence of a change point indicates an abrupt and significant change in the data generating process. While many algorithms for change point detection have been proposed, comparatively little attention has been paid to evaluating their performance on real-world time series. Algorithms are typically evaluated on simulated data and a small number of commonly-used series with unreliable ground truth. Clearly this does not provide sufficient insight into the comparative performance of these algorithms. Therefore, instead of developing yet another change point detection method, we consider it vastly more important to properly evaluate existing algorithms on real-world data. To achieve this, we present a data set specifically designed for the evaluation of change point detection algorithms that consists of 37 time series from various application domains. Each series was annotated by five human annotators to provide ground truth on the presence and location of change points. We analyze the consistency of the human annotators, and describe evaluation metrics that can be used to measure algorithm performance in the presence of multiple ground truth annotations. Next, we present a benchmark study where 14 algorithms are evaluated on each of the time series in the data set. Our aim is that this data set will serve as a proving ground in the development of novel change point detection algorithms. △ Less

Submitted 12 February, 2022; v1 submitted 13 March, 2020; originally announced March 2020.

Comments: For code and data, see https://github.com/alan-turing-institute/TCPDBench ; Changelog in pdf

MSC Class: 62M10 ACM Class: G.3

arXiv:1911.10081 [pdf, other]

doi 10.1007/s10618-020-00680-1

ptype: Probabilistic Type Inference

Authors: Taha Ceritli, Christopher K. I. Williams, James Geddes

Abstract: Type inference refers to the task of inferring the data type of a given column of data. Current approaches often fail when data contains missing data and anomalies, which are found commonly in real-world data sets. In this paper, we propose ptype, a probabilistic robust type inference method that allows us to detect such entries, and infer data types. We further show that the proposed method outpe… ▽ More Type inference refers to the task of inferring the data type of a given column of data. Current approaches often fail when data contains missing data and anomalies, which are found commonly in real-world data sets. In this paper, we propose ptype, a probabilistic robust type inference method that allows us to detect such entries, and infer data types. We further show that the proposed method outperforms the existing methods. △ Less

Submitted 23 March, 2020; v1 submitted 22 November, 2019; originally announced November 2019.

Journal ref: Data Mining and Knowledge Discovery (2020)

arXiv:1910.05026 [pdf, other]

Customizing Sequence Generation with Multi-Task Dynamical Systems

Authors: Alex Bird, Christopher K. I. Williams

Abstract: Dynamical system models (including RNNs) often lack the ability to adapt the sequence generation or prediction to a given context, limiting their real-world application. In this paper we show that hierarchical multi-task dynamical systems (MTDSs) provide direct user control over sequence generation, via use of a latent code $\mathbf{z}$ that specifies the customization to the individual data seque… ▽ More Dynamical system models (including RNNs) often lack the ability to adapt the sequence generation or prediction to a given context, limiting their real-world application. In this paper we show that hierarchical multi-task dynamical systems (MTDSs) provide direct user control over sequence generation, via use of a latent code $\mathbf{z}$ that specifies the customization to the individual data sequence. This enables style transfer, interpolation and morphing within generated sequences. We show the MTDS can improve predictions via latent code interpolation, and avoid the long-term performance degradation of standard RNN approaches. △ Less

Submitted 11 October, 2019; originally announced October 2019.

arXiv:1907.06671 [pdf, other]

Robust Variational Autoencoders for Outlier Detection and Repair of Mixed-Type Data

Authors: Simão Eduardo, Alfredo Nazábal, Christopher K. I. Williams, Charles Sutton

Abstract: We focus on the problem of unsupervised cell outlier detection and repair in mixed-type tabular data. Traditional methods are concerned only with detecting which rows in the dataset are outliers. However, identifying which cells are corrupted in a specific row is an important problem in practice, and the very first step towards repairing them. We introduce the Robust Variational Autoencoder (RVAE)… ▽ More We focus on the problem of unsupervised cell outlier detection and repair in mixed-type tabular data. Traditional methods are concerned only with detecting which rows in the dataset are outliers. However, identifying which cells are corrupted in a specific row is an important problem in practice, and the very first step towards repairing them. We introduce the Robust Variational Autoencoder (RVAE), a deep generative model that learns the joint distribution of the clean data while identifying the outlier cells, allowing their imputation (repair). RVAE explicitly learns the probability of each cell being an outlier, balancing different likelihood models in the row outlier score, making the method suitable for outlier detection in mixed-type datasets. We show experimentally that not only RVAE performs better than several state-of-the-art methods in cell outlier detection and repair for tabular data, but also that is robust against the initial hyper-parameter selection. △ Less

Submitted 3 March, 2020; v1 submitted 15 July, 2019; originally announced July 2019.

Comments: Accepted for publication at AISTATS 2020

arXiv:1906.01251 [pdf, other]

doi 10.1007/978-3-030-43823-4_11

The Extended Dawid-Skene Model: Fusing Information from Multiple Data Schemas

Authors: Michael P. J. Camilleri, Christopher K. I. Williams

Abstract: While label fusion from multiple noisy annotations is a well understood concept in data wrangling (tackled for example by the Dawid-Skene (DS) model), we consider the extended problem of carrying out learning when the labels themselves are not consistently annotated with the same schema. We show that even if annotators use disparate, albeit related, label-sets, we can still draw inferences for the… ▽ More While label fusion from multiple noisy annotations is a well understood concept in data wrangling (tackled for example by the Dawid-Skene (DS) model), we consider the extended problem of carrying out learning when the labels themselves are not consistently annotated with the same schema. We show that even if annotators use disparate, albeit related, label-sets, we can still draw inferences for the underlying full label-set. We propose the Inter-Schema AdapteR (ISAR) to translate the fully-specified label-set to the one used by each annotator, enabling learning under such heterogeneous schemas, without the need to re-annotate the data. We apply our method to a mouse behavioural dataset, achieving significant gains (compared with DS) in out-of-sample log-likelihood (-3.40 to -2.39) and F1-score (0.785 to 0.864). △ Less

Submitted 6 March, 2020; v1 submitted 4 June, 2019; originally announced June 2019.

Comments: Updated with Author-Preprint version following Publication in P. Cellier and K. Driessens (Eds.): ECML PKDD 2019 Workshops, CCIS 1167, pp. 121 - 136, 2020

Journal ref: in ECML PKDD 2019 Workshops, CCIS 1167, pp. 121 - 136, 2020

arXiv:1903.08970 [pdf, other]

Multi-Task Time Series Analysis applied to Drug Response Modelling

Authors: Alex Bird, Christopher K. I. Williams, Christopher Hawthorne

Abstract: Time series models such as dynamical systems are frequently fitted to a cohort of data, ignoring variation between individual entities such as patients. In this paper we show how these models can be personalised to an individual level while retaining statistical power, via use of multi-task learning (MTL). To our knowledge this is a novel development of MTL which applies to time series both with a… ▽ More Time series models such as dynamical systems are frequently fitted to a cohort of data, ignoring variation between individual entities such as patients. In this paper we show how these models can be personalised to an individual level while retaining statistical power, via use of multi-task learning (MTL). To our knowledge this is a novel development of MTL which applies to time series both with and without control inputs. The modelling framework is demonstrated on a physiological drug response problem which results in improved predictive accuracy and uncertainty estimation over existing state-of-the-art models. △ Less

Submitted 21 March, 2019; originally announced March 2019.

Comments: To appear in AISTATS 2019

arXiv:1812.07524 [pdf, other]

doi 10.1016/j.patcog.2020.107369

Learning Direct Optimization for Scene Understanding

Authors: Lukasz Romaszko, Christopher K. I. Williams, John Winn

Abstract: We develop a Learning Direct Optimization (LiDO) method for the refinement of a latent variable model that describes input image x. Our goal is to explain a single image x with an interpretable 3D computer graphics model having scene graph latent variables z (such as object appearance, camera position). Given a current estimate of z we can render a prediction of the image g(z), which can be compar… ▽ More We develop a Learning Direct Optimization (LiDO) method for the refinement of a latent variable model that describes input image x. Our goal is to explain a single image x with an interpretable 3D computer graphics model having scene graph latent variables z (such as object appearance, camera position). Given a current estimate of z we can render a prediction of the image g(z), which can be compared to the image x. The standard way to proceed is then to measure the error E(x, g(z)) between the two, and use an optimizer to minimize the error. However, it is unknown which error measure E would be most effective for simultaneously addressing issues such as misaligned objects, occlusions, textures, etc. In contrast, the LiDO approach trains a Prediction Network to predict an update directly to correct z, rather than minimizing the error with respect to z. Experiments show that our LiDO method converges rapidly as it does not need to perform a search on the error landscape, produces better solutions than error-based competitors, and is able to handle the mismatch between the data and the fitted scene model. We apply LiDO to a realistic synthetic dataset, and show that the method also transfers to work well with real images. △ Less

Submitted 7 May, 2020; v1 submitted 18 December, 2018; originally announced December 2018.

Journal ref: Pattern Recognition, Volume 105, 2020, 107369

arXiv:1806.00400 [pdf, other]

Inverting Supervised Representations with Autoregressive Neural Density Models

Authors: Charlie Nash, Nate Kushman, Christopher K. I. Williams

Abstract: We present a method for feature interpretation that makes use of recent advances in autoregressive density estimation models to invert model representations. We train generative inversion models to express a distribution over input features conditioned on intermediate model representations. Insights into the invariances learned by supervised models can be gained by viewing samples from these inver… ▽ More We present a method for feature interpretation that makes use of recent advances in autoregressive density estimation models to invert model representations. We train generative inversion models to express a distribution over input features conditioned on intermediate model representations. Insights into the invariances learned by supervised models can be gained by viewing samples from these inversion models. In addition, we can use these inversion models to estimate the mutual information between a model's inputs and its intermediate representations, thus quantifying the amount of information preserved by the network at different stages. Using this method we examine the types of information preserved at different layers of convolutional neural networks, and explore the invariances induced by different architectural choices. Finally we show that the mutual information between inputs and network layers decreases over the course of training, supporting recent work by Shwartz-Ziv and Tishby (2017) on the information bottleneck theory of deep learning. △ Less

Submitted 2 January, 2019; v1 submitted 1 June, 2018; originally announced June 2018.

Comments: Accepted for publication by AISTATS 2019

arXiv:1801.03851 [pdf, other]

Autoencoders and Probabilistic Inference with Missing Data: An Exact Solution for The Factor Analysis Case

Authors: Christopher K. I. Williams, Charlie Nash, Alfredo Nazábal

Abstract: Latent variable models can be used to probabilistically "fill-in" missing data entries. The variational autoencoder architecture (Kingma and Welling, 2014; Rezende et al., 2014) includes a "recognition" or "encoder" network that infers the latent variables given the data variables. However, it is not clear how to handle missing data variables in this network. The factor analysis (FA) model is a ba… ▽ More Latent variable models can be used to probabilistically "fill-in" missing data entries. The variational autoencoder architecture (Kingma and Welling, 2014; Rezende et al., 2014) includes a "recognition" or "encoder" network that infers the latent variables given the data variables. However, it is not clear how to handle missing data variables in this network. The factor analysis (FA) model is a basic autoencoder, using linear encoder and decoder networks. We show how to calculate exactly the latent posterior distribution for the factor analysis (FA) model in the presence of missing data, and note that this solution implies that a different encoder network is required for each pattern of missingness. We also discuss various approximations to the exact solution. Experiments compare the effectiveness of various approaches to filling in the missing data. △ Less

Submitted 19 February, 2019; v1 submitted 11 January, 2018; originally announced January 2018.

Comments: 7 pages, 2 figures, Adding ref to Ilin and Raiko (2010)

arXiv:1612.00662 [pdf, other]

Predicting Patient State-of-Health using Sliding Window and Recurrent Classifiers

Authors: Adam McCarthy, Christopher K. I. Williams

Abstract: Bedside monitors in Intensive Care Units (ICUs) frequently sound incorrectly, slowing response times and desensitising nurses to alarms (Chambrin, 2001), causing true alarms to be missed (Hug et al., 2011). We compare sliding window predictors with recurrent predictors to classify patient state-of-health from ICU multivariate time series; we report slightly improved performance for the RNN for thr… ▽ More Bedside monitors in Intensive Care Units (ICUs) frequently sound incorrectly, slowing response times and desensitising nurses to alarms (Chambrin, 2001), causing true alarms to be missed (Hug et al., 2011). We compare sliding window predictors with recurrent predictors to classify patient state-of-health from ICU multivariate time series; we report slightly improved performance for the RNN for three out of four targets. △ Less

Submitted 2 December, 2016; originally announced December 2016.

Comments: NIPS 2016 Workshop on Machine Learning for Health

arXiv:1608.00242 [pdf, ps, other]

Input-Output Non-Linear Dynamical Systems applied to Physiological Condition Monitoring

Authors: Konstantinos Georgatzis, Christopher K. I. Williams, Christopher Hawthorne

Abstract: We present a non-linear dynamical system for modelling the effect of drug infusions on the vital signs of patients admitted in Intensive Care Units (ICUs). More specifically we are interested in modelling the effect of a widely used anaesthetic drug (Propofol) on a patient's monitored depth of anaesthesia and haemodynamics. We compare our approach with one from the Pharmacokinetics/Pharmacodynamic… ▽ More We present a non-linear dynamical system for modelling the effect of drug infusions on the vital signs of patients admitted in Intensive Care Units (ICUs). More specifically we are interested in modelling the effect of a widely used anaesthetic drug (Propofol) on a patient's monitored depth of anaesthesia and haemodynamics. We compare our approach with one from the Pharmacokinetics/Pharmacodynamics (PK/PD) literature and show that we can provide significant improvements in performance without requiring the incorporation of expert physiological knowledge in our system. △ Less

Submitted 8 October, 2016; v1 submitted 31 July, 2016; originally announced August 2016.

Comments: 15 pages, 4 figures, Presented at 2016 Machine Learning and Healthcare Conference (MLHC 2016), Los Angeles, CA, camera ready version

arXiv:1506.03852 [pdf, other]

Tree-Cut for Probabilistic Image Segmentation

Authors: Shell X. Hu, Christopher K. I. Williams, Sinisa Todorovic

Abstract: This paper presents a new probabilistic generative model for image segmentation, i.e. the task of partitioning an image into homogeneous regions. Our model is grounded on a mid-level image representation, called a region tree, in which regions are recursively split into subregions until superpixels are reached. Given the region tree, image segmentation is formalized as sampling cuts in the tree fr… ▽ More This paper presents a new probabilistic generative model for image segmentation, i.e. the task of partitioning an image into homogeneous regions. Our model is grounded on a mid-level image representation, called a region tree, in which regions are recursively split into subregions until superpixels are reached. Given the region tree, image segmentation is formalized as sampling cuts in the tree from the model. Inference for the cuts is exact, and formulated using dynamic programming. Our tree-cut model can be tuned to sample segmentations at a particular scale of interest out of many possible multiscale image segmentations. This generalizes the common notion that there should be only one correct segmentation per image. Also, it allows moving beyond the standard single-scale evaluation, where the segmentation result for an image is averaged against the corresponding set of coarse and fine human annotations, to conduct a scale-specific evaluation. Our quantitative results are comparable to those of the leading gPb-owt-ucm method, with the notable advantage that we additionally produce a distribution over all possible tree-consistent segmentations of the image. △ Less

Submitted 11 June, 2015; originally announced June 2015.

arXiv:1504.06494 [pdf, ps, other]

Discriminative Switching Linear Dynamical Systems applied to Physiological Condition Monitoring

Authors: Konstantinos Georgatzis, Christopher K. I. Williams

Abstract: We present a Discriminative Switching Linear Dynamical System (DSLDS) applied to patient monitoring in Intensive Care Units (ICUs). Our approach is based on identifying the state-of-health of a patient given their observed vital signs using a discriminative classifier, and then inferring their underlying physiological values conditioned on this status. The work builds on the Factorial Switching Li… ▽ More We present a Discriminative Switching Linear Dynamical System (DSLDS) applied to patient monitoring in Intensive Care Units (ICUs). Our approach is based on identifying the state-of-health of a patient given their observed vital signs using a discriminative classifier, and then inferring their underlying physiological values conditioned on this status. The work builds on the Factorial Switching Linear Dynamical System (FSLDS) (Quinn et al., 2009) which has been previously used in a similar setting. The FSLDS is a generative model, whereas the DSLDS is a discriminative model. We demonstrate on two real-world datasets that the DSLDS is able to outperform the FSLDS in most cases of interest, and that an $α$-mixture of the two models achieves higher performance than either of the two models separately. △ Less

Submitted 24 April, 2015; originally announced April 2015.

arXiv:1408.1489 [pdf]

Renewal Strings for Cleaning Astronomical Databases

Authors: Amos J. Storkey, Nigel C. Hambly, Christopher K. I. Williams, Robert G. Mann

Abstract: Large astronomical databases obtained from sky surveys such as the SuperCOSMOS Sky Surveys (SSS) invariably suffer from a small number of spurious records coming from artefactual effects of the telescope, satellites and junk objects in orbit around earth and physical defects on the photographic plate or CCD. Though relatively small in number these spurious records present a significant problem in… ▽ More Large astronomical databases obtained from sky surveys such as the SuperCOSMOS Sky Surveys (SSS) invariably suffer from a small number of spurious records coming from artefactual effects of the telescope, satellites and junk objects in orbit around earth and physical defects on the photographic plate or CCD. Though relatively small in number these spurious records present a significant problem in many situations where they can become a large proportion of the records potentially of interest to a given astronomer. In this paper we focus on the four most common causes of unwanted records in the SSS: satellite or aeroplane tracks, scratches fibres and other linear phenomena introduced to the plate, circular halos around bright stars due to internal reflections within the telescope and diffraction spikes near to bright stars. Accurate and robust techniques are needed for locating and flagging such spurious objects. We have developed renewal strings, a probabilistic technique combining the Hough transform, renewal processes and hidden Markov models which have proven highly effective in this context. The methods are applied to the SSS data to develop a dataset of spurious object detections, along with confidence measures, which can allow this unwanted data to be removed from consideration. These methods are general and can be adapted to any future astronomical survey data. △ Less

Submitted 7 August, 2014; originally announced August 2014.

Comments: Appears in Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence (UAI2003)

Report number: UAI-P-2003-PG-559-566

arXiv:1205.6326 [pdf, other]

A Framework for Evaluating Approximation Methods for Gaussian Process Regression

Authors: Krzysztof Chalupka, Christopher K. I. Williams, Iain Murray

Abstract: Gaussian process (GP) predictors are an important component of many Bayesian approaches to machine learning. However, even a straightforward implementation of Gaussian process regression (GPR) requires O(n^2) space and O(n^3) time for a dataset of n examples. Several approximation methods have been proposed, but there is a lack of understanding of the relative merits of the different approximation… ▽ More Gaussian process (GP) predictors are an important component of many Bayesian approaches to machine learning. However, even a straightforward implementation of Gaussian process regression (GPR) requires O(n^2) space and O(n^3) time for a dataset of n examples. Several approximation methods have been proposed, but there is a lack of understanding of the relative merits of the different approximations, and in what situations they are most useful. We recommend assessing the quality of the predictions obtained as a function of the compute time taken, and comparing to standard baselines (e.g., Subset of Data and FITC). We empirically investigate four different approximation algorithms on four different prediction problems, and make our code available to encourage future comparisons. △ Less

Submitted 5 November, 2012; v1 submitted 29 May, 2012; originally announced May 2012.

Comments: 19 pages, 4 figures

Showing 1–34 of 34 results for author: Williams, C K I