-
Differentiable Scene Graphs
Authors:
Moshiko Raboh,
Roei Herzig,
Gal Chechik,
Jonathan Berant,
Amir Globerson
Abstract:
Reasoning about complex visual scenes involves perception of entities and their relations. Scene graphs provide a natural representation for reasoning tasks, by assigning labels to both entities (nodes) and relations (edges). Unfortunately, reasoning systems based on SGs are typically trained in a two-step procedure: First, training a model to predict SGs from images; Then, a separate model is cre…
▽ More
Reasoning about complex visual scenes involves perception of entities and their relations. Scene graphs provide a natural representation for reasoning tasks, by assigning labels to both entities (nodes) and relations (edges). Unfortunately, reasoning systems based on SGs are typically trained in a two-step procedure: First, training a model to predict SGs from images; Then, a separate model is created to reason based on predicted SGs. In many domains, it is preferable to train systems jointly in an end-to-end manner, but SGs are not commonly used as intermediate components in visual reasoning systems because being discrete and sparse, scene-graph representations are non-differentiable and difficult to optimize. Here we propose Differentiable Scene Graphs (DSGs), an image representation that is amenable to differentiable end-to-end optimization, and requires supervision only from the downstream tasks. DSGs provide a dense representation for all regions and pairs of regions, and do not spend modelling capacity on areas of the images that do not contain objects or relations of interest. We evaluate our model on the challenging task of identifying referring relationships (RR) in three benchmark datasets, Visual Genome, VRD and CLEVR. We describe a multi-task objective, and train in an end-to-end manner supervised by the downstream RR task. Using DSGs as an intermediate representation leads to new state-of-the-art performance.
△ Less
Submitted 14 March, 2020; v1 submitted 26 February, 2019;
originally announced February 2019.
-
Cross-Lingual Alignment of Contextual Word Embeddings, with Applications to Zero-shot Dependency Parsing
Authors:
Tal Schuster,
Ori Ram,
Regina Barzilay,
Amir Globerson
Abstract:
We introduce a novel method for multilingual transfer that utilizes deep contextual embeddings, pretrained in an unsupervised fashion. While contextual embeddings have been shown to yield richer representations of meaning compared to their static counterparts, aligning them poses a challenge due to their dynamic nature. To this end, we construct context-independent variants of the original monolin…
▽ More
We introduce a novel method for multilingual transfer that utilizes deep contextual embeddings, pretrained in an unsupervised fashion. While contextual embeddings have been shown to yield richer representations of meaning compared to their static counterparts, aligning them poses a challenge due to their dynamic nature. To this end, we construct context-independent variants of the original monolingual spaces and utilize their mapping to derive an alignment for the context-dependent spaces. This mapping readily supports processing of a target language, improving transfer by context-aware embeddings. Our experimental results demonstrate the effectiveness of this approach for zero-shot and few-shot learning of dependency parsing. Specifically, our method consistently outperforms the previous state-of-the-art on 6 tested languages, yielding an improvement of 6.8 LAS points on average.
△ Less
Submitted 3 April, 2019; v1 submitted 25 February, 2019;
originally announced February 2019.
-
Spatio-Temporal Action Graph Networks
Authors:
Roei Herzig,
Elad Levi,
Huijuan Xu,
Hang Gao,
Eli Brosh,
Xiaolong Wang,
Amir Globerson,
Trevor Darrell
Abstract:
Events defined by the interaction of objects in a scene are often of critical importance; yet important events may have insufficient labeled examples to train a conventional deep model to generalize to future object appearance. Activity recognition models that represent object interactions explicitly have the potential to learn in a more efficient manner than those that represent scenes with globa…
▽ More
Events defined by the interaction of objects in a scene are often of critical importance; yet important events may have insufficient labeled examples to train a conventional deep model to generalize to future object appearance. Activity recognition models that represent object interactions explicitly have the potential to learn in a more efficient manner than those that represent scenes with global descriptors. We propose a novel inter-object graph representation for activity recognition based on a disentangled graph embedding with direct observation of edge appearance. We employ a novel factored embedding of the graph structure, disentangling a representation hierarchy formed over spatial dimensions from that found over temporal variation. We demonstrate the effectiveness of our model on the Charades activity recognition benchmark, as well as a new dataset of driving activities focusing on multi-object interactions with near-collision events. Our model offers significantly improved performance compared to baseline approaches without object-graph representations, or with previous graph-based models.
△ Less
Submitted 29 September, 2019; v1 submitted 4 December, 2018;
originally announced December 2018.
-
Why do Larger Models Generalize Better? A Theoretical Perspective via the XOR Problem
Authors:
Alon Brutzkus,
Amir Globerson
Abstract:
Empirical evidence suggests that neural networks with ReLU activations generalize better with over-parameterization. However, there is currently no theoretical analysis that explains this observation. In this work, we provide theoretical and empirical evidence that, in certain cases, overparameterized convolutional networks generalize better than small networks because of an interplay between weig…
▽ More
Empirical evidence suggests that neural networks with ReLU activations generalize better with over-parameterization. However, there is currently no theoretical analysis that explains this observation. In this work, we provide theoretical and empirical evidence that, in certain cases, overparameterized convolutional networks generalize better than small networks because of an interplay between weight clustering and feature exploration at initialization. We demonstrate this theoretically for a 3-layer convolutional neural network with max-pooling, in a novel setting which extends the XOR problem. We show that this interplay implies that with overparamterization, gradient descent converges to global minima with better generalization performance compared to global minima of small networks. Empirically, we demonstrate these phenomena for a 3-layer convolutional neural network in the MNIST task.
△ Less
Submitted 29 January, 2019; v1 submitted 6 October, 2018;
originally announced October 2018.
-
Explaining Queries over Web Tables to Non-Experts
Authors:
Jonathan Berant,
Daniel Deutch,
Amir Globerson,
Tova Milo,
Tomer Wolfson
Abstract:
Designing a reliable natural language (NL) interface for querying tables has been a longtime goal of researchers in both the data management and natural language processing (NLP) communities. Such an interface receives as input an NL question, translates it into a formal query, executes the query and returns the results. Errors in the translation process are not uncommon, and users typically strug…
▽ More
Designing a reliable natural language (NL) interface for querying tables has been a longtime goal of researchers in both the data management and natural language processing (NLP) communities. Such an interface receives as input an NL question, translates it into a formal query, executes the query and returns the results. Errors in the translation process are not uncommon, and users typically struggle to understand whether their query has been mapped correctly. We address this problem by explaining the obtained formal queries to non-expert users. Two methods for query explanations are presented: the first translates queries into NL, while the second method provides a graphic representation of the query cell-based provenance (in its execution on a given table). Our solution augments a state-of-the-art NL interface over web tables, enhancing it in both its training and deployment phase. Experiments, including a user study conducted on Amazon Mechanical Turk, show our solution to improve both the correctness and reliability of an NL interface.
△ Less
Submitted 14 August, 2018;
originally announced August 2018.
-
Learning Rules-First Classifiers
Authors:
Deborah Cohen,
Amit Daniely,
Amir Globerson,
Gal Elidan
Abstract:
Complex classifiers may exhibit "embarassing" failures in cases where humans can easily provide a justified classification. Avoiding such failures is obviously of key importance. In this work, we focus on one such setting, where a label is perfectly predictable if the input contains certain features, or rules, and otherwise it is predictable by a linear classifier. We define a hypothesis class tha…
▽ More
Complex classifiers may exhibit "embarassing" failures in cases where humans can easily provide a justified classification. Avoiding such failures is obviously of key importance. In this work, we focus on one such setting, where a label is perfectly predictable if the input contains certain features, or rules, and otherwise it is predictable by a linear classifier. We define a hypothesis class that captures this notion and determine its sample complexity. We also give evidence that efficient algorithms cannot achieve this sample complexity. We then derive a simple and efficient algorithm and show that its sample complexity is close to optimal, among efficient algorithms. Experiments on synthetic and sentiment analysis data demonstrate the efficacy of the method, both in terms of accuracy and interpretability.
△ Less
Submitted 13 June, 2019; v1 submitted 8 March, 2018;
originally announced March 2018.
-
Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction
Authors:
Roei Herzig,
Moshiko Raboh,
Gal Chechik,
Jonathan Berant,
Amir Globerson
Abstract:
Machine understanding of complex images is a key goal of artificial intelligence. One challenge underlying this task is that visual scenes contain multiple inter-related objects, and that global context plays an important role in interpreting the scene. A natural modeling framework for capturing such effects is structured prediction, which optimizes over complex labels, while modeling within-label…
▽ More
Machine understanding of complex images is a key goal of artificial intelligence. One challenge underlying this task is that visual scenes contain multiple inter-related objects, and that global context plays an important role in interpreting the scene. A natural modeling framework for capturing such effects is structured prediction, which optimizes over complex labels, while modeling within-label interactions. However, it is unclear what principles should guide the design of a structured prediction model that utilizes the power of deep learning components. Here we propose a design principle for such architectures that follows from a natural requirement of permutation invariance. We prove a necessary and sufficient characterization for architectures that follow this invariance, and discuss its implication on model design. Finally, we show that the resulting model achieves new state of the art results on the Visual Genome scene graph labeling benchmark, outperforming all recent approaches.
△ Less
Submitted 1 November, 2018; v1 submitted 15 February, 2018;
originally announced February 2018.
-
Predict and Constrain: Modeling Cardinality in Deep Structured Prediction
Authors:
Nataly Brukhim,
Amir Globerson
Abstract:
Many machine learning problems require the prediction of multi-dimensional labels. Such structured prediction models can benefit from modeling dependencies between labels. Recently, several deep learning approaches to structured prediction have been proposed. Here we focus on capturing cardinality constraints in such models. Namely, constraining the number of non-zero labels that the model outputs…
▽ More
Many machine learning problems require the prediction of multi-dimensional labels. Such structured prediction models can benefit from modeling dependencies between labels. Recently, several deep learning approaches to structured prediction have been proposed. Here we focus on capturing cardinality constraints in such models. Namely, constraining the number of non-zero labels that the model outputs. Such constraints have proven very useful in previous structured prediction approaches, but it is a challenge to introduce them into a deep learning framework. Here we show how to do this via a novel deep architecture. Our approach outperforms strong baselines, achieving state-of-the-art results on multi-label classification benchmarks.
△ Less
Submitted 13 February, 2018;
originally announced February 2018.
-
Weakly-supervised Semantic Parsing with Abstract Examples
Authors:
Omer Goldman,
Veronica Latcinnik,
Udi Naveh,
Amir Globerson,
Jonathan Berant
Abstract:
Training semantic parsers from weak supervision (denotations) rather than strong supervision (programs) complicates training in two ways. First, a large search space of potential programs needs to be explored at training time to find a correct program. Second, spurious programs that accidentally lead to a correct denotation add noise to training. In this work we propose that in closed worlds with…
▽ More
Training semantic parsers from weak supervision (denotations) rather than strong supervision (programs) complicates training in two ways. First, a large search space of potential programs needs to be explored at training time to find a correct program. Second, spurious programs that accidentally lead to a correct denotation add noise to training. In this work we propose that in closed worlds with clear semantic types, one can substantially alleviate these problems by utilizing an abstract representation, where tokens in both the language utterance and program are lifted to an abstract form. We show that these abstractions can be defined with a handful of lexical rules and that they result in sharing between different examples that alleviates the difficulties in training. To test our approach, we develop the first semantic parser for CNLVR, a challenging visual reasoning dataset, where the search space is large and overcoming spuriousness is critical, because denotations are either TRUE or FALSE, and thus random programs are likely to lead to a correct denotation. Our method substantially improves performance, and reaches 82.5% accuracy, a 14.7% absolute accuracy improvement compared to the best reported accuracy so far.
△ Less
Submitted 13 March, 2019; v1 submitted 14 November, 2017;
originally announced November 2017.
-
SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data
Authors:
Alon Brutzkus,
Amir Globerson,
Eran Malach,
Shai Shalev-Shwartz
Abstract:
Neural networks exhibit good generalization behavior in the over-parameterized regime, where the number of network parameters exceeds the number of observations. Nonetheless, current generalization bounds for neural networks fail to explain this phenomenon. In an attempt to bridge this gap, we study the problem of learning a two-layer over-parameterized neural network, when the data is generated b…
▽ More
Neural networks exhibit good generalization behavior in the over-parameterized regime, where the number of network parameters exceeds the number of observations. Nonetheless, current generalization bounds for neural networks fail to explain this phenomenon. In an attempt to bridge this gap, we study the problem of learning a two-layer over-parameterized neural network, when the data is generated by a linearly separable function. In the case where the network has Leaky ReLU activations, we provide both optimization and generalization guarantees for over-parameterized networks. Specifically, we prove convergence rates of SGD to a global minimum and provide generalization guarantees for this global minimum that are independent of the network size. Therefore, our result clearly shows that the use of SGD for optimization both finds a global minimum, and avoids overfitting despite the high capacity of the model. This is the first theoretical demonstration that SGD can avoid overfitting, when learning over-specified neural network classifiers.
△ Less
Submitted 27 October, 2017;
originally announced October 2017.
-
Robust Conditional Probabilities
Authors:
Yoav Wald,
Amir Globerson
Abstract:
Conditional probabilities are a core concept in machine learning. For example, optimal prediction of a label $Y$ given an input $X$ corresponds to maximizing the conditional probability of $Y$ given $X$. A common approach to inference tasks is learning a model of conditional probabilities. However, these models are often based on strong assumptions (e.g., log-linear models), and hence their estima…
▽ More
Conditional probabilities are a core concept in machine learning. For example, optimal prediction of a label $Y$ given an input $X$ corresponds to maximizing the conditional probability of $Y$ given $X$. A common approach to inference tasks is learning a model of conditional probabilities. However, these models are often based on strong assumptions (e.g., log-linear models), and hence their estimate of conditional probabilities is not robust and is highly dependent on the validity of their assumptions.
Here we propose a framework for reasoning about conditional probabilities without assuming anything about the underlying distributions, except knowledge of their second order marginals, which can be estimated from data. We show how this setting leads to guaranteed bounds on conditional probabilities, which can be calculated efficiently in a variety of settings, including structured-prediction. Finally, we apply them to semi-supervised deep learning, obtaining results competitive with variational autoencoders.
△ Less
Submitted 8 August, 2017;
originally announced August 2017.
-
Semi-Supervised Learning with Competitive Infection Models
Authors:
Nir Rosenfeld,
Amir Globerson
Abstract:
The goal in semi-supervised learning is to effectively combine labeled and unlabeled data. One way to do this is by encouraging smoothness across edges in a graph whose nodes correspond to input examples. In many graph-based methods, labels can be thought of as propagating over the graph, where the underlying propagation mechanism is based on random walks or on averaging dynamics. While theoretica…
▽ More
The goal in semi-supervised learning is to effectively combine labeled and unlabeled data. One way to do this is by encouraging smoothness across edges in a graph whose nodes correspond to input examples. In many graph-based methods, labels can be thought of as propagating over the graph, where the underlying propagation mechanism is based on random walks or on averaging dynamics. While theoretically elegant, these dynamics suffer from several drawbacks which can hurt predictive performance.
Our goal in this work is to explore alternative mechanisms for propagating labels. In particular, we propose a method based on dynamic infection processes, where unlabeled nodes can be "infected" with the label of their already infected neighbors. Our algorithm is efficient and scalable, and an analysis of the underlying optimization objective reveals a surprising relation to other Laplacian approaches. We conclude with a thorough set of experiments across multiple benchmarks and various learning settings.
△ Less
Submitted 27 February, 2018; v1 submitted 19 March, 2017;
originally announced March 2017.
-
Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs
Authors:
Alon Brutzkus,
Amir Globerson
Abstract:
Deep learning models are often successfully trained using gradient descent, despite the worst case hardness of the underlying non-convex optimization problem. The key question is then under what conditions can one prove that optimization will succeed. Here we provide a strong result of this kind. We consider a neural net with one hidden layer and a convolutional structure with no overlap and a ReL…
▽ More
Deep learning models are often successfully trained using gradient descent, despite the worst case hardness of the underlying non-convex optimization problem. The key question is then under what conditions can one prove that optimization will succeed. Here we provide a strong result of this kind. We consider a neural net with one hidden layer and a convolutional structure with no overlap and a ReLU activation function. For this architecture we show that learning is NP-complete in the general case, but that when the input distribution is Gaussian, gradient descent converges to the global optimum in polynomial time. To the best of our knowledge, this is the first global optimality guarantee of gradient descent on a convolutional neural network with ReLU activations.
△ Less
Submitted 25 February, 2017;
originally announced February 2017.
-
Learning to generalize to new compositions in image understanding
Authors:
Yuval Atzmon,
Jonathan Berant,
Vahid Kezami,
Amir Globerson,
Gal Chechik
Abstract:
Recurrent neural networks have recently been used for learning to describe images using natural language. However, it has been observed that these models generalize poorly to scenes that were not observed during training, possibly depending too strongly on the statistics of the text in the training data. Here we propose to describe images using short structured representations, aiming to capture t…
▽ More
Recurrent neural networks have recently been used for learning to describe images using natural language. However, it has been observed that these models generalize poorly to scenes that were not observed during training, possibly depending too strongly on the statistics of the text in the training data. Here we propose to describe images using short structured representations, aiming to capture the crux of a description. These structured representations allow us to tease-out and evaluate separately two types of generalization: standard generalization to new images with similar scenes, and generalization to new combinations of known entities. We compare two learning approaches on the MS-COCO dataset: a state-of-the-art recurrent network based on an LSTM (Show, Attend and Tell), and a simple structured prediction model on top of a deep network. We find that the structured model generalizes to new compositions substantially better than the LSTM, ~7 times the accuracy of predicting structured representations. By providing a concrete method to quantify generalization for unseen combinations, we argue that structured representations and compositional splits are a useful benchmark for image captioning, and advocate compositional models that capture linguistic and visual structure.
△ Less
Submitted 26 August, 2016;
originally announced August 2016.
-
Learning Infinite-Layer Networks: Without the Kernel Trick
Authors:
Roi Livni,
Daniel Carmon,
Amir Globerson
Abstract:
Infinite--Layer Networks (ILN) have recently been proposed as an architecture that mimics neural networks while enjoying some of the advantages of kernel methods. ILN are networks that integrate over infinitely many nodes within a single hidden layer. It has been demonstrated by several authors that the problem of learning ILN can be reduced to the kernel trick, implying that whenever a certain in…
▽ More
Infinite--Layer Networks (ILN) have recently been proposed as an architecture that mimics neural networks while enjoying some of the advantages of kernel methods. ILN are networks that integrate over infinitely many nodes within a single hidden layer. It has been demonstrated by several authors that the problem of learning ILN can be reduced to the kernel trick, implying that whenever a certain integral can be computed analytically they are efficiently learnable.
In this work we give an online algorithm for ILN, which avoids the kernel trick assumption. More generally and of independent interest, we show that kernel methods in general can be exploited even when the kernel cannot be efficiently computed but can only be estimated via sampling.
We provide a regret analysis for our algorithm, showing that it matches the sample complexity of methods which have access to kernel values. Thus, our method is the first to demonstrate that the kernel trick is not necessary as such, and random features suffice to obtain comparable performance.
△ Less
Submitted 28 July, 2017; v1 submitted 16 June, 2016;
originally announced June 2016.
-
Optimal Tagging with Markov Chain Optimization
Authors:
Nir Rosenfeld,
Amir Globerson
Abstract:
Many information systems use tags and keywords to describe and annotate content. These allow for efficient organization and categorization of items, as well as facilitate relevant search queries. As such, the selected set of tags for an item can have a considerable effect on the volume of traffic that eventually reaches an item. In settings where tags are chosen by an item's creator, who in turn i…
▽ More
Many information systems use tags and keywords to describe and annotate content. These allow for efficient organization and categorization of items, as well as facilitate relevant search queries. As such, the selected set of tags for an item can have a considerable effect on the volume of traffic that eventually reaches an item. In settings where tags are chosen by an item's creator, who in turn is interested in maximizing traffic, a principled approach for choosing tags can prove valuable. In this paper we introduce the problem of optimal tagging, where the task is to choose a subset of tags for a new item such that the probability of a browsing user reaching that item is maximized. We formulate the problem by modeling traffic using a Markov chain, and asking how transitions in this chain should be modified to maximize traffic into a certain state of interest. The resulting optimization problem involves maximizing a certain function over subsets, under a cardinality constraint. We show that the optimization problem is NP-hard, but nonetheless has a simple (1-1/e)-approximation via a simple greedy algorithm. Furthermore, the structure of the problem allows for an efficient implementation of the greedy step.To demonstrate the effectiveness of our method, we perform experiments on three tagging datasets, and show that the greedy algorithm outperforms other baselines.
△ Less
Submitted 19 May, 2016; v1 submitted 16 May, 2016;
originally announced May 2016.
-
Tight Error Bounds for Structured Prediction
Authors:
Amir Globerson,
Tim Roughgarden,
David Sontag,
Cafer Yildirim
Abstract:
Structured prediction tasks in machine learning involve the simultaneous prediction of multiple labels. This is typically done by maximizing a score function on the space of labels, which decomposes as a sum of pairwise elements, each depending on two specific labels. Intuitively, the more pairwise terms are used, the better the expected accuracy. However, there is currently no theoretical account…
▽ More
Structured prediction tasks in machine learning involve the simultaneous prediction of multiple labels. This is typically done by maximizing a score function on the space of labels, which decomposes as a sum of pairwise elements, each depending on two specific labels. Intuitively, the more pairwise terms are used, the better the expected accuracy. However, there is currently no theoretical account of this intuition. This paper takes a significant step in this direction.
We formulate the problem as classifying the vertices of a known graph $G=(V,E)$, where the vertices and edges of the graph are labelled and correlate semi-randomly with the ground truth. We show that the prospects for achieving low expected Hamming error depend on the structure of the graph $G$ in interesting ways. For example, if $G$ is a very poor expander, like a path, then large expected Hamming error is inevitable. Our main positive result shows that, for a wide class of graphs including 2D grid graphs common in machine vision applications, there is a polynomial-time algorithm with small and information-theoretically near-optimal expected error. Our results provide a first step toward a theoretical justification for the empirical success of the efficient approximate inference algorithms that are used for structured prediction in models where exact inference is intractable.
△ Less
Submitted 19 September, 2014;
originally announced September 2014.
-
Tighter Linear Program Relaxations for High Order Graphical Models
Authors:
Elad Mezuman,
Daniel Tarlow,
Amir Globerson,
Yair Weiss
Abstract:
Graphical models with High Order Potentials (HOPs) have received considerable interest in recent years. While there are a variety of approaches to inference in these models, nearly all of them amount to solving a linear program (LP) relaxation with unary consistency constraints between the HOP and the individual variables. In many cases, the resulting relaxations are loose, and in these cases the…
▽ More
Graphical models with High Order Potentials (HOPs) have received considerable interest in recent years. While there are a variety of approaches to inference in these models, nearly all of them amount to solving a linear program (LP) relaxation with unary consistency constraints between the HOP and the individual variables. In many cases, the resulting relaxations are loose, and in these cases the results of inference can be poor. It is thus desirable to look for more accurate ways of performing inference in these models. In this work, we study the LP relaxations that result from enforcing additional consistency constraints between the HOP and the rest of the model. We address theoretical questions about the strength of the resulting relaxations compared to the relaxations that arise in standard approaches, and we develop practical and efficient message passing algorithms for optimizing the LPs. Empirically, we show that the LPs with additional consistency constraints lead to more accurate inference on some challenging problems that include a combination of low order and high order terms.
△ Less
Submitted 26 September, 2013;
originally announced September 2013.
-
Learning Max-Margin Tree Predictors
Authors:
Ofer Meshi,
Elad Eban,
Gal Elidan,
Amir Globerson
Abstract:
Structured prediction is a powerful framework for coping with joint prediction of interacting outputs. A central difficulty in using this framework is that often the correct label dependence structure is unknown. At the same time, we would like to avoid an overly complex structure that will lead to intractable prediction. In this work we address the challenge of learning tree structured predictive…
▽ More
Structured prediction is a powerful framework for coping with joint prediction of interacting outputs. A central difficulty in using this framework is that often the correct label dependence structure is unknown. At the same time, we would like to avoid an overly complex structure that will lead to intractable prediction. In this work we address the challenge of learning tree structured predictive models that achieve high accuracy while at the same time facilitate efficient (linear time) inference. We start by proving that this task is in general NP-hard, and then suggest an approximate alternative. Briefly, our CRANK approach relies on a novel Circuit-RANK regularizer that penalizes non-tree structures and that can be optimized using a CCCP procedure. We demonstrate the effectiveness of our approach on several domains and show that, despite the relative simplicity of the structure, prediction accuracy is competitive with a fully connected model that is computationally costly at prediction time.
△ Less
Submitted 26 September, 2013;
originally announced September 2013.
-
Sufficient Dimensionality Reduction with Irrelevant Statistics
Authors:
Amir Globerson,
Gal Chechik,
Naftali Tishby
Abstract:
The problem of finding a reduced dimensionality representation of categorical variables while preserving their most relevant characteristics is fundamental for the analysis of complex data. Specifically, given a co-occurrence matrix of two variables, one often seeks a compact representation of one variable which preserves information about the other variable. We have recently intro…
▽ More
The problem of finding a reduced dimensionality representation of categorical variables while preserving their most relevant characteristics is fundamental for the analysis of complex data. Specifically, given a co-occurrence matrix of two variables, one often seeks a compact representation of one variable which preserves information about the other variable. We have recently introduced ``Sufficient Dimensionality Reduction' [GT-2003], a method that extracts continuous reduced dimensional features whose measurements (i.e., expectation values) capture maximal mutual information among the variables. However, such measurements often capture information that is irrelevant for a given task. Widely known examples are illumination conditions, which are irrelevant as features for face recognition, writing style which is irrelevant as a feature for content classification, and intonation which is irrelevant as a feature for speech recognition. Such irrelevance cannot be deduced apriori, since it depends on the details of the task, and is thus inherently ill defined in the purely unsupervised case. Separating relevant from irrelevant features can be achieved using additional side data that contains such irrelevant structures. This approach was taken in [CT-2002], extending the information bottleneck method, which uses clustering to compress the data. Here we use this side-information framework to identify features whose measurements are maximally informative for the original data set, but carry as little information as possible on a side data set. In statistical terms this can be understood as extracting statistics which are maximally sufficient for the original dataset, while simultaneously maximally ancillary for the side dataset. We formulate this tradeoff as a constrained optimization problem and characterize its solutions. We then derive a gradient descent algorithm for this problem, which is based on the Generalized Iterative Scaling method for finding maximum entropy distributions. The method is demonstrated on synthetic data, as well as on real face recognition datasets, and is shown to outperform standard methods such as oriented PCA.
△ Less
Submitted 19 October, 2012;
originally announced December 2012.
-
The Minimum Information Principle for Discriminative Learning
Authors:
Amir Globerson,
Naftali Tishby
Abstract:
Exponential models of distributions are widely used in machine learning for classiffication and modelling. It is well known that they can be interpreted as maximum entropy models under empirical expectation constraints. In this work, we argue that for classiffication tasks, mutual information is a more suitable information theoretic measure to be optimized. We show how the principle of minimum mut…
▽ More
Exponential models of distributions are widely used in machine learning for classiffication and modelling. It is well known that they can be interpreted as maximum entropy models under empirical expectation constraints. In this work, we argue that for classiffication tasks, mutual information is a more suitable information theoretic measure to be optimized. We show how the principle of minimum mutual information generalizes that of maximum entropy, and provides a comprehensive framework for building discriminative classiffiers. A game theoretic interpretation of our approach is then given, and several generalization bounds provided. We present iterative algorithms for solving the minimum information problem and its convex dual, and demonstrate their performance on various classiffication tasks. The results show that minimum information classiffiers outperform the corresponding maximum entropy models.
△ Less
Submitted 11 July, 2012;
originally announced July 2012.
-
Discriminative Learning via Semidefinite Probabilistic Models
Authors:
Koby Crammer,
Amir Globerson
Abstract:
Discriminative linear models are a popular tool in machine learning. These can be generally divided into two types: The first is linear classifiers, such as support vector machines, which are well studied and provide state-of-the-art results. One shortcoming of these models is that their output (known as the 'margin') is not calibrated, and cannot be translated naturally into a distribution over t…
▽ More
Discriminative linear models are a popular tool in machine learning. These can be generally divided into two types: The first is linear classifiers, such as support vector machines, which are well studied and provide state-of-the-art results. One shortcoming of these models is that their output (known as the 'margin') is not calibrated, and cannot be translated naturally into a distribution over the labels. Thus, it is difficult to incorporate such models as components of larger systems, unlike probabilistic based approaches. The second type of approach constructs class conditional distributions using a nonlinearity (e.g. log-linear models), but is occasionally worse in terms of classification error. We propose a supervised learning method which combines the best of both approaches. Specifically, our method provides a distribution over the labels, which is a linear function of the model parameters. As a consequence, differences between probabilities are linear functions, a property which most probabilistic models (e.g. log-linear) do not have.
Our model assumes that classes correspond to linear subspaces (rather than to half spaces). Using a relaxed projection operator, we construct a measure which evaluates the degree to which a given vector 'belongs' to a subspace, resulting in a distribution over labels. Interestingly, this view is closely related to similar concepts in quantum detection theory. The resulting models can be trained either to maximize the margin or to optimize average likelihood measures. The corresponding optimization problems are semidefinite programs which can be solved efficiently. We illustrate the performance of our algorithm on real world datasets, and show that it outperforms 2nd order kernel methods.
△ Less
Submitted 27 June, 2012;
originally announced June 2012.
-
Convergent Propagation Algorithms via Oriented Trees
Authors:
Amir Globerson,
Tommi S. Jaakkola
Abstract:
Inference problems in graphical models are often approximated by casting them as constrained optimization problems. Message passing algorithms, such as belief propagation, have previously been suggested as methods for solving these optimization problems. However, there are few convergence guarantees for such algorithms, and the algorithms are therefore not guaranteed to solve the corresponding opt…
▽ More
Inference problems in graphical models are often approximated by casting them as constrained optimization problems. Message passing algorithms, such as belief propagation, have previously been suggested as methods for solving these optimization problems. However, there are few convergence guarantees for such algorithms, and the algorithms are therefore not guaranteed to solve the corresponding optimization problem. Here we present an oriented tree decomposition algorithm that is guaranteed to converge to the global optimum of the Tree-Reweighted (TRW) variational problem. Our algorithm performs local updates in the convex dual of the TRW problem - an unconstrained generalized geometric program. Primal updates, also local, correspond to oriented reparametrization operations that leave the distribution intact.
△ Less
Submitted 20 June, 2012;
originally announced June 2012.
-
Learning the Experts for Online Sequence Prediction
Authors:
Elad Eban,
Aharon Birnbaum,
Shai Shalev-Shwartz,
Amir Globerson
Abstract:
Online sequence prediction is the problem of predicting the next element of a sequence given previous elements. This problem has been extensively studied in the context of individual sequence prediction, where no prior assumptions are made on the origin of the sequence. Individual sequence prediction algorithms work quite well for long sequences, where the algorithm has enough time to learn the te…
▽ More
Online sequence prediction is the problem of predicting the next element of a sequence given previous elements. This problem has been extensively studied in the context of individual sequence prediction, where no prior assumptions are made on the origin of the sequence. Individual sequence prediction algorithms work quite well for long sequences, where the algorithm has enough time to learn the temporal structure of the sequence. However, they might give poor predictions for short sequences. A possible remedy is to rely on the general model of prediction with expert advice, where the learner has access to a set of $r$ experts, each of which makes its own predictions on the sequence. It is well known that it is possible to predict almost as well as the best expert if the sequence length is order of $\log(r)$. But, without firm prior knowledge on the problem, it is not clear how to choose a small set of {\em good} experts. In this paper we describe and analyze a new algorithm that learns a good set of experts using a training set of previously observed sequences. We demonstrate the merits of our approach by applying it on the task of click prediction on the web.
△ Less
Submitted 18 June, 2012;
originally announced June 2012.
-
Tightening LP Relaxations for MAP using Message Passing
Authors:
David Sontag,
Talya Meltzer,
Amir Globerson,
Tommi S. Jaakkola,
Yair Weiss
Abstract:
Linear Programming (LP) relaxations have become powerful tools for finding the most probable (MAP) configuration in graphical models. These relaxations can be solved efficiently using message-passing algorithms such as belief propagation and, when the relaxation is tight, provably find the MAP configuration. The standard LP relaxation is not tight enough in many real-world problems, however, and t…
▽ More
Linear Programming (LP) relaxations have become powerful tools for finding the most probable (MAP) configuration in graphical models. These relaxations can be solved efficiently using message-passing algorithms such as belief propagation and, when the relaxation is tight, provably find the MAP configuration. The standard LP relaxation is not tight enough in many real-world problems, however, and this has lead to the use of higher order cluster-based LP relaxations. The computational cost increases exponentially with the size of the clusters and limits the number and type of clusters we can use. We propose to solve the cluster selection problem monotonically in the dual LP, iteratively selecting clusters with guaranteed improvement, and quickly re-solving with the added clusters by reusing the existing solution. Our dual message-passing algorithm finds the MAP configuration in protein sidechain placement, protein design, and stereo problems, in cases where the standard LP relaxation fails.
△ Less
Submitted 13 June, 2012;
originally announced June 2012.
-
Convergent message passing algorithms - a unifying view
Authors:
Talya Meltzer,
Amir Globerson,
Yair Weiss
Abstract:
Message-passing algorithms have emerged as powerful techniques for approximate inference in graphical models. When these algorithms converge, they can be shown to find local (or sometimes even global) optima of variational formulations to the inference problem. But many of the most popular algorithms are not guaranteed to converge. This has lead to recent interest in convergent message-passing alg…
▽ More
Message-passing algorithms have emerged as powerful techniques for approximate inference in graphical models. When these algorithms converge, they can be shown to find local (or sometimes even global) optima of variational formulations to the inference problem. But many of the most popular algorithms are not guaranteed to converge. This has lead to recent interest in convergent message-passing algorithms. In this paper, we present a unified view of convergent message-passing algorithms. We present a simple derivation of an abstract algorithm, tree-consistency bound optimization (TCBO) that is provably convergent in both its sum and max product forms. We then show that many of the existing convergent algorithms are instances of our TCBO algorithm, and obtain novel convergent algorithms "for free" by exchanging maximizations and summations in existing algorithms. In particular, we show that Wainwright's non-convergent sum-product algorithm for tree based variational bounds, is actually convergent with the right update order for the case where trees are monotonic chains.
△ Less
Submitted 9 May, 2012;
originally announced May 2012.
-
Convexifying the Bethe Free Energy
Authors:
Ofer Meshi,
Ariel Jaimovich,
Amir Globerson,
Nir Friedman
Abstract:
The introduction of loopy belief propagation (LBP) revitalized the application of graphical models in many domains. Many recent works present improvements on the basic LBP algorithm in an attempt to overcome convergence and local optima problems. Notable among these are convexified free energy approximations that lead to inference procedures with provable convergence and quality properties. Howeve…
▽ More
The introduction of loopy belief propagation (LBP) revitalized the application of graphical models in many domains. Many recent works present improvements on the basic LBP algorithm in an attempt to overcome convergence and local optima problems. Notable among these are convexified free energy approximations that lead to inference procedures with provable convergence and quality properties. However, empirically LBP still outperforms most of its convex variants in a variety of settings, as we also demonstrate here. Motivated by this fact we seek convexified free energies that directly approximate the Bethe free energy. We show that the proposed approximations compare favorably with state-of-the art convex free energy approximations.
△ Less
Submitted 9 May, 2012;
originally announced May 2012.
-
What Cannot be Learned with Bethe Approximations
Authors:
Uri Heinemann,
Amir Globerson
Abstract:
We address the problem of learning the parameters in graphical models when inference is intractable. A common strategy in this case is to replace the partition function with its Bethe approximation. We show that there exists a regime of empirical marginals where such Bethe learning will fail. By failure we mean that the empirical marginals cannot be recovered from the approximated maximum likeliho…
▽ More
We address the problem of learning the parameters in graphical models when inference is intractable. A common strategy in this case is to replace the partition function with its Bethe approximation. We show that there exists a regime of empirical marginals where such Bethe learning will fail. By failure we mean that the empirical marginals cannot be recovered from the approximated maximum likelihood parameters (i.e., moment matching is not achieved). We provide several conditions on empirical marginals that yield outer and inner bounds on the set of Bethe learnable marginals. An interesting implication of our results is that there exists a large class of marginals that cannot be obtained as stable fixed points of belief propagation. Taken together our results provide a novel approach to analyzing learning with Bethe approximations and highlight when it can be expected to work or fail.
△ Less
Submitted 14 February, 2012;
originally announced February 2012.
-
Gaussian Robust Classification
Authors:
Ido Ginodi,
Amir Globerson
Abstract:
Supervised learning is all about the ability to generalize knowledge. Specifically, the goal of the learning is to train a classifier using training data, in such a way that it will be capable of classifying new unseen data correctly. In order to acheive this goal, it is important to carefully design the learner, so it will not overfit the training data. The later can is done usually by adding a r…
▽ More
Supervised learning is all about the ability to generalize knowledge. Specifically, the goal of the learning is to train a classifier using training data, in such a way that it will be capable of classifying new unseen data correctly. In order to acheive this goal, it is important to carefully design the learner, so it will not overfit the training data. The later can is done usually by adding a regularization term. The statistical learning theory explains the success of this method by claiming that it restricts the complexity of the learned model. This explanation, however, is rather abstract and does not have a geometric intuition. The generalization error of a classifier may be thought of as correlated with its robustness to perturbations of the data: a classifier that copes with disturbance is expected to generalize well. Indeed, Xu et al. [2009] have shown that the SVM formulation is equivalent to a robust optimization (RO) formulation, in which an adversary displaces the training and testing points within a ball of pre-determined radius. In this work we explore a different kind of robustness, namely changing each data point with a Gaussian cloud centered at the sample. Loss is evaluated as the expectation of an underlying loss function on the cloud. This setup fits the fact that in many applications, the data is sampled along with noise. We develop an RO framework, in which the adversary chooses the covariance of the noise. In our algorithm named GURU, the tuning parameter is a spectral bound on the noise, thus it can be estimated using physical or applicative considerations. Our experiments show that this framework performs as well as SVM and even slightly better in some cases. Generalizations for Mercer kernels and for the multiclass case are presented as well. We also show that our framework may be further generalized, using the technique of convex perspective functions.
△ Less
Submitted 1 April, 2011;
originally announced April 2011.