-
PaliGemma: A versatile 3B VLM for transfer
Authors:
Lucas Beyer,
Andreas Steiner,
André Susano Pinto,
Alexander Kolesnikov,
Xiao Wang,
Daniel Salz,
Maxim Neumann,
Ibrahim Alabdulmohsin,
Michael Tschannen,
Emanuele Bugliarello,
Thomas Unterthiner,
Daniel Keysers,
Skanda Koppula,
Fangyu Liu,
Adam Grycner,
Alexey Gritsenko,
Neil Houlsby,
Manoj Kumar,
Keran Rong,
Julian Eisenschlos,
Rishabh Kabra,
Matthias Bauer,
Matko Bošnjak,
Xi Chen,
Matthias Minderer
, et al. (10 additional authors not shown)
Abstract:
PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more…
▽ More
PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection
Authors:
Tim Salzmann,
Markus Ryll,
Alex Bewley,
Matthias Minderer
Abstract:
Visual relationship detection aims to identify objects and their relationships in images. Prior methods approach this task by adding separate relationship modules or decoders to existing object detection architectures. This separation increases complexity and hinders end-to-end training, which limits performance. We propose a simple and highly efficient decoder-free architecture for open-vocabular…
▽ More
Visual relationship detection aims to identify objects and their relationships in images. Prior methods approach this task by adding separate relationship modules or decoders to existing object detection architectures. This separation increases complexity and hinders end-to-end training, which limits performance. We propose a simple and highly efficient decoder-free architecture for open-vocabulary visual relationship detection. Our model consists of a Transformer-based image encoder that represents objects as tokens and models their relationships implicitly. To extract relationship information, we introduce an attention mechanism that selects object pairs likely to form a relationship. We provide a single-stage recipe to train this model on a mixture of object and relationship detection data. Our approach achieves state-of-the-art relationship detection performance on Visual Genome and on the large-vocabulary GQA benchmark at real-time inference speeds. We provide analyses of zero-shot performance, ablations, and real-world qualitative examples.
△ Less
Submitted 21 March, 2024;
originally announced March 2024.
-
Improving fine-grained understanding in image-text pre-training
Authors:
Ioana Bica,
Anastasija Ilić,
Matthias Bauer,
Goker Erdogan,
Matko Bošnjak,
Christos Kaplanis,
Alexey A. Gritsenko,
Matthias Minderer,
Charles Blundell,
Razvan Pascanu,
Jovana Mitrović
Abstract:
We introduce SPARse Fine-grained Contrastive Alignment (SPARC), a simple method for pretraining more fine-grained multimodal representations from image-text pairs. Given that multiple image patches often correspond to single words, we propose to learn a grouping of image patches for every token in the caption. To achieve this, we use a sparse similarity metric between image patches and language to…
▽ More
We introduce SPARse Fine-grained Contrastive Alignment (SPARC), a simple method for pretraining more fine-grained multimodal representations from image-text pairs. Given that multiple image patches often correspond to single words, we propose to learn a grouping of image patches for every token in the caption. To achieve this, we use a sparse similarity metric between image patches and language tokens and compute for each token a language-grouped vision embedding as the weighted average of patches. The token and language-grouped vision embeddings are then contrasted through a fine-grained sequence-wise loss that only depends on individual samples and does not require other batch samples as negatives. This enables more detailed information to be learned in a computationally inexpensive manner. SPARC combines this fine-grained loss with a contrastive loss between global image and text embeddings to learn representations that simultaneously encode global and local information. We thoroughly evaluate our proposed method and show improved performance over competing approaches both on image-level tasks relying on coarse-grained information, e.g. classification, as well as region-level tasks relying on fine-grained information, e.g. retrieval, object detection, and segmentation. Moreover, SPARC improves model faithfulness and captioning in foundational vision-language models.
△ Less
Submitted 18 January, 2024;
originally announced January 2024.
-
Video OWL-ViT: Temporally-consistent open-world localization in video
Authors:
Georg Heigold,
Matthias Minderer,
Alexey Gritsenko,
Alex Bewley,
Daniel Keysers,
Mario Lučić,
Fisher Yu,
Thomas Kipf
Abstract:
We present an architecture and a training recipe that adapts pre-trained open-world image models to localization in videos. Understanding the open visual world (without being constrained by fixed label spaces) is crucial for many real-world vision tasks. Contrastive pre-training on large image-text datasets has recently led to significant improvements for image-level tasks. For more structured tas…
▽ More
We present an architecture and a training recipe that adapts pre-trained open-world image models to localization in videos. Understanding the open visual world (without being constrained by fixed label spaces) is crucial for many real-world vision tasks. Contrastive pre-training on large image-text datasets has recently led to significant improvements for image-level tasks. For more structured tasks involving object localization applying pre-trained models is more challenging. This is particularly true for video tasks, where task-specific data is limited. We show successful transfer of open-world models by building on the OWL-ViT open-vocabulary detection model and adapting it to video by adding a transformer decoder. The decoder propagates object representations recurrently through time by using the output tokens for one frame as the object queries for the next. Our model is end-to-end trainable on video data and enjoys improved temporal consistency compared to tracking-by-detection baselines, while retaining the open-world capabilities of the backbone detector. We evaluate our model on the challenging TAO-OW benchmark and demonstrate that open-world capabilities, learned from large-scale image-text pre-training, can be transferred successfully to open-world localization across diverse videos.
△ Less
Submitted 21 August, 2023;
originally announced August 2023.
-
Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
Authors:
Mostafa Dehghani,
Basil Mustafa,
Josip Djolonga,
Jonathan Heek,
Matthias Minderer,
Mathilde Caron,
Andreas Steiner,
Joan Puigcerver,
Robert Geirhos,
Ibrahim Alabdulmohsin,
Avital Oliver,
Piotr Padlewski,
Alexey Gritsenko,
Mario Lučić,
Neil Houlsby
Abstract:
The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution before processing them with computer vision models has not yet been successfully challenged. However, models such as the Vision Transformer (ViT) offer flexible sequence-based modeling, and hence varying input sequence lengths. We take advantage of this with NaViT (Native Resolution ViT) which uses sequence…
▽ More
The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution before processing them with computer vision models has not yet been successfully challenged. However, models such as the Vision Transformer (ViT) offer flexible sequence-based modeling, and hence varying input sequence lengths. We take advantage of this with NaViT (Native Resolution ViT) which uses sequence packing during training to process inputs of arbitrary resolutions and aspect ratios. Alongside flexible model usage, we demonstrate improved training efficiency for large-scale supervised and contrastive image-text pretraining. NaViT can be efficiently transferred to standard tasks such as image and video classification, object detection, and semantic segmentation and leads to improved results on robustness and fairness benchmarks. At inference time, the input resolution flexibility can be used to smoothly navigate the test-time cost-performance trade-off. We believe that NaViT marks a departure from the standard, CNN-designed, input and modelling pipeline used by most computer vision models, and represents a promising direction for ViTs.
△ Less
Submitted 12 July, 2023;
originally announced July 2023.
-
Scaling Open-Vocabulary Object Detection
Authors:
Matthias Minderer,
Alexey Gritsenko,
Neil Houlsby
Abstract:
Open-vocabulary object detection has benefited greatly from pretrained vision-language models, but is still limited by the amount of available detection training data. While detection training data can be expanded by using Web image-text pairs as weak supervision, this has not been done at scales comparable to image-level pretraining. Here, we scale up detection data with self-training, which uses…
▽ More
Open-vocabulary object detection has benefited greatly from pretrained vision-language models, but is still limited by the amount of available detection training data. While detection training data can be expanded by using Web image-text pairs as weak supervision, this has not been done at scales comparable to image-level pretraining. Here, we scale up detection data with self-training, which uses an existing detector to generate pseudo-box annotations on image-text pairs. Major challenges in scaling self-training are the choice of label space, pseudo-annotation filtering, and training efficiency. We present the OWLv2 model and OWL-ST self-training recipe, which address these challenges. OWLv2 surpasses the performance of previous state-of-the-art open-vocabulary detectors already at comparable training scales (~10M examples). However, with OWL-ST, we can scale to over 1B examples, yielding further large improvement: With an L/14 architecture, OWL-ST improves AP on LVIS rare classes, for which the model has seen no human box annotations, from 31.2% to 44.6% (43% relative improvement). OWL-ST unlocks Web-scale training for open-world localization, similar to what has been seen for image classification and language modelling.
△ Less
Submitted 22 May, 2024; v1 submitted 16 June, 2023;
originally announced June 2023.
-
PaLI-X: On Scaling up a Multilingual Vision and Language Model
Authors:
Xi Chen,
Josip Djolonga,
Piotr Padlewski,
Basil Mustafa,
Soravit Changpinyo,
Jialin Wu,
Carlos Riquelme Ruiz,
Sebastian Goodman,
Xiao Wang,
Yi Tay,
Siamak Shakeri,
Mostafa Dehghani,
Daniel Salz,
Mario Lucic,
Michael Tschannen,
Arsha Nagrani,
Hexiang Hu,
Mandar Joshi,
Bo Pang,
Ceslee Montgomery,
Paulina Pietrzyk,
Marvin Ritter,
AJ Piergiovanni,
Matthias Minderer,
Filip Pavetic
, et al. (18 additional authors not shown)
Abstract:
We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-sh…
▽ More
We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. PaLI-X advances the state-of-the-art on most vision-and-language benchmarks considered (25+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.
△ Less
Submitted 29 May, 2023;
originally announced May 2023.
-
Scaling Vision Transformers to 22 Billion Parameters
Authors:
Mostafa Dehghani,
Josip Djolonga,
Basil Mustafa,
Piotr Padlewski,
Jonathan Heek,
Justin Gilmer,
Andreas Steiner,
Mathilde Caron,
Robert Geirhos,
Ibrahim Alabdulmohsin,
Rodolphe Jenatton,
Lucas Beyer,
Michael Tschannen,
Anurag Arnab,
Xiao Wang,
Carlos Riquelme,
Matthias Minderer,
Joan Puigcerver,
Utku Evci,
Manoj Kumar,
Sjoerd van Steenkiste,
Gamaleldin F. Elsayed,
Aravindh Mahendran,
Fisher Yu,
Avital Oliver
, et al. (17 additional authors not shown)
Abstract:
The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al…
▽ More
The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features), ViT-22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between fairness and performance, state-of-the-art alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT-22B demonstrates the potential for "LLM-like" scaling in vision, and provides key steps towards getting there.
△ Less
Submitted 10 February, 2023;
originally announced February 2023.
-
FlexiViT: One Model for All Patch Sizes
Authors:
Lucas Beyer,
Pavel Izmailov,
Alexander Kolesnikov,
Mathilde Caron,
Simon Kornblith,
Xiaohua Zhai,
Matthias Minderer,
Michael Tschannen,
Ibrahim Alabdulmohsin,
Filip Pavetic
Abstract:
Vision Transformers convert images to sequences by slicing them into patches. The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost, but changing the patch size typically requires retraining the model. In this paper, we demonstrate that simply randomizing the patch size at training time leads to a single set of w…
▽ More
Vision Transformers convert images to sequences by slicing them into patches. The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost, but changing the patch size typically requires retraining the model. In this paper, we demonstrate that simply randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes, making it possible to tailor the model to different compute budgets at deployment time. We extensively evaluate the resulting model, which we call FlexiViT, on a wide range of tasks, including classification, image-text retrieval, open-world detection, panoptic segmentation, and semantic segmentation, concluding that it usually matches, and sometimes outperforms, standard ViT models trained at a single patch size in an otherwise identical setup. Hence, FlexiViT training is a simple drop-in improvement for ViT that makes it easy to add compute-adaptive capabilities to most models relying on a ViT backbone architecture. Code and pre-trained models are available at https://github.com/google-research/big_vision
△ Less
Submitted 23 March, 2023; v1 submitted 15 December, 2022;
originally announced December 2022.
-
Decoder Denoising Pretraining for Semantic Segmentation
Authors:
Emmanuel Brempong Asiedu,
Simon Kornblith,
Ting Chen,
Niki Parmar,
Matthias Minderer,
Mohammad Norouzi
Abstract:
Semantic segmentation labels are expensive and time consuming to acquire. Hence, pretraining is commonly used to improve the label-efficiency of segmentation models. Typically, the encoder of a segmentation model is pretrained as a classifier and the decoder is randomly initialized. Here, we argue that random initialization of the decoder can be suboptimal, especially when few labeled examples are…
▽ More
Semantic segmentation labels are expensive and time consuming to acquire. Hence, pretraining is commonly used to improve the label-efficiency of segmentation models. Typically, the encoder of a segmentation model is pretrained as a classifier and the decoder is randomly initialized. Here, we argue that random initialization of the decoder can be suboptimal, especially when few labeled examples are available. We propose a decoder pretraining approach based on denoising, which can be combined with supervised pretraining of the encoder. We find that decoder denoising pretraining on the ImageNet dataset strongly outperforms encoder-only supervised pretraining. Despite its simplicity, decoder denoising pretraining achieves state-of-the-art results on label-efficient semantic segmentation and offers considerable gains on the Cityscapes, Pascal Context, and ADE20K datasets.
△ Less
Submitted 23 May, 2022;
originally announced May 2022.
-
Simple Open-Vocabulary Object Detection with Vision Transformers
Authors:
Matthias Minderer,
Alexey Gritsenko,
Austin Stone,
Maxim Neumann,
Dirk Weissenborn,
Alexey Dosovitskiy,
Aravindh Mahendran,
Anurag Arnab,
Mostafa Dehghani,
Zhuoran Shen,
Xiao Wang,
Xiaohua Zhai,
Thomas Kipf,
Neil Houlsby
Abstract:
Combining simple architectures with large-scale pre-training has led to massive improvements in image classification. For object detection, pre-training and scaling approaches are less well established, especially in the long-tailed and open-vocabulary setting, where training data is relatively scarce. In this paper, we propose a strong recipe for transferring image-text models to open-vocabulary…
▽ More
Combining simple architectures with large-scale pre-training has led to massive improvements in image classification. For object detection, pre-training and scaling approaches are less well established, especially in the long-tailed and open-vocabulary setting, where training data is relatively scarce. In this paper, we propose a strong recipe for transferring image-text models to open-vocabulary object detection. We use a standard Vision Transformer architecture with minimal modifications, contrastive image-text pre-training, and end-to-end detection fine-tuning. Our analysis of the scaling properties of this setup shows that increasing image-level pre-training and model size yield consistent improvements on the downstream detection task. We provide the adaptation strategies and regularizations needed to attain very strong performance on zero-shot text-conditioned and one-shot image-conditioned object detection. Code and models are available on GitHub.
△ Less
Submitted 20 July, 2022; v1 submitted 12 May, 2022;
originally announced May 2022.
-
SCENIC: A JAX Library for Computer Vision Research and Beyond
Authors:
Mostafa Dehghani,
Alexey Gritsenko,
Anurag Arnab,
Matthias Minderer,
Yi Tay
Abstract:
Scenic is an open-source JAX library with a focus on Transformer-based models for computer vision research and beyond. The goal of this toolkit is to facilitate rapid experimentation, prototyping, and research of new vision architectures and models. Scenic supports a diverse range of vision tasks (e.g., classification, segmentation, detection)and facilitates working on multi-modal problems, along…
▽ More
Scenic is an open-source JAX library with a focus on Transformer-based models for computer vision research and beyond. The goal of this toolkit is to facilitate rapid experimentation, prototyping, and research of new vision architectures and models. Scenic supports a diverse range of vision tasks (e.g., classification, segmentation, detection)and facilitates working on multi-modal problems, along with GPU/TPU support for multi-host, multi-device large-scale training. Scenic also offers optimized implementations of state-of-the-art research models spanning a wide range of modalities. Scenic has been successfully used for numerous projects and published papers and continues serving as the library of choice for quick prototyping and publication of new research ideas.
△ Less
Submitted 18 October, 2021;
originally announced October 2021.
-
Revisiting the Calibration of Modern Neural Networks
Authors:
Matthias Minderer,
Josip Djolonga,
Rob Romijnders,
Frances Hubis,
Xiaohua Zhai,
Neil Houlsby,
Dustin Tran,
Mario Lucic
Abstract:
Accurate estimation of predictive uncertainty (model calibration) is essential for the safe application of neural networks. Many instances of miscalibration in modern neural networks have been reported, suggesting a trend that newer, more accurate models produce poorly calibrated predictions. Here, we revisit this question for recent state-of-the-art image classification models. We systematically…
▽ More
Accurate estimation of predictive uncertainty (model calibration) is essential for the safe application of neural networks. Many instances of miscalibration in modern neural networks have been reported, suggesting a trend that newer, more accurate models produce poorly calibrated predictions. Here, we revisit this question for recent state-of-the-art image classification models. We systematically relate model calibration and accuracy, and find that the most recent models, notably those not using convolutions, are among the best calibrated. Trends observed in prior model generations, such as decay of calibration with distribution shift or model size, are less pronounced in recent architectures. We also show that model size and amount of pretraining do not fully explain these differences, suggesting that architecture is a major determinant of calibration properties.
△ Less
Submitted 26 October, 2021; v1 submitted 15 June, 2021;
originally announced June 2021.
-
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Authors:
Alexey Dosovitskiy,
Lucas Beyer,
Alexander Kolesnikov,
Dirk Weissenborn,
Xiaohua Zhai,
Thomas Unterthiner,
Mostafa Dehghani,
Matthias Minderer,
Georg Heigold,
Sylvain Gelly,
Jakob Uszkoreit,
Neil Houlsby
Abstract:
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not nece…
▽ More
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
△ Less
Submitted 3 June, 2021; v1 submitted 22 October, 2020;
originally announced October 2020.
-
On Robustness and Transferability of Convolutional Neural Networks
Authors:
Josip Djolonga,
Jessica Yung,
Michael Tschannen,
Rob Romijnders,
Lucas Beyer,
Alexander Kolesnikov,
Joan Puigcerver,
Matthias Minderer,
Alexander D'Amour,
Dan Moldovan,
Sylvain Gelly,
Neil Houlsby,
Xiaohua Zhai,
Mario Lucic
Abstract:
Modern deep convolutional networks (CNNs) are often criticized for not generalizing under distributional shifts. However, several recent breakthroughs in transfer learning suggest that these networks can cope with severe distribution shifts and successfully adapt to new tasks from a few training examples. In this work we study the interplay between out-of-distribution and transfer performance of m…
▽ More
Modern deep convolutional networks (CNNs) are often criticized for not generalizing under distributional shifts. However, several recent breakthroughs in transfer learning suggest that these networks can cope with severe distribution shifts and successfully adapt to new tasks from a few training examples. In this work we study the interplay between out-of-distribution and transfer performance of modern image classification CNNs for the first time and investigate the impact of the pre-training data size, the model scale, and the data preprocessing pipeline. We find that increasing both the training set and model sizes significantly improve the distributional shift robustness. Furthermore, we show that, perhaps surprisingly, simple changes in the preprocessing such as modifying the image resolution can significantly mitigate robustness issues in some cases. Finally, we outline the shortcomings of existing robustness evaluation datasets and introduce a synthetic dataset SI-Score we use for a systematic analysis across factors of variation common in visual data such as object size and position.
△ Less
Submitted 23 March, 2021; v1 submitted 16 July, 2020;
originally announced July 2020.
-
Automatic Shortcut Removal for Self-Supervised Representation Learning
Authors:
Matthias Minderer,
Olivier Bachem,
Neil Houlsby,
Michael Tschannen
Abstract:
In self-supervised visual representation learning, a feature extractor is trained on a "pretext task" for which labels can be generated cheaply, without human annotation. A central challenge in this approach is that the feature extractor quickly learns to exploit low-level visual features such as color aberrations or watermarks and then fails to learn useful semantic representations. Much work has…
▽ More
In self-supervised visual representation learning, a feature extractor is trained on a "pretext task" for which labels can be generated cheaply, without human annotation. A central challenge in this approach is that the feature extractor quickly learns to exploit low-level visual features such as color aberrations or watermarks and then fails to learn useful semantic representations. Much work has gone into identifying such "shortcut" features and hand-designing schemes to reduce their effect. Here, we propose a general framework for mitigating the effect shortcut features. Our key assumption is that those features which are the first to be exploited for solving the pretext task may also be the most vulnerable to an adversary trained to make the task harder. We show that this assumption holds across common pretext tasks and datasets by training a "lens" network to make small image changes that maximally reduce performance in the pretext task. Representations learned with the modified images outperform those learned without in all tested cases. Additionally, the modifications made by the lens reveal how the choice of pretext task and dataset affects the features learned by self-supervision.
△ Less
Submitted 30 June, 2020; v1 submitted 20 February, 2020;
originally announced February 2020.
-
Unsupervised Learning of Object Structure and Dynamics from Videos
Authors:
Matthias Minderer,
Chen Sun,
Ruben Villegas,
Forrester Cole,
Kevin Murphy,
Honglak Lee
Abstract:
Extracting and predicting object structure and dynamics from videos without supervision is a major challenge in machine learning. To address this challenge, we adopt a keypoint-based image representation and learn a stochastic dynamics model of the keypoints. Future frames are reconstructed from the keypoints and a reference frame. By modeling dynamics in the keypoint coordinate space, we achieve…
▽ More
Extracting and predicting object structure and dynamics from videos without supervision is a major challenge in machine learning. To address this challenge, we adopt a keypoint-based image representation and learn a stochastic dynamics model of the keypoints. Future frames are reconstructed from the keypoints and a reference frame. By modeling dynamics in the keypoint coordinate space, we achieve stable learning and avoid compounding of errors in pixel space. Our method improves upon unstructured representations both for pixel-level video prediction and for downstream tasks requiring object-level understanding of motion dynamics. We evaluate our model on diverse datasets: a multi-agent sports dataset, the Human3.6M dataset, and datasets based on continuous control tasks from the DeepMind Control Suite. The spatially structured representation outperforms unstructured representations on a range of motion-related tasks such as object tracking, action recognition and reward prediction.
△ Less
Submitted 2 March, 2020; v1 submitted 18 June, 2019;
originally announced June 2019.