Skip to main content

Showing 1–29 of 29 results for author: Mensink, T

  1. arXiv:2406.04103  [pdf, other

    cs.LG cs.AI cs.CV cs.NE

    Multistep Distillation of Diffusion Models via Moment Matching

    Authors: Tim Salimans, Thomas Mensink, Jonathan Heek, Emiel Hoogeboom

    Abstract: We present a new method for making diffusion models faster to sample. The method distills many-step diffusion models into few-step models by matching conditional expectations of the clean data given noisy data along the sampling trajectory. Our approach extends recently proposed one-step methods to the multi-step case, and provides a new perspective by interpreting these approaches in terms of mom… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

  2. arXiv:2404.05465  [pdf, other

    cs.CV cs.LG

    HAMMR: HierArchical MultiModal React agents for generic VQA

    Authors: Lluis Castrejon, Thomas Mensink, Howard Zhou, Vittorio Ferrari, Andre Araujo, Jasper Uijlings

    Abstract: Combining Large Language Models (LLMs) with external specialized tools (LLMs+tools) is a recent paradigm to solve multimodal tasks such as Visual Question Answering (VQA). While this approach was demonstrated to work well when optimized and evaluated for each individual benchmark, in practice it is crucial for the next generation of real-world AI systems to handle a broad range of multimodal probl… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

  3. arXiv:2310.06641  [pdf, other

    cs.CV

    How (not) to ensemble LVLMs for VQA

    Authors: Lisa Alazraki, Lluis Castrejon, Mostafa Dehghani, Fantine Huot, Jasper Uijlings, Thomas Mensink

    Abstract: This paper studies ensembling in the era of Large Vision-Language Models (LVLMs). Ensembling is a classical method to combine different models to get increased performance. In the recent work on Encyclopedic-VQA the authors examine a wide variety of models to solve their task: from vanilla LVLMs, to models including the caption as extra context, to models augmented with Lens-based retrieval of Wik… ▽ More

    Submitted 7 December, 2023; v1 submitted 10 October, 2023; originally announced October 2023.

    Comments: 4th I Can't Believe It's Not Better Workshop (co-located with NeurIPS 2023)

  4. arXiv:2306.09224  [pdf, other

    cs.CV

    Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories

    Authors: Thomas Mensink, Jasper Uijlings, Lluis Castrejon, Arushi Goel, Felipe Cadar, Howard Zhou, Fei Sha, André Araujo, Vittorio Ferrari

    Abstract: We propose Encyclopedic-VQA, a large scale visual question answering (VQA) dataset featuring visual questions about detailed properties of fine-grained categories and instances. It contains 221k unique question+answer pairs each matched with (up to) 5 images, resulting in a total of 1M VQA samples. Moreover, our dataset comes with a controlled knowledge base derived from Wikipedia, marking the evi… ▽ More

    Submitted 24 July, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

    Comments: ICCV'23

  5. arXiv:2305.10293  [pdf, other

    cs.CV cs.LG

    Infinite Class Mixup

    Authors: Thomas Mensink, Pascal Mettes

    Abstract: Mixup is a widely adopted strategy for training deep networks, where additional samples are augmented by interpolating inputs and labels of training pairs. Mixup has shown to improve classification performance, network calibration, and out-of-distribution generalisation. While effective, a cornerstone of Mixup, namely that networks learn linear behaviour patterns between classes, is only indirectl… ▽ More

    Submitted 6 September, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

    Comments: BMVC 2023

  6. arXiv:2302.05442  [pdf, other

    cs.CV cs.AI cs.LG

    Scaling Vision Transformers to 22 Billion Parameters

    Authors: Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin F. Elsayed, Aravindh Mahendran, Fisher Yu, Avital Oliver , et al. (17 additional authors not shown)

    Abstract: The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al… ▽ More

    Submitted 10 February, 2023; originally announced February 2023.

  7. arXiv:2206.04453  [pdf, other

    cs.CV

    The Missing Link: Finding label relations across datasets

    Authors: Jasper Uijlings, Thomas Mensink, Vittorio Ferrari

    Abstract: Computer vision is driven by the many datasets available for training or evaluating novel methods. However, each dataset has a different set of class labels, visual definition of classes, images following a specific distribution, annotation protocols, etc. In this paper we explore the automatic discovery of visual-semantic relations between labels across datasets. We aim to understand how instance… ▽ More

    Submitted 9 August, 2022; v1 submitted 9 June, 2022; originally announced June 2022.

    Comments: ECCV 2022

  8. arXiv:2204.01403  [pdf, other

    cs.CV

    How stable are Transferability Metrics evaluations?

    Authors: Andrea Agostinelli, Michal Pándy, Jasper Uijlings, Thomas Mensink, Vittorio Ferrari

    Abstract: Transferability metrics is a maturing field with increasing interest, which aims at providing heuristics for selecting the most suitable source models to transfer to a given target dataset, without fine-tuning them all. However, existing works rely on custom experimental setups which differ across papers, leading to inconsistent conclusions about which transferability metrics work best. In this pa… ▽ More

    Submitted 20 October, 2022; v1 submitted 4 April, 2022; originally announced April 2022.

    Comments: ECCV 2022

  9. arXiv:2111.13011  [pdf, other

    cs.CV

    Transferability Metrics for Selecting Source Model Ensembles

    Authors: Andrea Agostinelli, Jasper Uijlings, Thomas Mensink, Vittorio Ferrari

    Abstract: We address the problem of ensemble selection in transfer learning: Given a large pool of source models we want to select an ensemble of models which, after fine-tuning on the target training set, yields the best performance on the target test set. Since fine-tuning all possible ensembles is computationally prohibitive, we aim at predicting performance on the target dataset using a computationally… ▽ More

    Submitted 31 March, 2022; v1 submitted 25 November, 2021; originally announced November 2021.

  10. arXiv:2111.12780  [pdf, other

    cs.CV

    Transferability Estimation using Bhattacharyya Class Separability

    Authors: Michal Pándy, Andrea Agostinelli, Jasper Uijlings, Vittorio Ferrari, Thomas Mensink

    Abstract: Transfer learning has become a popular method for leveraging pre-trained models in computer vision. However, without performing computationally expensive fine-tuning, it is difficult to quantify which pre-trained source models are suitable for a specific target task, or, conversely, to which tasks a pre-trained source model can be easily adapted to. In this work, we propose Gaussian Bhattacharyya… ▽ More

    Submitted 11 April, 2022; v1 submitted 24 November, 2021; originally announced November 2021.

    Comments: Accepted for CVPR 2022

  11. arXiv:2103.13318  [pdf, other

    cs.CV

    Factors of Influence for Transfer Learning across Diverse Appearance Domains and Task Types

    Authors: Thomas Mensink, Jasper Uijlings, Alina Kuznetsova, Michael Gygli, Vittorio Ferrari

    Abstract: Transfer learning enables to re-use knowledge learned on a source task to help learning a target task. A simple form of transfer learning is common in current state-of-the-art computer vision models, i.e. pre-training a model for image classification on the ILSVRC dataset, and then fine-tune on any target task. However, previous systematic studies of transfer learning have been limited and the cir… ▽ More

    Submitted 20 November, 2021; v1 submitted 24 March, 2021; originally announced March 2021.

    Comments: Accepted for future publication in TPAMI

  12. arXiv:2011.04389  [pdf, other

    cs.CV

    EDEN: Multimodal Synthetic Dataset of Enclosed GarDEN Scenes

    Authors: Hoang-An Le, Thomas Mensink, Partha Das, Sezer Karaoglu, Theo Gevers

    Abstract: Multimodal large-scale datasets for outdoor scenes are mostly designed for urban driving problems. The scenes are highly structured and semantically different from scenarios seen in nature-centered scenes such as gardens or parks. To promote machine learning methods for nature-oriented applications, such as agriculture and gardening, we propose the multimodal synthetic dataset for Enclosed garDEN… ▽ More

    Submitted 10 November, 2020; v1 submitted 9 November, 2020; originally announced November 2020.

    Comments: Accepted for publishing at WACV 2021

  13. arXiv:2009.08321  [pdf, other

    cs.CV

    Novel View Synthesis from Single Images via Point Cloud Transformation

    Authors: Hoang-An Le, Thomas Mensink, Partha Das, Theo Gevers

    Abstract: In this paper the argument is made that for true novel view synthesis of objects, where the object can be synthesized from any viewpoint, an explicit 3D shape representation isdesired. Our method estimates point clouds to capture the geometry of the object, which can be freely rotated into the desired view and then projected into a new image. This image, however, is sparse by nature and hence this… ▽ More

    Submitted 18 September, 2020; v1 submitted 17 September, 2020; originally announced September 2020.

    Comments: Accepted at British Machine Vision Conference 2020

  14. arXiv:2009.01717  [pdf, other

    cs.CV cs.AI

    Multi-Loss Weighting with Coefficient of Variations

    Authors: Rick Groenendijk, Sezer Karaoglu, Theo Gevers, Thomas Mensink

    Abstract: Many interesting tasks in machine learning and computer vision are learned by optimising an objective function defined as a weighted linear combination of multiple losses. The final performance is sensitive to choosing the correct (relative) weights for these losses. Finding a good set of weights is often done by adopting them into the set of hyper-parameters, which are set using an extensive grid… ▽ More

    Submitted 10 November, 2020; v1 submitted 3 September, 2020; originally announced September 2020.

    Comments: Paper was accepted at the IEEE Winter Conference on Applications of Computer Vision 2021 (WACV2021)

    MSC Class: 68T45 ACM Class: I.4

  15. arXiv:2008.06374  [pdf, other

    cs.CV

    PointMixup: Augmentation for Point Clouds

    Authors: Yunlu Chen, Vincent Tao Hu, Efstratios Gavves, Thomas Mensink, Pascal Mettes, Pengwan Yang, Cees G. M. Snoek

    Abstract: This paper introduces data augmentation for point clouds by interpolation between examples. Data augmentation by interpolation has shown to be a simple and effective approach in the image domain. Such a mixup is however not directly transferable to point clouds, as we do not have a one-to-one correspondence between the points of two different objects. In this paper, we define data augmentation bet… ▽ More

    Submitted 14 August, 2020; originally announced August 2020.

    Comments: Accepted as Spotlight presentation at European Conference on Computer Vision (ECCV), 2020

  16. arXiv:2006.12807  [pdf, other

    cs.LG cs.CV stat.ML

    Post-hoc Calibration of Neural Networks by g-Layers

    Authors: Amir Rahimi, Thomas Mensink, Kartik Gupta, Thalaiyasingam Ajanthan, Cristian Sminchisescu, Richard Hartley

    Abstract: Calibration of neural networks is a critical aspect to consider when incorporating machine learning models in real-world decision-making systems where the confidence of decisions are equally important as the decisions themselves. In recent years, there is a surge of research on neural network calibration and the majority of the works can be categorized into post-hoc calibration methods, defined as… ▽ More

    Submitted 21 February, 2022; v1 submitted 23 June, 2020; originally announced June 2020.

  17. arXiv:2006.12800  [pdf, other

    cs.LG cs.CV stat.ML

    Calibration of Neural Networks using Splines

    Authors: Kartik Gupta, Amir Rahimi, Thalaiyasingam Ajanthan, Thomas Mensink, Cristian Sminchisescu, Richard Hartley

    Abstract: Calibrating neural networks is of utmost importance when employing them in safety-critical applications where the downstream decision making depends on the predicted probabilities. Measuring calibration error amounts to comparing two empirical distributions. In this work, we introduce a binning-free calibration measure inspired by the classical Kolmogorov-Smirnov (KS) statistical test in which the… ▽ More

    Submitted 29 December, 2021; v1 submitted 23 June, 2020; originally announced June 2020.

    Comments: ICLR 2021

  18. arXiv:2005.09927  [pdf, other

    cs.CV cs.LG cs.RO

    Range Conditioned Dilated Convolutions for Scale Invariant 3D Object Detection

    Authors: Alex Bewley, Pei Sun, Thomas Mensink, Dragomir Anguelov, Cristian Sminchisescu

    Abstract: This paper presents a novel 3D object detection framework that processes LiDAR data directly on its native representation: range images. Benefiting from the compactness of range images, 2D convolutions can efficiently process dense LiDAR data of a scene. To overcome scale sensitivity in this perspective view, a novel range-conditioned dilation (RCD) layer is proposed to dynamically adjust a contin… ▽ More

    Submitted 22 January, 2021; v1 submitted 20 May, 2020; originally announced May 2020.

    Comments: CoRL 2020

  19. On the Benefit of Adversarial Training for Monocular Depth Estimation

    Authors: Rick Groenendijk, Sezer Karaoglu, Theo Gevers, Thomas Mensink

    Abstract: In this paper we address the benefit of adding adversarial training to the task of monocular depth estimation. A model can be trained in a self-supervised setting on stereo pairs of images, where depth (disparities) are an intermediate result in a right-to-left image reconstruction pipeline. For the quality of the image reconstruction and disparity prediction, a combination of different losses is… ▽ More

    Submitted 29 October, 2019; originally announced October 2019.

    Comments: 11 pages, 8 tables, 5 figures, accepted at CVIU

    MSC Class: 68T45

  20. arXiv:1910.01460  [pdf, other

    cs.CV

    3D Neighborhood Convolution: Learning Depth-Aware Features for RGB-D and RGB Semantic Segmentation

    Authors: Yunlu Chen, Thomas Mensink, Efstratios Gavves

    Abstract: A key challenge for RGB-D segmentation is how to effectively incorporate 3D geometric information from the depth channel into 2D appearance features. We propose to model the effective receptive field of 2D convolution based on the scale and locality from the 3D neighborhood. Standard convolutions are local in the image space ($u, v$), often with a fixed receptive field of 3x3 pixels. We propose to… ▽ More

    Submitted 3 October, 2019; originally announced October 2019.

  21. Automatic Generation of Dense Non-rigid Optical Flow

    Authors: Hoàng-Ân Lê, Tushar Nimbhorkar, Thomas Mensink, Anil S. Baslamisli, Sezer Karaoglu, Theo Gevers

    Abstract: There hardly exists any large-scale datasets with dense optical flow of non-rigid motion from real-world imagery as of today. The reason lies mainly in the required setup to derive ground truth optical flows: a series of images with known camera poses along its trajectory, and an accurate 3D model from a textured scene. Human annotation is not only too tedious for large databases, it can simply ha… ▽ More

    Submitted 7 September, 2021; v1 submitted 5 December, 2018; originally announced December 2018.

    Comments: The paper is accepted for publication for Computer Vision and Image Understanding (CVIU)

    Journal ref: Volume 212, November 2021, 103274

  22. arXiv:1807.07473  [pdf, other

    cs.CV

    Three for one and one for three: Flow, Segmentation, and Surface Normals

    Authors: Hoang-An Le, Anil S. Baslamisli, Thomas Mensink, Theo Gevers

    Abstract: Optical flow, semantic segmentation, and surface normals represent different information modalities, yet together they bring better cues for scene understanding problems. In this paper, we study the influence between the three modalities: how one impacts on the others and their efficiency in combination. We employ a modular approach using a convolutional refinement network which is trained supervi… ▽ More

    Submitted 19 July, 2018; originally announced July 2018.

    Comments: BMVC 2018

  23. IterGANs: Iterative GANs to Learn and Control 3D Object Transformation

    Authors: Ysbrand Galama, Thomas Mensink

    Abstract: We are interested in learning visual representations which allow for 3D manipulations of visual objects based on a single 2D image. We cast this into an image-to-image transformation task, and propose Iterative Generative Adversarial Networks (IterGANs) which iteratively transform an input image into an output image. Our models learn a visual representation that can be used for objects seen in tra… ▽ More

    Submitted 4 September, 2019; v1 submitted 16 April, 2018; originally announced April 2018.

  24. arXiv:1801.10253  [pdf, other

    cs.CL cs.IR cs.MM

    The New Modality: Emoji Challenges in Prediction, Anticipation, and Retrieval

    Authors: Spencer Cappallo, Stacey Svetlichnaya, Pierre Garrigues, Thomas Mensink, Cees G. M. Snoek

    Abstract: Over the past decade, emoji have emerged as a new and widespread form of digital communication, spanning diverse social networks and spoken languages. We propose to treat these ideograms as a new modality in their own right, distinct in their semantic structure from both the text in which they are often embedded as well as the images which they resemble. As a new modality, emoji present rich novel… ▽ More

    Submitted 2 February, 2018; v1 submitted 30 January, 2018; originally announced January 2018.

  25. arXiv:1612.06753  [pdf, other

    cs.IR cs.MM

    Video Stream Retrieval of Unseen Queries using Semantic Memory

    Authors: Spencer Cappallo, Thomas Mensink, Cees G. M. Snoek

    Abstract: Retrieval of live, user-broadcast video streams is an under-addressed and increasingly relevant challenge. The on-line nature of the problem requires temporal evaluation and the unforeseeable scope of potential queries motivates an approach which can accommodate arbitrary search queries. To account for the breadth of possible queries, we adopt a no-example approach to query retrieval, which uses a… ▽ More

    Submitted 20 December, 2016; originally announced December 2016.

    Comments: Presented at BMVC 2016, British Machine Vision Conference, 2016

  26. arXiv:1604.02275  [pdf, other

    cs.CV cs.LG stat.ML

    Online Open World Recognition

    Authors: Rocco De Rosa, Thomas Mensink, Barbara Caputo

    Abstract: As we enter into the big data age and an avalanche of images have become readily available, recognition systems face the need to move from close, lab settings where the number of classes and training data are fixed, to dynamic scenarios where the number of categories to be recognized grows continuously over time, as well as new data providing useful information to update the system. Recent attempt… ▽ More

    Submitted 8 April, 2016; originally announced April 2016.

    Comments: keywords{Open world recognition, Open set, Incremental Learning, Metric Learning, Nonparametric methods, Classification confidence}

  27. arXiv:1511.02492  [pdf, other

    cs.CV cs.MM

    VideoStory Embeddings Recognize Events when Examples are Scarce

    Authors: Amirhossein Habibian, Thomas Mensink, Cees G. M. Snoek

    Abstract: This paper aims for event recognition when video examples are scarce or even completely absent. The key in such a challenging setting is a semantic video representation. Rather than building the representation from individual attribute detectors and their annotations, we propose to learn the entire representation from freely available web videos and their descriptions using an embedding between vi… ▽ More

    Submitted 8 November, 2015; originally announced November 2015.

  28. arXiv:1510.06939  [pdf, other

    cs.CV

    Objects2action: Classifying and localizing actions without any video example

    Authors: Mihir Jain, Jan C. van Gemert, Thomas Mensink, Cees G. M. Snoek

    Abstract: The goal of this paper is to recognize actions in video without the need for examples. Different from traditional zero-shot approaches we do not demand the design and specification of attribute classifiers and class-to-attribute mappings to allow for transfer from seen classes to unseen classes. Our key contribution is objects2action, a semantic word embedding that is spanned by a skip-gram model… ▽ More

    Submitted 23 October, 2015; originally announced October 2015.

  29. arXiv:1510.01544  [pdf, other

    cs.CV

    Active Transfer Learning with Zero-Shot Priors: Reusing Past Datasets for Future Tasks

    Authors: Efstratios Gavves, Thomas Mensink, Tatiana Tommasi, Cees G. M. Snoek, Tinne Tuytelaars

    Abstract: How can we reuse existing knowledge, in the form of available datasets, when solving a new and apparently unrelated target task from a set of unlabeled data? In this work we make a first contribution to answer this question in the context of image classification. We frame this quest as an active learning problem and use zero-shot classifiers to guide the learning process by linking the new task to… ▽ More

    Submitted 6 October, 2015; originally announced October 2015.