-
A Mamba-based Siamese Network for Remote Sensing Change Detection
Authors:
Jay N. Paranjape,
Celso de Melo,
Vishal M. Patel
Abstract:
Change detection in remote sensing images is an essential tool for analyzing a region at different times. It finds varied applications in monitoring environmental changes, man-made changes as well as corresponding decision-making and prediction of future trends. Deep learning methods like Convolutional Neural Networks (CNNs) and Transformers have achieved remarkable success in detecting significan…
▽ More
Change detection in remote sensing images is an essential tool for analyzing a region at different times. It finds varied applications in monitoring environmental changes, man-made changes as well as corresponding decision-making and prediction of future trends. Deep learning methods like Convolutional Neural Networks (CNNs) and Transformers have achieved remarkable success in detecting significant changes, given two images at different times. In this paper, we propose a Mamba-based Change Detector (M-CD) that segments out the regions of interest even better. Mamba-based architectures demonstrate linear-time training capabilities and an improved receptive field over transformers. Our experiments on four widely used change detection datasets demonstrate significant improvements over existing state-of-the-art (SOTA) methods. Our code and pre-trained models are available at https://github.com/JayParanjape/M-CD
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
ViLCo-Bench: VIdeo Language COntinual learning Benchmark
Authors:
Tianqi Tang,
Shohreh Deldari,
Hao Xue,
Celso De Melo,
Flora D. Salim
Abstract:
Video language continual learning involves continuously adapting to information from video and text inputs, enhancing a model's ability to handle new tasks while retaining prior knowledge. This field is a relatively under-explored area, and establishing appropriate datasets is crucial for facilitating communication and research in this field. In this study, we present the first dedicated benchmark…
▽ More
Video language continual learning involves continuously adapting to information from video and text inputs, enhancing a model's ability to handle new tasks while retaining prior knowledge. This field is a relatively under-explored area, and establishing appropriate datasets is crucial for facilitating communication and research in this field. In this study, we present the first dedicated benchmark, ViLCo-Bench, designed to evaluate continual learning models across a range of video-text tasks. The dataset comprises ten-minute-long videos and corresponding language queries collected from publicly available datasets. Additionally, we introduce a novel memory-efficient framework that incorporates self-supervised learning and mimics long-term and short-term memory effects. This framework addresses challenges including memory complexity from long video clips, natural language complexity from open queries, and text-video misalignment. We posit that ViLCo-Bench, with greater complexity compared to existing continual learning benchmarks, would serve as a critical tool for exploring the video-language domain, extending beyond conventional class-incremental tasks, and addressing complex and limited annotation issues. The curated data, evaluations, and our novel method are available at https://github.com/cruiseresearchgroup/ViLCo .
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
WeatherProof: Leveraging Language Guidance for Semantic Segmentation in Adverse Weather
Authors:
Blake Gella,
Howard Zhang,
Rishi Upadhyay,
Tiffany Chang,
Nathan Wei,
Matthew Waliman,
Yunhao Ba,
Celso de Melo,
Alex Wong,
Achuta Kadambi
Abstract:
We propose a method to infer semantic segmentation maps from images captured under adverse weather conditions. We begin by examining existing models on images degraded by weather conditions such as rain, fog, or snow, and found that they exhibit a large performance drop as compared to those captured under clear weather. To control for changes in scene structures, we propose WeatherProof, the first…
▽ More
We propose a method to infer semantic segmentation maps from images captured under adverse weather conditions. We begin by examining existing models on images degraded by weather conditions such as rain, fog, or snow, and found that they exhibit a large performance drop as compared to those captured under clear weather. To control for changes in scene structures, we propose WeatherProof, the first semantic segmentation dataset with accurate clear and adverse weather image pairs that share an underlying scene. Through this dataset, we analyze the error modes in existing models and found that they were sensitive to the highly complex combination of different weather effects induced on the image during capture. To improve robustness, we propose a way to use language as guidance by identifying contributions of adverse weather conditions and injecting that as "side information". Models trained using our language guidance exhibit performance gains by up to 10.2% in mIoU on WeatherProof, up to 8.44% in mIoU on the widely used ACDC dataset compared to standard training techniques, and up to 6.21% in mIoU on the ACDC dataset as compared to previous SOTA methods.
△ Less
Submitted 7 May, 2024; v1 submitted 21 March, 2024;
originally announced March 2024.
-
Entropic Open-set Active Learning
Authors:
Bardia Safaei,
Vibashan VS,
Celso M. de Melo,
Vishal M. Patel
Abstract:
Active Learning (AL) aims to enhance the performance of deep models by selecting the most informative samples for annotation from a pool of unlabeled data. Despite impressive performance in closed-set settings, most AL methods fail in real-world scenarios where the unlabeled data contains unknown categories. Recently, a few studies have attempted to tackle the AL problem for the open-set setting.…
▽ More
Active Learning (AL) aims to enhance the performance of deep models by selecting the most informative samples for annotation from a pool of unlabeled data. Despite impressive performance in closed-set settings, most AL methods fail in real-world scenarios where the unlabeled data contains unknown categories. Recently, a few studies have attempted to tackle the AL problem for the open-set setting. However, these methods focus more on selecting known samples and do not efficiently utilize unknown samples obtained during AL rounds. In this work, we propose an Entropic Open-set AL (EOAL) framework which leverages both known and unknown distributions effectively to select informative samples during AL rounds. Specifically, our approach employs two different entropy scores. One measures the uncertainty of a sample with respect to the known-class distributions. The other measures the uncertainty of the sample with respect to the unknown-class distributions. By utilizing these two entropy scores we effectively separate the known and unknown samples from the unlabeled data resulting in better sampling. Through extensive experiments, we show that the proposed method outperforms existing state-of-the-art methods on CIFAR-10, CIFAR-100, and TinyImageNet datasets. Code is available at \url{https://github.com/bardisafa/EOAL}.
△ Less
Submitted 21 December, 2023;
originally announced December 2023.
-
Unsupervised Video Domain Adaptation with Masked Pre-Training and Collaborative Self-Training
Authors:
Arun Reddy,
William Paul,
Corban Rivera,
Ketul Shah,
Celso M. de Melo,
Rama Chellappa
Abstract:
In this work, we tackle the problem of unsupervised domain adaptation (UDA) for video action recognition. Our approach, which we call UNITE, uses an image teacher model to adapt a video student model to the target domain. UNITE first employs self-supervised pre-training to promote discriminative feature learning on target domain videos using a teacher-guided masked distillation objective. We then…
▽ More
In this work, we tackle the problem of unsupervised domain adaptation (UDA) for video action recognition. Our approach, which we call UNITE, uses an image teacher model to adapt a video student model to the target domain. UNITE first employs self-supervised pre-training to promote discriminative feature learning on target domain videos using a teacher-guided masked distillation objective. We then perform self-training on masked target data, using the video student model and image teacher model together to generate improved pseudolabels for unlabeled target videos. Our self-training process successfully leverages the strengths of both models to achieve strong transfer performance across domains. We evaluate our approach on multiple video domain adaptation benchmarks and observe significant improvements upon previously reported results.
△ Less
Submitted 20 April, 2024; v1 submitted 5 December, 2023;
originally announced December 2023.
-
Guarding Barlow Twins Against Overfitting with Mixed Samples
Authors:
Wele Gedara Chaminda Bandara,
Celso M. De Melo,
Vishal M. Patel
Abstract:
Self-supervised Learning (SSL) aims to learn transferable feature representations for downstream applications without relying on labeled data. The Barlow Twins algorithm, renowned for its widespread adoption and straightforward implementation compared to its counterparts like contrastive learning methods, minimizes feature redundancy while maximizing invariance to common corruptions. Optimizing fo…
▽ More
Self-supervised Learning (SSL) aims to learn transferable feature representations for downstream applications without relying on labeled data. The Barlow Twins algorithm, renowned for its widespread adoption and straightforward implementation compared to its counterparts like contrastive learning methods, minimizes feature redundancy while maximizing invariance to common corruptions. Optimizing for the above objective forces the network to learn useful representations, while avoiding noisy or constant features, resulting in improved downstream task performance with limited adaptation. Despite Barlow Twins' proven effectiveness in pre-training, the underlying SSL objective can inadvertently cause feature overfitting due to the lack of strong interaction between the samples unlike the contrastive learning approaches. From our experiments, we observe that optimizing for the Barlow Twins objective doesn't necessarily guarantee sustained improvements in representation quality beyond a certain pre-training phase, and can potentially degrade downstream performance on some datasets. To address this challenge, we introduce Mixed Barlow Twins, which aims to improve sample interaction during Barlow Twins training via linearly interpolated samples. This results in an additional regularization term to the original Barlow Twins objective, assuming linear interpolation in the input space translates to linearly interpolated features in the feature space. Pre-training with this regularization effectively mitigates feature overfitting and further enhances the downstream performance on CIFAR-10, CIFAR-100, TinyImageNet, STL-10, and ImageNet datasets. The code and checkpoints are available at: https://github.com/wgcban/mix-bt.git
△ Less
Submitted 4 December, 2023;
originally announced December 2023.
-
ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning
Authors:
Qiao Gu,
Alihusein Kuwajerwala,
Sacha Morin,
Krishna Murthy Jatavallabhula,
Bipasha Sen,
Aditya Agarwal,
Corban Rivera,
William Paul,
Kirsty Ellis,
Rama Chellappa,
Chuang Gan,
Celso Miguel de Melo,
Joshua B. Tenenbaum,
Antonio Torralba,
Florian Shkurti,
Liam Paull
Abstract:
For robots to perform a wide variety of tasks, they require a 3D representation of the world that is semantically rich, yet compact and efficient for task-driven perception and planning. Recent approaches have attempted to leverage features from large vision-language models to encode semantics in 3D representations. However, these approaches tend to produce maps with per-point feature vectors, whi…
▽ More
For robots to perform a wide variety of tasks, they require a 3D representation of the world that is semantically rich, yet compact and efficient for task-driven perception and planning. Recent approaches have attempted to leverage features from large vision-language models to encode semantics in 3D representations. However, these approaches tend to produce maps with per-point feature vectors, which do not scale well in larger environments, nor do they contain semantic spatial relationships between entities in the environment, which are useful for downstream planning. In this work, we propose ConceptGraphs, an open-vocabulary graph-structured representation for 3D scenes. ConceptGraphs is built by leveraging 2D foundation models and fusing their output to 3D by multi-view association. The resulting representations generalize to novel semantic classes, without the need to collect large 3D datasets or finetune models. We demonstrate the utility of this representation through a number of downstream planning tasks that are specified through abstract (language) prompts and require complex reasoning over spatial and semantic concepts. (Project page: https://concept-graphs.github.io/ Explainer video: https://youtu.be/mRhNkQwRYnc )
△ Less
Submitted 28 September, 2023;
originally announced September 2023.
-
RobôCIn Small Size League Extended Team Description Paper for RoboCup 2023
Authors:
Aline Lima de Oliveira,
Cauê Addae da Silva Gomes,
Cecília Virginia Santos da Silva,
Charles Matheus de Sousa Alves,
Danilo Andrade Martins de Souza,
Driele Pires Ferreira Araújo Xavier,
Edgleyson Pereira da Silva,
Felipe Bezerra Martins,
Lucas Henrique Cavalcanti Santos,
Lucas Dias Maciel,
Matheus Paixão Gumercindo dos Santos,
Matheus Lafayette Vasconcelos,
Matheus Vinícius Teotonio do Nascimento Andrade,
João Guilherme Oliveira Carvalho de Melo,
João Pedro Souza Pereira de Moura,
José Ronald da Silva,
José Victor Silva Cruz,
Pedro Henrique Santana de Morais,
Pedro Paulo Salman de Oliveira,
Riei Joaquim Matos Rodrigues,
Roberto Costa Fernandes,
Ryan Vinicius Santos Morais,
Tamara Mayara Ramos Teobaldo,
Washington Igor dos Santos Silva,
Edna Natividade Silva Barros
Abstract:
RobôCIn has participated in RoboCup Small Size League since 2019, won its first world title in 2022 (Division B), and is currently a three-times Latin-American champion. This paper presents our improvements to defend the Small Size League (SSL) division B title in RoboCup 2023 in Bordeaux, France. This paper aims to share some of the academic research that our team developed over the past year. Ou…
▽ More
RobôCIn has participated in RoboCup Small Size League since 2019, won its first world title in 2022 (Division B), and is currently a three-times Latin-American champion. This paper presents our improvements to defend the Small Size League (SSL) division B title in RoboCup 2023 in Bordeaux, France. This paper aims to share some of the academic research that our team developed over the past year. Our team has successfully published 2 articles related to SSL at two high-impact conferences: the 25th RoboCup International Symposium and the 19th IEEE Latin American Robotics Symposium (LARS 2022). Over the last year, we have been continuously migrating from our past codebase to Unification. We will describe the new architecture implemented and some points of software and AI refactoring. In addition, we discuss the process of integrating machined components into the mechanical system, our development for participating in the vision blackout challenge last year and what we are preparing for this year.
△ Less
Submitted 19 July, 2023;
originally announced July 2023.
-
STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition
Authors:
Xiaoyu Zhu,
Po-Yao Huang,
Junwei Liang,
Celso M. de Melo,
Alexander Hauptmann
Abstract:
We study the problem of human action recognition using motion capture (MoCap) sequences. Unlike existing techniques that take multiple manual steps to derive standardized skeleton representations as model input, we propose a novel Spatial-Temporal Mesh Transformer (STMT) to directly model the mesh sequences. The model uses a hierarchical transformer with intra-frame off-set attention and inter-fra…
▽ More
We study the problem of human action recognition using motion capture (MoCap) sequences. Unlike existing techniques that take multiple manual steps to derive standardized skeleton representations as model input, we propose a novel Spatial-Temporal Mesh Transformer (STMT) to directly model the mesh sequences. The model uses a hierarchical transformer with intra-frame off-set attention and inter-frame self-attention. The attention mechanism allows the model to freely attend between any two vertex patches to learn non-local relationships in the spatial-temporal domain. Masked vertex modeling and future frame prediction are used as two self-supervised tasks to fully activate the bi-directional and auto-regressive attention in our hierarchical transformer. The proposed method achieves state-of-the-art performance compared to skeleton-based and point-cloud-based models on common MoCap benchmarks. Code is available at https://github.com/zgzxy001/STMT.
△ Less
Submitted 31 March, 2023;
originally announced March 2023.
-
Synthetic-to-Real Domain Adaptation for Action Recognition: A Dataset and Baseline Performances
Authors:
Arun V. Reddy,
Ketul Shah,
William Paul,
Rohita Mocharla,
Judy Hoffman,
Kapil D. Katyal,
Dinesh Manocha,
Celso M. de Melo,
Rama Chellappa
Abstract:
Human action recognition is a challenging problem, particularly when there is high variability in factors such as subject appearance, backgrounds and viewpoint. While deep neural networks (DNNs) have been shown to perform well on action recognition tasks, they typically require large amounts of high-quality labeled data to achieve robust performance across a variety of conditions. Synthetic data h…
▽ More
Human action recognition is a challenging problem, particularly when there is high variability in factors such as subject appearance, backgrounds and viewpoint. While deep neural networks (DNNs) have been shown to perform well on action recognition tasks, they typically require large amounts of high-quality labeled data to achieve robust performance across a variety of conditions. Synthetic data has shown promise as a way to avoid the substantial costs and potential ethical concerns associated with collecting and labeling enormous amounts of data in the real-world. However, synthetic data may differ from real data in important ways. This phenomenon, known as \textit{domain shift}, can limit the utility of synthetic data in robotics applications. To mitigate the effects of domain shift, substantial effort is being dedicated to the development of domain adaptation (DA) techniques. Yet, much remains to be understood about how best to develop these techniques. In this paper, we introduce a new dataset called Robot Control Gestures (RoCoG-v2). The dataset is composed of both real and synthetic videos from seven gesture classes, and is intended to support the study of synthetic-to-real domain shift for video-based action recognition. Our work expands upon existing datasets by focusing the action classes on gestures for human-robot teaming, as well as by enabling investigation of domain shift in both ground and aerial views. We present baseline results using state-of-the-art action recognition and domain adaptation algorithms and offer initial insight on tackling the synthetic-to-real and ground-to-air domain shifts.
△ Less
Submitted 17 March, 2023;
originally announced March 2023.
-
AZTR: Aerial Video Action Recognition with Auto Zoom and Temporal Reasoning
Authors:
Xijun Wang,
Ruiqi Xian,
Tianrui Guan,
Celso M. de Melo,
Stephen M. Nogar,
Aniket Bera,
Dinesh Manocha
Abstract:
We propose a novel approach for aerial video action recognition. Our method is designed for videos captured using UAVs and can run on edge or mobile devices. We present a learning-based approach that uses customized auto zoom to automatically identify the human target and scale it appropriately. This makes it easier to extract the key features and reduces the computational overhead. We also presen…
▽ More
We propose a novel approach for aerial video action recognition. Our method is designed for videos captured using UAVs and can run on edge or mobile devices. We present a learning-based approach that uses customized auto zoom to automatically identify the human target and scale it appropriately. This makes it easier to extract the key features and reduces the computational overhead. We also present an efficient temporal reasoning algorithm to capture the action information along the spatial and temporal domains within a controllable computational cost. Our approach has been implemented and evaluated both on the desktop with high-end GPUs and on the low power Robotics RB5 Platform for robots and drones. In practice, we achieve 6.1-7.4% improvement over SOTA in Top-1 accuracy on the RoCoG-v2 dataset, 8.3-10.4% improvement on the UAV-Human dataset and 3.2% improvement on the Drone Action dataset.
△ Less
Submitted 2 March, 2023;
originally announced March 2023.
-
ConceptFusion: Open-set Multimodal 3D Mapping
Authors:
Krishna Murthy Jatavallabhula,
Alihusein Kuwajerwala,
Qiao Gu,
Mohd Omama,
Tao Chen,
Alaa Maalouf,
Shuang Li,
Ganesh Iyer,
Soroush Saryazdi,
Nikhil Keetha,
Ayush Tewari,
Joshua B. Tenenbaum,
Celso Miguel de Melo,
Madhava Krishna,
Liam Paull,
Florian Shkurti,
Antonio Torralba
Abstract:
Building 3D maps of the environment is central to robot navigation, planning, and interaction with objects in a scene. Most existing approaches that integrate semantic concepts with 3D maps largely remain confined to the closed-set setting: they can only reason about a finite set of concepts, pre-defined at training time. Further, these maps can only be queried using class labels, or in recent wor…
▽ More
Building 3D maps of the environment is central to robot navigation, planning, and interaction with objects in a scene. Most existing approaches that integrate semantic concepts with 3D maps largely remain confined to the closed-set setting: they can only reason about a finite set of concepts, pre-defined at training time. Further, these maps can only be queried using class labels, or in recent work, using text prompts.
We address both these issues with ConceptFusion, a scene representation that is (1) fundamentally open-set, enabling reasoning beyond a closed set of concepts and (ii) inherently multimodal, enabling a diverse range of possible queries to the 3D map, from language, to images, to audio, to 3D geometry, all working in concert. ConceptFusion leverages the open-set capabilities of today's foundation models pre-trained on internet-scale data to reason about concepts across modalities such as natural language, images, and audio. We demonstrate that pixel-aligned open-set features can be fused into 3D maps via traditional SLAM and multi-view fusion approaches. This enables effective zero-shot spatial reasoning, not needing any additional training or finetuning, and retains long-tailed concepts better than supervised approaches, outperforming them by more than 40% margin on 3D IoU. We extensively evaluate ConceptFusion on a number of real-world datasets, simulated home environments, a real-world tabletop manipulation task, and an autonomous driving platform. We showcase new avenues for blending foundation models with 3D open-set multimodal mapping.
For more information, visit our project page https://concept-fusion.github.io or watch our 5-minute explainer video https://www.youtube.com/watch?v=rkXgws8fiDs
△ Less
Submitted 23 October, 2023; v1 submitted 14 February, 2023;
originally announced February 2023.
-
Open-Set Automatic Target Recognition
Authors:
Bardia Safaei,
Vibashan VS,
Celso M. de Melo,
Shuowen Hu,
Vishal M. Patel
Abstract:
Automatic Target Recognition (ATR) is a category of computer vision algorithms which attempts to recognize targets on data obtained from different sensors. ATR algorithms are extensively used in real-world scenarios such as military and surveillance applications. Existing ATR algorithms are developed for traditional closed-set methods where training and testing have the same class distribution. Th…
▽ More
Automatic Target Recognition (ATR) is a category of computer vision algorithms which attempts to recognize targets on data obtained from different sensors. ATR algorithms are extensively used in real-world scenarios such as military and surveillance applications. Existing ATR algorithms are developed for traditional closed-set methods where training and testing have the same class distribution. Thus, these algorithms have not been robust to unknown classes not seen during the training phase, limiting their utility in real-world applications. To this end, we propose an Open-set Automatic Target Recognition framework where we enable open-set recognition capability for ATR algorithms. In addition, we introduce a plugin Category-aware Binary Classifier (CBC) module to effectively tackle unknown classes seen during inference. The proposed CBC module can be easily integrated with any existing ATR algorithms and can be trained in an end-to-end manner. Experimental results show that the proposed approach outperforms many open-set methods on the DSIAC and CIFAR-10 datasets. To the best of our knowledge, this is the first work to address the open-set classification problem for ATR algorithms. Source code is available at: https://github.com/bardisafa/Open-set-ATR.
△ Less
Submitted 10 November, 2022;
originally announced November 2022.
-
The Impact of Partner Expressions on Felt Emotion in the Iterated Prisoner's Dilemma: An Event-level Analysis
Authors:
Maria Angelika-Nikita,
Celso M. de Melo,
Kazunori Terada,
Gale Lucas,
Jonathan Gratch
Abstract:
Social games like the prisoner's dilemma are often used to develop models of the role of emotion in social decision-making. Here we examine an understudied aspect of emotion in such games: how an individual's feelings are shaped by their partner's expressions. Prior research has tended to focus on other aspects of emotion. Research on felt-emotion has focused on how an individual's feelings shape…
▽ More
Social games like the prisoner's dilemma are often used to develop models of the role of emotion in social decision-making. Here we examine an understudied aspect of emotion in such games: how an individual's feelings are shaped by their partner's expressions. Prior research has tended to focus on other aspects of emotion. Research on felt-emotion has focused on how an individual's feelings shape how they treat their partner, or whether these feelings are authentically expressed. Research on expressed-emotion has focused on how an individual's decisions are shaped by their partner's expressions, without regard for whether these expressions actually evoke feelings. Here, we use computer-generated characters to examine how an individual's moment-to-moment feelings are shaped by (1) how they are treated by their partner and (2) what their partner expresses during this treatment. Surprisingly, we find that partner expressions are far more important than actions in determining self-reported feelings. In other words, our partner can behave in a selfish and exploitive way, but if they show a collaborative pattern of expressions, we will feel greater pleasure collaborating with them. These results also emphasize the importance of context in determining how someone will feel in response to an expression (i.e., knowing a partner is happy is insufficient; we must know what they are happy-at). We discuss the implications of this work for cognitive-system design, emotion theory, and methodological practice in affective computing.
△ Less
Submitted 2 July, 2022;
originally announced July 2022.
-
On Structuring Functional Programs with Monoidal Profunctors
Authors:
Alexandre Garcia de Oliveira,
Mauro Jaskelioff,
Ana Cristina Vieira de Melo
Abstract:
We study monoidal profunctors as a tool to reason and structure pure functional programs both from a categorical perspective and as a Haskell implementation. From the categorical point of view we approach them as monoids in a certain monoidal category of profunctors. We study properties of this monoidal category and construct and implement the free monoidal profunctor. We study the relationship of…
▽ More
We study monoidal profunctors as a tool to reason and structure pure functional programs both from a categorical perspective and as a Haskell implementation. From the categorical point of view we approach them as monoids in a certain monoidal category of profunctors. We study properties of this monoidal category and construct and implement the free monoidal profunctor. We study the relationship of the monoidal construction to optics, and introduce a promising generalization of the implementation which we illustrate by introducing effectful monoidal profunctors.
△ Less
Submitted 2 July, 2022;
originally announced July 2022.
-
Not Just Streaks: Towards Ground Truth for Single Image Deraining
Authors:
Yunhao Ba,
Howard Zhang,
Ethan Yang,
Akira Suzuki,
Arnold Pfahnl,
Chethan Chinder Chandrappa,
Celso de Melo,
Suya You,
Stefano Soatto,
Alex Wong,
Achuta Kadambi
Abstract:
We propose a large-scale dataset of real-world rainy and clean image pairs and a method to remove degradations, induced by rain streaks and rain accumulation, from the image. As there exists no real-world dataset for deraining, current state-of-the-art methods rely on synthetic data and thus are limited by the sim2real domain gap; moreover, rigorous evaluation remains a challenge due to the absenc…
▽ More
We propose a large-scale dataset of real-world rainy and clean image pairs and a method to remove degradations, induced by rain streaks and rain accumulation, from the image. As there exists no real-world dataset for deraining, current state-of-the-art methods rely on synthetic data and thus are limited by the sim2real domain gap; moreover, rigorous evaluation remains a challenge due to the absence of a real paired dataset. We fill this gap by collecting a real paired deraining dataset through meticulous control of non-rain variations. Our dataset enables paired training and quantitative evaluation for diverse real-world rain phenomena (e.g. rain streaks and rain accumulation). To learn a representation robust to rain phenomena, we propose a deep neural network that reconstructs the underlying scene by minimizing a rain-robust loss between rainy and clean images. Extensive experiments demonstrate that our model outperforms the state-of-the-art deraining methods on real rainy images under various conditions. Project website: https://visual.ee.ucla.edu/gt_rain.htm/.
△ Less
Submitted 28 August, 2022; v1 submitted 21 June, 2022;
originally announced June 2022.
-
A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition
Authors:
R. Gnana Praveen,
Wheidima Carneiro de Melo,
Nasib Ullah,
Haseeb Aslam,
Osama Zeeshan,
Théo Denorme,
Marco Pedersoli,
Alessandro Koerich,
Simon Bacon,
Patrick Cardinal,
Eric Granger
Abstract:
Multimodal emotion recognition has recently gained much attention since it can leverage diverse and complementary relationships over multiple modalities (e.g., audio, visual, biosignals, etc.), and can provide some robustness to noisy modalities. Most state-of-the-art methods for audio-visual (A-V) fusion rely on recurrent networks or conventional attention mechanisms that do not effectively lever…
▽ More
Multimodal emotion recognition has recently gained much attention since it can leverage diverse and complementary relationships over multiple modalities (e.g., audio, visual, biosignals, etc.), and can provide some robustness to noisy modalities. Most state-of-the-art methods for audio-visual (A-V) fusion rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. In this paper, we focus on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos. Specifically, we propose a joint cross-attention model that relies on the complementary relationships to extract the salient features across A-V modalities, allowing for accurate prediction of continuous values of valence and arousal. The proposed fusion model efficiently leverages the inter-modal relationships, while reducing the heterogeneity between the features. In particular, it computes the cross-attention weights based on correlation between the combined feature representation and individual modalities. By deploying the combined A-V feature representation into the cross-attention module, the performance of our fusion module improves significantly over the vanilla cross-attention module. Experimental results on validation-set videos from the AffWild2 dataset indicate that our proposed A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches. The code is available on GitHub: https://github.com/praveena2j/JointCrossAttentional-AV-Fusion.
△ Less
Submitted 6 July, 2024; v1 submitted 28 March, 2022;
originally announced March 2022.
-
Facial Expression Analysis Using Decomposed Multiscale Spatiotemporal Networks
Authors:
Wheidima Carneiro de Melo,
Eric Granger,
Miguel Bordallo Lopez
Abstract:
Video-based analysis of facial expressions has been increasingly applied to infer health states of individuals, such as depression and pain. Among the existing approaches, deep learning models composed of structures for multiscale spatiotemporal processing have shown strong potential for encoding facial dynamics. However, such models have high computational complexity, making for a difficult deplo…
▽ More
Video-based analysis of facial expressions has been increasingly applied to infer health states of individuals, such as depression and pain. Among the existing approaches, deep learning models composed of structures for multiscale spatiotemporal processing have shown strong potential for encoding facial dynamics. However, such models have high computational complexity, making for a difficult deployment of these solutions. To address this issue, we introduce a new technique to decompose the extraction of multiscale spatiotemporal features. Particularly, a building block structure called Decomposed Multiscale Spatiotemporal Network (DMSN) is presented along with three variants: DMSN-A, DMSN-B, and DMSN-C blocks. The DMSN-A block generates multiscale representations by analyzing spatiotemporal features at multiple temporal ranges, while the DMSN-B block analyzes spatiotemporal features at multiple ranges, and the DMSN-C block analyzes spatiotemporal features at multiple spatial sizes. Using these variants, we design our DMSN architecture which has the ability to explore a variety of multiscale spatiotemporal features, favoring the adaptation to different facial behaviors. Our extensive experiments on challenging datasets show that the DMSN-C block is effective for depression detection, whereas the DMSN-A block is efficient for pain estimation. Results also indicate that our DMSN architecture provides a cost-effective solution for expressions that range from fewer facial variations over time, as in depression detection, to greater variations, as in pain estimation.
△ Less
Submitted 21 March, 2022;
originally announced March 2022.
-
Covariance-free Partial Least Squares: An Incremental Dimensionality Reduction Method
Authors:
Artur Jordao,
Maiko Lie,
Victor Hugo Cunha de Melo,
William Robson Schwartz
Abstract:
Dimensionality reduction plays an important role in computer vision problems since it reduces computational cost and is often capable of yielding more discriminative data representation. In this context, Partial Least Squares (PLS) has presented notable results in tasks such as image classification and neural network optimization. However, PLS is infeasible on large datasets, such as ImageNet, bec…
▽ More
Dimensionality reduction plays an important role in computer vision problems since it reduces computational cost and is often capable of yielding more discriminative data representation. In this context, Partial Least Squares (PLS) has presented notable results in tasks such as image classification and neural network optimization. However, PLS is infeasible on large datasets, such as ImageNet, because it requires all the data to be in memory in advance, which is often impractical due to hardware limitations. Additionally, this requirement prevents us from employing PLS on streaming applications where the data are being continuously generated. Motivated by this, we propose a novel incremental PLS, named Covariance-free Incremental Partial Least Squares (CIPLS), which learns a low-dimensional representation of the data using a single sample at a time. In contrast to other state-of-the-art approaches, instead of adopting a partially-discriminative or SGD-based model, we extend Nonlinear Iterative Partial Least Squares (NIPALS) -- the standard algorithm used to compute PLS -- for incremental processing. Among the advantages of this approach are the preservation of discriminative information across all components, the possibility of employing its score matrices for feature selection, and its computational efficiency. We validate CIPLS on face verification and image classification tasks, where it outperforms several other incremental dimensionality reduction techniques. In the context of feature selection, CIPLS achieves comparable results when compared to state-of-the-art techniques.
△ Less
Submitted 10 November, 2020; v1 submitted 5 October, 2019;
originally announced October 2019.
-
The Information Processing Factory: Organization, Terminology, and Definitions
Authors:
Eberle A. Rambo,
Bryan Donyanavard,
Minjun Seo,
Florian Maurer,
Thawra Kadeed,
Caio B. de Melo,
Biswadip Maity,
Anmol Surhonne,
Andreas Herkersdorf,
Fadi Kurdahi,
Nikil Dutt,
Rolf Ernst
Abstract:
The Information Processing Factory (IPF) project has recently introduced the abstraction of complex architectures as self-aware information processing factories. These factories consist of a set of highly configurable resources, e.g., processing elements and interconnects, whose use is monitored, planned, and configured during runtime. Managing a factory involves multiple facets, such as efficienc…
▽ More
The Information Processing Factory (IPF) project has recently introduced the abstraction of complex architectures as self-aware information processing factories. These factories consist of a set of highly configurable resources, e.g., processing elements and interconnects, whose use is monitored, planned, and configured during runtime. Managing a factory involves multiple facets, such as efficiency, availability, reliability, integrity, and timing. IPF conquers the complexity of managing facets in digital systems by hierarchically decomposing the challenges and addressing them with different co-existing entities in the factory. This paper introduces the organization, terminology, and definitions of IPF.
△ Less
Submitted 2 July, 2019;
originally announced July 2019.
-
Activity Recognition based on a Magnitude-Orientation Stream Network
Authors:
Carlos Caetano,
Victor H. C. de Melo,
Jefersson A. dos Santos,
William Robson Schwartz
Abstract:
The temporal component of videos provides an important clue for activity recognition, as a number of activities can be reliably recognized based on the motion information. In view of that, this work proposes a novel temporal stream for two-stream convolutional networks based on images computed from the optical flow magnitude and orientation, named Magnitude-Orientation Stream (MOS), to learn the m…
▽ More
The temporal component of videos provides an important clue for activity recognition, as a number of activities can be reliably recognized based on the motion information. In view of that, this work proposes a novel temporal stream for two-stream convolutional networks based on images computed from the optical flow magnitude and orientation, named Magnitude-Orientation Stream (MOS), to learn the motion in a better and richer manner. Our method applies simple nonlinear transformations on the vertical and horizontal components of the optical flow to generate input images for the temporal stream. Experimental results, carried on two well-known datasets (HMDB51 and UCF101), demonstrate that using our proposed temporal stream as input to existing neural network architectures can improve their performance for activity recognition. Results demonstrate that our temporal stream provides complementary information able to improve the classical two-stream methods, indicating the suitability of our approach to be used as a temporal video representation.
△ Less
Submitted 22 August, 2017;
originally announced August 2017.