subscribe to arXiv mailings

DVANet: Disentangling View and Action Features for Multi-View Action Recognition

Authors: Nyle Siddiqui, Praveen Tirupattur, Mubarak Shah

Abstract: In this work, we present a novel approach to multi-view action recognition where we guide learned action representations to be separated from view-relevant information in a video. When trying to classify action instances captured from multiple viewpoints, there is a higher degree of difficulty due to the difference in background, occlusion, and visibility of the captured action from different came… ▽ More In this work, we present a novel approach to multi-view action recognition where we guide learned action representations to be separated from view-relevant information in a video. When trying to classify action instances captured from multiple viewpoints, there is a higher degree of difficulty due to the difference in background, occlusion, and visibility of the captured action from different camera angles. To tackle the various problems introduced in multi-view action recognition, we propose a novel configuration of learnable transformer decoder queries, in conjunction with two supervised contrastive losses, to enforce the learning of action features that are robust to shifts in viewpoints. Our disentangled feature learning occurs in two stages: the transformer decoder uses separate queries to separately learn action and view information, which are then further disentangled using our two contrastive losses. We show that our model and method of training significantly outperforms all other uni-modal models on four multi-view action recognition datasets: NTU RGB+D, NTU RGB+D 120, PKU-MMD, and N-UCLA. Compared to previous RGB works, we see maximal improvements of 1.5\%, 4.8\%, 2.2\%, and 4.8\% on each dataset, respectively. △ Less

Submitted 9 December, 2023; originally announced December 2023.

arXiv:2312.05623 [pdf, other]

Impact of Urban Street Geometry on the Detection Probability of Automotive Radars

Authors: Mohammad Taha Shah, Ankit Kumar, Gourab Ghatak, Shobha Sundar Ram

Abstract: Prior works have analyzed the performance of millimeter wave automotive radars in the presence of diverse clutter and interference scenarios using stochastic geometry tools instead of more time-consuming measurement studies or system-level simulations. In these works, the distributions of radars or discrete clutter scatterers were modeled as Poisson point processes in the Euclidean space. However,… ▽ More Prior works have analyzed the performance of millimeter wave automotive radars in the presence of diverse clutter and interference scenarios using stochastic geometry tools instead of more time-consuming measurement studies or system-level simulations. In these works, the distributions of radars or discrete clutter scatterers were modeled as Poisson point processes in the Euclidean space. However, since most automotive radars are likely to be mounted on vehicles and road infrastructure, road geometries are an important factor that must be considered. Instead of considering each road geometry as an individual case for study, in this work, we model each case as a specific instance of an underlying Poisson line process and further model the distribution of vehicles on the road as a Poisson point process - forming a Poisson line Cox process. Then, through the use of stochastic geometry tools, we estimate the average number of interfering radars for specific road and vehicular densities and the effect of radar parameters such as noise and beamwidth on the radar detection metrics. The numerical results are validated with Monte Carlo simulations. △ Less

Submitted 9 December, 2023; originally announced December 2023.

Comments: Submitted to IEEE Radar Conference 2024 (RadarConf24)

arXiv:2312.04850 [pdf, other]

doi 10.1103/PhysRevB.109.115403

Impact of the Fizeau drag effect on Goos-Hänchen shifts in graphene

Authors: Rafi Ud Din, Muzamil Shah, Reza Asgari, Gao Xianlong

Abstract: We investigate the Goos-Hänchen shifts in reflection for a light beam within a graphene structure, utilizing the Fizeau drag effect induced by its massless Dirac electrons in incident light. The magnitudes of spatial and angular shifts for a light beam propagating against the direction of drifting electrons are significantly enhanced, while shifts for a beam co-propagating with the drifting electr… ▽ More We investigate the Goos-Hänchen shifts in reflection for a light beam within a graphene structure, utilizing the Fizeau drag effect induced by its massless Dirac electrons in incident light. The magnitudes of spatial and angular shifts for a light beam propagating against the direction of drifting electrons are significantly enhanced, while shifts for a beam co-propagating with the drifting electrons are suppressed. The Goos-Hänchen shifts exhibit augmentation with increasing drift velocities of electrons in graphene. The impact of incident wavelength on the angular and spatial shifts in reflection is discussed. Furthermore, the study highlights the crucial roles of the density of charged particles in graphene, the particle relaxation time, and the thickness of the graphene in manipulating the drag-affected Goos-Hänchen shifts. This investigation offers valuable insights for efficiently guiding light in graphene structures under the influence of the Fizeau drag effect. △ Less

Submitted 6 March, 2024; v1 submitted 8 December, 2023; originally announced December 2023.

Comments: 11 pages, 7 figures

arXiv:2312.04548 [pdf, other]

Multiview Aerial Visual Recognition (MAVREC): Can Multi-view Improve Aerial Visual Perception?

Authors: Aritra Dutta, Srijan Das, Jacob Nielsen, Rajatsubhra Chakraborty, Mubarak Shah

Abstract: Despite the commercial abundance of UAVs, aerial data acquisition remains challenging, and the existing Asia and North America-centric open-source UAV datasets are small-scale or low-resolution and lack diversity in scene contextuality. Additionally, the color content of the scenes, solar-zenith angle, and population density of different geographies influence the data diversity. These two factors… ▽ More Despite the commercial abundance of UAVs, aerial data acquisition remains challenging, and the existing Asia and North America-centric open-source UAV datasets are small-scale or low-resolution and lack diversity in scene contextuality. Additionally, the color content of the scenes, solar-zenith angle, and population density of different geographies influence the data diversity. These two factors conjointly render suboptimal aerial-visual perception of the deep neural network (DNN) models trained primarily on the ground-view data, including the open-world foundational models. To pave the way for a transformative era of aerial detection, we present Multiview Aerial Visual RECognition or MAVREC, a video dataset where we record synchronized scenes from different perspectives -- ground camera and drone-mounted camera. MAVREC consists of around 2.5 hours of industry-standard 2.7K resolution video sequences, more than 0.5 million frames, and 1.1 million annotated bounding boxes. This makes MAVREC the largest ground and aerial-view dataset, and the fourth largest among all drone-based datasets across all modalities and tasks. Through our extensive benchmarking on MAVREC, we recognize that augmenting object detectors with ground-view images from the corresponding geographical location is a superior pre-training strategy for aerial detection. Building on this strategy, we benchmark MAVREC with a curriculum-based semi-supervised object detection approach that leverages labeled (ground and aerial) and unlabeled (only aerial) images to enhance the aerial detection. We publicly release the MAVREC dataset: https://mavrec.github.io. △ Less

Submitted 7 December, 2023; originally announced December 2023.

ACM Class: I.4.0; I.4.8; I.5.1; I.5.4; I.2.10

arXiv:2311.13435 [pdf, other]

PG-Video-LLaVA: Pixel Grounding Large Video-Language Models

Authors: Shehan Munasinghe, Rusiru Thushara, Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, Mubarak Shah, Fahad Khan

Abstract: Extending image-based Large Multimodal Models (LMMs) to videos is challenging due to the inherent complexity of video data. The recent approaches extending image-based LMMs to videos either lack the grounding capabilities (e.g., VideoChat, Video-ChatGPT, Video-LLaMA) or do not utilize the audio-signals for better video understanding (e.g., Video-ChatGPT). Addressing these gaps, we propose PG-Video… ▽ More Extending image-based Large Multimodal Models (LMMs) to videos is challenging due to the inherent complexity of video data. The recent approaches extending image-based LMMs to videos either lack the grounding capabilities (e.g., VideoChat, Video-ChatGPT, Video-LLaMA) or do not utilize the audio-signals for better video understanding (e.g., Video-ChatGPT). Addressing these gaps, we propose PG-Video-LLaVA, the first LMM with pixel-level grounding capability, integrating audio cues by transcribing them into text to enrich video-context understanding. Our framework uses an off-the-shelf tracker and a novel grounding module, enabling it to spatially localize objects in videos following user instructions. We evaluate PG-Video-LLaVA using video-based generative and question-answering benchmarks and introduce new benchmarks specifically designed to measure prompt-based object grounding performance in videos. Further, we propose the use of Vicuna over GPT-3.5, as utilized in Video-ChatGPT, for video-based conversation benchmarking, ensuring reproducibility of results which is a concern with the proprietary nature of GPT-3.5. Our framework builds on SoTA image-based LLaVA model and extends its advantages to the video domain, delivering promising gains on video-based conversation and grounding tasks. Project Page: https://github.com/mbzuai-oryx/Video-LLaVA △ Less

Submitted 13 December, 2023; v1 submitted 22 November, 2023; originally announced November 2023.

Comments: Technical Report

arXiv:2311.12733 [pdf, other]

doi 10.1145/3626252.3630899

Not Just Training, Also Testing: High School Youths' Perspective-Taking through Peer Testing Machine Learning-Powered Applications

Authors: L. Morales-Navarro, M. Shah, Y. B. Kafai

Abstract: Most attention in K-12 artificial intelligence and machine learning (AI/ML) education has been given to having youths train models, with much less attention to the equally important testing of models when creating machine learning applications. Testing ML applications allows for the evaluation of models against predictions and can help creators of applications identify and address failure and edge… ▽ More Most attention in K-12 artificial intelligence and machine learning (AI/ML) education has been given to having youths train models, with much less attention to the equally important testing of models when creating machine learning applications. Testing ML applications allows for the evaluation of models against predictions and can help creators of applications identify and address failure and edge cases that could negatively impact user experiences. We investigate how testing each other's projects supported youths to take perspective about functionality, performance, and potential issues in their own projects. We analyzed testing worksheets, audio and video recordings collected during a two week workshop in which 11 high school youths created physical computing projects that included (audio, pose, and image) ML classifiers. We found that through peer-testing youths reflected on the size of their training datasets, the diversity of their training data, the design of their classes and the contexts in which they produced training data. We discuss future directions for research on peer-testing in AI/ML education and current limitations for these kinds of activities. △ Less

Submitted 14 December, 2023; v1 submitted 21 November, 2023; originally announced November 2023.

ACM Class: K.3.2

arXiv:2311.05548 [pdf, other]

L-WaveBlock: A Novel Feature Extractor Leveraging Wavelets for Generative Adversarial Networks

Authors: Mirat Shah, Vansh Jain, Anmol Chokshi, Guruprasad Parasnis, Pramod Bide

Abstract: Generative Adversarial Networks (GANs) have risen to prominence in the field of deep learning, facilitating the generation of realistic data from random noise. The effectiveness of GANs often depends on the quality of feature extraction, a critical aspect of their architecture. This paper introduces L-WaveBlock, a novel and robust feature extractor that leverages the capabilities of the Discrete W… ▽ More Generative Adversarial Networks (GANs) have risen to prominence in the field of deep learning, facilitating the generation of realistic data from random noise. The effectiveness of GANs often depends on the quality of feature extraction, a critical aspect of their architecture. This paper introduces L-WaveBlock, a novel and robust feature extractor that leverages the capabilities of the Discrete Wavelet Transform (DWT) with deep learning methodologies. L-WaveBlock is catered to quicken the convergence of GAN generators while simultaneously enhancing their performance. The paper demonstrates the remarkable utility of L-WaveBlock across three datasets, a road satellite imagery dataset, the CelebA dataset and the GoPro dataset, showcasing its ability to ease feature extraction and make it more efficient. By utilizing DWT, L-WaveBlock efficiently captures the intricate details of both structural and textural details, and further partitions feature maps into orthogonal subbands across multiple scales while preserving essential information at the same time. Not only does it lead to faster convergence, but also gives competent results on every dataset by employing the L-WaveBlock. The proposed method achieves an Inception Score of 3.6959 and a Structural Similarity Index of 0.4261 on the maps dataset, a Peak Signal-to-Noise Ratio of 29.05 and a Structural Similarity Index of 0.874 on the CelebA dataset. The proposed method performs competently to the state-of-the-art for the image denoising dataset, albeit not better, but still leads to faster convergence than conventional methods. With this, L-WaveBlock emerges as a robust and efficient tool for enhancing GAN-based image generation, demonstrating superior convergence speed and competitive performance across multiple datasets for image resolution, image generation and image denoising. △ Less

Submitted 9 November, 2023; originally announced November 2023.

Comments: 12 figures, 8 pages

arXiv:2310.15324 [pdf, other]

Videoprompter: an ensemble of foundational models for zero-shot video understanding

Authors: Adeel Yousaf, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan, Mubarak Shah

Abstract: Vision-language models (VLMs) classify the query video by calculating a similarity score between the visual features and text-based class label representations. Recently, large language models (LLMs) have been used to enrich the text-based class labels by enhancing the descriptiveness of the class names. However, these improvements are restricted to the text-based classifier only, and the query vi… ▽ More Vision-language models (VLMs) classify the query video by calculating a similarity score between the visual features and text-based class label representations. Recently, large language models (LLMs) have been used to enrich the text-based class labels by enhancing the descriptiveness of the class names. However, these improvements are restricted to the text-based classifier only, and the query visual features are not considered. In this paper, we propose a framework which combines pre-trained discriminative VLMs with pre-trained generative video-to-text and text-to-text models. We introduce two key modifications to the standard zero-shot setting. First, we propose language-guided visual feature enhancement and employ a video-to-text model to convert the query video to its descriptive form. The resulting descriptions contain vital visual cues of the query video, such as what objects are present and their spatio-temporal interactions. These descriptive cues provide additional semantic knowledge to VLMs to enhance their zeroshot performance. Second, we propose video-specific prompts to LLMs to generate more meaningful descriptions to enrich class label representations. Specifically, we introduce prompt techniques to create a Tree Hierarchy of Categories for class names, offering a higher-level action context for additional visual cues, We demonstrate the effectiveness of our approach in video understanding across three different zero-shot settings: 1) video action recognition, 2) video-to-text and textto-video retrieval, and 3) time-sensitive video tasks. Consistent improvements across multiple benchmarks and with various VLMs demonstrate the effectiveness of our proposed framework. Our code will be made publicly available. △ Less

Submitted 23 October, 2023; originally announced October 2023.

arXiv:2310.05932 [pdf, other]

A Multi-Agent Systems Approach for Peer-to-Peer Energy Trading in Dairy Farming

Authors: Mian Ibad Ali Shah, Abdul Wahid, Enda Barrett, Karl Mason

Abstract: To achieve desired carbon emission reductions, integrating renewable generation and accelerating the adoption of peer-to-peer energy trading is crucial. This is especially important for energy-intensive farming, like dairy farming. However, integrating renewables and peer-to-peer trading presents challenges. To address this, we propose the Multi-Agent Peer-to-Peer Dairy Farm Energy Simulator (MAPD… ▽ More To achieve desired carbon emission reductions, integrating renewable generation and accelerating the adoption of peer-to-peer energy trading is crucial. This is especially important for energy-intensive farming, like dairy farming. However, integrating renewables and peer-to-peer trading presents challenges. To address this, we propose the Multi-Agent Peer-to-Peer Dairy Farm Energy Simulator (MAPDES), enabling dairy farms to participate in peer-to-peer markets. Our strategy reduces electricity costs and peak demand by approximately 30% and 24% respectively, while increasing energy sales by 37% compared to the baseline scenario without P2P trading. This demonstrates the effectiveness of our approach. △ Less

Submitted 21 August, 2023; originally announced October 2023.

Comments: Proc. of the Artificial Intelligence for Sustainability, ECAI 2023, Eunika et al. (eds.), Sep 30- Oct 1, 2023, https://sites.google.com/view/ai4s. 2023

arXiv:2310.04445 [pdf, other]

LoFT: Local Proxy Fine-tuning For Improving Transferability Of Adversarial Attacks Against Large Language Model

Authors: Muhammad Ahmed Shah, Roshan Sharma, Hira Dhamyal, Raphael Olivier, Ankit Shah, Joseph Konan, Dareen Alharthi, Hazim T Bukhari, Massa Baali, Soham Deshmukh, Michael Kuhlmann, Bhiksha Raj, Rita Singh

Abstract: It has been shown that Large Language Model (LLM) alignments can be circumvented by appending specially crafted attack suffixes with harmful queries to elicit harmful responses. To conduct attacks against private target models whose characterization is unknown, public models can be used as proxies to fashion the attack, with successful attacks being transferred from public proxies to private targe… ▽ More It has been shown that Large Language Model (LLM) alignments can be circumvented by appending specially crafted attack suffixes with harmful queries to elicit harmful responses. To conduct attacks against private target models whose characterization is unknown, public models can be used as proxies to fashion the attack, with successful attacks being transferred from public proxies to private target models. The success rate of attack depends on how closely the proxy model approximates the private model. We hypothesize that for attacks to be transferrable, it is sufficient if the proxy can approximate the target model in the neighborhood of the harmful query. Therefore, in this paper, we propose \emph{Local Fine-Tuning (LoFT)}, \textit{i.e.}, fine-tuning proxy models on similar queries that lie in the lexico-semantic neighborhood of harmful queries to decrease the divergence between the proxy and target models. First, we demonstrate three approaches to prompt private target models to obtain similar queries given harmful queries. Next, we obtain data for local fine-tuning by eliciting responses from target models for the generated similar queries. Then, we optimize attack suffixes to generate attack prompts and evaluate the impact of our local fine-tuning on the attack's success rate. Experiments show that local fine-tuning of proxy models improves attack transferability and increases attack success rate by $39\%$, $7\%$, and $0.5\%$ (absolute) on target models ChatGPT, GPT-4, and Claude respectively. △ Less

Submitted 21 October, 2023; v1 submitted 2 October, 2023; originally announced October 2023.

arXiv:2309.16020 [pdf, other]

GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization

Authors: Vicente Vivanco Cepeda, Gaurav Kumar Nayak, Mubarak Shah

Abstract: Worldwide Geo-localization aims to pinpoint the precise location of images taken anywhere on Earth. This task has considerable challenges due to immense variation in geographic landscapes. The image-to-image retrieval-based approaches fail to solve this problem on a global scale as it is not feasible to construct a large gallery of images covering the entire world. Instead, existing approaches div… ▽ More Worldwide Geo-localization aims to pinpoint the precise location of images taken anywhere on Earth. This task has considerable challenges due to immense variation in geographic landscapes. The image-to-image retrieval-based approaches fail to solve this problem on a global scale as it is not feasible to construct a large gallery of images covering the entire world. Instead, existing approaches divide the globe into discrete geographic cells, transforming the problem into a classification task. However, their performance is limited by the predefined classes and often results in inaccurate localizations when an image's location significantly deviates from its class center. To overcome these limitations, we propose GeoCLIP, a novel CLIP-inspired Image-to-GPS retrieval approach that enforces alignment between the image and its corresponding GPS locations. GeoCLIP's location encoder models the Earth as a continuous function by employing positional encoding through random Fourier features and constructing a hierarchical representation that captures information at varying resolutions to yield a semantically rich high-dimensional feature suitable to use even beyond geo-localization. To the best of our knowledge, this is the first work employing GPS encoding for geo-localization. We demonstrate the efficacy of our method via extensive experiments and ablations on benchmark datasets. We achieve competitive performance with just 20% of training data, highlighting its effectiveness even in limited-data settings. Furthermore, we qualitatively demonstrate geo-localization using a text query by leveraging CLIP backbone of our image encoder. The project webpage is available at: https://vicentevivan.github.io/GeoCLIP △ Less

Submitted 21 November, 2023; v1 submitted 27 September, 2023; originally announced September 2023.

Comments: Accepted at NeurIPS 2023

arXiv:2309.15140 [pdf, other]

doi 10.1109/CVCI59596.2023.10397371

A Review on AI Algorithms for Energy Management in E-Mobility Services

Authors: Sen Yan, Maqsood Hussain Shah, Ji Li, Noel O'Connor, Mingming Liu

Abstract: E-mobility, or electric mobility, has emerged as a pivotal solution to address pressing environmental and sustainability concerns in the transportation sector. The depletion of fossil fuels, escalating greenhouse gas emissions, and the imperative to combat climate change underscore the significance of transitioning to electric vehicles (EVs). This paper seeks to explore the potential of artificial… ▽ More E-mobility, or electric mobility, has emerged as a pivotal solution to address pressing environmental and sustainability concerns in the transportation sector. The depletion of fossil fuels, escalating greenhouse gas emissions, and the imperative to combat climate change underscore the significance of transitioning to electric vehicles (EVs). This paper seeks to explore the potential of artificial intelligence (AI) in addressing various challenges related to effective energy management in e-mobility systems (EMS). These challenges encompass critical factors such as range anxiety, charge rate optimization, and the longevity of energy storage in EVs. By analyzing existing literature, we delve into the role that AI can play in tackling these challenges and enabling efficient energy management in EMS. Our objectives are twofold: to provide an overview of the current state-of-the-art in this research domain and propose effective avenues for future investigations. Through this analysis, we aim to contribute to the advancement of sustainable and efficient e-mobility solutions, shaping a greener and more sustainable future for transportation. △ Less

Submitted 26 September, 2023; originally announced September 2023.

Comments: 8 pages, 4 tables, 1 figure

arXiv:2309.13962 [pdf, other]

Egocentric RGB+Depth Action Recognition in Industry-Like Settings

Authors: Jyoti Kini, Sarah Fleischer, Ishan Dave, Mubarak Shah

Abstract: Action recognition from an egocentric viewpoint is a crucial perception task in robotics and enables a wide range of human-robot interactions. While most computer vision approaches prioritize the RGB camera, the Depth modality - which can further amplify the subtleties of actions from an egocentric perspective - remains underexplored. Our work focuses on recognizing actions from egocentric RGB and… ▽ More Action recognition from an egocentric viewpoint is a crucial perception task in robotics and enables a wide range of human-robot interactions. While most computer vision approaches prioritize the RGB camera, the Depth modality - which can further amplify the subtleties of actions from an egocentric perspective - remains underexplored. Our work focuses on recognizing actions from egocentric RGB and Depth modalities in an industry-like environment. To study this problem, we consider the recent MECCANO dataset, which provides a wide range of assembling actions. Our framework is based on the 3D Video SWIN Transformer to encode both RGB and Depth modalities effectively. To address the inherent skewness in real-world multimodal action occurrences, we propose a training strategy using an exponentially decaying variant of the focal loss modulating factor. Additionally, to leverage the information in both RGB and Depth modalities, we opt for late fusion to combine the predictions from each modality. We thoroughly evaluate our method on the action recognition task of the MECCANO dataset, and it significantly outperforms the prior work. Notably, our method also secured first place at the multimodal action recognition challenge at ICIAP 2023. △ Less

Submitted 25 September, 2023; originally announced September 2023.

arXiv:2309.10058 [pdf, other]

Dual Student Networks for Data-Free Model Stealing

Authors: James Beetham, Navid Kardan, Ajmal Mian, Mubarak Shah

Abstract: Existing data-free model stealing methods use a generator to produce samples in order to train a student model to match the target model outputs. To this end, the two main challenges are estimating gradients of the target model without access to its parameters, and generating a diverse set of training samples that thoroughly explores the input space. We propose a Dual Student method where two stud… ▽ More Existing data-free model stealing methods use a generator to produce samples in order to train a student model to match the target model outputs. To this end, the two main challenges are estimating gradients of the target model without access to its parameters, and generating a diverse set of training samples that thoroughly explores the input space. We propose a Dual Student method where two students are symmetrically trained in order to provide the generator a criterion to generate samples that the two students disagree on. On one hand, disagreement on a sample implies at least one student has classified the sample incorrectly when compared to the target model. This incentive towards disagreement implicitly encourages the generator to explore more diverse regions of the input space. On the other hand, our method utilizes gradients of student models to indirectly estimate gradients of the target model. We show that this novel training objective for the generator network is equivalent to optimizing a lower bound on the generator's loss if we had access to the target model gradients. We show that our new optimization framework provides more accurate gradient estimation of the target model and better accuracies on benchmark classification datasets. Additionally, our approach balances improved query efficiency with training computation cost. Finally, we demonstrate that our method serves as a better proxy model for transfer-based adversarial attacks than existing data-free model stealing methods. △ Less

Submitted 18 September, 2023; originally announced September 2023.

Comments: Published in the ICLR 2023 - The Eleventh International Conference on Learning Representations

arXiv:2309.03989 [pdf, other]

CDFSL-V: Cross-Domain Few-Shot Learning for Videos

Authors: Sarinda Samarasinghe, Mamshad Nayeem Rizve, Navid Kardan, Mubarak Shah

Abstract: Few-shot video action recognition is an effective approach to recognizing new categories with only a few labeled examples, thereby reducing the challenges associated with collecting and annotating large-scale video datasets. Existing methods in video action recognition rely on large labeled datasets from the same domain. However, this setup is not realistic as novel categories may come from differ… ▽ More Few-shot video action recognition is an effective approach to recognizing new categories with only a few labeled examples, thereby reducing the challenges associated with collecting and annotating large-scale video datasets. Existing methods in video action recognition rely on large labeled datasets from the same domain. However, this setup is not realistic as novel categories may come from different data domains that may have different spatial and temporal characteristics. This dissimilarity between the source and target domains can pose a significant challenge, rendering traditional few-shot action recognition techniques ineffective. To address this issue, in this work, we propose a novel cross-domain few-shot video action recognition method that leverages self-supervised learning and curriculum learning to balance the information from the source and target domains. To be particular, our method employs a masked autoencoder-based self-supervised training objective to learn from both source and target data in a self-supervised manner. Then a progressive curriculum balances learning the discriminative information from the source dataset with the generic information learned from the target domain. Initially, our curriculum utilizes supervised learning to learn class discriminative features from the source data. As the training progresses, we transition to learning target-domain-specific features. We propose a progressive curriculum to encourage the emergence of rich features in the target domain based on class discriminative supervised features in the source domain. We evaluate our method on several challenging benchmark datasets and demonstrate that our approach outperforms existing cross-domain few-shot learning techniques. Our code is available at https://github.com/Sarinda251/CDFSL-V △ Less

Submitted 15 September, 2023; v1 submitted 7 September, 2023; originally announced September 2023.

Comments: ICCV 2023

arXiv:2309.02626 [pdf, other]

Adaptive Consensus: A network pruning approach for decentralized optimization

Authors: Suhail M. Shah, Albert S. Berahas, Raghu Bollapragada

Abstract: We consider network-based decentralized optimization problems, where each node in the network possesses a local function and the objective is to collectively attain a consensus solution that minimizes the sum of all the local functions. A major challenge in decentralized optimization is the reliance on communication which remains a considerable bottleneck in many applications. To address this chal… ▽ More We consider network-based decentralized optimization problems, where each node in the network possesses a local function and the objective is to collectively attain a consensus solution that minimizes the sum of all the local functions. A major challenge in decentralized optimization is the reliance on communication which remains a considerable bottleneck in many applications. To address this challenge, we propose an adaptive randomized communication-efficient algorithmic framework that reduces the volume of communication by periodically tracking the disagreement error and judiciously selecting the most influential and effective edges at each node for communication. Within this framework, we present two algorithms: Adaptive Consensus (AC) to solve the consensus problem and Adaptive Consensus based Gradient Tracking (AC-GT) to solve smooth strongly convex decentralized optimization problems. We establish strong theoretical convergence guarantees for the proposed algorithms and quantify their performance in terms of various algorithmic parameters under standard assumptions. Finally, numerical experiments showcase the effectiveness of the framework in significantly reducing the information exchange required to achieve a consensus solution. △ Less

Submitted 5 September, 2023; originally announced September 2023.

Comments: 35 pages, 3 figures

arXiv:2309.02226 [pdf, ps, other]

Asymptotically harmonic manifolds of dimension 3 with minimal horospheres

Authors: Jihun Kim, JeongHyeong Park, Hemangi Madhusudan Shah

Abstract: In [14], it was shown that, if M is a 3-dimensional asymptotically harmonic with minimal horospheres, then M is flat. However, there is a gap in the proof of this paper. In this paper, we provide the correct proof of the result. Thus we complete the classification of asymptotically harmonic manifolds of dimension 3: An asymptotically harmonic manifold of dimension 3 is either a flat or real hyperb… ▽ More In [14], it was shown that, if M is a 3-dimensional asymptotically harmonic with minimal horospheres, then M is flat. However, there is a gap in the proof of this paper. In this paper, we provide the correct proof of the result. Thus we complete the classification of asymptotically harmonic manifolds of dimension 3: An asymptotically harmonic manifold of dimension 3 is either a flat or real hyperbolic space. △ Less

Submitted 5 September, 2023; originally announced September 2023.

Comments: 7 pages

MSC Class: Primary 53C35; Secondary 53C25

arXiv:2308.13711 [pdf, other]

EventTransAct: A video transformer-based framework for Event-camera based action recognition

Authors: Tristan de Blegiers, Ishan Rajendrakumar Dave, Adeel Yousaf, Mubarak Shah

Abstract: Recognizing and comprehending human actions and gestures is a crucial perception requirement for robots to interact with humans and carry out tasks in diverse domains, including service robotics, healthcare, and manufacturing. Event cameras, with their ability to capture fast-moving objects at a high temporal resolution, offer new opportunities compared to standard action recognition in RGB videos… ▽ More Recognizing and comprehending human actions and gestures is a crucial perception requirement for robots to interact with humans and carry out tasks in diverse domains, including service robotics, healthcare, and manufacturing. Event cameras, with their ability to capture fast-moving objects at a high temporal resolution, offer new opportunities compared to standard action recognition in RGB videos. However, previous research on event camera action recognition has primarily focused on sensor-specific network architectures and image encoding, which may not be suitable for new sensors and limit the use of recent advancements in transformer-based architectures. In this study, we employ a computationally efficient model, namely the video transformer network (VTN), which initially acquires spatial embeddings per event-frame and then utilizes a temporal self-attention mechanism. In order to better adopt the VTN for the sparse and fine-grained nature of event data, we design Event-Contrastive Loss ($\mathcal{L}_{EC}$) and event-specific augmentations. Proposed $\mathcal{L}_{EC}$ promotes learning fine-grained spatial cues in the spatial backbone of VTN by contrasting temporally misaligned frames. We evaluate our method on real-world action recognition of N-EPIC Kitchens dataset, and achieve state-of-the-art results on both protocols - testing in seen kitchen (\textbf{74.9\%} accuracy) and testing in unseen kitchens (\textbf{42.43\% and 46.66\% Accuracy}). Our approach also takes less computation time compared to competitive prior approaches, which demonstrates the potential of our framework \textit{EventTransAct} for real-world applications of event-camera based action recognition. Project Page: \url{https://tristandb8.github.io/EventTransAct_webpage/} △ Less

Submitted 25 August, 2023; originally announced August 2023.

Comments: IROS 2023; The first two authors contributed equally

arXiv:2308.13077 [pdf, other]

Preserving Modality Structure Improves Multi-Modal Learning

Authors: Swetha Sirnam, Mamshad Nayeem Rizve, Nina Shvetsova, Hilde Kuehne, Mubarak Shah

Abstract: Self-supervised learning on large-scale multi-modal datasets allows learning semantically meaningful embeddings in a joint multi-modal representation space without relying on human annotations. These joint embeddings enable zero-shot cross-modal tasks like retrieval and classification. However, these methods often struggle to generalize well on out-of-domain data as they ignore the semantic struct… ▽ More Self-supervised learning on large-scale multi-modal datasets allows learning semantically meaningful embeddings in a joint multi-modal representation space without relying on human annotations. These joint embeddings enable zero-shot cross-modal tasks like retrieval and classification. However, these methods often struggle to generalize well on out-of-domain data as they ignore the semantic structure present in modality-specific embeddings. In this context, we propose a novel Semantic-Structure-Preserving Consistency approach to improve generalizability by preserving the modality-specific relationships in the joint embedding space. To capture modality-specific semantic relationships between samples, we propose to learn multiple anchors and represent the multifaceted relationship between samples with respect to their relationship with these anchors. To assign multiple anchors to each sample, we propose a novel Multi-Assignment Sinkhorn-Knopp algorithm. Our experimentation demonstrates that our proposed approach learns semantically meaningful anchors in a self-supervised manner. Furthermore, our evaluation on MSR-VTT and YouCook2 datasets demonstrates that our proposed multi-anchor assignment based solution achieves state-of-the-art performance and generalizes to both inand out-of-domain datasets. Code: https://github.com/Swetha5/Multi_Sinkhorn_Knopp △ Less

Submitted 24 August, 2023; originally announced August 2023.

Comments: Accepted at ICCV 2023

arXiv:2308.12816 [pdf, ps, other]

doi 10.5277/e-Inf160105

Software Startups -- A Research Agenda

Authors: Michael Unterkalmsteiner, Pekka Abrahamsson, Xiaofeng Wang, Anh Nguyen-Duc, Syed M. Ali Shah, Sohaib Shahid Bajwa, Guido H. Baltes, Kieran Conboy, Eoin Cullina, Denis Dennehy, Henry Edison, Carlos Fernández-Sánchez, Juan Garbajosa, Tony Gorschek, Eriks Klotins, Laura Hokkanen, Fabio Kon, Ilaria Lunesu, Michele Marchesi, Lorraine Morgan, Markku Oivo, Christoph Selig, Pertti Seppänen, Roger Sweetman, Pasi Tyrväinen , et al. (2 additional authors not shown)

Abstract: Software startup companies develop innovative, software-intensive products within limited time frames and with few resources, searching for sustainable and scalable business models. Software startups are quite distinct from traditional mature software companies, but also from micro-, small-, and medium-sized enterprises, introducing new challenges relevant for software engineering research. This p… ▽ More Software startup companies develop innovative, software-intensive products within limited time frames and with few resources, searching for sustainable and scalable business models. Software startups are quite distinct from traditional mature software companies, but also from micro-, small-, and medium-sized enterprises, introducing new challenges relevant for software engineering research. This paper's research agenda focuses on software engineering in startups, identifying, in particular, 70+ research questions in the areas of supporting startup engineering activities, startup evolution models and patterns, ecosystems and innovation hubs, human aspects in software startups, applying startup concepts in non-startup environments, and methodologies and theories for startup research. We connect and motivate this research agenda with past studies in software startup research, while pointing out possible future directions. While all authors of this research agenda have their main background in Software Engineering or Computer Science, their interest in software startups broadens the perspective to the challenges, but also to the opportunities that emerge from multi-disciplinary research. Our audience is therefore primarily software engineering researchers, even though we aim at stimulating collaborations and research that crosses disciplinary boundaries. We believe that with this research agenda we cover a wide spectrum of the software startup industry current needs. △ Less

Submitted 24 August, 2023; originally announced August 2023.

Journal ref: e-Informatica Softw. Eng. J. 10(1): 89-124 (2016)

arXiv:2308.11405 [pdf, other]

Achievable Sum-rate of variants of QAM over Gaussian Multiple Access Channel with and without security

Authors: Shifa Showkat, Zahid Bashir Dar, Shahid Mehraj Shah

Abstract: The performance of next generation wireless systems (5G/6G and beyond) at the physical layer is primarily driven by the choice of digital modulation techniques that are bandwidth and power efficient, while maintaining high data rates. Achievable rates for Gaussian input and some finite constellations (BPSK/QPSK/QAM) are well studied in the literature. However, new variants of Quadrature Amplitude… ▽ More The performance of next generation wireless systems (5G/6G and beyond) at the physical layer is primarily driven by the choice of digital modulation techniques that are bandwidth and power efficient, while maintaining high data rates. Achievable rates for Gaussian input and some finite constellations (BPSK/QPSK/QAM) are well studied in the literature. However, new variants of Quadrature Amplitude Modulation (QAM) such as Cross-QAM (XQAM), Star-QAM (S-QAM), Amplitude and phase shift keying (APSK), and Hexagonal Quadrature Amplitude Modulation (H-QAM) are not studied in the context of achievable rates for meeting the demand of high data rates. In this paper, we study achievable rate region for different variants of M-QAM like Cross-QAM, H-QAM, Star-QAM and APSK. We also compute mutual information corresponding to the sum rate of Gaussian Multiple Access Channel (G-MAC), for hybrid constellation scheme, e.g., user 1 transmits using Star-QAM and user 2 by H-QAM. From the results, it is observed that S-QAM gives the maximum sum-rate when users transmit same constellations. Also, it has been found that when hybrid constellation is used, the combination of Star-QAM \& H-QAM gives the maximum rate. In the next part of the paper, we consider a scenario wherein an adversary is also present at the receiver side and is trying to decode the information. We model this scenario as Gaussian Multiple Access Wiretap Channel (G-MAW-WT). We then compute the achievable secrecy sum rate of two user G-MAC-WT with discrete inputs from different variants of QAM (viz, X-QAM, H-QAM and S-QAM).It has been found that at higher values of SNR, S-QAM gives better values of SSR than the other variants. For hybrid inputs of QAM, at lower values of SNR, combination of APSK and S-QAM gives better results and at higher values of SNR, combination of HQAM and APSK gives greater value of SSR. △ Less

Submitted 22 August, 2023; originally announced August 2023.

Comments: 11 Figures, two tables. Accepted for publication in IEEE International Conference on Signal Processing and Computer Vision (SPCV-2023)

arXiv:2308.11072 [pdf, other]

TeD-SPAD: Temporal Distinctiveness for Self-supervised Privacy-preservation for video Anomaly Detection

Authors: Joseph Fioresi, Ishan Rajendrakumar Dave, Mubarak Shah

Abstract: Video anomaly detection (VAD) without human monitoring is a complex computer vision task that can have a positive impact on society if implemented successfully. While recent advances have made significant progress in solving this task, most existing approaches overlook a critical real-world concern: privacy. With the increasing popularity of artificial intelligence technologies, it becomes crucial… ▽ More Video anomaly detection (VAD) without human monitoring is a complex computer vision task that can have a positive impact on society if implemented successfully. While recent advances have made significant progress in solving this task, most existing approaches overlook a critical real-world concern: privacy. With the increasing popularity of artificial intelligence technologies, it becomes crucial to implement proper AI ethics into their development. Privacy leakage in VAD allows models to pick up and amplify unnecessary biases related to people's personal information, which may lead to undesirable decision making. In this paper, we propose TeD-SPAD, a privacy-aware video anomaly detection framework that destroys visual private information in a self-supervised manner. In particular, we propose the use of a temporally-distinct triplet loss to promote temporally discriminative features, which complements current weakly-supervised VAD methods. Using TeD-SPAD, we achieve a positive trade-off between privacy protection and utility anomaly detection performance on three popular weakly supervised VAD datasets: UCF-Crime, XD-Violence, and ShanghaiTech. Our proposed anonymization model reduces private attribute prediction by 32.25% while only reducing frame-level ROC AUC on the UCF-Crime anomaly detection dataset by 3.69%. Project Page: https://joefioresi718.github.io/TeD-SPAD_webpage/ △ Less

Submitted 21 August, 2023; originally announced August 2023.

Comments: ICCV 2023

arXiv:2308.09693 [pdf, other]

A Lightweight Transformer for Faster and Robust EBSD Data Collection

Authors: Harry Dong, Sean Donegan, Megna Shah, Yuejie Chi

Abstract: Three dimensional electron back-scattered diffraction (EBSD) microscopy is a critical tool in many applications in materials science, yet its data quality can fluctuate greatly during the arduous collection process, particularly via serial-sectioning. Fortunately, 3D EBSD data is inherently sequential, opening up the opportunity to use transformers, state-of-the-art deep learning architectures tha… ▽ More Three dimensional electron back-scattered diffraction (EBSD) microscopy is a critical tool in many applications in materials science, yet its data quality can fluctuate greatly during the arduous collection process, particularly via serial-sectioning. Fortunately, 3D EBSD data is inherently sequential, opening up the opportunity to use transformers, state-of-the-art deep learning architectures that have made breakthroughs in a plethora of domains, for data processing and recovery. To be more robust to errors and accelerate this 3D EBSD data collection, we introduce a two step method that recovers missing slices in an 3D EBSD volume, using an efficient transformer model and a projection algorithm to process the transformer's outputs. Overcoming the computational and practical hurdles of deep learning with scarce high dimensional data, we train this model using only synthetic 3D EBSD data with self-supervision and obtain superior recovery accuracy on real 3D EBSD data, compared to existing methods. △ Less

Submitted 18 August, 2023; originally announced August 2023.

arXiv:2308.07705 [pdf, other]

Parametric entropy based Cluster Centriod Initialization for k-means clustering of various Image datasets

Authors: Faheem Hussayn, Shahid M Shah

Abstract: One of the most employed yet simple algorithm for cluster analysis is the k-means algorithm. k-means has successfully witnessed its use in artificial intelligence, market segmentation, fraud detection, data mining, psychology, etc., only to name a few. The k-means algorithm, however, does not always yield the best quality results. Its performance heavily depends upon the number of clusters supplie… ▽ More One of the most employed yet simple algorithm for cluster analysis is the k-means algorithm. k-means has successfully witnessed its use in artificial intelligence, market segmentation, fraud detection, data mining, psychology, etc., only to name a few. The k-means algorithm, however, does not always yield the best quality results. Its performance heavily depends upon the number of clusters supplied and the proper initialization of the cluster centroids or seeds. In this paper, we conduct an analysis of the performance of k-means on image data by employing parametric entropies in an entropy based centroid initialization method and propose the best fitting entropy measures for general image datasets. We use several entropies like Taneja entropy, Kapur entropy, Aczel Daroczy entropy, Sharma Mittal entropy. We observe that for different datasets, different entropies provide better results than the conventional methods. We have applied our proposed algorithm on these datasets: Satellite, Toys, Fruits, Cars, Brain MRI, Covid X-Ray. △ Less

Submitted 15 August, 2023; originally announced August 2023.

Comments: 6 Pages, 2 tables, one algorithm. Accepted for publication in IEEE International Conference on Signal Processing and Computer Vision (SPCV-2023)

arXiv:2308.05430 [pdf, other]

Ensemble Modeling for Multimodal Visual Action Recognition

Authors: Jyoti Kini, Sarah Fleischer, Ishan Dave, Mubarak Shah

Abstract: In this work, we propose an ensemble modeling approach for multimodal action recognition. We independently train individual modality models using a variant of focal loss tailored to handle the long-tailed distribution of the MECCANO [21] dataset. Based on the underlying principle of focal loss, which captures the relationship between tail (scarce) classes and their prediction difficulties, we prop… ▽ More In this work, we propose an ensemble modeling approach for multimodal action recognition. We independently train individual modality models using a variant of focal loss tailored to handle the long-tailed distribution of the MECCANO [21] dataset. Based on the underlying principle of focal loss, which captures the relationship between tail (scarce) classes and their prediction difficulties, we propose an exponentially decaying variant of focal loss for our current task. It initially emphasizes learning from the hard misclassified examples and gradually adapts to the entire range of examples in the dataset. This annealing process encourages the model to strike a balance between focusing on the sparse set of hard samples, while still leveraging the information provided by the easier ones. Additionally, we opt for the late fusion strategy to combine the resultant probability distributions from RGB and Depth modalities for final action prediction. Experimental evaluations on the MECCANO dataset demonstrate the effectiveness of our approach. △ Less

Submitted 25 September, 2023; v1 submitted 10 August, 2023; originally announced August 2023.

Comments: 22nd International Conference on Image Analysis and Processing Workshops - Multimodal Action Recognition on the MECCANO Dataset, 2023

arXiv:2308.03956 [pdf, other]

Fixed Inter-Neuron Covariability Induces Adversarial Robustness

Authors: Muhammad Ahmed Shah, Bhiksha Raj

Abstract: The vulnerability to adversarial perturbations is a major flaw of Deep Neural Networks (DNNs) that raises question about their reliability when in real-world scenarios. On the other hand, human perception, which DNNs are supposed to emulate, is highly robust to such perturbations, indicating that there may be certain features of the human perception that make it robust but are not represented in t… ▽ More The vulnerability to adversarial perturbations is a major flaw of Deep Neural Networks (DNNs) that raises question about their reliability when in real-world scenarios. On the other hand, human perception, which DNNs are supposed to emulate, is highly robust to such perturbations, indicating that there may be certain features of the human perception that make it robust but are not represented in the current class of DNNs. One such feature is that the activity of biological neurons is correlated and the structure of this correlation tends to be rather rigid over long spans of times, even if it hampers performance and learning. We hypothesize that integrating such constraints on the activations of a DNN would improve its adversarial robustness, and, to test this hypothesis, we have developed the Self-Consistent Activation (SCA) layer, which comprises of neurons whose activations are consistent with each other, as they conform to a fixed, but learned, covariability pattern. When evaluated on image and sound recognition tasks, the models with a SCA layer achieved high accuracy, and exhibited significantly greater robustness than multi-layer perceptron models to state-of-the-art Auto-PGD adversarial attacks \textit{without being trained on adversarially perturbed data △ Less

Submitted 7 August, 2023; originally announced August 2023.

arXiv:2308.03866 [pdf, other]

Trusting Language Models in Education

Authors: Jogi Suda Neto, Li Deng, Thejaswi Raya, Reza Shahbazi, Nick Liu, Adhitya Venkatesh, Miral Shah, Neeru Khosla, Rodrigo Capobianco Guido

Abstract: Language Models are being widely used in Education. Even though modern deep learning models achieve very good performance on question-answering tasks, sometimes they make errors. To avoid misleading students by showing wrong answers, it is important to calibrate the confidence - that is, the prediction probability - of these models. In our work, we propose to use an XGBoost on top of BERT to outpu… ▽ More Language Models are being widely used in Education. Even though modern deep learning models achieve very good performance on question-answering tasks, sometimes they make errors. To avoid misleading students by showing wrong answers, it is important to calibrate the confidence - that is, the prediction probability - of these models. In our work, we propose to use an XGBoost on top of BERT to output the corrected probabilities, using features based on the attention mechanism. Our hypothesis is that the level of uncertainty contained in the flow of attention is related to the quality of the model's response itself. △ Less

Submitted 7 August, 2023; originally announced August 2023.

arXiv:2308.01987 [pdf, other]

Bengali Fake Reviews: A Benchmark Dataset and Detection System

Authors: G. M. Shahariar, Md. Tanvir Rouf Shawon, Faisal Muhammad Shah, Mohammad Shafiul Alam, Md. Shahriar Mahbub

Abstract: The proliferation of fake reviews on various online platforms has created a major concern for both consumers and businesses. Such reviews can deceive customers and cause damage to the reputation of products or services, making it crucial to identify them. Although the detection of fake reviews has been extensively studied in English language, detecting fake reviews in non-English languages such as… ▽ More The proliferation of fake reviews on various online platforms has created a major concern for both consumers and businesses. Such reviews can deceive customers and cause damage to the reputation of products or services, making it crucial to identify them. Although the detection of fake reviews has been extensively studied in English language, detecting fake reviews in non-English languages such as Bengali is still a relatively unexplored research area. This paper introduces the Bengali Fake Review Detection (BFRD) dataset, the first publicly available dataset for identifying fake reviews in Bengali. The dataset consists of 7710 non-fake and 1339 fake food-related reviews collected from social media posts. To convert non-Bengali words in a review, a unique pipeline has been proposed that translates English words to their corresponding Bengali meaning and also back transliterates Romanized Bengali to Bengali. We have conducted rigorous experimentation using multiple deep learning and pre-trained transformer language models to develop a reliable detection system. Finally, we propose a weighted ensemble model that combines four pre-trained transformers: BanglaBERT, BanglaBERT Base, BanglaBERT Large, and BanglaBERT Generator . According to the experiment results, the proposed ensemble model obtained a weighted F1-score of 0.9843 on 13390 reviews, including 1339 actual fake reviews and 5356 augmented fake reviews generated with the nlpaug library. The remaining 6695 reviews were randomly selected from the 7710 non-fake instances. The model achieved a 0.9558 weighted F1-score when the fake reviews were augmented using the bnaug library. △ Less

Submitted 4 May, 2024; v1 submitted 3 August, 2023; originally announced August 2023.

arXiv:2308.01472 [pdf, other]

Reverse Stable Diffusion: What prompt was used to generate this image?

Authors: Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, Mubarak Shah

Abstract: Text-to-image diffusion models such as Stable Diffusion have recently attracted the interest of many researchers, and inverting the diffusion process can play an important role in better understanding the generative process and how to engineer prompts in order to obtain the desired images. To this end, we introduce the new task of predicting the text prompt given an image generated by a generative… ▽ More Text-to-image diffusion models such as Stable Diffusion have recently attracted the interest of many researchers, and inverting the diffusion process can play an important role in better understanding the generative process and how to engineer prompts in order to obtain the desired images. To this end, we introduce the new task of predicting the text prompt given an image generated by a generative diffusion model. We combine a series of white-box and black-box models (with and without access to the weights of the diffusion network) to deal with the proposed task. We propose a novel learning framework comprising of a joint prompt regression and multi-label vocabulary classification objective that generates improved prompts. To further improve our method, we employ a curriculum learning procedure that promotes the learning of image-prompt pairs with lower labeling noise (i.e. that are better aligned), and an unsupervised domain-adaptive kernel learning method that uses the similarities between samples in the source and target domains as extra features. We conduct experiments on the DiffusionDB data set, predicting text prompts from images generated by Stable Diffusion. Our novel learning framework produces excellent results on the aforementioned task, yielding the highest gains when applied on the white-box model. In addition, we make an interesting discovery: training a diffusion model on the prompt generation task can make the model generate images that are much better aligned with the input prompts, when the model is directly reused for text-to-image generation. △ Less

Submitted 2 August, 2023; originally announced August 2023.

arXiv:2308.00854 [pdf, other]

Training on Foveated Images Improves Robustness to Adversarial Attacks

Authors: Muhammad A. Shah, Bhiksha Raj

Abstract: Deep neural networks (DNNs) have been shown to be vulnerable to adversarial attacks -- subtle, perceptually indistinguishable perturbations of inputs that change the response of the model. In the context of vision, we hypothesize that an important contributor to the robustness of human visual perception is constant exposure to low-fidelity visual stimuli in our peripheral vision. To investigate th… ▽ More Deep neural networks (DNNs) have been shown to be vulnerable to adversarial attacks -- subtle, perceptually indistinguishable perturbations of inputs that change the response of the model. In the context of vision, we hypothesize that an important contributor to the robustness of human visual perception is constant exposure to low-fidelity visual stimuli in our peripheral vision. To investigate this hypothesis, we develop \RBlur, an image transform that simulates the loss in fidelity of peripheral vision by blurring the image and reducing its color saturation based on the distance from a given fixation point. We show that compared to DNNs trained on the original images, DNNs trained on images transformed by \RBlur are substantially more robust to adversarial attacks, as well as other, non-adversarial, corruptions, achieving up to 25\% higher accuracy on perturbed data. △ Less

Submitted 1 August, 2023; originally announced August 2023.

arXiv:2307.14942 [pdf, other]

A Stochastic Gradient Tracking Algorithm for Decentralized Optimization With Inexact Communication

Authors: Suhail M. Shah, Raghu Bollapragada

Abstract: Decentralized optimization is typically studied under the assumption of noise-free transmission. However, real-world scenarios often involve the presence of noise due to factors such as additive white Gaussian noise channels or probabilistic quantization of transmitted data. These sources of noise have the potential to degrade the performance of decentralized optimization algorithms if not effecti… ▽ More Decentralized optimization is typically studied under the assumption of noise-free transmission. However, real-world scenarios often involve the presence of noise due to factors such as additive white Gaussian noise channels or probabilistic quantization of transmitted data. These sources of noise have the potential to degrade the performance of decentralized optimization algorithms if not effectively addressed. In this paper, we focus on the noisy communication setting and propose an algorithm that bridges the performance gap caused by communication noise while also mitigating other challenges like data heterogeneity. We establish theoretical results of the proposed algorithm that quantify the effect of communication noise and gradient noise on the performance of the algorithm. Notably, our algorithm achieves the optimal convergence rate for minimizing strongly convex, smooth functions in the context of inexact communication and stochastic gradients. Finally, we illustrate the superior performance of the proposed algorithm compared to its state-of-the-art counterparts on machine learning problems using MNIST and CIFAR-10 datasets. △ Less

Submitted 27 July, 2023; originally announced July 2023.

Comments: 34 pages, 4 figures

arXiv:2307.13721 [pdf, other]

Foundational Models Defining a New Era in Vision: A Survey and Outlook

Authors: Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Fahad Shahbaz Khan

Abstract: Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The complex relations between objects and their locations, ambiguities, and variations in the real-world environment can be better described in human language, naturally governed by grammatical rules and other modalities such as audio and depth. The models learned to bridge… ▽ More Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The complex relations between objects and their locations, ambiguities, and variations in the real-world environment can be better described in human language, naturally governed by grammatical rules and other modalities such as audio and depth. The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time. These models are referred to as foundational models. The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions. In this survey, we provide a comprehensive review of such emerging foundational models, including typical architecture designs to combine different modalities (vision, text, audio, etc), training objectives (contrastive, generative), pre-training datasets, fine-tuning mechanisms, and the common prompting patterns; textual, visual, and heterogeneous. We discuss the open challenges and research directions for foundational models in computer vision, including difficulties in their evaluations and benchmarking, gaps in their real-world understanding, limitations of their contextual understanding, biases, vulnerability to adversarial attacks, and interpretability issues. We review recent developments in this field, covering a wide range of applications of foundation models systematically and comprehensively. A comprehensive list of foundational models studied in this work is available at \url{https://github.com/awaisrauf/Awesome-CV-Foundational-Models}. △ Less

Submitted 25 July, 2023; originally announced July 2023.

Comments: Project page: https://github.com/awaisrauf/Awesome-CV-Foundational-Models

arXiv:2307.11671 [pdf, other]

Adapted poling to break the nonlinear efficiency limit in nanophotonic lithium niobate waveguides

Authors: Pao-Kang Chen, Ian Briggs, Chaohan Cui, Liang Zhang, Manav Shah, Linran Fan

Abstract: Nonlinear frequency mixing is of critical importance in extending the wavelength range of optical sources. It is also indispensable for emerging applications such as quantum information and photonic signal processing. Conventional lithium niobate with periodic poling is the most widely used device for frequency mixing due to the strong second-order nonlinearity. The recent development of nanophoto… ▽ More Nonlinear frequency mixing is of critical importance in extending the wavelength range of optical sources. It is also indispensable for emerging applications such as quantum information and photonic signal processing. Conventional lithium niobate with periodic poling is the most widely used device for frequency mixing due to the strong second-order nonlinearity. The recent development of nanophotonic lithium niobate waveguides promises improvements of nonlinear efficiencies by orders of magnitude with sub-wavelength optical conferment. However, the intrinsic nanoscale inhomogeneity in nanophotonic lithium niobate limits the coherent interaction length, leading to low nonlinear efficiencies. Therefore, the performance of nanophotonic lithium niobate waveguides is still far behind conventional counterparts. Here, we overcome this limitation and demonstrate ultra-efficient second order nonlinearity in nanophotonic lithium niobate waveguides significantly outperforming conventional crystals. This is realized by developing the adapted poling approach to eliminate the impact of nanoscale inhomogeneity in nanophotonic lithium niobate waveguides. We realize overall secondharmonic efficiency near 10^4 %/W without cavity enhancement, which saturates the theoretical limit. Phase-matching bandwidths and temperature tunability are improved through dispersion engineering. The ideal square dependence of the nonlinear efficiency on the waveguide length is recovered. We also break the trade-off between the energy conversion ratio and pump power. A conversion ratio over 80% is achieved in the single-pass configuration with pump power as low as 20 mW. △ Less

Submitted 21 July, 2023; originally announced July 2023.

arXiv:2307.07395 [pdf, other]

Flexible Beamforming in B5G for Improving Tethered UAV Coverage over Smart Environments

Authors: Abdu Saif, Nor Shahida Mohd Shah, Soreen Ameen Fattah, Saeed Hamood Alsamhi, Santosh Kumar, Ali Saad Al khuraib

Abstract: Unmanned Aerial Vehicles (UAVs) are being used for wireless communications in smart environments. However, the need for mobility, scalability of data transmission over wide areas, and the required coverage area make UAV beamforming essential for better coverage and user experience. To this end, we propose a flexible beamforming approach to improve tethered UAV coverage quality and maximize the num… ▽ More Unmanned Aerial Vehicles (UAVs) are being used for wireless communications in smart environments. However, the need for mobility, scalability of data transmission over wide areas, and the required coverage area make UAV beamforming essential for better coverage and user experience. To this end, we propose a flexible beamforming approach to improve tethered UAV coverage quality and maximize the number of users experiencing the minimum required rate in any target environment. Our solution demonstrates a significant achievement in flexible beamforming in smart environments, including urban, suburban, dense, and high-rise urban. Furthermore, the beamforming gain is mainly concentrated in the target to improve the coverage area based on various scenarios. Simulation results show that the proposed approach can achieve a significantly received flexible power beam that focuses the transmitted signal towards the receiver and improves received power by reducing signal power spread. In the case of no beamforming, signal power spreads out as distance increases, reducing the signal strength. Furthermore, our proposed solution is suitable for improving UAV coverage and reliability in smart and harsh environments. △ Less

Submitted 14 July, 2023; originally announced July 2023.

Comments: 6 pages, 7 figures

arXiv:2307.07269 [pdf, other]

Frequency Domain Adversarial Training for Robust Volumetric Medical Segmentation

Authors: Asif Hanif, Muzammal Naseer, Salman Khan, Mubarak Shah, Fahad Shahbaz Khan

Abstract: It is imperative to ensure the robustness of deep learning models in critical applications such as, healthcare. While recent advances in deep learning have improved the performance of volumetric medical image segmentation models, these models cannot be deployed for real-world applications immediately due to their vulnerability to adversarial attacks. We present a 3D frequency domain adversarial at… ▽ More It is imperative to ensure the robustness of deep learning models in critical applications such as, healthcare. While recent advances in deep learning have improved the performance of volumetric medical image segmentation models, these models cannot be deployed for real-world applications immediately due to their vulnerability to adversarial attacks. We present a 3D frequency domain adversarial attack for volumetric medical image segmentation models and demonstrate its advantages over conventional input or voxel domain attacks. Using our proposed attack, we introduce a novel frequency domain adversarial training approach for optimizing a robust model against voxel and frequency domain attacks. Moreover, we propose frequency consistency loss to regulate our frequency domain adversarial training that achieves a better tradeoff between model's performance on clean and adversarial samples. Code is publicly available at https://github.com/asif-hanif/vafa. △ Less

Submitted 20 July, 2023; v1 submitted 14 July, 2023; originally announced July 2023.

Comments: This paper has been accepted in MICCAI 2023 conference

arXiv:2307.06947 [pdf, other]

Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition

Authors: Syed Talal Wasim, Muhammad Uzair Khattak, Muzammal Naseer, Salman Khan, Mubarak Shah, Fahad Shahbaz Khan

Abstract: Recent video recognition models utilize Transformer models for long-range spatio-temporal context modeling. Video transformer designs are based on self-attention that can model global context at a high computational cost. In comparison, convolutional designs for videos offer an efficient alternative but lack long-range dependency modeling. Towards achieving the best of both designs, this work prop… ▽ More Recent video recognition models utilize Transformer models for long-range spatio-temporal context modeling. Video transformer designs are based on self-attention that can model global context at a high computational cost. In comparison, convolutional designs for videos offer an efficient alternative but lack long-range dependency modeling. Towards achieving the best of both designs, this work proposes Video-FocalNet, an effective and efficient architecture for video recognition that models both local and global contexts. Video-FocalNet is based on a spatio-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention for better efficiency. Further, the aggregation step and the interaction step are both implemented using efficient convolution and element-wise multiplication operations that are computationally less expensive than their self-attention counterparts on video representations. We extensively explore the design space of focal modulation-based spatio-temporal context modeling and demonstrate our parallel spatial and temporal encoding design to be the optimal choice. Video-FocalNets perform favorably well against the state-of-the-art transformer-based models for video recognition on five large-scale datasets (Kinetics-400, Kinetics-600, SS-v2, Diving-48, and ActivityNet-1.3) at a lower computational cost. Our code/models are released at https://github.com/TalalWasim/Video-FocalNets. △ Less

Submitted 27 October, 2023; v1 submitted 13 July, 2023; originally announced July 2023.

Comments: Accepted to ICCV-2023. Camera-Ready version. Project page: https://TalalWasim.github.io/Video-FocalNets/

arXiv:2306.12041 [pdf, other]

Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors

Authors: Nicolae-Catalin Ristea, Florinel-Alin Croitoru, Radu Tudor Ionescu, Marius Popescu, Fahad Shahbaz Khan, Mubarak Shah

Abstract: We propose an efficient abnormal event detection model based on a lightweight masked auto-encoder (AE) applied at the video frame level. The novelty of the proposed model is threefold. First, we introduce an approach to weight tokens based on motion gradients, thus shifting the focus from the static background scene to the foreground objects. Second, we integrate a teacher decoder and a student de… ▽ More We propose an efficient abnormal event detection model based on a lightweight masked auto-encoder (AE) applied at the video frame level. The novelty of the proposed model is threefold. First, we introduce an approach to weight tokens based on motion gradients, thus shifting the focus from the static background scene to the foreground objects. Second, we integrate a teacher decoder and a student decoder into our architecture, leveraging the discrepancy between the outputs given by the two decoders to improve anomaly detection. Third, we generate synthetic abnormal events to augment the training videos, and task the masked AE model to jointly reconstruct the original frames (without anomalies) and the corresponding pixel-level anomaly maps. Our design leads to an efficient and effective model, as demonstrated by the extensive experiments carried out on four benchmarks: Avenue, ShanghaiTech, UBnormal and UCSD Ped2. The empirical results show that our model achieves an excellent trade-off between speed and accuracy, obtaining competitive AUC scores, while processing 1655 FPS. Hence, our model is between 8 and 70 times faster than competing methods. We also conduct an ablation study to justify our design. Our code is freely available at: https://github.com/ristea/aed-mae. △ Less

Submitted 9 March, 2024; v1 submitted 21 June, 2023; originally announced June 2023.

Comments: Accepted at CVPR 2024

arXiv:2306.09239 [pdf, ps, other]

Exploiting the Brain's Network Structure for Automatic Identification of ADHD Subjects

Authors: Soumyabrata Dey, Ravishankar Rao, Mubarak Shah

Abstract: Attention Deficit Hyperactive Disorder (ADHD) is a common behavioral problem affecting children. In this work, we investigate the automatic classification of ADHD subjects using the resting state Functional Magnetic Resonance Imaging (fMRI) sequences of the brain. We show that the brain can be modeled as a functional network, and certain properties of the networks differ in ADHD subjects from cont… ▽ More Attention Deficit Hyperactive Disorder (ADHD) is a common behavioral problem affecting children. In this work, we investigate the automatic classification of ADHD subjects using the resting state Functional Magnetic Resonance Imaging (fMRI) sequences of the brain. We show that the brain can be modeled as a functional network, and certain properties of the networks differ in ADHD subjects from control subjects. We compute the pairwise correlation of brain voxels' activity over the time frame of the experimental protocol which helps to model the function of a brain as a network. Different network features are computed for each of the voxels constructing the network. The concatenation of the network features of all the voxels in a brain serves as the feature vector. Feature vectors from a set of subjects are then used to train a PCA-LDA (principal component analysis-linear discriminant analysis) based classifier. We hypothesized that ADHD-related differences lie in some specific regions of the brain and using features only from those regions is sufficient to discriminate ADHD and control subjects. We propose a method to create a brain mask that includes the useful regions only and demonstrate that using the feature from the masked regions improves classification accuracy on the test data set. We train our classifier with 776 subjects and test on 171 subjects provided by The Neuro Bureau for the ADHD-200 challenge. We demonstrate the utility of graph-motif features, specifically the maps that represent the frequency of participation of voxels in network cycles of length 3. The best classification performance (69.59%) is achieved using 3-cycle map features with masking. Our proposed approach holds promise in being able to diagnose and understand the disorder. △ Less

Submitted 15 June, 2023; originally announced June 2023.

arXiv:2306.09209 [pdf, ps, other]

Bayesian Game Formulation of Power Allocation in Multiple Access Wiretap Channel with Incomplete CSI

Authors: Basharat Rashid, Majed Haddad, Shahid Mehraj Shah

Abstract: In this paper, we address the problem of distributed power allocation in a $K$ user fading multiple access wiretap channel, where global channel state information is limited, i.e., each user has knowledge of their own channel state with respect to Bob and Eve but only knows the distribution of other users' channel states. We model this problem as a Bayesian game, where each user is assumed to self… ▽ More In this paper, we address the problem of distributed power allocation in a $K$ user fading multiple access wiretap channel, where global channel state information is limited, i.e., each user has knowledge of their own channel state with respect to Bob and Eve but only knows the distribution of other users' channel states. We model this problem as a Bayesian game, where each user is assumed to selfishly maximize his average \emph{secrecy capacity} with partial channel state information. In this work, we first prove that there is a unique Bayesian equilibrium in the proposed game. Additionally, the price of anarchy is calculated to measure the efficiency of the equilibrium solution. We also propose a fast convergent iterative algorithm for power allocation. Finally, the results are validated using simulation results. △ Less

Submitted 4 September, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

Comments: 7 Pages, 2 Figures, submitted for possible publication

arXiv:2306.06453 [pdf, other]

Isometric models of the Funk disc and the Busemann function

Authors: Ashok Kumar, Hemangi Madhusudan Shah, Bankteshwar Tiwari

Abstract: In this article, we find three isometric models of the Funk disc: Finsler upper half of the hyperboloid of two sheets model, the Finsler band model and the Finsler upper hemi sphere model; and we also find two new models of the Finsler-Poincaré disc. We explicitly describe the geodesics in each model. Moreover, we compute the Busemann function and consequently describe the horocycles in the Funk a… ▽ More In this article, we find three isometric models of the Funk disc: Finsler upper half of the hyperboloid of two sheets model, the Finsler band model and the Finsler upper hemi sphere model; and we also find two new models of the Finsler-Poincaré disc. We explicitly describe the geodesics in each model. Moreover, we compute the Busemann function and consequently describe the horocycles in the Funk and the Hilbert disc. Finally, we prove the asymptotic harmonicity of the Funk disc. We also show that, the concept of asymptotic harmonicity of the Finsler manifolds {\it tacitly} depends on the measure, in {\it contrast} to the Riemannian case. △ Less

Submitted 10 June, 2023; originally announced June 2023.

arXiv:2306.06325 [pdf, other]

Explaining a machine learning decision to physicians via counterfactuals

Authors: Supriya Nagesh, Nina Mishra, Yonatan Naamad, James M. Rehg, Mehul A. Shah, Alexei Wagner

Abstract: Machine learning models perform well on several healthcare tasks and can help reduce the burden on the healthcare system. However, the lack of explainability is a major roadblock to their adoption in hospitals. \textit{How can the decision of an ML model be explained to a physician?} The explanations considered in this paper are counterfactuals (CFs), hypothetical scenarios that would have resulte… ▽ More Machine learning models perform well on several healthcare tasks and can help reduce the burden on the healthcare system. However, the lack of explainability is a major roadblock to their adoption in hospitals. \textit{How can the decision of an ML model be explained to a physician?} The explanations considered in this paper are counterfactuals (CFs), hypothetical scenarios that would have resulted in the opposite outcome. Specifically, time-series CFs are investigated, inspired by the way physicians converse and reason out decisions `I would have given the patient a vasopressor if their blood pressure was lower and falling'. Key properties of CFs that are particularly meaningful in clinical settings are outlined: physiological plausibility, relevance to the task and sparse perturbations. Past work on CF generation does not satisfy these properties, specifically plausibility in that realistic time-series CFs are not generated. A variational autoencoder (VAE)-based approach is proposed that captures these desired properties. The method produces CFs that improve on prior approaches quantitatively (more plausible CFs as evaluated by their likelihood w.r.t original data distribution, and 100$\times$ faster at generating CFs) and qualitatively (2$\times$ more plausible and relevant) as evaluated by three physicians. △ Less

Submitted 9 June, 2023; originally announced June 2023.

arXiv:2305.17033 [pdf, other]

The Brain Tumor Segmentation (BraTS) Challenge 2023: Focus on Pediatrics (CBTN-CONNECT-DIPGR-ASNR-MICCAI BraTS-PEDs)

Authors: Anahita Fathi Kazerooni, Nastaran Khalili, Xinyang Liu, Debanjan Haldar, Zhifan Jiang, Syed Muhammed Anwar, Jake Albrecht, Maruf Adewole, Udunna Anazodo, Hannah Anderson, Sina Bagheri, Ujjwal Baid, Timothy Bergquist, Austin J. Borja, Evan Calabrese, Verena Chung, Gian-Marco Conte, Farouk Dako, James Eddy, Ivan Ezhov, Ariana Familiar, Keyvan Farahani, Shuvanjan Haldar, Juan Eugenio Iglesias, Anastasia Janas , et al. (48 additional authors not shown)

Abstract: Pediatric tumors of the central nervous system are the most common cause of cancer-related death in children. The five-year survival rate for high-grade gliomas in children is less than 20\%. Due to their rarity, the diagnosis of these entities is often delayed, their treatment is mainly based on historic treatment concepts, and clinical trials require multi-institutional collaborations. The MICCA… ▽ More Pediatric tumors of the central nervous system are the most common cause of cancer-related death in children. The five-year survival rate for high-grade gliomas in children is less than 20\%. Due to their rarity, the diagnosis of these entities is often delayed, their treatment is mainly based on historic treatment concepts, and clinical trials require multi-institutional collaborations. The MICCAI Brain Tumor Segmentation (BraTS) Challenge is a landmark community benchmark event with a successful history of 12 years of resource creation for the segmentation and analysis of adult glioma. Here we present the CBTN-CONNECT-DIPGR-ASNR-MICCAI BraTS-PEDs 2023 challenge, which represents the first BraTS challenge focused on pediatric brain tumors with data acquired across multiple international consortia dedicated to pediatric neuro-oncology and clinical trials. The BraTS-PEDs 2023 challenge focuses on benchmarking the development of volumentric segmentation algorithms for pediatric brain glioma through standardized quantitative performance evaluation metrics utilized across the BraTS 2023 cluster of challenges. Models gaining knowledge from the BraTS-PEDs multi-parametric structural MRI (mpMRI) training data will be evaluated on separate validation and unseen test mpMRI dataof high-grade pediatric glioma. The CBTN-CONNECT-DIPGR-ASNR-MICCAI BraTS-PEDs 2023 challenge brings together clinicians and AI/imaging scientists to lead to faster development of automated segmentation techniques that could benefit clinical trials, and ultimately the care of children with brain tumors. △ Less

Submitted 23 May, 2024; v1 submitted 26 May, 2023; originally announced May 2023.

arXiv:2305.06597 [pdf, ps, other]

Nature of Some Solitons on Almost coKähler Manifolds and Asymptotically Harmonic Manifolds

Authors: Paritosh Ghosh, Hemangi Madhusudan Shah, Arindam Bhattacharyya

Abstract: In this research, we study the nature of $η$-Einstein and gradient $η$-Einstein soliton in the framework of almost coKähler manifolds and $(κ, μ)$-almost coKähler manifolds. We find some expressions for scalar curvature of the almost coKähler manifold admitting $η$-Einstein soliton in various cases. We also prove that if a $(κ, μ)$-almost coKähler manifold admits a gradient $η$-Einstein soliton, t… ▽ More In this research, we study the nature of $η$-Einstein and gradient $η$-Einstein soliton in the framework of almost coKähler manifolds and $(κ, μ)$-almost coKähler manifolds. We find some expressions for scalar curvature of the almost coKähler manifold admitting $η$-Einstein soliton in various cases. We also prove that if a $(κ, μ)$-almost coKähler manifold admits a gradient $η$-Einstein soliton, then either the manifold is coKähler, or $N(κ)$-almost coKahler, or the soliton is trivial. We present an example which validates our results. Finally, we investigate the asymptotically harmonic manifolds admitting non-trivial Ricci solitons and show that they exhibit rigid behavior if for example, scalar curvature attains maximum. △ Less

Submitted 11 May, 2023; originally announced May 2023.

MSC Class: 53B40; 53B20; 53C25; 53D15

arXiv:2304.08682 [pdf, other]

Learning Situation Hyper-Graphs for Video Question Answering

Authors: Aisha Urooj Khan, Hilde Kuehne, Bo Wu, Kim Chheu, Walid Bousselham, Chuang Gan, Niels Lobo, Mubarak Shah

Abstract: Answering questions about complex situations in videos requires not only capturing the presence of actors, objects, and their relations but also the evolution of these relationships over time. A situation hyper-graph is a representation that describes situations as scene sub-graphs for video frames and hyper-edges for connected sub-graphs and has been proposed to capture all such information in a… ▽ More Answering questions about complex situations in videos requires not only capturing the presence of actors, objects, and their relations but also the evolution of these relationships over time. A situation hyper-graph is a representation that describes situations as scene sub-graphs for video frames and hyper-edges for connected sub-graphs and has been proposed to capture all such information in a compact structured form. In this work, we propose an architecture for Video Question Answering (VQA) that enables answering questions related to video content by predicting situation hyper-graphs, coined Situation Hyper-Graph based Video Question Answering (SHG-VQA). To this end, we train a situation hyper-graph decoder to implicitly identify graph representations with actions and object/human-object relationships from the input video clip. and to use cross-attention between the predicted situation hyper-graphs and the question embedding to predict the correct answer. The proposed method is trained in an end-to-end manner and optimized by a VQA loss with the cross-entropy function and a Hungarian matching loss for the situation graph prediction. The effectiveness of the proposed architecture is extensively evaluated on two challenging benchmarks: AGQA and STAR. Our results show that learning the underlying situation hyper-graphs helps the system to significantly improve its performance for novel challenges of video question-answering tasks. △ Less

Submitted 6 May, 2023; v1 submitted 17 April, 2023; originally announced April 2023.

arXiv:2304.03410 [pdf, other]

$R^{2}$Former: Unified $R$etrieval and $R$eranking Transformer for Place Recognition

Authors: Sijie Zhu, Linjie Yang, Chen Chen, Mubarak Shah, Xiaohui Shen, Heng Wang

Abstract: Visual Place Recognition (VPR) estimates the location of query images by matching them with images in a reference database. Conventional methods generally adopt aggregated CNN features for global retrieval and RANSAC-based geometric verification for reranking. However, RANSAC only employs geometric information but ignores other possible information that could be useful for reranking, e.g. local fe… ▽ More Visual Place Recognition (VPR) estimates the location of query images by matching them with images in a reference database. Conventional methods generally adopt aggregated CNN features for global retrieval and RANSAC-based geometric verification for reranking. However, RANSAC only employs geometric information but ignores other possible information that could be useful for reranking, e.g. local feature correlations, and attention values. In this paper, we propose a unified place recognition framework that handles both retrieval and reranking with a novel transformer model, named $R^{2}$Former. The proposed reranking module takes feature correlation, attention value, and xy coordinates into account, and learns to determine whether the image pair is from the same location. The whole pipeline is end-to-end trainable and the reranking module alone can also be adopted on other CNN or transformer backbones as a generic component. Remarkably, $R^{2}$Former significantly outperforms state-of-the-art methods on major VPR datasets with much less inference time and memory consumption. It also achieves the state-of-the-art on the hold-out MSLS challenge set and could serve as a simple yet strong solution for real-world large-scale applications. Experiments also show vision transformer tokens are comparable and sometimes better than CNN local features on local matching. The code is released at https://github.com/Jeff-Zilence/R2Former. △ Less

Submitted 6 April, 2023; originally announced April 2023.

Comments: CVPR

arXiv:2304.03307 [pdf, other]

Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting

Authors: Syed Talal Wasim, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan, Mubarak Shah

Abstract: Adopting contrastive image-text pretrained models like CLIP towards video classification has gained attention due to its cost-effectiveness and competitive performance. However, recent works in this area face a trade-off. Finetuning the pretrained model to achieve strong supervised performance results in low zero-shot generalization. Similarly, freezing the backbone to retain zero-shot capability… ▽ More Adopting contrastive image-text pretrained models like CLIP towards video classification has gained attention due to its cost-effectiveness and competitive performance. However, recent works in this area face a trade-off. Finetuning the pretrained model to achieve strong supervised performance results in low zero-shot generalization. Similarly, freezing the backbone to retain zero-shot capability causes significant drop in supervised accuracy. Because of this, recent works in literature typically train separate models for supervised and zero-shot action recognition. In this work, we propose a multimodal prompt learning scheme that works to balance the supervised and zero-shot performance under a single unified training. Our prompting approach on the vision side caters for three aspects: 1) Global video-level prompts to model the data distribution; 2) Local frame-level prompts to provide per-frame discriminative conditioning; and 3) a summary prompt to extract a condensed video representation. Additionally, we define a prompting scheme on the text side to augment the textual context. Through this prompting scheme, we can achieve state-of-the-art zero-shot performance on Kinetics-600, HMDB51 and UCF101 while remaining competitive in the supervised setting. By keeping the pretrained backbone frozen, we optimize a much lower number of parameters and retain the existing general representation which helps achieve the strong zero-shot performance. Our codes/models are released at https://github.com/TalalWasim/Vita-CLIP. △ Less

Submitted 6 April, 2023; originally announced April 2023.

Comments: Accepted at CVPR-2023. Codes/models available at https://github.com/TalalWasim/Vita-CLIP

arXiv:2304.02739 [pdf]

doi 10.1109/ICNLP58431.2023.00011

Bengali Fake Review Detection using Semi-supervised Generative Adversarial Networks

Authors: Md. Tanvir Rouf Shawon, G. M. Shahariar, Faisal Muhammad Shah, Mohammad Shafiul Alam, Md. Shahriar Mahbub

Abstract: This paper investigates the potential of semi-supervised Generative Adversarial Networks (GANs) to fine-tune pretrained language models in order to classify Bengali fake reviews from real reviews with a few annotated data. With the rise of social media and e-commerce, the ability to detect fake or deceptive reviews is becoming increasingly important in order to protect consumers from being misled… ▽ More This paper investigates the potential of semi-supervised Generative Adversarial Networks (GANs) to fine-tune pretrained language models in order to classify Bengali fake reviews from real reviews with a few annotated data. With the rise of social media and e-commerce, the ability to detect fake or deceptive reviews is becoming increasingly important in order to protect consumers from being misled by false information. Any machine learning model will have trouble identifying a fake review, especially for a low resource language like Bengali. We have demonstrated that the proposed semi-supervised GAN-LM architecture (generative adversarial network on top of a pretrained language model) is a viable solution in classifying Bengali fake reviews as the experimental results suggest that even with only 1024 annotated samples, BanglaBERT with semi-supervised GAN (SSGAN) achieved an accuracy of 83.59% and a f1-score of 84.89% outperforming other pretrained language models - BanglaBERT generator, Bangla BERT Base and Bangla-Electra by almost 3%, 4% and 10% respectively in terms of accuracy. The experiments were conducted on a manually labeled food review dataset consisting of total 6014 real and fake reviews collected from various social media groups. Researchers that are experiencing difficulty recognizing not just fake reviews but other classification issues owing to a lack of labeled data may find a solution in our proposed methodology. △ Less

Submitted 5 April, 2023; originally announced April 2023.

arXiv:2304.01200 [pdf, other]

Video Instance Segmentation in an Open-World

Authors: Omkar Thawakar, Sanath Narayan, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Jorma Laaksonen, Mubarak Shah, Fahad Shahbaz Khan

Abstract: Existing video instance segmentation (VIS) approaches generally follow a closed-world assumption, where only seen category instances are identified and spatio-temporally segmented at inference. Open-world formulation relaxes the close-world static-learning assumption as follows: (a) first, it distinguishes a set of known categories as well as labels an unknown object as `unknown' and then (b) it i… ▽ More Existing video instance segmentation (VIS) approaches generally follow a closed-world assumption, where only seen category instances are identified and spatio-temporally segmented at inference. Open-world formulation relaxes the close-world static-learning assumption as follows: (a) first, it distinguishes a set of known categories as well as labels an unknown object as `unknown' and then (b) it incrementally learns the class of an unknown as and when the corresponding semantic labels become available. We propose the first open-world VIS approach, named OW-VISFormer, that introduces a novel feature enrichment mechanism and a spatio-temporal objectness (STO) module. The feature enrichment mechanism based on a light-weight auxiliary network aims at accurate pixel-level (unknown) object delineation from the background as well as distinguishing category-specific known semantic classes. The STO module strives to generate instance-level pseudo-labels by enhancing the foreground activations through a contrastive loss. Moreover, we also introduce an extensive experimental protocol to measure the characteristics of OW-VIS. Our OW-VISFormer performs favorably against a solid baseline in OW-VIS setting. Further, we evaluate our contributions in the standard fully-supervised VIS setting by integrating them into the recent SeqFormer, achieving an absolute gain of 1.6\% AP on Youtube-VIS 2019 val. set. Lastly, we show the generalizability of our contributions for the open-world detection (OWOD) setting, outperforming the best existing OWOD method in the literature. Code, models along with OW-VIS splits are available at \url{https://github.com/OmkarThawakar/OWVISFormer}. △ Less

Submitted 3 April, 2023; originally announced April 2023.

Comments: 9 pages, 5 figures

arXiv:2303.17959 [pdf, other]

Diffusion Action Segmentation

Authors: Daochang Liu, Qiyue Li, AnhDung Dinh, Tingting Jiang, Mubarak Shah, Chang Xu

Abstract: Temporal action segmentation is crucial for understanding long-form videos. Previous works on this task commonly adopt an iterative refinement paradigm by using multi-stage models. We propose a novel framework via denoising diffusion models, which nonetheless shares the same inherent spirit of such iterative refinement. In this framework, action predictions are iteratively generated from random no… ▽ More Temporal action segmentation is crucial for understanding long-form videos. Previous works on this task commonly adopt an iterative refinement paradigm by using multi-stage models. We propose a novel framework via denoising diffusion models, which nonetheless shares the same inherent spirit of such iterative refinement. In this framework, action predictions are iteratively generated from random noise with input video features as conditions. To enhance the modeling of three striking characteristics of human actions, including the position prior, the boundary ambiguity, and the relational dependency, we devise a unified masking strategy for the conditioning inputs in our framework. Extensive experiments on three benchmark datasets, i.e., GTEA, 50Salads, and Breakfast, are performed and the proposed method achieves superior or comparable results to state-of-the-art methods, showing the effectiveness of a generative approach for action segmentation. △ Less

Submitted 11 August, 2023; v1 submitted 31 March, 2023; originally announced March 2023.

Comments: ICCV 2023

arXiv:2303.16268 [pdf, other]

TimeBalance: Temporally-Invariant and Temporally-Distinctive Video Representations for Semi-Supervised Action Recognition

Authors: Ishan Rajendrakumar Dave, Mamshad Nayeem Rizve, Chen Chen, Mubarak Shah

Abstract: Semi-Supervised Learning can be more beneficial for the video domain compared to images because of its higher annotation cost and dimensionality. Besides, any video understanding task requires reasoning over both spatial and temporal dimensions. In order to learn both the static and motion related features for the semi-supervised action recognition task, existing methods rely on hard input inducti… ▽ More Semi-Supervised Learning can be more beneficial for the video domain compared to images because of its higher annotation cost and dimensionality. Besides, any video understanding task requires reasoning over both spatial and temporal dimensions. In order to learn both the static and motion related features for the semi-supervised action recognition task, existing methods rely on hard input inductive biases like using two-modalities (RGB and Optical-flow) or two-stream of different playback rates. Instead of utilizing unlabeled videos through diverse input streams, we rely on self-supervised video representations, particularly, we utilize temporally-invariant and temporally-distinctive representations. We observe that these representations complement each other depending on the nature of the action. Based on this observation, we propose a student-teacher semi-supervised learning framework, TimeBalance, where we distill the knowledge from a temporally-invariant and a temporally-distinctive teacher. Depending on the nature of the unlabeled video, we dynamically combine the knowledge of these two teachers based on a novel temporal similarity-based reweighting scheme. Our method achieves state-of-the-art performance on three action recognition benchmarks: UCF101, HMDB51, and Kinetics400. Code: https://github.com/DAVEISHAN/TimeBalance △ Less

Submitted 28 March, 2023; originally announced March 2023.

Comments: CVPR-2023

Showing 51–100 of 421 results for author: Shah, M