-
CM2-Net: Continual Cross-Modal Mapping Network for Driver Action Recognition
Authors:
Ruoyu Wang,
Chen Cai,
Wenqian Wang,
Jianjun Gao,
Dan Lin,
Wenyang Liu,
Kim-Hui Yap
Abstract:
Driver action recognition has significantly advanced in enhancing driver-vehicle interactions and ensuring driving safety by integrating multiple modalities, such as infrared and depth. Nevertheless, compared to RGB modality only, it is always laborious and costly to collect extensive data for all types of non-RGB modalities in car cabin environments. Therefore, previous works have suggested indep…
▽ More
Driver action recognition has significantly advanced in enhancing driver-vehicle interactions and ensuring driving safety by integrating multiple modalities, such as infrared and depth. Nevertheless, compared to RGB modality only, it is always laborious and costly to collect extensive data for all types of non-RGB modalities in car cabin environments. Therefore, previous works have suggested independently learning each non-RGB modality by fine-tuning a model pre-trained on RGB videos, but these methods are less effective in extracting informative features when faced with newly-incoming modalities due to large domain gaps. In contrast, we propose a Continual Cross-Modal Mapping Network (CM2-Net) to continually learn each newly-incoming modality with instructive prompts from the previously-learned modalities. Specifically, we have developed Accumulative Cross-modal Mapping Prompting (ACMP), to map the discriminative and informative features learned from previous modalities into the feature space of newly-incoming modalities. Then, when faced with newly-incoming modalities, these mapped features are able to provide effective prompts for which features should be extracted and prioritized. These prompts are accumulating throughout the continual learning process, thereby boosting further recognition performances. Extensive experiments conducted on the Drive&Act dataset demonstrate the performance superiority of CM2-Net on both uni- and multi-modal driver action recognition.
△ Less
Submitted 18 June, 2024; v1 submitted 17 June, 2024;
originally announced June 2024.
-
Video sentence grounding with temporally global textual knowledge
Authors:
Cai Chen,
Runzhong Zhang,
Jianjun Gao,
Kejun Wu,
Kim-Hui Yap,
Yi Wang
Abstract:
Temporal sentence grounding involves the retrieval of a video moment with a natural language query. Many existing works directly incorporate the given video and temporally localized query for temporal grounding, overlooking the inherent domain gap between different modalities. In this paper, we utilize pseudo-query features containing extensive temporally global textual knowledge sourced from the…
▽ More
Temporal sentence grounding involves the retrieval of a video moment with a natural language query. Many existing works directly incorporate the given video and temporally localized query for temporal grounding, overlooking the inherent domain gap between different modalities. In this paper, we utilize pseudo-query features containing extensive temporally global textual knowledge sourced from the same video-query pair, to enhance the bridging of domain gaps and attain a heightened level of similarity between multi-modal features. Specifically, we propose a Pseudo-query Intermediary Network (PIN) to achieve an improved alignment of visual and comprehensive pseudo-query features within the feature space through contrastive learning. Subsequently, we utilize learnable prompts to encapsulate the knowledge of pseudo-queries, propagating them into the textual encoder and multi-modal fusion module, further enhancing the feature alignment between visual and language for better temporal grounding. Extensive experiments conducted on the Charades-STA and ActivityNet-Captions datasets demonstrate the effectiveness of our method.
△ Less
Submitted 1 June, 2024; v1 submitted 21 April, 2024;
originally announced April 2024.
-
Multi-modality action recognition based on dual feature shift in vehicle cabin monitoring
Authors:
Dan Lin,
Philip Hann Yung Lee,
Yiming Li,
Ruoyu Wang,
Kim-Hui Yap,
Bingbing Li,
You Shing Ngim
Abstract:
Driver Action Recognition (DAR) is crucial in vehicle cabin monitoring systems. In real-world applications, it is common for vehicle cabins to be equipped with cameras featuring different modalities. However, multi-modality fusion strategies for the DAR task within car cabins have rarely been studied. In this paper, we propose a novel yet efficient multi-modality driver action recognition method b…
▽ More
Driver Action Recognition (DAR) is crucial in vehicle cabin monitoring systems. In real-world applications, it is common for vehicle cabins to be equipped with cameras featuring different modalities. However, multi-modality fusion strategies for the DAR task within car cabins have rarely been studied. In this paper, we propose a novel yet efficient multi-modality driver action recognition method based on dual feature shift, named DFS. DFS first integrates complementary features across modalities by performing modality feature interaction. Meanwhile, DFS achieves the neighbour feature propagation within single modalities, by feature shifting among temporal frames. To learn common patterns and improve model efficiency, DFS shares feature extracting stages among multiple modalities. Extensive experiments have been carried out to verify the effectiveness of the proposed DFS model on the Drive\&Act dataset. The results demonstrate that DFS achieves good performance and improves the efficiency of multi-modality driver action recognition.
△ Less
Submitted 26 January, 2024;
originally announced January 2024.
-
Octopus: A Fair Packet Delivery Service
Authors:
Junzhi Gong,
Yuliang Li,
Devdeep Ray,
KK Yap,
Nandita Dukkipati
Abstract:
The packet delivery fairness is critical in many applications in the cloud, such as exchange systems, consensus protocols, and online gaming applications. However, due to nonidentical and dynamic packet forwarding paths, as well as many in-network queuing delays, supporting packet delivery fairness is challenging in a shared compute environment. In this paper, we present Octopus, the first general…
▽ More
The packet delivery fairness is critical in many applications in the cloud, such as exchange systems, consensus protocols, and online gaming applications. However, due to nonidentical and dynamic packet forwarding paths, as well as many in-network queuing delays, supporting packet delivery fairness is challenging in a shared compute environment. In this paper, we present Octopus, the first general fair packet delivery service to achieve packet arrival time variations smaller than tens of nanoseconds, with the existence of latency variations in the network. The key ideas of Octopus to support such good fairness come from repurposing hardware traffic shaping capabilities in modern NICs, and deploying agents at local SmartNICs to minimize latency variations from packet forwarding. Evaluation results show that Octopus has less than 40 ns unfairness for up to 99.97\% multicast packets.
△ Less
Submitted 16 January, 2024;
originally announced January 2024.
-
Learning-Based Biharmonic Augmentation for Point Cloud Classification
Authors:
Jiacheng Wei,
Guosheng Lin,
Henghui Ding,
Jie Hu,
Kim-Hui Yap
Abstract:
Point cloud datasets often suffer from inadequate sample sizes in comparison to image datasets, making data augmentation challenging. While traditional methods, like rigid transformations and scaling, have limited potential in increasing dataset diversity due to their constraints on altering individual sample shapes, we introduce the Biharmonic Augmentation (BA) method. BA is a novel and efficient…
▽ More
Point cloud datasets often suffer from inadequate sample sizes in comparison to image datasets, making data augmentation challenging. While traditional methods, like rigid transformations and scaling, have limited potential in increasing dataset diversity due to their constraints on altering individual sample shapes, we introduce the Biharmonic Augmentation (BA) method. BA is a novel and efficient data augmentation technique that diversifies point cloud data by imposing smooth non-rigid deformations on existing 3D structures. This approach calculates biharmonic coordinates for the deformation function and learns diverse deformation prototypes. Utilizing a CoefNet, our method predicts coefficients to amalgamate these prototypes, ensuring comprehensive deformation. Moreover, we present AdvTune, an advanced online augmentation system that integrates adversarial training. This system synergistically refines the CoefNet and the classification network, facilitating the automated creation of adaptive shape deformations contingent on the learner status. Comprehensive experimental analysis validates the superiority of Biharmonic Augmentation, showcasing notable performance improvements over prevailing point cloud augmentation techniques across varied network designs.
△ Less
Submitted 10 November, 2023;
originally announced November 2023.
-
Bitstream-Corrupted Video Recovery: A Novel Benchmark Dataset and Method
Authors:
Tianyi Liu,
Kejun Wu,
Yi Wang,
Wenyang Liu,
Kim-Hui Yap,
Lap-Pui Chau
Abstract:
The past decade has witnessed great strides in video recovery by specialist technologies, like video inpainting, completion, and error concealment. However, they typically simulate the missing content by manual-designed error masks, thus failing to fill in the realistic video loss in video communication (e.g., telepresence, live streaming, and internet video) and multimedia forensics. To address t…
▽ More
The past decade has witnessed great strides in video recovery by specialist technologies, like video inpainting, completion, and error concealment. However, they typically simulate the missing content by manual-designed error masks, thus failing to fill in the realistic video loss in video communication (e.g., telepresence, live streaming, and internet video) and multimedia forensics. To address this, we introduce the bitstream-corrupted video (BSCV) benchmark, the first benchmark dataset with more than 28,000 video clips, which can be used for bitstream-corrupted video recovery in the real world. The BSCV is a collection of 1) a proposed three-parameter corruption model for video bitstream, 2) a large-scale dataset containing rich error patterns, multiple corruption levels, and flexible dataset branches, and 3) a plug-and-play module in video recovery framework that serves as a benchmark. We evaluate state-of-the-art video inpainting methods on the BSCV dataset, demonstrating existing approaches' limitations and our framework's advantages in solving the bitstream-corrupted video recovery problem. The benchmark and dataset are released at https://github.com/LIUTIGHE/BSCV-Dataset.
△ Less
Submitted 26 September, 2023; v1 submitted 25 September, 2023;
originally announced September 2023.
-
OccluTrack: Rethinking Awareness of Occlusion for Enhancing Multiple Pedestrian Tracking
Authors:
Jianjun Gao,
Yi Wang,
Kim-Hui Yap,
Kratika Garg,
Boon Siew Han
Abstract:
Multiple pedestrian tracking faces the challenge of tracking pedestrians in the presence of occlusion. Existing methods suffer from inaccurate motion estimation, appearance feature extraction, and association due to occlusion, leading to inadequate Identification F1-Score (IDF1), excessive ID switches (IDSw), and insufficient association accuracy and recall (AssA and AssR). We found that the main…
▽ More
Multiple pedestrian tracking faces the challenge of tracking pedestrians in the presence of occlusion. Existing methods suffer from inaccurate motion estimation, appearance feature extraction, and association due to occlusion, leading to inadequate Identification F1-Score (IDF1), excessive ID switches (IDSw), and insufficient association accuracy and recall (AssA and AssR). We found that the main reason is abnormal detections caused by partial occlusion. In this paper, we suggest that the key insight is explicit motion estimation, reliable appearance features, and fair association in occlusion scenes. Specifically, we propose an adaptive occlusion-aware multiple pedestrian tracker, OccluTrack. We first introduce an abnormal motion suppression mechanism into the Kalman Filter to adaptively detect and suppress outlier motions caused by partial occlusion. Second, we propose a pose-guided re-ID module to extract discriminative part features for partially occluded pedestrians. Last, we design a new occlusion-aware association method towards fair IoU and appearance embedding distance measurement for occluded pedestrians. Extensive evaluation results demonstrate that our OccluTrack outperforms state-of-the-art methods on MOT-Challenge datasets. Particularly, the improvements on IDF1, IDSw, AssA, and AssR demonstrate the effectiveness of our OccluTrack on tracking and association performance.
△ Less
Submitted 19 September, 2023;
originally announced September 2023.
-
Ladder-of-Thought: Using Knowledge as Steps to Elevate Stance Detection
Authors:
Kairui Hu,
Ming Yan,
Joey Tianyi Zhou,
Ivor W. Tsang,
Wen Haw Chong,
Yong Keong Yap
Abstract:
Stance detection aims to identify the attitude expressed in a document towards a given target. Techniques such as Chain-of-Thought (CoT) prompting have advanced this task, enhancing a model's reasoning capabilities through the derivation of intermediate rationales. However, CoT relies primarily on a model's pre-trained internal knowledge during reasoning, thereby neglecting the valuable external i…
▽ More
Stance detection aims to identify the attitude expressed in a document towards a given target. Techniques such as Chain-of-Thought (CoT) prompting have advanced this task, enhancing a model's reasoning capabilities through the derivation of intermediate rationales. However, CoT relies primarily on a model's pre-trained internal knowledge during reasoning, thereby neglecting the valuable external information that is previously unknown to the model. This omission, especially within the unsupervised reasoning process, can affect the model's overall performance. Moreover, while CoT enhances Large Language Models (LLMs), smaller LMs, though efficient operationally, face challenges in delivering nuanced reasoning. In response to these identified gaps, we introduce the Ladder-of-Thought (LoT) for the stance detection task. Constructed through a dual-phase Progressive Optimization Framework, LoT directs the small LMs to assimilate high-quality external knowledge, refining the intermediate rationales produced. These bolstered rationales subsequently serve as the foundation for more precise predictions - akin to how a ladder facilitates reaching elevated goals. LoT achieves a balance between efficiency and performance. Our empirical evaluations underscore LoT's efficacy, marking a 16% improvement over GPT-3.5 and a 10% enhancement compared to GPT-3.5 with CoT on stance detection task.
△ Less
Submitted 7 September, 2023; v1 submitted 31 August, 2023;
originally announced August 2023.
-
Top-Down Framework for Weakly-supervised Grounded Image Captioning
Authors:
Chen Cai,
Suchen Wang,
Kim-hui Yap,
Yi Wang
Abstract:
Weakly-supervised grounded image captioning (WSGIC) aims to generate the caption and ground (localize) predicted object words in the input image without using bounding box supervision. Recent two-stage solutions mostly apply a bottom-up pipeline: (1) encode the input image into multiple region features using an object detector; (2) leverage region features for captioning and grounding. However, ut…
▽ More
Weakly-supervised grounded image captioning (WSGIC) aims to generate the caption and ground (localize) predicted object words in the input image without using bounding box supervision. Recent two-stage solutions mostly apply a bottom-up pipeline: (1) encode the input image into multiple region features using an object detector; (2) leverage region features for captioning and grounding. However, utilizing independent proposals produced by object detectors tends to make the subsequent grounded captioner overfitted in finding the correct object words, overlooking the relation between objects, and selecting incompatible proposal regions for grounding. To address these issues, we propose a one-stage weakly-supervised grounded captioner that directly takes the RGB image as input to perform captioning and grounding at the top-down image level. Specifically, we encode the image into visual token representations and propose a Recurrent Grounding Module (RGM) in the decoder to obtain precise Visual Language Attention Maps (VLAMs), which recognize the spatial locations of the objects. In addition, we explicitly inject a relation module into our one-stage framework to encourage relation understanding through multi-label classification. This relation semantics served as contextual information facilitating the prediction of relation and object words in the caption. We observe that the relation semantic not only assists the grounded captioner in generating a more accurate caption but also improves the grounding performance. We validate the effectiveness of our proposed method on two challenging datasets (Flick30k Entities captioning and MSCOCO captioning). The experimental results demonstrate that our method achieves state-of-the-art grounding performance.
△ Less
Submitted 2 March, 2024; v1 submitted 12 June, 2023;
originally announced June 2023.
-
Guiding Computational Stance Detection with Expanded Stance Triangle Framework
Authors:
Zhengyuan Liu,
Yong Keong Yap,
Hai Leong Chieu,
Nancy F. Chen
Abstract:
Stance detection determines whether the author of a piece of text is in favor of, against, or neutral towards a specified target, and can be used to gain valuable insights into social media. The ubiquitous indirect referral of targets makes this task challenging, as it requires computational solutions to model semantic features and infer the corresponding implications from a literal statement. Mor…
▽ More
Stance detection determines whether the author of a piece of text is in favor of, against, or neutral towards a specified target, and can be used to gain valuable insights into social media. The ubiquitous indirect referral of targets makes this task challenging, as it requires computational solutions to model semantic features and infer the corresponding implications from a literal statement. Moreover, the limited amount of available training data leads to subpar performance in out-of-domain and cross-target scenarios, as data-driven approaches are prone to rely on superficial and domain-specific features. In this work, we decompose the stance detection task from a linguistic perspective, and investigate key components and inference paths in this task. The stance triangle is a generic linguistic framework previously proposed to describe the fundamental ways people express their stance. We further expand it by characterizing the relationship between explicit and implicit objects. We then use the framework to extend one single training corpus with additional annotation. Experimental results show that strategically-enriched data can significantly improve the performance on out-of-domain and cross-target evaluation.
△ Less
Submitted 31 May, 2023;
originally announced May 2023.
-
SSN: Stockwell Scattering Network for SAR Image Change Detection
Authors:
Gong Chen,
Yanan Zhao,
Yi Wang,
Kim-Hui Yap
Abstract:
Recently, synthetic aperture radar (SAR) image change detection has become an interesting yet challenging direction due to the presence of speckle noise. Although both traditional and modern learning-driven methods attempted to overcome this challenge, deep convolutional neural networks (DCNNs)-based methods are still hindered by the lack of interpretability and the requirement of large computatio…
▽ More
Recently, synthetic aperture radar (SAR) image change detection has become an interesting yet challenging direction due to the presence of speckle noise. Although both traditional and modern learning-driven methods attempted to overcome this challenge, deep convolutional neural networks (DCNNs)-based methods are still hindered by the lack of interpretability and the requirement of large computation power. To overcome this drawback, wavelet scattering network (WSN) and Fourier scattering network (FSN) are proposed. Combining respective merits of WSN and FSN, we propose Stockwell scattering network (SSN) based on Stockwell transform which is widely applied against noisy signals and shows advantageous characteristics in speckle reduction. The proposed SSN provides noise-resilient feature representation and obtains state-of-art performance in SAR image change detection as well as high computational efficiency. Experimental results on three real SAR image datasets demonstrate the effectiveness of the proposed method.
△ Less
Submitted 22 April, 2023;
originally announced April 2023.
-
A Byte Sequence is Worth an Image: CNN for File Fragment Classification Using Bit Shift and n-Gram Embeddings
Authors:
Wenyang Liu,
Yi Wang,
Kejun Wu,
Kim-Hui Yap,
Lap-Pui Chau
Abstract:
File fragment classification (FFC) on small chunks of memory is essential in memory forensics and Internet security. Existing methods mainly treat file fragments as 1d byte signals and utilize the captured inter-byte features for classification, while the bit information within bytes, i.e., intra-byte information, is seldom considered. This is inherently inapt for classifying variable-length codin…
▽ More
File fragment classification (FFC) on small chunks of memory is essential in memory forensics and Internet security. Existing methods mainly treat file fragments as 1d byte signals and utilize the captured inter-byte features for classification, while the bit information within bytes, i.e., intra-byte information, is seldom considered. This is inherently inapt for classifying variable-length coding files whose symbols are represented as the variable number of bits. Conversely, we propose Byte2Image, a novel data augmentation technique, to introduce the neglected intra-byte information into file fragments and re-treat them as 2d gray-scale images, which allows us to capture both inter-byte and intra-byte correlations simultaneously through powerful convolutional neural networks (CNNs). Specifically, to convert file fragments to 2d images, we employ a sliding byte window to expose the neglected intra-byte information and stack their n-gram features row by row. We further propose a byte sequence \& image fusion network as a classifier, which can jointly model the raw 1d byte sequence and the converted 2d image to perform FFC. Experiments on FFT-75 dataset validate that our proposed method can achieve notable accuracy improvements over state-of-the-art methods in nearly all scenarios. The code will be released at https://github.com/wenyang001/Byte2Image.
△ Less
Submitted 14 April, 2023;
originally announced April 2023.
-
Bitstream-Corrupted JPEG Images are Restorable: Two-stage Compensation and Alignment Framework for Image Restoration
Authors:
Wenyang Liu,
Yi Wang,
Kim-Hui Yap,
Lap-Pui Chau
Abstract:
In this paper, we study a real-world JPEG image restoration problem with bit errors on the encrypted bitstream. The bit errors bring unpredictable color casts and block shifts on decoded image contents, which cannot be resolved by existing image restoration methods mainly relying on pre-defined degradation models in the pixel domain. To address these challenges, we propose a robust JPEG decoder, f…
▽ More
In this paper, we study a real-world JPEG image restoration problem with bit errors on the encrypted bitstream. The bit errors bring unpredictable color casts and block shifts on decoded image contents, which cannot be resolved by existing image restoration methods mainly relying on pre-defined degradation models in the pixel domain. To address these challenges, we propose a robust JPEG decoder, followed by a two-stage compensation and alignment framework to restore bitstream-corrupted JPEG images. Specifically, the robust JPEG decoder adopts an error-resilient mechanism to decode the corrupted JPEG bitstream. The two-stage framework is composed of the self-compensation and alignment (SCA) stage and the guided-compensation and alignment (GCA) stage. The SCA adaptively performs block-wise image color compensation and alignment based on the estimated color and block offsets via image content similarity. The GCA leverages the extracted low-resolution thumbnail from the JPEG header to guide full-resolution pixel-wise image restoration in a coarse-to-fine manner. It is achieved by a coarse-guided pix2pix network and a refine-guided bi-directional Laplacian pyramid fusion network. We conduct experiments on three benchmarks with varying degrees of bit error rates. Experimental results and ablation studies demonstrate the superiority of our proposed method. The code will be released at https://github.com/wenyang001/Two-ACIR.
△ Less
Submitted 14 April, 2023;
originally announced April 2023.
-
TAPS3D: Text-Guided 3D Textured Shape Generation from Pseudo Supervision
Authors:
Jiacheng Wei,
Hao Wang,
Jiashi Feng,
Guosheng Lin,
Kim-Hui Yap
Abstract:
In this paper, we investigate an open research task of generating controllable 3D textured shapes from the given textual descriptions. Previous works either require ground truth caption labeling or extensive optimization time. To resolve these issues, we present a novel framework, TAPS3D, to train a text-guided 3D shape generator with pseudo captions. Specifically, based on rendered 2D images, we…
▽ More
In this paper, we investigate an open research task of generating controllable 3D textured shapes from the given textual descriptions. Previous works either require ground truth caption labeling or extensive optimization time. To resolve these issues, we present a novel framework, TAPS3D, to train a text-guided 3D shape generator with pseudo captions. Specifically, based on rendered 2D images, we retrieve relevant words from the CLIP vocabulary and construct pseudo captions using templates. Our constructed captions provide high-level semantic supervision for generated 3D shapes. Further, in order to produce fine-grained textures and increase geometry diversity, we propose to adopt low-level image regularization to enable fake-rendered images to align with the real ones. During the inference phase, our proposed model can generate 3D textured shapes from the given text without any additional optimization. We conduct extensive experiments to analyze each of our proposed components and show the efficacy of our framework in generating high-fidelity 3D textured and text-relevant shapes.
△ Less
Submitted 23 March, 2023;
originally announced March 2023.
-
Dense Supervision Propagation for Weakly Supervised Semantic Segmentation on 3D Point Clouds
Authors:
Jiacheng Wei,
Guosheng Lin,
Kim-Hui Yap,
Fayao Liu,
Tzu-Yi Hung
Abstract:
Semantic segmentation on 3D point clouds is an important task for 3D scene understanding. While dense labeling on 3D data is expensive and time-consuming, only a few works address weakly supervised semantic point cloud segmentation methods to relieve the labeling cost by learning from simpler and cheaper labels. Meanwhile, there are still huge performance gaps between existing weakly supervised me…
▽ More
Semantic segmentation on 3D point clouds is an important task for 3D scene understanding. While dense labeling on 3D data is expensive and time-consuming, only a few works address weakly supervised semantic point cloud segmentation methods to relieve the labeling cost by learning from simpler and cheaper labels. Meanwhile, there are still huge performance gaps between existing weakly supervised methods and state-of-the-art fully supervised methods. In this paper, we train a semantic point cloud segmentation network with only a small portion of points being labeled. We argue that we can better utilize the limited supervision information as we densely propagate the supervision signal from the labeled points to other points within and across the input samples. Specifically, we propose a cross-sample feature reallocating module to transfer similar features and therefore re-route the gradients across two samples with common classes and an intra-sample feature redistribution module to propagate supervision signals on unlabeled points across and within point cloud samples. We conduct extensive experiments on public datasets S3DIS and ScanNet. Our weakly supervised method with only 10% and 1% of labels can produce compatible results with the fully supervised counterpart.
△ Less
Submitted 1 April, 2024; v1 submitted 23 July, 2021;
originally announced July 2021.
-
Reconciliation of Statistical and Spatial Sparsity For Robust Image and Image-Set Classification
Authors:
Hao Cheng,
Kim-Hui Yap,
Bihan Wen
Abstract:
Recent image classification algorithms, by learning deep features from large-scale datasets, have achieved significantly better results comparing to the classic feature-based approaches. However, there are still various challenges of image classifications in practice, such as classifying noisy image or image-set queries and training deep image classification models over the limited-scale dataset.…
▽ More
Recent image classification algorithms, by learning deep features from large-scale datasets, have achieved significantly better results comparing to the classic feature-based approaches. However, there are still various challenges of image classifications in practice, such as classifying noisy image or image-set queries and training deep image classification models over the limited-scale dataset. Instead of applying generic deep features, the model-based approaches can be more effective and data-efficient for robust image and image-set classification tasks, as various image priors are exploited for modeling the inter- and intra-set data variations while preventing over-fitting. In this work, we propose a novel Joint Statistical and Spatial Sparse representation, dubbed \textit{J3S}, to model the image or image-set data for classification, by reconciling both their local patch structures and global Gaussian distribution mapped into Riemannian manifold. To the best of our knowledge, no work to date utilized both global statistics and local patch structures jointly via joint sparse representation. We propose to solve the joint sparse coding problem based on the J3S model, by coupling the local and global image representations using joint sparsity. The learned J3S models are used for robust image and image-set classification. Experiments show that the proposed J3S-based image classification scheme outperforms the popular or state-of-the-art competing methods over FMD, UIUC, ETH-80 and YTC databases.
△ Less
Submitted 1 June, 2021;
originally announced June 2021.
-
The D-plus Discriminant and Complexity of Root Clustering
Authors:
Jing Yang,
Chee K. Yap
Abstract:
Let $p(x)$ be an integer polynomial with $m\ge 2$ distinct roots $ρ_1,\ldots,ρ_m$ whose multiplicities are $\boldsymbolμ=(μ_1,\ldots,μ_m)$. We define the D-plus discriminant of $p(x)$ to be $D^+(p):= \prod_{1\le i<j\le m}(ρ_i-ρ_j)^{μ_i+μ_j}$. We first prove a conjecture that $D^+(p)$ is a $\boldsymbolμ$-symmetric function of its roots $ρ_1,\ldots,ρ_m$. Our main result gives an explicit formula for…
▽ More
Let $p(x)$ be an integer polynomial with $m\ge 2$ distinct roots $ρ_1,\ldots,ρ_m$ whose multiplicities are $\boldsymbolμ=(μ_1,\ldots,μ_m)$. We define the D-plus discriminant of $p(x)$ to be $D^+(p):= \prod_{1\le i<j\le m}(ρ_i-ρ_j)^{μ_i+μ_j}$. We first prove a conjecture that $D^+(p)$ is a $\boldsymbolμ$-symmetric function of its roots $ρ_1,\ldots,ρ_m$. Our main result gives an explicit formula for $D^+(p)$, as a rational function of its coefficients. Our proof is ideal-theoretic, based on re-casting the classic Poisson resultant as the "symbolic Poisson formula". The D-plus discriminant first arose in the complexity analysis of a root clustering algorithm from Becker et al. (ISSAC 2016). The bit-complexity of this algorithm is proportional to a quantity $\log(|D^+(p)|^{-1})$. As an application of our main result, we give an explicit upper bound on this quantity in terms of the degree of $p$ and its leading coefficient.
△ Less
Submitted 19 May, 2021; v1 submitted 9 May, 2021;
originally announced May 2021.
-
Empirical Analysis of Overfitting and Mode Drop in GAN Training
Authors:
Yasin Yazici,
Chuan-Sheng Foo,
Stefan Winkler,
Kim-Hui Yap,
Vijay Chandrasekhar
Abstract:
We examine two key questions in GAN training, namely overfitting and mode drop, from an empirical perspective. We show that when stochasticity is removed from the training procedure, GANs can overfit and exhibit almost no mode drop. Our results shed light on important characteristics of the GAN training procedure. They also provide evidence against prevailing intuitions that GANs do not memorize t…
▽ More
We examine two key questions in GAN training, namely overfitting and mode drop, from an empirical perspective. We show that when stochasticity is removed from the training procedure, GANs can overfit and exhibit almost no mode drop. Our results shed light on important characteristics of the GAN training procedure. They also provide evidence against prevailing intuitions that GANs do not memorize the training set, and that mode dropping is mainly due to properties of the GAN objective rather than how it is optimized during training.
△ Less
Submitted 25 June, 2020;
originally announced June 2020.
-
Arnold: an eFPGA-Augmented RISC-V SoC for Flexible and Low-Power IoT End-Nodes
Authors:
Pasquale Davide Schiavone,
Davide Rossi,
Alfio Di Mauro,
Frank Gurkaynak,
Timothy Saxe,
Mao Wang,
Ket Chong Yap,
Luca Benini
Abstract:
A wide range of Internet of Things (IoT) applications require powerful, energy-efficient and flexible end-nodes to acquire data from multiple sources, process and distill the sensed data through near-sensor data analytics algorithms, and transmit it wirelessly. This work presents Arnold: a 0.5 V to 0.8 V, 46.83 uW/MHz, 600 MOPS fully programmable RISC-V Microcontroller unit (MCU) fabricated in 22…
▽ More
A wide range of Internet of Things (IoT) applications require powerful, energy-efficient and flexible end-nodes to acquire data from multiple sources, process and distill the sensed data through near-sensor data analytics algorithms, and transmit it wirelessly. This work presents Arnold: a 0.5 V to 0.8 V, 46.83 uW/MHz, 600 MOPS fully programmable RISC-V Microcontroller unit (MCU) fabricated in 22 nm Globalfoundries GF22FDX (GF22FDX) technology, coupled with a stateof-the-art (SoA) microcontroller to an embedded Field Programmable Gate Array (FPGA). We demonstrate the flexibility of the System-OnChip (SoC) to tackle the challenges of many emerging IoT applications, such as (i) interfacing sensors and accelerators with non-standard interfaces, (ii) performing on-the-fly pre-processing tasks on data streamed from peripherals, and (iii) accelerating near-sensor analytics, encryption, and machine learning tasks. A unique feature of the proposed SoC is the exploitation of body-biasing to reduce leakage power of the embedded FPGA (eFPGA) fabric by up to 18x at 0.5 V, achieving SoA state bitstream-retentive sleep power for the eFPGA fabric, as low as 20.5 uW. The proposed SoC provides 3.4x better performance and 2.9x better energy efficiency than other fabricated heterogeneous re-configurable SoCs of the same class.
△ Less
Submitted 25 June, 2020;
originally announced June 2020.
-
Multi-Path Region Mining For Weakly Supervised 3D Semantic Segmentation on Point Clouds
Authors:
Jiacheng Wei,
Guosheng Lin,
Kim-Hui Yap,
Tzu-Yi Hung,
Lihua Xie
Abstract:
Point clouds provide intrinsic geometric information and surface context for scene understanding. Existing methods for point cloud segmentation require a large amount of fully labeled data. Using advanced depth sensors, collection of large scale 3D dataset is no longer a cumbersome process. However, manually producing point-level label on the large scale dataset is time and labor-intensive. In thi…
▽ More
Point clouds provide intrinsic geometric information and surface context for scene understanding. Existing methods for point cloud segmentation require a large amount of fully labeled data. Using advanced depth sensors, collection of large scale 3D dataset is no longer a cumbersome process. However, manually producing point-level label on the large scale dataset is time and labor-intensive. In this paper, we propose a weakly supervised approach to predict point-level results using weak labels on 3D point clouds. We introduce our multi-path region mining module to generate pseudo point-level label from a classification network trained with weak labels. It mines the localization cues for each class from various aspects of the network feature using different attention modules. Then, we use the point-level pseudo labels to train a point cloud segmentation network in a fully supervised manner. To the best of our knowledge, this is the first method that uses cloud-level weak labels on raw 3D space to train a point cloud semantic segmentation network. In our setting, the 3D weak labels only indicate the classes that appeared in our input sample. We discuss both scene- and subcloud-level weakly labels on raw 3D point cloud data and perform in-depth experiments on them. On ScanNet dataset, our result trained with subcloud-level labels is compatible with some fully supervised methods.
△ Less
Submitted 29 March, 2020;
originally announced March 2020.
-
On mu-Symmetric Polynomials
Authors:
Jing Yang,
Chee K. Yap
Abstract:
In this paper, we study functions of the roots of a univariate polynomial in which the roots have a given multiplicity structure $μ$. Traditionally, root functions are studied via the theory of symmetric polynomials; we extend this theory to $μ$-symmetric polynomials. We were motivated by a conjecture from Becker et al.~(ISSAC 2016) about the $μ$-symmetry of a particular root function $D^+(μ)$, ca…
▽ More
In this paper, we study functions of the roots of a univariate polynomial in which the roots have a given multiplicity structure $μ$. Traditionally, root functions are studied via the theory of symmetric polynomials; we extend this theory to $μ$-symmetric polynomials. We were motivated by a conjecture from Becker et al.~(ISSAC 2016) about the $μ$-symmetry of a particular root function $D^+(μ)$, called D-plus. To investigate this conjecture, it was desirable to have fast algorithms for checking if a given root function is $μ$-symmetric. We designed three such algorithms: one based on Gröbner bases, another based on preprocessing and reduction, and the third based on solving linear equations. We implemented them in Maple and experiments show that the latter two algorithms are significantly faster than the first.
△ Less
Submitted 21 January, 2020;
originally announced January 2020.
-
AANet: Attribute Attention Network for Person Re-Identifications
Authors:
Chiat-Pin Tay,
Sharmili Roy,
Kim-Hui Yap
Abstract:
This paper proposes Attribute Attention Network (AANet), a new architecture that integrates person attributes and attribute attention maps into a classification framework to solve the person re-identification (re-ID) problem. Many person re-ID models typically employ semantic cues such as body parts or human pose to improve the re-ID performance. Attribute information, however, is often not utiliz…
▽ More
This paper proposes Attribute Attention Network (AANet), a new architecture that integrates person attributes and attribute attention maps into a classification framework to solve the person re-identification (re-ID) problem. Many person re-ID models typically employ semantic cues such as body parts or human pose to improve the re-ID performance. Attribute information, however, is often not utilized. The proposed AANet leverages on a baseline model that uses body parts and integrates the key attribute information in an unified learning framework. The AANet consists of a global person ID task, a part detection task and a crucial attribute detection task. By estimating the class responses of individual attributes and combining them to form the attribute attention map (AAM), a very strong discriminatory representation is constructed. The proposed AANet outperforms the best state-of-the-art method arXiv:1711.09349v3 [cs.CV] using ResNet-50 by 3.36% in mAP and 3.12% in Rank-1 accuracy on DukeMTMC-reID dataset. On Market1501 dataset, AANet achieves 92.38% mAP and 95.10% Rank-1 accuracy with re-ranking, outperforming arXiv:1804.00216v1 [cs.CV], another state of the art method using ResNet-152, by 1.42% in mAP and 0.47% in Rank-1 accuracy. In addition, AANet can perform person attribute prediction (e.g., gender, hair length, clothing length etc.), and localize the attributes in the query image.
△ Less
Submitted 19 December, 2019;
originally announced December 2019.
-
Semantic Granularity Metric Learning for Visual Search
Authors:
Dipu Manandhar,
Muhammet Bastan,
Kim-Hui Yap
Abstract:
Deep metric learning applied to various applications has shown promising results in identification, retrieval and recognition. Existing methods often do not consider different granularity in visual similarity. However, in many domain applications, images exhibit similarity at multiple granularities with visual semantic concepts, e.g. fashion demonstrates similarity ranging from clothing of the exa…
▽ More
Deep metric learning applied to various applications has shown promising results in identification, retrieval and recognition. Existing methods often do not consider different granularity in visual similarity. However, in many domain applications, images exhibit similarity at multiple granularities with visual semantic concepts, e.g. fashion demonstrates similarity ranging from clothing of the exact same instance to similar looks/design or a common category. Therefore, training image triplets/pairs used for metric learning inherently possess different degree of information. However, the existing methods often treats them with equal importance during training. This hinders capturing the underlying granularities in feature similarity required for effective visual search.
In view of this, we propose a new deep semantic granularity metric learning (SGML) that develops a novel idea of leveraging attribute semantic space to capture different granularity of similarity, and then integrate this information into deep metric learning. The proposed method simultaneously learns image attributes and embeddings using multitask CNNs. The two tasks are not only jointly optimized but are further linked by the semantic granularity similarity mappings to leverage the correlations between the tasks. To this end, we propose a new soft-binomial deviance loss that effectively integrates the degree of information in training samples, which helps to capture visual similarity at multiple granularities. Compared to recent ensemble-based methods, our framework is conceptually elegant, computationally simple and provides better performance. We perform extensive experiments on benchmark metric learning datasets and demonstrate that our method outperforms recent state-of-the-art methods, e.g., 1-4.5\% improvement in Recall@1 over the previous state-of-the-arts [1],[2] on DeepFashion In-Shop dataset.
△ Less
Submitted 14 November, 2019;
originally announced November 2019.
-
Venn GAN: Discovering Commonalities and Particularities of Multiple Distributions
Authors:
Yasin Yazıcı,
Bruno Lecouat,
Chuan-Sheng Foo,
Stefan Winkler,
Kim-Hui Yap,
Georgios Piliouras,
Vijay Chandrasekhar
Abstract:
We propose a GAN design which models multiple distributions effectively and discovers their commonalities and particularities. Each data distribution is modeled with a mixture of $K$ generator distributions. As the generators are partially shared between the modeling of different true data distributions, shared ones captures the commonality of the distributions, while non-shared ones capture uniqu…
▽ More
We propose a GAN design which models multiple distributions effectively and discovers their commonalities and particularities. Each data distribution is modeled with a mixture of $K$ generator distributions. As the generators are partially shared between the modeling of different true data distributions, shared ones captures the commonality of the distributions, while non-shared ones capture unique aspects of them. We show the effectiveness of our method on various datasets (MNIST, Fashion MNIST, CIFAR-10, Omniglot, CelebA) with compelling results.
△ Less
Submitted 9 February, 2019;
originally announced February 2019.
-
Interest Point Detection based on Adaptive Ternary Coding
Authors:
Zhenwei Miao,
Kim-Hui Yap,
Xudong Jiang
Abstract:
In this paper, an adaptive pixel ternary coding mechanism is proposed and a contrast invariant and noise resistant interest point detector is developed on the basis of this mechanism. Every pixel in a local region is adaptively encoded into one of the three statuses: bright, uncertain and dark. The blob significance of the local region is measured by the spatial distribution of the bright and dark…
▽ More
In this paper, an adaptive pixel ternary coding mechanism is proposed and a contrast invariant and noise resistant interest point detector is developed on the basis of this mechanism. Every pixel in a local region is adaptively encoded into one of the three statuses: bright, uncertain and dark. The blob significance of the local region is measured by the spatial distribution of the bright and dark pixels. Interest points are extracted from this blob significance measurement. By labeling the statuses of ternary bright, uncertain, and dark, the proposed detector shows more robustness to image noise and quantization errors. Moreover, the adaptive strategy for the ternary cording, which relies on two thresholds that automatically converge to the median of the local region in measurement, enables this coding to be insensitive to the image local contrast. As a result, the proposed detector is invariant to illumination changes. The state-of-the-art results are achieved on the standard datasets, and also in the face recognition application.
△ Less
Submitted 31 December, 2018;
originally announced January 2019.
-
DCI: Discriminative and Contrast Invertible Descriptor
Authors:
Zhenwei Miao,
Kim-Hui Yap,
Xudong Jiang,
Subbhuraam Sinduja,
Zhenhua Wang
Abstract:
Local feature descriptors have been widely used in fine-grained visual object search thanks to their robustness in scale and rotation variation and cluttered background. However, the performance of such descriptors drops under severe illumination changes. In this paper, we proposed a Discriminative and Contrast Invertible (DCI) local feature descriptor. In order to increase the discriminative abil…
▽ More
Local feature descriptors have been widely used in fine-grained visual object search thanks to their robustness in scale and rotation variation and cluttered background. However, the performance of such descriptors drops under severe illumination changes. In this paper, we proposed a Discriminative and Contrast Invertible (DCI) local feature descriptor. In order to increase the discriminative ability of the descriptor under illumination changes, a Laplace gradient based histogram is proposed. A robust contrast flipping estimate is proposed based on the divergence of a local region. Experiments on fine-grained object recognition and retrieval applications demonstrate the superior performance of DCI descriptor to others.
△ Less
Submitted 31 December, 2018;
originally announced January 2019.
-
The Unusual Effectiveness of Averaging in GAN Training
Authors:
Yasin Yazıcı,
Chuan-Sheng Foo,
Stefan Winkler,
Kim-Hui Yap,
Georgios Piliouras,
Vijay Chandrasekhar
Abstract:
We examine two different techniques for parameter averaging in GAN training. Moving Average (MA) computes the time-average of parameters, whereas Exponential Moving Average (EMA) computes an exponentially discounted sum. Whilst MA is known to lead to convergence in bilinear settings, we provide the -- to our knowledge -- first theoretical arguments in support of EMA. We show that EMA converges to…
▽ More
We examine two different techniques for parameter averaging in GAN training. Moving Average (MA) computes the time-average of parameters, whereas Exponential Moving Average (EMA) computes an exponentially discounted sum. Whilst MA is known to lead to convergence in bilinear settings, we provide the -- to our knowledge -- first theoretical arguments in support of EMA. We show that EMA converges to limit cycles around the equilibrium with vanishing amplitude as the discount parameter approaches one for simple bilinear games and also enhances the stability of general GAN training. We establish experimentally that both techniques are strikingly effective in the non-convex-concave GAN setting as well. Both improve inception and FID scores on different architectures and for different GAN objectives. We provide comprehensive experimental results across a range of datasets -- mixture of Gaussians, CIFAR-10, STL-10, CelebA and ImageNet -- to demonstrate its effectiveness. We achieve state-of-the-art results on CIFAR-10 and produce clean CelebA face images.\footnote{~The code is available at \url{https://github.com/yasinyazici/EMA_GAN}}
△ Less
Submitted 26 February, 2019; v1 submitted 12 June, 2018;
originally announced June 2018.
-
Remote Detection of Idling Cars Using Infrared Imaging and Deep Networks
Authors:
Muhammet Bastan,
Kim-Hui Yap,
Lap-Pui Chau
Abstract:
Idling vehicles waste energy and pollute the environment through exhaust emission. In some countries, idling a vehicle for more than a predefined duration is prohibited and automatic idling vehicle detection is desirable for law enforcement. We propose the first automatic system to detect idling cars, using infrared (IR) imaging and deep networks.
We rely on the differences in spatio-temporal he…
▽ More
Idling vehicles waste energy and pollute the environment through exhaust emission. In some countries, idling a vehicle for more than a predefined duration is prohibited and automatic idling vehicle detection is desirable for law enforcement. We propose the first automatic system to detect idling cars, using infrared (IR) imaging and deep networks.
We rely on the differences in spatio-temporal heat signatures of idling and stopped cars and monitor the car temperature with a long-wavelength IR camera. We formulate the idling car detection problem as spatio-temporal event detection in IR image sequences and employ deep networks for spatio-temporal modeling. We collected the first IR image sequence dataset for idling car detection. First, we detect the cars in each IR image using a convolutional neural network, which is pre-trained on regular RGB images and fine-tuned on IR images for higher accuracy. Then, we track the detected cars over time to identify the cars that are parked. Finally, we use the 3D spatio-temporal IR image volume of each parked car as input to convolutional and recurrent networks to classify them as idling or not. We carried out an extensive empirical evaluation of temporal and spatio-temporal modeling approaches with various convolutional and recurrent architectures. We present promising experimental results on our IR image sequence dataset.
△ Less
Submitted 28 April, 2018;
originally announced April 2018.
-
Handling state space explosion in verification of component-based systems: A review
Authors:
Faranak Nejati,
Abdul Azim Abd. Ghani,
Ng Keng Yap,
Azmi Jaafar
Abstract:
Component-based software development (CBSD) is an alternative approach to constructing software systems that offers numerous benefits, particularly in decreasing the complexity of system design. However, deploying components into a system is a challenging and error-prone task. Model-checking is one of the reliable methods to systematically analyze the correctness of a system. It is a bruce-force c…
▽ More
Component-based software development (CBSD) is an alternative approach to constructing software systems that offers numerous benefits, particularly in decreasing the complexity of system design. However, deploying components into a system is a challenging and error-prone task. Model-checking is one of the reliable methods to systematically analyze the correctness of a system. It is a bruce-force checking of the system's state space that assists to significantly expand the level of confidence in the system. Nevertheless, model-checking is limited by a critical problem called state-space explosion (SSE). To benefit from model-checking, an appropriate method is required to reduce SSE. In the past two decades, a great number of SSE reduction methods have been proposed containing many similarities, dissimilarities, and unclear concepts in some cases. This research, firstly, plans to present a review of SSE handling methods and classify them based on their similarities, principle, and characteristics. Second, it investigates the methods for handling the SSE problem in the verification process of CBSD and provides insight into the potential limitations, underlining the key challenges for future research efforts.
△ Less
Submitted 26 May, 2021; v1 submitted 28 July, 2017;
originally announced September 2017.
-
Resolution-Exact Planner for Thick Non-Crossing 2-Link Robots
Authors:
Chee K. Yap,
Zhongdi Luo,
Ching-Hsiang Hsu
Abstract:
We consider the path planning problem for a 2-link robot amidst polygonal obstacles. Our robot is parametrizable by the lengths $\ell_1, \ell_2>0$ of its two links, the thickness $τ\ge 0$ of the links, and an angle $κ$ that constrains the angle between the 2 links to be strictly greater than $κ$. The case $τ>0$ and $κ\ge 0$ corresponds to "thick non-crossing" robots. This results in a novel 4DOF c…
▽ More
We consider the path planning problem for a 2-link robot amidst polygonal obstacles. Our robot is parametrizable by the lengths $\ell_1, \ell_2>0$ of its two links, the thickness $τ\ge 0$ of the links, and an angle $κ$ that constrains the angle between the 2 links to be strictly greater than $κ$. The case $τ>0$ and $κ\ge 0$ corresponds to "thick non-crossing" robots. This results in a novel 4DOF configuration space ${\mathbb R}^2\times ({\mathbb T}\setminusΔ(κ))$ where ${\mathbb T}$ is the torus and $Δ(κ)$ the diagonal band of width $κ$. We design a resolution-exact planner for this robot using the framework of Soft Subdivision Search (SSS). First, we provide an analysis of the space of forbidden angles, leading to a soft predicate for classifying configuration boxes. We further exploit the T/R splitting technique which was previously introduced for self-crossing thin 2-link robots. Our open-source implementation in Core Library achieves real-time performance for a suite of combinatorially non-trivial obstacle sets. Experimentally, our algorithm is significantly better than any of the state-of-art sampling algorithms we looked at, in timing and in success rate.
△ Less
Submitted 17 April, 2017;
originally announced April 2017.
-
Certified Computation of planar Morse-Smale Complexes
Authors:
Amit Chattopadhyay,
Gert Vegter,
Chee K. Yap
Abstract:
The Morse-Smale complex is an important tool for global topological analysis in various problems of computational geometry and topology. Algorithms for Morse-Smale complexes have been presented in case of piecewise linear manifolds. However, previous research in this field is incomplete in the case of smooth functions. In the current paper we address the following question: Given an arbitrarily co…
▽ More
The Morse-Smale complex is an important tool for global topological analysis in various problems of computational geometry and topology. Algorithms for Morse-Smale complexes have been presented in case of piecewise linear manifolds. However, previous research in this field is incomplete in the case of smooth functions. In the current paper we address the following question: Given an arbitrarily complex Morse-Smale system on a planar domain, is it possible to compute its certified (topologically correct) Morse-Smale complex? Towards this, we develop an algorithm using interval arithmetic to compute certified critical points and separatrices forming the Morse-Smale complexes of smooth functions on bounded planar domain. Our algorithm can also compute geometrically close Morse-Smale complexes.
△ Less
Submitted 20 June, 2015;
originally announced June 2015.