-
Visible-Thermal Tiny Object Detection: A Benchmark Dataset and Baselines
Authors:
Xinyi Ying,
Chao Xiao,
Ruojing Li,
Xu He,
Boyang Li,
Zhaoxu Li,
Yingqian Wang,
Mingyuan Hu,
Qingyu Xu,
Zaiping Lin,
Miao Li,
Shilin Zhou,
Wei An,
Weidong Sheng,
Li Liu
Abstract:
Small object detection (SOD) has been a longstanding yet challenging task for decades, with numerous datasets and algorithms being developed. However, they mainly focus on either visible or thermal modality, while visible-thermal (RGBT) bimodality is rarely explored. Although some RGBT datasets have been developed recently, the insufficient quantity, limited category, misaligned images and large t…
▽ More
Small object detection (SOD) has been a longstanding yet challenging task for decades, with numerous datasets and algorithms being developed. However, they mainly focus on either visible or thermal modality, while visible-thermal (RGBT) bimodality is rarely explored. Although some RGBT datasets have been developed recently, the insufficient quantity, limited category, misaligned images and large target size cannot provide an impartial benchmark to evaluate multi-category visible-thermal small object detection (RGBT SOD) algorithms. In this paper, we build the first large-scale benchmark with high diversity for RGBT SOD (namely RGBT-Tiny), including 115 paired sequences, 93K frames and 1.2M manual annotations. RGBT-Tiny contains abundant targets (7 categories) and high-diversity scenes (8 types that cover different illumination and density variations). Note that, over 81% of targets are smaller than 16x16, and we provide paired bounding box annotations with tracking ID to offer an extremely challenging benchmark with wide-range applications, such as RGBT fusion, detection and tracking. In addition, we propose a scale adaptive fitness (SAFit) measure that exhibits high robustness on both small and large targets. The proposed SAFit can provide reasonable performance evaluation and promote detection performance. Based on the proposed RGBT-Tiny dataset and SAFit measure, extensive evaluations have been conducted, including 23 recent state-of-the-art algorithms that cover four different types (i.e., visible generic detection, visible SOD, thermal SOD and RGBT object detection). Project is available at https://github.com/XinyiYing24/RGBT-Tiny.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention
Authors:
Wenbin An,
Feng Tian,
Sicong Leng,
Jiahao Nie,
Haonan Lin,
QianYing Wang,
Guang Dai,
Ping Chen,
Shijian Lu
Abstract:
Despite their great success across various multimodal tasks, Large Vision-Language Models (LVLMs) are facing a prevalent problem with object hallucinations, where the generated textual responses are inconsistent with ground-truth objects in the given image. This paper investigates various LVLMs and pinpoints attention deficiency toward discriminative local image features as one root cause of objec…
▽ More
Despite their great success across various multimodal tasks, Large Vision-Language Models (LVLMs) are facing a prevalent problem with object hallucinations, where the generated textual responses are inconsistent with ground-truth objects in the given image. This paper investigates various LVLMs and pinpoints attention deficiency toward discriminative local image features as one root cause of object hallucinations. Specifically, LVLMs predominantly attend to prompt-independent global image features, while failing to capture prompt-relevant local features, consequently undermining the visual grounding capacity of LVLMs and leading to hallucinations. To this end, we propose Assembly of Global and Local Attention (AGLA), a training-free and plug-and-play approach that mitigates object hallucinations by exploring an ensemble of global features for response generation and local features for visual discrimination simultaneously. Our approach exhibits an image-prompt matching scheme that captures prompt-relevant local features from images, leading to an augmented view of the input image where prompt-relevant content is reserved while irrelevant distractions are masked. With the augmented view, a calibrated decoding distribution can be derived by integrating generative global features from the original image and discriminative local features from the augmented image. Extensive experiments show that AGLA consistently mitigates object hallucinations and enhances general perception capability for LVLMs across various discriminative and generative benchmarks. Our code will be released at https://github.com/Lackel/AGLA.
△ Less
Submitted 21 June, 2024; v1 submitted 18 June, 2024;
originally announced June 2024.
-
MMRel: A Relation Understanding Dataset and Benchmark in the MLLM Era
Authors:
Jiahao Nie,
Gongjie Zhang,
Wenbin An,
Yap-Peng Tan,
Alex C. Kot,
Shijian Lu
Abstract:
Despite the recent advancements in Multi-modal Large Language Models (MLLMs), understanding inter-object relations, i.e., interactions or associations between distinct objects, remains a major challenge for such models. This issue significantly hinders their advanced reasoning capabilities and is primarily due to the lack of large-scale, high-quality, and diverse multi-modal data essential for tra…
▽ More
Despite the recent advancements in Multi-modal Large Language Models (MLLMs), understanding inter-object relations, i.e., interactions or associations between distinct objects, remains a major challenge for such models. This issue significantly hinders their advanced reasoning capabilities and is primarily due to the lack of large-scale, high-quality, and diverse multi-modal data essential for training and evaluating MLLMs. In this paper, we provide a taxonomy of inter-object relations and introduce Multi-Modal Relation Understanding (MMRel), a comprehensive dataset designed to bridge this gap by providing large-scale, high-quality and diverse data for studying inter-object relations with MLLMs. MMRel features three distinctive attributes: (i) It includes over 15K question-answer pairs, which are sourced from three distinct domains, ensuring large scale and high diversity; (ii) It contains a subset featuring highly unusual relations, on which MLLMs often fail due to hallucinations, thus are very challenging; (iii) It provides manually verified high-quality labels for inter-object relations. Thanks to these features, MMRel is ideal for evaluating MLLMs on relation understanding, as well as being used to fine-tune MLLMs to enhance relation understanding and even benefit overall performance in various vision-language tasks. Extensive experiments on various popular MLLMs validate the effectiveness of MMRel. Both MMRel dataset and the complete labeling scripts have been made publicly available.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Vim-F: Visual State Space Model Benefiting from Learning in the Frequency Domain
Authors:
Juntao Zhang,
Kun Bian,
Peng Cheng,
Wenbo An,
Jianning Liu,
Jun Zhou
Abstract:
In recent years, State Space Models (SSMs) with efficient hardware-aware designs, known as the Mamba deep learning models, have made significant progress in modeling long sequences such as language understanding. Therefore, building efficient and general-purpose visual backbones based on SSMs is a promising direction. Compared to traditional convolutional neural networks (CNNs) and Vision Transfor…
▽ More
In recent years, State Space Models (SSMs) with efficient hardware-aware designs, known as the Mamba deep learning models, have made significant progress in modeling long sequences such as language understanding. Therefore, building efficient and general-purpose visual backbones based on SSMs is a promising direction. Compared to traditional convolutional neural networks (CNNs) and Vision Transformers (ViTs), the performance of Vision Mamba (ViM) methods is not yet fully competitive. To enable SSMs to process image data, ViMs typically flatten 2D images into 1D sequences, inevitably ignoring some 2D local dependencies, thereby weakening the model's ability to interpret spatial relationships from a global perspective. We use Fast Fourier Transform (FFT) to obtain the spectrum of the feature map and add it to the original feature map, enabling ViM to model a unified visual representation in both frequency and spatial domains. The introduction of frequency domain information enables ViM to have a global receptive field during scanning. We propose a novel model called Vim-F, which employs pure Mamba encoders and scans in both the frequency and spatial domains. Moreover, we question the necessity of position embedding in ViM and remove it accordingly in Vim-F, which helps to fully utilize the efficient long-sequence modeling capability of ViM. Finally, we redesign a patch embedding for Vim-F, leveraging a convolutional stem to capture more local correlations, further improving the performance of Vim-F. Code is available at: \url{https://github.com/yws-wxs/Vim-F}.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
SpecDETR: A Transformer-based Hyperspectral Point Object Detection Network
Authors:
Zhaoxu Li,
Wei An,
Gaowei Guo,
Longguang Wang,
Yingqian Wang,
Zaiping Lin
Abstract:
Hyperspectral target detection (HTD) aims to identify specific materials based on spectral information in hyperspectral imagery and can detect point targets, some of which occupy a smaller than one-pixel area. However, existing HTD methods are developed based on per-pixel binary classification, which limits the feature representation capability for point targets. In this paper, we rethink the hype…
▽ More
Hyperspectral target detection (HTD) aims to identify specific materials based on spectral information in hyperspectral imagery and can detect point targets, some of which occupy a smaller than one-pixel area. However, existing HTD methods are developed based on per-pixel binary classification, which limits the feature representation capability for point targets. In this paper, we rethink the hyperspectral point target detection from the object detection perspective, and focus more on the object-level prediction capability rather than the pixel classification capability. Inspired by the token-based processing flow of Detection Transformer (DETR), we propose the first specialized network for hyperspectral multi-class point object detection, SpecDETR. Without the backbone part of the current object detection framework, SpecDETR treats the spectral features of each pixel in hyperspectral images as a token and utilizes a multi-layer Transformer encoder with local and global coordination attention modules to extract deep spatial-spectral joint features. SpecDETR regards point object detection as a one-to-many set prediction problem, thereby achieving a concise and efficient DETR decoder that surpasses the current state-of-the-art DETR decoder in terms of parameters and accuracy in point object detection. We develop a simulated hyperSpectral Point Object Detection benchmark termed SPOD, and for the first time, evaluate and compare the performance of current object detection networks and HTD methods on hyperspectral multi-class point object detection. SpecDETR demonstrates superior performance as compared to current object detection networks and HTD methods on the SPOD dataset. Additionally, we validate on a public HTD dataset that by using data simulation instead of manual annotation, SpecDETR can detect real-world single-spectral point objects directly.
△ Less
Submitted 16 May, 2024;
originally announced May 2024.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Authors:
DeepSeek-AI,
Aixin Liu,
Bei Feng,
Bin Wang,
Bingxuan Wang,
Bo Liu,
Chenggang Zhao,
Chengqi Dengr,
Chong Ruan,
Damai Dai,
Daya Guo,
Dejian Yang,
Deli Chen,
Dongjie Ji,
Erhang Li,
Fangyun Lin,
Fuli Luo,
Guangbo Hao,
Guanting Chen,
Guowei Li,
H. Zhang,
Hanwei Xu,
Hao Yang,
Haowei Zhang,
Honghui Ding
, et al. (132 additional authors not shown)
Abstract:
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference…
▽ More
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.
△ Less
Submitted 19 June, 2024; v1 submitted 7 May, 2024;
originally announced May 2024.
-
DreamSalon: A Staged Diffusion Framework for Preserving Identity-Context in Editable Face Generation
Authors:
Haonan Lin,
Mengmeng Wang,
Yan Chen,
Wenbin An,
Yuzhe Yao,
Guang Dai,
Qianying Wang,
Yong Liu,
Jingdong Wang
Abstract:
While large-scale pre-trained text-to-image models can synthesize diverse and high-quality human-centered images, novel challenges arise with a nuanced task of "identity fine editing": precisely modifying specific features of a subject while maintaining its inherent identity and context. Existing personalization methods either require time-consuming optimization or learning additional encoders, ad…
▽ More
While large-scale pre-trained text-to-image models can synthesize diverse and high-quality human-centered images, novel challenges arise with a nuanced task of "identity fine editing": precisely modifying specific features of a subject while maintaining its inherent identity and context. Existing personalization methods either require time-consuming optimization or learning additional encoders, adept in "identity re-contextualization". However, they often struggle with detailed and sensitive tasks like human face editing. To address these challenges, we introduce DreamSalon, a noise-guided, staged-editing framework, uniquely focusing on detailed image manipulations and identity-context preservation. By discerning editing and boosting stages via the frequency and gradient of predicted noises, DreamSalon first performs detailed manipulations on specific features in the editing stage, guided by high-frequency information, and then employs stochastic denoising in the boosting stage to improve image quality. For more precise editing, DreamSalon semantically mixes source and target textual prompts, guided by differences in their embedding covariances, to direct the model's focus on specific manipulation areas. Our experiments demonstrate DreamSalon's ability to efficiently and faithfully edit fine details on human faces, outperforming existing methods both qualitatively and quantitatively.
△ Less
Submitted 28 March, 2024;
originally announced March 2024.
-
Transfer and Alignment Network for Generalized Category Discovery
Authors:
Wenbin An,
Feng Tian,
Wenkai Shi,
Yan Chen,
Yaqiang Wu,
Qianying Wang,
Ping Chen
Abstract:
Generalized Category Discovery is a crucial real-world task. Despite the improved performance on known categories, current methods perform poorly on novel categories. We attribute the poor performance to two reasons: biased knowledge transfer between labeled and unlabeled data and noisy representation learning on the unlabeled data. To mitigate these two issues, we propose a Transfer and Alignment…
▽ More
Generalized Category Discovery is a crucial real-world task. Despite the improved performance on known categories, current methods perform poorly on novel categories. We attribute the poor performance to two reasons: biased knowledge transfer between labeled and unlabeled data and noisy representation learning on the unlabeled data. To mitigate these two issues, we propose a Transfer and Alignment Network (TAN), which incorporates two knowledge transfer mechanisms to calibrate the biased knowledge and two feature alignment mechanisms to learn discriminative features. Specifically, we model different categories with prototypes and transfer the prototypes in labeled data to correct model bias towards known categories. On the one hand, we pull instances with known categories in unlabeled data closer to these prototypes to form more compact clusters and avoid boundary overlap between known and novel categories. On the other hand, we use these prototypes to calibrate noisy prototypes estimated from unlabeled data based on category similarities, which allows for more accurate estimation of prototypes for novel categories that can be used as reliable learning targets later. After knowledge transfer, we further propose two feature alignment mechanisms to acquire both instance- and category-level knowledge from unlabeled data by aligning instance features with both augmented features and the calibrated prototypes, which can boost model performance on both known and novel categories with less noise. Experiments on three benchmark datasets show that our model outperforms SOTA methods, especially on novel categories. Theoretical analysis is provided for an in-depth understanding of our model in general. Our code and data are available at https://github.com/Lackel/TAN.
△ Less
Submitted 27 December, 2023;
originally announced December 2023.
-
Semantic Importance-Aware Based for Multi-User Communication Over MIMO Fading Channels
Authors:
Haotai Liang,
Zhicheng Bao,
Wannian An,
Chen Dong,
Xiaodong Xu
Abstract:
Semantic communication, as a novel communication paradigm, has attracted the interest of many scholars, with multi-user, multi-input multi-output (MIMO) scenarios being one of the critical contexts. This paper presents a semantic importance-aware based communication system (SIA-SC) over MIMO Rayleigh fading channels. Combining the semantic symbols' inequality and the equivalent subchannels of MIMO…
▽ More
Semantic communication, as a novel communication paradigm, has attracted the interest of many scholars, with multi-user, multi-input multi-output (MIMO) scenarios being one of the critical contexts. This paper presents a semantic importance-aware based communication system (SIA-SC) over MIMO Rayleigh fading channels. Combining the semantic symbols' inequality and the equivalent subchannels of MIMO channels based on Singular Value Decomposition (SVD) maximizes the end-to-end semantic performance through the new layer mapping method. For multi-user scenarios, a method of semantic interference cancellation is proposed. Furthermore, a new metric, namely semantic information distortion (SID), is established to unify the expressions of semantic performance, which is affected by channel bandwidth ratio (CBR) and signal-to-noise ratio (SNR). With the help of the proposed metric, we derived performance expressions and Semantic Outage Probability (SOP) of SIA-SC for Single-User Single-Input Single-Output (SU-SISO), Single-User MIMO (SU-MIMO), Multi-Users SISO (MU-MIMO) and Multi-Users MIMO (MU-MIMO) scenarios. Numerical experiments show that SIA-SC can significantly improve semantic performance across various scenarios.
△ Less
Submitted 26 December, 2023;
originally announced December 2023.
-
Generalized Category Discovery with Large Language Models in the Loop
Authors:
Wenbin An,
Wenkai Shi,
Feng Tian,
Haonan Lin,
QianYing Wang,
Yaqiang Wu,
Mingxiang Cai,
Luyan Wang,
Yan Chen,
Haiping Zhu,
Ping Chen
Abstract:
Generalized Category Discovery (GCD) is a crucial task that aims to recognize both known and novel categories from a set of unlabeled data by utilizing a few labeled data with only known categories. Due to the lack of supervision and category information, current methods usually perform poorly on novel categories and struggle to reveal semantic meanings of the discovered clusters, which limits the…
▽ More
Generalized Category Discovery (GCD) is a crucial task that aims to recognize both known and novel categories from a set of unlabeled data by utilizing a few labeled data with only known categories. Due to the lack of supervision and category information, current methods usually perform poorly on novel categories and struggle to reveal semantic meanings of the discovered clusters, which limits their applications in the real world. To mitigate the above issues, we propose Loop, an end-to-end active-learning framework that introduces Large Language Models (LLMs) into the training loop, which can boost model performance and generate category names without relying on any human efforts. Specifically, we first propose Local Inconsistent Sampling (LIS) to select samples that have a higher probability of falling to wrong clusters, based on neighborhood prediction consistency and entropy of cluster assignment probabilities. Then we propose a Scalable Query strategy to allow LLMs to choose true neighbors of the selected samples from multiple candidate samples. Based on the feedback from LLMs, we perform Refined Neighborhood Contrastive Learning (RNCL) to pull samples and their neighbors closer to learn clustering-friendly representations. Finally, we select representative samples from clusters corresponding to novel categories to allow LLMs to generate category names for them. Extensive experiments on three benchmark datasets show that Loop outperforms SOTA models by a large margin and generates accurate category names for the discovered clusters. Code and data are available at https://github.com/Lackel/LOOP.
△ Less
Submitted 26 May, 2024; v1 submitted 17 December, 2023;
originally announced December 2023.
-
Performance Analysis of MDMA-Based Cooperative MRC Networks with Relays in Dissimilar Rayleigh Fading Channels
Authors:
Lei Teng,
Wannian An,
Chen Dong,
Xiaoqi Qin,
Xiaodong Xu
Abstract:
Multiple access technology is a key technology in various generations of wireless communication systems. As a potential multiple access technology for the next generation wireless communication systems, model division multiple access (MDMA) technology improves spectrum efficiency and feasibility regions. This implies that the MDMA scheme can achieve greater performance gains compared to traditiona…
▽ More
Multiple access technology is a key technology in various generations of wireless communication systems. As a potential multiple access technology for the next generation wireless communication systems, model division multiple access (MDMA) technology improves spectrum efficiency and feasibility regions. This implies that the MDMA scheme can achieve greater performance gains compared to traditional schemes. Relayassisted cooperative networks, as a infrastructure of wireless communication, can effectively utilize resources and improve performance when MDMA is applied. In this paper, a communication relay cooperative network based on MDMA in dissimilar rayleigh fading channels is proposed, which consists of two source nodes, any number of decode-and-forward (DF) relay nodes, and one destination node, as well as using the maximal ratio combining (MRC) at the destination to combine the signals received from the source and relays. By applying the state transition matrix (STM) and moment generating function (MGF), closed-form analytical solutions for outage probability and resource utilization efficiency are derived. Theoretical and simulation results are conducted to verify the validity of the theoretical analysis.
△ Less
Submitted 27 November, 2023;
originally announced November 2023.
-
A Relay System for Semantic Image Transmission based on Shared Feature Extraction and Hyperprior Entropy Compression
Authors:
Wannian An,
Zhicheng Bao,
Haotai Liang,
Chen Dong,
Xiaodong
Abstract:
Nowadays, the need for high-quality image reconstruction and restoration is more and more urgent. However, most image transmission systems may suffer from image quality degradation or transmission interruption in the face of interference such as channel noise and link fading. To solve this problem, a relay communication network for semantic image transmission based on shared feature extraction and…
▽ More
Nowadays, the need for high-quality image reconstruction and restoration is more and more urgent. However, most image transmission systems may suffer from image quality degradation or transmission interruption in the face of interference such as channel noise and link fading. To solve this problem, a relay communication network for semantic image transmission based on shared feature extraction and hyperprior entropy compression (HEC) is proposed, where the shared feature extraction technology based on Pearson correlation is proposed to eliminate partial shared feature of extracted semantic latent feature. In addition, the HEC technology is used to resist the effect of channel noise and link fading and carried out respectively at the source node and the relay node. Experimental results demonstrate that compared with other recent research methods, the proposed system has lower transmission overhead and higher semantic image transmission performance. Particularly, under the same conditions, the multi-scale structural similarity (MS-SSIM) of this system is superior to the comparison method by approximately 0.2.
△ Less
Submitted 17 November, 2023;
originally announced November 2023.
-
The Communication GSC System with Energy Harvesting Nodes aided by Opportunistic Routing
Authors:
Hanyu Liu,
Lei Teng,
Wannian An,
Xiaoqi Qin,
Chen Dong,
Xiaodong Xu
Abstract:
In this paper, a cooperative communication network based on energy-harvesting (EH) decode-and-forward (DF) relays is proposed. For relay nodes, there is harvest-storage-use (HSU) structure in this system. And energy can be obtained from the surrounding environment through energy buffering. In order to improve the performance of the communication system, the opportunistic routing algorithm and the…
▽ More
In this paper, a cooperative communication network based on energy-harvesting (EH) decode-and-forward (DF) relays is proposed. For relay nodes, there is harvest-storage-use (HSU) structure in this system. And energy can be obtained from the surrounding environment through energy buffering. In order to improve the performance of the communication system, the opportunistic routing algorithm and the generalized selection combining (GSC) algorithm are adopted in this communication system. In addition, from discrete-time continuous-state space Markov chain model (DCSMC), a theoretical expression of the energy limiting distribution stored in infinite buffers is derived. Through using the probability distribution and state transition matrix, the theoretical expressions of system outage probability, throughput and time cost of per packet are obtained. Through the simulation verification, the theoretical results are in good agreement with the simulation results.
△ Less
Submitted 16 November, 2023;
originally announced November 2023.
-
A Diffusion Weighted Graph Framework for New Intent Discovery
Authors:
Wenkai Shi,
Wenbin An,
Feng Tian,
Qinghua Zheng,
QianYing Wang,
Ping Chen
Abstract:
New Intent Discovery (NID) aims to recognize both new and known intents from unlabeled data with the aid of limited labeled data containing only known intents. Without considering structure relationships between samples, previous methods generate noisy supervisory signals which cannot strike a balance between quantity and quality, hindering the formation of new intent clusters and effective transf…
▽ More
New Intent Discovery (NID) aims to recognize both new and known intents from unlabeled data with the aid of limited labeled data containing only known intents. Without considering structure relationships between samples, previous methods generate noisy supervisory signals which cannot strike a balance between quantity and quality, hindering the formation of new intent clusters and effective transfer of the pre-training knowledge. To mitigate this limitation, we propose a novel Diffusion Weighted Graph Framework (DWGF) to capture both semantic similarities and structure relationships inherent in data, enabling more sufficient and reliable supervisory signals. Specifically, for each sample, we diffuse neighborhood relationships along semantic paths guided by the nearest neighbors for multiple hops to characterize its local structure discriminately. Then, we sample its positive keys and weigh them based on semantic similarities and local structures for contrastive learning. During inference, we further propose Graph Smoothing Filter (GSF) to explicitly utilize the structure relationships to filter high-frequency noise embodied in semantically ambiguous samples on the cluster boundary. Extensive experiments show that our method outperforms state-of-the-art models on all evaluation metrics across multiple benchmark datasets. Code and data are available at https://github.com/yibai-shi/DWGF.
△ Less
Submitted 24 October, 2023;
originally announced October 2023.
-
DNA: Denoised Neighborhood Aggregation for Fine-grained Category Discovery
Authors:
Wenbin An,
Feng Tian,
Wenkai Shi,
Yan Chen,
Qinghua Zheng,
QianYing Wang,
Ping Chen
Abstract:
Discovering fine-grained categories from coarsely labeled data is a practical and challenging task, which can bridge the gap between the demand for fine-grained analysis and the high annotation cost. Previous works mainly focus on instance-level discrimination to learn low-level features, but ignore semantic similarities between data, which may prevent these models learning compact cluster represe…
▽ More
Discovering fine-grained categories from coarsely labeled data is a practical and challenging task, which can bridge the gap between the demand for fine-grained analysis and the high annotation cost. Previous works mainly focus on instance-level discrimination to learn low-level features, but ignore semantic similarities between data, which may prevent these models learning compact cluster representations. In this paper, we propose Denoised Neighborhood Aggregation (DNA), a self-supervised framework that encodes semantic structures of data into the embedding space. Specifically, we retrieve k-nearest neighbors of a query as its positive keys to capture semantic similarities between data and then aggregate information from the neighbors to learn compact cluster representations, which can make fine-grained categories more separatable. However, the retrieved neighbors can be noisy and contain many false-positive keys, which can degrade the quality of learned embeddings. To cope with this challenge, we propose three principles to filter out these false neighbors for better representation learning. Furthermore, we theoretically justify that the learning objective of our framework is equivalent to a clustering loss, which can capture semantic similarities between data to form compact fine-grained clusters. Extensive experiments on three benchmark datasets show that our method can retrieve more accurate neighbors (21.31% accuracy improvement) and outperform state-of-the-art models by a large margin (average 9.96% improvement on three metrics). Our code and data are available at https://github.com/Lackel/DNA.
△ Less
Submitted 16 October, 2023;
originally announced October 2023.
-
MovePose: A High-performance Human Pose Estimation Algorithm on Mobile and Edge Devices
Authors:
Dongyang Yu,
Haoyue Zhang,
Ruisheng Zhao,
Guoqi Chen,
Wangpeng An,
Yanhong Yang
Abstract:
We present MovePose, an optimized lightweight convolutional neural network designed specifically for real-time body pose estimation on CPU-based mobile devices. The current solutions do not provide satisfactory accuracy and speed for human posture estimation, and MovePose addresses this gap. It aims to maintain real-time performance while improving the accuracy of human posture estimation for mobi…
▽ More
We present MovePose, an optimized lightweight convolutional neural network designed specifically for real-time body pose estimation on CPU-based mobile devices. The current solutions do not provide satisfactory accuracy and speed for human posture estimation, and MovePose addresses this gap. It aims to maintain real-time performance while improving the accuracy of human posture estimation for mobile devices. Our MovePose algorithm has attained an Mean Average Precision (mAP) score of 68.0 on the COCO \cite{cocodata} validation dataset. The MovePose algorithm displayed efficiency with a performance of 69+ frames per second (fps) when run on an Intel i9-10920x CPU. Additionally, it showcased an increased performance of 452+ fps on an NVIDIA RTX3090 GPU. On an Android phone equipped with a Snapdragon 8 + 4G processor, the fps reached above 11. To enhance accuracy, we incorporated three techniques: deconvolution, large kernel convolution, and coordinate classification methods. Compared to basic upsampling, deconvolution is trainable, improves model capacity, and enhances the receptive field. Large kernel convolution strengthens these properties at a decreased computational cost. In summary, MovePose provides high accuracy and real-time performance, marking it a potential tool for a variety of applications, including those focused on mobile-side human posture estimation. The code and models for this algorithm will be made publicly accessible.
△ Less
Submitted 19 April, 2024; v1 submitted 17 August, 2023;
originally announced August 2023.
-
OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation
Authors:
Dongyang Yu,
Shihao Wang,
Yuan Fang,
Wangpeng An
Abstract:
This paper presents OmniDataComposer, an innovative approach for multimodal data fusion and unlimited data generation with an intent to refine and uncomplicate interplay among diverse data modalities. Coming to the core breakthrough, it introduces a cohesive data structure proficient in processing and merging multimodal data inputs, which include video, audio, and text.
Our crafted algorithm lev…
▽ More
This paper presents OmniDataComposer, an innovative approach for multimodal data fusion and unlimited data generation with an intent to refine and uncomplicate interplay among diverse data modalities. Coming to the core breakthrough, it introduces a cohesive data structure proficient in processing and merging multimodal data inputs, which include video, audio, and text.
Our crafted algorithm leverages advancements across multiple operations such as video/image caption extraction, dense caption extraction, Automatic Speech Recognition (ASR), Optical Character Recognition (OCR), Recognize Anything Model(RAM), and object tracking. OmniDataComposer is capable of identifying over 6400 categories of objects, substantially broadening the spectrum of visual information. It amalgamates these diverse modalities, promoting reciprocal enhancement among modalities and facilitating cross-modal data correction. \textbf{The final output metamorphoses each video input into an elaborate sequential document}, virtually transmuting videos into thorough narratives, making them easier to be processed by large language models.
Future prospects include optimizing datasets for each modality to encourage unlimited data generation. This robust base will offer priceless insights to models like ChatGPT, enabling them to create higher quality datasets for video captioning and easing question-answering tasks based on video content. OmniDataComposer inaugurates a new stage in multimodal learning, imparting enormous potential for augmenting AI's understanding and generation of complex, real-world data.
△ Less
Submitted 17 August, 2023; v1 submitted 8 August, 2023;
originally announced August 2023.
-
Joint Coordinate Regression and Association For Multi-Person Pose Estimation, A Pure Neural Network Approach
Authors:
Dongyang Yu,
Yunshi Xie,
Wangpeng An,
Li Zhang,
Yufeng Yao
Abstract:
We introduce a novel one-stage end-to-end multi-person 2D pose estimation algorithm, known as Joint Coordinate Regression and Association (JCRA), that produces human pose joints and associations without requiring any post-processing. The proposed algorithm is fast, accurate, effective, and simple. The one-stage end-to-end network architecture significantly improves the inference speed of JCRA. Mea…
▽ More
We introduce a novel one-stage end-to-end multi-person 2D pose estimation algorithm, known as Joint Coordinate Regression and Association (JCRA), that produces human pose joints and associations without requiring any post-processing. The proposed algorithm is fast, accurate, effective, and simple. The one-stage end-to-end network architecture significantly improves the inference speed of JCRA. Meanwhile, we devised a symmetric network structure for both the encoder and decoder, which ensures high accuracy in identifying keypoints. It follows an architecture that directly outputs part positions via a transformer network, resulting in a significant improvement in performance. Extensive experiments on the MS COCO and CrowdPose benchmarks demonstrate that JCRA outperforms state-of-the-art approaches in both accuracy and efficiency. Moreover, JCRA demonstrates 69.2 mAP and is 78\% faster at inference acceleration than previous state-of-the-art bottom-up algorithms. The code for this algorithm will be publicly available.
△ Less
Submitted 19 April, 2024; v1 submitted 3 July, 2023;
originally announced July 2023.
-
Monte Carlo Linear Clustering with Single-Point Supervision is Enough for Infrared Small Target Detection
Authors:
Boyang Li,
Yingqian Wang,
Longguang Wang,
Fei Zhang,
Ting Liu,
Zaiping Lin,
Wei An,
Yulan Guo
Abstract:
Single-frame infrared small target (SIRST) detection aims at separating small targets from clutter backgrounds on infrared images. Recently, deep learning based methods have achieved promising performance on SIRST detection, but at the cost of a large amount of training data with expensive pixel-level annotations. To reduce the annotation burden, we propose the first method to achieve SIRST detect…
▽ More
Single-frame infrared small target (SIRST) detection aims at separating small targets from clutter backgrounds on infrared images. Recently, deep learning based methods have achieved promising performance on SIRST detection, but at the cost of a large amount of training data with expensive pixel-level annotations. To reduce the annotation burden, we propose the first method to achieve SIRST detection with single-point supervision. The core idea of this work is to recover the per-pixel mask of each target from the given single point label by using clustering approaches, which looks simple but is indeed challenging since targets are always insalient and accompanied with background clutters. To handle this issue, we introduce randomness to the clustering process by adding noise to the input images, and then obtain much more reliable pseudo masks by averaging the clustered results. Thanks to this "Monte Carlo" clustering approach, our method can accurately recover pseudo masks and thus turn arbitrary fully supervised SIRST detection networks into weakly supervised ones with only single point annotation. Experiments on four datasets demonstrate that our method can be applied to existing SIRST detection networks to achieve comparable performance with their fully supervised counterparts, which reveals that single-point supervision is strong enough for SIRST detection. Our code will be available at: https://github.com/YeRen123455/SIRST-Single-Point-Supervision.
△ Less
Submitted 10 April, 2023;
originally announced April 2023.
-
You Only Train Once: Learning a General Anomaly Enhancement Network with Random Masks for Hyperspectral Anomaly Detection
Authors:
Zhaoxu Li,
Yingqian Wang,
Chao Xiao,
Qiang Ling,
Zaiping Lin,
Wei An
Abstract:
In this paper, we introduce a new approach to address the challenge of generalization in hyperspectral anomaly detection (AD). Our method eliminates the need for adjusting parameters or retraining on new test scenes as required by most existing methods. Employing an image-level training paradigm, we achieve a general anomaly enhancement network for hyperspectral AD that only needs to be trained on…
▽ More
In this paper, we introduce a new approach to address the challenge of generalization in hyperspectral anomaly detection (AD). Our method eliminates the need for adjusting parameters or retraining on new test scenes as required by most existing methods. Employing an image-level training paradigm, we achieve a general anomaly enhancement network for hyperspectral AD that only needs to be trained once. Trained on a set of anomaly-free hyperspectral images with random masks, our network can learn the spatial context characteristics between anomalies and background in an unsupervised way. Additionally, a plug-and-play model selection module is proposed to search for a spatial-spectral transform domain that is more suitable for AD task than the original data. To establish a unified benchmark to comprehensively evaluate our method and existing methods, we develop a large-scale hyperspectral AD dataset (HAD100) that includes 100 real test scenes with diverse anomaly targets. In comparison experiments, we combine our network with a parameter-free detector and achieve the optimal balance between detection accuracy and inference speed among state-of-the-art AD methods. Experimental results also show that our method still achieves competitive performance when the training and test set are captured by different sensor devices. Our code is available at https://github.com/ZhaoxuLi123/AETNet.
△ Less
Submitted 31 March, 2023;
originally announced March 2023.
-
Parameter-Free Channel Attention for Image Classification and Super-Resolution
Authors:
Yuxuan Shi,
Lingxiao Yang,
Wangpeng An,
Xiantong Zhen,
Liuqing Wang
Abstract:
The channel attention mechanism is a useful technique widely employed in deep convolutional neural networks to boost the performance for image processing tasks, eg, image classification and image super-resolution. It is usually designed as a parameterized sub-network and embedded into the convolutional layers of the network to learn more powerful feature representations. However, current channel a…
▽ More
The channel attention mechanism is a useful technique widely employed in deep convolutional neural networks to boost the performance for image processing tasks, eg, image classification and image super-resolution. It is usually designed as a parameterized sub-network and embedded into the convolutional layers of the network to learn more powerful feature representations. However, current channel attention induces more parameters and therefore leads to higher computational costs. To deal with this issue, in this work, we propose a Parameter-Free Channel Attention (PFCA) module to boost the performance of popular image classification and image super-resolution networks, but completely sweep out the parameter growth of channel attention. Experiments on CIFAR-100, ImageNet, and DIV2K validate that our PFCA module improves the performance of ResNet on image classification and improves the performance of MSRResNet on image super-resolution tasks, respectively, while bringing little growth of parameters and FLOPs.
△ Less
Submitted 20 March, 2023;
originally announced March 2023.
-
Modeling and Performance Analysis of Single-Server Database Over Quasi-static Rayleigh Fading Channel
Authors:
Mengying Chen,
Wannian An,
Yang Liu,
Chen Dong,
Xiaodong Xu,
Boxiao Han,
Ping Zhang
Abstract:
Cloud database is the key technology in cloud computing. The effective and efficient service quality of the cloud database is inseparable from communication technology, just as improving communication quality will reduce the concurrency phenomenon in the ticketing system. In order to visually observe the impact of communication on the cloud database, we propose a Communication-Database (C-D) Model…
▽ More
Cloud database is the key technology in cloud computing. The effective and efficient service quality of the cloud database is inseparable from communication technology, just as improving communication quality will reduce the concurrency phenomenon in the ticketing system. In order to visually observe the impact of communication on the cloud database, we propose a Communication-Database (C-D) Model with a single-server database over the quasi-static Rayleigh fading channel, which consists of three parts: CLIENTS SOURCE, COMMUNICATION SYSTEM and DATABASE SYSTEM. This paper uses the queuing model, M/G/1//K, to model the whole system. The C-D Model is analyzed in two cases: nonlinearity and linearity, which correspond to some instances of SISO and MIMO. The simulation results of average staying time, average number of transactions and other performance characteristics are basically consistent with the theoretical results, which verifies the validity of the C-D Model. The comparison of these experimental results also proves that poor communication quality does lead to the reduction in the quality of service.
△ Less
Submitted 17 January, 2023; v1 submitted 18 December, 2022;
originally announced December 2022.
-
Generalized Category Discovery with Decoupled Prototypical Network
Authors:
Wenbin An,
Feng Tian,
Qinghua Zheng,
Wei Ding,
QianYing Wang,
Ping Chen
Abstract:
Generalized Category Discovery (GCD) aims to recognize both known and novel categories from a set of unlabeled data, based on another dataset labeled with only known categories. Without considering differences between known and novel categories, current methods learn about them in a coupled manner, which can hurt model's generalization and discriminative ability. Furthermore, the coupled training…
▽ More
Generalized Category Discovery (GCD) aims to recognize both known and novel categories from a set of unlabeled data, based on another dataset labeled with only known categories. Without considering differences between known and novel categories, current methods learn about them in a coupled manner, which can hurt model's generalization and discriminative ability. Furthermore, the coupled training approach prevents these models transferring category-specific knowledge explicitly from labeled data to unlabeled data, which can lose high-level semantic information and impair model performance. To mitigate above limitations, we present a novel model called Decoupled Prototypical Network (DPN). By formulating a bipartite matching problem for category prototypes, DPN can not only decouple known and novel categories to achieve different training targets effectively, but also align known categories in labeled and unlabeled data to transfer category-specific knowledge explicitly and capture high-level semantics. Furthermore, DPN can learn more discriminative features for both known and novel categories through our proposed Semantic-aware Prototypical Learning (SPL). Besides capturing meaningful semantic information, SPL can also alleviate the noise of hard pseudo labels through semantic-weighted soft assignment. Extensive experiments show that DPN outperforms state-of-the-art models by a large margin on all evaluation metrics across multiple benchmark datasets. Code and data are available at https://github.com/Lackel/DPN.
△ Less
Submitted 15 March, 2023; v1 submitted 28 November, 2022;
originally announced November 2022.
-
Fine-grained Category Discovery under Coarse-grained supervision with Hierarchical Weighted Self-contrastive Learning
Authors:
Wenbin An,
Feng Tian,
Ping Chen,
Siliang Tang,
Qinghua Zheng,
QianYing Wang
Abstract:
Novel category discovery aims at adapting models trained on known categories to novel categories. Previous works only focus on the scenario where known and novel categories are of the same granularity. In this paper, we investigate a new practical scenario called Fine-grained Category Discovery under Coarse-grained supervision (FCDC). FCDC aims at discovering fine-grained categories with only coar…
▽ More
Novel category discovery aims at adapting models trained on known categories to novel categories. Previous works only focus on the scenario where known and novel categories are of the same granularity. In this paper, we investigate a new practical scenario called Fine-grained Category Discovery under Coarse-grained supervision (FCDC). FCDC aims at discovering fine-grained categories with only coarse-grained labeled data, which can adapt models to categories of different granularity from known ones and reduce significant labeling cost. It is also a challenging task since supervised training on coarse-grained categories tends to focus on inter-class distance (distance between coarse-grained classes) but ignore intra-class distance (distance between fine-grained sub-classes) which is essential for separating fine-grained categories. Considering most current methods cannot transfer knowledge from coarse-grained level to fine-grained level, we propose a hierarchical weighted self-contrastive network by building a novel weighted self-contrastive module and combining it with supervised learning in a hierarchical manner. Extensive experiments on public datasets show both effectiveness and efficiency of our model over compared methods. Code and data are available at https://github.com/Lackel/Hierarchical_Weighted_SCL.
△ Less
Submitted 14 October, 2022;
originally announced October 2022.
-
MTU-Net: Multi-level TransUNet for Space-based Infrared Tiny Ship Detection
Authors:
Tianhao Wu,
Boyang Li,
Yihang Luo,
Yingqian Wang,
Chao Xiao,
Ting Liu,
Jungang Yang,
Wei An,
Yulan Guo
Abstract:
Space-based infrared tiny ship detection aims at separating tiny ships from the images captured by earth orbiting satellites. Due to the extremely large image coverage area (e.g., thousands square kilometers), candidate targets in these images are much smaller, dimer, more changeable than those targets observed by aerial-based and land-based imaging devices. Existing short imaging distance-based i…
▽ More
Space-based infrared tiny ship detection aims at separating tiny ships from the images captured by earth orbiting satellites. Due to the extremely large image coverage area (e.g., thousands square kilometers), candidate targets in these images are much smaller, dimer, more changeable than those targets observed by aerial-based and land-based imaging devices. Existing short imaging distance-based infrared datasets and target detection methods cannot be well adopted to the space-based surveillance task. To address these problems, we develop a space-based infrared tiny ship detection dataset (namely, NUDT-SIRST-Sea) with 48 space-based infrared images and 17598 pixel-level tiny ship annotations. Each image covers about 10000 square kilometers of area with 10000X10000 pixels. Considering the extreme characteristics (e.g., small, dim, changeable) of those tiny ships in such challenging scenes, we propose a multi-level TransUNet (MTU-Net) in this paper. Specifically, we design a Vision Transformer (ViT) Convolutional Neural Network (CNN) hybrid encoder to extract multi-level features. Local feature maps are first extracted by several convolution layers and then fed into the multi-level feature extraction module (MVTM) to capture long-distance dependency. We further propose a copy-rotate-resize-paste (CRRP) data augmentation approach to accelerate the training phase, which effectively alleviates the issue of sample imbalance between targets and background. Besides, we design a FocalIoU loss to achieve both target localization and shape description. Experimental results on the NUDT-SIRST-Sea dataset show that our MTU-Net outperforms traditional and existing deep learning based SIRST methods in terms of probability of detection, false alarm rate and intersection over union.
△ Less
Submitted 27 September, 2022;
originally announced September 2022.
-
Soft decoding without soft demapping with ORBGRAND
Authors:
Wei An,
Muriel Medard,
Ken R. Duffy
Abstract:
For spectral efficiency, higher order modulation symbols confer information on more than one bit. As soft detection forward error correction decoders assume the availability of information at binary granularity, however, soft demappers are required to compute per-bit reliabilities from complex-valued signals. Here we show that the recently introduced universal soft detection decoder ORBGRAND can b…
▽ More
For spectral efficiency, higher order modulation symbols confer information on more than one bit. As soft detection forward error correction decoders assume the availability of information at binary granularity, however, soft demappers are required to compute per-bit reliabilities from complex-valued signals. Here we show that the recently introduced universal soft detection decoder ORBGRAND can be adapted to work with symbol-level soft information, obviating the need for energy expensive soft demapping. We establish that doing so reduces complexity while retaining the error correction performance achieved with the optimal demapper.
△ Less
Submitted 25 July, 2022;
originally announced July 2022.
-
Real-World Light Field Image Super-Resolution via Degradation Modulation
Authors:
Yingqian Wang,
Zhengyu Liang,
Longguang Wang,
Jungang Yang,
Wei An,
Yulan Guo
Abstract:
Recent years have witnessed the great advances of deep neural networks (DNNs) in light field (LF) image super-resolution (SR). However, existing DNN-based LF image SR methods are developed on a single fixed degradation (e.g., bicubic downsampling), and thus cannot be applied to super-resolve real LF images with diverse degradation. In this paper, we propose a simple yet effective method for real-w…
▽ More
Recent years have witnessed the great advances of deep neural networks (DNNs) in light field (LF) image super-resolution (SR). However, existing DNN-based LF image SR methods are developed on a single fixed degradation (e.g., bicubic downsampling), and thus cannot be applied to super-resolve real LF images with diverse degradation. In this paper, we propose a simple yet effective method for real-world LF image SR. In our method, a practical LF degradation model is developed to formulate the degradation process of real LF images. Then, a convolutional neural network is designed to incorporate the degradation prior into the SR process. By training on LF images using our formulated degradation, our network can learn to modulate different degradation while incorporating both spatial and angular information in LF images. Extensive experiments on both synthetically degraded and real-world LF images demonstrate the effectiveness of our method. Compared with existing state-of-the-art single and LF image SR methods, our method achieves superior SR performance under a wide range of degradation, and generalizes better to real LF images. Codes and models are available at https://yingqianwang.github.io/LF-DMnet/.
△ Less
Submitted 30 November, 2023; v1 submitted 13 June, 2022;
originally announced June 2022.
-
Opportunistic Routing aided Cooperative Communication MRC Network with Energy-Harvesting Nodes
Authors:
Lei Teng,
Wannian An,
Chen Dong,
Xiaodong Xu,
Boxiao Han
Abstract:
In this paper, we consider a multi-hop cooperative network founded on two energy-harvesting (EH) decode-and-forward (DF) relays which are provided with harvest-store-use (HSU) architecture to harvest energy from the ambience using the energy buffers. For the sake of boosting the data delivery in this network, maximal ratio combining (MRC) at destination to combine the signals received from source…
▽ More
In this paper, we consider a multi-hop cooperative network founded on two energy-harvesting (EH) decode-and-forward (DF) relays which are provided with harvest-store-use (HSU) architecture to harvest energy from the ambience using the energy buffers. For the sake of boosting the data delivery in this network, maximal ratio combining (MRC) at destination to combine the signals received from source and relays, as well as an opportunistic routing (OR) algorithm which considers channel status information, location and energy buffer status of relays is proposed. With applying discrete-time continuous-state space Markov chain model (DCSMC), the algorithm-based theoretical expression for limiting distribution of stored energy in infinite-size buffer is derived. Further more, with using both the limiting distributions of energy buffers and the probability of transmitter candidates set, the algorithm-based theoretical expressions for outage probability, throughput and timesolt cost for each data of the network are obtained. The simulation results are presented to validate the derived algorithm-based theoretical expressions.
△ Less
Submitted 2 February, 2023; v1 submitted 10 June, 2022;
originally announced June 2022.
-
Opportunistic Routing Aided Cooperative Communication Network with Energy-Harvesting
Authors:
Wannian An,
Chen Dong,
Xiaodong Xu,
Chao Xu,
Shujun Han,
Lei Teng
Abstract:
In this paper, a cooperative communication network based on energy-harvesting (EH) decode-and-forward (DF) relays that harvest energy from the ambience using buffers with harvest-store-use (HSU) architecture is considered. An opportunistic routing (OR) protocol, which selects the transmission path of packet based on the node transmission priority, is proposed to improve data delivery in this netwo…
▽ More
In this paper, a cooperative communication network based on energy-harvesting (EH) decode-and-forward (DF) relays that harvest energy from the ambience using buffers with harvest-store-use (HSU) architecture is considered. An opportunistic routing (OR) protocol, which selects the transmission path of packet based on the node transmission priority, is proposed to improve data delivery in this network. Additionally, an algorithm based on state transition matrix (STM) is proposed to obtain the probability distribution of the candidate broadcast node set. Based on the probability distribution, the existence conditions and the theoretical expressions for the limiting distribution of energy in energy buffers using discrete-time continuous-state space Markov chain (DCSMC) model are derived. Furthermore, the closed-form expressions for network outage probability and throughput are obtained with the help of the limiting distributions of energy stored in buffers. Numerous experiments have been performed to validate the derived theoretical expressions.
△ Less
Submitted 11 June, 2022; v1 submitted 13 May, 2022;
originally announced May 2022.
-
Occlusion-Aware Cost Constructor for Light Field Depth Estimation
Authors:
Yingqian Wang,
Longguang Wang,
Zhengyu Liang,
Jungang Yang,
Wei An,
Yulan Guo
Abstract:
Matching cost construction is a key step in light field (LF) depth estimation, but was rarely studied in the deep learning era. Recent deep learning-based LF depth estimation methods construct matching cost by sequentially shifting each sub-aperture image (SAI) with a series of predefined offsets, which is complex and time-consuming. In this paper, we propose a simple and fast cost constructor to…
▽ More
Matching cost construction is a key step in light field (LF) depth estimation, but was rarely studied in the deep learning era. Recent deep learning-based LF depth estimation methods construct matching cost by sequentially shifting each sub-aperture image (SAI) with a series of predefined offsets, which is complex and time-consuming. In this paper, we propose a simple and fast cost constructor to construct matching cost for LF depth estimation. Our cost constructor is composed by a series of convolutions with specifically designed dilation rates. By applying our cost constructor to SAI arrays, pixels under predefined disparities can be integrated and matching cost can be constructed without using any shifting operation. More importantly, the proposed cost constructor is occlusion-aware and can handle occlusions by dynamically modulating pixels from different views. Based on the proposed cost constructor, we develop a deep network for LF depth estimation. Our network ranks first on the commonly used 4D LF benchmark in terms of the mean square error (MSE), and achieves a faster running time than other state-of-the-art methods.
△ Less
Submitted 3 March, 2022;
originally announced March 2022.
-
Ordered Reliability Bits Guessing Random Additive Noise Decoding
Authors:
Ken R. Duffy,
Wei An,
Muriel Medard
Abstract:
Error correction techniques traditionally focus on the co-design of restricted code-structures in tandem with code-specific decoders that are computationally efficient when decoding long codes in hardware. Modern applications are, however, driving demand for ultra-reliable low-latency communications (URLLC), rekindling interest in the performance of shorter, higher-rate error correcting codes, and…
▽ More
Error correction techniques traditionally focus on the co-design of restricted code-structures in tandem with code-specific decoders that are computationally efficient when decoding long codes in hardware. Modern applications are, however, driving demand for ultra-reliable low-latency communications (URLLC), rekindling interest in the performance of shorter, higher-rate error correcting codes, and raising the possibility of revisiting universal, code-agnostic decoders.
To that end, here we introduce a soft-detection variant of Guessing Random Additive Noise Decoding (GRAND) called Ordered Reliability Bits GRAND that can accurately decode any moderate redundancy block-code. It is designed with efficient circuit implementation in mind, and determines accurate decodings while retaining the original hard detection GRAND algorithm's suitability for a highly parallelized implementation in hardware.
ORBGRAND is shown to provide excellent soft decision block error performance for codes of distinct classes (BCH, CA-Polar and RLC) with modest complexity, while providing better block error rate performance than CA-SCL, a state of the art soft detection CA-Polar decoder. ORBGRAND offers the possibility of an accurate, energy efficient soft detection decoder suitable for delivering URLLC in a single hardware realization.
△ Less
Submitted 29 August, 2022; v1 submitted 28 February, 2022;
originally announced February 2022.
-
Disentangling Light Fields for Super-Resolution and Disparity Estimation
Authors:
Yingqian Wang,
Longguang Wang,
Gaochang Wu,
Jungang Yang,
Wei An,
Jingyi Yu,
Yulan Guo
Abstract:
Light field (LF) cameras record both intensity and directions of light rays, and encode 3D scenes into 4D LF images. Recently, many convolutional neural networks (CNNs) have been proposed for various LF image processing tasks. However, it is challenging for CNNs to effectively process LF images since the spatial and angular information are highly inter-twined with varying disparities. In this pape…
▽ More
Light field (LF) cameras record both intensity and directions of light rays, and encode 3D scenes into 4D LF images. Recently, many convolutional neural networks (CNNs) have been proposed for various LF image processing tasks. However, it is challenging for CNNs to effectively process LF images since the spatial and angular information are highly inter-twined with varying disparities. In this paper, we propose a generic mechanism to disentangle these coupled information for LF image processing. Specifically, we first design a class of domain-specific convolutions to disentangle LFs from different dimensions, and then leverage these disentangled features by designing task-specific modules. Our disentangling mechanism can well incorporate the LF structure prior and effectively handle 4D LF data. Based on the proposed mechanism, we develop three networks (i.e., DistgSSR, DistgASR and DistgDisp) for spatial super-resolution, angular super-resolution and disparity estimation. Experimental results show that our networks achieve state-of-the-art performance on all these three tasks, which demonstrates the effectiveness, efficiency, and generality of our disentangling mechanism. Project page: https://yingqianwang.github.io/DistgLF/.
△ Less
Submitted 22 July, 2023; v1 submitted 21 February, 2022;
originally announced February 2022.
-
Detecting and Tracking Small and Dense Moving Objects in Satellite Videos: A Benchmark
Authors:
Qian Yin,
Qingyong Hu,
Hao Liu,
Feng Zhang,
Yingqian Wang,
Zaiping Lin,
Wei An,
Yulan Guo
Abstract:
Satellite video cameras can provide continuous observation for a large-scale area, which is important for many remote sensing applications. However, achieving moving object detection and tracking in satellite videos remains challenging due to the insufficient appearance information of objects and lack of high-quality datasets. In this paper, we first build a large-scale satellite video dataset wit…
▽ More
Satellite video cameras can provide continuous observation for a large-scale area, which is important for many remote sensing applications. However, achieving moving object detection and tracking in satellite videos remains challenging due to the insufficient appearance information of objects and lack of high-quality datasets. In this paper, we first build a large-scale satellite video dataset with rich annotations for the task of moving object detection and tracking. This dataset is collected by the Jilin-1 satellite constellation and composed of 47 high-quality videos with 1,646,038 instances of interest for object detection and 3,711 trajectories for object tracking. We then introduce a motion modeling baseline to improve the detection rate and reduce false alarms based on accumulative multi-frame differencing and robust matrix completion. Finally, we establish the first public benchmark for moving object detection and tracking in satellite videos, and extensively evaluate the performance of several representative approaches on our dataset. Comprehensive experimental analyses and insightful conclusions are also provided. The dataset is available at https://github.com/QingyongHu/VISO.
△ Less
Submitted 25 November, 2021;
originally announced November 2021.
-
Dense Dual-Attention Network for Light Field Image Super-Resolution
Authors:
Yu Mo,
Yingqian Wang,
Chao Xiao,
Jungang Yang,
Wei An
Abstract:
Light field (LF) images can be used to improve the performance of image super-resolution (SR) because both angular and spatial information is available. It is challenging to incorporate distinctive information from different views for LF image SR. Moreover, the long-term information from the previous layers can be weakened as the depth of network increases. In this paper, we propose a dense dual-a…
▽ More
Light field (LF) images can be used to improve the performance of image super-resolution (SR) because both angular and spatial information is available. It is challenging to incorporate distinctive information from different views for LF image SR. Moreover, the long-term information from the previous layers can be weakened as the depth of network increases. In this paper, we propose a dense dual-attention network for LF image SR. Specifically, we design a view attention module to adaptively capture discriminative features across different views and a channel attention module to selectively focus on informative information across all channels. These two modules are fed to two branches and stacked separately in a chain structure for adaptive fusion of hierarchical features and distillation of valid information. Meanwhile, a dense connection is used to fully exploit multi-level information. Extensive experiments demonstrate that our dense dual-attention mechanism can capture informative information across views and channels to improve SR performance. Comparative results show the advantage of our method over state-of-the-art methods on public datasets.
△ Less
Submitted 22 October, 2021;
originally announced October 2021.
-
Selective Light Field Refocusing for Camera Arrays Using Bokeh Rendering and Superresolution
Authors:
Yingqian Wang,
Jungang Yang,
Yulan Guo,
Chao Xiao,
Wei An
Abstract:
Camera arrays provide spatial and angular information within a single snapshot. With refocusing methods, focal planes can be altered after exposure. In this letter, we propose a light field refocusing method to improve the imaging quality of camera arrays. In our method, the disparity is first estimated. Then, the unfocused region (bokeh) is rendered by using a depth-based anisotropic filter. Fina…
▽ More
Camera arrays provide spatial and angular information within a single snapshot. With refocusing methods, focal planes can be altered after exposure. In this letter, we propose a light field refocusing method to improve the imaging quality of camera arrays. In our method, the disparity is first estimated. Then, the unfocused region (bokeh) is rendered by using a depth-based anisotropic filter. Finally, the refocused image is produced by a reconstruction-based superresolution approach where the bokeh image is used as a regularization term. Our method can selectively refocus images with focused region being superresolved and bokeh being aesthetically rendered. Our method also enables postadjustment of depth of field. We conduct experiments on both public and self-developed datasets. Our method achieves superior visual performance with acceptable computational cost as compared to other state-of-the-art methods. Code is available at https://github.com/YingqianWang/Selective-LF-Refocusing.
△ Less
Submitted 9 August, 2021;
originally announced August 2021.
-
Dense Nested Attention Network for Infrared Small Target Detection
Authors:
Boyang Li,
Chao Xiao,
Longguang Wang,
Yingqian Wang,
Zaiping Lin,
Miao Li,
Wei An,
Yulan Guo
Abstract:
Single-frame infrared small target (SIRST) detection aims at separating small targets from clutter backgrounds. With the advances of deep learning, CNN-based methods have yielded promising results in generic object detection due to their powerful modeling capability. However, existing CNN-based methods cannot be directly applied for infrared small targets since pooling layers in their networks cou…
▽ More
Single-frame infrared small target (SIRST) detection aims at separating small targets from clutter backgrounds. With the advances of deep learning, CNN-based methods have yielded promising results in generic object detection due to their powerful modeling capability. However, existing CNN-based methods cannot be directly applied for infrared small targets since pooling layers in their networks could lead to the loss of targets in deep layers. To handle this problem, we propose a dense nested attention network (DNANet) in this paper. Specifically, we design a dense nested interactive module (DNIM) to achieve progressive interaction among high-level and low-level features. With the repeated interaction in DNIM, infrared small targets in deep layers can be maintained. Based on DNIM, we further propose a cascaded channel and spatial attention module (CSAM) to adaptively enhance multi-level features. With our DNANet, contextual information of small targets can be well incorporated and fully exploited by repeated fusion and enhancement. Moreover, we develop an infrared small target dataset (namely, NUDT-SIRST) and propose a set of evaluation metrics to conduct comprehensive performance evaluation. Experiments on both public and our self-developed datasets demonstrate the effectiveness of our method. Compared to other state-of-the-art methods, our method achieves better performance in terms of probability of detection (Pd), false-alarm rate (Fa), and intersection of union (IoU).
△ Less
Submitted 15 August, 2022; v1 submitted 1 June, 2021;
originally announced June 2021.
-
Non-Convex Tensor Low-Rank Approximation for Infrared Small Target Detection
Authors:
Ting Liu,
Jungang Yang,
Boyang Li,
Chao Xiao,
Yang Sun,
Yingqian Wang,
Wei An
Abstract:
Infrared small target detection is an important fundamental task in the infrared system. Therefore, many infrared small target detection methods have been proposed, in which the low-rank model has been used as a powerful tool. However, most low-rank-based methods assign the same weights for different singular values, which will lead to inaccurate background estimation. Considering that different s…
▽ More
Infrared small target detection is an important fundamental task in the infrared system. Therefore, many infrared small target detection methods have been proposed, in which the low-rank model has been used as a powerful tool. However, most low-rank-based methods assign the same weights for different singular values, which will lead to inaccurate background estimation. Considering that different singular values have different importance and should be treated discriminatively, in this paper, we propose a non-convex tensor low-rank approximation (NTLA) method for infrared small target detection. In our method, NTLA regularization adaptively assigns different weights to different singular values for accurate background estimation. Based on the proposed NTLA, we propose asymmetric spatial-temporal total variation (ASTTV) regularization to achieve more accurate background estimation in complex scenes. Compared with the traditional total variation approach, ASTTV exploits different smoothness intensities for spatial and temporal regularization. We design an efficient algorithm to find the optimal solution of our method. Compared with some state-of-the-art methods, the proposed method achieves an improvement in terms of various evaluation metrics. Extensive experimental results in various complex scenes demonstrate that our method has strong robustness and low false-alarm rate. Code is available at https://github.com/LiuTing20a/ASTTV-NTLA.
△ Less
Submitted 20 November, 2021; v1 submitted 31 May, 2021;
originally announced May 2021.
-
Exploring Robustness of Unsupervised Domain Adaptation in Semantic Segmentation
Authors:
Jinyu Yang,
Chunyuan Li,
Weizhi An,
Hehuan Ma,
Yuzhi Guo,
Yu Rong,
Peilin Zhao,
Junzhou Huang
Abstract:
Recent studies imply that deep neural networks are vulnerable to adversarial examples -- inputs with a slight but intentional perturbation are incorrectly classified by the network. Such vulnerability makes it risky for some security-related applications (e.g., semantic segmentation in autonomous cars) and triggers tremendous concerns on the model reliability. For the first time, we comprehensivel…
▽ More
Recent studies imply that deep neural networks are vulnerable to adversarial examples -- inputs with a slight but intentional perturbation are incorrectly classified by the network. Such vulnerability makes it risky for some security-related applications (e.g., semantic segmentation in autonomous cars) and triggers tremendous concerns on the model reliability. For the first time, we comprehensively evaluate the robustness of existing UDA methods and propose a robust UDA approach. It is rooted in two observations: (i) the robustness of UDA methods in semantic segmentation remains unexplored, which pose a security concern in this field; and (ii) although commonly used self-supervision (e.g., rotation and jigsaw) benefits image tasks such as classification and recognition, they fail to provide the critical supervision signals that could learn discriminative representation for segmentation tasks. These observations motivate us to propose adversarial self-supervision UDA (or ASSUDA) that maximizes the agreement between clean images and their adversarial examples by a contrastive loss in the output space. Extensive empirical studies on commonly used benchmarks demonstrate that ASSUDA is resistant to adversarial attacks.
△ Less
Submitted 25 July, 2021; v1 submitted 22 May, 2021;
originally announced May 2021.
-
CRC Codes as Error Correction Codes
Authors:
Wei An,
Muriel Médard,
Ken R. Duffy
Abstract:
CRC codes have long since been adopted in a vast range of applications. The established notion that they are suitable primarily for error detection can be set aside through use of the recently proposed Guessing Random Additive Noise Decoding (GRAND). Hard-detection (GRAND-SOS) and soft-detection (ORBGRAND) variants can decode any short, high-rate block code, making them suitable for error correcti…
▽ More
CRC codes have long since been adopted in a vast range of applications. The established notion that they are suitable primarily for error detection can be set aside through use of the recently proposed Guessing Random Additive Noise Decoding (GRAND). Hard-detection (GRAND-SOS) and soft-detection (ORBGRAND) variants can decode any short, high-rate block code, making them suitable for error correction of CRC-coded data. When decoded with GRAND, short CRC codes have error correction capability that is at least as good as popular codes such as BCH codes, but with no restriction on either code length or rate.
The state-of-the-art CA-Polar codes are concatenated CRC and Polar codes. For error correction, we find that the CRC is a better short code than either Polar or CA-Polar codes. Moreover, the standard CA-SCL decoder only uses the CRC for error detection and therefore suffers severe performance degradation in short, high rate settings when compared with the performance GRAND provides, which uses all of the CA-Polar bits for error correction.
Using GRAND, existing systems can be upgraded from error detection to low-latency error correction without re-engineering the encoder, and additional applications of CRCs can be found in IoT, Ultra-Reliable Low Latency Communication (URLLC), and beyond. The universality of GRAND, its ready parallelized implementation in hardware, and the good performance of CRC as codes make their combination a viable solution for low-latency applications.
△ Less
Submitted 28 April, 2021;
originally announced April 2021.
-
Unsupervised Degradation Representation Learning for Blind Super-Resolution
Authors:
Longguang Wang,
Yingqian Wang,
Xiaoyu Dong,
Qingyu Xu,
Jungang Yang,
Wei An,
Yulan Guo
Abstract:
Most existing CNN-based super-resolution (SR) methods are developed based on an assumption that the degradation is fixed and known (e.g., bicubic downsampling). However, these methods suffer a severe performance drop when the real degradation is different from their assumption. To handle various unknown degradations in real-world applications, previous methods rely on degradation estimation to rec…
▽ More
Most existing CNN-based super-resolution (SR) methods are developed based on an assumption that the degradation is fixed and known (e.g., bicubic downsampling). However, these methods suffer a severe performance drop when the real degradation is different from their assumption. To handle various unknown degradations in real-world applications, previous methods rely on degradation estimation to reconstruct the SR image. Nevertheless, degradation estimation methods are usually time-consuming and may lead to SR failure due to large estimation errors. In this paper, we propose an unsupervised degradation representation learning scheme for blind SR without explicit degradation estimation. Specifically, we learn abstract representations to distinguish various degradations in the representation space rather than explicit estimation in the pixel space. Moreover, we introduce a Degradation-Aware SR (DASR) network with flexible adaption to various degradations based on the learned representations. It is demonstrated that our degradation representation learning scheme can extract discriminative representations to obtain accurate degradation information. Experiments on both synthetic and real images show that our network achieves state-of-the-art performance for the blind SR task. Code is available at: https://github.com/LongguangWang/DASR.
△ Less
Submitted 1 April, 2021;
originally announced April 2021.
-
Symmetric Parallax Attention for Stereo Image Super-Resolution
Authors:
Yingqian Wang,
Xinyi Ying,
Longguang Wang,
Jungang Yang,
Wei An,
Yulan Guo
Abstract:
Although recent years have witnessed the great advances in stereo image super-resolution (SR), the beneficial information provided by binocular systems has not been fully used. Since stereo images are highly symmetric under epipolar constraint, in this paper, we improve the performance of stereo image SR by exploiting symmetry cues in stereo image pairs. Specifically, we propose a symmetric bi-dir…
▽ More
Although recent years have witnessed the great advances in stereo image super-resolution (SR), the beneficial information provided by binocular systems has not been fully used. Since stereo images are highly symmetric under epipolar constraint, in this paper, we improve the performance of stereo image SR by exploiting symmetry cues in stereo image pairs. Specifically, we propose a symmetric bi-directional parallax attention module (biPAM) and an inline occlusion handling scheme to effectively interact cross-view information. Then, we design a Siamese network equipped with a biPAM to super-resolve both sides of views in a highly symmetric manner. Finally, we design several illuminance-robust losses to enhance stereo consistency. Experiments on four public datasets demonstrate the superior performance of our method. Source code is available at https://github.com/YingqianWang/iPASSR.
△ Less
Submitted 20 April, 2021; v1 submitted 7 November, 2020;
originally announced November 2020.
-
Keep the bursts and ditch the interleavers
Authors:
Wei An,
Muriel Médard,
Ken R. Duffy
Abstract:
To facilitate applications in IoT, 5G, and beyond, there is an engineering need to enable high-rate, low-latency communications. Errors in physical channels typically arrive in clumps, but most decoders are designed assuming that channels are memoryless. As a result, communication networks rely on interleaving over tens of thousands of bits so that channel conditions match decoder assumptions. Eve…
▽ More
To facilitate applications in IoT, 5G, and beyond, there is an engineering need to enable high-rate, low-latency communications. Errors in physical channels typically arrive in clumps, but most decoders are designed assuming that channels are memoryless. As a result, communication networks rely on interleaving over tens of thousands of bits so that channel conditions match decoder assumptions. Even for short high rate codes, awaiting sufficient data to interleave at the sender and de-interleave at the receiver is a significant source of unwanted latency. Using existing decoders with non-interleaved channels causes a degradation in block error rate performance owing to mismatch between the decoder's channel model and true channel behaviour.
Through further development of the recently proposed Guessing Random Additive Noise Decoding (GRAND) algorithm, which we call GRAND-MO for GRAND Markov Order, here we establish that by abandoning interleaving and embracing bursty noise, low-latency, short-code, high-rate communication is possible with block error rates that outperform their interleaved counterparts by a substantial margin. Moreover, while most decoders are twinned to a specific code-book structure, GRAND-MO can decode any code. Using this property, we establish that certain well-known structured codes are ill-suited for use in bursty channels, but Random Linear Codes (RLCs) are robust to correlated noise. This work suggests that the use of RLCs with GRAND-MO is a good candidate for applications requiring high throughput with low latency.
△ Less
Submitted 6 November, 2020;
originally announced November 2020.
-
Dense-View GEIs Set: View Space Covering for Gait Recognition based on Dense-View GAN
Authors:
Rijun Liao,
Weizhi An,
Shiqi Yu,
Zhu Li,
Yongzhen Huang
Abstract:
Gait recognition has proven to be effective for long-distance human recognition. But view variance of gait features would change human appearance greatly and reduce its performance. Most existing gait datasets usually collect data with a dozen different angles, or even more few. Limited view angles would prevent learning better view invariant feature. It can further improve robustness of gait reco…
▽ More
Gait recognition has proven to be effective for long-distance human recognition. But view variance of gait features would change human appearance greatly and reduce its performance. Most existing gait datasets usually collect data with a dozen different angles, or even more few. Limited view angles would prevent learning better view invariant feature. It can further improve robustness of gait recognition if we collect data with various angles at 1 degree interval. But it is time consuming and labor consuming to collect this kind of dataset. In this paper, we, therefore, introduce a Dense-View GEIs Set (DV-GEIs) to deal with the challenge of limited view angles. This set can cover the whole view space, view angle from 0 degree to 180 degree with 1 degree interval. In addition, Dense-View GAN (DV-GAN) is proposed to synthesize this dense view set. DV-GAN consists of Generator, Discriminator and Monitor, where Monitor is designed to preserve human identification and view information. The proposed method is evaluated on the CASIA-B and OU-ISIR dataset. The experimental results show that DV-GEIs synthesized by DV-GAN is an effective way to learn better view invariant feature. We believe the idea of dense view generated samples will further improve the development of gait recognition.
△ Less
Submitted 26 September, 2020;
originally announced September 2020.
-
Parallax Attention for Unsupervised Stereo Correspondence Learning
Authors:
Longguang Wang,
Yulan Guo,
Yingqian Wang,
Zhengfa Liang,
Zaiping Lin,
Jungang Yang,
Wei An
Abstract:
Stereo image pairs encode 3D scene cues into stereo correspondences between the left and right images. To exploit 3D cues within stereo images, recent CNN based methods commonly use cost volume techniques to capture stereo correspondence over large disparities. However, since disparities can vary significantly for stereo cameras with different baselines, focal lengths and resolutions, the fixed ma…
▽ More
Stereo image pairs encode 3D scene cues into stereo correspondences between the left and right images. To exploit 3D cues within stereo images, recent CNN based methods commonly use cost volume techniques to capture stereo correspondence over large disparities. However, since disparities can vary significantly for stereo cameras with different baselines, focal lengths and resolutions, the fixed maximum disparity used in cost volume techniques hinders them to handle different stereo image pairs with large disparity variations. In this paper, we propose a generic parallax-attention mechanism (PAM) to capture stereo correspondence regardless of disparity variations. Our PAM integrates epipolar constraints with attention mechanism to calculate feature similarities along the epipolar line to capture stereo correspondence. Based on our PAM, we propose a parallax-attention stereo matching network (PASMnet) and a parallax-attention stereo image super-resolution network (PASSRnet) for stereo matching and stereo image super-resolution tasks. Moreover, we introduce a new and large-scale dataset named Flickr1024 for stereo image super-resolution. Experimental results show that our PAM is generic and can effectively learn stereo correspondence under large disparity variations in an unsupervised manner. Comparative results show that our PASMnet and PASSRnet achieve the state-of-the-art performance.
△ Less
Submitted 12 October, 2021; v1 submitted 15 September, 2020;
originally announced September 2020.
-
Light Field Image Super-Resolution Using Deformable Convolution
Authors:
Yingqian Wang,
Jungang Yang,
Longguang Wang,
Xinyi Ying,
Tianhao Wu,
Wei An,
Yulan Guo
Abstract:
Light field (LF) cameras can record scenes from multiple perspectives, and thus introduce beneficial angular information for image super-resolution (SR). However, it is challenging to incorporate angular information due to disparities among LF images. In this paper, we propose a deformable convolution network (i.e., LF-DFnet) to handle the disparity problem for LF image SR. Specifically, we design…
▽ More
Light field (LF) cameras can record scenes from multiple perspectives, and thus introduce beneficial angular information for image super-resolution (SR). However, it is challenging to incorporate angular information due to disparities among LF images. In this paper, we propose a deformable convolution network (i.e., LF-DFnet) to handle the disparity problem for LF image SR. Specifically, we design an angular deformable alignment module (ADAM) for feature-level alignment. Based on ADAM, we further propose a collect-and-distribute approach to perform bidirectional alignment between the center-view feature and each side-view feature. Using our approach, angular information can be well incorporated and encoded into features of each view, which benefits the SR reconstruction of all LF images. Moreover, we develop a baseline-adjustable LF dataset to evaluate SR performance under different disparity variations. Experiments on both public and our self-developed datasets have demonstrated the superiority of our method. Our LF-DFnet can generate high-resolution images with more faithful details and achieve state-of-the-art reconstruction accuracy. Besides, our LF-DFnet is more robust to disparity variations, which has not been well addressed in literature.
△ Less
Submitted 25 November, 2020; v1 submitted 7 July, 2020;
originally announced July 2020.
-
Exploring Sparsity in Image Super-Resolution for Efficient Inference
Authors:
Longguang Wang,
Xiaoyu Dong,
Yingqian Wang,
Xinyi Ying,
Zaiping Lin,
Wei An,
Yulan Guo
Abstract:
Current CNN-based super-resolution (SR) methods process all locations equally with computational resources being uniformly assigned in space. However, since missing details in low-resolution (LR) images mainly exist in regions of edges and textures, less computational resources are required for those flat regions. Therefore, existing CNN-based methods involve redundant computation in flat regions,…
▽ More
Current CNN-based super-resolution (SR) methods process all locations equally with computational resources being uniformly assigned in space. However, since missing details in low-resolution (LR) images mainly exist in regions of edges and textures, less computational resources are required for those flat regions. Therefore, existing CNN-based methods involve redundant computation in flat regions, which increases their computational cost and limits their applications on mobile devices. In this paper, we explore the sparsity in image SR to improve inference efficiency of SR networks. Specifically, we develop a Sparse Mask SR (SMSR) network to learn sparse masks to prune redundant computation. Within our SMSR, spatial masks learn to identify "important" regions while channel masks learn to mark redundant channels in those "unimportant" regions. Consequently, redundant computation can be accurately localized and skipped while maintaining comparable performance. It is demonstrated that our SMSR achieves state-of-the-art performance with 41%/33%/27% FLOPs being reduced for x2/3/4 SR. Code is available at: https://github.com/LongguangWang/SMSR.
△ Less
Submitted 1 April, 2021; v1 submitted 16 June, 2020;
originally announced June 2020.
-
Learning A Single Network for Scale-Arbitrary Super-Resolution
Authors:
Longguang Wang,
Yingqian Wang,
Zaiping Lin,
Jungang Yang,
Wei An,
Yulan Guo
Abstract:
Recently, the performance of single image super-resolution (SR) has been significantly improved with powerful networks. However, these networks are developed for image SR with a single specific integer scale (e.g., x2;x3,x4), and cannot be used for non-integer and asymmetric SR. In this paper, we propose to learn a scale-arbitrary image SR network from scale-specific networks. Specifically, we pro…
▽ More
Recently, the performance of single image super-resolution (SR) has been significantly improved with powerful networks. However, these networks are developed for image SR with a single specific integer scale (e.g., x2;x3,x4), and cannot be used for non-integer and asymmetric SR. In this paper, we propose to learn a scale-arbitrary image SR network from scale-specific networks. Specifically, we propose a plug-in module for existing SR networks to perform scale-arbitrary SR, which consists of multiple scale-aware feature adaption blocks and a scale-aware upsampling layer. Moreover, we introduce a scale-aware knowledge transfer paradigm to transfer knowledge from scale-specific networks to the scale-arbitrary network. Our plug-in module can be easily adapted to existing networks to achieve scale-arbitrary SR. These networks plugged with our module can achieve promising results for non-integer and asymmetric SR while maintaining state-of-the-art performance for SR with integer scale factors. Besides, the additional computational and memory cost of our module is very small.
△ Less
Submitted 23 July, 2021; v1 submitted 7 April, 2020;
originally announced April 2020.
-
Deformable 3D Convolution for Video Super-Resolution
Authors:
Xinyi Ying,
Longguang Wang,
Yingqian Wang,
Weidong Sheng,
Wei An,
Yulan Guo
Abstract:
The spatio-temporal information among video sequences is significant for video super-resolution (SR). However, the spatio-temporal information cannot be fully used by existing video SR methods since spatial feature extraction and temporal motion compensation are usually performed sequentially. In this paper, we propose a deformable 3D convolution network (D3Dnet) to incorporate spatio-temporal inf…
▽ More
The spatio-temporal information among video sequences is significant for video super-resolution (SR). However, the spatio-temporal information cannot be fully used by existing video SR methods since spatial feature extraction and temporal motion compensation are usually performed sequentially. In this paper, we propose a deformable 3D convolution network (D3Dnet) to incorporate spatio-temporal information from both spatial and temporal dimensions for video SR. Specifically, we introduce deformable 3D convolution (D3D) to integrate deformable convolution with 3D convolution, obtaining both superior spatio-temporal modeling capability and motion-aware modeling flexibility. Extensive experiments have demonstrated the effectiveness of D3D in exploiting spatio-temporal information. Comparative results show that our network achieves state-of-the-art SR performance. Code is available at: https://github.com/XinyiYing/D3Dnet.
△ Less
Submitted 15 August, 2020; v1 submitted 6 April, 2020;
originally announced April 2020.
-
Label-Driven Reconstruction for Domain Adaptation in Semantic Segmentation
Authors:
Jinyu Yang,
Weizhi An,
Sheng Wang,
Xinliang Zhu,
Chaochao Yan,
Junzhou Huang
Abstract:
Unsupervised domain adaptation enables to alleviate the need for pixel-wise annotation in the semantic segmentation. One of the most common strategies is to translate images from the source domain to the target domain and then align their marginal distributions in the feature space using adversarial learning. However, source-to-target translation enlarges the bias in translated images and introduc…
▽ More
Unsupervised domain adaptation enables to alleviate the need for pixel-wise annotation in the semantic segmentation. One of the most common strategies is to translate images from the source domain to the target domain and then align their marginal distributions in the feature space using adversarial learning. However, source-to-target translation enlarges the bias in translated images and introduces extra computations, owing to the dominant data size of the source domain. Furthermore, consistency of the joint distribution in source and target domains cannot be guaranteed through global feature alignment. Here, we present an innovative framework, designed to mitigate the image translation bias and align cross-domain features with the same category. This is achieved by 1) performing the target-to-source translation and 2) reconstructing both source and target images from their predicted labels. Extensive experiments on adapting from synthetic to real urban scene understanding demonstrate that our framework competes favorably against existing state-of-the-art methods.
△ Less
Submitted 23 August, 2020; v1 submitted 10 March, 2020;
originally announced March 2020.
-
Context-Aware Domain Adaptation in Semantic Segmentation
Authors:
Jinyu Yang,
Weizhi An,
Chaochao Yan,
Peilin Zhao,
Junzhou Huang
Abstract:
In this paper, we consider the problem of unsupervised domain adaptation in the semantic segmentation. There are two primary issues in this field, i.e., what and how to transfer domain knowledge across two domains. Existing methods mainly focus on adapting domain-invariant features (what to transfer) through adversarial learning (how to transfer). Context dependency is essential for semantic segme…
▽ More
In this paper, we consider the problem of unsupervised domain adaptation in the semantic segmentation. There are two primary issues in this field, i.e., what and how to transfer domain knowledge across two domains. Existing methods mainly focus on adapting domain-invariant features (what to transfer) through adversarial learning (how to transfer). Context dependency is essential for semantic segmentation, however, its transferability is still not well understood. Furthermore, how to transfer contextual information across two domains remains unexplored. Motivated by this, we propose a cross-attention mechanism based on self-attention to capture context dependencies between two domains and adapt transferable context. To achieve this goal, we design two cross-domain attention modules to adapt context dependencies from both spatial and channel views. Specifically, the spatial attention module captures local feature dependencies between each position in the source and target image. The channel attention module models semantic dependencies between each pair of cross-domain channel maps. To adapt context dependencies, we further selectively aggregate the context information from two domains. The superiority of our method over existing state-of-the-art methods is empirically proved on "GTA5 to Cityscapes" and "SYNTHIA to Cityscapes".
△ Less
Submitted 9 March, 2020;
originally announced March 2020.