subscribe to arXiv mailings

Pathformer3D: A 3D Scanpath Transformer for 360° Images

Authors: Rong Quan, Yantao Lai, Mengyu Qiu, Dong Liang

Abstract: Scanpath prediction in 360° images can help realize rapid rendering and better user interaction in Virtual/Augmented Reality applications. However, existing scanpath prediction models for 360° images execute scanpath prediction on 2D equirectangular projection plane, which always result in big computation error owing to the 2D plane's distortion and coordinate discontinuity. In this work, we perfo… ▽ More Scanpath prediction in 360° images can help realize rapid rendering and better user interaction in Virtual/Augmented Reality applications. However, existing scanpath prediction models for 360° images execute scanpath prediction on 2D equirectangular projection plane, which always result in big computation error owing to the 2D plane's distortion and coordinate discontinuity. In this work, we perform scanpath prediction for 360° images in 3D spherical coordinate system and proposed a novel 3D scanpath Transformer named Pathformer3D. Specifically, a 3D Transformer encoder is first used to extract 3D contextual feature representation for the 360° image. Then, the contextual feature representation and historical fixation information are input into a Transformer decoder to output current time step's fixation embedding, where the self-attention module is used to imitate the visual working memory mechanism of human visual system and directly model the time dependencies among the fixations. Finally, a 3D Gaussian distribution is learned from each fixation embedding, from which the fixation position can be sampled. Evaluation on four panoramic eye-tracking datasets demonstrates that Pathformer3D outperforms the current state-of-the-art methods. Code is available at https://github.com/lsztzp/Pathformer3D . △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: ECCV 2024

arXiv:2407.05617 [pdf, other]

LINEAR: Learning Implicit Neural Representation With Explicit Physical Priors for Accelerated Quantitative T1rho Mapping

Authors: Yuanyuan Liu, Jinwen Xie, Zhuo-Xu Cui, Qingyong Zhu, Jing Cheng, Dong Liang, Yanjie Zhu

Abstract: Quantitative T1rho parameter mapping has shown promise in clinical and research studies. However, it suffers from long scan times. Deep learning-based techniques have been successfully applied in accelerated quantitative MR parameter mapping. However, most methods require fully-sampled training dataset, which is impractical in the clinic. In this study, a novel subject-specific unsupervised method… ▽ More Quantitative T1rho parameter mapping has shown promise in clinical and research studies. However, it suffers from long scan times. Deep learning-based techniques have been successfully applied in accelerated quantitative MR parameter mapping. However, most methods require fully-sampled training dataset, which is impractical in the clinic. In this study, a novel subject-specific unsupervised method based on the implicit neural representation is proposed to reconstruct images from highly undersampled k-space data and estimate parameter maps from reconstructions, which only takes spatiotemporal coordinates as the input. Specifically, the proposed method learned a implicit neural representation of the MR images driven by two explicit priors of images (or k-space data), including the low-rankness of Hankel matrix, and the self-consistency of k-space data. The ablation experiments show that the proposed method can characterize the physical priors of MR images well. Moreover,experimental results of retrospective and prospective data show that the proposed method outperforms the state-of-the-art methods in terms of supressing artifacts and achieving the lowest error. △ Less

Submitted 8 July, 2024; originally announced July 2024.

Comments: Yuanyuan Liu and Jinwen Xie contributed equally to this work

arXiv:2407.04289 [pdf, other]

Electronic Correlations in Multielectron Silicon Quantum Dots

Authors: Dylan H. Liang, MengKe Feng, Philip Y. Mai, Jesus D. Cifuentes, Andrew S. Dzurak, Andre Saraiva

Abstract: Silicon quantum computing has the potential to revolutionize technology with capabilities to solve real-life problems that are computationally complex or even intractable for modern computers [1] by offering sufficient high quality qubits to perform complex error-corrected calculations. Silicon metal-oxide-semiconductor based quantum dots present a promising pathway for realizing practical quantum… ▽ More Silicon quantum computing has the potential to revolutionize technology with capabilities to solve real-life problems that are computationally complex or even intractable for modern computers [1] by offering sufficient high quality qubits to perform complex error-corrected calculations. Silicon metal-oxide-semiconductor based quantum dots present a promising pathway for realizing practical quantum computers. To improve certain qubit properties, it is a common strategy to incorporate multiple electrons in the same dot in order to form qubits in higher confined orbital states. Theoretical modelling is an essential part of understanding the quantum behaviour of these electrons, providing a basis for validating the physical working of device models as well as providing insights into experimental data. Hartree-Fock theory is an imperative tool for the electronic structure modelling of multi-electron quantum dots due to its ability to simulate a large number of electrons with manageable computation load. However, an efficient calculation of the self-consistent field becomes hard because dot formations in silicon are characterized by strong electron-electron interactions and conduction band valleys, besides the relatively high comparative effective mass, which add to create a behaviour dominated by repulsion between electrons rather than a well established shell structure. In this paper, we present a Hartree-Fock-based method that accounts for these complexities for the modelling of silicon quantum dots. With this method, we first establish the significance of including electron-electron interactions and valley degree of freedom and their implications. We then explore a simple case of anisotropic dots and observe the impact of anisotropy on dot formations. △ Less

Submitted 5 July, 2024; originally announced July 2024.

arXiv:2407.03719 [pdf, other]

Relative Difficulty Distillation for Semantic Segmentation

Authors: Dong Liang, Yue Sun, Yun Du, Songcan Chen, Sheng-Jun Huang

Abstract: Current knowledge distillation (KD) methods primarily focus on transferring various structured knowledge and designing corresponding optimization goals to encourage the student network to imitate the output of the teacher network. However, introducing too many additional optimization objectives may lead to unstable training, such as gradient conflicts. Moreover, these methods ignored the guideline… ▽ More Current knowledge distillation (KD) methods primarily focus on transferring various structured knowledge and designing corresponding optimization goals to encourage the student network to imitate the output of the teacher network. However, introducing too many additional optimization objectives may lead to unstable training, such as gradient conflicts. Moreover, these methods ignored the guidelines of relative learning difficulty between the teacher and student networks. Inspired by human cognitive science, in this paper, we redefine knowledge from a new perspective -- the student and teacher networks' relative difficulty of samples, and propose a pixel-level KD paradigm for semantic segmentation named Relative Difficulty Distillation (RDD). We propose a two-stage RDD framework: Teacher-Full Evaluated RDD (TFE-RDD) and Teacher-Student Evaluated RDD (TSE-RDD). RDD allows the teacher network to provide effective guidance on learning focus without additional optimization goals, thus avoiding adjusting learning weights for multiple losses. Extensive experimental evaluations using a general distillation loss function on popular datasets such as Cityscapes, CamVid, Pascal VOC, and ADE20k demonstrate the effectiveness of RDD against state-of-the-art KD methods. Additionally, our research showcases that RDD can integrate with existing KD methods to improve their upper performance bound. △ Less

Submitted 4 July, 2024; originally announced July 2024.

arXiv:2407.03699 [pdf, other]

Generalized Robust Fundus Photography-based Vision Loss Estimation for High Myopia

Authors: Zipei Yan, Zhile Liang, Zhengji Liu, Shuai Wang, Rachel Ka-Man Chun, Jizhou Li, Chea-su Kee, Dong Liang

Abstract: High myopia significantly increases the risk of irreversible vision loss. Traditional perimetry-based visual field (VF) assessment provides systematic quantification of visual loss but it is subjective and time-consuming. Consequently, machine learning models utilizing fundus photographs to estimate VF have emerged as promising alternatives. However, due to the high variability and the limited ava… ▽ More High myopia significantly increases the risk of irreversible vision loss. Traditional perimetry-based visual field (VF) assessment provides systematic quantification of visual loss but it is subjective and time-consuming. Consequently, machine learning models utilizing fundus photographs to estimate VF have emerged as promising alternatives. However, due to the high variability and the limited availability of VF data, existing VF estimation models fail to generalize well, particularly when facing out-of-distribution data across diverse centers and populations. To tackle this challenge, we propose a novel, parameter-efficient framework to enhance the generalized robustness of VF estimation on both in- and out-of-distribution data. Specifically, we design a Refinement-by-Denoising (RED) module for feature refinement and adaptation from pretrained vision models, aiming to learn high-entropy feature representations and to mitigate the domain gap effectively and efficiently. Through independent validation on two distinct real-world datasets from separate centers, our method significantly outperforms existing approaches in RMSE, MAE and correlation coefficient for both internal and external validation. Our proposed framework benefits both in- and out-of-distribution VF estimation, offering significant clinical implications and potential utility in real-world ophthalmic practices. △ Less

Submitted 4 July, 2024; originally announced July 2024.

Comments: Accepted by MICCAI 2024, code: https://github.com/yanzipei/VF_RED

arXiv:2407.03263 [pdf, other]

A Unified Framework for 3D Scene Understanding

Authors: Wei Xu, Chunsheng Shi, Sifan Tu, Xin Zhou, Dingkang Liang, Xiang Bai

Abstract: We propose UniSeg3D, a unified 3D segmentation framework that achieves panoptic, semantic, instance, interactive, referring, and open-vocabulary semantic segmentation tasks within a single model. Most previous 3D segmentation approaches are specialized for a specific task, thereby limiting their understanding of 3D scenes to a task-specific perspective. In contrast, the proposed method unifies six… ▽ More We propose UniSeg3D, a unified 3D segmentation framework that achieves panoptic, semantic, instance, interactive, referring, and open-vocabulary semantic segmentation tasks within a single model. Most previous 3D segmentation approaches are specialized for a specific task, thereby limiting their understanding of 3D scenes to a task-specific perspective. In contrast, the proposed method unifies six tasks into unified representations processed by the same Transformer. It facilitates inter-task knowledge sharing and, therefore, promotes comprehensive 3D scene understanding. To take advantage of multi-task unification, we enhance the performance by leveraging task connections. Specifically, we design a knowledge distillation method and a contrastive learning method to transfer task-specific knowledge across different tasks. Benefiting from extensive inter-task knowledge sharing, our UniSeg3D becomes more powerful. Experiments on three benchmarks, including the ScanNet20, ScanRefer, and ScanNet200, demonstrate that the UniSeg3D consistently outperforms current SOTA methods, even those specialized for individual tasks. We hope UniSeg3D can serve as a solid unified baseline and inspire future work. The code will be available at https://dk-liang.github.io/UniSeg3D/. △ Less

Submitted 3 July, 2024; originally announced July 2024.

Comments: The code will be available at https://dk-liang.github.io/UniSeg3D/

arXiv:2407.01016 [pdf, other]

SOOD++: Leveraging Unlabeled Data to Boost Oriented Object Detection

Authors: Dingkang Liang, Wei Hua, Chunsheng Shi, Zhikang Zou, Xiaoqing Ye, Xiang Bai

Abstract: Semi-supervised object detection (SSOD), leveraging unlabeled data to boost object detectors, has become a hot topic recently. However, existing SSOD approaches mainly focus on horizontal objects, leaving multi-oriented objects common in aerial images unexplored. At the same time, the annotation cost of multi-oriented objects is significantly higher than that of their horizontal counterparts. Ther… ▽ More Semi-supervised object detection (SSOD), leveraging unlabeled data to boost object detectors, has become a hot topic recently. However, existing SSOD approaches mainly focus on horizontal objects, leaving multi-oriented objects common in aerial images unexplored. At the same time, the annotation cost of multi-oriented objects is significantly higher than that of their horizontal counterparts. Therefore, in this paper, we propose a simple yet effective Semi-supervised Oriented Object Detection method termed SOOD++. Specifically, we observe that objects from aerial images are usually arbitrary orientations, small scales, and aggregation, which inspires the following core designs: a Simple Instance-aware Dense Sampling (SIDS) strategy is used to generate comprehensive dense pseudo-labels; the Geometry-aware Adaptive Weighting (GAW) loss dynamically modulates the importance of each pair between pseudo-label and corresponding prediction by leveraging the intricate geometric information of aerial objects; we treat aerial images as global layouts and explicitly build the many-to-many relationship between the sets of pseudo-labels and predictions via the proposed Noise-driven Global Consistency (NGC). Extensive experiments conducted on various multi-oriented object datasets under various labeled settings demonstrate the effectiveness of our method. For example, on the DOTA-V1.5 benchmark, the proposed method outperforms previous state-of-the-art (SOTA) by a large margin (+2.92, +2.39, and +2.57 mAP under 10%, 20%, and 30% labeled data settings, respectively) with single-scale training and testing. More importantly, it still improves upon a strong supervised baseline with 70.66 mAP, trained using the full DOTA-V1.5 train-val set, by +1.82 mAP, resulting in a 72.48 mAP, pushing the new state-of-the-art. The code will be made available. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2406.14952 [pdf, other]

ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models

Authors: Haiquan Zhao, Lingyu Li, Shisong Chen, Shuqi Kong, Jiaan Wang, Kexin Huang, Tianle Gu, Yixu Wang, Dandan Liang, Zhixu Li, Yan Teng, Yanghua Xiao, Yingchun Wang

Abstract: Emotion Support Conversation (ESC) is a crucial application, which aims to reduce human stress, offer emotional guidance, and ultimately enhance human mental and physical well-being. With the advancement of Large Language Models (LLMs), many researchers have employed LLMs as the ESC models. However, the evaluation of these LLM-based ESCs remains uncertain. Inspired by the awesome development of ro… ▽ More Emotion Support Conversation (ESC) is a crucial application, which aims to reduce human stress, offer emotional guidance, and ultimately enhance human mental and physical well-being. With the advancement of Large Language Models (LLMs), many researchers have employed LLMs as the ESC models. However, the evaluation of these LLM-based ESCs remains uncertain. Inspired by the awesome development of role-playing agents, we propose an ESC Evaluation framework (ESC-Eval), which uses a role-playing agent to interact with ESC models, followed by a manual evaluation of the interactive dialogues. In detail, we first re-organize 2,801 role-playing cards from seven existing datasets to define the roles of the role-playing agent. Second, we train a specific role-playing model called ESC-Role which behaves more like a confused person than GPT-4. Third, through ESC-Role and organized role cards, we systematically conduct experiments using 14 LLMs as the ESC models, including general AI-assistant LLMs (ChatGPT) and ESC-oriented LLMs (ExTES-Llama). We conduct comprehensive human annotations on interactive multi-turn dialogues of different ESC models. The results show that ESC-oriented LLMs exhibit superior ESC abilities compared to general AI-assistant LLMs, but there is still a gap behind human performance. Moreover, to automate the scoring process for future ESC models, we developed ESC-RANK, which trained on the annotated data, achieving a scoring performance surpassing 35 points of GPT-4. Our data and code are available at https://github.com/haidequanbu/ESC-Eval. △ Less

Submitted 24 June, 2024; v1 submitted 21 June, 2024; originally announced June 2024.

Comments: Pre-print

arXiv:2406.14067 [pdf]

A microwave photonic prototype for concurrent radar detection and spectrum sensing over an 8 to 40 GHz bandwidth

Authors: Taixia Shi, Dingding Liang, Lu Wang, Lin Li, Shaogang Guo, Jiawei Gao, Xiaowei Li, Chulun Lin, Lei Shi, Baogang Ding, Shiyang Liu, Fangyi Yang, Chi Jiang, Yang Chen

Abstract: In this work, a microwave photonic prototype for concurrent radar detection and spectrum sensing is proposed, designed, built, and investigated. A direct digital synthesizer and an analog electronic circuit are integrated to generate an intermediate frequency (IF) linearly frequency-modulated (LFM) signal with a tunable center frequency from 2.5 to 9.5 GHz and an instantaneous bandwidth of 1 GHz.… ▽ More In this work, a microwave photonic prototype for concurrent radar detection and spectrum sensing is proposed, designed, built, and investigated. A direct digital synthesizer and an analog electronic circuit are integrated to generate an intermediate frequency (IF) linearly frequency-modulated (LFM) signal with a tunable center frequency from 2.5 to 9.5 GHz and an instantaneous bandwidth of 1 GHz. The IF LFM signal is converted to the optical domain via an intensity modulator and then filtered by a fiber Bragg grating (FBG) to generate only two 2nd-order optical LFM sidebands. In radar detection, the two optical LFM sidebands beat with each other to generate a frequency-and-bandwidth-quadrupled LFM signal, which is used for ranging, radial velocity measurement, and imaging. By changing the center frequency of the IF LFM signal, the radar function can be operated within 8 to 40 GHz. In spectrum sensing, one 2nd-order optical LFM sideband is selected by another FBG, which then works in conjunction with the stimulated Brillouin scattering gain spectrum to map the frequency of the signal under test to time with an instantaneous measurement bandwidth of 2 GHz. By using a frequency shift module to adjust the pump frequency, the frequency measurement range can be adjusted from 0 to 40 GHz. The prototype is comprehensively studied and tested, which is capable of achieving a range resolution of 3.75 cm, a range error of less than $\pm$ 2 cm, a radial velocity error within $\pm$ 1 cm/s, delivering clear imaging of multiple small targets, and maintaining a frequency measurement error of less than $\pm$ 7 MHz and a frequency resolution of better than 20 MHz. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: 18 pages, 12 figures, 1 table

arXiv:2406.11397 [pdf, other]

DistPred: A Distribution-Free Probabilistic Inference Method for Regression and Forecasting

Authors: Daojun Liang, Haixia Zhang, Dongfeng Yuan

Abstract: Traditional regression and prediction tasks often only provide deterministic point estimates. To estimate the uncertainty or distribution information of the response variable, methods such as Bayesian inference, model ensembling, or MC Dropout are typically used. These methods either assume that the posterior distribution of samples follows a Gaussian process or require thousands of forward passes… ▽ More Traditional regression and prediction tasks often only provide deterministic point estimates. To estimate the uncertainty or distribution information of the response variable, methods such as Bayesian inference, model ensembling, or MC Dropout are typically used. These methods either assume that the posterior distribution of samples follows a Gaussian process or require thousands of forward passes for sample generation. We propose a novel approach called DistPred for regression and forecasting tasks, which overcomes the limitations of existing methods while remaining simple and powerful. Specifically, we transform proper scoring rules that measure the discrepancy between the predicted distribution and the target distribution into a differentiable discrete form and use it as a loss function to train the model end-to-end. This allows the model to sample numerous samples in a single forward pass to estimate the potential distribution of the response variable. We have compared our method with several existing approaches on multiple datasets and achieved state-of-the-art performance. Additionally, our method significantly improves computational efficiency. For example, compared to state-of-the-art models, DistPred has a 90x faster inference speed. Experimental results can be reproduced through https://github.com/Anoise/DistPred. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.07594 [pdf, other]

MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models

Authors: Tianle Gu, Zeyang Zhou, Kexin Huang, Dandan Liang, Yixu Wang, Haiquan Zhao, Yuanqi Yao, Xingge Qiao, Keqing Wang, Yujiu Yang, Yan Teng, Yu Qiao, Yingchun Wang

Abstract: Powered by remarkable advancements in Large Language Models (LLMs), Multimodal Large Language Models (MLLMs) demonstrate impressive capabilities in manifold tasks. However, the practical application scenarios of MLLMs are intricate, exposing them to potential malicious instructions and thereby posing safety risks. While current benchmarks do incorporate certain safety considerations, they often la… ▽ More Powered by remarkable advancements in Large Language Models (LLMs), Multimodal Large Language Models (MLLMs) demonstrate impressive capabilities in manifold tasks. However, the practical application scenarios of MLLMs are intricate, exposing them to potential malicious instructions and thereby posing safety risks. While current benchmarks do incorporate certain safety considerations, they often lack comprehensive coverage and fail to exhibit the necessary rigor and robustness. For instance, the common practice of employing GPT-4V as both the evaluator and a model to be evaluated lacks credibility, as it tends to exhibit a bias toward its own responses. In this paper, we present MLLMGuard, a multidimensional safety evaluation suite for MLLMs, including a bilingual image-text evaluation dataset, inference utilities, and a lightweight evaluator. MLLMGuard's assessment comprehensively covers two languages (English and Chinese) and five important safety dimensions (Privacy, Bias, Toxicity, Truthfulness, and Legality), each with corresponding rich subtasks. Focusing on these dimensions, our evaluation dataset is primarily sourced from platforms such as social media, and it integrates text-based and image-based red teaming techniques with meticulous annotation by human experts. This can prevent inaccurate evaluation caused by data leakage when using open-source datasets and ensures the quality and challenging nature of our benchmark. Additionally, a fully automated lightweight evaluator termed GuardRank is developed, which achieves significantly higher evaluation accuracy than GPT-4. Our evaluation results across 13 advanced models indicate that MLLMs still have a substantial journey ahead before they can be considered safe and responsible. △ Less

Submitted 13 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

arXiv:2406.04801 [pdf, other]

MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks

Authors: Xingkui Zhu, Yiran Guan, Dingkang Liang, Yuchao Chen, Yuliang Liu, Xiang Bai

Abstract: The sparsely activated mixture of experts (MoE) model presents a promising alternative to traditional densely activated (dense) models, enhancing both quality and computational efficiency. However, training MoE models from scratch demands extensive data and computational resources. Moreover, public repositories like timm mainly provide pre-trained dense checkpoints, lacking similar resources for M… ▽ More The sparsely activated mixture of experts (MoE) model presents a promising alternative to traditional densely activated (dense) models, enhancing both quality and computational efficiency. However, training MoE models from scratch demands extensive data and computational resources. Moreover, public repositories like timm mainly provide pre-trained dense checkpoints, lacking similar resources for MoE models, hindering their adoption. To bridge this gap, we introduce MoE Jetpack, an effective method for fine-tuning dense checkpoints into MoE models. MoE Jetpack incorporates two key techniques: (1) checkpoint recycling, which repurposes dense checkpoints as initial weights for MoE models, thereby accelerating convergence, enhancing accuracy, and alleviating the computational burden of pre-training; (2) hyperspherical adaptive MoE (SpheroMoE) layer, which optimizes the MoE architecture for better integration of dense checkpoints, enhancing fine-tuning performance. Our experiments on vision tasks demonstrate that MoE Jetpack significantly improves convergence speed and accuracy when fine-tuning dense checkpoints into MoE models. Our code will be publicly available at https://github.com/Adlith/MoE-Jetpack. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: 9 pages, 6 figures

ACM Class: I.2

arXiv:2405.19736 [pdf, other]

Intrinsic Dynamics-Driven Generalizable Scene Representations for Vision-Oriented Decision-Making Applications

Authors: Dayang Liang, Jinyang Lai, Yunlong Liu

Abstract: How to improve the ability of scene representation is a key issue in vision-oriented decision-making applications, and current approaches usually learn task-relevant state representations within visual reinforcement learning to address this problem. While prior work typically introduces one-step behavioral similarity metrics with elements (e.g., rewards and actions) to extract task-relevant state… ▽ More How to improve the ability of scene representation is a key issue in vision-oriented decision-making applications, and current approaches usually learn task-relevant state representations within visual reinforcement learning to address this problem. While prior work typically introduces one-step behavioral similarity metrics with elements (e.g., rewards and actions) to extract task-relevant state information from observations, they often ignore the inherent dynamics relationships among the elements that are essential for learning accurate representations, which further impedes the discrimination of short-term similar task/behavior information in long-term dynamics transitions. To alleviate this problem, we propose an intrinsic dynamics-driven representation learning method with sequence models in visual reinforcement learning, namely DSR. Concretely, DSR optimizes the parameterized encoder by the state-transition dynamics of the underlying system, which prompts the latent encoding information to satisfy the state-transition process and then the state space and the noise space can be distinguished. In the implementation and to further improve the representation ability of DSR on encoding similar tasks, sequential elements' frequency domain and multi-step prediction are adopted for sequentially modeling the inherent dynamics. Finally, experimental results show that DSR has achieved significant performance improvements in the visual Distracting DMControl control tasks, especially with an average of 78.9\% over the backbone baseline. Further results indicate that it also achieves the best performances in real-world autonomous driving applications on the CARLA simulator. Moreover, qualitative analysis results validate that our method possesses the superior ability to learn generalizable scene representations on visual tasks. The source code is available at https://github.com/DMU-XMU/DSR. △ Less

Submitted 30 June, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2405.15830 [pdf, other]

Diff-DTI: Fast Diffusion Tensor Imaging Using A Feature-Enhanced Joint Diffusion Model

Authors: Lang Zhang, Jinling He, Dong Liang, Hairong Zheng, Yanjie Zhu

Abstract: Magnetic resonance diffusion tensor imaging (DTI) is a critical tool for neural disease diagnosis. However, long scan time greatly hinders the widespread clinical use of DTI. To accelerate image acquisition, a feature-enhanced joint diffusion model (Diff-DTI) is proposed to obtain accurate DTI parameter maps from a limited number of diffusion-weighted images (DWIs). Diff-DTI introduces a joint dif… ▽ More Magnetic resonance diffusion tensor imaging (DTI) is a critical tool for neural disease diagnosis. However, long scan time greatly hinders the widespread clinical use of DTI. To accelerate image acquisition, a feature-enhanced joint diffusion model (Diff-DTI) is proposed to obtain accurate DTI parameter maps from a limited number of diffusion-weighted images (DWIs). Diff-DTI introduces a joint diffusion model that directly learns the joint probability distribution of DWIs with DTI parametric maps for conditional generation. Additionally, a feature enhancement fusion mechanism (FEFM) is designed and incorporated into the generative process of Diff-DTI to preserve fine structures in the generated DTI maps. A comprehensive evaluation of the performance of Diff-DTI was conducted on the Human Connectome Project dataset. The results demonstrate that Diff-DTI outperforms existing state-of-the-art fast DTI imaging methods in terms of visual quality and quantitative metrics. Furthermore, Diff-DTI has shown the ability to produce high-fidelity DTI maps with only three DWIs, thus overcoming the requirement of a minimum of six DWIs for DTI. △ Less

Submitted 24 May, 2024; originally announced May 2024.

Comments: 11 pages, 7 figures

arXiv:2405.15271 [pdf]

Seamless Integration and Implementation of Distributed Contact and Contactless Vital Sign Monitoring

Authors: Dingding Liang, Yang Chen, Jiawei Gao, Taixia Shi, Jianping Yao

Abstract: Real-time vital sign monitoring is gaining immense significance not only in the medical field but also in personal health management. Facing the needs of different application scenarios of the smart and healthy city in the future, the low-cost, large-scale, scalable, and distributed vital sign monitoring system is of great significance. In this work, a seamlessly integrated contact and contactless… ▽ More Real-time vital sign monitoring is gaining immense significance not only in the medical field but also in personal health management. Facing the needs of different application scenarios of the smart and healthy city in the future, the low-cost, large-scale, scalable, and distributed vital sign monitoring system is of great significance. In this work, a seamlessly integrated contact and contactless vital sign monitoring system, which can simultaneously implement respiration and heartbeat monitoring, is proposed. In contact vital sign monitoring, the chest wall movement due to respiration and heartbeat is translated into changes in the optical output intensity of a fiber Bragg grating (FBG). The FBG is also an important part of radar signal generation for contactless vital sign monitoring, in which the chest wall movement is translated into phase changes of the radar de-chirped signal. By analyzing the intensity of the FBG output and phase of the radar de-chirped signal, real-time respiration and heartbeat monitoring are realized. In addition, due to the distributed structure of the system and its good integration with the wavelength-division multiplexing optical network, it can be massively scaled by employing more wavelengths. A proof-of-concept experiment is carried out. Contact and contactless respiration and heartbeat monitoring of three people are simultaneously realized. During a monitoring time of 60 s, the maximum absolute measurement errors of respiration and heartbeat rates are 1.6 respirations per minute and 2.3 beats per minute, respectively. The measurement error does not have an obvious change even when the monitoring time is decreased to 5 s. △ Less

Submitted 24 May, 2024; originally announced May 2024.

Comments: 14 pages,9 figures

arXiv:2405.13152 [pdf, other]

Enhancing Interaction Modeling with Agent Selection and Physical Methods for Trajectory Prediction

Authors: Shiji Huang, Lei Ye, Min Chen, Wenhai Luo, Chenqi Xu, Deyuan Liang, Dihong Wang

Abstract: In this study, we address the limitations inherent in most existing vehicle trajectory prediction methodologies that indiscriminately incorporate all agents within a predetermined proximity when accounting for inter-agent interactions. These approaches commonly employ attention-based architecture or graph neural networks for encoding interactions, which introduces three challenges: (i) The indiscr… ▽ More In this study, we address the limitations inherent in most existing vehicle trajectory prediction methodologies that indiscriminately incorporate all agents within a predetermined proximity when accounting for inter-agent interactions. These approaches commonly employ attention-based architecture or graph neural networks for encoding interactions, which introduces three challenges: (i) The indiscriminate selection of all nearby agents substantially escalates the computational demands of the model, particularly in those interaction-rich scenarios. (ii) Moreover, the simplistic feature extraction of current time agents falls short of adequately capturing the nuanced dynamics of interactions. (iii) Compounded by the inherently low interpretability of attention mechanism and graph neural networks, there is a propensity for the model to allocate unreliable correlation coefficients to certain agents, adversely impacting the accuracy of trajectory predictions. To mitigate these issues, we introduce ASPILin, a novel approach that enhances the selection of interacting agents by considering their current and future lanes, extending this consideration across all historical frames. Utilizing the states of the agents, we estimate the nearest future distance between agents and the time needed to reach this distance. Then, combine these with their current distances to derive a physical correlation coefficient to encode interactions. Experiments conducted on popular trajectory prediction datasets demonstrate that our method is efficient and straightforward, outperforming other state-of-the-art methods. △ Less

Submitted 21 May, 2024; originally announced May 2024.

Comments: code:https://github.com/kkk00714/ASPILin

arXiv:2405.12434 [pdf, other]

Resolving Word Vagueness with Scenario-guided Adapter for Natural Language Inference

Authors: Yonghao Liu, Mengyu Li, Di Liang, Ximing Li, Fausto Giunchiglia, Lan Huang, Xiaoyue Feng, Renchu Guan

Abstract: Natural Language Inference (NLI) is a crucial task in natural language processing that involves determining the relationship between two sentences, typically referred to as the premise and the hypothesis. However, traditional NLI models solely rely on the semantic information inherent in independent sentences and lack relevant situational visual information, which can hinder a complete understandi… ▽ More Natural Language Inference (NLI) is a crucial task in natural language processing that involves determining the relationship between two sentences, typically referred to as the premise and the hypothesis. However, traditional NLI models solely rely on the semantic information inherent in independent sentences and lack relevant situational visual information, which can hinder a complete understanding of the intended meaning of the sentences due to the ambiguity and vagueness of language. To address this challenge, we propose an innovative ScenaFuse adapter that simultaneously integrates large-scale pre-trained linguistic knowledge and relevant visual information for NLI tasks. Specifically, we first design an image-sentence interaction module to incorporate visuals into the attention mechanism of the pre-trained model, allowing the two modalities to interact comprehensively. Furthermore, we introduce an image-sentence fusion module that can adaptively integrate visual information from images and semantic information from sentences. By incorporating relevant visual information and leveraging linguistic knowledge, our approach bridges the gap between language and vision, leading to improved understanding and inference capabilities in NLI tasks. Extensive benchmark experiments demonstrate that our proposed ScenaFuse, a scenario-guided approach, consistently boosts NLI performance. △ Less

Submitted 20 May, 2024; originally announced May 2024.

Comments: IJCAI24

arXiv:2405.12119 [pdf, other]

Reindex-Then-Adapt: Improving Large Language Models for Conversational Recommendation

Authors: Zhankui He, Zhouhang Xie, Harald Steck, Dawen Liang, Rahul Jha, Nathan Kallus, Julian McAuley

Abstract: Large language models (LLMs) are revolutionizing conversational recommender systems by adeptly indexing item content, understanding complex conversational contexts, and generating relevant item titles. However, controlling the distribution of recommended items remains a challenge. This leads to suboptimal performance due to the failure to capture rapidly changing data distributions, such as item p… ▽ More Large language models (LLMs) are revolutionizing conversational recommender systems by adeptly indexing item content, understanding complex conversational contexts, and generating relevant item titles. However, controlling the distribution of recommended items remains a challenge. This leads to suboptimal performance due to the failure to capture rapidly changing data distributions, such as item popularity, on targeted conversational recommendation platforms. In conversational recommendation, LLMs recommend items by generating the titles (as multiple tokens) autoregressively, making it difficult to obtain and control the recommendations over all items. Thus, we propose a Reindex-Then-Adapt (RTA) framework, which converts multi-token item titles into single tokens within LLMs, and then adjusts the probability distributions over these single-token item titles accordingly. The RTA framework marries the benefits of both LLMs and traditional recommender systems (RecSys): understanding complex queries as LLMs do; while efficiently controlling the recommended item distributions in conversational recommendations as traditional RecSys do. Our framework demonstrates improved accuracy metrics across three different conversational recommendation datasets and two adaptation settings △ Less

Submitted 20 May, 2024; originally announced May 2024.

arXiv:2405.10242 [pdf, ps, other]

Quantum State Learning Implies Circuit Lower Bounds

Authors: Nai-Hui Chia, Daniel Liang, Fang Song

Abstract: We establish connections between state tomography, pseudorandomness, quantum state synthesis, and circuit lower bounds. In particular, let $\mathfrak{C}$ be a family of non-uniform quantum circuits of polynomial size and suppose that there exists an algorithm that, given copies of $|ψ\rangle$, distinguishes whether $|ψ\rangle$ is produced by $\mathfrak{C}$ or is Haar random, promised one of these… ▽ More We establish connections between state tomography, pseudorandomness, quantum state synthesis, and circuit lower bounds. In particular, let $\mathfrak{C}$ be a family of non-uniform quantum circuits of polynomial size and suppose that there exists an algorithm that, given copies of $|ψ\rangle$, distinguishes whether $|ψ\rangle$ is produced by $\mathfrak{C}$ or is Haar random, promised one of these is the case. For arbitrary fixed constant $c$, we show that if the algorithm uses at most $O(2^{n^c})$ time and $2^{n^{0.99}}$ samples then $\mathsf{stateBQE} \not\subset \mathsf{state}\mathfrak{C}$. Here $\mathsf{stateBQE} := \mathsf{stateBQTIME}[2^{O(n)}]$ and $\mathsf{state}\mathfrak{C}$ are state synthesis complexity classes as introduced by Rosenthal and Yuen (ITCS 2022), which capture problems with classical inputs but quantum output. Note that efficient tomography implies a similarly efficient distinguishing algorithm against Haar random states, even for nearly exponential-time algorithms. Because every state produced by a polynomial-size circuit can be learned with $2^{O(n)}$ samples and time, or $O(n^{ω(1)})$ samples and $2^{O(n^{ω(1)})}$ time, we show that even slightly non-trivial quantum state tomography algorithms would lead to new statements about quantum state synthesis. Finally, a slight modification of our proof shows that distinguishing algorithms for quantum states can imply circuit lower bounds for decision problems as well. This help sheds light on why time-efficient tomography algorithms for non-uniform quantum circuit classes has only had limited and partial progress. Our work parallels results by Arunachalam et al. (FOCS 2021) that revealed a similar connection between quantum learning of Boolean functions and circuit lower bounds for classical circuit classes, but modified for the purposes of state tomography and state synthesis. △ Less

Submitted 16 May, 2024; originally announced May 2024.

Comments: 53 pages

arXiv:2405.05763 [pdf]

DP-MDM: Detail-Preserving MR Reconstruction via Multiple Diffusion Models

Authors: Mengxiao Geng, Jiahao Zhu, Xiaolin Zhu, Qiqing Liu, Dong Liang, Qiegen Liu

Abstract: Detail features of magnetic resonance images play a cru-cial role in accurate medical diagnosis and treatment, as they capture subtle changes that pose challenges for doc-tors when performing precise judgments. However, the widely utilized naive diffusion model has limitations, as it fails to accurately capture more intricate details. To en-hance the quality of MRI reconstruction, we propose a com… ▽ More Detail features of magnetic resonance images play a cru-cial role in accurate medical diagnosis and treatment, as they capture subtle changes that pose challenges for doc-tors when performing precise judgments. However, the widely utilized naive diffusion model has limitations, as it fails to accurately capture more intricate details. To en-hance the quality of MRI reconstruction, we propose a comprehensive detail-preserving reconstruction method using multiple diffusion models to extract structure and detail features in k-space domain instead of image do-main. Moreover, virtual binary modal masks are utilized to refine the range of values in k-space data through highly adaptive center windows, which allows the model to focus its attention more efficiently. Last but not least, an inverted pyramid structure is employed, where the top-down image information gradually decreases, ena-bling a cascade representation. The framework effective-ly represents multi-scale sampled data, taking into ac-count the sparsity of the inverted pyramid architecture, and utilizes cascade training data distribution to repre-sent multi-scale data. Through a step-by-step refinement approach, the method refines the approximation of de-tails. Finally, the proposed method was evaluated by con-ducting experiments on clinical and public datasets. The results demonstrate that the proposed method outper-forms other methods. △ Less

Submitted 9 May, 2024; originally announced May 2024.

arXiv:2404.16680 [pdf, other]

Unrevealing the existence of nontensorial gravitational-wave polarizations from individual supermassive black hole binaries with pulsar timing arrays

Authors: Dicong Liang, Siyuan Chen, Chao Zhang, Lijing Shao

Abstract: With the strong evidence for a gravitational wave (GW) background in the nanohertz frequency band from pulsar timing arrays, the detection of continuous GWs from individual supermassive black hole binaries is already at the dawn. Utilizing continuous GWs to test theories of gravity, especially to test the polarizations of GWs is becoming more and more realistic. In this theoretical study, assuming… ▽ More With the strong evidence for a gravitational wave (GW) background in the nanohertz frequency band from pulsar timing arrays, the detection of continuous GWs from individual supermassive black hole binaries is already at the dawn. Utilizing continuous GWs to test theories of gravity, especially to test the polarizations of GWs is becoming more and more realistic. In this theoretical study, assuming a detection of signals from individual supermassive binary black holes, we use the null stream to estimate the capability of identifying the nontensorial polarizations of GWs. We consider cases for the nontensorial polarizations where the dipole radiation and quadrupole radiation dominate separately. With a frequentist method, we estimate the threshold of the nontensor-to-tensor relative amplitude above which extra polarizations can be detected. We also conduct Bayesian analysis to do parameter estimation with the null stream data. Our treatment provides a data-analysis methodology using the null stream to probe the nontensorial GW polarizations with pulsar timing arrays. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: 10 pages, 5 figures. Comments are welcome

arXiv:2404.08882 [pdf, other]

Explanations of MTF discrepancy in grating-based X-ray differential phase contrast CT imaging

Authors: Yuhang Tan, Jiecheng Yang, Hairong Zheng, Dong Liang, Peiping Zhu, Yongshuai Ge

Abstract: As a multi-contrast X-ray computed tomography (CT) imaging system, the grating-based Talbot-Lau interferometer is able to generate the absorption contrast and differential phase contrast (DPC) images concurrently. However, experiments found that the absorption CT (ACT) images have better spatial resolution, i.e., higher modulation transfer function (MTF), than the differential phase contrast CT (D… ▽ More As a multi-contrast X-ray computed tomography (CT) imaging system, the grating-based Talbot-Lau interferometer is able to generate the absorption contrast and differential phase contrast (DPC) images concurrently. However, experiments found that the absorption CT (ACT) images have better spatial resolution, i.e., higher modulation transfer function (MTF), than the differential phase contrast CT (DPCT) images. Until now, the root cause of such observed discrepancy has not been rigorously investigated. Through physical experiments, this study revealed that the phase grating in the Talbot-Lau interferometer induces direct superposition of paired split absorption signals and inverse superposition of paired split phase signals via diffraction. Further simulation experiments demonstrated that this splitting leads to a reduction in MTF in both ACT and DPCT images, with distinct superposition mechanisms contributing to the lower MTF in DPCT. Besides, such MTF discrepancy may also be affected in a minor extent by object composition, sample size, beam spectra and detector pixel size. Based on this study, the spatial resolution could be optimized when designing a grating-based DPC imaging system. △ Less

Submitted 12 April, 2024; originally announced April 2024.

Comments: 7 pages,3 figures

ACM Class: J.2

arXiv:2404.08450 [pdf, other]

Joint Physical-Digital Facial Attack Detection Via Simulating Spoofing Clues

Authors: Xianhua He, Dashuang Liang, Song Yang, Zhanlong Hao, Hui Ma, Binjie Mao, Xi Li, Yao Wang, Pengfei Yan, Ajian Liu

Abstract: Face recognition systems are frequently subjected to a variety of physical and digital attacks of different types. Previous methods have achieved satisfactory performance in scenarios that address physical attacks and digital attacks, respectively. However, few methods are considered to integrate a model that simultaneously addresses both physical and digital attacks, implying the necessity to dev… ▽ More Face recognition systems are frequently subjected to a variety of physical and digital attacks of different types. Previous methods have achieved satisfactory performance in scenarios that address physical attacks and digital attacks, respectively. However, few methods are considered to integrate a model that simultaneously addresses both physical and digital attacks, implying the necessity to develop and maintain multiple models. To jointly detect physical and digital attacks within a single model, we propose an innovative approach that can adapt to any network architecture. Our approach mainly contains two types of data augmentation, which we call Simulated Physical Spoofing Clues augmentation (SPSC) and Simulated Digital Spoofing Clues augmentation (SDSC). SPSC and SDSC augment live samples into simulated attack samples by simulating spoofing clues of physical and digital attacks, respectively, which significantly improve the capability of the model to detect "unseen" attack types. Extensive experiments show that SPSC and SDSC can achieve state-of-the-art generalization in Protocols 2.1 and 2.2 of the UniAttackData dataset, respectively. Our method won first place in "Unified Physical-Digital Face Attack Detection" of the 5th Face Anti-spoofing Challenge@CVPR2024. Our final submission obtains 3.75% APCER, 0.93% BPCER, and 2.34% ACER, respectively. Our code is available at https://github.com/Xianhua-He/cvpr2024-face-anti-spoofing-challenge. △ Less

Submitted 12 April, 2024; originally announced April 2024.

Comments: 10 pages with 6 figures, Accepted by CVPRW 2024

arXiv:2404.08023 [pdf, other]

Pathology-genomic fusion via biologically informed cross-modality graph learning for survival analysis

Authors: Zeyu Zhang, Yuanshen Zhao, Jingxian Duan, Yaou Liu, Hairong Zheng, Dong Liang, Zhenyu Zhang, Zhi-Cheng Li

Abstract: The diagnosis and prognosis of cancer are typically based on multi-modal clinical data, including histology images and genomic data, due to the complex pathogenesis and high heterogeneity. Despite the advancements in digital pathology and high-throughput genome sequencing, establishing effective multi-modal fusion models for survival prediction and revealing the potential association between histo… ▽ More The diagnosis and prognosis of cancer are typically based on multi-modal clinical data, including histology images and genomic data, due to the complex pathogenesis and high heterogeneity. Despite the advancements in digital pathology and high-throughput genome sequencing, establishing effective multi-modal fusion models for survival prediction and revealing the potential association between histopathology and transcriptomics remains challenging. In this paper, we propose Pathology-Genome Heterogeneous Graph (PGHG) that integrates whole slide images (WSI) and bulk RNA-Seq expression data with heterogeneous graph neural network for cancer survival analysis. The PGHG consists of biological knowledge-guided representation learning network and pathology-genome heterogeneous graph. The representation learning network utilizes the biological prior knowledge of intra-modal and inter-modal data associations to guide the feature extraction. The node features of each modality are updated through attention-based graph learning strategy. Unimodal features and bi-modal fused features are extracted via attention pooling module and then used for survival prediction. We evaluate the model on low-grade gliomas, glioblastoma, and kidney renal papillary cell carcinoma datasets from the Cancer Genome Atlas (TCGA) and the First Affiliated Hospital of Zhengzhou University (FAHZU). Extensive experimental results demonstrate that the proposed method outperforms both unimodal and other multi-modal fusion models. For demonstrating the model interpretability, we also visualize the attention heatmap of pathological images and utilize integrated gradient algorithm to identify important tissue structure, biological pathways and key genes. △ Less

Submitted 11 April, 2024; originally announced April 2024.

arXiv:2404.04586 [pdf, other]

PIE: Physics-inspired Low-light Enhancement

Authors: Dong Liang, Zhengyan Xu, Ling Li, Mingqiang Wei, Songcan Chen

Abstract: In this paper, we propose a physics-inspired contrastive learning paradigm for low-light enhancement, called PIE. PIE primarily addresses three issues: (i) To resolve the problem of existing learning-based methods often training a LLE model with strict pixel-correspondence image pairs, we eliminate the need for pixel-correspondence paired training data and instead train with unpaired images. (ii)… ▽ More In this paper, we propose a physics-inspired contrastive learning paradigm for low-light enhancement, called PIE. PIE primarily addresses three issues: (i) To resolve the problem of existing learning-based methods often training a LLE model with strict pixel-correspondence image pairs, we eliminate the need for pixel-correspondence paired training data and instead train with unpaired images. (ii) To address the disregard for negative samples and the inadequacy of their generation in existing methods, we incorporate physics-inspired contrastive learning for LLE and design the Bag of Curves (BoC) method to generate more reasonable negative samples that closely adhere to the underlying physical imaging principle. (iii) To overcome the reliance on semantic ground truths in existing methods, we propose an unsupervised regional segmentation module, ensuring regional brightness consistency while eliminating the dependency on semantic ground truths. Overall, the proposed PIE can effectively learn from unpaired positive/negative samples and smoothly realize non-semantic regional enhancement, which is clearly different from existing LLE efforts. Besides the novel architecture of PIE, we explore the gain of PIE on downstream tasks such as semantic segmentation and face detection. Training on readily available open data and extensive experiments demonstrate that our method surpasses the state-of-the-art LLE models over six independent cross-scenes datasets. PIE runs fast with reasonable GFLOPs in test time, making it easy to use on mobile devices. △ Less

Submitted 6 April, 2024; originally announced April 2024.

Comments: arXiv admin note: text overlap with arXiv:2112.06451

arXiv:2404.03813 [pdf, ps, other]

Agnostic Tomography of Stabilizer Product States

Authors: Sabee Grewal, Vishnu Iyer, William Kretschmer, Daniel Liang

Abstract: We define a quantum learning task called agnostic tomography, where given copies of an arbitrary state $ρ$ and a class of quantum states $\mathcal{C}$, the goal is to output a succinct description of a state that approximates $ρ$ at least as well as any state in $\mathcal{C}$ (up to some small error $\varepsilon$). This task generalizes ordinary quantum tomography of states in $\mathcal{C}$ and is… ▽ More We define a quantum learning task called agnostic tomography, where given copies of an arbitrary state $ρ$ and a class of quantum states $\mathcal{C}$, the goal is to output a succinct description of a state that approximates $ρ$ at least as well as any state in $\mathcal{C}$ (up to some small error $\varepsilon$). This task generalizes ordinary quantum tomography of states in $\mathcal{C}$ and is more challenging because the learning algorithm must be robust to perturbations of $ρ$. We give an efficient agnostic tomography algorithm for the class $\mathcal{C}$ of $n$-qubit stabilizer product states. Assuming $ρ$ has fidelity at least $τ$ with a stabilizer product state, the algorithm runs in time $n^{O(1 + \log(1/τ))} / \varepsilon^2$. This runtime is quasipolynomial in all parameters, and polynomial if $τ$ is a constant. △ Less

Submitted 4 April, 2024; originally announced April 2024.

Comments: 20 pages

arXiv:2404.02546 [pdf, ps, other]

Analysis and approximation to parabolic optimal control problems with measure-valued controls in time

Authors: Wei Gong, Dongdong Liang

Abstract: In this paper, we investigate an optimal control problem governed by parabolic equations with measure-valued controls over time. We establish the well-posedness of the optimal control problem and derive the first-order optimality condition using Clarke's subgradients, revealing a sparsity structure in time for the optimal control. Consequently, these optimal control problems represent a generaliza… ▽ More In this paper, we investigate an optimal control problem governed by parabolic equations with measure-valued controls over time. We establish the well-posedness of the optimal control problem and derive the first-order optimality condition using Clarke's subgradients, revealing a sparsity structure in time for the optimal control. Consequently, these optimal control problems represent a generalization of impulse control for evolution equations. To discretize the optimal control problem, we employ the space-time finite element method. Here, the state equation is approximated using piecewise linear and continuous finite elements in space, alongside a Petrov-Galerkin method utilizing piecewise constant trial functions and piecewise linear and continuous test functions in time. The control variable is discretized using the variational discretization concept. For error estimation, we initially derive a priori error estimates and stabilities for the finite element discretizations of the state and adjoint equations. Subsequently, we establish weak-* convergence for the control under the norm $\mathcal{M}(\bar I_c;L^2(ω))$, with a convergence order of $O(h^\frac{1}{2}+τ^\frac{1}{4})$ for the state. △ Less

Submitted 3 April, 2024; originally announced April 2024.

MSC Class: 65M06; 68M99; 49M40

arXiv:2404.02065 [pdf, other]

doi 10.1109/TMM.2024.3374594

Multi-Level Label Correction by Distilling Proximate Patterns for Semi-supervised Semantic Segmentation

Authors: Hui Xiao, Yuting Hong, Li Dong, Diqun Yan, Jiayan Zhuang, Junjie Xiong, Dongtai Liang, Chengbin Peng

Abstract: Semi-supervised semantic segmentation relieves the reliance on large-scale labeled data by leveraging unlabeled data. Recent semi-supervised semantic segmentation approaches mainly resort to pseudo-labeling methods to exploit unlabeled data. However, unreliable pseudo-labeling can undermine the semi-supervision processes. In this paper, we propose an algorithm called Multi-Level Label Correction (… ▽ More Semi-supervised semantic segmentation relieves the reliance on large-scale labeled data by leveraging unlabeled data. Recent semi-supervised semantic segmentation approaches mainly resort to pseudo-labeling methods to exploit unlabeled data. However, unreliable pseudo-labeling can undermine the semi-supervision processes. In this paper, we propose an algorithm called Multi-Level Label Correction (MLLC), which aims to use graph neural networks to capture structural relationships in Semantic-Level Graphs (SLGs) and Class-Level Graphs (CLGs) to rectify erroneous pseudo-labels. Specifically, SLGs represent semantic affinities between pairs of pixel features, and CLGs describe classification consistencies between pairs of pixel labels. With the support of proximate pattern information from graphs, MLLC can rectify incorrectly predicted pseudo-labels and can facilitate discriminative feature representations. We design an end-to-end network to train and perform this effective label corrections mechanism. Experiments demonstrate that MLLC can significantly improve supervised baselines and outperforms state-of-the-art approaches in different scenarios on Cityscapes and PASCAL VOC 2012 datasets. Specifically, MLLC improves the supervised baseline by at least 5% and 2% with DeepLabV2 and DeepLabV3+ respectively under different partition protocols. △ Less

Submitted 9 April, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

Comments: 12 pages, 8 figures. IEEE Transactions on Multimedia, 2024

arXiv:2404.00126 [pdf, ps, other]

Pseudoentanglement Ain't Cheap

Authors: Sabee Grewal, Vishnu Iyer, William Kretschmer, Daniel Liang

Abstract: We show that any pseudoentangled state ensemble with a gap of $t$ bits of entropy requires $Ω(t)$ non-Clifford gates to prepare. This bound is tight up to polylogarithmic factors if linear-time quantum-secure pseudorandom functions exist. Our result follows from a polynomial-time algorithm to estimate the entanglement entropy of a quantum state across any cut of qubits. When run on an $n$-qubit st… ▽ More We show that any pseudoentangled state ensemble with a gap of $t$ bits of entropy requires $Ω(t)$ non-Clifford gates to prepare. This bound is tight up to polylogarithmic factors if linear-time quantum-secure pseudorandom functions exist. Our result follows from a polynomial-time algorithm to estimate the entanglement entropy of a quantum state across any cut of qubits. When run on an $n$-qubit state that is stabilized by at least $2^{n-t}$ Pauli operators, our algorithm produces an estimate that is within an additive factor of $\frac{t}{2}$ bits of the true entanglement entropy. △ Less

Submitted 11 April, 2024; v1 submitted 29 March, 2024; originally announced April 2024.

Comments: 15 pages; v2: slight edits to concurrent work section

arXiv:2403.15132 [pdf, other]

Transfer CLIP for Generalizable Image Denoising

Authors: Jun Cheng, Dong Liang, Shan Tan

Abstract: Image denoising is a fundamental task in computer vision. While prevailing deep learning-based supervised and self-supervised methods have excelled in eliminating in-distribution noise, their susceptibility to out-of-distribution (OOD) noise remains a significant challenge. The recent emergence of contrastive language-image pre-training (CLIP) model has showcased exceptional capabilities in open-w… ▽ More Image denoising is a fundamental task in computer vision. While prevailing deep learning-based supervised and self-supervised methods have excelled in eliminating in-distribution noise, their susceptibility to out-of-distribution (OOD) noise remains a significant challenge. The recent emergence of contrastive language-image pre-training (CLIP) model has showcased exceptional capabilities in open-world image recognition and segmentation. Yet, the potential for leveraging CLIP to enhance the robustness of low-level tasks remains largely unexplored. This paper uncovers that certain dense features extracted from the frozen ResNet image encoder of CLIP exhibit distortion-invariant and content-related properties, which are highly desirable for generalizable denoising. Leveraging these properties, we devise an asymmetrical encoder-decoder denoising network, which incorporates dense features including the noisy image and its multi-scale features from the frozen ResNet encoder of CLIP into a learnable image decoder to achieve generalizable denoising. The progressive feature augmentation strategy is further proposed to mitigate feature overfitting and improve the robustness of the learnable decoder. Extensive experiments and comparisons conducted across diverse OOD noises, including synthetic noise, real-world sRGB noise, and low-dose CT image noise, demonstrate the superior generalization ability of our method. △ Less

Submitted 22 March, 2024; originally announced March 2024.

Comments: Accepted by CVPR2024

arXiv:2403.14972 [pdf, other]

A Picture Is Worth a Graph: Blueprint Debate on Graph for Multimodal Reasoning

Authors: Changmeng Zheng, Dayong Liang, Wengyu Zhang, Xiao-Yong Wei, Tat-Seng Chua, Qing Li

Abstract: This paper presents a pilot study aimed at introducing multi-agent debate into multimodal reasoning. The study addresses two key challenges: the trivialization of opinions resulting from excessive summarization and the diversion of focus caused by distractor concepts introduced from images. These challenges stem from the inductive (bottom-up) nature of existing debating schemes. To address the iss… ▽ More This paper presents a pilot study aimed at introducing multi-agent debate into multimodal reasoning. The study addresses two key challenges: the trivialization of opinions resulting from excessive summarization and the diversion of focus caused by distractor concepts introduced from images. These challenges stem from the inductive (bottom-up) nature of existing debating schemes. To address the issue, we propose a deductive (top-down) debating approach called Blueprint Debate on Graphs (BDoG). In BDoG, debates are confined to a blueprint graph to prevent opinion trivialization through world-level summarization. Moreover, by storing evidence in branches within the graph, BDoG mitigates distractions caused by frequent but irrelevant concepts. Extensive experiments validate BDoG, achieving state-of-the-art results in Science QA and MMBench with significant improvements over previous methods. △ Less

Submitted 22 March, 2024; originally announced March 2024.

Comments: Work in progress

arXiv:2403.09493 [pdf, other]

Anomaly Detection by Adapting a pre-trained Vision Language Model

Authors: Yuxuan Cai, Xinwei He, Dingkang Liang, Ao Tong, Xiang Bai

Abstract: Recently, large vision and language models have shown their success when adapting them to many downstream tasks. In this paper, we present a unified framework named CLIP-ADA for Anomaly Detection by Adapting a pre-trained CLIP model. To this end, we make two important improvements: 1) To acquire unified anomaly detection across industrial images of multiple categories, we introduce the learnable p… ▽ More Recently, large vision and language models have shown their success when adapting them to many downstream tasks. In this paper, we present a unified framework named CLIP-ADA for Anomaly Detection by Adapting a pre-trained CLIP model. To this end, we make two important improvements: 1) To acquire unified anomaly detection across industrial images of multiple categories, we introduce the learnable prompt and propose to associate it with abnormal patterns through self-supervised learning. 2) To fully exploit the representation power of CLIP, we introduce an anomaly region refinement strategy to refine the localization quality. During testing, the anomalies are localized by directly calculating the similarity between the representation of the learnable prompt and the image. Comprehensive experiments demonstrate the superiority of our framework, e.g., we achieve the state-of-the-art 97.5/55.6 and 89.3/33.1 on MVTec-AD and VisA for anomaly detection and localization. In addition, the proposed method also achieves encouraging performance with marginal training data, which is more challenging. △ Less

Submitted 14 March, 2024; originally announced March 2024.

arXiv:2403.06323 [pdf, other]

Risk-Sensitive RL with Optimized Certainty Equivalents via Reduction to Standard RL

Authors: Kaiwen Wang, Dawen Liang, Nathan Kallus, Wen Sun

Abstract: We study Risk-Sensitive Reinforcement Learning (RSRL) with the Optimized Certainty Equivalent (OCE) risk, which generalizes Conditional Value-at-risk (CVaR), entropic risk and Markowitz's mean-variance. Using an augmented Markov Decision Process (MDP), we propose two general meta-algorithms via reductions to standard RL: one based on optimistic algorithms and another based on policy optimization.… ▽ More We study Risk-Sensitive Reinforcement Learning (RSRL) with the Optimized Certainty Equivalent (OCE) risk, which generalizes Conditional Value-at-risk (CVaR), entropic risk and Markowitz's mean-variance. Using an augmented Markov Decision Process (MDP), we propose two general meta-algorithms via reductions to standard RL: one based on optimistic algorithms and another based on policy optimization. Our optimistic meta-algorithm generalizes almost all prior RSRL theory with entropic risk or CVaR. Under discrete rewards, our optimistic theory also certifies the first RSRL regret bounds for MDPs with bounded coverability, e.g., exogenous block MDPs. Under discrete rewards, our policy optimization meta-algorithm enjoys both global convergence and local improvement guarantees in a novel metric that lower bounds the true OCE risk. Finally, we instantiate our framework with PPO, construct an MDP, and show that it learns the optimal risk-sensitive policy while prior algorithms provably fail. △ Less

Submitted 10 March, 2024; originally announced March 2024.

arXiv:2403.05385 [pdf, other]

Switching the Loss Reduces the Cost in Batch Reinforcement Learning

Authors: Alex Ayoub, Kaiwen Wang, Vincent Liu, Samuel Robertson, James McInerney, Dawen Liang, Nathan Kallus, Csaba Szepesvári

Abstract: We propose training fitted Q-iteration with log-loss (FQI-LOG) for batch reinforcement learning (RL). We show that the number of samples needed to learn a near-optimal policy with FQI-LOG scales with the accumulated cost of the optimal policy, which is zero in problems where acting optimally achieves the goal and incurs no cost. In doing so, we provide a general framework for proving… ▽ More We propose training fitted Q-iteration with log-loss (FQI-LOG) for batch reinforcement learning (RL). We show that the number of samples needed to learn a near-optimal policy with FQI-LOG scales with the accumulated cost of the optimal policy, which is zero in problems where acting optimally achieves the goal and incurs no cost. In doing so, we provide a general framework for proving $\textit{small-cost}$ bounds, i.e. bounds that scale with the optimal achievable cost, in batch RL. Moreover, we empirically verify that FQI-LOG uses fewer samples than FQI trained with squared loss on problems where the optimal policy reliably achieves the goal. △ Less

Submitted 12 March, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

arXiv:2403.02151 [pdf, other]

TripoSR: Fast 3D Object Reconstruction from a Single Image

Authors: Dmitry Tochilkin, David Pankratz, Zexiang Liu, Zixuan Huang, Adam Letts, Yangguang Li, Ding Liang, Christian Laforte, Varun Jampani, Yan-Pei Cao

Abstract: This technical report introduces TripoSR, a 3D reconstruction model leveraging transformer architecture for fast feed-forward 3D generation, producing 3D mesh from a single image in under 0.5 seconds. Building upon the LRM network architecture, TripoSR integrates substantial improvements in data processing, model design, and training techniques. Evaluations on public datasets show that TripoSR exh… ▽ More This technical report introduces TripoSR, a 3D reconstruction model leveraging transformer architecture for fast feed-forward 3D generation, producing 3D mesh from a single image in under 0.5 seconds. Building upon the LRM network architecture, TripoSR integrates substantial improvements in data processing, model design, and training techniques. Evaluations on public datasets show that TripoSR exhibits superior performance, both quantitatively and qualitatively, compared to other open-source alternatives. Released under the MIT license, TripoSR is intended to empower researchers, developers, and creatives with the latest advancements in 3D generative AI. △ Less

Submitted 4 March, 2024; originally announced March 2024.

Comments: Model: https://huggingface.co/stabilityai/TripoSR Code: https://github.com/VAST-AI-Research/TripoSR Demo: https://huggingface.co/spaces/stabilityai/TripoSR

arXiv:2403.01439 [pdf, other]

Dynamic Adapter Meets Prompt Tuning: Parameter-Efficient Transfer Learning for Point Cloud Analysis

Authors: Xin Zhou, Dingkang Liang, Wei Xu, Xingkui Zhu, Yihan Xu, Zhikang Zou, Xiang Bai

Abstract: Point cloud analysis has achieved outstanding performance by transferring point cloud pre-trained models. However, existing methods for model adaptation usually update all model parameters, i.e., full fine-tuning paradigm, which is inefficient as it relies on high computational costs (e.g., training GPU memory) and massive storage space. In this paper, we aim to study parameter-efficient transfer… ▽ More Point cloud analysis has achieved outstanding performance by transferring point cloud pre-trained models. However, existing methods for model adaptation usually update all model parameters, i.e., full fine-tuning paradigm, which is inefficient as it relies on high computational costs (e.g., training GPU memory) and massive storage space. In this paper, we aim to study parameter-efficient transfer learning for point cloud analysis with an ideal trade-off between task performance and parameter efficiency. To achieve this goal, we freeze the parameters of the default pre-trained models and then propose the Dynamic Adapter, which generates a dynamic scale for each token, considering the token significance to the downstream task. We further seamlessly integrate Dynamic Adapter with Prompt Tuning (DAPT) by constructing Internal Prompts, capturing the instance-specific features for interaction. Extensive experiments conducted on five challenging datasets demonstrate that the proposed DAPT achieves superior performance compared to the full fine-tuning counterparts while significantly reducing the trainable parameters and training GPU memory by 95% and 35%, respectively. Code is available at https://github.com/LMD0311/DAPT. △ Less

Submitted 5 April, 2024; v1 submitted 3 March, 2024; originally announced March 2024.

Comments: Accepted to CVPR 2024. Code is available at https://github.com/LMD0311/DAPT

arXiv:2402.17521 [pdf, other]

AVS-Net: Point Sampling with Adaptive Voxel Size for 3D Scene Understanding

Authors: Hongcheng Yang, Dingkang Liang, Dingyuan Zhang, Zhe Liu, Zhikang Zou, Xingyu Jiang, Yingying Zhu

Abstract: The recent advancements in point cloud learning have enabled intelligent vehicles and robots to comprehend 3D environments better. However, processing large-scale 3D scenes remains a challenging problem, such that efficient downsampling methods play a crucial role in point cloud learning. Existing downsampling methods either require a huge computational burden or sacrifice fine-grained geometric i… ▽ More The recent advancements in point cloud learning have enabled intelligent vehicles and robots to comprehend 3D environments better. However, processing large-scale 3D scenes remains a challenging problem, such that efficient downsampling methods play a crucial role in point cloud learning. Existing downsampling methods either require a huge computational burden or sacrifice fine-grained geometric information. For such purpose, this paper presents an advanced sampler that achieves both high accuracy and efficiency. The proposed method utilizes voxel centroid sampling as a foundation but effectively addresses the challenges regarding voxel size determination and the preservation of critical geometric cues. Specifically, we propose a Voxel Adaptation Module that adaptively adjusts voxel sizes with the reference of point-based downsampling ratio. This ensures that the sampling results exhibit a favorable distribution for comprehending various 3D objects or scenes. Meanwhile, we introduce a network compatible with arbitrary voxel sizes for sampling and feature extraction while maintaining high efficiency. The proposed approach is demonstrated with 3D object detection and 3D semantic segmentation. Compared to existing state-of-the-art methods, our approach achieves better accuracy on outdoor and indoor large-scale datasets, e.g. Waymo and ScanNet, with promising efficiency. △ Less

Submitted 15 April, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

Comments: 10 pages, 7 figures

arXiv:2402.14598 [pdf, other]

Brain-inspired Distributed Memorization Learning for Efficient Feature-free Unsupervised Domain Adaptation

Authors: Jianming Lv, Depin Liang, Zequan Liang, Yaobin Zhang, Sijun Xia

Abstract: Compared with gradient based artificial neural networks, biological neural networks usually show a more powerful generalization ability to quickly adapt to unknown environments without using any gradient back-propagation procedure. Inspired by the distributed memory mechanism of human brains, we propose a novel gradient-free Distributed Memorization Learning mechanism, namely DML, to support quick… ▽ More Compared with gradient based artificial neural networks, biological neural networks usually show a more powerful generalization ability to quickly adapt to unknown environments without using any gradient back-propagation procedure. Inspired by the distributed memory mechanism of human brains, we propose a novel gradient-free Distributed Memorization Learning mechanism, namely DML, to support quick domain adaptation of transferred models. In particular, DML adopts randomly connected neurons to memorize the association of input signals, which are propagated as impulses, and makes the final decision by associating the distributed memories based on their confidence. More importantly, DML is able to perform reinforced memorization based on unlabeled data to quickly adapt to a new domain without heavy fine-tuning of deep features, which makes it very suitable for deploying on edge devices. Experiments based on four cross-domain real-world datasets show that DML can achieve superior performance of real-time domain adaptation compared with traditional gradient based MLP with more than 10% improvement of accuracy while reducing 87% of the timing cost of optimization. △ Less

Submitted 4 February, 2024; originally announced February 2024.

Comments: 15 pages,15 figures

arXiv:2402.13188 [pdf, other]

Question Calibration and Multi-Hop Modeling for Temporal Question Answering

Authors: Chao Xue, Di Liang, Pengfei Wang, Jing Zhang

Abstract: Many models that leverage knowledge graphs (KGs) have recently demonstrated remarkable success in question answering (QA) tasks. In the real world, many facts contained in KGs are time-constrained thus temporal KGQA has received increasing attention. Despite the fruitful efforts of previous models in temporal KGQA, they still have several limitations. (I) They adopt pre-trained language models (PL… ▽ More Many models that leverage knowledge graphs (KGs) have recently demonstrated remarkable success in question answering (QA) tasks. In the real world, many facts contained in KGs are time-constrained thus temporal KGQA has received increasing attention. Despite the fruitful efforts of previous models in temporal KGQA, they still have several limitations. (I) They adopt pre-trained language models (PLMs) to obtain question representations, while PLMs tend to focus on entity information and ignore entity transfer caused by temporal constraints, and finally fail to learn specific temporal representations of entities. (II) They neither emphasize the graph structure between entities nor explicitly model the multi-hop relationship in the graph, which will make it difficult to solve complex multi-hop question answering. To alleviate this problem, we propose a novel Question Calibration and Multi-Hop Modeling (QC-MHM) approach. Specifically, We first calibrate the question representation by fusing the question and the time-constrained concepts in KG. Then, we construct the GNN layer to complete multi-hop message passing. Finally, the question representation is combined with the embedding output by the GNN to generate the final prediction. Empirical results verify that the proposed model achieves better performance than the state-of-the-art models in the benchmark dataset. Notably, the Hits@1 and Hits@10 results of QC-MHM on the CronQuestions dataset's complex questions are absolutely improved by 5.1% and 1.2% compared to the best-performing baseline. Moreover, QC-MHM can generate interpretable and trustworthy predictions. △ Less

Submitted 20 February, 2024; originally announced February 2024.

Comments: Accepted by AAAI 2024

arXiv:2402.10739 [pdf, other]

PointMamba: A Simple State Space Model for Point Cloud Analysis

Authors: Dingkang Liang, Xin Zhou, Wei Xu, Xingkui Zhu, Zhikang Zou, Xiaoqing Ye, Xiao Tan, Xiang Bai

Abstract: Transformers have become one of the foundational architectures in point cloud analysis tasks due to their excellent global modeling ability. However, the attention mechanism has quadratic complexity, making the design of a linear complexity method with global modeling appealing. In this paper, we propose PointMamba, transferring the success of Mamba, a recent representative state space model (SSM)… ▽ More Transformers have become one of the foundational architectures in point cloud analysis tasks due to their excellent global modeling ability. However, the attention mechanism has quadratic complexity, making the design of a linear complexity method with global modeling appealing. In this paper, we propose PointMamba, transferring the success of Mamba, a recent representative state space model (SSM), from NLP to point cloud analysis tasks. Unlike traditional Transformers, PointMamba employs a linear complexity algorithm, presenting global modeling capacity while significantly reducing computational costs. Specifically, our method leverages space-filling curves for effective point tokenization and adopts an extremely simple, non-hierarchical Mamba encoder as the backbone. Comprehensive evaluations demonstrate that PointMamba achieves superior performance across multiple datasets while significantly reducing GPU memory usage and FLOPs. This work underscores the potential of SSMs in 3D vision-related tasks and presents a simple yet effective Mamba-based baseline for future research. The code is available at https://github.com/LMD0311/PointMamba. △ Less

Submitted 29 May, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

Comments: Update the architecture and performance. The code is available at https://github.com/LMD0311/PointMamba

arXiv:2402.05954 [pdf, other]

EasyFS: an Efficient Model-free Feature Selection Framework via Elastic Transformation of Features

Authors: Jianming Lv, Sijun Xia, Depin Liang, Wei Chen

Abstract: Traditional model-free feature selection methods treat each feature independently while disregarding the interrelationships among features, which leads to relatively poor performance compared with the model-aware methods. To address this challenge, we propose an efficient model-free feature selection framework via elastic expansion and compression of the features, namely EasyFS, to achieve better… ▽ More Traditional model-free feature selection methods treat each feature independently while disregarding the interrelationships among features, which leads to relatively poor performance compared with the model-aware methods. To address this challenge, we propose an efficient model-free feature selection framework via elastic expansion and compression of the features, namely EasyFS, to achieve better performance than state-of-the-art model-aware methods while sharing the characters of efficiency and flexibility with the existing model-free methods. In particular, EasyFS expands the feature space by using the random non-linear projection network to achieve the non-linear combinations of the original features, so as to model the interrelationships among the features and discover most correlated features. Meanwhile, a novel redundancy measurement based on the change of coding rate is proposed for efficient filtering of redundant features. Comprehensive experiments on 21 different datasets show that EasyFS outperforms state-of-the art methods up to 10.9\% in the regression tasks and 5.7\% in the classification tasks while saving more than 94\% of the time. △ Less

Submitted 4 February, 2024; originally announced February 2024.

arXiv:2402.05740 [pdf, other]

CounterCLR: Counterfactual Contrastive Learning with Non-random Missing Data in Recommendation

Authors: Jun Wang, Haoxuan Li, Chi Zhang, Dongxu Liang, Enyun Yu, Wenwu Ou, Wenjia Wang

Abstract: Recommender systems are designed to learn user preferences from observed feedback and comprise many fundamental tasks, such as rating prediction and post-click conversion rate (pCVR) prediction. However, the observed feedback usually suffer from two issues: selection bias and data sparsity, where biased and insufficient feedback seriously degrade the performance of recommender systems in terms of… ▽ More Recommender systems are designed to learn user preferences from observed feedback and comprise many fundamental tasks, such as rating prediction and post-click conversion rate (pCVR) prediction. However, the observed feedback usually suffer from two issues: selection bias and data sparsity, where biased and insufficient feedback seriously degrade the performance of recommender systems in terms of accuracy and ranking. Existing solutions for handling the issues, such as data imputation and inverse propensity score, are highly susceptible to additional trained imputation or propensity models. In this work, we propose a novel counterfactual contrastive learning framework for recommendation, named CounterCLR, to tackle the problem of non-random missing data by exploiting the advances in contrast learning. Specifically, the proposed CounterCLR employs a deep representation network, called CauNet, to infer non-random missing data in recommendations and perform user preference modeling by further introducing a self-supervised contrastive learning task. Our CounterCLR mitigates the selection bias problem without the need for additional models or estimators, while also enhancing the generalization ability in cases of sparse data. Experiments on real-world datasets demonstrate the effectiveness and superiority of our method. △ Less

Submitted 8 February, 2024; originally announced February 2024.

Comments: 2023 IEEE International Conference on Data Mining (ICDM)

arXiv:2402.02332 [pdf, other]

Minusformer: Improving Time Series Forecasting by Progressively Learning Residuals

Authors: Daojun Liang, Haixia Zhang, Dongfeng Yuan, Bingzheng Zhang, Minggao Zhang

Abstract: In this paper, we find that ubiquitous time series (TS) forecasting models are prone to severe overfitting. To cope with this problem, we embrace a de-redundancy approach to progressively reinstate the intrinsic values of TS for future intervals. Specifically, we introduce a dual-stream and subtraction mechanism, which is a deep Boosting ensemble learning method. And the vanilla Transformer is ren… ▽ More In this paper, we find that ubiquitous time series (TS) forecasting models are prone to severe overfitting. To cope with this problem, we embrace a de-redundancy approach to progressively reinstate the intrinsic values of TS for future intervals. Specifically, we introduce a dual-stream and subtraction mechanism, which is a deep Boosting ensemble learning method. And the vanilla Transformer is renovated by reorienting the information aggregation mechanism from addition to subtraction. Then, we incorporate an auxiliary output branch into each block of the original model to construct a highway leading to the ultimate prediction. The output of subsequent modules in this branch will subtract the previously learned results, enabling the model to learn the residuals of the supervision signal, layer by layer. This designing facilitates the learning-driven implicit progressive decomposition of the input and output streams, empowering the model with heightened versatility, interpretability, and resilience against overfitting. Since all aggregations in the model are minus signs, which is called Minusformer. Extensive experiments demonstrate the proposed method outperform existing state-of-the-art methods, yielding an average performance improvement of 11.9% across various datasets.The code has been released at https://github.com/Anoise/Minusformer. △ Less

Submitted 17 June, 2024; v1 submitted 3 February, 2024; originally announced February 2024.

arXiv:2401.17509 [pdf, other]

Anything in Any Scene: Photorealistic Video Object Insertion

Authors: Chen Bai, Zeman Shao, Guoxiang Zhang, Di Liang, Jie Yang, Zhuorui Zhang, Yujian Guo, Chengzhang Zhong, Yiqiao Qiu, Zhendong Wang, Yichen Guan, Xiaoyin Zheng, Tao Wang, Cheng Lu

Abstract: Realistic video simulation has shown significant potential across diverse applications, from virtual reality to film production. This is particularly true for scenarios where capturing videos in real-world settings is either impractical or expensive. Existing approaches in video simulation often fail to accurately model the lighting environment, represent the object geometry, or achieve high level… ▽ More Realistic video simulation has shown significant potential across diverse applications, from virtual reality to film production. This is particularly true for scenarios where capturing videos in real-world settings is either impractical or expensive. Existing approaches in video simulation often fail to accurately model the lighting environment, represent the object geometry, or achieve high levels of photorealism. In this paper, we propose Anything in Any Scene, a novel and generic framework for realistic video simulation that seamlessly inserts any object into an existing dynamic video with a strong emphasis on physical realism. Our proposed general framework encompasses three key processes: 1) integrating a realistic object into a given scene video with proper placement to ensure geometric realism; 2) estimating the sky and environmental lighting distribution and simulating realistic shadows to enhance the light realism; 3) employing a style transfer network that refines the final video output to maximize photorealism. We experimentally demonstrate that Anything in Any Scene framework produces simulated videos of great geometric realism, lighting realism, and photorealism. By significantly mitigating the challenges associated with video data generation, our framework offers an efficient and cost-effective solution for acquiring high-quality videos. Furthermore, its applications extend well beyond video data augmentation, showing promising potential in virtual reality, video editing, and various other video-centric applications. Please check our project website https://anythinginanyscene.github.io for access to our project code and more high-resolution video results. △ Less

Submitted 30 January, 2024; originally announced January 2024.

arXiv:2401.15319 [pdf, other]

doi 10.1109/LRA.2023.3313053

You Only Look Bottom-Up for Monocular 3D Object Detection

Authors: Kaixin Xiong, Dingyuan Zhang, Dingkang Liang, Zhe Liu, Hongcheng Yang, Wondimu Dikubab, Jianwei Cheng, Xiang Bai

Abstract: Monocular 3D Object Detection is an essential task for autonomous driving. Meanwhile, accurate 3D object detection from pure images is very challenging due to the loss of depth information. Most existing image-based methods infer objects' location in 3D space based on their 2D sizes on the image plane, which usually ignores the intrinsic position clues from images, leading to unsatisfactory perfor… ▽ More Monocular 3D Object Detection is an essential task for autonomous driving. Meanwhile, accurate 3D object detection from pure images is very challenging due to the loss of depth information. Most existing image-based methods infer objects' location in 3D space based on their 2D sizes on the image plane, which usually ignores the intrinsic position clues from images, leading to unsatisfactory performances. Motivated by the fact that humans could leverage the bottom-up positional clues to locate objects in 3D space from a single image, in this paper, we explore the position modeling from the image feature column and propose a new method named You Only Look Bottum-Up (YOLOBU). Specifically, our YOLOBU leverages Column-based Cross Attention to determine how much a pixel contributes to pixels above it. Next, the Row-based Reverse Cumulative Sum (RRCS) is introduced to build the connections of pixels in the bottom-up direction. Our YOLOBU fully explores the position clues for monocular 3D detection via building the relationship of pixels from the bottom-up way. Extensive experiments on the KITTI dataset demonstrate the effectiveness and superiority of our method. △ Less

Submitted 27 January, 2024; originally announced January 2024.

Comments: Accepted by IEEE Robotics and Automation Letters (RA-L)

arXiv:2401.13757 [pdf]

Heterogeneously Integrated Laser on Silicon with Non-Volatile Wavelength Tuning

Authors: Bassem Tossoun, Di Liang, Xia Sheng, John Paul Strachan, Raymond G. Beausoleil

Abstract: The von-Neumann bottleneck has constrained computing systems from efficiently operating on the increasingly large demand in data from networks and devices. Silicon (Si) photonics offers a powerful solution for this issue by providing a platform for high-bandwidth, energy-efficient interconnects. Furthermore, memristors have emerged as a fundamental building block for non-volatile data storage and… ▽ More The von-Neumann bottleneck has constrained computing systems from efficiently operating on the increasingly large demand in data from networks and devices. Silicon (Si) photonics offers a powerful solution for this issue by providing a platform for high-bandwidth, energy-efficient interconnects. Furthermore, memristors have emerged as a fundamental building block for non-volatile data storage and novel computing architectures with powerful in-memory processing capabilities. In this paper, we integrate an Al2O3 memristor into a heterogeneous Si quantum dot microring laser to demonstrate the first laser with non-volatile optical memory. The memristor alters the effective optical modal index of the microring laser cavity by the plasma dispersion effect in the high resistance state (HRS) or Joule heating in the low resistance state (LRS), subsequently controlling the output wavelength of the laser in a non-volatile manner. This device enables a novel pathway for future optoelectronic neuromorphic computers and optical memory chips. △ Less

Submitted 24 January, 2024; originally announced January 2024.

arXiv:2401.10516 [pdf, other]

Episodic Reinforcement Learning with Expanded State-reward Space

Authors: Dayang Liang, Yaru Zhang, Yunlong Liu

Abstract: Empowered by deep neural networks, deep reinforcement learning (DRL) has demonstrated tremendous empirical successes in various domains, including games, health care, and autonomous driving. Despite these advancements, DRL is still identified as data-inefficient as effective policies demand vast numbers of environmental samples. Recently, episodic control (EC)-based model-free DRL methods enable s… ▽ More Empowered by deep neural networks, deep reinforcement learning (DRL) has demonstrated tremendous empirical successes in various domains, including games, health care, and autonomous driving. Despite these advancements, DRL is still identified as data-inefficient as effective policies demand vast numbers of environmental samples. Recently, episodic control (EC)-based model-free DRL methods enable sample efficiency by recalling past experiences from episodic memory. However, existing EC-based methods suffer from the limitation of potential misalignment between the state and reward spaces for neglecting the utilization of (past) retrieval states with extensive information, which probably causes inaccurate value estimation and degraded policy performance. To tackle this issue, we introduce an efficient EC-based DRL framework with expanded state-reward space, where the expanded states used as the input and the expanded rewards used in the training both contain historical and current information. To be specific, we reuse the historical states retrieved by EC as part of the input states and integrate the retrieved MC-returns into the immediate reward in each interactive transition. As a result, our method is able to simultaneously achieve the full utilization of retrieval information and the better evaluation of state values by a Temporal Difference (TD) loss. Empirical results on challenging Box2d and Mujoco tasks demonstrate the superiority of our method over a recent sibling method and common baselines. Further, we also verify our method's effectiveness in alleviating Q-value overestimation by additional experiments of Q-value comparison. △ Less

Submitted 19 January, 2024; originally announced January 2024.

Comments: Accepted at AAMAS'24

arXiv:2401.07757 [pdf, other]

doi 10.1103/PhysRevD.109.084076

Dynamic instability analysis for bumblebee black holes: the odd parity

Authors: Zhan-Feng Mai, Rui Xu, Dicong Liang, Lijing Shao

Abstract: Spherical black-hole (BH) solutions have been found in the bumblebee gravity where a vector field nonminimally couples to the Ricci tensor. We study dynamic (in)stability associated with the gravitational and vector perturbations of odd parity against these bumblebee BHs. Under the plane-wave approximation, we find that bumblebee BHs do not suffer ghost instability, but gradient instability and ta… ▽ More Spherical black-hole (BH) solutions have been found in the bumblebee gravity where a vector field nonminimally couples to the Ricci tensor. We study dynamic (in)stability associated with the gravitational and vector perturbations of odd parity against these bumblebee BHs. Under the plane-wave approximation, we find that bumblebee BHs do not suffer ghost instability, but gradient instability and tachyonic instability exist when the bumblebee charge exceeds certain values. The existence of the instabilities also depends on the nonminimal coupling constant $ξ$ that, there is a minimal value $ξ\sim 4πG$ with $G$ the gravitational constant for the instabilities to happen. The theoretical consideration for bumblebee BH stability turns out to place stronger constraints on the parameter space than those from the recent observations of supermassive BH shadows by the Event Horizon Telescope Collaboration. It is also reminiscent of Penrose's cosmic censorship conjecture since the charge of bumblebee BHs cannot be too large due to the dynamic instabilities. Specifically, for $ξ(ξ-16πG) > 0$, we find that the charge of a bumblebee BH cannot be larger than its mass. △ Less

Submitted 8 April, 2024; v1 submitted 15 January, 2024; originally announced January 2024.

Comments: Accepted by PRD, 13 pages, 4 figures, references added

Journal ref: Phys. Rev. D 109 (2024) 084076

arXiv:2401.06990 [pdf, other]

doi 10.1080/01621459.2024.2302198

Graphical Principal Component Analysis of Multivariate Functional Time Series

Authors: Jianbin Tan, Decai Liang, Yongtao Guan, Hui Huang

Abstract: In this paper, we consider multivariate functional time series with a two-way dependence structure: a serial dependence across time points and a graphical interaction among the multiple functions within each time point. We develop the notion of dynamic weak separability, a more general condition than those assumed in literature, and use it to characterize the two-way structure in multivariate func… ▽ More In this paper, we consider multivariate functional time series with a two-way dependence structure: a serial dependence across time points and a graphical interaction among the multiple functions within each time point. We develop the notion of dynamic weak separability, a more general condition than those assumed in literature, and use it to characterize the two-way structure in multivariate functional time series. Based on the proposed weak separability, we develop a unified framework for functional graphical models and dynamic principal component analysis, and further extend it to optimally reconstruct signals from contaminated functional data using graphical-level information. We investigate asymptotic properties of the resulting estimators and illustrate the effectiveness of our proposed approach through extensive simulations. We apply our method to hourly air pollution data that were collected from a monitoring network in China. △ Less

Submitted 13 January, 2024; originally announced January 2024.

Comments: Journal of the American Statistical Association (2024)

Journal ref: Journal of the American Statistical Association (2024): 1-24

arXiv:2401.05506 [pdf, ps, other]

On the Coherency of Completed Group Algebra

Authors: David Burns, Yu Kuang, Dingli Liang

Abstract: We investigate coherency properties of certain completed integral group rings, precisely for compact $p$-adic Lie groups. We investigate coherency properties of certain completed integral group rings, precisely for compact $p$-adic Lie groups. △ Less

Submitted 16 January, 2024; v1 submitted 10 January, 2024; originally announced January 2024.

Comments: 16 pages. Submitted

MSC Class: 16D10; 16E05; 20E18(primary); 16S34(secondary)

Showing 1–50 of 378 results for author: Liang, D