subscribe to arXiv mailings

VoxBlink2: A 100K+ Speaker Recognition Corpus and the Open-Set Speaker-Identification Benchmark

Authors: Yuke Lin, Ming Cheng, Fulin Zhang, Yingying Gao, Shilei Zhang, Ming Li

Abstract: In this paper, we provide a large audio-visual speaker recognition dataset, VoxBlink2, which includes approximately 10M utterances with videos from 110K+ speakers in the wild. This dataset represents a significant expansion over the VoxBlink dataset, encompassing a broader diversity of speakers and scenarios by the grace of an optimized data collection pipeline. Afterward, we explore the impact of… ▽ More In this paper, we provide a large audio-visual speaker recognition dataset, VoxBlink2, which includes approximately 10M utterances with videos from 110K+ speakers in the wild. This dataset represents a significant expansion over the VoxBlink dataset, encompassing a broader diversity of speakers and scenarios by the grace of an optimized data collection pipeline. Afterward, we explore the impact of training strategies, data scale, and model complexity on speaker verification and finally establish a new single-model state-of-the-art EER at 0.170% and minDCF at 0.006% on the VoxCeleb1-O test set. Such remarkable results motivate us to explore speaker recognition from a new challenging perspective. We raise the Open-Set Speaker-Identification task, which is designed to either match a probe utterance with a known gallery speaker or categorize it as an unknown query. Associated with this task, we design concrete benchmark and evaluation protocols. The data and model resources can be found in http://voxblink2.github.io. △ Less

Submitted 16 July, 2024; originally announced July 2024.

Comments: Accepted By InterSpeech2024

arXiv:2407.10540 [pdf, other]

Sudden polarization angle jumps of the repeating fast radio burst FRB 20201124A

Authors: J. R. Niu, W. Y. Wang, J. C. Jiang, Y. Qu, D. J. Zhou, W. W. Zhu, K. J. Lee, J. L. Han, B. Zhang, D. Li, S. Cao, Z. Y. Fang, Y. Feng, Q. Y. Fu, P. Jiang, W. C. Jing, J. Li, Y. Li, R. Luo, L. Q. Meng, C. C. Miao, X. L. Miao, C. H. Niu, Y. C. Pan, B. J. Wang , et al. (19 additional authors not shown)

Abstract: We report the first detection of polarization angle (PA) orthogonal jumps, a phenomenon previously only observed from radio pulsars, from a fast radio burst (FRB) source FRB 20201124A. We find three cases of orthogonal jumps in over two thousand bursts, all resembling those observed in pulsar single pulses. We propose that the jumps are due to the superposition of two orthogonal emission modes tha… ▽ More We report the first detection of polarization angle (PA) orthogonal jumps, a phenomenon previously only observed from radio pulsars, from a fast radio burst (FRB) source FRB 20201124A. We find three cases of orthogonal jumps in over two thousand bursts, all resembling those observed in pulsar single pulses. We propose that the jumps are due to the superposition of two orthogonal emission modes that could only be produced in a highly magnetized plasma, and they are caused by the line of sight sweeping across a rotating magnetosphere. The shortest jump timescale is of the order of one-millisecond, which hints that the emission modes come from regions smaller than the light cylinder of most pulsars or magnetars. This discovery provides convincing evidence that FRB emission originates from the complex magnetosphere of a magnetar, suggesting an FRB emission mechanism that is analogous to radio pulsars despite a huge luminosity difference between two types of objects. △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: 10 pages, 5 figures, submitted to APJL

arXiv:2407.08883 [pdf]

TractGraphFormer: Anatomically Informed Hybrid Graph CNN-Transformer Network for Classification from Diffusion MRI Tractography

Authors: Yuqian Chen, Fan Zhang, Meng Wang, Leo R. Zekelman, Suheyla Cetin-Karayumak, Tengfei Xue, Chaoyi Zhang, Yang Song, Nikos Makris, Yogesh Rathi, Weidong Cai, Lauren J. O'Donnell

Abstract: The relationship between brain connections and non-imaging phenotypes is increasingly studied using deep neural networks. However, the local and global properties of the brain's white matter networks are often overlooked in convolutional network design. We introduce TractGraphFormer, a hybrid Graph CNN-Transformer deep learning framework tailored for diffusion MRI tractography. This model leverage… ▽ More The relationship between brain connections and non-imaging phenotypes is increasingly studied using deep neural networks. However, the local and global properties of the brain's white matter networks are often overlooked in convolutional network design. We introduce TractGraphFormer, a hybrid Graph CNN-Transformer deep learning framework tailored for diffusion MRI tractography. This model leverages local anatomical characteristics and global feature dependencies of white matter structures. The Graph CNN module captures white matter geometry and grey matter connectivity to aggregate local features from anatomically similar white matter connections, while the Transformer module uses self-attention to enhance global information learning. Additionally, TractGraphFormer includes an attention module for interpreting predictive white matter connections. In sex prediction tests, TractGraphFormer shows strong performance in large datasets of children (n=9345) and young adults (n=1065). Overall, our approach suggests that widespread connections in the WM are predictive of the sex of an individual, and consistent predictive anatomical tracts are identified across the two datasets. The proposed approach highlights the potential of integrating local anatomical information and global feature dependencies to improve prediction performance in machine learning with diffusion MRI tractography. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: 23 pages, 4 figures

arXiv:2407.08661 [pdf, other]

Self-consistent theory for the fractional quantum anomalous Hall effect in rhombohedral pentalayer graphene

Authors: Ke Huang, Xiao Li, Sankar Das Sarma, Fan Zhang

Abstract: The fractional quantum anomalous Hall (FQAH) effect in rhombohedral pentalayer graphene (PLG) has attracted significant attention due to its potential for observing exotic quantum states. In this work, we present a self-consistent Hartree-Fock theory for the FQAH effect in rhombohedral PLG. In particular, we focus on the convergence of the Hartree-Fock calculation with various reference fields and… ▽ More The fractional quantum anomalous Hall (FQAH) effect in rhombohedral pentalayer graphene (PLG) has attracted significant attention due to its potential for observing exotic quantum states. In this work, we present a self-consistent Hartree-Fock theory for the FQAH effect in rhombohedral PLG. In particular, we focus on the convergence of the Hartree-Fock calculation with various reference fields and discuss the stability of the FQAH states in PLG. We show that the so-called charge neutrality scheme provides an unambiguous result for the Hartree-Fock calculation, as it ensures a convergence with respect to the momentum cutoff. Based on the Hartree-Fock band structure, we further carry out exact diagonalization calculations to explore the stability of the FQAH states in PLG. Our work provides an improved and unified (minimal) theoretical framework to understand the FQAH effect in rhombohedral PLG and paves the way for future experimental and theoretical studies. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: 18 pages, 12 figures. Comments are welcome

arXiv:2407.08303 [pdf, other]

DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

Authors: Xiaotong Li, Fan Zhang, Haiwen Diao, Yueze Wang, Xinlong Wang, Ling-Yu Duan

Abstract: Existing Multimodal Large Language Models (MLLMs) increasingly emphasize complex understanding of various visual elements, including multiple objects, text information, and spatial relations. Their development for comprehensive visual perception hinges on the availability of high-quality image-text datasets that offer diverse visual elements and throughout image descriptions. However, the scarcity… ▽ More Existing Multimodal Large Language Models (MLLMs) increasingly emphasize complex understanding of various visual elements, including multiple objects, text information, and spatial relations. Their development for comprehensive visual perception hinges on the availability of high-quality image-text datasets that offer diverse visual elements and throughout image descriptions. However, the scarcity of such hyper-detailed datasets currently hinders progress within the MLLM community. The bottleneck stems from the limited perceptual capabilities of current caption engines, which fall short in providing complete and accurate annotations. To facilitate the cutting-edge research of MLLMs on comprehensive vision perception, we thereby propose Perceptual Fusion, using a low-budget but highly effective caption engine for complete and accurate image descriptions. Specifically, Perceptual Fusion integrates diverse perception experts as image priors to provide explicit information on visual elements and adopts an efficient MLLM as a centric pivot to mimic advanced MLLMs' perception abilities. We carefully select 1M highly representative images from uncurated LAION dataset and generate dense descriptions using our engine, dubbed DenseFusion-1M. Extensive experiments validate that our engine outperforms its counterparts, where the resulting dataset significantly improves the perception and cognition abilities of existing MLLMs across diverse vision-language benchmarks, especially with high-resolution images as inputs. The dataset and code are publicly available at https://github.com/baaivision/DenseFusion. △ Less

Submitted 11 July, 2024; originally announced July 2024.

arXiv:2407.07334 [pdf, other]

First-order Néel-VBS transition in $S=3/2$ antiferromagnets

Authors: Fan Zhang, Wenan Guo, Ribhu K. Kaul

Abstract: We study the transition between Néel and columnar valence-bond solid ordering in two-dimensional $S=3/2$ square lattice quantum antiferromagnets with SO(3) symmetry. According to the deconfined criticality scenario, this transition can be direct and continuous like the well-studied $S=1/2$ case. To study the global phase diagram, we work with four multi-spin couplings with full rotational symmetry… ▽ More We study the transition between Néel and columnar valence-bond solid ordering in two-dimensional $S=3/2$ square lattice quantum antiferromagnets with SO(3) symmetry. According to the deconfined criticality scenario, this transition can be direct and continuous like the well-studied $S=1/2$ case. To study the global phase diagram, we work with four multi-spin couplings with full rotational symmetry, that are free of the sign-problem of quantum Monte Carlo. Exploring the phase diagram with quantum Monte Carlo simulations, we find that the phase transition between Néel and valence-bond solid is strongly first-order in the parts of the phase diagram that we have accessed. △ Less

Submitted 9 July, 2024; originally announced July 2024.

Comments: 11 pages, 16 figures

arXiv:2407.07318 [pdf]

Serial coherent diffraction imaging of dynamic samples based on inter-frame constraint

Authors: Pengju Sheng, Fucai Zhang

Abstract: We proposed a novel approach to coherent imaging of dynamic samples. The inter-frame similarity of the sample's local structures is found to be a powerful constraint in phasing a sequence of diffraction patterns. We devised a new image reconstruction algorithm that exploits this inter-frame constraint enabled by an adaptive similar region determination approach. We demonstrated the feasibility of… ▽ More We proposed a novel approach to coherent imaging of dynamic samples. The inter-frame similarity of the sample's local structures is found to be a powerful constraint in phasing a sequence of diffraction patterns. We devised a new image reconstruction algorithm that exploits this inter-frame constraint enabled by an adaptive similar region determination approach. We demonstrated the feasibility of this technique in visible light experiments with various real samples, achieving reconstructions of good quality within a few hundred iterations. With a setup as simple as conventional coherent diffraction imaging but with much-improved convergence and robustness to missing data and noise, our method is expected to enrich X-ray imaging techniques and electron microscopy, offering a new tool for dynamics studies. △ Less

Submitted 9 July, 2024; originally announced July 2024.

arXiv:2407.05749 [pdf, other]

LDGCN: An Edge-End Lightweight Dual GCN Based on Single-Channel EEG for Driver Drowsiness Monitoring

Authors: Jingwei Huang, Chuansheng Wang, Jiayan Huang, Haoyi Fan, Antoni Grau, Fuquan Zhang

Abstract: Driver drowsiness electroencephalography (EEG) signal monitoring can timely alert drivers of their drowsiness status, thereby reducing the probability of traffic accidents. Graph convolutional networks (GCNs) have shown significant advancements in processing the non-stationary, time-varying, and non-Euclidean nature of EEG signals. However, the existing single-channel EEG adjacency graph construct… ▽ More Driver drowsiness electroencephalography (EEG) signal monitoring can timely alert drivers of their drowsiness status, thereby reducing the probability of traffic accidents. Graph convolutional networks (GCNs) have shown significant advancements in processing the non-stationary, time-varying, and non-Euclidean nature of EEG signals. However, the existing single-channel EEG adjacency graph construction process lacks interpretability, which hinders the ability of GCNs to effectively extract adjacency graph features, thus affecting the performance of drowsiness monitoring. To address this issue, we propose an edge-end lightweight dual graph convolutional network (LDGCN). Specifically, we are the first to incorporate neurophysiological knowledge to design a Baseline Drowsiness Status Adjacency Graph (BDSAG), which characterizes driver drowsiness status. Additionally, to express more features within limited EEG data, we introduce the Augmented Graph-level Module (AGM). This module captures global and local information at the graph level, ensuring that BDSAG features remain intact while enhancing effective feature expression capability. Furthermore, to deploy our method on the fourth-generation Raspberry Pi, we utilize Adaptive Pruning Optimization (APO) on both channels and neurons, reducing inference latency by almost half. Experiments on benchmark datasets demonstrate that LDGCN offers the best trade-off between monitoring performance and hardware resource utilization compared to existing state-of-the-art algorithms. All our source code can be found at https://github.com/BryantDom/Driver-Drowsiness-Monitoring. △ Less

Submitted 8 July, 2024; originally announced July 2024.

arXiv:2407.04396 [pdf, other]

Graph-Guided Test-Time Adaptation for Glaucoma Diagnosis using Fundus Photography

Authors: Qian Zeng, Le Zhang, Yipeng Liu, Ce Zhu, Fan Zhang

Abstract: Glaucoma is a leading cause of irreversible blindness worldwide. While deep learning approaches using fundus images have largely improved early diagnosis of glaucoma, variations in images from different devices and locations (known as domain shifts) challenge the use of pre-trained models in real-world settings. To address this, we propose a novel Graph-guided Test-Time Adaptation (GTTA) framework… ▽ More Glaucoma is a leading cause of irreversible blindness worldwide. While deep learning approaches using fundus images have largely improved early diagnosis of glaucoma, variations in images from different devices and locations (known as domain shifts) challenge the use of pre-trained models in real-world settings. To address this, we propose a novel Graph-guided Test-Time Adaptation (GTTA) framework to generalize glaucoma diagnosis models to unseen test environments. GTTA integrates the topological information of fundus images into the model training, enhancing the model's transferability and reducing the risk of learning spurious correlation. During inference, GTTA introduces a novel test-time training objective to make the source-trained classifier progressively adapt to target patterns with reliable class conditional estimation and consistency regularization. Experiments on cross-domain glaucoma diagnosis benchmarks demonstrate the superiority of the overall framework and individual components under different backbone networks. △ Less

Submitted 9 July, 2024; v1 submitted 5 July, 2024; originally announced July 2024.

Comments: 11 pages, 3 figures, 3 tables, submitted to MICCAI

arXiv:2407.03964 [pdf, other]

Improving Sample Efficiency of Reinforcement Learning with Background Knowledge from Large Language Models

Authors: Fuxiang Zhang, Junyou Li, Yi-Chen Li, Zongzhang Zhang, Yang Yu, Deheng Ye

Abstract: Low sample efficiency is an enduring challenge of reinforcement learning (RL). With the advent of versatile large language models (LLMs), recent works impart common-sense knowledge to accelerate policy learning for RL processes. However, we note that such guidance is often tailored for one specific task but loses generalizability. In this paper, we introduce a framework that harnesses LLMs to extr… ▽ More Low sample efficiency is an enduring challenge of reinforcement learning (RL). With the advent of versatile large language models (LLMs), recent works impart common-sense knowledge to accelerate policy learning for RL processes. However, we note that such guidance is often tailored for one specific task but loses generalizability. In this paper, we introduce a framework that harnesses LLMs to extract background knowledge of an environment, which contains general understandings of the entire environment, making various downstream RL tasks benefit from one-time knowledge representation. We ground LLMs by feeding a few pre-collected experiences and requesting them to delineate background knowledge of the environment. Afterward, we represent the output knowledge as potential functions for potential-based reward shaping, which has a good property for maintaining policy optimality from task rewards. We instantiate three variants to prompt LLMs for background knowledge, including writing code, annotating preferences, and assigning goals. Our experiments show that these methods achieve significant sample efficiency improvements in a spectrum of downstream tasks from Minigrid and Crafter domains. △ Less

Submitted 4 July, 2024; originally announced July 2024.

arXiv:2407.03856 [pdf, other]

Q-Adapter: Training Your LLM Adapter as a Residual Q-Function

Authors: Yi-Chen Li, Fuxiang Zhang, Wenjie Qiu, Lei Yuan, Chengxing Jia, Zongzhang Zhang, Yang Yu

Abstract: We consider the problem of adapting Large Language Models (LLMs) pre-trained with Reinforcement Learning from Human Feedback (RLHF) to downstream preference data. Naive approaches to achieve this could be supervised fine-tuning on preferred responses or reinforcement learning with a learned reward model. However, the LLM runs the risk of forgetting its initial knowledge as the fine-tuning progress… ▽ More We consider the problem of adapting Large Language Models (LLMs) pre-trained with Reinforcement Learning from Human Feedback (RLHF) to downstream preference data. Naive approaches to achieve this could be supervised fine-tuning on preferred responses or reinforcement learning with a learned reward model. However, the LLM runs the risk of forgetting its initial knowledge as the fine-tuning progresses. To customize the LLM while preserving its existing capabilities, this paper proposes a novel method, named as Q-Adapter. We start by formalizing LLM adaptation as a problem of maximizing the linear combination of two rewards, one of which corresponds to the reward optimized by the pre-trained LLM and the other to the downstream preference data. Although both rewards are unknown, we show that this can be solved by directly learning a new module from the preference data that approximates the \emph{residual Q-function}. We consider this module to be an adapter because the original pre-trained LLM, together with it, can form the optimal customised LLM. Empirically, experiments on a range of domain-specific tasks and safety alignment tasks illustrate the superiority of Q-Adapter in both anti-forgetting and learning from new preferences. △ Less

Submitted 4 July, 2024; originally announced July 2024.

arXiv:2407.03560 [pdf, ps, other]

Numerical semigroups from rational matrices I: power-integral matrices and nilpotent representations

Authors: Arsh Chhabra, Stephan Ramon Garcia, Fangqian Zhang, Hechun Zhang

Abstract: Our aim in this paper is to initiate the study of exponent semigroups for rational matrices. We prove that every numerical semigroup is the exponent semigroup of some rational matrix. We also obtain lower bounds on the size of such matrices and discuss the related class of power-integral matrices. Our aim in this paper is to initiate the study of exponent semigroups for rational matrices. We prove that every numerical semigroup is the exponent semigroup of some rational matrix. We also obtain lower bounds on the size of such matrices and discuss the related class of power-integral matrices. △ Less

Submitted 3 July, 2024; originally announced July 2024.

Comments: 13 pages

arXiv:2407.02880 [pdf, other]

Knowledge Composition using Task Vectors with Learned Anisotropic Scaling

Authors: Frederic Z. Zhang, Paul Albert, Cristian Rodriguez-Opazo, Anton van den Hengel, Ehsan Abbasnejad

Abstract: Pre-trained models produce strong generic representations that can be adapted via fine-tuning. The learned weight difference relative to the pre-trained model, known as a task vector, characterises the direction and stride of fine-tuning. The significance of task vectors is such that simple arithmetic operations on them can be used to combine diverse representations from different domains. This pa… ▽ More Pre-trained models produce strong generic representations that can be adapted via fine-tuning. The learned weight difference relative to the pre-trained model, known as a task vector, characterises the direction and stride of fine-tuning. The significance of task vectors is such that simple arithmetic operations on them can be used to combine diverse representations from different domains. This paper builds on these properties of task vectors and aims to answer (1) whether components of task vectors, particularly parameter blocks, exhibit similar characteristics, and (2) how such blocks can be used to enhance knowledge composition and transfer. To this end, we introduce aTLAS, an algorithm that linearly combines parameter blocks with different learned coefficients, resulting in anisotropic scaling at the task vector level. We show that such linear combinations explicitly exploit the low intrinsic dimensionality of pre-trained models, with only a few coefficients being the learnable parameters. Furthermore, composition of parameter blocks leverages the already learned representations, thereby reducing the dependency on large amounts of data. We demonstrate the effectiveness of our method in task arithmetic, few-shot recognition and test-time adaptation, with supervised or unsupervised objectives. In particular, we show that (1) learned anisotropic scaling allows task vectors to be more disentangled, causing less interference in composition; (2) task vector composition excels with scarce or no labeled data and is less prone to domain shift, thus leading to better generalisability; (3) mixing the most informative parameter blocks across different task vectors prior to training can reduce the memory footprint and improve the flexibility of knowledge transfer. Moreover, we show the potential of aTLAS as a PEFT method, particularly with less data, and demonstrate that its scalibility. △ Less

Submitted 3 July, 2024; originally announced July 2024.

arXiv:2407.02675 [pdf, other]

Depth-Aware Endoscopic Video Inpainting

Authors: Francis Xiatian Zhang, Shuang Chen, Xianghua Xie, Hubert P. H. Shum

Abstract: Video inpainting fills in corrupted video content with plausible replacements. While recent advances in endoscopic video inpainting have shown potential for enhancing the quality of endoscopic videos, they mainly repair 2D visual information without effectively preserving crucial 3D spatial details for clinical reference. Depth-aware inpainting methods attempt to preserve these details by incorpor… ▽ More Video inpainting fills in corrupted video content with plausible replacements. While recent advances in endoscopic video inpainting have shown potential for enhancing the quality of endoscopic videos, they mainly repair 2D visual information without effectively preserving crucial 3D spatial details for clinical reference. Depth-aware inpainting methods attempt to preserve these details by incorporating depth information. Still, in endoscopic contexts, they face challenges including reliance on pre-acquired depth maps, less effective fusion designs, and ignorance of the fidelity of 3D spatial details. To address them, we introduce a novel Depth-aware Endoscopic Video Inpainting (DAEVI) framework. It features a Spatial-Temporal Guided Depth Estimation module for direct depth estimation from visual features, a Bi-Modal Paired Channel Fusion module for effective channel-by-channel fusion of visual and depth information, and a Depth Enhanced Discriminator to assess the fidelity of the RGB-D sequence comprised of the inpainted frames and estimated depth images. Experimental evaluations on established benchmarks demonstrate our framework's superiority, achieving a 2% improvement in PSNR and a 6% reduction in MSE compared to state-of-the-art methods. Qualitative analyses further validate its enhanced ability to inpaint fine details, highlighting the benefits of integrating depth information into endoscopic inpainting. △ Less

Submitted 2 July, 2024; originally announced July 2024.

Comments: Accepted by MICCAI 2024

arXiv:2407.01461 [pdf, other]

Enhancing the Capability and Robustness of Large Language Models through Reinforcement Learning-Driven Query Refinement

Authors: Zisu Huang, Xiaohua Wang, Feiran Zhang, Zhibo Xu, Cenyuan Zhang, Xiaoqing Zheng, Xuanjing Huang

Abstract: The capacity of large language models (LLMs) to generate honest, harmless, and helpful responses heavily relies on the quality of user prompts. However, these prompts often tend to be brief and vague, thereby significantly limiting the full potential of LLMs. Moreover, harmful prompts can be meticulously crafted and manipulated by adversaries to jailbreak LLMs, inducing them to produce potentially… ▽ More The capacity of large language models (LLMs) to generate honest, harmless, and helpful responses heavily relies on the quality of user prompts. However, these prompts often tend to be brief and vague, thereby significantly limiting the full potential of LLMs. Moreover, harmful prompts can be meticulously crafted and manipulated by adversaries to jailbreak LLMs, inducing them to produce potentially toxic content. To enhance the capabilities of LLMs while maintaining strong robustness against harmful jailbreak inputs, this study proposes a transferable and pluggable framework that refines user prompts before they are input into LLMs. This strategy improves the quality of the queries, empowering LLMs to generate more truthful, benign and useful responses. Specifically, a lightweight query refinement model is introduced and trained using a specially designed reinforcement learning approach that incorporates multiple objectives to enhance particular capabilities of LLMs. Extensive experiments demonstrate that the refinement model not only improves the quality of responses but also strengthens their robustness against jailbreak attacks. Code is available at: https://github.com/Huangzisu/query-refinement . △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2407.01230 [pdf, other]

DaBiT: Depth and Blur informed Transformer for Joint Refocusing and Super-Resolution

Authors: Crispian Morris, Nantheera Anantrasirichai, Fan Zhang, David Bull

Abstract: In many real-world scenarios, recorded videos suffer from accidental focus blur, and while video deblurring methods exist, most specifically target motion blur. This paper introduces a framework optimised for the joint task of focal deblurring (refocusing) and video super-resolution (VSR). The proposed method employs novel map guided transformers, in addition to image propagation, to effectively l… ▽ More In many real-world scenarios, recorded videos suffer from accidental focus blur, and while video deblurring methods exist, most specifically target motion blur. This paper introduces a framework optimised for the joint task of focal deblurring (refocusing) and video super-resolution (VSR). The proposed method employs novel map guided transformers, in addition to image propagation, to effectively leverage the continuous spatial variance of focal blur and restore the footage. We also introduce a flow re-focusing module to efficiently align relevant features between the blurry and sharp domains. Additionally, we propose a novel technique for generating synthetic focal blur data, broadening the model's learning capabilities to include a wider array of content. We have made a new benchmark dataset, DAVIS-Blur, available. This dataset, a modified extension of the popular DAVIS video segmentation set, provides realistic out-of-focus blur degradations as well as the corresponding blur maps. Comprehensive experiments on DAVIS-Blur demonstrate the superiority of our approach. We achieve state-of-the-art results with an average PSNR performance over 1.9dB greater than comparable existing video restoration methods. Our source code will be made available at https://github.com/crispianm/DaBiT △ Less

Submitted 10 July, 2024; v1 submitted 1 July, 2024; originally announced July 2024.

arXiv:2407.01219 [pdf, other]

Searching for Best Practices in Retrieval-Augmented Generation

Authors: Xiaohua Wang, Zhenghua Wang, Xuan Gao, Feiran Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi, Zhengyuan Wang, Shizheng Li, Qi Qian, Ruicheng Yin, Changze Lv, Xiaoqing Zheng, Xuanjing Huang

Abstract: Retrieval-augmented generation (RAG) techniques have proven to be effective in integrating up-to-date information, mitigating hallucinations, and enhancing response quality, particularly in specialized domains. While many RAG approaches have been proposed to enhance large language models through query-dependent retrievals, these approaches still suffer from their complex implementation and prolong… ▽ More Retrieval-augmented generation (RAG) techniques have proven to be effective in integrating up-to-date information, mitigating hallucinations, and enhancing response quality, particularly in specialized domains. While many RAG approaches have been proposed to enhance large language models through query-dependent retrievals, these approaches still suffer from their complex implementation and prolonged response times. Typically, a RAG workflow involves multiple processing steps, each of which can be executed in various ways. Here, we investigate existing RAG approaches and their potential combinations to identify optimal RAG practices. Through extensive experiments, we suggest several strategies for deploying RAG that balance both performance and efficiency. Moreover, we demonstrate that multimodal retrieval techniques can significantly enhance question-answering capabilities about visual inputs and accelerate the generation of multimodal content using a "retrieval as generation" strategy. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2407.00136 [pdf, other]

Observation of the Electromagnetic Dalitz Transition $h_c \rightarrow e^+e^-η_c$

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, S. Ahmed, M. Albrecht, R. Aliberti, A. Amoroso, M. R. An, Q. An, X. H. Bai, Y. Bai, O. Bakina, R. Baldini Ferroli, I. Balossino, Y. Ban, K. Begzsuren, N. Berger, M. Bertani, D. Bettoni, F. Bianchi, J. Bloms, A. Bortone, I. Boyko, R. A. Briere , et al. (495 additional authors not shown)

Abstract: Using $(27.12\pm 0.14)\times10^8$ $ψ(3686)$ decays and data samples of $e^+e^-$ collisions with $\sqrt{s}$ from 4.130 to 4.780~GeV collected with the BESIII detector, we report the first observation of the electromagnetic Dalitz transition $h_c\to e^+e^-η_c$ with a statistical significance of $5.4σ$. We measure the ratio of the branching fractions… ▽ More Using $(27.12\pm 0.14)\times10^8$ $ψ(3686)$ decays and data samples of $e^+e^-$ collisions with $\sqrt{s}$ from 4.130 to 4.780~GeV collected with the BESIII detector, we report the first observation of the electromagnetic Dalitz transition $h_c\to e^+e^-η_c$ with a statistical significance of $5.4σ$. We measure the ratio of the branching fractions $\frac{\mathcal{B}(h_c\rightarrow e^+e^-η_c)}{\mathcal{B}(h_c\rightarrow γη_c)}$ separately for the $h_c$ samples produced via $ψ(3686)\toπ^0h_c$ and $e^+e^-\toπ^+π^-h_c$. The average ratio is determined to be $(0.59\pm0.10(\text{stat.})\pm0.04(\text{syst.}))\%$, where the uncertainty includes both statistical and systematic components. △ Less

Submitted 2 July, 2024; v1 submitted 28 June, 2024; originally announced July 2024.

arXiv:2406.20036 [pdf]

Direct observation of layer skyrmions in twisted WSe2 bilayers

Authors: Fan Zhang, Nicolás Morales-Durán, Yanxing Li, Wang Yao, Jung-Jung Su, Yu-Chuan Lin, Chengye Dong, Hyunsue Kim, Joshua A. Robinson, Allan H. Macdonald, Chih-Kang Shih

Abstract: Transition metal dichalcogenide (TMD) twisted homobilayers have been established as an ideal platform for studying strong correlation phenomena, as exemplified by the recent discovery of fractional Chern insulator (FCI) states in twisted MoTe2 and Chern insulators (CI) and unconventional superconductivity in twisted WSe2. In these systems, nontrivial topology in the strongly layer-hybridized regim… ▽ More Transition metal dichalcogenide (TMD) twisted homobilayers have been established as an ideal platform for studying strong correlation phenomena, as exemplified by the recent discovery of fractional Chern insulator (FCI) states in twisted MoTe2 and Chern insulators (CI) and unconventional superconductivity in twisted WSe2. In these systems, nontrivial topology in the strongly layer-hybridized regime can arise from a spatial patterning of interlayer tunneling amplitudes and layer-dependent potentials that yields a lattice of layer skyrmions. Here we report the direct observation of skyrmion textures in the layer degree of freedom of Rhombohedral-stacked (R-stacked) twisted WSe2 homobilayers. This observation is based on scanning tunneling spectroscopy that separately resolves the Γ-valley and K-valley moiré electronic states. We show that Γ-valley states are subjected to a moiré potential with an amplitude of ~ 120 meV. At ~150 meV above the Γ-valley, the K-valley states are subjected to a weaker moiré potential of ~30 meV. Most significantly, we reveal opposite layer polarization of the K-valley at the MX and XM sites within the moiré unit cell, confirming the theoretically predicted skyrmion layer-texture. The dI/dV mappings allow the parameters that enter the continuum model for the description of moiré bands in twisted TMD bilayers to be determined experimentally, further establishing a direct correlation between the shape of LDOS profile in real space and topology of topmost moiré band. △ Less

Submitted 28 June, 2024; originally announced June 2024.

arXiv:2406.19240 [pdf, other]

Data Preparation for Deep Learning based Code Smell Detection: A Systematic Literature Review

Authors: Fengji Zhang, Zexian Zhang, Jacky Wai Keung, Xiangru Tang, Zhen Yang, Xiao Yu, Wenhua Hu

Abstract: Code Smell Detection (CSD) plays a crucial role in improving software quality and maintainability. And Deep Learning (DL) techniques have emerged as a promising approach for CSD due to their superior performance. However, the effectiveness of DL-based CSD methods heavily relies on the quality of the training data. Despite its importance, little attention has been paid to analyzing the data prepara… ▽ More Code Smell Detection (CSD) plays a crucial role in improving software quality and maintainability. And Deep Learning (DL) techniques have emerged as a promising approach for CSD due to their superior performance. However, the effectiveness of DL-based CSD methods heavily relies on the quality of the training data. Despite its importance, little attention has been paid to analyzing the data preparation process. This systematic literature review analyzes the data preparation techniques used in DL-based CSD methods. We identify 36 relevant papers published by December 2023 and provide a thorough analysis of the critical considerations in constructing CSD datasets, including data requirements, collection, labeling, and cleaning. We also summarize seven primary challenges and corresponding solutions in the literature. Finally, we offer actionable recommendations for preparing and accessing high-quality CSD data, emphasizing the importance of data diversity, standardization, and accessibility. This survey provides valuable insights for researchers and practitioners to harness the full potential of DL techniques in CSD. △ Less

Submitted 27 June, 2024; originally announced June 2024.

arXiv:2406.17694 [pdf, other]

Protecting the 'Stop Using My Data' Right through Blockchain-assisted Evidence Generation

Authors: Fan Zhang, Peng Liu

Abstract: In order to provide personalized services to users, Internet-based platforms collect and utilize user-generated behavioral data. Although the 'stop using my data' right should be a fundamental data right, which allows individuals to request their personal data to be no longer utilized by online platforms, the existing preventive data protection measures (e.g., cryptographic data elimination, diffe… ▽ More In order to provide personalized services to users, Internet-based platforms collect and utilize user-generated behavioral data. Although the 'stop using my data' right should be a fundamental data right, which allows individuals to request their personal data to be no longer utilized by online platforms, the existing preventive data protection measures (e.g., cryptographic data elimination, differential privacy) are unfortunately not applicable. This work aims to develop the first Evidence Generation Framework for deterring post-acquisition data right violations. We formulated the 'stop using my data' problem, which captures a vantage facet of the multi-faceted notion of 'right to be forgotten'. We designed and implemented the first blockchain-assisted system to generate evidence for deterring the violations of the 'stop using my data' right. Our system employs a novel two-stage evidence generation protocol whose efficacy is ensured by a newly proposed Lemma. To validate our framework, we conducted a case study on recommendation systems with systematic evaluation experiments using two real-world datasets: the measured success rate exceeds 99%. △ Less

Submitted 29 June, 2024; v1 submitted 25 June, 2024; originally announced June 2024.

arXiv:2406.17247 [pdf, ps, other]

doi 10.1142/S0217732324500305

Einstein-Podolsky-Rosen steering paradox "2=1'' for $N$ qubits

Authors: Zhi-Jie Liu, Jie Zhou, Hui-Xian Meng, Xing-Yan Fan, Mi Xie, Fu-lin Zhang, Jing-Ling Chen

Abstract: Einstein-Podolsky-Rosen (EPR) paradox highlights the absence of a local realistic explanation for quantum mechanics, and shows the incompatibility of the local-hidden-state models with quantum theory. For $N$-qubit states, or more importantly, the $N$-qubit mixed states, we present the EPR steering paradox in the form of the contradictory equality "2=1". We show that the contradiction holds for an… ▽ More Einstein-Podolsky-Rosen (EPR) paradox highlights the absence of a local realistic explanation for quantum mechanics, and shows the incompatibility of the local-hidden-state models with quantum theory. For $N$-qubit states, or more importantly, the $N$-qubit mixed states, we present the EPR steering paradox in the form of the contradictory equality "2=1". We show that the contradiction holds for any $N$-qubit state as long as both the pure state requirement and the measurement requirement are satisfied. This also indicates that the EPR steering paradox exists in more general cases. Finally, we give specific examples to demonstrate and analyze our arguments. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: 12 pages, 0 figure

Journal ref: Modern Physics Letters A Vol. 39, No. 9, 2450030 (2024)

arXiv:2406.16486 [pdf, other]

Towards Comprehensive Preference Data Collection for Reward Modeling

Authors: Yulan Hu, Qingyang Li, Sheng Ouyang, Ge Chen, Kaihui Chen, Lijun Mei, Xucheng Ye, Fuzheng Zhang, Yong Liu

Abstract: Reinforcement Learning from Human Feedback (RLHF) facilitates the alignment of large language models (LLMs) with human preferences, thereby enhancing the quality of responses generated. A critical component of RLHF is the reward model, which is trained on preference data and outputs a scalar reward during the inference stage. However, the collection of preference data still lacks thorough investig… ▽ More Reinforcement Learning from Human Feedback (RLHF) facilitates the alignment of large language models (LLMs) with human preferences, thereby enhancing the quality of responses generated. A critical component of RLHF is the reward model, which is trained on preference data and outputs a scalar reward during the inference stage. However, the collection of preference data still lacks thorough investigation. Recent studies indicate that preference data is collected either by AI or humans, where chosen and rejected instances are identified among pairwise responses. We question whether this process effectively filters out noise and ensures sufficient diversity in collected data. To address these concerns, for the first time, we propose a comprehensive framework for preference data collection, decomposing the process into four incremental steps: Prompt Generation, Response Generation, Response Filtering, and Human Labeling. This structured approach ensures the collection of high-quality preferences while reducing reliance on human labor. We conducted comprehensive experiments based on the data collected at different stages, demonstrating the effectiveness of the proposed data collection method. △ Less

Submitted 24 June, 2024; originally announced June 2024.

arXiv:2406.16062 [pdf, other]

Towards Biologically Plausible Computing: A Comprehensive Comparison

Authors: Changze Lv, Yufei Gu, Zhengkang Guo, Zhibo Xu, Yixin Wu, Feiran Zhang, Tianyuan Shi, Zhenghua Wang, Ruicheng Yin, Yu Shang, Siqi Zhong, Xiaohua Wang, Muling Wu, Wenhao Liu, Tianlong Li, Jianhao Zhu, Cenyuan Zhang, Zixuan Ling, Xiaoqing Zheng

Abstract: Backpropagation is a cornerstone algorithm in training neural networks for supervised learning, which uses a gradient descent method to update network weights by minimizing the discrepancy between actual and desired outputs. Despite its pivotal role in propelling deep learning advancements, the biological plausibility of backpropagation is questioned due to its requirements for weight symmetry, gl… ▽ More Backpropagation is a cornerstone algorithm in training neural networks for supervised learning, which uses a gradient descent method to update network weights by minimizing the discrepancy between actual and desired outputs. Despite its pivotal role in propelling deep learning advancements, the biological plausibility of backpropagation is questioned due to its requirements for weight symmetry, global error computation, and dual-phase training. To address this long-standing challenge, many studies have endeavored to devise biologically plausible training algorithms. However, a fully biologically plausible algorithm for training multilayer neural networks remains elusive, and interpretations of biological plausibility vary among researchers. In this study, we establish criteria for biological plausibility that a desirable learning algorithm should meet. Using these criteria, we evaluate a range of existing algorithms considered to be biologically plausible, including Hebbian learning, spike-timing-dependent plasticity, feedback alignment, target propagation, predictive coding, forward-forward algorithm, perturbation learning, local losses, and energy-based learning. Additionally, we empirically evaluate these algorithms across diverse network architectures and datasets. We compare the feature representations learned by these algorithms with brain activity recorded by non-invasive devices under identical stimuli, aiming to identify which algorithm can most accurately replicate brain activity patterns. We are hopeful that this study could inspire the development of new biologically plausible algorithms for training multilayer networks, thereby fostering progress in both the fields of neuroscience and machine learning. △ Less

Submitted 23 June, 2024; originally announced June 2024.

arXiv:2406.15879 [pdf]

Robust Ptychographic Reconstruction with an Out-of-Focus Electron Probe

Authors: Shoucong Ning, Wenhui Xu, Pengju Sheng, Leyi Loh, Stephen Pennycook, Fucai Zhang, Michel Bosman, Qian He

Abstract: As a burgeoning technique, out-of-focus electron ptychography offers the potential for rapidly imaging atomic-scale large fields of view (FoV) using a single diffraction dataset. However, achieving robust out-of-focus ptychographic reconstruction poses a significant challenge due to the inherent scan instabilities of electron microscopes, compounded by the presence of unknown aberrations in the pr… ▽ More As a burgeoning technique, out-of-focus electron ptychography offers the potential for rapidly imaging atomic-scale large fields of view (FoV) using a single diffraction dataset. However, achieving robust out-of-focus ptychographic reconstruction poses a significant challenge due to the inherent scan instabilities of electron microscopes, compounded by the presence of unknown aberrations in the probe-forming lens. In this study, we substantially enhance the robustness of out-of-focus ptychographic reconstruction by extending our previous calibration method (the Fourier method), which was originally developed for the in-focus scenario. This extended Fourier method surpasses existing calibration techniques by providing more reliable and accurate initialization of scan positions and electron probes. Additionally, we comprehensively explore and recommend optimized experimental parameters for robust out-of-focus ptychography, includingaperture size and defocus, through extensive simulations. Lastly, we conduct a comprehensive comparison between ptychographic reconstructions obtained with focused and defocused electron probes, particularly in the context of low-dose and precise phase imaging, utilizing our calibration method as the basis for evaluation. △ Less

Submitted 22 June, 2024; originally announced June 2024.

Comments: 22 pages, 6 figures

arXiv:2406.13597 [pdf, other]

GraphKAN: Enhancing Feature Extraction with Graph Kolmogorov Arnold Networks

Authors: Fan Zhang, Xin Zhang

Abstract: Massive number of applications involve data with underlying relationships embedded in non-Euclidean space. Graph neural networks (GNNs) are utilized to extract features by capturing the dependencies within graphs. Despite groundbreaking performances, we argue that Multi-layer perceptrons (MLPs) and fixed activation functions impede the feature extraction due to information loss. Inspired by Kolmog… ▽ More Massive number of applications involve data with underlying relationships embedded in non-Euclidean space. Graph neural networks (GNNs) are utilized to extract features by capturing the dependencies within graphs. Despite groundbreaking performances, we argue that Multi-layer perceptrons (MLPs) and fixed activation functions impede the feature extraction due to information loss. Inspired by Kolmogorov Arnold Networks (KANs), we make the first attempt to GNNs with KANs. We discard MLPs and activation functions, and instead used KANs for feature extraction. Experiments demonstrate the effectiveness of GraphKAN, emphasizing the potential of KANs as a powerful tool. Code is available at https://github.com/Ryanfzhang/GraphKan. △ Less

Submitted 19 June, 2024; originally announced June 2024.

arXiv:2406.13268 [pdf, other]

CEC: A Noisy Label Detection Method for Speaker Recognition

Authors: Yao Shen, Yingying Gao, Yaqian Hao, Chenguang Hu, Fulin Zhang, Junlan Feng, Shilei Zhang

Abstract: Noisy labels are inevitable, even in well-annotated datasets. The detection of noisy labels is of significant importance to enhance the robustness of speaker recognition models. In this paper, we propose a novel noisy label detection approach based on two new statistical metrics: Continuous Inconsistent Counting (CIC) and Total Inconsistent Counting (TIC). These metrics are calculated through Cros… ▽ More Noisy labels are inevitable, even in well-annotated datasets. The detection of noisy labels is of significant importance to enhance the robustness of speaker recognition models. In this paper, we propose a novel noisy label detection approach based on two new statistical metrics: Continuous Inconsistent Counting (CIC) and Total Inconsistent Counting (TIC). These metrics are calculated through Cross-Epoch Counting (CEC) and correspond to the early and late stages of training, respectively. Additionally, we categorize samples based on their prediction results into three categories: inconsistent samples, hard samples, and easy samples. During training, we gradually increase the difficulty of hard samples to update model parameters, preventing noisy labels from being overfitted. Compared to contrastive schemes, our approach not only achieves the best performance in speaker verification but also excels in noisy label detection. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: interspeech 2024

arXiv:2406.13007 [pdf, other]

NTIRE 2024 Challenge on Night Photography Rendering

Authors: Egor Ershov, Artyom Panshin, Oleg Karasev, Sergey Korchagin, Shepelev Lev, Alexandr Startsev, Daniil Vladimirov, Ekaterina Zaychenkova, Nikola Banić, Dmitrii Iarchuk, Maria Efimova, Radu Timofte, Arseniy Terekhin, Shuwei Yue, Yuyang Liu, Minchen Wei, Lu Xu, Chao Zhang, Yasi Wang, Furkan Kınlı, Doğa Yılmaz, Barış Özcan, Furkan Kıraç, Shuai Liu, Jingyuan Xiao , et al. (25 additional authors not shown)

Abstract: This paper presents a review of the NTIRE 2024 challenge on night photography rendering. The goal of the challenge was to find solutions that process raw camera images taken in nighttime conditions, and thereby produce a photo-quality output images in the standard RGB (sRGB) space. Unlike the previous year's competition, the challenge images were collected with a mobile phone and the speed of algo… ▽ More This paper presents a review of the NTIRE 2024 challenge on night photography rendering. The goal of the challenge was to find solutions that process raw camera images taken in nighttime conditions, and thereby produce a photo-quality output images in the standard RGB (sRGB) space. Unlike the previous year's competition, the challenge images were collected with a mobile phone and the speed of algorithms was also measured alongside the quality of their output. To evaluate the results, a sufficient number of viewers were asked to assess the visual quality of the proposed solutions, considering the subjective nature of the task. There were 2 nominations: quality and efficiency. Top 5 solutions in terms of output quality were sorted by evaluation time (see Fig. 1). The top ranking participants' solutions effectively represent the state-of-the-art in nighttime photography rendering. More results can be found at https://nightimaging.org. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: 10 pages, 10 figures

arXiv:2406.12520 [pdf, other]

On the analysis of two-time correlation functions: equilibrium vs non-equilibrium systems

Authors: Anastasia Ragulskaya, Vladimir Starostin, Fajun Zhang, Christian Gutt, Frank Schreiber

Abstract: X-ray photon correlation spectroscopy (XPCS) is a powerful tool for the investigation of dynamics covering a broad range of time and length scales. The two-time correlation function (TTC) is commonly used to track non-equilibrium dynamical evolution in XPCS measurements, followed by the extraction of one-time correlations. While the theoretical foundation for the quantitative analysis of TTCs is p… ▽ More X-ray photon correlation spectroscopy (XPCS) is a powerful tool for the investigation of dynamics covering a broad range of time and length scales. The two-time correlation function (TTC) is commonly used to track non-equilibrium dynamical evolution in XPCS measurements, followed by the extraction of one-time correlations. While the theoretical foundation for the quantitative analysis of TTCs is primarily established for equilibrium systems, where key parameters such as diffusion remain constant, non-equilibrium systems pose a unique challenge. In such systems, different projections ("cuts") of the TTC may lead to divergent results if the underlying fundamental parameters themselves are subject to temporal variations. This article explores widely used approaches for TTC calculations and common methods for extracting relevant information from correlation functions on case studies, particularly in the light of comparing dynamics in equilibrium and non-equilibrium systems. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.11653 [pdf, other]

Communication-Efficient MARL for Platoon Stability and Energy-efficiency Co-optimization in Cooperative Adaptive Cruise Control of CAVs

Authors: Min Hua, Dong Chen, Kun Jiang, Fanggang Zhang, Jinhai Wang, Bo Wang, Quan Zhou, Hongming Xu

Abstract: Cooperative adaptive cruise control (CACC) has been recognized as a fundamental function of autonomous driving, in which platoon stability and energy efficiency are outstanding challenges that are difficult to accommodate in real-world operations. This paper studied the CACC of connected and autonomous vehicles (CAVs) based on the multi-agent reinforcement learning algorithm (MARL) to optimize pla… ▽ More Cooperative adaptive cruise control (CACC) has been recognized as a fundamental function of autonomous driving, in which platoon stability and energy efficiency are outstanding challenges that are difficult to accommodate in real-world operations. This paper studied the CACC of connected and autonomous vehicles (CAVs) based on the multi-agent reinforcement learning algorithm (MARL) to optimize platoon stability and energy efficiency simultaneously. The optimal use of communication bandwidth is the key to guaranteeing learning performance in real-world driving, and thus this paper proposes a communication-efficient MARL by incorporating the quantified stochastic gradient descent (QSGD) and a binary differential consensus (BDC) method into a fully-decentralized MARL framework. We benchmarked the performance of our proposed BDC-MARL algorithm against several several non-communicative andcommunicative MARL algorithms, e.g., IA2C, FPrint, and DIAL, through the evaluation of platoon stability, fuel economy, and driving comfort. Our results show that BDC-MARL achieved the highest energy savings, improving by up to 5.8%, with an average velocity of 15.26 m/s and an inter-vehicle spacing of 20.76 m. In addition, we conducted different information-sharing analyses to assess communication efficacy, along with sensitivity analyses and scalability tests with varying platoon sizes. The practical effectiveness of our approach is further demonstrated using real-world scenarios sourced from open-sourced OpenACC. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.11512 [pdf, ps, other]

Asymptotic Behaviors of Moduli of One-dimensional Sheaves on Surfaces

Authors: Fei Si, Feinuo Zhang

Abstract: In this paper, we study the asymptotic behaviors of the Betti numbers and Picard numbers of the moduli space $M_{β,χ}$ of one-dimensional sheaves supported in a curve class $β$ on $S$ with Euler characteristic $χ$. We determine the intersection cohomology Betti numbers of $M_{β,χ}$ when $S$ is a del Pezzo surface and $β$ is sufficiently positive. As an application, we formulate a $P = C$ conjectur… ▽ More In this paper, we study the asymptotic behaviors of the Betti numbers and Picard numbers of the moduli space $M_{β,χ}$ of one-dimensional sheaves supported in a curve class $β$ on $S$ with Euler characteristic $χ$. We determine the intersection cohomology Betti numbers of $M_{β,χ}$ when $S$ is a del Pezzo surface and $β$ is sufficiently positive. As an application, we formulate a $P = C$ conjecture regarding the refined BPS invariants for local del Pezzo surfaces. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: 28 pages, comments are very welcome!

arXiv:2406.11277 [pdf, other]

Small Agent Can Also Rock! Empowering Small Language Models as Hallucination Detector

Authors: Xiaoxue Cheng, Junyi Li, Wayne Xin Zhao, Hongzhi Zhang, Fuzheng Zhang, Di Zhang, Kun Gai, Ji-Rong Wen

Abstract: Hallucination detection is a challenging task for large language models (LLMs), and existing studies heavily rely on powerful closed-source LLMs such as GPT-4. In this paper, we propose an autonomous LLM-based agent framework, called HaluAgent, which enables relatively smaller LLMs (e.g. Baichuan2-Chat 7B) to actively select suitable tools for detecting multiple hallucination types such as text, c… ▽ More Hallucination detection is a challenging task for large language models (LLMs), and existing studies heavily rely on powerful closed-source LLMs such as GPT-4. In this paper, we propose an autonomous LLM-based agent framework, called HaluAgent, which enables relatively smaller LLMs (e.g. Baichuan2-Chat 7B) to actively select suitable tools for detecting multiple hallucination types such as text, code, and mathematical expression. In HaluAgent, we integrate the LLM, multi-functional toolbox, and design a fine-grained three-stage detection framework along with memory mechanism. To facilitate the effectiveness of HaluAgent, we leverage existing Chinese and English datasets to synthesize detection trajectories for fine-tuning, which endows HaluAgent with the capability for bilingual hallucination detection. Extensive experiments demonstrate that only using 2K samples for tuning LLMs, HaluAgent can perform hallucination detection on various types of tasks and datasets, achieving performance comparable to or even higher than GPT-4 without tool enhancements on both in-domain and out-of-domain datasets. We release our dataset and code at https://github.com/RUCAIBox/HaluAgent. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.10810 [pdf, other]

RGBlimp-Q: Robotic Gliding Blimp With Moving Mass Control Based on a Bird-Inspired Continuum Arm

Authors: Hao Cheng, Feitian Zhang

Abstract: Robotic blimps, as lighter-than-air aerial systems, offer prolonged duration and enhanced safety in human-robot interactions due to their buoyant lift. However, robust flight against environmental airflow disturbances remains a significant challenge, limiting the broader application of these robots. Drawing inspiration from the flight mechanics of birds and their ability to perch against natural w… ▽ More Robotic blimps, as lighter-than-air aerial systems, offer prolonged duration and enhanced safety in human-robot interactions due to their buoyant lift. However, robust flight against environmental airflow disturbances remains a significant challenge, limiting the broader application of these robots. Drawing inspiration from the flight mechanics of birds and their ability to perch against natural wind, this article introduces RGBlimp-Q, a robotic gliding blimp equipped with a bird-inspired continuum arm. This arm allows for flexible attitude adjustments through moving mass control to enhance disturbance resilience, while also enabling object capture by using claws to counteract environmental disturbances, similar to a bird. This article presents the design, modeling, and prototyping of RGBlimp-Q, thus extending the advantages of robotic blimps to more complex environments. To the best of the authors' knowledge, this is the first interdisciplinary design integrating continuum mechanisms onto robotic blimps. Experimental results from both indoor and outdoor settings validate the improved flight robustness against environmental disturbances offered by this novel design. △ Less

Submitted 16 June, 2024; originally announced June 2024.

arXiv:2406.10558 [pdf, other]

A Hybrid Controller Design for Human-Assistive Piloting of an Underactuated Blimp

Authors: Wugang Meng, Tianfu Wu, Qiuyang Tao, Fumin Zhang

Abstract: This paper introduces a novel solution to the manual control challenge for indoor blimps. The problem's complexity arises from the conflicting demands of executing human commands while maintaining stability through automatic control for underactuated robots. To tackle this challenge, we introduced an assisted piloting hybrid controller with a preemptive mechanism, that seamlessly switches between… ▽ More This paper introduces a novel solution to the manual control challenge for indoor blimps. The problem's complexity arises from the conflicting demands of executing human commands while maintaining stability through automatic control for underactuated robots. To tackle this challenge, we introduced an assisted piloting hybrid controller with a preemptive mechanism, that seamlessly switches between executing human commands and activating automatic stabilization control. Our algorithm ensures that the automatic stabilization controller operates within the time delay between human observation and perception, providing assistance to the driver in a way that remains imperceptible. △ Less

Submitted 15 June, 2024; originally announced June 2024.

arXiv:2406.09598 [pdf, other]

Introducing HOT3D: An Egocentric Dataset for 3D Hand and Object Tracking

Authors: Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon, Shreyas Hampali, Fan Zhang, Jade Fountain, Edward Miller, Selen Basol, Richard Newcombe, Robert Wang, Jakob Julian Engel, Tomas Hodan

Abstract: We introduce HOT3D, a publicly available dataset for egocentric hand and object tracking in 3D. The dataset offers over 833 minutes (more than 3.7M images) of multi-view RGB/monochrome image streams showing 19 subjects interacting with 33 diverse rigid objects, multi-modal signals such as eye gaze or scene point clouds, as well as comprehensive ground truth annotations including 3D poses of object… ▽ More We introduce HOT3D, a publicly available dataset for egocentric hand and object tracking in 3D. The dataset offers over 833 minutes (more than 3.7M images) of multi-view RGB/monochrome image streams showing 19 subjects interacting with 33 diverse rigid objects, multi-modal signals such as eye gaze or scene point clouds, as well as comprehensive ground truth annotations including 3D poses of objects, hands, and cameras, and 3D models of hands and objects. In addition to simple pick-up/observe/put-down actions, HOT3D contains scenarios resembling typical actions in a kitchen, office, and living room environment. The dataset is recorded by two head-mounted devices from Meta: Project Aria, a research prototype of light-weight AR/AI glasses, and Quest 3, a production VR headset sold in millions of units. Ground-truth poses were obtained by a professional motion-capture system using small optical markers attached to hands and objects. Hand annotations are provided in the UmeTrack and MANO formats and objects are represented by 3D meshes with PBR materials obtained by an in-house scanner. We aim to accelerate research on egocentric hand-object interaction by making the HOT3D dataset publicly available and by co-organizing public challenges on the dataset at ECCV 2024. The dataset can be downloaded from the project website: https://facebookresearch.github.io/hot3d/. △ Less

Submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.08997 [pdf, ps, other]

Adaptive Temporal Motion Guided Graph Convolution Network for Micro-expression Recognition

Authors: Fengyuan Zhang, Zhaopei Huang, Xinjie Zhang, Qin Jin

Abstract: Micro-expressions serve as essential cues for understanding individuals' genuine emotional states. Recognizing micro-expressions attracts increasing research attention due to its various applications in fields such as business negotiation and psychotherapy. However, the intricate and transient nature of micro-expressions poses a significant challenge to their accurate recognition. Most existing wo… ▽ More Micro-expressions serve as essential cues for understanding individuals' genuine emotional states. Recognizing micro-expressions attracts increasing research attention due to its various applications in fields such as business negotiation and psychotherapy. However, the intricate and transient nature of micro-expressions poses a significant challenge to their accurate recognition. Most existing works either neglect temporal dependencies or suffer from redundancy issues in clip-level recognition. In this work, we propose a novel framework for micro-expression recognition, named the Adaptive Temporal Motion Guided Graph Convolution Network (ATM-GCN). Our framework excels at capturing temporal dependencies between frames across the entire clip, thereby enhancing micro-expression recognition at the clip level. Specifically, the integration of Adaptive Temporal Motion layers empowers our method to aggregate global and local motion features inherent in micro-expressions. Experimental results demonstrate that ATM-GCN not only surpasses existing state-of-the-art methods, particularly on the Composite dataset, but also achieves superior performance on the latest micro-expression dataset CAS(ME)$^3$. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: Accepted by ICME 2024

arXiv:2406.08759 [pdf, other]

Gaussian-Forest: Hierarchical-Hybrid 3D Gaussian Splatting for Compressed Scene Modeling

Authors: Fengyi Zhang, Tianjun Zhang, Lin Zhang, Helen Huang, Yadan Luo

Abstract: The field of novel-view synthesis has recently witnessed the emergence of 3D Gaussian Splatting, which represents scenes in a point-based manner and renders through rasterization. This methodology, in contrast to Radiance Fields that rely on ray tracing, demonstrates superior rendering quality and speed. However, the explicit and unstructured nature of 3D Gaussians poses a significant storage chal… ▽ More The field of novel-view synthesis has recently witnessed the emergence of 3D Gaussian Splatting, which represents scenes in a point-based manner and renders through rasterization. This methodology, in contrast to Radiance Fields that rely on ray tracing, demonstrates superior rendering quality and speed. However, the explicit and unstructured nature of 3D Gaussians poses a significant storage challenge, impeding its broader application. To address this challenge, we introduce the Gaussian-Forest modeling framework, which hierarchically represents a scene as a forest of hybrid 3D Gaussians. Each hybrid Gaussian retains its unique explicit attributes while sharing implicit ones with its sibling Gaussians, thus optimizing parameterization with significantly fewer variables. Moreover, adaptive growth and pruning strategies are designed, ensuring detailed representation in complex regions and a notable reduction in the number of required Gaussians. Extensive experiments demonstrate that Gaussian-Forest not only maintains comparable speed and quality but also achieves a compression rate surpassing 10 times, marking a significant advancement in efficient scene modeling. Codes are available at https://github.com/Xian-Bei/GaussianForest. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.08698 [pdf, other]

Constraints on Ultra Heavy Dark Matter Properties from Dwarf Spheroidal Galaxies with LHAASO Observations

Authors: Zhen Cao, F. Aharonian, Q. An, Axikegu, Y. X. Bai, Y. W. Bao, D. Bastieri, X. J. Bi, Y. J. Bi, J. T. Cai, Q. Cao, W. Y. Cao, Zhe Cao, J. Chang, J. F. Chang, A. M. Chen, E. S. Chen, Liang Chen, Lin Chen, Long Chen, M. J. Chen, M. L. Chen, Q. H. Chen, S. H. Chen, S. Z. Chen , et al. (255 additional authors not shown)

Abstract: In this work we try to search for signals generated by ultra-heavy dark matter at the Large High Altitude Air Shower Observatory (LHAASO) data. We look for possible gamma-ray by dark matter annihilation or decay from 16 dwarf spheroidal galaxies in the field of view of LHAASO. Dwarf spheroidal galaxies are among the most promising targets for indirect detection of dark matter which have low fluxes… ▽ More In this work we try to search for signals generated by ultra-heavy dark matter at the Large High Altitude Air Shower Observatory (LHAASO) data. We look for possible gamma-ray by dark matter annihilation or decay from 16 dwarf spheroidal galaxies in the field of view of LHAASO. Dwarf spheroidal galaxies are among the most promising targets for indirect detection of dark matter which have low fluxes of astrophysical $γ$-ray background while large amount of dark matter. By analyzing more than 700 days observational data at LHAASO, no significant dark matter signal from 1 TeV to 1 EeV is detected. Accordingly we derive the most stringent constraints on the ultra-heavy dark matter annihilation cross-section up to EeV. The constraints on the lifetime of dark matter in decay mode are also derived. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: 17 pages, 12 figures, accepted by PRL

arXiv:2406.08090 [pdf, other]

From Sim-to-Real: Toward General Event-based Low-light Frame Interpolation with Per-scene Optimization

Authors: Ziran Zhang, Yongrui Ma, Yueting Chen, Feng Zhang, Jinwei Gu, Tianfan Xue, Shi Guo

Abstract: Video Frame Interpolation (VFI) is important for video enhancement, frame rate up-conversion, and slow-motion generation. The introduction of event cameras, which capture per-pixel brightness changes asynchronously, has significantly enhanced VFI capabilities, particularly for high-speed, nonlinear motions. However, these event-based methods encounter challenges in low-light conditions, notably tr… ▽ More Video Frame Interpolation (VFI) is important for video enhancement, frame rate up-conversion, and slow-motion generation. The introduction of event cameras, which capture per-pixel brightness changes asynchronously, has significantly enhanced VFI capabilities, particularly for high-speed, nonlinear motions. However, these event-based methods encounter challenges in low-light conditions, notably trailing artifacts and signal latency, which hinder their direct applicability and generalization. Addressing these issues, we propose a novel per-scene optimization strategy tailored for low-light conditions. This approach utilizes the internal statistics of a sequence to handle degraded event data under low-light conditions, improving the generalizability to different lighting and camera settings. To evaluate its robustness in low-light condition, we further introduce EVFI-LL, a unique RGB+Event dataset captured under low-light conditions. Our results demonstrate state-of-the-art performance in low-light environments. Both the dataset and the source code will be made publicly available upon publication. Project page: https://naturezhanghn.github.io/sim2real. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.06951 [pdf, other]

doi 10.1093/mnras/stae1346

Determination method of binary fractions by the integrated spectrum

Authors: F. Zhang, L. Li, Z. Han, X. Wang

Abstract: We need to resolve the individual stars for binary fraction determinations of stellar systems. Therefore, it is not possible to obtain the binary fractions for dense or distant stellar systems. % We proposed a method to determine the binary fraction of a dense or distant stellar system. The method is to first determine the binary fraction variation for any two adjacent regions and then add up thos… ▽ More We need to resolve the individual stars for binary fraction determinations of stellar systems. Therefore, it is not possible to obtain the binary fractions for dense or distant stellar systems. % We proposed a method to determine the binary fraction of a dense or distant stellar system. The method is to first determine the binary fraction variation for any two adjacent regions and then add up those binary fraction variations along the radial direction to obtain the binary fraction for a stellar system. Binary fraction variation is derived by using ten binary fraction-sensitive spectral absorption feature indices (SAFIs) and the binary fraction variation calibrations in terms of these SAFIs. Using this method, we first presented the binary fraction variations for twenty-one Galactic globular clusters (GCs). By comparisons, we find that they agree well with the binary fractions based on the main-sequence fiducial line method by previous studies. This verifies that the above mentioned method is feasible. Next, we presented the binary fraction variations of thirteen Galactic GCs. We gave the relationships between binary fraction and various parameters, and found that binary fraction is negatively correlated with NHB and NRR, binary fraction of some studies is not strongly correlated with NBS, and the number of GCs with large binary fraction is greater at extreme blue horizontal branch population ratio. At last, if we want to obtain more accurate binary fraction, we suggest that the spectroscopic and photometric observations are conducted at an appropriate area interval for a stellar system. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: 12 pages, 6 figures, accepted by MNRAS

arXiv:2406.06640 [pdf]

A high-performance reconstruction method for partially coherent ptychography

Authors: Wenhui Xu, Shoucong Ning, Pengju Sheng, Huixiang Lin, Angus I Kirkland, Yong Peng, Fucai Zhang

Abstract: Ptychography is now integrated as a tool in mainstream microscopy allowing quantitative and high-resolution imaging capabilities over a wide field of view. However, its ultimate performance is inevitably limited by the available coherent flux when implemented using electrons or laboratory X-ray sources. We present a universal reconstruction algorithm with high tolerance to low coherence for both f… ▽ More Ptychography is now integrated as a tool in mainstream microscopy allowing quantitative and high-resolution imaging capabilities over a wide field of view. However, its ultimate performance is inevitably limited by the available coherent flux when implemented using electrons or laboratory X-ray sources. We present a universal reconstruction algorithm with high tolerance to low coherence for both far-field and near-field ptychography. The approach is practical for partial temporal and spatial coherence and requires no prior knowledge of the source properties. Our initial visible-light and electron data show that the method can dramatically improve the reconstruction quality and accelerate the convergence rate of the reconstruction. The approach also integrates well into existing ptychographic engines. It can also improve mixed-state and numerical monochromatisation methods, requiring a smaller number of coherent modes or lower dimensionality of Krylov subspace while providing more stable and faster convergence. We propose that this approach could have significant impact on ptychography of weakly scattering samples. △ Less

Submitted 9 June, 2024; originally announced June 2024.

arXiv:2406.05357 [pdf, other]

Classification of Fermi Gamma-Ray Bursts Based on Machine Learning

Authors: Si-Yuan Zhu, Wan-Peng Sun, Da-Ling Ma, Fu-Wen Zhang

Abstract: Gamma-ray bursts (GRBs) are typically classified into long and short GRBs based on their durations. However, there is a significant overlapping in the duration distributions of these two categories. In this paper, we apply the unsupervised dimensionality reduction algorithm called t-SNE and UMAP to classify 2061 Fermi GRBs based on four observed quantities: duration, peak energy, fluence, and peak… ▽ More Gamma-ray bursts (GRBs) are typically classified into long and short GRBs based on their durations. However, there is a significant overlapping in the duration distributions of these two categories. In this paper, we apply the unsupervised dimensionality reduction algorithm called t-SNE and UMAP to classify 2061 Fermi GRBs based on four observed quantities: duration, peak energy, fluence, and peak flux. The map results of t-SNE and UMAP show a clear division of these GRBs into two clusters. We mark the two clusters as GRBs-I and GRBs-II, and find that all GRBs associated with supernovae are classified as GRBs-II. It includes the peculiar short GRB 200826A, which was confirmed to originate from the death of a massive star. Furthermore, except for two extreme events GRB 211211A and GRB 230307A, all GRBs associated with kilonovae fall into GRBs-I population. By comparing to the traditional classification of short and long GRBs, the distribution of durations for GRBs-I and GRBs-II do not have a fixed boundary. We find that more than 10% of GRBs-I have a duration greater than 2 seconds, while approximately 1% of GRBs-II have a duration shorter than 2 seconds. △ Less

Submitted 8 June, 2024; originally announced June 2024.

Comments: 11 pages, 5 figures, revised version submitted to MNRAS

Report number: https://doi.org/10.1093/mnras/stae1594

Journal ref: MNRAS, 2024, 532, 1434-1443

arXiv:2406.03394 [pdf, other]

Gaussian Representation for Deformable Image Registration

Authors: Jihe Li, Fabian Zhang, Xia Li, Tianhao Zhang, Ye Zhang, Joachim Buhmann

Abstract: Deformable image registration (DIR) is a fundamental task in radiotherapy, with existing methods often struggling to balance computational efficiency, registration accuracy, and speed effectively. We introduce a novel DIR approach employing parametric 3D Gaussian control points achieving a better tradeoff. It provides an explicit and flexible representation for spatial deformation fields between 3… ▽ More Deformable image registration (DIR) is a fundamental task in radiotherapy, with existing methods often struggling to balance computational efficiency, registration accuracy, and speed effectively. We introduce a novel DIR approach employing parametric 3D Gaussian control points achieving a better tradeoff. It provides an explicit and flexible representation for spatial deformation fields between 3D volumetric medical images, producing a displacement vector field (DVF) across all volumetric positions. The movement of individual voxels is derived using linear blend skinning (LBS) through localized interpolation of transformations associated with neighboring Gaussians. This interpolation strategy not only simplifies the determination of voxel motions but also acts as an effective regularization technique. Our approach incorporates a unified optimization process through backpropagation, enabling iterative learning of both the parameters of the 3D Gaussians and their transformations. Additionally, the density of Gaussians is adjusted adaptively during the learning phase to accommodate varying degrees of motion complexity. We validated our approach on the 4D-CT lung DIR-Lab and cardiac ACDC datasets, achieving an average target registration error (TRE) of 1.06 mm within a much-improved processing time of 2.43 seconds for the DIR-Lab dataset over existing methods, demonstrating significant advancements in both accuracy and efficiency. △ Less

Submitted 5 June, 2024; originally announced June 2024.

arXiv:2406.01007 [pdf, other]

Measurement of Electron Antineutrino Oscillation Amplitude and Frequency via Neutron Capture on Hydrogen at Daya Bay

Authors: Daya Bay collaboration, F. P. An, W. D. Bai, A. B. Balantekin, M. Bishai, S. Blyth, G. F. Cao, J. Cao, J. F. Chang, Y. Chang, H. S. Chen, H. Y. Chen, S. M. Chen, Y. Chen, Y. X. Chen, Z. Y. Chen, J. Cheng, J. Cheng, Y. -C. Cheng, Z. K. Cheng, J. J. Cherwinka, M. C. Chu, J. P. Cummings, O. Dalager, F. S. Deng , et al. (177 additional authors not shown)

Abstract: This Letter reports the first measurement of the oscillation amplitude and frequency of reactor antineutrinos at Daya Bay via neutron capture on hydrogen using 1958 days of data. With over 3.6 million signal candidates, an optimized candidate selection, improved treatment of backgrounds and efficiencies, refined energy calibration, and an energy response model for the capture-on-hydrogen sensitive… ▽ More This Letter reports the first measurement of the oscillation amplitude and frequency of reactor antineutrinos at Daya Bay via neutron capture on hydrogen using 1958 days of data. With over 3.6 million signal candidates, an optimized candidate selection, improved treatment of backgrounds and efficiencies, refined energy calibration, and an energy response model for the capture-on-hydrogen sensitive region, the relative $\overlineν_{e}$ rates and energy spectra variation among the near and far detectors gives $\mathrm{sin}^22θ_{13} = 0.0759_{-0.0049}^{+0.0050}$ and $Δm^2_{32} = (2.72^{+0.14}_{-0.15})\times10^{-3}$ eV$^2$ assuming the normal neutrino mass ordering, and $Δm^2_{32} = (-2.83^{+0.15}_{-0.14})\times10^{-3}$ eV$^2$ for the inverted neutrino mass ordering. This estimate of $\sin^2 2θ_{13}$ is consistent with and essentially independent from the one obtained using the capture-on-gadolinium sample at Daya Bay. The combination of these two results yields $\mathrm{sin}^22θ_{13}= 0.0833\pm0.0022$, which represents an 8% relative improvement in precision regarding the Daya Bay full 3158-day capture-on-gadolinium result. △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2406.00947 [pdf, other]

Cross-Dimensional Medical Self-Supervised Representation Learning Based on a Pseudo-3D Transformation

Authors: Fei Gao, Siwen Wang, Fandong Zhang, Hong-Yu Zhou, Yizhou Wang, Churan Wang, Gang Yu, Yizhou Yu

Abstract: Medical image analysis suffers from a shortage of data, whether annotated or not. This becomes even more pronounced when it comes to 3D medical images. Self-Supervised Learning (SSL) can partially ease this situation by using unlabeled data. However, most existing SSL methods can only make use of data in a single dimensionality (e.g. 2D or 3D), and are incapable of enlarging the training dataset b… ▽ More Medical image analysis suffers from a shortage of data, whether annotated or not. This becomes even more pronounced when it comes to 3D medical images. Self-Supervised Learning (SSL) can partially ease this situation by using unlabeled data. However, most existing SSL methods can only make use of data in a single dimensionality (e.g. 2D or 3D), and are incapable of enlarging the training dataset by using data with differing dimensionalities jointly. In this paper, we propose a new cross-dimensional SSL framework based on a pseudo-3D transformation (CDSSL-P3D), that can leverage both 2D and 3D data for joint pre-training. Specifically, we introduce an image transformation based on the im2col algorithm, which converts 2D images into a format consistent with 3D data. This transformation enables seamless integration of 2D and 3D data, and facilitates cross-dimensional self-supervised learning for 3D medical image analysis. We run extensive experiments on 13 downstream tasks, including 2D and 3D classification and segmentation. The results indicate that our CDSSL-P3D achieves superior performance, outperforming other advanced SSL methods. △ Less

Submitted 4 July, 2024; v1 submitted 2 June, 2024; originally announced June 2024.

Comments: MICCAI 2024 accept

arXiv:2406.00707 [pdf, other]

QUADFormer: Learning-based Detection of Cyber Attacks in Quadrotor UAVs

Authors: Pengyu Wang, Zhaohua Yang, Nachuan Yang, Zikai Wang, Jialu Li, Fan Zhang, Chaoqun Wang, Jiankun Wang, Max Q. -H. Meng, Ling Shi

Abstract: Safety-critical intelligent cyber-physical systems, such as quadrotor unmanned aerial vehicles (UAVs), are vulnerable to different types of cyber attacks, and the absence of timely and accurate attack detection can lead to severe consequences. When UAVs are engaged in large outdoor maneuvering flights, their system constitutes highly nonlinear dynamics that include non-Gaussian noises. Therefore,… ▽ More Safety-critical intelligent cyber-physical systems, such as quadrotor unmanned aerial vehicles (UAVs), are vulnerable to different types of cyber attacks, and the absence of timely and accurate attack detection can lead to severe consequences. When UAVs are engaged in large outdoor maneuvering flights, their system constitutes highly nonlinear dynamics that include non-Gaussian noises. Therefore, the commonly employed traditional statistics-based and emerging learning-based attack detection methods do not yield satisfactory results. In response to the above challenges, we propose QUADFormer, a novel Quadrotor UAV Attack Detection framework with transFormer-based architecture. This framework includes a residue generator designed to generate a residue sequence sensitive to anomalies. Subsequently, this sequence is fed into a transformer structure with disparity in correlation to specifically learn its statistical characteristics for the purpose of classification and attack detection. Finally, we design an alert module to ensure the safe execution of tasks by UAVs under attack conditions. We conduct extensive simulations and real-world experiments, and the results show that our method has achieved superior detection performance compared with many state-of-the-art methods. △ Less

Submitted 14 June, 2024; v1 submitted 2 June, 2024; originally announced June 2024.

arXiv:2406.00706 [pdf, other]

MINER-RRT*: A Hierarchical and Fast Trajectory Planning Framework in 3D Cluttered Environments

Authors: Pengyu Wang, Jiawei Tang, Hin Wang Lin, Fan Zhang, Chaoqun Wang, Jiankun Wang, Ling Shi, Max Q. -H. Meng

Abstract: Trajectory planning for quadrotors in cluttered environments has been challenging in recent years. While many trajectory planning frameworks have been successful, there still exists potential for improvements, particularly in enhancing the speed of generating efficient trajectories. In this paper, we present a novel hierarchical trajectory planning framework to reduce computational time and memory… ▽ More Trajectory planning for quadrotors in cluttered environments has been challenging in recent years. While many trajectory planning frameworks have been successful, there still exists potential for improvements, particularly in enhancing the speed of generating efficient trajectories. In this paper, we present a novel hierarchical trajectory planning framework to reduce computational time and memory usage called MINER-RRT*, which consists of two main components. First, we propose a sampling-based path planning method boosted by neural networks, where the predicted heuristic region accelerates the convergence of rapidly-exploring random trees. Second, we utilize the optimal conditions derived from the quadrotor's differential flatness properties to construct polynomial trajectories that minimize control effort in multiple stages. Extensive simulation and real-world experimental results demonstrate that, compared to several state-of-the-art (SOTA) approaches, our method can generate high-quality trajectories with better performance in 3D cluttered environments. △ Less

Submitted 14 June, 2024; v1 submitted 2 June, 2024; originally announced June 2024.

arXiv:2406.00312 [pdf, other]

NuRF: Nudging the Particle Filter in Radiance Fields for Robot Visual Localization

Authors: Wugang Meng, Tianfu Wu, Huan Yin, Fumin Zhang

Abstract: Can we localize a robot in radiance fields only using monocular vision? This study presents NuRF, a nudged particle filter framework for 6-DoF robot visual localization in radiance fields. NuRF sets anchors in SE(3) to leverage visual place recognition, which provides image comparisons to guide the sampling process. This guidance could improve the convergence and robustness of particle filters for… ▽ More Can we localize a robot in radiance fields only using monocular vision? This study presents NuRF, a nudged particle filter framework for 6-DoF robot visual localization in radiance fields. NuRF sets anchors in SE(3) to leverage visual place recognition, which provides image comparisons to guide the sampling process. This guidance could improve the convergence and robustness of particle filters for robot localization. Additionally, an adaptive scheme is designed to enhance the performance of NuRF, thus enabling both global visual localization and local pose tracking. Real-world experiments are conducted with comprehensive tests to demonstrate the effectiveness of NuRF. The results showcase the advantages of NuRF in terms of accuracy and efficiency, including comparisons with alternative approaches. Furthermore, we report our findings for future studies and advancements in robot navigation in radiance fields. △ Less

Submitted 1 June, 2024; originally announced June 2024.

Comments: 11 pages, 14 figures

arXiv:2406.00212 [pdf, other]

MVAD: A Multiple Visual Artifact Detector for Video Streaming

Authors: Chen Feng, Duolikun Danier, Fan Zhang, David Bull

Abstract: Visual artifacts are often introduced into streamed video content, due to prevailing conditions during content production and/or delivery. Since these can degrade the quality of the user's experience, it is important to automatically and accurately detect them in order to enable effective quality measurement and enhancement. Existing detection methods often focus on a single type of artifact and/o… ▽ More Visual artifacts are often introduced into streamed video content, due to prevailing conditions during content production and/or delivery. Since these can degrade the quality of the user's experience, it is important to automatically and accurately detect them in order to enable effective quality measurement and enhancement. Existing detection methods often focus on a single type of artifact and/or determine the presence of an artifact through thresholding objective quality indices. Such approaches have been reported to offer inconsistent prediction performance and are also impractical for real-world applications where multiple artifacts co-exist and interact. In this paper, we propose a Multiple Visual Artifact Detector, MVAD, for video streaming which, for the first time, is able to detect multiple artifacts using a single framework that is not reliant on video quality assessment models. Our approach employs a new Artifact-aware Dynamic Feature Extractor (ADFE) to obtain artifact-relevant spatial features within each frame for multiple artifact types. The extracted features are further processed by a Recurrent Memory Vision Transformer (RMViT) module, which captures both short-term and long-term temporal information within the input video. The proposed network architecture is optimized in an end-to-end manner based on a new, large and diverse training database that is generated by simulating the video streaming pipeline and based on Adversarial Data Augmentation. This model has been evaluated on two video artifact databases, Maxwell and BVI-Artifact, and achieves consistent and improved prediction results for ten target visual artifacts when compared to seven existing single and multiple artifact detectors. The source code and training database will be available at https://chenfeng-bristol.github.io/MVAD/. △ Less

Submitted 31 May, 2024; originally announced June 2024.

Comments: 9 pages

arXiv:2405.19883 [pdf, other]

From Words to Actions: Unveiling the Theoretical Underpinnings of LLM-Driven Autonomous Systems

Authors: Jianliang He, Siyu Chen, Fengzhuo Zhang, Zhuoran Yang

Abstract: In this work, from a theoretical lens, we aim to understand why large language model (LLM) empowered agents are able to solve decision-making problems in the physical world. To this end, consider a hierarchical reinforcement learning (RL) model where the LLM Planner and the Actor perform high-level task planning and low-level execution, respectively. Under this model, the LLM Planner navigates a p… ▽ More In this work, from a theoretical lens, we aim to understand why large language model (LLM) empowered agents are able to solve decision-making problems in the physical world. To this end, consider a hierarchical reinforcement learning (RL) model where the LLM Planner and the Actor perform high-level task planning and low-level execution, respectively. Under this model, the LLM Planner navigates a partially observable Markov decision process (POMDP) by iteratively generating language-based subgoals via prompting. Under proper assumptions on the pretraining data, we prove that the pretrained LLM Planner effectively performs Bayesian aggregated imitation learning (BAIL) through in-context learning. Additionally, we highlight the necessity for exploration beyond the subgoals derived from BAIL by proving that naively executing the subgoals returned by LLM leads to a linear regret. As a remedy, we introduce an $ε$-greedy exploration strategy to BAIL, which is proven to incur sublinear regret when the pretraining error is small. Finally, we extend our theoretical framework to include scenarios where the LLM Planner serves as a world model for inferring the transition model of the environment and to multi-agent settings, enabling coordination among multiple Actors. △ Less

Submitted 30 May, 2024; originally announced May 2024.

Comments: Accepted by ICML 2024

Showing 1–50 of 2,583 results for author: Zhang, F