subscribe to arXiv mailings

Image Compression for Machine and Human Vision with Spatial-Frequency Adaptation

Authors: Han Li, Shaohui Li, Shuangrui Ding, Wenrui Dai, Maida Cao, Chenglin Li, Junni Zou, Hongkai Xiong

Abstract: Image compression for machine and human vision (ICMH) has gained increasing attention in recent years. Existing ICMH methods are limited by high training and storage overheads due to heavy design of task-specific networks. To address this issue, in this paper, we develop a novel lightweight adapter-based tuning framework for ICMH, named Adapt-ICMH, that better balances task performance and bitrate… ▽ More Image compression for machine and human vision (ICMH) has gained increasing attention in recent years. Existing ICMH methods are limited by high training and storage overheads due to heavy design of task-specific networks. To address this issue, in this paper, we develop a novel lightweight adapter-based tuning framework for ICMH, named Adapt-ICMH, that better balances task performance and bitrates with reduced overheads. We propose a spatial-frequency modulation adapter (SFMA) that simultaneously eliminates non-semantic redundancy with a spatial modulation adapter, and enhances task-relevant frequency components and suppresses task-irrelevant frequency components with a frequency modulation adapter. The proposed adapter is plug-and-play and compatible with almost all existing learned image compression models without compromising the performance of pre-trained models. Experiments demonstrate that Adapt-ICMH consistently outperforms existing ICMH frameworks on various machine vision tasks with fewer fine-tuned parameters and reduced computational complexity. Code will be released at https://github.com/qingshi9974/ECCV2024-AdpatICMH . △ Less

Submitted 13 July, 2024; originally announced July 2024.

Comments: Accepted by ECCV2024, project: https://github.com/qingshi9974/ECCV2024-AdpatICMH

arXiv:2407.08340 [pdf, other]

SLRL: Structured Latent Representation Learning for Multi-view Clustering

Authors: Zhangci Xiong, Meng Cao

Abstract: In recent years, Multi-View Clustering (MVC) has attracted increasing attention for its potential to reduce the annotation burden associated with large datasets. The aim of MVC is to exploit the inherent consistency and complementarity among different views, thereby integrating information from multiple perspectives to improve clustering outcomes. Despite extensive research in MVC, most existing… ▽ More In recent years, Multi-View Clustering (MVC) has attracted increasing attention for its potential to reduce the annotation burden associated with large datasets. The aim of MVC is to exploit the inherent consistency and complementarity among different views, thereby integrating information from multiple perspectives to improve clustering outcomes. Despite extensive research in MVC, most existing methods focus predominantly on harnessing complementary information across views to enhance clustering effectiveness, often neglecting the structural information among samples, which is crucial for exploring sample correlations. To address this gap, we introduce a novel framework, termed Structured Latent Representation Learning based Multi-View Clustering method (SLRL). SLRL leverages both the complementary and structural information. Initially, it learns a common latent representation for all views. Subsequently, to exploit the structural information among samples, a k-nearest neighbor graph is constructed from this common latent representation. This graph facilitates enhanced sample interaction through graph learning techniques, leading to a structured latent representation optimized for clustering. Extensive experiments demonstrate that SLRL not only competes well with existing methods but also sets new benchmarks in various multi-view datasets. △ Less

Submitted 11 July, 2024; originally announced July 2024.

arXiv:2407.05097 [pdf, other]

$\mathcal{PT}$-symmetric photonic lattices with type-II Dirac cones

Authors: Qian Tang, Milivoj R. Belić, Hua Zhong, Meng Cao, Yongdong Li, Yiqi Zhang

Abstract: The type-II Dirac cone is a special feature of the band structure, whose Fermi level is represented by a pair of crossing lines. It has been demonstrated that such a structure is useful for investigating topological edge solitons, and more specifically, for mimicking the Kline tunneling. However, it is still not clear what the interplay between type-II Dirac cones and the non-Hermiticity mechanism… ▽ More The type-II Dirac cone is a special feature of the band structure, whose Fermi level is represented by a pair of crossing lines. It has been demonstrated that such a structure is useful for investigating topological edge solitons, and more specifically, for mimicking the Kline tunneling. However, it is still not clear what the interplay between type-II Dirac cones and the non-Hermiticity mechanism will result in. Here, this question is addressed; in particular, we report the $\mathcal{PT}$-symmetric photonic lattices with type-II Dirac cones for the first time. We identify a slope-exceptional ring and name it the type-II exceptional ring. We display the restoration of the $\mathcal{PT}$ symmetry of the lattice by reducing the separation between the sites in the unit cell. Curiously, the amplitude of the beam during propagation in the non-Hermitian lattice with $\mathcal{PT}$ symmetry only decays because of diffraction, whereas in the $\mathcal{PT}$ symmetry-broken lattice it will be amplified, even though the beam still diffracts. This work establishes the link between the non-Hermiticity mechanism and the violation of Lorentz invariance in these physical systems. △ Less

Submitted 6 July, 2024; originally announced July 2024.

Comments: 5 pages, 4 figures, to appear in Optics Letters. Comments are welcome

arXiv:2407.02787 [pdf]

A versatile quantum microwave photonic signal processing platform based on coincidence window selection technique

Authors: Xinghua Li, Yifan Guo, Xiao Xiang, Runai Quan, Mingtao Cao, Ruifang Dong, Tao Liu, Ming Li, Shougang Zhang

Abstract: Quantum microwave photonics (QMWP) is an innovative approach that combines energy-time entangled biphoton sources as the optical carrier with time-correlated single-photon detection for high-speed RF signal recovery. This groundbreaking method offers unique advantages such as nonlocal RF signal encoding and robust resistance to dispersion-induced frequency fading. This paper explores the versatili… ▽ More Quantum microwave photonics (QMWP) is an innovative approach that combines energy-time entangled biphoton sources as the optical carrier with time-correlated single-photon detection for high-speed RF signal recovery. This groundbreaking method offers unique advantages such as nonlocal RF signal encoding and robust resistance to dispersion-induced frequency fading. This paper explores the versatility of processing the quantum microwave photonic signal by utilizing coincidence window selection on the biphoton coincidence distribution. The demonstration includes finely-tunable RF phase shifting, flexible multi-tap transversal filtering (with up to 15 taps), and photonically implemented RF mixing, leveraging the nonlocal RF mapping characteristic of QMWP. These accomplishments significantly enhance the capability of microwave photonic systems in processing ultra-weak signals, opening up new possibilities for various applications. △ Less

Submitted 2 July, 2024; originally announced July 2024.

arXiv:2407.02774 [pdf]

Quantum microwave photonic mixer with a large spurious-free dynamic range

Authors: Xinghua Li, Yifan Guo, Xiao Xiang, Runai Quan, Mingtao Cao, Ruifang Dong, Tao Liu, Ming Li, Shougang Zhang

Abstract: As one of the most fundamental functionalities of microwave photonics, microwave frequency mixing plays an essential role in modern radars and wireless communication systems. However, the commonly utilized intensity modulation in the systems often leads to inadequate spurious-free dynamic range (SFDR) for many sought-after applications. Quantum microwave photonics technique offers a promising solu… ▽ More As one of the most fundamental functionalities of microwave photonics, microwave frequency mixing plays an essential role in modern radars and wireless communication systems. However, the commonly utilized intensity modulation in the systems often leads to inadequate spurious-free dynamic range (SFDR) for many sought-after applications. Quantum microwave photonics technique offers a promising solution for improving SFDR in terms of higher-order harmonic distortion. In this paper, we demonstrate two types of quantum microwave photonic mixers based on the configuration of the intensity modulators: cascade-type and parallel-type. Leveraging the nonlocal RF signal encoding capability, both types of quantum microwave photonic mixers not only exhibit the advantage of dual-channel output but also present significant improvement in SFDR. Specifically, the parallel-type quantum microwave photonic mixer achieves a remarkable SFDR value of 113.6 dB.Hz1/2, which is 30 dB better than that of the cascade-type quantum microwave photonic mixer. When compared to the classical microwave photonic mixer, this enhancement reaches a notable 53.6 dB at the expense of 8 dB conversion loss. These results highlight the superiority of quantum microwave photonic mixers in the fields of microwave and millimeter-wave systems. Further applying multi-photon frequency entangled sources as optical carriers, the dual-channel microwave frequency conversion capability endowed by the quantum microwave photonic mixer can be extended to enhance the performance of multiple-paths microwave mixing which is essential for radar net systems. △ Less

Submitted 2 July, 2024; originally announced July 2024.

arXiv:2406.16935 [pdf, other]

Benchmarking Out-of-Distribution Generalization Capabilities of DNN-based Encoding Models for the Ventral Visual Cortex

Authors: Spandan Madan, Will Xiao, Mingran Cao, Hanspeter Pfister, Margaret Livingstone, Gabriel Kreiman

Abstract: We characterized the generalization capabilities of DNN-based encoding models when predicting neuronal responses from the visual cortex. We collected \textit{MacaqueITBench}, a large-scale dataset of neural population responses from the macaque inferior temporal (IT) cortex to over $300,000$ images, comprising $8,233$ unique natural images presented to seven monkeys over $109$ sessions. Using \tex… ▽ More We characterized the generalization capabilities of DNN-based encoding models when predicting neuronal responses from the visual cortex. We collected \textit{MacaqueITBench}, a large-scale dataset of neural population responses from the macaque inferior temporal (IT) cortex to over $300,000$ images, comprising $8,233$ unique natural images presented to seven monkeys over $109$ sessions. Using \textit{MacaqueITBench}, we investigated the impact of distribution shifts on models predicting neural activity by dividing the images into Out-Of-Distribution (OOD) train and test splits. The OOD splits included several different image-computable types including image contrast, hue, intensity, temperature, and saturation. Compared to the performance on in-distribution test images -- the conventional way these models have been evaluated -- models performed worse at predicting neuronal responses to out-of-distribution images, retaining as little as $20\%$ of the performance on in-distribution test images. The generalization performance under OOD shifts can be well accounted by a simple image similarity metric -- the cosine distance between image representations extracted from a pre-trained object recognition model is a strong predictor of neural predictivity under different distribution shifts. The dataset of images, neuronal firing rate recordings, and computational benchmarks are hosted publicly at: https://bit.ly/3zeutVd. △ Less

Submitted 16 June, 2024; originally announced June 2024.

arXiv:2406.14887 [pdf, other]

InternLM-Law: An Open Source Chinese Legal Large Language Model

Authors: Zhiwei Fei, Songyang Zhang, Xiaoyu Shen, Dawei Zhu, Xiao Wang, Maosong Cao, Fengzhe Zhou, Yining Li, Wenwei Zhang, Dahua Lin, Kai Chen, Jidong Ge

Abstract: While large language models (LLMs) have showcased impressive capabilities, they struggle with addressing legal queries due to the intricate complexities and specialized expertise required in the legal field. In this paper, we introduce InternLM-Law, a specialized LLM tailored for addressing diverse legal queries related to Chinese laws, spanning from responding to standard legal questions (e.g., l… ▽ More While large language models (LLMs) have showcased impressive capabilities, they struggle with addressing legal queries due to the intricate complexities and specialized expertise required in the legal field. In this paper, we introduce InternLM-Law, a specialized LLM tailored for addressing diverse legal queries related to Chinese laws, spanning from responding to standard legal questions (e.g., legal exercises in textbooks) to analyzing complex real-world legal situations. We meticulously construct a dataset in the Chinese legal domain, encompassing over 1 million queries, and implement a data filtering and processing pipeline to ensure its diversity and quality. Our training approach involves a novel two-stage process: initially fine-tuning LLMs on both legal-specific and general-purpose content to equip the models with broad knowledge, followed by exclusive fine-tuning on high-quality legal data to enhance structured output generation. InternLM-Law achieves the highest average performance on LawBench, outperforming state-of-the-art models, including GPT-4, on 13 out of 20 subtasks. We make InternLM-Law and our dataset publicly available to facilitate future research in applying LLMs within the legal domain. △ Less

Submitted 21 June, 2024; originally announced June 2024.

Comments: Our dataset, code and models will be released at https://github.com/InternLM/InternLM-Law

arXiv:2406.14060 [pdf, ps, other]

Distributed Event-Triggered Bandit Convex Optimization with Time-Varying Constraints

Authors: Kunpeng Zhang, Xinlei Yi, Guanghui Wen, Ming Cao, Karl H. Johansson, Tianyou Chai, Tao Yang

Abstract: This paper considers the distributed bandit convex optimization problem with time-varying inequality constraints over a network of agents, where the goal is to minimize network regret and cumulative constraint violation. Existing distributed online algorithms require that each agent broadcasts its decision to its neighbors at each iteration. To better utilize the limited communication resources, w… ▽ More This paper considers the distributed bandit convex optimization problem with time-varying inequality constraints over a network of agents, where the goal is to minimize network regret and cumulative constraint violation. Existing distributed online algorithms require that each agent broadcasts its decision to its neighbors at each iteration. To better utilize the limited communication resources, we propose a distributed event-triggered online primal--dual algorithm with two-point bandit feedback. Under several classes of appropriately chosen decreasing parameter sequences and non-increasing event-triggered threshold sequences, we establish dynamic network regret and network cumulative constraint violation bounds. These bounds are comparable to the results achieved by distributed event-triggered online algorithms with full-information feedback. Finally, a numerical example is provided to verify the theoretical results. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: 34 pages, 4 figures. arXiv admin note: text overlap with arXiv:2311.01957

arXiv:2406.12703 [pdf, other]

Coarse-Fine Spectral-Aware Deformable Convolution For Hyperspectral Image Reconstruction

Authors: Jincheng Yang, Lishun Wang, Miao Cao, Huan Wang, Yinping Zhao, Xin Yuan

Abstract: We study the inverse problem of Coded Aperture Snapshot Spectral Imaging (CASSI), which captures a spatial-spectral data cube using snapshot 2D measurements and uses algorithms to reconstruct 3D hyperspectral images (HSI). However, current methods based on Convolutional Neural Networks (CNNs) struggle to capture long-range dependencies and non-local similarities. The recently popular Transformer-b… ▽ More We study the inverse problem of Coded Aperture Snapshot Spectral Imaging (CASSI), which captures a spatial-spectral data cube using snapshot 2D measurements and uses algorithms to reconstruct 3D hyperspectral images (HSI). However, current methods based on Convolutional Neural Networks (CNNs) struggle to capture long-range dependencies and non-local similarities. The recently popular Transformer-based methods are poorly deployed on downstream tasks due to the high computational cost caused by self-attention. In this paper, we propose Coarse-Fine Spectral-Aware Deformable Convolution Network (CFSDCN), applying deformable convolutional networks (DCN) to this task for the first time. Considering the sparsity of HSI, we design a deformable convolution module that exploits its deformability to capture long-range dependencies and non-local similarities. In addition, we propose a new spectral information interaction module that considers both coarse-grained and fine-grained spectral similarities. Extensive experiments demonstrate that our CFSDCN significantly outperforms previous state-of-the-art (SOTA) methods on both simulated and real HSI datasets. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: 7 pages, 5 figures, Accepted by ICIP2024

arXiv:2406.10318 [pdf, other]

Creating a Lens of Chinese Culture: A Multimodal Dataset for Chinese Pun Rebus Art Understanding

Authors: Tuo Zhang, Tiantian Feng, Yibin Ni, Mengqin Cao, Ruying Liu, Katharine Butler, Yanjun Weng, Mi Zhang, Shrikanth S. Narayanan, Salman Avestimehr

Abstract: Large vision-language models (VLMs) have demonstrated remarkable abilities in understanding everyday content. However, their performance in the domain of art, particularly culturally rich art forms, remains less explored. As a pearl of human wisdom and creativity, art encapsulates complex cultural narratives and symbolism. In this paper, we offer the Pun Rebus Art Dataset, a multimodal dataset for… ▽ More Large vision-language models (VLMs) have demonstrated remarkable abilities in understanding everyday content. However, their performance in the domain of art, particularly culturally rich art forms, remains less explored. As a pearl of human wisdom and creativity, art encapsulates complex cultural narratives and symbolism. In this paper, we offer the Pun Rebus Art Dataset, a multimodal dataset for art understanding deeply rooted in traditional Chinese culture. We focus on three primary tasks: identifying salient visual elements, matching elements with their symbolic meanings, and explanations for the conveyed messages. Our evaluation reveals that state-of-the-art VLMs struggle with these tasks, often providing biased and hallucinated explanations and showing limited improvement through in-context learning. By releasing the Pun Rebus Art Dataset, we aim to facilitate the development of VLMs that can better understand and interpret culturally specific content, promoting greater inclusiveness beyond English-based corpora. △ Less

Submitted 14 June, 2024; originally announced June 2024.

arXiv:2406.09838 [pdf, other]

Vision-Language Models Meet Meteorology: Developing Models for Extreme Weather Events Detection with Heatmaps

Authors: Jian Chen, Peilin Zhou, Yining Hua, Dading Chong, Meng Cao, Yaowei Li, Zixuan Yuan, Bing Zhu, Junwei Liang

Abstract: Real-time detection and prediction of extreme weather protect human lives and infrastructure. Traditional methods rely on numerical threshold setting and manual interpretation of weather heatmaps with Geographic Information Systems (GIS), which can be slow and error-prone. Our research redefines Extreme Weather Events Detection (EWED) by framing it as a Visual Question Answering (VQA) problem, the… ▽ More Real-time detection and prediction of extreme weather protect human lives and infrastructure. Traditional methods rely on numerical threshold setting and manual interpretation of weather heatmaps with Geographic Information Systems (GIS), which can be slow and error-prone. Our research redefines Extreme Weather Events Detection (EWED) by framing it as a Visual Question Answering (VQA) problem, thereby introducing a more precise and automated solution. Leveraging Vision-Language Models (VLM) to simultaneously process visual and textual data, we offer an effective aid to enhance the analysis process of weather heatmaps. Our initial assessment of general-purpose VLMs (e.g., GPT-4-Vision) on EWED revealed poor performance, characterized by low accuracy and frequent hallucinations due to inadequate color differentiation and insufficient meteorological knowledge. To address these challenges, we introduce ClimateIQA, the first meteorological VQA dataset, which includes 8,760 wind gust heatmaps and 254,040 question-answer pairs covering four question types, both generated from the latest climate reanalysis data. We also propose Sparse Position and Outline Tracking (SPOT), an innovative technique that leverages OpenCV and K-Means clustering to capture and depict color contours in heatmaps, providing ClimateIQA with more accurate color spatial location information. Finally, we present Climate-Zoo, the first meteorological VLM collection, which adapts VLMs to meteorological applications using the ClimateIQA dataset. Experiment results demonstrate that models from Climate-Zoo substantially outperform state-of-the-art general VLMs, achieving an accuracy increase from 0% to over 90% in EWED verification. The datasets and models in this study are publicly available for future climate science research: https://github.com/AlexJJJChen/Climate-Zoo. △ Less

Submitted 14 June, 2024; originally announced June 2024.

arXiv:2406.06329 [pdf, other]

A Parameter-efficient Language Extension Framework for Multilingual ASR

Authors: Wei Liu, Jingyong Hou, Dong Yang, Muyong Cao, Tan Lee

Abstract: Covering all languages with a multilingual speech recognition model (MASR) is very difficult. Performing language extension on top of an existing MASR is a desirable choice. In this study, the MASR continual learning problem is probabilistically decomposed into language identity prediction (LP) and cross-lingual adaptation (XLA) sub-problems. Based on this, we propose an architecture-based framewo… ▽ More Covering all languages with a multilingual speech recognition model (MASR) is very difficult. Performing language extension on top of an existing MASR is a desirable choice. In this study, the MASR continual learning problem is probabilistically decomposed into language identity prediction (LP) and cross-lingual adaptation (XLA) sub-problems. Based on this, we propose an architecture-based framework for language extension that can fundamentally solve catastrophic forgetting, debudded as PELE. PELE is designed to be parameter-efficient, incrementally incorporating an add-on module to adapt to a new language. Specifically, different parameter-efficient fine-tuning (PEFT) modules and their variants are explored as potential candidates to perform XLA. Experiments are carried out on 5 new languages with a wide range of low-resourced data sizes. The best-performing PEFT candidate can achieve satisfactory performance across all languages and demonstrates superiority in three of five languages over the continual joint learning setting. Notably, PEFT methods focusing on weight parameters or input features are revealed to be limited in performance, showing significantly inferior extension capabilities compared to inserting a lightweight module in between layers such as an Adapter. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: Accepted by Interspeech 2024

arXiv:2405.20607 [pdf, other]

Textual Inversion and Self-supervised Refinement for Radiology Report Generation

Authors: Yuanjiang Luo, Hongxiang Li, Xuan Wu, Meng Cao, Xiaoshuang Huang, Zhihong Zhu, Peixi Liao, Hu Chen, Yi Zhang

Abstract: Existing mainstream approaches follow the encoder-decoder paradigm for generating radiology reports. They focus on improving the network structure of encoders and decoders, which leads to two shortcomings: overlooking the modality gap and ignoring report content constraints. In this paper, we proposed Textual Inversion and Self-supervised Refinement (TISR) to address the above two issues. Specific… ▽ More Existing mainstream approaches follow the encoder-decoder paradigm for generating radiology reports. They focus on improving the network structure of encoders and decoders, which leads to two shortcomings: overlooking the modality gap and ignoring report content constraints. In this paper, we proposed Textual Inversion and Self-supervised Refinement (TISR) to address the above two issues. Specifically, textual inversion can project text and image into the same space by representing images as pseudo words to eliminate the cross-modeling gap. Subsequently, self-supervised refinement refines these pseudo words through contrastive loss computation between images and texts, enhancing the fidelity of generated reports to images. Notably, TISR is orthogonal to most existing methods, plug-and-play. We conduct experiments on two widely-used public datasets and achieve significant improvements on various baselines, which demonstrates the effectiveness and generalization of TISR. The code will be available soon. △ Less

Submitted 6 June, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

Comments: This paper has been early accepted by MICCAI 2024!

arXiv:2405.19689 [pdf, other]

Uncertainty-aware sign language video retrieval with probability distribution modeling

Authors: Xuan Wu, Hongxiang Li, Yuanjiang Luo, Xuxin Cheng, Xianwei Zhuang, Meng Cao, Keren Fu

Abstract: Sign language video retrieval plays a key role in facilitating information access for the deaf community. Despite significant advances in video-text retrieval, the complexity and inherent uncertainty of sign language preclude the direct application of these techniques. Previous methods achieve the mapping between sign language video and text through fine-grained modal alignment. However, due to th… ▽ More Sign language video retrieval plays a key role in facilitating information access for the deaf community. Despite significant advances in video-text retrieval, the complexity and inherent uncertainty of sign language preclude the direct application of these techniques. Previous methods achieve the mapping between sign language video and text through fine-grained modal alignment. However, due to the scarcity of fine-grained annotation, the uncertainty inherent in sign language video is underestimated, limiting the further development of sign language retrieval tasks. To address this challenge, we propose a novel Uncertainty-aware Probability Distribution Retrieval (UPRet), that conceptualizes the mapping process of sign language video and text in terms of probability distributions, explores their potential interrelationships, and enables flexible mappings. Experiments on three benchmarks demonstrate the effectiveness of our method, which achieves state-of-the-art results on How2Sign (59.1%), PHOENIX-2014T (72.0%), and CSL-Daily (78.4%). △ Less

Submitted 30 May, 2024; originally announced May 2024.

arXiv:2405.19465 [pdf, other]

RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter

Authors: Meng Cao, Haoran Tang, Jinfa Huang, Peng Jin, Can Zhang, Ruyang Liu, Long Chen, Xiaodan Liang, Li Yuan, Ge Li

Abstract: Text-Video Retrieval (TVR) aims to align relevant video content with natural language queries. To date, most state-of-the-art TVR methods learn image-to-video transfer learning based on large-scale pre-trained visionlanguage models (e.g., CLIP). However, fully fine-tuning these pre-trained models for TVR incurs prohibitively expensive computation costs. To this end, we propose to conduct efficient… ▽ More Text-Video Retrieval (TVR) aims to align relevant video content with natural language queries. To date, most state-of-the-art TVR methods learn image-to-video transfer learning based on large-scale pre-trained visionlanguage models (e.g., CLIP). However, fully fine-tuning these pre-trained models for TVR incurs prohibitively expensive computation costs. To this end, we propose to conduct efficient text-video Retrieval with a sparse-andcorrelated AdaPter (RAP), i.e., fine-tuning the pre-trained model with a few parameterized layers. To accommodate the text-video scenario, we equip our RAP with two indispensable characteristics: temporal sparsity and correlation. Specifically, we propose a low-rank modulation module to refine the per-image features from the frozen CLIP backbone, which accentuates salient frames within the video features while alleviating temporal redundancy. Besides, we introduce an asynchronous self-attention mechanism that first selects the top responsive visual patches and augments the correlation modeling between them with learnable temporal and patch offsets. Extensive experiments on four TVR datasets demonstrate that RAP achieves superior or comparable performance compared to the fully fine-tuned counterpart and other parameter-efficient fine-tuning methods. △ Less

Submitted 29 May, 2024; originally announced May 2024.

Comments: Accepted by ACL 2024 Findings

arXiv:2405.18969 [pdf, ps, other]

Global and local observability of hypergraphs

Authors: Chencheng Zhang, Hao Yang, Shaoxuan Cui, Bin Jiang, Ming Cao

Abstract: This paper studies observability for non-uniform hypergraphs with inputs and outputs. To capture higher-order interactions, we define a canonical non-homogeneous dynamical system with nonlinear outputs on hypergraphs. We then construct algebraic necessary and sufficient conditions based on polynomial ideals and varieties for global observability at an initial state of hypergraphs. An example is gi… ▽ More This paper studies observability for non-uniform hypergraphs with inputs and outputs. To capture higher-order interactions, we define a canonical non-homogeneous dynamical system with nonlinear outputs on hypergraphs. We then construct algebraic necessary and sufficient conditions based on polynomial ideals and varieties for global observability at an initial state of hypergraphs. An example is given to illustrate the proposed criteria for observability. Further, necessary and sufficient conditions for local observability are derived based on rank conditions of observability matrices, which provide a framework to study local observability for non-uniform hypergraphs. Finally, the similarity of observability for hypergraphs is proposed using similarity of tensors, which reveals the relation of observability between two hypergraphs and helps to check the observability intuitively. △ Less

Submitted 29 May, 2024; originally announced May 2024.

arXiv:2405.18333 [pdf, other]

On the analysis of a higher-order Lotka-Volterra model: an application of S-tensors and the polynomial complementarity problem

Authors: Shaoxuan Cui, Qi Zhao, Guofeng Zhang, Hildeberto Jardón-Kojakhmetov, Ming Cao

Abstract: It is known that the effect of species' density on species' growth is non-additive in real ecological systems. This challenges the conventional Lotka-Volterra model, where the interactions are always pairwise and their effects are additive. To address this challenge, we introduce HOIs (Higher-Order Interactions) which are able to capture, for example, the indirect effect of one species on a second… ▽ More It is known that the effect of species' density on species' growth is non-additive in real ecological systems. This challenges the conventional Lotka-Volterra model, where the interactions are always pairwise and their effects are additive. To address this challenge, we introduce HOIs (Higher-Order Interactions) which are able to capture, for example, the indirect effect of one species on a second one correlating to a third species. Towards this end, we propose a general higher-order Lotka-Volterra model. We provide an existence result of a positive equilibrium for a non-homogeneous polynomial equation system with the help of S-tensors. Afterward, by utilizing the latter result, as well as the theory of monotone systems and results from the polynomial complementarity problem, we provide comprehensive results regarding the existence, uniqueness, and stability of the corresponding equilibrium. These results can be regarded as natural extensions of many analogous ones for the classical Lotka-Volterra model, especially in the case of full cooperation, competition among two factions, and pure competition. Finally, illustrative numerical examples are provided to highlight our contributions. △ Less

Submitted 8 July, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

arXiv:2405.13865 [pdf, other]

ReVideo: Remake a Video with Motion and Content Control

Authors: Chong Mou, Mingdeng Cao, Xintao Wang, Zhaoyang Zhang, Ying Shan, Jian Zhang

Abstract: Despite significant advancements in video generation and editing using diffusion models, achieving accurate and localized video editing remains a substantial challenge. Additionally, most existing video editing methods primarily focus on altering visual content, with limited research dedicated to motion editing. In this paper, we present a novel attempt to Remake a Video (ReVideo) which stands out… ▽ More Despite significant advancements in video generation and editing using diffusion models, achieving accurate and localized video editing remains a substantial challenge. Additionally, most existing video editing methods primarily focus on altering visual content, with limited research dedicated to motion editing. In this paper, we present a novel attempt to Remake a Video (ReVideo) which stands out from existing methods by allowing precise video editing in specific areas through the specification of both content and motion. Content editing is facilitated by modifying the first frame, while the trajectory-based motion control offers an intuitive user interaction experience. ReVideo addresses a new task involving the coupling and training imbalance between content and motion control. To tackle this, we develop a three-stage training strategy that progressively decouples these two aspects from coarse to fine. Furthermore, we propose a spatiotemporal adaptive fusion module to integrate content and motion control across various sampling steps and spatial locations. Extensive experiments demonstrate that our ReVideo has promising performance on several accurate video editing applications, i.e., (1) locally changing video content while keeping the motion constant, (2) keeping content unchanged and customizing new motion trajectories, (3) modifying both content and motion trajectories. Our method can also seamlessly extend these applications to multi-area editing without specific training, demonstrating its flexibility and robustness. △ Less

Submitted 22 May, 2024; originally announced May 2024.

arXiv:2405.10915 [pdf, other]

Strategic control for a Boltzmann like decision-making model

Authors: Luis Guillermo Venegas-Pineda, Hildeberto Jardón-Kojakhmetov, Maximilian Engel, Jobst Heitzig, Muhittin Cenk Eser, Ming Cao

Abstract: We study a prototypical non-polynomial decision-making model for which agents in a population potentially alternate between two consumption strategies, one related to the exploitation of an unlimited but considerably expensive resource and the other a comparably cheaper but restricted and slowly renewable source. In particular, we study a model following a Boltzmann-like exploration policy, enhanc… ▽ More We study a prototypical non-polynomial decision-making model for which agents in a population potentially alternate between two consumption strategies, one related to the exploitation of an unlimited but considerably expensive resource and the other a comparably cheaper but restricted and slowly renewable source. In particular, we study a model following a Boltzmann-like exploration policy, enhancing the accuracy at which the exchange rates are captured with respect to classical polynomial approaches by considering sigmoidal functions to represent the cost-profit relation in both exploit strategies. Additionally, given the intrinsic timescale separation between the decision-making process and recovery rates of the renewable resource, we use geometric singular perturbation theory to analyze the model. We further use numerical analysis to determine parameter ranges for which the model undergoes bifurcations. These bifurcations, being related to critical states of the system, are relevant to the fast transitions between strategies. Hence, we design controllers to regulate such rapid transitions by taking advantage of the system's criticality. △ Less

Submitted 17 May, 2024; originally announced May 2024.

Comments: 40 pages, 20 figures

arXiv:2405.07740 [pdf, ps, other]

The $σ$ hulls of matrix-product codes and related entanglement-assisted quantum error-correcting codes

Authors: Meng Cao

Abstract: Let $\mathrm{SLAut}(\mathbb{F}_{q}^{n})$ denote the group of all semilinear isometries on $\mathbb{F}_{q}^{n}$, where $q=p^{e}$ is a prime power. Matrix-product (MP) codes are a class of long classical codes generated by combining several commensurate classical codes with a defining matrix. We give an explicit formula for calculating the dimension of the $σ$ hull of a MP code. As a result, we give… ▽ More Let $\mathrm{SLAut}(\mathbb{F}_{q}^{n})$ denote the group of all semilinear isometries on $\mathbb{F}_{q}^{n}$, where $q=p^{e}$ is a prime power. Matrix-product (MP) codes are a class of long classical codes generated by combining several commensurate classical codes with a defining matrix. We give an explicit formula for calculating the dimension of the $σ$ hull of a MP code. As a result, we give necessary and sufficient conditions for the MP codes to be $σ$ dual-containing and $σ$ self-orthogonal. We prove that $\mathrm{dim}_{\mathbb{F}_{q}}(\mathrm{Hull}_σ(\mathcal{C}))=\mathrm{dim}_{\mathbb{F}_{q}}(\mathrm{Hull}_σ(\mathcal{C}^{\bot_σ}))$. We prove that for any integer $h$ with $\mathrm{max}\{0,k_{1}-k_{2}\}\leq h\leq \mathrm{dim}_{\mathbb{F}_{q}}(\mathcal{C}_{1}\cap\mathcal{C}_{2}^{\bot_σ})$, there exists a linear code $\mathcal{C}_{2,h}$ monomially equivalent to $\mathcal{C}_{2}$ such that $\mathrm{dim}_{\mathbb{F}_{q}}(\mathcal{C}_{1}\cap\mathcal{C}_{2,h}^{\bot_σ})=h$, where $\mathcal{C}_{i}$ is an $[n,k_{i}]_{q}$ linear code for $i=1,2$. We show that given an $[n,k,d]_{q}$ linear code $\mathcal{C}$, there exists a monomially equivalent $[n,k,d]_{q}$ linear code $\mathcal{C}_{h}$, whose $σ$ dual code has minimum distance $d'$, such that there exist an $[[n,k-h,d;n-k-h]]_{q}$ EAQECC and an $[[n,n-k-h,d';k-h]]_{q}$ EAQECC for every integer $h$ with $0\leq h\leq \mathrm{dim}_{\mathbb{F}_{q}}(\mathrm{Hull}_σ(\mathcal{C}))$. Based on this result, we present a general construction method for deriving EAQECCs with flexible parameters from MP codes related to $σ$ hulls. △ Less

Submitted 13 May, 2024; originally announced May 2024.

arXiv:2405.02538 [pdf, other]

AdaFPP: Adapt-Focused Bi-Propagating Prototype Learning for Panoramic Activity Recognition

Authors: Meiqi Cao, Rui Yan, Xiangbo Shu, Guangzhao Dai, Yazhou Yao, Guo-Sen Xie

Abstract: Panoramic Activity Recognition (PAR) aims to identify multi-granularity behaviors performed by multiple persons in panoramic scenes, including individual activities, group activities, and global activities. Previous methods 1) heavily rely on manually annotated detection boxes in training and inference, hindering further practical deployment; or 2) directly employ normal detectors to detect multip… ▽ More Panoramic Activity Recognition (PAR) aims to identify multi-granularity behaviors performed by multiple persons in panoramic scenes, including individual activities, group activities, and global activities. Previous methods 1) heavily rely on manually annotated detection boxes in training and inference, hindering further practical deployment; or 2) directly employ normal detectors to detect multiple persons with varying size and spatial occlusion in panoramic scenes, blocking the performance gain of PAR. To this end, we consider learning a detector adapting varying-size occluded persons, which is optimized along with the recognition module in the all-in-one framework. Therefore, we propose a novel Adapt-Focused bi-Propagating Prototype learning (AdaFPP) framework to jointly recognize individual, group, and global activities in panoramic activity scenes by learning an adapt-focused detector and multi-granularity prototypes as the pretext tasks in an end-to-end way. Specifically, to accommodate the varying sizes and spatial occlusion of multiple persons in crowed panoramic scenes, we introduce a panoramic adapt-focuser, achieving the size-adapting detection of individuals by comprehensively selecting and performing fine-grained detections on object-dense sub-regions identified through original detections. In addition, to mitigate information loss due to inaccurate individual localizations, we introduce a bi-propagation prototyper that promotes closed-loop interaction and informative consistency across different granularities by facilitating bidirectional information propagation among the individual, group, and global levels. Extensive experiments demonstrate the significant performance of AdaFPP and emphasize its powerful applicability for PAR. △ Less

Submitted 3 May, 2024; originally announced May 2024.

arXiv:2405.02285 [pdf, ps, other]

Special matrices over finite fields and their applications to quantum error-correcting codes

Authors: Meng Cao

Abstract: The matrix-product (MP) code $\mathcal{C}_{A,k}:=[\mathcal{C}_{1},\mathcal{C}_{2},\ldots,\mathcal{C}_{k}]\cdot A$ with a non-singular by column (NSC) matrix $A$ plays an important role in constructing good quantum error-correcting codes. In this paper, we study the MP code when the defining matrix $A$ satisfies the condition that $AA^†$ is $(D,τ)$-monomial. We give an explicit formula for calculat… ▽ More The matrix-product (MP) code $\mathcal{C}_{A,k}:=[\mathcal{C}_{1},\mathcal{C}_{2},\ldots,\mathcal{C}_{k}]\cdot A$ with a non-singular by column (NSC) matrix $A$ plays an important role in constructing good quantum error-correcting codes. In this paper, we study the MP code when the defining matrix $A$ satisfies the condition that $AA^†$ is $(D,τ)$-monomial. We give an explicit formula for calculating the dimension of the Hermitian hull of a MP code. We provide the necessary and sufficient conditions that a MP code is Hermitian dual-containing (HDC), almost Hermitian dual-containing (AHDC), Hermitian self-orthogonal (HSO), almost Hermitian self-orthogonal (AHSO), and Hermitian LCD, respectively. We theoretically determine the number of all possible ways involving the relationships among the constituent codes to yield a MP code with these properties, respectively. We give alternative necessary and sufficient conditions for a MP code to be AHDC and AHSO, respectively, and show several cases where a MP code is not AHDC or AHSO. We provide the construction methods of HDC and AHDC MP codes, including those with optimal minimum distance lower bounds. △ Less

Submitted 11 May, 2024; v1 submitted 3 May, 2024; originally announced May 2024.

arXiv:2404.18686 [pdf]

Dynamic temperature compensation for wavelength-stable entangled biphoton generation

Authors: Yuting Liu, Huibo Hong, Xiao Xiang, Runai Quan, Tao Liu, Mingtao Cao, Shougang Zhang, Ruifang Dong

Abstract: A dynamic temperature compensation method is presented to stabilize the wavelength of the entangled biphoton source, which is generated via the spontaneous parametric down-conversion based on a MgO: PPLN waveguide. Utilizing the dispersive Fourier transformation technique combined with a digital proportional-integral-differential algorithm, the small amount of wavelength variation can be instantly… ▽ More A dynamic temperature compensation method is presented to stabilize the wavelength of the entangled biphoton source, which is generated via the spontaneous parametric down-conversion based on a MgO: PPLN waveguide. Utilizing the dispersive Fourier transformation technique combined with a digital proportional-integral-differential algorithm, the small amount of wavelength variation can be instantly identified and then compensated with active temperature correction. The long-term wavelength stability, assessed though Allan deviation, shows nearly a hundredfold enhancement, reaching 2.00*10^(-7) at the averaging time of 10000 s. It offers a simple, ready-to-use solution for precise wavelength control in quantum information processing. △ Less

Submitted 29 April, 2024; originally announced April 2024.

arXiv:2404.18106 [pdf, other]

Semi-supervised Text-based Person Search

Authors: Daming Gao, Yang Bai, Min Cao, Hao Dou, Mang Ye, Min Zhang

Abstract: Text-based person search (TBPS) aims to retrieve images of a specific person from a large image gallery based on a natural language description. Existing methods rely on massive annotated image-text data to achieve satisfactory performance in fully-supervised learning. It poses a significant challenge in practice, as acquiring person images from surveillance videos is relatively easy, while obtain… ▽ More Text-based person search (TBPS) aims to retrieve images of a specific person from a large image gallery based on a natural language description. Existing methods rely on massive annotated image-text data to achieve satisfactory performance in fully-supervised learning. It poses a significant challenge in practice, as acquiring person images from surveillance videos is relatively easy, while obtaining annotated texts is challenging. The paper undertakes a pioneering initiative to explore TBPS under the semi-supervised setting, where only a limited number of person images are annotated with textual descriptions while the majority of images lack annotations. We present a two-stage basic solution based on generation-then-retrieval for semi-supervised TBPS. The generation stage enriches annotated data by applying an image captioning model to generate pseudo-texts for unannotated images. Later, the retrieval stage performs fully-supervised retrieval learning using the augmented data. Significantly, considering the noise interference of the pseudo-texts on retrieval learning, we propose a noise-robust retrieval framework that enhances the ability of the retrieval model to handle noisy data. The framework integrates two key strategies: Hybrid Patch-Channel Masking (PC-Mask) to refine the model architecture, and Noise-Guided Progressive Training (NP-Train) to enhance the training process. PC-Mask performs masking on the input data at both the patch-level and the channel-level to prevent overfitting noisy supervision. NP-Train introduces a progressive training schedule based on the noise level of pseudo-texts to facilitate noise-robust learning. Extensive experiments on multiple TBPS benchmarks show that the proposed framework achieves promising performance under the semi-supervised setting. △ Less

Submitted 28 April, 2024; originally announced April 2024.

Comments: 13 pages

arXiv:2404.14013 [pdf, ps, other]

A characterization of compactness via bilinear $T1$ theorem

Authors: Mingming Cao, Honghai Liu, Zengyan Si, Kôzô Yabuta

Abstract: We establish a bilinear $T1$ theorem to characterize the weighted compactness of bilinear Calderón--Zygmund operators. Let $T$ be a bilinear operator associated with a standard bilinear Calderón--Zygmund kernel. We demonstrate that $T$ can be extended to a compact bilinear operator from $L^{p_1}(w_1^{p_1}) \times L^{p_2}(w_2^{p_2})$ to $L^p(w^p)$ for all exponents… ▽ More We establish a bilinear $T1$ theorem to characterize the weighted compactness of bilinear Calderón--Zygmund operators. Let $T$ be a bilinear operator associated with a standard bilinear Calderón--Zygmund kernel. We demonstrate that $T$ can be extended to a compact bilinear operator from $L^{p_1}(w_1^{p_1}) \times L^{p_2}(w_2^{p_2})$ to $L^p(w^p)$ for all exponents $\frac{1}{p} = \frac{1}{p_1} + \frac{1}{p_2}$ with $1<p_1, p_2< \infty$ and for all weights $(w_1, w_2) \in A_{(p_1, p_2)}$ if and only if the following conditions hold: (i) $T$ is associated with a compact bilinear Calderón--Zygmund kernel, (ii) $T$ satisfies the weak compactness property, and (iii) $T(1,1), T^{*1}(1,1), T^{*2}(1,1) \in \mathrm{CMO}(\mathbb{R}^n)$. △ Less

Submitted 22 April, 2024; originally announced April 2024.

Comments: This is just a draft, but we post the file in its current form now, in response to several queries about the result and method. Eventually, these results will be a part of a more extensive work about compactness of bilinear singular integrals

MSC Class: 42B20; 42B35

arXiv:2404.09842 [pdf, other]

STMixer: A One-Stage Sparse Action Detector

Authors: Tao Wu, Mengqi Cao, Ziteng Gao, Gangshan Wu, Limin Wang

Abstract: Traditional video action detectors typically adopt the two-stage pipeline, where a person detector is first employed to generate actor boxes and then 3D RoIAlign is used to extract actor-specific features for classification. This detection paradigm requires multi-stage training and inference, and the feature sampling is constrained inside the box, failing to effectively leverage richer context inf… ▽ More Traditional video action detectors typically adopt the two-stage pipeline, where a person detector is first employed to generate actor boxes and then 3D RoIAlign is used to extract actor-specific features for classification. This detection paradigm requires multi-stage training and inference, and the feature sampling is constrained inside the box, failing to effectively leverage richer context information outside. Recently, a few query-based action detectors have been proposed to predict action instances in an end-to-end manner. However, they still lack adaptability in feature sampling and decoding, thus suffering from the issues of inferior performance or slower convergence. In this paper, we propose two core designs for a more flexible one-stage sparse action detector. First, we present a query-based adaptive feature sampling module, which endows the detector with the flexibility of mining a group of discriminative features from the entire spatio-temporal domain. Second, we devise a decoupled feature mixing module, which dynamically attends to and mixes video features along the spatial and temporal dimensions respectively for better feature decoding. Based on these designs, we instantiate two detection pipelines, that is, STMixer-K for keyframe action detection and STMixer-T for action tubelet detection. Without bells and whistles, our STMixer detectors obtain state-of-the-art results on five challenging spatio-temporal action detection benchmarks for keyframe action detection or action tube detection. △ Less

Submitted 15 April, 2024; originally announced April 2024.

Comments: Extended version of the paper arXiv:2303.15879 presented at CVPR 2023. Accepted by TPAMI 2024

arXiv:2404.06784 [pdf]

Statistical evaluation of 571 GaAs quantum point contact transistors showing the 0.7 anomaly in quantized conductance using millikelvin cryogenic on-chip multiplexing

Authors: Pengcheng Ma, Kaveh Delfanazari, Reuben K. Puddy, Jiahui Li, Moda Cao, Teng Yi, Jonathan P. Griffiths, Harvey E. Beere, David A. Ritchie, Michael J. Kelly, Charles G. Smith

Abstract: The mass production and the practical number of cryogenic quantum devices producible in a single chip are limited to the number of electrical contact pads and wiring of the cryostat or dilution refrigerator. It is, therefore, beneficial to contrast the measurements of hundreds of devices fabricated in a single chip in one cooldown process to promote the scalability, integrability, reliability, and… ▽ More The mass production and the practical number of cryogenic quantum devices producible in a single chip are limited to the number of electrical contact pads and wiring of the cryostat or dilution refrigerator. It is, therefore, beneficial to contrast the measurements of hundreds of devices fabricated in a single chip in one cooldown process to promote the scalability, integrability, reliability, and reproducibility of quantum devices and to save evaluation time, cost and energy. Here, we use a cryogenic on-chip multiplexer architecture and investigate the statistics of the 0.7 anomaly observed on the first three plateaus of the quantized conductance of semiconductor quantum point contact (QPC) transistors. Our single chips contain 256 split gate field effect QPC transistors (QFET) each, with two 16-branch multiplexed source-drain and gate pads, allowing individual transistors to be selected, addressed and controlled through an electrostatic gate voltage process. A total of 1280 quantum transistors with nano-scale dimensions are patterned in 5 different chips of GaAs heterostructures. From the measurements of 571 functioning QPCs taken at temperatures T= 1.4 K and T= 40 mK, it is found that the spontaneous polarisation model and Kondo effect do not fit our results. Furthermore, some of the features in our data largely agreed with van Hove model with short-range interactions. Our approach provides further insight into the quantum mechanical properties and microscopic origin of the 0.7 anomaly in QPCs, paving the way for the development of semiconducting quantum circuits and integrated cryogenic electronics, for scalable quantum logic control, readout, synthesis, and processing applications. △ Less

Submitted 10 April, 2024; originally announced April 2024.

arXiv:2404.06350 [pdf, other]

Rolling Shutter Correction with Intermediate Distortion Flow Estimation

Authors: Mingdeng Cao, Sidi Yang, Yujiu Yang, Yinqiang Zheng

Abstract: This paper proposes to correct the rolling shutter (RS) distorted images by estimating the distortion flow from the global shutter (GS) to RS directly. Existing methods usually perform correction using the undistortion flow from the RS to GS. They initially predict the flow from consecutive RS frames, subsequently rescaling it as the displacement fields from the RS frame to the underlying GS image… ▽ More This paper proposes to correct the rolling shutter (RS) distorted images by estimating the distortion flow from the global shutter (GS) to RS directly. Existing methods usually perform correction using the undistortion flow from the RS to GS. They initially predict the flow from consecutive RS frames, subsequently rescaling it as the displacement fields from the RS frame to the underlying GS image using time-dependent scaling factors. Following this, RS-aware forward warping is employed to convert the RS image into its GS counterpart. Nevertheless, this strategy is prone to two shortcomings. First, the undistortion flow estimation is rendered inaccurate by merely linear scaling the flow, due to the complex non-linear motion nature. Second, RS-aware forward warping often results in unavoidable artifacts. To address these limitations, we introduce a new framework that directly estimates the distortion flow and rectifies the RS image with the backward warping operation. More specifically, we first propose a global correlation-based flow attention mechanism to estimate the initial distortion flow and GS feature jointly, which are then refined by the following coarse-to-fine decoder layers. Additionally, a multi-distortion flow prediction strategy is integrated to mitigate the issue of inaccurate flow estimation further. Experimental results validate the effectiveness of the proposed method, which outperforms state-of-the-art approaches on various benchmarks while maintaining high efficiency. The project is available at \url{https://github.com/ljzycmd/DFRSC}. △ Less

Submitted 9 April, 2024; originally announced April 2024.

Comments: CVPR2024

arXiv:2404.02845 [pdf, other]

Cross-Modal Conditioned Reconstruction for Language-guided Medical Image Segmentation

Authors: Xiaoshuang Huang, Hongxiang Li, Meng Cao, Long Chen, Chenyu You, Dong An

Abstract: Recent developments underscore the potential of textual information in enhancing learning models for a deeper understanding of medical visual semantics. However, language-guided medical image segmentation still faces a challenging issue. Previous works employ implicit and ambiguous architectures to embed textual information. This leads to segmentation results that are inconsistent with the semanti… ▽ More Recent developments underscore the potential of textual information in enhancing learning models for a deeper understanding of medical visual semantics. However, language-guided medical image segmentation still faces a challenging issue. Previous works employ implicit and ambiguous architectures to embed textual information. This leads to segmentation results that are inconsistent with the semantics represented by the language, sometimes even diverging significantly. To this end, we propose a novel cross-modal conditioned Reconstruction for Language-guided Medical Image Segmentation (RecLMIS) to explicitly capture cross-modal interactions, which assumes that well-aligned medical visual features and medical notes can effectively reconstruct each other. We introduce conditioned interaction to adaptively predict patches and words of interest. Subsequently, they are utilized as conditioning factors for mutual reconstruction to align with regions described in the medical notes. Extensive experiments demonstrate the superiority of our RecLMIS, surpassing LViT by 3.74% mIoU on the publicly available MosMedData+ dataset and achieving an average increase of 1.89% mIoU for cross-domain tests on our QATA-CoV19 dataset. Simultaneously, we achieve a relative reduction of 20.2% in parameter count and a 55.5% decrease in computational load. The code will be available at https://github.com/ShashankHuang/RecLMIS. △ Less

Submitted 7 July, 2024; v1 submitted 3 April, 2024; originally announced April 2024.

arXiv:2403.19238 [pdf, other]

Taming Lookup Tables for Efficient Image Retouching

Authors: Sidi Yang, Binxiao Huang, Mingdeng Cao, Yatai Ji, Hanzhong Guo, Ngai Wong, Yujiu Yang

Abstract: The widespread use of high-definition screens in edge devices, such as end-user cameras, smartphones, and televisions, is spurring a significant demand for image enhancement. Existing enhancement models often optimize for high performance while falling short of reducing hardware inference time and power consumption, especially on edge devices with constrained computing and storage resources. To th… ▽ More The widespread use of high-definition screens in edge devices, such as end-user cameras, smartphones, and televisions, is spurring a significant demand for image enhancement. Existing enhancement models often optimize for high performance while falling short of reducing hardware inference time and power consumption, especially on edge devices with constrained computing and storage resources. To this end, we propose Image Color Enhancement Lookup Table (ICELUT) that adopts LUTs for extremely efficient edge inference, without any convolutional neural network (CNN). During training, we leverage pointwise (1x1) convolution to extract color information, alongside a split fully connected layer to incorporate global information. Both components are then seamlessly converted into LUTs for hardware-agnostic deployment. ICELUT achieves near-state-of-the-art performance and remarkably low power consumption. We observe that the pointwise network structure exhibits robust scalability, upkeeping the performance even with a heavily downsampled 32x32 input image. These enable ICELUT, the first-ever purely LUT-based image enhancer, to reach an unprecedented speed of 0.4ms on GPU and 7ms on CPU, at least one order faster than any CNN solution. Codes are available at https://github.com/Stephen0808/ICELUT. △ Less

Submitted 13 July, 2024; v1 submitted 28 March, 2024; originally announced March 2024.

Comments: Accepted by ECCV2024

arXiv:2403.18167 [pdf, other]

Mechanistic Understanding and Mitigation of Language Model Non-Factual Hallucinations

Authors: Lei Yu, Meng Cao, Jackie Chi Kit Cheung, Yue Dong

Abstract: State-of-the-art language models (LMs) sometimes generate non-factual hallucinations that misalign with world knowledge. To explore the mechanistic causes of these hallucinations, we create diagnostic datasets with subject-relation queries and adapt interpretability methods to trace hallucinations through internal model representations. We discover two general and distinct mechanistic causes of ha… ▽ More State-of-the-art language models (LMs) sometimes generate non-factual hallucinations that misalign with world knowledge. To explore the mechanistic causes of these hallucinations, we create diagnostic datasets with subject-relation queries and adapt interpretability methods to trace hallucinations through internal model representations. We discover two general and distinct mechanistic causes of hallucinations shared across LMs (Llama-2, Pythia, GPT-J): 1) knowledge enrichment hallucinations: insufficient subject attribute knowledge in lower layer MLPs, and 2) answer extraction hallucinations: failure to select the correct object attribute in upper layer attention heads. We also found these two internal mechanistic causes of hallucinations are reflected in external manifestations. Based on insights from our mechanistic analysis, we propose a novel hallucination mitigation method through targeted restoration of the LM's internal fact recall pipeline, demonstrating superior performance compared to baselines. △ Less

Submitted 17 June, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

arXiv:2403.17297 [pdf, other]

InternLM2 Technical Report

Authors: Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang , et al. (75 additional authors not shown)

Abstract: The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI). However, replicating such advancements in open-source models has been challenging. This paper introduces InternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context m… ▽ More The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI). However, replicating such advancements in open-source models has been challenging. This paper introduces InternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context modeling, and open-ended subjective evaluations through innovative pre-training and optimization techniques. The pre-training process of InternLM2 is meticulously detailed, highlighting the preparation of diverse data types including text, code, and long-context data. InternLM2 efficiently captures long-term dependencies, initially trained on 4k tokens before advancing to 32k tokens in pre-training and fine-tuning stages, exhibiting remarkable performance on the 200k ``Needle-in-a-Haystack" test. InternLM2 is further aligned using Supervised Fine-Tuning (SFT) and a novel Conditional Online Reinforcement Learning from Human Feedback (COOL RLHF) strategy that addresses conflicting human preferences and reward hacking. By releasing InternLM2 models in different training stages and model sizes, we provide the community with insights into the model's evolution. △ Less

Submitted 25 March, 2024; originally announced March 2024.

arXiv:2403.15805 [pdf, other]

AirCrab: A Hybrid Aerial-Ground Manipulator with An Active Wheel

Authors: Muqing Cao, Jiayan Zhao, Xinhang Xu, Lihua Xie

Abstract: Inspired by the behavior of birds, we present AirCrab, a hybrid aerial ground manipulator (HAGM) with a single active wheel and a 3-degree of freedom (3-DoF) manipulator. AirCrab leverages a single point of contact with the ground to reduce position drift and improve manipulation accuracy. The single active wheel enables locomotion on narrow surfaces without adding significant weight to the robot.… ▽ More Inspired by the behavior of birds, we present AirCrab, a hybrid aerial ground manipulator (HAGM) with a single active wheel and a 3-degree of freedom (3-DoF) manipulator. AirCrab leverages a single point of contact with the ground to reduce position drift and improve manipulation accuracy. The single active wheel enables locomotion on narrow surfaces without adding significant weight to the robot. To realize accurate attitude maintenance using propellers on the ground, we design a control allocation method for AirCrab that prioritizes attitude control and dynamically adjusts the thrust input to reduce energy consumption. Experiments verify the effectiveness of the proposed control method and the gain in manipulation accuracy with ground contact. A series of operations to complete the letters 'NTU' demonstrates the capability of the robot to perform challenging hybrid aerial-ground manipulation missions. △ Less

Submitted 23 March, 2024; originally announced March 2024.

arXiv:2403.14668 [pdf, other]

Predicting Learning Performance with Large Language Models: A Study in Adult Literacy

Authors: Liang Zhang, Jionghao Lin, Conrad Borchers, John Sabatini, John Hollander, Meng Cao, Xiangen Hu

Abstract: Intelligent Tutoring Systems (ITSs) have significantly enhanced adult literacy training, a key factor for societal participation, employment opportunities, and lifelong learning. Our study investigates the application of advanced AI models, including Large Language Models (LLMs) like GPT-4, for predicting learning performance in adult literacy programs in ITSs. This research is motivated by the po… ▽ More Intelligent Tutoring Systems (ITSs) have significantly enhanced adult literacy training, a key factor for societal participation, employment opportunities, and lifelong learning. Our study investigates the application of advanced AI models, including Large Language Models (LLMs) like GPT-4, for predicting learning performance in adult literacy programs in ITSs. This research is motivated by the potential of LLMs to predict learning performance based on its inherent reasoning and computational capabilities. By using reading comprehension datasets from the ITS, AutoTutor, we evaluate the predictive capabilities of GPT-4 versus traditional machine learning methods in predicting learning performance through five-fold cross-validation techniques. Our findings show that the GPT-4 presents the competitive predictive abilities with traditional machine learning methods such as Bayesian Knowledge Tracing, Performance Factor Analysis, Sparse Factor Analysis Lite (SPARFA-Lite), tensor factorization and eXtreme Gradient Boosting (XGBoost). While XGBoost (trained on local machine) outperforms GPT-4 in predictive accuracy, GPT-4-selected XGBoost and its subsequent tuning on the GPT-4 platform demonstrates superior performance compared to local machine execution. Moreover, our investigation into hyper-parameter tuning by GPT-4 versus grid-search suggests comparable performance, albeit with less stability in the automated approach, using XGBoost as the case study. Our study contributes to the field by highlighting the potential of integrating LLMs with traditional machine learning models to enhance predictive accuracy and personalize adult literacy education, setting a foundation for future research in applying LLMs within ITSs. △ Less

Submitted 4 March, 2024; originally announced March 2024.

Comments: 26TH International Conference on Human-Computer Interaction

arXiv:2403.14416 [pdf, other]

Quantum Channel Simulation in Fidelity is no more difficult than State Splitting

Authors: Michael X. Cao, Rahul Jain, Marco Tomamichel

Abstract: Characterizing the minimal communication needed for the quantum channel simulation is a fundamental task in the quantum information theory. In this paper, we show that, in fidelity, the quantum channel simulation can be directly achieved via quantum state splitting without using a technique known as the de~Finetti reduction, and thus provide a pair of tighter one-shot bounds. Using the bounds, we… ▽ More Characterizing the minimal communication needed for the quantum channel simulation is a fundamental task in the quantum information theory. In this paper, we show that, in fidelity, the quantum channel simulation can be directly achieved via quantum state splitting without using a technique known as the de~Finetti reduction, and thus provide a pair of tighter one-shot bounds. Using the bounds, we also recover the quantum reverse Shannon theorem in a much simpler way. △ Less

Submitted 24 June, 2024; v1 submitted 21 March, 2024; originally announced March 2024.

arXiv:2403.14173 [pdf, other]

HCTO: Optimality-Aware LiDAR Inertial Odometry with Hybrid Continuous Time Optimization for Compact Wearable Mapping System

Authors: Jianping Li, Shenghai Yuan, Muqing Cao, Thien-Minh Nguyen, Kun Cao, Lihua Xie

Abstract: Compact wearable mapping system (WMS) has gained significant attention due to their convenience in various applications. Specifically, it provides an efficient way to collect prior maps for 3D structure inspection and robot-based "last-mile delivery" in complex environments. However, vibrations in human motion and the uneven distribution of point cloud features in complex environments often lead t… ▽ More Compact wearable mapping system (WMS) has gained significant attention due to their convenience in various applications. Specifically, it provides an efficient way to collect prior maps for 3D structure inspection and robot-based "last-mile delivery" in complex environments. However, vibrations in human motion and the uneven distribution of point cloud features in complex environments often lead to rapid drift, which is a prevalent issue when applying existing LiDAR Inertial Odometry (LIO) methods on low-cost WMS. To address these limitations, we propose a novel LIO for WMSs based on Hybrid Continuous Time Optimization (HCTO) considering the optimality of Lidar correspondences. First, HCTO recognizes patterns in human motion (high-frequency part, low-frequency part, and constant velocity part) by analyzing raw IMU measurements. Second, HCTO constructs hybrid IMU factors according to different motion states, which enables robust and accurate estimation against vibration-induced noise in the IMU measurements. Third, the best point correspondences are selected using optimal design to achieve real-time performance and better odometry accuracy. We conduct experiments on head-mounted WMS datasets to evaluate the performance of our system, demonstrating significant advantages over state-of-the-art methods. Video recordings of experiments can be found on the project page of HCTO: \href{https://github.com/kafeiyin00/HCTO}{https://github.com/kafeiyin00/HCTO}. △ Less

Submitted 21 March, 2024; originally announced March 2024.

arXiv:2403.13839 [pdf, other]

depyf: Open the Opaque Box of PyTorch Compiler for Machine Learning Researchers

Authors: Kaichao You, Runsheng Bai, Meng Cao, Jianmin Wang, Ion Stoica, Mingsheng Long

Abstract: PyTorch \texttt{2.x} introduces a compiler designed to accelerate deep learning programs. However, for machine learning researchers, adapting to the PyTorch compiler to full potential can be challenging. The compiler operates at the Python bytecode level, making it appear as an opaque box. To address this, we introduce \texttt{depyf}, a tool designed to demystify the inner workings of the PyTorch… ▽ More PyTorch \texttt{2.x} introduces a compiler designed to accelerate deep learning programs. However, for machine learning researchers, adapting to the PyTorch compiler to full potential can be challenging. The compiler operates at the Python bytecode level, making it appear as an opaque box. To address this, we introduce \texttt{depyf}, a tool designed to demystify the inner workings of the PyTorch compiler. \texttt{depyf} decompiles bytecode generated by PyTorch back into equivalent source code, and establishes connections between in-memory code objects and their on-disk source code counterparts. This feature enables users to step through the source code line by line using debuggers, thus enhancing their understanding of the underlying processes. Notably, \texttt{depyf} is non-intrusive and user-friendly, primarily relying on two convenient context managers for its core functionality. The project is \href{https://github.com/thuml/depyf}{ openly available} and is recognized as a \href{https://pytorch.org/ecosystem/}{PyTorch ecosystem project}. △ Less

Submitted 14 March, 2024; originally announced March 2024.

Comments: 16 pages, 2 figures

arXiv:2403.12549 [pdf, other]

Treewidth of generalized Hamming graph, bipartite Kneser graph and generalized Petersen graph

Authors: Yichen Wang, Mengyu Cao, Zequn Lv, Mei Lu

Abstract: Let $t,q$ and $n$ be positive integers. Write $[q] = \{1,2,\ldots,q\}$. The generalized Hamming graph $H(t,q,n)$ is the graph whose vertex set is the cartesian product of $n$ copies of $[q]$$(q\ge 2)$, where two vertices are adjacent if their Hamming distance is at most $t$. In particular, $H(1,q,n)$ is the well-known Hamming graph and $H(1,2,n)$ is the hypercube. In 2006, Chandran and Kavitha des… ▽ More Let $t,q$ and $n$ be positive integers. Write $[q] = \{1,2,\ldots,q\}$. The generalized Hamming graph $H(t,q,n)$ is the graph whose vertex set is the cartesian product of $n$ copies of $[q]$$(q\ge 2)$, where two vertices are adjacent if their Hamming distance is at most $t$. In particular, $H(1,q,n)$ is the well-known Hamming graph and $H(1,2,n)$ is the hypercube. In 2006, Chandran and Kavitha described the asymptotic value of $tw(H(1,q,n))$, where $tw(G)$ denotes the treewidth of $G$. In this paper, we give the exact pathwidth of $H(t,2,n)$ and show that $tw(H(t,q,n)) = Θ(tq^n/\sqrt{n})$ when $n$ goes to infinity. Based on those results, we show that the treewidth of bipartite Kneser graph $BK(n,k)$ is $\binom{n}{k} - 1$ when $n$ is sufficient large relative to $k$ and the bounds of $tw(BK(2k+1,k))$ are given. Moreover, we present the bounds of the treewidth of generalized Petersen graph. △ Less

Submitted 19 March, 2024; originally announced March 2024.

arXiv:2403.11183 [pdf, other]

Decoding Continuous Character-based Language from Non-invasive Brain Recordings

Authors: Cenyuan Zhang, Xiaoqing Zheng, Ruicheng Yin, Shujie Geng, Jianhan Xu, Xuan Gao, Changze Lv, Zixuan Ling, Xuanjing Huang, Miao Cao, Jianfeng Feng

Abstract: Deciphering natural language from brain activity through non-invasive devices remains a formidable challenge. Previous non-invasive decoders either require multiple experiments with identical stimuli to pinpoint cortical regions and enhance signal-to-noise ratios in brain activity, or they are limited to discerning basic linguistic elements such as letters and words. We propose a novel approach to… ▽ More Deciphering natural language from brain activity through non-invasive devices remains a formidable challenge. Previous non-invasive decoders either require multiple experiments with identical stimuli to pinpoint cortical regions and enhance signal-to-noise ratios in brain activity, or they are limited to discerning basic linguistic elements such as letters and words. We propose a novel approach to decoding continuous language from single-trial non-invasive fMRI recordings, in which a three-dimensional convolutional network augmented with information bottleneck is developed to automatically identify responsive voxels to stimuli, and a character-based decoder is designed for the semantic reconstruction of continuous language characterized by inherent character structures. The resulting decoder can produce intelligible textual sequences that faithfully capture the meaning of perceived speech both within and across subjects, while existing decoders exhibit significantly inferior performance in cross-subject contexts. The ability to decode continuous language from single trials across subjects demonstrates the promising applications of non-invasive language brain-computer interfaces in both healthcare and neuroscience. △ Less

Submitted 19 March, 2024; v1 submitted 17 March, 2024; originally announced March 2024.

arXiv:2403.09323 [pdf, other]

E2E-MFD: Towards End-to-End Synchronous Multimodal Fusion Detection

Authors: Jiaqing Zhang, Mingxiang Cao, Xue Yang, Weiying Xie, Jie Lei, Daixun Li, Wenbo Huang, Yunsong Li

Abstract: Multimodal image fusion and object detection are crucial for autonomous driving. While current methods have advanced the fusion of texture details and semantic information, their complex training processes hinder broader applications. Addressing this challenge, we introduce E2E-MFD, a novel end-to-end algorithm for multimodal fusion detection. E2E-MFD streamlines the process, achieving high perfor… ▽ More Multimodal image fusion and object detection are crucial for autonomous driving. While current methods have advanced the fusion of texture details and semantic information, their complex training processes hinder broader applications. Addressing this challenge, we introduce E2E-MFD, a novel end-to-end algorithm for multimodal fusion detection. E2E-MFD streamlines the process, achieving high performance with a single training phase. It employs synchronous joint optimization across components to avoid suboptimal solutions tied to individual tasks. Furthermore, it implements a comprehensive optimization strategy in the gradient matrix for shared parameters, ensuring convergence to an optimal fusion detection configuration. Our extensive testing on multiple public datasets reveals E2E-MFD's superior capabilities, showcasing not only visually appealing image fusion but also impressive detection outcomes, such as a 3.9% and 2.0% mAP50 increase on horizontal object detection dataset M3FD and oriented object detection dataset DroneVehicle, respectively, compared to state-of-the-art approaches. The code is released at https://github.com/icey-zhang/E2E-MFD. △ Less

Submitted 23 May, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

arXiv:2403.03416 [pdf, other]

On discrete-time polynomial dynamical systems on hypergraphs

Authors: Shaoxuan Cui, Guofeng Zhang, Hildeberto Jardón-Kojakhmetov, Ming Cao

Abstract: This paper studies the stability of discrete-time polynomial dynamical systems on hypergraphs by utilizing the Perron-Frobenius theorem for nonnegative tensors with respect to the tensors Z-eigenvalues and Z-eigenvectors. Firstly, for a multilinear polynomial system on a uniform hypergraph, we study the stability of the origin of the corresponding systems. Next, we extend our results to non-homoge… ▽ More This paper studies the stability of discrete-time polynomial dynamical systems on hypergraphs by utilizing the Perron-Frobenius theorem for nonnegative tensors with respect to the tensors Z-eigenvalues and Z-eigenvectors. Firstly, for a multilinear polynomial system on a uniform hypergraph, we study the stability of the origin of the corresponding systems. Next, we extend our results to non-homogeneous polynomial systems on non-uniform hypergraphs. We confirm that the local stability of any discrete-time polynomial system is in general dominated by pairwise terms. Assuming that the origin is locally stable, we construct a conservative (but explicit) region of attraction from the system parameters. Finally, we validate our results via some numerical examples. △ Less

Submitted 5 June, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

Comments: arXiv admin note: text overlap with arXiv:2401.03652

arXiv:2403.03048 [pdf, other]

Design of Stochastic Quantizers for Privacy Preservation

Authors: Le Liu, Yu Kawano, Ming Cao

Abstract: In this paper, we examine the role of stochastic quantizers for privacy preservation. We first employ a static stochastic quantizer and investigate its corresponding privacy-preserving properties. Specifically, we demonstrate that a sufficiently large quantization step guarantees $(0, δ)$ differential privacy. Additionally, the degradation of control performance caused by quantization is evaluated… ▽ More In this paper, we examine the role of stochastic quantizers for privacy preservation. We first employ a static stochastic quantizer and investigate its corresponding privacy-preserving properties. Specifically, we demonstrate that a sufficiently large quantization step guarantees $(0, δ)$ differential privacy. Additionally, the degradation of control performance caused by quantization is evaluated as the tracking error of output regulation. These two analyses characterize the trade-off between privacy and control performance, determined by the quantization step. This insight enables us to use quantization intentionally as a means to achieve the seemingly conflicting two goals of maintaining control performance and preserving privacy at the same time; towards this end, we further investigate a dynamic stochastic quantizer. Under a stability assumption, the dynamic stochastic quantizer can enhance privacy, more than the static one, while achieving the same control performance. We further handle the unstable case by additionally applying input Gaussian noise. △ Less

Submitted 5 March, 2024; originally announced March 2024.

Comments: 11 pages, 4 figures

arXiv:2403.02146 [pdf, ps, other]

Reinforcement Learning for Inverse Non-Cooperative Linear-Quadratic Output-feedback Differential Games

Authors: Emin Martirosyan, Ming Cao

Abstract: In this paper, we address the inverse problem for linear-quadratic differential non-cooperative games with output-feedback. Given players' stabilizing feedback laws, the goal is to find cost function parameters that lead to a game for which the observed game dynamics are at a Nash equilibrium. Using the given feedback laws, we introduce a model-based algorithm that generates cost function paramete… ▽ More In this paper, we address the inverse problem for linear-quadratic differential non-cooperative games with output-feedback. Given players' stabilizing feedback laws, the goal is to find cost function parameters that lead to a game for which the observed game dynamics are at a Nash equilibrium. Using the given feedback laws, we introduce a model-based algorithm that generates cost function parameters solving the above inverse problem. We introduce a correction procedure that at each iteration of the algorithm guarantees the existence of the feedback laws, which addresses a key challenge of output-feedback control designs. As an intermediate stage of the algorithm, we have developed a procedure for the initial stabilization of the multiple-input system with output-feedback information structure. We prove convergence and stability of the algorithm, and show the way to generate new games with necessary properties without requiring to run the complete algorithm repeatedly. Then the algorithm is extended to a model-free version that uses data samples generated by unknown dynamics and has the same converging and stabilizing properties as the model-based version. Finally, we show how the inverse problem can be solved in a distributed manner and provide possible extensions. Simulation results validate the effectiveness of the proposed algorithms. △ Less

Submitted 4 March, 2024; originally announced March 2024.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2403.01322 [pdf, ps, other]

A Communication-Efficient Stochastic Gradient Descent Algorithm for Distributed Nonconvex Optimization

Authors: Antai Xie, Xinlei Yi, Xiaofan Wang, Ming Cao, Xiaoqiang Ren

Abstract: This paper studies distributed nonconvex optimization problems with stochastic gradients for a multi-agent system, in which each agent aims to minimize the sum of all agents' cost functions by using local compressed information exchange. We propose a distributed stochastic gradient descent (SGD) algorithm, suitable for a general class of compressors. We show that the proposed algorithm achieves th… ▽ More This paper studies distributed nonconvex optimization problems with stochastic gradients for a multi-agent system, in which each agent aims to minimize the sum of all agents' cost functions by using local compressed information exchange. We propose a distributed stochastic gradient descent (SGD) algorithm, suitable for a general class of compressors. We show that the proposed algorithm achieves the linear speedup convergence rate $\mathcal{O}(1/\sqrt{nT})$ for smooth nonconvex functions, where $T$ and $n$ are the number of iterations and agents, respectively. If the global cost function additionally satisfies the Polyak--Łojasiewicz condition, the proposed algorithm can linearly converge to a neighborhood of the global optimum, regardless of whether the stochastic gradient is unbiased or not. Numerical experiments are carried out to verify the efficiency of our algorithm. △ Less

Submitted 2 March, 2024; originally announced March 2024.

arXiv:2403.01225 [pdf, other]

A Cost-Effective Cooperative Exploration and Inspection Strategy for Heterogeneous Aerial System

Authors: Xinhang Xu, Muqing Cao, Shenghai Yuan, Thien Hoang Nguyen, Thien-Minh Nguyen, Lihua Xie

Abstract: In this paper, we propose a cost-effective strategy for heterogeneous UAV swarm systems for cooperative aerial inspection. Unlike previous swarm inspection works, the proposed method does not rely on precise prior knowledge of the environment and can complete full 3D surface coverage of objects in any shape. In this work, agents are partitioned into teams, with each drone assign a different task,… ▽ More In this paper, we propose a cost-effective strategy for heterogeneous UAV swarm systems for cooperative aerial inspection. Unlike previous swarm inspection works, the proposed method does not rely on precise prior knowledge of the environment and can complete full 3D surface coverage of objects in any shape. In this work, agents are partitioned into teams, with each drone assign a different task, including mapping, exploration, and inspection. Task allocation is facilitated by assigning optimal inspection volumes to each team, following best-first rules. A voxel map-based representation of the environment is used for pathfinding, and a rule-based path-planning method is the core of this approach. We achieved the best performance in all challenging experiments with the proposed approach, surpassing all benchmark methods for similar tasks across multiple evaluation trials. The proposed method is open source at https://github.com/ntu-aris/caric_baseline and used as the baseline of the Cooperative Aerial Robots Inspection Challenge at the 62nd IEEE Conference on Decision and Control 2023. △ Less

Submitted 2 March, 2024; originally announced March 2024.

Comments: Baseline method of CARIC at CDC 2023, Singapore

arXiv:2402.13942 [pdf, other]

doi 10.1063/5.0207687

The Maintenance of Coherent Vortex Topology by Lagrangian Chaos in Drift-Rossby Wave Turbulence

Authors: Norman M. Cao, Di Qi

Abstract: This work introduces the "potential vorticity bucket brigade," a mechanism for explaining the resilience of vortex structures in magnetically confined fusion plasmas and geophysical flows. Drawing parallels with zonal jet formation, we show how inhomogeneous patterns of mixing can reinforce, rather than destroy non-zonal flow structure. We accomplish this through an exact stochastic Lagrangian rep… ▽ More This work introduces the "potential vorticity bucket brigade," a mechanism for explaining the resilience of vortex structures in magnetically confined fusion plasmas and geophysical flows. Drawing parallels with zonal jet formation, we show how inhomogeneous patterns of mixing can reinforce, rather than destroy non-zonal flow structure. We accomplish this through an exact stochastic Lagrangian representation of vorticity transport, together with a near-integrability property, which relates coherent flow topology to fluid relabeling symmetries. We demonstrate these ideas in the context of gradient-driven magnetized plasma turbulence, though the tools we develop here are model-agnostic and applicable beyond the system studied here. △ Less

Submitted 3 June, 2024; v1 submitted 21 February, 2024; originally announced February 2024.

Journal ref: Physics of Fluids 36, 061701 (2024)

arXiv:2402.13822 [pdf, other]

MSTAR: Multi-Scale Backbone Architecture Search for Timeseries Classification

Authors: Tue M. Cao, Nhat H. Tran, Hieu H. Pham, Hung T. Nguyen, Le P. Nguyen

Abstract: Most of the previous approaches to Time Series Classification (TSC) highlight the significance of receptive fields and frequencies while overlooking the time resolution. Hence, unavoidably suffered from scalability issues as they integrated an extensive range of receptive fields into classification models. Other methods, while having a better adaptation for large datasets, require manual design an… ▽ More Most of the previous approaches to Time Series Classification (TSC) highlight the significance of receptive fields and frequencies while overlooking the time resolution. Hence, unavoidably suffered from scalability issues as they integrated an extensive range of receptive fields into classification models. Other methods, while having a better adaptation for large datasets, require manual design and yet not being able to reach the optimal architecture due to the uniqueness of each dataset. We overcome these challenges by proposing a novel multi-scale search space and a framework for Neural architecture search (NAS), which addresses both the problem of frequency and time resolution, discovering the suitable scale for a specific dataset. We further show that our model can serve as a backbone to employ a powerful Transformer module with both untrained and pre-trained weights. Our search space reaches the state-of-the-art performance on four datasets on four different domains while introducing more than ten highly fine-tuned models for each data. △ Less

Submitted 21 February, 2024; originally announced February 2024.

arXiv:2402.11907 [pdf, other]

Direct Large Language Model Alignment Through Self-Rewarding Contrastive Prompt Distillation

Authors: Aiwei Liu, Haoping Bai, Zhiyun Lu, Xiang Kong, Simon Wang, Jiulong Shan, Meng Cao, Lijie Wen

Abstract: Aligning large language models (LLMs) with human expectations without human-annotated preference data is an important problem. In this paper, we propose a method to evaluate the response preference by using the output probabilities of response pairs under contrastive prompt pairs, which could achieve better performance on LLaMA2-7B and LLaMA2-13B compared to RLAIF. Based on this, we propose an aut… ▽ More Aligning large language models (LLMs) with human expectations without human-annotated preference data is an important problem. In this paper, we propose a method to evaluate the response preference by using the output probabilities of response pairs under contrastive prompt pairs, which could achieve better performance on LLaMA2-7B and LLaMA2-13B compared to RLAIF. Based on this, we propose an automatic alignment method, Direct Large Model Alignment (DLMA). First, we use contrastive prompt pairs to automatically generate preference data. Then, we continue to evaluate the generated preference data using contrastive prompt pairs and calculate a self-rewarding score. Finally, we use the DPO algorithm to effectively align LLMs by combining this self-rewarding score. In the experimental stage, our DLMA method could surpass the \texttt{RLHF} method without relying on human-annotated preference data. △ Less

Submitted 19 February, 2024; originally announced February 2024.

Comments: 24 pages, 5 pages

MSC Class: 68T50 ACM Class: I.2.7

arXiv:2402.09752 [pdf]

Vector spectrometer with Hertz-level resolution and super-recognition capability

Authors: Ting Qing, Shupeng Li, Huashan Yang, Lihan Wang, Yijie Fang, Xiaohu Tang, Meihui Cao, Jianming Lu, Jijun He, Junqiu Liu, Yueguang Lyu, Shilong Pan

Abstract: High-resolution optical spectrometers are crucial in revealing intricate characteristics of signals, determining laser frequencies, measuring physical constants, identifying substances, and advancing biosensing applications. Conventional spectrometers, however, often grapple with inherent trade-offs among spectral resolution, wavelength range, and accuracy. Furthermore, even at high resolution, re… ▽ More High-resolution optical spectrometers are crucial in revealing intricate characteristics of signals, determining laser frequencies, measuring physical constants, identifying substances, and advancing biosensing applications. Conventional spectrometers, however, often grapple with inherent trade-offs among spectral resolution, wavelength range, and accuracy. Furthermore, even at high resolution, resolving overlapping spectral lines during spectroscopic analyses remains a huge challenge. Here, we propose a vector spectrometer with ultrahigh resolution, combining broadband optical frequency hopping, ultrafine microwave-photonic scanning, and vector detection. A programmable frequency-hopping laser was developed, facilitating a sub-Hz linewidth and Hz-level frequency stability, an improvement of four and six orders of magnitude, respectively, compared to those of state-of-the-art tunable lasers. We also designed an asymmetric optical transmitter and receiver to eliminate measurement errors arising from modulation nonlinearity and multi-channel crosstalk. The resultant vector spectrometer exhibits an unprecedented frequency resolution of 2 Hz, surpassing the state-of-the-art by four orders of magnitude, over a 33-nm range. Through high-resolution vector analysis, we observed that group delay information enhances the separation capability of overlapping spectral lines by over 47%, significantly streamlining the real-time identification of diverse substances. Our technique fills the gap in optical spectrometers with resolutions below 10 kHz and enables vector measurement to embrace revolution in functionality. △ Less

Submitted 6 March, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

Comments: 21 pages, 6 figures

arXiv:2402.05431 [pdf, ps, other]

Dynamical quantum state tomography with time-dependent channels

Authors: Meng Cao, Yu Wang

Abstract: In this paper, we establish a dynamical quantum state tomography framework. Under this framework, it is feasible to obtain complete knowledge of any unknown state of a $d$-level system via only an arbitrary operator of certain types of IC-POVMs in dimension $d$. We show that under the time-dependent average channel, we can acquire a collection of projective operators that is informationally comple… ▽ More In this paper, we establish a dynamical quantum state tomography framework. Under this framework, it is feasible to obtain complete knowledge of any unknown state of a $d$-level system via only an arbitrary operator of certain types of IC-POVMs in dimension $d$. We show that under the time-dependent average channel, we can acquire a collection of projective operators that is informationally complete (IC) and thus obtain the corresponding IC-POVMs. We show that under certain condition, it is possible to obtain infinite families of projective operators that are IC, and obtain infinite families of corresponding IC-POVMs; otherwise, the Zauner's conjecture is incorrect. We also show how to simulate a SIC-POVM on any unknown quantum state by using the time-dependent average channel. △ Less

Submitted 8 February, 2024; originally announced February 2024.

Comments: 23 pages, 1 table

Showing 1–50 of 416 results for author: Cao, M