subscribe to arXiv mailings

Batch SLAM with PMBM Data Association Sampling and Graph-Based Optimization

Authors: Yu Ge, Ossi Kaltiokallio, Yuxuan Xia, Ángel F. García-Fernández, Hyowon Kim, Jukka Talvitie, Mikko Valkama, Henk Wymeersch, Lennart Svensson

Abstract: Simultaneous localization and mapping (SLAM) methods need to both solve the data association (DA) problem and the joint estimation of the sensor trajectory and the map, conditioned on a DA. In this paper, we propose a novel integrated approach to solve both the DA problem and the batch SLAM problem simultaneously, combining random finite set (RFS) theory and the graph-based SLAM approach. A sampli… ▽ More Simultaneous localization and mapping (SLAM) methods need to both solve the data association (DA) problem and the joint estimation of the sensor trajectory and the map, conditioned on a DA. In this paper, we propose a novel integrated approach to solve both the DA problem and the batch SLAM problem simultaneously, combining random finite set (RFS) theory and the graph-based SLAM approach. A sampling method based on the Poisson multi-Bernoulli mixture (PMBM) density is designed for dealing with the DA uncertainty, and a graph-based SLAM solver is applied for the conditional SLAM problem. In the end, a post-processing approach is applied to merge SLAM results from different iterations. Using synthetic data, it is demonstrated that the proposed SLAM approach achieves performance close to the posterior Cramér-Rao bound, and outperforms state-of-the-art RFS-based SLAM filters in high clutter and high process noise scenarios. △ Less

Submitted 16 July, 2024; originally announced July 2024.

arXiv:2407.11046 [pdf, other]

A Survey on LoRA of Large Language Models

Authors: Yuren Mao, Yuhang Ge, Yijiang Fan, Wenyi Xu, Yu Mi, Zhonghao Hu, Yunjun Gao

Abstract: Low-Rank Adaptation~(LoRA), which updates the dense neural network layers with pluggable low-rank matrices, is one of the best performed parameter efficient fine-tuning paradigms. Furthermore, it has significant advantages in cross-task generalization and privacy-preserving. Hence, LoRA has gained much attention recently, and the number of related literature demonstrates exponential growth. It is… ▽ More Low-Rank Adaptation~(LoRA), which updates the dense neural network layers with pluggable low-rank matrices, is one of the best performed parameter efficient fine-tuning paradigms. Furthermore, it has significant advantages in cross-task generalization and privacy-preserving. Hence, LoRA has gained much attention recently, and the number of related literature demonstrates exponential growth. It is necessary to conduct a comprehensive overview of the current progress on LoRA. This survey categorizes and reviews the progress from the perspectives of (1) downstream adaptation improving variants that improve LoRA's performance on downstream tasks; (2) cross-task generalization methods that mix multiple LoRA plugins to achieve cross-task generalization; (3) efficiency-improving methods that boost the computation-efficiency of LoRA; (4) data privacy-preserving methods that use LoRA in federated learning; (5) application. Besides, this survey also discusses the future directions in this field. △ Less

Submitted 8 July, 2024; originally announced July 2024.

arXiv:2407.10694 [pdf]

Features Reconstruction Disentanglement Cloth-Changing Person Re-Identification

Authors: Zhihao Chen, Yiyuan Ge, Qing Yue

Abstract: Cloth-changing person re-identification (CC-ReID) aims to retrieve specific pedestrians in a cloth-changing scenario. Its main challenge is to disentangle the clothing-related and clothing-unrelated features. Most existing approaches force the model to learn clothing-unrelated features by changing the color of the clothes. However, due to the lack of ground truth, these methods inevitably introduc… ▽ More Cloth-changing person re-identification (CC-ReID) aims to retrieve specific pedestrians in a cloth-changing scenario. Its main challenge is to disentangle the clothing-related and clothing-unrelated features. Most existing approaches force the model to learn clothing-unrelated features by changing the color of the clothes. However, due to the lack of ground truth, these methods inevitably introduce noise, which destroys the discriminative features and leads to an uncontrollable disentanglement process. In this paper, we propose a new person re-identification network called features reconstruction disentanglement ReID (FRD-ReID), which can controllably decouple the clothing-unrelated and clothing-related features. Specifically, we first introduce the human parsing mask as the ground truth of the reconstruction process. At the same time, we propose the far away attention (FAA) mechanism and the person contour attention (PCA) mechanism for clothing-unrelated features and pedestrian contour features to improve the feature reconstruction efficiency. In the testing phase, we directly discard the clothing-related features for inference,which leads to a controllable disentanglement process. We conducted extensive experiments on the PRCC, LTCC, and Vc-Clothes datasets and demonstrated that our method outperforms existing state-of-the-art methods. △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: 2024 International Conference on Intelligent Computing

arXiv:2407.09868 [pdf]

Separation of Sodium Signals Between Mono- and Bi-Exponential T2 Decays via Multi-TE Single-Quantum Sodium (23Na) MRI

Authors: Yongxian Qian, Ying-Chia Lin, Xingye Chen, Tiejun Zhao, Karthik Lakshmanan, Yulin Ge, Yvonne W. Lui, Fernando E. Boada

Abstract: Purpose. It is a long standing pursuit in sodium (23Na) MRI to separate signals between mono and bi exponential T2 decays in the human brain, due to lack of clinically translational solutions under the restriction of intrinsically low signal to noise ratio (SNR). Here we propose a new technique called multi TE single quantum (MSQ) sodium MRI to address the challenge. Methods. We exploit an intrins… ▽ More Purpose. It is a long standing pursuit in sodium (23Na) MRI to separate signals between mono and bi exponential T2 decays in the human brain, due to lack of clinically translational solutions under the restriction of intrinsically low signal to noise ratio (SNR). Here we propose a new technique called multi TE single quantum (MSQ) sodium MRI to address the challenge. Methods. We exploit an intrinsic difference in T2 decay between mono and bi exponential sodium signals by acquiring SQ images at multiple TEs and performing voxel based matrix inversions on these SQ images. The MSQ method was then investigated on numerical models, agar phantoms, and human brains for the feasibility on clinical scanners at 3T. Results. The whole brain T2* spectrum of FID signals from the study subjects showed sparse peaks (2 to 4 peaks), suggesting a global set of T2* values (T2*fr, T2*bs, T2*bl) applicable to the separation. The simulations indicated a small impact (3.9 to 5.6 percent) of T2* variation on accuracy of the separation, and the phantom experiments showed a high accuracy of the separation, 95.8 percent for mono T2 sodium and 72.5 to 80.4 percent for biT2 sodium. The human studies demonstrated feasibility of the separation and potentials of highlighting abnormal brain regions in the biT2 sodium images. Conclusion. The MSQ technique has been shown, via the numerical simulations, phantom experiments, and human brain studies, to be able to separate mono and bi T2 sodium signals using a two TE sampling scheme and a global set of T2* values. However, MSQ has limitations and requires cautions in practice. Keywords. sodium MRI, single quantum MRI, triple quantum MRI, neuroimaging, neurodegeneration △ Less

Submitted 13 July, 2024; originally announced July 2024.

Comments: 37 pages and 14 figures

arXiv:2407.08683 [pdf, other]

SEED-Story: Multimodal Long Story Generation with Large Language Model

Authors: Shuai Yang, Yuying Ge, Yang Li, Yukang Chen, Yixiao Ge, Ying Shan, Yingcong Chen

Abstract: With the remarkable advancements in image generation and open-form text generation, the creation of interleaved image-text content has become an increasingly intriguing field. Multimodal story generation, characterized by producing narrative texts and vivid images in an interleaved manner, has emerged as a valuable and practical task with broad applications. However, this task poses significant ch… ▽ More With the remarkable advancements in image generation and open-form text generation, the creation of interleaved image-text content has become an increasingly intriguing field. Multimodal story generation, characterized by producing narrative texts and vivid images in an interleaved manner, has emerged as a valuable and practical task with broad applications. However, this task poses significant challenges, as it necessitates the comprehension of the complex interplay between texts and images, and the ability to generate long sequences of coherent, contextually relevant texts and visuals. In this work, we propose SEED-Story, a novel method that leverages a Multimodal Large Language Model (MLLM) to generate extended multimodal stories. Our model, built upon the powerful comprehension capability of MLLM, predicts text tokens as well as visual tokens, which are subsequently processed with an adapted visual de-tokenizer to produce images with consistent characters and styles. We further propose multimodal attention sink mechanism to enable the generation of stories with up to 25 sequences (only 10 for training) in a highly efficient autoregressive manner. Additionally, we present a large-scale and high-resolution dataset named StoryStream for training our model and quantitatively evaluating the task of multimodal story generation in various aspects. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: Our models, codes and datasets are released in https://github.com/TencentARC/SEED-Story

arXiv:2407.03842 [pdf, other]

Beyond Viewpoint: Robust 3D Object Recognition under Arbitrary Views through Joint Multi-Part Representation

Authors: Linlong Fan, Ye Huang, Yanqi Ge, Wen Li, Lixin Duan

Abstract: Existing view-based methods excel at recognizing 3D objects from predefined viewpoints, but their exploration of recognition under arbitrary views is limited. This is a challenging and realistic setting because each object has different viewpoint positions and quantities, and their poses are not aligned. However, most view-based methods, which aggregate multiple view features to obtain a global fe… ▽ More Existing view-based methods excel at recognizing 3D objects from predefined viewpoints, but their exploration of recognition under arbitrary views is limited. This is a challenging and realistic setting because each object has different viewpoint positions and quantities, and their poses are not aligned. However, most view-based methods, which aggregate multiple view features to obtain a global feature representation, hard to address 3D object recognition under arbitrary views. Due to the unaligned inputs from arbitrary views, it is challenging to robustly aggregate features, leading to performance degradation. In this paper, we introduce a novel Part-aware Network (PANet), which is a part-based representation, to address these issues. This part-based representation aims to localize and understand different parts of 3D objects, such as airplane wings and tails. It has properties such as viewpoint invariance and rotation robustness, which give it an advantage in addressing the 3D object recognition problem under arbitrary views. Our results on benchmark datasets clearly demonstrate that our proposed method outperforms existing view-based aggregation baselines for the task of 3D object recognition under arbitrary views, even surpassing most fixed viewpoint methods. △ Less

Submitted 4 July, 2024; originally announced July 2024.

Comments: ECCV 2024

arXiv:2407.00736 [pdf, other]

Quantum Circuit Synthesis and Compilation Optimization: Overview and Prospects

Authors: Yan Ge, Wu Wenjie, Chen Yuheng, Pan Kaisen, Lu Xudong, Zhou Zixiang, Wang Yuhan, Wang Ruocheng, Yan Junchi

Abstract: Quantum computing is regarded as a promising paradigm that may overcome the current computational power bottlenecks in the post-Moore era. The increasing maturity of quantum processors, especially superconducting ones, provides more possibilities for the development and implementation of quantum algorithms. As the crucial stages for quantum algorithm implementation, the logic circuit design and qu… ▽ More Quantum computing is regarded as a promising paradigm that may overcome the current computational power bottlenecks in the post-Moore era. The increasing maturity of quantum processors, especially superconducting ones, provides more possibilities for the development and implementation of quantum algorithms. As the crucial stages for quantum algorithm implementation, the logic circuit design and quantum compiling have also received significant attention, which covers key technologies such as quantum logic circuit synthesis (also widely known as quantum architecture search) and optimization, as well as qubit mapping and routing. Recent studies suggest that the scale and precision of related algorithms are steadily increasing, especially with the integration of artificial intelligence methods. In this survey, we systematically review and summarize a vast body of literature, exploring the feasibility of an integrated design and optimization scheme that spans from the algorithmic level to quantum hardware, combining the steps of logic circuit design and compilation optimization. Leveraging the exceptional cognitive and learning capabilities of AI algorithms, one can reduce manual design costs, enhance the precision and efficiency of execution, and facilitate the implementation and validation of the superiority of quantum algorithms on hardware. △ Less

Submitted 30 June, 2024; originally announced July 2024.

Comments: 32 page, 3 figures, 3 tables

arXiv:2406.19311 [pdf, other]

Zero-Query Adversarial Attack on Black-box Automatic Speech Recognition Systems

Authors: Zheng Fang, Tao Wang, Lingchen Zhao, Shenyi Zhang, Bowen Li, Yunjie Ge, Qi Li, Chao Shen, Qian Wang

Abstract: In recent years, extensive research has been conducted on the vulnerability of ASR systems, revealing that black-box adversarial example attacks pose significant threats to real-world ASR systems. However, most existing black-box attacks rely on queries to the target ASRs, which is impractical when queries are not permitted. In this paper, we propose ZQ-Attack, a transfer-based adversarial attack… ▽ More In recent years, extensive research has been conducted on the vulnerability of ASR systems, revealing that black-box adversarial example attacks pose significant threats to real-world ASR systems. However, most existing black-box attacks rely on queries to the target ASRs, which is impractical when queries are not permitted. In this paper, we propose ZQ-Attack, a transfer-based adversarial attack on ASR systems in the zero-query black-box setting. Through a comprehensive review and categorization of modern ASR technologies, we first meticulously select surrogate ASRs of diverse types to generate adversarial examples. Following this, ZQ-Attack initializes the adversarial perturbation with a scaled target command audio, rendering it relatively imperceptible while maintaining effectiveness. Subsequently, to achieve high transferability of adversarial perturbations, we propose a sequential ensemble optimization algorithm, which iteratively optimizes the adversarial perturbation on each surrogate model, leveraging collaborative information from other models. We conduct extensive experiments to evaluate ZQ-Attack. In the over-the-line setting, ZQ-Attack achieves a 100% success rate of attack (SRoA) with an average signal-to-noise ratio (SNR) of 21.91dB on 4 online speech recognition services, and attains an average SRoA of 100% and SNR of 19.67dB on 16 open-source ASRs. For commercial intelligent voice control devices, ZQ-Attack also achieves a 100% SRoA with an average SNR of 15.77dB in the over-the-air setting. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: To appear in the Proceedings of The ACM Conference on Computer and Communications Security (CCS), 2024

arXiv:2406.18165 [pdf]

Prediction of superconductivity in Bilayer Kagome borophene

Authors: Yifan Han, Yue Shang, Wenhui Wan, Yong Liu, Yanfeng Ge

Abstract: The element boron has long been central to two-dimensional superconducting materials, and numerous studies have demonstrated the presence of superconductivity in various boron-based structures. Recent work introduced a new variant: Bilayer Kagome borophene, characterized by its bilayer Kagome lattice with van Hove singularity. Using first-principles calculations, our research investigates the uniq… ▽ More The element boron has long been central to two-dimensional superconducting materials, and numerous studies have demonstrated the presence of superconductivity in various boron-based structures. Recent work introduced a new variant: Bilayer Kagome borophene, characterized by its bilayer Kagome lattice with van Hove singularity. Using first-principles calculations, our research investigates the unique electronic structure and superconducting properties of Bilayer Kagome borophene (BK-borophene) through first-principles calculations. BK-borophene is identified as a single-gap superconductor with an initial superconducting transition temperature (Tc) of 11.0 K. By strategically doping the material to align its Fermi level with the Van Hove singularity, Tc is significantly enhanced to 30.0 K. The results contribute to the existing understanding of BK-borophene, highlighting its potential as a member of the expanding family of two-dimensional superconducting materials. △ Less

Submitted 26 June, 2024; originally announced June 2024.

Comments: 12 pages, 6 figures

arXiv:2406.18008 [pdf, other]

Rate-Distortion-Perception Tradeoff for Gaussian Vector Sources

Authors: Jingjing Qian, Sadaf Salehkalaibar, Jun Chen, Ashish Khisti, Wei Yu, Wuxian Shi, Yiqun Ge, Wen Tong

Abstract: This paper studies the rate-distortion-perception (RDP) tradeoff for a Gaussian vector source coding problem where the goal is to compress the multi-component source subject to distortion and perception constraints. The purpose of imposing a perception constraint is to ensure visually pleasing reconstructions. This paper studies this RDP setting with either the Kullback-Leibler (KL) divergence or… ▽ More This paper studies the rate-distortion-perception (RDP) tradeoff for a Gaussian vector source coding problem where the goal is to compress the multi-component source subject to distortion and perception constraints. The purpose of imposing a perception constraint is to ensure visually pleasing reconstructions. This paper studies this RDP setting with either the Kullback-Leibler (KL) divergence or Wasserstein-2 metric as the perception loss function, and shows that for Gaussian vector sources, jointly Gaussian reconstructions are optimal. We further demonstrate that the optimal tradeoff can be expressed as an optimization problem, which can be explicitly solved. An interesting property of the optimal solution is as follows. Without the perception constraint, the traditional reverse water-filling solution for characterizing the rate-distortion (RD) tradeoff of a Gaussian vector source states that the optimal rate allocated to each component depends on a constant, called the water-level. If the variance of a specific component is below the water-level, it is assigned a {zero} compression rate. However, with active distortion and perception constraints, we show that the optimal rates allocated to the different components are always {positive}. Moreover, the water-levels that determine the optimal rate allocation for different components are unequal. We further treat the special case of perceptually perfect reconstruction and study its RDP function in the high-distortion and low-distortion regimes to obtain insight to the structure of the optimal solution. △ Less

Submitted 25 June, 2024; originally announced June 2024.

arXiv:2406.17950 [pdf, other]

V2X Sidelink Positioning in FR1: From Ray-Tracing and Channel Estimation to Bayesian Tracking

Authors: Yu Ge, Maximilian Stark, Musa Furkan Keskin, Hui Chen, Guillaume Jornod, Thomas Hansen, Frank Hofmann, Henk Wymeersch

Abstract: Sidelink positioning research predominantly focuses on the snapshot positioning problem, often within the mmWave band. Only a limited number of studies have delved into vehicle-to-anything (V2X) tracking within sub-6 GHz bands. In this paper, we investigate the V2X sidelink tracking challenges over sub-6 GHz frequencies. We propose a Kalman-filter-based tracking approach that leverages the estimat… ▽ More Sidelink positioning research predominantly focuses on the snapshot positioning problem, often within the mmWave band. Only a limited number of studies have delved into vehicle-to-anything (V2X) tracking within sub-6 GHz bands. In this paper, we investigate the V2X sidelink tracking challenges over sub-6 GHz frequencies. We propose a Kalman-filter-based tracking approach that leverages the estimated error covariance lower bounds (EECLBs) as measurement covariance, alongside a gating method to augment tracking performance. Through simulations employing ray-tracing data and super-resolution channel parameter estimation, we validate the feasibility of sidelink tracking using our proposed tracking filter with two novel EECLBs. Additionally, we demonstrate the efficacy of the gating method in identifying line-of-sight paths and enhancing tracking performance. △ Less

Submitted 30 June, 2024; v1 submitted 25 June, 2024; originally announced June 2024.

arXiv:2406.12671 [pdf, other]

GeoBench: Benchmarking and Analyzing Monocular Geometry Estimation Models

Authors: Yongtao Ge, Guangkai Xu, Zhiyue Zhao, Libo Sun, Zheng Huang, Yanlong Sun, Hao Chen, Chunhua Shen

Abstract: Recent advances in discriminative and generative pretraining have yielded geometry estimation models with strong generalization capabilities. While discriminative monocular geometry estimation methods rely on large-scale fine-tuning data to achieve zero-shot generalization, several generative-based paradigms show the potential of achieving impressive generalization performance on unseen scenes by… ▽ More Recent advances in discriminative and generative pretraining have yielded geometry estimation models with strong generalization capabilities. While discriminative monocular geometry estimation methods rely on large-scale fine-tuning data to achieve zero-shot generalization, several generative-based paradigms show the potential of achieving impressive generalization performance on unseen scenes by leveraging pre-trained diffusion models and fine-tuning on even a small scale of synthetic training data. Frustratingly, these models are trained with different recipes on different datasets, making it hard to find out the critical factors that determine the evaluation performance. Besides, current geometry evaluation benchmarks have two main drawbacks that may prevent the development of the field, i.e., limited scene diversity and unfavorable label quality. To resolve the above issues, (1) we build fair and strong baselines in a unified codebase for evaluating and analyzing the geometry estimation models; (2) we evaluate monocular geometry estimators on more challenging benchmarks for geometry estimation task with diverse scenes and high-quality annotations. Our results reveal that pre-trained using large data, discriminative models such as DINOv2, can outperform generative counterparts with a small amount of high-quality synthetic data under the same training configuration, which suggests that fine-tuning data quality is a more important factor than the data scale and model architecture. Our observation also raises a question: if simply fine-tuning a general vision model such as DINOv2 using a small amount of synthetic depth data produces SOTA results, do we really need complex generative models for depth estimation? We believe this work can propel advancements in geometry estimation tasks as well as a wide range of downstream applications. △ Less

Submitted 20 June, 2024; v1 submitted 18 June, 2024; originally announced June 2024.

Comments: Code and Benchmark are available at: https://github.com/aim-uofa/GeoBench

arXiv:2406.12275 [pdf, other]

VoCo-LLaMA: Towards Vision Compression with Large Language Models

Authors: Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, Ying Shan, Yansong Tang

Abstract: Vision-Language Models (VLMs) have achieved remarkable success in various multi-modal tasks, but they are often bottlenecked by the limited context window and high computational cost of processing high-resolution image inputs and videos. Vision compression can alleviate this problem by reducing the vision token count. Previous approaches compress vision tokens with external modules and force LLMs… ▽ More Vision-Language Models (VLMs) have achieved remarkable success in various multi-modal tasks, but they are often bottlenecked by the limited context window and high computational cost of processing high-resolution image inputs and videos. Vision compression can alleviate this problem by reducing the vision token count. Previous approaches compress vision tokens with external modules and force LLMs to understand the compressed ones, leading to visual information loss. However, the LLMs' understanding paradigm of vision tokens is not fully utilised in the compression learning process. We propose VoCo-LLaMA, the first approach to compress vision tokens using LLMs. By introducing Vision Compression tokens during the vision instruction tuning phase and leveraging attention distillation, our method distill how LLMs comprehend vision tokens into their processing of VoCo tokens. VoCo-LLaMA facilitates effective vision compression and improves the computational efficiency during the inference stage. Specifically, our method achieves minimal performance loss with a compression ratio of 576$\times$, resulting in up to 94.8$\%$ fewer FLOPs and 69.6$\%$ acceleration in inference time. Furthermore, through continuous training using time-series compressed token sequences of video frames, VoCo-LLaMA demonstrates the ability to understand temporal correlations, outperforming previous methods on popular video question-answering benchmarks. Our approach presents a promising way to unlock the full potential of VLMs' contextual window, enabling more scalable multi-modal applications. The project page, along with the associated code, can be accessed via $\href{https://yxxxb.github.io/VoCo-LLaMA-page/}{\text{this https URL}}$. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: 18 pages, 5 figures

arXiv:2406.11602 [pdf, other]

Association between a Failed Prominence Eruption and the Drainage of Mass from Another Prominence

Authors: Jianchao Xue, Li Feng, Hui Li, Ping Zhang, Jun Chen, Guanglu Shi, Kaifan Ji, Ye Qiu, Chuan Li, Lei Lu, Beili Ying, Ying Li, Yu Huang, Youping Li, Jingwei Li, Jie Zhao, Dechao Song, Shuting Li, Zhengyuan Tian, Yingna Su, Qingmin Zhang, Yunyi Ge, Jiahui Shan, Qiao Li, Gen Li , et al. (9 additional authors not shown)

Abstract: Sympathetic eruptions of solar prominences have been studied for decades, however, it is usually difficult to identify their causal links. Here we present two failed prominence eruptions on 26 October 2022 and explore their connections. Using stereoscopic observations, the south prominence (PRO-S) erupts with untwisting motions, flare ribbons occur underneath, and new connections are formed during… ▽ More Sympathetic eruptions of solar prominences have been studied for decades, however, it is usually difficult to identify their causal links. Here we present two failed prominence eruptions on 26 October 2022 and explore their connections. Using stereoscopic observations, the south prominence (PRO-S) erupts with untwisting motions, flare ribbons occur underneath, and new connections are formed during the eruption. The north prominence (PRO-N) rises up along with PRO-S, and its upper part disappears due to catastrophic mass draining along an elongated structure after PRO-S failed eruption. We suggest that the eruption of PRO-S initiates due to a kink instability, further rises up, and fails to erupt due to reconnection with surrounding fields. The elongated structure connecting PRO-N overlies PRO-S, which causes the rising up of PRO-N along with PRO-S and mass drainage after PRO-S eruption. This study suggests that a prominence may end its life through mass drainage forced by an eruption underneath. △ Less

Submitted 20 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

Comments: 15 pages, 7 figures, has been accepted by Solar Physics

arXiv:2406.02946 [pdf, other]

doi 10.3847/1538-4365/ad37bc

CAMEL. II. A 3D Coronal Mass Ejection Catalog Based on Coronal Mass Ejection Automatic Detection with Deep Learning

Authors: Jiahui Shan, Huapeng Zhang, Lei Lu, Yan Zhang, Li Feng, Yunyi Ge, Jianchao Xue, Shuting Li

Abstract: Coronal mass ejections (CMEs) are major drivers of geomagnetic storms, which may cause severe space weather effects. Automating the detection, tracking, and three-dimensional (3D) reconstruction of CMEs is important for operational predictions of CME arrivals. The COR1 coronagraphs on board the Solar Terrestrial Relations Observatory spacecraft have facilitated extensive polarization observations,… ▽ More Coronal mass ejections (CMEs) are major drivers of geomagnetic storms, which may cause severe space weather effects. Automating the detection, tracking, and three-dimensional (3D) reconstruction of CMEs is important for operational predictions of CME arrivals. The COR1 coronagraphs on board the Solar Terrestrial Relations Observatory spacecraft have facilitated extensive polarization observations, which are very suitable for the establishment of a 3D CME system. We have developed such a 3D system comprising four modules: classification, segmentation, tracking, and 3D reconstructions. We generalize our previously pretrained classification model to classify COR1 coronagraph images. Subsequently, as there are no publicly available CME segmentation data sets, we manually annotate the structural regions of CMEs using Large Angle and Spectrometric Coronagraph C2 observations. Leveraging transformer-based models, we achieve state-of-the-art results in CME segmentation. Furthermore, we improve the tracking algorithm to solve the difficult separation task of multiple CMEs. In the final module, tracking results, combined with the polarization ratio technique are used to develop the first single-view 3D CME catalog without requiring manual mask annotation. Our method provides higher precision in automatic 2D CME catalog and more reliable physical parameters of CMEs, including 3D propagation direction and speed. The aforementioned 3D CME system can be applied to any coronagraph data with the capability of polarization measurements. △ Less

Submitted 5 June, 2024; originally announced June 2024.

arXiv:2406.02395 [pdf, other]

GrootVL: Tree Topology is All You Need in State Space Model

Authors: Yicheng Xiao, Lin Song, Shaoli Huang, Jiangshan Wang, Siyu Song, Yixiao Ge, Xiu Li, Ying Shan

Abstract: The state space models, employing recursively propagated features, demonstrate strong representation capabilities comparable to Transformer models and superior efficiency. However, constrained by the inherent geometric constraints of sequences, it still falls short in modeling long-range dependencies. To address this issue, we propose the GrootVL network, which first dynamically generates a tree t… ▽ More The state space models, employing recursively propagated features, demonstrate strong representation capabilities comparable to Transformer models and superior efficiency. However, constrained by the inherent geometric constraints of sequences, it still falls short in modeling long-range dependencies. To address this issue, we propose the GrootVL network, which first dynamically generates a tree topology based on spatial relationships and input features. Then, feature propagation is performed based on this graph, thereby breaking the original sequence constraints to achieve stronger representation capabilities. Additionally, we introduce a linear complexity dynamic programming algorithm to enhance long-range interactions without increasing computational cost. GrootVL is a versatile multimodal framework that can be applied to both visual and textual tasks. Extensive experiments demonstrate that our method significantly outperforms existing structured state space models on image classification, object detection and segmentation. Besides, by fine-tuning large language models, our approach achieves consistent improvements in multiple textual tasks at minor training cost. △ Less

Submitted 4 June, 2024; originally announced June 2024.

Comments: The code is available at https://github.com/EasonXiao-888/GrootVL

arXiv:2406.01371 [pdf, other]

An Origami-Inspired Endoscopic Capsule with Tactile Perception for Early Tissue Anomaly Detection

Authors: Yukun Ge, Rui Zong, Xiaoshuai Zhang, Thrishantha Nanayakkara

Abstract: Video Capsule Endoscopy (VCE) is currently one of the most effective methods for detecting intestinal diseases. However, it is challenging to detect early-stage small nodules with this method because they lack obvious color or shape features. In this letter, we present a new origami capsule endoscope to detect early small intestinal nodules using tactile sensing. Four soft tactile sensors made out… ▽ More Video Capsule Endoscopy (VCE) is currently one of the most effective methods for detecting intestinal diseases. However, it is challenging to detect early-stage small nodules with this method because they lack obvious color or shape features. In this letter, we present a new origami capsule endoscope to detect early small intestinal nodules using tactile sensing. Four soft tactile sensors made out of piezoresistive material feed four channels of phase-shifted data that are processed using a particle filter. The particle filter uses an importance assignment template designed using experimental data from five known sizes of modules. Moreover, the proposed capsule can use shape changes to passively move forward or backward under peristalsis, enabling it to reach any position in the intestine for detection. Experimental results show that the proposed capsule can detect nodules of more than 3mm diameter with 100% accuracy. △ Less

Submitted 3 June, 2024; originally announced June 2024.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2405.19519 [pdf, other]

Two-layer retrieval augmented generation framework for low-resource medical question-answering: proof of concept using Reddit data

Authors: Sudeshna Das, Yao Ge, Yuting Guo, Swati Rajwal, JaMor Hairston, Jeanne Powell, Drew Walker, Snigdha Peddireddy, Sahithi Lakamana, Selen Bozkurt, Matthew Reyna, Reza Sameni, Yunyu Xiao, Sangmi Kim, Rasheeta Chandler, Natalie Hernandez, Danielle Mowery, Rachel Wightman, Jennifer Love, Anthony Spadaro, Jeanmarie Perrone, Abeed Sarker

Abstract: Retrieval augmented generation (RAG) provides the capability to constrain generative model outputs, and mitigate the possibility of hallucination, by providing relevant in-context text. The number of tokens a generative large language model (LLM) can incorporate as context is finite, thus limiting the volume of knowledge from which to generate an answer. We propose a two-layer RAG framework for qu… ▽ More Retrieval augmented generation (RAG) provides the capability to constrain generative model outputs, and mitigate the possibility of hallucination, by providing relevant in-context text. The number of tokens a generative large language model (LLM) can incorporate as context is finite, thus limiting the volume of knowledge from which to generate an answer. We propose a two-layer RAG framework for query-focused answer generation and evaluate a proof-of-concept for this framework in the context of query-focused summary generation from social media forums, focusing on emerging drug-related information. The evaluations demonstrate the effectiveness of the two-layer framework in resource constrained settings to enable researchers in obtaining near real-time data from users. △ Less

Submitted 29 May, 2024; originally announced May 2024.

arXiv:2405.15287 [pdf, other]

StyleMaster: Towards Flexible Stylized Image Generation with Diffusion Models

Authors: Chengming Xu, Kai Hu, Donghao Luo, Jiangning Zhang, Wei Li, Yanhao Ge, Chengjie Wang

Abstract: Stylized Text-to-Image Generation (STIG) aims to generate images based on text prompts and style reference images. We in this paper propose a novel framework dubbed as StyleMaster for this task by leveraging pretrained Stable Diffusion (SD), which tries to solve the previous problems such as insufficient style and inconsistent semantics. The enhancement lies in two novel module, namely multi-sourc… ▽ More Stylized Text-to-Image Generation (STIG) aims to generate images based on text prompts and style reference images. We in this paper propose a novel framework dubbed as StyleMaster for this task by leveraging pretrained Stable Diffusion (SD), which tries to solve the previous problems such as insufficient style and inconsistent semantics. The enhancement lies in two novel module, namely multi-source style embedder and dynamic attention adapter. In order to provide SD with better style embeddings, we propose the multi-source style embedder considers both global and local level visual information along with textual one, which provide both complementary style-related and semantic-related knowledge. Additionally, aiming for better balance between the adaptor capacity and semantic control, the proposed dynamic attention adapter is applied to the diffusion UNet in which adaptation weights are dynamically calculated based on the style embeddings. Two objective functions are introduced to optimize the model together with denoising loss, which can further enhance semantic and style consistency. Extensive experiments demonstrate the superiority of StyleMaster over existing methods, rendering images with variable target styles while successfully maintaining the semantic information from the text prompts. △ Less

Submitted 24 May, 2024; originally announced May 2024.

arXiv:2405.12970 [pdf, ps, other]

Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute Control

Authors: Yue Han, Junwei Zhu, Keke He, Xu Chen, Yanhao Ge, Wei Li, Xiangtai Li, Jiangning Zhang, Chengjie Wang, Yong Liu

Abstract: Current face reenactment and swapping methods mainly rely on GAN frameworks, but recent focus has shifted to pre-trained diffusion models for their superior generation capabilities. However, training these models is resource-intensive, and the results have not yet achieved satisfactory performance levels. To address this issue, we introduce Face-Adapter, an efficient and effective adapter designed… ▽ More Current face reenactment and swapping methods mainly rely on GAN frameworks, but recent focus has shifted to pre-trained diffusion models for their superior generation capabilities. However, training these models is resource-intensive, and the results have not yet achieved satisfactory performance levels. To address this issue, we introduce Face-Adapter, an efficient and effective adapter designed for high-precision and high-fidelity face editing for pre-trained diffusion models. We observe that both face reenactment/swapping tasks essentially involve combinations of target structure, ID and attribute. We aim to sufficiently decouple the control of these factors to achieve both tasks in one model. Specifically, our method contains: 1) A Spatial Condition Generator that provides precise landmarks and background; 2) A Plug-and-play Identity Encoder that transfers face embeddings to the text space by a transformer decoder. 3) An Attribute Controller that integrates spatial conditions and detailed attributes. Face-Adapter achieves comparable or even superior performance in terms of motion control precision, ID retention capability, and generation quality compared to fully fine-tuned face reenactment/swapping models. Additionally, Face-Adapter seamlessly integrates with various StableDiffusion models. △ Less

Submitted 8 July, 2024; v1 submitted 21 May, 2024; originally announced May 2024.

Comments: Accepted to ECCV2024; Project Page: https://faceadapter.github.io/face-adapter.github.io/

arXiv:2405.11896 [pdf, other]

doi 10.1051/0004-6361/202450296

The Milky Way Atlas for Linear Filaments

Authors: Ke Wang, Yifei Ge, Tapas Baug

Abstract: Filamentary structure is important for the ISM and star formation. Galactic distribution of filaments may regulate the star formation rate in the Milky Way. However, interstellar filaments are intrinsically complex, making it difficult to study quantitatively. Here, we focus on linear filaments, the simplest morphology that can be treated as building blocks of any filamentary structure. We present… ▽ More Filamentary structure is important for the ISM and star formation. Galactic distribution of filaments may regulate the star formation rate in the Milky Way. However, interstellar filaments are intrinsically complex, making it difficult to study quantitatively. Here, we focus on linear filaments, the simplest morphology that can be treated as building blocks of any filamentary structure. We present the first catalog of 42 ``straight-line'' filaments across the full Galactic plane, identified by clustering of far-IR Herschel HiGAL clumps in position-position-velocity space. We use molecular line cubes to investigate the dynamics along the filaments; compare the filaments with Galactic spiral arms; and compare ambient magnetic fields with the filaments' orientation. The selected filaments show extreme linearity ($>$10), aspect ratio (7-48), and velocity coherence over a length of 3-40 pc (mostly $>$10 pc). About 1/3 of them are associated with spiral arms, but only one is located in arm center, a.k.a. ``bones'' of the Milky Way. A few of them extend perpendicular to the Galactic plane, and none is located in the Central Molecular Zone (CMZ) near the Galactic center. Along the filaments, prevalent periodic oscillation (both in velocity and density) is consistent with gas flows channeled by the filaments and feeding the clumps which harbor diverse star formation activities. No correlation is found between the filament orientations with Planck measured global magnetic field lines. This work highlights some of the fundamental properties of molecular filaments and provides a golden sample for follow-up studies on star formation, ISM structure, and Milky Way structure. △ Less

Submitted 20 May, 2024; originally announced May 2024.

Comments: Accepted to A&A Letters. 15 pages, 6 figures, 1 table

Journal ref: A&A 686, L11 (2024)

arXiv:2405.09546 [pdf, other]

BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation

Authors: Yunhao Ge, Yihe Tang, Jiashu Xu, Cem Gokmen, Chengshu Li, Wensi Ai, Benjamin Jose Martinez, Arman Aydin, Mona Anvari, Ayush K Chakravarthy, Hong-Xing Yu, Josiah Wong, Sanjana Srivastava, Sharon Lee, Shengxin Zha, Laurent Itti, Yunzhu Li, Roberto Martín-Martín, Miao Liu, Pengchuan Zhang, Ruohan Zhang, Li Fei-Fei, Jiajun Wu

Abstract: The systematic evaluation and understanding of computer vision models under varying conditions require large amounts of data with comprehensive and customized labels, which real-world vision datasets rarely satisfy. While current synthetic data generators offer a promising alternative, particularly for embodied AI tasks, they often fall short for computer vision tasks due to low asset and renderin… ▽ More The systematic evaluation and understanding of computer vision models under varying conditions require large amounts of data with comprehensive and customized labels, which real-world vision datasets rarely satisfy. While current synthetic data generators offer a promising alternative, particularly for embodied AI tasks, they often fall short for computer vision tasks due to low asset and rendering quality, limited diversity, and unrealistic physical properties. We introduce the BEHAVIOR Vision Suite (BVS), a set of tools and assets to generate fully customized synthetic data for systematic evaluation of computer vision models, based on the newly developed embodied AI benchmark, BEHAVIOR-1K. BVS supports a large number of adjustable parameters at the scene level (e.g., lighting, object placement), the object level (e.g., joint configuration, attributes such as "filled" and "folded"), and the camera level (e.g., field of view, focal length). Researchers can arbitrarily vary these parameters during data generation to perform controlled experiments. We showcase three example application scenarios: systematically evaluating the robustness of models across different continuous axes of domain shift, evaluating scene understanding models on the same set of images, and training and evaluating simulation-to-real transfer for a novel vision task: unary and binary state prediction. Project website: https://behavior-vision-suite.github.io/ △ Less

Submitted 15 May, 2024; originally announced May 2024.

Comments: CVPR 2024 (Highlight). Project website: https://behavior-vision-suite.github.io/

arXiv:2405.07990 [pdf, other]

Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots

Authors: Chengyue Wu, Yixiao Ge, Qiushan Guo, Jiahao Wang, Zhixuan Liang, Zeyu Lu, Ying Shan, Ping Luo

Abstract: The remarkable progress of Multi-modal Large Language Models (MLLMs) has attracted significant attention due to their superior performance in visual contexts. However, their capabilities in turning visual figure to executable code, have not been evaluated thoroughly. To address this, we introduce Plot2Code, a comprehensive visual coding benchmark designed for a fair and in-depth assessment of MLLM… ▽ More The remarkable progress of Multi-modal Large Language Models (MLLMs) has attracted significant attention due to their superior performance in visual contexts. However, their capabilities in turning visual figure to executable code, have not been evaluated thoroughly. To address this, we introduce Plot2Code, a comprehensive visual coding benchmark designed for a fair and in-depth assessment of MLLMs. We carefully collect 132 manually selected high-quality matplotlib plots across six plot types from publicly available matplotlib galleries. For each plot, we carefully offer its source code, and an descriptive instruction summarized by GPT-4. This approach enables Plot2Code to extensively evaluate MLLMs' code capabilities across various input modalities. Furthermore, we propose three automatic evaluation metrics, including code pass rate, text-match ratio, and GPT-4V overall rating, for a fine-grained assessment of the output code and rendered images. Instead of simply judging pass or fail, we employ GPT-4V to make an overall judgement between the generated and reference images, which has been shown to be consistent with human evaluation. The evaluation results, which include analyses of 14 MLLMs such as the proprietary GPT-4V, Gemini-Pro, and the open-sourced Mini-Gemini, highlight the substantial challenges presented by Plot2Code. With Plot2Code, we reveal that most existing MLLMs struggle with visual coding for text-dense plots, heavily relying on textual instruction. We hope that the evaluation results from Plot2Code on visual coding will guide the future development of MLLMs. All data involved with Plot2Code are available at https://huggingface.co/datasets/TencentARC/Plot2Code. △ Less

Submitted 13 May, 2024; originally announced May 2024.

arXiv:2405.07027 [pdf, other]

TD-NeRF: Novel Truncated Depth Prior for Joint Camera Pose and Neural Radiance Field Optimization

Authors: Zhen Tan, Zongtan Zhou, Yangbing Ge, Zi Wang, Xieyuanli Chen, Dewen Hu

Abstract: The reliance on accurate camera poses is a significant barrier to the widespread deployment of Neural Radiance Fields (NeRF) models for 3D reconstruction and SLAM tasks. The existing method introduces monocular depth priors to jointly optimize the camera poses and NeRF, which fails to fully exploit the depth priors and neglects the impact of their inherent noise. In this paper, we propose Truncate… ▽ More The reliance on accurate camera poses is a significant barrier to the widespread deployment of Neural Radiance Fields (NeRF) models for 3D reconstruction and SLAM tasks. The existing method introduces monocular depth priors to jointly optimize the camera poses and NeRF, which fails to fully exploit the depth priors and neglects the impact of their inherent noise. In this paper, we propose Truncated Depth NeRF (TD-NeRF), a novel approach that enables training NeRF from unknown camera poses - by jointly optimizing learnable parameters of the radiance field and camera poses. Our approach explicitly utilizes monocular depth priors through three key advancements: 1) we propose a novel depth-based ray sampling strategy based on the truncated normal distribution, which improves the convergence speed and accuracy of pose estimation; 2) to circumvent local minima and refine depth geometry, we introduce a coarse-to-fine training strategy that progressively improves the depth precision; 3) we propose a more robust inter-frame point constraint that enhances robustness against depth noise during training. The experimental results on three datasets demonstrate that TD-NeRF achieves superior performance in the joint optimization of camera pose and NeRF, surpassing prior works, and generates more accurate depth geometry. The implementation of our method has been released at https://github.com/nubot-nudt/TD-NeRF. △ Less

Submitted 11 May, 2024; originally announced May 2024.

arXiv:2405.06145 [pdf, other]

Reddit-Impacts: A Named Entity Recognition Dataset for Analyzing Clinical and Social Effects of Substance Use Derived from Social Media

Authors: Yao Ge, Sudeshna Das, Karen O'Connor, Mohammed Ali Al-Garadi, Graciela Gonzalez-Hernandez, Abeed Sarker

Abstract: Substance use disorders (SUDs) are a growing concern globally, necessitating enhanced understanding of the problem and its trends through data-driven research. Social media are unique and important sources of information about SUDs, particularly since the data in such sources are often generated by people with lived experiences. In this paper, we introduce Reddit-Impacts, a challenging Named Entit… ▽ More Substance use disorders (SUDs) are a growing concern globally, necessitating enhanced understanding of the problem and its trends through data-driven research. Social media are unique and important sources of information about SUDs, particularly since the data in such sources are often generated by people with lived experiences. In this paper, we introduce Reddit-Impacts, a challenging Named Entity Recognition (NER) dataset curated from subreddits dedicated to discussions on prescription and illicit opioids, as well as medications for opioid use disorder. The dataset specifically concentrates on the lesser-studied, yet critically important, aspects of substance use--its clinical and social impacts. We collected data from chosen subreddits using the publicly available Application Programming Interface for Reddit. We manually annotated text spans representing clinical and social impacts reported by people who also reported personal nonmedical use of substances including but not limited to opioids, stimulants and benzodiazepines. Our objective is to create a resource that can enable the development of systems that can automatically detect clinical and social impacts of substance use from text-based social media data. The successful development of such systems may enable us to better understand how nonmedical use of substances affects individual health and societal dynamics, aiding the development of effective public health strategies. In addition to creating the annotated data set, we applied several machine learning models to establish baseline performances. Specifically, we experimented with transformer models like BERT, and RoBERTa, one few-shot learning model DANN by leveraging the full training dataset, and GPT-3.5 by using one-shot learning, for automatic NER of clinical and social impacts. The dataset has been made available through the 2024 SMM4H shared tasks. △ Less

Submitted 9 May, 2024; originally announced May 2024.

Comments: 7 pages, 1 figure, 4 tables

arXiv:2405.04007 [pdf, other]

SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing

Authors: Yuying Ge, Sijie Zhao, Chen Li, Yixiao Ge, Ying Shan

Abstract: In this technical report, we introduce SEED-Data-Edit: a unique hybrid dataset for instruction-guided image editing, which aims to facilitate image manipulation using open-form language. SEED-Data-Edit is composed of three distinct types of data: (1) High-quality editing data produced by an automated pipeline, ensuring a substantial volume of diverse image editing pairs. (2) Real-world scenario da… ▽ More In this technical report, we introduce SEED-Data-Edit: a unique hybrid dataset for instruction-guided image editing, which aims to facilitate image manipulation using open-form language. SEED-Data-Edit is composed of three distinct types of data: (1) High-quality editing data produced by an automated pipeline, ensuring a substantial volume of diverse image editing pairs. (2) Real-world scenario data collected from the internet, which captures the intricacies of user intentions for promoting the practical application of image editing in the real world. (3) High-precision multi-turn editing data annotated by humans, which involves multiple rounds of edits for simulating iterative editing processes. The combination of these diverse data sources makes SEED-Data-Edit a comprehensive and versatile dataset for training language-guided image editing model. We fine-tune a pretrained Multimodal Large Language Model (MLLM) that unifies comprehension and generation with SEED-Data-Edit. The instruction tuned model demonstrates promising results, indicating the potential and effectiveness of SEED-Data-Edit in advancing the field of instructional image editing. The datasets are released in https://huggingface.co/datasets/AILab-CVC/SEED-Data-Edit. △ Less

Submitted 7 May, 2024; originally announced May 2024.

Comments: Technical Report; Dataset released in https://huggingface.co/datasets/AILab-CVC/SEED-Data-Edit

arXiv:2405.03119 [pdf, ps, other]

DAFT-Spread Affine Frequency Division Multiple Access for Downlink Transmission

Authors: Yiwei Tao, Miaowen Wen, Yao Ge, Tianqi Mao, Lixia Xiao, Jun Li

Abstract: Affine frequency division multiplexing (AFDM) and orthogonal AFDM access (O-AFDMA) are promising techniques based on chirp signals, which are able to suppress the performance deterioration caused by Doppler shifts in high-mobility scenarios. However, the high peak-to-average power ratio (PAPR) in AFDM or O-AFDMA is still a crucial problem, which severely limits their practical applications. In thi… ▽ More Affine frequency division multiplexing (AFDM) and orthogonal AFDM access (O-AFDMA) are promising techniques based on chirp signals, which are able to suppress the performance deterioration caused by Doppler shifts in high-mobility scenarios. However, the high peak-to-average power ratio (PAPR) in AFDM or O-AFDMA is still a crucial problem, which severely limits their practical applications. In this paper, we propose a discrete affine Fourier transform (DAFT)-spread AFDMA scheme based on the properties of the AFDM systems, named DAFT-s-AFDMA to significantly reduce the PAPR by resorting to the DAFT. We formulate the transmitted time-domain signals of the proposed DAFT-s-AFDMA schemes with localized and interleaved chirp subcarrier allocation strategies. Accordingly, we derive the guidelines for setting the DAFT parameters, revealing the insights of PAPR reduction. Finally, simulation results of PAPR comparison in terms of the complementary cumulative distribution function (CCDF) show that the proposed DAFT-s-AFDMA schemes with localized and interleaved strategies can both attain better PAPR performances than the conventional O-AFDMA scheme. △ Less

Submitted 5 May, 2024; originally announced May 2024.

arXiv:2405.02604 [pdf, ps, other]

Interleave Frequency Division Multiplexing

Authors: Yuhao Chi, Lei Liu, Yao Ge, Xuehui Chen, Ying Li, Zhaoyang Zhang

Abstract: In this letter, we study interleave frequency division multiplexing (IFDM) for multicarrier modulation in static multipath and mobile time-varying channels, which outperforms orthogonal frequency division multiplexing (OFDM), orthogonal time frequency space (OTFS), and affine frequency division multiplexing (AFDM) by considering practical advanced detectors. The fundamental principle underlying ex… ▽ More In this letter, we study interleave frequency division multiplexing (IFDM) for multicarrier modulation in static multipath and mobile time-varying channels, which outperforms orthogonal frequency division multiplexing (OFDM), orthogonal time frequency space (OTFS), and affine frequency division multiplexing (AFDM) by considering practical advanced detectors. The fundamental principle underlying existing modulation techniques is to establish sparse equivalent channel matrices in order to facilitate the design of low-complexity detection algorithms for signal recovery, making a trade-off between performance and implementation complexity. In contrast, the proposed IFDM establishes an equivalent fully dense and right-unitarily invariant channel matrix with the goal of achieving channel capacity, ensuring that the signals undergo sufficient statistical channel fading. Meanwhile, a low-complexity and replica maximum a posteriori (MAP)-optimal cross-domain memory approximate message passing (CD-MAMP) detector is proposed for IFDM by exploiting the sparsity of the time-domain channel and the unitary invariance in interleave-frequency-domain channel. Numerical results show that IFDM with extremely low-complexity CD-MAMP outperforms OFDM, OTFS, and AFDM with state-of-the-art orthogonal approximate message passing detectors, particularly at low velocities. △ Less

Submitted 4 May, 2024; originally announced May 2024.

Comments: Accepted by IEEE Wireless Communications Letters

arXiv:2405.01312 [pdf, other]

Privacy-Enhanced Database Synthesis for Benchmark Publishing

Authors: Yongrui Zhong, Yunqing Ge, Jianbin Qin, Shuyuan Zheng, Bo Tang, Yu-Xuan Qiu, Rui Mao, Ye Yuan, Makoto Onizuka, Chuan Xiao

Abstract: Benchmarking is crucial for evaluating a DBMS, yet existing benchmarks often fail to reflect the varied nature of user workloads. As a result, there is increasing momentum toward creating databases that incorporate real-world user data to more accurately mirror business environments. However, privacy concerns deter users from directly sharing their data, underscoring the importance of creating syn… ▽ More Benchmarking is crucial for evaluating a DBMS, yet existing benchmarks often fail to reflect the varied nature of user workloads. As a result, there is increasing momentum toward creating databases that incorporate real-world user data to more accurately mirror business environments. However, privacy concerns deter users from directly sharing their data, underscoring the importance of creating synthesized databases for benchmarking that also prioritize privacy protection. Differential privacy has become a key method for safeguarding privacy when sharing data, but the focus has largely been on minimizing errors in aggregate queries or classification tasks, with less attention given to benchmarking factors like runtime performance. This paper delves into the creation of privacy-preserving databases specifically for benchmarking, aiming to produce a differentially private database whose query performance closely resembles that of the original data. Introducing PrivBench, an innovative synthesis framework, we support the generation of high-quality data that maintains privacy. PrivBench uses sum-product networks (SPNs) to partition and sample data, enhancing data representation while securing privacy. The framework allows users to adjust the detail of SPN partitions and privacy settings, crucial for customizing privacy levels. We validate our approach, which uses the Laplace and exponential mechanisms, in maintaining privacy. Our tests show that PrivBench effectively generates data that maintains privacy and excels in query performance, consistently reducing errors in query execution time, query cardinality, and KL divergence. △ Less

Submitted 2 May, 2024; originally announced May 2024.

arXiv:2405.01308 [pdf, ps, other]

Spectral and Imaging Observations of a C2.3 White-Light Flare from the Advanced Space-Based Solar Observatory (ASO-S) and the Chinese H$α$ Solar Explorer (CHASE)

Authors: Qiao Li, Ying Li, Yang Su, Dechao Song, Hui Li, Li Feng, Yu Huang, Youping Li, Jingwei Li, Jie Zhao, Lei Lu, Beili Ying, Jianchao Xue, Ping Zhang, Jun Tian, Xiaofeng Liu, Gen Li, Zhichen Jing, Shuting Li, Guanglu Shi, Zhengyuan Tian, Wei Chen, Yingna Su, Qingmin Zhang, Dong Li , et al. (5 additional authors not shown)

Abstract: Solar white-light flares are characterized by an enhancement in the optical continuum, which are usually large flares (say X- and M-class flares). Here we report a small C2.3 white-light flare (SOL2022-12-20T04:10) observed by the \emph{Advanced Space-based Solar Observatory} and the \emph{Chinese H$α$ Solar Explorer}. This flare exhibits an increase of $\approx$6.4\% in the photospheric Fe \texts… ▽ More Solar white-light flares are characterized by an enhancement in the optical continuum, which are usually large flares (say X- and M-class flares). Here we report a small C2.3 white-light flare (SOL2022-12-20T04:10) observed by the \emph{Advanced Space-based Solar Observatory} and the \emph{Chinese H$α$ Solar Explorer}. This flare exhibits an increase of $\approx$6.4\% in the photospheric Fe \textsc{i} line at 6569.2\,Å and {$\approx$3.2\%} in the nearby continuum. The continuum at 3600\,Å also shows an enhancement of $\approx$4.7\%. The white-light brightening kernels are mainly located at the flare ribbons and co-spatial with nonthermal hard X-ray sources, which implies that the enhanced white-light emissions are related to nonthermal electron-beam heating. At the brightening kernels, the Fe \textsc{i} line displays an absorption profile that has a good Gaussian shape, with a redshift up to $\approx$1.7 km s$^{-1}$, while the H$α$ line shows an emission profile though having a central reversal. The H$α$ line profile also shows a red or blue asymmetry caused by plasma flows with a velocity of several to tens of km s$^{-1}$. It is interesting to find that the H$α$ asymmetry is opposite at the conjugate footpoints. It is also found that the CHASE continuum increase seems to be related to the change of photospheric magnetic field. Our study provides comprehensive characteristics of a small white-light flare that help understand the energy release process of white-light flares. △ Less

Submitted 2 May, 2024; originally announced May 2024.

Comments: 23 pages, 6 figures, accepted by Solar Physics

arXiv:2404.19752 [pdf, other]

Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation

Authors: Yunhao Ge, Xiaohui Zeng, Jacob Samuel Huffman, Tsung-Yi Lin, Ming-Yu Liu, Yin Cui

Abstract: Existing automatic captioning methods for visual content face challenges such as lack of detail, content hallucination, and poor instruction following. In this work, we propose VisualFactChecker (VFC), a flexible training-free pipeline that generates high-fidelity and detailed captions for both 2D images and 3D objects. VFC consists of three steps: 1) proposal, where image-to-text captioning model… ▽ More Existing automatic captioning methods for visual content face challenges such as lack of detail, content hallucination, and poor instruction following. In this work, we propose VisualFactChecker (VFC), a flexible training-free pipeline that generates high-fidelity and detailed captions for both 2D images and 3D objects. VFC consists of three steps: 1) proposal, where image-to-text captioning models propose multiple initial captions; 2) verification, where a large language model (LLM) utilizes tools such as object detection and VQA models to fact-check proposed captions; 3) captioning, where an LLM generates the final caption by summarizing caption proposals and the fact check verification results. In this step, VFC can flexibly generate captions in various styles following complex instructions. We conduct comprehensive captioning evaluations using four metrics: 1) CLIP-Score for image-text similarity; 2) CLIP-Image-Score for measuring the image-image similarity between the original and the reconstructed image generated by a text-to-image model using the caption. 3) human study on Amazon Mechanical Turk; 4) GPT-4V for fine-grained evaluation. Evaluation results show that VFC outperforms state-of-the-art open-sourced captioning methods for 2D images on the COCO dataset and 3D assets on the Objaverse dataset. Our study demonstrates that by combining open-source models into a pipeline, we can attain captioning capability comparable to proprietary models such as GPT-4V, despite being over 10x smaller in model size. △ Less

Submitted 30 April, 2024; originally announced April 2024.

Comments: CVPR 2024

arXiv:2404.16957 [pdf, other]

Attributing Responsibility in AI-Induced Incidents: A Computational Reflective Equilibrium Framework for Accountability

Authors: Yunfei Ge, Quanyan Zhu

Abstract: The pervasive integration of Artificial Intelligence (AI) has introduced complex challenges in the responsibility and accountability in the event of incidents involving AI-enabled systems. The interconnectivity of these systems, ethical concerns of AI-induced incidents, coupled with uncertainties in AI technology and the absence of corresponding regulations, have made traditional responsibility at… ▽ More The pervasive integration of Artificial Intelligence (AI) has introduced complex challenges in the responsibility and accountability in the event of incidents involving AI-enabled systems. The interconnectivity of these systems, ethical concerns of AI-induced incidents, coupled with uncertainties in AI technology and the absence of corresponding regulations, have made traditional responsibility attribution challenging. To this end, this work proposes a Computational Reflective Equilibrium (CRE) approach to establish a coherent and ethically acceptable responsibility attribution framework for all stakeholders. The computational approach provides a structured analysis that overcomes the limitations of conceptual approaches in dealing with dynamic and multifaceted scenarios, showcasing the framework's explainability, coherence, and adaptivity properties in the responsibility attribution process. We examine the pivotal role of the initial activation level associated with claims in equilibrium computation. Using an AI-assisted medical decision-support system as a case study, we illustrate how different initializations lead to diverse responsibility distributions. The framework offers valuable insights into accountability in AI-induced incidents, facilitating the development of a sustainable and resilient system through continuous monitoring, revision, and reflection. △ Less

Submitted 25 April, 2024; originally announced April 2024.

arXiv:2404.16790 [pdf, other]

SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension

Authors: Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, Ying Shan

Abstract: Comprehending text-rich visual content is paramount for the practical application of Multimodal Large Language Models (MLLMs), since text-rich scenarios are ubiquitous in the real world, which are characterized by the presence of extensive texts embedded within images. Recently, the advent of MLLMs with impressive versatility has raised the bar for what we can expect from MLLMs. However, their pro… ▽ More Comprehending text-rich visual content is paramount for the practical application of Multimodal Large Language Models (MLLMs), since text-rich scenarios are ubiquitous in the real world, which are characterized by the presence of extensive texts embedded within images. Recently, the advent of MLLMs with impressive versatility has raised the bar for what we can expect from MLLMs. However, their proficiency in text-rich scenarios has yet to be comprehensively and objectively assessed, since current MLLM benchmarks primarily focus on evaluating general visual comprehension. In this work, we introduce SEED-Bench-2-Plus, a benchmark specifically designed for evaluating \textbf{text-rich visual comprehension} of MLLMs. Our benchmark comprises 2.3K multiple-choice questions with precise human annotations, spanning three broad categories: Charts, Maps, and Webs, each of which covers a wide spectrum of text-rich scenarios in the real world. These categories, due to their inherent complexity and diversity, effectively simulate real-world text-rich environments. We further conduct a thorough evaluation involving 34 prominent MLLMs (including GPT-4V, Gemini-Pro-Vision and Claude-3-Opus) and emphasize the current limitations of MLLMs in text-rich visual comprehension. We hope that our work can serve as a valuable addition to existing MLLM benchmarks, providing insightful observations and inspiring further research in the area of text-rich visual comprehension with MLLMs. The dataset and evaluation code can be accessed at https://github.com/AILab-CVC/SEED-Bench. △ Less

Submitted 25 April, 2024; originally announced April 2024.

arXiv:2404.14396 [pdf, other]

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Authors: Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, Ying Shan

Abstract: The rapid evolution of multimodal foundation model has demonstrated significant progresses in vision-language understanding and generation, e.g., our previous work SEED-LLaMA. However, there remains a gap between its capability and the real-world applicability, primarily due to the model's limited capacity to effectively respond to various user instructions and interact with diverse visual data. I… ▽ More The rapid evolution of multimodal foundation model has demonstrated significant progresses in vision-language understanding and generation, e.g., our previous work SEED-LLaMA. However, there remains a gap between its capability and the real-world applicability, primarily due to the model's limited capacity to effectively respond to various user instructions and interact with diverse visual data. In this work, we focus on bridging this gap through integrating two enhanced features: (1) comprehending images of arbitrary sizes and ratios, and (2) enabling multi-granularity image generation. We present a unified and versatile foundation model, namely, SEED-X, which is able to model multi-granularity visual semantics for comprehension and generation tasks. Besides the competitive results on public benchmarks, SEED-X demonstrates its effectiveness in handling real-world applications across various domains after instruction tuning. We hope that our work will inspire future research into what can be achieved by versatile multimodal foundation models in real-world applications. The models, codes, and datasets will be released in https://github.com/AILab-CVC/SEED-X. △ Less

Submitted 22 April, 2024; originally announced April 2024.

Comments: Project released at: https://github.com/AILab-CVC/SEED-X

arXiv:2404.13884 [pdf]

MambaUIE&SR: Unraveling the Ocean's Secrets with Only 2.8 GFLOPs

Authors: Zhihao Chen, Yiyuan Ge

Abstract: Underwater Image Enhancement (UIE) techniques aim to address the problem of underwater image degradation due to light absorption and scattering. In recent years, both Convolution Neural Network (CNN)-based and Transformer-based methods have been widely explored. In addition, combining CNN and Transformer can effectively combine global and local information for enhancement. However, this approach i… ▽ More Underwater Image Enhancement (UIE) techniques aim to address the problem of underwater image degradation due to light absorption and scattering. In recent years, both Convolution Neural Network (CNN)-based and Transformer-based methods have been widely explored. In addition, combining CNN and Transformer can effectively combine global and local information for enhancement. However, this approach is still affected by the secondary complexity of the Transformer and cannot maximize the performance. Recently, the state-space model (SSM) based architecture Mamba has been proposed, which excels in modeling long distances while maintaining linear complexity. This paper explores the potential of this SSM-based model for UIE from both efficiency and effectiveness perspectives. However, the performance of directly applying Mamba is poor because local fine-grained features, which are crucial for image enhancement, cannot be fully utilized. Specifically, we customize the MambaUIE architecture for efficient UIE. Specifically, we introduce visual state space (VSS) blocks to capture global contextual information at the macro level while mining local information at the micro level. Also, for these two kinds of information, we propose a Dynamic Interaction Block (DIB) and Spatial feed-forward Network (SGFN) for intra-block feature aggregation. MambaUIE is able to efficiently synthesize global and local information and maintains a very small number of parameters with high accuracy. Experiments on UIEB datasets show that our method reduces GFLOPs by 67.4% (2.715G) relative to the SOTA method. To the best of our knowledge, this is the first UIE model constructed based on SSM that breaks the limitation of FLOPs on accuracy in UIE. The official repository of MambaUIE at https://github.com/1024AILab/MambaUIE. △ Less

Submitted 24 May, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

Comments: arXiv admin note: text overlap with arXiv:2305.08824 by other authors

arXiv:2404.13600 [pdf, other]

Are We Ready for Planetary Exploration Robots? The TAIL-Plus Dataset for SLAM in Granular Environments

Authors: Zirui Wang, Chen Yao, Yangtao Ge, Guowei Shi, Ningbo Yang, Zheng Zhu, Kewei Dong, Hexiang Wei, Zhenzhong Jia, Jing Wu

Abstract: So far, planetary surface exploration depends on various mobile robot platforms. The autonomous navigation and decision-making of these mobile robots in complex terrains largely rely on their terrain-aware perception, localization and mapping capabilities. In this paper we release the TAIL-Plus dataset, a new challenging dataset in deformable granular environments for planetary exploration robots,… ▽ More So far, planetary surface exploration depends on various mobile robot platforms. The autonomous navigation and decision-making of these mobile robots in complex terrains largely rely on their terrain-aware perception, localization and mapping capabilities. In this paper we release the TAIL-Plus dataset, a new challenging dataset in deformable granular environments for planetary exploration robots, which is an extension to our previous work, TAIL (Terrain-Aware multI-modaL) dataset. We conducted field experiments on beaches that are considered as planetary surface analog environments for diverse sandy terrains. In TAIL-Plus dataset, we provide more sequences with multiple loops and expand the scene from day to night. Benefit from our sensor suite with modular design, we use both wheeled and quadruped robots for data collection. The sensors include a 3D LiDAR, three downward RGB-D cameras, a pair of global-shutter color cameras that can be used as a forward-looking stereo camera, an RTK-GPS device and an extra IMU. Our datasets are intended to help researchers developing multi-sensor simultaneous localization and mapping (SLAM) algorithms for robots in unstructured, deformable granular terrains. Our datasets and supplementary materials will be available at \url{https://tailrobot.github.io/}. △ Less

Submitted 21 April, 2024; originally announced April 2024.

Comments: Accepted to the IEEE ICRA Workshop on Field Robotics 2024

arXiv:2404.10291 [pdf, other]

Robust Snapshot Radio SLAM

Authors: Ossi Kaltiokallio, Elizaveta Rastorgueva-Foi, Jukka Talvitie, Yu Ge, Henk Wymeersch, Mikko Valkama

Abstract: The intrinsic geometric connections between millimeter-wave (mmWave) signals and the propagation environment can be leveraged for simultaneous localization and mapping (SLAM) in 5G and beyond networks. However, estimated channel parameters that are mismatched to the utilized geometric model can cause the SLAM solution to degrade. In this paper, we propose a robust snapshot radio SLAM algorithm for… ▽ More The intrinsic geometric connections between millimeter-wave (mmWave) signals and the propagation environment can be leveraged for simultaneous localization and mapping (SLAM) in 5G and beyond networks. However, estimated channel parameters that are mismatched to the utilized geometric model can cause the SLAM solution to degrade. In this paper, we propose a robust snapshot radio SLAM algorithm for mixed line-of-sight (LoS) and non-line-of-sight (NLoS) environments that can estimate the unknown user equipment (UE) state, map of the environment as well as the presence of the LoS path. The proposed method can accurately detect outliers and the LoS path, enabling robust estimation in both LoS and NLoS conditions. The proposed method is validated using 60 GHz experimental data, indicating superior performance compared to the state-of-the-art. △ Less

Submitted 16 April, 2024; originally announced April 2024.

arXiv:2404.08882 [pdf, other]

Explanations of MTF discrepancy in grating-based X-ray differential phase contrast CT imaging

Authors: Yuhang Tan, Jiecheng Yang, Hairong Zheng, Dong Liang, Peiping Zhu, Yongshuai Ge

Abstract: As a multi-contrast X-ray computed tomography (CT) imaging system, the grating-based Talbot-Lau interferometer is able to generate the absorption contrast and differential phase contrast (DPC) images concurrently. However, experiments found that the absorption CT (ACT) images have better spatial resolution, i.e., higher modulation transfer function (MTF), than the differential phase contrast CT (D… ▽ More As a multi-contrast X-ray computed tomography (CT) imaging system, the grating-based Talbot-Lau interferometer is able to generate the absorption contrast and differential phase contrast (DPC) images concurrently. However, experiments found that the absorption CT (ACT) images have better spatial resolution, i.e., higher modulation transfer function (MTF), than the differential phase contrast CT (DPCT) images. Until now, the root cause of such observed discrepancy has not been rigorously investigated. Through physical experiments, this study revealed that the phase grating in the Talbot-Lau interferometer induces direct superposition of paired split absorption signals and inverse superposition of paired split phase signals via diffraction. Further simulation experiments demonstrated that this splitting leads to a reduction in MTF in both ACT and DPCT images, with distinct superposition mechanisms contributing to the lower MTF in DPCT. Besides, such MTF discrepancy may also be affected in a minor extent by object composition, sample size, beam spectra and detector pixel size. Based on this study, the spatial resolution could be optimized when designing a grating-based DPC imaging system. △ Less

Submitted 12 April, 2024; originally announced April 2024.

Comments: 7 pages,3 figures

ACM Class: J.2

arXiv:2404.07855 [pdf, other]

Resolve Domain Conflicts for Generalizable Remote Physiological Measurement

Authors: Weiyu Sun, Xinyu Zhang, Hao Lu, Ying Chen, Yun Ge, Xiaolin Huang, Jie Yuan, Yingcong Chen

Abstract: Remote photoplethysmography (rPPG) technology has become increasingly popular due to its non-invasive monitoring of various physiological indicators, making it widely applicable in multimedia interaction, healthcare, and emotion analysis. Existing rPPG methods utilize multiple datasets for training to enhance the generalizability of models. However, they often overlook the underlying conflict issu… ▽ More Remote photoplethysmography (rPPG) technology has become increasingly popular due to its non-invasive monitoring of various physiological indicators, making it widely applicable in multimedia interaction, healthcare, and emotion analysis. Existing rPPG methods utilize multiple datasets for training to enhance the generalizability of models. However, they often overlook the underlying conflict issues across different datasets, such as (1) label conflict resulting from different phase delays between physiological signal labels and face videos at the instance level, and (2) attribute conflict stemming from distribution shifts caused by head movements, illumination changes, skin types, etc. To address this, we introduce the DOmain-HArmonious framework (DOHA). Specifically, we first propose a harmonious phase strategy to eliminate uncertain phase delays and preserve the temporal variation of physiological signals. Next, we design a harmonious hyperplane optimization that reduces irrelevant attribute shifts and encourages the model's optimization towards a global solution that fits more valid scenarios. Our experiments demonstrate that DOHA significantly improves the performance of existing methods under multiple protocols. Our code is available at https://github.com/SWY666/rPPG-DOHA. △ Less

Submitted 11 April, 2024; originally announced April 2024.

Comments: Accepted by ACM MM 2023

arXiv:2404.07019 [pdf, other]

Chiral Chaos Enhanced Sensing

Authors: Yun-Qiu Ge, Zhe Wang, Qian-Chuan Zhao, Jing Zhang, Yu-xi Liu

Abstract: Chirality refers to the property that an object and its mirror image cannot overlap each other by spatial rotation and translation, and can be found in various research fields. We here propose chiral chaos and construct a chiral chaotic device via coupled whispering gallery mode resonators, where routes to chaos exhibit pronounced chirality for two opposite pumping directions. The mechanism respon… ▽ More Chirality refers to the property that an object and its mirror image cannot overlap each other by spatial rotation and translation, and can be found in various research fields. We here propose chiral chaos and construct a chiral chaotic device via coupled whispering gallery mode resonators, where routes to chaos exhibit pronounced chirality for two opposite pumping directions. The mechanism responsible for this phenomenon is that time-reversal symmetry of the traveling-wave light fields is broken by the Rayleigh scatterers inserted in resonators. Combining with the Lyapunov exponents, we propose metrics to measure the symmetry and chirality between different chaotic dynamics. We find that such a chiral chaotic device can be applied to achieve sensing with high sensitivity, wide detectable range, and strong robustness to the phase and orientation randomness of weak signals. Our work presents a promising candidate for on-chip sensing and may have applications in quantum networks and chaotic communications. △ Less

Submitted 10 April, 2024; originally announced April 2024.

arXiv:2404.06835 [pdf, other]

Tuning-Free Adaptive Style Incorporation for Structure-Consistent Text-Driven Style Transfer

Authors: Yanqi Ge, Jiaqi Liu, Qingnan Fan, Xi Jiang, Ye Huang, Shuai Qin, Hong Gu, Wen Li, Lixin Duan

Abstract: In this work, we target the task of text-driven style transfer in the context of text-to-image (T2I) diffusion models. The main challenge is consistent structure preservation while enabling effective style transfer effects. The past approaches in this field directly concatenate the content and style prompts for a prompt-level style injection, leading to unavoidable structure distortions. In this w… ▽ More In this work, we target the task of text-driven style transfer in the context of text-to-image (T2I) diffusion models. The main challenge is consistent structure preservation while enabling effective style transfer effects. The past approaches in this field directly concatenate the content and style prompts for a prompt-level style injection, leading to unavoidable structure distortions. In this work, we propose a novel solution to the text-driven style transfer task, namely, Adaptive Style Incorporation~(ASI), to achieve fine-grained feature-level style incorporation. It consists of the Siamese Cross-Attention~(SiCA) to decouple the single-track cross-attention to a dual-track structure to obtain separate content and style features, and the Adaptive Content-Style Blending (AdaBlending) module to couple the content and style information from a structure-consistent manner. Experimentally, our method exhibits much better performance in both structure preservation and stylized effects. △ Less

Submitted 10 April, 2024; originally announced April 2024.

arXiv:2404.03443 [pdf, ps, other]

Part-Attention Based Model Make Occluded Person Re-Identification Stronger

Authors: Zhihao Chen, Yiyuan Ge

Abstract: The goal of occluded person re-identification (ReID) is to retrieve specific pedestrians in occluded situations. However, occluded person ReID still suffers from background clutter and low-quality local feature representations, which limits model performance. In our research, we introduce a new framework called PAB-ReID, which is a novel ReID model incorporating part-attention mechanisms to tackle… ▽ More The goal of occluded person re-identification (ReID) is to retrieve specific pedestrians in occluded situations. However, occluded person ReID still suffers from background clutter and low-quality local feature representations, which limits model performance. In our research, we introduce a new framework called PAB-ReID, which is a novel ReID model incorporating part-attention mechanisms to tackle the aforementioned issues effectively. Firstly, we introduce the human parsing label to guide the generation of more accurate human part attention maps. In addition, we propose a fine-grained feature focuser for generating fine-grained human local feature representations while suppressing background interference. Moreover, We also design a part triplet loss to supervise the learning of human local features, which optimizes intra/inter-class distance. We conducted extensive experiments on specialized occlusion and regular ReID datasets, showcasing that our approach outperforms the existing state-of-the-art methods. △ Less

Submitted 1 May, 2024; v1 submitted 4 April, 2024; originally announced April 2024.

Comments: Accepted By International Joint Conference on Neural Networks 2024

arXiv:2404.00308 [pdf, other]

ST-LLM: Large Language Models Are Effective Temporal Learners

Authors: Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, Ge Li

Abstract: Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation, prompting research efforts towards video LLMs to facilitate human-AI interaction at the video level. However, how to effectively encode and understand videos in video-based dialogue systems remains to be solved. In this paper, we investigate a straightforward yet unexplored question: Can we fe… ▽ More Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation, prompting research efforts towards video LLMs to facilitate human-AI interaction at the video level. However, how to effectively encode and understand videos in video-based dialogue systems remains to be solved. In this paper, we investigate a straightforward yet unexplored question: Can we feed all spatial-temporal tokens into the LLM, thus delegating the task of video sequence modeling to the LLMs? Surprisingly, this simple approach yields significant improvements in video understanding. Based upon this, we propose ST-LLM, an effective video-LLM baseline with Spatial-Temporal sequence modeling inside LLM. Furthermore, to address the overhead and stability issues introduced by uncompressed video tokens within LLMs, we develop a dynamic masking strategy with tailor-made training objectives. For particularly long videos, we have also designed a global-local input module to balance efficiency and effectiveness. Consequently, we harness LLM for proficient spatial-temporal modeling, while upholding efficiency and stability. Extensive experimental results attest to the effectiveness of our method. Through a more concise model and training pipeline, ST-LLM establishes a new state-of-the-art result on VideoChatGPT-Bench and MVBench. Codes have been available at https://github.com/TencentARC/ST-LLM. △ Less

Submitted 30 March, 2024; originally announced April 2024.

arXiv:2403.19021 [pdf, other]

IDGenRec: LLM-RecSys Alignment with Textual ID Learning

Authors: Juntao Tan, Shuyuan Xu, Wenyue Hua, Yingqiang Ge, Zelong Li, Yongfeng Zhang

Abstract: Generative recommendation based on Large Language Models (LLMs) have transformed the traditional ranking-based recommendation style into a text-to-text generation paradigm. However, in contrast to standard NLP tasks that inherently operate on human vocabulary, current research in generative recommendations struggles to effectively encode recommendation items within the text-to-text framework using… ▽ More Generative recommendation based on Large Language Models (LLMs) have transformed the traditional ranking-based recommendation style into a text-to-text generation paradigm. However, in contrast to standard NLP tasks that inherently operate on human vocabulary, current research in generative recommendations struggles to effectively encode recommendation items within the text-to-text framework using concise yet meaningful ID representations. To better align LLMs with recommendation needs, we propose IDGen, representing each item as a unique, concise, semantically rich, platform-agnostic textual ID using human language tokens. This is achieved by training a textual ID generator alongside the LLM-based recommender, enabling seamless integration of personalized recommendations into natural language generation. Notably, as user history is expressed in natural language and decoupled from the original dataset, our approach suggests the potential for a foundational generative recommendation model. Experiments show that our framework consistently surpasses existing models in sequential recommendation under standard experimental setting. Then, we explore the possibility of training a foundation recommendation model with the proposed method on data collected from 19 different datasets and tested its recommendation performance on 6 unseen datasets across different platforms under a completely zero-shot setting. The results show that the zero-shot performance of the pre-trained foundation model is comparable to or even better than some traditional recommendation models based on supervised training, showing the potential of the IDGen paradigm serving as the foundation model for generative recommendation. Code and data are open-sourced at https://github.com/agiresearch/IDGenRec. △ Less

Submitted 17 May, 2024; v1 submitted 27 March, 2024; originally announced March 2024.

Comments: Accepted in SIGIR 2024

arXiv:2403.18691 [pdf, other]

Building defect conformal field theory from the Sachdev-Ye-Kitaev interactions

Authors: Yang Ge, Shao-Kai Jian

Abstract: The coupling between defects and extended critical degrees of freedom gives rise to the intriguing theory known as defect conformal field theory (CFT). In this work, we introduce a novel family of boundary and interface CFTs by coupling $N$ Majorana chains with SYK$_q$ interactions at the defect. Our analysis reveals that the interaction with $q=2$ constitutes a new marginal defect. Employing a ve… ▽ More The coupling between defects and extended critical degrees of freedom gives rise to the intriguing theory known as defect conformal field theory (CFT). In this work, we introduce a novel family of boundary and interface CFTs by coupling $N$ Majorana chains with SYK$_q$ interactions at the defect. Our analysis reveals that the interaction with $q=2$ constitutes a new marginal defect. Employing a versatile saddle point method, we compute unique entanglement characterizations, including the $g$-function and effective central charge, of the defect CFT. Furthermore, we analytically evaluate the transmission coefficient using CFT techniques. Surprisingly, the transmission coefficient deviates from the universal relation with the effective central charge across the defect at the large $N$ limit, suggesting that our defect CFT extends beyond all known examples of Gaussian defect CFT. △ Less

Submitted 27 March, 2024; originally announced March 2024.

Comments: 15 pages, 6 figures

arXiv:2403.18189 [pdf]

doi 10.1038/s41467-024-45318-8

Interfacial magnetic spin Hall effect in van der Waals Fe3GeTe2/MoTe2 heterostructure

Authors: Yudi Dai, Junlin Xiong, Yanfeng Ge, Bin Cheng, Lizheng Wang, Pengfei Wang, Zenglin Liu, Shengnan Yan, Cuiwei Zhang, Xianghan Xu, Youguo Shi, Sang-Wook Cheong, Cong Xiao, Shengyuan A. Yang, Shi-Jun Liang, Feng Miao

Abstract: The spin Hall effect (SHE) allows efficient generation of spin polarization or spin current through charge current and plays a crucial role in the development of spintronics. While SHE typically occurs in non-magnetic materials and is time-reversal even, exploring time-reversal-odd (T-odd) SHE, which couples SHE to magnetization in ferromagnetic materials, offers a new charge-spin conversion mecha… ▽ More The spin Hall effect (SHE) allows efficient generation of spin polarization or spin current through charge current and plays a crucial role in the development of spintronics. While SHE typically occurs in non-magnetic materials and is time-reversal even, exploring time-reversal-odd (T-odd) SHE, which couples SHE to magnetization in ferromagnetic materials, offers a new charge-spin conversion mechanism with new functionalities. Here, we report the observation of giant T-odd SHE in Fe3GeTe2/MoTe2 van der Waals heterostructure, representing a previously unidentified interfacial magnetic spin Hall effect (interfacial-MSHE). Through rigorous symmetry analysis and theoretical calculations, we attribute the interfacial-MSHE to a symmetry-breaking induced spin current dipole at the vdW interface. Furthermore, we show that this linear effect can be used for implementing multiply-accumulate operations and binary convolutional neural networks with cascaded multi-terminal devices. Our findings uncover an interfacial T-odd charge-spin conversion mechanism with promising potential for energy-efficient in-memory computing. △ Less

Submitted 26 March, 2024; originally announced March 2024.

Journal ref: Nature Communications 15, 1129 (2024)

arXiv:2403.17664 [pdf, other]

DiffFAE: Advancing High-fidelity One-shot Facial Appearance Editing with Space-sensitive Customization and Semantic Preservation

Authors: Qilin Wang, Jiangning Zhang, Chengming Xu, Weijian Cao, Ying Tai, Yue Han, Yanhao Ge, Hong Gu, Chengjie Wang, Yanwei Fu

Abstract: Facial Appearance Editing (FAE) aims to modify physical attributes, such as pose, expression and lighting, of human facial images while preserving attributes like identity and background, showing great importance in photograph. In spite of the great progress in this area, current researches generally meet three challenges: low generation fidelity, poor attribute preservation, and inefficient infer… ▽ More Facial Appearance Editing (FAE) aims to modify physical attributes, such as pose, expression and lighting, of human facial images while preserving attributes like identity and background, showing great importance in photograph. In spite of the great progress in this area, current researches generally meet three challenges: low generation fidelity, poor attribute preservation, and inefficient inference. To overcome above challenges, this paper presents DiffFAE, a one-stage and highly-efficient diffusion-based framework tailored for high-fidelity FAE. For high-fidelity query attributes transfer, we adopt Space-sensitive Physical Customization (SPC), which ensures the fidelity and generalization ability by utilizing rendering texture derived from 3D Morphable Model (3DMM). In order to preserve source attributes, we introduce the Region-responsive Semantic Composition (RSC). This module is guided to learn decoupled source-regarding features, thereby better preserving the identity and alleviating artifacts from non-facial attributes such as hair, clothes, and background. We further introduce a consistency regularization for our pipeline to enhance editing controllability by leveraging prior knowledge in the attention matrices of diffusion model. Extensive experiments demonstrate the superiority of DiffFAE over existing methods, achieving state-of-the-art performance in facial appearance editing. △ Less

Submitted 26 March, 2024; originally announced March 2024.

arXiv:2403.16971 [pdf, other]

AIOS: LLM Agent Operating System

Authors: Kai Mei, Zelong Li, Shuyuan Xu, Ruosong Ye, Yingqiang Ge, Yongfeng Zhang

Abstract: The integration and deployment of large language model (LLM)-based intelligent agents have been fraught with challenges that compromise their efficiency and efficacy. Among these issues are sub-optimal scheduling and resource allocation of agent requests over the LLM, the difficulties in maintaining context during interactions between agent and LLM, and the complexities inherent in integrating het… ▽ More The integration and deployment of large language model (LLM)-based intelligent agents have been fraught with challenges that compromise their efficiency and efficacy. Among these issues are sub-optimal scheduling and resource allocation of agent requests over the LLM, the difficulties in maintaining context during interactions between agent and LLM, and the complexities inherent in integrating heterogeneous agents with different capabilities and specializations. The rapid increase of agent quantity and complexity further exacerbates these issues, often leading to bottlenecks and sub-optimal utilization of resources. Inspired by these challenges, this paper presents AIOS, an LLM agent operating system, which embeds large language model into operating systems (OS) as the brain of the OS, enabling an operating system "with soul" -- an important step towards AGI. Specifically, AIOS is designed to optimize resource allocation, facilitate context switch across agents, enable concurrent execution of agents, provide tool service for agents, and maintain access control for agents. We present the architecture of such an operating system, outline the core challenges it aims to resolve, and provide the basic design and implementation of the AIOS. Our experiments on concurrent execution of multiple agents demonstrate the reliability and efficiency of our AIOS modules. Through this, we aim to not only improve the performance and efficiency of LLM agents but also to pioneer for better development and deployment of the AIOS ecosystem in the future. The project is open-source at https://github.com/agiresearch/AIOS. △ Less

Submitted 25 March, 2024; v1 submitted 25 March, 2024; originally announced March 2024.

Comments: 14 pages, 5 figures, 5 tables; comments and suggestions are appreciated

arXiv:2403.16875 [pdf, other]

TAIL: A Terrain-Aware Multi-Modal SLAM Dataset for Robot Locomotion in Deformable Granular Environments

Authors: Chen Yao, Yangtao Ge, Guowei Shi, Zirui Wang, Ningbo Yang, Zheng Zhu, Hexiang Wei, Yuntian Zhao, Jing Wu, Zhenzhong Jia

Abstract: Terrain-aware perception holds the potential to improve the robustness and accuracy of autonomous robot navigation in the wilds, thereby facilitating effective off-road traversals. However, the lack of multi-modal perception across various motion patterns hinders the solutions of Simultaneous Localization And Mapping (SLAM), especially when confronting non-geometric hazards in demanding landscapes… ▽ More Terrain-aware perception holds the potential to improve the robustness and accuracy of autonomous robot navigation in the wilds, thereby facilitating effective off-road traversals. However, the lack of multi-modal perception across various motion patterns hinders the solutions of Simultaneous Localization And Mapping (SLAM), especially when confronting non-geometric hazards in demanding landscapes. In this paper, we first propose a Terrain-Aware multI-modaL (TAIL) dataset tailored to deformable and sandy terrains. It incorporates various types of robotic proprioception and distinct ground interactions for the unique challenges and benchmark of multi-sensor fusion SLAM. The versatile sensor suite comprises stereo frame cameras, multiple ground-pointing RGB-D cameras, a rotating 3D LiDAR, an IMU, and an RTK device. This ensemble is hardware-synchronized, well-calibrated, and self-contained. Utilizing both wheeled and quadrupedal locomotion, we efficiently collect comprehensive sequences to capture rich unstructured scenarios. It spans the spectrum of scope, terrain interactions, scene changes, ground-level properties, and dynamic robot characteristics. We benchmark several state-of-the-art SLAM methods against ground truth and provide performance validations. Corresponding challenges and limitations are also reported. All associated resources are accessible upon request at \url{https://tailrobot.github.io/}. △ Less

Submitted 25 March, 2024; originally announced March 2024.

Comments: Submitted to IEEE Robotics and Automation Letters

arXiv:2403.16411 [pdf, other]

A Geometric Perspective on Fusing Gaussian Distributions on Lie Groups

Authors: Yixiao Ge, Pieter van Goor, Robert Mahony

Abstract: Stochastic inference on Lie groups plays a key role in state estimation problems such as; inertial navigation, visual inertial odometry, pose estimation in virtual reality, etc. A key problem is fusing independent concentrated Gaussian distributions defined at different reference points on the group. In this paper we approximate distributions at different points in the group in a single set of exp… ▽ More Stochastic inference on Lie groups plays a key role in state estimation problems such as; inertial navigation, visual inertial odometry, pose estimation in virtual reality, etc. A key problem is fusing independent concentrated Gaussian distributions defined at different reference points on the group. In this paper we approximate distributions at different points in the group in a single set of exponential coordinates and then use classical Gaussian fusion to obtain the fused posteriori in those coordinates. We consider several approximations including the exact Jacobian of the change of coordinate map, first and second order Taylor's expansions of the Jacobian, and parallel transport with and without curvature correction associated with the underlying geometry of the Lie group. Preliminary results on SO(3) demonstrate that a novel approximation using parallel transport with curvature correction achieves similar accuracy to the state-of-the-art optimisation based algorithms at a fraction of the computational cost. △ Less

Submitted 30 April, 2024; v1 submitted 25 March, 2024; originally announced March 2024.

Comments: Preprint for L-CSS

Showing 1–50 of 638 results for author: Ge, Y