subscribe to arXiv mailings

GRUtopia: Dream General Robots in a City at Scale

Authors: Hanqing Wang, Jiahe Chen, Wensi Huang, Qingwei Ben, Tai Wang, Boyu Mi, Tao Huang, Siheng Zhao, Yilun Chen, Sizhe Yang, Peizhou Cao, Wenye Yu, Zichao Ye, Jialun Li, Junfeng Long, Zirui Wang, Huiling Wang, Ying Zhao, Zhongying Tu, Yu Qiao, Dahua Lin, Jiangmiao Pang

Abstract: Recent works have been exploring the scaling laws in the field of Embodied AI. Given the prohibitive costs of collecting real-world data, we believe the Simulation-to-Real (Sim2Real) paradigm is a crucial step for scaling the learning of embodied models. This paper introduces project GRUtopia, the first simulated interactive 3D society designed for various robots. It features several advancements:… ▽ More Recent works have been exploring the scaling laws in the field of Embodied AI. Given the prohibitive costs of collecting real-world data, we believe the Simulation-to-Real (Sim2Real) paradigm is a crucial step for scaling the learning of embodied models. This paper introduces project GRUtopia, the first simulated interactive 3D society designed for various robots. It features several advancements: (a) The scene dataset, GRScenes, includes 100k interactive, finely annotated scenes, which can be freely combined into city-scale environments. In contrast to previous works mainly focusing on home, GRScenes covers 89 diverse scene categories, bridging the gap of service-oriented environments where general robots would be initially deployed. (b) GRResidents, a Large Language Model (LLM) driven Non-Player Character (NPC) system that is responsible for social interaction, task generation, and task assignment, thus simulating social scenarios for embodied AI applications. (c) The benchmark, GRBench, supports various robots but focuses on legged robots as primary agents and poses moderately challenging tasks involving Object Loco-Navigation, Social Loco-Navigation, and Loco-Manipulation. We hope that this work can alleviate the scarcity of high-quality data in this field and provide a more comprehensive assessment of Embodied AI research. The project is available at https://github.com/OpenRobotLab/GRUtopia. △ Less

Submitted 15 July, 2024; originally announced July 2024.

arXiv:2407.10701 [pdf, other]

DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems

Authors: Anni Zou, Wenhao Yu, Hongming Zhang, Kaixin Ma, Deng Cai, Zhuosheng Zhang, Hai Zhao, Dong Yu

Abstract: Recently, there has been a growing interest among large language model (LLM) developers in LLM-based document reading systems, which enable users to upload their own documents and pose questions related to the document contents, going beyond simple reading comprehension tasks. Consequently, these systems have been carefully designed to tackle challenges such as file parsing, metadata extraction, m… ▽ More Recently, there has been a growing interest among large language model (LLM) developers in LLM-based document reading systems, which enable users to upload their own documents and pose questions related to the document contents, going beyond simple reading comprehension tasks. Consequently, these systems have been carefully designed to tackle challenges such as file parsing, metadata extraction, multi-modal information understanding and long-context reading. However, no current benchmark exists to evaluate their performance in such scenarios, where a raw file and questions are provided as input, and a corresponding response is expected as output. In this paper, we introduce DocBench, a new benchmark designed to evaluate LLM-based document reading systems. Our benchmark involves a meticulously crafted process, including the recruitment of human annotators and the generation of synthetic questions. It includes 229 real documents and 1,102 questions, spanning across five different domains and four major types of questions. We evaluate both proprietary LLM-based systems accessible via web interfaces or APIs, and a parse-then-read pipeline employing open-source LLMs. Our evaluations reveal noticeable gaps between existing LLM-based document reading systems and human performance, underscoring the challenges of developing proficient systems. To summarize, DocBench aims to establish a standardized benchmark for evaluating LLM-based document reading systems under diverse real-world scenarios, thereby guiding future advancements in this research area. △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: Work in progress

arXiv:2407.10184 [pdf, other]

doi 10.1145/3637528.3671661

Towards Robust Recommendation via Decision Boundary-aware Graph Contrastive Learning

Authors: Jiakai Tang, Sunhao Dai, Zexu Sun, Xu Chen, Jun Xu, Wenhui Yu, Lantao Hu, Peng Jiang, Han Li

Abstract: In recent years, graph contrastive learning (GCL) has received increasing attention in recommender systems due to its effectiveness in reducing bias caused by data sparsity. However, most existing GCL models rely on heuristic approaches and usually assume entity independence when constructing contrastive views. We argue that these methods struggle to strike a balance between semantic invariance an… ▽ More In recent years, graph contrastive learning (GCL) has received increasing attention in recommender systems due to its effectiveness in reducing bias caused by data sparsity. However, most existing GCL models rely on heuristic approaches and usually assume entity independence when constructing contrastive views. We argue that these methods struggle to strike a balance between semantic invariance and view hardness across the dynamic training process, both of which are critical factors in graph contrastive learning. To address the above issues, we propose a novel GCL-based recommendation framework RGCL, which effectively maintains the semantic invariance of contrastive pairs and dynamically adapts as the model capability evolves through the training process. Specifically, RGCL first introduces decision boundary-aware adversarial perturbations to constrain the exploration space of contrastive augmented views, avoiding the decrease of task-specific information. Furthermore, to incorporate global user-user and item-item collaboration relationships for guiding on the generation of hard contrastive views, we propose an adversarial-contrastive learning objective to construct a relation-aware view-generator. Besides, considering that unsupervised GCL could potentially narrower margins between data points and the decision boundary, resulting in decreased model robustness, we introduce the adversarial examples based on maximum perturbations to achieve margin maximization. We also provide theoretical analyses on the effectiveness of our designs. Through extensive experiments on five public datasets, we demonstrate the superiority of RGCL compared against twelve baseline models. △ Less

Submitted 14 July, 2024; originally announced July 2024.

Comments: KDD 2024

arXiv:2407.09932 [pdf, other]

Quantum Clock Synchronization Network with Silicon-chip Dual-Pumped Entangled Photon Source

Authors: J. A. Li, H. Han, X. P. Huang, B. Y. Tang, K. Guo, J. Q. Huang, S. Y. Xiong, W. R. Yu, Z. J. Zhang, J. B. Yang, B. Liu, H. Chen, Z. K. Lu

Abstract: In this paper, we propose a quantum clock synchronization (QCS) network scheme with silicon-chip dual-pumped entangled photon source. This scheme couples two pump beams into the silicon-based waveguide, where degenerate and non-degenerate spontaneous four-wave mixing (SFWM) occurs, generating entanglement between one signal channel and three idler channels. The entangled photons are distributed to… ▽ More In this paper, we propose a quantum clock synchronization (QCS) network scheme with silicon-chip dual-pumped entangled photon source. This scheme couples two pump beams into the silicon-based waveguide, where degenerate and non-degenerate spontaneous four-wave mixing (SFWM) occurs, generating entanglement between one signal channel and three idler channels. The entangled photons are distributed to remote users through the wavelength division multiplexing strategy to construct an entanglement distribution network, and the round-trip QCS is adopted to realize a QCS network that can serve multiple users. A proof-of-principle QCS network experiment is implemented among the server and multiple users (Alice, Bob, and Charlie) for 11.1 hours, where Alice and Charlie are 10 km away from the server and Bob is 25 km away from the server. The lowest time deviations (TDEV) between the server and each user (Alice, Bob, and Charlie) are 1.57 ps, 0.82 ps and 2.57 ps at the average time of 8000 s, 8000 s and 800 s respectively. The results show that the QCS network scheme with dual-pumped SFWM photon source proposed by us achieves high accuracy, and the channel resources used by n users are reduced by about 30% compared with other round-trip QCS schemes. △ Less

Submitted 13 July, 2024; originally announced July 2024.

arXiv:2407.09324 [pdf, other]

Provable Privacy Advantages of Decentralized Federated Learning via Distributed Optimization

Authors: Wenrui Yu, Qiongxiu Li, Milan Lopuhaä-Zwakenberg, Mads Græsbøll Christensen, Richard Heusdens

Abstract: Federated learning (FL) emerged as a paradigm designed to improve data privacy by enabling data to reside at its source, thus embedding privacy as a core consideration in FL architectures, whether centralized or decentralized. Contrasting with recent findings by Pasquini et al., which suggest that decentralized FL does not empirically offer any additional privacy or security benefits over centrali… ▽ More Federated learning (FL) emerged as a paradigm designed to improve data privacy by enabling data to reside at its source, thus embedding privacy as a core consideration in FL architectures, whether centralized or decentralized. Contrasting with recent findings by Pasquini et al., which suggest that decentralized FL does not empirically offer any additional privacy or security benefits over centralized models, our study provides compelling evidence to the contrary. We demonstrate that decentralized FL, when deploying distributed optimization, provides enhanced privacy protection - both theoretically and empirically - compared to centralized approaches. The challenge of quantifying privacy loss through iterative processes has traditionally constrained the theoretical exploration of FL protocols. We overcome this by conducting a pioneering in-depth information-theoretical privacy analysis for both frameworks. Our analysis, considering both eavesdropping and passive adversary models, successfully establishes bounds on privacy leakage. We show information theoretically that the privacy loss in decentralized FL is upper bounded by the loss in centralized FL. Compared to the centralized case where local gradients of individual participants are directly revealed, a key distinction of optimization-based decentralized FL is that the relevant information includes differences of local gradients over successive iterations and the aggregated sum of different nodes' gradients over the network. This information complicates the adversary's attempt to infer private data. To bridge our theoretical insights with practical applications, we present detailed case studies involving logistic regression and deep neural networks. These examples demonstrate that while privacy leakage remains comparable in simpler models, complex models like deep neural networks exhibit lower privacy risks under decentralized FL. △ Less

Submitted 12 July, 2024; originally announced July 2024.

arXiv:2407.09013 [pdf, ps, other]

Procedural Content Generation via Generative Artificial Intelligence

Authors: Xinyu Mao, Wanli Yu, Kazunori D Yamada, Michael R. Zielewski

Abstract: The attempt to utilize machine learning in PCG has been made in the past. In this survey paper, we investigate how generative artificial intelligence (AI), which saw a significant increase in interest in the mid-2010s, is being used for PCG. We review applications of generative AI for the creation of various types of content, including terrains, items, and even storylines. While generative AI is e… ▽ More The attempt to utilize machine learning in PCG has been made in the past. In this survey paper, we investigate how generative artificial intelligence (AI), which saw a significant increase in interest in the mid-2010s, is being used for PCG. We review applications of generative AI for the creation of various types of content, including terrains, items, and even storylines. While generative AI is effective for PCG, one significant issues it faces is that building high-performance generative AI requires vast amounts of training data. Because content generally highly customized, domain-specific training data is scarce, and straightforward approaches to generative AI models may not work well. For PCG research to advance further, issues related to limited training data must be overcome. Thus, we also give special consideration to research that addresses the challenges posed by limited training data. △ Less

Submitted 12 July, 2024; originally announced July 2024.

arXiv:2407.08910 [pdf, other]

doi 10.1145/3637528.3671611

PAIL: Performance based Adversarial Imitation Learning Engine for Carbon Neutral Optimization

Authors: Yuyang Ye, Lu-An Tang, Haoyu Wang, Runlong Yu, Wenchao Yu, Erhu He, Haifeng Chen, Hui Xiong

Abstract: Achieving carbon neutrality within industrial operations has become increasingly imperative for sustainable development. It is both a significant challenge and a key opportunity for operational optimization in industry 4.0. In recent years, Deep Reinforcement Learning (DRL) based methods offer promising enhancements for sequential optimization processes and can be used for reducing carbon emission… ▽ More Achieving carbon neutrality within industrial operations has become increasingly imperative for sustainable development. It is both a significant challenge and a key opportunity for operational optimization in industry 4.0. In recent years, Deep Reinforcement Learning (DRL) based methods offer promising enhancements for sequential optimization processes and can be used for reducing carbon emissions. However, existing DRL methods need a pre-defined reward function to assess the impact of each action on the final sustainable development goals (SDG). In many real applications, such a reward function cannot be given in advance. To address the problem, this study proposes a Performance based Adversarial Imitation Learning (PAIL) engine. It is a novel method to acquire optimal operational policies for carbon neutrality without any pre-defined action rewards. Specifically, PAIL employs a Transformer-based policy generator to encode historical information and predict following actions within a multi-dimensional space. The entire action sequence will be iteratively updated by an environmental simulator. Then PAIL uses a discriminator to minimize the discrepancy between generated sequences and real-world samples of high SDG. In parallel, a Q-learning framework based performance estimator is designed to estimate the impact of each action on SDG. Based on these estimations, PAIL refines generated policies with the rewards from both discriminator and performance estimator. PAIL is evaluated on multiple real-world application cases and datasets. The experiment results demonstrate the effectiveness of PAIL comparing to other state-of-the-art baselines. In addition, PAIL offers meaningful interpretability for the optimization in carbon neutrality. △ Less

Submitted 11 July, 2024; originally announced July 2024.

arXiv:2407.08421 [pdf, other]

X-ray spectral and timing evolution during the 2018 outburst of MAXI J1820+070

Authors: YaXing Li, Zhen Yan, ChenXu Gao, Wenfei Yu

Abstract: We made use high-cadence observations from the $Insight$-HXMT and $NICER$ to scrutinize the spectral and timing evolution during the 2018 outburst of the black hole X-ray binary (BHXRB) MAXI J1820+070. It's hardness-intensity diagram (HID) displays a ''q''-like track including all the spectral states, along a unique loop in the hard state. The tracks observed in the HID is anticipated in the evolu… ▽ More We made use high-cadence observations from the $Insight$-HXMT and $NICER$ to scrutinize the spectral and timing evolution during the 2018 outburst of the black hole X-ray binary (BHXRB) MAXI J1820+070. It's hardness-intensity diagram (HID) displays a ''q''-like track including all the spectral states, along a unique loop in the hard state. The tracks observed in the HID is anticipated in the evolution of the components responsible for Compton and reflection emission. This is substantiated by the relationship between the X-ray luminosity $L_\mathrm{X}$ and photon index $Γ$, as well as the relationship between X-ray luminosity $L_\mathrm{X}$ and the ratio of Compton to disk luminosities $L_\mathrm{C}/L_\mathrm{D}$. Both of these relationships exhibit a pattern reminiscent of HID. During the hard state, the hardness (also $Γ$) is determined by either reflection component ($R_{f}>1$ ) or Compton component ($R_{f}<1$) depending on the value reflection fraction $R_{f}$. So the distinctive evolution of $R_{f}$ leads to the unique loop in the HID (also in the $L_\mathrm{X}$--$Γ$ plane) of hard state. Additionally, we found a negative correlation between frequency of the type-C quasi-periodic oscillation (QPO) ($ν_{\mathrm{C,QPO}}$) and the optical depth of the Compton emission ($τ$), and a positive correlation between $ν_{\mathrm{C,QPO}}$ and $Γ$. These correlations strongly suggest a coupling between the QPO properties and the underlying process responsible for Comptonization. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: 14 pages, 10 figures, submitted to MNRAS

arXiv:2407.07775 [pdf, other]

Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs

Authors: Hao-Tien Lewis Chiang, Zhuo Xu, Zipeng Fu, Mithun George Jacob, Tingnan Zhang, Tsang-Wei Edward Lee, Wenhao Yu, Connor Schenck, David Rendleman, Dhruv Shah, Fei Xia, Jasmine Hsu, Jonathan Hoech, Pete Florence, Sean Kirmani, Sumeet Singh, Vikas Sindhwani, Carolina Parada, Chelsea Finn, Peng Xu, Sergey Levine, Jie Tan

Abstract: An elusive goal in navigation research is to build an intelligent agent that can understand multimodal instructions including natural language and image, and perform useful navigation. To achieve this, we study a widely useful category of navigation tasks we call Multimodal Instruction Navigation with demonstration Tours (MINT), in which the environment prior is provided through a previously recor… ▽ More An elusive goal in navigation research is to build an intelligent agent that can understand multimodal instructions including natural language and image, and perform useful navigation. To achieve this, we study a widely useful category of navigation tasks we call Multimodal Instruction Navigation with demonstration Tours (MINT), in which the environment prior is provided through a previously recorded demonstration video. Recent advances in Vision Language Models (VLMs) have shown a promising path in achieving this goal as it demonstrates capabilities in perceiving and reasoning about multimodal inputs. However, VLMs are typically trained to predict textual output and it is an open research question about how to best utilize them in navigation. To solve MINT, we present Mobility VLA, a hierarchical Vision-Language-Action (VLA) navigation policy that combines the environment understanding and common sense reasoning power of long-context VLMs and a robust low-level navigation policy based on topological graphs. The high-level policy consists of a long-context VLM that takes the demonstration tour video and the multimodal user instruction as input to find the goal frame in the tour video. Next, a low-level policy uses the goal frame and an offline constructed topological graph to generate robot actions at every timestep. We evaluated Mobility VLA in a 836m^2 real world environment and show that Mobility VLA has a high end-to-end success rates on previously unsolved multimodal instructions such as "Where should I return this?" while holding a plastic bin. A video demonstrating Mobility VLA can be found here: https://youtu.be/-Tof__Q8_5s △ Less

Submitted 12 July, 2024; v1 submitted 10 July, 2024; originally announced July 2024.

arXiv:2407.07304 [pdf, other]

Inference Performance Optimization for Large Language Models on CPUs

Authors: Pujiang He, Shan Zhou, Wenhuan Huang, Changqing Li, Duyi Wang, Bin Guo, Chen Meng, Sheng Gui, Weifei Yu, Yi Xie

Abstract: Large language models (LLMs) have shown exceptional performance and vast potential across diverse tasks. However, the deployment of LLMs with high performance in low-resource environments has garnered significant attention in the industry. When GPU hardware resources are limited, we can explore alternative options on CPUs. To mitigate the financial burden and alleviate constraints imposed by hardw… ▽ More Large language models (LLMs) have shown exceptional performance and vast potential across diverse tasks. However, the deployment of LLMs with high performance in low-resource environments has garnered significant attention in the industry. When GPU hardware resources are limited, we can explore alternative options on CPUs. To mitigate the financial burden and alleviate constraints imposed by hardware resources, optimizing inference performance is necessary. In this paper, we introduce an easily deployable inference performance optimization solution aimed at accelerating LLMs on CPUs. In this solution, we implement an effective way to reduce the KV cache size while ensuring precision. We propose a distributed inference optimization approach and implement it based on oneAPI Collective Communications Library. Furthermore, we propose optimization approaches for LLMs on CPU, and conduct tailored optimizations for the most commonly used models. The code is open-sourced at https://github.com/intel/xFasterTransformer. △ Less

Submitted 9 July, 2024; originally announced July 2024.

Comments: 5 pages, 6 figure, ICML 2024 on Foundation Models in the Wild

arXiv:2407.06222 [pdf, other]

Formalization of the Filter Extension Principle (FEP) in Coq

Authors: Guowei Dou, Wensheng Yu

Abstract: The Filter Extension Principle (FEP) asserts that every filter can be extended to an ultrafilter, which plays a crucial role in the quest for non-principal ultrafilters. Non-principal ultrafilters find widespread applications in logic, set theory, topology, model theory, and especially non-standard extensions of algebraic structures. Since non-principal ultrafilters are challenging to construct di… ▽ More The Filter Extension Principle (FEP) asserts that every filter can be extended to an ultrafilter, which plays a crucial role in the quest for non-principal ultrafilters. Non-principal ultrafilters find widespread applications in logic, set theory, topology, model theory, and especially non-standard extensions of algebraic structures. Since non-principal ultrafilters are challenging to construct directly, the Filter Extension Principle, stemming from the Axiom of Choice, holds significant value in obtaining them. This paper presents the formal verification of the Filter Extension Principle, implemented using the Coq proof assistant and grounded in axiomatic set theory. It offers formal descriptions for the concepts related to filter base, filter, ultrafilter and more. All relevant theorems, propositions, and the Filter Extension Principle itself are rigorously and formally verified. This work sets the stage for the formalization of non-standard analysis and a specific real number theory. △ Less

Submitted 5 July, 2024; originally announced July 2024.

Comments: Conference on Intelligent Networked Things, 2024 (CINT2024)

arXiv:2407.05540 [pdf, other]

GTP-4o: Modality-prompted Heterogeneous Graph Learning for Omni-modal Biomedical Representation

Authors: Chenxin Li, Xinyu Liu, Cheng Wang, Yifan Liu, Weihao Yu, Jing Shao, Yixuan Yuan

Abstract: Recent advances in learning multi-modal representation have witnessed the success in biomedical domains. While established techniques enable handling multi-modal information, the challenges are posed when extended to various clinical modalities and practical modalitymissing setting due to the inherent modality gaps. To tackle these, we propose an innovative Modality-prompted Heterogeneous Graph fo… ▽ More Recent advances in learning multi-modal representation have witnessed the success in biomedical domains. While established techniques enable handling multi-modal information, the challenges are posed when extended to various clinical modalities and practical modalitymissing setting due to the inherent modality gaps. To tackle these, we propose an innovative Modality-prompted Heterogeneous Graph for Omnimodal Learning (GTP-4o), which embeds the numerous disparate clinical modalities into a unified representation, completes the deficient embedding of missing modality and reformulates the cross-modal learning with a graph-based aggregation. Specially, we establish a heterogeneous graph embedding to explicitly capture the diverse semantic properties on both the modality-specific features (nodes) and the cross-modal relations (edges). Then, we design a modality-prompted completion that enables completing the inadequate graph representation of missing modality through a graph prompting mechanism, which generates hallucination graphic topologies to steer the missing embedding towards the intact representation. Through the completed graph, we meticulously develop a knowledge-guided hierarchical cross-modal aggregation consisting of a global meta-path neighbouring to uncover the potential heterogeneous neighbors along the pathways driven by domain knowledge, and a local multi-relation aggregation module for the comprehensive cross-modal interaction across various heterogeneous relations. We assess the efficacy of our methodology on rigorous benchmarking experiments against prior state-of-the-arts. In a nutshell, GTP-4o presents an initial foray into the intriguing realm of embedding, relating and perceiving the heterogeneous patterns from various clinical modalities holistically via a graph theory. Project page: https://gtp-4-o.github.io/. △ Less

Submitted 7 July, 2024; originally announced July 2024.

Comments: Accepted by ECCV2024

arXiv:2407.05413 [pdf, other]

SBoRA: Low-Rank Adaptation with Regional Weight Updates

Authors: Lai-Man Po, Yuyang Liu, Haoxuan Wu, Tianqi Zhang, Wing-Yin Yu, Zeyu Jiang, Kun Li

Abstract: This paper introduces Standard Basis LoRA (SBoRA), a novel parameter-efficient fine-tuning approach for Large Language Models that builds upon the pioneering works of Low-Rank Adaptation (LoRA) and Orthogonal Adaptation. SBoRA further reduces the computational and memory requirements of LoRA while enhancing learning performance. By leveraging orthogonal standard basis vectors to initialize one of… ▽ More This paper introduces Standard Basis LoRA (SBoRA), a novel parameter-efficient fine-tuning approach for Large Language Models that builds upon the pioneering works of Low-Rank Adaptation (LoRA) and Orthogonal Adaptation. SBoRA further reduces the computational and memory requirements of LoRA while enhancing learning performance. By leveraging orthogonal standard basis vectors to initialize one of the low-rank matrices, either A or B, SBoRA enables regional weight updates and memory-efficient fine-tuning. This approach gives rise to two variants, SBoRA-FA and SBoRA-FB, where only one of the matrices is updated, resulting in a sparse update matrix with a majority of zero rows or columns. Consequently, the majority of the fine-tuned model's weights remain unchanged from the pre-trained weights. This characteristic of SBoRA, wherein regional weight updates occur, is reminiscent of the modular organization of the human brain, which efficiently adapts to new tasks. Our empirical results demonstrate the superiority of SBoRA-FA over LoRA in various fine-tuning tasks, including commonsense reasoning and arithmetic reasoning. Furthermore, we evaluate the effectiveness of QSBoRA on quantized LLaMA models of varying scales, highlighting its potential for efficient adaptation to new tasks. Code is available at https://github.com/cityuhkai/SBoRA △ Less

Submitted 10 July, 2024; v1 submitted 7 July, 2024; originally announced July 2024.

Comments: 15 pages, 2 figures

arXiv:2407.05236 [pdf, other]

A timing view of the additional high-energy spectral component discovered in the black hole candidate Swift J1727.8-1613

Authors: Zi-Xu Yang, Liang Zhang, Shuang-Nan Zhang, L. Tao, Shu Zhang, Ruican Ma, Qingcui Bu, Yue Huang, He-Xin Liu, Wei Yu, Guang C. Xiao, Peng-Ju Wang, Hua Feng, Li-Ming Song, Xiang Ma, Mingyu Ge, QingChang Zhao, J. L. Qu

Abstract: We present an energy-dependent analysis for the type-C quasi-periodic oscillations (QPOs) observed in the black hole X-ray binary Swift J1727.8-1613 using Insight-HXMT observations. We find that the QPO fractional rms at energies above 40 keV is significantly higher than that below 20 keV. This is the first report of a high energy (HE)-rms excess in the rms spectrum of a black hole X-ray binary. I… ▽ More We present an energy-dependent analysis for the type-C quasi-periodic oscillations (QPOs) observed in the black hole X-ray binary Swift J1727.8-1613 using Insight-HXMT observations. We find that the QPO fractional rms at energies above 40 keV is significantly higher than that below 20 keV. This is the first report of a high energy (HE)-rms excess in the rms spectrum of a black hole X-ray binary. In the high energy band, an extra hard component is observed in additional to the standard thermal Comptonization component at similar energy band. The value of the QPO HE-rms excess is not only correlated with the disk parameters and the photon index of the standard Comptonization component, but also exhibits a moderate positive correlation with the flux of the additional hard spectral component. No features in the QPO phase-lag spectra are seen corresponding to the additional hard component. We propose that the additional hard component in the spectrum may originate from jet emission and the associated QPO HE-rms excess can be explained by the precession of the jet base. △ Less

Submitted 6 July, 2024; originally announced July 2024.

arXiv:2407.03971 [pdf, other]

MineNetCD: A Benchmark for Global Mining Change Detection on Remote Sensing Imagery

Authors: Weikang Yu, Xiaokang Zhang, Xiao Xiang Zhu, Richard Gloaguen, Pedram Ghamisi

Abstract: Monitoring changes triggered by mining activities is crucial for industrial controlling, environmental management and regulatory compliance, yet it poses significant challenges due to the vast and often remote locations of mining sites. Remote sensing technologies have increasingly become indispensable to detect and analyze these changes over time. We thus introduce MineNetCD, a comprehensive benc… ▽ More Monitoring changes triggered by mining activities is crucial for industrial controlling, environmental management and regulatory compliance, yet it poses significant challenges due to the vast and often remote locations of mining sites. Remote sensing technologies have increasingly become indispensable to detect and analyze these changes over time. We thus introduce MineNetCD, a comprehensive benchmark designed for global mining change detection using remote sensing imagery. The benchmark comprises three key contributions. First, we establish a global mining change detection dataset featuring more than 70k paired patches of bi-temporal high-resolution remote sensing images and pixel-level annotations from 100 mining sites worldwide. Second, we develop a novel baseline model based on a change-aware Fast Fourier Transform (ChangeFFT) module, which enhances various backbones by leveraging essential spectrum components within features in the frequency domain and capturing the channel-wise correlation of bi-temporal feature differences to learn change-aware representations. Third, we construct a unified change detection (UCD) framework that integrates over 13 advanced change detection models. This framework is designed for streamlined and efficient processing, utilizing the cloud platform hosted by HuggingFace. Extensive experiments have been conducted to demonstrate the superiority of the proposed baseline model compared with 12 state-of-the-art change detection approaches. Empirical studies on modularized backbones comprehensively confirm the efficacy of different representation learners on change detection. This contribution represents significant advancements in the field of remote sensing and change detection, providing a robust resource for future research and applications in global mining monitoring. Dataset and Codes are available via the link. △ Less

Submitted 4 July, 2024; originally announced July 2024.

arXiv:2407.02190 [pdf, other]

I2EKF-LO: A Dual-Iteration Extended Kalman Filter Based LiDAR Odometry

Authors: Wenlu Yu, Jie Xu, Chengwei Zhao, Lijun Zhao, Thien-Minh Nguyen, Shenghai Yuan, Mingming Bai, Lihua Xie

Abstract: LiDAR odometry is a pivotal technology in the fields of autonomous driving and autonomous mobile robotics. However, most of the current works focus on nonlinear optimization methods, and still existing many challenges in using the traditional Iterative Extended Kalman Filter (IEKF) framework to tackle the problem: IEKF only iterates over the observation equation, relying on a rough estimate of the… ▽ More LiDAR odometry is a pivotal technology in the fields of autonomous driving and autonomous mobile robotics. However, most of the current works focus on nonlinear optimization methods, and still existing many challenges in using the traditional Iterative Extended Kalman Filter (IEKF) framework to tackle the problem: IEKF only iterates over the observation equation, relying on a rough estimate of the initial state, which is insufficient to fully eliminate motion distortion in the input point cloud; the system process noise is difficult to be determined during state estimation of the complex motions; and the varying motion models across different sensor carriers. To address these issues, we propose the Dual-Iteration Extended Kalman Filter (I2EKF) and the LiDAR odometry based on I2EKF (I2EKF-LO). This approach not only iterates over the observation equation but also leverages state updates to iteratively mitigate motion distortion in LiDAR point clouds. Moreover, it dynamically adjusts process noise based on the confidence level of prior predictions during state estimation and establishes motion models for different sensor carriers to achieve accurate and efficient state estimation. Comprehensive experiments demonstrate that I2EKF-LO achieves outstanding levels of accuracy and computational efficiency in the realm of LiDAR odometry. Additionally, to foster community development, our code is open-sourced.https://github.com/YWL0720/I2EKF-LO. △ Less

Submitted 2 July, 2024; originally announced July 2024.

Comments: Accepted by IROS 2024

arXiv:2407.01950 [pdf, other]

LDP: A Local Diffusion Planner for Efficient Robot Navigation and Collision Avoidance

Authors: Wenhao Yu, Jie Peng, Huanyu Yang, Junrui Zhang, Yifan Duan, Jianmin Ji, Yanyong Zhang

Abstract: The conditional diffusion model has been demonstrated as an efficient tool for learning robot policies, owing to its advancement to accurately model the conditional distribution of policies. The intricate nature of real-world scenarios, characterized by dynamic obstacles and maze-like structures, underscores the complexity of robot local navigation decision-making as a conditional distribution pro… ▽ More The conditional diffusion model has been demonstrated as an efficient tool for learning robot policies, owing to its advancement to accurately model the conditional distribution of policies. The intricate nature of real-world scenarios, characterized by dynamic obstacles and maze-like structures, underscores the complexity of robot local navigation decision-making as a conditional distribution problem. Nevertheless, leveraging the diffusion model for robot local navigation is not trivial and encounters several under-explored challenges: (1) Data Urgency. The complex conditional distribution in local navigation needs training data to include diverse policy in diverse real-world scenarios; (2) Myopic Observation. Due to the diversity of the perception scenarios, diffusion decisions based on the local perspective of robots may prove suboptimal for completing the entire task, as they often lack foresight. In certain scenarios requiring detours, the robot may become trapped. To address these issues, our approach begins with an exploration of a diverse data generation mechanism that encompasses multiple agents exhibiting distinct preferences through target selection informed by integrated global-local insights. Then, based on this diverse training data, a diffusion agent is obtained, capable of excellent collision avoidance in diverse scenarios. Subsequently, we augment our Local Diffusion Planner, also known as LDP by incorporating global observations in a lightweight manner. This enhancement broadens the observational scope of LDP, effectively mitigating the risk of becoming ensnared in local optima and promoting more robust navigational decisions. △ Less

Submitted 2 July, 2024; originally announced July 2024.

Comments: 8 pages, 6 figures, accepted by IROS 2024

arXiv:2407.01875 [pdf, ps, other]

Spatio-Temporal Graphical Counterfactuals: An Overview

Authors: Mingyu Kang, Duxin Chen, Ziyuan Pu, Jianxi Gao, Wenwu Yu

Abstract: Counterfactual thinking is a critical yet challenging topic for artificial intelligence to learn knowledge from data and ultimately improve their performances for new scenarios. Many research works, including Potential Outcome Model and Structural Causal Model, have been proposed to realize it. However, their modelings, theoretical foundations and application approaches are usually different. More… ▽ More Counterfactual thinking is a critical yet challenging topic for artificial intelligence to learn knowledge from data and ultimately improve their performances for new scenarios. Many research works, including Potential Outcome Model and Structural Causal Model, have been proposed to realize it. However, their modelings, theoretical foundations and application approaches are usually different. Moreover, there is a lack of graphical approach to infer spatio-temporal counterfactuals, that considers spatial and temporal interactions between multiple units. Thus, in this work, our aim is to investigate a survey to compare and discuss different counterfactual models, theories and approaches, and further build a unified graphical causal frameworks to infer the spatio-temporal counterfactuals. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2407.01029 [pdf, other]

EndoSparse: Real-Time Sparse View Synthesis of Endoscopic Scenes using Gaussian Splatting

Authors: Chenxin Li, Brandon Y. Feng, Yifan Liu, Hengyu Liu, Cheng Wang, Weihao Yu, Yixuan Yuan

Abstract: 3D reconstruction of biological tissues from a collection of endoscopic images is a key to unlock various important downstream surgical applications with 3D capabilities. Existing methods employ various advanced neural rendering techniques for photorealistic view synthesis, but they often struggle to recover accurate 3D representations when only sparse observations are available, which is usually… ▽ More 3D reconstruction of biological tissues from a collection of endoscopic images is a key to unlock various important downstream surgical applications with 3D capabilities. Existing methods employ various advanced neural rendering techniques for photorealistic view synthesis, but they often struggle to recover accurate 3D representations when only sparse observations are available, which is usually the case in real-world clinical scenarios. To tackle this {sparsity} challenge, we propose a framework leveraging the prior knowledge from multiple foundation models during the reconstruction process, dubbed as \textit{EndoSparse}. Experimental results indicate that our proposed strategy significantly improves the geometric and appearance quality under challenging sparse-view conditions, including using only three views. In rigorous benchmarking experiments against state-of-the-art methods, \textit{EndoSparse} achieves superior results in terms of accurate geometry, realistic appearance, and rendering efficiency, confirming the robustness to sparse-view limitations in endoscopic reconstruction. \textit{EndoSparse} signifies a steady step towards the practical deployment of neural 3D reconstruction in real-world clinical scenarios. Project page: https://endo-sparse.github.io/. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: Accpeted by MICCAI2024

arXiv:2407.00946 [pdf]

Atomic cluster expansion interatomic potential for defects and thermodynamics of Cu-W system

Authors: Jiahao Pan, Huiqun Cheng, Gaosheng Yan, Lei Zhang, Wenshan Yu, Shengping Shen

Abstract: The unique properties exhibited in immiscible metals, such as excellent strength, hardness, and radiation-damage tolerance, have stimulated the interest of many researchers. As a typical immiscible metal system, the Cu-W nano-multilayers combine the plasticity of copper and the strength of tungsten, making it a suitable candidate for applications in aerospace, nuclear fusion engineering, and elect… ▽ More The unique properties exhibited in immiscible metals, such as excellent strength, hardness, and radiation-damage tolerance, have stimulated the interest of many researchers. As a typical immiscible metal system, the Cu-W nano-multilayers combine the plasticity of copper and the strength of tungsten, making it a suitable candidate for applications in aerospace, nuclear fusion engineering, and electronic packaging etc. To understand the atomistic origin of the defects and thermodynamics of the Cu-W immiscible system, we have developed an accurate machine learning interatomic potential (ML-IAP) for Cu-W based on the atomic cluster expansion (ACE) method. The Cu-W ACE potential can faithfully reproduce the fundamental properties of Cu and W predicted by density functional theory (DFT). Moreover, the thermodynamical properties, such as the melting point, coefficient of thermal expansion, diffusion coefficient, and equation of the state curve of the Cu-W solid solution, are calculated and compared against DFT and experiments. Monte Carlo Molecular Dynamics (MC-MD) simulations performed with the Cu-W ACE potential predict the experimentally observed phase separation and uphill diffusion phenomena. Our findings not only provide an accurate ACE potential for describing the Cu-W immiscible system, but also shed light on understanding the atomistic mechanism during the Cu-W nano-multilayers formation process. △ Less

Submitted 30 June, 2024; originally announced July 2024.

Comments: 26 pages, 14 figures

arXiv:2407.00029 [pdf, other]

Distributed Inference Performance Optimization for LLMs on CPUs

Authors: Pujiang He, Shan Zhou, Changqing Li, Wenhuan Huang, Weifei Yu, Duyi Wang, Chen Meng, Sheng Gui

Abstract: Large language models (LLMs) hold tremendous potential for addressing numerous real-world challenges, yet they typically demand significant computational resources and memory. Deploying LLMs onto a resource-limited hardware device with restricted memory capacity presents considerable challenges. Distributed computing emerges as a prevalent strategy to mitigate single-node memory constraints and ex… ▽ More Large language models (LLMs) hold tremendous potential for addressing numerous real-world challenges, yet they typically demand significant computational resources and memory. Deploying LLMs onto a resource-limited hardware device with restricted memory capacity presents considerable challenges. Distributed computing emerges as a prevalent strategy to mitigate single-node memory constraints and expedite LLM inference performance. To reduce the hardware limitation burden, we proposed an efficient distributed inference optimization solution for LLMs on CPUs. We conduct experiments with the proposed solution on 5th Gen Intel Xeon Scalable Processors, and the result shows the time per output token for the LLM with 72B parameter is 140 ms/token, much faster than the average human reading speed about 200ms per token. △ Less

Submitted 16 May, 2024; originally announced July 2024.

Comments: 4 pages, 3 figures, Practical ML for Low Resource Settings Workshop @ ICLR 2024

arXiv:2406.20019 [pdf, other]

Capacity Bounds for Broadcast Channels with Bidirectional Conferencing Decoders

Authors: Reza K. Farsani, Wei Yu

Abstract: The two-user broadcast channel (BC) with receivers connected by cooperative links of given capacities, known as conferencing decoders, is considered. A novel outer bound on the capacity region is established. This outer bound is derived using multiple applications of the Csiszár-Körner identity. New achievable rate regions are also presented. A first achievable rate region is derived by applying M… ▽ More The two-user broadcast channel (BC) with receivers connected by cooperative links of given capacities, known as conferencing decoders, is considered. A novel outer bound on the capacity region is established. This outer bound is derived using multiple applications of the Csiszár-Körner identity. New achievable rate regions are also presented. A first achievable rate region is derived by applying Marton's coding as the transmission scheme, and quantize-bin-and-forward at one receiver first and then a combination of decode-and-forward and quantize-bin-and-forward at the other receiver as cooperative strategy. A second achievable rate region is given by applying a combination of decode-and-forward and quantize-bin-and-forward at one receiver first and then quantize-bin-and-forward at the other receiver. It is proved that the outer bound coincides with the first achievable rate region for a class of semi-deterministic BCs with degraded message sets. This is the first capacity result for the two-user BC with bidirectional conferencing decoders. A capacity result is also derived for a new class of more capable semi-deterministic BCs with both common and private messages and one-sided conferencing. For the Gaussian BC with conferencing decoders, if the noises at the decoders are fully correlated (i.e., the correlation is either 1 or -1), the new outer bound yields exact capacity region for two cases: i) BC with degraded message sets; ii) BC with one-sided conferencing from the weaker receiver to the stronger receiver. An interesting consequence of these results is that for a Gaussian BC with fully negatively correlated noises and conferencing decoders of fixed cooperation link capacities, it is possible to achieve a positive rate bounded away from zero using only infinitesimal amount of transmit power. △ Less

Submitted 28 June, 2024; originally announced June 2024.

arXiv:2406.19820 [pdf, other]

BeamAggR: Beam Aggregation Reasoning over Multi-source Knowledge for Multi-hop Question Answering

Authors: Zheng Chu, Jingchang Chen, Qianglong Chen, Haotian Wang, Kun Zhu, Xiyuan Du, Weijiang Yu, Ming Liu, Bing Qin

Abstract: Large language models (LLMs) have demonstrated strong reasoning capabilities. Nevertheless, they still suffer from factual errors when tackling knowledge-intensive tasks. Retrieval-augmented reasoning represents a promising approach. However, significant challenges still persist, including inaccurate and insufficient retrieval for complex questions, as well as difficulty in integrating multi-sourc… ▽ More Large language models (LLMs) have demonstrated strong reasoning capabilities. Nevertheless, they still suffer from factual errors when tackling knowledge-intensive tasks. Retrieval-augmented reasoning represents a promising approach. However, significant challenges still persist, including inaccurate and insufficient retrieval for complex questions, as well as difficulty in integrating multi-source knowledge. To address this, we propose Beam Aggregation Reasoning, BeamAggR, a reasoning framework for knowledge-intensive multi-hop QA. BeamAggR explores and prioritizes promising answers at each hop of question. Concretely, we parse the complex questions into trees, which include atom and composite questions, followed by bottom-up reasoning. For atomic questions, the LLM conducts reasoning on multi-source knowledge to get answer candidates. For composite questions, the LLM combines beam candidates, explores multiple reasoning paths through probabilistic aggregation, and prioritizes the most promising trajectory. Extensive experiments on four open-domain multi-hop reasoning datasets show that our method significantly outperforms SOTA methods by 8.5%. Furthermore, our analysis reveals that BeamAggR elicits better knowledge collaboration and answer aggregation. △ Less

Submitted 28 June, 2024; originally announced June 2024.

Comments: Accepted to ACL 2024

arXiv:2406.19627 [pdf]

Practical Power System Inertia Monitoring Based on Pumped Storage Hydropower Operation Signature

Authors: Hongyu Li, Chang Chen, Mark Baldwin, Shutang You, Wenpeng Yu, Lin Zhu, Yilu Liu

Abstract: This paper proposes a practical method to monitor power system inertia using Pumped Storage Hydropower (PSH) switching-off events. This approach offers real-time system-level inertia estimation with minimal expenses, no disruption, and the inclusion of behind-the-meter inertia. First, accurate inertia estimation is achieved through improved RoCoF calculation that accounts for pre-event RoCoF, redu… ▽ More This paper proposes a practical method to monitor power system inertia using Pumped Storage Hydropower (PSH) switching-off events. This approach offers real-time system-level inertia estimation with minimal expenses, no disruption, and the inclusion of behind-the-meter inertia. First, accurate inertia estimation is achieved through improved RoCoF calculation that accounts for pre-event RoCoF, reducing common random frequency fluctuations in practice. Second, PSH field data is analyzed, highlighting the benefits of using switching-off events for grid inertia estimation. Third, an event detection trigger is designed to capture pump switching-off events based on local and system features. Fourth, the method is validated on the U.S. Eastern Interconnection model with over 60,000 buses, demonstrating very high accuracy (3%-5% error rate). Finally, it is applied to the U.S. Western Interconnection, with field validation showing a 9.9% average absolute error rate. Despite challenges in practical power system inertia estimation, this method enhances decision-making for power grid reliability and efficiency, addressing challenges posed by renewable energy integration. △ Less

Submitted 1 July, 2024; v1 submitted 27 June, 2024; originally announced June 2024.

Comments: 8 pages, 15 figures

arXiv:2406.19483 [pdf, ps, other]

Localization in Multipath Environments via Active Sensing with Reconfigurable Intelligent Surfaces

Authors: Yinghan Li, Wei Yu

Abstract: This letter investigates an uplink pilot-based wireless indoor localization problem in a multipath environment for a single-input single-output (SISO) narrowband communication system aided by reconfigurable intelligent surface (RIS). The indoor localization problem is challenging because the uplink channel consists of multiple overlapping propagation paths with varying amplitudes and phases, which… ▽ More This letter investigates an uplink pilot-based wireless indoor localization problem in a multipath environment for a single-input single-output (SISO) narrowband communication system aided by reconfigurable intelligent surface (RIS). The indoor localization problem is challenging because the uplink channel consists of multiple overlapping propagation paths with varying amplitudes and phases, which are not easy to differentiate. This letter proposes the use of RIS capable of adaptively changing its reflection pattern to sense such a multiple-path environment. Toward this end, we train a long-short-term-memory (LSTM) based controller to perform adaptive sequential reconfigurations of the RIS over multiple stages and propose to group multiple pilots as input in each stage. Information from the multiple paths is captured by training the LSTM to generate multiple RIS configurations to align to the different paths within each stage. Experimental results show that the proposed approach is effective in significantly reducing training complexity while maintaining localization performance at fixed number of pilots. △ Less

Submitted 8 July, 2024; v1 submitted 27 June, 2024; originally announced June 2024.

arXiv:2406.18361 [pdf, other]

Stable Diffusion Segmentation for Biomedical Images with Single-step Reverse Process

Authors: Tianyu Lin, Zhiguang Chen, Zhonghao Yan, Weijiang Yu, Fudan Zheng

Abstract: Diffusion models have demonstrated their effectiveness across various generative tasks. However, when applied to medical image segmentation, these models encounter several challenges, including significant resource and time requirements. They also necessitate a multi-step reverse process and multiple samples to produce reliable predictions. To address these challenges, we introduce the first laten… ▽ More Diffusion models have demonstrated their effectiveness across various generative tasks. However, when applied to medical image segmentation, these models encounter several challenges, including significant resource and time requirements. They also necessitate a multi-step reverse process and multiple samples to produce reliable predictions. To address these challenges, we introduce the first latent diffusion segmentation model, named SDSeg, built upon stable diffusion (SD). SDSeg incorporates a straightforward latent estimation strategy to facilitate a single-step reverse process and utilizes latent fusion concatenation to remove the necessity for multiple samples. Extensive experiments indicate that SDSeg surpasses existing state-of-the-art methods on five benchmark datasets featuring diverse imaging modalities. Remarkably, SDSeg is capable of generating stable predictions with a solitary reverse step and sample, epitomizing the model's stability as implied by its name. The code is available at https://github.com/lin-tianyu/Stable-Diffusion-Seg △ Less

Submitted 9 July, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

Comments: Accepted at MICCAI 2024. Code and citation info see https://github.com/lin-tianyu/Stable-Diffusion-Seg

arXiv:2406.18008 [pdf, other]

Rate-Distortion-Perception Tradeoff for Gaussian Vector Sources

Authors: Jingjing Qian, Sadaf Salehkalaibar, Jun Chen, Ashish Khisti, Wei Yu, Wuxian Shi, Yiqun Ge, Wen Tong

Abstract: This paper studies the rate-distortion-perception (RDP) tradeoff for a Gaussian vector source coding problem where the goal is to compress the multi-component source subject to distortion and perception constraints. The purpose of imposing a perception constraint is to ensure visually pleasing reconstructions. This paper studies this RDP setting with either the Kullback-Leibler (KL) divergence or… ▽ More This paper studies the rate-distortion-perception (RDP) tradeoff for a Gaussian vector source coding problem where the goal is to compress the multi-component source subject to distortion and perception constraints. The purpose of imposing a perception constraint is to ensure visually pleasing reconstructions. This paper studies this RDP setting with either the Kullback-Leibler (KL) divergence or Wasserstein-2 metric as the perception loss function, and shows that for Gaussian vector sources, jointly Gaussian reconstructions are optimal. We further demonstrate that the optimal tradeoff can be expressed as an optimization problem, which can be explicitly solved. An interesting property of the optimal solution is as follows. Without the perception constraint, the traditional reverse water-filling solution for characterizing the rate-distortion (RD) tradeoff of a Gaussian vector source states that the optimal rate allocated to each component depends on a constant, called the water-level. If the variance of a specific component is below the water-level, it is assigned a {zero} compression rate. However, with active distortion and perception constraints, we show that the optimal rates allocated to the different components are always {positive}. Moreover, the water-levels that determine the optimal rate allocation for different components are unequal. We further treat the special case of perceptually perfect reconstruction and study its RDP function in the high-distortion and low-distortion regimes to obtain insight to the structure of the optimal solution. △ Less

Submitted 25 June, 2024; originally announced June 2024.

arXiv:2406.17269 [pdf, other]

Elko as an inflaton candidate

Authors: Xinglong Chen, Cheng-Yang Lee, Yanjiao Ma, Haomin Rao, Wenqi Yu, Siyi Zhou

Abstract: Elko is a spin-half fermion with a two-fold Wigner degeneracy and Klein-Gordon dynamics. In this paper, we show that in a spatially flat FLRW space-time, slow-roll inflation can be initiated by the homogeneous Elko fields. The inflaton is a composite scalar field obtained by contracting the spinor field with its dual. This is possible because the background evolution as described by the Friedmann… ▽ More Elko is a spin-half fermion with a two-fold Wigner degeneracy and Klein-Gordon dynamics. In this paper, we show that in a spatially flat FLRW space-time, slow-roll inflation can be initiated by the homogeneous Elko fields. The inflaton is a composite scalar field obtained by contracting the spinor field with its dual. This is possible because the background evolution as described by the Friedmann equation is completely determined by the scalar field. This approach has the advantage that we do not need to specify the initial conditions for every component of the spinor fields. We derive the equation of motion for the inflaton and also show that this solution is an attractor. △ Less

Submitted 29 June, 2024; v1 submitted 25 June, 2024; originally announced June 2024.

Comments: 15 pages, 8 figures

arXiv:2406.15877 [pdf, other]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Authors: Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, Thong Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu , et al. (8 additional authors not shown)

Abstract: Automated software engineering has been greatly empowered by the recent advances in Large Language Models (LLMs) for programming. While current benchmarks have shown that LLMs can perform various software engineering tasks like human developers, the majority of their evaluations are limited to short and self-contained algorithmic tasks. Solving challenging and practical programming tasks requires… ▽ More Automated software engineering has been greatly empowered by the recent advances in Large Language Models (LLMs) for programming. While current benchmarks have shown that LLMs can perform various software engineering tasks like human developers, the majority of their evaluations are limited to short and self-contained algorithmic tasks. Solving challenging and practical programming tasks requires the capability of utilizing diverse function calls as tools to efficiently implement functionalities like data analysis and web development. In addition, using multiple tools to solve a task needs compositional reasoning by accurately understanding complex instructions. Fulfilling both of these characteristics can pose a great challenge for LLMs. To assess how well LLMs can solve challenging and practical programming tasks, we introduce Bench, a benchmark that challenges LLMs to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained programming tasks. To evaluate LLMs rigorously, each programming task encompasses 5.6 test cases with an average branch coverage of 99%. In addition, we propose a natural-language-oriented variant of Bench, Benchi, that automatically transforms the original docstrings into short instructions only with essential information. Our extensive evaluation of 60 LLMs shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%. The results underscore the need for further advancements in this area. △ Less

Submitted 26 June, 2024; v1 submitted 22 June, 2024; originally announced June 2024.

Comments: 44 pages, 14 figures, 7 tables, built with love by the BigCode community :)

arXiv:2406.15704 [pdf, other]

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

Authors: Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, Chao Zhang

Abstract: Speech understanding as an element of the more generic video understanding using audio-visual large language models (av-LLMs) is a crucial yet understudied aspect. This paper proposes video-SALMONN, a single end-to-end av-LLM for video processing, which can understand not only visual frame sequences, audio events and music, but speech as well. To obtain fine-grained temporal information required b… ▽ More Speech understanding as an element of the more generic video understanding using audio-visual large language models (av-LLMs) is a crucial yet understudied aspect. This paper proposes video-SALMONN, a single end-to-end av-LLM for video processing, which can understand not only visual frame sequences, audio events and music, but speech as well. To obtain fine-grained temporal information required by speech understanding, while keeping efficient for other video elements, this paper proposes a novel multi-resolution causal Q-Former (MRC Q-Former) structure to connect pre-trained audio-visual encoders and the backbone large language model. Moreover, dedicated training approaches including the diversity loss and the unpaired audio-visual mixed training scheme are proposed to avoid frames or modality dominance. On the introduced speech-audio-visual evaluation benchmark, video-SALMONN achieves more than 25\% absolute accuracy improvements on the video-QA task and over 30\% absolute accuracy improvements on audio-visual QA tasks with human speech. In addition, video-SALMONN demonstrates remarkable video comprehension and reasoning abilities on tasks that are unprecedented by other av-LLMs. Our training code and model checkpoints are available at \texttt{\url{https://github.com/bytedance/SALMONN/}}. △ Less

Submitted 21 June, 2024; originally announced June 2024.

Comments: Accepted at ICML 2024. arXiv admin note: substantial text overlap with arXiv:2310.05863

arXiv:2406.12064 [pdf, other]

skandiver: a divergence-based analysis tool for identifying intercellular mobile genetic elements

Authors: Xiaolei Brian Zhang, Grace Oualline, Jim Shaw, Yun William Yu

Abstract: Mobile genetic elements (MGEs) are as ubiquitous in nature as they are varied in type, ranging from viral insertions to transposons to incorporated plasmids. Horizontal transfer of MGEs across bacterial species may also pose a significant threat to global health due to their capability to harbour antibiotic resistance genes. However, despite cheap and rapid whole genome sequencing, the varied natu… ▽ More Mobile genetic elements (MGEs) are as ubiquitous in nature as they are varied in type, ranging from viral insertions to transposons to incorporated plasmids. Horizontal transfer of MGEs across bacterial species may also pose a significant threat to global health due to their capability to harbour antibiotic resistance genes. However, despite cheap and rapid whole genome sequencing, the varied nature of MGEs makes it difficult to fully characterize them, and existing methods for detecting MGEs often don't agree on what should count. In this manuscript, we first define and argue in favor of a divergence-based characterization of mobile-genetic elements. Using that paradigm, we present skandiver, a tool designed to efficiently detect MGEs from whole genome assemblies without the need for gene annotation or markers. skandiver determines mobile elements via genome fragmentation, average nucleotide identity (ANI), and divergence time. By building on the scalable skani software for ANI computation, skandiver can query hundreds of complete assemblies against $>$65,000 representative genomes in a few minutes and 19 GB memory, providing scalable and efficient method for elucidating mobile element profiles in incomplete, uncharacterized genomic sequences. For isolated and integrated large plasmids (>10kbp), skandiver's recall was 48\% and 47\%, MobileElementFinder was 59\% and 17\%, and geNomad was 86\% and 32\%, respectively. For isolated large plasmids, skandiver's recall (48\%) is lower than state-of-the-art reference-based methods geNomad (86\%) and MobileElementFinder (59\%). However, skandiver achieves higher recall on integrated plasmids and, unlike other methods, without comparing against a curated database, making skandiver suitable for discovery of novel MGEs. Availability: https://github.com/YoukaiFromAccounting/skandiver △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: 9 pages, 6 figures

arXiv:2406.12050 [pdf, other]

Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning

Authors: Zhihan Zhang, Zhenwen Liang, Wenhao Yu, Dian Yu, Mengzhao Jia, Dong Yu, Meng Jiang

Abstract: Supervised fine-tuning enhances the problem-solving abilities of language models across various mathematical reasoning tasks. To maximize such benefits, existing research focuses on broadening the training set with various data augmentation techniques, which is effective for standard single-round question-answering settings. Our work introduces a novel technique aimed at cultivating a deeper under… ▽ More Supervised fine-tuning enhances the problem-solving abilities of language models across various mathematical reasoning tasks. To maximize such benefits, existing research focuses on broadening the training set with various data augmentation techniques, which is effective for standard single-round question-answering settings. Our work introduces a novel technique aimed at cultivating a deeper understanding of the training problems at hand, enhancing performance not only in standard settings but also in more complex scenarios that require reflective thinking. Specifically, we propose reflective augmentation, a method that embeds problem reflection into each training instance. It trains the model to consider alternative perspectives and engage with abstractions and analogies, thereby fostering a thorough comprehension through reflective reasoning. Extensive experiments validate the achievement of our aim, underscoring the unique advantages of our method and its complementary nature relative to existing augmentation techniques. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.11551 [pdf, other]

Simple Yet Efficient: Towards Self-Supervised FG-SBIR with Unified Sample Feature Alignment

Authors: Jianan Jiang, Di Wu, Zhilin Jiang, Weiren Yu

Abstract: Fine-Grained Sketch-Based Image Retrieval (FG-SBIR) aims to minimize the distance between sketches and corresponding images in the embedding space. However, scalability is hindered by the growing complexity of solutions, mainly due to the abstract nature of fine-grained sketches. In this paper, we propose a simple yet efficient approach to narrow the gap between the two modes. It mainly facilitate… ▽ More Fine-Grained Sketch-Based Image Retrieval (FG-SBIR) aims to minimize the distance between sketches and corresponding images in the embedding space. However, scalability is hindered by the growing complexity of solutions, mainly due to the abstract nature of fine-grained sketches. In this paper, we propose a simple yet efficient approach to narrow the gap between the two modes. It mainly facilitates unified mutual information sharing both intra- and inter-samples, rather than treating them as a single feature alignment problem between modalities. Specifically, our approach includes: (i) Employing dual weight-sharing networks to optimize alignment within sketch and image domain, which also effectively mitigates model learning saturation issues. (ii) Introducing an objective optimization function based on contrastive loss to enhance the model's ability to align features intra- and inter-samples. (iii) Presenting a learnable TRSM combined of self-attention and cross-attention to promote feature representations among tokens, further enhancing sample alignment in the embedding space. Our framework achieves excellent results on CNN- and ViT-based backbones. Extensive experiments demonstrate its superiority over existing methods. We also introduce Cloths-V1, the first professional fashion sketches and images dataset, utilized to validate our method and will be beneficial for other applications. △ Less

Submitted 22 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

Comments: 10 pages,8 figures, 4 tables

arXiv:2406.11507 [pdf, other]

Prior Normality Prompt Transformer for Multi-class Industrial Image Anomaly Detection

Authors: Haiming Yao, Yunkang Cao, Wei Luo, Weihang Zhang, Wenyong Yu, Weiming Shen

Abstract: Image anomaly detection plays a pivotal role in industrial inspection. Traditional approaches often demand distinct models for specific categories, resulting in substantial deployment costs. This raises concerns about multi-class anomaly detection, where a unified model is developed for multiple classes. However, applying conventional methods, particularly reconstruction-based models, directly to… ▽ More Image anomaly detection plays a pivotal role in industrial inspection. Traditional approaches often demand distinct models for specific categories, resulting in substantial deployment costs. This raises concerns about multi-class anomaly detection, where a unified model is developed for multiple classes. However, applying conventional methods, particularly reconstruction-based models, directly to multi-class scenarios encounters challenges such as identical shortcut learning, hindering effective discrimination between normal and abnormal instances. To tackle this issue, our study introduces the Prior Normality Prompt Transformer (PNPT) method for multi-class image anomaly detection. PNPT strategically incorporates normal semantics prompting to mitigate the "identical mapping" problem. This entails integrating a prior normality prompt into the reconstruction process, yielding a dual-stream model. This innovative architecture combines normal prior semantics with abnormal samples, enabling dual-stream reconstruction grounded in both prior knowledge and intrinsic sample characteristics. PNPT comprises four essential modules: Class-Specific Normality Prompting Pool (CS-NPP), Hierarchical Patch Embedding (HPE), Semantic Alignment Coupling Encoding (SACE), and Contextual Semantic Conditional Decoding (CSCD). Experimental validation on diverse benchmark datasets and real-world industrial applications highlights PNPT's superior performance in multi-class industrial anomaly detection. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: Accepted by IEEE Transactions on Industrial Informatics

arXiv:2406.10780 [pdf, ps, other]

Colouring negative exact-distance graphs of signed graphs

Authors: Reza Naserasr, Patrice Ossona de Mendez, Daniel A. Quiroz, Robert Šámal, Weiqiang Yu

Abstract: The $k$-th exact-distance graph, of a graph $G$ has $V(G)$ as its vertex set, and $xy$ as an edge if and only if the distance between $x$ and $y$ is (exactly) $k$ in $G$. We consider two possible extensions of this notion for signed graphs. Finding the chromatic number of a negative exact-distance square of a signed graph is a weakening of the problem of finding the smallest target graph to which… ▽ More The $k$-th exact-distance graph, of a graph $G$ has $V(G)$ as its vertex set, and $xy$ as an edge if and only if the distance between $x$ and $y$ is (exactly) $k$ in $G$. We consider two possible extensions of this notion for signed graphs. Finding the chromatic number of a negative exact-distance square of a signed graph is a weakening of the problem of finding the smallest target graph to which the signed graph has a sign-preserving homomorphism. We study the chromatic number of negative exact-distance graphs of signed graphs that are planar, and also the relation of these chromatic numbers with the generalised colouring numbers of the underlying graphs. Our results are related to a theorem of Alon and Marshall about homomorphisms of signed graphs. △ Less

Submitted 15 June, 2024; originally announced June 2024.

Comments: 16 pages, 2 figures, 3 tables

MSC Class: 05C10; 05C12; 05C15; 05C22; 05C60

arXiv:2406.10744 [pdf, other]

Technique Report of CVPR 2024 PBDL Challenges

Authors: Ying Fu, Yu Li, Shaodi You, Boxin Shi, Linwei Chen, Yunhao Zou, Zichun Wang, Yichen Li, Yuze Han, Yingkai Zhang, Jianan Wang, Qinglin Liu, Wei Yu, Xiaoqian Lv, Jianing Li, Shengping Zhang, Xiangyang Ji, Yuanpei Chen, Yuhan Zhang, Weihang Peng, Liwen Zhang, Zhe Xu, Dingyong Gou, Cong Li, Senyan Xu , et al. (75 additional authors not shown)

Abstract: The intersection of physics-based vision and deep learning presents an exciting frontier for advancing computer vision technologies. By leveraging the principles of physics to inform and enhance deep learning models, we can develop more robust and accurate vision systems. Physics-based vision aims to invert the processes to recover scene properties such as shape, reflectance, light distribution, a… ▽ More The intersection of physics-based vision and deep learning presents an exciting frontier for advancing computer vision technologies. By leveraging the principles of physics to inform and enhance deep learning models, we can develop more robust and accurate vision systems. Physics-based vision aims to invert the processes to recover scene properties such as shape, reflectance, light distribution, and medium properties from images. In recent years, deep learning has shown promising improvements for various vision tasks, and when combined with physics-based vision, these approaches can enhance the robustness and accuracy of vision systems. This technical report summarizes the outcomes of the Physics-Based Vision Meets Deep Learning (PBDL) 2024 challenge, held in CVPR 2024 workshop. The challenge consisted of eight tracks, focusing on Low-Light Enhancement and Detection as well as High Dynamic Range (HDR) Imaging. This report details the objectives, methodologies, and results of each track, highlighting the top-performing solutions and their innovative approaches. △ Less

Submitted 12 July, 2024; v1 submitted 15 June, 2024; originally announced June 2024.

Comments: CVPR 2024 PBDL Challenges: https://pbdl-ws.github.io/pbdl2024/challenge/index.html

arXiv:2406.10583 [pdf, other]

Demonstration of neutron identification in neutrino interactions in the MicroBooNE liquid argon time projection chamber

Authors: MicroBooNE collaboration, P. Abratenko, O. Alterkait, D. Andrade Aldana, L. Arellano, J. Asaadi, A. Ashkenazi, S. Balasubramanian, B. Baller, A. Barnard, G. Barr, D. Barrow, J. Barrow, V. Basque, J. Bateman, O. Benevides Rodrigues, S. Berkman, A. Bhanderi, A. Bhat, M. Bhattacharya, M. Bishai, A. Blake, B. Bogart, T. Bolton, J. Y. Book , et al. (165 additional authors not shown)

Abstract: A significant challenge in measurements of neutrino oscillations is reconstructing the incoming neutrino energies. While modern fully-active tracking calorimeters such as liquid argon time projection chambers in principle allow the measurement of all final state particles above some detection threshold, undetected neutrons remain a considerable source of missing energy with little to no data const… ▽ More A significant challenge in measurements of neutrino oscillations is reconstructing the incoming neutrino energies. While modern fully-active tracking calorimeters such as liquid argon time projection chambers in principle allow the measurement of all final state particles above some detection threshold, undetected neutrons remain a considerable source of missing energy with little to no data constraining their production rates and kinematics. We present the first demonstration of tagging neutrino-induced neutrons in liquid argon time projection chambers using secondary protons emitted from neutron-argon interactions in the MicroBooNE detector. We describe the method developed to identify neutrino-induced neutrons and demonstrate its performance using neutrons produced in muon-neutrino charged current interactions. The method is validated using a small subset of MicroBooNE's total dataset. The selection yields a sample with $60\%$ of selected tracks corresponding to neutron-induced secondary protons. △ Less

Submitted 15 June, 2024; originally announced June 2024.

Report number: FERMILAB-PUB-24-0301

arXiv:2406.10222 [pdf, other]

Ultra-low noise laser and optical frequency comb-based timing system for the Black Hole Explorer (BHEX) mission

Authors: Hannah Tomio, Guangning Yang, Holly F. Leopardi, Kenji Numata, Anthony W. Yu, Andrew Attar, Xiaozhen Xu, Wei Lu, Cheryl Gramling, T. K. Sridharan, Peter Kurczynski

Abstract: In this effort, we demonstrate the performance of a highly stable time reference for the proposed Black Hole Explorer (BHEX) mission, a space-based extension to the Event Horizon Telescope (EHT) Very Long Baseline Interferometry (VLBI) project. This precision timing system is based on the use of a space-qualified, ultra-low noise laser developed as part of the Laser Interferometer Space Antenna (L… ▽ More In this effort, we demonstrate the performance of a highly stable time reference for the proposed Black Hole Explorer (BHEX) mission, a space-based extension to the Event Horizon Telescope (EHT) Very Long Baseline Interferometry (VLBI) project. This precision timing system is based on the use of a space-qualified, ultra-low noise laser developed as part of the Laser Interferometer Space Antenna (LISA) mission as the timing reference, and an optical frequency comb to transfer the stability of this laser to the microwave regime for instrumentation use. We describe the implementation of this system and experimental setup to characterize the stability performance. We present the results of this experiment that demonstrate the performance of this system meets requirements for the BHEX mission. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: To be published in the proceedings of SPIE Astronomical Telescopes + Instrumentation 2024

arXiv:2406.10123 [pdf, other]

Improving neutrino energy estimation of charged-current interaction events with recurrent neural networks in MicroBooNE

Authors: MicroBooNE collaboration, P. Abratenko, O. Alterkait, D. Andrade Aldana, L. Arellano, J. Asaadi, A. Ashkenazi, S. Balasubramanian, B. Baller, A. Barnard, G. Barr, D. Barrow, J. Barrow, V. Basque, J. Bateman, O. Benevides Rodrigues, S. Berkman, A. Bhanderi, A. Bhat, M. Bhattacharya, M. Bishai, A. Blake, B. Bogart, T. Bolton, J. Y. Book , et al. (164 additional authors not shown)

Abstract: We present a deep learning-based method for estimating the neutrino energy of charged-current neutrino-argon interactions. We employ a recurrent neural network (RNN) architecture for neutrino energy estimation in the MicroBooNE experiment, utilizing liquid argon time projection chamber (LArTPC) detector technology. Traditional energy estimation approaches in LArTPCs, which largely rely on reconstr… ▽ More We present a deep learning-based method for estimating the neutrino energy of charged-current neutrino-argon interactions. We employ a recurrent neural network (RNN) architecture for neutrino energy estimation in the MicroBooNE experiment, utilizing liquid argon time projection chamber (LArTPC) detector technology. Traditional energy estimation approaches in LArTPCs, which largely rely on reconstructing and summing visible energies, often experience sizable biases and resolution smearing because of the complex nature of neutrino interactions and the detector response. The estimation of neutrino energy can be improved after considering the kinematics information of reconstructed final-state particles. Utilizing kinematic information of reconstructed particles, the deep learning-based approach shows improved resolution and reduced bias for the muon neutrino Monte Carlo simulation sample compared to the traditional approach. In order to address the common concern about the effectiveness of this method on experimental data, the RNN-based energy estimator is further examined and validated with dedicated data-simulation consistency tests using MicroBooNE data. We also assess its potential impact on a neutrino oscillation study after accounting for all statistical and systematic uncertainties and show that it enhances physics sensitivity. This method has good potential to improve the performance of other physics analyses. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Report number: FERMILAB-PUB-24-0287

arXiv:2406.09747 [pdf, ps, other]

Hybrid atom-photon entangling gates via Gaussian soft control

Authors: Wanrang Yu, Qiuyu Yin, Yanzhao Liang, Ning Ji, Thibault Vogt

Abstract: Hybrid atom-photon gates play an important role for the realization of a quantum interface capable of mapping atomic states to photons for communication across quantum networks. Here, we propose a feasible theoretical scheme for implementing a hybrid atom-photon controlled-Z gate between an atom and a microwave photon in a superconducting coplanar waveguide resonator based on the Gaussian soft con… ▽ More Hybrid atom-photon gates play an important role for the realization of a quantum interface capable of mapping atomic states to photons for communication across quantum networks. Here, we propose a feasible theoretical scheme for implementing a hybrid atom-photon controlled-Z gate between an atom and a microwave photon in a superconducting coplanar waveguide resonator based on the Gaussian soft control technique. The gate protocol employs a classical auxiliary field that induces an atomic transition between one state of the atomic qubit and Rydberg states for obtaining strong coupling of the atom and microwave resonator. By tailoring the amplitude of this field with Gaussian temporal modulation, the gate performances are improved in various aspects. Numerical simulations demonstrate that the controlled-Z gate based on Gaussian soft control is resilient to the variation of the atom-photon coupling strength, deviation in the gate time, and less sensitive to the Rydberg level shifts caused by stray electric fields. △ Less

Submitted 14 June, 2024; originally announced June 2024.

arXiv:2406.09742 [pdf, other]

IFA: Interaction Fidelity Attention for Entire Lifelong Behaviour Sequence Modeling

Authors: Wenhui Yu, Chao Feng, Yanze Zhang, Lantao Hu, Peng Jiang, Han Li

Abstract: The lifelong user behavior sequence provides abundant information of user preference and gains impressive improvement in the recommendation task, however increases computational consumption significantly. To meet the severe latency requirement in online service, a short sub-sequence is sampled based on similarity to the target item. Unfortunately, items not in the sub-sequence are abandoned, leadi… ▽ More The lifelong user behavior sequence provides abundant information of user preference and gains impressive improvement in the recommendation task, however increases computational consumption significantly. To meet the severe latency requirement in online service, a short sub-sequence is sampled based on similarity to the target item. Unfortunately, items not in the sub-sequence are abandoned, leading to serious information loss. In this paper, we propose a new efficient paradigm to model the full lifelong sequence, which is named as \textbf{I}nteraction \textbf{F}idelity \textbf{A}ttention (\textbf{IFA}). In IFA, we input all target items in the candidate set into the model at once, and leverage linear transformer to reduce the time complexity of the cross attention between the candidate set and the sequence without any interaction information loss. We also additionally model the relationship of all target items for optimal set generation, and design loss function for better consistency of training and inference. We demonstrate the effectiveness and efficiency of our model by off-line and online experiments in the recommender system of Kuaishou. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: 7 pages, 2 figures

arXiv:2406.09295 [pdf, other]

AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models

Authors: Yuhang Wu, Wenmeng Yu, Yean Cheng, Yan Wang, Xiaohan Zhang, Jiazheng Xu, Ming Ding, Yuxiao Dong

Abstract: Evaluating the alignment capabilities of large Vision-Language Models (VLMs) is essential for determining their effectiveness as helpful assistants. However, existing benchmarks primarily focus on basic abilities using nonverbal methods, such as yes-no and multiple-choice questions. In this paper, we address this gap by introducing AlignMMBench, a comprehensive alignment benchmark specifically des… ▽ More Evaluating the alignment capabilities of large Vision-Language Models (VLMs) is essential for determining their effectiveness as helpful assistants. However, existing benchmarks primarily focus on basic abilities using nonverbal methods, such as yes-no and multiple-choice questions. In this paper, we address this gap by introducing AlignMMBench, a comprehensive alignment benchmark specifically designed for emerging Chinese VLMs. This benchmark is meticulously curated from real-world scenarios and Chinese Internet sources, encompassing thirteen specific tasks across three categories, and includes both single-turn and multi-turn dialogue scenarios. Incorporating a prompt rewrite strategy, AlignMMBench encompasses 1,054 images and 4,978 question-answer pairs. To facilitate the evaluation pipeline, we propose CritiqueVLM, a rule-calibrated evaluator that exceeds GPT-4's evaluation ability. Finally, we report the performance of representative VLMs on AlignMMBench, offering insights into the capabilities and limitations of different VLM architectures. All evaluation codes and data are available on https://alignmmbench.github.io. △ Less

Submitted 13 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.09166 [pdf, other]

Fine-Grained Domain Generalization with Feature Structuralization

Authors: Wenlong Yu, Dongyue Chen, Qilong Wang, Qinghua Hu

Abstract: Fine-grained domain generalization (FGDG) is a more challenging task than traditional DG tasks due to its small inter-class variations and relatively large intra-class disparities. When domain distribution changes, the vulnerability of subtle features leads to a severe deterioration in model performance. Nevertheless, humans inherently demonstrate the capacity for generalizing to out-of-distributi… ▽ More Fine-grained domain generalization (FGDG) is a more challenging task than traditional DG tasks due to its small inter-class variations and relatively large intra-class disparities. When domain distribution changes, the vulnerability of subtle features leads to a severe deterioration in model performance. Nevertheless, humans inherently demonstrate the capacity for generalizing to out-of-distribution data, leveraging structured multi-granularity knowledge that emerges from discerning the commonality and specificity within categories. Likewise, we propose a Feature Structuralized Domain Generalization (FSDG) model, wherein features experience structuralization into common, specific, and confounding segments, harmoniously aligned with their relevant semantic concepts, to elevate performance in FGDG. Specifically, feature structuralization (FS) is accomplished through joint optimization of five constraints: a decorrelation function applied to disentangled segments, three constraints ensuring common feature consistency and specific feature distinctiveness, and a prediction calibration term. By imposing these stipulations, FSDG is prompted to disentangle and align features based on multi-granularity knowledge, facilitating robust subtle distinctions among categories. Extensive experimentation on three benchmarks consistently validates the superiority of FSDG over state-of-the-art counterparts, with an average improvement of 6.2% in FGDG performance. Beyond that, the explainability analysis on explicit concept matching intensity between the shared concepts among categories and the model channels, along with experiments on various mainstream model architectures, substantiates the validity of FS. △ Less

Submitted 17 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.07914 [pdf, other]

Can Large Language Models Understand Spatial Audio?

Authors: Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Jun Zhang, Lu Lu, Zejun Ma, Yuxuan Wang, Chao Zhang

Abstract: This paper explores enabling large language models (LLMs) to understand spatial information from multichannel audio, a skill currently lacking in auditory LLMs. By leveraging LLMs' advanced cognitive and inferential abilities, the aim is to enhance understanding of 3D environments via audio. We study 3 spatial audio tasks: sound source localization (SSL), far-field speech recognition (FSR), and lo… ▽ More This paper explores enabling large language models (LLMs) to understand spatial information from multichannel audio, a skill currently lacking in auditory LLMs. By leveraging LLMs' advanced cognitive and inferential abilities, the aim is to enhance understanding of 3D environments via audio. We study 3 spatial audio tasks: sound source localization (SSL), far-field speech recognition (FSR), and localisation-informed speech extraction (LSE), achieving notable progress in each task. For SSL, our approach achieves an MAE of $2.70^{\circ}$ on the Spatial LibriSpeech dataset, substantially surpassing the prior benchmark of about $6.60^{\circ}$. Moreover, our model can employ spatial cues to improve FSR accuracy and execute LSE by selectively attending to sounds originating from a specified direction via text prompts, even amidst overlapping speech. These findings highlight the potential of adapting LLMs to grasp physical audio concepts, paving the way for LLM-based agents in 3D environments. △ Less

Submitted 14 June, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

Comments: Accepted at Interspeech 2024

arXiv:2406.07333 [pdf, other]

Global-Regularized Neighborhood Regression for Efficient Zero-Shot Texture Anomaly Detection

Authors: Haiming Yao, Wei Luo, Yunkang Cao, Yiheng Zhang, Wenyong Yu, Weiming Shen

Abstract: Texture surface anomaly detection finds widespread applications in industrial settings. However, existing methods often necessitate gathering numerous samples for model training. Moreover, they predominantly operate within a close-set detection framework, limiting their ability to identify anomalies beyond the training dataset. To tackle these challenges, this paper introduces a novel zero-shot te… ▽ More Texture surface anomaly detection finds widespread applications in industrial settings. However, existing methods often necessitate gathering numerous samples for model training. Moreover, they predominantly operate within a close-set detection framework, limiting their ability to identify anomalies beyond the training dataset. To tackle these challenges, this paper introduces a novel zero-shot texture anomaly detection method named Global-Regularized Neighborhood Regression (GRNR). Unlike conventional approaches, GRNR can detect anomalies on arbitrary textured surfaces without any training data or cost. Drawing from human visual cognition, GRNR derives two intrinsic prior supports directly from the test texture image: local neighborhood priors characterized by coherent similarities and global normality priors featuring typical normal patterns. The fundamental principle of GRNR involves utilizing the two extracted intrinsic support priors for self-reconstructive regression of the query sample. This process employs the transformation facilitated by local neighbor support while being regularized by global normality support, aiming to not only achieve visually consistent reconstruction results but also preserve normality properties. We validate the effectiveness of GRNR across various industrial scenarios using eight benchmark datasets, demonstrating its superior detection performance without the need for training data. Remarkably, our method is applicable for open-set texture defect detection and can even surpass existing vanilla approaches that require extensive training. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: SUBMISSION TO IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS

arXiv:2406.06420 [pdf, other]

An Improved Empirical Fisher Approximation for Natural Gradient Descent

Authors: Xiaodong Wu, Wenyi Yu, Chao Zhang, Philip Woodland

Abstract: Approximate Natural Gradient Descent (NGD) methods are an important family of optimisers for deep learning models, which use approximate Fisher information matrices to pre-condition gradients during training. The empirical Fisher (EF) method approximates the Fisher information matrix empirically by reusing the per-sample gradients collected during back-propagation. Despite its ease of implementati… ▽ More Approximate Natural Gradient Descent (NGD) methods are an important family of optimisers for deep learning models, which use approximate Fisher information matrices to pre-condition gradients during training. The empirical Fisher (EF) method approximates the Fisher information matrix empirically by reusing the per-sample gradients collected during back-propagation. Despite its ease of implementation, the EF approximation has its theoretical and practical limitations. This paper first investigates the inversely-scaled projection issue of EF, which is shown to be a major cause of the poor empirical approximation quality. An improved empirical Fisher (iEF) method, motivated as a generalised NGD method from a loss reduction perspective, is proposed to address this issue, meanwhile retaining the practical convenience of EF. The exact iEF and EF methods are experimentally evaluated using practical deep learning setups, including widely-used setups for parameter-efficient fine-tuning of pre-trained models (T5-base with LoRA and Prompt-Tuning on GLUE tasks, and ViT with LoRA for CIFAR100). Optimisation experiments show that applying exact iEF as an optimiser provides strong convergence and generalisation. It achieves the best test performance and the lowest training loss for majority of the tasks, even when compared with well-tuned AdamW/Adafactor baselines. Additionally, under a novel empirical evaluation framework, the proposed iEF method shows consistently better approximation quality to the exact Natural Gradient updates than both EF and the more expensive sampled Fisher (SF). Further investigation also shows that the superior approximation quality of iEF is robust to damping across tasks and training stages. Improving existing approximate NGD optimisers with iEF is expected to lead to better convergence ability and stronger robustness to choice of damping. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: 33 pages, 11 figures, 7 tables

arXiv:2406.05491 [pdf, other]

One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models

Authors: Hao Fang, Jiawei Kong, Wenbo Yu, Bin Chen, Jiawei Li, Shutao Xia, Ke Xu

Abstract: Vision-Language Pre-training (VLP) models trained on large-scale image-text pairs have demonstrated unprecedented capability in many practical applications. However, previous studies have revealed that VLP models are vulnerable to adversarial samples crafted by a malicious adversary. While existing attacks have achieved great success in improving attack effect and transferability, they all focus o… ▽ More Vision-Language Pre-training (VLP) models trained on large-scale image-text pairs have demonstrated unprecedented capability in many practical applications. However, previous studies have revealed that VLP models are vulnerable to adversarial samples crafted by a malicious adversary. While existing attacks have achieved great success in improving attack effect and transferability, they all focus on instance-specific attacks that generate perturbations for each input sample. In this paper, we show that VLP models can be vulnerable to a new class of universal adversarial perturbation (UAP) for all input samples. Although initially transplanting existing UAP algorithms to perform attacks showed effectiveness in attacking discriminative models, the results were unsatisfactory when applied to VLP models. To this end, we revisit the multimodal alignments in VLP model training and propose the Contrastive-training Perturbation Generator with Cross-modal conditions (C-PGC). Specifically, we first design a generator that incorporates cross-modal information as conditioning input to guide the training. To further exploit cross-modal interactions, we propose to formulate the training objective as a multimodal contrastive learning paradigm based on our constructed positive and negative image-text pairs. By training the conditional generator with the designed loss, we successfully force the adversarial samples to move away from its original area in the VLP model's feature space, and thus essentially enhance the attacks. Extensive experiments show that our method achieves remarkable attack performance across various VLP models and Vision-and-Language (V+L) tasks. Moreover, C-PGC exhibits outstanding black-box transferability and achieves impressive results in fooling prevalent large VLP models including LLaVA and Qwen-VL. △ Less

Submitted 8 June, 2024; originally announced June 2024.

arXiv:2406.04737 [pdf, other]

doi 10.1109/TWC.2024.3425473

Fast-Fading Channel and Power Optimization of the Magnetic Inductive Cellular Network

Authors: Honglei Ma, Erwu Liu, Zhijun Fang, Rui Wang, Yongbin Gao, Wenjun Yu, Dongming Zhang

Abstract: The cellular network of magnetic Induction (MI) communication holds promise in long-distance underground environments. In the traditional MI communication, there is no fast-fading channel since the MI channel is treated as a quasi-static channel. However, for the vehicle (mobile) MI (VMI) communication, the unpredictable antenna vibration brings the remarkable fast-fading. As such fast-fading cann… ▽ More The cellular network of magnetic Induction (MI) communication holds promise in long-distance underground environments. In the traditional MI communication, there is no fast-fading channel since the MI channel is treated as a quasi-static channel. However, for the vehicle (mobile) MI (VMI) communication, the unpredictable antenna vibration brings the remarkable fast-fading. As such fast-fading cannot be modeled by the central limit theorem, it differs radically from other wireless fast-fading channels. Unfortunately, few studies focus on this phenomenon. In this paper, using a novel space modeling based on the electromagnetic field theorem, we propose a 3-dimension model of the VMI antenna vibration. By proposing ``conjugate pseudo-piecewise functions'' and boundary $p(x)$ distribution, we derive the cumulative distribution function (CDF), probability density function (PDF) and the expectation of the VMI fast-fading channel. We also theoretically analyze the effects of the VMI fast-fading on the network throughput, including the VMI outage probability which can be ignored in the traditional MI channel study. We draw several intriguing conclusions different from those in wireless fast-fading studies. For instance, the fast-fading brings more uniformly distributed channel coefficients. Finally, we propose the power control algorithm using the non-cooperative game and multiagent Q-learning methods to optimize the throughput of the cellular VMI network. Simulations validate the derivation and the proposed algorithm. △ Less

Submitted 7 July, 2024; v1 submitted 7 June, 2024; originally announced June 2024.

Comments: This work has been accepted by the IEEE TWC for publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2406.04649 [pdf, other]

SMART: Scene-motion-aware human action recognition framework for mental disorder group

Authors: Zengyuan Lai, Jiarui Yang, Songpengcheng Xia, Qi Wu, Zhen Sun, Wenxian Yu, Ling Pei

Abstract: Patients with mental disorders often exhibit risky abnormal actions, such as climbing walls or hitting windows, necessitating intelligent video behavior monitoring for smart healthcare with the rising Internet of Things (IoT) technology. However, the development of vision-based Human Action Recognition (HAR) for these actions is hindered by the lack of specialized algorithms and datasets. In this… ▽ More Patients with mental disorders often exhibit risky abnormal actions, such as climbing walls or hitting windows, necessitating intelligent video behavior monitoring for smart healthcare with the rising Internet of Things (IoT) technology. However, the development of vision-based Human Action Recognition (HAR) for these actions is hindered by the lack of specialized algorithms and datasets. In this paper, we innovatively propose to build a vision-based HAR dataset including abnormal actions often occurring in the mental disorder group and then introduce a novel Scene-Motion-aware Action Recognition Technology framework, named SMART, consisting of two technical modules. First, we propose a scene perception module to extract human motion trajectory and human-scene interaction features, which introduces additional scene information for a supplementary semantic representation of the above actions. Second, the multi-stage fusion module fuses the skeleton motion, motion trajectory, and human-scene interaction features, enhancing the semantic association between the skeleton motion and the above supplementary representation, thus generating a comprehensive representation with both human motion and scene information. The effectiveness of our proposed method has been validated on our self-collected HAR dataset (MentalHAD), achieving 94.9% and 93.1% accuracy in un-seen subjects and scenes and outperforming state-of-the-art approaches by 6.5% and 13.2%, respectively. The demonstrated subject- and scene- generalizability makes it possible for SMART's migration to practical deployment in smart healthcare systems for mental disorder patients in medical settings. The code and dataset will be released publicly for further research: https://github.com/Inowlzy/SMART.git. △ Less

Submitted 7 June, 2024; originally announced June 2024.

arXiv:2406.03834 [pdf, other]

The Broadband X-ray Spectral Properties during the Rising Phases of the Outburst of the New Black Hole X-ray Binary Candidate Swift J1727.8-1613

Authors: He-Xin Liu, Yan-Jun Xu, Shuang-Nan Zhang, Wei Yu, Yue Huang, Lian Tao, Liang Zhang, Zi-Xu Yang, Qing-Chang Zhao, Jin-Lu Qu, Li-Ming Song

Abstract: We report data analysis results about the outburst evolution and spectral properties during the hard state of the recently discovered X-ray transient Swift J1727.8-163 as observed by \emph{Insight}-HXMT and NuSTAR. We find that the broadband X-ray spectrum of Swift J1727.8-163 is more complex than the most typical spectral patterns of black hole X-ray binary systems, with not only a comparatively… ▽ More We report data analysis results about the outburst evolution and spectral properties during the hard state of the recently discovered X-ray transient Swift J1727.8-163 as observed by \emph{Insight}-HXMT and NuSTAR. We find that the broadband X-ray spectrum of Swift J1727.8-163 is more complex than the most typical spectral patterns of black hole X-ray binary systems, with not only a comparatively weaker reflection component but also an additional spectral continuum component, manifesting itself as a hard X-ray tail beyond the thermal Comptonization description detectable below 100 keV. This additional component can be phenomenologically well fitted by adding an extra power-law model with high energy exponential cutoff in the 2-120 keV energy band. We made an attempt to explain the broadband X-ray spectral continuum with a thermal/non-thermal hybrid plasma corona scenario , and find an ultra high compactness parameter ($l_{\rm s}\sim2000$) and a steep non-thermal electron distribution ($Γ_{\rm inj}>4$), suggesting the source was accreting with high Eddington rates and that the electron acceleration mechanism is not very efficient. We also present a detailed multi-epoch analysis of spectral properties using \emph{Insight}-HXMT data to investigate the evolution of the key physical properties regarding the disk and corona during the hard states. No significant variation is found with the inner disk radius and the coronal temperature during this time period, and the weak reflection and hard X-ray tail features are persistent. We discuss the physical implications of our spectral analysis results in the context of disk-corona relation, particle acceleration, and jet contribution, during the rise of a black hole X-ray binary in outburst. △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: 16 pages, 6 figures

Showing 1–50 of 1,447 results for author: Yu, W