subscribe to arXiv mailings

arXiv:2407.11921 [pdf, other]

IPA-NeRF: Illusory Poisoning Attack Against Neural Radiance Fields

Authors: Wenxiang Jiang, Hanwei Zhang, Shuo Zhao, Zhongwen Guo, Hao Wang

Abstract: Neural Radiance Field (NeRF) represents a significant advancement in computer vision, offering implicit neural network-based scene representation and novel view synthesis capabilities. Its applications span diverse fields including robotics, urban mapping, autonomous navigation, virtual reality/augmented reality, etc., some of which are considered high-risk AI applications. However, despite its wi… ▽ More Neural Radiance Field (NeRF) represents a significant advancement in computer vision, offering implicit neural network-based scene representation and novel view synthesis capabilities. Its applications span diverse fields including robotics, urban mapping, autonomous navigation, virtual reality/augmented reality, etc., some of which are considered high-risk AI applications. However, despite its widespread adoption, the robustness and security of NeRF remain largely unexplored. In this study, we contribute to this area by introducing the Illusory Poisoning Attack against Neural Radiance Fields (IPA-NeRF). This attack involves embedding a hidden backdoor view into NeRF, allowing it to produce predetermined outputs, i.e. illusory, when presented with the specified backdoor view while maintaining normal performance with standard inputs. Our attack is specifically designed to deceive users or downstream models at a particular position while ensuring that any abnormalities in NeRF remain undetectable from other viewpoints. Experimental results demonstrate the effectiveness of our Illusory Poisoning Attack, successfully presenting the desired illusory on the specified viewpoint without impacting other views. Notably, we achieve this attack by introducing small perturbations solely to the training set. The code can be found at https://github.com/jiang-wenxiang/IPA-NeRF. △ Less

Submitted 16 July, 2024; originally announced July 2024.

arXiv:2407.11895 [pdf, other]

OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces

Authors: Zehan Wang, Ziang Zhang, Hang Zhang, Luping Liu, Rongjie Huang, Xize Cheng, Hengshuang Zhao, Zhou Zhao

Abstract: Recently, human-computer interaction with various modalities has shown promising applications, like GPT-4o and Gemini. Given the foundational role of multimodal joint representation in understanding and generation pipelines, high-quality omni joint representations would be a step toward co-processing more diverse multimodal information. In this work, we present OmniBind, large-scale multimodal joi… ▽ More Recently, human-computer interaction with various modalities has shown promising applications, like GPT-4o and Gemini. Given the foundational role of multimodal joint representation in understanding and generation pipelines, high-quality omni joint representations would be a step toward co-processing more diverse multimodal information. In this work, we present OmniBind, large-scale multimodal joint representation models ranging in scale from 7 billion to 30 billion parameters, which support 3D, audio, image, and language inputs. Due to the scarcity of data pairs across all modalities, instead of training large models from scratch, we propose remapping and binding the spaces of various pre-trained specialist models together. This approach enables "scaling up" by indirectly increasing the model parameters and the amount of seen data. To effectively integrate various spaces, we dynamically assign weights to different spaces by learning routers with two objectives: cross-modal overall alignment and language representation decoupling. Notably, since binding and routing spaces both only require lightweight networks, OmniBind is extremely training-efficient. Learning the largest 30B model requires merely unpaired unimodal data and approximately 3 days on a single 8-4090 node. Extensive experiments demonstrate the versatility and superiority of OmniBind as an omni representation model, highlighting its great potential for diverse applications, such as any-query and composable multimodal understanding. △ Less

Submitted 16 July, 2024; originally announced July 2024.

Comments: Homepage is http://omnibind.github.io

arXiv:2407.11775 [pdf, other]

doi 10.1038/s41467-024-50333-w

A cryogenic on-chip microwave pulse generator for large-scale superconducting quantum computing

Authors: Zenghui Bao, Yan Li, Zhiling Wang, Jiahui Wang, Jize Yang, Haonan Xiong, Yipu Song, Yukai Wu, Hongyi Zhang, Luming Duan

Abstract: For superconducting quantum processors, microwave signals are delivered to each qubit from room-temperature electronics to the cryogenic environment through coaxial cables. Limited by the heat load of cabling and the massive cost of electronics, such an architecture is not viable for millions of qubits required for fault-tolerant quantum computing. Monolithic integration of the control electronics… ▽ More For superconducting quantum processors, microwave signals are delivered to each qubit from room-temperature electronics to the cryogenic environment through coaxial cables. Limited by the heat load of cabling and the massive cost of electronics, such an architecture is not viable for millions of qubits required for fault-tolerant quantum computing. Monolithic integration of the control electronics and the qubits provides a promising solution, which, however, requires a coherent cryogenic microwave pulse generator that is compatible with superconducting quantum circuits. Here, we report such a signal source driven by digital-like signals, generating pulsed microwave emission with well-controlled phase, intensity, and frequency directly at millikelvin temperatures. We showcase high-fidelity readout of superconducting qubits with the microwave pulse generator. The device demonstrated here has a small footprint, negligible heat load, great flexibility to operate, and is fully compatible with today's superconducting quantum circuits, thus providing an enabling technology for large-scale superconducting quantum computers. △ Less

Submitted 16 July, 2024; originally announced July 2024.

Comments: 12 pages, 4 figures

Journal ref: Nat Commun 15, 5958 (2024)

arXiv:2407.11736 [pdf, other]

GV-Bench: Benchmarking Local Feature Matching for Geometric Verification of Long-term Loop Closure Detection

Authors: Jingwen Yu, Hanjing Ye, Jianhao Jiao, Ping Tan, Hong Zhang

Abstract: Visual loop closure detection is an important module in visual simultaneous localization and mapping (SLAM), which associates current camera observation with previously visited places. Loop closures correct drifts in trajectory estimation to build a globally consistent map. However, a false loop closure can be fatal, so verification is required as an additional step to ensure robustness by rejecti… ▽ More Visual loop closure detection is an important module in visual simultaneous localization and mapping (SLAM), which associates current camera observation with previously visited places. Loop closures correct drifts in trajectory estimation to build a globally consistent map. However, a false loop closure can be fatal, so verification is required as an additional step to ensure robustness by rejecting the false positive loops. Geometric verification has been a well-acknowledged solution that leverages spatial clues provided by local feature matching to find true positives. Existing feature matching methods focus on homography and pose estimation in long-term visual localization, lacking references for geometric verification. To fill the gap, this paper proposes a unified benchmark targeting geometric verification of loop closure detection under long-term conditional variations. Furthermore, we evaluate six representative local feature matching methods (handcrafted and learning-based) under the benchmark, with in-depth analysis for limitations and future directions. △ Less

Submitted 16 July, 2024; originally announced July 2024.

Comments: 9 pages, 11 figures, Accepted by IROS(2024)

arXiv:2407.11734 [pdf, other]

Generating Multi-Modal and Multi-Attribute Single-Cell Counts with CFGen

Authors: Alessandro Palma, Till Richter, Hanyi Zhang, Manuel Lubetzki, Alexander Tong, Andrea Dittadi, Fabian Theis

Abstract: Generative modeling of single-cell RNA-seq data has shown invaluable potential in community-driven tasks such as trajectory inference, batch effect removal and gene expression generation. However, most recent deep models generating synthetic single cells from noise operate on pre-processed continuous gene expression approximations, ignoring the inherently discrete and over-dispersed nature of sing… ▽ More Generative modeling of single-cell RNA-seq data has shown invaluable potential in community-driven tasks such as trajectory inference, batch effect removal and gene expression generation. However, most recent deep models generating synthetic single cells from noise operate on pre-processed continuous gene expression approximations, ignoring the inherently discrete and over-dispersed nature of single-cell data, which limits downstream applications and hinders the incorporation of robust noise models. Moreover, crucial aspects of deep-learning-based synthetic single-cell generation remain underexplored, such as controllable multi-modal and multi-label generation and its role in the performance enhancement of downstream tasks. This work presents Cell Flow for Generation (CFGen), a flow-based conditional generative model for multi-modal single-cell counts, which explicitly accounts for the discrete nature of the data. Our results suggest improved recovery of crucial biological data characteristics while accounting for novel generative tasks such as conditioning on multiple attributes and boosting rare cell type classification via data augmentation. By showcasing CFGen on a diverse set of biological datasets and settings, we provide evidence of its value to the fields of computational biology and deep generative models. △ Less

Submitted 16 July, 2024; originally announced July 2024.

Comments: 28 pages, 12 figures

arXiv:2407.11727 [pdf, ps, other]

Measurement of the branching fraction of $D^+_s\to \ell^+ν_\ell$ via $e^+e^-\to D^{*+}_{s} D^{*-}_{s}$

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, O. Afedulidis, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, I. Balossino, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere , et al. (634 additional authors not shown)

Abstract: Based on $10.64~\mathrm{fb}^{-1}$ of $e^+e^-$ collision data taken at center-of-mass energies between 4.237 and 4.699 GeV with the BESIII detector, we study the leptonic $D^+_s$ decays using the $e^+e^-\to D^{*+}_{s} D^{*-}_{s}$ process. The branching fractions of $D_s^+\to\ell^+ν_{\ell}\,(\ell=μ,τ)$ are measured to be $\mathcal{B}(D_s^+\toμ^+ν_μ)=(\bfmuv)\%$ and… ▽ More Based on $10.64~\mathrm{fb}^{-1}$ of $e^+e^-$ collision data taken at center-of-mass energies between 4.237 and 4.699 GeV with the BESIII detector, we study the leptonic $D^+_s$ decays using the $e^+e^-\to D^{*+}_{s} D^{*-}_{s}$ process. The branching fractions of $D_s^+\to\ell^+ν_{\ell}\,(\ell=μ,τ)$ are measured to be $\mathcal{B}(D_s^+\toμ^+ν_μ)=(\bfmuv)\%$ and $\mathcal{B}(D_s^+\toτ^+ν_τ)=(\bftauv)\%$, respectively. The product of the decay constant and Cabibbo-Kobayashi-Maskawa matrix element $|V_{cs}|$ is determined to be $f_{D_s^+}|V_{cs}|=(\mufdsxvcsresult)_{μν}~\mathrm{MeV}$ and $f_{D_s^+}|V_{cs}|=(\taufdsxvcsresult))_{τν}~\mathrm{MeV}$, respectively. Taking the value of $|V_{cs}|$ from a global fit in the Standard Model, we obtain ${f_{D^+_s}}=(\mufdsresult)_{μν}$\,MeV and ${f_{D^+_s}}=(\taufdsresult)_{τν}$\,MeV, respectively. Conversely, taking the value for $f_{D_s^+}$ from the latest lattice quantum chromodynamics calculation, we obtain $|V_{cs}| =(\muvcsresult)_{μν}$ and $|V_{cs}| = (\tauvcsresult)_{τν}$, respectively. △ Less

Submitted 16 July, 2024; originally announced July 2024.

Comments: 27 pages, 13 figures

arXiv:2407.11684 [pdf, ps, other]

$α$-SGHN: A Robust Model for Learning Particle Interactions in Lattice Systems

Authors: Yixian Gao, Ru Geng, Panayotis Kevrekidis, Hong-Kun Zhang, Jian Zu

Abstract: We propose an $α$-separable graph Hamiltonian network ($α$-SGHN) that reveals complex interaction patterns between particles in lattice systems. Utilizing trajectory data, $α$-SGHN infers potential interactions without prior knowledge about particle coupling, overcoming the limitations of traditional graph neural networks that require predefined links. Furthermore, $α$-SGHN preserves all conservat… ▽ More We propose an $α$-separable graph Hamiltonian network ($α$-SGHN) that reveals complex interaction patterns between particles in lattice systems. Utilizing trajectory data, $α$-SGHN infers potential interactions without prior knowledge about particle coupling, overcoming the limitations of traditional graph neural networks that require predefined links. Furthermore, $α$-SGHN preserves all conservation laws during trajectory prediction. Experimental results demonstrate that our model, incorporating structural information, outperforms baseline models based on conventional neural networks in predicting lattice systems. We anticipate that the results presented will be applicable beyond the specific onsite and inter-site interaction lattices studied, including the Frenkel-Kontorova model, the rotator lattice, and the Toda lattice. △ Less

Submitted 16 July, 2024; originally announced July 2024.

Comments: 17pages

arXiv:2407.11682 [pdf, other]

MapDistill: Boosting Efficient Camera-based HD Map Construction via Camera-LiDAR Fusion Model Distillation

Authors: Xiaoshuai Hao, Ruikai Li, Hui Zhang, Dingzhe Li, Rong Yin, Sangil Jung, Seung-In Park, ByungIn Yoo, Haimei Zhao, Jing Zhang

Abstract: Online high-definition (HD) map construction is an important and challenging task in autonomous driving. Recently, there has been a growing interest in cost-effective multi-view camera-based methods without relying on other sensors like LiDAR. However, these methods suffer from a lack of explicit depth information, necessitating the use of large models to achieve satisfactory performance. To addre… ▽ More Online high-definition (HD) map construction is an important and challenging task in autonomous driving. Recently, there has been a growing interest in cost-effective multi-view camera-based methods without relying on other sensors like LiDAR. However, these methods suffer from a lack of explicit depth information, necessitating the use of large models to achieve satisfactory performance. To address this, we employ the Knowledge Distillation (KD) idea for efficient HD map construction for the first time and introduce a novel KD-based approach called MapDistill to transfer knowledge from a high-performance camera-LiDAR fusion model to a lightweight camera-only model. Specifically, we adopt the teacher-student architecture, i.e., a camera-LiDAR fusion model as the teacher and a lightweight camera model as the student, and devise a dual BEV transform module to facilitate cross-modal knowledge distillation while maintaining cost-effective camera-only deployment. Additionally, we present a comprehensive distillation scheme encompassing cross-modal relation distillation, dual-level feature distillation, and map head distillation. This approach alleviates knowledge transfer challenges between modalities, enabling the student model to learn improved feature representations for HD map construction. Experimental results on the challenging nuScenes dataset demonstrate the effectiveness of MapDistill, surpassing existing competitors by over 7.7 mAP or 4.5X speedup. △ Less

Submitted 16 July, 2024; originally announced July 2024.

Comments: Accepted by ECCV2024

arXiv:2407.11619 [pdf, ps, other]

Strategic Littlestone Dimension: Improved Bounds on Online Strategic Classification

Authors: Saba Ahmadi, Kunhe Yang, Hanrui Zhang

Abstract: We study the problem of online binary classification in settings where strategic agents can modify their observable features to receive a positive classification. We model the set of feasible manipulations by a directed graph over the feature space, and assume the learner only observes the manipulated features instead of the original ones. We introduce the Strategic Littlestone Dimension, a new co… ▽ More We study the problem of online binary classification in settings where strategic agents can modify their observable features to receive a positive classification. We model the set of feasible manipulations by a directed graph over the feature space, and assume the learner only observes the manipulated features instead of the original ones. We introduce the Strategic Littlestone Dimension, a new combinatorial measure that captures the joint complexity of the hypothesis class and the manipulation graph. We demonstrate that it characterizes the instance-optimal mistake bounds for deterministic learning algorithms in the realizable setting. We also achieve improved regret in the agnostic setting by a refined agnostic-to-realizable reduction that accounts for the additional challenge of not observing agents' original features. Finally, we relax the assumption that the learner knows the manipulation graph, instead assuming their knowledge is captured by a family of graphs. We derive regret bounds in both the realizable setting where all agents manipulate according to the same graph within the graph family, and the agnostic setting where the manipulation graphs are chosen adversarially and not consistently modeled by a single graph in the family. △ Less

Submitted 16 July, 2024; originally announced July 2024.

arXiv:2407.11529 [pdf, other]

Cross-Phase Mutual Learning Framework for Pulmonary Embolism Identification on Non-Contrast CT Scans

Authors: Bizhe Bai, Yan-Jie Zhou, Yujian Hu, Tony C. W. Mok, Yilang Xiang, Le Lu, Hongkun Zhang, Minfeng Xu

Abstract: Pulmonary embolism (PE) is a life-threatening condition where rapid and accurate diagnosis is imperative yet difficult due to predominantly atypical symptomatology. Computed tomography pulmonary angiography (CTPA) is acknowledged as the gold standard imaging tool in clinics, yet it can be contraindicated for emergency department (ED) patients and represents an onerous procedure, thus necessitating… ▽ More Pulmonary embolism (PE) is a life-threatening condition where rapid and accurate diagnosis is imperative yet difficult due to predominantly atypical symptomatology. Computed tomography pulmonary angiography (CTPA) is acknowledged as the gold standard imaging tool in clinics, yet it can be contraindicated for emergency department (ED) patients and represents an onerous procedure, thus necessitating PE identification through non-contrast CT (NCT) scans. In this work, we explore the feasibility of applying a deep-learning approach to NCT scans for PE identification. We propose a novel Cross-Phase Mutual learNing framework (CPMN) that fosters knowledge transfer from CTPA to NCT, while concurrently conducting embolism segmentation and abnormality classification in a multi-task manner. The proposed CPMN leverages the Inter-Feature Alignment (IFA) strategy that enhances spatial contiguity and mutual learning between the dual-pathway network, while the Intra-Feature Discrepancy (IFD) strategy can facilitate precise segmentation of PE against complex backgrounds for single-pathway networks. For a comprehensive assessment of the proposed approach, a large-scale dual-phase dataset containing 334 PE patients and 1,105 normal subjects has been established. Experimental results demonstrate that CPMN achieves the leading identification performance, which is 95.4\% and 99.6\% in patient-level sensitivity and specificity on NCT scans, indicating the potential of our approach as an economical, accessible, and precise tool for PE identification in clinical practice. △ Less

Submitted 16 July, 2024; originally announced July 2024.

Comments: Early accept by MICCAI 2024

arXiv:2407.11509 [pdf, other]

Exact eigenstates with off-diagonal long-range order for interacting bosonic systems

Authors: C. H. Zhang, Z. Song

Abstract: Fermions and hardcore bosons share the same restriction: no more than one particle can occupy a single site in a lattice system. Specifically, in one dimension, two systems can share the same matrix representation. In this work, we investigate both the fermion and hardcore-boson models with nearest-neighbor (NN) interaction in a ring lattice. We construct the exact eigenstates of the hardcore-boso… ▽ More Fermions and hardcore bosons share the same restriction: no more than one particle can occupy a single site in a lattice system. Specifically, in one dimension, two systems can share the same matrix representation. In this work, we investigate both the fermion and hardcore-boson models with nearest-neighbor (NN) interaction in a ring lattice. We construct the exact eigenstates of the hardcore-boson model with resonant NN interaction and show that they possess off-diagonal long-range order (ODLRO) in the thermodynamic limit. In comparison, the fermionic counterpart does not support such a feature due to the different particle statistics, although they share an identical energy spectrum. In addition, we examine the effect of the periodic boundary condition on the dynamics of the condensate states through numerical simulations. △ Less

Submitted 16 July, 2024; originally announced July 2024.

arXiv:2407.11425 [pdf]

doi 10.1038/s41598-024-60279-0

Incremental high average-utility itemset mining: survey and challenges

Authors: Jing Chen, Shengyi Yang, Weiping Ding, Peng Li, Aijun Liu, Hongjun Zhang, Tian Li

Abstract: The High Average Utility Itemset Mining (HAUIM) technique, a variation of High Utility Itemset Mining (HUIM), uses the average utility of the itemsets. Historically, most HAUIM algorithms were designed for static databases. However, practical applications like market basket analysis and business decision-making necessitate regular updates of the database with new transactions. As a result, researc… ▽ More The High Average Utility Itemset Mining (HAUIM) technique, a variation of High Utility Itemset Mining (HUIM), uses the average utility of the itemsets. Historically, most HAUIM algorithms were designed for static databases. However, practical applications like market basket analysis and business decision-making necessitate regular updates of the database with new transactions. As a result, researchers have developed incremental HAUIM (iHAUIM) algorithms to identify HAUIs in a dynamically updated database. Contrary to conventional methods that begin from scratch, the iHAUIM algorithm facilitates incremental changes and outputs, thereby reducing the cost of discovery. This paper provides a comprehensive review of the state-of-the-art iHAUIM algorithms, analyzing their unique characteristics and advantages. First, we explain the concept of iHAUIM, providing formulas and real-world examples for a more in-depth understanding. Subsequently, we categorize and discuss the key technologies used by varying types of iHAUIM algorithms, encompassing Apriori-based, Tree-based, and Utility-list-based techniques. Moreover, we conduct a critical analysis of each mining method's advantages and disadvantages. In conclusion, we explore potential future directions, research opportunities, and various extensions of the iHAUIM algorithm. △ Less

Submitted 16 July, 2024; originally announced July 2024.

Comments: 25 pages, 23 figures

arXiv:2407.11422 [pdf, other]

Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models

Authors: Jinrui Zhang, Teng Wang, Haigang Zhang, Ping Lu, Feng Zheng

Abstract: Large vision-language models (LVLMs) have shown promising performance on a variety of vision-language tasks. However, they remain susceptible to hallucinations, generating outputs misaligned with visual content or instructions. While various mitigation strategies have been proposed, they often neglect a key contributor to hallucinations: lack of fine-grained reasoning supervision during training.… ▽ More Large vision-language models (LVLMs) have shown promising performance on a variety of vision-language tasks. However, they remain susceptible to hallucinations, generating outputs misaligned with visual content or instructions. While various mitigation strategies have been proposed, they often neglect a key contributor to hallucinations: lack of fine-grained reasoning supervision during training. Without intermediate reasoning steps, models may establish superficial shortcuts between instructions and responses, failing to internalize the inherent reasoning logic. To address this challenge, we propose reflective instruction tuning, which integrates rationale learning into visual instruction tuning. Unlike previous methods that learning from responses only, our approach entails the model predicting rationales justifying why responses are correct or incorrect. This fosters a deeper engagement with the fine-grained reasoning underlying each response, thus enhancing the model's reasoning proficiency. To facilitate this approach, we propose REVERIE, the first large-scale instruction-tuning dataset with ReflEctiVE RatIonalE annotations. REVERIE comprises 115k machine-generated reasoning instructions, each meticulously annotated with a corresponding pair of correct and confusing responses, alongside comprehensive rationales elucidating the justification behind the correctness or erroneousness of each response. Experimental results on multiple LVLM benchmarks reveal that reflective instruction tuning with the REVERIE dataset yields noticeable performance gain over the baseline model, demonstrating the effectiveness of reflecting from the rationales. Project page is at https://zjr2000.github.io/projects/reverie. △ Less

Submitted 16 July, 2024; originally announced July 2024.

Comments: To appear at ECCV2024

arXiv:2407.11382 [pdf, other]

Segment, Lift and Fit: Automatic 3D Shape Labeling from 2D Prompts

Authors: Jianhao Li, Tianyu Sun, Zhongdao Wang, Enze Xie, Bailan Feng, Hongbo Zhang, Ze Yuan, Ke Xu, Jiaheng Liu, Ping Luo

Abstract: This paper proposes an algorithm for automatically labeling 3D objects from 2D point or box prompts, especially focusing on applications in autonomous driving. Unlike previous arts, our auto-labeler predicts 3D shapes instead of bounding boxes and does not require training on a specific dataset. We propose a Segment, Lift, and Fit (SLF) paradigm to achieve this goal. Firstly, we segment high-quali… ▽ More This paper proposes an algorithm for automatically labeling 3D objects from 2D point or box prompts, especially focusing on applications in autonomous driving. Unlike previous arts, our auto-labeler predicts 3D shapes instead of bounding boxes and does not require training on a specific dataset. We propose a Segment, Lift, and Fit (SLF) paradigm to achieve this goal. Firstly, we segment high-quality instance masks from the prompts using the Segment Anything Model (SAM) and transform the remaining problem into predicting 3D shapes from given 2D masks. Due to the ill-posed nature of this problem, it presents a significant challenge as multiple 3D shapes can project into an identical mask. To tackle this issue, we then lift 2D masks to 3D forms and employ gradient descent to adjust their poses and shapes until the projections fit the masks and the surfaces conform to surrounding LiDAR points. Notably, since we do not train on a specific dataset, the SLF auto-labeler does not overfit to biased annotation patterns in the training set as other methods do. Thus, the generalization ability across different datasets improves. Experimental results on the KITTI dataset demonstrate that the SLF auto-labeler produces high-quality bounding box annotations, achieving an AP@0.5 IoU of nearly 90\%. Detectors trained with the generated pseudo-labels perform nearly as well as those trained with actual ground-truth annotations. Furthermore, the SLF auto-labeler shows promising results in detailed shape predictions, providing a potential alternative for the occupancy annotation of dynamic objects. △ Less

Submitted 16 July, 2024; originally announced July 2024.

Comments: Accepted to ECCV 2024

arXiv:2407.11322 [pdf, ps, other]

Reconfigurable-Intelligent-Surface Assisted Orbital-Angular-Momentum Secure Communications

Authors: Minmin Wang, Liping Liang, Wenchi Cheng, Wei Zhang, Ruirui Chen, Hailin Zhang

Abstract: As a kind of wavefront with helical phase, orbital angular momentum (OAM) shows the great potential to enhance the security results of wireless communications due to its unique orthogonality and central hollow electromagnetic wave structure. Therefore, in this paper we propose the reconfigurable-intelligent-surface (RIS) assisted OAM scheme, where RIS is deployed to weaken the information acquisit… ▽ More As a kind of wavefront with helical phase, orbital angular momentum (OAM) shows the great potential to enhance the security results of wireless communications due to its unique orthogonality and central hollow electromagnetic wave structure. Therefore, in this paper we propose the reconfigurable-intelligent-surface (RIS) assisted OAM scheme, where RIS is deployed to weaken the information acquisition at eavesdroppers by adjusting the OAM beams pointed to the eavesdropper and artificial noise (AN) is applied to interfere with the eavesdropper, thus significantly increasing the secrecy rates of short-range secure communications. Aiming at obtaining the maximum secrecy rate, we develop the Riemannian manifold conjugate gradient (RMCG) based alternative optimization (AO) algorithm to assign much power to low-order OAM-modes and optimize the OAM beams direction with the programmable RIS, thus respectively enhancing and weakening the received signal strength at the legitimate receiver and the eavesdropper. Numerical results show that our proposed scheme outperforms the existing works in terms of the secrecy rate and the eavesdropper's bit error rate. △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: arXiv admin note: text overlap with arXiv:2406.05799

arXiv:2407.11087 [pdf, other]

Restore-RWKV: Efficient and Effective Medical Image Restoration with RWKV

Authors: Zhiwen Yang, Hui Zhang, Dan Zhao, Bingzheng Wei, Yan Xu

Abstract: Transformers have revolutionized medical image restoration, but the quadratic complexity still poses limitations for their application to high-resolution medical images. The recent advent of RWKV in the NLP field has attracted much attention as it can process long sequences efficiently. To leverage its advanced design, we propose Restore-RWKV, the first RWKV-based model for medical image restorati… ▽ More Transformers have revolutionized medical image restoration, but the quadratic complexity still poses limitations for their application to high-resolution medical images. The recent advent of RWKV in the NLP field has attracted much attention as it can process long sequences efficiently. To leverage its advanced design, we propose Restore-RWKV, the first RWKV-based model for medical image restoration. Since the original RWKV model is designed for 1D sequences, we make two necessary modifications for modeling spatial relations in 2D images. First, we present a recurrent WKV (Re-WKV) attention mechanism that captures global dependencies with linear computational complexity. Re-WKV incorporates bidirectional attention as basic for a global receptive field and recurrent attention to effectively model 2D dependencies from various scan directions. Second, we develop an omnidirectional token shift (Omni-Shift) layer that enhances local dependencies by shifting tokens from all directions and across a wide context range. These adaptations make the proposed Restore-RWKV an efficient and effective model for medical image restoration. Extensive experiments demonstrate that Restore-RWKV achieves superior performance across various medical image restoration tasks, including MRI image super-resolution, CT image denoising, PET image synthesis, and all-in-one medical image restoration. Code is available at: \href{https://github.com/Yaziwel/Restore-RWKV.git}{https://github.com/Yaziwel/Restore-RWKV}. △ Less

Submitted 14 July, 2024; originally announced July 2024.

Comments: This paper introduces the first RWKV-based model for image restoration

arXiv:2407.10988 [pdf, other]

Residual resampling-based physics-informed neural network for neutron diffusion equations

Authors: Heng Zhang, Yun-Ling He, Dong Liu, Qin Hang, He-Min Yao, Di Xiang

Abstract: The neutron diffusion equation plays a pivotal role in the analysis of nuclear reactors. Nevertheless, employing the Physics-Informed Neural Network (PINN) method for its solution entails certain limitations. Traditional PINN approaches often utilize fully connected network (FCN) architecture, which is susceptible to overfitting, training instability, and gradient vanishing issues as the network d… ▽ More The neutron diffusion equation plays a pivotal role in the analysis of nuclear reactors. Nevertheless, employing the Physics-Informed Neural Network (PINN) method for its solution entails certain limitations. Traditional PINN approaches often utilize fully connected network (FCN) architecture, which is susceptible to overfitting, training instability, and gradient vanishing issues as the network depth increases. These challenges result in accuracy bottlenecks in the solution. In response to these issues, the Residual-based Resample Physics-Informed Neural Network(R2-PINN) is proposed, which proposes an improved PINN architecture that replaces the FCN with a Convolutional Neural Network with a shortcut(S-CNN), incorporating skip connections to facilitate gradient propagation between network layers. Additionally, the incorporation of the Residual Adaptive Resampling (RAR) mechanism dynamically increases sampling points, enhancing the spatial representation capabilities and overall predictive accuracy of the model. The experimental results illustrate that our approach significantly improves the model's convergence capability, achieving high-precision predictions of physical fields. In comparison to traditional FCN-based PINN methods, R2-PINN effectively overcomes the limitations inherent in current methods, providing more accurate and robust solutions for neutron diffusion equations. △ Less

Submitted 23 June, 2024; originally announced July 2024.

arXiv:2407.10982 [pdf, other]

ARA-O-RAN: End-to-End Programmable O-RAN Living Lab for Agriculture and Rural Communities

Authors: Tianyi Zhang, Joshua Ofori Boateng, Taimoor UI Islam, Arsalan Ahmad, Hongwei Zhang, Daji Qiao

Abstract: As wireless networks evolve towards open architectures like O-RAN, testing, and integration platforms are crucial to address challenges like interoperability. This paper describes ARA-O-RAN, a novel O-RAN testbed established through the NSF Platforms for Advanced Wireless Research (PAWR) ARA platform. ARA provides an at-scale rural wireless living lab focused on technologies for digital agricultur… ▽ More As wireless networks evolve towards open architectures like O-RAN, testing, and integration platforms are crucial to address challenges like interoperability. This paper describes ARA-O-RAN, a novel O-RAN testbed established through the NSF Platforms for Advanced Wireless Research (PAWR) ARA platform. ARA provides an at-scale rural wireless living lab focused on technologies for digital agriculture and rural communities. As an O-RAN Alliance certified Open Testing and Integration Centre (OTIC), ARA launched ARA-O-RAN -- the first public O-RAN testbed tailored to rural and agriculture use cases, together with the end-to-end, whole-stack programmability. ARA-O-RAN uniquely combines support for outdoor testing across a university campus, surrounding farmlands, and rural communities with a 50-node indoor sandbox. The testbed facilitates vital R\&D to implement open architectures that can meet rural connectivity needs. The paper outlines ARA-O-RAN's hardware system design, software architecture, and enabled research experiments. It also discusses plans aligned with national spectrum policy and rural spectrum innovation. ARA-O-RAN exemplifies the value of purpose-built wireless testbeds in accelerating impactful wireless research. △ Less

Submitted 14 June, 2024; originally announced July 2024.

arXiv:2407.10957 [pdf, other]

Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes

Authors: Yaoting Wang, Peiwen Sun, Dongzhan Zhou, Guangyao Li, Honggang Zhang, Di Hu

Abstract: Traditional reference segmentation tasks have predominantly focused on silent visual scenes, neglecting the integral role of multimodal perception and interaction in human experiences. In this work, we introduce a novel task called Reference Audio-Visual Segmentation (Ref-AVS), which seeks to segment objects within the visual domain based on expressions containing multimodal cues. Such expressions… ▽ More Traditional reference segmentation tasks have predominantly focused on silent visual scenes, neglecting the integral role of multimodal perception and interaction in human experiences. In this work, we introduce a novel task called Reference Audio-Visual Segmentation (Ref-AVS), which seeks to segment objects within the visual domain based on expressions containing multimodal cues. Such expressions are articulated in natural language forms but are enriched with multimodal cues, including audio and visual descriptions. To facilitate this research, we construct the first Ref-AVS benchmark, which provides pixel-level annotations for objects described in corresponding multimodal-cue expressions. To tackle the Ref-AVS task, we propose a new method that adequately utilizes multimodal cues to offer precise segmentation guidance. Finally, we conduct quantitative and qualitative experiments on three test subsets to compare our approach with existing methods from related tasks. The results demonstrate the effectiveness of our method, highlighting its capability to precisely segment objects using multimodal-cue expressions. Dataset is available at \href{https://gewu-lab.github.io/Ref-AVS}{https://gewu-lab.github.io/Ref-AVS}. △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: Accepted by ECCV2024

arXiv:2407.10956 [pdf, other]

Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?

Authors: Ruisheng Cao, Fangyu Lei, Haoyuan Wu, Jixuan Chen, Yeqiao Fu, Hongcheng Gao, Xinzhuang Xiong, Hanchong Zhang, Yuchen Mao, Wenjing Hu, Tianbao Xie, Hongshen Xu, Danyang Zhang, Sida Wang, Ruoxi Sun, Pengcheng Yin, Caiming Xiong, Ansong Ni, Qian Liu, Victor Zhong, Lu Chen, Kai Yu, Tao Yu

Abstract: Data science and engineering workflows often span multiple stages, from warehousing to orchestration, using tools like BigQuery, dbt, and Airbyte. As vision language models (VLMs) advance in multimodal understanding and code generation, VLM-based agents could potentially automate these workflows by generating SQL queries, Python code, and GUI operations. This automation can improve the productivit… ▽ More Data science and engineering workflows often span multiple stages, from warehousing to orchestration, using tools like BigQuery, dbt, and Airbyte. As vision language models (VLMs) advance in multimodal understanding and code generation, VLM-based agents could potentially automate these workflows by generating SQL queries, Python code, and GUI operations. This automation can improve the productivity of experts while democratizing access to large-scale data analysis. In this paper, we introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering workflows, featuring 494 real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications. These tasks, derived from real-world use cases, evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems. To balance realistic simulation with evaluation simplicity, we devote significant effort to developing automatic configurations for task setup and carefully crafting evaluation metrics for each task. Furthermore, we supplement multimodal agents with comprehensive documents of these enterprise data software systems. Our empirical evaluation reveals that existing state-of-the-art LLM/VLM-based agents do not reliably automate full data workflows (14.0% success). Even with step-by-step guidance, these agents still underperform in tasks that require fine-grained, knowledge-intensive GUI actions (16.2%) and involve remote cloud-hosted workspaces (10.6%). We hope that Spider2-V paves the way for autonomous multimodal agents to transform the automation of data science and engineering workflow. Our code and data are available at https://spider2-v.github.io. △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: 34 pages, 14 figures, 10 tables

arXiv:2407.10947 [pdf, other]

Can Textual Semantics Mitigate Sounding Object Segmentation Preference?

Authors: Yaoting Wang, Peiwen Sun, Yuanchao Li, Honggang Zhang, Di Hu

Abstract: The Audio-Visual Segmentation (AVS) task aims to segment sounding objects in the visual space using audio cues. However, in this work, it is recognized that previous AVS methods show a heavy reliance on detrimental segmentation preferences related to audible objects, rather than precise audio guidance. We argue that the primary reason is that audio lacks robust semantics compared to vision, especi… ▽ More The Audio-Visual Segmentation (AVS) task aims to segment sounding objects in the visual space using audio cues. However, in this work, it is recognized that previous AVS methods show a heavy reliance on detrimental segmentation preferences related to audible objects, rather than precise audio guidance. We argue that the primary reason is that audio lacks robust semantics compared to vision, especially in multi-source sounding scenes, resulting in weak audio guidance over the visual space. Motivated by the the fact that text modality is well explored and contains rich abstract semantics, we propose leveraging text cues from the visual scene to enhance audio guidance with the semantics inherent in text. Our approach begins by obtaining scene descriptions through an off-the-shelf image captioner and prompting a frozen large language model to deduce potential sounding objects as text cues. Subsequently, we introduce a novel semantics-driven audio modeling module with a dynamic mask to integrate audio features with text cues, leading to representative sounding object features. These features not only encompass audio cues but also possess vivid semantics, providing clearer guidance in the visual space. Experimental results on AVS benchmarks validate that our method exhibits enhanced sensitivity to audio when aided by text cues, achieving highly competitive performance on all three subsets. Project page: \href{https://github.com/GeWu-Lab/Sounding-Object-Segmentation-Preference}{https://github.com/GeWu-Lab/Sounding-Object-Segmentation-Preference} △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: Accepted by ECCV2024

arXiv:2407.10701 [pdf, other]

DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems

Authors: Anni Zou, Wenhao Yu, Hongming Zhang, Kaixin Ma, Deng Cai, Zhuosheng Zhang, Hai Zhao, Dong Yu

Abstract: Recently, there has been a growing interest among large language model (LLM) developers in LLM-based document reading systems, which enable users to upload their own documents and pose questions related to the document contents, going beyond simple reading comprehension tasks. Consequently, these systems have been carefully designed to tackle challenges such as file parsing, metadata extraction, m… ▽ More Recently, there has been a growing interest among large language model (LLM) developers in LLM-based document reading systems, which enable users to upload their own documents and pose questions related to the document contents, going beyond simple reading comprehension tasks. Consequently, these systems have been carefully designed to tackle challenges such as file parsing, metadata extraction, multi-modal information understanding and long-context reading. However, no current benchmark exists to evaluate their performance in such scenarios, where a raw file and questions are provided as input, and a corresponding response is expected as output. In this paper, we introduce DocBench, a new benchmark designed to evaluate LLM-based document reading systems. Our benchmark involves a meticulously crafted process, including the recruitment of human annotators and the generation of synthetic questions. It includes 229 real documents and 1,102 questions, spanning across five different domains and four major types of questions. We evaluate both proprietary LLM-based systems accessible via web interfaces or APIs, and a parse-then-read pipeline employing open-source LLMs. Our evaluations reveal noticeable gaps between existing LLM-based document reading systems and human performance, underscoring the challenges of developing proficient systems. To summarize, DocBench aims to establish a standardized benchmark for evaluating LLM-based document reading systems under diverse real-world scenarios, thereby guiding future advancements in this research area. △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: Work in progress

arXiv:2407.10691 [pdf, other]

$\texttt{MixGR}$: Enhancing Retriever Generalization for Scientific Domain through Complementary Granularity

Authors: Fengyu Cai, Xinran Zhao, Tong Chen, Sihao Chen, Hongming Zhang, Iryna Gurevych, Heinz Koeppl

Abstract: Recent studies show the growing significance of document retrieval in the generation of LLMs, i.e., RAG, within the scientific domain by bridging their knowledge gap. However, dense retrievers often struggle with domain-specific retrieval and complex query-document relationships, particularly when query segments correspond to various parts of a document. To alleviate such prevalent challenges, thi… ▽ More Recent studies show the growing significance of document retrieval in the generation of LLMs, i.e., RAG, within the scientific domain by bridging their knowledge gap. However, dense retrievers often struggle with domain-specific retrieval and complex query-document relationships, particularly when query segments correspond to various parts of a document. To alleviate such prevalent challenges, this paper introduces $\texttt{MixGR}$, which improves dense retrievers' awareness of query-document matching across various levels of granularity in queries and documents using a zero-shot approach. $\texttt{MixGR}$ fuses various metrics based on these granularities to a united score that reflects a comprehensive query-document similarity. Our experiments demonstrate that $\texttt{MixGR}$ outperforms previous document retrieval by 24.7% and 9.8% on nDCG@5 with unsupervised and supervised retrievers, respectively, averaged on queries containing multiple subqueries from five scientific retrieval datasets. Moreover, the efficacy of two downstream scientific question-answering tasks highlights the advantage of $\texttt{MixGR}$to boost the application of LLMs in the scientific domain. △ Less

Submitted 15 July, 2024; originally announced July 2024.

arXiv:2407.10670 [pdf, other]

Enhancing Retrieval and Managing Retrieval: A Four-Module Synergy for Improved Quality and Efficiency in RAG Systems

Authors: Yunxiao Shi, Xing Zi, Zijing Shi, Haimin Zhang, Qiang Wu, Min Xu

Abstract: Retrieval-augmented generation (RAG) techniques leverage the in-context learning capabilities of large language models (LLMs) to produce more accurate and relevant responses. Originating from the simple 'retrieve-then-read' approach, the RAG framework has evolved into a highly flexible and modular paradigm. A critical component, the Query Rewriter module, enhances knowledge retrieval by generating… ▽ More Retrieval-augmented generation (RAG) techniques leverage the in-context learning capabilities of large language models (LLMs) to produce more accurate and relevant responses. Originating from the simple 'retrieve-then-read' approach, the RAG framework has evolved into a highly flexible and modular paradigm. A critical component, the Query Rewriter module, enhances knowledge retrieval by generating a search-friendly query. This method aligns input questions more closely with the knowledge base. Our research identifies opportunities to enhance the Query Rewriter module to Query Rewriter+ by generating multiple queries to overcome the Information Plateaus associated with a single query and by rewriting questions to eliminate Ambiguity, thereby clarifying the underlying intent. We also find that current RAG systems exhibit issues with Irrelevant Knowledge; to overcome this, we propose the Knowledge Filter. These two modules are both based on the instruction-tuned Gemma-2B model, which together enhance response quality. The final identified issue is Redundant Retrieval; we introduce the Memory Knowledge Reservoir and the Retriever Trigger to solve this. The former supports the dynamic expansion of the RAG system's knowledge base in a parameter-free manner, while the latter optimizes the cost for accessing external knowledge, thereby improving resource utilization and response efficiency. These four RAG modules synergistically improve the response quality and efficiency of the RAG system. The effectiveness of these modules has been validated through experiments and ablation studies across six common QA datasets. The source code can be accessed at https://github.com/Ancientshi/ERM4. △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: ECAI2024 #1304

arXiv:2407.10421 [pdf, ps, other]

Constraining Weyl type f(Q,T) gravity with Big Bang Nucleosynthesis

Authors: Jian Ge, Lei Ming, Shi-Dong Liang, Hong-Hao Zhang, Tiberiu Harko

Abstract: The Weyl type $f(Q,T)$ modified gravity theory is an extension of the $f(Q)$ and $f(Q,T)$ type theories, where $T$ is the trace of the matter energy-momentum tensor, and the scalar non-metricity $Q$ is represented in its standard Weyl form, and it is fully determined by a vector field $ω_μ$. The theory can give a good description of the observational data, and of the evolution of the late-time Uni… ▽ More The Weyl type $f(Q,T)$ modified gravity theory is an extension of the $f(Q)$ and $f(Q,T)$ type theories, where $T$ is the trace of the matter energy-momentum tensor, and the scalar non-metricity $Q$ is represented in its standard Weyl form, and it is fully determined by a vector field $ω_μ$. The theory can give a good description of the observational data, and of the evolution of the late-time Universe, including a geometric explanation of the dark energy. In this work we investigate the Big Bang Nucleosynthesis (BBN) constraints on several Weyl type $f(Q,T)$ gravity models. In particular, we consider the corrections that Weyl type $f(Q,T)$ terms induce on the freeze-out temperature $\mathcal{T}_f$, as compared to the standard $Λ$CDM results. We analyze in detail three distinct cosmological models, corresponding to specific choices of the functional form of $f(Q,T)$. The first model has a simple linear additive structure in $Q$ and $T$, the second model is multiplicative in $Q$ and $T$, while the third is additive in $T$ and the exponential of $Q$. For each $f(Q,T)$ we consider first the cosmological evolution in the radiation dominated era, and then we impose the observational bound on $\left|δ\mathcal{T}_f/ \mathcal{T}_f\right|$ to obtain constraints on the model parameters from the primordial abundances of the light elements such as helium-4, deuterium and lithium-7. The abundances of helium-4 and deuterium agree with theoretical predictions, however, the lithium problem, even slightly alleviated, still persists for the considered Weyl type $f(Q,T)$ models. Generally, these models satisfy the BBN constraints, and thus they represent viable cosmologies describing the entire dynamical time scale of the evolution of the Universe. △ Less

Submitted 14 July, 2024; originally announced July 2024.

Comments: 19 pages, 12 figures

arXiv:2407.10352 [pdf]

Signature of Orbital Driven Finite Momentum Pairing in a 3D Ising Superconductor

Authors: F. Z. Yang, H. D. Zhang, Saswata Mandal, F. Y. Meng, G. Fabbris, A. Said, P. Mercado Lozano, A. Rajapitamahuni, E. Vescovo, C. Nelson, S. Lin, Y. Park, E. M. Clements, T. Z. Ward, H. -N. Lee, H. C. Lei, C. X. Liu, H. Miao

Abstract: The finite momentum superconducting pairing states (FMPs), where Cooper pairs carry non-zero momentum, are believed to give rise to exotic physical phenomena including the pseudogap phase of cuprate high-Tc superconductors and Majorana fermions in topological superconductivity. FMPs can emerge in intertwined electronic liquids with strong spin-spin interactions or be induced by lifting the spin de… ▽ More The finite momentum superconducting pairing states (FMPs), where Cooper pairs carry non-zero momentum, are believed to give rise to exotic physical phenomena including the pseudogap phase of cuprate high-Tc superconductors and Majorana fermions in topological superconductivity. FMPs can emerge in intertwined electronic liquids with strong spin-spin interactions or be induced by lifting the spin degeneracy under magnetic field as originally proposed by Fulde-Ferrell and Larkin-Ovchinnikov. In quantum materials with strong Ising-type spin-orbit coupling, such as the 2D transition metal dichalcogenides (TMDs), the spin degree of freedom is frozen enabling novel orbital driven FMPs via magnetoelectric effect. While evidence of orbital driven FMPs has been revealed in bilayer TMDs, its realization in 3D bulk materials remains an unresolved challenge. Here we report experimental signatures of FMP in a locally noncentrosymmetric bulk superconductor 4Hb-TaS2. Using hard X-ray diffraction and angle-resolved photoemission spectroscopy, we reveal unusual 2D chiral charge density wave (CDW) and weak interlayer hopping in 4Hb-TaS2. Below the superconducting transition temperature, the upper critical field, Hc2, linearly increases via decreasing temperature, and well exceeds the Pauli limit, thus establishing the dominant orbital pair-breaking mechanism. Remarkably, we discover a field-induced superconductivity-to-superconductivity transition that breaks continuous rotational symmetry of the s-wave uniform pairing in the Bardeen-Cooper-Schrieffer theory down to the six-fold rotation symmetry. Combining with a Ginzburg-Landau free energy analysis that incorporates magnetoelectric effect, our observations provide strong evidence of orbital driven FMP in the 3D quantum heterostructure 4Hb-TaS2. △ Less

Submitted 14 July, 2024; originally announced July 2024.

arXiv:2407.10287 [pdf, ps, other]

Gauss Relations in Feynman Integrals

Authors: Tai-Fu Feng, Yang Zhou, Hai-Bin Zhang

Abstract: Embedding Feynman integrals in Grassmannians, we express Feynman integrals as linear combinations of generalized hypergeometric functions. Here we present general methods to obtain Gauss relations among those generalized hypergeometric functions. The hypergeometric expressions of Feynman integral are analytically continued from some connected component to another by the Gauss inverse relations, th… ▽ More Embedding Feynman integrals in Grassmannians, we express Feynman integrals as linear combinations of generalized hypergeometric functions. Here we present general methods to obtain Gauss relations among those generalized hypergeometric functions. The hypergeometric expressions of Feynman integral are analytically continued from some connected component to another by the Gauss inverse relations, then continued to the whole domain of definition by the Gauss-Kummer relations. The Laurant series of the Feynman integral around the time-space dimension $D=4$ is obtained by the Gauss adjacent relations where the coefficients of powers of $D-4$ are given as some finite linear combinations of hypergeometric functions with integer parameters. As an example, we illustrate how to use the method to obtain the analytic expression of the Feynman integral of one-loop self energy in its whole domain of definition. △ Less

Submitted 14 July, 2024; originally announced July 2024.

Comments: 75 pages, including text of 22 pages + 1 figure +appendices of 52 pages

arXiv:2407.10233 [pdf, other]

Visual Prompt Selection for In-Context Learning Segmentation

Authors: Wei Suo, Lanqing Lai, Mengyang Sun, Hanwang Zhang, Peng Wang, Yanning Zhang

Abstract: As a fundamental and extensively studied task in computer vision, image segmentation aims to locate and identify different semantic concepts at the pixel level. Recently, inspired by In-Context Learning (ICL), several generalist segmentation frameworks have been proposed, providing a promising paradigm for segmenting specific objects. However, existing works mostly ignore the value of visual promp… ▽ More As a fundamental and extensively studied task in computer vision, image segmentation aims to locate and identify different semantic concepts at the pixel level. Recently, inspired by In-Context Learning (ICL), several generalist segmentation frameworks have been proposed, providing a promising paradigm for segmenting specific objects. However, existing works mostly ignore the value of visual prompts or simply apply similarity sorting to select contextual examples. In this paper, we focus on rethinking and improving the example selection strategy. By comprehensive comparisons, we first demonstrate that ICL-based segmentation models are sensitive to different contexts. Furthermore, empirical evidence indicates that the diversity of contextual prompts plays a crucial role in guiding segmentation. Based on the above insights, we propose a new stepwise context search method. Different from previous works, we construct a small yet rich candidate pool and adaptively search the well-matched contexts. More importantly, this method effectively reduces the annotation cost by compacting the search space. Extensive experiments show that our method is an effective strategy for selecting examples and enhancing segmentation performance. △ Less

Submitted 14 July, 2024; originally announced July 2024.

Comments: Accept by ECCV2024

arXiv:2407.10199 [pdf, other]

Charge radii of $^{11-16}$C, $^{13-17}$N and $^{15-18}$O determined from their charge-changing cross-sections and the mirror-difference charge radii

Authors: J. W. Zhao, B. -H. Sun, I. Tanihata, J. Y. Xu, K. Y. Zhang, A. Prochazka, L. H. Zhu, S. Terashima, J. Meng, L. C. He, C. Y. Liu, G. S. Li, C. G. Lu, W. J. Lin, W. P. Lin, Z. Liu, P. P Ren, Z. Y. Sun, F. Wang, J. Wang, M. Wang, S. T. Wang, X. L. Wei, X. D. Xu, J. C. Zhang , et al. (2 additional authors not shown)

Abstract: Charge-changing cross-sections of $^{11-16}$C, $^{13-17}$N and $^{15-18}$O on a carbon target have been determined at energies around 300 MeV/nucleon. A nucleon separation energy dependent correction factor has been introduced to the Glauber model calculation for extracting the nuclear charge radii from the experimental CCCSs. The charge radii of $^{11}$C, $^{13,16}$N and $^{15}$O thus were determ… ▽ More Charge-changing cross-sections of $^{11-16}$C, $^{13-17}$N and $^{15-18}$O on a carbon target have been determined at energies around 300 MeV/nucleon. A nucleon separation energy dependent correction factor has been introduced to the Glauber model calculation for extracting the nuclear charge radii from the experimental CCCSs. The charge radii of $^{11}$C, $^{13,16}$N and $^{15}$O thus were determined for the first time. With the new radii, we studied the experimental mirror-difference charge radii ($ΔR_{\text {ch}}^{\text {mirror}}$) of $^{11}$B-$^{11}$C, $^{13}$C-$^{13}$N, $^{15}$N-$^{15}$O, $^{17}$N-$^{17}$Ne pairs for the first time. We find that the $ΔR_{\text {ch}}^{\text {mirror}}$, including both bound and weakly bound proton-rich mirror partners, are reproduced by the empirical relation to the isospin asymmetry predicted by the $ab$ $initio$ calculations. △ Less

Submitted 14 July, 2024; originally announced July 2024.

Comments: 3 figures, submitted to Physics Letters B

arXiv:2407.10124 [pdf, other]

Adaptive Model Predictive Control with Data-driven Error Model for Quadrupedal Locomotion

Authors: Xuanqi Zeng, Hongbo Zhang, Linzhu Yue, Zhitao Song, Linwei Zhang, Yun-Hui Liu

Abstract: Model Predictive Control (MPC) relies heavily on the robot model for its control law. However, a gap always exists between the reduced-order control model with uncertainties and the real robot, which degrades its performance. To address this issue, we propose the controller of integrating a data-driven error model into traditional MPC for quadruped robots. Our approach leverages real-world data fr… ▽ More Model Predictive Control (MPC) relies heavily on the robot model for its control law. However, a gap always exists between the reduced-order control model with uncertainties and the real robot, which degrades its performance. To address this issue, we propose the controller of integrating a data-driven error model into traditional MPC for quadruped robots. Our approach leverages real-world data from sensors to compensate for defects in the control model. Specifically, we employ the Autoregressive Moving Average Vector (ARMAV) model to construct the state error model of the quadruped robot using data. The predicted state errors are then used to adjust the predicted future robot states generated by MPC. By such an approach, our proposed controller can provide more accurate inputs to the system, enabling it to achieve desired states even in the presence of model parameter inaccuracies or disturbances. The proposed controller exhibits the capability to partially eliminate the disparity between the model and the real-world robot, thereby enhancing the locomotion performance of quadruped robots. We validate our proposed method through simulations and real-world experimental trials on a large-size quadruped robot that involves carrying a 20 kg un-modeled payload (84% of body weight). △ Less

Submitted 14 July, 2024; originally announced July 2024.

Comments: 7 Pages, 7 figures, conference(ICRA 2024)

arXiv:2407.10081 [pdf, other]

All Roads Lead to Rome: Unveiling the Trajectory of Recommender Systems Across the LLM Era

Authors: Bo Chen, Xinyi Dai, Huifeng Guo, Wei Guo, Weiwen Liu, Yong Liu, Jiarui Qin, Ruiming Tang, Yichao Wang, Chuhan Wu, Yaxiong Wu, Hao Zhang

Abstract: Recommender systems (RS) are vital for managing information overload and delivering personalized content, responding to users' diverse information needs. The emergence of large language models (LLMs) offers a new horizon for redefining recommender systems with vast general knowledge and reasoning capabilities. Standing across this LLM era, we aim to integrate recommender systems into a broader pic… ▽ More Recommender systems (RS) are vital for managing information overload and delivering personalized content, responding to users' diverse information needs. The emergence of large language models (LLMs) offers a new horizon for redefining recommender systems with vast general knowledge and reasoning capabilities. Standing across this LLM era, we aim to integrate recommender systems into a broader picture, and pave the way for more comprehensive solutions for future research. Therefore, we first offer a comprehensive overview of the technical progression of recommender systems, particularly focusing on language foundation models and their applications in recommendation. We identify two evolution paths of modern recommender systems -- via list-wise recommendation and conversational recommendation. These two paths finally converge at LLM agents with superior capabilities of long-term memory, reflection, and tool intelligence. Along these two paths, we point out that the information effectiveness of the recommendation is increased, while the user's acquisition cost is decreased. Technical features, research methodologies, and inherent challenges for each milestone along the path are carefully investigated -- from traditional list-wise recommendation to LLM-enhanced recommendation to recommendation with LLM agents. Finally, we highlight several unresolved challenges crucial for the development of future personalization technologies and interfaces and discuss the future prospects. △ Less

Submitted 14 July, 2024; originally announced July 2024.

arXiv:2407.09984 [pdf, ps, other]

Stabilizing Dynamic Systems through Neural Network Learning: A Robust Approach

Authors: Yu Zhang, Haoyu Zhang, Yongxiang Zou, Houcheng Li, Long Cheng

Abstract: Point-to-point and periodic motions are ubiquitous in the world of robotics. To master these motions, Autonomous Dynamic System (DS) based algorithms are fundamental in the domain of Learning from Demonstration (LfD). However, these algorithms face the significant challenge of balancing precision in learning with the maintenance of system stability. This paper addresses this challenge by presentin… ▽ More Point-to-point and periodic motions are ubiquitous in the world of robotics. To master these motions, Autonomous Dynamic System (DS) based algorithms are fundamental in the domain of Learning from Demonstration (LfD). However, these algorithms face the significant challenge of balancing precision in learning with the maintenance of system stability. This paper addresses this challenge by presenting a novel ADS algorithm that leverages neural network technology. The proposed algorithm is designed to distill essential knowledge from demonstration data, ensuring stability during the learning of both point-to-point and periodic motions. For point-to-point motions, a neural Lyapunov function is proposed to align with the provided demonstrations. In the case of periodic motions, the neural Lyapunov function is used with the transversal contraction to ensure that all generated motions converge to a stable limit cycle. The model utilizes a streamlined neural network architecture, adept at achieving dual objectives: optimizing learning accuracy while maintaining global stability. To thoroughly assess the efficacy of the proposed algorithm, rigorous evaluations are conducted using the LASA dataset and a manually designed dataset. These assessments were complemented by empirical validation through robotic experiments, providing robust evidence of the algorithm's performance △ Less

Submitted 13 July, 2024; originally announced July 2024.

Comments: arXiv admin note: text overlap with arXiv:2309.08849

arXiv:2407.09943 [pdf, other]

Minimizing PLM-Based Few-Shot Intent Detectors

Authors: Haode Zhang, Xiao-Ming Wu, Albert Y. S. Lam

Abstract: Recent research has demonstrated the feasibility of training efficient intent detectors based on pre-trained language model~(PLM) with limited labeled data. However, deploying these detectors in resource-constrained environments such as mobile devices poses challenges due to their large sizes. In this work, we aim to address this issue by exploring techniques to minimize the size of PLM-based inte… ▽ More Recent research has demonstrated the feasibility of training efficient intent detectors based on pre-trained language model~(PLM) with limited labeled data. However, deploying these detectors in resource-constrained environments such as mobile devices poses challenges due to their large sizes. In this work, we aim to address this issue by exploring techniques to minimize the size of PLM-based intent detectors trained with few-shot data. Specifically, we utilize large language models (LLMs) for data augmentation, employ a cutting-edge model compression method for knowledge distillation, and devise a vocabulary pruning mechanism called V-Prune. Through these approaches, we successfully achieve a compression ratio of 21 in model memory usage, including both Transformer and the vocabulary, while maintaining almost identical performance levels on four real-world benchmarks. △ Less

Submitted 13 July, 2024; originally announced July 2024.

arXiv:2407.09942 [pdf, other]

Deterministic Benchmarking of Quantum Gates

Authors: Vinay Tripathi, Daria Kowsari, Kumar Saurav, Haimeng Zhang, Eli M. Levenson-Falk, Daniel A. Lidar

Abstract: We introduce deterministic benchmarking (DB), a protocol designed to identify the interplay of coherent and incoherent errors overlooked by randomized benchmarking (RB) and related benchmarking methods. DB provides a set of four parameters that characterize both incoherent and coherent errors in the single-qubit gate set. Furthermore, DB reveals asymmetries in gate performance induced by strong re… ▽ More We introduce deterministic benchmarking (DB), a protocol designed to identify the interplay of coherent and incoherent errors overlooked by randomized benchmarking (RB) and related benchmarking methods. DB provides a set of four parameters that characterize both incoherent and coherent errors in the single-qubit gate set. Furthermore, DB reveals asymmetries in gate performance induced by strong relaxation errors ($T_1$). We experimentally demonstrate DB using a superconducting transmon qubit and support these results with a simple analytical model and master equation simulations. Our findings uncover critical errors missed by conventional RB and point to strategies to mitigate these errors. △ Less

Submitted 13 July, 2024; originally announced July 2024.

Comments: 13 pages, 5 figures, comments are welcome

arXiv:2407.09770 [pdf, other]

doi 10.1002/qute.202300068

Scheme for measuring topological transitions in a continuous variable system

Authors: Bi-Yao Wang, Hao-Long Zhang, Shou-Bang Yang, Fan Wu, Zhen-Biao Yang, Shi-Biao Zheng

Abstract: We propose a scheme for measuring topological properties in a two-photon-driven Kerr-nonlinear resonator (KNR) subjected to a single-photon modulation. The topological properties are revealed through the observation of the Berry curvature and hence the first Chern number, as a nonadiabatic response of the physical observable to the change rate of the control parameter of the modulated drive. The p… ▽ More We propose a scheme for measuring topological properties in a two-photon-driven Kerr-nonlinear resonator (KNR) subjected to a single-photon modulation. The topological properties are revealed through the observation of the Berry curvature and hence the first Chern number, as a nonadiabatic response of the physical observable to the change rate of the control parameter of the modulated drive. The parameter manifold, constructed from the system's Hamiltonian that determines its dynamics constrained in the state space spanned by the even and odd cat states as two basis states, is adjusted so that the degeneracy crossing the manifold indicates a topological transition. The scheme, with such continuous variable states in mesoscpic systems, provides a new perspective for exploration of the geometry and the related topology with complex systems. △ Less

Submitted 13 July, 2024; originally announced July 2024.

Journal ref: Advanced Quantum Technologies, 2023

arXiv:2407.09528 [pdf]

Prism XR -- A Curated Exhibition Experience in Virtual Reality with Peer Annotation Features and Virtual Guides for Art and Archaeology Classes

Authors: Huopu Zhang

Abstract: The Prism XR project is a curated exhibition experience in virtual reality (VR) for art and archaeology education with features designed for the enhancement of interactivity and collaborative learning. The project integrates peer annotations and a virtual exhibition guide to augment educational experiences. The peer annotation features are intended for facilitating visitor critiques and comments p… ▽ More The Prism XR project is a curated exhibition experience in virtual reality (VR) for art and archaeology education with features designed for the enhancement of interactivity and collaborative learning. The project integrates peer annotations and a virtual exhibition guide to augment educational experiences. The peer annotation features are intended for facilitating visitor critiques and comments pivotal in fostering a dialog between the curator and the audience and a dialogue between the visitors in art and archaeology education, which are demonstrated to have positive impacts on the learning motivations and learning outcomes. The virtual exhibition guide is intended to address the issue of isolation in the virtual exhibition space and to increase interactivity in the virtual curatorial experiences. △ Less

Submitted 15 July, 2024; v1 submitted 24 June, 2024; originally announced July 2024.

arXiv:2407.09268 [pdf, other]

Region Attention Transformer for Medical Image Restoration

Authors: Zhiwen Yang, Haowei Chen, Ziniu Qian, Yang Zhou, Hui Zhang, Dan Zhao, Bingzheng Wei, Yan Xu

Abstract: Transformer-based methods have demonstrated impressive results in medical image restoration, attributed to the multi-head self-attention (MSA) mechanism in the spatial dimension. However, the majority of existing Transformers conduct attention within fixed and coarsely partitioned regions (\text{e.g.} the entire image or fixed patches), resulting in interference from irrelevant regions and fragmen… ▽ More Transformer-based methods have demonstrated impressive results in medical image restoration, attributed to the multi-head self-attention (MSA) mechanism in the spatial dimension. However, the majority of existing Transformers conduct attention within fixed and coarsely partitioned regions (\text{e.g.} the entire image or fixed patches), resulting in interference from irrelevant regions and fragmentation of continuous image content. To overcome these challenges, we introduce a novel Region Attention Transformer (RAT) that utilizes a region-based multi-head self-attention mechanism (R-MSA). The R-MSA dynamically partitions the input image into non-overlapping semantic regions using the robust Segment Anything Model (SAM) and then performs self-attention within these regions. This region partitioning is more flexible and interpretable, ensuring that only pixels from similar semantic regions complement each other, thereby eliminating interference from irrelevant regions. Moreover, we introduce a focal region loss to guide our model to adaptively focus on recovering high-difficulty regions. Extensive experiments demonstrate the effectiveness of RAT in various medical image restoration tasks, including PET image synthesis, CT image denoising, and pathological image super-resolution. Code is available at \href{https://github.com/Yaziwel/Region-Attention-Transformer-for-Medical-Image-Restoration.git}{https://github.com/RAT}. △ Less

Submitted 12 July, 2024; originally announced July 2024.

Comments: This paper has been accepted by MICCAI 2024

arXiv:2407.09265 [pdf, other]

Novel structures and collapse of solitons in nonminimally gravitating dark matter halos

Authors: Jiajun Chen, Hong-Yi Zhang

Abstract: Ultralight dark matter simulations predict Bose-Einstein condensations with short-range correlation, known as solitons or boson stars, at the centers of dark matter halos. This paper investigates the formation and collapse of dark matter solitons influenced by nonminimal gravitational effects, characterized by gradient-dependent self-interactions of dark matter and an additional source in Poisson'… ▽ More Ultralight dark matter simulations predict Bose-Einstein condensations with short-range correlation, known as solitons or boson stars, at the centers of dark matter halos. This paper investigates the formation and collapse of dark matter solitons influenced by nonminimal gravitational effects, characterized by gradient-dependent self-interactions of dark matter and an additional source in Poisson's equation for gravity. Our simulations suggest that the initial evolution of dark matter resembles that without nonminimal gravitational effects. However, regions with negative mass density may develop, and solitons will collapse when their densities reach certain critical values for both positive and negative coupling constants. With strong nonminimal coupling, structure growth could be significantly enhanced. △ Less

Submitted 12 July, 2024; originally announced July 2024.

Comments: 10 pages, 7 big figures

arXiv:2407.08924 [pdf, other]

Disassembling Obfuscated Executables with LLM

Authors: Huanyao Rong, Yue Duan, Hang Zhang, XiaoFeng Wang, Hongbo Chen, Shengchen Duan, Shen Wang

Abstract: Disassembly is a challenging task, particularly for obfuscated executables containing junk bytes, which is designed to induce disassembly errors. Existing solutions rely on heuristics or leverage machine learning techniques, but only achieve limited successes. Fundamentally, such obfuscation cannot be defeated without in-depth understanding of the binary executable's semantics, which is made possi… ▽ More Disassembly is a challenging task, particularly for obfuscated executables containing junk bytes, which is designed to induce disassembly errors. Existing solutions rely on heuristics or leverage machine learning techniques, but only achieve limited successes. Fundamentally, such obfuscation cannot be defeated without in-depth understanding of the binary executable's semantics, which is made possible by the emergence of large language models (LLMs). In this paper, we present DisasLLM, a novel LLM-driven dissembler to overcome the challenge in analyzing obfuscated executables. DisasLLM consists of two components: an LLM-based classifier that determines whether an instruction in an assembly code snippet is correctly decoded, and a disassembly strategy that leverages this model to disassemble obfuscated executables end-to-end. We evaluated DisasLLM on a set of heavily obfuscated executables, which is shown to significantly outperform other state-of-the-art disassembly solutions. △ Less

Submitted 11 July, 2024; originally announced July 2024.

arXiv:2407.08882 [pdf, ps, other]

Emerging Practices for Large Multimodal Model (LMM) Assistance for People with Visual Impairments: Implications for Design

Authors: Jingyi Xie, Rui Yu, He Zhang, Sooyeon Lee, Syed Masum Billah, John M. Carroll

Abstract: People with visual impairments perceive their environment non-visually and often use AI-powered assistive tools to obtain textual descriptions of visual information. Recent large vision-language model-based AI-powered tools like Be My AI are more capable of understanding users' inquiries in natural language and describing the scene in audible text; however, the extent to which these tools are usef… ▽ More People with visual impairments perceive their environment non-visually and often use AI-powered assistive tools to obtain textual descriptions of visual information. Recent large vision-language model-based AI-powered tools like Be My AI are more capable of understanding users' inquiries in natural language and describing the scene in audible text; however, the extent to which these tools are useful to visually impaired users is currently understudied. This paper aims to fill this gap. Our study with 14 visually impaired users reveals that they are adapting these tools organically -- not only can these tools facilitate complex interactions in household, spatial, and social contexts, but they also act as an extension of users' cognition, as if the cognition were distributed in the visual information. We also found that although the tools are currently not goal-oriented, users accommodate this limitation and embrace the tools' capabilities for broader use. These findings enable us to envision design implications for creating more goal-oriented, real-time processing, and reliable AI-powered assistive technology. △ Less

Submitted 11 July, 2024; originally announced July 2024.

arXiv:2407.08787 [pdf, other]

doi 10.1609/aaai.v38i5.28249

Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification

Authors: Wenshuo Peng, Kaipeng Zhang, Yue Yang, Hao Zhang, Yu Qiao

Abstract: Vision-language foundation models have been incredibly successful in a wide range of downstream computer vision tasks using adaptation methods. However, due to the high cost of obtaining pre-training datasets, pairs with weak image-text correlation in the data exist in large numbers. We call them weak-paired samples. Due to the limitations of these weak-paired samples, the pre-training model are u… ▽ More Vision-language foundation models have been incredibly successful in a wide range of downstream computer vision tasks using adaptation methods. However, due to the high cost of obtaining pre-training datasets, pairs with weak image-text correlation in the data exist in large numbers. We call them weak-paired samples. Due to the limitations of these weak-paired samples, the pre-training model are unable to mine all the knowledge from pre-training data. The existing adaptation methods do not consider the missing knowledge, which may lead to crucial task-related knowledge for the downstream tasks being ignored. To address this issue, we propose a new adaptation framework called Data Adaptive Traceback (DAT). Specifically, we utilize a zero-shot-based method to extract the most downstream task-related subset of the pre-training data to enable the downstream tasks. Furthermore, we adopt a pseudo-label-based semi-supervised technique to reuse the pre-training images and a vision-language contrastive learning method to address the confirmation bias issue in semi-supervised learning. We conduct extensive experiments that show our proposed DAT approach meaningfully improves various benchmark datasets performance over traditional adaptation methods by simply. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: 9 pages,4 figures

arXiv:2407.08569 [pdf, other]

Approaching Outside: Scaling Unsupervised 3D Object Detection from 2D Scene

Authors: Ruiyang Zhang, Hu Zhang, Hang Yu, Zhedong Zheng

Abstract: The unsupervised 3D object detection is to accurately detect objects in unstructured environments with no explicit supervisory signals. This task, given sparse LiDAR point clouds, often results in compromised performance for detecting distant or small objects due to the inherent sparsity and limited spatial resolution. In this paper, we are among the early attempts to integrate LiDAR data with 2D… ▽ More The unsupervised 3D object detection is to accurately detect objects in unstructured environments with no explicit supervisory signals. This task, given sparse LiDAR point clouds, often results in compromised performance for detecting distant or small objects due to the inherent sparsity and limited spatial resolution. In this paper, we are among the early attempts to integrate LiDAR data with 2D images for unsupervised 3D detection and introduce a new method, dubbed LiDAR-2D Self-paced Learning (LiSe). We argue that RGB images serve as a valuable complement to LiDAR data, offering precise 2D localization cues, particularly when scarce LiDAR points are available for certain objects. Considering the unique characteristics of both modalities, our framework devises a self-paced learning pipeline that incorporates adaptive sampling and weak model aggregation strategies. The adaptive sampling strategy dynamically tunes the distribution of pseudo labels during training, countering the tendency of models to overfit easily detected samples, such as nearby and large-sized objects. By doing so, it ensures a balanced learning trajectory across varying object scales and distances. The weak model aggregation component consolidates the strengths of models trained under different pseudo label distributions, culminating in a robust and powerful final model. Experimental evaluations validate the efficacy of our proposed LiSe method, manifesting significant improvements of +7.1% AP$_{BEV}$ and +3.4% AP$_{3D}$ on nuScenes, and +8.3% AP$_{BEV}$ and +7.4% AP$_{3D}$ on Lyft compared to existing techniques. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: Accepted by ECCV'24, 18 pages, 5 figures, 6 tables

arXiv:2407.08265 [pdf, other]

Enhancing Thermal Infrared Tracking with Natural Language Modeling and Coordinate Sequence Generation

Authors: Miao Yan, Ping Zhang, Haofei Zhang, Ruqian Hao, Juanxiu Liu, Xiaoyang Wang, Lin Liu

Abstract: Thermal infrared tracking is an essential topic in computer vision tasks because of its advantage of all-weather imaging. However, most conventional methods utilize only hand-crafted features, while deep learning-based correlation filtering methods are limited by simple correlation operations. Transformer-based methods ignore temporal and coordinate information, which is critical for TIR tracking… ▽ More Thermal infrared tracking is an essential topic in computer vision tasks because of its advantage of all-weather imaging. However, most conventional methods utilize only hand-crafted features, while deep learning-based correlation filtering methods are limited by simple correlation operations. Transformer-based methods ignore temporal and coordinate information, which is critical for TIR tracking that lacks texture and color information. In this paper, to address these issues, we apply natural language modeling to TIR tracking and propose a novel model called NLMTrack, which enhances the utilization of coordinate and temporal information. NLMTrack applies an encoder that unifies feature extraction and feature fusion, which simplifies the TIR tracking pipeline. To address the challenge of low detail and low contrast in TIR images, on the one hand, we design a multi-level progressive fusion module that enhances the semantic representation and incorporates multi-scale features. On the other hand, the decoder combines the TIR features and the coordinate sequence features using a causal transformer to generate the target sequence step by step. Moreover, we explore an adaptive loss aimed at elevating tracking accuracy and a simple template update strategy to accommodate the target's appearance variations. Experiments show that NLMTrack achieves state-of-the-art performance on multiple benchmarks. The Code is publicly available at \url{https://github.com/ELOESZHANG/NLMTrack}. △ Less

Submitted 11 July, 2024; originally announced July 2024.

arXiv:2407.08093 [pdf, other]

MemWarp: Discontinuity-Preserving Cardiac Registration with Memorized Anatomical Filters

Authors: Hang Zhang, Xiang Chen, Renjiu Hu, Dongdong Liu, Gaolei Li, Rongguang Wang

Abstract: Many existing learning-based deformable image registration methods impose constraints on deformation fields to ensure they are globally smooth and continuous. However, this assumption does not hold in cardiac image registration, where different anatomical regions exhibit asymmetric motions during respiration and movements due to sliding organs within the chest. Consequently, such global constraint… ▽ More Many existing learning-based deformable image registration methods impose constraints on deformation fields to ensure they are globally smooth and continuous. However, this assumption does not hold in cardiac image registration, where different anatomical regions exhibit asymmetric motions during respiration and movements due to sliding organs within the chest. Consequently, such global constraints fail to accommodate local discontinuities across organ boundaries, potentially resulting in erroneous and unrealistic displacement fields. In this paper, we address this issue with MemWarp, a learning framework that leverages a memory network to store prototypical information tailored to different anatomical regions. MemWarp is different from earlier approaches in two main aspects: firstly, by decoupling feature extraction from similarity matching in moving and fixed images, it facilitates more effective utilization of feature maps; secondly, despite its capability to preserve discontinuities, it eliminates the need for segmentation masks during model inference. In experiments on a publicly available cardiac dataset, our method achieves considerable improvements in registration accuracy and producing realistic deformations, outperforming state-of-the-art methods with a remarkable 7.1\% Dice score improvement over the runner-up semi-supervised method. Source code will be available at https://github.com/tinymilky/Mem-Warp. △ Less

Submitted 10 July, 2024; originally announced July 2024.

Comments: 11 pages, 2 figure, 2 tables

arXiv:2407.07959 [pdf, other]

Source Code Summarization in the Era of Large Language Models

Authors: Weisong Sun, Yun Miao, Yuekang Li, Hongyu Zhang, Chunrong Fang, Yi Liu, Gelei Deng, Yang Liu, Zhenyu Chen

Abstract: To support software developers in understanding and maintaining programs, various automatic (source) code summarization techniques have been proposed to generate a concise natural language summary (i.e., comment) for a given code snippet. Recently, the emergence of large language models (LLMs) has led to a great boost in the performance of code-related tasks. In this paper, we undertake a systemat… ▽ More To support software developers in understanding and maintaining programs, various automatic (source) code summarization techniques have been proposed to generate a concise natural language summary (i.e., comment) for a given code snippet. Recently, the emergence of large language models (LLMs) has led to a great boost in the performance of code-related tasks. In this paper, we undertake a systematic and comprehensive study on code summarization in the era of LLMs, which covers multiple aspects involved in the workflow of LLM-based code summarization. Specifically, we begin by examining prevalent automated evaluation methods for assessing the quality of summaries generated by LLMs and find that the results of the GPT-4 evaluation method are most closely aligned with human evaluation. Then, we explore the effectiveness of five prompting techniques (zero-shot, few-shot, chain-of-thought, critique, and expert) in adapting LLMs to code summarization tasks. Contrary to expectations, advanced prompting techniques may not outperform simple zero-shot prompting. Next, we investigate the impact of LLMs' model settings (including top\_p and temperature parameters) on the quality of generated summaries. We find the impact of the two parameters on summary quality varies by the base LLM and programming language, but their impacts are similar. Moreover, we canvass LLMs' abilities to summarize code snippets in distinct types of programming languages. The results reveal that LLMs perform suboptimally when summarizing code written in logic programming languages compared to other language types. Finally, we unexpectedly find that CodeLlama-Instruct with 7B parameters can outperform advanced GPT-4 in generating summaries describing code implementation details and asserting code properties. We hope that our findings can provide a comprehensive understanding of code summarization in the era of LLMs. △ Less

Submitted 9 July, 2024; originally announced July 2024.

Comments: Just accepted to the 47th International Conference on Software Engineering (ICSE 2025)

MSC Class: 68-04 ACM Class: D.2.3; I.2.7

arXiv:2407.07895 [pdf, other]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Authors: Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, Chunyuan Li

Abstract: Visual instruction tuning has made considerable strides in enhancing the capabilities of Large Multimodal Models (LMMs). However, existing open LMMs largely focus on single-image tasks, their applications to multi-image scenarios remains less explored. Additionally, prior LMM research separately tackles different scenarios, leaving it impossible to generalize cross scenarios with new emerging capa… ▽ More Visual instruction tuning has made considerable strides in enhancing the capabilities of Large Multimodal Models (LMMs). However, existing open LMMs largely focus on single-image tasks, their applications to multi-image scenarios remains less explored. Additionally, prior LMM research separately tackles different scenarios, leaving it impossible to generalize cross scenarios with new emerging capabilities. To this end, we introduce LLaVA-NeXT-Interleave, which simultaneously tackles Multi-image, Multi-frame (video), Multi-view (3D), and Multi-patch (single-image) scenarios in LMMs. To enable these capabilities, we regard the interleaved data format as a general template and compile the M4-Instruct dataset with 1,177.6k samples, spanning 4 primary domains with 14 tasks and 41 datasets. We also curate the LLaVA-Interleave Bench to comprehensively evaluate the multi-image performance of LMMs. Through extensive experiments, LLaVA-NeXT-Interleave achieves leading results in multi-image, video, and 3D benchmarks, while maintaining the performance of single-image tasks. Besides, our model also exhibits several emerging capabilities, e.g., transferring tasks across different settings and modalities. Code is available at https://github.com/LLaVA-VL/LLaVA-NeXT △ Less

Submitted 10 July, 2024; originally announced July 2024.

Comments: Project Page: https://llava-vl.github.io/blog/2024-06-16-llava-next-interleave/

arXiv:2407.07731 [pdf, other]

Large spin-orbit torque in a-plane $α$-Fe$_{2}$O$_{3}$/Pt bilayers

Authors: Igor Lyalin, Hantao Zhang, Justin Michel, Daniel Russell, Fengyuan Yang, Ran Cheng, Roland K. Kawakami

Abstract: Realization of efficient spin-orbit torque switching of the Néel vector in insulating antiferromagnets is a challenge, often complicated by spurious effects. Quantifying the spin-orbit torques in antiferromagnet/heavy metal heterostructures is an important first step towards this goal. Here, we employ magneto-optic techniques to study damping-like spin-orbit torque (DL-SOT) in a-plane $α$-Fe$_2$O… ▽ More Realization of efficient spin-orbit torque switching of the Néel vector in insulating antiferromagnets is a challenge, often complicated by spurious effects. Quantifying the spin-orbit torques in antiferromagnet/heavy metal heterostructures is an important first step towards this goal. Here, we employ magneto-optic techniques to study damping-like spin-orbit torque (DL-SOT) in a-plane $α$-Fe$_2$O$_3$ (hematite) with a Pt spin-orbit overlayer. We find that the DL-SOT efficiency is two orders of magnitude larger than reported in c- and r-plane hematite/Pt using harmonic Hall techniques. The large magnitude of DL-SOT is supported by direct imaging of current-induced motion of antiferromagnetic domains that happens at moderate current densities. Our study introduces a new method for quantifying spin-orbit torque in antiferromagnets with a small canted moment and identifies a-plane $α$-Fe$_2$O$_3$ as a promising candidate to realize efficient SOT switching. △ Less

Submitted 10 July, 2024; originally announced July 2024.

Comments: 6 pages, 3 figures

arXiv:2407.07651 [pdf, other]

Study of the decay and production properties of $D_{s1}(2536)$ and $D_{s2}^*(2573)$

Authors: M. Ablikim, M. N. Achasov, P. Adlarson, O. Afedulidis, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, I. Balossino, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere, A. Brueggemann , et al. (645 additional authors not shown)

Abstract: The $e^+e^-\rightarrow D_s^+D_{s1}(2536)^-$ and $e^+e^-\rightarrow D_s^+D^*_{s2}(2573)^-$ processes are studied using data samples collected with the BESIII detector at center-of-mass energies from 4.530 to 4.946~GeV. The absolute branching fractions of $D_{s1}(2536)^- \rightarrow \bar{D}^{*0}K^-$ and $D_{s2}^*(2573)^- \rightarrow \bar{D}^0K^-$ are measured for the first time to be… ▽ More The $e^+e^-\rightarrow D_s^+D_{s1}(2536)^-$ and $e^+e^-\rightarrow D_s^+D^*_{s2}(2573)^-$ processes are studied using data samples collected with the BESIII detector at center-of-mass energies from 4.530 to 4.946~GeV. The absolute branching fractions of $D_{s1}(2536)^- \rightarrow \bar{D}^{*0}K^-$ and $D_{s2}^*(2573)^- \rightarrow \bar{D}^0K^-$ are measured for the first time to be $(35.9\pm 4.8\pm 3.5)\%$ and $(37.4\pm 3.1\pm 4.6)\%$, respectively. The measurements are in tension with predictions based on the assumption that the $D_{s1}(2536)$ and $D_{s2}^*(2573)$ are dominated by a bare $c\bar{s}$ component. The $e^+e^-\rightarrow D_s^+D_{s1}(2536)^-$ and $e^+e^-\rightarrow D_s^+D^*_{s2}(2573)^-$ cross sections are measured, and a resonant structure at around 4.6~GeV with a width of 50~MeV is observed for the first time with a statistical significance of $15σ$ in the $e^+e^-\rightarrow D_s^+D^*_{s2}(2573)^-$ process. It could be the $Y(4626)$ found by the Belle collaboration in the $D_s^+D_{s1}(2536)^{-}$ final state, since they have similar masses and widths. There is also evidence for a structure at around 4.75~GeV in both processes. △ Less

Submitted 10 July, 2024; originally announced July 2024.

arXiv:2407.07554 [pdf, other]

Beat-It: Beat-Synchronized Multi-Condition 3D Dance Generation

Authors: Zikai Huang, Xuemiao Xu, Cheng Xu, Huaidong Zhang, Chenxi Zheng, Jing Qin, Shengfeng He

Abstract: Dance, as an art form, fundamentally hinges on the precise synchronization with musical beats. However, achieving aesthetically pleasing dance sequences from music is challenging, with existing methods often falling short in controllability and beat alignment. To address these shortcomings, this paper introduces Beat-It, a novel framework for beat-specific, key pose-guided dance generation. Unlike… ▽ More Dance, as an art form, fundamentally hinges on the precise synchronization with musical beats. However, achieving aesthetically pleasing dance sequences from music is challenging, with existing methods often falling short in controllability and beat alignment. To address these shortcomings, this paper introduces Beat-It, a novel framework for beat-specific, key pose-guided dance generation. Unlike prior approaches, Beat-It uniquely integrates explicit beat awareness and key pose guidance, effectively resolving two main issues: the misalignment of generated dance motions with musical beats, and the inability to map key poses to specific beats, critical for practical choreography. Our approach disentangles beat conditions from music using a nearest beat distance representation and employs a hierarchical multi-condition fusion mechanism. This mechanism seamlessly integrates key poses, beats, and music features, mitigating condition conflicts and offering rich, multi-conditioned guidance for dance generation. Additionally, a specially designed beat alignment loss ensures the generated dance movements remain in sync with the designated beats. Extensive experiments confirm Beat-It's superiority over existing state-of-the-art methods in terms of beat alignment and motion controllability. △ Less

Submitted 10 July, 2024; originally announced July 2024.

Comments: ECCV 2024

arXiv:2407.07351 [pdf, other]

Unity in Diversity: Multi-expert Knowledge Confrontation and Collaboration for Generalizable Vehicle Re-identification

Authors: Zhenyu Kuang, Hongyang Zhang, Lidong Cheng, Yinhao Liu, Yue Huang, Xinghao Ding

Abstract: Generalizable vehicle re-identification (ReID) aims to enable the well-trained model in diverse source domains to broadly adapt to unknown target domains without additional fine-tuning or retraining. However, it still faces the challenges of domain shift problem and has difficulty accurately generalizing to unknown target domains. This limitation occurs because the model relies heavily on primary… ▽ More Generalizable vehicle re-identification (ReID) aims to enable the well-trained model in diverse source domains to broadly adapt to unknown target domains without additional fine-tuning or retraining. However, it still faces the challenges of domain shift problem and has difficulty accurately generalizing to unknown target domains. This limitation occurs because the model relies heavily on primary domain-invariant features in the training data and pays less attention to potentially valuable secondary features. To solve this complex and common problem, this paper proposes the two-stage Multi-expert Knowledge Confrontation and Collaboration (MiKeCoCo) method, which incorporates multiple experts with unique perspectives into Contrastive Language-Image Pretraining (CLIP) and fully leverages high-level semantic knowledge for comprehensive feature representation. Specifically, we propose to construct the learnable prompt set of all specific-perspective experts by adversarial learning in the latent space of visual features during the first stage of training. The learned prompt set with high-level semantics is then utilized to guide representation learning of the multi-level features for final knowledge fusion in the next stage. In this process of knowledge fusion, although multiple experts employ different assessment ways to examine the same vehicle, their common goal is to confirm the vehicle's true identity. Their collective decision can ensure the accuracy and consistency of the evaluation results. Furthermore, we design different image inputs for two-stage training, which include image component separation and diversity enhancement in order to extract the ID-related prompt representation and to obtain feature representation highlighted by all experts, respectively. Extensive experimental results demonstrate that our method achieves state-of-the-art recognition performance. △ Less

Submitted 10 July, 2024; originally announced July 2024.

Showing 1–50 of 9,603 results for author: Zhang, H