subscribe to arXiv mailings

Differentiable Neural-Integrated Meshfree Method for Forward and Inverse Modeling of Finite Strain Hyperelasticity

Authors: Honghui Du, Binyao Guo, QiZhi He

Abstract: The present study aims to extend the novel physics-informed machine learning approach, specifically the neural-integrated meshfree (NIM) method, to model finite-strain problems characterized by nonlinear elasticity and large deformations. To this end, the hyperelastic material models are integrated into the loss function of the NIM method by employing a consistent local variational formulation. Th… ▽ More The present study aims to extend the novel physics-informed machine learning approach, specifically the neural-integrated meshfree (NIM) method, to model finite-strain problems characterized by nonlinear elasticity and large deformations. To this end, the hyperelastic material models are integrated into the loss function of the NIM method by employing a consistent local variational formulation. Thanks to the inherent differentiable programming capabilities, NIM can circumvent the need for derivation of Newton-Raphson linearization of the variational form and the resulting tangent stiffness matrix, typically required in traditional numerical methods. Additionally, NIM utilizes a hybrid neural-numerical approximation encoded with partition-of-unity basis functions, coined NeuroPU, to effectively represent the displacement and streamline the training process. NeuroPU can also be used for approximating the unknown material fields, enabling NIM a unified framework for both forward and inverse modeling. For the imposition of displacement boundary conditions, this study introduces a new approach based on singular kernel functions into the NeuroPU approximation, leveraging its unique feature that allows for customized basis functions. Numerical experiments demonstrate the NIM method's capability in forward hyperelasticity modeling, achieving desirable accuracy, with errors among $10^{-3} \sim 10^{-5}$ in the relative $L_2$ norm, comparable to the well-established finite element solvers. Furthermore, NIM is applied to address the complex task of identifying heterogeneous mechanical properties of hyperelastic materials from strain data, validating its effectiveness in the inverse modeling of nonlinear materials. To leverage GPU acceleration, NIM is fully implemented on the JAX deep learning framework in this study, utilizing the accelerator-oriented array computation capabilities offered by JAX. △ Less

Submitted 15 July, 2024; originally announced July 2024.

arXiv:2407.10485 [pdf, other]

Effective Motion Modeling for UAV-platform Multiple Object Tracking with Re-Margin Loss

Authors: Mufeng Yao, Jinlong Peng, Qingdong He, Bo Peng, Hao Chen, Mingmin Chi, Chao Liu, Jon Atli Benediktsson

Abstract: Multiple object tracking (MOT) from unmanned aerial vehicle (UAV) platforms requires efficient motion modeling. This is because UAV-MOT faces tracking difficulties caused by large and irregular motion, and insufficient training due to the motion long-tailed distribution of current UAV-MOT datasets. Previous UAV-MOT methods either extract motion and detection features redundantly or supervise motio… ▽ More Multiple object tracking (MOT) from unmanned aerial vehicle (UAV) platforms requires efficient motion modeling. This is because UAV-MOT faces tracking difficulties caused by large and irregular motion, and insufficient training due to the motion long-tailed distribution of current UAV-MOT datasets. Previous UAV-MOT methods either extract motion and detection features redundantly or supervise motion model in a sparse scheme, which limited their tracking performance and speed. To this end, we propose a flowing-by-detection module to realize accurate motion modeling with a minimum cost. Focusing on the motion long-tailed problem that were ignored by previous works, the flow-guided margin loss is designed to enable more complete training of large moving objects. Experiments on two widely open-source datasets show that our proposed model can successfully track objects with large and irregular motion and outperform existing state-of-the-art methods in UAV-MOT tasks. △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: arXiv admin note: text overlap with arXiv:2308.07207

arXiv:2407.05368 [pdf, other]

Music Era Recognition Using Supervised Contrastive Learning and Artist Information

Authors: Qiqi He, Xuchen Song, Weituo Hao, Ju-Chiang Wang, Wei-Tsung Lu, Wei Li

Abstract: Does popular music from the 60s sound different than that of the 90s? Prior study has shown that there would exist some variations of patterns and regularities related to instrumentation changes and growing loudness across multi-decadal trends. This indicates that perceiving the era of a song from musical features such as audio and artist information is possible. Music era information can be an im… ▽ More Does popular music from the 60s sound different than that of the 90s? Prior study has shown that there would exist some variations of patterns and regularities related to instrumentation changes and growing loudness across multi-decadal trends. This indicates that perceiving the era of a song from musical features such as audio and artist information is possible. Music era information can be an important feature for playlist generation and recommendation. However, the release year of a song can be inaccessible in many circumstances. This paper addresses a novel task of music era recognition. We formulate the task as a music classification problem and propose solutions based on supervised contrastive learning. An audio-based model is developed to predict the era from audio. For the case where the artist information is available, we extend the audio-based model to take multimodal inputs and develop a framework, called MultiModal Contrastive (MMC) learning, to enhance the training. Experimental result on Million Song Dataset demonstrates that the audio-based model achieves 54% in accuracy with a tolerance of 3-years range; incorporating the artist information with the MMC framework for training leads to 9% improvement further. △ Less

Submitted 7 July, 2024; originally announced July 2024.

arXiv:2407.04498 [pdf, ps, other]

Global dynamics for the generalized chemotaxis-Navier-Stokes system in $\mathbb{R}^3$

Authors: Qingyou He, Ling-Yun Shou, Leyun Wu

Abstract: We consider the Cauchy problem of the three-dimensional generalized chemotaxis-Navier-Stokes system \begin{eqnarray*} \begin{cases} \partial_t n+u\cdot \nabla n=Δn- \nabla \cdot (χ(c)n \nabla c),\\ \partial_t c+u \cdot \nabla c=Δc-nf(c),\\ \partial_t u +u \cdot \nabla u+\nabla P=-(-Δ)^αu-n\nabla φ,\\ \nabla \cdot u=0. \end{cases} \end{eqnarray*} First, we study the time extensibility criter… ▽ More We consider the Cauchy problem of the three-dimensional generalized chemotaxis-Navier-Stokes system \begin{eqnarray*} \begin{cases} \partial_t n+u\cdot \nabla n=Δn- \nabla \cdot (χ(c)n \nabla c),\\ \partial_t c+u \cdot \nabla c=Δc-nf(c),\\ \partial_t u +u \cdot \nabla u+\nabla P=-(-Δ)^αu-n\nabla φ,\\ \nabla \cdot u=0. \end{cases} \end{eqnarray*} First, we study the time extensibility criteria of strong solutions, including the Prodi-Serrin type criterion ($α>\frac{3}{4}$) and the Beir${\rm\tilde{a}}$o da Veiga type criterion $(α>\frac{1}{2})$. Furthermore, with Lions' dissipation exponent $α\geq \frac{5}{4}$, we verify the global existence and uniqueness of strong solutions for arbitrarily large initial fluid velocity and oxygen concentration. These results reflect the influence of the generalized dissipation for the solutions of the coupled chemotaxis-fluid equations. Finally, in the scenario of weaker dissipation ($\frac{3}{4}<α<\frac{5}{4}$), we establish uniform regularity estimates for global strong solutions and further obtain optimal time-decay rates under the mild condition that the initial $L^2$ energy is small. To our knowledge, this is the first result concerning the global existence and large-time behavior of strong solutions for the three-dimensional chemotaxis-Navier-Stokes equations with possibly large oscillations. △ Less

Submitted 5 July, 2024; originally announced July 2024.

Comments: 39 pages

arXiv:2407.03612 [pdf, other]

Quantum phase transition in a quantum Rabi square with next-nearest-neighbor hopping

Authors: Yilun Xu, Feng-Xao Sun, Qiongyi He, Han Pu, Wei Zhang

Abstract: We propose a quantum Rabi square model where both the nearest-neighbor and the next-nearest-neighbor photon hopping are allowed among four quantum Rabi systems located at the vertices of a square. By tuning the next-nearest hopping strength, we realize a first-order phase transition between the antiferromagnetic superradiant phase and the frustrated superradiant phase, as well as a second-order ph… ▽ More We propose a quantum Rabi square model where both the nearest-neighbor and the next-nearest-neighbor photon hopping are allowed among four quantum Rabi systems located at the vertices of a square. By tuning the next-nearest hopping strength, we realize a first-order phase transition between the antiferromagnetic superradiant phase and the frustrated superradiant phase, as well as a second-order phase transition between the normal and the superradiant phases. To understand the emergence of such phases, we show analytically that the effect induced by next-nearest hopping is equivalent to that of an artificial gauge phase. Our findings suggest that the next-nearest-neighbor hopping can serve as an alternative for the gauge phase to realize quantum control in applications of quantum simulation and quantum materials, and that our model represents a basic building block for the frustrated $J_1$-$J_2$ quantum spin model on square lattices. △ Less

Submitted 3 July, 2024; originally announced July 2024.

arXiv:2407.03547 [pdf, ps, other]

Large Time Behavior of Solutions to Cauchy Problem for 1-D Compressible Isentropic Navier-Stokes/Allen-Cahn System

Authors: Yazhou Chen, Qiaolin He, Xiaoding Shi

Abstract: This paper is concerned with the large time behavior of the solutions to the Cauchy problem for the one-dimensional compressible Navier-Stokes/Allen-Cahn system with the immiscible two-phase flow initially located near the phase separation state. Under the assumptions that the initial data is a small perturbation of the constant state, we prove the global existence and uniqueness of the solutions… ▽ More This paper is concerned with the large time behavior of the solutions to the Cauchy problem for the one-dimensional compressible Navier-Stokes/Allen-Cahn system with the immiscible two-phase flow initially located near the phase separation state. Under the assumptions that the initial data is a small perturbation of the constant state, we prove the global existence and uniqueness of the solutions and establish the time decay rates of the solution as well as its higher-order spatial derivatives. Moreover, we derive that the solutions of the system are time asymptotically approximated by the solutions of the modified parabolic system and obtain decay rates in $L^2$ and $L^1$. Furthermore, we show that the solution of the system is time asymptotically approximated in $L^p (1 \leq p \leq+\infty)$ by the diffusion waves. △ Less

Submitted 3 July, 2024; originally announced July 2024.

Comments: 26 pages

MSC Class: 35Q35; 35B65; 76N10; 35M10; 35B40; 35C20; 76T30

arXiv:2407.02784 [pdf]

Breeding the Cat Through Superposition of Two Schrodinger Kittens Based on Coupled Waveguides

Authors: Nuo Wang, Xinchen Zhang, Qi Liu, Fengxiao Sun, Qiongyi He, Ying Gu

Abstract: Optical Schrodinger's cat (SC) is highly anticipated because of the potential of realizing fault-tolerant quantum computing, but the practical merit is only shown when the amplitude is larger than 2. However, such high-amplitude cats have not been prepared due to the limitations rooted in the existing method. Here, we demonstrate a principle that a large SC-like state can be generated by the super… ▽ More Optical Schrodinger's cat (SC) is highly anticipated because of the potential of realizing fault-tolerant quantum computing, but the practical merit is only shown when the amplitude is larger than 2. However, such high-amplitude cats have not been prepared due to the limitations rooted in the existing method. Here, we demonstrate a principle that a large SC-like state can be generated by the superposition of two kittens in which two nearby coherent states interfere and grow to an enlarged coherent-like state. Further, we propose a scheme to breed the cat beyond the limitation in the former works with a high probability by realizing the superposition of two SCs in coupled waveguides. The principle and scheme demonstrated here provide a new perspective on understanding quantum superposition in phase space and a better solution for the efficient generation of SCs on chips. △ Less

Submitted 2 July, 2024; originally announced July 2024.

arXiv:2407.02761 [pdf]

doi 10.1088/1361-648X/ad550a

Inducing superconductivity in quantum anomalous Hall regime

Authors: Yu Huang, Yu Fu, Peng Zhang, Kang L. Wang, Qing Lin He

Abstract: Interfacing the quantum anomalous Hall insulator with a conventional superconductor is known to be a promising manner for realizing a topological superconductor, which has been continuously pursued for years. Such a proximity route depends to a great extent on the control of the delicate interfacial coupling of the two constituents. However, a recent experiment reported the failure to reproduce su… ▽ More Interfacing the quantum anomalous Hall insulator with a conventional superconductor is known to be a promising manner for realizing a topological superconductor, which has been continuously pursued for years. Such a proximity route depends to a great extent on the control of the delicate interfacial coupling of the two constituents. However, a recent experiment reported the failure to reproduce such a topological superconductor, which is ascribed to the negligence of the electrical short by the superconductor in the theoretical proposal. Here, we reproduce this topological superconductor with attention to the interface control. The resulted conductance matrix under a wide magnetic field range agrees with the fingerprint of this topological superconductor. This allows us to develop a phase diagram that unveils three regions parameterized by various coupling limits, which not only supports the feasibility to fabricate the topological superconductor by proximity but also fully explains the origin of the previous debate. The present work provides a comprehensible guide on fabricating the topological superconductor. △ Less

Submitted 2 July, 2024; originally announced July 2024.

Comments: 17 pages, 4 figures

Journal ref: 2024 J. Phys.: Condens. Matter 36 37LT01

arXiv:2407.00294 [pdf, other]

Deep Neural Networks with Symplectic Preservation Properties

Authors: Qing He, Wei Cai

Abstract: We propose a deep neural network architecture designed such that its output forms an invertible symplectomorphism of the input. This design draws an analogy to the real-valued non-volume-preserving (real NVP) method used in normalizing flow techniques. Utilizing this neural network type allows for learning tasks on unknown Hamiltonian systems without breaking the inherent symplectic structure of t… ▽ More We propose a deep neural network architecture designed such that its output forms an invertible symplectomorphism of the input. This design draws an analogy to the real-valued non-volume-preserving (real NVP) method used in normalizing flow techniques. Utilizing this neural network type allows for learning tasks on unknown Hamiltonian systems without breaking the inherent symplectic structure of the phase space. △ Less

Submitted 28 June, 2024; originally announced July 2024.

MSC Class: 37J11; 70H15; 68T07

arXiv:2406.19859 [pdf, other]

MetaDesigner: Advancing Artistic Typography through AI-Driven, User-Centric, and Multilingual WordArt Synthesis

Authors: Jun-Yan He, Zhi-Qi Cheng, Chenyang Li, Jingdong Sun, Qi He, Wangmeng Xiang, Hanyuan Chen, Jin-Peng Lan, Xianhui Lin, Kang Zhu, Bin Luo, Yifeng Geng, Xuansong Xie, Alexander G. Hauptmann

Abstract: MetaDesigner revolutionizes artistic typography synthesis by leveraging the strengths of Large Language Models (LLMs) to drive a design paradigm centered around user engagement. At the core of this framework lies a multi-agent system comprising the Pipeline, Glyph, and Texture agents, which collectively enable the creation of customized WordArt, ranging from semantic enhancements to the imposition… ▽ More MetaDesigner revolutionizes artistic typography synthesis by leveraging the strengths of Large Language Models (LLMs) to drive a design paradigm centered around user engagement. At the core of this framework lies a multi-agent system comprising the Pipeline, Glyph, and Texture agents, which collectively enable the creation of customized WordArt, ranging from semantic enhancements to the imposition of complex textures. MetaDesigner incorporates a comprehensive feedback mechanism that harnesses insights from multimodal models and user evaluations to refine and enhance the design process iteratively. Through this feedback loop, the system adeptly tunes hyperparameters to align with user-defined stylistic and thematic preferences, generating WordArt that not only meets but exceeds user expectations of visual appeal and contextual relevance. Empirical validations highlight MetaDesigner's capability to effectively serve diverse WordArt applications, consistently producing aesthetically appealing and context-sensitive results. △ Less

Submitted 4 July, 2024; v1 submitted 28 June, 2024; originally announced June 2024.

Comments: 18 pages, 16 figures, Project: https://modelscope.cn/studios/WordArt/WordArt

arXiv:2406.15879 [pdf]

Robust Ptychographic Reconstruction with an Out-of-Focus Electron Probe

Authors: Shoucong Ning, Wenhui Xu, Pengju Sheng, Leyi Loh, Stephen Pennycook, Fucai Zhang, Michel Bosman, Qian He

Abstract: As a burgeoning technique, out-of-focus electron ptychography offers the potential for rapidly imaging atomic-scale large fields of view (FoV) using a single diffraction dataset. However, achieving robust out-of-focus ptychographic reconstruction poses a significant challenge due to the inherent scan instabilities of electron microscopes, compounded by the presence of unknown aberrations in the pr… ▽ More As a burgeoning technique, out-of-focus electron ptychography offers the potential for rapidly imaging atomic-scale large fields of view (FoV) using a single diffraction dataset. However, achieving robust out-of-focus ptychographic reconstruction poses a significant challenge due to the inherent scan instabilities of electron microscopes, compounded by the presence of unknown aberrations in the probe-forming lens. In this study, we substantially enhance the robustness of out-of-focus ptychographic reconstruction by extending our previous calibration method (the Fourier method), which was originally developed for the in-focus scenario. This extended Fourier method surpasses existing calibration techniques by providing more reliable and accurate initialization of scan positions and electron probes. Additionally, we comprehensively explore and recommend optimized experimental parameters for robust out-of-focus ptychography, includingaperture size and defocus, through extensive simulations. Lastly, we conduct a comprehensive comparison between ptychographic reconstructions obtained with focused and defocused electron probes, particularly in the context of low-dose and precise phase imaging, utilizing our calibration method as the basis for evaluation. △ Less

Submitted 22 June, 2024; originally announced June 2024.

Comments: 22 pages, 6 figures

arXiv:2406.15005 [pdf, other]

Manipulating Spectral Windings and Skin Modes through Nonconservative Couplings

Authors: Ningxin Kong, Chenghe Yu, Yilun Xu, Matteo Fadel, Xinyao Huang, Qiongyi He

Abstract: The discovery of the non-Hermitian skin effect (NHSE) has revolutionized our understanding of wave propagation in non-Hermitian systems, highlighting unexpected localization effects beyond conventional theories. Here, we discover that NHSE, accompanied by multi-type spectral phases, can be induced by manipulating nonconservative couplings. By characterizing the spectrum through the windings of the… ▽ More The discovery of the non-Hermitian skin effect (NHSE) has revolutionized our understanding of wave propagation in non-Hermitian systems, highlighting unexpected localization effects beyond conventional theories. Here, we discover that NHSE, accompanied by multi-type spectral phases, can be induced by manipulating nonconservative couplings. By characterizing the spectrum through the windings of the energy bands, we demonstrate that band structures with identical, opposite, and even twisted windings can be achieved. These inequivalent types of spectra originate from the multi-channel interference resulting from the interplay between conservative and nonconservative couplings. Associated with the multi-type spectra, unipolar and bipolar NHSE with different eigenmode localizations can be observed. Additionally, our findings link the nonreciprocal transmission properties of the system to multiple spectral phases, indicating a connection with the skin modes. This work paves new pathways for investigating non-Hermitian topological effects and manipulating nonreciprocal energy flow. △ Less

Submitted 21 June, 2024; originally announced June 2024.

Comments: 6 pages, 4 figures

arXiv:2406.13577 [pdf, other]

Genuine Multipartite Entanglement induced by a Thermal Acoustic Reservoir

Authors: Qing-Yang Qiu, Zhi-Guang Lu, Qiongyi He, Ying Wu, Xin-You Lü

Abstract: Genuine multipartite entanglement (GME) is not only fundamental interesting for the study of quantum-to-classical transition, but also is essential for realizing universal quantum computing and quantum networks. Here we investigate the multipartite entanglement (ME) dynamics in a linear chain of N LC resonators interacting optomechanically with a common thermal acoustic reservoir. By presenting th… ▽ More Genuine multipartite entanglement (GME) is not only fundamental interesting for the study of quantum-to-classical transition, but also is essential for realizing universal quantum computing and quantum networks. Here we investigate the multipartite entanglement (ME) dynamics in a linear chain of N LC resonators interacting optomechanically with a common thermal acoustic reservoir. By presenting the exact analytical solutions of system evolution, we predict the periodic generation of non-Gaussian ME, including the discrete and continuous variables entanglement. Interestingly, the GME is obtained even though the system is in a heat bath. The mechanism relies on the special acoustic environment featuring frequency comb structure. More importantly, our proposed model also allows the periodic generation of entangled multipartite cat states (MCSs), i.e., a typical GHZ state, with high fidelity. This work fundamentally broadens the fields of ME, and have wide applications in implementing thermal-noise-resistant quantum information processing and many-body quantum simulation. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: 25 pages, 9 figures

arXiv:2406.13171 [pdf]

Super-resolution 3D tomography of vector near-fields in dielectric resonators

Authors: Bingbing Zhu, Qingnan Cai, Yaxin Liu, Sheng Zhang, Weifeng Liu, Qiong He, Lei Zhou, Zhensheng Tao

Abstract: All-dielectric optical resonators, exhibiting exotic near-field distributions upon excitations, have emerged as low-loss, versatile and highly adaptable components in nanophotonic structures for manipulating electromagnetic waves and enhancing light-matter interactions. However, achieving experimental full three-dimensional characterization of near-fields within dielectric materials poses signific… ▽ More All-dielectric optical resonators, exhibiting exotic near-field distributions upon excitations, have emerged as low-loss, versatile and highly adaptable components in nanophotonic structures for manipulating electromagnetic waves and enhancing light-matter interactions. However, achieving experimental full three-dimensional characterization of near-fields within dielectric materials poses significant challenges. Here, we develop a novel technique using high-order sideband generation to image near-field wave patterns inside dielectric optical resonators. By exploiting the phase-sensitivity of various harmonic orders that enables the detection of near-field distributions at distinct depths, we realize three-dimensional tomographic and super-resolution near-field imaging inside a micrometer-thick silicon anapole resonator. Furthermore, our method offers high-contrast polarization sensitivity and phase-resolving capability, providing comprehensive vectorial near-field information. Our approach can potentially be applied to diverse dielectric metamaterials, and becomes a valuable tool for comprehensive characterization of near-field wave phenomena within dielectric materials. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: 26 pages, 4 figures

arXiv:2406.11789 [pdf, other]

Quantum metrology with a squeezed Kerr oscillator

Authors: Jiajie Guo, Qiongyi He, Matteo Fadel

Abstract: We study the squeezing dynamics in a Kerr-nonlinear oscillator, and quantify the metrological usefulness of the resulting states. Even if the nonlinearity limits the attainable squeezing by making the evolution non-Gaussian, the states obtained still have a high quantum Fisher information for sensing displacements. However, contrary to the Gaussian case, the amplitude of the displacement cannot be… ▽ More We study the squeezing dynamics in a Kerr-nonlinear oscillator, and quantify the metrological usefulness of the resulting states. Even if the nonlinearity limits the attainable squeezing by making the evolution non-Gaussian, the states obtained still have a high quantum Fisher information for sensing displacements. However, contrary to the Gaussian case, the amplitude of the displacement cannot be estimated by simple quadrature measurements. Therefore, we propose the use of a measurement-after-interaction protocol where a linear quadrature measurement is preceded by an additional nonlinear evolution, and show the significant sensitivity enhancement that can be obtained. Our results are robust when considering realistic imperfections such as energy relaxation, and can be implemented in state-of-the-art experimental setups. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.10902 [pdf, other]

Light Up the Shadows: Enhance Long-Tailed Entity Grounding with Concept-Guided Vision-Language Models

Authors: Yikai Zhang, Qianyu He, Xintao Wang, Siyu Yuan, Jiaqing Liang, Yanghua Xiao

Abstract: Multi-Modal Knowledge Graphs (MMKGs) have proven valuable for various downstream tasks. However, scaling them up is challenging because building large-scale MMKGs often introduces mismatched images (i.e., noise). Most entities in KGs belong to the long tail, meaning there are few images of them available online. This scarcity makes it difficult to determine whether a found image matches the entity… ▽ More Multi-Modal Knowledge Graphs (MMKGs) have proven valuable for various downstream tasks. However, scaling them up is challenging because building large-scale MMKGs often introduces mismatched images (i.e., noise). Most entities in KGs belong to the long tail, meaning there are few images of them available online. This scarcity makes it difficult to determine whether a found image matches the entity. To address this, we draw on the Triangle of Reference Theory and suggest enhancing vision-language models with concept guidance. Specifically, we introduce COG, a two-stage framework with COncept-Guided vision-language models. The framework comprises a Concept Integration module, which effectively identifies image-text pairs of long-tailed entities, and an Evidence Fusion module, which offers explainability and enables human verification. To demonstrate the effectiveness of COG, we create a dataset of 25k image-text pairs of long-tailed entities. Our comprehensive experiments show that COG not only improves the accuracy of recognizing long-tailed image-text pairs compared to baselines but also offers flexibility and explainability. △ Less

Submitted 16 June, 2024; originally announced June 2024.

arXiv:2406.10715 [pdf, other]

Chip-scale generation of 60-mode continuous-variable cluster states

Authors: Ze Wang, Kangkang Li, Yue Wang, Xin Zhou, Yinke Cheng, Boxuan Jing, Fengxiao Sun, Jincheng Li, Zhilin Li, Qihuang Gong, Qiongyi He, Bei-Bei Li, Qi-Fan Yang

Abstract: Increasing the number of entangled entities is crucial for achieving exponential computational speedups and secure quantum networks. Despite recent progress in generating large-scale entanglement through continuous-variable (CV) cluster states, translating these technologies to photonic chips has been hindered by decoherence, limiting the number of entangled entities to 8. Here, we demonstrate 60-… ▽ More Increasing the number of entangled entities is crucial for achieving exponential computational speedups and secure quantum networks. Despite recent progress in generating large-scale entanglement through continuous-variable (CV) cluster states, translating these technologies to photonic chips has been hindered by decoherence, limiting the number of entangled entities to 8. Here, we demonstrate 60-mode CVcluster states in a chip-based optical microresonator pumped by chromatic lasers. Resonantly-enhanced four-wave mixing processes establish entanglement between equidistant spectral quantum modes (qumodes), forming a quantum analogue of optical frequency combs. Decoherence is minimized to achieve unprecedented two-mode raw squeezing (>3 dB) from a chip. Using bichromatic and trichromatic pump lasers, we realize one- and two-dimensional cluster states with up to 60 qumodes. Our work provides a compact and scalable platform for constructing large-scale entangled quantum resources, which are appealing for performing computational and communicational tasks with quantum advantages. △ Less

Submitted 15 June, 2024; originally announced June 2024.

arXiv:2406.10517 [pdf, other]

ADSNet: Cross-Domain LTV Prediction with an Adaptive Siamese Network in Advertising

Authors: Ruize Wang, Hui Xu, Ying Cheng, Qi He, Xing Zhou, Rui Feng, Wei Xu, Lei Huang, Jie Jiang

Abstract: Advertising platforms have evolved in estimating Lifetime Value (LTV) to better align with advertisers' true performance metric. However, the sparsity of real-world LTV data presents a significant challenge to LTV predictive model(i.e., pLTV), severely limiting the their capabilities. Therefore, we propose to utilize external data, in addition to the internal data of advertising platform, to expan… ▽ More Advertising platforms have evolved in estimating Lifetime Value (LTV) to better align with advertisers' true performance metric. However, the sparsity of real-world LTV data presents a significant challenge to LTV predictive model(i.e., pLTV), severely limiting the their capabilities. Therefore, we propose to utilize external data, in addition to the internal data of advertising platform, to expand the size of purchase samples and enhance the LTV prediction model of the advertising platform. To tackle the issue of data distribution shift between internal and external platforms, we introduce an Adaptive Difference Siamese Network (ADSNet), which employs cross-domain transfer learning to prevent negative transfer. Specifically, ADSNet is designed to learn information that is beneficial to the target domain. We introduce a gain evaluation strategy to calculate information gain, aiding the model in learning helpful information for the target domain and providing the ability to reject noisy samples, thus avoiding negative transfer. Additionally, we also design a Domain Adaptation Module as a bridge to connect different domains, reduce the distribution distance between them, and enhance the consistency of representation space distribution. We conduct extensive offline experiments and online A/B tests on a real advertising platform. Our proposed ADSNet method outperforms other methods, improving GINI by 2$\%$. The ablation study highlights the importance of the gain evaluation strategy in negative gain sample rejection and improving model performance. Additionally, ADSNet significantly improves long-tail prediction. The online A/B tests confirm ADSNet's efficacy, increasing online LTV by 3.47$\%$ and GMV by 3.89$\%$. △ Less

Submitted 15 June, 2024; originally announced June 2024.

Comments: Accepted to KDD 2024

arXiv:2406.09422 [pdf, other]

LooPIN: A PinFi protocol for decentralized computing

Authors: Yunwei Mao, Qi He, Ju Li

Abstract: Networked computing power is a critical utility in the era of artificial intelligence. This paper presents a novel Physical Infrastructure Finance (PinFi) protocol designed to facilitate the distribution of computing power within networks in a decentralized manner. Addressing the core challenges of coordination, pricing, and liquidity in decentralized physical infrastructure networks (DePIN), the… ▽ More Networked computing power is a critical utility in the era of artificial intelligence. This paper presents a novel Physical Infrastructure Finance (PinFi) protocol designed to facilitate the distribution of computing power within networks in a decentralized manner. Addressing the core challenges of coordination, pricing, and liquidity in decentralized physical infrastructure networks (DePIN), the PinFi protocol introduces a distinctive dynamic pricing mechanism. It enables providers to allocate excess computing resources to a "dissipative" PinFi liquidity pool, distinct from traditional DeFi liquidity pools, ensuring seamless access for clients at equitable, market-based prices. This approach significantly reduces the costs of accessing computing power, potentially to as low as 1% compared to existing services, while simultaneously enhancing security and dependability. The PinFi protocol is poised to transform the dynamics of supply and demand in computing power networks, setting a new standard for efficiency and accessibility. △ Less

Submitted 29 March, 2024; originally announced June 2024.

arXiv:2406.08122 [pdf]

Fully Few-shot Class-incremental Audio Classification Using Expandable Dual-embedding Extractor

Authors: Yongjie Si, Yanxiong Li, Jialong Li, Jiaxin Tan, Qianhua He

Abstract: It's assumed that training data is sufficient in base session of few-shot class-incremental audio classification. However, it's difficult to collect abundant samples for model training in base session in some practical scenarios due to the data scarcity of some classes. This paper explores a new problem of fully few-shot class-incremental audio classification with few training samples in all sessi… ▽ More It's assumed that training data is sufficient in base session of few-shot class-incremental audio classification. However, it's difficult to collect abundant samples for model training in base session in some practical scenarios due to the data scarcity of some classes. This paper explores a new problem of fully few-shot class-incremental audio classification with few training samples in all sessions. Moreover, we propose a method using expandable dual-embedding extractor to solve it. The proposed model consists of an embedding extractor and an expandable classifier. The embedding extractor consists of a pretrained Audio Spectrogram Transformer (AST) and a finetuned AST. The expandable classifier consists of prototypes and each prototype represents a class. Experiments are conducted on three datasets (LS-100, NSynth-100 and FSC-89). Results show that our method exceeds seven baseline ones in average accuracy with statistical significance. Code is at: https://github.com/YongjieSi/EDE. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: Accepted for publication on Interspeech 2024. 5 pages, 3 figures, 5 tables

arXiv:2406.08119 [pdf]

Low-Complexity Acoustic Scene Classification Using Parallel Attention-Convolution Network

Authors: Yanxiong Li, Jiaxin Tan, Guoqing Chen, Jialong Li, Yongjie Si, Qianhua He

Abstract: This work is an improved system that we submitted to task 1 of DCASE2023 challenge. We propose a method of low-complexity acoustic scene classification by a parallel attention-convolution network which consists of four modules, including pre-processing, fusion, global and local contextual information extraction. The proposed network is computationally efficient to capture global and local contextu… ▽ More This work is an improved system that we submitted to task 1 of DCASE2023 challenge. We propose a method of low-complexity acoustic scene classification by a parallel attention-convolution network which consists of four modules, including pre-processing, fusion, global and local contextual information extraction. The proposed network is computationally efficient to capture global and local contextual information from each audio clip. In addition, we integrate other techniques into our method, such as knowledge distillation, data augmentation, and adaptive residual normalization. When evaluated on the official dataset of DCASE2023 challenge, our method obtains the highest accuracy of 56.10% with parameter number of 5.21 kilo and multiply-accumulate operations of 1.44 million. It exceeds the top two systems of DCASE2023 challenge in accuracy and complexity, and obtains state-of-the-art result. Code is at: https://github.com/Jessytan/Low-complexity-ASC. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: Accepted for publication on Interspeech 2024. 5 pages, 4 figures, 3 tables

arXiv:2406.06464 [pdf, other]

Transforming Wearable Data into Health Insights using Large Language Model Agents

Authors: Mike A. Merrill, Akshay Paruchuri, Naghmeh Rezaei, Geza Kovacs, Javier Perez, Yun Liu, Erik Schenck, Nova Hammerquist, Jake Sunshine, Shyam Tailor, Kumar Ayush, Hao-Wei Su, Qian He, Cory Y. McLean, Mark Malhotra, Shwetak Patel, Jiening Zhan, Tim Althoff, Daniel McDuff, Xin Liu

Abstract: Despite the proliferation of wearable health trackers and the importance of sleep and exercise to health, deriving actionable personalized insights from wearable data remains a challenge because doing so requires non-trivial open-ended analysis of these data. The recent rise of large language model (LLM) agents, which can use tools to reason about and interact with the world, presents a promising… ▽ More Despite the proliferation of wearable health trackers and the importance of sleep and exercise to health, deriving actionable personalized insights from wearable data remains a challenge because doing so requires non-trivial open-ended analysis of these data. The recent rise of large language model (LLM) agents, which can use tools to reason about and interact with the world, presents a promising opportunity to enable such personalized analysis at scale. Yet, the application of LLM agents in analyzing personal health is still largely untapped. In this paper, we introduce the Personal Health Insights Agent (PHIA), an agent system that leverages state-of-the-art code generation and information retrieval tools to analyze and interpret behavioral health data from wearables. We curate two benchmark question-answering datasets of over 4000 health insights questions. Based on 650 hours of human and expert evaluation we find that PHIA can accurately address over 84% of factual numerical questions and more than 83% of crowd-sourced open-ended questions. This work has implications for advancing behavioral health across the population, potentially enabling individuals to interpret their own wearable data, and paving the way for a new era of accessible, personalized wellness regimens that are informed by data-driven insights. △ Less

Submitted 11 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

Comments: 38 pages

arXiv:2406.03262 [pdf, other]

ADer: A Comprehensive Benchmark for Multi-class Visual Anomaly Detection

Authors: Jiangning Zhang, Haoyang He, Zhenye Gan, Qingdong He, Yuxuan Cai, Zhucun Xue, Yabiao Wang, Chengjie Wang, Lei Xie, Yong Liu

Abstract: Visual anomaly detection aims to identify anomalous regions in images through unsupervised learning paradigms, with increasing application demand and value in fields such as industrial inspection and medical lesion detection. Despite significant progress in recent years, there is a lack of comprehensive benchmarks to adequately evaluate the performance of various mainstream methods across differen… ▽ More Visual anomaly detection aims to identify anomalous regions in images through unsupervised learning paradigms, with increasing application demand and value in fields such as industrial inspection and medical lesion detection. Despite significant progress in recent years, there is a lack of comprehensive benchmarks to adequately evaluate the performance of various mainstream methods across different datasets under the practical multi-class setting. The absence of standardized experimental setups can lead to potential biases in training epochs, resolution, and metric results, resulting in erroneous conclusions. This paper addresses this issue by proposing a comprehensive visual anomaly detection benchmark, \textbf{\textit{ADer}}, which is a modular framework that is highly extensible for new methods. The benchmark includes multiple datasets from industrial and medical domains, implementing fifteen state-of-the-art methods and nine comprehensive metrics. Additionally, we have open-sourced the GPU-assisted \href{https://pypi.org/project/ADEval}{ADEval} package to address the slow evaluation problem of metrics like time-consuming mAU-PRO on large-scale data, significantly reducing evaluation time by more than \textit{1000-fold}. Through extensive experimental results, we objectively reveal the strengths and weaknesses of different methods and provide insights into the challenges and future directions of multi-class visual anomaly detection. We hope that \textbf{\textit{ADer}} will become a valuable resource for researchers and practitioners in the field, promoting the development of more robust and generalizable anomaly detection systems. Full codes have been attached in Appendix and open-sourced at \url{https://github.com/zhangzjn/ader}. △ Less

Submitted 6 June, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

arXiv:2406.01902 [pdf, other]

Large Time Behavior and Sharp Interface Limit of Compressible Navier-Stokes/Allen-Cahn System for Interacting Shock Waves

Authors: Yazhou Chen, Qiaolin He, Xiaoding Shi, Xiaoping Wang

Abstract: In this paper, we study the large time behavior and sharp interface limit of the Cauchy problem for compressible Navier-Stokes/Allen-Cahn system with interaction shock waves in the same family. This system is an important mathematical model for describing the motion of immiscible two-phase flow. The results show that, if the initial density and velocity are near the superposition of two shock wave… ▽ More In this paper, we study the large time behavior and sharp interface limit of the Cauchy problem for compressible Navier-Stokes/Allen-Cahn system with interaction shock waves in the same family. This system is an important mathematical model for describing the motion of immiscible two-phase flow. The results show that, if the initial density and velocity are near the superposition of two shock waves in the same family, then there exists a unique global solution to the compressible Navier-Stokes/Allen-Cahn system, and this solution asymptotically converges to the superposition of the viscous shock wave and rarefaction wave which moving in opposite directions. Moreover, this global-in-time solution converges to the entropy solution of $p$-system in $L^\infty$-norm as the thickness of the diffusion interface tends to zero. △ Less

Submitted 3 June, 2024; originally announced June 2024.

Comments: 41pages, 2 figures

MSC Class: 35Q35; 35B65; 76N10; 35M10; 35B40; 35C20; 76T30

arXiv:2406.01103 [pdf, other]

Advancing DRL Agents in Commercial Fighting Games: Training, Integration, and Agent-Human Alignment

Authors: Chen Zhang, Qiang He, Zhou Yuan, Elvis S. Liu, Hong Wang, Jian Zhao, Yang Wang

Abstract: Deep Reinforcement Learning (DRL) agents have demonstrated impressive success in a wide range of game genres. However, existing research primarily focuses on optimizing DRL competence rather than addressing the challenge of prolonged player interaction. In this paper, we propose a practical DRL agent system for fighting games named Shūkai, which has been successfully deployed to Naruto Mobile, a p… ▽ More Deep Reinforcement Learning (DRL) agents have demonstrated impressive success in a wide range of game genres. However, existing research primarily focuses on optimizing DRL competence rather than addressing the challenge of prolonged player interaction. In this paper, we propose a practical DRL agent system for fighting games named Shūkai, which has been successfully deployed to Naruto Mobile, a popular fighting game with over 100 million registered users. Shūkai quantifies the state to enhance generalizability, introducing Heterogeneous League Training (HELT) to achieve balanced competence, generalizability, and training efficiency. Furthermore, Shūkai implements specific rewards to align the agent's behavior with human expectations. Shūkai's ability to generalize is demonstrated by its consistent competence across all characters, even though it was trained on only 13% of them. Additionally, HELT exhibits a remarkable 22% improvement in sample efficiency. Shūkai serves as a valuable training partner for players in Naruto Mobile, enabling them to enhance their abilities and skills. △ Less

Submitted 3 June, 2024; originally announced June 2024.

Comments: Accept at ICML 2024

arXiv:2405.20081 [pdf, other]

NoiseBoost: Alleviating Hallucination with Noise Perturbation for Multimodal Large Language Models

Authors: Kai Wu, Boyuan Jiang, Zhengkai Jiang, Qingdong He, Donghao Luo, Shengzhi Wang, Qingwen Liu, Chengjie Wang

Abstract: Multimodal large language models (MLLMs) contribute a powerful mechanism to understanding visual information building on large language models. However, MLLMs are notorious for suffering from hallucinations, especially when generating lengthy, detailed descriptions for images. Our analysis reveals that hallucinations stem from the inherent summarization mechanism of large language models, leading… ▽ More Multimodal large language models (MLLMs) contribute a powerful mechanism to understanding visual information building on large language models. However, MLLMs are notorious for suffering from hallucinations, especially when generating lengthy, detailed descriptions for images. Our analysis reveals that hallucinations stem from the inherent summarization mechanism of large language models, leading to excessive dependence on linguistic tokens while neglecting vision information. In this paper, we propose NoiseBoost, a broadly applicable and simple method for alleviating hallucinations for MLLMs through the integration of noise feature perturbations. Noise perturbation acts as a regularizer, facilitating a balanced distribution of attention weights among visual and linguistic tokens. Despite its simplicity, NoiseBoost consistently enhances the performance of MLLMs across common training strategies, including supervised fine-tuning and reinforcement learning. Further, NoiseBoost pioneerly enables semi-supervised learning for MLLMs, unleashing the power of unlabeled data. Comprehensive experiments demonstrate that NoiseBoost improves dense caption accuracy by 8.1% with human evaluation and achieves comparable results with 50% of the data by mining unlabeled data. Code and models are available at https://kaiwu5.github.io/noiseboost. △ Less

Submitted 31 May, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

Comments: 14 pages, 5 figures with supplementary material

arXiv:2405.19664 [pdf, other]

Quantum Zeno Effect on Genuine Tripartite Nonlocality and Entanglement in Quantum Dissipative System

Authors: Zi-Yu Xiong, Yong-Jun Xiao, Ye-Qi Zhang, Qi-Liang He

Abstract: As a precious global resource in quantum information, genuine tripartite nonlocality(GTN) can be quantified by violating Svetlichny inequality. However, there is still no analytical expression for the general three-qubit states due to the difficulty of theoretical calculations. In this paper, we achieve highly accurate quantization of GTN for arbitrary three-qubit quantum states numerically. As an… ▽ More As a precious global resource in quantum information, genuine tripartite nonlocality(GTN) can be quantified by violating Svetlichny inequality. However, there is still no analytical expression for the general three-qubit states due to the difficulty of theoretical calculations. In this paper, we achieve highly accurate quantization of GTN for arbitrary three-qubit quantum states numerically. As an example, we study the dynamics of GTN and genuine tripartite entanglement(GTE) for the W state. Moreover, the complementarity of GTN is verified by examining the nonlocality between the tripartite and the bipartite. Finally, we also find a useful strategy to protect the correlation of GTN and GTE under decoherence by utilizing the Zeno effect. △ Less

Submitted 29 May, 2024; originally announced May 2024.

Comments: 7 pages, 7 figures

arXiv:2405.19633 [pdf, other]

Phase transition and multistability in Dicke dimer

Authors: Yilun Xu, Feng-Xiao Sun, Wei Zhang, Qiongyi He, Han Pu

Abstract: The exotic phase transitions and multistabilities in atom-cavity coupled systems have attracted tremendous interests recently. In this work, we investigate the effect of photon hopping between two Dicke cavities, which induces rich quantum phases for steady states and dynamic process. Starting from a generic dimer system where the two cavities are not necessarily identical, we analytically prove a… ▽ More The exotic phase transitions and multistabilities in atom-cavity coupled systems have attracted tremendous interests recently. In this work, we investigate the effect of photon hopping between two Dicke cavities, which induces rich quantum phases for steady states and dynamic process. Starting from a generic dimer system where the two cavities are not necessarily identical, we analytically prove all possible steady-state phases, which are confirmed by numerical calculations. We then focus on the special case with two identical cavities, where all the steady states are confirmed by exact solutions. We show that photon hopping is a convenient and powerful tool to manipulate the quantum phases and induce multistable behavior in this system. △ Less

Submitted 29 May, 2024; originally announced May 2024.

arXiv:2405.17741 [pdf, other]

LoRA-Switch: Boosting the Efficiency of Dynamic LLM Adapters via System-Algorithm Co-design

Authors: Rui Kong, Qiyang Li, Xinyu Fang, Qingtian Feng, Qingfeng He, Yazhu Dong, Weijun Wang, Yuanchun Li, Linghe Kong, Yunxin Liu

Abstract: Recent literature has found that an effective method to customize or further improve large language models (LLMs) is to add dynamic adapters, such as low-rank adapters (LoRA) with Mixture-of-Experts (MoE) structures. Though such dynamic adapters incur modest computational complexity, they surprisingly lead to huge inference latency overhead, slowing down the decoding speed by 2.5+ times. In this p… ▽ More Recent literature has found that an effective method to customize or further improve large language models (LLMs) is to add dynamic adapters, such as low-rank adapters (LoRA) with Mixture-of-Experts (MoE) structures. Though such dynamic adapters incur modest computational complexity, they surprisingly lead to huge inference latency overhead, slowing down the decoding speed by 2.5+ times. In this paper, we analyze the fine-grained costs of the dynamic adapters and find that the fragmented CUDA kernel calls are the root cause. Therefore, we propose LoRA-Switch, a system-algorithm co-designed architecture for efficient dynamic adapters. Unlike most existing dynamic structures that adopt layer-wise or block-wise dynamic routing, LoRA-Switch introduces a token-wise routing mechanism. It switches the LoRA adapters and weights for each token and merges them into the backbone for inference. For efficiency, this switching is implemented with an optimized CUDA kernel, which fuses the merging operations for all LoRA adapters at once. Based on experiments with popular open-source LLMs on common benchmarks, our approach has demonstrated similar accuracy improvement as existing dynamic adapters, while reducing the decoding latency by more than 2.4 times. △ Less

Submitted 27 May, 2024; originally announced May 2024.

arXiv:2405.17718 [pdf, other]

AdapNet: Adaptive Noise-Based Network for Low-Quality Image Retrieval

Authors: Sihe Zhang, Qingdong He, Jinlong Peng, Yuxi Li, Zhengkai Jiang, Jiafu Wu, Mingmin Chi, Yabiao Wang, Chengjie Wang

Abstract: Image retrieval aims to identify visually similar images within a database using a given query image. Traditional methods typically employ both global and local features extracted from images for matching, and may also apply re-ranking techniques to enhance accuracy. However, these methods often fail to account for the noise present in query images, which can stem from natural or human-induced fac… ▽ More Image retrieval aims to identify visually similar images within a database using a given query image. Traditional methods typically employ both global and local features extracted from images for matching, and may also apply re-ranking techniques to enhance accuracy. However, these methods often fail to account for the noise present in query images, which can stem from natural or human-induced factors, thereby negatively impacting retrieval performance. To mitigate this issue, we introduce a novel setting for low-quality image retrieval, and propose an Adaptive Noise-Based Network (AdapNet) to learn robust abstract representations. Specifically, we devise a quality compensation block trained to compensate for various low-quality factors in input images. Besides, we introduce an innovative adaptive noise-based loss function, which dynamically adjusts its focus on the gradient in accordance with image quality, thereby augmenting the learning of unknown noisy samples during training and enhancing intra-class compactness. To assess the performance, we construct two datasets with low-quality queries, which is built by applying various types of noise on clean query images on the standard Revisited Oxford and Revisited Paris datasets. Comprehensive experimental results illustrate that AdapNet surpasses state-of-the-art methods on the Noise Revisited Oxford and Noise Revisited Paris benchmarks, while maintaining competitive performance on high-quality datasets. The code and constructed datasets will be made available. △ Less

Submitted 27 May, 2024; originally announced May 2024.

arXiv:2405.16265 [pdf, other]

MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time

Authors: Jikun Kang, Xin Zhe Li, Xi Chen, Amirreza Kazemi, Qianyi Sun, Boxing Chen, Dong Li, Xu He, Quan He, Feng Wen, Jianye Hao, Jun Yao

Abstract: Although Large Language Models (LLMs) achieve remarkable performance across various tasks, they often struggle with complex reasoning tasks, such as answering mathematical questions. Recent efforts to address this issue have primarily focused on leveraging mathematical datasets through supervised fine-tuning or self-improvement techniques. However, these methods often depend on high-quality datase… ▽ More Although Large Language Models (LLMs) achieve remarkable performance across various tasks, they often struggle with complex reasoning tasks, such as answering mathematical questions. Recent efforts to address this issue have primarily focused on leveraging mathematical datasets through supervised fine-tuning or self-improvement techniques. However, these methods often depend on high-quality datasets that are difficult to prepare, or they require substantial computational resources for fine-tuning. Inspired by findings that LLMs know how to produce the right answer but struggle to select the correct reasoning path, we propose a purely inference-based searching method -- MindStar (M*). This method formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths. We evaluate the M* framework on both the GSM8K and MATH datasets, comparing its performance with existing open and closed-source LLMs. Our results demonstrate that M* significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1, but with substantially reduced model size and computational costs. △ Less

Submitted 26 June, 2024; v1 submitted 25 May, 2024; originally announced May 2024.

arXiv:2405.15580 [pdf, other]

Open-Vocabulary SAM3D: Understand Any 3D Scene

Authors: Hanchen Tai, Qingdong He, Jiangning Zhang, Yijie Qian, Zhenyu Zhang, Xiaobin Hu, Yabiao Wang, Yong Liu

Abstract: Open-vocabulary 3D scene understanding presents a significant challenge in the field. Recent advancements have sought to transfer knowledge embedded in vision language models from the 2D domain to 3D domain. However, these approaches often require learning prior knowledge from specific 3D scene datasets, which limits their applicability in open-world scenarios. The Segment Anything Model (SAM) has… ▽ More Open-vocabulary 3D scene understanding presents a significant challenge in the field. Recent advancements have sought to transfer knowledge embedded in vision language models from the 2D domain to 3D domain. However, these approaches often require learning prior knowledge from specific 3D scene datasets, which limits their applicability in open-world scenarios. The Segment Anything Model (SAM) has demonstrated remarkable zero-shot segmentation capabilities, prompting us to investigate its potential for comprehending 3D scenes without the need for training. In this paper, we introduce OV-SAM3D, a universal framework for open-vocabulary 3D scene understanding. This framework is designed to perform understanding tasks for any 3D scene without requiring prior knowledge of the scene. Specifically, our method is composed of two key sub-modules: First, we initiate the process by generating superpoints as the initial 3D prompts and refine these prompts using segment masks derived from SAM. Moreover, we then integrate a specially designed overlapping score table with open tags from the Recognize Anything Model (RAM) to produce final 3D instances with open-world label. Empirical evaluations conducted on the ScanNet200 and nuScenes datasets demonstrate that our approach surpasses existing open-vocabulary methods in unknown open-world environments. △ Less

Submitted 21 June, 2024; v1 submitted 24 May, 2024; originally announced May 2024.

Comments: Project page: https://hithqd.github.io/projects/OV-SAM3D

arXiv:2405.15214 [pdf, other]

PointRWKV: Efficient RWKV-Like Model for Hierarchical Point Cloud Learning

Authors: Qingdong He, Jiangning Zhang, Jinlong Peng, Haoyang He, Yabiao Wang, Chengjie Wang

Abstract: Transformers have revolutionized the point cloud learning task, but the quadratic complexity hinders its extension to long sequence and makes a burden on limited computational resources. The recent advent of RWKV, a fresh breed of deep sequence models, has shown immense potential for sequence modeling in NLP tasks. In this paper, we present PointRWKV, a model of linear complexity derived from the… ▽ More Transformers have revolutionized the point cloud learning task, but the quadratic complexity hinders its extension to long sequence and makes a burden on limited computational resources. The recent advent of RWKV, a fresh breed of deep sequence models, has shown immense potential for sequence modeling in NLP tasks. In this paper, we present PointRWKV, a model of linear complexity derived from the RWKV model in the NLP field with necessary modifications for point cloud learning tasks. Specifically, taking the embedded point patches as input, we first propose to explore the global processing capabilities within PointRWKV blocks using modified multi-headed matrix-valued states and a dynamic attention recurrence mechanism. To extract local geometric features simultaneously, we design a parallel branch to encode the point cloud efficiently in a fixed radius near-neighbors graph with a graph stabilizer. Furthermore, we design PointRWKV as a multi-scale framework for hierarchical feature learning of 3D point clouds, facilitating various downstream tasks. Extensive experiments on different point cloud learning tasks show our proposed PointRWKV outperforms the transformer- and mamba-based counterparts, while significantly saving about 46\% FLOPs, demonstrating the potential option for constructing foundational 3D models. △ Less

Submitted 24 May, 2024; originally announced May 2024.

arXiv:2405.14210 [pdf, other]

Eidos: Efficient, Imperceptible Adversarial 3D Point Clouds

Authors: Hanwei Zhang, Luo Cheng, Qisong He, Wei Huang, Renjue Li, Ronan Sicre, Xiaowei Huang, Holger Hermanns, Lijun Zhang

Abstract: Classification of 3D point clouds is a challenging machine learning (ML) task with important real-world applications in a spectrum from autonomous driving and robot-assisted surgery to earth observation from low orbit. As with other ML tasks, classification models are notoriously brittle in the presence of adversarial attacks. These are rooted in imperceptible changes to inputs with the effect tha… ▽ More Classification of 3D point clouds is a challenging machine learning (ML) task with important real-world applications in a spectrum from autonomous driving and robot-assisted surgery to earth observation from low orbit. As with other ML tasks, classification models are notoriously brittle in the presence of adversarial attacks. These are rooted in imperceptible changes to inputs with the effect that a seemingly well-trained model ends up misclassifying the input. This paper adds to the understanding of adversarial attacks by presenting Eidos, a framework providing Efficient Imperceptible aDversarial attacks on 3D pOint cloudS. Eidos supports a diverse set of imperceptibility metrics. It employs an iterative, two-step procedure to identify optimal adversarial examples, thereby enabling a runtime-imperceptibility trade-off. We provide empirical evidence relative to several popular 3D point cloud classification models and several established 3D attack methods, showing Eidos' superiority with respect to efficiency as well as imperceptibility. △ Less

Submitted 23 May, 2024; originally announced May 2024.

Comments: Preprint

arXiv:2405.14165 [pdf, other]

Spatial topological insulator

Authors: Qinghua He, Wenlong Gao, Feng Liu

Abstract: Traditional topological insulators often rely on band inversions driven by nonuniform hopping textures and spin-orbit coupling, as exemplified in the Su-Schrieffer-Heeger and Kane-Mele models. We present a novel approach utilizing the spatial nature of sublattice symmetry to induce nontrivial topological insulating properties characterized by second-order corner states without band inversion. To s… ▽ More Traditional topological insulators often rely on band inversions driven by nonuniform hopping textures and spin-orbit coupling, as exemplified in the Su-Schrieffer-Heeger and Kane-Mele models. We present a novel approach utilizing the spatial nature of sublattice symmetry to induce nontrivial topological insulating properties characterized by second-order corner states without band inversion. To substantiate our proposal, we design a photonic crystal with non primitive translational symmetry, demonstrating unique directional waveguide edge modes and localized corner modes. △ Less

Submitted 23 May, 2024; originally announced May 2024.

Comments: 3 pages, 3 figures

arXiv:2405.13902 [pdf, other]

LOGIN: A Large Language Model Consulted Graph Neural Network Training Framework

Authors: Yiran Qiao, Xiang Ao, Yang Liu, Jiarong Xu, Xiaoqian Sun, Qing He

Abstract: Recent prevailing works on graph machine learning typically follow a similar methodology that involves designing advanced variants of graph neural networks (GNNs) to maintain the superior performance of GNNs on different graphs. In this paper, we aim to streamline the GNN design process and leverage the advantages of Large Language Models (LLMs) to improve the performance of GNNs on downstream tas… ▽ More Recent prevailing works on graph machine learning typically follow a similar methodology that involves designing advanced variants of graph neural networks (GNNs) to maintain the superior performance of GNNs on different graphs. In this paper, we aim to streamline the GNN design process and leverage the advantages of Large Language Models (LLMs) to improve the performance of GNNs on downstream tasks. We formulate a new paradigm, coined "LLMs-as-Consultants," which integrates LLMs with GNNs in an interactive manner. A framework named LOGIN (LLM Consulted GNN training) is instantiated, empowering the interactive utilization of LLMs within the GNN training process. First, we attentively craft concise prompts for spotted nodes, carrying comprehensive semantic and topological information, and serving as input to LLMs. Second, we refine GNNs by devising a complementary coping mechanism that utilizes the responses from LLMs, depending on their correctness. We empirically evaluate the effectiveness of LOGIN on node classification tasks across both homophilic and heterophilic graphs. The results illustrate that even basic GNN architectures, when employed within the proposed LLMs-as-Consultants paradigm, can achieve comparable performance to advanced GNNs with intricate designs. Our codes are available at https://github.com/QiaoYRan/LOGIN. △ Less

Submitted 6 June, 2024; v1 submitted 22 May, 2024; originally announced May 2024.

arXiv:2405.12490 [pdf, other]

Customize Your Own Paired Data via Few-shot Way

Authors: Jinshu Chen, Bingchuan Li, Miao Hua, Panpan Xu, Qian He

Abstract: Existing solutions to image editing tasks suffer from several issues. Though achieving remarkably satisfying generated results, some supervised methods require huge amounts of paired training data, which greatly limits their usages. The other unsupervised methods take full advantage of large-scale pre-trained priors, thus being strictly restricted to the domains where the priors are trained on and… ▽ More Existing solutions to image editing tasks suffer from several issues. Though achieving remarkably satisfying generated results, some supervised methods require huge amounts of paired training data, which greatly limits their usages. The other unsupervised methods take full advantage of large-scale pre-trained priors, thus being strictly restricted to the domains where the priors are trained on and behaving badly in out-of-distribution cases. The task we focus on is how to enable the users to customize their desired effects through only few image pairs. In our proposed framework, a novel few-shot learning mechanism based on the directional transformations among samples is introduced and expands the learnable space exponentially. Adopting a diffusion model pipeline, we redesign the condition calculating modules in our model and apply several technical improvements. Experimental results demonstrate the capabilities of our method in various cases. △ Less

Submitted 21 May, 2024; originally announced May 2024.

Comments: Accepted by AI4CC CVPR2024 WorkShop

arXiv:2405.04828 [pdf, other]

ChuXin: 1.6B Technical Report

Authors: Xiaomin Zhuang, Yufan Jiang, Qiaozhi He, Zhihua Wu

Abstract: In this report, we present ChuXin, an entirely open-source language model with a size of 1.6 billion parameters. Unlike the majority of works that only open-sourced the model weights and architecture, we have made everything needed to train a model available, including the training data, the training process, and the evaluation code. Our goal is to empower and strengthen the open research communit… ▽ More In this report, we present ChuXin, an entirely open-source language model with a size of 1.6 billion parameters. Unlike the majority of works that only open-sourced the model weights and architecture, we have made everything needed to train a model available, including the training data, the training process, and the evaluation code. Our goal is to empower and strengthen the open research community, fostering transparency and enabling a new wave of innovation in the field of language modeling. Furthermore, we extend the context length to 1M tokens through lightweight continual pretraining and demonstrate strong needle-in-a-haystack retrieval performance. The weights for both models are available at Hugging Face to download and use. △ Less

Submitted 8 May, 2024; originally announced May 2024.

Comments: Technical Report

arXiv:2405.04451 [pdf, ps, other]

Analyticity for classical hard-core gases via recursion

Authors: Qidong He

Abstract: In the recent work of [Michelen, Perkins, Comm. Math. Phys. 399:1 (2023)], a new lower bound of $eC_φ(β)^{-1}$ is obtained for the positive activity up to which the pressure of a classical system of particles with repulsive pair interactions is analytic. In this paper, we extend their method to the class of radially symmetric, locally stable, and tempered pair potentials. Our main result is that t… ▽ More In the recent work of [Michelen, Perkins, Comm. Math. Phys. 399:1 (2023)], a new lower bound of $eC_φ(β)^{-1}$ is obtained for the positive activity up to which the pressure of a classical system of particles with repulsive pair interactions is analytic. In this paper, we extend their method to the class of radially symmetric, locally stable, and tempered pair potentials. Our main result is that the pressure of such systems is analytic for positive activities up to $e^{2(1-W(eA_φ(β)/C_φ(β)))}C_φ(β)^{-1}e^{-(βC+1)}$, where $C>0$ is the local stability constant, $W(\cdot)$ the Lambert $W$-function, and $A_φ(β)$ the contribution from the attractive part of the pair potential to the temperedness constant. In the high-temperature limit, our result improves the classical Penrose-Ruelle bound of $C_φ(β)^{-1}e^{-(βC+1)}$ by a factor of $e^{2}$. This proves the absence of phase transitions in these systems in the Lee-Yang sense for activities up to the above threshold. △ Less

Submitted 7 May, 2024; originally announced May 2024.

Comments: 19 pages

arXiv:2405.03349 [pdf, other]

Retinexmamba: Retinex-based Mamba for Low-light Image Enhancement

Authors: Jiesong Bai, Yuhao Yin, Qiyuan He, Yuanxian Li, Xiaofeng Zhang

Abstract: In the field of low-light image enhancement, both traditional Retinex methods and advanced deep learning techniques such as Retinexformer have shown distinct advantages and limitations. Traditional Retinex methods, designed to mimic the human eye's perception of brightness and color, decompose images into illumination and reflection components but struggle with noise management and detail preserva… ▽ More In the field of low-light image enhancement, both traditional Retinex methods and advanced deep learning techniques such as Retinexformer have shown distinct advantages and limitations. Traditional Retinex methods, designed to mimic the human eye's perception of brightness and color, decompose images into illumination and reflection components but struggle with noise management and detail preservation under low light conditions. Retinexformer enhances illumination estimation through traditional self-attention mechanisms, but faces challenges with insufficient interpretability and suboptimal enhancement effects. To overcome these limitations, this paper introduces the RetinexMamba architecture. RetinexMamba not only captures the physical intuitiveness of traditional Retinex methods but also integrates the deep learning framework of Retinexformer, leveraging the computational efficiency of State Space Models (SSMs) to enhance processing speed. This architecture features innovative illumination estimators and damage restorer mechanisms that maintain image quality during enhancement. Moreover, RetinexMamba replaces the IG-MSA (Illumination-Guided Multi-Head Attention) in Retinexformer with a Fused-Attention mechanism, improving the model's interpretability. Experimental evaluations on the LOL dataset show that RetinexMamba outperforms existing deep learning approaches based on Retinex theory in both quantitative and qualitative metrics, confirming its effectiveness and superiority in enhancing low-light images. △ Less

Submitted 19 May, 2024; v1 submitted 6 May, 2024; originally announced May 2024.

arXiv:2405.03261 [pdf, other]

A nonlinear criterion for characterizing high-dimensional multipartite entanglement

Authors: Shuheng Liu, Qiongyi He, Marcus Huber, Giuseppe Vitagliano

Abstract: Understanding entanglement of potentially high-dimensional multipartite quantum systems is crucial across different disciplines in quantum sciences. We take inspiration from covariance matrix based techniques to derive a nonlinear criterion that can be used to lower bound the dimensionality vector of mixed quantum states, revealing both the level of multipartiteness and the dimensionality of the e… ▽ More Understanding entanglement of potentially high-dimensional multipartite quantum systems is crucial across different disciplines in quantum sciences. We take inspiration from covariance matrix based techniques to derive a nonlinear criterion that can be used to lower bound the dimensionality vector of mixed quantum states, revealing both the level of multipartiteness and the dimensionality of the entanglement in the quantum states. The technique is based on a system of inequalities that has to be satisfied by all quantum states with a given entanglement dimensionality vector, which can be checked via linear programming. We test our condition on paradigmatic classes of high-dimensional multipartite entangled states like imperfect Greenberger-Horne-Zeilinger (GHZ) states and find that, in comparison with other available criteria our method provides a significant advantage, which is enhanced especially in the case that the dimensions of the individual particles are different from each other. △ Less

Submitted 6 May, 2024; originally announced May 2024.

arXiv:2405.02593 [pdf]

An Interdisciplinary Perspective of the Built-Environment Microbiome

Authors: John S. McAlister, Michael J. Blum, Yana Bromberg, Nina H. Fefferman, Qiang He, Eric Lofgren, Debra L. Miller, Courtney Schreiner, K. Selcuk Candan, Heather Szabo-Rogers, J. Michael Reed

Abstract: The built environment provides an excellent setting for interdisciplinary research on the dynamics of microbial communities. The system is simplified compared to many natural settings, and to some extent the entire environment can be manipulated, from architectural design, to materials use, air flow, human traffic, and capacity to disrupt microbial communities through cleaning. Here we provide an… ▽ More The built environment provides an excellent setting for interdisciplinary research on the dynamics of microbial communities. The system is simplified compared to many natural settings, and to some extent the entire environment can be manipulated, from architectural design, to materials use, air flow, human traffic, and capacity to disrupt microbial communities through cleaning. Here we provide an overview of the ecology of the microbiome in the built environment. We address niche space and refugia, population and community (metagenomic) dynamics, spatial ecology within a building, including the major microbial transmission mechanisms, as well as evolution. We also address the landscape ecology connecting microbiomes between physically separated buildings. At each stage we pay particular attention to the actual and potential interface between disciplines, such as ecology, epidemiology, materials science, and human social behavior. We end by identifying some opportunities for future interdisciplinary research on the microbiome of the built environment. △ Less

Submitted 4 May, 2024; originally announced May 2024.

Comments: 23 pages

arXiv:2405.00236 [pdf, other]

STT: Stateful Tracking with Transformers for Autonomous Driving

Authors: Longlong Jing, Ruichi Yu, Xu Chen, Zhengli Zhao, Shiwei Sheng, Colin Graber, Qi Chen, Qinru Li, Shangxuan Wu, Han Deng, Sangjin Lee, Chris Sweeney, Qiurui He, Wei-Chih Hung, Tong He, Xingyi Zhou, Farshid Moussavi, Zijian Guo, Yin Zhou, Mingxing Tan, Weilong Yang, Congcong Li

Abstract: Tracking objects in three-dimensional space is critical for autonomous driving. To ensure safety while driving, the tracker must be able to reliably track objects across frames and accurately estimate their states such as velocity and acceleration in the present. Existing works frequently focus on the association task while either neglecting the model performance on state estimation or deploying c… ▽ More Tracking objects in three-dimensional space is critical for autonomous driving. To ensure safety while driving, the tracker must be able to reliably track objects across frames and accurately estimate their states such as velocity and acceleration in the present. Existing works frequently focus on the association task while either neglecting the model performance on state estimation or deploying complex heuristics to predict the states. In this paper, we propose STT, a Stateful Tracking model built with Transformers, that can consistently track objects in the scenes while also predicting their states accurately. STT consumes rich appearance, geometry, and motion signals through long term history of detections and is jointly optimized for both data association and state estimation tasks. Since the standard tracking metrics like MOTA and MOTP do not capture the combined performance of the two tasks in the wider spectrum of object states, we extend them with new metrics called S-MOTA and MOTPS that address this limitation. STT achieves competitive real-time performance on the Waymo Open Dataset. △ Less

Submitted 30 April, 2024; originally announced May 2024.

Comments: ICRA 2024

arXiv:2404.18057 [pdf, other]

Efficient LLM Inference with Kcache

Authors: Qiaozhi He, Zhihua Wu

Abstract: Large Language Models(LLMs) have had a profound impact on AI applications, particularly in the domains of long-text comprehension and generation. KV Cache technology is one of the most widely used techniques in the industry. It ensures efficient sequence generation by caching previously computed KV states. However, it also introduces significant memory overhead. We discovered that KV Cache is not… ▽ More Large Language Models(LLMs) have had a profound impact on AI applications, particularly in the domains of long-text comprehension and generation. KV Cache technology is one of the most widely used techniques in the industry. It ensures efficient sequence generation by caching previously computed KV states. However, it also introduces significant memory overhead. We discovered that KV Cache is not necessary and proposed a novel KCache technique to alleviate the memory bottleneck issue during the LLMs inference process. KCache can be used directly for inference without any training process, Our evaluations show that KCache improves the throughput of popular LLMs by 40% with the baseline, while keeping accuracy. △ Less

Submitted 27 April, 2024; originally announced April 2024.

Comments: Technical Report, 8 pages

arXiv:2404.16353 [pdf, other]

Rigorous derivation of a Hele-Shaw type model and its non-symmetric traveling wave solution

Authors: Yu Feng, Qingyou He, Jian-Guo Liu, Zhennan Zhou

Abstract: In this paper, we consider a Hele-Shaw model that describes tumor growth subject to nutrient supply. This model was recently studied in \cite{feng2022tumor} via asymptotic analysis. Our contributions are twofold: Firstly, we provide a rigorous derivation of this Hele-Shaw model by taking the incompressible limit of the porous medium reaction-diffusion equation, which solidifies the mathematical fo… ▽ More In this paper, we consider a Hele-Shaw model that describes tumor growth subject to nutrient supply. This model was recently studied in \cite{feng2022tumor} via asymptotic analysis. Our contributions are twofold: Firstly, we provide a rigorous derivation of this Hele-Shaw model by taking the incompressible limit of the porous medium reaction-diffusion equation, which solidifies the mathematical foundations of the model. Secondly, from a bifurcation theory perspective, we prove the existence of non-symmetric traveling wave solutions to the model, which reflect the intrinsic boundary instability in tumor growth dynamics. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: 23 pages, 2 figures

MSC Class: 35R35; 76D27; 92C10; 70K50

arXiv:2404.16022 [pdf, other]

PuLID: Pure and Lightning ID Customization via Contrastive Alignment

Authors: Zinan Guo, Yanze Wu, Zhuowei Chen, Lang Chen, Qian He

Abstract: We propose Pure and Lightning ID customization (PuLID), a novel tuning-free ID customization method for text-to-image generation. By incorporating a Lightning T2I branch with a standard diffusion one, PuLID introduces both contrastive alignment loss and accurate ID loss, minimizing disruption to the original model and ensuring high ID fidelity. Experiments show that PuLID achieves superior perform… ▽ More We propose Pure and Lightning ID customization (PuLID), a novel tuning-free ID customization method for text-to-image generation. By incorporating a Lightning T2I branch with a standard diffusion one, PuLID introduces both contrastive alignment loss and accurate ID loss, minimizing disruption to the original model and ensuring high ID fidelity. Experiments show that PuLID achieves superior performance in both ID fidelity and editability. Another attractive property of PuLID is that the image elements (e.g., background, lighting, composition, and style) before and after the ID insertion are kept as consistent as possible. Codes and models will be available at https://github.com/ToTheBeginning/PuLID △ Less

Submitted 24 April, 2024; originally announced April 2024.

Comments: Tech Report. Codes and models will be available at https://github.com/ToTheBeginning/PuLID

arXiv:2404.15846 [pdf, other]

From Complex to Simple: Enhancing Multi-Constraint Complex Instruction Following Ability of Large Language Models

Authors: Qianyu He, Jie Zeng, Qianxi He, Jiaqing Liang, Yanghua Xiao

Abstract: It is imperative for Large language models (LLMs) to follow instructions with elaborate requirements (i.e. Complex Instructions Following). Yet, it remains under-explored how to enhance the ability of LLMs to follow complex instructions with multiple constraints. To bridge the gap, we initially study what training data is effective in enhancing complex constraints following abilities. We found tha… ▽ More It is imperative for Large language models (LLMs) to follow instructions with elaborate requirements (i.e. Complex Instructions Following). Yet, it remains under-explored how to enhance the ability of LLMs to follow complex instructions with multiple constraints. To bridge the gap, we initially study what training data is effective in enhancing complex constraints following abilities. We found that training LLMs with instructions containing multiple constraints enhances their understanding of complex instructions, especially those with lower complexity levels. The improvement can even generalize to compositions of out-of-domain constraints. Additionally, we further propose methods addressing how to obtain and utilize the effective training data. Finally, we conduct extensive experiments to prove the effectiveness of our methods in terms of overall performance and training efficiency. We also demonstrate that our methods improve models' ability to follow instructions generally and generalize effectively across out-of-domain, in-domain, and adversarial settings, while maintaining general capabilities. △ Less

Submitted 18 June, 2024; v1 submitted 24 April, 2024; originally announced April 2024.

arXiv:2404.14705 [pdf, other]

Think-Program-reCtify: 3D Situated Reasoning with Large Language Models

Authors: Qingrong He, Kejun Lin, Shizhe Chen, Anwen Hu, Qin Jin

Abstract: This work addresses the 3D situated reasoning task which aims to answer questions given egocentric observations in a 3D environment. The task remains challenging as it requires comprehensive 3D perception and complex reasoning skills. End-to-end models trained on supervised data for 3D situated reasoning suffer from data scarcity and generalization ability. Inspired by the recent success of levera… ▽ More This work addresses the 3D situated reasoning task which aims to answer questions given egocentric observations in a 3D environment. The task remains challenging as it requires comprehensive 3D perception and complex reasoning skills. End-to-end models trained on supervised data for 3D situated reasoning suffer from data scarcity and generalization ability. Inspired by the recent success of leveraging large language models (LLMs) for visual reasoning, we propose LLM-TPC, a novel framework that leverages the planning, tool usage, and reflection capabilities of LLMs through a ThinkProgram-reCtify loop. The Think phase first decomposes the compositional question into a sequence of steps, and then the Program phase grounds each step to a piece of code and calls carefully designed 3D visual perception modules. Finally, the Rectify phase adjusts the plan and code if the program fails to execute. Experiments and analysis on the SQA3D benchmark demonstrate the effectiveness, interpretability and robustness of our method. Our code is publicly available at https://qingrongh.github.io/LLM-TPC/. △ Less

Submitted 22 April, 2024; originally announced April 2024.

arXiv:2404.12754 [pdf, other]

Adaptive Regularization of Representation Rank as an Implicit Constraint of Bellman Equation

Authors: Qiang He, Tianyi Zhou, Meng Fang, Setareh Maghsudi

Abstract: Representation rank is an important concept for understanding the role of Neural Networks (NNs) in Deep Reinforcement learning (DRL), which measures the expressive capacity of value networks. Existing studies focus on unboundedly maximizing this rank; nevertheless, that approach would introduce overly complex models in the learning, thus undermining performance. Hence, fine-tuning representation r… ▽ More Representation rank is an important concept for understanding the role of Neural Networks (NNs) in Deep Reinforcement learning (DRL), which measures the expressive capacity of value networks. Existing studies focus on unboundedly maximizing this rank; nevertheless, that approach would introduce overly complex models in the learning, thus undermining performance. Hence, fine-tuning representation rank presents a challenging and crucial optimization problem. To address this issue, we find a guiding principle for adaptive control of the representation rank. We employ the Bellman equation as a theoretical foundation and derive an upper bound on the cosine similarity of consecutive state-action pairs representations of value networks. We then leverage this upper bound to propose a novel regularizer, namely BEllman Equation-based automatic rank Regularizer (BEER). This regularizer adaptively regularizes the representation rank, thus improving the DRL agent's performance. We first validate the effectiveness of automatic control of rank on illustrative experiments. Then, we scale up BEER to complex continuous control tasks by combining it with the deterministic policy gradient method. Among 12 challenging DeepMind control tasks, BEER outperforms the baselines by a large margin. Besides, BEER demonstrates significant advantages in Q-value approximation. Our code is available at https://github.com/sweetice/BEER-ICLR2024. △ Less

Submitted 19 April, 2024; originally announced April 2024.

Comments: Accepted to CVPR23; Code: https://github.com/sweetice/BEER-ICLR2024

arXiv:2404.11326 [pdf, other]

Single-temporal Supervised Remote Change Detection for Domain Generalization

Authors: Qiangang Du, Jinlong Peng, Xu Chen, Qingdong He, Liren He, Qiang Nie, Wenbing Zhu, Mingmin Chi, Yabiao Wang, Chengjie Wang

Abstract: Change detection is widely applied in remote sensing image analysis. Existing methods require training models separately for each dataset, which leads to poor domain generalization. Moreover, these methods rely heavily on large amounts of high-quality pair-labelled data for training, which is expensive and impractical. In this paper, we propose a multimodal contrastive learning (ChangeCLIP) based… ▽ More Change detection is widely applied in remote sensing image analysis. Existing methods require training models separately for each dataset, which leads to poor domain generalization. Moreover, these methods rely heavily on large amounts of high-quality pair-labelled data for training, which is expensive and impractical. In this paper, we propose a multimodal contrastive learning (ChangeCLIP) based on visual-language pre-training for change detection domain generalization. Additionally, we propose a dynamic context optimization for prompt learning. Meanwhile, to address the data dependency issue of existing methods, we introduce a single-temporal and controllable AI-generated training strategy (SAIN). This allows us to train the model using a large number of single-temporal images without image pairs in the real world, achieving excellent generalization. Extensive experiments on series of real change detection datasets validate the superiority and strong generalization of ChangeCLIP, outperforming state-of-the-art change detection methods. Code will be available. △ Less

Submitted 23 April, 2024; v1 submitted 17 April, 2024; originally announced April 2024.

Showing 1–50 of 752 results for author: He, Q