subscribe to arXiv mailings

ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter

Authors: Yaoyao Qian, Xupeng Zhu, Ondrej Biza, Shuo Jiang, Linfeng Zhao, Haojie Huang, Yu Qi, Robert Platt

Abstract: Robotic grasping in cluttered environments remains a significant challenge due to occlusions and complex object arrangements. We have developed ThinkGrasp, a plug-and-play vision-language grasping system that makes use of GPT-4o's advanced contextual reasoning for heavy clutter environment grasping strategies. ThinkGrasp can effectively identify and generate grasp poses for target objects, even wh… ▽ More Robotic grasping in cluttered environments remains a significant challenge due to occlusions and complex object arrangements. We have developed ThinkGrasp, a plug-and-play vision-language grasping system that makes use of GPT-4o's advanced contextual reasoning for heavy clutter environment grasping strategies. ThinkGrasp can effectively identify and generate grasp poses for target objects, even when they are heavily obstructed or nearly invisible, by using goal-oriented language to guide the removal of obstructing objects. This approach progressively uncovers the target object and ultimately grasps it with a few steps and a high success rate. In both simulated and real experiments, ThinkGrasp achieved a high success rate and significantly outperformed state-of-the-art methods in heavily cluttered environments or with diverse unseen objects, demonstrating strong generalization capabilities. △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: Project Website:(https://h-freax.github.io/thinkgrasp_page/)

arXiv:2407.09868 [pdf]

Separation of Sodium Signals Between Mono- and Bi-Exponential T2 Decays via Multi-TE Single-Quantum Sodium (23Na) MRI

Authors: Yongxian Qian, Ying-Chia Lin, Xingye Chen, Tiejun Zhao, Karthik Lakshmanan, Yulin Ge, Yvonne W. Lui, Fernando E. Boada

Abstract: Purpose. It is a long standing pursuit in sodium (23Na) MRI to separate signals between mono and bi exponential T2 decays in the human brain, due to lack of clinically translational solutions under the restriction of intrinsically low signal to noise ratio (SNR). Here we propose a new technique called multi TE single quantum (MSQ) sodium MRI to address the challenge. Methods. We exploit an intrins… ▽ More Purpose. It is a long standing pursuit in sodium (23Na) MRI to separate signals between mono and bi exponential T2 decays in the human brain, due to lack of clinically translational solutions under the restriction of intrinsically low signal to noise ratio (SNR). Here we propose a new technique called multi TE single quantum (MSQ) sodium MRI to address the challenge. Methods. We exploit an intrinsic difference in T2 decay between mono and bi exponential sodium signals by acquiring SQ images at multiple TEs and performing voxel based matrix inversions on these SQ images. The MSQ method was then investigated on numerical models, agar phantoms, and human brains for the feasibility on clinical scanners at 3T. Results. The whole brain T2* spectrum of FID signals from the study subjects showed sparse peaks (2 to 4 peaks), suggesting a global set of T2* values (T2*fr, T2*bs, T2*bl) applicable to the separation. The simulations indicated a small impact (3.9 to 5.6 percent) of T2* variation on accuracy of the separation, and the phantom experiments showed a high accuracy of the separation, 95.8 percent for mono T2 sodium and 72.5 to 80.4 percent for biT2 sodium. The human studies demonstrated feasibility of the separation and potentials of highlighting abnormal brain regions in the biT2 sodium images. Conclusion. The MSQ technique has been shown, via the numerical simulations, phantom experiments, and human brain studies, to be able to separate mono and bi T2 sodium signals using a two TE sampling scheme and a global set of T2* values. However, MSQ has limitations and requires cautions in practice. Keywords. sodium MRI, single quantum MRI, triple quantum MRI, neuroimaging, neurodegeneration △ Less

Submitted 13 July, 2024; originally announced July 2024.

Comments: 37 pages and 14 figures

arXiv:2407.09613 [pdf, other]

Shadows Wreak Havocs in Transition Disks

Authors: Yansong Qian, Yanqin Wu

Abstract: We demonstrate that shadows cast on a proto-planetary disk can drive it eccentric. Stellar irradiation dominates heating across much of these disks, so an uneven illumination can have interesting dynamical effects. Here, we focus on transition disks. We carry out 3D Athena++ simulations, using a constant thermal relaxation time to describe the disk's response to changing stellar illumination. We f… ▽ More We demonstrate that shadows cast on a proto-planetary disk can drive it eccentric. Stellar irradiation dominates heating across much of these disks, so an uneven illumination can have interesting dynamical effects. Here, we focus on transition disks. We carry out 3D Athena++ simulations, using a constant thermal relaxation time to describe the disk's response to changing stellar illumination. We find that an asymmetric shadow, a feature commonly observed in real disks, perturbs the radial pressure gradient and distorts the fluid streamlines into a set of twisted ellipses. Interactions between these streamlines have a range of consequences. For a narrow ring, an asymmetric shadow can sharply truncate its inner edge, possibly explaining the steep density drop-offs observed in some disks and obviating the need for massive perturbers. For a wide ring, such a shadow can dismantle it into two (or possibly more) eccentric rings. These rings continuously exert torque on each other and drive gas accretion at a healthy rate, even in the absence of disk viscosity. Signatures of such twisted eccentric rings may have already been observed as, e.g., twisted velocity maps inside gas cavities. We advocate for more targeted observations, and for a better understanding on the origin of such shadows. △ Less

Submitted 12 July, 2024; originally announced July 2024.

Comments: submitted to AAS journal; 11 pages, 9 figures; feedbacks welcome

arXiv:2407.07844 [pdf, other]

OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

Authors: Hao Wang, Pengzhen Ren, Zequn Jie, Xiao Dong, Chengjian Feng, Yinlong Qian, Lin Ma, Dongmei Jiang, Yaowei Wang, Xiangyuan Lan, Xiaodan Liang

Abstract: Open-vocabulary detection is a challenging task due to the requirement of detecting objects based on class names, including those not encountered during training. Existing methods have shown strong zero-shot detection capabilities through pre-training on diverse large-scale datasets. However, these approaches still face two primary challenges: (i) how to universally integrate diverse data sources… ▽ More Open-vocabulary detection is a challenging task due to the requirement of detecting objects based on class names, including those not encountered during training. Existing methods have shown strong zero-shot detection capabilities through pre-training on diverse large-scale datasets. However, these approaches still face two primary challenges: (i) how to universally integrate diverse data sources for end-to-end training, and (ii) how to effectively leverage the language-aware capability for region-level cross-modality understanding. To address these challenges, we propose a novel unified open-vocabulary detection method called OV-DINO, which pre-trains on diverse large-scale datasets with language-aware selective fusion in a unified framework. Specifically, we introduce a Unified Data Integration (UniDI) pipeline to enable end-to-end training and eliminate noise from pseudo-label generation by unifying different data sources into detection-centric data. In addition, we propose a Language-Aware Selective Fusion (LASF) module to enable the language-aware ability of the model through a language-aware query selection and fusion process. We evaluate the performance of the proposed OV-DINO on popular open-vocabulary detection benchmark datasets, achieving state-of-the-art results with an AP of 50.6\% on the COCO dataset and 40.0\% on the LVIS dataset in a zero-shot manner, demonstrating its strong generalization ability. Furthermore, the fine-tuned OV-DINO on COCO achieves 58.4\% AP, outperforming many existing methods with the same backbone. The code for OV-DINO will be available at \href{https://github.com/wanghao9610/OV-DINO}{https://github.com/wanghao9610/OV-DINO}. △ Less

Submitted 10 July, 2024; originally announced July 2024.

Comments: Technical Report

arXiv:2407.02477 [pdf, other]

Understanding Alignment in Multimodal LLMs: A Comprehensive Study

Authors: Elmira Amirloo, Jean-Philippe Fauconnier, Christoph Roesmann, Christian Kerl, Rinu Boney, Yusu Qian, Zirui Wang, Afshin Dehghan, Yinfei Yang, Zhe Gan, Peter Grasch

Abstract: Preference alignment has become a crucial component in enhancing the performance of Large Language Models (LLMs), yet its impact in Multimodal Large Language Models (MLLMs) remains comparatively underexplored. Similar to language models, MLLMs for image understanding tasks encounter challenges like hallucination. In MLLMs, hallucination can occur not only by stating incorrect facts but also by pro… ▽ More Preference alignment has become a crucial component in enhancing the performance of Large Language Models (LLMs), yet its impact in Multimodal Large Language Models (MLLMs) remains comparatively underexplored. Similar to language models, MLLMs for image understanding tasks encounter challenges like hallucination. In MLLMs, hallucination can occur not only by stating incorrect facts but also by producing responses that are inconsistent with the image content. A primary objective of alignment for MLLMs is to encourage these models to align responses more closely with image information. Recently, multiple works have introduced preference datasets for MLLMs and examined different alignment methods, including Direct Preference Optimization (DPO) and Proximal Policy Optimization (PPO). However, due to variations in datasets, base model types, and alignment methods, it remains unclear which specific elements contribute most significantly to the reported improvements in these works. In this paper, we independently analyze each aspect of preference alignment in MLLMs. We start by categorizing the alignment algorithms into two groups, offline (such as DPO), and online (such as online-DPO), and show that combining offline and online methods can improve the performance of the model in certain scenarios. We review a variety of published multimodal preference datasets and discuss how the details of their construction impact model performance. Based on these insights, we introduce a novel way of creating multimodal preference data called Bias-Driven Hallucination Sampling (BDHS) that needs neither additional annotation nor external models, and show that it can achieve competitive performance to previously published alignment work for multimodal models across a range of benchmarks. △ Less

Submitted 2 July, 2024; originally announced July 2024.

arXiv:2407.01509 [pdf, other]

MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

Authors: Yusu Qian, Hanrong Ye, Jean-Philippe Fauconnier, Peter Grasch, Yinfei Yang, Zhe Gan

Abstract: We introduce MIA-Bench, a new benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to strictly adhere to complex instructions. Our benchmark comprises a diverse set of 400 image-prompt pairs, each crafted to challenge the models' compliance with layered instructions in generating accurate responses that satisfy specific requested patterns. Evaluation results fro… ▽ More We introduce MIA-Bench, a new benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to strictly adhere to complex instructions. Our benchmark comprises a diverse set of 400 image-prompt pairs, each crafted to challenge the models' compliance with layered instructions in generating accurate responses that satisfy specific requested patterns. Evaluation results from a wide array of state-of-the-art MLLMs reveal significant variations in performance, highlighting areas for improvement in instruction fidelity. Additionally, we create extra training data and explore supervised fine-tuning to enhance the models' ability to strictly follow instructions without compromising performance on other tasks. We hope this benchmark not only serves as a tool for measuring MLLM adherence to instructions, but also guides future developments in MLLM training methods. △ Less

Submitted 3 July, 2024; v1 submitted 1 July, 2024; originally announced July 2024.

arXiv:2407.00707 [pdf, other]

Deep learning quantum Monte Carlo for solids

Authors: Yubing Qian, Xiang Li, Zhe Li, Weiluo Ren, Ji Chen

Abstract: Deep learning has deeply changed the paradigms of many research fields. At the heart of chemical and physical sciences is the accurate ab initio calculation of many-body wavefunction, which has become one of the most notable examples to demonstrate the power of deep learning in science. In particular, the introduction of deep learning into quantum Monte Carlo (QMC) has significantly advanced the f… ▽ More Deep learning has deeply changed the paradigms of many research fields. At the heart of chemical and physical sciences is the accurate ab initio calculation of many-body wavefunction, which has become one of the most notable examples to demonstrate the power of deep learning in science. In particular, the introduction of deep learning into quantum Monte Carlo (QMC) has significantly advanced the frontier of ab initio calculation, offering a universal tool to solve the electronic structure of materials and molecules. Deep learning QMC architectures were initial designed and tested on small molecules, focusing on comparisons with other state-of-the-art ab initio methods. Methodological developments, including extensions to real solids and periodic models, have been rapidly progressing and reported applications are fast expanding. This review covers the theoretical foundation of deep learning QMC for solids, the neural network wavefunction ansatz, and various of other methodological developments. Applications on computing energy, electron density, electric polarization, force and stress of real solids are also reviewed. The methods have also been extended to other periodic systems and finite temperature calculations. The review highlights the potentials and existing challenges of deep learning QMC in materials chemistry and condensed matter physics. △ Less

Submitted 30 June, 2024; originally announced July 2024.

arXiv:2407.00105 [pdf, other]

Multiple Kronecker RLS fusion-based link propagation for drug-side effect prediction

Authors: Yuqing Qian, Ziyu Zheng, Prayag Tiwari, Yijie Ding, Quan Zou

Abstract: Drug-side effect prediction has become an essential area of research in the field of pharmacology. As the use of medications continues to rise, so does the importance of understanding and mitigating the potential risks associated with them. At present, researchers have turned to data-driven methods to predict drug-side effects. Drug-side effect prediction is a link prediction problem, and the rela… ▽ More Drug-side effect prediction has become an essential area of research in the field of pharmacology. As the use of medications continues to rise, so does the importance of understanding and mitigating the potential risks associated with them. At present, researchers have turned to data-driven methods to predict drug-side effects. Drug-side effect prediction is a link prediction problem, and the related data can be described from various perspectives. To process these kinds of data, a multi-view method, called Multiple Kronecker RLS fusion-based link propagation (MKronRLSF-LP), is proposed. MKronRLSF-LP extends the Kron-RLS by finding the consensus partitions and multiple graph Laplacian constraints in the multi-view setting. Both of these multi-view settings contribute to a higher quality result. Extensive experiments have been conducted on drug-side effect datasets, and our empirical results provide evidence that our approach is effective and robust. △ Less

Submitted 27 June, 2024; originally announced July 2024.

Comments: Transactions on Machine Learning Research (TMLR 2024)

arXiv:2406.13471 [pdf, other]

Diffusion-based Generative Modeling with Discriminative Guidance for Streamable Speech Enhancement

Authors: Chenda Li, Samuele Cornell, Shinji Watanabe, Yanmin Qian

Abstract: Diffusion-based generative models (DGMs) have recently attracted attention in speech enhancement research (SE) as previous works showed a remarkable generalization capability. However, DGMs are also computationally intensive, as they usually require many iterations in the reverse diffusion process (RDP), making them impractical for streaming SE systems. In this paper, we propose to use discriminat… ▽ More Diffusion-based generative models (DGMs) have recently attracted attention in speech enhancement research (SE) as previous works showed a remarkable generalization capability. However, DGMs are also computationally intensive, as they usually require many iterations in the reverse diffusion process (RDP), making them impractical for streaming SE systems. In this paper, we propose to use discriminative scores from discriminative models in the first steps of the RDP. These discriminative scores require only one forward pass with the discriminative model for multiple RDP steps, thus greatly reducing computations. This approach also allows for performance improvements. We show that we can trade off between generative and discriminative capabilities as the number of steps with the discriminative score increases. Furthermore, we propose a novel streamable time-domain generative model with an algorithmic latency of 50 ms, which has no significant performance degradation compared to offline models. △ Less

Submitted 19 June, 2024; originally announced June 2024.

arXiv:2406.11740 [pdf, other]

Imagination Policy: Using Generative Point Cloud Models for Learning Manipulation Policies

Authors: Haojie Huang, Karl Schmeckpeper, Dian Wang, Ondrej Biza, Yaoyao Qian, Haotian Liu, Mingxi Jia, Robert Platt, Robin Walters

Abstract: Humans can imagine goal states during planning and perform actions to match those goals. In this work, we propose Imagination Policy, a novel multi-task key-frame policy network for solving high-precision pick and place tasks. Instead of learning actions directly, Imagination Policy generates point clouds to imagine desired states which are then translated to actions using rigid action estimation.… ▽ More Humans can imagine goal states during planning and perform actions to match those goals. In this work, we propose Imagination Policy, a novel multi-task key-frame policy network for solving high-precision pick and place tasks. Instead of learning actions directly, Imagination Policy generates point clouds to imagine desired states which are then translated to actions using rigid action estimation. This transforms action inference into a local generative task. We leverage pick and place symmetries underlying the tasks in the generation process and achieve extremely high sample efficiency and generalizability to unseen configurations. Finally, we demonstrate state-of-the-art performance across various tasks on the RLbench benchmark compared with several strong baselines. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.11364 [pdf, other]

AnoPatch: Towards Better Consistency in Machine Anomalous Sound Detection

Authors: Anbai Jiang, Bing Han, Zhiqiang Lv, Yufeng Deng, Wei-Qiang Zhang, Xie Chen, Yanmin Qian, Jia Liu, Pingyi Fan

Abstract: Large pre-trained models have demonstrated dominant performances in multiple areas, where the consistency between pre-training and fine-tuning is the key to success. However, few works reported satisfactory results of pre-trained models for the machine anomalous sound detection (ASD) task. This may be caused by the inconsistency of the pre-trained model and the inductive bias of machine audio, res… ▽ More Large pre-trained models have demonstrated dominant performances in multiple areas, where the consistency between pre-training and fine-tuning is the key to success. However, few works reported satisfactory results of pre-trained models for the machine anomalous sound detection (ASD) task. This may be caused by the inconsistency of the pre-trained model and the inductive bias of machine audio, resulting in inconsistency in data and architecture. Thus, we propose AnoPatch which utilizes a ViT backbone pre-trained on AudioSet and fine-tunes it on machine audio. It is believed that machine audio is more related to audio datasets than speech datasets, and modeling it from patch level suits the sparsity of machine audio. As a result, AnoPatch showcases state-of-the-art (SOTA) performances on the DCASE 2020 ASD dataset and the DCASE 2023 ASD dataset. We also compare multiple pre-trained models and empirically demonstrate that better consistency yields considerable improvement. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: Accepted by INTERSPEECH 2024

arXiv:2406.11134 [pdf, other]

Emergent Wigner phases in moiré superlattice from deep learning

Authors: Xiang Li, Yubing Qian, Weiluo Ren, Yang Xu, Ji Chen

Abstract: Moiré superlattice designed in stacked van der Waals material provides a dynamic platform for hosting exotic and emergent condensed matter phenomena. However, the relevance of strong correlation effects and the large size of moiré unit cells pose significant challenges for traditional computational techniques. To overcome these challenges, we develop an unsupervised deep learning approach to uncov… ▽ More Moiré superlattice designed in stacked van der Waals material provides a dynamic platform for hosting exotic and emergent condensed matter phenomena. However, the relevance of strong correlation effects and the large size of moiré unit cells pose significant challenges for traditional computational techniques. To overcome these challenges, we develop an unsupervised deep learning approach to uncover electronic phases emerging from moiré systems based on variational optimization of neural network many-body wavefunction. Our approach has identified diverse quantum states, including novel phases such as generalized Wigner crystals, Wigner molecular crystals, and previously unreported Wigner covalent crystals. These discoveries provide insights into recent experimental studies and suggest new phases for future exploration. They also highlight the crucial role of spin polarization in determining Wigner phases. More importantly, our proposed deep learning approach is proven general and efficient, offering a powerful framework for studying moiré physics. △ Less

Submitted 16 June, 2024; originally announced June 2024.

arXiv:2406.08812 [pdf, other]

Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems

Authors: Zhengyang Chen, Xuechen Liu, Erica Cooper, Junichi Yamagishi, Yanmin Qian

Abstract: This paper proposes a speech synthesis system that allows users to specify and control the acoustic characteristics of a speaker by means of prompts describing the speaker's traits of synthesized speech. Unlike previous approaches, our method utilizes listener impressions to construct prompts, which are easier to collect and align more naturally with everyday descriptions of speaker traits. We ado… ▽ More This paper proposes a speech synthesis system that allows users to specify and control the acoustic characteristics of a speaker by means of prompts describing the speaker's traits of synthesized speech. Unlike previous approaches, our method utilizes listener impressions to construct prompts, which are easier to collect and align more naturally with everyday descriptions of speaker traits. We adopt the Low-rank Adaptation (LoRA) technique to swiftly tailor a pre-trained language model to our needs, facilitating the extraction of speaker-related traits from the prompt text. Besides, different from other prompt-driven text-to-speech (TTS) systems, we separate the prompt-to-speaker module from the multi-speaker TTS system, enhancing system flexibility and compatibility with various pre-trained multi-speaker TTS systems. Moreover, for the prompt-to-speaker characteristic module, we also compared the discriminative method and flow-matching based generative method and we found that combining both methods can help the system simultaneously capture speaker-related information from prompts better and generate speech with higher fidelity. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: Accepted for presentation at Interspeech 2024 (with more analysis in the final Appendix part)

arXiv:2406.07855 [pdf, other]

VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment

Authors: Bing Han, Long Zhou, Shujie Liu, Sanyuan Chen, Lingwei Meng, Yanming Qian, Yanqing Liu, Sheng Zhao, Jinyu Li, Furu Wei

Abstract: With the help of discrete neural audio codecs, large language models (LLM) have increasingly been recognized as a promising methodology for zero-shot Text-to-Speech (TTS) synthesis. However, sampling based decoding strategies bring astonishing diversity to generation, but also pose robustness issues such as typos, omissions and repetition. In addition, the high sampling rate of audio also brings h… ▽ More With the help of discrete neural audio codecs, large language models (LLM) have increasingly been recognized as a promising methodology for zero-shot Text-to-Speech (TTS) synthesis. However, sampling based decoding strategies bring astonishing diversity to generation, but also pose robustness issues such as typos, omissions and repetition. In addition, the high sampling rate of audio also brings huge computational overhead to the inference process of autoregression. To address these issues, we propose VALL-E R, a robust and efficient zero-shot TTS system, building upon the foundation of VALL-E. Specifically, we introduce a phoneme monotonic alignment strategy to strengthen the connection between phonemes and acoustic sequence, ensuring a more precise alignment by constraining the acoustic tokens to match their associated phonemes. Furthermore, we employ a codec-merging approach to downsample the discrete codes in shallow quantization layer, thereby accelerating the decoding speed while preserving the high quality of speech output. Benefiting from these strategies, VALL-E R obtains controllablity over phonemes and demonstrates its strong robustness by approaching the WER of ground truth. In addition, it requires fewer autoregressive steps, with over 60% time reduction during inference. This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia. Audio samples will be available at: https://aka.ms/valler. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: 15 pages, 5 figures

arXiv:2406.07198 [pdf, other]

Target Speech Diarization with Multimodal Prompts

Authors: Yidi Jiang, Ruijie Tao, Zhengyang Chen, Yanmin Qian, Haizhou Li

Abstract: Traditional speaker diarization seeks to detect ``who spoke when'' according to speaker characteristics. Extending to target speech diarization, we detect ``when target event occurs'' according to the semantic characteristics of speech. We propose a novel Multimodal Target Speech Diarization (MM-TSD) framework, which accommodates diverse and multi-modal prompts to specify target events in a flexib… ▽ More Traditional speaker diarization seeks to detect ``who spoke when'' according to speaker characteristics. Extending to target speech diarization, we detect ``when target event occurs'' according to the semantic characteristics of speech. We propose a novel Multimodal Target Speech Diarization (MM-TSD) framework, which accommodates diverse and multi-modal prompts to specify target events in a flexible and user-friendly manner, including semantic language description, pre-enrolled speech, pre-registered face image, and audio-language logical prompts. We further propose a voice-face aligner module to project human voice and face representation into a shared space. We develop a multi-modal dataset based on VoxCeleb2 for MM-TSD training and evaluation. Additionally, we conduct comparative analysis and ablation studies for each category of prompts to validate the efficacy of each component in the proposed framework. Furthermore, our framework demonstrates versatility in performing various signal processing tasks, including speaker diarization and overlap speech detection, using task-specific prompts. MM-TSD achieves robust and comparable performance as a unified system compared to specialized models. Moreover, MM-TSD shows capability to handle complex conversations for real-world dataset. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: 13 pages, 7 figures

arXiv:2406.06350 [pdf, other]

Error Analysis and Numerical Algorithm for PDE Approximation with Hidden-Layer Concatenated Physics Informed Neural Networks

Authors: Yianxia Qian, Yongchao Zhang, Suchuan Dong

Abstract: We present the hidden-layer concatenated physics informed neural network (HLConcPINN) method, which combines hidden-layer concatenated feed-forward neural networks, a modified block time marching strategy, and a physics informed approach for approximating partial differential equations (PDEs). We analyze the convergence properties and establish the error bounds of this method for two types of PDEs… ▽ More We present the hidden-layer concatenated physics informed neural network (HLConcPINN) method, which combines hidden-layer concatenated feed-forward neural networks, a modified block time marching strategy, and a physics informed approach for approximating partial differential equations (PDEs). We analyze the convergence properties and establish the error bounds of this method for two types of PDEs: parabolic (exemplified by the heat and Burgers' equations) and hyperbolic (exemplified by the wave and nonlinear Klein-Gordon equations). We show that its approximation error of the solution can be effectively controlled by the training loss for dynamic simulations with long time horizons. The HLConcPINN method in principle allows an arbitrary number of hidden layers not smaller than two and any of the commonly-used smooth activation functions for the hidden layers beyond the first two, with theoretical guarantees. This generalizes several recent neural-network techniques, which have theoretical guarantees but are confined to two hidden layers in the network architecture and the $\tanh$ activation function. Our theoretical analyses subsequently inform the formulation of appropriate training loss functions for these PDEs, leading to physics informed neural network (PINN) type computational algorithms that differ from the standard PINN formulation. Ample numerical experiments are presented based on the proposed algorithm to validate the effectiveness of this method and confirm aspects of the theoretical analyses. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: 40 pages, 10 tables, 18 figures

arXiv:2406.05740 [pdf, ps, other]

Convergence of ZH-type nonmonotone descent method for Kurdyka-Łojasiewicz optimization problems

Authors: Yitian Qian, Ting Tao, Shaohua Pan, Houduo Qi

Abstract: This note concerns a class of nonmonotone descent methods for minimizing a proper lower semicontinuous Kurdyka-Ł$\ddot{o}$jasiewicz (KL) function $Φ$, whose iterate sequence obeys the ZH-type nonmonotone decrease condition and a relative error condition. We prove that the iterate sequence converges to a critical point of $Φ$, and if $Φ$ has the KL property of exponent $θ\in(0,1)$ at this critical… ▽ More This note concerns a class of nonmonotone descent methods for minimizing a proper lower semicontinuous Kurdyka-Ł$\ddot{o}$jasiewicz (KL) function $Φ$, whose iterate sequence obeys the ZH-type nonmonotone decrease condition and a relative error condition. We prove that the iterate sequence converges to a critical point of $Φ$, and if $Φ$ has the KL property of exponent $θ\in(0,1)$ at this critical point, the convergence has a linear rate for $θ\in(0,1/2]$ and a sublinear rate of exponent $\frac{1-θ}{1-2θ}$ for $θ\in(1/2,1)$. Our results first resolve the full convergence of the iterate sequence generated by the ZH-type nonmonotone descent method for nonconvex and nonsmooth optimization problems, and extend the full convergence of monotone descent methods for KL optimization problems to the ZH-type nonmonotone descent method. △ Less

Submitted 9 June, 2024; originally announced June 2024.

arXiv:2406.05370 [pdf, other]

VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers

Authors: Sanyuan Chen, Shujie Liu, Long Zhou, Yanqing Liu, Xu Tan, Jinyu Li, Sheng Zhao, Yao Qian, Furu Wei

Abstract: This paper introduces VALL-E 2, the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time. Based on its predecessor, VALL-E, the new iteration introduces two significant enhancements: Repetition Aware Sampling refines the original nucleus sampling process by accounting for token repetition in… ▽ More This paper introduces VALL-E 2, the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time. Based on its predecessor, VALL-E, the new iteration introduces two significant enhancements: Repetition Aware Sampling refines the original nucleus sampling process by accounting for token repetition in the decoding history. It not only stabilizes the decoding but also circumvents the infinite loop issue. Grouped Code Modeling organizes codec codes into groups to effectively shorten the sequence length, which not only boosts inference speed but also addresses the challenges of long sequence modeling. Our experiments on the LibriSpeech and VCTK datasets show that VALL-E 2 surpasses previous systems in speech robustness, naturalness, and speaker similarity. It is the first of its kind to reach human parity on these benchmarks. Moreover, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases. The advantages of this work could contribute to valuable endeavors, such as generating speech for individuals with aphasia or people with amyotrophic lateral sclerosis. See https://aka.ms/valle2 for demos of VALL-E 2. △ Less

Submitted 17 June, 2024; v1 submitted 8 June, 2024; originally announced June 2024.

Comments: Demo posted

arXiv:2406.05359 [pdf, other]

Towards Lightweight Speaker Verification via Adaptive Neural Network Quantization

Authors: Bei Liu, Haoyu Wang, Yanmin Qian

Abstract: Modern speaker verification (SV) systems typically demand expensive storage and computing resources, thereby hindering their deployment on mobile devices. In this paper, we explore adaptive neural network quantization for lightweight speaker verification. Firstly, we propose a novel adaptive uniform precision quantization method which enables the dynamic generation of quantization centroids custom… ▽ More Modern speaker verification (SV) systems typically demand expensive storage and computing resources, thereby hindering their deployment on mobile devices. In this paper, we explore adaptive neural network quantization for lightweight speaker verification. Firstly, we propose a novel adaptive uniform precision quantization method which enables the dynamic generation of quantization centroids customized for each network layer based on k-means clustering. By applying it to the pre-trained SV systems, we obtain a series of quantized variants with different bit widths. To enhance the performance of low-bit quantized models, a mixed precision quantization algorithm along with a multi-stage fine-tuning (MSFT) strategy is further introduced. Unlike uniform precision quantization, mixed precision approach allows for the assignment of varying bit widths to different network layers. When bit combination is determined, MSFT is employed to progressively quantize and fine-tune network in a specific order. Finally, we design two distinct binary quantization schemes to mitigate performance degradation of 1-bit quantized models: the static and adaptive quantizers. Experiments on VoxCeleb demonstrate that lossless 4-bit uniform precision quantization is achieved on both ResNets and DF-ResNets, yielding a promising compression ratio of around 8. Moreover, compared to uniform precision approach, mixed precision quantization not only obtains additional performance improvements with a similar model size but also offers the flexibility to generate bit combination for any desirable model size. In addition, our suggested 1-bit quantization schemes remarkably boost the performance of binarized models. Finally, a thorough comparison with existing lightweight SV systems reveals that our proposed models outperform all previous methods by a large margin across various model size ranges. △ Less

Submitted 18 June, 2024; v1 submitted 8 June, 2024; originally announced June 2024.

Comments: submitted to IEEE/ACM Transactions on Audio Speech and Language Processing (Under Review)

arXiv:2406.04660 [pdf, other]

URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement

Authors: Wangyou Zhang, Robin Scheibler, Kohei Saijo, Samuele Cornell, Chenda Li, Zhaoheng Ni, Anurag Kumar, Jan Pirklbauer, Marvin Sach, Shinji Watanabe, Tim Fingscheidt, Yanmin Qian

Abstract: The last decade has witnessed significant advancements in deep learning-based speech enhancement (SE). However, most existing SE research has limitations on the coverage of SE sub-tasks, data diversity and amount, and evaluation metrics. To fill this gap and promote research toward universal SE, we establish a new SE challenge, named URGENT, to focus on the universality, robustness, and generaliza… ▽ More The last decade has witnessed significant advancements in deep learning-based speech enhancement (SE). However, most existing SE research has limitations on the coverage of SE sub-tasks, data diversity and amount, and evaluation metrics. To fill this gap and promote research toward universal SE, we establish a new SE challenge, named URGENT, to focus on the universality, robustness, and generalizability of SE. We aim to extend the SE definition to cover different sub-tasks to explore the limits of SE models, starting from denoising, dereverberation, bandwidth extension, and declipping. A novel framework is proposed to unify all these sub-tasks in a single model, allowing the use of all existing SE approaches. We collected public speech and noise data from different domains to construct diverse evaluation data. Finally, we discuss the insights gained from our preliminary baseline experiments based on both generative and discriminative SE methods with 12 curated metrics. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: 6 pages, 3 figures, 3 tables. Accepted by Interspeech 2024. An extended version of the accepted manuscript with appendix

arXiv:2406.04588 [pdf, ps, other]

A majorized PAM method with subspace correction for low-rank composite factorization model

Authors: Ting Tao, Yitian Qian, Shaohua Pan

Abstract: This paper concerns a class of low-rank composite factorization models arising from matrix completion. For this nonconvex and nonsmooth optimization problem, we propose a proximal alternating minimization algorithm (PAMA) with subspace correction, in which a subspace correction step is imposed on every proximal subproblem so as to guarantee that the corrected proximal subproblem has a closed-form… ▽ More This paper concerns a class of low-rank composite factorization models arising from matrix completion. For this nonconvex and nonsmooth optimization problem, we propose a proximal alternating minimization algorithm (PAMA) with subspace correction, in which a subspace correction step is imposed on every proximal subproblem so as to guarantee that the corrected proximal subproblem has a closed-form solution. For this subspace correction PAMA, we prove the subsequence convergence of the iterate sequence, and establish the convergence of the whole iterate sequence and the column subspace sequences of factor pairs under the KL property of objective function and a restrictive condition that holds automatically for the column $\ell_{2,0}$-norm function. Numerical comparison with the proximal alternating linearized minimization method on one-bit matrix completion problems indicates that PAMA has an advantage in seeking lower relative error within less time. △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: 28 pages, 3 figures

MSC Class: 49J52; 90C26; 49M27

arXiv:2406.04269 [pdf, other]

Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement

Authors: Wangyou Zhang, Kohei Saijo, Jee-weon Jung, Chenda Li, Shinji Watanabe, Yanmin Qian

Abstract: Deep learning-based speech enhancement (SE) models have achieved impressive performance in the past decade. Numerous advanced architectures have been designed to deliver state-of-the-art performance; however, their scalability potential remains unrevealed. Meanwhile, the majority of research focuses on small-sized datasets with restricted diversity, leading to a plateau in performance improvement.… ▽ More Deep learning-based speech enhancement (SE) models have achieved impressive performance in the past decade. Numerous advanced architectures have been designed to deliver state-of-the-art performance; however, their scalability potential remains unrevealed. Meanwhile, the majority of research focuses on small-sized datasets with restricted diversity, leading to a plateau in performance improvement. In this paper, we aim to provide new insights for addressing the above issues by exploring the scalability of SE models in terms of architectures, model sizes, compute budgets, and dataset sizes. Our investigation involves several popular SE architectures and speech data from different domains. Experiments reveal both similarities and distinctions between the scaling effects in SE and other tasks such as speech recognition. These findings further provide insights into the under-explored SE directions, e.g., larger-scale multi-domain corpora and efficiently scalable architectures. △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: 5 pages, 3 figures, 4 tables, Accepted by Interspeech 2024

arXiv:2405.17809 [pdf, other]

TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation

Authors: Chenyang Le, Yao Qian, Dongmei Wang, Long Zhou, Shujie Liu, Xiaofei Wang, Midia Yousefi, Yanmin Qian, Jinyu Li, Sheng Zhao, Michael Zeng

Abstract: There is a rising interest and trend in research towards directly translating speech from one language to another, known as end-to-end speech-to-speech translation. However, most end-to-end models struggle to outperform cascade models, i.e., a pipeline framework by concatenating speech recognition, machine translation and text-to-speech models. The primary challenges stem from the inherent complex… ▽ More There is a rising interest and trend in research towards directly translating speech from one language to another, known as end-to-end speech-to-speech translation. However, most end-to-end models struggle to outperform cascade models, i.e., a pipeline framework by concatenating speech recognition, machine translation and text-to-speech models. The primary challenges stem from the inherent complexities involved in direct translation tasks and the scarcity of data. In this study, we introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion yet facilitates end-to-end inference through joint probability. Furthermore, we propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process, making it highly suitable for scenarios such as video dubbing. Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Comments: Work in progress

arXiv:2405.17233 [pdf, other]

CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs

Authors: Haoyu Wang, Bei Liu, Hang Shao, Bo Xiao, Ke Zeng, Guanglu Wan, Yanmin Qian

Abstract: Parameter quantization for Large Language Models (LLMs) has attracted increasing attentions recently in reducing memory costs and improving computational efficiency. Early approaches have been widely adopted. However, the existing methods suffer from poor performance in low-bit (such as 2 to 3 bits) scenarios. In this paper, we present a novel and effective Column-Level Adaptive weight Quantizatio… ▽ More Parameter quantization for Large Language Models (LLMs) has attracted increasing attentions recently in reducing memory costs and improving computational efficiency. Early approaches have been widely adopted. However, the existing methods suffer from poor performance in low-bit (such as 2 to 3 bits) scenarios. In this paper, we present a novel and effective Column-Level Adaptive weight Quantization (CLAQ) framework by introducing three different types of adaptive strategies for LLM quantization. Firstly, a K-Means clustering based algorithm is proposed that allows dynamic generation of quantization centroids for each column of a parameter matrix. Secondly, we design an outlier-guided adaptive precision search strategy which can dynamically assign varying bit-widths to different columns. Finally, a dynamic outlier reservation scheme is developed to retain some parameters in their original float point precision, in trade off of boosted model performance. Experiments on various mainstream open source LLMs including LLaMA-1, LLaMA-2 and Yi demonstrate that our methods achieve the state-of-the-art results across different bit settings, especially in extremely low-bit scenarios. Code is available at https://github.com/fayuge/CLAQ. △ Less

Submitted 2 June, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

arXiv:2405.15580 [pdf, other]

Open-Vocabulary SAM3D: Understand Any 3D Scene

Authors: Hanchen Tai, Qingdong He, Jiangning Zhang, Yijie Qian, Zhenyu Zhang, Xiaobin Hu, Yabiao Wang, Yong Liu

Abstract: Open-vocabulary 3D scene understanding presents a significant challenge in the field. Recent advancements have sought to transfer knowledge embedded in vision language models from the 2D domain to 3D domain. However, these approaches often require learning prior knowledge from specific 3D scene datasets, which limits their applicability in open-world scenarios. The Segment Anything Model (SAM) has… ▽ More Open-vocabulary 3D scene understanding presents a significant challenge in the field. Recent advancements have sought to transfer knowledge embedded in vision language models from the 2D domain to 3D domain. However, these approaches often require learning prior knowledge from specific 3D scene datasets, which limits their applicability in open-world scenarios. The Segment Anything Model (SAM) has demonstrated remarkable zero-shot segmentation capabilities, prompting us to investigate its potential for comprehending 3D scenes without the need for training. In this paper, we introduce OV-SAM3D, a universal framework for open-vocabulary 3D scene understanding. This framework is designed to perform understanding tasks for any 3D scene without requiring prior knowledge of the scene. Specifically, our method is composed of two key sub-modules: First, we initiate the process by generating superpoints as the initial 3D prompts and refine these prompts using segment masks derived from SAM. Moreover, we then integrate a specially designed overlapping score table with open tags from the Recognize Anything Model (RAM) to produce final 3D instances with open-world label. Empirical evaluations conducted on the ScanNet200 and nuScenes datasets demonstrate that our approach surpasses existing open-vocabulary methods in unknown open-world environments. △ Less

Submitted 21 June, 2024; v1 submitted 24 May, 2024; originally announced May 2024.

Comments: Project page: https://hithqd.github.io/projects/OV-SAM3D

arXiv:2405.13080 [pdf, other]

EmInspector: Combating Backdoor Attacks in Federated Self-Supervised Learning Through Embedding Inspection

Authors: Yuwen Qian, Shuchi Wu, Kang Wei, Ming Ding, Di Xiao, Tao Xiang, Chuan Ma, Song Guo

Abstract: Federated self-supervised learning (FSSL) has recently emerged as a promising paradigm that enables the exploitation of clients' vast amounts of unlabeled data while preserving data privacy. While FSSL offers advantages, its susceptibility to backdoor attacks, a concern identified in traditional federated supervised learning (FSL), has not been investigated. To fill the research gap, we undertake… ▽ More Federated self-supervised learning (FSSL) has recently emerged as a promising paradigm that enables the exploitation of clients' vast amounts of unlabeled data while preserving data privacy. While FSSL offers advantages, its susceptibility to backdoor attacks, a concern identified in traditional federated supervised learning (FSL), has not been investigated. To fill the research gap, we undertake a comprehensive investigation into a backdoor attack paradigm, where unscrupulous clients conspire to manipulate the global model, revealing the vulnerability of FSSL to such attacks. In FSL, backdoor attacks typically build a direct association between the backdoor trigger and the target label. In contrast, in FSSL, backdoor attacks aim to alter the global model's representation for images containing the attacker's specified trigger pattern in favor of the attacker's intended target class, which is less straightforward. In this sense, we demonstrate that existing defenses are insufficient to mitigate the investigated backdoor attacks in FSSL, thus finding an effective defense mechanism is urgent. To tackle this issue, we dive into the fundamental mechanism of backdoor attacks on FSSL, proposing the Embedding Inspector (EmInspector) that detects malicious clients by inspecting the embedding space of local models. In particular, EmInspector assesses the similarity of embeddings from different local models using a small set of inspection images (e.g., ten images of CIFAR100) without specific requirements on sample distribution or labels. We discover that embeddings from backdoored models tend to cluster together in the embedding space for a given inspection image. Evaluation results show that EmInspector can effectively mitigate backdoor attacks on FSSL across various adversary settings. Our code is avaliable at https://github.com/ShuchiWu/EmInspector. △ Less

Submitted 21 May, 2024; originally announced May 2024.

Comments: 18 pages, 12 figures

arXiv:2405.12944 [pdf, other]

AMFD: Distillation via Adaptive Multimodal Fusion for Multispectral Pedestrian Detection

Authors: Zizhao Chen, Yeqiang Qian, Xiaoxiao Yang, Chunxiang Wang, Ming Yang

Abstract: Multispectral pedestrian detection has been shown to be effective in improving performance within complex illumination scenarios. However, prevalent double-stream networks in multispectral detection employ two separate feature extraction branches for multi-modal data, leading to nearly double the inference time compared to single-stream networks utilizing only one feature extraction branch. This i… ▽ More Multispectral pedestrian detection has been shown to be effective in improving performance within complex illumination scenarios. However, prevalent double-stream networks in multispectral detection employ two separate feature extraction branches for multi-modal data, leading to nearly double the inference time compared to single-stream networks utilizing only one feature extraction branch. This increased inference time has hindered the widespread employment of multispectral pedestrian detection in embedded devices for autonomous systems. To address this limitation, various knowledge distillation methods have been proposed. However, traditional distillation methods focus only on the fusion features and ignore the large amount of information in the original multi-modal features, thereby restricting the student network's performance. To tackle the challenge, we introduce the Adaptive Modal Fusion Distillation (AMFD) framework, which can fully utilize the original modal features of the teacher network. Specifically, a Modal Extraction Alignment (MEA) module is utilized to derive learning weights for student networks, integrating focal and global attention mechanisms. This methodology enables the student network to acquire optimal fusion strategies independent from that of teacher network without necessitating an additional feature fusion module. Furthermore, we present the SMOD dataset, a well-aligned challenging multispectral dataset for detection. Extensive experiments on the challenging KAIST, LLVIP and SMOD datasets are conducted to validate the effectiveness of AMFD. The results demonstrate that our method outperforms existing state-of-the-art methods in both reducing log-average Miss Rate and improving mean Average Precision. The code is available at https://github.com/bigD233/AMFD.git. △ Less

Submitted 21 May, 2024; originally announced May 2024.

arXiv:2405.12914 [pdf, other]

An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation

Authors: Zhiyu Tan, Mengping Yang, Luozheng Qin, Hao Yang, Ye Qian, Qiang Zhou, Cheng Zhang, Hao Li

Abstract: One critical prerequisite for faithful text-to-image generation is the accurate understanding of text inputs. Existing methods leverage the text encoder of the CLIP model to represent input prompts. However, the pre-trained CLIP model can merely encode English with a maximum token length of 77. Moreover, the model capacity of the text encoder from CLIP is relatively limited compared to Large Langu… ▽ More One critical prerequisite for faithful text-to-image generation is the accurate understanding of text inputs. Existing methods leverage the text encoder of the CLIP model to represent input prompts. However, the pre-trained CLIP model can merely encode English with a maximum token length of 77. Moreover, the model capacity of the text encoder from CLIP is relatively limited compared to Large Language Models (LLMs), which offer multilingual input, accommodate longer context, and achieve superior text representation. In this paper, we investigate LLMs as the text encoder to improve the language understanding in text-to-image generation. Unfortunately, training text-to-image generative model with LLMs from scratch demands significant computational resources and data. To this end, we introduce a three-stage training pipeline that effectively and efficiently integrates the existing text-to-image model with LLMs. Specifically, we propose a lightweight adapter that enables fast training of the text-to-image model using the textual representations from LLMs. Extensive experiments demonstrate that our model supports not only multilingual but also longer input context with superior image generation quality. △ Less

Submitted 21 May, 2024; originally announced May 2024.

Comments: Technical report. Project page: https://github.com/llm-conditioned-diffusion/llm-conditioned-diffusion

arXiv:2405.06510 [pdf, other]

UniDM: A Unified Framework for Data Manipulation with Large Language Models

Authors: Yichen Qian, Yongyi He, Rong Zhu, Jintao Huang, Zhijian Ma, Haibin Wang, Yaohua Wang, Xiuyu Sun, Defu Lian, Bolin Ding, Jingren Zhou

Abstract: Designing effective data manipulation methods is a long standing problem in data lakes. Traditional methods, which rely on rules or machine learning models, require extensive human efforts on training data collection and tuning models. Recent methods apply Large Language Models (LLMs) to resolve multiple data manipulation tasks. They exhibit bright benefits in terms of performance but still requir… ▽ More Designing effective data manipulation methods is a long standing problem in data lakes. Traditional methods, which rely on rules or machine learning models, require extensive human efforts on training data collection and tuning models. Recent methods apply Large Language Models (LLMs) to resolve multiple data manipulation tasks. They exhibit bright benefits in terms of performance but still require customized designs to fit each specific task. This is very costly and can not catch up with the requirements of big data lake platforms. In this paper, inspired by the cross-task generality of LLMs on NLP tasks, we pave the first step to design an automatic and general solution to tackle with data manipulation tasks. We propose UniDM, a unified framework which establishes a new paradigm to process data manipulation tasks using LLMs. UniDM formalizes a number of data manipulation tasks in a unified form and abstracts three main general steps to solve each task. We develop an automatic context retrieval to allow the LLMs to retrieve data from data lakes, potentially containing evidence and factual information. For each step, we design effective prompts to guide LLMs to produce high quality results. By our comprehensive evaluation on a variety of benchmarks, our UniDM exhibits great generality and state-of-the-art performance on a wide variety of data manipulation tasks. △ Less

Submitted 10 May, 2024; originally announced May 2024.

Comments: MLSys24

arXiv:2404.19040 [pdf, other]

GSTalker: Real-time Audio-Driven Talking Face Generation via Deformable Gaussian Splatting

Authors: Bo Chen, Shoukang Hu, Qi Chen, Chenpeng Du, Ran Yi, Yanmin Qian, Xie Chen

Abstract: We present GStalker, a 3D audio-driven talking face generation model with Gaussian Splatting for both fast training (40 minutes) and real-time rendering (125 FPS) with a 3$\sim$5 minute video for training material, in comparison with previous 2D and 3D NeRF-based modeling frameworks which require hours of training and seconds of rendering per frame. Specifically, GSTalker learns an audio-driven Ga… ▽ More We present GStalker, a 3D audio-driven talking face generation model with Gaussian Splatting for both fast training (40 minutes) and real-time rendering (125 FPS) with a 3$\sim$5 minute video for training material, in comparison with previous 2D and 3D NeRF-based modeling frameworks which require hours of training and seconds of rendering per frame. Specifically, GSTalker learns an audio-driven Gaussian deformation field to translate and transform 3D Gaussians to synchronize with audio information, in which multi-resolution hashing grid-based tri-plane and temporal smooth module are incorporated to learn accurate deformation for fine-grained facial details. In addition, a pose-conditioned deformation field is designed to model the stabilized torso. To enable efficient optimization of the condition Gaussian deformation field, we initialize 3D Gaussians by learning a coarse static Gaussian representation. Extensive experiments in person-specific videos with audio tracks validate that GSTalker can generate high-fidelity and audio-lips synchronized results with fast training and real-time rendering speed. △ Less

Submitted 29 April, 2024; originally announced April 2024.

arXiv:2404.16383 [pdf, other]

Euler band topology in spin-orbit coupled magnetic systems

Authors: Seung Hun Lee, Yuting Qian, Bohm-Jung Yang

Abstract: The Euler class characterizes the topology of two real bands isolated from other bands in two-dimensions. Despite various intriguing topological properties predicted up to now, the candidate real materials hosting electronic Euler bands are extremely rare. Here, we show that in a quantum spin Hall insulator with two-fold rotation $C_{2z}$ about the $z$-axis, a pair of bands with nontrivial… ▽ More The Euler class characterizes the topology of two real bands isolated from other bands in two-dimensions. Despite various intriguing topological properties predicted up to now, the candidate real materials hosting electronic Euler bands are extremely rare. Here, we show that in a quantum spin Hall insulator with two-fold rotation $C_{2z}$ about the $z$-axis, a pair of bands with nontrivial $\mathbb{Z}_2$ invariant turn into magnetic Euler bands under in-plane Zeeman field or in-plane ferromagnetic ordering. The resulting magnetic insulator generally carries a nontrivial second Stiefel-Whitney invariant. In particular, when the topmost pair of occupied bands carry a nonzero Euler number, the corresponding magnetic insulator can be called a magnetic Euler insulator. Moreover, the topological phase transition between a trivial magnetic insulator and a magnetic Stiefel-Whitney or Euler insulator is mediated by a stable topological semimetal phase in which Dirac nodes carrying non-Abelian topological charges exhibit braiding processes across the transition. Using the first-principles calculations, we propose various candidate materials hosting magnetic Euler bands. We especially show that ZrTe$_5$ bilayers under in-plane ferromagnetism are a candidate system for magnetic Stiefel-Whitney insulators in which the non-Abelian braiding-induced topological phase transitions can occur under pressure. △ Less

Submitted 25 April, 2024; originally announced April 2024.

arXiv:2404.14485 [pdf, other]

Enhanced Muonization by Active-Sterile Neutrino Mixing in Protoneutron Stars

Authors: Anupam Ray, Yong-Zhong Qian

Abstract: We study $ν_μ$-$ν_s$ and $\barν_μ$-$\barν_s$ mixing in the protoneutron star (PNS) created in a core-collapse supernova (CCSN). We point out the importance of the feedback on the general composition of the PNS in addition to the obvious feedback on the $ν_μ$ lepton number. We show that for our adopted mixing parameters $δm^2\sim 10^2\,$keV$^2$ and $\sin^2 2θ$ consistent with the current constraint… ▽ More We study $ν_μ$-$ν_s$ and $\barν_μ$-$\barν_s$ mixing in the protoneutron star (PNS) created in a core-collapse supernova (CCSN). We point out the importance of the feedback on the general composition of the PNS in addition to the obvious feedback on the $ν_μ$ lepton number. We show that for our adopted mixing parameters $δm^2\sim 10^2\,$keV$^2$ and $\sin^2 2θ$ consistent with the current constraints, sterile neutrino production is dominated by the Mikheyev$-$Smirnov$-$Wolfenstein conversion of $\barν_μ$ into $\barν_s$ and that the subsequent escape of $\barν_s$ increases the $ν_μ$ lepton number, which in turn enhances muonization of the PNS primarily through $ν_μ+n\to p+μ^-$. We find that this enhancement is amplified by diffusion of the $ν_e$ and $ν_μ$ lepton numbers. While these results are qualitatively robust, their quantitative effects on the dynamics and active neutrino emission of CCSNe should be evaluated by including $ν_μ$-$ν_s$ and $\barν_μ$-$\barν_s$ mixing in the simulations. △ Less

Submitted 22 April, 2024; originally announced April 2024.

Comments: 12 pages, 7 figures. Comments are welcome

Report number: N3AS-24-013

arXiv:2404.13279 [pdf, other]

Backdoor Attacks and Defenses on Semantic-Symbol Reconstruction in Semantic Communications

Authors: Yuan Zhou, Rose Qingyang Hu, Yi Qian

Abstract: Semantic communication is of crucial importance for the next-generation wireless communication networks. The existing works have developed semantic communication frameworks based on deep learning. However, systems powered by deep learning are vulnerable to threats such as backdoor attacks and adversarial attacks. This paper delves into backdoor attacks targeting deep learning-enabled semantic comm… ▽ More Semantic communication is of crucial importance for the next-generation wireless communication networks. The existing works have developed semantic communication frameworks based on deep learning. However, systems powered by deep learning are vulnerable to threats such as backdoor attacks and adversarial attacks. This paper delves into backdoor attacks targeting deep learning-enabled semantic communication systems. Since current works on backdoor attacks are not tailored for semantic communication scenarios, a new backdoor attack paradigm on semantic symbols (BASS) is introduced, based on which the corresponding defense measures are designed. Specifically, a training framework is proposed to prevent BASS. Additionally, reverse engineering-based and pruning-based defense strategies are designed to protect against backdoor attacks in semantic communication. Simulation results demonstrate the effectiveness of both the proposed attack paradigm and the defense strategies. △ Less

Submitted 20 April, 2024; originally announced April 2024.

Comments: This paper has been accepted by IEEE ICC 2024

arXiv:2404.11035 [pdf, other]

Approximate Wireless Communication for Lossy Gradient Updates in IoT Federated Learning

Authors: Xiang Ma, Haijian Sun, Rose Qingyang Hu, Yi Qian

Abstract: Federated learning (FL) has emerged as a distributed machine learning (ML) technique that can protect local data privacy for participating clients and improve system efficiency. Instead of sharing raw data, FL exchanges intermediate learning parameters, such as gradients, among clients. This article presents an efficient wireless communication approach tailored for FL parameter transmission, espec… ▽ More Federated learning (FL) has emerged as a distributed machine learning (ML) technique that can protect local data privacy for participating clients and improve system efficiency. Instead of sharing raw data, FL exchanges intermediate learning parameters, such as gradients, among clients. This article presents an efficient wireless communication approach tailored for FL parameter transmission, especially for Internet of Things (IoT) devices, to facilitate model aggregation. Our study considers practical wireless channels that can lead to random bit errors, which can substantially affect FL performance. Motivated by empirical gradient value distribution, we introduce a novel received bit masking method that confines received gradient values within prescribed limits. Moreover, given the intrinsic error resilience of ML gradients, our approach enables the delivery of approximate gradient values with errors without resorting to extensive error correction coding or retransmission. This strategy reduces computational overhead at both the transmitter and the receiver and minimizes communication latency. Consequently, our scheme is particularly well-suited for resource-constrained IoT devices. Additionally, we explore the inherent protection of the most significant bits (MSBs) through gray coding in high-order modulation. Our simulations demonstrate that our proposed scheme can effectively mitigate random bit errors in FL performance, achieving similar learning objectives, but with the 50% air time required by existing methods involving error correction and retransmission. △ Less

Submitted 16 April, 2024; originally announced April 2024.

Comments: submitted to IEEE journals for publication

arXiv:2404.06690 [pdf, other]

CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations

Authors: Leying Zhang, Yao Qian, Long Zhou, Shujie Liu, Dongmei Wang, Xiaofei Wang, Midia Yousefi, Yanmin Qian, Jinyu Li, Lei He, Sheng Zhao, Michael Zeng

Abstract: Recent advancements in zero-shot text-to-speech (TTS) modeling have led to significant strides in generating high-fidelity and diverse speech. However, dialogue generation, along with achieving human-like naturalness in speech, continues to be a challenge. In this paper, we introduce CoVoMix: Conversational Voice Mixture Generation, a novel model for zero-shot, human-like, multi-speaker, multi-rou… ▽ More Recent advancements in zero-shot text-to-speech (TTS) modeling have led to significant strides in generating high-fidelity and diverse speech. However, dialogue generation, along with achieving human-like naturalness in speech, continues to be a challenge. In this paper, we introduce CoVoMix: Conversational Voice Mixture Generation, a novel model for zero-shot, human-like, multi-speaker, multi-round dialogue speech generation. CoVoMix first converts dialogue text into multiple streams of discrete tokens, with each token stream representing semantic information for individual talkers. These token streams are then fed into a flow-matching based acoustic model to generate mixed mel-spectrograms. Finally, the speech waveforms are produced using a HiFi-GAN model. Furthermore, we devise a comprehensive set of metrics for measuring the effectiveness of dialogue modeling and generation. Our experimental results show that CoVoMix can generate dialogues that are not only human-like in their naturalness and coherence but also involve multiple talkers engaging in multiple rounds of conversation. This is exemplified by instances generated in a single channel where one speaker's utterance is seamlessly mixed with another's interjections or laughter, indicating the latter's role as an attentive listener. Audio samples are available at https://aka.ms/covomix. △ Less

Submitted 29 May, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

arXiv:2404.01462 [pdf, other]

OpenChemIE: An Information Extraction Toolkit For Chemistry Literature

Authors: Vincent Fan, Yujie Qian, Alex Wang, Amber Wang, Connor W. Coley, Regina Barzilay

Abstract: Information extraction from chemistry literature is vital for constructing up-to-date reaction databases for data-driven chemistry. Complete extraction requires combining information across text, tables, and figures, whereas prior work has mainly investigated extracting reactions from single modalities. In this paper, we present OpenChemIE to address this complex challenge and enable the extractio… ▽ More Information extraction from chemistry literature is vital for constructing up-to-date reaction databases for data-driven chemistry. Complete extraction requires combining information across text, tables, and figures, whereas prior work has mainly investigated extracting reactions from single modalities. In this paper, we present OpenChemIE to address this complex challenge and enable the extraction of reaction data at the document level. OpenChemIE approaches the problem in two steps: extracting relevant information from individual modalities and then integrating the results to obtain a final list of reactions. For the first step, we employ specialized neural models that each address a specific task for chemistry information extraction, such as parsing molecules or reactions from text or figures. We then integrate the information from these modules using chemistry-informed algorithms, allowing for the extraction of fine-grained reaction data from reaction condition and substrate scope investigations. Our machine learning models attain state-of-the-art performance when evaluated individually, and we meticulously annotate a challenging dataset of reaction schemes with R-groups to evaluate our pipeline as a whole, achieving an F1 score of 69.5%. Additionally, the reaction extraction results of \ours attain an accuracy score of 64.3% when directly compared against the Reaxys chemical database. We provide OpenChemIE freely to the public as an open-source package, as well as through a web interface. △ Less

Submitted 1 April, 2024; originally announced April 2024.

Comments: To be submitted to the Journal of Chemical Information and Modeling

arXiv:2404.00878 [pdf, other]

TryOn-Adapter: Efficient Fine-Grained Clothing Identity Adaptation for High-Fidelity Virtual Try-On

Authors: Jiazheng Xing, Chao Xu, Yijie Qian, Yang Liu, Guang Dai, Baigui Sun, Yong Liu, Jingdong Wang

Abstract: Virtual try-on focuses on adjusting the given clothes to fit a specific person seamlessly while avoiding any distortion of the patterns and textures of the garment. However, the clothing identity uncontrollability and training inefficiency of existing diffusion-based methods, which struggle to maintain the identity even with full parameter training, are significant limitations that hinder the wide… ▽ More Virtual try-on focuses on adjusting the given clothes to fit a specific person seamlessly while avoiding any distortion of the patterns and textures of the garment. However, the clothing identity uncontrollability and training inefficiency of existing diffusion-based methods, which struggle to maintain the identity even with full parameter training, are significant limitations that hinder the widespread applications. In this work, we propose an effective and efficient framework, termed TryOn-Adapter. Specifically, we first decouple clothing identity into fine-grained factors: style for color and category information, texture for high-frequency details, and structure for smooth spatial adaptive transformation. Our approach utilizes a pre-trained exemplar-based diffusion model as the fundamental network, whose parameters are frozen except for the attention layers. We then customize three lightweight modules (Style Preserving, Texture Highlighting, and Structure Adapting) incorporated with fine-tuning techniques to enable precise and efficient identity control. Meanwhile, we introduce the training-free T-RePaint strategy to further enhance clothing identity preservation while maintaining the realistic try-on effect during the inference. Our experiments demonstrate that our approach achieves state-of-the-art performance on two widely-used benchmarks. Additionally, compared with recent full-tuning diffusion-based methods, we only use about half of their tunable parameters during training. The code will be made publicly available at https://github.com/jiazheng-xing/TryOn-Adapter. △ Less

Submitted 31 March, 2024; originally announced April 2024.

arXiv:2404.00875 [pdf, other]

DPA-Net: Structured 3D Abstraction from Sparse Views via Differentiable Primitive Assembly

Authors: Fenggen Yu, Yiming Qian, Xu Zhang, Francisca Gil-Ureta, Brian Jackson, Eric Bennett, Hao Zhang

Abstract: We present a differentiable rendering framework to learn structured 3D abstractions in the form of primitive assemblies from sparse RGB images capturing a 3D object. By leveraging differentiable volume rendering, our method does not require 3D supervision. Architecturally, our network follows the general pipeline of an image-conditioned neural radiance field (NeRF) exemplified by pixelNeRF for col… ▽ More We present a differentiable rendering framework to learn structured 3D abstractions in the form of primitive assemblies from sparse RGB images capturing a 3D object. By leveraging differentiable volume rendering, our method does not require 3D supervision. Architecturally, our network follows the general pipeline of an image-conditioned neural radiance field (NeRF) exemplified by pixelNeRF for color prediction. As our core contribution, we introduce differential primitive assembly (DPA) into NeRF to output a 3D occupancy field in place of density prediction, where the predicted occupancies serve as opacity values for volume rendering. Our network, coined DPA-Net, produces a union of convexes, each as an intersection of convex quadric primitives, to approximate the target 3D object, subject to an abstraction loss and a masking loss, both defined in the image space upon volume rendering. With test-time adaptation and additional sampling and loss designs aimed at improving the accuracy and compactness of the obtained assemblies, our method demonstrates superior performance over state-of-the-art alternatives for 3D primitive abstraction from sparse views. △ Less

Submitted 2 April, 2024; v1 submitted 31 March, 2024; originally announced April 2024.

Comments: 14 pages

arXiv:2404.00589

Harnessing the Power of Large Language Model for Uncertainty Aware Graph Processing

Authors: Zhenyu Qian, Yiming Qian, Yuting Song, Fei Gao, Hai Jin, Chen Yu, Xia Xie

Abstract: Handling graph data is one of the most difficult tasks. Traditional techniques, such as those based on geometry and matrix factorization, rely on assumptions about the data relations that become inadequate when handling large and complex graph data. On the other hand, deep learning approaches demonstrate promising results in handling large graph data, but they often fall short of providing interpr… ▽ More Handling graph data is one of the most difficult tasks. Traditional techniques, such as those based on geometry and matrix factorization, rely on assumptions about the data relations that become inadequate when handling large and complex graph data. On the other hand, deep learning approaches demonstrate promising results in handling large graph data, but they often fall short of providing interpretable explanations. To equip the graph processing with both high accuracy and explainability, we introduce a novel approach that harnesses the power of a large language model (LLM), enhanced by an uncertainty-aware module to provide a confidence score on the generated answer. We experiment with our approach on two graph processing tasks: few-shot knowledge graph completion and graph classification. Our results demonstrate that through parameter efficient fine-tuning, the LLM surpasses state-of-the-art algorithms by a substantial margin across ten diverse benchmark datasets. Moreover, to address the challenge of explainability, we propose an uncertainty estimation based on perturbation, along with a calibration scheme to quantify the confidence scores of the generated answers. Our confidence measure achieves an AUC of 0.8 or higher on seven out of the ten datasets in predicting the correctness of the answer generated by LLM. △ Less

Submitted 12 April, 2024; v1 submitted 31 March, 2024; originally announced April 2024.

Comments: Because my organization does not allow members to privately upload papers to arXiv, I am requesting a withdrawal of my submission

arXiv:2403.17870 [pdf, other]

Boosting Diffusion Models with Moving Average Sampling in Frequency Domain

Authors: Yurui Qian, Qi Cai, Yingwei Pan, Yehao Li, Ting Yao, Qibin Sun, Tao Mei

Abstract: Diffusion models have recently brought a powerful revolution in image generation. Despite showing impressive generative capabilities, most of these models rely on the current sample to denoise the next one, possibly resulting in denoising instability. In this paper, we reinterpret the iterative denoising process as model optimization and leverage a moving average mechanism to ensemble all the prio… ▽ More Diffusion models have recently brought a powerful revolution in image generation. Despite showing impressive generative capabilities, most of these models rely on the current sample to denoise the next one, possibly resulting in denoising instability. In this paper, we reinterpret the iterative denoising process as model optimization and leverage a moving average mechanism to ensemble all the prior samples. Instead of simply applying moving average to the denoised samples at different timesteps, we first map the denoised samples to data space and then perform moving average to avoid distribution shift across timesteps. In view that diffusion models evolve the recovery from low-frequency components to high-frequency details, we further decompose the samples into different frequency components and execute moving average separately on each component. We name the complete approach "Moving Average Sampling in Frequency domain (MASF)". MASF could be seamlessly integrated into mainstream pre-trained diffusion models and sampling schedules. Extensive experiments on both unconditional and conditional diffusion models demonstrate that our MASF leads to superior performances compared to the baselines, with almost negligible additional complexity cost. △ Less

Submitted 26 March, 2024; originally announced March 2024.

Comments: CVPR 2024

arXiv:2403.16002 [pdf, other]

SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking

Authors: Xiaojun Hou, Jiazheng Xing, Yijie Qian, Yaowei Guo, Shuo Xin, Junhao Chen, Kai Tang, Mengmeng Wang, Zhengkai Jiang, Liang Liu, Yong Liu

Abstract: Multimodal Visual Object Tracking (VOT) has recently gained significant attention due to its robustness. Early research focused on fully fine-tuning RGB-based trackers, which was inefficient and lacked generalized representation due to the scarcity of multimodal data. Therefore, recent studies have utilized prompt tuning to transfer pre-trained RGB-based trackers to multimodal data. However, the m… ▽ More Multimodal Visual Object Tracking (VOT) has recently gained significant attention due to its robustness. Early research focused on fully fine-tuning RGB-based trackers, which was inefficient and lacked generalized representation due to the scarcity of multimodal data. Therefore, recent studies have utilized prompt tuning to transfer pre-trained RGB-based trackers to multimodal data. However, the modality gap limits pre-trained knowledge recall, and the dominance of the RGB modality persists, preventing the full utilization of information from other modalities. To address these issues, we propose a novel symmetric multimodal tracking framework called SDSTrack. We introduce lightweight adaptation for efficient fine-tuning, which directly transfers the feature extraction ability from RGB to other domains with a small number of trainable parameters and integrates multimodal features in a balanced, symmetric manner. Furthermore, we design a complementary masked patch distillation strategy to enhance the robustness of trackers in complex environments, such as extreme weather, poor imaging, and sensor failure. Extensive experiments demonstrate that SDSTrack outperforms state-of-the-art methods in various multimodal tracking scenarios, including RGB+Depth, RGB+Thermal, and RGB+Event tracking, and exhibits impressive results in extreme conditions. Our source code is available at https://github.com/hoqolo/SDSTrack. △ Less

Submitted 27 March, 2024; v1 submitted 24 March, 2024; originally announced March 2024.

Comments: Accepted by CVPR2024

arXiv:2403.13423 [pdf, other]

doi 10.1109/TASLP.2024.3350893

Advanced Long-Content Speech Recognition With Factorized Neural Transducer

Authors: Xun Gong, Yu Wu, Jinyu Li, Shujie Liu, Rui Zhao, Xie Chen, Yanmin Qian

Abstract: In this paper, we propose two novel approaches, which integrate long-content information into the factorized neural transducer (FNT) based architecture in both non-streaming (referred to as LongFNT ) and streaming (referred to as SLongFNT ) scenarios. We first investigate whether long-content transcriptions can improve the vanilla conformer transducer (C-T) models. Our experiments indicate that th… ▽ More In this paper, we propose two novel approaches, which integrate long-content information into the factorized neural transducer (FNT) based architecture in both non-streaming (referred to as LongFNT ) and streaming (referred to as SLongFNT ) scenarios. We first investigate whether long-content transcriptions can improve the vanilla conformer transducer (C-T) models. Our experiments indicate that the vanilla C-T models do not exhibit improved performance when utilizing long-content transcriptions, possibly due to the predictor network of C-T models not functioning as a pure language model. Instead, FNT shows its potential in utilizing long-content information, where we propose the LongFNT model and explore the impact of long-content information in both text (LongFNT-Text) and speech (LongFNT-Speech). The proposed LongFNT-Text and LongFNT-Speech models further complement each other to achieve better performance, with transcription history proving more valuable to the model. The effectiveness of our LongFNT approach is evaluated on LibriSpeech and GigaSpeech corpora, and obtains relative 19% and 12% word error rate reduction, respectively. Furthermore, we extend the LongFNT model to the streaming scenario, which is named SLongFNT , consisting of SLongFNT-Text and SLongFNT-Speech approaches to utilize long-content text and speech information. Experiments show that the proposed SLongFNT model achieves relative 26% and 17% WER reduction on LibriSpeech and GigaSpeech respectively while keeping a good latency, compared to the FNT baseline. Overall, our proposed LongFNT and SLongFNT highlight the significance of considering long-content speech and transcription knowledge for improving both non-streaming and streaming speech recognition systems. △ Less

Submitted 20 March, 2024; originally announced March 2024.

Comments: Accepted by TASLP 2024

Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 1803-1815, 2024

arXiv:2403.11210 [pdf, ps, other]

Error bounds for rank-one DNN reformulation of QAP and DC exact penalty approach

Authors: Yitian Qian, Shaohua Pan, Shujun Bi, Houduo Qi

Abstract: This paper concerns the quadratic assignment problem (QAP), a class of challenging combinatorial optimization problems. We provide an equivalent rank-one doubly nonnegative (DNN) reformulation with fewer equality constraints, and derive the local error bounds for its feasible set. By leveraging these error bounds, we prove that the penalty problem induced by the difference of convexity (DC) reform… ▽ More This paper concerns the quadratic assignment problem (QAP), a class of challenging combinatorial optimization problems. We provide an equivalent rank-one doubly nonnegative (DNN) reformulation with fewer equality constraints, and derive the local error bounds for its feasible set. By leveraging these error bounds, we prove that the penalty problem induced by the difference of convexity (DC) reformulation of the rank-one constraint is a global exact penalty, and so is the penalty problem for its Burer-Monteiro (BM) factorization. As a byproduct, we verify that the penalty problem for the rank-one DNN reformulation proposed in \cite{Jiang21} is a global exact penalty without the calmness assumption. Then, we develop a continuous relaxation approach by seeking approximate stationary points of a finite number of penalty problems for the BM factorization with an augmented Lagrangian method, whose asymptotic convergence certificate is also provided under a mild condition. Numerical comparison with Gurobi for \textbf{131} benchmark instances validates the efficiency of the proposed DC exact penalty approach. △ Less

Submitted 17 March, 2024; originally announced March 2024.

arXiv:2403.01686 [pdf, other]

doi 10.3847/2041-8213/ad319

AT2023lli: A Tidal Disruption Event with Prominent Optical Early Bump and Delayed Episodic X-ray Emission

Authors: Shifeng Huang, Ning Jiang, Jiazheng Zhu, Yibo Wang, Tinggui Wang, Shan-Qin Wang, Wen-Pei Gan, En-Wei Liang, Yu-Jing Qin, Zheyu Lin, Lin-Na Xu, Min-Xuan Cai, Ji-An Jiang, Xu Kong, Jiaxun Li, Long Li, Jian-Guo Wang, Ze-Lin Xu, Yongquan Xue, Ye-Fei Yuan, Jingquan Cheng, Lulu Fan, Jie Gao, Lei Hu, Weida Hu , et al. (20 additional authors not shown)

Abstract: High-cadence, multiwavelength observations have continuously revealed the diversity of tidal disruption events (TDEs), thus greatly advancing our knowledge and understanding of TDEs. In this work, we conducted an intensive optical-UV and X-ray follow-up campaign of TDE AT2023lli, and found a remarkable month-long bump in its UV/optical light curve nearly two months prior to maximum brightness. The… ▽ More High-cadence, multiwavelength observations have continuously revealed the diversity of tidal disruption events (TDEs), thus greatly advancing our knowledge and understanding of TDEs. In this work, we conducted an intensive optical-UV and X-ray follow-up campaign of TDE AT2023lli, and found a remarkable month-long bump in its UV/optical light curve nearly two months prior to maximum brightness. The bump represents the longest separation time from the main peak among known TDEs to date. The main UV/optical outburst declines as $t^{-4.10}$, making it one of the fastest decaying optically selected TDEs. Furthermore, we detected sporadic X-ray emission 30 days after the UV/optical peak, accompanied by a reduction in the period of inactivity. It is proposed that the UV/optical bump could be caused by the self-intersection of the stream debris, whereas the primary peak is generated by the reprocessed emission of the accretion process. In addition, our results suggest that episodic X-ray radiation during the initial phase of decline may be due to the patched obscurer surrounding the accretion disk, a phenomenon associated with the inhomogeneous reprocessing process. The double TDE scenario, in which two stars are disrupted in sequence, is also a possible explanation for producing the observed early bump and main peak. We anticipate that the multicolor light curves of TDEs, especially in the very early stages, and the underlying physics can be better understood in the near future with the assistance of dedicated surveys such as the deep high-cadence survey of the 2.5-meter Wide Field Survey Telescope (WFST). △ Less

Submitted 26 March, 2024; v1 submitted 3 March, 2024; originally announced March 2024.

Comments: 14 pages, 8 figures,accepted for publication by ApJL

arXiv:2403.01257 [pdf]

Secure and Scalable Network Slicing with Plug-and-Play Support for Power Distribution System Communication Networks

Authors: Jian Zhong, Chen Chen, Yuqi Qian, Yiheng Bian, Yuxiong Huang, Zhaohong Bie

Abstract: With the rapid development of power distribution systems (PDSs), the number of terminal devices and the types of delivered services involved are constantly growing. These trends make the operations of PDSs highly dependent on the support of advanced communication networks, which face two related challenges. The first is to provide sufficient flexibility, resilience, and security to meet varying de… ▽ More With the rapid development of power distribution systems (PDSs), the number of terminal devices and the types of delivered services involved are constantly growing. These trends make the operations of PDSs highly dependent on the support of advanced communication networks, which face two related challenges. The first is to provide sufficient flexibility, resilience, and security to meet varying demands and ensure the proper operation of gradually diversifying network services. The second is to realize the automatic identification of terminal devices, thus reducing the network maintenance burden. To solve these problems, this paper presents a novel multiservice network integration and device authentication slice-based network slicing scheme. In this scheme, the integration of PDS communication networks enables network resource sharing, and recovery from communication interruption is achieved through network slicing in the integrated network. Authentication servers periodically poll terminal devices, adjusting network slice ranges based on authentication results, thereby facilitating dynamic network slicing. Additionally, secure plug-and-play support for PDS terminal devices and network protection are achieved through device identification and dynamic adjustment of network slices. On this basis, a network optimization and upgrading methodology for load balancing and robustness enhancement is further proposed. This approach is designed to improve the performance of PDS communication networks, adapting to ongoing PDS development and the evolution of PDS services. The simulation results show that the proposed schemes endow a PDS communication network with favorable resource utilization, fault recovery, terminal device plug-and-play support, load balancing, and improved network robustness. △ Less

Submitted 2 March, 2024; originally announced March 2024.

arXiv:2403.00673 [pdf, other]

Snapshot Reinforcement Learning: Leveraging Prior Trajectories for Efficiency

Authors: Yanxiao Zhao, Yangge Qian, Tianyi Wang, Jingyang Shan, Xiaolin Qin

Abstract: Deep reinforcement learning (DRL) algorithms require substantial samples and computational resources to achieve higher performance, which restricts their practical application and poses challenges for further development. Given the constraint of limited resources, it is essential to leverage existing computational work (e.g., learned policies, samples) to enhance sample efficiency and reduce the c… ▽ More Deep reinforcement learning (DRL) algorithms require substantial samples and computational resources to achieve higher performance, which restricts their practical application and poses challenges for further development. Given the constraint of limited resources, it is essential to leverage existing computational work (e.g., learned policies, samples) to enhance sample efficiency and reduce the computational resource consumption of DRL algorithms. Previous works to leverage existing computational work require intrusive modifications to existing algorithms and models, designed specifically for specific algorithms, lacking flexibility and universality. In this paper, we present the Snapshot Reinforcement Learning (SnapshotRL) framework, which enhances sample efficiency by simply altering environments, without making any modifications to algorithms and models. By allowing student agents to choose states in teacher trajectories as the initial state to sample, SnapshotRL can effectively utilize teacher trajectories to assist student agents in training, allowing student agents to explore a larger state space at the early training phase. We propose a simple and effective SnapshotRL baseline algorithm, S3RL, which integrates well with existing DRL algorithms. Our experiments demonstrate that integrating S3RL with TD3, SAC, and PPO algorithms on the MuJoCo benchmark significantly improves sample efficiency and average return, without extra samples and additional computational resources. △ Less

Submitted 12 March, 2024; v1 submitted 1 March, 2024; originally announced March 2024.

Comments: Under review

arXiv:2403.00645 [pdf, ps, other]

doi 10.1016/j.jai.2024.03.004

Event-Triggered Robust Cooperative Output Regulation for a Class of Linear Multi-Agent Systems with an Unknown Exosystem

Authors: Yangyang Qian, Lu Liu

Abstract: This paper investigates the robust cooperative output regulation problem for a class of heterogeneous uncertain linear multi-agent systems with an unknown exosystem via event-triggered control (ETC). By utilizing the internal model approach and the adaptive control technique, a distributed adaptive internal model is constructed for each agent. Then, based on this internal model, a fully distribute… ▽ More This paper investigates the robust cooperative output regulation problem for a class of heterogeneous uncertain linear multi-agent systems with an unknown exosystem via event-triggered control (ETC). By utilizing the internal model approach and the adaptive control technique, a distributed adaptive internal model is constructed for each agent. Then, based on this internal model, a fully distributed ETC strategy composed of a distributed event-triggered adaptive output feedback control law and a distributed dynamic event-triggering mechanism is proposed, in which each agent updates its control input at its own triggering time instants. It is shown that under the proposed ETC strategy, the robust cooperative output regulation problem can be solved without requiring either the global information associated with the communication topology or the bounds of the uncertain or unknown parameters in each agent and the exosystem. A numerical example is provided to illustrate the effectiveness of the proposed control strategy. △ Less

Submitted 1 March, 2024; originally announced March 2024.

Comments: 13 pages, 8 figures

arXiv:2402.15173 [pdf, other]

Second-Order Fine-Tuning without Pain for LLMs:A Hessian Informed Zeroth-Order Optimizer

Authors: Yanjun Zhao, Sizhe Dang, Haishan Ye, Guang Dai, Yi Qian, Ivor W. Tsang

Abstract: Fine-tuning large language models (LLMs) with classic first-order optimizers entails prohibitive GPU memory due to the backpropagation process. Recent works have turned to zeroth-order optimizers for fine-tuning, which save substantial memory by using two forward passes. However, these optimizers are plagued by the heterogeneity of parameter curvatures across different dimensions. In this work, we… ▽ More Fine-tuning large language models (LLMs) with classic first-order optimizers entails prohibitive GPU memory due to the backpropagation process. Recent works have turned to zeroth-order optimizers for fine-tuning, which save substantial memory by using two forward passes. However, these optimizers are plagued by the heterogeneity of parameter curvatures across different dimensions. In this work, we propose HiZOO, a diagonal Hessian informed zeroth-order optimizer which is the first work to leverage the diagonal Hessian to enhance zeroth-order optimizer for fine-tuning LLMs. What's more, HiZOO avoids the expensive memory cost and only increases one forward pass per step. Extensive experiments on various models (350M~66B parameters) indicate that HiZOO improves model convergence, significantly reducing training steps and effectively enhancing model accuracy. Moreover, we visualize the optimization trajectories of HiZOO on test functions, illustrating its effectiveness in handling heterogeneous curvatures. Lastly, we provide theoretical proofs of convergence for HiZOO. Code is publicly available at https://anonymous.4open.science/r/HiZOO27F8. △ Less

Submitted 23 February, 2024; originally announced February 2024.

arXiv:2402.13220 [pdf, other]

How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts

Authors: Yusu Qian, Haotian Zhang, Yinfei Yang, Zhe Gan

Abstract: The remarkable advancements in Multimodal Large Language Models (MLLMs) have not rendered them immune to challenges, particularly in the context of handling deceptive information in prompts, thus producing hallucinated responses under such conditions. To quantitatively assess this vulnerability, we present MAD-Bench, a carefully curated benchmark that contains 850 test samples divided into 6 categ… ▽ More The remarkable advancements in Multimodal Large Language Models (MLLMs) have not rendered them immune to challenges, particularly in the context of handling deceptive information in prompts, thus producing hallucinated responses under such conditions. To quantitatively assess this vulnerability, we present MAD-Bench, a carefully curated benchmark that contains 850 test samples divided into 6 categories, such as non-existent objects, count of objects, spatial relationship, and visual confusion. We provide a comprehensive analysis of popular MLLMs, ranging from GPT-4V, Gemini-Pro, to open-sourced models, such as LLaVA-1.5 and CogVLM. Empirically, we observe significant performance gaps between GPT-4V and other models; and previous robust instruction-tuned models, such as LRV-Instruction and LLaVA-RLHF, are not effective on this new benchmark. While GPT-4V achieves 75.02% accuracy on MAD-Bench, the accuracy of any other model in our experiments ranges from 5% to 35%. We further propose a remedy that adds an additional paragraph to the deceptive prompts to encourage models to think twice before answering the question. Surprisingly, this simple method can even double the accuracy; however, the absolute numbers are still too low to be satisfactory. We hope MAD-Bench can serve as a valuable benchmark to stimulate further research to enhance models' resilience against deceptive prompts. △ Less

Submitted 20 February, 2024; originally announced February 2024.

arXiv:2402.08875 [pdf, other]

Advancing Human Action Recognition with Foundation Models trained on Unlabeled Public Videos

Authors: Yang Qian, Yinan Sun, Ali Kargarandehkordi, Parnian Azizian, Onur Cezmi Mutlu, Saimourya Surabhi, Pingyi Chen, Zain Jabbar, Dennis Paul Wall, Peter Washington

Abstract: The increasing variety and quantity of tagged multimedia content on a variety of online platforms offer a unique opportunity to advance the field of human action recognition. In this study, we utilize 283,582 unique, unlabeled TikTok video clips, categorized into 386 hashtags, to train a domain-specific foundation model for action recognition. We employ VideoMAE V2, an advanced model integrating M… ▽ More The increasing variety and quantity of tagged multimedia content on a variety of online platforms offer a unique opportunity to advance the field of human action recognition. In this study, we utilize 283,582 unique, unlabeled TikTok video clips, categorized into 386 hashtags, to train a domain-specific foundation model for action recognition. We employ VideoMAE V2, an advanced model integrating Masked Autoencoders (MAE) with Vision Transformers (ViT), pre-trained on this diverse collection of unstructured videos. Our model, fine-tuned on established action recognition benchmarks such as UCF101 and HMDB51, achieves state-of-the-art results: 99.05% on UCF101, 86.08% on HMDB51, 85.51% on Kinetics-400, and 74.27% on Something-Something V2 using the ViT-giant backbone. These results highlight the potential of using unstructured and unlabeled videos as a valuable source of diverse and dynamic content for training foundation models. Our investigation confirms that while initial increases in pre-training data volume significantly enhance model performance, the gains diminish as the dataset size continues to expand. Our findings emphasize two critical axioms in self-supervised learning for computer vision: (1) additional pre-training data can yield diminishing benefits for some datasets and (2) quality is more important than quantity in self-supervised learning, especially when building foundation models. △ Less

Submitted 15 July, 2024; v1 submitted 13 February, 2024; originally announced February 2024.

Comments: 10 pages

Showing 1–50 of 636 results for author: Qian, Y