subscribe to arXiv mailings

FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models

Authors: Pengxiang Li, Zhi Gao, Bofei Zhang, Tao Yuan, Yuwei Wu, Mehrtash Harandi, Yunde Jia, Song-Chun Zhu, Qing Li

Abstract: Vision language models (VLMs) have achieved impressive progress in diverse applications, becoming a prevalent research direction. In this paper, we build FIRE, a feedback-refinement dataset, consisting of 1.1M multi-turn conversations that are derived from 27 source datasets, empowering VLMs to spontaneously refine their responses based on user feedback across diverse tasks. To scale up the data c… ▽ More Vision language models (VLMs) have achieved impressive progress in diverse applications, becoming a prevalent research direction. In this paper, we build FIRE, a feedback-refinement dataset, consisting of 1.1M multi-turn conversations that are derived from 27 source datasets, empowering VLMs to spontaneously refine their responses based on user feedback across diverse tasks. To scale up the data collection, FIRE is collected in two components: FIRE-100K and FIRE-1M, where FIRE-100K is generated by GPT-4V, and FIRE-1M is freely generated via models trained on FIRE-100K. Then, we build FIRE-Bench, a benchmark to comprehensively evaluate the feedback-refining capability of VLMs, which contains 11K feedback-refinement conversations as the test data, two evaluation settings, and a model to provide feedback for VLMs. We develop the FIRE-LLaVA model by fine-tuning LLaVA on FIRE-100K and FIRE-1M, which shows remarkable feedback-refining capability on FIRE-Bench and outperforms untrained VLMs by 50%, making more efficient user-agent interactions and underscoring the significance of the FIRE dataset. △ Less

Submitted 16 July, 2024; originally announced July 2024.

arXiv:2407.11502 [pdf, other]

How Control Information Influences Multilingual Text Image Generation and Editing?

Authors: Boqiang Zhang, Zuan Gao, Yadong Qu, Hongtao Xie

Abstract: Visual text generation has significantly advanced through diffusion models aimed at producing images with readable and realistic text. Recent works primarily use a ControlNet-based framework, employing standard font text images to control diffusion models. Recognizing the critical role of control information in generating high-quality text, we investigate its influence from three perspectives: inp… ▽ More Visual text generation has significantly advanced through diffusion models aimed at producing images with readable and realistic text. Recent works primarily use a ControlNet-based framework, employing standard font text images to control diffusion models. Recognizing the critical role of control information in generating high-quality text, we investigate its influence from three perspectives: input encoding, role at different stages, and output features. Our findings reveal that: 1) Input control information has unique characteristics compared to conventional inputs like Canny edges and depth maps. 2) Control information plays distinct roles at different stages of the denoising process. 3) Output control features significantly differ from the base and skip features of the U-Net decoder in the frequency domain. Based on these insights, we propose TextGen, a novel framework designed to enhance generation quality by optimizing control information. We improve input and output features using Fourier analysis to emphasize relevant information and reduce noise. Additionally, we employ a two-stage generation framework to align the different roles of control information at different stages. Furthermore, we introduce an effective and lightweight dataset for training. Our method achieves state-of-the-art performance in both Chinese and English text generation. The code and dataset will be made available. △ Less

Submitted 16 July, 2024; originally announced July 2024.

arXiv:2407.10892 [pdf, other]

First Measurement of Solar $^8$B Neutrino Flux through Coherent Elastic Neutrino-Nucleus Scattering in PandaX-4T

Authors: PandaX Collaboration, Zihao Bo, Wei Chen, Xun Chen, Yunhua Chen, Zhaokan Cheng, Xiangyi Cui, Yingjie Fan, Deqing Fang, Zhixing Gao, Lisheng Geng, Karl Giboni, Xunan Guo, Xuyuan Guo, Zichao Guo, Chencheng Han, Ke Han, Changda He, Jinrong He, Di Huang, Houqi Huang, Junting Huang, Ruquan Hou, Yu Hou, Xiangdong Ji , et al. (77 additional authors not shown)

Abstract: The PandaX-4T liquid xenon detector at the China Jinping Underground Laboratory is used to measure the solar $^8$B neutrino flux by detecting neutrinos through coherent scattering with xenon nuclei. Data samples requiring the coincidence of scintillation and ionization signals (paired), as well as unpaired ionization-only signals (US2), are selected with energy threshold of approximately 1.1 keV (… ▽ More The PandaX-4T liquid xenon detector at the China Jinping Underground Laboratory is used to measure the solar $^8$B neutrino flux by detecting neutrinos through coherent scattering with xenon nuclei. Data samples requiring the coincidence of scintillation and ionization signals (paired), as well as unpaired ionization-only signals (US2), are selected with energy threshold of approximately 1.1 keV (0.33 keV) nuclear recoil energy. Combining the commissioning run and the first science run of PandaX-4T, a total exposure of 1.25 and 1.04 tonne$\cdot$year are collected for the paired and US2, respectively. After unblinding, 3 and 332 events are observed with an expectation of 2.8$\pm$0.5 and 251$\pm$32 background events, for the paired and US2 data, respectively. A combined analysis yields a best-fit $^8$B neutrino signal of 3.5 (75) events from the paired (US2) data sample, with $\sim$37\% uncertainty, and the background-only hypothesis is disfavored at 2.64$σ$ significance. This gives a solar $^8$B neutrino flux of ($8.4\pm3.1$)$\times$10$^6$ cm$^{-2}$s$^{-1}$, consistent with the standard solar model prediction. This is the first indication of solar $^8$B neutrino ``fog'' in a dark matter direct detection experiment. △ Less

Submitted 15 July, 2024; originally announced July 2024.

arXiv:2407.10459 [pdf, other]

DiffStega: Towards Universal Training-Free Coverless Image Steganography with Diffusion Models

Authors: Yiwei Yang, Zheyuan Liu, Jun Jia, Zhongpai Gao, Yunhao Li, Wei Sun, Xiaohong Liu, Guangtao Zhai

Abstract: Traditional image steganography focuses on concealing one image within another, aiming to avoid steganalysis by unauthorized entities. Coverless image steganography (CIS) enhances imperceptibility by not using any cover image. Recent works have utilized text prompts as keys in CIS through diffusion models. However, this approach faces three challenges: invalidated when private prompt is guessed, c… ▽ More Traditional image steganography focuses on concealing one image within another, aiming to avoid steganalysis by unauthorized entities. Coverless image steganography (CIS) enhances imperceptibility by not using any cover image. Recent works have utilized text prompts as keys in CIS through diffusion models. However, this approach faces three challenges: invalidated when private prompt is guessed, crafting public prompts for semantic diversity, and the risk of prompt leakage during frequent transmission. To address these issues, we propose DiffStega, an innovative training-free diffusion-based CIS strategy for universal application. DiffStega uses a password-dependent reference image as an image prompt alongside the text, ensuring that only authorized parties can retrieve the hidden information. Furthermore, we develop Noise Flip technique to further secure the steganography against unauthorized decryption. To comprehensively assess our method across general CIS tasks, we create a dataset comprising various image steganography instances. Experiments indicate substantial improvements in our method over existing ones, particularly in aspects of versatility, password sensitivity, and recovery quality. Codes are available at \url{https://github.com/evtricks/DiffStega}. △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: 9 pages, 7 figures; reference added; accepted at IJCAI2024 main track

arXiv:2407.09922 [pdf]

Transcranial low-level laser stimulation in near infrared-II region for brain safety and protection

Authors: Zhilin Li, Yongheng Zhao, Yiqing Hu, Yang Li, Keyao Zhang, Zhibing Gao, Lirou Tan, Hanli Liu, Xiaoli Li, Aihua Cao, Zaixu Cui, Chenguang Zhao

Abstract: Background: The use of near-infrared lasers for transcranial photobiomodulation (tPBM) offers a non-invasive method for influencing brain activity and is beneficial for various neurological conditions. Objective: To investigate the safety and neuroprotective properties of tPBM using near-infrared (NIR)-II laser stimulation. Methods: We conducted thirteen experiments involving multidimensional and… ▽ More Background: The use of near-infrared lasers for transcranial photobiomodulation (tPBM) offers a non-invasive method for influencing brain activity and is beneficial for various neurological conditions. Objective: To investigate the safety and neuroprotective properties of tPBM using near-infrared (NIR)-II laser stimulation. Methods: We conducted thirteen experiments involving multidimensional and quantitative methods and measured serum neurobiomarkers, performed electroencephalogram (EEG) and magnetic resonance imaging (MRI) scans, assessed executive functions, and collected a subjective questionnaire. Results: Significant reductions (n=15) in neuron specific enolase (NSE) levels were observed after treatment, indicating neuroprotective effects. No structural or functional brain abnormalities were observed, confirming the safety of tPBM. Additionally, cognitive and executive functions were not impaired, with participants' feedback indicating minimal discomfort. Conclusions: Our data indicate that NIR-II tPBM is safe with specific parameters, highlighting its potential for brain protection. △ Less

Submitted 13 July, 2024; originally announced July 2024.

arXiv:2407.09738 [pdf, other]

Sparse Asymptotic PCA: Identifying Sparse Latent Factors Across Time Horizon

Authors: Zhaoxing Gao

Abstract: This paper proposes a novel method for sparse latent factor modeling using a new sparse asymptotic Principal Component Analysis (APCA). This approach analyzes the co-movements of large-dimensional panel data systems over time horizons within a general approximate factor model framework. Unlike existing sparse factor modeling approaches based on sparse PCA, which assume sparse loading matrices, our… ▽ More This paper proposes a novel method for sparse latent factor modeling using a new sparse asymptotic Principal Component Analysis (APCA). This approach analyzes the co-movements of large-dimensional panel data systems over time horizons within a general approximate factor model framework. Unlike existing sparse factor modeling approaches based on sparse PCA, which assume sparse loading matrices, our sparse APCA assumes that factor processes are sparse over the time horizon, while the corresponding loading matrices are not necessarily sparse. This development is motivated by the observation that the assumption of sparse loadings may not be appropriate for financial returns, where exposure to market factors is generally universal and non-sparse. We propose a truncated power method to estimate the first sparse factor process and a sequential deflation method for multi-factor cases. Additionally, we develop a data-driven approach to identify the sparsity of risk factors over the time horizon using a novel cross-sectional cross-validation method. Theoretically, we establish that our estimators are consistent under mild conditions. Monte Carlo simulations demonstrate that the proposed method performs well in finite samples. Empirically, we analyze daily stock returns for a balanced panel of S&P 500 stocks from January 2004 to December 2016. Through textual analysis, we examine specific events associated with the identified sparse factors that systematically influence the stock market. Our approach offers a new pathway for economists to study and understand the systematic risks of economic and financial systems over time. △ Less

Submitted 12 July, 2024; originally announced July 2024.

Comments: 66 pages, 6 figures

arXiv:2407.09694 [pdf, other]

Divide and Fuse: Body Part Mesh Recovery from Partially Visible Human Images

Authors: Tianyu Luan, Zhongpai Gao, Luyuan Xie, Abhishek Sharma, Hao Ding, Benjamin Planche, Meng Zheng, Ange Lou, Terrence Chen, Junsong Yuan, Ziyan Wu

Abstract: We introduce a novel bottom-up approach for human body mesh reconstruction, specifically designed to address the challenges posed by partial visibility and occlusion in input images. Traditional top-down methods, relying on whole-body parametric models like SMPL, falter when only a small part of the human is visible, as they require visibility of most of the human body for accurate mesh reconstruc… ▽ More We introduce a novel bottom-up approach for human body mesh reconstruction, specifically designed to address the challenges posed by partial visibility and occlusion in input images. Traditional top-down methods, relying on whole-body parametric models like SMPL, falter when only a small part of the human is visible, as they require visibility of most of the human body for accurate mesh reconstruction. To overcome this limitation, our method employs a "Divide and Fuse (D&F)" strategy, reconstructing human body parts independently before fusing them, thereby ensuring robustness against occlusions. We design Human Part Parametric Models (HPPM) that independently reconstruct the mesh from a few shape and global-location parameters, without inter-part dependency. A specially designed fusion module then seamlessly integrates the reconstructed parts, even when only a few are visible. We harness a large volume of ground-truth SMPL data to train our parametric mesh models. To facilitate the training and evaluation of our method, we have established benchmark datasets featuring images of partially visible humans with HPPM annotations. Our experiments, conducted on these benchmark datasets, demonstrate the effectiveness of our D&F method, particularly in scenarios with substantial invisibility, where traditional approaches struggle to maintain reconstruction quality. △ Less

Submitted 12 July, 2024; originally announced July 2024.

Comments: Accepted by ECCV2024

arXiv:2407.06152 [pdf, other]

Uni-ELF: A Multi-Level Representation Learning Framework for Electrolyte Formulation Design

Authors: Boshen Zeng, Sian Chen, Xinxin Liu, Changhong Chen, Bin Deng, Xiaoxu Wang, Zhifeng Gao, Yuzhi Zhang, Weinan E, Linfeng Zhang

Abstract: Advancements in lithium battery technology heavily rely on the design and engineering of electrolytes. However, current schemes for molecular design and recipe optimization of electrolytes lack an effective computational-experimental closed loop and often fall short in accurately predicting diverse electrolyte formulation properties. In this work, we introduce Uni-ELF, a novel multi-level represen… ▽ More Advancements in lithium battery technology heavily rely on the design and engineering of electrolytes. However, current schemes for molecular design and recipe optimization of electrolytes lack an effective computational-experimental closed loop and often fall short in accurately predicting diverse electrolyte formulation properties. In this work, we introduce Uni-ELF, a novel multi-level representation learning framework to advance electrolyte design. Our approach involves two-stage pretraining: reconstructing three-dimensional molecular structures at the molecular level using the Uni-Mol model, and predicting statistical structural properties (e.g., radial distribution functions) from molecular dynamics simulations at the mixture level. Through this comprehensive pretraining, Uni-ELF is able to capture intricate molecular and mixture-level information, which significantly enhances its predictive capability. As a result, Uni-ELF substantially outperforms state-of-the-art methods in predicting both molecular properties (e.g., melting point, boiling point, synthesizability) and formulation properties (e.g., conductivity, Coulombic efficiency). Moreover, Uni-ELF can be seamlessly integrated into an automatic experimental design workflow. We believe this innovative framework will pave the way for automated AI-based electrolyte design and engineering. △ Less

Submitted 8 July, 2024; originally announced July 2024.

arXiv:2407.05407 [pdf, other]

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Authors: Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, Zhijie Yan

Abstract: Recent years have witnessed a trend that large language model (LLM) based text-to-speech (TTS) emerges into the mainstream due to their high naturalness and zero-shot capacity. In this paradigm, speech signals are discretized into token sequences, which are modeled by an LLM with text as prompts and reconstructed by a token-based vocoder to waveforms. Obviously, speech tokens play a critical role… ▽ More Recent years have witnessed a trend that large language model (LLM) based text-to-speech (TTS) emerges into the mainstream due to their high naturalness and zero-shot capacity. In this paradigm, speech signals are discretized into token sequences, which are modeled by an LLM with text as prompts and reconstructed by a token-based vocoder to waveforms. Obviously, speech tokens play a critical role in LLM-based TTS models. Current speech tokens are learned in an unsupervised manner, which lacks explicit semantic information and alignment to the text. In this paper, we propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder. Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis. Experimental results show that supervised semantic tokens significantly outperform existing unsupervised tokens in terms of content consistency and speaker similarity for zero-shot voice cloning. Moreover, we find that utilizing large-scale data further improves the synthesis performance, indicating the scalable capacity of CosyVoice. To the best of our knowledge, this is the first attempt to involve supervised speech tokens into TTS models. △ Less

Submitted 9 July, 2024; v1 submitted 7 July, 2024; originally announced July 2024.

Comments: work in progress. arXiv admin note: substantial text overlap with arXiv:2407.04051

arXiv:2407.05395 [pdf, other]

Quantifying angular distributions in multinucleon transfer reactions with a semi-classical method

Authors: Zehong Liao, Zepeng Gao, Yu Yang, Yueping Fang, Jun Su, Long Zhu

Abstract: The multinucleon transfer (MNT) process in low-energy heavy ion collisions can be utilized to produce unknown nuclei far beyond the stability line. However, the reaction products exhibit broad angular and energy distributions, which could lower the experimental detection efficiency. We present a classical approach that employs a parameterized angular distribution to describe the complex issue. By… ▽ More The multinucleon transfer (MNT) process in low-energy heavy ion collisions can be utilized to produce unknown nuclei far beyond the stability line. However, the reaction products exhibit broad angular and energy distributions, which could lower the experimental detection efficiency. We present a classical approach that employs a parameterized angular distribution to describe the complex issue. By analyzing limited experimental data on angular distribution, we proposed a three-parameter formula to calculate the angular distribution and identified the dependencies of the parameters. We also discuss the sensitivity of these parameters within this method. A comprehensive comparison with microscopic models and experimental data across a wide range of conditions is conducted. The proposed formula offers an efficient and straightforward way to determine the angular distribution of MNT products. △ Less

Submitted 7 July, 2024; originally announced July 2024.

Comments: 6 pages, 6 figure

arXiv:2407.04405 [pdf, other]

Discovering symbolic expressions with parallelized tree search

Authors: Kai Ruan, Ze-Feng Gao, Yike Guo, Hao Sun, Ji-Rong Wen, Yang Liu

Abstract: Symbolic regression plays a crucial role in modern scientific research thanks to its capability of discovering concise and interpretable mathematical expressions from data. A grand challenge lies in the arduous search for parsimonious and generalizable mathematical formulas, in an infinite search space, while intending to fit the training data. Existing algorithms have faced a critical bottleneck… ▽ More Symbolic regression plays a crucial role in modern scientific research thanks to its capability of discovering concise and interpretable mathematical expressions from data. A grand challenge lies in the arduous search for parsimonious and generalizable mathematical formulas, in an infinite search space, while intending to fit the training data. Existing algorithms have faced a critical bottleneck of accuracy and efficiency over a decade when handling problems of complexity, which essentially hinders the pace of applying symbolic regression for scientific exploration across interdisciplinary domains. To this end, we introduce a parallelized tree search (PTS) model to efficiently distill generic mathematical expressions from limited data. Through a series of extensive experiments, we demonstrate the superior accuracy and efficiency of PTS for equation discovery, which greatly outperforms the state-of-the-art baseline models on over 80 synthetic and experimental datasets (e.g., lifting its performance by up to 99% accuracy improvement and one-order of magnitude speed up). PTS represents a key advance in accurate and efficient data-driven discovery of symbolic, interpretable models (e.g., underlying physical laws) and marks a pivotal transition towards scalable symbolic learning. △ Less

Submitted 5 July, 2024; originally announced July 2024.

ACM Class: I.2

arXiv:2407.04100 [pdf, other]

C$^3$DG: Conditional Domain Generalization for Hyperspectral Imagery Classification with Convergence and Constrained-risk Theories

Authors: Zhe Gao, Bin Pan, Zhenwei Shi

Abstract: Hyperspectral imagery (HSI) classification may suffer the challenge of hyperspectral-monospectra, where different classes present similar spectra. Joint spatial-spectral feature extraction is a popular solution for the problem, but this strategy tends to inflate accuracy since test pixels may exist in training patches. Domain generalization methods show promising potential, but they still fail to… ▽ More Hyperspectral imagery (HSI) classification may suffer the challenge of hyperspectral-monospectra, where different classes present similar spectra. Joint spatial-spectral feature extraction is a popular solution for the problem, but this strategy tends to inflate accuracy since test pixels may exist in training patches. Domain generalization methods show promising potential, but they still fail to distinguish similar spectra across varying domains, in addition, the theoretical support is usually ignored. In this paper, we only rely on spectral information to solve the hyperspectral-monospectra problem, and propose a Convergence and Error-Constrained Conditional Domain Generalization method for Hyperspectral Imagery Classification (C$^3$DG). The major contributions of this paper include two aspects: the Conditional Revising Inference Block (CRIB), and the corresponding theories for model convergence and generalization errors. CRIB is the kernel structure of the proposed method, which employs a shared encoder and multi-branch decoders to fully leverage the conditional distribution during training, achieving a decoupling that aligns with the generation mechanisms of HSI. Moreover, to ensure model convergence and maintain controllable error, we propose the optimization convergence theorem and risk upper bound theorem. In the optimization convergence theorem, we ensure the model convergence by demonstrating that the gradients of the loss terms are not contradictory. In the risk upper bound theorem, our theoretical analysis explores the relationship between test-time training and recent related work to establish a concrete bound for error. Experimental results on three benchmark datasets indicate the superiority of C$^3$DG. △ Less

Submitted 4 July, 2024; originally announced July 2024.

arXiv:2407.04051 [pdf, other]

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

Authors: Keyu An, Qian Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Yue Gu, Ting He, Hangrui Hu, Kai Hu, Shengpeng Ji, Yabin Li, Zerui Li, Heng Lu, Haoneng Luo, Xiang Lv, Bin Ma, Ziyang Ma, Chongjia Ni, Changhe Song, Jiaqi Shi, Xian Shi, Hao Wang, Wen Wang, Yuxuan Wang , et al. (8 additional authors not shown)

Abstract: This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, sp… ▽ More This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity. SenseVoice-Small delivers exceptionally low-latency ASR for 5 languages, and SenseVoice-Large supports high-precision ASR for over 50 languages, while CosyVoice excels in multi-lingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities. The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub. By integrating these models with LLMs, FunAudioLLM enables applications such as speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration, thereby pushing the boundaries of voice interaction technology. Demos are available at https://fun-audio-llm.github.io, and the code can be accessed at https://github.com/FunAudioLLM. △ Less

Submitted 10 July, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

Comments: Work in progress. Authors are listed in alphabetical order by family name

arXiv:2407.02893 [pdf, other]

An Uncertainty-guided Tiered Self-training Framework for Active Source-free Domain Adaptation in Prostate Segmentation

Authors: Zihao Luo, Xiangde Luo, Zijun Gao, Guotai Wang

Abstract: Deep learning models have exhibited remarkable efficacy in accurately delineating the prostate for diagnosis and treatment of prostate diseases, but challenges persist in achieving robust generalization across different medical centers. Source-free Domain Adaptation (SFDA) is a promising technique to adapt deep segmentation models to address privacy and security concerns while reducing domain shif… ▽ More Deep learning models have exhibited remarkable efficacy in accurately delineating the prostate for diagnosis and treatment of prostate diseases, but challenges persist in achieving robust generalization across different medical centers. Source-free Domain Adaptation (SFDA) is a promising technique to adapt deep segmentation models to address privacy and security concerns while reducing domain shifts between source and target domains. However, recent literature indicates that the performance of SFDA remains far from satisfactory due to unpredictable domain gaps. Annotating a few target domain samples is acceptable, as it can lead to significant performance improvement with a low annotation cost. Nevertheless, due to extremely limited annotation budgets, careful consideration is needed in selecting samples for annotation. Inspired by this, our goal is to develop Active Source-free Domain Adaptation (ASFDA) for medical image segmentation. Specifically, we propose a novel Uncertainty-guided Tiered Self-training (UGTST) framework, consisting of efficient active sample selection via entropy-based primary local peak filtering to aggregate global uncertainty and diversity-aware redundancy filter, coupled with a tiered self-learning strategy, achieves stable domain adaptation. Experimental results on cross-center prostate MRI segmentation datasets revealed that our method yielded marked advancements, with a mere 5% annotation, exhibiting an average Dice score enhancement of 9.78% and 7.58% in two target domains compared with state-of-the-art methods, on par with fully supervised learning. Code is available at:https://github.com/HiLab-git/UGTST △ Less

Submitted 4 July, 2024; v1 submitted 3 July, 2024; originally announced July 2024.

Comments: 11 pages, 3 figures, 2 tables, accept to MICCAI 2024

arXiv:2407.02814 [pdf, other]

Images Speak Louder than Words: Understanding and Mitigating Bias in Vision-Language Model from a Causal Mediation Perspective

Authors: Zhaotian Weng, Zijun Gao, Jerone Andrews, Jieyu Zhao

Abstract: Vision-language models (VLMs) pre-trained on extensive datasets can inadvertently learn biases by correlating gender information with specific objects or scenarios. Current methods, which focus on modifying inputs and monitoring changes in the model's output probability scores, often struggle to comprehensively understand bias from the perspective of model components. We propose a framework that i… ▽ More Vision-language models (VLMs) pre-trained on extensive datasets can inadvertently learn biases by correlating gender information with specific objects or scenarios. Current methods, which focus on modifying inputs and monitoring changes in the model's output probability scores, often struggle to comprehensively understand bias from the perspective of model components. We propose a framework that incorporates causal mediation analysis to measure and map the pathways of bias generation and propagation within VLMs. This approach allows us to identify the direct effects of interventions on model bias and the indirect effects of interventions on bias mediated through different model components. Our results show that image features are the primary contributors to bias, with significantly higher impacts than text features, specifically accounting for 32.57% and 12.63% of the bias in the MSCOCO and PASCAL-SENTENCE datasets, respectively. Notably, the image encoder's contribution surpasses that of the text encoder and the deep fusion encoder. Further experimentation confirms that contributions from both language and vision modalities are aligned and non-conflicting. Consequently, focusing on blurring gender representations within the image encoder, which contributes most to the model bias, reduces bias efficiently by 22.03% and 9.04% in the MSCOCO and PASCAL-SENTENCE datasets, respectively, with minimal performance loss or increased computational demands. △ Less

Submitted 3 July, 2024; originally announced July 2024.

ACM Class: I.2.7

arXiv:2407.01926 [pdf]

Chemical Shift Encoding based Double Bonds Quantification in Triglycerides using Deep Image Prior

Authors: Chaoxing Huang, Ziqiang Yu, Zijian Gao, Qiuyi Shen, Queenie Chan, Vincent Wai-Sun Wong, Winnie Chiu-Wing Chu, Weitian Chen

Abstract: This study evaluated a deep learning-based method using Deep Image Prior (DIP) to quantify triglyceride double bonds from chemical-shift encoded multi-echo gradient echo images without network training. We employed a cost function based on signal constraints to iteratively update the neural network on a single dataset. The method was validated using phantom experiments and in vivo scans. Results s… ▽ More This study evaluated a deep learning-based method using Deep Image Prior (DIP) to quantify triglyceride double bonds from chemical-shift encoded multi-echo gradient echo images without network training. We employed a cost function based on signal constraints to iteratively update the neural network on a single dataset. The method was validated using phantom experiments and in vivo scans. Results showed close alignment between measured and reference double bond values, with phantom experiments yielding a Pearson correlation coefficient of 0.96 (p = .0005). In vivo results demonstrated good agreement in subcutaneous fat. We conclude that Deep Image Prior shows feasibility for quantifying double bonds and fatty acid content from chemical-shift encoded multi-echo MRI. △ Less

Submitted 3 July, 2024; v1 submitted 1 July, 2024; originally announced July 2024.

arXiv:2407.01517 [pdf, other]

Centerline Boundary Dice Loss for Vascular Segmentation

Authors: Pengcheng Shi, Jiesi Hu, Yanwu Yang, Zilve Gao, Wei Liu, Ting Ma

Abstract: Vascular segmentation in medical imaging plays a crucial role in analysing morphological and functional assessments. Traditional methods, like the centerline Dice (clDice) loss, ensure topology preservation but falter in capturing geometric details, especially under translation and deformation. The combination of clDice with traditional Dice loss can lead to diameter imbalance, favoring larger ves… ▽ More Vascular segmentation in medical imaging plays a crucial role in analysing morphological and functional assessments. Traditional methods, like the centerline Dice (clDice) loss, ensure topology preservation but falter in capturing geometric details, especially under translation and deformation. The combination of clDice with traditional Dice loss can lead to diameter imbalance, favoring larger vessels. Addressing these challenges, we introduce the centerline boundary Dice (cbDice) loss function, which harmonizes topological integrity and geometric nuances, ensuring consistent segmentation across various vessel sizes. cbDice enriches the clDice approach by including boundary-aware aspects, thereby improving geometric detail recognition. It matches the performance of the boundary difference over union (B-DoU) loss through a mask-distance-based approach, enhancing traslation sensitivity. Crucially, cbDice incorporates radius information from vascular skeletons, enabling uniform adaptation to vascular diameter changes and maintaining balance in branch growth and fracture impacts. Furthermore, we conducted a theoretical analysis of clDice variants (cl-X-Dice). We validated cbDice's efficacy on three diverse vascular segmentation datasets, encompassing both 2D and 3D, and binary and multi-class segmentation. Particularly, the method integrated with cbDice demonstrated outstanding performance on the MICCAI 2023 TopCoW Challenge dataset. Our code is made publicly available at: https://github.com/PengchengShi1220/cbDice. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: accepted by MICCAI 2024

arXiv:2407.01304 [pdf, ps, other]

Heights and periods of algebraic cycles in families

Authors: Ziyang Gao, Shou-Wu Zhang

Abstract: We consider the Beilinson--Bloch heights and Abel--Jacobian periods of homologically trivial Chow cycles in families. For the Beilinson--Bloch heights, we show that for any $g\ge 2$, there is a Zariski open dense subset $U$ of $\mathcal{M}_g$, the coarse moduli of curves of genus $g$ over rationals, such that the heights of Ceresa cycles and Gross--Schoen cycles over $U$ satisfy the Northcott prop… ▽ More We consider the Beilinson--Bloch heights and Abel--Jacobian periods of homologically trivial Chow cycles in families. For the Beilinson--Bloch heights, we show that for any $g\ge 2$, there is a Zariski open dense subset $U$ of $\mathcal{M}_g$, the coarse moduli of curves of genus $g$ over rationals, such that the heights of Ceresa cycles and Gross--Schoen cycles over $U$ satisfy the Northcott property. For the Abel--Jacobi periods, we provide an algebraic criterion for the existence of a Zariski open dense subset of any family such that all cycles not defined over $\overline{\mathbb{Q}}$ are non-torsion and verify that this criterion holds for Ceresa cycles and Gross--Schoen cycles. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: Comments are welcome

arXiv:2407.01220 [pdf, other]

Fast and Efficient: Mask Neural Fields for 3D Scene Segmentation

Authors: Zihan Gao, Lingling Li, Licheng Jiao, Fang Liu, Xu Liu, Wenping Ma, Yuwei Guo, Shuyuan Yang

Abstract: Understanding 3D scenes is a crucial challenge in computer vision research with applications spanning multiple domains. Recent advancements in distilling 2D vision-language foundation models into neural fields, like NeRF and 3DGS, enables open-vocabulary segmentation of 3D scenes from 2D multi-view images without the need for precise 3D annotations. While effective, however, the per-pixel distilla… ▽ More Understanding 3D scenes is a crucial challenge in computer vision research with applications spanning multiple domains. Recent advancements in distilling 2D vision-language foundation models into neural fields, like NeRF and 3DGS, enables open-vocabulary segmentation of 3D scenes from 2D multi-view images without the need for precise 3D annotations. While effective, however, the per-pixel distillation of high-dimensional CLIP features introduces ambiguity and necessitates complex regularization strategies, adding inefficiencies during training. This paper presents MaskField, which enables fast and efficient 3D open-vocabulary segmentation with neural fields under weak supervision. Unlike previous methods, MaskField distills masks rather than dense high-dimensional CLIP features. MaskFields employ neural fields as binary mask generators and supervise them with masks generated by SAM and classified by coarse CLIP features. MaskField overcomes the ambiguous object boundaries by naturally introducing SAM segmented object shapes without extra regularization during training. By circumventing the direct handling of high-dimensional CLIP features during training, MaskField is particularly compatible with explicit scene representations like 3DGS. Our extensive experiments show that MaskField not only surpasses prior state-of-the-art methods but also achieves remarkably fast convergence, outperforming previous methods with just 5 minutes of training. We hope that MaskField will inspire further exploration into how neural fields can be trained to comprehend 3D scenes from 2D models. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: 16 pages, 7 figures

arXiv:2407.00050 [pdf, other]

FoldToken2: Learning compact, invariant and generative protein structure language

Authors: Zhangyang Gao, Cheng Tan, Stan Z. Li

Abstract: The equivalent nature of 3D coordinates has posed long term challenges in protein structure representation learning, alignment, and generation. Can we create a compact and invariant language that equivalently represents protein structures? Towards this goal, we propose FoldToken2 to transfer equivariant structures into discrete tokens, while maintaining the recoverability of the original structure… ▽ More The equivalent nature of 3D coordinates has posed long term challenges in protein structure representation learning, alignment, and generation. Can we create a compact and invariant language that equivalently represents protein structures? Towards this goal, we propose FoldToken2 to transfer equivariant structures into discrete tokens, while maintaining the recoverability of the original structures. From FoldToken1 to FoldToken2, we improve three key components: (1) invariant structure encoder, (2) vector-quantized compressor, and (3) equivalent structure decoder. We evaluate FoldToken2 on the protein structure reconstruction task and show that it outperforms previous FoldToken1 by 20\% in TMScore and 81\% in RMSD. FoldToken2 probably be the first method that works well on both single-chain and multi-chain protein structures quantization. We believe that FoldToken2 will inspire further improvement in protein structure representation learning, structure alignment, and structure generation tasks. △ Less

Submitted 11 June, 2024; originally announced July 2024.

arXiv:2406.19853 [pdf, other]

YuLan: An Open-source Large Language Model

Authors: Yutao Zhu, Kun Zhou, Kelong Mao, Wentong Chen, Yiding Sun, Zhipeng Chen, Qian Cao, Yihan Wu, Yushuo Chen, Feng Wang, Lei Zhang, Junyi Li, Xiaolei Wang, Lei Wang, Beichen Zhang, Zican Dong, Xiaoxue Cheng, Yuhan Chen, Xinyu Tang, Yupeng Hou, Qiangqiang Ren, Xincheng Pang, Shufang Xie, Wayne Xin Zhao, Zhicheng Dou , et al. (13 additional authors not shown)

Abstract: Large language models (LLMs) have become the foundation of many applications, leveraging their extensive capabilities in processing and understanding natural language. While many open-source LLMs have been released with technical reports, the lack of training details hinders further research and development. This paper presents the development of YuLan, a series of open-source LLMs with $12$ billi… ▽ More Large language models (LLMs) have become the foundation of many applications, leveraging their extensive capabilities in processing and understanding natural language. While many open-source LLMs have been released with technical reports, the lack of training details hinders further research and development. This paper presents the development of YuLan, a series of open-source LLMs with $12$ billion parameters. The base model of YuLan is pre-trained on approximately $1.7$T tokens derived from a diverse corpus, including massive English, Chinese, and multilingual texts. We design a three-stage pre-training method to enhance YuLan's overall capabilities. Subsequent phases of training incorporate instruction-tuning and human alignment, employing a substantial volume of high-quality synthesized data. To facilitate the learning of complex and long-tail knowledge, we devise a curriculum-learning framework throughout across these stages, which helps LLMs learn knowledge in an easy-to-hard manner. YuLan's training is finished on Jan, 2024 and has achieved performance on par with state-of-the-art LLMs across various English and Chinese benchmarks. This paper outlines a comprehensive technical roadmap for developing LLMs from scratch. Our model and codes are available at https://github.com/RUC-GSAI/YuLan-Chat. △ Less

Submitted 28 June, 2024; originally announced June 2024.

arXiv:2406.19130 [pdf, other]

Evidential Concept Embedding Models: Towards Reliable Concept Explanations for Skin Disease Diagnosis

Authors: Yibo Gao, Zheyao Gao, Xin Gao, Yuanye Liu, Bomin Wang, Xiahai Zhuang

Abstract: Due to the high stakes in medical decision-making, there is a compelling demand for interpretable deep learning methods in medical image analysis. Concept Bottleneck Models (CBM) have emerged as an active interpretable framework incorporating human-interpretable concepts into decision-making. However, their concept predictions may lack reliability when applied to clinical diagnosis, impeding conce… ▽ More Due to the high stakes in medical decision-making, there is a compelling demand for interpretable deep learning methods in medical image analysis. Concept Bottleneck Models (CBM) have emerged as an active interpretable framework incorporating human-interpretable concepts into decision-making. However, their concept predictions may lack reliability when applied to clinical diagnosis, impeding concept explanations' quality. To address this, we propose an evidential Concept Embedding Model (evi-CEM), which employs evidential learning to model the concept uncertainty. Additionally, we offer to leverage the concept uncertainty to rectify concept misalignments that arise when training CBMs using vision-language models without complete concept supervision. With the proposed methods, we can enhance concept explanations' reliability for both supervised and label-efficient settings. Furthermore, we introduce concept uncertainty for effective test-time intervention. Our evaluation demonstrates that evi-CEM achieves superior performance in terms of concept prediction, and the proposed concept rectification effectively mitigates concept misalignments for label-efficient training. Our code is available at https://github.com/obiyoag/evi-CEM. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: accepted by MICCAI 2024

arXiv:2406.17810 [pdf, other]

PIC2O-Sim: A Physics-Inspired Causality-Aware Dynamic Convolutional Neural Operator for Ultra-Fast Photonic Device FDTD Simulation

Authors: Pingchuan Ma, Haoyu Yang, Zhengqi Gao, Duane S. Boning, Jiaqi Gu

Abstract: The finite-difference time-domain (FDTD) method, which is important in photonic hardware design flow, is widely adopted to solve time-domain Maxwell equations. However, FDTD is known for its prohibitive runtime cost, taking minutes to hours to simulate a single device. Recently, AI has been applied to realize orders-of-magnitude speedup in partial differential equation (PDE) solving. However, AI-b… ▽ More The finite-difference time-domain (FDTD) method, which is important in photonic hardware design flow, is widely adopted to solve time-domain Maxwell equations. However, FDTD is known for its prohibitive runtime cost, taking minutes to hours to simulate a single device. Recently, AI has been applied to realize orders-of-magnitude speedup in partial differential equation (PDE) solving. However, AI-based FDTD solvers for photonic devices have not been clearly formulated. Directly applying off-the-shelf models to predict the optical field dynamics shows unsatisfying fidelity and efficiency since the model primitives are agnostic to the unique physical properties of Maxwell equations and lack algorithmic customization. In this work, we thoroughly investigate the synergy between neural operator designs and the physical property of Maxwell equations and introduce a physics-inspired AI-based FDTD prediction framework PIC2O-Sim which features a causality-aware dynamic convolutional neural operator as its backbone model that honors the space-time causality constraints via careful receptive field configuration and explicitly captures the permittivity-dependent light propagation behavior via an efficient dynamic convolution operator. Meanwhile, we explore the trade-offs among prediction scalability, fidelity, and efficiency via a multi-stage partitioned time-bundling technique in autoregressive prediction. Multiple key techniques have been introduced to mitigate iterative error accumulation while maintaining efficiency advantages during autoregressive field prediction. Extensive evaluations on three challenging photonic device simulation tasks have shown the superiority of our PIC2O-Sim method, showing 51.2% lower roll-out prediction error, 23.5 times fewer parameters than state-of-the-art neural operators, providing 300-600x higher simulation speed than an open-source FDTD numerical solver. △ Less

Submitted 24 June, 2024; originally announced June 2024.

arXiv:2406.17626 [pdf, other]

CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference

Authors: Erxin Yu, Jing Li, Ming Liao, Siqi Wang, Zuchen Gao, Fei Mi, Lanqing Hong

Abstract: As large language models (LLMs) constantly evolve, ensuring their safety remains a critical research problem. Previous red-teaming approaches for LLM safety have primarily focused on single prompt attacks or goal hijacking. To the best of our knowledge, we are the first to study LLM safety in multi-turn dialogue coreference. We created a dataset of 1,400 questions across 14 categories, each featur… ▽ More As large language models (LLMs) constantly evolve, ensuring their safety remains a critical research problem. Previous red-teaming approaches for LLM safety have primarily focused on single prompt attacks or goal hijacking. To the best of our knowledge, we are the first to study LLM safety in multi-turn dialogue coreference. We created a dataset of 1,400 questions across 14 categories, each featuring multi-turn coreference safety attacks. We then conducted detailed evaluations on five widely used open-source LLMs. The results indicated that under multi-turn coreference safety attacks, the highest attack success rate was 56% with the LLaMA2-Chat-7b model, while the lowest was 13.9% with the Mistral-7B-Instruct model. These findings highlight the safety vulnerabilities in LLMs during dialogue coreference interactions. △ Less

Submitted 25 June, 2024; originally announced June 2024.

Comments: Submitted to EMNLP 2024

arXiv:2406.17255 [pdf, other]

MPCODER: Multi-user Personalized Code Generator with Explicit and Implicit Style Representation Learning

Authors: Zhenlong Dai, Chang Yao, WenKang Han, Ying Yuan, Zhipeng Gao, Jingyuan Chen

Abstract: Large Language Models (LLMs) have demonstrated great potential for assisting developers in their daily development. However, most research focuses on generating correct code, how to use LLMs to generate personalized code has seldom been investigated. To bridge this gap, we proposed MPCoder (Multi-user Personalized Code Generator) to generate personalized code for multiple users. To better learn co… ▽ More Large Language Models (LLMs) have demonstrated great potential for assisting developers in their daily development. However, most research focuses on generating correct code, how to use LLMs to generate personalized code has seldom been investigated. To bridge this gap, we proposed MPCoder (Multi-user Personalized Code Generator) to generate personalized code for multiple users. To better learn coding style features, we utilize explicit coding style residual learning to capture the syntax code style standards and implicit style learning to capture the semantic code style conventions. We train a multi-user style adapter to better differentiate the implicit feature representations of different users through contrastive learning, ultimately enabling personalized code generation for multiple users. We further propose a novel evaluation metric for estimating similarities between codes of different coding styles. The experimental results show the effectiveness of our approach for this novel task. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: Accepted by ACL 2024, Main Conference

arXiv:2406.16603 [pdf, other]

Bipolarized Weyl semimetals and quantum crystal valley Hall effect in two-dimensional altermagnetic materials

Authors: Chao-Yang Tan, Ze-Feng Gao, Huan-Cheng Yang, Kai Liu, Peng-Jie Guo, Zhong-Yi Lu

Abstract: Magnetism and topology are two major areas of condensed matter physics. The combination of magnetism and topology gives rise to more novel physical effects, which have attracted strongly theoretical and experimental attention. Recently, the concept of altermagnetism has been introduced, characterized by a dual nature: real-space antiferromagnetism and reciprocal-space anisotropic spin polarization… ▽ More Magnetism and topology are two major areas of condensed matter physics. The combination of magnetism and topology gives rise to more novel physical effects, which have attracted strongly theoretical and experimental attention. Recently, the concept of altermagnetism has been introduced, characterized by a dual nature: real-space antiferromagnetism and reciprocal-space anisotropic spin polarization. The amalgamation of altermagnetism with topology may lead to the emergence of previously unobserved topological phases and the associated physical effects. In this study, utilizing a four-band lattice model that incorporates altermagnetism and spin group symmetry, we demonstrate that type-I, type-II, and type-III bipolarized Weyl semimetals can exist in altermagnetic systems. Through the first-principles electronic structure calculations, we predict four ideal two-dimensional type-I altermagnetic bipolarized Weyl semimetals Fe$_2$WTe$_4$ and Fe$_2$MoZ$_4$ (Z=S,Se,Te). More significantly, we introduce the quantum crystal valley Hall effect, a phenomenon achievable in three of these materials namely Fe$_2$WTe$_4$, Fe$_2$MoS$_4$, and Fe$_2$MoTe$_4$, when spin-orbit coupling is considered. Furthermore, these materials have the potential to transition from a quantum crystal valley Hall phase to a Chern insulator phase under strain. In contrast, Fe$_2$MoSe$_4$ remains to be a Weyl semimetal under spin-orbit coupling but is distinguished by possessing only a single pair of Weyl points. Additionally, the position, polarization, and number of Weyl points in Fe$_2$WTe$_4$ and Fe$_2$MoZ$_4$ can be manipulated by adjusting the direction of the Néel vector. Consequently, Fe$_2$WTe$_4$ and Fe$_2$MoZ$_4$ emerge as promising experimental platforms for investigating the distinctive physical attributes of various altermagnetic topological phases. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: 7 pages, 5 figures

arXiv:2406.16189 [pdf, other]

Fuzzy Attention-based Border Rendering Network for Lung Organ Segmentation

Authors: Sheng Zhang, Yang Nan, Yingying Fang, Shiyi Wang, Xiaodan Xing, Zhifan Gao, Guang Yang

Abstract: Automatic lung organ segmentation on CT images is crucial for lung disease diagnosis. However, the unlimited voxel values and class imbalance of lung organs can lead to false-negative/positive and leakage issues in advanced methods. Additionally, some slender lung organs are easily lost during the recycled down/up-sample procedure, e.g., bronchioles & arterioles, causing severe discontinuity issue… ▽ More Automatic lung organ segmentation on CT images is crucial for lung disease diagnosis. However, the unlimited voxel values and class imbalance of lung organs can lead to false-negative/positive and leakage issues in advanced methods. Additionally, some slender lung organs are easily lost during the recycled down/up-sample procedure, e.g., bronchioles & arterioles, causing severe discontinuity issue. Inspired by these, this paper introduces an effective lung organ segmentation method called Fuzzy Attention-based Border Rendering (FABR) network. Since fuzzy logic can handle the uncertainty in feature extraction, hence the fusion of deep networks and fuzzy sets should be a viable solution for better performance. Meanwhile, unlike prior top-tier methods that operate on all regular dense points, our FABR depicts lung organ regions as cube-trees, focusing only on recycle-sampled border vulnerable points, rendering the severely discontinuous, false-negative/positive organ regions with a novel Global-Local Cube-tree Fusion (GLCF) module. All experimental results, on four challenging datasets of airway & artery, demonstrate that our method can achieve the favorable performance significantly. △ Less

Submitted 1 July, 2024; v1 submitted 23 June, 2024; originally announced June 2024.

Comments: MICCAI 2024

arXiv:2406.14969 [pdf, other]

Uni-Mol2: Exploring Molecular Pretraining Model at Scale

Authors: Xiaohong Ji, Zhen Wang, Zhifeng Gao, Hang Zheng, Linfeng Zhang, Guolin Ke, Weinan E

Abstract: In recent years, pretraining models have made significant advancements in the fields of natural language processing (NLP), computer vision (CV), and life sciences. The significant advancements in NLP and CV are predominantly driven by the expansion of model parameters and data size, a phenomenon now recognized as the scaling laws. However, research exploring scaling law in molecular pretraining mo… ▽ More In recent years, pretraining models have made significant advancements in the fields of natural language processing (NLP), computer vision (CV), and life sciences. The significant advancements in NLP and CV are predominantly driven by the expansion of model parameters and data size, a phenomenon now recognized as the scaling laws. However, research exploring scaling law in molecular pretraining models remains unexplored. In this work, we present Uni-Mol2 , an innovative molecular pretraining model that leverages a two-track transformer to effectively integrate features at the atomic level, graph level, and geometry structure level. Along with this, we systematically investigate the scaling law within molecular pretraining models, characterizing the power-law correlations between validation loss and model size, dataset size, and computational resources. Consequently, we successfully scale Uni-Mol2 to 1.1 billion parameters through pretraining on 800 million conformations, making it the largest molecular pretraining model to date. Extensive experiments show consistent improvement in the downstream tasks as the model size grows. The Uni-Mol2 with 1.1B parameters also outperforms existing methods, achieving an average 27% improvement on the QM9 and 14% on COMPAS-1D dataset. △ Less

Submitted 1 July, 2024; v1 submitted 21 June, 2024; originally announced June 2024.

arXiv:2406.13436 [pdf, other]

What's Next? Exploring Utilization, Challenges, and Future Directions of AI-Generated Image Tools in Graphic Design

Authors: Yuying Tang, Mariana Ciancia, Zhigang Wang, Ze Gao

Abstract: Recent advancements in artificial intelligence, such as computer vision and deep learning, have led to the emergence of numerous generative AI platforms, particularly for image generation. However, the application of AI-generated image tools in graphic design has not been extensively explored. This study conducted semi-structured interviews with seven designers of varying experience levels to unde… ▽ More Recent advancements in artificial intelligence, such as computer vision and deep learning, have led to the emergence of numerous generative AI platforms, particularly for image generation. However, the application of AI-generated image tools in graphic design has not been extensively explored. This study conducted semi-structured interviews with seven designers of varying experience levels to understand their current usage, challenges, and future functional needs for AI-generated image tools in graphic design. As our findings suggest, AI tools serve as creative partners in design, enhancing human creativity, offering strategic insights, and fostering team collaboration and communication. The findings provide guiding recommendations for the future development of AI-generated image tools, aimed at helping engineers optimize these tools to better meet the needs of graphic designers. △ Less

Submitted 19 June, 2024; originally announced June 2024.

arXiv:2406.13170 [pdf, other]

Amphista: Accelerate LLM Inference with Bi-directional Multiple Drafting Heads in a Non-autoregressive Style

Authors: Zeping Li, Xinlong Yang, Ziheng Gao, Ji Liu, Zhuang Liu, Dong Li, Jinzhang Peng, Lu Tian, Emad Barsoum

Abstract: Large Language Models (LLMs) inherently use autoregressive decoding, which lacks parallelism in inference and results in significantly slow inference speeds, especially when hardware parallel accelerators and memory bandwidth are not fully utilized. In this work, we propose Amphista, a speculative decoding algorithm that adheres to a non-autoregressive decoding paradigm. Owing to the increased par… ▽ More Large Language Models (LLMs) inherently use autoregressive decoding, which lacks parallelism in inference and results in significantly slow inference speeds, especially when hardware parallel accelerators and memory bandwidth are not fully utilized. In this work, we propose Amphista, a speculative decoding algorithm that adheres to a non-autoregressive decoding paradigm. Owing to the increased parallelism, our method demonstrates higher efficiency in inference compared to autoregressive methods. Specifically, Amphista models an Auto-embedding Block capable of parallel inference, incorporating bi-directional attention to enable interaction between different drafting heads. Additionally, Amphista implements Staged Adaptation Layers to facilitate the transition of semantic information from the base model's autoregressive inference to the drafting heads' non-autoregressive speculation, thereby achieving paradigm transformation and feature fusion. We conduct a series of experiments on a suite of Vicuna models using MT-Bench and Spec-Bench. For the Vicuna 33B model, Amphista achieves up to 2.75$\times$ and 1.40$\times$ wall-clock acceleration compared to vanilla autoregressive decoding and Medusa, respectively, while preserving lossless generation quality. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.12460 [pdf, other]

An extrapolation-driven network architecture for physics-informed deep learning

Authors: Yong Wang, Yanzhong Yao, Zhiming Gao

Abstract: Deep learning with physics-informed neural networks (PINNs) has emerged as a highly popular and effective approach for solving partial differential equations(PDEs). In this paper, we first investigate the extrapolation capability of the PINN method for time-dependent PDEs. Taking advantage of this extrapolation property, we can generalize the training result obtained in the time subinterval to the… ▽ More Deep learning with physics-informed neural networks (PINNs) has emerged as a highly popular and effective approach for solving partial differential equations(PDEs). In this paper, we first investigate the extrapolation capability of the PINN method for time-dependent PDEs. Taking advantage of this extrapolation property, we can generalize the training result obtained in the time subinterval to the large interval by adding a correction term to the network parameters of the subinterval. The correction term is determined by further training with the sample points in the added subinterval. Secondly, by designing an extrapolation control function with special characteristics and combining it with the correction term, we construct a new neural network architecture whose network parameters are coupled with the time variable, which we call the extrapolation-driven network architecture. Based on this architecture, using a single neural network, we can obtain the overall PINN solution of the whole domain with the following two characteristics: (1) it completely inherits the local solution of the interval obtained from the previous training, (2) at the interval node, it strictly maintains the continuity and smoothness that the true solution has. The extrapolation-driven network architecture allows us to divide a large time domain into multiple subintervals and solve the time-dependent PDEs one by one in chronological order. This training scheme respects the causality principle and effectively overcomes the difficulties of the conventional PINN method in solving the evolution equation on a large time domain. Numerical experiments verify the performance of our proposed method. △ Less

Submitted 21 June, 2024; v1 submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.12414 [pdf, other]

Harnessing spontaneous emission of correlated photon pairs from ladder-type giant atoms

Authors: Zhao-Min Gao, Jia-Qi Li, Ying-Huan Wu, Wen-Xiao Liu, Xin Wang

Abstract: The realization of correlated multi-photon processes usually depends on the interaction between nonlinear media and atoms. However, the nonlinearity of optical materials is generally weak, making it still very challenging to achieve correlated multi-photon dynamics at the few-photon level. Meanwhile, giant atoms, with their capability for multi-point coupling, which is a novel paradigm in quantum… ▽ More The realization of correlated multi-photon processes usually depends on the interaction between nonlinear media and atoms. However, the nonlinearity of optical materials is generally weak, making it still very challenging to achieve correlated multi-photon dynamics at the few-photon level. Meanwhile, giant atoms, with their capability for multi-point coupling, which is a novel paradigm in quantum optics, mostly focus on the single photon field. In this work, using the method described in Phys. Rev. Res. 6. 013279 (2024), we reveal that the ladder-type three-level giant atom spontaneously emits strongly correlated photon pairs with high efficiency by designing and optimizing the target function. In addition, by encoding local phases into the optimal coupling sequence, directional two-photon correlated transfer can be achieved. This method does not require a nonlinear waveguide and can be realized in the conventional environment. We show that the photon pairs emitted in both the bidirectional and the chiral case exhibit strong correlation properties in both time and space. Such correlated photon pairs have great potential applications for quantum information processing. For example, numerical results show that our proposal can realize the two-photon mediated cascaded quantum system. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: 12 pages; 10 figures

arXiv:2406.11816 [pdf, other]

VideoLLM-online: Online Video Large Language Model for Streaming Video

Authors: Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, Mike Zheng Shou

Abstract: Recent Large Language Models have been enhanced with vision capabilities, enabling them to comprehend images, videos, and interleaved vision-language content. However, the learning methods of these large multimodal models typically treat videos as predetermined clips, making them less effective and efficient at handling streaming video inputs. In this paper, we propose a novel Learning-In-Video-St… ▽ More Recent Large Language Models have been enhanced with vision capabilities, enabling them to comprehend images, videos, and interleaved vision-language content. However, the learning methods of these large multimodal models typically treat videos as predetermined clips, making them less effective and efficient at handling streaming video inputs. In this paper, we propose a novel Learning-In-Video-Stream (LIVE) framework, which enables temporally aligned, long-context, and real-time conversation within a continuous video stream. Our LIVE framework comprises comprehensive approaches to achieve video streaming dialogue, encompassing: (1) a training objective designed to perform language modeling for continuous streaming inputs, (2) a data generation scheme that converts offline temporal annotations into a streaming dialogue format, and (3) an optimized inference pipeline to speed up the model responses in real-world video streams. With our LIVE framework, we built VideoLLM-online model upon Llama-2/Llama-3 and demonstrate its significant advantages in processing streaming videos. For instance, on average, our model can support streaming dialogue in a 5-minute video clip at over 10 FPS on an A100 GPU. Moreover, it also showcases state-of-the-art performance on public offline video benchmarks, such as recognition, captioning, and forecasting. The code, model, data, and demo have been made available at https://showlab.github.io/videollm-online. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: CVPR 2024. This arxiv version is upgraded with Llama-3

arXiv:2406.11204 [pdf]

Magnetically tunable optical bound states in the continuum with arbitrary polarization and intrinsic chirality

Authors: Qing-an Tu, Hongxin Zhou, Yan Meng, Maohua Gong, Zhen Gao

Abstract: Optical bound states in the continuum (BICs), which are exotic localized eigenstates embedded in the continuum spectrum and topological polarization singularity in momentum space, have attracted great attentions in both fundamental and applied physics. Here, based on magneto-optical photonic crystal slab placed in external magnetic fields to break the time-reversal symmetry, we theoretically demon… ▽ More Optical bound states in the continuum (BICs), which are exotic localized eigenstates embedded in the continuum spectrum and topological polarization singularity in momentum space, have attracted great attentions in both fundamental and applied physics. Here, based on magneto-optical photonic crystal slab placed in external magnetic fields to break the time-reversal symmetry, we theoretically demonstrate magnetically tunable BICs with arbitrary polarization covering the entire Poincaré sphere and efficient off-Γ chiral emission of circularly polarized states. More interestingly, by further breaking the in-plane inversion symmetry of the magneto-optical photonic crystal slab to generate a pair of circularly polarized states (C point) spawning from the eliminated BICs and tuning the external magnetic field strength to move one C point to the Γ point, one at-Γ intrinsic chiral BICs with near-unity circular dichroism exceeding 0.99 and a high quality factor of 46000 owning to the preserved out-of-plane mirror symmetry can be observed. These findings may lead to a plethora of potential applications in chiral-optical effects, structured light, and tunable optical devices. △ Less

Submitted 1 July, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

Comments: 13 pages, 4 figures

arXiv:2406.10840 [pdf, other]

CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph

Authors: Haitao Lin, Guojiang Zhao, Odin Zhang, Yufei Huang, Lirong Wu, Zicheng Liu, Siyuan Li, Cheng Tan, Zhifeng Gao, Stan Z. Li

Abstract: Structure-based drug design (SBDD) aims to generate potential drugs that can bind to a target protein and is greatly expedited by the aid of AI techniques in generative models. However, a lack of systematic understanding persists due to the diverse settings, complex implementation, difficult reproducibility, and task singularity. Firstly, the absence of standardization can lead to unfair compariso… ▽ More Structure-based drug design (SBDD) aims to generate potential drugs that can bind to a target protein and is greatly expedited by the aid of AI techniques in generative models. However, a lack of systematic understanding persists due to the diverse settings, complex implementation, difficult reproducibility, and task singularity. Firstly, the absence of standardization can lead to unfair comparisons and inconclusive insights. To address this dilemma, we propose CBGBench, a comprehensive benchmark for SBDD, that unifies the task as a generative heterogeneous graph completion, analogous to fill-in-the-blank of the 3D complex binding graph. By categorizing existing methods based on their attributes, CBGBench facilitates a modular and extensible framework that implements various cutting-edge methods. Secondly, a single task on \textit{de novo} molecule generation can hardly reflect their capabilities. To broaden the scope, we have adapted these models to a range of tasks essential in drug design, which are considered sub-tasks within the graph fill-in-the-blank tasks. These tasks include the generative designation of \textit{de novo} molecules, linkers, fragments, scaffolds, and sidechains, all conditioned on the structures of protein pockets. Our evaluations are conducted with fairness, encompassing comprehensive perspectives on interaction, chemical properties, geometry authenticity, and substructure validity. We further provide the pre-trained versions of the state-of-the-art models and deep insights with analysis from empirical studies. The codebase for CBGBench is publicly accessible at \url{https://github.com/Edapinenut/CBGBench}. △ Less

Submitted 16 June, 2024; originally announced June 2024.

Comments: 9 pages main context

arXiv:2406.10188 [pdf, ps, other]

$L^{\vec{p}}-L^{\vec{q}}$ Boundedness of Multiparameter Forelli-Rudin Type Operators on the Siegel Upper Half-space

Authors: Hongheng Yin, Guan-Tie Deng, Zhi-Qiang Gao

Abstract: In this article,we present exactly when two classes of multiparameter Forelli-Rudin type integral operators are bounded from one weighted mixed-norm Lebesgue space $L^{\vec{p}}$ to another space $L^{\vec{q}}$ over the Siegel upper half-space. In this article,we present exactly when two classes of multiparameter Forelli-Rudin type integral operators are bounded from one weighted mixed-norm Lebesgue space $L^{\vec{p}}$ to another space $L^{\vec{q}}$ over the Siegel upper half-space. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: 22 pages

arXiv:2406.09953 [pdf, other]

DAG-Plan: Generating Directed Acyclic Dependency Graphs for Dual-Arm Cooperative Planning

Authors: Zeyu Gao, Yao Mu, Jinye Qu, Mengkang Hu, Lingyue Guo, Ping Luo, Yanfeng Lu

Abstract: Dual-arm robots offer enhanced versatility and efficiency over single-arm counterparts by enabling concurrent manipulation of multiple objects or cooperative execution of tasks using both arms. However, effectively coordinating the two arms for complex long-horizon tasks remains a significant challenge. Existing task planning methods predominantly focus on single-arm robots or rely on predefined b… ▽ More Dual-arm robots offer enhanced versatility and efficiency over single-arm counterparts by enabling concurrent manipulation of multiple objects or cooperative execution of tasks using both arms. However, effectively coordinating the two arms for complex long-horizon tasks remains a significant challenge. Existing task planning methods predominantly focus on single-arm robots or rely on predefined bimanual operations, failing to fully leverage the capabilities of dual-arm systems. To address this limitation, we introduce DAG-Plan, a structured task planning framework tailored for dual-arm robots. DAG-Plan harnesses large language models (LLMs) to decompose intricate tasks into actionable sub-tasks represented as nodes within a directed acyclic graph (DAG). Critically, DAG-Plan dynamically assigns these sub-tasks to the appropriate arm based on real-time environmental observations, enabling parallel and adaptive execution. We evaluate DAG-Plan on the novel Dual-Arm Kitchen Benchmark, comprising 9 sequential tasks with 78 sub-tasks and 26 objects. Extensive experiments demonstrate the superiority of DAG-Plan over directly using LLM to generate plans, achieving nearly 50% higher efficiency compared to the single-arm task planning baseline and nearly double the success rate of the dual-arm task planning baseline. △ Less

Submitted 30 June, 2024; v1 submitted 14 June, 2024; originally announced June 2024.

Comments: 46 pages, 13 figures

arXiv:2406.09890 [pdf, other]

ALMA Lensing Cluster Survey: Physical characterization of near-infrared-dark intrinsically faint ALMA sources at z=2-4

Authors: Akiyoshi Tsujita, Kotaro Kohno, Shuo Huang, Masamune Oguri, Ken-ichi Tadaki, Ian Smail, Hideki Umehata, Zhen-Kai Gao, Wei-Hao Wang, Fengwu Sun, Seiji Fujimoto, Tao Wang, Ryosuke Uematsu, Daniel Espada, Francesco Valentino, Yiping Ao, Franz E. Bauer, Bunyo Hatsukade, Fumi Egusa, Yuri Nishimura, Anton M. Koekemoer, Daniel Schaerer, Claudia Lagos, Miroslava Dessauges-Zavadsky, Gabriel Brammer , et al. (11 additional authors not shown)

Abstract: We present results from Atacama Large Millimeter/submillimeter Array (ALMA) spectral line-scan observations at 3-mm and 2-mm bands of three near-infrared-dark (NIR-dark) galaxies behind two massive lensing clusters MACS J0417.5-1154 and RXC J0032.1+1808. Each of these three sources is a faint (de-lensed $S_{\text{1.2 mm}}$ $<$ 1 mJy) triply lensed system originally discovered in the ALMA Lensing C… ▽ More We present results from Atacama Large Millimeter/submillimeter Array (ALMA) spectral line-scan observations at 3-mm and 2-mm bands of three near-infrared-dark (NIR-dark) galaxies behind two massive lensing clusters MACS J0417.5-1154 and RXC J0032.1+1808. Each of these three sources is a faint (de-lensed $S_{\text{1.2 mm}}$ $<$ 1 mJy) triply lensed system originally discovered in the ALMA Lensing Cluster Survey. We have successfully detected CO and [C I] emission lines and confirmed that their spectroscopic redshifts are $z=3.652$, 2.391, and 2.985. By utilizing a rich multi-wavelength data set, we find that the NIR-dark galaxies are located on the star formation main sequence in the intrinsic stellar mass range of log ($M_*$/$M_\odot$) = 9.8 - 10.4, which is about one order of magnitude lower than that of typical submillimeter galaxies (SMGs). These NIR-dark galaxies show a variety in gas depletion times and spatial extent of dust emission. One of the three is a normal star-forming galaxy with gas depletion time consistent with a scaling relation, and its infrared surface brightness is an order of magnitude smaller than that of typical SMGs. Since this galaxy has an elongated axis ratio of $\sim 0.17$, we argue that normal star-forming galaxies in an edge-on configuration can be heavily dust-obscured. This implies that existing deep WFC3/F160W surveys may miss a fraction of typical star-forming main-sequence galaxies due to their edge-on orientation. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: 23 pages, 10 figures, Submitted to ApJ

arXiv:2406.08418 [pdf, other]

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Authors: Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, Jiashuo Yu, Hao Tian, Jiasheng Zhou, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Zhenxiang Li, Pei Chu, Yi Wang , et al. (15 additional authors not shown)

Abstract: Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning. However, the limited scale an… ▽ More Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning. However, the limited scale and diversity of current image-text interleaved data restrict the development of multimodal large language models. In this paper, we introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset. Using an efficient data engine, we filter and extract large-scale high-quality documents, which contain 8.6 billion images and 1,696 billion text tokens. Compared to counterparts (e.g., MMC4, OBELICS), our dataset 1) has 15 times larger scales while maintaining good data quality; 2) features more diverse sources, including both English and non-English websites as well as video-centric websites; 3) is more flexible, easily degradable from an image-text interleaved format to pure text corpus and image-text pairs. Through comprehensive analysis and experiments, we validate the quality, usability, and effectiveness of the proposed dataset. We hope this could provide a solid data foundation for future multimodal model research. Code and data are released at https://github.com/OpenGVLab/OmniCorpus. △ Less

Submitted 12 July, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.07868 [pdf, other]

Bridging multiple worlds: multi-marginal optimal transport for causal partial-identification problem

Authors: Zijun Gao, Shu Ge, Jian Qian

Abstract: Under the prevalent potential outcome model in causal inference, each unit is associated with multiple potential outcomes but at most one of which is observed, leading to many causal quantities being only partially identified. The inherent missing data issue echoes the multi-marginal optimal transport (MOT) problem, where marginal distributions are known, but how the marginals couple to form the j… ▽ More Under the prevalent potential outcome model in causal inference, each unit is associated with multiple potential outcomes but at most one of which is observed, leading to many causal quantities being only partially identified. The inherent missing data issue echoes the multi-marginal optimal transport (MOT) problem, where marginal distributions are known, but how the marginals couple to form the joint distribution is unavailable. In this paper, we cast the causal partial identification problem in the framework of MOT with $K$ margins and $d$-dimensional outcomes and obtain the exact partial identified set. In order to estimate the partial identified set via MOT, statistically, we establish a convergence rate of the plug-in MOT estimator for general quadratic objective functions and prove it is minimax optimal for a quadratic objective function stemming from the variance minimization problem with arbitrary $K$ and $d \le 4$. Numerically, we demonstrate the efficacy of our method over several real-world datasets where our proposal consistently outperforms the baseline by a significant margin (over 70%). In addition, we provide efficient off-the-shelf implementations of MOT with general objective functions. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.07068 [pdf]

Emergent Moiré fringes in direct-grown quasicrystal

Authors: Jingwei Li, Kejie Bao, Honglin Sun, Xingxu Yan, Ting Huang, Qicheng Zhang, Yaoqiang Zhou, Zhenjing Liu, Paul Masih Das, Jiawen You, Jiong Zhao, Jianbin Xu, Xiaoqing Pan, Yongli Mi, Junyi Zhu, Zhaoli Gao

Abstract: Quasicrystals represent a category of rarely structured solids that challenge traditional periodicity in crystal materials. Recent advancements in the synthesis of two-dimensional (2D) van der Waals materials have paved the way for exploring the unique physical properties of these systems. Here, we report on the synthesis of 2D quasicrystals featuring 30° alternating twist angles between multiple… ▽ More Quasicrystals represent a category of rarely structured solids that challenge traditional periodicity in crystal materials. Recent advancements in the synthesis of two-dimensional (2D) van der Waals materials have paved the way for exploring the unique physical properties of these systems. Here, we report on the synthesis of 2D quasicrystals featuring 30° alternating twist angles between multiple graphene layers, using chemical vapor deposition (CVD). Strikingly, we observed periodic Moiré patterns in the quasicrystal, a finding that has not been previously reported in traditional alloy-based quasicrystals. The Moiré periodicity, varying with the parity of the constituent layers, aligns with the theoretical predictions that suggest a stress cancellation mechanism in force. The emergence of Moiré fringes is attributed to the spontaneous mismatched lattice constant in the oriented graphene layers, proving the existence of atomic relaxation. This phenomenon, which has been largely understudied in graphene systems with large twist angles, has now been validated through our use of scanning transmission electron microscopy (STEM). Our CVD-grown Moiré quasicrystal provides an ideal platform for exploring the unusual physical properties that arise from Moiré periodicity within quasicrystals. △ Less

Submitted 11 June, 2024; originally announced June 2024.

arXiv:2406.06986 [pdf, other]

DNN Partitioning, Task Offloading, and Resource Allocation in Dynamic Vehicular Networks: A Lyapunov-Guided Diffusion-Based Reinforcement Learning Approach

Authors: Zhang Liu, Hongyang Du, Junzhe Lin, Zhibin Gao, Lianfen Huang, Seyyedali Hosseinalipour, Dusit Niyato

Abstract: The rapid advancement of Artificial Intelligence (AI) has introduced Deep Neural Network (DNN)-based tasks to the ecosystem of vehicular networks. These tasks are often computation-intensive, requiring substantial computation resources, which are beyond the capability of a single vehicle. To address this challenge, Vehicular Edge Computing (VEC) has emerged as a solution, offering computing servic… ▽ More The rapid advancement of Artificial Intelligence (AI) has introduced Deep Neural Network (DNN)-based tasks to the ecosystem of vehicular networks. These tasks are often computation-intensive, requiring substantial computation resources, which are beyond the capability of a single vehicle. To address this challenge, Vehicular Edge Computing (VEC) has emerged as a solution, offering computing services for DNN-based tasks through resource pooling via Vehicle-to-Vehicle/Infrastructure (V2V/V2I) communications. In this paper, we formulate the problem of joint DNN partitioning, task offloading, and resource allocation in VEC as a dynamic long-term optimization. Our objective is to minimize the DNN-based task completion time while guaranteeing the system stability over time. To this end, we first leverage a Lyapunov optimization technique to decouple the original long-term optimization with stability constraints into a per-slot deterministic problem. Afterwards, we propose a Multi-Agent Diffusion-based Deep Reinforcement Learning (MAD2RL) algorithm, incorporating the innovative use of diffusion models to determine the optimal DNN partitioning and task offloading decisions. Furthermore, we integrate convex optimization techniques into MAD2RL as a subroutine to allocate computation resources, enhancing the learning efficiency. Through simulations under real-world movement traces of vehicles, we demonstrate the superior performance of our proposed algorithm compared to existing benchmark solutions. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: 16 pages, 9 figures, and with extra appendix

arXiv:2406.06867 [pdf]

Electrically Tunable Magnetoconductance of Close-Packed CVD Bilayer Graphene Layer Stacking Walls

Authors: Qicheng Zhang, Sheng Wang, Zhaoli Gao, Sebastian Hurtado-Parra, Joel Berry, Zachariah Addison, Paul Masih Das, William M. Parkin, Marija Drndic, James M. Kikkawa, Feng Wang, Eugene J. Mele, A. T. Charlie Johnson, Zhengtang Luo

Abstract: Quantum valley Hall (QVH) domain wall states are a new class of one-dimensional (1D) one-way conductors that are topologically protected in the absence of valley mixing. Development beyond a single QVH channel raises important new questions as to how QVH channels in close spatial proximity interact with each other, and how that interaction may be controlled. Scalable epitaxial bilayer graphene syn… ▽ More Quantum valley Hall (QVH) domain wall states are a new class of one-dimensional (1D) one-way conductors that are topologically protected in the absence of valley mixing. Development beyond a single QVH channel raises important new questions as to how QVH channels in close spatial proximity interact with each other, and how that interaction may be controlled. Scalable epitaxial bilayer graphene synthesis produces layer stacking wall (LSW) bundles, where QVH channels are bound, providing an excellent platform to study QVH channel interactions. Here we show that distinct strain sources lead to the formation of both well-separated LSWs and close packed LSW bundles. Comparative studies of electronic transport in these two regimes reveal that close-packed LSW bundles support electrically tunable magnetoconductance. The coexistence of different strain sources offers a potential pathway to realize scalable quantum transport platform based on LSWs where electrically tunability enables programmable functionality. △ Less

Submitted 10 June, 2024; originally announced June 2024.

arXiv:2406.05839 [pdf, other]

MaLa-ASR: Multimedia-Assisted LLM-Based ASR

Authors: Guanrou Yang, Ziyang Ma, Fan Yu, Zhifu Gao, Shiliang Zhang, Xie Chen

Abstract: As more and more information-rich data like video become available, utilizing multi-modal auxiliary information to enhance audio tasks has sparked widespread research interest. The recent surge in research on LLM-based audio models provides fresh perspectives for tackling audio tasks. Given that LLM can flexibly ingest multiple inputs, we propose MaLa-ASR, an LLM-based ASR model that can integrate… ▽ More As more and more information-rich data like video become available, utilizing multi-modal auxiliary information to enhance audio tasks has sparked widespread research interest. The recent surge in research on LLM-based audio models provides fresh perspectives for tackling audio tasks. Given that LLM can flexibly ingest multiple inputs, we propose MaLa-ASR, an LLM-based ASR model that can integrate textual keywords extracted from presentation slides to improve recognition of conference content. MaLa-ASR yields average WERs of 9.4% and 11.7% on the L95 and S95 subsets of the SlideSpeech corpus, representing a significant relative WER drop of 27.9% and 44.7% over the baseline model reported in SlideSpeech. MaLa-ASR underscores LLM's strong performance in speech tasks and the capability to integrate auxiliary information conveniently. By adding keywords to the input prompt, the biased word error rate (B-WER) reduces relatively by 46.0% and 44.2%, establishing a new SOTA on this dataset. △ Less

Submitted 13 June, 2024; v1 submitted 9 June, 2024; originally announced June 2024.

arXiv:2406.05688 [pdf, other]

Peer Review as A Multi-Turn and Long-Context Dialogue with Role-Based Interactions

Authors: Cheng Tan, Dongxin Lyu, Siyuan Li, Zhangyang Gao, Jingxuan Wei, Siqi Ma, Zicheng Liu, Stan Z. Li

Abstract: Large Language Models (LLMs) have demonstrated wide-ranging applications across various fields and have shown significant potential in the academic peer-review process. However, existing applications are primarily limited to static review generation based on submitted papers, which fail to capture the dynamic and iterative nature of real-world peer reviews. In this paper, we reformulate the peer-r… ▽ More Large Language Models (LLMs) have demonstrated wide-ranging applications across various fields and have shown significant potential in the academic peer-review process. However, existing applications are primarily limited to static review generation based on submitted papers, which fail to capture the dynamic and iterative nature of real-world peer reviews. In this paper, we reformulate the peer-review process as a multi-turn, long-context dialogue, incorporating distinct roles for authors, reviewers, and decision makers. We construct a comprehensive dataset containing over 26,841 papers with 92,017 reviews collected from multiple sources, including the top-tier conference and prestigious journal. This dataset is meticulously designed to facilitate the applications of LLMs for multi-turn dialogues, effectively simulating the complete peer-review process. Furthermore, we propose a series of metrics to evaluate the performance of LLMs for each role under this reformulated peer-review setting, ensuring fair and comprehensive evaluations. We believe this work provides a promising perspective on enhancing the LLM-driven peer-review process by incorporating dynamic, role-based interactions. It aligns closely with the iterative and interactive nature of real-world academic peer review, offering a robust foundation for future research and development in this area. We open-source the dataset at https://github.com/chengtan9907/ReviewMT. △ Less

Submitted 9 June, 2024; originally announced June 2024.

Comments: Under review

arXiv:2406.05676 [pdf]

Chern insulator phase realized in dual-gate-tuned MnBi2Te4 thin films grown by molecular beam epitaxy

Authors: Yunhe Bai, Yuanzhao Li, Ruixuan Liu, Jianli Luan, Yang Chen, Wenyu Song, Peng-Fei Ji, Cui Ding, Zongwei Gao, Qinghua Zhang, Fanqi Meng, Bingbing Tong, Lin Li, Tianchen Zhu, Lin Gu, Lili Wang, Jinsong Zhang, Yayu Wang, Qi-Kun Xue, Ke He, Yang Feng, Xiao Feng

Abstract: The intrinsic magnetic order, large topological-magnetic gap and rich topological phases make MnBi2Te4 a wonderful platform to study exotic topological quantum states such as axion insulator and Chern insulator. To realize and manipulate these topological phases in a MnBi2Te4 thin film, precise manipulation of the electric field across the film is essential, which requires a dual-gate structure. I… ▽ More The intrinsic magnetic order, large topological-magnetic gap and rich topological phases make MnBi2Te4 a wonderful platform to study exotic topological quantum states such as axion insulator and Chern insulator. To realize and manipulate these topological phases in a MnBi2Te4 thin film, precise manipulation of the electric field across the film is essential, which requires a dual-gate structure. In this work, we achieve dual-gate tuning of MnBi2Te4 thin films grown with molecular beam epitaxy on SrTiO3(111) substrates by applying the substrate and an AlOx layer as the gate dielectrics of bottom and top gates, respectively. Under magnetic field of 9T and temperature of 20 mK, the Hall and longitudinal resistivities of the films show inversed gate-voltage dependence, for both top- and bottom-gates, signifying the existence of the dissipationless edge state contributed by Chern insulator phase in the ferromagnetic configuration. The maximum of the Hall resistivity only reaches 0.8 h/e2, even with dual-gate tuning, probably due to the high density of bulk carriers introduced by secondary phases. In the antiferromagnetic state under zero magnetic field, the films show normal insulator behavior. The dual-gated MnBi2Te4 thin films lay the foundation for developing devices based on electrically tunable topological quantum states. △ Less

Submitted 9 June, 2024; originally announced June 2024.

Comments: 24 pages, 4 figures

arXiv:2406.04961 [pdf, other]

Multiplane Prior Guided Few-Shot Aerial Scene Rendering

Authors: Zihan Gao, Licheng Jiao, Lingling Li, Xu Liu, Fang Liu, Puhua Chen, Yuwei Guo

Abstract: Neural Radiance Fields (NeRF) have been successfully applied in various aerial scenes, yet they face challenges with sparse views due to limited supervision. The acquisition of dense aerial views is often prohibitive, as unmanned aerial vehicles (UAVs) may encounter constraints in perspective range and energy constraints. In this work, we introduce Multiplane Prior guided NeRF (MPNeRF), a novel ap… ▽ More Neural Radiance Fields (NeRF) have been successfully applied in various aerial scenes, yet they face challenges with sparse views due to limited supervision. The acquisition of dense aerial views is often prohibitive, as unmanned aerial vehicles (UAVs) may encounter constraints in perspective range and energy constraints. In this work, we introduce Multiplane Prior guided NeRF (MPNeRF), a novel approach tailored for few-shot aerial scene rendering-marking a pioneering effort in this domain. Our key insight is that the intrinsic geometric regularities specific to aerial imagery could be leveraged to enhance NeRF in sparse aerial scenes. By investigating NeRF's and Multiplane Image (MPI)'s behavior, we propose to guide the training process of NeRF with a Multiplane Prior. The proposed Multiplane Prior draws upon MPI's benefits and incorporates advanced image comprehension through a SwinV2 Transformer, pre-trained via SimMIM. Our extensive experiments demonstrate that MPNeRF outperforms existing state-of-the-art methods applied in non-aerial contexts, by tripling the performance in SSIM and LPIPS even with three views available. We hope our work offers insights into the development of NeRF-based applications in aerial scenes with limited data. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: 17 pages, 8 figures, accepted at CVPR 2024

Journal ref: CVPR 2024

arXiv:2406.04821 [pdf, other]

Deep Learning Powered Estimate of The Extrinsic Parameters on Unmanned Surface Vehicles

Authors: Yi Shen, Hao Liu, Chang Zhou, Wentao Wang, Zijun Gao, Qi Wang

Abstract: Unmanned Surface Vehicles (USVs) are pivotal in marine exploration, but their sensors' accuracy is compromised by the dynamic marine environment. Traditional calibration methods fall short in these conditions. This paper introduces a deep learning architecture that predicts changes in the USV's dynamic metacenter and refines sensors' extrinsic parameters in real time using a Time-Sequence General… ▽ More Unmanned Surface Vehicles (USVs) are pivotal in marine exploration, but their sensors' accuracy is compromised by the dynamic marine environment. Traditional calibration methods fall short in these conditions. This paper introduces a deep learning architecture that predicts changes in the USV's dynamic metacenter and refines sensors' extrinsic parameters in real time using a Time-Sequence General Regression Neural Network (GRNN) with Euler angles as input. Simulation data from Unity3D ensures robust training and testing. Experimental results show that the Time-Sequence GRNN achieves the lowest mean squared error (MSE) loss, outperforming traditional neural networks. This method significantly enhances sensor calibration for USVs, promising improved data accuracy in challenging maritime conditions. Future work will refine the network and validate results with real-world data. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: Accepted by The 9th Asia-Pacific Conference on Intelligent Robot Systems (ACIRS 2024)

arXiv:2406.04809 [pdf, other]

A Survey of Fragile Model Watermarking

Authors: Zhenzhe Gao, Yu Cheng, Zhaoxia Yin

Abstract: Model fragile watermarking, inspired by both the field of adversarial attacks on neural networks and traditional multimedia fragile watermarking, has gradually emerged as a potent tool for detecting tampering, and has witnessed rapid development in recent years. Unlike robust watermarks, which are widely used for identifying model copyrights, fragile watermarks for models are designed to identify… ▽ More Model fragile watermarking, inspired by both the field of adversarial attacks on neural networks and traditional multimedia fragile watermarking, has gradually emerged as a potent tool for detecting tampering, and has witnessed rapid development in recent years. Unlike robust watermarks, which are widely used for identifying model copyrights, fragile watermarks for models are designed to identify whether models have been subjected to unexpected alterations such as backdoors, poisoning, compression, among others. These alterations can pose unknown risks to model users, such as misidentifying stop signs as speed limit signs in classic autonomous driving scenarios. This paper provides an overview of the relevant work in the field of model fragile watermarking since its inception, categorizing them and revealing the developmental trajectory of the field, thus offering a comprehensive survey for future endeavors in model fragile watermarking. △ Less

Submitted 8 July, 2024; v1 submitted 7 June, 2024; originally announced June 2024.

Comments: Submitted Signal Processing

arXiv:2406.04727 [pdf, other]

Predicting Polymer Properties Based on Multimodal Multitask Pretraining

Authors: Fanmeng Wang, Wentao Guo, Minjie Cheng, Shen Yuan, Hongteng Xu, Zhifeng Gao

Abstract: In the past few decades, polymers, high-molecular-weight compounds formed by bonding numerous identical or similar monomers covalently, have played an essential role in various scientific fields. In this context, accurate prediction of their properties is becoming increasingly crucial. Typically, the properties of a polymer, such as plasticity, conductivity, bio-compatibility, and so on, are highl… ▽ More In the past few decades, polymers, high-molecular-weight compounds formed by bonding numerous identical or similar monomers covalently, have played an essential role in various scientific fields. In this context, accurate prediction of their properties is becoming increasingly crucial. Typically, the properties of a polymer, such as plasticity, conductivity, bio-compatibility, and so on, are highly correlated with its 3D structure. However, current methods for predicting polymer properties heavily rely on information from polymer SMILES sequences (P-SMILES strings) while ignoring crucial 3D structural information, leading to sub-optimal performance. In this work, we propose MMPolymer, a novel multimodal multitask pretraining framework incorporating both polymer 1D sequential information and 3D structural information to enhance downstream polymer property prediction tasks. Besides, to overcome the limited availability of polymer 3D data, we further propose the "Star Substitution" strategy to extract 3D structural information effectively. During pretraining, MMPolymer not only predicts masked tokens and recovers 3D coordinates but also achieves the cross-modal alignment of latent representation. Subsequently, we further fine-tune the pretrained MMPolymer for downstream polymer property prediction tasks in the supervised learning paradigm. Experimental results demonstrate that MMPolymer achieves state-of-the-art performance in various polymer property prediction tasks. Moreover, leveraging the pretrained MMPolymer and using only one modality (either P-SMILES string or 3D conformation) during fine-tuning can also surpass existing polymer property prediction methods, highlighting the exceptional capability of MMPolymer in polymer feature extraction and utilization. Our online platform for polymer property prediction is available at https://app.bohrium.dp.tech/mmpolymer. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Showing 1–50 of 1,363 results for author: Gao, Z