subscribe to arXiv mailings

arXiv:2407.12023 [pdf, other]

CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models

Authors: Zhong-Zhi Li, Ming-Liang Zhang, Fei Yin, Zhi-Long Ji, Jin-Feng Bai, Zhen-Ru Pan, Fan-Hu Zeng, Jian Xu, Jia-Xin Zhang, Cheng-Lin Liu

Abstract: Due to the rapid advancements in multimodal large language models, evaluating their multimodal mathematical capabilities continues to receive wide attention. Despite the datasets like MathVista proposed benchmarks for assessing mathematical capabilities in multimodal scenarios, there is still a lack of corresponding evaluation tools and datasets for fine-grained assessment in the context of K12 ed… ▽ More Due to the rapid advancements in multimodal large language models, evaluating their multimodal mathematical capabilities continues to receive wide attention. Despite the datasets like MathVista proposed benchmarks for assessing mathematical capabilities in multimodal scenarios, there is still a lack of corresponding evaluation tools and datasets for fine-grained assessment in the context of K12 education in Chinese language. To systematically evaluate the capability of multimodal large models in solving Chinese multimodal mathematical problems, we propose a Chinese Multi-modal Math Skill Evaluation Benchmark, named CMMaTH, contraining 23k multimodal K12 math related questions, forming the largest Chinese multimodal mathematical problem benchmark to date. CMMaTH questions from elementary to high school levels, provide increased diversity in problem types, solution objectives, visual elements, detailed knowledge points, and standard solution annotations. We have constructed an open-source tool GradeGPT integrated with the CMMaTH dataset, facilitating stable, rapid, and cost-free model evaluation. Our data and code are available. △ Less

Submitted 27 June, 2024; originally announced July 2024.

arXiv:2407.10671 [pdf, other]

Qwen2 Technical Report

Authors: An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin , et al. (37 additional authors not shown)

Abstract: This report introduces the Qwen2 series, the latest addition to our large language models and large multimodal models. We release a comprehensive suite of foundational and instruction-tuned language models, encompassing a parameter range from 0.5 to 72 billion, featuring dense models and a Mixture-of-Experts model. Qwen2 surpasses most prior open-weight models, including its predecessor Qwen1.5, a… ▽ More This report introduces the Qwen2 series, the latest addition to our large language models and large multimodal models. We release a comprehensive suite of foundational and instruction-tuned language models, encompassing a parameter range from 0.5 to 72 billion, featuring dense models and a Mixture-of-Experts model. Qwen2 surpasses most prior open-weight models, including its predecessor Qwen1.5, and exhibits competitive performance relative to proprietary models across diverse benchmarks on language understanding, generation, multilingual proficiency, coding, mathematics, and reasoning. The flagship model, Qwen2-72B, showcases remarkable performance: 84.2 on MMLU, 37.9 on GPQA, 64.6 on HumanEval, 89.5 on GSM8K, and 82.4 on BBH as a base language model. The instruction-tuned variant, Qwen2-72B-Instruct, attains 9.1 on MT-Bench, 48.1 on Arena-Hard, and 35.7 on LiveCodeBench. Moreover, Qwen2 demonstrates robust multilingual capabilities, proficient in approximately 30 languages, spanning English, Chinese, Spanish, French, German, Arabic, Russian, Korean, Japanese, Thai, Vietnamese, and more, underscoring its versatility and global reach. To foster community innovation and accessibility, we have made the Qwen2 model weights openly available on Hugging Face and ModelScope, and the supplementary materials including example code on GitHub. These platforms also include resources for quantization, fine-tuning, and deployment, facilitating a wide range of applications and research endeavors. △ Less

Submitted 17 July, 2024; v1 submitted 15 July, 2024; originally announced July 2024.

Comments: 25 pages, 1 figure

arXiv:2407.10435 [pdf]

Nontrivial impact of interlayer coupling on thermal conductivity: opposing trends in in-plane and out-of-plane phonons

Authors: H. F. Feng, B. Liu, J. L. Bai, X. Zhang, Z. X. Song, Zhi-Xin Guo

Abstract: The study of heat transport in two-dimensional (2D) materials reveals novel behaviors due to quantum confinement effects, where in-plane and out-of-plane phonons play crucial roles. In 2D materials like graphene, it is widely recognized that the out-of-plane vibrational mode is the primary contributor to thermal conductivity owing to the mirror symmetry. Based on this perspective, the introduction… ▽ More The study of heat transport in two-dimensional (2D) materials reveals novel behaviors due to quantum confinement effects, where in-plane and out-of-plane phonons play crucial roles. In 2D materials like graphene, it is widely recognized that the out-of-plane vibrational mode is the primary contributor to thermal conductivity owing to the mirror symmetry. Based on this perspective, the introduction of interlayer coupling, which breaks this symmetry, is expected to induce a significant reduction in thermal conductivity within 2D materials. Nevertheless, recent studies have presented unexpected findings, indicating that interlayer coupling can actually increase thermal conductivity of 2D materials. This controversial result suggests a nontrivial underlying mechanism governing the effects of interlayer coupling on thermal conductivity in 2D materials, necessitating further exploration. In our work, we investigate the modulation of thermal conductivity through interlayer coupling in a sandwich structure composed of hexagonal boron nitride (h-BN) and bilayer graphene (BG), specifically a h- BN/BG/h-BN system. Through molecular dynamics simulations, we find that the thermal conductivity from out-of-plane phonons can be significantly reduced, while that from in-plane phonons can be significantly increased, as the interlayer coupling strength increases. This results in a nontrivial, coupling-strength-dependent overall thermal conductivity. The phonon spectrum analysis conducted using our modified package reveals that the upshift and flattening of the out-of-plane (ZA and ZO) phonon modes are mainly responsible for these variations, and the extent of the upshift and flattening is proportional to the strength of interlayer coupling. This work offers new insights into manipulating the thermal conductivity of 2D materials. △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: 4 figures

arXiv:2407.09550 [pdf]

CAPM: Fast and Robust Verification on Maxpool-based CNN via Dual Network

Authors: Jia-Hau Bai, Chi-Ting Liu, Yu Wang, Fu-Chieh Chang, Pei-Yuan Wu

Abstract: This study uses CAPM (Convex Adversarial Polytope for Maxpool-based CNN) to improve the verified bound for general purpose maxpool-based convolutional neural networks (CNNs) under bounded norm adversarial perturbations. The maxpool function is decomposed as a series of ReLU functions to extend the convex relaxation technique to maxpool functions, by which the verified bound can be efficiently comp… ▽ More This study uses CAPM (Convex Adversarial Polytope for Maxpool-based CNN) to improve the verified bound for general purpose maxpool-based convolutional neural networks (CNNs) under bounded norm adversarial perturbations. The maxpool function is decomposed as a series of ReLU functions to extend the convex relaxation technique to maxpool functions, by which the verified bound can be efficiently computed through a dual network. The experimental results demonstrate that this technique allows the state-of-the-art verification precision for maxpool-based CNNs and involves a much lower computational cost than current verification methods, such as DeepZ, DeepPoly and PRIMA. This method is also applicable to large-scale CNNs, which previous studies show to be often computationally prohibitively expensive. Under certain circumstances, CAPM is 40-times, 20-times or twice as fast and give a significantly higher verification bound (CAPM 98% vs. PRIMA 76%/DeepPoly 73%/DeepZ 8%) as compared to PRIMA/DeepPoly/DeepZ. Furthermore, we additionally present the time complexity of our algorithm as $O(W^2NK)$, where $W$ is the maximum width of the neural network, $N$ is the number of neurons, and $K$ is the size of the maxpool layer's kernel. △ Less

Submitted 27 June, 2024; originally announced July 2024.

arXiv:2407.09021 [pdf, other]

Squeeze-and-Excite ResNet-Conformers for Sound Event Localization, Detection, and Distance Estimation for DCASE 2024 Challenge

Authors: Jun Wei Yeow, Ee-Leng Tan, Jisheng Bai, Santi Peksi, Woon-Seng Gan

Abstract: This technical report details our systems submitted for Task 3 of the DCASE 2024 Challenge: Audio and Audiovisual Sound Event Localization and Detection (SELD) with Source Distance Estimation (SDE). We address only the audio-only SELD with SDE (SELDDE) task in this report. We propose to improve the existing ResNet-Conformer architectures with Squeeze-and-Excitation blocks in order to introduce add… ▽ More This technical report details our systems submitted for Task 3 of the DCASE 2024 Challenge: Audio and Audiovisual Sound Event Localization and Detection (SELD) with Source Distance Estimation (SDE). We address only the audio-only SELD with SDE (SELDDE) task in this report. We propose to improve the existing ResNet-Conformer architectures with Squeeze-and-Excitation blocks in order to introduce additional forms of channel- and spatial-wise attention. In order to improve SELD performance, we also utilize the Spatial Cue-Augmented Log-Spectrogram (SALSA) features over the commonly used log-mel spectra features for polyphonic SELD. We complement the existing Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) dataset with the audio channel swapping technique and synthesize additional data using the SpatialScaper generator. We also perform distance scaling in order to prevent large distance errors from contributing more towards the loss function. Finally, we evaluate our approach on the evaluation subset of the STARSS23 dataset. △ Less

Submitted 12 July, 2024; originally announced July 2024.

Comments: Technical report for DCASE 2024 Challenge Task 3

arXiv:2407.08183 [pdf, other]

The white-light superflares from cool stars in GWAC triggers

Authors: Guang-Wei Li, Liang Wang, Hai-Long Yuan, Li-Ping Xin, Jing Wang, Chao Wu, Hua-Li Li, Hasitieer Haerken, Wei-Hua Wang, Hong-Bo Cai, Xu-Hui Han, Yang Xu, Lei Huang, Xiao-Meng Lu, Jian-Ying Bai, Xiang-Yu Wang, Zi-Gao Dai, En-Wei Liang, Jian-Yan Wei

Abstract: M-type stars are the ones that flare most frequently, but how big their maximum flare energy can reach is still unknown. We present 163 flares from 162 individual M2 through L1-type stars that triggered the GWAC, with flare energies ranging from $10^{32.2}$ to $10^{36.4}$ erg . The flare amplitudes range from $\triangle G = 0.84$ to $\sim 10$ mag. Flare energy increases with stellar surface temper… ▽ More M-type stars are the ones that flare most frequently, but how big their maximum flare energy can reach is still unknown. We present 163 flares from 162 individual M2 through L1-type stars that triggered the GWAC, with flare energies ranging from $10^{32.2}$ to $10^{36.4}$ erg . The flare amplitudes range from $\triangle G = 0.84$ to $\sim 10$ mag. Flare energy increases with stellar surface temperature ($T_{\rm eff}$) but both $\triangle G$ and equivalent duration $\log_{10}(ED)$ seem to be independent of $T_{\rm eff}$. Combining periods detected from light curves of TESS and K2, spectra from LAMOST, SDSS and the 2.16 m Telescope, and the Gaia DR3 data, we found that these GWAC flare stars are young. For the stars that have spectra, we found that these stars are in or very near to the saturation region, and $\log_{10}(L_{\rm Hα}/L_{\rm bol})$ is lower for M7-L1 stars than for M2-M6 stars. We also studied the relation between GWAC flare bolometric energy $E_{\rm bol}$ and stellar hemispherical area $S$, and found that $\log_{10}E_{\rm bol}$ (in erg) increases with increasing $S$ (in cm$^2$), and the maximum flare energy $\log_{10}E_{\rm bol, max} \geqslant \log_{10}S + 14.25$. For M7-L1 stars, there seem to be other factors limiting their maximum flare energies in addition to stellar hemispherical area. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: 18 pages, 11 figures, 4 tables

arXiv:2407.08120 [pdf, other]

Spectroastrometry and Reverberation Mapping (SARM) of Active Galactic Nuclei. I. The H$β$ Broad-line Region Structure and Black Hole Mass of Five Quasars

Authors: Yan-Rong Li, Chen Hu, Zhu-Heng Yao, Yong-Jie Chen, Hua-Rui Bai, Sen Yang, Pu Du, Feng-Na Fang, Yi-Xin Fu, Jun-Rong Liu, Yue-Chang Peng, Yu-Yang Songsheng, Yi-Lin Wang, Ming Xiao, Shuo Zhai, Hartmut Winkler, Jin-Ming Bai, Luis C. Ho, Romain G. Petrov, Jesus Aceituno, Jian-Min Wang

Abstract: We conduct a reverberation mapping (RM) campaign to spectroscopically monitor a sample of selected bright active galactic nuclei with large anticipated broad-line region (BLR) sizes adequate for spectroastrometric observations by the GRAVITY instrument on the Very Large Telescope Interferometer. We report the first results for five objects, IC 4329A, Mrk 335, Mrk 509, Mrk 1239, and PDS 456, among… ▽ More We conduct a reverberation mapping (RM) campaign to spectroscopically monitor a sample of selected bright active galactic nuclei with large anticipated broad-line region (BLR) sizes adequate for spectroastrometric observations by the GRAVITY instrument on the Very Large Telescope Interferometer. We report the first results for five objects, IC 4329A, Mrk 335, Mrk 509, Mrk 1239, and PDS 456, among which Mrk 1239 and PDS 456 are for the first time spectroscopically monitored. We obtain multi-year monitoring data and perform multi-component spectral decomposition to extract the broad H$β$ profiles. We detect significant time lags between the H$β$ and continuum variations, generally obeying the previously established BLR size-luminosity relation. Velocity-resolved H$β$ time lags illustrate diverse, possibly evolving BLR kinematics. We further measure the H$β$ line widths from mean and rms spectra and the resulting virial products show good consistency among different seasons. Adopting a unity virial factor and the full width at half maximum of the broad H$β$ line from the mean spectrum as the measure of velocity, the obtained black hole mass averaged over seasons is $\log M_\bullet/M_\odot=8.02_{-0.14}^{+0.09}$, $6.92_{-0.12}^{+0.12}$, $8.01_{-0.25}^{+0.16}$, $7.44_{-0.14}^{+0.13}$, and $8.59_{-0.11}^{+0.07}$ for the five objects, respectively. The black hole mass estimations using other line width measures are also reported (up to the virial factors). For objects with previous RM campaigns, our mass estimates are in agreement with earlier results. In a companion paper, we will employ BLR dynamical modeling to directly infer the black hole mass and thereby determine the virial factors. △ Less

Submitted 10 July, 2024; originally announced July 2024.

Comments: 32 pages, 6 tables, 20 figures. To appear in ApJ

arXiv:2407.06964 [pdf, other]

Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach

Authors: Taolin Zhang, Jiawang Bai, Zhihe Lu, Dongze Lian, Genping Wang, Xinchao Wang, Shu-Tao Xia

Abstract: Recent works on parameter-efficient transfer learning (PETL) show the potential to adapt a pre-trained Vision Transformer to downstream recognition tasks with only a few learnable parameters. However, since they usually insert new structures into the pre-trained model, entire intermediate features of that model are changed and thus need to be stored to be involved in back-propagation, resulting in… ▽ More Recent works on parameter-efficient transfer learning (PETL) show the potential to adapt a pre-trained Vision Transformer to downstream recognition tasks with only a few learnable parameters. However, since they usually insert new structures into the pre-trained model, entire intermediate features of that model are changed and thus need to be stored to be involved in back-propagation, resulting in memory-heavy training. We solve this problem from a novel disentangled perspective, i.e., dividing PETL into two aspects: task-specific learning and pre-trained knowledge utilization. Specifically, we synthesize the task-specific query with a learnable and lightweight module, which is independent of the pre-trained model. The synthesized query equipped with task-specific knowledge serves to extract the useful features for downstream tasks from the intermediate representations of the pre-trained model in a query-only manner. Built upon these features, a customized classification head is proposed to make the prediction for the input sample. lightweight architecture and avoids the use of heavy intermediate features for running gradient descent, it demonstrates limited memory usage in training. Extensive experiments manifest that our method achieves state-of-the-art performance under memory constraints, showcasing its applicability in real-world situations. △ Less

Submitted 14 July, 2024; v1 submitted 9 July, 2024; originally announced July 2024.

Comments: ECCV2024

arXiv:2407.05676 [pdf, other]

Continuous broadband Rydberg receiver using AC Stark shifts and Floquet States

Authors: Danni Song, Yuechun Jiao, Jinlian Hu, Yuwen Yin, Zhenhua Li, Yunhui He, Jingxu Bai, Jianming Zhao, Suotang Jia

Abstract: We demonstrate the continuous broadband microwave receivers based on AC Stark shifts and Floquet States of Rydberg levels in a cesium atomic vapor cell. The resonant transition frequency of two adjacent Rydberg states 78$S_{1/2}$ and 78$P_{1/2}$ is tuned based on AC Stark effect of 70~MHz Radio frequency (RF) field that is applied outside the vapor cell. Meanwhile, the Rydberg states also exhibit… ▽ More We demonstrate the continuous broadband microwave receivers based on AC Stark shifts and Floquet States of Rydberg levels in a cesium atomic vapor cell. The resonant transition frequency of two adjacent Rydberg states 78$S_{1/2}$ and 78$P_{1/2}$ is tuned based on AC Stark effect of 70~MHz Radio frequency (RF) field that is applied outside the vapor cell. Meanwhile, the Rydberg states also exhibit Floquet even-order sidebands that are used to extend the bandwidths further. We achieve microwave electric field measurements over 1.172~GHz of continuous frequency range. The sensitivity of the Rydberg receiver with heterodyne technique in the absence of RF field is 280.2~nVcm$^{-1}$Hz$^{-1/2}$, while it is dramatically decreased with tuning the resonant transition frequency in the presence of RF field. Surprisingly, the sensitivity can be greatly improved if the microwave field couples the Floquet sideband transition. The achieving of continuous frequency and high sensitivity microwave detection will promote the application of Rydberg receiver in the radar technique and wireless communication. △ Less

Submitted 8 July, 2024; originally announced July 2024.

Comments: 5 pages, 4 figures

arXiv:2407.05414 [pdf, other]

Velocity-Resolved Ionization Mapping of Broad Line Region. I. Insights into Diverse Geometry and Kinematics

Authors: Sha-Sha Li, Hai-Cheng Feng, H. T. Liu, J. M. Bai, Xiang Ji, Cheng Cheng, Kai-Xing Lu, Jian-Guo Wang, Rui Li

Abstract: Broad emission lines of active galactic nuclei (AGNs) originate from the broad-line region (BLR), consisting of dense gas clouds in orbit around an accreting supermassive black hole. Understanding the geometry and kinematics of the region is crucial for gaining insights into the physics and evolution of AGNs. Conventional velocity-resolved reverberation mapping may face challenges in disentangling… ▽ More Broad emission lines of active galactic nuclei (AGNs) originate from the broad-line region (BLR), consisting of dense gas clouds in orbit around an accreting supermassive black hole. Understanding the geometry and kinematics of the region is crucial for gaining insights into the physics and evolution of AGNs. Conventional velocity-resolved reverberation mapping may face challenges in disentangling the degeneracy between intricate motion and geometry of this region. To address this challenge, new key constraints are required. Here, we report the discovery of an asymmetric BLR using a novel technique: velocity-resolved ionization mapping, which can map the distance of emitting gas clouds by measuring Hydrogen line ratios at different velocities. By analyzing spectroscopic monitoring data, we find that the Balmer decrement is anticorrelated with the continuum and correlated with the lags across broad emission line velocities. Some line ratio profiles deviate from the expectations for a symmetrically virialized BLR, suggesting that the red-shifted and blue-shifted gas clouds may not be equidistant from the supermassive black hole (SMBH). This asymmetric geometry might represent a formation imprint, provide new perspectives on the evolution of AGNs, and influence SMBH mass measurements. △ Less

Submitted 7 July, 2024; originally announced July 2024.

Comments: 20 pages, 10 figures, Accepted by ApJ

arXiv:2407.05369 [pdf, other]

A Novel Property of Generalized Fibonacci Sequence in Grids

Authors: Zixian Yang, Jianchao Bai

Abstract: Fibonacci sequence, generated by summing the preceding two terms, is a classical sequence renowned for its elegant properties. In this paper, leveraging properties of generalized Fibonacci sequences and formulas for consecutive sums of equidistant subsequences, we investigate the ratio of the sum of numbers along main-diagonal and sub-diagonal of odd-order grids containing generalized Fibonacci se… ▽ More Fibonacci sequence, generated by summing the preceding two terms, is a classical sequence renowned for its elegant properties. In this paper, leveraging properties of generalized Fibonacci sequences and formulas for consecutive sums of equidistant subsequences, we investigate the ratio of the sum of numbers along main-diagonal and sub-diagonal of odd-order grids containing generalized Fibonacci sequences. We show that this ratio is solely dependent on the order of the grid, providing a concise and splendid identity. △ Less

Submitted 7 July, 2024; originally announced July 2024.

arXiv:2407.03654 [pdf, other]

Mixstyle based Domain Generalization for Sound Event Detection with Heterogeneous Training Data

Authors: Yang Xiao, Han Yin, Jisheng Bai, Rohan Kumar Das

Abstract: This work explores domain generalization (DG) for sound event detection (SED), advancing adaptability towards real-world scenarios. Our approach employs a mean-teacher framework with domain generalization to integrate heterogeneous training data, while preserving the SED model performance across the datasets. Specifically, we first apply mixstyle to the frequency dimension to adapt the mel-spectro… ▽ More This work explores domain generalization (DG) for sound event detection (SED), advancing adaptability towards real-world scenarios. Our approach employs a mean-teacher framework with domain generalization to integrate heterogeneous training data, while preserving the SED model performance across the datasets. Specifically, we first apply mixstyle to the frequency dimension to adapt the mel-spectrograms from different domains. Next, we use the adaptive residual normalization method to generalize features across multiple domains by applying instance normalization in the frequency dimension. Lastly, we use the sound event bounding boxes method for post-processing. Our approach integrates features from bidirectional encoder representations from audio transformers and a convolutional recurrent neural network. We evaluate the proposed approach on DCASE 2024 Challenge Task 4 dataset, measuring polyphonic SED score (PSDS) on the DESED dataset and macro-average pAUC on the MAESTRO dataset. The results indicate that the proposed DG-based method improves both PSDS and macro-average pAUC compared to the challenge baseline. △ Less

Submitted 4 July, 2024; originally announced July 2024.

Comments: Sumbitted to DCASE WS 2024. 5 pages. arXiv admin note: text overlap with arXiv:2407.00291

arXiv:2407.00291 [pdf, other]

FMSG-JLESS Submission for DCASE 2024 Task4 on Sound Event Detection with Heterogeneous Training Dataset and Potentially Missing Labels

Authors: Yang Xiao, Han Yin, Jisheng Bai, Rohan Kumar Das

Abstract: This report presents the systems developed and submitted by Fortemedia Singapore (FMSG) and Joint Laboratory of Environmental Sound Sensing (JLESS) for DCASE 2024 Task 4. The task focuses on recognizing event classes and their time boundaries, given that multiple events can be present and may overlap in an audio recording. The novelty this year is a dataset with two sources, making it challenging… ▽ More This report presents the systems developed and submitted by Fortemedia Singapore (FMSG) and Joint Laboratory of Environmental Sound Sensing (JLESS) for DCASE 2024 Task 4. The task focuses on recognizing event classes and their time boundaries, given that multiple events can be present and may overlap in an audio recording. The novelty this year is a dataset with two sources, making it challenging to achieve good performance without knowing the source of the audio clips during evaluation. To address this, we propose a sound event detection method using domain generalization. Our approach integrates features from bidirectional encoder representations from audio transformers and a convolutional recurrent neural network. We focus on three main strategies to improve our method. First, we apply mixstyle to the frequency dimension to adapt the mel-spectrograms from different domains. Second, we consider training loss of our model specific to each datasets for their corresponding classes. This independent learning framework helps the model extract domain-specific features effectively. Lastly, we use the sound event bounding boxes method for post-processing. Our proposed method shows superior macro-average pAUC and polyphonic SED score performance on the DCASE 2024 Challenge Task 4 validation dataset and public evaluation dataset. △ Less

Submitted 28 June, 2024; originally announced July 2024.

Comments: Technical report for DCASE 2024 Challenge Task 4

arXiv:2406.16855 [pdf, other]

DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation

Authors: Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, Shu-Tao Xia

Abstract: Personalized image generation holds great promise in assisting humans in everyday work and life due to its impressive function in creatively generating personalized content. However, current evaluations either are automated but misalign with humans or require human evaluations that are time-consuming and expensive. In this work, we present DreamBench++, a human-aligned benchmark automated by advan… ▽ More Personalized image generation holds great promise in assisting humans in everyday work and life due to its impressive function in creatively generating personalized content. However, current evaluations either are automated but misalign with humans or require human evaluations that are time-consuming and expensive. In this work, we present DreamBench++, a human-aligned benchmark automated by advanced multimodal GPT models. Specifically, we systematically design the prompts to let GPT be both human-aligned and self-aligned, empowered with task reinforcement. Further, we construct a comprehensive dataset comprising diverse images and prompts. By benchmarking 7 modern generative models, we demonstrate that DreamBench++ results in significantly more human-aligned evaluation, helping boost the community with innovative findings. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: Project page: https://dreambenchplus.github.io/

arXiv:2406.11045 [pdf, other]

Kolmogorov Arnold Informed neural network: A physics-informed deep learning framework for solving PDEs based on Kolmogorov Arnold Networks

Authors: Yizheng Wang, Jia Sun, Jinshuai Bai, Cosmin Anitescu, Mohammad Sadegh Eshaghi, Xiaoying Zhuang, Timon Rabczuk, Yinghua Liu

Abstract: AI for partial differential equations (PDEs) has garnered significant attention, particularly with the emergence of Physics-informed neural networks (PINNs). The recent advent of Kolmogorov-Arnold Network (KAN) indicates that there is potential to revisit and enhance the previously MLP-based PINNs. Compared to MLPs, KANs offer interpretability and require fewer parameters. PDEs can be described in… ▽ More AI for partial differential equations (PDEs) has garnered significant attention, particularly with the emergence of Physics-informed neural networks (PINNs). The recent advent of Kolmogorov-Arnold Network (KAN) indicates that there is potential to revisit and enhance the previously MLP-based PINNs. Compared to MLPs, KANs offer interpretability and require fewer parameters. PDEs can be described in various forms, such as strong form, energy form, and inverse form. While mathematically equivalent, these forms are not computationally equivalent, making the exploration of different PDE formulations significant in computational physics. Thus, we propose different PDE forms based on KAN instead of MLP, termed Kolmogorov-Arnold-Informed Neural Network (KINN). We systematically compare MLP and KAN in various numerical examples of PDEs, including multi-scale, singularity, stress concentration, nonlinear hyperelasticity, heterogeneous, and complex geometry problems. Our results demonstrate that KINN significantly outperforms MLP in terms of accuracy and convergence speed for numerous PDEs in computational solid mechanics, except for the complex geometry problem. This highlights KINN's potential for more efficient and accurate PDE solutions in AI for PDEs. △ Less

Submitted 16 June, 2024; originally announced June 2024.

arXiv:2406.10885 [pdf, other]

On the Role of Entity and Event Level Conceptualization in Generalizable Reasoning: A Survey of Tasks, Methods, Applications, and Future Directions

Authors: Weiqi Wang, Tianqing Fang, Haochen Shi, Baixuan Xu, Wenxuan Ding, Liyu Zhang, Wei Fan, Jiaxin Bai, Haoran Li, Xin Liu, Yangqiu Song

Abstract: Entity- and event-level conceptualization, as fundamental elements of human cognition, plays a pivotal role in generalizable reasoning. This process involves abstracting specific instances into higher-level concepts and forming abstract knowledge that can be applied in unfamiliar or novel situations, which can enhance models' inferential capabilities and support the effective transfer of knowledge… ▽ More Entity- and event-level conceptualization, as fundamental elements of human cognition, plays a pivotal role in generalizable reasoning. This process involves abstracting specific instances into higher-level concepts and forming abstract knowledge that can be applied in unfamiliar or novel situations, which can enhance models' inferential capabilities and support the effective transfer of knowledge across various domains. Despite its significance, there is currently a lack of a systematic overview that comprehensively examines existing works in the definition, execution, and application of conceptualization to enhance reasoning tasks. In this paper, we address this gap by presenting the first comprehensive survey of 150+ papers, categorizing various definitions, resources, methods, and downstream applications related to conceptualization into a unified taxonomy, with a focus on the entity and event levels. Furthermore, we shed light on potential future directions in this field and hope to garner more attention from the community. △ Less

Submitted 16 June, 2024; originally announced June 2024.

arXiv:2406.10701 [pdf, other]

MIND: Multimodal Shopping Intention Distillation from Large Vision-language Models for E-commerce Purchase Understanding

Authors: Baixuan Xu, Weiqi Wang, Haochen Shi, Wenxuan Ding, Huihao Jing, Tianqing Fang, Jiaxin Bai, Long Chen, Yangqiu Song

Abstract: Improving user experience and providing personalized search results in E-commerce platforms heavily rely on understanding purchase intention. However, existing methods for acquiring large-scale intentions bank on distilling large language models with human annotation for verification. Such an approach tends to generate product-centric intentions, overlook valuable visual information from product i… ▽ More Improving user experience and providing personalized search results in E-commerce platforms heavily rely on understanding purchase intention. However, existing methods for acquiring large-scale intentions bank on distilling large language models with human annotation for verification. Such an approach tends to generate product-centric intentions, overlook valuable visual information from product images, and incurs high costs for scalability. To address these issues, we introduce MIND, a multimodal framework that allows Large Vision-Language Models (LVLMs) to infer purchase intentions from multimodal product metadata and prioritize human-centric ones. Using Amazon Review data, we apply MIND and create a multimodal intention knowledge base, which contains 1,264,441 million intentions derived from 126,142 co-buy shopping records across 107,215 products. Extensive human evaluations demonstrate the high plausibility and typicality of our obtained intentions and validate the effectiveness of our distillation framework and filtering mechanism. Additional experiments reveal that our obtained intentions significantly enhance large language models in two intention comprehension tasks. △ Less

Submitted 15 June, 2024; originally announced June 2024.

Comments: 8 pages, 5 figures

arXiv:2406.10173 [pdf, other]

IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Language Models in E-commerce

Authors: Wenxuan Ding, Weiqi Wang, Sze Heng Douglas Kwok, Minghao Liu, Tianqing Fang, Jiaxin Bai, Junxian He, Yangqiu Song

Abstract: Enhancing Language Models' (LMs) ability to understand purchase intentions in E-commerce scenarios is crucial for their effective assistance in various downstream tasks. However, previous approaches that distill intentions from LMs often fail to generate meaningful and human-centric intentions applicable in real-world E-commerce contexts. This raises concerns about the true comprehension and utili… ▽ More Enhancing Language Models' (LMs) ability to understand purchase intentions in E-commerce scenarios is crucial for their effective assistance in various downstream tasks. However, previous approaches that distill intentions from LMs often fail to generate meaningful and human-centric intentions applicable in real-world E-commerce contexts. This raises concerns about the true comprehension and utilization of purchase intentions by LMs. In this paper, we present IntentionQA, a double-task multiple-choice question answering benchmark to evaluate LMs' comprehension of purchase intentions in E-commerce. Specifically, LMs are tasked to infer intentions based on purchased products and utilize them to predict additional purchases. IntentionQA consists of 4,360 carefully curated problems across three difficulty levels, constructed using an automated pipeline to ensure scalability on large E-commerce platforms. Human evaluations demonstrate the high quality and low false-negative rate of our benchmark. Extensive experiments across 19 language models show that they still struggle with certain scenarios, such as understanding products and intentions accurately, jointly reasoning with products and intentions, and more, in which they fall far behind human performances. Our code and data are publicly available at https://github.com/HKUST-KnowComp/IntentionQA. △ Less

Submitted 14 June, 2024; originally announced June 2024.

arXiv:2406.09695 [pdf, other]

Machine learning-based Near-field Emitter Localization via Grouped Hybrid Analog and Digital Massive MIMO Receive Array

Authors: Yifan Li, Feng Shu, Jiatong Bai, Cunhua Pan, Yongpeng Wu, Yaoliang Song, Jiangzhou Wang

Abstract: A fully-digital massive MIMO receive array is promising to meet the high-resolution requirement of near-field (NF) emitter localization, but it also results in the significantly increasing of hardware costs and algorithm complexity. In order to meet the future demand for green communication while maintaining high performance, the grouped hybrid analog and digital (HAD) structure is proposed for NF… ▽ More A fully-digital massive MIMO receive array is promising to meet the high-resolution requirement of near-field (NF) emitter localization, but it also results in the significantly increasing of hardware costs and algorithm complexity. In order to meet the future demand for green communication while maintaining high performance, the grouped hybrid analog and digital (HAD) structure is proposed for NF DOA estimation, which divides the large-scale receive array into small-scale groups and each group contains several subarrays. Thus the NF direction-of-arrival (DOA) estimation problem is viewed as far-field (FF) within each group, and some existing methods such as MUSIC, Root-MUSIC, ESPRIT, etc., can be adopted. Then by angle calibration, a candidate position set is generated. To eliminate the phase ambiguity arising from the HAD structure and obtain the emitter position, two low-complexity clustering-based methods, minimum sample distance clustering (MSDC) and range scatter diagram (RSD) - angle scatter diagram (ASD)-based DBSCAN (RSD-ASD-DBSCAN), are proposed based on the distribution features of samples in the candidate position set. Then to further improve the localization accuracy, a model-driven regression network (RegNet) is designed, which consists of a multi-layer neural network (MLNN) for false solution elimination and a perceptron for angle fusion. Finally, the Cramer-Rao lower bound (CRLB) of NF emitter localization for the proposed grouped HAD structure is also derived. The simulation results show that the proposed methods can achieve CRLB at different SNR regions, the RegNet has great performance advantages at low SNR regions and the clustering-based methods have much lower complexity. △ Less

Submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.07880 [pdf, other]

A Comprehensive Survey on Machine Learning Driven Material Defect Detection: Challenges, Solutions, and Future Prospects

Authors: Jun Bai, Di Wu, Tristan Shelley, Peter Schubel, David Twine, John Russell, Xuesen Zeng, Ji Zhang

Abstract: Material defects (MD) represent a primary challenge affecting product performance and giving rise to safety issues in related products. The rapid and accurate identification and localization of MD constitute crucial research endeavours in addressing contemporary challenges associated with MD. Although conventional non-destructive testing methods such as ultrasonic and X-ray approaches have mitigat… ▽ More Material defects (MD) represent a primary challenge affecting product performance and giving rise to safety issues in related products. The rapid and accurate identification and localization of MD constitute crucial research endeavours in addressing contemporary challenges associated with MD. Although conventional non-destructive testing methods such as ultrasonic and X-ray approaches have mitigated issues related to low efficiency in manual inspections, they struggle to meet the diverse requirements of high precision, real-time speed, automation, and intelligence. In recent years, propelled by the swift advancement of machine learning (ML) technologies, particularly exemplified by deep learning, ML has swiftly emerged as the core technology and a prominent research direction for material defect detection (MDD). Through a comprehensive review of the latest literature, we systematically survey the ML techniques applied in MDD into five categories: unsupervised learning, supervised learning, semi-supervised learning, reinforcement learning, and generative learning. We provide a detailed analysis of the main principles and techniques used, together with the advantages and potential challenges associated with these techniques. Furthermore, the survey focuses on the techniques for defect detection in composite materials, which are important types of materials enjoying increasingly wide application in various industries such as aerospace, automotive, construction, and renewable energy. Finally, the survey explores potential future directions in MDD utilizing ML technologies. This comprehensive survey not only consolidates existing literature on ML-based MDD technologies but also serves as a foundational reference for future researchers and industrial practitioners, providing valuable insights and guidance in developing advanced and efficient MDD systems. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.05696 [pdf, other]

Two Power Allocation and Beamforming Strategies for Active IRS-aided Wireless Network via Machine Learning

Authors: Qiankun Cheng, Jiatong Bai, Baihua Shi, Wei Gao, Feng Shu

Abstract: This paper models an active intelligent reflecting surface (IRS) -assisted wireless communication network, which has the ability to adjust power between BS and IRS. We aim to maximize the signal-to-noise ratio of user by jointly designing power allocation (PA) factor, active IRS phase shift matrix, and beamforming vector of BS, subject to a total power constraint. To tackle this non-convex problem… ▽ More This paper models an active intelligent reflecting surface (IRS) -assisted wireless communication network, which has the ability to adjust power between BS and IRS. We aim to maximize the signal-to-noise ratio of user by jointly designing power allocation (PA) factor, active IRS phase shift matrix, and beamforming vector of BS, subject to a total power constraint. To tackle this non-convex problem, we solve this problem by alternately optimizing these variables. Firstly, the PA factor is designed via polynomial regression method. Next, BS beamforming vector and IRS phase shift matrix are obtained by Dinkelbach's transform and successive convex approximation methods. To reduce the high computational complexity of the above proposed algorithm, we maximize achievable rate (AR) and use closed-form fractional programming method to transform the original problem into an equivalent form. Then, we address this problem by iteratively optimizing auxiliary variables, BS and IRS beamformings. Simulation results show that the proposed algorithms can effectively improve the AR performance compared to fixed PA strategies, aided by passive IRS, and without IRS. △ Less

Submitted 9 June, 2024; originally announced June 2024.

arXiv:2406.03797 [pdf, other]

Morpho-Photometric Classification of KiDS DR5 Sources Based on Neural Networks: A Comprehensive Star-Quasar-Galaxy Catalog

Authors: Hai-Cheng Feng, Rui Li, Nicola R. Napolitano, Sha-Sha Li, J. M. Bai, Ran Li, H. T. Liu, Kai-Xing Lu, Mario Radovich, Huan-Yuan Shan, Jian-Guo Wang, Wen-Zhe Xi, Ling-Hua Xie, Yang-Wei Zhang

Abstract: We present a novel multimodal neural network for classifying astronomical sources in multiband ground-based observations, from optical to near infrared, to separate sources in stars, galaxies and quasars. Our approach combines a convolutional neural network branch for learning morphological features from $r$-band images with an artificial neural network branch for extracting spectral energy distri… ▽ More We present a novel multimodal neural network for classifying astronomical sources in multiband ground-based observations, from optical to near infrared, to separate sources in stars, galaxies and quasars. Our approach combines a convolutional neural network branch for learning morphological features from $r$-band images with an artificial neural network branch for extracting spectral energy distribution (SED) information. Specifically, we have used 9-band optical ($ugri$) and NIR ($ZYHJK_s$) data from the Kilo-Degree Survey (KiDS) Data Release 5. The two branches of the network are concatenated and feed into fully-connected layers for final classification. We train the network on a spectroscopically confirmed sample from the Sloan Digital Sky Survey cross-matched with KiDS. The trained model achieves 98.76\% overall accuracy on an independent testing dataset, with F1 scores exceeding 95\% for each class. Raising the output probability threshold, we obtain higher purity at the cost of a lower completeness. We have also validated the network using external catalogs cross-matched with KiDS, correctly classifying 99.74\% of a pure star sample selected from Gaia parallaxes and proper motions, and 99.74\% of an external galaxy sample from the Galaxy and Mass Assembly survey, adjusted for low-redshift contamination. We apply the trained network to 27,334,751 KiDS DR5 sources with $r \leqslant 23$ mag to generate a new classification catalog. This multimodal neural network successfully leverages both morphological and SED information to enable efficient and robust classification of stars, quasars, and galaxies in large photometric surveys. △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: 18 pages, 12 figures, 2 tables, Submitted to ApJS

arXiv:2406.03127 [pdf, other]

Towards Real-world Scenario: Imbalanced New Intent Discovery

Authors: Shun Zhang, Chaoran Yan, Jian Yang, Jiaheng Liu, Ying Mo, Jiaqi Bai, Tongliang Li, Zhoujun Li

Abstract: New Intent Discovery (NID) aims at detecting known and previously undefined categories of user intent by utilizing limited labeled and massive unlabeled data. Most prior works often operate under the unrealistic assumption that the distribution of both familiar and new intent classes is uniform, overlooking the skewed and long-tailed distributions frequently encountered in real-world scenarios. To… ▽ More New Intent Discovery (NID) aims at detecting known and previously undefined categories of user intent by utilizing limited labeled and massive unlabeled data. Most prior works often operate under the unrealistic assumption that the distribution of both familiar and new intent classes is uniform, overlooking the skewed and long-tailed distributions frequently encountered in real-world scenarios. To bridge the gap, our work introduces the imbalanced new intent discovery (i-NID) task, which seeks to identify familiar and novel intent categories within long-tailed distributions. A new benchmark (ImbaNID-Bench) comprised of three datasets is created to simulate the real-world long-tail distributions. ImbaNID-Bench ranges from broad cross-domain to specific single-domain intent categories, providing a thorough representation of practical use cases. Besides, a robust baseline model ImbaNID is proposed to achieve cluster-friendly intent representations. It includes three stages: model pre-training, generation of reliable pseudo-labels, and robust representation learning that strengthens the model performance to handle the intricacies of real-world data distributions. Our extensive experiments on previous benchmarks and the newly established benchmark demonstrate the superior performance of ImbaNID in addressing the i-NID task, highlighting its potential as a powerful baseline for uncovering and categorizing user intents in imbalanced and long-tailed distributions\footnote{\url{https://github.com/Zkdc/i-NID}}. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: ACL 2024

arXiv:2406.02993 [pdf, other]

Dual-color Q-switched mode-locking in an Erbium-doped fiber laser

Authors: Chenyue Lv, Baole Lu, Jintao Bai

Abstract: Q-switched mode-locking (QML) has been widely observed in various lasers, but its generation mechanism in passive mode-locking remains unclear. In this paper, we build up a dual-color QML Erbium-doped fiber laser and find a bound-state-like envelope on the optical spectrum for the first time. Theoretically, the formation mechanism of QML is numerically investigated using the coupled Ginzburg-Landa… ▽ More Q-switched mode-locking (QML) has been widely observed in various lasers, but its generation mechanism in passive mode-locking remains unclear. In this paper, we build up a dual-color QML Erbium-doped fiber laser and find a bound-state-like envelope on the optical spectrum for the first time. Theoretically, the formation mechanism of QML is numerically investigated using the coupled Ginzburg-Landau equations. In addition, we demonstrated the existence of two QML pulse evolution patterns with gain or polarization state variations in simulation. Our results deepen the understanding of QML pulses in mode-locked fiber lasers and provide a foundation for studying mode-locking nonlinear evolutionary paths. △ Less

Submitted 5 June, 2024; originally announced June 2024.

arXiv:2405.19732 [pdf, other]

Two Optimizers Are Better Than One: LLM Catalyst Empowers Gradient-Based Optimization for Prompt Tuning

Authors: Zixian Guo, Ming Liu, Zhilong Ji, Jinfeng Bai, Yiwen Guo, Wangmeng Zuo

Abstract: Learning a skill generally relies on both practical experience by doer and insightful high-level guidance by instructor. Will this strategy also work well for solving complex non-convex optimization problems? Here, a common gradient-based optimizer acts like a disciplined doer, making locally optimal update at each step. Recent methods utilize large language models (LLMs) to optimize solutions for… ▽ More Learning a skill generally relies on both practical experience by doer and insightful high-level guidance by instructor. Will this strategy also work well for solving complex non-convex optimization problems? Here, a common gradient-based optimizer acts like a disciplined doer, making locally optimal update at each step. Recent methods utilize large language models (LLMs) to optimize solutions for concrete problems by inferring from natural language instructions, akin to a high-level instructor. In this paper, we show that these two optimizers are complementary to each other, suggesting a collaborative optimization approach. The gradient-based optimizer and LLM-based optimizer are combined in an interleaved manner. We instruct LLMs using task descriptions and timely optimization trajectories recorded during gradient-based optimization. Inferred results from LLMs are used as restarting points for the next stage of gradient optimization. By leveraging both the locally rigorous gradient-based optimizer and the high-level deductive LLM-based optimizer, our combined optimization method consistently yields improvements over competitive baseline prompt tuning methods. Our results demonstrate the synergistic effect of conventional gradient-based optimization and the inference ability of LLMs. The code is released at https://github.com/guozix/LLM-catalyst. △ Less

Submitted 6 June, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

arXiv:2405.15758 [pdf, other]

InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation

Authors: Yuchi Wang, Junliang Guo, Jianhong Bai, Runyi Yu, Tianyu He, Xu Tan, Xu Sun, Jiang Bian

Abstract: Recent talking avatar generation models have made strides in achieving realistic and accurate lip synchronization with the audio, but often fall short in controlling and conveying detailed expressions and emotions of the avatar, making the generated video less vivid and controllable. In this paper, we propose a novel text-guided approach for generating emotionally expressive 2D avatars, offering f… ▽ More Recent talking avatar generation models have made strides in achieving realistic and accurate lip synchronization with the audio, but often fall short in controlling and conveying detailed expressions and emotions of the avatar, making the generated video less vivid and controllable. In this paper, we propose a novel text-guided approach for generating emotionally expressive 2D avatars, offering fine-grained control, improved interactivity, and generalizability to the resulting video. Our framework, named InstructAvatar, leverages a natural language interface to control the emotion as well as the facial motion of avatars. Technically, we design an automatic annotation pipeline to construct an instruction-video paired training dataset, equipped with a novel two-branch diffusion-based generator to predict avatars with audio and text instructions at the same time. Experimental results demonstrate that InstructAvatar produces results that align well with both conditions, and outperforms existing methods in fine-grained emotion control, lip-sync quality, and naturalness. Our project page is https://wangyuchi369.github.io/InstructAvatar/. △ Less

Submitted 24 May, 2024; originally announced May 2024.

Comments: Project page: https://wangyuchi369.github.io/InstructAvatar/

arXiv:2405.12072 [pdf, other]

Real topological phonons in 3D carbon allotropes

Authors: Xiaotian Wang, Jingbo Bai, Jianhua Wang, Zhenxiang Cheng, Shifeng Qian, Wenhong Wang, Gang Zhang, Zhi-Ming Yu, Yugui Yao

Abstract: There has been a significant focus on real topological systems that enjoy space-time inversion symmetry (PT ) and lack spin-orbit coupling. While the theoretical classification of the real topology has been established, more progress has yet to be made in the materials realization of such real topological systems in three dimensions (3D). To address this crucial issue, by selecting the carbon-base… ▽ More There has been a significant focus on real topological systems that enjoy space-time inversion symmetry (PT ) and lack spin-orbit coupling. While the theoretical classification of the real topology has been established, more progress has yet to be made in the materials realization of such real topological systems in three dimensions (3D). To address this crucial issue, by selecting the carbon-based material candidates as targets, we perform high-throughput computing to inspect the real topology in the phonon spectrums of the 3D carbon allotropes in the Samara Carbon Allotrope Database (SACADA). Among 1192 kinds of 3D carbon allotropes, we find 65 real topological systems with a phononic real Chern insulating (PRCI) state, 2 real topological systems with a phononic real nodal line (PRNL) state, 10 real topological systems with a phononic real Dirac point (PRDP) state, and 8 real topological systems with a phononic real triple-point pair (PRTPP) state. This extremely expands the material candidates with real topology, especially for the gapless topological phonons. We exhibit the PRCI, PRNL, PRTPP, and PRDP states of 27-SG. 166-pcu-h, 1081-SG. 194- 4 2T13-CA, 52-SG. 141-gis, and 132-SG. 191-3,4T157 as illustrative examples, and explore the second-order boundary mode, i.e., phononic hinge mode. Among the four examples, the materials 1081-SG. 194-42T13-CA and 52-SG. 141-gis are so ideal that the PRNL and PRTPP in them are well separated from other bands, and the phononic hinge mode can be clearly observed. This study aims to broaden the understanding of 3D topological phonons, and emphasizes the potential of 3D carbon allotropes as a valuable framework for exploring the fascinating physics related to phononic hinge modes and phononic real topology. △ Less

Submitted 20 May, 2024; originally announced May 2024.

arXiv:2405.10612 [pdf, other]

Not All Prompts Are Secure: A Switchable Backdoor Attack Against Pre-trained Vision Transformers

Authors: Sheng Yang, Jiawang Bai, Kuofeng Gao, Yong Yang, Yiming Li, Shu-tao Xia

Abstract: Given the power of vision transformers, a new learning paradigm, pre-training and then prompting, makes it more efficient and effective to address downstream visual recognition tasks. In this paper, we identify a novel security threat towards such a paradigm from the perspective of backdoor attacks. Specifically, an extra prompt token, called the switch token in this work, can turn the backdoor mo… ▽ More Given the power of vision transformers, a new learning paradigm, pre-training and then prompting, makes it more efficient and effective to address downstream visual recognition tasks. In this paper, we identify a novel security threat towards such a paradigm from the perspective of backdoor attacks. Specifically, an extra prompt token, called the switch token in this work, can turn the backdoor mode on, i.e., converting a benign model into a backdoored one. Once under the backdoor mode, a specific trigger can force the model to predict a target class. It poses a severe risk to the users of cloud API, since the malicious behavior can not be activated and detected under the benign mode, thus making the attack very stealthy. To attack a pre-trained model, our proposed attack, named SWARM, learns a trigger and prompt tokens including a switch token. They are optimized with the clean loss which encourages the model always behaves normally even the trigger presents, and the backdoor loss that ensures the backdoor can be activated by the trigger when the switch is on. Besides, we utilize the cross-mode feature distillation to reduce the effect of the switch token on clean samples. The experiments on diverse visual recognition tasks confirm the success of our switchable backdoor attack, i.e., achieving 95%+ attack success rate, and also being hard to be detected and removed. Our code is available at https://github.com/20000yshust/SWARM. △ Less

Submitted 17 May, 2024; originally announced May 2024.

arXiv:2405.09981 [pdf, other]

Adversarial Robustness for Visual Grounding of Multimodal Large Language Models

Authors: Kuofeng Gao, Yang Bai, Jiawang Bai, Yong Yang, Shu-Tao Xia

Abstract: Multi-modal Large Language Models (MLLMs) have recently achieved enhanced performance across various vision-language tasks including visual grounding capabilities. However, the adversarial robustness of visual grounding remains unexplored in MLLMs. To fill this gap, we use referring expression comprehension (REC) as an example task in visual grounding and propose three adversarial attack paradigms… ▽ More Multi-modal Large Language Models (MLLMs) have recently achieved enhanced performance across various vision-language tasks including visual grounding capabilities. However, the adversarial robustness of visual grounding remains unexplored in MLLMs. To fill this gap, we use referring expression comprehension (REC) as an example task in visual grounding and propose three adversarial attack paradigms as follows. Firstly, untargeted adversarial attacks induce MLLMs to generate incorrect bounding boxes for each object. Besides, exclusive targeted adversarial attacks cause all generated outputs to the same target bounding box. In addition, permuted targeted adversarial attacks aim to permute all bounding boxes among different objects within a single image. Extensive experiments demonstrate that the proposed methods can successfully attack visual grounding capabilities of MLLMs. Our methods not only provide a new perspective for designing novel attacks but also serve as a strong baseline for improving the adversarial robustness for visual grounding of MLLMs. △ Less

Submitted 16 May, 2024; originally announced May 2024.

Comments: ICLR 2024 Workshop on Reliable and Responsible Foundation Models

arXiv:2405.09556 [pdf, other]

Co-learning-aided Multi-modal-deep-learning Framework of Passive DOA Estimators for a Heterogeneous Hybrid Massive MIMO Receiver

Authors: Jiatong Bai, Feng Shu, Qinghe Zheng, Bo Xu, Baihua Shi, Yiwen Chen, Weibin Zhang, Xianpeng Wang

Abstract: Due to its excellent performance in rate and resolution, fully-digital (FD) massive multiple-input multiple-output (MIMO) antenna arrays has been widely applied in data transmission and direction of arrival (DOA) measurements, etc. But it confronts with two main challenges: high computational complexity and circuit cost. The two problems may be addressed well by hybrid analog-digital (HAD) structu… ▽ More Due to its excellent performance in rate and resolution, fully-digital (FD) massive multiple-input multiple-output (MIMO) antenna arrays has been widely applied in data transmission and direction of arrival (DOA) measurements, etc. But it confronts with two main challenges: high computational complexity and circuit cost. The two problems may be addressed well by hybrid analog-digital (HAD) structure. But there exists the problem of phase ambiguity for HAD, which leads to its low-efficiency or high-latency. Does exist there such a MIMO structure of owning low-cost, low-complexity and high time efficiency at the same time. To satisfy the three properties, a novel heterogeneous hybrid MIMO receiver structure of integrating FD and heterogeneous HAD ($\rm{H}^2$AD-FD) is proposed and corresponding multi-modal (MD)-learning framework is developed. The framework includes three major stages: 1) generate the candidate sets via root multiple signal classification (Root-MUSIC) or deep learning (DL); 2) infer the class of true solutions from candidate sets using machine learning (ML) methods; 3) fuse the two-part true solutions to achieve a better DOA estimation. The above process form two methods named MD-Root-MUSIC and MDDL. To improve DOA estimation accuracy and reduce the clustering complexity, a co-learning-aided MD framework is proposed to form two enhanced methods named CoMDDL and CoMD-RootMUSIC. Moreover, the Cramer-Rao lower bound (CRLB) for the proposed $\rm{H}^2$AD-FD structure is also derived. Experimental results demonstrate that our proposed four methods could approach the CRLB for signal-to-noise ratio (SNR) > 0 dB and the proposed CoMDDL and MDDL perform better than CoMD-RootMUSIC and MD-RootMUSIC, particularly in the extremely low SNR region. △ Less

Submitted 12 June, 2024; v1 submitted 27 April, 2024; originally announced May 2024.

arXiv:2405.09425 [pdf, other]

doi 10.1109/IEEECONF59524.2023.10476865

Robust Covariance-Based Activity Detection for Massive Access

Authors: Jianan Bai, Erik G. Larsson

Abstract: The wireless channel is undergoing continuous changes, and the block-fading assumption, despite its popularity in theoretical contexts, never holds true in practical scenarios. This discrepancy is particularly critical for user activity detection in grant-free random access, where joint processing across multiple resource blocks is usually undesirable. In this paper, we propose employing a low-dim… ▽ More The wireless channel is undergoing continuous changes, and the block-fading assumption, despite its popularity in theoretical contexts, never holds true in practical scenarios. This discrepancy is particularly critical for user activity detection in grant-free random access, where joint processing across multiple resource blocks is usually undesirable. In this paper, we propose employing a low-dimensional approximation of the channel to capture variations over time and frequency and robustify activity detection algorithms. This approximation entails projecting channel fading vectors onto their principal directions to minimize the approximation order. Through numerical examples, we demonstrate a substantial performance improvement achieved by the resulting activity detection algorithm. △ Less

Submitted 15 May, 2024; originally announced May 2024.

Comments: 5 pages, 11 figures. Asilomar SSC 2023 Conference

arXiv:2405.07551 [pdf, other]

MuMath-Code: Combining Tool-Use Large Language Models with Multi-perspective Data Augmentation for Mathematical Reasoning

Authors: Shuo Yin, Weihao You, Zhilong Ji, Guoqiang Zhong, Jinfeng Bai

Abstract: The tool-use Large Language Models (LLMs) that integrate with external Python interpreters have significantly enhanced mathematical reasoning capabilities for open-source LLMs, while tool-free methods chose another track: augmenting math reasoning data. However, a great method to integrate the above two research paths and combine their advantages remains to be explored. In this work, we firstly in… ▽ More The tool-use Large Language Models (LLMs) that integrate with external Python interpreters have significantly enhanced mathematical reasoning capabilities for open-source LLMs, while tool-free methods chose another track: augmenting math reasoning data. However, a great method to integrate the above two research paths and combine their advantages remains to be explored. In this work, we firstly include new math questions via multi-perspective data augmenting methods and then synthesize code-nested solutions to them. The open LLMs (i.e., Llama-2) are finetuned on the augmented dataset to get the resulting models, MuMath-Code ($μ$-Math-Code). During the inference phase, our MuMath-Code generates code and interacts with the external python interpreter to get the execution results. Therefore, MuMath-Code leverages the advantages of both the external tool and data augmentation. To fully leverage the advantages of our augmented data, we propose a two-stage training strategy: In Stage-1, we finetune Llama-2 on pure CoT data to get an intermediate model, which then is trained on the code-nested data in Stage-2 to get the resulting MuMath-Code. Our MuMath-Code-7B achieves 83.8 on GSM8K and 52.4 on MATH, while MuMath-Code-70B model achieves new state-of-the-art performance among open methods -- achieving 90.7% on GSM8K and 55.1% on MATH. Extensive experiments validate the combination of tool use and data augmentation, as well as our two-stage training strategy. We release the proposed dataset along with the associated code for public use. △ Less

Submitted 13 May, 2024; originally announced May 2024.

Comments: The state-of-the-art open-source tool-use LLMs for mathematical reasoning

arXiv:2405.07518 [pdf, other]

SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts

Authors: Raghu Prabhakar, Ram Sivaramakrishnan, Darshan Gandhi, Yun Du, Mingran Wang, Xiangyu Song, Kejie Zhang, Tianren Gao, Angela Wang, Karen Li, Yongning Sheng, Joshua Brot, Denis Sokolov, Apurv Vivek, Calvin Leung, Arjun Sabnis, Jiayu Bai, Tuowen Zhao, Mark Gottscho, David Jackson, Mark Luttrell, Manish K. Shah, Edison Chen, Kaizhao Liang, Swayambhoo Jain , et al. (5 additional authors not shown)

Abstract: Monolithic large language models (LLMs) like GPT-4 have paved the way for modern generative AI applications. Training, serving, and maintaining monolithic LLMs at scale, however, remains prohibitively expensive and challenging. The disproportionate increase in compute-to-memory ratio of modern AI accelerators have created a memory wall, necessitating new methods to deploy AI. Composition of Expert… ▽ More Monolithic large language models (LLMs) like GPT-4 have paved the way for modern generative AI applications. Training, serving, and maintaining monolithic LLMs at scale, however, remains prohibitively expensive and challenging. The disproportionate increase in compute-to-memory ratio of modern AI accelerators have created a memory wall, necessitating new methods to deploy AI. Composition of Experts (CoE) is an alternative modular approach that lowers the cost and complexity of training and serving. However, this approach presents two key challenges when using conventional hardware: (1) without fused operations, smaller models have lower operational intensity, which makes high utilization more challenging to achieve; and (2) hosting a large number of models can be either prohibitively expensive or slow when dynamically switching between them. In this paper, we describe how combining CoE, streaming dataflow, and a three-tier memory system scales the AI memory wall. We describe Samba-CoE, a CoE system with 150 experts and a trillion total parameters. We deploy Samba-CoE on the SambaNova SN40L Reconfigurable Dataflow Unit (RDU) - a commercial dataflow accelerator architecture that has been co-designed for enterprise inference and training applications. The chip introduces a new three-tier memory system with on-chip distributed SRAM, on-package HBM, and off-package DDR DRAM. A dedicated inter-RDU network enables scaling up and out over multiple sockets. We demonstrate speedups ranging from 2x to 13x on various benchmarks running on eight RDU sockets compared with an unfused baseline. We show that for CoE inference deployments, the 8-socket RDU Node reduces machine footprint by up to 19x, speeds up model switching time by 15x to 31x, and achieves an overall speedup of 3.7x over a DGX H100 and 6.6x over a DGX A100. △ Less

Submitted 13 May, 2024; originally announced May 2024.

arXiv:2405.07497 [pdf, other]

Towards Subgraph Isomorphism Counting with Graph Kernels

Authors: Xin Liu, Weiqi Wang, Jiaxin Bai, Yangqiu Song

Abstract: Subgraph isomorphism counting is known as #P-complete and requires exponential time to find the accurate solution. Utilizing representation learning has been shown as a promising direction to represent substructures and approximate the solution. Graph kernels that implicitly capture the correlations among substructures in diverse graphs have exhibited great discriminative power in graph classifica… ▽ More Subgraph isomorphism counting is known as #P-complete and requires exponential time to find the accurate solution. Utilizing representation learning has been shown as a promising direction to represent substructures and approximate the solution. Graph kernels that implicitly capture the correlations among substructures in diverse graphs have exhibited great discriminative power in graph classification, so we pioneeringly investigate their potential in counting subgraph isomorphisms and further explore the augmentation of kernel capability through various variants, including polynomial and Gaussian kernels. Through comprehensive analysis, we enhance the graph kernels by incorporating neighborhood information. Finally, we present the results of extensive experiments to demonstrate the effectiveness of the enhanced graph kernels and discuss promising directions for future research. △ Less

Submitted 13 May, 2024; originally announced May 2024.

arXiv:2405.05840 [pdf, other]

FREmu: Power Spectrum Emulator for $f(R)$ Gravity

Authors: Jiachen Bai, Junqing Xia

Abstract: To investigate gravity in the non-linear regime of cosmic structure using measurements from Stage-IV surveys, it is imperative to accurately compute large-scale structure observables, such as non-linear matter power spectra, for gravity models that extend beyond general relativity. However, the theoretical predictions of non-linear observables are typically derived from N-body simulations, which d… ▽ More To investigate gravity in the non-linear regime of cosmic structure using measurements from Stage-IV surveys, it is imperative to accurately compute large-scale structure observables, such as non-linear matter power spectra, for gravity models that extend beyond general relativity. However, the theoretical predictions of non-linear observables are typically derived from N-body simulations, which demand substantial computational resources. In this study, we introduce a novel public emulator, termed FREmu, designed to provide rapid and precise forecasts of non-linear power spectra specifically for the Hu-Sawicki $f(R)$ gravity model across scales $0.0089 h \mathrm{Mpc}^{-1}<k<0.5 h \mathrm{Mpc}^{-1}$ and redshifts $0<z<3$. FREmu leverages Principal Component Analysis and Artificial Neural Networks to establish a mapping from parameters to power spectra, utilizing training data derived from the Quijote-MG simulation suite. With a parameter space encompassing 7 dimensions, including $Ω_m$, $Ω_b$, $h$, $n_s$, $σ_8$, $M_ν$ and $f_{R_0}$, the emulator achieves an accuracy exceeding 95% for the majority of cases, thus proving to be highly efficient for constraining parameters. △ Less

Submitted 5 June, 2024; v1 submitted 9 May, 2024; originally announced May 2024.

Comments: 12 pages, 5 figures, 1 table, accepted by The Astrophysical Journal (ApJ)

arXiv:2405.05806 [pdf, other]

MasterWeaver: Taming Editability and Identity for Personalized Text-to-Image Generation

Authors: Yuxiang Wei, Zhilong Ji, Jinfeng Bai, Hongzhi Zhang, Lei Zhang, Wangmeng Zuo

Abstract: Text-to-image (T2I) diffusion models have shown significant success in personalized text-to-image generation, which aims to generate novel images with human identities indicated by the reference images. Despite promising identity fidelity has been achieved by several tuning-free methods, they usually suffer from overfitting issues. The learned identity tends to entangle with irrelevant information… ▽ More Text-to-image (T2I) diffusion models have shown significant success in personalized text-to-image generation, which aims to generate novel images with human identities indicated by the reference images. Despite promising identity fidelity has been achieved by several tuning-free methods, they usually suffer from overfitting issues. The learned identity tends to entangle with irrelevant information, resulting in unsatisfied text controllability, especially on faces. In this work, we present MasterWeaver, a test-time tuning-free method designed to generate personalized images with both faithful identity fidelity and flexible editability. Specifically, MasterWeaver adopts an encoder to extract identity features and steers the image generation through additional introduced cross attention. To improve editability while maintaining identity fidelity, we propose an editing direction loss for training, which aligns the editing directions of our MasterWeaver with those of the original T2I model. Additionally, a face-augmented dataset is constructed to facilitate disentangled identity learning, and further improve the editability. Extensive experiments demonstrate that our MasterWeaver can not only generate personalized images with faithful identity, but also exhibit superiority in text controllability. Our code will be publicly available at https://github.com/csyxwei/MasterWeaver. △ Less

Submitted 10 May, 2024; v1 submitted 9 May, 2024; originally announced May 2024.

Comments: 34 pages

arXiv:2405.05606 [pdf, other]

doi 10.1145/3626772.3661343

Optimizing E-commerce Search: Toward a Generalizable and Rank-Consistent Pre-Ranking Model

Authors: Enqiang Xu, Yiming Qiu, Junyang Bai, Ping Zhang, Dadong Miao, Songlin Wang, Guoyu Tang, Lin Liu, Mingming Li

Abstract: In large e-commerce platforms, search systems are typically composed of a series of modules, including recall, pre-ranking, and ranking phases. The pre-ranking phase, serving as a lightweight module, is crucial for filtering out the bulk of products in advance for the downstream ranking module. Industrial efforts on optimizing the pre-ranking model have predominantly focused on enhancing ranking c… ▽ More In large e-commerce platforms, search systems are typically composed of a series of modules, including recall, pre-ranking, and ranking phases. The pre-ranking phase, serving as a lightweight module, is crucial for filtering out the bulk of products in advance for the downstream ranking module. Industrial efforts on optimizing the pre-ranking model have predominantly focused on enhancing ranking consistency, model structure, and generalization towards long-tail items. Beyond these optimizations, meeting the system performance requirements presents a significant challenge. Contrasting with existing industry works, we propose a novel method: a Generalizable and RAnk-ConsistEnt Pre-Ranking Model (GRACE), which achieves: 1) Ranking consistency by introducing multiple binary classification tasks that predict whether a product is within the top-k results as estimated by the ranking model, which facilitates the addition of learning objectives on common point-wise ranking models; 2) Generalizability through contrastive learning of representation for all products by pre-training on a subset of ranking product embeddings; 3) Ease of implementation in feature construction and online deployment. Our extensive experiments demonstrate significant improvements in both offline metrics and online A/B test: a 0.75% increase in AUC and a 1.28% increase in CVR. △ Less

Submitted 13 May, 2024; v1 submitted 9 May, 2024; originally announced May 2024.

ACM Class: H.3.3

arXiv:2405.05565 [pdf, other]

doi 10.1109/TGRS.2024.3406711

Array SAR 3D Sparse Imaging Based on Regularization by Denoising Under Few Observed Data

Authors: Yangyang Wang, Xu Zhan, Jing Gao, Jinjie Yao, Shunjun Wei, JianSheng Bai

Abstract: Array synthetic aperture radar (SAR) three-dimensional (3D) imaging can obtain 3D information of the target region, which is widely used in environmental monitoring and scattering information measurement. In recent years, with the development of compressed sensing (CS) theory, sparse signal processing is used in array SAR 3D imaging. Compared with matched filter (MF), sparse SAR imaging can effect… ▽ More Array synthetic aperture radar (SAR) three-dimensional (3D) imaging can obtain 3D information of the target region, which is widely used in environmental monitoring and scattering information measurement. In recent years, with the development of compressed sensing (CS) theory, sparse signal processing is used in array SAR 3D imaging. Compared with matched filter (MF), sparse SAR imaging can effectively improve image quality. However, sparse imaging based on handcrafted regularization functions suffers from target information loss in few observed SAR data. Therefore, in this article, a general 3D sparse imaging framework based on Regulation by Denoising (RED) and proximal gradient descent type method for array SAR is presented. Firstly, we construct explicit prior terms via state-of-the-art denoising operators instead of regularization functions, which can improve the accuracy of sparse reconstruction and preserve the structure information of the target. Then, different proximal gradient descent type methods are presented, including a generalized alternating projection (GAP) and an alternating direction method of multiplier (ADMM), which is suitable for high-dimensional data processing. Additionally, the proposed method has robust convergence, which can achieve sparse reconstruction of 3D SAR in few observed SAR data. Extensive simulations and real data experiments are conducted to analyze the performance of the proposed method. The experimental results show that the proposed method has superior sparse reconstruction performance. △ Less

Submitted 26 May, 2024; v1 submitted 9 May, 2024; originally announced May 2024.

arXiv:2405.03349 [pdf, other]

Retinexmamba: Retinex-based Mamba for Low-light Image Enhancement

Authors: Jiesong Bai, Yuhao Yin, Qiyuan He, Yuanxian Li, Xiaofeng Zhang

Abstract: In the field of low-light image enhancement, both traditional Retinex methods and advanced deep learning techniques such as Retinexformer have shown distinct advantages and limitations. Traditional Retinex methods, designed to mimic the human eye's perception of brightness and color, decompose images into illumination and reflection components but struggle with noise management and detail preserva… ▽ More In the field of low-light image enhancement, both traditional Retinex methods and advanced deep learning techniques such as Retinexformer have shown distinct advantages and limitations. Traditional Retinex methods, designed to mimic the human eye's perception of brightness and color, decompose images into illumination and reflection components but struggle with noise management and detail preservation under low light conditions. Retinexformer enhances illumination estimation through traditional self-attention mechanisms, but faces challenges with insufficient interpretability and suboptimal enhancement effects. To overcome these limitations, this paper introduces the RetinexMamba architecture. RetinexMamba not only captures the physical intuitiveness of traditional Retinex methods but also integrates the deep learning framework of Retinexformer, leveraging the computational efficiency of State Space Models (SSMs) to enhance processing speed. This architecture features innovative illumination estimators and damage restorer mechanisms that maintain image quality during enhancement. Moreover, RetinexMamba replaces the IG-MSA (Illumination-Guided Multi-Head Attention) in Retinexformer with a Fused-Attention mechanism, improving the model's interpretability. Experimental evaluations on the LOL dataset show that RetinexMamba outperforms existing deep learning approaches based on Retinex theory in both quantitative and qualitative metrics, confirming its effectiveness and superiority in enhancing low-light images. △ Less

Submitted 19 May, 2024; v1 submitted 6 May, 2024; originally announced May 2024.

arXiv:2405.02942 [pdf, other]

Design, analysis, and manufacturing of a glass-plastic hybrid minimalist aspheric panoramic annular lens

Authors: Shaohua Gao, Qi Jiang, Yiqi Liao, Yi Qiu, Wanglei Ying, Kailun Yang, Kaiwei Wang, Benhao Zhang, Jian Bai

Abstract: We propose a high-performance glass-plastic hybrid minimalist aspheric panoramic annular lens (ASPAL) to solve several major limitations of the traditional panoramic annular lens (PAL), such as large size, high weight, and complex system. The field of view (FoV) of the ASPAL is 360°x(35°~110°) and the imaging quality is close to the diffraction limit. This large FoV ASPAL is composed of only 4 len… ▽ More We propose a high-performance glass-plastic hybrid minimalist aspheric panoramic annular lens (ASPAL) to solve several major limitations of the traditional panoramic annular lens (PAL), such as large size, high weight, and complex system. The field of view (FoV) of the ASPAL is 360°x(35°~110°) and the imaging quality is close to the diffraction limit. This large FoV ASPAL is composed of only 4 lenses. Moreover, we establish a physical structure model of PAL using the ray tracing method and study the influence of its physical parameters on compactness ratio. In addition, for the evaluation of local tolerances of annular surfaces, we propose a tolerance analysis method suitable for ASPAL. This analytical method can effectively analyze surface irregularities on annular surfaces and provide clear guidance on manufacturing tolerances for ASPAL. Benefiting from high-precision glass molding and injection molding aspheric lens manufacturing techniques, we finally manufactured 20 ASPALs in small batches. The weight of an ASPAL prototype is only 8.5 g. Our framework provides promising insights for the application of panoramic systems in space and weight-constrained environmental sensing scenarios such as intelligent security, micro-UAVs, and micro-robots. △ Less

Submitted 5 May, 2024; originally announced May 2024.

Comments: Accepted to Optics & Laser Technology

arXiv:2405.01074 [pdf, other]

Stability Analysis of Interacting Wireless Repeaters

Authors: Erik G. Larsson, Jianan Bai

Abstract: We consider a wireless network with multiple single-antenna repeaters that amplify and instantaneously re-transmit the signals they receive to improve the channel rank and system coverage. Due to the positive feedback formed by inter-repeater interference, stability could become a critical issue. We investigate the problem of determining the maximum amplification gain that the repeaters can use wi… ▽ More We consider a wireless network with multiple single-antenna repeaters that amplify and instantaneously re-transmit the signals they receive to improve the channel rank and system coverage. Due to the positive feedback formed by inter-repeater interference, stability could become a critical issue. We investigate the problem of determining the maximum amplification gain that the repeaters can use without breaking the system stability. Specifically, we obtain a bound by using the Gershgorin disc theorem, which reveals that the maximum amplification gain is restricted by the sum of channel amplitude gains. We show by case studies the usefulness of the so-obtained bound and provide insights on how the repeaters should be deployed. △ Less

Submitted 7 July, 2024; v1 submitted 2 May, 2024; originally announced May 2024.

Comments: Accepted to SPAWC 2024. 5 pages, 7 figures

arXiv:2404.18700 [pdf, other]

Real-fluid Transport Property Computations Based on the Boltzmann-weighted Full-dimensional Potential Model

Authors: Xin Zhang, Junfeng Bai, Bowen Liu, Tong Zhu, Hao Zhao

Abstract: The intermolecular potential plays crucial roles in real-fluid interactions away from the ideal-gas equilibrium, such as supercritical fluid, high-enthalpy fluid, plasma interactions, etc. We propose a Boltzmann-weighted Full-dimensional (BWF) potential model for real-fluid computations. It includes diverse intermolecular interactions so as to determine the potential well, molecular diameter, dipo… ▽ More The intermolecular potential plays crucial roles in real-fluid interactions away from the ideal-gas equilibrium, such as supercritical fluid, high-enthalpy fluid, plasma interactions, etc. We propose a Boltzmann-weighted Full-dimensional (BWF) potential model for real-fluid computations. It includes diverse intermolecular interactions so as to determine the potential well, molecular diameter, dipole moment, polarizability of species without introducing bath gases, allowing more accurate descriptions of potential surfaces with more potential parameters. The anisotropy and temperature dependence of potential parameters are also considered by applying the Boltzmann weighting on all orientations. Through the high-level Symmetry-Adapted Perturbation Theory calculations, full-dimensional potential energy surface datasets are obtained in 432 orientations for each species. Subsequently, the Boltzmann-weighted Full-dimensional potential parameters are derived by training the dataset exceeding 5*106 data, including nonpolar and polar molecules, radicals, long-chain molecules, and ions. These BWF transport properties calculated by the BWF potential have been compared against the Lennard-Jones transport properties as well as experimental viscosity, mass diffusivity, and thermal conductivity coefficients. It shows discrepancies of viscosity coefficients within 1% and 5% for nonpolar and polar molecules, respectively. Furthermore, this potential model is applied to study radicals, long-chain molecules, and ions, for which the experimental data is rarely accessed in high accuracy. It indicates significant prediction improvements of complex interactions between various particles. The new transport properties are also embedded to predict the laminar flame speeds and the flame extinction limits of methane, dimethyl ether, and n-heptane at elevated pressures, confirming its predictivity and effectiveness. △ Less

Submitted 29 April, 2024; originally announced April 2024.

Comments: 18 pages, 10 figures

MSC Class: 82 (Primary) ACM Class: J.2

arXiv:2404.18356 [pdf, other]

FEDQ-Trust: Efficient Data-Driven Trust Prediction for Mobile Edge-Based IoT Systems

Authors: Jiahui Bai, Hai Dong, Athman Bouguettaya

Abstract: We introduce FEDQ-Trust, an innovative data-driven trust prediction approach designed for mobile edge-based Internet of Things (IoT) environments. The decentralized nature of mobile edge environments introduces challenges due to variations in data distribution, impacting the accuracy and training efficiency of existing distributed data-driven trust prediction models. FEDQ-Trust effectively tackles… ▽ More We introduce FEDQ-Trust, an innovative data-driven trust prediction approach designed for mobile edge-based Internet of Things (IoT) environments. The decentralized nature of mobile edge environments introduces challenges due to variations in data distribution, impacting the accuracy and training efficiency of existing distributed data-driven trust prediction models. FEDQ-Trust effectively tackles the statistical heterogeneity challenges by integrating Federated Expectation-Maximization with Deep Q Networks. Federated Expectation-Maximization's robust handling of statistical heterogeneity significantly enhances trust prediction accuracy. Meanwhile, Deep Q Networks streamlines the model training process, efficiently reducing the number of training clients while maintaining model performance. We conducted a suite of experiments within simulated MEC-based IoT settings by leveraging two real-world IoT datasets. The experimental results demonstrate that our model achieved a significant convergence time reduction of 97% to 99% while ensuring a notable improvement of 8% to 14% in accuracy compared to state-of-the-art models. △ Less

Submitted 28 April, 2024; originally announced April 2024.

Comments: 14 pages, 6 figures, submitted to IEEE Transactions on Services Computing

arXiv:2404.16655 [pdf]

Rational Designing of Anthocyanidins-Directed Near-Infrared Two-Photon Fluorescence Probes

Authors: Xiu-e Zhang, Xue Wei, Wei-Bo Cui, Jin-Pu Bai, Aynur Matyusup, Jing-Fu Guo, Hui Li, Ai-Min Ren

Abstract: Recently, two-photon fluorescent probes based on anthocyanidins molecules have attracted extensive attention due to their outstanding photophysical properties. However, there are only a few two-photon excited fluorescent probes that really meet the requirements of relatively long emission wavelengths (>600 nm), large two-photon absorption (TPA) cross sections (300 GM), significant Stokes shift (>8… ▽ More Recently, two-photon fluorescent probes based on anthocyanidins molecules have attracted extensive attention due to their outstanding photophysical properties. However, there are only a few two-photon excited fluorescent probes that really meet the requirements of relatively long emission wavelengths (>600 nm), large two-photon absorption (TPA) cross sections (300 GM), significant Stokes shift (>80 nm), and high fluorescence intensity. Herein, the photophysical properties of a series of anthocyanidins with the same substituents but different fluorophore skeletons were investigated in detail. Compared with b-series molecules, a-series molecules with a six-membered ring in the backbone have a slightly higher reorganization energy. This results in more energy loss upon light excitation, enabling the reaction products to detect NTR through a larger Stokes shift. More importantly, there is very little decrease in fluorescence intensity as the Stokes shift increases. These features are extremely valuable for high-resolution NTR detection. In light of this, novel 2a-n (n=1-5) compounds are designed, which are accomplished by inhibiting the twisted intramolecular charge transfer (TICT) effect through alkyl cyclization, azetidine ring and extending π conjugation. Among them, 2a-3 gains long emission spectrum (λem=691.42 nm), noticeable TPA cross section (957.36 GM), and large Stokes shift (110.88 nm), indicating that it serves as a promising candidate for two-photon fluorescent dyes. It is hoped that this work will offer some insightful theoretical direction for the development of novel high performance anthocyanin fluorescent materials. △ Less

Submitted 25 April, 2024; originally announced April 2024.

arXiv:2404.10763 [pdf, other]

LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?

Authors: Yuchi Wang, Shuhuai Ren, Rundong Gao, Linli Yao, Qingyan Guo, Kaikai An, Jianhong Bai, Xu Sun

Abstract: Diffusion models have exhibited remarkable capabilities in text-to-image generation. However, their performance in image-to-text generation, specifically image captioning, has lagged behind Auto-Regressive (AR) models, casting doubt on their applicability for such tasks. In this work, we revisit diffusion models, highlighting their capacity for holistic context modeling and parallel decoding. With… ▽ More Diffusion models have exhibited remarkable capabilities in text-to-image generation. However, their performance in image-to-text generation, specifically image captioning, has lagged behind Auto-Regressive (AR) models, casting doubt on their applicability for such tasks. In this work, we revisit diffusion models, highlighting their capacity for holistic context modeling and parallel decoding. With these benefits, diffusion models can alleviate the inherent limitations of AR methods, including their slow inference speed, error propagation, and unidirectional constraints. Furthermore, we identify the prior underperformance of diffusion models stemming from the absence of an effective latent space for image-text alignment, and the discrepancy between continuous diffusion processes and discrete textual data. In response, we introduce a novel architecture, LaDiC, which utilizes a split BERT to create a dedicated latent space for captions and integrates a regularization module to manage varying text lengths. Our framework also includes a diffuser for semantic image-to-text conversion and a Back&Refine technique to enhance token interactivity during inference. LaDiC achieves state-of-the-art performance for diffusion-based methods on the MS COCO dataset with 38.2 BLEU@4 and 126.2 CIDEr, demonstrating exceptional performance without pre-training or ancillary modules. This indicates strong competitiveness with AR models, revealing the previously untapped potential of diffusion models in image-to-text generation. △ Less

Submitted 16 April, 2024; originally announced April 2024.

arXiv:2404.08998 [pdf, other]

Dual-comb mode-locked Yb:CALGO laser based on cavity-shared configuration with separated end mirrors

Authors: Ruixin Tang, Ziyu Luo, Pengfei Li, Pengrun Ying, Haiyang Xie, Siyuan Xu, Hui Liu, Jintao Bai

Abstract: Dual-comb spectroscopy typically requires the utilization of two independent and phase-locked femtosecond lasers, resulting in a complex and expensive system that hinders its industrial applications. Single-cavity dual-comb lasers are considered as one of the primary solution to simplify the system. However, controlling the crucial parameter of difference in repetition rates remains challenging. I… ▽ More Dual-comb spectroscopy typically requires the utilization of two independent and phase-locked femtosecond lasers, resulting in a complex and expensive system that hinders its industrial applications. Single-cavity dual-comb lasers are considered as one of the primary solution to simplify the system. However, controlling the crucial parameter of difference in repetition rates remains challenging. In this study, we present a dual-comb mode-locked Yb:CALGO laser based on a cavity-shared configuration with separated end mirrors. We employ two pairs of end mirrors and two thin-film polarizers angled at 45 degrees to the cavity axis, leading to separating the cross-polarized laser modes. We achieve simultaneous operation of two combs at approximately 1040 nm with pulse durations of around 400 fs and an average power exceeding 1 W. The repetition rates are approximately 59 MHz and their difference can be easily tuned from zero up to the MHz range. By effectively canceling out common mode noises, we observe minimal fluctuation in the repetition rate difference with a standard deviation of about 1.9 Hz over ten minutes, while experiencing fluctuations in repetition rates as large as 90 Hz. We demonstrate the capabilities of this system by utilizing the free-running dual-comb setup for asynchronous optical sampling on a saturable absorber and measuring etalon transmission spectrum. This system allows for simple and independent control of the repetition rates and their difference during operation, facilitating the selection of optimal repetition rate difference and implementation of phase-locking loops. This advancement paves the way for the development of simple yet high-performance dual-comb laser sources. △ Less

Submitted 13 April, 2024; originally announced April 2024.

arXiv:2404.08977 [pdf, other]

RoNID: New Intent Discovery with Generated-Reliable Labels and Cluster-friendly Representations

Authors: Shun Zhang, Chaoran Yan, Jian Yang, Changyu Ren, Jiaqi Bai, Tongliang Li, Zhoujun Li

Abstract: New Intent Discovery (NID) strives to identify known and reasonably deduce novel intent groups in the open-world scenario. But current methods face issues with inaccurate pseudo-labels and poor representation learning, creating a negative feedback loop that degrades overall model performance, including accuracy and the adjusted rand index. To address the aforementioned challenges, we propose a Rob… ▽ More New Intent Discovery (NID) strives to identify known and reasonably deduce novel intent groups in the open-world scenario. But current methods face issues with inaccurate pseudo-labels and poor representation learning, creating a negative feedback loop that degrades overall model performance, including accuracy and the adjusted rand index. To address the aforementioned challenges, we propose a Robust New Intent Discovery (RoNID) framework optimized by an EM-style method, which focuses on constructing reliable pseudo-labels and obtaining cluster-friendly discriminative representations. RoNID comprises two main modules: reliable pseudo-label generation module and cluster-friendly representation learning module. Specifically, the pseudo-label generation module assigns reliable synthetic labels by solving an optimal transport problem in the E-step, which effectively provides high-quality supervised signals for the input of the cluster-friendly representation learning module. To learn cluster-friendly representation with strong intra-cluster compactness and large inter-cluster separation, the representation learning module combines intra-cluster and inter-cluster contrastive learning in the M-step to feed more discriminative features into the generation module. RoNID can be performed iteratively to ultimately yield a robust model with reliable pseudo-labels and cluster-friendly representations. Experimental results on multiple benchmarks demonstrate our method brings substantial improvements over previous state-of-the-art methods by a large margin of +1~+4 points. △ Less

Submitted 18 April, 2024; v1 submitted 13 April, 2024; originally announced April 2024.

Comments: DASFAA 2024

arXiv:2404.07943 [pdf, other]

HomoGenius: a Foundation Model of Homogenization for Rapid Prediction of Effective Mechanical Properties using Neural Operators

Authors: Yizheng Wang, Xiang Li, Ziming Yan, Yuqing Du, Jinshuai Bai, Bokai Liu, Timon Rabczuk, Yinghua Liu

Abstract: Homogenization is an essential tool for studying multiscale physical phenomena. However, traditional numerical homogenization, heavily reliant on finite element analysis, requires extensive computation costs, particularly in handling complex geometries, materials, and high-resolution problems. To address these limitations, we propose a numerical homogenization model based on operator learning: Hom… ▽ More Homogenization is an essential tool for studying multiscale physical phenomena. However, traditional numerical homogenization, heavily reliant on finite element analysis, requires extensive computation costs, particularly in handling complex geometries, materials, and high-resolution problems. To address these limitations, we propose a numerical homogenization model based on operator learning: HomoGenius. The proposed model can quickly provide homogenization results for arbitrary geometries, materials, and resolutions, increasing the efficiency by a factor of 80 compared to traditional numerical homogenization methods. We validate effectiveness of our model in predicting the effective elastic modulus on periodic materials (TPMS: Triply Periodic Minimal Surface), including complex geometries, various Poisson's ratios and elastic modulus, and different resolutions for training and testing. The results show that our model possesses high precision, super efficiency, and learning capability. △ Less

Submitted 18 March, 2024; originally announced April 2024.

arXiv:2404.07343 [pdf, other]

Monitoring AGNs with H$β$ Asymmetry. IV. First Reverberation Mapping Results of 14 AGNs

Authors: T. E. Zastrocky, Michael S. Brotherton, Pu Du, Jacob N. McLane, Kianna A. Olson, D. A. Dale, H. A. Kobulnicky, Jaya Maithil, My L. Nguyen, William T. Chick, David H. Kasper, Derek Hand, C. Adelman, Z. Carter, G. Murphree, M. Oeur, T. Roth, S. Schonsberg, M. J. Caradonna, J. Favro, A. J. Ferguson, I. M. Gonzalez, L. M. Hadding, H. D. Hagler, C. J. Rogers , et al. (19 additional authors not shown)

Abstract: We report first-time reverberation mapping results for 14 AGNs from the ongoing Monitoring AGNs with H$β$ Asymmetry campaign (MAHA). These results utilize optical spectra obtained with the Long Slit Spectrograph on the Wyoming Infrared 2.3m Telescope between 2017 November-2023 May. MAHA combines long-duration monitoring with high cadence. We report results from multiple observing seasons for 9 of… ▽ More We report first-time reverberation mapping results for 14 AGNs from the ongoing Monitoring AGNs with H$β$ Asymmetry campaign (MAHA). These results utilize optical spectra obtained with the Long Slit Spectrograph on the Wyoming Infrared 2.3m Telescope between 2017 November-2023 May. MAHA combines long-duration monitoring with high cadence. We report results from multiple observing seasons for 9 of the 14 objects. These results include H$β$ time lags, supermassive black hole masses, and velocity-resolved time lags. The velocity-resolved lags allow us to investigate the kinematics of the broad-line region. △ Less

Submitted 10 April, 2024; originally announced April 2024.

Comments: 35 pages, 19 figures, accepted for publication in ApJ Supplement

arXiv:2404.07246 [pdf, other]

doi 10.1088/1674-4527/ad339f

Prospects of the multi-channel photometric survey telescope in the cosmological application of Type Ia supernovae

Authors: Zhenyu Wang, Ju-Jia Zhang, Xinzhong Er, Jinming Bai

Abstract: The Multi-channel Photometric Survey Telescope (Mephisto) is a real-time, three-color photometric system designed to capture the color evolution of stars and transients accurately. This telescope system can be crucial in cosmological distance measurements of low-redshift (low-$z$, $z$ $\lesssim 0.1$) Type Ia supernovae (SNe Ia). To optimize the capabilities of this instrument, we perform a compreh… ▽ More The Multi-channel Photometric Survey Telescope (Mephisto) is a real-time, three-color photometric system designed to capture the color evolution of stars and transients accurately. This telescope system can be crucial in cosmological distance measurements of low-redshift (low-$z$, $z$ $\lesssim 0.1$) Type Ia supernovae (SNe Ia). To optimize the capabilities of this instrument, we perform a comprehensive simulation study before its official operation is scheduled to start. By considering the impact of atmospheric extinction, weather conditions, and the lunar phase at the observing site involving the instrumental features, we simulate the light curves of SNe Ia obtained by the Mephisto. The best strategy in the case of SN Ia cosmology is to take the image at an exposure time of 130 s with a cadence of 3 days. In this condition, Mephisto can obtain hundreds of high-quality SNe Ia to achieve a distance measurement better than $4.5\%$. Given the on-time spectral classification and monitoring of the Lijiang 2.4 m Telescope at the same observatory, Mephisto, in the whole operation, can significantly enrich the well-calibrated sample of supernovae at low-$z$ and improve the calibration accuracy of high-$z$ SNe Ia. △ Less

Submitted 17 April, 2024; v1 submitted 10 April, 2024; originally announced April 2024.

Comments: 15 pages, 7 figures

Showing 1–50 of 824 results for author: Bai, J