subscribe to arXiv mailings

Performance analysis for a rotary compressor at high speed: experimental study and mathematical modeling

Authors: Chuntai Zheng, Wei Zhao, Benshuai Lyu, Keke Gao, Hongjun Cao, Lei Zhong, Yi Gao, Ren Liao

Abstract: This paper conducted a comprehensive study on the performance of a rotary compressor over a rotational speed range of 80Hz to 200Hz through experimental tests and mathematical modeling. A compressor performance test rig was designed to conduct the performance tests, with fast-response pressure sensors and displacement sensors capturing the P-V diagram and dynamic motion of the moving components. R… ▽ More This paper conducted a comprehensive study on the performance of a rotary compressor over a rotational speed range of 80Hz to 200Hz through experimental tests and mathematical modeling. A compressor performance test rig was designed to conduct the performance tests, with fast-response pressure sensors and displacement sensors capturing the P-V diagram and dynamic motion of the moving components. Results show that the compressor efficiency degrades at high speeds due to the dominant loss factors of leakage and discharge power loss. Supercharging effects become significant at speeds above 160Hz, and its net effects reduce the compressor efficiency, especially at high speeds. This study identifies and analyzes the loss factors on the mass flow rate and power consumption based on experimental data, and hypothesizes possible mechanisms for each loss factor, which can aid in the design of a high-speed rotary compressor with higher efficiency. △ Less

Submitted 13 July, 2024; originally announced July 2024.

arXiv:2407.04675 [pdf, other]

Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

Authors: Ye Bai, Jingping Chen, Jitong Chen, Wei Chen, Zhuo Chen, Chuang Ding, Linhao Dong, Qianqian Dong, Yujiao Du, Kepan Gao, Lu Gao, Yi Guo, Minglun Han, Ting Han, Wenchao Hu, Xinying Hu, Yuxiang Hu, Deyu Hua, Lu Huang, Mingkun Huang, Youjia Huang, Jishuo Jin, Fanliu Kong, Zongwei Lan, Tianyu Li , et al. (30 additional authors not shown)

Abstract: Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this wor… ▽ More Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this work, we introduce Seed-ASR, a large language model (LLM) based speech recognition model. Seed-ASR is developed based on the framework of audio conditioned LLM (AcLLM), leveraging the capabilities of LLMs by inputting continuous speech representations together with contextual information into the LLM. Through stage-wise large-scale training and the elicitation of context-aware capabilities in LLM, Seed-ASR demonstrates significant improvement over end-to-end models on comprehensive evaluation sets, including multiple domains, accents/dialects and languages. Additionally, Seed-ASR can be further deployed to support specific needs in various scenarios without requiring extra language models. Compared to recently released large ASR models, Seed-ASR achieves 10%-40% reduction in word (or character, for Chinese) error rates on Chinese and English public test sets, further demonstrating its powerful performance. △ Less

Submitted 10 July, 2024; v1 submitted 5 July, 2024; originally announced July 2024.

arXiv:2407.02773 [pdf, other]

OpenVNA: A Framework for Analyzing the Behavior of Multimodal Language Understanding System under Noisy Scenarios

Authors: Ziqi Yuan, Baozheng Zhang, Hua Xu, Zhiyun Liang, Kai Gao

Abstract: We present OpenVNA, an open-source framework designed for analyzing the behavior of multimodal language understanding systems under noisy conditions. OpenVNA serves as an intuitive toolkit tailored for researchers, facilitating convenience batch-level robustness evaluation and on-the-fly instance-level demonstration. It primarily features a benchmark Python library for assessing global model robus… ▽ More We present OpenVNA, an open-source framework designed for analyzing the behavior of multimodal language understanding systems under noisy conditions. OpenVNA serves as an intuitive toolkit tailored for researchers, facilitating convenience batch-level robustness evaluation and on-the-fly instance-level demonstration. It primarily features a benchmark Python library for assessing global model robustness, offering high flexibility and extensibility, thereby enabling customization with user-defined noise types and models. Additionally, a GUI-based interface has been developed to intuitively analyze local model behavior. In this paper, we delineate the design principles and utilization of the created library and GUI-based web platform. Currently, OpenVNA is publicly accessible at \url{https://github.com/thuiar/OpenVNA}, with a demonstration video available at \url{https://youtu.be/0Z9cW7RGct4}. △ Less

Submitted 2 July, 2024; originally announced July 2024.

Comments: 10 pages, 4 figures, to be published in ACL 2024 System Demonstration Track

arXiv:2407.02411 [pdf, other]

Video Watermarking: Safeguarding Your Video from (Unauthorized) Annotations by Video-based LLMs

Authors: Jinmin Li, Kuofeng Gao, Yang Bai, Jingyun Zhang, Shu-Tao Xia

Abstract: The advent of video-based Large Language Models (LLMs) has significantly enhanced video understanding. However, it has also raised some safety concerns regarding data protection, as videos can be more easily annotated, even without authorization. This paper introduces Video Watermarking, a novel technique to protect videos from unauthorized annotations by such video-based LLMs, especially concerni… ▽ More The advent of video-based Large Language Models (LLMs) has significantly enhanced video understanding. However, it has also raised some safety concerns regarding data protection, as videos can be more easily annotated, even without authorization. This paper introduces Video Watermarking, a novel technique to protect videos from unauthorized annotations by such video-based LLMs, especially concerning the video content and description, in response to specific queries. By imperceptibly embedding watermarks into key video frames with multi-modal flow-based losses, our method preserves the viewing experience while preventing misuse by video-based LLMs. Extensive experiments show that Video Watermarking significantly reduces the comprehensibility of videos with various video-based LLMs, demonstrating both stealth and robustness. In essence, our method provides a solution for securing video content, ensuring its integrity and confidentiality in the face of evolving video-based LLMs technologies. △ Less

Submitted 2 July, 2024; v1 submitted 2 July, 2024; originally announced July 2024.

Comments: arXiv admin note: substantial text overlap with arXiv:2403.13507

arXiv:2406.12556 [pdf, other]

Towards Deep Application-Network Integration: Architectures, Progress and Opportunities

Authors: Berta Serracanta, Kai Gao, Jordi Ros-Giralt, Alberto Rodriguez-Natal, Luis M. Contreras, Richard Yang, Albert Cabellos

Abstract: With the rise of a new generation of applications (e.g., virtual and augmented reality, artificial intelligence, etc) demanding stringent performance requirements, the need for networking solutions and architectures that can enable a higher Quality of Experience (QoE) is becoming increasingly important. While jointly optimizing application and network may increase the applications' QoE and simul… ▽ More With the rise of a new generation of applications (e.g., virtual and augmented reality, artificial intelligence, etc) demanding stringent performance requirements, the need for networking solutions and architectures that can enable a higher Quality of Experience (QoE) is becoming increasingly important. While jointly optimizing application and network may increase the applications' QoE and simultaneously improve the utilization of network resources, such a paradigm has had limited success in real production networks. However, with the combination of revolutionary trends in (1) compute processing demands, (2) networking capabilities, and (3) sustainable business models, it is high time the community explores the full potential of deeper integration between application and network. In this paper, recent trends observed over the past few years are systematically reviewed. These include the paradigm shift in modern communication services towards computing-driven applications, such as on-site AI training, advances in programmable network technologies like Software Defined Networking (SDN), and new business models incentivizing collaboration and cooperation between parties. Following this, successful scenarios that benefit from various forms of deeper network-application integration are reported, highlighting their considerable potential. A unified framework is then introduced, providing an overview of possible architecture paradigms for network-application integration and bringing awareness to existing abstractions, mechanisms, tools, and their potential combinations. The paper concludes with a discussion of several remaining challenges in building practical network-application integrated systems. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.10981 [pdf, other]

ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models

Authors: Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, Jun Xiao

Abstract: With the advance of diffusion models, today's video generation has achieved impressive quality. But generating temporal consistent long videos is still challenging. A majority of video diffusion models (VDMs) generate long videos in an autoregressive manner, i.e., generating subsequent clips conditioned on last frames of previous clip. However, existing approaches all involve bidirectional computa… ▽ More With the advance of diffusion models, today's video generation has achieved impressive quality. But generating temporal consistent long videos is still challenging. A majority of video diffusion models (VDMs) generate long videos in an autoregressive manner, i.e., generating subsequent clips conditioned on last frames of previous clip. However, existing approaches all involve bidirectional computations, which restricts the receptive context of each autoregression step, and results in the model lacking long-term dependencies. Inspired from the huge success of large language models (LLMs) and following GPT (generative pre-trained transformer), we bring causal (i.e., unidirectional) generation into VDMs, and use past frames as prompt to generate future frames. For Causal Generation, we introduce causal temporal attention into VDM, which forces each generated frame to depend on its previous frames. For Frame as Prompt, we inject the conditional frames by concatenating them with noisy frames (frames to be generated) along the temporal axis. Consequently, we present Video Diffusion GPT (ViD-GPT). Based on the two key designs, in each autoregression step, it is able to acquire long-term context from prompting frames concatenated by all previously generated frames. Additionally, we bring the kv-cache mechanism to VDMs, which eliminates the redundant computation from overlapped frames, significantly boosting the inference speed. Extensive experiments demonstrate that our ViD-GPT achieves state-of-the-art performance both quantitatively and qualitatively on long video generation. Code will be available at https://github.com/Dawn-LX/Causal-VideoGen. △ Less

Submitted 16 June, 2024; originally announced June 2024.

Comments: Code will be available at https://github.com/Dawn-LX/Causal-VideoGen

arXiv:2406.08698 [pdf, other]

Constraints on Ultra Heavy Dark Matter Properties from Dwarf Spheroidal Galaxies with LHAASO Observations

Authors: Zhen Cao, F. Aharonian, Q. An, Axikegu, Y. X. Bai, Y. W. Bao, D. Bastieri, X. J. Bi, Y. J. Bi, J. T. Cai, Q. Cao, W. Y. Cao, Zhe Cao, J. Chang, J. F. Chang, A. M. Chen, E. S. Chen, Liang Chen, Lin Chen, Long Chen, M. J. Chen, M. L. Chen, Q. H. Chen, S. H. Chen, S. Z. Chen , et al. (255 additional authors not shown)

Abstract: In this work we try to search for signals generated by ultra-heavy dark matter at the Large High Altitude Air Shower Observatory (LHAASO) data. We look for possible gamma-ray by dark matter annihilation or decay from 16 dwarf spheroidal galaxies in the field of view of LHAASO. Dwarf spheroidal galaxies are among the most promising targets for indirect detection of dark matter which have low fluxes… ▽ More In this work we try to search for signals generated by ultra-heavy dark matter at the Large High Altitude Air Shower Observatory (LHAASO) data. We look for possible gamma-ray by dark matter annihilation or decay from 16 dwarf spheroidal galaxies in the field of view of LHAASO. Dwarf spheroidal galaxies are among the most promising targets for indirect detection of dark matter which have low fluxes of astrophysical $γ$-ray background while large amount of dark matter. By analyzing more than 700 days observational data at LHAASO, no significant dark matter signal from 1 TeV to 1 EeV is detected. Accordingly we derive the most stringent constraints on the ultra-heavy dark matter annihilation cross-section up to EeV. The constraints on the lifetime of dark matter in decay mode are also derived. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: 17 pages, 12 figures, accepted by PRL

arXiv:2406.05797 [pdf, other]

3D-MolT5: Towards Unified 3D Molecule-Text Modeling with 3D Molecular Tokenization

Authors: Qizhi Pei, Lijun Wu, Kaiyuan Gao, Jinhua Zhu, Rui Yan

Abstract: The integration of molecule and language has garnered increasing attention in molecular science. Recent advancements in Language Models (LMs) have demonstrated potential for the comprehensive modeling of molecule and language. However, existing works exhibit notable limitations. Most existing works overlook the modeling of 3D information, which is crucial for understanding molecular structures and… ▽ More The integration of molecule and language has garnered increasing attention in molecular science. Recent advancements in Language Models (LMs) have demonstrated potential for the comprehensive modeling of molecule and language. However, existing works exhibit notable limitations. Most existing works overlook the modeling of 3D information, which is crucial for understanding molecular structures and also functions. While some attempts have been made to leverage external structure encoding modules to inject the 3D molecular information into LMs, there exist obvious difficulties that hinder the integration of molecular structure and language text, such as modality alignment and separate tuning. To bridge this gap, we propose 3D-MolT5, a unified framework designed to model both 1D molecular sequence and 3D molecular structure. The key innovation lies in our methodology for mapping fine-grained 3D substructure representations (based on 3D molecular fingerprints) to a specialized 3D token vocabulary for 3D-MolT5. This 3D structure token vocabulary enables the seamless combination of 1D sequence and 3D structure representations in a tokenized format, allowing 3D-MolT5 to encode molecular sequence (SELFIES), molecular structure, and text sequences within a unified architecture. Alongside, we further introduce 1D and 3D joint pre-training to enhance the model's comprehension of these diverse modalities in a joint representation space and better generalize to various tasks for our foundation model. Through instruction tuning on multiple downstream datasets, our proposed 3D-MolT5 shows superior performance than existing methods in molecular property prediction, molecule captioning, and text-based molecule generation tasks. Our code will be available on GitHub soon. △ Less

Submitted 9 June, 2024; originally announced June 2024.

Comments: 18 pages

arXiv:2406.05392 [pdf, other]

Deconstructing The Ethics of Large Language Models from Long-standing Issues to New-emerging Dilemmas

Authors: Chengyuan Deng, Yiqun Duan, Xin Jin, Heng Chang, Yijun Tian, Han Liu, Henry Peng Zou, Yiqiao Jin, Yijia Xiao, Yichen Wang, Shenghao Wu, Zongxing Xie, Kuofeng Gao, Sihong He, Jun Zhuang, Lu Cheng, Haohan Wang

Abstract: Large Language Models (LLMs) have achieved unparalleled success across diverse language modeling tasks in recent years. However, this progress has also intensified ethical concerns, impacting the deployment of LLMs in everyday contexts. This paper provides a comprehensive survey of ethical challenges associated with LLMs, from longstanding issues such as copyright infringement, systematic bias, an… ▽ More Large Language Models (LLMs) have achieved unparalleled success across diverse language modeling tasks in recent years. However, this progress has also intensified ethical concerns, impacting the deployment of LLMs in everyday contexts. This paper provides a comprehensive survey of ethical challenges associated with LLMs, from longstanding issues such as copyright infringement, systematic bias, and data privacy, to emerging problems like truthfulness and social norms. We critically analyze existing research aimed at understanding, examining, and mitigating these ethical risks. Our survey underscores integrating ethical standards and societal values into the development of LLMs, thereby guiding the development of responsible and ethically aligned language models. △ Less

Submitted 8 June, 2024; originally announced June 2024.

arXiv:2405.17213 [pdf]

Highly inhomogeneous interactions between background climate and urban warming across typical local climate zones in heatwave and non-heatwave days

Authors: Jing Kong, Yongling Zhao, Kai Gao, Dominik Strebel, Jan Carmeliet, Chengwang Lei

Abstract: Urban heat island (UHI) in conjunction with heatwave (HW) leads to exacerbation of thermal stress in urban areas. Prior research on UHI and HW has predominantly concentrated on examining the thermal conditions at the surface and near-surface, with few investigations extending to the radiative and dynamical interactions of UHI and HW, particularly with a focus on the inhomogeneities across local cl… ▽ More Urban heat island (UHI) in conjunction with heatwave (HW) leads to exacerbation of thermal stress in urban areas. Prior research on UHI and HW has predominantly concentrated on examining the thermal conditions at the surface and near-surface, with few investigations extending to the radiative and dynamical interactions of UHI and HW, particularly with a focus on the inhomogeneities across local climate zones (LCZs). Here, we analyse the temperature disparity between HW and non-HW conditions across LCZs in the Sydney area by quantifying the contributions of individual radiative and dynamical processes using the coupled surface-atmosphere climate feedback-response analysis method (CFRAM). Three HW events in 2017, 2019, and 2020 are simulated using the Weather Research and Forecasting (WRF) model coupled with the Single-Layer Urban Canopy Model (SLUCM). The maximum temperature difference between HW and non-HW days may reach up to 10 K, with the increased net solar radiation during HWs being comparable to the typical level of anthropogenic heat flux in urban areas. It is also found that the reduction of clouds, the presence of vapor, and the increase of sensible heat contribute to the warming effect at different levels, with the contribution of clouds being the most dominant. Conversely, the generation of dry convection and the increase of latent heat flux lead to mitigating effects, with the latter being more dominant and capable of causing up to 10 K surface temperature difference between LCZ1 (compact high-rise) and LCZ9 (sparsely built). The differences in the contributions of climate feedback processes across different LCZs become more evident during more severe and humid HWs. These findings underscore the necessity of implementing local climate zone-tailored heat mitigation strategies. △ Less

Submitted 27 May, 2024; originally announced May 2024.

arXiv:2405.15826 [pdf, other]

3D Learnable Supertoken Transformer for LiDAR Point Cloud Scene Segmentation

Authors: Dening Lu, Jun Zhou, Kyle Gao, Linlin Xu, Jonathan Li

Abstract: 3D Transformers have achieved great success in point cloud understanding and representation. However, there is still considerable scope for further development in effective and efficient Transformers for large-scale LiDAR point cloud scene segmentation. This paper proposes a novel 3D Transformer framework, named 3D Learnable Supertoken Transformer (3DLST). The key contributions are summarized as f… ▽ More 3D Transformers have achieved great success in point cloud understanding and representation. However, there is still considerable scope for further development in effective and efficient Transformers for large-scale LiDAR point cloud scene segmentation. This paper proposes a novel 3D Transformer framework, named 3D Learnable Supertoken Transformer (3DLST). The key contributions are summarized as follows. Firstly, we introduce the first Dynamic Supertoken Optimization (DSO) block for efficient token clustering and aggregating, where the learnable supertoken definition avoids the time-consuming pre-processing of traditional superpoint generation. Since the learnable supertokens can be dynamically optimized by multi-level deep features during network learning, they are tailored to the semantic homogeneity-aware token clustering. Secondly, an efficient Cross-Attention-guided Upsampling (CAU) block is proposed for token reconstruction from optimized supertokens. Thirdly, the 3DLST is equipped with a novel W-net architecture instead of the common U-net design, which is more suitable for Transformer-based feature learning. The SOTA performance on three challenging LiDAR datasets (airborne MultiSpectral LiDAR (MS-LiDAR) (89.3% of the average F1 score), DALES (80.2% of mIoU), and Toronto-3D dataset (80.4% of mIoU)) demonstrate the superiority of 3DLST and its strong adaptability to various LiDAR point cloud data (airborne MS-LiDAR, aerial LiDAR, and vehicle-mounted LiDAR data). Furthermore, 3DLST also achieves satisfactory results in terms of algorithm efficiency, which is up to 5x faster than previous best-performing methods. △ Less

Submitted 23 May, 2024; originally announced May 2024.

Comments: 13 pages, 10 figures, 7 tables

arXiv:2405.12775 [pdf, other]

Unsupervised Multimodal Clustering for Semantics Discovery in Multimodal Utterances

Authors: Hanlei Zhang, Hua Xu, Fei Long, Xin Wang, Kai Gao

Abstract: Discovering the semantics of multimodal utterances is essential for understanding human language and enhancing human-machine interactions. Existing methods manifest limitations in leveraging nonverbal information for discerning complex semantics in unsupervised scenarios. This paper introduces a novel unsupervised multimodal clustering method (UMC), making a pioneering contribution to this field.… ▽ More Discovering the semantics of multimodal utterances is essential for understanding human language and enhancing human-machine interactions. Existing methods manifest limitations in leveraging nonverbal information for discerning complex semantics in unsupervised scenarios. This paper introduces a novel unsupervised multimodal clustering method (UMC), making a pioneering contribution to this field. UMC introduces a unique approach to constructing augmentation views for multimodal data, which are then used to perform pre-training to establish well-initialized representations for subsequent clustering. An innovative strategy is proposed to dynamically select high-quality samples as guidance for representation learning, gauged by the density of each sample's nearest neighbors. Besides, it is equipped to automatically determine the optimal value for the top-$K$ parameter in each cluster to refine sample selection. Finally, both high- and low-quality samples are used to learn representations conducive to effective clustering. We build baselines on benchmark multimodal intent and dialogue act datasets. UMC shows remarkable improvements of 2-6\% scores in clustering metrics over state-of-the-art methods, marking the first successful endeavor in this domain. The complete code and data are available at https://github.com/thuiar/UMC. △ Less

Submitted 21 May, 2024; originally announced May 2024.

Comments: Accepted by ACL 2024, Main Conference, Long Paper

arXiv:2405.11826 [pdf, other]

Data quality control system and long-term performance monitor of the LHAASO-KM2A

Authors: Zhen Cao, F. Aharonian, Axikegu, Y. X. Bai, Y. W. Bao, D. Bastieri, X. J. Bi, Y. J. Bi, W. Bian, A. V. Bukevich, Q. Cao, W. Y. Cao, Zhe Cao, J. Chang, J. F. Chang, A. M. Chen, E. S. Chen, H. X. Chen, Liang Chen, Lin Chen, Long Chen, M. J. Chen, M. L. Chen, Q. H. Chen, S. Chen , et al. (263 additional authors not shown)

Abstract: The KM2A is the largest sub-array of the Large High Altitude Air Shower Observatory (LHAASO). It consists of 5216 electromagnetic particle detectors (EDs) and 1188 muon detectors (MDs). The data recorded by the EDs and MDs are used to reconstruct primary information of cosmic ray and gamma-ray showers. This information is used for physical analysis in gamma-ray astronomy and cosmic ray physics. To… ▽ More The KM2A is the largest sub-array of the Large High Altitude Air Shower Observatory (LHAASO). It consists of 5216 electromagnetic particle detectors (EDs) and 1188 muon detectors (MDs). The data recorded by the EDs and MDs are used to reconstruct primary information of cosmic ray and gamma-ray showers. This information is used for physical analysis in gamma-ray astronomy and cosmic ray physics. To ensure the reliability of the LHAASO-KM2A data, a three-level quality control system has been established. It is used to monitor the status of detector units, stability of reconstructed parameters and the performance of the array based on observations of the Crab Nebula and Moon shadow. This paper will introduce the control system and its application on the LHAASO-KM2A data collected from August 2021 to July 2023. During this period, the pointing and angular resolution of the array were stable. From the observations of the Moon shadow and Crab Nebula, the results achieved using the two methods are consistent with each other. According to the observation of the Crab Nebula at energies from 25 TeV to 100 TeV, the time averaged pointing errors are estimated to be $-0.003^{\circ} \pm 0.005^{\circ}$ and $0.001^{\circ} \pm 0.006^{\circ}$ in the R.A. and Dec directions, respectively. △ Less

Submitted 13 June, 2024; v1 submitted 20 May, 2024; originally announced May 2024.

Comments: 15 pages, 9 figures

arXiv:2405.11021 [pdf, other]

Enhanced 3D Urban Scene Reconstruction and Point Cloud Densification using Gaussian Splatting and Google Earth Imagery

Authors: Kyle Gao, Dening Lu, Hongjie He, Linlin Xu, Jonathan Li

Abstract: 3D urban scene reconstruction and modelling is a crucial research area in remote sensing with numerous applications in academia, commerce, industry, and administration. Recent advancements in view synthesis models have facilitated photorealistic 3D reconstruction solely from 2D images. Leveraging Google Earth imagery, we construct a 3D Gaussian Splatting model of the Waterloo region centered on th… ▽ More 3D urban scene reconstruction and modelling is a crucial research area in remote sensing with numerous applications in academia, commerce, industry, and administration. Recent advancements in view synthesis models have facilitated photorealistic 3D reconstruction solely from 2D images. Leveraging Google Earth imagery, we construct a 3D Gaussian Splatting model of the Waterloo region centered on the University of Waterloo and are able to achieve view-synthesis results far exceeding previous 3D view-synthesis results based on neural radiance fields which we demonstrate in our benchmark. Additionally, we retrieved the 3D geometry of the scene using the 3D point cloud extracted from the 3D Gaussian Splatting model which we benchmarked against our Multi- View-Stereo dense reconstruction of the scene, thereby reconstructing both the 3D geometry and photorealistic lighting of the large-scale urban scene through 3D Gaussian Splatting △ Less

Submitted 1 June, 2024; v1 submitted 17 May, 2024; originally announced May 2024.

ACM Class: I.4; I.3

arXiv:2405.10612 [pdf, other]

Not All Prompts Are Secure: A Switchable Backdoor Attack Against Pre-trained Vision Transformers

Authors: Sheng Yang, Jiawang Bai, Kuofeng Gao, Yong Yang, Yiming Li, Shu-tao Xia

Abstract: Given the power of vision transformers, a new learning paradigm, pre-training and then prompting, makes it more efficient and effective to address downstream visual recognition tasks. In this paper, we identify a novel security threat towards such a paradigm from the perspective of backdoor attacks. Specifically, an extra prompt token, called the switch token in this work, can turn the backdoor mo… ▽ More Given the power of vision transformers, a new learning paradigm, pre-training and then prompting, makes it more efficient and effective to address downstream visual recognition tasks. In this paper, we identify a novel security threat towards such a paradigm from the perspective of backdoor attacks. Specifically, an extra prompt token, called the switch token in this work, can turn the backdoor mode on, i.e., converting a benign model into a backdoored one. Once under the backdoor mode, a specific trigger can force the model to predict a target class. It poses a severe risk to the users of cloud API, since the malicious behavior can not be activated and detected under the benign mode, thus making the attack very stealthy. To attack a pre-trained model, our proposed attack, named SWARM, learns a trigger and prompt tokens including a switch token. They are optimized with the clean loss which encourages the model always behaves normally even the trigger presents, and the backdoor loss that ensures the backdoor can be activated by the trigger when the switch is on. Besides, we utilize the cross-mode feature distillation to reduce the effect of the switch token on clean samples. The experiments on diverse visual recognition tasks confirm the success of our switchable backdoor attack, i.e., achieving 95%+ attack success rate, and also being hard to be detected and removed. Our code is available at https://github.com/20000yshust/SWARM. △ Less

Submitted 17 May, 2024; originally announced May 2024.

arXiv:2405.09981 [pdf, other]

Adversarial Robustness for Visual Grounding of Multimodal Large Language Models

Authors: Kuofeng Gao, Yang Bai, Jiawang Bai, Yong Yang, Shu-Tao Xia

Abstract: Multi-modal Large Language Models (MLLMs) have recently achieved enhanced performance across various vision-language tasks including visual grounding capabilities. However, the adversarial robustness of visual grounding remains unexplored in MLLMs. To fill this gap, we use referring expression comprehension (REC) as an example task in visual grounding and propose three adversarial attack paradigms… ▽ More Multi-modal Large Language Models (MLLMs) have recently achieved enhanced performance across various vision-language tasks including visual grounding capabilities. However, the adversarial robustness of visual grounding remains unexplored in MLLMs. To fill this gap, we use referring expression comprehension (REC) as an example task in visual grounding and propose three adversarial attack paradigms as follows. Firstly, untargeted adversarial attacks induce MLLMs to generate incorrect bounding boxes for each object. Besides, exclusive targeted adversarial attacks cause all generated outputs to the same target bounding box. In addition, permuted targeted adversarial attacks aim to permute all bounding boxes among different objects within a single image. Extensive experiments demonstrate that the proposed methods can successfully attack visual grounding capabilities of MLLMs. Our methods not only provide a new perspective for designing novel attacks but also serve as a strong baseline for improving the adversarial robustness for visual grounding of MLLMs. △ Less

Submitted 16 May, 2024; originally announced May 2024.

Comments: ICLR 2024 Workshop on Reliable and Responsible Foundation Models

arXiv:2405.07691 [pdf, other]

Discovery of Very-high-energy Gamma-ray Emissions from the Low Luminosity AGN NGC 4278 by LHAASO

Authors: Zhen Cao, F. Aharonian, Q. An, Axikegu, Y. X. Bai, Y. W. Bao, D. Bastieri, X. J. Bi, Y. J. Bi, J. T. Cai, Q. Cao, W. Y. Cao, Zhe Cao, J. Chang, J. F. Chang, A. M. Chen, E. S. Chen, Liang Chen, Lin Chen, Long Chen, M. J. Chen, M. L. Chen, Q. H. Chen, S. H. Chen, S. Z. Chen , et al. (255 additional authors not shown)

Abstract: The first source catalog of Large High Altitude Air Shower Observatory reported the detection of a very-high-energy gamma ray source, 1LHAASO J1219+2915. In this paper a further detailed study of the spectral and temporal behavior of this point-like source have been carried. The best-fit position of the TeV source ($\rm{RA}=185.05^{\circ}\pm0.04^{\circ}$, $\rm{Dec}=29.25^{\circ}\pm0.03^{\circ}$) i… ▽ More The first source catalog of Large High Altitude Air Shower Observatory reported the detection of a very-high-energy gamma ray source, 1LHAASO J1219+2915. In this paper a further detailed study of the spectral and temporal behavior of this point-like source have been carried. The best-fit position of the TeV source ($\rm{RA}=185.05^{\circ}\pm0.04^{\circ}$, $\rm{Dec}=29.25^{\circ}\pm0.03^{\circ}$) is compatible with NGC 4278 within $\sim0.03$ degree. Variation analysis shows an indication of the variability at a few months level in the TeV band, which is consistent with low frequency observations. Based on these observations, we report the detection of TeV $γ$-ray emissions from this low-luminosity AGN NGC 4278. The observations by LHAASO-WCDA during active period has a significance level of 8.8\,$σ$ with best-fit photon spectral index $\varGamma=2.56\pm0.14$ and a flux $f_{1-10\,\rm{TeV}}=(7.0\pm1.1_{\rm{sta}}\pm0.35_{\rm{syst}})\times10^{-13}\,\rm{photons\,cm^{-2}\,s^{-1}}$, or approximately $5\%$ of the Crab Nebula. The discovery of VHE from NGC 4278 indicates that the compact, weak radio jet can efficiently accelerate particles and emit TeV photons. △ Less

Submitted 13 May, 2024; originally announced May 2024.

Comments: 11 pages, 5 figures

arXiv:2405.07136 [pdf]

Extremely long transverse optical needle focus for reflective metalens enabled by monolayer MoS$_2$

Authors: Zhonglin Li, Kangyu Gao, Yingying Wang, Ruitong Bie, Dongliang Yang, Tianze Yu, Renxi Gao, Wenjun Liu, Bo Zhong, Linfeng Sun

Abstract: Line-scan mode facilitates fast-speed and high-throughput imaging with developing a suitable optical transverse needle focus. Metasurface with periodic structures such as diffractive rings, ellipses, and gratings could enable discrete focus evolving into line focus under momentum conservation, but still face the challenge of extremely low light power utilization brought by inevitably multiple high… ▽ More Line-scan mode facilitates fast-speed and high-throughput imaging with developing a suitable optical transverse needle focus. Metasurface with periodic structures such as diffractive rings, ellipses, and gratings could enable discrete focus evolving into line focus under momentum conservation, but still face the challenge of extremely low light power utilization brought by inevitably multiple high-order diffractions. In addition, the designed focus requires the selection of particular optical functional materials. High dielectric constants in atomic transition metal dichalcogenides make significant phase modulation by bringing phase singularity at zero-reflection possible. However, no light power is available for use at zero-reflection and a balance between phase and amplitude modulation is needed. In this work, above issues are simultaneously solved by designing a monolayer MoS2 based Fresnel strip structure. An optical needle primary focus with a transverse length of 40 μm (~80 λ) is obtained, which is the longest value recorded so far, together with a sub-diffraction-limited lateral spot and a broad working wavelength range. This specially developed structure not only concentrates light power in primary diffraction by breaking restriction of momentum conservation, but also guarantees a consistent phase across different strips. The novel optical manipulation way provided here together with the longer focus length for flat optics will show promising applications in biology, oncology, nanofabrication, energy harvesting, and optical information processing. △ Less

Submitted 11 May, 2024; originally announced May 2024.

Comments: 22 pages, 5 figures

arXiv:2405.04434 [pdf, other]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Authors: DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Honghui Ding , et al. (132 additional authors not shown)

Abstract: We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference… ▽ More We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models. △ Less

Submitted 19 June, 2024; v1 submitted 7 May, 2024; originally announced May 2024.

arXiv:2404.19387 [pdf, other]

Online Electricity Purchase for Data Center with Dynamic Virtual Battery from Flexibility Aggregation

Authors: Kekun Gao, Yuejun Yan, Yixuan Liu, Endong Liu, Pengcheng You

Abstract: As a critical component of modern infrastructure, data centers account for a huge amount of power consumption and greenhouse gas emission. This paper studies the electricity purchase strategy for a data center to lower its energy cost while integrating local renewable generation under uncertainty. To facilitate efficient and scalable decision-making, we propose a two-layer hierarchy where the lowe… ▽ More As a critical component of modern infrastructure, data centers account for a huge amount of power consumption and greenhouse gas emission. This paper studies the electricity purchase strategy for a data center to lower its energy cost while integrating local renewable generation under uncertainty. To facilitate efficient and scalable decision-making, we propose a two-layer hierarchy where the lower layer consists of the operation of all electrical equipment in the data center and the upper layer determines the procurement and dispatch of electricity. At the lower layer, instead of device-level scheduling in real time, we propose to exploit the inherent flexibility in demand, such as thermostatically controlled loads and flexible computing tasks, and aggregate them into virtual batteries. By this means, the upper-layer decision only needs to take into account these virtual batteries, the size of which is generally small and independent of the data center scale. We further propose an online algorithm based on Lyapunov optimization to purchase electricity from the grid with a manageable energy cost, even though the prices, renewable availability, and battery specifications are uncertain and dynamic. In particular, we show that, under mild conditions, our algorithm can achieve bounded loss compared with the offline optimal cost, while strictly respecting battery operational constraints. Extensive simulation studies validate the theoretical analysis and illustrate the tradeoff between optimality and conservativeness. △ Less

Submitted 30 April, 2024; originally announced April 2024.

arXiv:2404.16565 [pdf, other]

PyRadar: Towards Automatically Retrieving and Validating Source Code Repository Information for PyPI Packages

Authors: Kai Gao, Weiwei Xu, Wenhao Yang, Minghui Zhou

Abstract: A package's source code repository records the development history of the package, providing indispensable information for the use and risk monitoring of the package. However, a package release often misses its source code repository due to the separation of the package's development platform from its distribution platform. Existing tools retrieve the release's repository information from its meta… ▽ More A package's source code repository records the development history of the package, providing indispensable information for the use and risk monitoring of the package. However, a package release often misses its source code repository due to the separation of the package's development platform from its distribution platform. Existing tools retrieve the release's repository information from its metadata, which suffers from two limitations: the metadata may not contain or contain wrong information. Our analysis shows that existing tools can only retrieve repository information for up to 70.5% of PyPI releases. To address the limitations, this paper proposes PyRadar, a novel framework that utilizes the metadata and source distribution to retrieve and validate the repository information for PyPI releases. We start with an empirical study to compare four existing tools on 4,227,425 PyPI releases and analyze phantom files (files appearing in the release's distribution but not in the release's repository) in 14,375 correct package-repository links and 2,064 incorrect links. Based on the findings, we design PyRadar with three components, i.e., Metadata-based Retriever, Source Code Repository Validator, and Source Code-based Retriever. In particular, the Metadata-based Retriever combines best practices of existing tools and successfully retrieves repository information from the metadata for 72.1% of PyPI releases. The Source Code Repository Validator applies common machine learning algorithms on six crafted features and achieves an AUC of up to 0.995. The Source Code-based Retriever queries World of Code with the SHA-1 hashes of all Python files in the release's source distribution and retrieves repository information for 90.2% of packages in our dataset with an accuracy of 0.970. Both practitioners and researchers can employ the PyRadar to better use PyPI packages. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: This paper has been accepted at FSE 2024

arXiv:2404.16557 [pdf, other]

Energy-Latency Manipulation of Multi-modal Large Language Models via Verbose Samples

Authors: Kuofeng Gao, Jindong Gu, Yang Bai, Shu-Tao Xia, Philip Torr, Wei Liu, Zhifeng Li

Abstract: Despite the exceptional performance of multi-modal large language models (MLLMs), their deployment requires substantial computational resources. Once malicious users induce high energy consumption and latency time (energy-latency cost), it will exhaust computational resources and harm availability of service. In this paper, we investigate this vulnerability for MLLMs, particularly image-based and… ▽ More Despite the exceptional performance of multi-modal large language models (MLLMs), their deployment requires substantial computational resources. Once malicious users induce high energy consumption and latency time (energy-latency cost), it will exhaust computational resources and harm availability of service. In this paper, we investigate this vulnerability for MLLMs, particularly image-based and video-based ones, and aim to induce high energy-latency cost during inference by crafting an imperceptible perturbation. We find that high energy-latency cost can be manipulated by maximizing the length of generated sequences, which motivates us to propose verbose samples, including verbose images and videos. Concretely, two modality non-specific losses are proposed, including a loss to delay end-of-sequence (EOS) token and an uncertainty loss to increase the uncertainty over each generated token. In addition, improving diversity is important to encourage longer responses by increasing the complexity, which inspires the following modality specific loss. For verbose images, a token diversity loss is proposed to promote diverse hidden states. For verbose videos, a frame feature diversity loss is proposed to increase the feature diversity among frames. To balance these losses, we propose a temporal weight adjustment algorithm. Experiments demonstrate that our verbose samples can largely extend the length of generated sequences. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: arXiv admin note: substantial text overlap with arXiv:2401.11170

arXiv:2404.16525 [pdf, other]

An efficient method to generate near-ideal hollow beams of different shapes for box potential of quantum gases

Authors: Tongtong Ren, Yirong Wang, Xiaoyu Dai, Xiaoxu Gao, Guangren Sun, Xue Zhao, Kuiyi Gao, Zhiyue Zheng, Wei Zhang

Abstract: Ultracold quantum gases are usually prepared in conservative traps for quantum simulation experiments. The atomic density inhomogeneity, together with the consequent position-dependent energy and time scales of cold atoms in traditional harmonic traps, makes it difficult to manipulate and detect the sample at a better level. These problems are partially solved by optical box traps of blue-detuned… ▽ More Ultracold quantum gases are usually prepared in conservative traps for quantum simulation experiments. The atomic density inhomogeneity, together with the consequent position-dependent energy and time scales of cold atoms in traditional harmonic traps, makes it difficult to manipulate and detect the sample at a better level. These problems are partially solved by optical box traps of blue-detuned hollow beams. However, generating a high-quality hollow beam with high light efficiency for the box trap is challenging. Here, we present a scheme that combines the fixed optics, including axicons and prisms, to pre-shape a Gaussian beam into a hollow beam, with a digital micromirror device (DMD) to improve the quality of the hollow beam further, providing a nearly ideal optical potential of various shapes for preparing highly homogeneous cold atoms. The highest power-law exponent of potential walls can reach a value over 100, and the light efficiency from a Gaussian to a hollow beam is also improved compared to direct optical shaping by a mask or a DMD. Combined with a one-dimensional optical lattice, a nearly ideal two-dimensional uniform quantum gas with different geometrical boundaries can be prepared for exploring quantum many-body physics to an unprecedented level. △ Less

Submitted 25 April, 2024; originally announced April 2024.

arXiv:2404.14372 [pdf, other]

Beyond Scaling: Predicting Patent Approval with Domain-specific Fine-grained Claim Dependency Graph

Authors: Xiaochen Kev Gao, Feng Yao, Kewen Zhao, Beilei He, Animesh Kumar, Vish Krishnan, Jingbo Shang

Abstract: Model scaling is becoming the default choice for many language tasks due to the success of large language models (LLMs). However, it can fall short in specific scenarios where simple customized methods excel. In this paper, we delve into the patent approval pre-diction task and unveil that simple domain-specific graph methods outperform enlarging the model, using the intrinsic dependencies within… ▽ More Model scaling is becoming the default choice for many language tasks due to the success of large language models (LLMs). However, it can fall short in specific scenarios where simple customized methods excel. In this paper, we delve into the patent approval pre-diction task and unveil that simple domain-specific graph methods outperform enlarging the model, using the intrinsic dependencies within the patent data. Specifically, we first extend the embedding-based state-of-the-art (SOTA) by scaling up its backbone model with various sizes of open-source LLMs, then explore prompt-based methods to harness proprietary LLMs' potential, but find the best results close to random guessing, underlining the ineffectiveness of model scaling-up. Hence, we propose a novel Fine-grained cLAim depeNdency (FLAN) Graph through meticulous patent data analyses, capturing the inherent dependencies across segments of the patent text. As it is model-agnostic, we apply cost-effective graph models to our FLAN Graph to obtain representations for approval prediction. Extensive experiments and detailed analyses prove that incorporating FLAN Graph via various graph models consistently outperforms all LLM baselines significantly. We hope that our observations and analyses in this paper can bring more attention to this challenging task and prompt further research into the limitations of LLMs. Our source code and dataset can be obtained from http://github.com/ShangDataLab/FLAN-Graph. △ Less

Submitted 22 April, 2024; originally announced April 2024.

Comments: 17 Pages, Under Review

arXiv:2404.11070 [pdf]

Sky-GVIO: an enhanced GNSS/INS/Vision navigation with FCN-based sky-segmentation in urban canyon

Authors: Jingrong Wang, Bo Xu, Ronghe Jin, Shoujian Zhang, Kefu Gao, Jingnan Liu

Abstract: Accurate, continuous, and reliable positioning is a critical component of achieving autonomous driving. However, in complex urban canyon environments, the vulnerability of a stand-alone sensor and non-line-of-sight (NLOS) caused by high buildings, trees, and elevated structures seriously affect positioning results. To address these challenges, a sky-view images segmentation algorithm based on Full… ▽ More Accurate, continuous, and reliable positioning is a critical component of achieving autonomous driving. However, in complex urban canyon environments, the vulnerability of a stand-alone sensor and non-line-of-sight (NLOS) caused by high buildings, trees, and elevated structures seriously affect positioning results. To address these challenges, a sky-view images segmentation algorithm based on Fully Convolutional Network (FCN) is proposed for GNSS NLOS detection. Building upon this, a novel NLOS detection and mitigation algorithm (named S-NDM) is extended to the tightly coupled Global Navigation Satellite Systems (GNSS), Inertial Measurement Units (IMU), and visual feature system which is called Sky-GVIO, with the aim of achieving continuous and accurate positioning in urban canyon environments. Furthermore, the system harmonizes Single Point Positioning (SPP) with Real-Time Kinematic (RTK) methodologies to bolster its operational versatility and resilience. In urban canyon environments, the positioning performance of S-NDM algorithm proposed in this paper is evaluated under different tightly coupled SPP-related and RTK-related models. The results exhibit that Sky-GVIO system achieves meter-level accuracy under SPP mode and sub-decimeter precision with RTK, surpassing the performance of GNSS/INS/Vision frameworks devoid of S-NDM. Additionally, the sky-view image dataset, inclusive of training and evaluation subsets, has been made publicly accessible for scholarly exploration at https://github.com/whuwangjr/sky-view-images . △ Less

Submitted 17 April, 2024; originally announced April 2024.

arXiv:2404.06758 [pdf, other]

Toward Holistic Planning and Control Optimization for Dual-Arm Rearrangement

Authors: Kai Gao, Zihe Ye, Duo Zhang, Baichuan Huang, Jingjin Yu

Abstract: Long-horizon task and motion planning (TAMP) is notoriously difficult to solve, let alone optimally, due to the tight coupling between the interleaved (discrete) task and (continuous) motion planning phases, where each phase on its own is frequently an NP-hard or even PSPACE-hard computational challenge. In this study, we tackle the even more challenging goal of jointly optimizing task and motion… ▽ More Long-horizon task and motion planning (TAMP) is notoriously difficult to solve, let alone optimally, due to the tight coupling between the interleaved (discrete) task and (continuous) motion planning phases, where each phase on its own is frequently an NP-hard or even PSPACE-hard computational challenge. In this study, we tackle the even more challenging goal of jointly optimizing task and motion plans for a real dual-arm system in which the two arms operate in close vicinity to solve highly constrained tabletop multi-object rearrangement problems. Toward that, we construct a tightly integrated planning and control optimization pipeline, Makespan-Optimized Dual-Arm Planner (MODAP) that combines novel sampling techniques for task planning with state-of-the-art trajectory optimization techniques. Compared to previous state-of-the-art, MODAP produces task and motion plans that better coordinate a dual-arm system, delivering significantly improved execution time improvements while simultaneously ensuring that the resulting time-parameterized trajectory conforms to specified acceleration and jerk limits. △ Less

Submitted 10 April, 2024; originally announced April 2024.

Comments: First three authors made equal contributions to this study

arXiv:2404.05211 [pdf, other]

Multi-level Graph Subspace Contrastive Learning for Hyperspectral Image Clustering

Authors: Jingxin Wang, Renxiang Guan, Kainan Gao, Zihao Li, Hao Li, Xianju Li, Chang Tang

Abstract: Hyperspectral image (HSI) clustering is a challenging task due to its high complexity. Despite subspace clustering shows impressive performance for HSI, traditional methods tend to ignore the global-local interaction in HSI data. In this study, we proposed a multi-level graph subspace contrastive learning (MLGSC) for HSI clustering. The model is divided into the following main parts. Graph convolu… ▽ More Hyperspectral image (HSI) clustering is a challenging task due to its high complexity. Despite subspace clustering shows impressive performance for HSI, traditional methods tend to ignore the global-local interaction in HSI data. In this study, we proposed a multi-level graph subspace contrastive learning (MLGSC) for HSI clustering. The model is divided into the following main parts. Graph convolution subspace construction: utilizing spectral and texture feautures to construct two graph convolution views. Local-global graph representation: local graph representations were obtained by step-by-step convolutions and a more representative global graph representation was obtained using an attention-based pooling strategy. Multi-level graph subspace contrastive learning: multi-level contrastive learning was conducted to obtain local-global joint graph representations, to improve the consistency of the positive samples between views, and to obtain more robust graph embeddings. Specifically, graph-level contrastive learning is used to better learn global representations of HSI data. Node-level intra-view and inter-view contrastive learning is designed to learn joint representations of local regions of HSI. The proposed model is evaluated on four popular HSI datasets: Indian Pines, Pavia University, Houston, and Xu Zhou. The overall accuracies are 97.75%, 99.96%, 92.28%, and 95.73%, which significantly outperforms the current state-of-the-art clustering methods. △ Less

Submitted 8 April, 2024; originally announced April 2024.

Comments: IJCNN 2024

arXiv:2404.04801 [pdf, ps, other]

doi 10.1007/s41605-024-00467-8

LHAASO-KM2A detector simulation using Geant4

Authors: Zhen Cao, F. Aharonian, Q. An, Axikegu, Y. X. Bai, Y. W. Bao, D. Bastieri, X. J. Bi, Y. J. Bi, J. T. Cai, Q. Cao, W. Y. Cao, Zhe Cao, J. Chang, J. F. Chang, A. M. Chen, E. S. Chen, Liang Chen, Lin Chen, Long Chen, M. J. Chen, M. L. Chen, Q. H. Chen, S. H. Chen, S. Z. Chen , et al. (254 additional authors not shown)

Abstract: KM2A is one of the main sub-arrays of LHAASO, working on gamma ray astronomy and cosmic ray physics at energies above 10 TeV. Detector simulation is the important foundation for estimating detector performance and data analysis. It is a big challenge to simulate the KM2A detector in the framework of Geant4 due to the need to track numerous photons from a large number of detector units (>6000) with… ▽ More KM2A is one of the main sub-arrays of LHAASO, working on gamma ray astronomy and cosmic ray physics at energies above 10 TeV. Detector simulation is the important foundation for estimating detector performance and data analysis. It is a big challenge to simulate the KM2A detector in the framework of Geant4 due to the need to track numerous photons from a large number of detector units (>6000) with large altitude difference (30 m) and huge coverage (1.3 km^2). In this paper, the design of the KM2A simulation code G4KM2A based on Geant4 is introduced. The process of G4KM2A is optimized mainly in memory consumption to avoid memory overffow. Some simpliffcations are used to signiffcantly speed up the execution of G4KM2A. The running time is reduced by at least 30 times compared to full detector simulation. The particle distributions and the core/angle resolution comparison between simulation and experimental data of the full KM2A array are also presented, which show good agreement. △ Less

Submitted 7 April, 2024; originally announced April 2024.

arXiv:2403.20261 [pdf, other]

FABind+: Enhancing Molecular Docking through Improved Pocket Prediction and Pose Generation

Authors: Kaiyuan Gao, Qizhi Pei, Jinhua Zhu, Kun He, Lijun Wu

Abstract: Molecular docking is a pivotal process in drug discovery. While traditional techniques rely on extensive sampling and simulation governed by physical principles, these methods are often slow and costly. The advent of deep learning-based approaches has shown significant promise, offering increases in both accuracy and efficiency. Building upon the foundational work of FABind, a model designed with… ▽ More Molecular docking is a pivotal process in drug discovery. While traditional techniques rely on extensive sampling and simulation governed by physical principles, these methods are often slow and costly. The advent of deep learning-based approaches has shown significant promise, offering increases in both accuracy and efficiency. Building upon the foundational work of FABind, a model designed with a focus on speed and accuracy, we present FABind+, an enhanced iteration that largely boosts the performance of its predecessor. We identify pocket prediction as a critical bottleneck in molecular docking and propose a novel methodology that significantly refines pocket prediction, thereby streamlining the docking process. Furthermore, we introduce modifications to the docking module to enhance its pose generation capabilities. In an effort to bridge the gap with conventional sampling/generative methods, we incorporate a simple yet effective sampling technique coupled with a confidence model, requiring only minor adjustments to the regression framework of FABind. Experimental results and analysis reveal that FABind+ remarkably outperforms the original FABind, achieves competitive state-of-the-art performance, and delivers insightful modeling strategies. This demonstrates FABind+ represents a substantial step forward in molecular docking and drug discovery. Our code is in https://github.com/QizhiPei/FABind. △ Less

Submitted 7 April, 2024; v1 submitted 29 March, 2024; originally announced March 2024.

Comments: 17 pages, 14 figures, 5 tables

arXiv:2403.14681 [pdf]

doi 10.4018/IJBAN.338367

AI Ethics: A Bibliometric Analysis, Critical Issues, and Key Gaps

Authors: Di Kevin Gao, Andrew Haverly, Sudip Mittal, Jiming Wu, Jingdao Chen

Abstract: Artificial intelligence (AI) ethics has emerged as a burgeoning yet pivotal area of scholarly research. This study conducts a comprehensive bibliometric analysis of the AI ethics literature over the past two decades. The analysis reveals a discernible tripartite progression, characterized by an incubation phase, followed by a subsequent phase focused on imbuing AI with human-like attributes, culmi… ▽ More Artificial intelligence (AI) ethics has emerged as a burgeoning yet pivotal area of scholarly research. This study conducts a comprehensive bibliometric analysis of the AI ethics literature over the past two decades. The analysis reveals a discernible tripartite progression, characterized by an incubation phase, followed by a subsequent phase focused on imbuing AI with human-like attributes, culminating in a third phase emphasizing the development of human-centric AI systems. After that, they present seven key AI ethics issues, encompassing the Collingridge dilemma, the AI status debate, challenges associated with AI transparency and explainability, privacy protection complications, considerations of justice and fairness, concerns about algocracy and human enfeeblement, and the issue of superintelligence. Finally, they identify two notable research gaps in AI ethics regarding the large ethics model (LEM) and AI identification and extend an invitation for further scholarly research. △ Less

Submitted 12 March, 2024; originally announced March 2024.

Journal ref: International Journal of Business Analytics (IJBAN), 2024, 11(1), 1-19

arXiv:2403.13507 [pdf, other]

FMM-Attack: A Flow-based Multi-modal Adversarial Attack on Video-based LLMs

Authors: Jinmin Li, Kuofeng Gao, Yang Bai, Jingyun Zhang, Shu-tao Xia, Yisen Wang

Abstract: Despite the remarkable performance of video-based large language models (LLMs), their adversarial threat remains unexplored. To fill this gap, we propose the first adversarial attack tailored for video-based LLMs by crafting flow-based multi-modal adversarial perturbations on a small fraction of frames within a video, dubbed FMM-Attack. Extensive experiments show that our attack can effectively in… ▽ More Despite the remarkable performance of video-based large language models (LLMs), their adversarial threat remains unexplored. To fill this gap, we propose the first adversarial attack tailored for video-based LLMs by crafting flow-based multi-modal adversarial perturbations on a small fraction of frames within a video, dubbed FMM-Attack. Extensive experiments show that our attack can effectively induce video-based LLMs to generate incorrect answers when videos are added with imperceptible adversarial perturbations. Intriguingly, our FMM-Attack can also induce garbling in the model output, prompting video-based LLMs to hallucinate. Overall, our observations inspire a further understanding of multi-modal robustness and safety-related feature alignment across different modalities, which is of great importance for various large multi-modal models. Our code is available at https://github.com/THU-Kingmin/FMM-Attack. △ Less

Submitted 21 March, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

arXiv:2403.10943 [pdf, other]

MIntRec2.0: A Large-scale Benchmark Dataset for Multimodal Intent Recognition and Out-of-scope Detection in Conversations

Authors: Hanlei Zhang, Xin Wang, Hua Xu, Qianrui Zhou, Kai Gao, Jianhua Su, jinyue Zhao, Wenrui Li, Yanting Chen

Abstract: Multimodal intent recognition poses significant challenges, requiring the incorporation of non-verbal modalities from real-world contexts to enhance the comprehension of human intentions. Existing benchmark datasets are limited in scale and suffer from difficulties in handling out-of-scope samples that arise in multi-turn conversational interactions. We introduce MIntRec2.0, a large-scale benchmar… ▽ More Multimodal intent recognition poses significant challenges, requiring the incorporation of non-verbal modalities from real-world contexts to enhance the comprehension of human intentions. Existing benchmark datasets are limited in scale and suffer from difficulties in handling out-of-scope samples that arise in multi-turn conversational interactions. We introduce MIntRec2.0, a large-scale benchmark dataset for multimodal intent recognition in multi-party conversations. It contains 1,245 dialogues with 15,040 samples, each annotated within a new intent taxonomy of 30 fine-grained classes. Besides 9,304 in-scope samples, it also includes 5,736 out-of-scope samples appearing in multi-turn contexts, which naturally occur in real-world scenarios. Furthermore, we provide comprehensive information on the speakers in each utterance, enriching its utility for multi-party conversational research. We establish a general framework supporting the organization of single-turn and multi-turn dialogue data, modality feature extraction, multimodal fusion, as well as in-scope classification and out-of-scope detection. Evaluation benchmarks are built using classic multimodal fusion methods, ChatGPT, and human evaluators. While existing methods incorporating nonverbal information yield improvements, effectively leveraging context information and detecting out-of-scope samples remains a substantial challenge. Notably, large language models exhibit a significant performance gap compared to humans, highlighting the limitations of machine learning methods in the cognitive intent understanding task. We believe that MIntRec2.0 will serve as a valuable resource, providing a pioneering foundation for research in human-machine conversational interactions, and significantly facilitating related applications. The full dataset and codes are available at https://github.com/thuiar/MIntRec2.0. △ Less

Submitted 27 June, 2024; v1 submitted 16 March, 2024; originally announced March 2024.

Comments: Accepted by ICLR 2024, Long Paper; The abstract is slightly modified due to the length limitation

arXiv:2403.10010 [pdf, other]

doi 10.1103/PhysRevLett.132.131002

Measurements of All-Particle Energy Spectrum and Mean Logarithmic Mass of Cosmic Rays from 0.3 to 30 PeV with LHAASO-KM2A

Authors: The LHAASO Collaboration, Zhen Cao, F. Aharonian, Q. An, A. Axikegu, Y. X. Bai, Y. W. Bao, D. Bastieri, X. J. Bi, Y. J. Bi, J. T. Cai, Q. Cao, W. Y. Cao, Zhe Cao, J. Chang, J. F. Chang, A. M. Chen, E. S. Chen, Liang Chen, Lin Chen, Long Chen, M. J. Chen, M. L. Chen, Q. H. Chen, S. H. Chen , et al. (256 additional authors not shown)

Abstract: We present the measurements of all-particle energy spectrum and mean logarithmic mass of cosmic rays in the energy range of 0.3-30 PeV using data collected from LHAASO-KM2A between September 2021 and December 2022, which is based on a nearly composition-independent energy reconstruction method, achieving unprecedented accuracy. Our analysis reveals the position of the knee at… ▽ More We present the measurements of all-particle energy spectrum and mean logarithmic mass of cosmic rays in the energy range of 0.3-30 PeV using data collected from LHAASO-KM2A between September 2021 and December 2022, which is based on a nearly composition-independent energy reconstruction method, achieving unprecedented accuracy. Our analysis reveals the position of the knee at $3.67 \pm 0.05 \pm 0.15$ PeV. Below the knee, the spectral index is found to be -$2.7413 \pm 0.0004 \pm 0.0050$, while above the knee, it is -$3.128 \pm 0.005 \pm 0.027$, with the sharpness of the transition measured with a statistical error of 2%. The mean logarithmic mass of cosmic rays is almost heavier than helium in the whole measured energy range. It decreases from 1.7 at 0.3 PeV to 1.3 at 3 PeV, representing a 24% decline following a power law with an index of -$0.1200 \pm 0.0003 \pm 0.0341$. This is equivalent to an increase in abundance of light components. Above the knee, the mean logarithmic mass exhibits a power law trend towards heavier components, which is reversal to the behavior observed in the all-particle energy spectrum. Additionally, the knee position and the change in power-law index are approximately the same. These findings suggest that the knee observed in the all-particle spectrum corresponds to the knee of the light component, rather than the medium-heavy components. △ Less

Submitted 26 March, 2024; v1 submitted 15 March, 2024; originally announced March 2024.

Comments: 8 pages, 3 figures

Journal ref: Physical Review Letters 132, 131002 (2024)

arXiv:2403.06384 [pdf, other]

Precision Spectroscopy and Nuclear Structure Parameters in 7Li+ ion

Authors: Hua Guan, Xiao-Qiu Qi, Peng-Peng Zhou, Wei Sun, Shao-Long Chen, Xu-Rui Chang, Yao Huang, Pei-Pei Zhang, Zong-Chao Yan, G. W. F. Drake, Ai-Xi Chen, Zhen-Xiang Zhong, Ting-Yun Shi, Ke-Lin Gao

Abstract: The optical Ramsey technique is used to obtain precise measurements of the hyperfine splittings in the $2\,^3\!S_1$ and $2\,^3\!P_J$ states of $^7$Li$^+$. Together with bound-state quantum electrodynamic theory, the Zemach radius and quadrupole moment of the $^7$Li nucleus are determined to be $3.35(1)$~fm and $-3.86(5)$~fm$^2$ respectively, with the quadrupole moment deviating from the recommende… ▽ More The optical Ramsey technique is used to obtain precise measurements of the hyperfine splittings in the $2\,^3\!S_1$ and $2\,^3\!P_J$ states of $^7$Li$^+$. Together with bound-state quantum electrodynamic theory, the Zemach radius and quadrupole moment of the $^7$Li nucleus are determined to be $3.35(1)$~fm and $-3.86(5)$~fm$^2$ respectively, with the quadrupole moment deviating from the recommended value of $-4.00(3)$~fm$^2$ by $1.75σ$. Furthermore, we determine the quadrupole moment ratio of $^6$Li to $^7$Li as $0.101(13)$, exhibiting a $6σ$ deviation from the previous measured value of $0.020161(13)$ by LiF molecular spectroscopy. The results taken together provide a sensitive test of nuclear structure models. △ Less

Submitted 10 March, 2024; originally announced March 2024.

arXiv:2403.05551 [pdf]

A Bibliometric View of AI Ethics Development

Authors: Di Kevin Gao, Andrew Haverly, Sudip Mittal, Jingdao Chen

Abstract: Artificial Intelligence (AI) Ethics is a nascent yet critical research field. Recent developments in generative AI and foundational models necessitate a renewed look at the problem of AI Ethics. In this study, we perform a bibliometric analysis of AI Ethics literature for the last 20 years based on keyword search. Our study reveals a three-phase development in AI Ethics, namely an incubation phase… ▽ More Artificial Intelligence (AI) Ethics is a nascent yet critical research field. Recent developments in generative AI and foundational models necessitate a renewed look at the problem of AI Ethics. In this study, we perform a bibliometric analysis of AI Ethics literature for the last 20 years based on keyword search. Our study reveals a three-phase development in AI Ethics, namely an incubation phase, making AI human-like machines phase, and making AI human-centric machines phase. We conjecture that the next phase of AI ethics is likely to focus on making AI more machine-like as AI matches or surpasses humans intellectually, a term we coin as "machine-like human". △ Less

Submitted 8 February, 2024; originally announced March 2024.

arXiv:2403.01528 [pdf, other]

Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey

Authors: Qizhi Pei, Lijun Wu, Kaiyuan Gao, Jinhua Zhu, Yue Wang, Zun Wang, Tao Qin, Rui Yan

Abstract: The integration of biomolecular modeling with natural language (BL) has emerged as a promising interdisciplinary area at the intersection of artificial intelligence, chemistry and biology. This approach leverages the rich, multifaceted descriptions of biomolecules contained within textual data sources to enhance our fundamental understanding and enable downstream computational tasks such as biomol… ▽ More The integration of biomolecular modeling with natural language (BL) has emerged as a promising interdisciplinary area at the intersection of artificial intelligence, chemistry and biology. This approach leverages the rich, multifaceted descriptions of biomolecules contained within textual data sources to enhance our fundamental understanding and enable downstream computational tasks such as biomolecule property prediction. The fusion of the nuanced narratives expressed through natural language with the structural and functional specifics of biomolecules described via various molecular modeling techniques opens new avenues for comprehensively representing and analyzing biomolecules. By incorporating the contextual language data that surrounds biomolecules into their modeling, BL aims to capture a holistic view encompassing both the symbolic qualities conveyed through language as well as quantitative structural characteristics. In this review, we provide an extensive analysis of recent advancements achieved through cross modeling of biomolecules and natural language. (1) We begin by outlining the technical representations of biomolecules employed, including sequences, 2D graphs, and 3D structures. (2) We then examine in depth the rationale and key objectives underlying effective multi-modal integration of language and molecular data sources. (3) We subsequently survey the practical applications enabled to date in this developing research area. (4) We also compile and summarize the available resources and datasets to facilitate future work. (5) Looking ahead, we identify several promising research directions worthy of further exploration and investment to continue advancing the field. The related resources and contents are updating in \url{https://github.com/QizhiPei/Awesome-Biomolecule-Language-Cross-Modeling}. △ Less

Submitted 5 March, 2024; v1 submitted 3 March, 2024; originally announced March 2024.

Comments: Survey Paper. 25 pages, 9 figures, and 3 tables

arXiv:2402.17810 [pdf, other]

BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning

Authors: Qizhi Pei, Lijun Wu, Kaiyuan Gao, Xiaozhuan Liang, Yin Fang, Jinhua Zhu, Shufang Xie, Tao Qin, Rui Yan

Abstract: Recent research trends in computational biology have increasingly focused on integrating text and bio-entity modeling, especially in the context of molecules and proteins. However, previous efforts like BioT5 faced challenges in generalizing across diverse tasks and lacked a nuanced understanding of molecular structures, particularly in their textual representations (e.g., IUPAC). This paper intro… ▽ More Recent research trends in computational biology have increasingly focused on integrating text and bio-entity modeling, especially in the context of molecules and proteins. However, previous efforts like BioT5 faced challenges in generalizing across diverse tasks and lacked a nuanced understanding of molecular structures, particularly in their textual representations (e.g., IUPAC). This paper introduces BioT5+, an extension of the BioT5 framework, tailored to enhance biological research and drug discovery. BioT5+ incorporates several novel features: integration of IUPAC names for molecular understanding, inclusion of extensive bio-text and molecule data from sources like bioRxiv and PubChem, the multi-task instruction tuning for generality across tasks, and a numerical tokenization technique for improved processing of numerical data. These enhancements allow BioT5+ to bridge the gap between molecular representations and their textual descriptions, providing a more holistic understanding of biological entities, and largely improving the grounded reasoning of bio-text and bio-sequences. The model is pre-trained and fine-tuned with a large number of experiments, including \emph{3 types of problems (classification, regression, generation), 15 kinds of tasks, and 21 total benchmark datasets}, demonstrating the remarkable performance and state-of-the-art results in most cases. BioT5+ stands out for its ability to capture intricate relationships in biological data, thereby contributing significantly to bioinformatics and computational biology. Our code is available at \url{https://github.com/QizhiPei/BioT5}. △ Less

Submitted 31 May, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

Comments: Accepted by ACL 2024 (Findings)

arXiv:2402.16088 [pdf]

Origin of giant magnetoresistance in layered nodal-line semimetal TaNiTe5 nanoflakes

Authors: Ding-Bang Zhou, Kuang-Hong Gao, Meng-Fan Zhao, Zhi-Yan Jia, Xiao-Xia Hu, Qian-Jin Guo, Hai-Yan Du, Xiao-Ping Chen, Zhi-Qing Li

Abstract: Layered transition metal chalcogenides have stimulated a wide research interest due to their many exotic physical properties. In this paper, we studied the magnetotransport properties of the exfoliated TaNiTe5, a recently discovered Dirac nodal-line semimetal. A giant positive magnetoresistance (MR) is observed when the current is parallel to the crystallographic c axis, while it is strongly dimin… ▽ More Layered transition metal chalcogenides have stimulated a wide research interest due to their many exotic physical properties. In this paper, we studied the magnetotransport properties of the exfoliated TaNiTe5, a recently discovered Dirac nodal-line semimetal. A giant positive magnetoresistance (MR) is observed when the current is parallel to the crystallographic c axis, while it is strongly diminished when the current flows along the a axis. The observed giant MR is gradually suppressed either on reducing the thickness of nanoflake or on increasing temperature. By performing MR measurement in tilted magnetic fields, the interlayer coupling is found to be weakened both by reducing the thickness and by increasing temperature. We propose a mechanism of electron-electron interaction-assisted interlayer transport as a origin of the giant MR. The mechanism is likely to provide a explanation for the giant MR in other layered materials. △ Less

Submitted 18 May, 2024; v1 submitted 25 February, 2024; originally announced February 2024.

Comments: 21 pages, 7 figures, 1 table

arXiv:2402.01347 [pdf, ps, other]

doi 10.1103/PhysRevB.109.224508

Quantum Griffiths singularity in three-dimensional MoTiN superconducting films

Authors: Zi-Xiao Wang, Tian-Yu Jing, Zi-Yan Han, Kuang-Hong Gao, Song-Ci Li, Zhi-Qing Li

Abstract: Quantum Griffiths singularity (QGS) has been experimentally observed in a range of two-dimensional (2D) superconducting systems. Although it is theoretically suggested that the QGS also exists in three-dimensional (3D) superconductors, there is almost no experimental support to the theoretical prediction. In the present paper, we observe the occurrence of QGS in a series of $\sim$80-nm-thick Mo… ▽ More Quantum Griffiths singularity (QGS) has been experimentally observed in a range of two-dimensional (2D) superconducting systems. Although it is theoretically suggested that the QGS also exists in three-dimensional (3D) superconductors, there is almost no experimental support to the theoretical prediction. In the present paper, we observe the occurrence of QGS in a series of $\sim$80-nm-thick Mo$_{0.8}$Ti$_{0.2}$N$_x$ ($0.84 \lesssim x \lesssim 1.12$) superconducting films near the field-driven superconductor-metal transition (SMT). These films have a NaCl structure and are 3D with respect to the superconductivity. For each film, the low-temperature magnetoresistance isotherms, measured at magnetic fields being perpendicular or parallel to the film plane, do not cross at a single point but at a clear wide region. The dynamical critical exponents $zν_{\perp}$ (for perpendicular field) and $zν_{\parallel}$ (for parallel field) obtained by analyzing the related magnetoresistance isotherms increase with decreasing temperature and tend to diverge as $T\rightarrow 0$ K. In addition, the effective resistivity data for the perpendicular and parallel field in the vicinity of the SMTs both obey an activated scaling based on the random transverse-field Ising model. We also fabricate a $\sim$80-nm-thick (Mo$_{0.8}$Ti$_{0.2}$)$_2$N$_{1.06}$ superconducting film with face-centered cubic structure at low nitrogen partial pressure. It is found that the low-temperature magnetoresistance isotherms for the perpendicular (parallel) field cross at a single point and the resistivity data for the perpendicular (parallel) field in the vicinity of the field-induced SMT obey the power-law scaling deduced from the dirty-boson model. Our results provide unambigous experimental evidence for the existence of QGS in 3D superconductors. △ Less

Submitted 2 February, 2024; originally announced February 2024.

Comments: 11 pages and 9 Figures

Journal ref: Physical Review B 109, 224508 (2024)

arXiv:2401.13488 [pdf, other]

Fast Inverse Model Transformation: Algebraic Framework for Fast Data Plane Verification

Authors: Shenshen Chen, Jian Luo, Dong Guo, Kai Gao, Yang Richard Yang

Abstract: Data plane verification (DPV) analyzes routing tables and detects routing abnormalities and policy violations during network operation and planning. Thus, it has become an important tool to harden the networking infrastructure and the computing systems building on top. Substantial advancements have been made in the last decade and state-of-the-art DPV systems can achieve sub-us verification for an… ▽ More Data plane verification (DPV) analyzes routing tables and detects routing abnormalities and policy violations during network operation and planning. Thus, it has become an important tool to harden the networking infrastructure and the computing systems building on top. Substantial advancements have been made in the last decade and state-of-the-art DPV systems can achieve sub-us verification for an update of a single forwarding rule. In this paper, we introduce fast inverse model transformation (FIMT), the first theoretical framework to systematically model and analyze centralized DPV systems. FIMT reveals the algebraic structure in the model update process, a key step in fast DPV systems. Thus, it can systematically analyze the correctness of several DPV systems, using algebraic properties. The theory also guides the design and implementation of NeoFlash, a refactored version of Flash with new optimization techniques. Evaluations show that NeoFlash outperforms existing state-of-the-art centralized DPV systems in various datasets and reveal insights to key techniques towards fast DPV. △ Less

Submitted 26 February, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

Comments: 12 pages pre-reference

arXiv:2401.11170 [pdf, other]

Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images

Authors: Kuofeng Gao, Yang Bai, Jindong Gu, Shu-Tao Xia, Philip Torr, Zhifeng Li, Wei Liu

Abstract: Large vision-language models (VLMs) such as GPT-4 have achieved exceptional performance across various multi-modal tasks. However, the deployment of VLMs necessitates substantial energy consumption and computational resources. Once attackers maliciously induce high energy consumption and latency time (energy-latency cost) during inference of VLMs, it will exhaust computational resources. In this p… ▽ More Large vision-language models (VLMs) such as GPT-4 have achieved exceptional performance across various multi-modal tasks. However, the deployment of VLMs necessitates substantial energy consumption and computational resources. Once attackers maliciously induce high energy consumption and latency time (energy-latency cost) during inference of VLMs, it will exhaust computational resources. In this paper, we explore this attack surface about availability of VLMs and aim to induce high energy-latency cost during inference of VLMs. We find that high energy-latency cost during inference of VLMs can be manipulated by maximizing the length of generated sequences. To this end, we propose verbose images, with the goal of crafting an imperceptible perturbation to induce VLMs to generate long sentences during inference. Concretely, we design three loss objectives. First, a loss is proposed to delay the occurrence of end-of-sequence (EOS) token, where EOS token is a signal for VLMs to stop generating further tokens. Moreover, an uncertainty loss and a token diversity loss are proposed to increase the uncertainty over each generated token and the diversity among all tokens of the whole generated sequence, respectively, which can break output dependency at token-level and sequence-level. Furthermore, a temporal weight adjustment algorithm is proposed, which can effectively balance these losses. Extensive experiments demonstrate that our verbose images can increase the length of generated sequences by 7.87 times and 8.56 times compared to original images on MS-COCO and ImageNet datasets, which presents potential challenges for various applications. Our code is available at https://github.com/KuofengGao/Verbose_Images. △ Less

Submitted 22 March, 2024; v1 submitted 20 January, 2024; originally announced January 2024.

Comments: Accepted by ICLR 2024

arXiv:2401.02954 [pdf, other]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Authors: DeepSeek-AI, :, Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, Huazuo Gao, Kaige Gao, Wenjun Gao, Ruiqi Ge, Kang Guan, Daya Guo, Jianzhong Guo, Guangbo Hao, Zhewen Hao, Ying He, Wenjie Hu, Panpan Huang, Erhang Li , et al. (63 additional authors not shown)

Abstract: The rapid development of open-source large language models (LLMs) has been truly remarkable. However, the scaling law described in previous literature presents varying conclusions, which casts a dark cloud over scaling LLMs. We delve into the study of scaling laws and present our distinctive findings that facilitate scaling of large scale models in two commonly used open-source configurations, 7B… ▽ More The rapid development of open-source large language models (LLMs) has been truly remarkable. However, the scaling law described in previous literature presents varying conclusions, which casts a dark cloud over scaling LLMs. We delve into the study of scaling laws and present our distinctive findings that facilitate scaling of large scale models in two commonly used open-source configurations, 7B and 67B. Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source language models with a long-term perspective. To support the pre-training phase, we have developed a dataset that currently consists of 2 trillion tokens and is continuously expanding. We further conduct supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) on DeepSeek LLM Base models, resulting in the creation of DeepSeek Chat models. Our evaluation results demonstrate that DeepSeek LLM 67B surpasses LLaMA-2 70B on various benchmarks, particularly in the domains of code, mathematics, and reasoning. Furthermore, open-ended evaluations reveal that DeepSeek LLM 67B Chat exhibits superior performance compared to GPT-3.5. △ Less

Submitted 5 January, 2024; originally announced January 2024.

arXiv:2312.14667 [pdf, other]

Token-Level Contrastive Learning with Modality-Aware Prompting for Multimodal Intent Recognition

Authors: Qianrui Zhou, Hua Xu, Hao Li, Hanlei Zhang, Xiaohan Zhang, Yifan Wang, Kai Gao

Abstract: Multimodal intent recognition aims to leverage diverse modalities such as expressions, body movements and tone of speech to comprehend user's intent, constituting a critical task for understanding human language and behavior in real-world multimodal scenarios. Nevertheless, the majority of existing methods ignore potential correlations among different modalities and own limitations in effectively… ▽ More Multimodal intent recognition aims to leverage diverse modalities such as expressions, body movements and tone of speech to comprehend user's intent, constituting a critical task for understanding human language and behavior in real-world multimodal scenarios. Nevertheless, the majority of existing methods ignore potential correlations among different modalities and own limitations in effectively learning semantic features from nonverbal modalities. In this paper, we introduce a token-level contrastive learning method with modality-aware prompting (TCL-MAP) to address the above challenges. To establish an optimal multimodal semantic environment for text modality, we develop a modality-aware prompting module (MAP), which effectively aligns and fuses features from text, video and audio modalities with similarity-based modality alignment and cross-modality attention mechanism. Based on the modality-aware prompt and ground truth labels, the proposed token-level contrastive learning framework (TCL) constructs augmented samples and employs NT-Xent loss on the label token. Specifically, TCL capitalizes on the optimal textual semantic insights derived from intent labels to guide the learning processes of other modalities in return. Extensive experiments show that our method achieves remarkable improvements compared to state-of-the-art methods. Additionally, ablation analyses demonstrate the superiority of the modality-aware prompt over the handcrafted prompt, which holds substantial significance for multimodal prompt learning. The codes are released at https://github.com/thuiar/TCL-MAP. △ Less

Submitted 5 June, 2024; v1 submitted 22 December, 2023; originally announced December 2023.

Comments: Accepted by AAAI 2024 (Main Track, Long Paper)

arXiv:2312.12123 [pdf, other]

Probabilistic Prediction of Longitudinal Trajectory Considering Driving Heterogeneity with Interpretability

Authors: Shuli Wang, Kun Gao, Lanfang Zhang, Yang Liu, Lei Chen

Abstract: Automated vehicles are envisioned to navigate safely in complex mixed-traffic scenarios alongside human-driven vehicles. To promise a high degree of safety, accurately predicting the maneuvers of surrounding vehicles and their future positions is a critical task and attracts much attention. However, most existing studies focused on reasoning about positional information based on objective historic… ▽ More Automated vehicles are envisioned to navigate safely in complex mixed-traffic scenarios alongside human-driven vehicles. To promise a high degree of safety, accurately predicting the maneuvers of surrounding vehicles and their future positions is a critical task and attracts much attention. However, most existing studies focused on reasoning about positional information based on objective historical trajectories without fully considering the heterogeneity of driving behaviors. Therefore, this study proposes a trajectory prediction framework that combines Mixture Density Networks (MDN) and considers the driving heterogeneity to provide probabilistic and personalized predictions. Specifically, based on a certain length of historical trajectory data, the situation-specific driving preferences of each driver are identified, where key driving behavior feature vectors are extracted to characterize heterogeneity in driving behavior among different drivers. With the inputs of the short-term historical trajectory data and key driving behavior feature vectors, a probabilistic LSTMMD-DBV model combined with LSTM-based encoder-decoder networks and MDN layers is utilized to carry out personalized predictions. Finally, the SHapley Additive exPlanations (SHAP) method is employed to interpret the trained model for predictions. The proposed framework is tested based on a wide-range vehicle trajectory dataset. The results indicate that the proposed model can generate probabilistic future trajectories with remarkably improved predictions compared to existing benchmark models. Moreover, the results confirm that the additional input of driving behavior feature vectors representing the heterogeneity of driving behavior could provide more information and thus contribute to improving the prediction accuracy. △ Less

Submitted 19 December, 2023; originally announced December 2023.

Comments: 14 pages, 8 figures

arXiv:2312.05104

An Autonomous Driving Model Integrated with BEV-V2X Perception, Fusion Prediction of Motion and Occupancy, and Driving Planning, in Complex Traffic Intersections

Authors: Fukang Li, Wenlin Ou, Kunpeng Gao, Yuwen Pang, Yifei Li, Henry Fan

Abstract: The comprehensiveness of vehicle-to-everything (V2X) recognition enriches and holistically shapes the global Birds-Eye-View (BEV) perception, incorporating rich semantics and integrating driving scene information, thereby serving features of vehicle state prediction, decision-making and driving planning. Utilizing V2X message sets to form BEV map proves to be an effective perception method for con… ▽ More The comprehensiveness of vehicle-to-everything (V2X) recognition enriches and holistically shapes the global Birds-Eye-View (BEV) perception, incorporating rich semantics and integrating driving scene information, thereby serving features of vehicle state prediction, decision-making and driving planning. Utilizing V2X message sets to form BEV map proves to be an effective perception method for connected and automated vehicles (CAVs). Specifically, Map Msg. (MAP), Signal Phase And Timing (SPAT) and Roadside Information (RSI) contributes to the achievement of road connectivity, synchronized traffic signal navigation and obstacle warning. Moreover, harnessing time-sequential Basic Safety Msg. (BSM) data from multiple vehicles allows for the real-time perception and future state prediction. Therefore, this paper develops a comprehensive autonomous driving model that relies on BEV-V2X perception, Interacting Multiple model Unscented Kalman Filter (IMM-UKF)-based fusion prediction, and deep reinforcement learning (DRL)-based decision making and planning. We integrated them into a DRL environment to develop an optimal set of unified driving behaviors that encompass obstacle avoidance, lane changes, overtaking, turning maneuver, and synchronized traffic signal navigation. Consequently, a complex traffic intersection scenario was simulated, and the well-trained model was applied for driving planning. The observed driving behavior closely resembled that of an experienced driver, exhibiting anticipatory actions and revealing notable operational highlights of driving policy. △ Less

Submitted 22 April, 2024; v1 submitted 8 December, 2023; originally announced December 2023.

Comments: The content of the paper has not received unanimous consent from all the members and requires further evaluation prior to submission

arXiv:2311.16194 [pdf, other]

BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP

Authors: Jiawang Bai, Kuofeng Gao, Shaobo Min, Shu-Tao Xia, Zhifeng Li, Wei Liu

Abstract: Contrastive Vision-Language Pre-training, known as CLIP, has shown promising effectiveness in addressing downstream image recognition tasks. However, recent works revealed that the CLIP model can be implanted with a downstream-oriented backdoor. On downstream tasks, one victim model performs well on clean samples but predicts a specific target class whenever a specific trigger is present. For inje… ▽ More Contrastive Vision-Language Pre-training, known as CLIP, has shown promising effectiveness in addressing downstream image recognition tasks. However, recent works revealed that the CLIP model can be implanted with a downstream-oriented backdoor. On downstream tasks, one victim model performs well on clean samples but predicts a specific target class whenever a specific trigger is present. For injecting a backdoor, existing attacks depend on a large amount of additional data to maliciously fine-tune the entire pre-trained CLIP model, which makes them inapplicable to data-limited scenarios. In this work, motivated by the recent success of learnable prompts, we address this problem by injecting a backdoor into the CLIP model in the prompt learning stage. Our method named BadCLIP is built on a novel and effective mechanism in backdoor attacks on CLIP, i.e., influencing both the image and text encoders with the trigger. It consists of a learnable trigger applied to images and a trigger-aware context generator, such that the trigger can change text features via trigger-aware prompts, resulting in a powerful and generalizable attack. Extensive experiments conducted on 11 datasets verify that the clean accuracy of BadCLIP is similar to those of advanced prompt learning methods and the attack success rate is higher than 99% in most cases. BadCLIP is also generalizable to unseen classes, and shows a strong generalization capability under cross-dataset and cross-domain settings. △ Less

Submitted 21 March, 2024; v1 submitted 26 November, 2023; originally announced November 2023.

Comments: 14 pages, 6 figures

arXiv:2311.12644 [pdf, other]

Careful Selection and Thoughtful Discarding: Graph Explicit Pooling Utilizing Discarded Nodes

Authors: Chuang Liu, Wenhang Yu, Kuang Gao, Xueqi Ma, Yibing Zhan, Jia Wu, Bo Du, Wenbin Hu

Abstract: Graph pooling has been increasingly recognized as crucial for Graph Neural Networks (GNNs) to facilitate hierarchical graph representation learning. Existing graph pooling methods commonly consist of two stages: selecting top-ranked nodes and discarding the remaining to construct coarsened graph representations. However, this paper highlights two key issues with these methods: 1) The process of se… ▽ More Graph pooling has been increasingly recognized as crucial for Graph Neural Networks (GNNs) to facilitate hierarchical graph representation learning. Existing graph pooling methods commonly consist of two stages: selecting top-ranked nodes and discarding the remaining to construct coarsened graph representations. However, this paper highlights two key issues with these methods: 1) The process of selecting nodes to discard frequently employs additional Graph Convolutional Networks or Multilayer Perceptrons, lacking a thorough evaluation of each node's impact on the final graph representation and subsequent prediction tasks. 2) Current graph pooling methods tend to directly discard the noise segment (dropped) of the graph without accounting for the latent information contained within these elements. To address the first issue, we introduce a novel Graph Explicit Pooling (GrePool) method, which selects nodes by explicitly leveraging the relationships between the nodes and final representation vectors crucial for classification. The second issue is addressed using an extended version of GrePool (i.e., GrePool+), which applies a uniform loss on the discarded nodes. This addition is designed to augment the training process and improve classification accuracy. Furthermore, we conduct comprehensive experiments across 12 widely used datasets to validate our proposed method's effectiveness, including the Open Graph Benchmark datasets. Our experimental results uniformly demonstrate that GrePool outperforms 14 baseline methods for most datasets. Likewise, implementing GrePool+ enhances GrePool's performance without incurring additional computational costs. △ Less

Submitted 21 November, 2023; originally announced November 2023.

Comments: 14 pages, 7 figures, 4 tables. Submitting to Science China Information Sciences

arXiv:2310.18755 [pdf, other]

doi 10.1145/3604237.3626913

Deeper Hedging: A New Agent-based Model for Effective Deep Hedging

Authors: Kang Gao, Stephen Weston, Perukrishnen Vytelingum, Namid R. Stillman, Wayne Luk, Ce Guo

Abstract: We propose the Chiarella-Heston model, a new agent-based model for improving the effectiveness of deep hedging strategies. This model includes momentum traders, fundamental traders, and volatility traders. The volatility traders participate in the market by innovatively following a Heston-style volatility signal. The proposed model generalises both the extended Chiarella model and the Heston stoch… ▽ More We propose the Chiarella-Heston model, a new agent-based model for improving the effectiveness of deep hedging strategies. This model includes momentum traders, fundamental traders, and volatility traders. The volatility traders participate in the market by innovatively following a Heston-style volatility signal. The proposed model generalises both the extended Chiarella model and the Heston stochastic volatility model, and is calibrated to reproduce as many empirical stylized facts as possible. According to the stylised facts distance metric, the proposed model is able to reproduce more realistic financial time series than three baseline models: the extended Chiarella model, the Heston model, and the Geometric Brownian Motion. The proposed model is further validated by the Generalized Subtracted L-divergence metric. With the proposed Chiarella-Heston model, we generate a training dataset to train a deep hedging agent for optimal hedging strategies under various transaction cost levels. The deep hedging agent employs the Deep Deterministic Policy Gradient algorithm and is trained to maximize profits and minimize risks. Our testing results reveal that the deep hedging agent, trained with data generated by our proposed model, outperforms the baseline in most transaction cost levels. Furthermore, the testing process, which is conducted using empirical data, demonstrates the effective performance of the trained deep hedging agent in a realistic trading environment. △ Less

Submitted 28 October, 2023; originally announced October 2023.

Comments: Accepted in the 4th ACM International Conference on AI in Finance (ICAIF'23)

arXiv:2310.17082 [pdf, ps, other]

Does or did the supernova remnant Cassiopeia A operate as a PeVatron?

Authors: Zhen Cao, F. Aharonian, Q. An, Axikegu, Y. X. Bai, Y. W. Bao, D. Bastieri, X. J. Bi, Y. J. Bi, J. T. Cai, Q. Cao, W. Y. Cao, Zhe Cao, J. Chang, J. F. Chang, A. M. Chen, E. S. Chen, Liang Chen, Lin Chen, Long Chen, M. J. Chen, M. L. Chen, Q. H. Chen, S. H. Chen, S. Z. Chen , et al. (255 additional authors not shown)

Abstract: For decades, supernova remnants (SNRs) have been considered the prime sources of Galactic Cosmic rays (CRs). But whether SNRs can accelerate CR protons to PeV energies and thus dominate CR flux up to the knee is currently under intensive theoretical and phenomenological debate. The direct test of the ability of SNRs to operate as CR PeVatrons can be provided by ultrahigh-energy (UHE;… ▽ More For decades, supernova remnants (SNRs) have been considered the prime sources of Galactic Cosmic rays (CRs). But whether SNRs can accelerate CR protons to PeV energies and thus dominate CR flux up to the knee is currently under intensive theoretical and phenomenological debate. The direct test of the ability of SNRs to operate as CR PeVatrons can be provided by ultrahigh-energy (UHE; $E_γ\geq 100$~TeV) $γ$-rays. In this context, the historical SNR Cassiopeia A (Cas A) is considered one of the most promising target for UHE observations. This paper presents the observation of Cas A and its vicinity by the LHAASO KM2A detector. The exceptional sensitivity of LHAASO KM2A in the UHE band, combined with the young age of Cas A, enabled us to derive stringent model-independent limits on the energy budget of UHE protons and nuclei accelerated by Cas A at any epoch after the explosion. The results challenge the prevailing paradigm that Cas A-type SNRs are major suppliers of PeV CRs in the Milky Way. △ Less

Submitted 25 October, 2023; originally announced October 2023.

Comments: 11 pages, 3 figures, Accepted by the APJL

arXiv:2310.16006 [pdf, other]

Machine-learning the phase diagram of a strongly-interacting Fermi gas

Authors: M. Link, K. Gao, A. Kell, M. Breyer, D. Eberz, B. Rauf, M. Köhl

Abstract: We determine the phase diagram of strongly correlated fermions in the crossover from Bose-Einstein condensates of molecules (BEC) to Cooper pairs of fermions (BCS) utilizing an artificial neural network. By applying advanced image recognition techniques to the momentum distribution of the fermions, a quantity which has been widely considered as featureless for providing information about the conde… ▽ More We determine the phase diagram of strongly correlated fermions in the crossover from Bose-Einstein condensates of molecules (BEC) to Cooper pairs of fermions (BCS) utilizing an artificial neural network. By applying advanced image recognition techniques to the momentum distribution of the fermions, a quantity which has been widely considered as featureless for providing information about the condensed state, we measure the critical temperature and show that it exhibits a maximum on the bosonic side of the crossover. Additionally, we back-analyze the trained neural network and demonstrate that it interprets physically relevant quantities. △ Less

Submitted 24 October, 2023; originally announced October 2023.

Journal ref: Phys. Rev. Lett. 130, 203401 (2023)

Showing 1–50 of 301 results for author: Gao, K