subscribe to arXiv mailings

States Hidden in Hidden States: LLMs Emerge Discrete State Representations Implicitly

Authors: Junhao Chen, Shengding Hu, Zhiyuan Liu, Maosong Sun

Abstract: Large Language Models (LLMs) exhibit various emergent abilities. Among these abilities, some might reveal the internal working mechanisms of models. In this paper, we uncover a novel emergent capability in models: the intrinsic ability to perform extended sequences of calculations without relying on chain-of-thought step-by-step solutions. Remarkably, the most advanced models can directly output t… ▽ More Large Language Models (LLMs) exhibit various emergent abilities. Among these abilities, some might reveal the internal working mechanisms of models. In this paper, we uncover a novel emergent capability in models: the intrinsic ability to perform extended sequences of calculations without relying on chain-of-thought step-by-step solutions. Remarkably, the most advanced models can directly output the results of two-digit number additions with lengths extending up to 15 addends. We hypothesize that the model emerges Implicit Discrete State Representations (IDSRs) within its hidden states and performs symbolic calculations internally. To test this hypothesis, we design a sequence of experiments that look into the hidden states. Specifically, we first confirm that IDSRs exist. Then, we provide interesting observations about the formation of IDSRs from layer, digit, and sequence perspectives. Finally, we confirm that models indeed use IDSRs to produce the final answers. However, we also discover that these state representations are far from lossless in current open-sourced models, leading to inaccuracies in their final performance. Our work presents a novel exploration of LLMs' symbolic calculation abilities and the underlying mechanisms. △ Less

Submitted 16 July, 2024; originally announced July 2024.

arXiv:2407.10233 [pdf, other]

Visual Prompt Selection for In-Context Learning Segmentation

Authors: Wei Suo, Lanqing Lai, Mengyang Sun, Hanwang Zhang, Peng Wang, Yanning Zhang

Abstract: As a fundamental and extensively studied task in computer vision, image segmentation aims to locate and identify different semantic concepts at the pixel level. Recently, inspired by In-Context Learning (ICL), several generalist segmentation frameworks have been proposed, providing a promising paradigm for segmenting specific objects. However, existing works mostly ignore the value of visual promp… ▽ More As a fundamental and extensively studied task in computer vision, image segmentation aims to locate and identify different semantic concepts at the pixel level. Recently, inspired by In-Context Learning (ICL), several generalist segmentation frameworks have been proposed, providing a promising paradigm for segmenting specific objects. However, existing works mostly ignore the value of visual prompts or simply apply similarity sorting to select contextual examples. In this paper, we focus on rethinking and improving the example selection strategy. By comprehensive comparisons, we first demonstrate that ICL-based segmentation models are sensitive to different contexts. Furthermore, empirical evidence indicates that the diversity of contextual prompts plays a crucial role in guiding segmentation. Based on the above insights, we propose a new stepwise context search method. Different from previous works, we construct a small yet rich candidate pool and adaptively search the well-matched contexts. More importantly, this method effectively reduces the annotation cost by compacting the search space. Extensive experiments show that our method is an effective strategy for selecting examples and enhancing segmentation performance. △ Less

Submitted 14 July, 2024; originally announced July 2024.

Comments: Accept by ECCV2024

arXiv:2407.07093 [pdf, other]

FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation

Authors: Liqun Ma, Mingjie Sun, Zhiqiang Shen

Abstract: This work presents a Fully BInarized Large Language Model (FBI-LLM), demonstrating for the first time how to train a large-scale binary language model from scratch (not the partial binary or ternary LLM like BitNet b1.58) to match the performance of its full-precision counterparts (e.g., FP16 or BF16) in transformer-based LLMs. It achieves this by employing an autoregressive distillation (AD) loss… ▽ More This work presents a Fully BInarized Large Language Model (FBI-LLM), demonstrating for the first time how to train a large-scale binary language model from scratch (not the partial binary or ternary LLM like BitNet b1.58) to match the performance of its full-precision counterparts (e.g., FP16 or BF16) in transformer-based LLMs. It achieves this by employing an autoregressive distillation (AD) loss with maintaining equivalent model dimensions (130M, 1.3B, 7B) and training data volume as regular LLM pretraining, while delivering competitive results in terms of perplexity and task-specific effectiveness. Intriguingly, by analyzing the training trajectory, we find that the pretrained weight is not necessary for training binarized LLMs from scratch. This research encourages a new computational framework and may facilitate the future design of specialized hardware tailored for fully 1-bit LLMs. We make all models, code, and training dataset fully accessible and transparent to support further research (Code: https://github.com/LiqunMa/FBI-LLM. Model: https://huggingface.co/LiqunMa/). △ Less

Submitted 9 July, 2024; originally announced July 2024.

Comments: Github at https://github.com/LiqunMa/FBI-LLM

arXiv:2407.07061 [pdf, other]

Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence

Authors: Weize Chen, Ziming You, Ran Li, Yitong Guan, Chen Qian, Chenyang Zhao, Cheng Yang, Ruobing Xie, Zhiyuan Liu, Maosong Sun

Abstract: The rapid advancement of large language models (LLMs) has paved the way for the development of highly capable autonomous agents. However, existing multi-agent frameworks often struggle with integrating diverse capable third-party agents due to reliance on agents defined within their own ecosystems. They also face challenges in simulating distributed environments, as most frameworks are limited to… ▽ More The rapid advancement of large language models (LLMs) has paved the way for the development of highly capable autonomous agents. However, existing multi-agent frameworks often struggle with integrating diverse capable third-party agents due to reliance on agents defined within their own ecosystems. They also face challenges in simulating distributed environments, as most frameworks are limited to single-device setups. Furthermore, these frameworks often rely on hard-coded communication pipelines, limiting their adaptability to dynamic task requirements. Inspired by the concept of the Internet, we propose the Internet of Agents (IoA), a novel framework that addresses these limitations by providing a flexible and scalable platform for LLM-based multi-agent collaboration. IoA introduces an agent integration protocol, an instant-messaging-like architecture design, and dynamic mechanisms for agent teaming and conversation flow control. Through extensive experiments on general assistant tasks, embodied AI tasks, and retrieval-augmented generation benchmarks, we demonstrate that IoA consistently outperforms state-of-the-art baselines, showcasing its ability to facilitate effective collaboration among heterogeneous agents. IoA represents a step towards linking diverse agents in an Internet-like environment, where agents can seamlessly collaborate to achieve greater intelligence and capabilities. Our codebase has been released at \url{https://github.com/OpenBMB/IoA}. △ Less

Submitted 10 July, 2024; v1 submitted 9 July, 2024; originally announced July 2024.

Comments: work in progress

arXiv:2407.06853 [pdf, other]

TimeTravel: Real-time Timing Drift Attack on System Time Using Acoustic Waves

Authors: Jianshuo Liu, Hong Li, Haining Wang, Mengjie Sun, Hui Wen, Jinfa Wang, Limin Sun

Abstract: Real-time Clock (RTC) has been widely used in various real-time systems to provide precise system time. In this paper, we reveal a new security vulnerability of the RTC circuit, where the internal storage time or timestamp can be arbitrarily modified forward or backward. The security threat of dynamic modifications of system time caused by this vulnerability is called TimeTravel. Based on acoustic… ▽ More Real-time Clock (RTC) has been widely used in various real-time systems to provide precise system time. In this paper, we reveal a new security vulnerability of the RTC circuit, where the internal storage time or timestamp can be arbitrarily modified forward or backward. The security threat of dynamic modifications of system time caused by this vulnerability is called TimeTravel. Based on acoustic resonance and piezoelectric effects, TimeTravel applies acoustic guide waves to the quartz crystal, thereby adjusting the characteristics of the oscillating signal transmitted into the RTC circuit. By manipulating the parameters of acoustic waves, TimeTravel can accelerate or decelerate the timing speed of system time at an adjustable rate, resulting in the relative drift of the timing, which can pose serious safety threats. To assess the severity of TimeTravel, we examine nine modules and two commercial devices under the RTC circuit. The experimental results show that TimeTravel can drift system time forward and backward at a chosen speed with a maximum 93% accuracy. Our analysis further shows that TimeTravel can maintain an attack success rate of no less than 77% under environments with typical obstacle items. △ Less

Submitted 9 July, 2024; originally announced July 2024.

Comments: Accepted by USENIX Security 2024 winter cycle and will appear in USENIX Security 2025

arXiv:2407.06529 [pdf]

Advanced Financial Fraud Detection Using GNN-CL Model

Authors: Yu Cheng, Junjie Guo, Shiqing Long, You Wu, Mengfang Sun, Rong Zhang

Abstract: The innovative GNN-CL model proposed in this paper marks a breakthrough in the field of financial fraud detection by synergistically combining the advantages of graph neural networks (gnn), convolutional neural networks (cnn) and long short-term memory (LSTM) networks. This convergence enables multifaceted analysis of complex transaction patterns, improving detection accuracy and resilience agains… ▽ More The innovative GNN-CL model proposed in this paper marks a breakthrough in the field of financial fraud detection by synergistically combining the advantages of graph neural networks (gnn), convolutional neural networks (cnn) and long short-term memory (LSTM) networks. This convergence enables multifaceted analysis of complex transaction patterns, improving detection accuracy and resilience against complex fraudulent activities. A key novelty of this paper is the use of multilayer perceptrons (MLPS) to estimate node similarity, effectively filtering out neighborhood noise that can lead to false positives. This intelligent purification mechanism ensures that only the most relevant information is considered, thereby improving the model's understanding of the network structure. Feature weakening often plagues graph-based models due to the dilution of key signals. In order to further address the challenge of feature weakening, GNN-CL adopts reinforcement learning strategies. By dynamically adjusting the weights assigned to central nodes, it reinforces the importance of these influential entities to retain important clues of fraud even in less informative data. Experimental evaluations on Yelp datasets show that the results highlight the superior performance of GNN-CL compared to existing methods. △ Less

Submitted 8 July, 2024; originally announced July 2024.

arXiv:2407.05693 [pdf, other]

Sub-SA: Strengthen In-context Learning via Submodular Selective Annotation

Authors: Jian Qian, Miao Sun, Sifan Zhou, Ziyu Zhao, Ruizhi Hun, Patrick Chiang

Abstract: In-context learning (ICL) leverages in-context examples as prompts for the predictions of Large Language Models (LLMs). These prompts play a crucial role in achieving strong performance. However, the selection of suitable prompts from a large pool of labeled examples often entails significant annotation costs. To address this challenge, we propose \textbf{Sub-SA} (\textbf{Sub}modular \textbf{S}ele… ▽ More In-context learning (ICL) leverages in-context examples as prompts for the predictions of Large Language Models (LLMs). These prompts play a crucial role in achieving strong performance. However, the selection of suitable prompts from a large pool of labeled examples often entails significant annotation costs. To address this challenge, we propose \textbf{Sub-SA} (\textbf{Sub}modular \textbf{S}elective \textbf{A}nnotation), a submodule-based selective annotation method. The aim of Sub-SA is to reduce annotation costs while improving the quality of in-context examples and minimizing the time consumption of the selection process. In Sub-SA, we design a submodular function that facilitates effective subset selection for annotation and demonstrates the characteristics of monotonically and submodularity from the theoretical perspective. Specifically, we propose \textbf{RPR} (\textbf{R}eward and \textbf{P}enalty \textbf{R}egularization) to better balance the diversity and representativeness of the unlabeled dataset attributed to a reward term and a penalty term, respectively. Consequently, the selection for annotations can be effectively addressed with a simple yet effective greedy search algorithm based on the submodular function. Finally, we apply the similarity prompt retrieval to get the examples for ICL. △ Less

Submitted 8 July, 2024; originally announced July 2024.

arXiv:2407.04709 [pdf, other]

Efficient 4D Radar Data Auto-labeling Method using LiDAR-based Object Detection Network

Authors: Min-Hyeok Sun, Dong-Hee Paek, Seung-Hyun Song, Seung-Hyun Kong

Abstract: Focusing on the strength of 4D (4-Dimensional) radar, research about robust 3D object detection networks in adverse weather conditions has gained attention. To train such networks, datasets that contain large amounts of 4D radar data and ground truth labels are essential. However, the existing 4D radar datasets (e.g., K-Radar) lack sufficient sensor data and labels, which hinders the advancement i… ▽ More Focusing on the strength of 4D (4-Dimensional) radar, research about robust 3D object detection networks in adverse weather conditions has gained attention. To train such networks, datasets that contain large amounts of 4D radar data and ground truth labels are essential. However, the existing 4D radar datasets (e.g., K-Radar) lack sufficient sensor data and labels, which hinders the advancement in this research domain. Furthermore, enlarging the 4D radar datasets requires a time-consuming and expensive manual labeling process. To address these issues, we propose the auto-labeling method of 4D radar tensor (4DRT) in the K-Radar dataset. The proposed method initially trains a LiDAR-based object detection network (LODN) using calibrated LiDAR point cloud (LPC). The trained LODN then automatically generates ground truth labels (i.e., auto-labels, ALs) of the K-Radar train dataset without human intervention. The generated ALs are used to train the 4D radar-based object detection network (4DRODN), Radar Tensor Network with Height (RTNH). The experimental results demonstrate that RTNH trained with ALs has achieved a similar detection performance to the original RTNH which is trained with manually annotated ground truth labels, thereby verifying the effectiveness of the proposed auto-labeling method. All relevant codes will be soon available at the following GitHub project: https://github.com/kaist-avelab/K-Radar △ Less

Submitted 13 May, 2024; originally announced July 2024.

Comments: Accept at IEEE IVS 2024

arXiv:2407.04211 [pdf, other]

TimeLDM: Latent Diffusion Model for Unconditional Time Series Generation

Authors: Jian Qian, Miao Sun, Sifan Zhou, Biao Wan, Minhao Li, Patrick Chiang

Abstract: Time series generation is a crucial research topic in the area of deep learning, which can be used for data augmentation, imputing missing values, and forecasting. Currently, latent diffusion models are ascending to the forefront of generative modeling for many important data representations. Being the most pivotal in the computer vision domain, latent diffusion models have also recently attracted… ▽ More Time series generation is a crucial research topic in the area of deep learning, which can be used for data augmentation, imputing missing values, and forecasting. Currently, latent diffusion models are ascending to the forefront of generative modeling for many important data representations. Being the most pivotal in the computer vision domain, latent diffusion models have also recently attracted interest in other communities, including NLP, Speech, and Geometric Space. In this work, we propose TimeLDM, a novel latent diffusion model for high-quality time series generation. TimeLDM is composed of a variational autoencoder that encodes time series into an informative and smoothed latent content and a latent diffusion model operating in the latent space to generate latent information. We evaluate the ability of our method to generate synthetic time series with simulated and realistic datasets, benchmark the performance against existing state-of-the-art methods. Qualitatively and quantitatively, we find that the proposed TimeLDM persistently delivers high-quality generated time series. Sores from Context-FID and Discriminative indicate that TimeLDM consistently and significantly outperforms current state-of-the-art benchmarks with an average improvement of 3.4$\times$ and 3.8$\times$, respectively. Further studies demonstrate that our method presents better performance on different lengths of time series data generation. To the best of our knowledge, this is the first study to explore the potential of the latent diffusion model for unconditional time series generation and establish a new baseline for synthetic time series. △ Less

Submitted 4 July, 2024; originally announced July 2024.

arXiv:2407.03883 [pdf, other]

Protecting Deep Learning Model Copyrights with Adversarial Example-Free Reuse Detection

Authors: Xiaokun Luan, Xiyue Zhang, Jingyi Wang, Meng Sun

Abstract: Model reuse techniques can reduce the resource requirements for training high-performance deep neural networks (DNNs) by leveraging existing models. However, unauthorized reuse and replication of DNNs can lead to copyright infringement and economic loss to the model owner. This underscores the need to analyze the reuse relation between DNNs and develop copyright protection techniques to safeguard… ▽ More Model reuse techniques can reduce the resource requirements for training high-performance deep neural networks (DNNs) by leveraging existing models. However, unauthorized reuse and replication of DNNs can lead to copyright infringement and economic loss to the model owner. This underscores the need to analyze the reuse relation between DNNs and develop copyright protection techniques to safeguard intellectual property rights. Existing white-box testing-based approaches cannot address the common heterogeneous reuse case where the model architecture is changed, and DNN fingerprinting approaches heavily rely on generating adversarial examples with good transferability, which is known to be challenging in the black-box setting. To bridge the gap, we propose NFARD, a Neuron Functionality Analysis-based Reuse Detector, which only requires normal test samples to detect reuse relations by measuring the models' differences on a newly proposed model characterization, i.e., neuron functionality (NF). A set of NF-based distance metrics is designed to make NFARD applicable to both white-box and black-box settings. Moreover, we devise a linear transformation method to handle heterogeneous reuse cases by constructing the optimal projection matrix for dimension consistency, significantly extending the application scope of NFARD. To the best of our knowledge, this is the first adversarial example-free method that exploits neuron functionality for DNN copyright protection. As a side contribution, we constructed a reuse detection benchmark named Reuse Zoo that covers various practical reuse techniques and popular datasets. Extensive evaluations on this comprehensive benchmark show that NFARD achieves F1 scores of 0.984 and 1.0 for detecting reuse relationships in black-box and white-box settings, respectively, while generating test suites 2 ~ 99 times faster than previous methods. △ Less

Submitted 4 July, 2024; originally announced July 2024.

Comments: 12 pages, 5 figures

arXiv:2407.02277 [pdf, other]

MelodyT5: A Unified Score-to-Score Transformer for Symbolic Music Processing

Authors: Shangda Wu, Yashan Wang, Xiaobing Li, Feng Yu, Maosong Sun

Abstract: In the domain of symbolic music research, the progress of developing scalable systems has been notably hindered by the scarcity of available training data and the demand for models tailored to specific tasks. To address these issues, we propose MelodyT5, a novel unified framework that leverages an encoder-decoder architecture tailored for symbolic music processing in ABC notation. This framework c… ▽ More In the domain of symbolic music research, the progress of developing scalable systems has been notably hindered by the scarcity of available training data and the demand for models tailored to specific tasks. To address these issues, we propose MelodyT5, a novel unified framework that leverages an encoder-decoder architecture tailored for symbolic music processing in ABC notation. This framework challenges the conventional task-specific approach, considering various symbolic music tasks as score-to-score transformations. Consequently, it integrates seven melody-centric tasks, from generation to harmonization and segmentation, within a single model. Pre-trained on MelodyHub, a newly curated collection featuring over 261K unique melodies encoded in ABC notation and encompassing more than one million task instances, MelodyT5 demonstrates superior performance in symbolic music processing via multi-task transfer learning. Our findings highlight the efficacy of multi-task transfer learning in symbolic music processing, particularly for data-scarce tasks, challenging the prevailing task-specific paradigms and offering a comprehensive dataset and framework for future explorations in this domain. △ Less

Submitted 3 July, 2024; v1 submitted 2 July, 2024; originally announced July 2024.

Comments: 9 pages, 2 figures, 3 tables, accepted by ISMIR 2024

arXiv:2407.00791 [pdf, other]

inlabru: software for fitting latent Gaussian models with non-linear predictors

Authors: Finn Lindgren, Fabian Bachl, Janine Illian, Man Ho Suen, Håvard Rue, Andrew E. Seaton

Abstract: The integrated nested Laplace approximation (INLA) method has become a popular approach for computationally efficient approximate Bayesian computation. In particular, by leveraging sparsity in random effect precision matrices, INLA is commonly used in spatial and spatio-temporal applications. However, the speed of INLA comes at the cost of restricting the user to the family of latent Gaussian mode… ▽ More The integrated nested Laplace approximation (INLA) method has become a popular approach for computationally efficient approximate Bayesian computation. In particular, by leveraging sparsity in random effect precision matrices, INLA is commonly used in spatial and spatio-temporal applications. However, the speed of INLA comes at the cost of restricting the user to the family of latent Gaussian models and the likelihoods currently implemented in {INLA}, the main software implementation of the INLA methodology. {inlabru} is a software package that extends the types of models that can be fitted using INLA by allowing the latent predictor to be non-linear in its parameters, moving beyond the additive linear predictor framework to allow more complex functional relationships. For inference it uses an approximate iterative method based on the first-order Taylor expansion of the non-linear predictor, fitting the model using INLA for each linearised model configuration. {inlabru} automates much of the workflow required to fit models using {R-INLA}, simplifying the process for users to specify, fit and predict from models. There is additional support for fitting joint likelihood models by building each likelihood individually. {inlabru} also supports the direct use of spatial data structures, such as those implemented in the {sf} and {terra} packages. In this paper we outline the statistical theory, model structure and basic syntax required for users to understand and develop their own models using {inlabru}. We evaluate the approximate inference method using a Bayesian method checking approach. We provide three examples modelling simulated spatial data that demonstrate the benefits of the additional flexibility provided by {inlabru}. △ Less

Submitted 30 June, 2024; originally announced July 2024.

MSC Class: 62-04

arXiv:2407.00200 [pdf, other]

Formation of Wind-Fed Black Hole High-mass X-ray Binaries: The Role of Roche-lobe-Overflow Post Black-Hole Formation

Authors: Zepei Xing, Tassos Fragos, Emmanouil Zapartas, Tom M. Kwan, Lixin Dai, Ilya Mandel, Matthias U. Kruckow, Max Briel, Jeff J. Andrews, Simone S. Bavera, Seth Gossage, Konstantinos Kovlakas, Kyle A. Rocha, Meng Sun, Philipp M. Srivastava

Abstract: The three dynamically confirmed wind-fed black hole high-mass X-ray binaries (BH-HMXBs) are suggested to all contain a highly spinning black hole (BH). However, based on the theories of efficient angular momentum transport inside the stars, we expect that the first-born BHs in binary systems should have low spins, which is consistent with gravitational-wave observations. As a result, the origin of… ▽ More The three dynamically confirmed wind-fed black hole high-mass X-ray binaries (BH-HMXBs) are suggested to all contain a highly spinning black hole (BH). However, based on the theories of efficient angular momentum transport inside the stars, we expect that the first-born BHs in binary systems should have low spins, which is consistent with gravitational-wave observations. As a result, the origin of the high BH spins measured in wind-fed BH-HMXBs remains a mystery. In this paper, we conduct a binary population synthesis study on wind-fed BH-HMXBs at solar metallicity with the use of the newly developed code POSYDON, considering three scenarios for BH accretion: Eddington-limited, moderately super-Eddington, and fully conservative accretion. Taking into account the conditions for accretion-disk formation, we find that regardless of the accretion model, these systems are more likely to have already experienced a phase of Roche-lobe overflow after the BH formation. To account for the extreme BH spins, highly conservative accretion onto BHs is required, when assuming the accreted material carries the specific angular momentum at the innermost stable orbit. Besides, in our simulations we found that the systems with donor stars within the mass range of $10-20\,M_{\odot}$ are prevalent, posing a challenge in explaining simultaneously all observed properties of the BH-HMXB in our Galaxy, Cygnus X-1, and potentially hinting that the accretion efficiency onto non-degenerate stars, before the formation of the BH, is also more conservative than assumed in our simulations. △ Less

Submitted 28 June, 2024; originally announced July 2024.

Comments: Submitted to A&A. Comments are welcome!

arXiv:2406.19661 [pdf, ps, other]

Heavy quarkonium spectral function in the spinning black hole background

Authors: Zhou-Run Zhu, Manman Sun, Rui Zhou, Zhuang Ma, Jinzhong Han

Abstract: In this paper, we study the dissociation of heavy quarkonium in the spinning black hole background. Specifically, we analyze the spectral function of charmonium and bottomonium in the spinning black hole background and examine how the angular momentum affects the dissociation of $J/Ψ$ and $Υ(1S)$. From the results, we find that the angular momentum decreases the peak height and expands the peak wi… ▽ More In this paper, we study the dissociation of heavy quarkonium in the spinning black hole background. Specifically, we analyze the spectral function of charmonium and bottomonium in the spinning black hole background and examine how the angular momentum affects the dissociation of $J/Ψ$ and $Υ(1S)$. From the results, we find that the angular momentum decreases the peak height and expands the peak width of the spectral function which indicates it enhances the dissociation of heavy vector mesons. Moreover, the angular momentum has a stronger dissociation effect in the transverse orientation, revealing the directional influence of angular momentum. △ Less

Submitted 28 June, 2024; originally announced June 2024.

Comments: 12 pages, 2 figures

arXiv:2406.19043 [pdf]

CMRxRecon2024: A Multi-Modality, Multi-View K-Space Dataset Boosting Universal Machine Learning for Accelerated Cardiac MRI

Authors: Zi Wang, Fanwen Wang, Chen Qin, Jun Lyu, Ouyang Cheng, Shuo Wang, Yan Li, Mengyao Yu, Haoyu Zhang, Kunyuan Guo, Zhang Shi, Qirong Li, Ziqiang Xu, Yajing Zhang, Hao Li, Sha Hua, Binghua Chen, Longyu Sun, Mengting Sun, Qin Li, Ying-Hua Chu, Wenjia Bai, Jing Qin, Xiahai Zhuang, Claudia Prieto , et al. (7 additional authors not shown)

Abstract: Cardiac magnetic resonance imaging (MRI) has emerged as a clinically gold-standard technique for diagnosing cardiac diseases, thanks to its ability to provide diverse information with multiple modalities and anatomical views. Accelerated cardiac MRI is highly expected to achieve time-efficient and patient-friendly imaging, and then advanced image reconstruction approaches are required to recover h… ▽ More Cardiac magnetic resonance imaging (MRI) has emerged as a clinically gold-standard technique for diagnosing cardiac diseases, thanks to its ability to provide diverse information with multiple modalities and anatomical views. Accelerated cardiac MRI is highly expected to achieve time-efficient and patient-friendly imaging, and then advanced image reconstruction approaches are required to recover high-quality, clinically interpretable images from undersampled measurements. However, the lack of publicly available cardiac MRI k-space dataset in terms of both quantity and diversity has severely hindered substantial technological progress, particularly for data-driven artificial intelligence. Here, we provide a standardized, diverse, and high-quality CMRxRecon2024 dataset to facilitate the technical development, fair evaluation, and clinical transfer of cardiac MRI reconstruction approaches, towards promoting the universal frameworks that enable fast and robust reconstructions across different cardiac MRI protocols in clinical practice. To the best of our knowledge, the CMRxRecon2024 dataset is the largest and most diverse publicly available cardiac k-space dataset. It is acquired from 330 healthy volunteers, covering commonly used modalities, anatomical views, and acquisition trajectories in clinical cardiac MRI workflows. Besides, an open platform with tutorials, benchmarks, and data processing tools is provided to facilitate data usage, advanced method development, and fair performance evaluation. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: 19 pages, 3 figures, 2 tables

arXiv:2406.17561 [pdf, other]

Improving density matrix electronic structure method by deep learning

Authors: Zechen Tang, Nianlong Zou, He Li, Yuxiang Wang, Zilong Yuan, Honggeng Tao, Yang Li, Zezhou Chen, Boheng Zhao, Minghui Sun, Hong Jiang, Wenhui Duan, Yong Xu

Abstract: The combination of deep learning and ab initio materials calculations is emerging as a trending frontier of materials science research, with deep-learning density functional theory (DFT) electronic structure being particularly promising. In this work, we introduce a neural-network method for modeling the DFT density matrix, a fundamental yet previously unexplored quantity in deep-learning electron… ▽ More The combination of deep learning and ab initio materials calculations is emerging as a trending frontier of materials science research, with deep-learning density functional theory (DFT) electronic structure being particularly promising. In this work, we introduce a neural-network method for modeling the DFT density matrix, a fundamental yet previously unexplored quantity in deep-learning electronic structure. Utilizing an advanced neural network framework that leverages the nearsightedness and equivariance properties of the density matrix, the method demonstrates high accuracy and excellent generalizability in multiple example studies, as well as capability to precisely predict charge density and reproduce other electronic structure properties. Given the pivotal role of the density matrix in DFT as well as other computational methods, the current research introduces a novel approach to the deep-learning study of electronic structure properties, opening up new opportunities for deep-learning enhanced computational materials study. △ Less

Submitted 25 June, 2024; originally announced June 2024.

arXiv:2406.15963 [pdf, other]

Effectiveness of ChatGPT in explaining complex medical reports to patients

Authors: Mengxuan Sun, Ehud Reiter, Anne E Kiltie, George Ramsay, Lisa Duncan, Peter Murchie, Rosalind Adam

Abstract: Electronic health records contain detailed information about the medical condition of patients, but they are difficult for patients to understand even if they have access to them. We explore whether ChatGPT (GPT 4) can help explain multidisciplinary team (MDT) reports to colorectal and prostate cancer patients. These reports are written in dense medical language and assume clinical knowledge, so t… ▽ More Electronic health records contain detailed information about the medical condition of patients, but they are difficult for patients to understand even if they have access to them. We explore whether ChatGPT (GPT 4) can help explain multidisciplinary team (MDT) reports to colorectal and prostate cancer patients. These reports are written in dense medical language and assume clinical knowledge, so they are a good test of the ability of ChatGPT to explain complex medical reports to patients. We asked clinicians and lay people (not patients) to review explanations and responses of ChatGPT. We also ran three focus groups (including cancer patients, caregivers, computer scientists, and clinicians) to discuss output of ChatGPT. Our studies highlighted issues with inaccurate information, inappropriate language, limited personalization, AI distrust, and challenges integrating large language models (LLMs) into clinical workflow. These issues will need to be resolved before LLMs can be used to explain complex personal medical information to patients. △ Less

Submitted 22 June, 2024; originally announced June 2024.

Comments: under review

arXiv:2406.15718 [pdf, other]

Beyond the Turn-Based Game: Enabling Real-Time Conversations with Duplex Models

Authors: Xinrong Zhang, Yingfa Chen, Shengding Hu, Xu Han, Zihang Xu, Yuanwei Xu, Weilin Zhao, Maosong Sun, Zhiyuan Liu

Abstract: As large language models (LLMs) increasingly permeate daily lives, there is a growing demand for real-time interactions that mirror human conversations. Traditional turn-based chat systems driven by LLMs prevent users from verbally interacting with the system while it is generating responses. To overcome these limitations, we adapt existing LLMs to \textit{duplex models} so that these LLMs can lis… ▽ More As large language models (LLMs) increasingly permeate daily lives, there is a growing demand for real-time interactions that mirror human conversations. Traditional turn-based chat systems driven by LLMs prevent users from verbally interacting with the system while it is generating responses. To overcome these limitations, we adapt existing LLMs to \textit{duplex models} so that these LLMs can listen for users while generating output and dynamically adjust themselves to provide users with instant feedback. % such as in response to interruptions. Specifically, we divide the queries and responses of conversations into several time slices and then adopt a time-division-multiplexing (TDM) encoding-decoding strategy to pseudo-simultaneously process these slices. Furthermore, to make LLMs proficient enough to handle real-time conversations, we build a fine-tuning dataset consisting of alternating time slices of queries and responses as well as covering typical feedback types in instantaneous interactions. Our experiments show that although the queries and responses of conversations are segmented into incomplete slices for processing, LLMs can preserve their original performance on standard benchmarks with a few fine-tuning steps on our dataset. Automatic and human evaluation indicate that duplex models make user-AI interactions more natural and human-like, and greatly improve user satisfaction compared to vanilla LLMs. Our duplex model and dataset will be released. △ Less

Submitted 21 June, 2024; originally announced June 2024.

arXiv:2406.14219 [pdf, other]

Proving Olympiad Algebraic Inequalities without Human Demonstrations

Authors: Chenrui Wei, Mengzhou Sun, Wei Wang

Abstract: Solving Olympiad-level mathematical problems represents a significant advancement in machine intelligence and automated reasoning. Current machine learning methods, however, struggle to solve Olympiad-level problems beyond Euclidean plane geometry due to a lack of large-scale, high-quality datasets. The challenge is even greater in algebraic systems, which involve infinite reasoning spaces within… ▽ More Solving Olympiad-level mathematical problems represents a significant advancement in machine intelligence and automated reasoning. Current machine learning methods, however, struggle to solve Olympiad-level problems beyond Euclidean plane geometry due to a lack of large-scale, high-quality datasets. The challenge is even greater in algebraic systems, which involve infinite reasoning spaces within finite conditions. To address these issues, we propose AIPS, an Algebraic Inequality Proving System capable of autonomously generating complex inequality theorems and effectively solving Olympiad-level inequality problems without requiring human demonstrations. During proof search in a mixed reasoning manner, a value curriculum learning strategy on generated datasets is implemented to improve proving performance, demonstrating strong mathematical intuitions. On a test set of 20 International Mathematical Olympiad-level inequality problems, AIPS successfully solved 10, outperforming state-of-the-art methods. Furthermore, AIPS automatically generated a vast array of non-trivial theorems without human intervention, some of which have been evaluated by professional contestants and deemed to reach the level of the International Mathematical Olympiad. Notably, one theorem was selected as a competition problem in a major city 2024 Mathematical Olympiad. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: 36 pages, 32 figures, 2 tables

MSC Class: 03B35; 68T05; 68T20 ACM Class: I.2.3; I.2.6; I.2.8

arXiv:2406.12646 [pdf, other]

An Empirical Study on the Fairness of Foundation Models for Multi-Organ Image Segmentation

Authors: Qin Li, Yizhe Zhang, Yan Li, Jun Lyu, Meng Liu, Longyu Sun, Mengting Sun, Qirong Li, Wenyue Mao, Xinran Wu, Yajing Zhang, Yinghua Chu, Shuo Wang, Chengyan Wang

Abstract: The segmentation foundation model, e.g., Segment Anything Model (SAM), has attracted increasing interest in the medical image community. Early pioneering studies primarily concentrated on assessing and improving SAM's performance from the perspectives of overall accuracy and efficiency, yet little attention was given to the fairness considerations. This oversight raises questions about the potenti… ▽ More The segmentation foundation model, e.g., Segment Anything Model (SAM), has attracted increasing interest in the medical image community. Early pioneering studies primarily concentrated on assessing and improving SAM's performance from the perspectives of overall accuracy and efficiency, yet little attention was given to the fairness considerations. This oversight raises questions about the potential for performance biases that could mirror those found in task-specific deep learning models like nnU-Net. In this paper, we explored the fairness dilemma concerning large segmentation foundation models. We prospectively curate a benchmark dataset of 3D MRI and CT scans of the organs including liver, kidney, spleen, lung and aorta from a total of 1056 healthy subjects with expert segmentations. Crucially, we document demographic details such as gender, age, and body mass index (BMI) for each subject to facilitate a nuanced fairness analysis. We test state-of-the-art foundation models for medical image segmentation, including the original SAM, medical SAM and SAT models, to evaluate segmentation efficacy across different demographic groups and identify disparities. Our comprehensive analysis, which accounts for various confounding factors, reveals significant fairness concerns within these foundational models. Moreover, our findings highlight not only disparities in overall segmentation metrics, such as the Dice Similarity Coefficient but also significant variations in the spatial distribution of segmentation errors, offering empirical evidence of the nuanced challenges in ensuring fairness in medical image segmentation. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: Accepted to MICCAI-2024

arXiv:2406.12349 [pdf, other]

Effective Generation of Feasible Solutions for Integer Programming via Guided Diffusion

Authors: Hao Zeng, Jiaqi Wang, Avirup Das, Junying He, Kunpeng Han, Haoyuan Hu, Mingfei Sun

Abstract: Feasible solutions are crucial for Integer Programming (IP) since they can substantially speed up the solving process. In many applications, similar IP instances often exhibit similar structures and shared solution distributions, which can be potentially modeled by deep learning methods. Unfortunately, existing deep-learning-based algorithms, such as Neural Diving and Predict-and-search framework,… ▽ More Feasible solutions are crucial for Integer Programming (IP) since they can substantially speed up the solving process. In many applications, similar IP instances often exhibit similar structures and shared solution distributions, which can be potentially modeled by deep learning methods. Unfortunately, existing deep-learning-based algorithms, such as Neural Diving and Predict-and-search framework, are limited to generating only partial feasible solutions, and they must rely on solvers like SCIP and Gurobi to complete the solutions for a given IP problem. In this paper, we propose a novel framework that generates complete feasible solutions end-to-end. Our framework leverages contrastive learning to characterize the relationship between IP instances and solutions, and learns latent embeddings for both IP instances and their solutions. Further, the framework employs diffusion models to learn the distribution of solution embeddings conditioned on IP representations, with a dedicated guided sampling strategy that accounts for both constraints and objectives. We empirically evaluate our framework on four typical datasets of IP problems, and show that it effectively generates complete feasible solutions with a high probability (> 89.7 \%) without the reliance of Solvers and the quality of solutions is comparable to the best heuristic solutions from Gurobi. Furthermore, by integrating our method's sampled partial solutions with the CompleteSol heuristic from SCIP, the resulting feasible solutions outperform those from state-of-the-art methods across all datasets, exhibiting a 3.7 to 33.7\% improvement in the gap to optimal values, and maintaining a feasible ratio of over 99.7\% for all datasets. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: Accepted to SIGKDD 2024

arXiv:2406.11933 [pdf, other]

Scaling Efficient Masked Autoencoder Learning on Large Remote Sensing Dataset

Authors: Fengxiang Wang, Hongzhen Wang, Di Wang, Zonghao Guo, Zhenyu Zhong, Long Lan, Jing Zhang, Zhiyuan Liu, Maosong Sun

Abstract: Masked Image Modeling (MIM) has emerged as a pivotal approach for developing foundational visual models in the field of remote sensing (RS). However, current RS datasets are limited in volume and diversity, which significantly constrains the capacity of MIM methods to learn generalizable representations. In this study, we introduce \textbf{RS-4M}, a large-scale dataset designed to enable highly ef… ▽ More Masked Image Modeling (MIM) has emerged as a pivotal approach for developing foundational visual models in the field of remote sensing (RS). However, current RS datasets are limited in volume and diversity, which significantly constrains the capacity of MIM methods to learn generalizable representations. In this study, we introduce \textbf{RS-4M}, a large-scale dataset designed to enable highly efficient MIM training on RS images. RS-4M comprises 4 million optical images encompassing abundant and fine-grained RS visual tasks, including object-level detection and pixel-level segmentation. Compared to natural images, RS images often contain massive redundant background pixels, which limits the training efficiency of the conventional MIM models. To address this, we propose an efficient MIM method, termed \textbf{SelectiveMAE}, which dynamically encodes and reconstructs a subset of patch tokens selected based on their semantic richness. SelectiveMAE roots in a progressive semantic token selection module, which evolves from reconstructing semantically analogical tokens to encoding complementary semantic dependencies. This approach transforms conventional MIM training into a progressive feature learning process, enabling SelectiveMAE to efficiently learn robust representations of RS images. Extensive experiments show that SelectiveMAE significantly boosts training efficiency by 2.2-2.7 times and enhances the classification, detection, and segmentation performance of the baseline MIM model.The dataset, source code, and trained models will be released. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.11721 [pdf, other]

Zero-Shot Generalization during Instruction Tuning: Insights from Similarity and Granularity

Authors: Bingxiang He, Ning Ding, Cheng Qian, Jia Deng, Ganqu Cui, Lifan Yuan, Huan-ang Gao, Huimin Chen, Zhiyuan Liu, Maosong Sun

Abstract: Understanding alignment techniques begins with comprehending zero-shot generalization brought by instruction tuning, but little of the mechanism has been understood. Existing work has largely been confined to the task level, without considering that tasks are artificially defined and, to LLMs, merely consist of tokens and representations. This line of research has been limited to examining transfe… ▽ More Understanding alignment techniques begins with comprehending zero-shot generalization brought by instruction tuning, but little of the mechanism has been understood. Existing work has largely been confined to the task level, without considering that tasks are artificially defined and, to LLMs, merely consist of tokens and representations. This line of research has been limited to examining transfer between tasks from a task-pair perspective, with few studies focusing on understanding zero-shot generalization from the perspective of the data itself. To bridge this gap, we first demonstrate through multiple metrics that zero-shot generalization during instruction tuning happens very early. Next, we investigate the facilitation of zero-shot generalization from both data similarity and granularity perspectives, confirming that encountering highly similar and fine-grained training data earlier during instruction tuning, without the constraints of defined "tasks", enables better generalization. Finally, we propose a more grounded training data arrangement method, Test-centric Multi-turn Arrangement, and show its effectiveness in promoting continual learning and further loss reduction. For the first time, we show that zero-shot generalization during instruction tuning is a form of similarity-based generalization between training and test data at the instance level. We hope our analysis will advance the understanding of zero-shot generalization during instruction tuning and contribute to the development of more aligned LLMs. Our code is released at https://github.com/HBX-hbx/dynamics_of_zero-shot_generalization. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: 33 pages, 14 figures

arXiv:2406.11583 [pdf]

Where there's a will there's a way: ChatGPT is used more for science in countries where it is prohibited

Authors: Honglin Bao, Mengyi Sun, Misha Teplitskiy

Abstract: Regulating AI is a key societal challenge, but which regulation methods are effective is unclear. This study measures the effectiveness of restricting AI services geographically, focusing on ChatGPT. OpenAI restricts ChatGPT access in several countries, including China and Russia. If restrictions are effective, ChatGPT use should be minimal in these countries. We measured use with a classifier bas… ▽ More Regulating AI is a key societal challenge, but which regulation methods are effective is unclear. This study measures the effectiveness of restricting AI services geographically, focusing on ChatGPT. OpenAI restricts ChatGPT access in several countries, including China and Russia. If restrictions are effective, ChatGPT use should be minimal in these countries. We measured use with a classifier based on distinctive word usage found in early versions of ChatGPT, e.g. "delve." We trained the classifier on pre- and post-ChatGPT "polished" abstracts and found it outperformed GPTZero and ZeroGPT on validation sets, including papers with self-reported AI use. Applying the classifier to preprints from Arxiv, BioRxiv, and MedRxiv showed ChatGPT was used in about 12.6% of preprints by August 2023, with 7.7% higher usage in restricted countries. The gap appeared before China's first major legal LLM became widely available. To test the possibility that, due to high demand, use in restricted countries would have been even higher without restrictions, we compared Asian countries with high expected demand (where English is not an official language) and found that use was higher in those with restrictions. ChatGPT use was correlated with higher views and downloads, but not citations or journal placement. Overall, restricting ChatGPT geographically has proven ineffective in science and possibly other domains, likely due to widespread workarounds. △ Less

Submitted 27 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

Comments: Three figures, two tables, 21 pages, and a 19-page appendix

arXiv:2406.11317 [pdf, other]

GUICourse: From General Vision Language Models to Versatile GUI Agents

Authors: Wentong Chen, Junbo Cui, Jinyi Hu, Yujia Qin, Junjie Fang, Yue Zhao, Chongyi Wang, Jun Liu, Guirong Chen, Yupeng Huo, Yuan Yao, Yankai Lin, Zhiyuan Liu, Maosong Sun

Abstract: Utilizing Graphic User Interface (GUI) for human-computer interaction is essential for accessing a wide range of digital tools. Recent advancements in Vision Language Models (VLMs) highlight the compelling potential to develop versatile agents to help humans finish GUI navigation tasks. However, current VLMs are challenged in terms of fundamental abilities (OCR and grounding) and GUI knowledge (th… ▽ More Utilizing Graphic User Interface (GUI) for human-computer interaction is essential for accessing a wide range of digital tools. Recent advancements in Vision Language Models (VLMs) highlight the compelling potential to develop versatile agents to help humans finish GUI navigation tasks. However, current VLMs are challenged in terms of fundamental abilities (OCR and grounding) and GUI knowledge (the functions and control methods of GUI elements), preventing them from becoming practical GUI agents. To solve these challenges, we contribute GUICourse, a suite of datasets to train visual-based GUI agents from general VLMs. First, we introduce the GUIEnv dataset to strengthen the OCR and grounding capabilities of VLMs. Then, we introduce the GUIAct and GUIChat datasets to enrich their knowledge of GUI components and interactions. Experiments demonstrate that our GUI agents have better performance on common GUI tasks than their baseline VLMs. Even the small-size GUI agent (with 3.1B parameters) can still work well on single-step and multi-step GUI tasks. Finally, we analyze the different varieties in the training stage of this agent by ablation study. Our source codes and datasets are released at https://github.com/yiye3/GUICourse. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.11309 [pdf, other]

BaFTA: Backprop-Free Test-Time Adaptation For Zero-Shot Vision-Language Models

Authors: Xuefeng Hu, Ke Zhang, Min Sun, Albert Chen, Cheng-Hao Kuo, Ram Nevatia

Abstract: Large-scale pretrained vision-language models like CLIP have demonstrated remarkable zero-shot image classification capabilities across diverse domains. To enhance CLIP's performance while preserving the zero-shot paradigm, various test-time prompt tuning methods have been introduced to refine class embeddings through unsupervised learning objectives during inference. However, these methods often… ▽ More Large-scale pretrained vision-language models like CLIP have demonstrated remarkable zero-shot image classification capabilities across diverse domains. To enhance CLIP's performance while preserving the zero-shot paradigm, various test-time prompt tuning methods have been introduced to refine class embeddings through unsupervised learning objectives during inference. However, these methods often encounter challenges in selecting appropriate learning rates to prevent collapsed training in the absence of validation data during test-time adaptation. In this study, we propose a novel backpropagation-free algorithm BaFTA for test-time adaptation of vision-language models. Instead of fine-tuning text prompts to refine class embeddings, our approach directly estimates class centroids using online clustering within a projected embedding space that aligns text and visual embeddings. We dynamically aggregate predictions from both estimated and original class embeddings, as well as from distinct augmented views, by assessing the reliability of each prediction using Rényi Entropy. Through extensive experiments, we demonstrate that BaFTA consistently outperforms state-of-the-art test-time adaptation methods in both effectiveness and efficiency. △ Less

Submitted 18 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

Comments: Preprint updated from our earlier manuscript submitted to ICLR 2024 (https://openreview.net/forum?id=KNtcoAM5Gy)

arXiv:2406.08903 [pdf, other]

Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models

Authors: Bowen Ping, Shuo Wang, Hanqing Wang, Xu Han, Yuzhuang Xu, Yukun Yan, Yun Chen, Baobao Chang, Zhiyuan Liu, Maosong Sun

Abstract: Fine-tuning is a crucial process for adapting large language models (LLMs) to diverse applications. In certain scenarios, such as multi-tenant serving, deploying multiple LLMs becomes necessary to meet complex demands. Recent studies suggest decomposing a fine-tuned LLM into a base model and corresponding delta weights, which are then compressed using low-rank or low-bit approaches to reduce costs… ▽ More Fine-tuning is a crucial process for adapting large language models (LLMs) to diverse applications. In certain scenarios, such as multi-tenant serving, deploying multiple LLMs becomes necessary to meet complex demands. Recent studies suggest decomposing a fine-tuned LLM into a base model and corresponding delta weights, which are then compressed using low-rank or low-bit approaches to reduce costs. In this work, we observe that existing low-rank and low-bit compression methods can significantly harm the model performance for task-specific fine-tuned LLMs (e.g., WizardMath for math problems). Motivated by the long-tail distribution of singular values in the delta weights, we propose a delta quantization approach using mixed-precision. This method employs higher-bit representation for singular vectors corresponding to larger singular values. We evaluate our approach on various fine-tuned LLMs, including math LLMs, code LLMs, chat LLMs, and even VLMs. Experimental results demonstrate that our approach performs comparably to full fine-tuned LLMs, surpassing both low-rank and low-bit baselines by a considerable margin. Additionally, we show that our method is compatible with various backbone LLMs, such as Llama-2, Llama-3, and Mistral, highlighting its generalizability. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: 12 pages

arXiv:2406.07597 [pdf, ps, other]

A Central Limit Theorem on Two-Sided Descents of Mallows Distributed Elements of Finite Coxeter Groups

Authors: Maxwell Sun

Abstract: The Mallows distribution is a non-uniform distribution, first introduced over permutations to study non-ranked data, in which permutations are weighted according to their length. It can be generalized to any Coxeter group, and we study the distribution of $\text{des}(w) + \text{des}(w^{-1})$ where $w$ is a Mallows distributed element of a finite irreducible Coxeter group. We show that the asymptot… ▽ More The Mallows distribution is a non-uniform distribution, first introduced over permutations to study non-ranked data, in which permutations are weighted according to their length. It can be generalized to any Coxeter group, and we study the distribution of $\text{des}(w) + \text{des}(w^{-1})$ where $w$ is a Mallows distributed element of a finite irreducible Coxeter group. We show that the asymptotic behavior of this statistic is Guassian. The proof uses a size-bias coupling with Stein's method. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: 49 pages, 1 figure. arXiv admin note: text overlap with arXiv:2005.09802 by other authors

arXiv:2406.07155 [pdf, other]

Scaling Large-Language-Model-based Multi-Agent Collaboration

Authors: Chen Qian, Zihao Xie, Yifei Wang, Wei Liu, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, Zhiyuan Liu, Maosong Sun

Abstract: Pioneering advancements in large language model-powered agents have underscored the design pattern of multi-agent collaboration, demonstrating that collective intelligence can surpass the capabilities of each individual. Inspired by the neural scaling law, which posits that increasing neurons leads to emergent abilities, this study investigates whether a similar principle applies to increasing age… ▽ More Pioneering advancements in large language model-powered agents have underscored the design pattern of multi-agent collaboration, demonstrating that collective intelligence can surpass the capabilities of each individual. Inspired by the neural scaling law, which posits that increasing neurons leads to emergent abilities, this study investigates whether a similar principle applies to increasing agents in multi-agent collaboration. Technically, we propose multi-agent collaboration networks (MacNet), which utilize directed acyclic graphs to organize agents and streamline their interactive reasoning via topological ordering, with solutions derived from their dialogues. Extensive experiments show that MacNet consistently outperforms baseline models, enabling effective agent collaboration across various network topologies and supporting cooperation among more than a thousand agents. Notably, we observed a small-world collaboration phenomenon, where topologies resembling small-world properties achieved superior performance. Additionally, we identified a collaborative scaling law, indicating that normalized solution quality follows a logistic growth pattern as scaling agents, with collaborative emergence occurring much earlier than previously observed instances of neural emergence. The code and data will be available at https://github.com/OpenBMB/ChatDev. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: Work in progress; The code and data will be available at https://github.com/OpenBMB/ChatDev

arXiv:2406.05564 [pdf, other]

Automata Extraction from Transformers

Authors: Yihao Zhang, Zeming Wei, Meng Sun

Abstract: In modern machine (ML) learning systems, Transformer-based architectures have achieved milestone success across a broad spectrum of tasks, yet understanding their operational mechanisms remains an open problem. To improve the transparency of ML systems, automata extraction methods, which interpret stateful ML models as automata typically through formal languages, have proven effective for explaini… ▽ More In modern machine (ML) learning systems, Transformer-based architectures have achieved milestone success across a broad spectrum of tasks, yet understanding their operational mechanisms remains an open problem. To improve the transparency of ML systems, automata extraction methods, which interpret stateful ML models as automata typically through formal languages, have proven effective for explaining the mechanism of recurrent neural networks (RNNs). However, few works have been applied to this paradigm to Transformer models. In particular, understanding their processing of formal languages and identifying their limitations in this area remains unexplored. In this paper, we propose an automata extraction algorithm specifically designed for Transformer models. Treating the Transformer model as a black-box system, we track the model through the transformation process of their internal latent representations during their operations, and then use classical pedagogical approaches like L* algorithm to interpret them as deterministic finite-state automata (DFA). Overall, our study reveals how the Transformer model comprehends the structure of formal languages, which not only enhances the interpretability of the Transformer-based ML systems but also marks a crucial step toward a deeper understanding of how ML systems process formal languages. Code and data are available at https://github.com/Zhang-Yihao/Transfomer2DFA. △ Less

Submitted 8 June, 2024; originally announced June 2024.

arXiv:2406.03746 [pdf, other]

Efficient Knowledge Infusion via KG-LLM Alignment

Authors: Zhouyu Jiang, Ling Zhong, Mengshu Sun, Jun Xu, Rui Sun, Hui Cai, Shuhan Luo, Zhiqiang Zhang

Abstract: To tackle the problem of domain-specific knowledge scarcity within large language models (LLMs), knowledge graph-retrievalaugmented method has been proven to be an effective and efficient technique for knowledge infusion. However, existing approaches face two primary challenges: knowledge mismatch between public available knowledge graphs and the specific domain of the task at hand, and poor infor… ▽ More To tackle the problem of domain-specific knowledge scarcity within large language models (LLMs), knowledge graph-retrievalaugmented method has been proven to be an effective and efficient technique for knowledge infusion. However, existing approaches face two primary challenges: knowledge mismatch between public available knowledge graphs and the specific domain of the task at hand, and poor information compliance of LLMs with knowledge graphs. In this paper, we leverage a small set of labeled samples and a large-scale corpus to efficiently construct domain-specific knowledge graphs by an LLM, addressing the issue of knowledge mismatch. Additionally, we propose a three-stage KG-LLM alignment strategyto enhance the LLM's capability to utilize information from knowledge graphs. We conduct experiments with a limited-sample setting on two biomedical question-answering datasets, and the results demonstrate that our approach outperforms existing baselines. △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: ACL2024 Findings

arXiv:2406.03488 [pdf, other]

Seq1F1B: Efficient Sequence-Level Pipeline Parallelism for Large Language Model Training

Authors: Ao Sun, Weilin Zhao, Xu Han, Cheng Yang, Zhiyuan Liu, Chuan Shi, Maosong Sun

Abstract: The emergence of large language models (LLMs) relies heavily on distributed training strategies, among which pipeline parallelism plays a crucial role. As LLMs' training sequence length extends to 32k or even 128k, the current pipeline parallel methods face severe bottlenecks, including high memory footprints and substantial pipeline bubbles, greatly hindering model scalability and training throug… ▽ More The emergence of large language models (LLMs) relies heavily on distributed training strategies, among which pipeline parallelism plays a crucial role. As LLMs' training sequence length extends to 32k or even 128k, the current pipeline parallel methods face severe bottlenecks, including high memory footprints and substantial pipeline bubbles, greatly hindering model scalability and training throughput. To enhance memory efficiency and training throughput, in this work, we introduce an efficient sequence-level one-forward-one-backward (1F1B) pipeline scheduling method tailored for training LLMs on long sequences named Seq1F1B. Seq1F1B decomposes batch-level schedulable units into finer sequence-level units, reducing bubble size and memory footprint. Considering that Seq1F1B may produce slight extra bubbles if sequences are split evenly, we design a computation-wise strategy to partition input sequences and mitigate this side effect. Compared to competitive pipeline baseline methods such as Megatron 1F1B pipeline parallelism, our method achieves higher training throughput with less memory footprint. Notably, Seq1F1B efficiently trains a LLM with 30B parameters on sequences up to 64k using 64 NVIDIA A100 GPUs without recomputation strategies, a feat unachievable with existing methods. Our source code is based on Megatron-LM, and now is avaiable at: https://github.com/MayDomine/Seq1F1B.git. △ Less

Submitted 6 June, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

Comments: 12 pages, 4 figures, 6 tables

arXiv:2406.02370 [pdf, other]

Query-based Semantic Gaussian Field for Scene Representation in Reinforcement Learning

Authors: Jiaxu Wang, Ziyi Zhang, Qiang Zhang, Jia Li, Jingkai Sun, Mingyuan Sun, Junhao He, Renjing Xu

Abstract: Latent scene representation plays a significant role in training reinforcement learning (RL) agents. To obtain good latent vectors describing the scenes, recent works incorporate the 3D-aware latent-conditioned NeRF pipeline into scene representation learning. However, these NeRF-related methods struggle to perceive 3D structural information due to the inefficient dense sampling in volumetric rend… ▽ More Latent scene representation plays a significant role in training reinforcement learning (RL) agents. To obtain good latent vectors describing the scenes, recent works incorporate the 3D-aware latent-conditioned NeRF pipeline into scene representation learning. However, these NeRF-related methods struggle to perceive 3D structural information due to the inefficient dense sampling in volumetric rendering. Moreover, they lack fine-grained semantic information included in their scene representation vectors because they evenly consider free and occupied spaces. Both of them can destroy the performance of downstream RL tasks. To address the above challenges, we propose a novel framework that adopts the efficient 3D Gaussian Splatting (3DGS) to learn 3D scene representation for the first time. In brief, we present the Query-based Generalizable 3DGS to bridge the 3DGS technique and scene representations with more geometrical awareness than those in NeRFs. Moreover, we present the Hierarchical Semantics Encoding to ground the fine-grained semantic features to 3D Gaussians and further distilled to the scene representation vectors. We conduct extensive experiments on two RL platforms including Maniskill2 and Robomimic across 10 different tasks. The results show that our method outperforms the other 5 baselines by a large margin. We achieve the best success rates on 8 tasks and the second-best on the other two tasks. △ Less

Submitted 9 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

arXiv:2406.02350 [pdf, other]

LlamaCare: A Large Medical Language Model for Enhancing Healthcare Knowledge Sharing

Authors: Maojun Sun

Abstract: Large language models (LLMs) have shown amazing capabilities in knowledge memorization and the present. However, when it comes to domain-specific knowledge and downstream tasks like medical, general LLMs are often unable to give precise answers. In addition, when people want LLMs to answer classification questions, they usually go through instruction tuning first. However, LLMs do not always give… ▽ More Large language models (LLMs) have shown amazing capabilities in knowledge memorization and the present. However, when it comes to domain-specific knowledge and downstream tasks like medical, general LLMs are often unable to give precise answers. In addition, when people want LLMs to answer classification questions, they usually go through instruction tuning first. However, LLMs do not always give a direct index of the categorization after instruction tuning. In this paper, we proposed LlamaCare, a fine-tuned medical language model, and Extended Classification Integration(ECI), a module to handle classification problems of LLMs. Our contributions are : (i) We fine-tuned a large language model of medical knowledge with very low carbon emissions and achieved similar performance with ChatGPT by a 24G GPU. (ii) We solved the problem of redundant categorical answers and improved the performance of LLMs by proposing a new module called Extended Classification Integration. (iii) We released our processed data for one-shot and few-shot training for some benchmarks such as PubMedQA and USMLE 1-3 step. Our method achieves a close performance comparable to some state-of-the-art models with the same quantity of parameters on benchmarks, while being more environmentally friendly by using less GPU computation time. Our models, codes, and datasets can be found at \url{https://github.com/Stephen-SMJ/LLamaCare}. △ Less

Submitted 5 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

arXiv:2406.01059 [pdf, other]

VIP: Versatile Image Outpainting Empowered by Multimodal Large Language Model

Authors: Jinze Yang, Haoran Wang, Zining Zhu, Chenglong Liu, Meng Wymond Wu, Zeke Xie, Zhong Ji, Jungong Han, Mingming Sun

Abstract: In this paper, we focus on resolving the problem of image outpainting, which aims to extrapolate the surrounding parts given the center contents of an image. Although recent works have achieved promising performance, the lack of versatility and customization hinders their practical applications in broader scenarios. Therefore, this work presents a novel image outpainting framework that is capable… ▽ More In this paper, we focus on resolving the problem of image outpainting, which aims to extrapolate the surrounding parts given the center contents of an image. Although recent works have achieved promising performance, the lack of versatility and customization hinders their practical applications in broader scenarios. Therefore, this work presents a novel image outpainting framework that is capable of customizing the results according to the requirement of users. First of all, we take advantage of a Multimodal Large Language Model (MLLM) that automatically extracts and organizes the corresponding textual descriptions of the masked and unmasked part of a given image. Accordingly, the obtained text prompts are introduced to endow our model with the capacity to customize the outpainting results. In addition, a special Cross-Attention module, namely Center-Total-Surrounding (CTS), is elaborately designed to enhance further the the interaction between specific space regions of the image and corresponding parts of the text prompts. Note that unlike most existing methods, our approach is very resource-efficient since it is just slightly fine-tuned on the off-the-shelf stable diffusion (SD) model rather than being trained from scratch. Finally, the experimental results on three commonly used datasets, i.e. Scenery, Building, and WikiArt, demonstrate our model significantly surpasses the SoTA methods. Moreover, versatile outpainting results are listed to show its customized ability. △ Less

Submitted 3 June, 2024; originally announced June 2024.

Comments: 15 pages

arXiv:2405.20603 [pdf]

Advancing Financial Risk Prediction Through Optimized LSTM Model Performance and Comparative Analysis

Authors: Ke Xu, Yu Cheng, Shiqing Long, Junjie Guo, Jue Xiao, Mengfang Sun

Abstract: This paper focuses on the application and optimization of LSTM model in financial risk prediction. The study starts with an overview of the architecture and algorithm foundation of LSTM, and then details the model training process and hyperparameter tuning strategy, and adjusts network parameters through experiments to improve performance. Comparative experiments show that the optimized LSTM model… ▽ More This paper focuses on the application and optimization of LSTM model in financial risk prediction. The study starts with an overview of the architecture and algorithm foundation of LSTM, and then details the model training process and hyperparameter tuning strategy, and adjusts network parameters through experiments to improve performance. Comparative experiments show that the optimized LSTM model shows significant advantages in AUC index compared with random forest, BP neural network and XGBoost, which verifies its efficiency and practicability in the field of financial risk prediction, especially its ability to deal with complex time series data, which lays a solid foundation for the application of the model in the actual production environment. △ Less

Submitted 30 May, 2024; originally announced May 2024.

arXiv:2405.17765 [pdf, other]

PTM-VQA: Efficient Video Quality Assessment Leveraging Diverse PreTrained Models from the Wild

Authors: Kun Yuan, Hongbo Liu, Mading Li, Muyi Sun, Ming Sun, Jiachao Gong, Jinhua Hao, Chao Zhou, Yansong Tang

Abstract: Video quality assessment (VQA) is a challenging problem due to the numerous factors that can affect the perceptual quality of a video, \eg, content attractiveness, distortion type, motion pattern, and level. However, annotating the Mean opinion score (MOS) for videos is expensive and time-consuming, which limits the scale of VQA datasets, and poses a significant obstacle for deep learning-based me… ▽ More Video quality assessment (VQA) is a challenging problem due to the numerous factors that can affect the perceptual quality of a video, \eg, content attractiveness, distortion type, motion pattern, and level. However, annotating the Mean opinion score (MOS) for videos is expensive and time-consuming, which limits the scale of VQA datasets, and poses a significant obstacle for deep learning-based methods. In this paper, we propose a VQA method named PTM-VQA, which leverages PreTrained Models to transfer knowledge from models pretrained on various pre-tasks, enabling benefits for VQA from different aspects. Specifically, we extract features of videos from different pretrained models with frozen weights and integrate them to generate representation. Since these models possess various fields of knowledge and are often trained with labels irrelevant to quality, we propose an Intra-Consistency and Inter-Divisibility (ICID) loss to impose constraints on features extracted by multiple pretrained models. The intra-consistency constraint ensures that features extracted by different pretrained models are in the same unified quality-aware latent space, while the inter-divisibility introduces pseudo clusters based on the annotation of samples and tries to separate features of samples from different clusters. Furthermore, with a constantly growing number of pretrained models, it is crucial to determine which models to use and how to use them. To address this problem, we propose an efficient scheme to select suitable candidates. Models with better clustering performance on VQA datasets are chosen to be our candidates. Extensive experiments demonstrate the effectiveness of the proposed method. △ Less

Submitted 27 May, 2024; originally announced May 2024.

Comments: CVPR 2024, 11 pages, 4 figures, 7 tables

arXiv:2405.17220 [pdf, other]

RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness

Authors: Tianyu Yu, Haoye Zhang, Yuan Yao, Yunkai Dang, Da Chen, Xiaoman Lu, Ganqu Cui, Taiwen He, Zhiyuan Liu, Tat-Seng Chua, Maosong Sun

Abstract: Learning from feedback reduces the hallucination of multimodal large language models (MLLMs) by aligning them with human preferences. While traditional methods rely on labor-intensive and time-consuming manual labeling, recent approaches employing models as automatic labelers have shown promising results without human intervention. However, these methods heavily rely on costly proprietary models l… ▽ More Learning from feedback reduces the hallucination of multimodal large language models (MLLMs) by aligning them with human preferences. While traditional methods rely on labor-intensive and time-consuming manual labeling, recent approaches employing models as automatic labelers have shown promising results without human intervention. However, these methods heavily rely on costly proprietary models like GPT-4V, resulting in scalability issues. Moreover, this paradigm essentially distills the proprietary models to provide a temporary solution to quickly bridge the performance gap. As this gap continues to shrink, the community is soon facing the essential challenge of aligning MLLMs using labeler models of comparable capability. In this work, we introduce RLAIF-V, a novel framework that aligns MLLMs in a fully open-source paradigm for super GPT-4V trustworthiness. RLAIF-V maximally exploits the open-source feedback from two perspectives, including high-quality feedback data and online feedback learning algorithm. Extensive experiments on seven benchmarks in both automatic and human evaluation show that RLAIF-V substantially enhances the trustworthiness of models without sacrificing performance on other tasks. Using a 34B model as labeler, RLAIF-V 7B model reduces object hallucination by 82.9\% and overall hallucination by 42.1\%, outperforming the labeler model. Remarkably, RLAIF-V also reveals the self-alignment potential of open-source MLLMs, where a 12B model can learn from the feedback of itself to achieve less than 29.5\% overall hallucination rate, surpassing GPT-4V (45.9\%) by a large margin. The results shed light on a promising route to enhance the efficacy of leading-edge MLLMs. △ Less

Submitted 27 May, 2024; originally announced May 2024.

Comments: Project Website: https://github.com/RLHF-V/RLAIF-V

arXiv:2405.14959 [pdf, other]

EvGGS: A Collaborative Learning Framework for Event-based Generalizable Gaussian Splatting

Authors: Jiaxu Wang, Junhao He, Ziyi Zhang, Mingyuan Sun, Jingkai Sun, Renjing Xu

Abstract: Event cameras offer promising advantages such as high dynamic range and low latency, making them well-suited for challenging lighting conditions and fast-moving scenarios. However, reconstructing 3D scenes from raw event streams is difficult because event data is sparse and does not carry absolute color information. To release its potential in 3D reconstruction, we propose the first event-based ge… ▽ More Event cameras offer promising advantages such as high dynamic range and low latency, making them well-suited for challenging lighting conditions and fast-moving scenarios. However, reconstructing 3D scenes from raw event streams is difficult because event data is sparse and does not carry absolute color information. To release its potential in 3D reconstruction, we propose the first event-based generalizable 3D reconstruction framework, called EvGGS, which reconstructs scenes as 3D Gaussians from only event input in a feedforward manner and can generalize to unseen cases without any retraining. This framework includes a depth estimation module, an intensity reconstruction module, and a Gaussian regression module. These submodules connect in a cascading manner, and we collaboratively train them with a designed joint loss to make them mutually promote. To facilitate related studies, we build a novel event-based 3D dataset with various material objects and calibrated labels of grayscale images, depth maps, camera poses, and silhouettes. Experiments show models that have jointly trained significantly outperform those trained individually. Our approach performs better than all baselines in reconstruction quality, and depth/intensity predictions with satisfactory rendering speed. △ Less

Submitted 3 June, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

arXiv:2405.11750 [pdf, other]

The Intermediate-Mass Black Hole Reverberation Mapping Project: Initial Results for a candidate IMBH in a nearby Seyfert 1 Galaxy

Authors: Wenwen Zuo, Hengxiao Guo, Jingbo Sun, Qi Yuan, Paulina Lira, Minfeng Gu, Philip G. Edwards, Alok C. Gupta, Shubham Kishore, Jamie Stevens, Tao An, Zhen-Yi Cai, Haicheng Feng, Luis C. Ho, Dragana Ilić, Andjelka B. Kovačević, ShaSha Li, Mar Mezcua, Luka Č. Popović, Mouyuan Sun, Tushar Tripathi, Vivian U., Oliver Vince, Jianguo Wang, Junxian Wang , et al. (3 additional authors not shown)

Abstract: To investigate the short-term variability and determine the size of the optical continuum emitting size of intermediate-mass black holes (IMBHs), we carried out high-cadence, multi-band photometric monitoring of a Seyfert 1 galaxy J0249-0815 across two nights, together with a one-night single-band preliminary test. The presence of the broad Ha component in our target was confirmed by recent Paloma… ▽ More To investigate the short-term variability and determine the size of the optical continuum emitting size of intermediate-mass black holes (IMBHs), we carried out high-cadence, multi-band photometric monitoring of a Seyfert 1 galaxy J0249-0815 across two nights, together with a one-night single-band preliminary test. The presence of the broad Ha component in our target was confirmed by recent Palomar/P200 spectroscopic observations, 23 years after Sloan Digital Sky Survey, ruling out the supernovae origin of the broad Ha line. The photometric experiment was primarily conducted utilizing four-channel imagers MuSCAT 3 & 4 mounted on 2-meter telescopes within the Las Cumbres Observatory Global Telescope Network. Despite the expectation of variability, we observed no significant variation (<1.4%) on timescales of 6-10 hours. This non-detection is likely due to substantial host galaxy light diluting the subtle AGN variability. Dual-band preliminary tests and tailored simulations may enhance the possibility of detecting variability and lag in future IMBH reverberation campaigns. △ Less

Submitted 19 May, 2024; originally announced May 2024.

Comments: 14 pages, 6 figures, submitted to ApJ, comments welcome

arXiv:2405.10621 [pdf, other]

Historically Relevant Event Structuring for Temporal Knowledge Graph Reasoning

Authors: Jinchuan Zhang, Bei Hui, Chong Mu, Ming Sun, Ling Tian

Abstract: Temporal Knowledge Graph (TKG) reasoning focuses on predicting events through historical information within snapshots distributed on a timeline. Existing studies mainly concentrate on two perspectives of leveraging the history of TKGs, including capturing evolution of each recent snapshot or correlations among global historical facts. Despite the achieved significant accomplishments, these models… ▽ More Temporal Knowledge Graph (TKG) reasoning focuses on predicting events through historical information within snapshots distributed on a timeline. Existing studies mainly concentrate on two perspectives of leveraging the history of TKGs, including capturing evolution of each recent snapshot or correlations among global historical facts. Despite the achieved significant accomplishments, these models still fall short of (1) investigating the influences of multi-granularity interactions across recent snapshots and (2) harnessing the expressive semantics of significant links accorded with queries throughout the entire history, especially events exerting a profound impact on the future. These inadequacies restrict representation ability to reflect historical dependencies and future trends thoroughly. To overcome these drawbacks, we propose an innovative TKG reasoning approach towards \textbf{His}torically \textbf{R}elevant \textbf{E}vents \textbf{S}tructuring ($\mathsf{HisRES}$). Concretely, $\mathsf{HisRES}$ comprises two distinctive modules excelling in structuring historically relevant events within TKGs, including a multi-granularity evolutionary encoder that captures structural and temporal dependencies of the most recent snapshots, and a global relevance encoder that concentrates on crucial correlations among events relevant to queries from the entire history. Furthermore, $\mathsf{HisRES}$ incorporates a self-gating mechanism for adaptively merging multi-granularity recent and historically relevant structuring representations. Extensive experiments on four event-based benchmarks demonstrate the state-of-the-art performance of $\mathsf{HisRES}$ and indicate the superiority and effectiveness of structuring historical relevance for TKG reasoning. △ Less

Submitted 17 May, 2024; originally announced May 2024.

arXiv:2405.09825 [pdf, other]

A Sample of Compact Object Candidates in Single-lined Spectroscopic Binaries from LAMOST Medium Resolution Survey

Authors: Hao-Bin Liu, Wei-Min Gu, Zhi-Xiang Zhang, Tuan Yi, Jin-Zhong Liu, Mouyuan Sun

Abstract: The stellar spectra from LAMOST Medium Resolution Survey can be used to search for compact objects in binaries. The LAMOST DR10 catalog includes > 980, 000 targets with multiple medium resolution spectra. We select the targets with large or rapid radial velocity variation, and obtained an input-sample of 1822 sources. We use light curves and spectra to identify and exclude eclipsing binaries and d… ▽ More The stellar spectra from LAMOST Medium Resolution Survey can be used to search for compact objects in binaries. The LAMOST DR10 catalog includes > 980, 000 targets with multiple medium resolution spectra. We select the targets with large or rapid radial velocity variation, and obtained an input-sample of 1822 sources. We use light curves and spectra to identify and exclude eclipsing binaries and double-lined spectroscopic binaries in the input-sample. We finally derive a catalog of 89 candidates with well-folded radial velocity, which are all single-lined spectroscopic binaries, indicating an unseen companion residing in each system. The mass function of each system can be well constrained based on the radial velocity curve. In our sample, 26 sources have mass function higher than 0.1 $M_{\odot}$, among which 18 sources have ellipsoidal type light curves. In our opinion, compact objects are likely existent in all these 26 binaries, which are worth follow-up identification. △ Less

Submitted 16 May, 2024; originally announced May 2024.

Comments: 20 pages, 10 figures, accepted for publication in ApJ

arXiv:2405.09738 [pdf, other]

X-ray Cool Core Remnants Heated by Strong Radio AGN Feedback

Authors: Wenhao Liu, Ming Sun, G. Mark Voit, Dharam Vir Lal, Paul Nulsen, Massimo Gaspari, Craig Sarazin, Steven Ehlert, Xianzhong Zheng

Abstract: Strong AGN heating provides an alternative means for the disruption of cluster cool cores (CCs) to cluster mergers. In this work we present a systematic Chandra study of a sample of 108 nearby ($z<0.1$) galaxy clusters, to investigate the effect of AGN heating on CCs. About 40% of clusters with small offsets between the BCG and the X-ray centre ($\le50$ kpc) have small CCs. For comparison, 14 of 1… ▽ More Strong AGN heating provides an alternative means for the disruption of cluster cool cores (CCs) to cluster mergers. In this work we present a systematic Chandra study of a sample of 108 nearby ($z<0.1$) galaxy clusters, to investigate the effect of AGN heating on CCs. About 40% of clusters with small offsets between the BCG and the X-ray centre ($\le50$ kpc) have small CCs. For comparison, 14 of 17 clusters with large offsets have small CCs, which suggests that mergers or sloshing can be efficient in reducing the CC size. Relaxed, small CC clusters generally have weak radio AGNs ($P_{1.4\rm GHz}<10^{23}$ W Hz$^{-1}$), and they show a lack of systems hosting a radio AGN with intermediate radio power ($2\times10^{23}<P_{1.4\rm GHz}<2\times10^{24}$ W Hz$^{-1}$). We found that the strongest circumnuclear ($<1$ kpc) X-ray emission only exists in clusters with strong radio AGN. The duty cycle of relaxed, small CC clusters is less than half of that for large CC clusters. It suggests that the radio activity of BCGs is affected by the properties of the surrounding gas beyond the central $\sim10$ kpc, and strong radio AGNs in small X-ray CCs fade more rapidly than those embedded in large X-ray CCs. A scenario is also presented for the transition of large CCs and coronae due to radio AGN feedback. We also present a detailed analysis of galaxy cluster 3C 129.1 as an example of a CC remnant possibly disrupted by radio AGN. △ Less

Submitted 15 May, 2024; originally announced May 2024.

Comments: 17 pages, 10 figures, Accepted for publication on MNRAS

arXiv:2405.05672 [pdf, other]

Multi-Stream Keypoint Attention Network for Sign Language Recognition and Translation

Authors: Mo Guan, Yan Wang, Guangkun Ma, Jiarui Liu, Mingzu Sun

Abstract: Sign language serves as a non-vocal means of communication, transmitting information and significance through gestures, facial expressions, and bodily movements. The majority of current approaches for sign language recognition (SLR) and translation rely on RGB video inputs, which are vulnerable to fluctuations in the background. Employing a keypoint-based strategy not only mitigates the effects of… ▽ More Sign language serves as a non-vocal means of communication, transmitting information and significance through gestures, facial expressions, and bodily movements. The majority of current approaches for sign language recognition (SLR) and translation rely on RGB video inputs, which are vulnerable to fluctuations in the background. Employing a keypoint-based strategy not only mitigates the effects of background alterations but also substantially diminishes the computational demands of the model. Nevertheless, contemporary keypoint-based methodologies fail to fully harness the implicit knowledge embedded in keypoint sequences. To tackle this challenge, our inspiration is derived from the human cognition mechanism, which discerns sign language by analyzing the interplay between gesture configurations and supplementary elements. We propose a multi-stream keypoint attention network to depict a sequence of keypoints produced by a readily available keypoint estimator. In order to facilitate interaction across multiple streams, we investigate diverse methodologies such as keypoint fusion strategies, head fusion, and self-distillation. The resulting framework is denoted as MSKA-SLR, which is expanded into a sign language translation (SLT) model through the straightforward addition of an extra translation network. We carry out comprehensive experiments on well-known benchmarks like Phoenix-2014, Phoenix-2014T, and CSL-Daily to showcase the efficacy of our methodology. Notably, we have attained a novel state-of-the-art performance in the sign language translation task of Phoenix-2014T. The code and models can be accessed at: https://github.com/sutwangyan/MSKA. △ Less

Submitted 9 May, 2024; originally announced May 2024.

Comments: 15 pages

arXiv:2405.05288 [pdf, other]

Learning Social Graph for Inactive User Recommendation

Authors: Nian Liu, Shen Fan, Ting Bai, Peng Wang, Mingwei Sun, Yanhu Mo, Xiaoxiao Xu, Hong Liu, Chuan Shi

Abstract: Social relations have been widely incorporated into recommender systems to alleviate data sparsity problem. However, raw social relations don't always benefit recommendation due to their inferior quality and insufficient quantity, especially for inactive users, whose interacted items are limited. In this paper, we propose a novel social recommendation method called LSIR (\textbf{L}earning \textbf{… ▽ More Social relations have been widely incorporated into recommender systems to alleviate data sparsity problem. However, raw social relations don't always benefit recommendation due to their inferior quality and insufficient quantity, especially for inactive users, whose interacted items are limited. In this paper, we propose a novel social recommendation method called LSIR (\textbf{L}earning \textbf{S}ocial Graph for \textbf{I}nactive User \textbf{R}ecommendation) that learns an optimal social graph structure for social recommendation, especially for inactive users. LSIR recursively aggregates user and item embeddings to collaboratively encode item and user features. Then, graph structure learning (GSL) is employed to refine the raw user-user social graph, by removing noisy edges and adding new edges based on the enhanced embeddings. Meanwhile, mimic learning is implemented to guide active users in mimicking inactive users during model training, which improves the construction of new edges for inactive users. Extensive experiments on real-world datasets demonstrate that LSIR achieves significant improvements of up to 129.58\% on NDCG in inactive user recommendation. Our code is available at~\url{https://github.com/liun-online/LSIR}. △ Less

Submitted 22 May, 2024; v1 submitted 7 May, 2024; originally announced May 2024.

Comments: This paper has been received by DASFAA 2024

arXiv:2405.04219 [pdf, other]

Iterative Experience Refinement of Software-Developing Agents

Authors: Chen Qian, Jiahao Li, Yufan Dang, Wei Liu, YiFei Wang, Zihao Xie, Weize Chen, Cheng Yang, Yingli Zhang, Zhiyuan Liu, Maosong Sun

Abstract: Autonomous agents powered by large language models (LLMs) show significant potential for achieving high autonomy in various scenarios such as software development. Recent research has shown that LLM agents can leverage past experiences to reduce errors and enhance efficiency. However, the static experience paradigm, reliant on a fixed collection of past experiences acquired heuristically, lacks it… ▽ More Autonomous agents powered by large language models (LLMs) show significant potential for achieving high autonomy in various scenarios such as software development. Recent research has shown that LLM agents can leverage past experiences to reduce errors and enhance efficiency. However, the static experience paradigm, reliant on a fixed collection of past experiences acquired heuristically, lacks iterative refinement and thus hampers agents' adaptability. In this paper, we introduce the Iterative Experience Refinement framework, enabling LLM agents to refine experiences iteratively during task execution. We propose two fundamental patterns: the successive pattern, refining based on nearest experiences within a task batch, and the cumulative pattern, acquiring experiences across all previous task batches. Augmented with our heuristic experience elimination, the method prioritizes high-quality and frequently-used experiences, effectively managing the experience space and enhancing efficiency. Extensive experiments show that while the successive pattern may yield superior results, the cumulative pattern provides more stable performance. Moreover, experience elimination facilitates achieving better performance using just 11.54% of a high-quality subset. △ Less

Submitted 7 May, 2024; originally announced May 2024.

Comments: Work in progress

arXiv:2405.02584 [pdf, other]

doi 10.3847/1538-4357/ad47c7

A new timescale-mass scaling for the optical variation of active galactic nuclei across the intermediate-mass to supermassive scales

Authors: Zhen-Bo Su, Zhen-Yi Cai, Mouyuan Sun, Hengxiao Guo, Wei-Min Gu, Jun-Xian Wang

Abstract: Variability of active galactic nuclei (AGNs) has long been servicing as an essential avenue of exploring the accretion physics of black hole (BH). There are two commonly used methods for analyzing AGN variability. First, the AGN variability, characterized by the structure function (SF) of a single band, can be well described by a damped random walk (DRW) process on timescales longer than ~weeks, s… ▽ More Variability of active galactic nuclei (AGNs) has long been servicing as an essential avenue of exploring the accretion physics of black hole (BH). There are two commonly used methods for analyzing AGN variability. First, the AGN variability, characterized by the structure function (SF) of a single band, can be well described by a damped random walk (DRW) process on timescales longer than ~weeks, shorter than which departures have been reported. Second, the color variation (CV) between two bands behaves timescale-dependent, raising challenges to the widely accepted reprocessing scenario. However, both the departure from the DRW process and the timescale-dependent CV are hitherto limited to AGNs, mainly quasars, at the supermassive scale. Here, utilizing the high-cadence multi-wavelength monitoring on NGC 4395 harboring an intermediate-mass BH, we unveil at the intermediate-mass scale for the first time, prominent departures from the DRW process at timescales shorter than ~hours in all three nights and bands, and plausible timescale-dependent CVs in the two longest nights of observation. Furthermore, comparing SFs of NGC 4395 to four AGNs at the supermassive scale, we suggest a new scaling relation between the timescale (\uptau; across nearly three orders of magnitude) and the BH mass (M_BH): \uptau \propto M^γ_BH where the exponent γis likely ~ 0.6 - 0.8. This exponent differs from most previous measurements, but confirms a few and is consistent with a recent theoretical prediction, suggesting a similar accretion process in AGNs across different mass scales. △ Less

Submitted 4 May, 2024; originally announced May 2024.

Comments: 22 pages, 12 figures, Accepted by ApJ

Journal ref: 2024, ApJ, 969, 78

arXiv:2404.19193 [pdf]

Tunable Collective Excitations in Epitaxial Perovskite Nickelates

Authors: Mengxia Sun, Xu He, Mingyao Chen, Chi Sin Tang, Xiongfang Liu, Liang Dai, Jishan Liu, Zhigang Zeng, Shuo Sun, Mark B. H. Breese, Chuanbing Cai, Yingge Du, Le Wang, Andrew T. S. Wee, Xinmao Yin

Abstract: The formation of plasmons through the collective excitation of charge density has generated intense discussions, offering insights to fundamental sciences and potential applications. While the underlying physical principles have been well-established, the effects of many-body interactions and orbital hybridization on plasmonic dynamics remain understudied. In this work, we present the observation… ▽ More The formation of plasmons through the collective excitation of charge density has generated intense discussions, offering insights to fundamental sciences and potential applications. While the underlying physical principles have been well-established, the effects of many-body interactions and orbital hybridization on plasmonic dynamics remain understudied. In this work, we present the observation of conventional metallic and correlated plasmons in epitaxial La1-xSrxNiO3 (LSNO) films with varying Sr doping concentrations (x = 0, 0.125, 0.25), unveiling their intriguing evolution. Unlike samples at other doping concentrations, the x = 0.125 intermediate doping sample does not exhibit the correlated plasmons despite showing high optical conductivity. Through a comprehensive experimental investigation using spectroscopic ellipsometry and X-ray absorption spectroscopy, the O2p-Ni3d orbital hybridization for LSNO with a doping concentration of x = 0.125 is found to be significantly enhanced, alongside a considerable weakening of its effective correlation U*. These factors account for the absence of correlated plasmons and the high optical conductivity observed in LSNO (0.125). Our results underscore the profound impact of orbital hybridization on the electronic structure and the formation of plasmon in strongly-correlated systems. This in turn suggest that LSNO could serve as a promising alternative material in optoelectronic devices. △ Less

Submitted 1 June, 2024; v1 submitted 29 April, 2024; originally announced April 2024.

arXiv:2404.18412 [pdf]

Uncovering an Interfacial Band Resulting from Orbital Hybridization in Nickelate Heterostructures

Authors: Mingyao Chen, Huimin Liu, Xu He, Minjuan Li, Chi Sin Tang, Mengxia Sun, Krishna Prasad Koirala, Mark E. Bowden, Yangyang Li, Xiongfang Liu, Difan Zhou, Shuo Sun, Mark B. H. Breese, Chuanbing Cai, Yingge Du, Andrew T. S. Wee, Le Wang, Xinmao Yin

Abstract: The interaction of atomic orbitals at the interface of perovskite oxide heterostructures has been investigated for its profound impact on the band structures and electronic properties, giving rise to unique electronic states and a variety of tunable functionalities. In this study, we conducted an extensive investigation of the optical and electronic properties of epitaxial NdNiO3 thin films grown… ▽ More The interaction of atomic orbitals at the interface of perovskite oxide heterostructures has been investigated for its profound impact on the band structures and electronic properties, giving rise to unique electronic states and a variety of tunable functionalities. In this study, we conducted an extensive investigation of the optical and electronic properties of epitaxial NdNiO3 thin films grown on a series of single crystal substrates. Unlike films synthesized on other substrates, NdNiO3 on SrTiO3 (NNO/STO) gives rise to a unique band structure which features an additional unoccupied band situated above the Fermi level. Our comprehensive investigation, which incorporated a wide array of experimental techniques and density functional theory calculations, revealed that the emergence of the interfacial band structure is primarily driven by the orbital hybridization between Ti 3d orbitals of the STO substrate and O 2p orbitals of the NNO thin film. Furthermore, exciton peaks have been detected in the optical spectra of the NNO/STO film, attributable to the pronounced electron-electron (e-e) and electron-hole (e-h) interactions propagating from the STO substrate into the NNO film. These findings underscore the substantial influence of interfacial orbital hybridization on the electronic structure of oxide thin-films, thereby offering key insights into tuning their interfacial properties. △ Less

Submitted 29 April, 2024; originally announced April 2024.

Comments: 26 pages,4 figures

arXiv:2404.18243 [pdf, other]

LEGENT: Open Platform for Embodied Agents

Authors: Zhili Cheng, Zhitong Wang, Jinyi Hu, Shengding Hu, An Liu, Yuge Tu, Pengkai Li, Lei Shi, Zhiyuan Liu, Maosong Sun

Abstract: Despite advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), their integration into language-grounded, human-like embodied agents remains incomplete, hindering complex real-life task performance in physical environments. Existing integrations often feature limited open sourcing, challenging collective progress in this field. We introduce LEGENT, an open, scalable platfo… ▽ More Despite advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), their integration into language-grounded, human-like embodied agents remains incomplete, hindering complex real-life task performance in physical environments. Existing integrations often feature limited open sourcing, challenging collective progress in this field. We introduce LEGENT, an open, scalable platform for developing embodied agents using LLMs and LMMs. LEGENT offers a dual approach: a rich, interactive 3D environment with communicable and actionable agents, paired with a user-friendly interface, and a sophisticated data generation pipeline utilizing advanced algorithms to exploit supervision from simulated worlds at scale. In our experiments, an embryonic vision-language-action model trained on LEGENT-generated data surpasses GPT-4V in embodied tasks, showcasing promising generalization capabilities. △ Less

Submitted 28 April, 2024; originally announced April 2024.

Comments: Demo Paper

Showing 1–50 of 1,414 results for author: Sun, M