subscribe to arXiv mailings

Fast and Accurate Multi-Agent Trajectory Prediction For Crowded Unknown Scenes

Authors: Xiuye Tao, Huiping Li, Bin Liang, Yang Shi, Demin Xu

Abstract: This paper studies the problem of multi-agent trajectory prediction in crowded unknown environments. A novel energy function optimization-based framework is proposed to generate prediction trajectories. Firstly, a new energy function is designed for easier optimization. Secondly, an online optimization pipeline for calculating parameters and agents' velocities is developed. In this pipeline, we fi… ▽ More This paper studies the problem of multi-agent trajectory prediction in crowded unknown environments. A novel energy function optimization-based framework is proposed to generate prediction trajectories. Firstly, a new energy function is designed for easier optimization. Secondly, an online optimization pipeline for calculating parameters and agents' velocities is developed. In this pipeline, we first design an efficient group division method based on Frechet distance to classify agents online. Then the strategy on decoupling the optimization of velocities and critical parameters in the energy function is developed, where the the slap swarm algorithm and gradient descent algorithms are integrated to solve the optimization problems more efficiently. Thirdly, we propose a similarity-based resample evaluation algorithm to predict agents' optimal goals, defined as the target-moving headings of agents, which effectively extracts hidden information in observed states and avoids learning agents' destinations via the training dataset in advance. Experiments and comparison studies verify the advantages of the proposed method in terms of prediction accuracy and speed. △ Less

Submitted 12 July, 2024; originally announced July 2024.

arXiv:2406.11181 [pdf, other]

General Scintillation for Gaussian Beam Propagating through Oceanic Turbulence and UWOC System Performance Evaluation

Authors: Yuxuan Li, Xiang Yi, Xinyue Tao, Ata Yalçın, Mingjian Cheng, Lu Zhang

Abstract: In this paper, we derive a general and exact closed-form expression of scintillation index (SI) for a Gaussian beam propagating through weak oceanic turbulence, based on the general oceanic turbulence optical power spectrum (OTOPS) and the Rytov theory. Our universal expression not only includes existing Rytov variances but also accounts for actual cases where the Kolmogorov microscale is non-zero… ▽ More In this paper, we derive a general and exact closed-form expression of scintillation index (SI) for a Gaussian beam propagating through weak oceanic turbulence, based on the general oceanic turbulence optical power spectrum (OTOPS) and the Rytov theory. Our universal expression not only includes existing Rytov variances but also accounts for actual cases where the Kolmogorov microscale is non-zero. The correctness and accuracy of our derivation are verified through comparison with the published work under identical conditions. By utilizing our derived expressions, we analyze the impact of various beam, propagation and oceanic turbulence parameters on both SI and bit error rate (BER) performance of underwater wireless optical communication (UWOC) systems. Numerical results demonstrate that the relationship between the Kolmogorov microscale and SI is nonlinear. Additionally, considering that certain oceanic turbulence parameters are related to depth, we use temperature and salinity data from Argo buoy deployed in real oceans to investigate the dependence of SI on depth. Our findings will contribute to the design and optimization of UWOC systems. △ Less

Submitted 16 June, 2024; originally announced June 2024.

arXiv:2406.10469 [pdf, other]

Object-Attribute-Relation Representation based Video Semantic Communication

Authors: Qiyuan Du, Yiping Duan, Qianqian Yang, Xiaoming Tao, Mérouane Debbah

Abstract: With the rapid growth of multimedia data volume, there is an increasing need for efficient video transmission in applications such as virtual reality and future video streaming services. Semantic communication is emerging as a vital technique for ensuring efficient and reliable transmission in low-bandwidth, high-noise settings. However, most current approaches focus on joint source-channel coding… ▽ More With the rapid growth of multimedia data volume, there is an increasing need for efficient video transmission in applications such as virtual reality and future video streaming services. Semantic communication is emerging as a vital technique for ensuring efficient and reliable transmission in low-bandwidth, high-noise settings. However, most current approaches focus on joint source-channel coding (JSCC) that depends on end-to-end training. These methods often lack an interpretable semantic representation and struggle with adaptability to various downstream tasks. In this paper, we introduce the use of object-attribute-relation (OAR) as a semantic framework for videos to facilitate low bit-rate coding and enhance the JSCC process for more effective video transmission. We utilize OAR sequences for both low bit-rate representation and generative video reconstruction. Additionally, we incorporate OAR into the image JSCC model to prioritize communication resources for areas more critical to downstream tasks. Our experiments on traffic surveillance video datasets assess the effectiveness of our approach in terms of video transmission performance. The empirical findings demonstrate that our OAR-based video coding method not only outperforms H.265 coding at lower bit-rates but also synergizes with JSCC to deliver robust and efficient video transmission. △ Less

Submitted 14 June, 2024; originally announced June 2024.

arXiv:2406.04277 [pdf, other]

VideoTetris: Towards Compositional Text-to-Video Generation

Authors: Ye Tian, Ling Yang, Haotian Yang, Yuan Gao, Yufan Deng, Jingmin Chen, Xintao Wang, Zhaochen Yu, Xin Tao, Pengfei Wan, Di Zhang, Bin Cui

Abstract: Diffusion models have demonstrated great success in text-to-video (T2V) generation. However, existing methods may face challenges when handling complex (long) video generation scenarios that involve multiple objects or dynamic changes in object numbers. To address these limitations, we propose VideoTetris, a novel framework that enables compositional T2V generation. Specifically, we propose spatio… ▽ More Diffusion models have demonstrated great success in text-to-video (T2V) generation. However, existing methods may face challenges when handling complex (long) video generation scenarios that involve multiple objects or dynamic changes in object numbers. To address these limitations, we propose VideoTetris, a novel framework that enables compositional T2V generation. Specifically, we propose spatio-temporal compositional diffusion to precisely follow complex textual semantics by manipulating and composing the attention maps of denoising networks spatially and temporally. Moreover, we propose an enhanced video data preprocessing to enhance the training data regarding motion dynamics and prompt understanding, equipped with a new reference frame attention mechanism to improve the consistency of auto-regressive video generation. Extensive experiments demonstrate that our VideoTetris achieves impressive qualitative and quantitative results in compositional T2V generation. Code is available at: https://github.com/YangLing0818/VideoTetris △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: Code: https://github.com/YangLing0818/VideoTetris

arXiv:2405.19226 [pdf, other]

ContextBLIP: Doubly Contextual Alignment for Contrastive Image Retrieval from Linguistically Complex Descriptions

Authors: Honglin Lin, Siyu Li, Guoshun Nan, Chaoyue Tang, Xueting Wang, Jingxin Xu, Rong Yankai, Zhili Zhou, Yutong Gao, Qimei Cui, Xiaofeng Tao

Abstract: Image retrieval from contextual descriptions (IRCD) aims to identify an image within a set of minimally contrastive candidates based on linguistically complex text. Despite the success of VLMs, they still significantly lag behind human performance in IRCD. The main challenges lie in aligning key contextual cues in two modalities, where these subtle cues are concealed in tiny areas of multiple cont… ▽ More Image retrieval from contextual descriptions (IRCD) aims to identify an image within a set of minimally contrastive candidates based on linguistically complex text. Despite the success of VLMs, they still significantly lag behind human performance in IRCD. The main challenges lie in aligning key contextual cues in two modalities, where these subtle cues are concealed in tiny areas of multiple contrastive images and within the complex linguistics of textual descriptions. This motivates us to propose ContextBLIP, a simple yet effective method that relies on a doubly contextual alignment scheme for challenging IRCD. Specifically, 1) our model comprises a multi-scale adapter, a matching loss, and a text-guided masking loss. The adapter learns to capture fine-grained visual cues. The two losses enable iterative supervision for the adapter, gradually highlighting the focal patches of a single image to the key textual cues. We term such a way as intra-contextual alignment. 2) Then, ContextBLIP further employs an inter-context encoder to learn dependencies among candidates, facilitating alignment between the text to multiple images. We term this step as inter-contextual alignment. Consequently, the nuanced cues concealed in each modality can be effectively aligned. Experiments on two benchmarks show the superiority of our method. We observe that ContextBLIP can yield comparable results with GPT-4V, despite involving about 7,500 times fewer parameters. △ Less

Submitted 29 May, 2024; originally announced May 2024.

Comments: Accepted in ACL 2024 Findings

arXiv:2405.15403 [pdf, other]

Fine-Grained Dynamic Framework for Bias-Variance Joint Optimization on Data Missing Not at Random

Authors: Mingming Ha, Xuewen Tao, Wenfang Lin, Qionxu Ma, Wujiang Xu, Linxun Chen

Abstract: In most practical applications such as recommendation systems, display advertising, and so forth, the collected data often contains missing values and those missing values are generally missing-not-at-random, which deteriorates the prediction performance of models. Some existing estimators and regularizers attempt to achieve unbiased estimation to improve the predictive performance. However, varia… ▽ More In most practical applications such as recommendation systems, display advertising, and so forth, the collected data often contains missing values and those missing values are generally missing-not-at-random, which deteriorates the prediction performance of models. Some existing estimators and regularizers attempt to achieve unbiased estimation to improve the predictive performance. However, variances and generalization bound of these methods are generally unbounded when the propensity scores tend to zero, compromising their stability and robustness. In this paper, we first theoretically reveal that limitations of regularization techniques. Besides, we further illustrate that, for more general estimators, unbiasedness will inevitably lead to unbounded variance. These general laws inspire us that the estimator designs is not merely about eliminating bias, reducing variance, or simply achieve a bias-variance trade-off. Instead, it involves a quantitative joint optimization of bias and variance. Then, we develop a systematic fine-grained dynamic learning framework to jointly optimize bias and variance, which adaptively selects an appropriate estimator for each user-item pair according to the predefined objective function. With this operation, the generalization bounds and variances of models are reduced and bounded with theoretical guarantees. Extensive experiments are conducted to verify the theoretical results and the effectiveness of the proposed dynamic learning framework. △ Less

Submitted 24 May, 2024; originally announced May 2024.

arXiv:2405.15321 [pdf, other]

SG-Adapter: Enhancing Text-to-Image Generation with Scene Graph Guidance

Authors: Guibao Shen, Luozhou Wang, Jiantao Lin, Wenhang Ge, Chaozhe Zhang, Xin Tao, Yuan Zhang, Pengfei Wan, Zhongyuan Wang, Guangyong Chen, Yijun Li, Ying-Cong Chen

Abstract: Recent advancements in text-to-image generation have been propelled by the development of diffusion models and multi-modality learning. However, since text is typically represented sequentially in these models, it often falls short in providing accurate contextualization and structural control. So the generated images do not consistently align with human expectations, especially in complex scenari… ▽ More Recent advancements in text-to-image generation have been propelled by the development of diffusion models and multi-modality learning. However, since text is typically represented sequentially in these models, it often falls short in providing accurate contextualization and structural control. So the generated images do not consistently align with human expectations, especially in complex scenarios involving multiple objects and relationships. In this paper, we introduce the Scene Graph Adapter(SG-Adapter), leveraging the structured representation of scene graphs to rectify inaccuracies in the original text embeddings. The SG-Adapter's explicit and non-fully connected graph representation greatly improves the fully connected, transformer-based text representations. This enhancement is particularly notable in maintaining precise correspondence in scenarios involving multiple relationships. To address the challenges posed by low-quality annotated datasets like Visual Genome, we have manually curated a highly clean, multi-relational scene graph-image paired dataset MultiRels. Furthermore, we design three metrics derived from GPT-4V to effectively and thoroughly measure the correspondence between images and scene graphs. Both qualitative and quantitative results validate the efficacy of our approach in controlling the correspondence in multiple relationships. △ Less

Submitted 24 May, 2024; originally announced May 2024.

arXiv:2405.10514 [pdf, other]

Secrecy Performance Analysis of Multi-Functional RIS-Assisted NOMA Networks

Authors: Yingjie Pei, Wanli Ni, Jin Xu, Xinwei Yue, Xiaofeng Tao, Dusit Niyato

Abstract: Although reconfigurable intelligent surface (RIS) can improve the secrecy communication performance of wireless users, it still faces challenges such as limited coverage and double-fading effect. To address these issues, in this paper, we utilize a novel multi-functional RIS (MF-RIS) to enhance the secrecy performance of wireless users, and investigate the physical layer secrecy problem in non-ort… ▽ More Although reconfigurable intelligent surface (RIS) can improve the secrecy communication performance of wireless users, it still faces challenges such as limited coverage and double-fading effect. To address these issues, in this paper, we utilize a novel multi-functional RIS (MF-RIS) to enhance the secrecy performance of wireless users, and investigate the physical layer secrecy problem in non-orthogonal multiple access (NOMA) networks. Specifically, we derive closed-form expressions for the secrecy outage probability (SOP) and secrecy throughput of users in the MF-RIS-assisted NOMA networks with external and internal eavesdroppers. The asymptotic expressions for SOP and secrecy diversity order are also analyzed under high signal-to-noise ratio (SNR) conditions. Additionally, we examine the impact of receiver hardware limitations and error transmission-induced imperfect successive interference cancellation (SIC) on the secrecy performance. Numerical results indicate that: i) under the same power budget, the secrecy performance achieved by MF-RIS significantly outperforms active RIS and simultaneously transmitting and reflecting RIS; ii) with increasing power budget, residual interference caused by imperfect SIC surpasses thermal noise as the primary factor affecting secrecy capacity; and iii) deploying additional elements at the MF-RIS brings significant secrecy enhancements for the external eavesdropping scenario, in contrast to the internal eavesdropping case. △ Less

Submitted 16 May, 2024; originally announced May 2024.

Comments: 14 pages, 9 figures, submitted to IEEE transactions on wireless communication

arXiv:2405.08096 [pdf, other]

Semantic MIMO Systems for Speech-to-Text Transmission

Authors: Zhenzi Weng, Zhijin Qin, Huiqiang Xie, Xiaoming Tao, Khaled B. Letaief

Abstract: Semantic communications have been utilized to execute numerous intelligent tasks by transmitting task-related semantic information instead of bits. In this article, we propose a semantic-aware speech-to-text transmission system for the single-user multiple-input multiple-output (MIMO) and multi-user MIMO communication scenarios, named SAC-ST. Particularly, a semantic communication system to serve… ▽ More Semantic communications have been utilized to execute numerous intelligent tasks by transmitting task-related semantic information instead of bits. In this article, we propose a semantic-aware speech-to-text transmission system for the single-user multiple-input multiple-output (MIMO) and multi-user MIMO communication scenarios, named SAC-ST. Particularly, a semantic communication system to serve the speech-to-text task at the receiver is first designed, which compresses the semantic information and generates the low-dimensional semantic features by leveraging the transformer module. In addition, a novel semantic-aware network is proposed to facilitate the transmission with high semantic fidelity to identify the critical semantic information and guarantee it is recovered accurately. Furthermore, we extend the SAC-ST with a neural network-enabled channel estimation network to mitigate the dependence on accurate channel state information and validate the feasibility of SAC-ST in practical communication environments. Simulation results will show that the proposed SAC-ST outperforms the communication framework without the semantic-aware network for speech-to-text transmission over the MIMO channels in terms of the speech-to-text metrics, especially in the low signal-to-noise regime. Moreover, the SAC-ST with the developed channel estimation network is comparable to the SAC-ST with perfect channel state information. △ Less

Submitted 13 May, 2024; originally announced May 2024.

arXiv:2405.05795 [pdf, other]

Enhancing Suicide Risk Detection on Social Media through Semi-Supervised Deep Label Smoothing

Authors: Matthew Squires, Xiaohui Tao, Soman Elangovan, U Rajendra Acharya, Raj Gururajan, Haoran Xie, Xujuan Zhou

Abstract: Suicide is a prominent issue in society. Unfortunately, many people at risk for suicide do not receive the support required. Barriers to people receiving support include social stigma and lack of access to mental health care. With the popularity of social media, people have turned to online forums, such as Reddit to express their feelings and seek support. This provides the opportunity to support… ▽ More Suicide is a prominent issue in society. Unfortunately, many people at risk for suicide do not receive the support required. Barriers to people receiving support include social stigma and lack of access to mental health care. With the popularity of social media, people have turned to online forums, such as Reddit to express their feelings and seek support. This provides the opportunity to support people with the aid of artificial intelligence. Social media posts can be classified, using text classification, to help connect people with professional help. However, these systems fail to account for the inherent uncertainty in classifying mental health conditions. Unlike other areas of healthcare, mental health conditions have no objective measurements of disease often relying on expert opinion. Thus when formulating deep learning problems involving mental health, using hard, binary labels does not accurately represent the true nature of the data. In these settings, where human experts may disagree, fuzzy or soft labels may be more appropriate. The current work introduces a novel label smoothing method which we use to capture any uncertainty within the data. We test our approach on a five-label multi-class classification problem. We show, our semi-supervised deep label smoothing method improves classification accuracy above the existing state of the art. Where existing research reports an accuracy of 43\% on the Reddit C-SSRS dataset, using empirical experiments to evaluate our novel label smoothing method, we improve upon this existing benchmark to 52\%. These improvements in model performance have the potential to better support those experiencing mental distress. Future work should explore the use of probabilistic methods in both natural language processing and quantifying contributions of both epistemic and aleatoric uncertainty in noisy datasets. △ Less

Submitted 9 May, 2024; originally announced May 2024.

arXiv:2405.00181 [pdf, other]

Uncovering What, Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly

Authors: Hang Du, Sicheng Zhang, Binzhu Xie, Guoshun Nan, Jiayang Zhang, Junrui Xu, Hangyu Liu, Sicong Leng, Jiangming Liu, Hehe Fan, Dajiu Huang, Jing Feng, Linli Chen, Can Zhang, Xuhuan Li, Hao Zhang, Jianhang Chen, Qimei Cui, Xiaofeng Tao

Abstract: Video anomaly understanding (VAU) aims to automatically comprehend unusual occurrences in videos, thereby enabling various applications such as traffic surveillance and industrial manufacturing. While existing VAU benchmarks primarily concentrate on anomaly detection and localization, our focus is on more practicality, prompting us to raise the following crucial questions: "what anomaly occurred?"… ▽ More Video anomaly understanding (VAU) aims to automatically comprehend unusual occurrences in videos, thereby enabling various applications such as traffic surveillance and industrial manufacturing. While existing VAU benchmarks primarily concentrate on anomaly detection and localization, our focus is on more practicality, prompting us to raise the following crucial questions: "what anomaly occurred?", "why did it happen?", and "how severe is this abnormal event?". In pursuit of these answers, we present a comprehensive benchmark for Causation Understanding of Video Anomaly (CUVA). Specifically, each instance of the proposed benchmark involves three sets of human annotations to indicate the "what", "why" and "how" of an anomaly, including 1) anomaly type, start and end times, and event descriptions, 2) natural language explanations for the cause of an anomaly, and 3) free text reflecting the effect of the abnormality. In addition, we also introduce MMEval, a novel evaluation metric designed to better align with human preferences for CUVA, facilitating the measurement of existing LLMs in comprehending the underlying cause and corresponding effect of video anomalies. Finally, we propose a novel prompt-based method that can serve as a baseline approach for the challenging CUVA. We conduct extensive experiments to show the superiority of our evaluation metric and the prompt-based approach. Our code and dataset are available at https://github.com/fesvhtr/CUVA. △ Less

Submitted 6 May, 2024; v1 submitted 30 April, 2024; originally announced May 2024.

Comments: Accepted in CVPR2024, Codebase: https://github.com/fesvhtr/CUVA

arXiv:2404.16913 [pdf, other]

DE-CGAN: Boosting rTMS Treatment Prediction with Diversity Enhancing Conditional Generative Adversarial Networks

Authors: Matthew Squires, Xiaohui Tao, Soman Elangovan, Raj Gururajan, Haoran Xie, Xujuan Zhou, Yuefeng Li, U Rajendra Acharya

Abstract: Repetitive Transcranial Magnetic Stimulation (rTMS) is a well-supported, evidence-based treatment for depression. However, patterns of response to this treatment are inconsistent. Emerging evidence suggests that artificial intelligence can predict rTMS treatment outcomes for most patients using fMRI connectivity features. While these models can reliably predict treatment outcomes for many patients… ▽ More Repetitive Transcranial Magnetic Stimulation (rTMS) is a well-supported, evidence-based treatment for depression. However, patterns of response to this treatment are inconsistent. Emerging evidence suggests that artificial intelligence can predict rTMS treatment outcomes for most patients using fMRI connectivity features. While these models can reliably predict treatment outcomes for many patients for some underrepresented fMRI connectivity measures DNN models are unable to reliably predict treatment outcomes. As such we propose a novel method, Diversity Enhancing Conditional General Adversarial Network (DE-CGAN) for oversampling these underrepresented examples. DE-CGAN creates synthetic examples in difficult-to-classify regions by first identifying these data points and then creating conditioned synthetic examples to enhance data diversity. Through empirical experiments we show that a classification model trained using a diversity enhanced training set outperforms traditional data augmentation techniques and existing benchmark results. This work shows that increasing the diversity of a training dataset can improve classification model performance. Furthermore, this work provides evidence for the utility of synthetic patients providing larger more robust datasets for both AI researchers and psychiatrists to explore variable relationships. △ Less

Submitted 25 April, 2024; originally announced April 2024.

arXiv:2404.16687 [pdf, other]

NTIRE 2024 Quality Assessment of AI-Generated Content Challenge

Authors: Xiaohong Liu, Xiongkuo Min, Guangtao Zhai, Chunyi Li, Tengchuan Kou, Wei Sun, Haoning Wu, Yixuan Gao, Yuqin Cao, Zicheng Zhang, Xiele Wu, Radu Timofte, Fei Peng, Huiyuan Fu, Anlong Ming, Chuanming Wang, Huadong Ma, Shuai He, Zifei Dou, Shu Chen, Huacong Zhang, Haiyi Xie, Chengwei Wang, Baoying Chen, Jishen Zeng , et al. (89 additional authors not shown)

Abstract: This paper reports on the NTIRE 2024 Quality Assessment of AI-Generated Content Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2024. This challenge is to address a major challenge in the field of image and video processing, namely, Image Quality Assessment (IQA) and Video Quality Assessment (VQA) for AI-Generated Conte… ▽ More This paper reports on the NTIRE 2024 Quality Assessment of AI-Generated Content Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2024. This challenge is to address a major challenge in the field of image and video processing, namely, Image Quality Assessment (IQA) and Video Quality Assessment (VQA) for AI-Generated Content (AIGC). The challenge is divided into the image track and the video track. The image track uses the AIGIQA-20K, which contains 20,000 AI-Generated Images (AIGIs) generated by 15 popular generative models. The image track has a total of 318 registered participants. A total of 1,646 submissions are received in the development phase, and 221 submissions are received in the test phase. Finally, 16 participating teams submitted their models and fact sheets. The video track uses the T2VQA-DB, which contains 10,000 AI-Generated Videos (AIGVs) generated by 9 popular Text-to-Video (T2V) models. A total of 196 participants have registered in the video track. A total of 991 submissions are received in the development phase, and 185 submissions are received in the test phase. Finally, 12 participating teams submitted their models and fact sheets. Some methods have achieved better results than baseline methods, and the winning methods in both tracks have demonstrated superior prediction performance on AIGC. △ Less

Submitted 7 May, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

arXiv:2404.16076 [pdf, other]

Semantic Evolvement Enhanced Graph Autoencoder for Rumor Detection

Authors: Xiang Tao, Liang Wang, Qiang Liu, Shu Wu, Liang Wang

Abstract: Due to the rapid spread of rumors on social media, rumor detection has become an extremely important challenge. Recently, numerous rumor detection models which utilize textual information and the propagation structure of events have been proposed. However, these methods overlook the importance of semantic evolvement information of event in propagation process, which is often challenging to be trul… ▽ More Due to the rapid spread of rumors on social media, rumor detection has become an extremely important challenge. Recently, numerous rumor detection models which utilize textual information and the propagation structure of events have been proposed. However, these methods overlook the importance of semantic evolvement information of event in propagation process, which is often challenging to be truly learned in supervised training paradigms and traditional rumor detection methods. To address this issue, we propose a novel semantic evolvement enhanced Graph Autoencoder for Rumor Detection (GARD) model in this paper. The model learns semantic evolvement information of events by capturing local semantic changes and global semantic evolvement information through specific graph autoencoder and reconstruction strategies. By combining semantic evolvement information and propagation structure information, the model achieves a comprehensive understanding of event propagation and perform accurate and robust detection, while also detecting rumors earlier by capturing semantic evolvement information in the early stages. Moreover, in order to enhance the model's ability to learn the distinct patterns of rumors and non-rumors, we introduce a uniformity regularizer to further improve the model's performance. Experimental results on three public benchmark datasets confirm the superiority of our GARD method over the state-of-the-art approaches in both overall performance and early rumor detection. △ Less

Submitted 24 April, 2024; originally announced April 2024.

arXiv:2404.11001 [pdf]

Modulation of the Octahedral Structure and Potential Superconductivity of La$_3$Ni$_2$O$_7$ through Strain Engineering

Authors: Zihao Huo, Zhihui Luo, Peng Zhang, Aiqin Yang, Zhengtao Liu, Xiangru Tao, Zihan Zhang, Shumin Guo, Qiwen Jiang, Wenxuan Chen, Dao-Xin Yao, Defang Duan, Tian Cui

Abstract: The recent transport measurement of La$_3$Ni$_2$O$_7$ uncover a "right-triangle" shape of the superconducting dome in the pressure-temperature (P-T) phase diagram. Motivated by this, we perform theoretical first-principles studies of La$_3$Ni$_2$O$_7$ with the pressure ranging from 0 to 100 GPa. Notably, we reveal a pressure dependence of the Ni-$d_{z^2}$ electron density at the Fermi energy (… ▽ More The recent transport measurement of La$_3$Ni$_2$O$_7$ uncover a "right-triangle" shape of the superconducting dome in the pressure-temperature (P-T) phase diagram. Motivated by this, we perform theoretical first-principles studies of La$_3$Ni$_2$O$_7$ with the pressure ranging from 0 to 100 GPa. Notably, we reveal a pressure dependence of the Ni-$d_{z^2}$ electron density at the Fermi energy ($n_z^{EF}$) that highly coincides with such shape. On this basis, we further explore the electronic structure under uniaxial stress. By tracking the stress response of $n_z^{EF}$, we propose that superconductivity can be achieved by applying only about 2 GPa of compression along the c axis. The idea is further exemplified from the perspectives of lattice distortion, band structure, Fermi surface and superconducting phase coherence. We also discuss the possible charge modulation under the stress and provide an insight to the relation between n_z^EF and the superconducting Tc in La$_3$Ni$_2$O$_7$ system. Our study provides a helpful guide to the future experiment. △ Less

Submitted 8 July, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

arXiv:2404.09619 [pdf, other]

UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark

Authors: Zhaokun Zhou, Qiulin Wang, Bin Lin, Yiwei Su, Rui Chen, Xin Tao, Amin Zheng, Li Yuan, Pengfei Wan, Di Zhang

Abstract: As an alternative to expensive expert evaluation, Image Aesthetic Assessment (IAA) stands out as a crucial task in computer vision. However, traditional IAA methods are typically constrained to a single data source or task, restricting the universality and broader application. In this work, to better align with human aesthetics, we propose a Unified Multi-modal Image Aesthetic Assessment (UNIAA) f… ▽ More As an alternative to expensive expert evaluation, Image Aesthetic Assessment (IAA) stands out as a crucial task in computer vision. However, traditional IAA methods are typically constrained to a single data source or task, restricting the universality and broader application. In this work, to better align with human aesthetics, we propose a Unified Multi-modal Image Aesthetic Assessment (UNIAA) framework, including a Multi-modal Large Language Model (MLLM) named UNIAA-LLaVA and a comprehensive benchmark named UNIAA-Bench. We choose MLLMs with both visual perception and language ability for IAA and establish a low-cost paradigm for transforming the existing datasets into unified and high-quality visual instruction tuning data, from which the UNIAA-LLaVA is trained. To further evaluate the IAA capability of MLLMs, we construct the UNIAA-Bench, which consists of three aesthetic levels: Perception, Description, and Assessment. Extensive experiments validate the effectiveness and rationality of UNIAA. UNIAA-LLaVA achieves competitive performance on all levels of UNIAA-Bench, compared with existing MLLMs. Specifically, our model performs better than GPT-4V in aesthetic perception and even approaches the junior-level human. We find MLLMs have great potential in IAA, yet there remains plenty of room for further improvement. The UNIAA-LLaVA and UNIAA-Bench will be released. △ Less

Submitted 15 April, 2024; originally announced April 2024.

arXiv:2404.07484 [pdf]

Multimodal Emotion Recognition by Fusing Video Semantic in MOOC Learning Scenarios

Authors: Yuan Zhang, Xiaomei Tao, Hanxu Ai, Tao Chen, Yanling Gan

Abstract: In the Massive Open Online Courses (MOOC) learning scenario, the semantic information of instructional videos has a crucial impact on learners' emotional state. Learners mainly acquire knowledge by watching instructional videos, and the semantic information in the videos directly affects learners' emotional states. However, few studies have paid attention to the potential influence of the semantic… ▽ More In the Massive Open Online Courses (MOOC) learning scenario, the semantic information of instructional videos has a crucial impact on learners' emotional state. Learners mainly acquire knowledge by watching instructional videos, and the semantic information in the videos directly affects learners' emotional states. However, few studies have paid attention to the potential influence of the semantic information of instructional videos on learners' emotional states. To deeply explore the impact of video semantic information on learners' emotions, this paper innovatively proposes a multimodal emotion recognition method by fusing video semantic information and physiological signals. We generate video descriptions through a pre-trained large language model (LLM) to obtain high-level semantic information about instructional videos. Using the cross-attention mechanism for modal interaction, the semantic information is fused with the eye movement and PhotoPlethysmoGraphy (PPG) signals to obtain the features containing the critical information of the three modes. The accurate recognition of learners' emotional states is realized through the emotion classifier. The experimental results show that our method has significantly improved emotion recognition performance, providing a new perspective and efficient method for emotion recognition research in MOOC learning scenarios. The method proposed in this paper not only contributes to a deeper understanding of the impact of instructional videos on learners' emotional states but also provides a beneficial reference for future research on emotion recognition in MOOC learning scenarios. △ Less

Submitted 11 April, 2024; originally announced April 2024.

arXiv:2404.06756 [pdf, other]

CrimeAlarm: Towards Intensive Intent Dynamics in Fine-grained Crime Prediction

Authors: Kaixi Hu, Lin Li, Qing Xie, Xiaohui Tao, Guandong Xu

Abstract: Granularity and accuracy are two crucial factors for crime event prediction. Within fine-grained event classification, multiple criminal intents may alternately exhibit in preceding sequential events, and progress differently in next. Such intensive intent dynamics makes training models hard to capture unobserved intents, and thus leads to sub-optimal generalization performance, especially in the… ▽ More Granularity and accuracy are two crucial factors for crime event prediction. Within fine-grained event classification, multiple criminal intents may alternately exhibit in preceding sequential events, and progress differently in next. Such intensive intent dynamics makes training models hard to capture unobserved intents, and thus leads to sub-optimal generalization performance, especially in the intertwining of numerous potential events. To capture comprehensive criminal intents, this paper proposes a fine-grained sequential crime prediction framework, CrimeAlarm, that equips with a novel mutual distillation strategy inspired by curriculum learning. During the early training phase, spot-shared criminal intents are captured through high-confidence sequence samples. In the later phase, spot-specific intents are gradually learned by increasing the contribution of low-confidence sequences. Meanwhile, the output probability distributions are reciprocally learned between prediction networks to model unobserved criminal intents. Extensive experiments show that CrimeAlarm outperforms state-of-the-art methods in terms of NDCG@5, with improvements of 4.51% for the NYC16 and 7.73% for the CHI18 in accuracy measures. △ Less

Submitted 10 April, 2024; originally announced April 2024.

Comments: Accepted by DASFAA 2024

arXiv:2404.06692 [pdf, other]

Perception-Oriented Video Frame Interpolation via Asymmetric Blending

Authors: Guangyang Wu, Xin Tao, Changlin Li, Wenyi Wang, Xiaohong Liu, Qingqing Zheng

Abstract: Previous methods for Video Frame Interpolation (VFI) have encountered challenges, notably the manifestation of blur and ghosting effects. These issues can be traced back to two pivotal factors: unavoidable motion errors and misalignment in supervision. In practice, motion estimates often prove to be error-prone, resulting in misaligned features. Furthermore, the reconstruction loss tends to bring… ▽ More Previous methods for Video Frame Interpolation (VFI) have encountered challenges, notably the manifestation of blur and ghosting effects. These issues can be traced back to two pivotal factors: unavoidable motion errors and misalignment in supervision. In practice, motion estimates often prove to be error-prone, resulting in misaligned features. Furthermore, the reconstruction loss tends to bring blurry results, particularly in misaligned regions. To mitigate these challenges, we propose a new paradigm called PerVFI (Perception-oriented Video Frame Interpolation). Our approach incorporates an Asymmetric Synergistic Blending module (ASB) that utilizes features from both sides to synergistically blend intermediate features. One reference frame emphasizes primary content, while the other contributes complementary information. To impose a stringent constraint on the blending process, we introduce a self-learned sparse quasi-binary mask which effectively mitigates ghosting and blur artifacts in the output. Additionally, we employ a normalizing flow-based generator and utilize the negative log-likelihood loss to learn the conditional distribution of the output, which further facilitates the generation of clear and fine details. Experimental results validate the superiority of PerVFI, demonstrating significant improvements in perceptual quality compared to existing methods. Codes are available at \url{https://github.com/mulns/PerVFI} △ Less

Submitted 9 April, 2024; originally announced April 2024.

Comments: Accepted by CVPR 2024

arXiv:2404.05386 [pdf, other]

MealRec$^+$: A Meal Recommendation Dataset with Meal-Course Affiliation for Personalization and Healthiness

Authors: Ming Li, Lin Li, Xiaohui Tao, Jimmy Xiangji Huang

Abstract: Meal recommendation, as a typical health-related recommendation task, contains complex relationships between users, courses, and meals. Among them, meal-course affiliation associates user-meal and user-course interactions. However, an extensive literature review demonstrates that there is a lack of publicly available meal recommendation datasets including meal-course affiliation. Meal recommendati… ▽ More Meal recommendation, as a typical health-related recommendation task, contains complex relationships between users, courses, and meals. Among them, meal-course affiliation associates user-meal and user-course interactions. However, an extensive literature review demonstrates that there is a lack of publicly available meal recommendation datasets including meal-course affiliation. Meal recommendation research has been constrained in exploring the impact of cooperation between two levels of interaction on personalization and healthiness. To pave the way for meal recommendation research, we introduce a new benchmark dataset called MealRec$^+$. Due to constraints related to user health privacy and meal scenario characteristics, the collection of data that includes both meal-course affiliation and two levels of interactions is impeded. Therefore, a simulation method is adopted to derive meal-course affiliation and user-meal interaction from the user's dining sessions simulated based on user-course interaction data. Then, two well-known nutritional standards are used to calculate the healthiness scores of meals. Moreover, we experiment with several baseline models, including separate and cooperative interaction learning methods. Our experiment demonstrates that cooperating the two levels of interaction in appropriate ways is beneficial for meal recommendations. Furthermore, in response to the less healthy recommendation phenomenon found in the experiment, we explore methods to enhance the healthiness of meal recommendations. The dataset is available on GitHub (https://github.com/WUT-IDEA/MealRecPlus). △ Less

Submitted 27 April, 2024; v1 submitted 8 April, 2024; originally announced April 2024.

Comments: Accepted by SIGIR 2024

arXiv:2403.20193 [pdf, other]

Motion Inversion for Video Customization

Authors: Luozhou Wang, Guibao Shen, Yixun Liang, Xin Tao, Pengfei Wan, Di Zhang, Yijun Li, Yingcong Chen

Abstract: In this research, we present a novel approach to motion customization in video generation, addressing the widespread gap in the thorough exploration of motion representation within video generative models. Recognizing the unique challenges posed by video's spatiotemporal nature, our method introduces Motion Embeddings, a set of explicit, temporally coherent one-dimensional embeddings derived from… ▽ More In this research, we present a novel approach to motion customization in video generation, addressing the widespread gap in the thorough exploration of motion representation within video generative models. Recognizing the unique challenges posed by video's spatiotemporal nature, our method introduces Motion Embeddings, a set of explicit, temporally coherent one-dimensional embeddings derived from a given video. These embeddings are designed to integrate seamlessly with the temporal transformer modules of video diffusion models, modulating self-attention computations across frames without compromising spatial integrity. Our approach offers a compact and efficient solution to motion representation and enables complex manipulations of motion characteristics through vector arithmetic in the embedding space. Furthermore, we identify the Temporal Discrepancy in video generative models, which refers to variations in how different motion modules process temporal relationships between frames. We leverage this understanding to optimize the integration of our motion embeddings. Our contributions include the introduction of a tailored motion embedding for customization tasks, insights into the temporal processing differences in video models, and a demonstration of the practical advantages and effectiveness of our method through extensive experiments. △ Less

Submitted 29 March, 2024; originally announced March 2024.

Comments: Project Page: https://wileewang.github.io/MotionInversion/

arXiv:2403.17735 [pdf, other]

Out-of-distribution Rumor Detection via Test-Time Adaptation

Authors: Xiang Tao, Mingqing Zhang, Qiang Liu, Shu Wu, Liang Wang

Abstract: Due to the rapid spread of rumors on social media, rumor detection has become an extremely important challenge. Existing methods for rumor detection have achieved good performance, as they have collected enough corpus from the same data distribution for model training. However, significant distribution shifts between the training data and real-world test data occur due to differences in news topic… ▽ More Due to the rapid spread of rumors on social media, rumor detection has become an extremely important challenge. Existing methods for rumor detection have achieved good performance, as they have collected enough corpus from the same data distribution for model training. However, significant distribution shifts between the training data and real-world test data occur due to differences in news topics, social media platforms, languages and the variance in propagation scale caused by news popularity. This leads to a substantial decline in the performance of these existing methods in Out-Of-Distribution (OOD) situations. To address this problem, we propose a simple and efficient method named Test-time Adaptation for Rumor Detection under distribution shifts (TARD). This method models the propagation of news in the form of a propagation graph, and builds propagation graph test-time adaptation framework, enhancing the model's adaptability and robustness when facing OOD problems. Extensive experiments conducted on two group datasets collected from real-world social platforms demonstrate that our framework outperforms the state-of-the-art methods in performance. △ Less

Submitted 26 March, 2024; originally announced March 2024.

arXiv:2403.15758 [pdf, ps, other]

An endpoint estimate for the maximal Calderón commutator with rough kernel

Authors: Guoen Hu, Xudong Lai, Xiangxing Tao, Qingying Xue

Abstract: In this paper, the authors consider the endpoint estimates for the maximal Calderón commutator defined by $$T_{Ω,\,a}^*f(x)=\sup_{ε>0}\Big|\int_{|x-y|>ε}\frac{Ω(x-y)}{|x-y|^{d+1}} \big(a(x)-a(y)\big)f(y)dy\Big|,$$ where $Ω$ is homogeneous of degree zero, integrable on $S^{d-1}$ and has vanishing moment of order one, $a$ be a function on $\mathbb{R}^d$ such that… ▽ More In this paper, the authors consider the endpoint estimates for the maximal Calderón commutator defined by $$T_{Ω,\,a}^*f(x)=\sup_{ε>0}\Big|\int_{|x-y|>ε}\frac{Ω(x-y)}{|x-y|^{d+1}} \big(a(x)-a(y)\big)f(y)dy\Big|,$$ where $Ω$ is homogeneous of degree zero, integrable on $S^{d-1}$ and has vanishing moment of order one, $a$ be a function on $\mathbb{R}^d$ such that $\nabla a\in L^{\infty}(\mathbb{R}^d)$. The authors prove that if $Ω\in L\log L(S^{d-1})$, then $T^*_{Ω,\,a}$ satisfies an endpoint estimate of $L\log\log L$ type. △ Less

Submitted 14 April, 2024; v1 submitted 23 March, 2024; originally announced March 2024.

Comments: 25 pages

MSC Class: 42B20

arXiv:2403.15283 [pdf, other]

Discovery of superconductivity in technetium-borides at moderate pressures

Authors: Xiangru Tao, Aiqin Yang, Yundi Quan, Biao Wan, Shuxiang Yang, Peng Zhang

Abstract: Advances in theoretical calculations boosted the searches for high temperature superconductors, such as sulfur hydrides and rare-earth polyhydrides. However, the required extremely high pressures for stabilizing these superconductors handicapped further implementations. Based upon thorough structural searches, we identified series of unprecedented superconducting technetium-borides at moderate pre… ▽ More Advances in theoretical calculations boosted the searches for high temperature superconductors, such as sulfur hydrides and rare-earth polyhydrides. However, the required extremely high pressures for stabilizing these superconductors handicapped further implementations. Based upon thorough structural searches, we identified series of unprecedented superconducting technetium-borides at moderate pressures, including TcB (P6$_3$/mmc) with superconducting transition temperature $T_{\text{c}}$ = 20.2 K at ambient pressure and TcB$_2$ (P6/mmm) with $T_{\text{c}}$ = 23.1 K at 20 GPa. Superconductivity in these technetium-borides mainly originates from the coupling between the low frequency vibrations of technetium-atoms and the dominant technetium-4d electrons at the Fermi level. Our works therefore present a fresh group in the family of superconducting borides, whose diversified crystal structures suggest rich possibilities in discovery of other superconducting transition-metal-borides. △ Less

Submitted 22 March, 2024; originally announced March 2024.

arXiv:2403.15234 [pdf, other]

Shadow Generation for Composite Image Using Diffusion model

Authors: Qingyang Liu, Junqi You, Jianting Wang, Xinhao Tao, Bo Zhang, Li Niu

Abstract: In the realm of image composition, generating realistic shadow for the inserted foreground remains a formidable challenge. Previous works have developed image-to-image translation models which are trained on paired training data. However, they are struggling to generate shadows with accurate shapes and intensities, hindered by data scarcity and inherent task complexity. In this paper, we resort to… ▽ More In the realm of image composition, generating realistic shadow for the inserted foreground remains a formidable challenge. Previous works have developed image-to-image translation models which are trained on paired training data. However, they are struggling to generate shadows with accurate shapes and intensities, hindered by data scarcity and inherent task complexity. In this paper, we resort to foundation model with rich prior knowledge of natural shadow images. Specifically, we first adapt ControlNet to our task and then propose intensity modulation modules to improve the shadow intensity. Moreover, we extend the small-scale DESOBA dataset to DESOBAv2 using a novel data acquisition pipeline. Experimental results on both DESOBA and DESOBAv2 datasets as well as real composite images demonstrate the superior capability of our model for shadow generation task. The dataset, code, and model are released at https://github.com/bcmi/Object-Shadow-Generation-Dataset-DESOBAv2. △ Less

Submitted 22 March, 2024; originally announced March 2024.

Comments: accepted by CVPR2024

arXiv:2403.12372 [pdf, other]

Learning Transferable Time Series Classifier with Cross-Domain Pre-training from Language Model

Authors: Mingyue Cheng, Xiaoyu Tao, Qi Liu, Hao Zhang, Yiheng Chen, Chenyi Lei

Abstract: Advancements in self-supervised pre-training (SSL) have significantly advanced the field of learning transferable time series representations, which can be very useful in enhancing the downstream task. Despite being effective, most existing works struggle to achieve cross-domain SSL pre-training, missing valuable opportunities to integrate patterns and features from different domains. The main cha… ▽ More Advancements in self-supervised pre-training (SSL) have significantly advanced the field of learning transferable time series representations, which can be very useful in enhancing the downstream task. Despite being effective, most existing works struggle to achieve cross-domain SSL pre-training, missing valuable opportunities to integrate patterns and features from different domains. The main challenge lies in the significant differences in the characteristics of time-series data across different domains, such as variations in the number of channels and temporal resolution scales. To address this challenge, we propose CrossTimeNet, a novel cross-domain SSL learning framework to learn transferable knowledge from various domains to largely benefit the target downstream task. One of the key characteristics of CrossTimeNet is the newly designed time series tokenization module, which could effectively convert the raw time series into a sequence of discrete tokens based on a reconstruction optimization process. Besides, we highlight that predicting a high proportion of corrupted tokens can be very helpful for extracting informative patterns across different domains during SSL pre-training, which has been largely overlooked in past years. Furthermore, unlike previous works, our work treats the pre-training language model (PLM) as the initialization of the encoder network, investigating the feasibility of transferring the knowledge learned by the PLM to the time series area. Through these efforts, the path to cross-domain pre-training of a generic time series model can be effectively paved. We conduct extensive experiments in a real-world scenario across various time series classification domains. The experimental results clearly confirm CrossTimeNet's superior performance. △ Less

Submitted 18 March, 2024; originally announced March 2024.

arXiv:2403.09222 [pdf, other]

A Robust Semantic Communication System for Image

Authors: Xiang Peng, Zhijin Qin, Xiaoming Tao, Jianhua Lu, Khaled B. Letaief

Abstract: Semantic communications have gained significant attention as a promising approach to address the transmission bottleneck, especially with the continuous development of 6G techniques. Distinct from the well investigated physical channel impairments, this paper focuses on semantic impairments in image, particularly those arising from adversarial perturbations. Specifically, we propose a novel metric… ▽ More Semantic communications have gained significant attention as a promising approach to address the transmission bottleneck, especially with the continuous development of 6G techniques. Distinct from the well investigated physical channel impairments, this paper focuses on semantic impairments in image, particularly those arising from adversarial perturbations. Specifically, we propose a novel metric for quantifying the intensity of semantic impairment and develop a semantic impairment dataset. Furthermore, we introduce a deep learning enabled semantic communication system, termed as DeepSC-RI, to enhance the robustness of image transmission, which incorporates a multi-scale semantic extractor with a dual-branch architecture for extracting semantics with varying granularity, thereby improving the robustness of the system. The fine-grained branch incorporates a semantic importance evaluation module to identify and prioritize crucial semantics, while the coarse-grained branch adopts a hierarchical approach for capturing the robust semantics. These two streams of semantics are seamlessly integrated via an advanced cross-attention-based semantic fusion module. Experimental results demonstrate the superior performance of DeepSC-RI under various levels of semantic impairment intensity. △ Less

Submitted 14 March, 2024; originally announced March 2024.

Comments: 6 pages

arXiv:2403.09157 [pdf, ps, other]

VM-UNET-V2 Rethinking Vision Mamba UNet for Medical Image Segmentation

Authors: Mingya Zhang, Yue Yu, Limei Gu, Tingsheng Lin, Xianping Tao

Abstract: In the field of medical image segmentation, models based on both CNN and Transformer have been thoroughly investigated. However, CNNs have limited modeling capabilities for long-range dependencies, making it challenging to exploit the semantic information within images fully. On the other hand, the quadratic computational complexity poses a challenge for Transformers. Recently, State Space Models… ▽ More In the field of medical image segmentation, models based on both CNN and Transformer have been thoroughly investigated. However, CNNs have limited modeling capabilities for long-range dependencies, making it challenging to exploit the semantic information within images fully. On the other hand, the quadratic computational complexity poses a challenge for Transformers. Recently, State Space Models (SSMs), such as Mamba, have been recognized as a promising method. They not only demonstrate superior performance in modeling long-range interactions, but also preserve a linear computational complexity. Inspired by the Mamba architecture, We proposed Vison Mamba-UNetV2, the Visual State Space (VSS) Block is introduced to capture extensive contextual information, the Semantics and Detail Infusion (SDI) is introduced to augment the infusion of low-level and high-level features. We conduct comprehensive experiments on the ISIC17, ISIC18, CVC-300, CVC-ClinicDB, Kvasir, CVC-ColonDB and ETIS-LaribPolypDB public datasets. The results indicate that VM-UNetV2 exhibits competitive performance in medical image segmentation tasks. Our code is available at https://github.com/nobodyplayer1/VM-UNetV2. △ Less

Submitted 14 March, 2024; originally announced March 2024.

Comments: 12 pages, 4 figures

arXiv:2403.02910 [pdf, other]

ImgTrojan: Jailbreaking Vision-Language Models with ONE Image

Authors: Xijia Tao, Shuai Zhong, Lei Li, Qi Liu, Lingpeng Kong

Abstract: There has been an increasing interest in the alignment of large language models (LLMs) with human values. However, the safety issues of their integration with a vision module, or vision language models (VLMs), remain relatively underexplored. In this paper, we propose a novel jailbreaking attack against VLMs, aiming to bypass their safety barrier when a user inputs harmful instructions. A scenario… ▽ More There has been an increasing interest in the alignment of large language models (LLMs) with human values. However, the safety issues of their integration with a vision module, or vision language models (VLMs), remain relatively underexplored. In this paper, we propose a novel jailbreaking attack against VLMs, aiming to bypass their safety barrier when a user inputs harmful instructions. A scenario where our poisoned (image, text) data pairs are included in the training data is assumed. By replacing the original textual captions with malicious jailbreak prompts, our method can perform jailbreak attacks with the poisoned images. Moreover, we analyze the effect of poison ratios and positions of trainable parameters on our attack's success rate. For evaluation, we design two metrics to quantify the success rate and the stealthiness of our attack. Together with a list of curated harmful instructions, a benchmark for measuring attack efficacy is provided. We demonstrate the efficacy of our attack by comparing it with baseline methods. △ Less

Submitted 5 March, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

arXiv:2402.17417 [pdf, other]

CARZero: Cross-Attention Alignment for Radiology Zero-Shot Classification

Authors: Haoran Lai, Qingsong Yao, Zihang Jiang, Rongsheng Wang, Zhiyang He, Xiaodong Tao, S. Kevin Zhou

Abstract: The advancement of Zero-Shot Learning in the medical domain has been driven forward by using pre-trained models on large-scale image-text pairs, focusing on image-text alignment. However, existing methods primarily rely on cosine similarity for alignment, which may not fully capture the complex relationship between medical images and reports. To address this gap, we introduce a novel approach call… ▽ More The advancement of Zero-Shot Learning in the medical domain has been driven forward by using pre-trained models on large-scale image-text pairs, focusing on image-text alignment. However, existing methods primarily rely on cosine similarity for alignment, which may not fully capture the complex relationship between medical images and reports. To address this gap, we introduce a novel approach called Cross-Attention Alignment for Radiology Zero-Shot Classification (CARZero). Our approach innovatively leverages cross-attention mechanisms to process image and report features, creating a Similarity Representation that more accurately reflects the intricate relationships in medical semantics. This representation is then linearly projected to form an image-text similarity matrix for cross-modality alignment. Additionally, recognizing the pivotal role of prompt selection in zero-shot learning, CARZero incorporates a Large Language Model-based prompt alignment strategy. This strategy standardizes diverse diagnostic expressions into a unified format for both training and inference phases, overcoming the challenges of manual prompt design. Our approach is simple yet effective, demonstrating state-of-the-art performance in zero-shot classification on five official chest radiograph diagnostic test sets, including remarkable results on datasets with long-tail distributions of rare diseases. This achievement is attributed to our new image-text alignment strategy, which effectively addresses the complex relationship between medical images and reports. Code and models are available at https://github.com/laihaoran/CARZero. △ Less

Submitted 24 March, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

arXiv:2402.14718 [pdf, other]

Quantum Annealing Inspired Algorithms for Track Reconstruction at High Energy Colliders

Authors: Hideki Okawa, Qing-Guo Zeng, Xian-Zhe Tao, Man-Hong Yung

Abstract: Charged particle reconstruction or track reconstruction is one of the most crucial components of pattern recognition in high energy collider physics. It is known for enormous consumption of the computing resources, especially when the particle multiplicity is high. This would indeed be the conditions at future colliders such as the High Luminosity Large Hadron Collider and Super Proton Proton Coll… ▽ More Charged particle reconstruction or track reconstruction is one of the most crucial components of pattern recognition in high energy collider physics. It is known for enormous consumption of the computing resources, especially when the particle multiplicity is high. This would indeed be the conditions at future colliders such as the High Luminosity Large Hadron Collider and Super Proton Proton Collider. Track reconstruction can be formulated as a quadratic unconstrained binary optimization (QUBO) problem, for which various quantum algorithms have been investigated and evaluated with both the quantum simulator and hardware. Simulated bifurcation algorithms are a set of quantum annealing inspired algorithms, and are serious competitors to the quantum annealing, other Ising machines and their classical counterparts. In this study, we show that the simulated bifurcation algorithms can be employed for solving the particle tracking problem. As the simulated bifurcation algorithms run on classical computers and are suitable for parallel processing and usage of the graphical processing units, they can handle significantly large data at high speed. These algorithms exhibit compatible or sometimes improved reconstruction efficiency and purity than the simulated annealing, but the running time can be reduced by as much as four orders of magnitude. These results suggest that QUBO models together with the quantum annealing inspired algorithms are valuable for the current and future particle tracking problems. △ Less

Submitted 22 February, 2024; originally announced February 2024.

Comments: 10 pages, 4 figures

arXiv:2402.13471 [pdf]

Thermal transport in a 2D amorphous material

Authors: Yuxi Wang, Xingxing Zhang, Wujuan Yan, Nianjie Liang, Haiyu He, Xinwei Tao, Ang Li, Fuwei Yang, Buxuan Li, Te-Huan Liu, Jia Zhu, Wu Zhou, Wei Wang, Lin Zhou, Bai Song

Abstract: Two-dimensional (2D) crystals proved revolutionary soon after graphene was discovered in 2004. However, 2D amorphous materials only became accessible in 2020 and remain largely unexplored. In particular, the thermophysical properties of amorphous materials are of great interest upon transition from 3D to 2D. Here, we probe thermal transport in 2D amorphous carbon. A cross-plane thermal conductivit… ▽ More Two-dimensional (2D) crystals proved revolutionary soon after graphene was discovered in 2004. However, 2D amorphous materials only became accessible in 2020 and remain largely unexplored. In particular, the thermophysical properties of amorphous materials are of great interest upon transition from 3D to 2D. Here, we probe thermal transport in 2D amorphous carbon. A cross-plane thermal conductivity ($κ$) down to 0.079 $\rm{Wm}^{-1}K^{-1}$ is measured for van der Waals stacked multilayers at room temperature, which is among the lowest reported to date. Meanwhile, an unexpectedly high in-plane $κ$ is obtained for freestanding monolayers which is a few times larger than what is predicted by conventional wisdom for 3D amorphous carbon with similar $\rm{sp}^{2}$ fraction. Our molecular dynamics simulations reveal the role of disorder and highlight the impact of dimensionality. Amorphous materials at the 2D limit open up new avenues for understanding and manipulating heat at the atomic scale. △ Less

Submitted 22 March, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

arXiv:2402.13073 [pdf, other]

Towards Intelligent Communications: Large Model Empowered Semantic Communications

Authors: Huiqiang Xie, Zhijin Qin, Xiaoming Tao, Zhu Han

Abstract: Deep learning enabled semantic communications have shown great potential to significantly improve transmission efficiency and alleviate spectrum scarcity, by effectively exchanging the semantics behind the data. Recently, the emergence of large models, boasting billions of parameters, has unveiled remarkable human-like intelligence, offering a promising avenue for advancing semantic communication… ▽ More Deep learning enabled semantic communications have shown great potential to significantly improve transmission efficiency and alleviate spectrum scarcity, by effectively exchanging the semantics behind the data. Recently, the emergence of large models, boasting billions of parameters, has unveiled remarkable human-like intelligence, offering a promising avenue for advancing semantic communication by enhancing semantic understanding and contextual understanding. This article systematically investigates the large model-empowered semantic communication systems from potential applications to system design. First, we propose a new semantic communication architecture that seamlessly integrates large models into semantic communication through the introduction of a memory module. Then, the typical applications are illustrated to show the benefits of the new architecture. Besides, we discuss the key designs in implementing the new semantic communication systems from module design to system training. Finally, the potential research directions are identified to boost the large model-empowered semantic communications. △ Less

Submitted 19 March, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

Comments: 7 pages, 6 figures

arXiv:2402.12398 [pdf, other]

Primary and Secondary Factor Consistency as Domain Knowledge to Guide Happiness Computing in Online Assessment

Authors: Xiaohua Wu, Lin Li, Xiaohui Tao, Frank Xing, Jingling Yuan

Abstract: Happiness computing based on large-scale online web data and machine learning methods is an emerging research topic that underpins a range of issues, from personal growth to social stability. Many advanced Machine Learning (ML) models with explanations are used to compute the happiness online assessment while maintaining high accuracy of results. However, domain knowledge constraints, such as the… ▽ More Happiness computing based on large-scale online web data and machine learning methods is an emerging research topic that underpins a range of issues, from personal growth to social stability. Many advanced Machine Learning (ML) models with explanations are used to compute the happiness online assessment while maintaining high accuracy of results. However, domain knowledge constraints, such as the primary and secondary relations of happiness factors, are absent from these models, which limits the association between computing results and the right reasons for why they occurred. This article attempts to provide new insights into the explanation consistency from an empirical study perspective. Then we study how to represent and introduce domain knowledge constraints to make ML models more trustworthy. We achieve this through: (1) proving that multiple prediction models with additive factor attributions will have the desirable property of primary and secondary relations consistency, and (2) showing that factor relations with quantity can be represented as an importance distribution for encoding domain knowledge. Factor explanation difference is penalized by the Kullback-Leibler divergence-based loss among computing models. Experimental results using two online web datasets show that domain knowledge of stable factor relations exists. Using this knowledge not only improves happiness computing accuracy but also reveals more significative happiness factors for assisting decisions well. △ Less

Submitted 17 February, 2024; originally announced February 2024.

Comments: 12 pages

arXiv:2402.10097 [pdf, other]

Adaptive Federated Learning in Heterogeneous Wireless Networks with Independent Sampling

Authors: Jiaxiang Geng, Yanzhao Hou, Xiaofeng Tao, Juncheng Wang, Bing Luo

Abstract: Federated Learning (FL) algorithms commonly sample a random subset of clients to address the straggler issue and improve communication efficiency. While recent works have proposed various client sampling methods, they have limitations in joint system and data heterogeneity design, which may not align with practical heterogeneous wireless networks. In this work, we advocate a new independent client… ▽ More Federated Learning (FL) algorithms commonly sample a random subset of clients to address the straggler issue and improve communication efficiency. While recent works have proposed various client sampling methods, they have limitations in joint system and data heterogeneity design, which may not align with practical heterogeneous wireless networks. In this work, we advocate a new independent client sampling strategy to minimize the wall-clock training time of FL, while considering data heterogeneity and system heterogeneity in both communication and computation. We first derive a new convergence bound for non-convex loss functions with independent client sampling and then propose an adaptive bandwidth allocation scheme. Furthermore, we propose an efficient independent client sampling algorithm based on the upper bounds on the convergence rounds and the expected per-round training time, to minimize the wall-clock time of FL, while considering both the data and system heterogeneity. Experimental results under practical wireless network settings with real-world prototype demonstrate that the proposed independent sampling scheme substantially outperforms the current best sampling schemes under various training models and datasets. △ Less

Submitted 13 May, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

Comments: 6 pages, 5 figures, accepted for publication in IEEE International Conference on Communications (ICC)

arXiv:2402.07225 [pdf, other]

Rethinking Graph Masked Autoencoders through Alignment and Uniformity

Authors: Liang Wang, Xiang Tao, Qiang Liu, Shu Wu, Liang Wang

Abstract: Self-supervised learning on graphs can be bifurcated into contrastive and generative methods. Contrastive methods, also known as graph contrastive learning (GCL), have dominated graph self-supervised learning in the past few years, but the recent advent of graph masked autoencoder (GraphMAE) rekindles the momentum behind generative methods. Despite the empirical success of GraphMAE, there is still… ▽ More Self-supervised learning on graphs can be bifurcated into contrastive and generative methods. Contrastive methods, also known as graph contrastive learning (GCL), have dominated graph self-supervised learning in the past few years, but the recent advent of graph masked autoencoder (GraphMAE) rekindles the momentum behind generative methods. Despite the empirical success of GraphMAE, there is still a dearth of theoretical understanding regarding its efficacy. Moreover, while both generative and contrastive methods have been shown to be effective, their connections and differences have yet to be thoroughly investigated. Therefore, we theoretically build a bridge between GraphMAE and GCL, and prove that the node-level reconstruction objective in GraphMAE implicitly performs context-level GCL. Based on our theoretical analysis, we further identify the limitations of the GraphMAE from the perspectives of alignment and uniformity, which have been considered as two key properties of high-quality representations in GCL. We point out that GraphMAE's alignment performance is restricted by the masking strategy, and the uniformity is not strictly guaranteed. To remedy the aforementioned limitations, we propose an Alignment-Uniformity enhanced Graph Masked AutoEncoder, named AUG-MAE. Specifically, we propose an easy-to-hard adversarial masking strategy to provide hard-to-align samples, which improves the alignment performance. Meanwhile, we introduce an explicit uniformity regularizer to ensure the uniformity of the learned representations. Experimental results on benchmark datasets demonstrate the superiority of our model over existing state-of-the-art methods. △ Less

Submitted 11 February, 2024; originally announced February 2024.

Comments: Accepted by AAAI 2024

arXiv:2402.03916 [pdf, other]

Can Large Language Models Detect Rumors on Social Media?

Authors: Qiang Liu, Xiang Tao, Junfei Wu, Shu Wu, Liang Wang

Abstract: In this work, we investigate to use Large Language Models (LLMs) for rumor detection on social media. However, it is challenging for LLMs to reason over the entire propagation information on social media, which contains news contents and numerous comments, due to LLMs may not concentrate on key clues in the complex propagation information, and have trouble in reasoning when facing massive and redu… ▽ More In this work, we investigate to use Large Language Models (LLMs) for rumor detection on social media. However, it is challenging for LLMs to reason over the entire propagation information on social media, which contains news contents and numerous comments, due to LLMs may not concentrate on key clues in the complex propagation information, and have trouble in reasoning when facing massive and redundant information. Accordingly, we propose an LLM-empowered Rumor Detection (LeRuD) approach, in which we design prompts to teach LLMs to reason over important clues in news and comments, and divide the entire propagation information into a Chain-of-Propagation for reducing LLMs' burden. We conduct extensive experiments on the Twitter and Weibo datasets, and LeRuD outperforms several state-of-the-art rumor detection models by 3.2% to 7.7%. Meanwhile, by applying LLMs, LeRuD requires no data for training, and thus shows more promising rumor detection ability in few-shot or zero-shot scenarios. △ Less

Submitted 8 February, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

arXiv:2402.02950 [pdf, other]

Semantic Entropy Can Simultaneously Benefit Transmission Efficiency and Channel Security of Wireless Semantic Communications

Authors: Yankai Rong, Guoshun Nan, Minwei Zhang, Sihan Chen, Songtao Wang, Xuefei Zhang, Nan Ma, Shixun Gong, Zhaohui Yang, Qimei Cui, Xiaofeng Tao, Tony Q. S. Quek

Abstract: Recently proliferated deep learning-based semantic communications (DLSC) focus on how transmitted symbols efficiently convey a desired meaning to the destination. However, the sensitivity of neural models and the openness of wireless channels cause the DLSC system to be extremely fragile to various malicious attacks. This inspires us to ask a question: "Can we further exploit the advantages of tra… ▽ More Recently proliferated deep learning-based semantic communications (DLSC) focus on how transmitted symbols efficiently convey a desired meaning to the destination. However, the sensitivity of neural models and the openness of wireless channels cause the DLSC system to be extremely fragile to various malicious attacks. This inspires us to ask a question: "Can we further exploit the advantages of transmission efficiency in wireless semantic communications while also alleviating its security disadvantages?". Keeping this in mind, we propose SemEntropy, a novel method that answers the above question by exploring the semantics of data for both adaptive transmission and physical layer encryption. Specifically, we first introduce semantic entropy, which indicates the expectation of various semantic scores regarding the transmission goal of the DLSC. Equipped with such semantic entropy, we can dynamically assign informative semantics to Orthogonal Frequency Division Multiplexing (OFDM) subcarriers with better channel conditions in a fine-grained manner. We also use the entropy to guide semantic key generation to safeguard communications over open wireless channels. By doing so, both transmission efficiency and channel security can be simultaneously improved. Extensive experiments over various benchmarks show the effectiveness of the proposed SemEntropy. We discuss the reason why our proposed method benefits secure transmission of DLSC, and also give some interesting findings, e.g., SemEntropy can keep the semantic accuracy remain 95% with 60% less transmission. △ Less

Submitted 6 February, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

Comments: 13 pages, 12 figures

arXiv:2401.17575 [pdf, other]

Can We Improve Channel Reciprocity via Loop-back Compensation for RIS-assisted Physical Layer Key Generation

Authors: Ningya Xu, Guoshun Nan, Xiaofeng Tao, Na Li, Pengxuan Mao, Tianyuan Yang

Abstract: Reconfigurable intelligent surface (RIS) facilitates the extraction of unpredictable channel features for physical layer key generation (PKG), securing communications among legitimate users with symmetric keys. Previous works have demonstrated that channel reciprocity plays a crucial role in generating symmetric keys in PKG systems, whereas, in reality, reciprocity is greatly affected by hardware… ▽ More Reconfigurable intelligent surface (RIS) facilitates the extraction of unpredictable channel features for physical layer key generation (PKG), securing communications among legitimate users with symmetric keys. Previous works have demonstrated that channel reciprocity plays a crucial role in generating symmetric keys in PKG systems, whereas, in reality, reciprocity is greatly affected by hardware interference and RIS-based jamming attacks. This motivates us to propose LoCKey, a novel approach that aims to improve channel reciprocity by mitigating interferences and attacks with a loop-back compensation scheme, thus maximizing the secrecy performance of the PKG system. Specifically, our proposed LoCKey is capable of effectively compensating for the CSI non-reciprocity by the combination of transmit-back signal value and error minimization module. Firstly, we introduce the entire flowchart of our method and provide an in-depth discussion of each step. Following that, we delve into a theoretical analysis of the performance optimizations when our LoCKey is applied for CSI reciprocity enhancement. Finally, we conduct experiments to verify the effectiveness of the proposed LoCKey in improving channel reciprocity under various interferences for RIS-assisted wireless communications. The results demonstrate a significant improvement in both the rate of key generation assisted by the RIS and the consistency of the generated keys, showing great potential for the practical deployment of our LoCKey in future wireless systems. △ Less

Submitted 30 April, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

Comments: Accepted by ICC 2024

arXiv:2401.15444 [pdf, other]

Towards Causal Classification: A Comprehensive Study on Graph Neural Networks

Authors: Simi Job, Xiaohui Tao, Taotao Cai, Lin Li, Haoran Xie, Jianming Yong

Abstract: The exploration of Graph Neural Networks (GNNs) for processing graph-structured data has expanded, particularly their potential for causal analysis due to their universal approximation capabilities. Anticipated to significantly enhance common graph-based tasks such as classification and prediction, the development of a causally enhanced GNN framework is yet to be thoroughly investigated. Addressin… ▽ More The exploration of Graph Neural Networks (GNNs) for processing graph-structured data has expanded, particularly their potential for causal analysis due to their universal approximation capabilities. Anticipated to significantly enhance common graph-based tasks such as classification and prediction, the development of a causally enhanced GNN framework is yet to be thoroughly investigated. Addressing this shortfall, our study delves into nine benchmark graph classification models, testing their strength and versatility across seven datasets spanning three varied domains to discern the impact of causality on the predictive prowess of GNNs. This research offers a detailed assessment of these models, shedding light on their efficiency, and flexibility in different data environments, and highlighting areas needing advancement. Our findings are instrumental in furthering the understanding and practical application of GNNs in diverse datacentric fields △ Less

Submitted 27 January, 2024; originally announced January 2024.

arXiv:2401.13425 [pdf, ps, other]

Two-dimensional ferromagnetic semiconductor Cr2XP: First-principles calculations and Monte Carlo simulations

Authors: Xiao-Ping Wei, Lan-Lan Du, Jiang-Liu Meng, Xiaoma Tao

Abstract: According to the Mermin Wagner theorem, two-dimensional material is difficult to have the Curie temperature above room temperature. By using the method of band engineering, we design a promising two-dimensional ferromagnetic semiconductor Cr2XP (X=P, As, Sb) with large magnetization, high Curie temperature and sizable band gap. The formation of gap is discussed in terms of the hybridizations, occu… ▽ More According to the Mermin Wagner theorem, two-dimensional material is difficult to have the Curie temperature above room temperature. By using the method of band engineering, we design a promising two-dimensional ferromagnetic semiconductor Cr2XP (X=P, As, Sb) with large magnetization, high Curie temperature and sizable band gap. The formation of gap is discussed in terms of the hybridizations, occupation and distribution of electronic states and charge transfer. Large magnetic moments about 6.16~6.37uB origin from the occupation of Cr-d electrons in crystal field.Competition and cooperation between d-d (Cr-d~Cr-d) and d-p-d (Cr-d~X-p~Cr-d) exchange interactions lead to the emergence of ferromagnetic ordering phase. Furthermore, Curie temperatures, approaching to 269 K, 332 K and 400 K for Cr2P2, Cr2AsP and Cr2SbP, are estimated by employing Monte Carlo simulation based on the Heisenberg model. Magnetic anisotropy energy of Cr2XP is determined by calculating the total energy dependence on the angle along different directions, and the origin is also discussed by the second-order perturbation theory. In addition, the Cr2XP possesses excellent thermodynamical, dynamical and mechanical stabilities, and can overcome their own gravity to keep their planar structure without the support of substrate. These above-mentioned advantages will offer some valuable hints for two-dimensional ferromagnetic semiconductor Cr2XP in spintronic devices. △ Less

Submitted 24 January, 2024; originally announced January 2024.

arXiv:2401.12483 [pdf, other]

Persona-centric Metamorphic Relation guided Robustness Evaluation for Multi-turn Dialogue Modelling

Authors: Yanbing Chen, Lin Li, Xiaohui Tao, Dong Zhou

Abstract: Recently there has been significant progress in the field of dialogue system thanks to the introduction of training paradigms such as fine-tune and prompt learning. Persona can function as the prior knowledge for maintaining the personality consistency of dialogue systems, which makes it perform well on accuracy. Nonetheless, the conventional reference-based evaluation method falls short in captur… ▽ More Recently there has been significant progress in the field of dialogue system thanks to the introduction of training paradigms such as fine-tune and prompt learning. Persona can function as the prior knowledge for maintaining the personality consistency of dialogue systems, which makes it perform well on accuracy. Nonetheless, the conventional reference-based evaluation method falls short in capturing the genuine text comprehension prowess of the model, significantly relying on the quality of data annotation. In contrast, the application of metamorphic testing offers a more profound insight into the model's distinct capabilities without necessitating supplementary annotation labels. This approach furnishes a more comprehensive portrayal of the model's intricacies and exposes intricacies concealed within reference-based validation techniques. Consequently, we introduce a persona-centric metamorphic relation construction for metamorphic testing, aimed at evaluating both the persona consistency and robustness of personalized dialogue models. For that reason, this work evaluates several widely used training paradigms including learning from scratch, pretrain + fine-tune and prompt learning in personalized dialogue retrieval to know if they are more robust or if they have the same flaws as their predecessor. Under three kinds of designed metamorphic relations with consistent outputs, our experimental results reveal that prompt learning shows stronger robustness compared to training from scratch and fine-tune. Although tested retrieval models gain competitively high retrieval accuracy according to the traditional reference-based validation, they are still fragile and demonstrate various unexpected behaviors, thus there is still room for future improvement in personalized dialogue retrieval. △ Less

Submitted 22 January, 2024; originally announced January 2024.

arXiv:2401.00859 [pdf, ps, other]

Federated Multi-View Synthesizing for Metaverse

Authors: Yiyu Guo, Zhijin Qin, Xiaoming Tao, Geoffrey Ye Li

Abstract: The metaverse is expected to provide immersive entertainment, education, and business applications. However, virtual reality (VR) transmission over wireless networks is data- and computation-intensive, making it critical to introduce novel solutions that meet stringent quality-of-service requirements. With recent advances in edge intelligence and deep learning, we have developed a novel multi-view… ▽ More The metaverse is expected to provide immersive entertainment, education, and business applications. However, virtual reality (VR) transmission over wireless networks is data- and computation-intensive, making it critical to introduce novel solutions that meet stringent quality-of-service requirements. With recent advances in edge intelligence and deep learning, we have developed a novel multi-view synthesizing framework that can efficiently provide computation, storage, and communication resources for wireless content delivery in the metaverse. We propose a three-dimensional (3D)-aware generative model that uses collections of single-view images. These single-view images are transmitted to a group of users with overlapping fields of view, which avoids massive content transmission compared to transmitting tiles or whole 3D models. We then present a federated learning approach to guarantee an efficient learning process. The training performance can be improved by characterizing the vertical and horizontal data samples with a large latent feature space, while low-latency communication can be achieved with a reduced number of transmitted parameters during federated learning. We also propose a federated transfer learning framework to enable fast domain adaptation to different target domains. Simulation results have demonstrated the effectiveness of our proposed federated multi-view synthesizing framework for VR content delivery. △ Less

Submitted 18 December, 2023; originally announced January 2024.

arXiv:2312.16418 [pdf, other]

Refining Latent Homophilic Structures over Heterophilic Graphs for Robust Graph Convolution Networks

Authors: Chenyang Qiu, Guoshun Nan, Tianyu Xiong, Wendi Deng, Di Wang, Zhiyang Teng, Lijuan Sun, Qimei Cui, Xiaofeng Tao

Abstract: Graph convolution networks (GCNs) are extensively utilized in various graph tasks to mine knowledge from spatial data. Our study marks the pioneering attempt to quantitatively investigate the GCN robustness over omnipresent heterophilic graphs for node classification. We uncover that the predominant vulnerability is caused by the structural out-of-distribution (OOD) issue. This finding motivates u… ▽ More Graph convolution networks (GCNs) are extensively utilized in various graph tasks to mine knowledge from spatial data. Our study marks the pioneering attempt to quantitatively investigate the GCN robustness over omnipresent heterophilic graphs for node classification. We uncover that the predominant vulnerability is caused by the structural out-of-distribution (OOD) issue. This finding motivates us to present a novel method that aims to harden GCNs by automatically learning Latent Homophilic Structures over heterophilic graphs. We term such a methodology as LHS. To elaborate, our initial step involves learning a latent structure by employing a novel self-expressive technique based on multi-node interactions. Subsequently, the structure is refined using a pairwisely constrained dual-view contrastive learning approach. We iteratively perform the above procedure, enabling a GCN model to aggregate information in a homophilic way on heterophilic graphs. Armed with such an adaptable structure, we can properly mitigate the structural OOD threats over heterophilic graphs. Experiments on various benchmarks show the effectiveness of the proposed LHS approach for robust GCNs. △ Less

Submitted 27 December, 2023; originally announced December 2023.

Comments: To be appeared in the proceedings of AAAI-2024

arXiv:2312.16023 [pdf, other]

DocMSU: A Comprehensive Benchmark for Document-level Multimodal Sarcasm Understanding

Authors: Hang Du, Guoshun Nan, Sicheng Zhang, Binzhu Xie, Junrui Xu, Hehe Fan, Qimei Cui, Xiaofeng Tao, Xudong Jiang

Abstract: Multimodal Sarcasm Understanding (MSU) has a wide range of applications in the news field such as public opinion analysis and forgery detection. However, existing MSU benchmarks and approaches usually focus on sentence-level MSU. In document-level news, sarcasm clues are sparse or small and are often concealed in long text. Moreover, compared to sentence-level comments like tweets, which mainly fo… ▽ More Multimodal Sarcasm Understanding (MSU) has a wide range of applications in the news field such as public opinion analysis and forgery detection. However, existing MSU benchmarks and approaches usually focus on sentence-level MSU. In document-level news, sarcasm clues are sparse or small and are often concealed in long text. Moreover, compared to sentence-level comments like tweets, which mainly focus on only a few trends or hot topics (e.g., sports events), content in the news is considerably diverse. Models created for sentence-level MSU may fail to capture sarcasm clues in document-level news. To fill this gap, we present a comprehensive benchmark for Document-level Multimodal Sarcasm Understanding (DocMSU). Our dataset contains 102,588 pieces of news with text-image pairs, covering 9 diverse topics such as health, business, etc. The proposed large-scale and diverse DocMSU significantly facilitates the research of document-level MSU in real-world scenarios. To take on the new challenges posed by DocMSU, we introduce a fine-grained sarcasm comprehension method to properly align the pixel-level image features with word-level textual features in documents. Experiments demonstrate the effectiveness of our method, showing that it can serve as a baseline approach to the challenging DocMSU. Our code and dataset are available at https://github.com/Dulpy/DocMSU. △ Less

Submitted 26 December, 2023; originally announced December 2023.

arXiv:2312.13316 [pdf, other]

ECAMP: Entity-centered Context-aware Medical Vision Language Pre-training

Authors: Rongsheng Wang, Qingsong Yao, Haoran Lai, Zhiyang He, Xiaodong Tao, Zihang Jiang, S. Kevin Zhou

Abstract: Despite significant advancements in medical vision-language pre-training, existing methods have largely overlooked the inherent entity-specific context within radiology reports and the complex cross-modality contextual relationships between text and images. To close this gap, we propose a novel Entity-centered Context-aware Medical Vision-language Pre-training (ECAMP) framework, which is designed… ▽ More Despite significant advancements in medical vision-language pre-training, existing methods have largely overlooked the inherent entity-specific context within radiology reports and the complex cross-modality contextual relationships between text and images. To close this gap, we propose a novel Entity-centered Context-aware Medical Vision-language Pre-training (ECAMP) framework, which is designed to enable a more entity-centered and context-sensitive interpretation of medical data. Utilizing the recent powerful large language model, we distill entity-centered context from medical reports, which enables ECAMP to gain more effective supervision from the text modality. By further pre-training our model with carefully designed entity-aware, context-enhanced masked language modeling and context-guided super-resolution tasks, ECAMP significantly refines the interplay between text and image modalities, leading to an enhanced ability to extract entity-centered contextual features. Besides, our proposed multi-scale context fusion design also improves the semantic integration of both coarse and fine-level image representations, prompting better performance for multi-scale downstream applications. Combining these components leads to significant performance leaps over current state-of-the-art methods and establishes a new standard for cross-modality learning in medical imaging, whose effectiveness is demonstrated by our extensive experiments on various tasks including classification, segmentation, and detection across several public datasets. Code and models are available at https://github.com/ToniChopp/ECAMP. △ Less

Submitted 19 March, 2024; v1 submitted 20 December, 2023; originally announced December 2023.

arXiv:2312.13305 [pdf, other]

DVIS++: Improved Decoupled Framework for Universal Video Segmentation

Authors: Tao Zhang, Xingye Tian, Yikang Zhou, Shunping Ji, Xuebo Wang, Xin Tao, Yuan Zhang, Pengfei Wan, Zhongyuan Wang, Yu Wu

Abstract: We present the \textbf{D}ecoupled \textbf{VI}deo \textbf{S}egmentation (DVIS) framework, a novel approach for the challenging task of universal video segmentation, including video instance segmentation (VIS), video semantic segmentation (VSS), and video panoptic segmentation (VPS). Unlike previous methods that model video segmentation in an end-to-end manner, our approach decouples video segmentat… ▽ More We present the \textbf{D}ecoupled \textbf{VI}deo \textbf{S}egmentation (DVIS) framework, a novel approach for the challenging task of universal video segmentation, including video instance segmentation (VIS), video semantic segmentation (VSS), and video panoptic segmentation (VPS). Unlike previous methods that model video segmentation in an end-to-end manner, our approach decouples video segmentation into three cascaded sub-tasks: segmentation, tracking, and refinement. This decoupling design allows for simpler and more effective modeling of the spatio-temporal representations of objects, especially in complex scenes and long videos. Accordingly, we introduce two novel components: the referring tracker and the temporal refiner. These components track objects frame by frame and model spatio-temporal representations based on pre-aligned features. To improve the tracking capability of DVIS, we propose a denoising training strategy and introduce contrastive learning, resulting in a more robust framework named DVIS++. Furthermore, we evaluate DVIS++ in various settings, including open vocabulary and using a frozen pre-trained backbone. By integrating CLIP with DVIS++, we present OV-DVIS++, the first open-vocabulary universal video segmentation framework. We conduct extensive experiments on six mainstream benchmarks, including the VIS, VSS, and VPS datasets. Using a unified architecture, DVIS++ significantly outperforms state-of-the-art specialized methods on these benchmarks in both close- and open-vocabulary settings. Code:~\url{https://github.com/zhang-tao-whu/DVIS_Plus}. △ Less

Submitted 19 December, 2023; originally announced December 2023.

arXiv:2312.12338 [pdf, other]

Smart Connected Farms and Networked Farmers to Tackle Climate Challenges Impacting Agricultural Production

Authors: Behzad J. Balabaygloo, Barituka Bekee, Samuel W. Blair, Suzanne Fey, Fateme Fotouhi, Ashish Gupta, Kevin Menke, Anusha Vangala, Jorge C. M. Palomares, Aaron Prestholt, Vishesh K. Tanwar, Xu Tao, Matthew E. Carroll, Sajal Das, Gil Depaula, Peter Kyveryga, Soumik Sarkar, Michelle Segovia, Simone Sylvestri, Corinne Valdivia, Asheesh K. Singh

Abstract: To meet the grand challenges of agricultural production including climate change impacts on crop production, a tight integration of social science, technology and agriculture experts including farmers are needed. There are rapid advances in information and communication technology, precision agriculture and data analytics, which are creating a fertile field for the creation of smart connected farm… ▽ More To meet the grand challenges of agricultural production including climate change impacts on crop production, a tight integration of social science, technology and agriculture experts including farmers are needed. There are rapid advances in information and communication technology, precision agriculture and data analytics, which are creating a fertile field for the creation of smart connected farms (SCF) and networked farmers. A network and coordinated farmer network provides unique advantages to farmers to enhance farm production and profitability, while tackling adverse climate events. The aim of this article is to provide a comprehensive overview of the state of the art in SCF including the advances in engineering, computer sciences, data sciences, social sciences and economics including data privacy, sharing and technology adoption. △ Less

Submitted 19 December, 2023; originally announced December 2023.

arXiv:2312.12148 [pdf, other]

Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment

Authors: Lingling Xu, Haoran Xie, Si-Zhao Joe Qin, Xiaohui Tao, Fu Lee Wang

Abstract: With the continuous growth in the number of parameters of transformer-based pretrained language models (PLMs), particularly the emergence of large language models (LLMs) with billions of parameters, many natural language processing (NLP) tasks have demonstrated remarkable success. However, the enormous size and computational demands of these models pose significant challenges for adapting them to… ▽ More With the continuous growth in the number of parameters of transformer-based pretrained language models (PLMs), particularly the emergence of large language models (LLMs) with billions of parameters, many natural language processing (NLP) tasks have demonstrated remarkable success. However, the enormous size and computational demands of these models pose significant challenges for adapting them to specific downstream tasks, especially in environments with limited computational resources. Parameter Efficient Fine-Tuning (PEFT) offers an effective solution by reducing the number of fine-tuning parameters and memory usage while achieving comparable performance to full fine-tuning. The demands for fine-tuning PLMs, especially LLMs, have led to a surge in the development of PEFT methods, as depicted in Fig. 1. In this paper, we present a comprehensive and systematic review of PEFT methods for PLMs. We summarize these PEFT methods, discuss their applications, and outline future directions. Furthermore, we conduct experiments using several representative PEFT methods to better understand their effectiveness in parameter efficiency and memory efficiency. By offering insights into the latest advancements and practical applications, this survey serves as an invaluable resource for researchers and practitioners seeking to navigate the challenges and opportunities presented by PEFT in the context of PLMs. △ Less

Submitted 19 December, 2023; originally announced December 2023.

Comments: 20 pages, 4 figures

arXiv:2312.11391 [pdf, other]

FedCompetitors: Harmonious Collaboration in Federated Learning with Competing Participants

Authors: Shanli Tan, Hao Cheng, Xiaohu Wu, Han Yu, Tiantian He, Yew-Soon Ong, Chongjun Wang, Xiaofeng Tao

Abstract: Federated learning (FL) provides a privacy-preserving approach for collaborative training of machine learning models. Given the potential data heterogeneity, it is crucial to select appropriate collaborators for each FL participant (FL-PT) based on data complementarity. Recent studies have addressed this challenge. Similarly, it is imperative to consider the inter-individual relationships among FL… ▽ More Federated learning (FL) provides a privacy-preserving approach for collaborative training of machine learning models. Given the potential data heterogeneity, it is crucial to select appropriate collaborators for each FL participant (FL-PT) based on data complementarity. Recent studies have addressed this challenge. Similarly, it is imperative to consider the inter-individual relationships among FL-PTs where some FL-PTs engage in competition. Although FL literature has acknowledged the significance of this scenario, practical methods for establishing FL ecosystems remain largely unexplored. In this paper, we extend a principle from the balance theory, namely ``the friend of my enemy is my enemy'', to ensure the absence of conflicting interests within an FL ecosystem. The extended principle and the resulting problem are formulated via graph theory and integer linear programming. A polynomial-time algorithm is proposed to determine the collaborators of each FL-PT. The solution guarantees high scalability, allowing even competing FL-PTs to smoothly join the ecosystem without conflict of interest. The proposed framework jointly considers competition and data heterogeneity. Extensive experiments on real-world and synthetic data demonstrate its efficacy compared to five alternative approaches, and its ability to establish efficient collaboration networks among FL-PTs. △ Less

Submitted 18 December, 2023; originally announced December 2023.

Comments: Accepted to AAAI-2024

Showing 1–50 of 230 results for author: Tao, X