subscribe to arXiv mailings

Baryon Acoustic Oscillations analyses with Density-Split Statistics

Authors: Tengpeng Xu, Yan-Chuan Cai, Yun Chen, Mark Neyrinck, Liang Gao, Qiao Wang

Abstract: Accurate modeling for the evolution of the Baryon Acoustic Oscillations (BAO) is essential for using it as a standard ruler to probe cosmology. We explore the non-linearity of the BAO in different environments using the density-split statistics and compare them to the case of the conventional two-point correlation function (2PCF). We detect density-dependent shifts for the position of the BAO with… ▽ More Accurate modeling for the evolution of the Baryon Acoustic Oscillations (BAO) is essential for using it as a standard ruler to probe cosmology. We explore the non-linearity of the BAO in different environments using the density-split statistics and compare them to the case of the conventional two-point correlation function (2PCF). We detect density-dependent shifts for the position of the BAO with respect to its linear version using halos from N-body simulations. Around low/high-densities, the scale of the BAO expands/contracts due to non-linear peculiar velocities. As the simulation evolves from redshift 1 to 0, the difference in the magnitude of the shifts between high- and low-density regions increases from the sub-percent to the percent level. In contrast, the scale of the BAO does not evolve in the total 2PCF in the same redshift range. The width of the BAO around high density regions increases as the universe evolves, similar to the known broadening of the BAO in the 2PCF due to non-linear evolution. In contrast, the width is smaller and stable for low density regions. We discuss possible implications for the reconstructions of the BAO in light of our results. △ Less

Submitted 2 July, 2024; originally announced July 2024.

Comments: 16 pages, 10 figures

arXiv:2407.01675 [pdf, other]

Hawking Radiation of Nonrelativistic Scalars: Applications to Pion and Axion Production

Authors: Hao-Ran Cui, Yuhsin Tsai, Tao Xu

Abstract: In studying secondary gamma-ray emissions from Primordial Black Holes (PBHs), the production of scalar particles like pions and axion-like particles (ALPs) via Hawking radiation is crucial. While previous analyses assumed relativistic production, asteroid-mass PBHs, relevant to upcoming experiments like AMEGO-X, likely produce pions and ALPs non-relativistically when their masses exceed 10 MeV. To… ▽ More In studying secondary gamma-ray emissions from Primordial Black Holes (PBHs), the production of scalar particles like pions and axion-like particles (ALPs) via Hawking radiation is crucial. While previous analyses assumed relativistic production, asteroid-mass PBHs, relevant to upcoming experiments like AMEGO-X, likely produce pions and ALPs non-relativistically when their masses exceed 10 MeV. To account for mass dependence in Hawking radiation, we revisit the greybody factors for massive scalars from Schwarzschild black holes, revealing significant mass corrections to particle production rates compared to the projected AMEGO-X sensitivity. We highlight the importance of considering non-relativistic $π^0$ production in interpreting PBH gamma-ray signals, essential for determining PBH properties. Additionally, we comment on the potential suppression of pion production due to form factor effects when producing extended objects via Hawking radiation. We also provide an example code for calculating the Hawking radiation spectrum of massive scalar particles. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: 16+2 pages, 8 figures. The numerical code is available at https://github.com/Haoran-Brook/HoRNS

arXiv:2407.01530 [pdf, other]

xLSTM-UNet can be an Effective 2D & 3D Medical Image Segmentation Backbone with Vision-LSTM (ViL) better than its Mamba Counterpart

Authors: Tianrun Chen, Chaotao Ding, Lanyun Zhu, Tao Xu, Deyi Ji, Yan Wang, Ying Zang, Zejian Li

Abstract: Convolutional Neural Networks (CNNs) and Vision Transformers (ViT) have been pivotal in biomedical image segmentation, yet their ability to manage long-range dependencies remains constrained by inherent locality and computational overhead. To overcome these challenges, in this technical report, we first propose xLSTM-UNet, a UNet structured deep learning neural network that leverages Vision-LSTM (… ▽ More Convolutional Neural Networks (CNNs) and Vision Transformers (ViT) have been pivotal in biomedical image segmentation, yet their ability to manage long-range dependencies remains constrained by inherent locality and computational overhead. To overcome these challenges, in this technical report, we first propose xLSTM-UNet, a UNet structured deep learning neural network that leverages Vision-LSTM (xLSTM) as its backbone for medical image segmentation. xLSTM is a recently proposed as the successor of Long Short-Term Memory (LSTM) networks and have demonstrated superior performance compared to Transformers and State Space Models (SSMs) like Mamba in Neural Language Processing (NLP) and image classification (as demonstrated in Vision-LSTM, or ViL implementation). Here, xLSTM-UNet we designed extend the success in biomedical image segmentation domain. By integrating the local feature extraction strengths of convolutional layers with the long-range dependency capturing abilities of xLSTM, xLSTM-UNet offers a robust solution for comprehensive image analysis. We validate the efficacy of xLSTM-UNet through experiments. Our findings demonstrate that xLSTM-UNet consistently surpasses the performance of leading CNN-based, Transformer-based, and Mamba-based segmentation networks in multiple datasets in biomedical segmentation including organs in abdomen MRI, instruments in endoscopic images, and cells in microscopic images. With comprehensive experiments performed, this technical report highlights the potential of xLSTM-based architectures in advancing biomedical image analysis in both 2D and 3D. The code, models, and datasets are publicly available at http://tianrun-chen.github.io/xLSTM-UNet/ △ Less

Submitted 2 July, 2024; v1 submitted 1 July, 2024; originally announced July 2024.

arXiv:2407.01511 [pdf, other]

CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents

Authors: Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, Philip Torr, Bernard Ghanem, Guohao Li

Abstract: The development of autonomous agents increasingly relies on Multimodal Language Models (MLMs) to perform tasks described in natural language with GUI environments, such as websites, desktop computers, or mobile phones. Existing benchmarks for MLM agents in interactive environments are limited by their focus on a single environment, lack of detailed and generalized evaluation methods, and the compl… ▽ More The development of autonomous agents increasingly relies on Multimodal Language Models (MLMs) to perform tasks described in natural language with GUI environments, such as websites, desktop computers, or mobile phones. Existing benchmarks for MLM agents in interactive environments are limited by their focus on a single environment, lack of detailed and generalized evaluation methods, and the complexities of constructing tasks and evaluators. To overcome these limitations, we introduce Crab, the first agent benchmark framework designed to support cross-environment tasks, incorporating a graph-based fine-grained evaluation method and an efficient mechanism for task and evaluator construction. Our framework supports multiple devices and can be easily extended to any environment with a Python interface. Leveraging Crab, we developed a cross-platform Crab Benchmark-v0 comprising 100 tasks in computer desktop and mobile phone environments. We evaluated four advanced MLMs using different single and multi-agent system configurations on this benchmark. The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 35.26%. All framework code, agent code, and task datasets are publicly available at https://github.com/camel-ai/crab. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2406.18838 [pdf]

Electric-field control of the perpendicular magnetization switching in ferroelectric/ferrimagnet heterostructures

Authors: Pengfei Liu, Tao Xu, Qi Liu, Juncai Dong, Ting Lin, Qinhua Zhang, Xiukai Lan, Yu Sheng, Chunyu Wang, Jiajing Pei, Hongxin Yang, Lin Gu, Kaiyou Wang

Abstract: Electric field control of the magnetic state in ferrimagnets holds great promise for developing spintronic devices due to low power consumption. Here, we demonstrate a non-volatile reversal of perpendicular net magnetization in a ferrimagnet by manipulating the electric-field driven polarization within the Pb (Zr0.2Ti0.8) O3 (PZT)/CoGd heterostructure. Electron energy loss spectra and X-ray absorp… ▽ More Electric field control of the magnetic state in ferrimagnets holds great promise for developing spintronic devices due to low power consumption. Here, we demonstrate a non-volatile reversal of perpendicular net magnetization in a ferrimagnet by manipulating the electric-field driven polarization within the Pb (Zr0.2Ti0.8) O3 (PZT)/CoGd heterostructure. Electron energy loss spectra and X-ray absorption spectrum directly verify that the oxygen ion migration at the PZT/CoGd interface associated with reversing the polarization causes the enhanced/reduced oxidation in CoGd. Ab initio calculations further substantiate that the migrated oxygen ions can modulate the relative magnetization of Co/Gd sublattices, facilitating perpendicular net magnetization switching. Our findings offer an approach to effectively control ferrimagnetic net magnetization, holding significant implications for ferrimagnetic spintronic applications. △ Less

Submitted 26 June, 2024; originally announced June 2024.

Comments: 21 pages,4 figures

arXiv:2406.18548 [pdf]

Exploration of Multi-Scale Image Fusion Systems in Intelligent Medical Image Analysis

Authors: Yuxiang Hu, Haowei Yang, Ting Xu, Shuyao He, Jiajie Yuan, Haozhang Deng

Abstract: The diagnosis of brain cancer relies heavily on medical imaging techniques, with MRI being the most commonly used. It is necessary to perform automatic segmentation of brain tumors on MRI images. This project intends to build an MRI algorithm based on U-Net. The residual network and the module used to enhance the context information are combined, and the void space convolution pooling pyramid is a… ▽ More The diagnosis of brain cancer relies heavily on medical imaging techniques, with MRI being the most commonly used. It is necessary to perform automatic segmentation of brain tumors on MRI images. This project intends to build an MRI algorithm based on U-Net. The residual network and the module used to enhance the context information are combined, and the void space convolution pooling pyramid is added to the network for processing. The brain glioma MRI image dataset provided by cancer imaging archives was experimentally verified. A multi-scale segmentation method based on a weighted least squares filter was used to complete the 3D reconstruction of brain tumors. Thus, the accuracy of three-dimensional reconstruction is further improved. Experiments show that the local texture features obtained by the proposed algorithm are similar to those obtained by laser scanning. The algorithm is improved by using the U-Net method and an accuracy of 0.9851 is obtained. This approach significantly enhances the precision of image segmentation and boosts the efficiency of image classification. △ Less

Submitted 23 May, 2024; originally announced June 2024.

arXiv:2406.17334 [pdf, other]

doi 10.1109/TSC.2023.3326539

Joint Admission Control and Resource Allocation of Virtual Network Embedding via Hierarchical Deep Reinforcement Learning

Authors: Tianfu Wang, Li Shen, Qilin Fan, Tong Xu, Tongliang Liu, Hui Xiong

Abstract: As an essential resource management problem in network virtualization, virtual network embedding (VNE) aims to allocate the finite resources of physical network to sequentially arriving virtual network requests (VNRs) with different resource demands. Since this is an NP-hard combinatorial optimization problem, many efforts have been made to provide viable solutions. However, most existing approach… ▽ More As an essential resource management problem in network virtualization, virtual network embedding (VNE) aims to allocate the finite resources of physical network to sequentially arriving virtual network requests (VNRs) with different resource demands. Since this is an NP-hard combinatorial optimization problem, many efforts have been made to provide viable solutions. However, most existing approaches have either ignored the admission control of VNRs, which has a potential impact on long-term performances, or not fully exploited the temporal and topological features of the physical network and VNRs. In this paper, we propose a deep Hierarchical Reinforcement Learning approach to learn a joint Admission Control and Resource Allocation policy for VNE, named HRL-ACRA. Specifically, the whole VNE process is decomposed into an upper-level policy for deciding whether to admit the arriving VNR or not and a lower-level policy for allocating resources of the physical network to meet the requirement of VNR through the HRL approach. Considering the proximal policy optimization as the basic training algorithm, we also adopt the average reward method to address the infinite horizon problem of the upper-level agent and design a customized multi-objective intrinsic reward to alleviate the sparse reward issue of the lower-level agent. Moreover, we develop a deep feature-aware graph neural network to capture the features of VNR and physical network and exploit a sequence-to-sequence model to generate embedding actions iteratively. Finally, extensive experiments are conducted in various settings, and show that HRL-ACRA outperforms state-of-the-art baselines in terms of both the acceptance ratio and long-term average revenue. Our code is available at \url{https://github.com/GeminiLight/hrl-acra}. △ Less

Submitted 25 June, 2024; originally announced June 2024.

Comments: Accepted by IEEE Transactions on Services Computing (TSC)

Journal ref: IEEE Transactions on Services Computing ( Volume: 17, Issue: 3, May-June 2024)

arXiv:2406.16069 [pdf, other]

FastMem: Fast Memorization of Prompt Improves Context Awareness of Large Language Models

Authors: Junyi Zhu, Shuochen Liu, Yu Yu, Bo Tang, Yibo Yan, Zhiyu Li, Feiyu Xiong, Tong Xu, Matthew B. Blaschko

Abstract: Large language models (LLMs) excel in generating coherent text, but they often struggle with context awareness, leading to inaccuracies in tasks requiring faithful adherence to provided information. We introduce FastMem, a novel method designed to enhance instruction fine-tuned LLMs' context awareness through fast memorization of the prompt. FastMem maximizes the likelihood of the prompt before in… ▽ More Large language models (LLMs) excel in generating coherent text, but they often struggle with context awareness, leading to inaccuracies in tasks requiring faithful adherence to provided information. We introduce FastMem, a novel method designed to enhance instruction fine-tuned LLMs' context awareness through fast memorization of the prompt. FastMem maximizes the likelihood of the prompt before inference by fine-tuning only the last Feed-Forward Network (FFN) module. This targeted approach ensures efficient optimization without overfitting, significantly improving the model's ability to comprehend and accurately follow the context. Our experiments demonstrate substantial gains in reading comprehension, text summarization and adherence to output structures. For instance, FastMem improves the accuracy of Llama 3-8B-Inst on the NQ-SWAP dataset from 59.1% to 71.6%, and reduces the output structure failure rate of Qwen 1.5-4B-Chat from 34.9% to 25.5%. Extensive experimental results highlight FastMem's potential to offer a robust solution to enhance the reliability and accuracy of LLMs in various applications. Our code is available at: https://github.com/IAAR-Shanghai/FastMem △ Less

Submitted 23 June, 2024; originally announced June 2024.

arXiv:2406.15969 [pdf, other]

Single Element Error Correction/ in a Euclidean Distance Matrix

Authors: Abdo Alfakih, Woosuk L. Jung, Henry Wolkowicz, Tina Xu

Abstract: We consider the \emph{exact} error correction of a noisy Euclidean distance matrix, EDM, where the elements are the squared distances between $n$ points in $R^d$. For our problem we are given two facts: (i) the embedding dimension, $d$, (ii) \emph{exactly one} distance in the data is corrupted by \emph{nonzero noise}. But we do \underline{not} know the magnitude nor position of the noise. Thus the… ▽ More We consider the \emph{exact} error correction of a noisy Euclidean distance matrix, EDM, where the elements are the squared distances between $n$ points in $R^d$. For our problem we are given two facts: (i) the embedding dimension, $d$, (ii) \emph{exactly one} distance in the data is corrupted by \emph{nonzero noise}. But we do \underline{not} know the magnitude nor position of the noise. Thus there is a combinatorial element to the problem. We present three solution techniques. These use three divide and conquer strategies in combination with three versions of facial reduction that use: exposing vectors, facial vectors, and Gale transforms. This sheds light on the connections between the various forms of facial reduction related to Gale transforms. Our highly successful empirics confirm the success of these approaches as we can solve huge problems of the order of $100,000$ nodes in approximately one minute to machine precision. \\Our algorithm depends on identifying whether a principal submatrix of the \EDM contains the corrupted element. We provide a theorem for doing this that is related to the existing results for identifying \emph{yielding} elements, i.e.,~we provide a characterization for guaranteeing the perturbed EDM remains an EDM with embedding dimension $d$. The characterization is particularly simple in the $d=2$ case. \\In addition, we characterize when the intuitive approach of the nearest EDM problem, solves our problem. In fact, we show that this happens if, and only if, the original distance element is $0$, degenerate, and the perturbation is negative. △ Less

Submitted 22 June, 2024; originally announced June 2024.

MSC Class: 51K05; 90C26; 90C46; 65K10; 15A48; 90C22

arXiv:2406.14979 [pdf, other]

Retrieve-Plan-Generation: An Iterative Planning and Answering Framework for Knowledge-Intensive LLM Generation

Authors: Yuanjie Lyu, Zihan Niu, Zheyong Xie, Chao Zhang, Tong Xu, Yang Wang, Enhong Chen

Abstract: Despite the significant progress of large language models (LLMs) in various tasks, they often produce factual errors due to their limited internal knowledge. Retrieval-Augmented Generation (RAG), which enhances LLMs with external knowledge sources, offers a promising solution. However, these methods can be misled by irrelevant paragraphs in retrieved documents. Due to the inherent uncertainty in L… ▽ More Despite the significant progress of large language models (LLMs) in various tasks, they often produce factual errors due to their limited internal knowledge. Retrieval-Augmented Generation (RAG), which enhances LLMs with external knowledge sources, offers a promising solution. However, these methods can be misled by irrelevant paragraphs in retrieved documents. Due to the inherent uncertainty in LLM generation, inputting the entire document may introduce off-topic information, causing the model to deviate from the central topic and affecting the relevance of the generated content. To address these issues, we propose the Retrieve-Plan-Generation (RPG) framework. RPG generates plan tokens to guide subsequent generation in the plan stage. In the answer stage, the model selects relevant fine-grained paragraphs based on the plan and uses them for further answer generation. This plan-answer process is repeated iteratively until completion, enhancing generation relevance by focusing on specific topics. To implement this framework efficiently, we utilize a simple but effective multi-task prompt-tuning method, enabling the existing LLMs to handle both planning and answering. We comprehensively compare RPG with baselines across 5 knowledge-intensive generation tasks, demonstrating the effectiveness of our approach. △ Less

Submitted 21 June, 2024; originally announced June 2024.

arXiv:2406.14106 [pdf, other]

EasyECR: A Library for Easy Implementation and Evaluation of Event Coreference Resolution Models

Authors: Yuncong Li, Tianhua Xu, Sheng-hua Zhong, Haiqin Yang

Abstract: Event Coreference Resolution (ECR) is the task of clustering event mentions that refer to the same real-world event. Despite significant advancements, ECR research faces two main challenges: limited generalizability across domains due to narrow dataset evaluations, and difficulties in comparing models within diverse ECR pipelines. To address these issues, we develop EasyECR, the first open-source… ▽ More Event Coreference Resolution (ECR) is the task of clustering event mentions that refer to the same real-world event. Despite significant advancements, ECR research faces two main challenges: limited generalizability across domains due to narrow dataset evaluations, and difficulties in comparing models within diverse ECR pipelines. To address these issues, we develop EasyECR, the first open-source library designed to standardize data structures and abstract ECR pipelines for easy implementation and fair evaluation. More specifically, EasyECR integrates seven representative pipelines and ten popular benchmark datasets, enabling model evaluations on various datasets and promoting the development of robust ECR pipelines. By conducting extensive evaluation via our EasyECR, we find that, \lowercase\expandafter{\romannumeral1}) the representative ECR pipelines cannot generalize across multiple datasets, hence evaluating ECR pipelines on multiple datasets is necessary, \lowercase\expandafter{\romannumeral2}) all models in ECR pipelines have a great effect on pipeline performance, therefore, when one model in ECR pipelines are compared, it is essential to ensure that the other models remain consistent. Additionally, reproducing ECR results is not trivial, and the developed library can help reduce this discrepancy. The experimental results provide valuable baselines for future research. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: 14 pages, 4 figures, 12 tables

arXiv:2406.13885 [pdf, other]

Knowledge Tagging System on Math Questions via LLMs with Flexible Demonstration Retriever

Authors: Hang Li, Tianlong Xu, Jiliang Tang, Qingsong Wen

Abstract: Knowledge tagging for questions plays a crucial role in contemporary intelligent educational applications, including learning progress diagnosis, practice question recommendations, and course content organization. Traditionally, these annotations are always conducted by pedagogical experts, as the task requires not only a strong semantic understanding of both question stems and knowledge definitio… ▽ More Knowledge tagging for questions plays a crucial role in contemporary intelligent educational applications, including learning progress diagnosis, practice question recommendations, and course content organization. Traditionally, these annotations are always conducted by pedagogical experts, as the task requires not only a strong semantic understanding of both question stems and knowledge definitions but also deep insights into connecting question-solving logic with corresponding knowledge concepts. With the recent emergence of advanced text encoding algorithms, such as pre-trained language models, many researchers have developed automatic knowledge tagging systems based on calculating the semantic similarity between the knowledge and question embeddings. In this paper, we explore automating the task using Large Language Models (LLMs), in response to the inability of prior encoding-based methods to deal with the hard cases which involve strong domain knowledge and complicated concept definitions. By showing the strong performance of zero- and few-shot results over math questions knowledge tagging tasks, we demonstrate LLMs' great potential in conquering the challenges faced by prior methods. Furthermore, by proposing a reinforcement learning-based demonstration retriever, we successfully exploit the great potential of different-sized LLMs in achieving better performance results while keeping the in-context demonstration usage efficiency high. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: 13 pages, 6 figures

arXiv:2406.13618 [pdf, other]

In-Context Former: Lightning-fast Compressing Context for Large Language Model

Authors: Xiangfeng Wang, Zaiyi Chen, Zheyong Xie, Tong Xu, Yongyi He, Enhong Chen

Abstract: With the rising popularity of Transformer-based large language models (LLMs), reducing their high inference costs has become a significant research focus. One effective approach is to compress the long input contexts. Existing methods typically leverage the self-attention mechanism of the LLM itself for context compression. While these methods have achieved notable results, the compression process… ▽ More With the rising popularity of Transformer-based large language models (LLMs), reducing their high inference costs has become a significant research focus. One effective approach is to compress the long input contexts. Existing methods typically leverage the self-attention mechanism of the LLM itself for context compression. While these methods have achieved notable results, the compression process still involves quadratic time complexity, which limits their applicability. To mitigate this limitation, we propose the In-Context Former (IC-Former). Unlike previous methods, IC-Former does not depend on the target LLMs. Instead, it leverages the cross-attention mechanism and a small number of learnable digest tokens to directly condense information from the contextual word embeddings. This approach significantly reduces inference time, which achieves linear growth in time complexity within the compression range. Experimental results indicate that our method requires only 1/32 of the floating-point operations of the baseline during compression and improves processing speed by 68 to 112 times while achieving over 90% of the baseline performance on evaluation metrics. Overall, our model effectively reduces compression costs and makes real-time compression scenarios feasible. △ Less

Submitted 19 June, 2024; originally announced June 2024.

arXiv:2406.12975 [pdf, other]

SHIELD: Evaluation and Defense Strategies for Copyright Compliance in LLM Text Generation

Authors: Xiaoze Liu, Ting Sun, Tianyang Xu, Feijie Wu, Cunxiang Wang, Xiaoqian Wang, Jing Gao

Abstract: Large Language Models (LLMs) have transformed machine learning but raised significant legal concerns due to their potential to produce text that infringes on copyrights, resulting in several high-profile lawsuits. The legal landscape is struggling to keep pace with these rapid advancements, with ongoing debates about whether generated text might plagiarize copyrighted materials. Current LLMs may i… ▽ More Large Language Models (LLMs) have transformed machine learning but raised significant legal concerns due to their potential to produce text that infringes on copyrights, resulting in several high-profile lawsuits. The legal landscape is struggling to keep pace with these rapid advancements, with ongoing debates about whether generated text might plagiarize copyrighted materials. Current LLMs may infringe on copyrights or overly restrict non-copyrighted texts, leading to these challenges: (i) the need for a comprehensive evaluation benchmark to assess copyright compliance from multiple aspects; (ii) evaluating robustness against safeguard bypassing attacks; and (iii) developing effective defenses targeted against the generation of copyrighted text. To tackle these challenges, we introduce a curated dataset to evaluate methods, test attack strategies, and propose lightweight, real-time defenses to prevent the generation of copyrighted text, ensuring the safe and lawful use of LLMs. Our experiments demonstrate that current LLMs frequently output copyrighted text, and that jailbreaking attacks can significantly increase the volume of copyrighted output. Our proposed defense mechanisms significantly reduce the volume of copyrighted text generated by LLMs by effectively refusing malicious requests. Code is publicly available at https://github.com/xz-liu/SHIELD △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.12020 [pdf, other]

When Box Meets Graph Neural Network in Tag-aware Recommendation

Authors: Fake Lin, Ziwei Zhao, Xi Zhu, Da Zhang, Shitian Shen, Xueying Li, Tong Xu, Suojuan Zhang, Enhong Chen

Abstract: Last year has witnessed the re-flourishment of tag-aware recommender systems supported by the LLM-enriched tags. Unfortunately, though large efforts have been made, current solutions may fail to describe the diversity and uncertainty inherent in user preferences with only tag-driven profiles. Recently, with the development of geometry-based techniques, e.g., box embedding, diversity of user prefer… ▽ More Last year has witnessed the re-flourishment of tag-aware recommender systems supported by the LLM-enriched tags. Unfortunately, though large efforts have been made, current solutions may fail to describe the diversity and uncertainty inherent in user preferences with only tag-driven profiles. Recently, with the development of geometry-based techniques, e.g., box embedding, diversity of user preferences now could be fully modeled as the range within a box in high dimension space. However, defect still exists as these approaches are incapable of capturing high-order neighbor signals, i.e., semantic-rich multi-hop relations within the user-tag-item tripartite graph, which severely limits the effectiveness of user modeling. To deal with this challenge, in this paper, we propose a novel algorithm, called BoxGNN, to perform the message aggregation via combination of logical operations, thereby incorporating high-order signals. Specifically, we first embed users, items, and tags as hyper-boxes rather than simple points in the representation space, and define two logical operations to facilitate the subsequent process. Next, we perform the message aggregation mechanism via the combination of logical operations, to obtain the corresponding high-order box representations. Finally, we adopt a volume-based learning objective with Gumbel smoothing techniques to refine the representation of boxes. Extensive experiments on two publicly available datasets and one LLM-enhanced e-commerce dataset have validated the superiority of BoxGNN compared with various state-of-the-art baselines. The code is released online △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.08864 [pdf]

Research on Early Warning Model of Cardiovascular Disease Based on Computer Deep Learning

Authors: Yuxiang Hu, Jinxin Hu, Ting Xu, Bo Zhang, Jiajie Yuan, Haozhang Deng

Abstract: This project intends to study a cardiovascular disease risk early warning model based on one-dimensional convolutional neural networks. First, the missing values of 13 physiological and symptom indicators such as patient age, blood glucose, cholesterol, and chest pain were filled and Z-score was standardized. The convolutional neural network is converted into a 2D matrix, the convolution function… ▽ More This project intends to study a cardiovascular disease risk early warning model based on one-dimensional convolutional neural networks. First, the missing values of 13 physiological and symptom indicators such as patient age, blood glucose, cholesterol, and chest pain were filled and Z-score was standardized. The convolutional neural network is converted into a 2D matrix, the convolution function of 1,3, and 5 is used for the first-order convolution operation, and the Max Pooling algorithm is adopted for dimension reduction. Set the learning rate and output rate. It is optimized by the Adam algorithm. The result of classification is output by a soft classifier. This study was conducted based on Statlog in the UCI database and heart disease database respectively. The empirical data indicate that the forecasting precision of this technique has been enhanced by 11.2%, relative to conventional approaches, while there is a significant improvement in the logarithmic curve fitting. The efficacy and applicability of the novel approach are corroborated through the examination employing a one-dimensional convolutional neural network. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: 6 pages

arXiv:2406.08358 [pdf, other]

From a Social Cognitive Perspective: Context-aware Visual Social Relationship Recognition

Authors: Shiwei Wu, Chao Zhang, Joya Chen, Tong Xu, Likang Wu, Yao Hu, Enhong Chen

Abstract: People's social relationships are often manifested through their surroundings, with certain objects or interactions acting as symbols for specific relationships, e.g., wedding rings, roses, hugs, or holding hands. This brings unique challenges to recognizing social relationships, requiring understanding and capturing the essence of these contexts from visual appearances. However, current methods o… ▽ More People's social relationships are often manifested through their surroundings, with certain objects or interactions acting as symbols for specific relationships, e.g., wedding rings, roses, hugs, or holding hands. This brings unique challenges to recognizing social relationships, requiring understanding and capturing the essence of these contexts from visual appearances. However, current methods of social relationship understanding rely on the basic classification paradigm of detected persons and objects, which fails to understand the comprehensive context and often overlooks decisive social factors, especially subtle visual cues. To highlight the social-aware context and intricate details, we propose a novel approach that recognizes \textbf{Con}textual \textbf{So}cial \textbf{R}elationships (\textbf{ConSoR}) from a social cognitive perspective. Specifically, to incorporate social-aware semantics, we build a lightweight adapter upon the frozen CLIP to learn social concepts via our novel multi-modal side adapter tuning mechanism. Further, we construct social-aware descriptive language prompts (e.g., scene, activity, objects, emotions) with social relationships for each image, and then compel ConSoR to concentrate more intensively on the decisive visual social factors via visual-linguistic contrasting. Impressively, ConSoR outperforms previous methods with a 12.2\% gain on the People-in-Social-Context (PISC) dataset and a 9.8\% increase on the People-in-Photo-Album (PIPA) benchmark. Furthermore, we observe that ConSoR excels at finding critical visual evidence to reveal social relationships. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.08090 [pdf, other]

From Sim-to-Real: Toward General Event-based Low-light Frame Interpolation with Per-scene Optimization

Authors: Ziran Zhang, Yongrui Ma, Yueting Chen, Feng Zhang, Jinwei Gu, Tianfan Xue, Shi Guo

Abstract: Video Frame Interpolation (VFI) is important for video enhancement, frame rate up-conversion, and slow-motion generation. The introduction of event cameras, which capture per-pixel brightness changes asynchronously, has significantly enhanced VFI capabilities, particularly for high-speed, nonlinear motions. However, these event-based methods encounter challenges in low-light conditions, notably tr… ▽ More Video Frame Interpolation (VFI) is important for video enhancement, frame rate up-conversion, and slow-motion generation. The introduction of event cameras, which capture per-pixel brightness changes asynchronously, has significantly enhanced VFI capabilities, particularly for high-speed, nonlinear motions. However, these event-based methods encounter challenges in low-light conditions, notably trailing artifacts and signal latency, which hinder their direct applicability and generalization. Addressing these issues, we propose a novel per-scene optimization strategy tailored for low-light conditions. This approach utilizes the internal statistics of a sequence to handle degraded event data under low-light conditions, improving the generalizability to different lighting and camera settings. To evaluate its robustness in low-light condition, we further introduce EVFI-LL, a unique RGB+Event dataset captured under low-light conditions. Our results demonstrate state-of-the-art performance in low-light environments. Both the dataset and the source code will be made publicly available upon publication. Project page: https://naturezhanghn.github.io/sim2real. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.07579 [pdf, other]

GFPack++: Improving 2D Irregular Packing by Learning Gradient Field with Attention

Authors: Tianyang Xue, Lin Lu, Yang Liu, Mingdong Wu, Hao Dong, Yanbin Zhang, Renmin Han, Baoquan Chen

Abstract: 2D irregular packing is a classic combinatorial optimization problem with various applications, such as material utilization and texture atlas generation. This NP-hard problem requires efficient algorithms to optimize space utilization. Conventional numerical methods suffer from slow convergence and high computational cost. Existing learning-based methods, such as the score-based diffusion model,… ▽ More 2D irregular packing is a classic combinatorial optimization problem with various applications, such as material utilization and texture atlas generation. This NP-hard problem requires efficient algorithms to optimize space utilization. Conventional numerical methods suffer from slow convergence and high computational cost. Existing learning-based methods, such as the score-based diffusion model, also have limitations, such as no rotation support, frequent collisions, and poor adaptability to arbitrary boundaries, and slow inferring. The difficulty of learning from teacher packing is to capture the complex geometric relationships among packing examples, which include the spatial (position, orientation) relationships of objects, their geometric features, and container boundary conditions. Representing these relationships in latent space is challenging. We propose GFPack++, an attention-based gradient field learning approach that addresses this challenge. It consists of two pivotal strategies: \emph{attention-based geometry encoding} for effective feature encoding and \emph{attention-based relation encoding} for learning complex relationships. We investigate the utilization distribution between the teacher and inference data and design a weighting function to prioritize tighter teacher data during training, enhancing learning effectiveness. Our diffusion model supports continuous rotation and outperforms existing methods on various datasets. We achieve higher space utilization over several widely used baselines, one-order faster than the previous diffusion-based method, and promising generalization for arbitrary boundaries. We plan to release our source code and datasets to support further research in this direction. △ Less

Submitted 9 June, 2024; originally announced June 2024.

arXiv:2406.04776 [pdf, ps, other]

OFDM-Standard Compatible SC-NOFS Waveforms for Low-Latency and Jitter-Tolerance Industrial IoT Communications

Authors: Tongyang Xu, Shuangyang Li, Jinhong Yuan

Abstract: Traditional communications focus on regular and orthogonal signal waveforms for simplified signal processing and improved spectral efficiency. In contrast, the next-generation communications would aim for irregular and non-orthogonal signal waveforms to introduce new capabilities. This work proposes a spectrally efficient irregular Sinc (irSinc) shaping technique, revisiting the traditional Sinc b… ▽ More Traditional communications focus on regular and orthogonal signal waveforms for simplified signal processing and improved spectral efficiency. In contrast, the next-generation communications would aim for irregular and non-orthogonal signal waveforms to introduce new capabilities. This work proposes a spectrally efficient irregular Sinc (irSinc) shaping technique, revisiting the traditional Sinc back to 1924, with the aim of enhancing performance in industrial Internet of things (IIoT). In time-critical IIoT applications, low-latency and time-jitter tolerance are two critical factors that significantly impact the performance and reliability. Recognizing the inevitability of latency and jitter in practice, this work aims to propose a waveform technique to mitigate these effects via reducing latency and enhancing the system robustness under time jitter effects. The utilization of irSinc yields a signal with increased spectral efficiency without sacrificing error performance. Integrating the irSinc in a two-stage framework, a single-carrier non-orthogonal frequency shaping (SC-NOFS) waveform is developed, showcasing perfect compatibility with 5G standards, enabling the direct integration of irSinc in existing industrial IoT setups. Through 5G standard signal configuration, our signal achieves faster data transmission within the same spectral bandwidth. Hardware experiments validate an 18% saving in timing resources, leading to either reduced latency or enhanced jitter tolerance. △ Less

Submitted 7 June, 2024; originally announced June 2024.

arXiv:2406.04129 [pdf, other]

LenslessFace: An End-to-End Optimized Lensless System for Privacy-Preserving Face Verification

Authors: Xin Cai, Hailong Zhang, Chenchen Wang, Wentao Liu, Jinwei Gu, Tianfan Xue

Abstract: Lensless cameras, innovatively replacing traditional lenses for ultra-thin, flat optics, encode light directly onto sensors, producing images that are not immediately recognizable. This compact, lightweight, and cost-effective imaging solution offers inherent privacy advantages, making it attractive for privacy-sensitive applications like face verification. Typical lensless face verification adopt… ▽ More Lensless cameras, innovatively replacing traditional lenses for ultra-thin, flat optics, encode light directly onto sensors, producing images that are not immediately recognizable. This compact, lightweight, and cost-effective imaging solution offers inherent privacy advantages, making it attractive for privacy-sensitive applications like face verification. Typical lensless face verification adopts a two-stage process of reconstruction followed by verification, incurring privacy risks from reconstructed faces and high computational costs. This paper presents an end-to-end optimization approach for privacy-preserving face verification directly on encoded lensless captures, ensuring that the entire software pipeline remains encoded with no visible faces as intermediate results. To achieve this, we propose several techniques to address unique challenges from the lensless setup which precludes traditional face detection and alignment. Specifically, we propose a face center alignment scheme, an augmentation curriculum to build robustness against variations, and a knowledge distillation method to smooth optimization and enhance performance. Evaluations under both simulation and real environment demonstrate our method outperforms two-stage lensless verification while enhancing privacy and efficiency. Project website: \url{lenslessface.github.io}. △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: under review

arXiv:2406.02728 [pdf]

Impacts of Illuminance and Correlated Color Temperature on Cognitive Performance: A VR-Lighting Study

Authors: Armin Mostafavi, Milica Vujovic, Tong Bill Xu, Michael Hensel

Abstract: This study contributes to the ongoing exploration of methods to enhance the environmental design, cognitive function, and overall wellbeing, primarily focusing on understanding the modulation of human cognitive performance by artificial lighting conditions. In this investigation, participants (N=35) engaged with two distinct architectural contexts, each featuring five different lighting conditions… ▽ More This study contributes to the ongoing exploration of methods to enhance the environmental design, cognitive function, and overall wellbeing, primarily focusing on understanding the modulation of human cognitive performance by artificial lighting conditions. In this investigation, participants (N=35) engaged with two distinct architectural contexts, each featuring five different lighting conditions within a virtual environment during specific daytime scenarios. Responding to a series of cognitive memory tests, we measured participant test scores and the corresponding reaction time. The study's findings, particularly in Backward Digit Span Tasks (BDST) and Visual Memory Tasks (VMT), indicate that diverse lighting conditions significantly impacted cognitive performance at different times of the day. Notably, the BDST scores were mainly affected by lighting conditions in the afternoon session, whereas the VMT scores were primarily influenced in the morning sessions. This research offers support for architects and engineers as they develop lighting designs that are sensitive to the cognitive performance of occupants. It highlights the advantages of utilizing VR simulations in the AEC industry to assess the impact of lighting designs on users. Further research can lead to the development of lighting systems that can promote better cognitive function and overall wellbeing. △ Less

Submitted 4 June, 2024; originally announced June 2024.

arXiv:2406.01007 [pdf, other]

Measurement of Electron Antineutrino Oscillation Amplitude and Frequency via Neutron Capture on Hydrogen at Daya Bay

Authors: Daya Bay collaboration, F. P. An, W. D. Bai, A. B. Balantekin, M. Bishai, S. Blyth, G. F. Cao, J. Cao, J. F. Chang, Y. Chang, H. S. Chen, H. Y. Chen, S. M. Chen, Y. Chen, Y. X. Chen, Z. Y. Chen, J. Cheng, J. Cheng, Y. -C. Cheng, Z. K. Cheng, J. J. Cherwinka, M. C. Chu, J. P. Cummings, O. Dalager, F. S. Deng , et al. (177 additional authors not shown)

Abstract: This Letter reports the first measurement of the oscillation amplitude and frequency of reactor antineutrinos at Daya Bay via neutron capture on hydrogen using 1958 days of data. With over 3.6 million signal candidates, an optimized candidate selection, improved treatment of backgrounds and efficiencies, refined energy calibration, and an energy response model for the capture-on-hydrogen sensitive… ▽ More This Letter reports the first measurement of the oscillation amplitude and frequency of reactor antineutrinos at Daya Bay via neutron capture on hydrogen using 1958 days of data. With over 3.6 million signal candidates, an optimized candidate selection, improved treatment of backgrounds and efficiencies, refined energy calibration, and an energy response model for the capture-on-hydrogen sensitive region, the relative $\overlineν_{e}$ rates and energy spectra variation among the near and far detectors gives $\mathrm{sin}^22θ_{13} = 0.0759_{-0.0049}^{+0.0050}$ and $Δm^2_{32} = (2.72^{+0.14}_{-0.15})\times10^{-3}$ eV$^2$ assuming the normal neutrino mass ordering, and $Δm^2_{32} = (-2.83^{+0.15}_{-0.14})\times10^{-3}$ eV$^2$ for the inverted neutrino mass ordering. This estimate of $\sin^2 2θ_{13}$ is consistent with and essentially independent from the one obtained using the capture-on-gadolinium sample at Daya Bay. The combination of these two results yields $\mathrm{sin}^22θ_{13}= 0.0833\pm0.0022$, which represents an 8% relative improvement in precision regarding the Daya Bay full 3158-day capture-on-gadolinium result. △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2406.01003 [pdf, other]

Uni-ISP: Unifying the Learning of ISPs from Multiple Cameras

Authors: Lingen Li, Mingde Yao, Xingyu Meng, Muquan Yu, Tianfan Xue, Jinwei Gu

Abstract: Modern end-to-end image signal processors (ISPs) can learn complex mappings from RAW/XYZ data to sRGB (or inverse), opening new possibilities in image processing. However, as the diversity of camera models continues to expand, developing and maintaining individual ISPs is not sustainable in the long term, which inherently lacks versatility, hindering the adaptability to multiple camera models. In… ▽ More Modern end-to-end image signal processors (ISPs) can learn complex mappings from RAW/XYZ data to sRGB (or inverse), opening new possibilities in image processing. However, as the diversity of camera models continues to expand, developing and maintaining individual ISPs is not sustainable in the long term, which inherently lacks versatility, hindering the adaptability to multiple camera models. In this paper, we propose a novel pipeline, Uni-ISP, which unifies the learning of ISPs from multiple cameras, offering an accurate and versatile processor to multiple camera models. The core of Uni-ISP is leveraging device-aware embeddings through learning inverse/forward ISPs and its special training scheme. By doing so, Uni-ISP not only improves the performance of inverse/forward ISPs but also unlocks a variety of new applications inaccessible to existing learned ISPs. Moreover, since there is no dataset synchronously captured by multiple cameras for training, we construct a real-world 4K dataset, FiveCam, comprising more than 2,400 pairs of sRGB-RAW images synchronously captured by five smartphones. We conducted extensive experiments demonstrating Uni-ISP's accuracy in inverse/forward ISPs (with improvements of +1.5dB/2.4dB PSNR), its versatility in enabling new applications, and its adaptability to new camera models. △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2406.00448 [pdf, other]

Bilateral Guided Radiance Field Processing

Authors: Yuehao Wang, Chaoyi Wang, Bingchen Gong, Tianfan Xue

Abstract: Neural Radiance Fields (NeRF) achieves unprecedented performance in synthesizing novel view synthesis, utilizing multi-view consistency. When capturing multiple inputs, image signal processing (ISP) in modern cameras will independently enhance them, including exposure adjustment, color correction, local tone mapping, etc. While these processings greatly improve image quality, they often break the… ▽ More Neural Radiance Fields (NeRF) achieves unprecedented performance in synthesizing novel view synthesis, utilizing multi-view consistency. When capturing multiple inputs, image signal processing (ISP) in modern cameras will independently enhance them, including exposure adjustment, color correction, local tone mapping, etc. While these processings greatly improve image quality, they often break the multi-view consistency assumption, leading to "floaters" in the reconstructed radiance fields. To address this concern without compromising visual aesthetics, we aim to first disentangle the enhancement by ISP at the NeRF training stage and re-apply user-desired enhancements to the reconstructed radiance fields at the finishing stage. Furthermore, to make the re-applied enhancements consistent between novel views, we need to perform imaging signal processing in 3D space (i.e. "3D ISP"). For this goal, we adopt the bilateral grid, a locally-affine model, as a generalized representation of ISP processing. Specifically, we optimize per-view 3D bilateral grids with radiance fields to approximate the effects of camera pipelines for each input view. To achieve user-adjustable 3D finishing, we propose to learn a low-rank 4D bilateral grid from a given single view edit, lifting photo enhancements to the whole 3D scene. We demonstrate our approach can boost the visual quality of novel view synthesis by effectively removing floaters and performing enhancements from user retouching. The source code and our data are available at: https://bilarfpro.github.io. △ Less

Submitted 1 June, 2024; originally announced June 2024.

Comments: SIGGRAPH (ACM TOG), 2024. Project page: https://bilarfpro.github.io

arXiv:2405.21075 [pdf, other]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Authors: Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Rongrong Ji, Xing Sun

Abstract: In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a focal point in recent advancements. However, the predominant focus remains on developing their capabilities in static image understanding. The potential of MLLMs in processing sequential visual data is still insufficiently explored, highlighting the absence of a comprehensive, high-quality… ▽ More In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a focal point in recent advancements. However, the predominant focus remains on developing their capabilities in static image understanding. The potential of MLLMs in processing sequential visual data is still insufficiently explored, highlighting the absence of a comprehensive, high-quality assessment of their performance. In this paper, we introduce Video-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis. Our work distinguishes from existing benchmarks through four key features: 1) Diversity in video types, spanning 6 primary visual domains with 30 subfields to ensure broad scenario generalizability; 2) Duration in temporal dimension, encompassing both short-, medium-, and long-term videos, ranging from 11 seconds to 1 hour, for robust contextual dynamics; 3) Breadth in data modalities, integrating multi-modal inputs besides video frames, including subtitles and audios, to unveil the all-round capabilities of MLLMs; 4) Quality in annotations, utilizing rigorous manual labeling by expert annotators to facilitate precise and reliable model assessment. 900 videos with a total of 254 hours are manually selected and annotated by repeatedly viewing all the video content, resulting in 2,700 question-answer pairs. With Video-MME, we extensively evaluate various state-of-the-art MLLMs, including GPT-4 series and Gemini 1.5 Pro, as well as open-source image models like InternVL-Chat-V1.5 and video models like LLaVA-NeXT-Video. Our experiments reveal that Gemini 1.5 Pro is the best-performing commercial model, significantly outperforming the open-source models. Our dataset along with these findings underscores the need for further improvements in handling longer sequences and multi-modal data. Project Page: https://video-mme.github.io △ Less

Submitted 16 June, 2024; v1 submitted 31 May, 2024; originally announced May 2024.

Comments: Project Page: https://video-mme.github.io

arXiv:2405.20974 [pdf, other]

SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales

Authors: Tianyang Xu, Shujin Wu, Shizhe Diao, Xiaoze Liu, Xingyao Wang, Yangyi Chen, Jing Gao

Abstract: Large language models (LLMs) often generate inaccurate or fabricated information and generally fail to indicate their confidence, which limits their broader applications. Previous work elicits confidence from LLMs by direct or self-consistency prompting, or constructing specific datasets for supervised finetuning. The prompting-based approaches have inferior performance, and the training-based app… ▽ More Large language models (LLMs) often generate inaccurate or fabricated information and generally fail to indicate their confidence, which limits their broader applications. Previous work elicits confidence from LLMs by direct or self-consistency prompting, or constructing specific datasets for supervised finetuning. The prompting-based approaches have inferior performance, and the training-based approaches are limited to binary or inaccurate group-level confidence estimates. In this work, we present the advanced SaySelf, a training framework that teaches LLMs to express more accurate fine-grained confidence estimates. In addition, beyond the confidence scores, SaySelf initiates the process of directing LLMs to produce self-reflective rationales that clearly identify gaps in their parametric knowledge and explain their uncertainty. This is achieved by using an LLM to automatically summarize the uncertainties in specific knowledge via natural language. The summarization is based on the analysis of the inconsistency in multiple sampled reasoning chains, and the resulting data is utilized for supervised fine-tuning. Moreover, we utilize reinforcement learning with a meticulously crafted reward function to calibrate the confidence estimates, motivating LLMs to deliver accurate, high-confidence predictions and to penalize overconfidence in erroneous outputs. Experimental results in both in-distribution and out-of-distribution datasets demonstrate the effectiveness of SaySelf in reducing the confidence calibration error and maintaining the task performance. We show that the generated self-reflective rationales are reasonable and can further contribute to the calibration. The code is made public at https://github.com/xu1868/SaySelf. △ Less

Submitted 5 June, 2024; v1 submitted 31 May, 2024; originally announced May 2024.

Comments: The code is available at https://github.com/xu1868/SaySelf

arXiv:2405.18842 [pdf, other]

Descriptive Image Quality Assessment in the Wild

Authors: Zhiyuan You, Jinjin Gu, Zheyuan Li, Xin Cai, Kaiwen Zhu, Chao Dong, Tianfan Xue

Abstract: With the rapid advancement of Vision Language Models (VLMs), VLM-based Image Quality Assessment (IQA) seeks to describe image quality linguistically to align with human expression and capture the multifaceted nature of IQA tasks. However, current methods are still far from practical usage. First, prior works focus narrowly on specific sub-tasks or settings, which do not align with diverse real-wor… ▽ More With the rapid advancement of Vision Language Models (VLMs), VLM-based Image Quality Assessment (IQA) seeks to describe image quality linguistically to align with human expression and capture the multifaceted nature of IQA tasks. However, current methods are still far from practical usage. First, prior works focus narrowly on specific sub-tasks or settings, which do not align with diverse real-world applications. Second, their performance is sub-optimal due to limitations in dataset coverage, scale, and quality. To overcome these challenges, we introduce Depicted image Quality Assessment in the Wild (DepictQA-Wild). Our method includes a multi-functional IQA task paradigm that encompasses both assessment and comparison tasks, brief and detailed responses, full-reference and non-reference scenarios. We introduce a ground-truth-informed dataset construction approach to enhance data quality, and scale up the dataset to 495K under the brief-detail joint framework. Consequently, we construct a comprehensive, large-scale, and high-quality dataset, named DQ-495K. We also retain image resolution during training to better handle resolution-related quality issues, and estimate a confidence score that is helpful to filter out low-quality responses. Experimental results demonstrate that DepictQA-Wild significantly outperforms traditional score-based methods, prior VLM-based IQA models, and proprietary GPT-4V in distortion identification, instant rating, and reasoning tasks. Our advantages are further confirmed by real-world applications including assessing the web-downloaded images and ranking model-processed images. Datasets and codes will be released in https://depictqa.github.io/depictqa-wild/. △ Less

Submitted 12 June, 2024; v1 submitted 29 May, 2024; originally announced May 2024.

arXiv:2405.18717 [pdf]

Silicon-integrated scandium-doped aluminum nitride electro-optic modulator

Authors: Tianqi Xu, Yushuai Liu, Yuanmao Pu, Yongxiang Yang, Qize Zhong, Xingyan Zhao, Yang Qiu, Yuan Dong, Tao Wu, Shaonan Zheng, Ting Hu

Abstract: Scandium-doped aluminum nitride (AlScN) with an asymmetric hexagonal wurtzite structure exhibits enhanced second-order nonlinear and piezoelectric properties compared to aluminum nitride (AlN), while maintaining a relatively large bandgap. It provides a promising platform for photonic integration and facilitates the seamless integration of passive and active functional devices. Here, we present th… ▽ More Scandium-doped aluminum nitride (AlScN) with an asymmetric hexagonal wurtzite structure exhibits enhanced second-order nonlinear and piezoelectric properties compared to aluminum nitride (AlN), while maintaining a relatively large bandgap. It provides a promising platform for photonic integration and facilitates the seamless integration of passive and active functional devices. Here, we present the design, fabrication, and characterization of AlScN EO micro-ring modulators, introducing active functionalities to the chip-scale AlScN platform. These waveguide-integrated EO modulators employ sputtered AlScN thin films as the light-guiding medium, and the entire fabrication process is compatible with complementary metal oxide semiconductor (CMOS) technology. We characterize the high-frequency performance of an AlScN modulator for the first time, extracting a maximum in-device effective EO coefficient of 2.86 pm/V at 12 GHz. The devices show a minimum half-wave voltage-length product of 3.12 V*cm and a 3-dB modulation bandwidth of approximately 22 GHz. Our work provides a promising modulation scheme for cost-effective silicon-integrated photonics systems. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2405.16789 [pdf, other]

NoteLLM-2: Multimodal Large Representation Models for Recommendation

Authors: Chao Zhang, Haoxin Zhang, Shiwei Wu, Di Wu, Tong Xu, Yan Gao, Yao Hu, Enhong Chen

Abstract: Large Language Models (LLMs) have demonstrated exceptional text understanding. Existing works explore their application in text embedding tasks. However, there are few works utilizing LLMs to assist multimodal representation tasks. In this work, we investigate the potential of LLMs to enhance multimodal representation in multimodal item-to-item (I2I) recommendations. One feasible method is the tra… ▽ More Large Language Models (LLMs) have demonstrated exceptional text understanding. Existing works explore their application in text embedding tasks. However, there are few works utilizing LLMs to assist multimodal representation tasks. In this work, we investigate the potential of LLMs to enhance multimodal representation in multimodal item-to-item (I2I) recommendations. One feasible method is the transfer of Multimodal Large Language Models (MLLMs) for representation tasks. However, pre-training MLLMs usually requires collecting high-quality, web-scale multimodal data, resulting in complex training procedures and high costs. This leads the community to rely heavily on open-source MLLMs, hindering customized training for representation scenarios. Therefore, we aim to design an end-to-end training method that customizes the integration of any existing LLMs and vision encoders to construct efficient multimodal representation models. Preliminary experiments show that fine-tuned LLMs in this end-to-end method tend to overlook image content. To overcome this challenge, we propose a novel training framework, NoteLLM-2, specifically designed for multimodal representation. We propose two ways to enhance the focus on visual information. The first method is based on the prompt viewpoint, which separates multimodal content into visual content and textual content. NoteLLM-2 adopts the multimodal In-Content Learning method to teach LLMs to focus on both modalities and aggregate key information. The second method is from the model architecture, utilizing a late fusion mechanism to directly fuse visual information into textual information. Extensive experiments have been conducted to validate the effectiveness of our method. △ Less

Submitted 26 May, 2024; originally announced May 2024.

Comments: 19 pages, 5 figures

arXiv:2405.16241 [pdf, other]

FastQuery: Communication-efficient Embedding Table Query for Private LLM Inference

Authors: Chenqi Lin, Tianshi Xu, Zebin Yang, Runsheng Wang, Ru Huang, Meng Li

Abstract: With the fast evolution of large language models (LLMs), privacy concerns with user queries arise as they may contain sensitive information. Private inference based on homomorphic encryption (HE) has been proposed to protect user query privacy. However, a private embedding table query has to be formulated as a HE-based matrix-vector multiplication problem and suffers from enormous computation and… ▽ More With the fast evolution of large language models (LLMs), privacy concerns with user queries arise as they may contain sensitive information. Private inference based on homomorphic encryption (HE) has been proposed to protect user query privacy. However, a private embedding table query has to be formulated as a HE-based matrix-vector multiplication problem and suffers from enormous computation and communication overhead. We observe the overhead mainly comes from the neglect of 1) the one-hot nature of user queries and 2) the robustness of the embedding table to low bit-width quantization noise. Hence, in this paper, we propose a private embedding table query optimization framework, dubbed FastQuery. FastQuery features a communication-aware embedding table quantization algorithm and a one-hot-aware dense packing algorithm to simultaneously reduce both the computation and communication costs. Compared to prior-art HE-based frameworks, e.g., Cheetah, Iron, and Bumblebee, FastQuery achieves more than $4.3\times$, $2.7\times$, $1.3\times$ latency reduction, respectively and more than $75.7\times$, $60.2\times$, $20.2\times$ communication reduction, respectively, on both LLAMA-7B and LLAMA-30B. △ Less

Submitted 25 May, 2024; originally announced May 2024.

Comments: 6 pages, DAC2024

arXiv:2405.15153 [pdf, other]

Optimal Reference Nodes Deployment for Positioning Seafloor Anchor Nodes

Authors: Wei Huang, Pengfei Wu, Tianhe Xu, Hao Zhang, Kaitao Meng

Abstract: Seafloor anchor nodes, which form a geodetic network, are designed to provide surface and underwater users with positioning, navigation and timing (PNT) services. Due to the non-uniform distribution of underwater sound speed, accurate positioning of underwater anchor nodes is a challenge work. Traditional anchor node positioning typically uses cross or circular shapes, however, how to optimize the… ▽ More Seafloor anchor nodes, which form a geodetic network, are designed to provide surface and underwater users with positioning, navigation and timing (PNT) services. Due to the non-uniform distribution of underwater sound speed, accurate positioning of underwater anchor nodes is a challenge work. Traditional anchor node positioning typically uses cross or circular shapes, however, how to optimize the deployment of reference nodes for positioning underwater anchor nodes considering the variability of sound speed has not yet been studied. This paper focuses on the optimal reference nodes deployment strategies for time--of--arrival (TOA) localization in the three-dimensional (3D) underwater space. We adopt the criterion that minimizing the trace of the inverse Fisher information matrix (FIM) to determine optimal reference nodes deployment with Gaussian measurement noise, which is positive related to the signal propagation path. A comprehensive analysis of optimal reference-target geometries is provided in the general circumstance with no restriction on the number of reference nodes, elevation angle and reference-target range. A new semi-closed form solution is found to detemine the optimal geometries. To demonstrate the findings in this paper, we conducted both simulations and sea trials on underwater anchor node positioning. Both the simulation and experiment results are consistent with theoretical analysis. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2405.14569 [pdf, other]

PrivCirNet: Efficient Private Inference via Block Circulant Transformation

Authors: Tianshi Xu, Lemeng Wu, Runsheng Wang, Meng Li

Abstract: Homomorphic encryption (HE)-based deep neural network (DNN) inference protects data and model privacy but suffers from significant computation overhead. We observe transforming the DNN weights into circulant matrices converts general matrix-vector multiplications into HE-friendly 1-dimensional convolutions, drastically reducing the HE computation cost. Hence, in this paper, we propose \method, a p… ▽ More Homomorphic encryption (HE)-based deep neural network (DNN) inference protects data and model privacy but suffers from significant computation overhead. We observe transforming the DNN weights into circulant matrices converts general matrix-vector multiplications into HE-friendly 1-dimensional convolutions, drastically reducing the HE computation cost. Hence, in this paper, we propose \method, a protocol/network co-optimization framework based on block circulant transformation. At the protocol level, PrivCirNet customizes the HE encoding algorithm that is fully compatible with the block circulant transformation and reduces the computation latency in proportion to the block size. At the network level, we propose a latency-aware formulation to search for the layer-wise block size assignment based on second-order information. PrivCirNet also leverages layer fusion to further reduce the inference cost. We compare PrivCirNet with the state-of-the-art HE-based framework Bolt (IEEE S\&P 2024) and the HE-friendly pruning method SpENCNN (ICML 2023). For ResNet-18 and Vision Transformer (ViT) on Tiny ImageNet, PrivCirNet reduces latency by $5.0\times$ and $1.3\times$ with iso-accuracy over Bolt, respectively, and improves accuracy by $4.1\%$ and $12\%$ over SpENCNN, respectively. For MobileNetV2 on ImageNet, PrivCirNet achieves $1.7\times$ lower latency and $4.2\%$ better accuracy over Bolt and SpENCNN, respectively. Our code and checkpoints are available in the supplementary materials. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2405.14455 [pdf, other]

TIGER: Text-Instructed 3D Gaussian Retrieval and Coherent Editing

Authors: Teng Xu, Jiamin Chen, Peng Chen, Youjia Zhang, Junqing Yu, Wei Yang

Abstract: Editing objects within a scene is a critical functionality required across a broad spectrum of applications in computer vision and graphics. As 3D Gaussian Splatting (3DGS) emerges as a frontier in scene representation, the effective modification of 3D Gaussian scenes has become increasingly vital. This process entails accurately retrieve the target objects and subsequently performing modification… ▽ More Editing objects within a scene is a critical functionality required across a broad spectrum of applications in computer vision and graphics. As 3D Gaussian Splatting (3DGS) emerges as a frontier in scene representation, the effective modification of 3D Gaussian scenes has become increasingly vital. This process entails accurately retrieve the target objects and subsequently performing modifications based on instructions. Though available in pieces, existing techniques mainly embed sparse semantics into Gaussians for retrieval, and rely on an iterative dataset update paradigm for editing, leading to over-smoothing or inconsistency issues. To this end, this paper proposes a systematic approach, namely TIGER, for coherent text-instructed 3D Gaussian retrieval and editing. In contrast to the top-down language grounding approach for 3D Gaussians, we adopt a bottom-up language aggregation strategy to generate a denser language embedded 3D Gaussians that supports open-vocabulary retrieval. To overcome the over-smoothing and inconsistency issues in editing, we propose a Coherent Score Distillation (CSD) that aggregates a 2D image editing diffusion model and a multi-view diffusion model for score distillation, producing multi-view consistent editing with much finer details. In various experiments, we demonstrate that our TIGER is able to accomplish more consistent and realistic edits than prior work. △ Less

Submitted 1 June, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

arXiv:2405.11531 [pdf, other]

Knowledge Graph Pruning for Recommendation

Authors: Fake Lin, Xi Zhu, Ziwei Zhao, Deqiang Huang, Yu Yu, Xueying Li, Tong Xu, Enhong Chen

Abstract: Recent years have witnessed the prosperity of knowledge graph based recommendation system (KGRS), which enriches the representation of users, items, and entities by structural knowledge with striking improvement. Nevertheless, its unaffordable computational cost still limits researchers from exploring more sophisticated models. We observe that the bottleneck for training efficiency arises from the… ▽ More Recent years have witnessed the prosperity of knowledge graph based recommendation system (KGRS), which enriches the representation of users, items, and entities by structural knowledge with striking improvement. Nevertheless, its unaffordable computational cost still limits researchers from exploring more sophisticated models. We observe that the bottleneck for training efficiency arises from the knowledge graph, which is plagued by the well-known issue of knowledge explosion. Recently, some works have attempted to slim the inflated KG via summarization techniques. However, these summarized nodes may ignore the collaborative signals and deviate from the facts that nodes in knowledge graph represent symbolic abstractions of entities from the real-world. To this end, in this paper, we propose a novel approach called KGTrimmer for knowledge graph pruning tailored for recommendation, to remove the unessential nodes while minimizing performance degradation. Specifically, we design an importance evaluator from a dual-view perspective. For the collective view, we embrace the idea of collective intelligence by extracting community consensus based on abundant collaborative signals, i.e. nodes are considered important if they attract attention of numerous users. For the holistic view, we learn a global mask to identify the valueless nodes from their inherent properties or overall popularity. Next, we build an end-to-end importance-aware graph neural network, which injects filtered knowledge to enhance the distillation of valuable user-item collaborative signals. Ultimately, we generate a pruned knowledge graph with lightweight, stable, and robust properties to facilitate the following-up recommendation task. Extensive experiments are conducted on three publicly available datasets to prove the effectiveness and generalization ability of KGTrimmer. △ Less

Submitted 19 May, 2024; originally announced May 2024.

arXiv:2405.10959 [pdf, other]

Foundation Models for Education: Promises and Prospects

Authors: Tianlong Xu, Richard Tong, Jing Liang, Xing Fan, Haoyang Li, Qingsong Wen

Abstract: With the advent of foundation models like ChatGPT, educators are excited about the transformative role that AI might play in propelling the next education revolution. The developing speed and the profound impact of foundation models in various industries force us to think deeply about the changes they will make to education, a domain that is critically important for the future of humans. In this p… ▽ More With the advent of foundation models like ChatGPT, educators are excited about the transformative role that AI might play in propelling the next education revolution. The developing speed and the profound impact of foundation models in various industries force us to think deeply about the changes they will make to education, a domain that is critically important for the future of humans. In this paper, we discuss the strengths of foundation models, such as personalized learning, education inequality, and reasoning capabilities, as well as the development of agent architecture tailored for education, which integrates AI agents with pedagogical frameworks to create adaptive learning environments. Furthermore, we highlight the risks and opportunities of AI overreliance and creativity. Lastly, we envision a future where foundation models in education harmonize human and AI capabilities, fostering a dynamic, inclusive, and adaptive educational ecosystem. △ Less

Submitted 8 April, 2024; originally announced May 2024.

Comments: Accepted by IEEE Intelligent Systems

arXiv:2405.10879 [pdf, other]

One registration is worth two segmentations

Authors: Shiqi Huang, Tingfa Xu, Ziyi Shen, Shaheer Ullah Saeed, Wen Yan, Dean Barratt, Yipeng Hu

Abstract: The goal of image registration is to establish spatial correspondence between two or more images, traditionally through dense displacement fields (DDFs) or parametric transformations (e.g., rigid, affine, and splines). Rethinking the existing paradigms of achieving alignment via spatial transformations, we uncover an alternative but more intuitive correspondence representation: a set of correspond… ▽ More The goal of image registration is to establish spatial correspondence between two or more images, traditionally through dense displacement fields (DDFs) or parametric transformations (e.g., rigid, affine, and splines). Rethinking the existing paradigms of achieving alignment via spatial transformations, we uncover an alternative but more intuitive correspondence representation: a set of corresponding regions-of-interest (ROI) pairs, which we demonstrate to have sufficient representational capability as other correspondence representation methods.Further, it is neither necessary nor sufficient for these ROIs to hold specific anatomical or semantic significance. In turn, we formulate image registration as searching for the same set of corresponding ROIs from both moving and fixed images - in other words, two multi-class segmentation tasks on a pair of images. For a general-purpose and practical implementation, we integrate the segment anything model (SAM) into our proposed algorithms, resulting in a SAM-enabled registration (SAMReg) that does not require any training data, gradient-based fine-tuning or engineered prompts. We experimentally show that the proposed SAMReg is capable of segmenting and matching multiple ROI pairs, which establish sufficiently accurate correspondences, in three clinical applications of registering prostate MR, cardiac MR and abdominal CT images. Based on metrics including Dice and target registration errors on anatomical structures, the proposed registration outperforms both intensity-based iterative algorithms and DDF-predicting learning-based networks, even yielding competitive performance with weakly-supervised registration which requires fully-segmented training data. △ Less

Submitted 17 May, 2024; originally announced May 2024.

Comments: Early Accepted by MICCAI2024

arXiv:2405.10493 [pdf, other]

Building imaginary-time thermal filed theory with artificial neural networks

Authors: Tian Xu, Lingxiao Wang, Lianyi He, Kai Zhou, Yin Jiang

Abstract: In this study, we introduce a novel approach in quantum field theories to estimate the action using the artificial neural networks (ANNs). The estimation is achieved by learning on system configurations governed by the Boltzmann factor, $e^{-S}$ at different temperatures within the imaginary time formalism of thermal field theory. We focus on 0+1 dimensional quantum field with kink/anti-kink confi… ▽ More In this study, we introduce a novel approach in quantum field theories to estimate the action using the artificial neural networks (ANNs). The estimation is achieved by learning on system configurations governed by the Boltzmann factor, $e^{-S}$ at different temperatures within the imaginary time formalism of thermal field theory. We focus on 0+1 dimensional quantum field with kink/anti-kink configurations to demonstrate the feasibility of the method. The continuous-mixture autoregressive networks (CANs) enable the construction of accurate effective actions with tractable probability density estimation. Our numerical results demonstrate that this methodology not only facilitates the construction of effective actions at specified temperatures but also adeptly estimates the action at intermediate temperatures using data from both lower and higher temperature ensembles. This capability is especially valuable for the detailed exploration of phase diagrams. △ Less

Submitted 16 May, 2024; originally announced May 2024.

Comments: 8 pages, 4 figures. Comments welcome!

Report number: RIKEN-iTHEMS-Report-24

arXiv:2405.09764 [pdf, other]

Clearing time randomization and transaction fees for auction market design

Authors: Thibaut Mastrolia, Tianrui Xu

Abstract: Flaws of a continuous limit order book mechanism raise the question of whether a continuous trading session and a periodic auction session would bring better efficiency. This paper wants to go further in designing a periodic auction when both a continuous market and a periodic auction market are available to traders. In a periodic auction, we discover that a strategic trader could take advantage o… ▽ More Flaws of a continuous limit order book mechanism raise the question of whether a continuous trading session and a periodic auction session would bring better efficiency. This paper wants to go further in designing a periodic auction when both a continuous market and a periodic auction market are available to traders. In a periodic auction, we discover that a strategic trader could take advantage of the accumulated information available along the auction duration by arriving at the latest moment before the auction closes, increasing the price impact on the market. Such price impact moves the clearing price away from the efficient price and may disturb the efficiency of a periodic auction market. We thus propose and quantify the effect of two remedies to mitigate these flaws: randomizing the auction's closing time and optimally designing a transaction fees policy. Our results show that these policies encourage a strategic trader to send their orders earlier to enhance the efficiency of the auction market, illustrated by data extracted from Alphabet and Apple stocks. △ Less

Submitted 15 May, 2024; originally announced May 2024.

Comments: 30 pages, 11 figures

arXiv:2405.08573 [pdf, other]

ViSTooth: A Visualization Framework for Tooth Segmentation on Panoramic Radiograph

Authors: Shenji Zhu, Miaoxin Hu, Tianya Pan, Yue Hong, Bin Li, Zhiguang Zhou, Ting Xu

Abstract: Tooth segmentation is a key step for computer aided diagnosis of dental diseases. Numerous machine learning models have been employed for tooth segmentation on dental panoramic radiograph. However, it is a difficult task to achieve accurate tooth segmentation due to complex tooth shapes, diverse tooth categories and incomplete sample set for machine learning. In this paper, we propose ViSTooth, a… ▽ More Tooth segmentation is a key step for computer aided diagnosis of dental diseases. Numerous machine learning models have been employed for tooth segmentation on dental panoramic radiograph. However, it is a difficult task to achieve accurate tooth segmentation due to complex tooth shapes, diverse tooth categories and incomplete sample set for machine learning. In this paper, we propose ViSTooth, a visualization framework for tooth segmentation on dental panoramic radiograph. First, we employ Mask R-CNN to conduct preliminary tooth segmentation, and a set of domain metrics are proposed to estimate the accuracy of the segmented teeth, including tooth shape, tooth position and tooth angle. Then, we represent the teeth with high-dimensional vectors and visualize their distribution in a low-dimensional space, in which experts can easily observe those teeth with specific metrics. Further, we expand the sample set with the expert-specified teeth and train the tooth segmentation model iteratively. Finally, we conduct case study and expert study to demonstrate the effectiveness and usability of our ViSTooth, in aiding experts to implement accurate tooth segmentation guided by expert knowledge. △ Less

Submitted 14 May, 2024; originally announced May 2024.

arXiv:2405.07685 [pdf, other]

Comprehensive Analysis of Access Control Models in Edge Computing: Challenges, Solutions, and Future Directions

Authors: Tao Xue, Ying Zhang, Yanbin Wang, Wenbo Wang, Shuailou Li, Haibin Zhang

Abstract: Many contemporary applications, including smart homes and autonomous vehicles, rely on the Internet of Things technology. While cloud computing provides a multitude of valuable services for these applications, it generally imposes constraints on latency-sensitive applications due to the significant propagation delays. As a complementary technique to cloud computing, edge computing situates computi… ▽ More Many contemporary applications, including smart homes and autonomous vehicles, rely on the Internet of Things technology. While cloud computing provides a multitude of valuable services for these applications, it generally imposes constraints on latency-sensitive applications due to the significant propagation delays. As a complementary technique to cloud computing, edge computing situates computing resources closer to the data sources, which reduces the latency and simultaneously alleviates the bandwidth pressure for the cloud and enhances data security. While edge computing offers significant advantages, it also presents significant challenges in access control -- a critical component for safeguarding data. For instance, it is crucial to implement access control mechanisms that are both effective and efficient on resource-constrained devices, ensuring high security without compromising the inherent low latency benefits of edge computing. These challenges drive the development of innovative access control solutions tailored to meet the unique requirements of edge computing environments. We classify related references from the perspectives of multiple data lifecycles (including data collection, storage, and usage), which thoroughly investigates the access control techniques and helps readers understand them systematically. Finally, we reflect on the classification and envisage future research directions. △ Less

Submitted 22 May, 2024; v1 submitted 13 May, 2024; originally announced May 2024.

arXiv:2405.07580 [pdf, other]

DynLLM: When Large Language Models Meet Dynamic Graph Recommendation

Authors: Ziwei Zhao, Fake Lin, Xi Zhu, Zhi Zheng, Tong Xu, Shitian Shen, Xueying Li, Zikai Yin, Enhong Chen

Abstract: Last year has witnessed the considerable interest of Large Language Models (LLMs) for their potential applications in recommender systems, which may mitigate the persistent issue of data sparsity. Though large efforts have been made for user-item graph augmentation with better graph-based recommendation performance, they may fail to deal with the dynamic graph recommendation task, which involves b… ▽ More Last year has witnessed the considerable interest of Large Language Models (LLMs) for their potential applications in recommender systems, which may mitigate the persistent issue of data sparsity. Though large efforts have been made for user-item graph augmentation with better graph-based recommendation performance, they may fail to deal with the dynamic graph recommendation task, which involves both structural and temporal graph dynamics with inherent complexity in processing time-evolving data. To bridge this gap, in this paper, we propose a novel framework, called DynLLM, to deal with the dynamic graph recommendation task with LLMs. Specifically, DynLLM harnesses the power of LLMs to generate multi-faceted user profiles based on the rich textual features of historical purchase records, including crowd segments, personal interests, preferred categories, and favored brands, which in turn supplement and enrich the underlying relationships between users and items. Along this line, to fuse the multi-faceted profiles with temporal graph embedding, we engage LLMs to derive corresponding profile embeddings, and further employ a distilled attention mechanism to refine the LLM-generated profile embeddings for alleviating noisy signals, while also assessing and adjusting the relevance of each distilled facet embedding for seamless integration with temporal graph embedding from continuous time dynamic graphs (CTDGs). Extensive experiments on two real e-commerce datasets have validated the superior improvements of DynLLM over a wide range of state-of-the-art baseline methods. △ Less

Submitted 13 May, 2024; originally announced May 2024.

Comments: 11 pages, 5 figures

arXiv:2405.07303 [pdf, other]

Search for solar axions by Primakoff effect with the full dataset of the CDEX-1B Experiment

Authors: L. T. Yang, S. K. Liu, Q. Yue, K. J. Kang, Y. J. Li, H. P. An, Greeshma C., J. P. Chang, Y. H. Chen, J. P. Cheng, W. H. Dai, Z. Deng, C. H. Fang, X. P. Geng, H. Gong, Q. J. Guo, T. Guo, X. Y. Guo, L. He, J. R. He, J. W. Hu, H. X. Huang, T. C. Huang, L. Jiang, S. Karmakar , et al. (61 additional authors not shown)

Abstract: We present the first limit on $g_{Aγ}$ coupling constant using the Bragg-Primakoff conversion based on an exposure of 1107.5 kg days of data from the CDEX-1B experiment at the China Jinping Underground Laboratory. The data are consistent with the null signal hypothesis, and no excess signals are observed. Limits of the coupling $g_{Aγ}<2.08\times10^{-9}$ GeV$^{-1}$ (95\% C.L.) are derived for axio… ▽ More We present the first limit on $g_{Aγ}$ coupling constant using the Bragg-Primakoff conversion based on an exposure of 1107.5 kg days of data from the CDEX-1B experiment at the China Jinping Underground Laboratory. The data are consistent with the null signal hypothesis, and no excess signals are observed. Limits of the coupling $g_{Aγ}<2.08\times10^{-9}$ GeV$^{-1}$ (95\% C.L.) are derived for axions with mass up to 100 eV/$c^2$. Within the hadronic model of KSVZ, our results exclude axion mass $>5.3~\rm{eV}/c^2$ at 95\% C.L. △ Less

Submitted 12 May, 2024; originally announced May 2024.

Comments: 7 pages, 5 figures

arXiv:2405.06011 [pdf, other]

Exploring Dark Forces with Multimessenger Studies of Extreme Mass Ratio Inspirals

Authors: Badal Bhalla, Kuver Sinha, Tao Xu

Abstract: The exploration of dark sector interactions via gravitational waves (GWs) from binary inspirals has been a subject of recent interest. We study dark forces using extreme mass ratio inspirals (EMRIs), pointing out two issues of interest. Firstly, the innermost stable circular orbit (ISCO) of the EMRI, which sets the characteristic length scale of the system and hence the dark force range to which i… ▽ More The exploration of dark sector interactions via gravitational waves (GWs) from binary inspirals has been a subject of recent interest. We study dark forces using extreme mass ratio inspirals (EMRIs), pointing out two issues of interest. Firstly, the innermost stable circular orbit (ISCO) of the EMRI, which sets the characteristic length scale of the system and hence the dark force range to which it exhibits enhanced sensitivity, probes force mediator masses that complement those studied with supermassive black hole (SMBH) or neutron star binaries. The LISA mission (the proposed $μ$Ares detector) will probe mediators with masses $m_V \sim 10^{-16}~{\rm eV}$ ($m_V \sim 10^{-18}~{\rm eV}$), corresponding to ISCOs of $10^6 M_\odot$ ($10^8 M_\odot$) central SMBHs. Secondly, while the sensitivity to dark couplings is typically limited by the uncertainty in the binary component masses, independent mass measurements of the central SMBH through reverberation mapping campaigns or the motion of dynamical tracers enable one to break this degeneracy. Our results, therefore, highlight the necessity for coordinated studies, loosely referred to as "multimessenger", between future $μ{\rm Hz}-{\rm mHz}$ GW observatories and ongoing and forthcoming SMBH mass measurement campaigns, including OzDES-RM, SDSS-RM, and SDSS-V Black Hole Mapper. △ Less

Submitted 9 May, 2024; originally announced May 2024.

Comments: 16+6 pages, 6 figures

Report number: CETUP-2023-013

arXiv:2405.05004 [pdf, other]

TENet: Targetness Entanglement Incorporating with Multi-Scale Pooling and Mutually-Guided Fusion for RGB-E Object Tracking

Authors: Pengcheng Shao, Tianyang Xu, Zhangyong Tang, Linze Li, Xiao-Jun Wu, Josef Kittler

Abstract: There is currently strong interest in improving visual object tracking by augmenting the RGB modality with the output of a visual event camera that is particularly informative about the scene motion. However, existing approaches perform event feature extraction for RGB-E tracking using traditional appearance models, which have been optimised for RGB only tracking, without adapting it for the intri… ▽ More There is currently strong interest in improving visual object tracking by augmenting the RGB modality with the output of a visual event camera that is particularly informative about the scene motion. However, existing approaches perform event feature extraction for RGB-E tracking using traditional appearance models, which have been optimised for RGB only tracking, without adapting it for the intrinsic characteristics of the event data. To address this problem, we propose an Event backbone (Pooler), designed to obtain a high-quality feature representation that is cognisant of the innate characteristics of the event data, namely its sparsity. In particular, Multi-Scale Pooling is introduced to capture all the motion feature trends within event data through the utilisation of diverse pooling kernel sizes. The association between the derived RGB and event representations is established by an innovative module performing adaptive Mutually Guided Fusion (MGF). Extensive experimental results show that our method significantly outperforms state-of-the-art trackers on two widely used RGB-E tracking datasets, including VisEvent and COESOT, where the precision and success rates on COESOT are improved by 4.9% and 5.2%, respectively. Our code will be available at https://github.com/SSSpc333/TENet. △ Less

Submitted 8 May, 2024; originally announced May 2024.

arXiv:2405.04082 [pdf, other]

Logic-Skill Programming: An Optimization-based Approach to Sequential Skill Planning

Authors: Teng Xue, Amirreza Razmjoo, Suhan Shetty, Sylvain Calinon

Abstract: Recent advances in robot skill learning have unlocked the potential to construct task-agnostic skill libraries, facilitating the seamless sequencing of multiple simple manipulation primitives (aka. skills) to tackle significantly more complex tasks. Nevertheless, determining the optimal sequence for independently learned skills remains an open problem, particularly when the objective is given sole… ▽ More Recent advances in robot skill learning have unlocked the potential to construct task-agnostic skill libraries, facilitating the seamless sequencing of multiple simple manipulation primitives (aka. skills) to tackle significantly more complex tasks. Nevertheless, determining the optimal sequence for independently learned skills remains an open problem, particularly when the objective is given solely in terms of the final geometric configuration rather than a symbolic goal. To address this challenge, we propose Logic-Skill Programming (LSP), an optimization-based approach that sequences independently learned skills to solve long-horizon tasks. We formulate a first-order extension of a mathematical program to optimize the overall cumulative reward of all skills within a plan, abstracted by the sum of value functions. To solve such programs, we leverage the use of tensor train factorization to construct the value function space, and rely on alternations between symbolic search and skill value optimization to find the appropriate skill skeleton and optimal subgoal sequence. Experimental results indicate that the obtained value functions provide a superior approximation of cumulative rewards compared to state-of-the-art reinforcement learning methods. Furthermore, we validate LSP in three manipulation domains, encompassing both prehensile and non-prehensile primitives. The results demonstrate its capability to identify the optimal solution over the full logic and geometric path. The real-robot experiments showcase the effectiveness of our approach to cope with contact uncertainty and external disturbances in the real world. △ Less

Submitted 4 June, 2024; v1 submitted 7 May, 2024; originally announced May 2024.

Comments: In Proc. Robotics: Science and Systems (RSS), 2024

arXiv:2405.02801 [pdf, other]

Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models

Authors: Tianze Xu, Jiajun Li, Xuesong Chen, Xinrui Yao, Shuchang Liu

Abstract: In recent years, AI-Generated Content (AIGC) has witnessed rapid advancements, facilitating the generation of music, images, and other forms of artistic expression across various industries. However, researches on general multi-modal music generation model remain scarce. To fill this gap, we propose a multi-modal music generation framework Mozart's Touch. It could generate aligned music with the c… ▽ More In recent years, AI-Generated Content (AIGC) has witnessed rapid advancements, facilitating the generation of music, images, and other forms of artistic expression across various industries. However, researches on general multi-modal music generation model remain scarce. To fill this gap, we propose a multi-modal music generation framework Mozart's Touch. It could generate aligned music with the cross-modality inputs, such as images, videos and text. Mozart's Touch is composed of three main components: Multi-modal Captioning Module, Large Language Model (LLM) Understanding & Bridging Module, and Music Generation Module. Unlike traditional approaches, Mozart's Touch requires no training or fine-tuning pre-trained models, offering efficiency and transparency through clear, interpretable prompts. We also introduce "LLM-Bridge" method to resolve the heterogeneous representation problems between descriptive texts of different modalities. We conduct a series of objective and subjective evaluations on the proposed model, and results indicate that our model surpasses the performance of current state-of-the-art models. Our codes and examples is availble at: https://github.com/WangTooNaive/MozartsTouch △ Less

Submitted 7 May, 2024; v1 submitted 4 May, 2024; originally announced May 2024.

Comments: 7 pages, 2 figures, submitted to ACM MM 2024

arXiv:2405.02132 [pdf, other]

Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets

Authors: Xuelong Geng, Tianyi Xu, Kun Wei, Bingshen Mu, Hongfei Xue, He Wang, Yangze Li, Pengcheng Guo, Yuhang Dai, Longhao Li, Mingchen Shao, Lei Xie

Abstract: Large Language Models (LLMs) have demonstrated unparalleled effectiveness in various NLP tasks, and integrating LLMs with automatic speech recognition (ASR) is becoming a mainstream paradigm. Building upon this momentum, our research delves into an in-depth examination of this paradigm on a large open-source Chinese dataset. Specifically, our research aims to evaluate the impact of various configu… ▽ More Large Language Models (LLMs) have demonstrated unparalleled effectiveness in various NLP tasks, and integrating LLMs with automatic speech recognition (ASR) is becoming a mainstream paradigm. Building upon this momentum, our research delves into an in-depth examination of this paradigm on a large open-source Chinese dataset. Specifically, our research aims to evaluate the impact of various configurations of speech encoders, LLMs, and projector modules in the context of the speech foundation encoder-LLM ASR paradigm. Furthermore, we introduce a three-stage training approach, expressly developed to enhance the model's ability to align auditory and textual information. The implementation of this approach, alongside the strategic integration of ASR components, enabled us to achieve the SOTA performance on the AISHELL-1, Test_Net, and Test_Meeting test sets. Our analysis presents an empirical foundation for future research in LLM-based ASR systems and offers insights into optimizing performance using Chinese datasets. We will publicly release all scripts used for data preparation, training, inference, and scoring, as well as pre-trained models and training logs to promote reproducible research. △ Less

Submitted 6 May, 2024; v1 submitted 3 May, 2024; originally announced May 2024.

arXiv:2405.00168 [pdf, other]

Revisiting RGBT Tracking Benchmarks from the Perspective of Modality Validity: A New Benchmark, Problem, and Method

Authors: Zhangyong Tang, Tianyang Xu, Zhenhua Feng, Xuefeng Zhu, He Wang, Pengcheng Shao, Chunyang Cheng, Xiao-Jun Wu, Muhammad Awais, Sara Atito, Josef Kittler

Abstract: RGBT tracking draws increasing attention due to its robustness in multi-modality warranting (MMW) scenarios, such as nighttime and bad weather, where relying on a single sensing modality fails to ensure stable tracking results. However, the existing benchmarks predominantly consist of videos collected in common scenarios where both RGB and thermal infrared (TIR) information are of sufficient quali… ▽ More RGBT tracking draws increasing attention due to its robustness in multi-modality warranting (MMW) scenarios, such as nighttime and bad weather, where relying on a single sensing modality fails to ensure stable tracking results. However, the existing benchmarks predominantly consist of videos collected in common scenarios where both RGB and thermal infrared (TIR) information are of sufficient quality. This makes the data unrepresentative of severe imaging conditions, leading to tracking failures in MMW scenarios. To bridge this gap, we present a new benchmark, MV-RGBT, captured specifically in MMW scenarios. In contrast with the existing datasets, MV-RGBT comprises more object categories and scenes, providing a diverse and challenging benchmark. Furthermore, for severe imaging conditions of MMW scenarios, a new problem is posed, namely \textit{when to fuse}, to stimulate the development of fusion strategies for such data. We propose a new method based on a mixture of experts, namely MoETrack, as a baseline fusion strategy. In MoETrack, each expert generates independent tracking results along with the corresponding confidence score, which is used to control the fusion process. Extensive experimental results demonstrate the significant potential of MV-RGBT in advancing RGBT tracking and elicit the conclusion that fusion is not always beneficial, especially in MMW scenarios. Significantly, the proposed MoETrack method achieves new state-of-the-art results not only on MV-RGBT, but also on standard benchmarks, such as RGBT234, LasHeR, and the short-term split of VTUAV (VTUAV-ST). More information of MV-RGBT and the source code of MoETrack will be released at https://github.com/Zhangyong-Tang/MoETrack. △ Less

Submitted 30 April, 2024; originally announced May 2024.

arXiv:2404.17772 [pdf, other]

PWEXP: An R Package Using Piecewise Exponential Model for Study Design and Event/Timeline Prediction

Authors: Tianchen Xu, Rachael Wen

Abstract: Parametric assumptions such as exponential distribution are commonly used in clinical trial design and analysis. However, violation of distribution assumptions can introduce biases in sample size and power calculations. Piecewise exponential (PWE) hazard model partitions the hazard function into segments each with constant hazards and is easy for interpretation and computation. Due to its piecewis… ▽ More Parametric assumptions such as exponential distribution are commonly used in clinical trial design and analysis. However, violation of distribution assumptions can introduce biases in sample size and power calculations. Piecewise exponential (PWE) hazard model partitions the hazard function into segments each with constant hazards and is easy for interpretation and computation. Due to its piecewise property, PWE can fit a wide range of survival curves and accurately predict the future number of events and analysis time in event-driven clinical trials, thus enabling more flexible and reliable study designs. Compared with other existing approaches, the PWE model provides a superior balance of flexibility and robustness in model fitting and prediction. The proposed PWEXP package is designed for estimating and predicting PWE hazard models for right-censored data. By utilizing well-established criteria such as AIC, BIC, and cross-validation log-likelihood, the PWEXP package chooses the optimal number of change-points and determines the optimal position of change-points. With its particular goodness-of-fit, the PWEXP provides accurate and robust hazard estimation, which can be used for reliable power calculation at study design and timeline prediction at study conduct. The package also offers visualization functions to facilitate the interpretation of survival curve fitting results. △ Less

Submitted 27 April, 2024; originally announced April 2024.

Comments: 37 pages, 15 figures

MSC Class: 60-04 ACM Class: G.4

Showing 1–50 of 1,015 results for author: Xu, T