subscribe to arXiv mailings

R3D-AD: Reconstruction via Diffusion for 3D Anomaly Detection

Authors: Zheyuan Zhou, Le Wang, Naiyu Fang, Zili Wang, Lemiao Qiu, Shuyou Zhang

Abstract: 3D anomaly detection plays a crucial role in monitoring parts for localized inherent defects in precision manufacturing. Embedding-based and reconstruction-based approaches are among the most popular and successful methods. However, there are two major challenges to the practical application of the current approaches: 1) the embedded models suffer the prohibitive computational and storage due to t… ▽ More 3D anomaly detection plays a crucial role in monitoring parts for localized inherent defects in precision manufacturing. Embedding-based and reconstruction-based approaches are among the most popular and successful methods. However, there are two major challenges to the practical application of the current approaches: 1) the embedded models suffer the prohibitive computational and storage due to the memory bank structure; 2) the reconstructive models based on the MAE mechanism fail to detect anomalies in the unmasked regions. In this paper, we propose R3D-AD, reconstructing anomalous point clouds by diffusion model for precise 3D anomaly detection. Our approach capitalizes on the data distribution conversion of the diffusion process to entirely obscure the input's anomalous geometry. It step-wisely learns a strict point-level displacement behavior, which methodically corrects the aberrant points. To increase the generalization of the model, we further present a novel 3D anomaly simulation strategy named Patch-Gen to generate realistic and diverse defect shapes, which narrows the domain gap between training and testing. Our R3D-AD ensures a uniform spatial transformation, which allows straightforwardly generating anomaly results by distance comparison. Extensive experiments show that our R3D-AD outperforms previous state-of-the-art methods, achieving 73.4% Image-level AUROC on the Real3D-AD dataset and 74.9% Image-level AUROC on the Anomaly-ShapeNet dataset with an exceptional efficiency. △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: ECCV 2024

arXiv:2407.10234 [pdf, other]

Understanding the non-trivial isoscalar pseudoscalar structures in the $K_S K_Sπ^0$ spectra in the $J/ψ$ radiative decay

Authors: Yin Cheng, Lin Qiu, Qiang Zhao

Abstract: Initiated by the recent observation of a flattened lineshape of $IJ^{PC}=00^{-+}$ around $1.4\sim 1.5$ GeV in the $K_S K_S π^0$ invariant mass spectrum by BESIII, we make a systematic partial wave analysis of $J/ψ\toγη_X\to γK\bar{K}π$ based on an isobaric approach. We demonstrate that in the scenario of the first radial excitations of the isoscalar pseudoscalar from the $K\bar{K}π$ threshold to a… ▽ More Initiated by the recent observation of a flattened lineshape of $IJ^{PC}=00^{-+}$ around $1.4\sim 1.5$ GeV in the $K_S K_S π^0$ invariant mass spectrum by BESIII, we make a systematic partial wave analysis of $J/ψ\toγη_X\to γK\bar{K}π$ based on an isobaric approach. We demonstrate that in the scenario of the first radial excitations of the isoscalar pseudoscalar from the $K\bar{K}π$ threshold to about 1.6 GeV the non-trivial $K_S K_S π^0$ invariant mass spectrum can be explained by the coupled-channel effects with the presence of the triangle singularity mechanism. It shows that a combined fit of the three-body and two-body spectra can be achieved which suggests that the one-state solution around $1.4\sim 1.5$ GeV proposed before still holds well. In particular, we show that the coupled-channel effects between the two most important quasi-two-body decay channels, $K^*\bar{K}+c.c.$ and $a_0(980)π$, can be well described by taking into account the one-loop corrections in the isobaric approach. This is because the isoscalar pseudoscalar states are coupled to the $K^*\bar{K}+c.c.$ and $a_0(980)π$ channels via the $P$ and $S$ waves, respectively. As a consequence, the coupled-channel effects can be largely absorbed into the redefinition of the tree-level effective couplings with the transition amplitudes computed to the order of one-loop corrections. Then, the coupled-channel effects can be estimated by the contributions from the one-loop rescattering amplitudes in comparison with the tree-level ones, where we find that the rescattering contributions from the $P$-wave into the $S$-wave, or vice verse, are apparently suppressed in the kinematic region near threshold. △ Less

Submitted 14 July, 2024; originally announced July 2024.

Comments: Revtex, 13 pages, 6 figures

arXiv:2407.07071 [pdf, other]

Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps

Authors: Yung-Sung Chuang, Linlu Qiu, Cheng-Yu Hsieh, Ranjay Krishna, Yoon Kim, James Glass

Abstract: When asked to summarize articles or answer questions given a passage, large language models (LLMs) can hallucinate details and respond with unsubstantiated answers that are inaccurate with respect to the input context. This paper describes a simple approach for detecting such contextual hallucinations. We hypothesize that contextual hallucinations are related to the extent to which an LLM attends… ▽ More When asked to summarize articles or answer questions given a passage, large language models (LLMs) can hallucinate details and respond with unsubstantiated answers that are inaccurate with respect to the input context. This paper describes a simple approach for detecting such contextual hallucinations. We hypothesize that contextual hallucinations are related to the extent to which an LLM attends to information in the provided context versus its own generations. Based on this intuition, we propose a simple hallucination detection model whose input features are given by the ratio of attention weights on the context versus newly generated tokens (for each attention head). We find that a linear classifier based on these lookback ratio features is as effective as a richer detector that utilizes the entire hidden states of an LLM or a text-based entailment model. The lookback ratio-based detector -- Lookback Lens -- is found to transfer across tasks and even models, allowing a detector that is trained on a 7B model to be applied (without retraining) to a larger 13B model. We further apply this detector to mitigate contextual hallucinations, and find that a simple classifier-guided decoding approach is able to reduce the amount of hallucination, for example by 9.6% in the XSum summarization task. △ Less

Submitted 9 July, 2024; originally announced July 2024.

Comments: The source code is available at https://github.com/voidism/Lookback-Lens

arXiv:2407.06174 [pdf, other]

The Tug-of-War Between Deepfake Generation and Detection

Authors: Hannah Lee, Changyeon Lee, Kevin Farhat, Lin Qiu, Steve Geluso, Aerin Kim, Oren Etzioni

Abstract: Multimodal generative models are rapidly evolving, leading to a surge in the generation of realistic video and audio that offers exciting possibilities but also serious risks. Deepfake videos, which can convincingly impersonate individuals, have particularly garnered attention due to their potential misuse in spreading misinformation and creating fraudulent content. This survey paper examines the… ▽ More Multimodal generative models are rapidly evolving, leading to a surge in the generation of realistic video and audio that offers exciting possibilities but also serious risks. Deepfake videos, which can convincingly impersonate individuals, have particularly garnered attention due to their potential misuse in spreading misinformation and creating fraudulent content. This survey paper examines the dual landscape of deepfake video generation and detection, emphasizing the need for effective countermeasures against potential abuses. We provide a comprehensive overview of current deepfake generation techniques, including face swapping, reenactment, and audio-driven animation, which leverage cutting-edge technologies like generative adversarial networks and diffusion models to produce highly realistic fake videos. Additionally, we analyze various detection approaches designed to differentiate authentic from altered videos, from detecting visual artifacts to deploying advanced algorithms that pinpoint inconsistencies across video and audio signals. The effectiveness of these detection methods heavily relies on the diversity and quality of datasets used for training and evaluation. We discuss the evolution of deepfake datasets, highlighting the importance of robust, diverse, and frequently updated collections to enhance the detection accuracy and generalizability. As deepfakes become increasingly indistinguishable from authentic content, developing advanced detection techniques that can keep pace with generation technologies is crucial. We advocate for a proactive approach in the "tug-of-war" between deepfake creators and detectors, emphasizing the need for continuous research collaboration, standardization of evaluation metrics, and the creation of comprehensive benchmarks. △ Less

Submitted 8 July, 2024; originally announced July 2024.

arXiv:2407.02616 [pdf]

Deep Learning Based Apparent Diffusion Coefficient Map Generation from Multi-parametric MR Images for Patients with Diffuse Gliomas

Authors: Zach Eidex, Mojtaba Safari, Jacob Wynne, Richard L. J. Qiu, Tonghe Wang, David Viar Hernandez, Hui-Kuo Shu, Hui Mao, Xiaofeng Yang

Abstract: Purpose: Apparent diffusion coefficient (ADC) maps derived from diffusion weighted (DWI) MRI provides functional measurements about the water molecules in tissues. However, DWI is time consuming and very susceptible to image artifacts, leading to inaccurate ADC measurements. This study aims to develop a deep learning framework to synthesize ADC maps from multi-parametric MR images. Methods: We pro… ▽ More Purpose: Apparent diffusion coefficient (ADC) maps derived from diffusion weighted (DWI) MRI provides functional measurements about the water molecules in tissues. However, DWI is time consuming and very susceptible to image artifacts, leading to inaccurate ADC measurements. This study aims to develop a deep learning framework to synthesize ADC maps from multi-parametric MR images. Methods: We proposed the multiparametric residual vision transformer model (MPR-ViT) that leverages the long-range context of ViT layers along with the precision of convolutional operators. Residual blocks throughout the network significantly increasing the representational power of the model. The MPR-ViT model was applied to T1w and T2- fluid attenuated inversion recovery images of 501 glioma cases from a publicly available dataset including preprocessed ADC maps. Selected patients were divided into training (N=400), validation (N=50) and test (N=51) sets, respectively. Using the preprocessed ADC maps as ground truth, model performance was evaluated and compared against the Vision Convolutional Transformer (VCT) and residual vision transformer (ResViT) models. Results: The results are as follows using T1w + T2-FLAIR MRI as inputs: MPR-ViT - PSNR: 31.0 +/- 2.1, MSE: 0.009 +/- 0.0005, SSIM: 0.950 +/- 0.015. In addition, ablation studies showed the relative impact on performance of each input sequence. Both qualitative and quantitative results indicate that the proposed MR- ViT model performs favorably against the ground truth data. Conclusion: We show that high-quality ADC maps can be synthesized from structural MRI using a MPR- VCT model. Our predicted images show better conformality to the ground truth volume than ResViT and VCT predictions. These high-quality synthetic ADC maps would be particularly useful for disease diagnosis and intervention, especially when ADC maps have artifacts or are unavailable. △ Less

Submitted 4 July, 2024; v1 submitted 2 July, 2024; originally announced July 2024.

Comments: arXiv admin note: text overlap with arXiv:2311.15044

arXiv:2407.02490 [pdf, other]

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

Authors: Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu

Abstract: The computational challenges of Large Language Model (LLM) inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens (i.e., the pre-filling stage) on a single A100 GPU. Existing methods for speeding up prefi… ▽ More The computational challenges of Large Language Model (LLM) inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens (i.e., the pre-filling stage) on a single A100 GPU. Existing methods for speeding up prefilling often fail to maintain acceptable accuracy or efficiency when applied to long-context LLMs. To address this gap, we introduce MInference (Milliontokens Inference), a sparse calculation method designed to accelerate pre-filling of long-sequence processing. Specifically, we identify three unique patterns in long-context attention matrices-the A-shape, Vertical-Slash, and Block-Sparsethat can be leveraged for efficient sparse computation on GPUs. We determine the optimal pattern for each attention head offline and dynamically build sparse indices based on the assigned pattern during inference. With the pattern and sparse indices, we perform efficient sparse attention calculations via our optimized GPU kernels to significantly reduce the latency in the pre-filling stage of long-context LLMs. Our proposed technique can be directly applied to existing LLMs without any modifications to the pre-training setup or additional fine-tuning. By evaluating on a wide range of downstream tasks, including InfiniteBench, RULER, PG-19, and Needle In A Haystack, and models including LLaMA-3-1M, GLM4-1M, Yi-200K, Phi-3-128K, and Qwen2-128K, we demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy. Our code is available at https://aka.ms/MInference. △ Less

Submitted 2 July, 2024; originally announced July 2024.

arXiv:2407.01491 [pdf, other]

Expressive and Generalizable Low-rank Adaptation for Large Models via Slow Cascaded Learning

Authors: Siwei Li, Yifan Yang, Yifei Shen, Fangyun Wei, Zongqing Lu, Lili Qiu, Yuqing Yang

Abstract: Efficient fine-tuning plays a fundamental role in modern large models, with low-rank adaptation emerging as a particularly promising approach. However, the existing variants of LoRA are hampered by limited expressiveness, a tendency to overfit, and sensitivity to hyperparameter settings. This paper presents LoRA Slow Cascade Learning (LoRASC), an innovative technique designed to enhance LoRA's exp… ▽ More Efficient fine-tuning plays a fundamental role in modern large models, with low-rank adaptation emerging as a particularly promising approach. However, the existing variants of LoRA are hampered by limited expressiveness, a tendency to overfit, and sensitivity to hyperparameter settings. This paper presents LoRA Slow Cascade Learning (LoRASC), an innovative technique designed to enhance LoRA's expressiveness and generalization capabilities while preserving its training efficiency. Our approach augments expressiveness through a cascaded learning strategy that enables a mixture-of-low-rank adaptation, thereby increasing the model's ability to capture complex patterns. Additionally, we introduce a slow-fast update mechanism and cascading noisy tuning to bolster generalization. The extensive experiments on various language and vision datasets, as well as robustness benchmarks, demonstrate that the proposed method not only significantly outperforms existing baselines, but also mitigates overfitting, enhances model stability, and improves OOD robustness. Code will be release in https://github.com/microsoft/LoRASC very soon. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2406.16864 [pdf, other]

StableNormal: Reducing Diffusion Variance for Stable and Sharp Normal

Authors: Chongjie Ye, Lingteng Qiu, Xiaodong Gu, Qi Zuo, Yushuang Wu, Zilong Dong, Liefeng Bo, Yuliang Xiu, Xiaoguang Han

Abstract: This work addresses the challenge of high-quality surface normal estimation from monocular colored inputs (i.e., images and videos), a field which has recently been revolutionized by repurposing diffusion priors. However, previous attempts still struggle with stochastic inference, conflicting with the deterministic nature of the Image2Normal task, and costly ensembling step, which slows down the e… ▽ More This work addresses the challenge of high-quality surface normal estimation from monocular colored inputs (i.e., images and videos), a field which has recently been revolutionized by repurposing diffusion priors. However, previous attempts still struggle with stochastic inference, conflicting with the deterministic nature of the Image2Normal task, and costly ensembling step, which slows down the estimation process. Our method, StableNormal, mitigates the stochasticity of the diffusion process by reducing inference variance, thus producing "Stable-and-Sharp" normal estimates without any additional ensembling process. StableNormal works robustly under challenging imaging conditions, such as extreme lighting, blurring, and low quality. It is also robust against transparent and reflective surfaces, as well as cluttered scenes with numerous objects. Specifically, StableNormal employs a coarse-to-fine strategy, which starts with a one-step normal estimator (YOSO) to derive an initial normal guess, that is relatively coarse but reliable, then followed by a semantic-guided refinement process (SG-DRN) that refines the normals to recover geometric details. The effectiveness of StableNormal is demonstrated through competitive performance in standard datasets such as DIODE-indoor, iBims, ScannetV2 and NYUv2, and also in various downstream tasks, such as surface reconstruction and normal enhancement. These results evidence that StableNormal retains both the "stability" and "sharpness" for accurate normal estimation. StableNormal represents a baby attempt to repurpose diffusion priors for deterministic estimation. To democratize this, code and models have been publicly available in hf.co/Stable-X △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: HF Demo: hf.co/Stable-X, Video: https://www.youtube.com/watch?v=sylXTxG_U2U

arXiv:2406.15656 [pdf, other]

Adaptive Self-Supervised Consistency-Guided Diffusion Model for Accelerated MRI Reconstruction

Authors: Mojtaba Safari, Zach Eidex, Shaoyan Pan, Richard L. J. Qiu, Xiaofeng Yang

Abstract: Purpose: To propose a self-supervised deep learning-based compressed sensing MRI (DL-based CS-MRI) method named "Adaptive Self-Supervised Consistency Guided Diffusion Model (ASSCGD)" to accelerate data acquisition without requiring fully sampled datasets. Materials and Methods: We used the fastMRI multi-coil brain axial T2-weighted (T2-w) dataset from 1,376 cases and single-coil brain quantitative… ▽ More Purpose: To propose a self-supervised deep learning-based compressed sensing MRI (DL-based CS-MRI) method named "Adaptive Self-Supervised Consistency Guided Diffusion Model (ASSCGD)" to accelerate data acquisition without requiring fully sampled datasets. Materials and Methods: We used the fastMRI multi-coil brain axial T2-weighted (T2-w) dataset from 1,376 cases and single-coil brain quantitative magnetization prepared 2 rapid acquisition gradient echoes (MP2RAGE) T1 maps from 318 cases to train and test our model. Robustness against domain shift was evaluated using two out-of-distribution (OOD) datasets: multi-coil brain axial postcontrast T1 -weighted (T1c) dataset from 50 cases and axial T1-weighted (T1-w) dataset from 50 patients. Data were retrospectively subsampled at acceleration rates R in {2x, 4x, 8x}. ASSCGD partitions a random sampling pattern into two disjoint sets, ensuring data consistency during training. We compared our method with ReconFormer Transformer and SS-MRI, assessing performance using normalized mean squared error (NMSE), peak signal-to-noise ratio (PSNR), and structural similarity index (SSIM). Statistical tests included one-way analysis of variance (ANOVA) and multi-comparison Tukey's Honesty Significant Difference (HSD) tests. Results: ASSCGD preserved fine structures and brain abnormalities visually better than comparative methods at R = 8x for both multi-coil and single-coil datasets. It achieved the lowest NMSE at R in {4x, 8x}, and the highest PSNR and SSIM values at all acceleration rates for the multi-coil dataset. Similar trends were observed for the single-coil dataset, though SSIM values were comparable to ReconFormer at R in {2x, 8x}. These results were further confirmed by the voxel-wise correlation scatter plots. OOD results showed significant (p << 10^-5 ) improvements in undersampled image quality after reconstruction. △ Less

Submitted 21 June, 2024; originally announced June 2024.

arXiv:2406.11375 [pdf, other]

Boosting Scientific Concepts Understanding: Can Analogy from Teacher Models Empower Student Models?

Authors: Siyu Yuan, Cheng Jiayang, Lin Qiu, Deqing Yang

Abstract: Analogical reasoning plays a critical role in human cognition, enabling us to understand new concepts by associating them with familiar ones. Previous research in the AI community has mainly focused on identifying and generating analogies and then examining their quality under human evaluation, which overlooks the practical application of these analogies in real-world settings. Inspired by the hum… ▽ More Analogical reasoning plays a critical role in human cognition, enabling us to understand new concepts by associating them with familiar ones. Previous research in the AI community has mainly focused on identifying and generating analogies and then examining their quality under human evaluation, which overlooks the practical application of these analogies in real-world settings. Inspired by the human education process, in this paper, we propose to investigate how analogies created by teacher language models (LMs) can assist student LMs in understanding scientific concepts, thereby aligning more closely with practical scenarios. Our results suggest that free-form analogies can indeed aid LMs in understanding concepts. Additionally, analogies generated by student LMs can improve their own performance on scientific question answering, demonstrating their capability to use analogies for self-learning new knowledge. Resources are available at https://github.com/siyuyuan/SCUA. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.09931 [pdf, other]

SCKansformer: Fine-Grained Classification of Bone Marrow Cells via Kansformer Backbone and Hierarchical Attention Mechanisms

Authors: Yifei Chen, Zhu Zhu, Shenghao Zhu, Linwei Qiu, Binfeng Zou, Fan Jia, Yunpeng Zhu, Chenyan Zhang, Zhaojie Fang, Feiwei Qin, Jin Fan, Changmiao Wang, Yu Gao, Gang Yu

Abstract: The incidence and mortality rates of malignant tumors, such as acute leukemia, have risen significantly. Clinically, hospitals rely on cytological examination of peripheral blood and bone marrow smears to diagnose malignant tumors, with accurate blood cell counting being crucial. Existing automated methods face challenges such as low feature expression capability, poor interpretability, and redund… ▽ More The incidence and mortality rates of malignant tumors, such as acute leukemia, have risen significantly. Clinically, hospitals rely on cytological examination of peripheral blood and bone marrow smears to diagnose malignant tumors, with accurate blood cell counting being crucial. Existing automated methods face challenges such as low feature expression capability, poor interpretability, and redundant feature extraction when processing high-dimensional microimage data. We propose a novel fine-grained classification model, SCKansformer, for bone marrow blood cells, which addresses these challenges and enhances classification accuracy and efficiency. The model integrates the Kansformer Encoder, SCConv Encoder, and Global-Local Attention Encoder. The Kansformer Encoder replaces the traditional MLP layer with the KAN, improving nonlinear feature representation and interpretability. The SCConv Encoder, with its Spatial and Channel Reconstruction Units, enhances feature representation and reduces redundancy. The Global-Local Attention Encoder combines Multi-head Self-Attention with a Local Part module to capture both global and local features. We validated our model using the Bone Marrow Blood Cell Fine-Grained Classification Dataset (BMCD-FGCD), comprising over 10,000 samples and nearly 40 classifications, developed with a partner hospital. Comparative experiments on our private dataset, as well as the publicly available PBC and ALL-IDB datasets, demonstrate that SCKansformer outperforms both typical and advanced microcell classification methods across all datasets. Our source code and private BMCD-FGCD dataset are available at https://github.com/JustlfC03/SCKansformer. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: 15 pages, 6 figures

arXiv:2406.05962 [pdf, other]

Data Caching for Enterprise-Grade Petabyte-Scale OLAP

Authors: Chunxu Tang, Bin Fan, Jing Zhao, Chen Liang, Yi Wang, Beinan Wang, Ziyue Qiu, Lu Qiu, Bowen Ding, Shouzhuo Sun, Saiguang Che, Jiaming Mai, Shouwei Chen, Yu Zhu, Jianjian Xie, Yutian, Sun, Yao Li, Yangjun Zhang, Ke Wang, Mingmin Chen

Abstract: With the exponential growth of data and evolving use cases, petabyte-scale OLAP data platforms are increasingly adopting a model that decouples compute from storage. This shift, evident in organizations like Uber and Meta, introduces operational challenges including massive, read-heavy I/O traffic with potential throttling, as well as skewed and fragmented data access patterns. Addressing these ch… ▽ More With the exponential growth of data and evolving use cases, petabyte-scale OLAP data platforms are increasingly adopting a model that decouples compute from storage. This shift, evident in organizations like Uber and Meta, introduces operational challenges including massive, read-heavy I/O traffic with potential throttling, as well as skewed and fragmented data access patterns. Addressing these challenges, this paper introduces the Alluxio local (edge) cache, a highly effective architectural optimization tailored for such environments. This embeddable cache, optimized for petabyte-scale data analytics, leverages local SSD resources to alleviate network I/O and API call pressures, significantly improving data transfer efficiency. Integrated with OLAP systems like Presto and storage services like HDFS, the Alluxio local cache has demonstrated its effectiveness in handling large-scale, enterprise-grade workloads over three years of deployment at Uber and Meta. We share insights and operational experiences in implementing these optimizations, providing valuable perspectives on managing modern, massive-scale OLAP workloads. △ Less

Submitted 9 June, 2024; originally announced June 2024.

Comments: Accepted to the USENIX Annual Technical Conference (USENIX ATC) 2024

arXiv:2406.02756 [pdf, other]

Aligning Large Language Models via Fine-grained Supervision

Authors: Dehong Xu, Liang Qiu, Minseok Kim, Faisal Ladhak, Jaeyoung Do

Abstract: Pre-trained large-scale language models (LLMs) excel at producing coherent articles, yet their outputs may be untruthful, toxic, or fail to align with user expectations. Current approaches focus on using reinforcement learning with human feedback (RLHF) to improve model alignment, which works by transforming coarse human preferences of LLM outputs into a feedback signal that guides the model learn… ▽ More Pre-trained large-scale language models (LLMs) excel at producing coherent articles, yet their outputs may be untruthful, toxic, or fail to align with user expectations. Current approaches focus on using reinforcement learning with human feedback (RLHF) to improve model alignment, which works by transforming coarse human preferences of LLM outputs into a feedback signal that guides the model learning process. However, because this approach operates on sequence-level feedback, it lacks the precision to identify the exact parts of the output affecting user preferences. To address this gap, we propose a method to enhance LLM alignment through fine-grained token-level supervision. Specifically, we ask annotators to minimally edit less preferred responses within the standard reward modeling dataset to make them more favorable, ensuring changes are made only where necessary while retaining most of the original content. The refined dataset is used to train a token-level reward model, which is then used for training our fine-grained Proximal Policy Optimization (PPO) model. Our experiment results demonstrate that this approach can achieve up to an absolute improvement of $5.1\%$ in LLM performance, in terms of win rate against the reference model, compared with the traditional PPO model. △ Less

Submitted 4 June, 2024; originally announced June 2024.

arXiv:2406.02536 [pdf, other]

Mitigate Position Bias in Large Language Models via Scaling a Single Dimension

Authors: Yijiong Yu, Huiqiang Jiang, Xufang Luo, Qianhui Wu, Chin-Yew Lin, Dongsheng Li, Yuqing Yang, Yongfeng Huang, Lili Qiu

Abstract: Large Language Models (LLMs) are increasingly applied in various real-world scenarios due to their excellent generalization capabilities and robust generative abilities. However, they exhibit position bias, also known as "lost in the middle", a phenomenon that is especially pronounced in long-context scenarios, which indicates the placement of the key information in different positions of a prompt… ▽ More Large Language Models (LLMs) are increasingly applied in various real-world scenarios due to their excellent generalization capabilities and robust generative abilities. However, they exhibit position bias, also known as "lost in the middle", a phenomenon that is especially pronounced in long-context scenarios, which indicates the placement of the key information in different positions of a prompt can significantly affect accuracy. This paper first explores the micro-level manifestations of position bias, concluding that attention weights are a micro-level expression of position bias. It further identifies that, in addition to position embeddings, causal attention mask also contributes to position bias by creating position-specific hidden states. Based on these insights, we propose a method to mitigate position bias by scaling this positional hidden states. Experiments on the NaturalQuestions Multi-document QA, KV retrieval, LongBench and timeline reorder tasks, using various models including RoPE models, context windowextended models, and Alibi models, demonstrate the effectiveness and generalizability of our approach. Our method can improve performance by up to 15.2% by modifying just one dimension of hidden states. Our code is available at https://aka.ms/PositionalHidden. △ Less

Submitted 4 June, 2024; originally announced June 2024.

arXiv:2405.19888 [pdf, other]

Parrot: Efficient Serving of LLM-based Applications with Semantic Variable

Authors: Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, Lili Qiu

Abstract: The rise of large language models (LLMs) has enabled LLM-based applications (a.k.a. AI agents or co-pilots), a new software paradigm that combines the strength of LLM and conventional software. Diverse LLM applications from different tenants could design complex workflows using multiple LLM requests to accomplish one task. However, they have to use the over-simplified request-level API provided by… ▽ More The rise of large language models (LLMs) has enabled LLM-based applications (a.k.a. AI agents or co-pilots), a new software paradigm that combines the strength of LLM and conventional software. Diverse LLM applications from different tenants could design complex workflows using multiple LLM requests to accomplish one task. However, they have to use the over-simplified request-level API provided by today's public LLM services, losing essential application-level information. Public LLM services have to blindly optimize individual LLM requests, leading to sub-optimal end-to-end performance of LLM applications. This paper introduces Parrot, an LLM service system that focuses on the end-to-end experience of LLM-based applications. Parrot proposes Semantic Variable, a unified abstraction to expose application-level knowledge to public LLM services. A Semantic Variable annotates an input/output variable in the prompt of a request, and creates the data pipeline when connecting multiple LLM requests, providing a natural way to program LLM applications. Exposing Semantic Variables to the public LLM service allows it to perform conventional data flow analysis to uncover the correlation across multiple LLM requests. This correlation opens a brand-new optimization space for the end-to-end performance of LLM-based applications. Extensive evaluations demonstrate that Parrot can achieve up to an order-of-magnitude improvement for popular and practical use cases of LLM applications. △ Less

Submitted 30 May, 2024; originally announced May 2024.

Comments: To appear on USENIX OSDI 2024

arXiv:2405.16337 [pdf, other]

Learning to Reason via Program Generation, Emulation, and Search

Authors: Nathaniel Weir, Muhammad Khalifa, Linlu Qiu, Orion Weller, Peter Clark

Abstract: Program synthesis with language models (LMs) has unlocked a large set of reasoning abilities; code-tuned LMs have proven adept at generating programs that solve a wide variety of algorithmic symbolic manipulation tasks (e.g. word concatenation). However, not all reasoning tasks are easily expressible as code, e.g. tasks involving commonsense reasoning, moral decision-making, and sarcasm understand… ▽ More Program synthesis with language models (LMs) has unlocked a large set of reasoning abilities; code-tuned LMs have proven adept at generating programs that solve a wide variety of algorithmic symbolic manipulation tasks (e.g. word concatenation). However, not all reasoning tasks are easily expressible as code, e.g. tasks involving commonsense reasoning, moral decision-making, and sarcasm understanding. Our goal is to extend an LM's program synthesis skills to such tasks and evaluate the results via pseudo-programs, namely Python programs where some leaf function calls are left undefined. To that end, we propose, Code Generation and Emulated EXecution (CoGEX). CoGEX works by (1) training LMs to generate their own pseudo-programs, (2) teaching them to emulate their generated program's execution, including those leaf functions, allowing the LM's knowledge to fill in the execution gaps; and (3) using them to search over many programs to find an optimal one. To adapt the CoGEX model to a new task, we introduce a method for performing program search to find a single program whose pseudo-execution yields optimal performance when applied to all the instances of a given dataset. We show that our approach yields large improvements compared to standard in-context learning approaches on a battery of tasks, both algorithmic and soft reasoning. This result thus demonstrates that code synthesis can be applied to a much broader class of problems than previously considered. Our released dataset, fine-tuned models, and implementation can be found at \url{https://github.com/nweir127/CoGEX}. △ Less

Submitted 28 May, 2024; v1 submitted 25 May, 2024; originally announced May 2024.

Comments: 16 pages, 10 figures

arXiv:2405.14486 [pdf, other]

RefChecker: Reference-based Fine-grained Hallucination Checker and Benchmark for Large Language Models

Authors: Xiangkun Hu, Dongyu Ru, Lin Qiu, Qipeng Guo, Tianhang Zhang, Yang Xu, Yun Luo, Pengfei Liu, Yue Zhang, Zheng Zhang

Abstract: Large Language Models (LLMs) have shown impressive capabilities but also a concerning tendency to hallucinate. This paper presents RefChecker, a framework that introduces claim-triplets to represent claims in LLM responses, aiming to detect fine-grained hallucinations. In RefChecker, an extractor generates claim-triplets from a response, which are then evaluated by a checker against a reference. W… ▽ More Large Language Models (LLMs) have shown impressive capabilities but also a concerning tendency to hallucinate. This paper presents RefChecker, a framework that introduces claim-triplets to represent claims in LLM responses, aiming to detect fine-grained hallucinations. In RefChecker, an extractor generates claim-triplets from a response, which are then evaluated by a checker against a reference. We delineate three task settings: Zero, Noisy and Accurate Context, to reflect various real-world use cases. We curated a benchmark spanning various NLP tasks and annotated 11k claim-triplets from 2.1k responses by seven LLMs. RefChecker supports both proprietary and open-source models as the extractor and checker. Experiments demonstrate that claim-triplets enable superior hallucination detection, compared to other granularities such as response, sentence and sub-sentence level claims. RefChecker outperforms prior methods by 6.8 to 26.1 points on our benchmark and the checking results of RefChecker are strongly aligned with human judgments. This work is open sourced at https://github.com/amazon-science/RefChecker △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2405.13634 [pdf, other]

Secure Communications in Near-Filed ISCAP Systems with Extremely Large-Scale Antenna Arrays

Authors: Zixiang Ren, Siyao Zhang, Xinmin Li, Ling Qiu, Jie Xu, Derrick Wing Kwan Ng

Abstract: This paper investigates secure communications in a near-field multi-functional integrated sensing, communication, and powering (ISCAP) system with an extremely large-scale antenna arrays (ELAA) equipped at the base station (BS). In this system, the BS sends confidential messages to a single communication user (CU), and at the same time wirelessly senses a point target and charges multiple energy r… ▽ More This paper investigates secure communications in a near-field multi-functional integrated sensing, communication, and powering (ISCAP) system with an extremely large-scale antenna arrays (ELAA) equipped at the base station (BS). In this system, the BS sends confidential messages to a single communication user (CU), and at the same time wirelessly senses a point target and charges multiple energy receivers (ERs). It is assumed that the ERs and the sensing target are potential eavesdroppers that may attempt to intercept the confidential messages intended for the CU. We consider the joint transmit beamforming design to support secure communications while ensuring the sensing and powering requirements. In particular, the BS transmits dedicated sensing/energy beams in addition to the information beam, which also play the role of artificial noise (AN) for effectively jamming potential eavesdroppers. Building upon this, we maximize the secrecy rate at the CU, subject to the maximum \ac{crb} constraints for target sensing and the minimum harvested energy constraints for the ERs. Although the formulated joint beamforming problem is non-convex and challenging to solve, we acquire the optimal solution via the semi-definite relaxation (SDR) and fractional programming techniques together with a one-dimensional (1D) search. Subsequently, we present two alternative designs based on zero-forcing (ZF) beamforming and maximum ratio transmission (MRT), respectively. Finally, our numerical results show that our proposed approaches exploit both the distance-domain resolution of near-field ELAA and the joint beamforming design for enhancing secure communication performance while ensuring the sensing and powering requirements in ISCAP, especially when the CU and the target and ER eavesdroppers are located at the same angle (but different distances) with respect to the BS. △ Less

Submitted 22 May, 2024; originally announced May 2024.

Comments: 6 pages

arXiv:2405.12470 [pdf, other]

Dynamical Geometry of the Haldane Model under a Quantum Quench

Authors: Liwei Qiu, Lih-King Lim, Xin Wan

Abstract: We explore the time evolution of a topological system when the system undergoes a sudden quantum quench within the same nontrivial phase. Using Haldane's honeycomb model as an example, we show that equilibrium states in a topological phase can be distinguished by geometrical features, such as the characteristic momentum at which the half-occupied edge modes cross, the associated edge-mode velocity… ▽ More We explore the time evolution of a topological system when the system undergoes a sudden quantum quench within the same nontrivial phase. Using Haldane's honeycomb model as an example, we show that equilibrium states in a topological phase can be distinguished by geometrical features, such as the characteristic momentum at which the half-occupied edge modes cross, the associated edge-mode velocity, and the winding vector about which the normalized pseudospin magnetic field winds along a great circle on the Bloch sphere. We generalize these geometrical quantities for non-equilibrium states and use them to visualize the quench dynamics of the topological system. In general, we find the pre-quench equilibrium state relaxes to the post-quench equilibrium state in an oscillatory fashion, whose amplitude decay as $t^{1/2}$. In the process, however, the characteristic winding vector of the non-equilibrium system can evolve to regimes that are not reachable with equilibrium states. △ Less

Submitted 20 May, 2024; originally announced May 2024.

Comments: 13 pages, 11 figures

arXiv:2405.09891 [pdf]

Adaptive Proton Therapy Using CBCT-Guided Digital Twins

Authors: Chih-Wei Chang, Zhen Tian, Richard L. J. Qiu, H. Scott McGinnis, Duncan Bohannon, Pretesh Patel, Yinan Wang, David S. Yu, Sagar A. Patel, Jun Zhou, Xiaofeng Yang

Abstract: This study aims to develop a digital twin (DT) framework to enhance adaptive proton stereotactic body radiation therapy (SBRT) for prostate cancer. Prostate SBRT has emerged as a leading option for external beam radiotherapy due to its effectiveness and reduced treatment duration. However, interfractional anatomy variations can impact treatment outcomes. This study seeks to address these uncertain… ▽ More This study aims to develop a digital twin (DT) framework to enhance adaptive proton stereotactic body radiation therapy (SBRT) for prostate cancer. Prostate SBRT has emerged as a leading option for external beam radiotherapy due to its effectiveness and reduced treatment duration. However, interfractional anatomy variations can impact treatment outcomes. This study seeks to address these uncertainties using DT concept, with the goal of improving treatment quality, potentially revolutionizing prostate radiotherapy to offer personalized treatment solutions. Our study presented a pioneering approach that leverages DT technology to enhance adaptive proton SBRT. The framework improves treatment plans by utilizing patient-specific CTV setup uncertainty, which is usually smaller than conventional clinical setups. This research contributes to the ongoing efforts to enhance the efficiency and efficacy of prostate radiotherapy, with ultimate goals of improving patient outcomes and life quality. △ Less

Submitted 17 May, 2024; v1 submitted 16 May, 2024; originally announced May 2024.

arXiv:2405.05945 [pdf, other]

Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

Authors: Peng Gao, Le Zhuo, Dongyang Liu, Ruoyi Du, Xu Luo, Longtian Qiu, Yuhang Zhang, Chen Lin, Rongjie Huang, Shijie Geng, Renrui Zhang, Junlin Xi, Wenqi Shao, Zhengkai Jiang, Tianshuo Yang, Weicai Ye, He Tong, Jingwen He, Yu Qiao, Hongsheng Li

Abstract: Sora unveils the potential of scaling Diffusion Transformer for generating photorealistic images and videos at arbitrary resolutions, aspect ratios, and durations, yet it still lacks sufficient implementation details. In this technical report, we introduce the Lumina-T2X family - a series of Flow-based Large Diffusion Transformers (Flag-DiT) equipped with zero-initialized attention, as a unified f… ▽ More Sora unveils the potential of scaling Diffusion Transformer for generating photorealistic images and videos at arbitrary resolutions, aspect ratios, and durations, yet it still lacks sufficient implementation details. In this technical report, we introduce the Lumina-T2X family - a series of Flow-based Large Diffusion Transformers (Flag-DiT) equipped with zero-initialized attention, as a unified framework designed to transform noise into images, videos, multi-view 3D objects, and audio clips conditioned on text instructions. By tokenizing the latent spatial-temporal space and incorporating learnable placeholders such as [nextline] and [nextframe] tokens, Lumina-T2X seamlessly unifies the representations of different modalities across various spatial-temporal resolutions. This unified approach enables training within a single framework for different modalities and allows for flexible generation of multimodal data at any resolution, aspect ratio, and length during inference. Advanced techniques like RoPE, RMSNorm, and flow matching enhance the stability, flexibility, and scalability of Flag-DiT, enabling models of Lumina-T2X to scale up to 7 billion parameters and extend the context window to 128K tokens. This is particularly beneficial for creating ultra-high-definition images with our Lumina-T2I model and long 720p videos with our Lumina-T2V model. Remarkably, Lumina-T2I, powered by a 5-billion-parameter Flag-DiT, requires only 35% of the training computational costs of a 600-million-parameter naive DiT. Our further comprehensive analysis underscores Lumina-T2X's preliminary capability in resolution extrapolation, high-resolution editing, generating consistent 3D views, and synthesizing videos with seamless transitions. We expect that the open-sourcing of Lumina-T2X will further foster creativity, transparency, and diversity in the generative AI community. △ Less

Submitted 13 June, 2024; v1 submitted 9 May, 2024; originally announced May 2024.

Comments: Technical Report; Code at: https://github.com/Alpha-VLLM/Lumina-T2X

arXiv:2405.00241 [pdf]

Fast MRI Reconstruction Using Deep Learning-based Compressed Sensing: A Systematic Review

Authors: Mojtaba Safari, Zach Eidex, Chih-Wei Chang, Richard L. J. Qiu, Xiaofeng Yang

Abstract: Magnetic resonance imaging (MRI) has revolutionized medical imaging, providing a non-invasive and highly detailed look into the human body. However, the long acquisition times of MRI present challenges, causing patient discomfort, motion artifacts, and limiting real-time applications. To address these challenges, researchers are exploring various techniques to reduce acquisition time and improve t… ▽ More Magnetic resonance imaging (MRI) has revolutionized medical imaging, providing a non-invasive and highly detailed look into the human body. However, the long acquisition times of MRI present challenges, causing patient discomfort, motion artifacts, and limiting real-time applications. To address these challenges, researchers are exploring various techniques to reduce acquisition time and improve the overall efficiency of MRI. One such technique is compressed sensing (CS), which reduces data acquisition by leveraging image sparsity in transformed spaces. In recent years, deep learning (DL) has been integrated with CS-MRI, leading to a new framework that has seen remarkable growth. DL-based CS-MRI approaches are proving to be highly effective in accelerating MR imaging without compromising image quality. This review comprehensively examines DL-based CS-MRI techniques, focusing on their role in increasing MR imaging speed. We provide a detailed analysis of each category of DL-based CS-MRI including end-to-end, unroll optimization, self-supervised, and federated learning. Our systematic review highlights significant contributions and underscores the exciting potential of DL in CS-MRI. Additionally, our systematic review efficiently summarizes key results and trends in DL-based CS-MRI including quantitative metrics, the dataset used, acceleration factors, and the progress of and research interest in DL techniques over time. Finally, we discuss potential future directions and the importance of DL-based CS-MRI in the advancement of medical imaging. To facilitate further research in this area, we provide a GitHub repository that includes up-to-date DL-based CS-MRI publications and publicly available datasets - https://github.com/mosaf/Awesome-DL-based-CS-MRI. △ Less

Submitted 30 April, 2024; originally announced May 2024.

arXiv:2404.18388 [pdf, other]

SPECIAL: Synopsis Assisted Secure Collaborative Analytics

Authors: Chenghong Wang, Lina Qiu, Johes Bater, Yukui Luo

Abstract: Secure collaborative analytics (SCA) enable the processing of analytical SQL queries across multiple owners' data, even when direct data sharing is not feasible. Although essential for strong privacy, the large overhead from data-oblivious primitives in traditional SCA has hindered its practical adoption. Recent SCA variants that permit controlled leakages under differential privacy (DP) show a be… ▽ More Secure collaborative analytics (SCA) enable the processing of analytical SQL queries across multiple owners' data, even when direct data sharing is not feasible. Although essential for strong privacy, the large overhead from data-oblivious primitives in traditional SCA has hindered its practical adoption. Recent SCA variants that permit controlled leakages under differential privacy (DP) show a better balance between privacy and efficiency. However, they still face significant challenges, such as potentially unbounded privacy loss, suboptimal query planning, and lossy processing. To address these challenges, we introduce SPECIAL, the first SCA system that simultaneously ensures bounded privacy loss, advanced query planning, and lossless processing. SPECIAL employs a novel synopsis-assisted secure processing model, where a one-time privacy cost is spent to acquire private synopses (table statistics) from owner data. These synopses then allow SPECIAL to estimate (compaction) sizes for secure operations (e.g., filter, join) and index encrypted data without extra privacy loss. Crucially, these estimates and indexes can be prepared before runtime, thereby facilitating efficient query planning and accurate cost estimations. Moreover, by using one-sided noise mechanisms and private upper bound techniques, SPECIAL ensures strict lossless processing for complex queries (e.g., multi-join). Through a comprehensive benchmark, we show that SPECIAL significantly outperforms cutting-edge SCAs, with up to 80X faster query times and over 900X smaller memory for complex queries. Moreover, it also achieves up to an 89X reduction in privacy loss under continual processing. △ Less

Submitted 28 April, 2024; originally announced April 2024.

arXiv:2404.15946 [pdf]

Mammo-CLIP: Leveraging Contrastive Language-Image Pre-training (CLIP) for Enhanced Breast Cancer Diagnosis with Multi-view Mammography

Authors: Xuxin Chen, Yuheng Li, Mingzhe Hu, Ella Salari, Xiaoqian Chen, Richard L. J. Qiu, Bin Zheng, Xiaofeng Yang

Abstract: Although fusion of information from multiple views of mammograms plays an important role to increase accuracy of breast cancer detection, developing multi-view mammograms-based computer-aided diagnosis (CAD) schemes still faces challenges and no such CAD schemes have been used in clinical practice. To overcome the challenges, we investigate a new approach based on Contrastive Language-Image Pre-tr… ▽ More Although fusion of information from multiple views of mammograms plays an important role to increase accuracy of breast cancer detection, developing multi-view mammograms-based computer-aided diagnosis (CAD) schemes still faces challenges and no such CAD schemes have been used in clinical practice. To overcome the challenges, we investigate a new approach based on Contrastive Language-Image Pre-training (CLIP), which has sparked interest across various medical imaging tasks. By solving the challenges in (1) effectively adapting the single-view CLIP for multi-view feature fusion and (2) efficiently fine-tuning this parameter-dense model with limited samples and computational resources, we introduce Mammo-CLIP, the first multi-modal framework to process multi-view mammograms and corresponding simple texts. Mammo-CLIP uses an early feature fusion strategy to learn multi-view relationships in four mammograms acquired from the CC and MLO views of the left and right breasts. To enhance learning efficiency, plug-and-play adapters are added into CLIP image and text encoders for fine-tuning parameters and limiting updates to about 1% of the parameters. For framework evaluation, we assembled two datasets retrospectively. The first dataset, comprising 470 malignant and 479 benign cases, was used for few-shot fine-tuning and internal evaluation of the proposed Mammo-CLIP via 5-fold cross-validation. The second dataset, including 60 malignant and 294 benign cases, was used to test generalizability of Mammo-CLIP. Study results show that Mammo-CLIP outperforms the state-of-art cross-view transformer in AUC (0.841 vs. 0.817, 0.837 vs. 0.807) on both datasets. It also surpasses previous two CLIP-based methods by 20.3% and 14.3%. This study highlights the potential of applying the finetuned vision-language models for developing next-generation, image-text-based CAD schemes of breast cancer. △ Less

Submitted 24 April, 2024; originally announced April 2024.

arXiv:2404.11216 [pdf, other]

Position Engineering: Boosting Large Language Models through Positional Information Manipulation

Authors: Zhiyuan He, Huiqiang Jiang, Zilong Wang, Yuqing Yang, Luna Qiu, Lili Qiu

Abstract: The performance of large language models (LLMs) is significantly influenced by the quality of the prompts provided. In response, researchers have developed enormous prompt engineering strategies aimed at modifying the prompt text to enhance task performance. In this paper, we introduce a novel technique termed position engineering, which offers a more efficient way to guide large language models.… ▽ More The performance of large language models (LLMs) is significantly influenced by the quality of the prompts provided. In response, researchers have developed enormous prompt engineering strategies aimed at modifying the prompt text to enhance task performance. In this paper, we introduce a novel technique termed position engineering, which offers a more efficient way to guide large language models. Unlike prompt engineering, which requires substantial effort to modify the text provided to LLMs, position engineering merely involves altering the positional information in the prompt without modifying the text itself. We have evaluated position engineering in two widely-used LLM scenarios: retrieval-augmented generation (RAG) and in-context learning (ICL). Our findings show that position engineering substantially improves upon the baseline in both cases. Position engineering thus represents a promising new strategy for exploiting the capabilities of large language models. △ Less

Submitted 17 April, 2024; originally announced April 2024.

arXiv:2404.05609 [pdf, other]

Feedback Stability Under Mixed Gain and Phase Uncertainty

Authors: Jiajin Liang, Di Zhao, Li Qiu

Abstract: In this study, we investigate the robust feedback stability problem for multiple-input-multiple-output linear time-invariant systems involving sectored-disk uncertainty, namely, dynamic uncertainty subject to simultaneous gain and phase constraints. This problem is thereby called a sectored-disk problem. Employing a frequency-wise analysis approach, we derive a fundamental static matrix problem th… ▽ More In this study, we investigate the robust feedback stability problem for multiple-input-multiple-output linear time-invariant systems involving sectored-disk uncertainty, namely, dynamic uncertainty subject to simultaneous gain and phase constraints. This problem is thereby called a sectored-disk problem. Employing a frequency-wise analysis approach, we derive a fundamental static matrix problem that serves as a key component in addressing the feedback stability. The study of this matrix problem heavily relies on the Davis-Wielandt (DW) shells of matrices, providing a profound insight into matrices subjected to simultaneous gain and phase constraints. This understanding is pivotal for establishing a less conservative sufficient condition for the matrix sectored-disk problem, from which we formulate several robust feedback stability conditions against sectored-disk uncertainty. Finally, several conditions based on linear matrix inequalities are developed for efficient computation and verification of feedback robust stability against sectored-disk uncertainty. △ Less

Submitted 8 April, 2024; originally announced April 2024.

arXiv:2404.04286 [pdf, other]

Language Model Evolution: An Iterated Learning Perspective

Authors: Yi Ren, Shangmin Guo, Linlu Qiu, Bailin Wang, Danica J. Sutherland

Abstract: With the widespread adoption of Large Language Models (LLMs), the prevalence of iterative interactions among these models is anticipated to increase. Notably, recent advancements in multi-round self-improving methods allow LLMs to generate new examples for training subsequent models. At the same time, multi-agent LLM systems, involving automated interactions among agents, are also increasing in pr… ▽ More With the widespread adoption of Large Language Models (LLMs), the prevalence of iterative interactions among these models is anticipated to increase. Notably, recent advancements in multi-round self-improving methods allow LLMs to generate new examples for training subsequent models. At the same time, multi-agent LLM systems, involving automated interactions among agents, are also increasing in prominence. Thus, in both short and long terms, LLMs may actively engage in an evolutionary process. We draw parallels between the behavior of LLMs and the evolution of human culture, as the latter has been extensively studied by cognitive scientists for decades. Our approach involves leveraging Iterated Learning (IL), a Bayesian framework that elucidates how subtle biases are magnified during human cultural evolution, to explain some behaviors of LLMs. This paper outlines key characteristics of agents' behavior in the Bayesian-IL framework, including predictions that are supported by experimental verification with various LLMs. This theoretical framework could help to more effectively predict and guide the evolution of LLMs in desired directions. △ Less

Submitted 3 April, 2024; originally announced April 2024.

arXiv:2404.01617 [pdf, other]

LLM-ABR: Designing Adaptive Bitrate Algorithms via Large Language Models

Authors: Zhiyuan He, Aashish Gottipati, Lili Qiu, Francis Y. Yan, Xufang Luo, Kenuo Xu, Yuqing Yang

Abstract: We present LLM-ABR, the first system that utilizes the generative capabilities of large language models (LLMs) to autonomously design adaptive bitrate (ABR) algorithms tailored for diverse network characteristics. Operating within a reinforcement learning framework, LLM-ABR empowers LLMs to design key components such as states and neural network architectures. We evaluate LLM-ABR across diverse ne… ▽ More We present LLM-ABR, the first system that utilizes the generative capabilities of large language models (LLMs) to autonomously design adaptive bitrate (ABR) algorithms tailored for diverse network characteristics. Operating within a reinforcement learning framework, LLM-ABR empowers LLMs to design key components such as states and neural network architectures. We evaluate LLM-ABR across diverse network settings, including broadband, satellite, 4G, and 5G. LLM-ABR consistently outperforms default ABR algorithms. △ Less

Submitted 1 April, 2024; originally announced April 2024.

arXiv:2404.00998 [pdf, other]

LLM-RadJudge: Achieving Radiologist-Level Evaluation for X-Ray Report Generation

Authors: Zilong Wang, Xufang Luo, Xinyang Jiang, Dongsheng Li, Lili Qiu

Abstract: Evaluating generated radiology reports is crucial for the development of radiology AI, but existing metrics fail to reflect the task's clinical requirements. This study proposes a novel evaluation framework using large language models (LLMs) to compare radiology reports for assessment. We compare the performance of various LLMs and demonstrate that, when using GPT-4, our proposed metric achieves e… ▽ More Evaluating generated radiology reports is crucial for the development of radiology AI, but existing metrics fail to reflect the task's clinical requirements. This study proposes a novel evaluation framework using large language models (LLMs) to compare radiology reports for assessment. We compare the performance of various LLMs and demonstrate that, when using GPT-4, our proposed metric achieves evaluation consistency close to that of radiologists. Furthermore, to reduce costs and improve accessibility, making this method practical, we construct a dataset using LLM evaluation results and perform knowledge distillation to train a smaller model. The distilled model achieves evaluation capabilities comparable to GPT-4. Our framework and distilled model offer an accessible and efficient evaluation method for radiology report generation, facilitating the development of more clinically relevant models. The model will be further open-sourced and accessible. △ Less

Submitted 1 April, 2024; originally announced April 2024.

Comments: 11 pages, 6 figures

arXiv:2404.00269 [pdf, other]

IPoD: Implicit Field Learning with Point Diffusion for Generalizable 3D Object Reconstruction from Single RGB-D Images

Authors: Yushuang Wu, Luyue Shi, Junhao Cai, Weihao Yuan, Lingteng Qiu, Zilong Dong, Liefeng Bo, Shuguang Cui, Xiaoguang Han

Abstract: Generalizable 3D object reconstruction from single-view RGB-D images remains a challenging task, particularly with real-world data. Current state-of-the-art methods develop Transformer-based implicit field learning, necessitating an intensive learning paradigm that requires dense query-supervision uniformly sampled throughout the entire space. We propose a novel approach, IPoD, which harmonizes im… ▽ More Generalizable 3D object reconstruction from single-view RGB-D images remains a challenging task, particularly with real-world data. Current state-of-the-art methods develop Transformer-based implicit field learning, necessitating an intensive learning paradigm that requires dense query-supervision uniformly sampled throughout the entire space. We propose a novel approach, IPoD, which harmonizes implicit field learning with point diffusion. This approach treats the query points for implicit field learning as a noisy point cloud for iterative denoising, allowing for their dynamic adaptation to the target object shape. Such adaptive query points harness diffusion learning's capability for coarse shape recovery and also enhances the implicit representation's ability to delineate finer details. Besides, an additional self-conditioning mechanism is designed to use implicit predictions as the guidance of diffusion learning, leading to a cooperative system. Experiments conducted on the CO3D-v2 dataset affirm the superiority of IPoD, achieving 7.8% improvement in F-score and 28.6% in Chamfer distance over existing methods. The generalizability of IPoD is also demonstrated on the MVImgNet dataset. Our project page is at https://yushuang-wu.github.io/IPoD. △ Less

Submitted 30 March, 2024; originally announced April 2024.

Comments: CVPR 2024

arXiv:2404.00209 [pdf, other]

EventGround: Narrative Reasoning by Grounding to Eventuality-centric Knowledge Graphs

Authors: Cheng Jiayang, Lin Qiu, Chunkit Chan, Xin Liu, Yangqiu Song, Zheng Zhang

Abstract: Narrative reasoning relies on the understanding of eventualities in story contexts, which requires a wealth of background world knowledge. To help machines leverage such knowledge, existing solutions can be categorized into two groups. Some focus on implicitly modeling eventuality knowledge by pretraining language models (LMs) with eventuality-aware objectives. However, this approach breaks down k… ▽ More Narrative reasoning relies on the understanding of eventualities in story contexts, which requires a wealth of background world knowledge. To help machines leverage such knowledge, existing solutions can be categorized into two groups. Some focus on implicitly modeling eventuality knowledge by pretraining language models (LMs) with eventuality-aware objectives. However, this approach breaks down knowledge structures and lacks interpretability. Others explicitly collect world knowledge of eventualities into structured eventuality-centric knowledge graphs (KGs). However, existing research on leveraging these knowledge sources for free-texts is limited. In this work, we propose an initial comprehensive framework called EventGround, which aims to tackle the problem of grounding free-texts to eventuality-centric KGs for contextualized narrative reasoning. We identify two critical problems in this direction: the event representation and sparsity problems. We provide simple yet effective parsing and partial information extraction methods to tackle these problems. Experimental results demonstrate that our approach consistently outperforms baseline models when combined with graph neural network (GNN) or large language model (LLM) based graph reasoning models. Our framework, incorporating grounded knowledge, achieves state-of-the-art performance while providing interpretable evidence. △ Less

Submitted 7 July, 2024; v1 submitted 29 March, 2024; originally announced April 2024.

arXiv:2403.16353 [pdf, other]

Energy-Efficient Hybrid Beamforming with Dynamic On-off Control for Integrated Sensing, Communications, and Powering

Authors: Zeyu Hao, Yuan Fang, Xianghao Yu, Jie Xu, Ling Qiu, Lexi Xu, Shuguang Cui

Abstract: This paper investigates the energy-efficient hybrid beamforming design for a multi-functional integrated sensing, communications, and powering (ISCAP) system. In this system, a base station (BS) with a hybrid analog-digital (HAD) architecture sends unified wireless signals to communicate with multiple information receivers (IRs), sense multiple point targets, and wirelessly charge multiple energy… ▽ More This paper investigates the energy-efficient hybrid beamforming design for a multi-functional integrated sensing, communications, and powering (ISCAP) system. In this system, a base station (BS) with a hybrid analog-digital (HAD) architecture sends unified wireless signals to communicate with multiple information receivers (IRs), sense multiple point targets, and wirelessly charge multiple energy receivers (ERs) at the same time. To facilitate the energy-efficient design, we present a novel HAD architecture for the BS transmitter, which allows dynamic on-off control of its radio frequency (RF) chains and analog phase shifters (PSs) through a switch network. We also consider a practical and comprehensive power consumption model for the BS, by taking into account the power-dependent non-linear power amplifier (PA) efficiency, and the on-off non-transmission power consumption model of RF chains and PSs. We jointly design the hybrid beamforming and dynamic on-off control at the BS, aiming to minimize its total power consumption, while guaranteeing the performance requirements on communication rates, sensing Cramér-Rao bound (CRB), and harvested power levels. The formulation also takes into consideration the per-antenna transmit power constraint and the constant modulus constraints for the analog beamformer at the BS. The resulting optimization problem for ISCAP is highly non-convex. Please refer to the paper for a complete abstract. △ Less

Submitted 24 March, 2024; originally announced March 2024.

Comments: 13 pages, 6 figures, submitted to IEEE Transactions on Communications

arXiv:2403.12968 [pdf, other]

LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

Authors: Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor Rühle, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, Dongmei Zhang

Abstract: This paper focuses on task-agnostic prompt compression for better generalizability and efficiency. Considering the redundancy in natural language, existing approaches compress prompts by removing tokens or lexical units according to their information entropy obtained from a causal language model such as LLaMa-7B. The challenge is that information entropy may be a suboptimal compression metric: (i)… ▽ More This paper focuses on task-agnostic prompt compression for better generalizability and efficiency. Considering the redundancy in natural language, existing approaches compress prompts by removing tokens or lexical units according to their information entropy obtained from a causal language model such as LLaMa-7B. The challenge is that information entropy may be a suboptimal compression metric: (i) it only leverages unidirectional context and may fail to capture all essential information needed for prompt compression; (ii) it is not aligned with the prompt compression objective. To address these issues, we propose a data distillation procedure to derive knowledge from an LLM to compress prompts without losing crucial information, and meantime, introduce an extractive text compression dataset. We formulate prompt compression as a token classification problem to guarantee the faithfulness of the compressed prompt to the original one, and use a Transformer encoder as the base architecture to capture all essential information for prompt compression from the full bidirectional context. Our approach leads to lower latency by explicitly learning the compression objective with smaller models such as XLM-RoBERTa-large and mBERT. We evaluate our method on both in-domain and out-of-domain datasets, including MeetingBank, LongBench, ZeroScrolls, GSM8K, and BBH. Despite its small size, our model shows significant performance gains over strong baselines and demonstrates robust generalization ability across different LLMs. Additionally, our model is 3x-6x faster than existing prompt compression methods, while accelerating the end-to-end latency by 1.6x-2.9x with compression ratios of 2x-5x. △ Less

Submitted 19 March, 2024; originally announced March 2024.

arXiv:2403.12370 [pdf, other]

XPose: eXplainable Human Pose Estimation

Authors: Luyu Qiu, Jianing Li, Lei Wen, Chi Su, Fei Hao, Chen Jason Zhang, Lei Chen

Abstract: Current approaches in pose estimation primarily concentrate on enhancing model architectures, often overlooking the importance of comprehensively understanding the rationale behind model decisions. In this paper, we propose XPose, a novel framework that incorporates Explainable AI (XAI) principles into pose estimation. This integration aims to elucidate the individual contribution of each keypoint… ▽ More Current approaches in pose estimation primarily concentrate on enhancing model architectures, often overlooking the importance of comprehensively understanding the rationale behind model decisions. In this paper, we propose XPose, a novel framework that incorporates Explainable AI (XAI) principles into pose estimation. This integration aims to elucidate the individual contribution of each keypoint to final prediction, thereby elevating the model's transparency and interpretability. Conventional XAI techniques have predominantly addressed tasks with single-target tasks like classification. Additionally, the application of Shapley value, a common measure in XAI, to pose estimation has been hindered by prohibitive computational demands. To address these challenges, this work introduces an innovative concept called Group Shapley Value (GSV). This approach strategically organizes keypoints into clusters based on their interdependencies. Within these clusters, GSV meticulously calculates Shapley value for keypoints, while for inter-cluster keypoints, it opts for a more holistic group-level valuation. This dual-level computation framework meticulously assesses keypoint contributions to the final outcome, optimizing computational efficiency. Building on the insights into keypoint interactions, we devise a novel data augmentation technique known as Group-based Keypoint Removal (GKR). This method ingeniously removes individual keypoints during training phases, deliberately preserving those with strong mutual connections, thereby refining the model's predictive prowess for non-visible keypoints. The empirical validation of GKR across a spectrum of standard approaches attests to its efficacy. GKR's success demonstrates how using Explainable AI (XAI) can directly enhance pose estimation models. △ Less

Submitted 18 March, 2024; originally announced March 2024.

arXiv:2403.12010 [pdf, other]

VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model

Authors: Qi Zuo, Xiaodong Gu, Lingteng Qiu, Yuan Dong, Zhengyi Zhao, Weihao Yuan, Rui Peng, Siyu Zhu, Zilong Dong, Liefeng Bo, Qixing Huang

Abstract: Generating multi-view images based on text or single-image prompts is a critical capability for the creation of 3D content. Two fundamental questions on this topic are what data we use for training and how to ensure multi-view consistency. This paper introduces a novel framework that makes fundamental contributions to both questions. Unlike leveraging images from 2D diffusion models for training,… ▽ More Generating multi-view images based on text or single-image prompts is a critical capability for the creation of 3D content. Two fundamental questions on this topic are what data we use for training and how to ensure multi-view consistency. This paper introduces a novel framework that makes fundamental contributions to both questions. Unlike leveraging images from 2D diffusion models for training, we propose a dense consistent multi-view generation model that is fine-tuned from off-the-shelf video generative models. Images from video generative models are more suitable for multi-view generation because the underlying network architecture that generates them employs a temporal module to enforce frame consistency. Moreover, the video data sets used to train these models are abundant and diverse, leading to a reduced train-finetuning domain gap. To enhance multi-view consistency, we introduce a 3D-Aware Denoising Sampling, which first employs a feed-forward reconstruction module to get an explicit global 3D model, and then adopts a sampling strategy that effectively involves images rendered from the global 3D model into the denoising sampling loop to improve the multi-view consistency of the final images. As a by-product, this module also provides a fast way to create 3D assets represented by 3D Gaussians within a few seconds. Our approach can generate 24 dense views and converges much faster in training than state-of-the-art approaches (4 GPU hours versus many thousand GPU hours) with comparable visual quality and consistency. By further fine-tuning, our approach outperforms existing state-of-the-art methods in both quantitative metrics and visual effects. Our project page is aigc3d.github.io/VideoMV. △ Less

Submitted 18 March, 2024; originally announced March 2024.

Comments: Project page: aigc3d.github.io/VideoMV/

arXiv:2403.11890 [pdf]

Dual-Energy Cone-Beam CT Using Two Complementary Limited-Angle Scans with A Projection-Consistent Diffusion Model

Authors: Junbo Peng, Chih-Wei Chang, Richard L. J. Qiu, Tonghe Wang, Justin Roper, Beth Ghavidel, Xiangyang Tang, Xiaofeng Yang

Abstract: Background: Dual-energy imaging on cone-beam CT (CBCT) scanners has great potential in different clinical applications, including image-guided surgery and adaptive proton therapy. However, the clinical practice of dual-energy CBCT (DE-CBCT) has been hindered by the requirement of sophisticated hardware components. Purpose: In this work, we aim to propose a practical solution for single-scan dual-e… ▽ More Background: Dual-energy imaging on cone-beam CT (CBCT) scanners has great potential in different clinical applications, including image-guided surgery and adaptive proton therapy. However, the clinical practice of dual-energy CBCT (DE-CBCT) has been hindered by the requirement of sophisticated hardware components. Purpose: In this work, we aim to propose a practical solution for single-scan dual-energy imaging on current CBCT scanners without hardware modifications, using two complementary limited-angle scans with a projection-consistent diffusion model. Methods: Our approach has two major components: data acquisition using two complementary limited-angle scans, and dual-energy projections restoration with subsequent FDK reconstruction. Two complementary scans at different kVps are performed in a single rotation by switching the tube voltage at the middle of the source trajectory, acquiring the mixed-spectra projection in a single CBCT scan. Full-sampled dual-energy projections are then restored by a projection-consistent diffusion model in a slice-by-slice manner, followed by the DE-CBCT reconstruction using the FDK algorithm. Results: The proposed method was evaluated in a simulation study of digital abdomen phantoms and a study of real rat data. In the simulation study, the proposed method produced DE-CBCT images at a mean absolute error (MAE) of 20 HU. In the small-animal study, reconstructed DE-CBCT images using the proposed method gave an MAE of 25 HU. Conclusion: This study demonstrates the feasibility of DE-CBCT imaging using two complementary limited-angle scans with a projection-consistent diffusion model in both half-fan and short scans. The proposed method may allow quantitative applications of DE-CBCT and enable DE-CBCT-based adaptive proton therapy. △ Less

Submitted 18 March, 2024; originally announced March 2024.

arXiv:2403.11465 [pdf]

Ultra-Long Homochiral Graphene Nanoribbons Grown Within h-BN Stacks for High-Performance Electronics

Authors: Bosai Lyu, Jiajun Chen, Sen Wang, Shuo Lou, Peiyue Shen, Jingxu Xie, Lu Qiu, Izaac Mitchell, Can Li, Cheng Hu, Xianliang Zhou, Kenji Watanabe, Takashi Taniguchi, Xiaoqun Wang, Jinfeng Jia, Qi Liang, Guorui Chen, Tingxin Li, Shiyong Wang, Wengen Ouyang, Oded Hod, Feng Ding, Michael Urbakh, Zhiwen Shi

Abstract: Van der Waals encapsulation of two-dimensional materials within hexagonal boron nitride (h-BN) stacks has proven to be a promising way to create ultrahigh-performance electronic devices. However, contemporary approaches for achieving van der Waals encapsulation, which involve artificial layer stacking using mechanical transfer techniques, are difficult to control, prone to contamination, and unsca… ▽ More Van der Waals encapsulation of two-dimensional materials within hexagonal boron nitride (h-BN) stacks has proven to be a promising way to create ultrahigh-performance electronic devices. However, contemporary approaches for achieving van der Waals encapsulation, which involve artificial layer stacking using mechanical transfer techniques, are difficult to control, prone to contamination, and unscalable. Here, we report on the transfer-free direct growth of high-quality graphene nanoribbons (GNRs) within h-BN stacks. The as-grown embedded GNRs exhibit highly desirable features being ultralong (up to 0.25 mm), ultranarrow ( < 5 nm), and homochiral with zigzag edges. Our atomistic simulations reveal that the mechanism underlying the embedded growth involves ultralow GNR friction when sliding between AA'-stacked h-BN layers. Using the grown structures, we demonstrate the transfer-free fabrication of embedded GNR field-effect devices that exhibit excellent performance at room temperature with mobilities of up to 4,600 $cm^{2} V^{-1} s^{-1}$ and on-off ratios of up to $10^{6}$. This paves the way to the bottom-up fabrication of high-performance electronic devices based on embedded layered materials. △ Less

Submitted 18 March, 2024; originally announced March 2024.

arXiv:2403.07858 [pdf, other]

Accelerating Biclique Counting on GPU

Authors: Linshan Qiu, Zhonggen Li, Xiangyu Ke, Lu Chen, Yunjun Gao

Abstract: Counting (p,q)-bicliques in bipartite graphs poses a foundational challenge with broad applications, from densest subgraph discovery in algorithmic research to personalized content recommendation in practical scenarios. Despite its significance, current leading (p,q)-biclique counting algorithms fall short, particularly when faced with larger graph sizes and clique scales. Fortunately, the problem… ▽ More Counting (p,q)-bicliques in bipartite graphs poses a foundational challenge with broad applications, from densest subgraph discovery in algorithmic research to personalized content recommendation in practical scenarios. Despite its significance, current leading (p,q)-biclique counting algorithms fall short, particularly when faced with larger graph sizes and clique scales. Fortunately, the problem's inherent structure, allowing for the independent counting of each biclique starting from every vertex, combined with a substantial set intersections, makes it highly amenable to parallelization. Recent successes in GPU-accelerated algorithms across various domains motivate our exploration into harnessing the parallelism power of GPUs to efficiently address the (p,q)-biclique counting challenge. We introduce GBC (GPU-based Biclique Counting), a novel approach designed to enable efficient and scalable (p,q)-biclique counting on GPUs. To address major bottleneck arising from redundant comparisons in set intersections (occupying an average of 90% of the runtime), we introduce a novel data structure that hashes adjacency lists into truncated bitmaps to enable efficient set intersection on GPUs via bit-wise AND operations. Our innovative hybrid DFS-BFS exploration strategy further enhances thread utilization and effectively manages memory constraints. A composite load balancing strategy, integrating pre-runtime and runtime workload allocation, ensures equitable distribution among threads. Additionally, we employ vertex reordering and graph partitioning strategies for improved compactness and scalability. Experimental evaluations on eight real-life and two synthetic datasets demonstrate that GBC outperforms state-of-the-art algorithms by a substantial margin. In particular, GBC achieves an average speedup of 497.8x, with the largest instance achieving a remarkable 1217.7x speedup when p = q = 8. △ Less

Submitted 20 March, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

Comments: This paper has been accepted by ICDE24

arXiv:2403.06739 [pdf, other]

doi 10.1103/PhysRevD.109.114025

Energy loss of a heavy fermion in a collisional QED plasma

Authors: Yun Guo, Luhua Qiu, Ruizhe Zhao, Michael Strickland

Abstract: We compute the energy loss of heavy fermions moving in a plasma, taking into account the modification of the photon collective modes induced by collisions using a Bhatnagar-Gross-Krook collisional kernel. We include contributions from both hard and soft scatterings of the heavy fermion using a collisionally modified hard-thermal-loop resummed propagator. Using this method, one does not need to int… ▽ More We compute the energy loss of heavy fermions moving in a plasma, taking into account the modification of the photon collective modes induced by collisions using a Bhatnagar-Gross-Krook collisional kernel. We include contributions from both hard and soft scatterings of the heavy fermion using a collisionally modified hard-thermal-loop resummed propagator. Using this method, one does not need to introduce a separation scale between hard- and soft-momentum exchanges. To place our calculation in context, we review other theoretical approaches to computing the collisional energy loss of fermions and discuss the systematics and results obtained in each approach compared to using a resummed propagator for both hard and soft momentum exchanges. Our final results indicate that self-consistently including the effect of collisions in the self-energies of the resummed propagator results in an increased energy loss compared to using collisionless hard-thermal-loop propagators. The effect becomes larger as the magnitude of the coupling constant and the velocity of the fermion increase. △ Less

Submitted 15 June, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

Comments: 25 pages, 7 figures, final version published in PRD

Journal ref: Phys. Rev. D 109, 114025 (2024)

arXiv:2403.05747 [pdf]

doi 10.1021/acs.nanolett.3c05024

Switching intrinsic magnetic skyrmions with controllable magnetic anisotropy in van der Waals multiferroic heterostructures

Authors: Ze-quan Wang, Feng Xue, Liang Qiu, Zhe Wang, Ruqian Wu, Yusheng Hou

Abstract: Magnetic skyrmions, topologically nontrivial whirling spin textures at nanometer scales, have emerged as potential information carriers for spintronic devices. The ability to efficiently create and erase magnetic skyrmions is vital yet challenging for such applications. Based on first-principles studies, we find that switching between intrinsic magnetic skyrmion and high-temperature ferromagnetic… ▽ More Magnetic skyrmions, topologically nontrivial whirling spin textures at nanometer scales, have emerged as potential information carriers for spintronic devices. The ability to efficiently create and erase magnetic skyrmions is vital yet challenging for such applications. Based on first-principles studies, we find that switching between intrinsic magnetic skyrmion and high-temperature ferromagnetic states can be achieved in two-dimensional van der Waals (vdW) multiferroic heterostructure CrSeI/In2Te3 by reversing the ferroelectric polarization of In2Te3. The core mechanism of this switching is traced to the controllable magnetic anisotropy of CrSeI influenced by the ferroelectric polarization of In2Te3. We propose a useful descriptor linking the presence of magnetic skyrmions to magnetic parameters, and validate this connection through studies of a variety of similar vdW multiferroic heterostructures. Our work demonstrates that manipulating magnetic skyrmions via tunable magnetic anisotropies in vdW multiferroic heterostructures represents a highly promising and energy-efficient strategy for future development of spintronics. △ Less

Submitted 8 March, 2024; originally announced March 2024.

Comments: 20 Pages, 3 figures, 1 table; accepted by Nano Letters

arXiv:2402.18495 [pdf, other]

ROG$_{PL}$: Robust Open-Set Graph Learning via Region-Based Prototype Learning

Authors: Qin Zhang, Xiaowei Li, Jiexin Lu, Liping Qiu, Shirui Pan, Xiaojun Chen, Junyang Chen

Abstract: Open-set graph learning is a practical task that aims to classify the known class nodes and to identify unknown class samples as unknowns. Conventional node classification methods usually perform unsatisfactorily in open-set scenarios due to the complex data they encounter, such as out-of-distribution (OOD) data and in-distribution (IND) noise. OOD data are samples that do not belong to any known… ▽ More Open-set graph learning is a practical task that aims to classify the known class nodes and to identify unknown class samples as unknowns. Conventional node classification methods usually perform unsatisfactorily in open-set scenarios due to the complex data they encounter, such as out-of-distribution (OOD) data and in-distribution (IND) noise. OOD data are samples that do not belong to any known classes. They are outliers if they occur in training (OOD noise), and open-set samples if they occur in testing. IND noise are training samples which are assigned incorrect labels. The existence of IND noise and OOD noise is prevalent, which usually cause the ambiguity problem, including the intra-class variety problem and the inter-class confusion problem. Thus, to explore robust open-set learning methods is necessary and difficult, and it becomes even more difficult for non-IID graph data.To this end, we propose a unified framework named ROG$_{PL}$ to achieve robust open-set learning on complex noisy graph data, by introducing prototype learning. In specific, ROG$_{PL}$ consists of two modules, i.e., denoising via label propagation and open-set prototype learning via regions. The first module corrects noisy labels through similarity-based label propagation and removes low-confidence samples, to solve the intra-class variety problem caused by noise. The second module learns open-set prototypes for each known class via non-overlapped regions and remains both interior and border prototypes to remedy the inter-class confusion problem.The two modules are iteratively updated under the constraints of classification loss and prototype diversity loss. To the best of our knowledge, the proposed ROG$_{PL}$ is the first robust open-set node classification method for graph data with complex noise. △ Less

Submitted 29 February, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

Comments: 9 pages, 5 figures

arXiv:2402.17033 [pdf, other]

Mechanics and wrinkling patterns of pressurized bent tubes

Authors: Cesar L. Pastrana, Luyi Qiu, John W. Hutchinson, Ariel Amir, Ulrich Gerland

Abstract: Take a drinking straw and bend it from its ends. After sufficient bending, the tube buckles forming a kink, where the curvature is localized in a very small area. This instability, known generally as the Brazier effect, is inherent to thin-walled cylindrical shells, which are particularly ubiquitous in living systems, such as rod-shaped bacteria. However, tubular biological structures are often pr… ▽ More Take a drinking straw and bend it from its ends. After sufficient bending, the tube buckles forming a kink, where the curvature is localized in a very small area. This instability, known generally as the Brazier effect, is inherent to thin-walled cylindrical shells, which are particularly ubiquitous in living systems, such as rod-shaped bacteria. However, tubular biological structures are often pressurized, and the knowledge of the mechanical response upon bending in this scenario is limited. In this work, we use a computational model to study the mechanical response and the deformations as a result of bending pressurized tubes. In addition, we employ tension-field theory to describe the mechanical behaviour before and after the wrinkling transition. Furthermore, we investigate the development and evolution of wrinkle patterns beyond the instability, showing different wrinkled configurations. We discover the existence of a multi-wavelength mode following the purely sinusoidal wrinkles and anticipating the kinked configuration of the tube. △ Less

Submitted 10 April, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

Comments: 5 pages, 4 figures

arXiv:2402.05935 [pdf, other]

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

Authors: Dongyang Liu, Renrui Zhang, Longtian Qiu, Siyuan Huang, Weifeng Lin, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng Jin, Kaipeng Zhang, Wenqi Shao, Chao Xu, Conghui He, Junjun He, Hao Shao, Pan Lu, Hongsheng Li, Yu Qiao, Peng Gao

Abstract: We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series developed upon SPHINX. To improve the architecture and training efficiency, we modify the SPHINX framework by removing redundant visual encoders, bypassing fully-padded sub-images with skip tokens, and simplifying multi-stage training into a one-stage all-in-one paradigm. To fully unleash the potential of MLLMs, we… ▽ More We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series developed upon SPHINX. To improve the architecture and training efficiency, we modify the SPHINX framework by removing redundant visual encoders, bypassing fully-padded sub-images with skip tokens, and simplifying multi-stage training into a one-stage all-in-one paradigm. To fully unleash the potential of MLLMs, we assemble a comprehensive multi-domain and multimodal dataset covering publicly available resources in language, vision, and vision-language tasks. We further enrich this collection with our curated OCR intensive and Set-of-Mark datasets, extending the diversity and generality. By training over different base LLMs including TinyLlama1.1B, InternLM2-7B, LLaMA2-13B, and Mixtral8x7B, we obtain a spectrum of MLLMs that vary in parameter size and multilingual capabilities. Comprehensive benchmarking reveals a strong correlation between the multi-modal performance with the data and parameter scales. Code and models are released at https://github.com/Alpha-VLLM/LLaMA2-Accessory △ Less

Submitted 26 June, 2024; v1 submitted 8 February, 2024; originally announced February 2024.

Comments: Accepted by ICML 2024. Code and models are released at https://github.com/Alpha-VLLM/LLaMA2-Accessory

arXiv:2402.03259 [pdf, other]

Meeting Bridges: Designing Information Artifacts that Bridge from Synchronous Meetings to Asynchronous Collaboration

Authors: Ruotong Wang, Lin Qiu, Justin Cranshaw, Amy X. Zhang

Abstract: A recent surge in remote meetings has led to complaints of ``Zoom fatigue'' and ``collaboration overload,'' negatively impacting worker productivity and well-being. One way to alleviate the burden of meetings is to de-emphasize their synchronous participation by shifting work to and enabling sensemaking during post-meeting asynchronous activities. Towards this goal, we propose the design concept o… ▽ More A recent surge in remote meetings has led to complaints of ``Zoom fatigue'' and ``collaboration overload,'' negatively impacting worker productivity and well-being. One way to alleviate the burden of meetings is to de-emphasize their synchronous participation by shifting work to and enabling sensemaking during post-meeting asynchronous activities. Towards this goal, we propose the design concept of meeting bridges, or information artifacts that can encapsulate meeting information towards bridging to and facilitating post-meeting activities. Through 13 interviews and a survey of 198 information workers, we learn how people use online meeting information after meetings are over, finding five main uses: as an archive, as task reminders, to onboard or support inclusion, for group sensemaking, and as a launching point for follow-on collaboration. However, we also find that current common meeting artifacts, such as notes and recordings, present challenges in serving as meeting bridges. After conducting co-design sessions with 16 participants, we distill key principles for the design of meeting bridges to optimally support asynchronous collaboration goals. Overall, our findings point to the opportunity of designing information artifacts that not only support users to access but also continue to transform and engage in meeting information post-meeting. △ Less

Submitted 5 February, 2024; originally announced February 2024.

Comments: accepted to CSCW 2024

arXiv:2402.02349 [pdf]

Vision Transformer-based Multimodal Feature Fusion Network for Lymphoma Segmentation on PET/CT Images

Authors: Huan Huang, Liheng Qiu, Shenmiao Yang, Longxi Li, Jiaofen Nan, Yanting Li, Chuang Han, Fubao Zhu, Chen Zhao, Weihua Zhou

Abstract: Background: Diffuse large B-cell lymphoma (DLBCL) segmentation is a challenge in medical image analysis. Traditional segmentation methods for lymphoma struggle with the complex patterns and the presence of DLBCL lesions. Objective: We aim to develop an accurate method for lymphoma segmentation with 18F-Fluorodeoxyglucose positron emission tomography (PET) and computed tomography (CT) images. Metho… ▽ More Background: Diffuse large B-cell lymphoma (DLBCL) segmentation is a challenge in medical image analysis. Traditional segmentation methods for lymphoma struggle with the complex patterns and the presence of DLBCL lesions. Objective: We aim to develop an accurate method for lymphoma segmentation with 18F-Fluorodeoxyglucose positron emission tomography (PET) and computed tomography (CT) images. Methods: Our lymphoma segmentation approach combines a vision transformer with dual encoders, adeptly fusing PET and CT data via multimodal cross-attention fusion (MMCAF) module. In this study, PET and CT data from 165 DLBCL patients were analyzed. A 5-fold cross-validation was employed to evaluate the performance and generalization ability of our method. Ground truths were annotated by experienced nuclear medicine experts. We calculated the total metabolic tumor volume (TMTV) and performed a statistical analysis on our results. Results: The proposed method exhibited accurate performance in DLBCL lesion segmentation, achieving a Dice similarity coefficient of 0.9173$\pm$0.0071, a Hausdorff distance of 2.71$\pm$0.25mm, a sensitivity of 0.9462$\pm$0.0223, and a specificity of 0.9986$\pm$0.0008. Additionally, a Pearson correlation coefficient of 0.9030$\pm$0.0179 and an R-square of 0.8586$\pm$0.0173 were observed in TMTV when measured on manual annotation compared to our segmentation results. Conclusion: This study highlights the advantages of MMCAF and vision transformer for lymphoma segmentation using PET and CT, offering great promise for computer-aided lymphoma diagnosis and treatment. △ Less

Submitted 4 February, 2024; originally announced February 2024.

Comments: 14 pages, 6 figures; reference added

arXiv:2401.17018 [pdf, other]

GPU-Accelerated Batch-Dynamic Subgraph Matching

Authors: Linshan Qiu, Lu Chen, Hailiang Jie, Xiangyu Ke, Yunjun Gao, Yang Liu, Zetao Zhang

Abstract: Subgraph matching has garnered increasing attention for its diverse real-world applications. Given the dynamic nature of real-world graphs, addressing evolving scenarios without incurring prohibitive overheads has been a focus of research. However, existing approaches for dynamic subgraph matching often proceed serially, retrieving incremental matches for each updated edge individually. This appro… ▽ More Subgraph matching has garnered increasing attention for its diverse real-world applications. Given the dynamic nature of real-world graphs, addressing evolving scenarios without incurring prohibitive overheads has been a focus of research. However, existing approaches for dynamic subgraph matching often proceed serially, retrieving incremental matches for each updated edge individually. This approach falls short when handling batch data updates, leading to a decrease in system throughput. Leveraging the parallel processing power of GPUs, which can execute a massive number of cores simultaneously, has been widely recognized for performance acceleration in various domains. Surprisingly, systematic exploration of subgraph matching in the context of batch-dynamic graphs, particularly on a GPU platform, remains untouched. In this paper, we bridge this gap by introducing an efficient framework, GAMMA (GPU-Accelerated Batch-Dynamic Subgraph Matching). Our approach features a DFS-based warp-centric batch-dynamic subgraph matching algorithm. To ensure load balance in the DFS-based search, we propose warp-level work stealing via shared memory. Additionally, we introduce coalesced search to reduce redundant computations. Comprehensive experiments demonstrate the superior performance of GAMMA. Compared to state-of-the-art algorithms, GAMMA showcases a performance improvement up to hundreds of times. △ Less

Submitted 30 January, 2024; originally announced January 2024.

Comments: This paper has been accepted by ICDE 2024

arXiv:2401.12433 [pdf, other]

A Novel Garment Transfer Method Supervised by Distilled Knowledge of Virtual Try-on Model

Authors: Naiyu Fang, Lemiao Qiu, Shuyou Zhang, Zili Wang, Kerui Hu, Jianrong Tan

Abstract: This paper proposes a novel garment transfer method supervised with knowledge distillation from virtual try-on. Our method first reasons the transfer parsing to provide shape prior to downstream tasks. We employ a multi-phase teaching strategy to supervise the training of the transfer parsing reasoning model, learning the response and feature knowledge from the try-on parsing reasoning model. To c… ▽ More This paper proposes a novel garment transfer method supervised with knowledge distillation from virtual try-on. Our method first reasons the transfer parsing to provide shape prior to downstream tasks. We employ a multi-phase teaching strategy to supervise the training of the transfer parsing reasoning model, learning the response and feature knowledge from the try-on parsing reasoning model. To correct the teaching error, it transfers the garment back to its owner to absorb the hard knowledge in the self-study phase. Guided by the transfer parsing, we adjust the position of the transferred garment via STN to prevent distortion. Afterward, we estimate a progressive flow to precisely warp the garment with shape and content correspondences. To ensure warping rationality, we supervise the training of the garment warping model using target shape and warping knowledge from virtual try-on. To better preserve body features in the transfer result, we propose a well-designed training strategy for the arm regrowth task to infer new exposure skin. Experiments demonstrate that our method has state-of-the-art performance compared with other virtual try-on and garment transfer methods in garment transfer, especially for preserving garment texture and body features. △ Less

Submitted 4 April, 2024; v1 submitted 22 January, 2024; originally announced January 2024.

arXiv:2401.11740 [pdf, other]

Multi-level Cross-modal Alignment for Image Clustering

Authors: Liping Qiu, Qin Zhang, Xiaojun Chen, Shaotian Cai

Abstract: Recently, the cross-modal pretraining model has been employed to produce meaningful pseudo-labels to supervise the training of an image clustering model. However, numerous erroneous alignments in a cross-modal pre-training model could produce poor-quality pseudo-labels and degrade clustering performance. To solve the aforementioned issue, we propose a novel \textbf{Multi-level Cross-modal Alignmen… ▽ More Recently, the cross-modal pretraining model has been employed to produce meaningful pseudo-labels to supervise the training of an image clustering model. However, numerous erroneous alignments in a cross-modal pre-training model could produce poor-quality pseudo-labels and degrade clustering performance. To solve the aforementioned issue, we propose a novel \textbf{Multi-level Cross-modal Alignment} method to improve the alignments in a cross-modal pretraining model for downstream tasks, by building a smaller but better semantic space and aligning the images and texts in three levels, i.e., instance-level, prototype-level, and semantic-level. Theoretical results show that our proposed method converges, and suggests effective means to reduce the expected clustering risk of our method. Experimental results on five benchmark datasets clearly show the superiority of our new method. △ Less

Submitted 22 January, 2024; originally announced January 2024.

arXiv:2401.10600 [pdf, other]

Changing-look NLS1 galaxies, their detection with SVOM, and the case of NGC 1566

Authors: D. W. Xu, S. Komossa, D. Grupe, J. Wang, L. P. Xin, X. H. Han, J. Y. Wei, J. Y. Bai, E. Bon, F. Cangemi, B. Cordier, M. Dennefeld, L. C. Gallo, W. Kollatschny, De-Feng Kong, M. W. Ochmann, Y. L. Qiu, N. Schartel

Abstract: We discuss applications of the study of the new and barely explored class of changing-look (CL) narrow-line Seyfert 1 (NLS1) galaxies and comment on their detection with the space mission SVOM (Space Variable Objects Monitor). We highlight the case of NGC 1566, which is outstanding in many respects, for instance as one of the nearest known CL AGN undergoing exceptional outbursts. Its NLS1 nature i… ▽ More We discuss applications of the study of the new and barely explored class of changing-look (CL) narrow-line Seyfert 1 (NLS1) galaxies and comment on their detection with the space mission SVOM (Space Variable Objects Monitor). We highlight the case of NGC 1566, which is outstanding in many respects, for instance as one of the nearest known CL AGN undergoing exceptional outbursts. Its NLS1 nature is discussed, and we take it as a nearby prototype for systems that could be discovered and studied in the near future, including with SVOM. Finally, we briefly examine the broader implications and applications of CL events in NLS1 galaxies and show that such systems, once discovered in larger numbers, will greatly advance our understanding of the physics of the environment of rapidly growing supermassive black holes. This White Paper is part of a sequence of publications which explore aspects of our understanding of (CL) NLS1 galaxy physics with future missions. △ Less

Submitted 19 January, 2024; originally announced January 2024.

Comments: 14 pages, 4 figures. Accepted for publication in the Universe Special Issue "A Multimessenger View of Supermassive Black Holes and the Quasar Main Sequence"

arXiv:2401.10278 [pdf, other]

EEGFormer: Towards Transferable and Interpretable Large-Scale EEG Foundation Model

Authors: Yuqi Chen, Kan Ren, Kaitao Song, Yansen Wang, Yifan Wang, Dongsheng Li, Lili Qiu

Abstract: Self-supervised learning has emerged as a highly effective approach in the fields of natural language processing and computer vision. It is also applicable to brain signals such as electroencephalography (EEG) data, given the abundance of available unlabeled data that exist in a wide spectrum of real-world medical applications ranging from seizure detection to wave analysis. The existing works lev… ▽ More Self-supervised learning has emerged as a highly effective approach in the fields of natural language processing and computer vision. It is also applicable to brain signals such as electroencephalography (EEG) data, given the abundance of available unlabeled data that exist in a wide spectrum of real-world medical applications ranging from seizure detection to wave analysis. The existing works leveraging self-supervised learning on EEG modeling mainly focus on pretraining upon each individual dataset corresponding to a single downstream task, which cannot leverage the power of abundant data, and they may derive sub-optimal solutions with a lack of generalization. Moreover, these methods rely on end-to-end model learning which is not easy for humans to understand. In this paper, we present a novel EEG foundation model, namely EEGFormer, pretrained on large-scale compound EEG data. The pretrained model cannot only learn universal representations on EEG signals with adaptable performance on various downstream tasks but also provide interpretable outcomes of the useful patterns within the data. To validate the effectiveness of our model, we extensively evaluate it on various downstream tasks and assess the performance under different transfer settings. Furthermore, we demonstrate how the learned model exhibits transferable anomaly detection performance and provides valuable interpretability of the acquired patterns via self-supervised learning. △ Less

Submitted 11 January, 2024; originally announced January 2024.

Comments: A preprint version of an ongoing work

Showing 1–50 of 359 results for author: Qiu, L