-
Deep Learning Based Apparent Diffusion Coefficient Map Generation1 from Multi-parametric MR Images for Patients with Diffuse Gliomas
Authors:
Zach Eidex,
Mojtaba Safari,
Jacob Wynne,
Richard L. J. Qiu,
Tonghe Wang,
David Viar Hernandez,
Hui-Kuo Shu,
Hui Mao,
Xiaofeng Yang
Abstract:
Purpose: Apparent diffusion coefficient (ADC) maps derived from diffusion weighted (DWI) MRI provides functional measurements about the water molecules in tissues. However, DWI is time consuming and very susceptible to image artifacts, leading to inaccurate ADC measurements. This study aims to develop a deep learning framework to synthesize ADC maps from multi-parametric MR images. Methods: We pro…
▽ More
Purpose: Apparent diffusion coefficient (ADC) maps derived from diffusion weighted (DWI) MRI provides functional measurements about the water molecules in tissues. However, DWI is time consuming and very susceptible to image artifacts, leading to inaccurate ADC measurements. This study aims to develop a deep learning framework to synthesize ADC maps from multi-parametric MR images. Methods: We proposed the multiparametric residual vision transformer model (MPR-ViT) that leverages the long-range context of ViT layers along with the precision of convolutional operators. Residual blocks throughout the network significantly increasing the representational power of the model. The MPR-ViT model was applied to T1w and T2- fluid attenuated inversion recovery images of 501 glioma cases from a publicly available dataset including preprocessed ADC maps. Selected patients were divided into training (N=400), validation (N=50) and test (N=51) sets, respectively. Using the preprocessed ADC maps as ground truth, model performance was evaluated and compared against the Vision Convolutional Transformer (VCT) and residual vision transformer (ResViT) models. Results: The results are as follows using T1w + T2-FLAIR MRI as inputs: MPR-ViT - PSNR: 31.0 +/- 2.1, MSE: 0.009 +/- 0.0005, SSIM: 0.950 +/- 0.015. In addition, ablation studies showed the relative impact on performance of each input sequence. Both qualitative and quantitative results indicate that the proposed MR- ViT model performs favorably against the ground truth data. Conclusion: We show that high-quality ADC maps can be synthesized from structural MRI using a MPR- VCT model. Our predicted images show better conformality to the ground truth volume than ResViT and VCT predictions. These high-quality synthetic ADC maps would be particularly useful for disease diagnosis and intervention, especially when ADC maps have artifacts or are unavailable.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
Authors:
Huiqiang Jiang,
Yucheng Li,
Chengruidong Zhang,
Qianhui Wu,
Xufang Luo,
Surin Ahn,
Zhenhua Han,
Amir H. Abdi,
Dongsheng Li,
Chin-Yew Lin,
Yuqing Yang,
Lili Qiu
Abstract:
The computational challenges of Large Language Model (LLM) inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens (i.e., the pre-filling stage) on a single A100 GPU. Existing methods for speeding up prefi…
▽ More
The computational challenges of Large Language Model (LLM) inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens (i.e., the pre-filling stage) on a single A100 GPU. Existing methods for speeding up prefilling often fail to maintain acceptable accuracy or efficiency when applied to long-context LLMs. To address this gap, we introduce MInference (Milliontokens Inference), a sparse calculation method designed to accelerate pre-filling of long-sequence processing. Specifically, we identify three unique patterns in long-context attention matrices-the A-shape, Vertical-Slash, and Block-Sparsethat can be leveraged for efficient sparse computation on GPUs. We determine the optimal pattern for each attention head offline and dynamically build sparse indices based on the assigned pattern during inference. With the pattern and sparse indices, we perform efficient sparse attention calculations via our optimized GPU kernels to significantly reduce the latency in the pre-filling stage of long-context LLMs. Our proposed technique can be directly applied to existing LLMs without any modifications to the pre-training setup or additional fine-tuning. By evaluating on a wide range of downstream tasks, including InfiniteBench, RULER, PG-19, and Needle In A Haystack, and models including LLaMA-3-1M, GLM4-1M, Yi-200K, Phi-3-128K, and Qwen2-128K, we demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy. Our code is available at https://aka.ms/MInference.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
Expressive and Generalizable Low-rank Adaptation for Large Models via Slow Cascaded Learning
Authors:
Siwei Li,
Yifan Yang,
Yifei Shen,
Fangyun Wei,
Zongqing Lu,
Lili Qiu,
Yuqing Yang
Abstract:
Efficient fine-tuning plays a fundamental role in modern large models, with low-rank adaptation emerging as a particularly promising approach. However, the existing variants of LoRA are hampered by limited expressiveness, a tendency to overfit, and sensitivity to hyperparameter settings. This paper presents LoRA Slow Cascade Learning (LoRASC), an innovative technique designed to enhance LoRA's exp…
▽ More
Efficient fine-tuning plays a fundamental role in modern large models, with low-rank adaptation emerging as a particularly promising approach. However, the existing variants of LoRA are hampered by limited expressiveness, a tendency to overfit, and sensitivity to hyperparameter settings. This paper presents LoRA Slow Cascade Learning (LoRASC), an innovative technique designed to enhance LoRA's expressiveness and generalization capabilities while preserving its training efficiency. Our approach augments expressiveness through a cascaded learning strategy that enables a mixture-of-low-rank adaptation, thereby increasing the model's ability to capture complex patterns. Additionally, we introduce a slow-fast update mechanism and cascading noisy tuning to bolster generalization. The extensive experiments on various language and vision datasets, as well as robustness benchmarks, demonstrate that the proposed method not only significantly outperforms existing baselines, but also mitigates overfitting, enhances model stability, and improves OOD robustness. Code will be release in https://github.com/microsoft/LoRASC very soon.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
StableNormal: Reducing Diffusion Variance for Stable and Sharp Normal
Authors:
Chongjie Ye,
Lingteng Qiu,
Xiaodong Gu,
Qi Zuo,
Yushuang Wu,
Zilong Dong,
Liefeng Bo,
Yuliang Xiu,
Xiaoguang Han
Abstract:
This work addresses the challenge of high-quality surface normal estimation from monocular colored inputs (i.e., images and videos), a field which has recently been revolutionized by repurposing diffusion priors. However, previous attempts still struggle with stochastic inference, conflicting with the deterministic nature of the Image2Normal task, and costly ensembling step, which slows down the e…
▽ More
This work addresses the challenge of high-quality surface normal estimation from monocular colored inputs (i.e., images and videos), a field which has recently been revolutionized by repurposing diffusion priors. However, previous attempts still struggle with stochastic inference, conflicting with the deterministic nature of the Image2Normal task, and costly ensembling step, which slows down the estimation process. Our method, StableNormal, mitigates the stochasticity of the diffusion process by reducing inference variance, thus producing "Stable-and-Sharp" normal estimates without any additional ensembling process. StableNormal works robustly under challenging imaging conditions, such as extreme lighting, blurring, and low quality. It is also robust against transparent and reflective surfaces, as well as cluttered scenes with numerous objects. Specifically, StableNormal employs a coarse-to-fine strategy, which starts with a one-step normal estimator (YOSO) to derive an initial normal guess, that is relatively coarse but reliable, then followed by a semantic-guided refinement process (SG-DRN) that refines the normals to recover geometric details. The effectiveness of StableNormal is demonstrated through competitive performance in standard datasets such as DIODE-indoor, iBims, ScannetV2 and NYUv2, and also in various downstream tasks, such as surface reconstruction and normal enhancement. These results evidence that StableNormal retains both the "stability" and "sharpness" for accurate normal estimation. StableNormal represents a baby attempt to repurpose diffusion priors for deterministic estimation. To democratize this, code and models have been publicly available in hf.co/Stable-X
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
Adaptive Self-Supervised Consistency-Guided Diffusion Model for Accelerated MRI Reconstruction
Authors:
Mojtaba Safari,
Zach Eidex,
Shaoyan Pan,
Richard L. J. Qiu,
Xiaofeng Yang
Abstract:
Purpose: To propose a self-supervised deep learning-based compressed sensing MRI (DL-based CS-MRI) method named "Adaptive Self-Supervised Consistency Guided Diffusion Model (ASSCGD)" to accelerate data acquisition without requiring fully sampled datasets. Materials and Methods: We used the fastMRI multi-coil brain axial T2-weighted (T2-w) dataset from 1,376 cases and single-coil brain quantitative…
▽ More
Purpose: To propose a self-supervised deep learning-based compressed sensing MRI (DL-based CS-MRI) method named "Adaptive Self-Supervised Consistency Guided Diffusion Model (ASSCGD)" to accelerate data acquisition without requiring fully sampled datasets. Materials and Methods: We used the fastMRI multi-coil brain axial T2-weighted (T2-w) dataset from 1,376 cases and single-coil brain quantitative magnetization prepared 2 rapid acquisition gradient echoes (MP2RAGE) T1 maps from 318 cases to train and test our model. Robustness against domain shift was evaluated using two out-of-distribution (OOD) datasets: multi-coil brain axial postcontrast T1 -weighted (T1c) dataset from 50 cases and axial T1-weighted (T1-w) dataset from 50 patients. Data were retrospectively subsampled at acceleration rates R in {2x, 4x, 8x}. ASSCGD partitions a random sampling pattern into two disjoint sets, ensuring data consistency during training. We compared our method with ReconFormer Transformer and SS-MRI, assessing performance using normalized mean squared error (NMSE), peak signal-to-noise ratio (PSNR), and structural similarity index (SSIM). Statistical tests included one-way analysis of variance (ANOVA) and multi-comparison Tukey's Honesty Significant Difference (HSD) tests. Results: ASSCGD preserved fine structures and brain abnormalities visually better than comparative methods at R = 8x for both multi-coil and single-coil datasets. It achieved the lowest NMSE at R in {4x, 8x}, and the highest PSNR and SSIM values at all acceleration rates for the multi-coil dataset. Similar trends were observed for the single-coil dataset, though SSIM values were comparable to ReconFormer at R in {2x, 8x}. These results were further confirmed by the voxel-wise correlation scatter plots. OOD results showed significant (p << 10^-5 ) improvements in undersampled image quality after reconstruction.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
Boosting Scientific Concepts Understanding: Can Analogy from Teacher Models Empower Student Models?
Authors:
Siyu Yuan,
Cheng Jiayang,
Lin Qiu,
Deqing Yang
Abstract:
Analogical reasoning plays a critical role in human cognition, enabling us to understand new concepts by associating them with familiar ones. Previous research in the AI community has mainly focused on identifying and generating analogies and then examining their quality under human evaluation, which overlooks the practical application of these analogies in real-world settings. Inspired by the hum…
▽ More
Analogical reasoning plays a critical role in human cognition, enabling us to understand new concepts by associating them with familiar ones. Previous research in the AI community has mainly focused on identifying and generating analogies and then examining their quality under human evaluation, which overlooks the practical application of these analogies in real-world settings. Inspired by the human education process, in this paper, we propose to investigate how analogies created by teacher language models (LMs) can assist student LMs in understanding scientific concepts, thereby aligning more closely with practical scenarios. Our results suggest that free-form analogies can indeed aid LMs in understanding concepts. Additionally, analogies generated by student LMs can improve their own performance on scientific question answering, demonstrating their capability to use analogies for self-learning new knowledge. Resources are available at https://github.com/siyuyuan/SCUA.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
SCKansformer: Fine-Grained Classification of Bone Marrow Cells via Kansformer Backbone and Hierarchical Attention Mechanisms
Authors:
Yifei Chen,
Zhu Zhu,
Shenghao Zhu,
Linwei Qiu,
Binfeng Zou,
Fan Jia,
Yunpeng Zhu,
Chenyan Zhang,
Zhaojie Fang,
Feiwei Qin,
Jin Fan,
Changmiao Wang,
Yu Gao,
Gang Yu
Abstract:
The incidence and mortality rates of malignant tumors, such as acute leukemia, have risen significantly. Clinically, hospitals rely on cytological examination of peripheral blood and bone marrow smears to diagnose malignant tumors, with accurate blood cell counting being crucial. Existing automated methods face challenges such as low feature expression capability, poor interpretability, and redund…
▽ More
The incidence and mortality rates of malignant tumors, such as acute leukemia, have risen significantly. Clinically, hospitals rely on cytological examination of peripheral blood and bone marrow smears to diagnose malignant tumors, with accurate blood cell counting being crucial. Existing automated methods face challenges such as low feature expression capability, poor interpretability, and redundant feature extraction when processing high-dimensional microimage data. We propose a novel fine-grained classification model, SCKansformer, for bone marrow blood cells, which addresses these challenges and enhances classification accuracy and efficiency. The model integrates the Kansformer Encoder, SCConv Encoder, and Global-Local Attention Encoder. The Kansformer Encoder replaces the traditional MLP layer with the KAN, improving nonlinear feature representation and interpretability. The SCConv Encoder, with its Spatial and Channel Reconstruction Units, enhances feature representation and reduces redundancy. The Global-Local Attention Encoder combines Multi-head Self-Attention with a Local Part module to capture both global and local features. We validated our model using the Bone Marrow Blood Cell Fine-Grained Classification Dataset (BMCD-FGCD), comprising over 10,000 samples and nearly 40 classifications, developed with a partner hospital. Comparative experiments on our private dataset, as well as the publicly available PBC and ALL-IDB datasets, demonstrate that SCKansformer outperforms both typical and advanced microcell classification methods across all datasets. Our source code and private BMCD-FGCD dataset are available at https://github.com/JustlfC03/SCKansformer.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
Data Caching for Enterprise-Grade Petabyte-Scale OLAP
Authors:
Chunxu Tang,
Bin Fan,
Jing Zhao,
Chen Liang,
Yi Wang,
Beinan Wang,
Ziyue Qiu,
Lu Qiu,
Bowen Ding,
Shouzhuo Sun,
Saiguang Che,
Jiaming Mai,
Shouwei Chen,
Yu Zhu,
Jianjian Xie,
Yutian,
Sun,
Yao Li,
Yangjun Zhang,
Ke Wang,
Mingmin Chen
Abstract:
With the exponential growth of data and evolving use cases, petabyte-scale OLAP data platforms are increasingly adopting a model that decouples compute from storage. This shift, evident in organizations like Uber and Meta, introduces operational challenges including massive, read-heavy I/O traffic with potential throttling, as well as skewed and fragmented data access patterns. Addressing these ch…
▽ More
With the exponential growth of data and evolving use cases, petabyte-scale OLAP data platforms are increasingly adopting a model that decouples compute from storage. This shift, evident in organizations like Uber and Meta, introduces operational challenges including massive, read-heavy I/O traffic with potential throttling, as well as skewed and fragmented data access patterns. Addressing these challenges, this paper introduces the Alluxio local (edge) cache, a highly effective architectural optimization tailored for such environments. This embeddable cache, optimized for petabyte-scale data analytics, leverages local SSD resources to alleviate network I/O and API call pressures, significantly improving data transfer efficiency. Integrated with OLAP systems like Presto and storage services like HDFS, the Alluxio local cache has demonstrated its effectiveness in handling large-scale, enterprise-grade workloads over three years of deployment at Uber and Meta. We share insights and operational experiences in implementing these optimizations, providing valuable perspectives on managing modern, massive-scale OLAP workloads.
△ Less
Submitted 9 June, 2024;
originally announced June 2024.
-
Aligning Large Language Models via Fine-grained Supervision
Authors:
Dehong Xu,
Liang Qiu,
Minseok Kim,
Faisal Ladhak,
Jaeyoung Do
Abstract:
Pre-trained large-scale language models (LLMs) excel at producing coherent articles, yet their outputs may be untruthful, toxic, or fail to align with user expectations. Current approaches focus on using reinforcement learning with human feedback (RLHF) to improve model alignment, which works by transforming coarse human preferences of LLM outputs into a feedback signal that guides the model learn…
▽ More
Pre-trained large-scale language models (LLMs) excel at producing coherent articles, yet their outputs may be untruthful, toxic, or fail to align with user expectations. Current approaches focus on using reinforcement learning with human feedback (RLHF) to improve model alignment, which works by transforming coarse human preferences of LLM outputs into a feedback signal that guides the model learning process. However, because this approach operates on sequence-level feedback, it lacks the precision to identify the exact parts of the output affecting user preferences. To address this gap, we propose a method to enhance LLM alignment through fine-grained token-level supervision. Specifically, we ask annotators to minimally edit less preferred responses within the standard reward modeling dataset to make them more favorable, ensuring changes are made only where necessary while retaining most of the original content. The refined dataset is used to train a token-level reward model, which is then used for training our fine-grained Proximal Policy Optimization (PPO) model. Our experiment results demonstrate that this approach can achieve up to an absolute improvement of $5.1\%$ in LLM performance, in terms of win rate against the reference model, compared with the traditional PPO model.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Mitigate Position Bias in Large Language Models via Scaling a Single Dimension
Authors:
Yijiong Yu,
Huiqiang Jiang,
Xufang Luo,
Qianhui Wu,
Chin-Yew Lin,
Dongsheng Li,
Yuqing Yang,
Yongfeng Huang,
Lili Qiu
Abstract:
Large Language Models (LLMs) are increasingly applied in various real-world scenarios due to their excellent generalization capabilities and robust generative abilities. However, they exhibit position bias, also known as "lost in the middle", a phenomenon that is especially pronounced in long-context scenarios, which indicates the placement of the key information in different positions of a prompt…
▽ More
Large Language Models (LLMs) are increasingly applied in various real-world scenarios due to their excellent generalization capabilities and robust generative abilities. However, they exhibit position bias, also known as "lost in the middle", a phenomenon that is especially pronounced in long-context scenarios, which indicates the placement of the key information in different positions of a prompt can significantly affect accuracy. This paper first explores the micro-level manifestations of position bias, concluding that attention weights are a micro-level expression of position bias. It further identifies that, in addition to position embeddings, causal attention mask also contributes to position bias by creating position-specific hidden states. Based on these insights, we propose a method to mitigate position bias by scaling this positional hidden states. Experiments on the NaturalQuestions Multi-document QA, KV retrieval, LongBench and timeline reorder tasks, using various models including RoPE models, context windowextended models, and Alibi models, demonstrate the effectiveness and generalizability of our approach. Our method can improve performance by up to 15.2% by modifying just one dimension of hidden states. Our code is available at https://aka.ms/PositionalHidden.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
Authors:
Chaofan Lin,
Zhenhua Han,
Chengruidong Zhang,
Yuqing Yang,
Fan Yang,
Chen Chen,
Lili Qiu
Abstract:
The rise of large language models (LLMs) has enabled LLM-based applications (a.k.a. AI agents or co-pilots), a new software paradigm that combines the strength of LLM and conventional software. Diverse LLM applications from different tenants could design complex workflows using multiple LLM requests to accomplish one task. However, they have to use the over-simplified request-level API provided by…
▽ More
The rise of large language models (LLMs) has enabled LLM-based applications (a.k.a. AI agents or co-pilots), a new software paradigm that combines the strength of LLM and conventional software. Diverse LLM applications from different tenants could design complex workflows using multiple LLM requests to accomplish one task. However, they have to use the over-simplified request-level API provided by today's public LLM services, losing essential application-level information. Public LLM services have to blindly optimize individual LLM requests, leading to sub-optimal end-to-end performance of LLM applications.
This paper introduces Parrot, an LLM service system that focuses on the end-to-end experience of LLM-based applications. Parrot proposes Semantic Variable, a unified abstraction to expose application-level knowledge to public LLM services. A Semantic Variable annotates an input/output variable in the prompt of a request, and creates the data pipeline when connecting multiple LLM requests, providing a natural way to program LLM applications. Exposing Semantic Variables to the public LLM service allows it to perform conventional data flow analysis to uncover the correlation across multiple LLM requests. This correlation opens a brand-new optimization space for the end-to-end performance of LLM-based applications. Extensive evaluations demonstrate that Parrot can achieve up to an order-of-magnitude improvement for popular and practical use cases of LLM applications.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Learning to Reason via Program Generation, Emulation, and Search
Authors:
Nathaniel Weir,
Muhammad Khalifa,
Linlu Qiu,
Orion Weller,
Peter Clark
Abstract:
Program synthesis with language models (LMs) has unlocked a large set of reasoning abilities; code-tuned LMs have proven adept at generating programs that solve a wide variety of algorithmic symbolic manipulation tasks (e.g. word concatenation). However, not all reasoning tasks are easily expressible as code, e.g. tasks involving commonsense reasoning, moral decision-making, and sarcasm understand…
▽ More
Program synthesis with language models (LMs) has unlocked a large set of reasoning abilities; code-tuned LMs have proven adept at generating programs that solve a wide variety of algorithmic symbolic manipulation tasks (e.g. word concatenation). However, not all reasoning tasks are easily expressible as code, e.g. tasks involving commonsense reasoning, moral decision-making, and sarcasm understanding. Our goal is to extend an LM's program synthesis skills to such tasks and evaluate the results via pseudo-programs, namely Python programs where some leaf function calls are left undefined. To that end, we propose, Code Generation and Emulated EXecution (CoGEX). CoGEX works by (1) training LMs to generate their own pseudo-programs, (2) teaching them to emulate their generated program's execution, including those leaf functions, allowing the LM's knowledge to fill in the execution gaps; and (3) using them to search over many programs to find an optimal one. To adapt the CoGEX model to a new task, we introduce a method for performing program search to find a single program whose pseudo-execution yields optimal performance when applied to all the instances of a given dataset. We show that our approach yields large improvements compared to standard in-context learning approaches on a battery of tasks, both algorithmic and soft reasoning. This result thus demonstrates that code synthesis can be applied to a much broader class of problems than previously considered. Our released dataset, fine-tuned models, and implementation can be found at \url{https://github.com/nweir127/CoGEX}.
△ Less
Submitted 28 May, 2024; v1 submitted 25 May, 2024;
originally announced May 2024.
-
RefChecker: Reference-based Fine-grained Hallucination Checker and Benchmark for Large Language Models
Authors:
Xiangkun Hu,
Dongyu Ru,
Lin Qiu,
Qipeng Guo,
Tianhang Zhang,
Yang Xu,
Yun Luo,
Pengfei Liu,
Yue Zhang,
Zheng Zhang
Abstract:
Large Language Models (LLMs) have shown impressive capabilities but also a concerning tendency to hallucinate. This paper presents RefChecker, a framework that introduces claim-triplets to represent claims in LLM responses, aiming to detect fine-grained hallucinations. In RefChecker, an extractor generates claim-triplets from a response, which are then evaluated by a checker against a reference. W…
▽ More
Large Language Models (LLMs) have shown impressive capabilities but also a concerning tendency to hallucinate. This paper presents RefChecker, a framework that introduces claim-triplets to represent claims in LLM responses, aiming to detect fine-grained hallucinations. In RefChecker, an extractor generates claim-triplets from a response, which are then evaluated by a checker against a reference. We delineate three task settings: Zero, Noisy and Accurate Context, to reflect various real-world use cases. We curated a benchmark spanning various NLP tasks and annotated 11k claim-triplets from 2.1k responses by seven LLMs. RefChecker supports both proprietary and open-source models as the extractor and checker. Experiments demonstrate that claim-triplets enable superior hallucination detection, compared to other granularities such as response, sentence and sub-sentence level claims. RefChecker outperforms prior methods by 6.8 to 26.1 points on our benchmark and the checking results of RefChecker are strongly aligned with human judgments. This work is open sourced at https://github.com/amazon-science/RefChecker
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers
Authors:
Peng Gao,
Le Zhuo,
Dongyang Liu,
Ruoyi Du,
Xu Luo,
Longtian Qiu,
Yuhang Zhang,
Chen Lin,
Rongjie Huang,
Shijie Geng,
Renrui Zhang,
Junlin Xi,
Wenqi Shao,
Zhengkai Jiang,
Tianshuo Yang,
Weicai Ye,
He Tong,
Jingwen He,
Yu Qiao,
Hongsheng Li
Abstract:
Sora unveils the potential of scaling Diffusion Transformer for generating photorealistic images and videos at arbitrary resolutions, aspect ratios, and durations, yet it still lacks sufficient implementation details. In this technical report, we introduce the Lumina-T2X family - a series of Flow-based Large Diffusion Transformers (Flag-DiT) equipped with zero-initialized attention, as a unified f…
▽ More
Sora unveils the potential of scaling Diffusion Transformer for generating photorealistic images and videos at arbitrary resolutions, aspect ratios, and durations, yet it still lacks sufficient implementation details. In this technical report, we introduce the Lumina-T2X family - a series of Flow-based Large Diffusion Transformers (Flag-DiT) equipped with zero-initialized attention, as a unified framework designed to transform noise into images, videos, multi-view 3D objects, and audio clips conditioned on text instructions. By tokenizing the latent spatial-temporal space and incorporating learnable placeholders such as [nextline] and [nextframe] tokens, Lumina-T2X seamlessly unifies the representations of different modalities across various spatial-temporal resolutions. This unified approach enables training within a single framework for different modalities and allows for flexible generation of multimodal data at any resolution, aspect ratio, and length during inference. Advanced techniques like RoPE, RMSNorm, and flow matching enhance the stability, flexibility, and scalability of Flag-DiT, enabling models of Lumina-T2X to scale up to 7 billion parameters and extend the context window to 128K tokens. This is particularly beneficial for creating ultra-high-definition images with our Lumina-T2I model and long 720p videos with our Lumina-T2V model. Remarkably, Lumina-T2I, powered by a 5-billion-parameter Flag-DiT, requires only 35% of the training computational costs of a 600-million-parameter naive DiT. Our further comprehensive analysis underscores Lumina-T2X's preliminary capability in resolution extrapolation, high-resolution editing, generating consistent 3D views, and synthesizing videos with seamless transitions. We expect that the open-sourcing of Lumina-T2X will further foster creativity, transparency, and diversity in the generative AI community.
△ Less
Submitted 13 June, 2024; v1 submitted 9 May, 2024;
originally announced May 2024.
-
SPECIAL: Synopsis Assisted Secure Collaborative Analytics
Authors:
Chenghong Wang,
Lina Qiu,
Johes Bater,
Yukui Luo
Abstract:
Secure collaborative analytics (SCA) enable the processing of analytical SQL queries across multiple owners' data, even when direct data sharing is not feasible. Although essential for strong privacy, the large overhead from data-oblivious primitives in traditional SCA has hindered its practical adoption. Recent SCA variants that permit controlled leakages under differential privacy (DP) show a be…
▽ More
Secure collaborative analytics (SCA) enable the processing of analytical SQL queries across multiple owners' data, even when direct data sharing is not feasible. Although essential for strong privacy, the large overhead from data-oblivious primitives in traditional SCA has hindered its practical adoption. Recent SCA variants that permit controlled leakages under differential privacy (DP) show a better balance between privacy and efficiency. However, they still face significant challenges, such as potentially unbounded privacy loss, suboptimal query planning, and lossy processing. To address these challenges, we introduce SPECIAL, the first SCA system that simultaneously ensures bounded privacy loss, advanced query planning, and lossless processing. SPECIAL employs a novel synopsis-assisted secure processing model, where a one-time privacy cost is spent to acquire private synopses (table statistics) from owner data. These synopses then allow SPECIAL to estimate (compaction) sizes for secure operations (e.g., filter, join) and index encrypted data without extra privacy loss. Crucially, these estimates and indexes can be prepared before runtime, thereby facilitating efficient query planning and accurate cost estimations. Moreover, by using one-sided noise mechanisms and private upper bound techniques, SPECIAL ensures strict lossless processing for complex queries (e.g., multi-join). Through a comprehensive benchmark, we show that SPECIAL significantly outperforms cutting-edge SCAs, with up to 80X faster query times and over 900X smaller memory for complex queries. Moreover, it also achieves up to an 89X reduction in privacy loss under continual processing.
△ Less
Submitted 28 April, 2024;
originally announced April 2024.
-
Mammo-CLIP: Leveraging Contrastive Language-Image Pre-training (CLIP) for Enhanced Breast Cancer Diagnosis with Multi-view Mammography
Authors:
Xuxin Chen,
Yuheng Li,
Mingzhe Hu,
Ella Salari,
Xiaoqian Chen,
Richard L. J. Qiu,
Bin Zheng,
Xiaofeng Yang
Abstract:
Although fusion of information from multiple views of mammograms plays an important role to increase accuracy of breast cancer detection, developing multi-view mammograms-based computer-aided diagnosis (CAD) schemes still faces challenges and no such CAD schemes have been used in clinical practice. To overcome the challenges, we investigate a new approach based on Contrastive Language-Image Pre-tr…
▽ More
Although fusion of information from multiple views of mammograms plays an important role to increase accuracy of breast cancer detection, developing multi-view mammograms-based computer-aided diagnosis (CAD) schemes still faces challenges and no such CAD schemes have been used in clinical practice. To overcome the challenges, we investigate a new approach based on Contrastive Language-Image Pre-training (CLIP), which has sparked interest across various medical imaging tasks. By solving the challenges in (1) effectively adapting the single-view CLIP for multi-view feature fusion and (2) efficiently fine-tuning this parameter-dense model with limited samples and computational resources, we introduce Mammo-CLIP, the first multi-modal framework to process multi-view mammograms and corresponding simple texts. Mammo-CLIP uses an early feature fusion strategy to learn multi-view relationships in four mammograms acquired from the CC and MLO views of the left and right breasts. To enhance learning efficiency, plug-and-play adapters are added into CLIP image and text encoders for fine-tuning parameters and limiting updates to about 1% of the parameters. For framework evaluation, we assembled two datasets retrospectively. The first dataset, comprising 470 malignant and 479 benign cases, was used for few-shot fine-tuning and internal evaluation of the proposed Mammo-CLIP via 5-fold cross-validation. The second dataset, including 60 malignant and 294 benign cases, was used to test generalizability of Mammo-CLIP. Study results show that Mammo-CLIP outperforms the state-of-art cross-view transformer in AUC (0.841 vs. 0.817, 0.837 vs. 0.807) on both datasets. It also surpasses previous two CLIP-based methods by 20.3% and 14.3%. This study highlights the potential of applying the finetuned vision-language models for developing next-generation, image-text-based CAD schemes of breast cancer.
△ Less
Submitted 24 April, 2024;
originally announced April 2024.
-
Position Engineering: Boosting Large Language Models through Positional Information Manipulation
Authors:
Zhiyuan He,
Huiqiang Jiang,
Zilong Wang,
Yuqing Yang,
Luna Qiu,
Lili Qiu
Abstract:
The performance of large language models (LLMs) is significantly influenced by the quality of the prompts provided. In response, researchers have developed enormous prompt engineering strategies aimed at modifying the prompt text to enhance task performance. In this paper, we introduce a novel technique termed position engineering, which offers a more efficient way to guide large language models.…
▽ More
The performance of large language models (LLMs) is significantly influenced by the quality of the prompts provided. In response, researchers have developed enormous prompt engineering strategies aimed at modifying the prompt text to enhance task performance. In this paper, we introduce a novel technique termed position engineering, which offers a more efficient way to guide large language models. Unlike prompt engineering, which requires substantial effort to modify the text provided to LLMs, position engineering merely involves altering the positional information in the prompt without modifying the text itself. We have evaluated position engineering in two widely-used LLM scenarios: retrieval-augmented generation (RAG) and in-context learning (ICL). Our findings show that position engineering substantially improves upon the baseline in both cases. Position engineering thus represents a promising new strategy for exploiting the capabilities of large language models.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
Language Model Evolution: An Iterated Learning Perspective
Authors:
Yi Ren,
Shangmin Guo,
Linlu Qiu,
Bailin Wang,
Danica J. Sutherland
Abstract:
With the widespread adoption of Large Language Models (LLMs), the prevalence of iterative interactions among these models is anticipated to increase. Notably, recent advancements in multi-round self-improving methods allow LLMs to generate new examples for training subsequent models. At the same time, multi-agent LLM systems, involving automated interactions among agents, are also increasing in pr…
▽ More
With the widespread adoption of Large Language Models (LLMs), the prevalence of iterative interactions among these models is anticipated to increase. Notably, recent advancements in multi-round self-improving methods allow LLMs to generate new examples for training subsequent models. At the same time, multi-agent LLM systems, involving automated interactions among agents, are also increasing in prominence. Thus, in both short and long terms, LLMs may actively engage in an evolutionary process. We draw parallels between the behavior of LLMs and the evolution of human culture, as the latter has been extensively studied by cognitive scientists for decades. Our approach involves leveraging Iterated Learning (IL), a Bayesian framework that elucidates how subtle biases are magnified during human cultural evolution, to explain some behaviors of LLMs. This paper outlines key characteristics of agents' behavior in the Bayesian-IL framework, including predictions that are supported by experimental verification with various LLMs. This theoretical framework could help to more effectively predict and guide the evolution of LLMs in desired directions.
△ Less
Submitted 3 April, 2024;
originally announced April 2024.
-
LLM-ABR: Designing Adaptive Bitrate Algorithms via Large Language Models
Authors:
Zhiyuan He,
Aashish Gottipati,
Lili Qiu,
Francis Y. Yan,
Xufang Luo,
Kenuo Xu,
Yuqing Yang
Abstract:
We present LLM-ABR, the first system that utilizes the generative capabilities of large language models (LLMs) to autonomously design adaptive bitrate (ABR) algorithms tailored for diverse network characteristics. Operating within a reinforcement learning framework, LLM-ABR empowers LLMs to design key components such as states and neural network architectures. We evaluate LLM-ABR across diverse ne…
▽ More
We present LLM-ABR, the first system that utilizes the generative capabilities of large language models (LLMs) to autonomously design adaptive bitrate (ABR) algorithms tailored for diverse network characteristics. Operating within a reinforcement learning framework, LLM-ABR empowers LLMs to design key components such as states and neural network architectures. We evaluate LLM-ABR across diverse network settings, including broadband, satellite, 4G, and 5G. LLM-ABR consistently outperforms default ABR algorithms.
△ Less
Submitted 1 April, 2024;
originally announced April 2024.
-
LLM-RadJudge: Achieving Radiologist-Level Evaluation for X-Ray Report Generation
Authors:
Zilong Wang,
Xufang Luo,
Xinyang Jiang,
Dongsheng Li,
Lili Qiu
Abstract:
Evaluating generated radiology reports is crucial for the development of radiology AI, but existing metrics fail to reflect the task's clinical requirements. This study proposes a novel evaluation framework using large language models (LLMs) to compare radiology reports for assessment. We compare the performance of various LLMs and demonstrate that, when using GPT-4, our proposed metric achieves e…
▽ More
Evaluating generated radiology reports is crucial for the development of radiology AI, but existing metrics fail to reflect the task's clinical requirements. This study proposes a novel evaluation framework using large language models (LLMs) to compare radiology reports for assessment. We compare the performance of various LLMs and demonstrate that, when using GPT-4, our proposed metric achieves evaluation consistency close to that of radiologists. Furthermore, to reduce costs and improve accessibility, making this method practical, we construct a dataset using LLM evaluation results and perform knowledge distillation to train a smaller model. The distilled model achieves evaluation capabilities comparable to GPT-4. Our framework and distilled model offer an accessible and efficient evaluation method for radiology report generation, facilitating the development of more clinically relevant models. The model will be further open-sourced and accessible.
△ Less
Submitted 1 April, 2024;
originally announced April 2024.
-
IPoD: Implicit Field Learning with Point Diffusion for Generalizable 3D Object Reconstruction from Single RGB-D Images
Authors:
Yushuang Wu,
Luyue Shi,
Junhao Cai,
Weihao Yuan,
Lingteng Qiu,
Zilong Dong,
Liefeng Bo,
Shuguang Cui,
Xiaoguang Han
Abstract:
Generalizable 3D object reconstruction from single-view RGB-D images remains a challenging task, particularly with real-world data. Current state-of-the-art methods develop Transformer-based implicit field learning, necessitating an intensive learning paradigm that requires dense query-supervision uniformly sampled throughout the entire space. We propose a novel approach, IPoD, which harmonizes im…
▽ More
Generalizable 3D object reconstruction from single-view RGB-D images remains a challenging task, particularly with real-world data. Current state-of-the-art methods develop Transformer-based implicit field learning, necessitating an intensive learning paradigm that requires dense query-supervision uniformly sampled throughout the entire space. We propose a novel approach, IPoD, which harmonizes implicit field learning with point diffusion. This approach treats the query points for implicit field learning as a noisy point cloud for iterative denoising, allowing for their dynamic adaptation to the target object shape. Such adaptive query points harness diffusion learning's capability for coarse shape recovery and also enhances the implicit representation's ability to delineate finer details. Besides, an additional self-conditioning mechanism is designed to use implicit predictions as the guidance of diffusion learning, leading to a cooperative system. Experiments conducted on the CO3D-v2 dataset affirm the superiority of IPoD, achieving 7.8% improvement in F-score and 28.6% in Chamfer distance over existing methods. The generalizability of IPoD is also demonstrated on the MVImgNet dataset. Our project page is at https://yushuang-wu.github.io/IPoD.
△ Less
Submitted 30 March, 2024;
originally announced April 2024.
-
EventGround: Narrative Reasoning by Grounding to Eventuality-centric Knowledge Graphs
Authors:
Cheng Jiayang,
Lin Qiu,
Chunkit Chan,
Xin Liu,
Yangqiu Song,
Zheng Zhang
Abstract:
Narrative reasoning relies on the understanding of eventualities in story contexts, which requires a wealth of background world knowledge. To help machines leverage such knowledge, existing solutions can be categorized into two groups. Some focus on implicitly modeling eventuality knowledge by pretraining language models (LMs) with eventuality-aware objectives. However, this approach breaks down k…
▽ More
Narrative reasoning relies on the understanding of eventualities in story contexts, which requires a wealth of background world knowledge. To help machines leverage such knowledge, existing solutions can be categorized into two groups. Some focus on implicitly modeling eventuality knowledge by pretraining language models (LMs) with eventuality-aware objectives. However, this approach breaks down knowledge structures and lacks interpretability. Others explicitly collect world knowledge of eventualities into structured eventuality-centric knowledge graphs (KGs). However, existing research on leveraging these knowledge sources for free-texts is limited. In this work, we propose an initial comprehensive framework called EventGround, which aims to tackle the problem of grounding free-texts to eventuality-centric KGs for contextualized narrative reasoning. We identify two critical problems in this direction: the event representation and sparsity problems. We provide simple yet effective parsing and partial information extraction methods to tackle these problems. Experimental results demonstrate that our approach consistently outperforms baseline models when combined with graph neural network (GNN) or large language model (LLM) based graph reasoning models. Our framework, incorporating grounded knowledge, achieves state-of-the-art performance while providing interpretable evidence.
△ Less
Submitted 29 March, 2024;
originally announced April 2024.
-
Energy-Efficient Hybrid Beamforming with Dynamic On-off Control for Integrated Sensing, Communications, and Powering
Authors:
Zeyu Hao,
Yuan Fang,
Xianghao Yu,
Jie Xu,
Ling Qiu,
Lexi Xu,
Shuguang Cui
Abstract:
This paper investigates the energy-efficient hybrid beamforming design for a multi-functional integrated sensing, communications, and powering (ISCAP) system. In this system, a base station (BS) with a hybrid analog-digital (HAD) architecture sends unified wireless signals to communicate with multiple information receivers (IRs), sense multiple point targets, and wirelessly charge multiple energy…
▽ More
This paper investigates the energy-efficient hybrid beamforming design for a multi-functional integrated sensing, communications, and powering (ISCAP) system. In this system, a base station (BS) with a hybrid analog-digital (HAD) architecture sends unified wireless signals to communicate with multiple information receivers (IRs), sense multiple point targets, and wirelessly charge multiple energy receivers (ERs) at the same time. To facilitate the energy-efficient design, we present a novel HAD architecture for the BS transmitter, which allows dynamic on-off control of its radio frequency (RF) chains and analog phase shifters (PSs) through a switch network. We also consider a practical and comprehensive power consumption model for the BS, by taking into account the power-dependent non-linear power amplifier (PA) efficiency, and the on-off non-transmission power consumption model of RF chains and PSs. We jointly design the hybrid beamforming and dynamic on-off control at the BS, aiming to minimize its total power consumption, while guaranteeing the performance requirements on communication rates, sensing Cramér-Rao bound (CRB), and harvested power levels. The formulation also takes into consideration the per-antenna transmit power constraint and the constant modulus constraints for the analog beamformer at the BS. The resulting optimization problem for ISCAP is highly non-convex. Please refer to the paper for a complete abstract.
△ Less
Submitted 24 March, 2024;
originally announced March 2024.
-
LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression
Authors:
Zhuoshi Pan,
Qianhui Wu,
Huiqiang Jiang,
Menglin Xia,
Xufang Luo,
Jue Zhang,
Qingwei Lin,
Victor Rühle,
Yuqing Yang,
Chin-Yew Lin,
H. Vicky Zhao,
Lili Qiu,
Dongmei Zhang
Abstract:
This paper focuses on task-agnostic prompt compression for better generalizability and efficiency. Considering the redundancy in natural language, existing approaches compress prompts by removing tokens or lexical units according to their information entropy obtained from a causal language model such as LLaMa-7B. The challenge is that information entropy may be a suboptimal compression metric: (i)…
▽ More
This paper focuses on task-agnostic prompt compression for better generalizability and efficiency. Considering the redundancy in natural language, existing approaches compress prompts by removing tokens or lexical units according to their information entropy obtained from a causal language model such as LLaMa-7B. The challenge is that information entropy may be a suboptimal compression metric: (i) it only leverages unidirectional context and may fail to capture all essential information needed for prompt compression; (ii) it is not aligned with the prompt compression objective.
To address these issues, we propose a data distillation procedure to derive knowledge from an LLM to compress prompts without losing crucial information, and meantime, introduce an extractive text compression dataset. We formulate prompt compression as a token classification problem to guarantee the faithfulness of the compressed prompt to the original one, and use a Transformer encoder as the base architecture to capture all essential information for prompt compression from the full bidirectional context. Our approach leads to lower latency by explicitly learning the compression objective with smaller models such as XLM-RoBERTa-large and mBERT.
We evaluate our method on both in-domain and out-of-domain datasets, including MeetingBank, LongBench, ZeroScrolls, GSM8K, and BBH. Despite its small size, our model shows significant performance gains over strong baselines and demonstrates robust generalization ability across different LLMs. Additionally, our model is 3x-6x faster than existing prompt compression methods, while accelerating the end-to-end latency by 1.6x-2.9x with compression ratios of 2x-5x.
△ Less
Submitted 19 March, 2024;
originally announced March 2024.
-
XPose: eXplainable Human Pose Estimation
Authors:
Luyu Qiu,
Jianing Li,
Lei Wen,
Chi Su,
Fei Hao,
Chen Jason Zhang,
Lei Chen
Abstract:
Current approaches in pose estimation primarily concentrate on enhancing model architectures, often overlooking the importance of comprehensively understanding the rationale behind model decisions. In this paper, we propose XPose, a novel framework that incorporates Explainable AI (XAI) principles into pose estimation. This integration aims to elucidate the individual contribution of each keypoint…
▽ More
Current approaches in pose estimation primarily concentrate on enhancing model architectures, often overlooking the importance of comprehensively understanding the rationale behind model decisions. In this paper, we propose XPose, a novel framework that incorporates Explainable AI (XAI) principles into pose estimation. This integration aims to elucidate the individual contribution of each keypoint to final prediction, thereby elevating the model's transparency and interpretability. Conventional XAI techniques have predominantly addressed tasks with single-target tasks like classification. Additionally, the application of Shapley value, a common measure in XAI, to pose estimation has been hindered by prohibitive computational demands.
To address these challenges, this work introduces an innovative concept called Group Shapley Value (GSV). This approach strategically organizes keypoints into clusters based on their interdependencies. Within these clusters, GSV meticulously calculates Shapley value for keypoints, while for inter-cluster keypoints, it opts for a more holistic group-level valuation. This dual-level computation framework meticulously assesses keypoint contributions to the final outcome, optimizing computational efficiency. Building on the insights into keypoint interactions, we devise a novel data augmentation technique known as Group-based Keypoint Removal (GKR). This method ingeniously removes individual keypoints during training phases, deliberately preserving those with strong mutual connections, thereby refining the model's predictive prowess for non-visible keypoints. The empirical validation of GKR across a spectrum of standard approaches attests to its efficacy. GKR's success demonstrates how using Explainable AI (XAI) can directly enhance pose estimation models.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model
Authors:
Qi Zuo,
Xiaodong Gu,
Lingteng Qiu,
Yuan Dong,
Zhengyi Zhao,
Weihao Yuan,
Rui Peng,
Siyu Zhu,
Zilong Dong,
Liefeng Bo,
Qixing Huang
Abstract:
Generating multi-view images based on text or single-image prompts is a critical capability for the creation of 3D content. Two fundamental questions on this topic are what data we use for training and how to ensure multi-view consistency. This paper introduces a novel framework that makes fundamental contributions to both questions. Unlike leveraging images from 2D diffusion models for training,…
▽ More
Generating multi-view images based on text or single-image prompts is a critical capability for the creation of 3D content. Two fundamental questions on this topic are what data we use for training and how to ensure multi-view consistency. This paper introduces a novel framework that makes fundamental contributions to both questions. Unlike leveraging images from 2D diffusion models for training, we propose a dense consistent multi-view generation model that is fine-tuned from off-the-shelf video generative models. Images from video generative models are more suitable for multi-view generation because the underlying network architecture that generates them employs a temporal module to enforce frame consistency. Moreover, the video data sets used to train these models are abundant and diverse, leading to a reduced train-finetuning domain gap. To enhance multi-view consistency, we introduce a 3D-Aware Denoising Sampling, which first employs a feed-forward reconstruction module to get an explicit global 3D model, and then adopts a sampling strategy that effectively involves images rendered from the global 3D model into the denoising sampling loop to improve the multi-view consistency of the final images. As a by-product, this module also provides a fast way to create 3D assets represented by 3D Gaussians within a few seconds. Our approach can generate 24 dense views and converges much faster in training than state-of-the-art approaches (4 GPU hours versus many thousand GPU hours) with comparable visual quality and consistency. By further fine-tuning, our approach outperforms existing state-of-the-art methods in both quantitative metrics and visual effects. Our project page is aigc3d.github.io/VideoMV.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
Accelerating Biclique Counting on GPU
Authors:
Linshan Qiu,
Zhonggen Li,
Xiangyu Ke,
Lu Chen,
Yunjun Gao
Abstract:
Counting (p,q)-bicliques in bipartite graphs poses a foundational challenge with broad applications, from densest subgraph discovery in algorithmic research to personalized content recommendation in practical scenarios. Despite its significance, current leading (p,q)-biclique counting algorithms fall short, particularly when faced with larger graph sizes and clique scales. Fortunately, the problem…
▽ More
Counting (p,q)-bicliques in bipartite graphs poses a foundational challenge with broad applications, from densest subgraph discovery in algorithmic research to personalized content recommendation in practical scenarios. Despite its significance, current leading (p,q)-biclique counting algorithms fall short, particularly when faced with larger graph sizes and clique scales. Fortunately, the problem's inherent structure, allowing for the independent counting of each biclique starting from every vertex, combined with a substantial set intersections, makes it highly amenable to parallelization. Recent successes in GPU-accelerated algorithms across various domains motivate our exploration into harnessing the parallelism power of GPUs to efficiently address the (p,q)-biclique counting challenge. We introduce GBC (GPU-based Biclique Counting), a novel approach designed to enable efficient and scalable (p,q)-biclique counting on GPUs. To address major bottleneck arising from redundant comparisons in set intersections (occupying an average of 90% of the runtime), we introduce a novel data structure that hashes adjacency lists into truncated bitmaps to enable efficient set intersection on GPUs via bit-wise AND operations. Our innovative hybrid DFS-BFS exploration strategy further enhances thread utilization and effectively manages memory constraints. A composite load balancing strategy, integrating pre-runtime and runtime workload allocation, ensures equitable distribution among threads. Additionally, we employ vertex reordering and graph partitioning strategies for improved compactness and scalability. Experimental evaluations on eight real-life and two synthetic datasets demonstrate that GBC outperforms state-of-the-art algorithms by a substantial margin. In particular, GBC achieves an average speedup of 497.8x, with the largest instance achieving a remarkable 1217.7x speedup when p = q = 8.
△ Less
Submitted 20 March, 2024; v1 submitted 12 March, 2024;
originally announced March 2024.
-
ROG$_{PL}$: Robust Open-Set Graph Learning via Region-Based Prototype Learning
Authors:
Qin Zhang,
Xiaowei Li,
Jiexin Lu,
Liping Qiu,
Shirui Pan,
Xiaojun Chen,
Junyang Chen
Abstract:
Open-set graph learning is a practical task that aims to classify the known class nodes and to identify unknown class samples as unknowns. Conventional node classification methods usually perform unsatisfactorily in open-set scenarios due to the complex data they encounter, such as out-of-distribution (OOD) data and in-distribution (IND) noise. OOD data are samples that do not belong to any known…
▽ More
Open-set graph learning is a practical task that aims to classify the known class nodes and to identify unknown class samples as unknowns. Conventional node classification methods usually perform unsatisfactorily in open-set scenarios due to the complex data they encounter, such as out-of-distribution (OOD) data and in-distribution (IND) noise. OOD data are samples that do not belong to any known classes. They are outliers if they occur in training (OOD noise), and open-set samples if they occur in testing. IND noise are training samples which are assigned incorrect labels. The existence of IND noise and OOD noise is prevalent, which usually cause the ambiguity problem, including the intra-class variety problem and the inter-class confusion problem. Thus, to explore robust open-set learning methods is necessary and difficult, and it becomes even more difficult for non-IID graph data.To this end, we propose a unified framework named ROG$_{PL}$ to achieve robust open-set learning on complex noisy graph data, by introducing prototype learning. In specific, ROG$_{PL}$ consists of two modules, i.e., denoising via label propagation and open-set prototype learning via regions. The first module corrects noisy labels through similarity-based label propagation and removes low-confidence samples, to solve the intra-class variety problem caused by noise. The second module learns open-set prototypes for each known class via non-overlapped regions and remains both interior and border prototypes to remedy the inter-class confusion problem.The two modules are iteratively updated under the constraints of classification loss and prototype diversity loss. To the best of our knowledge, the proposed ROG$_{PL}$ is the first robust open-set node classification method for graph data with complex noise.
△ Less
Submitted 29 February, 2024; v1 submitted 28 February, 2024;
originally announced February 2024.
-
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
Authors:
Dongyang Liu,
Renrui Zhang,
Longtian Qiu,
Siyuan Huang,
Weifeng Lin,
Shitian Zhao,
Shijie Geng,
Ziyi Lin,
Peng Jin,
Kaipeng Zhang,
Wenqi Shao,
Chao Xu,
Conghui He,
Junjun He,
Hao Shao,
Pan Lu,
Hongsheng Li,
Yu Qiao,
Peng Gao
Abstract:
We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series developed upon SPHINX. To improve the architecture and training efficiency, we modify the SPHINX framework by removing redundant visual encoders, bypassing fully-padded sub-images with skip tokens, and simplifying multi-stage training into a one-stage all-in-one paradigm. To fully unleash the potential of MLLMs, we…
▽ More
We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series developed upon SPHINX. To improve the architecture and training efficiency, we modify the SPHINX framework by removing redundant visual encoders, bypassing fully-padded sub-images with skip tokens, and simplifying multi-stage training into a one-stage all-in-one paradigm. To fully unleash the potential of MLLMs, we assemble a comprehensive multi-domain and multimodal dataset covering publicly available resources in language, vision, and vision-language tasks. We further enrich this collection with our curated OCR intensive and Set-of-Mark datasets, extending the diversity and generality. By training over different base LLMs including TinyLlama1.1B, InternLM2-7B, LLaMA2-13B, and Mixtral8x7B, we obtain a spectrum of MLLMs that vary in parameter size and multilingual capabilities. Comprehensive benchmarking reveals a strong correlation between the multi-modal performance with the data and parameter scales. Code and models are released at https://github.com/Alpha-VLLM/LLaMA2-Accessory
△ Less
Submitted 26 June, 2024; v1 submitted 8 February, 2024;
originally announced February 2024.
-
Meeting Bridges: Designing Information Artifacts that Bridge from Synchronous Meetings to Asynchronous Collaboration
Authors:
Ruotong Wang,
Lin Qiu,
Justin Cranshaw,
Amy X. Zhang
Abstract:
A recent surge in remote meetings has led to complaints of ``Zoom fatigue'' and ``collaboration overload,'' negatively impacting worker productivity and well-being. One way to alleviate the burden of meetings is to de-emphasize their synchronous participation by shifting work to and enabling sensemaking during post-meeting asynchronous activities. Towards this goal, we propose the design concept o…
▽ More
A recent surge in remote meetings has led to complaints of ``Zoom fatigue'' and ``collaboration overload,'' negatively impacting worker productivity and well-being. One way to alleviate the burden of meetings is to de-emphasize their synchronous participation by shifting work to and enabling sensemaking during post-meeting asynchronous activities. Towards this goal, we propose the design concept of meeting bridges, or information artifacts that can encapsulate meeting information towards bridging to and facilitating post-meeting activities. Through 13 interviews and a survey of 198 information workers, we learn how people use online meeting information after meetings are over, finding five main uses: as an archive, as task reminders, to onboard or support inclusion, for group sensemaking, and as a launching point for follow-on collaboration. However, we also find that current common meeting artifacts, such as notes and recordings, present challenges in serving as meeting bridges. After conducting co-design sessions with 16 participants, we distill key principles for the design of meeting bridges to optimally support asynchronous collaboration goals. Overall, our findings point to the opportunity of designing information artifacts that not only support users to access but also continue to transform and engage in meeting information post-meeting.
△ Less
Submitted 5 February, 2024;
originally announced February 2024.
-
Vision Transformer-based Multimodal Feature Fusion Network for Lymphoma Segmentation on PET/CT Images
Authors:
Huan Huang,
Liheng Qiu,
Shenmiao Yang,
Longxi Li,
Jiaofen Nan,
Yanting Li,
Chuang Han,
Fubao Zhu,
Chen Zhao,
Weihua Zhou
Abstract:
Background: Diffuse large B-cell lymphoma (DLBCL) segmentation is a challenge in medical image analysis. Traditional segmentation methods for lymphoma struggle with the complex patterns and the presence of DLBCL lesions. Objective: We aim to develop an accurate method for lymphoma segmentation with 18F-Fluorodeoxyglucose positron emission tomography (PET) and computed tomography (CT) images. Metho…
▽ More
Background: Diffuse large B-cell lymphoma (DLBCL) segmentation is a challenge in medical image analysis. Traditional segmentation methods for lymphoma struggle with the complex patterns and the presence of DLBCL lesions. Objective: We aim to develop an accurate method for lymphoma segmentation with 18F-Fluorodeoxyglucose positron emission tomography (PET) and computed tomography (CT) images. Methods: Our lymphoma segmentation approach combines a vision transformer with dual encoders, adeptly fusing PET and CT data via multimodal cross-attention fusion (MMCAF) module. In this study, PET and CT data from 165 DLBCL patients were analyzed. A 5-fold cross-validation was employed to evaluate the performance and generalization ability of our method. Ground truths were annotated by experienced nuclear medicine experts. We calculated the total metabolic tumor volume (TMTV) and performed a statistical analysis on our results. Results: The proposed method exhibited accurate performance in DLBCL lesion segmentation, achieving a Dice similarity coefficient of 0.9173$\pm$0.0071, a Hausdorff distance of 2.71$\pm$0.25mm, a sensitivity of 0.9462$\pm$0.0223, and a specificity of 0.9986$\pm$0.0008. Additionally, a Pearson correlation coefficient of 0.9030$\pm$0.0179 and an R-square of 0.8586$\pm$0.0173 were observed in TMTV when measured on manual annotation compared to our segmentation results. Conclusion: This study highlights the advantages of MMCAF and vision transformer for lymphoma segmentation using PET and CT, offering great promise for computer-aided lymphoma diagnosis and treatment.
△ Less
Submitted 4 February, 2024;
originally announced February 2024.
-
GPU-Accelerated Batch-Dynamic Subgraph Matching
Authors:
Linshan Qiu,
Lu Chen,
Hailiang Jie,
Xiangyu Ke,
Yunjun Gao,
Yang Liu,
Zetao Zhang
Abstract:
Subgraph matching has garnered increasing attention for its diverse real-world applications. Given the dynamic nature of real-world graphs, addressing evolving scenarios without incurring prohibitive overheads has been a focus of research. However, existing approaches for dynamic subgraph matching often proceed serially, retrieving incremental matches for each updated edge individually. This appro…
▽ More
Subgraph matching has garnered increasing attention for its diverse real-world applications. Given the dynamic nature of real-world graphs, addressing evolving scenarios without incurring prohibitive overheads has been a focus of research. However, existing approaches for dynamic subgraph matching often proceed serially, retrieving incremental matches for each updated edge individually. This approach falls short when handling batch data updates, leading to a decrease in system throughput. Leveraging the parallel processing power of GPUs, which can execute a massive number of cores simultaneously, has been widely recognized for performance acceleration in various domains. Surprisingly, systematic exploration of subgraph matching in the context of batch-dynamic graphs, particularly on a GPU platform, remains untouched. In this paper, we bridge this gap by introducing an efficient framework, GAMMA (GPU-Accelerated Batch-Dynamic Subgraph Matching). Our approach features a DFS-based warp-centric batch-dynamic subgraph matching algorithm. To ensure load balance in the DFS-based search, we propose warp-level work stealing via shared memory. Additionally, we introduce coalesced search to reduce redundant computations. Comprehensive experiments demonstrate the superior performance of GAMMA. Compared to state-of-the-art algorithms, GAMMA showcases a performance improvement up to hundreds of times.
△ Less
Submitted 30 January, 2024;
originally announced January 2024.
-
A Novel Garment Transfer Method Supervised by Distilled Knowledge of Virtual Try-on Model
Authors:
Naiyu Fang,
Lemiao Qiu,
Shuyou Zhang,
Zili Wang,
Kerui Hu,
Jianrong Tan
Abstract:
This paper proposes a novel garment transfer method supervised with knowledge distillation from virtual try-on. Our method first reasons the transfer parsing to provide shape prior to downstream tasks. We employ a multi-phase teaching strategy to supervise the training of the transfer parsing reasoning model, learning the response and feature knowledge from the try-on parsing reasoning model. To c…
▽ More
This paper proposes a novel garment transfer method supervised with knowledge distillation from virtual try-on. Our method first reasons the transfer parsing to provide shape prior to downstream tasks. We employ a multi-phase teaching strategy to supervise the training of the transfer parsing reasoning model, learning the response and feature knowledge from the try-on parsing reasoning model. To correct the teaching error, it transfers the garment back to its owner to absorb the hard knowledge in the self-study phase. Guided by the transfer parsing, we adjust the position of the transferred garment via STN to prevent distortion. Afterward, we estimate a progressive flow to precisely warp the garment with shape and content correspondences. To ensure warping rationality, we supervise the training of the garment warping model using target shape and warping knowledge from virtual try-on. To better preserve body features in the transfer result, we propose a well-designed training strategy for the arm regrowth task to infer new exposure skin. Experiments demonstrate that our method has state-of-the-art performance compared with other virtual try-on and garment transfer methods in garment transfer, especially for preserving garment texture and body features.
△ Less
Submitted 4 April, 2024; v1 submitted 22 January, 2024;
originally announced January 2024.
-
Multi-level Cross-modal Alignment for Image Clustering
Authors:
Liping Qiu,
Qin Zhang,
Xiaojun Chen,
Shaotian Cai
Abstract:
Recently, the cross-modal pretraining model has been employed to produce meaningful pseudo-labels to supervise the training of an image clustering model. However, numerous erroneous alignments in a cross-modal pre-training model could produce poor-quality pseudo-labels and degrade clustering performance. To solve the aforementioned issue, we propose a novel \textbf{Multi-level Cross-modal Alignmen…
▽ More
Recently, the cross-modal pretraining model has been employed to produce meaningful pseudo-labels to supervise the training of an image clustering model. However, numerous erroneous alignments in a cross-modal pre-training model could produce poor-quality pseudo-labels and degrade clustering performance. To solve the aforementioned issue, we propose a novel \textbf{Multi-level Cross-modal Alignment} method to improve the alignments in a cross-modal pretraining model for downstream tasks, by building a smaller but better semantic space and aligning the images and texts in three levels, i.e., instance-level, prototype-level, and semantic-level. Theoretical results show that our proposed method converges, and suggests effective means to reduce the expected clustering risk of our method. Experimental results on five benchmark datasets clearly show the superiority of our new method.
△ Less
Submitted 22 January, 2024;
originally announced January 2024.
-
EEGFormer: Towards Transferable and Interpretable Large-Scale EEG Foundation Model
Authors:
Yuqi Chen,
Kan Ren,
Kaitao Song,
Yansen Wang,
Yifan Wang,
Dongsheng Li,
Lili Qiu
Abstract:
Self-supervised learning has emerged as a highly effective approach in the fields of natural language processing and computer vision. It is also applicable to brain signals such as electroencephalography (EEG) data, given the abundance of available unlabeled data that exist in a wide spectrum of real-world medical applications ranging from seizure detection to wave analysis. The existing works lev…
▽ More
Self-supervised learning has emerged as a highly effective approach in the fields of natural language processing and computer vision. It is also applicable to brain signals such as electroencephalography (EEG) data, given the abundance of available unlabeled data that exist in a wide spectrum of real-world medical applications ranging from seizure detection to wave analysis. The existing works leveraging self-supervised learning on EEG modeling mainly focus on pretraining upon each individual dataset corresponding to a single downstream task, which cannot leverage the power of abundant data, and they may derive sub-optimal solutions with a lack of generalization. Moreover, these methods rely on end-to-end model learning which is not easy for humans to understand. In this paper, we present a novel EEG foundation model, namely EEGFormer, pretrained on large-scale compound EEG data. The pretrained model cannot only learn universal representations on EEG signals with adaptable performance on various downstream tasks but also provide interpretable outcomes of the useful patterns within the data. To validate the effectiveness of our model, we extensively evaluate it on various downstream tasks and assess the performance under different transfer settings. Furthermore, we demonstrate how the learned model exhibits transferable anomaly detection performance and provides valuable interpretability of the acquired patterns via self-supervised learning.
△ Less
Submitted 11 January, 2024;
originally announced January 2024.
-
Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training
Authors:
Longtian Qiu,
Shan Ning,
Xuming He
Abstract:
Image captioning aims at generating descriptive and meaningful textual descriptions of images, enabling a broad range of vision-language applications. Prior works have demonstrated that harnessing the power of Contrastive Image Language Pre-training (CLIP) offers a promising approach to achieving zero-shot captioning, eliminating the need for expensive caption annotations. However, the widely obse…
▽ More
Image captioning aims at generating descriptive and meaningful textual descriptions of images, enabling a broad range of vision-language applications. Prior works have demonstrated that harnessing the power of Contrastive Image Language Pre-training (CLIP) offers a promising approach to achieving zero-shot captioning, eliminating the need for expensive caption annotations. However, the widely observed modality gap in the latent space of CLIP harms the performance of zero-shot captioning by breaking the alignment between paired image-text features. To address this issue, we conduct an analysis on the CLIP latent space which leads to two findings. Firstly, we observe that the CLIP's visual feature of image subregions can achieve closer proximity to the paired caption due to the inherent information loss in text descriptions. In addition, we show that the modality gap between a paired image-text can be empirically modeled as a zero-mean Gaussian distribution. Motivated by the findings, we propose a novel zero-shot image captioning framework with text-only training to reduce the modality gap. In particular, we introduce a subregion feature aggregation to leverage local region information, which produces a compact visual representation for matching text representation. Moreover, we incorporate a noise injection and CLIP reranking strategy to boost captioning performance. We also extend our framework to build a zero-shot VQA pipeline, demonstrating its generality. Through extensive experiments on common captioning and VQA datasets such as MSCOCO, Flickr30k and VQAV2, we show that our method achieves remarkable performance improvements. Code is available at https://github.com/Artanic30/MacCap.
△ Less
Submitted 4 January, 2024;
originally announced January 2024.
-
Towards Fine-Grained Explainability for Heterogeneous Graph Neural Network
Authors:
Tong Li,
Jiale Deng,
Yanyan Shen,
Luyu Qiu,
Yongxiang Huang,
Caleb Chen Cao
Abstract:
Heterogeneous graph neural networks (HGNs) are prominent approaches to node classification tasks on heterogeneous graphs. Despite the superior performance, insights about the predictions made from HGNs are obscure to humans. Existing explainability techniques are mainly proposed for GNNs on homogeneous graphs. They focus on highlighting salient graph objects to the predictions whereas the problem…
▽ More
Heterogeneous graph neural networks (HGNs) are prominent approaches to node classification tasks on heterogeneous graphs. Despite the superior performance, insights about the predictions made from HGNs are obscure to humans. Existing explainability techniques are mainly proposed for GNNs on homogeneous graphs. They focus on highlighting salient graph objects to the predictions whereas the problem of how these objects affect the predictions remains unsolved. Given heterogeneous graphs with complex structures and rich semantics, it is imperative that salient objects can be accompanied with their influence paths to the predictions, unveiling the reasoning process of HGNs. In this paper, we develop xPath, a new framework that provides fine-grained explanations for black-box HGNs specifying a cause node with its influence path to the target node. In xPath, we differentiate the influence of a node on the prediction w.r.t. every individual influence path, and measure the influence by perturbing graph structure via a novel graph rewiring algorithm. Furthermore, we introduce a greedy search algorithm to find the most influential fine-grained explanations efficiently. Empirical results on various HGNs and heterogeneous graphs show that xPath yields faithful explanations efficiently, outperforming the adaptations of advanced GNN explanation approaches.
△ Less
Submitted 23 December, 2023;
originally announced December 2023.
-
A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise
Authors:
Chaoyou Fu,
Renrui Zhang,
Zihan Wang,
Yubo Huang,
Zhengye Zhang,
Longtian Qiu,
Gaoxiang Ye,
Yunhang Shen,
Mengdan Zhang,
Peixian Chen,
Sirui Zhao,
Shaohui Lin,
Deqiang Jiang,
Di Yin,
Peng Gao,
Ke Li,
Hongsheng Li,
Xing Sun
Abstract:
The surge of interest towards Multi-modal Large Language Models (MLLMs), e.g., GPT-4V(ision) from OpenAI, has marked a significant trend in both academia and industry. They endow Large Language Models (LLMs) with powerful capabilities in visual understanding, enabling them to tackle diverse multi-modal tasks. Very recently, Google released Gemini, its newest and most capable MLLM built from the gr…
▽ More
The surge of interest towards Multi-modal Large Language Models (MLLMs), e.g., GPT-4V(ision) from OpenAI, has marked a significant trend in both academia and industry. They endow Large Language Models (LLMs) with powerful capabilities in visual understanding, enabling them to tackle diverse multi-modal tasks. Very recently, Google released Gemini, its newest and most capable MLLM built from the ground up for multi-modality. In light of the superior reasoning capabilities, can Gemini challenge GPT-4V's leading position in multi-modal learning? In this paper, we present a preliminary exploration of Gemini Pro's visual understanding proficiency, which comprehensively covers four domains: fundamental perception, advanced cognition, challenging vision tasks, and various expert capacities. We compare Gemini Pro with the state-of-the-art GPT-4V to evaluate its upper limits, along with the latest open-sourced MLLM, Sphinx, which reveals the gap between manual efforts and black-box systems. The qualitative samples indicate that, while GPT-4V and Gemini showcase different answering styles and preferences, they can exhibit comparable visual reasoning capabilities, and Sphinx still trails behind them concerning domain generalizability. Specifically, GPT-4V tends to elaborate detailed explanations and intermediate steps, and Gemini prefers to output a direct and concise answer. The quantitative evaluation on the popular MME benchmark also demonstrates the potential of Gemini to be a strong challenger to GPT-4V. Our early investigation of Gemini also observes some common issues of MLLMs, indicating that there still remains a considerable distance towards artificial general intelligence. Our project for tracking the progress of MLLM is released at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.
△ Less
Submitted 20 December, 2023; v1 submitted 19 December, 2023;
originally announced December 2023.
-
High-Order Structure Based Middle-Feature Learning for Visible-Infrared Person Re-Identification
Authors:
Liuxiang Qiu,
Si Chen,
Yan Yan,
Jing-Hao Xue,
Da-Han Wang,
Shunzhi Zhu
Abstract:
Visible-infrared person re-identification (VI-ReID) aims to retrieve images of the same persons captured by visible (VIS) and infrared (IR) cameras. Existing VI-ReID methods ignore high-order structure information of features while being relatively difficult to learn a reasonable common feature space due to the large modality discrepancy between VIS and IR images. To address the above problems, we…
▽ More
Visible-infrared person re-identification (VI-ReID) aims to retrieve images of the same persons captured by visible (VIS) and infrared (IR) cameras. Existing VI-ReID methods ignore high-order structure information of features while being relatively difficult to learn a reasonable common feature space due to the large modality discrepancy between VIS and IR images. To address the above problems, we propose a novel high-order structure based middle-feature learning network (HOS-Net) for effective VI-ReID. Specifically, we first leverage a short- and long-range feature extraction (SLE) module to effectively exploit both short-range and long-range features. Then, we propose a high-order structure learning (HSL) module to successfully model the high-order relationship across different local features of each person image based on a whitened hypergraph network.This greatly alleviates model collapse and enhances feature representations. Finally, we develop a common feature space learning (CFL) module to learn a discriminative and reasonable common feature space based on middle features generated by aligning features from different modalities and ranges. In particular, a modality-range identity-center contrastive (MRIC) loss is proposed to reduce the distances between the VIS, IR, and middle features, smoothing the training process. Extensive experiments on the SYSU-MM01, RegDB, and LLCM datasets show that our HOS-Net achieves superior state-of-the-art performance. Our code is available at \url{https://github.com/Jaulaucoeng/HOS-Net}.
△ Less
Submitted 13 December, 2023; v1 submitted 12 December, 2023;
originally announced December 2023.
-
Secure Cell-Free Integrated Sensing and Communication in the Presence of Information and Sensing Eavesdroppers
Authors:
Zixiang Ren,
Jie Xu,
Ling Qiu,
Derrick Wing Kwan Ng
Abstract:
This paper studies a secure cell-free integrated sensing and communication (ISAC) system, in which multiple ISAC transmitters collaboratively send confidential information to multiple communication users (CUs) and concurrently conduct target detection. Different from prior works investigating communication security against potential information eavesdropping, we consider the security of both commu…
▽ More
This paper studies a secure cell-free integrated sensing and communication (ISAC) system, in which multiple ISAC transmitters collaboratively send confidential information to multiple communication users (CUs) and concurrently conduct target detection. Different from prior works investigating communication security against potential information eavesdropping, we consider the security of both communication and sensing in the presence of both information and sensing eavesdroppers that aim to intercept confidential communication information and extract target information, respectively. Towards this end, we optimize the joint information and sensing transmit beamforming at these ISAC transmitters for secure cell-free ISAC. Our objective is to maximize the detection probability over a designated sensing area while ensuring the minimum signal-to-interference-plus-noise-ratio (SINR) requirements at CUs. Our formulation also takes into account the maximum tolerable signal-to-noise ratio (SNR) at information eavesdroppers for ensuring the confidentiality of information transmission, and the maximum detection probability constraints at sensing eavesdroppers for preserving sensing privacy. The formulated secure joint transmit beamforming problem is highly non-convex due to the intricate interplay between the detection probabilities, beamforming vectors, and SINR constraints. Fortunately, through strategic manipulation and via applying the semidefinite relaxation (SDR) technique, we successfully obtain the globally optimal solution to the design problem by rigorously verifying the tightness of SDR. Furthermore, we present two alternative joint beamforming designs based on the sensing SNR maximization over the specific sensing area and the coordinated beamforming, respectively. Numerical results reveal the benefits of our proposed design over these alternative benchmarks.
△ Less
Submitted 7 December, 2023;
originally announced December 2023.
-
MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human Captures
Authors:
Zhangyang Xiong,
Chenghong Li,
Kenkun Liu,
Hongjie Liao,
Jianqiao Hu,
Junyi Zhu,
Shuliang Ning,
Lingteng Qiu,
Chongjie Wang,
Shijie Wang,
Shuguang Cui,
Xiaoguang Han
Abstract:
In this era, the success of large language models and text-to-image models can be attributed to the driving force of large-scale datasets. However, in the realm of 3D vision, while remarkable progress has been made with models trained on large-scale synthetic and real-captured object data like Objaverse and MVImgNet, a similar level of progress has not been observed in the domain of human-centric…
▽ More
In this era, the success of large language models and text-to-image models can be attributed to the driving force of large-scale datasets. However, in the realm of 3D vision, while remarkable progress has been made with models trained on large-scale synthetic and real-captured object data like Objaverse and MVImgNet, a similar level of progress has not been observed in the domain of human-centric tasks partially due to the lack of a large-scale human dataset. Existing datasets of high-fidelity 3D human capture continue to be mid-sized due to the significant challenges in acquiring large-scale high-quality 3D human data. To bridge this gap, we present MVHumanNet, a dataset that comprises multi-view human action sequences of 4,500 human identities. The primary focus of our work is on collecting human data that features a large number of diverse identities and everyday clothing using a multi-view human capture system, which facilitates easily scalable data collection. Our dataset contains 9,000 daily outfits, 60,000 motion sequences and 645 million frames with extensive annotations, including human masks, camera parameters, 2D and 3D keypoints, SMPL/SMPLX parameters, and corresponding textual descriptions. To explore the potential of MVHumanNet in various 2D and 3D visual tasks, we conducted pilot studies on view-consistent action recognition, human NeRF reconstruction, text-driven view-unconstrained human image generation, as well as 2D view-unconstrained human image and 3D avatar generation. Extensive experiments demonstrate the performance improvements and effective applications enabled by the scale provided by MVHumanNet. As the current largest-scale 3D human dataset, we hope that the release of MVHumanNet data with annotations will foster further innovations in the domain of 3D human-centric tasks at scale.
△ Less
Submitted 5 December, 2023;
originally announced December 2023.
-
SAMPro3D: Locating SAM Prompts in 3D for Zero-Shot Scene Segmentation
Authors:
Mutian Xu,
Xingyilang Yin,
Lingteng Qiu,
Yang Liu,
Xin Tong,
Xiaoguang Han
Abstract:
We introduce SAMPro3D for zero-shot 3D indoor scene segmentation. Given the 3D point cloud and multiple posed 2D frames of 3D scenes, our approach segments 3D scenes by applying the pretrained Segment Anything Model (SAM) to 2D frames. Our key idea involves locating 3D points in scenes as natural 3D prompts to align their projected pixel prompts across frames, ensuring frame-consistency in both pi…
▽ More
We introduce SAMPro3D for zero-shot 3D indoor scene segmentation. Given the 3D point cloud and multiple posed 2D frames of 3D scenes, our approach segments 3D scenes by applying the pretrained Segment Anything Model (SAM) to 2D frames. Our key idea involves locating 3D points in scenes as natural 3D prompts to align their projected pixel prompts across frames, ensuring frame-consistency in both pixel prompts and their SAM-predicted masks. Moreover, we suggest filtering out low-quality 3D prompts based on feedback from all 2D frames, for enhancing segmentation quality. We also propose to consolidate different 3D prompts if they are segmenting the same object, bringing a more comprehensive segmentation. Notably, our method does not require any additional training on domain-specific data, enabling us to preserve the zero-shot power of SAM. Extensive qualitative and quantitative results show that our method consistently achieves higher quality and more diverse segmentation than previous zero-shot or fully supervised approaches, and in many cases even surpasses human-level annotations. The project page can be accessed at https://mutianxu.github.io/sampro3d/.
△ Less
Submitted 29 November, 2023;
originally announced November 2023.
-
RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D
Authors:
Lingteng Qiu,
Guanying Chen,
Xiaodong Gu,
Qi Zuo,
Mutian Xu,
Yushuang Wu,
Weihao Yuan,
Zilong Dong,
Liefeng Bo,
Xiaoguang Han
Abstract:
Lifting 2D diffusion for 3D generation is a challenging problem due to the lack of geometric prior and the complex entanglement of materials and lighting in natural images. Existing methods have shown promise by first creating the geometry through score-distillation sampling (SDS) applied to rendered surface normals, followed by appearance modeling. However, relying on a 2D RGB diffusion model to…
▽ More
Lifting 2D diffusion for 3D generation is a challenging problem due to the lack of geometric prior and the complex entanglement of materials and lighting in natural images. Existing methods have shown promise by first creating the geometry through score-distillation sampling (SDS) applied to rendered surface normals, followed by appearance modeling. However, relying on a 2D RGB diffusion model to optimize surface normals is suboptimal due to the distribution discrepancy between natural images and normals maps, leading to instability in optimization. In this paper, recognizing that the normal and depth information effectively describe scene geometry and be automatically estimated from images, we propose to learn a generalizable Normal-Depth diffusion model for 3D generation. We achieve this by training on the large-scale LAION dataset together with the generalizable image-to-depth and normal prior models. In an attempt to alleviate the mixed illumination effects in the generated materials, we introduce an albedo diffusion model to impose data-driven constraints on the albedo component. Our experiments show that when integrated into existing text-to-3D pipelines, our models significantly enhance the detail richness, achieving state-of-the-art results. Our project page is https://aigc3d.github.io/richdreamer/.
△ Less
Submitted 24 December, 2023; v1 submitted 28 November, 2023;
originally announced November 2023.
-
Unified Medical Image Pre-training in Language-Guided Common Semantic Space
Authors:
Xiaoxuan He,
Yifan Yang,
Xinyang Jiang,
Xufang Luo,
Haoji Hu,
Siyun Zhao,
Dongsheng Li,
Yuqing Yang,
Lili Qiu
Abstract:
Vision-Language Pre-training (VLP) has shown the merits of analysing medical images, by leveraging the semantic congruence between medical images and their corresponding reports. It efficiently learns visual representations, which in turn facilitates enhanced analysis and interpretation of intricate imaging data. However, such observation is predominantly justified on single-modality data (mostly…
▽ More
Vision-Language Pre-training (VLP) has shown the merits of analysing medical images, by leveraging the semantic congruence between medical images and their corresponding reports. It efficiently learns visual representations, which in turn facilitates enhanced analysis and interpretation of intricate imaging data. However, such observation is predominantly justified on single-modality data (mostly 2D images like X-rays), adapting VLP to learning unified representations for medical images in real scenario remains an open challenge. This arises from medical images often encompass a variety of modalities, especially modalities with different various number of dimensions (e.g., 3D images like Computed Tomography). To overcome the aforementioned challenges, we propose an Unified Medical Image Pre-training framework, namely UniMedI, which utilizes diagnostic reports as common semantic space to create unified representations for diverse modalities of medical images (especially for 2D and 3D images). Under the text's guidance, we effectively uncover visual modality information, identifying the affected areas in 2D X-rays and slices containing lesion in sophisticated 3D CT scans, ultimately enhancing the consistency across various medical imaging modalities. To demonstrate the effectiveness and versatility of UniMedI, we evaluate its performance on both 2D and 3D images across 10 different datasets, covering a wide range of medical image tasks such as classification, segmentation, and retrieval. UniMedI has demonstrated superior performance in downstream tasks, showcasing its effectiveness in establishing a universal medical visual representation.
△ Less
Submitted 24 November, 2023;
originally announced November 2023.
-
Enhancing Uncertainty-Based Hallucination Detection with Stronger Focus
Authors:
Tianhang Zhang,
Lin Qiu,
Qipeng Guo,
Cheng Deng,
Yue Zhang,
Zheng Zhang,
Chenghu Zhou,
Xinbing Wang,
Luoyi Fu
Abstract:
Large Language Models (LLMs) have gained significant popularity for their impressive performance across diverse fields. However, LLMs are prone to hallucinate untruthful or nonsensical outputs that fail to meet user expectations in many real-world applications. Existing works for detecting hallucinations in LLMs either rely on external knowledge for reference retrieval or require sampling multiple…
▽ More
Large Language Models (LLMs) have gained significant popularity for their impressive performance across diverse fields. However, LLMs are prone to hallucinate untruthful or nonsensical outputs that fail to meet user expectations in many real-world applications. Existing works for detecting hallucinations in LLMs either rely on external knowledge for reference retrieval or require sampling multiple responses from the LLM for consistency verification, making these methods costly and inefficient. In this paper, we propose a novel reference-free, uncertainty-based method for detecting hallucinations in LLMs. Our approach imitates human focus in factuality checking from three aspects: 1) focus on the most informative and important keywords in the given text; 2) focus on the unreliable tokens in historical context which may lead to a cascade of hallucinations; and 3) focus on the token properties such as token type and token frequency. Experimental results on relevant datasets demonstrate the effectiveness of our proposed method, which achieves state-of-the-art performance across all the evaluation metrics and eliminates the need for additional information.
△ Less
Submitted 22 November, 2023;
originally announced November 2023.
-
Adversarial Prompt Tuning for Vision-Language Models
Authors:
Jiaming Zhang,
Xingjun Ma,
Xin Wang,
Lingyu Qiu,
Jiaqi Wang,
Yu-Gang Jiang,
Jitao Sang
Abstract:
With the rapid advancement of multimodal learning, pre-trained Vision-Language Models (VLMs) such as CLIP have demonstrated remarkable capacities in bridging the gap between visual and language modalities. However, these models remain vulnerable to adversarial attacks, particularly in the image modality, presenting considerable security risks. This paper introduces Adversarial Prompt Tuning (AdvPT…
▽ More
With the rapid advancement of multimodal learning, pre-trained Vision-Language Models (VLMs) such as CLIP have demonstrated remarkable capacities in bridging the gap between visual and language modalities. However, these models remain vulnerable to adversarial attacks, particularly in the image modality, presenting considerable security risks. This paper introduces Adversarial Prompt Tuning (AdvPT), a novel technique to enhance the adversarial robustness of image encoders in VLMs. AdvPT innovatively leverages learnable text prompts and aligns them with adversarial image embeddings, to address the vulnerabilities inherent in VLMs without the need for extensive parameter training or modification of the model architecture. We demonstrate that AdvPT improves resistance against white-box and black-box adversarial attacks and exhibits a synergistic effect when combined with existing image-processing-based defense techniques, further boosting defensive capabilities. Comprehensive experimental analyses provide insights into adversarial prompt tuning, a novel paradigm devoted to improving resistance to adversarial images through textual input modifications, paving the way for future robust multimodal learning research. These findings open up new possibilities for enhancing the security of VLMs. Our code is available at https://github.com/jiamingzhang94/Adversarial-Prompt-Tuning.
△ Less
Submitted 25 December, 2023; v1 submitted 19 November, 2023;
originally announced November 2023.
-
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models
Authors:
Ziyi Lin,
Chris Liu,
Renrui Zhang,
Peng Gao,
Longtian Qiu,
Han Xiao,
Han Qiu,
Chen Lin,
Wenqi Shao,
Keqin Chen,
Jiaming Han,
Siyuan Huang,
Yichi Zhang,
Xuming He,
Hongsheng Li,
Yu Qiao
Abstract:
We present SPHINX, a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings. First, for stronger vision-language alignment, we unfreeze the large language model (LLM) during pre-training, and introduce a weight mix strategy between LLMs trained by real-world and synthetic data. By directly integrating the weights from two domains…
▽ More
We present SPHINX, a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings. First, for stronger vision-language alignment, we unfreeze the large language model (LLM) during pre-training, and introduce a weight mix strategy between LLMs trained by real-world and synthetic data. By directly integrating the weights from two domains, the mixed LLM can efficiently incorporate diverse semantics with favorable robustness. Then, to enable multi-purpose capabilities, we mix a variety of tasks for joint visual instruction tuning, and design task-specific instructions to avoid inter-task conflict. In addition to the basic visual question answering, we include more challenging tasks such as region-level understanding, caption grounding, document layout detection, and human pose estimation, contributing to mutual enhancement over different scenarios. Additionally, we propose to extract comprehensive visual embeddings from various network architectures, pre-training paradigms, and information granularity, providing language models with more robust image representations. Based on our proposed joint mixing, SPHINX exhibits superior multi-modal understanding capabilities on a wide range of applications. On top of this, we further propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images. With a mixing of different scales and high-resolution sub-images, SPHINX attains exceptional visual parsing and reasoning performance on existing evaluation benchmarks. We hope our work may cast a light on the exploration of joint mixing in future MLLM research. Code is released at https://github.com/Alpha-VLLM/LLaMA2-Accessory.
△ Less
Submitted 13 November, 2023;
originally announced November 2023.
-
Sensing-Assisted Sparse Channel Recovery for Massive Antenna Systems
Authors:
Zixiang Ren,
Ling Qiu,
Jie Xu,
Derrick Wing Kwan Ng
Abstract:
This correspondence presents a novel sensing-assisted sparse channel recovery approach for massive antenna wireless communication systems. We focus on a fundamental configuration with one massive-antenna base station (BS) and one single-antenna communication user (CU). The wireless channel exhibits sparsity and consists of multiple paths associated with scatterers detectable via radar sensing. Und…
▽ More
This correspondence presents a novel sensing-assisted sparse channel recovery approach for massive antenna wireless communication systems. We focus on a fundamental configuration with one massive-antenna base station (BS) and one single-antenna communication user (CU). The wireless channel exhibits sparsity and consists of multiple paths associated with scatterers detectable via radar sensing. Under this setup, the BS first sends downlink pilots to the CU and concurrently receives the echo pilot signals for sensing the surrounding scatterers. Subsequently, the CU sends feedback information on its received pilot signal to the BS. Accordingly, the BS determines the sparse basis based on the sensed scatterers and proceeds to recover the wireless channel, exploiting the feedback information based on advanced compressive sensing (CS) algorithms. Numerical results show that the proposed sensing-assisted approach significantly increases the overall achievable rate than the conventional design relying on a discrete Fourier transform (DFT)-based sparse basis without sensing, thanks to the reduced training overhead and enhanced recovery accuracy with limited feedback.
△ Less
Submitted 10 November, 2023;
originally announced November 2023.
-
Joint Angle and Delay Cramér-Rao Bound Optimization for Integrated Sensing and Communications
Authors:
Chao Hu,
Yuan Fang,
Ling Qiu
Abstract:
In this paper, we study a multi-input multi-output (MIMO) beamforming design in an integrated sensing and communication (ISAC) system, in which an ISAC base station (BS) is used to communicate with multiple downlink users and simultaneously the communication signals are reused for sensing multiple targets. Our interested sensing parameters are the angle and delay information of the targets, which…
▽ More
In this paper, we study a multi-input multi-output (MIMO) beamforming design in an integrated sensing and communication (ISAC) system, in which an ISAC base station (BS) is used to communicate with multiple downlink users and simultaneously the communication signals are reused for sensing multiple targets. Our interested sensing parameters are the angle and delay information of the targets, which can be used to locate these targets. Under this consideration, we first derive the Cramér-Rao bound (CRB) for angle and delay estimation. Then, we optimize the transmit beamforming at the BS to minimize the CRB, subject to communication rate and power constraints. In particular, we obtain the optimal solution in closed-form in the case of single-target and single-user, and in the case of multi-target and multi-user scenario, the sparsity of the optimal solution is proven, leading to a reduction in computational complexity during optimization. The numerical results demonstrate that the optimized beamforming yields excellent positioning performance and effectively reduces the requirement for a large number of antennas at the BS.
△ Less
Submitted 13 November, 2023; v1 submitted 9 November, 2023;
originally announced November 2023.
-
StoryAnalogy: Deriving Story-level Analogies from Large Language Models to Unlock Analogical Understanding
Authors:
Cheng Jiayang,
Lin Qiu,
Tsz Ho Chan,
Tianqing Fang,
Weiqi Wang,
Chunkit Chan,
Dongyu Ru,
Qipeng Guo,
Hongming Zhang,
Yangqiu Song,
Yue Zhang,
Zheng Zhang
Abstract:
Analogy-making between narratives is crucial for human reasoning. In this paper, we evaluate the ability to identify and generate analogies by constructing a first-of-its-kind large-scale story-level analogy corpus, \textsc{StoryAnalogy}, which contains 24K story pairs from diverse domains with human annotations on two similarities from the extended Structure-Mapping Theory. We design a set of tes…
▽ More
Analogy-making between narratives is crucial for human reasoning. In this paper, we evaluate the ability to identify and generate analogies by constructing a first-of-its-kind large-scale story-level analogy corpus, \textsc{StoryAnalogy}, which contains 24K story pairs from diverse domains with human annotations on two similarities from the extended Structure-Mapping Theory. We design a set of tests on \textsc{StoryAnalogy}, presenting the first evaluation of story-level analogy identification and generation. Interestingly, we find that the analogy identification tasks are incredibly difficult not only for sentence embedding models but also for the recent large language models (LLMs) such as ChatGPT and LLaMa. ChatGPT, for example, only achieved around 30% accuracy in multiple-choice questions (compared to over 85% accuracy for humans). Furthermore, we observe that the data in \textsc{StoryAnalogy} can improve the quality of analogy generation in LLMs, where a fine-tuned FlanT5-xxl model achieves comparable performance to zero-shot ChatGPT.
△ Less
Submitted 23 October, 2023; v1 submitted 19 October, 2023;
originally announced October 2023.