-
V2I-Calib: A Novel Calibration Approach for Collaborative Vehicle and Infrastructure LiDAR Systems
Authors:
Qianxin Qu,
Yijin Xiong,
Xin Wu,
Hanyu Li,
Shichun Guo
Abstract:
Cooperative vehicle and infrastructure LiDAR systems hold great potential, yet their implementation faces numerous challenges. Calibration of LiDAR systems across heterogeneous vehicle and infrastructure endpoints is a critical step to ensure the accuracy and consistency of perception system data, necessitating calibration methods that are real-time and stable. To this end, this paper introduces a…
▽ More
Cooperative vehicle and infrastructure LiDAR systems hold great potential, yet their implementation faces numerous challenges. Calibration of LiDAR systems across heterogeneous vehicle and infrastructure endpoints is a critical step to ensure the accuracy and consistency of perception system data, necessitating calibration methods that are real-time and stable. To this end, this paper introduces a novel calibration method for cooperative vehicle and road infrastructure LiDAR systems, which exploits spatial association information between detection boxes. The method centers around a novel Overall IoU metric that reflects the correlation of targets between vehicle and infrastructure, enabling real-time monitoring of calibration results. We search for common matching boxes between vehicle and infrastructure nodes by constructing an affinity matrix. Subsequently, these matching boxes undergo extrinsic parameter computation and optimization. Comparative and ablation experiments on the DAIR-V2X dataset confirm the superiority of our method. To better reflect the differences in calibration results, we have categorized the calibration tasks on the DAIR-V2X dataset based on their level of difficulty, enriching the dataset's utility for future research. Our project is available at https://github.com/MassimoQu/v2i-calib .
△ Less
Submitted 14 July, 2024;
originally announced July 2024.
-
Quantum Clock Synchronization Network with Silicon-chip Dual-Pumped Entangled Photon Source
Authors:
J. A. Li,
H. Han,
X. P. Huang,
B. Y. Tang,
K. Guo,
J. Q. Huang,
S. Y. Xiong,
W. R. Yu,
Z. J. Zhang,
J. B. Yang,
B. Liu,
H. Chen,
Z. K. Lu
Abstract:
In this paper, we propose a quantum clock synchronization (QCS) network scheme with silicon-chip dual-pumped entangled photon source. This scheme couples two pump beams into the silicon-based waveguide, where degenerate and non-degenerate spontaneous four-wave mixing (SFWM) occurs, generating entanglement between one signal channel and three idler channels. The entangled photons are distributed to…
▽ More
In this paper, we propose a quantum clock synchronization (QCS) network scheme with silicon-chip dual-pumped entangled photon source. This scheme couples two pump beams into the silicon-based waveguide, where degenerate and non-degenerate spontaneous four-wave mixing (SFWM) occurs, generating entanglement between one signal channel and three idler channels. The entangled photons are distributed to remote users through the wavelength division multiplexing strategy to construct an entanglement distribution network, and the round-trip QCS is adopted to realize a QCS network that can serve multiple users. A proof-of-principle QCS network experiment is implemented among the server and multiple users (Alice, Bob, and Charlie) for 11.1 hours, where Alice and Charlie are 10 km away from the server and Bob is 25 km away from the server. The lowest time deviations (TDEV) between the server and each user (Alice, Bob, and Charlie) are 1.57 ps, 0.82 ps and 2.57 ps at the average time of 8000 s, 8000 s and 800 s respectively. The results show that the QCS network scheme with dual-pumped SFWM photon source proposed by us achieves high accuracy, and the channel resources used by n users are reduced by about 30% compared with other round-trip QCS schemes.
△ Less
Submitted 13 July, 2024;
originally announced July 2024.
-
MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts
Authors:
Zhenpeng Su,
Zijia Lin,
Xue Bai,
Xing Wu,
Yizhe Xiong,
Haoran Lian,
Guangyuan Ma,
Hui Chen,
Guiguang Ding,
Wei Zhou,
Songlin Hu
Abstract:
Scaling model capacity enhances its capabilities but significantly increases computation. Mixture-of-Experts models (MoEs) address this by allowing model capacity to scale without substantially increasing training or inference costs. Despite their promising results, MoE models encounter several challenges. Primarily, the dispersion of training tokens across multiple experts can lead to underfittin…
▽ More
Scaling model capacity enhances its capabilities but significantly increases computation. Mixture-of-Experts models (MoEs) address this by allowing model capacity to scale without substantially increasing training or inference costs. Despite their promising results, MoE models encounter several challenges. Primarily, the dispersion of training tokens across multiple experts can lead to underfitting, particularly for infrequent tokens. Additionally, while fixed routing mechanisms can mitigate this issue, they compromise on the diversity of representations. In this paper, we propose MaskMoE, a method designed to enhance token-level learning by employing a routing masking technique within the Mixture-of-Experts model. MaskMoE is capable of maintaining representation diversity while achieving more comprehensive training. Experimental results demonstrate that our method outperforms previous dominant Mixture-of-Experts models in both perplexity (PPL) and downstream tasks.
△ Less
Submitted 13 July, 2024;
originally announced July 2024.
-
Exploring Differences between Two Decades of Mental Health Related Emergency Department Visits by Youth via Recurrent Events Analyses
Authors:
Yi Xiong,
Joan Hu,
Rhonda Rosychuk
Abstract:
We aim to develop a tool for understanding how the mental health of youth aged less than 18 years evolve over time through administrative records of mental health related emergency department (MHED) visits in two decades. Administrative health data usually contain rich information for investigating public health issues; however, many restrictions and regulations apply to their use. Moreover, the d…
▽ More
We aim to develop a tool for understanding how the mental health of youth aged less than 18 years evolve over time through administrative records of mental health related emergency department (MHED) visits in two decades. Administrative health data usually contain rich information for investigating public health issues; however, many restrictions and regulations apply to their use. Moreover, the data are usually not in a conventional format since administrative databases are created and maintained to serve non-research purposes and only information for people who seek health services is accessible. Analysis of administrative health data is thus challenging in general. In the MHED data analyses, we are particularly concerned with (i) evaluating dynamic patterns and impacts with doubly-censored recurrent event data, and (ii) re-calibrating estimators developed based on truncated data by leveraging summary statistics from the population. The findings are verified empirically via simulation. We have established the asymptotic properties of the inference procedures. The contributions of this paper are twofold. We present innovative strategies for processing doubly-censored recurrent event data, and overcoming the truncation induced by the data collection. In addition, through exploring the pediatric MHED visit records, we provide new insights into children/youths mental health changes over time.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
OFDM Achieves the Lowest Ranging Sidelobe Under Random ISAC Signaling
Authors:
Fan Liu,
Ying Zhang,
Yifeng Xiong,
Shuangyang Li,
Weijie Yuan,
Feifei Gao,
Shi Jin,
Giuseppe Caire
Abstract:
This paper aims to answer a fundamental question in the area of Integrated Sensing and Communications (ISAC): What is the optimal communication-centric ISAC waveform for ranging? Towards that end, we first established a generic framework to analyze the sensing performance of communication-centric ISAC waveforms built upon orthonormal signaling bases and random data symbols. Then, we evaluated thei…
▽ More
This paper aims to answer a fundamental question in the area of Integrated Sensing and Communications (ISAC): What is the optimal communication-centric ISAC waveform for ranging? Towards that end, we first established a generic framework to analyze the sensing performance of communication-centric ISAC waveforms built upon orthonormal signaling bases and random data symbols. Then, we evaluated their ranging performance by adopting both the periodic and aperiodic auto-correlation functions (P-ACF and A-ACF), and defined the expectation of the integrated sidelobe level (EISL) as a sensing performance metric. On top of that, we proved that among all communication waveforms with cyclic prefix (CP), the orthogonal frequency division multiplexing (OFDM) modulation is the only globally optimal waveform that achieves the lowest ranging sidelobe for quadrature amplitude modulation (QAM) and phase shift keying (PSK) constellations, in terms of both the EISL and the sidelobe level at each individual lag of the P-ACF. As a step forward, we proved that among all communication waveforms without CP, OFDM is a locally optimal waveform for QAM/PSK in the sense that it achieves a local minimum of the EISL of the A-ACF. Finally, we demonstrated by numerical results that under QAM/PSK constellations, there is no other orthogonal communication-centric waveform that achieves a lower ranging sidelobe level than that of the OFDM, in terms of both P-ACF and A-ACF cases.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions
Authors:
Xuan Ju,
Yiming Gao,
Zhaoyang Zhang,
Ziyang Yuan,
Xintao Wang,
Ailing Zeng,
Yu Xiong,
Qiang Xu,
Ying Shan
Abstract:
Sora's high-motion intensity and long consistent videos have significantly impacted the field of video generation, attracting unprecedented attention. However, existing publicly available datasets are inadequate for generating Sora-like videos, as they mainly contain short videos with low motion intensity and brief captions. To address these issues, we propose MiraData, a high-quality video datase…
▽ More
Sora's high-motion intensity and long consistent videos have significantly impacted the field of video generation, attracting unprecedented attention. However, existing publicly available datasets are inadequate for generating Sora-like videos, as they mainly contain short videos with low motion intensity and brief captions. To address these issues, we propose MiraData, a high-quality video dataset that surpasses previous ones in video duration, caption detail, motion strength, and visual quality. We curate MiraData from diverse, manually selected sources and meticulously process the data to obtain semantically consistent clips. GPT-4V is employed to annotate structured captions, providing detailed descriptions from four different perspectives along with a summarized dense caption. To better assess temporal consistency and motion intensity in video generation, we introduce MiraBench, which enhances existing benchmarks by adding 3D consistency and tracking-based motion strength metrics. MiraBench includes 150 evaluation prompts and 17 metrics covering temporal consistency, motion strength, 3D consistency, visual quality, text-video alignment, and distribution similarity. To demonstrate the utility and effectiveness of MiraData, we conduct experiments using our DiT-based video generation model, MiraDiT. The experimental results on MiraBench demonstrate the superiority of MiraData, especially in motion strength.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
Extremely Large-Scale Dynamic Metasurface Antennas (XL-DMAs): Near-Field Modeling and Channel Estimation
Authors:
Songjie Yang,
Wanting Lyu,
Boyu Ning,
Yue Xiu,
Youzhi Xiong,
Hua Chen,
Chadi Assi,
Chau Yuen
Abstract:
Dynamic metasurface antennas (DMAs) represent a novel transceiver array architecture for extremely large-scale (XL) communications, offering the advantages of reduced power consumption and lower hardware costs compared to conventional arrays.
This paper focuses on near-field channel estimation for XL-DMAs. We begin by analyzing the near-field characteristics of uniform planar arrays (UPAs) and i…
▽ More
Dynamic metasurface antennas (DMAs) represent a novel transceiver array architecture for extremely large-scale (XL) communications, offering the advantages of reduced power consumption and lower hardware costs compared to conventional arrays.
This paper focuses on near-field channel estimation for XL-DMAs. We begin by analyzing the near-field characteristics of uniform planar arrays (UPAs) and introducing the Oblong Approx. model. This model decouples elevation-azimuth (EL-AZ) parameters for XL-DMAs, providing an effective means to characterize the near-field effect. It offers simpler mathematical expressions than the second-order Taylor expansion model, all while maintaining negligible model errors for oblong-shaped arrays.
Building on the Oblong Approx. model, we propose an EL-AZ-decoupled estimation framework that involves near- and far-field parameter estimation for AZ/EL and EL/AZ directions, respectively. The former is formulated as a distributed compressive sensing problem, addressed using the proposed off-grid distributed orthogonal least squares algorithm, while the latter involves a straightforward parallelizable search. Crucially, we illustrate the viability of decoupled EL-AZ estimation for near-field UPAs, exhibiting commendable performance and linear complexity correlated with the number of metasurface elements.
Moreover, we design an measurement matrix optimization method with the Lorentzian constraint on DMAs and highlight the estimation performance degradation resulting from this constraint.
△ Less
Submitted 6 July, 2024;
originally announced July 2024.
-
Non-Hermitian skin effect in arbitrary dimensions: non-Bloch band theory and classification
Authors:
Yuncheng Xiong,
Ze-Yu Xing,
Haiping Hu
Abstract:
Non-Hermitian skin effect (NHSE) is a distinctive phenomenon in non-Hermitian systems, characterized by a significant accumulation of eigenstates at system boundaries. While well-understood in one dimension via non-Bloch band theory, unraveling the NHSE in higher dimensions faces formidable challenges due to the diversity of open boundary conditions or lattice geometries and inevitable numerical e…
▽ More
Non-Hermitian skin effect (NHSE) is a distinctive phenomenon in non-Hermitian systems, characterized by a significant accumulation of eigenstates at system boundaries. While well-understood in one dimension via non-Bloch band theory, unraveling the NHSE in higher dimensions faces formidable challenges due to the diversity of open boundary conditions or lattice geometries and inevitable numerical errors. Key issues, including higher-dimensional non-Bloch band theory, geometric dependency, spectral convergence and stability, and a complete classification of NHSE, remain elusive. In this work, we address these challenges by presenting a geometry-adaptive non-Bloch band theory in arbitrary dimensions, through the lens of spectral potential. Our formulation accurately determines the energy spectra, density of states, and generalized Brillouin zone for a given geometry in the thermodynamic limit (TDL), revealing their geometric dependencies. Furthermore, we systematically classify the NHSE into critical and non-reciprocal types using net winding numbers. In the critical case, we identify novel scale-free skin modes residing on the boundary. In the nonreciprocal case, the skin modes manifest in various forms, including normal or anomalous corner modes, boundary modes or scale-free modes. We reveal the non-convergence and instability of the non-Bloch spectra in the presence of scale-free modes and attribute it to the non-exchangeability of the zero-perturbation limit and the TDL. The instability drives the energy spectra towards the Amoeba spectra in the critical case. Our findings provide a unified non-Bloch band theory governing the energy spectra, density of states, and generalized Brillouin zone in the TDL, offering a comprehensive understanding of NHSE in arbitrary dimensions.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Kramers Nonlinearity in PT Symmetric Magnets
Authors:
Oles Matsyshyn,
Ying Xiong,
Justin C. W. Song
Abstract:
Kramers degeneracies play an essential role in the spectrum of electronic materials. Here we argue that beyond spectral properties, Kramers degeneracy plays a critical role in the nonlinear response of PT symmetric magnets. In particular, we uncover a class of second-order Kramers nonlinearities that only arise in the presence of Kramers degeneracy, vanishing in non-degenerate PT symmetric materia…
▽ More
Kramers degeneracies play an essential role in the spectrum of electronic materials. Here we argue that beyond spectral properties, Kramers degeneracy plays a critical role in the nonlinear response of PT symmetric magnets. In particular, we uncover a class of second-order Kramers nonlinearities that only arise in the presence of Kramers degeneracy, vanishing in non-degenerate PT symmetric materials. Kramers nonlinearties depend on a circular dichroism between PT related Kramers states and enable to trace out the quantum geometry of the degenerate band structure. We find pronounced Kramers nonlinearitites in the nonlinear polarization responses of even layer antiferromagnetic MnBi$_2$Te$_4$ that enable to identify its antiferromagnetic order. This provides novel means for diagnosing Kramers pairs and addressing the internal Kramers degree of freedom.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism
Authors:
Diandian Gu,
Peng Sun,
Qinghao Hu,
Ting Huang,
Xun Chen,
Yingtong Xiong,
Guoteng Wang,
Qiaoling Chen,
Shangchun Zhao,
Jiarui Fang,
Yonggang Wen,
Tianwei Zhang,
Xin Jin,
Xuanzhe Liu
Abstract:
Efficiently training LLMs with long sequences is important yet challenged by the massive computation and memory requirements. Sequence parallelism has been proposed to tackle these problems, but existing methods suffer from scalability or efficiency issues. We propose LoongTrain, a novel system to efficiently train LLMs with long sequences at scale. The core of LoongTrain is the 2D-Attention mecha…
▽ More
Efficiently training LLMs with long sequences is important yet challenged by the massive computation and memory requirements. Sequence parallelism has been proposed to tackle these problems, but existing methods suffer from scalability or efficiency issues. We propose LoongTrain, a novel system to efficiently train LLMs with long sequences at scale. The core of LoongTrain is the 2D-Attention mechanism, which combines both head-parallel and context-parallel techniques to break the scalability constraints while maintaining efficiency. We introduce Double-Ring-Attention and analyze the performance of device placement strategies to further speed up training. We implement LoongTrain with the hybrid ZeRO and Selective Checkpoint++ techniques. Experiment results show that LoongTrain outperforms state-of-the-art baselines, i.e., DeepSpeed-Ulysses and Megatron Context Parallelism, in both end-to-end training speed and scalability, and improves Model FLOPs Utilization (MFU) by up to 2.88x.
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
Diverse Responses in Lattice Thermal Conductivity of $n$-type/$p$-type Semiconductors Driven by Asymmetric Electron-Phonon Interactions
Authors:
Jianshi Sun,
Shouhang Li,
Zhen Tong,
Cheng Shao,
Han Xie,
Meng An,
Chuang Zhang,
Xiongfei Zhu,
Chen Huang,
Yucheng Xiong,
Xiangjun Liu
Abstract:
Accurately assessing the impact of electron-phonon interaction (EPI) on the lattice thermal conductivity of semiconductors is crucial for the thermal management of electronic devices and a unified physical understanding of this issue is highly desired. In this work, we predict the lattice thermal conductivities of typical direct and indirect bandgap semiconductors accounting for EPI based on mode-…
▽ More
Accurately assessing the impact of electron-phonon interaction (EPI) on the lattice thermal conductivity of semiconductors is crucial for the thermal management of electronic devices and a unified physical understanding of this issue is highly desired. In this work, we predict the lattice thermal conductivities of typical direct and indirect bandgap semiconductors accounting for EPI based on mode-level first-principles calculations. It is found that EPI has a larger effect on the lattice thermal conductivity of $p$-type doping compared to $n$-type doping in the same semiconductor at high charge carrier concentrations. The stronger EPI in $p$-type doping is attributed to the relatively higher electron density of states caused by the relatively larger $p$-orbital component. Furthermore, EPI has a stronger influence on the lattice thermal conductivity of $n$-type indirect bandgap semiconductors than $n$-type direct bandgap semiconductors. This is attributed to the relatively lower electron density of states in direct bandgap semiconductors stemming from the $s$-orbital component. This work reveals that there exist diverse responses in lattice thermal conductivity of $n$-type/$p$-type semiconductors, which can be attributed to asymmetric EPIs.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Towards Adaptive Neighborhood for Advancing Temporal Interaction Graph Modeling
Authors:
Siwei Zhang,
Xi Chen,
Yun Xiong,
Xixi Wu,
Yao Zhang,
Yongrui Fu,
Yinglong Zhao,
Jiawei Zhang
Abstract:
Temporal Graph Networks (TGNs) have demonstrated their remarkable performance in modeling temporal interaction graphs. These works can generate temporal node representations by encoding the surrounding neighborhoods for the target node. However, an inherent limitation of existing TGNs is their reliance on fixed, hand-crafted rules for neighborhood encoding, overlooking the necessity for an adaptiv…
▽ More
Temporal Graph Networks (TGNs) have demonstrated their remarkable performance in modeling temporal interaction graphs. These works can generate temporal node representations by encoding the surrounding neighborhoods for the target node. However, an inherent limitation of existing TGNs is their reliance on fixed, hand-crafted rules for neighborhood encoding, overlooking the necessity for an adaptive and learnable neighborhood that can accommodate both personalization and temporal evolution across different timestamps. In this paper, we aim to enhance existing TGNs by introducing an adaptive neighborhood encoding mechanism. We present SEAN, a flexible plug-and-play model that can be seamlessly integrated with existing TGNs, effectively boosting their performance. To achieve this, we decompose the adaptive neighborhood encoding process into two phases: (i) representative neighbor selection, and (ii) temporal-aware neighborhood information aggregation. Specifically, we propose the Representative Neighbor Selector component, which automatically pinpoints the most important neighbors for the target node. It offers a tailored understanding of each node's unique surrounding context, facilitating personalization. Subsequently, we propose a Temporal-aware Aggregator, which synthesizes neighborhood aggregation by selectively determining the utilization of aggregation routes and decaying the outdated information, allowing our model to adaptively leverage both the contextually significant and current information during aggregation. We conduct extensive experiments by integrating SEAN into three representative TGNs, evaluating their performance on four public datasets and one financial benchmark dataset introduced in this paper. The results demonstrate that SEAN consistently leads to performance improvements across all models, achieving SOTA performance and exceptional robustness.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
RetinaGS: Scalable Training for Dense Scene Rendering with Billion-Scale 3D Gaussians
Authors:
Bingling Li,
Shengyi Chen,
Luchao Wang,
Kaimin Liao,
Sijie Yan,
Yuanjun Xiong
Abstract:
In this work, we explore the possibility of training high-parameter 3D Gaussian splatting (3DGS) models on large-scale, high-resolution datasets. We design a general model parallel training method for 3DGS, named RetinaGS, which uses a proper rendering equation and can be applied to any scene and arbitrary distribution of Gaussian primitives. It enables us to explore the scaling behavior of 3DGS i…
▽ More
In this work, we explore the possibility of training high-parameter 3D Gaussian splatting (3DGS) models on large-scale, high-resolution datasets. We design a general model parallel training method for 3DGS, named RetinaGS, which uses a proper rendering equation and can be applied to any scene and arbitrary distribution of Gaussian primitives. It enables us to explore the scaling behavior of 3DGS in terms of primitive numbers and training resolutions that were difficult to explore before and surpass previous state-of-the-art reconstruction quality. We observe a clear positive trend of increasing visual quality when increasing primitive numbers with our method. We also demonstrate the first attempt at training a 3DGS model with more than one billion primitives on the full MatrixCity dataset that attains a promising visual quality.
△ Less
Submitted 22 June, 2024; v1 submitted 17 June, 2024;
originally announced June 2024.
-
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs
Authors:
Ziyu Liu,
Tao Chu,
Yuhang Zang,
Xilin Wei,
Xiaoyi Dong,
Pan Zhang,
Zijian Liang,
Yuanjun Xiong,
Yu Qiao,
Dahua Lin,
Jiaqi Wang
Abstract:
Generating natural and meaningful responses to communicate with multi-modal human inputs is a fundamental capability of Large Vision-Language Models(LVLMs). While current open-source LVLMs demonstrate promising performance in simplified scenarios such as single-turn single-image input, they fall short in real-world conversation scenarios such as following instructions in a long context history wit…
▽ More
Generating natural and meaningful responses to communicate with multi-modal human inputs is a fundamental capability of Large Vision-Language Models(LVLMs). While current open-source LVLMs demonstrate promising performance in simplified scenarios such as single-turn single-image input, they fall short in real-world conversation scenarios such as following instructions in a long context history with multi-turn and multi-images. Existing LVLM benchmarks primarily focus on single-choice questions or short-form responses, which do not adequately assess the capabilities of LVLMs in real-world human-AI interaction applications. Therefore, we introduce MMDU, a comprehensive benchmark, and MMDU-45k, a large-scale instruction tuning dataset, designed to evaluate and improve LVLMs' abilities in multi-turn and multi-image conversations. We employ the clustering algorithm to ffnd the relevant images and textual descriptions from the open-source Wikipedia and construct the question-answer pairs by human annotators with the assistance of the GPT-4o model. MMDU has a maximum of 18k image+text tokens, 20 images, and 27 turns, which is at least 5x longer than previous benchmarks and poses challenges to current LVLMs. Our in-depth analysis of 15 representative LVLMs using MMDU reveals that open-source LVLMs lag behind closed-source counterparts due to limited conversational instruction tuning data. We demonstrate that ffne-tuning open-source LVLMs on MMDU-45k signiffcantly address this gap, generating longer and more accurate conversations, and improving scores on MMDU and existing benchmarks (MMStar: +1.1%, MathVista: +1.5%, ChartQA:+1.2%). Our contributions pave the way for bridging the gap between current LVLM models and real-world application demands. This project is available at https://github.com/Liuziyu77/MMDU.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Comparison of superconducting pairing in doped cuprates and nickelates within an extended Hubbard model
Authors:
Yicheng Xiong,
Hang Ma,
Hongxing Liu,
Runyu Ma,
Tianxing Ma
Abstract:
Within an extended Hubbard model, we investigate the superconducting pairing behavior of infinite-layer nickelate $\mathrm{NdNiO_2}$ and cuprates superconductors by using the determinant quantum Monte Carlo method. Our focus is on comparing their dominant pairing symmetries. The results indicate that the $d_{x^2-y^2}$ pairing interaction is significantly enhanced at low temperatures in both doped…
▽ More
Within an extended Hubbard model, we investigate the superconducting pairing behavior of infinite-layer nickelate $\mathrm{NdNiO_2}$ and cuprates superconductors by using the determinant quantum Monte Carlo method. Our focus is on comparing their dominant pairing symmetries. The results indicate that the $d_{x^2-y^2}$ pairing interaction is significantly enhanced at low temperatures in both doped nickelates and cuprates, while other typical pairing symmetries are effectively suppressed, highlighting the dominance of $d_{x^2-y^2}$ pairing form. Additionally, we find that the effective pairing interaction for $d_{x^2-y^2}$ pairing in doped nickelates is slightly lower than that in doped cuprates, which may be attributed to the different degrees of Fermi surface warping caused by the third nearest hopping $t''$. Further studies show that the hole doping and interaction strength have significant effects on the $d_{x^2-y^2}$ pairing interaction within the selected parameter range. The $d_{x^2-y^2}$ pairing interaction is notably weakened when the hole doping increases, while it is significantly enhanced with increasing Coulomb interaction strength $U$. This comparative analysis reveals the similarities and differences in the pairing behaviors of doped nickelates and cuprates, which may provide further insights into understanding the superconducting properties of these two classes of materials.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Image and Video Tokenization with Binary Spherical Quantization
Authors:
Yue Zhao,
Yuanjun Xiong,
Philipp Krähenbühl
Abstract:
We propose a new transformer-based image and video tokenizer with Binary Spherical Quantization (BSQ). BSQ projects the high-dimensional visual embedding to a lower-dimensional hypersphere and then applies binary quantization. BSQ is (1) parameter-efficient without an explicit codebook, (2) scalable to arbitrary token dimensions, and (3) compact: compressing visual data by up to 100$\times$ with m…
▽ More
We propose a new transformer-based image and video tokenizer with Binary Spherical Quantization (BSQ). BSQ projects the high-dimensional visual embedding to a lower-dimensional hypersphere and then applies binary quantization. BSQ is (1) parameter-efficient without an explicit codebook, (2) scalable to arbitrary token dimensions, and (3) compact: compressing visual data by up to 100$\times$ with minimal distortion. Our tokenizer uses a transformer encoder and decoder with simple block-wise causal masking to support variable-length videos as input. The resulting BSQ-ViT achieves state-of-the-art visual reconstruction quality on image and video reconstruction benchmarks with 2.4$\times$ throughput compared to the best prior methods. Furthermore, by learning an autoregressive prior for adaptive arithmetic coding, BSQ-ViT achieves comparable results on video compression with state-of-the-art video compression standards. BSQ-ViT also enables masked language models to achieve competitive image synthesis quality to GAN- and diffusion-based methods.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Mitigating Bias in Dataset Distillation
Authors:
Justin Cui,
Ruochen Wang,
Yuanhao Xiong,
Cho-Jui Hsieh
Abstract:
Dataset Distillation has emerged as a technique for compressing large datasets into smaller synthetic counterparts, facilitating downstream training tasks. In this paper, we study the impact of bias inside the original dataset on the performance of dataset distillation. With a comprehensive empirical evaluation on canonical datasets with color, corruption and background biases, we found that color…
▽ More
Dataset Distillation has emerged as a technique for compressing large datasets into smaller synthetic counterparts, facilitating downstream training tasks. In this paper, we study the impact of bias inside the original dataset on the performance of dataset distillation. With a comprehensive empirical evaluation on canonical datasets with color, corruption and background biases, we found that color and background biases in the original dataset will be amplified through the distillation process, resulting in a notable decline in the performance of models trained on the distilled dataset, while corruption bias is suppressed through the distillation process. To reduce bias amplification in dataset distillation, we introduce a simple yet highly effective approach based on a sample reweighting scheme utilizing kernel density estimation. Empirical results on multiple real-world and synthetic datasets demonstrate the effectiveness of the proposed method. Notably, on CMNIST with 5% bias-conflict ratio and IPC 50, our method achieves 91.5% test accuracy compared to 23.8% from vanilla DM, boosting the performance by 67.7%, whereas applying state-of-the-art debiasing method on the same dataset only achieves 53.7% accuracy. Our findings highlight the importance of addressing biases in dataset distillation and provide a promising avenue to address bias amplification in the process.
△ Less
Submitted 10 July, 2024; v1 submitted 6 June, 2024;
originally announced June 2024.
-
PointABM:Integrating Bidirectional State Space Model with Multi-Head Self-Attention for Point Cloud Analysis
Authors:
Jia-wei Chen,
Yu-jie Xiong,
Yong-bin Gao
Abstract:
Mamba, based on state space model (SSM) with its linear complexity and great success in classification provide its superiority in 3D point cloud analysis. Prior to that, Transformer has emerged as one of the most prominent and successful architectures for point cloud analysis. We present PointABM, a hybrid model that integrates the Mamba and Transformer architectures for enhancing local feature to…
▽ More
Mamba, based on state space model (SSM) with its linear complexity and great success in classification provide its superiority in 3D point cloud analysis. Prior to that, Transformer has emerged as one of the most prominent and successful architectures for point cloud analysis. We present PointABM, a hybrid model that integrates the Mamba and Transformer architectures for enhancing local feature to improve performance of 3D point cloud analysis. In order to enhance the extraction of global features, we introduce a bidirectional SSM (bi-SSM) framework, which comprises both a traditional token forward SSM and an innovative backward SSM. To enhance the bi-SSM's capability of capturing more comprehensive features without disrupting the sequence relationships required by the bidirectional Mamba, we introduce Transformer, utilizing its self-attention mechanism to process point clouds. Extensive experimental results demonstrate that integrating Mamba with Transformer significantly enhance the model's capability to analysis 3D point cloud.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
M2CVD: Multi-Model Collaboration for Code Vulnerability Detection
Authors:
Ziliang Wang,
Ge Li,
Jia Li,
Yingfei Xiong,
Jia Li,
Zhi Jin
Abstract:
Large Language Models (LLMs) have strong capabilities in code comprehension, but fine-tuning costs and semantic alignment issues limit their project-specific optimization; conversely, code models such CodeBERT are easy to fine-tune, but it is often difficult to learn vulnerability semantics from complex code languages. To address these challenges, this paper introduces the Multi-Model Collaborativ…
▽ More
Large Language Models (LLMs) have strong capabilities in code comprehension, but fine-tuning costs and semantic alignment issues limit their project-specific optimization; conversely, code models such CodeBERT are easy to fine-tune, but it is often difficult to learn vulnerability semantics from complex code languages. To address these challenges, this paper introduces the Multi-Model Collaborative Vulnerability Detection approach (M2CVD) that leverages the strong capability of analyzing vulnerability semantics from LLMs to improve the detection accuracy of code models. M2CVD employs a novel collaborative process: first enhancing the quality of vulnerability semantic description produced by LLMs through the understanding of project code by code models, and then using these improved vulnerability semantic description to boost the detection accuracy of code models. We demonstrated M2CVD's effectiveness on two real-world datasets, where M2CVD significantly outperformed the baseline. In addition, we demonstrate that the M2CVD collaborative method can extend to other different LLMs and code models to improve their accuracy in vulnerability detection tasks.
△ Less
Submitted 9 June, 2024;
originally announced June 2024.
-
VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval
Authors:
Junjie Zhou,
Zheng Liu,
Shitao Xiao,
Bo Zhao,
Yongping Xiong
Abstract:
Multi-modal retrieval becomes increasingly popular in practice. However, the existing retrievers are mostly text-oriented, which lack the capability to process visual information. Despite the presence of vision-language models like CLIP, the current methods are severely limited in representing the text-only and image-only data. In this work, we present a new embedding model VISTA for universal mul…
▽ More
Multi-modal retrieval becomes increasingly popular in practice. However, the existing retrievers are mostly text-oriented, which lack the capability to process visual information. Despite the presence of vision-language models like CLIP, the current methods are severely limited in representing the text-only and image-only data. In this work, we present a new embedding model VISTA for universal multi-modal retrieval. Our work brings forth threefold technical contributions. Firstly, we introduce a flexible architecture which extends a powerful text encoder with the image understanding capability by introducing visual token embeddings. Secondly, we develop two data generation strategies, which bring high-quality composed image-text to facilitate the training of the embedding model. Thirdly, we introduce a multi-stage training algorithm, which first aligns the visual token embedding with the text encoder using massive weakly labeled data, and then develops multi-modal representation capability using the generated composed image-text data. In our experiments, VISTA achieves superior performances across a variety of multi-modal retrieval tasks in both zero-shot and supervised settings. Our model, data, and source code are available at https://github.com/FlagOpen/FlagEmbedding.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding
Authors:
Junjie Zhou,
Yan Shu,
Bo Zhao,
Boya Wu,
Shitao Xiao,
Xi Yang,
Yongping Xiong,
Bo Zhang,
Tiejun Huang,
Zheng Liu
Abstract:
The evaluation of Long Video Understanding (LVU) performance poses an important but challenging research problem. Despite previous efforts, the existing video understanding benchmarks are severely constrained by several issues, especially the insufficient lengths of videos, a lack of diversity in video types and evaluation tasks, and the inappropriateness for evaluating LVU performances. To addres…
▽ More
The evaluation of Long Video Understanding (LVU) performance poses an important but challenging research problem. Despite previous efforts, the existing video understanding benchmarks are severely constrained by several issues, especially the insufficient lengths of videos, a lack of diversity in video types and evaluation tasks, and the inappropriateness for evaluating LVU performances. To address the above problems, we propose a new benchmark, called MLVU (Multi-task Long Video Understanding Benchmark), for the comprehensive and in-depth evaluation of LVU. MLVU presents the following critical values: 1) The substantial and flexible extension of video lengths, which enables the benchmark to evaluate LVU performance across a wide range of durations. 2) The inclusion of various video genres, e.g., movies, surveillance footage, egocentric videos, cartoons, game videos, etc., which reflects the models' LVU performances in different scenarios. 3) The development of diversified evaluation tasks, which enables a comprehensive examination of MLLMs' key abilities in long-video understanding. The empirical study with 20 latest MLLMs reveals significant room for improvement in today's technique, as all existing methods struggle with most of the evaluation tasks and exhibit severe performance degradation when handling longer videos. Additionally, it suggests that factors such as context length, image-understanding quality, and the choice of LLM backbone can play critical roles in future advancements. We anticipate that MLVU will advance the research of long video understanding by providing a comprehensive and in-depth analysis of MLLMs.
△ Less
Submitted 19 June, 2024; v1 submitted 6 June, 2024;
originally announced June 2024.
-
Giant enhancement of hole mobility for 4H-silicon carbide through suppressing interband electron-phonon scattering
Authors:
Jianshi Sun,
Shouhang Li,
Zhen Tong,
Cheng Shao,
Meng An,
Xiongfei Zhu,
Chuang Zhang,
Xiangchuan Chen,
Yucheng Xiong,
Thomas Frauenheim,
Xiangjun Liu
Abstract:
4H-Silicon Carbide (4H-SiC) possesses a high Baliga figure of merit, making it a promising material for power electronics. However, its applications are limited by its low hole mobility. Herein, we found that the hole mobility of 4H-SiC is mainly limited by the strong interband electron-phonon scattering using mode-level first-principles calculations. Our research indicates that applying compressi…
▽ More
4H-Silicon Carbide (4H-SiC) possesses a high Baliga figure of merit, making it a promising material for power electronics. However, its applications are limited by its low hole mobility. Herein, we found that the hole mobility of 4H-SiC is mainly limited by the strong interband electron-phonon scattering using mode-level first-principles calculations. Our research indicates that applying compressive strain can reverse the sign of crystal-field splitting and change the ordering of electron bands close to the valence band maximum. Therefore, the interband electron-phonon scattering is severely suppressed, and the out-of-plane hole mobility of 4H-SiC can be enhanced by 200% with 2% uniaxial compressive strain applied. This work provides new insights into the electron transport mechanisms in semiconductors and suggests a strategy to improve hole mobility that could be applied to other semiconductors with hexagonal crystalline geometries.
△ Less
Submitted 20 June, 2024; v1 submitted 4 June, 2024;
originally announced June 2024.
-
Bootstrap3D: Improving 3D Content Creation with Synthetic Data
Authors:
Zeyi Sun,
Tong Wu,
Pan Zhang,
Yuhang Zang,
Xiaoyi Dong,
Yuanjun Xiong,
Dahua Lin,
Jiaqi Wang
Abstract:
Recent years have witnessed remarkable progress in multi-view diffusion models for 3D content creation. However, there remains a significant gap in image quality and prompt-following ability compared to 2D diffusion models. A critical bottleneck is the scarcity of high-quality 3D assets with detailed captions. To address this challenge, we propose Bootstrap3D, a novel framework that automatically…
▽ More
Recent years have witnessed remarkable progress in multi-view diffusion models for 3D content creation. However, there remains a significant gap in image quality and prompt-following ability compared to 2D diffusion models. A critical bottleneck is the scarcity of high-quality 3D assets with detailed captions. To address this challenge, we propose Bootstrap3D, a novel framework that automatically generates an arbitrary quantity of multi-view images to assist in training multi-view diffusion models. Specifically, we introduce a data generation pipeline that employs (1) 2D and video diffusion models to generate multi-view images based on constructed text prompts, and (2) our fine-tuned 3D-aware MV-LLaVA for filtering high-quality data and rewriting inaccurate captions. Leveraging this pipeline, we have generated 1 million high-quality synthetic multi-view images with dense descriptive captions to address the shortage of high-quality 3D data. Furthermore, we present a Training Timestep Reschedule (TTR) strategy that leverages the denoising process to learn multi-view consistency while maintaining the original 2D diffusion prior. Extensive experiments demonstrate that Bootstrap3D can generate high-quality multi-view images with superior aesthetic quality, image-text alignment, and maintained view consistency.
△ Less
Submitted 31 May, 2024;
originally announced June 2024.
-
Some New Approaches to MPI Implementations
Authors:
Yuqing Xiong
Abstract:
This paper provides some new approaches to MPI implementations to improve MPI performance. These approaches include dynamically composable libraries, reducing average layer numbers of MPI libraries, and a single entity of MPI-network, MPI-protocol, and MPI.
This paper provides some new approaches to MPI implementations to improve MPI performance. These approaches include dynamically composable libraries, reducing average layer numbers of MPI libraries, and a single entity of MPI-network, MPI-protocol, and MPI.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
A Full-duplex Speech Dialogue Scheme Based On Large Language Models
Authors:
Peng Wang,
Songshuo Lu,
Yaohua Tang,
Sijie Yan,
Yuanjun Xiong,
Wei Xia
Abstract:
We present a generative dialogue system capable of operating in a full-duplex manner, allowing for seamless interaction. It is based on a large language model (LLM) carefully aligned to be aware of a perception module, a motor function module, and the concept of a simple finite state machine (called neural FSM) with two states. The perception and motor function modules operate simultaneously, allo…
▽ More
We present a generative dialogue system capable of operating in a full-duplex manner, allowing for seamless interaction. It is based on a large language model (LLM) carefully aligned to be aware of a perception module, a motor function module, and the concept of a simple finite state machine (called neural FSM) with two states. The perception and motor function modules operate simultaneously, allowing the system to simultaneously speak and listen to the user. The LLM generates textual tokens for inquiry responses and makes autonomous decisions to start responding to, wait for, or interrupt the user by emitting control tokens to the neural FSM. All these tasks of the LLM are carried out as next token prediction on a serialized view of the dialogue in real-time. In automatic quality evaluations simulating real-life interaction, the proposed system reduces the average conversation response latency by more than 3 folds compared with LLM-based half-duplex dialogue systems while responding within less than 500 milliseconds in more than 50% of evaluated interactions. Running a LLM with only 8 billion parameters, our system exhibits a 8% higher interruption precision rate than the best available commercial LLM for voice-based dialogue.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Can Graph Learning Improve Task Planning?
Authors:
Xixi Wu,
Yifei Shen,
Caihua Shan,
Kaitao Song,
Siwei Wang,
Bohang Zhang,
Jiarui Feng,
Hong Cheng,
Wei Chen,
Yun Xiong,
Dongsheng Li
Abstract:
Task planning is emerging as an important research topic alongside the development of large language models (LLMs). It aims to break down complex user requests into solvable sub-tasks, thereby fulfilling the original requests. In this context, the sub-tasks can be naturally viewed as a graph, where the nodes represent the sub-tasks, and the edges denote the dependencies among them. Consequently, t…
▽ More
Task planning is emerging as an important research topic alongside the development of large language models (LLMs). It aims to break down complex user requests into solvable sub-tasks, thereby fulfilling the original requests. In this context, the sub-tasks can be naturally viewed as a graph, where the nodes represent the sub-tasks, and the edges denote the dependencies among them. Consequently, task planning is a decision-making problem that involves selecting a connected path or subgraph within the corresponding graph and invoking it. In this paper, we explore graph learning-based methods for task planning, a direction that is orthogonal to the prevalent focus on prompt design. Our interest in graph learning stems from a theoretical discovery: the biases of attention and auto-regressive loss impede LLMs' ability to effectively navigate decision-making on graphs, which is adeptly addressed by graph neural networks (GNNs). This theoretical insight led us to integrate GNNs with LLMs to enhance overall performance. Extensive experiments demonstrate that GNN-based methods surpass existing solutions even without training, and minimal training can further enhance their performance. Additionally, our approach complements prompt engineering and fine-tuning techniques, with performance further enhanced by improved prompts or a fine-tuned model.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
No Algorithmic Collusion in Two-Player Blindfolded Game with Thompson Sampling
Authors:
Ningyuan Chen,
Xuefeng Gao,
Yi Xiong
Abstract:
When two players are engaged in a repeated game with unknown payoff matrices, they may be completely unaware of the existence of each other and use multi-armed bandit algorithms to choose the actions, which is referred to as the ``blindfolded game'' in this paper. We show that when the players use Thompson sampling, the game dynamics converges to the Nash equilibrium under a mild assumption on the…
▽ More
When two players are engaged in a repeated game with unknown payoff matrices, they may be completely unaware of the existence of each other and use multi-armed bandit algorithms to choose the actions, which is referred to as the ``blindfolded game'' in this paper. We show that when the players use Thompson sampling, the game dynamics converges to the Nash equilibrium under a mild assumption on the payoff matrices. Therefore, algorithmic collusion doesn't arise in this case despite the fact that the players do not intentionally deploy competitive strategies. To prove the convergence result, we find that the framework developed in stochastic approximation doesn't apply, because of the sporadic and infrequent updates of the inferior actions and the lack of Lipschitz continuity. We develop a novel sample-path-wise approach to show the convergence.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
An Introduction to Vision-Language Modeling
Authors:
Florian Bordes,
Richard Yuanzhe Pang,
Anurag Ajay,
Alexander C. Li,
Adrien Bardes,
Suzanne Petryk,
Oscar Mañas,
Zhiqiu Lin,
Anas Mahmoud,
Bargav Jayaraman,
Mark Ibrahim,
Melissa Hall,
Yunyang Xiong,
Jonathan Lebensold,
Candace Ross,
Srihari Jayakumar,
Chuan Guo,
Diane Bouchacourt,
Haider Al-Tahan,
Karthik Padthe,
Vasu Sharma,
Hu Xu,
Xiaoqing Ellen Tan,
Megan Richards,
Samuel Lavoie
, et al. (16 additional authors not shown)
Abstract:
Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From having a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technol…
▽ More
Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From having a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. To better understand the mechanics behind mapping vision to language, we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them. Then, we present and discuss approaches to evaluate VLMs. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
Finetuning Large Language Model for Personalized Ranking
Authors:
Zhuoxi Bai,
Ning Wu,
Fengyu Cai,
Xinyi Zhu,
Yun Xiong
Abstract:
Large Language Models (LLMs) have demonstrated remarkable performance across various domains, motivating researchers to investigate their potential use in recommendation systems. However, directly applying LLMs to recommendation tasks has proven challenging due to the significant disparity between the data used for pre-training LLMs and the specific requirements of recommendation tasks. In this st…
▽ More
Large Language Models (LLMs) have demonstrated remarkable performance across various domains, motivating researchers to investigate their potential use in recommendation systems. However, directly applying LLMs to recommendation tasks has proven challenging due to the significant disparity between the data used for pre-training LLMs and the specific requirements of recommendation tasks. In this study, we introduce Direct Multi-Preference Optimization (DMPO), a streamlined framework designed to bridge the gap and enhance the alignment of LLMs for recommendation tasks. DMPO enhances the performance of LLM-based recommenders by simultaneously maximizing the probability of positive samples and minimizing the probability of multiple negative samples. We conducted experimental evaluations to compare DMPO against traditional recommendation methods and other LLM-based recommendation approaches. The results demonstrate that DMPO significantly improves the recommendation capabilities of LLMs across three real-world public datasets in few-shot scenarios. Additionally, the experiments indicate that DMPO exhibits superior generalization ability in cross-domain recommendations. A case study elucidates the reasons behind these consistent improvements and also underscores DMPO's potential as an explainable recommendation system.
△ Less
Submitted 20 June, 2024; v1 submitted 25 May, 2024;
originally announced May 2024.
-
RAEE: A Training-Free Retrieval-Augmented Early Exiting Framework for Efficient Inference
Authors:
Lianming Huang,
Shangyu Wu,
Yufei Cui,
Ying Xiong,
Xue Liu,
Tei-Wei Kuo,
Nan Guan,
Chun Jason Xue
Abstract:
Deploying large language model inference remains challenging due to their high computational overhead. Early exiting accelerates model inference by adaptively reducing the number of inference layers. Existing methods require training internal classifiers to determine whether to exit at each intermediate layer. However, such classifier-based early exiting frameworks require significant effort to de…
▽ More
Deploying large language model inference remains challenging due to their high computational overhead. Early exiting accelerates model inference by adaptively reducing the number of inference layers. Existing methods require training internal classifiers to determine whether to exit at each intermediate layer. However, such classifier-based early exiting frameworks require significant effort to design and train the classifiers. To address these limitations, this paper proposes RAEE, a training-free Retrieval-Augmented Early Exiting framework for efficient inference. First, this paper demonstrates that the early exiting problem can be modeled as a distribution prediction problem, where the distribution is approximated using similar data's existing information. Next, the paper details the process of collecting existing information to build the retrieval database. Finally, based on the pre-built retrieval database, RAEE leverages the retrieved similar data's exiting information to guide the backbone model to exit at the layer, which is predicted by the approximated distribution. Experimental results demonstrate that the proposed RAEE can significantly accelerate inference. RAEE also achieves state-of-the-art zero-shot performance on 8 classification tasks.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
PT43D: A Probabilistic Transformer for Generating 3D Shapes from Single Highly-Ambiguous RGB Images
Authors:
Yiheng Xiong,
Angela Dai
Abstract:
Generating 3D shapes from single RGB images is essential in various applications such as robotics. Current approaches typically target images containing clear and complete visual descriptions of the object, without considering common realistic cases where observations of objects that are largely occluded or truncated. We thus propose a transformer-based autoregressive model to generate the probabi…
▽ More
Generating 3D shapes from single RGB images is essential in various applications such as robotics. Current approaches typically target images containing clear and complete visual descriptions of the object, without considering common realistic cases where observations of objects that are largely occluded or truncated. We thus propose a transformer-based autoregressive model to generate the probabilistic distribution of 3D shapes conditioned on an RGB image containing potentially highly ambiguous observations of the object. To handle realistic scenarios such as occlusion or field-of-view truncation, we create simulated image-to-shape training pairs that enable improved fine-tuning for real-world scenarios. We then adopt cross-attention to effectively identify the most relevant region of interest from the input image for shape generation. This enables inference of sampled shapes with reasonable diversity and strong alignment with the input image. We train and test our model on our synthetic data then fine-tune and test it on real-world data. Experiments demonstrate that our model outperforms state of the art in both scenarios
△ Less
Submitted 20 May, 2024;
originally announced May 2024.
-
Proving Functional Program Equivalence via Directed Lemma Synthesis
Authors:
Yican Sun,
Ruyi Ji,
Jian Fang,
Xuanlin Jiang,
Mingshuai Chen,
Yingfei Xiong
Abstract:
Proving equivalence between functional programs is a fundamental problem in program verification, which often amounts to reasoning about algebraic data types (ADTs) and compositions of structural recursions. Modern theorem provers address this problem by applying structural induction, which is insufficient for proving many equivalence theorems. In such cases, one has to invent a set of lemmas, pro…
▽ More
Proving equivalence between functional programs is a fundamental problem in program verification, which often amounts to reasoning about algebraic data types (ADTs) and compositions of structural recursions. Modern theorem provers address this problem by applying structural induction, which is insufficient for proving many equivalence theorems. In such cases, one has to invent a set of lemmas, prove these lemmas by additional induction, and use these lemmas to prove the original theorem. There is, however, a lack of systematic understanding of what lemmas are needed for inductive proofs and how these lemmas can be synthesized automatically. This paper presents directed lemma synthesis, an effective approach to automating equivalence proofs by discovering critical lemmas using program synthesis techniques. We first identify two induction-friendly forms of propositions that give formal guarantees to the progress of the proof. We then propose two tactics that synthesize and apply lemmas, thereby transforming the proof goal into induction-friendly forms. Both tactics reduce lemma synthesis to a specialized class of program synthesis problems with efficient algorithms. Experimental results demonstrate the effectiveness of our approach: Compared to state-of-the-art equivalence checkers employing heuristic-based lemma enumeration, directed lemma synthesis saves 95.47% runtime on average and solves 38 more tasks over an extended version of the standard benchmark set.
△ Less
Submitted 19 May, 2024;
originally announced May 2024.
-
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Authors:
Tianhe Ren,
Qing Jiang,
Shilong Liu,
Zhaoyang Zeng,
Wenlong Liu,
Han Gao,
Hongjie Huang,
Zhengyu Ma,
Xiaoke Jiang,
Yihao Chen,
Yuda Xiong,
Hao Zhang,
Feng Li,
Peijun Tang,
Kent Yu,
Lei Zhang
Abstract:
This paper introduces Grounding DINO 1.5, a suite of advanced open-set object detection models developed by IDEA Research, which aims to advance the "Edge" of open-set object detection. The suite encompasses two models: Grounding DINO 1.5 Pro, a high-performance model designed for stronger generalization capability across a wide range of scenarios, and Grounding DINO 1.5 Edge, an efficient model o…
▽ More
This paper introduces Grounding DINO 1.5, a suite of advanced open-set object detection models developed by IDEA Research, which aims to advance the "Edge" of open-set object detection. The suite encompasses two models: Grounding DINO 1.5 Pro, a high-performance model designed for stronger generalization capability across a wide range of scenarios, and Grounding DINO 1.5 Edge, an efficient model optimized for faster speed demanded in many applications requiring edge deployment. The Grounding DINO 1.5 Pro model advances its predecessor by scaling up the model architecture, integrating an enhanced vision backbone, and expanding the training dataset to over 20 million images with grounding annotations, thereby achieving a richer semantic understanding. The Grounding DINO 1.5 Edge model, while designed for efficiency with reduced feature scales, maintains robust detection capabilities by being trained on the same comprehensive dataset. Empirical results demonstrate the effectiveness of Grounding DINO 1.5, with the Grounding DINO 1.5 Pro model attaining a 54.3 AP on the COCO detection benchmark and a 55.7 AP on the LVIS-minival zero-shot transfer benchmark, setting new records for open-set object detection. Furthermore, the Grounding DINO 1.5 Edge model, when optimized with TensorRT, achieves a speed of 75.2 FPS while attaining a zero-shot performance of 36.2 AP on the LVIS-minival benchmark, making it more suitable for edge computing scenarios. Model examples and demos with API will be released at https://github.com/IDEA-Research/Grounding-DINO-1.5-API
△ Less
Submitted 31 May, 2024; v1 submitted 16 May, 2024;
originally announced May 2024.
-
Cooperative Visual-LiDAR Extrinsic Calibration Technology for Intersection Vehicle-Infrastructure: A review
Authors:
Xinyu Zhang,
Yijin Xiong,
Qianxin Qu,
Renjie Wang,
Xin Gao,
Jing Liu,
Shichun Guo,
Jun Li
Abstract:
In the typical urban intersection scenario, both vehicles and infrastructures are equipped with visual and LiDAR sensors. By successfully integrating the data from vehicle-side and road monitoring devices, a more comprehensive and accurate environmental perception and information acquisition can be achieved. The Calibration of sensors, as an essential component of autonomous driving technology, ha…
▽ More
In the typical urban intersection scenario, both vehicles and infrastructures are equipped with visual and LiDAR sensors. By successfully integrating the data from vehicle-side and road monitoring devices, a more comprehensive and accurate environmental perception and information acquisition can be achieved. The Calibration of sensors, as an essential component of autonomous driving technology, has consistently drawn significant attention. Particularly in scenarios involving multiple sensors collaboratively perceiving and addressing localization challenges, the requirement for inter-sensor calibration becomes crucial. Recent years have witnessed the emergence of the concept of multi-end cooperation, where infrastructure captures and transmits surrounding environment information to vehicles, bolstering their perception capabilities while mitigating costs. However, this also poses technical complexities, underscoring the pressing need for diverse end calibration. Camera and LiDAR, the bedrock sensors in autonomous driving, exhibit expansive applicability. This paper comprehensively examines and analyzes the calibration of multi-end camera-LiDAR setups from vehicle, roadside, and vehicle-road cooperation perspectives, outlining their relevant applications and profound significance. Concluding with a summary, we present our future-oriented ideas and hypotheses.
△ Less
Submitted 16 May, 2024;
originally announced May 2024.
-
Optical transition parameters of the silicon T centre
Authors:
Chloe Clear,
Sara Hosseini,
Amirhossein AlizadehKhaledi,
Nicholas Brunelle,
Austin Woolverton,
Joshua Kanaganayagam,
Moein Kazemi,
Camille Chartrand,
Mehdi Keshavarz,
Yihuang Xiong,
Oney O. Soykal,
Geoffroy Hautier,
Valentin Karassiouk,
Mike Thewalt,
Daniel Higginbottom,
Stephanie Simmons
Abstract:
The silicon T centre's narrow, telecommunications-band optical emission, long spin coherence, and direct photonic integration have spurred interest in this emitter as a spin-photon interface for distributed quantum computing and networking. However, key parameters of the T centre's spin-selective optical transitions remain undetermined or ambiguous in literature. In this paper we present a Hamilto…
▽ More
The silicon T centre's narrow, telecommunications-band optical emission, long spin coherence, and direct photonic integration have spurred interest in this emitter as a spin-photon interface for distributed quantum computing and networking. However, key parameters of the T centre's spin-selective optical transitions remain undetermined or ambiguous in literature. In this paper we present a Hamiltonian of the T centre TX state and determine key parameters of the optical transition from T$_0$ to TX$_0$ from a combined analysis of published results, density functional theory, and new spectroscopy. We resolve ambiguous values of the internal defect potential in the literature, and we present the first measurements of electrically tuned T centre emission. As a result, we provide a model of the T centre's optical and spin properties under strain, electric, and magnetic fields that can be utilized for realizing quantum technologies.
△ Less
Submitted 11 May, 2024;
originally announced May 2024.
-
Discovery of T center-like quantum defects in silicon
Authors:
Yihuang Xiong,
Jiongzhi Zheng,
Shay McBride,
Xueyue Zhang,
Sinéad M. Griffin,
Geoffroy Hautier
Abstract:
Quantum technologies would benefit from the development of high performance quantum defects acting as single-photon emitters or spin-photon interface. Finding such a quantum defect in silicon is especially appealing in view of its favorable spin bath and high processability. While some color centers in silicon have been emerging in quantum applications, there is still a need to search and develop…
▽ More
Quantum technologies would benefit from the development of high performance quantum defects acting as single-photon emitters or spin-photon interface. Finding such a quantum defect in silicon is especially appealing in view of its favorable spin bath and high processability. While some color centers in silicon have been emerging in quantum applications, there is still a need to search and develop new high performance quantum emitters. Searching a high-throughput computational database of more than 22,000 charged complex defects in silicon, we identify a series of defects formed by a group III element combined with carbon ((A-C)$\rm _{Si}$ with A=B,Al,Ga,In,Tl) and substituting on a silicon site. These defects are analogous structurally, electronically and chemically to the well-known T center in silicon ((C-C-H)$\rm_{Si}$) and their optical properties are mainly driven by an unpaired electron in a carbon $p$ orbital. They all emit in the telecom and some of these color centers show improved properties compared to the T center in terms of computed radiative lifetime or emission efficiency. We also show that the synthesis of hydrogenated T center-like defects followed by a dehydrogenation annealing step could be an efficient way of synthesis. All the T center-like defects show a higher symmetry than the T center making them easier to align with magnetic fields. Our work motivates further studies on the synthesis and control of this new family of quantum defects, and also demonstrates the use of high-throughput computational screening to detect new complex quantum defects.
△ Less
Submitted 8 May, 2024;
originally announced May 2024.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Authors:
DeepSeek-AI,
Aixin Liu,
Bei Feng,
Bin Wang,
Bingxuan Wang,
Bo Liu,
Chenggang Zhao,
Chengqi Dengr,
Chong Ruan,
Damai Dai,
Daya Guo,
Dejian Yang,
Deli Chen,
Dongjie Ji,
Erhang Li,
Fangyun Lin,
Fuli Luo,
Guangbo Hao,
Guanting Chen,
Guowei Li,
H. Zhang,
Hanwei Xu,
Hao Yang,
Haowei Zhang,
Honghui Ding
, et al. (132 additional authors not shown)
Abstract:
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference…
▽ More
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.
△ Less
Submitted 19 June, 2024; v1 submitted 7 May, 2024;
originally announced May 2024.
-
Improving the Ranging Performance of Random ISAC Signals Through Pulse Shaping Design
Authors:
Zihan Liao,
Fan Liu,
Shuangyang Li,
Yifeng Xiong,
Weijie Yuan,
Marco Lops
Abstract:
In this paper, we propose a novel pulse shaping design for single-carrier integrated sensing and communication (ISAC) transmission. Due to the communication information embedded in the ISAC signal, the resulting auto-correlation function (ACF) is determined by both the information-conveying random symbol sequence and the signaling pulse, where the former leads to random fluctuations in the sidelob…
▽ More
In this paper, we propose a novel pulse shaping design for single-carrier integrated sensing and communication (ISAC) transmission. Due to the communication information embedded in the ISAC signal, the resulting auto-correlation function (ACF) is determined by both the information-conveying random symbol sequence and the signaling pulse, where the former leads to random fluctuations in the sidelobes of the ACF, impairing the range estimation performance. To overcome this challenge, we first analyze the statistical characteristics of the random ACF under the symbol-wise pulse shaping (SWPS) regime. As a step further, we formulate an optimization problem to design ISAC pulse shaping filters, which minimizes the average integrated sidelobe level ratio (ISLR) while meeting the Nyquist criterion, subject to power and bandwidth constraints. We then show that the problem can be recast as a convex quadratic program by expressing it in the frequency domain, which can be readily solved through standard tools. Numerical results demonstrate that the proposed pulse shaping design achieves substantial ranging sidelobe reduction compared to the celebrated root-raised cosine (RRC) pulse shaping, given that the communication throughput is unchanged.
△ Less
Submitted 6 May, 2024; v1 submitted 6 May, 2024;
originally announced May 2024.
-
Scaffold-BPE: Enhancing Byte Pair Encoding with Simple and Effective Scaffold Token Removal
Authors:
Haoran Lian,
Yizhe Xiong,
Jianwei Niu,
Shasha Mo,
Zhenpeng Su,
Zijia Lin,
Peng Liu,
Hui Chen,
Guiguang Ding
Abstract:
Byte Pair Encoding (BPE) serves as a foundation method for text tokenization in the Natural Language Processing (NLP) field. Despite its wide adoption, the original BPE algorithm harbors an inherent flaw: it inadvertently introduces a frequency imbalance for tokens in the text corpus. Since BPE iteratively merges the most frequent token pair in the text corpus while keeping all tokens that have be…
▽ More
Byte Pair Encoding (BPE) serves as a foundation method for text tokenization in the Natural Language Processing (NLP) field. Despite its wide adoption, the original BPE algorithm harbors an inherent flaw: it inadvertently introduces a frequency imbalance for tokens in the text corpus. Since BPE iteratively merges the most frequent token pair in the text corpus while keeping all tokens that have been merged in the vocabulary, it unavoidably holds tokens that primarily represent subwords of complete words and appear infrequently on their own in the text corpus. We term such tokens as Scaffold Tokens. Due to their infrequent appearance in the text corpus, Scaffold Tokens pose a learning imbalance issue for language models. To address that issue, we propose Scaffold-BPE, which incorporates a dynamic scaffold token removal mechanism by parameter-free, computation-light, and easy-to-implement modifications to the original BPE. This novel approach ensures the exclusion of low-frequency Scaffold Tokens from the token representations for the given texts, thereby mitigating the issue of frequency imbalance and facilitating model training. On extensive experiments across language modeling tasks and machine translation tasks, Scaffold-BPE consistently outperforms the original BPE, well demonstrating its effectiveness and superiority.
△ Less
Submitted 27 April, 2024;
originally announced April 2024.
-
Temporal Scaling Law for Large Language Models
Authors:
Yizhe Xiong,
Xiansheng Chen,
Xin Ye,
Hui Chen,
Zijia Lin,
Haoran Lian,
Zhenpeng Su,
Jianwei Niu,
Guiguang Ding
Abstract:
Recently, Large Language Models (LLMs) have been widely adopted in a wide range of tasks, leading to increasing attention towards the research on how scaling LLMs affects their performance. Existing works, termed Scaling Laws, have discovered that the final test loss of LLMs scales as power-laws with model size, computational budget, and dataset size. However, the temporal change of the test loss…
▽ More
Recently, Large Language Models (LLMs) have been widely adopted in a wide range of tasks, leading to increasing attention towards the research on how scaling LLMs affects their performance. Existing works, termed Scaling Laws, have discovered that the final test loss of LLMs scales as power-laws with model size, computational budget, and dataset size. However, the temporal change of the test loss of an LLM throughout its pre-training process remains unexplored, though it is valuable in many aspects, such as selecting better hyperparameters \textit{directly} on the target LLM. In this paper, we propose the novel concept of Temporal Scaling Law, studying how the test loss of an LLM evolves as the training steps scale up. In contrast to modeling the test loss as a whole in a coarse-grained manner, we break it down and dive into the fine-grained test loss of each token position, and further develop a dynamic hyperbolic-law. Afterwards, we derive the much more precise temporal scaling law by studying the temporal patterns of the parameters in the dynamic hyperbolic-law. Results on both in-distribution (ID) and out-of-distribution (OOD) validation datasets demonstrate that our temporal scaling law accurately predicts the test loss of LLMs across training steps. Our temporal scaling law has broad practical applications. First, it enables direct and efficient hyperparameter selection on the target LLM, such as data mixture proportions. Secondly, viewing the LLM pre-training dynamics from the token position granularity provides some insights to enhance the understanding of LLM pre-training.
△ Less
Submitted 16 June, 2024; v1 submitted 27 April, 2024;
originally announced April 2024.
-
VN-Net: Vision-Numerical Fusion Graph Convolutional Network for Sparse Spatio-Temporal Meteorological Forecasting
Authors:
Yutong Xiong,
Xun Zhu,
Ming Wu,
Weiqing Li,
Fanbin Mo,
Chuang Zhang,
Bin Zhang
Abstract:
Sparse meteorological forecasting is indispensable for fine-grained weather forecasting and deserves extensive attention. Recent studies have highlighted the potential of spatio-temporal graph convolutional networks (ST-GCNs) in predicting numerical data from ground weather stations. However, as one of the highest fidelity and lowest latency data, the application of the vision data from satellites…
▽ More
Sparse meteorological forecasting is indispensable for fine-grained weather forecasting and deserves extensive attention. Recent studies have highlighted the potential of spatio-temporal graph convolutional networks (ST-GCNs) in predicting numerical data from ground weather stations. However, as one of the highest fidelity and lowest latency data, the application of the vision data from satellites in ST-GCNs remains unexplored. There are few studies to demonstrate the effectiveness of combining the above multi-modal data for sparse meteorological forecasting. Towards this objective, we introduce Vision-Numerical Fusion Graph Convolutional Network (VN-Net), which mainly utilizes: 1) Numerical-GCN (N-GCN) to adaptively model the static and dynamic patterns of spatio-temporal numerical data; 2) Vision-LSTM Network (V-LSTM) to capture multi-scale joint channel and spatial features from time series satellite images; 4) a GCN-based decoder to generate hourly predictions of specified meteorological factors. As far as we know, VN-Net is the first attempt to introduce GCN method to utilize multi-modal data for better handling sparse spatio-temporal meteorological forecasting. Our experiments on Weather2k dataset show VN-Net outperforms state-of-the-art by a significant margin on mean absolute error (MAE) and root mean square error (RMSE) for temperature, relative humidity, and visibility forecasting. Furthermore, we conduct interpretation analysis and design quantitative evaluation metrics to assess the impact of incorporating vision data.
△ Less
Submitted 26 January, 2024;
originally announced April 2024.
-
Fundamental Limits of Communication-Assisted Sensing in ISAC Systems
Authors:
Fuwang Dong,
Fan Liu,
Shihang Liu,
Yifeng Xiong,
Weijie Yuan,
Yuanhao Cui
Abstract:
In this paper, we introduce a novel communication-assisted sensing (CAS) framework that explores the potential coordination gains offered by the integrated sensing and communication technique. The CAS system endows users with beyond-line-of-the-sight sensing capabilities, supported by a dual-functional base station that enables simultaneous sensing and communication. To delve into the system's fun…
▽ More
In this paper, we introduce a novel communication-assisted sensing (CAS) framework that explores the potential coordination gains offered by the integrated sensing and communication technique. The CAS system endows users with beyond-line-of-the-sight sensing capabilities, supported by a dual-functional base station that enables simultaneous sensing and communication. To delve into the system's fundamental limits, we characterize the information-theoretic framework of the CAS system in terms of rate-distortion theory. We reveal the achievable overall distortion between the target's state and the reconstructions at the end-user, referred to as the sensing quality of service, within a special case where the distortion metric is separable for sensing and communication processes. As a case study, we employ a typical application to demonstrate distortion minimization under the ISAC signaling strategy, showcasing the potential of CAS in enhancing sensing capabilities.
△ Less
Submitted 23 April, 2024; v1 submitted 11 April, 2024;
originally announced April 2024.
-
Characterizing visual cortical magnification with topological smoothing and optimal transportation
Authors:
Yujian Xiong,
Yanshuai Tu,
Zhong-Lin Lu,
Yalin Wang
Abstract:
Human vision has different concentration on visual fields. Cortical magnification factor (CMF) is a popular measurement on visual acuity and cortex concentration. In order to achieve thorough measurement of CMF across the whole visual field, we propose a method to measure planar CMF upon retinotopic maps generated by pRF decoding, with help of our proposed methods: optimal transportation and topol…
▽ More
Human vision has different concentration on visual fields. Cortical magnification factor (CMF) is a popular measurement on visual acuity and cortex concentration. In order to achieve thorough measurement of CMF across the whole visual field, we propose a method to measure planar CMF upon retinotopic maps generated by pRF decoding, with help of our proposed methods: optimal transportation and topological smoothing. The optimal transportation re-calculates vertex location in retinotopic mapping, and topological smoothing guarantees topological conditions in retinotopic maps, which allow us to calculate planar CMF with the proposed 1-ring patch method. The pipeline was applied to the HCP 7T dataset, giving new planar results on CMF measurement across all 181 subjects, which illustrate novel concentration behavior on visual fields and their individual difference.
△ Less
Submitted 9 April, 2024;
originally announced April 2024.
-
Invariant stability conditions on local $\mathbb{P}^1\times \mathbb{P}^1$ (after Del Monte-Longhi)
Authors:
Yirui Xiong
Abstract:
Let $X$ be the total space of canonical bundle of $\pp$, we study an invariant subspace of stability conditions on $X$ under an autoequivalence of $D^b(X)$. We describe the complete set of stable objects with respect to the invariant stability conditions and characterize the space of invariant stability conditions.
Let $X$ be the total space of canonical bundle of $\pp$, we study an invariant subspace of stability conditions on $X$ under an autoequivalence of $D^b(X)$. We describe the complete set of stable objects with respect to the invariant stability conditions and characterize the space of invariant stability conditions.
△ Less
Submitted 8 April, 2024;
originally announced April 2024.
-
Reconstructing Retinal Visual Images from 3T fMRI Data Enhanced by Unsupervised Learning
Authors:
Yujian Xiong,
Wenhui Zhu,
Zhong-Lin Lu,
Yalin Wang
Abstract:
The reconstruction of human visual inputs from brain activity, particularly through functional Magnetic Resonance Imaging (fMRI), holds promising avenues for unraveling the mechanisms of the human visual system. Despite the significant strides made by deep learning methods in improving the quality and interpretability of visual reconstruction, there remains a substantial demand for high-quality, l…
▽ More
The reconstruction of human visual inputs from brain activity, particularly through functional Magnetic Resonance Imaging (fMRI), holds promising avenues for unraveling the mechanisms of the human visual system. Despite the significant strides made by deep learning methods in improving the quality and interpretability of visual reconstruction, there remains a substantial demand for high-quality, long-duration, subject-specific 7-Tesla fMRI experiments. The challenge arises in integrating diverse smaller 3-Tesla datasets or accommodating new subjects with brief and low-quality fMRI scans. In response to these constraints, we propose a novel framework that generates enhanced 3T fMRI data through an unsupervised Generative Adversarial Network (GAN), leveraging unpaired training across two distinct fMRI datasets in 7T and 3T, respectively. This approach aims to overcome the limitations of the scarcity of high-quality 7-Tesla data and the challenges associated with brief and low-quality scans in 3-Tesla experiments. In this paper, we demonstrate the reconstruction capabilities of the enhanced 3T fMRI data, highlighting its proficiency in generating superior input visual images compared to data-intensive methods trained and tested on a single subject.
△ Less
Submitted 7 April, 2024;
originally announced April 2024.
-
A dataset of primary nasopharyngeal carcinoma MRI with multi-modalities segmentation
Authors:
Yin Li,
Qi Chen,
Kai Wang,
Meige Li,
Liping Si,
Yingwei Guo,
Yu Xiong,
Qixing Wang,
Yang Qin,
Ling Xu,
Patrick van der Smagt,
Jun Tang,
Nutan Chen
Abstract:
Multi-modality magnetic resonance imaging data with various sequences facilitate the early diagnosis, tumor segmentation, and disease staging in the management of nasopharyngeal carcinoma (NPC). The lack of publicly available, comprehensive datasets limits advancements in diagnosis, treatment planning, and the development of machine learning algorithms for NPC. Addressing this critical need, we in…
▽ More
Multi-modality magnetic resonance imaging data with various sequences facilitate the early diagnosis, tumor segmentation, and disease staging in the management of nasopharyngeal carcinoma (NPC). The lack of publicly available, comprehensive datasets limits advancements in diagnosis, treatment planning, and the development of machine learning algorithms for NPC. Addressing this critical need, we introduce the first comprehensive NPC MRI dataset, encompassing MR axial imaging of 277 primary NPC patients. This dataset includes T1-weighted, T2-weighted, and contrast-enhanced T1-weighted sequences, totaling 831 scans. In addition to the corresponding clinical data, manually annotated and labeled segmentations by experienced radiologists offer high-quality data resources from untreated primary NPC.
△ Less
Submitted 4 April, 2024;
originally announced April 2024.
-
GPU acceleration of ab initio simulations of large-scale identical particles based on path integral molecular dynamics
Authors:
Yunuo Xiong
Abstract:
Path integral Monte Carlo (PIMC) and path integral molecular dynamics (PIMD) provide the golden standard for the ab initio simulations of identical particles. In this work, we achieved significant GPU acceleration based on PIMD, which is equivalent to PIMC in the ab initio simulations, and developed an open-source PIMD code repository that does not rely on any other third party library. Numerical…
▽ More
Path integral Monte Carlo (PIMC) and path integral molecular dynamics (PIMD) provide the golden standard for the ab initio simulations of identical particles. In this work, we achieved significant GPU acceleration based on PIMD, which is equivalent to PIMC in the ab initio simulations, and developed an open-source PIMD code repository that does not rely on any other third party library. Numerical experiments show that for a system of 1600 interacting identical bosons in a harmonic trap, using a single GPU and a single CPU, it only takes two hours to achieve satisfactory simulation accuracy. With the increase of the number of identical particles, the advantage of GPU acceleration over CPU becomes more obvious, making it possible to simulate tens of thousands of identical particles from first principles using a single GPU. For example, for a system of 10000 non-interacting bosons, numerical experiments show that it takes 23 hours to obtain a simulation that is highly consistent with the exact results. Our study shows that GPU acceleration can lay a solid foundation for the wide application of PIMD simulations for extremely large-scale identical particle quantum systems with more than 10,000 particles. Numerical experiments show that a 24GB GPU can simulate up to 40000 identical particles from first principles, and the GPU acceleration leads to a roughly linear relationship between the computation time and the number of identical particles. In addition, we have also successfully implemented simulations for fictitious identical particle thermodynamics using GPU to overcome the Fermion sign problem, which makes it promising to efficiently and accurately simulate tens of thousands of fermions based on GPU.
△ Less
Submitted 3 April, 2024;
originally announced April 2024.
-
Characterizations of amorphic schemes and fusions of pairs
Authors:
Edwin R. van Dam,
Jack H. Koolen,
Yanzhen Xiong
Abstract:
An association scheme is called amorphic if every possible fusion of relations gives rise to a fusion scheme. We call a pair of relations fusing if fusing that pair gives rise to a fusion scheme. We define the fusing-relations graph on the set of relations, where a pair forms an edge if it fuses. We show that if the fusing-relations graph is connected but not a path, then the association scheme is…
▽ More
An association scheme is called amorphic if every possible fusion of relations gives rise to a fusion scheme. We call a pair of relations fusing if fusing that pair gives rise to a fusion scheme. We define the fusing-relations graph on the set of relations, where a pair forms an edge if it fuses. We show that if the fusing-relations graph is connected but not a path, then the association scheme is amorphic. As a side result, we show that an association scheme in which at most one relation is not strongly regular of (negative) Latin square type, is amorphic.
△ Less
Submitted 31 March, 2024;
originally announced April 2024.
-
Beyond the Known: Novel Class Discovery for Open-world Graph Learning
Authors:
Yucheng Jin,
Yun Xiong,
Juncheng Fang,
Xixi Wu,
Dongxiao He,
Xing Jia,
Bingchen Zhao,
Philip Yu
Abstract:
Node classification on graphs is of great importance in many applications. Due to the limited labeling capability and evolution in real-world open scenarios, novel classes can emerge on unlabeled testing nodes. However, little attention has been paid to novel class discovery on graphs. Discovering novel classes is challenging as novel and known class nodes are correlated by edges, which makes thei…
▽ More
Node classification on graphs is of great importance in many applications. Due to the limited labeling capability and evolution in real-world open scenarios, novel classes can emerge on unlabeled testing nodes. However, little attention has been paid to novel class discovery on graphs. Discovering novel classes is challenging as novel and known class nodes are correlated by edges, which makes their representations indistinguishable when applying message passing GNNs. Furthermore, the novel classes lack labeling information to guide the learning process. In this paper, we propose a novel method Open-world gRAph neuraL network (ORAL) to tackle these challenges. ORAL first detects correlations between classes through semi-supervised prototypical learning. Inter-class correlations are subsequently eliminated by the prototypical attention network, leading to distinctive representations for different classes. Furthermore, to fully explore multi-scale graph features for alleviating label deficiencies, ORAL generates pseudo-labels by aligning and ensembling label estimations from multiple stacked prototypical attention networks. Extensive experiments on several benchmark datasets show the effectiveness of our proposed method.
△ Less
Submitted 28 March, 2024;
originally announced March 2024.
-
Kernel entropy estimation for linear processes II
Authors:
Yudan Xiong,
Fangjun Xu
Abstract:
Let $X=\{X_n: n\in \mathbb{N}\}$ be a linear process with bounded probability density function $f(x)$. Under certain conditions, we use the kernel estimator \[ \frac{2}{n(n-1)h_n} \sum_{1\le i<j\le n}K\Big(\frac{X_i-X_j}{h_n}\Big) \] to estimate the quadratic functional of $\int_{\mathbb{R}}f^2(x)dx$ of the linear process $X=\{X_n: n\in \mathbb{N}\}$ and improve the corresponding results in [4].
Let $X=\{X_n: n\in \mathbb{N}\}$ be a linear process with bounded probability density function $f(x)$. Under certain conditions, we use the kernel estimator \[ \frac{2}{n(n-1)h_n} \sum_{1\le i<j\le n}K\Big(\frac{X_i-X_j}{h_n}\Big) \] to estimate the quadratic functional of $\int_{\mathbb{R}}f^2(x)dx$ of the linear process $X=\{X_n: n\in \mathbb{N}\}$ and improve the corresponding results in [4].
△ Less
Submitted 28 March, 2024;
originally announced March 2024.