-
OVGNet: A Unified Visual-Linguistic Framework for Open-Vocabulary Robotic Grasping
Authors:
Li Meng,
Zhao Qi,
Lyu Shuchang,
Wang Chunlei,
Ma Yujing,
Cheng Guangliang,
Yang Chenguang
Abstract:
Recognizing and grasping novel-category objects remains a crucial yet challenging problem in real-world robotic applications. Despite its significance, limited research has been conducted in this specific domain. To address this, we seamlessly propose a novel framework that integrates open-vocabulary learning into the domain of robotic grasping, empowering robots with the capability to adeptly han…
▽ More
Recognizing and grasping novel-category objects remains a crucial yet challenging problem in real-world robotic applications. Despite its significance, limited research has been conducted in this specific domain. To address this, we seamlessly propose a novel framework that integrates open-vocabulary learning into the domain of robotic grasping, empowering robots with the capability to adeptly handle novel objects. Our contributions are threefold. Firstly, we present a large-scale benchmark dataset specifically tailored for evaluating the performance of open-vocabulary grasping tasks. Secondly, we propose a unified visual-linguistic framework that serves as a guide for robots in successfully grasping both base and novel objects. Thirdly, we introduce two alignment modules designed to enhance visual-linguistic perception in the robotic grasping process. Extensive experiments validate the efficacy and utility of our approach. Notably, our framework achieves an average accuracy of 71.2\% and 64.4\% on base and novel categories in our new dataset, respectively.
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
Collaborative Fall Detection and Response using Wi-Fi Sensing and Mobile Companion Robot
Authors:
Yunwang Chen,
Yaozhong Kang,
Ziqi Zhao,
Yue Hong,
Lingxiao Meng,
Max Q. -H. Meng
Abstract:
This paper presents a collaborative fall detection and response system integrating Wi-Fi sensing with robotic assistance. The proposed system leverages channel state information (CSI) disruptions caused by movements to detect falls in non-line-of-sight (NLOS) scenarios, offering non-intrusive monitoring. Besides, a companion robot is utilized to provide assistance capabilities to navigate and resp…
▽ More
This paper presents a collaborative fall detection and response system integrating Wi-Fi sensing with robotic assistance. The proposed system leverages channel state information (CSI) disruptions caused by movements to detect falls in non-line-of-sight (NLOS) scenarios, offering non-intrusive monitoring. Besides, a companion robot is utilized to provide assistance capabilities to navigate and respond to incidents autonomously, improving efficiency in providing assistance in various environments. The experimental results demonstrate the effectiveness of the proposed system in detecting falls and responding effectively.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
Sudden polarization angle jumps of the repeating fast radio burst FRB 20201124A
Authors:
J. R. Niu,
W. Y. Wang,
J. C. Jiang,
Y. Qu,
D. J. Zhou,
W. W. Zhu,
K. J. Lee,
J. L. Han,
B. Zhang,
D. Li,
S. Cao,
Z. Y. Fang,
Y. Feng,
Q. Y. Fu,
P. Jiang,
W. C. Jing,
J. Li,
Y. Li,
R. Luo,
L. Q. Meng,
C. C. Miao,
X. L. Miao,
C. H. Niu,
Y. C. Pan,
B. J. Wang
, et al. (19 additional authors not shown)
Abstract:
We report the first detection of polarization angle (PA) orthogonal jumps, a phenomenon previously only observed from radio pulsars, from a fast radio burst (FRB) source FRB 20201124A. We find three cases of orthogonal jumps in over two thousand bursts, all resembling those observed in pulsar single pulses. We propose that the jumps are due to the superposition of two orthogonal emission modes tha…
▽ More
We report the first detection of polarization angle (PA) orthogonal jumps, a phenomenon previously only observed from radio pulsars, from a fast radio burst (FRB) source FRB 20201124A. We find three cases of orthogonal jumps in over two thousand bursts, all resembling those observed in pulsar single pulses. We propose that the jumps are due to the superposition of two orthogonal emission modes that could only be produced in a highly magnetized plasma, and they are caused by the line of sight sweeping across a rotating magnetosphere. The shortest jump timescale is of the order of one-millisecond, which hints that the emission modes come from regions smaller than the light cylinder of most pulsars or magnetars. This discovery provides convincing evidence that FRB emission originates from the complex magnetosphere of a magnetar, suggesting an FRB emission mechanism that is analogous to radio pulsars despite a huge luminosity difference between two types of objects.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
Large Language Model-based FMRI Encoding of Language Functions for Subjects with Neurocognitive Disorder
Authors:
Yuejiao Wang,
Xianmin Gong,
Lingwei Meng,
Xixin Wu,
Helen Meng
Abstract:
Functional magnetic resonance imaging (fMRI) is essential for developing encoding models that identify functional changes in language-related brain areas of individuals with Neurocognitive Disorders (NCD). While large language model (LLM)-based fMRI encoding has shown promise, existing studies predominantly focus on healthy, young adults, overlooking older NCD populations and cognitive level corre…
▽ More
Functional magnetic resonance imaging (fMRI) is essential for developing encoding models that identify functional changes in language-related brain areas of individuals with Neurocognitive Disorders (NCD). While large language model (LLM)-based fMRI encoding has shown promise, existing studies predominantly focus on healthy, young adults, overlooking older NCD populations and cognitive level correlations. This paper explores language-related functional changes in older NCD adults using LLM-based fMRI encoding and brain scores, addressing current limitations. We analyze the correlation between brain scores and cognitive scores at both whole-brain and language-related ROI levels. Our findings reveal that higher cognitive abilities correspond to better brain scores, with correlations peaking in the middle temporal gyrus. This study highlights the potential of fMRI encoding models and brain scores for detecting early functional changes in NCD patients.
△ Less
Submitted 14 July, 2024;
originally announced July 2024.
-
Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System
Authors:
Lingwei Meng,
Jiawen Kang,
Yuejiao Wang,
Zengrui Jin,
Xixin Wu,
Xunying Liu,
Helen Meng
Abstract:
Multi-talker speech recognition and target-talker speech recognition, both involve transcription in multi-talker contexts, remain significant challenges. However, existing methods rarely attempt to simultaneously address both tasks. In this study, we propose a pioneering approach to empower Whisper, which is a speech foundation model, to tackle joint multi-talker and target-talker speech recogniti…
▽ More
Multi-talker speech recognition and target-talker speech recognition, both involve transcription in multi-talker contexts, remain significant challenges. However, existing methods rarely attempt to simultaneously address both tasks. In this study, we propose a pioneering approach to empower Whisper, which is a speech foundation model, to tackle joint multi-talker and target-talker speech recognition tasks. Specifically, (i) we freeze Whisper and plug a Sidecar separator into its encoder to separate mixed embedding for multiple talkers; (ii) a Target Talker Identifier is introduced to identify the embedding flow of the target talker on the fly, requiring only three-second enrollment speech as a cue; (iii) soft prompt tuning for decoder is explored for better task adaptation. Our method outperforms previous methods on two- and three-talker LibriMix and LibriSpeechMix datasets for both tasks, and delivers acceptable zero-shot performance on multi-talker ASR on AishellMix Mandarin dataset.
△ Less
Submitted 13 July, 2024;
originally announced July 2024.
-
Autoregressive Speech Synthesis without Vector Quantization
Authors:
Lingwei Meng,
Long Zhou,
Shujie Liu,
Sanyuan Chen,
Bing Han,
Shujie Hu,
Yanqing Liu,
Jinyu Li,
Sheng Zhao,
Xixin Wu,
Helen Meng,
Furu Wei
Abstract:
We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition, bypassing the need for vector quantization, which are originally designed for audio compression and sacrifice fidelity compared to mel-spectrograms. Specifically, (i) instead of cross…
▽ More
We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition, bypassing the need for vector quantization, which are originally designed for audio compression and sacrifice fidelity compared to mel-spectrograms. Specifically, (i) instead of cross-entropy loss, we apply regression loss with a proposed spectrogram flux loss function to model the probability distribution of the continuous-valued tokens. (ii) we have incorporated variational inference into MELLE to facilitate sampling mechanisms, thereby enhancing the output diversity and model robustness. Experiments demonstrate that, compared to the two-stage codec language models VALL-E and its variants, the single-stage MELLE mitigates robustness issues by avoiding the inherent flaws of sampling discrete codes, achieves superior performance across multiple metrics, and, most importantly, offers a more streamlined paradigm. See https://aka.ms/melle for demos of our work.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
Window-to-Window BEV Representation Learning for Limited FoV Cross-View Geo-localization
Authors:
Lei Cheng,
Teng Wang,
Lingquan Meng,
Changyin Sun
Abstract:
Cross-view geo-localization confronts significant challenges due to large perspective changes, especially when the ground-view query image has a limited field of view with unknown orientation. To bridge the cross-view domain gap, we for the first time explore to learn a BEV representation directly from the ground query image. However, the unknown orientation between ground and aerial images combin…
▽ More
Cross-view geo-localization confronts significant challenges due to large perspective changes, especially when the ground-view query image has a limited field of view with unknown orientation. To bridge the cross-view domain gap, we for the first time explore to learn a BEV representation directly from the ground query image. However, the unknown orientation between ground and aerial images combined with the absence of camera parameters led to ambiguity between BEV queries and ground references. To tackle this challenge, we propose a novel Window-to-Window BEV representation learning method, termed W2W-BEV, which adaptively matches BEV queries to ground reference at window-scale. Specifically, predefined BEV embeddings and extracted ground features are segmented into a fixed number of windows, and then most similar ground window is chosen for each BEV feature based on the context-aware window matching strategy. Subsequently, the cross-attention is performed between the matched BEV and ground windows to learn the robust BEV representation. Additionally, we use ground features along with predicted depth information to initialize the BEV embeddings, helping learn more powerful BEV representations. Extensive experimental results on benchmark datasets demonstrate significant superiority of our W2W-BEV over previous state-of-the-art methods under challenging conditions of unknown orientation and limited FoV. Specifically, on the CVUSA dataset with limited Fov of 90 degree and unknown orientation, the W2W-BEV achieve an significant improvement from 47.24% to 64.73 %(+17.49%) in R@1 accuracy.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
LVLM-empowered Multi-modal Representation Learning for Visual Place Recognition
Authors:
Teng Wang,
Lingquan Meng,
Lei Cheng,
Changyin Sun
Abstract:
Visual place recognition (VPR) remains challenging due to significant viewpoint changes and appearance variations. Mainstream works tackle these challenges by developing various feature aggregation methods to transform deep features into robust and compact global representations. Unfortunately, satisfactory results cannot be achieved under challenging conditions. We start from a new perspective an…
▽ More
Visual place recognition (VPR) remains challenging due to significant viewpoint changes and appearance variations. Mainstream works tackle these challenges by developing various feature aggregation methods to transform deep features into robust and compact global representations. Unfortunately, satisfactory results cannot be achieved under challenging conditions. We start from a new perspective and attempt to build a discriminative global representations by fusing image data and text descriptions of the the visual scene. The motivation is twofold: (1) Current Large Vision-Language Models (LVLMs) demonstrate extraordinary emergent capability in visual instruction following, and thus provide an efficient and flexible manner in generating text descriptions of images; (2) The text descriptions, which provide high-level scene understanding, show strong robustness against environment variations. Although promising, leveraging LVLMs to build multi-modal VPR solutions remains challenging in efficient multi-modal fusion. Furthermore, LVLMs will inevitably produces some inaccurate descriptions, making it even harder. To tackle these challenges, we propose a novel multi-modal VPR solution. It first adapts pre-trained visual and language foundation models to VPR for extracting image and text features, which are then fed into the feature combiner to enhance each other. As the main component, the feature combiner first propose a token-wise attention block to adaptively recalibrate text tokens according to their relevance to the image data, and then develop an efficient cross-attention fusion module to propagate information across different modalities. The enhanced multi-modal features are compressed into the feature descriptor for performing retrieval. Experimental results show that our method outperforms state-of-the-art methods by a large margin with significantly smaller image descriptor dimension.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
Internal structure of the $T_{cc}(3875)^+$ from its light-quark mass dependence
Authors:
Michael Abolnikov,
Vadim Baru,
Evgeny Epelbaum,
Arseniy A. Filin,
Christoph Hanhart,
Lu Meng
Abstract:
We employ a chiral effective field theory-based approach to connect $DD^*$ scattering observables at the physical and variable pion masses accessible in lattice QCD simulations. We incorporate all relevant scales associated with three-body $DDπ$ dynamics and the left-hand cut induced by the one-pion exchange for pion masses higher than the physical one, as required by analyticity and unitarity. By…
▽ More
We employ a chiral effective field theory-based approach to connect $DD^*$ scattering observables at the physical and variable pion masses accessible in lattice QCD simulations. We incorporate all relevant scales associated with three-body $DDπ$ dynamics and the left-hand cut induced by the one-pion exchange for pion masses higher than the physical one, as required by analyticity and unitarity. By adjusting the contact interactions to match experimental data at the physical pion mass and lattice finite-volume energy levels at $m_π = 280$ MeV, we predict the trajectory of the $T_{cc}$ pole as a function of the pion mass, finding it consistent with the hadronic-molecule scenario. In particular, we find that the explicit treatment of the one-pion exchange has a pronounced effect on the pole trajectory for $m_π\gtrsim 230$ MeV by pushing it into the complex energy plane.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
Fully heavy tetraquark resonant states with different flavors
Authors:
Wei-Lin Wu,
Yao Ma,
Yan-Ke Chen,
Lu Meng,
Shi-Lin Zhu
Abstract:
We use the quark potential model to calculate the mass spectrum of the S-wave fully heavy tetraquark systems with different flavors, including the $ bc\bar b\bar c, bb\bar c\bar c, cc\bar c\bar b $ and $ bb\bar b\bar c $ systems. We employ the Gaussian expansion method to solve the four-body Schrödinger equation, and the complex scaling method to identify resonant states. The…
▽ More
We use the quark potential model to calculate the mass spectrum of the S-wave fully heavy tetraquark systems with different flavors, including the $ bc\bar b\bar c, bb\bar c\bar c, cc\bar c\bar b $ and $ bb\bar b\bar c $ systems. We employ the Gaussian expansion method to solve the four-body Schrödinger equation, and the complex scaling method to identify resonant states. The $ bc\bar b\bar c, bb\bar c\bar c, cc\bar c\bar b $ and $ bb\bar b\bar c $ resonant states are obtained in the mass regions of $ (13.2,13.5) $, $ (13.3,13.6) $, $ (10.0,10.3) $, $ (16.5,16.7) $ GeV, respectively. Among these states, the $ bc\bar b\bar c $ tetraquark states are the most promising ones to be discovered in the near future. We recommend the experimental exploration of the $ 1^{++} $ and $ 2^{++} $ $ bc\bar b\bar c $ states with masses near $ 13.3 $ GeV in the $ J/ψΥ$ channel. From the root-mean-square radii, we find that all the resonant states we have identified are compact tetraquark states.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
Mapping AI Ethics Narratives: Evidence from Twitter Discourse Between 2015 and 2022
Authors:
Mengyi Wei,
Puzhen Zhang,
Chuan Chen,
Dongsheng Chen,
Chenyu Zuo,
Liqiu Meng
Abstract:
Public participation is indispensable for an insightful understanding of the ethics issues raised by AI technologies. Twitter is selected in this paper to serve as an online public sphere for exploring discourse on AI ethics, facilitating broad and equitable public engagement in the development of AI technology. A research framework is proposed to demonstrate how to transform AI ethics-related dis…
▽ More
Public participation is indispensable for an insightful understanding of the ethics issues raised by AI technologies. Twitter is selected in this paper to serve as an online public sphere for exploring discourse on AI ethics, facilitating broad and equitable public engagement in the development of AI technology. A research framework is proposed to demonstrate how to transform AI ethics-related discourse on Twitter into coherent and readable narratives. It consists of two parts: 1) combining neural networks with large language models to construct a topic hierarchy that contains popular topics of public concern without ignoring small but important voices, thus allowing a fine-grained exploration of meaningful information. 2) transforming fragmented and difficult-to-understand social media information into coherent and easy-to-read stories through narrative visualization, providing a new perspective for understanding the information in Twitter data. This paper aims to advocate for policy makers to enhance public oversight of AI technologies so as to promote their fair and sustainable development.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
V3Det Challenge 2024 on Vast Vocabulary and Open Vocabulary Object Detection: Methods and Results
Authors:
Jiaqi Wang,
Yuhang Zang,
Pan Zhang,
Tao Chu,
Yuhang Cao,
Zeyi Sun,
Ziyu Liu,
Xiaoyi Dong,
Tong Wu,
Dahua Lin,
Zeming Chen,
Zhi Wang,
Lingchen Meng,
Wenhao Yao,
Jianwei Yang,
Sihong Wu,
Zhineng Chen,
Zuxuan Wu,
Yu-Gang Jiang,
Peixi Wu,
Bosong Chai,
Xuan Nie,
Longquan Yan,
Zeyu Wang,
Qifan Zhou
, et al. (9 additional authors not shown)
Abstract:
Detecting objects in real-world scenes is a complex task due to various challenges, including the vast range of object categories, and potential encounters with previously unknown or unseen objects. The challenges necessitate the development of public benchmarks and challenges to advance the field of object detection. Inspired by the success of previous COCO and LVIS Challenges, we organize the V3…
▽ More
Detecting objects in real-world scenes is a complex task due to various challenges, including the vast range of object categories, and potential encounters with previously unknown or unseen objects. The challenges necessitate the development of public benchmarks and challenges to advance the field of object detection. Inspired by the success of previous COCO and LVIS Challenges, we organize the V3Det Challenge 2024 in conjunction with the 4th Open World Vision Workshop: Visual Perception via Learning in an Open World (VPLOW) at CVPR 2024, Seattle, US. This challenge aims to push the boundaries of object detection research and encourage innovation in this field. The V3Det Challenge 2024 consists of two tracks: 1) Vast Vocabulary Object Detection: This track focuses on detecting objects from a large set of 13204 categories, testing the detection algorithm's ability to recognize and locate diverse objects. 2) Open Vocabulary Object Detection: This track goes a step further, requiring algorithms to detect objects from an open set of categories, including unknown objects. In the following sections, we will provide a comprehensive summary and analysis of the solutions submitted by participants. By analyzing the methods and solutions presented, we aim to inspire future research directions in vast vocabulary and open-vocabulary object detection, driving progress in this field. Challenge homepage: https://v3det.openxlab.org.cn/challenge
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment
Authors:
Bing Han,
Long Zhou,
Shujie Liu,
Sanyuan Chen,
Lingwei Meng,
Yanming Qian,
Yanqing Liu,
Sheng Zhao,
Jinyu Li,
Furu Wei
Abstract:
With the help of discrete neural audio codecs, large language models (LLM) have increasingly been recognized as a promising methodology for zero-shot Text-to-Speech (TTS) synthesis. However, sampling based decoding strategies bring astonishing diversity to generation, but also pose robustness issues such as typos, omissions and repetition. In addition, the high sampling rate of audio also brings h…
▽ More
With the help of discrete neural audio codecs, large language models (LLM) have increasingly been recognized as a promising methodology for zero-shot Text-to-Speech (TTS) synthesis. However, sampling based decoding strategies bring astonishing diversity to generation, but also pose robustness issues such as typos, omissions and repetition. In addition, the high sampling rate of audio also brings huge computational overhead to the inference process of autoregression. To address these issues, we propose VALL-E R, a robust and efficient zero-shot TTS system, building upon the foundation of VALL-E. Specifically, we introduce a phoneme monotonic alignment strategy to strengthen the connection between phonemes and acoustic sequence, ensuring a more precise alignment by constraining the acoustic tokens to match their associated phonemes. Furthermore, we employ a codec-merging approach to downsample the discrete codes in shallow quantization layer, thereby accelerating the decoding speed while preserving the high quality of speech output. Benefiting from these strategies, VALL-E R obtains controllablity over phonemes and demonstrates its strong robustness by approaching the WER of ground truth. In addition, it requires fewer autoregressive steps, with over 60% time reduction during inference. This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia. Audio samples will be available at: https://aka.ms/valler.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Spectrum of the molecular hexaquarks
Authors:
Bo Wang,
Kan Chen,
Lu Meng,
Shi-Lin Zhu
Abstract:
We investigate the mass spectra of molecular-type hexaquark states in the dibaryon systems. These systems are composed of the charmed baryons $[Σ_c^{(\ast)}$, $Ξ_c^{(\prime,\ast)}]$, doubly charmed baryons $[Ξ_{cc}^{(\ast)}]$, and hyperons $[Σ^{(\ast)}$, $Ξ^{(\ast)}]$. We consider all possible combinations of particle-particle and particle-antiparticle pairs, including the S-wave spin multiplets i…
▽ More
We investigate the mass spectra of molecular-type hexaquark states in the dibaryon systems. These systems are composed of the charmed baryons $[Σ_c^{(\ast)}$, $Ξ_c^{(\prime,\ast)}]$, doubly charmed baryons $[Ξ_{cc}^{(\ast)}]$, and hyperons $[Σ^{(\ast)}$, $Ξ^{(\ast)}]$. We consider all possible combinations of particle-particle and particle-antiparticle pairs, including the S-wave spin multiplets in each combination. We establish the underlying connections among the molecular tetraquarks, pentaquarks, and hexaquarks with the effective quark-level interactions. We find that the existence of molecular states in $DD^\ast$, $D\bar{D}^\ast$, and $Σ_c\bar{D}^{(\ast)}$ systems leads to the emergence of a large number of deuteron-like hexaquarks in the heavy flavor sectors. Currently, there have been several experimental candidates for molecular tetraquarks and pentaquarks. The experimental search for near-threshold hexaquarks will further advance the establishment of the underlying dynamical picture of hadronic molecules and deepen our understanding of the role of spin-flavor symmetry in near-threshold residual strong interactions.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Training Dynamics of Nonlinear Contrastive Learning Model in the High Dimensional Limit
Authors:
Lineghuan Meng,
Chuang Wang
Abstract:
This letter presents a high-dimensional analysis of the training dynamics for a single-layer nonlinear contrastive learning model. The empirical distribution of the model weights converges to a deterministic measure governed by a McKean-Vlasov nonlinear partial differential equation (PDE). Under L2 regularization, this PDE reduces to a closed set of low-dimensional ordinary differential equations…
▽ More
This letter presents a high-dimensional analysis of the training dynamics for a single-layer nonlinear contrastive learning model. The empirical distribution of the model weights converges to a deterministic measure governed by a McKean-Vlasov nonlinear partial differential equation (PDE). Under L2 regularization, this PDE reduces to a closed set of low-dimensional ordinary differential equations (ODEs), reflecting the evolution of the model performance during the training process. We analyze the fixed point locations and their stability of the ODEs unveiling several interesting findings. First, only the hidden variable's second moment affects feature learnability at the state with uninformative initialization. Second, higher moments influence the probability of feature selection by controlling the attraction region, rather than affecting local stability. Finally, independent noises added in the data argumentation degrade performance but negatively correlated noise can reduces the variance of gradient estimation yielding better performance. Despite of the simplicity of the analyzed model, it exhibits a rich phenomena of training dynamics, paving a way to understand more complex mechanism behind practical large models.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Improve Mathematical Reasoning in Language Models by Automated Process Supervision
Authors:
Liangchen Luo,
Yinxiao Liu,
Rosanne Liu,
Samrat Phatale,
Harsh Lara,
Yunxuan Li,
Lei Shu,
Yun Zhu,
Lei Meng,
Jiao Sun,
Abhinav Rastogi
Abstract:
Complex multi-step reasoning tasks, such as solving mathematical problems or generating code, remain a significant hurdle for even the most advanced large language models (LLMs). Verifying LLM outputs with an Outcome Reward Model (ORM) is a standard inference-time technique aimed at enhancing the reasoning performance of LLMs. However, this still proves insufficient for reasoning tasks with a leng…
▽ More
Complex multi-step reasoning tasks, such as solving mathematical problems or generating code, remain a significant hurdle for even the most advanced large language models (LLMs). Verifying LLM outputs with an Outcome Reward Model (ORM) is a standard inference-time technique aimed at enhancing the reasoning performance of LLMs. However, this still proves insufficient for reasoning tasks with a lengthy or multi-hop reasoning chain, where the intermediate outcomes are neither properly rewarded nor penalized. Process supervision addresses this limitation by assigning intermediate rewards during the reasoning process. To date, the methods used to collect process supervision data have relied on either human annotation or per-step Monte Carlo estimation, both prohibitively expensive to scale, thus hindering the broad application of this technique. In response to this challenge, we propose a novel divide-and-conquer style Monte Carlo Tree Search (MCTS) algorithm named \textit{OmegaPRM} for the efficient collection of high-quality process supervision data. This algorithm swiftly identifies the first error in the Chain of Thought (CoT) with binary search and balances the positive and negative examples, thereby ensuring both efficiency and quality. As a result, we are able to collect over 1.5 million process supervision annotations to train a Process Reward Model (PRM). Utilizing this fully automated process supervision alongside the weighted self-consistency algorithm, we have enhanced the instruction tuned Gemini Pro model's math reasoning performance, achieving a 69.4\% success rate on the MATH benchmark, a 36\% relative improvement from the 51\% base model performance. Additionally, the entire process operates without any human intervention, making our method both financially and computationally cost-effective compared to existing methods.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
Proofread: Fixes All Errors with One Tap
Authors:
Renjie Liu,
Yanxiang Zhang,
Yun Zhu,
Haicheng Sun,
Yuanbo Zhang,
Michael Xuelin Huang,
Shanqing Cai,
Lei Meng,
Shumin Zhai
Abstract:
The impressive capabilities in Large Language Models (LLMs) provide a powerful approach to reimagine users' typing experience. This paper demonstrates Proofread, a novel Gboard feature powered by a server-side LLM in Gboard, enabling seamless sentence-level and paragraph-level corrections with a single tap. We describe the complete system in this paper, from data generation, metrics design to mode…
▽ More
The impressive capabilities in Large Language Models (LLMs) provide a powerful approach to reimagine users' typing experience. This paper demonstrates Proofread, a novel Gboard feature powered by a server-side LLM in Gboard, enabling seamless sentence-level and paragraph-level corrections with a single tap. We describe the complete system in this paper, from data generation, metrics design to model tuning and deployment. To obtain models with sufficient quality, we implement a careful data synthetic pipeline tailored to online use cases, design multifaceted metrics, employ a two-stage tuning approach to acquire the dedicated LLM for the feature: the Supervised Fine Tuning (SFT) for foundational quality, followed by the Reinforcement Learning (RL) tuning approach for targeted refinement. Specifically, we find sequential tuning on Rewrite and proofread tasks yields the best quality in SFT stage, and propose global and direct rewards in the RL tuning stage to seek further improvement. Extensive experiments on a human-labeled golden set showed our tuned PaLM2-XS model achieved 85.56\% good ratio. We launched the feature to Pixel 8 devices by serving the model on TPU v5 in Google Cloud, with thousands of daily active users. Serving latency was significantly reduced by quantization, bucket inference, text segmentation, and speculative decoding. Our demo could be seen in \href{https://youtu.be/4ZdcuiwFU7I}{Youtube}.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs
Authors:
Lingchen Meng,
Jianwei Yang,
Rui Tian,
Xiyang Dai,
Zuxuan Wu,
Jianfeng Gao,
Yu-Gang Jiang
Abstract:
Most large multimodal models (LMMs) are implemented by feeding visual tokens as a sequence into the first layer of a large language model (LLM). The resulting architecture is simple but significantly increases computation and memory costs, as it has to handle a large number of additional tokens in its input layer. This paper presents a new architecture DeepStack for LMMs. Considering $N$ layers in…
▽ More
Most large multimodal models (LMMs) are implemented by feeding visual tokens as a sequence into the first layer of a large language model (LLM). The resulting architecture is simple but significantly increases computation and memory costs, as it has to handle a large number of additional tokens in its input layer. This paper presents a new architecture DeepStack for LMMs. Considering $N$ layers in the language and vision transformer of LMMs, we stack the visual tokens into $N$ groups and feed each group to its aligned transformer layer \textit{from bottom to top}. Surprisingly, this simple method greatly enhances the power of LMMs to model interactions among visual tokens across layers but with minimal additional cost. We apply DeepStack to both language and vision transformer in LMMs, and validate the effectiveness of DeepStack LMMs with extensive empirical results. Using the same context length, our DeepStack 7B and 13B parameters surpass their counterparts by \textbf{2.7} and \textbf{2.9} on average across \textbf{9} benchmarks, respectively. Using only one-fifth of the context length, DeepStack rivals closely to the counterparts that use the full context length. These gains are particularly pronounced on high-resolution tasks, e.g., \textbf{4.2}, \textbf{11.0}, and \textbf{4.0} improvements on TextVQA, DocVQA, and InfoVQA compared to LLaVA-1.5-7B, respectively. We further apply DeepStack to vision transformer layers, which brings us a similar amount of improvements, \textbf{3.8} on average compared with LLaVA-1.5-7B.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
A 0.96pJ/SOP, 30.23K-neuron/mm^2 Heterogeneous Neuromorphic Chip With Fullerene-like Interconnection Topology for Edge-AI Computing
Authors:
P. J. Zhou,
Q. Yu,
M. Chen,
Y. C. Wang,
L. W. Meng,
Y. Zuo,
N. Ning,
Y. Liu,
S. G. Hu,
G. C. Qiao
Abstract:
Edge-AI computing requires high energy efficiency, low power consumption, and relatively high flexibility and compact area, challenging the AI-chip design. This work presents a 0.96 pJ/SOP heterogeneous neuromorphic system-on-chip (SoC) with fullerene-like interconnection topology for edge-AI computing. The neuromorphic core integrates different technologies to augment computing energy efficiency,…
▽ More
Edge-AI computing requires high energy efficiency, low power consumption, and relatively high flexibility and compact area, challenging the AI-chip design. This work presents a 0.96 pJ/SOP heterogeneous neuromorphic system-on-chip (SoC) with fullerene-like interconnection topology for edge-AI computing. The neuromorphic core integrates different technologies to augment computing energy efficiency, including sparse computing, partial membrane potential updates, and non-uniform weight quantization. Multiple neuromorphic cores and multi-mode routers form a fullerene-like network-on-chip (NoC). The average degree of communication nodes exceeds traditional topologies by 32%, with a minimal degree variance of 0.93, allowing advanced decentralized on-chip communication. Additionally, the NoC can be scaled up through extended off-chip high-level router nodes. A RISC-V CPU and a neuromorphic processor are tightly coupled and fabricated within a 5.42 mm^2 die area under 55 nm CMOS technology. The chip has a low power density of 0.52 mW/mm^2, reducing 67.5% compared to related works, and achieves a high neuron density of 30.23 K/mm^2. Eventually, the chip is demonstrated to be effective on different datasets and achieves 0.96 pJ/SOP energy efficiency.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
Cross-Training with Multi-View Knowledge Fusion for Heterogenous Federated Learning
Authors:
Zhuang Qi,
Lei Meng,
Weihao He,
Ruohan Zhang,
Yu Wang,
Xin Qi,
Xiangxu Meng
Abstract:
Federated learning benefits from cross-training strategies, which enables models to train on data from distinct sources to improve the generalization capability. However, the data heterogeneity between sources may lead models to gradually forget previously acquired knowledge when undergoing cross-training to adapt to new tasks or data sources. We argue that integrating personalized and global know…
▽ More
Federated learning benefits from cross-training strategies, which enables models to train on data from distinct sources to improve the generalization capability. However, the data heterogeneity between sources may lead models to gradually forget previously acquired knowledge when undergoing cross-training to adapt to new tasks or data sources. We argue that integrating personalized and global knowledge to gather information from multiple perspectives could potentially improve performance. To achieve this goal, this paper presents a novel approach that enhances federated learning through a cross-training scheme incorporating multi-view information. Specifically, the proposed method, termed FedCT, includes three main modules, where the consistency-aware knowledge broadcasting module aims to optimize model assignment strategies, which enhances collaborative advantages between clients and achieves an efficient federated learning process. The multi-view knowledge-guided representation learning module leverages fused prototypical knowledge from both global and local views to enhance the preservation of local knowledge before and after model exchange, as well as to ensure consistency between local and global knowledge. The mixup-based feature augmentation module aggregates rich information to further increase the diversity of feature spaces, which enables the model to better discriminate complex samples. Extensive experiments were conducted on four datasets in terms of performance comparison, ablation study, in-depth analysis and case study. The results demonstrated that FedCT alleviates knowledge forgetting from both local and global views, which enables it outperform state-of-the-art methods.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Spectral multiplexing based on multi-distance lensless imaging
Authors:
Qijun You,
Lingshuo Meng,
Yun Gao,
Qing Liao,
Wei Cao,
Peixiang Lu
Abstract:
We have demonstrated the capability of spectral multiplexing in multi-distance diffractive imaging, enabling the reconstruction of samples with diverse spectral responses. While previous methods like ptychography utilize redundancy in radial diffraction data to achieve information multiplexing, they typically require capturing a substantial amount of diffraction data. In contrast, our approach eff…
▽ More
We have demonstrated the capability of spectral multiplexing in multi-distance diffractive imaging, enabling the reconstruction of samples with diverse spectral responses. While previous methods like ptychography utilize redundancy in radial diffraction data to achieve information multiplexing, they typically require capturing a substantial amount of diffraction data. In contrast, our approach effectively harnesses the redundancy information in axial diffraction data. This significantly reduces the amount of diffraction data required and relaxes the stringent requirements on optical path stability.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Electron form factors in Basis Light-front Quantization
Authors:
Lingdi Meng,
Shuo Tang,
Zhi Hu,
Guo-Li Wang,
Yang Li,
Xingbo Zhao,
James P. Vary
Abstract:
In this paper, we evaluate the electromagnetic and gravitational form factors as well as the corresponding generalized parton distributions of the electron using the Basis Light-front Quantization approach to QED. We compare our results with those from light-front perturbation theory. We adopt a novel basis with its scale depending on the constituents' longitudinal momentum fraction. We show that…
▽ More
In this paper, we evaluate the electromagnetic and gravitational form factors as well as the corresponding generalized parton distributions of the electron using the Basis Light-front Quantization approach to QED. We compare our results with those from light-front perturbation theory. We adopt a novel basis with its scale depending on the constituents' longitudinal momentum fraction. We show that this basis improves convergence of the form factors with increasing basis dimension, compared to that calculated in the original basis with fixed scale. These results both validate the BLFQ approach and provide guidance for its efficient implementation in solving light-front Hamiltonian mass eigenstates for more complex systems in QED and QCD.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection
Authors:
Yun Zhu,
Jia-Chen Gu,
Caitlin Sikora,
Ho Ko,
Yinxiao Liu,
Chu-Cheng Lin,
Lei Shu,
Liangchen Luo,
Lei Meng,
Bang Liu,
Jindong Chen
Abstract:
Large language models (LLMs) augmented with retrieval exhibit robust performance and extensive versatility by incorporating external contexts. However, the input length grows linearly in the number of retrieved documents, causing a dramatic increase in latency. In this paper, we propose a novel paradigm named Sparse RAG, which seeks to cut computation costs through sparsity. Specifically, Sparse R…
▽ More
Large language models (LLMs) augmented with retrieval exhibit robust performance and extensive versatility by incorporating external contexts. However, the input length grows linearly in the number of retrieved documents, causing a dramatic increase in latency. In this paper, we propose a novel paradigm named Sparse RAG, which seeks to cut computation costs through sparsity. Specifically, Sparse RAG encodes retrieved documents in parallel, which eliminates latency introduced by long-range attention of retrieved documents. Then, LLMs selectively decode the output by only attending to highly relevant caches auto-regressively, which are chosen via prompting LLMs with special control tokens. It is notable that Sparse RAG combines the assessment of each individual document and the generation of the response into a single process. The designed sparse mechanism in a RAG system can facilitate the reduction of the number of documents loaded during decoding for accelerating the inference of the RAG system. Additionally, filtering out undesirable contexts enhances the model's focus on relevant context, inherently improving its generation quality. Evaluation results of two datasets show that Sparse RAG can strike an optimal balance between generation quality and computational efficiency, demonstrating its generalizability across both short- and long-form generation tasks.
△ Less
Submitted 25 May, 2024;
originally announced May 2024.
-
Inducing ferroelectricity in NH$_4$I and NH$_4$Br via partial replacement of protons by deuterons
Authors:
Miao Miao Zhao,
Lei Meng,
Yi Yang Xu,
Na Du,
Fei Yen
Abstract:
While all of the polymorphs of NH$_4$I and NH$_4$Br are non-polar, a reversible electric polarization is established in the ordered $γ$ phases of (NH$_4$)$_{0.73}$(ND$_4$)$_{0.27}$I and (NH$_4$)$_{0.84}$(ND$_4$)$_{0.16}$Br (where D is $^2$H) via $dc$ electric fields. The presence of two groups of orbital magnetic moments appears to be responsible for the asymmetric lattice distortions. Our finding…
▽ More
While all of the polymorphs of NH$_4$I and NH$_4$Br are non-polar, a reversible electric polarization is established in the ordered $γ$ phases of (NH$_4$)$_{0.73}$(ND$_4$)$_{0.27}$I and (NH$_4$)$_{0.84}$(ND$_4$)$_{0.16}$Br (where D is $^2$H) via $dc$ electric fields. The presence of two groups of orbital magnetic moments appears to be responsible for the asymmetric lattice distortions. Our findings provide an alternative pathway for hydrogen-based materials to potentially add a ferroelectric functionality.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
Electric Polarization and Magnetic Properties of (NH$_4$)$_{1-x}$K$_x$I (x = 0.05-0.17)
Authors:
Yi Yang Xu,
Lei Meng,
Miao Miao Zhao,
Chu Xin Peng,
Fei Yen
Abstract:
While all of the polymorphs of pure NH$_4$I and KI are non-polar, we identify that (NH$_4$)$_{0.95}$K$_{0.05}$I is ferroelectric and (NH$_4$)$_{0.87}$K$_{0.13}$I and (NH$_4$)$_{0.83}$K$_{0.17}$I are pyroelectric through measurements of their pyroelectric current and complex dielectric constant. The order to disorder phase transitions occur near 245 K. Magnetic susceptibility measurements indicate…
▽ More
While all of the polymorphs of pure NH$_4$I and KI are non-polar, we identify that (NH$_4$)$_{0.95}$K$_{0.05}$I is ferroelectric and (NH$_4$)$_{0.87}$K$_{0.13}$I and (NH$_4$)$_{0.83}$K$_{0.17}$I are pyroelectric through measurements of their pyroelectric current and complex dielectric constant. The order to disorder phase transitions occur near 245 K. Magnetic susceptibility measurements indicate that the proton orbitals of the NH$_4$$^+$ continue to become ordered in the ground state in the (NH$_4$)$_{1-x}$K$_x$I system up to x <= 0.17. The polar phases are proposed to stem from K$^+$ ions disrupting the symmetry of proton-orbital-lattice interactions between the NH$_4$$^+$ and I$^-$ ions. Our work introduces a new pathway for the ordered phases of ammonium-based compounds to potentially become ferroelectric.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
Maximum Manifold Capacity Representations in State Representation Learning
Authors:
Li Meng,
Morten Goodwin,
Anis Yazidi,
Paal Engelstad
Abstract:
The expanding research on manifold-based self-supervised learning (SSL) builds on the manifold hypothesis, which suggests that the inherent complexity of high-dimensional data can be unraveled through lower-dimensional manifold embeddings. Capitalizing on this, DeepInfomax with an unbalanced atlas (DIM-UA) has emerged as a powerful tool and yielded impressive results for state representations in r…
▽ More
The expanding research on manifold-based self-supervised learning (SSL) builds on the manifold hypothesis, which suggests that the inherent complexity of high-dimensional data can be unraveled through lower-dimensional manifold embeddings. Capitalizing on this, DeepInfomax with an unbalanced atlas (DIM-UA) has emerged as a powerful tool and yielded impressive results for state representations in reinforcement learning. Meanwhile, Maximum Manifold Capacity Representation (MMCR) presents a new frontier for SSL by optimizing class separability via manifold compression. However, MMCR demands extensive input views, resulting in significant computational costs and protracted pre-training durations. Bridging this gap, we present an innovative integration of MMCR into existing SSL methods, incorporating a discerning regularization strategy that enhances the lower bound of mutual information. We also propose a novel state representation learning method extending DIM-UA, embedding a nuclear norm loss to enforce manifold consistency robustly. On experimentation with the Atari Annotated RAM Interface, our method improves DIM-UA significantly with the same number of target encoding dimensions. The mean F1 score averaged over categories is 78% compared to 75% of DIM-UA. There are also compelling gains when implementing SimCLR and Barlow Twins. This supports our SSL innovation as a paradigm shift, enabling more nuanced high-dimensional data representations.
△ Less
Submitted 22 May, 2024;
originally announced May 2024.
-
Ferroelectricity Driven by Orbital Resonance of Protons in CH$_3$NH$_3$Cl and CH$_3$NH$_3$Br
Authors:
Chu Xin Peng,
Lei Meng,
Yi Yang Xu,
Tian Tian Xing,
Miao Miao Zhao,
Peng Ren,
Fei Yen
Abstract:
The $β$ and $γ$ phases of methylammonium chloride CH$_3$NH$_3$Cl and methylammonium bromide CH$_3$NH$_3$Br are identified to be ferroelectric $via$ pyroelectric current and dielectric constant measurements. The magnetic susceptibility also exhibits pronounced discontinuities at the Curie temperatures. We attribute the origin of spontaneous polarization to the emergence of two groups of proton orbi…
▽ More
The $β$ and $γ$ phases of methylammonium chloride CH$_3$NH$_3$Cl and methylammonium bromide CH$_3$NH$_3$Br are identified to be ferroelectric $via$ pyroelectric current and dielectric constant measurements. The magnetic susceptibility also exhibits pronounced discontinuities at the Curie temperatures. We attribute the origin of spontaneous polarization to the emergence of two groups of proton orbital magnetic moments from the uncorrelated motion of the CH$_3$ and NH$_3$ groups in the $β$ and $γ$ phases. The two inequivalent frameworks of intermolecular orbital resonances interact with each other to distort the lattice in a non-centrosymmetric fashion. Our findings indicate that the structural instabilities in molecular frameworks are magnetic in origin as well as provide a new pathway toward uncovering new organic ferroelectrics.
△ Less
Submitted 15 May, 2024;
originally announced May 2024.
-
VascularPilot3D: Toward a 3D fully autonomous navigation for endovascular robotics
Authors:
Song Jingwei,
Yang Keke,
Chen Han,
Liu Jiayi,
Gu Yinan,
Hui Qianxin,
Huang Yanqi,
Li Meng,
Zhang Zheng,
Cao Tuoyu,
Ghaffari Maani
Abstract:
This research reports VascularPilot3D, the first 3D fully autonomous endovascular robot navigation system. As an exploration toward autonomous guidewire navigation, VascularPilot3D is developed as a complete navigation system based on intra-operative imaging systems (fluoroscopic X-ray in this study) and typical endovascular robots. VascularPilot3D adopts previously researched fast 3D-2D vessel re…
▽ More
This research reports VascularPilot3D, the first 3D fully autonomous endovascular robot navigation system. As an exploration toward autonomous guidewire navigation, VascularPilot3D is developed as a complete navigation system based on intra-operative imaging systems (fluoroscopic X-ray in this study) and typical endovascular robots. VascularPilot3D adopts previously researched fast 3D-2D vessel registration algorithms and guidewire segmentation methods as its perception modules. We additionally propose three modules: a topology-constrained 2D-3D instrument end-point lifting method, a tree-based fast path planning algorithm, and a prior-free endovascular navigation strategy. VascularPilot3D is compatible with most mainstream endovascular robots. Ex-vivo experiments validate that VascularPilot3D achieves 100% success rate among 25 trials. It reduces the human surgeon's overall control loops by 18.38%. VascularPilot3D is promising for general clinical autonomous endovascular navigations.
△ Less
Submitted 15 May, 2024;
originally announced May 2024.
-
Magnetic interactions based on proton orbital motion in CH$_3$NH$_3$PbI$_3$ and CH$_3$NH$_3$PbBr$_3$
Authors:
Lei Meng,
Miao Miao Zhao,
Yi Yang Xu,
Chu Xin Peng,
Yang Yang,
Tian Tian Xing,
Peng Ren,
Fei Yen
Abstract:
The microscopic origin of the remarkable optoelectronic properties of one of the most studied contemporary materials remains unclear. Here, we identify the existence of magnetic interactions between intermolecular proton orbitals in CH$_3$NH$_3$PbI$_3$ and CH$_3$NH$_3$PbBr$_3$. In particular, a unique sharp drop and a pronounced step-up discontinuity in the magnetic susceptibility at the tetragona…
▽ More
The microscopic origin of the remarkable optoelectronic properties of one of the most studied contemporary materials remains unclear. Here, we identify the existence of magnetic interactions between intermolecular proton orbitals in CH$_3$NH$_3$PbI$_3$ and CH$_3$NH$_3$PbBr$_3$. In particular, a unique sharp drop and a pronounced step-up discontinuity in the magnetic susceptibility at the tetragonal-to-cubic phase transitions are identified in CH$_3$NH$_3$PbI$_3$ and CH$_3$NH$_3$PbBr$_3$, respectively. The magnetic interactions in the orthorhombic and tetragonal phases are dependent on thermal history and lattice orientation while nearly independent of the applied external magnetic field. In CH$_3$NH$_3$PbBr$_3$, the CH$_3$ and NH$_3$$^+$ components reorient in an uncorrelated fashion resulting the cubic phase to also exhibit magnetic anisotropy. Our findings provide a potential link connecting the highly light-absorbing CH$_3$NH$_3$$^+$ and the exceptional properties of the charge carriers of the inorganic framework in hybrid perovskite solar cells.
△ Less
Submitted 15 May, 2024;
originally announced May 2024.
-
Magnetic Properties of NH$_4$H$_2$PO$_4$ and KH$_2$PO$_4$: Emergence of Multiferroic Salts
Authors:
Lei Meng,
Chen He,
Wei Ji,
Fei Yen
Abstract:
We observe sharp step-down discontinuities in the magnetic susceptibility of NH$_4$H$_2$PO$_4$ and NH$_4$H$_2$PO$_4$-$d$$_{60}$ (60% deuterated) along the $a$ and $c$-axes occurring exactly at their antiferroelectric transition temperatures. For the case of KH$_2$PO$_4$, less pronounced discontinuities occur at the ferroelectric transition temperature. To explain this, we treat the acid protons as…
▽ More
We observe sharp step-down discontinuities in the magnetic susceptibility of NH$_4$H$_2$PO$_4$ and NH$_4$H$_2$PO$_4$-$d$$_{60}$ (60% deuterated) along the $a$ and $c$-axes occurring exactly at their antiferroelectric transition temperatures. For the case of KH$_2$PO$_4$, less pronounced discontinuities occur at the ferroelectric transition temperature. To explain this, we treat the acid protons as individual oscillators that generate current elements which translate to magnetic forces in near resonance with each other. With decreasing temperature, the resonant forces become more commensurate which amplifies a disproportionate drop off of two types of magnetic forces to eventually trigger the structural phase transitions. For the case of NH$_4$H$_2$PO$_4$, the associated internal magnetic field appears to aid the NH$_4$$^+$ to order at higher temperature. At 49 K, a shoulder-like anomaly in both NH$_4$H$_2$PO$_4$ and KH$_2$PO$_4$ is attributed to a possible onset of macroscopic quantum tunneling of protons. Our findings bring forth a new category of intrinsic multiferroic systems.
△ Less
Submitted 14 May, 2024;
originally announced May 2024.
-
Highly Efficient Observation Process based on FFT Filtering for Robot Swarm Collaborative Navigation in Unknown Environments
Authors:
Chenxi Li,
Weining Lu,
Zhihao Ma,
Litong Meng,
Bin Liang
Abstract:
Collaborative path planning for robot swarms in complex, unknown environments without external positioning is a challenging problem. This requires robots to find safe directions based on real-time environmental observations, and to efficiently transfer and fuse these observations within the swarm. This study presents a filtering method based on Fast Fourier Transform (FFT) to address these two iss…
▽ More
Collaborative path planning for robot swarms in complex, unknown environments without external positioning is a challenging problem. This requires robots to find safe directions based on real-time environmental observations, and to efficiently transfer and fuse these observations within the swarm. This study presents a filtering method based on Fast Fourier Transform (FFT) to address these two issues. We treat sensors' environmental observations as a digital sampling process. Then, we design two different types of filters for safe direction extraction, as well as for the compression and reconstruction of environmental data. The reconstructed data is mapped to probabilistic domain, achieving efficient fusion of swarm observations and planning decision. The computation time is only on the order of microseconds, and the transmission data in communication systems is in bit-level. The performance of our algorithm in sensor data processing was validated in real world experiments, and the effectiveness in swarm path optimization was demonstrated through extensive simulations.
△ Less
Submitted 17 July, 2024; v1 submitted 13 May, 2024;
originally announced May 2024.
-
Dark Matter Physics in General NMSSM
Authors:
Lei Meng,
Junjie Cao,
Shenshen Yang
Abstract:
In the General Next-to-Minimal Supersymmetric Standard Model (GNMSSM), singlet particles may form a secluded sector of dark matter (DM), in which Singlino-like DM could achieve the observed relic abundance through various channels such as $\tildeχ_1^0 \tildeχ_1^0 \to h_s h_s, A_s A_s, h_s A_s$, where $h_s$ and $A_s$ represent singlet-dominated CP-even and CP-odd Higgs bosons. We provide analytical…
▽ More
In the General Next-to-Minimal Supersymmetric Standard Model (GNMSSM), singlet particles may form a secluded sector of dark matter (DM), in which Singlino-like DM could achieve the observed relic abundance through various channels such as $\tildeχ_1^0 \tildeχ_1^0 \to h_s h_s, A_s A_s, h_s A_s$, where $h_s$ and $A_s$ represent singlet-dominated CP-even and CP-odd Higgs bosons. We provide analytical formulas for both the spin-independent and spin-dependent cross sections of Singlino DM scattering with nucleons, illustrating their dependence on the model's parameters in a clear manner. We also present analytic expressions for the annihilation cross sections of these three important channels. Based on these preparations, we conducted Bayesian analyses of the GNMSSM and concluded that the theory significantly favored Singlino-dominated DM over Bino-like DM across a much broader range of parameters. The combined results from our numerical analyses and the formulas distinctly highlight crucial aspects of DM physics within the GNMSSM.
△ Less
Submitted 11 May, 2024;
originally announced May 2024.
-
Magnetoelectric Coupling Based on Protons in Ammonium Sulfate
Authors:
Lei Meng,
Chen He,
Fei Yen
Abstract:
Most ferroelectric crystals have their own set of unique characteristics and ammonium sulfate (NH$_4$)$_2$SO$_4$ is no exception. We report on two previously unidentified features in ammonium sulfate: 1) that there are at least two successive transitions instead of one occurring at the Curie temperature $T$$_C$ = 223 K according to dielectric constant measurements; and 2) pronounced step-like anom…
▽ More
Most ferroelectric crystals have their own set of unique characteristics and ammonium sulfate (NH$_4$)$_2$SO$_4$ is no exception. We report on two previously unidentified features in ammonium sulfate: 1) that there are at least two successive transitions instead of one occurring at the Curie temperature $T$$_C$ = 223 K according to dielectric constant measurements; and 2) pronounced step-like anomalies are found in the magnetic susceptibility exactly at $T$$_C$. To explain these results, we take into account that there exists a previously unidentified linear coupling between the magnetic and electric dipole moments of the NH$_4$$^+$ tetrahedra due to their rapid reorientations and distorted geometry, respectively. The magnetic moments are small, 0.0016 $μ$$_B$ for every $C$$_3$ reorientation which involve three protons (H$^+$) undergoing orbital motion. Nevertheless, short-range correlations exist in the paraelectric phase because the magnetic moments are restricted to only point along 14 possible orientations due to the symmetry and periodic nature of the potential wells. At $T$$_C$, $C$$_2$ reorientations (involving four protons) are no longer energetically feasible so the reduction in the degrees of freedom to 8 further enhances the effect of the magnetic interactions. This triggers long-range ordering of the orbital moments in an antiferromagnetic configuration along the $ab$-plane, which via Dzyaloshinskii-Moriya interactions, end up canting slightly toward the $c$-axis direction. Since there exists two types of inequivalent NH$_4$$^+$ groups that reorient at different frequencies with temperature and do not have the same degree of distortion, the emerging polar phase is ferrielectric.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
Magnetic Ordering of Ammonium Cations in NH$_4$I, NH$_4$Br and NH$_4$Cl
Authors:
Fei Yen,
Lei Meng,
Tian Gao,
Sixia Hu
Abstract:
The different types of magnetism arise mainly from how electrons move and interact with each other. In this work, we show how protons (H$^+$) also exhibit magnetic behavior. We measured the magnetic susceptibility of the ammonium halides and identified pronounced increases at 232 K, 233 K and 243 K for NH$_4$I, NH$_4$Br and NH$_4$Cl, respectively, which all coincide to the geometric ordering of it…
▽ More
The different types of magnetism arise mainly from how electrons move and interact with each other. In this work, we show how protons (H$^+$) also exhibit magnetic behavior. We measured the magnetic susceptibility of the ammonium halides and identified pronounced increases at 232 K, 233 K and 243 K for NH$_4$I, NH$_4$Br and NH$_4$Cl, respectively, which all coincide to the geometric ordering of its ammonium cations. With extensive literature establishing the fact that the ammonium cations exhibit rotational motion even towards the lowest temperatures, we take into account that the orbital motion of the protons carries a magnetic moment and find it to be larger than that of the paired electrons. Consequently, the structural phase transitions are magnetically-driven as the system attempts to lift 8-fold energy degeneracies of the proton orbitals via Jahn-Teller distortions. Our findings identify that NH$_4$$^+$ cations are capable of comprising magnetism which appears to be ubiquitous in ammonia-based molecular solids.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
Probing the pole origin of $X(3872)$ with the coupled-channel dynamics
Authors:
Jun-Zhang Wang,
Zi-Yang Lin,
Yan-Ke Chen,
Lu Meng,
Shi-Lin Zhu
Abstract:
The $X(3872)$, as the first and the most crucial member in the exotic charmoniumlike $XYZ$ family, has been studied for a long time. However, its dynamical origin, whether stemming from a $D\bar{D}^*$ hadronic molecule or the first excited $P$-wave charmonium $χ_{c1}(2P)$, remains controversial. In this Letter, we demonstrate that the $X(3872)$ definitely does not result from the mass shift of the…
▽ More
The $X(3872)$, as the first and the most crucial member in the exotic charmoniumlike $XYZ$ family, has been studied for a long time. However, its dynamical origin, whether stemming from a $D\bar{D}^*$ hadronic molecule or the first excited $P$-wave charmonium $χ_{c1}(2P)$, remains controversial. In this Letter, we demonstrate that the $X(3872)$ definitely does not result from the mass shift of the higher bare $χ_{c1}(2P)$ resonance pole in the coupled-channel dynamics involving a short-distance $c\bar{c}$ core and the long-distance $D\bar{D}^*$ channels. Instead, it originates from either the $D\bar{D}^*$ molecular pole or the shadow pole associated with the $P$-wave charmonium, which depends on the concrete coupling mode between the $c\bar{c}$ and $D\bar{D}^*$. In order to further exploit the nature of $X(3872)$, we carefully investigate potential mechanisms that contribute to its pole width, which suggests that the coupled-channel dynamics plays a critical role in causing a noticeable discrepancy between the pole widths of $X(3872)$ and $T_{cc}^+$. Interestingly, we bridge the quantitative connection among the dynamics origin of $X(3872)$, its pole width and the properties of the predicted new resonance. The precise measurement of the pole width of $X(3872)$ and the search for the new charmoniumlike resonance become highly significant and can be anticipated in future LHCb, BESIII and Belle II experiments.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
Towards Better Text-to-Image Generation Alignment via Attention Modulation
Authors:
Yihang Wu,
Xiao Cao,
Kaixin Li,
Zitan Chen,
Haonan Wang,
Lei Meng,
Zhiyong Huang
Abstract:
In text-to-image generation tasks, the advancements of diffusion models have facilitated the fidelity of generated results. However, these models encounter challenges when processing text prompts containing multiple entities and attributes. The uneven distribution of attention results in the issues of entity leakage and attribute misalignment. Training from scratch to address this issue requires n…
▽ More
In text-to-image generation tasks, the advancements of diffusion models have facilitated the fidelity of generated results. However, these models encounter challenges when processing text prompts containing multiple entities and attributes. The uneven distribution of attention results in the issues of entity leakage and attribute misalignment. Training from scratch to address this issue requires numerous labeled data and is resource-consuming. Motivated by this, we propose an attribution-focusing mechanism, a training-free phase-wise mechanism by modulation of attention for diffusion model. One of our core ideas is to guide the model to concentrate on the corresponding syntactic components of the prompt at distinct timesteps. To achieve this, we incorporate a temperature control mechanism within the early phases of the self-attention modules to mitigate entity leakage issues. An object-focused masking scheme and a phase-wise dynamic weight control mechanism are integrated into the cross-attention modules, enabling the model to discern the affiliation of semantic information between entities more effectively. The experimental results in various alignment scenarios demonstrate that our model attain better image-text alignment with minimal additional computational cost.
△ Less
Submitted 22 April, 2024;
originally announced April 2024.
-
A Survey of Distributed Graph Algorithms on Massive Graphs
Authors:
Lingkai Meng,
Yu Shao,
Long Yuan,
Longbin Lai,
Peng Cheng,
Xue Li,
Wenyuan Yu,
Wenjie Zhang,
Xuemin Lin,
Jingren Zhou
Abstract:
Distributed processing of large-scale graph data has many practical applications and has been widely studied. In recent years, a lot of distributed graph processing frameworks and algorithms have been proposed. While many efforts have been devoted to analyzing these, with most analyzing them based on programming models, less research focuses on understanding their challenges in distributed environ…
▽ More
Distributed processing of large-scale graph data has many practical applications and has been widely studied. In recent years, a lot of distributed graph processing frameworks and algorithms have been proposed. While many efforts have been devoted to analyzing these, with most analyzing them based on programming models, less research focuses on understanding their challenges in distributed environments. Applying graph tasks to distributed environments is not easy, often facing numerous challenges through our analysis, including parallelism, load balancing, communication overhead, and bandwidth. In this paper, we provide an extensive overview of the current state-of-the-art in this field by outlining the challenges and solutions of distributed graph algorithms. We first conduct a systematic analysis of the inherent challenges in distributed graph processing, followed by presenting an overview of existing general solutions. Subsequently, we survey the challenges highlighted in recent distributed graph processing papers and the strategies adopted to address them. Finally, we discuss the current research trends and identify potential future opportunities.
△ Less
Submitted 9 April, 2024;
originally announced April 2024.
-
The Relativistic Spin Precession in the Compact Double Neutron Star System PSR~J1946+2052
Authors:
Lingqi Meng,
Weiwei Zhu,
Michael Kramer,
Xueli Miao,
Gregory Desvignes,
Lijing Shao,
Huanchen Hu,
Paulo C. C. Freire,
Yongkun Zhang,
Mengyao Xue,
Ziyao Fang,
David J. Champion,
Mao Yuan,
Chenchen Miao,
Jiarui Niu,
Qiuyang Fu,
Jumei Yao,
Yanjun Guo,
Chengmin Zhang
Abstract:
We observe systematic profile changes in the visible pulsar of the compact double neutron star system PSR~J1946+2052 using observations with the Five-hundred-meter Aperture Spherical radio Telescope (FAST). The interpulse of PSR~J1946+2052 changed from single-peak to double-peak shape from 2018 to 2021. We attribute this evolution as the result of the relativistic spin precession of the pulsar. Wi…
▽ More
We observe systematic profile changes in the visible pulsar of the compact double neutron star system PSR~J1946+2052 using observations with the Five-hundred-meter Aperture Spherical radio Telescope (FAST). The interpulse of PSR~J1946+2052 changed from single-peak to double-peak shape from 2018 to 2021. We attribute this evolution as the result of the relativistic spin precession of the pulsar. With the high sensitivity of FAST, we also measure significant polarization for the first time, allowing us to model this with the precessional rotating vector model. Assuming, to the first order, a circular hollow-cone-like emission beam pattern and taking the validity of general relativity, we derive the binary's orbital inclination angle (${63^\circ}^{+5^\circ}_{-3^\circ}$) and pulsar's spin geometry. Pulsar's spin vector and the orbital angular momentum vector are found to be only slightly misaligned (${0.21^\circ}^{+0.28^\circ}_{-0.10^\circ}$).The quoted uncertainties do not reflect the systematic uncertainties introduced by our model assumptions. By simulating future observations of profile and polarization evolution, we estimate that we could constrain the precession rate within a $43\%$ uncertainty in 9 years. Hence, we suggest that the system's profile evolution could be combined with precise pulsar timing to test general relativity in the future.
△ Less
Submitted 26 March, 2024;
originally announced March 2024.
-
Unifying Lane-Level Traffic Prediction from a Graph Structural Perspective: Benchmark and Baseline
Authors:
Shuhao Li,
Yue Cui,
Jingyi Xu,
Libin Li,
Lingkai Meng,
Weidong Yang,
Fan Zhang,
Xiaofang Zhou
Abstract:
Traffic prediction has long been a focal and pivotal area in research, witnessing both significant strides from city-level to road-level predictions in recent years. With the advancement of Vehicle-to-Everything (V2X) technologies, autonomous driving, and large-scale models in the traffic domain, lane-level traffic prediction has emerged as an indispensable direction. However, further progress in…
▽ More
Traffic prediction has long been a focal and pivotal area in research, witnessing both significant strides from city-level to road-level predictions in recent years. With the advancement of Vehicle-to-Everything (V2X) technologies, autonomous driving, and large-scale models in the traffic domain, lane-level traffic prediction has emerged as an indispensable direction. However, further progress in this field is hindered by the absence of comprehensive and unified evaluation standards, coupled with limited public availability of data and code. This paper extensively analyzes and categorizes existing research in lane-level traffic prediction, establishes a unified spatial topology structure and prediction tasks, and introduces a simple baseline model, GraphMLP, based on graph structure and MLP networks. We have replicated codes not publicly available in existing studies and, based on this, thoroughly and fairly assessed various models in terms of effectiveness, efficiency, and applicability, providing insights for practical applications. Additionally, we have released three new datasets and corresponding codes to accelerate progress in this field, all of which can be found on https://github.com/ShuhaoLii/TITS24LaneLevel-Traffic-Benchmark.
△ Less
Submitted 22 March, 2024;
originally announced March 2024.
-
Don't Half-listen: Capturing Key-part Information in Continual Instruction Tuning
Authors:
Yongquan He,
Xuancheng Huang,
Minghao Tang,
Lingxun Meng,
Xiang Li,
Wei Lin,
Wenyuan Zhang,
Yifu Gao
Abstract:
Instruction tuning for large language models (LLMs) can drive them to produce results consistent with human goals in specific downstream tasks. However, the process of continual instruction tuning (CIT) for LLMs may bring about the catastrophic forgetting (CF) problem, where previously learned abilities are degraded. Recent methods try to alleviate the CF problem by modifying models or replaying d…
▽ More
Instruction tuning for large language models (LLMs) can drive them to produce results consistent with human goals in specific downstream tasks. However, the process of continual instruction tuning (CIT) for LLMs may bring about the catastrophic forgetting (CF) problem, where previously learned abilities are degraded. Recent methods try to alleviate the CF problem by modifying models or replaying data, which may only remember the surface-level pattern of instructions and get confused on held-out tasks. In this paper, we propose a novel continual instruction tuning method based on Key-part Information Gain (KPIG). Our method computes the information gain on masked parts to dynamically replay data and refine the training objective, which enables LLMs to capture task-aware information relevant to the correct response and alleviate overfitting to general descriptions in instructions. In addition, we propose two metrics, P-score and V-score, to measure the generalization and instruction-following abilities of LLMs. Experiments demonstrate our method achieves superior performance on both seen and held-out tasks.
△ Less
Submitted 15 March, 2024;
originally announced March 2024.
-
PaddingFlow: Improving Normalizing Flows with Padding-Dimensional Noise
Authors:
Qinglong Meng,
Chongkun Xia,
Xueqian Wang
Abstract:
Normalizing flow is a generative modeling approach with efficient sampling. However, Flow-based models suffer two issues: 1) If the target distribution is manifold, due to the unmatch between the dimensions of the latent target distribution and the data distribution, flow-based models might perform badly. 2) Discrete data might make flow-based models collapse into a degenerate mixture of point mas…
▽ More
Normalizing flow is a generative modeling approach with efficient sampling. However, Flow-based models suffer two issues: 1) If the target distribution is manifold, due to the unmatch between the dimensions of the latent target distribution and the data distribution, flow-based models might perform badly. 2) Discrete data might make flow-based models collapse into a degenerate mixture of point masses. To sidestep such two issues, we propose PaddingFlow, a novel dequantization method, which improves normalizing flows with padding-dimensional noise. To implement PaddingFlow, only the dimension of normalizing flows needs to be modified. Thus, our method is easy to implement and computationally cheap. Moreover, the padding-dimensional noise is only added to the padding dimension, which means PaddingFlow can dequantize without changing data distributions. Implementing existing dequantization methods needs to change data distributions, which might degrade performance. We validate our method on the main benchmarks of unconditional density estimation, including five tabular datasets and four image datasets for Variational Autoencoder (VAE) models, and the Inverse Kinematics (IK) experiments which are conditional density estimation. The results show that PaddingFlow can perform better in all experiments in this paper, which means PaddingFlow is widely suitable for various tasks. The code is available at: https://github.com/AdamQLMeng/PaddingFlow.
△ Less
Submitted 23 April, 2024; v1 submitted 12 March, 2024;
originally announced March 2024.
-
Dynamic Perturbation-Adaptive Adversarial Training on Medical Image Classification
Authors:
Shuai Li,
Xiaoguang Ma,
Shancheng Jiang,
Lu Meng
Abstract:
Remarkable successes were made in Medical Image Classification (MIC) recently, mainly due to wide applications of convolutional neural networks (CNNs). However, adversarial examples (AEs) exhibited imperceptible similarity with raw data, raising serious concerns on network robustness. Although adversarial training (AT), in responding to malevolent AEs, was recognized as an effective approach to im…
▽ More
Remarkable successes were made in Medical Image Classification (MIC) recently, mainly due to wide applications of convolutional neural networks (CNNs). However, adversarial examples (AEs) exhibited imperceptible similarity with raw data, raising serious concerns on network robustness. Although adversarial training (AT), in responding to malevolent AEs, was recognized as an effective approach to improve robustness, it was challenging to overcome generalization decline of networks caused by the AT. In this paper, in order to reserve high generalization while improving robustness, we proposed a dynamic perturbation-adaptive adversarial training (DPAAT) method, which placed AT in a dynamic learning environment to generate adaptive data-level perturbations and provided a dynamically updated criterion by loss information collections to handle the disadvantage of fixed perturbation sizes in conventional AT methods and the dependence on external transference. Comprehensive testing on dermatology HAM10000 dataset showed that the DPAAT not only achieved better robustness improvement and generalization preservation but also significantly enhanced mean average precision and interpretability on various CNNs, indicating its great potential as a generic adversarial training method on the MIC.
△ Less
Submitted 11 March, 2024;
originally announced March 2024.
-
A&B BNN: Add&Bit-Operation-Only Hardware-Friendly Binary Neural Network
Authors:
Ruichen Ma,
Guanchao Qiao,
Yian Liu,
Liwei Meng,
Ning Ning,
Yang Liu,
Shaogang Hu
Abstract:
Binary neural networks utilize 1-bit quantized weights and activations to reduce both the model's storage demands and computational burden. However, advanced binary architectures still incorporate millions of inefficient and nonhardware-friendly full-precision multiplication operations. A&B BNN is proposed to directly remove part of the multiplication operations in a traditional BNN and replace th…
▽ More
Binary neural networks utilize 1-bit quantized weights and activations to reduce both the model's storage demands and computational burden. However, advanced binary architectures still incorporate millions of inefficient and nonhardware-friendly full-precision multiplication operations. A&B BNN is proposed to directly remove part of the multiplication operations in a traditional BNN and replace the rest with an equal number of bit operations, introducing the mask layer and the quantized RPReLU structure based on the normalizer-free network architecture. The mask layer can be removed during inference by leveraging the intrinsic characteristics of BNN with straightforward mathematical transformations to avoid the associated multiplication operations. The quantized RPReLU structure enables more efficient bit operations by constraining its slope to be integer powers of 2. Experimental results achieved 92.30%, 69.35%, and 66.89% on the CIFAR-10, CIFAR-100, and ImageNet datasets, respectively, which are competitive with the state-of-the-art. Ablation studies have verified the efficacy of the quantized RPReLU structure, leading to a 1.14% enhancement on the ImageNet compared to using a fixed slope RLeakyReLU. The proposed add&bit-operation-only BNN offers an innovative approach for hardware-friendly network architecture.
△ Less
Submitted 6 March, 2024;
originally announced March 2024.
-
Identify the new state $Y(3872)$ as the P-wave $D\bar{D}^*/\bar{D}D^*$ resonance
Authors:
Zi-Yang Lin,
Jun-Zhang Wang,
Jian-Bo Cheng,
Lu Meng,
Shi-Lin Zhu
Abstract:
The BESIII Collaboration recently observed a new charmonium-like vector state $Y(3872)$ in $e^+e^-\rightarrow D\bar{D}$, which should be the first P-wave $D\bar{D}^*/\bar{D}D^*$ molecular resonance. The experimental and theoretical identification of the P-wave dimeson state holds paramount importance in enhancing our comprehension of the non-perturbative QCD and few-body physics. Its existence is…
▽ More
The BESIII Collaboration recently observed a new charmonium-like vector state $Y(3872)$ in $e^+e^-\rightarrow D\bar{D}$, which should be the first P-wave $D\bar{D}^*/\bar{D}D^*$ molecular resonance. The experimental and theoretical identification of the P-wave dimeson state holds paramount importance in enhancing our comprehension of the non-perturbative QCD and few-body physics. Its existence is firmly established in a unified meson-exchange model which simultaneously depicts the features of the $χ_{c1}(3872)$, $Z_c(3900)$ and $T_{cc}(3875)$. This scenario can be directly examined in the $e^+e^-\rightarrow D\bar{D}^*/\bar{D}D^*$ cross section to see whether a resonance exists at the threshold. The credibility of the investigations is also ensured by the fact that the P-wave interaction dominantly arises from the well-known long-range pion exchange. Additionally, the existence of the P-wave resonance only depends on the interaction strength and is less sensitive to the potential shapes. We extensively calculate all systems up to P-wave with various quantum numbers and predict a dense population of the $D\bar{D}^*/\bar{D}D^*$ and $DD^*$ states, where the S-wave $D\bar{D}^*/\bar{D}D^*$ state with $I^G (J^{PC})=0^- (1^{+-})$, P-wave $D\bar{D}^*/\bar{D}D^*$ state with $I^G(J^{PC})=0^+(0^{-+})$, and P-wave $DD^*$ state with $I(J^P)=0(0^-)$ are more likely to be observed in experiments.
△ Less
Submitted 3 March, 2024;
originally announced March 2024.
-
Comparative Analysis of ImageNet Pre-Trained Deep Learning Models and DINOv2 in Medical Imaging Classification
Authors:
Yuning Huang,
Jingchen Zou,
Lanxi Meng,
Xin Yue,
Qing Zhao,
Jianqiang Li,
Changwei Song,
Gabriel Jimenez,
Shaowu Li,
Guanghui Fu
Abstract:
Medical image analysis frequently encounters data scarcity challenges. Transfer learning has been effective in addressing this issue while conserving computational resources. The recent advent of foundational models like the DINOv2, which uses the vision transformer architecture, has opened new opportunities in the field and gathered significant interest. However, DINOv2's performance on clinical…
▽ More
Medical image analysis frequently encounters data scarcity challenges. Transfer learning has been effective in addressing this issue while conserving computational resources. The recent advent of foundational models like the DINOv2, which uses the vision transformer architecture, has opened new opportunities in the field and gathered significant interest. However, DINOv2's performance on clinical data still needs to be verified. In this paper, we performed a glioma grading task using three clinical modalities of brain MRI data. We compared the performance of various pre-trained deep learning models, including those based on ImageNet and DINOv2, in a transfer learning context. Our focus was on understanding the impact of the freezing mechanism on performance. We also validated our findings on three other types of public datasets: chest radiography, fundus radiography, and dermoscopy. Our findings indicate that in our clinical dataset, DINOv2's performance was not as strong as ImageNet-based pre-trained models, whereas in public datasets, DINOv2 generally outperformed other models, especially when using the frozen mechanism. Similar performance was observed with various sizes of DINOv2 models across different tasks. In summary, DINOv2 is viable for medical image classification tasks, particularly with data resembling natural images. However, its effectiveness may vary with data that significantly differs from natural images such as MRI. In addition, employing smaller versions of the model can be adequate for medical task, offering resource-saving benefits. Our codes are available at https://github.com/GuanghuiFU/medical_DINOv2_eval.
△ Less
Submitted 13 February, 2024; v1 submitted 12 February, 2024;
originally announced February 2024.
-
A Manifold Representation of the Key in Vision Transformers
Authors:
Li Meng,
Morten Goodwin,
Anis Yazidi,
Paal Engelstad
Abstract:
Vision Transformers implement multi-head self-attention via stacking multiple attention blocks. The query, key, and value are often intertwined and generated within those blocks via a single, shared linear transformation. This paper explores the concept of disentangling the key from the query and value, and adopting a manifold representation for the key. Our experiments reveal that decoupling and…
▽ More
Vision Transformers implement multi-head self-attention via stacking multiple attention blocks. The query, key, and value are often intertwined and generated within those blocks via a single, shared linear transformation. This paper explores the concept of disentangling the key from the query and value, and adopting a manifold representation for the key. Our experiments reveal that decoupling and endowing the key with a manifold structure can enhance the model's performance. Specifically, ViT-B exhibits a 0.87% increase in top-1 accuracy, while Swin-T sees a boost of 0.52% in top-1 accuracy on the ImageNet-1K dataset, with eight charts in the manifold key. Our approach also yields positive results in object detection and instance segmentation tasks on the COCO dataset. We establish that these performance gains are not merely due to the simplicity of adding more parameters and computations. Future research may investigate strategies for cutting the budget of such representations and aim for further performance improvements based on our findings.
△ Less
Submitted 7 June, 2024; v1 submitted 1 February, 2024;
originally announced February 2024.
-
Tighter Lower Bounds on Aperiodic Ambiguity Function and Their Asymptotic Achievability
Authors:
Lingsheng Meng,
Yong Liang Guan,
Yao Ge,
Zilong Liu,
Pingzhi Fan
Abstract:
This paper presents tighter lower bounds on the maximum aperiodic ambiguity function (AF) magnitude of unimodular sequences under certain delay-Doppler low ambiguity zones (LAZ). These bounds are derived by exploiting the upper and lower bounds on the Frobenius norm of the weighted auto- and cross-AF matrices, with the introduction of two weight vectors associated with the delay and Doppler shifts…
▽ More
This paper presents tighter lower bounds on the maximum aperiodic ambiguity function (AF) magnitude of unimodular sequences under certain delay-Doppler low ambiguity zones (LAZ). These bounds are derived by exploiting the upper and lower bounds on the Frobenius norm of the weighted auto- and cross-AF matrices, with the introduction of two weight vectors associated with the delay and Doppler shifts, respectively. As a second major contribution, we demonstrate that our derived lower bounds are asymptotically achievable with selected Chu sequence sets by analyzing their maximum auto- and cross- AF magnitudes within certain LAZ.
△ Less
Submitted 18 July, 2024; v1 submitted 1 February, 2024;
originally announced February 2024.
-
Smart Fitting Room: A One-stop Framework for Matching-aware Virtual Try-on
Authors:
Mingzhe Yu,
Yunshan Ma,
Lei Wu,
Kai Cheng,
Xue Li,
Lei Meng,
Tat-Seng Chua
Abstract:
The development of virtual try-on has revolutionized online shopping by allowing customers to visualize themselves in various fashion items, thus extending the in-store try-on experience to the cyber space. Although virtual try-on has attracted considerable research initiatives, existing systems only focus on the quality of image generation, overlooking whether the fashion item is a good match to…
▽ More
The development of virtual try-on has revolutionized online shopping by allowing customers to visualize themselves in various fashion items, thus extending the in-store try-on experience to the cyber space. Although virtual try-on has attracted considerable research initiatives, existing systems only focus on the quality of image generation, overlooking whether the fashion item is a good match to the given person and clothes. Recognizing this gap, we propose to design a one-stop Smart Fitting Room, with the novel formulation of matching-aware virtual try-on. Following this formulation, we design a Hybrid Matching-aware Virtual Try-On Framework (HMaVTON), which combines retrieval-based and generative methods to foster a more personalized virtual try-on experience. This framework integrates a hybrid mix-and-match module and an enhanced virtual try-on module. The former can recommend fashion items available on the platform to boost sales and generate clothes that meets the diverse tastes of consumers. The latter provides high-quality try-on effects, delivering a one-stop shopping service. To validate the effectiveness of our approach, we enlist the expertise of fashion designers for a professional evaluation, assessing the rationality and diversity of the clothes combinations and conducting an evaluation matrix analysis. Our method significantly enhances the practicality of virtual try-on. The code is available at https://github.com/Yzcreator/HMaVTON.
△ Less
Submitted 20 April, 2024; v1 submitted 30 January, 2024;
originally announced January 2024.
-
Benchmark calculations of fully heavy compact and molecular tetraquark states
Authors:
Wei-Lin Wu,
Yan-Ke Chen,
Lu Meng,
Shi-Lin Zhu
Abstract:
We calculate the mass spectrum of the S-wave fully heavy tetraquark systems $ QQ\bar Q\bar Q~(Q=c,b) $ with both normal $ (J^{PC}=0^{++},1^{+-},2^{++}) $ and exotic $ (J^{PC}=0^{+-},1^{++},2^{+-}) $ C-parities using three different quark potential models (AL1, AP1, BGS). The exotic C-parity systems refer to the ones that cannot be composed of two S-wave ground heavy quarkonia. We incorporate the m…
▽ More
We calculate the mass spectrum of the S-wave fully heavy tetraquark systems $ QQ\bar Q\bar Q~(Q=c,b) $ with both normal $ (J^{PC}=0^{++},1^{+-},2^{++}) $ and exotic $ (J^{PC}=0^{+-},1^{++},2^{+-}) $ C-parities using three different quark potential models (AL1, AP1, BGS). The exotic C-parity systems refer to the ones that cannot be composed of two S-wave ground heavy quarkonia. We incorporate the molecular dimeson and compact diquark-antidiquark spatial correlations simultaneously, thereby discerning the actual configurations of the states. We employ the Gaussian expansion method to solve the four-body Schrödinger equation, and the complex scaling method to identify the resonant states. The mass spectra in three different models qualitatively agree with each other. We obtain several resonant states with $ J^{PC} = 0^{++}, 1^{+-}, 2^{++}, 1^{++} $ in the mass region $(6.92,7.30)\, \mathrm{GeV}$, some of which are good candidates of the experimentally observed $X(6900)$ and $X(7200)$. We also obtain several exotic C-parity zero-width states with $ J^{PC}=0^{+-} $ and $ 2^{+-} $. These zero-width states have no corresponding S-wave diquarkonium threshold and can only decay strongly to final states with P-wave quarkonia. With the notation $T_{4Q,J(C)}(M)$, we deduce from the root mean square radii that the $ X(7200) $ candidates $ T_{4c,0(+)}(7173), T_{4c,2(+)}(7214) $ and the state $ T_{4c,1(-)}(7191) $ look like molecular states although most of the resonant and zero-width states are compact states.
△ Less
Submitted 26 March, 2024; v1 submitted 26 January, 2024;
originally announced January 2024.
-
UNIT-DSR: Dysarthric Speech Reconstruction System Using Speech Unit Normalization
Authors:
Yuejiao Wang,
Xixin Wu,
Disong Wang,
Lingwei Meng,
Helen Meng
Abstract:
Dysarthric speech reconstruction (DSR) systems aim to automatically convert dysarthric speech into normal-sounding speech. The technology eases communication with speakers affected by the neuromotor disorder and enhances their social inclusion. NED-based (Neural Encoder-Decoder) systems have significantly improved the intelligibility of the reconstructed speech as compared with GAN-based (Generati…
▽ More
Dysarthric speech reconstruction (DSR) systems aim to automatically convert dysarthric speech into normal-sounding speech. The technology eases communication with speakers affected by the neuromotor disorder and enhances their social inclusion. NED-based (Neural Encoder-Decoder) systems have significantly improved the intelligibility of the reconstructed speech as compared with GAN-based (Generative Adversarial Network) approaches, but the approach is still limited by training inefficiency caused by the cascaded pipeline and auxiliary tasks of the content encoder, which may in turn affect the quality of reconstruction. Inspired by self-supervised speech representation learning and discrete speech units, we propose a Unit-DSR system, which harnesses the powerful domain-adaptation capacity of HuBERT for training efficiency improvement and utilizes speech units to constrain the dysarthric content restoration in a discrete linguistic space. Compared with NED approaches, the Unit-DSR system only consists of a speech unit normalizer and a Unit HiFi-GAN vocoder, which is considerably simpler without cascaded sub-modules or auxiliary tasks. Results on the UASpeech corpus indicate that Unit-DSR outperforms competitive baselines in terms of content restoration, reaching a 28.2% relative average word error rate reduction when compared to original dysarthric speech, and shows robustness against speed perturbation and noise.
△ Less
Submitted 26 January, 2024;
originally announced January 2024.