-
SwitchCIT: Switching for Continual Instruction Tuning of Large Language Models
Authors:
Xinbo Wu,
Max Hartman,
Vidhata Arjun Jayaraman,
Lav R. Varshney
Abstract:
Large language models (LLMs) have exhibited impressive capabilities in various domains, particularly in general language understanding. However these models, trained on massive text data, may not be finely optimized for specific tasks triggered by instructions. Continual instruction tuning is crucial to adapt LLMs to evolving tasks and domains, ensuring their effectiveness and relevance across a w…
▽ More
Large language models (LLMs) have exhibited impressive capabilities in various domains, particularly in general language understanding. However these models, trained on massive text data, may not be finely optimized for specific tasks triggered by instructions. Continual instruction tuning is crucial to adapt LLMs to evolving tasks and domains, ensuring their effectiveness and relevance across a wide range of applications. In the context of continual instruction tuning, where models are sequentially trained on different tasks, catastrophic forgetting can occur, leading to performance degradation on previously learned tasks. This work addresses the catastrophic forgetting in continual instruction learning for LLMs through a switching mechanism for routing computations to parameter-efficient tuned models. We demonstrate the effectiveness of our method through experiments on continual instruction tuning of different natural language generation tasks.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
Measurement of the branching fraction of $D^+_s\to \ell^+ν_\ell$ via $e^+e^-\to D^{*+}_{s} D^{*-}_{s}$
Authors:
BESIII Collaboration,
M. Ablikim,
M. N. Achasov,
P. Adlarson,
O. Afedulidis,
X. C. Ai,
R. Aliberti,
A. Amoroso,
Q. An,
Y. Bai,
O. Bakina,
I. Balossino,
Y. Ban,
H. -R. Bao,
V. Batozskaya,
K. Begzsuren,
N. Berger,
M. Berlowski,
M. Bertani,
D. Bettoni,
F. Bianchi,
E. Bianco,
A. Bortone,
I. Boyko,
R. A. Briere
, et al. (634 additional authors not shown)
Abstract:
Based on $10.64~\mathrm{fb}^{-1}$ of $e^+e^-$ collision data taken at center-of-mass energies between 4.237 and 4.699 GeV with the BESIII detector, we study the leptonic $D^+_s$ decays using the $e^+e^-\to D^{*+}_{s} D^{*-}_{s}$ process. The branching fractions of $D_s^+\to\ell^+ν_{\ell}\,(\ell=μ,τ)$ are measured to be $\mathcal{B}(D_s^+\toμ^+ν_μ)=(\bfmuv)\%$ and…
▽ More
Based on $10.64~\mathrm{fb}^{-1}$ of $e^+e^-$ collision data taken at center-of-mass energies between 4.237 and 4.699 GeV with the BESIII detector, we study the leptonic $D^+_s$ decays using the $e^+e^-\to D^{*+}_{s} D^{*-}_{s}$ process. The branching fractions of $D_s^+\to\ell^+ν_{\ell}\,(\ell=μ,τ)$ are measured to be $\mathcal{B}(D_s^+\toμ^+ν_μ)=(\bfmuv)\%$ and $\mathcal{B}(D_s^+\toτ^+ν_τ)=(\bftauv)\%$, respectively. The product of the decay constant and Cabibbo-Kobayashi-Maskawa matrix element $|V_{cs}|$ is determined to be $f_{D_s^+}|V_{cs}|=(\mufdsxvcsresult)_{μν}~\mathrm{MeV}$ and $f_{D_s^+}|V_{cs}|=(\taufdsxvcsresult))_{τν}~\mathrm{MeV}$, respectively. Taking the value of $|V_{cs}|$ from a global fit in the Standard Model, we obtain ${f_{D^+_s}}=(\mufdsresult)_{μν}$\,MeV and ${f_{D^+_s}}=(\taufdsresult)_{τν}$\,MeV, respectively. Conversely, taking the value for $f_{D_s^+}$ from the latest lattice quantum chromodynamics calculation, we obtain $|V_{cs}| =(\muvcsresult)_{μν}$ and $|V_{cs}| = (\tauvcsresult)_{τν}$, respectively.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
Holographic Lifshitz flows
Authors:
Matteo Baggioli,
Oriol Pujolas,
Xin-Meng Wu
Abstract:
Without Lorentz symmetry, generic fixed points of the renormalization group (RG) are labelled by their dynamical (or `Lifshitz') exponent $z$. Hence, a rich variety of possible RG flows arises. The first example is already given by the standard non-relativistic limit, which can be viewed as the flow from a $z=1$ UV fixed point to a $z=2$ IR fixed point. In strongly coupled theories, there are good…
▽ More
Without Lorentz symmetry, generic fixed points of the renormalization group (RG) are labelled by their dynamical (or `Lifshitz') exponent $z$. Hence, a rich variety of possible RG flows arises. The first example is already given by the standard non-relativistic limit, which can be viewed as the flow from a $z=1$ UV fixed point to a $z=2$ IR fixed point. In strongly coupled theories, there are good arguments suggesting that Lorentz invariance can emerge dynamically in the IR from a Lorentz violating UV. In this work, we perform a generic study of fixed points and the possible RG flows among them in a minimal bottom-up holographic model without Lorentz invariance, aiming to shed light on the possible options and the related phenomenology. We find: i) A minor generalization of previous models involving a massive vector field with allowed self-couplings leads to a much more efficient emergence of Lorentz invariance than in the previous attempts. Moreover, we find that generically the larger is the UV dynamical exponent $z_{UV}$ the faster is the recovery of Lorentz symmetry in the IR. ii) We construct explicitly a holographic model with a line of fixed points, realizing different Lifshitz scaling along the line. iii) We also confirm the monotonicity of a recently proposed a-function along all our Lorentz violating RG flows.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
Generalizing soft actor-critic algorithms to discrete action spaces
Authors:
Le Zhang,
Yong Gu,
Xin Zhao,
Yanshuo Zhang,
Shu Zhao,
Yifei Jin,
Xinxin Wu
Abstract:
ATARI is a suite of video games used by reinforcement learning (RL) researchers to test the effectiveness of the learning algorithm. Receiving only the raw pixels and the game score, the agent learns to develop sophisticated strategies, even to the comparable level of a professional human games tester. Ideally, we also want an agent requiring very few interactions with the environment. Previous co…
▽ More
ATARI is a suite of video games used by reinforcement learning (RL) researchers to test the effectiveness of the learning algorithm. Receiving only the raw pixels and the game score, the agent learns to develop sophisticated strategies, even to the comparable level of a professional human games tester. Ideally, we also want an agent requiring very few interactions with the environment. Previous competitive model-free algorithms for the task use the valued-based Rainbow algorithm without any policy head. In this paper, we change it by proposing a practical discrete variant of the soft actor-critic (SAC) algorithm. The new variant enables off-policy learning using policy heads for discrete domains. By incorporating it into the advanced Rainbow variant, i.e., the ``bigger, better, faster'' (BBF), the resulting SAC-BBF improves the previous state-of-the-art interquartile mean (IQM) from 1.045 to 1.088, and it achieves these results using only replay ratio (RR) 2. By using lower RR 2, the training time of SAC-BBF is strictly one-third of the time required for BBF to achieve an IQM of 1.045 using RR 8. As a value of IQM greater than one indicates super-human performance, SAC-BBF is also the only model-free algorithm with a super-human level using only RR 2. The code is publicly available on GitHub at https://github.com/lezhang-thu/bigger-better-faster-SAC.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
Globally-Constrained Decentralized Optimization with Variable Coupling
Authors:
Dandan Wang,
Xuyang Wu,
Zichong Ou,
Jie Lu
Abstract:
Many realistic decision-making problems in networked scenarios, such as formation control and collaborative task offloading, often involve complicatedly entangled local decisions, which, however, have not been sufficiently investigated yet. Motivated by this, we study a class of decentralized optimization problems with a variable coupling structure that is new to the literature. Specifically, we c…
▽ More
Many realistic decision-making problems in networked scenarios, such as formation control and collaborative task offloading, often involve complicatedly entangled local decisions, which, however, have not been sufficiently investigated yet. Motivated by this, we study a class of decentralized optimization problems with a variable coupling structure that is new to the literature. Specifically, we consider a network of nodes collaborating to minimize a global objective subject to a collection of global inequality and equality constraints, which are formed by the local objective and constraint functions of the nodes. On top of that, we allow such local functions of each node to depend on not only its own decision variable but the decisions of its neighbors as well. To address this problem, we propose a decentralized projected primal-dual algorithm. It first incorporates a virtual-queue technique with a primal-dual-primal scheme, and then linearizes the non-separable objective and constraint functions to enable decentralized implementation. Under mild conditions, we derive $O(1/k)$ convergence rates for both objective error and constraint violations. Finally, two numerical experiments corroborate our theoretical results and illustrate the competitive performance of the proposed algorithm.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
Understanding Matrix Function Normalizations in Covariance Pooling through the Lens of Riemannian Geometry
Authors:
Ziheng Chen,
Yue Song,
Xiao-Jun Wu,
Gaowen Liu,
Nicu Sebe
Abstract:
Global Covariance Pooling (GCP) has been demonstrated to improve the performance of Deep Neural Networks (DNNs) by exploiting second-order statistics of high-level representations. GCP typically performs classification of the covariance matrices by applying matrix function normalization, such as matrix logarithm or power, followed by a Euclidean classifier. However, covariance matrices inherently…
▽ More
Global Covariance Pooling (GCP) has been demonstrated to improve the performance of Deep Neural Networks (DNNs) by exploiting second-order statistics of high-level representations. GCP typically performs classification of the covariance matrices by applying matrix function normalization, such as matrix logarithm or power, followed by a Euclidean classifier. However, covariance matrices inherently lie in a Riemannian manifold, known as the Symmetric Positive Definite (SPD) manifold. The current literature does not provide a satisfactory explanation of why Euclidean classifiers can be applied directly to Riemannian features after the normalization of the matrix power. To mitigate this gap, this paper provides a comprehensive and unified understanding of the matrix logarithm and power from a Riemannian geometry perspective. The underlying mechanism of matrix functions in GCP is interpreted from two perspectives: one based on tangent classifiers (Euclidean classifiers on the tangent space) and the other based on Riemannian classifiers. Via theoretical analysis and empirical validation through extensive experiments on fine-grained and large-scale visual classification datasets, we conclude that the working mechanism of the matrix functions should be attributed to the Riemannian classifiers they implicitly respect.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
Large Language Model-based FMRI Encoding of Language Functions for Subjects with Neurocognitive Disorder
Authors:
Yuejiao Wang,
Xianmin Gong,
Lingwei Meng,
Xixin Wu,
Helen Meng
Abstract:
Functional magnetic resonance imaging (fMRI) is essential for developing encoding models that identify functional changes in language-related brain areas of individuals with Neurocognitive Disorders (NCD). While large language model (LLM)-based fMRI encoding has shown promise, existing studies predominantly focus on healthy, young adults, overlooking older NCD populations and cognitive level corre…
▽ More
Functional magnetic resonance imaging (fMRI) is essential for developing encoding models that identify functional changes in language-related brain areas of individuals with Neurocognitive Disorders (NCD). While large language model (LLM)-based fMRI encoding has shown promise, existing studies predominantly focus on healthy, young adults, overlooking older NCD populations and cognitive level correlations. This paper explores language-related functional changes in older NCD adults using LLM-based fMRI encoding and brain scores, addressing current limitations. We analyze the correlation between brain scores and cognitive scores at both whole-brain and language-related ROI levels. Our findings reveal that higher cognitive abilities correspond to better brain scores, with correlations peaking in the middle temporal gyrus. This study highlights the potential of fMRI encoding models and brain scores for detecting early functional changes in NCD patients.
△ Less
Submitted 14 July, 2024;
originally announced July 2024.
-
V2I-Calib: A Novel Calibration Approach for Collaborative Vehicle and Infrastructure LiDAR Systems
Authors:
Qianxin Qu,
Yijin Xiong,
Xin Wu,
Hanyu Li,
Shichun Guo
Abstract:
Cooperative vehicle and infrastructure LiDAR systems hold great potential, yet their implementation faces numerous challenges. Calibration of LiDAR systems across heterogeneous vehicle and infrastructure endpoints is a critical step to ensure the accuracy and consistency of perception system data, necessitating calibration methods that are real-time and stable. To this end, this paper introduces a…
▽ More
Cooperative vehicle and infrastructure LiDAR systems hold great potential, yet their implementation faces numerous challenges. Calibration of LiDAR systems across heterogeneous vehicle and infrastructure endpoints is a critical step to ensure the accuracy and consistency of perception system data, necessitating calibration methods that are real-time and stable. To this end, this paper introduces a novel calibration method for cooperative vehicle and road infrastructure LiDAR systems, which exploits spatial association information between detection boxes. The method centers around a novel Overall IoU metric that reflects the correlation of targets between vehicle and infrastructure, enabling real-time monitoring of calibration results. We search for common matching boxes between vehicle and infrastructure nodes by constructing an affinity matrix. Subsequently, these matching boxes undergo extrinsic parameter computation and optimization. Comparative and ablation experiments on the DAIR-V2X dataset confirm the superiority of our method. To better reflect the differences in calibration results, we have categorized the calibration tasks on the DAIR-V2X dataset based on their level of difficulty, enriching the dataset's utility for future research. Our project is available at https://github.com/MassimoQu/v2i-calib .
△ Less
Submitted 14 July, 2024;
originally announced July 2024.
-
Triggering the Untriggered: The First Einstein Probe-Detected Gamma-Ray Burst 240219A and Its Implications
Authors:
Yi-Han Iris Yin,
Bin-Bin Zhang,
Jun Yang,
Hui Sun,
Chen Zhang,
Yi-Xuan Shao,
You-Dong Hu,
Zi-Pei Zhu,
Dong Xu,
Li An,
He Gao,
Xue-Feng Wu,
Bing Zhang,
Alberto Javier Castro-Tirado,
Shashi B. Pandey,
Arne Rau,
Weihua Lei,
Wei Xie,
Giancarlo Ghirlanda,
Luigi Piro,
Paul O'Brien,
Eleonora Troja,
Peter Jonker,
Yun-Wei Yu,
Jie An
, et al. (26 additional authors not shown)
Abstract:
The Einstein Probe (EP) achieved its first detection and localization of a bright X-ray flare, EP240219a, on February 19, 2024, during its commissioning phase. Subsequent targeted searches triggered by the EP240219a alert identified a faint, untriggered gamma-ray burst (GRB) in the archived data of Fermi/GBM, Swift/BAT, Insight-HXMT/HE and INTEGRAL/SPI-ACS. The EP/WXT light curve reveals a long du…
▽ More
The Einstein Probe (EP) achieved its first detection and localization of a bright X-ray flare, EP240219a, on February 19, 2024, during its commissioning phase. Subsequent targeted searches triggered by the EP240219a alert identified a faint, untriggered gamma-ray burst (GRB) in the archived data of Fermi/GBM, Swift/BAT, Insight-HXMT/HE and INTEGRAL/SPI-ACS. The EP/WXT light curve reveals a long duration of approximately 160 seconds with a slow decay, whereas the Fermi/GBM light curve shows a total duration of approximately 70 seconds. The peak in the Fermi/GBM light curve occurs slightly later with respect to the peak seen in the EP/WXT light curve. Our spectral analysis shows that a single cutoff power-law model effectively describes the joint EP/WXT-Fermi/GBM spectra in general, indicating coherent broad emission typical of GRBs. The model yielded a photon index of $\sim -1.70 \pm 0.05$ and a peak energy of $\sim 257 \pm 134$ keV. After detection of GRB 240219A, long-term observations identified several candidates in optical and radio wavelengths, none of which was confirmed as the afterglow counterpart during subsequent optical and near-infrared follow-ups. The analysis of GRB 240219A classifies it as an X-ray rich GRB with a high peak energy, presenting both challenges and opportunities for studying the physical origins of X-ray flashes (XRFs), X-ray rich GRBs (XRRs), and classical GRBs (C-GRBs). Furthermore, linking the cutoff power-law component to non-thermal synchrotron radiation suggests that the burst is driven by a Poynting flux-dominated outflow.
△ Less
Submitted 14 July, 2024;
originally announced July 2024.
-
WPS-SAM: Towards Weakly-Supervised Part Segmentation with Foundation Models
Authors:
Xinjian Wu,
Ruisong Zhang,
Jie Qin,
Shijie Ma,
Cheng-Lin Liu
Abstract:
Segmenting and recognizing diverse object parts is crucial in computer vision and robotics. Despite significant progress in object segmentation, part-level segmentation remains underexplored due to complex boundaries and scarce annotated data. To address this, we propose a novel Weakly-supervised Part Segmentation (WPS) setting and an approach called WPS-SAM, built on the large-scale pre-trained v…
▽ More
Segmenting and recognizing diverse object parts is crucial in computer vision and robotics. Despite significant progress in object segmentation, part-level segmentation remains underexplored due to complex boundaries and scarce annotated data. To address this, we propose a novel Weakly-supervised Part Segmentation (WPS) setting and an approach called WPS-SAM, built on the large-scale pre-trained vision foundation model, Segment Anything Model (SAM). WPS-SAM is an end-to-end framework designed to extract prompt tokens directly from images and perform pixel-level segmentation of part regions. During its training phase, it only uses weakly supervised labels in the form of bounding boxes or points. Extensive experiments demonstrate that, through exploiting the rich knowledge embedded in pre-trained foundation models, WPS-SAM outperforms other segmentation models trained with pixel-level strong annotations. Specifically, WPS-SAM achieves 68.93% mIOU and 79.53% mACC on the PartImageNet dataset, surpassing state-of-the-art fully supervised methods by approximately 4% in terms of mIOU.
△ Less
Submitted 14 July, 2024;
originally announced July 2024.
-
Minimizing PLM-Based Few-Shot Intent Detectors
Authors:
Haode Zhang,
Xiao-Ming Wu,
Albert Y. S. Lam
Abstract:
Recent research has demonstrated the feasibility of training efficient intent detectors based on pre-trained language model~(PLM) with limited labeled data. However, deploying these detectors in resource-constrained environments such as mobile devices poses challenges due to their large sizes. In this work, we aim to address this issue by exploring techniques to minimize the size of PLM-based inte…
▽ More
Recent research has demonstrated the feasibility of training efficient intent detectors based on pre-trained language model~(PLM) with limited labeled data. However, deploying these detectors in resource-constrained environments such as mobile devices poses challenges due to their large sizes. In this work, we aim to address this issue by exploring techniques to minimize the size of PLM-based intent detectors trained with few-shot data. Specifically, we utilize large language models (LLMs) for data augmentation, employ a cutting-edge model compression method for knowledge distillation, and devise a vocabulary pruning mechanism called V-Prune. Through these approaches, we successfully achieve a compression ratio of 21 in model memory usage, including both Transformer and the vocabulary, while maintaining almost identical performance levels on four real-world benchmarks.
△ Less
Submitted 13 July, 2024;
originally announced July 2024.
-
Detection of hidden emissions in two rotating radio transients with high surface magnetic fields
Authors:
S. B. Zhang,
X. Yang,
J. J. Geng,
Y. P. Yang,
X. F. Wu
Abstract:
Rotating Radio Transients (RRATs) are neutron stars emitting sporadic radio pulses. The unique emission of RRATs has been proposed to resemble those of known pulsar types, such as extreme nulling pulsars or pulsars with giant pulses. However, the presence of additional radiation beyond these sporadic pulses remains unclear. Through high-sensitivity observations and extended tracking, we detected t…
▽ More
Rotating Radio Transients (RRATs) are neutron stars emitting sporadic radio pulses. The unique emission of RRATs has been proposed to resemble those of known pulsar types, such as extreme nulling pulsars or pulsars with giant pulses. However, the presence of additional radiation beyond these sporadic pulses remains unclear. Through high-sensitivity observations and extended tracking, we detected the sequential weak emissions in two RRATs with relatively high surface magnetic fields (Bs > 10^13 G): J1846-0257 and J1854+0306. These emissions show peak flux densities of 0.15 and 0.41 mJy, up to 687 and 512 times weaker than our detected RRAT single pulses, respectively. The weak emissions contribute small fractions (~ 16% and 5%) to the total radio pulse energy releases, contrasting significantly with giant-pulse pulsars where normal pulses dominate. Polarization analysis of J1854+0306 suggests that its sporadic RRAT pulses may originate from intermittent enhanced sparking processes due to magnetospheric evolution. Our findings indicate that some RRATs may represent a novel class of pulsars, distinct from any previously known subclass. Further observations of sources with similar rotational properties using high-sensitivity instruments could validate the generality of these hidden emissions.
△ Less
Submitted 13 July, 2024;
originally announced July 2024.
-
Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System
Authors:
Lingwei Meng,
Jiawen Kang,
Yuejiao Wang,
Zengrui Jin,
Xixin Wu,
Xunying Liu,
Helen Meng
Abstract:
Multi-talker speech recognition and target-talker speech recognition, both involve transcription in multi-talker contexts, remain significant challenges. However, existing methods rarely attempt to simultaneously address both tasks. In this study, we propose a pioneering approach to empower Whisper, which is a speech foundation model, to tackle joint multi-talker and target-talker speech recogniti…
▽ More
Multi-talker speech recognition and target-talker speech recognition, both involve transcription in multi-talker contexts, remain significant challenges. However, existing methods rarely attempt to simultaneously address both tasks. In this study, we propose a pioneering approach to empower Whisper, which is a speech foundation model, to tackle joint multi-talker and target-talker speech recognition tasks. Specifically, (i) we freeze Whisper and plug a Sidecar separator into its encoder to separate mixed embedding for multiple talkers; (ii) a Target Talker Identifier is introduced to identify the embedding flow of the target talker on the fly, requiring only three-second enrollment speech as a cue; (iii) soft prompt tuning for decoder is explored for better task adaptation. Our method outperforms previous methods on two- and three-talker LibriMix and LibriSpeechMix datasets for both tasks, and delivers acceptable zero-shot performance on multi-talker ASR on AishellMix Mandarin dataset.
△ Less
Submitted 13 July, 2024;
originally announced July 2024.
-
MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts
Authors:
Zhenpeng Su,
Zijia Lin,
Xue Bai,
Xing Wu,
Yizhe Xiong,
Haoran Lian,
Guangyuan Ma,
Hui Chen,
Guiguang Ding,
Wei Zhou,
Songlin Hu
Abstract:
Scaling model capacity enhances its capabilities but significantly increases computation. Mixture-of-Experts models (MoEs) address this by allowing model capacity to scale without substantially increasing training or inference costs. Despite their promising results, MoE models encounter several challenges. Primarily, the dispersion of training tokens across multiple experts can lead to underfittin…
▽ More
Scaling model capacity enhances its capabilities but significantly increases computation. Mixture-of-Experts models (MoEs) address this by allowing model capacity to scale without substantially increasing training or inference costs. Despite their promising results, MoE models encounter several challenges. Primarily, the dispersion of training tokens across multiple experts can lead to underfitting, particularly for infrequent tokens. Additionally, while fixed routing mechanisms can mitigate this issue, they compromise on the diversity of representations. In this paper, we propose MaskMoE, a method designed to enhance token-level learning by employing a routing masking technique within the Mixture-of-Experts model. MaskMoE is capable of maintaining representation diversity while achieving more comprehensive training. Experimental results demonstrate that our method outperforms previous dominant Mixture-of-Experts models in both perplexity (PPL) and downstream tasks.
△ Less
Submitted 13 July, 2024;
originally announced July 2024.
-
Semi-supervised 3D Object Detection with PatchTeacher and PillarMix
Authors:
Xiaopei Wu,
Liang Peng,
Liang Xie,
Yuenan Hou,
Binbin Lin,
Xiaoshui Huang,
Haifeng Liu,
Deng Cai,
Wanli Ouyang
Abstract:
Semi-supervised learning aims to leverage numerous unlabeled data to improve the model performance. Current semi-supervised 3D object detection methods typically use a teacher to generate pseudo labels for a student, and the quality of the pseudo labels is essential for the final performance. In this paper, we propose PatchTeacher, which focuses on partial scene 3D object detection to provide high…
▽ More
Semi-supervised learning aims to leverage numerous unlabeled data to improve the model performance. Current semi-supervised 3D object detection methods typically use a teacher to generate pseudo labels for a student, and the quality of the pseudo labels is essential for the final performance. In this paper, we propose PatchTeacher, which focuses on partial scene 3D object detection to provide high-quality pseudo labels for the student. Specifically, we divide a complete scene into a series of patches and feed them to our PatchTeacher sequentially. PatchTeacher leverages the low memory consumption advantage of partial scene detection to process point clouds with a high-resolution voxelization, which can minimize the information loss of quantization and extract more fine-grained features. However, it is non-trivial to train a detector on fractions of the scene. Therefore, we introduce three key techniques, i.e., Patch Normalizer, Quadrant Align, and Fovea Selection, to improve the performance of PatchTeacher. Moreover, we devise PillarMix, a strong data augmentation strategy that mixes truncated pillars from different LiDAR scans to generate diverse training samples and thus help the model learn more general representation. Extensive experiments conducted on Waymo and ONCE datasets verify the effectiveness and superiority of our method and we achieve new state-of-the-art results, surpassing existing methods by a large margin. Codes are available at https://github.com/LittlePey/PTPM.
△ Less
Submitted 13 July, 2024;
originally announced July 2024.
-
TASeg: Temporal Aggregation Network for LiDAR Semantic Segmentation
Authors:
Xiaopei Wu,
Yuenan Hou,
Xiaoshui Huang,
Binbin Lin,
Tong He,
Xinge Zhu,
Yuexin Ma,
Boxi Wu,
Haifeng Liu,
Deng Cai,
Wanli Ouyang
Abstract:
Training deep models for LiDAR semantic segmentation is challenging due to the inherent sparsity of point clouds. Utilizing temporal data is a natural remedy against the sparsity problem as it makes the input signal denser. However, previous multi-frame fusion algorithms fall short in utilizing sufficient temporal information due to the memory constraint, and they also ignore the informative tempo…
▽ More
Training deep models for LiDAR semantic segmentation is challenging due to the inherent sparsity of point clouds. Utilizing temporal data is a natural remedy against the sparsity problem as it makes the input signal denser. However, previous multi-frame fusion algorithms fall short in utilizing sufficient temporal information due to the memory constraint, and they also ignore the informative temporal images. To fully exploit rich information hidden in long-term temporal point clouds and images, we present the Temporal Aggregation Network, termed TASeg. Specifically, we propose a Temporal LiDAR Aggregation and Distillation (TLAD) algorithm, which leverages historical priors to assign different aggregation steps for different classes. It can largely reduce memory and time overhead while achieving higher accuracy. Besides, TLAD trains a teacher injected with gt priors to distill the model, further boosting the performance. To make full use of temporal images, we design a Temporal Image Aggregation and Fusion (TIAF) module, which can greatly expand the camera FOV and enhance the present features. Temporal LiDAR points in the camera FOV are used as mediums to transform temporal image features to the present coordinate for temporal multi-modal fusion. Moreover, we develop a Static-Moving Switch Augmentation (SMSA) algorithm, which utilizes sufficient temporal information to enable objects to switch their motion states freely, thus greatly increasing static and moving training samples. Our TASeg ranks 1st on three challenging tracks, i.e., SemanticKITTI single-scan track, multi-scan track and nuScenes LiDAR segmentation track, strongly demonstrating the superiority of our method. Codes are available at https://github.com/LittlePey/TASeg.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
FastImpute: A Baseline for Open-source, Reference-Free Genotype Imputation Methods -- A Case Study in PRS313
Authors:
Aaron Ge,
Jeya Balasubramanian,
Xueyao Wu,
Peter Kraft,
Jonas S. Almeida
Abstract:
Genotype imputation enhances genetic data by predicting missing SNPs using reference haplotype information. Traditional methods leverage linkage disequilibrium (LD) to infer untyped SNP genotypes, relying on the similarity of LD structures between genotyped target sets and fully sequenced reference panels. Recently, reference-free deep learning-based methods have emerged, offering a promising alte…
▽ More
Genotype imputation enhances genetic data by predicting missing SNPs using reference haplotype information. Traditional methods leverage linkage disequilibrium (LD) to infer untyped SNP genotypes, relying on the similarity of LD structures between genotyped target sets and fully sequenced reference panels. Recently, reference-free deep learning-based methods have emerged, offering a promising alternative by predicting missing genotypes without external databases, thereby enhancing privacy and accessibility. However, these methods often produce models with tens of millions of parameters, leading to challenges such as the need for substantial computational resources to train and inefficiency for client-sided deployment. Our study addresses these limitations by introducing a baseline for a novel genotype imputation pipeline that supports client-sided imputation models generalizable across any genotyping chip and genomic region. This approach enhances patient privacy by performing imputation directly on edge devices. As a case study, we focus on PRS313, a polygenic risk score comprising 313 SNPs used for breast cancer risk prediction. Utilizing consumer genetic panels such as 23andMe, our model democratizes access to personalized genetic insights by allowing 23andMe users to obtain their PRS313 score. We demonstrate that simple linear regression can significantly improve the accuracy of PRS313 scores when calculated using SNPs imputed from consumer gene panels, such as 23andMe. Our linear regression model achieved an R^2 of 0.86, compared to 0.33 without imputation and 0.28 with simple imputation (substituting missing SNPs with the minor allele frequency). These findings suggest that popular SNP analysis libraries could benefit from integrating linear regression models for genotype imputation, providing a viable and light-weight alternative to reference based imputation.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
Determination of the QCD running coupling in the entire perturbative regime from a single experiment using the Principle of Maximum Conformality
Authors:
Leonardo Di Giustino,
Stanley J. Brodsky,
Philip G. Ratcliffe,
Sheng-Quan Wang,
Xing-Gang Wu
Abstract:
We present a new approach for determining the strong coupling $α_s(Q)$ over the entire perturbative range of validity, for scales from $Λ_{\mathrm{QCD}}$ up to the Planck scale ${\sim}10^{19}$\,GeV, with the highest precision and using the data of just a single experiment. The results obtained with this method are consistent with world averages and exhibit improved precision with respect to previo…
▽ More
We present a new approach for determining the strong coupling $α_s(Q)$ over the entire perturbative range of validity, for scales from $Λ_{\mathrm{QCD}}$ up to the Planck scale ${\sim}10^{19}$\,GeV, with the highest precision and using the data of just a single experiment. The results obtained with this method are consistent with world averages and exhibit improved precision with respect to previous reports.
\pacs{12.38.Bx, 13.66.Bc, 13.66.Jn, 13.87.-a,11.10.Gh
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
Autoregressive Speech Synthesis without Vector Quantization
Authors:
Lingwei Meng,
Long Zhou,
Shujie Liu,
Sanyuan Chen,
Bing Han,
Shujie Hu,
Yanqing Liu,
Jinyu Li,
Sheng Zhao,
Xixin Wu,
Helen Meng,
Furu Wei
Abstract:
We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition, bypassing the need for vector quantization, which are originally designed for audio compression and sacrifice fidelity compared to mel-spectrograms. Specifically, (i) instead of cross…
▽ More
We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition, bypassing the need for vector quantization, which are originally designed for audio compression and sacrifice fidelity compared to mel-spectrograms. Specifically, (i) instead of cross-entropy loss, we apply regression loss with a proposed spectrogram flux loss function to model the probability distribution of the continuous-valued tokens. (ii) we have incorporated variational inference into MELLE to facilitate sampling mechanisms, thereby enhancing the output diversity and model robustness. Experiments demonstrate that, compared to the two-stage codec language models VALL-E and its variants, the single-stage MELLE mitigates robustness issues by avoiding the inherent flaws of sampling discrete codes, achieves superior performance across multiple metrics, and, most importantly, offers a more streamlined paradigm. See https://aka.ms/melle for demos of our work.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
An Economic Framework for 6-DoF Grasp Detection
Authors:
Xiao-Ming Wu,
Jia-Feng Cai,
Jian-Jian Jiang,
Dian Zheng,
Yi-Lin Wei,
Wei-Shi Zheng
Abstract:
Robotic grasping in clutters is a fundamental task in robotic manipulation. In this work, we propose an economic framework for 6-DoF grasp detection, aiming to economize the resource cost in training and meanwhile maintain effective grasp performance. To begin with, we discover that the dense supervision is the bottleneck of current SOTA methods that severely encumbers the entire training overload…
▽ More
Robotic grasping in clutters is a fundamental task in robotic manipulation. In this work, we propose an economic framework for 6-DoF grasp detection, aiming to economize the resource cost in training and meanwhile maintain effective grasp performance. To begin with, we discover that the dense supervision is the bottleneck of current SOTA methods that severely encumbers the entire training overload, meanwhile making the training difficult to converge. To solve the above problem, we first propose an economic supervision paradigm for efficient and effective grasping. This paradigm includes a well-designed supervision selection strategy, selecting key labels basically without ambiguity, and an economic pipeline to enable the training after selection. Furthermore, benefit from the economic supervision, we can focus on a specific grasp, and thus we devise a focal representation module, which comprises an interactive grasp head and a composite score estimation to generate the specific grasp more accurately. Combining all together, the EconomicGrasp framework is proposed. Our extensive experiments show that EconomicGrasp surpasses the SOTA grasp method by about 3AP on average, and with extremely low resource cost, for about 1/4 training time cost, 1/8 memory cost and 1/30 storage cost. Our code is available at https://github.com/iSEE-Laboratory/EconomicGrasp.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
Embedding groups into boundedly acyclic groups
Authors:
Fan Wu,
Xiaolei Wu,
Mengfei Zhao,
Zixiang Zhou
Abstract:
We show that the labeled Thompson groups and the twisted Brin--Thompson groups are boundedly acyclic. This allows us to prove several new embedding results for groups. First, every group of type $F_n$ embeds quasi-isometrically into a boundedly acyclic group of type $F_n$ that has no proper finite index subgroups. This improves a result of Bridson \cite{Br98} and a theorem of Fournier-Facio--Löh--…
▽ More
We show that the labeled Thompson groups and the twisted Brin--Thompson groups are boundedly acyclic. This allows us to prove several new embedding results for groups. First, every group of type $F_n$ embeds quasi-isometrically into a boundedly acyclic group of type $F_n$ that has no proper finite index subgroups. This improves a result of Bridson \cite{Br98} and a theorem of Fournier-Facio--Löh--Moraschini \cite[Theorem 2]{FFCM21}. Second, every group of type $F_n$ embeds quasi-isometrically into a $5$-uniformly perfect group of type $F_n$. Third, using Belk--Zaremsky's construction of twisted Brin--Thompson groups, we show that every finitely generated group embeds quasi-isometrically into a finitely generated boundedly acyclic simple group.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
Study of the decay and production properties of $D_{s1}(2536)$ and $D_{s2}^*(2573)$
Authors:
M. Ablikim,
M. N. Achasov,
P. Adlarson,
O. Afedulidis,
X. C. Ai,
R. Aliberti,
A. Amoroso,
Q. An,
Y. Bai,
O. Bakina,
I. Balossino,
Y. Ban,
H. -R. Bao,
V. Batozskaya,
K. Begzsuren,
N. Berger,
M. Berlowski,
M. Bertani,
D. Bettoni,
F. Bianchi,
E. Bianco,
A. Bortone,
I. Boyko,
R. A. Briere,
A. Brueggemann
, et al. (645 additional authors not shown)
Abstract:
The $e^+e^-\rightarrow D_s^+D_{s1}(2536)^-$ and $e^+e^-\rightarrow D_s^+D^*_{s2}(2573)^-$ processes are studied using data samples collected with the BESIII detector at center-of-mass energies from 4.530 to 4.946~GeV. The absolute branching fractions of $D_{s1}(2536)^- \rightarrow \bar{D}^{*0}K^-$ and $D_{s2}^*(2573)^- \rightarrow \bar{D}^0K^-$ are measured for the first time to be…
▽ More
The $e^+e^-\rightarrow D_s^+D_{s1}(2536)^-$ and $e^+e^-\rightarrow D_s^+D^*_{s2}(2573)^-$ processes are studied using data samples collected with the BESIII detector at center-of-mass energies from 4.530 to 4.946~GeV. The absolute branching fractions of $D_{s1}(2536)^- \rightarrow \bar{D}^{*0}K^-$ and $D_{s2}^*(2573)^- \rightarrow \bar{D}^0K^-$ are measured for the first time to be $(35.9\pm 4.8\pm 3.5)\%$ and $(37.4\pm 3.1\pm 4.6)\%$, respectively. The measurements are in tension with predictions based on the assumption that the $D_{s1}(2536)$ and $D_{s2}^*(2573)$ are dominated by a bare $c\bar{s}$ component. The $e^+e^-\rightarrow D_s^+D_{s1}(2536)^-$ and $e^+e^-\rightarrow D_s^+D^*_{s2}(2573)^-$ cross sections are measured, and a resonant structure at around 4.6~GeV with a width of 50~MeV is observed for the first time with a statistical significance of $15σ$ in the $e^+e^-\rightarrow D_s^+D^*_{s2}(2573)^-$ process. It could be the $Y(4626)$ found by the Belle collaboration in the $D_s^+D_{s1}(2536)^{-}$ final state, since they have similar masses and widths. There is also evidence for a structure at around 4.75~GeV in both processes.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
Optimization of noncollinear magnetic ordering temperature in Y-type hexaferrite by machine learning
Authors:
Yonghong Li,
Jing Zhang,
Linfeng Jiang,
Long Zhang,
Yugang Zhang,
Xueliang Wu,
Yisheng Chai,
Xiaoyuan Zhou,
Zizhen Zhou
Abstract:
Searching the optimal doping compositions of the Y-type hexaferrite Ba2Mg2Fe12O22 remains a long-standing challenge for enhanced non-collinear magnetic transition temperature (TNC). Instead of the conventional trial-and-error approach, the composition-property descriptor is established via a data driven machine learning method named SISSO (sure independence screening and sparsifying operator). Bas…
▽ More
Searching the optimal doping compositions of the Y-type hexaferrite Ba2Mg2Fe12O22 remains a long-standing challenge for enhanced non-collinear magnetic transition temperature (TNC). Instead of the conventional trial-and-error approach, the composition-property descriptor is established via a data driven machine learning method named SISSO (sure independence screening and sparsifying operator). Based on the chosen efficient and physically interpretable descriptor, a series of Y-type hexaferrite compositions are predicted to hold high TNC, among which the BaSrMg0.28Co1.72Fe10Al2O22 is then experimentally validated. Test results indicate that, under appropriate external magnetic field conditions, the TNC of this composition reaches up to reaches up to 568 K, and its magnetic transition temperature is also elevated to 735 K. This work offers a machine learning-based route to develop room temperature single phase multiferroics for device applications.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
Precision frequency tuning of tunable transmon qubits using alternating-bias assisted annealing
Authors:
Xiqiao Wang,
Joel Howard,
Eyob A. Sete,
Greg Stiehl,
Cameron Kopas,
Stefano Poletto,
Xian Wu,
Mark Field,
Nicholas Sharac,
Christopher Eckberg,
Hilal Cansizoglu,
Raja Katta,
Josh Mutus,
Andrew Bestwick,
Kameshwar Yadavalli,
David P. Pappas
Abstract:
Superconducting quantum processors are one of the leading platforms for realizing scalable fault-tolerant quantum computation (FTQC). The recent demonstration of post-fabrication tuning of Josephson junctions using alternating-bias assisted annealing (ABAA) technique and a reduction in junction loss after ABAA illuminates a promising path towards precision tuning of qubit frequency while maintaini…
▽ More
Superconducting quantum processors are one of the leading platforms for realizing scalable fault-tolerant quantum computation (FTQC). The recent demonstration of post-fabrication tuning of Josephson junctions using alternating-bias assisted annealing (ABAA) technique and a reduction in junction loss after ABAA illuminates a promising path towards precision tuning of qubit frequency while maintaining high coherence. Here, we demonstrate precision tuning of the maximum $|0\rangle\rightarrow |1\rangle$ transition frequency ($f_{01}^{\rm max}$) of tunable transmon qubits by performing ABAA at room temperature using commercially available test equipment. We characterize the impact of junction relaxation and aging on resistance spread after tuning, and demonstrate a frequency equivalent tuning precision of 7.7 MHz ($0.17\%$) based on targeted resistance tuning on hundreds of qubits, with a resistance tuning range up to $18.5\%$. Cryogenic measurements on tuned and untuned qubits show evidence of improved coherence after ABAA with no significant impact on tunability. Despite a small global offset, we show an empirical $f_{01}^{\rm max}$ tuning precision of 18.4 MHz by tuning a set of multi-qubit processors targeting their designed Hamiltonians. We experimentally characterize high-fidelity parametric resonance iSWAP gates on two ABAA-tuned 9-qubit processors with fidelity as high as $99.51\pm 0.20\%$. On the best-performing device, we measured across the device a median fidelity of $99.22\%$ and an average fidelity of $99.13\pm 0.12 \%$. Yield modeling analysis predicts high detuning-edge-yield using ABAA beyond the 1000-qubit scale. These results demonstrate the cutting-edge capability of frequency targeting using ABAA and open up a new avenue to systematically improving Hamiltonian targeting and optimization for scaling high-performance superconducting quantum processors.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
Tailor3D: Customized 3D Assets Editing and Generation with Dual-Side Images
Authors:
Zhangyang Qi,
Yunhan Yang,
Mengchen Zhang,
Long Xing,
Xiaoyang Wu,
Tong Wu,
Dahua Lin,
Xihui Liu,
Jiaqi Wang,
Hengshuang Zhao
Abstract:
Recent advances in 3D AIGC have shown promise in directly creating 3D objects from text and images, offering significant cost savings in animation and product design. However, detailed edit and customization of 3D assets remains a long-standing challenge. Specifically, 3D Generation methods lack the ability to follow finely detailed instructions as precisely as their 2D image creation counterparts…
▽ More
Recent advances in 3D AIGC have shown promise in directly creating 3D objects from text and images, offering significant cost savings in animation and product design. However, detailed edit and customization of 3D assets remains a long-standing challenge. Specifically, 3D Generation methods lack the ability to follow finely detailed instructions as precisely as their 2D image creation counterparts. Imagine you can get a toy through 3D AIGC but with undesired accessories and dressing. To tackle this challenge, we propose a novel pipeline called Tailor3D, which swiftly creates customized 3D assets from editable dual-side images. We aim to emulate a tailor's ability to locally change objects or perform overall style transfer. Unlike creating 3D assets from multiple views, using dual-side images eliminates conflicts on overlapping areas that occur when editing individual views. Specifically, it begins by editing the front view, then generates the back view of the object through multi-view diffusion. Afterward, it proceeds to edit the back views. Finally, a Dual-sided LRM is proposed to seamlessly stitch together the front and back 3D features, akin to a tailor sewing together the front and back of a garment. The Dual-sided LRM rectifies imperfect consistencies between the front and back views, enhancing editing capabilities and reducing memory burdens while seamlessly integrating them into a unified 3D representation with the LoRA Triplane Transformer. Experimental results demonstrate Tailor3D's effectiveness across various 3D generation and editing tasks, including 3D generative fill and style transfer. It provides a user-friendly, efficient solution for editing 3D assets, with each editing step taking only seconds to complete.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
OMuSense-23: A Multimodal Dataset for Contactless Breathing Pattern Recognition and Biometric Analysis
Authors:
Manuel Lage Cañellas,
Le Nguyen,
Anirban Mukherjee,
Constantino Álvarez Casado,
Xiaoting Wu,
Praneeth Susarla,
Sasan Sharifipour,
Dinesh B. Jayagopi,
Miguel Bordallo López
Abstract:
In the domain of non-contact biometrics and human activity recognition, the lack of a versatile, multimodal dataset poses a significant bottleneck. To address this, we introduce the Oulu Multi Sensing (OMuSense-23) dataset that includes biosignals obtained from a mmWave radar, and an RGB-D camera. The dataset features data from 50 individuals in three distinct poses -- standing, sitting, and lying…
▽ More
In the domain of non-contact biometrics and human activity recognition, the lack of a versatile, multimodal dataset poses a significant bottleneck. To address this, we introduce the Oulu Multi Sensing (OMuSense-23) dataset that includes biosignals obtained from a mmWave radar, and an RGB-D camera. The dataset features data from 50 individuals in three distinct poses -- standing, sitting, and lying down -- each featuring four specific breathing pattern activities: regular breathing, reading, guided breathing, and apnea, encompassing both typical situations (e.g., sitting with normal breathing) and critical conditions (e.g., lying down without breathing). In our work, we present a detailed overview of the OMuSense-23 dataset, detailing the data acquisition protocol, describing the process for each participant. In addition, we provide, a baseline evaluation of several data analysis tasks related to biometrics, breathing pattern recognition and pose identification. Our results achieve a pose identification accuracy of 87\% and breathing pattern activity recognition of 83\% using features extracted from biosignals. The OMuSense-23 dataset is publicly available as resource for other researchers and practitioners in the field.
△ Less
Submitted 22 May, 2024;
originally announced July 2024.
-
C2C: Component-to-Composition Learning for Zero-Shot Compositional Action Recognition
Authors:
Rongchang Li,
Zhenhua Feng,
Tianyang Xu,
Linze Li,
Xiao-Jun Wu,
Muhammad Awais,
Sara Atito,
Josef Kittler
Abstract:
Compositional actions consist of dynamic (verbs) and static (objects) concepts. Humans can easily recognize unseen compositions using the learned concepts. For machines, solving such a problem requires a model to recognize unseen actions composed of previously observed verbs and objects, thus requiring, so-called, compositional generalization ability. To facilitate this research, we propose a nove…
▽ More
Compositional actions consist of dynamic (verbs) and static (objects) concepts. Humans can easily recognize unseen compositions using the learned concepts. For machines, solving such a problem requires a model to recognize unseen actions composed of previously observed verbs and objects, thus requiring, so-called, compositional generalization ability. To facilitate this research, we propose a novel Zero-Shot Compositional Action Recognition (ZS-CAR) task. For evaluating the task, we construct a new benchmark, Something-composition (Sth-com), based on the widely used Something-Something V2 dataset. We also propose a novel Component-to-Composition (C2C) learning method to solve the new ZS-CAR task. C2C includes an independent component learning module and a composition inference module. Last, we devise an enhanced training strategy to address the challenges of component variation between seen and unseen compositions and to handle the subtle balance between learning seen and unseen actions. The experimental results demonstrate that the proposed framework significantly surpasses the existing compositional generalization methods and sets a new state-of-the-art. The new Sth-com benchmark and code are available at https://github.com/RongchangLi/ZSCAR_C2C.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
Learning with Alignments: Tackling the Inter- and Intra-domain Shifts for Cross-multidomain Facial Expression Recognition
Authors:
Yuxiang Yang,
Lu Wen,
Xinyi Zeng,
Yuanyuan Xu,
Xi Wu,
Jiliu Zhou,
Yan Wang
Abstract:
Facial Expression Recognition (FER) holds significant importance in human-computer interactions. Existing cross-domain FER methods often transfer knowledge solely from a single labeled source domain to an unlabeled target domain, neglecting the comprehensive information across multiple sources. Nevertheless, cross-multidomain FER (CMFER) is very challenging for (i) the inherent inter-domain shifts…
▽ More
Facial Expression Recognition (FER) holds significant importance in human-computer interactions. Existing cross-domain FER methods often transfer knowledge solely from a single labeled source domain to an unlabeled target domain, neglecting the comprehensive information across multiple sources. Nevertheless, cross-multidomain FER (CMFER) is very challenging for (i) the inherent inter-domain shifts across multiple domains and (ii) the intra-domain shifts stemming from the ambiguous expressions and low inter-class distinctions. In this paper, we propose a novel Learning with Alignments CMFER framework, named LA-CMFER, to handle both inter- and intra-domain shifts. Specifically, LA-CMFER is constructed with a global branch and a local branch to extract features from the full images and local subtle expressions, respectively. Based on this, LA-CMFER presents a dual-level inter-domain alignment method to force the model to prioritize hard-to-align samples in knowledge transfer at a sample level while gradually generating a well-clustered feature space with the guidance of class attributes at a cluster level, thus narrowing the inter-domain shifts. To address the intra-domain shifts, LA-CMFER introduces a multi-view intra-domain alignment method with a multi-view clustering consistency constraint where a prediction similarity matrix is built to pursue consistency between the global and local views, thus refining pseudo labels and eliminating latent noise. Extensive experiments on six benchmark datasets have validated the superiority of our LA-CMFER.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
Towards Bridging the Cross-modal Semantic Gap for Multi-modal Recommendation
Authors:
Xinglong Wu,
Anfeng Huang,
Hongwei Yang,
Hui He,
Yu Tai,
Weizhe Zhang
Abstract:
Multi-modal recommendation greatly enhances the performance of recommender systems by modeling the auxiliary information from multi-modality contents. Most existing multi-modal recommendation models primarily exploit multimedia information propagation processes to enrich item representations and directly utilize modal-specific embedding vectors independently obtained from upstream pre-trained mode…
▽ More
Multi-modal recommendation greatly enhances the performance of recommender systems by modeling the auxiliary information from multi-modality contents. Most existing multi-modal recommendation models primarily exploit multimedia information propagation processes to enrich item representations and directly utilize modal-specific embedding vectors independently obtained from upstream pre-trained models. However, this might be inappropriate since the abundant task-specific semantics remain unexplored, and the cross-modality semantic gap hinders the recommendation performance.
Inspired by the recent progress of the cross-modal alignment model CLIP, in this paper, we propose a novel \textbf{CLIP} \textbf{E}nhanced \textbf{R}ecommender (\textbf{CLIPER}) framework to bridge the semantic gap between modalities and extract fine-grained multi-view semantic information. Specifically, we introduce a multi-view modality-alignment approach for representation extraction and measure the semantic similarity between modalities. Furthermore, we integrate the multi-view multimedia representations into downstream recommendation models. Extensive experiments conducted on three public datasets demonstrate the consistent superiority of our model over state-of-the-art multi-modal recommendation models.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
The Solution for the sequential task continual learning track of the 2nd Greater Bay Area International Algorithm Competition
Authors:
Sishun Pan,
Xixian Wu,
Tingmin Li,
Longfei Huang,
Mingxu Feng,
Zhonghua Wan,
Yang Yang
Abstract:
This paper presents a data-free, parameter-isolation-based continual learning algorithm we developed for the sequential task continual learning track of the 2nd Greater Bay Area International Algorithm Competition. The method learns an independent parameter subspace for each task within the network's convolutional and linear layers and freezes the batch normalization layers after the first task. S…
▽ More
This paper presents a data-free, parameter-isolation-based continual learning algorithm we developed for the sequential task continual learning track of the 2nd Greater Bay Area International Algorithm Competition. The method learns an independent parameter subspace for each task within the network's convolutional and linear layers and freezes the batch normalization layers after the first task. Specifically, for domain incremental setting where all domains share a classification head, we freeze the shared classification head after first task is completed, effectively solving the issue of catastrophic forgetting. Additionally, facing the challenge of domain incremental settings without providing a task identity, we designed an inference task identity strategy, selecting an appropriate mask matrix for each sample. Furthermore, we introduced a gradient supplementation strategy to enhance the importance of unselected parameters for the current task, facilitating learning for new tasks. We also implemented an adaptive importance scoring strategy that dynamically adjusts the amount of parameters to optimize single-task performance while reducing parameter usage. Moreover, considering the limitations of storage space and inference time, we designed a mask matrix compression strategy to save storage space and improve the speed of encryption and decryption of the mask matrix. Our approach does not require expanding the core network or using external auxiliary networks or data, and performs well under both task incremental and domain incremental settings. This solution ultimately won a second-place prize in the competition.
△ Less
Submitted 6 July, 2024;
originally announced July 2024.
-
The Solution for Language-Enhanced Image New Category Discovery
Authors:
Haonan Xu,
Dian Chao,
Xiangyu Wu,
Zhonghua Wan,
Yang Yang
Abstract:
Treating texts as images, combining prompts with textual labels for prompt tuning, and leveraging the alignment properties of CLIP have been successfully applied in zero-shot multi-label image recognition. Nonetheless, relying solely on textual labels to store visual information is insufficient for representing the diversity of visual objects. In this paper, we propose reversing the training proce…
▽ More
Treating texts as images, combining prompts with textual labels for prompt tuning, and leveraging the alignment properties of CLIP have been successfully applied in zero-shot multi-label image recognition. Nonetheless, relying solely on textual labels to store visual information is insufficient for representing the diversity of visual objects. In this paper, we propose reversing the training process of CLIP and introducing the concept of Pseudo Visual Prompts. These prompts are initialized for each object category and pre-trained on large-scale, low-cost sentence data generated by large language models. This process mines the aligned visual information in CLIP and stores it in class-specific visual prompts. We then employ contrastive learning to transfer the stored visual information to the textual labels, enhancing their visual representation capacity. Additionally, we introduce a dual-adapter module that simultaneously leverages knowledge from the original CLIP and new learning knowledge derived from downstream datasets. Benefiting from the pseudo visual prompts, our method surpasses the state-of-the-art not only on clean annotated text data but also on pseudo text data generated by large language models.
△ Less
Submitted 6 July, 2024;
originally announced July 2024.
-
Data-Driven Prediction and Uncertainty Quantification of PWR Crud-Induced Power Shift Using Convolutional Neural Networks
Authors:
Aidan Furlong,
Farah Alsafadi,
Scott Palmtag,
Andrew Godfrey,
Xu Wu
Abstract:
The development of Crud-Induced Power Shift (CIPS) is an operational challenge in Pressurized Water Reactors that is due to the development of crud on the fuel rod cladding. The available predictive tools developed previously, usually based on fundamental physics, are computationally expensive and have shown differing degrees of accuracy. This work proposes a completely top-down approach to predic…
▽ More
The development of Crud-Induced Power Shift (CIPS) is an operational challenge in Pressurized Water Reactors that is due to the development of crud on the fuel rod cladding. The available predictive tools developed previously, usually based on fundamental physics, are computationally expensive and have shown differing degrees of accuracy. This work proposes a completely top-down approach to predict CIPS instances on an assembly level with reactor-specific calibration built-in. Built using artificial neural networks, this work uses a three-dimensional convolutional approach to leverage the image-like layout of the input data. As a classifier, the convolutional neural network model predicts whether a given assembly will experience CIPS as well as the time of occurrence during a given cycle. This surrogate model is both trained and tested using a combination of calculated core model parameters and measured plant data from Unit 1 of the Catawba Nuclear Station. After the evaluation of its performance using various metrics, Monte Carlo dropout is employed for extensive uncertainty quantification of the model predictions. The results indicate that this methodology could be a viable approach in predicting CIPS with an assembly-level resolution across both clean and afflicted cycles, while using limited computational resources.
△ Less
Submitted 27 June, 2024;
originally announced July 2024.
-
VCoME: Verbal Video Composition with Multimodal Editing Effects
Authors:
Weibo Gong,
Xiaojie Jin,
Xin Li,
Dongliang He,
Xinglong Wu
Abstract:
Verbal videos, featuring voice-overs or text overlays, provide valuable content but present significant challenges in composition, especially when incorporating editing effects to enhance clarity and visual appeal. In this paper, we introduce the novel task of verbal video composition with editing effects. This task aims to generate coherent and visually appealing verbal videos by integrating mult…
▽ More
Verbal videos, featuring voice-overs or text overlays, provide valuable content but present significant challenges in composition, especially when incorporating editing effects to enhance clarity and visual appeal. In this paper, we introduce the novel task of verbal video composition with editing effects. This task aims to generate coherent and visually appealing verbal videos by integrating multimodal editing effects across textual, visual, and audio categories. To achieve this, we curate a large-scale dataset of video effects compositions from publicly available sources. We then formulate this task as a generative problem, involving the identification of appropriate positions in the verbal content and the recommendation of editing effects for these positions. To address this task, we propose VCoME, a general framework that employs a large multimodal model to generate editing effects for video composition. Specifically, VCoME takes in the multimodal video context and autoregressively outputs where to apply effects within the verbal content and which effects are most appropriate for each position. VCoME also supports prompt-based control of composition density and style, providing substantial flexibility for diverse applications. Through extensive quantitative and qualitative evaluations, we clearly demonstrate the effectiveness of VCoME. A comprehensive user study shows that our method produces videos of professional quality while being 85$\times$ more efficient than professional editors.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
Second Place Solution of WSDM2023 Toloka Visual Question Answering Challenge
Authors:
Xiangyu Wu,
Zhouyang Chi,
Yang Yang,
Jianfeng Lu
Abstract:
In this paper, we present our solution for the WSDM2023 Toloka Visual Question Answering Challenge. Inspired by the application of multimodal pre-trained models to various downstream tasks(e.g., visual question answering, visual grounding, and cross-modal retrieval), we approached this competition as a visual grounding task, where the input is an image and a question, guiding the model to answer t…
▽ More
In this paper, we present our solution for the WSDM2023 Toloka Visual Question Answering Challenge. Inspired by the application of multimodal pre-trained models to various downstream tasks(e.g., visual question answering, visual grounding, and cross-modal retrieval), we approached this competition as a visual grounding task, where the input is an image and a question, guiding the model to answer the question and display the answer as a bounding box on the image. We designed a three-stage solution for this task. Specifically, we used the visual-language pre-trained model OFA as the foundation. In the first stage, we constructed a large-scale synthetic dataset similar to the competition dataset and coarse-tuned the model to learn generalized semantic information. In the second stage, we treated the competition task as a visual grounding task, loaded the weights from the previous stage, and continued to fine-tune the model on the competition dataset, transferring the semantic information learned in the first stage to the competition task. Finally, we designed a bounding box matching and replacing post-processing strategy to correct the model's prediction results. Our team achieved a score of 76.342 on the final leaderboard, ranking second.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
GSD: View-Guided Gaussian Splatting Diffusion for 3D Reconstruction
Authors:
Yuxuan Mu,
Xinxin Zuo,
Chuan Guo,
Yilin Wang,
Juwei Lu,
Xiaofeng Wu,
Songcen Xu,
Peng Dai,
Youliang Yan,
Li Cheng
Abstract:
We present GSD, a diffusion model approach based on Gaussian Splatting (GS) representation for 3D object reconstruction from a single view. Prior works suffer from inconsistent 3D geometry or mediocre rendering quality due to improper representations. We take a step towards resolving these shortcomings by utilizing the recent state-of-the-art 3D explicit representation, Gaussian Splatting, and an…
▽ More
We present GSD, a diffusion model approach based on Gaussian Splatting (GS) representation for 3D object reconstruction from a single view. Prior works suffer from inconsistent 3D geometry or mediocre rendering quality due to improper representations. We take a step towards resolving these shortcomings by utilizing the recent state-of-the-art 3D explicit representation, Gaussian Splatting, and an unconditional diffusion model. This model learns to generate 3D objects represented by sets of GS ellipsoids. With these strong generative 3D priors, though learning unconditionally, the diffusion model is ready for view-guided reconstruction without further model fine-tuning. This is achieved by propagating fine-grained 2D features through the efficient yet flexible splatting function and the guided denoising sampling process. In addition, a 2D diffusion model is further employed to enhance rendering fidelity, and improve reconstructed GS quality by polishing and re-using the rendered images. The final reconstructed objects explicitly come with high-quality 3D structure and texture, and can be efficiently rendered in arbitrary views. Experiments on the challenging real-world CO3D dataset demonstrate the superiority of our approach. Project page: $\href{https://yxmu.foo/GSD/}{\text{this https URL}}$
△ Less
Submitted 10 July, 2024; v1 submitted 4 July, 2024;
originally announced July 2024.
-
Probing the equilibration of the QCD matter created in heavy-ion collisions with dileptons
Authors:
Xiang-Yu Wu,
Lipei Du,
Charles Gale,
Sangyong Jeon
Abstract:
A systematic study of intermediate invariant mass dilepton production in Pb+Pb collisions at $\sqrt{s_{NN}} = 5.02$ TeV is performed, using next-to-leading-order (NLO) thermal QCD dilepton emission rates with a multistage dynamical approach which includes event-by-event IP-Glasma initial conditions, relativistic viscous fluid dynamics, and a hadronic afterburner. Considering dilepton yield and ani…
▽ More
A systematic study of intermediate invariant mass dilepton production in Pb+Pb collisions at $\sqrt{s_{NN}} = 5.02$ TeV is performed, using next-to-leading-order (NLO) thermal QCD dilepton emission rates with a multistage dynamical approach which includes event-by-event IP-Glasma initial conditions, relativistic viscous fluid dynamics, and a hadronic afterburner. Considering dilepton yield and anisotropic flow, special attention is paid to the out-of-equilibrium aspects, both thermal and chemical, and to the contribution of the Drell-Yan process. The relative contribution of each of those different channels to dilepton observables is calculated and discussed.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
The Solution for the GAIIC2024 RGB-TIR object detection Challenge
Authors:
Xiangyu Wu,
Jinling Xu,
Longfei Huang,
Yang Yang
Abstract:
This report introduces a solution to The task of RGB-TIR object detection from the perspective of unmanned aerial vehicles. Unlike traditional object detection methods, RGB-TIR object detection aims to utilize both RGB and TIR images for complementary information during detection. The challenges of RGB-TIR object detection from the perspective of unmanned aerial vehicles include highly complex ima…
▽ More
This report introduces a solution to The task of RGB-TIR object detection from the perspective of unmanned aerial vehicles. Unlike traditional object detection methods, RGB-TIR object detection aims to utilize both RGB and TIR images for complementary information during detection. The challenges of RGB-TIR object detection from the perspective of unmanned aerial vehicles include highly complex image backgrounds, frequent changes in lighting, and uncalibrated RGB-TIR image pairs. To address these challenges at the model level, we utilized a lightweight YOLOv9 model with extended multi-level auxiliary branches that enhance the model's robustness, making it more suitable for practical applications in unmanned aerial vehicle scenarios. For image fusion in RGB-TIR detection, we incorporated a fusion module into the backbone network to fuse images at the feature level, implicitly addressing calibration issues. Our proposed method achieved an mAP score of 0.516 and 0.543 on A and B benchmarks respectively while maintaining the highest inference speed among all models.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning
Authors:
Thong Nguyen,
Yi Bin,
Xiaobao Wu,
Xinshuai Dong,
Zhiyuan Hu,
Khoi Le,
Cong-Duy Nguyen,
See-Kiong Ng,
Luu Anh Tuan
Abstract:
Data quality stands at the forefront of deciding the effectiveness of video-language representation learning. However, video-text pairs in previous data typically do not align perfectly with each other, which might lead to video-language representations that do not accurately reflect cross-modal semantics. Moreover, previous data also possess an uneven distribution of concepts, thereby hampering t…
▽ More
Data quality stands at the forefront of deciding the effectiveness of video-language representation learning. However, video-text pairs in previous data typically do not align perfectly with each other, which might lead to video-language representations that do not accurately reflect cross-modal semantics. Moreover, previous data also possess an uneven distribution of concepts, thereby hampering the downstream performance across unpopular subjects. To address these problems, we propose a contrastive objective with a subtractive angular margin to regularize cross-modal representations in their effort to reach perfect similarity. Furthermore, to adapt to the non-uniform concept distribution, we propose a multi-layer perceptron (MLP)-parameterized weighting function that maps loss values to sample weights which enable dynamic adjustment of the model's focus throughout the training. With the training guided by a small amount of unbiased meta-data and augmented by video-text data generated by large vision-language model, we improve video-language representations and achieve superior performances on commonly used video question answering and text-video retrieval datasets.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
Controlling quasi-parametric amplifications: From multiple PT-symmetry phase transitions to non-Hermitian sensing
Authors:
Xiaoxiong Wu,
Kai Bai,
Penghong Yu,
Zhaohui Dong,
Yanyan He,
Jingui Ma,
Vladislav V. Yakovlev,
Meng Xiao,
Xianfeng Chen,
Luqi Yuan
Abstract:
Quasi-parametric amplification (QPA) is a nonlinear interaction in which the idler wave is depleted through some loss mechanism. QPA plays an important role in signal amplification in ultrafast photonics and quantum light generation. The QPA process has a number of features characterized by the non-Hermitian parity-time ($\mathcal{PT}$) symmetry. In this report, we explore new interaction regimes…
▽ More
Quasi-parametric amplification (QPA) is a nonlinear interaction in which the idler wave is depleted through some loss mechanism. QPA plays an important role in signal amplification in ultrafast photonics and quantum light generation. The QPA process has a number of features characterized by the non-Hermitian parity-time ($\mathcal{PT}$) symmetry. In this report, we explore new interaction regimes and uncover multiple $\mathcal{PT}$-symmetry phase transitions in such QPA process where transitions are particularly sensitive to external parameters. In particular, we demonstrate the feasibility of detection of $10^{-11}$ inhomogeneities of the doped absorber, which is order of magnitude more sensitive than similar measurements performed in a linear absorption regime. In doing so, we reveal a family of $\mathcal{PT}$-symmetry phase transitions appearing in the QPA process and provide a novel nonlinear optical sensing mechanism for precise optical measurements.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
Raw Text is All you Need: Knowledge-intensive Multi-turn Instruction Tuning for Large Language Model
Authors:
Xia Hou,
Qifeng Li,
Jian Yang,
Tongliang Li,
Linzheng Chai,
Xianjie Wu,
Hangyuan Ji,
Zhoujun Li,
Jixuan Nie,
Jingbo Dun,
Wenfeng Song
Abstract:
Instruction tuning as an effective technique aligns the outputs of large language models (LLMs) with human preference. But how to generate the seasonal multi-turn dialogues from raw documents for instruction tuning still requires further exploration. In this paper, we present a novel framework named R2S that leverages the CoD-Chain of Dialogue logic to guide large language models (LLMs) in generat…
▽ More
Instruction tuning as an effective technique aligns the outputs of large language models (LLMs) with human preference. But how to generate the seasonal multi-turn dialogues from raw documents for instruction tuning still requires further exploration. In this paper, we present a novel framework named R2S that leverages the CoD-Chain of Dialogue logic to guide large language models (LLMs) in generating knowledge-intensive multi-turn dialogues for instruction tuning. By integrating raw documents from both open-source datasets and domain-specific web-crawled documents into a benchmark K-BENCH, we cover diverse areas such as Wikipedia (English), Science (Chinese), and Artifacts (Chinese). Our approach first decides the logic flow of the current dialogue and then prompts LLMs to produce key phrases for sourcing relevant response content. This methodology enables the creation of the G I NSTRUCT instruction dataset, retaining raw document knowledge within dialoguestyle interactions. Utilizing this dataset, we fine-tune GLLM, a model designed to transform raw documents into structured multi-turn dialogues, thereby injecting comprehensive domain knowledge into the SFT model for enhanced instruction tuning. This work signifies a stride towards refining the adaptability and effectiveness of LLMs in processing and generating more accurate, contextually nuanced responses across various fields.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
Measurement of the branching fraction of the decay $J/ψ\to p \bar{p} η$
Authors:
BESIII Collaboration,
M. Ablikim,
M. N. Achasov,
P. Adlarson,
O. Afedulidis,
X. C. Ai,
R. Aliberti,
A. Amoroso,
Q. An,
Y. Bai,
O. Bakina,
I. Balossino,
Y. Ban,
H. -R. Bao,
V. Batozskaya,
K. Begzsuren,
N. Berger,
M. Berlowski,
M. Bertani,
D. Bettoni,
F. Bianchi,
E. Bianco,
A. Bortone,
I. Boyko,
R. A. Briere
, et al. (639 additional authors not shown)
Abstract:
A high precision measurement of the branching fraction of the decay $J/ψ\to p \bar{p} η$ is performed using $(10 087 \pm 44) \times 10^6$ $J/ψ$ events recorded by the {BESIII} detector at the {BEPCII} storage ring. The branching fractions of the two decays $J/ψ\to p \bar{p} η(η\to γγ)$ and $J/ψ\to p \bar{p} η(η\to π^+ π^- π^0)$ are measured individually to be…
▽ More
A high precision measurement of the branching fraction of the decay $J/ψ\to p \bar{p} η$ is performed using $(10 087 \pm 44) \times 10^6$ $J/ψ$ events recorded by the {BESIII} detector at the {BEPCII} storage ring. The branching fractions of the two decays $J/ψ\to p \bar{p} η(η\to γγ)$ and $J/ψ\to p \bar{p} η(η\to π^+ π^- π^0)$ are measured individually to be $\mathcal{B}(J/ψ\to p \bar{p} η(η\to γγ)) = (1.480 \pm 0.001 \pm 0.024)\times\,10^{-3}$ and $\mathcal{B}(J/ψ\to p \bar{p} η(η\to π^+ π^- π^0)) = (1.557 \pm 0.003 \pm 0.038)\times\,10^{-3}$, where the first uncertainties are statistical and the second systematic. Both results are compatible within their uncorrelated systematic uncertainties. The combined result is $\mathcal{B}(J/ψ\to p \bar{p} η)=(1.495 \pm 0.001 \pm 0.023)\times\,10^{-3}$ where the first uncertainty is the combined statistical uncertainty and the second one the combined systematic uncertainty of both analyses, incorporating correlations between them. In addition, the $p \bar{p}$ threshold region is investigated for a potential threshold enhancement, and no evidence for one is observed.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
Fine-Grained Scene Image Classification with Modality-Agnostic Adapter
Authors:
Yiqun Wang,
Zhao Zhou,
Xiangcheng Du,
Xingjiao Wu,
Yingbin Zheng,
Cheng Jin
Abstract:
When dealing with the task of fine-grained scene image classification, most previous works lay much emphasis on global visual features when doing multi-modal feature fusion. In other words, models are deliberately designed based on prior intuitions about the importance of different modalities. In this paper, we present a new multi-modal feature fusion approach named MAA (Modality-Agnostic Adapter)…
▽ More
When dealing with the task of fine-grained scene image classification, most previous works lay much emphasis on global visual features when doing multi-modal feature fusion. In other words, models are deliberately designed based on prior intuitions about the importance of different modalities. In this paper, we present a new multi-modal feature fusion approach named MAA (Modality-Agnostic Adapter), trying to make the model learn the importance of different modalities in different cases adaptively, without giving a prior setting in the model architecture. More specifically, we eliminate the modal differences in distribution and then use a modality-agnostic Transformer encoder for a semantic-level feature fusion. Our experiments demonstrate that MAA achieves state-of-the-art results on benchmarks by applying the same modalities with previous methods. Besides, it is worth mentioning that new modalities can be easily added when using MAA and further boost the performance. Code is available at https://github.com/quniLcs/MAA.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
Product Geometries on Cholesky Manifolds with Applications to SPD Manifolds
Authors:
Ziheng Chen,
Yue Song,
Xiao-Jun Wu,
Nicu Sebe
Abstract:
This paper presents two new metrics on the Symmetric Positive Definite (SPD) manifold via the Cholesky manifold, i.e., the space of lower triangular matrices with positive diagonal elements. We first unveil that the existing popular Riemannian metric on the Cholesky manifold can be generally characterized as the product metric of a Euclidean metric and a Riemannian metric on the space of n-dimensi…
▽ More
This paper presents two new metrics on the Symmetric Positive Definite (SPD) manifold via the Cholesky manifold, i.e., the space of lower triangular matrices with positive diagonal elements. We first unveil that the existing popular Riemannian metric on the Cholesky manifold can be generally characterized as the product metric of a Euclidean metric and a Riemannian metric on the space of n-dimensional positive vectors. Based on this analysis, we propose two novel metrics on the Cholesky manifolds, i.e., Diagonal Power Euclidean Metric and Diagonal Generalized Bures-Wasserstein Metric, which are numerically stabler than the existing Cholesky metric. We also discuss the gyro structures and deformed metrics associated with our metrics. The gyro structures connect the linear and geometric properties, while the deformed metrics interpolate between our proposed metrics and the existing metric. Further, by Cholesky decomposition, the proposed deformed metrics and gyro structures are pulled back to SPD manifolds. Compared with existing Riemannian metrics on SPD manifolds, our metrics are easy to use, computationally efficient, and numerically stable.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
The Solution for Temporal Sound Localisation Task of ICCV 1st Perception Test Challenge 2023
Authors:
Yurui Huang,
Yang Yang,
Shou Chen,
Xiangyu Wu,
Qingguo Chen,
Jianfeng Lu
Abstract:
In this paper, we propose a solution for improving the quality of temporal sound localization. We employ a multimodal fusion approach to combine visual and audio features. High-quality visual features are extracted using a state-of-the-art self-supervised pre-training network, resulting in efficient video feature representations. At the same time, audio features serve as complementary information…
▽ More
In this paper, we propose a solution for improving the quality of temporal sound localization. We employ a multimodal fusion approach to combine visual and audio features. High-quality visual features are extracted using a state-of-the-art self-supervised pre-training network, resulting in efficient video feature representations. At the same time, audio features serve as complementary information to help the model better localize the start and end of sounds. The fused features are trained in a multi-scale Transformer for training. In the final test dataset, we achieved a mean average precision (mAP) of 0.33, obtaining the second-best performance in this track.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Purple-teaming LLMs with Adversarial Defender Training
Authors:
Jingyan Zhou,
Kun Li,
Junan Li,
Jiawen Kang,
Minda Hu,
Xixin Wu,
Helen Meng
Abstract:
Existing efforts in safeguarding LLMs are limited in actively exposing the vulnerabilities of the target LLM and readily adapting to newly emerging safety risks. To address this, we present Purple-teaming LLMs with Adversarial Defender training (PAD), a pipeline designed to safeguard LLMs by novelly incorporating the red-teaming (attack) and blue-teaming (safety training) techniques. In PAD, we au…
▽ More
Existing efforts in safeguarding LLMs are limited in actively exposing the vulnerabilities of the target LLM and readily adapting to newly emerging safety risks. To address this, we present Purple-teaming LLMs with Adversarial Defender training (PAD), a pipeline designed to safeguard LLMs by novelly incorporating the red-teaming (attack) and blue-teaming (safety training) techniques. In PAD, we automatically collect conversational data that cover the vulnerabilities of an LLM around specific safety risks in a self-play manner, where the attacker aims to elicit unsafe responses and the defender generates safe responses to these attacks. We then update both modules in a generative adversarial network style by training the attacker to elicit more unsafe responses and updating the defender to identify them and explain the unsafe reason. Experimental results demonstrate that PAD significantly outperforms existing baselines in both finding effective attacks and establishing a robust safe guardrail. Furthermore, our findings indicate that PAD excels in striking a balance between safety and overall model quality. We also reveal key challenges in safeguarding LLMs, including defending multi-turn attacks and the need for more delicate strategies to identify specific risks.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Joint Design of Conventional Public Transport Network and Mobility on Demand
Authors:
Xiaoyi Wu,
Nisrine Mouhrim,
Andrea Araldo,
Yves Molenbruch,
Dominique Feillet,
Kris Braekers
Abstract:
Conventional Public Transport (PT) is based on fixed lines, running with routes and schedules determined a-priori. In low-demand areas, conventional PT is inefficient. Therein, Mobility on Demand (MoD) could serve users more efficiently and with an improved quality of service (QoS). The idea of integrating MoD into PT is therefore abundantly discussed by researchers and practitioners, mainly in th…
▽ More
Conventional Public Transport (PT) is based on fixed lines, running with routes and schedules determined a-priori. In low-demand areas, conventional PT is inefficient. Therein, Mobility on Demand (MoD) could serve users more efficiently and with an improved quality of service (QoS). The idea of integrating MoD into PT is therefore abundantly discussed by researchers and practitioners, mainly in the form of adding MoD on top of PT. Efficiency can be instead gained if also conventional PT lines are redesigned after integrating MoD in the first or last mile. In this paper we focus on this re-design problem. We devise a bilevel optimization problem where, given a certain initial design, the upper level determines stop selection and frequency settings, while the lower level routes a fleet of MoD vehicles. We propose a solution method based on Particle Swarm Optimization (PSO) for the upper level, while we adopt Large Neighborhood Search (LNS) in the lower level. Our solution method is computationally efficient and we test it in simulations with up to 10k travel requests. Results show important operational cost savings obtained via appropriately reducing the conventional PT coverage after integrating MoD, while preserving QoS.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
LASSI: An LLM-based Automated Self-Correcting Pipeline for Translating Parallel Scientific Codes
Authors:
Matthew T. Dearing,
Yiheng Tao,
Xingfu Wu,
Zhiling Lan,
Valerie Taylor
Abstract:
This paper addresses the problem of providing a novel approach to sourcing significant training data for LLMs focused on science and engineering. In particular, a crucial challenge is sourcing parallel scientific codes in the ranges of millions to billions of codes. To tackle this problem, we propose an automated pipeline framework, called LASSI, designed to translate between parallel programming…
▽ More
This paper addresses the problem of providing a novel approach to sourcing significant training data for LLMs focused on science and engineering. In particular, a crucial challenge is sourcing parallel scientific codes in the ranges of millions to billions of codes. To tackle this problem, we propose an automated pipeline framework, called LASSI, designed to translate between parallel programming languages by bootstrapping existing closed- or open-source LLMs. LASSI incorporates autonomous enhancement through self-correcting loops where errors encountered during compilation and execution of generated code are fed back to the LLM through guided prompting for debugging and refactoring. We highlight the bi-directional translation of existing GPU benchmarks between OpenMP target offload and CUDA to validate LASSI.
The results of evaluating LASSI with different application codes across four LLMs demonstrate the effectiveness of LASSI for generating executable parallel codes, with 80% of OpenMP to CUDA translations and 85% of CUDA to OpenMP translations producing the expected output. We also observe approximately 78% of OpenMP to CUDA translations and 62% of CUDA to OpenMP translations execute within 10% of or at a faster runtime than the original benchmark code in the same language.
△ Less
Submitted 30 June, 2024;
originally announced July 2024.
-
First Place Solution of 2023 Global Artificial Intelligence Technology Innovation Competition Track 1
Authors:
Xiangyu Wu,
Hailiang Zhang,
Yang Yang,
Jianfeng Lu
Abstract:
In this paper, we present our champion solution to the Global Artificial Intelligence Technology Innovation Competition Track 1: Medical Imaging Diagnosis Report Generation. We select CPT-BASE as our base model for the text generation task. During the pre-training stage, we delete the mask language modeling task of CPT-BASE and instead reconstruct the vocabulary, adopting a span mask strategy and…
▽ More
In this paper, we present our champion solution to the Global Artificial Intelligence Technology Innovation Competition Track 1: Medical Imaging Diagnosis Report Generation. We select CPT-BASE as our base model for the text generation task. During the pre-training stage, we delete the mask language modeling task of CPT-BASE and instead reconstruct the vocabulary, adopting a span mask strategy and gradually increasing the number of masking ratios to perform the denoising auto-encoder pre-training task. In the fine-tuning stage, we design iterative retrieval augmentation and noise-aware similarity bucket prompt strategies. The retrieval augmentation constructs a mini-knowledge base, enriching the input information of the model, while the similarity bucket further perceives the noise information within the mini-knowledge base, guiding the model to generate higher-quality diagnostic reports based on the similarity prompts. Surprisingly, our single model has achieved a score of 2.321 on leaderboard A, and the multiple model fusion scores are 2.362 and 2.320 on the A and B leaderboards respectively, securing first place in the rankings.
△ Less
Submitted 3 July, 2024; v1 submitted 1 July, 2024;
originally announced July 2024.
-
EMIF: Evidence-aware Multi-source Information Fusion Network for Explainable Fake News Detection
Authors:
Qingxing Dong,
Mengyi Zhang,
Shiyuan Wu,
Xiaozhen Wu
Abstract:
Extensive research on automatic fake news detection has been conducted due to the significant detrimental effects of fake news proliferation. Most existing approaches rely on a single source of evidence, such as comments or relevant news, to derive explanatory evidence for decision-making, demonstrating exceptional performance. However, their single evidence source suffers from two critical drawba…
▽ More
Extensive research on automatic fake news detection has been conducted due to the significant detrimental effects of fake news proliferation. Most existing approaches rely on a single source of evidence, such as comments or relevant news, to derive explanatory evidence for decision-making, demonstrating exceptional performance. However, their single evidence source suffers from two critical drawbacks: (i) noise abundance, and (ii) resilience deficiency. Inspired by the natural process of fake news identification, we propose an Evidence-aware Multi-source Information Fusion (EMIF) network that jointly leverages user comments and relevant news to make precise decision and excavate reliable evidence. To accomplish this, we initially construct a co-attention network to capture general semantic conflicts between comments and original news. Meanwhile, a divergence selection module is employed to identify the top-K relevant news articles with content that deviates the most from the original news, which ensures the acquisition of multiple evidence with higher objectivity. Finally, we utilize an inconsistency loss function within the evidence fusion layer to strengthen the consistency of two types of evidence, both negating the authenticity of the same news. Extensive experiments and ablation studies on real-world dataset FibVID show the effectiveness of our proposed model. Notably, EMIF shows remarkable robustness even in scenarios where a particular source of information is inadequate.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Observation of the Electromagnetic Dalitz Transition $h_c \rightarrow e^+e^-η_c$
Authors:
BESIII Collaboration,
M. Ablikim,
M. N. Achasov,
P. Adlarson,
S. Ahmed,
M. Albrecht,
R. Aliberti,
A. Amoroso,
M. R. An,
Q. An,
X. H. Bai,
Y. Bai,
O. Bakina,
R. Baldini Ferroli,
I. Balossino,
Y. Ban,
K. Begzsuren,
N. Berger,
M. Bertani,
D. Bettoni,
F. Bianchi,
J. Bloms,
A. Bortone,
I. Boyko,
R. A. Briere
, et al. (495 additional authors not shown)
Abstract:
Using $(27.12\pm 0.14)\times10^8$ $ψ(3686)$ decays and data samples of $e^+e^-$ collisions with $\sqrt{s}$ from 4.130 to 4.780~GeV collected with the BESIII detector, we report the first observation of the electromagnetic Dalitz transition $h_c\to e^+e^-η_c$ with a statistical significance of $5.4σ$. We measure the ratio of the branching fractions…
▽ More
Using $(27.12\pm 0.14)\times10^8$ $ψ(3686)$ decays and data samples of $e^+e^-$ collisions with $\sqrt{s}$ from 4.130 to 4.780~GeV collected with the BESIII detector, we report the first observation of the electromagnetic Dalitz transition $h_c\to e^+e^-η_c$ with a statistical significance of $5.4σ$. We measure the ratio of the branching fractions $\frac{\mathcal{B}(h_c\rightarrow e^+e^-η_c)}{\mathcal{B}(h_c\rightarrow γη_c)}$ separately for the $h_c$ samples produced via $ψ(3686)\toπ^0h_c$ and $e^+e^-\toπ^+π^-h_c$. The average ratio is determined to be $(0.59\pm0.10(\text{stat.})\pm0.04(\text{syst.}))\%$, where the uncertainty includes both statistical and systematic components.
△ Less
Submitted 2 July, 2024; v1 submitted 28 June, 2024;
originally announced July 2024.