-
MMGA: Multimodal Learning with Graph Alignment
Authors:
Xuan Yang,
Quanjin Tao,
Xiao Feng,
Donghong Cai,
Xiang Ren,
Yang Yang
Abstract:
Multimodal pre-training breaks down the modality barriers and allows the individual modalities to be mutually augmented with information, resulting in significant advances in representation learning. However, graph modality, as a very general and important form of data, cannot be easily interacted with other modalities because of its non-regular nature. In this paper, we propose MMGA (Multimodal l…
▽ More
Multimodal pre-training breaks down the modality barriers and allows the individual modalities to be mutually augmented with information, resulting in significant advances in representation learning. However, graph modality, as a very general and important form of data, cannot be easily interacted with other modalities because of its non-regular nature. In this paper, we propose MMGA (Multimodal learning with Graph Alignment), a novel multimodal pre-training framework to incorporate information from graph (social network), image and text modalities on social media to enhance user representation learning. In MMGA, a multi-step graph alignment mechanism is proposed to add the self-supervision from graph modality to optimize the image and text encoders, while using the information from the image and text modalities to guide the graph encoder learning. We conduct experiments on the dataset crawled from Instagram. The experimental results show that MMGA works well on the dataset and improves the fans prediction task's performance. We release our dataset, the first social media multimodal dataset with graph, of 60,000 users labeled with specific topics based on 2 million posts to facilitate future research.
△ Less
Submitted 31 October, 2022; v1 submitted 18 October, 2022;
originally announced October 2022.
-
Retrofitting Multilingual Sentence Embeddings with Abstract Meaning Representation
Authors:
Deng Cai,
Xin Li,
Jackie Chun-Sing Ho,
Lidong Bing,
Wai Lam
Abstract:
We introduce a new method to improve existing multilingual sentence embeddings with Abstract Meaning Representation (AMR). Compared with the original textual input, AMR is a structured semantic representation that presents the core concepts and relations in a sentence explicitly and unambiguously. It also helps reduce surface variations across different expressions and languages. Unlike most prior…
▽ More
We introduce a new method to improve existing multilingual sentence embeddings with Abstract Meaning Representation (AMR). Compared with the original textual input, AMR is a structured semantic representation that presents the core concepts and relations in a sentence explicitly and unambiguously. It also helps reduce surface variations across different expressions and languages. Unlike most prior work that only evaluates the ability to measure semantic similarity, we present a thorough evaluation of existing multilingual sentence embeddings and our improved versions, which include a collection of five transfer tasks in different downstream applications. Experiment results show that retrofitting multilingual sentence embeddings with AMR leads to better state-of-the-art performance on both semantic textual similarity and transfer tasks. Our codebase and evaluation scripts can be found at \url{https://github.com/jcyk/MSE-AMR}.
△ Less
Submitted 18 October, 2022;
originally announced October 2022.
-
Continuous Diagnosis and Prognosis by Controlling the Update Process of Deep Neural Networks
Authors:
Chenxi Sun,
Hongyan Li,
Moxian Song,
Derun Cai,
Baofeng Zhang,
Shenda Hong
Abstract:
Continuous diagnosis and prognosis are essential for intensive care patients. It can provide more opportunities for timely treatment and rational resource allocation, especially for sepsis, a main cause of death in ICU, and COVID-19, a new worldwide epidemic. Although deep learning methods have shown their great superiority in many medical tasks, they tend to catastrophically forget, over fit, and…
▽ More
Continuous diagnosis and prognosis are essential for intensive care patients. It can provide more opportunities for timely treatment and rational resource allocation, especially for sepsis, a main cause of death in ICU, and COVID-19, a new worldwide epidemic. Although deep learning methods have shown their great superiority in many medical tasks, they tend to catastrophically forget, over fit, and get results too late when performing diagnosis and prognosis in the continuous mode. In this work, we summarized the three requirements of this task, proposed a new concept, continuous classification of time series (CCTS), and designed a novel model training method, restricted update strategy of neural networks (RU). In the context of continuous prognosis, our method outperformed all baselines and achieved the average accuracy of 90%, 97%, and 85% on sepsis prognosis, COVID-19 mortality prediction, and eight diseases classification. Superiorly, our method can also endow deep learning with interpretability, having the potential to explore disease mechanisms and provide a new horizon for medical research. We have achieved disease staging for sepsis and COVID-19, discovering four stages and three stages with their typical biomarkers respectively. Further, our method is a data-agnostic and model-agnostic plug-in, it can be used to continuously prognose other diseases with staging and even implement CCTS in other fields.
△ Less
Submitted 6 October, 2022;
originally announced October 2022.
-
Multi-fidelity Monte Carlo: a pseudo-marginal approach
Authors:
Diana Cai,
Ryan P. Adams
Abstract:
Markov chain Monte Carlo (MCMC) is an established approach for uncertainty quantification and propagation in scientific applications. A key challenge in applying MCMC to scientific domains is computation: the target density of interest is often a function of expensive computations, such as a high-fidelity physical simulation, an intractable integral, or a slowly-converging iterative algorithm. Thu…
▽ More
Markov chain Monte Carlo (MCMC) is an established approach for uncertainty quantification and propagation in scientific applications. A key challenge in applying MCMC to scientific domains is computation: the target density of interest is often a function of expensive computations, such as a high-fidelity physical simulation, an intractable integral, or a slowly-converging iterative algorithm. Thus, using an MCMC algorithms with an expensive target density becomes impractical, as these expensive computations need to be evaluated at each iteration of the algorithm. In practice, these computations often approximated via a cheaper, low-fidelity computation, leading to bias in the resulting target density. Multi-fidelity MCMC algorithms combine models of varying fidelities in order to obtain an approximate target density with lower computational cost. In this paper, we describe a class of asymptotically exact multi-fidelity MCMC algorithms for the setting where a sequence of models of increasing fidelity can be computed that approximates the expensive target density of interest. We take a pseudo-marginal MCMC approach for multi-fidelity inference that utilizes a cheaper, randomized-fidelity unbiased estimator of the target fidelity constructed via random truncation of a telescoping series of the low-fidelity sequence of models. Finally, we discuss and evaluate the proposed multi-fidelity MCMC approach on several applications, including log-Gaussian Cox process modeling, Bayesian ODE system identification, PDE-constrained optimization, and Gaussian process regression parameter inference.
△ Less
Submitted 4 October, 2022;
originally announced October 2022.
-
Towards Explainable 3D Grounded Visual Question Answering: A New Benchmark and Strong Baseline
Authors:
Lichen Zhao,
Daigang Cai,
Jing Zhang,
Lu Sheng,
Dong Xu,
Rui Zheng,
Yinjie Zhao,
Lipeng Wang,
Xibo Fan
Abstract:
Recently, 3D vision-and-language tasks have attracted increasing research interest. Compared to other vision-and-language tasks, the 3D visual question answering (VQA) task is less exploited and is more susceptible to language priors and co-reference ambiguity. Meanwhile, a couple of recently proposed 3D VQA datasets do not well support 3D VQA task due to their limited scale and annotation methods…
▽ More
Recently, 3D vision-and-language tasks have attracted increasing research interest. Compared to other vision-and-language tasks, the 3D visual question answering (VQA) task is less exploited and is more susceptible to language priors and co-reference ambiguity. Meanwhile, a couple of recently proposed 3D VQA datasets do not well support 3D VQA task due to their limited scale and annotation methods. In this work, we formally define and address a 3D grounded VQA task by collecting a new 3D VQA dataset, referred to as FE-3DGQA, with diverse and relatively free-form question-answer pairs, as well as dense and completely grounded bounding box annotations. To achieve more explainable answers, we labelled the objects appeared in the complex QA pairs with different semantic types, including answer-grounded objects (both appeared and not appeared in the questions), and contextual objects for answer-grounded objects. We also propose a new 3D VQA framework to effectively predict the completely visually grounded and explainable answer. Extensive experiments verify that our newly collected benchmark datasets can be effectively used to evaluate various 3D VQA methods from different aspects and our newly proposed framework also achieves state-of-the-art performance on the new benchmark dataset. Both the newly collected dataset and our codes will be publicly available at http://github.com/zlccccc/3DGQA.
△ Less
Submitted 24 September, 2022;
originally announced September 2022.
-
A Depth-Progressive Initialization Strategy for Quantum Approximate Optimization Algorithm
Authors:
Xinwei Lee,
Ningyi Xie,
Yoshiyuki Saito,
Dongsheng Cai,
Nobuyoshi Asai
Abstract:
The quantum approximate optimization algorithm (QAOA) is known for its capability and universality in solving combinatorial optimization problems on near-term quantum devices. The results yielded by QAOA depend strongly on its initial variational parameters. Hence, parameters selection for QAOA becomes an active area of research as bad initialization might deteriorate the quality of the results, e…
▽ More
The quantum approximate optimization algorithm (QAOA) is known for its capability and universality in solving combinatorial optimization problems on near-term quantum devices. The results yielded by QAOA depend strongly on its initial variational parameters. Hence, parameters selection for QAOA becomes an active area of research as bad initialization might deteriorate the quality of the results, especially at great circuit depths. We first discuss on the patterns of optimal parameters in QAOA in two directions: the angle index and the circuit depth. Then, we discuss on the symmetries and periodicity of the expectation that is used to determine the bounds of the search space. Based on the patterns in optimal parameters and the bounds restriction, we propose a strategy which predicts the new initial parameters by taking the difference between previous optimal parameters. Unlike most other strategies, the strategy we propose does not require multiple trials to ensure success. It only requires one prediction when progressing to the next depth. We compare this strategy with our previously proposed strategy and the layerwise strategy on solving the Max-cut problem, in terms of the approximation ratio and the optimization cost. We also address the non-optimality in previous parameters, which is seldom discussed in other works, despite its importance in explaining the behavior of variational quantum algorithms.
△ Less
Submitted 27 September, 2022; v1 submitted 22 September, 2022;
originally announced September 2022.
-
Towards In-distribution Compatibility in Out-of-distribution Detection
Authors:
Boxi Wu,
Jie Jiang,
Haidong Ren,
Zifan Du,
Wenxiao Wang,
Zhifeng Li,
Deng Cai,
Xiaofei He,
Binbin Lin,
Wei Liu
Abstract:
Deep neural network, despite its remarkable capability of discriminating targeted in-distribution samples, shows poor performance on detecting anomalous out-of-distribution data. To address this defect, state-of-the-art solutions choose to train deep networks on an auxiliary dataset of outliers. Various training criteria for these auxiliary outliers are proposed based on heuristic intuitions. Howe…
▽ More
Deep neural network, despite its remarkable capability of discriminating targeted in-distribution samples, shows poor performance on detecting anomalous out-of-distribution data. To address this defect, state-of-the-art solutions choose to train deep networks on an auxiliary dataset of outliers. Various training criteria for these auxiliary outliers are proposed based on heuristic intuitions. However, we find that these intuitively designed outlier training criteria can hurt in-distribution learning and eventually lead to inferior performance. To this end, we identify three causes of the in-distribution incompatibility: contradictory gradient, false likelihood, and distribution shift. Based on our new understandings, we propose a new out-of-distribution detection method by adapting both the top-design of deep models and the loss function. Our method achieves in-distribution compatibility by pursuing less interference with the probabilistic characteristic of in-distribution features. On several benchmarks, our method not only achieves the state-of-the-art out-of-distribution detection performance but also improves the in-distribution accuracy.
△ Less
Submitted 29 August, 2022;
originally announced August 2022.
-
Graph R-CNN: Towards Accurate 3D Object Detection with Semantic-Decorated Local Graph
Authors:
Honghui Yang,
Zili Liu,
Xiaopei Wu,
Wenxiao Wang,
Wei Qian,
Xiaofei He,
Deng Cai
Abstract:
Two-stage detectors have gained much popularity in 3D object detection. Most two-stage 3D detectors utilize grid points, voxel grids, or sampled keypoints for RoI feature extraction in the second stage. Such methods, however, are inefficient in handling unevenly distributed and sparse outdoor points. This paper solves this problem in three aspects. 1) Dynamic Point Aggregation. We propose the patc…
▽ More
Two-stage detectors have gained much popularity in 3D object detection. Most two-stage 3D detectors utilize grid points, voxel grids, or sampled keypoints for RoI feature extraction in the second stage. Such methods, however, are inefficient in handling unevenly distributed and sparse outdoor points. This paper solves this problem in three aspects. 1) Dynamic Point Aggregation. We propose the patch search to quickly search points in a local region for each 3D proposal. The dynamic farthest voxel sampling is then applied to evenly sample the points. Especially, the voxel size varies along the distance to accommodate the uneven distribution of points. 2) RoI-graph Pooling. We build local graphs on the sampled points to better model contextual information and mine point relations through iterative message passing. 3) Visual Features Augmentation. We introduce a simple yet effective fusion strategy to compensate for sparse LiDAR points with limited semantic cues. Based on these modules, we construct our Graph R-CNN as the second stage, which can be applied to existing one-stage detectors to consistently improve the detection performance. Extensive experiments show that Graph R-CNN outperforms the state-of-the-art 3D detection models by a large margin on both the KITTI and Waymo Open Dataset. And we rank first place on the KITTI BEV car detection leaderboard. Code will be available at \url{https://github.com/Nightmare-n/GraphRCNN}.
△ Less
Submitted 6 August, 2022;
originally announced August 2022.
-
SC6D: Symmetry-agnostic and Correspondence-free 6D Object Pose Estimation
Authors:
Dingding Cai,
Janne Heikkilä,
Esa Rahtu
Abstract:
This paper presents an efficient symmetry-agnostic and correspondence-free framework, referred to as SC6D, for 6D object pose estimation from a single monocular RGB image. SC6D requires neither the 3D CAD model of the object nor any prior knowledge of the symmetries. The pose estimation is decomposed into three sub-tasks: a) object 3D rotation representation learning and matching; b) estimation of…
▽ More
This paper presents an efficient symmetry-agnostic and correspondence-free framework, referred to as SC6D, for 6D object pose estimation from a single monocular RGB image. SC6D requires neither the 3D CAD model of the object nor any prior knowledge of the symmetries. The pose estimation is decomposed into three sub-tasks: a) object 3D rotation representation learning and matching; b) estimation of the 2D location of the object center; and c) scale-invariant distance estimation (the translation along the z-axis) via classification. SC6D is evaluated on three benchmark datasets, T-LESS, YCB-V, and ITODD, and results in state-of-the-art performance on the T-LESS dataset. Moreover, SC6D is computationally much more efficient than the previous state-of-the-art method SurfEmb. The implementation and pre-trained models are publicly available at https://github.com/dingdingcai/SC6D-pose.
△ Less
Submitted 18 September, 2022; v1 submitted 3 August, 2022;
originally announced August 2022.
-
Accelerating Vertical Federated Learning
Authors:
Dongqi Cai,
Tao Fan,
Yan Kang,
Lixin Fan,
Mengwei Xu,
Shangguang Wang,
Qiang Yang
Abstract:
Privacy, security and data governance constraints rule out a brute force process in the integration of cross-silo data, which inherits the development of the Internet of Things. Federated learning is proposed to ensure that all parties can collaboratively complete the training task while the data is not out of the local. Vertical federated learning is a specialization of federated learning for dis…
▽ More
Privacy, security and data governance constraints rule out a brute force process in the integration of cross-silo data, which inherits the development of the Internet of Things. Federated learning is proposed to ensure that all parties can collaboratively complete the training task while the data is not out of the local. Vertical federated learning is a specialization of federated learning for distributed features. To preserve privacy, homomorphic encryption is applied to enable encrypted operations without decryption. Nevertheless, together with a robust security guarantee, homomorphic encryption brings extra communication and computation overhead. In this paper, we analyze the current bottlenecks of vertical federated learning under homomorphic encryption comprehensively and numerically. We propose a straggler-resilient and computation-efficient accelerating system that reduces the communication overhead in heterogeneous scenarios by 65.26% at most and reduces the computation overhead caused by homomorphic encryption by 40.66% at most. Our system can improve the robustness and efficiency of the current vertical federated learning framework without loss of security.
△ Less
Submitted 21 January, 2024; v1 submitted 23 July, 2022;
originally announced July 2022.
-
Towards Efficient Adversarial Training on Vision Transformers
Authors:
Boxi Wu,
Jindong Gu,
Zhifeng Li,
Deng Cai,
Xiaofei He,
Wei Liu
Abstract:
Vision Transformer (ViT), as a powerful alternative to Convolutional Neural Network (CNN), has received much attention. Recent work showed that ViTs are also vulnerable to adversarial examples like CNNs. To build robust ViTs, an intuitive way is to apply adversarial training since it has been shown as one of the most effective ways to accomplish robust CNNs. However, one major limitation of advers…
▽ More
Vision Transformer (ViT), as a powerful alternative to Convolutional Neural Network (CNN), has received much attention. Recent work showed that ViTs are also vulnerable to adversarial examples like CNNs. To build robust ViTs, an intuitive way is to apply adversarial training since it has been shown as one of the most effective ways to accomplish robust CNNs. However, one major limitation of adversarial training is its heavy computational cost. The self-attention mechanism adopted by ViTs is a computationally intense operation whose expense increases quadratically with the number of input patches, making adversarial training on ViTs even more time-consuming. In this work, we first comprehensively study fast adversarial training on a variety of vision transformers and illustrate the relationship between the efficiency and robustness. Then, to expediate adversarial training on ViTs, we propose an efficient Attention Guided Adversarial Training mechanism. Specifically, relying on the specialty of self-attention, we actively remove certain patch embeddings of each layer with an attention-guided dropping strategy during adversarial training. The slimmed self-attention modules accelerate the adversarial training on ViTs significantly. With only 65\% of the fast adversarial training time, we match the state-of-the-art results on the challenging ImageNet benchmark.
△ Less
Submitted 21 July, 2022;
originally announced July 2022.
-
DID-M3D: Decoupling Instance Depth for Monocular 3D Object Detection
Authors:
Liang Peng,
Xiaopei Wu,
Zheng Yang,
Haifeng Liu,
Deng Cai
Abstract:
Monocular 3D detection has drawn much attention from the community due to its low cost and setup simplicity. It takes an RGB image as input and predicts 3D boxes in the 3D space. The most challenging sub-task lies in the instance depth estimation. Previous works usually use a direct estimation method. However, in this paper we point out that the instance depth on the RGB image is non-intuitive. It…
▽ More
Monocular 3D detection has drawn much attention from the community due to its low cost and setup simplicity. It takes an RGB image as input and predicts 3D boxes in the 3D space. The most challenging sub-task lies in the instance depth estimation. Previous works usually use a direct estimation method. However, in this paper we point out that the instance depth on the RGB image is non-intuitive. It is coupled by visual depth clues and instance attribute clues, making it hard to be directly learned in the network. Therefore, we propose to reformulate the instance depth to the combination of the instance visual surface depth (visual depth) and the instance attribute depth (attribute depth). The visual depth is related to objects' appearances and positions on the image. By contrast, the attribute depth relies on objects' inherent attributes, which are invariant to the object affine transformation on the image. Correspondingly, we decouple the 3D location uncertainty into visual depth uncertainty and attribute depth uncertainty. By combining different types of depths and associated uncertainties, we can obtain the final instance depth. Furthermore, data augmentation in monocular 3D detection is usually limited due to the physical nature, hindering the boost of performance. Based on the proposed instance depth disentanglement strategy, we can alleviate this problem. Evaluated on KITTI, our method achieves new state-of-the-art results, and extensive ablation studies validate the effectiveness of each component in our method. The codes are released at https://github.com/SPengLiang/DID-M3D.
△ Less
Submitted 22 July, 2022; v1 submitted 18 July, 2022;
originally announced July 2022.
-
MLP-GAN for Brain Vessel Image Segmentation
Authors:
Bin Xie,
Hao Tang,
Bin Duan,
Dawen Cai,
Yan Yan
Abstract:
Brain vessel image segmentation can be used as a promising biomarker for better prevention and treatment of different diseases. One successful approach is to consider the segmentation as an image-to-image translation task and perform a conditional Generative Adversarial Network (cGAN) to learn a transformation between two distributions. In this paper, we present a novel multi-view approach, MLP-GA…
▽ More
Brain vessel image segmentation can be used as a promising biomarker for better prevention and treatment of different diseases. One successful approach is to consider the segmentation as an image-to-image translation task and perform a conditional Generative Adversarial Network (cGAN) to learn a transformation between two distributions. In this paper, we present a novel multi-view approach, MLP-GAN, which splits a 3D volumetric brain vessel image into three different dimensional 2D images (i.e., sagittal, coronal, axial) and then feed them into three different 2D cGANs. The proposed MLP-GAN not only alleviates the memory issue which exists in the original 3D neural networks but also retains 3D spatial information. Specifically, we utilize U-Net as the backbone for our generator and redesign the pattern of skip connection integrated with the MLP-Mixer which has attracted lots of attention recently. Our model obtains the ability to capture cross-patch information to learn global information with the MLP-Mixer. Extensive experiments are performed on the public brain vessel dataset that show our MLP-GAN outperforms other state-of-the-art methods. We release our code at https://github.com/bxie9/MLP-GAN
△ Less
Submitted 26 October, 2022; v1 submitted 17 July, 2022;
originally announced July 2022.
-
Identifying Source Speakers for Voice Conversion based Spoofing Attacks on Speaker Verification Systems
Authors:
Danwei Cai,
Zexin Cai,
Ming Li
Abstract:
An automatic speaker verification system aims to verify the speaker identity of a speech signal. However, a voice conversion system could manipulate a person's speech signal to make it sound like another speaker's voice and deceive the speaker verification system. Most countermeasures for voice conversion-based spoofing attacks are designed to discriminate bona fide speech from spoofed speech for…
▽ More
An automatic speaker verification system aims to verify the speaker identity of a speech signal. However, a voice conversion system could manipulate a person's speech signal to make it sound like another speaker's voice and deceive the speaker verification system. Most countermeasures for voice conversion-based spoofing attacks are designed to discriminate bona fide speech from spoofed speech for speaker verification systems. In this paper, we investigate the problem of source speaker identification -- inferring the identity of the source speaker given the voice converted speech. To perform source speaker identification, we simply add voice-converted speech data with the label of source speaker identity to the genuine speech dataset during speaker embedding network training. Experimental results show the feasibility of source speaker identification when training and testing with converted speeches from the same voice conversion model(s). In addition, our results demonstrate that having more converted utterances from various voice conversion model for training helps improve the source speaker identification performance on converted utterances from unseen voice conversion models.
△ Less
Submitted 31 October, 2022; v1 submitted 17 June, 2022;
originally announced June 2022.
-
Automatic Prosody Annotation with Pre-Trained Text-Speech Model
Authors:
Ziqian Dai,
Jianwei Yu,
Yan Wang,
Nuo Chen,
Yanyao Bian,
Guangzhi Li,
Deng Cai,
Dong Yu
Abstract:
Prosodic boundary plays an important role in text-to-speech synthesis (TTS) in terms of naturalness and readability. However, the acquisition of prosodic boundary labels relies on manual annotation, which is costly and time-consuming. In this paper, we propose to automatically extract prosodic boundary labels from text-audio data via a neural text-speech model with pre-trained audio encoders. This…
▽ More
Prosodic boundary plays an important role in text-to-speech synthesis (TTS) in terms of naturalness and readability. However, the acquisition of prosodic boundary labels relies on manual annotation, which is costly and time-consuming. In this paper, we propose to automatically extract prosodic boundary labels from text-audio data via a neural text-speech model with pre-trained audio encoders. This model is pre-trained on text and speech data separately and jointly fine-tuned on TTS data in a triplet format: {speech, text, prosody}. The experimental results on both automatic evaluation and human evaluation demonstrate that: 1) the proposed text-speech prosody annotation framework significantly outperforms text-only baselines; 2) the quality of automatic prosodic boundary annotations is comparable to human annotations; 3) TTS systems trained with model-annotated boundaries are slightly better than systems that use manual ones.
△ Less
Submitted 16 June, 2022;
originally announced June 2022.
-
Learning to Break the Loop: Analyzing and Mitigating Repetitions for Neural Text Generation
Authors:
Jin Xu,
Xiaojiang Liu,
Jianhao Yan,
Deng Cai,
Huayang Li,
Jian Li
Abstract:
While large-scale neural language models, such as GPT2 and BART, have achieved impressive results on various text generation tasks, they tend to get stuck in undesirable sentence-level loops with maximization-based decoding algorithms (\textit{e.g.}, greedy search). This phenomenon is counter-intuitive since there are few consecutive sentence-level repetitions in human corpora (e.g., 0.02\% in Wik…
▽ More
While large-scale neural language models, such as GPT2 and BART, have achieved impressive results on various text generation tasks, they tend to get stuck in undesirable sentence-level loops with maximization-based decoding algorithms (\textit{e.g.}, greedy search). This phenomenon is counter-intuitive since there are few consecutive sentence-level repetitions in human corpora (e.g., 0.02\% in Wikitext-103). To investigate the underlying reasons for generating consecutive sentence-level repetitions, we study the relationship between the probabilities of the repetitive tokens and their previous repetitions in the context. Through our quantitative experiments, we find that 1) Language models have a preference to repeat the previous sentence; 2) The sentence-level repetitions have a \textit{self-reinforcement effect}: the more times a sentence is repeated in the context, the higher the probability of continuing to generate that sentence; 3) The sentences with higher initial probabilities usually have a stronger self-reinforcement effect. Motivated by our findings, we propose a simple and effective training method \textbf{DITTO} (Pseu\underline{D}o-Repet\underline{IT}ion Penaliza\underline{T}i\underline{O}n), where the model learns to penalize probabilities of sentence-level repetitions from pseudo repetitive data. Although our method is motivated by mitigating repetitions, experiments show that DITTO not only mitigates the repetition issue without sacrificing perplexity, but also achieves better generation quality. Extensive experiments on open-ended text generation (Wikitext-103) and text summarization (CNN/DailyMail) demonstrate the generality and effectiveness of our method.
△ Less
Submitted 9 October, 2022; v1 submitted 6 June, 2022;
originally announced June 2022.
-
AUTM Flow: Atomic Unrestricted Time Machine for Monotonic Normalizing Flows
Authors:
Difeng Cai,
Yuliang Ji,
Huan He,
Qiang Ye,
Yuanzhe Xi
Abstract:
Nonlinear monotone transformations are used extensively in normalizing flows to construct invertible triangular mappings from simple distributions to complex ones. In existing literature, monotonicity is usually enforced by restricting function classes or model parameters and the inverse transformation is often approximated by root-finding algorithms as a closed-form inverse is unavailable. In thi…
▽ More
Nonlinear monotone transformations are used extensively in normalizing flows to construct invertible triangular mappings from simple distributions to complex ones. In existing literature, monotonicity is usually enforced by restricting function classes or model parameters and the inverse transformation is often approximated by root-finding algorithms as a closed-form inverse is unavailable. In this paper, we introduce a new integral-based approach termed "Atomic Unrestricted Time Machine (AUTM)", equipped with unrestricted integrands and easy-to-compute explicit inverse. AUTM offers a versatile and efficient way to the design of normalizing flows with explicit inverse and unrestricted function classes or parameters. Theoretically, we present a constructive proof that AUTM is universal: all monotonic normalizing flows can be viewed as limits of AUTM flows. We provide a concrete example to show how to approximate any given monotonic normalizing flow using AUTM flows with guaranteed convergence. The result implies that AUTM can be used to transform an existing flow into a new one equipped with explicit inverse and unrestricted parameters. The performance of the new approach is evaluated on high dimensional density estimation, variational inference and image generation. Experiments demonstrate superior speed and memory efficiency of AUTM.
△ Less
Submitted 5 June, 2022;
originally announced June 2022.
-
Data-driven Construction of Hierarchical Matrices with Nested Bases
Authors:
Difeng Cai,
Hua Huang,
Edmond Chow,
Yuanzhe Xi
Abstract:
Hierarchical matrices provide a powerful representation for significantly reducing the computational complexity associated with dense kernel matrices. For general kernel functions, interpolation-based methods are widely used for the efficient construction of hierarchical matrices. In this paper, we present a fast hierarchical data reduction (HiDR) procedure with $O(n)$ complexity for the memory-ef…
▽ More
Hierarchical matrices provide a powerful representation for significantly reducing the computational complexity associated with dense kernel matrices. For general kernel functions, interpolation-based methods are widely used for the efficient construction of hierarchical matrices. In this paper, we present a fast hierarchical data reduction (HiDR) procedure with $O(n)$ complexity for the memory-efficient construction of hierarchical matrices with nested bases where $n$ is the number of data points. HiDR aims to reduce the given data in a hierarchical way so as to obtain $O(1)$ representations for all nearfield and farfield interactions. Based on HiDR, a linear complexity $\mathcal{H}^2$ matrix construction algorithm is proposed. The use of data-driven methods enables {better efficiency than other general-purpose methods} and flexible computation without accessing the kernel function. Experiments demonstrate significantly improved memory efficiency of the proposed data-driven method compared to interpolation-based methods over a wide range of kernels. Though the method is not optimized for any special kernel, benchmark experiments for the Coulomb kernel show that the proposed general-purpose algorithm offers competitive performance for hierarchical matrix construction compared to several state-of-the-art algorithms for the Coulomb kernel.
△ Less
Submitted 3 June, 2022;
originally announced June 2022.
-
FedAdapter: Efficient Federated Learning for Modern NLP
Authors:
Dongqi Cai,
Yaozong Wu,
Shangguang Wang,
Felix Xiaozhu Lin,
Mengwei Xu
Abstract:
Transformer-based pre-trained models have revolutionized NLP for superior performance and generality. Fine-tuning pre-trained models for downstream tasks often requires private data, for which federated learning is the de-facto approach (i.e., FedNLP). However, our measurements show that FedNLP is prohibitively slow due to the large model sizes and the resultant high network/computation cost. Towa…
▽ More
Transformer-based pre-trained models have revolutionized NLP for superior performance and generality. Fine-tuning pre-trained models for downstream tasks often requires private data, for which federated learning is the de-facto approach (i.e., FedNLP). However, our measurements show that FedNLP is prohibitively slow due to the large model sizes and the resultant high network/computation cost. Towards practical FedNLP, we identify as the key building blocks adapters, small bottleneck modules inserted at a variety of model layers. A key challenge is to properly configure the depth and width of adapters, to which the training speed and efficiency is highly sensitive. No silver-bullet configuration exists: the optimal choice varies across downstream NLP tasks, desired model accuracy, and mobile resources. To automate adapter configuration, we propose FedAdapter, a framework that enhances the existing FedNLP with two novel designs. First, FedAdapter progressively upgrades the adapter configuration throughout a training session; the principle is to quickly learn shallow knowledge by only training fewer and smaller adapters at the model's top layers, and incrementally learn deep knowledge by incorporating deeper and larger adapters. Second, FedAdapter continuously profiles future adapter configurations by allocating participant devices to trial groups. Extensive experiments show that FedAdapter can reduce FedNLP's model convergence delay to no more than several hours, which is up to 155.5$\times$ faster compared to vanilla FedNLP and 48$\times$ faster compared to strong baselines.
△ Less
Submitted 8 May, 2023; v1 submitted 20 May, 2022;
originally announced May 2022.
-
Neural Collapse Inspired Attraction-Repulsion-Balanced Loss for Imbalanced Learning
Authors:
Liang Xie,
Yibo Yang,
Deng Cai,
Xiaofei He
Abstract:
Class imbalance distribution widely exists in real-world engineering. However, the mainstream optimization algorithms that seek to minimize error will trap the deep learning model in sub-optimums when facing extreme class imbalance. It seriously harms the classification precision, especially on the minor classes. The essential reason is that the gradients of the classifier weights are imbalanced a…
▽ More
Class imbalance distribution widely exists in real-world engineering. However, the mainstream optimization algorithms that seek to minimize error will trap the deep learning model in sub-optimums when facing extreme class imbalance. It seriously harms the classification precision, especially on the minor classes. The essential reason is that the gradients of the classifier weights are imbalanced among the components from different classes. In this paper, we propose Attraction-Repulsion-Balanced Loss (ARB-Loss) to balance the different components of the gradients. We perform experiments on the large-scale classification and segmentation datasets and our ARB-Loss can achieve state-of-the-art performance via only one-stage training instead of 2-stage learning like nowadays SOTA works.
△ Less
Submitted 21 February, 2023; v1 submitted 19 April, 2022;
originally announced April 2022.
-
Frame-wise Action Representations for Long Videos via Sequence Contrastive Learning
Authors:
Minghao Chen,
Fangyun Wei,
Chong Li,
Deng Cai
Abstract:
Prior works on action representation learning mainly focus on designing various architectures to extract the global representations for short video clips. In contrast, many practical applications such as video alignment have strong demand for learning dense representations for long videos. In this paper, we introduce a novel contrastive action representation learning (CARL) framework to learn fram…
▽ More
Prior works on action representation learning mainly focus on designing various architectures to extract the global representations for short video clips. In contrast, many practical applications such as video alignment have strong demand for learning dense representations for long videos. In this paper, we introduce a novel contrastive action representation learning (CARL) framework to learn frame-wise action representations, especially for long videos, in a self-supervised manner. Concretely, we introduce a simple yet efficient video encoder that considers spatio-temporal context to extract frame-wise representations. Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views obtained through a series of spatio-temporal data augmentations. SCL optimizes the embedding space by minimizing the KL-divergence between the sequence similarity of two augmented views and a prior Gaussian distribution of timestamp distance. Experiments on FineGym, PennAction and Pouring datasets show that our method outperforms previous state-of-the-art by a large margin for downstream fine-grained action classification. Surprisingly, although without training on paired videos, our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks. Code and models are available at https://github.com/minghchen/CARL_code.
△ Less
Submitted 28 March, 2022;
originally announced March 2022.
-
Linearizing Transformer with Key-Value Memory
Authors:
Yizhe Zhang,
Deng Cai
Abstract:
Efficient transformer variants with linear time complexity have been developed to mitigate the quadratic computational overhead of the vanilla transformer. Among them are low-rank projection methods such as Linformer and kernel-based Transformers. Despite their unique merits, they usually suffer from a performance drop comparing with the vanilla transformer on many sequence generation tasks, and o…
▽ More
Efficient transformer variants with linear time complexity have been developed to mitigate the quadratic computational overhead of the vanilla transformer. Among them are low-rank projection methods such as Linformer and kernel-based Transformers. Despite their unique merits, they usually suffer from a performance drop comparing with the vanilla transformer on many sequence generation tasks, and often fail to obtain computation gain when the generation is short. We propose MemSizer, an approach towards closing the performance gap while improving the efficiency even with short generation. It projects the source sequences into lower dimension representations like Linformer, while enjoying efficient recurrent-style incremental computation similar to kernel-based transformers. This yields linear computation time and constant memory complexity at inference time. MemSizer also employs a lightweight multi-head mechanism which renders the computation as light as a single-head model. We demonstrate that MemSizer provides an improved balance between efficiency and accuracy over the vanilla transformer and other efficient transformer variants in three typical sequence generation tasks, including machine translation, abstractive text summarization, and language modeling.
△ Less
Submitted 12 October, 2022; v1 submitted 23 March, 2022;
originally announced March 2022.
-
CLRNet: Cross Layer Refinement Network for Lane Detection
Authors:
Tu Zheng,
Yifei Huang,
Yang Liu,
Wenjian Tang,
Zheng Yang,
Deng Cai,
Xiaofei He
Abstract:
Lane is critical in the vision navigation system of the intelligent vehicle. Naturally, lane is a traffic sign with high-level semantics, whereas it owns the specific local pattern which needs detailed low-level features to localize accurately. Using different feature levels is of great importance for accurate lane detection, but it is still under-explored. In this work, we present Cross Layer Ref…
▽ More
Lane is critical in the vision navigation system of the intelligent vehicle. Naturally, lane is a traffic sign with high-level semantics, whereas it owns the specific local pattern which needs detailed low-level features to localize accurately. Using different feature levels is of great importance for accurate lane detection, but it is still under-explored. In this work, we present Cross Layer Refinement Network (CLRNet) aiming at fully utilizing both high-level and low-level features in lane detection. In particular, it first detects lanes with high-level semantic features then performs refinement based on low-level features. In this way, we can exploit more contextual information to detect lanes while leveraging local detailed lane features to improve localization accuracy. We present ROIGather to gather global context, which further enhances the feature representation of lanes. In addition to our novel network design, we introduce Line IoU loss which regresses the lane line as a whole unit to improve the localization accuracy. Experiments demonstrate that the proposed method greatly outperforms the state-of-the-art lane detection approaches.
△ Less
Submitted 19 March, 2022;
originally announced March 2022.
-
Sparse Fuse Dense: Towards High Quality 3D Detection with Depth Completion
Authors:
Xiaopei Wu,
Liang Peng,
Honghui Yang,
Liang Xie,
Chenxi Huang,
Chengqi Deng,
Haifeng Liu,
Deng Cai
Abstract:
Current LiDAR-only 3D detection methods inevitably suffer from the sparsity of point clouds. Many multi-modal methods are proposed to alleviate this issue, while different representations of images and point clouds make it difficult to fuse them, resulting in suboptimal performance. In this paper, we present a novel multi-modal framework SFD (Sparse Fuse Dense), which utilizes pseudo point clouds…
▽ More
Current LiDAR-only 3D detection methods inevitably suffer from the sparsity of point clouds. Many multi-modal methods are proposed to alleviate this issue, while different representations of images and point clouds make it difficult to fuse them, resulting in suboptimal performance. In this paper, we present a novel multi-modal framework SFD (Sparse Fuse Dense), which utilizes pseudo point clouds generated from depth completion to tackle the issues mentioned above. Different from prior works, we propose a new RoI fusion strategy 3D-GAF (3D Grid-wise Attentive Fusion) to make fuller use of information from different types of point clouds. Specifically, 3D-GAF fuses 3D RoI features from the couple of point clouds in a grid-wise attentive way, which is more fine-grained and more precise. In addition, we propose a SynAugment (Synchronized Augmentation) to enable our multi-modal framework to utilize all data augmentation approaches tailored to LiDAR-only methods. Lastly, we customize an effective and efficient feature extractor CPConv (Color Point Convolution) for pseudo point clouds. It can explore 2D image features and 3D geometric features of pseudo point clouds simultaneously. Our method holds the highest entry on the KITTI car 3D object detection leaderboard, demonstrating the effectiveness of our SFD. Codes are available at https://github.com/LittlePey/SFD.
△ Less
Submitted 4 July, 2022; v1 submitted 18 March, 2022;
originally announced March 2022.
-
WeakM3D: Towards Weakly Supervised Monocular 3D Object Detection
Authors:
Liang Peng,
Senbo Yan,
Boxi Wu,
Zheng Yang,
Xiaofei He,
Deng Cai
Abstract:
Monocular 3D object detection is one of the most challenging tasks in 3D scene understanding. Due to the ill-posed nature of monocular imagery, existing monocular 3D detection methods highly rely on training with the manually annotated 3D box labels on the LiDAR point clouds. This annotation process is very laborious and expensive. To dispense with the reliance on 3D box labels, in this paper we e…
▽ More
Monocular 3D object detection is one of the most challenging tasks in 3D scene understanding. Due to the ill-posed nature of monocular imagery, existing monocular 3D detection methods highly rely on training with the manually annotated 3D box labels on the LiDAR point clouds. This annotation process is very laborious and expensive. To dispense with the reliance on 3D box labels, in this paper we explore the weakly supervised monocular 3D detection. Specifically, we first detect 2D boxes on the image. Then, we adopt the generated 2D boxes to select corresponding RoI LiDAR points as the weak supervision. Eventually, we adopt a network to predict 3D boxes which can tightly align with associated RoI LiDAR points. This network is learned by minimizing our newly-proposed 3D alignment loss between the 3D box estimates and the corresponding RoI LiDAR points. We will illustrate the potential challenges of the above learning problem and resolve these challenges by introducing several effective designs into our method. Codes will be available at https://github.com/SPengLiang/WeakM3D.
△ Less
Submitted 15 March, 2022;
originally announced March 2022.
-
A Next-Generation Liquid Xenon Observatory for Dark Matter and Neutrino Physics
Authors:
J. Aalbers,
K. Abe,
V. Aerne,
F. Agostini,
S. Ahmed Maouloud,
D. S. Akerib,
D. Yu. Akimov,
J. Akshat,
A. K. Al Musalhi,
F. Alder,
S. K. Alsum,
L. Althueser,
C. S. Amarasinghe,
F. D. Amaro,
A. Ames,
T. J. Anderson,
B. Andrieu,
N. Angelides,
E. Angelino,
J. Angevaare,
V. C. Antochi,
D. Antón Martin,
B. Antunovic,
E. Aprile,
H. M. Araújo
, et al. (572 additional authors not shown)
Abstract:
The nature of dark matter and properties of neutrinos are among the most pressing issues in contemporary particle physics. The dual-phase xenon time-projection chamber is the leading technology to cover the available parameter space for Weakly Interacting Massive Particles (WIMPs), while featuring extensive sensitivity to many alternative dark matter candidates. These detectors can also study neut…
▽ More
The nature of dark matter and properties of neutrinos are among the most pressing issues in contemporary particle physics. The dual-phase xenon time-projection chamber is the leading technology to cover the available parameter space for Weakly Interacting Massive Particles (WIMPs), while featuring extensive sensitivity to many alternative dark matter candidates. These detectors can also study neutrinos through neutrinoless double-beta decay and through a variety of astrophysical sources. A next-generation xenon-based detector will therefore be a true multi-purpose observatory to significantly advance particle physics, nuclear physics, astrophysics, solar physics, and cosmology. This review article presents the science cases for such a detector.
△ Less
Submitted 4 March, 2022;
originally announced March 2022.
-
OVE6D: Object Viewpoint Encoding for Depth-based 6D Object Pose Estimation
Authors:
Dingding Cai,
Janne Heikkilä,
Esa Rahtu
Abstract:
This paper proposes a universal framework, called OVE6D, for model-based 6D object pose estimation from a single depth image and a target object mask. Our model is trained using purely synthetic data rendered from ShapeNet, and, unlike most of the existing methods, it generalizes well on new real-world objects without any fine-tuning. We achieve this by decomposing the 6D pose into viewpoint, in-p…
▽ More
This paper proposes a universal framework, called OVE6D, for model-based 6D object pose estimation from a single depth image and a target object mask. Our model is trained using purely synthetic data rendered from ShapeNet, and, unlike most of the existing methods, it generalizes well on new real-world objects without any fine-tuning. We achieve this by decomposing the 6D pose into viewpoint, in-plane rotation around the camera optical axis and translation, and introducing novel lightweight modules for estimating each component in a cascaded manner. The resulting network contains less than 4M parameters while demonstrating excellent performance on the challenging T-LESS and Occluded LINEMOD datasets without any dataset-specific training. We show that OVE6D outperforms some contemporary deep learning-based pose estimation methods specifically trained for individual objects or datasets with real-world training data.
The implementation and the pre-trained model will be made publicly available.
△ Less
Submitted 7 April, 2022; v1 submitted 2 March, 2022;
originally announced March 2022.
-
Towards Effective Resource Procurement in MEC: a Resource Re-selling Framework
Authors:
Marie Siew,
Shikhar Sharma,
Kun Guo,
Desmond Cai,
Wanli Wen,
Carlee Joe-Wong,
Tony Q. S. Quek
Abstract:
On-demand and resource reservation pricing models have been widely used in cloud computing, catering to different user requirements. Nevertheless, in Multi-Access Edge Computing (MEC), as the edge has limited resources compared to the cloud, on-demand users may not get their jobs served on time, or at all, if too many resources were reserved by reservation plan users. Concurrently, reservation pla…
▽ More
On-demand and resource reservation pricing models have been widely used in cloud computing, catering to different user requirements. Nevertheless, in Multi-Access Edge Computing (MEC), as the edge has limited resources compared to the cloud, on-demand users may not get their jobs served on time, or at all, if too many resources were reserved by reservation plan users. Concurrently, reservation plan users may possess excess un-utilized quota. To optimize this resource mismatch scenario, we propose a Sharing Quota Model (SQM) where reservation plan users can re-sell unused resource quota to on-demand users, with the mobile network operator (MNO) taking a commission. To analyze the user's aggregate behavior at equilibrium and investigate the MNO's incentive of allowing re-selling, we formulate a 3-stage non-cooperative Stackelberg Game. Solving this game, we characterize the optimal strategies of buyers and re-sellers. We show that on aggregate, users' optimal strategies give rise to 4 disjoint regions, dependent on the MNO's prices and supply levels. Based on this, we characterise the MNO's optimal prices for on-demand users. Numerical results show that having both the sharing and on-demand pool gives the MNO an optimal revenue when the on-demand pool's supply is low, and when the MNO's commission is low.
△ Less
Submitted 8 November, 2023; v1 submitted 1 March, 2022;
originally announced March 2022.
-
Measuring and Reducing Model Update Regression in Structured Prediction for NLP
Authors:
Deng Cai,
Elman Mansimov,
Yi-An Lai,
Yixuan Su,
Lei Shu,
Yi Zhang
Abstract:
Recent advance in deep learning has led to the rapid adoption of machine learning-based NLP models in a wide range of applications. Despite the continuous gain in accuracy, backward compatibility is also an important aspect for industrial applications, yet it received little research attention. Backward compatibility requires that the new model does not regress on cases that were correctly handled…
▽ More
Recent advance in deep learning has led to the rapid adoption of machine learning-based NLP models in a wide range of applications. Despite the continuous gain in accuracy, backward compatibility is also an important aspect for industrial applications, yet it received little research attention. Backward compatibility requires that the new model does not regress on cases that were correctly handled by its predecessor. This work studies model update regression in structured prediction tasks. We choose syntactic dependency parsing and conversational semantic parsing as representative examples of structured prediction tasks in NLP. First, we measure and analyze model update regression in different model update settings. Next, we explore and benchmark existing techniques for reducing model update regression including model ensemble and knowledge distillation. We further propose a simple and effective method, Backward-Congruent Re-ranking (BCR), by taking into account the characteristics of structured prediction. Experiments show that BCR can better mitigate model update regression than model ensemble and knowledge distillation approaches.
△ Less
Submitted 8 October, 2022; v1 submitted 7 February, 2022;
originally announced February 2022.
-
A Survey on Retrieval-Augmented Text Generation
Authors:
Huayang Li,
Yixuan Su,
Deng Cai,
Yan Wang,
Lemao Liu
Abstract:
Recently, retrieval-augmented text generation attracted increasing attention of the computational linguistics community. Compared with conventional generation models, retrieval-augmented text generation has remarkable advantages and particularly has achieved state-of-the-art performance in many NLP tasks. This paper aims to conduct a survey about retrieval-augmented text generation. It firstly hig…
▽ More
Recently, retrieval-augmented text generation attracted increasing attention of the computational linguistics community. Compared with conventional generation models, retrieval-augmented text generation has remarkable advantages and particularly has achieved state-of-the-art performance in many NLP tasks. This paper aims to conduct a survey about retrieval-augmented text generation. It firstly highlights the generic paradigm of retrieval-augmented generation, and then it reviews notable approaches according to different tasks including dialogue response generation, machine translation, and other generation tasks. Finally, it points out some important directions on top of recent methods to facilitate future research.
△ Less
Submitted 13 February, 2022; v1 submitted 2 February, 2022;
originally announced February 2022.
-
Synthetic topology and Floquet dynamic quantum phase transition in a periodically driven Raman lattice
Authors:
De-Huan Cai,
Wei Yi
Abstract:
Stimulated by the recent progress in engineering topological band structures in cold atomic gases, we study the dynamic topological phenomena for atoms loaded in a periodically driven optical lattice. When the frequency of the periodic modulation is low, the time-dependent Hamiltonian can be mapped to a two-dimensional topological insulator, with the discretized frequency components playing the ro…
▽ More
Stimulated by the recent progress in engineering topological band structures in cold atomic gases, we study the dynamic topological phenomena for atoms loaded in a periodically driven optical lattice. When the frequency of the periodic modulation is low, the time-dependent Hamiltonian can be mapped to a two-dimensional topological insulator, with the discretized frequency components playing the role of an additional, synthetic dimension. In the high-frequency limit, we derive the effective Floquet Hamiltonian of the system, and reveal the occurrence of Floquet dynamic quantum phase transitions -- an emergent topological phenomenon in the micromotion of the Floquet dynamics. Addressing the relation between the topology of the effective Floquet Hamiltonian and the presence of dynamic topological phenomena, we demonstrate that the topologically non-trivial nature of the Floquet Hamiltonian is a sufficient but not necessary condition for the onset of the Floquet dynamic quantum phase transition. We further discuss the relation of the topology of the Floquet Hamiltonian with the existence of dynamic skyrmion structures in the emergent momentum-time manifold of the micromotion, as well as the fate of these dynamic topological phenomena when the modulation frequency decreases away from the high-frequency limit. Finally, making use of the rich level structures of $^{171}$Yb atoms, we show that the system under study can be implemented in a one-dimensional Raman lattice where states in the $^1S_0$ ground-state manifold are coupled by Raman beams with periodically modulated amplitudes.
△ Less
Submitted 28 April, 2022; v1 submitted 30 December, 2021;
originally announced December 2021.
-
TopNet: Learning from Neural Topic Model to Generate Long Stories
Authors:
Yazheng Yang,
Boyuan Pan,
Deng Cai,
Huan Sun
Abstract:
Long story generation (LSG) is one of the coveted goals in natural language processing. Different from most text generation tasks, LSG requires to output a long story of rich content based on a much shorter text input, and often suffers from information sparsity. In this paper, we propose \emph{TopNet} to alleviate this problem, by leveraging the recent advances in neural topic modeling to obtain…
▽ More
Long story generation (LSG) is one of the coveted goals in natural language processing. Different from most text generation tasks, LSG requires to output a long story of rich content based on a much shorter text input, and often suffers from information sparsity. In this paper, we propose \emph{TopNet} to alleviate this problem, by leveraging the recent advances in neural topic modeling to obtain high-quality skeleton words to complement the short input. In particular, instead of directly generating a story, we first learn to map the short text input to a low-dimensional topic distribution (which is pre-assigned by a topic model). Based on this latent topic distribution, we can use the reconstruction decoder of the topic model to sample a sequence of inter-related words as a skeleton for the story. Experiments on two benchmark datasets show that our proposed framework is highly effective in skeleton word selection and significantly outperforms the state-of-the-art models in both automatic evaluation and human evaluation.
△ Less
Submitted 14 December, 2021;
originally announced December 2021.
-
Label Hierarchy Transition: Delving into Class Hierarchies to Enhance Deep Classifiers
Authors:
Renzhen Wang,
De cai,
Kaiwen Xiao,
Xixi Jia,
Xiao Han,
Deyu Meng
Abstract:
Hierarchical classification aims to sort the object into a hierarchical structure of categories. For example, a bird can be categorized according to a three-level hierarchy of order, family, and species. Existing methods commonly address hierarchical classification by decoupling it into a series of multi-class classification tasks. However, such a multi-task learning strategy fails to fully exploi…
▽ More
Hierarchical classification aims to sort the object into a hierarchical structure of categories. For example, a bird can be categorized according to a three-level hierarchy of order, family, and species. Existing methods commonly address hierarchical classification by decoupling it into a series of multi-class classification tasks. However, such a multi-task learning strategy fails to fully exploit the correlation among various categories across different levels of the hierarchy. In this paper, we propose Label Hierarchy Transition (LHT), a unified probabilistic framework based on deep learning, to address the challenges of hierarchical classification. The LHT framework consists of a transition network and a confusion loss. The transition network focuses on explicitly learning the label hierarchy transition matrices, which has the potential to effectively encode the underlying correlations embedded within class hierarchies. The confusion loss encourages the classification network to learn correlations across different label hierarchies during training. The proposed framework can be readily adapted to any existing deep network with only minor modifications. We experiment with a series of public benchmark datasets for hierarchical classification problems, and the results demonstrate the superiority of our approach beyond current state-of-the-art methods. Furthermore, we extend our proposed LHT framework to the skin lesion diagnosis task and validate its great potential in computer-aided diagnosis. The code of our method is available at \href{https://github.com/renzhenwang/label-hierarchy-transition}{https://github.com/renzhenwang/label-hierarchy-transition}.
△ Less
Submitted 31 October, 2023; v1 submitted 4 December, 2021;
originally announced December 2021.
-
Energy-Efficient Design for a NOMA assisted STAR-RIS Network with Deep Reinforcement Learning
Authors:
Yi Guo,
Fang Fang,
Donghong Cai,
Zhiguo Ding
Abstract:
Simultaneous transmitting and reflecting reconfigurable intelligent surfaces (STAR-RISs) has been considered as a promising auxiliary device to enhance the performance of the wireless network, where users located at the different sides of the surfaces can be simultaneously served by the transmitting and reflecting signals. In this paper, the energy efficiency (EE) maximization problem for a non-or…
▽ More
Simultaneous transmitting and reflecting reconfigurable intelligent surfaces (STAR-RISs) has been considered as a promising auxiliary device to enhance the performance of the wireless network, where users located at the different sides of the surfaces can be simultaneously served by the transmitting and reflecting signals. In this paper, the energy efficiency (EE) maximization problem for a non-orthogonal multiple access (NOMA) assisted STAR-RIS downlink network is investigated. Due to the fractional form of the EE, it is challenging to solve the EE maximization problem by the traditional convex optimization solutions. In this work, a deep deterministic policy gradient (DDPG)-based algorithm is proposed to maximize the EE by jointly optimizing the transmission beamforming vectors at the base station and the coefficients matrices at the STAR-RIS. Simulation results demonstrate that the proposed algorithm can effectively maximize the system EE considering the time-varying channels.
△ Less
Submitted 30 November, 2021;
originally announced November 2021.
-
GRecX: An Efficient and Unified Benchmark for GNN-based Recommendation
Authors:
Desheng Cai,
Jun Hu,
Quan Zhao,
Shengsheng Qian,
Quan Fang,
Changsheng Xu
Abstract:
In this paper, we present GRecX, an open-source TensorFlow framework for benchmarking GNN-based recommendation models in an efficient and unified way. GRecX consists of core libraries for building GNN-based recommendation benchmarks, as well as the implementations of popular GNN-based recommendation models. The core libraries provide essential components for building efficient and unified benchmar…
▽ More
In this paper, we present GRecX, an open-source TensorFlow framework for benchmarking GNN-based recommendation models in an efficient and unified way. GRecX consists of core libraries for building GNN-based recommendation benchmarks, as well as the implementations of popular GNN-based recommendation models. The core libraries provide essential components for building efficient and unified benchmarks, including FastMetrics (efficient metrics computation libraries), VectorSearch (efficient similarity search libraries for dense vectors), BatchEval (efficient mini-batch evaluation libraries), and DataManager (unified dataset management libraries). Especially, to provide a unified benchmark for the fair comparison of different complex GNN-based recommendation models, we design a new metric GRMF-X and integrate it into the FastMetrics component. Based on a TensorFlow GNN library tf_geometric, GRecX carefully implements a variety of popular GNN-based recommendation models. We carefully implement these baseline models to reproduce the performance reported in the literature, and our implementations are usually more efficient and friendly. In conclusion, GRecX enables uses to train and benchmark GNN-based recommendation baselines in an efficient and unified way. We conduct experiments with GRecX, and the experimental results show that GRecX allows us to train and benchmark GNN-based recommendation baselines in an efficient and unified way. The source code of GRecX is available at https://github.com/maenzhier/GRecX.
△ Less
Submitted 22 February, 2022; v1 submitted 19 November, 2021;
originally announced November 2021.
-
Exploring Dense Retrieval for Dialogue Response Selection
Authors:
Tian Lan,
Deng Cai,
Yan Wang,
Yixuan Su,
Heyan Huang,
Xian-Ling Mao
Abstract:
Recent progress in deep learning has continuously improved the accuracy of dialogue response selection. In particular, sophisticated neural network architectures are leveraged to capture the rich interactions between dialogue context and response candidates. While remarkably effective, these models also bring in a steep increase in computational cost. Consequently, such models can only be used as…
▽ More
Recent progress in deep learning has continuously improved the accuracy of dialogue response selection. In particular, sophisticated neural network architectures are leveraged to capture the rich interactions between dialogue context and response candidates. While remarkably effective, these models also bring in a steep increase in computational cost. Consequently, such models can only be used as a re-rank module in practice. In this study, we present a solution to directly select proper responses from a large corpus or even a nonparallel corpus that only consists of unpaired sentences, using a dense retrieval model. To push the limits of dense retrieval, we design an interaction layer upon the dense retrieval models and apply a set of tailor-designed learning strategies. Our model shows superiority over strong baselines on the conventional re-rank evaluation setting, which is remarkable given its efficiency. To verify the effectiveness of our approach in realistic scenarios, we also conduct full-rank evaluation, where the target is to select proper responses from a full candidate pool that may contain millions of candidates and evaluate them fairly through human annotations. Our proposed model notably outperforms pipeline baselines that integrate fast recall and expressive re-rank modules. Human evaluation results show that enlarging the candidate pool with nonparallel corpora improves response quality further.
△ Less
Submitted 25 April, 2022; v1 submitted 13 October, 2021;
originally announced October 2021.
-
Multilingual AMR Parsing with Noisy Knowledge Distillation
Authors:
Deng Cai,
Xin Li,
Jackie Chun-Sing Ho,
Lidong Bing,
Wai Lam
Abstract:
We study multilingual AMR parsing from the perspective of knowledge distillation, where the aim is to learn and improve a multilingual AMR parser by using an existing English parser as its teacher. We constrain our exploration in a strict multilingual setting: there is but one model to parse all different languages including English. We identify that noisy input and precise output are the key to s…
▽ More
We study multilingual AMR parsing from the perspective of knowledge distillation, where the aim is to learn and improve a multilingual AMR parser by using an existing English parser as its teacher. We constrain our exploration in a strict multilingual setting: there is but one model to parse all different languages including English. We identify that noisy input and precise output are the key to successful distillation. Together with extensive pre-training, we obtain an AMR parser whose performances surpass all previously published results on four different foreign languages, including German, Spanish, Italian, and Chinese, by large margins (up to 18.8 \textsc{Smatch} points on Chinese and on average 11.3 \textsc{Smatch} points). Our parser also achieves comparable performance on English to the latest state-of-the-art English-only parser.
△ Less
Submitted 13 October, 2021; v1 submitted 30 September, 2021;
originally announced September 2021.
-
Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System
Authors:
Yixuan Su,
Lei Shu,
Elman Mansimov,
Arshit Gupta,
Deng Cai,
Yi-An Lai,
Yi Zhang
Abstract:
Pre-trained language models have been recently shown to benefit task-oriented dialogue (TOD) systems. Despite their success, existing methods often formulate this task as a cascaded generation problem which can lead to error accumulation across different sub-tasks and greater data annotation overhead. In this study, we present PPTOD, a unified plug-and-play model for task-oriented dialogue. In add…
▽ More
Pre-trained language models have been recently shown to benefit task-oriented dialogue (TOD) systems. Despite their success, existing methods often formulate this task as a cascaded generation problem which can lead to error accumulation across different sub-tasks and greater data annotation overhead. In this study, we present PPTOD, a unified plug-and-play model for task-oriented dialogue. In addition, we introduce a new dialogue multi-task pre-training strategy that allows the model to learn the primary TOD task completion skills from heterogeneous dialog corpora. We extensively test our model on three benchmark TOD tasks, including end-to-end dialogue modelling, dialogue state tracking, and intent classification. Experimental results show that PPTOD achieves new state of the art on all evaluated tasks in both high-resource and low-resource scenarios. Furthermore, comparisons against previous SOTA methods show that the responses generated by PPTOD are more factually correct and semantically coherent as judged by human annotators.
△ Less
Submitted 1 March, 2022; v1 submitted 29 September, 2021;
originally announced September 2021.
-
Scaling properties of scale-free networks in degree-thresholding renormalization flows
Authors:
Dan Chen,
Defu Cai,
Housheng Su
Abstract:
We study the statistical properties of observables of scale-free networks in the degree-thresholding renormalization (DTR) flows. For BA scale-free networks with different sizes, we find that their structural and dynamical observables have similar scaling behavior in the DTR flow. The finite-size scaling analysis confirms this view and reveals a scaling function with a single scaling exponent that…
▽ More
We study the statistical properties of observables of scale-free networks in the degree-thresholding renormalization (DTR) flows. For BA scale-free networks with different sizes, we find that their structural and dynamical observables have similar scaling behavior in the DTR flow. The finite-size scaling analysis confirms this view and reveals a scaling function with a single scaling exponent that collectively captures the changes of these observables. Furthermore, for the scale-free network with a single initial size, we use its DTR snapshots as the original networks in the DTR flows, then perform a similar finite-size scaling analysis. Interestingly, the initial network and its snapshots share the same scaling exponent as the BA synthetic network. Our findings have important guiding significance for analyzing the structure and dynamic behavior of large-scale networks. Such as, in large-scale simulation scenarios with high time complexity, the DTR snapshot could serve as a substitute or guide for the initial network and then quickly explore the scaling behavior of initial networks.
△ Less
Submitted 28 November, 2022; v1 submitted 25 September, 2021;
originally announced September 2021.
-
A simple transcendental travelling wave solution and stability study for the thermophoretic motion with variable heat transmission factors on substrate-supported grapheme sheet
Authors:
Yue Chan,
Daoju Cai,
Kaisheng Cai,
Shern-Long Lee,
Rumiao Lin,
Yong Ren
Abstract:
Manually tailored wrinkled graphene sheets hold great promise in fabricating smart solid-state devices. In this paper, we employ an energy method to transform the original third-order partial differential equation (pde), i.e. Eq. (1) into the first-order pde, i.e. Eq. (8) for the thermophoretic motion of substrate-supported graphene sheets, which can be solved in terms of semi-group and transcende…
▽ More
Manually tailored wrinkled graphene sheets hold great promise in fabricating smart solid-state devices. In this paper, we employ an energy method to transform the original third-order partial differential equation (pde), i.e. Eq. (1) into the first-order pde, i.e. Eq. (8) for the thermophoretic motion of substrate-supported graphene sheets, which can be solved in terms of semi-group and transcendental solutions. Unlike soliton solutions derived using other more sophisticated techniques [9, 23], the present transcendental solution can be easily solved numerically and provides physical insights. Most importantly, we verify that the formation of various forms for wrinkling wave solutions can be determined by the evolution of equilibrium points for Eq. (1). This sheds a light on modifying the heat sources in order to control the configuration of wrinkle waves that has not been previously addressed.
△ Less
Submitted 19 September, 2021;
originally announced September 2021.
-
Exploiting Reasoning Chains for Multi-hop Science Question Answering
Authors:
Weiwen Xu,
Yang Deng,
Huihui Zhang,
Deng Cai,
Wai Lam
Abstract:
We propose a novel Chain Guided Retriever-reader ({\tt CGR}) framework to model the reasoning chain for multi-hop Science Question Answering. Our framework is capable of performing explainable reasoning without the need of any corpus-specific annotations, such as the ground-truth reasoning chain, or human-annotated entity mentions. Specifically, we first generate reasoning chains from a semantic g…
▽ More
We propose a novel Chain Guided Retriever-reader ({\tt CGR}) framework to model the reasoning chain for multi-hop Science Question Answering. Our framework is capable of performing explainable reasoning without the need of any corpus-specific annotations, such as the ground-truth reasoning chain, or human-annotated entity mentions. Specifically, we first generate reasoning chains from a semantic graph constructed by Abstract Meaning Representation of retrieved evidence facts. A \textit{Chain-aware loss}, concerning both local and global chain information, is also designed to enable the generated chains to serve as distant supervision signals for training the retriever, where reinforcement learning is also adopted to maximize the utility of the reasoning chains. Our framework allows the retriever to capture step-by-step clues of the entire reasoning process, which is not only shown to be effective on two challenging multi-hop Science QA tasks, namely OpenBookQA and ARC-Challenge, but also favors explainability.
△ Less
Submitted 7 September, 2021;
originally announced September 2021.
-
The DKU-DukeECE System for the Self-Supervision Speaker Verification Task of the 2021 VoxCeleb Speaker Recognition Challenge
Authors:
Danwei Cai,
Ming Li
Abstract:
This report describes the submission of the DKU-DukeECE team to the self-supervision speaker verification task of the 2021 VoxCeleb Speaker Recognition Challenge (VoxSRC). Our method employs an iterative labeling framework to learn self-supervised speaker representation based on a deep neural network (DNN). The framework starts with training a self-supervision speaker embedding network by maximizi…
▽ More
This report describes the submission of the DKU-DukeECE team to the self-supervision speaker verification task of the 2021 VoxCeleb Speaker Recognition Challenge (VoxSRC). Our method employs an iterative labeling framework to learn self-supervised speaker representation based on a deep neural network (DNN). The framework starts with training a self-supervision speaker embedding network by maximizing agreement between different segments within an utterance via a contrastive loss. Taking advantage of DNN's ability to learn from data with label noise, we propose to cluster the speaker embedding obtained from the previous speaker network and use the subsequent class assignments as pseudo labels to train a new DNN. Moreover, we iteratively train the speaker network with pseudo labels generated from the previous step to bootstrap the discriminative power of a DNN. Also, visual modal data is incorporated in this self-labeling framework. The visual pseudo label and the audio pseudo label are fused with a cluster ensemble algorithm to generate a robust supervisory signal for representation learning. Our submission achieves an equal error rate (EER) of 5.58% and 5.59% on the challenge development and test set, respectively.
△ Less
Submitted 7 September, 2021;
originally announced September 2021.
-
The DKU-DukeECE-Lenovo System for the Diarization Task of the 2021 VoxCeleb Speaker Recognition Challenge
Authors:
Weiqing Wang,
Danwei Cai,
Qingjian Lin,
Lin Yang,
Junjie Wang,
Jin Wang,
Ming Li
Abstract:
This report describes the submission of the DKU-DukeECE-Lenovo team to the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2021 track 4. Our system including a voice activity detection (VAD) model, a speaker embedding model, two clustering-based speaker diarization systems with different similarity measurements, two different overlapped speech detection (OSD) models, and a target-speaker voice act…
▽ More
This report describes the submission of the DKU-DukeECE-Lenovo team to the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2021 track 4. Our system including a voice activity detection (VAD) model, a speaker embedding model, two clustering-based speaker diarization systems with different similarity measurements, two different overlapped speech detection (OSD) models, and a target-speaker voice activity detection (TS-VAD) model. Our final submission, consisting of 5 independent systems, achieves a DER of 5.07% on the challenge test set.
△ Less
Submitted 6 September, 2021; v1 submitted 5 September, 2021;
originally announced September 2021.
-
GRP-FED: Addressing Client Imbalance in Federated Learning via Global-Regularized Personalization
Authors:
Yen-Hsiu Chou,
Shenda Hong,
Chenxi Sun,
Derun Cai,
Moxian Song,
Hongyan Li
Abstract:
Since data is presented long-tailed in reality, it is challenging for Federated Learning (FL) to train across decentralized clients as practical applications. We present Global-Regularized Personalization (GRP-FED) to tackle the data imbalanced issue by considering a single global model and multiple local models for each client. With adaptive aggregation, the global model treats multiple clients f…
▽ More
Since data is presented long-tailed in reality, it is challenging for Federated Learning (FL) to train across decentralized clients as practical applications. We present Global-Regularized Personalization (GRP-FED) to tackle the data imbalanced issue by considering a single global model and multiple local models for each client. With adaptive aggregation, the global model treats multiple clients fairly and mitigates the global long-tailed issue. Each local model is learned from the local data and aligns with its distribution for customization. To prevent the local model from just overfitting, GRP-FED applies an adversarial discriminator to regularize between the learned global-local features. Extensive results show that our GRP-FED improves under both global and local scenarios on real-world MIT-BIH and synthesis CIFAR-10 datasets, achieving comparable performance and addressing client imbalance.
△ Less
Submitted 31 August, 2021;
originally announced August 2021.
-
ASR-GLUE: A New Multi-task Benchmark for ASR-Robust Natural Language Understanding
Authors:
Lingyun Feng,
Jianwei Yu,
Deng Cai,
Songxiang Liu,
Haitao Zheng,
Yan Wang
Abstract:
Language understanding in speech-based systems have attracted much attention in recent years with the growing demand for voice interface applications. However, the robustness of natural language understanding (NLU) systems to errors introduced by automatic speech recognition (ASR) is under-examined. %To facilitate the research on ASR-robust general language understanding, In this paper, we propose…
▽ More
Language understanding in speech-based systems have attracted much attention in recent years with the growing demand for voice interface applications. However, the robustness of natural language understanding (NLU) systems to errors introduced by automatic speech recognition (ASR) is under-examined. %To facilitate the research on ASR-robust general language understanding, In this paper, we propose ASR-GLUE benchmark, a new collection of 6 different NLU tasks for evaluating the performance of models under ASR error across 3 different levels of background noise and 6 speakers with various voice characteristics. Based on the proposed benchmark, we systematically investigate the effect of ASR error on NLU tasks in terms of noise intensity, error type and speaker variants. We further purpose two ways, correction-based method and data augmentation-based method to improve robustness of the NLU systems. Extensive experimental results and analysises show that the proposed methods are effective to some extent, but still far from human performance, demonstrating that NLU under ASR error is still very challenging and requires further research.
△ Less
Submitted 16 March, 2022; v1 submitted 30 August, 2021;
originally announced August 2021.
-
An Iterative Improvement Method for HHL algorithm for Solving Linear System of Equations
Authors:
Yoshiyuki Saito,
Xinwei Lee,
Dongsheng Cai,
Nobuyoshi Asai
Abstract:
We propose an iterative improvement method for the Harrow-Hassidim-Lloyd (HHL) algorithm to solve a linear system of equations. This is a quantum-classical hybrid algorithm. The accuracy is essential to solve the linear system of equations. However, the accuracy of the HHL algorithm is limited by the number of quantum bits used to express the eigenvalues of the matrix. Our iterative method improve…
▽ More
We propose an iterative improvement method for the Harrow-Hassidim-Lloyd (HHL) algorithm to solve a linear system of equations. This is a quantum-classical hybrid algorithm. The accuracy is essential to solve the linear system of equations. However, the accuracy of the HHL algorithm is limited by the number of quantum bits used to express the eigenvalues of the matrix. Our iterative method improves the accuracy of the HHL solutions, and gives higher accuracy which surpasses the accuracy limited by the number of quantum bits. In practical HHL algorithm, a huge number of measurements is required to obtain good accuracy, even if we provide a sufficient number of quantum bits for the eigenvalue expression, since the solution is statistically processed from the measurements. Our improved iterative method can reduce the number of measurements. Moreover, the sign information for each eigenstate of the solution is lost once the measurement is made, although the sign is significant. Therefore, the naïve iterative method of the HHL algorithm may slow down, especially, when the solution includes wrong signs. In this paper, we propose and evaluate an improved iterative method for the HHL algorithm that is robust against the sign information loss, in terms of the number of iterations and the computational accuracy.
△ Less
Submitted 17 August, 2021;
originally announced August 2021.
-
Parameters Fixing Strategy for Quantum Approximate Optimization Algorithm
Authors:
Xinwei Lee,
Yoshiyuki Saito,
Dongsheng Cai,
Nobuyoshi Asai
Abstract:
The quantum approximate optimization algorithm (QAOA) has numerous promising applications in solving the combinatorial optimization problems on near-term Noisy Intermediate Scalable Quantum (NISQ) devices. QAOA has a quantum-classical hybrid structure. Its quantum part consists of a parameterized alternating operator ansatz, and its classical part comprises an optimization algorithm, which optimiz…
▽ More
The quantum approximate optimization algorithm (QAOA) has numerous promising applications in solving the combinatorial optimization problems on near-term Noisy Intermediate Scalable Quantum (NISQ) devices. QAOA has a quantum-classical hybrid structure. Its quantum part consists of a parameterized alternating operator ansatz, and its classical part comprises an optimization algorithm, which optimizes the parameters to maximize the expectation value of the problem Hamiltonian. This expectation value depends highly on the parameters, this implies that a set of good parameters leads to an accurate solution. However, at large circuit depth of QAOA, it is difficult to achieve global optimization due to the multiple occurrences of local minima or maxima. In this paper, we propose a parameters fixing strategy which gives high approximation ratio on average, even at large circuit depths, by initializing QAOA with the optimal parameters obtained from the previous depths. We test our strategy on the Max-cut problem of certain classes of graphs such as the 3-regular graphs and the Erdös-Rényi graphs.
△ Less
Submitted 11 August, 2021;
originally announced August 2021.
-
CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention
Authors:
Wenxiao Wang,
Lu Yao,
Long Chen,
Binbin Lin,
Deng Cai,
Xiaofei He,
Wei Liu
Abstract:
Transformers have made great progress in dealing with computer vision tasks. However, existing vision transformers do not yet possess the ability of building the interactions among features of different scales, which is perceptually important to visual inputs. The reasons are two-fold: (1) Input embeddings of each layer are equal-scale, so no cross-scale feature can be extracted; (2) to lower the…
▽ More
Transformers have made great progress in dealing with computer vision tasks. However, existing vision transformers do not yet possess the ability of building the interactions among features of different scales, which is perceptually important to visual inputs. The reasons are two-fold: (1) Input embeddings of each layer are equal-scale, so no cross-scale feature can be extracted; (2) to lower the computational cost, some vision transformers merge adjacent embeddings inside the self-attention module, thus sacrificing small-scale (fine-grained) features of the embeddings and also disabling the cross-scale interactions. To this end, we propose Cross-scale Embedding Layer (CEL) and Long Short Distance Attention (LSDA). On the one hand, CEL blends each embedding with multiple patches of different scales, providing the self-attention module itself with cross-scale features. On the other hand, LSDA splits the self-attention module into a short-distance one and a long-distance counterpart, which not only reduces the computational burden but also keeps both small-scale and large-scale features in the embeddings. Through the above two designs, we achieve cross-scale attention. Besides, we put forward a dynamic position bias for vision transformers to make the popular relative position bias apply to variable-sized images. Hinging on the cross-scale attention module, we construct a versatile vision architecture, dubbed CrossFormer, which accommodates variable-sized inputs. Extensive experiments show that CrossFormer outperforms the other vision transformers on image classification, object detection, instance segmentation, and semantic segmentation tasks. The code has been released: https://github.com/cheerss/CrossFormer.
△ Less
Submitted 8 October, 2021; v1 submitted 31 July, 2021;
originally announced August 2021.
-
Hybrid A Posteriori Error Estimators for Conforming Finite Element Approximations to Stationary Convection-Diffusion-Reaction equations
Authors:
Difeng Cai,
Zhiqiang Cai
Abstract:
We consider the a posteriori error estimation for convection-diffusion-reaction equations in both diffusion-dominated and convection/reaction-dominated regimes. We present an explicit hybrid estimator, which, in each regime, is proved to be reliable and efficient with constants independent of the parameters in the underlying problem. For convection-dominated problems, the norm introduced by Verf{ü…
▽ More
We consider the a posteriori error estimation for convection-diffusion-reaction equations in both diffusion-dominated and convection/reaction-dominated regimes. We present an explicit hybrid estimator, which, in each regime, is proved to be reliable and efficient with constants independent of the parameters in the underlying problem. For convection-dominated problems, the norm introduced by Verf{ü}rth \cite{verf2005confusion} is used to measure the approximation error. Various numerical experiments are performed to (1) demonstrate the robustness of the hybrid estimator; (2) show that the hybrid estimator is more accurate than the explicit residual estimator and is less sensitive to the size of reaction, even though both of them are robust.
△ Less
Submitted 15 July, 2021; v1 submitted 13 July, 2021;
originally announced July 2021.
-
Learning to Affiliate: Mutual Centralized Learning for Few-shot Classification
Authors:
Yang Liu,
Weifeng Zhang,
Chao Xiang,
Tu Zheng,
Deng Cai,
Xiaofei He
Abstract:
Few-shot learning (FSL) aims to learn a classifier that can be easily adapted to accommodate new tasks not seen during training, given only a few examples. To handle the limited-data problem in few-shot regimes, recent methods tend to collectively use a set of local features to densely represent an image instead of using a mixed global feature. They generally explore a unidirectional query-to-supp…
▽ More
Few-shot learning (FSL) aims to learn a classifier that can be easily adapted to accommodate new tasks not seen during training, given only a few examples. To handle the limited-data problem in few-shot regimes, recent methods tend to collectively use a set of local features to densely represent an image instead of using a mixed global feature. They generally explore a unidirectional query-to-support paradigm in FSL, e.g., find the nearest/optimal support feature for each query feature and aggregate these local matches for a joint classification. In this paper, we propose a new method Mutual Centralized Learning (MCL) to fully affiliate the two disjoint sets of dense features in a bidirectional paradigm. We associate each local feature with a particle that can bidirectionally random walk in a discrete feature space by the affiliations. To estimate the class probability, we propose the features' accessibility that measures the expected number of visits to the support features of that class in a Markov process. We relate our method to learning a centrality on an affiliation network and demonstrate its capability to be plugged in existing methods by highlighting centralized local features. Experiments show that our method achieves the state-of-the-art on both miniImageNet and tieredImageNet.
△ Less
Submitted 18 March, 2022; v1 submitted 10 June, 2021;
originally announced June 2021.