subscribe to arXiv mailings

arXiv:2406.19776 [pdf, other]

MDF: A Dynamic Fusion Model for Multi-modal Fake News Detection

Authors: Hongzhen Lv, Wenzhong Yang, Fuyuan Wei, Jiaren Peng, Haokun Geng

Abstract: Fake news detection has received increasing attention from researchers in recent years, especially multi-modal fake news detection containing both text and images. However, many previous works have fed two modal features, text and image, into a binary classifier after a simple concatenation or attention mechanism, in which the features contain a large amount of noise inherent in the data,which in… ▽ More Fake news detection has received increasing attention from researchers in recent years, especially multi-modal fake news detection containing both text and images. However, many previous works have fed two modal features, text and image, into a binary classifier after a simple concatenation or attention mechanism, in which the features contain a large amount of noise inherent in the data,which in turn leads to intra- and inter-modal uncertainty. In addition, although many methods based on simply splicing two modalities have achieved more prominent results, these methods ignore the drawback of holding fixed weights across modalities, which would lead to some features with higher impact factors being ignored. To alleviate the above problems, we propose a new dynamic fusion framework dubbed MDF for fake news detection. As far as we know, it is the first attempt of dynamic fusion framework in the field of fake news detection. Specifically, our model consists of two main components:(1) UEM as an uncertainty modeling module employing a multi-head attention mechanism to model intra-modal uncertainty; and (2) DFN is a dynamic fusion module based on D-S evidence theory for dynamically fusing the weights of two modalities, text and image. In order to present better results for the dynamic fusion framework, we use GAT for inter-modal uncertainty and weight modeling before DFN. Extensive experiments on two benchmark datasets demonstrate the effectiveness and superior performance of the MDF framework. We also conducted a systematic ablation study to gain insight into our motivation and architectural design. We make our model publicly available to:https://github.com/CoisiniStar/MDF △ Less

Submitted 28 June, 2024; originally announced June 2024.

arXiv:2406.18118 [pdf, other]

SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance

Authors: Caishuang Huang, Wanxu Zhao, Rui Zheng, Huijie Lv, Shihan Dou, Sixian Li, Xiao Wang, Enyu Zhou, Junjie Ye, Yuming Yang, Tao Gui, Qi Zhang, Xuanjing Huang

Abstract: As the development of large language models (LLMs) rapidly advances, securing these models effectively without compromising their utility has become a pivotal area of research. However, current defense strategies against jailbreak attacks (i.e., efforts to bypass security protocols) often suffer from limited adaptability, restricted general capability, and high cost. To address these challenges, w… ▽ More As the development of large language models (LLMs) rapidly advances, securing these models effectively without compromising their utility has become a pivotal area of research. However, current defense strategies against jailbreak attacks (i.e., efforts to bypass security protocols) often suffer from limited adaptability, restricted general capability, and high cost. To address these challenges, we introduce SafeAligner, a methodology implemented at the decoding stage to fortify defenses against jailbreak attacks. We begin by developing two specialized models: the Sentinel Model, which is trained to foster safety, and the Intruder Model, designed to generate riskier responses. SafeAligner leverages the disparity in security levels between the responses from these models to differentiate between harmful and beneficial tokens, effectively guiding the safety alignment by altering the output token distribution of the target model. Extensive experiments show that SafeAligner can increase the likelihood of beneficial tokens, while reducing the occurrence of harmful ones, thereby ensuring secure alignment with minimal loss to generality. △ Less

Submitted 28 June, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

arXiv:2403.17297 [pdf, other]

InternLM2 Technical Report

Authors: Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang , et al. (75 additional authors not shown)

Abstract: The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI). However, replicating such advancements in open-source models has been challenging. This paper introduces InternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context m… ▽ More The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI). However, replicating such advancements in open-source models has been challenging. This paper introduces InternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context modeling, and open-ended subjective evaluations through innovative pre-training and optimization techniques. The pre-training process of InternLM2 is meticulously detailed, highlighting the preparation of diverse data types including text, code, and long-context data. InternLM2 efficiently captures long-term dependencies, initially trained on 4k tokens before advancing to 32k tokens in pre-training and fine-tuning stages, exhibiting remarkable performance on the 200k ``Needle-in-a-Haystack" test. InternLM2 is further aligned using Supervised Fine-Tuning (SFT) and a novel Conditional Online Reinforcement Learning from Human Feedback (COOL RLHF) strategy that addresses conflicting human preferences and reward hacking. By releasing InternLM2 models in different training stages and model sizes, we provide the community with insights into the model's evolution. △ Less

Submitted 25 March, 2024; originally announced March 2024.

arXiv:2403.04780 [pdf, other]

MuseGraph: Graph-oriented Instruction Tuning of Large Language Models for Generic Graph Mining

Authors: Yanchao Tan, Hang Lv, Xinyi Huang, Jiawei Zhang, Shiping Wang, Carl Yang

Abstract: Graphs with abundant attributes are essential in modeling interconnected entities and improving predictions in various real-world applications. Traditional Graph Neural Networks (GNNs), which are commonly used for modeling attributed graphs, need to be re-trained every time when applied to different graph tasks and datasets. Although the emergence of Large Language Models (LLMs) has introduced a n… ▽ More Graphs with abundant attributes are essential in modeling interconnected entities and improving predictions in various real-world applications. Traditional Graph Neural Networks (GNNs), which are commonly used for modeling attributed graphs, need to be re-trained every time when applied to different graph tasks and datasets. Although the emergence of Large Language Models (LLMs) has introduced a new paradigm in natural language processing, the generative potential of LLMs in graph mining remains largely under-explored. To this end, we propose a novel framework MuseGraph, which seamlessly integrates the strengths of GNNs and LLMs and facilitates a more effective and generic approach for graph mining across different tasks and datasets. Specifically, we first introduce a compact graph description via the proposed adaptive input generation to encapsulate key information from the graph under the constraints of language token limitations. Then, we propose a diverse instruction generation mechanism, which distills the reasoning capabilities from LLMs (e.g., GPT-4) to create task-specific Chain-of-Thought-based instruction packages for different graph tasks. Finally, we propose a graph-aware instruction tuning with a dynamic instruction package allocation strategy across tasks and datasets, ensuring the effectiveness and generalization of the training process. Our experimental results demonstrate significant improvements in different graph tasks, showcasing the potential of our MuseGraph in enhancing the accuracy of graph-oriented downstream tasks while keeping the generation powers of LLMs. △ Less

Submitted 13 March, 2024; v1 submitted 2 March, 2024; originally announced March 2024.

arXiv:2402.19282 [pdf, other]

WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset

Authors: Jiantao Qiu, Haijun Lv, Zhenjiang Jin, Rui Wang, Wenchang Ning, Jia Yu, ChaoBin Zhang, Zhenxiang Li, Pei Chu, Yuan Qu, Jin Shi, Lindong Lu, Runyu Peng, Zhiyuan Zeng, Huanze Tang, Zhikai Lei, Jiawei Hong, Keyu Chen, Zhaoye Fei, Ruiliang Xu, Wei Li, Zhongying Tu, Lin Dahua, Yu Qiao, Hang Yan , et al. (1 additional authors not shown)

Abstract: This paper presents WanJuan-CC, a safe and high-quality open-sourced English webtext dataset derived from Common Crawl data. The study addresses the challenges of constructing large-scale pre-training datasets for language models, which require vast amounts of high-quality data. A comprehensive process was designed to handle Common Crawl data, including extraction, heuristic rule filtering, fuzzy… ▽ More This paper presents WanJuan-CC, a safe and high-quality open-sourced English webtext dataset derived from Common Crawl data. The study addresses the challenges of constructing large-scale pre-training datasets for language models, which require vast amounts of high-quality data. A comprehensive process was designed to handle Common Crawl data, including extraction, heuristic rule filtering, fuzzy deduplication, content safety filtering, and data quality filtering. From approximately 68 billion original English documents, we obtained 2.22T Tokens of safe data and selected 1.0T Tokens of high-quality data as part of WanJuan-CC. We have open-sourced 100B Tokens from this dataset. The paper also provides statistical information related to data quality, enabling users to select appropriate data according to their needs. To evaluate the quality and utility of the dataset, we trained 1B-parameter and 3B-parameter models using WanJuan-CC and another dataset, RefinedWeb. Results show that WanJuan-CC performs better on validation datasets and downstream tasks. △ Less

Submitted 17 March, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

arXiv:2402.17791 [pdf, other]

doi 10.1109/tnnls.2024.3363695

Label Informed Contrastive Pretraining for Node Importance Estimation on Knowledge Graphs

Authors: Tianyu Zhang, Chengbin Hou, Rui Jiang, Xuegong Zhang, Chenghu Zhou, Ke Tang, Hairong Lv

Abstract: Node Importance Estimation (NIE) is a task of inferring importance scores of the nodes in a graph. Due to the availability of richer data and knowledge, recent research interests of NIE have been dedicating to knowledge graphs for predicting future or missing node importance scores. Existing state-of-the-art NIE methods train the model by available labels, and they consider every interested node e… ▽ More Node Importance Estimation (NIE) is a task of inferring importance scores of the nodes in a graph. Due to the availability of richer data and knowledge, recent research interests of NIE have been dedicating to knowledge graphs for predicting future or missing node importance scores. Existing state-of-the-art NIE methods train the model by available labels, and they consider every interested node equally before training. However, the nodes with higher importance often require or receive more attention in real-world scenarios, e.g., people may care more about the movies or webpages with higher importance. To this end, we introduce Label Informed ContrAstive Pretraining (LICAP) to the NIE problem for being better aware of the nodes with high importance scores. Specifically, LICAP is a novel type of contrastive learning framework that aims to fully utilize the continuous labels to generate contrastive samples for pretraining embeddings. Considering the NIE problem, LICAP adopts a novel sampling strategy called top nodes preferred hierarchical sampling to first group all interested nodes into a top bin and a non-top bin based on node importance scores, and then divide the nodes within top bin into several finer bins also based on the scores. The contrastive samples are generated from those bins, and are then used to pretrain node embeddings of knowledge graphs via a newly proposed Predicate-aware Graph Attention Networks (PreGAT), so as to better separate the top nodes from non-top nodes, and distinguish the top nodes within top bin by keeping the relative order among finer bins. Extensive experiments demonstrate that the LICAP pretrained embeddings can further boost the performance of existing NIE methods and achieve the new state-of-the-art performance regarding both regression and ranking metrics. The source code for reproducibility is available at https://github.com/zhangtia16/LICAP △ Less

Submitted 26 February, 2024; originally announced February 2024.

Comments: Accepted by IEEE TNNLS

arXiv:2402.16717 [pdf, other]

CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models

Authors: Huijie Lv, Xiao Wang, Yuansen Zhang, Caishuang Huang, Shihan Dou, Junjie Ye, Tao Gui, Qi Zhang, Xuanjing Huang

Abstract: Adversarial misuse, particularly through `jailbreaking' that circumvents a model's safety and ethical protocols, poses a significant challenge for Large Language Models (LLMs). This paper delves into the mechanisms behind such successful attacks, introducing a hypothesis for the safety mechanism of aligned LLMs: intent security recognition followed by response generation. Grounded in this hypothes… ▽ More Adversarial misuse, particularly through `jailbreaking' that circumvents a model's safety and ethical protocols, poses a significant challenge for Large Language Models (LLMs). This paper delves into the mechanisms behind such successful attacks, introducing a hypothesis for the safety mechanism of aligned LLMs: intent security recognition followed by response generation. Grounded in this hypothesis, we propose CodeChameleon, a novel jailbreak framework based on personalized encryption tactics. To elude the intent security recognition phase, we reformulate tasks into a code completion format, enabling users to encrypt queries using personalized encryption functions. To guarantee response generation functionality, we embed a decryption function within the instructions, which allows the LLM to decrypt and execute the encrypted queries successfully. We conduct extensive experiments on 7 LLMs, achieving state-of-the-art average Attack Success Rate (ASR). Remarkably, our method achieves an 86.6\% ASR on GPT-4-1106. △ Less

Submitted 26 February, 2024; originally announced February 2024.

arXiv:2401.16762 [pdf, other]

Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image Personalization

Authors: Henglei Lv, Jiayu Xiao, Liang Li, Qingming Huang

Abstract: Diffusion-based text-to-image personalization have achieved great success in generating subjects specified by users among various contexts. Even though, existing finetuning-based methods still suffer from model overfitting, which greatly harms the generative diversity, especially when given subject images are few. To this end, we propose Pick-and-Draw, a training-free semantic guidance approach to… ▽ More Diffusion-based text-to-image personalization have achieved great success in generating subjects specified by users among various contexts. Even though, existing finetuning-based methods still suffer from model overfitting, which greatly harms the generative diversity, especially when given subject images are few. To this end, we propose Pick-and-Draw, a training-free semantic guidance approach to boost identity consistency and generative diversity for personalization methods. Our approach consists of two components: appearance picking guidance and layout drawing guidance. As for the former, we construct an appearance palette with visual features from the reference image, where we pick local patterns for generating the specified subject with consistent identity. As for layout drawing, we outline the subject's contour by referring to a generative template from the vanilla diffusion model, and inherit the strong image prior to synthesize diverse contexts according to different text conditions. The proposed approach can be applied to any personalized diffusion models and requires as few as a single reference image. Qualitative and quantitative experiments show that Pick-and-Draw consistently improves identity consistency and generative diversity, pushing the trade-off between subject fidelity and image-text fidelity to a new Pareto frontier. △ Less

Submitted 30 January, 2024; originally announced January 2024.

arXiv:2401.05702 [pdf, other]

Video Anomaly Detection and Explanation via Large Language Models

Authors: Hui Lv, Qianru Sun

Abstract: Video Anomaly Detection (VAD) aims to localize abnormal events on the timeline of long-range surveillance videos. Anomaly-scoring-based methods have been prevailing for years but suffer from the high complexity of thresholding and low explanability of detection results. In this paper, we conduct pioneer research on equipping video-based large language models (VLLMs) in the framework of VAD, making… ▽ More Video Anomaly Detection (VAD) aims to localize abnormal events on the timeline of long-range surveillance videos. Anomaly-scoring-based methods have been prevailing for years but suffer from the high complexity of thresholding and low explanability of detection results. In this paper, we conduct pioneer research on equipping video-based large language models (VLLMs) in the framework of VAD, making the VAD model free from thresholds and able to explain the reasons for the detected anomalies. We introduce a novel network module Long-Term Context (LTC) to mitigate the incapability of VLLMs in long-range context modeling. We design a three-phase training method to improve the efficiency of fine-tuning VLLMs by substantially minimizing the requirements for VAD data and lowering the costs of annotating instruction-tuning data. Our trained model achieves the top performance on the anomaly videos of the UCF-Crime and TAD benchmarks, with the AUC improvements of +3.86\% and +4.96\%, respectively. More impressively, our approach can provide textual explanations for detected anomalies. △ Less

Submitted 11 January, 2024; originally announced January 2024.

Comments: 9 pages, 6 figures

arXiv:2311.13562 [pdf, other]

Soulstyler: Using Large Language Model to Guide Image Style Transfer for Target Object

Authors: Junhao Chen, Peng Rong, Jingbo Sun, Chao Li, Xiang Li, Hongwu Lv

Abstract: Image style transfer occupies an important place in both computer graphics and computer vision. However, most current methods require reference to stylized images and cannot individually stylize specific objects. To overcome this limitation, we propose the "Soulstyler" framework, which allows users to guide the stylization of specific objects in an image through simple textual descriptions. We int… ▽ More Image style transfer occupies an important place in both computer graphics and computer vision. However, most current methods require reference to stylized images and cannot individually stylize specific objects. To overcome this limitation, we propose the "Soulstyler" framework, which allows users to guide the stylization of specific objects in an image through simple textual descriptions. We introduce a large language model to parse the text and identify stylization goals and specific styles. Combined with a CLIP-based semantic visual embedding encoder, the model understands and matches text and image content. We also introduce a novel localized text-image block matching loss that ensures that style transfer is performed only on specified target objects, while non-target regions remain in their original style. Experimental results demonstrate that our model is able to accurately perform style transfer on target objects according to textual descriptions without affecting the style of background regions. Our code will be available at https://github.com/yisuanwang/Soulstyler. △ Less

Submitted 29 November, 2023; v1 submitted 22 November, 2023; originally announced November 2023.

Comments: 5 pages,3 figures,ICASSP2024

arXiv:2310.14278 [pdf, other]

doi 10.1109/TASLP.2024.3389630

Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation

Authors: Kun Wei, Bei Li, Hang Lv, Quan Lu, Ning Jiang, Lei Xie

Abstract: Automatic Speech Recognition (ASR) in conversational settings presents unique challenges, including extracting relevant contextual information from previous conversational turns. Due to irrelevant content, error propagation, and redundancy, existing methods struggle to extract longer and more effective contexts. To address this issue, we introduce a novel conversational ASR system, extending the C… ▽ More Automatic Speech Recognition (ASR) in conversational settings presents unique challenges, including extracting relevant contextual information from previous conversational turns. Due to irrelevant content, error propagation, and redundancy, existing methods struggle to extract longer and more effective contexts. To address this issue, we introduce a novel conversational ASR system, extending the Conformer encoder-decoder model with cross-modal conversational representation. Our approach leverages a cross-modal extractor that combines pre-trained speech and text models through a specialized encoder and a modal-level mask input. This enables the extraction of richer historical speech context without explicit error propagation. We also incorporate conditional latent variational modules to learn conversational level attributes such as role preference and topic coherence. By introducing both cross-modal and conversational representations into the decoder, our model retains context over longer sentences without information loss, achieving relative accuracy improvements of 8.8% and 23% on Mandarin conversation datasets HKUST and MagicData-RAMC, respectively, compared to the standard Conformer model. △ Less

Submitted 27 April, 2024; v1 submitted 22 October, 2023; originally announced October 2023.

Comments: TASLP

Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024

arXiv:2310.10195 [pdf, other]

AdaLomo: Low-memory Optimization with Adaptive Learning Rate

Authors: Kai Lv, Hang Yan, Qipeng Guo, Haijun Lv, Xipeng Qiu

Abstract: Large language models have achieved remarkable success, but their extensive parameter size necessitates substantial memory for training, thereby setting a high threshold. While the recently proposed low-memory optimization (LOMO) reduces memory footprint, its optimization technique, akin to stochastic gradient descent, is sensitive to hyper-parameters and exhibits suboptimal convergence, failing t… ▽ More Large language models have achieved remarkable success, but their extensive parameter size necessitates substantial memory for training, thereby setting a high threshold. While the recently proposed low-memory optimization (LOMO) reduces memory footprint, its optimization technique, akin to stochastic gradient descent, is sensitive to hyper-parameters and exhibits suboptimal convergence, failing to match the performance of the prevailing optimizer for large language models, AdamW. Through empirical analysis of the Adam optimizer, we found that, compared to momentum, the adaptive learning rate is more critical for bridging the gap. Building on this insight, we introduce the low-memory optimization with adaptive learning rate (AdaLomo), which offers an adaptive learning rate for each parameter. To maintain memory efficiency, we employ non-negative matrix factorization for the second-order moment estimation in the optimizer state. Additionally, we suggest the use of a grouped update normalization to stabilize convergence. Our experiments with instruction-tuning and further pre-training demonstrate that AdaLomo achieves results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models. The code is accessible at https://github.com/OpenLMLab/LOMO. △ Less

Submitted 6 June, 2024; v1 submitted 16 October, 2023; originally announced October 2023.

Comments: ACL 2024 camera ready version

arXiv:2310.08872 [pdf, other]

R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image Generation

Authors: Jiayu Xiao, Henglei Lv, Liang Li, Shuhui Wang, Qingming Huang

Abstract: Recent text-to-image (T2I) diffusion models have achieved remarkable progress in generating high-quality images given text-prompts as input. However, these models fail to convey appropriate spatial composition specified by a layout instruction. In this work, we probe into zero-shot grounded T2I generation with diffusion models, that is, generating images corresponding to the input layout informati… ▽ More Recent text-to-image (T2I) diffusion models have achieved remarkable progress in generating high-quality images given text-prompts as input. However, these models fail to convey appropriate spatial composition specified by a layout instruction. In this work, we probe into zero-shot grounded T2I generation with diffusion models, that is, generating images corresponding to the input layout information without training auxiliary modules or finetuning diffusion models. We propose a Region and Boundary (R&B) aware cross-attention guidance approach that gradually modulates the attention maps of diffusion model during generative process, and assists the model to synthesize images (1) with high fidelity, (2) highly compatible with textual input, and (3) interpreting layout instructions accurately. Specifically, we leverage the discrete sampling to bridge the gap between consecutive attention maps and discrete layout constraints, and design a region-aware loss to refine the generative layout during diffusion process. We further propose a boundary-aware loss to strengthen object discriminability within the corresponding regions. Experimental results show that our method outperforms existing state-of-the-art zero-shot grounded T2I generation methods by a large margin both qualitatively and quantitatively on several benchmarks. △ Less

Submitted 27 November, 2023; v1 submitted 13 October, 2023; originally announced October 2023.

Comments: Preprint. Under review. Project page: https://sagileo.github.io/Region-and-Boundary

arXiv:2310.02064 [pdf, ps, other]

Auction Design for Bidders with Ex Post ROI Constraints

Authors: Hongtao Lv, Xiaohui Bei, Zhenzhe Zheng, Fan Wu

Abstract: Motivated by practical constraints in online advertising, we investigate single-parameter auction design for bidders with constraints on their Return On Investment (ROI) -- a targeted minimum ratio between the obtained value and the payment. We focus on ex post ROI constraints, which require the ROI condition to be satisfied for every realized value profile. With ROI-constrained bidders, we first… ▽ More Motivated by practical constraints in online advertising, we investigate single-parameter auction design for bidders with constraints on their Return On Investment (ROI) -- a targeted minimum ratio between the obtained value and the payment. We focus on ex post ROI constraints, which require the ROI condition to be satisfied for every realized value profile. With ROI-constrained bidders, we first provide a full characterization of the allocation and payment rules of dominant-strategy incentive compatible (DSIC) auctions. In particular, we show that given any monotone allocation rule, the corresponding DSIC payment should be the Myerson payment with a rebate for each bidder to meet their ROI constraints. Furthermore, we also determine the optimal auction structure when the item is sold to a single bidder under a mild regularity condition. This structure entails a randomized allocation scheme and a first-price payment rule, which differs from the deterministic Myerson auction and previous works on ex ante ROI constraints. △ Less

Submitted 3 October, 2023; originally announced October 2023.

Comments: Accepted by WINE2023

arXiv:2309.13373 [pdf, other]

Asca: less audio data is more insightful

Authors: Xiang Li, Junhao Chen, Chao Li, Hongwu Lv

Abstract: Audio recognition in specialized areas such as birdsong and submarine acoustics faces challenges in large-scale pre-training due to the limitations in available samples imposed by sampling environments and specificity requirements. While the Transformer model excels in audio recognition, its dependence on vast amounts of data becomes restrictive in resource-limited settings. Addressing this, we in… ▽ More Audio recognition in specialized areas such as birdsong and submarine acoustics faces challenges in large-scale pre-training due to the limitations in available samples imposed by sampling environments and specificity requirements. While the Transformer model excels in audio recognition, its dependence on vast amounts of data becomes restrictive in resource-limited settings. Addressing this, we introduce the Audio Spectrogram Convolution Attention (ASCA) based on CoAtNet, integrating a Transformer-convolution hybrid architecture, novel network design, and attention techniques, further augmented with data enhancement and regularization strategies. On the BirdCLEF2023 and AudioSet(Balanced), ASCA achieved accuracies of 81.2% and 35.1%, respectively, significantly outperforming competing methods. The unique structure of our model enriches output, enabling generalization across various audio detection tasks. Our code can be found at https://github.com/LeeCiang/ASCA. △ Less

Submitted 23 September, 2023; originally announced September 2023.

Comments: 6 pages,3 figures

arXiv:2308.12647 [pdf, other]

Multitasking Evolutionary Algorithm Based on Adaptive Seed Transfer for Combinatorial Problem

Authors: Haoyuan Lv, Ruochen Liu

Abstract: Evolutionary computing (EC) is widely used in dealing with combinatorial optimization problems (COP). Traditional EC methods can only solve a single task in a single run, while real-life scenarios often need to solve multiple COPs simultaneously. In recent years, evolutionary multitasking optimization (EMTO) has become an emerging topic in the EC community. And many methods have been designed to d… ▽ More Evolutionary computing (EC) is widely used in dealing with combinatorial optimization problems (COP). Traditional EC methods can only solve a single task in a single run, while real-life scenarios often need to solve multiple COPs simultaneously. In recent years, evolutionary multitasking optimization (EMTO) has become an emerging topic in the EC community. And many methods have been designed to deal with multiple COPs concurrently through exchanging knowledge. However, many-task optimization, cross-domain knowledge transfer, and negative transfer are still significant challenges in this field. A new evolutionary multitasking algorithm based on adaptive seed transfer (MTEA-AST) is developed for multitasking COPs in this work. First, a dimension unification strategy is proposed to unify the dimensions of different tasks. And then, an adaptive task selection strategy is designed to capture the similarity between the target task and other online optimization tasks. The calculated similarity is exploited to select suitable source tasks for the target one and determine the transfer strength. Next, a task transfer strategy is established to select seeds from source tasks and correct unsuitable knowledge in seeds to suppress negative transfer. Finally, the experimental results indicate that MTEA-AST can adaptively transfer knowledge in both same-domain and cross-domain many-task environments. And the proposed method shows competitive performance compared to other state-of-the-art EMTOs in experiments consisting of four COPs. △ Less

Submitted 24 August, 2023; originally announced August 2023.

arXiv:2304.04972 [pdf, other]

Federated Learning with Classifier Shift for Class Imbalance

Authors: Yunheng Shen, Haoxiang Wang, Hairong Lv

Abstract: Federated learning aims to learn a global model collaboratively while the training data belongs to different clients and is not allowed to be exchanged. However, the statistical heterogeneity challenge on non-IID data, such as class imbalance in classification, will cause client drift and significantly reduce the performance of the global model. This paper proposes a simple and effective approach… ▽ More Federated learning aims to learn a global model collaboratively while the training data belongs to different clients and is not allowed to be exchanged. However, the statistical heterogeneity challenge on non-IID data, such as class imbalance in classification, will cause client drift and significantly reduce the performance of the global model. This paper proposes a simple and effective approach named FedShift which adds the shift on the classifier output during the local training phase to alleviate the negative impact of class imbalance. We theoretically prove that the classifier shift in FedShift can make the local optimum consistent with the global optimum and ensure the convergence of the algorithm. Moreover, our experiments indicate that FedShift significantly outperforms the other state-of-the-art federated learning approaches on various datasets regarding accuracy and communication efficiency. △ Less

Submitted 11 April, 2023; originally announced April 2023.

arXiv:2303.12369 [pdf, other]

Unbiased Multiple Instance Learning for Weakly Supervised Video Anomaly Detection

Authors: Hui Lv, Zhongqi Yue, Qianru Sun, Bin Luo, Zhen Cui, Hanwang Zhang

Abstract: Weakly Supervised Video Anomaly Detection (WSVAD) is challenging because the binary anomaly label is only given on the video level, but the output requires snippet-level predictions. So, Multiple Instance Learning (MIL) is prevailing in WSVAD. However, MIL is notoriously known to suffer from many false alarms because the snippet-level detector is easily biased towards the abnormal snippets with si… ▽ More Weakly Supervised Video Anomaly Detection (WSVAD) is challenging because the binary anomaly label is only given on the video level, but the output requires snippet-level predictions. So, Multiple Instance Learning (MIL) is prevailing in WSVAD. However, MIL is notoriously known to suffer from many false alarms because the snippet-level detector is easily biased towards the abnormal snippets with simple context, confused by the normality with the same bias, and missing the anomaly with a different pattern. To this end, we propose a new MIL framework: Unbiased MIL (UMIL), to learn unbiased anomaly features that improve WSVAD. At each MIL training iteration, we use the current detector to divide the samples into two groups with different context biases: the most confident abnormal/normal snippets and the rest ambiguous ones. Then, by seeking the invariant features across the two sample groups, we can remove the variant context biases. Extensive experiments on benchmarks UCF-Crime and TAD demonstrate the effectiveness of our UMIL. Our code is provided at https://github.com/ktr-hubrt/UMIL. △ Less

Submitted 22 March, 2023; originally announced March 2023.

Comments: 11 pages,10 figures

arXiv:2302.09902 [pdf, other]

doi 10.1109/TCAD.2022.3207316

Variation Enhanced Attacks Against RRAM-based Neuromorphic Computing System

Authors: Hao Lv, Bing Li, Lei Zhang, Cheng Liu, Ying Wang

Abstract: The RRAM-based neuromorphic computing system has amassed explosive interests for its superior data processing capability and energy efficiency than traditional architectures, and thus being widely used in many data-centric applications. The reliability and security issues of the NCS therefore become an essential problem. In this paper, we systematically investigated the adversarial threats to the… ▽ More The RRAM-based neuromorphic computing system has amassed explosive interests for its superior data processing capability and energy efficiency than traditional architectures, and thus being widely used in many data-centric applications. The reliability and security issues of the NCS therefore become an essential problem. In this paper, we systematically investigated the adversarial threats to the RRAM-based NCS and observed that the RRAM hardware feature can be leveraged to strengthen the attack effect, which has not been granted sufficient attention by previous algorithmic attack methods. Thus, we proposed two types of hardware-aware attack methods with respect to different attack scenarios and objectives. The first is adversarial attack, VADER, which perturbs the input samples to mislead the prediction of neural networks. The second is fault injection attack, EFI, which perturbs the network parameter space such that a specified sample will be classified to a target label, while maintaining the prediction accuracy on other samples. Both attack methods leverage the RRAM properties to improve the performance compared with the conventional attack methods. Experimental results show that our hardware-aware attack methods can achieve nearly 100% attack success rate with extremely low operational cost, while maintaining the attack stealthiness. △ Less

Submitted 20 February, 2023; originally announced February 2023.

Comments: submitted to IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

arXiv:2302.08062 [pdf]

doi 10.1111/2041-210X.14229

Fossil Image Identification using Deep Learning Ensembles of Data Augmented Multiviews

Authors: Chengbin Hou, Xinyu Lin, Hanhui Huang, Sheng Xu, Junxuan Fan, Yukun Shi, Hairong Lv

Abstract: Identification of fossil species is crucial to evolutionary studies. Recent advances from deep learning have shown promising prospects in fossil image identification. However, the quantity and quality of labeled fossil images are often limited due to fossil preservation, conditioned sampling, and expensive and inconsistent label annotation by domain experts, which pose great challenges to training… ▽ More Identification of fossil species is crucial to evolutionary studies. Recent advances from deep learning have shown promising prospects in fossil image identification. However, the quantity and quality of labeled fossil images are often limited due to fossil preservation, conditioned sampling, and expensive and inconsistent label annotation by domain experts, which pose great challenges to training deep learning based image classification models. To address these challenges, we follow the idea of the wisdom of crowds and propose a multiview ensemble framework, which collects Original (O), Gray (G), and Skeleton (S) views of each fossil image reflecting its different characteristics to train multiple base models, and then makes the final decision via soft voting. Experiments on the largest fusulinid dataset with 2400 images show that the proposed OGS consistently outperforms baselines (using a single model for each view), and obtains superior or comparable performance compared to OOO (using three base models for three the same Original views). Besides, as the training data decreases, the proposed framework achieves more gains. While considering the identification consistency estimation with respect to human experts, OGS receives the highest agreement with the original labels of dataset and with the re-identifications of two human experts. The validation performance provides a quantitative estimation of consistency across different experts and genera. We conclude that the proposed framework can present state-of-the-art performance in the fusulinid fossil identification case study. This framework is designed for general fossil identification and it is expected to see applications to other fossil datasets in future work. The source code is publicly available at https://github.com/houchengbin/Fossil-Image-Identification to benefit future research in fossil image identification. △ Less

Submitted 1 February, 2024; v1 submitted 15 February, 2023; originally announced February 2023.

Comments: published in Methods in Ecology and Evolution

Journal ref: Methods in Ecology and Evolution, 14, 3020-3034 (2023)

arXiv:2302.07493 [pdf, other]

Adaptive incentive for cross-silo federated learning: A multi-agent reinforcement learning approach

Authors: Shijing Yuan, Hongze Liu, Hongtao Lv, Zhanbo Feng, Jie Li, Hongyang Chen, Chentao Wu

Abstract: Cross-silo federated learning (FL) is a typical FL that enables organizations(e.g., financial or medical entities) to train global models on isolated data. Reasonable incentive is key to encouraging organizations to contribute data. However, existing works on incentivizing cross-silo FL lack consideration of the environmental dynamics (e.g., precision of the trained global model and data owned by… ▽ More Cross-silo federated learning (FL) is a typical FL that enables organizations(e.g., financial or medical entities) to train global models on isolated data. Reasonable incentive is key to encouraging organizations to contribute data. However, existing works on incentivizing cross-silo FL lack consideration of the environmental dynamics (e.g., precision of the trained global model and data owned by uncertain clients during the training processes). Moreover, most of them assume that organizations share private information, which is unrealistic. To overcome these limitations, we propose a novel adaptive mechanism for cross-silo FL, towards incentivizing organizations to contribute data to maximize their long-term payoffs in a real dynamic training environment. The mechanism is based on multi-agent reinforcement learning, which learns near-optimal data contribution strategy from the history of potential games without organizations' private information. Experiments demonstrate that our mechanism achieves adaptive incentive and effectively improves the long-term payoffs for organizations. △ Less

Submitted 15 February, 2023; originally announced February 2023.

arXiv:2301.00615 [pdf, other]

doi 10.1145/3603269.3604850

ChameleMon: Shifting Measurement Attention as Network State Changes

Authors: Kaicheng Yang, Yuhan Wu, Ruijie Miao, Tong Yang, Zirui Liu, Zicang Xu, Rui Qiu, Yikai Zhao, Hanglong Lv, Zhigang Ji, Gaogang Xie

Abstract: Flow-level network measurement is critical to many network applications. Among various measurement tasks, packet loss detection and heavy-hitter detection are two most important measurement tasks, which we call the two key tasks. In practice, the two key tasks are often required at the same time, but existing works seldom handle both tasks. In this paper, we design ChameleMon to support the two ke… ▽ More Flow-level network measurement is critical to many network applications. Among various measurement tasks, packet loss detection and heavy-hitter detection are two most important measurement tasks, which we call the two key tasks. In practice, the two key tasks are often required at the same time, but existing works seldom handle both tasks. In this paper, we design ChameleMon to support the two key tasks simultaneously. One key design/novelty of ChameleMon is to shift measurement attention as network state changes, through two dimensions of dynamics: 1) dynamically allocating memory between the two key tasks; 2) dynamically monitoring the flows of importance. To realize the key design, we propose a key technique, leveraging Fermat's little theorem to devise a flexible data structure, namely FermatSketch. FermatSketch is dividable, additive, and subtractive, supporting the two key tasks. We have fully implemented a ChameleMon prototype on a testbed with a Fat-tree topology. We conduct extensive experiments and the results show ChameleMon supports the two key tasks with low memory/bandwidth overhead, and more importantly, it can automatically shift measurement attention as network state changes. △ Less

Submitted 20 July, 2023; v1 submitted 2 January, 2023; originally announced January 2023.

Comments: This is a preprint of ChameleMon: Shifting Measurement Attention as Network State Changes, to appear in SIGCOMM 2023

Journal ref: ACM SIGCOMM (2023) 881-903

arXiv:2211.16716 [pdf, other]

Automated Generating Natural Language Requirements based on Domain Ontology

Authors: Ziyan Zhao, Li Zhang, Xiaoyun Gao, Xiaoli Lian, Heyang Lv, Lin Shi

Abstract: Software requirements specification is undoubtedly critical for the whole software life-cycle. Nowadays, writing software requirements specifications primarily depends on human work. Although massive studies have been proposed to fasten the process via proposing advanced elicitation and analysis techniques, it is still a time-consuming and error-prone task that needs to take domain knowledge and b… ▽ More Software requirements specification is undoubtedly critical for the whole software life-cycle. Nowadays, writing software requirements specifications primarily depends on human work. Although massive studies have been proposed to fasten the process via proposing advanced elicitation and analysis techniques, it is still a time-consuming and error-prone task that needs to take domain knowledge and business information into consideration. In this paper, we propose an approach, named ReqGen, which can provide recommendations by automatically generating natural language requirements specifications based on certain given keywords. Specifically, ReqGen consists of three critical steps. First, keywords-oriented knowledge is selected from domain ontology and is injected to the basic Unified pre-trained Language Model (UniLM) for domain fine-tuning. Second, a copy mechanism is integrated to ensure the occurrence of keywords in the generated statements. Finally, a requirement syntax constrained decoding is designed to close the semantic and syntax distance between the candidate and reference specifications. Experiments on two public datasets from different groups and domains show that ReqGen outperforms six popular natural language generation approaches with respect to the hard constraint of keywords(phrases) inclusion, BLEU, ROUGE and syntax compliance. We believe that ReqGen can promote the efficiency and intelligence of specifying software requirements. △ Less

Submitted 29 November, 2022; originally announced November 2022.

arXiv:2211.16251 [pdf, other]

Utility Maximizer or Value Maximizer: Mechanism Design for Mixed Bidders in Online Advertising

Authors: Hongtao Lv, Zhilin Zhang, Zhenzhe Zheng, Jinghan Liu, Chuan Yu, Lei Liu, Lizhen Cui, Fan Wu

Abstract: Digital advertising constitutes one of the main revenue sources for online platforms. In recent years, some advertisers tend to adopt auto-bidding tools to facilitate advertising performance optimization, making the classical \emph{utility maximizer} model in auction theory not fit well. Some recent studies proposed a new model, called \emph{value maximizer}, for auto-bidding advertisers with retu… ▽ More Digital advertising constitutes one of the main revenue sources for online platforms. In recent years, some advertisers tend to adopt auto-bidding tools to facilitate advertising performance optimization, making the classical \emph{utility maximizer} model in auction theory not fit well. Some recent studies proposed a new model, called \emph{value maximizer}, for auto-bidding advertisers with return-on-investment (ROI) constraints. However, the model of either utility maximizer or value maximizer could only characterize partial advertisers in real-world advertising platforms. In a mixed environment where utility maximizers and value maximizers coexist, the truthful ad auction design would be challenging since bidders could manipulate both their values and affiliated classes, leading to a multi-parameter mechanism design problem. In this work, we address this issue by proposing a payment rule which combines the corresponding ones in classical VCG and GSP mechanisms in a novel way. Based on this payment rule, we propose a truthful auction mechanism with an approximation ratio of $2$ on social welfare, which is close to the lower bound of at least $\frac{5}{4}$ that we also prove. The designed auction mechanism is a generalization of VCG for utility maximizers and GSP for value maximizers. △ Less

Submitted 30 November, 2022; v1 submitted 29 November, 2022; originally announced November 2022.

Comments: accepted by AAAI2023

arXiv:2210.04287 [pdf, other]

Learning to Decompose Visual Features with Latent Textual Prompts

Authors: Feng Wang, Manling Li, Xudong Lin, Hairong Lv, Alexander G. Schwing, Heng Ji

Abstract: Recent advances in pre-training vision-language models like CLIP have shown great potential in learning transferable visual representations. Nonetheless, for downstream inference, CLIP-like models suffer from either 1) degraded accuracy and robustness in the case of inaccurate text descriptions during retrieval-based inference (the challenge for zero-shot protocol); or 2) breaking the well-establi… ▽ More Recent advances in pre-training vision-language models like CLIP have shown great potential in learning transferable visual representations. Nonetheless, for downstream inference, CLIP-like models suffer from either 1) degraded accuracy and robustness in the case of inaccurate text descriptions during retrieval-based inference (the challenge for zero-shot protocol); or 2) breaking the well-established vision-language alignment (the challenge for linear probing). To address them, we propose Decomposed Feature Prompting (DeFo). DeFo leverages a flexible number of learnable embeddings as textual input while maintaining the vision-language dual-model architecture, which enables the model to learn decomposed visual features with the help of feature-level textual prompts. We further use an additional linear layer to perform classification, allowing a scalable size of language inputs. Our empirical study shows DeFo's significance in improving the vision-language models. For example, DeFo obtains 73.2% test accuracy on ImageNet with a ResNet-50 backbone without tuning any pretrained weights of both the vision and language encoder, outperforming zero-shot CLIP by a large margin of 15.0%, and outperforming state-of-the-art vision-language prompt tuning method by 7.6%. △ Less

Submitted 9 October, 2022; originally announced October 2022.

arXiv:2209.13116 [pdf, other]

Spatio-Temporal Relation Learning for Video Anomaly Detection

Authors: Hui Lv, Zhen Cui, Biao Wang, Jian Yang

Abstract: Anomaly identification is highly dependent on the relationship between the object and the scene, as different/same object actions in same/different scenes may lead to various degrees of normality and anomaly. Therefore, object-scene relation actually plays a crucial role in anomaly detection but is inadequately explored in previous works. In this paper, we propose a Spatial-Temporal Relation Learn… ▽ More Anomaly identification is highly dependent on the relationship between the object and the scene, as different/same object actions in same/different scenes may lead to various degrees of normality and anomaly. Therefore, object-scene relation actually plays a crucial role in anomaly detection but is inadequately explored in previous works. In this paper, we propose a Spatial-Temporal Relation Learning (STRL) framework to tackle the video anomaly detection task. First, considering dynamic characteristics of the objects as well as scene areas, we construct a Spatio-Temporal Auto-Encoder (STAE) to jointly exploit spatial and temporal evolution patterns for representation learning. For better pattern extraction, two decoding branches are designed in the STAE module, i.e. an appearance branch capturing spatial cues by directly predicting the next frame, and a motion branch focusing on modeling the dynamics via optical flow prediction. Then, to well concretize the object-scene relation, a Relation Learning (RL) module is devised to analyze and summarize the normal relations by introducing the Knowledge Graph Embedding methodology. Specifically in this process, the plausibility of object-scene relation is measured by jointly modeling object/scene features and optimizable object-scene relation maps. Extensive experiments are conducted on three public datasets, and the superior performance over the state-of-the-art methods demonstrates the effectiveness of our method. △ Less

Submitted 26 September, 2022; originally announced September 2022.

Comments: 8 pages, 5 figures,Journal

arXiv:2209.08933 [pdf, ps, other]

Estimating Brain Age with Global and Local Dependencies

Authors: Yanwu Yang, Xutao Guo, Zhikai Chang, Chenfei Ye, Yang Xiang, Haiyan Lv, Ting Ma

Abstract: The brain age has been proven to be a phenotype of relevance to cognitive performance and brain disease. Achieving accurate brain age prediction is an essential prerequisite for optimizing the predicted brain-age difference as a biomarker. As a comprehensive biological characteristic, the brain age is hard to be exploited accurately with models using feature engineering and local processing such a… ▽ More The brain age has been proven to be a phenotype of relevance to cognitive performance and brain disease. Achieving accurate brain age prediction is an essential prerequisite for optimizing the predicted brain-age difference as a biomarker. As a comprehensive biological characteristic, the brain age is hard to be exploited accurately with models using feature engineering and local processing such as local convolution and recurrent operations that process one local neighborhood at a time. Instead, Vision Transformers learn global attentive interaction of patch tokens, introducing less inductive bias and modeling long-range dependencies. In terms of this, we proposed a novel network for learning brain age interpreting with global and local dependencies, where the corresponding representations are captured by Successive Permuted Transformer (SPT) and convolution blocks. The SPT brings computation efficiency and locates the 3D spatial information indirectly via continuously encoding 2D slices from different views. Finally, we collect a large cohort of 22645 subjects with ages ranging from 14 to 97 and our network performed the best among a series of deep learning methods, yielding a mean absolute error (MAE) of 2.855 in validation set, and 2.911 in an independent test set. △ Less

Submitted 19 September, 2022; originally announced September 2022.

arXiv:2207.06718 [pdf]

doi 10.1109/IECON49645.2022.9968471

Hardware-in-the-Loop Simulation for Evaluating Communication Impacts on the Wireless-Network-Controlled Robots

Authors: Honghao Lv, Zhibo Pang, Ming Xiao, Geng Yang

Abstract: More and more robot automation applications have changed to wireless communication, and network performance has a growing impact on robotic systems. This study proposes a hardware-in-the-loop (HiL) simulation methodology for connecting the simulated robot platform to real network devices. This project seeks to provide robotic engineers and researchers with the capability to experiment without heav… ▽ More More and more robot automation applications have changed to wireless communication, and network performance has a growing impact on robotic systems. This study proposes a hardware-in-the-loop (HiL) simulation methodology for connecting the simulated robot platform to real network devices. This project seeks to provide robotic engineers and researchers with the capability to experiment without heavily modifying the original controller and get more realistic test results that correlate with actual network conditions. We deployed this HiL simulation system in two common cases for wireless-network-controlled robotic applications: (1) safe multi-robot coordination for mobile robots, and (2) human-motion-based teleoperation for manipulators. The HiL simulation system is deployed and tested under various network conditions in all circumstances. The experiment results are analyzed and compared with the previous simulation methods, demonstrating that the proposed HiL simulation methodology can identify a more reliable communication impact on robot systems. △ Less

Submitted 28 September, 2022; v1 submitted 14 July, 2022; originally announced July 2022.

Comments: 6 pages, 11 figures, to appear in 48th Annual Conference of the Industrial Electronics Society IECON 2022 Conference

arXiv:2207.01261 [pdf, other]

Minimizing Sequential Confusion Error in Speech Command Recognition

Authors: Zhanheng Yang, Hang Lv, Xiong Wang, Ao Zhang, Lei Xie

Abstract: Speech command recognition (SCR) has been commonly used on resource constrained devices to achieve hands-free user experience. However, in real applications, confusion among commands with similar pronunciations often happens due to the limited capacity of small models deployed on edge devices, which drastically affects the user experience. In this paper, inspired by the advances of discriminative… ▽ More Speech command recognition (SCR) has been commonly used on resource constrained devices to achieve hands-free user experience. However, in real applications, confusion among commands with similar pronunciations often happens due to the limited capacity of small models deployed on edge devices, which drastically affects the user experience. In this paper, inspired by the advances of discriminative training in speech recognition, we propose a novel minimize sequential confusion error (MSCE) training criterion particularly for SCR, aiming to alleviate the command confusion problem. Specifically, we aim to improve the ability of discriminating the target command from other commands on the basis of MCE discriminative criteria. We define the likelihood of different commands through connectionist temporal classification (CTC). During training, we propose several strategies to use prior knowledge creating a confusing sequence set for similar-sounding command instead of creating the whole non-target command set, which can better save the training resources and effectively reduce command confusion errors. Specifically, we design and compare three different strategies for confusing set construction. By using our proposed method, we can relatively reduce the False Reject Rate~(FRR) by 33.7% at 0.01 False Alarm Rate~(FAR) and confusion errors by 18.28% on our collected speech command set. △ Less

Submitted 4 July, 2022; originally announced July 2022.

Comments: Accepted by Interspeech 2022

arXiv:2203.16539 [pdf, other]

doi 10.3389/fphy.2022.843932

Identification of diffracted vortex beams at different propagation distances using deep learning

Authors: Heng Lv, Yan Guo, Zi-Xiang Yang, Chunling Ding, Wu-Hao Cai, Chenglong You, Rui-Bo Jin

Abstract: Orbital angular momentum of light is regarded as a valuable resource in quantum technology, especially in quantum communication and quantum sensing and ranging. However, the OAM state of light is susceptible to undesirable experimental conditions such as propagation distance and phase distortions, which hinders the potential for the realistic implementation of relevant technologies. In this articl… ▽ More Orbital angular momentum of light is regarded as a valuable resource in quantum technology, especially in quantum communication and quantum sensing and ranging. However, the OAM state of light is susceptible to undesirable experimental conditions such as propagation distance and phase distortions, which hinders the potential for the realistic implementation of relevant technologies. In this article, we exploit an enhanced deep learning neural network to identify different OAM modes of light at multiple propagation distances with phase distortions. Specifically, our trained deep learning neural network can efficiently identify the vortex beam's topological charge and propagation distance with 97% accuracy. Our technique has important implications for OAM based communication and sensing protocols. △ Less

Submitted 30 March, 2022; originally announced March 2022.

Comments: 9 pages, 4 figures

Journal ref: Frontiers in Physics 10, 843932 (2022)

arXiv:2203.15455 [pdf, other]

WeNet 2.0: More Productive End-to-End Speech Recognition Toolkit

Authors: Binbin Zhang, Di Wu, Zhendong Peng, Xingchen Song, Zhuoyuan Yao, Hang Lv, Lei Xie, Chao Yang, Fuping Pan, Jianwei Niu

Abstract: Recently, we made available WeNet, a production-oriented end-to-end speech recognition toolkit, which introduces a unified two-pass (U2) framework and a built-in runtime to address the streaming and non-streaming decoding modes in a single model. To further improve ASR performance and facilitate various production requirements, in this paper, we present WeNet 2.0 with four important updates. (1) W… ▽ More Recently, we made available WeNet, a production-oriented end-to-end speech recognition toolkit, which introduces a unified two-pass (U2) framework and a built-in runtime to address the streaming and non-streaming decoding modes in a single model. To further improve ASR performance and facilitate various production requirements, in this paper, we present WeNet 2.0 with four important updates. (1) We propose U2++, a unified two-pass framework with bidirectional attention decoders, which includes the future contextual information by a right-to-left attention decoder to improve the representative ability of the shared encoder and the performance during the rescoring stage. (2) We introduce an n-gram based language model and a WFST-based decoder into WeNet 2.0, promoting the use of rich text data in production scenarios. (3) We design a unified contextual biasing framework, which leverages user-specific context (e.g., contact lists) to provide rapid adaptation ability for production and improves ASR accuracy in both with-LM and without-LM scenarios. (4) We design a unified IO to support large-scale data for effective model training. In summary, the brand-new WeNet 2.0 achieves up to 10\% relative recognition performance improvement over the original WeNet on various corpora and makes available several important production-oriented features. △ Less

Submitted 5 July, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

arXiv:2110.14636 [pdf, other]

Pay attention to emoji: Feature Fusion Network with EmoGraph2vec Model for Sentiment Analysis

Authors: Xiaowei Yuan, Jingyuan Hu, Xiaodan Zhang, Honglei Lv

Abstract: With the explosive growth of social media, opinionated postings with emojis have increased explosively. Many emojis are used to express emotions, attitudes, and opinions. Emoji representation learning can be helpful to improve the performance of emoji-related natural language processing tasks, especially in text sentiment analysis. However, most studies have only utilized the fixed descriptions pr… ▽ More With the explosive growth of social media, opinionated postings with emojis have increased explosively. Many emojis are used to express emotions, attitudes, and opinions. Emoji representation learning can be helpful to improve the performance of emoji-related natural language processing tasks, especially in text sentiment analysis. However, most studies have only utilized the fixed descriptions provided by the Unicode Consortium without consideration of actual usage scenarios. As for the sentiment analysis task, many researchers ignore the emotional impact of the interaction between text and emojis. It results that the emotional semantics of emojis cannot be fully explored. In this work, we propose a method called EmoGraph2vec to learn emoji representations by constructing a co-occurrence graph network from social data and enriching the semantic information based on an external knowledge base EmojiNet to embed emoji nodes. Based on EmoGraph2vec model, we design a novel neural network to incorporate text and emoji information into sentiment analysis, which uses a hybrid-attention module combined with TextCNN-based classifier to improve performance. Experimental results show that the proposed model can outperform several baselines for sentiment analysis on benchmark datasets. Additionally, we conduct a series of ablation and comparison experiments to investigate the effectiveness and interpretability of our model. △ Less

Submitted 23 May, 2022; v1 submitted 27 October, 2021; originally announced October 2021.

Comments: Camera-ready verison accepted by ICPR 2022

arXiv:2110.14227

doi 10.1007/978-3-030-92307-5_1

Emoji-based Co-attention Network for Microblog Sentiment Analysis

Authors: Xiaowei Yuan, Jingyuan Hu, Xiaodan Zhang, Honglei Lv, Hao Liu

Abstract: Emojis are widely used in online social networks to express emotions, attitudes, and opinions. As emotional-oriented characters, emojis can be modeled as important features of emotions towards the recipient or subject for sentiment analysis. However, existing methods mainly take emojis as heuristic information that fails to resolve the problem of ambiguity noise. Recent researches have utilized em… ▽ More Emojis are widely used in online social networks to express emotions, attitudes, and opinions. As emotional-oriented characters, emojis can be modeled as important features of emotions towards the recipient or subject for sentiment analysis. However, existing methods mainly take emojis as heuristic information that fails to resolve the problem of ambiguity noise. Recent researches have utilized emojis as an independent input to classify text sentiment but they ignore the emotional impact of the interaction between text and emojis. It results that the emotional semantics of emojis cannot be fully explored. In this paper, we propose an emoji-based co-attention network that learns the mutual emotional semantics between text and emojis on microblogs. Our model adopts the co-attention mechanism based on bidirectional long short-term memory incorporating the text and emojis, and integrates a squeeze-and-excitation block in a convolutional neural network classifier to increase its sensitivity to emotional semantic features. Experimental results show that the proposed method can significantly outperform several baselines for sentiment analysis on short texts of social media. △ Less

Submitted 14 January, 2022; v1 submitted 27 October, 2021; originally announced October 2021.

Comments: There are technical details that need to be changed, and the replacement version will take time to complete

arXiv:2110.03370 [pdf, other]

WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition

Authors: Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao, Chao Yang, Lei Xie, Xin Xu, Hui Bu, Xiaoyu Chen, Chenchen Zeng, Di Wu, Zhendong Peng

Abstract: In this paper, we present WenetSpeech, a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech, 2400+ hours weakly labeled speech, and about 10000 hours unlabeled speech, with 22400+ hours in total. We collect the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics, and noisy conditions. An optical character recognition… ▽ More In this paper, we present WenetSpeech, a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech, 2400+ hours weakly labeled speech, and about 10000 hours unlabeled speech, with 22400+ hours in total. We collect the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics, and noisy conditions. An optical character recognition (OCR) based method is introduced to generate the audio/text segmentation candidates for the YouTube data on its corresponding video captions, while a high-quality ASR transcription system is used to generate audio/text pair candidates for the Podcast data. Then we propose a novel end-to-end label error detection approach to further validate and filter the candidates. We also provide three manually labelled high-quality test sets along with WenetSpeech for evaluation -- Dev for cross-validation purpose in training, Test_Net, collected from Internet for matched test, and Test\_Meeting, recorded from real meetings for more challenging mismatched test. Baseline systems trained with WenetSpeech are provided for three popular speech recognition toolkits, namely Kaldi, ESPnet, and WeNet, and recognition results on the three test sets are also provided as benchmarks. To the best of our knowledge, WenetSpeech is the current largest open-sourced Mandarin speech corpus with transcriptions, which benefits research on production-level speech recognition. △ Less

Submitted 23 February, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

arXiv:2110.00959 [pdf, other]

Boost Neural Networks by Checkpoints

Authors: Feng Wang, Guoyizhe Wei, Qiao Liu, Jinxiang Ou, Xian Wei, Hairong Lv

Abstract: Training multiple deep neural networks (DNNs) and averaging their outputs is a simple way to improve the predictive performance. Nevertheless, the multiplied training cost prevents this ensemble method to be practical and efficient. Several recent works attempt to save and ensemble the checkpoints of DNNs, which only requires the same computational cost as training a single network. However, these… ▽ More Training multiple deep neural networks (DNNs) and averaging their outputs is a simple way to improve the predictive performance. Nevertheless, the multiplied training cost prevents this ensemble method to be practical and efficient. Several recent works attempt to save and ensemble the checkpoints of DNNs, which only requires the same computational cost as training a single network. However, these methods suffer from either marginal accuracy improvements due to the low diversity of checkpoints or high risk of divergence due to the cyclical learning rates they adopted. In this paper, we propose a novel method to ensemble the checkpoints, where a boosting scheme is utilized to accelerate model convergence and maximize the checkpoint diversity. We theoretically prove that it converges by reducing exponential loss. The empirical evaluation also indicates our proposed ensemble outperforms single model and existing ensembles in terms of accuracy and efficiency. With the same training budget, our method achieves 4.16% lower error on Cifar-100 and 6.96% on Tiny-ImageNet with ResNet-110 architecture. Moreover, the adaptive sample weights in our method make it an effective solution to address the imbalanced class distribution. In the experiments, it yields up to 5.02% higher accuracy over single EfficientNet-B0 on the imbalanced datasets. △ Less

Submitted 25 October, 2021; v1 submitted 3 October, 2021; originally announced October 2021.

arXiv:2109.07045 [pdf, ps, other]

Uncertainty Quantification in Medical Image Segmentation with Multi-decoder U-Net

Authors: Yanwu Yang, Xutao Guo, Yiwei Pan, Pengcheng Shi, Haiyan Lv, Ting Ma

Abstract: Accurate medical image segmentation is crucial for diagnosis and analysis. However, the models without calibrated uncertainty estimates might lead to errors in downstream analysis and exhibit low levels of robustness. Estimating the uncertainty in the measurement is vital to making definite, informed conclusions. Especially, it is difficult to make accurate predictions on ambiguous areas and focus… ▽ More Accurate medical image segmentation is crucial for diagnosis and analysis. However, the models without calibrated uncertainty estimates might lead to errors in downstream analysis and exhibit low levels of robustness. Estimating the uncertainty in the measurement is vital to making definite, informed conclusions. Especially, it is difficult to make accurate predictions on ambiguous areas and focus boundaries for both models and radiologists, even harder to reach a consensus with multiple annotations. In this work, the uncertainty under these areas is studied, which introduces significant information with anatomical structure and is as important as segmentation performance. We exploit the medical image segmentation uncertainty quantification by measuring segmentation performance with multiple annotations in a supervised learning manner and propose a U-Net based architecture with multiple decoders, where the image representation is encoded with the same encoder, and segmentation referring to each annotation is estimated with multiple decoders. Nevertheless, a cross-loss function is proposed for bridging the gap between different branches. The proposed architecture is trained in an end-to-end manner and able to improve predictive uncertainty estimates. The model achieves comparable performance with fewer parameters to the integrated training model that ranked the runner-up in the MICCAI-QUBIQ 2020 challenge. △ Less

Submitted 14 September, 2021; originally announced September 2021.

Comments: MICCAI_QUBIQ challenge, conference, Uncertainty qualification

arXiv:2108.10623 [pdf, other]

Data-Free Evaluation of User Contributions in Federated Learning

Authors: Hongtao Lv, Zhenzhe Zheng, Tie Luo, Fan Wu, Shaojie Tang, Lifeng Hua, Rongfei Jia, Chengfei Lv

Abstract: Federated learning (FL) trains a machine learning model on mobile devices in a distributed manner using each device's private data and computing resources. A critical issues is to evaluate individual users' contributions so that (1) users' effort in model training can be compensated with proper incentives and (2) malicious and low-quality users can be detected and removed. The state-of-the-art sol… ▽ More Federated learning (FL) trains a machine learning model on mobile devices in a distributed manner using each device's private data and computing resources. A critical issues is to evaluate individual users' contributions so that (1) users' effort in model training can be compensated with proper incentives and (2) malicious and low-quality users can be detected and removed. The state-of-the-art solutions require a representative test dataset for the evaluation purpose, but such a dataset is often unavailable and hard to synthesize. In this paper, we propose a method called Pairwise Correlated Agreement (PCA) based on the idea of peer prediction to evaluate user contribution in FL without a test dataset. PCA achieves this using the statistical correlation of the model parameters uploaded by users. We then apply PCA to designing (1) a new federated learning algorithm called Fed-PCA, and (2) a new incentive mechanism that guarantees truthfulness. We evaluate the performance of PCA and Fed-PCA using the MNIST dataset and a large industrial product recommendation dataset. The results demonstrate that our Fed-PCA outperforms the canonical FedAvg algorithm and other baseline methods in accuracy, and at the same time, PCA effectively incentivizes users to behave truthfully. △ Less

Submitted 24 August, 2021; originally announced August 2021.

Comments: accepted by WiOpt 2021

arXiv:2106.03593 [pdf, other]

Neural Auction: End-to-End Learning of Auction Mechanisms for E-Commerce Advertising

Authors: Xiangyu Liu, Chuan Yu, Zhilin Zhang, Zhenzhe Zheng, Yu Rong, Hongtao Lv, Da Huo, Yiqing Wang, Dagui Chen, Jian Xu, Fan Wu, Guihai Chen, Xiaoqiang Zhu

Abstract: In e-commerce advertising, it is crucial to jointly consider various performance metrics, e.g., user experience, advertiser utility, and platform revenue. Traditional auction mechanisms, such as GSP and VCG auctions, can be suboptimal due to their fixed allocation rules to optimize a single performance metric (e.g., revenue or social welfare). Recently, data-driven auctions, learned directly from… ▽ More In e-commerce advertising, it is crucial to jointly consider various performance metrics, e.g., user experience, advertiser utility, and platform revenue. Traditional auction mechanisms, such as GSP and VCG auctions, can be suboptimal due to their fixed allocation rules to optimize a single performance metric (e.g., revenue or social welfare). Recently, data-driven auctions, learned directly from auction outcomes to optimize multiple performance metrics, have attracted increasing research interests. However, the procedure of auction mechanisms involves various discrete calculation operations, making it challenging to be compatible with continuous optimization pipelines in machine learning. In this paper, we design \underline{D}eep \underline{N}eural \underline{A}uctions (DNAs) to enable end-to-end auction learning by proposing a differentiable model to relax the discrete sorting operation, a key component in auctions. We optimize the performance metrics by developing deep models to efficiently extract contexts from auctions, providing rich features for auction design. We further integrate the game theoretical conditions within the model design, to guarantee the stability of the auctions. DNAs have been successfully deployed in the e-commerce advertising system at Taobao. Experimental evaluation results on both large-scale data set as well as online A/B test demonstrated that DNAs significantly outperformed other mechanisms widely adopted in industry. △ Less

Submitted 13 July, 2021; v1 submitted 7 June, 2021; originally announced June 2021.

Comments: To appear in the Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2021

arXiv:2104.06813 [pdf, other]

doi 10.1145/3394171.3416277

Global Information Guided Video Anomaly Detection

Authors: Hui Lv, Chunyan Xu, Zhen Cui

Abstract: Video anomaly detection (VAD) is currently a challenging task due to the complexity of anomaly as well as the lack of labor-intensive temporal annotations. In this paper, we propose an end-to-end Global Information Guided (GIG) anomaly detection framework for anomaly detection using the video-level annotations (i.e., weak labels). We propose to first mine the global pattern cues by leveraging the… ▽ More Video anomaly detection (VAD) is currently a challenging task due to the complexity of anomaly as well as the lack of labor-intensive temporal annotations. In this paper, we propose an end-to-end Global Information Guided (GIG) anomaly detection framework for anomaly detection using the video-level annotations (i.e., weak labels). We propose to first mine the global pattern cues by leveraging the weak labels in a GIG module. Then we build a spatial reasoning module to measure the relevance between vectors in spatial domain with the global cue vectors, and select the most related feature vectors for temporal anomaly detection. The experimental results on the CityScene challenge demonstrate the effectiveness of our model. △ Less

Submitted 14 April, 2021; originally announced April 2021.

arXiv:2104.06689 [pdf, other]

Learning Normal Dynamics in Videos with Meta Prototype Network

Authors: Hui Lv, Chen Chen, Zhen Cui, Chunyan Xu, Yong Li, Jian Yang

Abstract: Frame reconstruction (current or future frame) based on Auto-Encoder (AE) is a popular method for video anomaly detection. With models trained on the normal data, the reconstruction errors of anomalous scenes are usually much larger than those of normal ones. Previous methods introduced the memory bank into AE, for encoding diverse normal patterns across the training videos. However, they are memo… ▽ More Frame reconstruction (current or future frame) based on Auto-Encoder (AE) is a popular method for video anomaly detection. With models trained on the normal data, the reconstruction errors of anomalous scenes are usually much larger than those of normal ones. Previous methods introduced the memory bank into AE, for encoding diverse normal patterns across the training videos. However, they are memory-consuming and cannot cope with unseen new scenarios in the testing data. In this work, we propose a dynamic prototype unit (DPU) to encode the normal dynamics as prototypes in real time, free from extra memory cost. In addition, we introduce meta-learning to our DPU to form a novel few-shot normalcy learner, namely Meta-Prototype Unit (MPU). It enables the fast adaption capability on new scenes by only consuming a few iterations of update. Extensive experiments are conducted on various benchmarks. The superior performance over the state-of-the-art demonstrates the effectiveness of our method. △ Less

Submitted 10 May, 2021; v1 submitted 14 April, 2021; originally announced April 2021.

Comments: 9 pages, 4 figures, 6 tables

arXiv:2103.09063 [pdf, other]

An Asynchronous WFST-Based Decoder For Automatic Speech Recognition

Authors: Hang Lv, Zhehuai Chen, Hainan Xu, Daniel Povey, Lei Xie, Sanjeev Khudanpur

Abstract: We introduce asynchronous dynamic decoder, which adopts an efficient A* algorithm to incorporate big language models in the one-pass decoding for large vocabulary continuous speech recognition. Unlike standard one-pass decoding with on-the-fly composition decoder which might induce a significant computation overhead, the asynchronous dynamic decoder has a novel design where it has two fronts, with… ▽ More We introduce asynchronous dynamic decoder, which adopts an efficient A* algorithm to incorporate big language models in the one-pass decoding for large vocabulary continuous speech recognition. Unlike standard one-pass decoding with on-the-fly composition decoder which might induce a significant computation overhead, the asynchronous dynamic decoder has a novel design where it has two fronts, with one performing "exploration" and the other "backfill". The computation of the two fronts alternates in the decoding process, resulting in more effective pruning than the standard one-pass decoding with an on-the-fly composition decoder. Experiments show that the proposed decoder works notably faster than the standard one-pass decoding with on-the-fly composition decoder, while the acceleration will be more obvious with the increment of data complexity. △ Less

Submitted 16 March, 2021; originally announced March 2021.

Comments: 5 pages, 5 figures, icassp

arXiv:2102.04488 [pdf, other]

Wake Word Detection with Streaming Transformers

Authors: Yiming Wang, Hang Lv, Daniel Povey, Lei Xie, Sanjeev Khudanpur

Abstract: Modern wake word detection systems usually rely on neural networks for acoustic modeling. Transformers has recently shown superior performance over LSTM and convolutional networks in various sequence modeling tasks with their better temporal modeling power. However it is not clear whether this advantage still holds for short-range temporal modeling like wake word detection. Besides, the vanilla Tr… ▽ More Modern wake word detection systems usually rely on neural networks for acoustic modeling. Transformers has recently shown superior performance over LSTM and convolutional networks in various sequence modeling tasks with their better temporal modeling power. However it is not clear whether this advantage still holds for short-range temporal modeling like wake word detection. Besides, the vanilla Transformer is not directly applicable to the task due to its non-streaming nature and the quadratic time and space complexity. In this paper we explore the performance of several variants of chunk-wise streaming Transformers tailored for wake word detection in a recently proposed LF-MMI system, including looking-ahead to the next chunk, gradient stopping, different positional embedding methods and adding same-layer dependency between chunks. Our experiments on the Mobvoi wake word dataset demonstrate that our proposed Transformer model outperforms the baseline convolution network by 25% on average in false rejection rate at the same false alarm rate with a comparable model size, while still maintaining linear complexity w.r.t. the sequence length. △ Less

Submitted 8 February, 2021; originally announced February 2021.

Comments: Accepted at IEEE ICASSP 2021. 5 pages, 3 figures

arXiv:2101.08387 [pdf]

doi 10.1007/s10462-022-10283-5

A Survey on Ensemble Learning under the Era of Deep Learning

Authors: Yongquan Yang, Haijun Lv, Ning Chen

Abstract: Due to the dominant position of deep learning (mostly deep neural networks) in various artificial intelligence applications, recently, ensemble learning based on deep neural networks (ensemble deep learning) has shown significant performances in improving the generalization of learning system. However, since modern deep neural networks usually have millions to billions of parameters, the time and… ▽ More Due to the dominant position of deep learning (mostly deep neural networks) in various artificial intelligence applications, recently, ensemble learning based on deep neural networks (ensemble deep learning) has shown significant performances in improving the generalization of learning system. However, since modern deep neural networks usually have millions to billions of parameters, the time and space overheads for training multiple base deep learners and testing with the ensemble deep learner are far greater than that of traditional ensemble learning. Though several algorithms of fast ensemble deep learning have been proposed to promote the deployment of ensemble deep learning in some applications, further advances still need to be made for many applications in specific fields, where the developing time and computing resources are usually restricted or the data to be processed is of large dimensionality. An urgent problem needs to be solved is how to take the significant advantages of ensemble deep learning while reduce the required expenses so that many more applications in specific fields can benefit from it. For the alleviation of this problem, it is essential to know about how ensemble learning has developed under the era of deep learning. Thus, in this article, we present fundamental discussions focusing on data analyses of published works, methodologies, recent advances and unattainability of traditional ensemble learning and ensemble deep learning. We hope this article will be helpful to realize the intrinsic problems and technical challenges faced by future developments of ensemble learning under the era of deep learning. △ Less

Submitted 27 September, 2022; v1 submitted 20 January, 2021; originally announced January 2021.

Comments: 47 pages, 8 figures, 15 tables

ACM Class: A.1

Journal ref: Artificial Intelligence Review, 2022

arXiv:2012.01295 [pdf, other]

Generating Descriptions for Sequential Images with Local-Object Attention and Global Semantic Context Modelling

Authors: Jing Su, Chenghua Lin, Mian Zhou, Qingyun Dai, Haoyu Lv

Abstract: In this paper, we propose an end-to-end CNN-LSTM model for generating descriptions for sequential images with a local-object attention mechanism. To generate coherent descriptions, we capture global semantic context using a multi-layer perceptron, which learns the dependencies between sequential images. A paralleled LSTM network is exploited for decoding the sequence descriptions. Experimental res… ▽ More In this paper, we propose an end-to-end CNN-LSTM model for generating descriptions for sequential images with a local-object attention mechanism. To generate coherent descriptions, we capture global semantic context using a multi-layer perceptron, which learns the dependencies between sequential images. A paralleled LSTM network is exploited for decoding the sequence descriptions. Experimental results show that our model outperforms the baseline across three different evaluation metrics on the datasets published by Microsoft. △ Less

Submitted 2 December, 2020; originally announced December 2020.

Comments: Accepted by INLG 2018

arXiv:2011.09301 [pdf, other]

Context-aware RNNLM Rescoring for Conversational Speech Recognition

Authors: Kun Wei, Pengcheng Guo, Hang Lv, Zhen Tu, Lei Xie

Abstract: Conversational speech recognition is regarded as a challenging task due to its free-style speaking and long-term contextual dependencies. Prior work has explored the modeling of long-range context through RNNLM rescoring with improved performance. To further take advantage of the persisted nature during a conversation, such as topics or speaker turn, we extend the rescoring procedure to a new cont… ▽ More Conversational speech recognition is regarded as a challenging task due to its free-style speaking and long-term contextual dependencies. Prior work has explored the modeling of long-range context through RNNLM rescoring with improved performance. To further take advantage of the persisted nature during a conversation, such as topics or speaker turn, we extend the rescoring procedure to a new context-aware manner. For RNNLM training, we capture the contextual dependencies by concatenating adjacent sentences with various tag words, such as speaker or intention information. For lattice rescoring, the lattice of adjacent sentences are also connected with the first-pass decoded result by tag words. Besides, we also adopt a selective concatenation strategy based on tf-idf, making the best use of contextual similarity to improve transcription performance. Results on four different conversation test sets show that our approach yields up to 13.1% and 6% relative char-error-rate (CER) reduction compared with 1st-pass decoding and common lattice-rescoring, respectively. △ Less

Submitted 18 November, 2020; originally announced November 2020.

arXiv:2008.08944 [pdf, other]

doi 10.1109/TIP.2021.3072863

Localizing Anomalies from Weakly-Labeled Videos

Authors: Hui Lv, Chuanwei Zhou, Chunyan Xu, Zhen Cui, Jian Yang

Abstract: Video anomaly detection under video-level labels is currently a challenging task. Previous works have made progresses on discriminating whether a video sequencecontains anomalies. However, most of them fail to accurately localize the anomalous events within videos in the temporal domain. In this paper, we propose a Weakly Supervised Anomaly Localization (WSAL) method focusing on temporally localiz… ▽ More Video anomaly detection under video-level labels is currently a challenging task. Previous works have made progresses on discriminating whether a video sequencecontains anomalies. However, most of them fail to accurately localize the anomalous events within videos in the temporal domain. In this paper, we propose a Weakly Supervised Anomaly Localization (WSAL) method focusing on temporally localizing anomalous segments within anomalous videos. Inspired by the appearance difference in anomalous videos, the evolution of adjacent temporal segments is evaluated for the localization of anomalous segments. To this end, a high-order context encoding model is proposed to not only extract semantic representations but also measure the dynamic variations so that the temporal context could be effectively utilized. In addition, in order to fully utilize the spatial context information, the immediate semantics are directly derived from the segment representations. The dynamic variations as well as the immediate semantics, are efficiently aggregated to obtain the final anomaly scores. An enhancement strategy is further proposed to deal with noise interference and the absence of localization guidance in anomaly detection. Moreover, to facilitate the diversity requirement for anomaly detection benchmarks, we also collect a new traffic anomaly (TAD) dataset which specifies in the traffic conditions, differing greatly from the current popular anomaly detection evaluation benchmarks.Extensive experiments are conducted to verify the effectiveness of different components, and our proposed method achieves new state-of-the-art performance on the UCF-Crime and TAD datasets. △ Less

Submitted 14 April, 2021; v1 submitted 20 August, 2020; originally announced August 2020.

arXiv:2007.13058 [pdf]

Do recommender systems function in the health domain: a system review

Authors: Jia Su, Yi Guan, Yuge Li, Weile Chen, He Lv, Yageng Yan

Abstract: Recommender systems have fulfilled an important role in everyday life. Recommendations such as news by Google, videos by Netflix, goods by e-commerce providers, etc. have heavily changed everyones lifestyle. Health domains contain similar decision-making problems such as what to eat, how to exercise, and what is the proper medicine for a patient. Recently, studies focused on recommender systems to… ▽ More Recommender systems have fulfilled an important role in everyday life. Recommendations such as news by Google, videos by Netflix, goods by e-commerce providers, etc. have heavily changed everyones lifestyle. Health domains contain similar decision-making problems such as what to eat, how to exercise, and what is the proper medicine for a patient. Recently, studies focused on recommender systems to solve health problems have attracted attention. In this paper, we review aspects of health recommender systems including interests, methods, evaluation, future challenges and trend issues. We find that 1) health recommender systems have their own health concern limitations that cause them to focus on less-risky recommendations such as diet recommendation; 2) traditional recommender methods such as content-based and collaborative filtering methods can hardly handle health constraints, but knowledge-based methods function more than ever; 3) evaluating a health recommendation is more complicated than evaluating a commercial one because multiple dimensions in addition to accuracy should be considered. Recommender systems can function well in the health domain after the solution of several key problems. Our work is a systematic review of health recommender system studies, we show current conditions and future directions. It is believed that this review will help domain researchers and promote health recommender systems to the next step. △ Less

Submitted 26 July, 2020; originally announced July 2020.

Comments: 32 pages, 1 table, 1 figure, 38 discussed articles

MSC Class: 68U35 ACM Class: H.4.0

arXiv:2005.08347 [pdf, other]

Wake Word Detection with Alignment-Free Lattice-Free MMI

Authors: Yiming Wang, Hang Lv, Daniel Povey, Lei Xie, Sanjeev Khudanpur

Abstract: Always-on spoken language interfaces, e.g. personal digital assistants, rely on a wake word to start processing spoken input. We present novel methods to train a hybrid DNN/HMM wake word detection system from partially labeled training data, and to use it in on-line applications: (i) we remove the prerequisite of frame-level alignments in the LF-MMI training algorithm, permitting the use of un-tra… ▽ More Always-on spoken language interfaces, e.g. personal digital assistants, rely on a wake word to start processing spoken input. We present novel methods to train a hybrid DNN/HMM wake word detection system from partially labeled training data, and to use it in on-line applications: (i) we remove the prerequisite of frame-level alignments in the LF-MMI training algorithm, permitting the use of un-transcribed training examples that are annotated only for the presence/absence of the wake word; (ii) we show that the classical keyword/filler model must be supplemented with an explicit non-speech (silence) model for good performance; (iii) we present an FST-based decoder to perform online detection. We evaluate our methods on two real data sets, showing 50%--90% reduction in false rejection rates at pre-specified false alarm rates over the best previously published figures, and re-validate them on a third (large) data set. △ Less

Submitted 28 July, 2020; v1 submitted 17 May, 2020; originally announced May 2020.

Comments: Accepted at Interspeech 2020. 5 pages, 3 figures

arXiv:1911.07706 [pdf, other]

Mechanism Design with Predicted Task Revenue for Bike Sharing Systems

Authors: Hongtao Lv, Chaoli Zhang, Zhenzhe Zheng, Tie Luo, Fan Wu, Guihai Chen

Abstract: Bike sharing systems have been widely deployed around the world in recent years. A core problem in such systems is to reposition the bikes so that the distribution of bike supply is reshaped to better match the dynamic bike demand. When the bike-sharing company or platform is able to predict the revenue of each reposition task based on historic data, an additional constraint is to cap the payment… ▽ More Bike sharing systems have been widely deployed around the world in recent years. A core problem in such systems is to reposition the bikes so that the distribution of bike supply is reshaped to better match the dynamic bike demand. When the bike-sharing company or platform is able to predict the revenue of each reposition task based on historic data, an additional constraint is to cap the payment for each task below its predicted revenue. In this paper, we propose an incentive mechanism called {\em TruPreTar} to incentivize users to park bicycles at locations desired by the platform toward rebalancing supply and demand. TruPreTar possesses four important economic and computational properties such as truthfulness and budget feasibility. Furthermore, we prove that even when the payment budget is tight, the total revenue still exceeds or equals the budget. Otherwise, TruPreTar achieves 2-approximation as compared to the optimal (revenue-maximizing) solution, which is close to the lower bound of at least $\sqrt{2}$ that we also prove. Using an industrial dataset obtained from a large bike-sharing company, our experiments show that TruPreTar is effective in rebalancing bike supply and demand and, as a result, generates high revenue that outperforms several benchmark mechanisms. △ Less

Submitted 3 July, 2023; v1 submitted 18 November, 2019; originally announced November 2019.

Comments: Accepted by AAAI 2020; This is the full version that contains all the proofs

arXiv:1909.08723 [pdf, other]

Espresso: A Fast End-to-end Neural Speech Recognition Toolkit

Authors: Yiming Wang, Tongfei Chen, Hainan Xu, Shuoyang Ding, Hang Lv, Yiwen Shao, Nanyun Peng, Lei Xie, Shinji Watanabe, Sanjeev Khudanpur

Abstract: We present Espresso, an open-source, modular, extensible end-to-end neural automatic speech recognition (ASR) toolkit based on the deep learning library PyTorch and the popular neural machine translation toolkit fairseq. Espresso supports distributed training across GPUs and computing nodes, and features various decoding approaches commonly employed in ASR, including look-ahead word-based language… ▽ More We present Espresso, an open-source, modular, extensible end-to-end neural automatic speech recognition (ASR) toolkit based on the deep learning library PyTorch and the popular neural machine translation toolkit fairseq. Espresso supports distributed training across GPUs and computing nodes, and features various decoding approaches commonly employed in ASR, including look-ahead word-based language model fusion, for which a fast, parallelized decoder is implemented. Espresso achieves state-of-the-art ASR performance on the WSJ, LibriSpeech, and Switchboard data sets among other end-to-end systems without data augmentation, and is 4--11x faster for decoding than similar systems (e.g. ESPnet). △ Less

Submitted 14 October, 2019; v1 submitted 18 September, 2019; originally announced September 2019.

Comments: Accepted to ASRU 2019

Showing 1–50 of 52 results for author: Lv, H