subscribe to arXiv mailings

Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation

Authors: Mengzhe Geng, Xurong Xie, Jiajun Deng, Zengrui Jin, Guinan Li, Tianzi Wang, Shujie Hu, Zhaoqing Li, Helen Meng, Xunying Liu

Abstract: The application of data-intensive automatic speech recognition (ASR) technologies to dysarthric and elderly adult speech is confronted by their mismatch against healthy and nonaged voices, data scarcity and large speaker-level variability. To this end, this paper proposes two novel data-efficient methods to learn homogeneous dysarthric and elderly speaker-level features for rapid, on-the-fly test-… ▽ More The application of data-intensive automatic speech recognition (ASR) technologies to dysarthric and elderly adult speech is confronted by their mismatch against healthy and nonaged voices, data scarcity and large speaker-level variability. To this end, this paper proposes two novel data-efficient methods to learn homogeneous dysarthric and elderly speaker-level features for rapid, on-the-fly test-time adaptation of DNN/TDNN and Conformer ASR models. These include: 1) speaker-level variance-regularized spectral basis embedding (VR-SBE) features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation; and 2) feature-based learning hidden unit contributions (f-LHUC) transforms that are conditioned on VR-SBE features. Experiments are conducted on four tasks across two languages: the English UASpeech and TORGO dysarthric speech datasets, the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech corpora. The proposed on-the-fly speaker adaptation techniques consistently outperform baseline iVector and xVector adaptation by statistically significant word or character error rate reductions up to 5.32% absolute (18.57% relative) and batch-mode LHUC speaker adaptation by 2.24% absolute (9.20% relative), while operating with real-time factors speeding up to 33.6 times against xVectors during adaptation. The efficacy of the proposed adaptation techniques is demonstrated in a comparison against current ASR technologies including SSL pre-trained systems on UASpeech, where our best system produces a state-of-the-art WER of 23.33%. Analyses show VR-SBE features and f-LHUC transforms are insensitive to speaker-level data quantity in testtime adaptation. T-SNE visualization reveals they have stronger speaker-level homogeneity than baseline iVectors, xVectors and batch-mode LHUC transforms. △ Less

Submitted 8 July, 2024; originally announced July 2024.

Comments: In submission to IEEE/ACM Transactions on Audio, Speech, and Language Processing

arXiv:2406.10160 [pdf, other]

One-pass Multiple Conformer and Foundation Speech Systems Compression and Quantization Using An All-in-one Neural Model

Authors: Zhaoqing Li, Haoning Xu, Tianzi Wang, Shoukang Hu, Zengrui Jin, Shujie Hu, Jiajun Deng, Mingyu Cui, Mengzhe Geng, Xunying Liu

Abstract: We propose a novel one-pass multiple ASR systems joint compression and quantization approach using an all-in-one neural model. A single compression cycle allows multiple nested systems with varying Encoder depths, widths, and quantization precision settings to be simultaneously constructed without the need to train and store individual target systems separately. Experiments consistently demonstrat… ▽ More We propose a novel one-pass multiple ASR systems joint compression and quantization approach using an all-in-one neural model. A single compression cycle allows multiple nested systems with varying Encoder depths, widths, and quantization precision settings to be simultaneously constructed without the need to train and store individual target systems separately. Experiments consistently demonstrate the multiple ASR systems compressed in a single all-in-one model produced a word error rate (WER) comparable to, or lower by up to 1.01\% absolute (6.98\% relative) than individually trained systems of equal complexity. A 3.4x overall system compression and training time speed-up was achieved. Maximum model size compression ratios of 12.8x and 3.93x were obtained over the baseline Switchboard-300hr Conformer and LibriSpeech-100hr fine-tuned wav2vec2.0 models, respectively, incurring no statistically significant WER increase. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: Accepted by Interspeech 2024

arXiv:2406.10152 [pdf, other]

Joint Speaker Features Learning for Audio-visual Multichannel Speech Separation and Recognition

Authors: Guinan Li, Jiajun Deng, Youjun Chen, Mengzhe Geng, Shujie Hu, Zhe Li, Zengrui Jin, Tianzi Wang, Xurong Xie, Helen Meng, Xunying Liu

Abstract: This paper proposes joint speaker feature learning methods for zero-shot adaptation of audio-visual multichannel speech separation and recognition systems. xVector and ECAPA-TDNN speaker encoders are connected using purpose-built fusion blocks and tightly integrated with the complete system training. Experiments conducted on LRS3-TED data simulated multichannel overlapped speech suggest that joint… ▽ More This paper proposes joint speaker feature learning methods for zero-shot adaptation of audio-visual multichannel speech separation and recognition systems. xVector and ECAPA-TDNN speaker encoders are connected using purpose-built fusion blocks and tightly integrated with the complete system training. Experiments conducted on LRS3-TED data simulated multichannel overlapped speech suggest that joint speaker feature learning consistently improves speech separation and recognition performance over the baselines without joint speaker feature estimation. Further analyses reveal performance improvements are strongly correlated with increased inter-speaker discrimination measured using cosine similarity. The best-performing joint speaker feature learning adapted system outperformed the baseline fine-tuned WavLM model by statistically significant WER reductions of 21.6% and 25.3% absolute (67.5% and 83.5% relative) on Dev and Test sets after incorporating WavLM features and video modality. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: Accepted by Interspeech 2024

arXiv:2406.10034 [pdf, other]

Towards Effective and Efficient Non-autoregressive Decoding Using Block-based Attention Mask

Authors: Tianzi Wang, Xurong Xie, Zhaoqing Li, Shoukang Hu, Zengrui Jing, Jiajun Deng, Mingyu Cui, Shujie Hu, Mengzhe Geng, Guinan Li, Helen Meng, Xunying Liu

Abstract: This paper proposes a novel non-autoregressive (NAR) block-based Attention Mask Decoder (AMD) that flexibly balances performance-efficiency trade-offs for Conformer ASR systems. AMD performs parallel NAR inference within contiguous blocks of output labels that are concealed using attention masks, while conducting left-to-right AR prediction and history context amalgamation between blocks. A beam s… ▽ More This paper proposes a novel non-autoregressive (NAR) block-based Attention Mask Decoder (AMD) that flexibly balances performance-efficiency trade-offs for Conformer ASR systems. AMD performs parallel NAR inference within contiguous blocks of output labels that are concealed using attention masks, while conducting left-to-right AR prediction and history context amalgamation between blocks. A beam search algorithm is designed to leverage a dynamic fusion of CTC, AR Decoder, and AMD probabilities. Experiments on the LibriSpeech-100hr corpus suggest the tripartite Decoder incorporating the AMD module produces a maximum decoding speed-up ratio of 1.73x over the baseline CTC+AR decoding, while incurring no statistically significant word error rate (WER) increase on the test sets. When operating with the same decoding real time factors, statistically significant WER reductions of up to 0.7% and 0.3% absolute (5.3% and 6.1% relative) were obtained over the CTC+AR baseline. △ Less

Submitted 16 July, 2024; v1 submitted 14 June, 2024; originally announced June 2024.

Comments: 5 pages, 2 figures, 2 tables, Interspeech24 conference

arXiv:2406.08911 [pdf, other]

An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios

Authors: Cheng Gong, Erica Cooper, Xin Wang, Chunyu Qiang, Mengzhe Geng, Dan Wells, Longbiao Wang, Jianwu Dang, Marc Tessier, Aidan Pine, Korin Richmond, Junichi Yamagishi

Abstract: Self-supervised learning (SSL) representations from massively multilingual models offer a promising solution for low-resource language speech tasks. Despite advancements, language adaptation in TTS systems remains an open problem. This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system proposed in our previous work. We conducted experiments on… ▽ More Self-supervised learning (SSL) representations from massively multilingual models offer a promising solution for low-resource language speech tasks. Despite advancements, language adaptation in TTS systems remains an open problem. This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system proposed in our previous work. We conducted experiments on 12 languages using limited data with various fine-tuning configurations. We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance. Additionally, we find that the fine-tuning dataset size and number of speakers influence adaptability. Surprisingly, we also observed that using paired data for fine-tuning is not always optimal compared to audio-only data. Beyond speech intelligibility, our analysis covers speaker similarity, language identification, and predicted MOS. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: Accepted to Interspeech 2024

arXiv:2405.19323 [pdf, other]

Are Large Language Models Chameleons?

Authors: Mingmeng Geng, Sihong He, Roberto Trotta

Abstract: Do large language models (LLMs) have their own worldviews and personality tendencies? Simulations in which an LLM was asked to answer subjective questions were conducted more than 1 million times. Comparison of the responses from different LLMs with real data from the European Social Survey (ESS) suggests that the effect of prompts on bias and variability is fundamental, highlighting major cultura… ▽ More Do large language models (LLMs) have their own worldviews and personality tendencies? Simulations in which an LLM was asked to answer subjective questions were conducted more than 1 million times. Comparison of the responses from different LLMs with real data from the European Social Survey (ESS) suggests that the effect of prompts on bias and variability is fundamental, highlighting major cultural, age, and gender biases. Methods for measuring the difference between LLMs and survey data are discussed, such as calculating weighted means and a new proposed measure inspired by Jaccard similarity. We conclude that it is important to analyze the robustness and variability of prompts before using LLMs to model individual decisions or collective behavior, as their imitation abilities are approximate at best. △ Less

Submitted 29 May, 2024; originally announced May 2024.

Comments: 16 pages,8 figures

arXiv:2405.07639 [pdf]

Unveiling the Magmatic Architecture Beneath Oceanus Procellarum: Insights from GRAIL Mission Data

Authors: Meixia Geng, Qingjie Yang, Chaouki Kasmi, J. Kim Welford, Alexander L. Peace

Abstract: The Oceanus Procellarum region, characterized by its vast basaltic plains and pronounced volcanic activity, serves as a focal point for understanding the volcanic history of the Moon. Leveraging the Gravity Recovery and Interior Laboratory (GRAIL) mission data, we imaged the magmatic structures beneath the Oceanus Procellarum region. Our 3D density models uncover pronounced linear magmatic structu… ▽ More The Oceanus Procellarum region, characterized by its vast basaltic plains and pronounced volcanic activity, serves as a focal point for understanding the volcanic history of the Moon. Leveraging the Gravity Recovery and Interior Laboratory (GRAIL) mission data, we imaged the magmatic structures beneath the Oceanus Procellarum region. Our 3D density models uncover pronounced linear magmatic structures along the Procellarum's western border and significant intrusions within the northern and southern Marius Hills. Crucially, they reveal three narrow near-horizontal sheeted magmatic structures, 80-150 km long, extending from near-surface to 6- 7 km depth, which we identified as sill-like magmatic conduits. These magmatic conduits connect the Marius Hills' northern and southern intrusions and bridge them with the Procellarum's western border structures. These discoveries suggest that sill-like magmatic conduits likely serve as central pathways facilitating magma transport across various volcanic systems and furthermore indicate widespread magmatic connectivity beneath the Oceanus Procellarum. △ Less

Submitted 13 May, 2024; originally announced May 2024.

Comments: 30 pages, 6 figures, and 1 table

arXiv:2405.05814 [pdf]

MSDiff: Multi-Scale Diffusion Model for Ultra-Sparse View CT Reconstruction

Authors: Pinhuang Tan, Mengxiao Geng, Jingya Lu, Liu Shi, Bin Huang, Qiegen Liu

Abstract: Computed Tomography (CT) technology reduces radiation haz-ards to the human body through sparse sampling, but fewer sampling angles pose challenges for image reconstruction. Score-based generative models are widely used in sparse-view CT re-construction, performance diminishes significantly with a sharp reduction in projection angles. Therefore, we propose an ultra-sparse view CT reconstruction me… ▽ More Computed Tomography (CT) technology reduces radiation haz-ards to the human body through sparse sampling, but fewer sampling angles pose challenges for image reconstruction. Score-based generative models are widely used in sparse-view CT re-construction, performance diminishes significantly with a sharp reduction in projection angles. Therefore, we propose an ultra-sparse view CT reconstruction method utilizing multi-scale dif-fusion models (MSDiff), designed to concentrate on the global distribution of information and facilitate the reconstruction of sparse views with local image characteristics. Specifically, the proposed model ingeniously integrates information from both comprehensive sampling and selectively sparse sampling tech-niques. Through precise adjustments in diffusion model, it is capable of extracting diverse noise distribution, furthering the understanding of the overall structure of images, and aiding the fully sampled model in recovering image information more effec-tively. By leveraging the inherent correlations within the projec-tion data, we have designed an equidistant mask, enabling the model to focus its attention more effectively. Experimental re-sults demonstrated that the multi-scale model approach signifi-cantly improved the quality of image reconstruction under ultra-sparse angles, with good generalization across various datasets. △ Less

Submitted 9 May, 2024; originally announced May 2024.

arXiv:2405.05763 [pdf]

DP-MDM: Detail-Preserving MR Reconstruction via Multiple Diffusion Models

Authors: Mengxiao Geng, Jiahao Zhu, Xiaolin Zhu, Qiqing Liu, Dong Liang, Qiegen Liu

Abstract: Detail features of magnetic resonance images play a cru-cial role in accurate medical diagnosis and treatment, as they capture subtle changes that pose challenges for doc-tors when performing precise judgments. However, the widely utilized naive diffusion model has limitations, as it fails to accurately capture more intricate details. To en-hance the quality of MRI reconstruction, we propose a com… ▽ More Detail features of magnetic resonance images play a cru-cial role in accurate medical diagnosis and treatment, as they capture subtle changes that pose challenges for doc-tors when performing precise judgments. However, the widely utilized naive diffusion model has limitations, as it fails to accurately capture more intricate details. To en-hance the quality of MRI reconstruction, we propose a comprehensive detail-preserving reconstruction method using multiple diffusion models to extract structure and detail features in k-space domain instead of image do-main. Moreover, virtual binary modal masks are utilized to refine the range of values in k-space data through highly adaptive center windows, which allows the model to focus its attention more efficiently. Last but not least, an inverted pyramid structure is employed, where the top-down image information gradually decreases, ena-bling a cascade representation. The framework effective-ly represents multi-scale sampled data, taking into ac-count the sparsity of the inverted pyramid architecture, and utilizes cascade training data distribution to repre-sent multi-scale data. Through a step-by-step refinement approach, the method refines the approximation of de-tails. Finally, the proposed method was evaluated by con-ducting experiments on clinical and public datasets. The results demonstrate that the proposed method outper-forms other methods. △ Less

Submitted 9 May, 2024; originally announced May 2024.

arXiv:2404.16687 [pdf, other]

NTIRE 2024 Quality Assessment of AI-Generated Content Challenge

Authors: Xiaohong Liu, Xiongkuo Min, Guangtao Zhai, Chunyi Li, Tengchuan Kou, Wei Sun, Haoning Wu, Yixuan Gao, Yuqin Cao, Zicheng Zhang, Xiele Wu, Radu Timofte, Fei Peng, Huiyuan Fu, Anlong Ming, Chuanming Wang, Huadong Ma, Shuai He, Zifei Dou, Shu Chen, Huacong Zhang, Haiyi Xie, Chengwei Wang, Baoying Chen, Jishen Zeng , et al. (89 additional authors not shown)

Abstract: This paper reports on the NTIRE 2024 Quality Assessment of AI-Generated Content Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2024. This challenge is to address a major challenge in the field of image and video processing, namely, Image Quality Assessment (IQA) and Video Quality Assessment (VQA) for AI-Generated Conte… ▽ More This paper reports on the NTIRE 2024 Quality Assessment of AI-Generated Content Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2024. This challenge is to address a major challenge in the field of image and video processing, namely, Image Quality Assessment (IQA) and Video Quality Assessment (VQA) for AI-Generated Content (AIGC). The challenge is divided into the image track and the video track. The image track uses the AIGIQA-20K, which contains 20,000 AI-Generated Images (AIGIs) generated by 15 popular generative models. The image track has a total of 318 registered participants. A total of 1,646 submissions are received in the development phase, and 221 submissions are received in the test phase. Finally, 16 participating teams submitted their models and fact sheets. The video track uses the T2VQA-DB, which contains 10,000 AI-Generated Videos (AIGVs) generated by 9 popular Text-to-Video (T2V) models. A total of 196 participants have registered in the video track. A total of 991 submissions are received in the development phase, and 185 submissions are received in the test phase. Finally, 12 participating teams submitted their models and fact sheets. Some methods have achieved better results than baseline methods, and the winning methods in both tracks have demonstrated superior prediction performance on AIGC. △ Less

Submitted 7 May, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

arXiv:2404.08627 [pdf, other]

Is ChatGPT Transforming Academics' Writing Style?

Authors: Mingmeng Geng, Roberto Trotta

Abstract: Based on one million arXiv papers submitted from May 2018 to January 2024, we assess the textual density of ChatGPT's writing style in their abstracts by means of a statistical analysis of word frequency changes. Our model is calibrated and validated on a mixture of real abstracts and ChatGPT-modified abstracts (simulated data) after a careful noise analysis. We find that ChatGPT is having an incr… ▽ More Based on one million arXiv papers submitted from May 2018 to January 2024, we assess the textual density of ChatGPT's writing style in their abstracts by means of a statistical analysis of word frequency changes. Our model is calibrated and validated on a mixture of real abstracts and ChatGPT-modified abstracts (simulated data) after a careful noise analysis. We find that ChatGPT is having an increasing impact on arXiv abstracts, especially in the field of computer science, where the fraction of ChatGPT-revised abstracts is estimated to be approximately 35%, if we take the output of one of the simplest prompts, "revise the following sentences", as a baseline. We conclude with an analysis of both positive and negative aspects of the penetration of ChatGPT into academics' writing style. △ Less

Submitted 12 April, 2024; originally announced April 2024.

Comments: 15 pages, 19 figures

arXiv:2403.03382 [pdf, other]

Adaptive Discovering and Merging for Incremental Novel Class Discovery

Authors: Guangyao Chen, Peixi Peng, Yangru Huang, Mengyue Geng, Yonghong Tian

Abstract: One important desideratum of lifelong learning aims to discover novel classes from unlabelled data in a continuous manner. The central challenge is twofold: discovering and learning novel classes while mitigating the issue of catastrophic forgetting of established knowledge. To this end, we introduce a new paradigm called Adaptive Discovering and Merging (ADM) to discover novel categories adaptive… ▽ More One important desideratum of lifelong learning aims to discover novel classes from unlabelled data in a continuous manner. The central challenge is twofold: discovering and learning novel classes while mitigating the issue of catastrophic forgetting of established knowledge. To this end, we introduce a new paradigm called Adaptive Discovering and Merging (ADM) to discover novel categories adaptively in the incremental stage and integrate novel knowledge into the model without affecting the original knowledge. To discover novel classes adaptively, we decouple representation learning and novel class discovery, and use Triple Comparison (TC) and Probability Regularization (PR) to constrain the probability discrepancy and diversity for adaptive category assignment. To merge the learned novel knowledge adaptively, we propose a hybrid structure with base and novel branches named Adaptive Model Merging (AMM), which reduces the interference of the novel branch on the old classes to preserve the previous knowledge, and merges the novel branch to the base model without performance loss and parameter growth. Extensive experiments on several datasets show that ADM significantly outperforms existing class-incremental Novel Class Discovery (class-iNCD) approaches. Moreover, our AMM also benefits the class-incremental Learning (class-IL) task by alleviating the catastrophic forgetting problem. △ Less

Submitted 5 March, 2024; originally announced March 2024.

Comments: AAAI 2024. arXiv admin note: text overlap with arXiv:2207.08605 by other authors

arXiv:2401.00662 [pdf, other]

Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation

Authors: Huimeng Wang, Zengrui Jin, Mengzhe Geng, Shujie Hu, Guinan Li, Tianzi Wang, Haoning Xu, Xunying Liu

Abstract: Automatic recognition of dysarthric speech remains a highly challenging task to date. Neuro-motor conditions and co-occurring physical disabilities create difficulty in large-scale data collection for ASR system development. Adapting SSL pre-trained ASR models to limited dysarthric speech via data-intensive parameter fine-tuning leads to poor generalization. To this end, this paper presents an ext… ▽ More Automatic recognition of dysarthric speech remains a highly challenging task to date. Neuro-motor conditions and co-occurring physical disabilities create difficulty in large-scale data collection for ASR system development. Adapting SSL pre-trained ASR models to limited dysarthric speech via data-intensive parameter fine-tuning leads to poor generalization. To this end, this paper presents an extensive comparative study of various data augmentation approaches to improve the robustness of pre-trained ASR model fine-tuning to dysarthric speech. These include: a) conventional speaker-independent perturbation of impaired speech; b) speaker-dependent speed perturbation, or GAN-based adversarial perturbation of normal, control speech based on their time alignment against parallel dysarthric speech; c) novel Spectral basis GAN-based adversarial data augmentation operating on non-parallel data. Experiments conducted on the UASpeech corpus suggest GAN-based data augmentation consistently outperforms fine-tuned Wav2vec2.0 and HuBERT models using no data augmentation and speed perturbation across different data expansion operating points by statistically significant word error rate (WER) reductions up to 2.01% and 0.96% absolute (9.03% and 4.63% relative) respectively on the UASpeech test set of 16 dysarthric speakers. After cross-system outputs rescoring, the best system produced the lowest published WER of 16.53% (46.47% on very low intelligibility) on UASpeech. △ Less

Submitted 31 December, 2023; originally announced January 2024.

Comments: To appear at IEEE ICASSP 2024

arXiv:2312.11562 [pdf, other]

A Survey of Reasoning with Foundation Models

Authors: Jiankai Sun, Chuanyang Zheng, Enze Xie, Zhengying Liu, Ruihang Chu, Jianing Qiu, Jiaqi Xu, Mingyu Ding, Hongyang Li, Mengzhe Geng, Yue Wu, Wenhai Wang, Junsong Chen, Zhangyue Yin, Xiaozhe Ren, Jie Fu, Junxian He, Wu Yuan, Qi Liu, Xihui Liu, Yu Li, Hao Dong, Yu Cheng, Ming Zhang, Pheng Ann Heng , et al. (9 additional authors not shown)

Abstract: Reasoning, a crucial ability for complex problem-solving, plays a pivotal role in various real-world settings such as negotiation, medical diagnosis, and criminal investigation. It serves as a fundamental methodology in the field of Artificial General Intelligence (AGI). With the ongoing development of foundation models, e.g., Large Language Models (LLMs), there is a growing interest in exploring… ▽ More Reasoning, a crucial ability for complex problem-solving, plays a pivotal role in various real-world settings such as negotiation, medical diagnosis, and criminal investigation. It serves as a fundamental methodology in the field of Artificial General Intelligence (AGI). With the ongoing development of foundation models, e.g., Large Language Models (LLMs), there is a growing interest in exploring their abilities in reasoning tasks. In this paper, we introduce seminal foundation models proposed or adaptable for reasoning, highlighting the latest advancements in various reasoning tasks, methods, and benchmarks. We then delve into the potential future directions behind the emergence of reasoning abilities within foundation models. We also discuss the relevance of multimodal learning, autonomous agents, and super alignment in the context of reasoning. By discussing these future research directions, we hope to inspire researchers in their exploration of this field, stimulate further advancements in reasoning with foundation models, and contribute to the development of AGI. △ Less

Submitted 25 January, 2024; v1 submitted 17 December, 2023; originally announced December 2023.

Comments: 20 Figures, 160 Pages, 750+ References, Project Page https://github.com/reasoning-survey/Awesome-Reasoning-Foundation-Models

arXiv:2312.08641 [pdf, other]

Towards Automatic Data Augmentation for Disordered Speech Recognition

Authors: Zengrui Jin, Xurong Xie, Tianzi Wang, Mengzhe Geng, Jiajun Deng, Guinan Li, Shujie Hu, Xunying Liu

Abstract: Automatic recognition of disordered speech remains a highly challenging task to date due to data scarcity. This paper presents a reinforcement learning (RL) based on-the-fly data augmentation approach for training state-of-the-art PyChain TDNN and end-to-end Conformer ASR systems on such data. The handcrafted temporal and spectral mask operations in the standard SpecAugment method that are task an… ▽ More Automatic recognition of disordered speech remains a highly challenging task to date due to data scarcity. This paper presents a reinforcement learning (RL) based on-the-fly data augmentation approach for training state-of-the-art PyChain TDNN and end-to-end Conformer ASR systems on such data. The handcrafted temporal and spectral mask operations in the standard SpecAugment method that are task and system dependent, together with additionally introduced minimum and maximum cut-offs of these time-frequency masks, are now automatically learned using an RNN-based policy controller and tightly integrated with ASR system training. Experiments on the UASpeech corpus suggest the proposed RL-based data augmentation approach consistently produced performance superior or comparable that obtained using expert or handcrafted SpecAugment policies. Our RL auto-augmented PyChain TDNN system produced an overall WER of 28.79% on the UASpeech test set of 16 dysarthric speakers. △ Less

Submitted 13 December, 2023; originally announced December 2023.

Comments: To appear at IEEE ICASSP 2024

arXiv:2311.18442 [pdf, other]

Electronic Phase Propagation Speed in BaFe$_2$As$_2$ Revealed by Dilatometry

Authors: Xin Qin, Xingyu Wang, Wenshan Hong, Mengqiao Geng, Yuan Li, Huiqian Luo, Shiliang Li, Yang Liu

Abstract: Thermal expansion offers deep insights into phase transitions in condensed matter physics. Utilizing an advanced AC-temperature dilatometer with picometer resolution, this study clearly resolves the antiferromagnetic and structural transition in BaFe$_2$As$_2$. The implementation of temperature oscillation reveals a hysteresis near the transition temperature $T_\mathrm{N}$ with unprecedented resol… ▽ More Thermal expansion offers deep insights into phase transitions in condensed matter physics. Utilizing an advanced AC-temperature dilatometer with picometer resolution, this study clearly resolves the antiferromagnetic and structural transition in BaFe$_2$As$_2$. The implementation of temperature oscillation reveals a hysteresis near the transition temperature $T_\mathrm{N}$ with unprecedented resolution. Unexpectedly, we find that the hysteretic width exhibits a universal dependence on the parameters of temperature oscillation and the sample's longidutinal dimension, which in turn reveals a finite transition speed. Our quantitative analysis shows that this phase boundary propagates at a mere 188 $μ$m/s - a speed seven orders of magnitude slower than acoustic waves. It suggests a hidden thermodynamic constraint imposed by the electronic degrees of freedom. Our research not only sheds light on the dynamics of phase transitions between different correlated phases, but also establishes high precision dilatometry as a powerful tool for material studies. This measurement technique, when properly modified, can be extended to studies of other material properties such as piezoelectric, magneto-restriction, elastic modulus, etc. △ Less

Submitted 26 March, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

arXiv:2311.16641 [pdf, other]

A High Resolution Dilatometer Using Optical Fiber Interferometer

Authors: Xin Qin, Guoxin Cao, Mengqiao Geng, Shengchun Liu, Yang Liu

Abstract: We introduce a high performance differential dilatometer based on an all-fiber Michelson interferometer at cryogenic temperature with $10^{-10}$ resolution in $δL/L$. It resolve the linear thermal expansion coefficient by measuring the oscillating changes of sample thickness and sample temperature with the interferometer and in-situ thermometer, respectively. By measuring the linear thermal expans… ▽ More We introduce a high performance differential dilatometer based on an all-fiber Michelson interferometer at cryogenic temperature with $10^{-10}$ resolution in $δL/L$. It resolve the linear thermal expansion coefficient by measuring the oscillating changes of sample thickness and sample temperature with the interferometer and in-situ thermometer, respectively. By measuring the linear thermal expansion coefficient $α$ near the antiferromagnetic transition region of BaFe$_2$As$_2$ as a demonstration, we show our dilatometer is able to measure thin samples with sub-pm-level length change resolution and mK-level temperature resolution. Despite there is residual background thermal expansion of a few nm/K in measurement result, our new dilatometer is sitll a powerful tool for study of phase transition in condensed matter physics, especially significant advantages in fragile materials with sub-100$μ$m thickness and being integrated with multiple synchronous measurements and tuning thanks to the extremely high resolution and contactless nature. The prototype design of this setup can be further improved in many aspects for specific applications. △ Less

Submitted 13 May, 2024; v1 submitted 28 November, 2023; originally announced November 2023.

arXiv:2311.09667 [pdf, other]

Repetitive nonoverlapping sequential pattern mining

Authors: Meng Geng, Youxi Wu, Yan Li, Jing Liu, Philippe Fournier-Viger, Xingquan Zhu, Xindong Wu

Abstract: Sequential pattern mining (SPM) is an important branch of knowledge discovery that aims to mine frequent sub-sequences (patterns) in a sequential database. Various SPM methods have been investigated, and most of them are classical SPM methods, since these methods only consider whether or not a given pattern occurs within a sequence. Classical SPM can only find the common features of sequences, but… ▽ More Sequential pattern mining (SPM) is an important branch of knowledge discovery that aims to mine frequent sub-sequences (patterns) in a sequential database. Various SPM methods have been investigated, and most of them are classical SPM methods, since these methods only consider whether or not a given pattern occurs within a sequence. Classical SPM can only find the common features of sequences, but it ignores the number of occurrences of the pattern in each sequence, i.e., the degree of interest of specific users. To solve this problem, this paper addresses the issue of repetitive nonoverlapping sequential pattern (RNP) mining and proposes the RNP-Miner algorithm. To reduce the number of candidate patterns, RNP-Miner adopts an itemset pattern join strategy. To improve the efficiency of support calculation, RNP-Miner utilizes the candidate support calculation algorithm based on the position dictionary. To validate the performance of RNP-Miner, 10 competitive algorithms and 20 sequence databases were selected. The experimental results verify that RNP-Miner outperforms the other algorithms, and using RNPs can achieve a better clustering performance than raw data and classical frequent patterns. All the algorithms were developed using the PyCharm environment and can be downloaded from https://github.com/wuc567/Pattern-Mining/tree/master/RNP-Miner. △ Less

Submitted 16 November, 2023; originally announced November 2023.

arXiv:2309.10986 [pdf, other]

Research on the Impact of Executive Shareholding on New Investment in Enterprises Based on Multivariable Linear Regression Model

Authors: Shanyi Zhou, Ning Yan, Zhijun Li, Mo Geng, Xulong Zhang, Hongbiao Si, Lihua Tang, Wenyuan Sun, Longda Zhang, Yi Cao

Abstract: Based on principal-agent theory and optimal contract theory, companies use the method of increasing executives' shareholding to stimulate collaborative innovation. However, from the aspect of agency costs between management and shareholders (i.e. the first type) and between major shareholders and minority shareholders (i.e. the second type), the interests of management, shareholders and creditors… ▽ More Based on principal-agent theory and optimal contract theory, companies use the method of increasing executives' shareholding to stimulate collaborative innovation. However, from the aspect of agency costs between management and shareholders (i.e. the first type) and between major shareholders and minority shareholders (i.e. the second type), the interests of management, shareholders and creditors will be unbalanced with the change of the marginal utility of executive equity incentives.In order to establish the correlation between the proportion of shares held by executives and investments in corporate innovation, we have chosen a range of publicly listed companies within China's A-share market as the focus of our study. Employing a multi-variable linear regression model, we aim to analyze this relationship thoroughly.The following models were developed: (1) the impact model of executive shareholding on corporate innovation investment; (2) the impact model of executive shareholding on two types of agency costs; (3)The model is employed to examine the mediating influence of the two categories of agency costs. Following both correlation and regression analyses, the findings confirm a meaningful and positive correlation between executives' shareholding and the augmentation of corporate innovation investments. Additionally, the results indicate that executive shareholding contributes to the reduction of the first type of agency cost, thereby fostering corporate innovation investment. However, simultaneously, it leads to an escalation in the second type of agency cost, thus impeding corporate innovation investment. △ Less

Submitted 19 September, 2023; originally announced September 2023.

Comments: Accepted by the 7th APWeb-WAIM International Joint Conference on Web and Big Data. (APWeb 2023)

arXiv:2308.03963 [pdf, other]

doi 10.1103/PhysRevB.108.134110

Influence of electronic entropy on Hellmann-Feynman forces in ab initio molecular dynamics with large temperature changes

Authors: Ming Geng, Chris E. Mohn

Abstract: The Z method is a popular atomistic simulation method for determining the melting temperature where a sequence of molecular dynamics runs are carried out to target the lowest system energy where the solid always melts. Homogeneous melting at the limit of critical superheating, Th, is accompanied by a drop in temperature as kinetic energy is converted to potential energy and the equilibrium melting… ▽ More The Z method is a popular atomistic simulation method for determining the melting temperature where a sequence of molecular dynamics runs are carried out to target the lowest system energy where the solid always melts. Homogeneous melting at the limit of critical superheating, Th, is accompanied by a drop in temperature as kinetic energy is converted to potential energy and the equilibrium melting temperature, Tm, can be calculated directly from the liquid state. Implementation of the Z method interfaced with modern ab initio electronic structure packages use Hellmann-Feynman forces to propagate the ions in the microcanonical(NVE) ensemble where the Mermin free energy plus the ionic kinetic energy is conserved. The electronic temperature, Tel, is kept fixed along the trajectory which may introduce some spurious ion-electron interactions in MD runs with large temperature changes such as often seen in homogeneous melting and freezing processes in the NVE ensemble. We estimate systematic errors in the calculated melting temperature to choice of Tel for two main mantle components, SiO2 and CaSiO3 at high pressure. Comparison of the calculated melting temperature from runs where the Tel=Th and Tel=Tm representing reasonable upper and lower boundaries respectively to choice of Tel shows that the difference in melting temperature is 200-300 K for our two test systems. The melting temperature decreases with increasing Tel due to the increasing entropic stabilisation of the liquid and the systems melts typically about 3 times faster in MD runs with Tel = Th compared to runs where Tel = Tm. A careful choice of electron temperature in BOMD simulations where the ions are propagated using Hellmann-Feynamn forces with the Mermin free energy + the ionic kinetic energy being conserved is therefore essential for the critical evaluation of the Z method and in particular at very high temperatures. △ Less

Submitted 20 September, 2023; v1 submitted 7 August, 2023; originally announced August 2023.

Comments: 10 figures, 17 pages

arXiv:2308.03018 [pdf, other]

Recurrent Spike-based Image Restoration under General Illumination

Authors: Lin Zhu, Yunlong Zheng, Mengyue Geng, Lizhi Wang, Hua Huang

Abstract: Spike camera is a new type of bio-inspired vision sensor that records light intensity in the form of a spike array with high temporal resolution (20,000 Hz). This new paradigm of vision sensor offers significant advantages for many vision tasks such as high speed image reconstruction. However, existing spike-based approaches typically assume that the scenes are with sufficient light intensity, whi… ▽ More Spike camera is a new type of bio-inspired vision sensor that records light intensity in the form of a spike array with high temporal resolution (20,000 Hz). This new paradigm of vision sensor offers significant advantages for many vision tasks such as high speed image reconstruction. However, existing spike-based approaches typically assume that the scenes are with sufficient light intensity, which is usually unavailable in many real-world scenarios such as rainy days or dusk scenes. To unlock more spike-based application scenarios, we propose a Recurrent Spike-based Image Restoration (RSIR) network, which is the first work towards restoring clear images from spike arrays under general illumination. Specifically, to accurately describe the noise distribution under different illuminations, we build a physical-based spike noise model according to the sampling process of the spike camera. Based on the noise model, we design our RSIR network which consists of an adaptive spike transformation module, a recurrent temporal feature fusion module, and a frequency-based spike denoising module. Our RSIR can process the spike array in a recursive manner to ensure that the spike temporal information is well utilized. In the training process, we generate the simulated spike data based on our noise model to train our network. Extensive experiments on real-world datasets with different illuminations demonstrate the effectiveness of the proposed network. The code and dataset are released at https://github.com/BIT-Vision/RSIR. △ Less

Submitted 6 August, 2023; originally announced August 2023.

Comments: Accepted by ACM MM 2023

arXiv:2307.05444 [pdf, other]

doi 10.1103/PhysRevB.109.024106

Ab initio constraints on silica melting to 500 GPa

Authors: Ming Geng, Chris E. Mohn

Abstract: The melting curve of pure silica (SiO$_2$) was determined using {\it ab initio} density functional theory together with the solid-liquid coexisting approach, thermodynamic integration and the Z method. The melting curves are consistent with a smooth slow increase in a large region from 50 GPa (dT/dP $\approx$ 15 K/GPa) to about 500 GPa (dT/dP $\approx$ 5 K/GPa) without any abrupt changes at around… ▽ More The melting curve of pure silica (SiO$_2$) was determined using {\it ab initio} density functional theory together with the solid-liquid coexisting approach, thermodynamic integration and the Z method. The melting curves are consistent with a smooth slow increase in a large region from 50 GPa (dT/dP $\approx$ 15 K/GPa) to about 500 GPa (dT/dP $\approx$ 5 K/GPa) without any abrupt changes at around 120 GPa and 300 GPa as seen in some recent experimental and computational studies. The topography of the melting curve above 50 GPa is consistent with a gradual change in the distribution of the Si coordination numbers in the liquid state and the absence of large changes in the density following solid-solid phase transitions. The pair distribution functions show that the structural correlation in the liquid is mainly short-ranged and that the Si-O bond is stiff. The densification of the melt structure with pressure above 50 GPa is therefore due to an increase in 7- and 8-fold coordinated silicon. △ Less

Submitted 18 November, 2023; v1 submitted 7 July, 2023; originally announced July 2023.

arXiv:2307.02909 [pdf, other]

Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

Authors: Guinan Li, Jiajun Deng, Mengzhe Geng, Zengrui Jin, Tianzi Wang, Shujie Hu, Mingyu Cui, Helen Meng, Xunying Liu

Abstract: Accurate recognition of cocktail party speech containing overlapping speakers, noise and reverberation remains a highly challenging task to date. Motivated by the invariance of visual modality to acoustic signal corruption, an audio-visual multi-channel speech separation, dereverberation and recognition approach featuring a full incorporation of visual information into all system components is pro… ▽ More Accurate recognition of cocktail party speech containing overlapping speakers, noise and reverberation remains a highly challenging task to date. Motivated by the invariance of visual modality to acoustic signal corruption, an audio-visual multi-channel speech separation, dereverberation and recognition approach featuring a full incorporation of visual information into all system components is proposed in this paper. The efficacy of the video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end and Conformer ASR back-end. Audio-visual integrated front-end architectures performing speech separation and dereverberation in a pipelined or joint fashion via mask-based WPD are investigated. The error cost mismatch between the speech enhancement front-end and ASR back-end components is minimized by end-to-end jointly fine-tuning using either the ASR cost function alone, or its interpolation with the speech enhancement loss. Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset. The proposed audio-visual multi-channel speech separation, dereverberation and recognition systems consistently outperformed the comparable audio-only baseline by 9.1% and 6.2% absolute (41.7% and 36.0% relative) word error rate (WER) reductions. Consistent speech enhancement improvements were also obtained on PESQ, STOI and SRMR scores. △ Less

Submitted 6 July, 2023; originally announced July 2023.

Comments: IEEE/ACM Transactions on Audio, Speech, and Language Processing

arXiv:2306.15265 [pdf, other]

Hyper-parameter Adaptation of Conformer ASR Systems for Elderly and Dysarthric Speech Recognition

Authors: Tianzi Wang, Shoukang Hu, Jiajun Deng, Zengrui Jin, Mengzhe Geng, Yi Wang, Helen Meng, Xunying Liu

Abstract: Automatic recognition of disordered and elderly speech remains highly challenging tasks to date due to data scarcity. Parameter fine-tuning is often used to exploit the large quantities of non-aged and healthy speech pre-trained models, while neural architecture hyper-parameters are set using expert knowledge and remain unchanged. This paper investigates hyper-parameter adaptation for Conformer AS… ▽ More Automatic recognition of disordered and elderly speech remains highly challenging tasks to date due to data scarcity. Parameter fine-tuning is often used to exploit the large quantities of non-aged and healthy speech pre-trained models, while neural architecture hyper-parameters are set using expert knowledge and remain unchanged. This paper investigates hyper-parameter adaptation for Conformer ASR systems that are pre-trained on the Librispeech corpus before being domain adapted to the DementiaBank elderly and UASpeech dysarthric speech datasets. Experimental results suggest that hyper-parameter adaptation produced word error rate (WER) reductions of 0.45% and 0.67% over parameter-only fine-tuning on DBank and UASpeech tasks respectively. An intuitive correlation is found between the performance improvements by hyper-parameter domain adaptation and the relative utterance length ratio between the source and target domain data. △ Less

Submitted 27 June, 2023; originally announced June 2023.

Comments: 5 pages, 3 figures, 3 tables, accepted by Interspeech2023

arXiv:2306.14608 [pdf, other]

Factorised Speaker-environment Adaptive Training of Conformer Speech Recognition Systems

Authors: Jiajun Deng, Guinan Li, Xurong Xie, Zengrui Jin, Mingyu Cui, Tianzi Wang, Shujie Hu, Mengzhe Geng, Xunying Liu

Abstract: Rich sources of variability in natural speech present significant challenges to current data intensive speech recognition technologies. To model both speaker and environment level diversity, this paper proposes a novel Bayesian factorised speaker-environment adaptive training and test time adaptation approach for Conformer ASR models. Speaker and environment level characteristics are separately mo… ▽ More Rich sources of variability in natural speech present significant challenges to current data intensive speech recognition technologies. To model both speaker and environment level diversity, this paper proposes a novel Bayesian factorised speaker-environment adaptive training and test time adaptation approach for Conformer ASR models. Speaker and environment level characteristics are separately modeled using compact hidden output transforms, which are then linearly or hierarchically combined to represent any speaker-environment combination. Bayesian learning is further utilized to model the adaptation parameter uncertainty. Experiments on the 300-hr WHAM noise corrupted Switchboard data suggest that factorised adaptation consistently outperforms the baseline and speaker label only adapted Conformers by up to 3.1% absolute (10.4% relative) word error rate reductions. Further analysis shows the proposed method offers potential for rapid adaption to unseen speaker-environment conditions. △ Less

Submitted 26 June, 2023; originally announced June 2023.

Comments: Accepted by INTERSPEECH 2023

arXiv:2306.06564

Guarding Quantum Key Distribution with integrated Magnetic-free Nonreciprocal Structures

Authors: Qiang Liu, Yinming Huang, Tingting Luo, Chunfeng Huang, Minming Geng, Zhenrong Zhang, Kejin Wei

Abstract: Inserting nonreciprocal devices at the doorways of Alice and Bob is a widely recognized countermeasure against quantum hacking attacks in quantum key distribution (QKD) systems. However, traditional integrated nonreciprocal devices, which are typically based on magneto-optical effects, face challenges in compatibility with current semiconductor integration technology. As a result, earlier chip-bas… ▽ More Inserting nonreciprocal devices at the doorways of Alice and Bob is a widely recognized countermeasure against quantum hacking attacks in quantum key distribution (QKD) systems. However, traditional integrated nonreciprocal devices, which are typically based on magneto-optical effects, face challenges in compatibility with current semiconductor integration technology. As a result, earlier chip-based QKD systems were unable to integrate nonreciprocal components and were vulnerable to injecting-type attacks. Based on the actual parameters of SOI integration, we employed the inverse design with the direct binary search algorithm to construct several magnetic-free nonreciprocal devices, facilitating their integration into chip-based QKD systems while meeting various chip configuration design requirements. The designed devices have sizes of only a few square micrometers, yet the quasi-isolator can achieve an isolation level exceeding 27 dB. To demonstrate their practical utility in QKD, we employed the designed devices to safeguard the QKD system against Trojan-horse attacks. The simulation results demonstrate that our proposed devices effectively secure the BB84 and measure-device-independent QKD systems against Trojan-horse attacks. △ Less

Submitted 4 August, 2023; v1 submitted 10 June, 2023; originally announced June 2023.

Comments: We have found that the presented structure is a mode convertor which is suitable for guarding quantum key ditribution

arXiv:2305.10659 [pdf, other]

Use of Speech Impairment Severity for Dysarthric Speech Recognition

Authors: Mengzhe Geng, Zengrui Jin, Tianzi Wang, Shujie Hu, Jiajun Deng, Mingyu Cui, Guinan Li, Jianwei Yu, Xurong Xie, Xunying Liu

Abstract: A key challenge in dysarthric speech recognition is the speaker-level diversity attributed to both speaker-identity associated factors such as gender, and speech impairment severity. Most prior researches on addressing this issue focused on using speaker-identity only. To this end, this paper proposes a novel set of techniques to use both severity and speaker-identity in dysarthric speech recognit… ▽ More A key challenge in dysarthric speech recognition is the speaker-level diversity attributed to both speaker-identity associated factors such as gender, and speech impairment severity. Most prior researches on addressing this issue focused on using speaker-identity only. To this end, this paper proposes a novel set of techniques to use both severity and speaker-identity in dysarthric speech recognition: a) multitask training incorporating severity prediction error; b) speaker-severity aware auxiliary feature adaptation; and c) structured LHUC transforms separately conditioned on speaker-identity and severity. Experiments conducted on UASpeech suggest incorporating additional speech impairment severity into state-of-the-art hybrid DNN, E2E Conformer and pre-trained Wav2vec 2.0 ASR systems produced statistically significant WER reductions up to 4.78% (14.03% relative). Using the best system the lowest published WER of 17.82% (51.25% on very low intelligibility) was obtained on UASpeech. △ Less

Submitted 17 May, 2023; originally announced May 2023.

Comments: Accepted to INTERSPEECH2023

arXiv:2304.11384 [pdf, other]

Large Language Models are Few-Shot Summarizers: Multi-Intent Comment Generation via In-Context Learning

Authors: Mingyang Geng, Shangwen Wang, Dezun Dong, Haotian Wang, Ge Li, Zhi Jin, Xiaoguang Mao, Xiangke Liao

Abstract: Code comment generation aims at generating natural language descriptions for a code snippet to facilitate developers' program comprehension activities. Despite being studied for a long time, a bottleneck for existing approaches is that given a code snippet, they can only generate one comment while developers usually need to know information from diverse perspectives such as what is the functionali… ▽ More Code comment generation aims at generating natural language descriptions for a code snippet to facilitate developers' program comprehension activities. Despite being studied for a long time, a bottleneck for existing approaches is that given a code snippet, they can only generate one comment while developers usually need to know information from diverse perspectives such as what is the functionality of this code snippet and how to use it. To tackle this limitation, this study empirically investigates the feasibility of utilizing large language models (LLMs) to generate comments that can fulfill developers' diverse intents. Our intuition is based on the facts that (1) the code and its pairwise comment are used during the pre-training process of LLMs to build the semantic connection between the natural language and programming language, and (2) comments in the real-world projects, which are collected for the pre-training, usually contain different developers' intents. We thus postulate that the LLMs can already understand the code from different perspectives after the pre-training. Indeed, experiments on two large-scale datasets demonstrate the rationale of our insights: by adopting the in-context learning paradigm and giving adequate prompts to the LLM (e.g., providing it with ten or more examples), the LLM can significantly outperform a state-of-the-art supervised learning approach on generating comments with multiple intents. Results also show that customized strategies for constructing the prompts and post-processing strategies for reranking the results can both boost the LLM's performances, which shed light on future research directions for using LLMs to achieve comment generation. △ Less

Submitted 14 June, 2023; v1 submitted 22 April, 2023; originally announced April 2023.

Comments: Accepted by the 46th International Conference on Software Engineering (ICSE 2024)

arXiv:2302.14564 [pdf, other]

Exploring Self-supervised Pre-trained ASR Models For Dysarthric and Elderly Speech Recognition

Authors: Shujie Hu, Xurong Xie, Zengrui Jin, Mengzhe Geng, Yi Wang, Mingyu Cui, Jiajun Deng, Xunying Liu, Helen Meng

Abstract: Automatic recognition of disordered and elderly speech remains a highly challenging task to date due to the difficulty in collecting such data in large quantities. This paper explores a series of approaches to integrate domain adapted SSL pre-trained models into TDNN and Conformer ASR systems for dysarthric and elderly speech recognition: a) input feature fusion between standard acoustic frontends… ▽ More Automatic recognition of disordered and elderly speech remains a highly challenging task to date due to the difficulty in collecting such data in large quantities. This paper explores a series of approaches to integrate domain adapted SSL pre-trained models into TDNN and Conformer ASR systems for dysarthric and elderly speech recognition: a) input feature fusion between standard acoustic frontends and domain adapted wav2vec2.0 speech representations; b) frame-level joint decoding of TDNN systems separately trained using standard acoustic features alone and with additional wav2vec2.0 features; and c) multi-pass decoding involving the TDNN/Conformer system outputs to be rescored using domain adapted wav2vec2.0 models. In addition, domain adapted wav2vec2.0 representations are utilized in acoustic-to-articulatory (A2A) inversion to construct multi-modal dysarthric and elderly speech recognition systems. Experiments conducted on the UASpeech dysarthric and DementiaBank Pitt elderly speech corpora suggest TDNN and Conformer ASR systems integrated domain adapted wav2vec2.0 models consistently outperform the standalone wav2vec2.0 models by statistically significant WER reductions of 8.22% and 3.43% absolute (26.71% and 15.88% relative) on the two tasks respectively. The lowest published WERs of 22.56% (52.53% on very low intelligibility, 39.09% on unseen words) and 18.17% are obtained on the UASpeech test set of 16 dysarthric speakers, and the DementiaBank Pitt test set respectively. △ Less

Submitted 22 June, 2023; v1 submitted 28 February, 2023; originally announced February 2023.

Comments: accepted by ICASSP 2023

arXiv:2211.01646 [pdf, other]

Adversarial Data Augmentation Using VAE-GAN for Disordered Speech Recognition

Authors: Zengrui Jin, Xurong Xie, Mengzhe Geng, Tianzi Wang, Shujie Hu, Jiajun Deng, Guinan Li, Xunying Liu

Abstract: Automatic recognition of disordered speech remains a highly challenging task to date. The underlying neuro-motor conditions, often compounded with co-occurring physical disabilities, lead to the difficulty in collecting large quantities of impaired speech required for ASR system development. This paper presents novel variational auto-encoder generative adversarial network (VAE-GAN) based personali… ▽ More Automatic recognition of disordered speech remains a highly challenging task to date. The underlying neuro-motor conditions, often compounded with co-occurring physical disabilities, lead to the difficulty in collecting large quantities of impaired speech required for ASR system development. This paper presents novel variational auto-encoder generative adversarial network (VAE-GAN) based personalized disordered speech augmentation approaches that simultaneously learn to encode, generate and discriminate synthesized impaired speech. Separate latent features are derived to learn dysarthric speech characteristics and phoneme context representations. Self-supervised pre-trained Wav2vec 2.0 embedding features are also incorporated. Experiments conducted on the UASpeech corpus suggest the proposed adversarial data augmentation approach consistently outperformed the baseline speed perturbation and non-VAE GAN augmentation methods with trained hybrid TDNN and End-to-end Conformer systems. After LHUC speaker adaptation, the best system using VAE-GAN based augmentation produced an overall WER of 27.78% on the UASpeech test set of 16 dysarthric speakers, and the lowest published WER of 57.31% on the subset of speakers with "Very Low" intelligibility. △ Less

Submitted 19 March, 2023; v1 submitted 3 November, 2022; originally announced November 2022.

Comments: Submitted to ICASSP 2023

arXiv:2208.13259 [pdf, other]

Bayesian Neural Network Language Modeling for Speech Recognition

Authors: Boyang Xue, Shoukang Hu, Junhao Xu, Mengzhe Geng, Xunying Liu, Helen Meng

Abstract: State-of-the-art neural network language models (NNLMs) represented by long short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming highly complex. They are prone to overfitting and poor generalization when given limited training data. To this end, an overarching full Bayesian learning framework encompassing three methods is proposed in this paper to account for the u… ▽ More State-of-the-art neural network language models (NNLMs) represented by long short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming highly complex. They are prone to overfitting and poor generalization when given limited training data. To this end, an overarching full Bayesian learning framework encompassing three methods is proposed in this paper to account for the underlying uncertainty in LSTM-RNN and Transformer LMs. The uncertainty over their model parameters, choice of neural activations and hidden output representations are modeled using Bayesian, Gaussian Process and variational LSTM-RNN or Transformer LMs respectively. Efficient inference approaches were used to automatically select the optimal network internal components to be Bayesian learned using neural architecture search. A minimal number of Monte Carlo parameter samples as low as one was also used. These allow the computational costs incurred in Bayesian NNLM training and evaluation to be minimized. Experiments are conducted on two tasks: AMI meeting transcription and Oxford-BBC LipReading Sentences 2 (LRS2) overlapped speech recognition using state-of-the-art LF-MMI trained factored TDNN systems featuring data augmentation, speaker adaptation and audio-visual multi-channel beamforming for overlapped speech. Consistent performance improvements over the baseline LSTM-RNN and Transformer LMs with point estimated model parameters and drop-out regularization were obtained across both tasks in terms of perplexity and word error rate (WER). In particular, on the LRS2 data, statistically significant WER reductions up to 1.3% and 1.2% absolute (12.1% and 11.3% relative) were obtained over the baseline LSTM-RNN and Transformer LMs respectively after model combination between Bayesian NNLMs and their respective baselines. △ Less

Submitted 28 August, 2022; originally announced August 2022.

arXiv:2206.13232 [pdf, other]

Conformer Based Elderly Speech Recognition System for Alzheimer's Disease Detection

Authors: Tianzi Wang, Jiajun Deng, Mengzhe Geng, Zi Ye, Shoukang Hu, Yi Wang, Mingyu Cui, Zengrui Jin, Xunying Liu, Helen Meng

Abstract: Early diagnosis of Alzheimer's disease (AD) is crucial in facilitating preventive care to delay further progression. This paper presents the development of a state-of-the-art Conformer based speech recognition system built on the DementiaBank Pitt corpus for automatic AD detection. The baseline Conformer system trained with speed perturbation and SpecAugment based data augmentation is significantl… ▽ More Early diagnosis of Alzheimer's disease (AD) is crucial in facilitating preventive care to delay further progression. This paper presents the development of a state-of-the-art Conformer based speech recognition system built on the DementiaBank Pitt corpus for automatic AD detection. The baseline Conformer system trained with speed perturbation and SpecAugment based data augmentation is significantly improved by incorporating a set of purposefully designed modeling features, including neural architecture search based auto-configuration of domain-specific Conformer hyper-parameters in addition to parameter fine-tuning; fine-grained elderly speaker adaptation using learning hidden unit contributions (LHUC); and two-pass cross-system rescoring based combination with hybrid TDNN systems. An overall word error rate (WER) reduction of 13.6% absolute (34.8% relative) was obtained on the evaluation data of 48 elderly speakers. Using the final systems' recognition outputs to extract textual features, the best-published speech recognition based AD detection accuracy of 91.7% was obtained. △ Less

Submitted 23 June, 2022; originally announced June 2022.

Comments: 5 pages, 1 figure, accepted by INTERSPEECH 2022

arXiv:2206.12045 [pdf, other]

Confidence Score Based Conformer Speaker Adaptation for Speech Recognition

Authors: Jiajun Deng, Xurong Xie, Tianzi Wang, Mingyu Cui, Boyang Xue, Zengrui Jin, Mengzhe Geng, Guinan Li, Xunying Liu, Helen Meng

Abstract: A key challenge for automatic speech recognition (ASR) systems is to model the speaker level variability. In this paper, compact speaker dependent learning hidden unit contributions (LHUC) are used to facilitate both speaker adaptive training (SAT) and test time unsupervised speaker adaptation for state-of-the-art Conformer based end-to-end ASR systems. The sensitivity during adaptation to supervi… ▽ More A key challenge for automatic speech recognition (ASR) systems is to model the speaker level variability. In this paper, compact speaker dependent learning hidden unit contributions (LHUC) are used to facilitate both speaker adaptive training (SAT) and test time unsupervised speaker adaptation for state-of-the-art Conformer based end-to-end ASR systems. The sensitivity during adaptation to supervision error rate is reduced using confidence score based selection of the more "trustworthy" subset of speaker specific data. A confidence estimation module is used to smooth the over-confident Conformer decoder output probabilities before serving as confidence scores. The increased data sparsity due to speaker level data selection is addressed using Bayesian estimation of LHUC parameters. Experiments on the 300-hour Switchboard corpus suggest that the proposed LHUC-SAT Conformer with confidence score based test time unsupervised adaptation outperformed the baseline speaker independent and i-vector adapted Conformer systems by up to 1.0%, 1.0%, and 1.2% absolute (9.0%, 7.9%, and 8.9% relative) word error rate (WER) reductions on the NIST Hub5'00, RT02, and RT03 evaluation sets respectively. Consistent performance improvements were retained after external Transformer and LSTM language models were used for rescoring. △ Less

Submitted 23 June, 2022; originally announced June 2022.

Comments: It's accepted to INTERSPEECH 2022. arXiv admin note: text overlap with arXiv:2206.11596

arXiv:2206.11596 [pdf, other]

doi 10.21437/Interspeech.2022-696

Two-pass Decoding and Cross-adaptation Based System Combination of End-to-end Conformer and Hybrid TDNN ASR Systems

Authors: Mingyu Cui, Jiajun Deng, Shoukang Hu, Xurong Xie, Tianzi Wang, Shujie Hu, Mengzhe Geng, Boyang Xue, Xunying Liu, Helen Meng

Abstract: Fundamental modelling differences between hybrid and end-to-end (E2E) automatic speech recognition (ASR) systems create large diversity and complementarity among them. This paper investigates multi-pass rescoring and cross adaptation based system combination approaches for hybrid TDNN and Conformer E2E ASR systems. In multi-pass rescoring, state-of-the-art hybrid LF-MMI trained CNN-TDNN system fea… ▽ More Fundamental modelling differences between hybrid and end-to-end (E2E) automatic speech recognition (ASR) systems create large diversity and complementarity among them. This paper investigates multi-pass rescoring and cross adaptation based system combination approaches for hybrid TDNN and Conformer E2E ASR systems. In multi-pass rescoring, state-of-the-art hybrid LF-MMI trained CNN-TDNN system featuring speed perturbation, SpecAugment and Bayesian learning hidden unit contributions (LHUC) speaker adaptation was used to produce initial N-best outputs before being rescored by the speaker adapted Conformer system using a 2-way cross system score interpolation. In cross adaptation, the hybrid CNN-TDNN system was adapted to the 1-best output of the Conformer system or vice versa. Experiments on the 300-hour Switchboard corpus suggest that the combined systems derived using either of the two system combination approaches outperformed the individual systems. The best combined system obtained using multi-pass rescoring produced statistically significant word error rate (WER) reductions of 2.5% to 3.9% absolute (22.5% to 28.9% relative) over the stand alone Conformer system on the NIST Hub5'00, Rt03 and Rt02 evaluation data. △ Less

Submitted 23 June, 2022; originally announced June 2022.

Comments: It' s accepted to ISCA 2022

arXiv:2206.07327 [pdf, other]

Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging Features For Elderly And Dysarthric Speech Recognition

Authors: Shujie Hu, Xurong Xie, Mengzhe Geng, Mingyu Cui, Jiajun Deng, Guinan Li, Tianzi Wang, Xunying Liu, Helen Meng

Abstract: Articulatory features are inherently invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition (ASR) systems designed for normal speech. Their practical application to atypical task domains such as elderly and disordered speech across languages is often limited by the difficulty in collecting such specialist data from target speakers. This pa… ▽ More Articulatory features are inherently invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition (ASR) systems designed for normal speech. Their practical application to atypical task domains such as elderly and disordered speech across languages is often limited by the difficulty in collecting such specialist data from target speakers. This paper presents a cross-domain and cross-lingual A2A inversion approach that utilizes the parallel audio and ultrasound tongue imaging (UTI) data of the 24-hour TaL corpus in A2A model pre-training before being cross-domain and cross-lingual adapted to three datasets across two languages: the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech corpora; and the English TORGO dysarthric speech data, to produce UTI based articulatory features. Experiments conducted on three tasks suggested incorporating the generated articulatory features consistently outperformed the baseline TDNN and Conformer ASR systems constructed using acoustic features only by statistically significant word or character error rate reductions up to 4.75%, 2.59% and 2.07% absolute (14.69%, 10.64% and 22.72% relative) after data augmentation, speaker adaptation and cross system multi-pass decoding were applied. △ Less

Submitted 22 June, 2023; v1 submitted 15 June, 2022; originally announced June 2022.

Comments: accepted by INTERSPEECH 2023

arXiv:2205.08318 [pdf]

doi 10.1007/s11128-022-03459-z

Two-party secure semiquantum summation against the collective-dephasing noise

Authors: Tian-Yu Ye, Tian-Jie Xu, Mao-Jie Geng, Ying Chen

Abstract: In this paper, we propose a two-party semiquantum summation protocol, where two classical users can accomplish the summation of their private binary sequences with the assistance of a quantum semi-honest third party (TP). The term 'semi-honest' implies that TP cannot conspire with others but is able to implement all kinds oof attacks. This protocol employs logical qubits as traveling particles to… ▽ More In this paper, we propose a two-party semiquantum summation protocol, where two classical users can accomplish the summation of their private binary sequences with the assistance of a quantum semi-honest third party (TP). The term 'semi-honest' implies that TP cannot conspire with others but is able to implement all kinds oof attacks. This protocol employs logical qubits as traveling particles to overcome the negative influence of collective-dephasing noise and needn't make any two parties pre-share a random secret key. The security analysis turns out that this protocol can effectively prevent the outside attacks from Eve and the participant attacks from TP. Moreover, TP has no knowledge about the summation results. △ Less

Submitted 14 May, 2022; originally announced May 2022.

Comments: 9 pages, 2 tables

Journal ref: Quantum Information Processing, 2022, 21:118

arXiv:2205.08317 [pdf]

doi 10.1360/SSPMA-2021-0188

Quantum dialogue based on quantum encryption with single photons in both polarization and spatial-mode degrees of freedom

Authors: Tian-Yu Ye, Mao-Jie Geng, Tian-Jie Xu, Ying Chen

Abstract: In this paper, a novel information leakage resistant quantum dialogue (QD) protocol with single photons in both polarization and spatial-mode degrees of freedom is proposed, which utilizes quantum encryption technology to overcome the information leakage problem. In the proposed QD protocol, during the transmission process, the single photons in both polarization and spatial-mode degrees of freedo… ▽ More In this paper, a novel information leakage resistant quantum dialogue (QD) protocol with single photons in both polarization and spatial-mode degrees of freedom is proposed, which utilizes quantum encryption technology to overcome the information leakage problem. In the proposed QD protocol, during the transmission process, the single photons in both polarization and spatial-mode degrees of freedom used for encoding two communicants' private classical bits are protected by both quantum encryption technology and decoy photon technology. For avoiding the information leakage problem, the initial states of the single photons in both polarization and spatial-mode degrees of freedom used for encoding two communicants' private classical bits are shared between two communicants through quantum key encryption and decryption. The information-theoretical efficiency of the proposed QD protocol is as high as 40%. △ Less

Submitted 14 May, 2022; originally announced May 2022.

Comments: 7 pages

Journal ref: Scientia Sinica Physica, Mechanica & Astronomica, 2021, 51(10): 100311

arXiv:2205.06445 [pdf, other]

doi 10.1109/TASLP.2023.3323888

Personalized Adversarial Data Augmentation for Dysarthric and Elderly Speech Recognition

Authors: Zengrui Jin, Mengzhe Geng, Jiajun Deng, Tianzi Wang, Shujie Hu, Guinan Li, Xunying Liu

Abstract: Despite the rapid progress of automatic speech recognition (ASR) technologies targeting normal speech, accurate recognition of dysarthric and elderly speech remains highly challenging tasks to date. It is difficult to collect large quantities of such data for ASR system development due to the mobility issues often found among these users. To this end, data augmentation techniques play a vital role… ▽ More Despite the rapid progress of automatic speech recognition (ASR) technologies targeting normal speech, accurate recognition of dysarthric and elderly speech remains highly challenging tasks to date. It is difficult to collect large quantities of such data for ASR system development due to the mobility issues often found among these users. To this end, data augmentation techniques play a vital role. In contrast to existing data augmentation techniques only modifying the speaking rate or overall shape of spectral contour, fine-grained spectro-temporal differences between dysarthric, elderly and normal speech are modelled using a novel set of speaker dependent (SD) generative adversarial networks (GAN) based data augmentation approaches in this paper. These flexibly allow both: a) temporal or speed perturbed normal speech spectra to be modified and closer to those of an impaired speaker when parallel speech data is available; and b) for non-parallel data, the SVD decomposed normal speech spectral basis features to be transformed into those of a target elderly speaker before being re-composed with the temporal bases to produce the augmented data for state-of-the-art TDNN and Conformer ASR system training. Experiments are conducted on four tasks: the English UASpeech and TORGO dysarthric speech corpora; the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets. The proposed GAN based data augmentation approaches consistently outperform the baseline speed perturbation method by up to 0.91% and 3.0% absolute (9.61% and 6.4% relative) WER reduction on the TORGO and DementiaBank data respectively. Consistent performance improvements are retained after applying LHUC based speaker adaptation. △ Less

Submitted 23 June, 2022; v1 submitted 13 May, 2022; originally announced May 2022.

Comments: arXiv admin note: text overlap with arXiv:2202.10290

arXiv:2205.04927 [pdf]

Semiquantum private comparison based on Bell states without quantum measurements from the classical user

Authors: Mao-Jie Geng, Xia Li, Tian-Yu Ye

Abstract: In this paper, we propose a novel semiquantum private comparison (SQPC) protocol based on Bell states, which enables one quantum user and one classical user to compare the equality of their private inputs with the help of a semi-honest quantum third party (TP). TP is assumed to be semi-honest in the sense that she may take all possible attacks to steal users' private inputs except conspiring with… ▽ More In this paper, we propose a novel semiquantum private comparison (SQPC) protocol based on Bell states, which enables one quantum user and one classical user to compare the equality of their private inputs with the help of a semi-honest quantum third party (TP). TP is assumed to be semi-honest in the sense that she may take all possible attacks to steal users' private inputs except conspiring with anyone. The security analysis validates that our protocol can resist not only the attacks from internal participants but also the attacks from an external eavesdropper. Besides, our protocol only asks TP to perform Bell basis measurements but doesn't need quantum entanglement swapping; and it releases the classical user from conducting quantum measurements and having a quantum memory. Moreover, our protocol can take advantage over previous SQPC protocols based on Bell states in qubit efficiency. Finally, our protocol can be generalized into its counterpart of the collective-dephasing noise quantum channel. △ Less

Submitted 25 September, 2023; v1 submitted 10 May, 2022; originally announced May 2022.

Comments: 17 pages, 1 figure, 3 tables

arXiv:2203.14593 [pdf, other]

On-the-Fly Feature Based Rapid Speaker Adaptation for Dysarthric and Elderly Speech Recognition

Authors: Mengzhe Geng, Xurong Xie, Rongfeng Su, Jianwei Yu, Zengrui Jin, Tianzi Wang, Shujie Hu, Zi Ye, Helen Meng, Xunying Liu

Abstract: Accurate recognition of dysarthric and elderly speech remain challenging tasks to date. Speaker-level heterogeneity attributed to accent or gender, when aggregated with age and speech impairment, create large diversity among these speakers. Scarcity of speaker-level data limits the practical use of data-intensive model based speaker adaptation methods. To this end, this paper proposes two novel fo… ▽ More Accurate recognition of dysarthric and elderly speech remain challenging tasks to date. Speaker-level heterogeneity attributed to accent or gender, when aggregated with age and speech impairment, create large diversity among these speakers. Scarcity of speaker-level data limits the practical use of data-intensive model based speaker adaptation methods. To this end, this paper proposes two novel forms of data-efficient, feature-based on-the-fly speaker adaptation methods: variance-regularized spectral basis embedding (SVR) and spectral feature driven f-LHUC transforms. Experiments conducted on UASpeech dysarthric and DementiaBank Pitt elderly speech corpora suggest the proposed on-the-fly speaker adaptation approaches consistently outperform baseline iVector adapted hybrid DNN/TDNN and E2E Conformer systems by statistically significant WER reduction of 2.48%-2.85% absolute (7.92%-8.06% relative), and offline model based LHUC adaptation by 1.82% absolute (5.63% relative) respectively. △ Less

Submitted 28 May, 2023; v1 submitted 28 March, 2022; originally announced March 2022.

Comments: Accepted to INTERSPEECH 2023

arXiv:2203.10274 [pdf, other]

Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For Disordered Speech Recognition

Authors: Shujie Hu, Shansong Liu, Xurong Xie, Mengzhe Geng, Tianzi Wang, Shoukang Hu, Mingyu Cui, Xunying Liu, Helen Meng

Abstract: Articulatory features are inherently invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition (ASR) systems for normal speech. Their practical application to disordered speech recognition is often limited by the difficulty in collecting such specialist data from impaired speakers. This paper presents a cross-domain acoustic-to-articulatory (… ▽ More Articulatory features are inherently invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition (ASR) systems for normal speech. Their practical application to disordered speech recognition is often limited by the difficulty in collecting such specialist data from impaired speakers. This paper presents a cross-domain acoustic-to-articulatory (A2A) inversion approach that utilizes the parallel acoustic-articulatory data of the 15-hour TORGO corpus in model training before being cross-domain adapted to the 102.7-hour UASpeech corpus and to produce articulatory features. Mixture density networks based neural A2A inversion models were used. A cross-domain feature adaptation network was also used to reduce the acoustic mismatch between the TORGO and UASpeech data. On both tasks, incorporating the A2A generated articulatory features consistently outperformed the baseline hybrid DNN/TDNN, CTC and Conformer based end-to-end systems constructed using acoustic features only. The best multi-modal system incorporating video modality and the cross-domain articulatory features as well as data augmentation and learning hidden unit contributions (LHUC) speaker adaptation produced the lowest published word error rate (WER) of 24.82% on the 16 dysarthric speakers of the benchmark UASpeech task. △ Less

Submitted 19 March, 2022; originally announced March 2022.

Comments: accepted by ICASSP 2022

arXiv:2202.10290 [pdf, other]

Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric and Elderly Speech Recognition

Authors: Mengzhe Geng, Xurong Xie, Zi Ye, Tianzi Wang, Guinan Li, Shujie Hu, Xunying Liu, Helen Meng

Abstract: Despite the rapid progress of automatic speech recognition (ASR) technologies targeting normal speech in recent decades, accurate recognition of dysarthric and elderly speech remains highly challenging tasks to date. Sources of heterogeneity commonly found in normal speech including accent or gender, when further compounded with the variability over age and speech pathology severity level, create… ▽ More Despite the rapid progress of automatic speech recognition (ASR) technologies targeting normal speech in recent decades, accurate recognition of dysarthric and elderly speech remains highly challenging tasks to date. Sources of heterogeneity commonly found in normal speech including accent or gender, when further compounded with the variability over age and speech pathology severity level, create large diversity among speakers. To this end, speaker adaptation techniques play a key role in personalization of ASR systems for such users. Motivated by the spectro-temporal level differences between dysarthric, elderly and normal speech that systematically manifest in articulatory imprecision, decreased volume and clarity, slower speaking rates and increased dysfluencies, novel spectrotemporal subspace basis deep embedding features derived using SVD speech spectrum decomposition are proposed in this paper to facilitate auxiliary feature based speaker adaptation of state-of-the-art hybrid DNN/TDNN and end-to-end Conformer speech recognition systems. Experiments were conducted on four tasks: the English UASpeech and TORGO dysarthric speech corpora; the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets. The proposed spectro-temporal deep feature adapted systems outperformed baseline i-Vector and xVector adaptation by up to 2.63% absolute (8.63% relative) reduction in word error rate (WER). Consistent performance improvements were retained after model based speaker adaptation using learning hidden unit contributions (LHUC) was further applied. The best speaker adapted system using the proposed spectral basis embedding features produced the lowest published WER of 25.05% on the UASpeech test set of 16 dysarthric speakers. △ Less

Submitted 17 March, 2022; v1 submitted 21 February, 2022; originally announced February 2022.

Comments: In submission to IEEE/ACM Transactions on Audio Speech and Language Processing

arXiv:2201.05845 [pdf, other]

doi 10.1109/TASLP.2021.3091805

Recent Progress in the CUHK Dysarthric Speech Recognition System

Authors: Shansong Liu, Mengzhe Geng, Shoukang Hu, Xurong Xie, Mingyu Cui, Jianwei Yu, Xunying Liu, Helen Meng

Abstract: Despite the rapid progress of automatic speech recognition (ASR) technologies in the past few decades, recognition of disordered speech remains a highly challenging task to date. Disordered speech presents a wide spectrum of challenges to current data intensive deep neural networks (DNNs) based ASR technologies that predominantly target normal speech. This paper presents recent research efforts at… ▽ More Despite the rapid progress of automatic speech recognition (ASR) technologies in the past few decades, recognition of disordered speech remains a highly challenging task to date. Disordered speech presents a wide spectrum of challenges to current data intensive deep neural networks (DNNs) based ASR technologies that predominantly target normal speech. This paper presents recent research efforts at the Chinese University of Hong Kong (CUHK) to improve the performance of disordered speech recognition systems on the largest publicly available UASpeech dysarthric speech corpus. A set of novel modelling techniques including neural architectural search, data augmentation using spectra-temporal perturbation, model based speaker adaptation and cross-domain generation of visual features within an audio-visual speech recognition (AVSR) system framework were employed to address the above challenges. The combination of these techniques produced the lowest published word error rate (WER) of 25.21% on the UASpeech test set 16 dysarthric speakers, and an overall WER reduction of 5.4% absolute (17.6% relative) over the CUHK 2018 dysarthric speech recognition system featuring a 6-way DNN system combination and cross adaptation of out-of-domain normal speech data trained systems. Bayesian model adaptation further allows rapid adaptation to individual dysarthric speakers to be performed using as little as 3.06 seconds of speech. The efficacy of these techniques were further demonstrated on a CUDYS Cantonese dysarthric speech recognition task. △ Less

Submitted 26 February, 2022; v1 submitted 15 January, 2022; originally announced January 2022.

arXiv:2201.05562 [pdf, other]

doi 10.21437/Interspeech.2020-1161

Investigation of Data Augmentation Techniques for Disordered Speech Recognition

Authors: Mengzhe Geng, Xurong Xie, Shansong Liu, Jianwei Yu, Shoukang Hu, Xunying Liu, Helen Meng

Abstract: Disordered speech recognition is a highly challenging task. The underlying neuro-motor conditions of people with speech disorders, often compounded with co-occurring physical disabilities, lead to the difficulty in collecting large quantities of speech required for system development. This paper investigates a set of data augmentation techniques for disordered speech recognition, including vocal t… ▽ More Disordered speech recognition is a highly challenging task. The underlying neuro-motor conditions of people with speech disorders, often compounded with co-occurring physical disabilities, lead to the difficulty in collecting large quantities of speech required for system development. This paper investigates a set of data augmentation techniques for disordered speech recognition, including vocal tract length perturbation (VTLP), tempo perturbation and speed perturbation. Both normal and disordered speech were exploited in the augmentation process. Variability among impaired speakers in both the original and augmented data was modeled using learning hidden unit contributions (LHUC) based speaker adaptive training. The final speaker adapted system constructed using the UASpeech corpus and the best augmentation approach based on speed perturbation produced up to 2.92% absolute (9.3% relative) word error rate (WER) reduction over the baseline system without data augmentation, and gave an overall WER of 26.37% on the test set containing 16 dysarthric speakers. △ Less

Submitted 14 January, 2022; originally announced January 2022.

Comments: Proceedings of INTERSPEECH 2020

arXiv:2201.05554 [pdf, other]

doi 10.21437/Interspeech.2021-60

Spectro-Temporal Deep Features for Disordered Speech Assessment and Recognition

Authors: Mengzhe Geng, Shansong Liu, Jianwei Yu, Xurong Xie, Shoukang Hu, Zi Ye, Zengrui Jin, Xunying Liu, Helen Meng

Abstract: Automatic recognition of disordered speech remains a highly challenging task to date. Sources of variability commonly found in normal speech including accent, age or gender, when further compounded with the underlying causes of speech impairment and varying severity levels, create large diversity among speakers. To this end, speaker adaptation techniques play a vital role in current speech recogni… ▽ More Automatic recognition of disordered speech remains a highly challenging task to date. Sources of variability commonly found in normal speech including accent, age or gender, when further compounded with the underlying causes of speech impairment and varying severity levels, create large diversity among speakers. To this end, speaker adaptation techniques play a vital role in current speech recognition systems. Motivated by the spectro-temporal level differences between disordered and normal speech that systematically manifest in articulatory imprecision, decreased volume and clarity, slower speaking rates and increased dysfluencies, novel spectro-temporal subspace basis embedding deep features derived by SVD decomposition of speech spectrum are proposed to facilitate both accurate speech intelligibility assessment and auxiliary feature based speaker adaptation of state-of-the-art hybrid DNN and end-to-end disordered speech recognition systems. Experiments conducted on the UASpeech corpus suggest the proposed spectro-temporal deep feature adapted systems consistently outperformed baseline i-Vector adaptation by up to 2.63% absolute (8.6% relative) reduction in word error rate (WER) with or without data augmentation. Learning hidden unit contribution (LHUC) based speaker adaptation was further applied. The final speaker adapted system using the proposed spectral basis embedding features gave an overall WER of 25.6% on the UASpeech test set of 16 dysarthric speakers △ Less

Submitted 14 January, 2022; originally announced January 2022.

Comments: Proceedings of INTERSPEECH 2021

arXiv:2201.04787 [pdf]

doi 10.1360/SSPMA-2022-0025

Semiquantum Private Comparison of Size Relationship Based on d-level Single-Particle States

Authors: Mao-Jie Geng, Tian-Jie Xu, Ying Chen, Tian-Yu Ye

Abstract: In this paper, we propose a novel semiquantum private comparison (SQPC) protocol of size relationship based on d-level single-particle states. The designed protocol can compare the size relationship of different privacy messages from two classical users with the help of a semi-honest third party (TP), who is permitted to misbehave on her own but cannot be in collusion with anyone else. The correct… ▽ More In this paper, we propose a novel semiquantum private comparison (SQPC) protocol of size relationship based on d-level single-particle states. The designed protocol can compare the size relationship of different privacy messages from two classical users with the help of a semi-honest third party (TP), who is permitted to misbehave on her own but cannot be in collusion with anyone else. The correctness analysis shows that this protocol can gain correct comparison results. The security analysis turns out that this protocol can resist famous outside attacks and participant attacks. Moreover, this protocol can guarantee that TP does not know the accurate comparison results. Compared with the only existing SQPC protocol of size relationship (Quantum Inf. Process. 20:124 (2021)), this protocol takes advantage over it on the aspects of initial quantum resource, TP's measurement operations and TP's knowledge about the comparison results. △ Less

Submitted 12 January, 2022; originally announced January 2022.

Comments: 12 pages, 2 figures, 2 tables

Journal ref: Scientia Sinica Physica, Mechanica & Astronomica , 2022, 52(9): 290311

arXiv:2201.03943 [pdf, other]

Neural Architecture Search For LF-MMI Trained Time Delay Neural Networks

Authors: Shoukang Hu, Xurong Xie, Mingyu Cui, Jiajun Deng, Shansong Liu, Jianwei Yu, Mengzhe Geng, Xunying Liu, Helen Meng

Abstract: State-of-the-art automatic speech recognition (ASR) system development is data and computation intensive. The optimal design of deep neural networks (DNNs) for these systems often require expert knowledge and empirical evaluation. In this paper, a range of neural architecture search (NAS) techniques are used to automatically learn two types of hyper-parameters of factored time delay neural network… ▽ More State-of-the-art automatic speech recognition (ASR) system development is data and computation intensive. The optimal design of deep neural networks (DNNs) for these systems often require expert knowledge and empirical evaluation. In this paper, a range of neural architecture search (NAS) techniques are used to automatically learn two types of hyper-parameters of factored time delay neural networks (TDNN-Fs): i) the left and right splicing context offsets; and ii) the dimensionality of the bottleneck linear projection at each hidden layer. These techniques include the differentiable neural architecture search (DARTS) method integrating architecture learning with lattice-free MMI training; Gumbel-Softmax and pipelined DARTS methods reducing the confusion over candidate architectures and improving the generalization of architecture selection; and Penalized DARTS incorporating resource constraints to balance the trade-off between performance and system complexity. Parameter sharing among TDNN-F architectures allows an efficient search over up to 7^28 different systems. Statistically significant word error rate (WER) reductions of up to 1.2% absolute and relative model size reduction of 31% were obtained over a state-of-the-art 300-hour Switchboard corpus trained baseline LF-MMI TDNN-F system featuring speed perturbation, i-Vector and learning hidden unit contribution (LHUC) based speaker adaptation as well as RNNLM rescoring. Performance contrasts on the same task against recent end-to-end systems reported in the literature suggest the best NAS auto-configured system achieves state-of-the-art WERs of 9.9% and 11.1% on the NIST Hub5' 00 and Rt03s test sets respectively with up to 96% model size reduction. Further analysis using Bayesian learning shows that ... △ Less

Submitted 28 March, 2022; v1 submitted 8 January, 2022; originally announced January 2022.

Comments: Accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP). arXiv admin note: text overlap with arXiv:2007.08818

arXiv:2201.00936 [pdf, other]

doi 10.1103/PhysRevA.105.012421

Experimental secure quantum key distribution in presence of polarization-dependent loss

Authors: Chunfeng Huang, Ye Chen, Long Jin, Minming Geng, Junwei Wang, Zhenrong Zhang, Kejin Wei

Abstract: Quantum key distribution (QKD) is theoretically secure using the principle of quantum mechanics; therefore, QKD is a promising solution for the future of secure communication. Although several experimental demonstrations of QKD have been reported, they have not considered the polarization-dependent loss in state preparation in the key-rate estimation. In this study, we experimentally characterized… ▽ More Quantum key distribution (QKD) is theoretically secure using the principle of quantum mechanics; therefore, QKD is a promising solution for the future of secure communication. Although several experimental demonstrations of QKD have been reported, they have not considered the polarization-dependent loss in state preparation in the key-rate estimation. In this study, we experimentally characterized polarization-dependent loss in realistic state-preparation devices and verified that a considerable PDL exists in fiber- and silicon-based polarization modulators. Hence, the security of such QKD systems is compromised because of the secure key rate overestimation. Furthermore, we report a decoy-state BB84 QKD experiment considering polarization-dependent loss. Finally, we achieved rigorous finite-key security bound over up to 75 km fiber links by applying a recently proposed security proof. This study considers more realistic source flaws than most previous experiments; thus, it is crucial toward a secure QKD with imperfect practical devices. △ Less

Submitted 3 January, 2022; originally announced January 2022.

arXiv:2112.14379 [pdf, other]

Background-aware Classification Activation Map for Weakly Supervised Object Localization

Authors: Lei Zhu, Qi She, Qian Chen, Xiangxi Meng, Mufeng Geng, Lujia Jin, Zhe Jiang, Bin Qiu, Yunfei You, Yibao Zhang, Qiushi Ren, Yanye Lu

Abstract: Weakly supervised object localization (WSOL) relaxes the requirement of dense annotations for object localization by using image-level classification masks to supervise its learning process. However, current WSOL methods suffer from excessive activation of background locations and need post-processing to obtain the localization mask. This paper attributes these issues to the unawareness of backgro… ▽ More Weakly supervised object localization (WSOL) relaxes the requirement of dense annotations for object localization by using image-level classification masks to supervise its learning process. However, current WSOL methods suffer from excessive activation of background locations and need post-processing to obtain the localization mask. This paper attributes these issues to the unawareness of background cues, and propose the background-aware classification activation map (B-CAM) to simultaneously learn localization scores of both object and background with only image-level labels. In our B-CAM, two image-level features, aggregated by pixel-level features of potential background and object locations, are used to purify the object feature from the object-related background and to represent the feature of the pure-background sample, respectively. Then based on these two features, both the object classifier and the background classifier are learned to determine the binary object localization mask. Our B-CAM can be trained in end-to-end manner based on a proposed stagger classification loss, which not only improves the objects localization but also suppresses the background activation. Experiments show that our B-CAM outperforms one-stage WSOL methods on the CUB-200, OpenImages and VOC2012 datasets. △ Less

Submitted 28 December, 2021; originally announced December 2021.

arXiv:2112.05874 [pdf]

doi 10.1007/s11128-022-03615-5

Single-state multi-party semiquantum key agreement protocol based on multi-particle GHZ entangled states

Authors: Tian-Jie Xu, Ying Chen, Mao-Jie Geng, Tian-Yu Ye

Abstract: In this paper, we put forward a novel single-state three-party semiquantum key agreement (SQKA) protocol with three-particle GHZ entangled states first. Different with previous quantum key agreement (QKA) protocols, the proposed single-state three-party SQKA protocol can realize the goal that a quantum party and two classical parties who only possess limited quantum capabilities equally contribute… ▽ More In this paper, we put forward a novel single-state three-party semiquantum key agreement (SQKA) protocol with three-particle GHZ entangled states first. Different with previous quantum key agreement (QKA) protocols, the proposed single-state three-party SQKA protocol can realize the goal that a quantum party and two classical parties who only possess limited quantum capabilities equally contribute to the generation of a shared private key over quantum channels. Detailed security analysis turns out that the proposed single-state three-party SQKA protocol is secure against several famous attacks from an outside eavesdropper, such as the Trojan horse attack, the entangle-measure attack, the measure-resend attack and the intercept-resend attack. Moreover, it can resist the participant attack, which means that the shared private key cannot be determined fully by any nontrivial subset of three parties. The proposed single-state three-party SQKA protocol has the following nice features: (1) it only employs one kind of three-particle GHZ entangled states as initial quantum resource; (2) it doesn't need pre-shared keys among different parties; (3) it doesn't need unitary operations or quantum entanglement swapping. Finally, we generalize the proposed single-state three-party SQKA protocol into the case of N-party by only employing one kind of N-particle GHZ entangled states as initial quantum resource, which inherits the nice features of its three-party counterpart. △ Less

Submitted 30 July, 2022; v1 submitted 10 December, 2021; originally announced December 2021.

Comments: 21 pages, 2 figures, 2 tables

Journal ref: Quantum Information Processing,2022,21:266

Showing 1–50 of 69 results for author: Geng, M