Skip to main content

Showing 1–27 of 27 results for author: Zheng, T F

  1. arXiv:2406.19706  [pdf, other

    cs.SD eess.AS

    SAML: Speaker Adaptive Mixture of LoRA Experts for End-to-End ASR

    Authors: Qiuming Zhao, Guangzhi Sun, Chao Zhang, Mingxing Xu, Thomas Fang Zheng

    Abstract: Mixture-of-experts (MoE) models have achieved excellent results in many tasks. However, conventional MoE models are often very large, making them challenging to deploy on resource-constrained edge devices. In this paper, we propose a novel speaker adaptive mixture of LoRA experts (SAML) approach, which uses low-rank adaptation (LoRA) modules as experts to reduce the number of trainable parameters… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    Comments: 5 pages, accepted by Interspeech 2024. arXiv admin note: substantial text overlap with arXiv:2309.09136

  2. arXiv:2309.09136  [pdf, other

    cs.SD cs.AI eess.AS

    Enhancing Quantised End-to-End ASR Models via Personalisation

    Authors: Qiuming Zhao, Guangzhi Sun, Chao Zhang, Mingxing Xu, Thomas Fang Zheng

    Abstract: Recent end-to-end automatic speech recognition (ASR) models have become increasingly larger, making them particularly challenging to be deployed on resource-constrained devices. Model quantisation is an effective solution that sometimes causes the word error rate (WER) to increase. In this paper, a novel strategy of personalisation for a quantised model (PQM) is proposed, which combines speaker ad… ▽ More

    Submitted 16 September, 2023; originally announced September 2023.

    Comments: 5 pages, submitted to ICASSP 2024

  3. arXiv:2111.12324  [pdf, other

    cs.SD eess.AS

    How Speech is Recognized to Be Emotional - A Study Based on Information Decomposition

    Authors: Haoran Sun, Lantian Li, Thomas Fang Zheng, Dong Wang

    Abstract: The way that humans encode their emotion into speech signals is complex. For instance, an angry man may increase his pitch and speaking rate, and use impolite words. In this paper, we present a preliminary study on various emotional factors and investigate how each of them impacts modern emotion recognition systems. The key tool of our study is the SpeechFlow model presented recently, by which we… ▽ More

    Submitted 24 November, 2021; originally announced November 2021.

  4. arXiv:2110.05087  [pdf

    cs.SD eess.AS

    A Multi-Resolution Front-End for End-to-End Speech Anti-Spoofing

    Authors: Wei Liu, Meng Sun, Xiongwei Zhang, Hugo Van hamme, Thomas Fang Zheng

    Abstract: The choice of an optimal time-frequency resolution is usually a difficult but important step in tasks involving speech signal classification, e.g., speech anti-spoofing. The variations of the performance with different choices of timefrequency resolutions can be as large as those with different model architectures, which makes it difficult to judge what the improvement actually comes from when a n… ▽ More

    Submitted 11 October, 2021; originally announced October 2021.

    Comments: submitted to ICASSP 2022

  5. Attack on practical speaker verification system using universal adversarial perturbations

    Authors: Weiyi Zhang, Shuning Zhao, Le Liu, Jianmin Li, Xingliang Cheng, Thomas Fang Zheng, Xiaolin Hu

    Abstract: In authentication scenarios, applications of practical speaker verification systems usually require a person to read a dynamic authentication text. Previous studies played an audio adversarial example as a digital signal to perform physical attacks, which would be easily rejected by audio replay detection modules. This work shows that by playing our crafted adversarial perturbation as a separate s… ▽ More

    Submitted 19 May, 2021; originally announced May 2021.

    Comments: 6 pages, 2 figures

  6. arXiv:2012.12468  [pdf, other

    cs.SD eess.AS

    CN-Celeb: multi-genre speaker recognition

    Authors: Lantian Li, Ruiqi Liu, Jiawen Kang, Yue Fan, Hao Cui, Yunqi Cai, Ravichander Vipperla, Thomas Fang Zheng, Dong Wang

    Abstract: Research on speaker recognition is extending to address the vulnerability in the wild conditions, among which genre mismatch is perhaps the most challenging, for instance, enrollment with reading speech while testing with conversational or singing audio. This mismatch leads to complex and composite inter-session variations, both intrinsic (i.e., speaking style, physiological status) and extrinsic… ▽ More

    Submitted 24 November, 2021; v1 submitted 22 December, 2020; originally announced December 2020.

    Comments: submitted to Speech Communication

  7. arXiv:2010.14243  [pdf, ps, other

    cs.SD cs.LG eess.AS

    Squeezing value of cross-domain labels: a decoupled scoring approach for speaker verification

    Authors: Lantian Li, Yang Zhang, Jiawen Kang, Thomas Fang Zheng, Dong Wang

    Abstract: Domain mismatch often occurs in real applications and causes serious performance reduction on speaker verification systems. The common wisdom is to collect cross-domain data and train a multi-domain PLDA model, with the hope to learn a domain-independent speaker subspace. In this paper, we firstly present an empirical study to show that simply adding cross-domain data does not help performance in… ▽ More

    Submitted 27 October, 2020; originally announced October 2020.

    Comments: Submitted to ICASSP 2021

  8. arXiv:2010.14242  [pdf, other

    cs.SD cs.LG eess.AS

    Deep generative factorization for speech signal

    Authors: Haoran Sun, Lantian Li, Yunqi Cai, Yang Zhang, Thomas Fang Zheng, Dong Wang

    Abstract: Various information factors are blended in speech signals, which forms the primary difficulty for most speech information processing tasks. An intuitive idea is to factorize speech signal into individual information factors (e.g., phonetic content and speaker trait), though it turns out to be highly challenging. This paper presents a speech factorization approach based on a novel factorial discrim… ▽ More

    Submitted 27 October, 2020; originally announced October 2020.

    Comments: Submitted to ICASSP 2021

  9. arXiv:2009.06863  [pdf

    eess.AS cs.CR cs.SD

    When Automatic Voice Disguise Meets Automatic Speaker Verification

    Authors: Linlin Zheng, Jiakang Li, Meng Sun, Xiongwei Zhang, Thomas Fang Zheng

    Abstract: The technique of transforming voices in order to hide the real identity of a speaker is called voice disguise, among which automatic voice disguise (AVD) by modifying the spectral and temporal characteristics of voices with miscellaneous algorithms are easily conducted with softwares accessible to the public. AVD has posed great threat to both human listening and automatic speaker verification (AS… ▽ More

    Submitted 15 September, 2020; originally announced September 2020.

    Comments: accepted for publication

    Journal ref: IEEE Transactions on Information Forensics and Security, 2020

  10. arXiv:1803.00886  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Deep factorization for speech signal

    Authors: Lantian Li, Dong Wang, Yixiang Chen, Ying Shi, Zhiyuan Tang, Thomas Fang Zheng

    Abstract: Various informative factors mixed in speech signals, leading to great difficulty when decoding any of the factors. An intuitive idea is to factorize each speech frame into individual informative factors, though it turns out to be highly difficult. Recently, we found that speaker traits, which were assumed to be long-term distributional properties, are actually short-time patterns, and can be learn… ▽ More

    Submitted 27 February, 2018; originally announced March 2018.

    Comments: Accepted by ICASSP 2018. arXiv admin note: substantial text overlap with arXiv:1706.01777

  11. arXiv:1711.00366  [pdf, other

    cs.SD cs.LG eess.AS

    Full-info Training for Deep Speaker Feature Learning

    Authors: Lantian Li, Zhiyuan Tang, Dong Wang, Thomas Fang Zheng

    Abstract: In recent studies, it has shown that speaker patterns can be learned from very short speech segments (e.g., 0.3 seconds) by a carefully designed convolutional & time-delay deep neural network (CT-DNN) model. By enforcing the model to discriminate the speakers in the training data, frame-level speaker features can be derived from the last hidden layer. In spite of its good performance, a potential… ▽ More

    Submitted 27 February, 2018; v1 submitted 31 October, 2017; originally announced November 2017.

    Comments: Accepted by ICASSP 2018

  12. arXiv:1710.01789  [pdf, ps, other

    cs.CL

    Enhanced Neural Machine Translation by Learning from Draft

    Authors: Aodong Li, Shiyue Zhang, Dong Wang, Thomas Fang Zheng

    Abstract: Neural machine translation (NMT) has recently achieved impressive results. A potential problem of the existing NMT algorithm, however, is that the decoding is conducted from left to right, without considering the right context. This paper proposes an two-stage approach to solve the problem. In the first stage, a conventional attention-based NMT system is used to produce a draft translation, and in… ▽ More

    Submitted 4 October, 2017; originally announced October 2017.

  13. arXiv:1706.07861  [pdf, other

    cs.SD cs.CL

    Cross-lingual Speaker Verification with Deep Feature Learning

    Authors: Lantian Li, Dong Wang, Askar Rozi, Thomas Fang Zheng

    Abstract: Existing speaker verification (SV) systems often suffer from performance degradation if there is any language mismatch between model training, speaker enrollment, and test. A major cause of this degradation is that most existing SV methods rely on a probabilistic model to infer the speaker factor, so any significant change on the distribution of the speech signal will impact the inference. Recentl… ▽ More

    Submitted 22 June, 2017; originally announced June 2017.

  14. arXiv:1706.07859  [pdf, other

    cs.SD cs.CL

    Deep Speaker Verification: Do We Need End to End?

    Authors: Dong Wang, Lantian Li, Zhiyuan Tang, Thomas Fang Zheng

    Abstract: End-to-end learning treats the entire system as a whole adaptable black box, which, if sufficient data are available, may learn a system that works very well for the target task. This principle has recently been applied to several prototype research on speaker verification (SV), where the feature learning and classifier are learned together with an objective function that is consistent with the ev… ▽ More

    Submitted 22 June, 2017; originally announced June 2017.

  15. arXiv:1706.02101  [pdf, other

    cs.SD

    A Study on Replay Attack and Anti-Spoofing for Automatic Speaker Verification

    Authors: Lantian Li, Yixiang Chen, Dong Wang, Thomas Fang Zheng

    Abstract: For practical automatic speaker verification (ASV) systems, replay attack poses a true risk. By replaying a pre-recorded speech signal of the genuine speaker, ASV systems tend to be easily fooled. An effective replay detection method is therefore highly desirable. In this study, we investigate a major difficulty in replay detection: the over-fitting problem caused by variability factors in speech… ▽ More

    Submitted 7 June, 2017; originally announced June 2017.

  16. arXiv:1609.08419  [pdf, other

    cs.SD cs.AI cs.LO

    Decision Making Based on Cohort Scores for Speaker Verification

    Authors: Lantian Li, Renyu Wang, Gang Wang, Caixia Wang, Thomas Fang Zheng

    Abstract: Decision making is an important component in a speaker verification system. For the conventional GMM-UBM architecture, the decision is usually conducted based on the log likelihood ratio of the test utterance against the GMM of the claimed speaker and the UBM. This single-score decision is simple but tends to be sensitive to the complex variations in speech signals (e.g. text content, channel, spe… ▽ More

    Submitted 27 September, 2016; originally announced September 2016.

    Comments: APSIPA ASC 2016

  17. arXiv:1603.09460  [pdf, ps, other

    cs.CL cs.NE

    System Combination for Short Utterance Speaker Recognition

    Authors: Lantian Li, Dong Wang, Xiaodong Zhang, Thomas Fang Zheng, Panshi Jin

    Abstract: For text-independent short-utterance speaker recognition (SUSR), the performance often degrades dramatically. This paper presents a combination approach to the SUSR tasks with two phonetic-aware systems: one is the DNN-based i-vector system and the other is our recently proposed subregion-based GMM-UBM system. The former employs phone posteriors to construct an i-vector model in which the shared s… ▽ More

    Submitted 27 September, 2016; v1 submitted 31 March, 2016; originally announced March 2016.

    Comments: APSIPA ASC 2016

  18. arXiv:1511.06066  [pdf, ps, other

    cs.CL cs.LG

    Transfer Learning for Speech and Language Processing

    Authors: Dong Wang, Thomas Fang Zheng

    Abstract: Transfer learning is a vital technique that generalizes models trained for one setting or task to other settings or tasks. For example in speech recognition, an acoustic model trained for one language can be used to recognize speech in another language, with little or no re-training data. Transfer learning is closely related to multi-task learning (cross-lingual vs. multilingual), and is tradition… ▽ More

    Submitted 19 November, 2015; originally announced November 2015.

    Comments: 13 pages, APSIPA 2015

  19. arXiv:1510.05940  [pdf, other

    cs.SD cs.LG

    Max-margin Metric Learning for Speaker Recognition

    Authors: Lantian Li, Dong Wang, Chao Xing, Thomas Fang Zheng

    Abstract: Probabilistic linear discriminant analysis (PLDA) is a popular normalization approach for the i-vector model, and has delivered state-of-the-art performance in speaker recognition. A potential problem of the PLDA model, however, is that it essentially assumes Gaussian distributions over speaker vectors, which is not always true in practice. Additionally, the objective function is not directly rela… ▽ More

    Submitted 31 March, 2016; v1 submitted 20 October, 2015; originally announced October 2015.

  20. arXiv:1510.05937  [pdf, ps, other

    cs.SD cs.LG

    Binary Speaker Embedding

    Authors: Lantian Li, Dong Wang, Chao Xing, Kaimin Yu, Thomas Fang Zheng

    Abstract: The popular i-vector model represents speakers as low-dimensional continuous vectors (i-vectors), and hence it is a way of continuous speaker embedding. In this paper, we investigate binary speaker embedding, which transforms i-vectors to binary vectors (codes) by a hash function. We start from locality sensitive hashing (LSH), a simple binarization approach where binary codes are derived from a s… ▽ More

    Submitted 31 March, 2016; v1 submitted 20 October, 2015; originally announced October 2015.

  21. arXiv:1509.01183  [pdf, ps, other

    cs.DC cs.DB

    Parallel Knowledge Embedding with MapReduce on a Multi-core Processor

    Authors: Miao Fan, Qiang Zhou, Thomas Fang Zheng, Ralph Grishman

    Abstract: This article firstly attempts to explore parallel algorithms of learning distributed representations for both entities and relations in large-scale knowledge repositories with {\it MapReduce} programming model on a multi-core processor. We accelerate the training progress of a canonical knowledge embedding method, i.e. {\it translating embedding} ({\bf TransE}) model, by dividing a whole knowledge… ▽ More

    Submitted 3 September, 2015; originally announced September 2015.

  22. arXiv:1505.06427  [pdf, other

    cs.CL cs.LG cs.NE

    Deep Speaker Vectors for Semi Text-independent Speaker Verification

    Authors: Lantian Li, Dong Wang, Zhiyong Zhang, Thomas Fang Zheng

    Abstract: Recent research shows that deep neural networks (DNNs) can be used to extract deep speaker vectors (d-vectors) that preserve speaker characteristics and can be used in speaker verification. This new method has been tested on text-dependent speaker verification tasks, and improvement was reported when combined with the conventional i-vector method. This paper extends the d-vector approach to semi… ▽ More

    Submitted 24 May, 2015; originally announced May 2015.

  23. arXiv:1505.03823  [pdf, other

    cs.CL cs.IR

    Distant Supervision for Entity Linking

    Authors: Miao Fan, Qiang Zhou, Thomas Fang Zheng

    Abstract: Entity linking is an indispensable operation of populating knowledge repositories for information extraction. It studies on aligning a textual entity mention to its corresponding disambiguated entry in a knowledge repository. In this paper, we propose a new paradigm named distantly supervised entity linking (DSEL), in the sense that the disambiguated entities that belong to a huge knowledge reposi… ▽ More

    Submitted 4 August, 2015; v1 submitted 14 May, 2015; originally announced May 2015.

  24. arXiv:1505.02433  [pdf, other

    cs.AI

    Probabilistic Belief Embedding for Knowledge Base Completion

    Authors: Miao Fan, Qiang Zhou, Andrew Abel, Thomas Fang Zheng, Ralph Grishman

    Abstract: This paper contributes a novel embedding model which measures the probability of each belief $\langle h,r,t,m\rangle$ in a large-scale knowledge repository via simultaneously learning distributed representations for entities ($h$ and $t$), relations ($r$), and the words in relation mentions ($m$). It facilitates knowledge completion by means of simple vector operations to discover new beliefs. Giv… ▽ More

    Submitted 22 May, 2015; v1 submitted 10 May, 2015; originally announced May 2015.

    Comments: arXiv admin note: text overlap with arXiv:1503.08155

  25. arXiv:1504.01684  [pdf, other

    cs.AI cs.CL

    Large Margin Nearest Neighbor Embedding for Knowledge Representation

    Authors: Miao Fan, Qiang Zhou, Thomas Fang Zheng, Ralph Grishman

    Abstract: Traditional way of storing facts in triplets ({\it head\_entity, relation, tail\_entity}), abbreviated as ({\it h, r, t}), makes the knowledge intuitively displayed and easily acquired by mankind, but hardly computed or even reasoned by AI machines. Inspired by the success in applying {\it Distributed Representations} to AI-related fields, recent studies expect to represent each entity and relatio… ▽ More

    Submitted 7 April, 2015; originally announced April 2015.

    Comments: arXiv admin note: text overlap with arXiv:1503.08155

  26. arXiv:1503.08155  [pdf, other

    cs.AI cs.CL

    Learning Embedding Representations for Knowledge Inference on Imperfect and Incomplete Repositories

    Authors: Miao Fan, Qiang Zhou, Thomas Fang Zheng

    Abstract: This paper considers the problem of knowledge inference on large-scale imperfect repositories with incomplete coverage by means of embedding entities and relations at the first attempt. We propose IIKE (Imperfect and Incomplete Knowledge Embedding), a probabilistic model which measures the probability of each belief, i.e. $\langle h,r,t\rangle$, in large-scale knowledge bases such as NELL and Free… ▽ More

    Submitted 27 March, 2015; originally announced March 2015.

  27. arXiv:1411.4455  [pdf, other

    cs.CL cs.LG

    Errata: Distant Supervision for Relation Extraction with Matrix Completion

    Authors: Miao Fan, Deli Zhao, Qiang Zhou, Zhiyuan Liu, Thomas Fang Zheng, Edward Y. Chang

    Abstract: The essence of distantly supervised relation extraction is that it is an incomplete multi-label classification problem with sparse and noisy features. To tackle the sparsity and noise challenges, we propose solving the classification problem using matrix completion on factorized matrix of minimized rank. We formulate relation classification as completing the unknown labels of testing items (entity… ▽ More

    Submitted 17 November, 2014; originally announced November 2014.