Skip to main content

Showing 1–11 of 11 results for author: Jalal, M A

  1. arXiv:2406.17159  [pdf, other

    eess.AS cs.MM cs.SD

    Exploring compressibility of transformer based text-to-music (TTM) models

    Authors: Vasileios Moschopoulos, Thanasis Kotsiopoulos, Pablo Peso Parada, Konstantinos Nikiforidis, Alexandros Stergiadis, Gerasimos Papakostas, Md Asif Jalal, Jisi Zhang, Anastasios Drosou, Karthikeyan Saravanan

    Abstract: State-of-the art Text-To-Music (TTM) generative AI models are large and require desktop or server class compute, making them infeasible for deployment on mobile phones. This paper presents an analysis of trade-offs between model compression and generation performance of TTM models. We study compression through knowledge distillation and specific modifications that enable applicability over the var… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: Proceedings of INTERSPEECH 2024

  2. arXiv:2401.13146  [pdf, other

    eess.AS cs.CL cs.SD

    Locality enhanced dynamic biasing and sampling strategies for contextual ASR

    Authors: Md Asif Jalal, Pablo Peso Parada, George Pavlidis, Vasileios Moschopoulos, Karthikeyan Saravanan, Chrysovalantis-Giorgos Kontoulis, Jisi Zhang, Anastasios Drosou, Gil Ho Lee, Jungin Lee, Seokyeong Jung

    Abstract: Automatic Speech Recognition (ASR) still face challenges when recognizing time-variant rare-phrases. Contextual biasing (CB) modules bias ASR model towards such contextually-relevant phrases. During training, a list of biasing phrases are selected from a large pool of phrases following a sampling strategy. In this work we firstly analyse different sampling strategies to provide insights into the t… ▽ More

    Submitted 23 January, 2024; originally announced January 2024.

    Comments: Accepted for IEEE ASRU 2023

  3. arXiv:2401.12085  [pdf, other

    eess.AS cs.SD

    Consistency Based Unsupervised Self-training For ASR Personalisation

    Authors: Jisi Zhang, Vandana Rajan, Haaris Mehmood, David Tuckey, Pablo Peso Parada, Md Asif Jalal, Karthikeyan Saravanan, Gil Ho Lee, Jungin Lee, Seokyeong Jung

    Abstract: On-device Automatic Speech Recognition (ASR) models trained on speech data of a large population might underperform for individuals unseen during training. This is due to a domain shift between user data and the original training data, differed by user's speaking characteristics and environmental acoustic conditions. ASR personalisation is a solution that aims to exploit user data to improve model… ▽ More

    Submitted 22 January, 2024; originally announced January 2024.

    Comments: Accepted for IEEE ASRU 2023

  4. arXiv:2307.13343  [pdf, other

    eess.AS cs.CR cs.SD

    On-Device Speaker Anonymization of Acoustic Embeddings for ASR based onFlexible Location Gradient Reversal Layer

    Authors: Md Asif Jalal, Pablo Peso Parada, Jisi Zhang, Karthikeyan Saravanan, Mete Ozay, Myoungji Han, Jung In Lee, Seokyeong Jung

    Abstract: Smart devices serviced by large-scale AI models necessitates user data transfer to the cloud for inference. For speech applications, this means transferring private user information, e.g., speaker identity. Our paper proposes a privacy-enhancing framework that targets speaker identity anonymization while preserving speech recognition accuracy for our downstream task~-~Automatic Speech Recognition… ▽ More

    Submitted 25 July, 2023; originally announced July 2023.

    Comments: Proceedings of INTERSPEECH 2023

  5. arXiv:2306.17500  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Empirical Interpretation of the Relationship Between Speech Acoustic Context and Emotion Recognition

    Authors: Anna Ollerenshaw, Md Asif Jalal, Rosanna Milner, Thomas Hain

    Abstract: Speech emotion recognition (SER) is vital for obtaining emotional intelligence and understanding the contextual meaning of speech. Variations of consonant-vowel (CV) phonemic boundaries can enrich acoustic context with linguistic cues, which impacts SER. In practice, speech emotions are treated as single labels over an acoustic segment for a given time duration. However, phone boundaries within sp… ▽ More

    Submitted 30 June, 2023; originally announced June 2023.

  6. arXiv:2303.00550  [pdf, other

    eess.AS cs.SD

    Towards domain generalisation in ASR with elitist sampling and ensemble knowledge distillation

    Authors: Rehan Ahmad, Md Asif Jalal, Muhammad Umar Farooq, Anna Ollerenshaw, Thomas Hain

    Abstract: Knowledge distillation has widely been used for model compression and domain adaptation for speech applications. In the presence of multiple teachers, knowledge can easily be transferred to the student by averaging the models output. However, previous research shows that the student do not adapt well with such combination. This paper propose to use an elitist sampling strategy at the output of ens… ▽ More

    Submitted 1 March, 2023; originally announced March 2023.

  7. arXiv:2211.02000  [pdf, other

    cs.SD cs.CL eess.AS

    Dynamic Kernels and Channel Attention for Low Resource Speaker Verification

    Authors: Anna Ollerenshaw, Md Asif Jalal, Thomas Hain

    Abstract: State-of-the-art speaker verification frameworks have typically focused on developing models with increasingly deeper (more layers) and wider (number of channels) models to improve their verification performance. Instead, this paper proposes an approach to increase the model resolution capability using attention-based dynamic kernels in a convolutional neural network to adapt the model parameters… ▽ More

    Submitted 27 February, 2023; v1 submitted 3 November, 2022; originally announced November 2022.

  8. arXiv:2211.01993  [pdf, other

    cs.CL cs.SD eess.AS

    Probing Statistical Representations For End-To-End ASR

    Authors: Anna Ollerenshaw, Md Asif Jalal, Thomas Hain

    Abstract: End-to-End automatic speech recognition (ASR) models aim to learn a generalised speech representation to perform recognition. In this domain there is little research to analyse internal representation dependencies and their relationship to modelling approaches. This paper investigates cross-domain language model dependencies within transformer architectures using SVCCA and uses these insights to e… ▽ More

    Submitted 3 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  9. A cross-corpus study on speech emotion recognition

    Authors: Rosanna Milner, Md Asif Jalal, Raymond W. M. Ng, Thomas Hain

    Abstract: For speech emotion datasets, it has been difficult to acquire large quantities of reliable data and acted emotions may be over the top compared to less expressive emotions displayed in everyday life. Lately, larger datasets with natural emotions have been created. Instead of ignoring smaller, acted datasets, this study investigates whether information learnt from acted emotions is useful for detec… ▽ More

    Submitted 5 July, 2022; originally announced July 2022.

    Comments: ASRU 2019

    Journal ref: IEEE Workshop on Automatic Speech Recognition and Understanding 2019

  10. Insights on Neural Representations for End-to-End Speech Recognition

    Authors: Anna Ollerenshaw, Md Asif Jalal, Thomas Hain

    Abstract: End-to-end automatic speech recognition (ASR) models aim to learn a generalised speech representation. However, there are limited tools available to understand the internal functions and the effect of hierarchical dependencies within the model architecture. It is crucial to understand the correlations between the layer-wise representations, to derive insights on the relationship between neural rep… ▽ More

    Submitted 19 May, 2022; originally announced May 2022.

    Comments: Submitted to Interspeech 2021

    Journal ref: Proc. Interspeech 2021, 4079-4083

  11. arXiv:2102.11420  [pdf, other

    cs.SD eess.AS

    Investigating Deep Neural Structures and their Interpretability in the Domain of Voice Conversion

    Authors: Samuel J. Broughton, Md Asif Jalal, Roger K. Moore

    Abstract: Generative Adversarial Networks (GANs) are machine learning networks based around creating synthetic data. Voice Conversion (VC) is a subset of voice translation that involves translating the paralinguistic features of a source speaker to a target speaker while preserving the linguistic information. The aim of non-parallel conditional GANs for VC is to translate an acoustic speech feature sequence… ▽ More

    Submitted 22 February, 2021; originally announced February 2021.

    Comments: For demo, see https://samuelbroughton.github.io/interpretability-demo-2020/