subscribe to arXiv mailings

Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection

Authors: Shruti Palaskar, Oggi Rudovic, Sameer Dharur, Florian Pesce, Gautam Krishna, Aswin Sivaraman, Jack Berkowitz, Ahmed Hussen Abdelaziz, Saurabh Adya, Ahmed Tewfik

Abstract: Although Large Language Models (LLMs) have shown promise for human-like conversations, they are primarily pre-trained on text data. Incorporating audio or video improves performance, but collecting large-scale multimodal data and pre-training multimodal LLMs is challenging. To this end, we propose a Fusion Low Rank Adaptation (FLoRA) technique that efficiently adapts a pre-trained unimodal LLM to… ▽ More Although Large Language Models (LLMs) have shown promise for human-like conversations, they are primarily pre-trained on text data. Incorporating audio or video improves performance, but collecting large-scale multimodal data and pre-training multimodal LLMs is challenging. To this end, we propose a Fusion Low Rank Adaptation (FLoRA) technique that efficiently adapts a pre-trained unimodal LLM to consume new, previously unseen modalities via low rank adaptation. For device-directed speech detection, using FLoRA, the multimodal LLM achieves 22% relative reduction in equal error rate (EER) over the text-only approach and attains performance parity with its full fine-tuning (FFT) counterpart while needing to tune only a fraction of its parameters. Furthermore, with the newly introduced adapter dropout, FLoRA is robust to missing data, improving over FFT by 20% lower EER and 56% lower false accept rate. The proposed approach scales well for model sizes from 16M to 3B parameters. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: Accepted at Interspeech 2024

arXiv:2310.15261 [pdf, ps, other]

Modality Dropout for Multimodal Device Directed Speech Detection using Verbal and Non-Verbal Features

Authors: Gautam Krishna, Sameer Dharur, Oggi Rudovic, Pranay Dighe, Saurabh Adya, Ahmed Hussen Abdelaziz, Ahmed H Tewfik

Abstract: Device-directed speech detection (DDSD) is the binary classification task of distinguishing between queries directed at a voice assistant versus side conversation or background speech. State-of-the-art DDSD systems use verbal cues, e.g acoustic, text and/or automatic speech recognition system (ASR) features, to classify speech as device-directed or otherwise, and often have to contend with one or… ▽ More Device-directed speech detection (DDSD) is the binary classification task of distinguishing between queries directed at a voice assistant versus side conversation or background speech. State-of-the-art DDSD systems use verbal cues, e.g acoustic, text and/or automatic speech recognition system (ASR) features, to classify speech as device-directed or otherwise, and often have to contend with one or more of these modalities being unavailable when deployed in real-world settings. In this paper, we investigate fusion schemes for DDSD systems that can be made more robust to missing modalities. Concurrently, we study the use of non-verbal cues, specifically prosody features, in addition to verbal cues for DDSD. We present different approaches to combine scores and embeddings from prosody with the corresponding verbal cues, finding that prosody improves DDSD performance by upto 8.5% in terms of false acceptance rate (FA) at a given fixed operating point via non-linear intermediate fusion, while our use of modality dropout techniques improves the performance of these models by 7.4% in terms of FA when evaluated with missing modalities during inference time. △ Less

Submitted 23 October, 2023; originally announced October 2023.

Comments: 5 pages

arXiv:2310.05886 [pdf, other]

doi 10.1109/ICASSP48485.2024.10447222

Streaming Anchor Loss: Augmenting Supervision with Temporal Significance

Authors: Utkarsh Oggy Sarawgi, John Berkowitz, Vineet Garg, Arnav Kundu, Minsik Cho, Sai Srujana Buddi, Saurabh Adya, Ahmed Tewfik

Abstract: Streaming neural network models for fast frame-wise responses to various speech and sensory signals are widely adopted on resource-constrained platforms. Hence, increasing the learning capacity of such streaming models (i.e., by adding more parameters) to improve the predictive power may not be viable for real-world tasks. In this work, we propose a new loss, Streaming Anchor Loss (SAL), to better… ▽ More Streaming neural network models for fast frame-wise responses to various speech and sensory signals are widely adopted on resource-constrained platforms. Hence, increasing the learning capacity of such streaming models (i.e., by adding more parameters) to improve the predictive power may not be viable for real-world tasks. In this work, we propose a new loss, Streaming Anchor Loss (SAL), to better utilize the given learning capacity by encouraging the model to learn more from essential frames. More specifically, our SAL and its focal variations dynamically modulate the frame-wise cross entropy loss based on the importance of the corresponding frames so that a higher loss penalty is assigned for frames within the temporal proximity of semantically critical events. Therefore, our loss ensures that the model training focuses on predicting the relatively rare but task-relevant frames. Experimental results with standard lightweight convolutional and recurrent streaming networks on three different speech based detection tasks demonstrate that SAL enables the model to learn the overall task more effectively with improved accuracy and latency, without any additional data, model parameters, or architectural changes. △ Less

Submitted 18 April, 2024; v1 submitted 9 October, 2023; originally announced October 2023.

Comments: Published at IEEE ICASSP 2024, please see https://ieeexplore.ieee.org/abstract/document/10447222

ACM Class: I.2.6; I.5.1; I.5.4; I.6.5

Journal ref: In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6110-6114). IEEE

arXiv:2309.04842 [pdf, other]

Leveraging Large Language Models for Exploiting ASR Uncertainty

Authors: Pranay Dighe, Yi Su, Shangshang Zheng, Yunshu Liu, Vineet Garg, Xiaochuan Niu, Ahmed Tewfik

Abstract: While large language models excel in a variety of natural language processing (NLP) tasks, to perform well on spoken language understanding (SLU) tasks, they must either rely on off-the-shelf automatic speech recognition (ASR) systems for transcription, or be equipped with an in-built speech modality. This work focuses on the former scenario, where LLM's accuracy on SLU tasks is constrained by the… ▽ More While large language models excel in a variety of natural language processing (NLP) tasks, to perform well on spoken language understanding (SLU) tasks, they must either rely on off-the-shelf automatic speech recognition (ASR) systems for transcription, or be equipped with an in-built speech modality. This work focuses on the former scenario, where LLM's accuracy on SLU tasks is constrained by the accuracy of a fixed ASR system on the spoken input. Specifically, we tackle speech-intent classification task, where a high word-error-rate can limit the LLM's ability to understand the spoken intent. Instead of chasing a high accuracy by designing complex or specialized architectures regardless of deployment costs, we seek to answer how far we can go without substantially changing the underlying ASR and LLM, which can potentially be shared by multiple unrelated tasks. To this end, we propose prompting the LLM with an n-best list of ASR hypotheses instead of only the error-prone 1-best hypothesis. We explore prompt-engineering to explain the concept of n-best lists to the LLM; followed by the finetuning of Low-Rank Adapters on the downstream tasks. Our approach using n-best lists proves to be effective on a device-directed speech detection task as well as on a keyword spotting task, where systems using n-best list prompts outperform those using 1-best ASR hypothesis; thus paving the way for an efficient method to exploit ASR uncertainty via LLMs for speech-based applications. △ Less

Submitted 12 September, 2023; v1 submitted 9 September, 2023; originally announced September 2023.

Comments: Added references

arXiv:2302.10450 [pdf, other]

Automotive RADAR sub-sampling via object detection networks: Leveraging prior signal information

Authors: Madhumitha Sakthi, Ahmed Tewfik, Marius Arvinte, Haris Vikalo

Abstract: Automotive radar has increasingly attracted attention due to growing interest in autonomous driving technologies. Acquiring situational awareness using multimodal data collected at high sampling rates by various sensing devices including cameras, LiDAR, and radar requires considerable power, memory and compute resources which are often limited at an edge device. In this paper, we present a novel a… ▽ More Automotive radar has increasingly attracted attention due to growing interest in autonomous driving technologies. Acquiring situational awareness using multimodal data collected at high sampling rates by various sensing devices including cameras, LiDAR, and radar requires considerable power, memory and compute resources which are often limited at an edge device. In this paper, we present a novel adaptive radar sub-sampling algorithm designed to identify regions that require more detailed/accurate reconstruction based on prior environmental conditions' knowledge, enabling near-optimal performance at considerably lower effective sampling rates. Designed to robustly perform under variable weather conditions, the algorithm was shown on the Oxford raw radar and RADIATE dataset to achieve accurate reconstruction utilizing only 10% of the original samples in good weather and 20% in extreme (snow, fog) weather conditions. A further modification of the algorithm incorporates object motion to enable reliable identification of important regions. This includes monitoring possible future occlusions caused by objects detected in the present frame. Finally, we train a YOLO network on the RADIATE dataset to perform object detection directly on RADAR data and obtain a 6.6% AP50 improvement over the baseline Faster R-CNN network. △ Less

Submitted 21 February, 2023; originally announced February 2023.

arXiv:2210.12134 [pdf, other]

Audio-to-Intent Using Acoustic-Textual Subword Representations from End-to-End ASR

Authors: Pranay Dighe, Prateeth Nayak, Oggi Rudovic, Erik Marchi, Xiaochuan Niu, Ahmed Tewfik

Abstract: Accurate prediction of the user intent to interact with a voice assistant (VA) on a device (e.g. on the phone) is critical for achieving naturalistic, engaging, and privacy-centric interactions with the VA. To this end, we present a novel approach to predict the user's intent (the user speaking to the device or not) directly from acoustic and textual information encoded at subword tokens which are… ▽ More Accurate prediction of the user intent to interact with a voice assistant (VA) on a device (e.g. on the phone) is critical for achieving naturalistic, engaging, and privacy-centric interactions with the VA. To this end, we present a novel approach to predict the user's intent (the user speaking to the device or not) directly from acoustic and textual information encoded at subword tokens which are obtained via an end-to-end ASR model. Modeling directly the subword tokens, compared to modeling of the phonemes and/or full words, has at least two advantages: (i) it provides a unique vocabulary representation, where each token has a semantic meaning, in contrast to the phoneme-level representations, (ii) each subword token has a reusable "sub"-word acoustic pattern (that can be used to construct multiple full words), resulting in a largely reduced vocabulary space than of the full words. To learn the subword representations for the audio-to-intent classification, we extract: (i) acoustic information from an E2E-ASR model, which provides frame-level CTC posterior probabilities for the subword tokens, and (ii) textual information from a pre-trained continuous bag-of-words model capturing the semantic meaning of the subword tokens. The key to our approach is the way it combines acoustic subword-level posteriors with text information using the notion of positional-encoding in order to account for multiple ASR hypotheses simultaneously. We show that our approach provides more robust and richer representations for audio-to-intent classification, and is highly accurate with correctly mitigating 93.3% of unintended user audio from invoking the smart assistant at 99% true positive rate. △ Less

Submitted 21 October, 2022; originally announced October 2022.

arXiv:2207.04394 [pdf, other]

Radiomics-Guided Global-Local Transformer for Weakly Supervised Pathology Localization in Chest X-Rays

Authors: Yan Han, Gregory Holste, Ying Ding, Ahmed Tewfik, Yifan Peng, Zhangyang Wang

Abstract: Before the recent success of deep learning methods for automated medical image analysis, practitioners used handcrafted radiomic features to quantitatively describe local patches of medical images. However, extracting discriminative radiomic features relies on accurate pathology localization, which is difficult to acquire in real-world settings. Despite advances in disease classification and local… ▽ More Before the recent success of deep learning methods for automated medical image analysis, practitioners used handcrafted radiomic features to quantitatively describe local patches of medical images. However, extracting discriminative radiomic features relies on accurate pathology localization, which is difficult to acquire in real-world settings. Despite advances in disease classification and localization from chest X-rays, many approaches fail to incorporate clinically-informed domain knowledge. For these reasons, we propose a Radiomics-Guided Transformer (RGT) that fuses \textit{global} image information with \textit{local} knowledge-guided radiomics information to provide accurate cardiopulmonary pathology localization and classification \textit{without any bounding box annotations}. RGT consists of an image Transformer branch, a radiomics Transformer branch, and fusion layers that aggregate image and radiomic information. Using the learned self-attention of its image branch, RGT extracts a bounding box for which to compute radiomic features, which are further processed by the radiomics branch; learned image and radiomic features are then fused and mutually interact via cross-attention layers. Thus, RGT utilizes a novel end-to-end feedback loop that can bootstrap accurate pathology localization only using image-level disease labels. Experiments on the NIH ChestXRay dataset demonstrate that RGT outperforms prior works in weakly supervised disease localization (by an average margin of 3.6\% over various intersection-over-union thresholds) and classification (by 1.1\% in average area under the receiver operating characteristic curve). We publicly release our codes and pre-trained models at \url{https://github.com/VITA-Group/chext}. △ Less

Submitted 19 October, 2022; v1 submitted 10 July, 2022; originally announced July 2022.

arXiv:2204.02455 [pdf, other]

Improving Voice Trigger Detection with Metric Learning

Authors: Prateeth Nayak, Takuya Higuchi, Anmol Gupta, Shivesh Ranjan, Stephen Shum, Siddharth Sigtia, Erik Marchi, Varun Lakshminarasimhan, Minsik Cho, Saurabh Adya, Chandra Dhir, Ahmed Tewfik

Abstract: Voice trigger detection is an important task, which enables activating a voice assistant when a target user speaks a keyword phrase. A detector is typically trained on speech data independent of speaker information and used for the voice trigger detection task. However, such a speaker independent voice trigger detector typically suffers from performance degradation on speech from underrepresented… ▽ More Voice trigger detection is an important task, which enables activating a voice assistant when a target user speaks a keyword phrase. A detector is typically trained on speech data independent of speaker information and used for the voice trigger detection task. However, such a speaker independent voice trigger detector typically suffers from performance degradation on speech from underrepresented groups, such as accented speakers. In this work, we propose a novel voice trigger detector that can use a small number of utterances from a target speaker to improve detection accuracy. Our proposed model employs an encoder-decoder architecture. While the encoder performs speaker independent voice trigger detection, similar to the conventional detector, the decoder predicts a personalized embedding for each utterance. A personalized voice trigger score is then obtained as a similarity score between the embeddings of enrollment utterances and a test utterance. The personalized embedding allows adapting to target speaker's speech when computing the voice trigger score, hence improving voice trigger detection accuracy. Experimental results show that the proposed approach achieves a 38% relative reduction in a false rejection rate (FRR) compared to a baseline speaker independent voice trigger model. △ Less

Submitted 13 September, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

Comments: Accepted at InterSpeech 2022

arXiv:2203.15975 [pdf, other]

Device-Directed Speech Detection: Regularization via Distillation for Weakly-Supervised Models

Authors: Vineet Garg, Ognjen Rudovic, Pranay Dighe, Ahmed H. Abdelaziz, Erik Marchi, Saurabh Adya, Chandra Dhir, Ahmed Tewfik

Abstract: We address the problem of detecting speech directed to a device that does not contain a specific wake-word. Specifically, we focus on audio coming from a touch-based invocation. Mitigating virtual assistants (VAs) activation due to accidental button presses is critical for user experience. While the majority of approaches to false trigger mitigation (FTM) are designed to detect the presence of a t… ▽ More We address the problem of detecting speech directed to a device that does not contain a specific wake-word. Specifically, we focus on audio coming from a touch-based invocation. Mitigating virtual assistants (VAs) activation due to accidental button presses is critical for user experience. While the majority of approaches to false trigger mitigation (FTM) are designed to detect the presence of a target keyword, inferring user intent in absence of keyword is difficult. This also poses a challenge when creating the training/evaluation data for such systems due to inherent ambiguity in the user's data. To this end, we propose a novel FTM approach that uses weakly-labeled training data obtained with a newly introduced data sampling strategy. While this sampling strategy reduces data annotation efforts, the data labels are noisy as the data are not annotated manually. We use these data to train an acoustics-only model for the FTM task by regularizing its loss function via knowledge distillation from an ASR-based (LatticeRNN) model. This improves the model decisions, resulting in 66% gain in accuracy, as measured by equal-error-rate (EER), over the base acoustics-only model. We also show that the ensemble of the LatticeRNN and acoustic-distilled models brings further accuracy improvement of 20%. △ Less

Submitted 29 March, 2022; originally announced March 2022.

Comments: Submitted to INTERSPEECH 2022

arXiv:2203.03905 [pdf, other]

End-to-end system for object detection from sub-sampled radar data

Authors: Madhumitha Sakthi, Ahmed Tewfik, Marius Arvinte, Haris Vikalo

Abstract: Robust and accurate sensing is of critical importance for advancing autonomous automotive systems. The need to acquire situational awareness in complex urban conditions using sensors such as radar has motivated research on power and latency-efficient signal acquisition methods. In this paper, we present an end-to-end signal processing pipeline, capable of operating in extreme weather conditions, t… ▽ More Robust and accurate sensing is of critical importance for advancing autonomous automotive systems. The need to acquire situational awareness in complex urban conditions using sensors such as radar has motivated research on power and latency-efficient signal acquisition methods. In this paper, we present an end-to-end signal processing pipeline, capable of operating in extreme weather conditions, that relies on sub-sampled radar data to perform object detection in vehicular settings. The results of the object detection are further utilized to sub-sample forthcoming radar data, which stands in contrast to prior work where the sub-sampling relies on image information. We show robust detection based on radar data reconstructed using 20% of samples under extreme weather conditions such as snow or fog, and on low-illuminated nights. Additionally, we generate 20% sampled radar data in a fine-tuning set and show 1.1% gain in AP50 across scenes and 3% AP50 gain in motorway condition. △ Less

Submitted 8 March, 2022; originally announced March 2022.

Comments: Submitted to EUSIPCO 2022

arXiv:2104.04968 [pdf, other]

Knowledge-Augmented Contrastive Learning for Abnormality Classification and Localization in Chest X-rays with Radiomics using a Feedback Loop

Authors: Yan Han, Chongyan Chen, Ahmed Tewfik, Benjamin Glicksberg, Ying Ding, Yifan Peng, Zhangyang Wang

Abstract: Building a highly accurate predictive model for classification and localization of abnormalities in chest X-rays usually requires a large number of manually annotated labels and pixel regions (bounding boxes) of abnormalities. However, it is expensive to acquire such annotations, especially the bounding boxes. Recently, contrastive learning has shown strong promise in leveraging unlabeled natural… ▽ More Building a highly accurate predictive model for classification and localization of abnormalities in chest X-rays usually requires a large number of manually annotated labels and pixel regions (bounding boxes) of abnormalities. However, it is expensive to acquire such annotations, especially the bounding boxes. Recently, contrastive learning has shown strong promise in leveraging unlabeled natural images to produce highly generalizable and discriminative features. However, extending its power to the medical image domain is under-explored and highly non-trivial, since medical images are much less amendable to data augmentations. In contrast, their prior knowledge, as well as radiomic features, is often crucial. To bridge this gap, we propose an end-to-end semi-supervised knowledge-augmented contrastive learning framework, that simultaneously performs disease classification and localization tasks. The key knob of our framework is a unique positive sampling approach tailored for the medical images, by seamlessly integrating radiomic features as a knowledge augmentation. Specifically, we first apply an image encoder to classify the chest X-rays and to generate the image features. We next leverage Grad-CAM to highlight the crucial (abnormal) regions for chest X-rays (even when unannotated), from which we extract radiomic features. The radiomic features are then passed through another dedicated encoder to act as the positive sample for the image features generated from the same chest X-ray. In this way, our framework constitutes a feedback loop for image and radiomic modality features to mutually reinforce each other. Their contrasting yields knowledge-augmented representations that are both robust and interpretable. Extensive experiments on the NIH Chest X-ray dataset demonstrate that our approach outperforms existing baselines in both classification and localization tasks. △ Less

Submitted 4 May, 2022; v1 submitted 11 April, 2021; originally announced April 2021.

Comments: Accepted by WACV 2022

arXiv:2103.02087 [pdf, other]

Deep J-Sense: Accelerated MRI Reconstruction via Unrolled Alternating Optimization

Authors: Marius Arvinte, Sriram Vishwanath, Ahmed H. Tewfik, Jonathan I. Tamir

Abstract: Accelerated multi-coil magnetic resonance imaging reconstruction has seen a substantial recent improvement combining compressed sensing with deep learning. However, most of these methods rely on estimates of the coil sensitivity profiles, or on calibration data for estimating model parameters. Prior work has shown that these methods degrade in performance when the quality of these estimators are p… ▽ More Accelerated multi-coil magnetic resonance imaging reconstruction has seen a substantial recent improvement combining compressed sensing with deep learning. However, most of these methods rely on estimates of the coil sensitivity profiles, or on calibration data for estimating model parameters. Prior work has shown that these methods degrade in performance when the quality of these estimators are poor or when the scan parameters differ from the training conditions. Here we introduce Deep J-Sense as a deep learning approach that builds on unrolled alternating minimization and increases robustness: our algorithm refines both the magnetization (image) kernel and the coil sensitivity maps. Experimental results on a subset of the knee fastMRI dataset show that this increases reconstruction performance and provides a significant degree of robustness to varying acceleration factors and calibration region sizes. △ Less

Submitted 11 April, 2021; v1 submitted 2 March, 2021; originally announced March 2021.

arXiv:2103.00383 [pdf, other]

Brain Signals to Rescue Aphasia, Apraxia and Dysarthria Speech Recognition

Authors: Gautam Krishna, Mason Carnahan, Shilpa Shamapant, Yashitha Surendranath, Saumya Jain, Arundhati Ghosh, Co Tran, Jose del R Millan, Ahmed H Tewfik

Abstract: In this paper, we propose a deep learning-based algorithm to improve the performance of automatic speech recognition (ASR) systems for aphasia, apraxia, and dysarthria speech by utilizing electroencephalography (EEG) features recorded synchronously with aphasia, apraxia, and dysarthria speech. We demonstrate a significant decoding performance improvement by more than 50\% during test time for isol… ▽ More In this paper, we propose a deep learning-based algorithm to improve the performance of automatic speech recognition (ASR) systems for aphasia, apraxia, and dysarthria speech by utilizing electroencephalography (EEG) features recorded synchronously with aphasia, apraxia, and dysarthria speech. We demonstrate a significant decoding performance improvement by more than 50\% during test time for isolated speech recognition task and we also provide preliminary results indicating performance improvement for the more challenging continuous speech recognition task by utilizing EEG features. The results presented in this paper show the first step towards demonstrating the possibility of utilizing non-invasive neural signals to design a real-time robust speech prosthetic for stroke survivors recovering from aphasia, apraxia, and dysarthria. Our aphasia, apraxia, and dysarthria speech-EEG data set will be released to the public to help further advance this interesting and crucial research. △ Less

Submitted 17 July, 2021; v1 submitted 27 February, 2021; originally announced March 2021.

Comments: Accepted to IEEE EMBC 2021

arXiv:2101.04269 [pdf, other]

Pneumonia Detection on Chest X-ray using Radiomic Features and Contrastive Learning

Authors: Yan Han, Chongyan Chen, Ahmed H Tewfik, Ying Ding, Yifan Peng

Abstract: Chest X-ray becomes one of the most common medical diagnoses due to its noninvasiveness. The number of chest X-ray images has skyrocketed, but reading chest X-rays still have been manually performed by radiologists, which creates huge burnouts and delays. Traditionally, radiomics, as a subfield of radiology that can extract a large number of quantitative features from medical images, demonstrates… ▽ More Chest X-ray becomes one of the most common medical diagnoses due to its noninvasiveness. The number of chest X-ray images has skyrocketed, but reading chest X-rays still have been manually performed by radiologists, which creates huge burnouts and delays. Traditionally, radiomics, as a subfield of radiology that can extract a large number of quantitative features from medical images, demonstrates its potential to facilitate medical imaging diagnosis before the deep learning era. With the rise of deep learning, the explainability of deep neural networks on chest X-ray diagnosis remains opaque. In this study, we proposed a novel framework that leverages radiomics features and contrastive learning to detect pneumonia in chest X-ray. Experiments on the RSNA Pneumonia Detection Challenge dataset show that our model achieves superior results to several state-of-the-art models (> 10% in F1-score) and increases the model's interpretability. △ Less

Submitted 4 May, 2022; v1 submitted 11 January, 2021; originally announced January 2021.

Comments: Accepted for ISBI 2021

arXiv:2012.12843 [pdf, other]

EQ-Net: A Unified Deep Learning Framework for Log-Likelihood Ratio Estimation and Quantization

Authors: Marius Arvinte, Ahmed H. Tewfik, Sriram Vishwanath

Abstract: In this work, we introduce EQ-Net: the first holistic framework that solves both the tasks of log-likelihood ratio (LLR) estimation and quantization using a data-driven method. We motivate our approach with theoretical insights on two practical estimation algorithms at the ends of the complexity spectrum and reveal a connection between the complexity of an algorithm and the information bottleneck… ▽ More In this work, we introduce EQ-Net: the first holistic framework that solves both the tasks of log-likelihood ratio (LLR) estimation and quantization using a data-driven method. We motivate our approach with theoretical insights on two practical estimation algorithms at the ends of the complexity spectrum and reveal a connection between the complexity of an algorithm and the information bottleneck method: simpler algorithms admit smaller bottlenecks when representing their solution. This motivates us to propose a two-stage algorithm that uses LLR compression as a pretext task for estimation and is focused on low-latency, high-performance implementations via deep neural networks. We carry out extensive experimental evaluation and demonstrate that our single architecture achieves state-of-the-art results on both tasks when compared to previous methods, with gains in quantization efficiency as high as $20\%$ and reduced estimation latency by up to $60\%$ when measured on general purpose and graphical processing units (GPU). In particular, our approach reduces the GPU inference latency by more than two times in several multiple-input multiple-output (MIMO) configurations. Finally, we demonstrate that our scheme is robust to distributional shifts and retains a significant part of its performance when evaluated on 5G channel models, as well as channel estimation errors. △ Less

Submitted 3 May, 2021; v1 submitted 23 December, 2020; originally announced December 2020.

arXiv:2011.12506 [pdf, other]

Using Radiomics as Prior Knowledge for Thorax Disease Classification and Localization in Chest X-rays

Authors: Yan Han, Chongyan Chen, Liyan Tang, Mingquan Lin, Ajay Jaiswal, Song Wang, Ahmed Tewfik, George Shih, Ying Ding, Yifan Peng

Abstract: Chest X-ray becomes one of the most common medical diagnoses due to its noninvasiveness. The number of chest X-ray images has skyrocketed, but reading chest X-rays still have been manually performed by radiologists, which creates huge burnouts and delays. Traditionally, radiomics, as a subfield of radiology that can extract a large number of quantitative features from medical images, demonstrates… ▽ More Chest X-ray becomes one of the most common medical diagnoses due to its noninvasiveness. The number of chest X-ray images has skyrocketed, but reading chest X-rays still have been manually performed by radiologists, which creates huge burnouts and delays. Traditionally, radiomics, as a subfield of radiology that can extract a large number of quantitative features from medical images, demonstrates its potential to facilitate medical imaging diagnosis before the deep learning era. In this paper, we develop an end-to-end framework, ChexRadiNet, that can utilize the radiomics features to improve the abnormality classification performance. Specifically, ChexRadiNet first applies a light-weight but efficient triplet-attention mechanism to classify the chest X-rays and highlight the abnormal regions. Then it uses the generated class activation map to extract radiomic features, which further guides our model to learn more robust image features. After a number of iterations and with the help of radiomic features, our framework can converge to more accurate image regions. We evaluate the ChexRadiNet framework using three public datasets: NIH ChestX-ray, CheXpert, and MIMIC-CXR. We find that ChexRadiNet outperforms the state-of-the-art on both disease detection (0.843 in AUC) and localization (0.679 in T(IoU) = 0.1). We will make the code publicly available at https://github.com/bionlplab/lung_disease_detection_amia2021, with the hope that this method can facilitate the development of automatic systems with a higher-level understanding of the radiological world. △ Less

Submitted 9 July, 2021; v1 submitted 24 November, 2020; originally announced November 2020.

Comments: Accepted by AMIA 2021

arXiv:2010.02367 [pdf, other]

Automotive Radar Data Acquisition using Object Detection

Authors: Madhumitha Sakthi, Ahmed Tewfik

Abstract: The growing urban complexity demands an efficient algorithm to acquire and process various sensor information from autonomous vehicles. In this paper, we introduce an algorithm to utilize object detection results from the image to adaptively sample and acquire radar data using Compressed Sensing (CS). This novel algorithm is motivated by the hypothesis that with a limited sampling budget, allocati… ▽ More The growing urban complexity demands an efficient algorithm to acquire and process various sensor information from autonomous vehicles. In this paper, we introduce an algorithm to utilize object detection results from the image to adaptively sample and acquire radar data using Compressed Sensing (CS). This novel algorithm is motivated by the hypothesis that with a limited sampling budget, allocating more sampling budget to areas with the object as opposed to a uniform sampling ultimately improves relevant object detection performance. We improve detection performance by dynamically allocating a lower sampling rate to objects such as buses than pedestrians leading to better reconstruction than baseline across areas with objects of interest. We automate the sampling rate allocation using linear programming and show significant time savings while reducing the radar block size by a factor of 2. We also analyze a Binary Permuted Diagonal measurement matrix for radar acquisition which is hardware-efficient and show its performance is similar to Gaussian and Binary Permuted Block Diagonal matrix. Our experiments on the Oxford radar dataset show an effective reconstruction of objects of interest with 10% sampling rate. Finally, we develop a transformer-based 2D object detection network using the NuScenes radar and image data. △ Less

Submitted 1 March, 2021; v1 submitted 5 October, 2020; originally announced October 2020.

Comments: Submitted to EUSIPCO 2021

arXiv:2008.07621 [pdf, other]

Speech Recognition using EEG signals recorded using dry electrodes

Authors: Gautam Krishna, Co Tran, Mason Carnahan, Morgan M Hagood, Ahmed H Tewfik

Abstract: In this paper, we demonstrate speech recognition using electroencephalography (EEG) signals obtained using dry electrodes on a limited English vocabulary consisting of three vowels and one word using a deep learning model. We demonstrate a test accuracy of 79.07 percent on a subset vocabulary consisting of two English vowels. Our results demonstrate the feasibility of using EEG signals recorded us… ▽ More In this paper, we demonstrate speech recognition using electroencephalography (EEG) signals obtained using dry electrodes on a limited English vocabulary consisting of three vowels and one word using a deep learning model. We demonstrate a test accuracy of 79.07 percent on a subset vocabulary consisting of two English vowels. Our results demonstrate the feasibility of using EEG signals recorded using dry electrodes for performing the task of speech recognition. △ Less

Submitted 13 August, 2020; originally announced August 2020.

arXiv:2006.03638 [pdf, other]

Robust Face Verification via Disentangled Representations

Authors: Marius Arvinte, Ahmed H. Tewfik, Sriram Vishwanath

Abstract: We introduce a robust algorithm for face verification, i.e., deciding whether twoimages are of the same person or not. Our approach is a novel take on the idea ofusing deep generative networks for adversarial robustness. We use the generativemodel during training as an online augmentation method instead of a test-timepurifier that removes adversarial noise. Our architecture uses a contrastive loss… ▽ More We introduce a robust algorithm for face verification, i.e., deciding whether twoimages are of the same person or not. Our approach is a novel take on the idea ofusing deep generative networks for adversarial robustness. We use the generativemodel during training as an online augmentation method instead of a test-timepurifier that removes adversarial noise. Our architecture uses a contrastive loss termand a disentangled generative model to sample negative pairs. Instead of randomlypairing two real images, we pair an image with its class-modified counterpart whilekeeping its content (pose, head tilt, hair, etc.) intact. This enables us to efficientlysample hard negative pairs for the contrastive loss. We experimentally show that, when coupled with adversarial training, the proposed scheme converges with aweak inner solver and has a higher clean and robust accuracy than state-of-the-art-methods when evaluated against white-box physical attacks. △ Less

Submitted 23 June, 2020; v1 submitted 5 June, 2020; originally announced June 2020.

Comments: Preprint

arXiv:2006.02902 [pdf]

Constrained Variational Autoencoder for improving EEG based Speech Recognition Systems

Authors: Gautam Krishna, Co Tran, Mason Carnahan, Ahmed Tewfik

Abstract: In this paper we introduce a recurrent neural network (RNN) based variational autoencoder (VAE) model with a new constrained loss function that can generate more meaningful electroencephalography (EEG) features from raw EEG features to improve the performance of EEG based speech recognition systems. We demonstrate that both continuous and isolated speech recognition systems trained and tested usin… ▽ More In this paper we introduce a recurrent neural network (RNN) based variational autoencoder (VAE) model with a new constrained loss function that can generate more meaningful electroencephalography (EEG) features from raw EEG features to improve the performance of EEG based speech recognition systems. We demonstrate that both continuous and isolated speech recognition systems trained and tested using EEG features generated from raw EEG features using our VAE model results in improved performance and we demonstrate our results for a limited English vocabulary consisting of 30 unique sentences for continuous speech recognition and for an English vocabulary consisting of 2 unique sentences for isolated speech recognition. We compare our method with another recently introduced method described by authors in [1] to improve the performance of EEG based continuous speech recognition systems and we demonstrate that our method outperforms their method as vocabulary size increases when trained and tested using the same data set. Even though we demonstrate results only for automatic speech recognition (ASR) experiments in this paper, the proposed VAE model with constrained loss function can be extended to a variety of other EEG based brain computer interface (BCI) applications. △ Less

Submitted 1 June, 2020; originally announced June 2020.

Comments: Under Review. arXiv admin note: substantial text overlap with arXiv:2006.01260

arXiv:2006.01262 [pdf, other]

Predicting Different Acoustic Features from EEG and towards direct synthesis of Audio Waveform from EEG

Authors: Gautam Krishna, Co Tran, Mason Carnahan, Ahmed Tewfik

Abstract: In [1,2] authors provided preliminary results for synthesizing speech from electroencephalography (EEG) features where they first predict acoustic features from EEG features and then the speech is reconstructed from the predicted acoustic features using griffin lim reconstruction algorithm. In this paper we first introduce a deep learning model that takes raw EEG waveform signals as input and dire… ▽ More In [1,2] authors provided preliminary results for synthesizing speech from electroencephalography (EEG) features where they first predict acoustic features from EEG features and then the speech is reconstructed from the predicted acoustic features using griffin lim reconstruction algorithm. In this paper we first introduce a deep learning model that takes raw EEG waveform signals as input and directly produces audio waveform as output. We then demonstrate predicting 16 different acoustic features from EEG features. We demonstrate our results for both spoken and listen condition in this paper. The results presented in this paper shows how different acoustic features are related to non-invasive neural EEG signals recorded during speech perception and production. △ Less

Submitted 29 May, 2020; originally announced June 2020.

Comments: Under Review

arXiv:2006.01261 [pdf, other]

Understanding effect of speech perception in EEG based speech recognition systems

Authors: Gautam Krishna, Co Tran, Mason Carnahan, Ahmed Tewfik

Abstract: The electroencephalography (EEG) signals recorded in parallel with speech are used to perform isolated and continuous speech recognition. During speaking process, one also hears his or her own speech and this speech perception is also reflected in the recorded EEG signals. In this paper we investigate whether it is possible to separate out this speech perception component from EEG signals in order… ▽ More The electroencephalography (EEG) signals recorded in parallel with speech are used to perform isolated and continuous speech recognition. During speaking process, one also hears his or her own speech and this speech perception is also reflected in the recorded EEG signals. In this paper we investigate whether it is possible to separate out this speech perception component from EEG signals in order to design more robust EEG based speech recognition systems. We further demonstrate predicting EEG signals recorded in parallel with speaking from EEG signals recorded in parallel with passive listening and vice versa with very low normalized root mean squared error (RMSE). We finally demonstrate both isolated and continuous speech recognition using EEG signals recorded in parallel with listening, speaking and improve the previous connectionist temporal classification (CTC) model results demonstrated by authors in [1] using their data set. △ Less

Submitted 29 May, 2020; originally announced June 2020.

Comments: Under Review

arXiv:2006.01260 [pdf, other]

Improving EEG based continuous speech recognition using GAN

Authors: Gautam Krishna, Co Tran, Mason Carnahan, Ahmed Tewfik

Abstract: In this paper we demonstrate that it is possible to generate more meaningful electroencephalography (EEG) features from raw EEG features using generative adversarial networks (GAN) to improve the performance of EEG based continuous speech recognition systems. We improve the results demonstrated by authors in [1] using their data sets for for some of the test time experiments and for other cases ou… ▽ More In this paper we demonstrate that it is possible to generate more meaningful electroencephalography (EEG) features from raw EEG features using generative adversarial networks (GAN) to improve the performance of EEG based continuous speech recognition systems. We improve the results demonstrated by authors in [1] using their data sets for for some of the test time experiments and for other cases our results were comparable with theirs. Our proposed approach can be implemented without using any additional sensor information, whereas in [1] authors used additional features like acoustic or articulatory information to improve the performance of EEG based continuous speech recognition systems. △ Less

Submitted 29 May, 2020; originally announced June 2020.

Comments: Under Review

arXiv:2005.11235 [pdf, other]

Predicting Video features from EEG and Vice versa

Authors: Gautam Krishna, Co Tran, Mason Carnahan, Ahmed Tewfik

Abstract: In this paper we explore predicting facial or lip video features from electroencephalography (EEG) features and predicting EEG features from recorded facial or lip video frames using deep learning models. The subjects were asked to read out loud English sentences shown to them on a computer screen and their simultaneous EEG signals and facial video frames were recorded. Our model was able to gener… ▽ More In this paper we explore predicting facial or lip video features from electroencephalography (EEG) features and predicting EEG features from recorded facial or lip video frames using deep learning models. The subjects were asked to read out loud English sentences shown to them on a computer screen and their simultaneous EEG signals and facial video frames were recorded. Our model was able to generate very broad characteristics of the facial or lip video frame from input EEG features. Our results demonstrate the first step towards synthesizing high quality facial or lip video from recorded EEG features. We demonstrate results for a data set consisting of seven subjects. △ Less

Submitted 16 May, 2020; originally announced May 2020.

Comments: under review

arXiv:2004.04731 [pdf, other]

Advancing Speech Synthesis using EEG

Authors: Gautam Krishna, Co Tran, Mason Carnahan, Ahmed Tewfik

Abstract: In this paper we introduce attention-regression model to demonstrate predicting acoustic features from electroencephalography (EEG) features recorded in parallel with spoken sentences. First we demonstrate predicting acoustic features directly from EEG features using our attention model and then we demonstrate predicting acoustic features from EEG features using a two-step approach where in the fi… ▽ More In this paper we introduce attention-regression model to demonstrate predicting acoustic features from electroencephalography (EEG) features recorded in parallel with spoken sentences. First we demonstrate predicting acoustic features directly from EEG features using our attention model and then we demonstrate predicting acoustic features from EEG features using a two-step approach where in the first step we use our attention model to predict articulatory features from EEG features and then in second step another attention-regression model is trained to transform the predicted articulatory features to acoustic features. Our proposed attention-regression model demonstrates superior performance compared to the regression model introduced by authors in [1] when tested using their data set for majority of the subjects during test time. The results presented in this paper further advances the work described by authors in [1]. △ Less

Submitted 3 May, 2020; v1 submitted 9 April, 2020; originally announced April 2020.

Comments: Under review

arXiv:2003.04733 [pdf, other]

Speaker Identification using EEG

Authors: Gautam Krishna, Co Tran, Mason Carnahan, Ahmed Tewfik

Abstract: In this paper we explore speaker identification using electroencephalography (EEG) signals. The performance of speaker identification systems degrades in presence of background noise, this paper demonstrates that EEG features can be used to enhance the performance of speaker identification systems operating in presence and absence of background noise. The paper further demonstrates that in presenc… ▽ More In this paper we explore speaker identification using electroencephalography (EEG) signals. The performance of speaker identification systems degrades in presence of background noise, this paper demonstrates that EEG features can be used to enhance the performance of speaker identification systems operating in presence and absence of background noise. The paper further demonstrates that in presence of high background noise, speaker identification system using only EEG features as input demonstrates better performance than the system using only acoustic features as input. △ Less

Submitted 6 March, 2020; originally announced March 2020.

arXiv:2003.00007 [pdf, other]

Generating EEG features from Acoustic features

Authors: Gautam Krishna, Co Tran, Mason Carnahan, Yan Han, Ahmed H Tewfik

Abstract: In this paper we demonstrate predicting electroencephalograpgy (EEG) features from acoustic features using recurrent neural network (RNN) based regression model and generative adversarial network (GAN). We predict various types of EEG features from acoustic features. We compare our results with the previously studied problem on speech synthesis using EEG and our results demonstrate that EEG featur… ▽ More In this paper we demonstrate predicting electroencephalograpgy (EEG) features from acoustic features using recurrent neural network (RNN) based regression model and generative adversarial network (GAN). We predict various types of EEG features from acoustic features. We compare our results with the previously studied problem on speech synthesis using EEG and our results demonstrate that EEG features can be generated from acoustic features with lower root mean square error (RMSE), normalized RMSE values compared to generating acoustic features from EEG features (ie: speech synthesis using EEG) when tested using the same data sets. △ Less

Submitted 18 March, 2020; v1 submitted 29 February, 2020; originally announced March 2020.

arXiv:2002.12504 [pdf, other]

Detecting Patch Adversarial Attacks with Image Residuals

Authors: Marius Arvinte, Ahmed Tewfik, Sriram Vishwanath

Abstract: We introduce an adversarial sample detection algorithm based on image residuals, specifically designed to guard against patch-based attacks. The image residual is obtained as the difference between an input image and a denoised version of it, and a discriminator is trained to distinguish between clean and adversarial samples. More precisely, we use a wavelet domain algorithm for denoising images a… ▽ More We introduce an adversarial sample detection algorithm based on image residuals, specifically designed to guard against patch-based attacks. The image residual is obtained as the difference between an input image and a denoised version of it, and a discriminator is trained to distinguish between clean and adversarial samples. More precisely, we use a wavelet domain algorithm for denoising images and demonstrate that the obtained residuals act as a digital fingerprint for adversarial attacks. To emulate the limitations of a physical adversary, we evaluate the performance of our approach against localized (patch-based) adversarial attacks, including in settings where the adversary has complete knowledge about the detection scheme. Our results show that the proposed detection method generalizes to previously unseen, stronger attacks and that it is able to reduce the success rate (conversely, increase the computational effort) of an adaptive attacker. △ Less

Submitted 2 March, 2020; v1 submitted 27 February, 2020; originally announced February 2020.

arXiv:2002.03851 [pdf, other]

Continuous Silent Speech Recognition using EEG

Authors: Gautam Krishna, Co Tran, Mason Carnahan, Ahmed Tewfik

Abstract: In this paper we explore continuous silent speech recognition using electroencephalography (EEG) signals. We implemented a connectionist temporal classification (CTC) automatic speech recognition (ASR) model to translate EEG signals recorded in parallel while subjects were reading English sentences in their mind without producing any voice to text. Our results demonstrate the feasibility of using… ▽ More In this paper we explore continuous silent speech recognition using electroencephalography (EEG) signals. We implemented a connectionist temporal classification (CTC) automatic speech recognition (ASR) model to translate EEG signals recorded in parallel while subjects were reading English sentences in their mind without producing any voice to text. Our results demonstrate the feasibility of using EEG signals for performing continuous silent speech recognition. We demonstrate our results for a limited English vocabulary consisting of 30 unique sentences. △ Less

Submitted 4 May, 2020; v1 submitted 6 February, 2020; originally announced February 2020.

arXiv:2001.00501 [pdf, other]

EEG based Continuous Speech Recognition using Transformers

Authors: Gautam Krishna, Co Tran, Mason Carnahan, Ahmed H Tewfik

Abstract: In this paper we investigate continuous speech recognition using electroencephalography (EEG) features using recently introduced end-to-end transformer based automatic speech recognition (ASR) model. Our results demonstrate that transformer based model demonstrate faster training compared to recurrent neural network (RNN) based sequence-to-sequence EEG models and better performance during inferenc… ▽ More In this paper we investigate continuous speech recognition using electroencephalography (EEG) features using recently introduced end-to-end transformer based automatic speech recognition (ASR) model. Our results demonstrate that transformer based model demonstrate faster training compared to recurrent neural network (RNN) based sequence-to-sequence EEG models and better performance during inference time for smaller test set vocabulary but as we increase the vocabulary size, the performance of the RNN based models were better than transformer based model on a limited English vocabulary. △ Less

Submitted 5 May, 2020; v1 submitted 31 December, 2019; originally announced January 2020.

arXiv:1912.07730 [pdf, other]

Continuous Speech Recognition using EEG and Video

Authors: Gautam Krishna, Mason Carnahan, Co Tran, Ahmed H Tewfik

Abstract: In this paper we investigate whether electroencephalography (EEG) features can be used to improve the performance of continuous visual speech recognition systems. We implemented a connectionist temporal classification (CTC) based end-to-end automatic speech recognition (ASR) model for performing recognition. Our results demonstrate that EEG features are helpful in enhancing the performance of cont… ▽ More In this paper we investigate whether electroencephalography (EEG) features can be used to improve the performance of continuous visual speech recognition systems. We implemented a connectionist temporal classification (CTC) based end-to-end automatic speech recognition (ASR) model for performing recognition. Our results demonstrate that EEG features are helpful in enhancing the performance of continuous visual speech recognition systems. △ Less

Submitted 27 December, 2019; v1 submitted 16 December, 2019; originally announced December 2019.

Comments: On preparation for submission to EUSIPCO 2020. arXiv admin note: text overlap with arXiv:1911.11610, arXiv:1911.04261

arXiv:1911.11610 [pdf, other]

Improving EEG based Continuous Speech Recognition

Authors: Gautam Krishna, Co Tran, Mason Carnahan, Yan Han, Ahmed H Tewfik

Abstract: In this paper we introduce various techniques to improve the performance of electroencephalography (EEG) features based continuous speech recognition (CSR) systems. A connectionist temporal classification (CTC) based automatic speech recognition (ASR) system was implemented for performing recognition. We introduce techniques to initialize the weights of the recurrent layers in the encoder of the C… ▽ More In this paper we introduce various techniques to improve the performance of electroencephalography (EEG) features based continuous speech recognition (CSR) systems. A connectionist temporal classification (CTC) based automatic speech recognition (ASR) system was implemented for performing recognition. We introduce techniques to initialize the weights of the recurrent layers in the encoder of the CTC model with more meaningful weights rather than with random weights and we make use of an external language model to improve the beam search during decoding time. We finally study the problem of predicting articulatory features from EEG features in this paper. △ Less

Submitted 23 December, 2019; v1 submitted 24 November, 2019; originally announced November 2019.

Comments: On preparation for submission to EUSIPCO 2020. arXiv admin note: text overlap with arXiv:1911.04261, arXiv:1906.08871

arXiv:1911.04261 [pdf, other]

Voice Activity Detection in presence of background noise using EEG

Authors: Gautam Krishna, Co Tran, Mason Carnahan, Yan Han, Ahmed H Tewfik

Abstract: In this paper we demonstrate that performance of voice activity detection (VAD) system operating in presence of background noise can be improved by concatenating acoustic input features with electroencephalography (EEG) features. We also demonstrate that VAD using only EEG features shows better performance than VAD using only acoustic features in presence of background noise. We implemented a recu… ▽ More In this paper we demonstrate that performance of voice activity detection (VAD) system operating in presence of background noise can be improved by concatenating acoustic input features with electroencephalography (EEG) features. We also demonstrate that VAD using only EEG features shows better performance than VAD using only acoustic features in presence of background noise. We implemented a recurrent neural network (RNN) based VAD system and we demonstrate our results for two different data sets recorded in presence of different noise conditions in this paper. We finally demonstrate the ability to predict whether a person wish to continue speaking a sentence or not from EEG features. △ Less

Submitted 14 March, 2020; v1 submitted 8 November, 2019; originally announced November 2019.

Comments: On preparation for submission to EUSIPCO 2020. arXiv admin note: text overlap with arXiv:1906.08871, arXiv:1909.09132

arXiv:1909.09132 [pdf, other]

Spoken Speech Enhancement using EEG

Authors: Gautam Krishna, Co Tran, Yan Han, Mason Carnahan, Ahmed H Tewfik

Abstract: In this paper we demonstrate spoken speech enhancement using electroencephalography (EEG) signals using a generative adversarial network (GAN) based model, gated recurrent unit (GRU) regression based model, temporal convolutional network (TCN) regression model and finally using a mixed TCN GRU regression model. We compare our EEG based speech enhancement results with traditional log minimum mean… ▽ More In this paper we demonstrate spoken speech enhancement using electroencephalography (EEG) signals using a generative adversarial network (GAN) based model, gated recurrent unit (GRU) regression based model, temporal convolutional network (TCN) regression model and finally using a mixed TCN GRU regression model. We compare our EEG based speech enhancement results with traditional log minimum mean-square error (MMSE) speech enhancement algorithm and our proposed methods demonstrate significant improvement in speech enhancement quality compared to the traditional method. Our overall results demonstrate that EEG features can be used to clean speech recorded in presence of background noise. To the best of our knowledge this is the first time a spoken speech enhancement is demonstrated using EEG features recorded in parallel with spoken speech. △ Less

Submitted 19 April, 2020; v1 submitted 13 September, 2019; originally announced September 2019.

arXiv:1908.05743 [pdf, other]

State-of-the-art Speech Recognition using EEG and Towards Decoding of Speech Spectrum From EEG

Authors: Gautam Krishna, Yan Han, Co Tran, Mason Carnahan, Ahmed H Tewfik

Abstract: In this paper we first demonstrate continuous noisy speech recognition using electroencephalography (EEG) signals on English vocabulary using different types of state of the art end-to-end automatic speech recognition (ASR) models, we further provide results obtained using EEG data recorded under different experimental conditions. We finally demonstrate decoding of speech spectrum from EEG signals… ▽ More In this paper we first demonstrate continuous noisy speech recognition using electroencephalography (EEG) signals on English vocabulary using different types of state of the art end-to-end automatic speech recognition (ASR) models, we further provide results obtained using EEG data recorded under different experimental conditions. We finally demonstrate decoding of speech spectrum from EEG signals using a long short term memory (LSTM) based regression model and Generative Adversarial Network (GAN) based model. Our results demonstrate the feasibility of using EEG signals for continuous noisy speech recognition under different experimental conditions and we provide preliminary results for synthesis of speech from EEG features. △ Less

Submitted 4 March, 2020; v1 submitted 14 August, 2019; originally announced August 2019.

arXiv:1906.08871 [pdf, other]

Advancing Speech Recognition With No Speech Or With Noisy Speech

Authors: Gautam Krishna, Co Tran, Mason Carnahan, Ahmed H Tewfik

Abstract: In this paper we demonstrate end-to-end continuous speech recognition (CSR) using electroencephalography (EEG) signals with no speech signal as input. An attention model based automatic speech recognition (ASR) and connectionist temporal classification (CTC) based ASR systems were implemented for performing recognition. We further demonstrate CSR for noisy speech by fusing with EEG features. In this paper we demonstrate end-to-end continuous speech recognition (CSR) using electroencephalography (EEG) signals with no speech signal as input. An attention model based automatic speech recognition (ASR) and connectionist temporal classification (CTC) based ASR systems were implemented for performing recognition. We further demonstrate CSR for noisy speech by fusing with EEG features. △ Less

Submitted 14 March, 2020; v1 submitted 17 June, 2019; originally announced June 2019.

Comments: Extended version of our accepted IEEE EUSIPCO 2019 paper with additional results for CTC model based recognition. arXiv admin note: substantial text overlap with arXiv:1906.08045, arXiv:1906.08044

arXiv:1906.08045 [pdf, other]

Speech Recognition With No Speech Or With Noisy Speech Beyond English

Authors: Gautam Krishna, Co Tran, Yan Han, Mason Carnahan, Ahmed H Tewfik

Abstract: In this paper we demonstrate continuous noisy speech recognition using connectionist temporal classification (CTC) model on limited Chinese vocabulary using electroencephalography (EEG) features with no speech signal as input and we further demonstrate single CTC model based continuous noisy speech recognition on limited joint English and Chinese vocabulary using EEG features with no speech signal… ▽ More In this paper we demonstrate continuous noisy speech recognition using connectionist temporal classification (CTC) model on limited Chinese vocabulary using electroencephalography (EEG) features with no speech signal as input and we further demonstrate single CTC model based continuous noisy speech recognition on limited joint English and Chinese vocabulary using EEG features with no speech signal as input. We demonstrate our results using various EEG feature sets recently introduced in [1] as well as we propose a new deep learning architecture in this paper which can perform continuous speech recognition using raw EEG signals on limited joint English and Chinese vocabulary. △ Less

Submitted 26 February, 2020; v1 submitted 17 June, 2019; originally announced June 2019.

Comments: arXiv admin note: text overlap with arXiv:1906.08871

arXiv:1906.08044 [pdf, other]

Robust End-to-End Speaker Verification Using EEG

Authors: Yan Han, Gautam Krishna, Co Tran, Mason Carnahan, Ahmed H Tewfik

Abstract: In this paper we demonstrate that performance of a speaker verification system can be improved by concatenating electroencephalography (EEG) signal features with speech signal features or only using EEG signal features. We use state-of-the-art end-to-end deep learning model for performing speaker verification and we demonstrate our results for noisy speech. Our results indicate that EEG signals ca… ▽ More In this paper we demonstrate that performance of a speaker verification system can be improved by concatenating electroencephalography (EEG) signal features with speech signal features or only using EEG signal features. We use state-of-the-art end-to-end deep learning model for performing speaker verification and we demonstrate our results for noisy speech. Our results indicate that EEG signals can improve the robustness of speaker verification systems, especially in noiser environment. △ Less

Submitted 9 June, 2020; v1 submitted 17 June, 2019; originally announced June 2019.

Comments: Accepted for EUSIPCO 2020

arXiv:1906.07849 [pdf, other]

Deep Learning-Based Quantization of L-Values for Gray-Coded Modulation

Authors: Marius Arvinte, Sriram Vishwanath, Ahmed H. Tewfik

Abstract: In this work, a deep learning-based quantization scheme for log-likelihood ratio (L-value) storage is introduced. We analyze the dependency between the average magnitude of different L-values from the same quadrature amplitude modulation (QAM) symbol and show they follow a consistent ordering. Based on this we design a deep autoencoder that jointly compresses and separately reconstructs each L-val… ▽ More In this work, a deep learning-based quantization scheme for log-likelihood ratio (L-value) storage is introduced. We analyze the dependency between the average magnitude of different L-values from the same quadrature amplitude modulation (QAM) symbol and show they follow a consistent ordering. Based on this we design a deep autoencoder that jointly compresses and separately reconstructs each L-value, allowing the use of a weighted loss function that aims to more accurately reconstructs low magnitude inputs. Our method is shown to be competitive with state-of-the-art maximum mutual information quantization schemes, reducing the required memory footprint by a ratio of up to two and a loss of performance smaller than 0.1 dB with less than two effective bits per L-value or smaller than 0.04 dB with 2.25 effective bits. We experimentally show that our proposed method is a universal compression scheme in the sense that after training on an LDPC-coded Rayleigh fading scenario we can reuse the same network without further training on other channel models and codes while preserving the same performance benefits. △ Less

Submitted 9 May, 2021; v1 submitted 18 June, 2019; originally announced June 2019.

Comments: Submitted to IEEE Globecom 2019

arXiv:1903.04656 [pdf, other]

Deep Log-Likelihood Ratio Quantization

Authors: Marius Arvinte, Ahmed H. Tewfik, Sriram Vishwanath

Abstract: In this work, a deep learning-based method for log-likelihood ratio (LLR) lossy compression and quantization is proposed, with emphasis on a single-input single-output uncorrelated fading communication setting. A deep autoencoder network is trained to compress, quantize and reconstruct the bit log-likelihood ratios corresponding to a single transmitted symbol. Specifically, the encoder maps to a l… ▽ More In this work, a deep learning-based method for log-likelihood ratio (LLR) lossy compression and quantization is proposed, with emphasis on a single-input single-output uncorrelated fading communication setting. A deep autoencoder network is trained to compress, quantize and reconstruct the bit log-likelihood ratios corresponding to a single transmitted symbol. Specifically, the encoder maps to a latent space with dimension equal to the number of sufficient statistics required to recover the inputs - equal to three in this case - while the decoder aims to reconstruct a noisy version of the latent representation with the purpose of modeling quantization effects in a differentiable way. Simulation results show that, when applied to a standard rate-1/2 low-density parity-check (LDPC) code, a finite precision compression factor of nearly three times is achieved when storing an entire codeword, with an incurred loss of performance lower than 0.1 dB compared to straightforward scalar quantization of the log-likelihood ratios. △ Less

Submitted 9 May, 2021; v1 submitted 11 March, 2019; originally announced March 2019.

Comments: Accepted for publication at EUSIPCO 2019. Camera-ready version

arXiv:1903.00739 [pdf, other]

Speech Recognition with no speech or with noisy speech

Authors: Gautam Krishna, Co Tran, Jianguo Yu, Ahmed H Tewfik

Abstract: The performance of automatic speech recognition systems(ASR) degrades in the presence of noisy speech. This paper demonstrates that using electroencephalography (EEG) can help automatic speech recognition systems overcome performance loss in the presence of noise. The paper also shows that distillation training of automatic speech recognition systems using EEG features will increase their performa… ▽ More The performance of automatic speech recognition systems(ASR) degrades in the presence of noisy speech. This paper demonstrates that using electroencephalography (EEG) can help automatic speech recognition systems overcome performance loss in the presence of noise. The paper also shows that distillation training of automatic speech recognition systems using EEG features will increase their performance. Finally, we demonstrate the ability to recognize words from EEG with no speech signal on a limited English vocabulary with high accuracy. △ Less

Submitted 2 March, 2019; originally announced March 2019.

Comments: Accepted for ICASSP 2019

arXiv:1703.00134 [pdf, other]

Collision Resolution and Interference Elimination in Multiaccess Communication Networks

Authors: Naeem Akl, Ahmed Tewfik

Abstract: We define a multiaccess communication scheme that effectively eliminates interference and resolves collisions in many-to-one and many-to-many communication scenarios. Each transmitter is uniquely identified by a steering vector. All signals issued from a specific transmitter will be steered into the same single-dimensional or double-dimensional subspace at all receivers hearing this transmission.… ▽ More We define a multiaccess communication scheme that effectively eliminates interference and resolves collisions in many-to-one and many-to-many communication scenarios. Each transmitter is uniquely identified by a steering vector. All signals issued from a specific transmitter will be steered into the same single-dimensional or double-dimensional subspace at all receivers hearing this transmission. This subspace is orthogonal to the noise subspace at a receiver and the signals within the subspace can be extracted using the root-MUSIC method. At high SNR, local channel knowledge and strict synchronization, the algorithm asymptotically achieves full network capacity on condition that a channel remains constant within a single time slot. Without synchronization, the worst case asymptotic performance is still greater than the $50\%$ throughput achieved by collision resolution algorithms and interference management techniques like interference alignment. △ Less

Submitted 28 February, 2017; originally announced March 2017.

arXiv:1605.00153 [pdf, other]

doi 10.1109/TWC.2014.042914.130861

Primary Traffic Characterization and Secondary Transmissions

Authors: Yingxi Liu, Ahmed Tewfik

Abstract: Channel idle time distribution based secondary transmission strategies have been studied intensively in the literature. Under various performance metrics, the ultimate performance of secondary devices are eventually dictated by the presumed channel idle time distribution. Such distributions can take any arbitrary form in practice. In this work, we study idle time distributions in wireless local ar… ▽ More Channel idle time distribution based secondary transmission strategies have been studied intensively in the literature. Under various performance metrics, the ultimate performance of secondary devices are eventually dictated by the presumed channel idle time distribution. Such distributions can take any arbitrary form in practice. In this work, we study idle time distributions in wireless local area networks (WLAN) using large amount of the channel idle time data collected in real-world WLAN networks. We demonstrate with experimental data that the channel idle time distribution can be closely modeled by hyper-exponential distribution. Furthermore, one can treat the primary packet arrival process as a semi-Markov modulated Poisson process. Several secondary transmission strategies are discussed under this model. It is shown that using only one hyper-exponential distribution, the secondary user can achieve a desirable performance when the primary packet arrival process is stationary. However, experimental data suggests that in practice, this process is not stationary and the secondary user can experience a large performance loss with stationary transmission strategy. We propose a novel transmission strategy that achieves suboptimal secondary user performance when the idle time distribution is not stationary. The performances of secondary transmission strategies are demonstrated using experimental data. △ Less

Submitted 30 April, 2016; originally announced May 2016.

Journal ref: IEEE Transactions on Wireless Communications, vol. 13, no. 6, pp. 3003-3016, June 2014

Showing 1–43 of 43 results for author: Tewfik, A