subscribe to arXiv mailings

Speech Robust Bench: A Robustness Benchmark For Speech Recognition

Authors: Muhammad A. Shah, David Solans Noguero, Mikko A. Heikkila, Nicolas Kourtellis

Abstract: As Automatic Speech Recognition (ASR) models become ever more pervasive, it is important to ensure that they make reliable predictions under corruptions present in the physical and digital world. We propose Speech Robust Bench (SRB), a comprehensive benchmark for evaluating the robustness of ASR models to diverse corruptions. SRB is composed of 69 input perturbations which are intended to simulate… ▽ More As Automatic Speech Recognition (ASR) models become ever more pervasive, it is important to ensure that they make reliable predictions under corruptions present in the physical and digital world. We propose Speech Robust Bench (SRB), a comprehensive benchmark for evaluating the robustness of ASR models to diverse corruptions. SRB is composed of 69 input perturbations which are intended to simulate various corruptions that ASR models may encounter in the physical and digital world. We use SRB to evaluate the robustness of several state-of-the-art ASR models and observe that model size and certain modeling choices such as discrete representations, and self-training appear to be conducive to robustness. We extend this analysis to measure the robustness of ASR models on data from various demographic subgroups, namely English and Spanish speakers, and males and females, and observed noticeable disparities in the model's robustness across subgroups. We believe that SRB will facilitate future research towards robust ASR models, by making it easier to conduct comprehensive and comparable robustness evaluations. △ Less

Submitted 8 March, 2024; originally announced March 2024.

arXiv:2403.06659 [pdf, other]

Zero-Shot ECG Classification with Multimodal Learning and Test-time Clinical Knowledge Enhancement

Authors: Che Liu, Zhongwei Wan, Cheng Ouyang, Anand Shah, Wenjia Bai, Rossella Arcucci

Abstract: Electrocardiograms (ECGs) are non-invasive diagnostic tools crucial for detecting cardiac arrhythmic diseases in clinical practice. While ECG Self-supervised Learning (eSSL) methods show promise in representation learning from unannotated ECG data, they often overlook the clinical knowledge that can be found in reports. This oversight and the requirement for annotated samples for downstream tasks… ▽ More Electrocardiograms (ECGs) are non-invasive diagnostic tools crucial for detecting cardiac arrhythmic diseases in clinical practice. While ECG Self-supervised Learning (eSSL) methods show promise in representation learning from unannotated ECG data, they often overlook the clinical knowledge that can be found in reports. This oversight and the requirement for annotated samples for downstream tasks limit eSSL's versatility. In this work, we address these issues with the Multimodal ECG Representation Learning (MERL}) framework. Through multimodal learning on ECG records and associated reports, MERL is capable of performing zero-shot ECG classification with text prompts, eliminating the need for training data in downstream tasks. At test time, we propose the Clinical Knowledge Enhanced Prompt Engineering (CKEPE) approach, which uses Large Language Models (LLMs) to exploit external expert-verified clinical knowledge databases, generating more descriptive prompts and reducing hallucinations in LLM-generated content to boost zero-shot classification. Based on MERL, we perform the first benchmark across six public ECG datasets, showing the superior performance of MERL compared against eSSL methods. Notably, MERL achieves an average AUC score of 75.2% in zero-shot classification (without training data), 3.2% higher than linear probed eSSL methods with 10\% annotated training data, averaged across all six datasets. Code and models are available at https://github.com/cheliu-computation/MERL △ Less

Submitted 2 July, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

Comments: Accepted by ICML2024

arXiv:2402.17050 [pdf, other]

Reinforcement Learning Based Oscillation Dampening: Scaling up Single-Agent RL algorithms to a 100 AV highway field operational test

Authors: Kathy Jang, Nathan Lichtlé, Eugene Vinitsky, Adit Shah, Matthew Bunting, Matthew Nice, Benedetto Piccoli, Benjamin Seibold, Daniel B. Work, Maria Laura Delle Monache, Jonathan Sprinkle, Jonathan W. Lee, Alexandre M. Bayen

Abstract: In this article, we explore the technical details of the reinforcement learning (RL) algorithms that were deployed in the largest field test of automated vehicles designed to smooth traffic flow in history as of 2023, uncovering the challenges and breakthroughs that come with developing RL controllers for automated vehicles. We delve into the fundamental concepts behind RL algorithms and their app… ▽ More In this article, we explore the technical details of the reinforcement learning (RL) algorithms that were deployed in the largest field test of automated vehicles designed to smooth traffic flow in history as of 2023, uncovering the challenges and breakthroughs that come with developing RL controllers for automated vehicles. We delve into the fundamental concepts behind RL algorithms and their application in the context of self-driving cars, discussing the developmental process from simulation to deployment in detail, from designing simulators to reward function shaping. We present the results in both simulation and deployment, discussing the flow-smoothing benefits of the RL controller. From understanding the basics of Markov decision processes to exploring advanced techniques such as deep RL, our article offers a comprehensive overview and deep dive of the theoretical foundations and practical implementations driving this rapidly evolving field. We also showcase real-world case studies and alternative research projects that highlight the impact of RL controllers in revolutionizing autonomous driving. From tackling complex urban environments to dealing with unpredictable traffic scenarios, these intelligent controllers are pushing the boundaries of what automated vehicles can achieve. Furthermore, we examine the safety considerations and hardware-focused technical details surrounding deployment of RL controllers into automated vehicles. As these algorithms learn and evolve through interactions with the environment, ensuring their behavior aligns with safety standards becomes crucial. We explore the methodologies and frameworks being developed to address these challenges, emphasizing the importance of building reliable control systems for automated vehicles. △ Less

Submitted 14 May, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

arXiv:2402.17043 [pdf, other]

Traffic Control via Connected and Automated Vehicles: An Open-Road Field Experiment with 100 CAVs

Authors: Jonathan W. Lee, Han Wang, Kathy Jang, Amaury Hayat, Matthew Bunting, Arwa Alanqary, William Barbour, Zhe Fu, Xiaoqian Gong, George Gunter, Sharon Hornstein, Abdul Rahman Kreidieh, Nathan Lichtlé, Matthew W. Nice, William A. Richardson, Adit Shah, Eugene Vinitsky, Fangyu Wu, Shengquan Xiang, Sulaiman Almatrudi, Fahd Althukair, Rahul Bhadani, Joy Carpio, Raphael Chekroun, Eric Cheng , et al. (39 additional authors not shown)

Abstract: The CIRCLES project aims to reduce instabilities in traffic flow, which are naturally occurring phenomena due to human driving behavior. These "phantom jams" or "stop-and-go waves,"are a significant source of wasted energy. Toward this goal, the CIRCLES project designed a control system referred to as the MegaController by the CIRCLES team, that could be deployed in real traffic. Our field experim… ▽ More The CIRCLES project aims to reduce instabilities in traffic flow, which are naturally occurring phenomena due to human driving behavior. These "phantom jams" or "stop-and-go waves,"are a significant source of wasted energy. Toward this goal, the CIRCLES project designed a control system referred to as the MegaController by the CIRCLES team, that could be deployed in real traffic. Our field experiment leveraged a heterogeneous fleet of 100 longitudinally-controlled vehicles as Lagrangian traffic actuators, each of which ran a controller with the architecture described in this paper. The MegaController is a hierarchical control architecture, which consists of two main layers. The upper layer is called Speed Planner, and is a centralized optimal control algorithm. It assigns speed targets to the vehicles, conveyed through the LTE cellular network. The lower layer is a control layer, running on each vehicle. It performs local actuation by overriding the stock adaptive cruise controller, using the stock on-board sensors. The Speed Planner ingests live data feeds provided by third parties, as well as data from our own control vehicles, and uses both to perform the speed assignment. The architecture of the speed planner allows for modular use of standard control techniques, such as optimal control, model predictive control, kernel methods and others, including Deep RL, model predictive control and explicit controllers. Depending on the vehicle architecture, all onboard sensing data can be accessed by the local controllers, or only some. Control inputs vary across different automakers, with inputs ranging from torque or acceleration requests for some cars, and electronic selection of ACC set points in others. The proposed architecture allows for the combination of all possible settings proposed above. Most configurations were tested throughout the ramp up to the MegaVandertest. △ Less

Submitted 26 February, 2024; originally announced February 2024.

arXiv:2401.12974 [pdf, other]

SegmentAnyBone: A Universal Model that Segments Any Bone at Any Location on MRI

Authors: Hanxue Gu, Roy Colglazier, Haoyu Dong, Jikai Zhang, Yaqian Chen, Zafer Yildiz, Yuwen Chen, Lin Li, Jichen Yang, Jay Willhite, Alex M. Meyer, Brian Guo, Yashvi Atul Shah, Emily Luo, Shipra Rajput, Sally Kuehn, Clark Bulleit, Kevin A. Wu, Jisoo Lee, Brandon Ramirez, Darui Lu, Jay M. Levin, Maciej A. Mazurowski

Abstract: Magnetic Resonance Imaging (MRI) is pivotal in radiology, offering non-invasive and high-quality insights into the human body. Precise segmentation of MRIs into different organs and tissues would be highly beneficial since it would allow for a higher level of understanding of the image content and enable important measurements, which are essential for accurate diagnosis and effective treatment pla… ▽ More Magnetic Resonance Imaging (MRI) is pivotal in radiology, offering non-invasive and high-quality insights into the human body. Precise segmentation of MRIs into different organs and tissues would be highly beneficial since it would allow for a higher level of understanding of the image content and enable important measurements, which are essential for accurate diagnosis and effective treatment planning. Specifically, segmenting bones in MRI would allow for more quantitative assessments of musculoskeletal conditions, while such assessments are largely absent in current radiological practice. The difficulty of bone MRI segmentation is illustrated by the fact that limited algorithms are publicly available for use, and those contained in the literature typically address a specific anatomic area. In our study, we propose a versatile, publicly available deep-learning model for bone segmentation in MRI across multiple standard MRI locations. The proposed model can operate in two modes: fully automated segmentation and prompt-based segmentation. Our contributions include (1) collecting and annotating a new MRI dataset across various MRI protocols, encompassing over 300 annotated volumes and 8485 annotated slices across diverse anatomic regions; (2) investigating several standard network architectures and strategies for automated segmentation; (3) introducing SegmentAnyBone, an innovative foundational model-based approach that extends Segment Anything Model (SAM); (4) comparative analysis of our algorithm and previous approaches; and (5) generalization analysis of our algorithm across different anatomical locations and MRI sequences, as well as an external dataset. We publicly release our model at https://github.com/mazurowski-lab/SegmentAnyBone. △ Less

Submitted 23 January, 2024; originally announced January 2024.

Comments: 15 pages, 15 figures

arXiv:2401.09666 [pdf, other]

Traffic Smoothing Controllers for Autonomous Vehicles Using Deep Reinforcement Learning and Real-World Trajectory Data

Authors: Nathan Lichtlé, Kathy Jang, Adit Shah, Eugene Vinitsky, Jonathan W. Lee, Alexandre M. Bayen

Abstract: Designing traffic-smoothing cruise controllers that can be deployed onto autonomous vehicles is a key step towards improving traffic flow, reducing congestion, and enhancing fuel efficiency in mixed autonomy traffic. We bypass the common issue of having to carefully fine-tune a large traffic microsimulator by leveraging real-world trajectory data from the I-24 highway in Tennessee, replayed in a o… ▽ More Designing traffic-smoothing cruise controllers that can be deployed onto autonomous vehicles is a key step towards improving traffic flow, reducing congestion, and enhancing fuel efficiency in mixed autonomy traffic. We bypass the common issue of having to carefully fine-tune a large traffic microsimulator by leveraging real-world trajectory data from the I-24 highway in Tennessee, replayed in a one-lane simulation. Using standard deep reinforcement learning methods, we train energy-reducing wave-smoothing policies. As an input to the agent, we observe the speed and distance of only the vehicle in front, which are local states readily available on most recent vehicles, as well as non-local observations about the downstream state of the traffic. We show that at a low 4% autonomous vehicle penetration rate, we achieve significant fuel savings of over 15% on trajectories exhibiting many stop-and-go waves. Finally, we analyze the smoothing effect of the controllers and demonstrate robustness to adding lane-changing into the simulation as well as the removal of downstream information. △ Less

Submitted 17 January, 2024; originally announced January 2024.

Comments: Accepted to be published as part of the 26th IEEE International Conference on Intelligent Transportation Systems (ITSC) 2023, Bilbao, Spain, September 24-28, 2023

arXiv:2312.09369 [pdf, other]

Audio-visual fine-tuning of audio-only ASR models

Authors: Avner May, Dmitriy Serdyuk, Ankit Parag Shah, Otavio Braga, Olivier Siohan

Abstract: Audio-visual automatic speech recognition (AV-ASR) models are very effective at reducing word error rates on noisy speech, but require large amounts of transcribed AV training data. Recently, audio-visual self-supervised learning (SSL) approaches have been developed to reduce this dependence on transcribed AV data, but these methods are quite complex and computationally expensive. In this work, we… ▽ More Audio-visual automatic speech recognition (AV-ASR) models are very effective at reducing word error rates on noisy speech, but require large amounts of transcribed AV training data. Recently, audio-visual self-supervised learning (SSL) approaches have been developed to reduce this dependence on transcribed AV data, but these methods are quite complex and computationally expensive. In this work, we propose replacing these expensive AV-SSL methods with a simple and fast \textit{audio-only} SSL method, and then performing AV supervised fine-tuning. We show that this approach is competitive with state-of-the-art (SOTA) AV-SSL methods on the LRS3-TED benchmark task (within 0.5% absolute WER), while being dramatically simpler and more efficient (12-30x faster to pre-train). Furthermore, we show we can extend this approach to convert a SOTA audio-only ASR model into an AV model. By doing so, we match SOTA AV-SSL results, even though no AV data was used during pre-training. △ Less

Submitted 14 December, 2023; originally announced December 2023.

arXiv:2312.03020 [pdf]

Enhanced Breast Cancer Tumor Classification using MobileNetV2: A Detailed Exploration on Image Intensity, Error Mitigation, and Streamlit-driven Real-time Deployment

Authors: Aaditya Surya, Aditya Shah, Jarnell Kabore, Subash Sasikumar

Abstract: This research introduces a sophisticated transfer learning model based on Google's MobileNetV2 for breast cancer tumor classification into normal, benign, and malignant categories, utilizing a dataset of 1576 ultrasound images (265 normal, 891 benign, 420 malignant). The model achieves an accuracy of 0.82, precision of 0.83, recall of 0.81, ROC-AUC of 0.94, PR-AUC of 0.88, and MCC of 0.74. It exam… ▽ More This research introduces a sophisticated transfer learning model based on Google's MobileNetV2 for breast cancer tumor classification into normal, benign, and malignant categories, utilizing a dataset of 1576 ultrasound images (265 normal, 891 benign, 420 malignant). The model achieves an accuracy of 0.82, precision of 0.83, recall of 0.81, ROC-AUC of 0.94, PR-AUC of 0.88, and MCC of 0.74. It examines image intensity distributions and misclassification errors, offering improvements for future applications. Addressing dataset imbalances, the study ensures a generalizable model. This work, using a dataset from Baheya Hospital, Cairo, Egypt, compiled by Walid Al-Dhabyani et al., emphasizes MobileNetV2's potential in medical imaging, aiming to improve diagnostic precision in oncology. Additionally, the paper explores Streamlit-based deployment for real-time tumor classification, demonstrating MobileNetV2's applicability in medical imaging and setting a benchmark for future research in oncology diagnostics. △ Less

Submitted 6 January, 2024; v1 submitted 5 December, 2023; originally announced December 2023.

arXiv:2312.01529 [pdf, other]

T3D: Towards 3D Medical Image Understanding through Vision-Language Pre-training

Authors: Che Liu, Cheng Ouyang, Yinda Chen, Cesar César Quilodrán-Casas, Lei Ma, Jie Fu, Yike Guo, Anand Shah, Wenjia Bai, Rossella Arcucci

Abstract: Expert annotation of 3D medical image for downstream analysis is resource-intensive, posing challenges in clinical applications. Visual self-supervised learning (vSSL), though effective for learning visual invariance, neglects the incorporation of domain knowledge from medicine. To incorporate medical knowledge into visual representation learning, vision-language pre-training (VLP) has shown promi… ▽ More Expert annotation of 3D medical image for downstream analysis is resource-intensive, posing challenges in clinical applications. Visual self-supervised learning (vSSL), though effective for learning visual invariance, neglects the incorporation of domain knowledge from medicine. To incorporate medical knowledge into visual representation learning, vision-language pre-training (VLP) has shown promising results in 2D image. However, existing VLP approaches become generally impractical when applied to high-resolution 3D medical images due to GPU hardware constraints and the potential loss of critical details caused by downsampling, which is the intuitive solution to hardware constraints. To address the above limitations, we introduce T3D, the first VLP framework designed for high-resolution 3D medical images. T3D incorporates two text-informed pretext tasks: (\lowerromannumeral{1}) text-informed contrastive learning; (\lowerromannumeral{2}) text-informed image restoration. These tasks focus on learning 3D visual representations from high-resolution 3D medical images and integrating clinical knowledge from radiology reports, without distorting information through forced alignment of downsampled volumes with detailed anatomical text. Trained on a newly curated large-scale dataset of 3D medical images and radiology reports, T3D significantly outperforms current vSSL methods in tasks like organ and tumor segmentation, as well as disease classification. This underlines T3D's potential in representation learning for 3D medical image analysis. All data and code will be available upon acceptance. △ Less

Submitted 5 December, 2023; v1 submitted 3 December, 2023; originally announced December 2023.

arXiv:2311.18539 [pdf, other]

Bridging Both Worlds in Semantics and Time: Domain Knowledge Based Analysis and Correlation of Industrial Process Attacks

Authors: Moses Ike, Kandy Phan, Anwesh Badapanda, Matthew Landen, Keaton Sadoski, Wanda Guo, Asfahan Shah, Saman Zonouz, Wenke Lee

Abstract: Modern industrial control systems (ICS) attacks infect supervisory control and data acquisition (SCADA) hosts to stealthily alter industrial processes, causing damage. To detect attacks with low false alarms, recent work detects attacks in both SCADA and process data. Unfortunately, this led to the same problem - disjointed (false) alerts, due to the semantic and time gap in SCADA and process beha… ▽ More Modern industrial control systems (ICS) attacks infect supervisory control and data acquisition (SCADA) hosts to stealthily alter industrial processes, causing damage. To detect attacks with low false alarms, recent work detects attacks in both SCADA and process data. Unfortunately, this led to the same problem - disjointed (false) alerts, due to the semantic and time gap in SCADA and process behavior, i.e., SCADA execution does not map to process dynamics nor evolve at similar time scales. We propose BRIDGE to analyze and correlate SCADA and industrial process attacks using domain knowledge to bridge their unique semantic and time evolution. This enables operators to tie malicious SCADA operations to their adverse process effects, which reduces false alarms and improves attack understanding. BRIDGE (i) identifies process constraints violations in SCADA by measuring actuation dependencies in SCADA process-control, and (ii) detects malicious SCADA effects in processes via a physics-informed neural network that embeds generic knowledge of inertial process dynamics. BRIDGE then dynamically aligns both analysis (i and ii) in a time-window that adjusts their time evolution based on process inertial delays. We applied BRIDGE to 11 diverse real-world industrial processes, and adaptive attacks inspired by past events. BRIDGE correlated 98.3% of attacks with 0.8% false positives (FP), compared to 78.3% detection accuracy and 13.7% FP of recent work. △ Less

Submitted 3 December, 2023; v1 submitted 30 November, 2023; originally announced November 2023.

arXiv:2310.07161 [pdf, ps, other]

Psychoacoustic Challenges Of Speech Enhancement On VoIP Platforms

Authors: Joseph Konan, Ojas Bhargave, Shikhar Agnihotri, Shuo Han, Yunyang Zeng, Ankit Shah, Bhiksha Raj

Abstract: Within the ambit of VoIP (Voice over Internet Protocol) telecommunications, the complexities introduced by acoustic transformations merit rigorous analysis. This research, rooted in the exploration of proprietary sender-side denoising effects, meticulously evaluates platforms such as Google Meets and Zoom. The study draws upon the Deep Noise Suppression (DNS) 2020 dataset, ensuring a structured ex… ▽ More Within the ambit of VoIP (Voice over Internet Protocol) telecommunications, the complexities introduced by acoustic transformations merit rigorous analysis. This research, rooted in the exploration of proprietary sender-side denoising effects, meticulously evaluates platforms such as Google Meets and Zoom. The study draws upon the Deep Noise Suppression (DNS) 2020 dataset, ensuring a structured examination tailored to various denoising settings and receiver interfaces. A methodological novelty is introduced via the Oaxaca decomposition, traditionally an econometric tool, repurposed herein to analyze acoustic-phonetic perturbations within VoIP systems. To further ground the implications of these transformations, psychoacoustic metrics, specifically PESQ and STOI, were harnessed to furnish a comprehensive understanding of speech alterations. Cumulatively, the insights garnered underscore the intricate landscape of VoIP-influenced acoustic dynamics. In addition to the primary findings, a multitude of metrics are reported, extending the research purview. Moreover, out-of-domain benchmarking for both time and time-frequency domain speech enhancement models is included, thereby enhancing the depth and applicability of this inquiry. Repository: github.com/deepology/VoIP-DNS-Challenge △ Less

Submitted 21 November, 2023; v1 submitted 10 October, 2023; originally announced October 2023.

arXiv:2310.05932 [pdf, other]

A Multi-Agent Systems Approach for Peer-to-Peer Energy Trading in Dairy Farming

Authors: Mian Ibad Ali Shah, Abdul Wahid, Enda Barrett, Karl Mason

Abstract: To achieve desired carbon emission reductions, integrating renewable generation and accelerating the adoption of peer-to-peer energy trading is crucial. This is especially important for energy-intensive farming, like dairy farming. However, integrating renewables and peer-to-peer trading presents challenges. To address this, we propose the Multi-Agent Peer-to-Peer Dairy Farm Energy Simulator (MAPD… ▽ More To achieve desired carbon emission reductions, integrating renewable generation and accelerating the adoption of peer-to-peer energy trading is crucial. This is especially important for energy-intensive farming, like dairy farming. However, integrating renewables and peer-to-peer trading presents challenges. To address this, we propose the Multi-Agent Peer-to-Peer Dairy Farm Energy Simulator (MAPDES), enabling dairy farms to participate in peer-to-peer markets. Our strategy reduces electricity costs and peak demand by approximately 30% and 24% respectively, while increasing energy sales by 37% compared to the baseline scenario without P2P trading. This demonstrates the effectiveness of our approach. △ Less

Submitted 21 August, 2023; originally announced October 2023.

Comments: Proc. of the Artificial Intelligence for Sustainability, ECAI 2023, Eunika et al. (eds.), Sep 30- Oct 1, 2023, https://sites.google.com/view/ai4s. 2023

arXiv:2309.14460 [pdf, other]

Online Active Learning For Sound Event Detection

Authors: Mark Lindsey, Ankit Shah, Francis Kubala, Richard M. Stern

Abstract: Data collection and annotation is a laborious, time-consuming prerequisite for supervised machine learning tasks. Online Active Learning (OAL) is a paradigm that addresses this issue by simultaneously minimizing the amount of annotation required to train a classifier and adapting to changes in the data over the duration of the data collection process. Prior work has indicated that fluctuating clas… ▽ More Data collection and annotation is a laborious, time-consuming prerequisite for supervised machine learning tasks. Online Active Learning (OAL) is a paradigm that addresses this issue by simultaneously minimizing the amount of annotation required to train a classifier and adapting to changes in the data over the duration of the data collection process. Prior work has indicated that fluctuating class distributions and data drift are still common problems for OAL. This work presents new loss functions that address these challenges when OAL is applied to Sound Event Detection (SED). Experimental results from the SONYC dataset and two Voice-Type Discrimination (VTD) corpora indicate that OAL can reduce the time and effort required to train SED classifiers by a factor of 5 for SONYC, and that the new methods presented here successfully resolve issues present in existing OAL methods. △ Less

Submitted 25 September, 2023; originally announced September 2023.

Comments: Submitted to ICASSP 2024. Publication will belong to IEEE

arXiv:2309.13227 [pdf, other]

Importance of negative sampling in weak label learning

Authors: Ankit Shah, Fuyu Tang, Zelin Ye, Rita Singh, Bhiksha Raj

Abstract: Weak-label learning is a challenging task that requires learning from data "bags" containing positive and negative instances, but only the bag labels are known. The pool of negative instances is usually larger than positive instances, thus making selecting the most informative negative instance critical for performance. Such a selection strategy for negative instances from each bag is an open prob… ▽ More Weak-label learning is a challenging task that requires learning from data "bags" containing positive and negative instances, but only the bag labels are known. The pool of negative instances is usually larger than positive instances, thus making selecting the most informative negative instance critical for performance. Such a selection strategy for negative instances from each bag is an open problem that has not been well studied for weak-label learning. In this paper, we study several sampling strategies that can measure the usefulness of negative instances for weak-label learning and select them accordingly. We test our method on CIFAR-10 and AudioSet datasets and show that it improves the weak-label classification performance and reduces the computational cost compared to random sampling methods. Our work reveals that negative instances are not all equally irrelevant, and selecting them wisely can benefit weak-label learning. △ Less

Submitted 22 September, 2023; originally announced September 2023.

arXiv:2309.04641 [pdf, other]

Exploring Domain-Specific Enhancements for a Neural Foley Synthesizer

Authors: Ashwin Pillay, Sage Betko, Ari Liloia, Hao Chen, Ankit Shah

Abstract: Foley sound synthesis refers to the creation of authentic, diegetic sound effects for media, such as film or radio. In this study, we construct a neural Foley synthesizer capable of generating mono-audio clips across seven predefined categories. Our approach introduces multiple enhancements to existing models in the text-to-audio domain, with the goal of enriching the diversity and acoustic charac… ▽ More Foley sound synthesis refers to the creation of authentic, diegetic sound effects for media, such as film or radio. In this study, we construct a neural Foley synthesizer capable of generating mono-audio clips across seven predefined categories. Our approach introduces multiple enhancements to existing models in the text-to-audio domain, with the goal of enriching the diversity and acoustic characteristics of the generated foleys. Notably, we utilize a pre-trained encoder that retains acoustical and musical attributes in intermediate embeddings, implement class-conditioning to enhance differentiability among foley classes in their intermediate representations, and devise an innovative transformer-based architecture for optimizing self-attention computations on very large inputs without compromising valuable information. Subsequent to implementation, we present intermediate outcomes that surpass the baseline, discuss practical challenges encountered in achieving optimal results, and outline potential pathways for further research. △ Less

Submitted 8 September, 2023; originally announced September 2023.

arXiv:2304.09756 [pdf, other]

Contactless Human Activity Recognition using Deep Learning with Flexible and Scalable Software Define Radio

Authors: Muhammad Zakir Khan, Jawad Ahmad, Wadii Boulila, Matthew Broadbent, Syed Aziz Shah, Anis Koubaa, Qammer H. Abbasi

Abstract: Ambient computing is gaining popularity as a major technological advancement for the future. The modern era has witnessed a surge in the advancement in healthcare systems, with viable radio frequency solutions proposed for remote and unobtrusive human activity recognition (HAR). Specifically, this study investigates the use of Wi-Fi channel state information (CSI) as a novel method of ambient sens… ▽ More Ambient computing is gaining popularity as a major technological advancement for the future. The modern era has witnessed a surge in the advancement in healthcare systems, with viable radio frequency solutions proposed for remote and unobtrusive human activity recognition (HAR). Specifically, this study investigates the use of Wi-Fi channel state information (CSI) as a novel method of ambient sensing that can be employed as a contactless means of recognizing human activity in indoor environments. These methods avoid additional costly hardware required for vision-based systems, which are privacy-intrusive, by (re)using Wi-Fi CSI for various safety and security applications. During an experiment utilizing universal software-defined radio (USRP) to collect CSI samples, it was observed that a subject engaged in six distinct activities, which included no activity, standing, sitting, and leaning forward, across different areas of the room. Additionally, more CSI samples were collected when the subject walked in two different directions. This study presents a Wi-Fi CSI-based HAR system that assesses and contrasts deep learning approaches, namely convolutional neural network (CNN), long short-term memory (LSTM), and hybrid (LSTM+CNN), employed for accurate activity recognition. The experimental results indicate that LSTM surpasses current models and achieves an average accuracy of 95.3% in multi-activity classification when compared to CNN and hybrid techniques. In the future, research needs to study the significance of resilience in diverse and dynamic environments to identify the activity of multiple users. △ Less

Submitted 18 April, 2023; originally announced April 2023.

arXiv:2303.09048 [pdf, other]

Improving Perceptual Quality, Intelligibility, and Acoustics on VoIP Platforms

Authors: Joseph Konan, Ojas Bhargave, Shikhar Agnihotri, Hojeong Lee, Ankit Shah, Shuo Han, Yunyang Zeng, Amanda Shu, Haohui Liu, Xuankai Chang, Hamza Khalid, Minseon Gwak, Kawon Lee, Minjeong Kim, Bhiksha Raj

Abstract: In this paper, we present a method for fine-tuning models trained on the Deep Noise Suppression (DNS) 2020 Challenge to improve their performance on Voice over Internet Protocol (VoIP) applications. Our approach involves adapting the DNS 2020 models to the specific acoustic characteristics of VoIP communications, which includes distortion and artifacts caused by compression, transmission, and plat… ▽ More In this paper, we present a method for fine-tuning models trained on the Deep Noise Suppression (DNS) 2020 Challenge to improve their performance on Voice over Internet Protocol (VoIP) applications. Our approach involves adapting the DNS 2020 models to the specific acoustic characteristics of VoIP communications, which includes distortion and artifacts caused by compression, transmission, and platform-specific processing. To this end, we propose a multi-task learning framework for VoIP-DNS that jointly optimizes noise suppression and VoIP-specific acoustics for speech enhancement. We evaluate our approach on a diverse VoIP scenarios and show that it outperforms both industry performance and state-of-the-art methods for speech enhancement on VoIP applications. Our results demonstrate the potential of models trained on DNS-2020 to be improved and tailored to different VoIP platforms using VoIP-DNS, whose findings have important applications in areas such as speech recognition, voice assistants, and telecommunication. △ Less

Submitted 15 March, 2023; originally announced March 2023.

Comments: Under review at European Association for Signal Processing. 5 pages

arXiv:2303.03591 [pdf, other]

Approach to Learning Generalized Audio Representation Through Batch Embedding Covariance Regularization and Constant-Q Transforms

Authors: Ankit Shah, Shuyi Chen, Kejun Zhou, Yue Chen, Bhiksha Raj

Abstract: General-purpose embedding is highly desirable for few-shot even zero-shot learning in many application scenarios, including audio tasks. In order to understand representations better, we conducted a thorough error analysis and visualization of HEAR 2021 submission results. Inspired by the analysis, this work experiments with different front-end audio preprocessing methods, including Constant-Q Tra… ▽ More General-purpose embedding is highly desirable for few-shot even zero-shot learning in many application scenarios, including audio tasks. In order to understand representations better, we conducted a thorough error analysis and visualization of HEAR 2021 submission results. Inspired by the analysis, this work experiments with different front-end audio preprocessing methods, including Constant-Q Transform (CQT) and Short-time Fourier transform (STFT), and proposes a Batch Embedding Covariance Regularization (BECR) term to uncover a more holistic simulation of the frequency information received by the human auditory system. We tested the models on the suite of HEAR 2021 tasks, which encompass a broad category of tasks. Preliminary results show (1) the proposed BECR can incur a more dispersed embedding on the test set, (2) BECR improves the PaSST model without extra computation complexity, and (3) STFT preprocessing outperforms CQT in all tasks we tested. Github:https://github.com/ankitshah009/general_audio_embedding_hear_2021 △ Less

Submitted 6 March, 2023; originally announced March 2023.

Comments: Technical report, 10 pages

arXiv:2302.10915 [pdf, other]

Conformers are All You Need for Visual Speech Recognition

Authors: Oscar Chang, Hank Liao, Dmitriy Serdyuk, Ankit Shah, Olivier Siohan

Abstract: Visual speech recognition models extract visual features in a hierarchical manner. At the lower level, there is a visual front-end with a limited temporal receptive field that processes the raw pixels depicting the lips or faces. At the higher level, there is an encoder that attends to the embeddings produced by the front-end over a large temporal receptive field. Previous work has focused on impr… ▽ More Visual speech recognition models extract visual features in a hierarchical manner. At the lower level, there is a visual front-end with a limited temporal receptive field that processes the raw pixels depicting the lips or faces. At the higher level, there is an encoder that attends to the embeddings produced by the front-end over a large temporal receptive field. Previous work has focused on improving the visual front-end of the model to extract more useful features for speech recognition. Surprisingly, our work shows that complex visual front-ends are not necessary. Instead of allocating resources to a sophisticated visual front-end, we find that a linear visual front-end paired with a larger Conformer encoder results in lower latency, more efficient memory usage, and improved WER performance. We achieve a new state-of-the-art of 12.8% WER for visual speech recognition on the TED LRS3 dataset, which rivals the performance of audio-only models from just four years ago. △ Less

Submitted 12 December, 2023; v1 submitted 16 February, 2023; originally announced February 2023.

arXiv:2302.09516 [pdf, other]

A Bibliography of Multiple Sclerosis Lesions Detection Methods using Brain MRIs

Authors: Atif Shah, Maged S. Al-Shaibani, Moataz Ahmad, Reem Bunyan

Abstract: Introduction: Multiple Sclerosis (MS) is a chronic disease that affects millions of people across the globe. MS can critically affect different organs of the central nervous system such as the eyes, the spinal cord, and the brain. Background: To help physicians in diagnosing MS lesions, computer-aided methods are widely used. In this regard, a considerable research has been carried out in the ar… ▽ More Introduction: Multiple Sclerosis (MS) is a chronic disease that affects millions of people across the globe. MS can critically affect different organs of the central nervous system such as the eyes, the spinal cord, and the brain. Background: To help physicians in diagnosing MS lesions, computer-aided methods are widely used. In this regard, a considerable research has been carried out in the area of automatic detection and segmentation of MS lesions in magnetic resonance images (MRIs). Methodology: In this study, we review the different approaches that have been used in computer-aided detection and segmentation of MS lesions. Our review resulted in categorizing MS lesion segmentation approaches into six broad categories: data-driven, statistical, supervised machine learning, unsupervised machine learning, fuzzy, and deep learning-based techniques. We critically analyze the different techniques under these approaches and highlight their strengths and weaknesses. Results: From the study, we observe that a considerable amount of work, around 25% of related literature, is focused on statistical-based MS lesion segmentation techniques, followed by 21.15% for data-driven based methods, 19.23% for deep learning and 15.38% for supervised methods. Implication: The study points out the challenges/gaps to be addressed in future research. The study shows the work which has been done in last one decade in detection and segmentation of MS lesions. The results show that, in recent years, deep learning methods are outperforming all the others methods. △ Less

Submitted 19 February, 2023; originally announced February 2023.

arXiv:2211.03333 [pdf]

Learning From Alarms: A Robust Learning Approach for Accurate Photoplethysmography-Based Atrial Fibrillation Detection using Eight Million Samples Labeled with Imprecise Arrhythmia Alarms

Authors: Cheng Ding, Zhicheng Guo, Cynthia Rudin, Ran Xiao, Amit Shah, Duc H. Do, Randall J Lee, Gari Clifford, Fadi B Nahab, Xiao Hu

Abstract: Atrial fibrillation (AF) is a common cardiac arrhythmia with serious health consequences if not detected and treated early. Detecting AF using wearable devices with photoplethysmography (PPG) sensors and deep neural networks has demonstrated some success using proprietary algorithms in commercial solutions. However, further advancement of this paradigm of continuous AF detection in ambulatory sett… ▽ More Atrial fibrillation (AF) is a common cardiac arrhythmia with serious health consequences if not detected and treated early. Detecting AF using wearable devices with photoplethysmography (PPG) sensors and deep neural networks has demonstrated some success using proprietary algorithms in commercial solutions. However, further advancement of this paradigm of continuous AF detection in ambulatory settings, towards a population-wide screening use case, still faces several challenges, one of which is the lack of large-scale labeled training data. To address this challenge, in this study, we propose to leverage AF alarms from bedside patient monitors to label concurrent PPG signals, resulting in the largest PPG-AF dataset so far (8.5M 30-second records from 24100 patients) and demonstrating a practical approach to build large labeled PPG datasets. Furthermore, we recognize that the AF labels thus obtained contain errors because of false AF alarms generated from imperfect built-in algorithms from bedside monitors. Dealing with label noise with unknown distribution characteristics in this case requires advanced algorithms. We, therefore, introduce and open source a novel loss design, the cluster membership consistency (CMC) loss, to mitigate label errors. By comparing CMC with state-of-the-art methods selected from a noisy label competition, we demonstrate its superiority in multiple aspects including handling label noise in PPG data, resilience to poor-quality signals, and computational efficiency. △ Less

Submitted 12 November, 2023; v1 submitted 7 November, 2022; originally announced November 2022.

arXiv:2210.14446 [pdf, other]

Smart Speech Segmentation using Acousto-Linguistic Features with look-ahead

Authors: Piyush Behre, Naveen Parihar, Sharman Tan, Amy Shah, Eva Sharma, Geoffrey Liu, Shuangyu Chang, Hosam Khalil, Chris Basoglu, Sayan Pathak

Abstract: Segmentation for continuous Automatic Speech Recognition (ASR) has traditionally used silence timeouts or voice activity detectors (VADs), which are both limited to acoustic features. This segmentation is often overly aggressive, given that people naturally pause to think as they speak. Consequently, segmentation happens mid-sentence, hindering both punctuation and downstream tasks like machine tr… ▽ More Segmentation for continuous Automatic Speech Recognition (ASR) has traditionally used silence timeouts or voice activity detectors (VADs), which are both limited to acoustic features. This segmentation is often overly aggressive, given that people naturally pause to think as they speak. Consequently, segmentation happens mid-sentence, hindering both punctuation and downstream tasks like machine translation for which high-quality segmentation is critical. Model-based segmentation methods that leverage acoustic features are powerful, but without an understanding of the language itself, these approaches are limited. We present a hybrid approach that leverages both acoustic and language information to improve segmentation. Furthermore, we show that including one word as a look-ahead boosts segmentation quality. On average, our models improve segmentation-F0.5 score by 9.8% over baseline. We show that this approach works for multiple languages. For the downstream task of machine translation, it improves the translation BLEU score by an average of 1.05 points. △ Less

Submitted 27 October, 2022; v1 submitted 25 October, 2022; originally announced October 2022.

arXiv:2208.01637 [pdf, other]

Comparative Analysis of State-of-the-Art Deep Learning Models for Detecting COVID-19 Lung Infection from Chest X-Ray Images

Authors: Zeba Ghaffar, Pir Masoom Shah, Hikmat Khan, Syed Farhan Alam Zaidi, Abdullah Gani, Izaz Ahmad Khan, Munam Ali Shah, Saif ul Islam

Abstract: The ongoing COVID-19 pandemic has already taken millions of lives and damaged economies across the globe. Most COVID-19 deaths and economic losses are reported from densely crowded cities. It is comprehensible that the effective control and prevention of epidemic/pandemic infectious diseases is vital. According to WHO, testing and diagnosis is the best strategy to control pandemics. Scientists wor… ▽ More The ongoing COVID-19 pandemic has already taken millions of lives and damaged economies across the globe. Most COVID-19 deaths and economic losses are reported from densely crowded cities. It is comprehensible that the effective control and prevention of epidemic/pandemic infectious diseases is vital. According to WHO, testing and diagnosis is the best strategy to control pandemics. Scientists worldwide are attempting to develop various innovative and cost-efficient methods to speed up the testing process. This paper comprehensively evaluates the applicability of the recent top ten state-of-the-art Deep Convolutional Neural Networks (CNNs) for automatically detecting COVID-19 infection using chest X-ray images. Moreover, it provides a comparative analysis of these models in terms of accuracy. This study identifies the effective methodologies to control and prevent infectious respiratory diseases. Our trained models have demonstrated outstanding results in classifying the COVID-19 infected chest x-rays. In particular, our trained models MobileNet, EfficentNet, and InceptionV3 achieved a classification average accuracy of 95\%, 95\%, and 94\% test set for COVID-19 class classification, respectively. Thus, it can be beneficial for clinical practitioners and radiologists to speed up the testing, detection, and follow-up of COVID-19 cases. △ Less

Submitted 30 June, 2022; originally announced August 2022.

arXiv:2207.04156 [pdf, other]

Automated Audio Captioning and Language-Based Audio Retrieval

Authors: Clive Gomes, Hyejin Park, Patrick Kollman, Yi Song, Iffanice Houndayi, Ankit Shah

Abstract: This project involved participation in the DCASE 2022 Competition (Task 6) which had two subtasks: (1) Automated Audio Captioning and (2) Language-Based Audio Retrieval. The first subtask involved the generation of a textual description for audio samples, while the goal of the second was to find audio samples within a fixed dataset that match a given description. For both subtasks, the Clotho data… ▽ More This project involved participation in the DCASE 2022 Competition (Task 6) which had two subtasks: (1) Automated Audio Captioning and (2) Language-Based Audio Retrieval. The first subtask involved the generation of a textual description for audio samples, while the goal of the second was to find audio samples within a fixed dataset that match a given description. For both subtasks, the Clotho dataset was used. The models were evaluated on BLEU1, BLEU2, BLEU3, ROUGEL, METEOR, CIDEr, SPICE, and SPIDEr scores for audio captioning and R1, R5, R10 and mARP10 scores for audio retrieval. We have conducted a handful of experiments that modify the baseline models for these tasks. Our final architecture for Automated Audio Captioning is close to the baseline performance, while our model for Language-Based Audio Retrieval has surpassed its counterpart. △ Less

Submitted 15 May, 2023; v1 submitted 8 July, 2022; originally announced July 2022.

Comments: DCASE 2022 Competition (Task 6)

arXiv:2206.04632 [pdf, other]

Temporal Logic Imitation: Learning Plan-Satisficing Motion Policies from Demonstrations

Authors: Yanwei Wang, Nadia Figueroa, Shen Li, Ankit Shah, Julie Shah

Abstract: Learning from demonstration (LfD) has succeeded in tasks featuring a long time horizon. However, when the problem complexity also includes human-in-the-loop perturbations, state-of-the-art approaches do not guarantee the successful reproduction of a task. In this work, we identify the roots of this challenge as the failure of a learned continuous policy to satisfy the discrete plan implicit in the… ▽ More Learning from demonstration (LfD) has succeeded in tasks featuring a long time horizon. However, when the problem complexity also includes human-in-the-loop perturbations, state-of-the-art approaches do not guarantee the successful reproduction of a task. In this work, we identify the roots of this challenge as the failure of a learned continuous policy to satisfy the discrete plan implicit in the demonstration. By utilizing modes (rather than subgoals) as the discrete abstraction and motion policies with both mode invariance and goal reachability properties, we prove our learned continuous policy can simulate any discrete plan specified by a linear temporal logic (LTL) formula. Consequently, an imitator is robust to both task- and motion-level perturbations and guaranteed to achieve task success. Project page: https://yanweiw.github.io/tli/ △ Less

Submitted 14 December, 2022; v1 submitted 9 June, 2022; originally announced June 2022.

Comments: CoRL 2022 Oral Talk

arXiv:2206.02358 [pdf, other]

doi 10.1109/TCSII.2022.3181132

Implementation of a Modified U-Net for Medical Image Segmentation on Edge Devices

Authors: Owais Ali, Hazrat Ali, Syed Ayaz Ali Shah, Aamir Shahzad

Abstract: Deep learning techniques, particularly convolutional neural networks, have shown great potential in computer vision and medical imaging applications. However, deep learning models are computationally demanding as they require enormous computational power and specialized processing hardware for model training. To make these models portable and compatible for prototyping, their implementation on low… ▽ More Deep learning techniques, particularly convolutional neural networks, have shown great potential in computer vision and medical imaging applications. However, deep learning models are computationally demanding as they require enormous computational power and specialized processing hardware for model training. To make these models portable and compatible for prototyping, their implementation on low-power devices is imperative. In this work, we present the implementation of Modified U-Net on Intel Movidius Neural Compute Stick 2 (NCS-2) for the segmentation of medical images. We selected U-Net because, in medical image segmentation, U-Net is a prominent model that provides improved performance for medical image segmentation even if the dataset size is small. The modified U-Net model is evaluated for performance in terms of dice score. Experiments are reported for segmentation task on three medical imaging datasets: BraTs dataset of brain MRI, heart MRI dataset, and Ziehl-Neelsen sputum smear microscopy image (ZNSDB) dataset. For the proposed model, we reduced the number of parameters from 30 million in the U-Net model to 0.49 million in the proposed architecture. Experimental results show that the modified U-Net provides comparable performance while requiring significantly lower resources and provides inference on the NCS-2. The maximum dice scores recorded are 0.96 for the BraTs dataset, 0.94 for the heart MRI dataset, and 0.74 for the ZNSDB dataset. △ Less

Submitted 6 June, 2022; originally announced June 2022.

Comments: Preprint of paper accepted in IEEE Transactions on Circuits and Systems II: Express Brief

arXiv:2204.09909 [pdf, other]

An Efficient End-to-End Deep Neural Network for Interstitial Lung Disease Recognition and Classification

Authors: Masum Shah Junayed, Afsana Ahsan Jeny, Md Baharul Islam, Ikhtiar Ahmed, A F M Shahen Shah

Abstract: The automated Interstitial Lung Diseases (ILDs) classification technique is essential for assisting clinicians during the diagnosis process. Detecting and classifying ILDs patterns is a challenging problem. This paper introduces an end-to-end deep convolution neural network (CNN) for classifying ILDs patterns. The proposed model comprises four convolutional layers with different kernel sizes and R… ▽ More The automated Interstitial Lung Diseases (ILDs) classification technique is essential for assisting clinicians during the diagnosis process. Detecting and classifying ILDs patterns is a challenging problem. This paper introduces an end-to-end deep convolution neural network (CNN) for classifying ILDs patterns. The proposed model comprises four convolutional layers with different kernel sizes and Rectified Linear Unit (ReLU) activation function, followed by batch normalization and max-pooling with a size equal to the final feature map size well as four dense layers. We used the ADAM optimizer to minimize categorical cross-entropy. A dataset consisting of 21328 image patches of 128 CT scans with five classes is taken to train and assess the proposed model. A comparison study showed that the presented model outperformed pre-trained CNNs and five-fold cross-validation on the same dataset. For ILDs pattern classification, the proposed approach achieved the accuracy scores of 99.09% and the average F score of 97.9%, outperforming three pre-trained CNNs. These outcomes show that the proposed model is relatively state-of-the-art in precision, recall, f score, and accuracy. △ Less

Submitted 21 April, 2022; originally announced April 2022.

Comments: Turkish Journal of Electrical Engineering and Computer Sciences

arXiv:2204.06322 [pdf, other]

Production federated keyword spotting via distillation, filtering, and joint federated-centralized training

Authors: Andrew Hard, Kurt Partridge, Neng Chen, Sean Augenstein, Aishanee Shah, Hyun Jin Park, Alex Park, Sara Ng, Jessica Nguyen, Ignacio Lopez Moreno, Rajiv Mathews, Françoise Beaufays

Abstract: We trained a keyword spotting model using federated learning on real user devices and observed significant improvements when the model was deployed for inference on phones. To compensate for data domains that are missing from on-device training caches, we employed joint federated-centralized training. And to learn in the absence of curated labels on-device, we formulated a confidence filtering str… ▽ More We trained a keyword spotting model using federated learning on real user devices and observed significant improvements when the model was deployed for inference on phones. To compensate for data domains that are missing from on-device training caches, we employed joint federated-centralized training. And to learn in the absence of curated labels on-device, we formulated a confidence filtering strategy based on user-feedback signals for federated distillation. These techniques created models that significantly improved quality metrics in offline evaluations and user-experience metrics in live A/B experiments. △ Less

Submitted 29 June, 2022; v1 submitted 11 April, 2022; originally announced April 2022.

Comments: Accepted to Interspeech 2022

arXiv:2204.04802 [pdf, other]

On the pragmatism of using binary classifiers over data intensive neural network classifiers for detection of COVID-19 from voice

Authors: Ankit Shah, Hira Dhamyal, Yang Gao, Daniel Arancibia, Mario Arancibia, Bhiksha Raj, Rita Singh

Abstract: Lately, there has been a global effort by multiple research groups to detect COVID-19 from voice. Different researchers use different kinds of information from the voice signal to achieve this. Various types of phonated sounds and the sound of cough and breath have all been used with varying degree of success in automated voice-based COVID-19 detection apps. In this paper, we show that detecting C… ▽ More Lately, there has been a global effort by multiple research groups to detect COVID-19 from voice. Different researchers use different kinds of information from the voice signal to achieve this. Various types of phonated sounds and the sound of cough and breath have all been used with varying degree of success in automated voice-based COVID-19 detection apps. In this paper, we show that detecting COVID-19 from voice does not require custom-made non-standard features or complicated neural network classifiers rather it can be successfully done with just standard features and simple binary classifiers. In fact, we show that the latter is not only more accurate and interpretable but also more computationally efficient in that they can be run locally on small devices. We demonstrate this on a human-curated dataset of over 1000 subjects, collected and calibrated in clinical settings. △ Less

Submitted 25 October, 2022; v1 submitted 10 April, 2022; originally announced April 2022.

Comments: Submitted to ICASSP 2022

arXiv:2203.10363 [pdf, other]

Towards Device Efficient Conditional Image Generation

Authors: Nisarg A. Shah, Gaurav Bharaj

Abstract: We present a novel algorithm to reduce tensor compute required by a conditional image generation autoencoder without sacrificing quality of photo-realistic image generation. Our method is device agnostic, and can optimize an autoencoder for a given CPU-only, GPU compute device(s) in about normal time it takes to train an autoencoder on a generic workstation. We achieve this via a two-stage novel s… ▽ More We present a novel algorithm to reduce tensor compute required by a conditional image generation autoencoder without sacrificing quality of photo-realistic image generation. Our method is device agnostic, and can optimize an autoencoder for a given CPU-only, GPU compute device(s) in about normal time it takes to train an autoencoder on a generic workstation. We achieve this via a two-stage novel strategy where, first, we condense the channel weights, such that, as few as possible channels are used. Then, we prune the nearly zeroed out weight activations, and fine-tune the autoencoder. To maintain image quality, fine-tuning is done via student-teacher training, where we reuse the condensed autoencoder as the teacher. We show performance gains for various conditional image generation tasks: segmentation mask to face images, face images to cartoonization, and finally CycleGAN-based model over multiple compute devices. We perform various ablation studies to justify the claims and design choices, and achieve real-time versions of various autoencoders on CPU-only devices while maintaining image quality, thus enabling at-scale deployment of such autoencoders. △ Less

Submitted 13 October, 2022; v1 submitted 19 March, 2022; originally announced March 2022.

Comments: British Machine Vision Conference 2022

arXiv:2203.02483 [pdf, other]

Ontological Learning from Weak Labels

Authors: Larry Tang, Po Hao Chou, Yi Yu Zheng, Ziqian Ge, Ankit Shah, Bhiksha Raj

Abstract: Ontologies encompass a formal representation of knowledge through the definition of concepts or properties of a domain, and the relationships between those concepts. In this work, we seek to investigate whether using this ontological information will improve learning from weakly labeled data, which are easier to collect since it requires only the presence or absence of an event to be known. We use… ▽ More Ontologies encompass a formal representation of knowledge through the definition of concepts or properties of a domain, and the relationships between those concepts. In this work, we seek to investigate whether using this ontological information will improve learning from weakly labeled data, which are easier to collect since it requires only the presence or absence of an event to be known. We use the AudioSet ontology and dataset, which contains audio clips weakly labeled with the ontology concepts and the ontology providing the "Is A" relations between the concepts. We first re-implemented the model proposed by soundevent_ontology with modification to fit the multi-label scenario and then expand on that idea by using a Graph Convolutional Network (GCN) to model the ontology information to learn the concepts. We find that the baseline Siamese does not perform better by incorporating ontology information in the weak and multi-label scenario, but that the GCN does capture the ontology knowledge better for weak, multi-labeled data. In our experiments, we also investigate how different modules can tolerate noises introduced from weak labels and better incorporate ontology information. Our best Siamese-GCN model achieves mAP=0.45 and AUC=0.87 for lower-level concepts and mAP=0.72 and AUC=0.86 for higher-level concepts, which is an improvement over the baseline Siamese but about the same as our models that do not use ontology information. △ Less

Submitted 4 March, 2022; originally announced March 2022.

arXiv:2203.00845 [pdf, other]

Can No-reference features help in Full-reference image quality estimation?

Authors: Saikat Dutta, Sourya Dipta Das, Nisarg A. Shah

Abstract: Development of perceptual image quality assessment (IQA) metrics has been of significant interest to computer vision community. The aim of these metrics is to model quality of an image as perceived by humans. Recent works in Full-reference IQA research perform pixelwise comparison between deep features corresponding to query and reference images for quality prediction. However, pixelwise feature c… ▽ More Development of perceptual image quality assessment (IQA) metrics has been of significant interest to computer vision community. The aim of these metrics is to model quality of an image as perceived by humans. Recent works in Full-reference IQA research perform pixelwise comparison between deep features corresponding to query and reference images for quality prediction. However, pixelwise feature comparison may not be meaningful if distortion present in query image is severe. In this context, we explore utilization of no-reference features in Full-reference IQA task. Our model consists of both full-reference and no-reference branches. Full-reference branches use both distorted and reference images, whereas No-reference branch only uses distorted image. Our experiments show that use of no-reference features boosts performance of image quality assessment. Our model achieves higher SRCC and KRCC scores than a number of state-of-the-art algorithms on KADID-10K and PIPAL datasets. △ Less

Submitted 1 March, 2022; originally announced March 2022.

Comments: Code to be updated on: https://github.com/saikatdutta/nr-in-friqa

arXiv:2202.07983 [pdf, other]

doi 10.1109/TMI.2022.3172773

ADAM Challenge: Detecting Age-related Macular Degeneration from Fundus Images

Authors: Huihui Fang, Fei Li, Huazhu Fu, Xu Sun, Xingxing Cao, Fengbin Lin, Jaemin Son, Sunho Kim, Gwenole Quellec, Sarah Matta, Sharath M Shankaranarayana, Yi-Ting Chen, Chuen-heng Wang, Nisarg A. Shah, Chia-Yen Lee, Chih-Chung Hsu, Hai Xie, Baiying Lei, Ujjwal Baid, Shubham Innani, Kang Dang, Wenxiu Shi, Ravi Kamble, Nitin Singhal, Ching-Wei Wang , et al. (6 additional authors not shown)

Abstract: Age-related macular degeneration (AMD) is the leading cause of visual impairment among elderly in the world. Early detection of AMD is of great importance, as the vision loss caused by this disease is irreversible and permanent. Color fundus photography is the most cost-effective imaging modality to screen for retinal disorders. Cutting edge deep learning based algorithms have been recently develo… ▽ More Age-related macular degeneration (AMD) is the leading cause of visual impairment among elderly in the world. Early detection of AMD is of great importance, as the vision loss caused by this disease is irreversible and permanent. Color fundus photography is the most cost-effective imaging modality to screen for retinal disorders. Cutting edge deep learning based algorithms have been recently developed for automatically detecting AMD from fundus images. However, there are still lack of a comprehensive annotated dataset and standard evaluation benchmarks. To deal with this issue, we set up the Automatic Detection challenge on Age-related Macular degeneration (ADAM), which was held as a satellite event of the ISBI 2020 conference. The ADAM challenge consisted of four tasks which cover the main aspects of detecting and characterizing AMD from fundus images, including detection of AMD, detection and segmentation of optic disc, localization of fovea, and detection and segmentation of lesions. As part of the challenge, we have released a comprehensive dataset of 1200 fundus images with AMD diagnostic labels, pixel-wise segmentation masks for both optic disc and AMD-related lesions (drusen, exudates, hemorrhages and scars, among others), as well as the coordinates corresponding to the location of the macular fovea. A uniform evaluation framework has been built to make a fair comparison of different models using this dataset. During the challenge, 610 results were submitted for online evaluation, with 11 teams finally participating in the onsite challenge. This paper introduces the challenge, the dataset and the evaluation methods, as well as summarizes the participating methods and analyzes their results for each task. In particular, we observed that the ensembling strategy and the incorporation of clinical domain knowledge were the key to improve the performance of the deep learning models. △ Less

Submitted 6 May, 2022; v1 submitted 16 February, 2022; originally announced February 2022.

Comments: 31 pages, 17 figures

arXiv:2202.05662 [pdf, other]

A Novel Chaos-based Light-weight Image Encryption Scheme for Multi-modal Hearing Aids

Authors: Awais Aziz Shah, Ahsan Adeel, Jawad Ahmad, Ahmed Al-Dubai, Mandar Gogate, Abhijeet Bishnu, Muhammad Diyan, Tassadaq Hussain, Kia Dashtipour, Tharm Ratnarajah, Amir Hussain

Abstract: Multimodal hearing aids (HAs) aim to deliver more intelligible audio in noisy environments by contextually sensing and processing data in the form of not only audio but also visual information (e.g. lip reading). Machine learning techniques can play a pivotal role for the contextually processing of multimodal data. However, since the computational power of HA devices is low, therefore this data mu… ▽ More Multimodal hearing aids (HAs) aim to deliver more intelligible audio in noisy environments by contextually sensing and processing data in the form of not only audio but also visual information (e.g. lip reading). Machine learning techniques can play a pivotal role for the contextually processing of multimodal data. However, since the computational power of HA devices is low, therefore this data must be processed either on the edge or cloud which, in turn, poses privacy concerns for sensitive user data. Existing literature proposes several techniques for data encryption but their computational complexity is a major bottleneck to meet strict latency requirements for development of future multi-modal hearing aids. To overcome this problem, this paper proposes a novel real-time audio/visual data encryption scheme based on chaos-based encryption using the Tangent-Delay Ellipse Reflecting Cavity-Map System (TD-ERCS) map and Non-linear Chaotic (NCA) Algorithm. The results achieved against different security parameters, including Correlation Coefficient, Unified Averaged Changed Intensity (UACI), Key Sensitivity Analysis, Number of Changing Pixel Rate (NPCR), Mean-Square Error (MSE), Peak Signal to Noise Ratio (PSNR), Entropy test, and Chi-test, indicate that the newly proposed scheme is more lightweight due to its lower execution time as compared to existing schemes and more secure due to increased key-space against modern brute-force attacks. △ Less

Submitted 11 February, 2022; originally announced February 2022.

arXiv:2201.02449 [pdf, ps, other]

Online 3-Axis Magnetometer Hard-Iron and Soft-Iron Bias and Angular Velocity Sensor Bias Estimation Using Angular Velocity Sensors for Improved Dynamic Heading Accuracy

Authors: Andrew R. Spielvogel, Abhimanyu S. Shah, Louis L. Whitcomb

Abstract: This article addresses the problem of dynamic on-line estimation and compensation of hard-iron and soft-iron biases of 3-axis magnetometers under dynamic motion in field robotics, utilizing only biased measurements from a 3-axis magnetometer and a 3-axis angular rate sensor. The proposed magnetometer and angular velocity bias estimator (MAVBE) utilizes a 15-state process model encoding the nonline… ▽ More This article addresses the problem of dynamic on-line estimation and compensation of hard-iron and soft-iron biases of 3-axis magnetometers under dynamic motion in field robotics, utilizing only biased measurements from a 3-axis magnetometer and a 3-axis angular rate sensor. The proposed magnetometer and angular velocity bias estimator (MAVBE) utilizes a 15-state process model encoding the nonlinear process dynamics for the magnetometer signal subject to angular velocity excursions, while simultaneously estimating 9 magnetometer bias parameters and 3 angular rate sensor bias parameters, within an extended Kalman filter framework. Bias parameter local observability is numerically evaluated. The bias-compensated signals, together with 3-axis accelerometer signals, are utilized to estimate bias compensated magnetic geodetic heading. Performance of the proposed MAVBE method is evaluated in comparison to the widely cited magnetometer-only TWOSTEP method in numerical simulations, laboratory experiments, and full-scale field trials of an instrumented autonomous underwater vehicle in the Chesapeake Bay, MD, USA. For the proposed MAVBE, (i) instrument attitude is not required to estimate biases, and the results show that (ii) the biases are locally observable, (iii) the bias estimates converge rapidly to true bias parameters, (iv) only modest instrument excitation is required for bias estimate convergence, and (v) compensation for magnetometer hard-iron and soft-iron biases dramatically improves dynamic heading estimation accuracy. △ Less

Submitted 7 January, 2022; originally announced January 2022.

Comments: Preprint of an article accepted for publication in Field Robotics, https://FieldRobotics.net, Special Issue in Unmanned Marine Systems. Submitted January 16, 2021; Revised May 28, 2021; Accepted August 2, 2021

arXiv:2112.10074 [pdf, other]

doi 10.59275/j.melba.2022-354b

QU-BraTS: MICCAI BraTS 2020 Challenge on Quantifying Uncertainty in Brain Tumor Segmentation - Analysis of Ranking Scores and Benchmarking Results

Authors: Raghav Mehta, Angelos Filos, Ujjwal Baid, Chiharu Sako, Richard McKinley, Michael Rebsamen, Katrin Datwyler, Raphael Meier, Piotr Radojewski, Gowtham Krishnan Murugesan, Sahil Nalawade, Chandan Ganesh, Ben Wagner, Fang F. Yu, Baowei Fei, Ananth J. Madhuranthakam, Joseph A. Maldjian, Laura Daza, Catalina Gomez, Pablo Arbelaez, Chengliang Dai, Shuo Wang, Hadrien Reynaud, Yuan-han Mo, Elsa Angelini , et al. (67 additional authors not shown)

Abstract: Deep learning (DL) models have provided state-of-the-art performance in various medical imaging benchmarking challenges, including the Brain Tumor Segmentation (BraTS) challenges. However, the task of focal pathology multi-compartment segmentation (e.g., tumor and lesion sub-regions) is particularly challenging, and potential errors hinder translating DL models into clinical workflows. Quantifying… ▽ More Deep learning (DL) models have provided state-of-the-art performance in various medical imaging benchmarking challenges, including the Brain Tumor Segmentation (BraTS) challenges. However, the task of focal pathology multi-compartment segmentation (e.g., tumor and lesion sub-regions) is particularly challenging, and potential errors hinder translating DL models into clinical workflows. Quantifying the reliability of DL model predictions in the form of uncertainties could enable clinical review of the most uncertain regions, thereby building trust and paving the way toward clinical translation. Several uncertainty estimation methods have recently been introduced for DL medical image segmentation tasks. Developing scores to evaluate and compare the performance of uncertainty measures will assist the end-user in making more informed decisions. In this study, we explore and evaluate a score developed during the BraTS 2019 and BraTS 2020 task on uncertainty quantification (QU-BraTS) and designed to assess and rank uncertainty estimates for brain tumor multi-compartment segmentation. This score (1) rewards uncertainty estimates that produce high confidence in correct assertions and those that assign low confidence levels at incorrect assertions, and (2) penalizes uncertainty measures that lead to a higher percentage of under-confident correct assertions. We further benchmark the segmentation uncertainties generated by 14 independent participating teams of QU-BraTS 2020, all of which also participated in the main BraTS segmentation task. Overall, our findings confirm the importance and complementary value that uncertainty estimates provide to segmentation algorithms, highlighting the need for uncertainty quantification in medical image analyses. Finally, in favor of transparency and reproducibility, our evaluation code is made publicly available at: https://github.com/RagMeh11/QU-BraTS. △ Less

Submitted 23 August, 2022; v1 submitted 19 December, 2021; originally announced December 2021.

Comments: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA): https://www.melba-journal.org/papers/2022:026.html

Journal ref: Machine.Learning.for.Biomedical.Imaging. 1 (2022)

arXiv:2111.08606 [pdf]

Advancement of Deep Learning in Pneumonia and Covid-19 Classification and Localization: A Qualitative and Quantitative Analysis

Authors: Aakash Shah, Manan Shah

Abstract: Around 450 million people are affected by pneumonia every year which results in 2.5 million deaths. Covid-19 has also affected 181 million people which has lead to 3.92 million casualties. The chances of death in both of these diseases can be significantly reduced if they are diagnosed early. However, the current methods of diagnosing pneumonia (complaints + chest X-ray) and covid-19 (RT-PCR) requ… ▽ More Around 450 million people are affected by pneumonia every year which results in 2.5 million deaths. Covid-19 has also affected 181 million people which has lead to 3.92 million casualties. The chances of death in both of these diseases can be significantly reduced if they are diagnosed early. However, the current methods of diagnosing pneumonia (complaints + chest X-ray) and covid-19 (RT-PCR) require the presence of expert radiologists and time, respectively. With the help of Deep Learning models, pneumonia and covid-19 can be detected instantly from Chest X-rays or CT scans. This way, the process of diagnosing Pneumonia/Covid-19 can be made more efficient and widespread. In this paper, we aim to elicit, explain, and evaluate, qualitatively and quantitatively, major advancements in deep learning methods aimed at detecting or localizing community-acquired pneumonia (CAP), viral pneumonia, and covid-19 from images of chest X-rays and CT scans. Being a systematic review, the focus of this paper lies in explaining deep learning model architectures which have either been modified or created from scratch for the task at hand wiwth focus on generalizability. For each model, this paper answers the question of why the model is designed the way it is, the challenges that a particular model overcomes, and the tradeoffs that come with modifying a model to the required specifications. A quantitative analysis of all models described in the paper is also provided to quantify the effectiveness of different models with a similar goal. Some tradeoffs cannot be quantified, and hence they are mentioned explicitly in the qualitative analysis, which is done throughout the paper. By compiling and analyzing a large quantum of research details in one place with all the datasets, model architectures, and results, we aim to provide a one-stop solution to beginners and current researchers interested in this field. △ Less

Submitted 16 November, 2021; originally announced November 2021.

Comments: 20 pages, 5 figures, 5 tables

Report number: CDTM-D-21-00047R2

arXiv:2110.09584 [pdf, other]

Set-based State Estimation with Probabilistic Consistency Guarantee under Epistemic Uncertainty

Authors: Shen Li, Theodoros Stouraitis, Michael Gienger, Sethu Vijayakumar, Julie A. Shah

Abstract: Consistent state estimation is challenging, especially under the epistemic uncertainties arising from learned (nonlinear) dynamic and observation models. In this work, we propose a set-based estimation algorithm, named Gaussian Process-Zonotopic Kalman Filter (GP-ZKF), that produces zonotopic state estimates while respecting both the epistemic uncertainties in the learned models and aleatoric unce… ▽ More Consistent state estimation is challenging, especially under the epistemic uncertainties arising from learned (nonlinear) dynamic and observation models. In this work, we propose a set-based estimation algorithm, named Gaussian Process-Zonotopic Kalman Filter (GP-ZKF), that produces zonotopic state estimates while respecting both the epistemic uncertainties in the learned models and aleatoric uncertainties. Our method guarantees probabilistic consistency, in the sense that the true states are bounded by sets (zonotopes) across all time steps, with high probability. We formally relate GP-ZKF with the corresponding stochastic approach, GP-EKF, in the case of learned (nonlinear) models. In particular, when linearization errors and aleatoric uncertainties are omitted and epistemic uncertainties are simplified, GP-ZKF reduces to GP-EKF. We empirically demonstrate our method's efficacy in both a simulated pendulum domain and a real-world robot-assisted dressing domain, where GP-ZKF produced more consistent and less conservative set-based estimates than all baseline stochastic methods. △ Less

Submitted 25 February, 2022; v1 submitted 18 October, 2021; originally announced October 2021.

Comments: Published at IEEE Robotics and Automation Letters, 2022. Video: https://www.youtube.com/watch?v=CvIPJlALaFU Copyright: 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any media, including reprinting/republishing for any purposes, creating new works, for resale or redistribution, or reuse of any copyrighted component of this work

arXiv:2110.04678 [pdf, other]

An Overview of Techniques for Biomarker Discovery in Voice Signal

Authors: Rita Singh, Ankit Shah, Hira Dhamyal

Abstract: This paper reflects on the effect of several categories of medical conditions on human voice, focusing on those that may be hypothesized to have effects on voice, but for which the changes themselves may be subtle enough to have eluded observation in standard analytical examinations of the voice signal. It presents three categories of techniques that can potentially uncover such elusive biomarkers… ▽ More This paper reflects on the effect of several categories of medical conditions on human voice, focusing on those that may be hypothesized to have effects on voice, but for which the changes themselves may be subtle enough to have eluded observation in standard analytical examinations of the voice signal. It presents three categories of techniques that can potentially uncover such elusive biomarkers and allow them to be measured and used for predictive and diagnostic purposes. These approaches include proxy techniques, model-based analytical techniques and data-driven AI techniques. △ Less

Submitted 9 October, 2021; originally announced October 2021.

Comments: Last two authors contributed equally to the paper

arXiv:2106.12864 [pdf, other]

A Systematic Collection of Medical Image Datasets for Deep Learning

Authors: Johann Li, Guangming Zhu, Cong Hua, Mingtao Feng, BasheerBennamoun, Ping Li, Xiaoyuan Lu, Juan Song, Peiyi Shen, Xu Xu, Lin Mei, Liang Zhang, Syed Afaq Ali Shah, Mohammed Bennamoun

Abstract: The astounding success made by artificial intelligence (AI) in healthcare and other fields proves that AI can achieve human-like performance. However, success always comes with challenges. Deep learning algorithms are data-dependent and require large datasets for training. The lack of data in the medical imaging field creates a bottleneck for the application of deep learning to medical image analy… ▽ More The astounding success made by artificial intelligence (AI) in healthcare and other fields proves that AI can achieve human-like performance. However, success always comes with challenges. Deep learning algorithms are data-dependent and require large datasets for training. The lack of data in the medical imaging field creates a bottleneck for the application of deep learning to medical image analysis. Medical image acquisition, annotation, and analysis are costly, and their usage is constrained by ethical restrictions. They also require many resources, such as human expertise and funding. That makes it difficult for non-medical researchers to have access to useful and large medical data. Thus, as comprehensive as possible, this paper provides a collection of medical image datasets with their associated challenges for deep learning research. We have collected information of around three hundred datasets and challenges mainly reported between 2013 and 2020 and categorized them into four categories: head & neck, chest & abdomen, pathology & blood, and ``others''. Our paper has three purposes: 1) to provide a most up to date and complete list that can be used as a universal reference to easily find the datasets for clinical image analysis, 2) to guide researchers on the methodology to test and evaluate their methods' performance and robustness on relevant datasets, 3) to provide a ``route'' to relevant algorithms for the relevant medical topics, and challenge leaderboards. △ Less

Submitted 24 June, 2021; originally announced June 2021.

Comments: This paper has been submitted to one journal

arXiv:2105.08819 [pdf, other]

Fast and Accurate Quantized Camera Scene Detection on Smartphones, Mobile AI 2021 Challenge: Report

Authors: Andrey Ignatov, Grigory Malivenko, Radu Timofte, Sheng Chen, Xin Xia, Zhaoyan Liu, Yuwei Zhang, Feng Zhu, Jiashi Li, Xuefeng Xiao, Yuan Tian, Xinglong Wu, Christos Kyrkou, Yixin Chen, Zexin Zhang, Yunbo Peng, Yue Lin, Saikat Dutta, Sourya Dipta Das, Nisarg A. Shah, Himanshu Kumar, Chao Ge, Pei-Lin Wu, Jin-Hua Du, Andrew Batutin , et al. (6 additional authors not shown)

Abstract: Camera scene detection is among the most popular computer vision problem on smartphones. While many custom solutions were developed for this task by phone vendors, none of the designed models were available publicly up until now. To address this problem, we introduce the first Mobile AI challenge, where the target is to develop quantized deep learning-based camera scene classification solutions th… ▽ More Camera scene detection is among the most popular computer vision problem on smartphones. While many custom solutions were developed for this task by phone vendors, none of the designed models were available publicly up until now. To address this problem, we introduce the first Mobile AI challenge, where the target is to develop quantized deep learning-based camera scene classification solutions that can demonstrate a real-time performance on smartphones and IoT platforms. For this, the participants were provided with a large-scale CamSDD dataset consisting of more than 11K images belonging to the 30 most important scene categories. The runtime of all models was evaluated on the popular Apple Bionic A11 platform that can be found in many iOS devices. The proposed solutions are fully compatible with all major mobile AI accelerators and can demonstrate more than 100-200 FPS on the majority of recent smartphone platforms while achieving a top-3 accuracy of more than 98%. A detailed description of all models developed in the challenge is provided in this paper. △ Less

Submitted 17 May, 2021; originally announced May 2021.

Comments: Mobile AI 2021 Workshop and Challenges: https://ai-benchmark.com/workshops/mai/2021/. arXiv admin note: substantial text overlap with arXiv:2105.08630; text overlap with arXiv:2105.07825, arXiv:2105.07809, arXiv:2105.08629

arXiv:2104.05778 [pdf, other]

Efficient Space-time Video Super Resolution using Low-Resolution Flow and Mask Upsampling

Authors: Saikat Dutta, Nisarg A. Shah, Anurag Mittal

Abstract: This paper explores an efficient solution for Space-time Super-Resolution, aiming to generate High-resolution Slow-motion videos from Low Resolution and Low Frame rate videos. A simplistic solution is the sequential running of Video Super Resolution and Video Frame interpolation models. However, this type of solutions are memory inefficient, have high inference time, and could not make the proper… ▽ More This paper explores an efficient solution for Space-time Super-Resolution, aiming to generate High-resolution Slow-motion videos from Low Resolution and Low Frame rate videos. A simplistic solution is the sequential running of Video Super Resolution and Video Frame interpolation models. However, this type of solutions are memory inefficient, have high inference time, and could not make the proper use of space-time relation property. To this extent, we first interpolate in LR space using quadratic modeling. Input LR frames are super-resolved using a state-of-the-art Video Super-Resolution method. Flowmaps and blending mask which are used to synthesize LR interpolated frame is reused in HR space using bilinear upsampling. This leads to a coarse estimate of HR intermediate frame which often contains artifacts along motion boundaries. We use a refinement network to improve the quality of HR intermediate frame via residual learning. Our model is lightweight and performs better than current state-of-the-art models in REDS STSR Validation set. △ Less

Submitted 8 June, 2021; v1 submitted 12 April, 2021; originally announced April 2021.

Comments: Accepted at NTIRE Workshop, CVPR 2021. Code and models: https://github.com/saikatdutta/FMU_STSR

arXiv:2104.01511 [pdf, other]

Late fusion of machine learning models using passively captured interpersonal social interactions and motion from smartphones predicts decompensation in heart failure

Authors: Ayse S. Cakmak, Samuel Densen, Gabriel Najarro, Pratik Rout, Christopher J. Rozell, Omer T. Inan, Amit J. Shah, Gari D. Clifford

Abstract: Objective: Worldwide, heart failure (HF) is a major cause of morbidity and mortality and one of the leading causes of hospitalization. Early detection of HF symptoms and pro-active management may reduce adverse events. Approach: Twenty-eight participants were monitored using a smartphone app after discharge from hospitals, and each clinical event during the enrollment (N=110 clinical events) was r… ▽ More Objective: Worldwide, heart failure (HF) is a major cause of morbidity and mortality and one of the leading causes of hospitalization. Early detection of HF symptoms and pro-active management may reduce adverse events. Approach: Twenty-eight participants were monitored using a smartphone app after discharge from hospitals, and each clinical event during the enrollment (N=110 clinical events) was recorded. Motion, social, location, and clinical survey data collected via the smartphone-based monitoring system were used to develop and validate an algorithm for predicting or classifying HF decompensation events (hospitalizations or clinic visit) versus clinic monitoring visits in which they were determined to be compensated or stable. Models based on single modality as well as early and late fusion approaches combining patient-reported outcomes and passive smartphone data were evaluated. Results: The highest AUCPr for classifying decompensation with a late fusion approach was 0.80 using leave one subject out cross-validation. Significance: Passively collected data from smartphones, especially when combined with weekly patient-reported outcomes, may reflect behavioral and physiological changes due to HF and thus could enable prediction of HF decompensation. △ Less

Submitted 3 April, 2021; originally announced April 2021.

arXiv:2103.09289 [pdf, other]

Colorectal Cancer Segmentation using Atrous Convolution and Residual Enhanced UNet

Authors: Nisarg A. Shah, Divij Gupta, Romil Lodaya, Ujjwal Baid, Sanjay Talbar

Abstract: Colorectal cancer is a leading cause of death worldwide. However, early diagnosis dramatically increases the chances of survival, for which it is crucial to identify the tumor in the body. Since its imaging uses high-resolution techniques, annotating the tumor is time-consuming and requires particular expertise. Lately, methods built upon Convolutional Neural Networks(CNNs) have proven to be at pa… ▽ More Colorectal cancer is a leading cause of death worldwide. However, early diagnosis dramatically increases the chances of survival, for which it is crucial to identify the tumor in the body. Since its imaging uses high-resolution techniques, annotating the tumor is time-consuming and requires particular expertise. Lately, methods built upon Convolutional Neural Networks(CNNs) have proven to be at par, if not better in many biomedical segmentation tasks. For the task at hand, we propose another CNN-based approach, which uses atrous convolutions and residual connections besides the conventional filters. The training and inference were made using an efficient patch-based approach, which significantly reduced unnecessary computations. The proposed AtResUNet was trained on the DigestPath 2019 Challenge dataset for colorectal cancer segmentation with results having a Dice Coefficient of 0.748. △ Less

Submitted 16 March, 2021; originally announced March 2021.

Comments: 5th IAPR International Conference on Computer Vision and Image Processing, 12 pages

arXiv:2011.04988 [pdf, other]

AIM 2020 Challenge on Rendering Realistic Bokeh

Authors: Andrey Ignatov, Radu Timofte, Ming Qian, Congyu Qiao, Jiamin Lin, Zhenyu Guo, Chenghua Li, Cong Leng, Jian Cheng, Juewen Peng, Xianrui Luo, Ke Xian, Zijin Wu, Zhiguo Cao, Densen Puthussery, Jiji C V, Hrishikesh P S, Melvin Kuriakose, Saikat Dutta, Sourya Dipta Das, Nisarg A. Shah, Kuldeep Purohit, Praveen Kandula, Maitreya Suin, A. N. Rajagopalan , et al. (10 additional authors not shown)

Abstract: This paper reviews the second AIM realistic bokeh effect rendering challenge and provides the description of the proposed solutions and results. The participating teams were solving a real-world bokeh simulation problem, where the goal was to learn a realistic shallow focus technique using a large-scale EBB! bokeh dataset consisting of 5K shallow / wide depth-of-field image pairs captured using th… ▽ More This paper reviews the second AIM realistic bokeh effect rendering challenge and provides the description of the proposed solutions and results. The participating teams were solving a real-world bokeh simulation problem, where the goal was to learn a realistic shallow focus technique using a large-scale EBB! bokeh dataset consisting of 5K shallow / wide depth-of-field image pairs captured using the Canon 7D DSLR camera. The participants had to render bokeh effect based on only one single frame without any additional data from other cameras or sensors. The target metric used in this challenge combined the runtime and the perceptual quality of the solutions measured in the user study. To ensure the efficiency of the submitted models, we measured their runtime on standard desktop CPUs as well as were running the models on smartphone GPUs. The proposed solutions significantly improved the baseline results, defining the state-of-the-art for practical bokeh effect rendering problem. △ Less

Submitted 10 November, 2020; originally announced November 2020.

Comments: Published in ECCV 2020 Workshop (Advances in Image Manipulation), https://data.vision.ee.ethz.ch/cvl/aim20/

arXiv:2010.06659 [pdf, other]

Towards Data-efficient Modeling for Wake Word Spotting

Authors: Yixin Gao, Yuriy Mishchenko, Anish Shah, Spyros Matsoukas, Shiv Vitaladevuni

Abstract: Wake word (WW) spotting is challenging in far-field not only because of the interference in signal transmission but also the complexity in acoustic environments. Traditional WW model training requires large amount of in-domain WW-specific data with substantial human annotations therefore it is hard to build WW models without such data. In this paper we present data-efficient solutions to address t… ▽ More Wake word (WW) spotting is challenging in far-field not only because of the interference in signal transmission but also the complexity in acoustic environments. Traditional WW model training requires large amount of in-domain WW-specific data with substantial human annotations therefore it is hard to build WW models without such data. In this paper we present data-efficient solutions to address the challenges in WW modeling, such as domain-mismatch, noisy conditions, limited annotation, etc. Our proposed system is composed of a multi-condition training pipeline with a stratified data augmentation, which improves the model robustness to a variety of predefined acoustic conditions, together with a semi-supervised learning pipeline to accurately extract the WW and confusable examples from untranscribed speech corpus. Starting from only 10 hours of domain-mismatched WW audio, we are able to enlarge and enrich the training dataset by 20-100 times to capture the acoustic complexity. Our experiments on real user data show that the proposed solutions can achieve comparable performance of a production-grade model by saving 97\% of the amount of WW-specific data collection and 86\% of the bandwidth for annotation. △ Less

Submitted 13 October, 2020; originally announced October 2020.

Journal ref: Proc. ICASSP 2020

arXiv:2009.12798 [pdf, other]

AIM 2020: Scene Relighting and Illumination Estimation Challenge

Authors: Majed El Helou, Ruofan Zhou, Sabine Süsstrunk, Radu Timofte, Mahmoud Afifi, Michael S. Brown, Kele Xu, Hengxing Cai, Yuzhong Liu, Li-Wen Wang, Zhi-Song Liu, Chu-Tak Li, Sourya Dipta Das, Nisarg A. Shah, Akashdeep Jassal, Tongtong Zhao, Shanshan Zhao, Sabari Nathan, M. Parisa Beham, R. Suganya, Qing Wang, Zhongyun Hu, Xin Huang, Yaning Li, Maitreya Suin , et al. (12 additional authors not shown)

Abstract: We review the AIM 2020 challenge on virtual image relighting and illumination estimation. This paper presents the novel VIDIT dataset used in the challenge and the different proposed solutions and final evaluation results over the 3 challenge tracks. The first track considered one-to-one relighting; the objective was to relight an input photo of a scene with a different color temperature and illum… ▽ More We review the AIM 2020 challenge on virtual image relighting and illumination estimation. This paper presents the novel VIDIT dataset used in the challenge and the different proposed solutions and final evaluation results over the 3 challenge tracks. The first track considered one-to-one relighting; the objective was to relight an input photo of a scene with a different color temperature and illuminant orientation (i.e., light source position). The goal of the second track was to estimate illumination settings, namely the color temperature and orientation, from a given image. Lastly, the third track dealt with any-to-any relighting, thus a generalization of the first track. The target color temperature and orientation, rather than being pre-determined, are instead given by a guide image. Participants were allowed to make use of their track 1 and 2 solutions for track 3. The tracks had 94, 52, and 56 registered participants, respectively, leading to 20 confirmed submissions in the final competition stage. △ Less

Submitted 27 September, 2020; originally announced September 2020.

Comments: ECCVW 2020. Data and more information on https://github.com/majedelhelou/VIDIT

arXiv:2008.07742 [pdf, other]

UDC 2020 Challenge on Image Restoration of Under-Display Camera: Methods and Results

Authors: Yuqian Zhou, Michael Kwan, Kyle Tolentino, Neil Emerton, Sehoon Lim, Tim Large, Lijiang Fu, Zhihong Pan, Baopu Li, Qirui Yang, Yihao Liu, Jigang Tang, Tao Ku, Shibin Ma, Bingnan Hu, Jiarong Wang, Densen Puthussery, Hrishikesh P S, Melvin Kuriakose, Jiji C V, Varun Sundar, Sumanth Hegde, Divya Kothandaraman, Kaushik Mitra, Akashdeep Jassal , et al. (20 additional authors not shown)

Abstract: This paper is the report of the first Under-Display Camera (UDC) image restoration challenge in conjunction with the RLQ workshop at ECCV 2020. The challenge is based on a newly-collected database of Under-Display Camera. The challenge tracks correspond to two types of display: a 4k Transparent OLED (T-OLED) and a phone Pentile OLED (P-OLED). Along with about 150 teams registered the challenge, ei… ▽ More This paper is the report of the first Under-Display Camera (UDC) image restoration challenge in conjunction with the RLQ workshop at ECCV 2020. The challenge is based on a newly-collected database of Under-Display Camera. The challenge tracks correspond to two types of display: a 4k Transparent OLED (T-OLED) and a phone Pentile OLED (P-OLED). Along with about 150 teams registered the challenge, eight and nine teams submitted the results during the testing phase for each track. The results in the paper are state-of-the-art restoration performance of Under-Display Camera Restoration. Datasets and paper are available at https://yzhouas.github.io/projects/UDC/udc.html. △ Less

Submitted 18 August, 2020; originally announced August 2020.

Comments: 15 pages

arXiv:2008.03790 [pdf, other]

doi 10.21437/Interspeech.2020-1491

Accurate Detection of Wake Word Start and End Using a CNN

Authors: Christin Jose, Yuriy Mishchenko, Thibaud Senechal, Anish Shah, Alex Escott, Shiv Vitaladevuni

Abstract: Small footprint embedded devices require keyword spotters (KWS) with small model size and detection latency for enabling voice assistants. Such a keyword is often referred to as \textit{wake word} as it is used to wake up voice assistant enabled devices. Together with wake word detection, accurate estimation of wake word endpoints (start and end) is an important task of KWS. In this paper, we prop… ▽ More Small footprint embedded devices require keyword spotters (KWS) with small model size and detection latency for enabling voice assistants. Such a keyword is often referred to as \textit{wake word} as it is used to wake up voice assistant enabled devices. Together with wake word detection, accurate estimation of wake word endpoints (start and end) is an important task of KWS. In this paper, we propose two new methods for detecting the endpoints of wake words in neural KWS that use single-stage word-level neural networks. Our results show that the new techniques give superior accuracy for detecting wake words' endpoints of up to 50 msec standard error versus human annotations, on par with the conventional Acoustic Model plus HMM forced alignment. To our knowledge, this is the first study of wake word endpoints detection methods for single-stage neural KWS. △ Less

Submitted 9 August, 2020; originally announced August 2020.

Comments: Proceedings of INTERSPEECH

Journal ref: Interspeech 2020

arXiv:2008.02567 [pdf, other]

doi 10.3390/s20092653

An Intelligent Non-Invasive Real Time Human Activity Recognition System for Next-Generation Healthcare

Authors: William Taylor, Syed Aziz Shah, Kia Dashtipour, Adnan Zahid, Qammer H. Abbasi, Muhammad Ali Imran

Abstract: Human motion detection is getting considerable attention in the field of Artificial Intelligence (AI) driven healthcare systems. Human motion can be used to provide remote healthcare solutions for vulnerable people by identifying particular movements such as falls, gait and breathing disorders. This can allow people to live more independent lifestyles and still have the safety of being monitored i… ▽ More Human motion detection is getting considerable attention in the field of Artificial Intelligence (AI) driven healthcare systems. Human motion can be used to provide remote healthcare solutions for vulnerable people by identifying particular movements such as falls, gait and breathing disorders. This can allow people to live more independent lifestyles and still have the safety of being monitored if more direct care is needed. At present wearable devices can provide real time monitoring by deploying equipment on a person's body. However, putting devices on a person's body all the time make it uncomfortable and the elderly tends to forget it to wear as well in addition to the insecurity of being tracked all the time. This paper demonstrates how human motions can be detected in quasi-real-time scenario using a non-invasive method. Patterns in the wireless signals presents particular human body motions as each movement induces a unique change in the wireless medium. These changes can be used to identify particular body motions. This work produces a dataset that contains patterns of radio wave signals obtained using software defined radios (SDRs) to establish if a subject is standing up or sitting down as a test case. The dataset was used to create a machine learning model, which was used in a developed application to provide a quasi-real-time classification of standing or sitting state. The machine learning model was able to achieve 96.70 % accuracy using the Random Forest algorithm using 10 fold cross validation. A benchmark dataset of wearable devices was compared to the proposed dataset and results showed the proposed dataset to have similar accuracy of nearly 90 %. The machine learning models developed in this paper are tested for two activities but the developed system is designed and applicable for detecting and differentiating x number of activities. △ Less

Submitted 6 August, 2020; originally announced August 2020.

Comments: 20 pages 18 figures, journal

Journal ref: Sensors 2020, 20(9), 2653

Showing 1–50 of 69 results for author: Shah, A