subscribe to arXiv mailings

Trajectory Tracking for Unmanned Aerial Vehicles in 3D Spaces under Motion Constraints

Authors: Saurabh Kumar, Shashi Ranjan Kumar, Abhinav Sinha

Abstract: This article presents a three-dimensional nonlinear trajectory tracking control strategy for unmanned aerial vehicles (UAVs) in the presence of spatial constraints. As opposed to many existing control strategies, which do not consider spatial constraints, the proposed strategy considers spatial constraints on each degree of freedom movement of the UAV. Such consideration makes the design appealing… ▽ More This article presents a three-dimensional nonlinear trajectory tracking control strategy for unmanned aerial vehicles (UAVs) in the presence of spatial constraints. As opposed to many existing control strategies, which do not consider spatial constraints, the proposed strategy considers spatial constraints on each degree of freedom movement of the UAV. Such consideration makes the design appealing for many practical applications, such as pipeline inspection, boundary tracking, etc. The proposed design accounts for the limited information about the inertia matrix, thereby affirming its inherent robustness against unmodeled dynamics and other imperfections. We rigorously show that the UAV will converge to its desired path by maintaining bounded position, orientation, and linear and angular speeds. Finally, we demonstrate the effectiveness of the proposed strategy through various numerical simulations. △ Less

Submitted 15 July, 2024; originally announced July 2024.

arXiv:2407.08655 [pdf, other]

SPOCKMIP: Segmentation of Vessels in MRAs with Enhanced Continuity using Maximum Intensity Projection as Loss

Authors: Chethan Radhakrishna, Karthikesh Varma Chintalapati, Sri Chandana Hudukula Ram Kumar, Raviteja Sutrave, Hendrik Mattern, Oliver Speck, Andreas Nürnberger, Soumick Chatterjee

Abstract: Identification of vessel structures of different sizes in biomedical images is crucial in the diagnosis of many neurodegenerative diseases. However, the sparsity of good-quality annotations of such images makes the task of vessel segmentation challenging. Deep learning offers an efficient way to segment vessels of different sizes by learning their high-level feature representations and the spatial… ▽ More Identification of vessel structures of different sizes in biomedical images is crucial in the diagnosis of many neurodegenerative diseases. However, the sparsity of good-quality annotations of such images makes the task of vessel segmentation challenging. Deep learning offers an efficient way to segment vessels of different sizes by learning their high-level feature representations and the spatial continuity of such features across dimensions. Semi-supervised patch-based approaches have been effective in identifying small vessels of one to two voxels in diameter. This study focuses on improving the segmentation quality by considering the spatial correlation of the features using the Maximum Intensity Projection~(MIP) as an additional loss criterion. Two methods are proposed with the incorporation of MIPs of label segmentation on the single~(z-axis) and multiple perceivable axes of the 3D volume. The proposed MIP-based methods produce segmentations with improved vessel continuity, which is evident in visual examinations of ROIs. Patch-based training is improved by introducing an additional loss term, MIP loss, to penalise the predicted discontinuity of vessels. A training set of 14 volumes is selected from the StudyForrest dataset comprising of 18 7-Tesla 3D Time-of-Flight~(ToF) Magnetic Resonance Angiography (MRA) images. The generalisation performance of the method is evaluated using the other unseen volumes in the dataset. It is observed that the proposed method with multi-axes MIP loss produces better quality segmentations with a median Dice of $80.245 \pm 0.129$. Also, the method with single-axis MIP loss produces segmentations with a median Dice of $79.749 \pm 0.109$. Furthermore, a visual comparison of the ROIs in the predicted segmentation reveals a significant improvement in the continuity of the vessels when MIP loss is incorporated into training. △ Less

Submitted 11 July, 2024; originally announced July 2024.

arXiv:2407.06868 [pdf, other]

Energy Efficient Fair STAR-RIS for Mobile Users

Authors: Ashok S. Kumar, Nancy Nayak, Sheetal Kalyani, Himal A. Suraweera

Abstract: In this work, we propose a method to improve the energy efficiency and fairness of simultaneously transmitting and reflecting reconfigurable intelligent surfaces (STAR-RIS) for mobile users, ensuring reduced power consumption while maintaining reliable communication. To achieve this, we introduce a new parameter known as the subsurface assignment variable, which determines the number of STAR-RIS e… ▽ More In this work, we propose a method to improve the energy efficiency and fairness of simultaneously transmitting and reflecting reconfigurable intelligent surfaces (STAR-RIS) for mobile users, ensuring reduced power consumption while maintaining reliable communication. To achieve this, we introduce a new parameter known as the subsurface assignment variable, which determines the number of STAR-RIS elements allocated to each user. We then formulate a novel optimization problem by concurrently optimizing the phase shifts of the STAR-RIS and subsurface assignment variable. We leverage the deep reinforcement learning (DRL) technique to address this optimization problem. The DRL model predicts the phase shifts of the STAR-RIS and efficiently allocates elements of STAR-RIS to the users. Additionally, we incorporate a penalty term in the DRL model to facilitate intelligent deactivation of STAR-RIS elements when not in use to enhance energy efficiency. Through extensive experiments, we show that the proposed method can achieve fairly high and nearly equal data rates for all users in both the transmission and reflection spaces in an energy-efficient manner. △ Less

Submitted 9 July, 2024; originally announced July 2024.

arXiv:2407.04444 [pdf, other]

TokenVerse: Unifying Speech and NLP Tasks via Transducer-based ASR

Authors: Shashi Kumar, Srikanth Madikeri, Juan Zuluaga-Gomez, Iuliia Nigmatulina, Esaú Villatoro-Tello, Sergio Burdisso, Petr Motlicek, Karthik Pandia, Aravind Ganapathiraju

Abstract: In traditional conversational intelligence from speech, a cascaded pipeline is used, involving tasks such as voice activity detection, diarization, transcription, and subsequent processing with different NLP models for tasks like semantic endpointing and named entity recognition (NER). Our paper introduces TokenVerse, a single Transducer-based model designed to handle multiple tasks. This is achie… ▽ More In traditional conversational intelligence from speech, a cascaded pipeline is used, involving tasks such as voice activity detection, diarization, transcription, and subsequent processing with different NLP models for tasks like semantic endpointing and named entity recognition (NER). Our paper introduces TokenVerse, a single Transducer-based model designed to handle multiple tasks. This is achieved by integrating task-specific tokens into the reference text during ASR model training, streamlining the inference and eliminating the need for separate NLP models. In addition to ASR, we conduct experiments on 3 different tasks: speaker change detection, endpointing, and NER. Our experiments on a public and a private dataset show that the proposed method improves ASR by up to 7.7% in relative WER while outperforming the cascaded pipeline approach in individual task performance. Additionally, we present task transfer learning to a new task within an existing TokenVerse. △ Less

Submitted 5 July, 2024; originally announced July 2024.

Comments: 5 pages, double column

arXiv:2407.04439 [pdf, other]

XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models

Authors: Shashi Kumar, Srikanth Madikeri, Juan Zuluaga-Gomez, Esaú Villatoro-Tello, Iuliia Nigmatulina, Petr Motlicek, Manjunath K E, Aravind Ganapathiraju

Abstract: Self-supervised pretrained models exhibit competitive performance in automatic speech recognition on finetuning, even with limited in-domain supervised data for training. However, popular pretrained models are not suitable for streaming ASR because they are trained with full attention context. In this paper, we introduce XLSR-Transducer, where the XLSR-53 model is used as encoder in transducer set… ▽ More Self-supervised pretrained models exhibit competitive performance in automatic speech recognition on finetuning, even with limited in-domain supervised data for training. However, popular pretrained models are not suitable for streaming ASR because they are trained with full attention context. In this paper, we introduce XLSR-Transducer, where the XLSR-53 model is used as encoder in transducer setup. Our experiments on the AMI dataset reveal that the XLSR-Transducer achieves 4% absolute WER improvement over Whisper large-v2 and 8% over a Zipformer transducer model trained from scratch.To enable streaming capabilities, we investigate different attention masking patterns in the self-attention computation of transformer layers within the XLSR-53 model. We validate XLSR-Transducer on AMI and 5 languages from CommonVoice under low-resource scenarios. Finally, with the introduction of attention sinks, we reduce the left context by half while achieving a relative 12% improvement in WER. △ Less

Submitted 5 July, 2024; originally announced July 2024.

Comments: 5 pages, double column

arXiv:2406.12405 [pdf]

On The Effective Rate and Error Rate Analysis over Fluctuating Nakagami-m Fading Channel

Authors: Manpreet Kaur, Puspraj Singh Chauhan, Sandeep Kumar, Pappu Kumar Verma

Abstract: This paper provides a detailed analysis of the important performance metrics like effective capacity and symbol error rate over fluctuating Nakagami-m fading channel. This distribution is obtained from the ratio of two random variables, following the Nakagami-m distribution and the uniform distribution. Our study derives exact analytical expressions for the EC and SER under different modulation sc… ▽ More This paper provides a detailed analysis of the important performance metrics like effective capacity and symbol error rate over fluctuating Nakagami-m fading channel. This distribution is obtained from the ratio of two random variables, following the Nakagami-m distribution and the uniform distribution. Our study derives exact analytical expressions for the EC and SER under different modulation schemes, considering the effect of channel parameters. Recognising the importance of additive Laplacian noise in today scenario, it has been considered for the error performance analysis of the system. This work may be utilised for the design and optimization of the systems operating in environments characterized by fluctuating Nakagami-m fading. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: 18 pages

arXiv:2406.11768 [pdf, other]

GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities

Authors: Sreyan Ghosh, Sonal Kumar, Ashish Seth, Chandra Kiran Reddy Evuru, Utkarsh Tyagi, S Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha

Abstract: Perceiving and understanding non-speech sounds and non-verbal speech is essential to making decisions that help us interact with our surroundings. In this paper, we propose GAMA, a novel General-purpose Large Audio-Language Model (LALM) with Advanced Audio Understanding and Complex Reasoning Abilities. We build GAMA by integrating an LLM with multiple types of audio representations, including feat… ▽ More Perceiving and understanding non-speech sounds and non-verbal speech is essential to making decisions that help us interact with our surroundings. In this paper, we propose GAMA, a novel General-purpose Large Audio-Language Model (LALM) with Advanced Audio Understanding and Complex Reasoning Abilities. We build GAMA by integrating an LLM with multiple types of audio representations, including features from a custom Audio Q-Former, a multi-layer aggregator that aggregates features from multiple layers of an audio encoder. We fine-tune GAMA on a large-scale audio-language dataset, which augments it with audio understanding capabilities. Next, we propose CompA-R (Instruction-Tuning for Complex Audio Reasoning), a synthetically generated instruction-tuning (IT) dataset with instructions that require the model to perform complex reasoning on the input audio. We instruction-tune GAMA with CompA-R to endow it with complex reasoning abilities, where we further add a soft prompt as input with high-level semantic evidence by leveraging event tags of the input audio. Finally, we also propose CompA-R-test, a human-labeled evaluation dataset for evaluating the capabilities of LALMs on open-ended audio question-answering that requires complex reasoning. Through automated and expert human evaluations, we show that GAMA outperforms all other LALMs in literature on diverse audio understanding tasks by margins of 1%-84%. Further, GAMA IT-ed on CompA-R proves to be superior in its complex reasoning and instruction following capabilities. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: Project Website: https://sreyan88.github.io/gamaaudio/

arXiv:2406.09443 [pdf, other]

Comparative Analysis of Personalized Voice Activity Detection Systems: Assessing Real-World Effectiveness

Authors: Satyam Kumar, Sai Srujana Buddi, Utkarsh Oggy Sarawgi, Vineet Garg, Shivesh Ranjan, Ognjen, Rudovic, Ahmed Hussen Abdelaziz, Saurabh Adya

Abstract: Voice activity detection (VAD) is a critical component in various applications such as speech recognition, speech enhancement, and hands-free communication systems. With the increasing demand for personalized and context-aware technologies, the need for effective personalized VAD systems has become paramount. In this paper, we present a comparative analysis of Personalized Voice Activity Detection… ▽ More Voice activity detection (VAD) is a critical component in various applications such as speech recognition, speech enhancement, and hands-free communication systems. With the increasing demand for personalized and context-aware technologies, the need for effective personalized VAD systems has become paramount. In this paper, we present a comparative analysis of Personalized Voice Activity Detection (PVAD) systems to assess their real-world effectiveness. We introduce a comprehensive approach to assess PVAD systems, incorporating various performance metrics such as frame-level and utterance-level error rates, detection latency and accuracy, alongside user-level analysis. Through extensive experimentation and evaluation, we provide a thorough understanding of the strengths and limitations of various PVAD variants. This paper advances the understanding of PVAD technology by offering insights into its efficacy and viability in practical applications using a comprehensive set of metrics. △ Less

Submitted 11 June, 2024; originally announced June 2024.

arXiv:2406.09167 [pdf, other]

Vision Transformer Segmentation for Visual Bird Sound Denoising

Authors: Sahil Kumar, Jialu Li, Youshan Zhang

Abstract: Audio denoising, especially in the context of bird sounds, remains a challenging task due to persistent residual noise. Traditional and deep learning methods often struggle with artificial or low-frequency noise. In this work, we propose ViTVS, a novel approach that leverages the power of the vision transformer (ViT) architecture. ViTVS adeptly combines segmentation techniques to disentangle clean… ▽ More Audio denoising, especially in the context of bird sounds, remains a challenging task due to persistent residual noise. Traditional and deep learning methods often struggle with artificial or low-frequency noise. In this work, we propose ViTVS, a novel approach that leverages the power of the vision transformer (ViT) architecture. ViTVS adeptly combines segmentation techniques to disentangle clean audio from complex signal mixtures. Our key contributions encompass the development of ViTVS, introducing comprehensive, long-range, and multi-scale representations. These contributions directly tackle the limitations inherent in conventional approaches. Extensive experiments demonstrate that ViTVS outperforms state-of-the-art methods, positioning it as a benchmark solution for real-world bird sound denoising applications. Source code is available at: https://github.com/aiai-4/ViVTS. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: INTERSPEECH 2024

arXiv:2406.07910 [pdf, other]

Demonstration of Safe Electromagnetic Radiation Emitted by 5G Active Antenna Systems

Authors: Sumit Kumar, Chandan Kumar Sheemar, Abdelrahman Astro, Jorge Querol, Symeon Chatzinotas

Abstract: The careful planning and safe deployment of 5G technologies will bring enormous benefits to society and the economy. Higher frequency, beamforming, and small-cells are key technologies that will provide unmatched throughput and seamless connectivity to 5G users. Superficial knowledge of these technologies has raised concerns among the general public about the harmful effects of radiation. Several… ▽ More The careful planning and safe deployment of 5G technologies will bring enormous benefits to society and the economy. Higher frequency, beamforming, and small-cells are key technologies that will provide unmatched throughput and seamless connectivity to 5G users. Superficial knowledge of these technologies has raised concerns among the general public about the harmful effects of radiation. Several standardization bodies are active to put limits on the emissions which are based on a defined set of radiation measurement methodologies. However, due to the peculiarity of 5G such as dynamicity of the beams, network densification, Time Division Duplexing mode of operation, etc, using existing EMF measurement methods may provide inaccurate results. In this context, we discuss our experimental studies aimed towards the measurement of radiation caused by beam-based transmissions from a 5G base station equipped with an Active Antenna System(AAS). We elaborate on the shortcomings of current measurement methodologies and address several open questions. Next, we demonstrate that using user-specific downlink beamforming, not only better performance is achieved compared to non-beamformed downlink, but also the radiation in the vicinity of the intended user is significantly decreased. Further, we show that under weak reception conditions, an uplink transmission can cause significantly high radiation in the vicinity of the user equipment. We believe that our work will help in clearing several misleading concepts about the 5G EMF radiation effects. We conclude the work by providing guidelines to improve the methodology of EMF measurement by considering the spatiotemporal dynamicity of the 5G transmission. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.04432 [pdf, other]

LipGER: Visually-Conditioned Generative Error Correction for Robust Automatic Speech Recognition

Authors: Sreyan Ghosh, Sonal Kumar, Ashish Seth, Purva Chiniya, Utkarsh Tyagi, Ramani Duraiswami, Dinesh Manocha

Abstract: Visual cues, like lip motion, have been shown to improve the performance of Automatic Speech Recognition (ASR) systems in noisy environments. We propose LipGER (Lip Motion aided Generative Error Correction), a novel framework for leveraging visual cues for noise-robust ASR. Instead of learning the cross-modal correlation between the audio and visual modalities, we make an LLM learn the task of vis… ▽ More Visual cues, like lip motion, have been shown to improve the performance of Automatic Speech Recognition (ASR) systems in noisy environments. We propose LipGER (Lip Motion aided Generative Error Correction), a novel framework for leveraging visual cues for noise-robust ASR. Instead of learning the cross-modal correlation between the audio and visual modalities, we make an LLM learn the task of visually-conditioned (generative) ASR error correction. Specifically, we instruct an LLM to predict the transcription from the N-best hypotheses generated using ASR beam-search. This is further conditioned on lip motions. This approach addresses key challenges in traditional AVSR learning, such as the lack of large-scale paired datasets and difficulties in adapting to new domains. We experiment on 4 datasets in various settings and show that LipGER improves the Word Error Rate in the range of 1.1%-49.2%. We also release LipHyp, a large-scale dataset with hypothesis-transcription pairs that is additionally equipped with lip motion cues to promote further research in this space △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: InterSpeech 2024. Code and Data: https://github.com/Sreyan88/LipGER

arXiv:2405.18297 [pdf, other]

Artificial Intelligence Satellite Telecommunication Testbed using Commercial Off-The-Shelf Chipsets

Authors: Luis M. Garces, Amirhossein Nik, Flor Ortiz, Juan A. Vásquez-Peralvo, Jorge L. Gonzalez, Mouhamad Chehailty, Marcele Kuhfuss, Eva Lagunas, Jan Thoemel, Sumit Kumar, Vishal Singh, Juan C. Duncan, Sahar Malmir, Swetha Varadajulu, Jorge Querol, Symeon Chatzinotas

Abstract: The Artificial Intelligence Satellite Telecommunications Testbed (AISTT), part of the ESA project SPAICE, is focused on the transformation of the satellite payload by using artificial intelligence (AI) and machine learning (ML) methodologies over available commercial off-the-shelf (COTS) AI chips for on-board processing. The objectives include validating artificial intelligence-driven SATCOM scena… ▽ More The Artificial Intelligence Satellite Telecommunications Testbed (AISTT), part of the ESA project SPAICE, is focused on the transformation of the satellite payload by using artificial intelligence (AI) and machine learning (ML) methodologies over available commercial off-the-shelf (COTS) AI chips for on-board processing. The objectives include validating artificial intelligence-driven SATCOM scenarios such as interference detection, spectrum sharing, radio resource management, decoding, and beamforming. The study highlights hardware selection and payload architecture. Preliminary results show that ML models significantly improve signal quality, spectral efficiency, and throughput compared to conventional payload. Moreover, the testbed aims to evaluate the performance and application of AI-capable COTS chips in onboard SATCOM contexts. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Comments: Submitted to SPAICE Conference 2024: AI in and for Space, 5 pages, 3 figures

Journal ref: SPAICE Conference 2024

arXiv:2405.06726 [pdf, other]

Region of Attraction Estimation for Free-Floating Systems under Time-Varying LQR Control

Authors: Lasse Shala, Shubham Vyas, Mohamed Khalil Ben-Larbi, Shivesh Kumar, Enrico Stoll

Abstract: Future Active Debris Removal (ADR) and On Orbit Servicing (OOS) missions demand for elaborate closed loop controllers. Feasible control architectures should take into consideration the inherent coupling of the free floating dynamics and the kinematics of the system. Recently, Time-Varying Linear Quadratic Regulators (TVLQR) have been used to stabilize underactuated systems that exhibit a similar k… ▽ More Future Active Debris Removal (ADR) and On Orbit Servicing (OOS) missions demand for elaborate closed loop controllers. Feasible control architectures should take into consideration the inherent coupling of the free floating dynamics and the kinematics of the system. Recently, Time-Varying Linear Quadratic Regulators (TVLQR) have been used to stabilize underactuated systems that exhibit a similar kinodynamic coupling. Furthermore, this control approach integrates synergistically with Lyapunov based region of attraction (ROA) estimation, which, in the context of ADR and OOS, allows for reasoning about composability of different sub-maneuvers. In this paper, TVLQR was used to stabilize an ADR detumbling maneuver in simulation. Moreover, the ROA of the closed loop dynamics was estimated using a probabilistic method. In order to demonstrate the real-world applicability for free floating robots, further experiments were conducted onboard a free floating testbed. △ Less

Submitted 10 May, 2024; originally announced May 2024.

arXiv:2405.05937 [pdf, other]

Dynamics of a Towed Cable with Sensor-Array for Underwater Target Motion Analysis

Authors: Rohit Kumar Singh, Subrata Kumar, Shovan Bhaumik

Abstract: During a war situation, many times an underwater target motion analysis (TMA) is performed using bearing-only measurements, obtained from a sensor array, which is towed by an own-ship with the help of a connected cable. It is well known that the own-ship is required to perform a manoeuvre in order to make the system observable and localise the target successfully. During the maneuver, it is import… ▽ More During a war situation, many times an underwater target motion analysis (TMA) is performed using bearing-only measurements, obtained from a sensor array, which is towed by an own-ship with the help of a connected cable. It is well known that the own-ship is required to perform a manoeuvre in order to make the system observable and localise the target successfully. During the maneuver, it is important to know the location of the sensor array with respect to the own-ship. This paper develops a dynamic model of a cable-sensor array system to localise the sensor array, which is towed behind a sea-surface vessel. We adopt a lumped-mass approach to represent the towed cable. The discretized cable elements are modelled as an interconnected rigid body, kinematically related to one another. The governing equations are derived by balancing the moments acting on each node. The derived dynamics are solved simultaneously for all the nodes to determine the orientation of the cable and sensor array. The position of the sensor array obtained from this proposed model will further be used by TMA algorithms to enhance the accuracy of the tracking system. △ Less

Submitted 9 May, 2024; originally announced May 2024.

arXiv:2404.09493 [pdf, ps, other]

Novel entropy difference-based EEG channel selection technique for automated detection of ADHD

Authors: Shishir Maheshwari, Kandala N V P S Rajesh, Vivek Kanhangad, U Rajendra Acharya, T Sunil Kumar

Abstract: Attention deficit hyperactivity disorder (ADHD) is one of the common neurodevelopmental disorders in children. This paper presents an automated approach for ADHD detection using the proposed entropy difference (EnD)- based encephalogram (EEG) channel selection approach. In the proposed approach, we selected the most significant EEG channels for the accurate identification of ADHD using an EnD-base… ▽ More Attention deficit hyperactivity disorder (ADHD) is one of the common neurodevelopmental disorders in children. This paper presents an automated approach for ADHD detection using the proposed entropy difference (EnD)- based encephalogram (EEG) channel selection approach. In the proposed approach, we selected the most significant EEG channels for the accurate identification of ADHD using an EnD-based channel selection approach. Secondly, a set of features is extracted from the selected channels and fed to a classifier. To verify the effectiveness of the channels selected, we explored three sets of features and classifiers. More specifically, we explored discrete wavelet transform (DWT), empirical mode decomposition (EMD) and symmetrically-weighted local binary pattern (SLBP)-based features. To perform automated classification, we have used k-nearest neighbor (k-NN), Ensemble classifier, and support vectors machine (SVM) classifiers. Our proposed approach yielded the highest accuracy of 99.29% using the public database. In addition, the proposed EnD-based channel selection has consistently provided better classification accuracies than the entropy-based channel selection approach. Also, the developed method △ Less

Submitted 15 April, 2024; originally announced April 2024.

arXiv:2404.00122 [pdf, other]

AgileFormer: Spatially Agile Transformer UNet for Medical Image Segmentation

Authors: Peijie Qiu, Jin Yang, Sayantan Kumar, Soumyendu Sekhar Ghosh, Aristeidis Sotiras

Abstract: In the past decades, deep neural networks, particularly convolutional neural networks, have achieved state-of-the-art performance in a variety of medical image segmentation tasks. Recently, the introduction of the vision transformer (ViT) has significantly altered the landscape of deep segmentation models. There has been a growing focus on ViTs, driven by their excellent performance and scalabilit… ▽ More In the past decades, deep neural networks, particularly convolutional neural networks, have achieved state-of-the-art performance in a variety of medical image segmentation tasks. Recently, the introduction of the vision transformer (ViT) has significantly altered the landscape of deep segmentation models. There has been a growing focus on ViTs, driven by their excellent performance and scalability. However, we argue that the current design of the vision transformer-based UNet (ViT-UNet) segmentation models may not effectively handle the heterogeneous appearance (e.g., varying shapes and sizes) of objects of interest in medical image segmentation tasks. To tackle this challenge, we present a structured approach to introduce spatially dynamic components to the ViT-UNet. This adaptation enables the model to effectively capture features of target objects with diverse appearances. This is achieved by three main components: \textbf{(i)} deformable patch embedding; \textbf{(ii)} spatially dynamic multi-head attention; \textbf{(iii)} deformable positional encoding. These components were integrated into a novel architecture, termed AgileFormer. AgileFormer is a spatially agile ViT-UNet designed for medical image segmentation. Experiments in three segmentation tasks using publicly available datasets demonstrated the effectiveness of the proposed method. The code is available at \href{https://github.com/sotiraslab/AgileFormer}{https://github.com/sotiraslab/AgileFormer}. △ Less

Submitted 29 March, 2024; originally announced April 2024.

arXiv:2403.13036 [pdf]

doi 10.1016/j.knosys.2023.110984

Intelligent fault diagnosis of worm gearbox based on adaptive CNN using amended gorilla troop optimization with quantum gate mutation strategy

Authors: Govind Vashishtha, Sumika Chauhan, Surinder Kumar, Rajesh Kumar, Radoslaw Zimroz, Anil Kumar

Abstract: The worm gearbox is a high-speed transmission system that plays a vital role in various industries. Therefore it becomes necessary to develop a robust fault diagnosis scheme for worm gearbox. Due to advancements in sensor technology, researchers from academia and industries prefer deep learning models for fault diagnosis purposes. The optimal selection of hyperparameters (HPs) of deep learning mod… ▽ More The worm gearbox is a high-speed transmission system that plays a vital role in various industries. Therefore it becomes necessary to develop a robust fault diagnosis scheme for worm gearbox. Due to advancements in sensor technology, researchers from academia and industries prefer deep learning models for fault diagnosis purposes. The optimal selection of hyperparameters (HPs) of deep learning models plays a significant role in stable performance. Existing methods mainly focused on manual tunning of these parameters, which is a troublesome process and sometimes leads to inaccurate results. Thus, exploring more sophisticated methods to optimize the HPs automatically is important. In this work, a novel optimization, i.e. amended gorilla troop optimization (AGTO), has been proposed to make the convolutional neural network (CNN) adaptive for extracting the features to identify the worm gearbox defects. Initially, the vibration and acoustic signals are converted into 2D images by the Morlet wavelet function. Then, the initial model of CNN is developed by setting hyperparameters. Further, the search space of each Hp is identified and optimized by the developed AGTO algorithm. The classification accuracy has been evaluated by AGTO-CNN, which is further validated by the confusion matrix. The performance of the developed model has also been compared with other models. The AGTO algorithm is examined on twenty-three classical benchmark functions and the Wilcoxon test which demonstrates the effectiveness and dominance of the developed optimization algorithm. The results obtained suggested that the AGTO-CNN has the highest diagnostic accuracy and is stable and robust while diagnosing the worm gearbox. △ Less

Submitted 19 March, 2024; originally announced March 2024.

Journal ref: Knowledge-Based Systems Volume 280, 25 November 2023, 110984

arXiv:2403.10966 [pdf, other]

Robust Co-Design of Canonical Underactuated Systems for Increased Certifiable Stability

Authors: Federico Girlanda, Lasse Shala, Shivesh Kumar, Frank Kirchner

Abstract: Optimal behaviours of a system to perform a specific task can be achieved by leveraging the coupling between trajectory optimization, stabilization, and design optimization. This approach is particularly advantageous for underactuated systems, which are systems that have fewer actuators than degrees of freedom and thus require for more elaborate control systems. This paper proposes a novel co-desi… ▽ More Optimal behaviours of a system to perform a specific task can be achieved by leveraging the coupling between trajectory optimization, stabilization, and design optimization. This approach is particularly advantageous for underactuated systems, which are systems that have fewer actuators than degrees of freedom and thus require for more elaborate control systems. This paper proposes a novel co-design algorithm, namely Robust Trajectory Control with Design optimization (RTC-D). An inner optimization layer (RTC) simultaneously performs direct transcription (DIRTRAN) to find a nominal trajectory while computing optimal hyperparameters for a stabilizing time-varying linear quadratic regulator (TVLQR). RTC-D augments RTC with a design optimization layer, maximizing the system's robustness through a time-varying Lyapunov-based region of attraction (ROA) analysis. This analysis provides a formal guarantee of stability for a set of off-nominal states. The proposed algorithm has been tested on two different underactuated systems: the torque-limited simple pendulum and the cart-pole. Extensive simulations of off-nominal initial conditions demonstrate improved robustness, while real-system experiments show increased insensitivity to torque disturbances. △ Less

Submitted 16 March, 2024; originally announced March 2024.

Comments: Copr. 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. PREPRINT

arXiv:2402.08231 [pdf, ps, other]

doi 10.1109/OJVT.2023.3349151

Asynchronous Distributed Coordinated Hybrid Precoding in Multi-cell mmWave Wireless Networks

Authors: Meesam Jafri, Suraj Srivastava, Sunil Kumar, Aditya K. Jagannatham, Lajos Hanzo

Abstract: Asynchronous distributed hybrid beamformers (ADBF) are conceived for minimizing the total transmit power subject to signal-to-interference-plus-noise ratio (SINR) constraints at the users. Our design requires only limited information exchange between the base stations (BSs) of the mmWave multi-cell coordinated (MCC) networks considered. To begin with, a semidefinite relaxation (SDR)-based fully-di… ▽ More Asynchronous distributed hybrid beamformers (ADBF) are conceived for minimizing the total transmit power subject to signal-to-interference-plus-noise ratio (SINR) constraints at the users. Our design requires only limited information exchange between the base stations (BSs) of the mmWave multi-cell coordinated (MCC) networks considered. To begin with, a semidefinite relaxation (SDR)-based fully-digital (FD) beamformer is designed for a centralized MCC system. Subsequently, a Bayesian learning (BL) technique is harnessed for decomposing the FD beamformer into its analog and baseband components and construct a hybrid transmit precoder (TPC). However, the centralized TPC design requires global channel state information (CSI), hence it results in a high signaling overhead. An alternating direction based method of multipliers (ADMM) technique is developed for a synchronous distributed beamformer (SDBF) design, which relies only on limited information exchange among the BSs, thus reducing the signaling overheads required by the centralized TPC design procedure. However, the SDBF design is challenging, since it requires the updates from the BSs to be strictly synchronized. As a remedy, an ADBF framework is developed that mitigates the inter-cell interference (ICI) and also control the asynchrony in the system. Furthermore, the above ADBF framework is also extended to the robust ADBF (R-ADBF) algorithm that incorporates the CSI uncertainty into the design procedure for minimizing the the worst-case transmit power. Our simulation results illustrate both the enhanced performance and the improved convergence properties of the ADMM-based ADBF and R-ADBF schemes. △ Less

Submitted 13 February, 2024; originally announced February 2024.

Journal ref: IEEE Open Journal of Vehicular Technology, vol. 5, pp. 200-218, 2024

arXiv:2402.06176 [pdf, other]

Cooperative Nonlinear Guidance Strategies for Guaranteed Pursuit-Evasion

Authors: Saurabh Kumar, Shashi Ranjan Kumar, Abhinav Sinha

Abstract: This paper addresses the pursuit-evasion problem involving three agents -- a purser, an evader, and a defender. We develop cooperative guidance laws for the evader-defender team that guarantee that the defender intercepts the pursuer before it reaches the vicinity of the evader. Unlike heuristic methods, optimal control, differential game formulation, and recently proposed time-constrained guidanc… ▽ More This paper addresses the pursuit-evasion problem involving three agents -- a purser, an evader, and a defender. We develop cooperative guidance laws for the evader-defender team that guarantee that the defender intercepts the pursuer before it reaches the vicinity of the evader. Unlike heuristic methods, optimal control, differential game formulation, and recently proposed time-constrained guidance techniques, we propose a geometric solution to safeguard the evader from the pursuer's incoming threat. The proposed strategy is computationally efficient and expected to be scalable as the number of agents increases. Another alluring feature of the proposed strategy is that the evader-defender team does not require the knowledge of the pursuer's strategy and that the pursuer's interception is guaranteed from arbitrary initial engagement geometries. We further show that the necessary error variables for the evader-defender team vanish within a time that can be exactly prescribed prior to the three-body engagement. Finally, we demonstrate the efficacy of the proposed cooperative defense strategy via simulation in diverse engagement scenarios. △ Less

Submitted 8 February, 2024; originally announced February 2024.

arXiv:2402.05918 [pdf, other]

Consensus-driven Deviated Pursuit for Guaranteed Simultaneous Interception of Moving Targets

Authors: Abhinav Sinha, Dwaipayan Mukherjee, Shashi Ranjan Kumar

Abstract: This work proposes a cooperative strategy that employs deviated pursuit guidance to simultaneously intercept a moving (but not manoeuvring) target. As opposed to many existing cooperative guidance strategies which use estimates of time-to-go, based on proportional-navigation guidance, the proposed strategy uses an exact expression for time-to-go to ensure simultaneous interception. The guidance de… ▽ More This work proposes a cooperative strategy that employs deviated pursuit guidance to simultaneously intercept a moving (but not manoeuvring) target. As opposed to many existing cooperative guidance strategies which use estimates of time-to-go, based on proportional-navigation guidance, the proposed strategy uses an exact expression for time-to-go to ensure simultaneous interception. The guidance design considers nonlinear engagement kinematics, allowing the proposed strategy to remain effective over a large operating regime. Unlike existing strategies on simultaneous interception that achieve interception at the average value of their initial time-to-go estimates, this work provides flexibility in the choice of impact time. By judiciously choosing the edge weights of the communication network, a weighted consensus in time-to-go can be achieved. It has been shown that by allowing an edge weight to be negative, consensus in time-to-go can even be achieved for an impact time that lies outside the convex hull of the set of initial time-to-go values of the individual interceptors. The bounds on such negative weights have been analysed for some special graphs, using Nyquist criterion. Simulations are provided to vindicate the efficacy of the proposed strategy. △ Less

Submitted 8 February, 2024; originally announced February 2024.

arXiv:2402.05273 [pdf, other]

ASCENT: A Context-Aware Spectrum Coexistence Design and Implementation Toolset for Policymakers in Satellite Bands

Authors: Ta-seen Reaz Niloy, Saurav Kumar, Aniruddha Hore, Zoheb Hassan, Carl Dietrich, Eric W. Burger, Jeffrey H. Reed, Vijay K. Shah

Abstract: This paper introduces ASCENT (context Aware Spectrum Coexistence Design and Implementation) toolset, an advanced context-aware terrestrial satellite spectrum sharing toolset designed for researchers, policymakers, and regulators. It serves two essential purposes (a) evaluating the potential for harmful interference to primary users in satellite bands and (b) facilitating the analysis, design, and… ▽ More This paper introduces ASCENT (context Aware Spectrum Coexistence Design and Implementation) toolset, an advanced context-aware terrestrial satellite spectrum sharing toolset designed for researchers, policymakers, and regulators. It serves two essential purposes (a) evaluating the potential for harmful interference to primary users in satellite bands and (b) facilitating the analysis, design, and implementation of diverse regulatory policies on spectrum usage and sharing. Notably, ASCENT implements a closed-loop feedback system that allows dynamic adaptation of policies according to a wide range of contextual factors (e.g., weather, buildings, summer/winter foliage, etc.) and feedback on the impact of these policies through realistic simulation. Specifically, ASCENT comprises the following components (i) interference evaluation tool for evaluating interference at the incumbents in a spectrum-sharing environment while taking the underlying contexts, (ii) dynamic spectrum access (DSA) framework for providing context-aware instructions to adapt networking parameters and control secondary terrestrial network's access to the shared spectrum band according to context aware prioritization, (iii) Context broker to acquire essential and relevant contexts from external context information providers; and (iv) DSA Database to store dynamic and static contexts and the regulator's policy information. The closed-loop feedback system of ASCENT is implemented by integrating these components in a modular software architecture. A case study of sharing the lower 12 GHz Ku band (12.2-12.7 GHz) with the 5G terrestrial cellular network is considered, and the usability of ASCENT is demonstrated by dynamically changing exclusion zone's radius in different weather conditions. △ Less

Submitted 15 February, 2024; v1 submitted 7 February, 2024; originally announced February 2024.

arXiv:2402.01933 [pdf, other]

ToMoBrush: Exploring Dental Health Sensing using a Sonic Toothbrush

Authors: Kuang Yuan, Mohamed Ibrahim, Yiwen Song, Guoxiang Deng, Suvendra Vijayan, Robert Nerone, Akshay Gadre, Swarun Kumar

Abstract: Early detection of dental disease is crucial to prevent adverse outcomes. Today, dental X-rays are currently the most accurate gold standard for dental disease detection. Unfortunately, regular X-ray exam is still a privilege for billions of people around the world. In this paper, we ask: "Can we develop a low-cost sensing system that enables dental self-examination in the comfort of one's home?"… ▽ More Early detection of dental disease is crucial to prevent adverse outcomes. Today, dental X-rays are currently the most accurate gold standard for dental disease detection. Unfortunately, regular X-ray exam is still a privilege for billions of people around the world. In this paper, we ask: "Can we develop a low-cost sensing system that enables dental self-examination in the comfort of one's home?" This paper presents ToMoBrush, a dental health sensing system that explores using off-the-shelf sonic toothbrushes for dental condition detection. Our solution leverages the fact that a sonic toothbrush produces rich acoustic signals when in contact with teeth, which contain important information about each tooth's status. ToMoBrush extracts tooth resonance signatures from the acoustic signals to characterize varied dental health conditions of the teeth. We evaluate ToMoBrush on 19 participants and dental-standard models for detecting common dental problems including caries, calculus, and food impaction, achieving a detection ROC-AUC of 0.90, 0.83, and 0.88 respectively. Interviews with dental experts validate ToMoBrush's potential in enhancing at-home dental healthcare. △ Less

Submitted 2 February, 2024; originally announced February 2024.

ACM Class: J.3; C.3; H.5.2

arXiv:2401.08468 [pdf, other]

Keep or toss? A nonparametric score to evaluate solutions for noisy ICA

Authors: Syamantak Kumar, Purnamrita Sarkar, Peter Bickel, Derek Bean

Abstract: Independent Component Analysis (ICA) was introduced in the 1980's as a model for Blind Source Separation (BSS), which refers to the process of recovering the sources underlying a mixture of signals, with little knowledge about the source signals or the mixing process. While there are many sophisticated algorithms for estimation, different methods have different shortcomings. In this paper, we deve… ▽ More Independent Component Analysis (ICA) was introduced in the 1980's as a model for Blind Source Separation (BSS), which refers to the process of recovering the sources underlying a mixture of signals, with little knowledge about the source signals or the mixing process. While there are many sophisticated algorithms for estimation, different methods have different shortcomings. In this paper, we develop a nonparametric score to adaptively pick the right algorithm for ICA with arbitrary Gaussian noise. The novelty of this score stems from the fact that it just assumes a finite second moment of the data and uses the characteristic function to evaluate the quality of the estimated mixing matrix without any knowledge of the parameters of the noise distribution. In addition, we propose some new contrast functions and algorithms that enjoy the same fast computability as existing algorithms like FASTICA and JADE but work in domains where the former may fail. While these also may have weaknesses, our proposed diagnostic, as shown by our simulations, can remedy them. Finally, we propose a theoretical framework to analyze the local and global convergence properties of our algorithms. △ Less

Submitted 9 February, 2024; v1 submitted 16 January, 2024; originally announced January 2024.

arXiv:2401.05376 [pdf, other]

Eating Speed Measurement Using Wrist-Worn IMU Sensors in Free-Living Environments

Authors: Chunzhuo Wang, T. Sunil Kumar, Walter De Raedt, Guido Camps, Hans Hallez, Bart Vanrumste

Abstract: Eating speed is an important indicator that has been widely scrutinized in nutritional studies. The relationship between eating speed and several intake-related problems such as obesity, diabetes, and oral health has received increased attention from researchers. However, existing studies mainly use self-reported questionnaires to obtain participants' eating speed, where they choose options from s… ▽ More Eating speed is an important indicator that has been widely scrutinized in nutritional studies. The relationship between eating speed and several intake-related problems such as obesity, diabetes, and oral health has received increased attention from researchers. However, existing studies mainly use self-reported questionnaires to obtain participants' eating speed, where they choose options from slow, medium, and fast. Such a non-quantitative method is highly subjective and coarse in individual level. In this study, we propose a novel approach to measure eating speed in free-living environments automatically and objectively using wrist-worn inertial measurement unit (IMU) sensors. Specifically, a temporal convolutional network combined with a multi-head attention module (TCN-MHA) is developed to detect bites (including eating and drinking gestures) from free-living IMU data. The predicted bite sequences are then clustered to eating episodes. Eating speed is calculated by using the time taken to finish the eating episode to divide the number of bites. To validate the proposed approach on eating speed measurement, a 7-fold cross validation is applied to the self-collected fine-annotated full-day-I (FD-I) dataset, and a hold-out experiment is conducted on the full-day-II (FD-II) dataset. The two datasets are collected from 61 participants in free-living environments with a total duration of 513 h, which are publicly available. Experimental results shows that the proposed approach achieves a mean absolute percentage error (MAPE) of 0.110 and 0.146 in the FD-I and FD-II datasets, respectively, showcasing the feasibility of automated eating speed measurement. To the best of our knowledge, this is the first study investigating automated eating speed measurement. △ Less

Submitted 15 December, 2023; originally announced January 2024.

arXiv:2312.09842 [pdf, ps, other]

On the compression of shallow non-causal ASR models using knowledge distillation and tied-and-reduced decoder for low-latency on-device speech recognition

Authors: Nagaraj Adiga, Jinhwan Park, Chintigari Shiva Kumar, Shatrughan Singh, Kyungmin Lee, Chanwoo Kim, Dhananjaya Gowda

Abstract: Recently, the cascaded two-pass architecture has emerged as a strong contender for on-device automatic speech recognition (ASR). A cascade of causal and shallow non-causal encoders coupled with a shared decoder enables operation in both streaming and look-ahead modes. In this paper, we propose shallow cascaded model by combining various model compression techniques such as knowledge distillation,… ▽ More Recently, the cascaded two-pass architecture has emerged as a strong contender for on-device automatic speech recognition (ASR). A cascade of causal and shallow non-causal encoders coupled with a shared decoder enables operation in both streaming and look-ahead modes. In this paper, we propose shallow cascaded model by combining various model compression techniques such as knowledge distillation, shared decoder, and tied-and-reduced transducer network in order to reduce the model footprint. The shared decoder is changed into a tied-and-reduced network. The cascaded two-pass model is further compressed using knowledge distillation using a Kullback-Leibler divergence loss on the model posteriors. We demonstrate a 50% reduction in the size of a 41 M parameter cascaded teacher model with no noticeable degradation in ASR accuracy and a 30% reduction in latency △ Less

Submitted 15 December, 2023; originally announced December 2023.

arXiv:2311.16409 [pdf, other]

A Deep Q-Learning based, Base-Station Connectivity-Aware, Decentralized Pheromone Mobility Model for Autonomous UAV Networks

Authors: Shreyas Devaraju, Alexander Ihler, Sunil Kumar

Abstract: UAV networks consisting of low SWaP (size, weight, and power), fixed-wing UAVs are used in many applications, including area monitoring, search and rescue, surveillance, and tracking. Performing these operations efficiently requires a scalable, decentralized, autonomous UAV network architecture with high network connectivity. Whereas fast area coverage is needed for quickly sensing the area, stron… ▽ More UAV networks consisting of low SWaP (size, weight, and power), fixed-wing UAVs are used in many applications, including area monitoring, search and rescue, surveillance, and tracking. Performing these operations efficiently requires a scalable, decentralized, autonomous UAV network architecture with high network connectivity. Whereas fast area coverage is needed for quickly sensing the area, strong node degree and base station (BS) connectivity are needed for UAV control and coordination and for transmitting sensed information to the BS in real time. However, the area coverage and connectivity exhibit a fundamental trade-off: maintaining connectivity restricts the UAVs' ability to explore. In this paper, we first present a node degree and BS connectivity-aware distributed pheromone (BS-CAP) mobility model to autonomously coordinate the UAV movements in a decentralized UAV network. This model maintains a desired connectivity among 1-hop neighbors and to the BS while achieving fast area coverage. Next, we propose a deep Q-learning policy based BS-CAP model (BSCAP-DQN) to further tune and improve the coverage and connectivity trade-off. Since it is not practical to know the complete topology of such a network in real time, the proposed mobility models work online, are fully distributed, and rely on neighborhood information. Our simulations demonstrate that both proposed models achieve efficient area coverage and desired node degree and BS connectivity, improving significantly over existing schemes. △ Less

Submitted 27 November, 2023; originally announced November 2023.

arXiv:2311.11521 [pdf]

On the Effective throughput of Shadowed Beaulieu-Xie fading channel

Authors: Manpreet Kaur, Sandeep Kumar, Poonam Yadav, Puspraj Singh Chauhan

Abstract: Given the imperative for advanced wireless networks in the next generation and the rise of real-time applications within wireless communication, there is a notable focus on investigating data rate performance across various fading scenarios. This research delved into analyzing the effective throughput of the shadowed Beaulieu-Xie (SBX) composite fading channel using the PDF-based approach. To get… ▽ More Given the imperative for advanced wireless networks in the next generation and the rise of real-time applications within wireless communication, there is a notable focus on investigating data rate performance across various fading scenarios. This research delved into analyzing the effective throughput of the shadowed Beaulieu-Xie (SBX) composite fading channel using the PDF-based approach. To get the simplified relationship between the performance parameter and channel parameters, the low-SNR and the high-SNR approximation of the effective rate are also provided. The proposed formulations are evaluated for different values of system parameters to study their impact on the effective throughput. Also, the impact of the delay parameter on the EC is investigated. Monte-Carlo simulations are used to verify the facticity of the deduced equations. △ Less

Submitted 19 November, 2023; originally announced November 2023.

Comments: 18 pages

arXiv:2310.08846 [pdf, other]

Speaking rate attention-based duration prediction for speed control TTS

Authors: Jesuraj Bandekar, Sathvik Udupa, Abhayjeet Singh, Anjali Jayakumar, Deekshitha G, Sandhya Badiger, Saurabh Kumar, Pooja VH, Prasanta Kumar Ghosh

Abstract: With the advent of high-quality speech synthesis, there is a lot of interest in controlling various prosodic attributes of speech. Speaking rate is an essential attribute towards modelling the expressivity of speech. In this work, we propose a novel approach to control the speaking rate for non-autoregressive TTS. We achieve this by conditioning the speaking rate inside the duration predictor, all… ▽ More With the advent of high-quality speech synthesis, there is a lot of interest in controlling various prosodic attributes of speech. Speaking rate is an essential attribute towards modelling the expressivity of speech. In this work, we propose a novel approach to control the speaking rate for non-autoregressive TTS. We achieve this by conditioning the speaking rate inside the duration predictor, allowing implicit speaking rate control. We show the benefits of this approach by synthesising audio at various speaking rate factors and measuring the quality of speaking rate-controlled synthesised speech. Further, we study the effect of the speaking rate distribution of the training data towards effective rate control. Finally, we fine-tune a baseline pretrained TTS model to obtain speaking rate control TTS. We provide various analyses to showcase the benefits of using this proposed approach, along with objective as well as subjective metrics. We find that the proposed methods have higher subjective scores and lower speaker rate errors across many speaking rate factors over the baseline. △ Less

Submitted 13 October, 2023; originally announced October 2023.

Comments: \c{opyright} 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

arXiv:2310.08753 [pdf, other]

CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models

Authors: Sreyan Ghosh, Ashish Seth, Sonal Kumar, Utkarsh Tyagi, Chandra Kiran Evuru, Ramaneswaran S, S. Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha

Abstract: A fundamental characteristic of audio is its compositional nature. Audio-language models (ALMs) trained using a contrastive approach (e.g., CLAP) that learns a shared representation between audio and language modalities have improved performance in many downstream applications, including zero-shot audio classification, audio retrieval, etc. However, the ability of these models to effectively perfo… ▽ More A fundamental characteristic of audio is its compositional nature. Audio-language models (ALMs) trained using a contrastive approach (e.g., CLAP) that learns a shared representation between audio and language modalities have improved performance in many downstream applications, including zero-shot audio classification, audio retrieval, etc. However, the ability of these models to effectively perform compositional reasoning remains largely unexplored and necessitates additional research. In this paper, we propose CompA, a collection of two expert-annotated benchmarks with a majority of real-world audio samples, to evaluate compositional reasoning in ALMs. Our proposed CompA-order evaluates how well an ALM understands the order or occurrence of acoustic events in audio, and CompA-attribute evaluates attribute-binding of acoustic events. An instance from either benchmark consists of two audio-caption pairs, where both audios have the same acoustic events but with different compositions. An ALM is evaluated on how well it matches the right audio to the right caption. Using this benchmark, we first show that current ALMs perform only marginally better than random chance, thereby struggling with compositional reasoning. Next, we propose CompA-CLAP, where we fine-tune CLAP using a novel learning method to improve its compositional reasoning abilities. To train CompA-CLAP, we first propose improvements to contrastive training with composition-aware hard negatives, allowing for more focused training. Next, we propose a novel modular contrastive loss that helps the model learn fine-grained compositional understanding and overcomes the acute scarcity of openly available compositional audios. CompA-CLAP significantly improves over all our baseline models on the CompA benchmark, indicating its superior compositional reasoning capabilities. △ Less

Submitted 18 June, 2024; v1 submitted 12 October, 2023; originally announced October 2023.

Comments: ICLR 2024

arXiv:2310.00395 [pdf]

doi 10.5121/ijcnc.2023.15506

Analysis of system capacity and spectral efficiency of fixed-grid network

Authors: Adarsha M, S. Malathi, Santosh Kumar

Abstract: In this article, the performance of a fixed grid network is examined for various modulation formats to estimate the system's capacity and spectral efficiency. The optical In-phase Quadrature Modulator structure is used to build a fixed grid network modulation, and the homodyne detection approach is used for the receiver. Data multiplexing is accomplished using the Polarization Division Multiplexed… ▽ More In this article, the performance of a fixed grid network is examined for various modulation formats to estimate the system's capacity and spectral efficiency. The optical In-phase Quadrature Modulator structure is used to build a fixed grid network modulation, and the homodyne detection approach is used for the receiver. Data multiplexing is accomplished using the Polarization Division Multiplexed technology. 100 Gbps, 150 Gbps, and 200 Gbps data rates are transmitted under these circumstances utilizing various modulation formats. Various pre-processing and signal recovery steps are explained by using modern digital signal processing systems. The achieved spectrum efficiencies for PM-QPSK, PM-8 QAM, and PM-16 QAM, respectively, were 2, 3, and 4 bits/s/Hz. Different modulation like PM-QPSK, PM-8-QAM, and PM-16-QAM each has system capacities of 8-9, 12-13.5, and 16-18 Tbps and it reaches transmission distances of 3000, 1300, and 700 kilometers with acceptable Bit Error Rate less than equal to 2*10-3 respectively. Peak optical power for received signal detection and full width at half maximum is noted for the different modulations under a fixed grind network. △ Less

Submitted 30 September, 2023; originally announced October 2023.

Journal ref: International Journal of Computer Networks & Communications (IJCNC) Vol.15, No.5, September 2023

arXiv:2309.14923 [pdf, other]

ML-based PBCH symbol detection and equalization for 5G Non-Terrestrial Networks

Authors: Inés Larráyoz-Arrigote, Marcele O. K. Mendonca, Alejandro Gonzalez-Garrido, Jevgenij Krivochiza, Sumit Kumar, Jorge Querol, Joel Grotz, Stefano Andrenacci, Symeon Chatzinotas

Abstract: This paper delves into the application of Machine Learning (ML) techniques in the realm of 5G Non-Terrestrial Networks (5G-NTN), particularly focusing on symbol detection and equalization for the Physical Broadcast Channel (PBCH). As 5G-NTN gains prominence within the 3GPP ecosystem, ML offers significant potential to enhance wireless communication performance. To investigate these possibilities,… ▽ More This paper delves into the application of Machine Learning (ML) techniques in the realm of 5G Non-Terrestrial Networks (5G-NTN), particularly focusing on symbol detection and equalization for the Physical Broadcast Channel (PBCH). As 5G-NTN gains prominence within the 3GPP ecosystem, ML offers significant potential to enhance wireless communication performance. To investigate these possibilities, we present ML-based models trained with both synthetic and real data from a real 5G over-the-satellite testbed. Our analysis includes examining the performance of these models under various Signal-to-Noise Ratio (SNR) scenarios and evaluating their effectiveness in symbol enhancement and channel equalization tasks. The results highlight the ML performance in controlled settings and their adaptability to real-world challenges, shedding light on the potential benefits of the application of ML in 5G-NTN. △ Less

Submitted 26 September, 2023; originally announced September 2023.

arXiv:2309.13716 [pdf, other]

MOSAIC: Multi-Object Segmented Arbitrary Stylization Using CLIP

Authors: Prajwal Ganugula, Y S S S Santosh Kumar, N K Sagar Reddy, Prabhath Chellingi, Avinash Thakur, Neeraj Kasera, C Shyam Anand

Abstract: Style transfer driven by text prompts paved a new path for creatively stylizing the images without collecting an actual style image. Despite having promising results, with text-driven stylization, the user has no control over the stylization. If a user wants to create an artistic image, the user requires fine control over the stylization of various entities individually in the content image, which… ▽ More Style transfer driven by text prompts paved a new path for creatively stylizing the images without collecting an actual style image. Despite having promising results, with text-driven stylization, the user has no control over the stylization. If a user wants to create an artistic image, the user requires fine control over the stylization of various entities individually in the content image, which is not addressed by the current state-of-the-art approaches. On the other hand, diffusion style transfer methods also suffer from the same issue because the regional stylization control over the stylized output is ineffective. To address this problem, We propose a new method Multi-Object Segmented Arbitrary Stylization Using CLIP (MOSAIC), that can apply styles to different objects in the image based on the context extracted from the input prompt. Text-based segmentation and stylization modules which are based on vision transformer architecture, were used to segment and stylize the objects. Our method can extend to any arbitrary objects, styles and produce high-quality images compared to the current state of art methods. To our knowledge, this is the first attempt to perform text-guided arbitrary object-wise stylization. We demonstrate the effectiveness of our approach through qualitative and quantitative analysis, showing that it can generate visually appealing stylized images with enhanced control over stylization and the ability to generalize to unseen object classes. △ Less

Submitted 24 September, 2023; originally announced September 2023.

Comments: Camera ready, New Ideas in Vision Transformers workshop, ICCV 2023

arXiv:2309.09836 [pdf, other]

RECAP: Retrieval-Augmented Audio Captioning

Authors: Sreyan Ghosh, Sonal Kumar, Chandra Kiran Reddy Evuru, Ramani Duraiswami, Dinesh Manocha

Abstract: We present RECAP (REtrieval-Augmented Audio CAPtioning), a novel and effective audio captioning system that generates captions conditioned on an input audio and other captions similar to the audio retrieved from a datastore. Additionally, our proposed method can transfer to any domain without the need for any additional fine-tuning. To generate a caption for an audio sample, we leverage an audio-t… ▽ More We present RECAP (REtrieval-Augmented Audio CAPtioning), a novel and effective audio captioning system that generates captions conditioned on an input audio and other captions similar to the audio retrieved from a datastore. Additionally, our proposed method can transfer to any domain without the need for any additional fine-tuning. To generate a caption for an audio sample, we leverage an audio-text model CLAP to retrieve captions similar to it from a replaceable datastore, which are then used to construct a prompt. Next, we feed this prompt to a GPT-2 decoder and introduce cross-attention layers between the CLAP encoder and GPT-2 to condition the audio for caption generation. Experiments on two benchmark datasets, Clotho and AudioCaps, show that RECAP achieves competitive performance in in-domain settings and significant improvements in out-of-domain settings. Additionally, due to its capability to exploit a large text-captions-only datastore in a training-free fashion, RECAP shows unique capabilities of captioning novel audio events never seen during training and compositional audios with multiple events. To promote research in this space, we also release 150,000+ new weakly labeled captions for AudioSet, AudioCaps, and Clotho. △ Less

Submitted 6 June, 2024; v1 submitted 18 September, 2023; originally announced September 2023.

Comments: ICASSP 2024. Code and data: https://github.com/Sreyan88/RECAP

arXiv:2308.11673 [pdf, other]

WEARS: Wearable Emotion AI with Real-time Sensor data

Authors: Dhruv Limbani, Daketi Yatin, Nitish Chaturvedi, Vaishnavi Moorthy, Pushpalatha M, Harichandana BSS, Sumit Kumar

Abstract: Emotion prediction is the field of study to understand human emotions. Existing methods focus on modalities like text, audio, facial expressions, etc., which could be private to the user. Emotion can be derived from the subject's psychological data as well. Various approaches that employ combinations of physiological sensors for emotion recognition have been proposed. Yet, not all sensors are simp… ▽ More Emotion prediction is the field of study to understand human emotions. Existing methods focus on modalities like text, audio, facial expressions, etc., which could be private to the user. Emotion can be derived from the subject's psychological data as well. Various approaches that employ combinations of physiological sensors for emotion recognition have been proposed. Yet, not all sensors are simple to use and handy for individuals in their daily lives. Thus, we propose a system to predict user emotion using smartwatch sensors. We design a framework to collect ground truth in real-time utilizing a mix of English and regional language-based videos to invoke emotions in participants and collect the data. Further, we modeled the problem as binary classification due to the limited dataset size and experimented with multiple machine-learning models. We also did an ablation study to understand the impact of features including Heart Rate, Accelerometer, and Gyroscope sensor data on mood. From the experimental results, Multi-Layer Perceptron has shown a maximum accuracy of 93.75 percent for pleasant-unpleasant (high/low valence classification) moods. △ Less

Submitted 22 August, 2023; originally announced August 2023.

arXiv:2308.09809 [pdf, other]

Adaptive Timers and Buffer Optimization for Layer-2 Protocols in 5G Non-Terrestrial Networks

Authors: Chandan Kumar Sheemar, Sumit Kumar, Jorge Querol, Symeon Chatzinotas

Abstract: Interest in the integration of Terrestrial Networks (TN) and Non-Terrestrial Networks (NTN); primarily satellites; has been rekindled due to the potential of NTN to provide ubiquitous coverage. Especially with the peculiar and flexible physical layer properties of 5G-NR, now direct access to 5G services through satellites could become possible. However, the large Round-Trip Delays (RTD) in NTNs re… ▽ More Interest in the integration of Terrestrial Networks (TN) and Non-Terrestrial Networks (NTN); primarily satellites; has been rekindled due to the potential of NTN to provide ubiquitous coverage. Especially with the peculiar and flexible physical layer properties of 5G-NR, now direct access to 5G services through satellites could become possible. However, the large Round-Trip Delays (RTD) in NTNs require a re-evaluation of the design of RLC and PDCP layers timers ( and associated buffers), in particular for the regenerative payload satellites which have limited computational resources, and hence need to be optimally utilized. Our aim in this work is to initiate a new line of research for emerging NTNs with limited resources from a higher-layer perspective. To this end, we propose a novel and efficient method for optimally designing the RLC and PDCP layers' buffers and timers without the need for intensive computations. This approach is relevant for low-cost satellites, which have limited computational and energy resources. The simulation results show that the proposed methods can significantly improve the performance in terms of resource utilization and delays. △ Less

Submitted 18 August, 2023; originally announced August 2023.

arXiv:2308.08579 [pdf, other]

IRS Assisted MIMO Full Duplex: Rate Analysis and Beamforming Under Imperfect CSI

Authors: Chandan Kumar Sheemar, Sourabh Solanki, Jorge Querol, Sumit Kumar, Symeon Chatzinotas

Abstract: Intelligent reflecting surfaces (IRS) have emerged as a promising technology to enhance the performance of wireless communication systems. By actively manipulating the wireless propagation environment, IRS enables efficient signal transmission and reception. In recent years, the integration of IRS with full-duplex (FD) communication has garnered significant attention due to its potential to furthe… ▽ More Intelligent reflecting surfaces (IRS) have emerged as a promising technology to enhance the performance of wireless communication systems. By actively manipulating the wireless propagation environment, IRS enables efficient signal transmission and reception. In recent years, the integration of IRS with full-duplex (FD) communication has garnered significant attention due to its potential to further improve spectral and energy efficiencies. IRS-assisted FD systems combine the benefits of both IRS and FD technologies, providing a powerful solution for the next generation of cellular systems. In this manuscript, we present a novel approach to jointly optimize active and passive beamforming in a multiple-input-multiple-output (MIMO) FD system assisted by an IRS for weighted sum rate (WSR) maximization. Given the inherent difficulty in obtaining perfect channel state information (CSI) in practical scenarios, we consider imperfect CSI and propose a statistically robust beamforming strategy to maximize the ergodic WSR. Additionally, we analyze the achievable WSR for an IRS-assisted MIMO FD system under imperfect CSI by deriving both the lower and upper bounds. To tackle the problem of ergodic WSR maximization, we employ the concept of expected weighted minimum mean squared error (EWMMSE), which exploits the information of the expected error covariance matrices and ensures convergence to a local optimum. We evaluate the effectiveness of our proposed design through extensive simulations. The results demonstrate that our robust approach yields significant performance improvements compared to the simplistic beamforming approach that disregards CSI errors, while also outperforming the robust half-duplex (HD) system considerably △ Less

Submitted 16 August, 2023; originally announced August 2023.

Comments: arXiv admin note: substantial text overlap with arXiv:2308.08016

arXiv:2308.08016 [pdf, other]

Robust Beamforming for IRS Aided MIMO Full Duplex Systems

Authors: Chandan Kumar Sheemar, Jorge Querol, Sourabh Solanki, Sumit Kumar, Symeon Chatzinotas

Abstract: In this paper, a novel robust beamforming for an intelligent reflecting surface (IRS) assisted FD system is presented. Since perfect channel state information (CSI) is often challenging to acquire in practice, we consider the case of imperfect CSI and adopt a statistically robust beamforming approach to maximize the ergodic weighted sum rate (WSR). We also analyze the achievable WSR of an IRS-assi… ▽ More In this paper, a novel robust beamforming for an intelligent reflecting surface (IRS) assisted FD system is presented. Since perfect channel state information (CSI) is often challenging to acquire in practice, we consider the case of imperfect CSI and adopt a statistically robust beamforming approach to maximize the ergodic weighted sum rate (WSR). We also analyze the achievable WSR of an IRS-assisted FD with imperfect CSI, for which the lower and the upper bounds are derived. The ergodic WSR maximization problem is tackled based on the expected Weighted Minimum Mean Squared Error (WMMSE), which is guaranteed to converge to a local optimum. The effectiveness of the proposed design is investigated with extensive simulation results. It is shown that our robust design achieves significant performance gain compared to the naive beamforming approaches and considerably outperforms the robust Half-Duplex (HD) system. △ Less

Submitted 15 August, 2023; originally announced August 2023.

arXiv:2308.06300 [pdf]

Automatic Classification of Blood Cell Images Using Convolutional Neural Network

Authors: Rabia Asghar, Sanjay Kumar, Paul Hynds, Abeera Mahfooz

Abstract: Human blood primarily comprises plasma, red blood cells, white blood cells, and platelets. It plays a vital role in transporting nutrients to different organs, where it stores essential health-related data about the human body. Blood cells are utilized to defend the body against diverse infections, including fungi, viruses, and bacteria. Hence, blood analysis can help physicians assess an individu… ▽ More Human blood primarily comprises plasma, red blood cells, white blood cells, and platelets. It plays a vital role in transporting nutrients to different organs, where it stores essential health-related data about the human body. Blood cells are utilized to defend the body against diverse infections, including fungi, viruses, and bacteria. Hence, blood analysis can help physicians assess an individual's physiological condition. Blood cells have been sub-classified into eight groups: Neutrophils, eosinophils, basophils, lymphocytes, monocytes, immature granulocytes (promyelocytes, myelocytes, and metamyelocytes), erythroblasts, and platelets or thrombocytes on the basis of their nucleus, shape, and cytoplasm. Traditionally, pathologists and hematologists in laboratories have examined these blood cells using a microscope before manually classifying them. The manual approach is slower and more prone to human error. Therefore, it is essential to automate this process. In our paper, transfer learning with CNN pre-trained models. VGG16, VGG19, ResNet-50, ResNet-101, ResNet-152, InceptionV3, MobileNetV2, and DenseNet-20 applied to the PBC dataset's normal DIB. The overall accuracy achieved with these models lies between 91.375 and 94.72%. Hence, inspired by these pre-trained architectures, a model has been proposed to automatically classify the ten types of blood cells with increased accuracy. A novel CNN-based framework has been presented to improve accuracy. The proposed CNN model has been tested on the PBC dataset normal DIB. The outcomes of the experiments demonstrate that our CNN-based framework designed for blood cell classification attains an accuracy of 99.91% on the PBC dataset. Our proposed convolutional neural network model performs competitively when compared to earlier results reported in the literature. △ Less

Submitted 21 August, 2023; v1 submitted 11 August, 2023; originally announced August 2023.

Comments: 15

arXiv:2308.06296 [pdf]

Classification of White Blood Cells Using Machine and Deep Learning Models: A Systematic Review

Authors: Rabia Asghar, Sanjay Kumar, Paul Hynds, Arslan Shaukat

Abstract: Machine learning (ML) and deep learning (DL) models have been employed to significantly improve analyses of medical imagery, with these approaches used to enhance the accuracy of prediction and classification. Model predictions and classifications assist diagnoses of various cancers and tumors. This review presents an in-depth analysis of modern techniques applied within the domain of medical imag… ▽ More Machine learning (ML) and deep learning (DL) models have been employed to significantly improve analyses of medical imagery, with these approaches used to enhance the accuracy of prediction and classification. Model predictions and classifications assist diagnoses of various cancers and tumors. This review presents an in-depth analysis of modern techniques applied within the domain of medical image analysis for white blood cell classification. The methodologies that use blood smear images, magnetic resonance imaging (MRI), X-rays, and similar medical imaging domains are identified and discussed, with a detailed analysis of ML/DL techniques applied to the classification of white blood cells (WBCs) representing the primary focus of the review. The data utilized in this research has been extracted from a collection of 136 primary papers that were published between the years 2006 and 2023. The most widely used techniques and best-performing white blood cell classification methods are identified. While the use of ML and DL for white blood cell classification has concurrently increased and improved in recent year, significant challenges remain - 1) Availability of appropriate datasets remain the primary challenge, and may be resolved using data augmentation techniques. 2) Medical training of researchers is recommended to improve current understanding of white blood cell structure and subsequent selection of appropriate classification models. 3) Advanced DL networks including Generative Adversarial Networks, R-CNN, Fast R-CNN, and faster R-CNN will likely be increasingly employed to supplement or replace current techniques. △ Less

Submitted 21 August, 2023; v1 submitted 11 August, 2023; originally announced August 2023.

arXiv:2307.16676 [pdf, other]

doi 10.1109/IROS55552.2023.10342187

End-to-End Reinforcement Learning for Torque Based Variable Height Hopping

Authors: Raghav Soni, Daniel Harnack, Hauke Isermann, Sotaro Fushimi, Shivesh Kumar, Frank Kirchner

Abstract: Legged locomotion is arguably the most suited and versatile mode to deal with natural or unstructured terrains. Intensive research into dynamic walking and running controllers has recently yielded great advances, both in the optimal control and reinforcement learning (RL) literature. Hopping is a challenging dynamic task involving a flight phase and has the potential to increase the traversability… ▽ More Legged locomotion is arguably the most suited and versatile mode to deal with natural or unstructured terrains. Intensive research into dynamic walking and running controllers has recently yielded great advances, both in the optimal control and reinforcement learning (RL) literature. Hopping is a challenging dynamic task involving a flight phase and has the potential to increase the traversability of legged robots. Model based control for hopping typically relies on accurate detection of different jump phases, such as lift-off or touch down, and using different controllers for each phase. In this paper, we present a end-to-end RL based torque controller that learns to implicitly detect the relevant jump phases, removing the need to provide manual heuristics for state detection. We also extend a method for simulation to reality transfer of the learned controller to contact rich dynamic tasks, resulting in successful deployment on the robot after training without parameter tuning. △ Less

Submitted 18 December, 2023; v1 submitted 31 July, 2023; originally announced July 2023.

Comments: Update publication info. Cite as: R. Soni, D. Harnack, H. Isermann, S. Fushimi, S. Kumar and F. Kirchner, "End-to-End Reinforcement Learning for Torque Based Variable Height Hopping," 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 2023, pp. 7531-7538, doi: 10.1109/IROS55552.2023.10342187

Journal ref: End-to-End Reinforcement Learning for Torque Based Variable Height Hopping, 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 2023, pp. 7531-7538

arXiv:2307.16033 [pdf, other]

CoVid-19 Detection leveraging Vision Transformers and Explainable AI

Authors: Pangoth Santhosh Kumar, Kundrapu Supriya, Mallikharjuna Rao K, Taraka Satya Krishna Teja Malisetti

Abstract: Lung disease is a common health problem in many parts of the world. It is a significant risk to people health and quality of life all across the globe since it is responsible for five of the top thirty leading causes of death. Among them are COVID 19, pneumonia, and tuberculosis, to name just a few. It is critical to diagnose lung diseases in their early stages. Several different models including… ▽ More Lung disease is a common health problem in many parts of the world. It is a significant risk to people health and quality of life all across the globe since it is responsible for five of the top thirty leading causes of death. Among them are COVID 19, pneumonia, and tuberculosis, to name just a few. It is critical to diagnose lung diseases in their early stages. Several different models including machine learning and image processing have been developed for this purpose. The earlier a condition is diagnosed, the better the patient chances of making a full recovery and surviving into the long term. Thanks to deep learning algorithms, there is significant promise for the autonomous, rapid, and accurate identification of lung diseases based on medical imaging. Several different deep learning strategies, including convolutional neural networks (CNN), vanilla neural networks, visual geometry group based networks (VGG), and capsule networks , are used for the goal of making lung disease forecasts. The standard CNN has a poor performance when dealing with rotated, tilted, or other aberrant picture orientations. As a result of this, within the scope of this study, we have suggested a vision transformer based approach end to end framework for the diagnosis of lung disorders. In the architecture, data augmentation, training of the suggested models, and evaluation of the models are all included. For the purpose of detecting lung diseases such as pneumonia, Covid 19, lung opacity, and others, a specialised Compact Convolution Transformers (CCT) model have been tested and evaluated on datasets such as the Covid 19 Radiography Database. The model has achieved a better accuracy for both its training and validation purposes on the Covid 19 Radiography Database. △ Less

Submitted 6 May, 2024; v1 submitted 29 July, 2023; originally announced July 2023.

arXiv:2307.14164 [pdf, ps, other]

Towards Continuous Time Finite Horizon LQR Control in SE(3)

Authors: Shivesh Kumar, Andreas Mueller, Patrick Wensing, Frank Kirchner

Abstract: The control of free-floating robots requires dealing with several challenges. The motion of such robots evolves on a continuous manifold described by the Special Euclidean Group of dimension 3, known as SE(3). Methods from finite horizon Linear Quadratic Regulators (LQR) control have gained recent traction in the robotics community. However, such approaches are inherently solving an unconstrained… ▽ More The control of free-floating robots requires dealing with several challenges. The motion of such robots evolves on a continuous manifold described by the Special Euclidean Group of dimension 3, known as SE(3). Methods from finite horizon Linear Quadratic Regulators (LQR) control have gained recent traction in the robotics community. However, such approaches are inherently solving an unconstrained optimization problem and hence are unable to respect the manifold constraints imposed by the group structure of SE(3). This may lead to small errors, singularity problems and double cover issues depending on the choice of coordinates to model the floating base motion. In this paper, we propose the use of canonical exponential coordinates of SE(3) and the associated Exponential map along with its differentials to embed this structure in the theory of finite horizon LQR controllers. △ Less

Submitted 26 July, 2023; originally announced July 2023.

Comments: IEEE International Conference on Robotics and Automation (ICRA) 2023 Workshop on Geometric Representations The Roles of Modern Screw Theory, Lie algebra, and Geometric Algebra in Robotics

arXiv:2307.09425 [pdf, other]

Musical Excellence of Mridangam: an introductory review

Authors: Arvind Shankar Kumar

Abstract: This is an introductory review of Musical Excellence of Mridangam by Dr. Umayalpuram K Sivaraman, Dr. T Ramasami and Dr. Naresh, which is a scientific treatise exploring the unique tonal properties of the ancient Indian classical percussive instrument -- the Mridangam. This review aims to bridge the gap between the primary intended audience of Musical Excellence of Mridangam - listeners, artistes… ▽ More This is an introductory review of Musical Excellence of Mridangam by Dr. Umayalpuram K Sivaraman, Dr. T Ramasami and Dr. Naresh, which is a scientific treatise exploring the unique tonal properties of the ancient Indian classical percussive instrument -- the Mridangam. This review aims to bridge the gap between the primary intended audience of Musical Excellence of Mridangam - listeners, artistes and makers -- and the scientific rigour with which the original treatise is written, by first introducing the concepts of musical analysis and then presenting and explaining the discoveries made within this context. The first three chapters of this review introduce the basic scientific concepts used in Musical Excellence of Mridangam and provides background to previous scientific research into this instrument, starting from the seminal work of Dr. CV Raman. This also includes brief discussions of the corresponding chapters in Musical Excellence of Mridangam. The next chapters all serve the purpose of explaining the main scientific results presented in Musical Excellence of Mridangam in each of the corresponding chapters in the treatise, and finally summarizing the relevance of the work. △ Less

Submitted 11 July, 2023; originally announced July 2023.

arXiv:2307.07948 [pdf, ps, other]

Model Adaptation for ASR in low-resource Indian Languages

Authors: Abhayjeet Singh, Arjun Singh Mehta, Ashish Khuraishi K S, Deekshitha G, Gauri Date, Jai Nanavati, Jesuraja Bandekar, Karnalius Basumatary, Karthika P, Sandhya Badiger, Sathvik Udupa, Saurabh Kumar, Savitha, Prasanta Kumar Ghosh, Prashanthi V, Priyanka Pai, Raoul Nanavati, Rohan Saxena, Sai Praneeth Reddy Mora, Srinivasa Raghavan

Abstract: Automatic speech recognition (ASR) performance has improved drastically in recent years, mainly enabled by self-supervised learning (SSL) based acoustic models such as wav2vec2 and large-scale multi-lingual training like Whisper. A huge challenge still exists for low-resource languages where the availability of both audio and text is limited. This is further complicated by the presence of multiple… ▽ More Automatic speech recognition (ASR) performance has improved drastically in recent years, mainly enabled by self-supervised learning (SSL) based acoustic models such as wav2vec2 and large-scale multi-lingual training like Whisper. A huge challenge still exists for low-resource languages where the availability of both audio and text is limited. This is further complicated by the presence of multiple dialects like in Indian languages. However, many Indian languages can be grouped into the same families and share the same script and grammatical structure. This is where a lot of adaptation and fine-tuning techniques can be applied to overcome the low-resource nature of the data by utilising well-resourced similar languages. In such scenarios, it is important to understand the extent to which each modality, like acoustics and text, is important in building a reliable ASR. It could be the case that an abundance of acoustic data in a language reduces the need for large text-only corpora. Or, due to the availability of various pretrained acoustic models, the vice-versa could also be true. In this proposed special session, we encourage the community to explore these ideas with the data in two low-resource Indian languages of Bengali and Bhojpuri. These approaches are not limited to Indian languages, the solutions are potentially applicable to various languages spoken around the world. △ Less

Submitted 16 July, 2023; originally announced July 2023.

Comments: ASRU Special session overview paper

arXiv:2307.07414 [pdf, other]

An Embedded Auto-Calibrated Offset Current Compensation Technique for PPG/fNIRS System

Authors: Sadan Saquib Khan, Sumit Kumar, Benish Jan, Laxmeesha Somappa, Shahid Malik

Abstract: Usually, the current generated by the photodiode proportional to the oxygenated blood in the photoplethysmography (PPG) and functional infrared spectroscopy (fNIRS) based recording systems is small as compared to the offset-current. The offset current is the combination of the dark current of the photodiode, the current due to ambient light, and the current due to the reflected light from fat and… ▽ More Usually, the current generated by the photodiode proportional to the oxygenated blood in the photoplethysmography (PPG) and functional infrared spectroscopy (fNIRS) based recording systems is small as compared to the offset-current. The offset current is the combination of the dark current of the photodiode, the current due to ambient light, and the current due to the reflected light from fat and skull . The relatively large value of the offset current limits the amplification of the signal current and affects the overall performance of the PPG/fNIRS recording systems. In this paper, we present a mixed-signal auto-calibrated offset current compensation technique for PPG and fNIRS recording systems. The system auto-calibrates the offset current, compensates using a dual discrete loop technique, and amplifies the signal current. Thanks to the amplification, the system provides better sensitivity. A prototype of the system is built and tested for PPG signal recording. The prototype is developed for a 3.3 V single supply. The results show that the proposed system is able to effectively compensate for the offset current. △ Less

Submitted 14 July, 2023; originally announced July 2023.

arXiv:2307.07395 [pdf, other]

Flexible Beamforming in B5G for Improving Tethered UAV Coverage over Smart Environments

Authors: Abdu Saif, Nor Shahida Mohd Shah, Soreen Ameen Fattah, Saeed Hamood Alsamhi, Santosh Kumar, Ali Saad Al khuraib

Abstract: Unmanned Aerial Vehicles (UAVs) are being used for wireless communications in smart environments. However, the need for mobility, scalability of data transmission over wide areas, and the required coverage area make UAV beamforming essential for better coverage and user experience. To this end, we propose a flexible beamforming approach to improve tethered UAV coverage quality and maximize the num… ▽ More Unmanned Aerial Vehicles (UAVs) are being used for wireless communications in smart environments. However, the need for mobility, scalability of data transmission over wide areas, and the required coverage area make UAV beamforming essential for better coverage and user experience. To this end, we propose a flexible beamforming approach to improve tethered UAV coverage quality and maximize the number of users experiencing the minimum required rate in any target environment. Our solution demonstrates a significant achievement in flexible beamforming in smart environments, including urban, suburban, dense, and high-rise urban. Furthermore, the beamforming gain is mainly concentrated in the target to improve the coverage area based on various scenarios. Simulation results show that the proposed approach can achieve a significantly received flexible power beam that focuses the transmitted signal towards the receiver and improves received power by reducing signal power spread. In the case of no beamforming, signal power spreads out as distance increases, reducing the signal strength. Furthermore, our proposed solution is suitable for improving UAV coverage and reliability in smart and harsh environments. △ Less

Submitted 14 July, 2023; originally announced July 2023.

Comments: 6 pages, 7 figures

arXiv:2305.18419 [pdf, other]

Semantic Segmentation with Bidirectional Language Models Improves Long-form ASR

Authors: W. Ronny Huang, Hao Zhang, Shankar Kumar, Shuo-yiin Chang, Tara N. Sainath

Abstract: We propose a method of segmenting long-form speech by separating semantically complete sentences within the utterance. This prevents the ASR decoder from needlessly processing faraway context while also preventing it from missing relevant context within the current sentence. Semantically complete sentence boundaries are typically demarcated by punctuation in written text; but unfortunately, spoken… ▽ More We propose a method of segmenting long-form speech by separating semantically complete sentences within the utterance. This prevents the ASR decoder from needlessly processing faraway context while also preventing it from missing relevant context within the current sentence. Semantically complete sentence boundaries are typically demarcated by punctuation in written text; but unfortunately, spoken real-world utterances rarely contain punctuation. We address this limitation by distilling punctuation knowledge from a bidirectional teacher language model (LM) trained on written, punctuated text. We compare our segmenter, which is distilled from the LM teacher, against a segmenter distilled from a acoustic-pause-based teacher used in other works, on a streaming ASR pipeline. The pipeline with our segmenter achieves a 3.2% relative WER gain along with a 60 ms median end-of-segment latency reduction on a YouTube captioning task. △ Less

Submitted 28 May, 2023; originally announced May 2023.

Comments: Interspeech 2023. First 3 authors contributed equally

arXiv:2305.12712 [pdf, other]

doi 10.1109/INDICON56171.2022.10039921

LEAN: Light and Efficient Audio Classification Network

Authors: Shwetank Choudhary, CR Karthik, Punuru Sri Lakshmi, Sumit Kumar

Abstract: Over the past few years, audio classification task on large-scale dataset such as AudioSet has been an important research area. Several deeper Convolution-based Neural networks have shown compelling performance notably Vggish, YAMNet, and Pretrained Audio Neural Network (PANN). These models are available as pretrained architecture for transfer learning as well as specific audio task adoption. In t… ▽ More Over the past few years, audio classification task on large-scale dataset such as AudioSet has been an important research area. Several deeper Convolution-based Neural networks have shown compelling performance notably Vggish, YAMNet, and Pretrained Audio Neural Network (PANN). These models are available as pretrained architecture for transfer learning as well as specific audio task adoption. In this paper, we propose a lightweight on-device deep learning-based model for audio classification, LEAN. LEAN consists of a raw waveform-based temporal feature extractor called as Wave Encoder and logmel-based Pretrained YAMNet. We show that using a combination of trainable wave encoder, Pretrained YAMNet along with cross attention-based temporal realignment, results in competitive performance on downstream audio classification tasks with lesser memory footprints and hence making it suitable for resource constraints devices such as mobile, edge devices, etc . Our proposed system achieves on-device mean average precision(mAP) of .445 with a memory footprint of a mere 4.5MB on the FSD50K dataset which is an improvement of 22% over baseline on-device mAP on same dataset. △ Less

Submitted 22 May, 2023; originally announced May 2023.

Comments: Accepted at INDICON 2022

arXiv:2305.08373 [pdf, other]

doi 10.1109/LRA.2023.3269296

AcroMonk: A Minimalist Underactuated Brachiating Robot

Authors: Mahdi Javadi, Daniel Harnack, Paula Stocco, Shivesh Kumar, Shubham Vyas, Daniel Pizzutilo, Frank Kirchner

Abstract: Brachiation is a dynamic, coordinated swinging maneuver of body and arms used by monkeys and apes to move between branches. As a unique underactuated mode of locomotion, it is interesting to study from a robotics perspective since it can broaden the deployment scenarios for humanoids and animaloids. While several brachiating robots of varying complexity have been proposed in the past, this paper p… ▽ More Brachiation is a dynamic, coordinated swinging maneuver of body and arms used by monkeys and apes to move between branches. As a unique underactuated mode of locomotion, it is interesting to study from a robotics perspective since it can broaden the deployment scenarios for humanoids and animaloids. While several brachiating robots of varying complexity have been proposed in the past, this paper presents the simplest possible prototype of a brachiation robot, using only a single actuator and unactuated grippers. The novel passive gripper design allows it to snap on and release from monkey bars, while guaranteeing well defined start and end poses of the swing. The brachiation behavior is realized in three different ways, using trajectory optimization via direct collocation and stabilization by a model-based time-varying linear quadratic regulator (TVLQR) or model-free proportional derivative (PD) control, as well as by a reinforcement learning (RL) based control policy. The three control schemes are compared in terms of robustness to disturbances, mass uncertainty, and energy consumption. The system design and controllers have been open-sourced. Due to its minimal and open design, the system can serve as a canonical underactuated platform for education and research. △ Less

Submitted 15 May, 2023; originally announced May 2023.

Comments: The open-source implementation is available at https://github.com/dfki-ric-underactuated-lab/acromonk and a video demonstration of the experiments can be accessed at https://youtu.be/FIcDNtJo9Jc}

Journal ref: journal={IEEE Robotics and Automation Letters}, year={2023}, volume={8}, number={6}, pages={3637-3644}

Showing 1–50 of 141 results for author: Kumar, S