Skip to main content

Showing 1–23 of 23 results for author: Adavanne, S

  1. arXiv:2306.09126  [pdf, other

    cs.SD cs.CV cs.MM eess.AS eess.IV

    STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

    Authors: Kazuki Shimada, Archontis Politis, Parthasaarathy Sudarsanam, Daniel Krause, Kengo Uchida, Sharath Adavanne, Aapo Hakala, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Tuomas Virtanen, Yuki Mitsufuji

    Abstract: While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e.g., sounds of footsteps come from the feet of a walker. This paper proposes an audio-visual sound event localization and detection (SELD) task, which uses multichannel audio and video information… ▽ More

    Submitted 14 November, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

    Comments: 27 pages, 9 figures, accepted for publication in NeurIPS 2023 Track on Datasets and Benchmarks

  2. arXiv:2211.02336  [pdf, other

    cs.SD eess.AS

    Improving Speech Prosody of Audiobook Text-to-Speech Synthesis with Acoustic and Textual Contexts

    Authors: Detai Xin, Sharath Adavanne, Federico Ang, Ashish Kulkarni, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: We present a multi-speaker Japanese audiobook text-to-speech (TTS) system that leverages multimodal context information of preceding acoustic context and bilateral textual context to improve the prosody of synthetic speech. Previous work either uses unilateral or single-modality context, which does not fully represent the context information. The proposed method uses an acoustic context encoder an… ▽ More

    Submitted 4 November, 2022; originally announced November 2022.

  3. arXiv:2206.04305  [pdf, other

    eess.AS cs.CL cs.SD

    Context-based out-of-vocabulary word recovery for ASR systems in Indian languages

    Authors: Arun Baby, Saranya Vinnaitherthan, Akhil Kerhalkar, Pranav Jawale, Sharath Adavanne, Nagaraj Adiga

    Abstract: Detecting and recovering out-of-vocabulary (OOV) words is always challenging for Automatic Speech Recognition (ASR) systems. Many existing methods focus on modeling OOV words by modifying acoustic and language models and integrating context words cleverly into models. To train such complex models, we need a large amount of data with context words, additional training time, and increased model size… ▽ More

    Submitted 9 June, 2022; originally announced June 2022.

    Comments: 12 pages

  4. arXiv:2206.01948  [pdf, other

    eess.AS cs.SD

    STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events

    Authors: Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, Tuomas Virtanen

    Abstract: This report presents the Sony-TAu Realistic Spatial Soundscapes 2022 (STARS22) dataset for sound event localization and detection, comprised of spatial recordings of real scenes collected in various interiors of two different sites. The dataset is captured with a high resolution spherical microphone array and delivered in two 4-channel formats, first-order Ambisonics and tetrahedral microphone arr… ▽ More

    Submitted 2 September, 2022; v1 submitted 4 June, 2022; originally announced June 2022.

  5. arXiv:2111.00030  [pdf, other

    eess.AS cs.SD

    Differentiable Tracking-Based Training of Deep Learning Sound Source Localizers

    Authors: Sharath Adavanne, Archontis Politis, Tuomas Virtanen

    Abstract: Data-based and learning-based sound source localization (SSL) has shown promising results in challenging conditions, and is commonly set as a classification or a regression problem. Regression-based approaches have certain advantages over classification-based, such as continuous direction-of-arrival estimation of static and moving sources. However, multi-source scenarios require multiple regressor… ▽ More

    Submitted 29 October, 2021; originally announced November 2021.

    Comments: Submitted to IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA2021)

  6. arXiv:2106.10870  [pdf, other

    eess.AS cs.CL cs.SD

    Non-native English lexicon creation for bilingual speech synthesis

    Authors: Arun Baby, Pranav Jawale, Saranya Vinnaitherthan, Sumukh Badam, Nagaraj Adiga, Sharath Adavanne

    Abstract: Bilingual English speakers speak English as one of their languages. Their English is of a non-native kind, and their conversations are of a code-mixed fashion. The intelligibility of a bilingual text-to-speech (TTS) system for such non-native English speakers depends on a lexicon that captures the phoneme sequence used by non-native speakers. However, due to the lack of non-native English lexicon,… ▽ More

    Submitted 21 June, 2021; originally announced June 2021.

    Comments: Accepted for Presentation at Speech Synthesis Workshop (SSW), 2021 (August 2021)

  7. arXiv:2106.06999  [pdf, other

    eess.AS cs.SD

    A Dataset of Dynamic Reverberant Sound Scenes with Directional Interferers for Sound Event Localization and Detection

    Authors: Archontis Politis, Sharath Adavanne, Daniel Krause, Antoine Deleforge, Prerak Srivastava, Tuomas Virtanen

    Abstract: This report presents the dataset and baseline of Task 3 of the DCASE2021 Challenge on Sound Event Localization and Detection (SELD). The dataset is based on emulation of real recordings of static or moving sound events under real conditions of reverberation and ambient noise, using spatial room impulse responses captured in a variety of rooms and delivered in two spatial formats. The acoustical sy… ▽ More

    Submitted 4 July, 2021; v1 submitted 13 June, 2021; originally announced June 2021.

  8. Overview and Evaluation of Sound Event Localization and Detection in DCASE 2019

    Authors: Archontis Politis, Annamaria Mesaros, Sharath Adavanne, Toni Heittola, Tuomas Virtanen

    Abstract: Sound event localization and detection is a novel area of research that emerged from the combined interest of analyzing the acoustic scene in terms of the spatial and temporal activity of sounds of interest. This paper presents an overview of the first international evaluation on sound event localization and detection, organized as a task of the DCASE 2019 Challenge. A large-scale realistic datase… ▽ More

    Submitted 11 January, 2021; v1 submitted 6 September, 2020; originally announced September 2020.

  9. arXiv:2006.01919  [pdf, other

    eess.AS cs.SD

    A Dataset of Reverberant Spatial Sound Scenes with Moving Sources for Sound Event Localization and Detection

    Authors: Archontis Politis, Sharath Adavanne, Tuomas Virtanen

    Abstract: This report presents the dataset and the evaluation setup of the Sound Event Localization & Detection (SELD) task for the DCASE 2020 Challenge. The SELD task refers to the problem of trying to simultaneously classify a known set of sound event classes, detect their temporal activations, and estimate their spatial directions or locations while they are active. To train and test SELD systems, datase… ▽ More

    Submitted 27 June, 2020; v1 submitted 2 June, 2020; originally announced June 2020.

  10. arXiv:2006.01463  [pdf, other

    cs.SD eess.AS

    An ASR Guided Speech Intelligibility Measure for TTS Model Selection

    Authors: Arun Baby, Saranya Vinnaitherthan, Nagaraj Adiga, Pranav Jawale, Sumukh Badam, Sharath Adavanne, Srikanth Konjeti

    Abstract: The perceptual quality of neural text-to-speech (TTS) is highly dependent on the choice of the model during training. Selecting the model using a training-objective metric such as the least mean squared error does not always correlate with human perception. In this paper, we propose an objective metric based on the phone error rate (PER) to select the TTS model with the best speech intelligibility… ▽ More

    Submitted 2 June, 2020; originally announced June 2020.

    Comments: Submitted to INTERSPEECH 2020

  11. arXiv:1905.08546  [pdf, other

    cs.SD eess.AS

    A multi-room reverberant dataset for sound event localization and detection

    Authors: Sharath Adavanne, Archontis Politis, Tuomas Virtanen

    Abstract: This paper presents the sound event localization and detection (SELD) task setup for the DCASE 2019 challenge. The goal of the SELD task is to detect the temporal activities of a known set of sound event classes, and further localize them in space when active. As part of the challenge, a synthesized dataset with each sound event associated with a spatial coordinate represented using azimuth and el… ▽ More

    Submitted 24 May, 2019; v1 submitted 21 May, 2019; originally announced May 2019.

  12. arXiv:1904.12769  [pdf, other

    cs.SD cs.LG eess.AS

    Localization, Detection and Tracking of Multiple Moving Sound Sources with a Convolutional Recurrent Neural Network

    Authors: Sharath Adavanne, Archontis Politis, Tuomas Virtanen

    Abstract: This paper investigates the joint localization, detection, and tracking of sound events using a convolutional recurrent neural network (CRNN). We use a CRNN previously proposed for the localization and detection of stationary sources, and show that the recurrent layers enable the spatial tracking of moving sources when trained with dynamic scenes. The tracking performance of the CRNN is compared w… ▽ More

    Submitted 29 April, 2019; originally announced April 2019.

  13. Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

    Authors: Sharath Adavanne, Archontis Politis, Joonas Nikunen, Tuomas Virtanen

    Abstract: In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in three-dimensional (3D) space. The proposed network takes a sequence of consecutive spectrogram time-frames as input and maps it to two outputs in parallel. As the first output, the sound event detection (SED) is performed as a multi-labe… ▽ More

    Submitted 17 December, 2018; v1 submitted 30 June, 2018; originally announced July 2018.

    Comments: Published in Journal of Selected Topics in Signal Processing 2018

  14. arXiv:1801.09522  [pdf, other

    cs.SD cs.LG eess.AS

    Multichannel Sound Event Detection Using 3D Convolutional Neural Networks for Learning Inter-channel Features

    Authors: Sharath Adavanne, Archontis Politis, Tuomas Virtanen

    Abstract: In this paper, we propose a stacked convolutional and recurrent neural network (CRNN) with a 3D convolutional neural network (CNN) in the first layer for the multichannel sound event detection (SED) task. The 3D CNN enables the network to simultaneously learn the inter- and intra-channel features from the input multichannel audio. In order to evaluate the proposed method, multichannel audio datase… ▽ More

    Submitted 29 January, 2018; originally announced January 2018.

  15. arXiv:1710.10059  [pdf, other

    cs.SD cs.LG eess.AS

    Direction of arrival estimation for multiple sound sources using convolutional recurrent neural network

    Authors: Sharath Adavanne, Archontis Politis, Tuomas Virtanen

    Abstract: This paper proposes a deep neural network for estimating the directions of arrival (DOA) of multiple sound sources. The proposed stacked convolutional and recurrent neural network (DOAnet) generates a spatial pseudo-spectrum (SPS) along with the DOA estimates in both azimuth and elevation. We avoid any explicit feature extraction step by using the magnitudes and phases of the spectrograms of all t… ▽ More

    Submitted 5 August, 2018; v1 submitted 27 October, 2017; originally announced October 2017.

    Comments: EUSIPCO 2018

  16. arXiv:1710.02998  [pdf, other

    cs.SD eess.AS

    Sound event detection using weakly labeled dataset with stacked convolutional and recurrent neural network

    Authors: Sharath Adavanne, Tuomas Virtanen

    Abstract: This paper proposes a neural network architecture and training scheme to learn the start and end time of sound events (strong labels) in an audio recording given just the list of sound events existing in the audio without time information (weak labels). We achieve this by using a stacked convolutional and recurrent neural network with two prediction layers in sequence one for the strong followed b… ▽ More

    Submitted 9 October, 2017; originally announced October 2017.

    Comments: Accepted in Detection and Classification of Acoustic Scenes and Events (DCASE 2017)

  17. arXiv:1710.02997  [pdf, other

    cs.SD eess.AS

    A report on sound event detection with different binaural features

    Authors: Sharath Adavanne, Tuomas Virtanen

    Abstract: In this paper, we compare the performance of using binaural audio features in place of single-channel features for sound event detection. Three different binaural features are studied and evaluated on the publicly available TUT Sound Events 2017 dataset of length 70 minutes. Sound event detection is performed separately with single-channel and binaural features using stacked convolutional and recu… ▽ More

    Submitted 9 October, 2017; originally announced October 2017.

    Comments: Technical report for the top performing method in Task 3: Real life sound event detection challenge, at Detection and classification of acoustic scene and events (DCASE) 2017

  18. arXiv:1706.10006  [pdf, other

    cs.SD cs.CL cs.LG

    Automated Audio Captioning with Recurrent Neural Networks

    Authors: Konstantinos Drossos, Sharath Adavanne, Tuomas Virtanen

    Abstract: We present the first approach to automated audio captioning. We employ an encoder-decoder scheme with an alignment model in between. The input to the encoder is a sequence of log mel-band energies calculated from an audio file, while the output is a sequence of words, i.e. a caption. The encoder is a multi-layered, bi-directional gated recurrent unit (GRU) and the decoder a multi-layered GRU with… ▽ More

    Submitted 24 October, 2017; v1 submitted 29 June, 2017; originally announced June 2017.

    Comments: Presented at the 11th IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017

  19. arXiv:1706.02293  [pdf, other

    cs.SD cs.LG

    Sound Event Detection in Multichannel Audio Using Spatial and Harmonic Features

    Authors: Sharath Adavanne, Giambattista Parascandolo, Pasi Pertilä, Toni Heittola, Tuomas Virtanen

    Abstract: In this paper, we propose the use of spatial and harmonic features in combination with long short term memory (LSTM) recurrent neural network (RNN) for automatic sound event detection (SED) task. Real life sound recordings typically have many overlapping sound events, making it hard to recognize with just mono channel audio. Human listeners have been successfully recognizing the mixture of overlap… ▽ More

    Submitted 7 June, 2017; originally announced June 2017.

  20. arXiv:1706.02292  [pdf, other

    cs.SD cs.LG

    Stacked Convolutional and Recurrent Neural Networks for Music Emotion Recognition

    Authors: Miroslav Malik, Sharath Adavanne, Konstantinos Drossos, Tuomas Virtanen, Dasa Ticha, Roman Jarina

    Abstract: This paper studies the emotion recognition from musical tracks in the 2-dimensional valence-arousal (V-A) emotional space. We propose a method based on convolutional (CNN) and recurrent neural networks (RNN), having significantly fewer parameters compared with the state-of-the-art method for the same task. We utilize one CNN layer followed by two branches of RNNs trained separately for arousal and… ▽ More

    Submitted 7 June, 2017; originally announced June 2017.

    Comments: Accepted for Sound and Music Computing (SMC 2017)

  21. arXiv:1706.02291  [pdf, other

    cs.SD cs.LG

    Sound Event Detection Using Spatial Features and Convolutional Recurrent Neural Network

    Authors: Sharath Adavanne, Pasi Pertilä, Tuomas Virtanen

    Abstract: This paper proposes to use low-level spatial features extracted from multichannel audio for sound event detection. We extend the convolutional recurrent neural network to handle more than one type of these multichannel features by learning from each of them separately in the initial stages. We show that instead of concatenating the features of each channel into a single feature vector the network… ▽ More

    Submitted 7 June, 2017; originally announced June 2017.

    Comments: Accepted for IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017)

  22. arXiv:1706.02047  [pdf, other

    cs.SD cs.LG

    Stacked Convolutional and Recurrent Neural Networks for Bird Audio Detection

    Authors: Sharath Adavanne, Konstantinos Drossos, Emre Çakır, Tuomas Virtanen

    Abstract: This paper studies the detection of bird calls in audio segments using stacked convolutional and recurrent neural networks. Data augmentation by blocks mixing and domain adaptation using a novel method of test mixing are proposed and evaluated in regard to making the method robust to unseen data. The contributions of two kinds of acoustic features (dominant frequency and log mel-band energy) and t… ▽ More

    Submitted 7 June, 2017; originally announced June 2017.

    Comments: Accepted for European Signal Processing Conference 2017

  23. arXiv:1703.02317  [pdf, other

    cs.SD cs.LG stat.ML

    Convolutional Recurrent Neural Networks for Bird Audio Detection

    Authors: EmreÇakır, Sharath Adavanne, Giambattista Parascandolo, Konstantinos Drossos, Tuomas Virtanen

    Abstract: Bird sounds possess distinctive spectral structure which may exhibit small shifts in spectrum depending on the bird species and environmental conditions. In this paper, we propose using convolutional recurrent neural networks on the task of automated bird audio detection in real-life environments. In the proposed method, convolutional layers extract high dimensional, local frequency shift invarian… ▽ More

    Submitted 7 March, 2017; originally announced March 2017.

    Comments: Submitted to EUSIPCO 2017 Special Session on Bird Audio Signal Processing