subscribe to arXiv mailings

Soundscape Captioning using Sound Affective Quality Network and Large Language Model

Authors: Yuanbo Hou, Qiaoqiao Ren, Andrew Mitchell, Wenwu Wang, Jian Kang, Tony Belpaeme, Dick Botteldooren

Abstract: We live in a rich and varied acoustic world, which is experienced by individuals or communities as a soundscape. Computational auditory scene analysis, disentangling acoustic scenes by detecting and classifying events, focuses on objective attributes of sounds, such as their category and temporal characteristics, ignoring the effect of sounds on people and failing to explore the relationship betwe… ▽ More We live in a rich and varied acoustic world, which is experienced by individuals or communities as a soundscape. Computational auditory scene analysis, disentangling acoustic scenes by detecting and classifying events, focuses on objective attributes of sounds, such as their category and temporal characteristics, ignoring the effect of sounds on people and failing to explore the relationship between sounds and the emotions they evoke within a context. To fill this gap and to automate soundscape analysis, which traditionally relies on labour-intensive subjective ratings and surveys, we propose the soundscape captioning (SoundSCap) task. SoundSCap generates context-aware soundscape descriptions by capturing the acoustic scene, event information, and the corresponding human affective qualities. To this end, we propose an automatic soundscape captioner (SoundSCaper) composed of an acoustic model, SoundAQnet, and a general large language model (LLM). SoundAQnet simultaneously models multi-scale information about acoustic scenes, events, and perceived affective qualities, while LLM generates soundscape captions by parsing the information captured by SoundAQnet to a common language. The soundscape caption's quality is assessed by a jury of 16 audio/soundscape experts. The average score (out of 5) of SoundSCaper-generated captions is lower than the score of captions generated by two soundscape experts by 0.21 and 0.25, respectively, on the evaluation set and the model-unknown mixed external dataset with varying lengths and acoustic properties, but the differences are not statistically significant. Overall, SoundSCaper-generated captions show promising performance compared to captions annotated by soundscape experts. The models' code, LLM scripts, human assessment data and instructions, and expert evaluation statistics are all publicly available. △ Less

Submitted 9 June, 2024; originally announced June 2024.

Comments: Code: https://github.com/Yuanbo2020/SoundSCaper

arXiv:2405.09708 [pdf, ps, other]

doi 10.1109/LRA.2024.3401117

No More Mumbles: Enhancing Robot Intelligibility through Speech Adaptation

Authors: Qiaoqiao Ren, Yuanbo Hou, Dick Botteldooren, Tony Belpaeme

Abstract: Spoken language interaction is at the heart of interpersonal communication, and people flexibly adapt their speech to different individuals and environments. It is surprising that robots, and by extension other digital devices, are not equipped to adapt their speech and instead rely on fixed speech parameters, which often hinder comprehension by the user. We conducted a speech comprehension study… ▽ More Spoken language interaction is at the heart of interpersonal communication, and people flexibly adapt their speech to different individuals and environments. It is surprising that robots, and by extension other digital devices, are not equipped to adapt their speech and instead rely on fixed speech parameters, which often hinder comprehension by the user. We conducted a speech comprehension study involving 39 participants who were exposed to different environmental and contextual conditions. During the experiment, the robot articulated words using different vocal parameters, and the participants were tasked with both recognising the spoken words and rating their subjective impression of the robot's speech. The experiment's primary outcome shows that spaces with good acoustic quality positively correlate with intelligibility and user experience. However, increasing the distance between the user and the robot exacerbated the user experience, while distracting background sounds significantly reduced speech recognition accuracy and user satisfaction. We next built an adaptive voice for the robot. For this, the robot needs to know how difficult it is for a user to understand spoken language in a particular setting. We present a prediction model that rates how annoying the ambient acoustic environment is and, consequentially, how hard it is to understand someone in this setting. Then, we develop a convolutional neural network model to adapt the robot's speech parameters to different users and spaces, while taking into account the influence of ambient acoustics on intelligibility. Finally, we present an evaluation with 27 users, demonstrating superior intelligibility and user experience with adaptive voice parameters compared to fixed voice. △ Less

Submitted 15 May, 2024; originally announced May 2024.

Comments: IEEE Robotics and Automation Letters (IEEE RAL)

arXiv:2404.17394 [pdf, other]

Child Speech Recognition in Human-Robot Interaction: Problem Solved?

Authors: Ruben Janssens, Eva Verhelst, Giulio Antonio Abbo, Qiaoqiao Ren, Maria Jose Pinto Bernal, Tony Belpaeme

Abstract: Automated Speech Recognition shows superhuman performance for adult English speech on a range of benchmarks, but disappoints when fed children's speech. This has long sat in the way of child-robot interaction. Recent evolutions in data-driven speech recognition, including the availability of Transformer architectures and unprecedented volumes of training data, might mean a breakthrough for child s… ▽ More Automated Speech Recognition shows superhuman performance for adult English speech on a range of benchmarks, but disappoints when fed children's speech. This has long sat in the way of child-robot interaction. Recent evolutions in data-driven speech recognition, including the availability of Transformer architectures and unprecedented volumes of training data, might mean a breakthrough for child speech recognition and social robot applications aimed at children. We revisit a study on child speech recognition from 2017 and show that indeed performance has increased, with newcomer OpenAI Whisper doing markedly better than leading commercial cloud services. While transcription is not perfect yet, the best model recognises 60.3% of sentences correctly barring small grammatical differences, with sub-second transcription time running on a local GPU, showing potential for usable autonomous child-robot speech interactions. △ Less

Submitted 26 April, 2024; originally announced April 2024.

Comments: Presented at 2024 International Symposium on Technological Advances in Human-Robot Interaction

arXiv:2311.08957 [pdf, other]

I Was Blind but Now I See: Implementing Vision-Enabled Dialogue in Social Robots

Authors: Giulio Antonio Abbo, Tony Belpaeme

Abstract: In the rapidly evolving landscape of human-computer interaction, the integration of vision capabilities into conversational agents stands as a crucial advancement. This paper presents an initial implementation of a dialogue manager that leverages the latest progress in Large Language Models (e.g., GPT-4, IDEFICS) to enhance the traditional text-based prompts with real-time visual input. LLMs are u… ▽ More In the rapidly evolving landscape of human-computer interaction, the integration of vision capabilities into conversational agents stands as a crucial advancement. This paper presents an initial implementation of a dialogue manager that leverages the latest progress in Large Language Models (e.g., GPT-4, IDEFICS) to enhance the traditional text-based prompts with real-time visual input. LLMs are used to interpret both textual prompts and visual stimuli, creating a more contextually aware conversational agent. The system's prompt engineering, incorporating dialogue with summarisation of the images, ensures a balance between context preservation and computational efficiency. Six interactions with a Furhat robot powered by this system are reported, illustrating and discussing the results obtained. By implementing this vision-enabled dialogue system, the paper envisions a future where conversational agents seamlessly blend textual and visual modalities, enabling richer, more context-aware dialogues. △ Less

Submitted 15 November, 2023; originally announced November 2023.

Comments: 8 pages, 3 figures

arXiv:2211.00730 [pdf, other]

Tactile interaction with a robot leads to increased risk-taking

Authors: Qiaoqiao Ren, Tony Belpaeme

Abstract: Tactile interaction plays a crucial role in interactions between people. Touch can, for example, help people calm down and lower physiological stress responses. Consequently, it is believed that tactile and haptic interaction matter also in human-robot interaction. We study if the intensity of the tactile interaction has an impact on people, and do so by studying whether different intensities of t… ▽ More Tactile interaction plays a crucial role in interactions between people. Touch can, for example, help people calm down and lower physiological stress responses. Consequently, it is believed that tactile and haptic interaction matter also in human-robot interaction. We study if the intensity of the tactile interaction has an impact on people, and do so by studying whether different intensities of tactile interaction modulate physiological measures and task performance. We use a paradigm in which a small humanoid robot is used to encourage risk-taking behaviour, relying on peer encouragement to take more risks which might lead to a higher pay-off, but potentially also to higher losses. For this, the Balloon Analogue Risk Task (BART) is used as a proxy for the propensity to take risks. We study four conditions, one control condition in which the task is completed without a robot, and three experimental conditions in which a robot is present that encourages risk-taking behaviour with different degrees of tactile interaction. The results show that both low-intensity and high-intensity tactile interaction increase people's risk-taking behaviour. However, low-intensity tactile interaction increases comfort and lowers stress, whereas high-intensity touch does not. △ Less

Submitted 1 November, 2022; originally announced November 2022.

Comments: 10 pages, 5 figures, conference

MSC Class: International conference of social robotics

arXiv:2210.11161 [pdf, ps, other]

From Modelling to Understanding Children's Behaviour in the Context of Robotics and Social Artificial Intelligence

Authors: Serge Thill, Vicky Charisi, Tony Belpaeme, Ana Paiva

Abstract: Understanding and modelling children's cognitive processes and their behaviour in the context of their interaction with robots and social artificial intelligence systems is a fundamental prerequisite for meaningful and effective robot interventions. However, children's development involve complex faculties such as exploration, creativity and curiosity which are challenging to model. Also, often ch… ▽ More Understanding and modelling children's cognitive processes and their behaviour in the context of their interaction with robots and social artificial intelligence systems is a fundamental prerequisite for meaningful and effective robot interventions. However, children's development involve complex faculties such as exploration, creativity and curiosity which are challenging to model. Also, often children express themselves in a playful way which is different from a typical adult behaviour. Different children also have different needs, and it remains a challenge in the current state of the art that those of neurodiverse children are under-addressed. With this workshop, we aim to promote a common ground among different disciplines such as developmental sciences, artificial intelligence and social robotics and discuss cutting-edge research in the area of user modelling and adaptive systems for children. △ Less

Submitted 20 October, 2022; originally announced October 2022.

Comments: Accepted proposal for a workshop to be held in conjunction with the 14th International Conference on Social Robotics (ICSR'22)

arXiv:2108.05709 [pdf, other]

To Rate or Not To Rate: Investigating Evaluation Methods for Generated Co-Speech Gestures

Authors: Pieter Wolfert, Jeffrey M. Girard, Taras Kucherenko, Tony Belpaeme

Abstract: While automatic performance metrics are crucial for machine learning of artificial human-like behaviour, the gold standard for evaluation remains human judgement. The subjective evaluation of artificial human-like behaviour in embodied conversational agents is however expensive and little is known about the quality of the data it returns. Two approaches to subjective evaluation can be largely dist… ▽ More While automatic performance metrics are crucial for machine learning of artificial human-like behaviour, the gold standard for evaluation remains human judgement. The subjective evaluation of artificial human-like behaviour in embodied conversational agents is however expensive and little is known about the quality of the data it returns. Two approaches to subjective evaluation can be largely distinguished, one relying on ratings, the other on pairwise comparisons. In this study we use co-speech gestures to compare the two against each other and answer questions about their appropriateness for evaluation of artificial behaviour. We consider their ability to rate quality, but also aspects pertaining to the effort of use and the time required to collect subjective data. We use crowd sourcing to rate the quality of co-speech gestures in avatars, assessing which method picks up more detail in subjective assessments. We compared gestures generated by three different machine learning models with various level of behavioural quality. We found that both approaches were able to rank the videos according to quality and that the ranking significantly correlated, showing that in terms of quality there is no preference of one method over the other. We also found that pairwise comparisons were slightly faster and came with improved inter-rater reliability, suggesting that for small-scale studies pairwise comparisons are to be favoured over ratings. △ Less

Submitted 13 August, 2021; v1 submitted 12 August, 2021; originally announced August 2021.

Comments: accepted for publication at International Conference for Multimodal Interaction (ICMI'21)

arXiv:2101.03769 [pdf, ps, other]

doi 10.1109/THMS.2022.3149173

A Review of Evaluation Practices of Gesture Generation in Embodied Conversational Agents

Authors: Pieter Wolfert, Nicole Robinson, Tony Belpaeme

Abstract: Embodied conversational agents (ECA) are often designed to produce nonverbal behavior to complement or enhance their verbal communication. One such form of nonverbal behavior is co-speech gesturing, which involves movements that the agent makes with its arms and hands that are paired with verbal communication. Co-speech gestures for ECAs can be created using different generation methods, divided i… ▽ More Embodied conversational agents (ECA) are often designed to produce nonverbal behavior to complement or enhance their verbal communication. One such form of nonverbal behavior is co-speech gesturing, which involves movements that the agent makes with its arms and hands that are paired with verbal communication. Co-speech gestures for ECAs can be created using different generation methods, divided into rule-based and data-driven processes, with the latter gaining traction because of the increasing interest from the applied machine learning community. However, reports on gesture generation methods use a variety of evaluation measures, which hinders comparison. To address this, we present a systematic review on co-speech gesture generation methods for iconic, metaphoric, deictic, and beat gestures, including reported evaluation methods. We review 22 studies that have an ECA with a human-like upper body that uses co-speech gesturing in social human-agent interaction. This includes studies that use human participants to evaluate performance. We found most studies use a within-subject design and rely on a form of subjective evaluation, but without a systematic approach. We argue that the field requires more rigorous and uniform tools for co-speech gesture evaluation, and formulate recommendations for empirical evaluation, including standardized phrases and example scenarios to help systematically test generative models across studies. Furthermore, we also propose a checklist that can be used to report relevant information for the evaluation of generative models, as well as to evaluate co-speech gesture use. △ Less

Submitted 1 March, 2022; v1 submitted 11 January, 2021; originally announced January 2021.

Comments: 11 pages, accepted for publication in IEEE Transactions on Human-Machine Systems

arXiv:1909.04747 [pdf, other]

Recognizing Human Internal States: A Conceptor-Based Approach

Authors: Madeleine Bartlett, Daniel Hernandez Garcia, Serge Thill, Tony Belpaeme

Abstract: The past few decades has seen increased interest in the application of social robots to interventions for Autism Spectrum Disorder as behavioural coaches [4]. We consider that robots embedded in therapies could also provide quantitative diagnostic information by observing patient behaviours. The social nature of ASD symptoms means that, to achieve this, robots need to be able to recognize the inte… ▽ More The past few decades has seen increased interest in the application of social robots to interventions for Autism Spectrum Disorder as behavioural coaches [4]. We consider that robots embedded in therapies could also provide quantitative diagnostic information by observing patient behaviours. The social nature of ASD symptoms means that, to achieve this, robots need to be able to recognize the internal states their human interaction partners are experiencing, e.g. states of confusion, engagement etc. Approaching this problem can be broken down into two questions: (1) what information, accessible to robots, can be used to recognize internal states, and (2) how can a system classify internal states such that it allows for sufficiently detailed diagnostic information? In this paper we discuss these two questions in depth and propose a novel, conceptor-based classifier. We report the initial results of this system in a proof-of-concept study and outline plans for future work. △ Less

Submitted 9 September, 2019; originally announced September 2019.

Comments: 4 pages, 1 figure, HRI conference workshop

Report number: SREC/2019/04

arXiv:1812.07613 [pdf]

Proceedings of the Workshop on Social Robots in Therapy: Focusing on Autonomy and Ethical Challenges

Authors: Pablo G. Esteban, Daniel Hernández García, Hee Rin Lee, Pauline Chevalier, Paul Baxter, Cindy L. Bethel, Jainendra Shukla, Joan Oliver, Domènec Puig, Jason R. Wilson, Linda Tickle-Degnen, Madeleine Bartlett, Tony Belpaeme, Serge Thill, Kim Baraka, Francisco S. Melo, Manuela Veloso, David Becerra, Maja Matarić, Eduard Fosch-Villaronga, Jordi Albo-Canals, Gloria Beraldo, Emanuele Menegatti, Valentina De Tommasi, Roberto Mancin , et al. (13 additional authors not shown)

Abstract: Robot-Assisted Therapy (RAT) has successfully been used in HRI research by including social robots in health-care interventions by virtue of their ability to engage human users both social and emotional dimensions. Research projects on this topic exist all over the globe in the USA, Europe, and Asia. All of these projects have the overall ambitious goal to increase the well-being of a vulnerable p… ▽ More Robot-Assisted Therapy (RAT) has successfully been used in HRI research by including social robots in health-care interventions by virtue of their ability to engage human users both social and emotional dimensions. Research projects on this topic exist all over the globe in the USA, Europe, and Asia. All of these projects have the overall ambitious goal to increase the well-being of a vulnerable population. Typical work in RAT is performed using remote controlled robots; a technique called Wizard-of-Oz (WoZ). The robot is usually controlled, unbeknownst to the patient, by a human operator. However, WoZ has been demonstrated to not be a sustainable technique in the long-term. Providing the robots with autonomy (while remaining under the supervision of the therapist) has the potential to lighten the therapists burden, not only in the therapeutic session itself but also in longer-term diagnostic tasks. Therefore, there is a need for exploring several degrees of autonomy in social robots used in therapy. Increasing the autonomy of robots might also bring about a new set of challenges. In particular, there will be a need to answer new ethical questions regarding the use of robots with a vulnerable population, as well as a need to ensure ethically-compliant robot behaviours. Therefore, in this workshop we want to gather findings and explore which degree of autonomy might help to improve health-care interventions and how we can overcome the ethical challenges inherent to it. △ Less

Submitted 18 December, 2018; originally announced December 2018.

Comments: 25 pages, editors for the proceedings: Pablo G. Esteban, Daniel Hernández García, Hee Rin Lee, Pauline Chevalier, Paul Baxter, Cindy Bethel

arXiv:1712.02421 [pdf, other]

The Free-play Sandbox: a Methodology for the Evaluation of Social Robotics and a Dataset of Social Interactions

Authors: Séverin Lemaignan, Charlotte Edmunds, Emmanuel Senft, Tony Belpaeme

Abstract: Evaluating human-robot social interactions in a rigorous manner is notoriously difficult: studies are either conducted in labs with constrained protocols to allow for robust measurements and a degree of replicability, but at the cost of ecological validity; or in the wild, which leads to superior experimental realism, but often with limited replicability and at the expense of rigorous interaction… ▽ More Evaluating human-robot social interactions in a rigorous manner is notoriously difficult: studies are either conducted in labs with constrained protocols to allow for robust measurements and a degree of replicability, but at the cost of ecological validity; or in the wild, which leads to superior experimental realism, but often with limited replicability and at the expense of rigorous interaction metrics. We introduce a novel interaction paradigm, designed to elicit rich and varied social interactions while having desirable scientific properties (replicability, clear metrics, possibility of either autonomous or Wizard-of-Oz robot behaviours). This paradigm focuses on child-robot interactions, and builds on a sandboxed free-play environment. We present the rationale and design of the interaction paradigm, its methodological and technical aspects (including the open-source implementation of the software platform), as well as two large open datasets acquired with this paradigm, and meant to act as experimental baselines for future research. △ Less

Submitted 6 December, 2017; originally announced December 2017.

Comments: conference paper

Showing 1–11 of 11 results for author: Belpaeme, T