subscribe to arXiv mailings

arXiv:1902.07262 [pdf, other]

BusyHands: A Hand-Tool Interaction Database for Assembly Tasks Semantic Segmentation

Authors: Roy Shilkrot, Zhi Chai, Minh Hoai

Abstract: Visual segmentation has seen tremendous advancement recently with ready solutions for a wide variety of scene types, including human hands and other body parts. However, focus on segmentation of human hands while performing complex tasks, such as manual assembly, is still severely lacking. Segmenting hands from tools, work pieces, background and other body parts is extremely difficult because of s… ▽ More Visual segmentation has seen tremendous advancement recently with ready solutions for a wide variety of scene types, including human hands and other body parts. However, focus on segmentation of human hands while performing complex tasks, such as manual assembly, is still severely lacking. Segmenting hands from tools, work pieces, background and other body parts is extremely difficult because of self-occlusions and intricate hand grips and poses. In this paper we introduce BusyHands, a large open dataset of pixel-level annotated images of hands performing 13 different tool-based assembly tasks, from both real-world captures and virtual-world renderings. A total of 7906 samples are included in our first-in-kind dataset, with both RGB and depth images as obtained from a Kinect V2 camera and Blender. We evaluate several state-of-the-art semantic segmentation methods on our dataset as a proposed performance benchmark. △ Less

Submitted 19 February, 2019; originally announced February 2019.

Comments: 10 pages, 8 figures

arXiv:1812.11090 [pdf, other]

Enhanced Touchable Projector-depth System with Deep Hand Pose Estimation

Authors: Zhi Chai, Roy Shilkrot

Abstract: Touchable projection with structured light range cameras is a prolific medium for large interaction surfaces, affording multiple simultaneous users and simple, cheap setup. However robust touch detection in such projector-depth systems is difficult to achieve due to measurement noise. We propose a novel combination of surface touch detection and a deep network for hand pose estimation, which aids… ▽ More Touchable projection with structured light range cameras is a prolific medium for large interaction surfaces, affording multiple simultaneous users and simple, cheap setup. However robust touch detection in such projector-depth systems is difficult to achieve due to measurement noise. We propose a novel combination of surface touch detection and a deep network for hand pose estimation, which aids in detecting both on- and above-surface hand gestures, disambiguating multiple touch fingers, as well as recovering fingertip positions in face of noisy input. We present the details of our GPU-accelerated system and an evaluation of its performance, as well as applications such as an enhanced virtual keyboard that utilizes the added features. △ Less

Submitted 28 December, 2018; originally announced December 2018.

Comments: 9 pages, 15 figures

arXiv:1812.03415 [pdf, other]

doi 10.1109/ICASSP.2019.8682937

Increase Apparent Public Speaking Fluency By Speech Augmentation

Authors: Sagnik Das, Nisha Gandhi, Tejas Naik, Roy Shilkrot

Abstract: Fluent and confident speech is desirable to every speaker. But professional speech delivering requires a great deal of experience and practice. In this paper, we propose a speech stream manipulation system which can help non-professional speakers to produce fluent, professional-like speech content, in turn contributing towards better listener engagement and comprehension. We propose to achieve thi… ▽ More Fluent and confident speech is desirable to every speaker. But professional speech delivering requires a great deal of experience and practice. In this paper, we propose a speech stream manipulation system which can help non-professional speakers to produce fluent, professional-like speech content, in turn contributing towards better listener engagement and comprehension. We propose to achieve this task by manipulating the disfluencies in human speech, like the sounds 'uh' and 'um', the filler words and awkward long silences. Given any unrehearsed speech we segment and silence the filled pauses and doctor the duration of imposed silence as well as other long pauses ('disfluent') by a predictive model learned using professional speech dataset. Finally, we output a audio stream in which speaker sounds more fluent, confident and practiced compared to the original speech he/she recorded. According to our quantitative evaluation, we significantly increase the fluency of speech by reducing rate of pauses and fillers. △ Less

Submitted 3 August, 2019; v1 submitted 8 December, 2018; originally announced December 2018.

Journal ref: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Showing 1–3 of 3 results for author: Shilkrot, R