subscribe to arXiv mailings

Acknowledgment of Emotional States: Generating Validating Responses for Empathetic Dialogue

Authors: Zi Haur Pang, Yahui Fu, Divesh Lala, Keiko Ochi, Koji Inoue, Tatsuya Kawahara

Abstract: In the realm of human-AI dialogue, the facilitation of empathetic responses is important. Validation is one of the key communication techniques in psychology, which entails recognizing, understanding, and acknowledging others' emotional states, thoughts, and actions. This study introduces the first framework designed to engender empathetic dialogue with validating responses. Our approach incorpora… ▽ More In the realm of human-AI dialogue, the facilitation of empathetic responses is important. Validation is one of the key communication techniques in psychology, which entails recognizing, understanding, and acknowledging others' emotional states, thoughts, and actions. This study introduces the first framework designed to engender empathetic dialogue with validating responses. Our approach incorporates a tripartite module system: 1) validation timing detection, 2) users' emotional state identification, and 3) validating response generation. Utilizing Japanese EmpatheticDialogues dataset - a textual-based dialogue dataset consisting of 8 emotional categories from Plutchik's wheel of emotions - the Task Adaptive Pre-Training (TAPT) BERT-based model outperforms both random baseline and the ChatGPT performance, in term of F1-score, in all modules. Further validation of our model's efficacy is confirmed in its application to the TUT Emotional Storytelling Corpus (TESC), a speech-based dialogue dataset, by surpassing both random baseline and the ChatGPT. This consistent performance across both textual and speech-based dialogues underscores the effectiveness of our framework in fostering empathetic human-AI communication. △ Less

Submitted 20 February, 2024; originally announced February 2024.

Comments: This paper has been accepted for presentation at International Workshop on Spoken Dialogue Systems Technology 2024 (IWSDS 2024)

arXiv:2402.01509 [pdf, other]

Advancing Brain Tumor Inpainting with Generative Models

Authors: Ruizhi Zhu, Xinru Zhang, Haowen Pang, Chundan Xu, Chuyang Ye

Abstract: Synthesizing healthy brain scans from diseased brain scans offers a potential solution to address the limitations of general-purpose algorithms, such as tissue segmentation and brain extraction algorithms, which may not effectively handle diseased images. We consider this a 3D inpainting task and investigate the adaptation of 2D inpainting methods to meet the requirements of 3D magnetic resonance… ▽ More Synthesizing healthy brain scans from diseased brain scans offers a potential solution to address the limitations of general-purpose algorithms, such as tissue segmentation and brain extraction algorithms, which may not effectively handle diseased images. We consider this a 3D inpainting task and investigate the adaptation of 2D inpainting methods to meet the requirements of 3D magnetic resonance imaging(MRI) data. Our contributions encompass potential modifications tailored to MRI-specific needs, and we conducted evaluations of multiple inpainting techniques using the BraTS2023 Inpainting datasets to assess their efficacy and limitations. △ Less

Submitted 2 February, 2024; originally announced February 2024.

arXiv:2312.08704 [pdf, other]

PairingNet: A Learning-based Pair-searching and -matching Network for Image Fragments

Authors: Rixin Zhou, Ding Xia, Yi Zhang, Honglin Pang, Xi Yang, Chuntao Li

Abstract: In this paper, we propose a learning-based image fragment pair-searching and -matching approach to solve the challenging restoration problem. Existing works use rule-based methods to match similar contour shapes or textures, which are always difficult to tune hyperparameters for extensive data and computationally time-consuming. Therefore, we propose a neural network that can effectively utilize n… ▽ More In this paper, we propose a learning-based image fragment pair-searching and -matching approach to solve the challenging restoration problem. Existing works use rule-based methods to match similar contour shapes or textures, which are always difficult to tune hyperparameters for extensive data and computationally time-consuming. Therefore, we propose a neural network that can effectively utilize neighbor textures with contour shape information to fundamentally improve performance. First, we employ a graph-based network to extract the local contour and texture features of fragments. Then, for the pair-searching task, we adopt a linear transformer-based module to integrate these local features and use contrastive loss to encode the global features of each fragment. For the pair-matching task, we design a weighted fusion module to dynamically fuse extracted local contour and texture features, and formulate a similarity matrix for each pair of fragments to calculate the matching score and infer the adjacent segment of contours. To faithfully evaluate our proposed network, we created a new image fragment dataset through an algorithm we designed that tears complete images into irregular fragments. The experimental results show that our proposed network achieves excellent pair-searching accuracy, reduces matching errors, and significantly reduces computational time. Details, sourcecode, and data are available in our supplementary material. △ Less

Submitted 14 December, 2023; originally announced December 2023.

Comments: 14 pages, 16 figures, 4 tables

arXiv:2312.05941 [pdf, other]

ASH: Animatable Gaussian Splats for Efficient and Photoreal Human Rendering

Authors: Haokai Pang, Heming Zhu, Adam Kortylewski, Christian Theobalt, Marc Habermann

Abstract: Real-time rendering of photorealistic and controllable human avatars stands as a cornerstone in Computer Vision and Graphics. While recent advances in neural implicit rendering have unlocked unprecedented photorealism for digital avatars, real-time performance has mostly been demonstrated for static scenes only. To address this, we propose ASH, an animatable Gaussian splatting approach for photore… ▽ More Real-time rendering of photorealistic and controllable human avatars stands as a cornerstone in Computer Vision and Graphics. While recent advances in neural implicit rendering have unlocked unprecedented photorealism for digital avatars, real-time performance has mostly been demonstrated for static scenes only. To address this, we propose ASH, an animatable Gaussian splatting approach for photorealistic rendering of dynamic humans in real-time. We parameterize the clothed human as animatable 3D Gaussians, which can be efficiently splatted into image space to generate the final rendering. However, naively learning the Gaussian parameters in 3D space poses a severe challenge in terms of compute. Instead, we attach the Gaussians onto a deformable character model, and learn their parameters in 2D texture space, which allows leveraging efficient 2D convolutional architectures that easily scale with the required number of Gaussians. We benchmark ASH with competing methods on pose-controllable avatars, demonstrating that our method outperforms existing real-time methods by a large margin and shows comparable or even better results than offline methods. △ Less

Submitted 15 April, 2024; v1 submitted 10 December, 2023; originally announced December 2023.

Comments: For project page, see https://vcai.mpi-inf.mpg.de/projects/ash/

arXiv:2310.17901 [pdf, other]

Improving the Knowledge Gradient Algorithm

Authors: Yang Le, Gao Siyang, Ho Chin Pang

Abstract: The knowledge gradient (KG) algorithm is a popular policy for the best arm identification (BAI) problem. It is built on the simple idea of always choosing the measurement that yields the greatest expected one-step improvement in the estimate of the best mean of the arms. In this research, we show that this policy has limitations, causing the algorithm not asymptotically optimal. We next provide a… ▽ More The knowledge gradient (KG) algorithm is a popular policy for the best arm identification (BAI) problem. It is built on the simple idea of always choosing the measurement that yields the greatest expected one-step improvement in the estimate of the best mean of the arms. In this research, we show that this policy has limitations, causing the algorithm not asymptotically optimal. We next provide a remedy for it, by following the manner of one-step look ahead of KG, but instead choosing the measurement that yields the greatest one-step improvement in the probability of selecting the best arm. The new policy is called improved knowledge gradient (iKG). iKG can be shown to be asymptotically optimal. In addition, we show that compared to KG, it is easier to extend iKG to variant problems of BAI, with the $ε$-good arm identification and feasible arm identification as two examples. The superior performances of iKG on these problems are further demonstrated using numerical examples. △ Less

Submitted 27 October, 2023; originally announced October 2023.

Comments: 32 pages, 42 figures

arXiv:2309.17448 [pdf, other]

SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation

Authors: Zhongang Cai, Wanqi Yin, Ailing Zeng, Chen Wei, Qingping Sun, Yanjun Wang, Hui En Pang, Haiyi Mei, Mingyuan Zhang, Lei Zhang, Chen Change Loy, Lei Yang, Ziwei Liu

Abstract: Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications. Despite encouraging progress, current state-of-the-art methods still depend largely on a confined set of training datasets. In this work, we investigate scaling up EHPS towards the first generalist foundation model (dubbed SMPLer-X), with up to ViT-Huge as the backbone and tra… ▽ More Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications. Despite encouraging progress, current state-of-the-art methods still depend largely on a confined set of training datasets. In this work, we investigate scaling up EHPS towards the first generalist foundation model (dubbed SMPLer-X), with up to ViT-Huge as the backbone and training with up to 4.5M instances from diverse data sources. With big data and the large model, SMPLer-X exhibits strong performance across diverse test benchmarks and excellent transferability to even unseen environments. 1) For the data scaling, we perform a systematic investigation on 32 EHPS datasets, including a wide range of scenarios that a model trained on any single dataset cannot handle. More importantly, capitalizing on insights obtained from the extensive benchmarking process, we optimize our training scheme and select datasets that lead to a significant leap in EHPS capabilities. 2) For the model scaling, we take advantage of vision transformers to study the scaling law of model sizes in EHPS. Moreover, our finetuning strategy turn SMPLer-X into specialist models, allowing them to achieve further performance boosts. Notably, our foundation model SMPLer-X consistently delivers state-of-the-art results on seven benchmarks such as AGORA (107.2 mm NMVE), UBody (57.4 mm PVE), EgoBody (63.6 mm PVE), and EHF (62.3 mm PVE without finetuning). Homepage: https://caizhongang.github.io/projects/SMPLer-X/ △ Less

Submitted 30 October, 2023; v1 submitted 29 September, 2023; originally announced September 2023.

Comments: Homepage: https://caizhongang.github.io/projects/SMPLer-X/

arXiv:2309.10684 [pdf, other]

Locally Stylized Neural Radiance Fields

Authors: Hong-Wing Pang, Binh-Son Hua, Sai-Kit Yeung

Abstract: In recent years, there has been increasing interest in applying stylization on 3D scenes from a reference style image, in particular onto neural radiance fields (NeRF). While performing stylization directly on NeRF guarantees appearance consistency over arbitrary novel views, it is a challenging problem to guide the transfer of patterns from the style image onto different parts of the NeRF scene.… ▽ More In recent years, there has been increasing interest in applying stylization on 3D scenes from a reference style image, in particular onto neural radiance fields (NeRF). While performing stylization directly on NeRF guarantees appearance consistency over arbitrary novel views, it is a challenging problem to guide the transfer of patterns from the style image onto different parts of the NeRF scene. In this work, we propose a stylization framework for NeRF based on local style transfer. In particular, we use a hash-grid encoding to learn the embedding of the appearance and geometry components, and show that the mapping defined by the hash table allows us to control the stylization to a certain extent. Stylization is then achieved by optimizing the appearance branch while keeping the geometry branch fixed. To support local style transfer, we propose a new loss function that utilizes a segmentation network and bipartite matching to establish region correspondences between the style image and the content images obtained from volume rendering. Our experiments show that our method yields plausible stylization results with novel view synthesis while having flexible controllability via manipulating and customizing the region correspondences. △ Less

Submitted 19 September, 2023; originally announced September 2023.

Comments: ICCV 2023

arXiv:2308.04322 [pdf, other]

Domain Adaptive Person Search via GAN-based Scene Synthesis for Cross-scene Videos

Authors: Huibing Wang, Tianxiang Cui, Mingze Yao, Huijuan Pang, Yushan Du

Abstract: Person search has recently been a challenging task in the computer vision domain, which aims to search specific pedestrians from real cameras.Nevertheless, most surveillance videos comprise only a handful of images of each pedestrian, which often feature identical backgrounds and clothing. Hence, it is difficult to learn more discriminative features for person search in real scenes. To tackle this… ▽ More Person search has recently been a challenging task in the computer vision domain, which aims to search specific pedestrians from real cameras.Nevertheless, most surveillance videos comprise only a handful of images of each pedestrian, which often feature identical backgrounds and clothing. Hence, it is difficult to learn more discriminative features for person search in real scenes. To tackle this challenge, we draw on Generative Adversarial Networks (GAN) to synthesize data from surveillance videos. GAN has thrived in computer vision problems because it produces high-quality images efficiently. We merely alter the popular Fast R-CNN model, which is capable of processing videos and yielding accurate detection outcomes. In order to appropriately relieve the pressure brought by the two-stage model, we design an Assisted-Identity Query Module (AIDQ) to provide positive images for the behind part. Besides, the proposed novel GAN-based Scene Synthesis model that can synthesize high-quality cross-id person images for person search tasks. In order to facilitate the feature learning of the GAN-based Scene Synthesis model, we adopt an online learning strategy that collaboratively learns the synthesized images and original images. Extensive experiments on two widely used person search benchmarks, CUHK-SYSU and PRW, have shown that our method has achieved great performance, and the extensive ablation study further justifies our GAN-synthetic data can effectively increase the variability of the datasets and be more realistic. △ Less

Submitted 8 August, 2023; originally announced August 2023.

arXiv:2307.09621 [pdf, other]

Conditional 360-degree Image Synthesis for Immersive Indoor Scene Decoration

Authors: Ka Chun Shum, Hong-Wing Pang, Binh-Son Hua, Duc Thanh Nguyen, Sai-Kit Yeung

Abstract: In this paper, we address the problem of conditional scene decoration for 360-degree images. Our method takes a 360-degree background photograph of an indoor scene and generates decorated images of the same scene in the panorama view. To do this, we develop a 360-aware object layout generator that learns latent object vectors in the 360-degree view to enable a variety of furniture arrangements for… ▽ More In this paper, we address the problem of conditional scene decoration for 360-degree images. Our method takes a 360-degree background photograph of an indoor scene and generates decorated images of the same scene in the panorama view. To do this, we develop a 360-aware object layout generator that learns latent object vectors in the 360-degree view to enable a variety of furniture arrangements for an input 360-degree background image. We use this object layout to condition a generative adversarial network to synthesize images of an input scene. To further reinforce the generation capability of our model, we develop a simple yet effective scene emptier that removes the generated furniture and produces an emptied scene for our model to learn a cyclic constraint. We train the model on the Structure3D dataset and show that our model can generate diverse decorations with controllable object layout. Our method achieves state-of-the-art performance on the Structure3D dataset and generalizes well to the Zillow indoor scene dataset. Our user study confirms the immersive experiences provided by the realistic image quality and furniture layout in our generation results. Our implementation will be made available. △ Less

Submitted 18 July, 2023; originally announced July 2023.

Comments: ICCV2023

arXiv:2212.07651 [pdf, other]

Two-stage Contextual Transformer-based Convolutional Neural Network for Airway Extraction from CT Images

Authors: Yanan Wu, Shuiqing Zhao, Shouliang Qi, Jie Feng, Haowen Pang, Runsheng Chang, Long Bai, Mengqi Li, Shuyue Xia, Wei Qian, Hongliang Ren

Abstract: Accurate airway extraction from computed tomography (CT) images is a critical step for planning navigation bronchoscopy and quantitative assessment of airway-related chronic obstructive pulmonary disease (COPD). The existing methods are challenging to sufficiently segment the airway, especially the high-generation airway, with the constraint of the limited label and cannot meet the clinical use in… ▽ More Accurate airway extraction from computed tomography (CT) images is a critical step for planning navigation bronchoscopy and quantitative assessment of airway-related chronic obstructive pulmonary disease (COPD). The existing methods are challenging to sufficiently segment the airway, especially the high-generation airway, with the constraint of the limited label and cannot meet the clinical use in COPD. We propose a novel two-stage 3D contextual transformer-based U-Net for airway segmentation using CT images. The method consists of two stages, performing initial and refined airway segmentation. The two-stage model shares the same subnetwork with different airway masks as input. Contextual transformer block is performed both in the encoder and decoder path of the subnetwork to finish high-quality airway segmentation effectively. In the first stage, the total airway mask and CT images are provided to the subnetwork, and the intrapulmonary airway mask and corresponding CT scans to the subnetwork in the second stage. Then the predictions of the two-stage method are merged as the final prediction. Extensive experiments were performed on in-house and multiple public datasets. Quantitative and qualitative analysis demonstrate that our proposed method extracted much more branches and lengths of the tree while accomplishing state-of-the-art airway segmentation performance. The code is available at https://github.com/zhaozsq/airway_segmentation. △ Less

Submitted 15 December, 2022; originally announced December 2022.

arXiv:2211.16544 [pdf, other]

Towards Transcervical Ultrasound Image Guidance for Transoral Robotic Surgery

Authors: Wanwen Chen, Megha Kalia, Qi Zeng, Emily H. T. Pang, Razeyeh Bagherinasab, Thomas D. Milner, Farahna Sabiq, Eitan Prisman, Septimiu E. Salcudean

Abstract: Purpose: Trans-oral robotic surgery (TORS) using the da Vinci surgical robot is a new minimally-invasive surgery method to treat oropharyngeal tumors, but it is a challenging operation. Augmented reality (AR) based on intra-operative ultrasound (US) has the potential to enhance the visualization of the anatomy and cancerous tumors to provide additional tools for decision-making in surgery. Methods… ▽ More Purpose: Trans-oral robotic surgery (TORS) using the da Vinci surgical robot is a new minimally-invasive surgery method to treat oropharyngeal tumors, but it is a challenging operation. Augmented reality (AR) based on intra-operative ultrasound (US) has the potential to enhance the visualization of the anatomy and cancerous tumors to provide additional tools for decision-making in surgery. Methods: We propose and carry out preliminary evaluations of a US-guided AR system for TORS, with the transducer placed on the neck for a transcervical view. Firstly, we perform a novel MRI-transcervical 3D US registration study. Secondly, we develop a US-robot calibration method with an optical tracker and an AR system to display the anatomy mesh model in the real-time endoscope images inside the surgeon console. Results: Our AR system reaches a mean projection error of 26.81 and 27.85 pixels for the projection from the US to stereo cameras in a water bath experiment. The average target registration error for MRI to 3D US is 8.90 mm for the 3D US transducer and 5.85 mm for freehand 3D US, and the average distance between the vessel centerlines is 2.32 mm. Conclusion: We demonstrate the first proof-of-concept transcervical US-guided AR system for TORS and the feasibility of trans-cervical 3D US-MRI registration. Our results show that trans-cervical 3D US is a promising technique for TORS image guidance. △ Less

Submitted 31 March, 2023; v1 submitted 29 November, 2022; originally announced November 2022.

Comments: 12 pages, 8 figures. Accepted by Information Processing for Computer Assisted Interventions (IPCAI 2023)

arXiv:2209.10529 [pdf, other]

Benchmarking and Analyzing 3D Human Pose and Shape Estimation Beyond Algorithms

Authors: Hui En Pang, Zhongang Cai, Lei Yang, Tianwei Zhang, Ziwei Liu

Abstract: 3D human pose and shape estimation (a.k.a. "human mesh recovery") has achieved substantial progress. Researchers mainly focus on the development of novel algorithms, while less attention has been paid to other critical factors involved. This could lead to less optimal baselines, hindering the fair and faithful evaluations of newly designed methodologies. To address this problem, this work presents… ▽ More 3D human pose and shape estimation (a.k.a. "human mesh recovery") has achieved substantial progress. Researchers mainly focus on the development of novel algorithms, while less attention has been paid to other critical factors involved. This could lead to less optimal baselines, hindering the fair and faithful evaluations of newly designed methodologies. To address this problem, this work presents the first comprehensive benchmarking study from three under-explored perspectives beyond algorithms. 1) Datasets. An analysis on 31 datasets reveals the distinct impacts of data samples: datasets featuring critical attributes (i.e. diverse poses, shapes, camera characteristics, backbone features) are more effective. Strategical selection and combination of high-quality datasets can yield a significant boost to the model performance. 2) Backbones. Experiments with 10 backbones, ranging from CNNs to transformers, show the knowledge learnt from a proximity task is readily transferable to human mesh recovery. 3) Training strategies. Proper augmentation techniques and loss designs are crucial. With the above findings, we achieve a PA-MPJPE of 47.3 mm on the 3DPW test set with a relatively simple model. More importantly, we provide strong baselines for fair comparisons of algorithms, and recommendations for building effective training configurations in the future. Codebase is available at http://github.com/smplbody/hmr-benchmarks △ Less

Submitted 21 September, 2022; originally announced September 2022.

Comments: Submission to 36th Conference on Neural Information Processing Systems (NeurIPS 2022) Track on Datasets and Benchmarks

arXiv:2204.05445 [pdf, other]

Small Footprint Multi-channel ConvMixer for Keyword Spotting with Centroid Based Awareness

Authors: Dianwen Ng, Jin Hui Pang, Yang Xiao, Biao Tian, Qiang Fu, Eng Siong Chng

Abstract: It is critical for a keyword spotting model to have a small footprint as it typically runs on-device with low computational resources. However, maintaining the previous SOTA performance with reduced model size is challenging. In addition, a far-field and noisy environment with multiple signals interference aggravates the problem causing the accuracy to degrade significantly. In this paper, we pres… ▽ More It is critical for a keyword spotting model to have a small footprint as it typically runs on-device with low computational resources. However, maintaining the previous SOTA performance with reduced model size is challenging. In addition, a far-field and noisy environment with multiple signals interference aggravates the problem causing the accuracy to degrade significantly. In this paper, we present a multi-channel ConvMixer for speech command recognitions. The novel architecture introduces an additional audio channel mixing for channel audio interaction in a multi-channel audio setting to achieve better noise-robust features with more efficient computation. Besides, we proposed a centroid based awareness component to enhance the system by equipping it with additional spatial geometry information in the latent feature projection space. We evaluate our model using the new MISP challenge 2021 dataset. Our model achieves significant improvement against the official baseline with a 55% gain in the competition score (0.152) on raw microphone array input and a 63% (0.126) boost upon front-end speech enhancement. △ Less

Submitted 11 April, 2022; originally announced April 2022.

Comments: submitted to INTERSPEECH 2022

arXiv:2108.01806 [pdf, other]

Neural Scene Decoration from a Single Photograph

Authors: Hong-Wing Pang, Yingshu Chen, Phuoc-Hieu Le, Binh-Son Hua, Duc Thanh Nguyen, Sai-Kit Yeung

Abstract: Furnishing and rendering indoor scenes has been a long-standing task for interior design, where artists create a conceptual design for the space, build a 3D model of the space, decorate, and then perform rendering. Although the task is important, it is tedious and requires tremendous effort. In this paper, we introduce a new problem of domain-specific indoor scene image synthesis, namely neural sc… ▽ More Furnishing and rendering indoor scenes has been a long-standing task for interior design, where artists create a conceptual design for the space, build a 3D model of the space, decorate, and then perform rendering. Although the task is important, it is tedious and requires tremendous effort. In this paper, we introduce a new problem of domain-specific indoor scene image synthesis, namely neural scene decoration. Given a photograph of an empty indoor space and a list of decorations with layout determined by user, we aim to synthesize a new image of the same space with desired furnishing and decorations. Neural scene decoration can be applied to create conceptual interior designs in a simple yet effective manner. Our attempt to this research problem is a novel scene generation architecture that transforms an empty scene and an object layout into a realistic furnished scene photograph. We demonstrate the performance of our proposed method by comparing it with conditional image synthesis baselines built upon prevailing image translation approaches both qualitatively and quantitatively. We conduct extensive experiments to further validate the plausibility and aesthetics of our generated scenes. Our implementation is available at \url{https://github.com/hkust-vgd/neural_scene_decoration}. △ Less

Submitted 25 July, 2022; v1 submitted 3 August, 2021; originally announced August 2021.

Comments: ECCV 2022 paper. 14 pages of main content, 4 pages of references, and 11 pages of appendix

arXiv:2105.02409 [pdf, other]

Multimedia Edge Computing

Authors: Zhi Wang, Wenwu Zhu, Lifeng Sun, Han Hu, Ge Ma, Ming Ma, Haitian Pang, Jiahui Ye, Hongshan Li

Abstract: In this paper, we investigate the recent studies on multimedia edge computing, from sensing not only traditional visual/audio data but also individuals' geographical preference and mobility behaviors, to performing distributed machine learning over such data using the joint edge and cloud infrastructure and using evolutional strategies like reinforcement learning and online learning at edge device… ▽ More In this paper, we investigate the recent studies on multimedia edge computing, from sensing not only traditional visual/audio data but also individuals' geographical preference and mobility behaviors, to performing distributed machine learning over such data using the joint edge and cloud infrastructure and using evolutional strategies like reinforcement learning and online learning at edge devices to optimize the quality of experience for multimedia services at the last mile proactively. We provide both a retrospective view of recent rapid migration (resp. merge) of cloud multimedia to (resp. and) edge-aware multimedia and insights on the fundamental guidelines for designing multimedia edge computing strategies that target satisfying the changing demand of quality of experience. By showing the recent research studies and industrial solutions, we also provide future directions towards high-quality multimedia services over edge computing. △ Less

Submitted 5 May, 2021; originally announced May 2021.

Comments: 20 pages, 9 figures. arXiv admin note: text overlap with arXiv:1702.07627

arXiv:1909.07541 [pdf, other]

A*3D Dataset: Towards Autonomous Driving in Challenging Environments

Authors: Quang-Hieu Pham, Pierre Sevestre, Ramanpreet Singh Pahwa, Huijing Zhan, Chun Ho Pang, Yuda Chen, Armin Mustafa, Vijay Chandrasekhar, Jie Lin

Abstract: With the increasing global popularity of self-driving cars, there is an immediate need for challenging real-world datasets for benchmarking and training various computer vision tasks such as 3D object detection. Existing datasets either represent simple scenarios or provide only day-time data. In this paper, we introduce a new challenging A*3D dataset which consists of RGB images and LiDAR data wi… ▽ More With the increasing global popularity of self-driving cars, there is an immediate need for challenging real-world datasets for benchmarking and training various computer vision tasks such as 3D object detection. Existing datasets either represent simple scenarios or provide only day-time data. In this paper, we introduce a new challenging A*3D dataset which consists of RGB images and LiDAR data with significant diversity of scene, time, and weather. The dataset consists of high-density images ($\approx~10$ times more than the pioneering KITTI dataset), heavy occlusions, a large number of night-time frames ($\approx~3$ times the nuScenes dataset), addressing the gaps in the existing datasets to push the boundaries of tasks in autonomous driving research to more challenging highly diverse environments. The dataset contains $39\text{K}$ frames, $7$ classes, and $230\text{K}$ 3D object annotations. An extensive 3D object detection benchmark evaluation on the A*3D dataset for various attributes such as high density, day-time/night-time, gives interesting insights into the advantages and limitations of training and testing 3D object detection in real-world setting. △ Less

Submitted 16 September, 2019; originally announced September 2019.

Comments: A new 3D dataset by I2R, A*STAR for autonomous driving

arXiv:1805.09249 [pdf, other]

Multi-User Cooperative Mobile Video Streaming: Performance Analysis and Online Mechanism Design

Authors: Lin Gao, Ming Tang, Haitian Pang, Jianwei Huang, Lifeng Sun

Abstract: Adaptive bitrate streaming enables video users to adapt their playing bitrates to the real-time network conditions, hence achieving the desirable quality-of-experience (QoE). In a multi-user wireless scenario, however, existing single-user based bitrate adaptation methods may fail to provide the desirable QoE, due to lack of consideration of multi-user interactions (such as the multi-user interfer… ▽ More Adaptive bitrate streaming enables video users to adapt their playing bitrates to the real-time network conditions, hence achieving the desirable quality-of-experience (QoE). In a multi-user wireless scenario, however, existing single-user based bitrate adaptation methods may fail to provide the desirable QoE, due to lack of consideration of multi-user interactions (such as the multi-user interferences and network congestion). In this work, we propose a novel user cooperation framework based on user-provided networking for multi-user mobile video streaming over wireless cellular networks. The framework enables nearby mobile video users to crowdsource their cellular links and resources for cooperative video streaming. We first analyze the social welfare performance bound of the proposed cooperative streaming system by introducing a virtual time-slotted system. Then, we design a low complexity Lyapunov-based online algorithm, which can be implemented in an online and distributed manner without the complete future and global network information. Numerical results show that the proposed online algorithm achieves an average 97% of the theoretical maximum social welfare. We further conduct experiments with real data traces, to compare our proposed online algorithm with the existing online algorithms in the literature. Experiment results show that our algorithm outperforms the existing algorithms in terms of both the achievable bitrate (with an average gain of 20% - 30%) and social welfare (with an average gain of 10% - 50%). △ Less

Submitted 21 May, 2018; originally announced May 2018.

Comments: This manuscript serves as the online technical report for the paper published in IEEE Transactions on Mobile Computing

arXiv:1805.08008 [pdf, other]

Performance Bound Analysis for Crowdsourced Mobile Video Streaming

Authors: Lin Gao, Ming Tang, Haitian Pang, Jianwei Huang, Lifeng Sun

Abstract: Adaptive bitrate (ABR) streaming enables video users to adapt the playing bitrate to the real-time network conditions to achieve the desirable quality of experience (QoE). In this work, we propose a novel crowdsourced streaming framework for multi-user ABR video streaming over wireless networks. This framework enables the nearby mobile video users to crowdsource their radio links and resources for… ▽ More Adaptive bitrate (ABR) streaming enables video users to adapt the playing bitrate to the real-time network conditions to achieve the desirable quality of experience (QoE). In this work, we propose a novel crowdsourced streaming framework for multi-user ABR video streaming over wireless networks. This framework enables the nearby mobile video users to crowdsource their radio links and resources for cooperative video streaming. We focus on analyzing the social welfare performance bound of the proposed crowdsourced streaming system. Directly solving this bound is challenging due to the asynchronous operations of users. To this end, we introduce a virtual time-slotted system with the synchronized operations, and formulate the associated social welfare optimization problem as a linear programming. We show that the optimal social welfare performance of the virtual system provides effective upper-bound and lower-bound for the optimal performance (bound) of the original asynchronous system, hence characterizes the feasible performance region of the proposed crowdsourced streaming system. The performance bounds derived in this work can serve as a benchmark for the future online algorithm design and incentive mechanism design. △ Less

Submitted 21 May, 2018; originally announced May 2018.

Comments: This manuscript serves as the online technical report for the paper published in the IEEE Conference on Information Sciences and Systems (CISS 2016)

arXiv:1709.00273 [pdf, ps, other]

When Data Sponsoring Meets Edge Caching: A Game-Theoretic Analysis

Authors: Haitian Pang, Lin Gao, Qinghua Ding, Lifeng Sun

Abstract: Data sponsoring is a widely-used incentive method in today's cellular networks, where video content providers (CPs) cover part or all of the cellular data cost for mobile users so as to attract more video users and increase data traffic. In the forthcoming 5G cellular networks, edge caching is emerging as a promising technique to deliver videos with lower cost and higher quality. The key idea is t… ▽ More Data sponsoring is a widely-used incentive method in today's cellular networks, where video content providers (CPs) cover part or all of the cellular data cost for mobile users so as to attract more video users and increase data traffic. In the forthcoming 5G cellular networks, edge caching is emerging as a promising technique to deliver videos with lower cost and higher quality. The key idea is to cache video contents on edge networks (e.g., femtocells and WiFi access points) in advance and deliver the cached contents to local video users directly (without involving cellular data cost for users). In this work, we aim to study how the edge caching will affect the CP's data sponsoring strategy as well as the users' behaviors and the data market. Specifically, we consider a single CP who offers both the edge caching service and the data sponsoring service to a set of heterogeneous mobile video users (with different mobility and video request patterns). We formulate the interactions of the CP and the users as a two-stage Stackelberg game, where the CP (leader) determines the budgets (efforts) for both services in Stage I, and the users (followers) decide whether and which service(s) they would like to subscribe to. We analyze the sub-game perfect equilibrium (SPE) of the proposed game systematically. Our analysis and experimental results show that by introducing the edge caching, the CP can increase his revenue by 105%. △ Less

Submitted 1 September, 2017; originally announced September 2017.

Comments: 6 pages, accepted by GLOBECOM 2017

arXiv:1704.01079 [pdf, other]

Homotopy Parametric Simplex Method for Sparse Learning

Authors: Haotian Pang, Robert Vanderbei, Han Liu, Tuo Zhao

Abstract: High dimensional sparse learning has imposed a great computational challenge to large scale data analysis. In this paper, we are interested in a broad class of sparse learning approaches formulated as linear programs parametrized by a {\em regularization factor}, and solve them by the parametric simplex method (PSM). Our parametric simplex method offers significant advantages over other competing… ▽ More High dimensional sparse learning has imposed a great computational challenge to large scale data analysis. In this paper, we are interested in a broad class of sparse learning approaches formulated as linear programs parametrized by a {\em regularization factor}, and solve them by the parametric simplex method (PSM). Our parametric simplex method offers significant advantages over other competing methods: (1) PSM naturally obtains the complete solution path for all values of the regularization parameter; (2) PSM provides a high precision dual certificate stopping criterion; (3) PSM yields sparse solutions through very few iterations, and the solution sparsity significantly reduces the computational cost per iteration. Particularly, we demonstrate the superiority of PSM over various sparse learning approaches, including Dantzig selector for sparse linear regression, LAD-Lasso for sparse robust linear regression, CLIME for sparse precision matrix estimation, sparse differential network estimation, and sparse Linear Programming Discriminant (LPD) analysis. We then provide sufficient conditions under which PSM always outputs sparse solutions such that its computational performance can be significantly boosted. Thorough numerical experiments are provided to demonstrate the outstanding performance of the PSM method. △ Less

Submitted 27 November, 2017; v1 submitted 4 April, 2017; originally announced April 2017.

Comments: Accepted by NIPS 2017

arXiv:1703.06648 [pdf, other]

Multi-Dimensional Auction Mechanisms for Crowdsourced Mobile Video Streaming

Authors: Ming Tang, Haitian Pang, Shou Wang, Lin Gao, Jianwei Huang, Lifeng Sun

Abstract: Crowdsourced mobile video streaming enables nearby mobile video users to aggregate network resources to improve their video streaming performances. However, users are often selfish and may not be willing to cooperate without proper incentives. Designing an incentive mechanism for such a scenario is challenging due to the users' asynchronous downloading behaviors and their private valuations for mu… ▽ More Crowdsourced mobile video streaming enables nearby mobile video users to aggregate network resources to improve their video streaming performances. However, users are often selfish and may not be willing to cooperate without proper incentives. Designing an incentive mechanism for such a scenario is challenging due to the users' asynchronous downloading behaviors and their private valuations for multi-bitrate coded videos. In this work, we propose both single-object and multi-object multi-dimensional auction mechanisms, through which users sell the opportunities for downloading single and multiple video segments with multiple bitrates, respectively. Both auction mechanisms can achieves truthfulness (i.e, truthful private information revelation) and efficiency (i.e., social welfare maximization). Simulations with real traces show that crowdsourced mobile streaming facilitated by the auction mechanisms outperforms noncooperative stream ing by 48.6% (on average) in terms of social welfare. To evaluate the real-world performance, we also construct a demo system for crowdsourced mobile streaming and implement our proposed auction mechanism. Experiments over the demo system further show that those users who provide resources to others and those users who receive helps can increase their welfares by 15.5% and 35.4% (on average) via cooperation, respectively. △ Less

Submitted 7 July, 2018; v1 submitted 20 March, 2017; originally announced March 2017.

arXiv:1611.00211 [pdf, ps, other]

Joint Optimization of Data Sponsoring and Edge Caching for Mobile Video Delivery

Authors: Haitian Pang, Lin Gao, Lifeng Sun

Abstract: In this work, we study the joint optimization of edge caching and data sponsoring for a video content provider (CP), aiming at reducing the content delivery cost and increasing the CP's revenue. Specifically, we formulate the joint optimization problem as a two-stage decision problem for the CP. In Stage I, the CP determines the edge caching policy (for a relatively long time period). In Stage II,… ▽ More In this work, we study the joint optimization of edge caching and data sponsoring for a video content provider (CP), aiming at reducing the content delivery cost and increasing the CP's revenue. Specifically, we formulate the joint optimization problem as a two-stage decision problem for the CP. In Stage I, the CP determines the edge caching policy (for a relatively long time period). In Stage II, the CP decides the real-time data sponsoring strategy for each content request within the period. We first propose a Lyapunov-based online sponsoring strategy in Stage II, which reaches 90% of the offline maximum performance (benchmark). We then solve the edge caching problem in Stage I based on the online sponsoring strategy proposed in Stage II, and show that the optimal caching policy depends on the aggregate user request for each content in each location. Simulations show that such a joint optimization can increase the CP's revenue by 30%~100%, comparing with the purely data sponsoring (i.e., without edge caching). △ Less

Submitted 1 November, 2016; originally announced November 2016.

Comments: accepted by GLOBECOM 2016

arXiv:1606.04195 [pdf, other]

Social- and Mobility-Aware Device-to-Device Content Delivery

Authors: Zhi Wang, Lifeng Sun, Miao Zhang, Haitian Pang, Erfang Tian, Wenwu Zhu

Abstract: Mobile online social network services have seen a rapid increase, in which the huge amount of user-generated social media contents propagating between users via social connections has significantly challenged the traditional content delivery paradigm: First, replicating all of the contents generated by users to edge servers that well "fit" the receivers becomes difficult due to the limited bandwid… ▽ More Mobile online social network services have seen a rapid increase, in which the huge amount of user-generated social media contents propagating between users via social connections has significantly challenged the traditional content delivery paradigm: First, replicating all of the contents generated by users to edge servers that well "fit" the receivers becomes difficult due to the limited bandwidth and storage capacities. Motivated by device-to-device (D2D) communication that allows users with smart devices to transfer content directly, we propose replicating bandwidth-intensive social contents in a device-to-device manner. Based on large-scale measurement studies on social content propagation and user mobility patterns in edge-network regions, we observe that (1) Device-to-device replication can significantly help users download social contents from nearby neighboring peers; (2) Both social propagation and mobility patterns affect how contents should be replicated; (3) The replication strategies depend on regional characteristics ({\em e.g.}, how users move across regions). Using these measurement insights, we propose a joint \emph{propagation- and mobility-aware} content replication strategy for edge-network regions, in which social contents are assigned to users in edge-network regions according to a joint consideration of social graph, content propagation and user mobility. We formulate the replication scheduling as an optimization problem and design distributed algorithm only using historical, local and partial information to solve it. Trace-driven experiments further verify the superiority of our proposal: compared with conventional pure movement-based and popularity-based approach, our design can significantly ($2-4$ times) improve the amount of social contents successfully delivered by device-to-device replication. △ Less

Submitted 13 June, 2016; originally announced June 2016.

arXiv:1207.4129 [pdf]

Recovering Articulated Object Models from 3D Range Data

Authors: Dragomir Anguelov, Daphne Koller, Hoi-Cheung Pang, Praveen Srinivasan, Sebastian Thrun

Abstract: We address the problem of unsupervised learning of complex articulated object models from 3D range data. We describe an algorithm whose input is a set of meshes corresponding to different configurations of an articulated object. The algorithm automatically recovers a decomposition of the object into approximately rigid parts, the location of the parts in the different object instances, and the art… ▽ More We address the problem of unsupervised learning of complex articulated object models from 3D range data. We describe an algorithm whose input is a set of meshes corresponding to different configurations of an articulated object. The algorithm automatically recovers a decomposition of the object into approximately rigid parts, the location of the parts in the different object instances, and the articulated object skeleton linking the parts. Our algorithm first registers allthe meshes using an unsupervised non-rigid technique described in a companion paper. It then segments the meshes using a graphical model that captures the spatial contiguity of parts. The segmentation is done using the EM algorithm, iterating between finding a decomposition of the object into rigid parts, and finding the location of the parts in the object instances. Although the graphical model is densely connected, the object decomposition step can be performed optimally and efficiently, allowing us to identify a large number of object parts while avoiding local maxima. We demonstrate the algorithm on real world datasets, recovering a 15-part articulated model of a human puppet from just 7 different puppet configurations, as well as a 4 part model of a fiexing arm where significant non-rigid deformation was present. △ Less

Submitted 11 July, 2012; originally announced July 2012.

Comments: Appears in Proceedings of the Twentieth Conference on Uncertainty in Artificial Intelligence (UAI2004)

Report number: UAI-P-2004-PG-18-26

Showing 1–24 of 24 results for author: Pang, H