research-article

Single-Stage Multi-human Parsing via Point Sets and Center-based Offsets

Authors:

Junliang Xing, and

Jian ZhaoAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

Pages 1863 - 1873

https://doi.org/10.1145/3581783.3611993

Published: 27 October 2023 Publication History

Abstract

This work studies the multi-human parsing problem. Existing methods, either following top-down or bottom-up two-stage paradigms, usually involve expensive computational costs. We instead present a high-performance Single-stage Multi-human Parsing (SMP) deep architecture that decouples the multi-human parsing problem into two fine-grained sub-problems,i.e., locating the human body and parts. SMP leverages the point features in the barycenter positions to obtain their segmentation and then generates a series of offsets from the barycenter of the human body to the barycenters of parts, thus performing human body and parts matching without the grouping process. Within the SMP architecture, we propose a Refined Feature Retain module to extract the global feature of instances through generated mask attention and a Mask of Interest Reclassify module as a trainable plug-in module to refine the classification results with the predicted segmentation. Extensive experiments on the MHPv2.0 dataset demonstrate the best effectiveness and efficiency of the proposed method, surpassing the state-of-the-art method by 2.1% in AP50p, 1.0% in APvolpsup>, and 1.2% in PCP50. Moreover, SMP also achieves superior performance in DensePose-COCO, verifying generalization of the model. In particular, the proposed method requires fewer training epochs and a less complex model architecture. Our codes are released in https://github.com/cjm-sfw/SMP.

Supplementary Material

MP4 File (1327-video.mp4)

Presentation video - short version In this talk, we presented SMP, a simple yet efficient single-stage method for multi-person human parsing that achieves better accuracy and faster inference speed compared to existing methods. The framework's modularity and efficiency make it a promising approach for future research on related problems in computer vision.

Download
46.87 MB

References

[1]

Yalan Qin A, Hanzhou Wu A, Jian Zhao B, and Guorui Feng A. [n. d.]. Enforced block diagonal subspace clustering with closed form solution. Pattern Recognition 130 ([n. d.]).

[2]

Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. 2019. YOLACT: Real-Time Instance Segmentation. ICCV (2019).

[3]

Gaofeng Cao, Fei Zhou, Han Yan, Anjie Wang, and Leidong Fan. 2022. KPN-MFI: A Kernel Prediction Network with Multi-frame Interaction for Video Inverse Tone Mapping. IJCAI (2022).

[4]

Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. 2019. MMDetection: Open MMLab Detection Toolbox and Benchmark. arXiv preprint arXiv:1906.07155 (2019).

[5]

Zehui Chen, Zhenyu Li, Shiquan Zhang, Liangji Fang, Qinghong Jiang, Feng Zhao, Bolei Zhou, and Hang Zhao. 2022. AutoAlign: Pixel-Instance Feature Aggregation for Multi-Modal 3D Object Detection. IJCAI (2022).

[6]

Jifeng Dai, Kaiming He, Yi Li, Shaoqing Ren, and Jian Sun. 2016. Instance-sensitive Fully Convolutional Networks. ECCV (2016).

[7]

Hexin Dong, Zifan Chen, Mingze Yuan, Yutong Xie, Jie Zhao, Fei Yu, Bin Dong, and Li Zhang. 2022. Region-Aware Metric Learning for Open World Semantic Segmentation via Meta-Channel Aggregation. IJCAI (2022).

[8]

Zhangfu Dong, Yuting He, Xiaoming Qi, Yang Chen, Huazhong Shu, Jean-Louis Coatrieux, Guanyu Yang, and Shuo Li. 2022. MNet: Rethinking 2D/3D Networks for Anisotropic Medical Image Segmentation. IJCAI (2022).

[9]

Zigang Geng, Ke Sun, Bin Xiao, Zhaoxiang Zhang, and Jingdong Wang. 2021. Bottom-Up Human Pose Estimation Via Disentangled Keypoint Regression. CVPR (2021).

[10]

Ross Girshick. 2015. Fast R-CNN. arXiv: Computer Vision and Pattern Recognition (2015).

[11]

Ke Gong, Xiaodan Liang, Yicheng Li, Yimin Chen, Ming Yang, and Liang Lin. 2018. Instance-level Human Parsing via Part Grouping Network. ECCV (2018).

[12]

Riza Alp Guler, Natalia Neverova, and Iasonas Kokkinos. 2018. DensePose: Dense Human Pose Estimation in the Wild. CVPR (2018).

[13]

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. TPAMI (2017).

[14]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. CVPR (2016).

[15]

Wenbin He, William Surmeier, Arvind Kumar Shekar, Liang Gou, and Liu Ren. 2022. Self-supervised Semantic Segmentation Grounded in Visual Concepts. IJCAI (2022).

[16]

Ruyi Ji, Dawei Du, Libo Zhang, Longyin Wen, Yanjun Wu, Chen Zhao, Feiyue Huang, and Siwei Lyu. 2019. Learning Semantic Neural Tree for Human Parsing. ECCV (2019).

[17]

Lei Jin, Xiaojuan Wang, Xuecheng Nie, Luoqi Liu, Yandong Guo, and Jian Zhao. 2022. Grouping by center: Predicting centripetal offsets for the bottom-up human pose estimation. IEEE TMM (2022).

[18]

Lei Jin, XiaojuanWang, Xuecheng Nie,WendongWang, Yandong Guo, Shuicheng Yan, and Jian Zhao. 2023. Rethinking the Person Localization for Single-Stage Multi-Person Pose Estimation. IEEE TMM (2023).

[19]

Tero Karras, Samuli Laine, and Timo Aila. 2018. A Style-Based Generator Architecture for Generative Adversarial Networks. TPAMI (2018).

[20]

Jin Lei, Chenyang Xu, Xiaojuan Wang, Yabo Xiao, Yandong Guo, Xuecheng Nie, and Jian Zhao. 2022. Single-Stage is Enough: Multi-Person Absolute 3D Pose Estimation. CVPR (2022).

[21]

Jie Li, Laiyan Ding, and Rui Huang. 2021. IMENet: Joint 3D Semantic Scene Completion and 2D Semantic Segmentation through Iterative Mutual Enhancement. IJCAI (2021).

[22]

Jianshu Li, Jian Zhao, Congyan Lang, Yidong Li, Yunchao Wei, Guodong Guo, Terence Sim, Shuicheng Yan, and Jiashi Feng. 2021. Multi-human Parsing with a Graph-based Generative Adversarial Model. ACM TOMM (2021).

[23]

Jianshu Li, Jian Zhao, Yunchao Wei, Congyan Lang, Yidong Li, and Jiashi Feng. 2017. Towards RealWorld Human Parsing: Multiple-Human Parsing in the Wild.

[24]

Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2016. Feature Pyramid Networks for Object Detection. CVPR (2016).

[25]

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal Loss for Dense Object Detection. IEEE TPAMI (2017).

[26]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. european conference on computer vision (2014).

[27]

Rosanne Liu, Joel Lehman, Piero Molino, Felipe Petroski Such, Eric Frank, Alex Sergeev, and Jason Yosinski. 2018. An intriguing failing of convolutional neural networks and the CoordConv solution. NeurIPS (2018).

[28]

Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2014. Fully Convolutional Networks for Semantic Segmentation. IEEE TPAMI (2014).

[29]

William McNally, Kanav Vats, Alexander Wong, and John McPhee. 2021. Rethinking Keypoint Representations: Modeling Keypoints and Poses as Objects for Multi-Person Human Pose Estimation. CVPR (2021).

[30]

Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. 2016. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. ICCV (2016).

[31]

Davy Neven, Bert De Brabandere, Marc Proesmans, and Luc Van Gool. 2019. Instance Segmentation by Jointly Optimizing Spatial Embeddings and Clustering Bandwidth. CVPR (2019).

[32]

Zhen-Liang Ni, Gui-Bin Bian, Guan'an Wang, Xiao-Hu Zhou, Zeng-Guang Hou, Xiao-Liang Xie, Zhen Li, and Yu-Han Wang. 2020. BARNet: Bilinear Attention Network with Adaptive Receptive Fields for Surgical Instrument Segmentation. IJCAI (2020).

[33]

Xuecheng Nie, Jiashi Feng, Jianfeng Zhang, and Shuicheng Yan. 2019. Single-Stage Multi-Person Pose Machines. ICCV (2019).

[34]

Yalan Qin, Hanzhou Wu, Jian Zhao, and Guorui Feng. 2022. Enforced Block Diagonal Subspace Clustering with Closed Form Solution. Pattern Recognition (2022).

[35]

Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An Incremental Improvement. arXiv: Computer Vision and Pattern Recognition (2018).

[36]

Tao Ruan, Ting Liu, Zilong Huang, Yunchao Wei, Shikui Wei, and Yao Zhao. 2019. Devil in the Details: Towards Accurate Single and Multiple Human Parsing. AAAI (2019).

[37]

Konstantin Sofiiuk, Olga Barinova, and Anton Konushin. 2019. AdaptIS: Adaptive Instance Selection Network. ICCV (2019).

[38]

Zhi Tian, Chunhua Shen, and Hao Chen. 2020. Conditional Convolutions for Instance Segmentation. arXiv: Computer Vision and Pattern Recognition (2020).

[39]

Xinlong Wang, Tao Kong, Chunhua Shen, Yuning Jiang, and Lei Li. 2019. SOLO: Segmenting Objects by Locations. ECCV (2019).

[40]

Xinlong Wang, Rufeng Zhang, Tao Kong, Lei Li, and Chunhua Shen. 2020. SOLOv2: Dynamic and Fast Instance Segmentation. NeurIPS (2020).

[41]

Haoyi Xiong, Ruosi Wan, Jian Zhao, Zeyu Chen, Xingjian Li, Zhanxing Zhu, and Jun Huan. 2022. GrOD : Deep Learning with Gradients Orthogonal Decomposition for Knowledge Transfer, Distillation, and Adversarial Training. ACM Transactions on Knowledge Discovery from Data (2022).

[42]

Nan Xue, Tianfu Wu, Gui-Song Xia, and Liangpei Zhang. 2022. Learning Local-Global Contextual Adaptation for Multi-Person Pose Estimation. (2022), 13065--13074.

[43]

Brandon Yang, Gabriel Bender, Quoc V. Le, and Jiquan Ngiam. 2019. CondConv: Conditionally Parameterized Convolutions for Efficient Inference. NeurIPS (2019).

[44]

Lu Yang, Qing Song, ZhihuiWang, Mengjie Hu, Chun Liu, Xin Xueshi, JiaWenhe, and Songcen Xu. 2020. Renovating Parsing R-CNN for Accurate Multiple Human Parsing. ECCV (2020).

[45]

Lu Yang, Qing Song, Zhihui Wang, and Ming Jiang. 2018. Parsing R-CNN for Instance-Level Human Analysis. CVPR (2018).

[46]

Hui Ying, Zhaojin Huang, Shu Liu, Tianjia Shao, and Kun Zhou. 2021. EmbedMask: Embedding Coupling for Instance Segmentation. IJCAI (2021).

[47]

Sanyi Zhang, Xiaochun Cao, Guo-Jun Qi, Zhanjie Song, and Jie Zhou. 2022. AIParsing: Anchor-free Instance-level Human Parsing. TIP.

[48]

Zihao Zhang, Lei Hu, Xiaoming Deng, and Shihong Xia. 2021. Sequential 3D Human Pose Estimation Using Adaptive Point Cloud Sampling Strategy. IJCAI (2021).

[49]

Zhengbo Zhang, Chunluan Zhou, and Zhigang Tu. 2022. Distilling Inter-Class Distance for Semantic Segmentation. IJCAI (2022).

[50]

Fang Zhao, Jian Zhao, Shuicheng Yan, and Jiashi Feng. 2018. Dynamic Conditional Networks for Few-Shot Learning. (2018).

[51]

Fang Zhao, Jian Zhao, Shuicheng Yan, and Jiashi Feng. 2018. Dynamic Conditional Networks for Few-Shot Learning. (2018).

[52]

Jian Zhao, Jianshu Li, Yu Cheng, Terence Sim, Shuicheng Yan, and Jiashi Feng. 2018. Understanding Humans in Crowded Scenes: Deep Nested Adversarial Learning and A New Benchmark for Multi-Human Parsing. ACM MM (2018).

Digital Library

[53]

Jian Zhao, Jianshu Li, Hengzhu Liu, Shuicheng Yan, and Jiashi Feng. 2020. Fine-Grained Multi-human Parsing. IJCV (2020).

[54]

Jian Zhao, Jianshu Li, Xuecheng Nie, Fang Zhao, Yunpeng Chen, Zhecan Wang, Jiashi Feng, and Shuicheng Yan. 2017. Self-Supervised Neural Aggregation Networks for Human Parsing. CVPR (2017).

[55]

Jian ZHAO, Jianshu Li, Fang Zhao, Xuecheng Nie, Yunpeng Chen, Shuicheng Yan, and Jiashi Feng. 2017. Marginalized CNN: Learning Deep Invariant Representations. In Procedings of the British Machine Vision Conference 2017.

[56]

Tianfei Zhou, Wenguan Wang, Si Liu, Yi Yang, and Luc Van Gool. 2021. Differentiable Multi-Granularity Human Representation Learning for Instance-Aware Human Semantic Parsing. CVPR (2021).

[57]

Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. 2018. Deformable ConvNets v2: More Deformable, Better Results. CVPR (2018).

Cited By

Ganga BB.T. LK.R. V(2024)Object detection and crowd analysis using deep learning techniques: Comprehensive review and future directionsNeurocomputing10.1016/j.neucom.2024.127932597(127932)Online publication date: Sep-2024
https://doi.org/10.1016/j.neucom.2024.127932

Index Terms

Single-Stage Multi-human Parsing via Point Sets and Center-based Offsets
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Recommendations

Multi-human Parsing with a Graph-based Generative Adversarial Model
Human parsing is an important task in human-centric image understanding in computer vision and multimedia systems. However, most existing works on human parsing mainly tackle the single-person scenario, which deviates from real-world applications where ...
Read More
Understanding Humans in Crowded Scenes: Deep Nested Adversarial Learning and A New Benchmark for Multi-Human Parsing
MM '18: Proceedings of the 26th ACM international conference on Multimedia

Despite the noticeable progress in perceptual tasks like detection, instance segmentation and human parsing, computers still perform unsatisfactorily on visually understanding humans in crowded scenes, such as group behavior analysis, person re-...
Read More
Multi-Human Parsing Machines
MM '18: Proceedings of the 26th ACM international conference on Multimedia

Human parsing is an important task in human-centric analysis. Despite the remarkable progress in single-human parsing, the more realistic case of multi-human parsing remains challenging in terms of the data and the model. Compared with the considerable ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

funding of National Key R & D Program of China
Young Elite Scientist Sponsorship Program of China Association for Science and Technology
Young Elite Scientist Sponsorship Program of Beijing Association for Science and Technology
Natural Science Foundation of China under Grant
National Nature Fund

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
83
Total Downloads

Downloads (Last 12 months)83
Downloads (Last 6 weeks)6

Other Metrics

View Author Metrics

Citations

Cited By

Ganga BB.T. LK.R. V(2024)Object detection and crowd analysis using deep learning techniques: Comprehensive review and future directionsNeurocomputing10.1016/j.neucom.2024.127932597(127932)Online publication date: Sep-2024
https://doi.org/10.1016/j.neucom.2024.127932

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents