skip to main content
research-article

Single-Stage Multi-human Parsing via Point Sets and Center-based Offsets

Published: 27 October 2023 Publication History
  • Get Citation Alerts
  • Abstract

    This work studies the multi-human parsing problem. Existing methods, either following top-down or bottom-up two-stage paradigms, usually involve expensive computational costs. We instead present a high-performance Single-stage Multi-human Parsing (SMP) deep architecture that decouples the multi-human parsing problem into two fine-grained sub-problems,i.e., locating the human body and parts. SMP leverages the point features in the barycenter positions to obtain their segmentation and then generates a series of offsets from the barycenter of the human body to the barycenters of parts, thus performing human body and parts matching without the grouping process. Within the SMP architecture, we propose a Refined Feature Retain module to extract the global feature of instances through generated mask attention and a Mask of Interest Reclassify module as a trainable plug-in module to refine the classification results with the predicted segmentation. Extensive experiments on the MHPv2.0 dataset demonstrate the best effectiveness and efficiency of the proposed method, surpassing the state-of-the-art method by 2.1% in AP50p, 1.0% in APvolpsup>, and 1.2% in PCP50. Moreover, SMP also achieves superior performance in DensePose-COCO, verifying generalization of the model. In particular, the proposed method requires fewer training epochs and a less complex model architecture. Our codes are released in https://github.com/cjm-sfw/SMP.

    Supplementary Material

    MP4 File (1327-video.mp4)
    Presentation video - short version In this talk, we presented SMP, a simple yet efficient single-stage method for multi-person human parsing that achieves better accuracy and faster inference speed compared to existing methods. The framework's modularity and efficiency make it a promising approach for future research on related problems in computer vision.

    References

    [1]
    Yalan Qin A, Hanzhou Wu A, Jian Zhao B, and Guorui Feng A. [n. d.]. Enforced block diagonal subspace clustering with closed form solution. Pattern Recognition 130 ([n. d.]).
    [2]
    Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. 2019. YOLACT: Real-Time Instance Segmentation. ICCV (2019).
    [3]
    Gaofeng Cao, Fei Zhou, Han Yan, Anjie Wang, and Leidong Fan. 2022. KPN-MFI: A Kernel Prediction Network with Multi-frame Interaction for Video Inverse Tone Mapping. IJCAI (2022).
    [4]
    Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. 2019. MMDetection: Open MMLab Detection Toolbox and Benchmark. arXiv preprint arXiv:1906.07155 (2019).
    [5]
    Zehui Chen, Zhenyu Li, Shiquan Zhang, Liangji Fang, Qinghong Jiang, Feng Zhao, Bolei Zhou, and Hang Zhao. 2022. AutoAlign: Pixel-Instance Feature Aggregation for Multi-Modal 3D Object Detection. IJCAI (2022).
    [6]
    Jifeng Dai, Kaiming He, Yi Li, Shaoqing Ren, and Jian Sun. 2016. Instance-sensitive Fully Convolutional Networks. ECCV (2016).
    [7]
    Hexin Dong, Zifan Chen, Mingze Yuan, Yutong Xie, Jie Zhao, Fei Yu, Bin Dong, and Li Zhang. 2022. Region-Aware Metric Learning for Open World Semantic Segmentation via Meta-Channel Aggregation. IJCAI (2022).
    [8]
    Zhangfu Dong, Yuting He, Xiaoming Qi, Yang Chen, Huazhong Shu, Jean-Louis Coatrieux, Guanyu Yang, and Shuo Li. 2022. MNet: Rethinking 2D/3D Networks for Anisotropic Medical Image Segmentation. IJCAI (2022).
    [9]
    Zigang Geng, Ke Sun, Bin Xiao, Zhaoxiang Zhang, and Jingdong Wang. 2021. Bottom-Up Human Pose Estimation Via Disentangled Keypoint Regression. CVPR (2021).
    [10]
    Ross Girshick. 2015. Fast R-CNN. arXiv: Computer Vision and Pattern Recognition (2015).
    [11]
    Ke Gong, Xiaodan Liang, Yicheng Li, Yimin Chen, Ming Yang, and Liang Lin. 2018. Instance-level Human Parsing via Part Grouping Network. ECCV (2018).
    [12]
    Riza Alp Guler, Natalia Neverova, and Iasonas Kokkinos. 2018. DensePose: Dense Human Pose Estimation in the Wild. CVPR (2018).
    [13]
    Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. TPAMI (2017).
    [14]
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. CVPR (2016).
    [15]
    Wenbin He, William Surmeier, Arvind Kumar Shekar, Liang Gou, and Liu Ren. 2022. Self-supervised Semantic Segmentation Grounded in Visual Concepts. IJCAI (2022).
    [16]
    Ruyi Ji, Dawei Du, Libo Zhang, Longyin Wen, Yanjun Wu, Chen Zhao, Feiyue Huang, and Siwei Lyu. 2019. Learning Semantic Neural Tree for Human Parsing. ECCV (2019).
    [17]
    Lei Jin, Xiaojuan Wang, Xuecheng Nie, Luoqi Liu, Yandong Guo, and Jian Zhao. 2022. Grouping by center: Predicting centripetal offsets for the bottom-up human pose estimation. IEEE TMM (2022).
    [18]
    Lei Jin, XiaojuanWang, Xuecheng Nie,WendongWang, Yandong Guo, Shuicheng Yan, and Jian Zhao. 2023. Rethinking the Person Localization for Single-Stage Multi-Person Pose Estimation. IEEE TMM (2023).
    [19]
    Tero Karras, Samuli Laine, and Timo Aila. 2018. A Style-Based Generator Architecture for Generative Adversarial Networks. TPAMI (2018).
    [20]
    Jin Lei, Chenyang Xu, Xiaojuan Wang, Yabo Xiao, Yandong Guo, Xuecheng Nie, and Jian Zhao. 2022. Single-Stage is Enough: Multi-Person Absolute 3D Pose Estimation. CVPR (2022).
    [21]
    Jie Li, Laiyan Ding, and Rui Huang. 2021. IMENet: Joint 3D Semantic Scene Completion and 2D Semantic Segmentation through Iterative Mutual Enhancement. IJCAI (2021).
    [22]
    Jianshu Li, Jian Zhao, Congyan Lang, Yidong Li, Yunchao Wei, Guodong Guo, Terence Sim, Shuicheng Yan, and Jiashi Feng. 2021. Multi-human Parsing with a Graph-based Generative Adversarial Model. ACM TOMM (2021).
    [23]
    Jianshu Li, Jian Zhao, Yunchao Wei, Congyan Lang, Yidong Li, and Jiashi Feng. 2017. Towards RealWorld Human Parsing: Multiple-Human Parsing in the Wild.
    [24]
    Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2016. Feature Pyramid Networks for Object Detection. CVPR (2016).
    [25]
    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal Loss for Dense Object Detection. IEEE TPAMI (2017).
    [26]
    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. european conference on computer vision (2014).
    [27]
    Rosanne Liu, Joel Lehman, Piero Molino, Felipe Petroski Such, Eric Frank, Alex Sergeev, and Jason Yosinski. 2018. An intriguing failing of convolutional neural networks and the CoordConv solution. NeurIPS (2018).
    [28]
    Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2014. Fully Convolutional Networks for Semantic Segmentation. IEEE TPAMI (2014).
    [29]
    William McNally, Kanav Vats, Alexander Wong, and John McPhee. 2021. Rethinking Keypoint Representations: Modeling Keypoints and Poses as Objects for Multi-Person Human Pose Estimation. CVPR (2021).
    [30]
    Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. 2016. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. ICCV (2016).
    [31]
    Davy Neven, Bert De Brabandere, Marc Proesmans, and Luc Van Gool. 2019. Instance Segmentation by Jointly Optimizing Spatial Embeddings and Clustering Bandwidth. CVPR (2019).
    [32]
    Zhen-Liang Ni, Gui-Bin Bian, Guan'an Wang, Xiao-Hu Zhou, Zeng-Guang Hou, Xiao-Liang Xie, Zhen Li, and Yu-Han Wang. 2020. BARNet: Bilinear Attention Network with Adaptive Receptive Fields for Surgical Instrument Segmentation. IJCAI (2020).
    [33]
    Xuecheng Nie, Jiashi Feng, Jianfeng Zhang, and Shuicheng Yan. 2019. Single-Stage Multi-Person Pose Machines. ICCV (2019).
    [34]
    Yalan Qin, Hanzhou Wu, Jian Zhao, and Guorui Feng. 2022. Enforced Block Diagonal Subspace Clustering with Closed Form Solution. Pattern Recognition (2022).
    [35]
    Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An Incremental Improvement. arXiv: Computer Vision and Pattern Recognition (2018).
    [36]
    Tao Ruan, Ting Liu, Zilong Huang, Yunchao Wei, Shikui Wei, and Yao Zhao. 2019. Devil in the Details: Towards Accurate Single and Multiple Human Parsing. AAAI (2019).
    [37]
    Konstantin Sofiiuk, Olga Barinova, and Anton Konushin. 2019. AdaptIS: Adaptive Instance Selection Network. ICCV (2019).
    [38]
    Zhi Tian, Chunhua Shen, and Hao Chen. 2020. Conditional Convolutions for Instance Segmentation. arXiv: Computer Vision and Pattern Recognition (2020).
    [39]
    Xinlong Wang, Tao Kong, Chunhua Shen, Yuning Jiang, and Lei Li. 2019. SOLO: Segmenting Objects by Locations. ECCV (2019).
    [40]
    Xinlong Wang, Rufeng Zhang, Tao Kong, Lei Li, and Chunhua Shen. 2020. SOLOv2: Dynamic and Fast Instance Segmentation. NeurIPS (2020).
    [41]
    Haoyi Xiong, Ruosi Wan, Jian Zhao, Zeyu Chen, Xingjian Li, Zhanxing Zhu, and Jun Huan. 2022. GrOD : Deep Learning with Gradients Orthogonal Decomposition for Knowledge Transfer, Distillation, and Adversarial Training. ACM Transactions on Knowledge Discovery from Data (2022).
    [42]
    Nan Xue, Tianfu Wu, Gui-Song Xia, and Liangpei Zhang. 2022. Learning Local-Global Contextual Adaptation for Multi-Person Pose Estimation. (2022), 13065--13074.
    [43]
    Brandon Yang, Gabriel Bender, Quoc V. Le, and Jiquan Ngiam. 2019. CondConv: Conditionally Parameterized Convolutions for Efficient Inference. NeurIPS (2019).
    [44]
    Lu Yang, Qing Song, ZhihuiWang, Mengjie Hu, Chun Liu, Xin Xueshi, JiaWenhe, and Songcen Xu. 2020. Renovating Parsing R-CNN for Accurate Multiple Human Parsing. ECCV (2020).
    [45]
    Lu Yang, Qing Song, Zhihui Wang, and Ming Jiang. 2018. Parsing R-CNN for Instance-Level Human Analysis. CVPR (2018).
    [46]
    Hui Ying, Zhaojin Huang, Shu Liu, Tianjia Shao, and Kun Zhou. 2021. EmbedMask: Embedding Coupling for Instance Segmentation. IJCAI (2021).
    [47]
    Sanyi Zhang, Xiaochun Cao, Guo-Jun Qi, Zhanjie Song, and Jie Zhou. 2022. AIParsing: Anchor-free Instance-level Human Parsing. TIP.
    [48]
    Zihao Zhang, Lei Hu, Xiaoming Deng, and Shihong Xia. 2021. Sequential 3D Human Pose Estimation Using Adaptive Point Cloud Sampling Strategy. IJCAI (2021).
    [49]
    Zhengbo Zhang, Chunluan Zhou, and Zhigang Tu. 2022. Distilling Inter-Class Distance for Semantic Segmentation. IJCAI (2022).
    [50]
    Fang Zhao, Jian Zhao, Shuicheng Yan, and Jiashi Feng. 2018. Dynamic Conditional Networks for Few-Shot Learning. (2018).
    [51]
    Fang Zhao, Jian Zhao, Shuicheng Yan, and Jiashi Feng. 2018. Dynamic Conditional Networks for Few-Shot Learning. (2018).
    [52]
    Jian Zhao, Jianshu Li, Yu Cheng, Terence Sim, Shuicheng Yan, and Jiashi Feng. 2018. Understanding Humans in Crowded Scenes: Deep Nested Adversarial Learning and A New Benchmark for Multi-Human Parsing. ACM MM (2018).
    [53]
    Jian Zhao, Jianshu Li, Hengzhu Liu, Shuicheng Yan, and Jiashi Feng. 2020. Fine-Grained Multi-human Parsing. IJCV (2020).
    [54]
    Jian Zhao, Jianshu Li, Xuecheng Nie, Fang Zhao, Yunpeng Chen, Zhecan Wang, Jiashi Feng, and Shuicheng Yan. 2017. Self-Supervised Neural Aggregation Networks for Human Parsing. CVPR (2017).
    [55]
    Jian ZHAO, Jianshu Li, Fang Zhao, Xuecheng Nie, Yunpeng Chen, Shuicheng Yan, and Jiashi Feng. 2017. Marginalized CNN: Learning Deep Invariant Representations. In Procedings of the British Machine Vision Conference 2017.
    [56]
    Tianfei Zhou, Wenguan Wang, Si Liu, Yi Yang, and Luc Van Gool. 2021. Differentiable Multi-Granularity Human Representation Learning for Instance-Aware Human Semantic Parsing. CVPR (2021).
    [57]
    Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. 2018. Deformable ConvNets v2: More Deformable, Better Results. CVPR (2018).

    Cited By

    View all
    • (2024)Object detection and crowd analysis using deep learning techniques: Comprehensive review and future directionsNeurocomputing10.1016/j.neucom.2024.127932597(127932)Online publication date: Sep-2024

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. multi-human parsing
    2. neural networks
    3. offsets
    4. point sets

    Qualifiers

    • Research-article

    Funding Sources

    • funding of National Key R & D Program of China
    • Young Elite Scientist Sponsorship Program of China Association for Science and Technology
    • Young Elite Scientist Sponsorship Program of Beijing Association for Science and Technology
    • Natural Science Foundation of China under Grant
    • National Nature Fund

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 995 of 4,171 submissions, 24%

    Upcoming Conference

    MM '24
    The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne , VIC , Australia

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)83
    • Downloads (Last 6 weeks)6

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Object detection and crowd analysis using deep learning techniques: Comprehensive review and future directionsNeurocomputing10.1016/j.neucom.2024.127932597(127932)Online publication date: Sep-2024

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media