OccGaussian: 3D Gaussian Splatting for Occluded Human Rendering

Jingrui Ye Tsinghua Shenzhen International Graduate School, Tsinghua UniversityShenzhenChina yjr22@mails.tsinghua.edu.cn Zhongkai Zhang Tsinghua Shenzhen International Graduate School, Tsinghua UniversityShenzhenChina zzk21@mails.tsinghua.edu.cn Yujiao Jiang Tsinghua Shenzhen International Graduate School, Tsinghua UniversityShenzhenChina jiangyj20@mails.tsinghua.edu.cn Qingmin Liao Tsinghua Shenzhen International Graduate School, Tsinghua UniversityShenzhenChina liaoqm@tsinghua.edu.cn Wenming Yang Tsinghua Shenzhen International Graduate School, Tsinghua UniversityShenzhenChina yang.wenming@sz.tsinghua.edu.cn  and  Zongqing Lu Tsinghua Shenzhen International Graduate School, Tsinghua UniversityShenzhenChina luzq@sz.tsinghua.edu.cn
(2018)
Abstract.

Rendering dynamic 3D humans from monocular videos is crucial for various applications such as virtual reality and digital entertainment. Most methods assume the human is in an unobstructed scene, while various objects may cause the occlusion of body parts in real-life scenarios. Previous method utilizing NeRF for surface rendering to recover the occluded areas, but it requiring more than one day to train and several seconds to render, failing to meet the requirements of real-time interactive applications. To address these issues, we propose OccGaussian based on 3D Gaussian Splatting, which can be trained within 6 minutes and produces high-quality human renderings up to 160 FPS with occluded input. OccGaussian initializes 3D Gaussian distributions in the canonical space, and we perform occlusion feature query at occluded regions, the aggregated pixel-align feature is extracted to compensate for the missing information. Then we use Gaussian Feature MLP to further process the feature along with the occlusion-aware loss functions to better perceive the occluded area. Extensive experiments both in simulated and real-world occlusions, demonstrate that our method achieves comparable or even superior performance compared to the state-of-the-art method. And we improving training and inference speeds by 250x and 800x, respectively. Our code will be available for research purposes.

human modeling, rendering under occlusion, 3D Gaussian Splatting
copyright: acmlicensedjournalyear: 2018doi: XXXXXXX.XXXXXXXconference: Proceedings of the 32th ACM International Conference on Multimedia; 28 October - 1 November 2024; Melbourne, Australia isbn: 978-1-4503-XXXX-X/18/06ccs: Computing methodologies Rendering
[Uncaptioned image]
Figure 1. Overview of OccGaussian. We develop an efficient method for rendering human under severe occlusions with the help of 3D Gaussian Splatting (Kerbl et al., 2023). Given a real-world monocular sequence of dynamic human with a tracked skeleton and foreground masks, our method trains within 6 minutes on a single GPU and supports rendering up to 160 FPS. Meanwhile OccGaussian achieves comparable or better rendering quality against the state-of-the-art method (Xiang et al., 2023b) that needs to train over one day, and takes several seconds to render a single image.

1. Introduction

For a long time, rendering high-quality humans has played an important role in industries such as movies, games, and entertainment, and its modeling effect directly affects people’s visual experience. Due to the intricate topology of the human body, how to reconstruct realistic humans in real-life scenarios is still a formidable challenge.

With the emergence of NeRF and its variants (Mildenhall et al., 2021; Fridovich-Keil et al., 2022; Zhang et al., 2022b; Barron et al., 2022; Müller et al., 2022), progress in human rendering has been increasingly rapid, many methods are now capable of rendering photorealistic human (Weng et al., 2022; Yu et al., 2023; Pan et al., 2023; Xu et al., 2023b). The datasets used by these methods are all captured in ideal laboratory environments, where all parts of the human are free from occlusion in spacious scenes. However, in real-world scenarios, it is not guaranteed that the scenes are entirely unobstructed during capture; various objects may exist, leading to certain occluded parts of the human body. Training with these occluded images will seriously degrade the rendering quality of these methods. Moreover, laboratory environments allow us to capture dense, synchronized multi-view data, but this is difficult to achieve in real-life scenes, where we can only capture from a single view. Due to the lack of ground truth in occluded regions and the limited information from only one view, previous methods fail to handle the occluded data.

Addressing the drawbacks mentioned above, OccNeRF (Xiang et al., 2023b) stands out as the first dynamic human rendering method for occluded environments. It introduces a surface-based rendering strategy, replacing points with sub-regions of sampled points to achieve better adjustments in occluded regions. Although OccNeRF has achieved decent rendering results in occluded areas, considering the inherent limitations of NeRF, which requires hundreds of forward predictions of MLP for each pixel during rendering. So OccNeRF needs at least one day of training with high demands on GPU memory. And like NeRF, OccNeRF also needs a large MLP for inference, resulting in slow rendering. These shortcomings restrict the application of OccNeRF in real-world scenarios.

Point-based rendering has emerged as an effective alternative to NeRFs (Rückert et al., 2022; Su et al., 2023; Xu et al., 2022; Zheng et al., 2023a). With the recently proposed 3D Gaussian Splatting (3DGS) model (Kerbl et al., 2023), training takes only a few minutes; rendering speeds are tens of times faster than the best NeRF method (Müller et al., 2022), and achieve state-of-the-art rendering quality for static scenes. Subsequent works (Hu and Liu, 2023; Hu et al., 2023b; Qian et al., 2023a, b) applying 3DGS to human rendering have also demonstrated that compared to NeRF, 3DGS can render more quickly with shorter training time while maintaining competitive rendering performance.

To address the slow training and rendering of OccNeRF, we propose OccGaussian to render occluded human in monocular videos. Leveraging the capabilities of 3DGS, our approach significantly reduces the time for training and rendering, decreases memory consumption, and improves rendering quality. We propose the aggregated pixel-align features from K-nearest visible points to substitute occluded points, thus better utilizing the local information to compensate for the lack of ground truth in occluded regions. Meanwhile, we update the weights of each point, allowing features from highly visible areas to be more dominant. Finally, since conventional losses used in human rendering tasks are ineffective in occluded settings, we design occlusion loss and consistency loss to encourage the network not to remove occluded points excessively, resulting in a more complete rendering result.

Our experiments show that, in the single-view occluded human rendering task, compare to the state-of-the-art method OccNeRF (Xiang et al., 2023b), OccGaussian can train in minutes, which is 250 times faster than OccNeRF, and the rendering speed up to 160 FPS is improved by 800 times. Our method maintains comparable rendering quality and even outperforms OccNeRF. In summary, our work contributes in the following aspects:

  1. (1)

    We propose OccGaussian, which, to the best of our knowledge, is the first work that applies 3DGS to render human in occlusion scenarios. It enables rapid training (6similar-to\sim13 minutes) and real-time rendering (up to 169 FPS), making it more convenient for real-time applications.

  2. (2)

    We propose a K-nearest feature query in occluded regions, combined with an aggregated pixel-align feature. And we design occlusion loss and consistency loss, help us to render appropriate textures in occluded regions.

  3. (3)

    Experiments on two datasets fully demonstrate that OccGaussian achieves SOTA occluded human rendering quality while ensuring rapid training and real-time rendering.

2. Related Works

Point-Based Rendering and Neural Radiance Field. Point sampling in geometric rendering has always been an indispensable part. This book (Gross and Pfister, 2011) provides a detailed overview of traditional point cloud rendering algorithms. Recently, there has been an increasing focus on differentiable rendering based on points. DSS (Yifan et al., 2019) projects point cloud onto a 3D grid and generates differentiable surface patches at the projected positions. NPBG (Aliev et al., 2020) introduces a multi-scale rendering strategy to render point clouds at different resolutions. NeRF (Mildenhall et al., 2021) pioneered the neural radiance field, combining them with volumetric rendering to obtain the color of pixels by aggregating sampled points along rays. Subsequent works (Barron et al., 2021, 2022; Zhang et al., 2022b; Fridovich-Keil et al., 2022) continuously improve the rendering quality and training/inference speed of NeRF. 3DGS (Kerbl et al., 2023) has recently emerged as a transformative approach in point-based rendering, leveraging Gaussian ellipsoid space to balance better rendering efficiency and quality.

Human avatar rendering. NeRF is initially proposed for rendering the static scene, which implicitly models it with a neural network. Based on parametric models of human (Loper et al., 2023; Li et al., 2017; Pavlakos et al., 2019), numerous works have applying NeRF to dynamic human rendering (Peng et al., 2021b; Wang et al., 2021; Peng et al., 2021a; Kwon et al., 2021; Yu et al., 2023; Chen et al., 2022; Weng et al., 2023; Mihajlovic et al., 2022; Chen et al., 2023b; Zheng et al., 2022; Xu et al., 2023b). In these works, Transhuman (Pan et al., 2023) introduced Transformer (Vaswani et al., 2017) to capture global relationships between body parts. SHERF (Hu et al., 2023a) can train on a single image and reconstruct an animatable 3D human. UV volumes (Chen et al., 2023a) leveraging pre-defined UV human maps and sparse 3D convolutions for feature encoding, which accelerated rendering but didn’t shorten training time. And some methods (Geng et al., 2023; Jiang et al., 2023a) applying a variant of NeRF (Müller et al., 2022) aim to accelerate training and inference, but with poorer generalization. With the advent of 3D Gaussian Splatting (Kerbl et al., 2023), many works have migrated it to human rendering (Hu et al., 2023b; Hu and Liu, 2023; Qian et al., 2023b; Li et al., 2023a; Qian et al., 2023a; Li et al., 2023b; Jiang et al., 2023b; Moreau et al., 2023; Pang et al., 2023; Xu et al., 2023a; Yuan et al., 2023; Zheng et al., 2023b; Kocabas et al., 2023; Lei et al., 2023), significantly improving training/inference speed and rendering quality compared to NeRF-based methods. However, all these methods can only trained on data captured in ideal environments without occlusions, where all parts of a person are clearly visible. In contrast, the data used in our work are occluded images of humans from a single view, with the aim of getting complete and high-quality rendering results for better application in real-world scenarios.

Occluded Human Reconstruction. Prior works primarily focused on pose estimation or human reconstruction for occluded humans. DensePose (Güler et al., 2018) introduced convolution networks to learn the mapping from 2D image to 3D human surface, processing dense coordinates at multiple frames per second to achieve dynamic pose estimation. Recent works (Zhou et al., 2021; Yang et al., 2022; Huang et al., 2022; Shim et al., 2022; Zhang et al., 2022a; Wang et al., 2023; Liu et al., 2023) have shown improved performance in estimating human poses under occlusions, exhibiting robustness even in outdoor scenes.

However, the works above reconstruct parametric models of humans, only representing rough body shapes and poses, unable to recover clothing details or even facial expressions. To address this, sun et al. (Sun et al., 2021) use sparse-view sequences as input, and employ a layer-by-layer scene decoupling strategy for the reconstruction and rendering of people and objects. Xiang et al. (Xiang et al., 2023b) proposed OccNeRF, which combines surface-based rendering with visibility attention to render occluded humans. It can partially recover occluded regions, but OccNeRF still inherits NeRF’s drawbacks, being too slow during training and inference and requiring significant GPU memory due to its ray sampling approach. Wild2Avatar (Xiang et al., 2023a) is also a neural occluded human rendering method. It introduces an occlusion-aware scene parameterization method that decomposes the scene into: occlusion, human body, and background. However, Wild2Avatar still employs NeRF for rendering and it’s unable to circumvent the slow training and inference issue.

Refer to caption
Figure 2. OccGaussian Framework. We initialize 3D Gaussian distributions in the canonical space, then transform the points from canonical space to pose space through linear blend skinning according to the SMPL parameters. Meanwhile, the input image is encoded to obtain feature maps, and then we project points onto the 2D image plane and perform feature query in the occluded regions, extracting aggregated pixel-aligned features for each occluded point. We concatenate this feature with the embedded occluded point and put them into MLP to predict the spherical harmonic coefficients f𝑓fitalic_f and the opacity α𝛼\alphaitalic_α. Following 3DGS (Kerbl et al., 2023), we apply the tile-based differentiable rasterizer to achieve fast rendering and adaptive density control during training. Despite the standard loss functions, we also design occlusion loss and consistency loss to prevent the model from learning background information in occluded regions.

3. Methods

In this section, we start by briefly reviewing the linear blend skinning (LBS) function and 3D Gaussian Splatting (Kerbl et al., 2023) in section 3.1, we then present OccGaussian by introducing 3D Gaussian Forward Skinning (section 3.2), Occlusion Feature Query (section 3.3), Gaussian Feature MLP (section 3.4), and our novel loss function (section 3.5) to archive high rendering quality as well as fast training/inference under occlusions. An overview of OccGaussian is shown in Figure 2.

3.1. Preliminary

Parametric model of human. Parametric human models (Loper et al., 2023; Pavlakos et al., 2019; Romero et al., 2022; Xu et al., 2020) describe human shape and pose using a set of parameters, enabling modeling, animation and rendering. The most widely used model is the SMPL (Loper et al., 2023) model. SMPL defines shape parameters β10𝛽superscript10\beta\in\mathbb{R}^{10}italic_β ∈ blackboard_R start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT and pose parameters θ10𝜃superscript10\theta\in\mathbb{R}^{10}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT, use the function M(β,θ)𝑀𝛽𝜃M(\beta,\theta)italic_M ( italic_β , italic_θ ) to output N=6980𝑁6980N=6980italic_N = 6980 vertices of mesh. The Linear Blend Skinning (LBS) algorithm is used to transform the SMPL vertices xcsuperscript𝑥𝑐{x}^{c}italic_x start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT from canonical space to points xpsuperscript𝑥𝑝{x}^{p}italic_x start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT in posed space:

(1) xp=k=1Kωk(Gk(J,θ))xc+bk(J,θ,β)){x}^{p}=\textstyle\sum_{k=1}^{K}{\omega}_{k}({G}_{k}(J,\theta)){x}^{c}+{b}_{k}% (J,\theta,\beta))italic_x start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_J , italic_θ ) ) italic_x start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT + italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_J , italic_θ , italic_β ) )

where J represents the positions of K joints, ωksubscript𝜔𝑘{\omega}_{k}italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT represents the skinning weights of the k-th joint to the SMPL vertex, Gk()subscript𝐺𝑘{G}_{k}(\cdot)italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( ⋅ ) and bk()subscript𝑏𝑘{b}_{k}(\cdot)italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( ⋅ ) denote the transformation matrix and translation vector of joint k, respectively.

3D Gaussian Splatting (Kerbl et al., 2023). NeRF (Mildenhall et al., 2021) represents the scene using a 5D neural radiance field and then computes pixel color and opacity through volume rendering. In contrast to NeRF, 3DGS employs unstructured, explicit 3D Gaussian distributions to represent the scene, which is differentiable and easy to project. 3DGS models the geometry as a set of 3D Gaussian functions that do not require normal, the Gaussian sphere is defined by covariance matrix ΣΣ\Sigmaroman_Σ in world space, with its mean μ𝜇\muitalic_μ as the center:

(2) G(x)=1(2π)32|Σ|12e12(xμ)TΣ1(xμ)𝐺𝑥1superscript2𝜋32superscriptΣ12superscript𝑒12superscript𝑥𝜇𝑇superscriptΣ1𝑥𝜇G(x)=\frac{1}{{(2\pi)}^{\frac{3}{2}}{\left|\Sigma\right|}^{\frac{1}{2}}}{e}^{-% \frac{1}{2}{(x-\mu)}^{T}{\Sigma}^{-1}(x-\mu)}italic_G ( italic_x ) = divide start_ARG 1 end_ARG start_ARG ( 2 italic_π ) start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT | roman_Σ | start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_x - italic_μ ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x - italic_μ ) end_POSTSUPERSCRIPT

And we need to project points onto the image plane for rendering, given a transformation matrix W𝑊Witalic_W from world coordinates to camera coordinates, the covariance matrix ΣsuperscriptΣ{\Sigma^{\prime}}roman_Σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in camera coordinates is computed as: Σ=JWΣWTJTsuperscriptΣ𝐽𝑊Σsuperscript𝑊𝑇superscript𝐽𝑇{\Sigma}^{\prime}=JW\Sigma{W}^{T}{J}^{T}roman_Σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_J italic_W roman_Σ italic_W start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, where J𝐽Jitalic_J is the Jacobian matrix of the affine transformation for projection. Since the covariance matrix needs to be positive semi-definite, 3DGS represents ΣΣ\Sigmaroman_Σ by the scaling matrix S𝑆Sitalic_S and rotation matrix R𝑅Ritalic_R: Σ=RSSTRTΣ𝑅𝑆superscript𝑆𝑇superscript𝑅𝑇\Sigma=RS{S}^{T}{R}^{T}roman_Σ = italic_R italic_S italic_S start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, they are respectively stored as the scaling vector s3𝑠superscript3s\in\mathbb{R}^{3}italic_s ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and rotation quaternion q4𝑞superscript4q\in\mathbb{R}^{4}italic_q ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT. With these formulas, we can transform the 3D points from world coordinates to ray coordinates:

(3) G(x)´=1(2π)32|Σ|12e12(xμ)TΣ1(xμ)´𝐺𝑥1superscript2𝜋32superscriptsuperscriptΣ12superscript𝑒12superscript𝑥superscript𝜇𝑇superscriptsuperscriptΣ1𝑥superscript𝜇\acute{G(x)}=\frac{1}{{(2\pi)}^{\frac{3}{2}}{\left|{\Sigma}^{{}^{\prime}}% \right|}^{\frac{1}{2}}}{e}^{-\frac{1}{2}{(x-{\mu^{{}^{\prime}})}}^{T}{\Sigma^{% {}^{\prime}}}^{-1}(x-\mu^{{}^{\prime}})}over´ start_ARG italic_G ( italic_x ) end_ARG = divide start_ARG 1 end_ARG start_ARG ( 2 italic_π ) start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT | roman_Σ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_x - italic_μ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x - italic_μ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT

where μ=c(Wμ+t)superscript𝜇𝑐𝑊𝜇𝑡{\mu}^{\prime}=c(W\mu+t)italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_c ( italic_W italic_μ + italic_t ), c()𝑐c(\cdot)italic_c ( ⋅ ) denotes the projection function, and t𝑡titalic_t is the translation vector. After projection, we compute the number of overlapping Gaussians at each pixel, as well as the color cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and opacity αisubscript𝛼𝑖{\alpha}_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of all points at that pixel, and sort them by depth. Finally, blending the N-ordered Gaussians to obtain the pixel color:

(4) C^=i=0Nciαij=1i1(1αi)^𝐶superscriptsubscript𝑖0𝑁subscript𝑐𝑖subscript𝛼𝑖superscriptsubscriptproduct𝑗1𝑖11subscript𝛼𝑖\hat{C}=\displaystyle\sum_{i=0}^{N}{c}_{i}{\alpha}_{i}\displaystyle\prod_{j=1}% ^{i-1}(1-{\alpha}_{i})over^ start_ARG italic_C end_ARG = ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ( 1 - italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

3DGS initializes Gaussian distributions with sparse point clouds from SfM (Schonberger and Frahm, 2016), and the color of each point is represented by spherical harmonic coefficients (Shs). 3DGS also proposes adaptive density control, it clones points with large gradients and small scaling matrices and splits points with large gradients and large scaling matrices. And after every 100 iterations, points with opacities below a threshold are pruned.

3.2. 3D Gaussian Forward Skinning

3DGS and its variants (Kerbl et al., 2023; Luiten et al., 2023; Wu et al., 2023; Yang et al., 2023) achieve fast rendering of static or dynamic scenes by splatting a set of 3D Gaussian points. Following Gauhuman (Hu and Liu, 2023), we can similarly represent the human with Gaussian distributions in canonical space, and map the points to pose space of each frame using LBS transformation. Due to the favorable properties of 3D Gaussians, such as rotational invariance, we can directly rotate and translate the mean and covariance matrices of each point:

(5) xpsuperscript𝑥𝑝\displaystyle x^{p}italic_x start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT =G(Jp,θp)xc+b(Jp,θp,βp)absent𝐺superscript𝐽𝑝superscript𝜃𝑝superscript𝑥𝑐𝑏superscript𝐽𝑝superscript𝜃𝑝superscript𝛽𝑝\displaystyle=G(J^{p},\theta^{p})x^{c}+b(J^{p},\theta^{p},\beta^{p})= italic_G ( italic_J start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , italic_θ start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) italic_x start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT + italic_b ( italic_J start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , italic_θ start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , italic_β start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT )
ΣpsuperscriptΣ𝑝\displaystyle\Sigma^{p}roman_Σ start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT =G(Jp,θp)ΣcG(Jp,θp)T,absent𝐺superscript𝐽𝑝superscript𝜃𝑝superscriptΣ𝑐𝐺superscriptsuperscript𝐽𝑝superscript𝜃𝑝𝑇\displaystyle=G(J^{p},\theta^{p})\Sigma^{c}G(J^{p},\theta^{p})^{T},= italic_G ( italic_J start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , italic_θ start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) roman_Σ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_G ( italic_J start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , italic_θ start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ,

where xpsuperscript𝑥𝑝x^{p}italic_x start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT, ΣpsuperscriptΣ𝑝\Sigma^{p}roman_Σ start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT, xcsuperscript𝑥𝑐x^{c}italic_x start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, and ΣcsuperscriptΣ𝑐\Sigma^{c}roman_Σ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT are the mean and covariance matrix of points in canonical and posed space respectively. And G(Jp,θp)=k=1KwkGk(Jp,θp)𝐺superscript𝐽𝑝superscript𝜃𝑝superscriptsubscript𝑘1𝐾subscript𝑤𝑘subscript𝐺𝑘superscript𝐽𝑝superscript𝜃𝑝G(J^{p},\theta^{p})=\sum_{k=1}^{K}w_{k}G_{k}(J^{p},\theta^{p})italic_G ( italic_J start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , italic_θ start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_J start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , italic_θ start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) is the rotation matrix, b(Jp,θp,βp)=k=1Kwkbk(Jp,θp,βp)𝑏superscript𝐽𝑝superscript𝜃𝑝superscript𝛽𝑝superscriptsubscript𝑘1𝐾subscript𝑤𝑘subscript𝑏𝑘superscript𝐽𝑝superscript𝜃𝑝superscript𝛽𝑝b(J^{p},\theta^{p},\beta^{p})=\sum_{k=1}^{K}w_{k}b_{k}(J^{p},\theta^{p},\beta^% {p})italic_b ( italic_J start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , italic_θ start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , italic_β start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_J start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , italic_θ start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , italic_β start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) is the translation vector, where K𝐾Kitalic_K is the joint number. Gk(Jt,θp)subscript𝐺𝑘superscript𝐽𝑡superscript𝜃𝑝G_{k}(J^{t},\theta^{p})italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_J start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_θ start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) and bk(Jp,θp,βp)subscript𝑏𝑘superscript𝐽𝑝superscript𝜃𝑝superscript𝛽𝑝b_{k}(J^{p},\theta^{p},\beta^{p})italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_J start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , italic_θ start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , italic_β start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) are the transformation matrix and translation vector of joint k𝑘kitalic_k respectively, wksubscript𝑤𝑘w_{k}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the LBS weight.

LBS Weight Field and Pose Refinement. We employ MLPΦlbs()subscriptMLPsubscriptΦlbs\text{MLP}_{\Phi_{\text{lbs}}}(\cdot)MLP start_POSTSUBSCRIPT roman_Φ start_POSTSUBSCRIPT lbs end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ ) to predict LBS weight offset for each SMPL vertex following Gauhuman (Hu and Liu, 2023). For each point, we find its nearest SMPL vertex with weight wkSMPLsuperscriptsubscript𝑤𝑘SMPLw_{k}^{\text{SMPL}}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT SMPL end_POSTSUPERSCRIPT, and add the predicted offset MLPΦlbs(γ(𝒙c))subscriptMLPsubscriptΦlbs𝛾superscript𝒙𝑐\text{MLP}_{\Phi_{\text{lbs}}}(\gamma(\bm{x}^{c}))MLP start_POSTSUBSCRIPT roman_Φ start_POSTSUBSCRIPT lbs end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_γ ( bold_italic_x start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ):

(6) wk=elog(wkSMPL+108)+MLPΦlbs(γ(xc))[k]k=1Kelog(wkSMPL+108)+MLPΦlbs(γ(xc))[k],subscript𝑤𝑘superscript𝑒superscriptsubscript𝑤𝑘SMPLsuperscript108subscriptMLPsubscriptΦlbs𝛾superscript𝑥𝑐delimited-[]𝑘superscriptsubscript𝑘1𝐾superscript𝑒superscriptsubscript𝑤𝑘SMPLsuperscript108subscriptMLPsubscriptΦlbs𝛾superscript𝑥𝑐delimited-[]𝑘\displaystyle w_{k}=\frac{e^{\log(w_{k}^{\text{SMPL}}+10^{-8})+\text{MLP}_{% \Phi_{\text{lbs}}}(\gamma(x^{c}))[k]}}{\sum_{k=1}^{K}e^{\log(w_{k}^{\text{SMPL% }}+10^{-8})+\text{MLP}_{\Phi_{\text{lbs}}}(\gamma(x^{c}))[k]}},italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG italic_e start_POSTSUPERSCRIPT roman_log ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT SMPL end_POSTSUPERSCRIPT + 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT ) + MLP start_POSTSUBSCRIPT roman_Φ start_POSTSUBSCRIPT lbs end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_γ ( italic_x start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ) [ italic_k ] end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT roman_log ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT SMPL end_POSTSUPERSCRIPT + 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT ) + MLP start_POSTSUBSCRIPT roman_Φ start_POSTSUBSCRIPT lbs end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_γ ( italic_x start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ) [ italic_k ] end_POSTSUPERSCRIPT end_ARG ,

where γ()𝛾\gamma(\cdot)italic_γ ( ⋅ ) is position encoding. For the body pose θ𝜃\thetaitalic_θ, we also introduced pose refinement form Gauhuman (Hu and Liu, 2023), add a MLPΦpose()subscriptMLPsubscriptΦpose\text{MLP}_{\Phi_{\text{pose}}}(\cdot)MLP start_POSTSUBSCRIPT roman_Φ start_POSTSUBSCRIPT pose end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ ) to correct SMPL pose 𝜽SMPLsuperscript𝜽SMPL\bm{\theta}^{\text{SMPL}}bold_italic_θ start_POSTSUPERSCRIPT SMPL end_POSTSUPERSCRIPT:

(7) 𝜽=𝜽SMPLMLPΦpose(𝜽),𝜽tensor-productsuperscript𝜽SMPLsubscriptMLPsubscriptΦpose𝜽\displaystyle\bm{\theta}=\bm{\theta}^{\text{SMPL}}\otimes\text{MLP}_{\Phi_{% \text{pose}}}(\bm{\theta}),bold_italic_θ = bold_italic_θ start_POSTSUPERSCRIPT SMPL end_POSTSUPERSCRIPT ⊗ MLP start_POSTSUBSCRIPT roman_Φ start_POSTSUBSCRIPT pose end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_θ ) ,

With the LBS offset and pose refinement modules, our SMPL parameters can be more accurate which leads to better rendering results. Previous works (Chen et al., 2021; Yang et al., 2021; Jiang et al., 2023a; Weng et al., 2022) have also demonstrated the effectiveness. After training, the canonical points along with the transform matrix {Ri=1N}superscriptsubscript𝑅𝑖1𝑁\left\{{R}_{i=1}^{N}\right\}{ italic_R start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT } and translations vectors {Ti=1N}superscriptsubscript𝑇𝑖1𝑁\left\{{T}_{i=1}^{N}\right\}{ italic_T start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT } will be saved for inference, where N𝑁Nitalic_N is the number of views. During inference the points xcsuperscript𝑥𝑐x^{c}italic_x start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT can be easily transform to posed space xosuperscript𝑥𝑜x^{o}italic_x start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT at view i𝑖iitalic_i: xp=Rixc+Tisuperscript𝑥𝑝subscript𝑅𝑖superscript𝑥𝑐subscript𝑇𝑖x^{p}={R}_{i}x^{c}+{T}_{i}italic_x start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT = italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT + italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which further increases the rendering.

Refer to caption
Figure 3. Occlusion Feature Query. For each occluded point (green point), we first query its K-nearest visible points (orange point) among all visible points, and then project these K-nearest visible points onto the feature maps to extract pixel-aligned features.

3.3. Occlusion Feature Query

Motivation. Although previous human rendering methods based on NeRF or 3DGS have achieved great rendering quality in non-occluded environments, their performance significantly deteriorates under even slight occlusion. For 3DGS, if we directly apply it to render occluded humans, it will completely fail to recover the occluded regions. As 3DGS is based on point rendering, and points are independent of each other, if a point is under occlusion, there is no ground truth supervision for it. These occluded points will be treated as blank background areas during training, resulting in their opacity being close to zero. During adaptive density control, the opacity of these points is below the threshold and therefore being pruned. Even if these points are not pruned, their spherical harmonic coefficients still represent blank information that cannot contribute to rendering.

K-nearest Occluded Points Query. To solve this problem, we draw inspiration from the traditional image inpainting method, utilizing the redundancy inherent in the image by using the information from known parts to predict the occluded regions. Considering the prior that the human body structure and the clothing texture are most similar in nearby regions, we decide to fill the occluded regions using information from the nearest non-occluded parts. Firstly, Using the provided camera parameters, we can project the 3D points after LBS transformation xN×3𝑥superscript𝑁3{x}\in{\mathbb{R}}^{N\times 3}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 3 end_POSTSUPERSCRIPT onto the 2D image plane and determine their visibility based on whether they lie within the foreground mask αfgsubscript𝛼𝑓𝑔{\alpha}_{fg}italic_α start_POSTSUBSCRIPT italic_f italic_g end_POSTSUBSCRIPT or not. This will give us N1subscript𝑁1{N}_{1}italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT visible points xseenN1×3subscript𝑥𝑠𝑒𝑒𝑛superscriptsubscript𝑁13{x}_{seen}\in{\mathbb{R}}^{{N}_{1}\times 3}italic_x start_POSTSUBSCRIPT italic_s italic_e italic_e italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × 3 end_POSTSUPERSCRIPT and N2subscript𝑁2{N}_{2}italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT occluded points xoccN2×3subscript𝑥𝑜𝑐𝑐superscriptsubscript𝑁23{x}_{occ}\in{\mathbb{R}}^{{N}_{2}\times 3}italic_x start_POSTSUBSCRIPT italic_o italic_c italic_c end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × 3 end_POSTSUPERSCRIPT, where N=N1+N2𝑁subscript𝑁1subscript𝑁2N={N}_{1}+{N}_{2}italic_N = italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. We can easily find that the occluded points lack ground truth, so similar to the principles of image inpainting, we will replace these blank points with information from their nearest visible points. Specifically, for each occluded point, we query its K-nearest neighbors among all visible points xseensubscript𝑥𝑠𝑒𝑒𝑛{x}_{seen}italic_x start_POSTSUBSCRIPT italic_s italic_e italic_e italic_n end_POSTSUBSCRIPT, so we can get N2×Ksubscript𝑁2𝐾{N}_{2}\times Kitalic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_K points denoted as xknnN2×K×3subscript𝑥𝑘𝑛𝑛superscriptsubscript𝑁2𝐾3{x}_{knn}\in{\mathbb{R}}^{{N}_{2}\times K\times 3}italic_x start_POSTSUBSCRIPT italic_k italic_n italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_K × 3 end_POSTSUPERSCRIPT, where K is the number of nearest visible points. After experiments, we choose the hyperparameter K=3𝐾3K=3italic_K = 3 here.

Aggregated Pixel-Aligned Feature. At the same time, we encode the input image IH×W×3𝐼superscript𝐻𝑊3{I}\in{\mathbb{R}}^{H\times W\times 3}italic_I ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT using an deep convolution encoder ΦΦ\Phiroman_Φ to obtain the feature maps IfeaH×W×Csubscript𝐼𝑓𝑒𝑎superscript𝐻𝑊𝐶{I}_{fea}\in{\mathbb{R}}^{H\times W\times C}italic_I start_POSTSUBSCRIPT italic_f italic_e italic_a end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT, where C𝐶Citalic_C is the channel dimension. Similar to the approach in (Zhao et al., 2022; Hu et al., 2023a; Mihajlovic et al., 2022), We project the K-nearest points xknnN2×K×3subscript𝑥𝑘𝑛𝑛superscriptsubscript𝑁2𝐾3{x}_{knn}\in{\mathbb{R}}^{{N}_{2}\times K\times 3}italic_x start_POSTSUBSCRIPT italic_k italic_n italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_K × 3 end_POSTSUPERSCRIPT onto each feature planes, and bi-linearly interpolate the grid values to extract the pixel-aligned features for each point, denote as hN2×K×Csuperscriptsubscript𝑁2𝐾𝐶{h}\in{\mathbb{R}}^{{N}_{2}\times K\times C}italic_h ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_K × italic_C end_POSTSUPERSCRIPT. The pixel-aligned features help better incorporate the local texture information from 2D images into 3D points. The above processes to extract feature can be formulated as:

(8) h=Γ(Π(Ifea;xknn))ΓΠsubscript𝐼𝑓𝑒𝑎subscript𝑥𝑘𝑛𝑛\displaystyle{h}=\varGamma(\varPi({I}_{fea};{x}_{knn}))italic_h = roman_Γ ( roman_Π ( italic_I start_POSTSUBSCRIPT italic_f italic_e italic_a end_POSTSUBSCRIPT ; italic_x start_POSTSUBSCRIPT italic_k italic_n italic_n end_POSTSUBSCRIPT ) )

Where Π()Π\varPi()roman_Π ( ) denotes the 3D-to-2D projection, and Γ()Γ\varGamma()roman_Γ ( ) is the bilinear interpolation. Figure 3 illustrates the process above.

However, some of the K-nearest visible points may have been occluded in previous frames for a long time, reducing their credibility seriously. Following OccNeRF (Xiang et al., 2023b), we propose occlusion-aware aggregation to refine the K-nearest features. We define a visibility weights ρN×1𝜌superscript𝑁1{\rho}\in{\mathbb{R}}^{N\times 1}italic_ρ ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT on the canonical SMPL vertices. During training, if a point is visible, the corresponding weight is incremented by 1, then we can weight the features of K-nearest visible points h{h}italic_h, obtaining haggN2×Csubscript𝑎𝑔𝑔superscriptsubscript𝑁2𝐶{h}_{agg}\in{\mathbb{R}}^{{N}_{2}\times C}italic_h start_POSTSUBSCRIPT italic_a italic_g italic_g end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_C end_POSTSUPERSCRIPT:

(9) hagg=i=1Kρihi,subscript𝑎𝑔𝑔superscriptsubscript𝑖1𝐾subscript𝜌𝑖subscript𝑖\displaystyle{h}_{agg}=\textstyle\sum_{i=1}^{K}{\rho}_{i}{h}_{i},italic_h start_POSTSUBSCRIPT italic_a italic_g italic_g end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,

Note that our visibility weights will no longer update after the adaptive density control begins.

3.4. Gaussian Feature MLP

Previous work (Wu et al., 2023; Yang et al., 2023) following the approach of 3DGS (Kerbl et al., 2023), learning the viewpoint-dependent colors by storing features within each 3D Gaussian. The stored color feature f𝑓fitalic_f consists of a set of orthogonal spherical harmonic coefficients, where the color at each point is computed through the dot product of spherical harmonic basis functions and the view direction:

(10) c=(γ(d),f)𝑐𝛾𝑑𝑓\displaystyle c=(\gamma(d),f)italic_c = ( italic_γ ( italic_d ) , italic_f )

Here, d𝑑ditalic_d represents the view direction, i.e., the direction from the 3D Gaussian towards the camera center. And γ𝛾\gammaitalic_γ denotes the spherical harmonic basis function. While this approach is straightforward and efficient for 3DGS to render static scenes, we find it unsuitable for occluded regions. As our input is a monocular video, the view direction in world space is sole and fixed, leading to poor generalization of our model to unseen test views. Additionally, directly replacing occluded points with information from nearest neighbors, though simple, is too reliant on local information. Therefore, we add multi-layer perceptrons(MLP) to further model the colors of occluded regions. Since the positional information of occluded regions, i.e., mean and covariance, will not be affected by occlusion, we only employ MLP to learn the spherical harmonic coefficients f𝑓fitalic_f and opacity α𝛼\alphaitalic_α for occluded points. Specifically, with the aforementioned aggregated pixel-aligned feature haggN2×Csubscript𝑎𝑔𝑔superscriptsubscript𝑁2𝐶{h}_{agg}\in{\mathbb{R}}^{{N}_{2}\times C}italic_h start_POSTSUBSCRIPT italic_a italic_g italic_g end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_C end_POSTSUPERSCRIPT, we concatenate the fused features with the embedded occluded 3D points xoccsubscript𝑥𝑜𝑐𝑐{x}_{occ}italic_x start_POSTSUBSCRIPT italic_o italic_c italic_c end_POSTSUBSCRIPT, and finally put them into MLPshs𝑀𝐿subscript𝑃𝑠𝑠{MLP}_{shs}italic_M italic_L italic_P start_POSTSUBSCRIPT italic_s italic_h italic_s end_POSTSUBSCRIPT and MLPopacity𝑀𝐿subscript𝑃𝑜𝑝𝑎𝑐𝑖𝑡𝑦{MLP}_{opacity}italic_M italic_L italic_P start_POSTSUBSCRIPT italic_o italic_p italic_a italic_c italic_i italic_t italic_y end_POSTSUBSCRIPT to predict the spherical harmonic coefficients focc3×16subscript𝑓𝑜𝑐𝑐superscript316{f}_{occ}\in{\mathbb{R}}^{3\times 16}italic_f start_POSTSUBSCRIPT italic_o italic_c italic_c end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 16 end_POSTSUPERSCRIPT and opacity αocc1subscript𝛼𝑜𝑐𝑐superscript1{\alpha}_{occ}\in{\mathbb{R}}^{1}italic_α start_POSTSUBSCRIPT italic_o italic_c italic_c end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT for each occluded points:

(11) focc=MLPshs(hagg,γ(xocc))subscript𝑓𝑜𝑐𝑐𝑀𝐿subscript𝑃𝑠𝑠subscript𝑎𝑔𝑔𝛾subscript𝑥𝑜𝑐𝑐\displaystyle{f}_{occ}={MLP}_{shs}({h}_{agg},\gamma({x}_{occ}))italic_f start_POSTSUBSCRIPT italic_o italic_c italic_c end_POSTSUBSCRIPT = italic_M italic_L italic_P start_POSTSUBSCRIPT italic_s italic_h italic_s end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_a italic_g italic_g end_POSTSUBSCRIPT , italic_γ ( italic_x start_POSTSUBSCRIPT italic_o italic_c italic_c end_POSTSUBSCRIPT ) )
αocc=MLPopacity(hagg,γ(xocc)),subscript𝛼𝑜𝑐𝑐𝑀𝐿subscript𝑃𝑜𝑝𝑎𝑐𝑖𝑡𝑦subscript𝑎𝑔𝑔𝛾subscript𝑥𝑜𝑐𝑐\displaystyle{\alpha}_{occ}={MLP}_{opacity}({h}_{agg},\gamma({x}_{occ})),italic_α start_POSTSUBSCRIPT italic_o italic_c italic_c end_POSTSUBSCRIPT = italic_M italic_L italic_P start_POSTSUBSCRIPT italic_o italic_p italic_a italic_c italic_i italic_t italic_y end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_a italic_g italic_g end_POSTSUBSCRIPT , italic_γ ( italic_x start_POSTSUBSCRIPT italic_o italic_c italic_c end_POSTSUBSCRIPT ) ) ,

Where γ()𝛾\gamma()italic_γ ( ) represents the positional encoding (Vaswani et al., 2017), we will replace the spherical harmonic coefficient and opacity of occluded point with the output of MLPshs𝑀𝐿subscript𝑃𝑠𝑠{MLP}_{shs}italic_M italic_L italic_P start_POSTSUBSCRIPT italic_s italic_h italic_s end_POSTSUBSCRIPT and MLPopacity𝑀𝐿subscript𝑃𝑜𝑝𝑎𝑐𝑖𝑡𝑦{MLP}_{opacity}italic_M italic_L italic_P start_POSTSUBSCRIPT italic_o italic_p italic_a italic_c italic_i italic_t italic_y end_POSTSUBSCRIPT.

3.5. Training Strategy

Loss Function. Following other methods using 3DGS to render human, we introduce RGB loss colorsubscript𝑐𝑜𝑙𝑜𝑟\mathcal{L}_{color}caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_l italic_o italic_r end_POSTSUBSCRIPT, mask loss masksubscript𝑚𝑎𝑠𝑘\mathcal{L}_{mask}caligraphic_L start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT, SSIM loss SSIMsubscript𝑆𝑆𝐼𝑀\mathcal{L}_{SSIM}caligraphic_L start_POSTSUBSCRIPT italic_S italic_S italic_I italic_M end_POSTSUBSCRIPT (Wang et al., 2004), and LPIPS loss LPIPSsubscript𝐿𝑃𝐼𝑃𝑆\mathcal{L}_{LPIPS}caligraphic_L start_POSTSUBSCRIPT italic_L italic_P italic_I italic_P italic_S end_POSTSUBSCRIPT (Zhang et al., 2018) to supervise the network on the visible human parts which having ground truth, we define these losses as standardsubscript𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑\mathcal{L}_{standard}caligraphic_L start_POSTSUBSCRIPT italic_s italic_t italic_a italic_n italic_d italic_a italic_r italic_d end_POSTSUBSCRIPT:

(12) standard=rgb+λ1mask+λ2SSIM+λ3LPIPS,subscript𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑subscript𝑟𝑔𝑏subscript𝜆1subscript𝑚𝑎𝑠𝑘subscript𝜆2subscript𝑆𝑆𝐼𝑀subscript𝜆3subscript𝐿𝑃𝐼𝑃𝑆\displaystyle\mathcal{L}_{standard}=\mathcal{L}_{rgb}+\lambda_{1}\mathcal{L}_{% mask}+\lambda_{2}\mathcal{L}_{SSIM}+\lambda_{3}\mathcal{L}_{LPIPS},caligraphic_L start_POSTSUBSCRIPT italic_s italic_t italic_a italic_n italic_d italic_a italic_r italic_d end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_r italic_g italic_b end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_S italic_S italic_I italic_M end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_L italic_P italic_I italic_P italic_S end_POSTSUBSCRIPT ,
Refer to caption
Figure 4. The process of occsubscript𝑜𝑐𝑐\mathcal{L}_{occ}caligraphic_L start_POSTSUBSCRIPT italic_o italic_c italic_c end_POSTSUBSCRIPT. The observation points are projected to get a rough human body mask αbodysubscript𝛼𝑏𝑜𝑑𝑦{\alpha}_{body}italic_α start_POSTSUBSCRIPT italic_b italic_o italic_d italic_y end_POSTSUBSCRIPT. By performing a logical OR operation between this mask and the inverse of the foreground mask αfgsubscript𝛼𝑓𝑔{\alpha}_{fg}italic_α start_POSTSUBSCRIPT italic_f italic_g end_POSTSUBSCRIPT, we obtain an approximate mask αoccsubscript𝛼𝑜𝑐𝑐{\alpha}_{occ}italic_α start_POSTSUBSCRIPT italic_o italic_c italic_c end_POSTSUBSCRIPT describing occluded objects. Then we render the occluded points and calculate the loss between the two.

However, for occluded regions where ground truth does not exist, these losses will become ineffective. Hence, it is necessary to design loss specifically for occluded areas. Inspired by the occlusion decoupling loss purposed in wild2avatar (Xiang et al., 2023a), we design the occlusion loss occsubscript𝑜𝑐𝑐\mathcal{L}_{occ}caligraphic_L start_POSTSUBSCRIPT italic_o italic_c italic_c end_POSTSUBSCRIPT. The process of occsubscript𝑜𝑐𝑐\mathcal{L}_{occ}caligraphic_L start_POSTSUBSCRIPT italic_o italic_c italic_c end_POSTSUBSCRIPT is shown in Figure 4. The approach involves projecting the points onto the image plane to obtain a body mask αbodysubscript𝛼𝑏𝑜𝑑𝑦{\alpha}_{body}italic_α start_POSTSUBSCRIPT italic_b italic_o italic_d italic_y end_POSTSUBSCRIPT that roughly describes the outline of the human body. For the occluded foreground mask αfgsubscript𝛼𝑓𝑔{\alpha}_{fg}italic_α start_POSTSUBSCRIPT italic_f italic_g end_POSTSUBSCRIPT, we compute its inversion and perform an XOR operation between αbodysubscript𝛼𝑏𝑜𝑑𝑦{\alpha}_{body}italic_α start_POSTSUBSCRIPT italic_b italic_o italic_d italic_y end_POSTSUBSCRIPT and the inversion of αfgsubscript𝛼𝑓𝑔{\alpha}_{fg}italic_α start_POSTSUBSCRIPT italic_f italic_g end_POSTSUBSCRIPT, resulting in an occlusion mask αoccsubscript𝛼𝑜𝑐𝑐{\alpha}_{occ}italic_α start_POSTSUBSCRIPT italic_o italic_c italic_c end_POSTSUBSCRIPT that roughly describes the occluded region. As we project points onto the image plane, we can extract points that lie in the occluded region, render them and calculate loss between the rendered alpha mask αrender_occsubscript𝛼𝑟𝑒𝑛𝑑𝑒𝑟_𝑜𝑐𝑐{\alpha}_{render\_occ}italic_α start_POSTSUBSCRIPT italic_r italic_e italic_n italic_d italic_e italic_r _ italic_o italic_c italic_c end_POSTSUBSCRIPT and the occlusion mask αoccsubscript𝛼𝑜𝑐𝑐{\alpha}_{occ}italic_α start_POSTSUBSCRIPT italic_o italic_c italic_c end_POSTSUBSCRIPT. In summary, our αoccsubscript𝛼𝑜𝑐𝑐{\alpha}_{occ}italic_α start_POSTSUBSCRIPT italic_o italic_c italic_c end_POSTSUBSCRIPT and occsubscript𝑜𝑐𝑐\mathcal{L}_{occ}caligraphic_L start_POSTSUBSCRIPT italic_o italic_c italic_c end_POSTSUBSCRIPT are defined as:

(13) αocc=αbody(αfg)\displaystyle{\alpha}_{occ}={\alpha}_{body}\odot(\thicksim{\alpha}_{fg})italic_α start_POSTSUBSCRIPT italic_o italic_c italic_c end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_b italic_o italic_d italic_y end_POSTSUBSCRIPT ⊙ ( ∼ italic_α start_POSTSUBSCRIPT italic_f italic_g end_POSTSUBSCRIPT )
occ=MSE(αrender_occ,αocc),subscript𝑜𝑐𝑐𝑀𝑆𝐸subscript𝛼𝑟𝑒𝑛𝑑𝑒𝑟_𝑜𝑐𝑐subscript𝛼𝑜𝑐𝑐\displaystyle\mathcal{L}_{occ}=MSE({\alpha}_{render\_occ},{\alpha}_{occ}),caligraphic_L start_POSTSUBSCRIPT italic_o italic_c italic_c end_POSTSUBSCRIPT = italic_M italic_S italic_E ( italic_α start_POSTSUBSCRIPT italic_r italic_e italic_n italic_d italic_e italic_r _ italic_o italic_c italic_c end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT italic_o italic_c italic_c end_POSTSUBSCRIPT ) ,

Where direct-product\odot is the XOR operation. In addition, we observing that after the adaptive density control, some points with small opacity in occluded regions are pruned, which should been retained. For these points under the opacity threshold ϵitalic-ϵ\epsilonitalic_ϵ, we render them to obtain C^s_opacitysubscript^𝐶𝑠_𝑜𝑝𝑎𝑐𝑖𝑡𝑦{\hat{C}}_{s\_opacity}over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_s _ italic_o italic_p italic_a italic_c italic_i italic_t italic_y end_POSTSUBSCRIPT and α^s_opacitysubscript^𝛼𝑠_𝑜𝑝𝑎𝑐𝑖𝑡𝑦{\hat{\alpha}}_{s\_opacity}over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_s _ italic_o italic_p italic_a italic_c italic_i italic_t italic_y end_POSTSUBSCRIPT, and calculate the consistency loss consubscript𝑐𝑜𝑛\mathcal{L}_{con}caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT with RGB ground truth C𝐶Citalic_C and mask ground truth α𝛼\alphaitalic_α:

(14) con=|C^s_opacityC|+λconMSE(α^s_opacity,α),subscript𝑐𝑜𝑛subscript^𝐶𝑠_𝑜𝑝𝑎𝑐𝑖𝑡𝑦𝐶subscript𝜆𝑐𝑜𝑛𝑀𝑆𝐸subscript^𝛼𝑠_𝑜𝑝𝑎𝑐𝑖𝑡𝑦𝛼\displaystyle\mathcal{L}_{con}=\left|{\hat{C}}_{s\_opacity}-C\right|+\lambda_{% con}MSE({\hat{\alpha}}_{s\_opacity},\alpha),caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT = | over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_s _ italic_o italic_p italic_a italic_c italic_i italic_t italic_y end_POSTSUBSCRIPT - italic_C | + italic_λ start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT italic_M italic_S italic_E ( over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_s _ italic_o italic_p italic_a italic_c italic_i italic_t italic_y end_POSTSUBSCRIPT , italic_α ) ,

where we set the opacity threshold as ϵ=0.05italic-ϵ0.05\epsilon=0.05italic_ϵ = 0.05 and λcon=0.1subscript𝜆𝑐𝑜𝑛0.1\lambda_{con}=0.1italic_λ start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT = 0.1. Our total loss function totalsubscript𝑡𝑜𝑡𝑎𝑙\mathcal{L}_{total}caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT is defined as:

(15) total=standard+λ4occ+con,subscript𝑡𝑜𝑡𝑎𝑙subscript𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑subscript𝜆4subscript𝑜𝑐𝑐subscript𝑐𝑜𝑛\displaystyle\mathcal{L}_{total}=\mathcal{L}_{standard}+\lambda_{4}\mathcal{L}% _{occ}+\mathcal{L}_{con},caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_s italic_t italic_a italic_n italic_d italic_a italic_r italic_d end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_o italic_c italic_c end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT ,

Here we set λ1=λ2=λ3=λ4=0.1subscript𝜆1subscript𝜆2subscript𝜆3subscript𝜆40.1\lambda_{1}=\lambda_{2}=\lambda_{3}=\lambda_{4}=0.1italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = 0.1. For more details about the loss functions, please refer to the appendix.

3D Gaussian Optimization. 3DGS uses SfM (Schonberger and Frahm, 2016) sparse point clouds to initialize Gaussian distributions. In the human rendering task, it is more reasonable to initialize Gaussian distributions using N=6890𝑁6890N=6890italic_N = 6890 SMPL vertices in the canonical space, which contains more prior information about human (Hu and Liu, 2023; Qian et al., 2023b). Additionally, we follow the adaptive density control in gauhuman (Hu and Liu, 2023) which involves: (1) constraining split and clone based on KL divergence, (2) introducing merge operation to merge redundant points, and (3) prune the points which are far from the SMPL surface. Please refer to the appendix for details.

Refer to caption
Figure 5. Qualitative results between our OccGaussian and OccNeRF on ZJU-MoCap and OcMotion datasets.
Table 1. Quantitative comparison of ours and baseline methods on the ZJU-MoCap and OcMotion datasets. SOTA metric are bold. LPIPS = 1000 ×\times× LPIPS.
Method ZJU_MoCap OcMotion
PSNR\uparrow SSIM\uparrow LPIPS\downarrow Train FPS PSNR\uparrow SSIM\uparrow LPIPS\downarrow Train FPS
HumanNeRF (Weng et al., 2022) 20.67 0.9509 - - - 19.57 0.9575 - - -
OccNeRF (Xiang et al., 2023b) 22.40 0.9562 43.01 28hsimilar-to\sim40h 0.20 21.01 0.9668 38.14 25h 0.16
OccGaussian(Ours) 23.29 0.9482 41.93 6m 169 21.76 0.9657 32.18 13m 163

4. Experiments

4.1. Implementation Details

We adopt the pre-trained ResNet18 (He et al., 2016) as the 2D image encoder. We train for 2400 iterations on the ZJU-MoCap dataset (Peng et al., 2021b). And for the OcMotion dataset (Huang et al., 2022), due to the more complex and various occlusions, we train for 5000 iterations. Utilizing the loss functions proposed in Section 3.5, we optimize OccGaussian using the Adam optimizer (Kingma and Ba, 2014) and dynamically adjust the learning rate based on the training steps. Please refer to the appendix for more details.

4.2. Datasets

ZJU-MoCap (Peng et al., 2021b). The ZJU-Mocap dataset is a widely used benchmark in human modeling, supplying human masks and SMPL parameters. We select six human subjects (377, 386, 387, 392, 393, 394) to conduct experiments and adopt the same training and testing setting as OccNeRF (Xiang et al., 2023b), i.e., the first camera is used for training, and the remaining cameras are used for evaluation. Since ZJU-MoCap is captured in a lab environment, there is no occlusion, so we simulate occlusions on training data following OccNeRF. This is done by artificially placing a rectangular barrier between the camera and the human, which is centered at the mean center of all valid pixels from the video frame and will obscure 50%percent5050\%50 % of the valid pixels. The length of the rectangle is the length of the image, and the center and width vary for different subjects. We set the obstacle to stationary and added this occlusion on 80%percent8080\%80 % of training frames. More details is provided in appendix.

OcMotion (Huang et al., 2022). The OcMotion dataset is built for human pose estimation under occlusion, which contains 43 motions and 300K frames with 3D annotations. This dataset better presents what happens when people encounter occlusion in real-life scenarios. Following OccNeRF, we evaluating on two videos with different levels of occlusions. 500 frames from video 11, camera 2 is defined as mild occlusion video, and 540 frames from video 14, camera 4 is defined as severe occlusion video. We use the camera parameters and SMPL parameters provided by OcMotion.

4.3. Comparison and Metrics

The state-of-art occluded human rendering method is OccNeRF (Xiang et al., 2023b), which we mainly compare with. And we also test the rendering performance of 3DGS-Avatar (Qian et al., 2023b) directly on occluded human. All the methods use the same training and evaluation setting, including a single training view, using foreground human mask and SMPL/camera parameters in training. Methods are compared qualitatively and quantitatively, for qualitative evaluations, we synthesize novel views to compare the quality of renderings. For quantitative evaluations, we consider three commonly used metrics: peak signal-to-noise ratio (PSNR), structural similarity (SSIM) (Wang et al., 2004) and LPIPS (Zhang et al., 2018) to measure the rendering quality. To demonstrate the superiority of our approach, we also calculating the training time and the Frames Per Second (FPS) of rendering. Note that the OcMotion dataset doesn’t have non-occlusive ground truth, we only calculate metrics on the visible area.

Table 2. Quantitative comparison between ours and OccNeRF on ZJU-MoCap and OcMotion datasets.
ZJU-MoCap Subject 377 Subject 386 Subject 387
PSNR SSIM LPIPS PSNR SSIM LPIPS PSNR SSIM LPIPS
OccNeRF (Xiang et al., 2023b) 23.37 0.9648 34.23 23.43 0.9629 41.87 22.15 0.9506 44.58
OccGaussian 24.33 0.9589 32.43 24.11 0.9544 39.36 23.02 0.9422 44.47
ZJU-MoCap Subject 392 Subject 393 Subject 394
PSNR SSIM LPIPS PSNR SSIM LPIPS PSNR SSIM LPIPS
OccNeRF (Xiang et al., 2023b) 22.13 0.9578 44.56 21.40 0.9484 47.82 21.95 0.9527 45.02
OccGaussian 22.92 0.9481 44.07 22.50 0.9413 46.87 22.84 0.9444 44.38
OcMotion Video Mild Video Severe
PSNR SSIM LPIPS PSNR SSIM LPIPS
OccNeRF (Xiang et al., 2023b) 21.55 0.9700 36.07 20.48 0.9637 40.21
OccGaussian 22.59 0.9702 30.04 20.93 0.9612 34.32
Refer to caption
Figure 6. Qualitative results of ablation study on the Video Severe of OcMotion dataset.

4.4. Qualitative Results

In Figure 5, we present the novel view rendering results of our OccGaussian and OccNeRF. For the ZJU-MoCap dataset with simulated occlusions, both our OccGaussian and OccNeRF are capable of rendering a mostly complete body geometry. However, OccNeRF sometimes fails to fill reasonable details in occluded regions, resulting in poor continuity of generated textures and will render some artifacts or floats on the body surface. By enhancing the input 2D observation information, OccGaussian complements texture in occluded regions, better recovering the facial expressions and clothing details with less artifacts. For the OcMotion dataset with real-world occlusions, although the quality of rendering somehow declined, OccGaussian is still able to render a relatively complete body with occluded area recovered. However OccNeRF will miss certain body parts (such as hands) and produce much more artifacts and noise. It also shows that our method still performs well in the real-world scenarios. Please refer to the appendix for more qualitative results.

4.5. Quantitative Results

We summarize the overall novel view synthesis results of our OccGaussian, OccNeRF and HumanNeRF in Table 1; the metrics are mean values taken on all subjects. Benefiting from 3DGS, our OccGaussian can train within minutes, while OccNeRF requires over one day to train, we accelerate by nearly 250 times. And OccNeRF takes 5 to 6 seconds to render an image when synthesizing novel views, limiting its real-world applications. In contrast, our method achieves a maximum FPS of 169, capable of rendering hundreds of frames in seconds which is 800 times faster than OccNeRF. For the evaluation metrics, Our approach achieves SOTA in both PSNR and LPIPS compare to OccNeRF, demonstrating that OccGaussian remains competitive in rendering quality while significantly reducing the time for training and rendering. We don’t provide some specific metrics for HumanNeRF, because the LPIPS is not measured in OccNeRF, and both training time and FPS are roughly equivalent to OccNeRF. Table 2 provides a more detailed summary for OccGaussian and OccNeRF, we can see our approach outperforms OccNeRF in PSNR and LPIPS across all subjects.

4.6. Ablation Studies

We are conducting ablation experiments by removing the methods proposed in Chapter 3 to demonstrate their effectiveness in improving rendering quality. We present qualitative results of video severe from OcMotion in Figure 6. And since the OcMotion dataset doesn’t capture complete images that are not occluded, we provide quantitative results on the ZJU-MoCap dataset in Table LABEL:tab:ablation_table, it’s mean values taken on all subjects.

Occlusion Loss occsubscript𝑜𝑐𝑐\mathcal{L}_{occ}caligraphic_L start_POSTSUBSCRIPT italic_o italic_c italic_c end_POSTSUBSCRIPT and consistency loss consubscript𝑐𝑜𝑛\mathcal{L}_{con}caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT. Our proposed occsubscript𝑜𝑐𝑐\mathcal{L}_{occ}caligraphic_L start_POSTSUBSCRIPT italic_o italic_c italic_c end_POSTSUBSCRIPT and consubscript𝑐𝑜𝑛\mathcal{L}_{con}caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT contribute to improving rendering quality in occluded area. After removing these two losses, the density of rendering decreases faintly near the occluded regions, and there is a slight decline in metrics.

Aggregated Pixel-Aligned Feature. The aggregated pixel-align feature can more fully utilize the feature of visible points to recover the occluded region. After deactivating it, we replace the spherical harmonic coefficient and opacity of the occluded point with those of the KNN visible points, weighted by their respective distances. We can see there is a large amount missing in the occluded region after the feature is disabled, the model will only partially render some unrealistic artifacts in these areas.

KNN Occluded Points Query. With the KNN occluded points query disabled, it is equivalent to rendering the occluded region directly without any processing. This time, the occluded region will be treated as background during training, and the appearance of the occluded region will not be rendered at all. Also, the metrics are the worst among all.

Table 3. Quantitative results of ablation study on the ZJU-MoCap dataset.
PSNR\uparrow SSIM\uparrow LPIPS\downarrow
Full model(Ours) 23.29 0.9482 41.93
w/o occsubscript𝑜𝑐𝑐\mathcal{L}_{occ}caligraphic_L start_POSTSUBSCRIPT italic_o italic_c italic_c end_POSTSUBSCRIPT and consubscript𝑐𝑜𝑛\mathcal{L}_{con}caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT 23.03 0.9471 42.84
w/o aggregated pixel-align feature 22.58 0.9460 46.04
w/o KNN occluded points query 18.42 0.9367 53.84

5. Discussions

Rendering high-quality humans under monocular occluded videos is exceptionally challenging. Facing with occlusions of varying shapes, and the occluded region has neither ground truth nor information to supplement from other views. By querying the features of neighboring visible points, our OccGaussian is able to render a complete human body geometry. However, if the regions have occluded for extended periods, OccGaussian still can’t fully recover them, resulting in some defects because the supervision in these regions is too weak. This issue may be addressed by incorporating temporal information (Wu et al., 2023). Additionally, our method requires relatively accurate human poses and camera parameters to project 3D points, in-the-wild videos with inaccurate priors can also degrade the rendering quality.

6. Conclusion

We propose OccGaussian, the first method to render human in monocular videos with occlusions using 3D Gaussian Splatting. While previous methods are too time-consuming in training and inference to meet the requirements of real-time applications, we can achieve fast training (613similar-to6136\sim 136 ∼ 13 minutes) and real-time rendering (169169169169 FPS). Specifically, we perform feature query in the occluded region, and input the aggregated pixel-align feature of visible K-nearest points into MLP to learn the information of invisible points. Moreover, we design the specialized loss functions for the occluded region, which makes the rendering more complete. In our experiments, we compare OccGaussian with the SOTA method under both simulated and real-world occlusions. The experiments show that our OccGaussian achieves SOTA performance while maintaining fast training and real-time rendering.

References

  • (1)
  • Aliev et al. (2020) Kara-Ali Aliev, Artem Sevastopolsky, Maria Kolos, Dmitry Ulyanov, and Victor Lempitsky. 2020. Neural point-based graphics. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16. Springer, 696–712.
  • Barron et al. (2021) Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. 2021. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5855–5864.
  • Barron et al. (2022) Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. 2022. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5470–5479.
  • Chen et al. (2023b) Jianchuan Chen, Wentao Yi, Liqian Ma, Xu Jia, and Huchuan Lu. 2023b. GM-NeRF: Learning Generalizable Model-based Neural Radiance Fields from Multi-view Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20648–20658.
  • Chen et al. (2022) Mingfei Chen, Jianfeng Zhang, Xiangyu Xu, Lijuan Liu, Yujun Cai, Jiashi Feng, and Shuicheng Yan. 2022. Geometry-guided progressive nerf for generalizable and efficient neural human rendering. In European Conference on Computer Vision. Springer, 222–239.
  • Chen et al. (2021) Xu Chen, Yufeng Zheng, Michael J Black, Otmar Hilliges, and Andreas Geiger. 2021. Snarf: Differentiable forward skinning for animating non-rigid neural implicit shapes. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11594–11604.
  • Chen et al. (2023a) Yue Chen, Xuan Wang, Xingyu Chen, Qi Zhang, Xiaoyu Li, Yu Guo, Jue Wang, and Fei Wang. 2023a. UV Volumes for real-time rendering of editable free-view human performance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16621–16631.
  • Fridovich-Keil et al. (2022) Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. 2022. Plenoxels: Radiance fields without neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5501–5510.
  • Geng et al. (2023) Chen Geng, Sida Peng, Zhen Xu, Hujun Bao, and Xiaowei Zhou. 2023. Learning neural volumetric representations of dynamic humans in minutes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8759–8770.
  • Gross and Pfister (2011) Markus Gross and Hanspeter Pfister. 2011. Point-based graphics. Elsevier.
  • Güler et al. (2018) Rıza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. 2018. Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7297–7306.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
  • Hu et al. (2023b) Liangxiao Hu, Hongwen Zhang, Yuxiang Zhang, Boyao Zhou, Boning Liu, Shengping Zhang, and Liqiang Nie. 2023b. Gaussianavatar: Towards realistic human avatar modeling from a single video via animatable 3d gaussians. arXiv preprint arXiv:2312.02134 (2023).
  • Hu et al. (2023a) Shoukang Hu, Fangzhou Hong, Liang Pan, Haiyi Mei, Lei Yang, and Ziwei Liu. 2023a. SHERF: Generalizable Human NeRF from a Single Image. arXiv preprint arXiv:2303.12791 (2023).
  • Hu and Liu (2023) Shoukang Hu and Ziwei Liu. 2023. Gauhuman: Articulated gaussian splatting from monocular human videos. arXiv preprint arXiv:2312.02973 (2023).
  • Huang et al. (2022) Buzhen Huang, Yuan Shu, Jingyi Ju, and Yangang Wang. 2022. Occluded Human Body Capture with Self-Supervised Spatial-Temporal Motion Prior. arXiv preprint arXiv:2207.05375 (2022).
  • Jiang et al. (2023a) Tianjian Jiang, Xu Chen, Jie Song, and Otmar Hilliges. 2023a. Instantavatar: Learning avatars from monocular video in 60 seconds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16922–16932.
  • Jiang et al. (2023b) Yuheng Jiang, Zhehao Shen, Penghao Wang, Zhuo Su, Yu Hong, Yingliang Zhang, Jingyi Yu, and Lan Xu. 2023b. Hifi4g: High-fidelity human performance rendering via compact gaussian splatting. arXiv preprint arXiv:2312.03461 (2023).
  • Kerbl et al. (2023) Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 2023. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Transactions on Graphics 42, 4 (2023).
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  • Kocabas et al. (2023) Muhammed Kocabas, Jen-Hao Rick Chang, James Gabriel, Oncel Tuzel, and Anurag Ranjan. 2023. Hugs: Human gaussian splats. arXiv preprint arXiv:2311.17910 (2023).
  • Kwon et al. (2021) Youngjoong Kwon, Dahun Kim, Duygu Ceylan, and Henry Fuchs. 2021. Neural human performer: Learning generalizable radiance fields for human performance rendering. Advances in Neural Information Processing Systems 34 (2021), 24741–24752.
  • Lei et al. (2023) Jiahui Lei, Yufu Wang, Georgios Pavlakos, Lingjie Liu, and Kostas Daniilidis. 2023. Gart: Gaussian articulated template models. arXiv preprint arXiv:2311.16099 (2023).
  • Li et al. (2023a) Mingwei Li, Jiachen Tao, Zongxin Yang, and Yi Yang. 2023a. Human101: Training 100+ fps human gaussians in 100s from 1 view. arXiv preprint arXiv:2312.15258 (2023).
  • Li et al. (2017) Tianye Li, Timo Bolkart, Michael J Black, Hao Li, and Javier Romero. 2017. Learning a model of facial shape and expression from 4D scans. ACM Trans. Graph. 36, 6 (2017), 194–1.
  • Li et al. (2023b) Zhe Li, Zerong Zheng, Lizhen Wang, and Yebin Liu. 2023b. Animatable gaussians: Learning pose-dependent gaussian maps for high-fidelity human avatar modeling. arXiv preprint arXiv:2311.16096 (2023).
  • Liu et al. (2023) Hanbing Liu, Jun-Yan He, Zhi-Qi Cheng, Wangmeng Xiang, Qize Yang, Wenhao Chai, Gaoang Wang, Xu Bao, Bin Luo, Yifeng Geng, et al. 2023. Posynda: Multi-hypothesis pose synthesis domain adaptation for robust 3d human pose estimation. In Proceedings of the 31st ACM International Conference on Multimedia. 5542–5551.
  • Loper et al. (2023) Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. 2023. SMPL: A skinned multi-person linear model. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2. 851–866.
  • Luiten et al. (2023) Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. 2023. Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. arXiv preprint arXiv:2308.09713 (2023).
  • Mihajlovic et al. (2022) Marko Mihajlovic, Aayush Bansal, Michael Zollhoefer, Siyu Tang, and Shunsuke Saito. 2022. KeypointNeRF: Generalizing image-based volumetric avatars using relative spatial encoding of keypoints. In European conference on computer vision. Springer, 179–197.
  • Mildenhall et al. (2021) Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. 2021. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 65, 1 (2021), 99–106.
  • Moreau et al. (2023) Arthur Moreau, Jifei Song, Helisa Dhamo, Richard Shaw, Yiren Zhou, and Eduardo Pérez-Pellitero. 2023. Human gaussian splatting: Real-time rendering of animatable avatars. arXiv preprint arXiv:2311.17113 (2023).
  • Müller et al. (2022) Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. 2022. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG) 41, 4 (2022), 1–15.
  • Pan et al. (2023) Xiao Pan, Zongxin Yang, Jianxin Ma, Chang Zhou, and Yi Yang. 2023. Transhuman: A transformer-based human representation for generalizable neural human rendering. In Proceedings of the IEEE/CVF International conference on computer vision. 3544–3555.
  • Pang et al. (2023) Haokai Pang, Heming Zhu, Adam Kortylewski, Christian Theobalt, and Marc Habermann. 2023. Ash: Animatable gaussian splats for efficient and photoreal human rendering. arXiv preprint arXiv:2312.05941 (2023).
  • Pavlakos et al. (2019) Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. 2019. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10975–10985.
  • Peng et al. (2021a) Sida Peng, Junting Dong, Qianqian Wang, Shangzhan Zhang, Qing Shuai, Xiaowei Zhou, and Hujun Bao. 2021a. Animatable neural radiance fields for modeling dynamic human bodies. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14314–14323.
  • Peng et al. (2021b) Sida Peng, Yuanqing Zhang, Yinghao Xu, Qianqian Wang, Qing Shuai, Hujun Bao, and Xiaowei Zhou. 2021b. Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9054–9063.
  • Qian et al. (2023a) Shenhan Qian, Tobias Kirschstein, Liam Schoneveld, Davide Davoli, Simon Giebenhain, and Matthias Nießner. 2023a. Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians. arXiv preprint arXiv:2312.02069 (2023).
  • Qian et al. (2023b) Zhiyin Qian, Shaofei Wang, Marko Mihajlovic, Andreas Geiger, and Siyu Tang. 2023b. 3dgs-avatar: Animatable avatars via deformable 3d gaussian splatting. arXiv preprint arXiv:2312.09228 (2023).
  • Romero et al. (2022) Javier Romero, Dimitrios Tzionas, and Michael J Black. 2022. Embodied hands: Modeling and capturing hands and bodies together. arXiv preprint arXiv:2201.02610 (2022).
  • Rückert et al. (2022) Darius Rückert, Linus Franke, and Marc Stamminger. 2022. Adop: Approximate differentiable one-pixel point rendering. ACM Transactions on Graphics (ToG) 41, 4 (2022), 1–14.
  • Schonberger and Frahm (2016) Johannes L Schonberger and Jan-Michael Frahm. 2016. Structure-from-motion revisited. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4104–4113.
  • Shim et al. (2022) Gyumin Shim, Minsoo Lee, and Jaegul Choo. 2022. Refu: Refine and fuse the unobserved view for detail-preserving single-image 3d human reconstruction. In Proceedings of the 30th ACM International Conference on Multimedia. 6850–6859.
  • Su et al. (2023) Shih-Yang Su, Timur Bagautdinov, and Helge Rhodin. 2023. NPC: Neural Point Characters from Video. arXiv preprint arXiv:2304.02013 (2023).
  • Sun et al. (2021) Guoxing Sun, Xin Chen, Yizhang Chen, Anqi Pang, Pei Lin, Yuheng Jiang, Lan Xu, Jingyi Yu, and Jingya Wang. 2021. Neural free-viewpoint performance rendering under complex human-object interactions. In Proceedings of the 29th ACM International Conference on Multimedia. 4651–4660.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
  • Wang et al. (2023) Haonan Wang, Jie Liu, Jie Tang, and Gangshan Wu. 2023. Lightweight Super-Resolution Head for Human Pose Estimation. In Proceedings of the 31st ACM International Conference on Multimedia. 2353–2361.
  • Wang et al. (2021) Liao Wang, Ziyu Wang, Pei Lin, Yuheng Jiang, Xin Suo, Minye Wu, Lan Xu, and Jingyi Yu. 2021. ibutter: Neural interactive bullet time generator for human free-viewpoint rendering. In Proceedings of the 29th ACM International Conference on Multimedia. 4641–4650.
  • Wang et al. (2004) Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13, 4 (2004), 600–612.
  • Weng et al. (2022) Chung-Yi Weng, Brian Curless, Pratul P Srinivasan, Jonathan T Barron, and Ira Kemelmacher-Shlizerman. 2022. Humannerf: Free-viewpoint rendering of moving people from monocular video. In Proceedings of the IEEE/CVF conference on computer vision and pattern Recognition. 16210–16220.
  • Weng et al. (2023) Chung-Yi Weng, Pratul P Srinivasan, Brian Curless, and Ira Kemelmacher-Shlizerman. 2023. PersonNeRF: Personalized Reconstruction from Photo Collections. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 524–533.
  • Wu et al. (2023) Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 2023. 4d gaussian splatting for real-time dynamic scene rendering. arXiv preprint arXiv:2310.08528 (2023).
  • Xiang et al. (2023a) Tiange Xiang, Adam Sun, Scott Delp, Kazuki Kozuka, Li Fei-Fei, and Ehsan Adeli. 2023a. Wild2Avatar: Rendering Humans Behind Occlusions. arXiv preprint arXiv:2401.00431 (2023).
  • Xiang et al. (2023b) Tiange Xiang, Adam Sun, Jiajun Wu, Ehsan Adeli, and Li Fei-Fei. 2023b. Rendering humans from object-occluded monocular videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3239–3250.
  • Xu et al. (2020) Hongyi Xu, Eduard Gabriel Bazavan, Andrei Zanfir, William T Freeman, Rahul Sukthankar, and Cristian Sminchisescu. 2020. Ghum & ghuml: Generative 3d human shape and articulated pose models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6184–6193.
  • Xu et al. (2022) Qiangeng Xu, Zexiang Xu, Julien Philip, Sai Bi, Zhixin Shu, Kalyan Sunkavalli, and Ulrich Neumann. 2022. Point-nerf: Point-based neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5438–5448.
  • Xu et al. (2023a) Yuelang Xu, Benwang Chen, Zhe Li, Hongwen Zhang, Lizhen Wang, Zerong Zheng, and Yebin Liu. 2023a. Gaussian head avatar: Ultra high-fidelity head avatar via dynamic gaussians. arXiv preprint arXiv:2312.03029 (2023).
  • Xu et al. (2023b) Yuelang Xu, Hongwen Zhang, Lizhen Wang, Xiaochen Zhao, Han Huang, Guojun Qi, and Yebin Liu. 2023b. LatentAvatar: Learning Latent Expression Code for Expressive Neural Head Avatar. arXiv preprint arXiv:2305.01190 (2023).
  • Yang et al. (2022) Kaibing Yang, Renshu Gu, Maoyu Wang, Masahiro Toyoura, and Gang Xu. 2022. LASOR: Learning accurate 3D human pose and shape via synthetic occlusion-aware data and neural mesh rendering. IEEE Transactions on Image Processing 31 (2022), 1938–1948.
  • Yang et al. (2023) Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. 2023. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. arXiv preprint arXiv:2309.13101 (2023).
  • Yang et al. (2021) Ze Yang, Shenlong Wang, Sivabalan Manivasagam, Zeng Huang, Wei-Chiu Ma, Xinchen Yan, Ersin Yumer, and Raquel Urtasun. 2021. S3: Neural shape, skeleton, and skinning fields for 3d human modeling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 13284–13293.
  • Yifan et al. (2019) Wang Yifan, Felice Serena, Shihao Wu, Cengiz Öztireli, and Olga Sorkine-Hornung. 2019. Differentiable surface splatting for point-based geometry processing. ACM Transactions on Graphics (TOG) 38, 6 (2019), 1–14.
  • Yu et al. (2023) Zhengming Yu, Wei Cheng, Xian Liu, Wayne Wu, and Kwan-Yee Lin. 2023. MonoHuman: Animatable Human Neural Field from Monocular Video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16943–16953.
  • Yuan et al. (2023) Ye Yuan, Xueting Li, Yangyi Huang, Shalini De Mello, Koki Nagano, Jan Kautz, and Umar Iqbal. 2023. Gavatar: Animatable 3d gaussian avatars with implicit mesh learning. arXiv preprint arXiv:2312.11461 (2023).
  • Zhang et al. (2022a) Juze Zhang, Jingya Wang, Ye Shi, Fei Gao, Lan Xu, and Jingyi Yu. 2022a. Mutual Adaptive Reasoning for Monocular 3D Multi-Person Pose Estimation. In Proceedings of the 30th ACM International Conference on Multimedia. 1788–1796.
  • Zhang et al. (2022b) Jiahui Zhang, Fangneng Zhan, Rongliang Wu, Yingchen Yu, Wenqing Zhang, Bai Song, Xiaoqin Zhang, and Shijian Lu. 2022b. Vmrf: View matching neural radiance fields. In Proceedings of the 30th ACM International Conference on Multimedia. 6579–6587.
  • Zhang et al. (2018) Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition. 586–595.
  • Zhao et al. (2022) Fuqiang Zhao, Wei Yang, Jiakai Zhang, Pei Lin, Yingliang Zhang, Jingyi Yu, and Lan Xu. 2022. Humannerf: Efficiently generated human radiance field from sparse inputs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7743–7753.
  • Zheng et al. (2023b) Shunyuan Zheng, Boyao Zhou, Ruizhi Shao, Boning Liu, Shengping Zhang, Liqiang Nie, and Yebin Liu. 2023b. Gps-gaussian: Generalizable pixel-wise 3d gaussian splatting for real-time human novel view synthesis. arXiv preprint arXiv:2312.02155 (2023).
  • Zheng et al. (2022) Yufeng Zheng, Victoria Fernández Abrevaya, Marcel C Bühler, Xu Chen, Michael J Black, and Otmar Hilliges. 2022. Im avatar: Implicit morphable head avatars from videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13545–13555.
  • Zheng et al. (2023a) Yufeng Zheng, Wang Yifan, Gordon Wetzstein, Michael J Black, and Otmar Hilliges. 2023a. Pointavatar: Deformable point-based head avatars from videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21057–21067.
  • Zhou et al. (2021) Qiang Zhou, Shiyin Wang, Yitong Wang, Zilong Huang, and Xinggang Wang. 2021. Human de-occlusion: Invisible perception and recovery for humans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3691–3701.

Appendix A Supplementary Material

A.1. Definition of Occlusions

This definition of occlusions is followed by OccNeRF (Xiang et al., 2023b). For simulated occlusions (ZJU-MoCap), we define the extent of occlusion as 1occluded pixelsvalid pixels1occluded pixelsvalid pixels1-\frac{\text{occluded pixels}}{\text{valid pixels}}1 - divide start_ARG occluded pixels end_ARG start_ARG valid pixels end_ARG. For real-world occlusions (OcMotion), where there is no reference for the occluded body, we utilize 2D projections of the ground truth SMPL mesh. In this case, the occlusion extent is defined as 1visible pixelsSMPL pixelsSMPL pixels1visible pixelsSMPL pixelsSMPL pixels1-\frac{\text{visible pixels}\cap\text{SMPL pixels}}{\text{SMPL pixels}}1 - divide start_ARG visible pixels ∩ SMPL pixels end_ARG start_ARG SMPL pixels end_ARG. Using the above formula, The occlusion extents for video Mild and video Severe are 17%percent1717\%17 % and 79%percent7979\%79 %, respectively.

A.2. The Structures of MLPopacity𝑀𝐿subscript𝑃𝑜𝑝𝑎𝑐𝑖𝑡𝑦{MLP}_{opacity}italic_M italic_L italic_P start_POSTSUBSCRIPT italic_o italic_p italic_a italic_c italic_i italic_t italic_y end_POSTSUBSCRIPT and MLPshs𝑀𝐿subscript𝑃𝑠𝑠{MLP}_{shs}italic_M italic_L italic_P start_POSTSUBSCRIPT italic_s italic_h italic_s end_POSTSUBSCRIPT

We used MLPshs𝑀𝐿subscript𝑃𝑠𝑠{MLP}_{shs}italic_M italic_L italic_P start_POSTSUBSCRIPT italic_s italic_h italic_s end_POSTSUBSCRIPT and MLPopacity𝑀𝐿subscript𝑃𝑜𝑝𝑎𝑐𝑖𝑡𝑦{MLP}_{opacity}italic_M italic_L italic_P start_POSTSUBSCRIPT italic_o italic_p italic_a italic_c italic_i italic_t italic_y end_POSTSUBSCRIPT to learn the spherical harmonic coefficients f𝑓fitalic_f and opacity α𝛼\alphaitalic_α of the invisible points, respectively. The structures of MLPopacity𝑀𝐿subscript𝑃𝑜𝑝𝑎𝑐𝑖𝑡𝑦{MLP}_{opacity}italic_M italic_L italic_P start_POSTSUBSCRIPT italic_o italic_p italic_a italic_c italic_i italic_t italic_y end_POSTSUBSCRIPT and MLPshs𝑀𝐿subscript𝑃𝑠𝑠{MLP}_{shs}italic_M italic_L italic_P start_POSTSUBSCRIPT italic_s italic_h italic_s end_POSTSUBSCRIPT are identical, we outlined the architecture in Figure 7. It’s a five-layer MLP with an input layer, an output layer, and three hidden layers. The dimensions of the hidden layers are 256, and residual connections are employed within the three hidden layers. The output dimensions of MLPopacity𝑀𝐿subscript𝑃𝑜𝑝𝑎𝑐𝑖𝑡𝑦{MLP}_{opacity}italic_M italic_L italic_P start_POSTSUBSCRIPT italic_o italic_p italic_a italic_c italic_i italic_t italic_y end_POSTSUBSCRIPT and MLPshs𝑀𝐿subscript𝑃𝑠𝑠{MLP}_{shs}italic_M italic_L italic_P start_POSTSUBSCRIPT italic_s italic_h italic_s end_POSTSUBSCRIPT are α1𝛼superscript1{\alpha\in\mathcal{R}}^{1}italic_α ∈ caligraphic_R start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and f3×16𝑓superscript316{f\in\mathcal{R}}^{3\times 16}italic_f ∈ caligraphic_R start_POSTSUPERSCRIPT 3 × 16 end_POSTSUPERSCRIPT respectively, and the learning rate of MLP is set to 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT.

Refer to caption
Figure 7. The network structure of MLPopacity𝑀𝐿subscript𝑃𝑜𝑝𝑎𝑐𝑖𝑡𝑦{MLP}_{opacity}italic_M italic_L italic_P start_POSTSUBSCRIPT italic_o italic_p italic_a italic_c italic_i italic_t italic_y end_POSTSUBSCRIPT and MLPshs𝑀𝐿subscript𝑃𝑠𝑠{MLP}_{shs}italic_M italic_L italic_P start_POSTSUBSCRIPT italic_s italic_h italic_s end_POSTSUBSCRIPT.

A.3. 3D Gaussian Optimization

The superior rendering quality of 3DGS (Kerbl et al., 2023) relies heavily on the adaptive density control of 3D Gaussians. Following Gauhuman (Hu and Liu, 2023), we select points with large KL divergence and positional gradients to perform the split and clone. The KL divergence of two Gaussians is calculated as:

(16) KL(G(𝒙0)|G(𝒙1\displaystyle KL(G(\bm{x}_{0})|G(\bm{x}_{1}italic_K italic_L ( italic_G ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) | italic_G ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ))=12(tr(𝚺11𝚺0)+lndet𝚺1det𝚺2\displaystyle))=\frac{1}{2}(tr(\bm{\Sigma}_{1}^{-1}\bm{\Sigma}_{0})+\ln\frac{% \det\bm{\Sigma}_{1}}{\det\bm{\Sigma}_{2}}) ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_t italic_r ( bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_Σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + roman_ln divide start_ARG roman_det bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG roman_det bold_Σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG
+(𝝁1𝝁0)T𝚺11(𝝁1𝝁0)3),\displaystyle+(\bm{\mu}_{1}-\bm{\mu}_{0})^{T}\bm{\Sigma}_{1}^{-1}(\bm{\mu}_{1}% -\bm{\mu}_{0})-3),+ ( bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - 3 ) ,

where 𝝁0subscript𝝁0\bm{\mu}_{0}bold_italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, 𝚺0subscript𝚺0\bm{\Sigma}_{0}bold_Σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, 𝝁1subscript𝝁1\bm{\mu}_{1}bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, 𝚺1subscript𝚺1\bm{\Sigma}_{1}bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are the mean and covariance matrix of two 3D Gaussians G(𝒙0)𝐺subscript𝒙0G(\bm{x}_{0})italic_G ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and G(𝒙1)𝐺subscript𝒙1G(\bm{x}_{1})italic_G ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). We further follow the merge operation from Gauhuman, 3D Gaussians with 1) large position gradients, 2) small scaling magnitude, and 3) KL divergence less than 0.1 will be merged. Two Gaussians are merged by averaging their means, opacity, and SH coefficients.

A.4. Details of Loss Functions

Photometric Loss. Given the ground truth target image C𝐶Citalic_C and predicted image C^^𝐶\hat{C}over^ start_ARG italic_C end_ARG, we apply the photometric loss:

(17) rgb=|C^C|.subscript𝑟𝑔𝑏^𝐶𝐶\displaystyle\mathcal{L}_{rgb}=|\hat{C}-C|.caligraphic_L start_POSTSUBSCRIPT italic_r italic_g italic_b end_POSTSUBSCRIPT = | over^ start_ARG italic_C end_ARG - italic_C | .

Mask Loss. We also leverage the human region masks for Human NeRF optimization. The mask loss is defined as:

(18) mask=M^M2,subscript𝑚𝑎𝑠𝑘subscriptnorm^𝑀𝑀2\displaystyle\mathcal{L}_{mask}=||\hat{M}-M||_{2},caligraphic_L start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT = | | over^ start_ARG italic_M end_ARG - italic_M | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,

where M^^𝑀\hat{M}over^ start_ARG italic_M end_ARG is the accumulated volume density and M𝑀Mitalic_M is the ground truth binary mask label.

SSIM Loss. We further employ SSIM (Wang et al., 2004) to ensure the structural similarity between ground truth and synthesized images:

(19) SSIM=SSIM(C^,C).subscript𝑆𝑆𝐼𝑀SSIM^𝐶𝐶\displaystyle\mathcal{L}_{SSIM}=\text{SSIM}(\hat{C},C).caligraphic_L start_POSTSUBSCRIPT italic_S italic_S italic_I italic_M end_POSTSUBSCRIPT = SSIM ( over^ start_ARG italic_C end_ARG , italic_C ) .

LPIPS Loss. The perceptual loss LPIPS (Zhang et al., 2018) is also utilized to ensure the quality of the rendered image:

(20) LPIPS=LPIPS(C^,C).subscript𝐿𝑃𝐼𝑃𝑆LPIPS^𝐶𝐶\displaystyle\mathcal{L}_{LPIPS}=\text{LPIPS}(\hat{C},C).caligraphic_L start_POSTSUBSCRIPT italic_L italic_P italic_I italic_P italic_S end_POSTSUBSCRIPT = LPIPS ( over^ start_ARG italic_C end_ARG , italic_C ) .
Refer to caption
Figure 8. Qualitative results of ablation study on training frames.

Details about αbodysubscript𝛼𝑏𝑜𝑑𝑦{\alpha}_{body}italic_α start_POSTSUBSCRIPT italic_b italic_o italic_d italic_y end_POSTSUBSCRIPT. In order to obtain an approximate description of the complete body, we project 3D points onto the 2D image plane and set the radius of each point to two pixels to ensure that adjacent points can be connected to each other. Now we have obtained a complete body mask, but due to the large radius of each point, the mask appears too bulky. Therefore, we further perform an erosion operation on the mask to obtain our αbodysubscript𝛼𝑏𝑜𝑑𝑦{\alpha}_{body}italic_α start_POSTSUBSCRIPT italic_b italic_o italic_d italic_y end_POSTSUBSCRIPT, with an erosion kernel size chosen as 5×5555\times 55 × 5. The masks before and after erosion are shown in Figure 9.

Refer to caption
Figure 9. The uneroded mask and eroded mask αbodysubscript𝛼𝑏𝑜𝑑𝑦{\alpha}_{body}italic_α start_POSTSUBSCRIPT italic_b italic_o italic_d italic_y end_POSTSUBSCRIPT.
Refer to caption
Figure 10. More qualitative results between our OccGaussian and OccNeRF on ZJU-MoCap and OcMotion datasets.

A.5. Ablation Results for K-nearest Occluded Points Query

When conducting the Occlusion Feature Query in section 3.3, we need to find the nearest k visible points for each occluded point. To this end, we conducted ablation experiments to find the optimal value of k. We performed this ablation experiment on subject 386 of the ZJU-MoCap dataset, and as shown in Table 4, the metrics are relatively superior when selecting k=3.

Table 4. Quantitative results of ablation study for K-nearest Occluded Points Query.
PSNR\uparrow SSIM\uparrow LPIPS\downarrow
k=1 nearest points 24.11 0.9544 39.36
k=3 nearest points 24.05 0.9536 39.75
k=5 nearest points 24.06 0.9538 39.12
k=8 nearest points 24.08 0.9541 39.27
k=10 nearest points 24.05 0.9542 39.38

A.6. More Qualitative Results on ZJU-MoCap and OcMotion

Figure 10 shows more qualitative rendering results of our OccGaussian and the SOTA method OccNeRF. It can be observed that compared to OccNeRF, our method does not produce unnecessary artifacts, exhibits better texture continuity in occluded regions, and yields more realistic rendering results.

Refer to caption
Figure 11. Qualitative results between our OccGaussian and 3DGS-Avatar on the ZJU-MoCap datasets.

A.7. Other Results based on 3DGS

We also employed other human body rendering methods based on 3D Gaussian Splatting to directly render the human under occlusions, demonstrating that our OccGaussian can efficiently render a more complete human body the in occluded scenes. We choose the 3DGS-Avatar (Qian et al., 2023b) that utilizes 3D Gaussian Splatting to reconstruct clothed human avatars from monocular videos. The experiments on the ZJU-MoCap dataset are shown in Figure 11. It can be seen that our OccGaussian can render a complete human with high quality, filling appropriate textures in the occluded areas. Meanwhile, 3DGS-Avatar fails to render the human body under occlusion, leaving large blank areas without color in many regions.

A.8. Ablation Results for Training Frames

On subject 387 of the ZJU-MoCap dataset, we conduct an ablation study using different number of training frames . As shown in Figure 8 and Table 5, training with more frames improves the rendering quality of the model. Training with 200 frames does not result in a complete human silhouette and renders a lot of artifacts. Training with 300 and 400 frames gives a more complete human body, but there are still some parts of the body that are fragmented and lose a lot of detail. Our 540-frame model better recovers the overall texture of the clothes and facial details, giving the most realistic rendering results.

Table 5. Quantitative results of ablation study for different training frames.
PSNR\uparrow SSIM\uparrow LPIPS\downarrow
200 frames 21.15 0.9279 61.04
300 frames 21.87 0.9367 49.44
400 frames 22.66 0.9418 44.35
540 frames 23.02 0.9422 44.47