\useunder

\ul

SMPLX-Lite: A Realistic and Drivable Avatar Benchmark with Rich Geometry and Texture Annotations

Yujiao Jiang1     Qingmin Liao1     Zhaolong Wang1,212{}^{1,2}\quadstart_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT Xiangru Lin22{}^{2}\quadstart_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT
Zongqing Lu11{}^{1}\quadstart_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Yuxi Zhao11{}^{1}\quadstart_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Hanqing Wei33{}^{3}\quadstart_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT Jingrui Ye11{}^{1}\quadstart_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Yu Zhang22{}^{2}\quadstart_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Zhijing Shao2,4∗
1Shenzhen International Graduate School, Tsinghua University    2Prometheus Vision Technology Co., Ltd.
3Beijing University of Aeronautics and Astronautics
4The Hong Kong University of Science and Technology (Guangzhou)
jiangyj20@mails.tsinghua.edu.cn, neil.szj@prometh.xyz
Abstract

Recovering photorealistic and drivable full-body avatars is crucial for numerous applications, including virtual reality, 3D games, and tele-presence. Most methods, whether reconstruction or generation, require large numbers of human motion sequences and corresponding textured meshes. To easily learn a drivable avatar, a reasonable parametric body model with unified topology is paramount. However, existing human body datasets either have images or textured models and lack parametric models which fit clothes well. We propose a new parametric model SMPLX-Lite-D, which can fit detailed geometry of the scanned mesh while maintaining stable geometry in the face, hand and foot regions. We present SMPLX-Lite dataset, the most comprehensive clothing avatar dataset with multi-view RGB sequences, keypoints annotations, textured scanned meshes, and textured SMPLX-Lite-D models. With the SMPLX-Lite dataset, we train a conditional variational autoencoder model that takes human pose and facial keypoints as input, and generates a photorealistic drivable human avatar.

Index Terms:
Drivable Avatar, Dataset, Reconstruction

I Introduction

Refer to caption
(a) Color Image
Refer to caption
(b) Keypoints
Refer to caption
(c) SMPL-X
Refer to caption
(d) Scanned Mesh
Refer to caption
(e) Scanned Texture
Refer to caption
(f) Lite-D
Refer to caption
(g) Lite-D Texture
Figure 1: SMPLX-Lite dataset has multiple types of data formats and annotations and is the most comprehensive dataset currently available in the drivable avatar area. We demonstrate a) color image, b) keypoints, c) SMPL-X registration, d) scanned mesh with e) scanned texture, and f) SMPLX-Lite-D mesh with g) SMPLX-Lite-D texture.

Existing methods generally reconstruct clothed human models from images or videos. One class of methods based on the neural radiance field[1, 2, 3] utilizes an implicit functional representation[4, 5] that allows pixel-level alignment with the image, but lacks an explicit geometric representation.

Another class of methods is based on parametric models (e.g., SMPL[6], SMPL-X[7]), which use low-dimensional parametric models as human body priors and learn to fit the model parameters to align them with the person in the picture by training. These template models[6, 7] derived from large amounts of data can be flexibly controlled by low-dimensional pose and shape parameters, which can capture non-rigid deformation well and reduce artifacts from linear transformation. The popular SMPL model learns better pose and shape blend shapes on top of linear blend skinning, and can fit various deformation through pose and shape parameters.

Most methods for reconstructing 3D humans from images are to align the distorted SMPL model with 2D images and joints by predicting the SMPL parameters [8, 9]. However, these methods can only obtain minimally dressed human meshes, not clothed ones, due to the naked parametric model. Other methods for reconstructing clothed human bodies extend the SMPL model into SMPL-D to represent clothes by vertex displacement[10, 11]. IPNet[10] divides body and clothes into two layers, fitted with SMPL and SMPL-D respectively. Similarly, CAPE[11] employs CVAE to generate corresponding meshes by pose, clothes type, and clothes shape, thereby producing the clothes vertex displacement. The reconstructed models can also distort the clothes mesh in different poses through the skeleton and skinning weights of the internal human model. However, the results are often poor because wrinkles and deformation of clothes are more uncontrollable than the human body, so a lot of data is needed to learn, and corresponding scanned models are needed to supervise them. Clothes such as skirts and coats are also difficult to reconstruct due to the limitations of vertex displacement.

To utilize the advantages of the approaches above, recent work has attempted to combine the two representations. ARCH[12] and ARCH++[13] use human prior knowledge to transform a human body in any posture into a canonical space, and then learn implicit representations for reconstruction. These methods produce pixel-aligned models and can theoretically be reposed by changing model parameters. However, since there is no learning to infer pose-dependent clothing deformation, these methods simply apply articulated deformation to the reconstructed model. This results in an unrealistic pose-related distortion that lacks fine details of the garment.

Since the SMPL[6] model has only 24 joints and doesn’t accommodate facial expressions and finger movements, the adoption of the SMPL-X[7] model is increasingly common in the pursuit of better character fitting, which aggregates body, face, and hand. However, challenges arise when fitting vertices using thee SMPL-X model, including eye deformation and lip flipping. To address these concerns, we propose the SMPLX-Lite model, optimized for vertex fitting based on the SMPL-X, while retaining the exceptional face expression and hand action representation capabilities of the SMPL-X model.

In order to get an animatable human avatar, previous methods usually required reconstructing a character template for a single person and then modeling pose-dependent dynamic distortions. Recent works suggest that we can learn the deformation of a general character template from scanned data[11, 14] or RGB video data[15, 16] to get a drivable avatar directly. These methods usually require a large amount of data to train an avatar associated with a person, and when the data is insufficient, problems arise with over-fitting and posture generalization capabilities. So we introduce the SMPLX-Lite dataset, which uses 32 4K RGB cameras to capture over 20k frames of action sequences simultaneously, containing 5 characters (3 male and 2 female, wearing various types of clothes) and 15 different action types, and performs a series of data processing operations, i.e., image segmentation, 3D model reconstruction, pose estimation, SMPLX-Lite-D model fitting and texture map fitting. We have packaged all these annotated data into the SMPLX-Lite dataset to advance research in this field, making it possible that just a simple baseline can generate avatars with good results.

To underscore the contribution of the SMPLX-Lite dataset to the community, we develop a conditional variational autoencoder network using this dataset as a foundation following [17, 18]. Our method uses pose parameters, facial keypoints and view direction as conditions to generate a character model with texture based on the corresponding pose. This greatly simplifies the process of driving the character model. Compared with CAPE[11], our recovered avatar has finer geometry and photorealistic texture, making it more lifelike and directly applicable in industrial settings.

Our contributions can be summarized as follows:

  • We collect the most comprehensive and photorealistic avatar dataset to date, containing multi-view segmented image sequences, 3D keypoint annotation, textured scanned model and fitted SMPLX-Lite-D model with texture maps.

  • We propose the SMPLX-Lite model optimized for vertex fitting based on the SMPL-X model, succeeding as the first SMPLX-based model using vertex displacement to fit clothes.

  • We introduce a multi-stage fitting procedure capturing fine geometry details like facial expressions and cloth wrinkles. Compared with the SMPL-X model, it greatly reduces the difficulty of vertex fitting while retaining the details of facial expressions and hand movements.

  • We propose a CVAE model that receives driving input by facial keypoints and pose parameters to produce a photorealistic avatar.

II SMPLX-Lite Dataset

We present SMPLX-Lite dataset, the most comprehensive captured human avatar dataset currently. Please refer to the suppl. for detailed comparison with other datasets containing human model fits and a demo dataset for check. Our dataset contains multi-view segmented image sequences, 3D keypoints annotation, reconstructed textured scanned mesh, fitted SMPLX-Lite-D model and texture maps. In this section, we will describe in detail how to capture and organize the dataset, and the procedure for obtaining these annotation data.

II-A Data Capture

We employ 32 calibrated cameras to simultaneously capture 4096x3000 image sequences of 15 different actions, being performed by 5 subjects (3 male, 2 female) in daily clothes. The image sequences include 15 kinds of actions in daily scenes, such as discussion, debate, public speaking, phone conversations and stretching, which significantly enhances the authentic, diverse and generalizable nature of an avatar. For the convenience of statistics and processing, we select over 200 consecutive frames for each action sequence and eventually collect over 20k frames. Each frame has 32 views of the raw image, as well as all annotation results from post-processing.

II-B Data Process

Refer to caption
Figure 2: Data Process Pipeline. Our pipeline produces a variety of data annotations, including 3D keypoints, SMPL-X parameters, textured scanned models, and textured SMPLX-Lite-D models.

Textured Mesh Reconstruction.  We utilize 32 RGB cameras with 48 additional IR cameras and random pattern projectors for reconstruction. Following [19], we first obtain the initial depth map from the IR images through the stereo matching algorithm[20] and then convert the depth map into a point cloud, which is later turned to the initial mesh by Poisson Surface Reconstruction (PSR) [21]. The obtained mesh has some mismatches w.r.t. the actual shape due to the accumulated error. We employ differentiable rendering [22] to optimize the vertex positions of the mesh geometry while extracting the texture of the mesh surface. Through these processes, we obtain the mesh model with higher accuracy and high-quality texture extremely close to the real picture.

3D Human Pose Estimation.  Once the 2D keypoints of the person from each camera view are obtained, our accurate camera intrinsic and extrinsic parameters from calibration enable the calculation of 3D keypoints by triangulation. We use openpose[23] to estimate 2D human joints of each view. However, 2D keypoints estimated from different views may not be reasonable due to occlusion and limited camera field of view. Consequently, it is crucial to select highly confident views for each keypoint during the process of triangulation. We employ RANSAC[24] method to select reasonable views. See suppl. for detailed process. Subsequently, easymocap[25] is utilized to fit SMPL-X[7] model through the supervision of 2D and 3D keypoints for every frame.

SMPLX-Lite Model Transfer.  SMPL-X has N=10475𝑁10475N=10475italic_N = 10475 vertices and K=54𝐾54K=54italic_K = 54 joints, and is defined as a function M(θ,β,ψ):|θ|×|β|×|ψ|3N:M𝜃𝛽𝜓superscript𝜃𝛽𝜓superscript3𝑁\mathrm{M}(\theta,\beta,\psi):\mathbb{R}^{|\theta|\times|\beta|\times|\psi|}% \rightarrow\mathbb{R}^{3N}roman_M ( italic_θ , italic_β , italic_ψ ) : blackboard_R start_POSTSUPERSCRIPT | italic_θ | × | italic_β | × | italic_ψ | end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT 3 italic_N end_POSTSUPERSCRIPT, where θ𝜃\mathcal{\theta}italic_θ, β𝛽\mathcal{\beta}italic_β, and ψ𝜓\mathcal{\psi}italic_ψ are pose, shape, and expression parameters respectively. More specifically,

M(β,θ,ψ)=W(Tp(β,θ,ψ),J(β),θ,𝒲)𝑀𝛽𝜃𝜓𝑊subscript𝑇𝑝𝛽𝜃𝜓𝐽𝛽𝜃𝒲M(\beta,\theta,\psi)=W\left(T_{p}(\beta,\theta,\psi),J(\beta),\theta,\mathcal{% W}\right)italic_M ( italic_β , italic_θ , italic_ψ ) = italic_W ( italic_T start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_β , italic_θ , italic_ψ ) , italic_J ( italic_β ) , italic_θ , caligraphic_W ) (1)

where W(.)W(.)italic_W ( . ) is a standard linear blend skinning function. Several parts of LBS function are:

TP(β,θ,ψ)=T¯+BS(β;𝒮)+BE(ψ;)+BP(θ;𝒫)subscript𝑇𝑃𝛽𝜃𝜓¯𝑇subscript𝐵𝑆𝛽𝒮subscript𝐵𝐸𝜓subscript𝐵𝑃𝜃𝒫T_{P}(\beta,\theta,\psi)=\bar{T}+B_{S}(\beta;\mathcal{S})+B_{E}(\psi;\mathcal{% E})+B_{P}(\theta;\mathcal{P})italic_T start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_β , italic_θ , italic_ψ ) = over¯ start_ARG italic_T end_ARG + italic_B start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_β ; caligraphic_S ) + italic_B start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( italic_ψ ; caligraphic_E ) + italic_B start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_θ ; caligraphic_P ) (2)
J(β)=𝒥(T¯+BS(β;𝒮))𝐽𝛽𝒥¯𝑇subscript𝐵𝑆𝛽𝒮J(\beta)=\mathcal{J}\left(\bar{T}+B_{S}(\beta;\mathcal{S})\right)italic_J ( italic_β ) = caligraphic_J ( over¯ start_ARG italic_T end_ARG + italic_B start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_β ; caligraphic_S ) ) (3)

and blend weights 𝒲N×K𝒲superscript𝑁𝐾\mathcal{W}\in\mathbb{R}^{N\times K}caligraphic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_K end_POSTSUPERSCRIPT. Methods using SMPL-X plus vertex displacement to fit clothes extend Eq.(2) to

TP(β,θ,ψ,D)=T¯+BS(β;𝒮)+BE(ψ;)+BP(θ;𝒫)+D.subscript𝑇𝑃𝛽𝜃𝜓𝐷¯𝑇subscript𝐵𝑆𝛽𝒮subscript𝐵𝐸𝜓subscript𝐵𝑃𝜃𝒫𝐷T_{P}(\beta,\theta,\psi,D)=\bar{T}+B_{S}(\beta;\mathcal{S})+B_{E}(\psi;% \mathcal{E})+B_{P}(\theta;\mathcal{P})+D.italic_T start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_β , italic_θ , italic_ψ , italic_D ) = over¯ start_ARG italic_T end_ARG + italic_B start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_β ; caligraphic_S ) + italic_B start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ( italic_ψ ; caligraphic_E ) + italic_B start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_θ ; caligraphic_P ) + italic_D . (4)

The SMPL-X model with vertex displacement shown in Fig.3a, exhibits face flipping and distortion in the eyes, ears, mouth, nose and feet. In response to these issues, we propose SMPLX-Lite model, which greatly reduces the difficulty of vertex fitting while preserving the facial expression and hand gesture fitting capabilities of the SMPL-X model. The iterative process entail vertex deletion, face reconstruction, and face flattening, ultimately yielding the SMPLX-Lite model with a reduced vertex count of Nv=8452subscript𝑁𝑣8452N_{v}=8452italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = 8452. Refer to suppl. for details.

As the number of vertices decreases, adjustments to the matrices 𝒮𝒮\mathcal{S}caligraphic_S, \mathcal{E}caligraphic_E, 𝒫𝒫\mathcal{P}caligraphic_P, 𝒥𝒥\mathcal{J}caligraphic_J, 𝒲𝒲\mathcal{W}caligraphic_W, as indicated in Eq.(1, 2, 3), are vital for ensuring the transferred model inherits the control parameters of SMPL-X and the linear blend skinning function. Upon transferring all coefficient matrices, the SMPLX-Lite model becomes operational akin to SMPL-X, and utilizing Eq.(4), vertex displacement can be added to the model to fit clothes. The subsequent analysis will demonstrate the impressive efficacy of this model in vertex fitting.

SMPLX-Lite-D fit. The purpose of vertex fit is to fully capture the fine geometry details of the scanned meshes in a unified mesh topology and texture UV layout. After the 3D pose estimation in sectionII-B, we obtain a starting mesh close to the scanned mesh without surface details. We propose to solve for vertex fit in 2 stages. In the first stage, we adopt the method from [26] and warp the mesh by predefined embedded nodes, then solve for the warp field. In the second stage, we directly solve for the remaining vertex shifts. The detailed procedures and impact of all the registration steps are illustrated in suppl.

Refer to caption
(a) SMPL-X vertex fit.
Refer to caption
(b) SMPLX-Lite vertex fit.
Figure 3: Comparison between SMPL-X and SMPLX-Lite model fit results.

How will this dataset be useful to the community?  Dedicated significant effort has been made to collect and process the most comprehensive 3D moving human avatar dataset with clothes and textures to date. The SMPLX-Lite dataset has significant implications for Drivable Textured Avatar Reconstruction, as it provides multi-view images, reconstructed texture models, and fitted clothed parametric models with texture maps. These diverse data types can be leveraged to reconstruct photorealistic drivable avatars, offering researchers a wider spectrum of supervising methods compared to datasets that offer only raw pictures [27] or solely reconstructed textured models [11]. This capability broadens the range of network structures that can be utilized, potentially enabling multiple stages of network training.

Besides, the SMPLX-Lite dataset is also pertinent to other important areas such as 3D Human Body Reconstruction and Novel View Synthesis. Moreover, researchers are encouraged to explore further applications of this dataset.

II-C Dataset Evaluation

TABLE I: Dataset Evaluation Results. We render textured models, compare them with captured images, and compare the geometry of the fitted SMPLX-Lite-D model with scanned mesh to get chamfer distance (CD, ×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT).
Scan SMPLX-Lite-D
Sub. PSNR\uparrow SSIM\uparrow PSNR\uparrow SSIM\uparrow CD\downarrow
WZL 28.92 0.9714 28.61 0.9706 6.7372
LDF 28.33 0.9706 27.80 0.9675 8.2897
ZX 28.95 0.9760 28.52 0.9749 6.8234
LW 27.21 0.9754 26.67 0.9744 6.4386
ZC 27.51 0.9623 27.05 0.9602 6.9238

We present the evaluation results in Tab.I, including peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), and chamfer distance (CD).

III Method

To demonstrate the effectiveness of SMPLX-Lite dataset, we utilize a basic baseline model to generate a drivable avatar and show the effect that all avatars in the dataset are driven by the same sequence of actions. Our approach is grounded on the network structure in [17, 18], involving comprehensive simplifications and adaptations.

III-A Network Structure

Refer to caption
Figure 4: Method Overview. The CVAE model generates mesh and texture maps via a decoder, which employs pose and face keypoints as driving signals, overlays camera view information, and utilizes latent codes sampled from the distribution obtained by the encoder. The output mesh obtained by LBS, together with the texture map and camera parameters, undergoes the differentiable renderer to produce photorealistic rendered images. The entire training process is end-to-end, and mesh, texture, and final rendered images are all supervisable.

The model employed is a conditional variational autoencoder (CVAE), consisting of an encoder E𝐸Eitalic_E and a decoder D𝐷Ditalic_D, both implemented using convolutional neural networks. See Fig.4 for the overview of our method.

The encoder takes as input the mean texture map 𝑻¯¯𝑻\bar{\boldsymbol{T}}over¯ start_ARG bold_italic_T end_ARG for each individual in the dataset and the T-pose mesh 𝑮𝒊N×3subscript𝑮𝒊superscript𝑁3\boldsymbol{G_{i}}\in\mathbb{R}^{N\times 3}bold_italic_G start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 3 end_POSTSUPERSCRIPT derived via inverse LBS to SMPLX-Lite-D model for each frame. Rendering 𝑮𝒊subscript𝑮𝒊\boldsymbol{G_{i}}bold_italic_G start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT onto a position map in UV space yields a feature map of the same size as the average texture map, which is subsequently merged with the average texture map in channel dimension and fed into the encoder.

E(𝑻,P(𝑮𝒊))𝝁𝒊,𝝈𝒊𝐸𝑻𝑃subscript𝑮𝒊subscript𝝁𝒊subscript𝝈𝒊E(\boldsymbol{T},P(\boldsymbol{G_{i}}))\rightarrow\boldsymbol{\mu_{i}},% \boldsymbol{\sigma_{i}}italic_E ( bold_italic_T , italic_P ( bold_italic_G start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) ) → bold_italic_μ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT , bold_italic_σ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT

The encoder outputs the mean 𝝁𝒊subscript𝝁𝒊\boldsymbol{\mu_{i}}bold_italic_μ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT and standard deviation 𝝈𝒊subscript𝝈𝒊\boldsymbol{\sigma_{i}}bold_italic_σ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT of the Gaussian distribution, which are trained to align as closely as possible to the standard normal distribution 𝒩(0,1)𝒩01\mathcal{N}(0,1)caligraphic_N ( 0 , 1 ) and then sampled to obtain the latent code 𝒛𝒛\boldsymbol{z}bold_italic_z.

𝒛𝒊𝓝(𝝁𝒊,𝝈𝒊2)similar-tosubscript𝒛𝒊𝓝subscript𝝁𝒊superscriptsubscript𝝈𝒊2\boldsymbol{z_{i}}\sim\boldsymbol{\mathcal{N}}(\boldsymbol{\mu_{i}},% \boldsymbol{\sigma_{i}}^{2})bold_italic_z start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ∼ bold_caligraphic_N ( bold_italic_μ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT , bold_italic_σ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )

Following [17], we utilize readily available pose parameters 𝜽𝒊subscript𝜽𝒊\boldsymbol{\theta_{i}}bold_italic_θ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT and face keypoints 𝒇𝒊subscript𝒇𝒊\boldsymbol{f_{i}}bold_italic_f start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT as driving signals. These are rendered to a position map in UV space and merged into feature maps, while the T-pose vertex coordinate is leveraged to generate view information feature maps. These driving signals and view feature maps serve as conditions and are combined with the latent code before being fed into the decoder to predict the T-pose mesh offsets 𝑮^𝒊𝑻subscriptsuperscriptbold-^𝑮𝑻𝒊\boldsymbol{\hat{G}^{T}_{i}}overbold_^ start_ARG bold_italic_G end_ARG start_POSTSUPERSCRIPT bold_italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT and view-dependent texture map 𝑻^𝒊subscriptbold-^𝑻𝒊\boldsymbol{\hat{T}_{i}}overbold_^ start_ARG bold_italic_T end_ARG start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT:

D(W(𝓕𝜽𝒊,𝓕𝒇𝒊,𝓕𝒗𝒊),𝒛𝒊)𝑮^𝒊𝑻,𝑻^𝒊,𝐷𝑊subscript𝓕subscript𝜽𝒊subscript𝓕subscript𝒇𝒊subscript𝓕subscript𝒗𝒊subscript𝒛𝒊subscriptsuperscriptbold-^𝑮𝑻𝒊subscriptbold-^𝑻𝒊D(W(\boldsymbol{\mathcal{F}_{\theta_{i}}},\boldsymbol{\mathcal{F}_{f_{i}}},% \boldsymbol{\mathcal{F}_{v_{i}}}),\boldsymbol{z_{i}})\rightarrow\boldsymbol{% \hat{G}^{T}_{i}},\boldsymbol{\hat{T}_{i}},italic_D ( italic_W ( bold_caligraphic_F start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_caligraphic_F start_POSTSUBSCRIPT bold_italic_f start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_caligraphic_F start_POSTSUBSCRIPT bold_italic_v start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) → overbold_^ start_ARG bold_italic_G end_ARG start_POSTSUPERSCRIPT bold_italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT , overbold_^ start_ARG bold_italic_T end_ARG start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ,

where W𝑊Witalic_W means conv1x1. We use decoder to predict offsets because the network fits the residuals better than directly fitting the vertex locations. By adding these offsets to T-pose template 𝑮¯𝑻subscriptbold-¯𝑮𝑻\boldsymbol{\bar{G}_{T}}overbold_¯ start_ARG bold_italic_G end_ARG start_POSTSUBSCRIPT bold_italic_T end_POSTSUBSCRIPT and transforming them using pose parameters 𝜽𝒊subscript𝜽𝒊\boldsymbol{\theta_{i}}bold_italic_θ start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT through LBS, the final reconstructed pose mesh 𝑮^𝒊subscriptbold-^𝑮𝒊\boldsymbol{\hat{G}_{i}}overbold_^ start_ARG bold_italic_G end_ARG start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT is obtained:

𝑮^𝒊=LBS(𝜽,𝑮¯𝑻+𝑮^𝒊𝑻).subscriptbold-^𝑮𝒊𝐿𝐵𝑆𝜽subscriptbold-¯𝑮𝑻subscriptsuperscriptbold-^𝑮𝑻𝒊\boldsymbol{\hat{G}_{i}}=LBS(\boldsymbol{\theta},\boldsymbol{\bar{G}_{T}}+% \boldsymbol{\hat{G}^{T}_{i}}).overbold_^ start_ARG bold_italic_G end_ARG start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT = italic_L italic_B italic_S ( bold_italic_θ , overbold_¯ start_ARG bold_italic_G end_ARG start_POSTSUBSCRIPT bold_italic_T end_POSTSUBSCRIPT + overbold_^ start_ARG bold_italic_G end_ARG start_POSTSUPERSCRIPT bold_italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) . (5)

The dataset provided allows for the supervision of 𝑮^𝒊subscriptbold-^𝑮𝒊\boldsymbol{\hat{G}_{i}}overbold_^ start_ARG bold_italic_G end_ARG start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT and 𝑻^𝒊subscriptbold-^𝑻𝒊\boldsymbol{\hat{T}_{i}}overbold_^ start_ARG bold_italic_T end_ARG start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT through the model geometry of SMPLX-Lite-D and the associated texture map, as well as the rendered images 𝑰^𝒊subscriptbold-^𝑰𝒊\boldsymbol{\hat{I}_{i}}overbold_^ start_ARG bold_italic_I end_ARG start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT through differentiable rendering with the captured images 𝑰𝒊subscript𝑰𝒊\boldsymbol{I_{i}}bold_italic_I start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT. This multi-faceted supervision facilitates the creation of high-quality drivable reconstructed human models.

During inference, the latent code 𝒛𝒛\boldsymbol{z}bold_italic_z is sampled from a standard normal distribution without the need for an encoder. Decoder D𝐷Ditalic_D takes 𝒛𝒛\boldsymbol{z}bold_italic_z along with the driving signal and view information as input to generate the geometry and texture of the person under the corresponding pose.

III-B Loss

The loss function we use is:

=λGG+λTT+λlaplap+λKLKL,subscript𝜆𝐺subscript𝐺subscript𝜆𝑇subscript𝑇subscript𝜆𝑙𝑎𝑝subscript𝑙𝑎𝑝subscript𝜆𝐾𝐿subscript𝐾𝐿\mathcal{L}=\lambda_{G}\mathcal{L}_{G}+\lambda_{T}\mathcal{L}_{T}+\lambda_{lap% }\mathcal{L}_{lap}+\lambda_{KL}\mathcal{L}_{KL},caligraphic_L = italic_λ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_l italic_a italic_p end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_l italic_a italic_p end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT , (6)

where 𝑮=𝑮gt𝑮^22subscript𝑮superscriptsubscriptnormsubscript𝑮𝑔𝑡^𝑮22\mathcal{L}_{\boldsymbol{G}}=||\boldsymbol{G}_{gt}-\hat{\boldsymbol{G}}||_{2}^% {2}caligraphic_L start_POSTSUBSCRIPT bold_italic_G end_POSTSUBSCRIPT = | | bold_italic_G start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT - over^ start_ARG bold_italic_G end_ARG | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the L2 distance between the vertex of the reconstructed model and gt SMPLX-Lite-D model, 𝑻=(𝑻gt𝑻^)𝑴𝑻22subscript𝑻superscriptsubscriptnormdirect-productsubscript𝑻𝑔𝑡^𝑻subscript𝑴𝑻22\mathcal{L}_{\boldsymbol{T}}=||(\boldsymbol{T}_{gt}-\hat{\boldsymbol{T}})\odot% \boldsymbol{M_{T}}||_{2}^{2}caligraphic_L start_POSTSUBSCRIPT bold_italic_T end_POSTSUBSCRIPT = | | ( bold_italic_T start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT - over^ start_ARG bold_italic_T end_ARG ) ⊙ bold_italic_M start_POSTSUBSCRIPT bold_italic_T end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the L2 loss of the texture map and gt texture map in the valid UV area with mask MTsubscript𝑀𝑇M_{T}italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, lap=L(𝑮gt)L(𝑮^)22subscript𝑙𝑎𝑝superscriptsubscriptnorm𝐿subscript𝑮𝑔𝑡𝐿^𝑮22\mathcal{L}_{lap}=||L(\boldsymbol{G}_{gt})-L(\hat{\boldsymbol{G}})||_{2}^{2}caligraphic_L start_POSTSUBSCRIPT italic_l italic_a italic_p end_POSTSUBSCRIPT = | | italic_L ( bold_italic_G start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT ) - italic_L ( over^ start_ARG bold_italic_G end_ARG ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the Laplacian term used to ensure the smoothness of the model, and KLsubscript𝐾𝐿\mathcal{L}_{KL}caligraphic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT is the KL term of the standard VAE model[28]. If gt image is used for supervision, Tsubscript𝑇\mathcal{L}_{T}caligraphic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT can be replaced with image loss 𝑰=𝑰gt𝑰^22subscript𝑰superscriptsubscriptnormsubscript𝑰𝑔𝑡^𝑰22\mathcal{L}_{\boldsymbol{I}}=||\boldsymbol{I}_{gt}-\hat{\boldsymbol{I}}||_{2}^% {2}caligraphic_L start_POSTSUBSCRIPT bold_italic_I end_POSTSUBSCRIPT = | | bold_italic_I start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT - over^ start_ARG bold_italic_I end_ARG | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT plus image mask loss 𝑴=𝑴gt𝑴𝑰^22subscript𝑴superscriptsubscriptnormsubscript𝑴𝑔𝑡^subscript𝑴𝑰22\mathcal{L}_{\boldsymbol{M}}=||\boldsymbol{M}_{gt}-\hat{\boldsymbol{M_{I}}}||_% {2}^{2}caligraphic_L start_POSTSUBSCRIPT bold_italic_M end_POSTSUBSCRIPT = | | bold_italic_M start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT - over^ start_ARG bold_italic_M start_POSTSUBSCRIPT bold_italic_I end_POSTSUBSCRIPT end_ARG | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

IV Experiments

In this section, we present the results of our photorealistic human model-driven algorithm on the SMPLX-Lite dataset. Subsequently, we compare our method with two baselines, in both novel view and novel pose synthesis experiments, to demonstrate the superior performance of our method in geometry and texture generation.

IV-A Reconstruction & Driving

We utilize 9 actions in the dataset for training and the others for testing. Subsequent experiments involve training individual character models on the training set and evaluating their reconstruction and driving effects on the test set.

To begin, we assess the method’s ability to reconstruct mesh and texture for new actions of the same person on the test set, which involves utilizing the encoder E𝐸Eitalic_E to generate latent code 𝒛𝒛\boldsymbol{z}bold_italic_z with the same distribution as the training data. Additionally, we evaluate the driving effect of the model by using the driving signal of the test set to drive the characters. Unlike reconstruction, driving necessitates the random sampling of latent code 𝒛𝒛\boldsymbol{z}bold_italic_z from the normal distribution without encoder E𝐸Eitalic_E.

The photorealistic reconstructed and driving results, along with quantitative evaluations for all subjects, are presented in Fig.5 and Tab.II, respectively. It is worth noting that driving is marginally less effective than reconstructing due to the absence of hidden space information associated with the character. Furthermore, we test the effect of using the same new sequence of actions to drive five trained character models and present the full results in suppl.

Refer to caption
(a) Recon
Refer to caption
(b) Driving
Refer to caption
(c) GT
Figure 5: Qualitative Results. Both rendered images are really close to the captured image, perfectly recovering clothing details, finger movements and facial expressions.
TABLE II: Quantitative Evaluation. CD (×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT) is the distance between the generated model and the SMPLX-Lite-D model.
Reconstruction Driving
Sub. PSNR\uparrow SSIM\uparrow CD\downarrow PSNR\uparrow SSIM\uparrow CD\downarrow
WZL 26.60 0.9454 4.1098 26.54 0.9443 4.2264
LDF 25.69 0.9394 3.9842 24.94 0.9307 4.2419
ZX 25.33 0.9382 4.7422 24.71 0.9312 4.7719
LW 23.38 0.9397 5.2762 22.42 0.9304 5.4222
ZC 24.48 0.935 4.6135 23.49 0.9254 4.6383

IV-B Comparison with Baselines

Refer to caption
(a) Novel View Synthesis
Refer to caption
(b) Novel Pose Synthesis
Figure 6: Our results in both experiments show clearer and more realistic textures and accurately reconstruct finger movements.

Additionally, We conduct comparisons with two baselines, Neural Body (NB)[29] and Ani-NeRF (AN)[16]. Following NB’s setting, our method outperforms the two baselines in both novel view and novel pose synthesis experiments, as demonstrated in Fig.6 and Tab.III.

TABLE III: Quantitative Results. Our method outperforms the baselines in terms of LPIPS and chamfer distance(CD, ×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT), while also achieving competitive PSNR and SSIM.
Method Novel View Novel Pose
PSNR\uparrow SSIM\uparrow LPIPS\downarrow CD\downarrow PSNR\uparrow SSIM\uparrow LPIPS\downarrow CD\downarrow
NB 31.29 0.9707 \ul0.0789 \ul11.490 29.27 0.9616 \ul0.0841 \ul11.732
AN 28.05 0.9500 0.0981 16.285 26.14 0.9382 0.1119 18.170
Ours \ul30.14 \ul0.9607 0.0567 7.1586 \ul27.97 \ul0.9485 0.0675 8.7690

The robust and highly generalizable nature of our approach enables it to capture intricate details and high-frequency information, leading to clearer textures and hand movements. In contrast, the baselines produce notably blurry results in both experiments, particularly in the hand area, with AN displaying abnormally twisted arms and fingers in Fig.6b. Besides, the meshes generated by our method appear smoother and retain a higher level of detail, as depicted in Fig.7.

Refer to caption
Figure 7: Mesh Generation. Our method enables the reconstruction of smoother surface and finer geometric details.

V Conclusion

We propose the SMPLX-Lite model, which simplifies the methods using vertex displacement to fit clothes, while retaining the advantages of the SMPL-X model. This paves the way for the generation of the proposed SMPLX-Lite dataset, which stands as the most comprehensive and fairly photorealistic textured clothed avatar dataset currently available, supporting the advancement of the research community. Leveraging this dataset, we introduce a CVAE-based textured human model driving algorithm, showcasing the substantial advantage of SMPLX-Lite dataset in label richness and photorealism. Notably, our driving algorithm utilizes solely the captured images and textured SMPLX-Lite-D model in the dataset. Additionally, the SMPLX-Lite dataset includes annotations for 2D/3D keypoints and SMPL-X model, high-precision scanned models, and corresponding texture maps, which are invaluable data contributing to pertinent research endeavors.

Acknowledgment

This work was partly supported by the National Natural Science Foundation of China under U23B2030 and the Special Foundations for the Development of Strategic Emerging Industries of Shenzhen (Nos.JSGG20211108092812020 & CJGJZD20210408092804011).

References

  • [1] S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and H. Li, “Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization,” in ICCV, 2019.
  • [2] S. Saito, T. Simon, J. Saragih, and H. Joo, “Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization,” in CVPR, 2020.
  • [3] Z. Dong, C. Guo, J. Song, X. Chen, A. Geiger, and O. Hilliges, “Pina: Learning a personalized implicit neural avatar from a single rgb-d video sequence,” in CVPR, 2022.
  • [4] Z. Chen and H. Zhang, “Learning implicit fields for generative shape modeling,” in CVPR, 2019.
  • [5] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove, “Deepsdf: Learning continuous signed distance functions for shape representation,” in CVPR, 2019.
  • [6] M. Loper, N. Mahmood, J. Romero, and M. J. Black, “SMPL: A skinned multi-person linear model,” ACM SIGGRAPH Asia, 2015.
  • [7] G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, M. Black et al., “Expressive body capture: 3D hands, face, and body from a single image,” in CVPR, 2019.
  • [8] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black, “Keep it smpl: Automatic estimation of 3d human pose and shape from a single image,” ECCV, 2016.
  • [9] N. Kolotouros, G. Pavlakos, M. Black, and K. Daniilidis, “Learning to reconstruct 3d human pose and shape via model-fitting in the loop,” in ICCV, 2020.
  • [10] B. Bhatnagar, C. Sminchisescu, C. Theobalt, and G. Pons-Moll, “Combining implicit function learning and parametric models for 3d human reconstruction,” in ECCV, 2020.
  • [11] Q. Ma, J. Yang, A. Ranjan, S. Pujades, G. Pons-Moll, and M. J. Black, “Learning to Dress 3D People in Generative Clothing,” in CVPR, 2020.
  • [12] Z. Huang, Y. Xu, C. Lassner, H. Li, and T. Tung, “Arch: Animatable reconstruction of clothed humans,” in CVPR, 2020.
  • [13] T. He, Y. Xu, S. Saito et al., “Arch++: Animation-ready clothed human reconstruction revisited,” in ICCV, 2021.
  • [14] S. Saito, J. Yang, Q. Ma, and M. J. Black, “SCANimate: Weakly supervised learning of skinned clothed avatar networks,” in CVPR, 2021.
  • [15] L. Liu, M. Habermann, V. Rudnev, K. Sarkar, J. Gu, and C. Theobalt, “Neural actor: Neural free-view synthesis of human actors with pose control,” ACM SIGGRAPH Asia, 2021.
  • [16] S. Peng, J. Dong, Q. Wang, S. Zhang, Q. Shuai, X. Zhou, and H. Bao, “Animatable neural radiance fields for modeling dynamic human bodies,” in ICCV, 2021.
  • [17] T. M. Bagautdinov, C. Wu, T. Simon, F. Prada et al., “Driving-signal aware full-body avatars,” ACM TOG, 2021.
  • [18] D. Xiang, F. Prada, T. Bagautdinov, W. Xu, Y. Dong, H. Wen, J. Hodgins, and C. Wu, “Modeling clothing as a separate layer for an animatable human avatar,” ACM TOG, 2021.
  • [19] A. Collet, M. Chuang, P. Sweeney, D. Gillett, D. Evseev, D. Calabrese, H. Hoppe, A. Kirk, and S. Sullivan, “High-quality streamable free-viewpoint video,” ACM TOG, 2015.
  • [20] J. Li, P. Wang, P. Xiong, T. Cai, Z. Yan, L. Yang, J. Liu, H. Fan, and S. Liu, “Practical stereo matching via cascaded recurrent network with adaptive correlation,” in CVPR, 2022.
  • [21] M. Kazhdan, M. Bolitho, and H. Hoppe, “Poisson Surface Reconstruction,” in Symposium on Geometry Processing, 2006.
  • [22] S. Laine, J. Hellsten, T. Karras, Y. Seol, J. Lehtinen, and T. Aila, “Modular primitives for high-performance differentiable rendering,” ACM TOG, 2020.
  • [23] Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh, “Openpose: Realtime multi-person 2d pose estimation using part affinity fields,” IEEE Tran. on PAMI, 2019.
  • [24] M. A. Fischler and R. C. Bolles, “Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography,” Commun. ACM, 1981.
  • [25] Q. Shuai, Q. Fang, J. Dong, S. Peng, D. Huang et al., “Easymocap - make human motion capture easier.” Github, 2021. [Online]. Available: https://github.com/zju3dv/EasyMocap
  • [26] H. Li, B. Adams, L. J. Guibas, and M. Pauly, “Robust single-view geometry and motion reconstruction,” ACM TOG, 2009.
  • [27] W. Cheng, S. Xu, J. Piao, C. Qian et al., “Generalizable neural performer: Learning robust radiance fields for human novel view synthesis,” arXiv preprint arXiv:2204.11798, 2022.
  • [28] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
  • [29] S. Peng, Y. Zhang, Y. Xu, Q. Wang et al., “Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans,” in CVPR, 2021.
  • [30] “Renderpeople,” https://renderpeople.com/, 2018.
  • [31] F. Bogo, J. Romero, G. Pons-Moll, and M. J. Black, “Dynamic FAUST: Registering human bodies in motion,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Jul. 2017.
  • [32] C. Zhang, S. Pujades, M. J. Black, and G. Pons-Moll, “Detailed, accurate, human shape estimation from clothed 3d scan sequences,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [33] P. Patel, C.-H. P. Huang, J. Tesch, D. T. Hoffmann, S. Tripathi, and M. J. Black, “AGORA: Avatars in geography optimized for regression analysis,” in Proceedings IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), Jun. 2021.
  • [34] Z. Yu, J. S. Yoon, I. K. Lee, P. Venkatesh, J. Park, J. Yu, and H. S. Park, “Humbi: A large multiview dataset of human body expressions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • [35] T. Yu, Z. Zheng, K. Guo, P. Liu, Q. Dai, and Y. Liu, “Function4d: Real-time human volumetric capture from very sparse consumer rgbd sensors,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR2021), June 2021.
  • [36] Z. Cai, D. Ren, A. Zeng, Z. Lin, T. Yu, W. Wang, X. Fan, Y. Gao, Y. Yu, L. Pan, F. Hong, M. Zhang, C. C. Loy, L. Yang, and Z. Liu, “Humman: Multi-modal 4d human dataset for versatile sensing and modeling,” October 2022.
  • [37] G. Tiwari, B. L. Bhatnagar, T. Tung, and G. Pons-Moll, “Sizer: A dataset and model for parsing 3d clothing and learning size sensitive 3d clothing,” in ECCV.   Springer, August 2020.
  • [38] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 7, pp. 1325–1339, jul 2014.
Refer to caption
Figure 8: SMPLX-Lite provides over 20k high-resolution scan models of 5 subjects performing 15 types of actions.

We provide additional dataset processing details (Appendix A), extended dataset evaluation results (Appendix B), extended diverse dataset visualization (Appendix C) and extended experiment results (Appendix D). Actors and actresses participating in SMPLX-Lite are well-informed and acknowledge that the data will be made public for research purposes.

Appendix A Additional Dataset Processing Details

A-A 3D Human Pose Estimation

We provide a detailed description of the RANSAC[24] algorithm mentioned in Sec.3.2 of the main paper in Algorithm1.

Algorithm 1 3D Keypoints Estimation by RANSAC
0:  Detected 2D Keypoints K2Dsubscript𝐾2𝐷K_{2D}italic_K start_POSTSUBSCRIPT 2 italic_D end_POSTSUBSCRIPT, camera views C𝐶Citalic_C, number of keypoints J𝐽Jitalic_J, projection matrix P𝑃Pitalic_P, reprojection threshold τ𝜏\tauitalic_τ, RANSAC confidence p𝑝pitalic_p, sample views V𝑉Vitalic_V, number of sample views v𝑣vitalic_v, maximum number of iterations I𝐼Iitalic_I, minimum reprojection error e𝑒eitalic_e, iteration i𝑖iitalic_i
0:  Estimated 3D Keypoints K3Dsubscript𝐾3𝐷K_{3D}italic_K start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT, reasonable 2D reprojection keypoints Kreprosubscript𝐾𝑟𝑒𝑝𝑟𝑜K_{repro}italic_K start_POSTSUBSCRIPT italic_r italic_e italic_p italic_r italic_o end_POSTSUBSCRIPT
1:  for all jrange(0,J)𝑗𝑟𝑎𝑛𝑔𝑒0𝐽j\in range(0,J)italic_j ∈ italic_r italic_a italic_n italic_g italic_e ( 0 , italic_J ) do
2:     I=10000,i=0,e=1000formulae-sequence𝐼10000formulae-sequence𝑖0𝑒1000I=10000,i=0,e=1000italic_I = 10000 , italic_i = 0 , italic_e = 1000
3:     while iI𝑖𝐼i\leq Iitalic_i ≤ italic_I do
4:        V=RANDOM_SELECT(C,v)𝑉𝑅𝐴𝑁𝐷𝑂𝑀_𝑆𝐸𝐿𝐸𝐶𝑇𝐶𝑣V=RANDOM\_SELECT(C,v)italic_V = italic_R italic_A italic_N italic_D italic_O italic_M _ italic_S italic_E italic_L italic_E italic_C italic_T ( italic_C , italic_v )
5:        K3Dj^=TRIANGULATE(K2D,Vj,PV)^superscriptsubscript𝐾3𝐷𝑗𝑇𝑅𝐼𝐴𝑁𝐺𝑈𝐿𝐴𝑇𝐸superscriptsubscript𝐾2𝐷𝑉𝑗subscript𝑃𝑉\hat{K_{3D}^{j}}=TRIANGULATE(K_{2D,V}^{j},P_{V})over^ start_ARG italic_K start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_ARG = italic_T italic_R italic_I italic_A italic_N italic_G italic_U italic_L italic_A italic_T italic_E ( italic_K start_POSTSUBSCRIPT 2 italic_D , italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_P start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT )
6:        Kreproj^=REPROJECTION(K3Dj^,P)^superscriptsubscript𝐾𝑟𝑒𝑝𝑟𝑜𝑗𝑅𝐸𝑃𝑅𝑂𝐽𝐸𝐶𝑇𝐼𝑂𝑁^superscriptsubscript𝐾3𝐷𝑗𝑃\hat{K_{repro}^{j}}=REPROJECTION(\hat{K_{3D}^{j}},P)over^ start_ARG italic_K start_POSTSUBSCRIPT italic_r italic_e italic_p italic_r italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_ARG = italic_R italic_E italic_P italic_R italic_O italic_J italic_E italic_C italic_T italic_I italic_O italic_N ( over^ start_ARG italic_K start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_ARG , italic_P )
7:        DV=DISTANCE(K2Dj,Kreproj^)subscript𝐷𝑉𝐷𝐼𝑆𝑇𝐴𝑁𝐶𝐸superscriptsubscript𝐾2𝐷𝑗^superscriptsubscript𝐾𝑟𝑒𝑝𝑟𝑜𝑗D_{V}=DISTANCE(K_{2D}^{j},\hat{K_{repro}^{j}})italic_D start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT = italic_D italic_I italic_S italic_T italic_A italic_N italic_C italic_E ( italic_K start_POSTSUBSCRIPT 2 italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , over^ start_ARG italic_K start_POSTSUBSCRIPT italic_r italic_e italic_p italic_r italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_ARG )
8:        C^,NC^=SELECT(D,τ)^𝐶subscript𝑁^𝐶𝑆𝐸𝐿𝐸𝐶𝑇𝐷𝜏\hat{C},N_{\hat{C}}=SELECT(D,\tau)over^ start_ARG italic_C end_ARG , italic_N start_POSTSUBSCRIPT over^ start_ARG italic_C end_ARG end_POSTSUBSCRIPT = italic_S italic_E italic_L italic_E italic_C italic_T ( italic_D , italic_τ )
9:        if NC^>3subscript𝑁^𝐶3N_{\hat{C}}>3italic_N start_POSTSUBSCRIPT over^ start_ARG italic_C end_ARG end_POSTSUBSCRIPT > 3 then
10:           K3Dj^=TRIANGULATE(K2D,C^j,PC^)^superscriptsubscript𝐾3𝐷𝑗𝑇𝑅𝐼𝐴𝑁𝐺𝑈𝐿𝐴𝑇𝐸superscriptsubscript𝐾2𝐷^𝐶𝑗subscript𝑃^𝐶\hat{K_{3D}^{j}}=TRIANGULATE(K_{2D,\hat{C}}^{j},P_{\hat{C}})over^ start_ARG italic_K start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_ARG = italic_T italic_R italic_I italic_A italic_N italic_G italic_U italic_L italic_A italic_T italic_E ( italic_K start_POSTSUBSCRIPT 2 italic_D , over^ start_ARG italic_C end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_P start_POSTSUBSCRIPT over^ start_ARG italic_C end_ARG end_POSTSUBSCRIPT )
11:           Kreproj^=REPROJECTION(K3Dj^,P)^superscriptsubscript𝐾𝑟𝑒𝑝𝑟𝑜𝑗𝑅𝐸𝑃𝑅𝑂𝐽𝐸𝐶𝑇𝐼𝑂𝑁^superscriptsubscript𝐾3𝐷𝑗𝑃\hat{K_{repro}^{j}}=REPROJECTION(\hat{K_{3D}^{j}},P)over^ start_ARG italic_K start_POSTSUBSCRIPT italic_r italic_e italic_p italic_r italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_ARG = italic_R italic_E italic_P italic_R italic_O italic_J italic_E italic_C italic_T italic_I italic_O italic_N ( over^ start_ARG italic_K start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_ARG , italic_P )
12:           D=MEAN_DISTANCE(K2Dj,Kreproj^)𝐷𝑀𝐸𝐴𝑁_𝐷𝐼𝑆𝑇𝐴𝑁𝐶𝐸superscriptsubscript𝐾2𝐷𝑗^superscriptsubscript𝐾𝑟𝑒𝑝𝑟𝑜𝑗D=MEAN\_DISTANCE(K_{2D}^{j},\hat{K_{repro}^{j}})italic_D = italic_M italic_E italic_A italic_N _ italic_D italic_I italic_S italic_T italic_A italic_N italic_C italic_E ( italic_K start_POSTSUBSCRIPT 2 italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , over^ start_ARG italic_K start_POSTSUBSCRIPT italic_r italic_e italic_p italic_r italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_ARG )
13:           if D<e𝐷𝑒D<eitalic_D < italic_e then
14:              e=D,Inliers=NC^formulae-sequence𝑒𝐷𝐼𝑛𝑙𝑖𝑒𝑟𝑠subscript𝑁^𝐶e=D,Inliers=N_{\hat{C}}italic_e = italic_D , italic_I italic_n italic_l italic_i italic_e italic_r italic_s = italic_N start_POSTSUBSCRIPT over^ start_ARG italic_C end_ARG end_POSTSUBSCRIPT
15:              I=log(1p)log(1(Inliers/|C|)vI=\frac{log(1-p)}{log(1-(Inliers/|C|)^{v}}italic_I = divide start_ARG italic_l italic_o italic_g ( 1 - italic_p ) end_ARG start_ARG italic_l italic_o italic_g ( 1 - ( italic_I italic_n italic_l italic_i italic_e italic_r italic_s / | italic_C | ) start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT end_ARG
16:              K3Dj=K3Dj^,Kreproj=Kreproj^formulae-sequencesuperscriptsubscript𝐾3𝐷𝑗^superscriptsubscript𝐾3𝐷𝑗superscriptsubscript𝐾𝑟𝑒𝑝𝑟𝑜𝑗^superscriptsubscript𝐾𝑟𝑒𝑝𝑟𝑜𝑗K_{3D}^{j}=\hat{K_{3D}^{j}},K_{repro}^{j}=\hat{K_{repro}^{j}}italic_K start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = over^ start_ARG italic_K start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_ARG , italic_K start_POSTSUBSCRIPT italic_r italic_e italic_p italic_r italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = over^ start_ARG italic_K start_POSTSUBSCRIPT italic_r italic_e italic_p italic_r italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_ARG
17:           end if
18:        end if
19:        i += 1
20:     end while
21:  end for
22:  return  K3Dsubscript𝐾3𝐷K_{3D}italic_K start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT, Kreprosubscript𝐾𝑟𝑒𝑝𝑟𝑜K_{repro}italic_K start_POSTSUBSCRIPT italic_r italic_e italic_p italic_r italic_o end_POSTSUBSCRIPT

A-B SMPLX-Lite Model Transfer

Refer to caption
(a) SMPL-X
Refer to caption
(b) Vertex reduction
Refer to caption
(c) Face flattening
Refer to caption
(d) Toe seam process
Refer to caption
(e) SMPLX-Lite
Figure 9: Model Transfer Process. (a) SMPL-X model. (b) Intermediate model after vertex reduction. (c) Intermediate model after face flattening. (d) Intermediate model after toe seam process. (e) Final SMPLX-Lite model.

We present the new SMPLX-Lite parametric model, which is derived from SMPL-X. The model aims to capture the intricate geometry of the scanned mesh, while also ensuring stable geometry in critical areas such as the nose, mouth, and feet, as well as preserving the overall facial and finger shapes. The entire process is depicted in Fig.9.

First, we eliminate the vertices within the eyeballs, cochlea, lips, nostrils, and toe seam region from the SMPL-X (9a) model that are either hidden or folded. Subsequently, we connect the edge vertices to create faces, while keeping the remaining vertices and topology unaltered. The resulting model (9b, 9c) still has a large depression area, which could affect vertex fitting. Consequently, we flatten the faces in these particular regions to achieve a smoother surface, ensuring a uniform vertex distribution. Nonetheless, it is observed that the vertex and face distribution remains uneven during the fitting process, resulting in clustering of some vertices and severe distortion of corresponding faces (refer to Fig.9d). To address this issue, we undertake multiple rounds of vertex deletion, face reconstruction, and face flattening to obtain a more suitable model for vertex fitting, which we designate as the SMPLX-Lite model (9e).

Subsequently, the reduction in the number of vertices necessitates adjustments to the matrices 𝒮𝒮\mathcal{S}caligraphic_S, \mathcal{E}caligraphic_E, 𝒫𝒫\mathcal{P}caligraphic_P, 𝒥𝒥\mathcal{J}caligraphic_J, and 𝒲𝒲\mathcal{W}caligraphic_W, as described in Sec2.2 of the main paper to ensure that the transferred model inherits the control parameters of SMPL-X and the linear blend skinning function. Initially, we resize these matrices to N×superscript𝑁absent\mathbb{R}^{N\times*}blackboard_R start_POSTSUPERSCRIPT italic_N × ∗ end_POSTSUPERSCRIPT, where {3|β|,3|ψ|,3×9K,K,K}*\in\{3|\beta|,3|\psi|,3\times 9K,K,K\}∗ ∈ { 3 | italic_β | , 3 | italic_ψ | , 3 × 9 italic_K , italic_K , italic_K }, to ensure that the number of rows remains consistent across all matrices. Then, for the 𝒮𝒮\mathcal{S}caligraphic_S, \mathcal{E}caligraphic_E, 𝒫𝒫\mathcal{P}caligraphic_P, and 𝒲𝒲\mathcal{W}caligraphic_W matrices, we determine the nearest neighbor on the SMPL-X model for each vertex of the SMPLX-Lite model, and uses the corresponding row in the original matrix to populate the new matrix. However, for the 𝒥𝒥\mathcal{J}caligraphic_J matrix, using the nearest neighbor will result in a loss of regression coefficients for certain vertices to joints. To circumvent this, we identify the nearest neighbor on the SMPLX-Lite model for each vertex of the SMPL-X model, and subsequently aggregate the rows corresponding to the same point on the SMPL-X Lite model as a row of the new matrix.

A-C SMPLX-Lite-D fit

We describe in detail the 2 stages of SMPLX-Lite-D fit process in Sec.2.2 of the main paper.

Stage 1: Embedded Nodes. The embedded nodes are initialized on the T-pose mesh without clustering vertices by radius as done in [26]. Instead, we cluster vertices by connectivity. The unbalanced distribution of embedded nodes is naturally adapted to the distortion ability of SMPLX-Lite mesh surface.

  1. 1)

    We initialize a candidate set S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with all the vertices on the mesh. We randomly select 1111 vertex from the candidate set S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as a new embedded node and remove k𝑘kitalic_k level of neighbor vertices from the candidate set, forming the remaining set S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. By k𝑘kitalic_k level of neighbors, we refer to at least k𝑘kitalic_k jumps from the select vertex to the neighbor vertex. In practice, we use k=2𝑘2k=2italic_k = 2.

  2. 2)

    Repeat step 1) until the candidate set is empty.

  3. 3)

    For a embedded node xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we define a base radius risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as the average radius of its k𝑘kitalic_k level of neighbors. We define the weight of a embedded node w.r.t. a vertex vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT by their geodesic distance dijsubscript𝑑𝑖𝑗d_{ij}italic_d start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT:

    w(vj,ri)=max(0,(1dij2(αri)2)3),𝑤subscript𝑣𝑗subscript𝑟𝑖𝑚𝑎𝑥0superscript1superscriptsubscript𝑑𝑖𝑗2superscript𝛼subscript𝑟𝑖23w(v_{j},r_{i})=max(0,(1-\frac{d_{ij}^{2}}{(\alpha r_{i})^{2}})^{3}),italic_w ( italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_m italic_a italic_x ( 0 , ( 1 - divide start_ARG italic_d start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_α italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) ,

    where α𝛼\alphaitalic_α controls how far an embedded node can affect. In practice, we have α=1.5𝛼1.5\alpha=1.5italic_α = 1.5.

  4. 4)

    The smooth term between embedded nodes is defined in a similar way to step 3). Two embedded nodes are considered neighbors if the geodesic distance between them is less than twice the largest base radius of them.

Upon initializing the embedded nodes on T-pose mesh, we record the selected vertex indices and weights. When applied to a posed SMPLX-Lite model, the embedded nodes’ positions are initialized using the corresponding vertex positions on the posed mesh. Please note that even the embedded nodes have the same initial positions as the chosen vertices, they are not bonded to be the same during the subsequent solving iterations.

To solve for the rotation and translation of the embedded nodes, our objective is to minimize the distance of warped vertices towards their nearest match on the scanned mesh. We refer the readers to [26] for more details.

Stage 2: Vertex Shifts. After the fitting with embedded nodes, we only need to solve for tiny vertex shifts to ultimately capture the fine geometry details. With the final shifts regularized by a Laplacian matrix initialized from the resulting mesh of Stage 1, the fitted mesh is denoted by SMPLX-Lite-D.

We present our fit pipeline in Fig.10. The stage 1 reasonably fits the scanned mesh, while the Stage 2 presents more geometry details.

Refer to caption
(a) Scanned
mesh
Refer to caption
(b) Embedded
nodes
Refer to caption
(c) Stage 1
results
Refer to caption
(d) Stage 2
results
Figure 10: SMPLX-Lite-D Fit Pipeline. (a) The scanned mesh. (b) The embedded nodes that control the warping in fitting stage 1. (c) The result of fitting stage 1, where joints are fitted and the surface presents a few details. (d) The final result of fitting the Stage 2. More geometry details can be seen in the hair and cloth wrinkles.

We also compare the results of our two-stage fitting vs. direct vertex fitting (stage 2 only) in Fig. 11. Direct vertex fitting may generate undesirable artifacts in many regions.

Refer to caption
(a) Ear w/o Stage 1
Refer to caption
(b) Ear w/ Stage 1
Refer to caption
(c) Hand w/o Stage 1
Refer to caption
(d) Hand w/ Stage 1
Figure 11: SMPLX-Lite-D Fit Comparison. Large distortion is beyond the applicability of the laplacian matrix as a regularizer, leading to undesirable artifacts in the finger and ear areas, which are solved by Stage 1.

Appendix B Extended Dataset Evaluation Results

Refer to caption
(a) SMPL
Refer to caption
(b) SMPL-X
Refer to caption
(c) SMPLX-Lite
Refer to caption
(d) SMPLX-Lite-D
Refer to caption
(e) Scanned Mesh
Figure 12: Fitting Results of Different Model (a) SMPL model cannot control facial expressions and hand movements. (b) SMPL-X model has overly complex faces and toes, making it unsuitable for vertex fitting. (c) SMPLX-Lite model, plus vertex displacement (d) can fit scanned mesh(e) perfectly, especially in hand regions.
Refer to caption
(b) Recon
Refer to caption
(c) Driving
Refer to caption
(d) GT
Refer to caption
(e) GT vs Recon
Refer to caption
(f) GT vs Driving
Refer to caption
(g) Recon vs Driving
Figure 13: Qualitative Results and Difference Heatmap. Both rendered images are really close to the captured image, perfectly recovering clothing details, finger movements, and facial expressions.
Refer to caption
Figure 14: We show the adhesion in the hand and underarm areas of the scanned mesh (left one), and the SMPLX-Lite-D model (right one) does not have adhesion.

The comparison of the SMPLX-Lite dataset with other datasets containing human model fits is presented in Tab.IV. As discussed in the main paper, SMPLX-Lite dataset offers a range of valuable components, including multi-view images, reconstructed texture models, and fitted clothed parametric models with texture maps. This variety of data types allows for the reconstruction of photorealistic drivable avatars, thereby providing researchers with a broader spectrum of supervising methods compared to datasets that only offer raw images [27] or solely reconstructed textured models [11]. In contrast, other datasets featuring both RGB images and scanned textured meshes are either synthetic or lack registered parametric models. Importantly, these datasets are unable to furnish a parametric model that facilitates control over facial expressions and hand movements and achieve vertex alignment. The fitting results of different parametric models are compared in Fig.12. Notably, our registered SMPLX-Lite-D models enable multiple supervision methods, such as direct supervision of 3D mesh and texture, as well as supervision with 2D images.

TABLE IV: Comparison with existing datasets containing human model fits. SMPLX-Lite has multiple data types and annotations, and supports multiple tasks. Registered: parametric model fit; Vertex fit: parametric model can fit clothes or not; K3D: 3D keypoints; Act: action label; Sequence: sequential data. ”Facebook” means the data used in [17, 18], which is not public available.
Dataset RGB Mesh Texture Registered
Vertex
Fit
Large Pose
Variation
K3D Act Sequence
RenderPeople[30] \checkmark \checkmark \checkmark ×\times× ×\times× \checkmark \checkmark ×\times× \checkmark
DFAUST[31] \checkmark \checkmark \checkmark Dyna ×\times× \checkmark \checkmark \checkmark \checkmark
BUFF[32] \checkmark \checkmark \checkmark SMPL ×\times× \checkmark \checkmark \checkmark \checkmark
AGORA[33] \checkmark \checkmark \checkmark SMPL-X&SMPL ×\times× ×\times× \checkmark ×\times× ×\times×
HUMBI[34] \checkmark \checkmark \checkmark SMPL ×\times× \checkmark \checkmark ×\times× \checkmark
THuman2.0[35] ×\times× \checkmark \checkmark SMPL-X ×\times× \checkmark ×\times× ×\times× ×\times×
ZJU LightStage[29] \checkmark \checkmark \checkmark SMPL-X ×\times× \checkmark \checkmark \checkmark \checkmark
GeneBody[27] \checkmark ×\times× ×\times× SMPL-X ×\times× \checkmark ×\times× ×\times× \checkmark
HuMMan[36] \checkmark \checkmark \checkmark SMPL ×\times× \checkmark \checkmark \checkmark \checkmark
Sizer[37] \checkmark \checkmark \checkmark SMPL-G \checkmark ×\times× ×\times× ×\times× ×\times×
CAPE[11] ×\times× \checkmark ×\times× SMPL-D \checkmark \checkmark \checkmark \checkmark \checkmark
Facebook†[17, 18] \checkmark \checkmark \checkmark \checkmark \checkmark ×\times× \checkmark ×\times× \checkmark
Ours \checkmark \checkmark \checkmark SMPLX-Lite-D \checkmark \checkmark \checkmark \checkmark \checkmark

Then, we provide more detailed dataset evaluation results. We utilize 8 telephoto cameras and 24 standard cameras to capture images with full body and local details simultaneously. PSNR and SSIM results of telephoto cameras are lower than standard cameras because they capture finer images, as shown in Fig.15 and Tab.LABEL:tab:eval.

In Tab.LABEL:tab:eval, we have a complete list of the average results per act for each subject. The names of the 15 actions are “01 discussion”, “02 debating”, “03 presentation”, “04 eating”, “05 directions”, “06 greeting”, “07 purchasing”, “08 posing”, “09 waiting”, “10 walking”, “11 walking dog”, “12 phoning”, “13 taking photo”, “14 turning around”, “15 stretching”. Some of the actions refer to the paper[38].

Appendix C Extended Diverse Dataset Visualization

We present multi-view visualization in Fig.15 and reconstructed high-resolution scan models of 5 subjects in Fig.8.

Refer to caption
Figure 15: Multi-View Capture. SMPLX-Lite deploys 24 standard cameras and 8 telephoto cameras to capture multi-view synchronized RGB sequences. We show several frames of images from a part of these cameras.

Appendix D Extended Experiments Results

Our experiment settings are as follows: epoch=5𝑒𝑝𝑜𝑐5epoch=5italic_e italic_p italic_o italic_c italic_h = 5, batch_size=1,lr=5e4,λG=0.5,λT=5,λlap=50,λKL=1formulae-sequence𝑏𝑎𝑡𝑐_𝑠𝑖𝑧𝑒1formulae-sequence𝑙𝑟5𝑒4formulae-sequencesubscript𝜆𝐺0.5formulae-sequencesubscript𝜆𝑇5formulae-sequencesubscript𝜆𝑙𝑎𝑝50subscript𝜆𝐾𝐿1batch\_size=1,lr=5e-4,\lambda_{G}=0.5,\lambda_{T}=5,\lambda_{lap}=50,\lambda_{% KL}=1italic_b italic_a italic_t italic_c italic_h _ italic_s italic_i italic_z italic_e = 1 , italic_l italic_r = 5 italic_e - 4 , italic_λ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = 0.5 , italic_λ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 5 , italic_λ start_POSTSUBSCRIPT italic_l italic_a italic_p end_POSTSUBSCRIPT = 50 , italic_λ start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT = 1. We utilize AdamW as optimizer and ExponentialLR as a scheduler with γ=0.9𝛾0.9\gamma=0.9italic_γ = 0.9

D-A Driving Results

We train a drivable model for each subject and use the same driving signal to drive all the models, as shown in Fig.16. Driven by the same signal, all reconstructed human models can present corresponding actions and facial expressions, and the geometry and texture of clothes change reasonably with the change of pose.

Refer to caption
Figure 16: Driving results of 5 models by the same driving signal. Each column represents a different driving signal.

D-B Driving vs. Reconstruction

We further visualize the qualitative results in Fig.5 of the main paper in Fig.13. We nonlinearly transform the difference between every two images and get heat maps. From the heatmap, we can see that the driving results are very close to the reconstruction results and both restore the captured image, perfectly recovering clothing details, finger movements, and facial expressions.

D-C Ablation Study

We perform ablation experiments to compare the effects of texture and image supervising. The experiment settings are as follows: Tex: λG=0.5,λT=5formulae-sequencesubscript𝜆𝐺0.5subscript𝜆𝑇5\lambda_{G}=0.5,\lambda_{T}=5italic_λ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = 0.5 , italic_λ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 5; Img: λG=0.5,λI=5,λM=5formulae-sequencesubscript𝜆𝐺0.5formulae-sequencesubscript𝜆𝐼5subscript𝜆𝑀5\lambda_{G}=0.5,\lambda_{I}=5,\lambda_{M}=5italic_λ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = 0.5 , italic_λ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT = 5 , italic_λ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT = 5; Both: λG=0.5,λT=2.5,λI=2.5,λM=2.5formulae-sequencesubscript𝜆𝐺0.5formulae-sequencesubscript𝜆𝑇2.5formulae-sequencesubscript𝜆𝐼2.5subscript𝜆𝑀2.5\lambda_{G}=0.5,\lambda_{T}=2.5,\lambda_{I}=2.5,\lambda_{M}=2.5italic_λ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = 0.5 , italic_λ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 2.5 , italic_λ start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT = 2.5 , italic_λ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT = 2.5. As results in Tab.V demonstrate, texture map supervising works better than image supervising. The result of using both to supervise and simply averaging loss weights is the worst.

TABLE V: Ablation Study Results. We use either texture loss or image loss to supervise the generated texture map, and we also test the results of using both. The best results can be attained by using texture map supervision, which is only possible with our dataset.
Supervise PSNR\uparrow SSIM\uparrow CD\downarrow(×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT)
Texture 26.17 0.9396 4.5589
Image 26.01 0.9335 9.0415
Both 19.32 0.5925 40.000

Appendix E Discussion of Limitations

In this section, we discuss several limitations of the SMPLX-Lite dataset and driving method.

As shown in Fig. 14, the scanned mesh reconstructed from the depth map and the point cloud has adhesion in very close areas, such as hands and underarms, while the fitted parametric model SMPLX-Lite-D has not. Therefore, chamfer distance (CD) may not be the most appropriate evaluation metric and does not reflect the advantages of our fitted model. A more reasonable evaluation metric is needed to evaluate the quality of the fitted mesh.

As for the driving method, our proposed one is only a preliminary baseline, which works well overall, but artifacts can occur when driving out-of-distribution actions. Besides, the current algorithm is still elementary for facial expression control. To get a drivable model with good generalization capabilities, a large amount of data is needed to train the neural network, which our dataset now provides.

In future studies, we will further promote the diversity and number of action sequences and optimize the SMPLX-Lite-D fit results. We will improve the baseline driving algorithm to take full advantage of the diverse data in the SMPLX-Lite dataset to achieve a better driving effect and consider decoupling of expressions and whole-body poses to produce more vivid facial expressions. Also, we consider using fewer data to train available models and achieve training time reduction.

TABLE VI: Complete Dataset Evaluation Results.We render textured models of 5 subjects to 32 views (8 telephoto cameras and 24 standard cameras), compare them with captured images to get PSNR and SSIM and compare the geometry of the fitted SMPLX-Lite-D model with the scanned mesh to get chamfer distance (CD, ×103mabsentsuperscript103𝑚\times 10^{-3}m× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT italic_m). Std means the average results of 24 standard cameras, and Tele means the average results of 8 telephoto cameras.
Subject Act Scan SMPLX-Lite-D CD\downarrow (×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT)
PSNR\uparrow SSIM\uparrow PSNR\uparrow SSIM\uparrow
Std Tele Std Tele Std Tele Std Tele
WZL 01 30.26 25.93 0.9803 0.9425 30.00 25.59 0.9797 0.9407 6.734
02 30.31 26.00 0.9805 0.9419 30.00 25.47 0.9799 0.9400 6.756
03 30.23 25.94 0.9803 0.9415 29.83 25.32 0.9796 0.9396 6.829
04 30.11 25.77 0.9800 0.9416 29.79 25.23 0.9796 0.9400 6.534
05 30.05 25.59 0.9798 0.9424 29.74 25.06 0.9792 0.9406 6.683
06 29.61 24.90 0.9791 0.9412 29.36 24.51 0.9786 0.9399 6.465
07 30.11 25.86 0.9802 0.9448 29.80 25.47 0.9796 0.9436 6.746
08 29.68 24.69 0.9790 0.9410 29.44 24.33 0.9785 0.9398 6.644
09 30.07 25.77 0.9800 0.9457 29.82 25.43 0.9795 0.9442 6.657
10 29.66 26.76 0.9798 0.9567 29.43 26.43 0.9792 0.9553 6.899
11 29.23 27.12 0.9784 0.9672 29.04 26.83 0.9779 0.9659 7.042
12 29.86 25.59 0.9797 0.9440 29.60 25.20 0.9791 0.9425 6.648
13 29.86 25.71 0.9797 0.9431 29.55 25.18 0.9791 0.9414 6.703
14 30.08 26.58 0.9803 0.9561 29.82 26.19 0.9797 0.9542 6.859
15 29.96 25.89 0.9796 0.9432 29.68 25.44 0.9789 0.9416 6.858
LDF 01 28.72 27.03 0.9750 0.9572 28.47 26.63 0.9743 0.9561 6.902
02 28.82 27.11 0.9749 0.9566 28.59 26.70 0.9744 0.9551 6.834
03 28.78 27.29 0.9747 0.9566 28.54 26.85 0.9742 0.9553 6.852
04 28.83 27.13 0.9747 0.9551 28.61 26.74 0.9742 0.9538 6.873
05 28.59 27.00 0.9741 0.9548 28.35 26.57 0.9735 0.9536 7.004
06 28.21 26.91 0.9734 0.9564 27.98 26.54 0.9729 0.9553 6.882
07 28.72 26.92 0.9740 0.9551 28.50 26.56 0.9735 0.9543 6.888
08 28.48 26.78 0.9731 0.9536 28.21 26.31 0.9726 0.9523 6.840
09 28.80 26.96 0.9746 0.9576 28.58 26.59 0.9740 0.9559 7.078
10 29.15 28.75 0.9780 0.9761 28.93 28.46 0.9773 0.9751 9.649
11 28.88 28.14 0.9774 0.9710 28.65 27.82 0.9768 0.9697 9.112
12 28.86 27.28 0.9745 0.9569 28.64 26.88 0.9739 0.9556 7.215
13 28.54 26.98 0.9731 0.9550 27.37 25.83 0.9655 0.9456 6.929
14 28.80 27.12 0.9751 0.9606 28.61 26.79 0.9744 0.9590 7.363
15 28.12 26.90 0.9721 0.9547 27.81 26.42 0.9712 0.9532 7.198
ZX 01 29.83 27.91 0.9799 0.9646 29.43 27.07 0.9790 0.9623 6.885
02 29.52 27.56 0.9795 0.9650 29.18 26.84 0.9786 0.9625 6.785
03 29.48 27.69 0.9796 0.9664 29.20 27.09 0.9791 0.9654 6.700
04 29.51 27.58 0.9791 0.9653 29.12 26.83 0.9783 0.9629 6.633
05 29.36 27.52 0.9788 0.9645 28.99 26.69 0.9779 0.9623 6.775
06 28.67 26.79 0.9768 0.9621 28.37 26.09 0.9765 0.9604 6.514
07 29.87 28.40 0.9792 0.9696 29.56 27.82 0.9786 0.9681 6.795
08 28.84 27.36 0.9778 0.9644 28.51 26.58 0.9768 0.9618 6.768
09 29.23 27.44 0.9796 0.9654 28.91 26.76 0.9789 0.9635 6.869
10 29.66 27.91 0.9806 0.9682 29.34 27.35 0.9797 0.9663 6.911
11 29.37 28.19 0.9803 0.9728 29.16 27.73 0.9795 0.9713 7.110
12 29.52 27.71 0.9794 0.9653 29.10 26.85 0.9785 0.9628 6.825
13 28.94 27.91 0.9796 0.9684 28.58 27.22 0.9788 0.9665 6.924
ZX 14 29.38 27.69 0.9800 0.9668 29.10 27.09 0.9791 0.9645 6.941
15 29.42 27.75 0.9788 0.9648 28.95 26.76 0.9776 0.9622 6.917
LW 01 28.05 25.45 0.9799 0.9625 27.70 24.67 0.9794 0.9612 6.312
02 28.49 26.70 0.9775 0.9657 28.12 26.31 0.9770 0.9646 6.192
03 28.29 26.24 0.9768 0.9652 27.90 25.82 0.9763 0.9642 6.254
04 28.08 25.68 0.9800 0.9602 27.67 24.67 0.9791 0.9579 6.379
05 28.07 25.38 0.9798 0.9603 27.59 24.40 0.9790 0.9581 6.394
06 27.45 25.10 0.9786 0.9577 27.07 24.20 0.9780 0.9558 6.184
07 27.49 25.27 0.9798 0.9644 27.10 24.50 0.9789 0.9627 6.429
08 27.96 25.26 0.9789 0.9560 27.44 24.22 0.9779 0.9540 6.488
09 27.99 25.20 0.9806 0.9630 27.52 24.25 0.9797 0.9608 6.622
10 27.82 25.05 0.9806 0.9654 27.42 24.24 0.9798 0.9635 6.517
11 28.07 25.93 0.9800 0.9705 27.70 25.24 0.9792 0.9692 6.691
12 28.31 25.47 0.9798 0.9631 27.86 24.74 0.9788 0.9614 6.609
13 27.32 24.54 0.9796 0.9649 26.83 23.89 0.9788 0.9633 6.282
14 27.75 25.00 0.9805 0.9673 27.33 24.18 0.9797 0.9656 6.478
15 27.71 25.33 0.9790 0.9572 27.21 24.25 0.9782 0.9551 6.316
ZC 01 29.01 25.78 0.9712 0.9358 28.61 25.07 0.9704 0.9344 6.958
02 28.04 25.43 0.9705 0.9367 27.67 24.81 0.9698 0.9357 7.033
03 27.79 25.69 0.9706 0.9378 27.61 25.14 0.9699 0.9368 7.038
04 27.63 25.69 0.9704 0.9378 27.40 25.13 0.9696 0.9372 7.110
05 27.94 25.19 0.9672 0.9350 27.55 24.56 0.9663 0.9342 6.907
06 28.02 25.15 0.9691 0.9286 27.42 24.33 0.9640 0.9181 6.706
07 28.16 25.51 0.9710 0.9371 27.40 24.62 0.9610 0.9158 6.404
08 28.40 25.35 0.9698 0.9299 27.97 24.62 0.9689 0.9290 6.862
09 28.35 25.47 0.9706 0.9341 28.06 24.85 0.9698 0.9333 7.099
10 28.64 26.70 0.9724 0.9570 28.28 26.35 0.9718 0.9561 6.684
11 28.23 25.32 0.9722 0.9415 27.93 24.78 0.9714 0.9408 6.973
12 28.43 25.69 0.9710 0.9380 28.07 25.05 0.9700 0.9373 7.132
13 27.99 25.38 0.9698 0.9464 27.52 24.69 0.9688 0.9448 6.665
14 28.56 25.46 0.9750 0.9515 28.19 24.89 0.9740 0.9501 7.033
15 28.16 25.37 0.9699 0.9356 27.76 24.72 0.9689 0.9344 7.013