3D Geometry-aware Deformable Gaussian Splatting for Dynamic View Synthesis

Zhicheng Lu1111Equal contributions.,  Xiang Guo1111Equal contributions.,  Le Hui1222Corresponding authors.,  Tianrui Chen1,2,
 Min Yang2,  Xiao Tang2,  Feng Zhu2,  Yuchao Dai1222Corresponding authors.
1Northwestern Polytechnical University 2Samsung R&D Institute
{zhichenglu, guoxiang, cherryxchen}@mail.nwpu.edu.cn
{daiyuchao, huile}@nwpu.edu.cn
   {min16.yang, xiao1.tang, f15.zhu}@samsung.com
Abstract

In this paper, we propose a 3D geometry-aware deformable Gaussian Splatting method for dynamic view synthesis. Existing neural radiance fields (NeRF) based solutions learn the deformation in an implicit manner, which cannot incorporate 3D scene geometry. Therefore, the learned deformation is not necessarily geometrically coherent, which results in unsatisfactory dynamic view synthesis and 3D dynamic reconstruction. Recently, 3D Gaussian Splatting provides a new representation of the 3D scene, building upon which the 3D geometry could be exploited in learning the complex 3D deformation. Specifically, the scenes are represented as a collection of 3D Gaussian, where each 3D Gaussian is optimized to move and rotate over time to model the deformation. To enforce the 3D scene geometry constraint during deformation, we explicitly extract 3D geometry features and integrate them in learning the 3D deformation. In this way, our solution achieves 3D geometry-aware deformation modeling, which enables improved dynamic view synthesis and 3D dynamic reconstruction. Extensive experimental results on both synthetic and real datasets prove the superiority of our solution, which achieves new state-of-the-art performance. The project is available at https://npucvr.github.io/GaGS/.

1 Introduction

Dynamic View Synthesis (DVS) aims at rendering novel photorealistic views at arbitrary viewpoints and any input time step given a monocular video of a dynamic scene, which has broad applications in virtual reality and augmented reality. Recently, empowered with effective representations such as neural radiance fields (NeRF) [33] and Gaussian Splatting [22], novel view synthesis for static scenes has been greatly advanced. However, this success cannot be extended to its dynamic counterpart directly. This is mainly due to the difficulty in modeling and representing the scene deformation. Due to the inherent motion/shape ambiguity in monocular dynamic 3D representation, dynamic scene modeling and synthesis are more challenging, especially for monocular video with limited observations.

Refer to caption
Figure 1: Geometric information exploited by different methods. a) Early dynamic NeRF methods such as DNeRF[40] directly encode the coordinate 𝐩𝐩\mathbf{p}bold_p of the sample point as input feature for deformation network. b) Interpolation is used to fuse features from neighbouring grids and mulitscale interpolation enhances the local geometry information [17, 30, 12, 57]. c) We propose to voxelize a set of Gaussian distributions and use a sparse convolution network to extract geometry-aware features for deformation learning.

In addressing the above challenges, one common strategy is to represent the dynamic scenes as a combination of a static canonical field and a deformation model [40, 36, 37, 55, 12, 17, 18, 57, 30], whereas the bottleneck lies in representing the diverse and complex real-world 3D deformation. To represent geometrically consistent 3D deformation, the local geometric/structural information is critical, since the deformations of the objects in the real world are highly correlated to their 3D structures. Furthermore, the motions of the object points are deeply coupled with the motions of their neighboring points. Thus, how to incorporate the local geometric information to learn locally smooth and consistent 3D deformations becomes the research focus in DVS.

Recently, different deformation models have utilized the local geometric information, but they all have their limitations. As shown in Fig. 1 a), originally in D-NeRF [40], the feature (positional encoding) of each sampled point is extracted independently with each other. Following works notice that this method could not handle the complex dynamic scene since the extracted features contain little information from neighboring points. In Fig. 1 b), interpolation is introduced to fuse features of neighboring grids. NDVG [17] and RoDynRF [30] gradually increase the voxel resolution so that the large voxel size could cover a larger area, introducing the local smoothness at the early stage of the training. However, this strategy has a limited cover range of local areas and cannot work at a later training stage. TiNeuVox [12] and SUDS [57] interpolate with multi-scales. Nevertheless, the interpolation operation is rather simple in extracting local geometric information and introduces un-smoothness and artifacts [20, 4].

In modling the nonrigid deformation, it is crucial to account for the consistency in the motion of local neighborhood. Since point-level MLP has a limited receptive field, which cannot capture the local geometric features of point clouds. To utilize the local geometric information effectively, we propose to use 3D sparse convolution. As shown in Fig. 1 c), building upon the recent explicit point cloud based Gaussian Splatting representation, we introduce a sparse convolution network to extract 3D geometry-aware features. Compared with simple feature interpolation, the convolutional neural network is superior in extracting local information and has a much larger reception field. Also, we treat the 3D Gaussian distributions as point clouds, which enable sparse 3D convolution for time and memory efficiency. Note that FDNeRF [18] uses a 3D U-Net to inpaint the missing area in the voxel grid. But this inpaint network is not used for deformation modeling, while the rendering speed and voxel resolution are also limited.

Originally in Guassian Splatting [22], the rotation parameter of each Gaussian is represented by quaternion. However, quaternion representation for rotation is discontinuous in parameter space for neural network learning [74]. We introduce the continuous 6D rotation [74] to ensure that the network learns a continuous function in the parameter space, which accurately represents the rotational states of each Gaussian at different time.

Overall, our method mainly has two components: a Gaussian canoncial field and a deformation field. The Gaussian canonical field consists of 3D Gaussian distributions and a geometry-aware feature learning network. The explicit 3D Gaussian distribution represents the geometry of the canonical scene, and the sparse 3D CNN network extracts local structural/geometric information for each Gaussian. The deformation field estimates a transformation for each Gaussian in the canonical field, which transfers the Gaussian from the canonical field to the given timestamp. Finally, we use 3D Gaussian splatting to render images for the given timestamp.

Our main contributions are summarized as:

  • We propose a geometry-aware feature extraction network based on 3D Gaussian distribution to better utilize local geometric information.

  • We propose to use continuous 6D rotation representation and modified density control strategy to adapt Gaussian splatting to dynamic scenes.

  • Extensive experiments on both synthetic and real datasets show that our method surpasses competing methods by a wide margin.

2 Related Work

Refer to caption
Figure 2: The pipeline of our proposed 3D geometry-aware deformable Gaussian splitting. In the Gaussian canonical field, we reconstruct a static scene in canonical space using 3D Gaussian distributions. We extract positional features using an MLP, as well as local geometric features using a 3D U-Net, fused by another MLP to form the geometry-aware features. In the deformation field, taking the geometry-aware features and timestamp t𝑡titalic_t, an MLP estimates the 3D Gaussian deformation, which transfers the canonical 3D Gaussian distributions to timestamp t𝑡titalic_t. Finally, a rasterizer renders the transformed 3D Gaussian to images.

2.1 Novel View Synthesis

Novel View Synthesis (NVS) is a well-known task in both computer vision and graphics [6, 9, 23, 16]. Surveys such as [47, 52, 53] provide comprehensive discussions. Explicit NVS methods generally reconstruct an explicit 3D model of a scene in the form of point clouds [1], voxels, or meshes [44, 45, 54, 19]. Once the geometry of the scene is represented, novel view images can be rendered from arbitrary viewpoints via manipulating the camera pose parameters. Other methods  [21, 38, 10, 44, 45, 13, 64] tackle NVS by estimating depth maps using multi-view geometry, whereas the features are aggregated from co-visible frames.

Neural Radiance Fields (NeRF) [33] is a groundbreaking approach that utilizes Multi-Layer Perceptrons (MLPs) to represent scenes implicitly. This methodology enables the modeling of a 5D radiance field, resulting in the impressive synthesis of views for static scenes. Numerous subsequent works expand the capabilities of NeRF by adapting it to various scenarios, such as handling larger and unbounded scenes  [71, 51, 63, 43, 32], scene editing and relighting,  [5, 49, 73, 65], [3, 20, 4], and improving the generalization ability [8, 56, 70, 60]. Meanwhile, researchers focus on achieving more efficient rendering and optimization in a NeRF-like framework. [35, 27, 39, 29, 69, 31, 14, 7] investigate efficient sampling methods along each ray for color accumulation, while [41, 42] partition the scene into multiple sub-regions as an efficient pre-processing, and [68, 50, 34, 14, 7] exploit voxel-grid representation to speed up the optimization. Very recently, [22] proposes to use 3D Guassian distribution to represent the scene, obtaining promising results. However, these methods are mainly applicable to static scenes, and fail in scenes with dynamic objects.

2.2 Dynamic View Synthesis

A recent trend in NVS is to extend the success in static NVS to dynamic NVS. One viable strategy is to construct a 4D spatial-temporal representation. Yoon et al[67] combine single-view and multi-view depth to achieve NVS by 3D warping. Gao et al[15] use a time-invariant model and a time-varying model to represent the static part and dynamic part of a scene, respectively, and use scene flow for motion modeling. NeRFlow [11] proposes a 4D spatial-temporal representation of a dynamic scene. Xian et al[62] map a spatial-temporal location to the color and volume density by a 4D spatial-temporal radiance field. NSFF [26] represents a dynamic scene as a continuously changing function, encompassing various aspects of the scene, including appearance, geometry, and 3D scene motion. DCT-NeRF [58] uses the Discrete Cosine Transform (DCT) to replace the scene flow in NSFF [26] to enable smoother motion trajectories. HexPlane [7] and K-Plane [14] project 4D spatial-temporal space to multiple 2D planes.

On the other hand, works such as  [40, 36, 37, 48, 55, 17, 18, 12, 30, 57] decode the dynamic scene with a canonical field and a deformation field. Along this pipeline, D-NeRF [40] first proposes the canonical-based framework. However, the deformation network utilizes positional features with little geometry information, which cannot handle complex dynamic scenarios well. Nerfies [36] proposes a coarse-to-fine optimization method for coordinate-based models that allows for more robust optimization. HyperNeRF [37] lifts the canonical field into a higher dimensional space to handle topological changes. NDVG [17] and RoDynRF [30] gradually increase the voxel resolution, which has two benefits. TiNeuVox [12] and SUDS [57] interpolate the features with multi-scales for deformation learning. The multi-scales interpolation covers a larger reception field, which benefits modeling varying motions.

Very recently, with the stunning debut of 3D Gaussian [22], some works introduce this point-based representation into their pipelines to synthesize high-fidelity images of a dynamic scene. Deformable 3DGS [66] proposes a deformable 3D Gaussian framework with a novel annealing smoothing training mechanism, which achieves real-time rendering in dynamic scenes. Wu et al. [61] introduce a 4D Gaussian Splatting representation and utilize a deformation field to model both Gaussian motions and shape changes. However, the multi-scale HexPlane interpolation has limited ability in extracting the geometry information, which is still insufficient for modeling complex motions.The projection-based representation compresses the 3D space to 2D space, losing 3D geometric information for deformation learning. In contrast, our canonical-based method can fully leverage the 3D information in 3D space.

3 Method

In this section, we present our 3D geometry-aware deformable Gaussian Splating solution for dynamic view synthesis, where an overview of our pipeline is illustrated in Fig. 2. Given a set of images or monocular video of a dynamic scene with frames with corresponding time labels and known camera intrinsic and extrinsic parameters, our goal is to synthesize a novel view at any desired view at any desired time. Our method mainly consists of two core components: the Gaussian canonical field is used to learn the reconstruction of static scenes, while the deformation field is used to learn object deformation. First, we review the static 3D Gaussian splatting in Sec. 3.1. Then, we introduce the proposed Gaussian canonical field in Sec. 3.2, which consists of 3D Gaussian distributions and a geometry feature learning network. Next, in Sec. 3.3, we propose a 3D geometry-aware deformation field to learn transformations for given time steps, which transform our canonical 3D Gaussian distributions to corresponding times. In Sec. 3.4, we explain the process of rendering images from transformed 3D Gaussian distributions. Finally, we present our losses and density control modifications in Sec. 3.5.

3.1 Preliminary

3D-GS [22] represents the scene with sparse 3D Gaussians distributions. Each Gaussian has an anisotropic covariance 𝚺3×3𝚺superscript33\mathbf{\Sigma}\in\mathbb{R}^{3\times 3}bold_Σ ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT and a mean value μ3𝜇superscript3\mu\in\mathbb{R}^{3}italic_μ ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT:

𝐆(𝐱)=e12(𝐱μ)𝚺1(𝐱μ).𝐆𝐱superscript𝑒12superscript𝐱𝜇topsuperscript𝚺1𝐱𝜇\mathbf{G}(\mathbf{x})=e^{-\frac{1}{2}(\mathbf{x}-\mu)^{\top}\mathbf{\Sigma}^{% -1}(\mathbf{x}-\mu)}.bold_G ( bold_x ) = italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_x - italic_μ ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_x - italic_μ ) end_POSTSUPERSCRIPT . (1)

The covariance matrix 𝚺𝚺\mathbf{\Sigma}bold_Σ can be decomposed into a scaling matrix 𝐒3×3𝐒superscript33\mathbf{S}\in\mathbb{R}^{3\times 3}bold_S ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT and a rotation matrix 𝐑SO(3)𝐑SO3\mathbf{R}\in\mathrm{SO}(3)bold_R ∈ roman_SO ( 3 ). This ensures that the covariance matrix is positive semi-definite, while reducing the learning difficulty of 3D Gaussians:

𝚺=𝐑𝐒𝐒𝐑.𝚺superscript𝐑𝐒𝐒topsuperscript𝐑top\mathbf{\Sigma}=\mathbf{R}\mathbf{S}\mathbf{S}^{\top}\mathbf{R}^{\top}.bold_Σ = bold_RSS start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_R start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT . (2)

To render an image from a designated viewpoint, the covariance matrix 𝚺superscript𝚺\mathbf{\Sigma}^{\prime}bold_Σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in camera coordinates can be calculated by giving a viewing transformation 𝐖𝐖\mathbf{W}bold_W, followed by [75]:

𝚺=𝐉𝐖𝚺𝐖𝐉,superscript𝚺𝐉𝐖𝚺superscript𝐖topsuperscript𝐉top\mathbf{\Sigma}^{\prime}=\mathbf{J}\mathbf{W}\mathbf{\Sigma}\mathbf{W}^{\top}% \mathbf{J}^{\top},bold_Σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = bold_JW bold_Σ bold_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_J start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , (3)

where 𝐉𝐉\mathbf{J}bold_J is the Jacobian of the affine approximation of the projective transformation, and 𝐖𝐖\mathbf{W}bold_W is the world to camera transformation matrix.

Each Gaussian is parameterized into the following attributes: position 𝐱3𝐱superscript3\mathbf{x}\in\mathbb{R}^{3}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, color defined by spherical harmonics coefficients 𝐜k𝐜superscript𝑘\mathbf{c}\in\mathbb{R}^{k}bold_c ∈ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, rotation 𝐫4𝐫superscript4\mathbf{r}\in\mathbb{R}^{4}bold_r ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT, scale 𝐬3𝐬superscript3\mathbf{s}\in\mathbb{R}^{3}bold_s ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, and opacity o𝑜o\in\mathbb{R}italic_o ∈ blackboard_R. Point-based α𝛼\alphaitalic_α-blending and volumetric rendering like NeRF [33] essentially share the same image formation model for the splatting process. Specifically, the color 𝐂𝐂\mathbf{C}bold_C of each pixel is influenced by the related Gaussians:

𝐂=i=1N𝐓iαi𝐜i,𝐂superscriptsubscript𝑖1𝑁subscript𝐓𝑖subscript𝛼𝑖subscript𝐜𝑖\mathbf{C}=\sum_{i=1}^{N}\mathbf{T}_{i}\alpha_{i}\mathbf{c}_{i},bold_C = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (4)

where αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the density of the Gaussian point computed by a Gaussian with covariance 𝚺𝚺\mathbf{\Sigma}bold_Σ multiplied by its opacity.

3.2 Gaussian Canonical Field

In this section, we first reconstruct a static scene in canonical space. Then, we propose a geometric branch, which enables geometry feature learning of the 3D Gaussian distributions for the subsequent deformation field.

Gaussian parameters. Similar to 3D-GS [22], each Gaussian in the canonical space is characterized by position, color, scale, and opacity. Note that for rotation, we are inspired by  [74] to use a continuous 6D rotation representation. Compared with the quaternion representation used in 3D-GS, the 6D rotation representation can benefit our method in estimating the deformation of each Gaussian from canonical space to time-space, especially in helping the neural networks to learn smooth rotation variation from time to time. Specifically, we set learnable parameter [a1,a2]subscript𝑎1subscript𝑎2\left[a_{1},a_{2}\right][ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] for each Gaussian to denote its rotation in canonical space, where a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the column vectors of three rows, respectively. They are initialized to [100]superscriptdelimited-[]100top[1~{}0~{}0]^{\top}[ 1 0 0 ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT and [010]superscriptdelimited-[]010top[0~{}1~{}0]^{\top}[ 0 1 0 ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, corresponding precisely to the identity rotation matrix. The mapping from this 6D representation vector to SO(3)SO3\mathrm{SO}(3)roman_SO ( 3 ) matrix is defined as  [74]:

fV2M([||a1a2||])=[|||b1b2b3|||],subscript𝑓V2Mdelimited-[]||missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionsubscript𝑎1subscript𝑎2missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpression||missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressiondelimited-[]|||missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionsubscript𝑏1subscript𝑏2subscript𝑏3missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpression|||missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionf_{\text{V2M}}\left({\left[{\begin{array}[]{*{20}{c}}|&|\\ {{a_{1}}}&{{a_{2}}}\\ |&|\end{array}}\right]}\right)=\left[{\begin{array}[]{*{20}{c}}|&|&|\\ {{b_{1}}}&{{b_{2}}}&{{b_{3}}}\\ |&|&|\end{array}}\right],italic_f start_POSTSUBSCRIPT V2M end_POSTSUBSCRIPT ( [ start_ARRAY start_ROW start_CELL | end_CELL start_CELL | end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL | end_CELL start_CELL | end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ] ) = [ start_ARRAY start_ROW start_CELL | end_CELL start_CELL | end_CELL start_CELL | end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL italic_b start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL | end_CELL start_CELL | end_CELL start_CELL | end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ] , (5)
bi=[{𝒩(a1)ifi=1𝒩(a2(b1a2)b1)ifi=2b1×b2ifi=3],{b_{i}}={\left[{\left\{{\begin{array}[]{*{20}{c}}{\mathcal{N}({a_{1}})}&{{\rm{% if}}~{}i=1}\\ {\mathcal{N}({a_{2}}-({b_{1}}\cdot{a_{2}}){b_{1}})}&{{\rm{if}}~{}i=2}\\ {{b_{1}}\times{b_{2}}}&{{\rm{if}}~{}i=3}\end{array}}\right.}\right]^{\top}},italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ { start_ARRAY start_ROW start_CELL caligraphic_N ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_CELL start_CELL roman_if italic_i = 1 end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL caligraphic_N ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - ( italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_CELL start_CELL roman_if italic_i = 2 end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL roman_if italic_i = 3 end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW end_ARRAY ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , (6)

where 𝒩()𝒩\mathcal{N}(\cdot)caligraphic_N ( ⋅ ) denotes a normalization function. “\cdot” represents the inner product of a vector and “×\times×” represents vector cross product. V2M in fV2Msubscript𝑓V2Mf_{\text{{V2M}}}italic_f start_POSTSUBSCRIPT V2M end_POSTSUBSCRIPT means the transform from 6D vector to rotation matrix.

Geometry feature learning. To capture the local geometric structure of the canonical scene, we regard the 3D Gaussian as the 3D point cloud, i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., we only use the 3D coordinates of the 3D Gaussian. In order to handle a large number of point clouds, we leverage a simple two-branch structure: the geometric branch learns local features of point clouds across different receptive fields, while the identity branch preserves the independent point-level features at high resolution. By integrating the geometric branch and identity branch, we can efficiently obtain point-level features at high resolution while embedding the local geometric information of the point cloud.

The geometric branch leverages the sparse convolution [28] on the sparse voxels to extract local geometric features at different receptive fields. Given the point cloud PN×3Psuperscript𝑁3\textbf{P}\in\mathbb{R}^{N\times 3}P ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 3 end_POSTSUPERSCRIPT, we first transform the high-resolution point clouds into low-resolution voxels by dividing the space through fixed grid size s𝑠sitalic_s:

V=floor(P/s),VfloorP𝑠\textbf{V}=\operatorname{floor}(\textbf{P}/s),V = roman_floor ( P / italic_s ) , (7)

where the size of V is M×3𝑀3M\times 3italic_M × 3 and M𝑀Mitalic_M is the number of voxels. Then, we construct a sparse 3D U-Net by stacking a set of sparse convolutions with a skip connection. Taking V as input, we perform sparse 3D U-Net to aggregate local features (dubbed as FvM×CsubscriptF𝑣superscript𝑀𝐶\textbf{F}_{v}\in\mathbb{R}^{M\times C}F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_C end_POSTSUPERSCRIPT) of the point clouds.

The identity branch uses a multi-layer perception (MLP) to map the 3D coordinate of the point cloud into the embedding space (dubbed as FpN×CsubscriptF𝑝superscript𝑁𝐶\textbf{F}_{p}\in\mathbb{R}^{N\times C}F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_C end_POSTSUPERSCRIPT) to maintain the independence of point features. To accurately characterize the local geometric structure of the canonical scene, we fuse the voxel features with local information onto point features. Specifically, we transform the voxel feature FvsubscriptF𝑣\textbf{F}_{v}F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT back to the corresponding points to obtain point-level features FpN×CsuperscriptsubscriptF𝑝superscript𝑁𝐶\textbf{F}_{p}^{{}^{\prime}}\in\mathbb{R}^{N\times C}F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_C end_POSTSUPERSCRIPT by assigning the voxel features to the corresponding points within it. Finally, we concatenate FpsuperscriptsubscriptF𝑝\textbf{F}_{p}^{{}^{\prime}}F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT and FpsubscriptF𝑝\textbf{F}_{p}F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT to obtain the fused point-level feature followed by an MLP layer as:

Ffuse=MLP(Concat(Fp,Fp)).subscriptFfuseMLPConcatsuperscriptsubscriptF𝑝subscriptF𝑝\textbf{F}_{\text{fuse}}=\operatorname{MLP}(\operatorname{Concat}(\textbf{F}_{% p}^{{}^{\prime}},\textbf{F}_{p})).F start_POSTSUBSCRIPT fuse end_POSTSUBSCRIPT = roman_MLP ( roman_Concat ( F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT , F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ) . (8)

3.3 Deformation Field

In this section, we propose a deformation field that estimates the deformation of each 3D Gaussian in the canonical space based on a given time t𝑡titalic_t.

Deformation estimation. We adopt an MLP as the decoder 𝒢Φsubscript𝒢Φ\mathcal{G}_{\Phi}caligraphic_G start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT, which takes the geometry feature learned from the geometry branch in the Gaussian canonical field, the position of each Gaussian, and timestamp as input, outputs the deformation of each Gaussian from canonical space to time t𝑡titalic_t, including position deformation Δ𝐱𝐭3Δsubscript𝐱𝐭superscript3\Delta\mathbf{x_{t}}\in\mathbb{R}^{3}roman_Δ bold_x start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, rotation deformation Δ𝐫𝐭6Δsubscript𝐫𝐭superscript6\Delta\mathbf{r_{t}}\in\mathbb{R}^{6}roman_Δ bold_r start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT and scale deformation Δ𝐬𝐭3Δsubscript𝐬𝐭superscript3\Delta\mathbf{s_{t}}\in\mathbb{R}^{3}roman_Δ bold_s start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT:

Δ𝐱𝐭,Δ𝐫𝐭,Δ𝐬𝐭=𝒢Φ(Ffuse,γ(x),γ(t)),Δsubscript𝐱𝐭Δsubscript𝐫𝐭Δsubscript𝐬𝐭subscript𝒢ΦsubscriptFfuse𝛾x𝛾𝑡\Delta\mathbf{x_{t}},\Delta\mathbf{r_{t}},\Delta\mathbf{s_{t}}=\mathcal{G}_{% \Phi}(\textbf{F}_{\text{fuse}},\gamma(\textbf{x}),\gamma(t)),roman_Δ bold_x start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT , roman_Δ bold_r start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT , roman_Δ bold_s start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT = caligraphic_G start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT ( F start_POSTSUBSCRIPT fuse end_POSTSUBSCRIPT , italic_γ ( x ) , italic_γ ( italic_t ) ) , (9)

where γ()𝛾\gamma(\cdot)italic_γ ( ⋅ ) denotes the positional encoding in NeRF [33], which maps a one dimension signal from \mathbb{R}blackboard_R into a higher dimensional space 2Lsuperscript2𝐿\mathbb{R}^{2L}blackboard_R start_POSTSUPERSCRIPT 2 italic_L end_POSTSUPERSCRIPT:

γ(p)=(sin(20πp),cos(20πp),,sin(2L1πp),cos(2L1πp)).𝛾𝑝superscript20𝜋𝑝superscript20𝜋𝑝superscript2𝐿1𝜋𝑝superscript2𝐿1𝜋𝑝\displaystyle\begin{split}\gamma(p)=~{}&(\sin{(2^{0}\pi p)},\cos{(2^{0}\pi p)}% ,\\ &...,\\ &\sin{(2^{L-1}\pi p)},\cos{(2^{L-1}\pi p)}).\end{split}start_ROW start_CELL italic_γ ( italic_p ) = end_CELL start_CELL ( roman_sin ( 2 start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT italic_π italic_p ) , roman_cos ( 2 start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT italic_π italic_p ) , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL … , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL roman_sin ( 2 start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT italic_π italic_p ) , roman_cos ( 2 start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT italic_π italic_p ) ) . end_CELL end_ROW (10)

Note that we set the color parameters 𝐜𝐜\mathbf{c}bold_c and opacity o𝑜oitalic_o of canonical 3D Gaussian distributions constant over time. These two factors are highly related to the physical properties of the Gaussian distributions, and we want each distribution to represent the same object area over the timeline.

Transformation. Using the estimated deformation for time t𝑡titalic_t above, we could transform the 3D Gaussian distributions to current time by

𝐱t=𝐱+Δ𝐱𝐭,𝐬t=𝐬+Δ𝐬𝐭,𝐫t=fV2M(Δ𝐫𝐭)×fV2M(𝐫).formulae-sequencesubscript𝐱𝑡𝐱Δsubscript𝐱𝐭formulae-sequencesubscript𝐬𝑡𝐬Δsubscript𝐬𝐭subscript𝐫𝑡subscript𝑓V2MΔsubscript𝐫𝐭subscript𝑓V2M𝐫\displaystyle\begin{split}\mathbf{x}_{t}&=~{}\mathbf{x}+\Delta\mathbf{x_{t}},% \\ \mathbf{s}_{t}&=~{}\mathbf{s}+\Delta\mathbf{s_{t}},\\ \mathbf{r}_{t}&=~{}f_{\text{V2M}}(\Delta\mathbf{r_{t}})\times f_{\text{V2M}}(% \mathbf{r}).\\ \end{split}start_ROW start_CELL bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL = bold_x + roman_Δ bold_x start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL = bold_s + roman_Δ bold_s start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL bold_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL = italic_f start_POSTSUBSCRIPT V2M end_POSTSUBSCRIPT ( roman_Δ bold_r start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT ) × italic_f start_POSTSUBSCRIPT V2M end_POSTSUBSCRIPT ( bold_r ) . end_CELL end_ROW (11)

3.4 Rasterization

Once we have completed preparing the attributes of each Gaussian(𝐱t,𝐜,𝐫t,𝐬t,o)subscript𝐱𝑡𝐜subscript𝐫𝑡subscript𝐬𝑡𝑜(\mathbf{x}_{t},\mathbf{c},\mathbf{r}_{t},\mathbf{s}_{t},o)( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c , bold_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_o ), we use the differentiable tile rasterizer [22] to render the image at any desired viewpoint at this timestamp:

𝐂t^=Rasterizer(𝐱t,𝐜,𝐫t,𝐬t,o,𝐊,[𝐑|𝐓]),^subscript𝐂𝑡𝑅𝑎𝑠𝑡𝑒𝑟𝑖𝑧𝑒𝑟subscript𝐱𝑡𝐜subscript𝐫𝑡subscript𝐬𝑡𝑜𝐊delimited-[]conditional𝐑𝐓\hat{\mathbf{C}_{t}}=Rasterizer(\mathbf{x}_{t},\mathbf{c},\mathbf{r}_{t},% \mathbf{s}_{t},o,\mathbf{K},[\mathbf{R}|\mathbf{T}]),over^ start_ARG bold_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG = italic_R italic_a italic_s italic_t italic_e italic_r italic_i italic_z italic_e italic_r ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c , bold_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_o , bold_K , [ bold_R | bold_T ] ) , (12)

where 𝐊𝐊\mathbf{K}bold_K and [𝐑|𝐓]delimited-[]conditional𝐑𝐓[\mathbf{R}|\mathbf{T}][ bold_R | bold_T ] represent the camera’s intrinsic and extrinsic parameters, respectively.

3.5 Optimization

To optimize the model, we use the photometric loss, and a motion loss, and also adapt the density control from 3D-GS [22] with our modifications.

Photometric loss. The photometric loss consists of the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss and structural similarity loss LDSSIMsubscript𝐿𝐷𝑆𝑆𝐼𝑀L_{D-SSIM}italic_L start_POSTSUBSCRIPT italic_D - italic_S italic_S italic_I italic_M end_POSTSUBSCRIPT between the rendered image 𝐂^tsubscript^𝐂𝑡\hat{\mathbf{C}}_{t}over^ start_ARG bold_C end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and ground truth image 𝐂tsubscript𝐂𝑡\mathbf{C}_{t}bold_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Lphoto=(1λ)L1+λLDSSIM.subscript𝐿𝑝𝑜𝑡𝑜1𝜆subscript𝐿1𝜆subscript𝐿𝐷𝑆𝑆𝐼𝑀L_{photo}=(1-\lambda)L_{1}+\lambda L_{D-SSIM}.italic_L start_POSTSUBSCRIPT italic_p italic_h italic_o italic_t italic_o end_POSTSUBSCRIPT = ( 1 - italic_λ ) italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_λ italic_L start_POSTSUBSCRIPT italic_D - italic_S italic_S italic_I italic_M end_POSTSUBSCRIPT . (13)
Table 1: Quantitative comparison between our method and competing methods on the D-NeRF dataset. The best results are highlighted in bold.
Hell Warrior Mutant Hook Bouncing Balls
Method PSNR\uparrow SSIM\uparrow LPIPS\downarrow PSNR\uparrow SSIM\uparrow LPIPS\downarrow PSNR\uparrow SSIM\uparrow LPIPS\downarrow PSNR\uparrow SSIM \uparrow LPIPS\downarrow
3D-GS [22] 15.3924 0.8776 0.1300 21.7554 0.9359 0.0575 18.6933 0.8733 0.1144 22.5575 0.9485 0.0647
D-NeRF [40] 25.0293 0.9506 0.0691 31.2900 0.9739 0.0268 29.2567 0.9650 0.1174 38.9300 0.9900 0.1031
TiNeuVox-B[12] 28.2058 0.9661 0.0631 33.9029 0.9771 0.0301 31.7929 0.9718 0.0436 40.8536 0.9913 0.0401
NDVG [17] 26.4933 0.9600 0.0670 34.4131 0.9801 0.0270 30.0009 0.9626 0.0463 37.5157 0.9874 0.0751
FDNeRF [18] 27.7120 0.9665 0.0508 34.9727 0.9810 0.0312 32.2867 0.9756 0.0388 40.0191 0.9912 0.0395
4D-GS [61] 28.1196 0.9730 0.0276 38.3411 0.9936 0.0062 33.1560 0.9810 0.0168 40.7418 0.9941 0.0105
Ours 32.2712 0.9835 0.0164 41.4284 0.9969 0.0029 36.9647 0.9916 0.0076 43.5929 0.9960 0.0061
Lego T-Rex Stand Up Jumping Jacks
Method PSNR\uparrow SSIM\uparrow LPIPS\downarrow PSNR\uparrow SSIM\uparrow LPIPS\downarrow PSNR\uparrow SSIM\uparrow LPIPS\downarrow PSNR\uparrow SSIM \uparrow LPIPS\downarrow
3D-GS [22] 23.0991 0.9329 0.0567 25.7496 0.9567 0.0474 19.3779 0.9200 0.0909 20.7163 0.9227 0.0980
D-NeRF [40] 21.6427 0.8394 0.1654 31.7568 0.9767 0.0396 32.7992 0.9818 0.0215 32.8031 0.9810 0.0373
TiNeuVox-B[12] 25.1748 0.9217 0.0689 32.7750 0.9783 0.0307 36.2031 0.9859 0.0199 34.7390 0.9823 0.0328
NDVG [17] 25.0416 0.9395 0.0534 32.6229 0.9781 0.0330 33.2158 0.9793 0.0302 31.2530 0.9737 0.0398
FDNeRF [18] 25.2700 0.9390 0.0460 30.7068 0.9731 0.0368 36.9107 0.9878 0.0188 33.5521 0.9812 0.0329
4D-GS [61] 25.4024 0.9434 0.0377 33.3912 0.9869 0.0130 38.2610 0.9923 0.0071 35.6656 0.9882 0.0159
Ours 25.4411 0.9474 0.0329 39.0285 0.9952 0.0052 42.2101 0.9966 0.0028 37.9604 0.9928 0.0088
Table 2: Quantitative comparison between our method and competing methods on the HyperNeRF dataset.The best results are highlighted in bold.
Chicken 3D Printer Broom Peel Banana
Method PSNR\uparrow MS-SSIM\uparrow PSNR\uparrow MS-SSIM\uparrow PSNR\uparrow MS-SSIM\uparrow PSNR\uparrow MS-SSIM\uparrow
TiNeuVox[12] 28.2861 0.9474 22.7514 0.8392 21.2682 0.6832 24.5136 0.8743
NDVG [17] 27.0536 0.9390 22.4196 0.8389 21.4658 0.7028 22.8204 0.8279
FDNeRF [18] 27.9627 0.9438 22.8027 0.8453 21.9091 0.7154 24.2515 0.8645
3D-GS [22] 20.8915 0.7426 18.3991 0.6114 20.3953 0.6598 20.5654 0.8094
Ours 28.5342 0.9331 22.0403 0.8098 20.8994 0.5241 25.5785 0.9067
Table 3: Quantitative comparison on HyperNeRF dataset: Average on Cut Lemon, Chicken, 3D Printer, and Split Cookie. The best results are highlighted in bold.
Method PSNR\uparrow SSIM\uparrow LPIPS\downarrow
TiNeuVox-B [12] 27.16 0.76 0.40
3D-GS [22] 21.26 0.69 0.40
4D-GS [61] 26.98 0.78 0.31
Ours 27.52 0.80 0.25

Regularization. We accept the fact that in a scene, the proportion of dynamic points is much smaller than that of static points, and the motion amplitude at dynamic points is not too large. In other words, the point in a scene should be as static as possible,

Lmotion=Δ𝐱𝐭1.subscript𝐿𝑚𝑜𝑡𝑖𝑜𝑛subscriptnormΔsubscript𝐱𝐭1L_{motion}=\left\|\Delta\mathbf{x_{t}}\right\|_{1}.italic_L start_POSTSUBSCRIPT italic_m italic_o italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT = ∥ roman_Δ bold_x start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . (14)

Total loss. The total loss we used is defined as follows,

L=Lphoto+ωLmotion,𝐿subscript𝐿𝑝𝑜𝑡𝑜𝜔subscript𝐿𝑚𝑜𝑡𝑖𝑜𝑛L=L_{photo}+\omega L_{motion},italic_L = italic_L start_POSTSUBSCRIPT italic_p italic_h italic_o italic_t italic_o end_POSTSUBSCRIPT + italic_ω italic_L start_POSTSUBSCRIPT italic_m italic_o italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT , (15)

where ω𝜔\omegaitalic_ω is a trade-off parameter to balance the components.

Density control.

Refer to caption
Figure 3: Our density control is designed for dynamic scenes. We control the densification of Gaussian distributions according to their transformed parameters at timestamp t𝑡titalic_t rather than parameters at canonical space.

3D-GS has shown that adaptive density control is essential in achieving high rendering performance. On the one hand, the Gaussians need to populate empty areas without geometric features. Thus, it simply creates a copy of the Gaussian for under-reconstructed regions. On the other hand, large Gaussians in regions with high variance need to be split into smaller Gaussians. We implement our method like 3D-GS but replace such Gaussians with two new ones, divide their scale by a factor of ϕ=1.6italic-ϕ1.6\phi=1.6italic_ϕ = 1.6, and initialize their position by using the original 3D Gaussian as a PDF for sampling.

Our method differs from 3D-GS in the following aspects. For 3D-GS, there only exists sets of Gaussians. However, in our case, we initialize the Gaussians in the canonical space, then estimate the deformations of these Gaussians, and transform their attributes into a timestamp space. As shown in Fig. 3, we use the Gaussians at the current moment to render the image. Therefore, we determine whether the Gaussians need to conduct density control by the current attributes (like scale) at the current timestamp rather than the canonical attributes. Afterward, we inverse the transformation of the split/cloned Gaussian back to the canonical space.

4 Experiments

4.1 Dataset

In the paper, we use both synthetic and real datasets for evaluating our method. The synthetic dataset D-NeRF [40] contains 8 dynamic scenes, including Hell Warrior, Mutant, Hook, Bouncing Balls, Lego, T-Rex, Stand Up, and Jumping Jacks. The real dataset proposed by HyperNeRF [37], including interp-cut-lemon, interp-cut-lemon1, vrig-chicken, vrig-3dprinter, misc-split-cookie, and misc-split-cookie. Following previous works [22], we report three evaluation metrics, including Peak Signal-to-Noise Ratio (PSNR), Structural Similarity (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS) [72].

Refer to caption
Figure 4: Qualitative comparisons between baselines and our method on the synthetic dataset.
Refer to caption
Figure 5: Qualitative comparisons between baselines and our method on the HyperNeRF real dataset[37].

4.2 Implementation Details

Our implementation is based on 3D-GS [22]. We trained a total of 40000 iterations, with the first 3000 iterations only optimizing static scenes, and then adding deformation fields to optimize dynamic scenes. The learning rate of our network takes an exponential decay from 8e-4 to 1.6e-6 with the Adam optimizer. Moreover, we use a 2-layer MLP with a width of 64 for the front point feature extraction, and a 3-layer MLP with a width of 64 for the back point feature fusion. Then 5 layers MLP with width 256 and skip connection is used for a decoder. For the positional encoding process, we use L=10𝐿10L=10italic_L = 10 for position 𝐱𝐱\mathbf{x}bold_x and L=6𝐿6L=6italic_L = 6 for timestamp t𝑡titalic_t. For the D-NeRF dataset, which does not provide point clouds, we randomly initialize 150000 points. Meanwhile, for the HyperNeRF dataset, we use the point cloud provided in its dataset as the initial point cloud. All the experiments are tested on a single RTX 4090 GPU.

4.3 Quantitative Results

Synthetic scenes. We compare our method with recent state-of-the-art methods in the field, including 3D-GS, D-NeRF, TiNeuVox, NDVG, FDNeRF, and 4D-GS on the D-NeRF Dataset. As shown in Table 1, we list the results of each scene. It can be observed that our method is significantly better than other methods in terms of all three metrics for physical canonical-based methods. On average, our method significantly improves PSNR compared with static Gaussian, 3D-GS. The computational costs are: training time around 2h (avg. on D-NeRF dataset), render FPS 12 (fixed viewpoint), model size (34MB points cloud + 14MB network). Since it inherently cannot model the deformation of the dynamic scene, 3D-GS performs poorly in dynamic view synthesis.

Real scenes. We further compare our method with some highly related works on the real scene dataset proposed by [37]. We have shown the detailed results on chicken, 3D printer, broom, and peel banana in Table 2, and the average result on cut lemon, chicken, 3d printer, split cookie in Table 3. It can be observed that our method achieves good performance compared with other state-of-the-art methods. Compared with synthetic datasets, real datasets are more challenging due to the narrow camera viewing range and pose ambiguity. The quantitative results can demonstrate the effectiveness of the proposed method in real scenes.

4.4 Visualization Results

Visual comparison. In addition to quantitative results, we also provide visualization results of different methods to demonstrate the superiority of our method. For better comparison, we show the rendered images of each synthetic scene from the same viewpoint in Fig. 4. By comparing the visualization results of different methods, it is shown that the rendered images by our method are closer to the ground truth images, indicating that our method can recover accurate and detailed images. In addition, we provide visualization results of the real scenes in Fig. 5. Compared with TiNueVox [12], our method can recover the detailed structure of dynamic objects, like chicken and banana.

Gaussian visualization. To verify the effectiveness of our method, we show the 3D point cloud of the 3D Gaussian. Specifically, we only use the 3D coordinates of the 3D Gaussian. As shown in Fig. 7, we provide the point clouds of different methods on the synthetic dataset, including 3D-GS [22], 4D-GS [61], and ours. Note that the color of the point cloud is generated by 3D coordinates. Since 3D-DS cannot model dynamic scenes, the quality of the point cloud is poor. Comparing 4D-GS with ours, it can be observed that the point cloud of our method has a clear local geometric structure.

Refer to caption
Figure 6: Visulization of learned geometry-aware features.
Refer to caption
Figure 7: Visulization of learned Gaussian. Colored with position coordinates

4.5 Ablation Study

We conduct ablation studies on the synthetic dataset (800×800)800800(800\times 800)( 800 × 800 ) to verify the effectiveness of our proposed components. In Table 4, vanilla model is a simple MLP model without our components.

Effect of geometric-aware features. To learn the geometric information of the object in our Gaussian canonical field, we voxelize the 3D Gaussian distributions and extract geometric aware features using our 3D U-Net. To demonstrate the effectiveness of this design, we test our method with geometric branch blocks and leave others unchanged. In Table 4, ours full has a clear advantage over w/o geo. feat., and our geometry branch plays the most important role among the components studied in the ablations.

In Fig. 6, we visualize the learned geometric-aware features. We color the point clouds with the learned features, and it shows meaningful geometric information. Interestingly, we can see an obvious difference in the learned features between the moving objects (bucket of the lego and the t-rex body) and the static objects (body of the lego and the ground in t-rex). Also, our geometric-aware features reflect the local geometric structure. For example, the spines of the bones on the t-rex tail have similar features, and the smooth part of the tail bones have other patterns.

Different geometric features. We use the PointNet-like architecture and plane projection (2D CNN) to conduct experiments. Compared with the results (dubbed as “PointNet feat.” and “Plane feat.”) in Table 4, it can be observed that our method achieves significant performance gains.

Table 4: Ablation Study. Ablation studys in terms of average PSNR, SSIM, and LPIPS. The best results are highlighted in bold.
Method PSNR\uparrow SSIM\uparrow LPIPS\downarrow
w/o geo. feat. 37.5757 0.9841 0.0173
w/o 6D rotation 37.8750 0.9851 0.0154
canonical DC 37.8026 0.9847 0.0166
vanilla 35.2307 0.9793 0.0242
PointNet feat. 36.7353 0.9826 0.0184
Plane feat. 35.9054 0.9811 0.0212
ours full 38.0134 0.9853 0.0153

6D representation. To study the effect of 6D representation of the rotation parameters of the 3D Gaussian, we conduct an experiment that replaces the 6D vector with quaternion 𝐪𝐪\mathbf{q}bold_q which is used in the original 3D-GS. To deform the 3D Gaussian in canonical space, our deformation field estimates a Δ𝐪𝐭Δsubscript𝐪𝐭\Delta\mathbf{q_{t}}roman_Δ bold_q start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT and gets 𝐪𝐭=𝐪+Δ𝐪𝐭subscript𝐪𝐭𝐪Δsubscript𝐪𝐭\mathbf{q_{t}}=\mathbf{q}+\Delta\mathbf{q_{t}}bold_q start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT = bold_q + roman_Δ bold_q start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT, using the quaternion add operation. In Table 4, quaternion demonstrates an obvious performance drop, which proves the effectiveness of the 6D representation.

Density control. In terms of density control, we test the setting that only uses the 3D Gaussian in canonical space without considering the transform 3D Gaussian at other timestamps. In Table 4, canonical DC shows a performance drop, as the canonical 3D Gaussian alone cannot reflect the over/under reconstruction information at all timestamps for dynamic scenes.

5 Conclusion

In this paper, we have proposed a 3D geometry aware Gaussian Splatting solution for dynamic view synthesis. We addressed the limitations of existing approaches from two perspectives: 1) we introduced 3D sparse convolution to extract local structural information effectively and efficiently for deformation learning, and 2) we represented the dynamic scenes as a collection of deforming 3D Gaussian distributions, which are optimized to deform (move, rotate, scaling) over time. Experimental results across synthetic and real datasets demonstrate the superiority of our solution in dynamic view synthesis and 3D reconstruction. We plan to further investigate explicit motion modeling by exploiting the foreground and background motion segmentation cues.

Acknowledgments

We thank the area chairs and the reviewers for their insightful and positive feedback. We also appreciate the reference provided by Ziyi Yang’s work. This work was supported in part by the National Science Fund of China (Grant Nos. 62271410, 62306238) and the Fundamental Research Funds for the Central Universities.

\thetitle

Supplementary Material

This supplementary material provides additional implementation details and experimental results. First, we provide the implementation details of our proposed method. Then, we provide additional experimental results in the form of visualization and discuss the limitations and impacts of our method. We conclude with discussions on future work. The source code, network model, and results will be released.

6 Implementation Details

6.1 Loss Function

We apply the photometric loss and regularization for our optimization:

Ltotal=Lphoto+ωLmotion,subscript𝐿𝑡𝑜𝑡𝑎𝑙subscript𝐿𝑝𝑜𝑡𝑜𝜔subscript𝐿𝑚𝑜𝑡𝑖𝑜𝑛L_{total}=L_{photo}+\omega L_{motion},italic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_p italic_h italic_o italic_t italic_o end_POSTSUBSCRIPT + italic_ω italic_L start_POSTSUBSCRIPT italic_m italic_o italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT , (16)
Lphoto=(1λ)Lrgb+λLDSSIM,subscript𝐿𝑝𝑜𝑡𝑜1𝜆subscript𝐿𝑟𝑔𝑏𝜆subscript𝐿𝐷𝑆𝑆𝐼𝑀L_{photo}=(1-\lambda)L_{rgb}+\lambda L_{D-SSIM},italic_L start_POSTSUBSCRIPT italic_p italic_h italic_o italic_t italic_o end_POSTSUBSCRIPT = ( 1 - italic_λ ) italic_L start_POSTSUBSCRIPT italic_r italic_g italic_b end_POSTSUBSCRIPT + italic_λ italic_L start_POSTSUBSCRIPT italic_D - italic_S italic_S italic_I italic_M end_POSTSUBSCRIPT , (17)

where Lrgbsubscript𝐿𝑟𝑔𝑏L_{rgb}italic_L start_POSTSUBSCRIPT italic_r italic_g italic_b end_POSTSUBSCRIPT is the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss and LSSIMsubscript𝐿𝑆𝑆𝐼𝑀L_{SSIM}italic_L start_POSTSUBSCRIPT italic_S italic_S italic_I italic_M end_POSTSUBSCRIPT is the structural similarity loss between the rendered image C^tsubscript^C𝑡\hat{\textbf{C}}_{t}over^ start_ARG C end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and ground truth image CtsubscriptC𝑡\textbf{C}_{t}C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Generally, within a dynamic scene, the proportion of dynamic points is much smaller than that of the static points. Thus the motion amplitude at dynamic points is not too large. We proposed to exploit this fact by introducing the motion regularization term Lmotion=Δ𝐱𝐭1subscript𝐿𝑚𝑜𝑡𝑖𝑜𝑛subscriptnormΔsubscript𝐱𝐭1L_{motion}=\left\|\Delta\mathbf{x_{t}}\right\|_{1}italic_L start_POSTSUBSCRIPT italic_m italic_o italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT = ∥ roman_Δ bold_x start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. In our experiments, we set λ=0.2𝜆0.2\lambda=0.2italic_λ = 0.2 and ω=0.01𝜔0.01\omega=0.01italic_ω = 0.01.

6.2 Network Architecture

Here, we introduce the network architecture adopted in our method. The Gaussian Canonical Field consists of two branches: the geometric branch and the identity branch. As shown in Fig. 8, the geometric branch takes the position of voxel points as input and outputs the geometrical features fgeosubscript𝑓𝑔𝑒𝑜f_{geo}italic_f start_POSTSUBSCRIPT italic_g italic_e italic_o end_POSTSUBSCRIPT. It is roughly composed of three parts, namely DownVoxelBlock, ResidualBlock, and UpVoxelBlock. The specific structures of these three parts are shown in Fig. 9. For the identity branch, we use a simple MLP to get the embedding features fidentitysubscript𝑓𝑖𝑑𝑒𝑛𝑡𝑖𝑡𝑦f_{identity}italic_f start_POSTSUBSCRIPT italic_i italic_d italic_e italic_n italic_t italic_i italic_t italic_y end_POSTSUBSCRIPT, which maintains the independence of point features. Then we concatenate the features from the geometric branch and the identity branch, and pass them into another MLP to get fused features FfusesubscriptFfuse\textbf{F}_{\text{fuse}}F start_POSTSUBSCRIPT fuse end_POSTSUBSCRIPT. Finally, we take the fused features FfusesubscriptFfuse\textbf{F}_{\text{fuse}}F start_POSTSUBSCRIPT fuse end_POSTSUBSCRIPT, position of Gaussians γ(x)𝛾𝑥\gamma(x)italic_γ ( italic_x ) and time γ(t)𝛾𝑡\gamma(t)italic_γ ( italic_t ) into a decoder to get the deformations of position, rotation, and scale from the canonical space to time space. In Fig. 11, we demonstrate the specific structure of MLPs. Additionally, the intermediate hidden layers are shown in blue, the number inside each block signifies the vector’s dimension. All layers are standard fully-connected layers, black arrows between layers indicate the ReLU activations. γ()𝛾\gamma(\cdot)italic_γ ( ⋅ ) is a positional encoding function, we use L=10𝐿10L=10italic_L = 10 for position, and L=6𝐿6L=6italic_L = 6 for timestamp. Similar to NeRF [33], we use a skip connection that concatenates the input to the third layer.

Refer to caption
Figure 8: Overall architecture of the geometric branch, which captures local geometric features using a 3D U-Net.
Refer to caption
Figure 9: Detailed structure of DownVoxelBlock, ResidualBlock, and UpVoxelBlock.
iteration=0𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛0iteration=0italic_i italic_t italic_e italic_r italic_a italic_t italic_i italic_o italic_n = 0 iteration=3000𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛3000iteration=3000italic_i italic_t italic_e italic_r italic_a italic_t italic_i italic_o italic_n = 3000 iteration=12000𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛12000iteration=12000italic_i italic_t italic_e italic_r italic_a italic_t italic_i italic_o italic_n = 12000 iteration=30000𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛30000iteration=30000italic_i italic_t italic_e italic_r italic_a italic_t italic_i italic_o italic_n = 30000 iteration=40000𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛40000iteration=40000italic_i italic_t italic_e italic_r italic_a italic_t italic_i italic_o italic_n = 40000
Lego Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Peel Banana Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 10: Visualization of Canonical Point Cloud. We show the evolution of point clouds in the canonical space with respect to the number of iterations.
Refer to caption
Figure 11: Detailed structure of MLPs we have used in our method.

7 Results and Discussions

7.1 Results on Neural 3D Video dataset

We further evaluated our method on Neural 3D Video dataset [25], which includes several videos captured with synchronized fixed GoPro camera system. We have evaluated our method in the following four scenarios: Cook Spinach, Cut Roast Beef, Flame Steak and Sear Steak, each scene includes from 17 to 20 cameras for training and one central camera for evaluation. Following previous works, we downsample the images to 1352 ×\times× 1014 and report the per-scene PSNR, SSIM and LPIPS for each method, as shown in Table 5. We find our method is struggling in these long-time series. Although our method maintains high fidelity restoration in static regions, its capability is severely limited in dynamic regions.

Table 5: Quantitative results on scenes from the Neural 3D Video Synthesis
Scene Cook Spinach Cut Roast Beef Flame Steak Sear Steak
Method PSNR\uparrow SSIM\uparrow LPIPS\downarrow PSNR\uparrow SSIM\uparrow LPIPS\downarrow PSNR\uparrow SSIM\uparrow LPIPS\downarrow PSNR\uparrow SSIM\uparrow LPIPS\downarrow
MixVoxels [59] 31.3931.3931.3931.39 0.9310.9310.9310.931 0.1130.1130.1130.113 31.3831.3831.3831.38 0.9280.9280.9280.928 0.1110.1110.1110.111 30.1530.1530.1530.15 0.9380.9380.9380.938 0.1080.1080.1080.108 30.8530.8530.8530.85 0.9400.9400.9400.940 0.1030.1030.1030.103
K-Planes [14] 31.2331.2331.2331.23 0.9260.9260.9260.926 0.1140.1140.1140.114 31.8731.8731.8731.87 0.9280.9280.9280.928 0.1140.1140.1140.114 31.4931.4931.4931.49 0.9400.9400.9400.940 0.1020.1020.1020.102 30.2830.2830.2830.28 0.9370.9370.9370.937 0.1040.1040.1040.104
Hexplanes [7] 31.0531.0531.0531.05 0.9280.9280.9280.928 0.1140.1140.1140.114 30.8330.8330.8330.83 0.9270.9270.9270.927 0.1150.1150.1150.115 30.4230.4230.4230.42 0.9390.9390.9390.939 0.1040.1040.1040.104 30.0030.0030.0030.00 0.9390.9390.9390.939 0.1050.1050.1050.105
Hyperreel [2] 31.7731.7731.7731.77 0.9320.9320.9320.932 0.090 32.25 0.9360.9360.9360.936 0.086 31.4831.4831.4831.48 0.9390.9390.9390.939 0.083 31.8831.8831.8831.88 0.9420.9420.9420.942 0.080
NeRFPlayer [48] 30.5830.5830.5830.58 0.9290.9290.9290.929 0.1130.1130.1130.113 29.3529.3529.3529.35 0.9080.9080.9080.908 0.1440.1440.1440.144 31.9331.9331.9331.93 0.9500.9500.9500.950 0.0880.0880.0880.088 29.1329.1329.1329.13 0.9080.9080.9080.908 0.1380.1380.1380.138
StreamRF [24] 30.8930.8930.8930.89 0.9140.9140.9140.914 0.1620.1620.1620.162 30.7530.7530.7530.75 0.9170.9170.9170.917 0.1540.1540.1540.154 31.3731.3731.3731.37 0.9230.9230.9230.923 0.1520.1520.1520.152 31.6031.6031.6031.60 0.9250.9250.9250.925 0.1470.1470.1470.147
SWAGS [46] 31.96 0.946 0.094 31.84 0.945 0.099 32.18 0.953 0.087 32.21 0.950 0.092
Ours 31.39 0.947 0.144 29.87 0.944 0.156 31.35 0.954 0.129 32.62 0.955 0.130

7.2 More Visualization Results

Point Cloud For the D-NeRF synthetic scenes [40], we randomly initialize 150000 points as the initial point cloud. We visualize the point cloud of the scene in the canonical space with different iterations. In Fig. 10, it can be observed that we can reconstruct the scene even from a random point cloud. Moreover, in complex scenes such as Peel Banana in the HyperNeRF dataset [37], we can also reconstruct the scene even if there are no dynamic parts in the input point clouds, as shown in Fig. 10. Our supplementary video also presents the trajectory of the scene’s point cloud as it evolves over time. Our supplementary video is available at our homepage: https://npucvr.github.io/GaGS/.

p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT p1subscript𝑝1p_{1}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT p2subscript𝑝2p_{2}italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT p3subscript𝑝3p_{3}italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT p4subscript𝑝4p_{4}italic_p start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT
Sear Steak Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Flame Steak Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Cook Spinach Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 12: Results on Neu3DV dataset.

Quantitative Results We show more qualitative comparisons in Fig. 13 and  Fig. 14 for D-NeRF synthetic dataset [40] and HyperNeRF dataset [37]. In our supplementary video, we also showcase the temporal interpolation capability of our method when maintaining a fixed camera viewpoint while time evolves. Additionally, we demonstrate the ability to synthesize novel viewpoints while keeping the time fixed and observing the scene from arbitrary viewpoints.

Temporal Interpolation We show the temporal interpolation ability of our method. In Fig. 15 and Fig. 16, we fix the camera viewpoint and show the results for temporal changes of the D-NeRF synthetic dataset [40] and HyperNeRF dataset [37]. Our method shows great temporal interpolation abilities for both synthetic and real datasets. More results are presented in our homepage.

7.3 Limitations and Impacts

Limitations First, our proposed method represents the deformation of Gaussians from the canonical space to time space. However, it can only chronicle a point within the scene from start to finish, lacking the capability to depict a point that abruptly emerges or disappears in the scene at a specific moment. Second, our proposed method essentially describes the motion and deformation of points in the canonical space. It necessitates acquiring precise camera poses in advance. However, in the context of dynamic scene modeling, obtaining accurate camera poses is inherently very challenging. Our approach is also constrained by this limitation. Last, our method struggles to describe excessively complex motions and long time videos, such as rapid movements of objects within the scene. This challenge results in the network facing difficulties in estimating point motions, ultimately leading to failures, as shown in  Fig. 12, we provide some cases in the test camera on Neu3DV dataset [25]. Due to the lack of explicit modeling of motion, our method exhibits insufficient capability in capturing fine-grained movements over long temporal sequences. However, it still maintains the ability to describe general motions, such as the swinging of curtains and human body movements.

Broader Impacts Our proposed method can be applied to various industries, including visual effects synthesis in the film industry, game modeling, autonomous driving simulation, and more. For the film industry and game modeling, dynamic scenes can be synthesized by our method. In autonomous driving simulation, our proposed method can provide more data from different viewpoints, which will contribute to the advancement of autonomous driving.

7.4 Future Work

In the future, we plan to exploit the motion mask to distinguish the dynamic points and static points of the scene, which will decrease the computing resource by only estimating the deformation of dynamic points. Also, we will investigate explicit motion modeling by exploiting the foreground and background motion segmentation cues.

GT Ours NDVG [17] 4D-GS [61] FDNeRF [18] TiNeuVox-B [12]
Hell Warrior Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Hook Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Jumping Jacks Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Lego Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Mutant Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Stand Up Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
T-Rex Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Bouncing Balls Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 13: Qualitative comparison on the D-NeRF synthetic dataset. We show synthesized images on the D-NeRF synthetic dataset of our method and other competing methods.
GT Ours NDVG [17] 3D-GS [22] FDNeRF [18] TiNeuVox-B [12]
Broom Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Peel Banana Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Chicken Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
3D Printer Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 14: Qualitative comparison on the HyperNeRF dataset. We show synthesized images on the HyperNeRF dataset of our method and other competing methods.
t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT t3subscript𝑡3t_{3}italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT t4subscript𝑡4t_{4}italic_t start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT t5subscript𝑡5t_{5}italic_t start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT
Bouncing Balls Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Hell Warrior Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Hook Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Jumping Jacks Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Lego Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Mutant Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Stand Up Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
T-Rex Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 15: Temporal Interpolation Capability on the D-NeRF synthetic dataset. We show the temporal interpolation capabilities of our method. Specifically, we showcase our ability to perform time interpolation by maintaining a fixed camera viewpoint while observing the temporal changes in scene content.
t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT t3subscript𝑡3t_{3}italic_t start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT
Peel Banana Refer to caption Refer to caption Refer to caption Refer to caption
Chicken Refer to caption Refer to caption Refer to caption Refer to caption
Split Cookie Refer to caption Refer to caption Refer to caption Refer to caption
Cut Lemon Refer to caption Refer to caption Refer to caption Refer to caption
Figure 16: Temporal Interpolation Capability on HyperNeRF dataset. We show the temporal interpolation capabilities of our method. Specifically, we showcase our ability to perform time interpolation by maintaining a fixed camera viewpoint while observing the temporal changes in scene content.

References

  • Aliev et al. [2020] Kara-Ali Aliev, Artem Sevastopolsky, Maria Kolos, Dmitry Ulyanov, and Victor Lempitsky. Neural point-based graphics. In Proceedings of the European Conference on Computer Vision (ECCV), 2020.
  • Attal et al. [2023] Benjamin Attal, Jia-Bin Huang, Christian Richardt, Michael Zollhoefer, Johannes Kopf, Matthew O’Toole, and Changil Kim. HyperReel: High-fidelity 6-dof video with ray-conditioned sampling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  • Barron et al. [2021] Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. Mip-NeRF: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021.
  • Barron et al. [2023] Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hedman. Zip-NeRF: Anti-aliased grid-based neural radiance fields. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2023.
  • Boss et al. [2021] Mark Boss, Raphael Braun, Varun Jampani, Jonathan T Barron, Ce Liu, and Hendrik Lensch. NeRD: Neural reflectance decomposition from image collections. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021.
  • Buehler et al. [2001] Chris Buehler, Michael Bosse, Leonard McMillan, Steven Gortler, and Michael Cohen. Unstructured lumigraph rendering. In Proceedings of the Conference on Computer Graphics and Interactive Techniques, 2001.
  • Cao and Johnson [2023] Ang Cao and Justin Johnson. HexPlane: A fast representation for dynamic scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  • Chen et al. [2021] Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, Jingyi Yu, and Hao Su. MVSNeRF: Fast generalizable radiance field reconstruction from multi-view stereo. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021.
  • Chen and Williams [1993] Shenchang Eric Chen and Lance Williams. View interpolation for image synthesis. In Proceedings of the Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), 1993.
  • Choi et al. [2019] Inchang Choi, Orazio Gallo, Alejandro Troccoli, Min H Kim, and Jan Kautz. Extreme view synthesis. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019.
  • Du et al. [2021] Yilun Du, Yinan Zhang, Hong-Xing Yu, Joshua B Tenenbaum, and Jiajun Wu. Neural radiance flow for 4d view synthesis and video processing. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021.
  • Fang et al. [2022] Jiemin Fang, Taoran Yi, Xinggang Wang, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Matthias Nießner, and Qi Tian. Fast dynamic radiance fields with time-aware neural voxels. In Proceedings of the Conference on Computer Graphics and Interactive Techniques in Asia (SIGGRAPH ASIA), 2022.
  • Flynn et al. [2016] John Flynn, Ivan Neulander, James Philbin, and Noah Snavely. DeepStereo: Learning to predict new views from the world’s imagery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • Fridovich-Keil et al. [2023] Sara Fridovich-Keil, Giacomo Meanti, Frederik Rahbæk Warburg, Benjamin Recht, and Angjoo Kanazawa. K-Planes: Explicit radiance fields in space, time, and appearance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  • Gao et al. [2021] Chen Gao, Ayush Saraf, Johannes Kopf, and Jia-Bin Huang. Dynamic view synthesis from dynamic monocular video. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021.
  • Greene [1986] Ned Greene. Environment mapping and other applications of world projections. IEEE Computer Graphics and Applications, 1986.
  • Guo et al. [2022] Xiang Guo, Guanying Chen, Yuchao Dai, Xiaoqing Ye, Jiadai Sun, Xiao Tan, and Errui Ding. Neural deformable voxel grid for fast optimization of dynamic view synthesis. In Proceedings of the Asian Conference on Computer Vision (ACCV), 2022.
  • Guo et al. [2023] Xiang Guo, Jiadai Sun, Yuchao Dai, Guanying Chen, Xiaoqing Ye, Xiao Tan, Errui Ding, Yumeng Zhang, and Jingdong Wang. Forward flow for novel view synthesis of dynamic scenes. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2023.
  • Hedman et al. [2018] Peter Hedman, Julien Philip, True Price, Jan-Michael Frahm, George Drettakis, and Gabriel Brostow. Deep blending for free-viewpoint image-based rendering. ACM Transactions on Graphics (TOG), 2018.
  • Hu et al. [2023] Wenbo Hu, Yuling Wang, Lin Ma, Bangbang Yang, Lin Gao, Xiao Liu, and Yuewen Ma. Tri-MipRF: Tri-mip representation for efficient anti-aliasing neural radiance fields. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2023.
  • Kalantari et al. [2016] Nima Khademi Kalantari, Ting-Chun Wang, and Ravi Ramamoorthi. Learning-based view synthesis for light field cameras. ACM Transactions on Graphics (TOG), 2016.
  • Kerbl et al. [2023] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3D Gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (TOG), 2023.
  • Levoy and Hanrahan [1996] Marc Levoy and Pat Hanrahan. Light field rendering. In Proceedings of the Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), 1996.
  • Li et al. [2022a] Lingzhi Li, Zhen Shen, Zhongshu Wang, Li Shen, and Ping Tan. Streaming radiance fields for 3D video synthesis. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2022a.
  • Li et al. [2022b] Tianye Li, Mira Slavcheva, Michael Zollhoefer, Simon Green, Christoph Lassner, Changil Kim, Tanner Schmidt, Steven Lovegrove, Michael Goesele, Richard Newcombe, et al. Neural 3D video synthesis from multi-view video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022b.
  • Li et al. [2021] Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Wang. Neural scene flow fields for space-time view synthesis of dynamic scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • Lindell et al. [2021] David B Lindell, Julien NP Martel, and Gordon Wetzstein. AutoInt: Automatic integration for fast neural volume rendering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • Liu et al. [2015] Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. Sparse convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • Liu et al. [2020] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and Christian Theobalt. Neural sparse voxel fields. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2020.
  • Liu et al. [2023] Yu-Lun Liu, Chen Gao, Andreas Meuleman, Hung-Yu Tseng, Ayush Saraf, Changil Kim, Yung-Yu Chuang, Johannes Kopf, and Jia-Bin Huang. Robust dynamic radiance fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  • Lombardi et al. [2021] Stephen Lombardi, Tomas Simon, Gabriel Schwartz, Michael Zollhoefer, Yaser Sheikh, and Jason Saragih. Mixture of volumetric primitives for efficient neural rendering. ACM Transactions on Graphics (TOG), 2021.
  • Martin-Brualla et al. [2021] Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi, Jonathan T Barron, Alexey Dosovitskiy, and Daniel Duckworth. NeRF in the Wild: Neural radiance fields for unconstrained photo collections. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • Mildenhall et al. [2020] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing scenes as neural radiance fields for view synthesis. In Proceedings of the European Conference on Computer Vision (ECCV), 2020.
  • Müller et al. [2022] Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (TOG), 2022.
  • Neff et al. [2021] Thomas Neff, Pascal Stadlbauer, Mathias Parger, Andreas Kurz, Joerg H. Mueller, Chakravarty R. Alla Chaitanya, Anton S. Kaplanyan, and Markus Steinberger. DONeRF: Towards real-time rendering of compact neural radiance fields using depth oracle networks. Computer Graphics Forum (CGF), 2021.
  • Park et al. [2021a] Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021a.
  • Park et al. [2021b] Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Ricardo Martin-Brualla, and Steven M Seitz. HyperNeRF: A higher-dimensional representation for topologically varying neural radiance fields. ACM Transactions on Graphics (TOG), 2021b.
  • Penner and Zhang [2017] Eric Penner and Li Zhang. Soft 3D reconstruction for view synthesis. ACM Transactions on Graphics (TOG), 2017.
  • Piala and Clark [2021] Martin Piala and Ronald Clark. TermiNeRF: Ray termination prediction for efficient neural rendering. In Proceedings of the International Conference on 3D Vision (3DV), 2021.
  • Pumarola et al. [2021] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-NeRF: Neural radiance fields for dynamic scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • Rebain et al. [2021] Daniel Rebain, Wei Jiang, Soroosh Yazdani, Ke Li, Kwang Moo Yi, and Andrea Tagliasacchi. DeRF: Decomposed radiance fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • Reiser et al. [2021] Christian Reiser, Songyou Peng, Yiyi Liao, and Andreas Geiger. KiloNeRF: Speeding up neural radiance fields with thousands of tiny MLPs. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021.
  • Rematas et al. [2022] Konstantinos Rematas, Andrew Liu, Pratul P. Srinivasan, Jonathan T. Barron, Andrea Tagliasacchi, Thomas Funkhouser, and Vittorio Ferrari. Urban radiance fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • Riegler and Koltun [2020] Gernot Riegler and Vladlen Koltun. Free view synthesis. In Proceedings of the European Conference on Computer Vision (ECCV), 2020.
  • Riegler and Koltun [2021] Gernot Riegler and Vladlen Koltun. Stable view synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • Shaw et al. [2023] Richard Shaw, Jifei Song, Arthur Moreau, Michal Nazarczuk, Sibi Catley-Chandar, Helisa Dhamo, and Eduardo Perez-Pellitero. SWAGS: Sampling windows adaptively for dynamic 3D Gaussian splatting. arXiv preprint arXiv:2312.13308, 2023.
  • Shum and Kang [2000] Harry Shum and Sing Bing Kang. Review of image-based rendering techniques. In Visual Communications and Image Processing (VCIP), 2000.
  • Song et al. [2023] Liangchen Song, Anpei Chen, Zhong Li, Zhang Chen, Lele Chen, Junsong Yuan, Yi Xu, and Andreas Geiger. NeRFPlayer: A streamable dynamic scene representation with decomposed neural radiance fields. IEEE Transactions on Visualization and Computer Graphics (TVCG), 2023.
  • Srinivasan et al. [2021] Pratul P Srinivasan, Boyang Deng, Xiuming Zhang, Matthew Tancik, Ben Mildenhall, and Jonathan T Barron. NeRV: Neural reflectance and visibility fields for relighting and view synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • Sun et al. [2022] Cheng Sun, Min Sun, and Hwann-Tzong Chen. Direct Voxel Grid Optimization: Super-fast convergence for radiance fields reconstruction. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • Tancik et al. [2022] Matthew Tancik, Vincent Casser, Xinchen Yan, Sabeek Pradhan, Ben Mildenhall, Pratul P Srinivasan, Jonathan T Barron, and Henrik Kretzschmar. Block-NeRF: Scalable large scene neural view synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • Tewari et al. [2020] Ayush Tewari, Ohad Fried, Justus Thies, Vincent Sitzmann, Stephen Lombardi, Kalyan Sunkavalli, Ricardo Martin-Brualla, Tomas Simon, Jason Saragih, Matthias Nießner, et al. State of the art on neural rendering. In Computer Graphics Forum (CGF), 2020.
  • Tewari et al. [2021] Ayush Tewari, O Fried, J Thies, V Sitzmann, S Lombardi, Z Xu, T Simon, M Nießner, E Tretschk, L Liu, et al. Advances in neural rendering. In Proceedings of the Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), 2021.
  • Thies et al. [2019] Justus Thies, Michael Zollhöfer, and Matthias Nießner. Deferred neural rendering: Image synthesis using neural textures. ACM Transactions on Graphics (TOG), 2019.
  • Tretschk et al. [2021] Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, Michael Zollhöfer, Christoph Lassner, and Christian Theobalt. Non-rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021.
  • Trevithick and Yang [2021] Alex Trevithick and Bo Yang. GRF: Learning a general radiance field for 3D representation and rendering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021.
  • Turki et al. [2023] Haithem Turki, Jason Y Zhang, Francesco Ferroni, and Deva Ramanan. SUDS: Scalable urban dynamic scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  • Wang et al. [2021a] Chaoyang Wang, Ben Eckart, Simon Lucey, and Orazio Gallo. Neural trajectory fields for dynamic novel view synthesis. arXiv preprint arXiv:2105.05994, 2021a.
  • Wang et al. [2023] Feng Wang, Sinan Tan, Xinghang Li, Zeyue Tian, Yafei Song, and Huaping Liu. Mixed neural voxels for fast multi-view video synthesis. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2023.
  • Wang et al. [2021b] Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P Srinivasan, Howard Zhou, Jonathan T Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. IBRNet: Learning multi-view image-based rendering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021b.
  • Wu et al. [2023] Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4D Gaussian splatting for real-time dynamic scene rendering. arXiv preprint arXiv:2310.08528, 2023.
  • Xian et al. [2021] Wenqi Xian, Jia-Bin Huang, Johannes Kopf, and Changil Kim. Space-time neural irradiance fields for free-viewpoint video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • Xiangli et al. [2022] Yuanbo Xiangli, Linning Xu, Xingang Pan, Nanxuan Zhao, Anyi Rao, Christian Theobalt, Bo Dai, and Dahua Lin. BungeeNeRF: Progressive neural radiance field for extreme multi-scale scene rendering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • Xu et al. [2019] Zexiang Xu, Sai Bi, Kalyan Sunkavalli, Sunil Hadap, Hao Su, and Ravi Ramamoorthi. Deep view synthesis from sparse photometric images. ACM Transactions on Graphics (TOG), 2019.
  • Yang et al. [2022] Wenqi Yang, Guanying Chen, Chaofeng Chen, Zhenfang Chen, and Kwan-Yee K Wong. S3-NeRF: Neural reflectance field from shading and shadow under a single viewpoint. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2022.
  • Yang et al. [2023] Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. arXiv preprint arXiv:2309.13101, 2023.
  • Yoon et al. [2020] Jae Shin Yoon, Kihwan Kim, Orazio Gallo, Hyun Soo Park, and Jan Kautz. Novel view synthesis of dynamic scenes with globally coherent depths from a monocular camera. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • Yu et al. [2021a] Alex Yu, Sara Fridovich-Keil, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021a.
  • Yu et al. [2021b] Alex Yu, Ruilong Li, Matthew Tancik, Hao Li, Ren Ng, and Angjoo Kanazawa. Plenoctrees for real-time rendering of neural radiance fields. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021b.
  • Yu et al. [2021c] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelNeRF: Neural radiance fields from one or few images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021c.
  • Zhang et al. [2020] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen Koltun. NeRF++: Analyzing and improving neural radiance fields. arXiv preprint arXiv:2010.07492, 2020.
  • Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • Zhang et al. [2021] Xiuming Zhang, Pratul P Srinivasan, Boyang Deng, Paul Debevec, William T Freeman, and Jonathan T Barron. NeRFactor: Neural factorization of shape and reflectance under an unknown illumination. ACM Transactions on Graphics (TOG), 2021.
  • Zhou et al. [2019] Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • Zwicker et al. [2001] M. Zwicker, H. Pfister, J. van Baar, and M. Gross. Ewa volume splatting. In Proceedings of IEEE Visualization (VIS), 2001.