Refer to caption
Figure 1. Our method achieves high-quality novel-view synthesis given a challenging monocular video as input. In contrast, other Gaussian representations arrive at poor local minima, while NeRF methods are on-par but exhibit slow rendering, poor tracking, and lack an editable and compositional structure.

Dynamic Gaussian Marbles for Novel View Synthesis of Casual Monocular Videos

Colton Stearns 0000-0002-3297-2870 Stanford UniversityStanfordCA94305USA coltongs@stanford.edu Adam Harley 0000-0002-9851-4645 Stanford UniversityStanfordCA94305USA aharley@cs.stanford.edu Mikaela Uy 0009-0009-4917-7724 Stanford UniversityStanfordCA94305USA mikacuy@stanford.edu Florian Dubost 0000-0002-7035-2680 GoogleMountain ViewUSA fdubost@google.com Federico Tombari 0000-0001-5598-5212 GoogleZurichSwitzerland tombari@google.com Gordon Wetzstein 0000-0002-9243-6885 Stanford UniversityStanfordCA94305USA gordon.wetzstein@stanford.edu  and  Leonidas Guibas 0000-0002-8315-4886 Stanford UniversityStanfordCA94305USA
Abstract.

Gaussian splatting has become a popular representation for novel-view synthesis, exhibiting clear strengths in efficiency, photometric quality, and compositional edibility. Following its success, many works have extended Gaussians to 4D, showing that dynamic Gaussians maintain these benefits while also tracking scene geometry far better than alternative representations. Yet, these methods assume dense multi-view videos as supervision, constraining their use to controlled capture settings. In this work, we are interested in extending the capability of Gaussian scene representations to casually captured monocular videos. We show that existing 4D Gaussian methods dramatically fail in this setup because the monocular setting is underconstrained. Building off this finding, we propose Dynamic Gaussian Marbles (DGMarbles), consisting of three core modifications that target the difficulties of the monocular setting. First, DGMarbles uses isotropic Gaussian “marbles”, reducing the degrees of freedom of each Gaussian, and constraining the optimization to focus on motion and appearance over local shape. Second, DGMarbles employs a hierarchical divide-and-conquer learning strategy to efficiently guide the optimization towards solutions with globally coherent motion. Finally, DGMarbles adds image-level and geometry-level priors into the optimization, including a tracking loss that takes advantage of recent progress in point tracking. By constraining the optimization in these ways, DGMarbles learns Gaussian trajectories that enable novel-view rendering and accurately capture the 3D motion of the scene elements. We evaluate on the (monocular) Nvidia Dynamic Scenes dataset and the Dycheck iPhone dataset, and show that DGMarbles significantly outperforms other Gaussian baselines in quality, and is on-par with non-Gaussian representations, all while maintaining the efficiency, compositionality, editability, and tracking benefits of Gaussians.

Gaussian splatting, neural rendering, novel view synthesis, inverse graphics, video editing
submissionid: 1022journal: TOGcopyright: none

1. Introduction

It is very challenging to convert everyday monocular videos of dynamic scenes into reconstructions which are renderable from alternative viewpoints. Doing so seems to require extracting 3D geometry, motion, and radiance, all from pixels alone. Achieving this in a robust way would greatly extend current capabilities in video production, 3D content creation, virtual reality, and synthetic data generation, as well as advance computer vision.

In recent years, the research community has made tremendous progress in building renderable 3D representations from multi-view captures. For instance, Gaussian Splatting (Kerbl et al., 2023) has emerged as a leading solution for novel-few synthesis of static scenes. By representing the 3D space with a collection of 3D Gaussians and “splatting” these onto the image plane, Gaussian Splatting achieves high-quality photometric reconstruction and efficient rendering. Another useful feature of the Gaussian representation is that it is compositional: a scene can be edited by, for example, moving (or removing) the Gaussians that make up an object. Many works have since extended Gaussian Splatting to the 4D setting, allowing dynamic scenes to be reconstructed in a manner that 3D content is tracked and rendered with impressive accuracy (Luiten et al., 2024; Huang et al., 2023; Duisterhof et al., 2023; Lin et al., 2023). Yet while impressive, these setups only apply to the setting where there are multiple simultaneous viewpoints of the scene (i.e., a multi-camera setup), which limits their use to purpose-built capture environments.

In this work, we are interested in using Gaussians for simple, casual, monocular captures, where a single camera is being moved smoothly about a dynamic scene (e.g., by a human). Our core finding is that while current methods for dynamic or 4D Gaussians are highly underconstrained in the absence of multi-view information, we can recover similar constraints using off-the-shelf methods for estimating depth and motion, along with standard geometry-based regularizations on scene structure. We demonstrate our findings through a method we call DGMarbles.

Compared to related Gaussian-based representations of dynamic scenes, DGMarbles contributes changes to the core representation, the learning strategy, and the objective function, with the aim of guiding the optimization process to arrive at solutions which reasonably generalize to novel views. First, DGMarbles removes the anisotropic nature of typical Gaussians, and simply uses isotropic “marbles”. We find Gaussian marbles are a better choice for the underconstrained monocular setting. Second, we employ a divide-and-conquer learning algorithm. Intuitively, we divide a long video into subsequences and optimize each one independently, and then merge pairs of subsequences until we reach a desired temporal horizon. This strategy takes advantage of the fact that it is easier to solve for motion and geometry within shorter time horizons, and converts long-sequence tracking into a task of gluing together neighboring subsequences. Third, we make use of freely-available priors in both image space and 3D space. In the image plane, we use off-the-shelf models SegmentAnything  (Kirillov et al., 2023; Yang et al., 2023a), CoTracker  (Karaev et al., 2023), and DepthAnything  (Yang et al., 2024a), and guide our 3D representation according to these 2D cues. In 3D space, we regularize Gaussian trajectories with geometric priors, including local isometry, global isometry, depth total variation, and chamfer distance.

We show that DGMarbles greatly outperforms other dynamic Gaussian methods in the casual monocular setting. Specifically, we evaluate on the Nvidia Dynamic Scenes dataset and DyCheck iPhone dataset, which we modify into strictly-monocular datasets. Furthermore, we show that we are on-par with NeRF-based methods, while retaining our key advantages over them, namely efficient rendering, tracking, and editability.

Refer to caption
Figure 2. DGMarbles overview. At training time (left), we take as input a video, and optimize a Gaussian-based reconstruction of these data. We begin by initializing a set of Gaussians for each frame, and subsequently employ a bottom-up divide-and-conquer strategy to merge sets of Gaussians, by attributing increasingly long motion trajectories to them. Motion estimation is achieved by optimizing a rendering loss (i.e., color reconstruction), a tracking loss (i.e., Gaussians should move similar to point tracks), and geometry-based losses (e.g., the scene surface should move locally rigidly). At the end of training (right), each Gaussian has a multi-frame trajectory, and we can render the set of Gaussians at any timestep and any viewpoint.

2. Related Work

2.1. Gaussian Splatting

Gaussian-based representations have long been an attractive representation for modeling the surfaces of 3D scenes (Blinn, 1982), thanks to their efficiency, interpretability, and compositionality. The key idea is to represent a scene using a set of anisotropic Gaussians, equipped with opacity and color attributes, enabling not only color rendering but a variety of applications in both graphics and computer vision, such as scene editing and pose estimation (Keselman and Hebert, 2022, 2023). Gaussian scene representations have received great attention in the past year, in particular due to 3D Gaussian Splatting (3DGS) (Kerbl et al., 2023), which differentiably “splats” the Gaussians onto the image plane, with a very efficient GPU implementation. Many works have dived deeper on the advantages of 3DGS, including its compositionality (Ye et al., 2023; Yu et al., 2023), speed (Lee et al., 2023a; Morgenstern et al., 2023; Niedermayr et al., 2023) and quality (Lee et al., 2024; Zhang et al., 2024; Feng et al., 2024), and it has also been adapted into many downstream applications such as pose estimation (Fan et al., 2024), SLAM (Matsuki et al., 2024), semantic scene understanding (Zhou et al., 2023; Cen et al., 2023; Ye et al., 2023), human avatar animation (Li et al., 2024; Qian et al., 2024), text-to-3D (Tang et al., 2023; Yi et al., 2024), and more.

2.2. Gaussians for Dynamic Scenes

Many works have begun extending the 3DGS representation to the 4D domain, aiming to solve the challenge of dynamic scene reconstruction. These works largely differ from one another in how they represent and learn motion.

One popular direction is to model motion as a set of per-Gaussian 3D trajectories through time (Luiten et al., 2024; Sun et al., 2024; Duisterhof et al., 2023), and to learn motion by sequentially optimizing for per-Gaussian offsets into the next frame. Notably, these methods have shown impressive tracking. Other works extend this motion representation, and push for a more compact set of trajectories via sparse control points (Huang et al., 2023) or an explicit motion basis (Das et al., 2023; Katsumata et al., 2023; Lin et al., 2023; Li et al., 2023a; Yu et al., 2023). Similar to these works, we represent motion as 3D trajectories through time. However, we greatly differ in how we learn motion, as we use a divide-and-conquer learning strategy as well as various unique 2D and 3D priors.

Another line of works (Wu et al., 2023; Yang et al., 2023b; Liang et al., 2023; Guo et al., 2024) defines motion as a time-conditioned deformation network that warps a canonical set of Gaussians into each timeframe. While a shared global deformation network is an efficient and compact representation of motion, learning the appropriate deformations is challenging – in particular, the deformation network may collapse into a local minimum while jointly optimizing across all timeframes. Lastly, a few works (Duan et al., 2024; Yang et al., 2024b) directly model Gaussians that extend across space and time, i.e. Gaussians with mean and covariances in 4-D.

Although previous methods present different motion representations, they largely address the same multi-camera setting. In contrast, our work tackles the more challenging monocular setting. We note that there are concurrent works (Das et al., 2023; Katsumata et al., 2023) that tackle the pseudo-monocular setting and showcase results on datasets with “teleporting” cameras or large amounts of effective multi-view information – please refer to DyCheck (Gao et al., 2022) for a thorough overview of this phenomena. In contrast, our approach is intended for any casual monocular video.

2.3. Other Neural Scene Representations for Dynamic Scenes

Many earlier works have explored different neural scene representations for dynamic scenes. One family of works is the extension of neural radiance fields (Mildenhall et al., 2020) (NeRFs) in 4D (Ramasinghe et al., 2024; Wang et al., 2021a; Song et al., 2023; Li et al., 2021; Gao et al., 2021, 2022; Cao and Johnson, 2023; Fridovich-Keil et al., 2023; Bui et al., 2023; Jang and Kim, 2022), treating time as a fourth dimension and as an additional coordinate in the neural field. Another approach is to combine a “canonical” 3D NeRF with a time-conditioned deformation field  (Liu et al., 2023; Fang et al., 2022; Park et al., 2021a, b; Johnson et al., 2023; Kirschstein et al., 2023; Wang et al., 2023; Tretschk et al., 2020). The deformation field can help disentangle motion and geometry, resulting in a more constrained and better-behaved scene, especially in the monocular case. A few works explored NeRF-based representations for the casual monocular setting – DyNiBar and MonoNeRF (Li et al., 2023b; Tian et al., 2023) showed compelling results by combining NeRF with image-based rendering (Wang et al., 2021b), and Wang et al. (Wang et al., 2024) used diffusion to regularize a 4D NeRF (Cao and Johnson, 2023; Fridovich-Keil et al., 2023). Finally, some concurrent works explore alternate plane-based and feed-forward representations for the casual monocular setting  (Zhao et al., 2024; Lee et al., 2023b).

In contrast to these representations, dynamic Gaussians have advantages in tracking, fast rendering, and compositional editability.

3. Preliminaries

3.1. 3D Gaussian Splatting

3D Gaussian splatting (Kerbl et al., 2023) is a differentiable rendering pipeline that represents a scene as a collection of 3D Gaussians and “splats” them onto the image plane. Concretely, a 3D scene is represented by 3D Gaussians, 𝒢𝒢\mathcal{G}caligraphic_G, with each Gaussian parameterized by its mean μ3𝜇superscript3\mu\in\mathbb{R}^{3}italic_μ ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, rotation R3×3𝑅superscript33R\in\mathbb{R}^{3\times 3}italic_R ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT, scale s3𝑠superscript3s\in\mathbb{R}^{3}italic_s ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, color c3𝑐superscript3c\in\mathbb{R}^{3}italic_c ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, and opacity α𝛼\alpha\in\mathbb{R}italic_α ∈ blackboard_R. Importantly, the scale and rotation can be composed into a 3D covariance matrix, Σ=RSSTRTΣ𝑅𝑆superscript𝑆𝑇superscript𝑅𝑇\Sigma=RSS^{T}R^{T}roman_Σ = italic_R italic_S italic_S start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, where S𝑆Sitalic_S is the 3×3333\times 33 × 3 diagonal scaling matrix.

Given 3D Gaussians 𝒢𝒢\mathcal{G}caligraphic_G and a camera viewing transformation W𝑊Witalic_W, the covariance matrix in camera coordinates can be computed as Σ=JWΣWTJTsuperscriptΣ𝐽𝑊Σsuperscript𝑊𝑇superscript𝐽𝑇\Sigma^{\prime}=JW\Sigma W^{T}J^{T}roman_Σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_J italic_W roman_Σ italic_W start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, where J𝐽Jitalic_J is the Jacobian of the approximately-affine projective transformation. Given camera-aligned Gaussians, the pipeline executes a highly-efficient differentiable tile-based rasterization of Gaussians. Specifically, the image is divided into 16×16161616\times 1616 × 16 tiles, and for each tile, all influencing Gaussians are depth-sorted and alpha-composited in the image plane. In contrast to volumetric rendering approaches (Mildenhall et al., 2020), Gaussian Splatting is extremely efficient, and often renders over 100100100100 times faster than its volumetric counterpart.

4. Method

Refer to caption
Figure 3. We visualize novel view synthesis of standard anisotropic Gaussians and our Gaussian Marbles after training on a single monocular image (along with depth estimated from this image) for 100K iterations. While the training view is near perfect for both, anistropic Gaussians lead to undesirable artifacts in the novel view, whereas Gaussian Marbles generalize well.

We provide an overview of DGMarbles in  Figure 2. We take as input a casually captured monocular video (i.e., a sequence of images captured by a single camera traversing a dynamic scene). We begin by initializing a set of Gaussian “marbles” for each frame. We consider these initial marbles to have trajectories of length 1. We next seek to merge these disjoint sets of short-trajectory marbles into much longer trajectories. We use a bottom-up divide-and-conquer merging strategy as depicted in Figure 4: we take two temporally adjacent marble sets, and merge them into a single set of marbles with trajectories of doubled length, and iterate this until we have fewer sets of marbles with much longer trajectories. Each iteration of the merging stage involves a short optimization, where we use rendering losses, tracking losses, and geometric regularizations, to guide the marble sets into correspondence. At inference, we use the learned Gaussian trajectories to render into any timestep.

4.1. Dynamic Gaussian Marbles

4.1.1. Definition

Following Kerbl et al. (2023), our scene representation is a set of Gaussians, 𝒢𝒢\mathcal{G}caligraphic_G. Different from the original formulation, our Gaussians are isotropic: each Gaussian’s orientation is the identity matrix (i.e. R=𝐈𝑅𝐈R=\mathbf{I}italic_R = bold_I), and the scale can be written as a scalar value (i.e. s1𝑠superscript1s\in\mathbb{R}^{1}italic_s ∈ blackboard_R start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT). To emphasize their spherical shape, we use the name Gaussian marbles. We assign each Gaussian marble to a semantic instance (where instances are provided by an off-the-shelf image segmentation method, described later), denoted as y𝑦y\in\mathbb{N}italic_y ∈ blackboard_N. Finally, to make each Gaussian marble “dynamic”, we equip it with a trajectory, represented as a sequence of translations mapping from its initial position μ𝜇\muitalic_μ to its position at every other timestep. We denote the sequence of translations over a T𝑇Titalic_T frame sequence as Δ𝐗T×3Δ𝐗superscript𝑇3\Delta\mathbf{X}\in\mathbb{R}^{T\times 3}roman_Δ bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × 3 end_POSTSUPERSCRIPT.

4.1.2. Why Isotropic Marbles?

While anisotropic Gaussians are far more expressive, we find that the extra degrees of freedom are poorly suited for the underconstrained monocular setting. We refer to Figure 3 as a simple illustration of this phenomenon. In this toy experiment, we train anisotropic Gaussians and our Gaussan Marbles on a single monocular image for 100K iterations. As observed, anisotropic Gaussians fit the training image in a manner that does not generalize to new views, leading to obvious visual artifacts. In contrast, the simpler marbles generalize to novel views.

Refer to caption
Figure 4. Our divide and conquer learning algorithm iteratively estimates motion between pairs of Gaussian sets, merges the sets, and performs a global adjustment on the Gaussian marbles within the merged sets.

4.2. Divide-and-Conquer Motion Estimation

4.2.1. Overview

Our learning strategy divides the input video into short subsequences, and then optimizes the joining of these subsequences, rather than attempting to optimize the full video at once. For a subsequence containing frames i𝑖iitalic_i to j𝑗jitalic_j inclusive, we denote a corresponding set of Gaussian marbles as Gijsubscript𝐺𝑖𝑗G_{ij}italic_G start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, meaning that Gijsubscript𝐺𝑖𝑗G_{ij}italic_G start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT only contains trajectories that travel across the timespan between frames i𝑖iitalic_i and j𝑗jitalic_j. As outlined in Figure 4, the learning algorithm consists of three iterative stages: motion estimation, merging, and global adjustment.

4.2.2. Initializing Gaussian Marbles

We initialize a distinct set of Gaussian marbles per frame, yielding a sequence of Gaussian sets [𝒢11,𝒢22,,𝒢TT]subscript𝒢11subscript𝒢22subscript𝒢𝑇𝑇[\mathcal{G}_{11},\mathcal{G}_{22},...,\mathcal{G}_{TT}][ caligraphic_G start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT , … , caligraphic_G start_POSTSUBSCRIPT italic_T italic_T end_POSTSUBSCRIPT ]. As mentioned earlier, 𝒢ijsubscript𝒢𝑖𝑗\mathcal{G}_{ij}caligraphic_G start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT denotes that each Gaussian trajectory only covers the subsequen ce of frames ij𝑖𝑗i\to jitalic_i → italic_j; thus, our initial 𝒢iisubscript𝒢𝑖𝑖\mathcal{G}_{ii}caligraphic_G start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT trivially contain trajectories of length 1 (i.e., coordinates for one timestep).

We achieve this initialization as follows. For each frame of the video, we obtain a monocular (or LiDAR) depthmap as well as off-the-shelf temporally-consistent segmentations from the SAM-driven TrackAnything model (Kirillov et al., 2023; Yang et al., 2023a). We then unproject the depth map into a point cloud, and perform outlier removal and downsampling. Then, for each point coordinate p𝑝pitalic_p, we initialize a Gaussian marble with mean μ=p𝜇𝑝\mu=pitalic_μ = italic_p, color c𝑐citalic_c as the pixel color, instance class y𝑦yitalic_y as the segmentation prediction, and we follow the original protocol (Kerbl et al., 2023) to initialize scales and opacities. Finally, we initialize the sequence of translations Δ𝐗=[𝟎]Δ𝐗delimited-[]0\Delta\mathbf{X}=[\mathbf{0}]roman_Δ bold_X = [ bold_0 ], i.e. as a length-1 sequence of 0 translation.

4.2.3. Motion Estimation Phase

While training on a video with T𝑇Titalic_T frames, we will always have a list of Gaussian Marble sets:

𝐆=[𝒢1K,𝒢(K+1)(2K),𝒢(2K+1)(3K),,𝒢(cK+1)T]𝐆subscript𝒢1𝐾subscript𝒢𝐾12𝐾subscript𝒢2𝐾13𝐾subscript𝒢𝑐𝐾1𝑇\mathbf{G}=[\mathcal{G}_{1K},\;\mathcal{G}_{(K+1)(2K)}\;,\mathcal{G}_{(2K+1)(3% K)}\;,...,\;\mathcal{G}_{(cK+1)T}]bold_G = [ caligraphic_G start_POSTSUBSCRIPT 1 italic_K end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT ( italic_K + 1 ) ( 2 italic_K ) end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT ( 2 italic_K + 1 ) ( 3 italic_K ) end_POSTSUBSCRIPT , … , caligraphic_G start_POSTSUBSCRIPT ( italic_c italic_K + 1 ) italic_T end_POSTSUBSCRIPT ]

with each set of Gaussian marbles covering a length K𝐾Kitalic_K subsequence. To reduce notation and create a simple working example, we will proceed in this section using K=2𝐾2K=2italic_K = 2 and T=8𝑇8T=8italic_T = 8, giving us the sequence 𝐆=[𝒢12,𝒢34,𝒢56,𝒢78]𝐆subscript𝒢12subscript𝒢34subscript𝒢56subscript𝒢78\mathbf{G}=[\mathcal{G}_{12},\;\mathcal{G}_{34},\;\mathcal{G}_{56},\;\mathcal{% G}_{78}]bold_G = [ caligraphic_G start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 34 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 56 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUBSCRIPT 78 end_POSTSUBSCRIPT ].

In the motion estimation phase, we begin by forming pairs of adjacent Gaussian marble sets, i.e. [(𝒢12a,𝒢34b),(𝒢56a,𝒢78b][(\mathcal{G}^{a}_{12},\;\mathcal{G}^{b}_{34}),\;(\mathcal{G}^{a}_{56},\;% \mathcal{G}^{b}_{78}][ ( caligraphic_G start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 34 end_POSTSUBSCRIPT ) , ( caligraphic_G start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 56 end_POSTSUBSCRIPT , caligraphic_G start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 78 end_POSTSUBSCRIPT ]), where a𝑎aitalic_a and b𝑏bitalic_b denote whether a set appears earlier or later than its partner. Our goal is to learn a mapping for every Gaussian in 𝒢asuperscript𝒢𝑎\mathcal{G}^{a}caligraphic_G start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT into every frame covered by 𝒢bsuperscript𝒢𝑏\mathcal{G}^{b}caligraphic_G start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT, and vice versa. To learn these motions, we will render 𝒢asuperscript𝒢𝑎\mathcal{G}^{a}caligraphic_G start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT into frames covered by 𝒢bsuperscript𝒢𝑏\mathcal{G}^{b}caligraphic_G start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT and apply the gradient update only to the Gaussian trajectories in 𝒢asuperscript𝒢𝑎\mathcal{G}^{a}caligraphic_G start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT, and vice versa.

More concretely, in the case of 𝒢12asubscriptsuperscript𝒢𝑎12\mathcal{G}^{a}_{12}caligraphic_G start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT, we start by extending the trajectories of the marbles, using a constant-velocity assumption: this results in an expanded trajectory, i.e.

ΔX=[Δx1,Δx2]ΔX=[Δx1,Δx2,Δx3init]ΔXΔsubscriptx1Δsubscriptx2ΔXΔsubscriptx1Δsubscriptx2Δsuperscriptsubscriptx3init\Delta\textbf{X}=[\Delta\textbf{x}_{1},\Delta\textbf{x}_{2}]\;\;\to\;\;\Delta% \textbf{X}=[\Delta\textbf{x}_{1},\Delta\textbf{x}_{2},\Delta\textbf{x}_{3}^{% \text{init}}]roman_Δ X = [ roman_Δ x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_Δ x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] → roman_Δ X = [ roman_Δ x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_Δ x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , roman_Δ x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT init end_POSTSUPERSCRIPT ]

We then render 𝒢12asubscriptsuperscript𝒢𝑎12\mathcal{G}^{a}_{12}caligraphic_G start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT into frame 3333, and compute our optimization objectives (to be described in the next subsection), and backpropagate gradient updates into Δx3Δsubscriptx3\Delta\textbf{x}_{3}roman_Δ x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, i.e. the translation into frame 3. We end this optimization after a fixed number of iterations, η𝜂\etaitalic_η. We repeat this for each missing frame in the sequence, until we have a trajectory that covers all frames in 𝒢bsuperscript𝒢𝑏\mathcal{G}^{b}caligraphic_G start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT.

4.2.4. Merging

The result of motion estimation is that we have two sets of Gaussian marbles which reconstruct the same subsequence. In other words, each pair (𝒢ija,𝒢ijb)subscriptsuperscript𝒢𝑎𝑖𝑗subscriptsuperscript𝒢𝑏𝑖𝑗(\mathcal{G}^{a}_{ij},\mathcal{G}^{b}_{ij})( caligraphic_G start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , caligraphic_G start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) covers the same interval [i,j]𝑖𝑗[i,j][ italic_i , italic_j ]. Because they cover the same subsequence, we can trivially merge the pair by taking the union of all the Gaussian marbles, 𝒢ij=𝒢ija𝒢ijbsubscript𝒢𝑖𝑗subscriptsuperscript𝒢𝑎𝑖𝑗subscriptsuperscript𝒢𝑏𝑖𝑗\mathcal{G}_{ij}=\mathcal{G}^{a}_{ij}\cup\mathcal{G}^{b}_{ij}caligraphic_G start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = caligraphic_G start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∪ caligraphic_G start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, yielding a set twice the size of the original sets. To avoid excessive computational burden, we drop Gaussians of low opacity and small scale, and additionally perform random downsampling, to keep the set size constant.

4.2.5. Global Adjustment Phase

After merging sets of Gaussian marbles, there is no guarantee that the new resulting set still satisfies our optimization objectives. Thus, we jointly optimize all Gaussian properties of the newly merged set. Specifically, for the new 𝒢ijsubscript𝒢𝑖𝑗\mathcal{G}_{ij}caligraphic_G start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, we repeatedly randomly sample a frame within [i,j]𝑖𝑗[i,j][ italic_i , italic_j ], and render all Gaussians into this frame, and we backpropagate gradient updates to Gaussian colors, scales, opacities, and trajectory offsets. We repeat this global adjustment for β𝛽\betaitalic_β iterations.

4.2.6. Why divide and conquer?

Our divide and conquer learning strategy guides the underconstrained optimization problem toward finding solutions which are more realistic. In particular, the motion estimation benefits from the locality and smoothness of adding a single additional frame at-a-time, similar to Dynamic 3D Gaussians  (Luiten et al., 2024), while the global adjustment phase contributes global coherence, similar to 4D Gaussians (Wu et al., 2023). By alternating between the two phases, we aim to get the best of both worlds.

DyCheck iPhone - Without Camera Pose Reported: mPSNR \uparrow / LPIPs \downarrow
Apple Block Spin Paper Windmill Space-Out Teddy Wheel Mean
Dyn. Gaussians 7.96 / 0.775 7.13 / 0.737 9.15 / 0.635 6.732 / 0.736 7.42 / 0.698 7.75 /0.709 7.03 / 0.641 7.60 / 0.704
4D Gaussians 14.44 / 0.716 12.30 / 0.706 12.77 / 0.697 14.46 / 0.790 14.93 / 0.640 11.86 / 0.729 10.99 / 0.803 13.11 / 0.726
DGMarbles (ours) 16.28 / 0.460 15.76 / 0.353 17.38 / 0.370 14.94 / 0.420 15.41 / 0.410 13.20 / 0.433 13.36 / 0.403 15.19 / 0.407
                             - With Camera Pose
Dyn. Gaussians 7.65 / 0.766 7.55 / 0.684 8.08 / 0.651 6.24 / 0.729 6.79 / 0.733 7.41 / 0.690 7.28 / 0.593 7.29 / 0.692
4D Gaussians 15.41 / 0.450 11.28 / 0.633 14.42 / 0.339 15.60 / 0.297 14.60 / 0.372 12.36 / 0.466 11.79 / 0.436 13.64 / 0.428
DGMarbles (ours) 17.57 / 0.463 16.88 / 0.427 15.49 / 0.412 18.67 / 0.392 15.99 / 0.446 13.57 / 0.547 14.04 / 0.367 16.03 / 0.436
Nvidia Dynamic Scenes
Balloon1 Balloon2 Jumping Playground Skating Truck Umbrella Mean
Dyn. Gaussians 8.68 / 0.660 13.70 / 0.375 11.11 / 0.592 11.91 / 0.424 13.32 / 0.449 15.58 / 0.377 10.20 / 0.743 12.07 / 0.517
4D Gaussians 14.11 / 0.404 18.56 / 0.239 17.32 / 0.326 13.51 / 0.341 19.41 / 0.218 21.25 / 0.172 19.00 / 0.346 17.59 / 0.292
DGMarbles (ours) 23.58 / 0.152 22.42 / 0.232 20.43 / 0.173 17.20 / 0.307 24.22 / 0.119 26.41 / 0.109 23.20 / 0.246 22.49 / 0.191
Table 1. We report PSNR and LPIPs metrics of DGMarbles and Gaussian baselines on the DyCheck iPhone dataset with pose, the iPhone dataset without camera pose, and the Nvidia Dynamic Scenes dataset. Overall, DGMarbles significantly outperforms previous the Gaussian baselines.

4.3. Losses

At each optimization step of our divide-and-conquer algorithm, we employ of a variety of loss terms to help drive the Gaussians towards a realistic factorization of scene geometry and motion.

4.3.1. Tracking Loss

Building off of recent advances in point tracking  (Harley et al., 2022; Karaev et al., 2023), we regularize the Gaussian marble trajectories to agree with off-the-shelf 2D point tracks. When optimizing the Gaussians into a target timestep j𝑗jitalic_j, we use CoTracker (Karaev et al., 2023) to estimate a 100×100100100100\times 100100 × 100 grid of point tracks in adjacent the frames [jw,j+w]𝑗𝑤𝑗𝑤[j-w,j+w][ italic_j - italic_w , italic_j + italic_w ], with w=12𝑤12w=12italic_w = 12. Then, we randomly sample a source frame i[jw,j+w]𝑖𝑗𝑤𝑗𝑤i\in[j-w,j+w]italic_i ∈ [ italic_j - italic_w , italic_j + italic_w ]. Next, for source i𝑖iitalic_i and target j𝑗jitalic_j, we use our learned Gaussian trajectories to map the 3D Gaussians into frames i𝑖iitalic_i and j𝑗jitalic_j, and then project the Gaussians into the image plane, computing the Gaussian 2D means, depths, and 2D covariances. Finally, we regularize the 2D Gaussian motion from source to target to match the point tracks – for each tracked point pijsubscript𝑝𝑖𝑗p_{i\to j}italic_p start_POSTSUBSCRIPT italic_i → italic_j end_POSTSUBSCRIPT from frame i𝑖iitalic_i to j𝑗jitalic_j, we find the K=32𝐾32K=32italic_K = 32 nearest Gaussians in 2D, and compute a loss that discourages these Gaussians from changing their distance to the tracked point:

(1) track=pPg𝒩(pi)αiDiμipi2Djμjpj2,subscripttracksubscript𝑝𝑃subscript𝑔𝒩subscript𝑝𝑖subscriptsuperscript𝛼𝑖normsubscript𝐷𝑖subscriptnormsubscriptsuperscript𝜇𝑖subscript𝑝𝑖2subscript𝐷𝑗subscriptnormsubscriptsuperscript𝜇𝑗subscript𝑝𝑗2\mathcal{L}_{\text{track}}=\sum_{p\in P}\sum_{g\in\mathcal{N}(p_{i})}\alpha^{% \prime}_{i}\;\big{\|}\;D_{i}||\mu^{\prime}_{i}-p_{i}||_{2}-D_{j}||\mu^{\prime}% _{j}-p_{j}||_{2}\;\big{\|},caligraphic_L start_POSTSUBSCRIPT track end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_p ∈ italic_P end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_g ∈ caligraphic_N ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | | italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ ,

where μisubscriptsuperscript𝜇𝑖\mu^{\prime}_{i}italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the projected location of a Gaussian center μisubscript𝜇𝑖\mu_{i}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and αisubscriptsuperscript𝛼𝑖\alpha^{\prime}_{i}italic_α start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the Gaussian’s opacity contribution to pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, P𝑃Pitalic_P is the set of point tracks, and 𝒩(p)𝒩𝑝\mathcal{N}(p)caligraphic_N ( italic_p ) is the set of Gaussians that neighbor a pixel p𝑝pitalic_p.

4.3.2. Rendering Losses

At each training iteration, we render an image, disparity map, and segmentation map. For each, we compute a standard L1 loss with the ground truth image, the initial disparity estimation, and the off-the-shelf instance segmentation.

4.3.3. Geometry Losses

Isometry Loss

Following previous works  (Prokudin et al., 2023; Luiten et al., 2024), we regularize our Gaussian marbles to follow locally rigid motion. In particular, we penalize Gaussians for moving in a manner that breaks isometric deformation on local neighborhoods. Specifically, when rendering into frame j𝑗jitalic_j, we select a random source timestep i𝑖iitalic_i and compute a local neighborhood isometry loss as follows:

(2) iso-local=ga𝒢gb𝒩(ga)|μiaμibμjaμjb|subscriptiso-localsubscriptsuperscript𝑔𝑎𝒢subscriptsuperscript𝑔𝑏𝒩superscript𝑔𝑎normsubscriptsuperscript𝜇𝑎𝑖subscriptsuperscript𝜇𝑏𝑖normsubscriptsuperscript𝜇𝑎𝑗subscriptsuperscript𝜇𝑏𝑗\mathcal{L}_{\text{iso-local}}=\sum_{g^{a}\in\mathcal{G}}\sum_{g^{b}\in% \mathcal{N}(g^{a})}\big{|}\|\mu^{a}_{i}-\mu^{b}_{i}\|-\|\mu^{a}_{j}-\mu^{b}_{j% }\|\big{|}caligraphic_L start_POSTSUBSCRIPT iso-local end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ∈ caligraphic_G end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ∈ caligraphic_N ( italic_g start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT | ∥ italic_μ start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ - ∥ italic_μ start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_μ start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ |

where μiasubscriptsuperscript𝜇𝑎𝑖\mu^{a}_{i}italic_μ start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and μibsubscriptsuperscript𝜇𝑏𝑖\mu^{b}_{i}italic_μ start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the means of the Gaussian marbles gasuperscript𝑔𝑎g^{a}italic_g start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT and gbsuperscript𝑔𝑏g^{b}italic_g start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT at timestep i𝑖iitalic_i.

In addition to local isometry, we incorporate an instance isometry loss that guides each unique semantic instance to move in a nearly-isometric manner. That is, when rendering into frame j𝑗jitalic_j, we select a random frame i𝑖iitalic_i and compute the instance isometry loss as follows:

(3) iso-local=ga𝒢gbY(ga)|μiaμibμjaμjb|subscriptiso-localsubscriptsuperscript𝑔𝑎𝒢subscriptsuperscript𝑔𝑏𝑌superscript𝑔𝑎normsubscriptsuperscript𝜇𝑎𝑖subscriptsuperscript𝜇𝑏𝑖normsubscriptsuperscript𝜇𝑎𝑗subscriptsuperscript𝜇𝑏𝑗\mathcal{L}_{\text{iso-local}}=\sum_{g^{a}\in\mathcal{G}}\sum_{g^{b}\in Y(g^{a% })}\big{|}\|\mu^{a}_{i}-\mu^{b}_{i}\|-\|\mu^{a}_{j}-\mu^{b}_{j}\|\big{|}caligraphic_L start_POSTSUBSCRIPT iso-local end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ∈ caligraphic_G end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ∈ italic_Y ( italic_g start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT | ∥ italic_μ start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ - ∥ italic_μ start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_μ start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ |

where Y(g)𝑌𝑔Y(g)italic_Y ( italic_g ) denotes all Gaussians with the same semantic instance label. Taken together, our final isometry loss is a weighted combination of the two:

(4) iso=λiso-local+σiso-instancesubscriptiso𝜆subscriptiso-local𝜎subscriptiso-instance\mathcal{L}_{\textrm{iso}}=\lambda\mathcal{L}_{\textrm{iso-local}}+\sigma% \mathcal{L}_{\textrm{iso-instance}}caligraphic_L start_POSTSUBSCRIPT iso end_POSTSUBSCRIPT = italic_λ caligraphic_L start_POSTSUBSCRIPT iso-local end_POSTSUBSCRIPT + italic_σ caligraphic_L start_POSTSUBSCRIPT iso-instance end_POSTSUBSCRIPT
3D Alignment Loss

When merging two distinct sets of Gaussian marbles, it is important that the two sets not only align in the projected image plane, but also in 3D space. Notably without guiding the optimization towards 3D alignment, we find the resulting merge is “cloudy” in 3D (or in novel views), even if the training-view 2D projection is sharp.

Our 3D alignment loss consists of two parts. First, we reduce the total variation of all Gaussian depths, to bring more Gaussians to the surface of the scene. Concretely, for each pixel, we regularize the Gaussians contributing to that pixel to have a similiar depth:

(5) TV-depth=pPgaα(p)αpa|DaD¯|subscriptTV-depthsubscript𝑝𝑃subscriptsuperscript𝑔𝑎𝛼𝑝subscriptsuperscript𝛼𝑎𝑝superscript𝐷𝑎¯𝐷\mathcal{L}_{\text{TV-depth}}=\sum_{p\in P}\sum_{g^{a}\in\alpha(p)}\alpha^{% \prime a}_{p}\;\big{|}D^{a}-\bar{D}\big{|}caligraphic_L start_POSTSUBSCRIPT TV-depth end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_p ∈ italic_P end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ∈ italic_α ( italic_p ) end_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT ′ italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT | italic_D start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT - over¯ start_ARG italic_D end_ARG |

where α(p)𝛼𝑝\alpha(p)italic_α ( italic_p ) indicates the subset of Gaussians that contribute to the pixel p𝑝pitalic_p, αpasubscriptsuperscript𝛼𝑎𝑝\alpha^{\prime a}_{p}italic_α start_POSTSUPERSCRIPT ′ italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is the opacity contribution of Gaussian a𝑎aitalic_a on pixel p𝑝pitalic_p, and D¯¯𝐷\bar{D}over¯ start_ARG italic_D end_ARG is the weighted-mean depth of all contributing Gaussians, i.e. D¯=gaα(p)αpaDa¯𝐷subscriptsuperscript𝑔𝑎𝛼𝑝subscriptsuperscript𝛼𝑎𝑝superscript𝐷𝑎\bar{D}=\sum_{g^{a}\in\alpha(p)}\alpha^{\prime a}_{p}D^{a}over¯ start_ARG italic_D end_ARG = ∑ start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ∈ italic_α ( italic_p ) end_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT ′ italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT.

Second, we include a weakly-weighted Chamfer loss to directly align the two sets of Gaussians. Concretely, we divide the set of Gaussians 𝒢𝒢\mathcal{G}caligraphic_G into two random subsets of equal size, 𝒢asuperscript𝒢𝑎\mathcal{G}^{a}caligraphic_G start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT and 𝒢bsuperscript𝒢𝑏\mathcal{G}^{b}caligraphic_G start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT. Then, we compute a 2-way Chamfer distance between the coordinates of these sets, using the means of the Gaussians as the coordinates:

(6) chamfer=gaGamingbGbμaμb2+gbGbmingaGaμaμb2subscriptchamfersubscriptsuperscript𝑔𝑎superscript𝐺𝑎subscriptsuperscript𝑔𝑏superscript𝐺𝑏subscriptnormsuperscript𝜇𝑎superscript𝜇𝑏2subscriptsuperscript𝑔𝑏superscript𝐺𝑏subscriptsuperscript𝑔𝑎superscript𝐺𝑎subscriptnormsuperscript𝜇𝑎superscript𝜇𝑏2\mathcal{L}_{\text{chamfer}}=\sum_{g^{a}\in G^{a}}\min_{g^{b}\in G^{b}}\big{\|% }\mu^{a}-\mu^{b}\big{\|}_{2}+\sum_{g^{b}\in G^{b}}\min_{g^{a}\in G^{a}}\big{\|% }\mu^{a}-\mu^{b}\big{\|}_{2}caligraphic_L start_POSTSUBSCRIPT chamfer end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ∈ italic_G start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ∈ italic_G start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT - italic_μ start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ∈ italic_G start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_g start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ∈ italic_G start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT - italic_μ start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

In all, our 3D alignment loss is a weighted linear combination of the depth total variation and the chamfer losses:

(7) alignment=ξTV-depth+ϕchamfersubscriptalignment𝜉subscriptTV-depthitalic-ϕsubscriptchamfer\mathcal{L}_{\text{alignment}}=\xi\mathcal{L}_{\text{TV-depth}}+\phi\mathcal{L% }_{\text{chamfer}}caligraphic_L start_POSTSUBSCRIPT alignment end_POSTSUBSCRIPT = italic_ξ caligraphic_L start_POSTSUBSCRIPT TV-depth end_POSTSUBSCRIPT + italic_ϕ caligraphic_L start_POSTSUBSCRIPT chamfer end_POSTSUBSCRIPT

5. Experiments

PCK-T @0.05% \uparrow
Method Apple Block Paper Space Spin Teddy Wheel Mean
Nerfies 0.400 0.239 0.091 0.795 0.115 0.795 0.147 0.301
HyperNeRf 0.214 0.048 0.069 0.765 0.076 0.698 0.238 0.369
Dyn. Gauss. 0.075 0.047 0.056 0.107 0.065 0.167 0.039 0.079
4D Gauss. 0.000 0.000 0.000 0.229 0.033 0.133 0.076 0.073
Ours 0.615 0.827 0.537 0.847 0.387 0.808 0.568 0.656
Table 2. We report the tracking metric PCK-T @5% for DGMarbles and baselines on the DyCheck iPhone dataset in the setting without camera pose. DGMarbles significantly outperforms previous methods in tracking.

5.1. Datasets

We evaluate our method and a set of competetive baselines on the standard Nvidia Dynamic Scenes  (Yoon et al., 2020) and DyCheck iPhone (Gao et al., 2022) datasets. However, each of these popular datasets contains multi-view information. Thus, as we will discuss, we modify the training and evaluation protocol to emulate a more monocular setting.

5.1.1. Nvidia Dynamic Scenes Dataset

The Nvidia Dynamic Scenes dataset (Yoon et al., 2020) consists of seven videos, each between 90 and 200 frames and captured with a rig consisting of 12 calibrated cameras. We evaluate on seven captures - Balloon1, Balloon2, Jumping, Playground, Skating, Truck, and Umbrella. Importantly, the previously benchmarked evaluations on the Nvidia dataset sample a different training camera at each timestep, resulting in a “monocular teleporting camera” (Gao et al., 2022). We consider this setting unrealistic, and hence we instead use the video stream from a single camera, specifically, camera 4 for training. We use the video streams from cameras 3, 5, and 6 for evaluation.

5.1.2. DyCheck iPhone Dataset

The DyCheck iPhone dataset (Gao et al., 2022) consists of seven casually-captured iPhone videos, each with up to two novel-view and time-synchronized validation videos. We evaluate on all scenes: Apple, Block, Paper Windmill, Space Out, Spin, Teddy, and Wheel. Unlike the Nvidia Dynamic Scenes dataset, the iPhone dataset is truly monocular, i.e. only permitting models to train on a single camera stream. Nevertheless, the training camera follows a purposeful 3D trajectory that circumnavigates the scene, maximally gathering multi-view information. While a valid monocular setting, the camera’s calculated motion is not representative of casually-captured videos (e.g., as might be found on YouTube). Thus, we evaluate in two settings. First, we follow the official benchmark and use the video stream and camera motion provided. Second, we remove camera poses, offloading the camera motion into the learned 4D scene representation’s dynamics. We find this setting interesting because it simulates additional dynamic content, where previously “static” regions of the scene now have rigid dynamics equal to the inverse camera motion, which must be solved by the scene representation itself.

5.2. Implementation Details

We initialize each frame with 200,000 Gaussians and fine-tune the initialization for 40404040 optimization steps. For each frame, we run η=100𝜂100\eta=100italic_η = 100 optimization steps during the motion estimation stage, and β=40𝛽40\beta=40italic_β = 40 steps during the global adjustment stage. After each merge, we downsample back to 200 thousand Gaussians. We set the tracking loss weight to 1.01.01.01.0, the rendered photometric loss weight to 0.70.70.70.7, the rendered depthmap loss weight to 0.10.10.10.1, the rendered segmentation loss weight to 0.40.40.40.4, the local isometry loss weight to 4000400040004000, and the per-instance isometry loss weight to 3.03.03.03.0. For DGMarbles and all baselines, we allocate a maximum compute budget of 24 training hours on a single NVIDIA A5000 GPU.

On the DyCheck dataset, we set the depth total-variation loss weight to 20202020 and the Chamfer loss weight to 1111. Furthermore, we use the provided iPhone LiDAR as depth maps. Finally, we stop our divide-and-conquer curriculum after learning subsequences of length 8 on the foreground and length 32 on the background. Continuing the divide-and-conquer optimization (e.g., leading to full-length trajectories) is possible but we found this degrades the visual quality slightly.

On the Nvidia dataset, we set the depth total-variation loss weight to 60606060 and the Chamfer loss weight to 00. We estimate monocular depth using  (Yang et al., 2024a), and we stop the divide-and-conquer learning curriculum after learning subsequence of length 32, 8, 16, 4, 32, 32, 16 for the scenes Balloon1, Balloon2, Jumping, Playground, Skating, Truck, and Umbrella.

5.3. Dynamic Novel View Synthesis with Gaussians

We evaluate DGMarbles against the recent Dynamic 3D Gaussians  (Luiten et al., 2024) and 4D Gaussians  (Wu et al., 2023). We report the standard metrics mPSNR and LPIPs on novel view synthesis in  Table 1. We see that DGMarbles significantly outperforms both Gaussian baselines on average across both datasets. In particular, DGMarbles significantly improves over baselines in settings with less multi-view information, i.e. the iPhone evaluation without camera pose and the Nvidia Dynamic Scenes evaluation.

We visualize the results of DGMarbles and the baselines on the iPhone dataset (without camera pose) in  Figure 1 and  Figure 5. As shown, the existing Gaussian baselines exhibit poor novel view synthesis in this monocular setting, further emphasizing their need for strong multi-view supervision. In particular, 4D Gaussians converges to a local minima, averaging the static information over all frames instead of correctly learning motion. On the other hand, the Gaussians in Dynamic Gaussians immediately diverge from the scene geometry, overfitting to the training view in a manner that does not correctly render into novel views. We also compare with depth warping  (Niklaus and Liu, 2020), and show that, by tracking and aggregating the scene over a larger time horizon, DGMarbles covers more of the scene’s content than per-frame warping, as illustrated in the missing/occluded regions in the first column in  Figure 5.

We additionally provide a qualitative comparison of DGMarbles and 4D Gaussians on a real-world video as shown in  Figure 7. As illustrated, 4D Gaussians fails to overfit even on the training images. Instead, the method again falls into a blurry and static local minimum that averages frames in the video. In contrast, DGMarbles attains a high-quality reconstruction, and also exhibits reasonable novel-view synthesis and tracking.

5.4. Tracking and Editing with Gaussians

In addition to novel view synthesis, dynamic Gaussians are well suited for dense point tracking. In  Table 2, we report tracking on the Dycheck iPhone dataset in the setting without camera pose. We follow the official DyCheck evaluation and report the percentage of correct keypoints tracked (PCK-T) at a 5%percent55\%5 % interval; we note that all DyCheck keypoints are in training-view images. As shown, DGMarbles significantly outperforms all other NeRF and Gaussian methods in tracking. We also visualize dense DGMarbles point tracks for both the training view and a novel view in  Figure 6. We see that DGMarbles successfully tracks the dense scene geometry. Furthermore, the visualization shows a clear distinction between foreground motion and the rigid background motion (which equals the inverse camera motion due holding out camera pose information).

In  Figure 7, we show that our tracking permits editing videos in a temporally consistent manner. We color the tiger blue in the first frame, and the Gaussian marbles propagate the edit throughout the entire video. This emphasizes that dynamic Gaussians are a good choice of scene representation for editing applications.

5.5. Dynamic Novel View Synthesis with NeRF

We compare DGMarbles with competitive NeRF baselines in  Table 3. As reported, DGMarbles is on-par with NeRF baselines. Interestingly, we find that DGMarbles does very well on the iPhone evaluation without camera pose, and we speculate that the volumetric NeRF approaches struggle in the absence of a static background (i.e. when the entire scene moves), while Gaussians can better handle the more expansive dynamic region. In contrast, when the multi-view iPhone camera poses are provided, and thus stronger multi-view supervision is present, DGMarbles does worse than NeRF counterparts. Still, we again emphasize that DGMarbles exhibits significantly faster rendering (see  Figure 1), better tracking (see  Table 2), and more editability (see  Figure 7) than the NeRFs.

Method iPhone (+++ pose) iPhone (-- pose) Nvidia Mean
Nerfies 16.45 / 0.339 14.60 / 0.483 21.40 / 0.190 17.48 / 0.337
HyperNeRf 16.81 / 0.332 14.97 / 0.474 21.73 / 0.167 17.83 / 0.324
T-NeRF 17.43 / 0.508 14.54 / 0.574 21.40 / 0.171 17.79 / 0.418
Ours 16.03 / 0.436 15.19 / 0.407 22.49 / 0.191 17.90 / 0.345
Table 3. We report mPSNR\uparrow / LPIPs \downarrow. DGMarbles is on-par with NeRF baselines for the task of novel view synthesis.

5.6. Ablations

In  Table 4, we ablate various parts of DGMarbles and report the outcomes on three scenes from the Nvidia dataset. As shown, each of our design choices is important in achieving high quality novel view synthesis. In particular, the table suggests that our global adjustment phase as well as isometry loss are of principle importance.

mPSNR \uparrow / mLPIPS \downarrow
Method Balloon1 Skating Truck Mean
No Segmentation 22.78 / 0.137 23.55 / 0.156 26.00 / 0.106 24.11 / 0.133
No Tracking 22.19 / 0.179 23.79 / 0.115 24.52 / 0.118 23.50 / 0.137
No Isometry 22.75 / 0.147 22.22 / 0.166 24.20 / 0.136 23.06 / 0.150
No Motion Estimation 22.83 / 0.182 23.16 / 0.149 25.12 / 0.136 23.70 / 0.156
No Global Adjustment 22.07 / 0.344 23.42 / 0.283 23.74 / 0.424 23.08 / 0.350
DGMarbles 23.58 / 0.152 24.22 / 0.119 26.41 / 0.109 24.74 / 0.127
Table 4. We ablate various components of DGMarbles, showing that each component is important to achieve high quality novel view synthesis.

6. Limitations and Conclusion

We present DGMarbles, an attempt to bring dynamic Gaussians to the challenging setting of casual monocular video captures. DGMarbles introduces using isotropic Gaussian “marbles”, a divide-and-conquer learning strategy, and various 2D priors achieving novel view synthesis that is significantly better than previous Gaussian methods. Furthermore, DGMarbles is well-suited for tracking and editing, and significantly outperforms previous reconstruction methods on tracking accuracy. Nevertheless, DGMarbles is not without its limitations towards comprehensively solving the extremely challenging problem of open-world dynamic and monocular novel view synthesis. Since DGMarbles relies on 2D image priors, errors in the 2D predictions such as poor depth estimation or poor segmentation can lead the optimization into suboptimal results. Similarly, our geometric priors may guide optimization incorrectly in scenes with rapid and non-rigid motion – a setting where further progress in 3D priors and visual tracking will be vital. We hope DGMarbles provides a significant step forward in bringing Gaussian representations to the challenging setting of general monocular novel-view synthesis.

References

  • (1)
  • Blinn (1982) James F Blinn. 1982. A generalization of algebraic surface drawing. ACM transactions on graphics (TOG) 1, 3 (1982), 235–256.
  • Bui et al. (2023) Minh-Quan Viet Bui, Jongmin Park, Jihyong Oh, and Munchurl Kim. 2023. DyBluRF: Dynamic Deblurring Neural Radiance Fields for Blurry Monocular Video. arXiv preprint arXiv:2312.13528 (2023).
  • Cao and Johnson (2023) Ang Cao and Justin Johnson. 2023. HexPlane: A Fast Representation for Dynamic Scenes. CVPR (2023).
  • Cen et al. (2023) Jiazhong Cen, Jiemin Fang, Chen Yang, Lingxi Xie, Xiaopeng Zhang, Wei Shen, and Qi Tian. 2023. Segment Any 3D Gaussians. arXiv preprint arXiv:2312.00860 (2023).
  • Das et al. (2023) Devikalyan Das, Christopher Wewer, Raza Yunus, Eddy Ilg, and Jan Eric Lenssen. 2023. Neural parametric gaussians for monocular non-rigid object reconstruction. arXiv preprint arXiv:2312.01196 (2023).
  • Duan et al. (2024) Yuanxing Duan, Fangyin Wei, Qiyu Dai, Yuhang He, Wenzheng Chen, and Baoquan Chen. 2024. 4D Gaussian Splatting: Towards Efficient Novel View Synthesis for Dynamic Scenes. ArXiv abs/2402.03307 (2024). https://api.semanticscholar.org/CorpusID:267411895
  • Duisterhof et al. (2023) Bardienus Pieter Duisterhof, Zhao Mandi, Yunchao Yao, Jia-Wei Liu, Mike Zheng Shou, Shuran Song, and Jeffrey Ichnowski. 2023. MD-Splatting: Learning Metric Deformation from 4D Gaussians in Highly Deformable Scenes. ArXiv abs/2312.00583 (2023). https://api.semanticscholar.org/CorpusID:265551723
  • Fan et al. (2024) Zhiwen Fan, Wenyan Cong, Kairun Wen, Kevin Wang, Jian Zhang, Xinghao Ding, Danfei Xu, Boris Ivanovic, Marco Pavone, Georgios Pavlakos, Zhangyang Wang, and Yue Wang. 2024. InstantSplat: Unbounded Sparse-view Pose-free Gaussian Splatting in 40 Seconds. arXiv:2403.20309 [cs.CV]
  • Fang et al. (2022) Jiemin Fang, Taoran Yi, Xinggang Wang, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Matthias Nießner, and Qi Tian. 2022. Fast Dynamic Radiance Fields with Time-Aware Neural Voxels. In SIGGRAPH Asia 2022 Conference Papers.
  • Feng et al. (2024) Qiyuan Feng, Geng-Chen Cao, Hao-Xiang Chen, Tai-Jiang Mu, Ralph R. Martin, and Shi-Min Hu. 2024. A New Split Algorithm for 3D Gaussian Splatting. ArXiv abs/2403.09143 (2024). https://api.semanticscholar.org/CorpusID:268384828
  • Fridovich-Keil et al. (2023) Sara Fridovich-Keil, Giacomo Meanti, Frederik Rahbæk Warburg, Benjamin Recht, and Angjoo Kanazawa. 2023. K-Planes: Explicit Radiance Fields in Space, Time, and Appearance. In CVPR.
  • Gao et al. (2021) Chen Gao, Ayush Saraf, Johannes Kopf, and Jia-Bin Huang. 2021. Dynamic View Synthesis from Dynamic Monocular Video. In Proceedings of the IEEE International Conference on Computer Vision.
  • Gao et al. (2022) Hang Gao, Ruilong Li, Shubham Tulsiani, Bryan Russell, and Angjoo Kanazawa. 2022. Monocular Dynamic View Synthesis: A Reality Check. In NeurIPS.
  • Guo et al. (2024) Zhiyang Guo, Wen gang Zhou, Li Li, Min Wang, and Houqiang Li. 2024. Motion-aware 3D Gaussian Splatting for Efficient Dynamic Scene Reconstruction. ArXiv abs/2403.11447 (2024). https://api.semanticscholar.org/CorpusID:268512916
  • Harley et al. (2022) Adam W Harley, Zhaoyuan Fang, and Katerina Fragkiadaki. 2022. Particle Video Revisited: Tracking Through Occlusions Using Point Trajectories. In ECCV.
  • Huang et al. (2023) Yi-Hua Huang, Yang-Tian Sun, Ziyi Yang, Xiaoyang Lyu, Yan-Pei Cao, and Xiaojuan Qi. 2023. SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes. arXiv preprint arXiv:2312.14937 (2023).
  • Jang and Kim (2022) Hankyu Jang and Daeyoung Kim. 2022. D-TensoRF: Tensorial Radiance Fields for Dynamic Scenes. ArXiv abs/2212.02375 (2022). https://api.semanticscholar.org/CorpusID:254247189
  • Johnson et al. (2023) Erik C.M. Johnson, Marc Habermann, Soshi Shimada, Vladislav Golyanik, and Christian Theobalt. 2023. Unbiased 4D: Monocular 4D Reconstruction with a Neural Deformation Model. CVPR Workshop (2023).
  • Karaev et al. (2023) Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. 2023. CoTracker: It is Better to Track Together. arXiv:2307.07635 (2023).
  • Katsumata et al. (2023) Kai Katsumata, Duc Minh Vo, and Hideki Nakayama. 2023. An Efficient 3D Gaussian Representation for Monocular/Multi-view Dynamic Scenes. ArXiv abs/2311.12897 (2023). https://api.semanticscholar.org/CorpusID:265351835
  • Kerbl et al. (2023) Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 2023. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Transactions on Graphics 42, 4 (July 2023). https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/
  • Keselman and Hebert (2022) Leonid Keselman and Martial Hebert. 2022. Approximate differentiable rendering with algebraic surfaces. In European Conference on Computer Vision (ECCV). Springer, 596–614.
  • Keselman and Hebert (2023) Leonid Keselman and Martial Hebert. 2023. Flexible techniques for differentiable rendering with 3d gaussians. arXiv preprint arXiv:2308.14737 (2023).
  • Kirillov et al. (2023) Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. 2023. Segment Anything. arXiv:2304.02643 (2023).
  • Kirschstein et al. (2023) Tobias Kirschstein, Shenhan Qian, Simon Giebenhain, Tim Walter, and Matthias Nießner. 2023. NeRSemble: Multi-View Radiance Field Reconstruction of Human Heads. ACM Trans. Graph. 42, 4, Article 161 (jul 2023), 14 pages. https://doi.org/10.1145/3592455
  • Lee et al. (2024) Byeonghyeon Lee, Howoong Lee, Xiangyu Sun, Usman Ali, and Eunbyung Park. 2024. Deblurring 3D Gaussian Splatting. arXiv:2401.00834 [cs.CV]
  • Lee et al. (2023a) Joo Chan Lee, Daniel Rho, Xiangyu Sun, Jong Hwan Ko, and Eunbyung Park. 2023a. Compact 3D Gaussian Representation for Radiance Field. arXiv preprint arXiv:2311.13681 (2023).
  • Lee et al. (2023b) Yao-Chih Lee, Zhoutong Zhang, Kevin Blackburn-Matzen, Simon Niklaus, Jianming Zhang, Jia-Bin Huang, and Feng Liu. 2023b. Fast View Synthesis of Casual Videos. arXiv preprint arXiv:2312.02135 (2023).
  • Li et al. (2023a) Zhan Li, Zhang Chen, Zhong Li, and Yi Xu. 2023a. Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis. arXiv preprint arXiv:2312.16812 (2023).
  • Li et al. (2021) Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Wang. 2021. Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  • Li et al. (2023b) Zhengqi Li, Qianqian Wang, Forrester Cole, Richard Tucker, and Noah Snavely. 2023b. DynIBaR: Neural Dynamic Image-Based Rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  • Li et al. (2024) Zhe Li, Zerong Zheng, Lizhen Wang, and Yebin Liu. 2024. Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  • Liang et al. (2023) Yiqing Liang, Numair Khan, Zhengqin Li, Thu Nguyen-Phuoc, Douglas Lanman, James Tompkin, and Lei Xiao. 2023. GauFRe: Gaussian Deformation Fields for Real-time Dynamic Novel View Synthesis. ArXiv abs/2312.11458 (2023). https://api.semanticscholar.org/CorpusID:266359262
  • Lin et al. (2023) Youtian Lin, Zuozhuo Dai, Siyu Zhu, and Yao Yao. 2023. Gaussian-Flow: 4D Reconstruction with Dynamic 3D Gaussian Particle. arXiv:2312.03431 (2023).
  • Liu et al. (2023) Yu-Lun Liu, Chen Gao, Andreas Meuleman, Hung-Yu Tseng, Ayush Saraf, Changil Kim, Yung-Yu Chuang, Johannes Kopf, and Jia-Bin Huang. 2023. Robust Dynamic Radiance Fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Luiten et al. (2024) Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. 2024. Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis. In 3DV.
  • Matsuki et al. (2024) Hidenobu Matsuki, Riku Murai, Paul H. J. Kelly, and Andrew J. Davison. 2024. Gaussian Splatting SLAM. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Mildenhall et al. (2020) Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. 2020. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In ECCV.
  • Morgenstern et al. (2023) Wieland Morgenstern, Florian Barthel, Anna Hilsmann, and Peter Eisert. 2023. Compact 3D Scene Representation via Self-Organizing Gaussian Grids. arXiv:2312.13299 [cs.CV]
  • Niedermayr et al. (2023) Simon Niedermayr, Josef Stumpfegger, and Rüdiger Westermann. 2023. Compressed 3D Gaussian Splatting for Accelerated Novel View Synthesis. arXiv:2401.02436 [cs.CV]
  • Niklaus and Liu (2020) Simon Niklaus and Feng Liu. 2020. Softmax Splatting for Video Frame Interpolation. In IEEE Conference on Computer Vision and Pattern Recognition.
  • Park et al. (2021a) Keunhong Park, Utkarsh Sinha, Jonathan T. Barron, Sofien Bouaziz, Dan B Goldman, Steven M. Seitz, and Ricardo Martin-Brualla. 2021a. Nerfies: Deformable Neural Radiance Fields. ICCV (2021).
  • Park et al. (2021b) Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T. Barron, Sofien Bouaziz, Dan B Goldman, Ricardo Martin-Brualla, and Steven M. Seitz. 2021b. HyperNeRF: A Higher-Dimensional Representation for Topologically Varying Neural Radiance Fields. ACM Trans. Graph. 40, 6, Article 238 (dec 2021).
  • Prokudin et al. (2023) Sergey Prokudin, Qianli Ma, Maxime Raafat, Julien Valentin, and Siyu Tang. 2023. Dynamic Point Fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 7964–7976.
  • Qian et al. (2024) Zhiyin Qian, Shaofei Wang, Marko Mihajlovic, Andreas Geiger, and Siyu Tang. 2024. 3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting. (2024).
  • Ramasinghe et al. (2024) Sameera Ramasinghe, Violetta Shevchenko, Gil Avraham, and Anton van den Hengel. 2024. BLiRF: Band limited radiance fields for dynamic scene modeling. In AAAI 2024. https://www.amazon.science/publications/blirf-band-limited-radiance-fields-for-dynamic-scene-modeling
  • Song et al. (2023) Liangchen Song, Anpei Chen, Zhong Li, Zhang Chen, Lele Chen, Junsong Yuan, Yi Xu, and Andreas Geiger. 2023. NeRFPlayer: A Streamable Dynamic Scene Representation with Decomposed Neural Radiance Fields. IEEE Transactions on Visualization and Computer Graphics 29, 5 (2023), 2732–2742. https://doi.org/10.1109/TVCG.2023.3247082
  • Sun et al. (2024) Jiakai Sun, Han Jiao, Guangyuan Li, Zhanjie Zhang, Lei Zhao, and Wei Xing. 2024. 3dgstream: On-the-fly training of 3d gaussians for efficient streaming of photo-realistic free-viewpoint videos. arXiv preprint arXiv:2403.01444 (2024).
  • Tang et al. (2023) Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. 2023. DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation. arXiv preprint arXiv:2309.16653 (2023).
  • Tian et al. (2023) Fengrui Tian, Shaoyi Du, and Yueqi Duan. 2023. MonoNeRF: Learning a Generalizable Dynamic Radiance Field from Monocular Videos. In Proceedings of the International Conference on Computer Vision (ICCV).
  • Tretschk et al. (2020) Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, Michael Zollhöfer, Christoph Lassner, and Christian Theobalt. 2020. Non-Rigid Neural Radiance Fields: Reconstruction and Novel View Synthesis of a Dynamic Scene From Monocular Video. arXiv:2012.12247 [cs.CV]
  • Wang et al. (2021a) Chaoyang Wang, Ben Eckart, Simon Lucey, and Orazio Gallo. 2021a. Neural Trajectory Fields for Dynamic Novel View Synthesis. ArXiv Preprint. arXiv:2105.05994
  • Wang et al. (2023) Chaoyang Wang, Lachlan Ewen MacDonald, László A. Jeni, and Simon Lucey. 2023. Flow Supervision for Deformable NeRF. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 21128–21137.
  • Wang et al. (2024) Chaoyang Wang, Peiye Zhuang, Aliaksandr Siarohin, Junli Cao, Guocheng Qian, Hsin-Ying Lee, and S. Tulyakov. 2024. Diffusion Priors for Dynamic View Synthesis from Monocular Videos. ArXiv abs/2401.05583 (2024). https://api.semanticscholar.org/CorpusID:266933409
  • Wang et al. (2021b) Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul Srinivasan, Howard Zhou, Jonathan T. Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. 2021b. IBRNet: Learning Multi-View Image-Based Rendering. In CVPR.
  • Wu et al. (2023) Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Wang Xinggang. 2023. 4D Gaussian Splatting for Real-Time Dynamic Scene Rendering. arXiv preprint arXiv:2310.08528 (2023).
  • Yang et al. (2023a) Jinyu Yang, Mingqi Gao, Zhe Li, Shang Gao, Fangjing Wang, and Feng Zheng. 2023a. Track Anything: Segment Anything Meets Videos. arXiv:2304.11968 [cs.CV]
  • Yang et al. (2024a) Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. 2024a. Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data. In CVPR.
  • Yang et al. (2023b) Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. 2023b. Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction. arXiv preprint arXiv:2309.13101 (2023).
  • Yang et al. (2024b) Zeyu Yang, Hongye Yang, Zijie Pan, and Li Zhang. 2024b. Real-time Photorealistic Dynamic Scene Representation and Rendering with 4D Gaussian Splatting. In International Conference on Learning Representations (ICLR).
  • Ye et al. (2023) Mingqiao Ye, Martin Danelljan, Fisher Yu, and Lei Ke. 2023. Gaussian Grouping: Segment and Edit Anything in 3D Scenes. arXiv preprint arXiv:2312.00732 (2023).
  • Yi et al. (2024) Taoran Yi, Jiemin Fang, Junjie Wang, Guanjun Wu, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Qi Tian, and Xinggang Wang. 2024. GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models. In CVPR.
  • Yoon et al. (2020) Jae Shin Yoon, Kihwan Kim, Orazio Gallo, Hyun Soo Park, and Jan Kautz. 2020. Novel View Synthesis of Dynamic Scenes With Globally Coherent Depths From a Monocular Camera. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), 5335–5344. https://api.semanticscholar.org/CorpusID:214795169
  • Yu et al. (2023) Heng Yu, Joel Julin, Zoltan A Milacski, Koichiro Niinuma, and Laszlo A Jeni. 2023. CoGS: Controllable Gaussian Splatting. arXiv (2023).
  • Zhang et al. (2024) Jiahui Zhang, Fangneng Zhan, Muyu Xu, Shijian Lu, and Eric P. Xing. 2024. FreGS: 3D Gaussian Splatting with Progressive Frequency Regularization. ArXiv abs/2403.06908 (2024). https://api.semanticscholar.org/CorpusID:268363429
  • Zhao et al. (2024) Xiaoming Zhao, Alex Colburn, Fangchang Ma, Miguel Ángel Bautista, Joshua M. Susskind, and Alexander G. Schwing. 2024. Pseudo-Generalized Dynamic View Synthesis from a Video. In ICLR.
  • Zhou et al. (2023) Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Zehao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi. 2023. Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields. arXiv preprint arXiv:2312.03203 (2023).
Refer to caption
Figure 5. We visualize novel view synthesis DGMarbles and baselines on various scenes of the DyCheck iPhone dataset (in the setting without camera pose).
Refer to caption
Figure 6. We visualize dense point tracking of DGMarbles on two scenes from the DyCheck IPhone dataset (in the setting where camera pose is withheld).
Refer to caption
Figure 7. DGMarbles reconstructs a sequence in a manner that tracks and aggregates 3D content, allowing both novel-view synthesis and downstream edits.