Refer to caption — Figure 1. Our method achieves high-quality novel-view synthesis given a challenging monocular video as input. In contrast, other Gaussian representations arrive at poor local minima, while NeRF methods are on-par but exhibit slow rendering, poor tracking, and lack an editable and compositional structure.

Dynamic Gaussian Marbles for Novel View Synthesis of Casual Monocular Videos

Colton Stearns 0000-0002-3297-2870 Stanford UniversityStanfordCA94305USA coltongs@stanford.edu , Adam Harley 0000-0002-9851-4645 Stanford UniversityStanfordCA94305USA aharley@cs.stanford.edu , Mikaela Uy 0009-0009-4917-7724 Stanford UniversityStanfordCA94305USA mikacuy@stanford.edu , Florian Dubost 0000-0002-7035-2680 GoogleMountain ViewUSA fdubost@google.com , Federico Tombari 0000-0001-5598-5212 GoogleZurichSwitzerland tombari@google.com , Gordon Wetzstein 0000-0002-9243-6885 Stanford UniversityStanfordCA94305USA gordon.wetzstein@stanford.edu and Leonidas Guibas 0000-0002-8315-4886 Stanford UniversityStanfordCA94305USA

Abstract.

Gaussian splatting has become a popular representation for novel-view synthesis, exhibiting clear strengths in efficiency, photometric quality, and compositional edibility. Following its success, many works have extended Gaussians to 4D, showing that dynamic Gaussians maintain these benefits while also tracking scene geometry far better than alternative representations. Yet, these methods assume dense multi-view videos as supervision, constraining their use to controlled capture settings. In this work, we are interested in extending the capability of Gaussian scene representations to casually captured monocular videos. We show that existing 4D Gaussian methods dramatically fail in this setup because the monocular setting is underconstrained. Building off this finding, we propose Dynamic Gaussian Marbles (DGMarbles), consisting of three core modifications that target the difficulties of the monocular setting. First, DGMarbles uses isotropic Gaussian “marbles”, reducing the degrees of freedom of each Gaussian, and constraining the optimization to focus on motion and appearance over local shape. Second, DGMarbles employs a hierarchical divide-and-conquer learning strategy to efficiently guide the optimization towards solutions with globally coherent motion. Finally, DGMarbles adds image-level and geometry-level priors into the optimization, including a tracking loss that takes advantage of recent progress in point tracking. By constraining the optimization in these ways, DGMarbles learns Gaussian trajectories that enable novel-view rendering and accurately capture the 3D motion of the scene elements. We evaluate on the (monocular) Nvidia Dynamic Scenes dataset and the Dycheck iPhone dataset, and show that DGMarbles significantly outperforms other Gaussian baselines in quality, and is on-par with non-Gaussian representations, all while maintaining the efficiency, compositionality, editability, and tracking benefits of Gaussians.

Gaussian splatting, neural rendering, novel view synthesis, inverse graphics, video editing

^†^†submissionid: 1022^†^†journal: TOG^†^†copyright: none

1. Introduction

It is very challenging to convert everyday monocular videos of dynamic scenes into reconstructions which are renderable from alternative viewpoints. Doing so seems to require extracting 3D geometry, motion, and radiance, all from pixels alone. Achieving this in a robust way would greatly extend current capabilities in video production, 3D content creation, virtual reality, and synthetic data generation, as well as advance computer vision.

In recent years, the research community has made tremendous progress in building renderable 3D representations from multi-view captures. For instance, Gaussian Splatting (Kerbl et al., 2023) has emerged as a leading solution for novel-few synthesis of static scenes. By representing the 3D space with a collection of 3D Gaussians and “splatting” these onto the image plane, Gaussian Splatting achieves high-quality photometric reconstruction and efficient rendering. Another useful feature of the Gaussian representation is that it is compositional: a scene can be edited by, for example, moving (or removing) the Gaussians that make up an object. Many works have since extended Gaussian Splatting to the 4D setting, allowing dynamic scenes to be reconstructed in a manner that 3D content is tracked and rendered with impressive accuracy (Luiten et al., 2024; Huang et al., 2023; Duisterhof et al., 2023; Lin et al., 2023). Yet while impressive, these setups only apply to the setting where there are multiple simultaneous viewpoints of the scene (i.e., a multi-camera setup), which limits their use to purpose-built capture environments.

In this work, we are interested in using Gaussians for simple, casual, monocular captures, where a single camera is being moved smoothly about a dynamic scene (e.g., by a human). Our core finding is that while current methods for dynamic or 4D Gaussians are highly underconstrained in the absence of multi-view information, we can recover similar constraints using off-the-shelf methods for estimating depth and motion, along with standard geometry-based regularizations on scene structure. We demonstrate our findings through a method we call DGMarbles.

Compared to related Gaussian-based representations of dynamic scenes, DGMarbles contributes changes to the core representation, the learning strategy, and the objective function, with the aim of guiding the optimization process to arrive at solutions which reasonably generalize to novel views. First, DGMarbles removes the anisotropic nature of typical Gaussians, and simply uses isotropic “marbles”. We find Gaussian marbles are a better choice for the underconstrained monocular setting. Second, we employ a divide-and-conquer learning algorithm. Intuitively, we divide a long video into subsequences and optimize each one independently, and then merge pairs of subsequences until we reach a desired temporal horizon. This strategy takes advantage of the fact that it is easier to solve for motion and geometry within shorter time horizons, and converts long-sequence tracking into a task of gluing together neighboring subsequences. Third, we make use of freely-available priors in both image space and 3D space. In the image plane, we use off-the-shelf models SegmentAnything (Kirillov et al., 2023; Yang et al., 2023a), CoTracker (Karaev et al., 2023), and DepthAnything (Yang et al., 2024a), and guide our 3D representation according to these 2D cues. In 3D space, we regularize Gaussian trajectories with geometric priors, including local isometry, global isometry, depth total variation, and chamfer distance.

We show that DGMarbles greatly outperforms other dynamic Gaussian methods in the casual monocular setting. Specifically, we evaluate on the Nvidia Dynamic Scenes dataset and DyCheck iPhone dataset, which we modify into strictly-monocular datasets. Furthermore, we show that we are on-par with NeRF-based methods, while retaining our key advantages over them, namely efficient rendering, tracking, and editability.

2. Related Work

2.1. Gaussian Splatting

Gaussian-based representations have long been an attractive representation for modeling the surfaces of 3D scenes (Blinn, 1982), thanks to their efficiency, interpretability, and compositionality. The key idea is to represent a scene using a set of anisotropic Gaussians, equipped with opacity and color attributes, enabling not only color rendering but a variety of applications in both graphics and computer vision, such as scene editing and pose estimation (Keselman and Hebert, 2022, 2023). Gaussian scene representations have received great attention in the past year, in particular due to 3D Gaussian Splatting (3DGS) (Kerbl et al., 2023), which differentiably “splats” the Gaussians onto the image plane, with a very efficient GPU implementation. Many works have dived deeper on the advantages of 3DGS, including its compositionality (Ye et al., 2023; Yu et al., 2023), speed (Lee et al., 2023a; Morgenstern et al., 2023; Niedermayr et al., 2023) and quality (Lee et al., 2024; Zhang et al., 2024; Feng et al., 2024), and it has also been adapted into many downstream applications such as pose estimation (Fan et al., 2024), SLAM (Matsuki et al., 2024), semantic scene understanding (Zhou et al., 2023; Cen et al., 2023; Ye et al., 2023), human avatar animation (Li et al., 2024; Qian et al., 2024), text-to-3D (Tang et al., 2023; Yi et al., 2024), and more.

2.2. Gaussians for Dynamic Scenes

Many works have begun extending the 3DGS representation to the 4D domain, aiming to solve the challenge of dynamic scene reconstruction. These works largely differ from one another in how they represent and learn motion.

One popular direction is to model motion as a set of per-Gaussian 3D trajectories through time (Luiten et al., 2024; Sun et al., 2024; Duisterhof et al., 2023), and to learn motion by sequentially optimizing for per-Gaussian offsets into the next frame. Notably, these methods have shown impressive tracking. Other works extend this motion representation, and push for a more compact set of trajectories via sparse control points (Huang et al., 2023) or an explicit motion basis (Das et al., 2023; Katsumata et al., 2023; Lin et al., 2023; Li et al., 2023a; Yu et al., 2023). Similar to these works, we represent motion as 3D trajectories through time. However, we greatly differ in how we learn motion, as we use a divide-and-conquer learning strategy as well as various unique 2D and 3D priors.

Another line of works (Wu et al., 2023; Yang et al., 2023b; Liang et al., 2023; Guo et al., 2024) defines motion as a time-conditioned deformation network that warps a canonical set of Gaussians into each timeframe. While a shared global deformation network is an efficient and compact representation of motion, learning the appropriate deformations is challenging – in particular, the deformation network may collapse into a local minimum while jointly optimizing across all timeframes. Lastly, a few works (Duan et al., 2024; Yang et al., 2024b) directly model Gaussians that extend across space and time, i.e. Gaussians with mean and covariances in 4-D.

Although previous methods present different motion representations, they largely address the same multi-camera setting. In contrast, our work tackles the more challenging monocular setting. We note that there are concurrent works (Das et al., 2023; Katsumata et al., 2023) that tackle the pseudo-monocular setting and showcase results on datasets with “teleporting” cameras or large amounts of effective multi-view information – please refer to DyCheck (Gao et al., 2022) for a thorough overview of this phenomena. In contrast, our approach is intended for any casual monocular video.

2.3. Other Neural Scene Representations for Dynamic Scenes

Many earlier works have explored different neural scene representations for dynamic scenes. One family of works is the extension of neural radiance fields (Mildenhall et al., 2020) (NeRFs) in 4D (Ramasinghe et al., 2024; Wang et al., 2021a; Song et al., 2023; Li et al., 2021; Gao et al., 2021, 2022; Cao and Johnson, 2023; Fridovich-Keil et al., 2023; Bui et al., 2023; Jang and Kim, 2022), treating time as a fourth dimension and as an additional coordinate in the neural field. Another approach is to combine a “canonical” 3D NeRF with a time-conditioned deformation field (Liu et al., 2023; Fang et al., 2022; Park et al., 2021a, b; Johnson et al., 2023; Kirschstein et al., 2023; Wang et al., 2023; Tretschk et al., 2020). The deformation field can help disentangle motion and geometry, resulting in a more constrained and better-behaved scene, especially in the monocular case. A few works explored NeRF-based representations for the casual monocular setting – DyNiBar and MonoNeRF (Li et al., 2023b; Tian et al., 2023) showed compelling results by combining NeRF with image-based rendering (Wang et al., 2021b), and Wang et al. (Wang et al., 2024) used diffusion to regularize a 4D NeRF (Cao and Johnson, 2023; Fridovich-Keil et al., 2023). Finally, some concurrent works explore alternate plane-based and feed-forward representations for the casual monocular setting (Zhao et al., 2024; Lee et al., 2023b).

In contrast to these representations, dynamic Gaussians have advantages in tracking, fast rendering, and compositional editability.

3. Preliminaries

3.1. 3D Gaussian Splatting

3D Gaussian splatting (Kerbl et al., 2023) is a differentiable rendering pipeline that represents a scene as a collection of 3D Gaussians and “splats” them onto the image plane. Concretely, a 3D scene is represented by 3D Gaussians, $\mathcal{G}$ , with each Gaussian parameterized by its mean $\mu\in\mathbb{R}^{3}$ , rotation $R\in\mathbb{R}^{3\times 3}$ , scale $s\in\mathbb{R}^{3}$ , color $c\in\mathbb{R}^{3}$ , and opacity $\alpha\in\mathbb{R}$ . Importantly, the scale and rotation can be composed into a 3D covariance matrix, $\Sigma=RSS^{T}R^{T}$ , where $S$ is the $3\times 3$ diagonal scaling matrix.

Given 3D Gaussians $\mathcal{G}$ and a camera viewing transformation $W$ , the covariance matrix in camera coordinates can be computed as $\Sigma^{\prime}=JW\Sigma W^{T}J^{T}$ , where $J$ is the Jacobian of the approximately-affine projective transformation. Given camera-aligned Gaussians, the pipeline executes a highly-efficient differentiable tile-based rasterization of Gaussians. Specifically, the image is divided into $16\times 16$ tiles, and for each tile, all influencing Gaussians are depth-sorted and alpha-composited in the image plane. In contrast to volumetric rendering approaches (Mildenhall et al., 2020), Gaussian Splatting is extremely efficient, and often renders over $100$ times faster than its volumetric counterpart.

4. Method

We provide an overview of DGMarbles in Figure 2. We take as input a casually captured monocular video (i.e., a sequence of images captured by a single camera traversing a dynamic scene). We begin by initializing a set of Gaussian “marbles” for each frame. We consider these initial marbles to have trajectories of length 1. We next seek to merge these disjoint sets of short-trajectory marbles into much longer trajectories. We use a bottom-up divide-and-conquer merging strategy as depicted in Figure 4: we take two temporally adjacent marble sets, and merge them into a single set of marbles with trajectories of doubled length, and iterate this until we have fewer sets of marbles with much longer trajectories. Each iteration of the merging stage involves a short optimization, where we use rendering losses, tracking losses, and geometric regularizations, to guide the marble sets into correspondence. At inference, we use the learned Gaussian trajectories to render into any timestep.

4.1. Dynamic Gaussian Marbles

4.1.1. Definition

Following Kerbl et al. (2023), our scene representation is a set of Gaussians, $\mathcal{G}$ . Different from the original formulation, our Gaussians are isotropic: each Gaussian’s orientation is the identity matrix (i.e. $R=\mathbf{I}$ ), and the scale can be written as a scalar value (i.e. $s\in\mathbb{R}^{1}$ ). To emphasize their spherical shape, we use the name Gaussian marbles. We assign each Gaussian marble to a semantic instance (where instances are provided by an off-the-shelf image segmentation method, described later), denoted as $y\in\mathbb{N}$ . Finally, to make each Gaussian marble “dynamic”, we equip it with a trajectory, represented as a sequence of translations mapping from its initial position $\mu$ to its position at every other timestep. We denote the sequence of translations over a $T$ frame sequence as $\Delta\mathbf{X}\in\mathbb{R}^{T\times 3}$ .

4.1.2. Why Isotropic Marbles?

While anisotropic Gaussians are far more expressive, we find that the extra degrees of freedom are poorly suited for the underconstrained monocular setting. We refer to Figure 3 as a simple illustration of this phenomenon. In this toy experiment, we train anisotropic Gaussians and our Gaussan Marbles on a single monocular image for 100K iterations. As observed, anisotropic Gaussians fit the training image in a manner that does not generalize to new views, leading to obvious visual artifacts. In contrast, the simpler marbles generalize to novel views.

4.2. Divide-and-Conquer Motion Estimation

4.2.1. Overview

Our learning strategy divides the input video into short subsequences, and then optimizes the joining of these subsequences, rather than attempting to optimize the full video at once. For a subsequence containing frames $i$ to $j$ inclusive, we denote a corresponding set of Gaussian marbles as $G_{ij}$ , meaning that $G_{ij}$ only contains trajectories that travel across the timespan between frames $i$ and $j$ . As outlined in Figure 4, the learning algorithm consists of three iterative stages: motion estimation, merging, and global adjustment.

4.2.2. Initializing Gaussian Marbles

We initialize a distinct set of Gaussian marbles per frame, yielding a sequence of Gaussian sets $[\mathcal{G}_{11},\mathcal{G}_{22},...,\mathcal{G}_{TT}]$ . As mentioned earlier, $\mathcal{G}_{ij}$ denotes that each Gaussian trajectory only covers the subsequen ce of frames $i\to j$ ; thus, our initial $\mathcal{G}_{ii}$ trivially contain trajectories of length 1 (i.e., coordinates for one timestep).

We achieve this initialization as follows. For each frame of the video, we obtain a monocular (or LiDAR) depthmap as well as off-the-shelf temporally-consistent segmentations from the SAM-driven TrackAnything model (Kirillov et al., 2023; Yang et al., 2023a). We then unproject the depth map into a point cloud, and perform outlier removal and downsampling. Then, for each point coordinate $p$ , we initialize a Gaussian marble with mean $\mu=p$ , color $c$ as the pixel color, instance class $y$ as the segmentation prediction, and we follow the original protocol (Kerbl et al., 2023) to initialize scales and opacities. Finally, we initialize the sequence of translations $\Delta\mathbf{X}=[\mathbf{0}]$ , i.e. as a length-1 sequence of 0 translation.

4.2.3. Motion Estimation Phase

While training on a video with $T$ frames, we will always have a list of Gaussian Marble sets:

\mathbf{G}=[\mathcal{G}_{1K},\;\mathcal{G}_{(K+1)(2K)}\;,\mathcal{G}_{(2K+1)(3% K)}\;,...,\;\mathcal{G}_{(cK+1)T}]

with each set of Gaussian marbles covering a length $K$ subsequence. To reduce notation and create a simple working example, we will proceed in this section using $K=2$ and $T=8$ , giving us the sequence $\mathbf{G}=[\mathcal{G}_{12},\;\mathcal{G}_{34},\;\mathcal{G}_{56},\;\mathcal{% G}_{78}]$ .

In the motion estimation phase, we begin by forming pairs of adjacent Gaussian marble sets, i.e. $[(\mathcal{G}^{a}_{12},\;\mathcal{G}^{b}_{34}),\;(\mathcal{G}^{a}_{56},\;% \mathcal{G}^{b}_{78}]$ ), where $a$ and $b$ denote whether a set appears earlier or later than its partner. Our goal is to learn a mapping for every Gaussian in $\mathcal{G}^{a}$ into every frame covered by $\mathcal{G}^{b}$ , and vice versa. To learn these motions, we will render $\mathcal{G}^{a}$ into frames covered by $\mathcal{G}^{b}$ and apply the gradient update only to the Gaussian trajectories in $\mathcal{G}^{a}$ , and vice versa.

More concretely, in the case of $\mathcal{G}^{a}_{12}$ , we start by extending the trajectories of the marbles, using a constant-velocity assumption: this results in an expanded trajectory, i.e.

\Delta\textbf{X}=[\Delta\textbf{x}_{1},\Delta\textbf{x}_{2}]\;\;\to\;\;\Delta% \textbf{X}=[\Delta\textbf{x}_{1},\Delta\textbf{x}_{2},\Delta\textbf{x}_{3}^{% \text{init}}]

We then render $\mathcal{G}^{a}_{12}$ into frame $3$ , and compute our optimization objectives (to be described in the next subsection), and backpropagate gradient updates into $\Delta\textbf{x}_{3}$ , i.e. the translation into frame 3. We end this optimization after a fixed number of iterations, $\eta$ . We repeat this for each missing frame in the sequence, until we have a trajectory that covers all frames in $\mathcal{G}^{b}$ .

4.2.4. Merging

The result of motion estimation is that we have two sets of Gaussian marbles which reconstruct the same subsequence. In other words, each pair $(\mathcal{G}^{a}_{ij},\mathcal{G}^{b}_{ij})$ covers the same interval $[i,j]$ . Because they cover the same subsequence, we can trivially merge the pair by taking the union of all the Gaussian marbles, $\mathcal{G}_{ij}=\mathcal{G}^{a}_{ij}\cup\mathcal{G}^{b}_{ij}$ , yielding a set twice the size of the original sets. To avoid excessive computational burden, we drop Gaussians of low opacity and small scale, and additionally perform random downsampling, to keep the set size constant.

4.2.5. Global Adjustment Phase

After merging sets of Gaussian marbles, there is no guarantee that the new resulting set still satisfies our optimization objectives. Thus, we jointly optimize all Gaussian properties of the newly merged set. Specifically, for the new $\mathcal{G}_{ij}$ , we repeatedly randomly sample a frame within $[i,j]$ , and render all Gaussians into this frame, and we backpropagate gradient updates to Gaussian colors, scales, opacities, and trajectory offsets. We repeat this global adjustment for $\beta$ iterations.

4.2.6. Why divide and conquer?

Our divide and conquer learning strategy guides the underconstrained optimization problem toward finding solutions which are more realistic. In particular, the motion estimation benefits from the locality and smoothness of adding a single additional frame at-a-time, similar to Dynamic 3D Gaussians (Luiten et al., 2024), while the global adjustment phase contributes global coherence, similar to 4D Gaussians (Wu et al., 2023). By alternating between the two phases, we aim to get the best of both worlds.

DyCheck iPhone - Without Camera Pose							Reported: mPSNR $\uparrow$ / LPIPs $\downarrow$
	Apple	Block	Spin	Paper Windmill	Space-Out	Teddy	Wheel	Mean
Dyn. Gaussians	7.96 / 0.775	7.13 / 0.737	9.15 / 0.635	6.732 / 0.736	7.42 / 0.698	7.75 /0.709	7.03 / 0.641	7.60 / 0.704
4D Gaussians	14.44 / 0.716	12.30 / 0.706	12.77 / 0.697	14.46 / 0.790	14.93 / 0.640	11.86 / 0.729	10.99 / 0.803	13.11 / 0.726
DGMarbles (ours)	16.28 / 0.460	15.76 / 0.353	17.38 / 0.370	14.94 / 0.420	15.41 / 0.410	13.20 / 0.433	13.36 / 0.403	15.19 / 0.407
- With Camera Pose
Dyn. Gaussians	7.65 / 0.766	7.55 / 0.684	8.08 / 0.651	6.24 / 0.729	6.79 / 0.733	7.41 / 0.690	7.28 / 0.593	7.29 / 0.692
4D Gaussians	15.41 / 0.450	11.28 / 0.633	14.42 / 0.339	15.60 / 0.297	14.60 / 0.372	12.36 / 0.466	11.79 / 0.436	13.64 / 0.428
DGMarbles (ours)	17.57 / 0.463	16.88 / 0.427	15.49 / 0.412	18.67 / 0.392	15.99 / 0.446	13.57 / 0.547	14.04 / 0.367	16.03 / 0.436
Nvidia Dynamic Scenes
	Balloon1	Balloon2	Jumping	Playground	Skating	Truck	Umbrella	Mean
Dyn. Gaussians	8.68 / 0.660	13.70 / 0.375	11.11 / 0.592	11.91 / 0.424	13.32 / 0.449	15.58 / 0.377	10.20 / 0.743	12.07 / 0.517
4D Gaussians	14.11 / 0.404	18.56 / 0.239	17.32 / 0.326	13.51 / 0.341	19.41 / 0.218	21.25 / 0.172	19.00 / 0.346	17.59 / 0.292
DGMarbles (ours)	23.58 / 0.152	22.42 / 0.232	20.43 / 0.173	17.20 / 0.307	24.22 / 0.119	26.41 / 0.109	23.20 / 0.246	22.49 / 0.191

Table 1. We report PSNR and LPIPs metrics of DGMarbles and Gaussian baselines on the DyCheck iPhone dataset with pose, the iPhone dataset without camera pose, and the Nvidia Dynamic Scenes dataset. Overall, DGMarbles significantly outperforms previous the Gaussian baselines.

4.3. Losses

At each optimization step of our divide-and-conquer algorithm, we employ of a variety of loss terms to help drive the Gaussians towards a realistic factorization of scene geometry and motion.

4.3.1. Tracking Loss

Building off of recent advances in point tracking (Harley et al., 2022; Karaev et al., 2023), we regularize the Gaussian marble trajectories to agree with off-the-shelf 2D point tracks. When optimizing the Gaussians into a target timestep $j$ , we use CoTracker (Karaev et al., 2023) to estimate a $100\times 100$ grid of point tracks in adjacent the frames $[j-w,j+w]$ , with $w=12$ . Then, we randomly sample a source frame $i\in[j-w,j+w]$ . Next, for source $i$ and target $j$ , we use our learned Gaussian trajectories to map the 3D Gaussians into frames $i$ and $j$ , and then project the Gaussians into the image plane, computing the Gaussian 2D means, depths, and 2D covariances. Finally, we regularize the 2D Gaussian motion from source to target to match the point tracks – for each tracked point $p_{i\to j}$ from frame $i$ to $j$ , we find the $K=32$ nearest Gaussians in 2D, and compute a loss that discourages these Gaussians from changing their distance to the tracked point:

(1)

\mathcal{L}_{\text{track}}=\sum_{p\in P}\sum_{g\in\mathcal{N}(p_{i})}\alpha^{% \prime}_{i}\;\big{\|}\;D_{i}||\mu^{\prime}_{i}-p_{i}||_{2}-D_{j}||\mu^{\prime}% _{j}-p_{j}||_{2}\;\big{\|},

where $\mu^{\prime}_{i}$ is the projected location of a Gaussian center $\mu_{i}$ , and $\alpha^{\prime}_{i}$ is the Gaussian’s opacity contribution to $p_{i}$ , $P$ is the set of point tracks, and $\mathcal{N}(p)$ is the set of Gaussians that neighbor a pixel $p$ .

4.3.2. Rendering Losses

At each training iteration, we render an image, disparity map, and segmentation map. For each, we compute a standard L1 loss with the ground truth image, the initial disparity estimation, and the off-the-shelf instance segmentation.

4.3.3. Geometry Losses

Isometry Loss

Following previous works (Prokudin et al., 2023; Luiten et al., 2024), we regularize our Gaussian marbles to follow locally rigid motion. In particular, we penalize Gaussians for moving in a manner that breaks isometric deformation on local neighborhoods. Specifically, when rendering into frame $j$ , we select a random source timestep $i$ and compute a local neighborhood isometry loss as follows:

(2)

\mathcal{L}_{\text{iso-local}}=\sum_{g^{a}\in\mathcal{G}}\sum_{g^{b}\in% \mathcal{N}(g^{a})}\big{|}\|\mu^{a}_{i}-\mu^{b}_{i}\|-\|\mu^{a}_{j}-\mu^{b}_{j% }\|\big{|}

where $\mu^{a}_{i}$ and $\mu^{b}_{i}$ are the means of the Gaussian marbles $g^{a}$ and $g^{b}$ at timestep $i$ .

In addition to local isometry, we incorporate an instance isometry loss that guides each unique semantic instance to move in a nearly-isometric manner. That is, when rendering into frame $j$ , we select a random frame $i$ and compute the instance isometry loss as follows:

(3)

\mathcal{L}_{\text{iso-local}}=\sum_{g^{a}\in\mathcal{G}}\sum_{g^{b}\in Y(g^{a% })}\big{|}\|\mu^{a}_{i}-\mu^{b}_{i}\|-\|\mu^{a}_{j}-\mu^{b}_{j}\|\big{|}

where $Y(g)$ denotes all Gaussians with the same semantic instance label. Taken together, our final isometry loss is a weighted combination of the two:

(4)

\mathcal{L}_{\textrm{iso}}=\lambda\mathcal{L}_{\textrm{iso-local}}+\sigma% \mathcal{L}_{\textrm{iso-instance}}

3D Alignment Loss

When merging two distinct sets of Gaussian marbles, it is important that the two sets not only align in the projected image plane, but also in 3D space. Notably without guiding the optimization towards 3D alignment, we find the resulting merge is “cloudy” in 3D (or in novel views), even if the training-view 2D projection is sharp.

Our 3D alignment loss consists of two parts. First, we reduce the total variation of all Gaussian depths, to bring more Gaussians to the surface of the scene. Concretely, for each pixel, we regularize the Gaussians contributing to that pixel to have a similiar depth:

(5)

\mathcal{L}_{\text{TV-depth}}=\sum_{p\in P}\sum_{g^{a}\in\alpha(p)}\alpha^{% \prime a}_{p}\;\big{|}D^{a}-\bar{D}\big{|}

where $\alpha(p)$ indicates the subset of Gaussians that contribute to the pixel $p$ , $\alpha^{\prime a}_{p}$ is the opacity contribution of Gaussian $a$ on pixel $p$ , and $\bar{D}$ is the weighted-mean depth of all contributing Gaussians, i.e. $\bar{D}=\sum_{g^{a}\in\alpha(p)}\alpha^{\prime a}_{p}D^{a}$ .

Second, we include a weakly-weighted Chamfer loss to directly align the two sets of Gaussians. Concretely, we divide the set of Gaussians $\mathcal{G}$ into two random subsets of equal size, $\mathcal{G}^{a}$ and $\mathcal{G}^{b}$ . Then, we compute a 2-way Chamfer distance between the coordinates of these sets, using the means of the Gaussians as the coordinates:

(6)

\mathcal{L}_{\text{chamfer}}=\sum_{g^{a}\in G^{a}}\min_{g^{b}\in G^{b}}\big{\|% }\mu^{a}-\mu^{b}\big{\|}_{2}+\sum_{g^{b}\in G^{b}}\min_{g^{a}\in G^{a}}\big{\|% }\mu^{a}-\mu^{b}\big{\|}_{2}

In all, our 3D alignment loss is a weighted linear combination of the depth total variation and the chamfer losses:

(7)

\mathcal{L}_{\text{alignment}}=\xi\mathcal{L}_{\text{TV-depth}}+\phi\mathcal{L% }_{\text{chamfer}}

5. Experiments

	PCK-T @0.05% $\uparrow$
Method	Apple	Block	Paper	Space	Spin	Teddy	Wheel	Mean
Nerfies	0.400	0.239	0.091	0.795	0.115	0.795	0.147	0.301
HyperNeRf	0.214	0.048	0.069	0.765	0.076	0.698	0.238	0.369
Dyn. Gauss.	0.075	0.047	0.056	0.107	0.065	0.167	0.039	0.079
4D Gauss.	0.000	0.000	0.000	0.229	0.033	0.133	0.076	0.073
Ours	0.615	0.827	0.537	0.847	0.387	0.808	0.568	0.656

Table 2. We report the tracking metric PCK-T @5% for DGMarbles and baselines on the DyCheck iPhone dataset in the setting without camera pose. DGMarbles significantly outperforms previous methods in tracking.

5.1. Datasets

We evaluate our method and a set of competetive baselines on the standard Nvidia Dynamic Scenes (Yoon et al., 2020) and DyCheck iPhone (Gao et al., 2022) datasets. However, each of these popular datasets contains multi-view information. Thus, as we will discuss, we modify the training and evaluation protocol to emulate a more monocular setting.

5.1.1. Nvidia Dynamic Scenes Dataset

The Nvidia Dynamic Scenes dataset (Yoon et al., 2020) consists of seven videos, each between 90 and 200 frames and captured with a rig consisting of 12 calibrated cameras. We evaluate on seven captures - Balloon1, Balloon2, Jumping, Playground, Skating, Truck, and Umbrella. Importantly, the previously benchmarked evaluations on the Nvidia dataset sample a different training camera at each timestep, resulting in a “monocular teleporting camera” (Gao et al., 2022). We consider this setting unrealistic, and hence we instead use the video stream from a single camera, specifically, camera 4 for training. We use the video streams from cameras 3, 5, and 6 for evaluation.

5.1.2. DyCheck iPhone Dataset

The DyCheck iPhone dataset (Gao et al., 2022) consists of seven casually-captured iPhone videos, each with up to two novel-view and time-synchronized validation videos. We evaluate on all scenes: Apple, Block, Paper Windmill, Space Out, Spin, Teddy, and Wheel. Unlike the Nvidia Dynamic Scenes dataset, the iPhone dataset is truly monocular, i.e. only permitting models to train on a single camera stream. Nevertheless, the training camera follows a purposeful 3D trajectory that circumnavigates the scene, maximally gathering multi-view information. While a valid monocular setting, the camera’s calculated motion is not representative of casually-captured videos (e.g., as might be found on YouTube). Thus, we evaluate in two settings. First, we follow the official benchmark and use the video stream and camera motion provided. Second, we remove camera poses, offloading the camera motion into the learned 4D scene representation’s dynamics. We find this setting interesting because it simulates additional dynamic content, where previously “static” regions of the scene now have rigid dynamics equal to the inverse camera motion, which must be solved by the scene representation itself.

5.2. Implementation Details

We initialize each frame with 200,000 Gaussians and fine-tune the initialization for $40$ optimization steps. For each frame, we run $\eta=100$ optimization steps during the motion estimation stage, and $\beta=40$ steps during the global adjustment stage. After each merge, we downsample back to 200 thousand Gaussians. We set the tracking loss weight to $1.0$ , the rendered photometric loss weight to $0.7$ , the rendered depthmap loss weight to $0.1$ , the rendered segmentation loss weight to $0.4$ , the local isometry loss weight to $4000$ , and the per-instance isometry loss weight to $3.0$ . For DGMarbles and all baselines, we allocate a maximum compute budget of 24 training hours on a single NVIDIA A5000 GPU.

On the DyCheck dataset, we set the depth total-variation loss weight to $20$ and the Chamfer loss weight to $1$ . Furthermore, we use the provided iPhone LiDAR as depth maps. Finally, we stop our divide-and-conquer curriculum after learning subsequences of length 8 on the foreground and length 32 on the background. Continuing the divide-and-conquer optimization (e.g., leading to full-length trajectories) is possible but we found this degrades the visual quality slightly.

On the Nvidia dataset, we set the depth total-variation loss weight to $60$ and the Chamfer loss weight to $0$ . We estimate monocular depth using (Yang et al., 2024a), and we stop the divide-and-conquer learning curriculum after learning subsequence of length 32, 8, 16, 4, 32, 32, 16 for the scenes Balloon1, Balloon2, Jumping, Playground, Skating, Truck, and Umbrella.

5.3. Dynamic Novel View Synthesis with Gaussians

We evaluate DGMarbles against the recent Dynamic 3D Gaussians (Luiten et al., 2024) and 4D Gaussians (Wu et al., 2023). We report the standard metrics mPSNR and LPIPs on novel view synthesis in Table 1. We see that DGMarbles significantly outperforms both Gaussian baselines on average across both datasets. In particular, DGMarbles significantly improves over baselines in settings with less multi-view information, i.e. the iPhone evaluation without camera pose and the Nvidia Dynamic Scenes evaluation.

We visualize the results of DGMarbles and the baselines on the iPhone dataset (without camera pose) in Figure 1 and Figure 5. As shown, the existing Gaussian baselines exhibit poor novel view synthesis in this monocular setting, further emphasizing their need for strong multi-view supervision. In particular, 4D Gaussians converges to a local minima, averaging the static information over all frames instead of correctly learning motion. On the other hand, the Gaussians in Dynamic Gaussians immediately diverge from the scene geometry, overfitting to the training view in a manner that does not correctly render into novel views. We also compare with depth warping (Niklaus and Liu, 2020), and show that, by tracking and aggregating the scene over a larger time horizon, DGMarbles covers more of the scene’s content than per-frame warping, as illustrated in the missing/occluded regions in the first column in Figure 5.

We additionally provide a qualitative comparison of DGMarbles and 4D Gaussians on a real-world video as shown in Figure 7. As illustrated, 4D Gaussians fails to overfit even on the training images. Instead, the method again falls into a blurry and static local minimum that averages frames in the video. In contrast, DGMarbles attains a high-quality reconstruction, and also exhibits reasonable novel-view synthesis and tracking.

5.4. Tracking and Editing with Gaussians

In addition to novel view synthesis, dynamic Gaussians are well suited for dense point tracking. In Table 2, we report tracking on the Dycheck iPhone dataset in the setting without camera pose. We follow the official DyCheck evaluation and report the percentage of correct keypoints tracked (PCK-T) at a $5\%$ interval; we note that all DyCheck keypoints are in training-view images. As shown, DGMarbles significantly outperforms all other NeRF and Gaussian methods in tracking. We also visualize dense DGMarbles point tracks for both the training view and a novel view in Figure 6. We see that DGMarbles successfully tracks the dense scene geometry. Furthermore, the visualization shows a clear distinction between foreground motion and the rigid background motion (which equals the inverse camera motion due holding out camera pose information).

In Figure 7, we show that our tracking permits editing videos in a temporally consistent manner. We color the tiger blue in the first frame, and the Gaussian marbles propagate the edit throughout the entire video. This emphasizes that dynamic Gaussians are a good choice of scene representation for editing applications.

5.5. Dynamic Novel View Synthesis with NeRF

We compare DGMarbles with competitive NeRF baselines in Table 3. As reported, DGMarbles is on-par with NeRF baselines. Interestingly, we find that DGMarbles does very well on the iPhone evaluation without camera pose, and we speculate that the volumetric NeRF approaches struggle in the absence of a static background (i.e. when the entire scene moves), while Gaussians can better handle the more expansive dynamic region. In contrast, when the multi-view iPhone camera poses are provided, and thus stronger multi-view supervision is present, DGMarbles does worse than NeRF counterparts. Still, we again emphasize that DGMarbles exhibits significantly faster rendering (see Figure 1), better tracking (see Table 2), and more editability (see Figure 7) than the NeRFs.

Method	iPhone ( $+$ pose)	iPhone ( $-$ pose)	Nvidia	Mean
Nerfies	16.45 / 0.339	14.60 / 0.483	21.40 / 0.190	17.48 / 0.337
HyperNeRf	16.81 / 0.332	14.97 / 0.474	21.73 / 0.167	17.83 / 0.324
T-NeRF	17.43 / 0.508	14.54 / 0.574	21.40 / 0.171	17.79 / 0.418
Ours	16.03 / 0.436	15.19 / 0.407	22.49 / 0.191	17.90 / 0.345

Table 3. We report mPSNR

\uparrow

/ LPIPs

\downarrow

. DGMarbles is on-par with NeRF baselines for the task of novel view synthesis.

5.6. Ablations

In Table 4, we ablate various parts of DGMarbles and report the outcomes on three scenes from the Nvidia dataset. As shown, each of our design choices is important in achieving high quality novel view synthesis. In particular, the table suggests that our global adjustment phase as well as isometry loss are of principle importance.

	mPSNR $\uparrow$ / mLPIPS $\downarrow$
Method	Balloon1	Skating	Truck	Mean
No Segmentation	22.78 / 0.137	23.55 / 0.156	26.00 / 0.106	24.11 / 0.133
No Tracking	22.19 / 0.179	23.79 / 0.115	24.52 / 0.118	23.50 / 0.137
No Isometry	22.75 / 0.147	22.22 / 0.166	24.20 / 0.136	23.06 / 0.150
No Motion Estimation	22.83 / 0.182	23.16 / 0.149	25.12 / 0.136	23.70 / 0.156
No Global Adjustment	22.07 / 0.344	23.42 / 0.283	23.74 / 0.424	23.08 / 0.350
DGMarbles	23.58 / 0.152	24.22 / 0.119	26.41 / 0.109	24.74 / 0.127

Table 4. We ablate various components of DGMarbles, showing that each component is important to achieve high quality novel view synthesis.

6. Limitations and Conclusion

We present DGMarbles, an attempt to bring dynamic Gaussians to the challenging setting of casual monocular video captures. DGMarbles introduces using isotropic Gaussian “marbles”, a divide-and-conquer learning strategy, and various 2D priors achieving novel view synthesis that is significantly better than previous Gaussian methods. Furthermore, DGMarbles is well-suited for tracking and editing, and significantly outperforms previous reconstruction methods on tracking accuracy. Nevertheless, DGMarbles is not without its limitations towards comprehensively solving the extremely challenging problem of open-world dynamic and monocular novel view synthesis. Since DGMarbles relies on 2D image priors, errors in the 2D predictions such as poor depth estimation or poor segmentation can lead the optimization into suboptimal results. Similarly, our geometric priors may guide optimization incorrectly in scenes with rapid and non-rigid motion – a setting where further progress in 3D priors and visual tracking will be vital. We hope DGMarbles provides a significant step forward in bringing Gaussian representations to the challenging setting of general monocular novel-view synthesis.

References

(1)
Blinn (1982) James F Blinn. 1982. A generalization of algebraic surface drawing. ACM transactions on graphics (TOG) 1, 3 (1982), 235–256.
Bui et al. (2023) Minh-Quan Viet Bui, Jongmin Park, Jihyong Oh, and Munchurl Kim. 2023. DyBluRF: Dynamic Deblurring Neural Radiance Fields for Blurry Monocular Video. arXiv preprint arXiv:2312.13528 (2023).
Cao and Johnson (2023) Ang Cao and Justin Johnson. 2023. HexPlane: A Fast Representation for Dynamic Scenes. CVPR (2023).
Cen et al. (2023) Jiazhong Cen, Jiemin Fang, Chen Yang, Lingxi Xie, Xiaopeng Zhang, Wei Shen, and Qi Tian. 2023. Segment Any 3D Gaussians. arXiv preprint arXiv:2312.00860 (2023).
Das et al. (2023) Devikalyan Das, Christopher Wewer, Raza Yunus, Eddy Ilg, and Jan Eric Lenssen. 2023. Neural parametric gaussians for monocular non-rigid object reconstruction. arXiv preprint arXiv:2312.01196 (2023).
Duan et al. (2024) Yuanxing Duan, Fangyin Wei, Qiyu Dai, Yuhang He, Wenzheng Chen, and Baoquan Chen. 2024. 4D Gaussian Splatting: Towards Efficient Novel View Synthesis for Dynamic Scenes. ArXiv abs/2402.03307 (2024). https://api.semanticscholar.org/CorpusID:267411895
Duisterhof et al. (2023) Bardienus Pieter Duisterhof, Zhao Mandi, Yunchao Yao, Jia-Wei Liu, Mike Zheng Shou, Shuran Song, and Jeffrey Ichnowski. 2023. MD-Splatting: Learning Metric Deformation from 4D Gaussians in Highly Deformable Scenes. ArXiv abs/2312.00583 (2023). https://api.semanticscholar.org/CorpusID:265551723
Fan et al. (2024) Zhiwen Fan, Wenyan Cong, Kairun Wen, Kevin Wang, Jian Zhang, Xinghao Ding, Danfei Xu, Boris Ivanovic, Marco Pavone, Georgios Pavlakos, Zhangyang Wang, and Yue Wang. 2024. InstantSplat: Unbounded Sparse-view Pose-free Gaussian Splatting in 40 Seconds. arXiv:2403.20309 [cs.CV]
Fang et al. (2022) Jiemin Fang, Taoran Yi, Xinggang Wang, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Matthias Nießner, and Qi Tian. 2022. Fast Dynamic Radiance Fields with Time-Aware Neural Voxels. In SIGGRAPH Asia 2022 Conference Papers.
Feng et al. (2024) Qiyuan Feng, Geng-Chen Cao, Hao-Xiang Chen, Tai-Jiang Mu, Ralph R. Martin, and Shi-Min Hu. 2024. A New Split Algorithm for 3D Gaussian Splatting. ArXiv abs/2403.09143 (2024). https://api.semanticscholar.org/CorpusID:268384828
Fridovich-Keil et al. (2023) Sara Fridovich-Keil, Giacomo Meanti, Frederik Rahbæk Warburg, Benjamin Recht, and Angjoo Kanazawa. 2023. K-Planes: Explicit Radiance Fields in Space, Time, and Appearance. In CVPR.
Gao et al. (2021) Chen Gao, Ayush Saraf, Johannes Kopf, and Jia-Bin Huang. 2021. Dynamic View Synthesis from Dynamic Monocular Video. In Proceedings of the IEEE International Conference on Computer Vision.
Gao et al. (2022) Hang Gao, Ruilong Li, Shubham Tulsiani, Bryan Russell, and Angjoo Kanazawa. 2022. Monocular Dynamic View Synthesis: A Reality Check. In NeurIPS.
Guo et al. (2024) Zhiyang Guo, Wen gang Zhou, Li Li, Min Wang, and Houqiang Li. 2024. Motion-aware 3D Gaussian Splatting for Efficient Dynamic Scene Reconstruction. ArXiv abs/2403.11447 (2024). https://api.semanticscholar.org/CorpusID:268512916
Harley et al. (2022) Adam W Harley, Zhaoyuan Fang, and Katerina Fragkiadaki. 2022. Particle Video Revisited: Tracking Through Occlusions Using Point Trajectories. In ECCV.
Huang et al. (2023) Yi-Hua Huang, Yang-Tian Sun, Ziyi Yang, Xiaoyang Lyu, Yan-Pei Cao, and Xiaojuan Qi. 2023. SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes. arXiv preprint arXiv:2312.14937 (2023).
Jang and Kim (2022) Hankyu Jang and Daeyoung Kim. 2022. D-TensoRF: Tensorial Radiance Fields for Dynamic Scenes. ArXiv abs/2212.02375 (2022). https://api.semanticscholar.org/CorpusID:254247189
Johnson et al. (2023) Erik C.M. Johnson, Marc Habermann, Soshi Shimada, Vladislav Golyanik, and Christian Theobalt. 2023. Unbiased 4D: Monocular 4D Reconstruction with a Neural Deformation Model. CVPR Workshop (2023).
Karaev et al. (2023) Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. 2023. CoTracker: It is Better to Track Together. arXiv:2307.07635 (2023).
Katsumata et al. (2023) Kai Katsumata, Duc Minh Vo, and Hideki Nakayama. 2023. An Efficient 3D Gaussian Representation for Monocular/Multi-view Dynamic Scenes. ArXiv abs/2311.12897 (2023). https://api.semanticscholar.org/CorpusID:265351835
Kerbl et al. (2023) Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 2023. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Transactions on Graphics 42, 4 (July 2023). https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/
Keselman and Hebert (2022) Leonid Keselman and Martial Hebert. 2022. Approximate differentiable rendering with algebraic surfaces. In European Conference on Computer Vision (ECCV). Springer, 596–614.
Keselman and Hebert (2023) Leonid Keselman and Martial Hebert. 2023. Flexible techniques for differentiable rendering with 3d gaussians. arXiv preprint arXiv:2308.14737 (2023).
Kirillov et al. (2023) Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. 2023. Segment Anything. arXiv:2304.02643 (2023).
Kirschstein et al. (2023) Tobias Kirschstein, Shenhan Qian, Simon Giebenhain, Tim Walter, and Matthias Nießner. 2023. NeRSemble: Multi-View Radiance Field Reconstruction of Human Heads. ACM Trans. Graph. 42, 4, Article 161 (jul 2023), 14 pages. https://doi.org/10.1145/3592455
Lee et al. (2024) Byeonghyeon Lee, Howoong Lee, Xiangyu Sun, Usman Ali, and Eunbyung Park. 2024. Deblurring 3D Gaussian Splatting. arXiv:2401.00834 [cs.CV]
Lee et al. (2023a) Joo Chan Lee, Daniel Rho, Xiangyu Sun, Jong Hwan Ko, and Eunbyung Park. 2023a. Compact 3D Gaussian Representation for Radiance Field. arXiv preprint arXiv:2311.13681 (2023).
Lee et al. (2023b) Yao-Chih Lee, Zhoutong Zhang, Kevin Blackburn-Matzen, Simon Niklaus, Jianming Zhang, Jia-Bin Huang, and Feng Liu. 2023b. Fast View Synthesis of Casual Videos. arXiv preprint arXiv:2312.02135 (2023).
Li et al. (2023a) Zhan Li, Zhang Chen, Zhong Li, and Yi Xu. 2023a. Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis. arXiv preprint arXiv:2312.16812 (2023).
Li et al. (2021) Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Wang. 2021. Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Li et al. (2023b) Zhengqi Li, Qianqian Wang, Forrester Cole, Richard Tucker, and Noah Snavely. 2023b. DynIBaR: Neural Dynamic Image-Based Rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Li et al. (2024) Zhe Li, Zerong Zheng, Lizhen Wang, and Yebin Liu. 2024. Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Liang et al. (2023) Yiqing Liang, Numair Khan, Zhengqin Li, Thu Nguyen-Phuoc, Douglas Lanman, James Tompkin, and Lei Xiao. 2023. GauFRe: Gaussian Deformation Fields for Real-time Dynamic Novel View Synthesis. ArXiv abs/2312.11458 (2023). https://api.semanticscholar.org/CorpusID:266359262
Lin et al. (2023) Youtian Lin, Zuozhuo Dai, Siyu Zhu, and Yao Yao. 2023. Gaussian-Flow: 4D Reconstruction with Dynamic 3D Gaussian Particle. arXiv:2312.03431 (2023).
Liu et al. (2023) Yu-Lun Liu, Chen Gao, Andreas Meuleman, Hung-Yu Tseng, Ayush Saraf, Changil Kim, Yung-Yu Chuang, Johannes Kopf, and Jia-Bin Huang. 2023. Robust Dynamic Radiance Fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Luiten et al. (2024) Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. 2024. Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis. In 3DV.
Matsuki et al. (2024) Hidenobu Matsuki, Riku Murai, Paul H. J. Kelly, and Andrew J. Davison. 2024. Gaussian Splatting SLAM. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Mildenhall et al. (2020) Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. 2020. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In ECCV.
Morgenstern et al. (2023) Wieland Morgenstern, Florian Barthel, Anna Hilsmann, and Peter Eisert. 2023. Compact 3D Scene Representation via Self-Organizing Gaussian Grids. arXiv:2312.13299 [cs.CV]
Niedermayr et al. (2023) Simon Niedermayr, Josef Stumpfegger, and Rüdiger Westermann. 2023. Compressed 3D Gaussian Splatting for Accelerated Novel View Synthesis. arXiv:2401.02436 [cs.CV]
Niklaus and Liu (2020) Simon Niklaus and Feng Liu. 2020. Softmax Splatting for Video Frame Interpolation. In IEEE Conference on Computer Vision and Pattern Recognition.
Park et al. (2021a) Keunhong Park, Utkarsh Sinha, Jonathan T. Barron, Sofien Bouaziz, Dan B Goldman, Steven M. Seitz, and Ricardo Martin-Brualla. 2021a. Nerfies: Deformable Neural Radiance Fields. ICCV (2021).
Park et al. (2021b) Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T. Barron, Sofien Bouaziz, Dan B Goldman, Ricardo Martin-Brualla, and Steven M. Seitz. 2021b. HyperNeRF: A Higher-Dimensional Representation for Topologically Varying Neural Radiance Fields. ACM Trans. Graph. 40, 6, Article 238 (dec 2021).
Prokudin et al. (2023) Sergey Prokudin, Qianli Ma, Maxime Raafat, Julien Valentin, and Siyu Tang. 2023. Dynamic Point Fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 7964–7976.
Qian et al. (2024) Zhiyin Qian, Shaofei Wang, Marko Mihajlovic, Andreas Geiger, and Siyu Tang. 2024. 3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting. (2024).
Ramasinghe et al. (2024) Sameera Ramasinghe, Violetta Shevchenko, Gil Avraham, and Anton van den Hengel. 2024. BLiRF: Band limited radiance fields for dynamic scene modeling. In AAAI 2024. https://www.amazon.science/publications/blirf-band-limited-radiance-fields-for-dynamic-scene-modeling
Song et al. (2023) Liangchen Song, Anpei Chen, Zhong Li, Zhang Chen, Lele Chen, Junsong Yuan, Yi Xu, and Andreas Geiger. 2023. NeRFPlayer: A Streamable Dynamic Scene Representation with Decomposed Neural Radiance Fields. IEEE Transactions on Visualization and Computer Graphics 29, 5 (2023), 2732–2742. https://doi.org/10.1109/TVCG.2023.3247082
Sun et al. (2024) Jiakai Sun, Han Jiao, Guangyuan Li, Zhanjie Zhang, Lei Zhao, and Wei Xing. 2024. 3dgstream: On-the-fly training of 3d gaussians for efficient streaming of photo-realistic free-viewpoint videos. arXiv preprint arXiv:2403.01444 (2024).
Tang et al. (2023) Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. 2023. DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation. arXiv preprint arXiv:2309.16653 (2023).
Tian et al. (2023) Fengrui Tian, Shaoyi Du, and Yueqi Duan. 2023. MonoNeRF: Learning a Generalizable Dynamic Radiance Field from Monocular Videos. In Proceedings of the International Conference on Computer Vision (ICCV).
Tretschk et al. (2020) Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, Michael Zollhöfer, Christoph Lassner, and Christian Theobalt. 2020. Non-Rigid Neural Radiance Fields: Reconstruction and Novel View Synthesis of a Dynamic Scene From Monocular Video. arXiv:2012.12247 [cs.CV]
Wang et al. (2021a) Chaoyang Wang, Ben Eckart, Simon Lucey, and Orazio Gallo. 2021a. Neural Trajectory Fields for Dynamic Novel View Synthesis. ArXiv Preprint. arXiv:2105.05994
Wang et al. (2023) Chaoyang Wang, Lachlan Ewen MacDonald, László A. Jeni, and Simon Lucey. 2023. Flow Supervision for Deformable NeRF. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 21128–21137.
Wang et al. (2024) Chaoyang Wang, Peiye Zhuang, Aliaksandr Siarohin, Junli Cao, Guocheng Qian, Hsin-Ying Lee, and S. Tulyakov. 2024. Diffusion Priors for Dynamic View Synthesis from Monocular Videos. ArXiv abs/2401.05583 (2024). https://api.semanticscholar.org/CorpusID:266933409
Wang et al. (2021b) Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul Srinivasan, Howard Zhou, Jonathan T. Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. 2021b. IBRNet: Learning Multi-View Image-Based Rendering. In CVPR.
Wu et al. (2023) Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Wang Xinggang. 2023. 4D Gaussian Splatting for Real-Time Dynamic Scene Rendering. arXiv preprint arXiv:2310.08528 (2023).
Yang et al. (2023a) Jinyu Yang, Mingqi Gao, Zhe Li, Shang Gao, Fangjing Wang, and Feng Zheng. 2023a. Track Anything: Segment Anything Meets Videos. arXiv:2304.11968 [cs.CV]
Yang et al. (2024a) Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. 2024a. Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data. In CVPR.
Yang et al. (2023b) Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. 2023b. Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction. arXiv preprint arXiv:2309.13101 (2023).
Yang et al. (2024b) Zeyu Yang, Hongye Yang, Zijie Pan, and Li Zhang. 2024b. Real-time Photorealistic Dynamic Scene Representation and Rendering with 4D Gaussian Splatting. In International Conference on Learning Representations (ICLR).
Ye et al. (2023) Mingqiao Ye, Martin Danelljan, Fisher Yu, and Lei Ke. 2023. Gaussian Grouping: Segment and Edit Anything in 3D Scenes. arXiv preprint arXiv:2312.00732 (2023).
Yi et al. (2024) Taoran Yi, Jiemin Fang, Junjie Wang, Guanjun Wu, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Qi Tian, and Xinggang Wang. 2024. GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models. In CVPR.
Yoon et al. (2020) Jae Shin Yoon, Kihwan Kim, Orazio Gallo, Hyun Soo Park, and Jan Kautz. 2020. Novel View Synthesis of Dynamic Scenes With Globally Coherent Depths From a Monocular Camera. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), 5335–5344. https://api.semanticscholar.org/CorpusID:214795169
Yu et al. (2023) Heng Yu, Joel Julin, Zoltan A Milacski, Koichiro Niinuma, and Laszlo A Jeni. 2023. CoGS: Controllable Gaussian Splatting. arXiv (2023).
Zhang et al. (2024) Jiahui Zhang, Fangneng Zhan, Muyu Xu, Shijian Lu, and Eric P. Xing. 2024. FreGS: 3D Gaussian Splatting with Progressive Frequency Regularization. ArXiv abs/2403.06908 (2024). https://api.semanticscholar.org/CorpusID:268363429
Zhao et al. (2024) Xiaoming Zhao, Alex Colburn, Fangchang Ma, Miguel Ángel Bautista, Joshua M. Susskind, and Alexander G. Schwing. 2024. Pseudo-Generalized Dynamic View Synthesis from a Video. In ICLR.
Zhou et al. (2023) Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Zehao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi. 2023. Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields. arXiv preprint arXiv:2312.03203 (2023).