Generative Gaussian Splatting

Generating 3D Scenes with Video Diffusion Priors

2025

Katja Schwarz¹, Norman Müller¹, Peter Kontschieder¹
¹Meta Reality Labs Zurich, Switzerland

ARXIV 2025

[Paper]

Given one or more input images, GGS leverages a video diffusion prior to directly generate a 3D radiance field parameterized via 3D Gaussian primitives. GGS first generates a feature field with a pose-conditional diffusion model and subsequently decodes the feature splats, yielding an explicit 3D representation of the generated scene.

2025

Abstract

Synthesizing consistent and photorealistic 3D scenes is an open problem in computer vision. Video diffusion models generate impressive videos but cannot directly synthesize 3D representations, i.e, lack 3D consistency in the generated sequences. In addition, directly training generative 3D models is challenging due to a lack of 3D training data at scale. In this work, we present Generative Gaussian Splatting (GGS) - a novel approach that integrates a 3D representation with a pre-trained latent video diffusion model. Specifically, our model synthesizes a feature field parameterized via 3D Gaussian primitives. The feature field is then either rendered to feature maps and decoded into multi-view images, or directly upsampled into a 3D radiance field. We evaluate our approach on two common benchmark datasets for scene synthesis, RealEstate10K and ScanNet++, and find that our proposed GGS model significantly improves both the 3D consistency of the generated multi-view images, and the quality of the generated 3D scenes over all relevant baselines. Compared to a similar model without 3D representation, GGS improves FID on the generated 3D scenes by 20% on both RealEstate10K and ScanNet++.

3D scene generated with GGS from a single input image.

Generating 3D Scenes with Video Diffusion Priors

While existing pose-conditional video diffusion models models achieve high photorealism, they often lack 3D consistency and cannot directly synthesize 3D representations. Instead, we propose to directly integrate an explicit 3D representation with a pre-trained latent video diffusion model. Our approach, GGS, improves 3D consistency in the generated images and naturally allows training with additional depth supervision where available. We further design a custom decoder that directly predicts the decoded 3D representation of the scene from the generated feature maps.

Our approach, GGS, directly synthesizes a 3D representation, which is parameterized by a set of Gaussian splats \(\{\mathbf{g}^m\}\), from a set of posed input images. Specifically, during training we consider a set of posed images \(\{\mathbf{I}^m\}\) with associated camera poses \(\{\mathbf{p}^m\}\) and corresponding Pluecker embeddings \(\{\mathbf{P}^m\}\). The images are first encoded into a latent representation \(\{\mathbf{z}_0^m\}\), which is then partitioned into \(K\) reference images and \(L\) target images. We introduce noise only to the latents of the target images \(\{\mathbf{z}_{tgt,0}^l\}_{l=1}^L\), while leaving the reference images noise-free. To ensure compatibility with the pretrained image-to-video diffusion model, we duplicate the reference latents across the channel dimension and concatenate zeros for the target latents. The resulting latents, along with the noise level \(\sigma_t\) and Pluecker embeddings, are fed into a U-Net architecture that produces intermediate per-latent feature maps. These feature maps are subsequently processed by an epipolar transformer \(\mathcal{T}_{epi}\) to predict the parameters of the Gaussian feature splats \(\{\mathbf{g}^m\}\). We render both feature maps \(\{\mathbf{f}^m\}\) and low-resolution images \(\{\mathbf{I}_{LR}^m\}\) for the input views, as well as low-resolution images for \(J\) novel views \(\{\mathbf{I}_{nv,LR}^j\}_{j=1}^J\) to regularize the 3D representation. Finally, the rendered feature maps are decoded into a weighted combination of sample noise \(\mathbf{\xi}^m\) and input latent to predict the noise-free latents \(\{\hat{\mathbf{z}}_0^m\}\).

Technical Contributions

We propose an approach that directly integrates an explicit 3D representation with a pre-trained latent video diffusion backbone, thereby improving 3D consistency of the generated image sequences and allowing for training with additional depth supervision where available.
We design a custom decoder that directly predicts the decoded 3D representation of the scene from the generated feature maps.
We train a conditional variant of our model that auto-regressively generates full scenes from an arbitrary number of input views.

Sequence Synthesis From a Single Image

Existing pose-conditional diffusion models without 3D representation often struggle to generate 3D consistent sequences. While CameraCtrl can generate reasonable sequences, it might not accurately follow the given camera trajectory. E.g., compare the position of the chair in the lower part of the image at the end of the sequence. ViewCrafter follows the trajectory more closely but often fails to preserve the appearance of the content. It also relies on correct depth estimates and wrong depth prediction can result in artifacts in the generated sequences, see bottom row.

Reference Image

GGS

CameraCtrl

ViewCrafter

3D Scene Synthesis From a Single Image

Below, we show 3D Gaussian splats generated from a single image using GGS. We also provide the reference images and generated feature splats.

View Extrapolation From Two Images

LatentSplat performs well for small camera baselines but its GAN-based generative decoder struggles with large viewpoint extrapolations. Instead, our diffusion model can generate reasonable and consistent images also for larger extrapolations. Similar to the single image setting, ViewCrafter performs overall well wrt. the camera trajectory but can alter the generated content between frames.