Synthesizing consistent and photorealistic 3D scenes is an open problem in computer vision. Video diffusion models generate impressive videos but cannot directly synthesize 3D representations, i.e, lack 3D consistency in the generated sequences. In addition, directly training generative 3D models is challenging due to a lack of 3D training data at scale. In this work, we present Generative Gaussian Splatting (GGS) - a novel approach that integrates a 3D representation with a pre-trained latent video diffusion model. Specifically, our model synthesizes a feature field parameterized via 3D Gaussian primitives. The feature field is then either rendered to feature maps and decoded into multi-view images, or directly upsampled into a 3D radiance field. We evaluate our approach on two common benchmark datasets for scene synthesis, RealEstate10K and ScanNet++, and find that our proposed GGS model significantly improves both the 3D consistency of the generated multi-view images, and the quality of the generated 3D scenes over all relevant baselines. Compared to a similar model without 3D representation, GGS improves FID on the generated 3D scenes by 20% on both RealEstate10K and ScanNet++.
While existing pose-conditional video diffusion models models achieve high photorealism, they often lack 3D consistency and cannot directly synthesize 3D representations. Instead, we propose to directly integrate an explicit 3D representation with a pre-trained latent video diffusion model. Our approach, GGS, improves 3D consistency in the generated images and naturally allows training with additional depth supervision where available. We further design a custom decoder that directly predicts the decoded 3D representation of the scene from the generated feature maps.
Our approach, GGS, directly synthesizes a 3D representation, which is parameterized by a set of Gaussian splats \(\{\mathbf{g}^m\}\), from a set of posed input images. Specifically, during training we consider a set of posed images \(\{\mathbf{I}^m\}\) with associated camera poses \(\{\mathbf{p}^m\}\) and corresponding Pluecker embeddings \(\{\mathbf{P}^m\}\). The images are first encoded into a latent representation \(\{\mathbf{z}_0^m\}\), which is then partitioned into \(K\) reference images and \(L\) target images. We introduce noise only to the latents of the target images \(\{\mathbf{z}_{tgt,0}^l\}_{l=1}^L\), while leaving the reference images noise-free. To ensure compatibility with the pretrained image-to-video diffusion model, we duplicate the reference latents across the channel dimension and concatenate zeros for the target latents. The resulting latents, along with the noise level \(\sigma_t\) and Pluecker embeddings, are fed into a U-Net architecture that produces intermediate per-latent feature maps. These feature maps are subsequently processed by an epipolar transformer \(\mathcal{T}_{epi}\) to predict the parameters of the Gaussian feature splats \(\{\mathbf{g}^m\}\). We render both feature maps \(\{\mathbf{f}^m\}\) and low-resolution images \(\{\mathbf{I}_{LR}^m\}\) for the input views, as well as low-resolution images for \(J\) novel views \(\{\mathbf{I}_{nv,LR}^j\}_{j=1}^J\) to regularize the 3D representation. Finally, the rendered feature maps are decoded into a weighted combination of sample noise \(\mathbf{\xi}^m\) and input latent to predict the noise-free latents \(\{\hat{\mathbf{z}}_0^m\}\).
Existing pose-conditional diffusion models without 3D representation often struggle to generate 3D consistent sequences. While CameraCtrl can generate reasonable sequences, it might not accurately follow the given camera trajectory. E.g., compare the position of the chair in the lower part of the image at the end of the sequence. ViewCrafter follows the trajectory more closely but often fails to preserve the appearance of the content. It also relies on correct depth estimates and wrong depth prediction can result in artifacts in the generated sequences, see bottom row.
Below, we show 3D Gaussian splats generated from a single image using GGS. We also provide the reference images and generated feature splats.
LatentSplat performs well for small camera baselines but its GAN-based generative decoder struggles with large viewpoint extrapolations. Instead, our diffusion model can generate reasonable and consistent images also for larger extrapolations. Similar to the single image setting, ViewCrafter performs overall well wrt. the camera trajectory but can alter the generated content between frames.
Below, we compare the generated 3D scenes from two images for GGS and the strongest baseline ViewCrafter. Results were obtained by running Splatfacto as an off-the-shelf 3D reconstruction algorithm. As GGS directly generates a 3D representation with its 3D decoder branch, we can leverage the generated 3D splats as initialization for the reconstruction.
Lastly, we show results for the conditional vairant of our GGS model. We autoregressively generate larger scenes using only 5 reference images.
@InProceedings{Schwarz2025ggs,
author = {Schwarz, Katja and Müller, Norman and Kontschieder, Peter},
title = {Generative Gaussian Splatting: Generating 3D Scenes with Video Diffusion Priors},
booktitle = {arXiv.org (ARXIV)},
year = {2025}
}