We introduce a recipe for generating immersive 3D worlds from a single image by framing the task as an in-context learning problem for 2D inpainting models. This approach requires minimal training and uses existing generative models. Our process involves two steps: generating coherent panoramas using a pre-trained diffusion model and lifting these into 3D with a metric depth estimator. We then fill unobserved regions by conditioning the inpainting model on rendered point clouds, requiring minimal fine-tuning. Tested on both synthetic and real images, our method produces high-quality 3D environments suitable for VR display. By explicitly modeling the 3D structure of the generated environment from the start, our approach consistently outperforms state-of-the-art, video synthesis-based methods along multiple quantitative image quality metrics.
Our key insight is that the task of generating a 3D environment from a single image, which is inherently complex and ambiguous, can be decomposed into a series of more manageable sub-problems, each of which can be addressed with existing techniques.
We address 2D panorama synthesis as an in-context zero-shot learning task for existing inpainting models. By leveraging a vision-language model to generate prompts, our method can produce high-fidelity panoramic images without the need for additional training. Specifically, the vision-language model generates a non-specific prompt for scene extension, ensuring that key features or properties, e.g. a men holding tulips, are not duplicated. Moreover, we create separate prompts for the upper (sky) and lower (ground) sections of the panorama. To anchor sky and ground synthesis, the input image is duplicated to the backside of the panorama. The progressive synthesis process begins with the sky and ground, maximizing the global context and ensuring a coherent panorama. Then, the backside anchor is removed and the remaining regions of the panorama are generated by rendering and outpainting perspective images. In the second stage, the generated panorama is lifted into an approximately metric three-dimensional space. We first apply monocular, metric depth estimation on rendered images. This works sufficiently well for images rendered from the panorama, but usually leaves empty spots in previously occluded areas or at large depth discontinuities emerging when the camera views are shifted, i.e., underwent a translation. We identify this as another inpainting task, and demonstrate that the inpainting model can quickly adapt to this setting, when fine-tuned with appropriate masks derived from the rendered point clouds. Lastly, we use the lifted point cloud and the generated images to reconstruct a 3D scene, parameterized by Gaussian splats. The scene is viewable and navigable within a 2-meter cube on a VR headset.
While existing approaches can generate good videos from a single input image, these methods are typically unable to produce a fully immersive scene, notably struggling with outpainting towards the opposite direction of the initial view. Further, the generated videos often lack consistency, resulting in artifacts in the reconstructed 3D scenes.
Our pipeline natively extends to text-to-scene synthesis by first generating an image from the given text prompt and subsequently running our pipeline.
@InProceedings{Schwarz2025worlds,
author = {Schwarz, Katja and Rozumny, Denis and Rota Bulo, Samuel and Porzi, Lorenzo and Kontschieder, Peter},
title = {A Recipe for Generating 3D Worlds From a Single Image},
booktitle = {arXiv.org (ARXIV)},
year = {2025}
}