Modern learning-based approaches to 3D-aware image synthesis achieve high photorealism and 3D-consistent viewpoint changes for the generated images. Existing approaches represent instances in a shared canonical space. However, for in-the-wild datasets a shared canonical system can be difficult to define or might not even exist. In this work, we instead model instances in view space, alleviating the need for posed images and learned camera distributions. We find that in this setting, existing GAN-based methods are prone to generating flat geometry and struggle with distribution coverage. We hence propose WildFusion, a new approach to 3D-aware image synthesis based on latent diffusion models (LDMs). We first train an autoencoder that infers a compressed latent representation, which additionally captures the images’ underlying 3D structure and enables not only reconstruction but also novel view synthesis. To learn a faithful 3D representation, we leverage cues from monocular depth prediction. Then, we train a diffusion model in the 3D-aware latent space, thereby enabling synthesis of high-quality 3D-consistent image samples, outperforming recent state-of-the-art GAN-based methods. Importantly, our 3D-aware LDM is trained without any direct supervision from multiview images or 3D geometry and does not require posed images or learned pose or camera distributions. It directly learns a 3D representation without relying on canonical camera coordinates. This opens up promising research avenues for scalable 3D-aware image synthesis and 3D content creation from in-the-wild image data.
While existing 3D-aware generative models achieve high photorealism and 3D-consistent viewpoint control, the vast majority of approaches only consider single-class and aligned data like human faces or cat faces. We identify two main causes for this:
Our 3D-aware LDM, called WildFusion, follows LDMs’ two-stage approach: First, we train a powerful 3D-aware autoencoder from large collections of unposed images without multiview supervision that simultaneously performs both compression and enables novel-view synthesis. The autoencoder is trained with pixel-space reconstruction losses on the input views and uses adversarial training to supervise novel views. Note that by using adversarial supervision for the novel views, our autoencoder is trained for novel-view synthesis without the need for multiview supervision. Adding monocular depth cues helps the model learn a faithful 3D representation and further improves novel-view synthesis. In the second stage, we train a diffusion model in the compressed and 3D-aware latent space, which enables us to synthesize novel samples and turns the novel-view synthesis system, i.e., our autoencoder, into a 3D-aware generative model.
We find that existing GAN-based approaches struggle with very low sample diversity (mode collapse) on multi-modal datasets with complex camera distributions like ImageNet. The results below show samples from WildFusion and 3DGP, the strongest baseline, where each row corresponds to samples of one class. While 3DGP collapses and produces almost identical samples within classes, WildFusion produces diverse, high-quality samples because it builds upon Latent Diffusion Models.
Our 3D-aware autoencoder performs both compression and enables novel-view synthesis. Notably, it is trained from large collections of unposed images without any direct multiview supervision. The learned compressed 3D-aware latent space can then be used to train a latent diffusion model. In addition, we can leverage our autoencoder to more efficiently perform novel view synthesis for a single given image than common GAN-based methods relying on GAN-inversion. We show pairs of input images and synthesized novel views from our autoencoder below.
We train a latent diffusion model on the compressed 3D-aware latent space of the 3D-aware autoencoder. Our 3D-aware LDM enables high-quality 3D-aware image synthesis with reasonable geometry and strong distribution coverage / high sample diversity.
Using WildFusion, we can interpolate in a semantically meaningful way between two given single images while simultaneously allowing to change the viewpoint. Note that the geometry also changes accordingly. Specifically, we encode two images into latent space, further encode into the diffusion model’s Gaussian prior space (inverse DDIM), interpolate the resulting encodings, and generate the corresponding 3D images along the interpolation path.
We can further use WildFusion to perform 3D-aware generative image resampling. Given an image, we forward diffuse its latent encoding for varying numbers of steps and re-generate from the partially noised encodings. Depending on how far we diffuse, we control how strongly the sample adheres to the input image. For the samples below, we gradually increase the number of diffusion steps from left to right.
@InProceedings{Schwarz2024ICLR,
author = {Schwarz, Katja and Wook Kim, Seung and Gao, Jun and Fidler, Sanja and Geiger, Andreas and Kreis, Karsten},
title = {WildFusion: Learning 3D-Aware Latent Diffusion Models in View Space},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2024}
}