User description

We present GANcraft, an unsupervised neural rendering framework for generating photorealistic images of large 3D block worlds such as those created in Minecraft. Our method takes a semantic block world as input, where each block is assigned a semantic label such as dirt, grass, or water. We represent the world as a continuous volumetric function and train our model to render view-consistent photorealistic images for a user-controlled camera. In the absence of paired ground truth real images for the block world, we devise a training technique based on pseudo-ground truth and adversarial training. This stands in contrast to prior work on neural rendering for view synthesis, which requires ground truth images to estimate scene geometry and view-dependent appearance. In addition to camera trajectory, GANcraft allows user control over both scene semantics and output style. Experimental results with comparison to strong baselines show the effectiveness of GANcraft on this novel task of photorealistic 3D block world synthesis. The project website is available at https://nvlabs.github.io/GANcraft/.Imagine a world where every Minecrafter is a 3D painter!Advances in 2D image-to-image translation [3, 22, 50] have enabled users to paint photorealistic images by drawing simple sketches similar to those created in Microsoft Paint. Despite these innovations, creating a realistic 3D scene remains a painstaking task, out of the reach of most people. It requires years of expertise, professional software, a library of digital assets, and a lot of development time. In contrast, building 3D worlds with blocks, say physical LEGOs or their digital counterpart, is so easy and intuitive that even a toddler can do it. Wouldn’t it be great if we could build a simple 3D world made of blocks representing various materials (like Fig. 1 (insets)), feed it to an algorithm, and receive a realistic looking 3D world featuring tall green trees, ice-capped mountains, and the blue sea (like Fig. 1)? With such a method, we could perform world-to-world translation to convert the worlds of our imagination to reality. Needless to say, such an ability would have many applications, from entertainment and education, to rapid prototyping for artists.In this paper, we propose GANcraft, a method that produces realistic renderings of semantically-labeled 3D block worlds, such as those from Minecraft (www.minecraft.net). Minecraft, the best-selling video game of all time with over 200 million copies sold and over 120 million monthly users [2], is a sandbox video game in which a user can explore a procedurally-generated 3D world made up of blocks arranged on a regular grid, while modifying and building structures with blocks. Minecraft provides blocks representing various building materials-grass, dirt, water, sand, snow, etc. Each block is assigned a simple texture, and the game is known for its distinctive cartoonish look. While one might discount Minecraft as a simple game with simple mechanics, Minecraft is, in fact, a very popular 3D content creation tool. Minecrafters have faithfully recreated large cities and famous landmarks including the Eiffel Tower! The block world representations are intuitive to manipulate and this makes it well-suited as the medium for our world-to-world translation task. We focus on generating natural landscapes, which was also studied in several prior work in image-to-image translation [3, 50].At first glance, generating a 3D photorealistic world from a semantic block world seems to be a task of translating a sequence of projected 2D segmentation maps of the 3D block world, and is a direct application of image-to-image translation . This approach, however, immediately runs into several serious issues. First, obtaining paired ground truth training data of the 3D block world, segmentation labels, and corresponding real images is extremely costly if not impossible. Second, existing image-to-image translation models [21, 50, 62, 72] do not generate consistent views [36]. Each image is translated independent of the others.While the recent world-consistent vid2vid work [36] overcomes the issue of view-consistency, it requires paired ground truth 3D training data. Even the most recent neural rendering approaches based on neural radiance fields such as NeRF [39], NSVF [31], and NeRF-W [37], require real images of a scene and associated camera parameters, and are best suited for view interpolation. As there is no paired 3D and ground truth real image data, as summarized in Table 1, none of the existing techniques can be used to solve this new task. This requires us to employ ad hoc adaptations to make our problem setting as similar to these methods’ requirements as possible, e.g. training them on real segmentations instead of Minecraft segmentations.In the absence of ground truth data, we propose a framework to train our model using pseudo-ground truth photorealistic images for sampled camera views. Our framework uses ideas from image-to-image translation and improves upon work in 3D view synthesis to produce view-consistent photorealistic renderings of input Minecraft worlds as shown in Fig. 1. Although we demonstrate our results using Minecraft, our method works with other 3D block world representations, such as voxels. We chose Minecraft because it is a popular platform available to a wide audience.Our key contributions include:• The novel task of producing view-consistent photorealistic renderings of user-created 3D semantic block worlds, or world-to-world translation, a 3D extension of image-to-image translation.• A framework for training neural renderers in the absence of ground truth data. This is enabled by using pseudo-ground truth images generated by a pretrained image synthesis model (Section 3.1).• A new neural rendering network architecture trained with adversarial losses (Section 3.2), that extends recent work in 2D and 3D neural rendering [20, 31, 37, 39, 45] to produce state-of-the-art results which can be conditioned on a style image (Section 4).2D image-to-image translation. The GAN framework [16] has enabled multiple methods to successfully map an image from one domain to another with high fidelity, e.g., from input segmentation maps to photorealistic images. This task can be performed in the supervised setting [22, 26, 35, 50, 62, 73], where example pairs of corresponding images are available, as well as the unsupervised setting [14, 21, 32, 33, 35, 54, 72], where only two sets of images are available. Methods operating in the supervised setting use stronger losses such as the L11_1start_FLOATSUBSCRIPT 1 end_FLOATSUBSCRIPT or perceptual loss [23], in conjunction with the adversarial loss. As paired data is unavailable in the unsupervised setting, works typically rely on a shared-latent space assumption [32] or cycle-consistency losses [72]. For a comprehensive overview of image-to-image translation methods, please refer to the survey of Liu et al. [34].Our problem setting naturally falls into the unsupervised setting as we do not possess real-world images corresponding to the Minecraft 3D world. To facilitate learning a view-consistent mapping, we employ pseudo-ground truths during training, which are predicted by a pretrained supervised image-to-image translation method.Pseudo-ground truths were first explored in prior work on self-training, or bootstrap learning [38, 67]111See https://ruder.io/semi-supervised/ for an overview.. More recently, this technique has been adopted in several unsupervised domain adaptation works [13, 27, 56, 61, 65, 70, 74]. They use a deep learning model trained on the ‘source’ domain to obtain predictions on the new ‘target’ domain, treat these predictions as ground truth labels, or pseudo labels, and finetune the deep learning model on such self-labeled data.In our problem setting, we have segmentation maps obtained from the Minecraft world but do not possess the corresponding real image. We use SPADE [50], a conditional GAN model, trained for generating landscape images from input segmentation maps to generate pseudo ground truth images. This yields the pseudo pair: input Minecraft segmentation mask and the corresponding pseudo ground truth image. The pseudo pairs enable us to use stronger supervision such as L11_1start_FLOATSUBSCRIPT 1 end_FLOATSUBSCRIPT, L22_2start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPT, and perceptual [23] losses in our world-to-world translation framework, resulting in improved output image quality. This idea of using pretrained GAN models for generating training data has also been explored in the very recent works of Pan et al. [48] and Zhang et al. [71], which use a pretrained StyleGAN [24, 25] as a multi-view data generator to train an inverse graphics model.3D neural rendering. A number of works have explored combining the strengths of the traditional graphics pipeline, such as 3D-aware projection, with the synthesis capabilities of neural networks to produce view-consistent outputs. By introducing differentiable 3D projection and using trainable layers that operate in the 3D and 2D feature space, several recent methods [4, 18, 43, 44, 57, 63] are able to model the geometry and appearance of 3D scenes from 2D images. Some works have successfully combined neural rendering with adversarial training [18, 43, 44, 45, 55], thereby removing the constraint of training images having to be posed and from the same scene. However, the under-constrained nature of the problem limited the application of these methods to single objects, synthetic data, or small-scale simple scenes. As shown later in Section 4, we find that adversarial training alone is not enough to produce good results in our setting. This is because our input scenes are larger and more complex, the available training data is highly diverse, and there are considerable gaps in the scene composition and camera pose distribution between the block world and the real images.Most recently, NeRF [39] demonstrated state-of-the-art results in novel view synthesis by encoding the scene in the weights of a neural network that produces the volume density and view-dependent radiance at every spatial location. The remarkable synthesis ability of NeRF has inspired a large number of follow-up works which have tried to improve the output quality [31, 69], make it faster to train and evaluate [30, 31, 42, 52, 60], extend it to deformable objects [15, 28, 49, 51, 64], account for lighting [9, 6, 37, 58] and compositionality [17, 45, 47, 68], as well as add generative capabilities [11, 55, 45].Most relevant to our work are NSVF [31], NeRF-W [37], and GIRAFFE [45]. NSVF [31] reduces the computational cost of NeRF by representing the scene as a set of voxel-bounded implicit fields organized in a sparse voxel octree, which is obtained by pruning an initially dense cuboid made of voxels. NeRF-W [37] learns image-dependent appearance embeddings allowing it to learn from unstructured photo collections, and produce style-conditioned outputs. These works on novel view synthesis learn the geometry and appearance of scenes given ground truth posed images. In our setting, the problem is inverted - we are given coarse voxel geometry and segmentation labels as input, without any corresponding real images.Similar to NSVF [31], we assign learnable features to each corner of the voxels to encode geometry and appearance. In contrast, we do not learn the 3D voxel structure of the scene from scratch, but instead implicitly refine the provided coarse input geometry (e.g. shape and opacity of trees represented by blocky voxels) during the course of training. All about video games Prior work by Riegler et al. [53] also used a mesh obtained by multi-view stereo as a coarse input geometry. Similar to NeRF-W [37], we use a style-conditioned network. This allows us to learn consistent geometry while accounting for the view inconsistency of SPADE [50]. Like neural point-based graphics [4] and GIRAFFE [45], we use differentiable projection to obtain features for image pixels, and then use a CNN to convert the 2D feature grid to an image. Like GIRAFFE [45], we use an adversarial loss in training. We, however, learn on large, complex scenes and produce higher-resolution outputs (1024×\times×2048 original image size in Fig. 1, v/s 64×\times×64 or 256×\times×256 pixels in GIRAFFE), in which case adversarial loss alone fails to produce good results.3 Neural Rendering of Minecraft WorldsOur goal is to convert a scene represented by semantically-labeled blocks (or voxels), such as the maps from Minecraft, to a photorealistic 3D scene that can be consistently rendered from arbitrary viewpoints (as shown in Fig. 1). In this paper, we focus on landscape scenes that are orders of magnitude larger than single objects or scenes typically used in the training and evaluation of previous neural rendering works. In all of our experiments, we use voxel grids of 512×\times×512×\times×256 blocks (512×\times×512 blocks horizontally, 256 blocks tall vertically). Given that each Minecraft block is considered to have a size of 1 cubic meter [1], each scene covers an area equivalent to 262,144 m2superscript