Coordinate-based neural representations for low dimensional signals are becoming increasingly popular in computer vision and graphics. In particular, these fully-connected networks can represent 3D scenes more compactly than voxel grids but are still easy to optimize with gradient-based methods. We presented Neural Radiance Fields (NeRF) earlier this year, a technique for achieving photorealistic view synthesis of complex objects and scenes. Since then, we have continued to further investigate and explain the capabilities of these coordinate-based networks.

### Update

### Researchers

- Ben Mildenhall, UC Berkeley
- Ren Ng, UC Berkeley
- Jon Barron, Google

Work done in collaboration with Pratul Srinivasan, Matthew Tancik, Ravi Ramamoorthi, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Terrance Wang, and Divi Schmidt.

### Overview

View synthesis is the problem of rendering arbitrary new views of a static scene, given a fixed set of input images and their camera poses. Solving this problem allows people to view photorealistic recreations of complex objects or interesting places, without requiring a digital artist to spend tens or hundreds of hours designing 3D models.

In NeRF, we represent a static scene as a continuous 5D function that outputs a color along any outgoing direction at each point in space, and an opacity at each point controlling how much of that color is accumulated by a ray passing through the point. We optimize a deep fully-connected neural network without any convolutional layers (often referred to as a multilayer perceptron or MLP) to represent this function by regressing from a single 5D coordinate to an output color and opacity.

To render this neural radiance field from a particular viewpoint we trace camera rays through the scene to generate a sampled set of 3D points, pass them through neural network to produce an output set of colors and densities, and use classical volume rendering techniques to accumulate those colors and densities into a 2D image. Because this process is naturally differentiable, we can use gradient descent to optimize this model by minimizing the error between each observed image and the corresponding views rendered from our representation, encouraging the network to predict a coherent model of the scene by assigning high volume densities and accurate colors to the locations that contain the true underlying scene content.

A key detail of NeRF is that we pass the input coordinates through a positional encoding before sending them into the fully-connected network. Our followup work reframed this positional encoding as a special case of a Fourier features mapping and explored why it enables the network to encode much higher resolution scene details.

Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains

We show that passing input points v through a simple Fourier feature mapping of the form

(for a tall and skinny random matrix B) enables a fully-connected MLP network to learn high-frequency functions in low-dimensional problem domains. Using tools from the neural tangent kernel (NTK) literature, we show that a standard MLP fails to learn high frequencies both in theory and in practice. To overcome this spectral bias, we use a Fourier feature mapping to transform the effective NTK into a stationary kernel with a tunable bandwidth. We suggest an approach for selecting problem-specific Fourier features that greatly improves the performance of MLPs for low-dimensional regression tasks relevant to the computer vision and graphics communities.

Learned Initializations for Optimizing Coordinate-Based Neural Representations

Optimizing a coordinate-based network from randomly initialized weights for each new signal is inefficient. We propose applying standard meta-learning algorithms to learn the initial weight parameters for these fully-connected networks based on the underlying class of signals being represented (e.g., images of faces or 3D models of chairs). Despite requiring only a minor change in implementation, using these learned initial weights enables faster convergence during optimization and can serve as a strong prior over the signal class being modeled, resulting in better generalization when only partial observations of a given signal are available. We explore these benefits across a variety of tasks, including representing 2D images, reconstructing CT scans, and recovering 3D shapes and scenes from 2D image observations.