author | bibliography | date | title | ||
---|---|---|---|---|---|
|
|
December 2024 |
Rendering a Frame in Gaussian Splatting |
Methods like NeRF depend on a continuous representation of the space to be able to reconstruct the scene using ray casting. On the other hand, point-based rendering methods render disconnected and unstructured geometry. In their simplest form, it's just rasterizing a set of points with a fixed size [@grossman1998point]. A problem with this kind of rendering is that it suffers from holes, causes aliasing, and is strictly discontinuous. To overcome these limitations, it has been proposed to "splat" the point primitives using ellipsoids, discs, or surfels to fill an extend larger than a pixel. [@botsch2005high; @pfister2000surfels; @zwicker2001surface; @ren2002object]
Other alternatives that have been proposed are the differentiable
point-based rendering techniques. [@wang2019differentiable] From them,
the one we are most interested in is point-based
where
Now, a point-based approach computes the color of a pixel by blending
This time,
Even though they are similar, the nature of the formulas is totally different. The volumetric rendering makes us take samples over all the ray, even in zones where there is almost no volume. This causes noise and has a big computational cost. Alternatively, this point-based method uses a finite collection of objects which are easier to process, as we can see in fast-rendering algorithms [@GaussianSplatting; @lassner2021pulsar; @ruckert2022adop]. Furthermore, as mentioned in [@GaussianSplatting], its unstructured and discrete nature is perfect to allow the creation, destruction, and displacement of geometry, which is perfect for scene reconstruction.
Finally, it is worth noting that using 3D Gaussians to represent objects instead of points had already been done beforehand, as seen in [@rhodin2015versatile; @wang2023voge]. However, 3DGS was the first to use them for reconstructing an entire scene.
Gaussian Splatting, also known as 3DGS, is a procedure designed to obtain a representation of a real scene starting from a sparse point and calibrated cameras obtained by SfM [@sfm]. To achieve this, the authors of the original paper chose to model the geometry as a collection of unnormalized 3D Gaussianas, which are easy to project onto a 2D plane.
The Gaussianas are defined in Local Space with coordinates
Each Gaussian also has an associated opacity value
In the following sections, we will discuss some of the most important steps in this method and how the formulas were derived.
Ultimately, our goal is to project the Gaussians onto a plane to
determine the color of a pixel. In order to do so, 3DGS performs
We define a 'Ray Space' similar to the approach in
[@zwicker2001ewaVolumeSplatting], by the triplet
The third coordinate,
In [@zwicker2001ewaVolumeSplatting] the volume rendering equation is
presented, which describes the light intensity
Here,
Now, given that we are working with Gaussians, it is assumed that the
extinction function can be expressed as a weighted sum of coefficients
substituting this into the volume rendering equation, we obtain:
As explained in the article, to be able to compute this function
numerically, splatting algorithms rely on several simplifying
assumptions. First, it is assumed that the reconstruction kernels
To account for view-dependent effects, we go an step further and assume
that the emission coefficient is constant over the entire support, given
the direction from the camera center to the particle's position. We
denote this as
Finally, the exponential function is approximated using the first two terms of its Taylor expansion, and self-occlusion is ignored. Under these assumptions, we can rewrite the previous equation as:
where
This function forms the foundation of all splatting algorithms. We can
see that in the original equation that we gave for this method,
Given that we chose Gaussians as kernels, estimating
We will follow the procedure used in [@zwicker2001ewaVolumeSplatting]
but for unormalized Gaussians. Given an affine mapping
So, if our projection is an affine transformation, we can easily obtain the new expression for the Gaussian.
As is usual in Computer Graphics, the first transformation that we have
to do is a Viewing Transformation, from Local Space to Camera Space,
passing through Global Space. We denote the camera coordinates by a
vector
where
The next step is to transform the Gaussians into 'Ray Space' using a Projective transformation. The thing is, the transformation varies depending on what projection we are doing. Therefore, we will cover two of the most relevant projections for computer graphics.
The typical method for projecting Gaussians involves using the intrinsic
parameters of the camera. A detailed explanation of how the projection
transformation is constructed for this kind of perspective projection
can be found here. In our case, to
take a more general approach, we will use the bounding values for the
screen and depth: right, left, top, bottom, near and far, which are
commonly used when modeling cameras in real-time applications. These
will be represented by
Additionally, to stay as close as possible to existing implementations of Gaussian Splatting renderers, we will assume that the camera is oriented towards the positive Z-axis, and we will flip the Y-axis. This convention aligns with how most Gaussian scenes are produced.
With the assumptions mentioned, an orthographic projection to Clip Space is given by the following affine transformation:
The next step is to finally transform
where
Since both of these transformations are affine, their composition is also affine. The combined transformation is:
We will denote the new matrix as
where
In contrast to orthographic projection, perspective projection is not an affine transformation. It starts with a matrix multiplication and translation but requires a "perspective division". Keeping the same assumptions as before, the perspective projection is as follows:
Next, transforming these coordinates into 'Ray Space' is identical to the orthographic case:
The composite function
To address this issue, the authors of [@zwicker2001ewaVolumeSplatting]
introduced the local affine approximation
whereb$x_k = m(u_k)$ and
Using the same procedure as before, the approximate representation of
the original Gaussian in the 'Ray Space' created by perspective
projection is given by
where
Returning to the Rendering Equation,
of each Gaussians over the ray. It turns out that integrating a 3D Gaussian
along one of the coordinate axes yields a 2D Gaussian. Specifically, by
integrating the resulting Gaussians in 'Ray Space' along the depth
axis (the
$$\int_\mathbb{R}G_{\Sigma''k}(x-x_k)dx_2 = G{\hat{\Sigma}_k}(\hat{x} - \hat{x}_k),$$
where
In summary, given the matrix
From this, we can obtain the covariance of the Gaussian in Screen Space and use it to obtain the integrated reconstruction kernel:
When examining implementations of renderers for Gaussian Splatting,
you'll often see that the third row of the
Aside from the algebraic derivation, this optimization makes sense
conceptually because, if we bypass the detailed steps of the Rendering
Equation, we could directly transform the Gaussians to Screen Space.
That is, changing
which is a
When calculating the final color of a Gaussian in the context of
To compute the effective opacity of a Gaussian, we need to use its
covariance
where we can see that it varies depending on the position
To evaluate the Gaussian we have to invert the matrix
If we look back at rendering equation, we can see that the color is referred
as the emission coefficient
As [@SH_1987] said, for representing functions defined in the unit sphere, Spherical Harmonics are the analogues of the Fourier series. While Spherical Harmonics are typically complex-valued, for color representation only the real-valued components are necessary.
Given the parametric representation of the unit sphere using the angles
Using this notation, the real valued Spherical Harmonics are expressed as follows:
where
These functions also form an orthonormal basis for the Hilbert space of
functions defined over the unit sphere. This means that for any
vector-valued function
where
Therefore, the vector function
where
For a more comprehensive explanation of Spherical Harmonics and Lagrange Polynomials, refer to the work [@SH_Gritty], which was written specifically for this purpose.
Returning to encoding the color of our Gaussians, recall that it depends
on the direction from the camera center to the position of the Gaussian
and that this direction can be represented by a unit vector
Since Spherical Harmonics involve infinite sums, it is not feasible to
compute them in full. For this reason, the authors of 3DGS chose to
limit the sum to
In many implementations of Gaussian Splatting, Spherical Harmonics are
not computed using the trigonometric functions
Summing up, given a Gaussian with its set of SH coefficients
A critical issue arises when transforming Gaussians in 3D space. Since the SH coefficients are defined in the Local Space of each Gaussian, rotating the Gaussian in Global Space would cause the coefficients to no longer represent the correct color. This is evident when considering that if a Gaussian were rotated 180° about an axis, we would expect to see the opposite side's color. However, without changing the SH coefficients, the color remains the same as the Gaussian moves.
To solve this problem, the authors of [@xie2023physgaussian] proposed a
simple method. Instead of rotating the SH basis (which is impractical
since they are precomputed), we apply the inverse rotation to the view
direction
We have covered all that there is to the rendering equation. Using the
notation introduced over the previous sections, we have that the final
color
All that remains is to implement the rendering equation. There are several approaches to achieve this, but we will focus on two primary methods: one using graphics APIs and another, wich is a GPU-based approach.
If you are using a Graphics API like OpenGL, it is desirable to to
offload as much of the rendering work as possible to existing GPU
functionality. Let us assume that we need to render
-
Sorting the Gaussians. The first step is sorting all the Gaussians using the distance from the camera center to their center to fulfill one of the render equation assumptions. This can be done either on the CPU or GPU.
-
Color Calculation In the Vertex Shader, we can use the Spherical Harmonics to evaluate the color of the Gaussians based on the position of the camera and the position of the Gaussians in World Space.
-
Projection to Screen Space. Continuing in the Vertex Shader, we obtain
$\hat{\Sigma}_k$ and calculate the bounding square of the Gaussian in Screen Space. Because the Vertex Shader's output is expected in Clip Space, we have to transform the results to the intervals$[-1,1]\times[-1,1]$ . Using this information, we can create a quad that is facing the camera for each Gaussian and pass it to the Fragment Shader. -
Pixel shading Once in the Fragment Shader, what is left to do is to evaluate the effective opacity
$\alpha_k($ $\hat{x}$ $)$ . Since the calculations are done in Clip Space and not Screen Space, extra care is required to ensure accuracy.One optimization that some renderers use is to assume that the corners of the quad are at the same distance from center (usually a value bigger than 1) and that the Conic Matrix is the identity matrix. This leads to good results but we have to remember that is only an approximation.
Once the opacity is obtained, it is saved to the alpha channel and the color obtained in the previous step is multiplied by the opacity and saved in the RGB channels.
-
$\alpha$ -blending. We can let the graphic API handle this by disabling depth test and activating alpha blending with the blending function set to Addition. For each call, the new fragment color obtained from the Fragment Shader is multiplied by One minus the$\alpha$ value already stored in the fragment buffer, and is added to the color stored in the buffer.
This setup ensures that the final color of each pixel is rendered correctly because the Gaussians are processed from front to back. Note, however, that the formula assumes per-pixel sorting of the Gaussians, whereas in this approach, sorting is done globally. Consequently, the results are approximate.
For an implementation directly on the GPU, the authors of the original 3D Gaussian Splatting paper proposed a method using tiles. Below is a breakdown of the approach, which begins by transferring all relevant information to the GPU:
-
Screen Division. The entire screen is divided into
$16\times 16$ pixels tiles. After that, a culling process is performed, keeping only the Gaussianas that with$99%$ confidence intersect the view frustum. Additionally, Gaussians at extreme positions are also discarded, since computing$\hat{\Sigma}_k$ becomes unstable. -
Pre-Sorting. This step is is a key component of the method. Each time a Gaussian overlaps a tile, it is duplicated on the GPU, and a key is assigned to it, saving all of them in a single array. These keys are formed by combining the Tile ID they intersect and the Gaussians' depth from the camera as we see next:
-
Sorting. Once the duplication finishes, a single GPU Radix Sort [@merrill2010GPURadixSort] applied to the array. Since Radix Sort is stable, the sorted array is divided into blocks where all Gaussians in each block belong to the same tile and are sorted by depth. Once again, because the sorting is per tile and not per pixel, the final color will be an approximation in some situations, as said by the authors of [@GaussianSplatting].
-
Rasterization. A thread block is launched for each tile, and the corresponding section of the sorted array is loaded into shared memory. Within the thread block, the rendering equation, using a straightforward implementation of the rendering equation, the color of all pixels in the tile is obtained, up to some desired opacity close to 1.
Although this method produces excellent results, duplicating Gaussians for each tile significantly increases VRAM usage. As a result, this approach is only practical on high-end GPUs.
As an extra, to generate a depth map of the Gaussians from a given camera center, we can adapt the rendering equation following the approach in [@DepthNormalizaion]:
Now
Notably, the terms in this equation align with those used for rendering the pixel color. This shared structure enables the depth map to be generated in the same render pass as the image, providing computational efficiency.