NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction 链接到标题
* Authors: [[Peng Wang]], [[Lingjie Liu]], [[Yuan Liu]], [[Christian Theobalt]], [[Taku Komura]], [[Wenping Wang]]
初读印象 链接到标题
NeuS 使用符号距离函数 SDF 对三维物体表面进行表示,提出一个 sigmoid 导数形式的 s-density 体渲染方法使得训练结果是 sdf 形式, 同时提出无偏的权重函数处理射线经过多表面的问题。
Note 链接到标题
s-density 原理是 $\phi$ 是: 我们想从 SDF 获得density, 而 density 在物体的表面应该达到最大,而物体表面对应的 SDF 应该为 0, 所以应该选择一个在 0 处达到峰值的单峰函数, 比如文中选择的 $\phi_s$
权重函数需要满足两个条件
- 无偏性 在物体表面时候权重应该最大, 这样能射线经过 SDF 的 0 面(也就是预测的表面)对像素的颜色贡献最多
- 遮挡感知能力 当一个射线上有两个深度值 $t_0$ 和 $t_1$ 对应相同的 sdf 值 , 那么离 view point 更近的那个点应该贡献更多最终颜色, 原因是这种情况预示着这条射线经过了两个表面, 那个离观察位置更近的表面颜色更加重要。
TL;DR 链接到标题
- for reconstructing objects and scenes with high fidelity from 2D image inputs.
- Existing neural surface reconstruction approaches require foreground mask as supervision, easily get trapped in local minima, and therefore struggle with the reconstruction of objects with severe self-occlusion or thin structures
- NeRF extracting high-quality surfaces isdifficult because not sufficient surface constraints in the representation
- NeuS represent a surface as the zero-level set of a signed distance function (SDF) and develop a new volume rendering method to train a neural SDF representation
- conventional volume rendering method causes inherent geometric errors (i.e. bias) for surface reconstruction
- propose a new formulation that is free of bias in the first order of approximation, thus leading to more accurate surface reconstruction even without the mask supervision
Inroduction 链接到标题
IDR
- The surface rendering method used in IDR only considers a single surface intersection point for each ray produces impressive reconstruction results, but it fails to reconstruct objects with complex structures that causes abrupt depth changes.
NeRF
- surface extracted as a level-set surface of the density field learned by NeRF contains conspicuous noise in some planar regions
NeuS
- uses the signed distance function (SDF) for surface representation
- uses a novel volume rendering scheme to learn a neural SDF representation
- NeuS is capable of reconstructing complex 3D objects and scenes with severe occlusions and delicate structures, even without foreground masks as supervision
Method 链接到标题
Rendering Procedure 链接到标题
Scene representation 链接到标题
- $f: \mathbb{R}^3 \rightarrow \mathbb{R}$ that maps a spatial position $x \in \mathbb{R}^3$ to its signed distance to the object
- $c: \mathbb{R}^3 \times \mathbb{S}^2 \rightarrow \mathbb{R}^3$ that encodes the color associated with a point $x \in R^3$ and a viewing direction $v \in \mathbb{S}^2$
The surface $\mathcal{S}$ of the object is represented by the zero-level set of its SDF
$$ \mathcal{S}=\left{\mathbf{x} \in \mathbb{R}^3 \mid f(\mathbf{x})=0\right} $$
In order to apply a volume rendering method to training the SDF network we first introduce S-density: a probability density function $\phi_s{(f(x))}$ where $\phi_s(x)=s e^{-s x} /\left(1+e^{-s x}\right)^2$ is derivative of the Sigmoid function $\Phi_s(x)=\left(1+e^{-s x}\right)^{-1}$ , $\phi_s(x)=\Phi_s^{\prime}(x)$
the standard deviation of $\phi_s(x)$ is given by 1/s , which is also a trainable parameter, that is, 1/s approaches to zero as the network training converges.
the zero-level set of the network-encoded SDF is expected to represent an accurately reconstructed surface S, with its induced S-density φ s (f(x)) assuming prominently high values near the surface.
Rendering 链接到标题
given a pixel, the ray emmited from the pixel as ${\mathbf{p}(t)=\mathbf{o}+t \mathbf{v} \mid t \geq 0}$ . We accumulate the colors along the ray by
$$ C(\mathbf{o}, \mathbf{v})=\int_0^{+\infty} w(t) c(\mathbf{p}(t), \mathbf{v}) \mathrm{d} t $$
where $C(o, v)$ is the output color for this pixel, $w(t)$ a weight for the point $p(t)$ , and $c(p(t), v)$ the color at the point $p$ along the viewing direction $v$ .
Requirements on weight function
Unbiased
- $w(t)$ attains a locally maximal value at a surface intersection point $p(t^)$ , the point $p(t^)$ is one the zero-level of the $SDF(x)$
- guarantees $w(t)$ that the intersection of the camera ray with the zero-level set of SDF contributes most to the pixel color
Occlusion-aware
- Given any two depth values $t_0$ and $t_1$ satisfying $f(t_0) = f(t_1)$, $w(t_0) > 0$ , $w(t_1 ) > 0$, and $t_0 < t_1$ , there is $w(t_0 ) > w(t_1 )$. That is, when two points have the same SDF value (thus the same SDF-induced S-density value), the point nearer to the view point should have a larger contribution to the final output color than does the other point.
- when a ray sequentially passes multiple surfaces, the rendering procedure will correctly use the color of the surface nearest to the camera to compute the output color. $w(t)$
Naive solution of $w(t)$ : $w(t)=T(t) \sigma(t)$ where $T(t)=\exp \left(-\int_0^t \sigma(u) \mathrm{d} u\right)$ occlusion-aware but is biased
our solution $$ w(t)=T(t) \rho(t), \quad \text { where } T(t)=\exp \left(-\int_0^t \rho(u) \mathrm{d} u\right) $$ $$ \rho(t)=\max \left(\frac{-\frac{\mathrm{d} \Phi_s}{\mathrm{~d} t}(f(\mathbf{p}(t)))}{\Phi_s(f(\mathbf{p}(t)))}, 0\right) $$
Training 链接到标题
$n$: point sampling size $m$: batch size
$$ \mathcal{L}=\mathcal{L}{\text {color }}+\lambda \mathcal{L}{\text {reg }}+\beta \mathcal{L}_{\text {mask }} $$
color loss
$$ \mathcal{L}_{\text {color }}=\frac{1}{m} \sum_k \mathcal{R}\left(\hat{C}_k, C_k\right) $$
$\mathcal{R}$ is L1 loss
Eikonal term
$$ \mathcal{L}{r e g}=\frac{1}{n m} \sum{k, i}\left(\left|\nabla f\left(\hat{\mathbf{p}}_{k, i}\right)\right|_2-1\right)^2 $$
mask loss
$$ \mathcal{L}_{\text {mask }}=\operatorname{BCE}\left(M_k, \hat{O}_k\right) $$