2013 Real-Time 3D Reconstruction in Dynamic Scenes Using Point-Based Fusion
2013 Real-Time 3D Reconstruction in Dynamic Scenes Using Point-Based Fusion
2013 Real-Time 3D Reconstruction in Dynamic Scenes Using Point-Based Fusion
Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on December 22,2022 at 14:18:12 UTC from IEEE Xplore. Restrictions apply.
also been explored. Height-map representations [3] work our system still captures many benefits of volumetric fusion,
with compact data structures allowing scalability, especially with competitive performance and quality to previous online
suited for modeling large buildings with floors and walls, systems, allowing for accumulation of denoised 3D models
since these appear as clear discontinuities in the height- over time that exploit redundant samples, model measure-
map. Multi-layered height-maps support reconstruction of ment uncertainty, and make no topological assumptions.
more complex 3D scenes such as balconies, doorways, and The simplicity of our approach allows us to tackle another
arches [3]. While these methods support compression of fundamental challenge of online reconstruction systems: the
surface data for simple scenes, the 2.5D representation fails assumption of a static scene. Most previous systems make
to model complex 3D environments efficiently. this assumption or treat dynamic content as outliers [18, 22];
Point-based representations are more amenable to the only KinectFusion [6] is at least capable of reconstructing
input acquired from depth/range sensors. [18] used a point- moving objects in a scene, provided a static pre-scan of the
based method and custom structured light sensor to demon- background is first acquired. Instead, we leverage the imme-
strate in-hand online 3D scanning. Online model rendering diacy of our representation to design a method to not only
required an intermediate volumetric data structure. Interest- robustly segment dynamic objects in the scene, which greatly
ingly, an offline volumetric method [2] was used for higher improves the robustness of the camera pose estimation, but
quality final output, which nicely highlights the computa- also to continuously update the global reconstruction, regard-
tional and quality trade-offs between point-based and volu- less of whether objects are added or removed. Our approach
metric methods. [22] took this one step further, demonstrat- is further able to detect when a moving object has become
ing higher quality scanning of small objects using a higher static or a stationary object has become dynamic.
resolution custom structured light camera, sensor drift correc- The ability to support reconstructions at quality compara-
tion, and higher quality surfel-based [15] rendering. These ble to state-of-the-art, without trading real-time performance,
systems however focus on single small object scanning. Fur- with the addition of extended spatial scale and support for
ther, the sensors produce less noise than consumer depth dynamic scenes provides unique capabilities over prior work.
cameras (due to dynamic rather than fixed structured light We conclude with results from reconstructing a variety of
patterns), making model denoising less challenging. static and dynamic scenes of different scales, and an experi-
Beyond reducing computational complexity, point-based mental comparison to related systems.
methods lower the memory overhead associated with vol-
umetric (regular grid) approaches, as long as overlapping 2. System Overview
points are merged. Such methods have therefore been used Our high-level approach shares commonalities with the
in larger sized reconstructions [4, 20]. However, a clear existing incremental reconstruction systems (presented previ-
trade-off becomes apparent in terms of scale versus speed ously): we use samples from a moving depth sensor; first pre-
and quality. For example, [4] allow for reconstructions of process the depth data; then estimate the current six degree-
entire floors of a building (with support for loop closure and of-freedom (6DoF) pose of sensor relative to the scene; and
bundle adjustment), but frame rate is limited (∼ 3 Hz) and an finally use this pose to convert depth samples into a unified
unoptimized surfel map representation for merging 3D points coordinate space and fuse them into an accumulated global
can take seconds to compute. [20] use a multi-level surfel model. Unlike prior systems, we adopt a purely point-based
representation that achieves interactive rates (∼ 10 Hz) but representation throughout our pipeline, carefully designed
require an intermediate octree representation which limits to support data fusion with quality comparable to online vol-
scalability and adds computational complexity. umetric methods, whilst enabling real-time reconstructions
In this paper we present an online reconstruction system at extended scales and in dynamic scenes.
also based around a flat, point-based representation, rather Our choice of representation makes our pipeline ex-
than any spatial data structure. A key contribution is that our tremely amenable to implementation using commodity
system is memory-efficient, supporting spatially extended graphics hardware. The main system pipeline as shown
reconstructions, but without trading reconstruction quality in Fig. 1 is based on the following steps:
or frame rate. As we will show, the ability to directly render
the representation using the standard graphics pipeline, with-
out converting between multiple representations, enables
efficient implementation of all central operations, i.e., cam-
era pose estimation, data association, denoising and fusion
through data accumulation, and outlier removal.
Figure 1. Main system pipeline.
A core technical contribution is leveraging a fusion
method that closely resembles [2] but removes the voxel Depth Map Preprocessing Using the intrinsic parame-
grid all-together. Despite the lack of a spatial data structure, ters of the camera, each input depth map from the depth
Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on December 22,2022 at 14:18:12 UTC from IEEE Xplore. Restrictions apply.
sensor is transformed into a set of 3D points, stored in a also denoised using a bilateral filter [21] (for camera pose
2D vertex map. Corresponding normals are computed from estimation later).
central-differences of the denoised vertex positions, and per- The 6DoF camera pose transformation comprises of rota-
point radii are computed as a function of depth and gradient tion (Ri ∈ SO3 ) matrix and translation (ti ∈ R3 ) vector, com-
(stored in respective normal and radius maps). puted per frame i as T i = [Ri , ti ] ∈ SE3 . A vertex is converted
g
Depth Map Fusion Given a valid camera pose, input to global coordinates as vi = T i vi . The associated normal is
g
points are fused into the global model. The global model is converted to global coordinates as ni (u) = Ri ni (u). Multi-
simply a list of 3D points with associated attributes. Points scale pyramids Vil and Nil are computed from vertex and
evolve from unstable to stable status based on the confi- normal maps for hierarchical ICP, where l ∈ {0, 1, 2} and
dence they gathered (essentially a function of how often they l = 0 denotes the original input resolution (e.g. 640×480 for
are observed by the sensor). Data fusion first projectively Kinect or 200×200 for PMD CamBoard).
associates each point in the input depth map with the set Each input vertex also has an associated radius ri (u) ∈ R
of points in the global model, by rendering the model as (collectively stored in a radius map Ri ∈ R), determined as
an index map. If corresponding points are found, the most in [22]. To prevent arbitrarily large radii from oblique views,
reliable point is merged with the new point estimate using we clamp radii for grazing observations exceeding 75◦ .
a weighted average. If no reliable corresponding points are In the remainder, we omit time frame indices i for clarity,
found, the new point estimate is added to the global model as unless we refer to two different time frames at once.
an unstable point. The global model is cleaned up over time
to remove outliers due to visibility and temporal constraints. 4. Depth Map Fusion
Sec. 4 discusses our point-based data fusion in detail.
Our system maintains a single global model, which is
Camera Pose Estimation All established (high confi- simply an unstructured set of points P̄k each with associated
dence) model points are passed to the visualization stage, position v̄k ∈ R3 , normal n̄k ∈ R3 , radius r̄k ∈ R, confidence
which reconstructs dense surfaces using a surface splatting counter c̄k ∈ R, and time stamp t¯k ∈ N, stored in a flat array
technique (see Sec. 5). To estimate the 6DoF camera pose, indexed by k ∈ N.
the model points are projected from the previous camera
New measurements v are either added as or merged with
pose, and a pyramid-based dense iterative closest point (ICP)
unstable points, or they get merged with stable model points.
[11] alignment is performed using this rendered model map
Merging v with a point P̄k in the global model increases the
and input depth map. This provides a new relative rigid 6DoF
confidence counter c̄k . Eventually an unstable point changes
transformation that maps from the previous to new global
its status to stable: points with c̄k ≥ cstable are considered
camera pose. Pose estimation occurs prior to data fusion, to
stable (in practice cstable = 10). In specific temporal or geo-
ensure the correct projection during data association.
metric conditions, points are removed from the global model.
Dynamics Estimation A key feature of our method is
automatic detection of dynamic changes in the scene, to 4.1. Data Association
update the global reconstruction and support robust camera
After estimation of the camera pose of the current input
tracking. Dynamic objects are initially indicated by outliers
frame (see Sec. 5), each vertex vg and associated normal and
in point correspondences during ICP. Starting from these
radius are integrated into the global model.
areas, we perform a point-based region growing procedure to
identify dynamic regions. These regions are excluded from In a first step, for each valid vertex vg , we find potential
the camera pose estimate, and their corresponding points corresponding points on the global model. Given the inverse
in the global model are reset to unstable status, leading to global camera pose T −1 and intrinsics K, each point P̄k in
a natural propagation of scene changes into our depth map the global model can be projected onto the image plane
fusion. For more detail, see Sec. 6. of the current physical camera view, where the respective
point index k is stored: we render all model points into a
sparse index map I. Unlike the splat-based dense surface
3. Depth Map Preprocessing
reconstruction renderer used in other parts of our pipeline
We denote a 2D pixel as u = (x, y) ∈ R2 . Di ∈ R is the (see Sec. 5), this stage renders each point index into a single
raw depth map at time frame i. Given the intrinsic camera pixel to reveal the actual surface sample distribution.
calibration matrix K i , we transform Di into a corresponding As nearby model points may project onto the same pixel,
vertex map Vi , by converting each depth sample Di (u) into we increase the precision of I by supersampling, represent-
a vertex position vi (u) = Di (u)K −1
i (u , 1) ∈ R in cam-
3 ing I at 4 × 4 the resolution of the input depth map. We start
era space. A corresponding normal map Ni is determined identifying model points near vg (u) by collecting point in-
from central-differences of the vertex map. A copy of the dices within the 4×4-neighborhood around each input pixel
depth map (and hence associated vertices and normals) are location u (suitably coordinate-transformed from D to I).
Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on December 22,2022 at 14:18:12 UTC from IEEE Xplore. Restrictions apply.
Amongst those points, we determine a single corresponding 4.3. Removing Points
model point by applying the following criteria:
So far we have merged or added new measurements to
1. Discard points larger than ±δdepth distance from the
the global model. Another key step is to remove points from
viewing ray vg (u) (the sensor line of sight), with δdepth
our global model due to various conditions:
adapted according to sensor uncertainty (i.e. as a func-
1. Points that remain in the unstable state for a long time
tion of depth for triangulation-based methods [13]).
are likely outliers or artifacts from moving objects and
2. Discard points whose normals have an angle larger than
will be removed after tmax time steps.
δnorm to the normal ng (u). (We use δnorm = 20◦ .)
2. For stable model points that are merged with new data,
3. From the remaining points, select the ones with the
we remove all model points that lie in front of these
highest confidence count.
newly merged points, as these are free-space violations.
4. If multiple such points exist, select the one closest to
To find these points to remove, we use the index map
the viewing ray through vg (u).
again and search the neighborhood around the pixel
4.2. Point Averaging with Sensor Uncertainty location that the merged point projects onto1 . This is
similar in spirit to the free-space carving method of [2],
If a corresponding model point P̄k is found during data
but avoids expensive voxel space traversal.
association, this is averaged with the input vertex vg (u) and
3. If after averaging, a stable point has neighboring points
normal ng (u) as follows:
(identified again via the index map) with very similar
c̄k v̄k + αvg (u) c̄k n̄k + αng (u) position and normal and their radii overlap, then we
v̄k ← , n̄k ← , (1)
c̄k + α c̄k + α merge these redundant neighboring points to further
simplify the model.
c̄k ← c̄k + α , t¯k ← t , (2)
Points are first marked to be removed from P̄k , and in a
where t is a new time stamp. Our weighted average is distinct second pass, the list is sorted (using a fast radix sort imple-
from the original KinectFusion system [11], as we introduce mentation), moving all marked points to the end, and finally
an explicit sample confidence α. This applies a Gaussian items deleted.
2
weight on the current depth measurement as α = e−γ /2σ ,
2
where γ is the normalized radial distance of the current depth 5. Camera Pose Estimation
measurement from the camera center, and σ = 0.6 is derived
empirically. This approach weights measurements based on Following the approach of KinectFusion [11], our camera
the assumption that measurements closer to the sensor center pose estimation uses dense hierarchical ICP to align the bilat-
will increase in accuracy [2]. As shown in Fig. 2, modeling eral filtered input depth map Di (of the current frame i) with
this sensor uncertainty leads to higher quality denoising. the reconstructed model by rendering the model into a virtual
depth map, or model map, D̂ i−1 , as seen from the previous
frame’s camera pose T i−1 . We use 3 hierarchy levels, with
the finest level at the camera’s resolution; unstable model
points are ignored. The registration transformation provides
the relative change from T i−1 to T i .
While KinectFusion employs raycasting of the (implicit)
voxel-based reconstruction, we render our explicit, point-
based representation using a simple surface-splatting tech-
nique: we render overlapping, disk-shaped surface splats
Figure 2. Weighted averaging of points using our method (left) and that are spanned by the model point’s position v̄, radius r̄
the method of [11] (right). and orientation n̄. Unlike more refined surface-splatting
techniques, such as EWA Surface Splatting [27], we do not
Since the noise level of the input measurement increases
perform blending and analytical prefiltering of splats but
as a function of depth [13], we apply Eqs. (1) only if the
trade local surface reconstruction quality for performance by
radius of the new point is not significantly larger than the
simply rendering opaque splats.
radius of the model point, i.e., if r(u) ≤ (1 + δr )r̄; we em-
We use the same point-based renderer for user feedback,
pirically chose δr = 1/2. This ensures that we always refine
but add Phong shading of surface splats, and also overlay
details, but never coarsen the global model. We apply the
the dynamic regions of the input depth map.
time stamp and the confidence counter updates according to
1 Backfacing points that are close to the merged points remain protected—
Eqs. (2) irrespectively.
If no corresponding model point has been identified, a such points may occur in regions of high curvature or around thin geometry
in the presence of noise and slight registration errors. Furthermore, we
new unstable point is added to the global model with c̄k = α, protect points that would be consistent with direct neighbor pixels in D, to
containing the input vertex, normal and radius. avoid spurious removal of points around depth discontinuities.
Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on December 22,2022 at 14:18:12 UTC from IEEE Xplore. Restrictions apply.
6. Dynamics Estimation Hierarchical Region Growing The remainder of this sec-
tion explains the region growing-based segmentation ap-
The system as described above already has limited sup- proach that computes the map X.
port for dynamic objects, in that unstable points must gain The goal is essentially to find connected components in
confidence to be promoted to stable model points, and so D. In the absence of explicit neighborhood relations in
fast moving objects will be added and then deleted from the the point data, we perform region growing based on point
global model. In this section we describe additional steps attribute similarity. Starting from the seed points marked in
that lead to an explicit classification of observed points as X, we agglomerate points whose position and normal are
being part of a dynamic object. In addition, we aim at seg- within given thresholds of vertex v(u) and normal n(u) of a
menting entire objects whose surface is partially moving and neighbor with x(u) = true.
remove them from the global point model. To accelerate the process, we start at a downsampled X 2 ,
We build upon an observation by Izadi et al. [6]: when and repeatedly upsample until we reach X 0 = X, each time
performing ICP, failure of data association to find model resuming region growing. (We reuse the input pyramids built
correspondences for input points is a strong indication that for camera pose estimation.)
these points are depth samples belonging to dynamic objects. We improve robustness to camera noise and occlusions
Accordingly, we retrieve this information by constructing by removing stray no corr points through morphological
an ICP status map S (with elements si (u)) that encodes for erosion at the coarsest pyramid level X 2 after initializing
each depth sample the return state of ICP’s search for a it from S. This also ensures that X 2 covers only the inner
corresponding model point in the data association step: region of dynamic objects.
no input: vk (u) is invalid or missing.
no cand: No stable model points in proximity of vk (u). 7. Results
no corr: Stable model points in proximity of, but no
valid ICP correspondence for vk (u). We have tested our system on a variety of scenes (see
corr: Otherwise ICP found a correspondence. Table 1 ). Fig. 3 shows a synthetic scene Sim. We generated
rendered depth maps for a virtual camera rotating around
Input points marked as no corr are a strong initial estimate
of parts of the scene that move independent of camera mo- #frames #model- Avg. timings [ms]
tion, i.e. dynamic objects in the scene. input/processed points ICP Dyn- Fusion
We use these points to seed our segmentation method (fps in./proc.) Seg.
based on hierarchical region growing (see below). It creates Sim 950/950 467,200 18.90 2.03 11.50
(15/15)
a dynamics map X, storing flags xi (u), that segments the Flower- 600/480 496,260 15.87 1.90 6.89
current input frame into static and dynamic points. The pot (30/24)
region growing aims at marking complete objects as dynamic Teapot 1000/923 191,459 15.20 1.60 5.56
even if only parts of them actually move. (Note that this high- (30/27)
Large 11892/6704 4,610,800 21.75 2.39 13.90
level view on dynamics is an improvement over the limited Office (30/17)
handling of dynamics in previous approaches, e.g., [6].) Moving 912/623 210,500 15.92 3.23 16.61
In the depth map fusion stage, model points that are Person (30/20)
merged with input points marked as dynamic are potentially Ball- 1886/1273 350,940 16.74 3.15 17.66
game (30/21)
demoted to unstable points using the following rule:
PMD 4101/4101 280,050 10.70 0.73 3.06
(27/27)
if xi (u) ∧ c̄k ≥ cstable + 1 then c̄k ← 1 (3) Table 1. Results from test scenes obtained on a PC equipped with
an Intel i7 8-core CPU and an NVidia GTX 680 GPU. Input frames
Thus, the state change from static to dynamic is reflected have a size of 640 × 480 pixels, except for the PMD scene which
immediately in the model. A critical aspect is the offset uses a frame size of 200 × 200.
of +1 in Eq. (3): it ensures that any dynamic point that
sufficiently grew in confidence (potentially because it is now
static) is allowed to be added to the global model for at
least one iteration; otherwise, a surface that has once been
classified as dynamic would never be able to re-added to the
global model, as it would always be inconsistent with the
model, leading to no corr classification.
For the bulk of the time, however, dynamic points remain
unstable and as such are not considered for camera pose Figure 3. The synthetic scene Sim. Left: error in final global model
estimation (see Sec. 5), which greatly improves accuracy based on ground truth camera transformations. Right: final error
and robustness of T. based on ICP pose estimation2 .
Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on December 22,2022 at 14:18:12 UTC from IEEE Xplore. Restrictions apply.
Figure 4. The scenes Flowerpot (top row) and Teapot (bottom row).
A and B show reconstruction results of the original KinectFusion
system. The other images show our method (middle: phong-shaded
surfels, right: model points colored with surface normals). Figure 5. Tracking errors for the original KinectFusion system
compared to our point-based approach. Tracking results were
computed on the Flowerpot sequence, by subtracting Vicon ground
this scene and used these as input to our system. This gave truth data from the resulting per frame 3D camera position. For
us ground truth camera transformations T GT i and ground each system, error is computed as the absolute distance between
truth scene geometry. Using T GTi , the points in the resulting the estimated camera position and the ground truth position (after
global model have a mean position error of 0.019 mm. This aligning both coordinate spaces manually). Where the error of the
original KinectFusion exceeds that of the new, the gap is colored
demonstrates only minimal error for our point-based data
blue. Where the error of our method exceeds the original, the gap
fusion approach. The camera transformations T i obtained
is colored red. Note our method is similar in performance with the
from ICP have a mean position error of 0.87 cm and a mean largest delta being ∼ 1cm.
viewing direction error of 0.1 degrees. This results in a mean
position error of 0.20 cm for global model points.
The Flowerpot and Teapot scenes shown in Fig. 4 were
recorded by Nguyen et al. [13]. Objects are placed on a
turntable which is rotated around a stationary Kinect camera.
Vicon is used for ground truth pose estimation of the Kinect,
which are compared to ICP for our method and the original
KinectFusion system Fig. 5
Fig. 6 shows that the number of global model points for
these scenes remains roughly constant after one full turn
of the turntable. This demonstrates that new points are not
continuously added; and the global model is refined but kept
compact. Note that one Kinect camera input frame provides
up to 307,200 input points, but the total number of points in
the final global teapot model is less than 300,000. Figure 6. The number of global model points stored on the GPU
The Large Office scene shown in Fig. 7 consists of plotted over time for the Flowerpot and Teapot scenes. Note after
two rooms with a total spatial extent of approximately the completion of one full turn of the turntable, the number of
10m × 6m × 2.5m. A predefined volumetric grid with 32- points converges instead of continuously growing.
bit voxels and 512 MB of GPU memory would result in a
voxel size of more than 1 cm3 . In contrast, our system does
merge RGB samples, we simply store the last one currently.
not define the scene extents in advance: the global model
grows as required. Furthermore, it does not limit the size In the Moving Person scene shown in Fig. 8, the person
of representable details; Fig. 7 shows close-ups of details first sits in front of the sensor and is reconstructed before
on the millimeter scale (e.g. the telephone keys). The 4.6 moving out of view. Since the moving person occupies
million global model points reported in Tab. 1 can be stored much of the field of view, leaving only few reliable points
in 110 MB of GPU memory using 3 floating point values for ICP, camera tracking fails with previous approaches (see
for the point position, 2 for the normalized point normal, 1 e.g. Izadi et al. Fig. 8 [6]). Our system segments the moving
for the radius, and one extra byte for a confidence counter. person and ignores dynamic scene parts in the ICP stage,
Additionally, RGB colors can be stored for each global point, thereby ensuring robustness to dynamic motion.
to texture the final model (see Fig. 7 far right). Rather than The Ballgame scene shown in Fig. 9 shows two people
playing with a ball across a table. Our region growing ap-
2 Rendered using CloudCompare, http://www.danielgm.net/cc/. proach segments dynamics on the object level instead of just
Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on December 22,2022 at 14:18:12 UTC from IEEE Xplore. Restrictions apply.
Figure 7. The Large Office scene, consisting of two large rooms and connecting corridors. A: overview; B and C: dynamically moving
objects during acquisition; Note the millimeter scale of the phone’s keypad. Other close-ups are also shown (right column: RGB textured).
Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on December 22,2022 at 14:18:12 UTC from IEEE Xplore. Restrictions apply.
There are many areas for future work. For example, whilst [10] W. Lorensen and H. Cline. Marching cubes: A high resolu-
our system scales to large scenes, there is the additional pos- tion 3D surface construction algorithm. Computer Graphics,
sibility of adding mechanisms for streaming subset of points 21(4):163–169, 1987. 1
(from GPU to CPU) especially once they are significantly [11] R. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim,
A. Davison, P. Kohli, J. Shotton, S. Hodges, and A. Fitzgibbon.
far away from the current pose. This would help increase
KinectFusion: Real-time dense surface mapping and tracking.
performance and clearly the point-based data would be low In Proc. IEEE Int. Symp. Mixed and Augm. Reality, pages
overhead in terms of CPU-GPU bandwidth. Another issue is 127–136, 2011. 1, 3, 4
sensor drift, which we do not currently tackle, instead focus- [12] R. Newcombe, S. Lovegrove, and A. Davison. DTAM: Dense
ing on the data representation. Drift in larger environments tracking and mapping in real-time. In Proc. IEEE Int. Conf.
can become an issue and remains an interesting direction Comp. Vision, pages 2320–2327, 2011. 1
for future work. Here again the point-based representation [13] C. Nguyen, S. Izadi, and D. Lovell. Modeling Kinect sen-
might be more amenable to correction after loop closure sor noise for improved 3D reconstruction and tracking. In
detection, rather than resampling a dense voxel grid. Proc. Int. Conf. 3D Imaging, Modeling, Processing, Vis. &
Transmission, pages 524–530, 2012. 4, 6
Acknowledgements [14] S. Parker, P. Shirley, Y. Livnat, C. Hansen, and P.-P. Sloan.
Interactive ray tracing for isosurface rendering. In Proc. IEEE
This research has partly been funded by the German Re- Vis., pages 233–238. IEEE, 1998. 1
search Foundation (DFG), grant GRK-1564 Imaging New [15] H. Pfister, M. Zwicker, J. Van Baar, and M. Gross. Surfels:
Modalities, and by the FP7 EU collaborative project BEAM- Surface elements as rendering primitives. In Proc. Conf.
Comp. Graphics & Interact. Techn., pages 335–342, 2000. 2
ING (248620). We thank Jens Orthmann for his work on the
[16] M. Pollefeys, D. Nistér, J. Frahm, A. Akbarzadeh, P. Mordo-
GPU framework osgCompute. hai, B. Clipp, C. Engels, D. Gallup, S. Kim, P. Merrell, et al.
Detailed real-time urban 3D reconstruction from video. Int. J.
References Comp. Vision, 78(2):143–167, 2008. 1
[17] H. Roth and M. Vona. Moving volume KinectFusion. In
[1] P. Besl and N. McKay. A method for registration of 3- British Machine Vision Conf., 2012. 1
D shapes. IEEE Trans. Pattern Anal. and Mach. Intell., [18] S. Rusinkiewicz, O. Hall-Holt, and M. Levoy. Real-time 3D
14(2):239–256, 1992. 1 model acquisition. ACM Trans. Graph. (Proc. SIGGRAPH),
[2] B. Curless and M. Levoy. A volumetric method for building 21(3):438–446, 2002. 2
complex models from range images. In Proc. Comp. Graph. [19] S. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski.
& Interact. Techn., pages 303–312, 1996. 1, 2, 4, 7 A comparison and evaluation of multi-view stereo reconstruc-
[3] D. Gallup, M. Pollefeys, and J.-M. Frahm. 3d reconstruction tion algorithms. In Proc. IEEE Conf. Comp. Vision & Pat.
using an n-layer heightmap. In Pattern Recognition, pages Rec., volume 1, pages 519–528. IEEE, 2006. 1
1–10. Springer, 2010. 2 [20] J. Stückler and S. Behnke. Integrating depth and color cues for
[4] P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox. RGB-D dense multi-resolution scene mapping using RGB-D cameras.
mapping: Using Kinect-style depth cameras for dense 3D In Proc. IEEE Int. Conf. Multisensor Fusion & Information
modeling of indoor environments. Int. J. Robotics Research, Integration, pages 162–167, 2012. 2
31:647–663, Apr. 2012. 2 [21] C. Tomasi and R. Manduchi. Bilateral filtering for gray and
[5] H. Hoppe, T. DeRose, T. Duchamp, J. McDonald, and color images. In Proc. Int. Conf. Computer Vision, pages
W. Stuetzle. Surface reconstruction from unorganized points. 839–846, 1998. 3
Computer Graphics (Proc. SIGGRAPH), 26(2), 1992. 1 [22] T. Weise, T. Wismer, B. Leibe, and L. Van Gool. In-hand
scanning with online loop closure. In Proc. IEEE Int. Conf.
[6] S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe,
Computer Vision Workshops, pages 1630–1637, 2009. 2, 3
P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison, and
[23] T. Whelan, M. Kaess, M. Fallon, H. Johannsson, J. Leonard,
A. Fitzgibbon. KinectFusion: real-time 3D reconstruction
and J. McDonald. Kintinuous: Spatially extended KinectFu-
and interaction using a moving depth camera. In Proc. ACM
sion. Technical report, CSAIL, MIT, 2012. 1
Symp. User Interface Softw. & Tech., pages 559–568, 2011.
[24] C. Yang and G. Medioni. Object modelling by registration
1, 2, 5, 6
of multiple range images. Image and Vision Computing,
[7] M. Kazhdan, M. Bolitho, and H. Hoppe. Poisson surface 10(3):145–155, 1992. 1
reconstruction. In Proc. EG Symp. Geom. Proc., 2006. 1 [25] C. Zach. Fast and high quality fusion of depth maps. In
[8] A. Kolb, E. Barth, R. Koch, and R. Larsen. Time-of-flight Proc. Int. Symp. on 3D Data Processing, Visualization and
cameras in computer graphics. Computer Graphics Forum, Transmission (3DPVT), volume 1, 2008. 1
29(1):141–159, 2010. 7 [26] M. Zeng, F. Zhao, J. Zheng, and X. Liu. Octree-based fusion
[9] M. Levoy, K. Pulli, B. Curless, S. Rusinkiewicz, D. Koller, for realtime 3D reconstruction. Graph. Models, 75(3):126 –
L. Pereira, M. Ginzton, S. Anderson, J. Davis, J. Ginsberg, 136, 2013. 1
et al. The digital michelangelo project: 3D scanning of large [27] M. Zwicker, H. Pfister., J. V. Baar, and M. Gross. Surface
statues. In Proc. Comp. Graph & Interact. Techn., pages splatting. In Computer Graphics (Proc. SIGGRAPH), pages
131–144, 2000. 1 371–378, 2001. 4
Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on December 22,2022 at 14:18:12 UTC from IEEE Xplore. Restrictions apply.