1. Introduction
Minimally invasive surgery is a new surgical technology that uses modern medical instruments and equipment to pass through small wounds on the surface of the human body and perform multiple actions with human hand–eye cooperation in the human body [
1]. Compared with traditional surgery or early minimally invasive surgery, modern minimally invasive surgery has the advantages of accurate operation, less bleeding and faster postoperative recovery. It is increasingly welcomed by patients and widely used in internal cavity surgery. However, surgeons are prone to disorientation and occasional hand–eye imbalance when they perform complex surgery through the 2D visual display of endoscopic video stream, and it is difficult to determine the lesion location by empirically matching the endoscopic visual field with the preoperative image, which is easy to cause intraoperative errors.
In recent years, minimally invasive surgery has been gradually integrated with computer three-dimensional (3D) reconstruction technology. For example, surgeons use surgical experience and image processing technology to stereo locate the lesion area using endoscope system, breaking the limitations of traditional surgery [
2]. To help surgeons simulate actual surgical operations, the digital 3D reconstruction model can be printed out in equal scale through 3D printing technology [
3]. Additionally, the 3D reconstruction model allows the surgeon to explain the patient’s condition and surgical plan visually [
4], facilitating smooth communication between the surgeon and patient and enhancing the patient’s confidence in treatment. At present, researchers at home and abroad have proposed different kinds of methods based on computer vision to restore the three-dimensional surface structure of surgical scene in minimally invasive surgery, which are mainly based on laser, coded structured light, time camera and video camera. Among them, the surface reconstruction technology based on endoscope video has many obvious advantages. Specifically, this method provides intraoperative information without destroying the internal structure of the human body, and there is no need to introduce additional hardware into the current surgical platform. Although endoscopic video provides on-site feedback information for surgeons during surgery, video information has limitations and cannot meet the needs of doctors. First, there is no clear depth information in two-dimensional images, so surgeons must estimate the depth according to their experience. In addition, the field of vision of the endoscope is very narrow, and it is difficult for the surgeon to accurately locate the position and direction of the endoscope and surgical instruments. More importantly, due to the limitation of the complex environment of human lumen, the number of cameras to obtain the surface information of lumen tissue also affects the real-time and anti-interference ability in the process of 3D reconstruction.
Monocular vision is a three-dimensional reconstruction technology that uses a camera to capture the image of the target object. There are two main ways to realize the three-dimensional modeling of monocular vision in the lumen environment. One way is to use the information of the lumen image itself to obtain the three-dimensional feature information of the lumen through a specific algorithm. Another way is to calibrate the camera parameters of the endoscope system to obtain the depth information of the measured point. Because monocular vision method has the advantages of simple equipment structure, convenient use and easy data processing, most of the research at home and abroad is to use a monocular vision algorithm to reconstruct the inner cavity.
In order to improve the accuracy of 3D reconstruction, Wu et al. [
5] proposed combining shape from shading (SFS) method and motion shape restoration method for the inner cavity 3D reconstruction in 2010. This method combines the iterative nearest point algorithm to reduce the error of coordinate system conversion in multiple artificial spine images, improve the matching rate and recover the bone boundary line.
In 2012, Ciuti et al. [
6,
7] proposed a complete set of SFS calibration methods. Assuming that the light source is close to the organ surface and far away from the optical center, the spatial three-dimensional coordinates are obtained by triangulating the part of the organ surface with specular highlights. Without any preoperative data, the endoscope device performs 3D measurement according to the calculated trajectory and finally realizes the automatic navigation of the capsule. However, the magnetic levitation capsule cannot reach the ideal state in the process of movement, and the calibration accuracy needs to be further improved. In the same year, Tokgozoglu et al. [
8] proposed an SFS method based on color projection, which can minimize the intensity changes caused by different surface characteristics. In 2015, Goncalves et al. [
9] proposed a perspective shape from shading (PSFS) algorithm based on near light source perspective mapping to solve radial distortion and reduced image edge resolution. This method establishes the radial distortion model, compensates the problem of reduced image edge resolution, and completes the three-dimensional reconstruction of knee bone. In 2016, Lei et al. [
10] proposed perspective mapping SFS method based on photometric calibration method to reconstruct organ surface. This method, combined with optical flow method, changes the relative change of gray gradient field into absolute change, which improves the stability of organ surface reconstruction. In 2018, Turan et al. [
11] used the above method for gastrointestinal surface reconstruction. However, the gastrointestinal surface is not a smooth area. The uneven surface will make the gradient vector change rate higher, and the obtained gray value will also be lower than the real value, resulting in large reconstruction error.
To sum up, the difficulty of SFS algorithm in the process of three-dimensional reconstruction of inner cavity is that there are multiple mappings between a two-dimensional image and the surface shape. At the same time, there is only one formula in the brightness equation, but there are two variables. Therefore, the direction of the object surface cannot be determined only by the brightness equation. However, SFS algorithm is easy to combine with other methods and complement each other for 3D reconstruction. At the same time, SFS algorithm can perform dense calculation on smooth surfaces. Since the 1980s, SLAM has been proposed for the first time, which specifically refers to the technology that the subject equipped with a specific sensor moves in an unknown environment, locates itself and constructs an incremental map [
12], which is widely used in real-time reconstruction of endoscope scenes.
In 2015, Lin et al. [
13] proposed to restore the surface structure of three-dimensional scene of abdominal surgery based on SLAM, improved the texture characteristics of lumen image, the selection of green channel and the processing of reflective area, and studied a new type of image features, namely branch points in blood tube features. After detecting the vascular feature points, the branch segments are jointly detected and matched to match the vascular features in the image. Finally, three-dimensional blood vessels are recovered from each frame of image, and three-dimensional blood vessels from different perspectives are integrated through blood vessel matching to obtain a global three-dimensional blood vessel network.
In 2016, Yang [
14] proposed endoscopic localization and construction of gastrointestinal feature map based on monocular SLAM. In this method, Oriented FAST and Rotated BRIEF (ORB) algorithm is selected for feature points detection from the perspective of efficiency and matching accuracy. Combined with local pose optimization algorithm and triangulation measurement with minimum geometric distance, a large amount of data redundancy is processed through reselection of key frames and screening of feature points. However, because the environment is the intestinal tract with non-closed endoscopic trajectory, and the local part tends to be straight, it is different from the closure of most lumen environments.
In 2019, Mahmoud et al. [
15] proposed dense three-dimensional reconstruction of abdominal cavity based on monocular ORB-SLAM. Firstly, the camera pose of key frames is estimated by using the detection and matching process of sparse ORB-SLAM, and the selection of key frames is determined according to the parallax criterion. Then, the variational method combining zero mean normalized cross-correlation (ZNCC) and gradient-robust kernel norm regularizer is used to calculate the dense matching between key frames in parallel. This method uses monocular video input and does not need any reference point or external tracker. It has been verified and evaluated on pig abdominal video sequence, which shows that it is robust to serious illumination changes and different scene textures. The main limitation of the system is that the texture feature description of soft tissue surface is not representative, and there is texture distortion after reconstruction.
In the same year, Xie et al. [
16] combined with the measurement data of endoscope in gastrointestinal tract and introduced the local pose optimization algorithm and triangulation algorithm with minimum geometric distance in terms of pose optimization and spatial point positioning. In 2021, LaMarca et al. [
17] first proposed the tracking and mapping of deforming scenes from single sequences algorithm, which can run in real time in the deformed scene, and divide the calculation into two parallel threads. The deformation tracking thread is used to estimate the camera pose and the deformation of the scene. The deformation mapping thread is applied to the pose estimation of the endoscope to better adapt to the lumen deformation scene, so as to generate an accurate 3D model of the human lumen. However, it is easy to be affected by uneven illumination, resulting in poor visual texture, and is not suitable for non-equidistant deformation lumen reconstruction.
In minimally invasive surgery, human tissue will deform and bleed, and often do not have strong edge characteristics, so it shows the characteristics of highlight and specular reflection. Facing this complex minimally invasive surgery environment, monocular SLAM method has high robustness and can process soft tissue sequence images in real time. Thus, 3D texture reconstruction of abdominal cavity based on monocular vision SLAM for minimally invasive surgery is proposed in this paper. The rest of this paper is organized as follows:
Section 2 briefly introduces the proposed methods relevant and improvements in this paper.
Section 3,
Section 4 and
Section 5 describe improved abdominal cavity feature tracking, mapping and optimization, Poisson surface reconstruction and texture mapping, and the experimental results and analysis are given.
Section 6 summarizes the conclusions and future work.
4. Abdominal Cavity Mapping and Optimization
Compared with the traditional three-dimensional reconstruction method using multi frame static abdominal images, the monocular SLAM system has the ability to optimize pose and eliminate cumulative error. By selecting key frames, using a bag-of-words model and BA optimization, the system reduces the accumulated error in the process of abdominal cavity map construction, and obtains the sparse three-dimensional point cloud on the abdominal cavity surface, which lays the foundation for dense reconstruction.
4.1. Construction of Abdominal Cavity Bag-of-Words Model
Bag of words (bow) [
26] is a technology that uses a visual dictionary to convert images into sparse vectors, which enables this paper to process large image data sets more efficiently. Words in the visual dictionary refer to the descriptors of ORB features. A word represents a subset of descriptors of multiple similar features, and the dictionary contains all words. In the SLAM system, the feature is extracted from each key frame and the descriptor is calculated. All the features of the current frame are searched in the dictionary, a word vector is constructed and added to the image database for query. When querying two images, we mainly consider the similarity between them, that is, the spatial distance of word vector. Usually, for the latest key frame, a series of key frames with high similarity are found as loopback candidate frames, and then the key frames with good quality are retained after verification and screening.
In this paper, 1500 sequence images of human body are extracted from the endoscopic video database of Hamlyn, a large number of feature points are generated according to the image data, organized and clustered according to a certain structure, and a vocabulary specially used for minimally invasive surgery is trained. The fork tree structure is simple and practical. It is the best choice to represent the word bag. It has logarithmic query efficiency. It can also query directly from a certain layer according to some known information to improve the query efficiency.
Figure 7 shows the structure of the K-ary tree dictionary. Starting from the root node, each layer node is divided into
k nodes downward until the set depth
d is reached. The leaf nodes stored in the
dth layer are clustered words. If you build a dictionary tree with
k bifurcation and
d depth, the specific process is as follows:
The root node represents the set of all features, K-means algorithm is used to cluster into k classes.
In the first layer, continue to cluster use K-means algorithm and separate k nodes to get the next layer.
On the new layer, cycle the second step until the depth of the tree reaches the dth layer.
In the whole tree structure, the leaf layer node is a word, and the intermediate node (cluster center) generated in the process of establishing a dictionary can be used to query words quickly. Each word includes parent node number, whether it is a leaf node, description of sub vector, weight and semantic label. The vocabulary words are the leaf nodes of the tree. The inverse index stores the weight of the words in the images in which they appear. The direct index stores the features of the images and their associated nodes at a certain level of the vocabulary tree.
The word bag vector is sparse and only needs to store the index and value of the non-zero element of the vector. If the sum of two word bag vectors
and
is given, a score
D in the interval is obtained by using the
norm, which is defined as the similarity of the two vectors:
In the Formula (16), the greater the value of D, the more similar of and . Therefore, by comparing the similarity of word bag vector, if the similarity score reaches the set threshold, it can be considered that the two abdominal images are similar.
When the number of feature points extracted by ORB algorithm is 392 and 456 matches are generated, the violent matching takes 46.62 ms to complete the matching, and the BoW matching takes 40.23 ms. When the number of feature points extracted by AKAZE-ORB algorithm is 587 and 531 matches are generated, the violent matching takes 41.52 ms to complete the matching, and the BoW matching takes 36.18 ms, which means that BoW matching can greatly reduce the time of feature matching.
4.2. BA Optimization
In the process of constructing the abdominal cavity 3D point cloud map, in order to avoid feature information tracking failure, when the current frame extracts less feature information, has low correlation with the historical frame, a new abdominal keyframe needs to be inserted as soon as possible to update the visual correlation map. To ensure abdominal feature tracking steadily, the system in this paper picks redundant keyframes in the process of local abdominal map construction to improve the speed of the 3D texture model. With the continuous addition of the key frames of the abdominal cavity image, the error will be larger and larger when calculating the camera pose and 3D point coordinates of the abdominal cavity space of adjacent frames. In this paper, the BA algorithm [
27] is used to construct the least squares problem and solve it iteratively to reduce the cumulative error and realize the optimization of local map.
There are
m three-dimensional points in abdominal space, of which the coordinates of a point
are
and the pixel coordinates of its projection are
. Then, the relationship between pixel position and spatial point position is shown in Formula (9):
where
is the lie algebra of the depth
of the point
and
is the lie algebra of the camera pose. After conversion to matrix form, Formula (9) is as follows:
There are errors in solving the equation due to the noise of camera observations and unknown pose. Therefore, in this paper, the error summation is transformed into the corresponding least squares problem, and then the optimal camera pose can be obtained.
Local optimization makes the re-projection error infinitely close to 0, so as to obtain the optimal camera parameters and the coordinates of three-dimensional space points. Therefore, the BA algorithm is a method to optimize the position and pose parameters of feature points, which can improve the positioning accuracy in abdominal space.
4.3. Local Configuration of Abdominal Cavity Surface
This paper selects Dataset15(the 15th video) in the Hamlyn laparoscopic video dataset to verify and analyze the feasibility and effectiveness of the point cloud map construction method designed in this paper.
Figure 8 shows the effect of traditional ORB algorithm and AKAZE-ORB algorithm on sparse reconstruction of abdominal surface, where the green mark is the trajectory of laparoscopy, the red mark points represent the map points being reconstructed, and the black points represent the map points after reconstruction. The blue line indicates the pose of the camera at the time of key frames, which constitute the motion trajectory of the camera. It can be seen that the monocular SLAM abdominal 3D reconstruction system can obtain the 3D reconstruction point cloud based on abdominal feature points and the motion trajectory of laparoscopy, but the obtained point cloud is very sparse. The AKAZE-ORB algorithm obtains a denser point cloud effect than the original system, but it is still unable to obtain a dense abdominal point cloud map.
5. Poisson Surface Reconstruction and Texture Mapping
Although the 3D reconstruction of an abdominal cavity surface based on the SLAM system can obtain real-time endoscope motion trajectory and 3D reconstruction point cloud based on feature points, sparse 3D point cloud surface cannot obtain dense 3D reconstruction effect. Therefore, the dense abdominal cavity map is obtained by Poisson surface reconstruction and texture mapping.
The approach of Poisson surface reconstruction [
28] is based on the observation that the (inward pointing) normal field of the boundary of a solid can be interpreted as the gradient of the solid’s indicator function [
29]. Thus, given a set of oriented points sampling the boundary, and construct the Poisson equation
. For the problem of uncertain position, the projection on the function space can best approximate the projection, and then the minimum value
of the following equation can be obtained.
Finally, the surface model reconstruction is obtained by extracting the isosurface from the indicator function. The position of the isosurface should be close to the position of the input sample, and then the Poisson surface can reflect the real surface of the point cloud model to be reconstructed.
In order to verify the performance-feature extraction and matching of AKAZE-ORB algorithm proposed in this paper, a Hamlyn laparoscopic video data set is used to construct 3D sparse point cloud map.
Figure 9 and
Figure 10 use Dataset1 and Dataset2, respectively, to reconstruct the abdominal Poisson surface obtained by different algorithms. Poisson surface reconstruction is to make all points as close to the implicit equation as possible. Therefore, it changes the original vertex data in the process, which is robust to external points and the generated surface is very smooth.
Figure 8 and
Figure 9 show the 3D reconstruction results of abdominal cavity of the classical SLAM system, and (b) show the 3D reconstruction results of abdominal cavity of the improved SLAM (ISLAM) system in this paper. From the reconstruction results, it can be seen that the abdominal mesh model reconstructed by the classical SLAM system has some holes, some surface mesh errors, and the reconstructed surface has uneven parts. For example, the areas in red are sparse and sunken parts of the mesh which leaves obvious gaps in the reconstructed abdominal model. However, the reconstructed model surface by our ISLAM system is smooth, and the relevant contour details are retained, which reduces the generation of holes in the reconstructed surface. The meshes are more dense in the red area, which characterize the geometry of the abdominal surface better, makes the abdominal model more real, smooth and delicate, and realizes more accurate reconstruction of the abdominal model.
Figure 11 and
Figure 12 are the abdominal reconstruction results of two data sets, respectively. From the texture mapping results, it can be seen that the reconstruction algorithm after feature extraction and matching by using the AKAZE-ORB algorithm, the effect of texture mapping is better than that of classical SLAM system. The reconstruction effect integrity of classical SLAM system is low, and it is difficult to characterize the features of blood vessels and tissues, while the reconstruction surface of ISLAM system is smooth, natural and realistic with fewer holes for three-dimensional visualization of the abdominal model. Additionally, the texture after mapping is also transitional and natural, with strong realism.