1. Introduction
It is an important issue in social welfare to help visually impaired people live and travel. Governments and welfare departments of various countries have issued many policies or carried out considerable infrastructure to facilitate these people, especially the visually impaired. Nevertheless, the daily lives, especially outdoor walking and traveling, of the visually impaired is still significantly limited because of their physiological and psychological factors. Thus, it is crucial to find an effective auxiliary method. The most effective navigation method for the visually impacted people at present is to train guide dogs. However, the popularization and promotion of guide dogs among the visually impaired come at the cost of high expenses, long time, and low success rate; moreover, there are no adequate laws, regulations, and assurance to keep visually impaired and the dogs from the interference of the other people or vehicles.
More recently, there has been growing interest in wearable blindness-assistive devices, which have already appeared and have even been for sale. These wearable blindness-assistive devices can recognize faces, texts, traffic signals, and banknotes with high precision. As an important branch of wearable blindness-assistive devices, wearable navigation devices (WNDs) for blind people still rely on traditional methods, such as ultrasonic obstacle avoidance, GPS positioning, inertial odometry, and other indoor localization methods [
1,
2,
3], which cannot meet the accuracy requirements of walking navigation and have significant limitation in the indoor environment. Visual navigation has become a hot research topic in recent years, which has considerable potential as WNDs.
Figure 1 shows a simple WND system concept based on computer vision. GPS (Global Positioning System), images, and inertial information are collected and input into a microcomputer controller unit simultaneously. Then, positioning and navigation are performed based on the data of multi-sensor fusion among these sensors. The navigation information is sent via Bluetooth to other auxiliary devices worn by the visually impaired, such as earphones, smartwatches, and guide cane [
4]. Additionally, the visual presentation-based brain-computer interface [
5], which has become popular in recent years, can help visually impaired people restore their vision. It is also suitable as an output device for navigation information, especially on the WNDs that can build three-dimensional semantic scene information.
Considering that WND frequently works (especially navigates) in unknown environments, and as blind people cannot identify surroundings directly, it needs not only to determine its position, posture, and trajectory but also to establish a global map in real-time, which is precisely the same as the SLAM problem in the robotics field. The concept of SLAM was put forward in the 1980s [
6] and detailed the most fundamental problem that robots need to face when moving and recognizing their position. The earliest SLAM schemes were realized by simple position sensors such as sonars and odometers, which are usually bulky in volume. According to the reports of recent researches, SLAM schemes can also be established based on vision sensors. Visual SLAM, including Monocular, Stereo, RGB-D cameras, and Lidars, has shown significant advantages in intelligent mobile robots and autonomous vehicles because of robustness, intelligence, and adaption. Some SLAM schemes, such as ORB-SLAM [
7,
8], LSD-SLAM [
9], ElasticFusion [
10], and SVO [
11], are mature.
There are many significant pieces of research about the WNDs combined with SLAM. Kaiser et al. [
12] described a wearable navigation system based on an a priori map established by the SLAM process of a robot. It was necessary to create a map in advance when applying it to navigation for the visually impaired people, since the early SLAM system usually relied on high-precision but heavy equipment. Lee et al. [
13] proposed a method of using RGB-D cameras and some other equipment to realize real-time navigation without the need to establish a priori map through the SLAM scheme. According to their description, the program can achieve a processing speed of 28.6 Hz. However, this solution requires visually impaired people to interact with a handheld smart device, especially when setting a destination, which may bring potential inconvenience to the user. Zhang et al. [
14] also used RGB-D cameras to realize navigation without prior maps. Different from the work of Lee et al., they proposed the method using voice and text recognition to determine the destination. Especially, geometric features at some specific locations are utilized to extract semantic information so that the user could follow the preset path correctly. Although Zhang et al. considered the perception of the visually impaired people during navigation, the information they provided was not clear enough. Moreover, since some other information such as the door numbers is collected by another independent camera, the user may need to wear more equipment, which would also cause some inconvenience.
The SLAM scheme applied to the WND, according to many previous studies, is different from the general SLAM scheme. The latter only needs to build a topological map and not know what objects are on the map. In the process of positioning and navigation, the WND needs to recognize various items on the map in real-time to determine whether the object is an obstacle. Therefore, a 3D semantic scene establishment solution that can extract semantic information from the map and perform localization simultaneously and in real-time is urgently required.
Much work has been done in the field of 3D semantic scene establishment. The semantic information labels were tagged by the decision-level tree random forest pixel-level semantic prediction in the early years. SLAM++, proposed by Salas-Moreno et al. [
15], is the earliest prototype, which is based on the dense reconstruction of scenes and optimized by setting camera nodes and object nodes after detecting a specific object. The work of Zhang et al. [
14] also proposed the idea of using semantic information to assist positioning, but they only used simple geometric features to obtain semantic information, which could not achieve 3D semantic scene recognition. However, CNN (Convolutional Neural Network) has become a popular method for image target recognition and detection with machine learning technology developed in recent years. There are many studies on CNN image semantic segmentation, such as FCN [
16], DeepLab [
17], ResNet [
18], and PSPNet [
19]. The CNN image detection and semantic segmentation are applied to autonomous driving and robotics widely. It is a considerable suit choice for us to use CNN to realize 3D semantic scene identification.
This paper has proposed a real-time 3D semantic scene SLAM solution for WNDs, integrating a semantic segmentation network and a SLAM system to construct a new semantic visual SLAM system. Firstly, we have built a wearable navigation system. We extend the data structure of map points by probability fusion between SLAM’s Mapping processing and semantic information and finally construct a SLAM system for the WNDs with semantic label output. We have done accuracy and speed evaluation using the TUM RGB-D database [
20] on the wearable navigation system. This SLAM system can generate three kinds of maps, including the sparse map, dense map, and semantic map. Compared with the general semantic SLAM scheme, our work can ensure the speed of real-time operation on low-cost devices. Finally, we set up a test scenario to simulate navigation for visually impaired people through voice enlightening.
2. Real-Time Semantic Visual SLAM
The real-time semantic vision SLAM solution for WNDs consists of two parts: one is a high-precision SLAM system, the other is a real-time semantic segmentation network. The combination of the SLAM system and semantic segmentation network will be discussed in the third part.
2.1. Real-Time SLAM System
The SLAM system is the fundamental part of the Semantic visual SLAM system, which is positively correlated with the accuracy and final performance of the WND system. There are currently three methods for constructing the SLAM scheme: the first is the feature-based method, which extracts some feature points with descriptors in the image, and matches these feature points between different images for tracking and mapping; the second is the direct method, which directly calculates the luminosity changes of some pixel blocks (not only using feature points); the third is the optical flow method, which uses the optical flow changes of feature points, pixel gradient points, and even the entire picture. The semantic information needs to be fused through data association, which further requires data association and localization estimation to be decoupled. The feature-based method needs to match feature points with the current position, demonstrating compatibility with semantic information fusion.
In fact, the feature-based method is not highly efficient. As is shown in
Figure 2, we have done an operation speed test with a single thread for some popular feature extractors, including Star (or CenSurE, Center Surround Extremes, by Agrawal et al. [
21]), GFTT (Good Feature to Track by Shi et al. [
22]), SIFT (by Lowe et al. [
23]), SURF (by Bay et al. [
24,
25]), ORB (by Rublee et al. [
26]), BRISK (by Leutenegger et al. [
27]), and FAST (by Rosten et al. [
28]). The feature extraction is such a time-consuming task that only a few extractors can meet real-time operation requirements. However, FAST and another extractor with better performance, Star, are both feature-only extractors without descriptors, which cannot be used directly in SLAM. Considering not only to meet the application requirements of SLAM but also to satisfy the real-time performance on the low-cost devices, we finally chose the ORB feature extractor.
It is worth noting that in recent years, a SLAM solution based on the ORB feature extractor has been proposed. As one of the most famous feature-based visual SLAM solutions, ORB-SLAM, presented by Mur-Artal et al. [
7,
8], has shown significant operating speed, robustness, and localizing accuracy. There are numerous designs in the ORB-SLAM to improve efficiency, which can be executed using a CPU and reach at least 30 FPS. The real-time nature of ORB-SLAM is precisely in line with the real-time requirements of this work. At the same time, it also provides interfaces for monocular, stereo, and RGB-D cameras, which provides the foundation of SLAM construction. However, the final result of the ORB-SLAM has limited readability and practicality, which needs to be processed. We first established a dense global map based on the original ORB-SLAM to ensure that the generated semantic map can match the global map. In addition, we reassigned the threads that SLAM runs and performed further processing during the extraction of keyframes to ensure the efficient operation of the entire program, which are introduced in the next section.
2.2. Framework of Semantic Visual SLAM
As is shown in
Figure 3, the system can be divided into two parts: the SLAM end and the semantic segmentation end. The SLAM end can also be divided into three threads: Keyframe, Local Mapping, and Loop Closing. Then, the global map is constructed through global pose optimization and Loop Correction. In the meantime, semantic segmentation threads are expanded into this system. The semantic segmentation result can be mapped to a dense semantic map through the Cloud Map Generation step, which combines the generated semantic results with the depth information from the RGB-D camera and generates the corresponding position and posture information from the global map of the SLAM end.
Semantic segmentation usually takes up too much computing resource, which causes the overall system to run at a slower speed. Here, we have applied two methods in this system to solve this problem.
Maximizing the use of computing resources is a critical way to improve operational efficiency. Although localization and navigation are the essential parts of the WNDS, it is impossible to devote all computing resources to solve SLAM. Current small high-performance computing devices have begun to use GPUs to enhance AI (Artificial Intelligence) capabilities, which can usually operate in parallel. The computing ability of the GPU can be used as much as possible to handle the task of semantic segmentation, to save CPU computing resources. Due to the requirement for complex matrix operations, the semantic segmentation coincides with the functional characteristics of the GPU as well. In short, the considerate allocation of computing power by the three threads of SLAM and the semantic segmentation thread can effectively ensure the real-time performance of the system.
Secondly, many redundant frames appear in SLAM, and the semantic segmentation of these frames also causes low efficiency. Considering the step of extracting keyframes in the SLAM thread, it is feasible to record the timestamps of keyframes and then index the images near the time stamps for semantic segmentation. The keyframe refers to the graph node captured during the mapping process associated with the previous keyframe, which, at the same time, has a significant difference. The cloud points mapped by the keyframes are finally combined into a point cloud map as a result. Therefore, semantic segmentation is performed on the images closest to the keyframes so that the map with semantic information can be established through point cloud reconstruction. Semantic segmentation of images near keyframes can minimize the amount of calculation and improve the efficiency of segmentation.
2.3. Semantic Segmentation
Semantic segmentation by the decision-level tree random forests has many limitations on the application. With the rapid development of CNN, Neural Networks are gradually becoming more popular in the field of the pixel-level segment. Moreover, the development of the Semantic Segment also aims to apply intelligent mobile robots and automatic driving vehicles. We can choose the base segment network from existing research and improve it. A highly efficient but succinct image semantic segmentation network is needed for the parallel process between SLAM and segment network. The target detection network can replace pixel-level semantic segmentation in some schemes, but it focuses on enhancing the performance of SLAM and has little effect on generating semantic maps.
Although there are many excellent CNN image segmentation networks, most of them rely on high-performance computing equipment for training and operation. ENet (proposed by Paszke et al. [
29]) is an efficient pixel-level semantic segmentation network. It uses a 34-layer residual network to form an asymmetric Encoder-Decoder structure. Due to the small number of network layers, the training speed of this network is fast, and the training model is small. To improve the accuracy and ensure the running speed, we have made a simple change of this network, mainly to increase the number of convolution layers but retain the asymmetric structure of the network.
As shown in
Figure 4, five kinds of convolution layers are adapted to establish the Semantic Segmentation network. The entire network can be divided into seven stages, where stages 1–5 constitute the encoder and stages 6–7 constitute the decoder. The order of the internal convolution layers of stages 3–5 is the same. Although it is not deliberately emphasized in the figure, except for the initial layer of the first stage, each layer adopts a Bottleneck structure. The Bottleneck structure is shown in
Figure 4b. The initialization layer, which can be seen as
Figure 4c, directly concatenates the result of convolution using a 3 × 3 (stride 2) kernel together with the max-pooling result to achieve rapid compression and reduce the storage space of a single image. There are three types of convolution kernels used in this network, including standard kernels, dilated kernels [
30,
31], and asymmetric kernels [
31], which are shown in
Figure 4d. The use of multiple convolution kernels can effectively expand the receptive field, which can significantly increase the speed and efficiency of the network while ensuring the accuracy of segmentation.
2.4. Probabilistic Data Association
The single semantic segmentation has not yet made the computer controller match the segmentation result with the detected terrain.
Section 2.2 mentioned that the semantic information is extended to three-dimensional space based on the localization results and point cloud information, which only generates a separate semantic map. Therefore, another method is needed to associate the generated semantic map with the point cloud map, which is the so-called data association. Probabilistic Data Association was first proposed by Bowman et al. [
32]. It extracts the location-related information by searching the maximum probability coincident position of the current coordinates in the point cloud map. Moreover, since the input sizes of the SLAM end and the segmentation network do not match, the two need to be processed uniformly.
When building a semantic visual SLAM system, data association is the most crucial part. In the particular scenarios applied to WNDs, this data association does not need to achieve high accuracy or correct the result of SLAM. WNDs need to perceive the semantic information of the surrounding environment in a three-dimensional manner, which means the devices need to know what they have seen and how far these objects are. Consequently, only relying on semantic segmentation of captured images cannot achieve this goal. The so-called data association refers to correlating semantic recognition results with the results of the cloud points map, which is equivalent to extending semantic information to three-dimensional space. The general SLAM solution solves the classic positioning and mapping problems. The mobile sensor moves in an unknown environment and establishes a total of M static landmark sets
. The corresponding measurement value set of the sensor is
. The task of SLAM is to use the estimated landmark L to restore the sensor trajectory position and attitude set
.
The Semantic SLAM needs not only to be able to restore posture through landmarks but also to combine semantic labels with map points. Therefore, it is necessary to associate semantics signs, landmarks, and restored poses in the set
:
where
are the prior estimated value and
is the maximum likelihood estimate based on
. It is worth noting that the landmarks here are those that already contain semantic information. The keyframe records the association between the restored pose
and the landmark
containing semantic information. In the process of loop closing detection, keyframes with the same semantic distribution are pre-screened as the key detection object. The semantic information will be connected with the generated point cloud image by fusing data association with landmarks. Therefore, when indexing semantic information, the keyframes that can be matched with the coordinates in the map and the corresponding semantics are extracted.
In order to make SLAM better cooperate with the semantic segmentation network, some changes in semantic information are necessary. On the one hand, the input images need to be segmented, and on the other hand, the landmarks need data expansion.
The image input size is set to 472 × 472 (pixels) to prevent the directionality problem of the images input to the network. The image input size of SLAM is the resolution of the camera 640 × 480. We unified the two by cutting. The center of these two input images coincide with each other, and the edge parts of the network input were ignored. The landmarks at the ignored positions are uniformly classified as “unclassified” to ensure matrix operations. This cutting method also avoids the mismatch, to a certain extent, that may occur when the network recognizes objects with incomplete edges on the image.
The semantic labels of landmarks are variable to avoid the classification errors that often occur. Each landmark continuously records the number of times where a keypoint is classified as a specific category during the entire SLAM process. With the work of the backend non-real-time optimization process, the largest category with a number of occurrences above a certain threshold other than uncategorized is selected as its own classification. If the mismatches are distributed with the normal distribution, this method can exclude most mismatch cases.
4. Conclusions
This paper has proposed a scheme to apply mature SLAM and semantic segmentation methods to wearable assistive devices for visually impaired people. We chose SLAM based on ORB feature extraction as the basis and constructed a real-time visual semantic SLAM solution for WNDs through a fast and efficient semantic segmentation network. To ensure the real-time performance of the system running on low-cost embedded devices, we made full use of the computing resources of the device and divided SLAM and semantic segmentation into multiple threads, which were allocated to different types of computing units according to their characteristics. The experimental results show that even under the condition of retaining the visual graphical interface, the program can still run on embedded devices at a processing speed of about 25 Fps. At the same time, the error is limited to centimeter-level. Finally, the generated semantic map and the SLAM cloud map are associated with each other through probabilistic association, which can further utilize mature speech synthesis solutions to broadcast location and object information to the visually impaired and further enlighten them to reach the destination. In general, the experimental results show that this system can be run in real-time on the platform we built, and the semantic map and SLAM map can be established synchronously.
Our solution shows the potential of semantic SLAM in wearable navigation devices, which can still improve in several aspects. Firstly, consider the miniaturized, lightweight, and integrated design. In fact, there are many small high-performance chips and high-precision cameras that can be used to develop wearable navigation devices. Secondly, the user experience can be improved through developing the navigation program and enhancing the human-computer interaction and appearance of the production. In addition, research in the fields of semantic SLAM fusion and multi-sensor fusion will also help further to improve the efficiency and accuracy of wearable navigation devices. Limited by the computing power and accuracy of the device, we can only implement voice-heuristic navigation currently. Accurate navigation requires the use of multi-sensor fusion, integrating GPS, IMU, and other indoor localization methods to obtain high-precision location information. Moreover, the interaction method between wearable devices and visually impaired people is also an important research topic. Simplifying the interaction operation as much as possible and increasing voice communication and text understanding can provide greater convenience for the visually impaired. In the future, we will try to utilize the GPS module and inertial measurement units while reducing the size and weight of the system to improve the operating accuracy and application scenarios of the devices.