Efficient Robot Localization Through Deep Learning-Based Natural Fiduciary Pattern Recognition

Mena-Almonte, Ramón Alberto; Zulueta, Ekaitz; Etxeberria-Agiriano, Ismael; Fernandez-Gamiz, Unai

doi:10.3390/math13030467

Open AccessArticle

Efficient Robot Localization Through Deep Learning-Based Natural Fiduciary Pattern Recognition

by

Ramón Alberto Mena-Almonte

¹,

Ekaitz Zulueta

^2,*,

Ismael Etxeberria-Agiriano

³ and

Unai Fernandez-Gamiz

⁴

¹

Instituto Tecnológico de Las Américas (ITLA), La Caleta, Boca Chica 11606, Dominican Republic

²

System Engineering and Automation Control Department, University of the Basque Country (UPV/EHU), 01006 Vitoria-Gasteiz, Spain

³

Department of Computer Languages and Systems, University College of Engineering, University of the Basque Country (UPV/EHU), 01006 Vitoria-Gasteiz, Spain

⁴

Department Energy Engineering, University of the Basque Country (UPV/EHU), 01006 Vitoria-Gasteiz, Spain

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(3), 467; https://doi.org/10.3390/math13030467

Submission received: 6 November 2024 / Revised: 2 January 2025 / Accepted: 29 January 2025 / Published: 30 January 2025

Download

Browse Figures

Versions Notes

Abstract

:

This paper introduces an efficient localization algorithm for robotic systems, utilizing deep learning to identify and exploit natural fiduciary patterns within the environment. Diverging from conventional localization techniques that depend on artificial markers, this method capitalizes on the inherent environmental features to enhance both accuracy and computational efficiency. By integrating advanced deep learning frameworks with natural scene analysis, the proposed algorithm facilitates robust, real-time localization in dynamic and unstructured settings. The resulting approach offers significant improvements in adaptability, precision, and operational efficiency, representing a substantial contribution to the field of autonomous robotics. We are aiming at analyzing an automotive manufacturing scenario to achieve robotic localization related to a moving target. To work with a simpler and more accessible scenario we have chosen a demonstrative context consisting of a laboratory wall containing some elements. This paper will focus on the first part of the case study, with a continuation planned for future work. It will demonstrate a scenario in which a camera is mounted on a robot, capturing images of the underside of a car (which we assume to be represented by a gray painted surface with specific elements to be described in Materials and Methods). These images are processed by a convolutional neural network (CNN), designed to detect the most distinctive features of the environment. The extracted information is crucial, as the identified characteristic areas will serve as reference points for the real-time localization of the industrial robot. In this work, we have demonstrated the potential of leveraging natural fiduciary patterns for efficient and accurate robot localization. By utilizing deep learning, specifically convolutional neural networks. The experimental results suggest that this approach is not only feasible but also scalable across a wide range of applications, including industrial automation autonomous vehicles, and aerospace navigation. As robots increasingly operate in environments where computational efficiency and adaptability are paramount, our methodology offers a viable solution to enhance localization without compromising accuracy or speed. The proposal of an algorithm that enables the application of the proposed method for natural fiduciary patterns based on neural networks to more complex scenarios is highlighted, along with the efficiency of the method for robot localization compared to others.

Keywords:

localization algorithm; natural fiduciary pattern; robot localization; autonomous robotics

MSC:

68T07; 68T40; 93C85; 68U10

1. Introduction

Robot localization is a critical challenge in robotics, involving the determination of a robot’s position and orientation within an environment [1]. Accurate localization is essential for the successful navigation and operation of autonomous robots, particularly in complex and dynamic environments [2]. Conventional localization methods often depend on artificial landmarks or sophisticated sensor systems, which can be costly and prone to failure in real-world conditions. In contrast, natural fiduciary pattern identification presents a promising, cost-effective, and scalable solution by leveraging inherent features of the environment for localization and the two important elements of localization and navigation are provided, the ability to serve as reliable reference points within an environment and the capacity to enhance the precision of trajectory planning for autonomous systems [3].

The novelty of this work lies in the development and implementation of a deep learning-based localization algorithm that leverages natural fiduciary patterns instead of artificial markers, a significant departure from conventional approaches. This algorithm capitalizes on the unique, inherent features of the environment to achieve robust and precise localization, even in dynamic and unstructured scenarios. Unlike existing methods, our approach integrates advanced convolutional neural networks (CNNs) for feature extraction and real-time processing, enabling high accuracy and computational efficiency.

Localization is crucial across various robotic applications, including industrial robots, autonomous vehicles, drones, and service robots [4]. It ensures that robots can perform tasks with high precision and that can navigate safely, avoid obstacles, and interact effectively with their surroundings. Conventional localization techniques include GPS-based systems, which are reliable outdoors; beacon-based systems; and visual odometry, which tracks movement using cameras and complex algorithms.

Despite their usefulness, these methods have limitations. GPS signals can be unreliable indoors, and beacon systems demand substantial infrastructure setup and maintenance. Visual odometry, though valuable, can be computationally intensive and susceptible to errors in environments with poor lighting or repetitive textures [5]. These limitations have driven research into alternative approaches that utilize natural environmental features to enhance localization accuracy and robustness.

Usually, localization algorithms need information about their environment. This information could be collected through different sensors like artificial vision or lidar systems. One of easiest solutions are camera-based reference tracker systems. These pattern or reference tracking algorithms identify the track of a pattern, and use this to change the environment using artificial fiduciary pattern images such as AprilTags or ArUco tags [6,7,8]. These kinds of approaches impose changing the environment. Sometimes these kinds of changes are impractical, not allowed, or impossible. So, an adequate solution is to identify which are the references in the scene that can be easily identified. To identify which are the most relevant elements in a visual scene, a human can estimate the best references.

The problem with this approach is that it is time consuming, and the estimation of human needs is not the best option for a vision algorithm. After all these considerations, authors conclude that the easiest way to define how well a visual element is detected is to use a candidate identification algorithm such as a pre-trained convolutional neural network (CNN). This approach identifies the most relevant elements in a visual scene. This approach lets us automatize image labeling in further training processes for visual navigation algorithms. We propose the natural fiduciary term to refer to these visual elements. Natural fiduciary patterns are distinctive, recognizable features within the environment that can serve as reference points for localization. These patterns include natural textures, structural elements, and unique objects naturally present in the environment. The process of utilizing these patterns involves several key stages: feature detection, pattern recognition, and localization computation.

The rapid development of deep learning has revolutionized several areas within computer vision and robotics, where robust localization algorithms have become increasingly central to industrial automation, autonomous navigation, and complex robotic systems. The goal of accurate localization has spurred innovations that allow robots to interpret, understand, and navigate environments with heightened precision and efficiency [9,10]. This work contributes to this body of research by exploring natural fiduciary patterns, a novel approach that leverages inherent environmental features instead of traditional artificial markers for localization.

This paper presents an innovative approach to robotic localization by introducing an efficient algorithm that leverages deep learning to identify and utilize natural fiduciary patterns in the environment. Unlike traditional methods reliant on artificial markers, our approach harnesses the inherent characteristics of the surroundings to improve accuracy and computational efficiency. The motivation for this work arises from the challenges of achieving precise and adaptable localization in dynamic, unstructured settings, such as automotive manufacturing. By focusing on the development of a robust algorithm for real-time localization, the paper addresses the need for scalable solutions in industrial and autonomous applications, offering significant advancements in precision, adaptability, and operational efficiency.

2. Background and Relevance

Traditional localization techniques, such as Visual Simultaneous Localization and Mapping (Visual SLAM), have proven effective in providing spatial awareness in robotics by generating maps from unknown environments. However, these methods often rely on computationally intensive processes that require extensive sensor data fusion and high processing power. Recent research has focused on improving computational efficiency without sacrificing accuracy by integrating deep learning models, especially convolutional neural networks (CNNs), which excel at spatial feature extraction and interpretation [11]. Table 1 presents a summary of the most current technologies in robot localization.

The innovation of employing natural fiduciary patterns represents a departure from conventional marker-based localization, wherein physical markers like AprilTags or ArUco markers are placed in predefined locations to assist with object identification and spatial referencing [6]. Instead, our approach encourages robots to autonomously recognize and utilize features already present in their environment, facilitating adaptability across various settings, from industrial floors to complex urban landscapes [12,13].

CNNs have emerged as critical tools in image-based tasks, providing robustness to transformations in scale, rotation, and lighting conditions. These properties make CNNs particularly suited for feature extraction and recognition tasks in dynamic environments, where traditional localization algorithms may fall short. The effectiveness of CNNs in analyzing complex, high-dimensional data has been demonstrated in several studies focused on autonomous navigation and robotic manipulation, where they significantly enhance the robustness and accuracy of object detection [14]. Through CNNs, robots can interpret their surroundings in real-time, allowing them to detect “characteristic” areas that act as natural fiduciary patterns.

In this study, we utilize a VGG16-based network architecture to analyze images and extract these characteristic regions autonomously. The application of CNNs for this purpose builds on prior work showing their capacity to process visual information in ways that mimic the hierarchical structure of the human visual system, capturing spatial features that are critical for reliable localization [15].

3. Application Case

We are aiming at analyzing an automotive manufacturing scenario to achieve robotic localization related to a moving target. To work with a simpler and more accessible scenario we have chosen a demonstrative context consisting of a laboratory wall containing some elements. High localization accuracy is expected, in an order of 1 mm relative to its environment. This precision is essential because the robot is intended to perform tasks such as screwing components on the underside of a car.

The above-described industrial robot will be oriented using images on the underside of a car, captured by a camera mounted on the robot. These images are then processed by a pre-trained neural network, designed to identify the most distinctive features of the surroundings. The goal is to pinpoint key areas that can be utilized for more precise localization, enabling the industrial robot to differentiate and use specific environmental cues.

This paper will focus on the first part of the case study, with a continuation planned for future work. It will demonstrate a scenario in which a camera is mounted on a robot, capturing images of the underside of a car (which we assume to be represented by a gray painted surface with specific elements to be described in Section 4). These images are processed by a convolutional neural network (CNN) designed to detect the most distinctive features of the environment. The extracted information is crucial, as the identified characteristic areas will serve as reference points for the real-time localization of the industrial robot.

Once the most characteristic elements of the scenario have been identified, we process the whole dataset of images to automatically label the positions in the images of these characteristic elements. In this way, we have an automatic labeling algorithm, and we can utilize the previously trained CNN aimed at obtaining an object detection neural network, such as YOLO.

The pre-trained network employed here builds on advancements in CNNs, particularly those tailored for spatial feature extraction, as explored by researchers in computer vision and robotic localization [16]. The network’s task is to identify the most characteristic regions within the captured images, an application that has shown promise in complex environments like manufacturing, where visual distinctions are critical for automation accuracy [17]. In this approach, regions identified as distinctive and relevant are labeled for future use by an anthropomorphic robot mounted on a mobile platform.

Once labeled, these significant areas allow the mobile robotic platform to precisely carry out screw-fastening tasks in pre-identified locations. The choice of an anthropomorphic robot ensures versatility in motion, aligning with previous findings that highlight the advantages of flexible, human-like machines in manufacturing environments requiring adaptable movement [18]. This structured, image-driven localization strategy, supported by robust image processing and spatial recognition algorithms, is a scalable solution in robotic automation, with applications that extend beyond automotive to other sectors of high-precision manufacturing.

4. Materials and Methods

To carry out our proposal, a robot with 6 degrees of freedom will be utilized. Figure 1 provides a picture of such a robot. The commands and control methods used previously have caused errors and have not achieved the desired results for this project. Therefore, position control for this robot will be implemented in MATLAB 2024b. The project will involve studying the forward and inverse kinematics of the KUKA LBR iiwa 7 R800 model from Augsburg, Germany. Although this robot has 7 degrees of freedom, one will be locked to simplify the equations and facilitate easier calculations.

A gray painted wall has been set up with three AprilTags and a vertical gutter. The experiment is indifferent to the specific objects placed on the wall. In fact, AprilTags were used for a different purpose and any other object could have been placed in the environment. At no point are the QR codes of the AprilTags utilized. It also provides a picture of this scenario. The wall has a smooth background to ensure that the neural network focuses only on these objects. The goal is to observe which areas are identified more effectively by the neural network and then use that position as a reference mechanism for the robot’s spatial and temporal localization.

Figure 1a shows the KUKA robot with 6 degrees of freedom and in Figure 1b a gray painted wall has been set up with three AprilTags and a vertical gutter.

The robot is equipped with a camera and programmed to take photos of the above-described study wall. Photos are taken at different positions and saved for later use. The photo’s filename specifies the position to ease the process of training the neural network.

Procedure 1. Image/coordinates capturing process.

Input: None.
Step 1: The robot starts from an initial known position.
Step 2: The robot captures images labeled with positions of the scene.
Output: Image dataset labeled with associated positions.

Procedure 2. Robot localization using natural fiduciary pattern identification.

Input: Image dataset labeled with associated positions; pre-trained CNN.
Step 1: The CNN is trained using the image dataset labeled with associated positions.
Step 2: Once the CNN is trained, the most significant zones are identified. A heat map is obtained with predicted error for each position. The best error predictions give the most significant patterns.
Output: Localization error heat map; trained CNN.

Computational requirements:

○: A KUKA robot with 6 degrees of freedom (LBR iiwa 7 R800 model);
○: A high resolution camera;
○: Programming language: MATLAB desktop;
○: Custom-developed software for dividing the image;
○: A pre-trained VGG16 model CNN (35 min 11 s in training process);
○: An HP laptop with a Core i5 processor, RAM (8 GB), disk (356 GB).

A neural network has been designed to analyze photos, which must be resized to dimensions that the network can process. The neural network selected for training is VGG16 [14,24]. This network is designed for photos with a resolution of 128 × 128, but it can be adjusted to train with photos of 64 × 64 or 32 × 32, for example. The original photos have a resolution of 640 × 480, which is not compatible with the network.

Segmenting the photos will not only allow them to be used for training the network but will also increase the amount of input data available for training [25,26]. To accomplish this, two programs have been designed to divide the original photos. There are two programs because two training sessions will be conducted with two different image sizes: 64 × 64 and 128 × 128 [27]. The programs are identical, so only the 64 × 64 program will be explained.

As a final optimization step, the coordinate data are normalized, significantly improving the efficiency of the neural network by reducing the computational overhead during the error estimation process, particularly when calculating the mean and variance. This approach not only accelerates training time but also ensures higher precision in the subsequent localization task [28].

This work proposes a novel localization algorithm that utilizes natural fiduciary patterns, reversing the conventional approach. Instead of placing artificial markers in predefined locations, this method employs deep learning and computer vision techniques to allow the robot to identify its surroundings and infer possible paths autonomously [11,23,29,30].

Our preliminary work considered the utilization of AprilTags, as will be remarked on in the provided pictures. However, we ended up discarding their interpretation, taking those labels as plane environment elements. Any visual element would equally work in our experiments.

To contextualize this approach, we begin with a review of recent advancements in related technologies, including Visual SLAM, Object Identification, ARUCO, among others. The methodology involves programming a KUKA LBR iiwa 7 R800 robot using MATLAB to capture images of a smooth wall containing three reference tags and a vertical gutter. These images will be segmented using MATLAB [31], and the resulting segments will be used to train an artificial neural network. Various techniques will be employed to analyze the network’s ability to differentiate features and infer localization [23]. Finally, we will present our findings, conclusions, and an algorithm detailing the steps required for robot localization using the proposed method.

In this study, the neural network used in the provided code appears to be a custom-trained convolutional neural network (CNN) VGG16 [11,14] designed for image regression tasks. This type of network typically consists of multiple convolutional layers that extract spatial features from the input images, followed by fully connected layers that process these features to output continuous values. The network is trained to predict the position of objects within images, making it suitable for tasks that require precise localization. The use of a dataset and operations such as reading images and predicting positions indicates that the network is tailored for specific image regression applications, emphasizing its capability to handle spatial data and provide accurate positional outputs. Table 2 provides the detailed characteristics of the CNN (VGG16) network. The layer composition of the VGG16 is depicted in Figure 2.

Figure 3 illustrates the training process of the convolutional neural network (CNN) VGG16, detailing how the model iteratively updates its parameters throughout training. This visualization is essential for comprehending the network’s progression towards optimal performance and the refinement of its feature extraction capabilities.

In Figure 3, we present the results of our training process for 64 × 64 and 128 × 128 size images, measuring the error along both the X and Y axes. The intensity of the blue color represents the magnitude of the error: the deeper the blue, the lower the reading error in the network’s analysis.

The main mathematical formulation of VGG16 follows.

1.

Convolution operation: each convolution layer performs the operation

y_{i, j, q, p + 1} = \sum_{m = 1, n = 1, k = 1}^{m = N_{r o w s}, n = N_{c o l u m n s}, k = N_{c h a n n e l s}} y_{i + m, j + n, k, p} . w_{m, n, k, q, p}

(1)

where

$y_{i, j, k, p}$ is the pth tensor’s value at position $i, j$ and the qth channel. In the first layer, this input is the image data.
$w_{m, n, k, p}$ are the weights from the $k$ th filter.

2.

Applying the activation function (ReLU):

y_{i, j, q, p + 1} = \max (0, y_{i, j, q, p})

(2)

3.

Max-pooling operation:

y_{i, j, q, p + 1} = {M a x}_{m = 1, n = 1, k = 1}^{m = N_{r o w s}, n = N_{c o l u m n s}, k = N_{c h a n n e l s}} (y_{i + m, j + n, k, p} . w_{m, n, k, q, p})

(3)

4.

Dense layers:

y_{i, p + 1} = \sum_{j = 1}^{j = N_{i n p u t s, p}} w_{i, j, p} . y_{j, p} + b_{i, p}

(4)

where

$y_{j, p}$ is the $j$ th value of the $p$ th tensor.
$w_{i, j, p}$ is the $i$ and $j$ weight of the $p$ th layer.
$b_{i, p}$ is the $i$ bias of the $p$ th layer.

5. Results

In Figure 4, where the network measures the error in X from the 64 × 64 resolution images, a darker blue is shown in the space used by the gutter on the test wall, indicating that the network recognizes it. As for the three AprilTags, nothing seems to be distinguishable. Now let us look at Figure 5, where the network measures the error in Y from the 64 × 64 resolution images.

In Y, the network does not distinguish anything, neither the gutter nor the AprilTags. The lighter blue color is in the center of the wall, which is logical. Here, the network indicates that there is nothing to recognize. The gutter is vertical, not horizontal, which is why the network is unable to read it.

In Figure 6, where the network measures the error in X from the 128 × 128 resolution images, the light blue spots at the positions of the AprilTags and the gutter on the test wall indicate that the network recognizes them. The bluer areas are in the spaces occupied by the AprilTags, indicating that the network recognizes them.

Regarding the Y-axis, the result in Figure 7 remains very similar to that of the Y-axis with the 64 × 64 resolution images. The AprilTag on the right is somewhat more distinguishable, and the other two AprilTags can be vaguely discerned, but the overall error remains quite high.

From the above, regarding the neural network’s reading, we conclude that it recognizes the elements placed on the test wall much better through the X-axis than through the Y-axis, for both the 64 × 64 and 128 × 128 resolutions. For the area of the wall where there are no elements, the network effectively communicates this information, indicated by the lighter blue areas.

Figure 8 presents the errors produced by the neural networks for 64 × 64 and 128 × 128 images, visualized through a heat map. The heat map of the trained neural networks is presented. The top graph shows the error committed by the network trained using 64 × 64 resolution images, and the bottom graph shows the error committed by the network trained using 128 × 128 resolution images. It is evident that both networks make more errors where the AprilTags are located on the left side of the test wall. Clearly, the neural network processing the 64 × 64 images tends to fail more.

The results observed, both in the charts showing the neural networks’ visualizations and in the heat maps, indicate that the network working with 128 × 128 images performs better in terms of calculated error. However, the fact that the network has a higher error when visualizing objects on the wall implies that it detects the existence of objects that are difficult to distinguish but are present, which aligns with the goal of our research.

As a summary and for clarification, the VGG16 model is trained and tested with 64 × 64 and 128 × 128 images. To verify the recognition error, we use two mechanisms: the errors in X and Y images (Figure 4, Figure 5, Figure 6 and Figure 7) and the heatmap (Figure 8). Regarding the recognition errors, the network with 128 × 128 images performs better in the X error (Figure 6) as it allows for two things:

The position of the objects in the scenario. The fact that the objects in the scenario are positioned in their original locations is a fundamental aspect of our study, as it enables accurate localization and the mapping of potential trajectories.
The shape of natural fiducial patterns is another key aspect. Their form indicates the space occupied by these objects, which is an essential element for accurate localization and trajectory mapping.

Regarding the heatmap, the VGG16 model shows a higher reading error with the two AprilTags placed on the board, which does not occur with the metal gutter. This suggests that simple shapes, such as the bar, among others, can be recognized better by CNN.

6. Discussion

The neural network processing the 128 × 128 images shows a lower error in the Y-axis, indicating that it distinguishes the objects on the wall more effectively, thereby demonstrating greater usefulness for our purposes. As a result, the robot is unable to reliably localize itself using these indistinct objects. Instead, it must rely on the more easily identifiable features for accurate positioning within the environment.

Then, a neural network can achieve better performance (minimal error) where there are no objects (in empty space) or when the objects have simple shapes, such as the gutter. Note that the heatmap of both networks does not show high levels of error in the space they occupy.

7. Conclusions

In this work, we have demonstrated the potential of leveraging natural fiduciary patterns for efficient and accurate robot localization. By utilizing deep learning, specifically convolutional neural networks (CNNs) [29,32], the analysis conducted, particularly through the identification of significant and non-significant areas in the robot’s surroundings, provides a clear pathway for future enhancements in localization algorithms [2].

The experimental results suggest that this approach is not only feasible but also scalable across a wide range of applications, including industrial automation [33], autonomous vehicles, and aerospace navigation. As robots increasingly operate in environments where computational efficiency and adaptability are paramount, our methodology offers a viable solution to enhance localization without compromising accuracy or speed [10].

Future work will focus on improving the algorithm’s ability to generalize across different environments and exploring hybrid approaches that combine natural fiduciary patterns with other localization techniques, such as Visual SLAM [33], to further increase robustness.

The findings presented in this work, regarding which areas are most characteristic as identified by the neural network, will be labeled and used as a reference mechanism for the robotic system [11,13].

8. General Applications and Future Directions

The implementation of natural fiduciary patterns has far-reaching implications for fields that demand high precision in localization, such as autonomous driving, medical robotics, and aerospace navigation. By moving away from artificial markers, the proposed method offers a scalable, adaptable solution that can be deployed across varying contexts, enabling robots to navigate seamlessly in diverse environments [5,34]. Additionally, the reduction in reliance on structured environments makes this method particularly suitable for industrial automation, where flexibility and rapid adaptation are increasingly vital.

Looking ahead, this work highlights several areas for improvement and expansion. One promising direction involves integrating natural fiduciary patterns with established techniques such as Visual SLAM, which could lead to hybrid systems that leverage the strengths of both approaches. By combining CNN-based feature extraction with traditional mapping, future localization algorithms could achieve higher levels of robustness, particularly in environments subject to frequent changes or those with limited visual features [35].

Author Contributions

Conceptualization, E.Z. and U.F.-G.; methodology, R.A.M.-A. and E.Z.; investigation, E.Z. and R.A.M.-A.; writing—original draft preparation, R.A.M.-A., I.E.-A., and U.F.-G.; writing—review and editing, U.F.-G.; supervision, E.Z.; project administration, I.E.-A.; funding acquisition, E.Z. All authors have read and agreed to the published version of the manuscript.

Funding

The authors appreciate the support to the government of the Basque Country through research programs Grants Elkartek Deepbask N KK-2024/00069 and the GIZAPRE Consolidated Research Group (IT-1435-22). This paper is a part of Project PID2021-126937NB-I00 (PALEOCROSS), funded by MCIN/AEI/10.13039/501100011033 and by “ERDF A way of making Europe”.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Durrant-Whyte, H.; Bailey, T. Simultaneous localization and mapping: Part I. IEEE Robot. Autom. Mag. 2006, 13, 99–110. [Google Scholar] [CrossRef]
Thrun, S.; Burgard, W.; Fox, D. Probabilistic Robotics Cambridge; MIT Press: Cambridge, MA, USA, 2005. [Google Scholar]
Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Siciliano, B. Springer Handbook of Robotics; Springer: Berlin/Heidelberg, Germany, 2008; Volume 2, pp. 15–35. [Google Scholar]
Scaramuzza, D.; Fraundorfer, F. Visual odometry [tutorial]. IEEE Robot. Autom. Mag. 2011, 18, 80–92. [Google Scholar] [CrossRef]
Garrido-Jurado, S.; Muñoz-Salinas, R.; Madrid-Cuevas, F.J.; Marín-Jiménez, M.J. Automatic generation and detection of highly reliable fiducial markers under occlusion. Pattern Recognit. 2014, 47, 2280–2292. [Google Scholar] [CrossRef]
Muñoz-Salinas, R.; Marín-Jimenez, M.J.; Yeguas-Bolivar, E.; Medina-Carnicer, R. Mapping and localization from planar markers. Pattern Recognit. 2018, 73, 158–171. [Google Scholar] [CrossRef]
Olson, E. AprilTag: A robust and flexible visual fiducial system. In Proceedings of the 2011 IEEE international conference on robotics and automation, Shanghai, China, 9–13 May 2011; pp. 3400–3407. [Google Scholar]
Mur-Artal, R.; Tardós, J.D. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
Cadena, C.; Carlone, L.; Carrillo, H.; Latif, Y.; Scaramuzza, D.; Neira, J.; Reid, I.; Leonard, J.J. Past, present, and future of simultaneous localization and mapping: Toward the robust-p perception age. IEEE Trans. Robot. 2016, 32, 1309–1332. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Leutenegger, S.; Chli, M.; Siegwart, R.Y. BRISK: Binary robust invariant scalable keypoints. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2548–2555. [Google Scholar]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Bay, H.; Tuytelaars, T.; Van Gool, L. SURF: Speeded up robust features. In Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, 7–13 May 2006. Proceedings, Part. I 9; Springer: Berlin/Heidelberg, Germany, 2006; pp. 404–417. [Google Scholar]
Liu, J.; Liu, Z. The Vision-Based Target Recognition, Localization, and Control for Harvesting Robots: A Review. Int. J. Precis. Eng. Manuf. 2024, 25, 409–428. [Google Scholar] [CrossRef]
Zhou, H.; Yang, G.; Wang, B.; Li, X.; Wang, R.; Huang, X.; Wu, H.; Wang, X.V. An attention-based deep learning approach for inertial motion recognition and estimation in human-robot collaboration. J. Manuf. Syst. 2023, 67, 97–110. [Google Scholar] [CrossRef]
Zhang, C.; Li, M.; Chen, Y.; Yang, Z.; He, B.; Li, X.; Xie, J.; Xu, G. An anthropomorphic robotic hand with a soft-rigid hybrid structure and positive-negative pneumatic actuation. IEEE Robot. Autom. Lett. 2023, 8, 4346–4353. [Google Scholar] [CrossRef]
Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.M.; Tardós, J.D. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Schops, T.; Schönberger, J.L.; Galliani, S.; Sattler, T.; Schindler, K.; Pollefeys, M.; Geiger, A. A multi-view stereo benchmark with high-resolution images and multi-camera videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3260–3269. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Yosinski, J.; Clune, J.; Bengio, Y.; Lipson, H. How transferable are features in deep neural networks? Adv. Neural Inf. Process. Syst. 2014, 27, 1–9. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5-9 October 2015, Proceedings, Part III 18; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for deep learning. J. Big Data 2019, 6, 1–48. [Google Scholar] [CrossRef]
Bengio, Y.; Courville, A.; Vincent, P. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1798–1828. [Google Scholar] [CrossRef]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Kaess, M.; Ranganathan, A.; Dellaert, F. iSAM: Incremental smoothing and mapping. IEEE Trans. Robot. 2008, 24, 1365–1378. [Google Scholar] [CrossRef]
Wang, Q.; Yuan, C.; Liu, Y. Learning deep conditional neural network for image segmentation. IEEE Trans. Multimed. 2019, 21, 1839–1852. [Google Scholar] [CrossRef]
Zhang, Z. Microsoft kinect sensor and its effect. IEEE Multimedia 2012, 19, 4–10. [Google Scholar] [CrossRef]
Kendall, A.; Gal, Y. What uncertainties do we need in bayesian deep learning for computer vision? Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar]
Engel, J.; Schöps, T.; Cremers, D. LSD-SLAM: Large-scale direct monocular SLAM. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2014; pp. 834–849. [Google Scholar]

Figure 1. This is the working scenario: (a) the robot used and (b) the study wall.

Figure 2. VGG16 layer sequence.

Figure 3. Training process.

Figure 4. Error in X during training with 64 × 64 resolution images.

Figure 5. Error in Y during training with 64 × 64 resolution images.

Figure 6. Error in X during training with 128 × 128 resolution images.

Figure 7. Error in Y during training with 128 × 128 resolution images.

Figure 8. Heat map of the trained neural network.

Table 1. Leading robot localization techniques.

Patterns	Short Description	Recent Advancements	Refs.
Visual SLAM	It estimates the position of a robot utilizing visual data, typically obtained from cameras, while constructing a map of the surrounding environment. It is applied in robotics, augmented reality (AR), and autonomous vehicles.	Feature-Based Methods: Identifying and tracking visual features over time. Recent advancements include improved versions of ORB-SLAM and new algorithms. Direct Methods: These operate directly on pixel intensities rather than extracting features. Examples include updated versions of DSO and new direct methods. Deep Learning-Based Approaches: Learning-based methods leveraging convolutional neural networks (CNNs) for feature extraction and end-to-end SLAM solutions.	[12,13,14]
Object Identification	Object identification involves recognizing and categorizing objects within an image or video.	Conventional Methods: Techniques such as improved versions of HOG and DPM. Deep Learning: CNNs have revolutionized the field, with updated models like YOLOv5, Faster R-CNN, and SSD providing state-of-the-art performance.	[15,16]
ARUCO Markers	ARUCO markers are binary square fiducial markers used for camera pose estimation and calibration.	Detection Algorithms: Recent improvements in ARUCO marker detection algorithms and OpenCV library implementations. They are used in augmented reality, robotics, and computer vision applications like camera calibration and 3D scene reconstruction.	[17,18]
AprilTag	AprilTag is another type of fiducial marker designed for robust and efficient detection.	Tag Design: Updated designs of AprilTags ensuring high detection rates and minimal false positives. Detection Performance: Recent updates in the AprilTag library offer fast and reliable detection, suitable for real-time applications in robotics and AR.	[19]
SURF (Speeded-Up Robust Features)	SURF is a feature detector and descriptor used for object recognition, image registration, and 3D reconstruction.	Feature Detection: Recent improvements in SURF for detecting points of interest. Descriptor: Updated SURF descriptors that describe the neighborhood around each detected feature, allowing for robust matching.	[20]
ORB (Oriented FAST and Rotated BRIEF)	ORB is an efficient alternative to SIFT and SURF, providing comparable performance at a lower computational cost.	Feature Detection: Recent improvements in ORB for detecting keypoints. Descriptor: Updated BRIEF descriptors that are rotated according to the keypoint orientation to achieve rotation invariance.	[21,22]
SIFT (Scale-Invariant Feature Transform)	SIFT is a robust feature detection and description algorithm widely used in computer vision.	Feature Detection: Recent advancements in SIFT for detecting scale-invariant keypoints. Descriptor: Updated SIFT descriptors based on the gradient histogram of the keypoint neighborhood, providing robustness to scale, rotation, and affine transformations.	[23]

Table 2. Architecture details of the VGG16 CNN layers utilized.

1.	Input Layer: 128 × 128 × 3
2.	Conv2D Layer: 128 × 128 × 64 (Weights: 3 × 3 × 3, Bias: 1 × 1 × 64), ReLU
3.	Conv2D Layer: 128 × 128 × 64 (Weights: 3 × 3 × 64, Bias: 1 × 1 × 64), ReLU
4.	MaxPooling2D: 64 × 64 × 64
5.	Conv2D Layer: 64 × 64 × 128 (Weights: 3 × 3 × 64, Bias: 1 × 1 × 128), ReLU
6.	Conv2D Layer: 64 × 64 × 128 (Weights: 3 × 3 × 128, Bias: 1 × 1 × 128), ReLU
7.	MaxPooling2D: 32 × 32 × 128
8.	Conv2D Layer: 32 × 32 × 256 (Weights: 3 × 3 × 128, Bias: 1 × 1 × 256), ReLU
9.	Conv2D Layer: 32 × 32 × 256 (Weights: 3 × 3 × 256, Bias: 1 × 1 × 256), ReLU
10.	MaxPooling2D: 16 × 16 × 256
11.	Conv2D Layer: 16 × 16 × 512 (Weights: 3 × 3 × 256, Bias: 1 × 1 × 512), ReLU
12.	Conv2D Layer: 16 × 16 × 512 (Weights: 3 × 3 × 512, Bias: 1 × 1 × 512), ReLU
13.	MaxPooling2D: 8 × 8 × 512
14.	Conv2D Layer: 8 × 8 × 512 (Weights: 3 × 3 × 512, Bias: 1 × 1 × 512), ReLU
15.	MaxPooling2D: 4 × 4 × 512
16.	Fully Connected Layer: 1 × 1 × 1000 (Weights: 1000 × 81, Bias: 1000 × 1), ReLU, Dropout (50%)
17.	Fully Connected Layer: 1 × 1 × 100 (Weights: 100 × 81, Bias: 100 × 1), ReLU, Dropout (50%)
18.	Fully Connected Layer: 1 × 1 × 2 (Weights: 2 × 100, Bias: 2 × 1), Mean-Squared Error Output

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mena-Almonte, R.A.; Zulueta, E.; Etxeberria-Agiriano, I.; Fernandez-Gamiz, U. Efficient Robot Localization Through Deep Learning-Based Natural Fiduciary Pattern Recognition. Mathematics 2025, 13, 467. https://doi.org/10.3390/math13030467

AMA Style

Mena-Almonte RA, Zulueta E, Etxeberria-Agiriano I, Fernandez-Gamiz U. Efficient Robot Localization Through Deep Learning-Based Natural Fiduciary Pattern Recognition. Mathematics. 2025; 13(3):467. https://doi.org/10.3390/math13030467

Chicago/Turabian Style

Mena-Almonte, Ramón Alberto, Ekaitz Zulueta, Ismael Etxeberria-Agiriano, and Unai Fernandez-Gamiz. 2025. "Efficient Robot Localization Through Deep Learning-Based Natural Fiduciary Pattern Recognition" Mathematics 13, no. 3: 467. https://doi.org/10.3390/math13030467

APA Style

Mena-Almonte, R. A., Zulueta, E., Etxeberria-Agiriano, I., & Fernandez-Gamiz, U. (2025). Efficient Robot Localization Through Deep Learning-Based Natural Fiduciary Pattern Recognition. Mathematics, 13(3), 467. https://doi.org/10.3390/math13030467

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Robot Localization Through Deep Learning-Based Natural Fiduciary Pattern Recognition

Abstract

1. Introduction

2. Background and Relevance

3. Application Case

4. Materials and Methods

5. Results

6. Discussion

7. Conclusions

8. General Applications and Future Directions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI