1. Introduction
Technology has become a great part of our life as it has spread to every aspect, and so, information has increased drastically. Confidentiality is particularly important for institutions and corporate enterprises since it helps protect them from being susceptible or exposed to their competitors. Nevertheless, these establishments must safeguard these data. Several system approaches are available nowadays, including microprocessor-based systems like microcontrollers and field programmable gate arrays (FPGAs), which are hardware-integrated circuits that can be programmed. The latter include numerous advantageous characteristics that enhance their usability in various applications. One such feature is their inherent speed, as FPGAs can execute functions simultaneously, as they function in parallel.
Recent advancements in Artificial Neural Networks (ANNs) have revolutionized various applications, particularly in speech recognition, machine translation, and scene analysis, by utilizing deep learning algorithms for effective sequence data processing. Deep learning architecture, characterized by multiple convolutional layers and pooling mechanisms, enhance feature extraction and prediction accuracy. The Tensor Processing Unit (TPU) has emerged as a powerful architecture for executing neural network models efficiently, while the integration of deep learning in medical applications highlights the need for tailored solutions. The adoption of deep learning algorithms on central processing units (CPUs) and graphics processing units (GPUs) brings several issues, primarily with respect to speed and power efficiency. Therefore, field programmable gate arrays (FPGAs) are being considered as a promising alternative for real-time embedded systems, presenting a trade-off between performance and power efficiency [
1]. The power consumption aspect is even more critical in cloud-based deep neural network (DNN) processing because of the huge energy consumption caused by data transfer inside data centres.
Power consumption is a critical issue in cloud-based deep neural network (DNN) processing, primarily due to energy-intensive data movement within data centres. Strategies such as storing weights and intermediate results on on-chip buffers can mitigate this concern by reducing the time and power needed for data retrieval. Optimizing the utilization of processing elements (PEs) is essential for enhancing throughput, which is influenced by the number of PEs and their operational efficiency in meeting DNN model requirements. Additionally, the architecture of convolutional PE arrays is designed to minimise resource consumption while maintaining performance, allowing for a simultaneous execution of multiple convolution operations. Various hardware accelerators, including GPUs, FPGAs, and ASICs, are employed to improve computational efficiency across different industries. GPUs are optimized for parallel processing tasks, while FPGAs offer flexibility and lower power consumption. ASICs provide specialized performance for specific computations but lack the versatility of GPUs. This paper discusses the importance of sensitivity, area, throughput, and latency in evaluating machine learning systems, emphasizing the need for future designs to focus on a reconfigurable architecture and parallel processing to enhance real-time inference capabilities. The proposed novel hardware architecture for perceptrons aims to improve accuracy and efficiency, particularly in applications like heart attack risk assessments, while also integrating self-healing mechanisms to enhance reliability in neural networks [
2].
FPGAs offer significant advantages in deep learning implementations, particularly in mobile and embedded platforms, outperforming traditional GPU setups in efficiency. The flexibility of FPGAs allows for optimized resource allocations, although challenges remain in managing the network topology and size. Effective implementation strategies involve balancing flexibility and optimization, utilizing automatic mapping solutions, and considering fixed-point versus floating-point coding to manage network complexity. The development of frameworks like CNNLab facilitates the testing of various topologies, while performance evaluation metrics focus on resource consumption and energy efficiency. Future developments in programmable deep learning architectures are expected to leverage the advantages of FPGAs to meet the demands of emerging applications across diverse fields [
3].
In short, our work builds upon these existing efforts by proposing a novel hardware acceleration approach for deep learning-based morphological biometric recognition. This paper introduces a novel biometric face recognition system designed to enhance security in high-security buildings. Unlike traditional systems that rely on internet connectivity, our system operates independently, ensuring robust security. The key contributions of the proposed method are summarized as follows:
Develop a Building Management System (BMS) implemented on an FPGA to control access and prevent unauthorized entries.
Optimize a deep learning model for FPGA hardware, achieving high-speed and accurate face recognition.
Explore hardware acceleration techniques to significantly improve the system’s processing speed.
Conduct extensive simulations and experiments using Matlab (R2021b 9.11) and various deep learning architectures (such as GoogLeNet, SqueezeNet, AlexNet, ResNet, and VGG-16) on the AT&T Face Database to assess performance.
Deploy the system on three different platforms (Raspberry Pi, ZYBO Z7, and PYNQ Z2) to evaluate feasibility and performance.
By combining these advancements, our system offers a fast, efficient, and reliable solution for secure access control in high-security environments. The rest of this paper is organized as follows:
Section 2 presents related work. The material and methods are explained in
Section 3. The results and a discussion are discussed in
Section 4. Finally, a conclusion is introduced.
2. Related Works
Biometric facial recognition is a technology used to identify and authenticate human faces accurately. Facial scanning can be used to confirm a person’s identity. This means a person cannot possess multiple driver’s licenses or state IDs and will not be identified in a law enforcement database. Facial recognition plays a vital role in future intelligent vehicle applications by determining if a person is authorised to drive. Sanctioned technology businesses struggle to create a facial recognition security application that is efficient and precise. This application must detect and verify a driver’s identity, crucial for identifying suspects and improving public safety. Mobile facial recognition software helps law enforcement quickly identify nearby individuals from a safe distance. The main importance of face recognition lies in its emphasis on security. We will examine several studies using different methodologies for facial recognition.
In [
4], Wang et al. proposed a face detection acceleration system that utilises a hardware and software collaboration to maximise the benefits of ARM and FPGA. The MTCNN cascaded deep CNN framework was accelerated, resulting in a completed face detection system based on ZYNQ. The design consists of two main components: software and hardware. Facial detection utilises the MTCNN cascaded deep CNN framework. A three-layer cascaded deep convolutional network was employed to predict facial positions and keypoint coordinates from rough to fine. The MTCNN model trains on the WIDER FACE dataset. The system was deployed on the ZYNQ 7020 SOC platform using standard C language (C99) within the Xilinx SDK software. The hardware integrates the ADV7511 controller and video buffer for real-time image display from an SD card. Direct memory access of video data is achieved using an ARM Cortex-A9 processor and VDMA as an AXI slave device. The study in [
5] introduced a method for designing CNN as hardware accelerators, focussing on optimising precision, power consumption, and space for resource-limited settings like IoT and healthcare. It highlighted the role of FPGAs in addressing the computational challenges of complex deep-learning models. This method involved segmenting input feature maps into smaller windows, which allows for the optimized use of multiplication units through a control unit and a synchronization buffer, enhancing data processing efficiency. This method reached an accuracy of 98.79% on the CIFAR-10 dataset.
Zhang et al. [
6] developed an end-to-end object detection accelerator tailored for FPGA-only devices, emphasising deep learning algorithms such as YOLO v2 and YOLO v3. It tackles the issues of deploying deep convolutional neural networks (CNNs) on resource-constrained hardware by suggesting a versatile architecture that combines CNN computation and post-processing to reduce latency. Key contributions feature a design that minimises DSP and memory use while delivering high throughput and low latency in object detection tasks. The accelerator uses optimisations like fusing batch normalisations with convolution operations and applying a quantisation method to boost performance while handling resource limitations. The hardware design includes efficient data transfer mechanisms, a processing element (PE) array for better convolution computation, and an on-chip memory system that minimises access latency. The accelerator reaches a maximum throughput of 914 GOP/s, showcasing its ability for complex object detection tasks. Performance evaluations show that the design surpasses current FPGA accelerators in throughput and resource efficiency, emphasising the need for effective post-processing algorithms to achieve low-latency requirements. The paper ends with a discussion on future work, focussing on advanced quantisation techniques and the implementation of more object detection algorithms to boost efficiency and performance. This method reached a Mean Average Precision of approximately 76.7% for YOLO v2, 53.5% for YOLO v2 tiny, 75.4% for YOLO v3, and 58.1% for YOLO v3 tiny on the VOC dataset.
Yu et al. in [
7] introduced an architecture for face detection with deep CNNs. The proposed method features a cascading structure of three networks (12-Net, 24-Net, and 48-Net) that enables the early rejection of non-face candidates, reducing the computational load. Models were assessed using the FDDB dataset. The paper highlights the significance of hardware factors for the effective deployment of CNNs. Using Vivado HLS, the hardware implementation shows a performance of 76.8 GFLOPS. This research connects deep learning models with FPGA implementations, achieving 83%.
Said et al. [
8] introduced a deep learning model architecture utilising CNNs. The model also included a Logistic Regression Classifier to classify the features learnt by CNN. This system’s performance was assessed with the ORL face and AR face datasets. The experimental results show this approach reached a validation accuracy of 98.7%. Hangaragi et al. [
9] presented a framework for face detection and recognition using a face mesh and deep neural network. The model trained the deep neural network using real-time captured images and the Labelled Wild Face (LWF) dataset. If the test image’s face landmarks match those of any training images, the model identifies the person; if not, it outputs “unknown”. The study shows a precision rate of 94.23% for facial recognition using the proposed method.
Teoh et al. proposed a facial recognition system using deep learning techniques in [
10]. The authors examine the challenges of face identification, such as misalignments, position variations, lighting variations, and expression fluctuations. They describe how deep learning can effectively tackle these problems and achieve enhanced accuracy in face recognition. The study examines deep learning for training neural networks and compares the performance of OpenCV and MATLAB in image processing tasks, highlighting OpenCV’s superior speed and efficiency. The authors explain how to execute the proposed face recognition system, which uses deep learning for face detection and identification. Real-time video recognition achieved an accuracy of 86.7%.
Albdairi et al. [
11] proposed a method for face recognition and ethnicity identification using a deep convolutional neural network (DCNN) model grounded in deep learning. This method outperformed traditional techniques in accurately determining a person’s ethnicity through facial analysis. The authors explored high-performance computing hardware, specifically field-programmable gate arrays (FPGAs), to meet the computational needs of the DCNN model, comparing their performance with graphics processing units (GPUs). The trials utilised a dataset of 3141 facial photos from three different countries. This dataset was specifically created to identify ethnic groups. The results show that the DCNN model achieves an F1 score of 94.6% and an accuracy of 96.9% when implemented on FPGAs.
Valenzuela et al. exploited the structure of a smart imaging sensor (SIS) for real-time face identification in [
12]. The SIS in the analogue domain featured a custom smart pixel capable of calculating local spatial gradients, alongside picture classifications performed by a digital coprocessor. The intelligent pixel used spatial gradients to derive a simplified version of local binary patterns (LBPs) called ringed LBPs (RLBPs). This facial recognition method involved three steps: feature extraction with RLBP, feature vector computation through RLBP histograms, and classifications via linear discriminant analysis and nearest neighbour criteria. Accuracy reached 96.0%.
Kulesza et al. [
13] suggested a facial recognition system with a single-board computer. The efficacy of various single-board computers, such as Raspberry Pi, Banana Pi, and Nvidia Jetson Nano, were assessed. The authors conducted a comparison between two face detection algorithms: a Haar feature-based cascade classifier and a multitask cascaded convolutional neural network (MTCNN). The authors employed the FaceNet algorithm for face identification, which trains a model to map face images to a condensed Euclidean space where distances represent a measure of facial similarity. This system underwent training and testing using a confidential database and obtained a monitoring accuracy of more than 97% in identifying individuals entering a room. Melzi et al. in [
14] introduced FRCSyn-onGoing, an ongoing challenge designed to assess and thoroughly evaluate real and synthetic data to enhance face recognition algorithms. The challenge focused on addressing concerns regarding data privacy, demographic biases, the ability to generalize to new situations, and performance constraints in difficult settings, including age differences, position changes, and occlusions. Their paper argues in favour of information fusion at different levels, including the input data, where a combination of actual and synthetic domains is suggested for specific tasks. The findings acquired in FRCSyn-onGoing, in conjunction with the proposed public ongoing benchmark, make a substantial contribution to the utilization of synthetic data for enhancing facial recognition technology.
Hammouche et al., in [
15], introduced a face recognition system combining a Gabor filter bank with a deep learning method called Sparse AutoEncoder (SAE). Features were reduced in dimensionality using Principal Component Analysis and linear discriminant analysis (PCA+LDA). The matching stage uses the cosine Mahalanobis distance. Seven publicly available face databases were used for the experiments, achieving full accuracy.
Moon et al. [
16] implemented a face anti-spoofing technique using CNNs to analyse the colour and texture features of face images. This method used a local binary pattern descriptor to extract features from the brightness and colour difference channels. It also looks at the Cb, S, and V bands in the colour spaces. This methodology was assessed using the CASIA-FASD dataset. The authors evaluated the feasibility of applying the proposed method on an AI FPGA board, confirming its effectiveness for edge computing uses.
Dang [
17] offers an effective deep-learning approach for a smart attendance system with improved facial recognition features. The author use an enhanced FaceNet model architecture featuring a MobileNetV2 backbone and an SSD component that employs depth-wise separable convolutions. This design choice minimises size and computational complexity while achieving an over 95% accuracy and a processing speed of 25 FPS. The solution successfully addresses the limitations of memory and storage on mobile devices for identifying individuals. It is very compatible with low-capacity hardware and systems with limited resources. The author demonstrates the use of deep learning facial recognition technology in developing an advanced automatic attendance system.
Tsai et al. in [
18] suggested the Facenet method which is utilized to extract distinctive high-dimensional features from facial images, enabling the computation of similarity through distance metrics. The model integrated separable convolution layers and fire modules, replacing conventional convolution and pooling layers to enhance efficiency and optimize memory usage. Additionally, the implementation of FPGA ensured a low-power operation, while the control system oversees memory preparation and task scheduling, ensuring effective communication between the HPS and FPGA through a lightweight AXI bus. This system achieves 99.2% accuracy on the LFW dataset and 94% on the VGGFace2 dataset, with a total of 1.4 million parameters and 274 million GOPs. The hardware design was executed using Intel Quartus II, with comprehensive specifications and performance metrics available in accompanying documentation.
A study by Wang et al. [
19] presented an approach to designing CNNs as hardware accelerators. It emphasized the utility of FPGAs as adaptable platforms for implementing deep learning models, aiming to optimize resource utilization. This architecture featured a control unit that optimally selected input feature windows and kernels, streamlining the convolution process by applying kernels to smaller, manageable windows. This design minimises the number of multiplication units needed, thereby lowering costs and improving efficiency. Implementation on the Aletra 10 GX FPGA demonstrated effective resource management with low utilization percentages, achieving an accuracy of 98.79% on the CIFAR-10 dataset.
Al Amin et al., in [
20], conducted a study on integrating deep learning techniques, particularly CNN and YOLO algorithms, into Advanced Driving Assistance Systems (ADASs) for traffic light detection and classification. It highlights the shortcomings of current algorithms, especially the Single Shot MultiBox Detector (SSD), which has difficulty detecting small objects and relies heavily on large, annotated datasets. This solution used the YOLO v3 tiny algorithm on the Xilinx Kria KV260 FPGA board to achieve real-time performance and to optimise resource use for autonomous vehicles. This system’s implementation included dataset preparation, model training, and deployment on the FPGA board. The Bosch Small Traffic Light Dataset (BSTLD) includes 13,427 images with annotated traffic light signals, crucial for training the YOLO model. The training process aimed to enhance the model’s performance, and the architecture was prepared for deployment on the FPGA. Experimental evaluations show a processing time of about 1.996 s per image, achieving a speed of 15 frames per second, which is suitable for real-time applications. Accuracy is approximately 99%.
Kabir [
21] introduced a reconfigurable memory-centric array processor architecture tailored for deep learning applications on FPGAs, aiming to mitigate the von Neumann memory bottleneck and to enhance processing speeds. The architecture leveraged a single-instruction multiple-data (SIMD) design, which is particularly effective for the high operational intensity of CNNs. This work explored various FPGA-based accelerators and automated frameworks that enhance performance and energy efficiency for deep neural networks. It discussed the limitations of existing PIM designs in utilizing Block RAM (BRAM) efficiently and presented a comprehensive design standard for evaluating and guiding future PIM developments. The architecture’s scalability and optimizations, demonstrated through practical implementations like PiCaSO and IMAGine, highlight its potential to achieve high performance in deep learning tasks. This paper showcased the effectiveness of a custom architecture in addressing the demands of modern deep learning applications.
Wu et al. [
22] introduced a framework for gaze estimation using an FPGA, improving its use in areas like smart classrooms and advertising research. The authors combine gaze estimation algorithms with block-wise convolutions and various convolution types to improve system performance and to address on-chip memory limitations in FPGAs. Key contributions include the use of block-wise convolutions to improve computational efficiency; a hybrid architecture that combines depthwise separable and standard convolutions to lower resource usage while preserving performance; and the incorporation of head pose information to enhance gaze estimation accuracy, especially during head movements. The system runs at 32 frames per second on the ZYNQ7035 CPU, consuming an average of 6.4 watts, showcasing its effectiveness in real-time processing with little accuracy loss compared to traditional methods. The study shows how FPGA technology improves gaze estimation algorithms.
Teboulbi et al. [
23] introduced a method to enhance facial point detection (FPD) using deep CNNs on FPGA-based systems-on-chip (SoCs). The proposed method employs dynamic partial reconfiguration and a hybrid architecture to address the significant computing needs of DCNNs. This includes the GPU software for FPD, CNN acceleration through high-level synthesis, and a DPR architecture to boost performance. Accuracy reached 89.01%, precision was 91.63%, and recall stood at 90.25%.
6. Conclusions
This paper aims to develop a secure, standalone biometric system for building access control. The system leverages FPGA technology to implement a Building Management System (BMS) that can independently verify authorized personnel without relying on internet connectivity. The core component of the system is a facial recognition system, enhanced by a modified machine learning algorithm tailored for FPGA implementation. This optimization improves performance, accuracy, and speed. To select the most suitable network architecture, we evaluated GoogLeNet, SqueezeNet, AlexNet, ResNet, and VGG-16 on two datasets. AlexNet and ResNet demonstrated superior performance in terms of accuracy and efficiency. We further tested the deployment of these models on three different hardware platforms: Raspberry Pi, ZYBO Z7, and PYNQ Z2. The results of these evaluations will inform the final selection of the optimal hardware platform for our system.
Future research directions for this work encompass expanding the scope by incorporating larger and more diverse datasets, including images with varying lighting conditions, poses, occlusions, and ethnicities, to improve robustness and generalizability.
In addition, advanced FPGA architectures and hardware acceleration techniques can be implemented to further improve the real-time performance of the facial recognition system. Moreover, power optimization techniques can be investigated to minimise power consumption for deployment in resource-constrained edge computing environments.