1. Introduction
Drone technology has advanced considerably in recent years. Drones are becoming more and more useful in places a man cannot quickly and effectively reach. When drone technology is implemented in various fields such as industries, government agencies, military sites, and so on, the scope, potential, and scale of global reach increase. They can reach the most remote places, where little manpower, effort, energy, or time is required.
The next generation of drones will primarily focus on propulsion, size, and autonomy. The type of drone is determined by the technology employed to fly it [
1]. Multirotor drones are the most common type of drone, and most professionals use them. Video surveillance, aerial photography, and other multirotor drone applications are just a few examples. A multirotor camera allows for more precise framing and positioning of the camera, resulting in crisp aerial photos. Multirotors are the most cost-effective to fly and construct. Quadcopters are the most extensively used and popular of the numerous forms of multirotor. Multirotors, however, have several drawbacks. They have a finite quantity of endurance, speed, and flight time. The maximum flight time for a multirotor is 20–30 min with a minimal payload, and these are the most common.
Interpreting the articulation of a human physique in an image and detecting the person’s motion in a video sequence acquired by multirotor quadcopter drones [
2] is a difficult research topic. It can be difficult to discern human activities in a variety of situations, such as visual blur, perspective distortion, low-resolution scenes, occlusions, and so on.
Many effective research efforts in human activity recognition have been carried out using various deep learning approaches using numerous common video datasets. Because of the endless postures and articulated character of the human being, human activity recognition is a complex and time-consuming study challenge. Researchers are increasingly interested in human activities to develop autonomous driving systems that use a variety of data collection methods. This necessitates examining images and videos captured by aerial cameras [
3].
Many datasets are available on the internet to research the behavior of people, automobiles, and the road environment. These datasets are crucial for the study of autonomous vehicle systems, including buses and self-driving vehicles. Numerous academics are now working on real-time research projects and commercial projects in fields such as search and rescue, crowd control, situational awareness, surveillance, and sports activity recording [
4,
5,
6].
To perform at a high level in these challenging environments, one needs efficient and suitable algorithms as well as training data. The challenging missions need datasets containing a variety of human perspectives. The vast majority of datasets focus on identifying aerial or ground-based activity. Finding a dataset containing video sequences of ground-level and aerial-view-based human activities is difficult.
The Microsoft Common Objects in Context (MS-COCO) dataset [
7], which includes numerous views and footage of one or more individuals, was utilized to research human identification, human activity recognition, and other topics. The majority of the activities in this dataset have different perspectives, which are known as aerial platforms, while the activities from the ground-level sights are photographed by a rigid or flying camera. This paper mainly focuses on single and multiple human activity recognition by utilizing spatial and temporal features. To improve the accuracy and speed of human recognition, it uses feature descriptor followed by instance segmentation and bidirectional long short term memory (Bi-LSTM) for activity recognition.
The major key contributions of the present research work can be summarized as follows:
After dividing the video stream into several frames, the preprocessing pipeline is used to increase the classification efficiency; even more, here rectangular regions of interest (ROIs) are produced based on Sobel edge detection resulting in faster processing as the research is interested in persons and their behavior.
Following that, the HOG descriptor is utilized to extract the features from the preprocessed frames to enhance the performance of the model. In this, the intensity distribution of gradients or the direction of contours can describe the local appearance and shape of an object in an image. These descriptors can be implemented by dividing the image into small connected regions called cells. Then, for each cell, a histogram of gradient directions or edge orientations for all pixels in the cell is computed. The descriptor is the sum of these histograms.
Next, the extracted HOG features are then communicated to the Mask-RCNN framework, which is pretrained on the MS-COCO model, to improve prediction accuracy and achieve cutting-edge outcomes for human detection.
Finally, Bi-LSTMs, as opposed to baseline LSTMs (which use only past information), use both past and future information when the entire sequence of time series data is available. Because of the additional context provided, the network can make more accurate predictions. The model employs convolutional kernels of various sizes, which allows it to capture various temporal local dependencies in sequential data.
Some of the limitations of the previous work are described as follows: the temporal dimension and global optimization techniques have not been incorporated into the mask-based object segmentation and tracking that are suggested in [
8]. In order to modify the FCN based on temporal input, the model in [
9] was unable to learn a recurrent representation of the modulation parameters. The object proposal and detection with spatial temporal features, end-to-end trainable matching criterion, and including motion information for better recognition and identity association were not used in another video instance segmentation model [
10]. Human group detection and tracking for event detection utilizing the action recognition paradigm for HOG [
11] was unsuccessful. The accuracy and performance of the suggested model for real-time research applications in deep learning-based object detection using Mask-RCNN [
12] failed to improve. Robust pedestrian detection using a recursive convolution neural network [
13] and human detection and tracking with deep convolutional neural networks while restricted by noise and obscured scenes [
14] failed to fully implement special and temporal features and increasing the accuracy in detecting and tracking humans.
Detecting complex activities becomes time-consuming and difficult as more features are gathered. These methods have limitations on accuracy and call for specialized training and expertise in the relevant field. This is where the proposed deep learning-based HAR with histograms has proved beneficial. Deep learning models such as Mask-RCNN and bi-directional LSTM automatically learn the features required to make accurate predictions from the histogram data which is taken from raw data directly. This enables new and large datasets to be used for HAR. Drone-captured YouTube-Aerial data is used, which results in an efficient model. This model is also capable of learning high-level features which can be very well utilized in complex HAR.
The following sections make up this paper within the field of image processing. The relevant research on object detection and segmentation, human detection, histogram of oriented gradient (HOG), instance segmentation techniques, regional convolutional neural network (RCNN), and activity recognition in videos is listed in
Section 2. The problem definition is described in
Section 3. The proposed methodology is presented in
Section 4.
Section 5 and
Section 6 provide descriptions of the HOG and the mask-regional convolutional neural network (Mask-RCNN).
Section 7 discusses the long short-term memory (LSTM) and Bi-LSTM architectures.
Section 8 describes the dataset, the metrics that were employed, the experimental findings, and comparisons to earlier models. The work’s conclusion is presented in
Section 9.
2. Literature Survey
Using Microsoft Kinect, Stone et al. [
15] employed an adult and a two-stage fall detection model for senior citizens. When viewed from the ground, a person’s vertical position is first identified using individual depth frames, and in the following phase, the processing is carried out by employing time-series segmentation of the vertical position of the human from the ground inputs. This approach is especially beneficial for older persons, as it produces superior results in actions such as standing, lying down, and sitting. When compared to conventional fall detection mechanisms, it produced better results.
In order to handle complicated behaviors and group dynamics in streaming sequences, Zhuang et al. [
16] proposed a method that combines differential recurrent convolutional neural networks (DRCNN) and stacked differential long short-term memory (DLSTM).
Human motions and actions are essential for the understanding of video analysis and human activities, according to Cheng et al. [
17]. The suggested model focuses on interactions between humans, with a restricted number of individuals cooperating to achieve a common objective. The motion trajectories in this model were based on the Gaussian model. It enhances recognition accuracy.
Social signal processing (SSP) is a novel perspective on mechanized human action surveillance that combines some psychological principles that are both effective and social, according to Cristani et al. [
18]. This model offered nonverbal clues, which are typically used in conscious aware systems, such as body gestures and posture, facial expression, vocal aspects, and gaze.
By leveraging the Kinect sensor, Yoon et al. [
19] developed a procedure for computer vision applications using the Kinect that overcame fundamental computer vision problems. This method consists of preprocessing, object tracking and recognition, analysis of human activity, indoor 3D mapping, and analysis of hand gestures.
In order to simultaneously distinguish different movements and actions, a prototype was developed using a gym as an example, according to Ling et al. [
20]. This prototype also includes color intensity-based segmentation of human gestures into temporal sequences and motion-based algorithms for efficient human action segmentation. Human action’s shape and features will be apparent.
An optical flow-based model approach to identify activity in the footage was put forth by Shao et al. [
21]. Using the optical flow approach, the monitored point of interest is sorted using the k-means method into several clusters. Each cluster projection’s displacements are crucial in determining the direction, geometric location, and principal component of each cluster. These estimates reveal the highest likelihood of a high-activity cluster encompassing video events.
Object segmentation in videos: there are two methods for segmenting video objects: unsupervised and semi-supervised. Semi-supervised segmentation of video objects [
22] emphasizes mask-based object segmentation and tracking. Temporal consistency, motion cues, and visual similarity are collected from the video to locate the related object [
23,
24,
25]. Segmentation is applied to a single foreground item in an unsupervised setting [
26,
27,
28]. The suggested algorithms ignore semantic categories in both circumstances and treat the target items as general objects. Using instance segmentation to recognize objects in videos is also becoming more popular.
Object detection in videos: detecting objects in video streams is referred to as video object detection. Initially, it is a visual challenge from ImageNet [
29]. Object identity information is frequently employed to improve the robustness of detection methods [
30,
31,
32]. Object detection and tracking are not required for the evaluation metric, which is confined to per-frame detection.
Human detection using HOG: the research field of object detection is wide, however, we just list a few pertinent articles on person detection here. A polynomial support vector machine (SVM) based person detection using rectified Haar wavelets as input descriptors are described by O Pinheiro et al. [
33], with a part (sub-window)-based variant in [
34]. Dalal and Triggs [
35] adopt a far more forthright technique, retrieving edge pictures and utilizing chamfer distance to compare them to a collection of accomplished exemplars. Dai et al. [
36] employed AdaBoost to train a series of increasingly complex region rejection rules based on space-time disparities and Haar-like wavelets, resulting in an efficient moving person detector. Dai et al. [
37] developed a parts-based technique with detectors for heads, faces, and front and side profiles of lower and upper body parts, combining binary-thresholded gradient magnitudes and orientation position histograms.
Instance segmentation: this separates pixels into semantic classes, after which it creates instances of objects [
38]. It employed a two-stage approach that combines the usage of region proposal network (RPN) to generate object proposals with the aggregate of region of Interest (RoI) features to anticipate object classes, bounding boxes, and masks [
39,
40]. Many techniques, such as RCNN, use segment proposals to implement this approach. Bottom-up segments were employed in the previous approach [
41]. Subsequent works [
42,
43] offered segment candidates, which fast R-CNN classified. Dai et al. [
44] proposed a method that uses a sophisticated multi-stage cascade to predict segment proposals from bounding-box proposals, with classification as the final step. Ren et al. [
45] recently proposed a prototype for “Fully Convolutional Instance Segmentation (FCIS)” that integrated an object identification system and a segment proposal system. The simple principle [
46] is to forecast completely convolutional output channels that are location sensitive. Our detector, on the other hand, has a fundamental structural design with only one detection window, but it seems to perform substantially better on pedestrian imagery.
RCNN: the region-based convolutional neural network (R-CNN) architecture, described in [
47], is used to determine the bounding box of an object and to handle a large number of potential object regions, as well as to evaluate convolutional networks separately on each RoI [
48,
49]. Region of interest pooling (RoIPool) is used by R-CNN to swiftly, accurately, and efficiently build RoIs on feature maps [
50]. For more reliable and adaptable subsequent enhancements, R-CNN employs an RPN with an attention mechanism.
Activity recognition in videos: recurrent neural Network (RNN) is a very effective and extensively used network architecture in sequential modeling applications such as human activity recognition in videos. The LSTM is an RNN-based network that is widely used for learning motion properties in video-based activity recognition [
51]. This can also be leveraged to mitigate gradient expansion and gradient vanishing difficulties during the training phase to some extent. Based on LSTM architecture, another study proposed a video as an ordered sequence using bidirectional LSTM architecture [
52]. Bi-LSTMs outperform unidirectional LSTMs in terms of prediction. Bi-LSTM architecture has been used in numerous video-related applications, such as video-super resolution, object segmentation in the video, spatiotemporal feature learning for gesture identification, and fine-grained action detection. Long-term dependencies are well-handled by Bi-LSTMs. Unlike LSTMs (which only use past data), Bi-LSTMs employ both past and future data when the entire time-series sequence data is available, allowing the network to generate more accurate predictions. Following that, bidirectional LSTMs were used to predict frame-wise phoneme categorization, network-wide traffic speed, and other variables [
53]. Only a few research papers in the field of activity recognition make proper use of the Bi-LSTM network. We present a novel model that uses HOG with Mask-RCNN architecture for edge detection and segmentation of humans in images, as well as Bi-LSTM architectures for learning spatiotemporal aspects of neighboring video frames.
4. Proposed Methodology
The proposed method is shown in
Figure 1, which takes a video stream as input and splits it into several frames. To reduce training and detection time, the preprocessing pipeline method is used. The rectangular region of interest is produced by this technique, which is based on Sobel edge detection [
54]. A smaller zone of interest is included in this extracted rectangular portion, which means there are fewer pixels to process, resulting in faster processing. Because we are interested in persons and their behavior, this enables us to apply a more straightforward preprocessing method that merely extracts the RoIs.
A multirotor quadcopter-equipped drone can employ the precompiled Mask-RCNN with HOG features and can be used by a multirotor quadcopter-mounted drone for person detection. Mask-RCNN with HOG features for person detection to capture human images. The next stage is to extract features with HOG [
55]. This stage extracts an object’s features, which are subsequently communicated to the Mask-RCNN framework in order to increase prediction accuracy.
On established benchmarks, the suggested approach utilizes the Mask-RCNN network to produce cutting-edge results for human detection. For object detection challenges, this architecture was trained using the MS-COCO model. Mask-RCNN’s performance in the area of image processing seldom reached similar outcomes due to the complicated kind of aerial imagery whose characteristics are acquired by the HOG technique, diverse object scales, and a pool of annotated data. This study investigates object region proposal creation, pixel-based segmentation, alignment of RoI, bounding box regression, and classification to recognize the human in UAV recordings. SoftMax classifier is used to classify people among various objects, and RoIPool is used to derive features from the bounding boxes.
5. HOG Descriptor
The field of applied machine learning is referred to as feature engineering which involves the extraction of additional features from existing raw data to enhance the performance of the model. Histogram of oriented gradients (HOG) is the term for an antiquated technique for feature extraction. The following sections will go over the foundations and functionality of the HOG feature representation.
The following are the guiding design principles for computer vision features:
5.1. Interpretation of Feature Descriptor
A feature descriptor is a concise summary of a frame that only has the data essential to identifying its objects (such as the shape of the object, color, edge, backdrop, and so on). HOG is the most often used feature descriptor algorithm (together with HOG, scale-invariant feature transform (SIFT), sped-up robust features (SURF), and others).
5.2. Principles of HOG
A general computer vision task for object detection is the HOG feature descriptor, which identifies patterns in picture data and extracts them.
The following are some ways that the HOG is unique compared to other feature descriptors: An object’s main priorities are its shape and structure. The magnitude and direction (or gradient and orientation) of the edges are extracted by HOG in order to determine whether a pixel serves as both an edge and a direction for edges. The directions are established in the specific regions of a frame. This reveals that the frame is fragmented into a large number of smaller regions. The magnitude and direction of each of these zones are analyzed.
Then, all of these regions are divided by HOG into a unique histogram. A “Histogram of Oriented Gradient” is a histogram that is produced utilizing the magnitude and the direction of pixel values.
Finally, the fundamental principle of HOG is that it keeps note of when a gradient orientation occurs in particular localized regions of a frame.
5.3. Hog Calculation
The input frame for identifying the HOG features with a resolution of 298 × 169 pixels is shown in
Figure 2.
5.3.1. Data Preprocessing (64 × 128)
Preprocessing data is a key stage in most machine learning studies and when working with images. HOG preprocessing maintains a fixed, uniform aspect ratio for each image patch regardless of image size. In our case, the patches must have a 1:2 aspect ratio. They can, for example, be 200 × 400, 256 × 512, or 1000 × 2000, but not 106 × 220. In order to extract the features, and make calculations easier,
Figure 3 shows how the frame is split into 8 × 8 and 16 × 16 patches in HOG with a 1:2 width-to-height ratio. They are based on the input image size and the output feature vector length. Patches at various scales are typically analyzed and tested at multiple image locations. The only limitation is that the patches under consideration have a fixed aspect ratio.
5.3.2. Evaluation of Magnitudes (X and Y Direction)
In this stage, determine the size of the small orientation shifts in the X and Y axes for every individual pixel in the frame. Presume the magnitude of a small portion of an image such as the one in
Figure 4a. The pixel matrix shown in
Figure 4b is a matrix that depicts the pixel values of the chosen patch.
The directional change (gradient/magnitude) for the highlighted pixel value 95 will now be computed for both the X and Y axes. Subtract the value of the left pixel from the value of the right pixel to determine a single pixel’s magnitude in the X-direction. Subtract the value of the bottom pixel from the value of the top pixel to obtain the magnitude in the Y-direction.
Hence, Magnitude in X-direction () = 90 − 82 = 8; Magnitude in Y-direction () = 68 − 62 = 6.
These two metrics are used in order to save magnitudes in the X and Y directions, respectively. The size 1 Sobel kernel filter is implemented using the same methodology. Repeat the process for all of the pixels in the image. The difference in intensity along the edges is particularly sharp, resulting in a larger magnitude. The magnitude and orientation of the object are then determined using these measurements.
5.3.3. Evaluation of Magnitude and Direction
Use the Pythagoras theorem to determine the magnitude and orientation of each pixel value. Consider the right-angle triangle shown in
Figure 5.
In this figure, the gradients
and
(8 and 6 in our case) are base and perpendicular. According to Pythagoras theorem, the following Equation (
1) is used to calculate the total gradient magnitude:
Hence, the total magnitude of the gradient is .
The direction (or orientation) of the same pixel must now be calculated. To do so, the following Equation (
2) is used to determine angles with a tan:
The orientation value is 36.88 (≈37) when the aforementioned values are given in the computation. This method allows us to determine each pixel’s gradient and direction, and these gradients and directions can be utilized to build the histograms in the following step.
5.3.4. Evaluation of Histogram of Magnitudes in 8 × 8 Cells (9 × 1)
It is important to comprehend what a histogram is and how to make one using magnitudes and orientations before we can compute the magnitudes needed to generate them.
- 1.
Construct histograms with magnitudes and orientations:
A histogram is a graphical illustration of how frequently each bin occurs for a given set of continuous data. We use this method to determine the orientation of each pixel and log the occurrence of these values in a 9 × 1 matrix, as shown in
Figure 6 bins. We use a bin size of 20 and a bucket count of 9.
The image’s histogram must then be created as the next step. As illustrated in
Figure 7 cells, partition the entire image into 8 × 8 cells, and then compute HOG for each cell. As a result, each cell receives a 9 × 1 matrix in addition to the histograms for every smaller patch of the overall image. For instance, its value can be modified to 16 × 16 or 32 × 32 from 8 × 8 or vice versa. After this stage, the histograms must then be normalized.
5.3.5. Normalization of Magnitudes for a 16 × 16 Cell (36 × 1)
For normalizing each block, Triggs and Dalal provided four potential strategies. Assume that
is the n-norm for
,
v is a non-normalized vector containing all of a block’s histograms, and
a is a small constant added to the square of
to prevent zero division error. The normalization factors are calculated using the following Equation (
3):
:
is clipped and renormalized afterward. In this instance, restrict the maximum values of
v to 0.2.
when compared to non-normalized data, the four approaches discussed above perform significantly better. By measuring the
, clipping the outcome, and then re-normalizing,
can be produced. According to Triggs and Dalal, the efficiency of the schemes
,
(stated in Equation (
5)), and
are comparable, while the performance of the other schemes,
(stated in Equation (
4)), is noticeably poorer.
The range of pixel intensity values can be changed using a technique called normalization, such as histogram stretching. Normalization is necessary since the magnitudes, for a single image, are sensitive to contrast, brightness, and general illumination. This suggests that while certain areas of a picture are brilliant, others are not. We may not be able to obtain correct histograms as a result of these variances. Though this cannot be eradicated, by using 16 × 16 blocks and gradient normalization, we can greatly reduce the variances in the lighting. The following
Figure 8 illustrates how 16 × 16 blocks are produced:
Each 8 × 8 cell produced a 9 × 1 matrix, which was then employed to build a histogram. Joining 8 × 8 cells produced a 16 × 16 block. Therefore, we have the option of using either one 36 × 1 matrix or four 9 × 1 matrices. To normalize this matrix, each of the retrieved values is then divided by the square root of the sum of the squares of these vector values. Take into consideration a vector’s mathematical representation, such as .
Now, we use the following Equation (
6) to determine the square root of the sum of the squares of the values in the vector above:
Finally, as shown in Equation (
7), divide this number
n by each of the vector’s u values, and we will obtain the normalized vector with the dimensions 36 × 1.
5.3.6. Produce HOG Features for Complete Image
The process of developing the histogram features for the entire image is complete at this stage. To generate features for the complete image, we must now merge the 16 × 16 chunks of the single image for which we previously created histogram features. A 64 × 128 image will need 105 (or 7 × 15) 16 × 16 blocks, as seen in
Figure 9. There will be a 36 × 1 feature vector per 105 blocks.
As a result, there are 105 × 36 × 1 = 3780 features in total. Now we build HOG features for a picture and check whether they ultimately match the overall number of features. The following Algorithm 1 describes the detailed process of HOG descriptor:
Algorithm 1 HOG Descriptor |
Input: Aerial videos dataset |
Output: HOG features of all frames |
Steps: |
1. Extract frames from each video in the dataset |
2. Data preprocessing: Resize all frames to a 1:2 ratio of height and width (i.e., 64 × 128) |
3. Calculate the gradients of each pixel in each block of the frame in the X and Y directions |
(a) |
(b) |
(c) |
4. Calculate the magnitude and angle (direction) of each pixel using Equations (1) and (2) |
5. Divide the gradients matrices into 8 × 8 cells to form a block to calculate a 9-point histogram for each block |
6. Let the number of bins and step size be |
(a) Number of bins (between to ) |
(b) Step size Number of bins (i.e., |
For all values in a block calculate the following, |
For each bin, |
i. The bin boundaries |
ii. Each bin center value be |
ii. |
7. For each cell in the block, calculate and values and append them to the array at the index of and bin calculated for each bin |
(a) |
(b) |
(c) |
8. Let |
Normalize each block by L2-norm using Equation (3) |
9. Calculate the value of ‘n’ to normalize using Equation (6) and calculate normalized vector using Equation (7) where |
|
6. Mask-RCNN
The implementation of instance segmentation is currently the most difficult task in computer vision. Mask-region convolutional neural network (Mask-RCNN) is a deep neural network architecture that is meant to efficiently incorporate the instance segmentation method to tackle segmentation challenges. It is a two-shot detector that has two stages, the first stage is a region proposal and the second one is a classification of those regions and refinement of the location prediction. In an image or video, it can identify several objects. When an image is provided as input, it outputs object masks, bounding boxes, and classes. It uses a fully convolutional network (FCN) to forecast the mask for each class separately. The Mask-RCNN-based methodology is suitable for this model over the single-shot object detectors such as you only look once (YOLO) frameworks, which are more suitable for real-time localization of objects because the maximum training input image size is 1024 × 1024 whereas YOLO takes 416 × 416. As we use high-resolution images, this architecture helps in the segmentation of humans efficiently compared to other frameworks.
According to [
56], object detection based on DCNNs as well as conventional traditional object detection (such as Oxford-MKL [
57], DPM [
58], NLPR-HOGLBP [
59], and selective search [
60]) are discussed. It is known that the essential distinction between the two is made by the revival of deep learning, which converts handcrafted object identification features into learned features. High detection accuracy is the primary benefit, and sluggish detection speed is the primary drawback. Examples of two-stage object detection architectures are RCNN [
47], SPPNet [
50], Fast RCNN [
34], Faster RCNN [
45], Mask RCNN [
40], and RFCN [
36]. Others are single-stage object detection designs that use DCNNs to directly locate and classify objects without breaking them up. The class probabilities and location coordinates of an object in a stage can be immediately generated by the one-stage object detection. The region proposal method, which is less complicated than two-stage object detection, is not necessary. The main benefit is quick detection. However, a two-stage object detection design typically provides higher detection accuracy. For instance, one-stage object detection includes OverFeat [
61], YOLO series [
62,
63,
64], SSD [
65], DSSD [
66], FSSD [
67], and DSOD [
68].
In Mask-RCNN, there are two basic steps of implementation. In the first step, the object bounding boxes are suggested by the region proposal network (RPN) using the input image as a starting point. Based on the first stage’s prediction, the second step determines the object’s class, improves the bounding box, and generates a mask at the pixel level for the object. Both levels are connected by a backbone framework. It is a feature pyramid network (FPN)-style deep neural network. RPN is applied in the three methods listed below.
Bottom-up pathway: it retrieves features from the original frame in a bottom-up fashion. Any convolutional neural network (ConvNet), including visual geometry group network (VGG-net) [
69] and residual network (ResNet) [
70], can be used.
Top-bottom pathway: this leads to a feature pyramid map with the same size as the previous pathway.
Lateral connections: these convolutions occur naturally. The primary objective of these connections is to enhance operations between the different levels of the two paths.
RPN, a compact neural network, first assesses all top-down and FPN paths (also termed a feature map). Additionally, it creates areas of interest that include objects. A technique is needed to link newly found features to their raw picture positions when examining the feature map. The scene now includes anchors. Without regard to the image’s content, a set of bounding boxes with predefined scales and locations are called anchors. Individual anchors are assigned bounding boxes based on background binary, intersection over union (IoU) value ground-truth classes, and some IoU value ground-truth classes that are classified in this phase or an object. RPN employs anchors with various scales associated with different feature map layers to locate an object on a feature map and determine the size of its bounding box. To maintain the feature’s positions concerning the object in the original image, convolution, downsampling, and upsampling are used. The algorithmic implementation of Mask-RCNN is represented in the following Algorithm 2.
Algorithm 2 Procedure for instance segmentation using Mask-RCNN |
Input: Dataset of images with histograms |
Output: Bounding box, mask, class, and score |
Steps: |
1. For each image repeat the following steps 2 to 14 |
2. Let upper-left coordinates and lower-right coordinates of predicted and ground truth bounding boxes be |
(a) |
(b) |
3. requires to meet and |
(a) , , |
(b) , |
4. Area of |
5. Area of |
6. Intersection I between and |
(a) |
(b) |
(c) |
7. Locating the small enclosed box’s coordinates : |
(a) |
(b) |
8. Determine the area of |
9. Determining i and j’s center coordinates |
(a) |
(b) |
(c) |
10. Calculating the distance between centers: |
(a) |
11. where |
12. Perform the non-maximal-suppression Algorithm 3 to choose the highest scoring bounding box |
13. Calculate loss function using classification loss, bounding box loss and mask loss by Equations (8)–(11) |
14. For each RoI, create a mask, class label, bounding box, and score |
In the second stage, other neural networks take into account the suggested regions created in the first stage. They propagate to various feature map level areas, scan these areas, and generate multi-category classified object classes, bounding boxes, and masks. This is similar to RPN, except instead of anchors, RoIAlign is used to determine the relative areas of the feature map, and a branch is used to generate masks for each item at the pixel level. The most important feature of Mask-RCNN is the ability to instruct the neural network’s various layers to learn features with different scales, such as RoIAlign and anchors.
6.1. Backbone
As a feature extractor, this standard CNN network is used. In this work, each succeeding layer’s low- and high-level features were detected using ResNet101. The image is transformed from 640 × 480 × 3 (RGB) to a 32 × 32 × 2048 shape feature map while the data travels over the backbone network. This serves as the starting point for the subsequent steps.
6.2. Feature Pyramid Network (FPN)
Mask-RCNN’s FPN is an extension that can represent things at many scales. To improve conventional feature extraction, It introduces a second pyramid, moving the top-tier features from the previous pyramid to the lower layers that follow. Features at each level can acquire both low-level and high-level features using this approach.
6.3. Region Proposal Network (RPN)
In a sliding window analysis, this compact neural network scans the image for regions designated as anchors or boxes that surround it. We choose the top anchors that contain objects based on the forecast and then modify their size and location. The non-max suppression (NMS) technique defined in Algorithm 3 is used to replace the anchors that overlap excessively with the foreground score that is highest and to reject the other anchors. The subsequent stage is then provided with the final RoIs.
Algorithm 3 A non-maximal suppression Algorithm (NMS) |
Input: |
a list of boxes, their scores, and the IoU threshold T |
(For example, T = 0.5) |
M: max selected boxes |
Output: |
a group of bounding boxes that have been checked off |
Algorithm: Steps to calculate NMS |
1. Arrange the bounding boxes based on their score |
2. Repeat till there are no more boxes present: |
(a) Select the box with the best score. Name it . |
(b) Remove the remaining boxes b using |
IoU(b, ) ≥ T. |
6.4. Bounding Box Regressor and RoI Classifier
Each RPN RoI that is received by this step generates two outputs. Bounding box optimization is similar to RPN in that it refines the position and dimensions of the bounding box and is employed by the network to classify regions into particular groups.
6.5. RoI Pooling
This RPN bounding box optimization step will crop and resize a specific area of the point chart. This can be accomplished by applying bi-linear interpolation and region of interest align (RoIAlign) on a feature map sampling point. The crop and resize feature of TensorFlow is used to perform this.
6.6. Segmentation Masks
This network, a convolutional network, chooses which positive zones to utilize as input for the creation of masks using the RoI classifier. Only 28 × 28 pixels make up the incredibly low resolution of this mask. The final masks are created by scaling down the ground-truth mask to 28 × 28 during the training phase and scaling up the predicted mask to the size of the RoI bounding box during the inference phase in order to calculate the loss.
6.6.1. Loss Functions
In Mask-RCNN, a multithread loss function that incorporates segmentation, localization, and classification loss is used for each sampled RoI, as shown in Equation (
8):
where:
= loss of classification;
= loss of bounding box regression;
= loss of mask.
The total loss is obtained by taking the average of all losses across all samples.
6.6.2. Classification Loss
The loss of RoI classification
is a logarithmic loss i.e., determined using Equation (
9):
where:
s = RoI true class;
p = (
, …,
): predicted
k + 1 class probability distribution.
6.6.3. Bounding-Box Regression Loss
Using Equation (
10), the RoI bounding-box regression loss
is determined as follows:
where
s = the RoI’s true class; t = the RoI regression targets of an actual bounding-box; = bounding-box regression predicted by class ‘u’.
6.6.4. The Mask Loss
The mean binary cross-entropy loss or RoI mask loss
is calculated using Equation (
11):
where:
s = the RoI’s true class;
= masks for the class ‘
s’ that is both true and expected in terms of RoI, respectively,
.
With each class indicating the label of the actual mask and the expected value, each RoI is given a mask with a dimension of m × m.
6.7. Training Phase
The total number of images used for training, validation, and testing is 29,979 which is shown in
Table 1. After including weights from the MS-COCO dataset in our model, we trained the network. Out of 29,979 total images, we used 2998 validation images and 17,987 training images to train our model leveraging stochastic gradient descent. The hyperparameters used for the model’s implementation are input image size 224 × 224, optimizer Adam, learning rate 0.001, batch size 128, loss categorical cross-entropy, epochs 20, and 100 training steps per epoch. These are used to fine-tune Mask-RCNN which is pretrained on MS COCO weights and the Bi-LSTM model.
6.8. Testing Phase
In the testing phase, we used 8994 images to test the trained model. In this testing data, each UAV image has a class label, masked segment, and bounding box that are predicted using the trained model. The predicted bounding boxes and labels should correspond to those in the dataset to evaluate the performance of the trained model for human activity recognition.
7. Bidirectional Long Short Term Memory (Bi-LSTM)
Bidirectional long short-term memory, often known as Bi-LSTM, is an LSTM model extension. Bi-LSTMs, unlike baseline LSTMs (which train a model in a single direction, i.e., forward, and only use past data), train a model in two ways, forward and backward, as seen in the following
Figure 10. It uses two LSTMs, one for the forward process and another for the backward process. The following section provides a detailed explanation of how LSTM works. When the complete set of time-series data is accessible, the model learns a sequence of inputs from past to future in the forward direction, and from future to past in the backward direction. Because it executes processing in both directions, the calculation of the output frame at timestamp ‘
t’ is dependent on the previous frame at a time ‘
’ and the next frame at a time ‘
’.
To preserve past and future information, this method employs two hidden states, one for the forward pass and the other for the backward pass. These states should be integrated to allow the network to produce more accurate predictions, and this method is known as merging. This can be accomplished using the sum, average, multiplication, and concatenation functions. Concatenation is the default technique for these functions. The following Algorithm 4 describes the Bi-LSTM procedure for human activity recognition.
Algorithm 4 Procedure for Bi-LSTM model |
Input: |
Input layers count |
Hidden layers count |
Output layers count |
Data set instances count |
Output: |
Weights are associated with all of the inputs from all layers |
Steps: |
1. Forward pass: |
Run every input value for a single slice with 1 < = t < = T and determine all predicted results. |
(a) for i = 1 to |
(b) for j = 1 to calculating the forward pass for the forward hidden layer’s activation function using the Equation (19) (from t = 1 to t = T) |
(c) end for |
(d) for j = to 1 calculating the backward pass for the backward hidden layer’s activation function using the Equation (19) (from t = T to t = 1) |
(e) end for |
(f) end for |
(g) for i = 1 to calculating the forward pass for the output layer using the previously stored activations using the Equation (20) |
(h) end for |
2. Backward pass: |
Calculate the portion of the objective function derivative for the forward-pass time slice with 1 <= t <= T. |
(a) for i = to 1 calculating the backward pass for the output layer using the previously stored activations using the Equation (20) |
(b) end for |
(c) for i = 1 to |
(d) for j = 1 to calculating the backward pass for the forward hidden layer’s activation function using the Equation (19) (from t = T to t = 1) |
(e) end for |
(f) for j = to 1 calculating the forward pass for the backward hidden layer’s activation function using the Equation (19) (from t = 1 to t = T) |
(g) end for |
(h) end for |
3. Update the weights of the network using each pass Equation (16). |
7.1. LSTM Architecture
LSTM functions similarly to RNN, but it has one essential feature that distinguishes it from RNN: it saves information for future cell processing. The three gates of an LSTM cell are the forget gate, input gate, and output gate. The internal process of an LSTM cell is shown in
Figure 11.
It has a memory pool with two key state vectors.
- 1.
Short-term state: A hidden state is sometimes known as a short-term state. The output is kept at the current time step in this state. represents the preceding timestamp’s short-term state, while represents the current timestamp.
- 2.
Long-term state (): A cell state is another name for a long-term state. This state examines and rejects data as it passes over a network that is intended for long-term storage. represents the preceding timestamp’s long-term state, while represents the current timestamp.
All timestamps and information are included in the cell state. The decision to read, write, or store is based on the activation functions whose outputs lie in between (0, 1), as shown in diagram
Figure 11.
7.2. Forget Gate
This is the first state in the LSTM network’s cell. This gate determines whether the previous timestamp’s information should be stored or ignored. The forget gate, Equation (
12), is as follows:
where:
= The current timestamp
t input;
= The input’s weight;
= The previous timestamp’s short-term state or hidden state;
= The short-term state’s weight matrix.
The activation function, namely the sigmoid function, is then applied to it, yielding the value of
in between (0, 1). The previous timestamp’s long-term state is then multiplied by it, as indicated in the computations below using formulae Equations (
13) and (
14).
If the value is 0, everything is forgotten; otherwise, nothing is remembered.
7.3. Input Gate
This is used to manage the flow of input values into the cell and to quantify the importance of the most recent information. The following Equation (
15) is the input gate equation:
where:
= Current timestamp
t input;
= Matrix of input weights;
= Previous timestamp’s short-term state or hidden state;
= The weight matrix for the short-term state.
The activation function is then passed through the sigmoid function, yielding the value of ‘I’ at timestamp ‘t’. The value exists between (0, 1).
Latest information (or new information):
This most recent information in Equation (
16) is a function of the short-term state at timestamp ‘
’ and input ‘C’ at timestamp ‘
t’. This data is required to obtain it through the long-term state. After applying the tanh activation function to it, the value of the most recent information falls between (−1 and 1).
This information is deleted from the long-term state if
is negative, and it is added to the long-term state at the present timestamp if
is positive. The following Equation (
17), has been updated to include
in the long-term state.
where
represents the long-term state at the current timestamp and others represent previously determined values.
7.4. Output Gate
This is utilized to determine the generation of the output from the current internal long-term state to the next short-term state and to govern the cell used for calculating the output activation of the LSTM unit. The output gate Equation (
18) is as follows:
This formula is comparable to the forget and input gates. When the sigmoid activation function is applied to this equation, the output value is between 0 and 1. The current short-term state
and tanh of the revised long-term state will then be computed using the following Equation (
19):
That is, the short-term state is a function of current output and the long-term state is a function of tanh. Then, using the following Equation (
20), apply the SoftMax activation function to the short-term state
to obtain the output of the current timestamp. The prediction is the token with the highest score in the output.