Object Detection Algorithm for Citrus Fruits Based on Improved YOLOv5 Model

Yu, Yao; Liu, Yucheng; Li, Yuanjiang; Xu, Changsu; Li, Yunwu

doi:10.3390/agriculture14101798

Open AccessArticle

Object Detection Algorithm for Citrus Fruits Based on Improved YOLOv5 Model

by

Yao Yu

¹,

Yucheng Liu

²,

Yuanjiang Li

²,

Changsu Xu

² and

Yunwu Li

^2,*

¹

Guizhou Mountainous Agriculture Machinery Research Institute, Guiyang 550000, China

²

College of Engineering and Technology, Southwest University, Chongqing 400715, China

^*

Author to whom correspondence should be addressed.

Agriculture 2024, 14(10), 1798; https://doi.org/10.3390/agriculture14101798

Submission received: 13 September 2024 / Revised: 10 October 2024 / Accepted: 11 October 2024 / Published: 12 October 2024

(This article belongs to the Section Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

:

To address the challenges of missed and false detections in citrus fruit detection caused by environmental factors such as leaf occlusion, fruit overlap, and variations in natural light in hilly and mountainous orchards, this paper proposes a citrus detection model based on an improved YOLOv5 algorithm. By introducing receptive field convolutions with full 3D weights (RFCF), the model overcomes the issue of parameter sharing in convolution operations, enhancing detection accuracy. A focused linear attention (FLA) module is incorporated to improve the expressive power of the self-attention mechanism while maintaining computational efficiency. Additionally, anchor boxes were re-clustered based on the shape characteristics of target objects, and the boundary box loss function was improved to Foal-EIoU, boosting the model’s localization ability. Experiments conducted on a citrus fruit dataset labeled using LabelImg, collected from hilly and mountainous areas, showed a detection precision of 95.83% and a mean average precision (mAP) of 79.68%. This research not only significantly improves detection performance in complex environments but also provides crucial data support for precision tasks such as orchard localization and intelligent picking, demonstrating strong potential for practical applications in smart agriculture.

Keywords:

fruit recognition; YOLOv5; weighted convolution; linear attention; clustering optimization

1. Introduction

In China’s agricultural industry, fruit cultivation ranks among the top three sectors, with orchards currently primarily distributed in plains and hilly mountainous regions. Statistics show that the citrus planting area in China covers 45.5031 million hectares, producing 60.0389 million tons, ranking first globally. With continuous agricultural development, both the planting area and yield of citrus fruits have been increasing annually. For citrus fruits, the cost of manual harvesting constitutes more than 30% of the total planting cost. Therefore, reducing harvesting costs is crucial for improving planting efficiency and promoting the healthy and vigorous development of the citrus industry. However, agricultural machinery production and research in hilly mountainous areas are in the early stages and are restricted by the terrain, hindering agricultural mechanization and automation deployment. With advancements in science and technology, traditional agricultural production equipment is no longer sufficient to meet modern agricultural production demands. Consequently, the development of intelligent agricultural machinery equipment has become a key focus area, with research on intelligent agricultural machinery gradually increasing. Achieving automatic harvesting in orchards first requires addressing the challenges of visual recognition and localization [1]. Currently, fruit detection methods are divided into two categories: deep learning methods and non-deep learning methods [2].

As a non-deep learning method for fruit detection, Chaivivatrakul et al. [3] proposed a detection method based on fruit texture for feature extraction, feature classification, fruit point localization, morphological closure, and region extraction, achieving accuracies of 85% for pineapples and 100% for bitter gourds. Fu et al. [4] researched various RGB-D sensors for fruit detection, integrating two or more types of image features to achieve object detection. Lin et al. [5] proposed a method that segments red–green–blue images into probabilistic segmentation, performs mask operations on original images to obtain filtered depth images, and then clusters the depth images, achieving detection accuracies of 86.4% for peppers and 88.6% for eggplants. Bulanon et al. [6] introduced a method that uses fuzzy logic to merge thermal and visible images of orange tree crowns, improving fruit detection accuracy.

In recent years, deep learning has been extensively applied in computer vision and has driven technological advancements, as well as innovative applications, in many fields [7]. Object detection algorithms have evolved with the rise of convolutional neural networks in computer vision [8] and can be categorized into single- and two-stage methods. Single-stage object detection algorithms directly predict target bounding boxes and categories in one processing step, offering simpler algorithmic structures and faster inference speeds [9]; representative algorithms include YOLOv3 [10], SSD [11], and RetinaNet [12]. Strategies for improving single-stage detection algorithms mostly focus on enhancing feature extraction by replacing backbone networks, yet these methods still face challenges in detecting small exposed-area targets in practical applications [13]. Two-stage object detection algorithms first generate region proposals and then classify and regress within the proposed regions. However, this approach can result in the loss of spatial information of local targets across the entire image and lacks end-to-end training, leading to fragmented training processes and larger parameter sizes, which severely restrict algorithm operation speeds. Representative algorithms in this category include Fast R-CNN [14], Faster R-CNN [15], and Mask R-CNN [16]. Deep learning object detection is widely applied in fields such as autonomous driving, medical image analysis, drone surveillance, and automated harvesting.

Scholars around the world have conducted extensive research on fruit recognition and detection using both two- and single-stage detection methods: Wang et al. [17] fine-tuned pre-scan frames, designed a dual NMS algorithm to address occlusion overlap issues and removed excess rectangular boxes to ensure high detection accuracy and low miss rates. Gai et al. [18] integrated DenseNet inter-layer density into CSPDarkNet53 and changed anchor boxes to circular marking boxes suitable for target shapes, resulting in improved detection speed and accuracy for small targets in an enhanced YOLOv4 network. Nan et al. [19] proposed GF-SPP using average and global average pooling, treating features obtained from global average pooling as independent channels to enhance both the average and maximum pooling features, simultaneously achieving multi-scale feature fusion and enhancement, demonstrating good performance in dragon fruit detection with the improved WGB-YOLO. Wang et al. [20] reduced detection model parameters and enhanced detection efficiency using channel pruning algorithms to trim the YOLOv5 model, efficiently detecting apple fruits and aiding orchard management optimization for growers. Bai et al. [21] proposed a real-time strawberry seedling detection YOLO algorithm, integrating Swin Transformer prediction heads on YOLO v7’s high-resolution feature map to enhance spatial location information utilization, thereby improving the detection accuracy for small target flowers and fruits in complex scenes of similar colors and occlusion overlaps. Chen et al. [22] applied the YOLOv4 model to propose a Yangmei tree detection method based on drone images, using Leaky_ReLU activation functions to accelerate feature extraction speeds and retain the most accurate prediction boxes through DIoU NMS. Zhong et al. [23] transformed the CBS module in the backbone network into a multi-branch large-kernel downsampling module to strengthen the receptive field of the network, achieving a mAP of 64.0% on a mango dataset. Zhao et al. [24] improved YOLOv5 by using the lightweight ShuffleNetv2 network as the backbone and adding an attention mechanism in the convolution module to optimize target detection accuracy, which not only enhanced the detection accuracy of pomegranates but also increased detection speed. Lawal [25] added PANet in the Neck to establish an accurate and fast algorithm for gourd fruit detection that can handle leaf occlusion and fruit overlap. Sun et al. [26] replaced the backbone network of YOLOv5 with the lightweight GhostNet model, which effectively recognized passion fruits in complex environments and improved recognition speed, providing technical support for orchard-picking robots.

Overall, non-deep learning methods for fruit detection tend to have low accuracy, while two-stage deep learning methods are often too large in terms of parameters, resulting in slower inference speeds, making them unsuitable for automated harvesting machines. Additionally, the complexity of YOLOv7 and YOLOv8 has significantly increased, which places higher demands on hardware, making them less practical for agricultural applications. Considering the environmental factors in this study, such as fluctuating lighting conditions and the occlusion and overlap between branches and fruits, along with the need to balance model size with device limitations, we chose YOLOv5 as the base network. YOLOv5 offers faster speeds, moderate model size, and satisfactory accuracy. To further address the challenges posed by the complex orchard environment and ensure practical deployment, we applied the concept of weighted convolutions to share feature parameters, enabling more efficient information capture and improving self-attention mechanisms [27,28,29]. By strengthening the relationships between input vectors, we enhanced the feature extraction for small targets. Ultimately, this approach aims to develop a highly accurate and robust object detection model tailored to the current environment. The main contributions of this paper are as follows:

(1): In traditional convolutional layers, information loss often occurs after multiple layers, especially in the extraction of fine-grained features critical for small object detection. To address this, we replaced 2D convolutions with receptive field convolutions with full 3D weights (RFCF) [30], which use full 3D weights to increase the network’s ability to capture spatial information across a larger area. Unlike convolutions that may lose spatial relationships, RFCF retains and emphasizes the importance of features in the receptive field by distributing weights more effectively across three dimensions, enhancing the model’s ability to detect small and occluded objects such as citrus fruits in complex environments. This approach also ensures better parameter sharing and reduces redundant feature loss.
(2): Linear attention mechanisms can suffer from a lack of focus on relevant features while being computationally expensive. To mitigate these issues, we introduced the focused linear attention (FLA) module [31], which uses mapping functions that simplify attention calculations and recover feature information more efficiently. By balancing the trade-off between focusing on critical areas of the input (such as small, overlapping fruits) and maintaining feature diversity, FLA enhances the network’s ability to distinguish between objects in cluttered scenes while keeping computational costs low. Specifically, this module applies a focused attention mechanism that prioritizes the most relevant parts of the image, allowing for a more precise and diverse feature extraction while still maintaining a high processing speed, which is crucial for real-time agricultural applications.
(3): In YOLO-based networks, the placement and sizing of anchor boxes are critical for effective object detection, particularly when dealing with objects of various sizes. To optimize the anchor boxes for our dataset, we applied K-means++ clustering [32], which provides more appropriate anchor sizes based on the distribution of object sizes in our training data. This technique ensures that the anchor boxes are better suited to detect citrus fruits of varying sizes in orchards. Additionally, we improved the bounding box loss function by using Foal-EIoU [33] (Focused Overlap Area-based Intersection over Union), which places greater emphasis on the overlap area of bounding boxes and improves localization accuracy. This modification enhances the model’s ability to accurately predict object boundaries, especially in scenarios involving occlusions or overlapping objects.

2. Materials and Methods

2.1. Image Data Collection

To align with real-world working conditions, we collected image data of citrus trees in an orchard using the constructed tracked sensing platform. The data collection site was a modern orchard base in Yubei District, Chongqing. Using a HIKROBOT MV-CS016 monocular camera (HIKROBOT, Hang Zhou, Zhejiang Province, China) with a resolution of 1440 × 1080, data on citrus trees were collected under different conditions, including overcast and sunny weather. The collected images were screened to exclude those with poor quality due to issues such as camera shaking or blurriness. Approximately 1223 clearer images were selected, covering multiple shooting angles (horizontal, upward, and downward) and lighting conditions (front and backlight), and the diversity of the dataset was ensured by the different weather conditions at the time of shooting (cloudy and sunny) and the different angles and lighting conditions. The images were annotated using the open-source labeling tool LabelImg (version 1.8.10) on a Windows 11 operating system. The annotations were classified into two categories: citrus fruits and trees. The model training data format used was that of the PASCAL VOC dataset, with .xml annotation files containing information on image names, dimensions, and bounding box coordinates defined by the top-left and bottom-right points.

The annotated images were divided according to a 9:1 ratio for (training set + validation set)/test set. Specifically, the training set/validation set ratio was also set to 9:1. Within the dataset structure, Annotations, ImageSets, and JPEGImages stored annotation files, labels, and original images, respectively. Files such as trainval.txt, test.txt, train.txt, and val.txt were utilized to manage annotations for the combined training and validation set, test set, training set, and validation set, respectively. Table 1 provides an overview of the final quantities obtained.

Data Augmentation

To avoid overfitting and gradient vanishing issues and to better match the variable environmental conditions of orchard scenes, data augmentation preprocessing operations were applied to the training set. These operations aimed to improve the model’s generalization and detection accuracy by creating more diverse training samples. The main data augmentation operations included random flipping, brightness enhancement, random cropping, and contrast enhancement, as illustrated in Figure 1. The specific steps for data augmentation were as follows: ① Random flipping, where each training image was horizontally flipped with a 50% probability. ② Brightness enhancement, where each training image underwent brightness enhancement with a 50% probability, simulating intense sunlight conditions typical of orchard environments. ③ Random cropping, where images were cropped to 480 × 480 pixels, and the remaining parts were filled with black (pixel value 0) to fit the input pixel size required for model training. ④ Contrast enhancement, where the contrast was enhanced with a 50% probability of improving the distinction between fruit boundaries and backgrounds, thereby enhancing sample quality.

2.2. The Improved YOLOv5 Network Model

Single-stage object detection algorithms are widely used in fruit recognition due to their ability to effectively balance detection speed and accuracy. Among them, the YOLOv5 network is considered a classic model with a robust performance [34]. On the contrary, YOLOv8 represents the latest advancement, integrating multiple detection tasks into a high-precision model [35]. YOLOv5, introduced after YOLOv4, excels in flexibility and speed and is particularly advantageous for rapid deployment and achieving superior object detection performance. YOLOv8 combines image classification, object detection, and instance segmentation into a single high-precision network. From a detection standpoint, it can be seen as an evolutionary upgrade over YOLOv5 in structural algorithms. Zhang et al. [36] introduced a lightweight image enhancement module into YOLOv5, improving the model’s detection performance under unfavorable lighting conditions. Wang et al. [37] incorporated dual-channel feature extraction attention into the YOLOv5s network and introduced dynamic snake convolution in the backbone network, significantly enhancing the detection accuracy of grapefruits. YOLOv8 leverages diverse strengths to push real-time detection to new heights, significantly enhancing detection accuracy. However, this improvement comes at the cost of increased network complexity and computational requirements due to additional stacked convolutional layers. It demands higher hardware specifications and is less suitable for agricultural applications. Considering fluctuating light conditions, occlusions between branches and fruits, and balancing model size with device limitations, YOLOv5 was chosen as the foundational network for this study due to its faster speed, moderate model size, and satisfactory accuracy. The network was further improved to achieve high accuracy and robustness, suitable for current environmental conditions and practical deployment scenarios. CSPDarknet53 [38] serves as the backbone network for feature extraction, enhancing nonlinear expression capabilities through cross-stage partial connections and multidimensional information interactions. However, challenges such as feature loss due to outdoor lighting variations, reduced recall rates from dense occlusions between fruits and leaves, and the need for additional context information to capture features from simple target structures were addressed. Traditional 3 × 3 convolution extracts information from spatial receptive field features but may overlook small targets and lead to information loss during feature map extraction. The YOLOv5 object detection model is comprised of three main components: Backbone, Neck, and Detect modules. The Backbone is responsible for extracting features from the input image, the Neck performs feature fusion and enhancement, and the Detect module utilizes the fused feature maps provided by the Neck to predict object classes and bounding boxes at each feature point. The structured modules of the improved network and the detailed description of its submodules are illustrated in Figure 2.

2.2.1. Receptive Field Convolution of Full (RFCF) 3D Weights

Convolutional neural networks extract image features through convolution operations, where parameter sharing significantly reduces model computational costs and complexity. During image processing, objects in different positions can vary in shape, size, color, and distribution. However, convolutional kernels use the same parameters across each receptive field to extract information, neglecting differential information from different positions. Moreover, convolution operations do not inherently consider the importance of each feature, further impacting the effectiveness of extracted features. These factors ultimately limit the network’s performance. The convolutional operation for feature extraction within each receptive field slide is represented by Equation (1).

\{\begin{cases} F_{1} = X_{11} \times K_{1} + X_{12} \times K_{2} + X_{13} \times K_{3} + \cdot \cdot \cdot + X_{1 S} \times K_{S} \\ F_{2} = X_{21} \times K_{1} + X_{22} \times K_{2} + X_{23} \times K_{3} + \cdot \cdot \cdot + X_{2 S} \times K_{S} \\ ⋮ \\ F_{N} = X_{N 1} \times K_{1} + X_{N 2} \times K_{2} + X_{N 3} \times K_{3} + \cdot \cdot \cdot + X_{N S} \times K_{S} \end{cases}

(1)

where

F_{i}

represents the value obtained after each convolution operation;

X_{i}

represents the pixel value at the corresponding position within each sliding window;

K

represents the convolution kernel;

S

represents the number of convolution kernel parameters;

N

represents the total number of receptive field sliding windows.

From Equation (1), it can be observed that features at the same position within each sliding window share the same parameters. Standard convolutional operations struggle to perceive differential information across different positions. Spatial attention mechanisms utilize learned attention maps to highlight the importance of each feature, addressing the parameter-sharing issue by focusing on spatial features of the receptive field [39]. This spatial attention mechanism can be integrated with convolution operations to enhance feature maps through weighted attention (i.e., attention-weighted operations), as detailed in Equation (2).

F_{i} = X_{i} \times A_{i} \times K

(2)

where

A_{i}

represents the value of the learned attention map at different positions, and the convolution kernel

K

acts as a parameter value.

The product of the attention weights corresponding to the receptive field sliding windows and the convolution kernel serves as new convolution parameters, introducing diversity among the parameters of these sliding windows. This effectively mitigates the parameter-sharing issue inherent in large-scale convolution kernels. However, spatial attention weights are shared only between each sliding window, neglecting spatial features across the entire receptive field. They treat all features within a channel or spatial position equally, making it challenging to compute true three-dimensional weights effectively. Therefore, we considered optimizing k values using an energy optimization function to derive the generation of weights through energy functions [40]. The computation of the energy function mainly involves element-wise operations. By emphasizing the importance of target features within the receptive field sliding windows in a lightweight manner, this approach also addresses spatial characteristics of the receptive field. Thus, it completely resolves the parameter-sharing problem of convolutional kernels. The process of full three-dimensional weight computation based on the energy function is illustrated in Figure 3. In other words, let

X

represent the input feature map, and let

C

,

H

,

W

represent the number of channels, height, and width of the feature map, respectively. By combining the spatial attention mechanism and convolution operation, 3D weights are extracted from the original feature map

X

to achieve parameter sharing within the convolution kernel, which is then extended to the new feature map.

The energy function for each feature can be defined as follows:

e_{t} (w_{t}, b_{t}, x_{i}) = \frac{1}{M - 1} \sum_{i = 1}^{M - 1} {(- 1 - (w_{t} x_{i} + b_{t}))}^{2} + {(1 - (w_{t} t + b_{t}))}^{2} + λ {w_{t}}^{2}

(3)

where

M = H \times W

represents the number of features;

x_{i}

represents the input feature;

w_{t}

represents the transformed weights;

b_{t}

represents the transformed bias;

λ

represents the energy function, and theoretically, each channel has

M

energy functions.

The weights and bias can be solved using Equation (4).

\{\begin{cases} μ_{t} = \frac{1}{M - 1} \sum_{i = 1}^{M - 1} x_{i} \\ {σ_{t}}^{2} = \frac{1}{M - 1} \sum_{i = 1}^{M - 1} (x_{i} - μ_{t})^{2} \end{cases}, \{\begin{cases} w_{t} = - \frac{2 (t - μ_{t})}{{(t - μ_{t})}^{2} + 2 {σ_{t}}^{2} + 2 λ} \\ b_{t} = - \frac{1}{2} (t - μ_{t}) w_{t} \end{cases}

(4)

where

μ_{t}

and

σ_{t}^{2}

represent the mean and variance of all features in the channel excluding feature

t

, respectively.

The above solving process is defined within a single channel [41]. For other channels, we first assumed that all pixel values in a single channel follow the same distribution. Based on this assumption, the mean and variance of all neurons can be calculated and reused for all features in that channel. This simplification effectively avoids repetitive, iterative calculations of

μ_{t}

and

σ_{t}^{2}

at each position, reducing computational costs. The final minimum energy function can be defined by Equation (5).

\{\begin{cases} μ_{t} = \frac{1}{M} \sum_{i = 1}^{M} x_{i} \\ {σ_{t}}^{2} = \frac{1}{M} \sum_{i = 1}^{M} (x_{i} - μ_{t})^{2} \end{cases}, e_{t}^{*} = \frac{4 (\overset{\land}{σ^{2}} + λ)}{{(t - \overset{\land}{μ})}^{2} + 2 \overset{\land}{σ^{2}} + 2 λ}

(5)

From the above definition, the lower the

{(e_{t}^{*})}^{- 1}

value, the greater the difference between feature

t

and the surrounding features, making it more important in visual information processing. Therefore, the importance of each feature can be determined by

{(e_{t}^{*})}^{- 1}

. The

k

value is defined as

{(e_{t}^{*})}^{- 1}

, with the

k

value determining the importance of each feature. Finally, the

k

value is post-processed for computational convenience. The convolution operation weighted by the attention map based on full three-dimensional weights is expressed in Equation (6).

F_{i} = X_{i} \times A_{i} \times S i g m o i d (\frac{1}{k})

(6)

Since the Sigmoid function does not affect the relative importance of each neuron and can constrain the output within the range (0,1), it is used to limit the

k

value and weight the attention map. Finally, the weighted result is added to the input features to obtain the final output features.

2.2.2. Focused Linear Attention (FLA) Module

When using convolutional neural networks for image processing, a common practice is to treat each pixel as a three-dimensional vector, with the dimensions corresponding to the image channels. Convolution operations [42] generally focus the model on a square or rectangular receptive field. However, the size of the vectors input to the neural network is typically inconsistent, and there is a certain semantic relationship between input vectors of different sizes. When using convolution operations for feature extraction, it is difficult to ensure the relationships between inputs, leading to poor performance of the model in small object detection. Therefore, it is important to strengthen the correlation between different parts of the input features during feature capture, allowing the model to determine the shape and type of the receptive field by itself and ensuring close relationships between vectors, thereby effectively improving the model’s accuracy.

The self-attention mechanism considers the correlation between each input vector and performs combined calculations, ensuring that each output

b_{i}

corresponding to an input

a_{i}

is derived from the influence of all input features (

a_{1}, a_{2}, a_{3}, a_{4}

) on that output feature. This is achieved by directly obtaining the attention weights of the features at each position during encoding through computations. Then, the hidden vector representation of the entire feature is calculated in the form of weighted sums. The specific process is shown in Figure 4.

In the self-attention mechanism, for each input vector, the correlation between all features is considered, resulting in dot products with all other input vectors. This process increases additional computational costs. Moreover, the SoftMax function used in the self-attention mechanism computes pairwise calculations for all queries and keys. The specific computation process is shown in Equation (7).

Q = \sum_{j}^{N} \frac{S i m (Q_{i}, K_{j})}{\sum_{j = 1}^{N} S i m (Q_{i}, K_{j})} V_{j}, Q = x W_{Q}, K = x W_{K}, V = x W_{V}

(7)

where

x

represents the input feature;

W_{Q}, W_{K}, W_{V}

are the projection matrices of the input

x

;

Q

,

K

, and

V

correspond to query, key, and value, respectively;

i

and

j

represent feature point distributions; and

S i m ()

is the similarity function.

The attention map is obtained by calculating the similarity between all query–key pairs. When given the same set of queries, keys, and values, the model can learn different behaviors based on the same attention mechanism and then combine these different behaviors as knowledge. For example, it can capture various ranges of dependencies within a sequence (e.g., short- and long-range dependencies). This process results in a high computational complexity. Additionally, when encoding information at the current position, the model may overly focus on its own position. Limiting the global receptive field to a smaller area can effectively mitigate this issue, such as by designing sparse global attention patterns [43] or applying smaller attention windows [44]. However, this approach may overlook information features from other regions and inevitably sacrifices the ability to model long-term dependencies. Using a linear attention method to approximate the SoftMax operation can reduce the overall complexity and address computational challenges by using a designed kernel function as an approximation of the original similarity function, namely,

S i m (Q, K) = \exp (Q K^{T} / \sqrt{d})

(8)

The distribution of the linear attention module is smoother than that of the SoftMax function. Its output features tend to approximate the average of all features, making it difficult to focus on more information areas. To address this issue, a mapping function was introduced. This function adjusts the direction of the characteristics of each query and key, bringing similar query–key pairs closer together and separating dissimilar pairs. The specific formulation is shown in Equation (9).

ψ_{p} (x) = f_{p} (Re L u (x)), f_{p} (x) = \frac{| | x | |}{| | x^{* * p} | |} x^{* * p}, S i m (Q_{i}, K_{j}) = ψ_{p} (Q_{i}) ψ_{p} {(K_{j})}^{T}

(9)

where

x^{* * p}

represents the

x

element exponentiation of

p

, and the

Re L U

function ensures the non-negativity of the input and the validity of the denominator. As can be seen, the norm of the mapped features remains unchanged, indicating that only the direction is adjusted. With an appropriate

p

value, the focusing function can create a more pronounced difference between similar and dissimilar query–key pairs while also recovering the sharp attention distribution of the original SoftMax function.

Feature diversity is another factor that limits the expression capability of linear attention [45]. Generally, the attention matrix has full rank, representing the diversity when aggregating features from the input matrix. However, in linear attention, the rank is incomplete. This is mainly because the attention map tends to homogenize the row vectors of the matrix, limiting the upper bound of the attention matrix rank. Additionally, the homogenization of attention weights inevitably leads to similarities among the aggregated features. To address these issues, a depthwise convolution (DWC) module was introduced into the attention matrix. It acts as a filter that focuses only on a few adjacent features in space rather than all input features when querying. This locality ensures that even if the linear attention values corresponding to two queries are the same, different outputs can still be obtained from similar local features, thus preserving feature diversity. By adding the DWC module, the rank of the attention map in linear attention can be restored to full rank, ensuring feature diversity similar to SoftMax attention. This increases the upper bound of the effective attention matrix rank, significantly enhances the performance of linear attention, and reduces computational complexity, balancing expression capability and computational cost. The final focused linear attention module can be defined using Equation (10).

O = S i m (Q, K) V = ψ_{p} (Q) ψ_{p} {(K)}^{T} V + D W C (V)

(10)

2.2.3. Anchor Box Clustering Optimization

In object detection, the model needs to both classify the detected objects and accurately locate them. Since different objects in various images have different scales and length–width ratios, it becomes challenging for the model to learn the shapes of different objects during training. This can lead to missed detections, especially when dealing with densely distributed targets. In the complex environment of orchards, dense fruit distributions, intersecting branches, and leaf occlusion can all impact the detection accuracy. Additionally, different growing conditions and external environmental factors result in a wide range of fruit sizes within the field of view. When the exposed area of the fruit in the detection view is small, there are fewer candidate regions for the detection network within the target range, resulting in less feature information being captured during feature extraction and ultimately affecting the accuracy of fruit detection. To address these issues, an optimization strategy of re-clustering the anchor boxes based on a self-constructed dataset was employed. This ensured that the detection box sizes were more aligned with the actual detection samples, effectively improving the final recognition accuracy and the precision of small target object localization without increasing the training difficulty or training time. The main process is shown in Figure 5.

For the dataset objects in this paper, the sizes of anchor boxes were refitted based on the K-means++ clustering algorithm, which aims to maximize the initial distances between cluster centers. The final results involved detecting target objects in complex environments, such as overlapping and occlusion, using feature maps at three scales. Each feature map corresponded to three anchor boxes, resulting in a total of nine anchor boxes. By applying the K-means++ algorithm to cluster the widths and heights of the manually annotated anchor boxes in the dataset, anchor boxes were obtained that are better suited for detecting the targets in this paper. The clustering results are depicted in Figure 6.

The results of the resized anchor box dimensions after re-clustering are shown in Table 2. The clustered anchor boxes were assigned to corresponding feature maps, aligning better with the sizes of the targets in the training set and thereby effectively enhancing the model’s detection accuracy and localization capability.

2.2.4. Foal-EIoU Loss Function

The improved loss function of the YOLOv5 model mainly consists of bounding box regression loss, confidence loss, and class loss. These correspond to the assessment of regression parameters, the presence of objects, and the types of objects detected, respectively. The confidence loss and bounding box regression loss still use the original binary cross-entropy loss strategy. The primary improvement discussed in this paper is the optimization of the bounding box regression loss. The original bounding box loss utilizes Complete-IoU (CIoU), which introduces the aspect ratio of the box. CIoU simultaneously considers the Intersection over Union (IoU), the distance between center points, and the aspect ratio to optimize the predicted box and accelerate convergence. However, it reflects the aspect ratio difference rather than the actual difference in width and height with their confidence. When two boxes scale proportionally,

v

= 0, making the predicted box difficult to optimize. For the target objects discussed in this paper, the height-to-width ratio of the detection box is often approximately 1. Additionally, the length and width of the predicted box cannot increase or decrease simultaneously, focusing only on their ratio rather than the actual difference for each dimension. This can lead the network to optimize in unintended directions during training. To address this issue, in this paper, we use the Foal-EIoU loss function for optimization. Moreover, due to the small dataset and environmental noise issues, an additional regularization parameter was introduced into the loss function for further optimization [46]. Introducing this parameter ensures greater flexibility in regression accuracy at different levels, extending applicability to small datasets and enhancing robustness against noise. The final loss function expression is shown in Equation (11).

L_{\partial - E I o U} = 1 - I o U^{\partial} + \frac{ρ^{2 \partial} (b, b^{g t})}{{(w^{c})}^{2 \partial} + {(b^{c})}^{2 \partial}} + \frac{ρ^{2 \partial} (w, w^{g t})}{{(w^{c})}^{2 \partial}} + \frac{ρ^{2 \partial} (h, h^{g t})}{{(h^{c})}^{2 \partial}}

(11)

where

b

and

b^{g t}

are the center points of the predicted box and the ground truth box, respectively;

ρ

represents the Euclidean distance;

c

is the diagonal length of the smallest enclosing box;

w^{g t}, w, h^{g t}, h

are the width and height of the ground truth box and the predicted box, respectively; and

w^{c}

and

h^{c}

represent the width and height of the smallest enclosing box, respectively.

2.3. Model Configuration and Evaluation Metrics

2.3.1. Experimental Environment and Parameter Settings

The experimental system environment used in this study was Windows 11, with an Intel Core i7 10750H CPU (6 cores, 12 threads, base frequency 2.60 GHz, turbo frequency 5.00 GHz), 16 GB of RAM, and a GeForce RTX 2060 GPU with 6 GB of VRAM. CUDA architecture version 11.3 and CuDNN version 11.1 were used for neural network acceleration. All experiments, including model training and testing, were implemented using Python 3.8 and PyTorch 1.9.0. The model’s training process was divided into two stages: the freezing stage and the unfreezing stage. The total number of training epochs was set to 300. During the freezing stage, where network parameters change minimally, the epoch was set to 40 with a batch size of 8. The optimizer used was SGD with a learning rate of 1 × 10⁻² and a momentum of 0.9. The weight decay was set to 5 × 10⁻⁴ to prevent overfitting. In the unfreezing stage, where the network parameters change significantly and consume more memory, the epoch was set to 260 with a batch size of 4 and a learning rate of 1 × 10⁻⁴. The learning rate was also scheduled to decay using a step decay scheduler, where the learning rate was reduced by 0.1 after every 50 epochs.

Additionally, all comparative algorithms described in this study were trained on the same dataset and under identical conditions, including the hardware setup and training parameters mentioned above, to ensure fair comparison and reproducibility.

2.3.2. Model Evaluation Metrics

For object detection tasks, it is customary to use multiple evaluation metrics to characterize the generalization performance of neural network models. This chapter employs six metrics to assess model performance: precision (

P

), recall (

R

), average precision (

A P

), mean average precision (

m A P

), F1 score, and frames per second (

F P S

). Precision (P) is defined as the ratio of correctly predicted positive samples to all samples predicted as positive. Recall (R) is the ratio of correctly predicted positive samples to all actual positive samples. Average precision (AP) is the area enclosed by the precision–recall (P-R) curve and the coordinate axes. The closer its value is to 1, the higher the network sensitivity and the greater the accuracy in object recognition. Mean average precision (mAP) is the mean of AP across all classes in the dataset. The F1 score represents the harmonic mean of precision and recall, balancing their trade-off.

P = \frac{T P}{T P + F P}

(12)

R = \frac{T P}{T P + F N}

(13)

A P = \int_{0}^{1} P (R) d R

(14)

m A P = \frac{\sum_{i = 1}^{k} A P_{i}}{k}

(15)

F 1 = 2 \times \frac{P \times R}{P + R}

(16)

3. Results and Analysis

3.1. Ablation Study Validation

An ablation study is a scientific research method used to determine the impact of specific conditions or parameters on the results and to help researchers understand the contribution of different components of a model to its overall performance. To validate the improved YOLOv5 object detection model proposed in this chapter for detecting fruits in orchard environments, four sets of ablation experiments were designed to evaluate the model’s performance from four aspects: comparing the results before and after re-clustering anchor boxes, comparing the effects of regression loss functions, comparing model performance using different improvement strategies, and comparing the accuracy of module detections in complex environments.

The first set of experiments primarily evaluated the effectiveness of anchor boxes generated after clustering with the K-means++ algorithm. The anchor boxes generated after re-clustering effectively reduced the clustering errors of the original anchor boxes, thereby improving the model’s localization ability. The comparison results of detection with anchor boxes before and after clustering are shown in Table 3. From Table 3, it can be observed that compared to before using K-means++ clustering, the class accuracy of the anchor boxes on the detected objects increased by 1.74% and 0.47%, respectively. The mean average precision (mAP) increased by 0.60%, and the frames per second (FPS) improved by 1.26 ms. Compared to the original anchor boxes, there was a certain degree of improvement in the detection accuracy, indicating that the strategy of retraining the network after reconstructing anchor boxes is effective.

The second set of ablation experiments aimed to specifically describe the effectiveness of the improved bounding box regression loss function on a custom dataset. Compared to the loss functions mentioned earlier, this experiment verified the performance of the loss function on the original model, as shown in Table 4. From Table 4, it can be seen that after training the network using Foal-EIoU as the bounding box regression loss function, the model’s p-value increased by 1.92% and 1.31%; the R-value increased by 2.35% and 3.06%; the mAP increased by 2.34% and 2.27%; the F1 score and network inference speed were relatively close compared to the other loss functions. The ablation experiments on the loss function demonstrate that the improved loss function can effectively improve model accuracy while maintaining detection speed.

The third set of ablation experiments primarily evaluated the detection performance of attention modules within the optimization strategy, comparing and verifying each module based on various evaluation metrics, as shown in Table 5. It can be observed that with the inclusion of weighted convolution and focused linear attention modules, the model’s performance metrics continuously improved. Specifically, the p-value increased by 2.05%, 1.08%, and 1.89%; the R-value improved by 3.36% compared to the original model; the mAP increased by 2.42%, 1.17%, and 1.30%, respectively. By reconstructing convolution through three-dimensional attention weights and reinforcing the relationship between input features, effective feature acquisition was achieved, ensuring long-term dependency on the information. Although the final model’s weights and detection time slightly increased, the improved detection accuracy was acceptable, with a minor increase in inference cost. Moreover, it enhanced the model’s ability to accurately locate detected objects.

To visually demonstrate the improved localization capability of each enhancement module for fruit detection in complex environments, we visualized the detection results of the aforementioned improvement strategies using heatmaps. Figure 7 presents a comparative visualization of the model’s localization differences after training with updated weights under consistent conditions, highlighting the effectiveness of each module in fruit localization.

By comparing the detection results shown in Figure 7, it is evident that using the original YOLOv5 model for localization faces challenges in accurately positioning fruits under conditions such as mutual occlusion between fruits, occlusion by tree trunks and foliage, shadow interference due to insufficient lighting, and their overlapping combinations. These conditions often lead to inaccurate fruit detection and even false positives in overlapping areas. The weighted convolutional block enhanced feature extraction, achieving some convergence in highly disruptive areas. Introducing the focal linear attention module increased the focus on areas with occlusions between fruits, disturbances from branches, and smaller exposed fruit regions, resulting in more precise localization. Ultimately, the improved model, integrating comprehensive three-dimensional weighted convolutions and focal linear attention modules, widened the focus area, captured more detailed feature information, enhanced the ability to accurately locate individual fruits, and improved the overall detection effectiveness and localization accuracy.

The fourth group of ablation experiments aimed to validate the applicability of the improved model in complex orchard environments. These experiments visually contrasted the enhancement effects of various attention modules on the network, considering environmental disturbances encountered during field detection processes. Specifically, they focused on mutual occlusion between fruits and occlusion of fruits by foliage, which are common interference scenarios. The experimental results, depicted in Figure 8 and Figure 9, were analyzed. From these figures, it can be observed that in densely packed fruit and foliage occlusion environments, the original YOLOv5 model exhibited missed detections due to limited receptive fields. The weighted convolutional module achieved more precise recognition in scenarios with smaller receptive fields. The focal linear module accurately located fruits under foliage occlusion and fruit overlap. While each improvement method alleviated missed detection to a certain extent, challenges remain in accurately identifying and locating objects under heavy occlusion and in smaller exposed areas. The comprehensive attention model in the improved network significantly enhanced the detection accuracy compared to the original model. The adopted improved loss function and bounding box clustering methods effectively enhanced the localization capability of the predicted boxes and mitigated missed detection issues to a certain degree.

3.2. Multi-Model Comparative Validation

To specifically verify the superiority of the improved model for detecting objects in complex environments, comparative experiments were conducted against classic object detection models, including YOLOv5, YOLOv7, YOLOv8, Fast-RCNN, and SSD. Based on the concept of transfer learning, the weights obtained by these networks from the VOC2007+2012 dataset were used as pre-trained models. Using a self-constructed dataset, the pre-trained weight files served as the base weights. During training, all other parameters and indicators for each model were kept consistent, and the same evaluation metrics were employed for a comparative analysis of the results. The final loss function curves of each model were integrated for better observation, and the consolidated curve results are shown in Figure 10.

From the loss function curve, it can be observed that as the iterations progressed, the loss curve of the proposed model decreased more rapidly compared to the other detection models, demonstrating better convergence. By the 100th iteration, the convergence was nearly complete, and the loss values at the same number of iterations were significantly lower and more stable than those of the other models. This indicates that the improved model performs better in terms of training stability and handling complex relationships between pixels, quickly learning effective features, and achieving convergence balance. The training results of different network models are compared in Table 6. As shown in Table 6, the proposed model exhibits a significant improvement in p-value and mAP compared to the other object detection networks. This enhancement ensures detection accuracy while also considering detection speed. The comprehensive experimental results indicate that the improved network significantly improved the training evaluation metrics compared to the other networks, further demonstrating the effectiveness of the improved model.

3.2.1. Comparison of the Detection Performance under Different Weather Conditions

To align with the actual application conditions in orchard environments, variations in natural light were considered, focusing on both sunny days (with strong natural light) and cloudy or rainy days (with weaker light). Based on the spatial distribution characteristics of the fruits, a comparison was made under working conditions where fruits overlap each other and are occluded by branches and leaves. The actual detection results of the different algorithm models were used for validation. The detection comparisons under favorable lighting conditions are shown in Figure 11, while the detection comparisons under cloudy or rainy conditions are shown in Figure 12.

From the comparative results in Figure 11, it can be seen that under sunny conditions, the strong natural light made the target fruits more prominent, resulting in generally better detection performance across all algorithms. However, in cases of overlapping occlusions, due to the smaller effective area, there were still instances of missed detections. The proposed algorithm, which enhances feature information, achieved higher precision and accuracy than the other object detection algorithms.

From the comparison in Figure 12, it is evident that under cloudy or rainy conditions, the weaker light and insufficient effective area in the detection view, combined with the dense environment, resulted in lower recognition accuracy for the existing advanced detection models. This issue is particularly pronounced for fruits with small exposed areas in the field of view, leading to more frequent missed detections. The proposed algorithm addresses this problem effectively by optimizing convolution operations and improving the self-attention mechanism to enhance the input information, resulting in a significant improvement in detection accuracy. It can be observed that with changes in light intensity, the improved YOLOv5 model is capable of recognizing and marking fruits even when they are occluded by leaves or have a small exposed area. The improved model’s prior boxes were also better aligned with the shapes of the detection targets, providing it with an advantage over other models. Compared to other advanced object detection models, the improved model demonstrates superior detection performance and recognition accuracy in complex environments.

3.2.2. Comparison of Detection Performance across Different Spatial Distributions

To comprehensively evaluate the robustness of the model, we conducted an error analysis to better understand where and why the proposed model fails. The spatial distribution of fruits is primarily characterized by sparsity, density, and severe occlusion, as visualized in Figure 8, Figure 9 and Figure 13. The instances of false positives and missed detections are indicated by yellow arrows in the figures. Environmental factors contribute to inconsistencies in fruit yield between trees (resulting in variability in detection rates), while occlusion by leaves limits the capture area for fruits. High yield rates can lead to dense overlapping distributions, obscuring the clear boundaries of the fruits. This interference makes fruit detection and localization prone to omission errors. Additionally, in areas where the visual similarity between fruits and their surrounding environment is high—especially under varying lighting conditions (e.g., backlighting or shadows)—the occurrence of false positives and missed detections increases. As shown in Figure 11 and Figure 12, the model occasionally misclassifies clusters of branches or leaves as fruits, particularly under challenging lighting conditions. A comparative analysis of the precision results from various object detection algorithms reveals that even advanced high-accuracy algorithms like YOLOv8 struggle to effectively address these issues. This limitation remains a significant challenge for achieving high-precision detection in current fruit detection tasks. To tackle these specific problems, we propose an improved solution based on the YOLOv5 object detection algorithm. Through more effective feature extraction strategies and comprehensive optimization, this model achieves efficient fruit recognition and localization in practical applications. Comparative detection results validate the effectiveness of these enhancements, demonstrating significant advantages over other object detection algorithms; both the accuracy and precision of fruit recognition have markedly improved.

4. Discussions

In recent years, the YOLO series models have been widely applied to various object detection tasks, including applications in agriculture such as fruit recognition, ripeness detection, and pest identification. The quality and suitability of datasets for specific tasks are crucial. This paper aimed to address the issues of missed detections and false positives caused by environmental factors like changes in natural lighting, leaf occlusion, and fruit overlap from a detection perspective. To mitigate these issues, a citrus fruit dataset for hilly and mountainous regions was constructed under different weather conditions and shooting angles. However, embedding the trained weights into a mobile picking robot requires balancing detection accuracy with model size, which is challenging. Our model improvements focus on the following aspects:

(1) Inspired by the modifications proposed by Chen et al. [47], we applied k-means clustering to recluster the anchor boxes to better fit the target objects in our study. This approach aligns the predicted box sizes more closely with the actual detection samples, effectively improving final recognition accuracy and precise localization of small objects without increasing the difficulty or time required for network training.

(2) Compared to the citrus fruit detection algorithm by Lin et al. [48], our focus is on detecting fruits across the entire tree rather than a single cluster. To address parameter sharing, we replaced the original convolutions with Receptive Field Convolution of Full (RFCF) 3D weights. This modification effectively reduces the impact of leaf occlusion and fruit overlap. Additionally, we introduced a Focused Linear Attention (FLA) module to enhance recognition accuracy within smaller receptive fields, thereby improving detection precision for small target fruits.

(3) In contrast to the real-time orchard citrus detection algorithm by Deng et al. [49], we optimized our model using the Foal-EIoU loss function, which helps avoid undesirable optimization directions during network training. Moreover, we introduced an additional regularization term

\partial

within this loss function to mitigate issues related to smaller datasets and environmental noise.

To visually demonstrate the detection performance of the improved algorithm and evaluate the results, multiple sets of comparative experiments were conducted under different environmental conditions (variations in light intensity and spatial distribution). Target counts were statistically analyzed against the detection outcomes of the comparative algorithms and the proposed algorithm. Relative error rates of the detection results were computed (difference between target counts and detected counts relative to target counts). By comparing the relative error rates of the actual detections using different algorithms, the effectiveness of the improved network was visually illustrated.

From Table 7, it is evident that under various environmental factors and operational conditions, the basic YOLOv5 and the latest, best-performing YOLOv8 algorithms were significantly affected by light and occlusion in orchard environments. These factors led to omission errors when identifying and detecting small targets like citrus fruits. The improved algorithm led to a notable increase in the number of detections, demonstrating its clear advantages with regard to detection accuracy.

The average relative error rates of the three object detection models under various conditions were 22.1%, 16.8%, and 8.2%, respectively. In comparison, the relative error rate of the improved model significantly decreased. Overall, the other advanced algorithms performed moderately under complex conditions and struggled with the precise localization of small, occluded targets. The improved algorithm, by emphasizing feature enhancement and integrating contextual information comprehensively, aggregated and strengthened feature diversity while maintaining information continuity. This resulted in a marked increase in detection accuracy. The relative error rates under various conditions also showed a noticeable decrease compared to other algorithms, with an average relative error rate reduction of 13.9% and 8.6%, respectively, highlighting the high detection precision of the improved model and providing more accurate and effective data input for future work.

Future work should incorporate additional occlusion-handling mechanisms, such as multi-scale contextual modeling or incorporating 3D information to better handle heavily occluded objects. Additionally, future work should enhance the model’s ability to discriminate between similar objects under varying lighting conditions by integrating more advanced data augmentation techniques or contrast-enhancement strategies during preprocessing. Additionally, integrating depth information and investigating the precise localization of citrus harvesting points can help achieve fast and accurate 3D positioning of citrus harvesting points.

5. Conclusions

To meet the requirements for fruit recognition and localization in automated picking, we propose a citrus fruit detection model based on an improved YOLOv5 algorithm. The model incorporates RFCF to enhance feature extraction within each receptive field and utilizes FLA to improve the expressiveness of the attention mechanism while reducing computational complexity. Additionally, we employ the k-means++ clustering algorithm and the Foal-EIoU loss function to strengthen the network’s localization capabilities and recognition accuracy, effectively minimizing missed and false detections caused by environmental factors. The specific performance improvements are as follows:

(1): Ablation experiments conducted on the orchard dataset showed that after using the K-means++ clustering algorithm, the mean average precision (mAP) improved by 0.60%, and the frame rate increased by 1.26 ms per frame. When using Foal-EIoU as the bounding box regression loss function for network model training, compared to the original CIoU loss function, the precision, recall, and mAP increased by 1.92%, 2.35%, and 2.34%, respectively. As the weighted convolution and focal linear attention modules were incorporated, the model’s metrics continued to improve, with mAP increases of 2.42%, 1.17%, and 1.30%.
(2): Comparative experiments demonstrated that the improved YOLOv5 algorithm outperformed other models, achieving precision, recall, and mAP scores of 95.83%, 76.93%, and 79.68%, respectively. Compared with the two-stage target detection model Fast-RCNN, the precision and mAP are improved by 20.04% and 20.01%. Compared with single-stage SSD models, the precision and mAP are improved by 18.67% and 19.39%; Compared with advanced models such as YOLOv8, the mAP of the improved YOLOv5 model is increased by 1.84%, and the number of parameters in the network is also reduced by 86M, achieving a better balance between detection accuracy and model size, which is crucial for real-time applications in intelligent orchard management. The model also showed significant improvement with regard to reduced missed and false detections in real-world orchards under varying lighting conditions and spatial distributions.
(3): The improved YOLOv5 model effectively balanced the detection accuracy and model size. The trained weight files were deployed on embedded devices and tested in a citrus orchard, verifying the model’s good recognition accuracy and robustness in complex environments. This provides technical support and data reference for intelligent orchard management and automated harvesting.

Author Contributions

Conceptualization, Y.Y. and Y.L. (Yucheng Liu); methodology, Y.Y.; validation, Y.L. (Yunwu Li) and Y.L. (Yuanjiang Li), and C.X.; formal analysis, Y.L. (Yucheng Liu); investigation, C.X.; resources, Y.L. (Yucheng Liu); data curation, Y.L. (Yuanjiang Li); writing—original draft preparation, Y.L. (Yucheng Liu); writing—review and editing, Y.Y.; visualization, Y.Y.; supervision, C.X.; project administration, Y.L. (Yunwu Li); funding acquisition, Y.L. (Yunwu Li). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Guizhou Provincial Science and Technology Plan Project (grant number: Qiankehe Support (2022) General No. 168).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article are available in the manuscript.

Acknowledgments

The authors would like to acknowledge the valuable comments by the editors and reviewers, which have greatly improved the quality of this work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, J.; Kang, N.; Qu, Q.; Zhou, L.; Zhang, H. Automatic fruit picking technology: A comprehensive review of research advances. Artif. Intell. Rev. 2024, 57, 54. [Google Scholar] [CrossRef]
Liu, Z.; Abeyrathna, R.R.D.; Sampurno, R.M.; Nakaguchi, V.M.; Ahamed, T. Faster-YOLO-AP: A lightweight apple detection algorithm based on improved YOLOv8 with a new efficient PDWConv in orchard. Comput. Electron. Agric. 2024, 223, 109118. [Google Scholar] [CrossRef]
Chaivivatrakul, S.; Dailey, M.N. Texture-based fruit detection. Precis. Agric. 2014, 15, 662–683. [Google Scholar] [CrossRef]
Fu, L.; Gao, F.; Wu, J.; Li, R.; Karkee, M.; Zhang, Q. Application of consumer RGB-D cameras for fruit detection and localization in field: A critical review. Comput. Electron. Agric. 2020, 177, 105687. [Google Scholar] [CrossRef]
Lin, G.; Tang, Y.; Zou, X.; Xiong, J.; Fang, Y. Color-, depth-, and shape-based 3D fruit detection. Precis. Agric. 2020, 21, 1–17. [Google Scholar] [CrossRef]
Bulanon, D.M.; Burks, T.F.; Alchanatis, V. Image fusion of visible and thermal images for fruit detection. Biosyst. Eng. 2009, 103, 12–22. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single Shot Multibox Detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. pp. 21–37. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Terven, J.; Córdova-Esparza, D.M.; Romero-González, J.A. A Comprehensive Review of Yolo Architectures in Computer vision: From Yolov1 to Yolov8 and Yolo-Nas. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In IEEE Transactions on Pattern Analysis and Machine Intelligence; IEEE: Piscataway, NJ, USA, 2016; Volume 39, pp. 1137–1149. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Wang, S.; Liao, X.; Ma, K. Analysis and Detection of Orange Images Based on Improved Faster R-CNN Algorithm and Feature Data Analysis. In Proceedings of the 2023 IEEE 3rd International Conference on Electronic Communications, Internet of Things and Big Data (ICEIB), Taiwan, China, 14–16 April 2023; pp. 109–113. [Google Scholar]
Gai, R.; Chen, N.; Yuan, H. A detection algorithm for cherry fruits based on the improved YOLO-v4 model. Neural Comput. Appl. 2023, 35, 13895–13906. [Google Scholar] [CrossRef]
Nan, Y.; Zhang, H.; Zeng, Y.; Zheng, J.; Ge, Y. Intelligent detection of Multi-Class pitaya fruits in target picking row based on WGB-YOLO network. Comput. Electron. Agric. 2023, 208, 107780. [Google Scholar] [CrossRef]
Wang, D.; He, D. Channel pruned YOLO V5s-based deep learning approach for rapid and accurate apple fruitlet detection before fruit thinning. Biosyst. Eng. 2021, 210, 271–281. [Google Scholar] [CrossRef]
Bai, Y.; Yu, J.; Yang, S.; Ning, J. An improved YOLO algorithm for detecting flowers and fruits on strawberry seedlings. Biosyst. Eng. 2024, 237, 1–12. [Google Scholar] [CrossRef]
Chen, Y.; Xu, H.; Zhang, X.; Gao, P.; Xu, Z.; Huang, X. An object detection method for bayberry trees based on an improved YOLO algorithm. Int. J. Digit. Earth 2023, 16, 781–805. [Google Scholar] [CrossRef]
Zhong, Z.; Yun, L.; Cheng, F.; Chen, Z.; Zhang, C. Light-YOLO: A Lightweight and Efficient YOLO-Based Deep Learning Model for Mango Detection. Agriculture 2024, 14, 140. [Google Scholar] [CrossRef]
Zhao, J.; Du, C.; Li, Y.; Mudhsh, M.; Guo, D.; Fan, Y.; Wu, X.; Wang, X.; Almodfer, R. YOLO-Granada: A lightweight attentioned Yolo for pomegranates fruit detection. Sci. Rep. 2024, 14, 16848. [Google Scholar] [CrossRef]
Lawal, O.M. Real-time cucurbit fruit detection in greenhouse using improved YOLO series algorithm. Precis. Agric. 2024, 25, 347–359. [Google Scholar] [CrossRef]
Sun, Q.; Li, P.; He, C.; Song, Q.; Chen, J.; Kong, X.; Luo, Z. A Lightweight and High-Precision Passion Fruit YOLO Detection Model for Deployment in Embedded Devices. Sensors 2024, 24, 4942. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need.(Nips), 2017. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Bengio, Y. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 2048–2057. [Google Scholar]
Mei, S.; Yuan, X.; Ji, J.; Zhang, Y.; Wan, S.; Du, Q. Hyperspectral image spatial super-resolution via 3D full convolutional neural network. Remote Sens. 2017, 9, 1139. [Google Scholar] [CrossRef]
Han, D.; Pan, X.; Han, Y.; Song, S.; Huang, G. Flatten transformer: Vision Transformer Using Focused Linear Attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 5961–5971. [Google Scholar]
Hartigan, J.A.; Wong, M.A. A k-means clustering algorithm. Appl. Stat. 1979, 28, 100–108. [Google Scholar] [CrossRef]
Zeng, W.; Huang, J.; Wen, S.; Fu, Z. A masked-face detection algorithm based on M-EIOU loss and improved ConvNeXt. Expert Syst. Appl. 2023, 225, 120037. [Google Scholar] [CrossRef]
Qi, J.; Liu, X.; Liu, K.; Xu, F.; Guo, H.; Tian, X.; Li, Y. An improved YOLOv5 model based on visual attention mechanism: Application to recognition of tomato virus disease. Comput. Electron. Agric. 2022, 194, 106780. [Google Scholar] [CrossRef]
Yang, S.; Wang, W.; Gao, S.; Deng, Z. Strawberry ripeness detection based on YOLOv8 algorithm fused with LW-Swin Transformer. Comput. Electron. Agric. 2023, 215, 108360. [Google Scholar] [CrossRef]
Zhang, G.; Tian, Y.; Yin, W.; Zheng, C. An Apple Detection and Localization Method for Automated Harvesting under Adverse Light Conditions. Agriculture 2024, 14, 485. [Google Scholar] [CrossRef]
Wang, W.; Shi, Y.; Liu, W.; Che, Z. An Unstructured Orchard Grape Detection Method Utilizing YOLOv5s. Agriculture 2024, 14, 262. [Google Scholar] [CrossRef]
Li, F.; Sun, T.; Dong, P.; Wang, Q.; Li, Y.; Sun, C. MSF-CSPNet: A Specially Designed Backbone Network for Faster R-CNN. IEEE Access 2024, 12, 52390–52399. [Google Scholar] [CrossRef]
Zhang, X.; Liu, C.; Yang, D.; Song, T.; Ye, Y.; Li, K.; Song, Y. RFAConv: Innovating spatial attention and standard convolutional operation. arXiv 2023, arXiv:2304.03198. [Google Scholar]
Yang, L.; Zhang, R.Y.; Li, L.; Xie, X.H. Simam: A Simple, Parameter-Free Attention Module for Convolutional Neural Networks. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
Hariharan, B.; Malik, J.; Ramanan, D. Discriminative Decorrelation for Clustering and Classification. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; pp. 459–472. [Google Scholar]
Dong, Q.; Sun, L.; Han, T.; Cai, M.; Gao, C. PestLite: A novel YOLO-based deep learning technique for crop pest detection. Agriculture 2024, 14, 228. [Google Scholar] [CrossRef]
Xia, Z.; Pan, X.; Song, S.; Li, L.E.; Huang, G. Vision Transformer with Deformable Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4794–4803. [Google Scholar]
Dong, X.; Bao, J.; Chen, D.; Zhang, W.; Yu, N.; Yuan, L.; Guo, B. Cswin transformer: A General Vision Transformer Backbone with Cross-Shaped Windows. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12124–12134. [Google Scholar]
Yu, T.; Khalitov, R.; Cheng, L.; Yang, Z. Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better than Dot-Product Self-Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 691–700. [Google Scholar]
He, J.; Erfani, S.; Ma, X.; Bailey, J.; Chi, Y.; Hua, X.S. alpha-IoU: A Family of Power Intersection over Union Losses for Bounding Box Regression. Adv. Neural Inf. Process. Syst. 2021, 34, 20230–20242. [Google Scholar]
Chen, S.; Zou, X.; Zhou, X.; Xiang, Y.; Wu, M. Study on fusion clustering and improved YOLOv5 algorithm based on multiple occlusion of Camellia oleifera fruit. Comput. Electron. Agric. 2023, 206, 107706. [Google Scholar] [CrossRef]
Lin, Y.; Huang, Z.; Liang, Y.; Liu, Y.; Jiang, W. AG-YOLO: A Rapid Citrus Fruit Detection Algorithm with Global Context Fusion. Agriculture 2024, 14, 114. [Google Scholar] [CrossRef]
Deng, F.; Chen, J.; Fu, L.; Zhong, J.; Qiaoi, W.; Luo, J.; Li, J.; Li, N. Real-time citrus variety detection in orchards based on complex scenarios of improved YOLOv7. Front. Plant Sci. 2024, 15, 1381694. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The specific steps for data augmentation.

Figure 2. Improved YOLOv5 network structure.

Figure 3. The process of full three-dimensional weight computation based on the energy function.

Figure 4. The specific process of self-attention module.

Figure 5. K-means++ algorithm flow.

Figure 6. The results after re-clustering (in the figure (x) represents the cluster center).

Figure 7. The detection results of the aforementioned improvement strategy are visualized using heatmaps to demonstrate the localization.

Figure 8. Comparison of the detection effects of fruits under dense distribution (the yellow arrows indicate the cases of false negatives and positives).

Figure 9. Comparison of the detection effects under obstruction by branches and leaves.

Figure 10. Loss function comparison.

Figure 11. Comparison of the experimental results of citrus fruits under different spatial distributions on sunny days.

Figure 12. Comparison of the experimental results of citrus fruits under different spatial distributions on rainy days.

Figure 13. Comparison of detection effect under different spatial distributions.

Table 1. Overview of the collected data.

Index	Collected Images	Training Set	Validation Set	Test Set	Total Annotations
Quantity	1223	989	110	124	42,823

Table 2. Comparison of the anchor box size before and after clustering (in the coordinates (x,y), x represents the length of the anchor box, while y represents the width of the anchor box).

	Original Anchor Box	After K-Means++ Clustering
Anchor Box Size	(10,13) (16,30) (33,23) (30,61) (62,45) (59,119) (116,90) (156,198) (373,326)	(9,6) (8,10) (13,10) (12,17) (17,13) (22,17) (17,24) (28,26) (559,484)

Table 3. Comparison of the anchor box size before and after clustering.

	AP/%		mAP/%	FPS/ms
	Fruit	Fruit Tree	mAP/%	FPS/ms
Original Anchor Box	59.09	94.42	77.26	18.67
After Re-clustering	60.83	94.89	77.86	19.93

Table 4. Ablation Study on Different Loss Functions.

	P/%	R/%	mAP/%	F1
CIoU	93.91	74.58	77.34	83.14
EIoU	94.52	73.87	77.41	82.93
Foal-EIoU	95.83	76.93	79.68	85.35

Table 5. Ablation test of each module.

	P/%	R/%	mAP/%	F1	FPS/ms	Params/M
YOLOv5	93.78	73.57	77.26	82.45	24.33	178
Weighted Convolutional Block	94.75	74.65	78.51	83.51	24.29	179
Focused Linear Block	93.94	73.97	78.38	82.77	23.42	181
Proposed Model	95.83	76.93	79.68	85.35	20.18	185

Table 6. Comparison of the experimental results of each model.

	Size	P/%	R/%	mAP/%	F1	FPS/ms	Params/M
YOLOv5	640	93.78	73.57	77.26	82.45	24.33	178
YOLOv7	640	94.23	73.21	76.41	82.40	22.15	235
YOLOv8	640	93.26	74.84	77.84	83.04	24.84	271
Fast-RCNN	640	75.79	54.37	59.67	63.32	17.31	312
SSD	640	77.16	57.64	60.29	65.99	18.46	244
Proposed Model	320	88.79	67.28	66.36	76.55	24.61	177
	640	95.83	76.93	79.68	85.35	20.18	185
	960	97.21	81.18	83.71	88.47	14.26	223

Table 7. The number of detections under different working conditions.

Operating Conditions		Number of Targets	Detection Algorithm
Operating Conditions		Number of Targets	YOLOv5	YOLOv8	Proposed Model
Light Intensity	Strong Light	80	65	66	73
		58	44	47	53
		72	56	61	67
	Weak Light	36	25	27	33
		45	33	36	41
		53	43	47	50
Spatial Distribution	Sparse	19	13	13	15
		24	18	20	22
		27	21	22	25
	Dense	53	42	45	50
		51	41	46	49
		58	47	52	55
	Severely Occluded	36	27	27	30
		43	34	35	39
		41	33	35	37

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, Y.; Liu, Y.; Li, Y.; Xu, C.; Li, Y. Object Detection Algorithm for Citrus Fruits Based on Improved YOLOv5 Model. Agriculture 2024, 14, 1798. https://doi.org/10.3390/agriculture14101798

AMA Style

Yu Y, Liu Y, Li Y, Xu C, Li Y. Object Detection Algorithm for Citrus Fruits Based on Improved YOLOv5 Model. Agriculture. 2024; 14(10):1798. https://doi.org/10.3390/agriculture14101798

Chicago/Turabian Style

Yu, Yao, Yucheng Liu, Yuanjiang Li, Changsu Xu, and Yunwu Li. 2024. "Object Detection Algorithm for Citrus Fruits Based on Improved YOLOv5 Model" Agriculture 14, no. 10: 1798. https://doi.org/10.3390/agriculture14101798

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Object Detection Algorithm for Citrus Fruits Based on Improved YOLOv5 Model

Abstract

1. Introduction

2. Materials and Methods

2.1. Image Data Collection

Data Augmentation

2.2. The Improved YOLOv5 Network Model

2.2.1. Receptive Field Convolution of Full (RFCF) 3D Weights

2.2.2. Focused Linear Attention (FLA) Module

2.2.3. Anchor Box Clustering Optimization

2.2.4. Foal-EIoU Loss Function

2.3. Model Configuration and Evaluation Metrics

2.3.1. Experimental Environment and Parameter Settings

2.3.2. Model Evaluation Metrics

3. Results and Analysis

3.1. Ablation Study Validation

3.2. Multi-Model Comparative Validation

3.2.1. Comparison of the Detection Performance under Different Weather Conditions

3.2.2. Comparison of Detection Performance across Different Spatial Distributions

4. Discussions

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI