1. Introduction
In China’s agricultural industry, fruit cultivation ranks among the top three sectors, with orchards currently primarily distributed in plains and hilly mountainous regions. Statistics show that the citrus planting area in China covers 45.5031 million hectares, producing 60.0389 million tons, ranking first globally. With continuous agricultural development, both the planting area and yield of citrus fruits have been increasing annually. For citrus fruits, the cost of manual harvesting constitutes more than 30% of the total planting cost. Therefore, reducing harvesting costs is crucial for improving planting efficiency and promoting the healthy and vigorous development of the citrus industry. However, agricultural machinery production and research in hilly mountainous areas are in the early stages and are restricted by the terrain, hindering agricultural mechanization and automation deployment. With advancements in science and technology, traditional agricultural production equipment is no longer sufficient to meet modern agricultural production demands. Consequently, the development of intelligent agricultural machinery equipment has become a key focus area, with research on intelligent agricultural machinery gradually increasing. Achieving automatic harvesting in orchards first requires addressing the challenges of visual recognition and localization [
1]. Currently, fruit detection methods are divided into two categories: deep learning methods and non-deep learning methods [
2].
As a non-deep learning method for fruit detection, Chaivivatrakul et al. [
3] proposed a detection method based on fruit texture for feature extraction, feature classification, fruit point localization, morphological closure, and region extraction, achieving accuracies of 85% for pineapples and 100% for bitter gourds. Fu et al. [
4] researched various RGB-D sensors for fruit detection, integrating two or more types of image features to achieve object detection. Lin et al. [
5] proposed a method that segments red–green–blue images into probabilistic segmentation, performs mask operations on original images to obtain filtered depth images, and then clusters the depth images, achieving detection accuracies of 86.4% for peppers and 88.6% for eggplants. Bulanon et al. [
6] introduced a method that uses fuzzy logic to merge thermal and visible images of orange tree crowns, improving fruit detection accuracy.
In recent years, deep learning has been extensively applied in computer vision and has driven technological advancements, as well as innovative applications, in many fields [
7]. Object detection algorithms have evolved with the rise of convolutional neural networks in computer vision [
8] and can be categorized into single- and two-stage methods. Single-stage object detection algorithms directly predict target bounding boxes and categories in one processing step, offering simpler algorithmic structures and faster inference speeds [
9]; representative algorithms include YOLOv3 [
10], SSD [
11], and RetinaNet [
12]. Strategies for improving single-stage detection algorithms mostly focus on enhancing feature extraction by replacing backbone networks, yet these methods still face challenges in detecting small exposed-area targets in practical applications [
13]. Two-stage object detection algorithms first generate region proposals and then classify and regress within the proposed regions. However, this approach can result in the loss of spatial information of local targets across the entire image and lacks end-to-end training, leading to fragmented training processes and larger parameter sizes, which severely restrict algorithm operation speeds. Representative algorithms in this category include Fast R-CNN [
14], Faster R-CNN [
15], and Mask R-CNN [
16]. Deep learning object detection is widely applied in fields such as autonomous driving, medical image analysis, drone surveillance, and automated harvesting.
Scholars around the world have conducted extensive research on fruit recognition and detection using both two- and single-stage detection methods: Wang et al. [
17] fine-tuned pre-scan frames, designed a dual NMS algorithm to address occlusion overlap issues and removed excess rectangular boxes to ensure high detection accuracy and low miss rates. Gai et al. [
18] integrated DenseNet inter-layer density into CSPDarkNet53 and changed anchor boxes to circular marking boxes suitable for target shapes, resulting in improved detection speed and accuracy for small targets in an enhanced YOLOv4 network. Nan et al. [
19] proposed GF-SPP using average and global average pooling, treating features obtained from global average pooling as independent channels to enhance both the average and maximum pooling features, simultaneously achieving multi-scale feature fusion and enhancement, demonstrating good performance in dragon fruit detection with the improved WGB-YOLO. Wang et al. [
20] reduced detection model parameters and enhanced detection efficiency using channel pruning algorithms to trim the YOLOv5 model, efficiently detecting apple fruits and aiding orchard management optimization for growers. Bai et al. [
21] proposed a real-time strawberry seedling detection YOLO algorithm, integrating Swin Transformer prediction heads on YOLO v7’s high-resolution feature map to enhance spatial location information utilization, thereby improving the detection accuracy for small target flowers and fruits in complex scenes of similar colors and occlusion overlaps. Chen et al. [
22] applied the YOLOv4 model to propose a Yangmei tree detection method based on drone images, using Leaky_ReLU activation functions to accelerate feature extraction speeds and retain the most accurate prediction boxes through DIoU NMS. Zhong et al. [
23] transformed the CBS module in the backbone network into a multi-branch large-kernel downsampling module to strengthen the receptive field of the network, achieving a mAP of 64.0% on a mango dataset. Zhao et al. [
24] improved YOLOv5 by using the lightweight ShuffleNetv2 network as the backbone and adding an attention mechanism in the convolution module to optimize target detection accuracy, which not only enhanced the detection accuracy of pomegranates but also increased detection speed. Lawal [
25] added PANet in the Neck to establish an accurate and fast algorithm for gourd fruit detection that can handle leaf occlusion and fruit overlap. Sun et al. [
26] replaced the backbone network of YOLOv5 with the lightweight GhostNet model, which effectively recognized passion fruits in complex environments and improved recognition speed, providing technical support for orchard-picking robots.
Overall, non-deep learning methods for fruit detection tend to have low accuracy, while two-stage deep learning methods are often too large in terms of parameters, resulting in slower inference speeds, making them unsuitable for automated harvesting machines. Additionally, the complexity of YOLOv7 and YOLOv8 has significantly increased, which places higher demands on hardware, making them less practical for agricultural applications. Considering the environmental factors in this study, such as fluctuating lighting conditions and the occlusion and overlap between branches and fruits, along with the need to balance model size with device limitations, we chose YOLOv5 as the base network. YOLOv5 offers faster speeds, moderate model size, and satisfactory accuracy. To further address the challenges posed by the complex orchard environment and ensure practical deployment, we applied the concept of weighted convolutions to share feature parameters, enabling more efficient information capture and improving self-attention mechanisms [
27,
28,
29]. By strengthening the relationships between input vectors, we enhanced the feature extraction for small targets. Ultimately, this approach aims to develop a highly accurate and robust object detection model tailored to the current environment. The main contributions of this paper are as follows:
- (1)
In traditional convolutional layers, information loss often occurs after multiple layers, especially in the extraction of fine-grained features critical for small object detection. To address this, we replaced 2D convolutions with receptive field convolutions with full 3D weights (RFCF) [
30], which use full 3D weights to increase the network’s ability to capture spatial information across a larger area. Unlike convolutions that may lose spatial relationships, RFCF retains and emphasizes the importance of features in the receptive field by distributing weights more effectively across three dimensions, enhancing the model’s ability to detect small and occluded objects such as citrus fruits in complex environments. This approach also ensures better parameter sharing and reduces redundant feature loss.
- (2)
Linear attention mechanisms can suffer from a lack of focus on relevant features while being computationally expensive. To mitigate these issues, we introduced the focused linear attention (FLA) module [
31], which uses mapping functions that simplify attention calculations and recover feature information more efficiently. By balancing the trade-off between focusing on critical areas of the input (such as small, overlapping fruits) and maintaining feature diversity, FLA enhances the network’s ability to distinguish between objects in cluttered scenes while keeping computational costs low. Specifically, this module applies a focused attention mechanism that prioritizes the most relevant parts of the image, allowing for a more precise and diverse feature extraction while still maintaining a high processing speed, which is crucial for real-time agricultural applications.
- (3)
In YOLO-based networks, the placement and sizing of anchor boxes are critical for effective object detection, particularly when dealing with objects of various sizes. To optimize the anchor boxes for our dataset, we applied K-means++ clustering [
32], which provides more appropriate anchor sizes based on the distribution of object sizes in our training data. This technique ensures that the anchor boxes are better suited to detect citrus fruits of varying sizes in orchards. Additionally, we improved the bounding box loss function by using Foal-EIoU [
33] (Focused Overlap Area-based Intersection over Union), which places greater emphasis on the overlap area of bounding boxes and improves localization accuracy. This modification enhances the model’s ability to accurately predict object boundaries, especially in scenarios involving occlusions or overlapping objects.
4. Discussions
In recent years, the YOLO series models have been widely applied to various object detection tasks, including applications in agriculture such as fruit recognition, ripeness detection, and pest identification. The quality and suitability of datasets for specific tasks are crucial. This paper aimed to address the issues of missed detections and false positives caused by environmental factors like changes in natural lighting, leaf occlusion, and fruit overlap from a detection perspective. To mitigate these issues, a citrus fruit dataset for hilly and mountainous regions was constructed under different weather conditions and shooting angles. However, embedding the trained weights into a mobile picking robot requires balancing detection accuracy with model size, which is challenging. Our model improvements focus on the following aspects:
(1) Inspired by the modifications proposed by Chen et al. [
47], we applied k-means clustering to recluster the anchor boxes to better fit the target objects in our study. This approach aligns the predicted box sizes more closely with the actual detection samples, effectively improving final recognition accuracy and precise localization of small objects without increasing the difficulty or time required for network training.
(2) Compared to the citrus fruit detection algorithm by Lin et al. [
48], our focus is on detecting fruits across the entire tree rather than a single cluster. To address parameter sharing, we replaced the original convolutions with Receptive Field Convolution of Full (RFCF) 3D weights. This modification effectively reduces the impact of leaf occlusion and fruit overlap. Additionally, we introduced a Focused Linear Attention (FLA) module to enhance recognition accuracy within smaller receptive fields, thereby improving detection precision for small target fruits.
(3) In contrast to the real-time orchard citrus detection algorithm by Deng et al. [
49], we optimized our model using the Foal-EIoU loss function, which helps avoid undesirable optimization directions during network training. Moreover, we introduced an additional regularization term
within this loss function to mitigate issues related to smaller datasets and environmental noise.
To visually demonstrate the detection performance of the improved algorithm and evaluate the results, multiple sets of comparative experiments were conducted under different environmental conditions (variations in light intensity and spatial distribution). Target counts were statistically analyzed against the detection outcomes of the comparative algorithms and the proposed algorithm. Relative error rates of the detection results were computed (difference between target counts and detected counts relative to target counts). By comparing the relative error rates of the actual detections using different algorithms, the effectiveness of the improved network was visually illustrated.
From
Table 7, it is evident that under various environmental factors and operational conditions, the basic YOLOv5 and the latest, best-performing YOLOv8 algorithms were significantly affected by light and occlusion in orchard environments. These factors led to omission errors when identifying and detecting small targets like citrus fruits. The improved algorithm led to a notable increase in the number of detections, demonstrating its clear advantages with regard to detection accuracy.
The average relative error rates of the three object detection models under various conditions were 22.1%, 16.8%, and 8.2%, respectively. In comparison, the relative error rate of the improved model significantly decreased. Overall, the other advanced algorithms performed moderately under complex conditions and struggled with the precise localization of small, occluded targets. The improved algorithm, by emphasizing feature enhancement and integrating contextual information comprehensively, aggregated and strengthened feature diversity while maintaining information continuity. This resulted in a marked increase in detection accuracy. The relative error rates under various conditions also showed a noticeable decrease compared to other algorithms, with an average relative error rate reduction of 13.9% and 8.6%, respectively, highlighting the high detection precision of the improved model and providing more accurate and effective data input for future work.
Future work should incorporate additional occlusion-handling mechanisms, such as multi-scale contextual modeling or incorporating 3D information to better handle heavily occluded objects. Additionally, future work should enhance the model’s ability to discriminate between similar objects under varying lighting conditions by integrating more advanced data augmentation techniques or contrast-enhancement strategies during preprocessing. Additionally, integrating depth information and investigating the precise localization of citrus harvesting points can help achieve fast and accurate 3D positioning of citrus harvesting points.