Bud-YOLOv8s: A Potato Bud-Eye-Detection Algorithm Based on Improved YOLOv8s

Liu, Wenlong; Li, Zhao; Zhang, Shaoshuang; Qin, Ting; Zhao, Jiaqi

doi:10.3390/electronics13132541

Open AccessArticle

Bud-YOLOv8s: A Potato Bud-Eye-Detection Algorithm Based on Improved YOLOv8s

by

Wenlong Liu

,

Zhao Li

^*

,

Shaoshuang Zhang

,

Ting Qin

and

Jiaqi Zhao

School of Computer Science and Technology, Shandong University of Technology, Zibo 255000, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(13), 2541; https://doi.org/10.3390/electronics13132541

Submission received: 11 May 2024 / Revised: 19 June 2024 / Accepted: 26 June 2024 / Published: 28 June 2024

(This article belongs to the Special Issue Recent Advances in Computer Vision: Technologies and Applications, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

The key to intelligent seed potato cutting technology lies in the accurate and rapid identification of potato bud eyes. Existing detection algorithms suffer from low recognition accuracy and high model complexity, resulting in an increased miss rate. To address these issues, this study proposes a potato bud-eye-detection algorithm based on an improved YOLOv8s. First, by integrating the Faster Neural Network (FasterNet) with the Efficient Multi-scale Attention (EMA) module, a novel Faster Block-EMA network structure is designed to replace the bottleneck components within the C2f module of YOLOv8s. This enhancement improves the model’s feature-extraction capability and computational efficiency for bud detection. Second, this study introduces a weighted bidirectional feature pyramid network (BiFPN) to optimize the neck network, achieving multi-scale fusion of potato bud eye features while significantly reducing the model’s parameters, computation, and size due to its flexible network topology. Finally, the Efficient Intersection over Union (EIoU) loss function is employed to optimize the bounding box regression process, further enhancing the model’s localization capability. The experimental results show that the improved model achieves a mean average precision ([email protected]) of 98.1% with a model size of only 11.1 MB. Compared to the baseline model, the [email protected] and [email protected]:0.95 were improved by 3.1% and 4.5%, respectively, while the model’s parameters, size, and computation were reduced by 49.1%, 48.1%, and 31.1%, respectively. Additionally, compared to the YOLOv3, YOLOv5s, YOLOv6s, YOLOv7-tiny, and YOLOv8m algorithms, the [email protected] was improved by 4.6%, 3.7%, 5.6%, 5.2%, and 3.3%, respectively. Therefore, the proposed algorithm not only significantly enhances the detection accuracy, but also greatly reduces the model complexity, providing essential technical support for the application and deployment of intelligent potato cutting technology.

Keywords:

potato bud eye detection; YOLOv8s; faster block-EMA network; multi-scale feature fusion; efficient intersection over union

1. Introduction

As one of the world’s most important food crops, the potato plays a crucial role in ensuring food security and improving people’s quality of life due to its high yield, strong resistance, and high nutritional value. Especially in China, the area and scale of potato production are among the highest in the world, highlighting its strategic position in the national food industry [1,2]. However, potato cultivation is mainly based on asexual propagation through tubers, where potatoes with bud eyes are cut into small pieces for planting. Although the planting and harvesting processes have been gradually mechanized, the core step of seed potato preparation—cutting—still relies heavily on manual labor. This not only results in low efficiency and high labor intensity, but also directly affects the planting efficiency and emergence rate of seed potatoes [3].

Currently, the functions of automatic cutting machines on the market are generally limited, and they usually lack the ability to accurately detect potato bud eyes [4,5,6]. Due to the small size of the bud eyes and their low contrast with the background of the potato skin, it is often difficult to accurately identify the position of the bud eyes. This not only reduces the germination rate of seed potatoes, but also results in a significant waste of resources [7]. This situation underlines the urgent need for the introduction of intelligent agricultural technologies in potato production and demonstrates the great potential for their application and development [8]. To address this core issue in potato cultivation, in-depth research on bud-eye-detection algorithms has become essential. This not only promotes the advancement of intelligent cutting technology and enhances planting efficiency, but also plays a crucial role in the overall transformation of agricultural automation. Against this backdrop, the research on potato bud-eye-detection technology has increasingly garnered attention as a key task in the fields of agricultural automation and intelligent recognition. Traditional techniques in this domain mainly rely on image processing and machine vision technologies. However, with the rapid development of computer vision and deep learning, potato bud eye recognition based on object-detection technology has gradually become a research hotspot.

In recent years, significant progress has been made in the field of potato bud eye recognition. However, numerous challenges remain. Firstly, traditional image-processing methods, while advantageous in terms of computational efficiency and real-time processing, are limited by their reliance on manually designed features. This reliance restricts their generalization capability and detection accuracy in complex scenarios [9]. Secondly, while machine vision technology enhances recognition accuracy and automation through automated feature extraction, this improvement entails increased computational resource requirements and extended training periods [10]. Although current potato bud-eye-detection technologies based on deep learning have achieved substantial improvements in accuracy and robustness, the complexity of their model network structures results in a dependence on large datasets and high-performance computing resources. This dependency affects the model’s operational efficiency and real-time response, making it challenging to achieve efficient real-time operation in resource-constrained agricultural applications [11,12]. Therefore, developing a bud-eye-detection algorithm that combines high efficiency and high accuracy is crucial for meeting actual production needs and advancing intelligent cutting technology. To address the limitations of existing technologies, this paper proposes an improved algorithm based on the YOLOv8s [13] model, named1 Bud-YOLOv8s. This model introduces a lightweight network structure and an efficient attention mechanism, aiming to enhance bud eye detection accuracy while reducing model complexity and improving operational efficiency. The main contributions and methods of this paper are as follows:

A novel network structure, Faster Block-EMA, is proposed. This structure effectively integrates the Faster Block from the Faster Neural Network (FasterNet) and the new Efficient Multi-scale Attention (EMA) mechanism to enhance the model’s focus on bud eyes and its feature-extraction capabilities.
The C2f module of YOLOv8s is improved by replacing the bottleneck structure in the C2f module with the Faster Block-EMA structure, forming the C2f-Faster-EMA module. This improvement not only increases the detection accuracy of bud eyes, but also optimizes the model’s operational efficiency.
A weighted bidirectional feature pyramid network (BiFPN) is introduced to improve the neck network of YOLOv8, effectively fusing bud eye features at different scales and significantly reducing the model’s parameters and size. Additionally, the Efficient Intersection over Union (EIoU) loss function is employed to optimize the bounding box regression task, further enhancing the model’s localization capability.

The rest of this paper is structured as follows: Section 2 reviews related research on potato bud eye detection. Section 3 introduces an improved model for potato bud eye detection and details the specific improvement methods. Section 4 describes the dataset, experimental environment, and parameter settings and outlines the evaluation criteria for the experiments. Section 5 analyzes the experimental results and presents data and conclusions from various comparative experiments. Section 6 summarizes the results of the full text and looks forward to future research directions.

2. Related Work

Early research on potato bud eye recognition primarily focused on traditional machine vision techniques and image-processing methods. For example, Li et al. [14] conducted three-dimensional geometric analysis of the saturation component (S) to extract feature parameters related to bud eyes and performed comprehensive longitudinal and transverse evaluations to achieve accurate bud eye recognition. Lyu et al. [15] utilized Gabor feature image filtering and bud eye feature analysis to remove boundary-connected regions of the potato and extract bud eye regions. However, this algorithm exhibits a high misrecognition rate when processing potatoes with damaged skin or insect holes, necessitating further improvement. Zhang et al. [16] used median filtering to remove image noise and Otsu’s method for image segmentation to separate the potato from the background. They then combined local binary patterns (LBPs) to extract features from both new and old seed potato bud eyes and non-bud eye regions, using a support vector machine (SVM) for sample feature training and classification, thus improving the accuracy of bud eye recognition. Xi et al. [17] applied chaotic variable optimization to the K-means algorithm, mapping it into the K-means search space to achieve global optimization and avoid local minima, resulting in fast segmentation of potato bud eyes. However, this method requires manual elimination of overlapping bud eyes. Yang et al. [18] combined optimal segmentation thresholds with the Canny edge detection algorithm to generate segmentation masks and edge masks on grayscale images, respectively, and then combined the two masks to complete potato bud eye detection. Nevertheless, the Canny edge detector may fail to detect subtle differences between the boundaries of sprouting and non-sprouting regions, posing a risk of missed detections.

Despite the improvements in accuracy and efficiency of bud eye recognition achieved by traditional image-processing and machine vision techniques, they still have limitations in handling complex surface features and reducing misrecognition rates. These methods often lack robustness and typically require considerable manual intervention. In contrast, the application of deep learning algorithms in the agricultural sector has demonstrated significant advantages. With their ability to autonomously learn feature representations and process hierarchical information, deep learning algorithms can effectively tackle challenges posed by complex scenes and large-scale data, thereby improving recognition accuracy and reducing the need for manual intervention. Object-detection technology, a crucial application of deep learning, is primarily divided into two-stage and single-stage algorithms. Two-stage object-detection algorithms, such as the R-CNN [19], Fast R-CNN [20], and Faster R-CNN [21], first generate candidate regions in an image and then classify and regress the positions of these regions. For example, Xi et al. [22] improved the Faster R-CNN algorithm for potato bud eye recognition, achieving a detection accuracy of 96.32%, an improvement of 5.98%. However, despite their high robustness, these detection algorithms have large model sizes and computational demands, requiring long running times, which makes them unsuitable for real-time production.

On the other hand, single-stage object-detection algorithms, such as SSD [23] and the YOLO series [24], treat object detection as a regression problem by dividing the image into grids to predict the categories and positions of objects. These algorithms have fast detection speeds, making them suitable for real-time scenarios. Shi et al. [25] constructed a bud eye dataset with occlusions and mechanical damage and used the high-performance feature-extraction capabilities of the YOLOv3 network model to achieve fast and accurate potato bud eye recognition. However, this study only used the original model without making specific improvements for bud eye detection. Huang et al. [26] introduced GhostNetV2 to replace the backbone network of YOLOv4 and used the SIoU loss function to improve model convergence speed, successfully reducing model parameters and enhancing bud eye detection performance. However, this study did not deeply evaluate the model’s generalization ability in different environments and complex backgrounds, limiting its applicability in broader scenarios. Li et al. [27] constructed a diversified potato bud eye dataset and trained it using the YOLOx network model. This model performed well on samples with occlusions, multiple bud eyes, and complex conditions. Compared to YOLOv3 and YOLOv4, this model showed superior accuracy and speed. Nevertheless, this study also used the original model without specialized optimization or improvement for bud eye features. Zhang et al. [28] integrated the CBAM attention mechanism and the BiFPN network structure into the YOLOv5s model, enhancing the capture of small object information and optimizing decoupled heads for faster convergence and improved performance, significantly increasing the average precision of bud eye detection. However, while the introduction of attention modules brought high precision, it also increased the parameter count and computational resource consumption. Additionally, Zhang et al. [29] improved the backbone network of the YOLOv7 model by introducing a self-attention mechanism and replacing the ELAN-H module in the original head layer with the Inception Next module, optimizing the bounding box loss function. This approach enhanced the focus on bud eye features and accelerated model convergence, but also introduced additional computational load and parameters, impacting the real-time performance and efficiency of the object detector.

3. Methods

This study aims to accurately identify potato bud eyes using object-detection technology, providing technical support for intelligent potato cutting equipment. This task requires that the object-detection algorithm balances high accuracy, real-time response, and low computational cost to meet the deployment conditions of embedded platforms.

Among the wide range of object-detection algorithms, the You Only Look Once (YOLO) series is renowned for its efficiency and accuracy in single-stage detection. Compared to traditional two-stage detection algorithms such as Faster R-CNN, YOLO maintains high detection accuracy while offering faster detection speeds. In particular, YOLOv8, as the latest iteration, has integrated various optimization strategies, further enhancing the model’s performance and applicability. This makes the YOLOv8 algorithm particularly outstanding in real-time application scenarios such as autonomous driving [30], video surveillance [31], and drone detection [32].

Considering the constraints on hardware resources, the real-time requirements of the actual task, and the excellent performance and flexibility of YOLOv8, this study selects YOLOv8 as the baseline model. This choice not only ensures the efficiency and accuracy of the current experiments, but also lays a solid foundation for further research and algorithm improvements.

3.1. Principle of YOLOv8 Network

The latest version of the YOLO algorithm, YOLOv8, developed by the Ultralytics team in 2023, introduces new features and optimizations based on YOLOv5. These enhancements aim to enhance performance and flexibility [33]. The YOLOv8 algorithm offers five model sizes with increasing network complexity, n, s, m, l, and x, catering to different application requirements and resource constraints. The n model is the smallest, designed for resource-limited environments, trading some accuracy for extremely high running speed. The s model is lightweight, balancing detection speed and accuracy. The m model is medium-sized, improving accuracy compared to the s model at the cost of some operational efficiency. The l model has larger parameters and computational demands, suitable for applications requiring higher accuracy. The x model is the largest, ideal for offline analysis or tasks with extremely high accuracy requirements.

This study employs the s version of the YOLOv8 series (YOLOv8s), which, as a balanced option, ensures high detection speed while controlling computational overhead. This makes it highly suitable for applications like potato bud eye detection, where efficiency is critical and hardware resources are limited. The network structure of YOLOv8s, shown in Figure 1, mainly comprises three components: the backbone network, the neck network, and the detection head.

The backbone network is the core of the YOLOv8 architecture, responsible for feature extraction from the input image. This network employs the C2f module based on the Cross-Stage Partial Network (CSP) [34] architecture concept, integrated with the Efficient Layer Aggregation (ELAN) design idea from YOLOv7 [35], replacing the C3 module used in YOLOv5. This improvement allows the model to remain lightweight while capturing richer gradient flow information. At the end of the backbone network, an SPPF module is employed, which increases the model’s ability to capture multi-level feature information through successive transmission and hierarchical concatenation of the max-pooling layers.

The neck layer utilizes a Path Aggregation Network (PANet) [36] structure based on the feature pyramid network (FPN) [37] design, which features a bidirectional feature flow design from bottom-up and top-down. This structure preserves detailed and semantic information of the image while effectively integrating the features extracted by the backbone network.

The detection head adopts a decoupled head structure, separating the tasks of classification and bounding box regression. This separation accelerates model convergence and improves detection accuracy. Additionally, YOLOv8 uses an anchor-free detection method that directly predicts object locations, reducing the number of anchor boxes and further enhancing the model’s detection speed and accuracy.

3.2. Bud-YOLOv8s Detection Algorithm

After an in-depth exploration of the application of the YOLOv8s algorithm in potato bud eye detection, we identified certain limitations. To address these issues, we propose the Bud-YOLOv8s bud-eye-detection algorithm, aiming to enhance the model’s detection performance while reducing computational costs. The overall network architecture of the improved Bud-YOLOv8s is shown in Figure 2.

Due to the integration of numerous bottleneck structures within the C2f module of the YOLOv8s model, there is an increase in detection performance, but it also results in channel information redundancy, increased computational cost, and a larger model size. These factors limit its deployment on resource-constrained devices [38]. To improve the efficiency of the model and reduce information redundancy, this paper adopts the Faster Block from FasterNet [39] to enhance the C2f module. The Faster Block employs efficient partial convolution (PConv) operations, which reduce the model parameters and computational complexity while improving bud eye detection accuracy. PConv is a variant of the convolution operation that captures spatial information by performing a standard convolution on only a subset of the input channels, leaving the remaining channels unchanged. This strategy not only maintains model performance, but also effectively reduces computational cost. To further enhance the model’s ability to detect small bud eye targets in complex backgrounds, we integrated the Efficient Multi-scale Attention (EMA) [40] module with the Faster Block, forming a new Faster Block-EMA structure to replace the bottleneck part in the C2f module. The EMA mechanism processes channel and spatial information in parallel by grouping feature maps, reducing network complexity. It leverages cross-spatial learning to merge the outputs of the two parallel branches, thus focusing more effectively on key features of the bud eyes and enhancing the feature-extraction capabilities. This combination reduces model complexity while significantly improving detection performance, making it particularly suitable for small target detection tasks such as potato bud eyes.

The small size, variable morphology, and high similarity to the surrounding skin of potato bud eyes present challenges for the PANet neck network in the YOLOv8s model during the feature fusion stage. The PANet struggles to effectively handle the multi-scale features of bud eyes, resulting in frequent false positives and missed detections in complex backgrounds [29]. To improve the multi-scale feature fusion of bud eyes, we adopted the bidirectional feature pyramid network (BiFPN) [41] to enhance the neck layer of YOLOv8s. BiFPN, with its multi-scale feature fusion strategy, not only preserves the original information of the bud eye images, but also simplifies the network structure through cross-scale connections.

In addition, since bud eyes are densely distributed in images, the current bounding box loss function used by YOLOv8s lacks precise localization capability when dealing with highly overlapping detection boxes. This limitation affects the accurate identification of bud eye positions and shapes. To address this issue, we introduced the Efficient IoU (EIoU) [42] loss function to optimize the bounding box regression task. The EIoU takes into account both the overlap area and the aspect ratio of the detection boxes, allowing for more accurate localization of the target boxes, thereby improving the precision of potato bud eye detection.

3.3. C2f-Faster-EMA Module

The C2f module in YOLOv8s, as an improved form of the CSP structure, utilizes depthwise separable convolution and residual connection techniques. As shown in Figure 1, it comprises CBS modules, Split layers, bottleneck sequences, and Concat layers for efficient feature extraction and information transfer. However, due to the excessive stacking of bottleneck structures within the C2f module, redundant or irrelevant feature information may be present in the feature maps. This redundancy not only increases the computational burden of the model, but also interferes with the efficiency of feature extraction for potato bud eyes, affecting detection accuracy.

To address the redundancy in the C2f module and enhance the accuracy of potato bud eye detection, we propose the C2f-Faster-EMA network module. This module optimizes the C2f structure by significantly reducing the number of parameters and the computational complexity of the model, thereby improving operational efficiency. The detailed design is illustrated in Figure 3a. The core idea is to integrate the Faster Block with the EMA mechanism, replacing the redundant bottleneck parts within the C2f module. This integration aims to improve the overall performance of the model. The structure of the Faster Block-EMA is depicted in Figure 3b. The Faster Block employs partial convolution (PConv) operations, which selectively process image regions, thereby reducing redundant computations and lowering the model’s computational cost and size. This selective processing allows for more precise and efficient extraction of spatial features of bud eyes, enhancing both detection performance and computational efficiency. In addition, the EMA mechanism dynamically adjusts the weight distribution, enabling the model to focus more on the significant features of potato bud eyes while effectively suppressing the background noise of the skin. This dynamic adjustment significantly improves the detection of small target bud eyes. The integration of these two components substantially enhances the performance of the model, improving the feature-extraction capabilities for bud eyes and achieving a lightweight design. This makes the model more efficient and practical for real-world applications.

3.3.1. FasterNet Neural Network

In the field of computer vision, the performance of network architectures has improved with the increasing depth of neural networks and the expansion of feature map channels. However, this has also resulted in significant information redundancy and increased computational costs. Particularly for the YOLOv8s model, its considerable size and computational demands limit its deployment and application on resource-constrained platforms. To reduce computational costs and enhance operational speed, this paper proposes an improved method by introducing a lightweight backbone network, FasterNet, to optimize the C2f module. As the core component of the C2f-Faster-EMA network module, it reduces computational redundancy by improving traditional convolution operations, thereby enhancing the operational efficiency of the potato sprout-eye-detection model.

Traditional convolution operations typically perform full convolution on all input channels when processing images, which increases computational complexity and memory access costs. To address this issue, FasterNet introduces the concept of partial convolution (PConv), as illustrated in Figure 4. PConv performs regular convolution operations on only a subset of the input channels to extract spatial features, while leaving the remaining channels unchanged. This strategy not only maintains the model’s performance, but also effectively reduces the computational load. In terms of memory optimization, PConv uses the first or last contiguous channel as a representative for the entire feature map during computation, thereby avoiding unnecessary memory access. This further reduces computational costs and enhances the model’s operational efficiency.

The calculations for PConv’s floating-point operations (FLOPs) and memory access are shown in Equations (1) and (2), respectively. These equations demonstrate the extent of optimization in computational load and memory access compared to traditional convolution.

h \times w \times k^{2} \times c_{p}^{2}

(1)

h \times w \times 2 c_{p} + k^{2} \times c_{p}^{2} \approx h \times w \times 2 c_{p}

(2)

where h and w are the channel height and width, respectively,

c_{p}

is the partial convolutional network channel, and k is the filter. Specifically, when opting to convolve just

1 / 4

of the input channels, the computational load of PConv is only

1 / 16

th that of regular convolutions. Therefore, the implementation of FasterNet significantly enhances the feasibility of deploying efficient models on devices with limited resources.

3.3.2. EMA Mechanism

Due to the small size of bud eyes and their resemblance to the potato skin background, neural networks face challenges in distinguishing bud eyes from the background. This often results in insufficient extraction of texture and contour information, leading to false detections and misses. To address this issue, this paper introduces the Efficient Multi-scale Attention (EMA) mechanism, in conjunction with the Faster Block, building on the improvements to the C2f module by FasterNet. The EMA mechanism dynamically adjusts feature map weights, enhancing focus on critical bud eye features and suppressing background noise, thereby effectively reducing false positives and negatives.

Compared to other attention mechanisms, EMA avoids the strategy of channel reduction. Instead, it reshapes some channels as batch dimensions, groups the channels, and utilizes parallel convolution kernels to capture features at different scales. This design retains channel information while reducing computational costs. As illustrated in Figure 5, the EMA module first divides the input feature map X’s channels C into G groups, each containing

C / G

channels. This grouping strategy distributes spatial semantic features evenly across each feature group, aiding the network in learning diverse semantic information and reducing computational complexity by decreasing the number of channels per group. Furthermore, the EMA module consists of two parallel sub-networks that process the outputs of 1 × 1 and 3 × 3 convolution kernels, respectively. The 1 × 1 convolution kernels capture local interactions between channels, while the 3 × 3 kernels capture broader spatial contextual information. By processing channel and spatial information in parallel, the EMA structure circumvents the performance degradation caused by complex sequential processing and deep convolutions.

Finally, the cross-spatial learning component employs 2D Global Average Pooling (GAP) and the Softmax function to encode global spatial information. It then uses matrix dot product operations to fuse the outputs of the two parallel branches, generating the final attention-weighted feature map. This design achieves cross-spatial feature interaction and information sharing, significantly enhancing the network’s ability to extract detailed information such as the texture and edges of bud eyes. The aforementioned method can be represented by the following equation:

Z_{c} = \frac{1}{H \times W} \sum_{j}^{H} \sum_{i}^{W} x_{c} (i, j)

(3)

where

Z_{c}

represents the average value of all channels c at spatial position

(i, j)

after GAP processing;

x_{c} (i, j)

represents the pixel value of channel c at position

(i, j)

in the input feature map; H and W are the height and width of the feature map, respectively.

Overall, the EMA mechanism enhances the expressive capability and computational efficiency of convolutional neural networks through feature grouping, parallel sub-networks, and cross-spatial learning, without significantly increasing computational load. Its lightweight and efficient features make it easily integrable into existing convolutional network architectures, offering new possibilities for more effective and accurate models. By integrating the EM mechanism into FasterNet, Bud-YOLOv8s not only boosts operational efficiency, but also significantly improves the accuracy of potato bud eye detection.

3.4. Weighted Bidirectional Feature Pyramid Network

In the realm of object detection, although deep networks excel at capturing more complex, high-dimensional features, they face difficulties when dealing with small targets such as bud eyes. Due to their small size, small targets inherently possess less feature information, which tends to be lost during the successive layers of transmission and feature fusion in deep networks. Additionally, features of small targets are often overshadowed or obscured by those of larger targets, leading to frequent false positives and missed detections. YOLOv8s incorporates the Path Aggregation Network (PANet), which builds upon the feature pyramid network (FPN) by adding an additional bottom-up path aggregation to improve feature fusion. However, not all nodes significantly contribute to the feature network, potentially affecting the efficiency of bud eye feature fusion. Moreover, the added paths and nodes in PANet, based on the FPN, complicate the model structure, resulting in an increase in parameters and computational load. Figure 6a,b display the structures of the FPN and PANet, respectively.

To address these issues and enhance the model’s capability to fuse features of bud eyes across different scales, this paper adopts the weighted bidirectional feature pyramid network (BiFPN) to replace the neck network PANet in the YOLOv8 model. BiFPN simplifies the bidirectional network structure and reduces redundancy by removing nodes with only a single input edge, as these nodes lack multi-scale feature fusion and contribute minimally to the feature network. Unlike PANet, BiFPN not only retains both top-down and bottom-up pathways, but also introduces a weighted feature fusion strategy. When fusing features of bud eyes at different scales, BiFPN does not simply sum or concatenate them. Instead, it assigns different weights based on the importance of features at different scales. This allows the model to focus more on bud eye feature maps that contribute more significantly to the image, thus enhancing the representational capacity of features. As illustrated in Figure 7, BiFPN treats each pathway as a feature network layer and repeats the same layer multiple times to achieve higher levels of feature fusion. The computation formula is as follows:

O = \sum_{i} \frac{ω_{i}}{ε + \sum_{j} ω_{j}} \times I_{j}

(4)

where O is the fused output,

ω_{i}

and

ω_{j}

are normalized weights corresponding to inputs

I_{i}

and

I_{j}

, and

ε

is a very small constant used to avoid numerical instability. To ensure

ω_{i} ⩾ 0

, a ReLU activation function is used before each

ω_{i}

.

The formulas for cross-scale connections and weighted feature fusion in BiFPN are as follows:

\begin{array}{l} P_{i}^{t d} = C o n v (\frac{ω_{1} \times P_{i}^{i n} + ω_{2} \times R e s i z e (P_{i + 1}^{i n})}{ω_{1} + ω_{2} + ε}) \\ P_{i}^{o u t} = C o n v (\frac{ω_{1}^{'} \times P_{i}^{i n} + ω_{2}^{'} \times P_{i}^{t d} + ω_{3}^{'} \times R e s i z e e (P_{i - 1}^{o u t})}{ω_{1}^{'} + ω_{2}^{'} + ω_{3}^{'} + ε}) \end{array}

(5)

where

C o n v

corresponds to the convolution operation,

R e s i z e

indicates the upsampling or downsampling operation,

P_{i}^{i n}

represents the input feature at level i,

P_{i}^{t d}

denotes the intermediate feature at level i, and

P_{i}^{o u t}

represents the output intermediate feature at level i.

This formula demonstrates how BiFPN calculates the weighted sum of inputs from different paths and scales, optimizing the feature representation for improved detection accuracy. This approach ensures that the most relevant and useful features are emphasized, enhancing the overall performance of the model in detecting potato bud eyes.

3.5. EIoU Loss Function

In potato bud eye detection, bounding box accuracy is crucial for the model’s localization capability. This depends primarily on the precise regression of the bounding box. Bounding boxes not only define the position of the target in the image, but also affect the accuracy of target classification. Therefore, the bounding box loss function plays an essential role in optimizing the bounding box regression process. It guides the model to progressively approximate true values during training, directly impacting target detection performance.

The original YOLOv8s model employs the Complete-IoU (CIoU) loss function during training, which builds on the Distance-IoU (DIoU) [43] by adding a loss term for the aspect ratio. This is intended to increase the penalty when there is a significant difference in the aspect ratio between the predicted and target boxes, thereby encouraging the model to better fit the target box’s aspect ratio during training. The CIoU loss function is calculated as follows:

\begin{array}{l} L_{C I o U} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + α v \\ α = \frac{v}{1 - I o U + v} \\ v = \frac{4}{π^{2}} {(a r c t a n \frac{w^{g t}}{h^{g t}} - a r c t a n \frac{w}{h})}^{2} \end{array}

(6)

where

I o U

represents the Intersection over Union between the predicted box and the ground truth box; b and

b^{g t}

are the central points of the predicted box and the ground truth box, respectively;

ρ

is the Euclidean distance between the two central points;

α

is a weight coefficient, and v is used to measure the consistency of the width-to-height ratio between the predicted box and the ground truth box; w, h,

w^{g t}

, and

h^{g t}

represent the width and height of the predicted box and the ground truth box, respectively.

However, although the CIoU loss function considers the aspect ratio between bounding boxes, the penalty term introduced in the CIoU becomes zero when the aspect ratios of the predicted and actual boxes are the same or linearly proportional, thus becoming ineffective. Additionally, the formula in the CIoU reflects the difference in the aspect ratio rather than the actual differences in the width and height relative to their respective confidence, which can affect the accuracy of the detection box in dense bud eye areas. To enhance the model’s localization ability in densely populated bud eye areas and address the issue of overlapping detection boxes, this paper adopts the Efficient IoU (EIoU) loss function to improve the bounding box regression process. The EIoU loss function, building on the CIoU penalty, separates the aspect ratio factors of the predicted and actual boxes and explicitly introduces loss terms for the width and height. This modification allows the loss function to more precisely reflect the differences in width and height between the predicted and actual boxes. Such improvements help enhance the regression precision and convergence speed of the detection boxes, thereby improving the overall detection accuracy in dense bud eye areas. The calculation formula is as follows:

L_{E I o U} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{{(w^{c})}^{2} + {(h^{c})}^{2}} + \frac{ρ^{2} (w, w^{g t})}{{(w^{c})}^{2}} + \frac{ρ^{2} (h, h^{g t})}{{(h^{c})}^{2}}

(7)

where

w^{c}

and

h^{c}

are the width and height of the minimum bounding rectangle of the predicted box and the ground truth box, respectively.

4. Data Preparation and Model Training

In the construction and training of deep learning models, the data preparation and precise model training settings are crucial. This section will detail the data collection, preprocessing methods, experimental environment, and specific parameter settings for training the potato bud eye detection model. The quality of the data and the effectiveness of preprocessing directly impact the model’s learning ability and final detection accuracy, while an appropriate training environment and parameter configuration are key to ensuring effective learning. This paper will also elaborate on the evaluation metrics used to measure the performance improvements of the model. Through these comprehensive measures, this study aims to develop a potato bud-eye-detection model with high robustness and strong generalization capabilities.

4.1. Data Collection and Preprocessing

In the field of potato seed bud eye detection, due to the lack of publicly available standard datasets, the dataset used in this study primarily comes from two sources: images obtained through market channels and those directly captured from purchased potatoes. The aim was to photograph under diverse environmental conditions to closely mimic the complex situations encountered in actual application scenarios. Given the significant impact of image quality on the detection effectiveness of bud eyes, we rigorously screened the collected images. We discarded those with excessive background noise and blurred pixels, ultimately selecting 845 high-quality images for this research. These images were then annotated using the open-source tool LabelImg, with bud eye areas precisely marked and labeled uniformly as “bud”, generating corresponding txt format label files for model training.

In target detection, model training accuracy often depends on a large volume of annotated image data. However, with limited dataset sizes, complex model designs can lead to overfitting. To address this, data augmentation techniques were employed to increase the number and diversity of the training samples, enhancing the model’s robustness and generalization ability. Specifically, augmentation included Gaussian noise, random rotations, translations, occlusions, and brightness adjustments, resulting in 3395 augmented images. Following this, the dataset was divided into training, validation, and test sets with a ratio of 7:1:2. Figure 8 displays some results of the data augmentation, and Table 1 details the dataset distribution.

4.2. Experimental Environment and Parameter Settings

The experimental setup for this study was as follows: The system operated on Ubuntu 20.04, equipped with a 12-core Intel Xeon(R) Platinum 8255C processor, 43 GB of RAM, and an NVIDIA GeForce RTX 3090 graphics card. The deep learning framework used was PyTorch 1.13.1, and the a software environment included CUDA version 11.7 and Python 3.9.18.

To ensure the rigor and fairness of the experiments, the following training parameters were uniformly set: The input image size was fixed at 640 × 640 pixels; the initial learning rate was set at 0.01; the optimizer was configured as SGD with a momentum of 0.937 and weight decay coefficient of 0.0005. The batch size was set at 16, and the training was conducted over 300 epochs.

4.3. Evaluation Metrics

To comprehensively assess the performance of the Bud-YOLOv8s model in the task of potato bud eye detection, we utilized precision (P), recall (R), average precision (AP), and mean average precision (mAP) as the accuracy metrics for the model. Additionally, GFLOPs, the number of parameters, and the model size were used to evaluate the model’s complexity and operational efficiency, with the parameter volume and model size measured in megabytes (MB). The definitions are as follows:

(1) Precision (P) is the ratio of correctly identified positive samples to the total number of samples identified as positive by the model.

P = \frac{TP}{TP + FP}

(8)

(2) Recall (R) refers to the ratio of the number of positive samples identified by the model to the actual number of positive samples.

R = \frac{TP}{TP + FN}

(9)

where

T P

is the number of correctly detected buds;

F P

is the number of wrongly detected as buds; and

F N

is the number of missed detections.

(3) Average precision (AP) is the area under the PR curve for a single category, where a higher AP value indicates a better model.

AP = \int_{0}^{1} P (R) d R

(10)

(4) Mean average precision (mAP) is the average of the AP values across all categories, used to express the performance of multi-label detection. The higher the mAP, the better the model performance is.

mAP = \frac{1}{n} \sum_{i = 1}^{n} {(AP)}_{i}

(11)

We used the [email protected] and [email protected]:0.95 as the primary performance evaluation metrics. The former is the mean average precision when the IoU threshold is fixed at 0.5, and the latter is the average mAP value when the IoU threshold varies from 0.50 to 0.95.

5. Experimental Results and Analysis

Model performance evaluation is a crucial step in verifying research hypotheses and assessing the effectiveness of technical improvements. In this section, we present a comparative performance analysis of the original YOLOv8s model and our proposed enhanced model, Bud-YOLOv8s, in the task of potato bud eye detection. Through a series of detailed experimental analyses, including generalization ability analysis, ablation studies, visualization heatmaps, and comparisons with other existing techniques, we will demonstrate the advantages of the Bud-YOLOv8s model in various aspects.

5.1. Analysis of Model Generalization before and after Improvement

In this study, both the original YOLOv8s and the improved Bud-YOLOv8s models were trained on the potato bud eye dataset. As illustrated in Figure 9, the improved model exhibited significant performance enhancements across various evaluation metrics, such as precision, recall, and mean average precision (mAP).

To visually demonstrate the detection performance and generalization ability of the improved model, we conducted a comparative analysis of the two models on the test set. The detection results for some images are shown in Figure 10. The results indicate that the original YOLOv8s model struggles to accurately identify bud eyes in low-light potato images due to the loss of detail information. For instance, in Figure 10e, the YOLOv8s model exhibits significant missed detections. This challenge is particularly evident in the edge regions of potatoes where the bud eye features are more blurred, leading to frequent missed detections by the original model. For example, in Figure 10f,g, the bud eyes in the edge regions of the potatoes are not accurately detected, with only the more prominent bud eyes being recognized. The Bud-YOLOv8s model is enhanced through the integration of FasterNet and the EMA mechanism, which optimizes the C2f module. This results in improved operational efficiency and adaptive feature channel importance, enabling the model to concentrate on the key features of potato bud eyes. As shown in Figure 10i–k, the Bud-YOLOv8s model excels not only in handling bud eye features in low-light images, but also in accurately detecting each bud eye in edge regions. This indicates that the improved model enhances the extraction of bud eye features, thereby effectively reducing missed detections. Additionally, in regions with dense bud eyes, the confusion and interference between features lead to severe overlapping of detection boxes with the original model, preventing precise localization of each bud eye. In Figure 10h, the detection boxes fail to accurately distinguish each closely adjacent bud eye. In contrast, the improved model uses the BiFPN feature fusion network to enhance the fusion of bud eye features at different scales and optimizes the boundary box regression with the EIoU loss function. This allows the model to accurately locate each bud eye even in densely populated areas, as shown in Figure 10l.

5.2. Comparison of Improvement Methods and Results

In this section, we present a comparative analysis of the performance of various improvement strategies applied to the backbone and neck networks of YOLOv8. This analysis aims to validate the effectiveness of our techniques in enhancing the model’s overall performance. The backbone network’s primary function is to extract deep features from the input image, while the neck network fuses and transmits these features. Consequently, improvements to these two components can significantly enhance the overall performance of the YOLOv8 model. We will detail the improvement methods for each component and evaluate their effects through experiments. To ensure the reliability and validity of our results, we will use a standardized potato eye dataset and the evaluation metrics described in Section 4.3.

5.2.1. Backbone Improvement Experiment

In this study, we performed an in-depth analysis and enhancement of the C2f module in the YOLOv8s backbone network, proposing several competitive improvement schemes to validate the superiority of the final technique. The specific comparative experiments included (1) the baseline YOLOv8s model, (2) replacing the bottleneck in C2f with the Dilated Re-parameterization Block from UniRepLKNe [44] (referred to as C2f-DRB), (3) enhancing C2f using the Diverse Branch Block [45] (referred to as C2f-DBB), (4) optimizing C2f with Deformable Attention [46] (referred to as C2f-DAttention), (5) improving C2f with the Faster Block from FasterNet (referred to as C2f-Faster), and (6) replacing C2f with a combination of the FasterNet and EMA modules (referred to as C2f-Faster-EMA). The experimental results are shown in Table 2.

In the backbone network improvement experiments, although C2f-DRB significantly reduced the model parameters and computational load, it showed a 1.8% drop in the [email protected] compared to the baseline model, indicating a trade-off between accuracy and computational efficiency. C2f-DBB slightly improved model accuracy, but resulted in a significant increase in the parameters and computational load, making it unsuitable for agricultural production applications. In contrast, C2f-DAttention achieved improvements in accuracy, recall, and mAP, but still displayed mediocre overall performance. After optimizing the C2f structure with the Faster Block (C2f-Faster), the parameter count and computational complexity were reduced by 25.3% and 22.4%, respectively, with a 2.3% increase in the [email protected], along with significant improvements in precision and recall. Finally, incorporating the EMA mechanism into C2f-Faster (C2f-Faster-EMA) led to a notable increase in recall and average precision with only a slight increase in parameter count. Therefore, our proposed C2f-Faster-EMA improvement scheme for YOLOv8s balances model lightweighting and performance, demonstrating its superiority.

5.2.2. Neck-End Improvement Experiment

In this experiment, we applied various feature pyramid networks (FPNs) to improve the neck network of YOLOv8s and conducted a comparative performance analysis. The models evaluated included the YOLOv8s baseline model, BiFPN, Efficient RepGFPN [47], Asymptotic feature pyramid network (AFPN) [48], and Rep-PAN [49].

As shown in Table 3, the model using BiFPN for the neck network achieved the highest precision, slightly exceeding the baseline YOLOv8s model and significantly outperforming the AFPN and Rep-PAN. Although the accuracy of RepGFPN was close to that of BiFPN, its computational efficiency was lower. Despite BiFPN’s [email protected] being slightly lower than that of the baseline model, its detection accuracy remained high. Its overall performance was superior to that of the other feature fusion networks. In terms of computational efficiency metrics, specifically the number of parameters (Params) and floating-point operations per second (FLOPs), BiFPN demonstrated a significant advantage over YOLOv8s and the other models. Compared to the baseline model, the BiFPN-enhanced neck network showed a 33.8% reduction in the parameters and an 11.9% reduction in the floating-point operations. This indicates that BiFPN maintains high performance while offering improved computational efficiency. Additionally, the size of the BiFPN-enhanced model was reduced by 33.2% compared to the YOLOv8s baseline, further highlighting BiFPN’s suitability for deployment in resource-constrained environments. In conclusion, considering all metrics and the resource consumption, BiFPN exhibited the most balanced and efficient performance among the evaluated models.

5.3. Ablation Study and Comparative Analysis

To validate the effectiveness of the proposed improvement strategies and their specific impact on the performance of the original network, we conducted five sets of ablation experiments under the same dataset and hyperparameter configuration conditions. These experiments covered the following five model variants: the original YOLOv8s model (YOLOv8s), the C2f structure improved using FasterNet (YOLOv8s-F), the C2f structure further optimized with the combination of FasterNet and the EMA module (YOLOv8s-FM), replacing the original neck structure with the BiFPN feature fusion network (YOLOv8s-FMB), and using the EIoU loss function to optimize the bounding box regression process (YOLOv8s-FMBE). Each improvement aimed to enhance model efficiency, accuracy, or both. The detailed data from the ablation experiments are summarized in Table 4, with each metric representing the model’s performance on the test set.

According to Table 4, the following can be seen:

By using FasterNet to improve the C2f structure of YOLOv8s (YOLOv8s-F), the model’s precision (P) and recall (R) increased by 2.9% and 4.1%, respectively. The mean average precision ([email protected] and [email protected]:0.95) also improved by 2.3% and 3.6%, respectively. This indicates that the FasterNet backbone significantly enhanced the spatial feature-extraction capabilities of the YOLOv8s backbone network. Additionally, the model’s parameter count and size decreased by 2.82 MB and 5.4 MB, respectively, and the computation cost was reduced by 7.0 GFLOPs. This demonstrates that FasterNet effectively replaced the redundant bottleneck structure in C2f by performing convolutions (PConv) on only a portion of the channels, thereby reducing computation costs and avoiding unnecessary memory access, which in turn improved the model’s operational efficiency.
Further optimizing the C2f structure with the EMA mechanism (YOLOv8s-FM) resulted in a slight increase in the parameter count and computation cost compared to YOLOv8s-F. Under this improvement, P remained at 97.4%, R increased to 95.4%, and the [email protected] and [email protected]:0.95 increased by 0.5% and 1%, respectively. This indicates that the EMA mechanism effectively improved the feature recognition ability by optimizing weight allocation to focus more on the bud eye features in the image, validating its effectiveness in enhancing model performance.
Replacing the YOLOv8s neck structure with the BiFPN feature fusion network (YOLOv8s-FMB) improved the efficiency of feature fusion with its efficient network topology, resulting in reductions in the parameter count, size, and computation cost by 32.1%, 31.1%, and 11.3%, respectively. Although there were slight decreases in P, the [email protected], and the [email protected]:0.95, they remained at high levels. This indicates that the BiFPN feature fusion network played a crucial role in enhancing model lightweighting and computational efficiency while maintaining excellent detection performance, increasing the model’s adaptability in resource-constrained environments.
Finally, using the EIoU loss function to optimize the bounding box regression task (YOLOv8s-FMBE) increased P to 97.8%, with a slight decrease in R. The [email protected] improved by 0.6%, enhancing the detection box localization ability and achieving the best overall performance for the model.

To more intuitively compare the effects of the improvements in the ablation experiments, Figure 11 shows the mean average precision (mAP) trends of each model during training. The results indicate that the mAP of all models rises rapidly in the early stages of training and converges at around 200 epochs, with the improved YOLOv8s-FMBE model achieving the highest mAP.

Figure 12 illustrates the trends in bounding box loss for each model during training. It is evident that, with each successive improvement, the final bounding box loss values of the models decrease, indicating a continuous enhancement in the bounding box detection capability of the bud-eye-detection models throughout the improvement process. Notably, the final YOLOv8s-FMBE model not only has the lowest bounding box loss, but also shows the greatest reduction in loss. This suggests that the improvements made with the EIoU loss function have significantly optimized the model’s bounding box regression performance.

The analysis of the above results indicates that our improvement strategies not only significantly enhanced the detection performance, but also substantially reduced the model parameters and improved the computational efficiency. This makes the model more suitable for efficient and high-precision potato bud eye detection tasks.

5.4. Comparative Analysis of Visualization Heatmaps

To quantitatively assess the performance of the improved YOLOv8 model in recognizing features of potato seed bud eyes, this study utilized the extended Gradient-weighted Class Activation Mapping (XGrad-CAM) [50] technique to generate heatmaps that visualize the model’s focus areas, as illustrated in Figure 13.

XGrad-CAM, an enhanced version of Grad-CAM [51], modifies gradient contributions to provide more detailed visual outputs, thus enhancing our ability to analyze and interpret model predictive behavior. The visualization results show that the original model often biased its focus towards the epidermal information surrounding the bud eyes, increasing the risk of false positives and missed detections. In contrast, the improved model focuses more effectively on the bud eye information, successfully suppressing interference from the surrounding epidermis. The highlighted areas of bud eye features at the edges are more focused and pronounced in the improved model, suggesting that the C2f-Faster-EMA module enhances the Bud-YOLOv8s model’s attention to bud eye features, making it more sensitive and accurate in recognizing edge features. Additionally, in densely populated bud eye areas, the improved model displays clearer and more dispersed highlighted areas than the original model, indicating that BiFPN effectively integrates bud eye features of different scales, whereas the EIoU loss function enhances the model’s localization ability in dense areas. Consequently, Bud-YOLOv8s proves more effective in distinguishing and locating closely adjacent bud eyes.

Thus, through the visualization analysis enabled by XGrad-CAM technology, we have gained a deeper understanding of the subtle changes in the model’s focus on potato bud eye features, further validating the effectiveness of the improvements made to the Bud-YOLOv8s model.

5.5. Comparison of Experimental Results of Different Model

To further validate the effectiveness of the algorithm proposed in this study, we compared the improved model with current mainstream detection algorithms. The algorithms compared were YOLOv3, YOLOv3-tiny, YOLOv5s, YOLOv6s, YOLOv7-tiny, YOLOv8m, and the pre-improvement YOLOv8s. To ensure a fair comparison, all models were tested using the same dataset and in the same training environment. The experimental results are summarized in Table 5.

As indicated in Table 5, the improved model Bud-YOLOv8s demonstrates the highest recognition accuracy in the potato bud-eye-detection task. Compared to YOLOv3, YOLOv3-tiny, YOLOv5s, YOLOv6s, YOLOv7-tiny, and YOLOv8m, the [email protected] improved by 4.6, 5.4, 3.7, 5.6, 5.2, and 3.3 percentage points, respectively, while the [email protected]:0.95 increased by 0.3, 14.5, 7.5, 11.1, 22.8, and 0.6 percentage points, respectively. Furthermore, in terms of model memory consumption and the number of parameters, Bud-YOLOv8s exhibits lower values than the comparative models.

To further illustrate the detection performance of the enhanced model relative to other models, we randomly selected images from the test set for inference, and the detection results are depicted in Figure 14 The figure shows that, while YOLOv3, YOLOv5s, and the baseline model YOLOv8s can identify most bud eye targets, these models tend to overlook targets with less prominent features. In contrast, the other detection models exhibited more severe issues with missed detections and frequently misidentified potato skin features as bud eyes. However, with the incorporation of the C2f-Faster-EMA network module, the Bud-YOLOv8s model enhances focus on subtle bud eye features, thereby detecting bud eyes more effectively, particularly at the edges. Additionally, with improvements in the feature fusion network and loss function, the Bud-YOLOv8s model significantly enhanced its localization capability in dense bud eye areas. Overall, the Bud-YOLOv8s model demonstrates a significant performance advantage in potato bud-eye-detection tasks, not only improving detection precision, but also enhancing model robustness.

6. Conclusions

This study introduces a potato bud-eye-detection algorithm based on an improved YOLOv8s model, aimed at enhancing the accuracy and efficiency of potato bud eye detection. Initially, we integrated the partial convolution concept from FasterNet with the EMA mechanism to reduce redundancy in the C2f module and enhance the model’s focus on bud eye features, thereby improving the model’s computational efficiency and feature-extraction capability. Subsequently, we improved the neck layer of YOLOv8s using a BiFPN feature fusion network, which utilizes a multi-scale feature fusion mechanism to enhance the integration of features from bud eyes of different sizes and to simplify the network structure. Finally, the use of the EIoU loss function optimized the bounding box regression process, thus enhancing the detection frame’s localization ability.

In the experiments, we conducted tests on generalization performance, ablation studies, and comparative analysis using visualization heatmaps to verify the effectiveness of the improved algorithm. Compared to the baseline YOLOv8s model, the improved algorithm achieved an average precision mean of 98.1% on the potato bud eye dataset, representing an increase of 3.1%. Simultaneously, the model’s parameter count was reduced by 49.1%, computational demand decreased by 31.1%, and the size was reduced to 11.1MB, a decrease of 48.1%. These significant enhancements indicate that our algorithm offers higher detection accuracy and lower hardware requirements in practical applications. Additionally, compared to current mainstream detection models, our model demonstrates superior performance, laying the foundation for implementing intelligent chunking technology.

Currently, our improved algorithm has primarily been tested in indoor environments, and its performance in the complex and variable natural conditions outdoors remains to be fully validated. Furthermore, variations in bud eye morphology and characteristics across different varieties and growth stages may affect the algorithm’s generalization ability. To address these limitations, we plan to introduce more diverse datasets in future work and verify the algorithm’s stability under varying environmental conditions, while continuing to explore more advanced network structures and optimization methods to further enhance the model’s generalization capability and robustness.

Author Contributions

Conceptualization, methodology, and supervision, W.L.; software, validation, and visualization, W.L. and Z.L.; formal analysis, W.L.; investigation, data curation, and resources, W.L., S.Z., T.Q. and J.Z.; writing—original draft preparation, W.L.; writing—review and editing, W.L.; funding acquisition, Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China (2022YFE0107300), and the Key R&D Program of Shandong Province (Soft Science Project) (2023RKY01010).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

The authors would like to thank the anonymous reviewers for their constructive comments and recommendations.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cui, Y.; Chen, Y. An Analysis of Supply and Demand of Potato and Its Products in the World and China. J. Hebei Agric. Univ. (Soc. Sci.) 2024, 26, 59–69. [Google Scholar]
Wang, Z.J.; Liu, H.; Zeng, F.K.; Yang, Y.C.; Xu, D.; Zhao, Y.C.; Liu, X.F.; Kaur, L.; Liu, G.; Singh, J. Potato Processing Industry in China: Current Scenario, Future Trends and Global Impact. Potato Res. 2023, 66, 543–562. [Google Scholar] [CrossRef]
Li, Z.; Wen, X.; Lyu, J.; Li, J.; Yi, S.; Qiao, D. Analysis and Prospect of Research Progress on Key Technologies and Equipments of Mechanization of Potato Planting. Trans. Chin. Soc. Agric. Mach. 2019, 50, 1–16. [Google Scholar]
Gao, Q.; Gao, A.; Meng, Y. Research Status and Development Trend of Potato Planter. For. Mach. Woodwork. Equip. 2023, 51, 11–14. [Google Scholar]
Wang, X.; Zhu, S.; Li, X.; Li, T.; Wang, L.; Hu, Z. Design and Experiment of Directional Arrangement Vertical and Horizontal Cutting of Seed Potato Cutter. Trans. Chin. Soc. Agric. Mach. 2020, 51, 334–345. [Google Scholar]
Feng, W.; Li, P.; Zhang, X.; Zhong, W.; Wang, P.; Cui, J. Design and Experiment of Intelligent Cutting Machine for Potato Seed. J. Agric. Mech. Res. 2022, 44, 124–129+134. [Google Scholar]
Li, Y.; Wang, N.; Pu, L.; Guo, Y. Research Status of Potato Planting Machinery at Home and Abroad. Agric. Eng. 2022, 12, 15–20. [Google Scholar]
Ünal, Z.; Kızıldeniz, T. Smart agriculture practices in potato production. In Potato Production Worldwide; Academic Press: Cambridge, MA, USA, 2023; pp. 317–329. [Google Scholar]
Liang, L.; Mao, L.; Gao, N. Recognition and Location Method of Potato Image Bud Eye Based on Kirsch Operator and Mathematical Morphology. Microcomput. Appl. 2024, 40, 92–95. [Google Scholar]
Tian, H.; Zhao, J.; Pu, F. A Method for Recognizing Potato’s Bud Eye. Acta Agric. Zhejiangensis 2016, 28, 1947–1953. [Google Scholar]
Xi, R.; Hou, J.; Lou, W. Potato Bud Detection with Improved Faster R-CNN. Trans. ASABE 2020, 63, 557–569. [Google Scholar] [CrossRef]
Chen, Z.; Zhang, W.; Zhang, T.; Xu, Y.; Lü, Z.; Lei, C. Potato Seed Tuber Sprout Eye Detection Based on YOLOv3 Algorithm. J. Agric. Mech. Res. 2022, 44, 19–23+30. [Google Scholar]
Reis, D.; Kupec, J.; Hong, J.; Daoudi, A. Real-Time Flying Object Detection with YOLOv8. arXiv 2023, arXiv:2305.09972. [Google Scholar]
Li, Y.; Li, T.; Niu, Z.; Wu, Y.; Zhang, Z.; Hou, J. Potato bud eyes recognition based on three-dimensional geometric features of color saturation. Trans. Chin. Soc. Agric. Eng. 2018, 34, 158–164. [Google Scholar]
Lyu, Z.; Qi, X.; Zhang, W.; Liu, Z.; Zheng, W.; Mu, G. Buds Recognition of Potato Images Based on Gabor Feature. J. Agric. Mech. Res. 2021, 43, 203–207. [Google Scholar]
Zhang, J.; Yang, T. Potato Bud Eye Recognition Based on LBP and SVM. J. Shandong Agric. Univ. (Nat. Sci. Ed.) 2020, 51, 5. [Google Scholar]
Rui, X.; Jialin, H.; Licheng, L. Fast segmentation on potato buds with chaos optimization-based K-means algorithm. Trans. Chin. Soc. Agric. Eng. 2019, 35, 190–196. [Google Scholar]
Yang, Y.; Zhao, X.; Huang, M.; Wang, X.; Zhu, Q. Multispectral image based germination detection of potato by using supervised multiple threshold segmentation model and Canny edge detector. Comput. Electron. Agric. 2021, 182, 106041. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Xi, R.; Jiang, K.; Zhang, W.; Lyu, Z.; Hou, J. Recognition Method for Potato Buds Based on Improved Faster R-CNN. Trans. Chin. Soc. Agric. Mach. 2020, 51, 216–223. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Shi, F.; Wang, H.; Huang, H. Research on potato buds detection and recognition based on convolutional neural network. J. Chin. Agric. Mech. 2022, 43, 159–165. [Google Scholar]
Huang, J.; Wang, X.; Wu, H.; Liu, S.; Yang, X.; Liu, W. Detecting potato seed bud eye using lightweight convolutional neural network(CNN). Trans. Chin. Soc. Agric. Eng. 2023, 39, 172–182. [Google Scholar]
Li, H.; Feng, Q.; Yang, S. YOLOx-based Potato Bud Eye Recognition Detection. Agric. Equip. Veh. Eng. 2024, 62, 12–17. [Google Scholar]
Zhang, W.; Zeng, X.; Liu, S.; Mu, G.; Zhang, H.; Guo, Z. Detection Method of Potato Seed Bud Eye Based on Improved YOLO v5s. Trans. Chin. Soc. Agric. Mach. 2023, 54, 260–269. [Google Scholar]
Zhang, W.; Zhang, H.; Liu, S.; Zeng, X.; Mu, G.; Zhang, T. Detection of Potato Seed Buds Based on an Improved YOLOv7 Model. Trans. Chin. Soc. Agric. Eng. 2023, 39, 148–158. [Google Scholar]
Jiao, Y.; Xing, L. Vehicle Target Detection Research Based on Enhanced YOLOv8. In Proceedings of the 2024 4th International Conference on Neural Networks, Information and Communication Engineering (NNICE), Guangzhou, China, 19–21 January 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1427–1432. [Google Scholar]
Sharma, N.; Baral, S.; Paing, M.P.; Chawuthai, R. Parking Time Violation Tracking Using YOLOv8 and Tracking Algorithms. Sensors 2023, 23, 5843. [Google Scholar] [CrossRef]
Li, Y.; Fan, Q.; Huang, H.; Han, Z.; Gu, Q. A Modified YOLOv8 Detection Network for UAV Aerial Image Recognition. Drones 2023, 7, 304. [Google Scholar] [CrossRef]
Terven, J.; Córdova-Esparza, D.M.; Romero-González, J.A. A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Wu, X.; Sahoo, D.; Hoi, S.C. Recent advances in deep learning for object detection. Neurocomputing 2020, 396, 39–64. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Lin, T.Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Geng, H.; Liu, Z.; Jiang, J.; Fan, Z.; Li, J. Embedded Road Crack Detection Algorithm Based on Improved YOLOv8. J. Comput. Appl. 2024, 44, 1613–1618. [Google Scholar]
Chen, J.; Kao, S.h.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. Proc. Aaai Conf. Artif. Intell. 2020, 34, 12993–13000. [Google Scholar] [CrossRef]
Ding, X.; Zhang, Y.; Ge, Y.; Zhao, S.; Song, L.; Yue, X.; Shan, Y. UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio Video Point Cloud Time-Series and Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 5513–5524. [Google Scholar]
Ding, X.; Zhang, X.; Han, J.; Ding, G. Diverse Branch Block: Building a Convolution as an Inception-Like Unit. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 10886–10895. [Google Scholar]
Xia, Z.; Pan, X.; Song, S.; Li, L.E.; Huang, G. DAT++: Spatially Dynamic Vision Transformer with Deformable Attention. arXiv 2023, arXiv:2309.01430. [Google Scholar]
Xu, X.; Jiang, Y.; Chen, W.; Huang, Y.; Zhang, Y.; Sun, X. DAMO-YOLO: A Report on Real-Time Object Detection Design. arXiv 2022, arXiv:2211.15444. [Google Scholar]
Yang, G.; Lei, J.; Zhu, Z.; Cheng, S.; Feng, Z.; Liang, R. AFPN: Asymptotic Feature Pyramid Network for Object Detection. In Proceedings of the 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Honolulu, HI, USA, 1–4 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 2184–2189. [Google Scholar]
Weng, K.; Chu, X.; Xu, X.; Huang, J.; Wei, X. EfficientRep: An Efficient Repvgg-style ConvNets with Hardware-aware Neural Network Design. arXiv 2023, arXiv:2302.00386. [Google Scholar]
Fu, R.; Hu, Q.; Dong, X.; Guo, Y.; Gao, Y.; Li, B. Axiom-based grad-cam: Towards accurate visualization and explanation of cnns. arXiv 2020, arXiv:2008.02312. [Google Scholar]
Selvaraju, R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int. J. Comput. Vis. 2020, 128, 336–359. [Google Scholar] [CrossRef]

Figure 1. YOLOv8s network architecture.

Figure 2. Schematic diagram of Bud-YOLOv8s network architecture.

Figure 3. The C2f-Faster-EMA and Faster Block-EMA network structures are shown in (a,b).

Figure 4. Schematic diagram of the FasterNet architecture.

Figure 5. Schematic structure of EMA module.

Figure 6. Schematic diagram of feature fusion network structure. (a) FPN structure. (b) PANet structure.

Figure 7. Network structure of BiFPN.

Figure 8. Exemplary illustrations of data augmentation.

Figure 9. Comparative analysis of training metrics curves before and after model improvement.

Figure 10. Comparative illustration of detection results before and after model improvement. (a–d) The original images. (e–h) The detection results of the baseline YOLOv8s model. (i–l) The detection results of the improved Bud-YOLOv8s model.

Figure 11. Comparison of mAP curves in ablation experiments.

Figure 12. Comparison of training loss curves in ablation experiments.

Figure 13. XGrad-CAM images of YOLOv8s and Bud-YOLOv8s.

Figure 14. Comparison of detection results of different algorithms.

Table 1. Distribution of potato image datasets.

Dataset	Original Images	Augmented Images
Training Set	592	2376
Validation Set	85	339
Test Set	168	680
Total	845	3395
Image Input Size	640 × 640 × 3	640 × 640 × 3

Table 2. Comparison of improved performance in different backbone networks.

Model	Precision (%)	Recall (%)	[email protected]	Params (M)	FLOPs (G)	Size (M)
YOLOv8s	94.5	90.5	95.0	11.14	28.6	21.4
C2f-DRB	92.9	88.1	93.2	9.37	24.6	18.1
C2f-DBB	94.3	90.9	95.3	16.20	40.4	34.5
C2f-DAttention	96.2	91.1	95.7	11.40	28.9	21.9
C2f-Faster	97.4	94.6	97.3	8.32	21.6	16.0
C2f-Faster-EMA	97.4	95.4	97.8	8.35	22.2	16.1

Table 3. Comparison of improved performance in different feature pyramid networks.

Model	P (%)	R (%)	[email protected]	Params (M)	FLOPs (G)	Size (M)
YOLOv8s	94.5	90.5	95.0	11.14	28.6	21.4
YOLOv8s + AFPN	93.2	86.9	92.3	8.87	25.4	17.2
YOLOv8s + Rep-PAN	92.6	88.9	93.1	10.59	26.7	20.5
YOLOv8s + RepGFPN	95.0	87.8	94.0	12.25	29.7	23.6
YOLOv8s + BiFPN	95.1	89.9	94.7	7.37	25.2	14.3

Table 4. Ablation experiment results.

Model	P (%)	R (%)	[email protected]	[email protected]:0.95	Params (M)	FLOPs (G)	Size (M)
YOLOv8s	94.5	90.5	95.0	73.1	11.14	28.6	21.4
YOLOv8s-F	97.4	94.6	97.3	76.7	8.32	21.6	16.0
YOLOv8s-FM	97.4	95.4	97.8	77.7	8.35	22.2	16.1
YOLOv8s-FMB	97.3	95.6	97.7	77.4	5.67	19.7	11.1
YOLOv8s-FMBE	97.8	95.5	98.1	77.6	5.67	19.7	11.1

Table 5. Comparison of experimental results of different models.

Model	P (%)	R (%)	[email protected]	[email protected]:0.95	Params (M)	FLOPs (G)	Size (M)
YOLOv3	94.5	86.8	93.5	77.3	10.37	283.0	198
YOLOv3-tiny	92.1	88.2	92.7	63.1	12.13	19.0	23.2
YOLOv5s	95	89.7	94.4	70.1	9.12	24.0	35.3
YOLOv6s	91.4	66.5	92.5	66.5	16.31	44.2	31.3
YOLOv7-tiny	91.7	88.2	92.9	54.8	6.01	13.0	11.6
YOLOv8s	94.5	90.5	95.0	73.1	11.14	28.6	21.4
YOLOv8m	95	89.7	94.8	77.0	25.89	79.1	49.6
Ours	97.8	95.5	98.1	77.6	5.67	19.7	11.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, W.; Li, Z.; Zhang, S.; Qin, T.; Zhao, J. Bud-YOLOv8s: A Potato Bud-Eye-Detection Algorithm Based on Improved YOLOv8s. Electronics 2024, 13, 2541. https://doi.org/10.3390/electronics13132541

AMA Style

Liu W, Li Z, Zhang S, Qin T, Zhao J. Bud-YOLOv8s: A Potato Bud-Eye-Detection Algorithm Based on Improved YOLOv8s. Electronics. 2024; 13(13):2541. https://doi.org/10.3390/electronics13132541

Chicago/Turabian Style

Liu, Wenlong, Zhao Li, Shaoshuang Zhang, Ting Qin, and Jiaqi Zhao. 2024. "Bud-YOLOv8s: A Potato Bud-Eye-Detection Algorithm Based on Improved YOLOv8s" Electronics 13, no. 13: 2541. https://doi.org/10.3390/electronics13132541

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Bud-YOLOv8s: A Potato Bud-Eye-Detection Algorithm Based on Improved YOLOv8s

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Principle of YOLOv8 Network

3.2. Bud-YOLOv8s Detection Algorithm

3.3. C2f-Faster-EMA Module

3.3.1. FasterNet Neural Network

3.3.2. EMA Mechanism

3.4. Weighted Bidirectional Feature Pyramid Network

3.5. EIoU Loss Function

4. Data Preparation and Model Training

4.1. Data Collection and Preprocessing

4.2. Experimental Environment and Parameter Settings

4.3. Evaluation Metrics

5. Experimental Results and Analysis

5.1. Analysis of Model Generalization before and after Improvement

5.2. Comparison of Improvement Methods and Results

5.2.1. Backbone Improvement Experiment

5.2.2. Neck-End Improvement Experiment

5.3. Ablation Study and Comparative Analysis

5.4. Comparative Analysis of Visualization Heatmaps

5.5. Comparison of Experimental Results of Different Model

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI