1. Introduction
Agriculture continues to play a crucial role in economies relying on agricultural activities [
1], particularly in countries like Morocco, where it constitutes approximately 10.33% of the Gross Domestic Product (GDP) as of 2022 [
2]. This sector is a significant source of employment in Morocco, contributing to nearly 33.3% of jobs and accounting for over 23% of exports [
3]. Fresh strawberries are securing their position among the top 10 most exported fruits and vegetables from this country: the revenue earned through strawberry production consistently ranges from USD 40 to USD 70 million almost every year and is still increasing by an average of 3% annually [
4]. However, the production of such soft fruits faces a great challenge because of pest infestation and diseases: the costs of managing these can contribute to nearly half of the production costs [
5].
Recently, innovative techniques have been developed and used to improve agricultural practices, giving birth to the term “precision agriculture” [
6]. The introduction of new tools and technologies, such as sophisticated irrigation systems or modern agricultural machinery, has significantly increased productivity and the ability to feed a growing population. However, plant detection in large areas remains a challenging task for farmers, particularly when dealing with small plants such as strawberry plants. This issue has prompted numerous studies aimed at finding solutions, including research on plant segmentation [
7], fruit counting [
8], and other related approaches [
9].
In red fruit industries, such as the tomato industry [
10,
11], the integration of robotics with machine learning algorithms enables a more precise, efficient, and sustainable approach, assisting farmers in making informed decisions to enhance yields, optimize the use of agricultural inputs, and reduce their environmental impact [
12]. These human-like, perceptive, high-capability machines have shown great potential, especially when dealing with detection tasks. Malik et al. [
10] introduced an advanced yet easy-to-embed algorithm for detecting ripe tomatoes, utilizing enhancements in the HSV (Hue, Saturation, Value) color space and watershed segmentation. This approach was able to reach an accuracy detection of up to 81.6% and was designed to be used for all red fruits. However, these methods, despite exhibiting acceptable performance for certain data acquisition conditions, do not generalize effectively. Hence, recent studies have concentrated on exploring universal feature extraction methods to address the constraints faced by traditional image detection algorithms.
To the best of our knowledge, Sa et al. [
13] pioneered the application of deep learning networks for fruit detection. These techniques are known for their capability to learn complex features autonomously and can be used in the development of an end-to-end non-linear model with high precision for detection and other applications. An example of such a model was designed by Lamb et al. [
14], who used a convolutional neural network for strawberry fruit detection. Their network was based on three sets of convolutional layers (classification, detection, and prediction) capable of being embedded in Raspberry Pi 3B. The detection speed achieved by this model was 1.63 frames per second, with an average precision (AP) of 84.2%. Another example was introduced by Yu et al. [
15], who employed the widely recognized Mask R-CNN framework [
16] with a ResNet 50 backbone [
17] combined with a feature pyramid network (FPN) [
18] for feature extraction to attain an AP of 95.78% for strawberry fruits in a non-structured environment. This method covers overlapping and hidden fruits under varying illumination intensities. Moreover, the processing time for images of 640 × 480 pixels was just 0.13 s when using a high-performance computer. However, these multi-stage detection methods demand substantial resources for region proposal selection, thereby posing limitations on the detection speed. Consequently, they are not suitable for real-time field detection applications.
Delving further into the realm of real-time detection, Redmon et al. introduced the YOLO (You Only Look Once) architecture [
19]. Unlike traditional methods that predict a proposal and subsequently refine it, YOLO directly predicts the final detection by dividing the image into grids and performing a prediction for each grid cell. This makes the YOLO network the most representative network with real-time capabilities [
20]. Currently, there are 10 versions that have improved upon the performance of the original YOLO. YOLOv3 was the first YOLO model to be widely adopted in real-time applications. It demonstrates remarkable results in detecting the smallest objects, thanks to its deep architecture, which incorporates three size prediction layers. Prico et al. [
21] proposed a real-time weed detection system for green onion crops using YOLOv3. The system was able to reach an AP of 93.81% and an F1 score of 0.94 on a five-minute UAV video with a resolution of 864 × 688 using a high-performance computer. In another study, Liu et al. [
22] used a circular bounding box with the YOLOv3 model. This resulted in a mean AP of 0.96 when using an image of 3648 × 2056 pixels at a speed detection rate of 0.054 s.
Regarding strawberry-related applications, He et al. [
23] applied the YOLOv4 network for strawberry maturity detection and localization for robotic harvesting in RGB images. Their approach employed a two-stage deep learning methodology utilizing YOLOv4 and YOLOv4-tiny models to identify mature strawberries and estimate their centers from both RGB and depth images. They reached an AP of 91.73% with a processing speed of 55.19 ms per image for maturity detection and a mean AP of 86.45% with a 4.16 ms processing time per image for strawberry localization. Xie et al. [
24] proposed a method for the rapid and accurate identification of strawberry fruit in greenhouses using an improved YOLOv5s approach. Their method, called YOLO-SR (YOLO–Strawberry Ripeness), replaced the C3 module with a lightweight multi-path aggregation network based on Channel Shuffle, enhancing the feature extraction capability. This method achieved a mAP of 94.3% and an F1 score of 93.7%. However, running these algorithms in a practical setting using small robots is very difficult because the YOLO network needs a strong GPU (Graphics Processing Unit) with over 4 Gigabytes of memory, which most electronic boards cannot handle [
20].
As a solution, researchers have developed a lightweight version of the YOLO network, introducing the term YOLO-tiny. In their study, Zhang et al. [
25] designed and implemented a lightweight deep neural network (RTSD-Net) for real-time strawberry detection on embedded devices, such as Jetson Nano. By reducing the number of convolutional layers and the cross-stage partial network of a YOLOv4-tiny architecture, and by using the TensorRT method to speed up the model, they achieved a mean AP of 82.44% on an aerial strawberry dataset, and the processing speed was 25.2 frames per second (FPS). These results were achieved using a dataset comprising only three types of objects, with a uniform image background, which may not be representative of all types of scenarios. Additionally, the network architecture of the YOLO-tiny model has fewer layers and parameters, allowing for quicker performance but increasing the risk of information loss in complex environments [
26]. To the authors’ best knowledge, there have been no previous studies on the use of YOLO models for strawberry plant detection in complex environments for pesticide spraying application. Consequently, this paper explores the possibility of using the YOLO model for detecting strawberry plants in such types of applications.
The pesticide-spraying process is a complex task that requires instant decision-making by the robot during a survey [
27], in contrast to applications such as yield prediction [
28] or disease detection [
29], which can be performed separately using a high-performance computer. For that, limited choices on the suitable detection model are available. This study highlighted the strengths and weaknesses of YOLO versions that are already being implemented in robotic platforms for instant decision-making, offering a data-driven basis for selecting the most suitable model for pesticide spraying robots.
For embedded systems with limited computational resources, YOLOv3 is the most suitable YOLO version in terms of memory and computational requirements compared with the later versions. Some studies, such as [
11,
20,
22,
30], used embedded YOLO and affirmed a good balance between the speed and accuracy of YOLOv3 for real-time applications, even if newer versions existed. Based on that, this study aims to develop a real-time plant detection algorithm that can be implemented in an agricultural pesticide-spraying robot using an NVIDIA Jetson TX2 GPU. The developed method, using YOLOv3, was specifically applied to strawberry plants within a greenhouse environment. To assess the effectiveness of our implementation, we used the Open Neural Network Exchange representation (ONNX). In order to show the performance of our proposal, a set of real-world experiments was carried out in a didactical strawberry greenhouse, which revealed that the proposed model can reach a mAP of over 97% with a processing time of 15 ms per frame.
This paper is organized as follows:
Section 2 presents a brief discussion of the robot and YOLOv3 model that is used in our application. The main results are detailed in
Section 3, and a section on the main conclusions ends this paper.
4. Results and Discussion
The performance of the best model extracted from the 10 different executions on the validation set are listed in
Table 5, indicating a precision of 0.73 and a recall of 0.95, leading to an F1-score of 0.83 for the YOLOv3 model. YOLOv3-tiny showed minor results, with a precision of 0.60 and a recall of 0.71, leading to an F1-score of 0.65. These results are expected because of the small training dataset along with the reduced number of layers in the tiny version. An F1-score of 0.83 indicates that YOLOv3 has a good balance between the precision and recall for this dataset, showing its efficiency in identifying true positives while minimizing false positives.
Based on that, real-world experiments were conducted using the strawberry plants in the greenhouse to verify the model’s generalization performance. First, we performed an in-lab test using the tools shown in
Figure 7, including the NVIDIA Jetson TX2 card, a web camera with a resolution of
, a monitor for visualization, a keyboard, and a mouse. To assess the effectiveness of our implementation, the ONNX representation was used. To assess the accuracy, suitability, and consistency of the proposed model, three alternative contemporary scenarios were trained and tested on the same datasets. These scenarios included traditional YOLOv3, YOLOv3-tiny, and YOLOv3-tiny with ONNX. The experiment was performed using two types of data, i.e., video and live video. We encountered certain challenges with the robot, as it is still in the development stage. So, we conducted a simulation utilizing the tools illustrated in
Figure 7, positioning the camera at an optimal altitude to cover two rows of the plantation.
Table 6 summarizes numerical evidence on the performance of the model compared to the other scenarios. The model achieves an accuracy of more than 97% with a processing time of over 15 milliseconds per frame.
The results shown in
Table 6 demonstrate real-time plant detection with mAP above 96%, which reveals the effectiveness and utility of the proposed model. Moreover, it showcases superior performance compared with the other three scenarios for both types of data. However, when considering processing speed, YOLOv3-tiny exhibits potential for achieving faster speeds because of its optimal balance between detection speed and accuracy, particularly when utilizing the ONNX representation. Nevertheless, in applications such as robotic pesticide spraying, detection accuracy holds greater significance, as the robot must ensure precise spraying while avoiding substantial areas of the plants. Additionally, YOLOv3-tiny versions are characterized by their compact networks and fewer prediction stages, which may risk information loss on objects of various sizes, given the varying sizes of plants. This issue is particularly significant in our case, where we have a limited dataset to train the model. Furthermore, when assessing robot navigation speed, a difference of 2 FPS may not be as critical as an approximate 6% increase in accuracy. It is important to acknowledge that these results were obtained by running the models using the NVIDIA Jetson TX2 board and not a computer.
Figure 8 shows some online real-time detection in the greenhouse environment covering different day illumination conditions in the morning and afternoon.
It is also noticeable that the mAP is affected by the type of data used, possibly because of variations in acquisition resolution. As illustrated in
Table 6, testing the model on higher-resolution videos yields superior results across all scenarios. This improvement could also be attributed to the stability of the acquisition process; for instance, the tested video exhibits less vibration compared with live video feeds. Furthermore, the quality of the training data may contribute to these outcomes. However, greater flexibility and diversity in factors such as angle, lighting conditions, focal length, sensitivity, and exposure duration during image capture result in a larger volume of collected data. Moreover, employing a variety of data augmentation techniques enhances the effectiveness of model training, leading to higher detection accuracy. In summary, the results of this study indicate that YOLOv3 with ONNX representation notably enhances detection accuracy. Additionally, it significantly boosts the image processing speed, which makes it suitable for implementation in electronic boards, such as the one used in this study.