Open Access Paper
24 May 2022 YOLO-pest: a real-time multi-class crop pest detection model
Shifeng Dong, Jie Zhang, Fenmei Wang, Xiaodong Wang
Author Affiliations +
Proceedings Volume 12260, International Conference on Computer Application and Information Security (ICCAIS 2021); 1226003 (2022) https://doi.org/10.1117/12.2637467
Event: International Conference on Computer Application and Information Security (ICCAIS 2021), 2021, Wuhan, China
Abstract
Crop pest control is one of the important tasks for crop yield. However, multi-class pests and high similarity in appearance bring challenges to precision recognition of pests. In recent years, deep-learning based algorithms in object detection have achieved an excellent result, such as the YOLO detector, which can balance accuracy and speed. YOLO performs well in detecting normal size objects, but has low precision in detecting small objects. The accuracy decreases notably when dealing with pest dataset, which have large-scale changes and multi-class. To solve the detection problem of multi-scale pest, we propose a detector named YOLO-pest based on YOLOv4 to improve the performance of pest detection. Our approach includes using lite but efficient backbone mobileNetv3 and lite fusion feature pyramid network. The improved detector significantly increased accuracy while remaining fast detection speed. Experiments on the constructed Croppest12 dataset show that our improved algorithm outperforms other compared methods.

1.

INTRODUCTION

Crop pests have a significant impact on crop yields and the agricultural economy. To solve the pest problem, it is necessary to distinguish the pest categories and apply precise medication to control them. However, there are so many pest categories and high similar morphology in appearance that non-agricultural specialists are not able to distinguish between them. The traditional method to recognition pests mainly relies on experiences. It is inaccurate and labor-intensive, thus will affect the precision pest control work. Therefore, it is essential to propose a new method to detect multi-class crop pests in real-time and accurately.

In recent years, with the development of deep convolutional neural networks (DCNNs), object detection has made great achievements1-2. The DCNN based object detection algorithm can automatically extract the features of pests, eliminating the subjective factor of manual feature extraction3, and thus can accurately identify the species and number of pests. However, most of the existing recognition methods are designed for generic images collected on the Internet as training data sets4-5. On the basis of this, some significant progress has been made in object detectors using common datasets. Among these methods, two-stage methods are more popular for pest detection due to their high detection accuracies, such as Faster R-CNN1, R-FCN6, and Cascade R-CNN7. One-stage methods are less time-consuming because it has a simple network, but lose accuracy. One-stage method has YOLO2,8,9, SSD10, RetinaNet11 and so on. However, there is still a big gap in the practical application of pest detection.

In this paper, a crop pest detection framework YOLO-pest is proposed. YOLO-pest uses Mobilenetv3 to replace the YOLOv4 backbone network to significantly reduce the number of parameters, and proposes a lite-FPN architecture. We built a pest image dataset named Croppest12 containing several forms of 12 common crop pests. YOLO-pest achieves 70.07% mAP on the Croppest12 dataset, which is only 2.4 AP lower than the YOLOv4 method. The model size is only 46.9M, which is 198.8M less than YOLOv4.

2.

METHOD

In this paper, we proposed YOLO-pest mainly improved in two aspects, one is to replace the YOLOv412 backbone network with Mobilenetv313, and the other one is to design the FPN-lite. The network framework is shown in Figure 1.

Figure 1.

YOLO-pest framework.

00171_psisdg12260_1226003_page_2_1.jpg

2.1

Backbone

MobileNet13 uses the reverse residual module of linear bottleneck to improve feature extraction based on adopting deeply separable convolution. The images are fed into the backbone feature extraction network, which uses the bneck structure. SE14 indicates that the attention mechanism is added to this layer. Bneck structure is used to up-dimension the input feature map first and then perform deep separable convolution, while squeeze-and-excite attention module is added to balance the weights of each channel of the feature map.

In the backbone network, the h-swish activation function modified by the swish activation function was used. Equation (1) is the swish activation function.

00171_psisdg12260_1226003_page_2_2.jpg
00171_psisdg12260_1226003_page_2_3.jpg

where x is input, к is the hyper-parameter used to adjust the slope of the activation function, σ is the sigmoid function. h-swish uses the ReLU6 activation function to optimize the σ(κx) in swish.

00171_psisdg12260_1226003_page_2_4.jpg
00171_psisdg12260_1226003_page_2_5.jpg

The use of ReLU6 activation allows limiting the input x to between 0 and 1, thus replacing the function of the sigmoid function. At the same time, h-swish reduces the number of activation functions in the bneck structure to 16 while maintaining the same accuracy of 32 activation functions using swish, reducing the complexity of the network.

2.2

Lite fusion feature pyramid network

Since the pests have different scales of targets, the single-scale convolution kernel cannot adapt to multi-angle and multi-scale changing pictures. Thus we need the feature pyramid network architecture15. The FPN shallow layer has a larger resolution and contains clearer location information, the deep layer features contain rich semantic information, and the feature layers at different scales contain different feature information and are more adaptable to objects of different sizes.

To solve the problem that multi-scale pests reduce model accuracy, a lightweight multi-layer fusion module is constructed for the feature pyramid network, which is shown in Figure 2. The feature map of size 52×52 is first downsampled using a 2×2 averaging pooling layer. This allows feature fusion operations to provide shallow visual information and preserve more detailed features. Second, the feature map of size 13×13 is upsampled to size 26×26. This feature map has high-level semantic information and contains global object information. In the end, three feature maps of size 26×26 are concat into one feature map. The size of 13×13 and 26×26 feature maps are upsampled to generate 52×52 and 13×13 additional feature maps then combined them to feature pyramids.

Figure 2.

The architecture of the proposed lite fusion module.

00171_psisdg12260_1226003_page_3_1.jpg

3.

EXPERIMENTS

3.1

Experiment platform and dataset

Experiment platform. In this paper, the model is trained on Ubuntu 18.04 operating system, using PyTorch framework, Intel Core I7-10700 CPU, NVIDIA TITAN RTX GPU (24GB), CUDA10.0, CuDNN7.6, Python3.7 software environment. The input images are resized to 640×640, the training batch is set to 16, the initial learning rate is 0.001, the IoU threshold is set to 0.5, and all models are trained for 100 epochs according to these parameters.

Dataset. We collected 11,130 pest images with the resolution of 1944×2592 using under-light pest image acquisition equipment. The images are annotated by agricultural experts with the pest bounding boxes and classes and the labeling open source software is LabelImg. Based on this, we built a dataset named Croppest12. Table 1 shows the pest names and their corresponding instance numbers, the average height and width of the pests in that category. As can be seen in Table 1, the pest instances range from 198 to 18,463. There are 8 categories of pests with instances less than 1,000. And the width and height of the pest boxes are mostly less than 100 pixels, which is still small compared to the 1944×2592 resolution image.

Table 1.

Statistics of Croppest12.

ClassesPest nameInstancesAverage width (pixels)Average height (pixels)
ASAgrotis segetum81580.863.2
ATAgrotis tokionis24397.176.3
AEAgrotis exclamationis28590.169.5
XCXestia c-nigrum36278.360.6
HOHolotrichia oblita44670.254.9
HPHolotrichia parallela518666.352.0
ACAnomala corpulenta1846360.647.7
GOGryllotalpa orientalis3237119.892.1
PCPleonomus canaliculatus22869.154.4
ASAgriotes subrittatus361544.835.8
MCMelanotus caudex43741.532.2
SFSpodoptera frugiperda19853.541.6

3.2

Evaluation metrics

To evaluate the performance of the algorithm in this paper, some evaluation metrics such as Precision (P) and Recall (R) are used to quantitatively evaluate the model, which is calculated in the form shown in equation (5). TP, TN, and FP denote the number of targets that are correct, targets that are incorrect, and undetected, respectively.

00171_psisdg12260_1226003_page_4_1.jpg

Average Precision (AP) is used to evaluate the performance of the model on the test set. The multi-category detection results are usually measured using mean Average Precision (mAP), it calculates based on the shape of the PR curve. In equation (5), C is the number of classes. In addition, we measure the speed of the detection algorithm in terms of the number of images processed per second (FPS).

3.3

Experiment results

As shown in Table 2, we compare the number of model parameters, mAP, and FPS of Faster R-CNN, SSD, and YOLOv3 while keeping the training parameters consistent. Compared with Faster RCNN, the average precision of YOLO-pest is 5 mAP higher, but in terms of inference speed, YOLO-pest is 40 FPS faster than Faster R-CNN, which meets the real-time detection requirement. Table 3 shows the scientific names of each pest category, as well as the number of instances. There is also the AP of different methods for each category of pests, and it can be seen that our method surpasses other methods for almost all pest categories.

Table 2.

Performances of different models.

MethodInput image sizeBackboneParams (M)FPSmAP (%)
Faster R-CNN1280×800ResNet-5041.2522.365.16
YOLOv3416×416Darknet5362.354.262.22
SSD512×512VGG1636.0438.762.17
YOLO-pest (ours)608×608MobileNet v346.962.570.07

Table 3.

Performances of single class pest.

ClassesPest nameInstancesAP (%)
SSDYOLOv3Ours
ASAgrotis segetum81540.134.847.5
ATAgrotis tokionis24345.254.868.2
AEAgrotis exclamationis28568.956.368.9
XCXestia c-nigrum36250.953.362.8
HOHolotrichia oblita44651.055.958.6
HPHolotrichia parallela518684.181.085.2
ACAnomala corpulenta1846391.487.793.7
GOGryllotalpa orientalis323792.793.194.0
PCPleonomus canaliculatus22854.050.661.7
ASAgriotes subrittatus361570.071.876.0
MCMelanotus caudex43741.457.761.7
SFSpodoptera frugiperda19856.449.762.6

Table 4 shows the results of the ablation experiments performed on YOLO-pest. Mobilenetv3 in the table represents replacing the CSPDarknet53 backbone network in YOLOv4 with the Mobilenetv3 structure. Lite-fusion FPN represents replacing the PANet FPN in YOLOv4 with the Lite-fusion FPN structure fusion network. The ablation experiments compare the module parameters, FPS, and mAP under various structure combinations, respectively. It can be seen that the number of parameters of the original YOLOv4 is 245.7M, and the model size further decreases to only 47.6M after replacing CSPDarknet53 with Mobilenetv3 on this model, but the mAP also decreases to 68.6%, and it can be concluded that if only PANet FPN is replaced with Lite-fusion FPN structure, the model size is almost unchanged, but the mAP is increased by 2.6 mAP, indicating that Lite-fusion FPN can indeed improve the module with almost no effect on the model size. Finally, by combing both Mobilenetv3 and Bi-FPN-Lite structures, the mAP of the algorithm increases to 70.1%, and the model size decreases to only 46.9M, which ensures a high mAP even though the model size and number of parameters are significantly reduced. We report the detection result in Figure 3. Our method can accurately detect most pest targets, which meets the needs practical application requirements.

Figure 3.

Some detection results of our model on Croppest12.

00171_psisdg12260_1226003_page_6_1.jpg

Table 4.

Ablation experiment.

MethodParams (M)FPSmAP (%)
YOLOv4245.745.372.5
YOLOv4+ Mobilenetv347.659.668.6
YOLOv4+Lite-fusion FPN244.551.871.2
YOLO-pest (Mobilenetv3+ Lite-FPN)46.962.570.1

4.

CONCLUSION

To address the problems of complex network structure and redundant computational parameters of existing detection algorithms, a lightweight pest detection method is proposed, which can achieve efficient real-time detection at multiple scales and objects. Based on YOLOv4, the problem of large model size is alleviated by replacing the backbone feature extraction network, then the feature pyramid is improved to enhance the expression of semantic feature and location information. The experimental results show that the YOLO-pest method parameters are smaller than other mainstream algorithms, and can balance detection accuracy and speed, which has good engineering application value.

REFERENCES

[1] 

Ren, S., He, K., Girshick, R. and Sun, J., “Faster R-CNN: Towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell, 1137 –1149 (2017). https://doi.org/10.1109/TPAMI.2016.2577031 Google Scholar

[2] 

Redmon, J and Farhadi, A., “Yolov3: An incremental improvement,” arXiv:1804.02767, (2018). Google Scholar

[3] 

Tetila, E. C., Machado, B. B. and Astolfi, G., “Detection and classification of soybean pests using deep learning with UAV images,” Computers and Electronics in Agriculture, 105836 (2020). https://doi.org/10.1016/j.compag.2020.105836 Google Scholar

[4] 

Wang, F., Jiang, M. and Qian, C., “Residual attention network for image classification,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 6450 –6458 (2017). Google Scholar

[5] 

Xie, Q., Luong, M. T. and Hovy, E., “Self-training with noisy student improves ImageNet classification,” arXiv:1911.04252, (2020). Google Scholar

[6] 

Dai, J., Li, Y., He, K. and Sun, J., “R-FCN: Object detection via region-based fully convolutional networks,” in Annual Conf. on Neural Information Processing Systems, 379 –387 (2016). Google Scholar

[7] 

Cai, Z. and Vasconcelos, N., “Cascade R-CNN: Delving into high quality object detection,” IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 6154 –6162 (2018). Google Scholar

[8] 

Redmon, J., Divvala, S., Girshick, R. and Farhadi, A., “You only look once: Unified, real-time object detection,” arXiv:1506.02640, (2016). Google Scholar

[9] 

Redmon, J. and Farhadi, A., “YOLO9000: Better, faster, stronger,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 6517 –6525 (2017). Google Scholar

[10] 

Liu, W., Anguelov, D. and Erhan, D., “SSD: Single shot multibox detector,” European Conf. on Computer Vision (ECCV), 21 –37 (2016). Google Scholar

[11] 

Lin, T. Y., Goyal, P., Girshick, R. B. and Dollár, P., “Focal loss for dense object detection,” in IEEE Inter. Conf. on Computer Vision, 2999 –3007 (2017). Google Scholar

[12] 

Bochkovskiy, A., Wang, C. Y. and Liao, H. Y. M., “YOLOv4: Optimal speed and accuracy of object detection,” arXiv:2004.10934, (2020). Google Scholar

[13] 

Howard, A., Sandler, M. and Chu, G., “Searching for mobilenetv3,” in IEEE Inter. Conference on Computer Vision, 1314 –1324 (2019). Google Scholar

[14] 

Hu, J., Shen, L., Albanie, S., Sug, N. and Wu, E., “Squeeze-and-excitation networks,” IEEE Trans. Pattern Anal. Mach. Intell, 2011 –2023 (2020). https://doi.org/10.1109/TPAMI.34 Google Scholar

[15] 

Lin, T. Y., Dollar, P., Girshick, R,, He, K., Hariharan, B. and Belongie, S., “Feature pyramid networks for object detection,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 936 –944 (2017). Google Scholar
© (2022) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Shifeng Dong, Jie Zhang, Fenmei Wang, and Xiaodong Wang "YOLO-pest: a real-time multi-class crop pest detection model", Proc. SPIE 12260, International Conference on Computer Application and Information Security (ICCAIS 2021), 1226003 (24 May 2022); https://doi.org/10.1117/12.2637467
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Feature extraction

Convolution

Performance modeling

Agriculture

Image resolution

Information visualization

Network architectures

RELATED CONTENT


Back to Top