1. Introduction
Abundant agricultural resources stand as a pivotal cornerstone for the sustenance of human society [
1,
2]. Sustaining agricultural resources to meet societal demands is an exceedingly critical challenge, particularly as human civilization undergoes a significant shift toward urbanization [
3,
4]. Crop classification in large-scale cultivation is a pivotal task within this context. In recent years, with the rapid advancement in hyperspectral imaging sensors, hyperspectral imagery (HSI) is widely acknowledged in agriculture for its substantial advantages in acquiring valuable and rich spectral information about land cover [
5]. In particular, HSI excels at capturing the detailed and discriminative features essential for crop classification, showcasing unique advantages compared to the initial methods using multispectral and optical images [
6]. Leveraging the significant achievements in machine learning (ML) and deep learning (DL) for hyperspectral image classification (HSIC), monitoring large-scale agricultural land and gaining insights into crop cultivation patterns has become feasible and easy to implement [
7,
8,
9].
Heilongjiang Province is China’s most significant agricultural province and a major commodity grain production area [
10]. It possesses one of the world’s most fertile black soils, offering abundant agricultural resources [
11]. In contrast to the small and scattered cropland in other regions, the region is situated in the Sanjiang Plain, featuring extensive and flat croplands [
12,
13]. And, in this area, human habitation zones are far less vast than agriculture regions. It is one of the few areas in China suitable for large-scale mechanized agricultural cultivation [
14]. Nonetheless, the area has grappled with a pressing issue of diminishing farmland due to population outmigration and soil erosion [
15,
16,
17]. In China, with a population exceeding 1.4 billion, food security faces a substantial risk with the depletion of the non-renewable black resource. To ensure arable land area and food production, annual agricultural crop planting structure investigation and farmland statistics are conducted in the region [
18]. This typically requires individuals with professional knowledge to conduct on-site surveys and interpret multiple types of remote sensing images. Therefore, employing ML and DL for crop classification holds practical value as it significantly reduces manual annotation costs [
19,
20,
21,
22,
23].
With the continued efforts of researchers, various ML methods for crop classification with HSI have been proposed. Rao et al. adopted the approach of constructing a spectral dictionary that encompasses the main crop types [
24]. This method aims to achieve crop classification by leveraging the unique spectral reflections of crops. However, these methods are limited due to the influence of numerous unknown factors on crop spectral characteristics. As a result, researchers have turned to simultaneously utilizing the spatial and spectral information of HSI to assist in classification. Zhang et al. employed both the spatial texture features and spectral features of crops to construct an optimal feature band set [
25]. Classification was achieved through band selection and an object-oriented approach.
In recent years, a plethora of DL methods have been employed to HSIC, yielding remarkable results [
26,
27,
28,
29]. Compared to traditional machine learning classification methods, it can extract more sophisticated and representative spatial–spectral features [
30,
31,
32]. And the widely used Indian Pines (IP) dataset is established on agricultural settings. Therefore, it can be regarded as a subject for in-depth exploration of methods utilizing HSI for crop classification. As excellent representatives of DL techniques, Hong et al. proposed an optimized transformer model (SpectralFormer) to extract global and local information for HSIC [
33]; this method can attain an overall classification accuracy of 81.76% on the IP dataset with only 695 training samples. Le Sun et al. utilized a module composed of a convolutional neural network (CNN) and transformer to capture both spatial–spectral features and high-level semantic features (SSFTT). The model achieved an impressive accuracy of 97.47% on the Indian Pines dataset with the utilization of 1024 labeled samples during the training phase [
34]. It is evident that current mainstream methods have achieved near-perfect classification results on this dataset. However, this also implies that the IP dataset has lost its benchmarking ability to measure the performance of classification methods. Unfortunately, most traditional HSI datasets, such as Salinas and Yellow River Estuary, face similar issues, with limited labeled samples and ease of fitting constraining their classification potential.
In order to address practical issues, researchers can only assist their studies by uniquely designing experiments on these overoptimistic datasets. Actually, agricultural scenarios offer an optimal subject for them. In other words, the issues that researchers attempt to simulate are widespread in rural areas. More specifically, in regions with lower human activity, a multitude of unknown land cover types with extremely uneven distributions coexist. This not only introduces intricate spatial–spectral information but also results in chaotic boundary areas [
24]. Furthermore, other practical issues can be summarized as follows: (1) Mixing of Crops. Different types of crops are planted in neighboring regions with such similar spectral characteristics that they are hard to differentiate. (2) Complex Geographic Environment. Variations in the growth status of crops at different locations result in inconsistent spectral characteristics. Soil types, moisture conditions, and fertilizer usage also have an impact. (3) Uncertain Crop Growth Stages. Crops exhibit different spectral characteristics at various growth stages [
35,
36]. (4) Vegetation Obstruction. Mutual obstruction between crops or vegetation obstructing crops can result in the loss of spectral information [
37]. In existing datasets, the aforementioned challenges are not usually encountered simultaneously, and these issues are typically avoided during scene selection and annotation. This classification scenario contributes significantly to enhancing the generalization capability of classification methods, providing more effective support for practical crop classification tasks.
It is crucial to note that, in actual agricultural crop planting structure surveys and farmland area statistics, the focus of the classification task is to determine the type of crop over a large area, rather than the growth status of the crops. Researchers in the past have been dedicated to categorizing these datasets into more numerous and finer classifications. For instance, corn is divided into ‘corn-notill’ and ‘corn-mintill’ categories in the Indian Pines dataset. Such requirement makes the already time-consuming and labor-intensive annotation task even more challenging [
38,
39]. Therefore, traditional datasets focused on agricultural areas comprise small-sized images and represent limited actual land areas [
40]. However, this contradicts current demands. Benefiting from the hyperspectral imaging system carried by unmanned aerial vehicles (UAVs), researchers are attempting to address this contradiction through the use of high spatial resolution HSI [
41,
42,
43]. As a representative of this approach, the WHU-Hi dataset has played a crucial role in supporting precise crop identification. However, utilizing UAVs to monitor the agricultural resource in a region or even an entire province will incur substantial costs. As a result, HSI obtained from satellites continue to be the primary focus of our current research. It provides a cost-effective means to obtain multitemporal images from the same region and same-temporal images from large-scale regions.
To assist numerous researchers interested in agricultural scene classification, a large-scale crop classification HSI dataset referred to as HLJ is introduced in this paper. It comprises two scenes of HSI, namely HLJ-Raohe and HLJ-Yan, captured from Heilongjiang Province, China, as depicted in
Figure 1. Considering that the core task of crop classification is to distinguish agricultural areas including several major crops from non-agricultural areas, these two scene images are intentionally selected from two real rural areas. In this region, the variety of crop types is limited, including crops, natural vegetation, and artificial structures, but the cultivation area of these crops is extremely extensive. Given this scenario, these two images respectively contain seven and eight categories, sufficient to cover the main land cover types in this region. Crop cultivation in this region depends on the type of land and topography, leading to the intermixing of different crops and making the situation quite complex in practice. Therefore, in the annotation process, we emphasized annotating the boundary segments and obtained accurately labeled ground truth images through on-site surveys and the integration of multitemporal images. Additionally, as this dataset is primarily intended for crop classification tasks, and the predominant land cover in the area is arable land, the proportion of annotated samples emphasizing crops is quite significant across the entire image. The main contributions of this article can be summarized as follows:
- (1)
A large-scale crop classification dataset has been introduced, named the HLJ dataset. Owing to the diversity of land cover types in agricultural regions, this dataset poses several practical challenges, such as uneven distribution of crops, uncertain crop growth stages, mixed planting, etc., and presents an elevated level of complexity in classification.
- (2)
This is a large-scale dataset that covers a wide range of rural areas, including a sufficiently representative selection of land cover types in the region. These diverse land-cover types contribute to an exceptionally rich set of spectral information. Furthermore, the proposed dataset contains a sufficient and accurate number of labeled samples, with 319685 and 318942 in the two images, respectively. The reliability of these samples stems from on-site surveys and comprehensive analysis of multitemporal images.
- (3)
The comprehensive validation of the HLJ dataset was conducted by employing several representative methods for basic classification experiments (e.g., SpectralFormer and SSFTT) and comparing the classification results among different datasets using the same methods. This process affirmed the research value inherent in the issues encompassed by the dataset and its suitability as a benchmark dataset for hyperspectral image classification.
2. Construction of the HLJ Dataset
The HLJ dataset is a satellite-based hyperspectral dataset primarily designed for the classification of large-scale agricultural crops. It was acquired in Heilongjiang Province, located in the northeastern region of China, known for its extensive and concentrated croplands [
44]. In this dataset, Raohe County and Yian County in particular have been selected as representatives. They are significant grain-producing regions in Heilongjiang province, providing the most authentic depiction of the agricultural characteristics in this area. Aside from small and concentrated artificial structures, the dataset mainly consists of large-scale cultivated farmlands and natural vegetation.
The two images in the HLJ dataset were acquired using the Advanced Hyperspectral Imager (AHSI) sensor. This sensor finely divides the visible near-infrared (VNIR) spectrum into 76 bands with a spectral resolution of 10 nm. Similarly, the shortwave near-infrared (SWIR) spectrum is segmented into 90 bands, each with a spectral resolution of 20 nm. Given the unique spectral characteristics exhibited by crops at various growth stages, the dataset was captured during the growth and maturity stages, offering a wealth of distinctive spectral information [
45,
46].
The construction of the HLJ dataset is divided into four main parts as shown in
Figure 2: data collection, data preprocessing, sample annotation, and experimental agreement.
Section 2.1 presents details about the acquisition of the data. In
Section 2.2, details about the preprocessing and the annotation of the proposed dataset are provided. A comprehensive evaluation experiment of the HLJ dataset is introduced in
Section 3.
2.1. The Acquisition of HLJ Dataset
The HLJ-Raohe dataset was captured by the ZY1-02D satellite on 30 September 2022, in Raohe County. Located in the northeastern part of Heilongjiang Province and adjacent to the Ussuri River, Raohe County covers an area of 6765 square kilometers (133°2′N–133°9′N, 47°1′E–47°6′E). The average elevation in this area is 149 m, with a minimum elevation of 45 m and a maximum elevation of 933 m. And the terrain is diverse, including four main types: mountainous hills, plateaus, plains, and wetlands. The dataset was acquired during the maturation stage of the crops, at a time when the crops were not yet harvested, resulting in significant variations in spectral information. The data was captured under favorable weather conditions with good visibility. The image has a size of 897 × 483 pixels and contains 151 spectral bands, covering a wavelength range of 400 to 2500 nm. It is worth noting that the following bands have been removed: Bands 98–102 and 125–132. The HSI acquired by the satellite has a spatial resolution of 30 m. The land cover types are categorized into seven representative classes: Rice, Soybean, Corn, Wetland, River, Built-up land, and Forest. The pseudocolor image and ground truth map are illustrated in
Figure 3.
The HLJ-Yan dataset was captured by the ZY1-02D satellite on 10 July 2022, in Yian County. Located in the western part of Heilongjiang Province, Raohe County covers an area of 3678 square kilometers (124°8′N–125°6′N, 47°3′E–47°7′E). The average elevation in this area is 205 m, with a minimum elevation of 154 m and a maximum elevation of 308 m. The primary landforms in this area consist of floodplains, mountainous hills, plains, and wetlands. The dataset was captured during the growth stage of the crops and, at this time, different crops were in varying stages of growth due to differences in planting times. The image has a size of 843 × 719 pixels and contains 149 spectral bands after the removal of broken bands, covering a wavelength range of 400 to 2500 nm. The 17 removed bands include Bands 98–103, 125–133, 165, and 166. The HSI acquired by the satellite has a spatial resolution of 30 m. The land cover types are categorized into eight representative classes: Rice, Soybean, Corn, River, Built-up land, Saline–alkali land, Channel, and Forest. The pseudocolor image and ground truth map of HLJ-Yan are depicted in
Figure 4.
2.2. The Data Preprocessing and Annotation Details of the HLJ Dataset
Combining the requirements of the crop structure survey task and the demands of hyperspectral classification methods, the task-specific annotations on two image are conducted. In HLJ-Raohe and HLJ-Yan, 319,685 and 318,942 pixels were labeled, respectively. The category information of HLJ dataset is detailed in
Table 1 and
Table 2. Combining the requirements of the crop structure survey task and the demands of hyperspectral classification methods, task-specific annotations on the two images are conducted. In HLJ-Raohe and HLJ-Yan, 319,685 and 318,942 pixels were labeled, respectively. Due to the requirement for classification not to be overly detailed in the crop structure survey task, we avoided further subdivision within the same crop. As a result, the number of categories in the dataset may be smaller compared to traditional datasets. Additionally, considering the dataset’s goal of reflecting real planting conditions, we minimized human adjustments to annotated details, especially at the boundaries. Therefore, the distribution of crops in the dataset may be uneven and the number of samples for different categories may be unbalanced. The complete arrangement of the annotation process is as follows: Firstly, five non-professional volunteers participated in the annotation task. They utilized hyperspectral and multispectral images from the same region at different times to perform initial annotations on different but overlapping areas. Subsequently, a comparative analysis of the preliminary annotation results was conducted. For areas with discrepancies and boundary regions, a secondary annotation and discussion were carried out. Additionally, for areas where determination was challenging, three researchers conducted on-site surveys to obtain the final reliable results. In HLJ-Yan, due to dense vegetation in certain image regions causing severe pixel mixing, annotated samples from these areas were excluded. Therefore, the annotation sample proportion has slightly decreased in HLJ-Yan.
4. Discussion
In this study, a large-scale HSI dataset for crop classification is introduced. By preprocessing the data and meticulous sample annotation efforts, we established two major study areas in Heilongjiang Province, located in northeast China. The images in these areas have large spatial dimensions, covering distinct growth stages of various crops and containing abundant spatial–spectral information. And each image in this dataset provides over 300,000 annotated samples for interpretation. In the fundamental classification experiments, eight classical methods have completed successful classification on the HLJ dataset, which represents the applicability of this dataset as a benchmark dataset. Simultaneously, this dataset also poses practical challenges for many HSIC methods with the characteristics of intensive cultivation and uneven distribution in agricultural scenarios. For instance, as shown in the HLJ-Raohe classification results presented in
Figure 10, most methods generate significant misclassifications for Corn and Soybeans. Additionally, categories such as rivers and artificial structures cannot be accurately determined due to their considerably fewer samples compared to other classes. Addressing these challenges requires specific enhancements and modifications to existing hyperspectral classification methods.
The results of the data distribution visualization for the HLJ dataset and other related datasets are presented in
Figure 14. In the HLJ dataset, samples within the same category exhibit close proximity, while different categories intertwine. For example, in the HLJ-Raohe dataset, Corn and Soybean display overlap, as well as in the HLJ-Yan dataset. This distribution pattern confirms the suboptimal performance of various classification methods in categorizing these classes in the classification experiments. It is evident that, in the other datasets, there is a substantial dispersion among the sample categories, with small intra-class distances. These datasets are more amenable for modeling due to their characteristics. As shown in
Table 5 and
Table 6, with the same experimental configuration, the HLJ dataset maintains a higher classification difficulty compared to the other datasets both overall and partially.
The results of the spectral curve visualization for the HLJ dataset are given in
Figure 15. Clearly, within specific wavelength ranges, the spectral curves exhibit notable overlap, with similar data values for peaks and troughs. In these bands, models struggle to extract distinctive features, especially in those minute yet crucial segments. This places a higher demand on the model’s ability to maintain precision and sensitivity towards the informative spectral bands.
The HLJ dataset proposed in this study is primarily designed to meet the demands of crop structure investigation in the northeast region. This task only requires accurate classification of major crops such as rice, soybeans, and corn, while maintaining limited focus on other land covers. Constrained by the difficulty of annotation, the labeling process did not involve detailed categorization of more varieties, and different varieties of a single crop were not distinguished. In addition, the impact of more practical factors on hyperspectral interpretation needs further exploration in future research.
5. Conclusions
In this paper, with the purpose of solving the difficulties encountered in crop structure investigation for northeast China, a large-scale HSI benchmark dataset for crop classification is proposed, namely the HLJ dataset. Acquired from the ZY-02D satellite, the dataset reflects the realistic agricultural characteristics of a vast agricultural region, represented by two elaborately selected HSIs. By accurately labeling a total of over 600,000 samples within the entire dataset, including the boundaries of distinct land covers, the limitations of DL in the development of HSIC due to the absence of sample diversity and inadequacy of annotated samples are addressed. And this has been validated through visualizing their features and spectral curves. Additionally, through the basic classification experiments conducted on eight mainstream classification methods, it is found that the mainstream DL methods achieved more than 80% classification accuracy using 10% labeled samples for training. This further confirms the feasibility and research potential of the HLJ dataset as a benchmark dataset for HSI classification. In parallel, compared with the existing traditional datasets, the HLJ dataset faces practical problems such as uneven sample distribution, and intensive and mixed crop cultivation, and their coexistence brings new challenges to the HSIC technique. The HLJ dataset not only serves as a benchmark for measuring the performance of the HSIC algorithm, but is also suitable for serving as a research object for a wide range of practical tasks, such as crop structure survey, long-tailed distribution classification, open-set classification, and so on. In the future, a more in-depth interpretation of this dataset will contribute to enhancing the scientific planning level of agriculture in China, thereby promoting sustainable agriculture and ensuring global food security.