1. Introduction
Native semiarid grasslands and savannas across the globe are increasingly affected by woody plant encroachment, a phenomenon that leads to fundamental state shifts, whereby herbaceous-dominated landscapes are converted to landscapes more similar to forests and dense shrublands. Increases in woody cover can result in myriad significant changes to a region’s ecology, economy, and culture [
1]. The Edwards Plateau ecological region of central Texas (93,000 km
2), USA, has been historically maintained as a savanna community by fuel–fire feedbacks driven by the grazing habits of native megafauna. However, over the last 150 years, in response to the combined effects of overgrazing, fire suppression, and climate change, this region of Texas has experienced accelerated encroachment of woody plants [
2]. The four main woody species encroaching into native savannas in this region are live oak (
Quercus virginiana), blueberry juniper (
Juniperus Ashei), redberry juniper (
Juniperus Pinchotii), and honey mesquite (
Prosopis glandulosa). The effects of this expansion on the region’s hydrological system include an increase in baseflow which is facilitated by the natural karst landscape [
3]. Furthermore, the increased density of woody vegetation across the region has made much of the land inhospitable to cattle ranching, affecting the livelihood of landowners and the local economy. As a result of the declining grazing pressure [
3], herbaceous plants have expanded concomitantly with woody plants, creating a dynamic multilevel plant community structure. This herbaceous vegetation consists mainly of shade-tolerant C3 species, replacing the more historically abundant C4 species, as a response to the greater canopy cover across the landscape [
4], thus changing grazing patterns and affecting the landscapes’ biodiversity.
The creation of accurate maps depicting the spread of woody plants across the Edwards Plateau provides a basis for identifying and classifying the encroaching species, which can help landowners better manage their properties. Understanding settlement patterns and structure of encroaching woody plant expansion over large areas also enables scientists and conservationists to scale up the impact of woody plant encroachment into native grasslands and savannas, making the phenomenon more accessible to the general public. Instead of traditional field sampling methods, a more practical and economical means of studying large areas is now offered by remote sensing—especially given the growing abundance of affordable aerial drones on the market. In 2016, two months after mandating a nationwide drone registry, The Federal Aviation Administration (FAA) reported a greater number of registered drones than traditional aircraft [
5]. The most popular drones for ecological purposes are “micro-drones” (weighing under 2 kg) equipped with a stabilized camera system (including 4k video), granting flight times of ~10–30 min and costing between ~ USD 300 and 5000 [
5]. Furthermore, thanks to advances in autonomous flight, camera control technology, and image stitching software, novice users can learn to fly and begin taking aerial images in less than a few hours [
5]. The Edwards Plateau is particularly well-suited to species classification via this methodology. First, it is characterized by a low diversity of woody plant species [
6]. Second, each of the aforementioned four tree species displays a unique phenology: honey mesquite is characterized by long periods of senescence, live oak by very short periods of senescence, and blueberry juniper and redberry juniper by no period of senescence, as shown in
Figure 1. These differences make it possible for land managers and scientists to use cost-effective image-acquisition methods for classifying vegetation during the transitionary period from winter into late spring. In addition, drone imagery has the capability to create 3D point clouds through photogrammetry—thereby adding vertical structure information, which improves classification results.
For a number of land cover classification projects, nonparametric machine learning algorithms—such as support vector machines (SVMs), random forest (RF), and deep learning convolutional neural networks (CNNs)—have demonstrated significantly higher levels of classification accuracy than traditional parametric statistical methods such as maximum likelihood (ML) [
7,
8]. Nonparametric classifiers, such as SVM and RF, do not require that training data be normally distributed and are not based on statistical parameters. These features increase the robustness of the output in cases where training data are limited and there is a mixed set of input variables [
9]. Support vector machines are particularly useful for classifying multidimensional data with limited training samples [
10], and RF retains a strong position as a classifier because of its ease of parameterization and good performance for both simple and complex classification functions [
11]. However, it is the deep learning CNNs that have demonstrated the greatest accuracy of the three algorithms in identifying complex patterns in image classification [
12,
13]. However, CNNs are computationally intensive and require large quantities of labeled training data to perform [
13]. Furthermore, hidden layers in CNNs can cause the user to lose interpretability for how the model can be improved; the resultant black box approach can lead to overfitting and reduced performance when the model is applied to new data.
Woody species classification studies have seen adequate levels of accuracy using multispectral and hyperspectral imagery from aerial and satellite platforms [
14,
15,
16]. Ref. [
14] used SVM and RF algorithms to classify tree species in the Southern Alps of Italy, using a combination of airborne multispectral, hyperspectral, and Lidar data. In another study, carried out by [
15], SVM, RF, and a CNN were used with airborne hyperspectral imagery to classify five tree species in Karkonosze National Park, Poland. Finally, [
16] classified ten tree species in a temperate Austrian forest by means of WorldView-2 satellite imagery using object- and pixel-based methodologies in combination with RF. They found that the object-based classification methodology applied to high-resolution imagery is more accurate than pixel-based approaches—specifically if the pixel size is significantly smaller than the classes of interest [
16]. However, few studies have tested the feasibility of using affordable high-resolution RGB imagery from drone imagery to map species-specific woody encroachment in semiarid grasslands and savannas.
Previous research into semiarid woody plant species classification has garnered acceptable results using a variety of sensors and platforms; however, few of these studies used drones [
17,
18]. Using five-band RapidEye satellite data (5 m spatial resolution), five woody plant species were mapped in a dry forest in Botswana [
17]. Studying the effects of climate and land use change on woody plant species distributions in Mediterranean woodlands and semiarid shrublands, 1 m-resolution hyperspectral data were gathered across a 43 km-long strip, and 247 trees consisting of seven woody species were identified for classifying using SVM [
18]. In [
19], a customized sensor granting a red-edge band was attached to a drone to map mortality rates of three woody plants species in a dry forest in Peru using object-based image analysis. Our study is unique in execution and purpose, strictly using RGB imagery gathered from a drone and for the purpose of classifying a semiarid region impacted by woody plant encroachment. Furthermore, the results will inform land managers in these landscapes of whether it is worth investing in a cost-effective drone to better help them manage the advancement of woody plant species into their properties.
Recently published works using RGB imagery acquired from drones to classify woody plant species have garnered good results but have actively focused on natural forests and areas of high humidity [
20,
21,
22,
23]. The authors of [
20] captured leaf-on and leaf-off images of a mixed deciduous–pine forest located in humid Kyoto, Japan, classifying seven tree species using a CNN and SVM for comparison. It was found that the CNN outperformed the SVM in both the leaf-on and leaf-off seasons, while attaining the highest accuracy (97.6%) using a CNN during the leaf-off season. A similar study on a subtropical region of Eastern China used three deep learning models to classify ten tree species in the “National Garden City” of Lin’an, finding accuracies as high as 92.6% [
21]. The authors of [
22] also used very high resolution RGB imagery and CNNs for mapping nine woody plant species over 51 ha in two temperate forest regions of Germany, attaining a mean F1-score of 73%. Finally, a study in the subalpine region of the northwestern Alps classified five tree species encroaching into an abandoned native grassland using RF, finding that a pixel-based approach attained the highest accuracy of 86% [
23]. We looked to combine the findings of these papers, including using leaf-off imagery, deep learning applications, and pixel-, and object-based approaches, and apply them to semi-arid grasslands and savannas, which supply a majority of the world’s animal products and have been fundamentally changed by woody plant encroachment [
24].
Our study had the overall goal of determining if affordable RGB drone imagery can be used for classifying woody plant species encroaching on semiarid grasslands and savannas. Specific objectives were to (1) compare classical and traditional machine learning algorithms; (2) compare pixel-based, object-based, and post-processing methodologies; (3) develop a methodology for classifying plant species that combines a deep learning approach with drone imagery; and (4) assess classification accuracies and develop recommendations.
2. Materials and Methods
2.1. Study Area
The study site (
Figure 2) is located within the Sonora A&M Agrilife Research Station (latitude 30.27, longitude−100.57), which occupies approximately 1401 ha (3462 acres) in the Edwards Plateau ecoregion of central Texas. This ecoregion is generally classified as semiarid, receiving on average 550–600 millimeters of rain per year. Rainfall is distributed evenly across the year, with summer months (May to October) seeing slightly higher amounts. Soils in this region are clayey and dark in color; most soil depths measure less than 254 mm (10 in.), but in some areas they are greater than 508 mm (20 in.). The site is characterized by karst topography—generally a mix of limestone and dolomite, which can be seen rising above the surface in many areas. The region’s vegetation structure consists of mixed grasses (tall, medium, and short) and forbs intermixed with woody species [
25].
2.2. Data Collection and Preprocessing
The drone imagery was acquired between 10 a.m. and 12 p.m. on 3 December 2018, with a DJI Phantom 3 Pro (Shenzhen Dajiang Baiwang Technology Co., Ltd., Shenzhen, China) drone with a Sony 1/2.3” CMOS RGB Sensor and a FOV 94° 20 mm lens (Sony Semiconductors Solutions Corporation., Ltd, Kanagawa, Japan). A total of 999 images were collected at a flying height of 50 m, covering approximately 346 acres of the 3462-acre site in 6 consecutive flights, each taking approximately 14 min, with a total flight time of about 84 min. Weather conditions were sunny with no wind. The flight took place in December to take advantage of the individual woody species’ unique ecophysiologies; during this season, the mesquite trees are leaf-off, the live oaks are showing a decline in greenness, and the junipers are dark shades of green.
Using the structure-from-motion (SfM) image reconstruction workflow implemented in Pix4Dmapper software (version, 4.6.4), we created a 3D point cloud (point density 195.6/m2), a geometrically corrected orthomosaic, a digital surface model (DSM), and a digital terrain model (DTM) over the study site. The gridded datasets (orthomosaic, DSM, and DTM) had an effective ground sample distance of 2.45 cm/pixel.
Additional information layers, as seen in
Figure 3, were created by means of feature engineering; these were used for the traditional and machine learning classification but left out of the VGG-19 CNN classification to highlight the exceptional capability of the CNN to detect complex class patterns from standalone RGB imagery. For the particular camera used, the blue channel has a response curve of 400–560 nm with a peak at 460 nm, the green channel has a response curve of 400–640 nm with a peak at 540 nm, and the red channel has a response curve of 400–700 nm with a peak at 580 nm. These additional information bands were created via the following formula:
where G = green wavelength (peak response at 540 nm); R = red wavelength (peak response at 580 nm); and B = blue wavelength (peak response at 460 nm).
The three features were selected for different purposes. The green–red difference and chlorophyll features distinguish more finely between the various levels of greenness between tree species [
26], while the canopy height model (CHM) detects differences in the height of woody plants. The three features were stacked along with the RGB bands, creating a six-band dataset, and can be seen in
Figure 3. The stacked data file was then resampled from 2.45 cm to 5 cm, at 10 cm, and at 15 cm to reduce the size of the data. Using image interpretation based on the presence of mixed pixels, the ability to accurately identify tree species, and overall image quality, we selected the resampled 10 cm mosaic for classification. Lastly, a simple decision tree classifier was used to mask out CHM values under 0.5 m for the majority of pixels representing ground properties and small herbaceous species. A similar step was carried out in [
27] using a CHM to mask out values over 1 m to mask out tree species. For our site, a 1 m mask was unnecessary as the terrain is generally flat and many of the herbaceous species are shorter than 50 cm; furthermore, using a 1 m mask would have concealed a large number of young trees that should have been included in the classification.
2.3. Species Classification Using Traditional and Classical Machine Learning Methods
The classification scheme we selected included five classes: ground, shadow, juniper (both blueberry and redberry species), live oak, and mesquite. The shadow class was included to distinguish dark void areas from tree canopies and dark green junipers (most shadows cast on open ground were removed by the CHM mask). The ground class was included for the same reason—to distinguish among sunlit void areas, tree canopies live oaks, and leaf-off mesquites. Classification accuracy was assessed for the following five methodologies: (1) pixel-based classification with ML, SVM, and RF; (2) small-object-based classification with ML, SVM, and RF; (3) large-object-based classification with ML, SVM, and RF; (4) pixel-based classification with a post-processing majority filter; and (5) VGG-19 CNN deep learning classification.
For the training data, we selected 600 pixels per class on the 10 cm six-band orthomosaic, on the basis of the key characteristics for training area datasets listed in [
28] as well as an expert understanding of the landscape and image analysis. We evaluated the separability of the collected training data based on the Jeffries–Matusita distance, which calculates interclass separability using the different spectral and physical band statistics used in the training area dataset. The range of values produced by this method is 0.0–2.0, with a score of 1.9 or better being ideal [
29]. The J–M distance was calculated via the following formulas:
where
x is the vector of first-class spectral response;
y is the vector of second-class spectral response;
Σx= covariance matrix of sample
x; and
Σy= covariance matrix of sample
y.
The J–M distance between all classes was greater than 1.9 except between juniper and live oak. The result for the latter two was expected because of their similar heights and spectral components (
Table 1). A dendrogram, commonly used to hierarchically cluster taxa or classes, yielded results similar to those of the separability report—showing a close similarity between the juniper and live oak classes due to their spectral and physical similarities (
Figure 4). In contrast, the J–M distance between classes, such as mesquite and shadow, which have very different spectral signals, was large.
Ideally, for testing data to ensure unbiased assessments of the accuracy of our classification results, in situ ground reference test pixels would have been collected [
30]. However, because these were not available, we instead used visual interpretation of the original 2.45 cm raw images in combination with expert knowledge of the landscape and the tree species as a basis for collecting testing data on our 10 cm stacked orthomosaic (this method has been verified as a viable alternative for ground reference test information) [
31]. We calculated the number of total testing pixels using the following formula [
32] (p. 137):
where
N = number of testing samples;
B = the upper (
α/k) × 100th percentile of the chi square (χ2); and
b = the desired precision.
The formula requires a minimum of 757 total testing pixels, but having 5 classes, we decided to select 1000 testing pixels, 200 for each class. We therefore generated 5000 random pixels across the orthomosaic, which were reduced to 200 pixels for each class, evenly distributed across the image (avoiding mixed pixels and class boundaries).
2.3.1. Image Segmentation
To enable comparisons between pixel-based and object-based approaches in classifying species, we carried out two separate segmentations in ESRI ArcMap®. The specific segmentation algorithm used was mean shift, which works by moving a window to fit around the center of a centroid placed in an area of maximum pixel density. ArcMap’s implementation of the mean shift segmentation segments the images on the basis of maximum object size, spectral detail (ranging from 0 to 20) and spatial detail (ranging from 0 to 20). We performed parameterization iteratively, using varying values and visually interpreting the results. The final two segmentations contained the same spectral and spatial detail parameters of 17. The first segmentation limited object size to a minimum of 10 pixels (referred to as small-object-based classification because it consists of only the smallest discernable objects). The second segmentation limited object size to a minimum of 100 pixels (referred to as large-object-based classification because it was the largest setting possible without taking in a large number of mixed class objects).
Figure 5 illustrates the two segmentations.
2.3.2. Image Classification
We applied support vector machines, random forest, and maximum likelihood algorithms to classify species over our study site using both pixel-based and object-based methodologies. The support vector machines algorithm [
33] is a nonparametric method that does not rely on training data normality or any assumptions regarding the underlying distributions of the training data but is designed to find the optimal hyperplane that separates the training dataset into discrete predefined classes [
34]. The random forest algorithm [
35] is an ensemble classifier that combines a number of decision trees (the number set by the user); these decision trees split at nodes on the basis of a random subset of predictors—the result being the sum of the majority votes from all the individual decision trees. The maximum likelihood algorithm was chosen due to its popular use in analyzing remotely sensed images and reliability in classifying a variety of cover types and conditions [
36]. It works by using training data to determine class-specific mean vectors and variance–covariance matrices, producing probability density functions. The density functions are used to calculate the probability of every pixel in an image belonging to each predetermined class, with the highest computed class probability being assigned to the output classification image.
We used the ML, SVM, and RF algorithms for standard pixel-based classification on the stacked orthomosaic and performed parameter tuning for the ML, SVMs, and RF algorithm by cross-validation on three unique subsamples of the training data (accounting for 10%, or 60 pixels, for each class) and visually assessed the results.
For the final ML classifications, no probability threshold was set in order to classify every pixel, and the data scaling factor was set to 1023 based on a 10-bit radiometric resolution provided by our camera sensor.
The final SVM classifications used a radial basis function kernel whose value depends strictly on the separation between an input pixel and the hyperplane separating the information classes. The gamma value was set to 0.091, which decides the curvature of the separating hyperplane, with higher values resulting in greater curvature and a tighter fit around the data values, and a penalty parameter of 100, which balances the trade-off between training errors and forcing rigid margins. Increasing the penalty parameter increases the effect misclassified points have on the hyperplane, which can lead to overfitting.
The final RF classifications used 1000 decision trees, with each tree simulating a random vector sampled from the training data with the same distribution. Each tree consisted of a max tree depth of 50, which caps the number of decision nodes needed before casting a vote for the output classification image.
A majority filter was applied to the resulting classified images, by means of a 3 × 3 kernel moving across the image at a 1-pixel step interval. For replacement to occur, the neighboring pixels surrounding the central pixel must have a simple class majority and be contiguous around the center of the filter kernel to minimize the corruption of cellular spatial patterns.
We also applied object-oriented classification to the two segmented images (small- and large-object segmentations) using ML, SVM, and RF algorithms. The same parameters applied to each algorithm in the pixel-based classification were applied to both the small- and large-object segmented images.
2.4. Species Classification Using a Deep Learning Method
A variety of deep learning convolutional network (CNN) models have been proposed for various environmental applications, including object detection and semantic segmentation. Popular ones include VGG [
37], Resnet [
38], Inception V3 [
39], and Xception [
40]. The main advantage of deep learning CNNs over classical machine learning models lies in their feature engineering, scalability, and high performance. In this study, we evaluated the VGG-19 CNN due to its popular use in remote sensing applications and excellent results [
41,
42,
43,
44].
2.4.1. VGG-19 Model Overview
Developed in 2014, VGG-19 has proven over the years to be a competitive deep learning model when tasked with classifying land cover and tree species classes from remotely sensed images [
41,
42,
43,
44]. The VGG-19 CNN algorithm uses a SegNet encoder–decoder architecture, whereby the encoding CNN learns relevant, low-level features about the target class from the received RGB image and passes the information on to the decoder, which formulates a prediction using max pooling indices for every pixel in the RGB image. Specifically, the VGG-19 CNN takes 224 × 224 RGB input images and moves them through 19 weighted convolutional layers with a fixed-kernel filter of 3 × 3 and a stride of 1 pixel. Within every convolutional layer, each neuron/pixel is connected to a few nearby neurons/pixels in the previous layer, with the exact same set of weights used for each local connection. This process enhances the capability of the model to identify local features, uninfluenced by pixels across the image; and it also provides an equal chance for features to be detected anywhere throughout the image. At the end of each convolutional stack, the volume size is reduced by a fixed-kernel filter of 2 × 2 with a stride of 2 max pooling layers, which highlights the most obvious feature (largest value) in the patch. The stack of convolutional layers is followed by three fully connected layers, such that each neuron in a layer is connected with every neuron in the previous layer with its own weight. Lastly, a softmax layer normalizes the output of the network to a probability curve comprising the predicted output classes for each pixel.
2.4.2. Training Label Collection
Collecting training samples for the VGG-19 semantic segmentation method required a different approach. Rather than single pixels being collected, sample images with all pixels labeled with one of the five classes were created from full drone scenes. For labeling convenience, 50 512 × 512-pixel images were labeled in MATLAB® Image Labeler application. The images covered a total of 635m
2, including 127 junipers, 103 live oaks, and 36 mesquites (appropriately representing each tree species’ relative abundance in our study area). To create a training set for model fitting, we subsampled each of the 50 images to randomly select 1000 224 × 224 images.
Figure 6 shows sampled labeled data.
2.4.3. VGG-19 Model Training
We trained the VGG-19 model within the MATLAB programming environment using the computer vision and deep learning toolboxes. Since the VGG-19 model is a pretrained model, our training process was in essence an adaption of the model to new classes which it was not originally trained on, often referred to as transfer learning. Prior to training, the collected data, comprising the 1000 raw 224 × 224 RGB images and their associated labeled images, were split into training (75%) and validation (25%) sets. We then set up the training process in MATLAB by specifying the input training and validation data, the required input size (224 × 224), the pretrained model (VGG-19) and the number of semantic classes (in our case, 5). Considering the unbalanced sampling across the five semantic classes, weights were incorporated, calculated as the inverse frequency of each class, to strengthen the robustness of our model. Furthermore, to enhance the accuracy of the network, we then augmented the training data by randomly shifting, rotation, and reflecting them to create different versions of the data [
45,
46]. With a fully specified model, we trained it using mini batch stochastic gradient descent with momentum (SGDM) as the optimizer with 75% of the labeled data and 25% of the data for validation. Learning parameters, including schedule type, rate, factor, and mini batch size, followed the inputs set by [
47]. Model training was accomplished over 100 epochs on a 64-bit Dell Workstation (Intel® Xeon® Processor with 256 GB RAM, NVIDIA™ Quadro K5200 GPU with 8 GB RAM) and took about 2 days to complete. Once model training was completed, we applied it to the 10cm RGB orthomosaic to generate a field-level classification map.
2.5. Accruacy Assessment
For consistency, the accuracy of each methodology was tested by producing an error matrix using the same 1000-point testing data set. These error matrices include both overall accuracies (
OA) and kappa coefficients (
KC), which were calculated using the following formulas:
where
TC = total number of correctly labeled pixels,
N = total number of testing pixels,
r = number of rows and columns in the confusion matrix,
Xii = observation in row
i and column
i,
Xi+ = marginal total of row
i, and
X +
I = marginal total of column
i.
OA and
KC are standard metrics for assessing image classification accuracy, with the
OA quantifying the accuracy of the entire product and the
KC serving as a more robust measure that takes into account the probability of chance agreements.
We also included user (
UA) and producer (
PA) accuracies in our accuracy assessments to provide a more encompassing evaluation of the quality of our products. The
UA refers to the point of view of the user who plans on using the classified map. It is defined as the probability that a classified pixel is in fact the class it says it is. The
PA refers to the point of view of the map maker. It is defined as the probability of real features on the ground being correctly shown on the classified map. They were calculated using the following formulas:
where
A = the number of pixels correctly identified in a given map class,
B = the total number of pixels claimed by the map to be in that class, and
C = total number of pixels in the reference class.
5. Conclusions
Our findings show that RGB drone sensors are indeed capable of providing highly accurate classifications of woody plant species in semiarid landscapes—comparable to and even greater in some regards to those achieved by aerial imagery using hyperspectral sensors in more diverse landscapes. One reason for the high quality of the classifications we were able to obtain could be the innate seasonal diversity of the woody plant species found in the semiarid Edwards Plateau. Using drones enables continuous surveillance of an area, with little preflight planning required. This advantage over using aerial or satellite imagery means that land managers and scientists can more easily investigate site-specific and species-specific problems, including the woody plant encroachment that has seriously affected many such areas.
The best overall accuracies are achieved by pairing drone imagery with a deep learning, convolutional neural network algorithm, such as VGG-19, which can be replicated with the labeling of new images every time new data are gathered. However, the tedious nature of labeling a large number of training samples, GPU requirements, and long model training times makes this method difficult to implement. For this reason, further studies in these landscapes should be undertaken to test other deep learning models to see which performs best and to train models with smaller samples of training data to see how accuracies and processing times are affected. Testing of different models and of varying numbers of training data is needed to find a balance between long processing times and acceptable classification accuracies. Another avenue that would add to our knowledge base would be to gather data during summer months, when all the wood plant species are in full bloom, to gain insights into possible limitations of RGB drone imagery in these landscapes. Additionally, yet another avenue for increasing our understanding of the capability of high-resolution RGB drone imagery would be to perform similar studies in other semiarid grasslands and savannas regions of the world with a different variety of encroaching woody plant species.
An exciting future possibility would be to expand the study further through data fusion with National Agricultural Imagery Program (NAIP) imagery that provides country wide 1-meter ground sample distance every 3 years. Data fusion would allow for plot-level evaluation of woody plant encroachment to be applied across the entire Edwards Plateau region of Texas and other semiarid regions of the continental United States (CONUS). Additionally, NAIP imagery being acquired every 3 years allows us to track change, thereby mapping the spread of woody species across semiarid regions of the CONUS. Furthermore, NAIP imagery is publicly available on the Google Earth Engine (GEE) providing interested parties the ability to use Google’s powerful cloud computing capabilities.
Finally, our recommendation to those working in these landscapes is to use an object-oriented SVM classifier due to its ease of implementation, fast turnaround time, relatively high overall and tree class accuracy, and accurate depiction of tree crowns. This study provides important groundwork for such efforts by demonstrating that using RGB imagery captured by drones is a cost-effective and time-efficient strategy for classifying woody plant species in semiarid grasslands and savannas.