Introduction

Wildfires represent a critical ecological and environmental challenge, impacting ecosystems and human communities globally. This study narrows its focus on the scope of wildfires, particularly vegetation fires, highlighting their frequency, spread, and management strategies. Forest loss and degradation, the emission of significant gasses and aerosols, etc., and the decrease in biodiversity have been identified as significantly contributing to increased vulnerability to fires (Albar et al. 2018). The global occurrence of wildfires shows considerable variation, with estimates suggesting they annually affect between 300 and 400 million hectares, varying significantly by geographic intensity and local conditions (van Lierop et al. 2015; Attri et al. 2020). Over 80% of global wildfires occur in savannahs and grasslands, mainly in South America, Australia, Africa, and South Asia. Forest and shrub-dominated regions account for 20% (Schultz et al. 2008). Annually, substantial funds are allocated towards fire management efforts to reduce or prevent the adverse consequences of wildfires (Thomas et al. 2017). Wildfire events lead to the death and displacement of fauna (Tien Bui et al. 2016; Bhujel et al. 2017), pose risks to the lives and livelihoods of local communities, impact soil fertility and water cycles, release harmful pollutants, including particulate matter (Shahdeo et al. 2020) that may contribute to global warming, and result in the loss of vegetation cover (Martell 2007; Usoltsev et al. 2020; Shobairi et al. 2022; Anees et al. 2022b, 2024; Akram et al. 2022; Aslam et al. 2022; Khan et al. 2024). Advancements in remote sensing technologies have contributed significantly to the monitoring and evaluating of vegetation fires (Gitas et al. 2012). Previous research has leveraged multi-temporal and multi-sensor remote sensing technologies to assess and monitor vegetation fires (Table 1).

Table 1 List of sensers used for fire monitoring

Vegetation fires result from a complex network of interactions among various natural variables, including climate and weather conditions (Andreevich et al. 2020), fuel composition, and topography. The ignition sources for these fires encompass hot surfaces, electrical sparks, flames, friction, static electricity, mechanical impacts (such as from machinery contact or falling rocks), and natural events like lightning (Vadrevu et al. 2008; Bui et al. 2017; Nami et al. 2018). Although human activities are globally recognized as predominant causes of fires, practices such as slash-and-burn for agricultural purposes are widely prevalent in South and Southeast Asia. Our study focuses on the climatic influences on fire occurrences in Pakistan. This study addresses how climatic factors, rather than direct human interventions, predominantly influence fire dynamics in Pakistan. While acknowledging the significant impact of human activities on fire occurrences as seen in regions such as the Eastern Ghats and northeast India (Vadrevu et al. 2008), Sarawak in Malaysia (Kleinman et al. 1995), and the Chittagong hill tracts in Bangladesh (Borggaard et al. 2003), our analysis focuses on how environmental variables (Anees et al. 2022b) like temperature, humidity, and solar radiation play crucial roles in the region’s fire ecology. Topographic factors such as aspect, slope (Muhammad et al. 2023), and elevation are also considered for their effects on the extent of burnt areas and fire intensity based on comparisons across different studies (Nunes et al. 2016; Pan et al. 2023).

Various models have been documented in the literature, focusing on distinct phases of the fire control cycle. These include vegetation fire occurrence models (Botequim et al. 2017), vegetation fire spread models (Zhai et al. 2020), deployment and dispatch models, vegetation fire damage models, and decision and information systems as technological support platforms (Marques et al. 2012; Duff and Tolhurst 2015). The studies describing models briefly discuss prominent algorithms in each category, including supervised, unsupervised, and agent-based modeling approaches. Additionally, they included references on the fundamentals of machine learning. Supervised learning works to establish a correlation between input data that has been labeled and the corresponding known output using a continuous target factor. A constant variable of interest is used in regression analyses, with various applications including fire vulnerability, fire occurrence, fire spread and burn area estimation, smoke and emissions prediction, and, finally, climate change assessment (Jain et al. 2020). Unsupervised learning aims to uncover patterns and relationships within data without using a specific target or outcome variable to guide the learning process. It is applicable for tasks involving clustering and dimensionality reduction. Clustering tasks in this context are used for fire mapping, fire detection, prediction of burnt areas, and fire weather prediction (Bot & Borges 2022). Some fire prediction algorithms, prominent for their computational speed and simplicity, utilize both supervised and unsupervised learning techniques to determine vegetation fire risks. These include neural networks, decision trees, random forest (Eslami et al. 2021), regression trees, and classification algorithms (Cabral et al. 2018), along with K-nearest neighbor, support vector machines, K-means clustering, self-organizing maps, autoencoders, hidden Markov models, and hard competitive learning (Arnold et al. 2014). A prominent gap exists in long-term, predictive studies integrating environmental, meteorological, and human factors, particularly across broader geographical scales (Sohail et al. 2023). This gap highlights the need for enhanced predictive modeling to inform proactive fire management strategies. In response to these gaps, our research aims to (1) compile a comprehensive dataset of historical fire incidents in Pakistan from 2001 to 2022; (2) develop a predictive model for wildfire occurrences using MODIS data, incorporating various environmental and meteorological variables to forecast spatial and temporal patterns; and (3) conduct a long-term trend analysis to evaluate the frequency, distribution, and severity of wildfires in Pakistan over the past two decades.

Materials and methods

Study area

The research focused on Pakistan, covering the period from 2001 to 2022. Pakistan is located in the western zone of South Asia, northeast of the Arabian Sea, between latitudes 24° and 37° N and longitudes 62° and 75° E (Qasim et al. 2014). Pakistan covers an area of 875,832 km2. Forests cover 2113 km2, croplands cover 176,976 km2, and other vegetation covers 261,755 km2. According to MODIS data, there were 208,943 fire events recorded in Pakistan from 2001 to 2022, including 642 in forests, 158,474 in croplands, and 31,484 in other vegetation types. Figure 1 shows classifications of forested land, cropland, and other vegetated land.

Fig. 1
figure 1

Study area map along with various LULC

The country is known for its diverse landscapes, which include towering mountains in the north and expansive arid regions in the southwest. It has four distinct seasons: a mild and dry winter (December to February), a hot and dry spring (March to May), a rainy season (June to August), and a post-monsoon season (September to November) (Begum et al. 2011). Pakistan’s forest cover is only 4.5%, a substantial concern considering the country’s agricultural-driven economy and location within the South Asian Ecological Zone (Oliveira et al. 2011). Throughout the latter half of the twentieth century, evidence indicated an escalating incidence of wildfires in Pakistan, contributing to increased burn area (Rafaqat et al. 2022a, b). Characterized by its lowest elevation at sea level and vulnerability to desertification, the eastern region of Pakistan requires targeted conservation and fire prevention strategies, particularly considering the availability of remote sensing technologies and worldwide databases that provide opportunities for a more detailed identification of factors causing fires and enhanced prediction models (Rafaqat et al. 2022a, b). This region is particularly vulnerable to wildfires due to its dry environment with little rainfall and susceptibility to desertification (Kattel et al. 2019; Anees et al. 2022a).

Datasets

Handling of response variable

This study employs a comprehensive approach to analyze historical fire data, focusing on the period from January 2001 to December 2022. This study used the MODIS fire product from the Fire Information for Resource Management System (FIRMS), which gave information about active fires found by NASA’s Aqua and Terra satellites’ MODIS instruments (https://firms.modaps.eosdis.nasa.gov) (Zhang et al. 2021). We combined the monthly global 500 m grid product with 1 km of MODIS active fire observations to enhance the spatial analysis of the MCD64A1 Version 6 Burned Area data product (Giglio et al. 2018). This product facilitates the identification of per-pixel burned areas, detecting thermal anomalies and fire locations at a moderate resolution (Katagis and Gitas 2022). We used this data to evaluate fire regimes on a national to continental scale, identify global hot spots of fire, and monitor trends in global vegetation fire occurrences (Giglio et al. 2006; Chuvieco et al. 2008). All fire events reported with a confidence level exceeding 50% were considered for detailed analysis. The analysis followed a grid-based approach, examining each 1 × 1 km grid cell for vegetation fire occurrences, binary-labeled as “1” for presence and “0” for absence. In this study, analyzing land use and land cover was crucial for understanding the distribution and types of vegetation affected by fires. The International Geosphere-Biosphere Project (IGBP) classification scheme of the MODIS product MCD12Q1 was used in the study (Liang et al. 2015; Badshah et al. 2024). This product has 500-m-level data on land cover (Sulla-Menashe and Friedl 2018). The dataset available on the LP DAAC website (https://lpdaac.usgs.gov/) greatly aided in identifying the surfaces beneath various types of vegetation in the study area (Usoltsev et al. 2022; Zhao et al. 2022). The research area shown in Table 2 underwent a careful process of mosaicking and reprojection using the Hierarchical Data Format-Earth Observing System (HDF-EOS) to Grid (HEG) tools. This step was crucial for achieving an accurate and coherent spatial representation of land cover types. The study area divided grid cells into categories based on the land cover types a vegetation fire had affected, including forest fire, other vegetation, and cropland. Five hundred twelve out of 642 forest cells, 124,179 out of 158,474 cropland cells, and 22,663 out of 31,484 vegetation cells were marked as “fire cells” and given the number “1.”

Table 2 Descriptions of vegetation types

During the dataset development, we created two random subsets of the actual MODIS vegetation fire ignition spots that were detected. We allocated 70% of this data for training the models and the remaining 30% for testing their performance. This division is standard practice in machine learning to validate models effectively, ensuring they can generalize well to new, unseen data. Using a 70–30 split, we aim to provide a robust dataset for training while retaining sufficient data for an accurate assessment of model performance in real-world scenarios (Rubí et al. 2023).

Selection and handling of predictor variables

This study utilized the Shuttle Radar Topography Mission’s (SRTM) Digital Elevation Model (DEM) dataset to investigate the impact of elevation, slope, and aspect as shown in Fig. 2 on the vegetation fire analysis. The SRTM dataset, downloaded from the SRTM Data Portal (January 1, 2023), provide highly accurate nationwide coverage.

Fig. 2
figure 2

Topographical factors. A Elevation. B Aspect. C Slope

The historical monthly climatic data was downloaded from two different sources: WorldClim (https://www.worldclim.org/) (Barreto and Armenteras 2020) and ERA 5 climate reanalysis data (https://cds.climate.copernicus.eu/) (Zhang et al. 2021) accessed on January 1, 2023). Key climatic variables extracted from WorldClim include minimum temperature (°C), maximum temperature (°C), and precipitation (mm), presented in GeoTiff format with a spatial resolution of approximately 2.5 min (~ 21 km2). Additional climatic variables sourced from ERA 5 climate reanalysis include northward and eastward components of the 10 m wind (m/s), skin temperature (°C), surface net solar radiation (W/m2), surface net thermal radiation (W/m2), surface pressure (hPa), soil temperature (°C), and forecast albedo (unitless). These variables are provided in Netcdf format with a spatial resolution of about 9 km2. All data underwent meticulous preprocessing using RStudio, specifically employing the “raster” and “ncdf4” packages, alongside the ArcGIS software (Table 3).

Table 3 Descriptions of independent variables

Detection of violations of assumptions about independent variables

A linear regression model may encounter multicollinearity, characterized by a substantial correlation among its independent variables. This multicollinearity has the potential to distort the model’s estimation and impede accurate predictions (Chang et al. 2013). The correlation matrix shown in Fig. 3 uses a color scale ranging from blue (low correlation) to red (high correlation) to identify significant correlations between variables. Each cell in the matrix represents the correlation coefficient between two variables, providing a visual aid to detect potential multicollinearity issues. Analysis of multicollinearity involves assessing variance inflation factors (VIF) and tolerance levels (TOL), which are commonly utilized to evaluate the relationships among independent variables. It is widely acknowledged that a TOL value below 0.1 and a VIF value exceeding 10 indicate the presence of multicollinearity (Bui et al. 2019; Li et al. 2022). These thresholds suggest that multicollinearity could significantly impact the reliability of regression and classification model estimates. TOL and VIF are computed as follows (Eqs. 1 and 2):

$$\text{TOL}=1-{\text{R}}^{2}$$
(1)
$$VIF=\frac{1}{1 - {\text{R}}^{2} }=\frac{1}{TOL}$$
(2)

where the coefficient of complex determination is denoted by \({R}^{2}\).

Fig. 3
figure 3

The Spearman rank correlation heat maps for a forest, b crop, and c other vegetation

Mann–Kendall mutation test

The Mann–Kendall mutation test is a statistical method used to analyze temporal fluctuations and detect significant trends or “mutational changes” within time series data. These “mutational changes” refer to substantial alterations in the trend of the data, such as shifts from increasing to decreasing values or vice versa, which could indicate environmental or systemic changes. This method is valued for its straightforward implementation, high precision, broad applicability across diverse datasets, minimal human intervention, and efficient validation capabilities (Yue et al. 2002). The time series x, including n samples, represents the fundamental temporal variations. By analyzing these patterns, it is possible to obtain knowledge of the historical evolution of the environmental system, including weather variables and MODIS-detected changes that generated the data (Mehmood et al. 2024d). The test calculates a sequence of detecting mutations according to the Eq. 3:

$${d}_{k}={\sum }_{i=1}^{k}{\gamma }_{i} \left(k=2, 3\dots , n\right).$$
(3)

The sequence dk is a succession of independent units that adhere to the common scoring factors for calculating (dk) (Zhang et al. 2020):

$$UF\left({d}_{k}\right)=\frac{\left[{d}_{k}-E\left(dk\right)\right]}{\sqrt{var}\left(dk\right)}$$
(4)

(dk) indicates the expected value, Var(dk) is the variance, and UFk is a standard distribution of values. The statistical order is determined by analyzing the time series x in the order \(x\)1, \(x\)2, …., \(x\)n. The reverse sequence of x (\(x\)n, \(x\)n-1…, \(x\)1) is computed. This procedure is repeated, and the value of \({d}_{k}\) is assessed by comparing each computed \({d}_{k}\) to its expected statistical properties, including the mean and variance, to determine deviations that suggest trends. A UB or UF value greater than 0 indicates the presence of both positive and negative trends in the time series. When these values exceed or fall below the key threshold (significance level), the time series trends upward or downward. The area beyond the threshold line is the mutation time region of the significant line (Feng et al. 2016).

Methodological overview machine learning models

Logistic regression

The logistic regression method is a classical statistical modeling method used to model binary outputs given one or more independent variables (Balboa et al. 2024). It is effective in different geographic locations for predicting and analyzing the variables that drive fire occurrence at different topographical levels (Garcia et al. 1995; Martínez et al. 2009). Many researchers have included model applicability (Oliveira et al. 2012; Rodrigues and De la Riva 2014). The formula for LR is:

$$\text{Logit}\left(p\right)=\mathit{ln}\left(\frac{p}{P-1}\right)$$
(5)

The equation represents the relationship between the probability of vegetation fire occurrence (P) and the number of variables (n), where (a1, a2, …, an) are the coefficients for each variable and (× 1, × 2, …, xn) are the factors that impact the rate of vegetation fires (Peng et al. 2002; Zhang et al. 2021).

Random forest

The random forest (RF) model was employed to determine the variables that drive vegetation fires and their respective influences on the probability of vegetation fires in the geographical areas of Pakistan. The RF model, presented by Breiman (2001), employs multiple decision trees to train and predict samples, rendering it a classifier (Haddouchi and Berrado 2019). RF is a machine learning method based on an ensemble of classification and regression trees (CARTs). Each tree in the RF model is built using bootstrap samples, enhancing the model’s robustness against outliers and variability, which is critical for predictive accuracy in forest fire forecasting (Su et al. 2018; Zhang et al. 2022). The RF model is a fast machine-learning approach that can handle many input factors and delivers high predicted accuracy (Sarkar et al. 2024). Still, it is sensitive to the danger of overfitting (Luo et al. 2024).

$$h\left(x\right)=\frac{1}{T}{\sum }_{t=1}^{T}h\left(x,{\theta }_{t}\right)$$
(6)

Hyperparameter adjustment was critical to derive the final models (Probst et al. 2019; Mehmood et al. 2024a, b, c). The number of trees (n = 1000), tree depth (maximum depth of 8), and minimum node size (minimum of 7 samples per leaf node) were optimized in the forest and crop fire prediction, but in the case of other vegetation, a minimum size of 6 for each node. The final prediction is obtained by taking the mean of each regression subtree \(\{h (x,{\theta }_{t})\}\), T represents the number of decision trees, θt represents a random vector that is independently and identically distributed, and x represents the input vector. The predictive efficacy of the model is determined by the quantity of random features and trees (Segal and Xiao 2011).

eXtreme Gradient Boosting

eXtreme Gradient Boosting (XGBoost), presented by Chen and Guestrin in 2016, is an innovative gradient-boosting decision tree (GBDT) algorithm (Chen and Guestrin 2016). It utilizes Taylor’s second-order expansion to optimize the loss function, exhibiting improved computing efficiency and generalization ability compared to other machine learning algorithms (Xie et al. 2022). The XGBoost model represents:

$$\begin{array}{c}{\widehat{y}}_{i}={\sum }_{k=1}^{k} {f}_{k}\left({x}_{i}\right),{f}_{k}\in F\end{array}$$
(7)

Here, \({\widehat{y}}_{i}\) is the predicted value for the ith sample, \(k\) denotes the number of decision trees, \({x}_{i}\) is the input data for the ith sample, \({f}_{k}\left({x}_{i}\right)\) is the \(k\) th decision tree generated in the \(k\) th iteration, and \({f}_{k}\) belongs to the tree collection space \(F\) (Luo et al. 2024).

The objective function for XGBoost is:

$$\begin{array}{c}Obj={\sum }_{i=1}^{N}l\left({y}_{i}, {\widehat{y}}_{i}\right)+{\sum }_{k=1}^{k} \Omega \left({f}_{k}\right)={\sum }_{i=1}^{N}\ l\left[{y}_{i},{{\widehat{y}}_{i}}^{t-1}+{f}_{t}\left({x}_{i}\right)\right]+{\sum }_{k=1}^{k} \Omega \left({f}_{k}\right)\end{array}$$
(8)

In Eq. (8), the first part represents the loss function, the difference between the predicted and observed numbers. The second component is a regularization term that essentially governs the complexity of the model, guides the construction of a tree structure, and prevents overfitting (Piraei et al. 2023).

Support vector machines

Pattern classification and nonlinear regression widely utilize support vector machines (SVMs). SVMs are based on the idea of minimizing structural risk (Jodhani et al. 2024). The fundamental concept behind SVMs is to create a classification hyperplane that serves as a decision boundary. The distance between positive and negative examples achieves superior generalization accuracy(Naderpour et al. 2019). SVMs specialize in manipulating data in high-dimensional environments by effectively employing kernel functions to tackle diverse nonlinear problems(Rossi and Villa 2006). For a two-class SVM, considering a training set T = {(\(x\)1, \(y\)1), ··· (\(x\)1, \(y\)1)} ∈ (X × Y)1, where \({x}_{i}\in\) X=\({R}^{n}\) and \({y}_{i}\)∈ {1, − 1} for \((i\) =1,2,…, \(l\)) which represents the feature vector. The consequence parameter C and the kernel function K (\(x,{x}^{\prime}\)) are specified. The problem of optimization is then formulated and resolved in the following manner (Boubeta et al. 2015):

$$\begin{array}{c}\begin{array}{c}min\\ \alpha \end{array}\frac{1}{2}{\Sigma }_{i=1}^{j}{\Sigma }_{j=1 }^{1}{y}_{i}{y}_{j}{a}_{i}{a}_{j}k\left(\varkappa ,{x}^{\prime}\right)-{\Sigma }_{j=1 }^{1}{\alpha }_{j}\end{array}$$
(9)
$$\begin{array}{c}s.t.{\Sigma }_{i=1}^{j}{y}_{i }{\alpha }_{\dot{i}}=\text{0,0}\le {\alpha }_{\dot{i}}\le C,\dot{i}=1,\dots ,l\end{array}$$
(10)

The optimal solution \({\alpha }^{*}=({\alpha }^{*}, \dots , {\alpha }^{*}{)}^{T}\) is obtained. A positive component \({\alpha }^{*}:0 \le {\alpha }_{j}^{*}\le C\) is then selected, and the threshold is computed as follows (Pang et al. 2022):

$$\begin{array}{c}{b}^{*}={y}_{j}-{\sum }_{i=1}^{1} {y}_{i }{\alpha }_{i}K\left({x}_{i}-{x}_{j}\right)\end{array}$$
(11)

Finally, the decision function is constructed:

$${\varvec{f}}\left({\varvec{x}}\right)=sgn({\sum }_{i=1}^{1} {\alpha }_{i}*{y}_{i }K \left( x, {x}_{i}\right)+ {b}^{*}$$
(12)

Model performance evaluation methods

Accuracy serves as a metric for evaluating categorical models, representing the percentage of correctly predicted outputs by the model as follows (Shao et al. 2023):

$$\begin{array}{c}Accuracy=\frac{TP +TN}{TP+TN+FP+FN}\end{array}$$
(13)

\(TP\) is the percentage of true positive cases, \(TN\) is the proportion of true negative cases, \(FP\) indicates the percentage of false positive cases, and \(FN\) is false negative cases (Pang et al. 2022). Recall or sensitivity, also presented as part of our evaluation metrics in Table 5, measures the proportion of actual positives that are correctly identified by the model and is calculated as (Eq. 15). The F1 score, which combines precision and recall into a single metric, is particularly useful when dealing with imbalanced datasets and is computed using (Eq. 16).

$$\text{Sensitivity}=\frac{TP}{TP+FN}$$
(14)

The F1 score, combining precision and recall, is computed as:

$$\begin{array}{c}F1 Score=2. \frac{\text{Precision }.\text{ Recall}}{\text{Precision}+\text{Recall}}\end{array}$$
(15)

The kappa coefficient is an indicator of statistical significance used to assess the level of reliability in testing. The expression is given by the following (Watson and Petrie 2010):

$$\begin{array}{c}Kappa=\frac{{P}_{0}-{P}_{E}}{1-{P}_{E}}\end{array}$$
(16)

where Po is the accuracy of the prediction, and Pe is the probability of chance agreement, derived from the class probabilities, and is crucial in understanding the kappa calculation as it considers both the observed and expected agreements. Kappa coefficients are categorized into five categories to represent varying degrees of accuracy: 0.0 to 0.20 for extremely low accuracy, 0.21 to 0.40 for medium accuracy, 0.41 to 0.60 for high accuracy, 0.61 to 0.80 for excellent accuracy, and 0.81 to 1 for virtually perfect accuracy (Landis and Koch 1977).

The ROC curve plots the true positive rate (sensitivity) against the false positive rate (1-specificity), illustrating the trade-offs between true positive and false positive rates across different thresholds (Carter et al. 2016). The area measures the accuracy of the results under the curve (ROC). The equations for the sensitivity and specificity are as follows (El Emam et al. 2001; Pang et al. 2022). The AUC quantifies the overall ability of the model to discriminate between classes and is discussed in terms of effectiveness (Muschelli III 2020). The area under the curve (AUC) measures the model’s predictive power, categorized into four distinct groups: 0.5–0.85 denotes medium performance, 0.85 ~ 0.95 signifies high performance, and 1.0 indicates ideal performance (Yingyongyudha et al. 2016; Sun et al. 2021). Figure 4 illustrates the workflow depicted in this paper.

Fig. 4
figure 4

Flowchart illustrating the stages involved in data processing and the outputs

Results

This study examined the multicollinearity of various environmental and topographic factors; their tolerance (TOL) values are more than 0.1, and variance inflation factors (VIF) are less than 10 across different vegetation types: forest, crop, and other vegetation, as shown in Table 4. This indicates a lack of covariance among the factors that may initiate fires, suggesting that these variables can inform fire risk assessments within the defined constraints of this study area and period.

Table 4 Results of multicollinearity analysis

Mann–Kendall mutation

The Mann–Kendall test applied to vegetation fires in Pakistan from 2001 to 2022 reveals fluctuating but overall upward trends in fire hotspots. Specifically, from 2006 to 2007, UF values were negative, indicating a temporary decline in fire occurrences. Conversely, from 2001 to 2006 and 2008 to 2022, UF values were consistently above zero, demonstrating a rising trend in the frequency of fires. Notably, the UF curve surpasses the 0.05 confidence level (± 1.96 standard deviations), suggesting that the decline and rise in fire frequencies are statistically significant. These trends are visually detailed in Fig. 5. In Fig. 6, the temporal evolution of vegetation fires spanning the years 2001 to 2022 is depicted, with a detailed legend categorizing the data into distinct types, including forest fires, crop fires, and other vegetation fires.

Fig. 5
figure 5

Mann–Kendall mutation test curve illustrating the temporal trends of fire hotspots from 2001 to 2022

Fig. 6
figure 6

The incidence of vegetation fires during the same period, categorized by fire types such as forest fires, crop fires, and other vegetation fires, highlighting spatial variations

The cumulative anomaly curve on the vegetation fire points in Pakistan showed negative, indicating a consistent buildup of negative anomalies from 2001 to 2022, as shown in Fig. 7. The Mann–Kendall test shows a substantial increase trend in vegetation fires, but the curve’s below-zero position suggests consistent deviations from predicted values. These anomalies suggest that hotspots frequently go below expectations, requiring further investigation into specific time frames and environmental variables. The point at which UF and UB meet the confidence line validates its validity to detect an essential change in the number of national hotspots between 2001 and 2022.

Fig. 7
figure 7

Cumulative distance leveling curve for vegetation fires

Logistic regression

To assess prediction accuracy across different vegetation types, a logistic regression modeling approach was used. The accuracy and AUC scores for each type of vegetation are as follows: in forest fire, the model achieved 81.6% accuracy and 87.3% AUC; in crop fire, the accuracy was 70.4% and the AUC 72.6%; and in other vegetation fires, the accuracy was 66.6% with an AUC of 74.2%. These results are presented concurrently in Table 5 for forest vegetation, crop vegetation, and other vegetation. The ROC curves, which predict the rates of the four modeling approaches, are shown in Fig. 8. This figure illustrates the effectiveness of each approach in distinguishing between the presence and absence of fire under various conditions. Additionally, In Fig. 9, the analysis reveals that the importance of initiating factors varies significantly across different types of vegetation. Notably, while weather-related variables tend to dominate across all categories, their impact is not uniformly distributed. For forest fires, variables such as wind speed (wind V), Soil Temp, and Tmin appear as the most influential, whereas for crop fires, factors like net thermal and ppt take precedence. This variation underscores the complexity of fire risk factors and the need to tailor fire management strategies to specific vegetation types and environmental conditions.

Table 5 Results from the evaluation of the four models for different types of vegetation
Fig. 8
figure 8

AUC curve of prediction rates of four models: a forest, b crop, and c other vegetation

Fig. 9
figure 9

The importance of initiating factor indicators in the LR model: a forest, b crop, and c other vegetation

Random forest

The present study used advanced features of tidy models and ranger packages to predict forest, crop, and other vegetation fire RF models. This investigation led to the model’s configuration space and found the greatest accuracy balance shown in Table 5. When modified with these parameters, the forest fire model predicted accuracy of 87.5% and 93.4; in crop fire, the model had 84.0% accuracy and 90.6% AUC; and in other vegetation fire, the model exhibited 83.1% accuracy and 90.7% AUC. Figure 8 demonstrates that the RF model surpassed the performance of the other three modes, as evaluated with accuracy and AUC metrics. Hence, we deemed the RF model as the most appropriate choice out of the four models for predicting forest fires in Pakistan. Figure 10 shows the variable importance factors of forests, crops, and other vegetation.

Fig. 10
figure 10

Importance of initiating factor indicators in the RF model: a forest, b crop, and c other vegetation

Support vector machine

In this section of the study, the accuracy and generalizability of the SVMs model are used to predict forest, crop, and other vegetation fires. The forest fire model predicted accuracy of 78.7% and 83.6 AUC; in crop fire, the model had 74.5% accuracy and 80.7% AUC; and in other vegetation fire, the model exhibited 68.7% accuracy and 74.8% AUC. The ROC curve of prediction rates of the SVM model is shown in Fig. 8. Overall, the SVM models provided significant predictive capability for different types of vegetation fire. These findings highlight the SVM model’s robust predictive performance across various vegetation types, underscoring its potential utility in designing targeted and effective fire prevention and management strategies. Further investigation into feature influence using advanced interpretative methods could enhance the model’s applicability and provide deeper insights into critical factors driving vegetation fire risks.

eXtreme Gradient Boosting

This study showed how well the XGBoost models we built can predict different types of vegetation fire. The accuracy and performance of the XGBoost model were constantly evaluated to ensure that, they were suitable for diverse prediction conditions. The XGBoost model’s accuracy and AUC scores are shown in Table 5. The forest fire model showed an accuracy of 86.0% with an AUC of 92.6%; for crop fires, the model achieved an accuracy of 83.9% with an AUC of 90.0%; and for other vegetation fires, it recorded an accuracy of 79.4% with an AUC of 87.6%. Figure 8 displays the AUC curves, illustrating the predictive performance of the XGBoost model across different types of vegetation. The results show that XGBoost models are second best for vegetation fire prediction in Pakistan using this set of variables for fires from 2001 to 2022. The model could be used to improve management and mitigation approaches for vegetation fire.

Vegetation fire risk assessment

By assessing the precision of the four models, we selected the RF model, which had the best accuracy, to determine the likelihood of vegetation fire happening in the whole country. We used ArcGIS 10.8 to create a cartographic representation of Pakistan’s potential danger of vegetation fires. The values indicated in the legends in Fig. 11 represent the expected probability of vegetation fires in Pakistan. For example, a vegetation fire has a probability of 1, showing the highest possibility of occurrence. The number of red regions ranges from 0.8 to 1, showing a high danger where vegetation fires are very likely to happen. Figure 11 illustrates that the prevalence of vegetation fires in Pakistan mainly occurs in specific regions. These regions include the northwest, covering various districts of Khyber Pakhtunkhwa (KP), such as Malakand Division, Bannu, Parachinar, Tank, and Kohat. Additionally, the northeast region, comprising Azad Jammu and Kashmir (AJK) and Gilgit-Baltistan (GB), demonstrates a high incidence of vegetation fires. The southeast region, which includes Punjab and Sindh, along with Islamabad, Dera Ghazi Khan, Multan, Karachi, Hyderabad, and Mirpur Khas, also faces many vegetation fires. Lastly, the southwestern region, specifically Baluchistan, including Quetta, is prone to vegetation fires. Generally, the likelihood of vegetation fires is more significant in western areas of Pakistan than in the eastern regions. Additionally, the possibility of vegetation fires is higher in southern Pakistan than in the northern areas.

Fig. 11
figure 11

Vegetation fire risk assessment map

Discussion

In our study, we examined the various factors influencing vegetation fire risk. Our analysis incorporates a detailed evaluation of meteorological variables such as average annual daily high temperature, annual average relative humidity, total annual precipitation, and average annual wind speed. We also considered broader climatic factors, topological features, and different vegetation types as significant determinants of fire risk (Li et al. 2022). In this study, we selected nine variables for analyzing forest and crop fires and seven for other types of vegetation based on their demonstrated association with fire occurrences and their statistical significance in preliminary models. The associated variables identified include soil temperature, minimum temperature, northward and eastward components of the 10 m wind, precipitation, surface net thermal radiation, slope, aspect, and elevation. For other vegetation types, elevation and minimum temperature were less significant in predicting fire ignition. These factors were crucial in training our machine learning models to predict vegetation fires effectively and were instrumental in the development of risk maps using the RF model. Fire factors and conditions vary by area (Abid 2021). This is primarily due to country-specific environmental and socioeconomic variables. It is also related to the investigated region and the environment of every country (Oliveira et al. 2012; Sun et al. 2023). According to Chang et al. (2013), land use intensity, precipitation, and vegetation type are the key variables affecting Durango State, Mexico fires. Fuel moisture, vegetation type, and human activity in northeast China greatly influence man-made fires. In eastern Kentucky, height and slope are the high-influence variables that affect vegetation fires. The most critical factors affecting vegetation fires in Swaziland are elevation, mean annual rainfall, mean annual temperature, and land cover (Dlamini 2010).

This study tested four machine learning methods to predict fire occurrence and show each model’s strengths and applications on vegetation fire in Pakistan. The classic LR models provide a good prediction with an 81.6% predicted accuracy for forest fires, 69.2 for crop fires, and 66.5 for other vegetation fires. The performance of LR in many predictive modeling situations was robust, achieving high accuracy and reliability, although it did not always outperform the more complex models. However, it may not effectively capture complex non-linear interactions compared to more advanced algorithms (Khalaji et al. 2022). The literature often acknowledges that advanced machine learning models, such as RF, SVM, and XGBoost, perform better than LR, particularly in complex prediction tasks like mapping and vulnerability of vegetation fire risk assessment. This study demonstrated that RF exhibited outstanding results, achieving an 87.5% accuracy for forest fires, 84.0% in crop fires, and 81.7% in other vegetation. This corresponds to previous studies in environmental modeling, which emphasize the tendency toward RF and similar ensemble techniques. These methods were selected for their ability to quickly analyze data with many variables and to capture a wide variety of interactions (Shmuel and Heifetz 2022). An integrated approach may show outstanding results in the context of the SVM model, which demonstrated functional flexibility in previous research (Rodrigues and De la Riva 2014).

XGBoost has a powerful technique for prediction analysis, showing outstanding results in several fields, such as vegetation fire prediction. The XGBoost models show a remarkable degree of accuracy and ROC AUC values, which aligns with the present literature that indicates their value in accurate overall classification (Mohajane et al. 2021; Mehmood et al. 2024a, b, c). The model’s ability to manage insufficient data and its optimal utilization of gradient boosting make it an essential tool for assessing environmental risks. Research has identified the intellectual capacities required to develop effective approaches to managing and mitigating risks in different vegetation environments (Tehrany et al. 2019). Comparing different models reveals little complexity, understanding, and variation in predictive capabilities. While models like RF and XGBoost could show better predictive accuracy, LR provides a more understandable framework, which is essential to policy formulation and strategic decision-making (Peng et al. 2021).

Furthermore, the SVM model uses a unique kernel method, which provides a highly flexible solution for non-linear problems with the environment. Therefore, it can be highly beneficial in analyzing datasets with complex feature associations (Lopez-Martin et al. 2019). Our research methodology also included the key foundational adjustment of hyperparameter modification, which significantly impacts model performance. Modifying the parameters of models like RF and XGBoost (e.g., number of trees, depth of trees, or minimum size for a node) significantly affects their accuracy and ability to generalize. These methods are supported by research highlighting the importance of model optimization (Jiang et al. 2022). According to the Mann–Kendall mutation test, vegetation fires showed an unstable increasing trend in Pakistan. This test is flexible and responsive, which is necessary to consistently show the temporal fluctuations and various kinds of change (Vadrevu et al. 2019; Mehmood et al. 2024a, b, c).

The result suggests that future studies should use a broader range of data sources, including remote sensing data and socio-economic aspects, to enhance the accuracy and applicability of prediction models. Moreover, it is essential to take advantage of the advancements in hybrid models, which enable the combination of various techniques to enhance prediction accuracy while maintaining accessibility. Therefore, the prediction and analysis of vegetation fires continue to be a significant area of research with substantial potential to avoid disasters and protect natural resources. The consistent and dependable performance of advanced machine learning models in the field of vegetation fire provides numerous possibilities for future research efforts and practical implementations. Both scholars and professionals could actively contribute to advancing more efficient methods in mitigating fire hazards and minimizing the impact of vegetation fire. This may be achieved through continuous improvement of these models and their integration with comprehensive data sources.

Conclusion

This research applied feature selection techniques to identify the most important variables associated with vegetation fire incidents in Pakistan. The key factors influencing the occurrence of vegetation fires were identified as meteorological and topographical, including soil temperature, minimum temperature, northward and eastward components of the 10 m wind, precipitation, surface net thermal radiation, and slope. We constructed four different types of prediction models for every kind of vegetation fire (forest, crop, and other vegetation) using the following ML algorithms: logistic regression (LR), random forest (RF), support vector machine (SVM), and eXtreme Gradient Boosting (XGBoost). The RF model demonstrated the best overall predictive capability, with an accuracy rate of 87.5% in forest fires, 84% in crop fires, and 83.1% in other vegetation fires. Hence, given its balance of computational speed and minimal variable requirements, the RF model is the most efficient choice for vegetation fire prediction in Pakistan. Using these probabilities, we created a map illustrating the annual likelihood of vegetation fires occurring throughout Pakistan during the study period. The study has significant implications for wildfire management policy and strategy. These algorithms accurately predict fires, helping governments and firefighting agencies allocate resources and devise preventative methods. The study’s long-term trend analysis shows an unpredictable increase in vegetation fires in Pakistan, underscoring the importance of adaptable and flexible models to reflect temporal fluctuations and changes in fire dynamics. Further research should include remote sensing and socio-economic elements to enhance predictive model accuracy and applicability. Hybrid models, which integrate multiple machine learning methods, can improve prediction accuracy while remaining user-friendly.