Toward Safer Roads: Predicting the Severity of Traffic Accidents in Montreal Using Machine Learning

Muktar, Bappa; Fono, Vincent

doi:10.3390/electronics13153036

Open AccessArticle

Toward Safer Roads: Predicting the Severity of Traffic Accidents in Montreal Using Machine Learning

by

Bappa Muktar

^*

and

Vincent Fono

Department of Computer Science, University of Quebec in Outaouais (UQO), 283 Boul. Alexandre-Taché, Gatineau, QC J8X 3X7, Canada

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(15), 3036; https://doi.org/10.3390/electronics13153036

Submission received: 12 June 2024 / Revised: 24 July 2024 / Accepted: 30 July 2024 / Published: 1 August 2024

(This article belongs to the Special Issue Advances in Artificial Intelligence Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Traffic accidents are among the most common causes of death worldwide. According to statistics from the World Health Organization (WHO), 50 million people are involved in traffic accidents every year. Canada, particularly Montreal, is not immune to this problem. Data from the Société de l’Assurance Automobile du Québec (SAAQ) show that there were 392 deaths on Québec roads in 2022, 38 of them related to the city of Montreal. This value represents an increase of 29.3% for the city of Montreal compared with the average for the years 2017 to 2021. In this context, it is important to take concrete measures to improve traffic safety in the city of Montreal. In this article, we present a web-based solution based on machine learning that predicts the severity of traffic accidents in Montreal. This solution uses a dataset of traffic accidents that occurred in Montreal between 2012 and 2021. By predicting the severity of accidents, our approach aims to identify key factors that influence whether an accident is serious or not. Understanding these factors can help authorities implement targeted interventions to prevent severe accidents and allocate resources more effectively during emergency responses. Classification algorithms such as eXtreme Gradient Boosting (XGBoost), Categorical Boosting (CatBoost), Random Forest (RF), and Gradient Boosting (GB) were used to develop the prediction model. Performance metrics such as precision, recall, F1 score, and accuracy were used to evaluate the prediction model. The performance analysis shows an excellent accuracy of 96% for the prediction model based on the XGBoost classifier. The other models (CatBoost, RF, GB) achieved 95%, 93%, and 89% accuracy, respectively. The prediction model based on the XGBoost classifier was deployed using a client–server web application managed by Swagger-UI, Angular, and the Flask Python framework. This study makes significant contributions to the field by employing an ensemble of supervised machine learning algorithms, achieving a high prediction accuracy, and developing a real-time prediction web application. This application enables quicker and more effective responses from emergency services, potentially reducing the impact of severe accidents and improving overall traffic safety.

Keywords:

traffic accidents; Montreal; machine learning; severity prediction; road safety; classification algorithms

1. Introduction

Traffic accidents remain the leading cause of death worldwide [1] and represent a significant burden on the global economy. The 2023 World Road Safety Report shows that road accidents cause 1.19 million deaths annually [2]. According to the WHO, this figure represents a slight improvement as the mortality rate decreased by 0.06 million compared with the findings of the 2015 Global Road Safety Report. Despite these advances, the impact of traffic accidents on mobility is profound and highlights the urgent need for concerted efforts to halve the number of traffic deaths and injuries by 2030 [2]. In Canada, the ratio of traffic accident injuries to deaths was particularly high, with approximately 108,018 injured in 2021—66 times higher than the mortality rate [3].

The successful deployment of an intelligent transportation system (ITS) that ensures safety and comfort for road users depends on the development of an accurate and fast algorithm for predicting accident severity. This feature can significantly help various government agencies by allowing them to assess the severity of accidents whose impact is initially unknown. For example, if the severity of an accident is preassessed as serious, emergency responders can proactively prepare the necessary medical equipment, thereby improving the efficiency of their response.

A key challenge in accident management is predicting the severity of the accident. Severity is typically considered a dependent variable, with the factors contributing to the accident treated as independent variables or predictors. Researchers analyze traffic accident data to identify key factors that influence these incidents and develop strategies to improve traffic safety. Factors that influence the severity and frequency of accidents include weather conditions, road conditions, speed limits, etc. This information is collected in extensive databases and analyzed using various analytical methods.

Despite numerous research efforts to predict the severity of traffic accidents [4,5,6,7,8], most have relied on a single classifier, predominantly RF. It has been observed that the precision and generalization ability of these prediction models rarely exceeds 90%.

The increasing frequency of traffic accidents worldwide poses a significant public health challenge and results in millions of deaths and injuries each year. Montreal recorded a notable 29.3% increase in traffic fatalities in 2022 compared with the 2017–2021 average, highlighting the urgent need for improved road safety measures. The aim of these measures is to reduce the number of accidents, save lives, and improve the safety and comfort of all road users. This highlights the need for a more reliable approach to predicting accident severity to enable preventive improvements in road safety.

Novelty and Contributions of the Research:

This study introduces several novel contributions to the field of traffic accident severity prediction. Unlike previous research that predominantly relied on single classifiers, our work employs an ensemble of supervised machine learning algorithms, including XGBoost, CatBoost, RF, and Gradient Boosting, to enhance prediction accuracy. Furthermore, we focus specifically on the city of Montreal, utilizing a comprehensive dataset spanning from 2012 to 2021. Our approach not only predicts accident severity with high accuracy but also identifies key factors influencing accident severity. This dual focus provides actionable insights for traffic safety interventions, making our study unique in its methodological rigor and practical applicability. Additionally, the development and deployment of a real-time web application for predicting accident severity is an innovative aspect that integrates advanced machine learning techniques with practical tools for immediate use by traffic authorities.

The remainder of this paper is organized as follows: Section 2 provides a comprehensive literature review focusing on relevant research in this area. Section 3 provides an overview of the datasets used in this research. In Section 4, we describe the step-by-step process of developing the Montreal Predictor web app. Section 5 presents the results, including interpretation. Finally, Section 6 concludes the paper by summarizing the key findings and discussing possible avenues for future research.

2. Related Work

In recent years, machine learning has garnered significant attention within the scientific community, leading to its application across various computer science disciplines, particularly in solving prediction problems [9,10,11,12,13]. This trend has notably impacted the field of traffic safety, where predicting traffic accidents and assessing their severity have emerged as critical areas of interest. Numerous studies have employed machine learning models to enhance the understanding of variables influencing accident severity and to develop more accurate prediction tools.

Several studies have employed various machine learning models to predict traffic accident severity. Ahmed, S. et al. have conducted a comprehensive analysis using the New Zealand traffic accident dataset (2016–2020), evaluating models such as RF, Decision Jungle (DJ), AdaBoost (AB), XGBoost, Light Gradient Boosting Machine (LGBM), and CatBoost. Their findings indicated that RF was the most effective model, achieving an accuracy of 81.45% [14]. Wu, P. et al. have described an ensemble learning technique that included traffic data and geometric road orientation, improving accuracy and lowering variance [15]. Similarly, Gan, J. et al. have applied the Deep Forests algorithm to the UK road safety dataset, highlighting the algorithm’s superior stability and accuracy with minimal hyperparameter tuning [16].

Deep learning approaches have demonstrated significant promise in traffic accident prediction. Dong, C. et al. have developed a deep learning model that incorporates a multivariate negative binomial (MVNB) model, which outperformed other deep learning and Support Vector Machine (SVM) models in predicting accident severity [17]. This model effectively reduces input dimensionality while retaining essential information. Similarly, Zhang, C. et al. have combined an artificial neural network (ANN) with an improved metaheuristic algorithm to predict accident severity using Washington State’s Interstate Highway dataset [18]. Their findings emphasize the critical role of vehicle-related factors over road-related factors in accident severity prediction.

Numerous studies focused on particular regions have provided important information about predicting the severity of traffic accidents. Yang, J. et al. have utilized the Chinese National Car Accident In-Depth Investigation System (2018–2020) to demonstrate the superior performance of the RF algorithm [19]. Similarly, Gupta, U. et al. have analyzed UK traffic accident data (2005–2017) using multiple machine learning models, successfully identifying accident severity hotspots [20]. Furthermore, Paul, A. et al. have investigated the factors influencing accident severity within the UK road accident database, emphasizing the impact of variables such as lighting conditions, driver age, and vehicle type [21].

Studies on the prediction of traffic accidents have made extensive use of ANN. Gatarić, D. et al. have utilized ANN to predict traffic disruptions in Serbia, Bosnia, and Herzegovina, achieving high accuracy with

r^{2}

values up to 0.990 [22]. Similarly, Sowdagur, J.A. et al. have applied a multilayer perceptron (MLP) ANN to predict accident severity in Mauritius, resulting in an accuracy rate of 84.1% [23]. This performance surpassed other models, such as SVM, Gradient Boosting, and logistic regression.

Pedestrian safety has been a critical focus in traffic accident prediction studies. Meocci, M. et al. have developed a Gradient-Boosting-based prediction model using the ISTAT dataset from Italy, effectively predicting pedestrian accident risk and identifying high-risk areas [24]. Similarly, Islam, M. et al. have analyzed crash severity and hotspots in Saudi Arabia using Gradient Boosting and RF models [25]. Their study identified critical factors and locations associated with severe accidents, providing valuable insights for targeted interventions.

Tree-based ensemble models have gained prominence for predicting traffic accident severity. In a study conducted in Saudi Arabia’s Qassim Province, Aldhari, I. et al. applied RF, XGBoost, and logistic regression, ultimately determining that XGBoost yielded the highest accuracy [26]. Similarly, Shen, Y. et al. integrated Graph Convolutional Networks (GCN) with Long Short-Term Memory (LSTM) and RF models to forecast traffic conditions in urban highway tunnels in Nanjing City, underscoring the critical role of precise predictive models in effectively managing tunnel operations during accidents [27].

Comparative studies have extensively evaluated the performance of various machine learning models in predicting accident severity. For instance, Zhang, J. et al. have conducted a comparative analysis of statistical methods and machine learning techniques for predicting accident severity in Florida [28]. Their findings highlighted the superior accuracy of machine learning models, particularly RF. Similarly, Infante, P. et al. have analyzed traffic accident severity in Portugal using logistic regression alongside machine learning models, demonstrating that both approaches performed comparably well on balanced datasets [29].

Innovative approaches have investigated traffic accident prediction and emergency management. Mansoor, U. et al. have introduced a two-layer ensemble machine learning model for predicting accident severity, demonstrating superior performance in both accuracy and F1 score [30]. Similarly, Vijithasena, R. et al. have employed data visualization and machine learning techniques to analyze traffic accident severity in the USA, achieving high accuracy with the RF algorithm [31]. In another study, Wahab et al. predicted motorcycle accident severity in Ghana using J48 Decision Tree, RF, and IBk models, with RF emerging as the most accurate model [32].

Table 1 briefly summarizes these studies based on their focus, data used, models evaluated, and key findings related to urban traffic safety.

Analysis of comparative studies:

Table 1 summarizes the focus, data used, models evaluated, and key findings of various studies on traffic accident severity prediction. While these studies have made significant strides in understanding and predicting accident severity, our research distinguishes itself by using multiple ensemble learning algorithms, including XGBoost, which has demonstrated superior accuracy (96%) compared with other models. Moreover, our study uniquely integrates a real-time prediction web application, providing a practical tool for authorities to implement data-driven safety measures. This integration of advanced machine learning techniques with a real-time web application underscores the relevance and novelty of our work in advancing traffic safety in urban environments.

3. Data Overview

In this section, we provide an overview of the dataset. The discussion focuses on two critical dimensions: First, the provenance of the data used in this study is examined, which provides insights into the provenance and reliability of the data. Second, a detailed description of the dataset is provided, detailing its characteristics and relevance to the research objectives.

3.1. Data Source

This study uses data on traffic accidents in Montreal from 2012 to 2021. The data were compiled from incident reports (R1) from the Montreal Police Service (Service de Police de la Ville de Montréal (SPVM)). They were then organized by the SAAQ and standardized in a database. This dataset is publicly available on the Québec Data web portal [33] and is distributed under an attribution license (CC-BY 4.0) [34].

To analyze the geographical distribution of collisions, a special geomatics method was used. This method was developed specifically for Montreal’s urban road network and deliberately eliminates incidents on highways. Geolocation of collisions was determined using various parameters from the SAAQ reports, including civic number, street, or intersection. Quality and precision indices were also integrated into the analysis to assess the reliability and accuracy of the location data in relation to the road network. This step was crucial to ensure the high precision of the geolocation results.

3.2. Data Description

In this section, we describe the attributes finally used from the initial dataset and the process of selecting these attributes. Our initial dataset contained 68 attributes, but through a rigorous feature selection process, we narrowed it down to the most relevant attributes for our predictive model.

Feature Selection Process:

We used the chi-square statistical method (see Section 4.3 for more details) to evaluate the importance of each attribute relative to the target variable, Severity. This method enabled us to identify the top 30 attributes with the highest correlation to accident severity. From these 30 attributes, we further refined our selection based on domain knowledge and data quality, ultimately choosing the most informative features to ensure our model avoids bias.

The final attributes used in our model are as follows:

street_name (RUE_ACCDN): Name of the street where the collision occurred.
collision_near (ACCDN_PRES_DE): Landmark near the collision site.
collision_type (CD_GENRE_ACCDN): Type of collision.
surface_condition (CD_ETAT_SURFC): Condition of the road surface.
road_category (CD_CATEG_ROUTE): Category of the road.
longitudinal_location (CD_LOCLN_ACCDN): Longitudinal location.
weather_conditions (CD_COND_METEO): Weather conditions.
light_cars_trucks_count (nb_automobile_camion_leger): Number of light cars and trucks involved.
heavy_trucks_count (nb_camionLourd_tractRoutier): Number of heavy trucks involved.
bicycle_count (nb_bicyclette): Number of bikes involved.
motorcycle_count (nb_motocyclette): Number of motorcycles involved.
emergency_vehicle_count (nb_urgence): Number of emergency vehicles involved.
unspecified_vehicle_count (nb_veh_non_precise): Number of unspecified vehicles involved.
authorized_speed (VITESSE_AUTOR): Authorized speed on the road.
x_coordinate (LOC_X): X coordinate (Nad83 MTM8).
y_coordinate (LOC_Y): Y coordinate (Nad83 MTM8).
hour (HR_ACCDN): Hour of the collision.

The target variable (Severity) is divided into five classes: Damage Below Reporting Threshold, Property Damage Only, Minor, Serious, and Fatal.

Figure 1 below shows the distribution of different classes within the Severity attribute using a pie chart.

Analysis of Figure 1 shows a significant imbalance between classes within the severity attribute. Therefore, implementing a data balancing strategy is crucial to improve the performance of the predictive model. This issue is discussed in more detail in the methodology section of our article.

A summary of the characteristics of the dataset can also be found in Table 2 below.

An examination of Table 2 shows that the dataset contains 218,272 rows and 68 attributes (columns). It includes a mix of data types that include both numeric or continuous variables (int64, float64) and categorical or discrete variables (object). This variety of data types suggests a rich store of information covering a wide range of aspects, from geographical and temporal data (e.g., date and location of the accident) to specific accident details (e.g., the type of vehicles involved and the severity of the accident). The significant number of categorical variables highlights the need for appropriate data processing, particularly the encoding of these variables, when preparing the data for the predictive model.

4. Methodology

This section describes our approach to conceptualizing traffic accident prediction in Montreal as a multiclass classification problem. Each accident is assigned a severity level that classifies it into one of five different categories: damage below the reporting threshold; property damage only; and minor, serious, and fatal accidents. This categorization forms the basis of our multiclass classification task.

Our research includes a thorough evaluation of several machine learning algorithms: XGBoost, CatBoost, RF, and GB. These algorithms are used to analyze the traffic accident dataset in Montreal with the aim of predicting accident severity with high accuracy. The main goal is to find out which algorithm works well in this particular context.

Subsequent sections of this document provide a comprehensive description of the methods implemented in developing the predictive model. This includes set-up and application design, an in-depth study of our data preprocessing methods, feature selection process, exploratory data analysis, construction of the predictive model, and its subsequent validation and evaluation phases. The ultimate goal is to establish a prediction system that is not only precise but also precisely tailored to the specific characteristics of the severity of traffic accidents in Montreal.

4.1. Setup and Application Design

To ensure the reproducibility of our work and to provide clarity on the application design, this section elaborates on the setup used to produce our results. Our application design comprises multiple interconnected components, each playing a critical role in the functionality and performance of the overall system.

Hardware and Software Configuration:

Processor: NVIDIA GeForce GTX 1650 ( manufactured by NVIDIA Corporation, Santa Clara, CA, USA).
RAM: 32 GB.
Storage: 1 TB SSD.
Operating System: Windows 11.
Programming Language: Python 3.11.
Libraries: Matplotlib 3.9.1, Seaborn 0.13.2, Pandas 2.0.2, NumPy 1.23.5, Scikit-learn 1.2.1, XGBoost 2.1.0, CatBoost 1.2.5, Flask 3.0.3, Angular 14.2.0, Swagger-UI (OpenAPI 3.0.3), NodeJS v18.16.0, Pickle-Mixin 1.0.2, Requests 2.32.3.

Data Preprocessing Steps:

Data cleaning: Removal of duplicates and irrelevant columns using Pandas.
Handling missing values: Implementing imputation strategies for categorical and numerical data.
Data balancing: Employing the SMOTE-ENN algorithm to address class imbalances.
Feature selection: Utilizing the chi-square statistical method to identify the most relevant features.

Machine Learning Models:

Models used: XGBoost, CatBoost, Random Forest, Gradient Boosting.
Training and testing split: 80% training, 20% testing.
Evaluation metrics: accuracy, precision, recall, F1 score.

Training Environment:

Software: Jupyter 1.0.0, Notebook 7.0.8, Anaconda 23.3.1

Web Application Design:

Backend:
-
Framework: Flask
-
API management: Swagger-UI for API documentation and testing.
-
Model deployment: Integration of the trained XGBoost model for real-time predictions.
Frontend:
-
Framework: Angular
-
User interface: Interactive forms for data input and real-time feedback on predictions.

4.2. Data Preprocessing

In this subsection, we summarize the essential preprocessing steps performed to prepare our dataset for predictive modeling.

First, temporal attributes such as Collision_Hour and Collision_Date were converted into numerical representations (time, day of the week, day, and month) using the to_datetime function from the Pandas library. This standardization helps to adapt the data to the model requirements.

To improve the efficiency and accuracy of the model, redundant data, including duplicate or irrelevant columns, were removed. Specifically, we eliminated 143 out of 172,759 records from the final dataset using the Pandas drop_duplicates function, which removed duplicate records. Irrelevant data were identified by analyzing missing values and using the Pandas duplicated function. For instance, features with a missing values percentage greater than 50% were considered to have minimal impact on the outcome and were removed. Additionally, correlation analysis was performed using the chi-squared test, and features with a correlation coefficient above 6% were considered for removal to prevent multicollinearity issues.

Categorical variables were encoded numerically to ensure compatibility with our predictive models. For example, the Severity attribute was encoded using Label Encoding, as shown in Table 3, where each severity level is assigned a unique numerical value to facilitate model processing.

4.3. Dealing with Missing Values

In this subsection, we explain our method to address missing data in the dataset. Columns with missing values are identified, and the percentage of missing data per column is calculated using the Python Pandas library. The proportion of missing data is expressed as the ratio of missing values per column to the total number of rows, multiplied by 100.

To address missing data issues, we implemented an imputation strategy [35,36,37] subject to the following rules:

Delete columns that are missing more than 50% of the data. Table 4 shows the attributes with more than 50% missing values that were removed from the dataset.
For columns of a numeric type that represent categorical variables, we replace missing values with the value from the previous row (using the fillna method from the Python Pandas library with method = ffill). This method is chosen to preserve the order of the data wherever possible, assuming that adjacent entries are likely to have similar or identical categorizations, which is common with time series or ordered datasets. Table 5 below shows the attributes where this imputation strategy was applied, indicating the number and percentage of missing values imputed.
For purely numeric columns, replace missing values with the column mean. This approach is used to maintain the overall distribution and central tendency of the data. This is important to avoid biasing results in predictive modeling. However, we are aware of the potential biases that this method introduces and therefore limit its application to columns where the mean is a representative summary statistic of the underlying distribution.Table 6 shows the attributes where this imputation strategy was applied.

Solving the Data Imbalance Problem:

This subsection explains the methodology for resolving data imbalance issues related to the Severity attribute. It is important to highlight that data imbalance can significantly impact the accuracy of a predictive model, often leading to a bias toward the more common classes. This problem is particularly pronounced in the Fatal category of the Severity attribute. Despite its critical importance in representing fatal accidents, its relative rarity in the dataset risks the model dismissing it as an outlier, which in turn biases predictions towards more common categories. To address this imbalance, we evaluated the effectiveness of four different balancing algorithms: Synthetic Minority Oversampling Technique (SMOTE) [38], SMOTE combined with Tomek Links (SMOTE-Tomek) [39], SMOTE combined with Edited Nearest Neighbors (SMOTE-ENN) [40], and Adaptive Synthetic Sampling approach (ADASYN) [41]. These methods were tested for their ability to handle data imbalances in conjunction with a RF classifier. The algorithm that best balances the data and maintains high accuracy was then selected.

Table 7 below summarizes the performance of data balancing algorithms based on accuracy.

In Table 7, the SMOTE-ENN data balancing algorithm shows superior performance in terms of accuracy compared with the other tested algorithms. The high performance of the SMOTE-ENN algorithm comes from its combination of Synthetic Minority Oversampling Technique (SMOTE) and Edited Nearest Neighbors (ENN), which allows it to generate new samples while reducing noise. Unlike SMOTE alone, which merely creates synthetic samples, SMOTE-ENN improves the quality of training data by eliminating ambiguous or noisy samples, enabling better model generalization to unseen data. This hybrid approach ensures a good balance between precision and recall, which is crucial for dealing with imbalanced datasets in severe cases. In summary, SMOTE-ENN’s ability to generate samples while cleaning the dataset makes it a more effective method, as demonstrated by the superior accuracy achieved in our experiments.

Post balancing with the SMOTE-ENN algorithm, the classes Damage Below Reporting Threshold, Property Damage Only, Minor, Serious, and Fatal contain 6750, 4828, 20,084, 64,832, and 76,265 records, respectively, resulting in a balanced dataset of 172,759 records from an initial 218,272 records. This approach helped mitigate the bias towards more frequent classes and improved the model’s ability to accurately predict severe accidents.

The initial evaluation of balancing methods using RF was conducted to ensure a robust baseline assessment of these techniques. RF, known for its general applicability and ease of interpretation, provided an effective comparison framework. However, based on comprehensive performance analysis, XGBoost was ultimately selected as the superior model due to its higher accuracy and robustness, making it more suitable for our final deployment.

4.4. Feature Selection Using the Chi-Square Statistical Method

In this subsection, we present the approach used to select the top thirty input variables that are most highly correlated with the target variable Severity for our prediction model. To achieve this, we use the chi-square statistical method to quantify the importance of input variables relative to the target variable. The selection of features based on the chi-square statistical method is supported by existing studies [42,43,44], which highlight its effectiveness in classification problems with multiple input variables. The chi-square statistical method is given by the following Equation (1).

χ^{2} = \sum_{i = 1}^{n} \frac{{(O_{i} - E_{i})}^{2}}{E_{i}}

(1)

where the following are used:

$χ^{2}$ is the chi-square statistic.
n is the number of observation categories.
$O_{i}$ is the observed frequency in category i.
$E_{i}$ is the expected frequency in category i under the null hypothesis that the observed and expected frequencies are independent.

Table 8 below shows the list of the thirty most relevant input variables, ordered in descending order of correlation with the target variable, according to the importance measure derived from the chi-square statistical method.

4.5. Exploratory Data Analysis

In this subsection, we present visual analysis charts that examine the temporal dynamics of accident severity. The following graphics are discussed:

Hourly Accident Severity Distribution: This chart illustrates the distribution of accident severity throughout the day, categorized by each hour.
Weekly Accident Severity Distribution: This chart shows the distribution of accident severity across the days of the week and provides insight into daily patterns.
Monthly Accident Severity Distribution: This chart shows how accident severity varies from month to month and highlights possible seasonal trends.
Yearly Accident Severity Distribution: This chart shows annual accident severity.

Figure 2 shows the fluctuations in accident frequency over the course of the day, with a clear peak between 3:00 p.m. and 5:00 p.m. This peak can be observed primarily in the categories Damage Below Reporting Threshold, Property Damage Only, and Minor accidents, which is probably related to the increased traffic volume in the evening rush hour. Although Minor accidents are more common, Serious accidents show a similar pattern, with the number of incidents increasing over the same period.

Although the Fatal accidents category is the least common, it is of significant importance to public safety initiatives. In contrast to other categories, fatal accidents do not follow a recognizable daily routine. This suggests that fatal accidents may be influenced less by traffic volume and more by other factors not listed in this graph, such as driving under the influence of alcohol or impaired driving ability.

Figure 3 shows that the categories Damage Below Reporting Threshold, Property Damage Only, and Minor accidents occur most frequently during the week and peak on Friday. This pattern is associated with increased traffic as Montreal residents commute to various daily tasks such as work and school. The increase on Fridays is due to people traveling after work and putting weekend plans into action. There is a clear trend, particularly in serious and fatal accidents, with the frequency being the lowest on weekdays and increasing significantly on weekends, particularly on Saturdays. This observation suggests that although the total number of weekend accidents is decreasing, the proportion of serious accidents is increasing. Possible explanations for this shift include different traffic patterns, weekend driving behavior, and other socioenvironmental factors unique to Montreal. Overall, the data suggest that the risk of an accident is highest on Friday, while the likelihood of a serious or fatal accident increases over the weekend, with Saturday being particularly dangerous.

Figure 4 makes it clear that incidents in the categories Damage Below Reporting Threshold, Property Damage Only, and Minor are the most common types of accidents during the year. This observation suggests that most accidents in the city of Montreal are not serious in nature. Conversely, accidents that are classified as serious and fatal occur less frequently but show a consistent pattern over time. The data show a seasonal variation, with the total number of accidents peaking in the winter months of January and February, followed by another peak in the summer months of June and July. These fluctuations are due to adverse weather conditions or an increase in travel activity. In addition, there is a noticeable increase in accidents in December, which is believed to be due to increased travel activity related to preparations for end-of-year celebrations, including Christmas and New Year.

Figure 5 shows the annual distribution of accident severity in the city of Montreal from 2012 to 2021. A trend can be seen in which accidents with pure property damage are the most common, followed by minor injuries. Serious injuries and incidents with damage below the reporting threshold are in the middle range, fatal accidents are the rarest. The year 2013 was characterized by an exceptionally high number of accidents of all levels of severity, particularly those involving pure property damage. From 2014 onwards, there has been a general decline in the number of accidents across all categories, with slight fluctuations. There was a slight increase in accidents in 2018 and 2019, but this trend reversed in 2020. Overall, the city of Montreal is showing a positive trend with falling accident numbers, indicating possible improvements in road safety and the effectiveness of the safety measures implemented. The sharp decline in 2020 could also reflect the impact of external factors such as policy changes, technological advances in vehicle safety, or reduced traffic due to circumstances such as the COVID-19 pandemic.

Impact of Exploratory Data Analysis on Data Preparation and Model Performance

The exploratory data analysis (EDA) not only provided valuable insights into the temporal patterns and distribution of accident severity but also guided the preprocessing steps for the data used in our machine learning models. For example, identifying peak hours for different types of accidents influenced the creation of time-based features, while the observed seasonal trends informed the inclusion of weather-related variables. Understanding these patterns allowed us to prepare the data more effectively, leading to improved model performance and more accurate predictions of accident severity.

4.6. Development of the Predictive Model

This subsection describes the methodology used to create a predictive model to assess the severity of accidents in Montreal. Our model development strategy is based on the application of four machine learning algorithms: XGBoost, CatBoost, RF, and GB. The selection of these algorithms is based on the existing literature [28,45,46,47], highlighting their effectiveness in addressing classification challenges.

We evaluate the performance of these algorithms to determine the most effective classifier. The optimal classifier is then integrated into a web application designed to predict the severity of accidents in Montreal.

In our research, we divided the dataset into 80% for training and 20% for testing. This distribution was chosen because it is a widely accepted practice in machine learning, ensuring a substantial amount of data for training while reserving a significant portion for testing. Additionally, we experimented with other splits, such as 70–30, to evaluate the robustness of our model. Our findings indicated that the 80–20 split consistently provided the most reliable performance metrics. We assessed model performance using metrics such as accuracy, recall, precision, and F1 score.

The following part of this section provides an overview of the learning algorithms used in this study.

4.6.1. Gradient Boosting (GB)

Friedman, J. et al. have introduced the GB machine learning method [48]. This technique involves a boosting process that sequentially creates decision trees. Each tree in the series is intended to correct the errors of its predecessors by building on the information they provided. The process involves adding one weak learner at a time to an incremental additive model while leaving the existing trees unchanged.

Training the GB model involves a series of iterations, gradually improving each tree. After each iteration, the data samples are reweighted: samples that were difficult to classify receive higher weights, while those that were accurately classified receive lower weights. This realignment ensures that subsequent trees focus more on the challenging cases. The contribution of each new tree is added to the cumulative output of the existing trees, continually improving the accuracy of the overall model. Ultimately, the final model represents a weighted sum of all trees, optimized to achieve the best possible classification accuracy for all samples.

4.6.2. Extreme Gradient Boosting (XGBoost)

XGBoost is a powerful ensemble learning technique based on Friedman’s Gradient Boosting framework [48]. In 2016, Chen and Guestrin introduced enhancements to the original Gradient Boosting Decision Tree (GBDT) algorithm, resulting in the development of XGBoost [49]. Both XGBoost and traditional Gradient Boosting are tree-based ensemble methods that combine the predictions of multiple trees to improve classification accuracy. In general, the prediction model (

\hat{y}

) for ensemble methods can be expressed as the sum of the classification scores from all trees (x). XGBoost builds a series of Classification and Regression Trees (CARTs) in parallel, aggregating their results to form the final prediction. The fundamental equation for Gradient Boosting models is given by Equation (2):

{\hat{y}}_{i} (x) = \sum_{T = 1}^{T} f_{T} (x_{i}), (f_{T} \in F)

(2)

where T represents the number of trees, and

F

denotes the space of all possible trees. This model is optimized using the following objective function given by Equation (3):

Obj (θ) = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i}) + \sum_{T = 1}^{T} Ω (f_{T})

(3)

The first term in the objective function is the loss function, which quantifies the difference between the true targets (

y_{i}

) and the predicted values (

{\hat{y}}_{i}

). The second term is a regularization component that manages the model’s complexity and prevents overfitting. Unlike standard Gradient Boosting, which primarily employs a learning rate (L) for regularization, XGBoost incorporates an additional regularization term defined by (4):

Ω (f_{T}) = γ t + \frac{1}{2} λ \sum_{j = 1}^{t} C_{q {(x)}^{2} j}

(4)

Here, t is the number of leaves,

C_{q {(x)}^{2} j}

represents the score of the j-th leaf, and

λ

and

γ

are regularization parameters. XGBoost is renowned for its high accuracy, efficiency, and ease of use compared with other machine learning algorithms, enabling it to achieve superior performance over traditional GBDT and other widely used models. Experimental results from our case study confirm that XGBoost consistently delivers better outcomes than other machine learning techniques.

4.6.3. Categorical Boosting (CatBoost)

Gradient Boosting decision trees like CatBoost are specifically designed to process categorical data using one-hot encoding. Bentéjac, C. et al. claim that implementing minimum variance sampling in node splitting significantly improves model performance by reducing the amount of data required for each boosting iteration [50].

4.6.4. RandomForest (RF)

RF is a versatile and powerful machine learning algorithm that uses various decision trees to create a forest. This ensemble method uses the technique of bagging or bootstrap aggregation to improve both the robustness and accuracy of predictions for classification and regression tasks [51]. By training each tree on a random subset of the data and aggregating its predictions, RF significantly reduces the risk of overfitting, making it a reliable choice for complex data-driven problems. Its ability to process large datasets with high-dimensional features, coupled with inherent feature selection capabilities, makes it an indispensable tool in the arsenal of modern data scientists and researchers. The algorithm’s efficiency in creating highly accurate models while maintaining interpretability makes it particularly valuable for both scientific research and practical applications.

5. Results and Discussion

This section describes the performance results of our machine learning models, followed by an analysis of these results.

5.1. Results

The classification report summary (Table 9), together with Figure 6, provides a detailed assessment of the predictive performance of each model.

5.1.1. Interpretation of Results

In Table 9, the ensemble methods XGBoost, CatBoost, RF, and GB demonstrated varied performances, each excelling in different metrics (Figure 6). XGBoost exhibited a high weighted average for precision, recall, F1 score, and accuracy, all at 0.96, indicating strong consistency of predictions across different classes. CatBoost also performed robustly, with weighted average scores around 0.94 to 0.95, reflecting its capability to handle different classifications effectively.

RF showed good performance with a weighted average accuracy of 0.93, though with slightly lower precision and recall than the top performers. GB, however, demonstrated relatively lower performance, with weighted average precision, recall, and accuracy values of 0.88 and 0.89, respectively, suggesting a decline in prediction consistency across classes.

Despite addressing the class imbalance issue, the precision and recall for the Damage Below Reporting Threshold and Property Damage Only categories were below 0.8 for some models, including XGBoost and CatBoost (Table 9). This performance can be due to the intrinsic difficulties in distinguishing these less severe accidents due to their subtle and overlapping features with other classes. The relatively low metrics for these categories highlight the challenge of accurately predicting these classes, which can affect the model’s reliability in providing precise predictions for the most frequent accident types in the dataset. These issues might arise from the algorithms’ sensitivity to minor variations in feature importance or their handling of noise in the data.

Overall, CatBoost and XGBoost emerged as the most robust models, achieving the highest scores in most categories and effectively balancing precision and recall. This performance indicates that boosting methods like XGBoost and CatBoost are particularly effective for this classification task. RF also demonstrated strong performance, especially in maintaining high recall and precision across all categories, making it suitable for applications where reducing false negatives is crucial.

It is important to note that the superior performance of XGBoost can be attributed to its advanced handling of missing data, effective regularization techniques that prevent overfitting, and efficient computational speed. Unlike other models, XGBoost’s Gradient Boosting framework is particularly adept at capturing complex patterns in the data, resulting in higher precision, recall, and F1 scores. Additionally, the comprehensive feature engineering and inclusion of temporal and spatial dependencies further enhanced its predictive power, making it the most effective model for this classification task.

Based on the performance analysis, we selected the XGBoost model for deployment in a web application.

5.1.2. Key Factors Influencing Accident Severity

In this section, we provide a detailed analysis of the key factors that influence accident severity, which is critical for understanding and mitigating the impacts of traffic accidents. Our study utilizes the chi-square statistical method to identify and quantify the importance of each factor.

Key factors:

Table 10 below illustrates the key factors influencing accident severity in our model based on their chi-square scores.

Detailed Analysis:

To understand the key factorsinfluencing accident severity, we performed a detailed analysis of the attribute values associated with severe accidents. For example, we graphically visualized the impact of weather conditions on the severity of traffic accidents. This analysis revealed that some attributes, such as weather conditions, road surface conditions, and the type of collision, play a significant role in determining the severity of an accident. For instance, severe accidents are more likely to occur under adverse weather conditions and poor road surface conditions.Additionally, collisions involving heavy trucks or multiple vehicles tend to result in more severe outcomes.

Identifying these key factors is crucial as it provides actionable insights for traffic safety authorities. By understanding the circumstances that increase the likelihood of severe accidents, it is possible to apply targeted interventions. For example, improving road maintenance during adverse weather conditions or enforcing stricter speed limits in areas prone to severe accidents can significantly reduce the risk of severe outcomes.

5.1.3. Comparison of the Results with a Previous Study in the Literature

The performance results of our study are good, especially compared with previous research that use traffic accident data from the city of Montreal for accident prediction. The previous study [52] used a special version of the RF algorithm called Random Forest Balanced, which achieved a traffic accident detection rate of 85% and a false positive rate of 13%. Conversely, our application of the standard RF algorithm, using a balanced dataset, achieved a weighted average accuracy of 0.93, with precision and recall rates of 0.92 and 0.93, respectively. Furthermore, the current study demonstrates a robust ability to accurately classify the severity of traffic accidents into different categories and demonstrates an effective trade-off in identifying all classes associated with the severity of accidents in Montreal.

We attribute the increased accuracy in our model to several factors. Firstly, we utilized enhanced data preprocessing techniques, which included more sophisticated handling of missing values and outliers, to provide a cleaner and more representative dataset for training. Secondly, we employed feature engineering methods that better captured the complexities of traffic data, such as temporal and spatial dependencies, which are crucial for accurate predictions. Lastly, we rigorously optimized the tuning of hyperparameters in the RF model to suit the specific characteristics of our data, unlike the generalized approach used in the previous study. For example, the RF classifier was configured with 800 estimators and a maximum depth of 20. The XGBoost classifier, which was ultimately selected for deployment, was fine-tuned with 1000 estimators, a maximum depth of 20, and a learning rate of 0.1. These improvements not only led to more precise predictions but also ensured higher performance consistency across various severity classes compared with the modified approach reported in [52].

5.1.4. Real-Time Prediction Web Application

In this subsection, we provide a comprehensive explanation of the real-time prediction web application mentioned in the introduction and conclusions. This application is an integral part of our study, leveraging the XGBoost model, which demonstrated the highest accuracy among the evaluated classifiers.

The application is optimized for quick response times, providing predictions within milliseconds. This efficiency is achieved through the Flask framework [53], which ensures fast server-client communication.

The application follows a client-server architecture. On the server side, the Flask framework handles the data processing and model inference. On the client side, Angular ensures a responsive and interactive user experience. Swagger-UI facilitates seamless API interactions, allowing for easy integration and testing of the prediction model. Figure 7 illustrates the architecture of the application.

The web application features a user-friendly interface built using Swagger-UI and Angular. Users can input data through a form, which is then processed by the server. The interface provides real-time feedback, displaying the predicted severity of the accident. Figure 8 illustrates the user interface of the application.

By predicting the severity of traffic accidents in real time, this application provides valuable insights that can help authorities implement targeted interventions. It enables quicker and more effective responses from emergency services, potentially reducing the impact of severe accidents and improving overall traffic safety.

To illustrate the practical applications, we included visual data (Figure 9) that map the severity of accidents across different regions of Montreal. This visualization helps identify hotspots of severe accidents and supports data-driven decision making for traffic safety enhancements.

6. Conclusions and Future Work

Traffic accidents in Montreal pose a significant challenge each year, resulting in numerous fatalities and substantial socioeconomic impacts. To address this issue, our study presents a machine-learning-based approach to enhance urban traffic safety. We developed a web application based on a predictive model to forecast the severity of traffic accidents in Montreal. This model leverages real-world data collected from traffic accidents in Montreal between 2012 and 2021, utilizing various machine learning algorithms such as XGBoost, CatBoost, RF, and GB. Our analysis revealed that the XGBoost model outperformed the others, achieving an accuracy of 96%, compared with 95%, 93%, and 89% for the CatBoost, RF, and GB models, respectively.

The significance of these results lies in the model’s ability to identify key factors influencing accident severity. By understanding these factors, authorities can implement targeted interventions to prevent severe accidents, allocate resources more effectively during emergency responses, and develop strategic policies to reduce traffic accident fatalities. This predictive capability is crucial for enhancing overall road safety and ensuring a safer environment for all road users.

Building on the results of the comparative analysis of these prediction models, we developed a web application leveraging the Python Flask framework and Swagger-UI/Angular. This application provides an effective tool for the Montreal city government to formulate and implement strategies to improve road safety, reduce fatalities, and enhance the overall experience for road users. The deployment of the XGBoost model in a user-friendly web interface ensures accessibility and practical application of the predictive insights.

Looking forward, we plan to expand the scope of our application to other provinces in Canada. Additionally, we intend to enhance the application with a comprehensive graphical interface using the Angular framework. This extension will facilitate the seamless integration of additional features, such as exploratory data analysis and real-time accident geolocation, directly into the user interface. These enhancements will further support data-driven decision making and promote proactive measures for traffic safety improvement.

Author Contributions

Conceptualization, B.M.; Methodology, B.M. and V.F.; Software, B.M.; Investigation, B.M.; Writing—original draft, Bappa Muktar; Writing—review & editing, V.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The dataset used in this study is available on the Données Québec website at https://www.donneesquebec.ca/recherche/dataset/vmtl-collisions-routieres under the attribution license (CC-BY 4.0), accessed on 20 December 2023.

Acknowledgments

This research would not have been possible without access to traffic accident data provided by the city of Montreal. The author would like to thank the city of Montreal for facilitating access to its traffic accident datasets.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, G.; Yau, K.K.; Chen, G. Risk factors associated with traffic violations and accident severity in China. Accid. Anal. Prev. 2013, 59, 18–25. [Google Scholar] [CrossRef]
World Health Organization. Global Status Report on Road Safety 2023. Available online: https://www.who.int/teams/social-determinants-of-health/safety-and-mobility/global-status-report-on-road-safety-2023 (accessed on 20 December 2023).
Transport Canada. Canadian Motor Vehicle Traffic Collision Statistics 2021. Available online: https://tc.canada.ca/en/road-transportation/statistics-data/canadian-motor-vehicle-traffic-collision-statistics-2021 (accessed on 20 December 2023).
Alkheder, S.; Taamneh, M.; Taamneh, S. Severity prediction of traffic accident using an artificial neural network. J. Forecast. 2017, 36, 100–108. [Google Scholar] [CrossRef]
Çeven, S.; Albayrak, A. Traffic accident severity prediction with ensemble learning methods. Comput. Electr. Eng. 2024, 114, 109101. [Google Scholar] [CrossRef]
Hashmienejad, S.H.A.; Hasheminejad, S.M.H. Traffic accident severity prediction using a novel multi-objective genetic algorithm. Int. J. Crashworthiness 2017, 22, 425–440. [Google Scholar] [CrossRef]
Sameen, M.I.; Pradhan, B. Severity prediction of traffic accidents with recurrent neural networks. Appl. Sci. 2017, 7, 476. [Google Scholar] [CrossRef]
Yan, M.; Shen, Y. Traffic accident severity prediction based on random forest. Sustainability 2022, 14, 1729. [Google Scholar] [CrossRef]
Dhanya, K.; Vajipayajula, S.; Srinivasan, K.; Tibrewal, A.; Kumar, T.S.; Kumar, T.G. Detection of Network Attacks using Machine Learning and Deep Learning Models. Procedia Comput. Sci. 2023, 218, 57–66. [Google Scholar] [CrossRef]
Filali, A.; Mlika, Z.; Cherkaoui, S.; Kobbane, A. Preemptive SDN load balancing with machine learning for delay sensitive applications. IEEE Trans. Veh. Technol. 2020, 69, 15947–15963. [Google Scholar] [CrossRef]
Hammouri, A.; Hammad, M.; Alnabhan, M.; Alsarayrah, F. Software bug prediction using machine learning approach. Int. J. Adv. Comput. Sci. Appl. 2018, 9, 78–83. [Google Scholar] [CrossRef]
Kumar, R.; Kumar, P.; Kumar, Y. Time series data prediction using IoT and machine learning technique. Procedia Comput. Sci. 2020, 167, 373–381. [Google Scholar] [CrossRef]
Muktar, B.; Fono, V.; Zongo, M. Predictive Modeling of Signal Degradation in Urban VANETs Using Artificial Neural Networks. Electronics 2023, 12, 3928. [Google Scholar] [CrossRef]
Ahmed, S.; Hossain, M.A.; Ray, S.K.; Bhuiyan, M.M.I.; Sabuj, S.R. A study on road accident prediction and contributing factors using explainable machine learning models: Analysis and performance. Transp. Res. Interdiscip. Perspect. 2023, 19, 100814. [Google Scholar] [CrossRef]
Wu, P.; Meng, X.; Song, L. A novel ensemble learning method for crash prediction using road geometric alignments and traffic data. J. Transp. Saf. Secur. 2020, 12, 1128–1146. [Google Scholar] [CrossRef]
Gan, J.; Li, L.; Zhang, D.; Yi, Z.; Xiang, Q. An alternative method for traffic accident severity prediction: Using deep forests algorithm. J. Adv. Transp. 2020, 2020, 1257627. [Google Scholar] [CrossRef]
Dong, C.; Shao, C.; Li, J.; Xiong, Z. An improved deep learning model for traffic crash prediction. J. Adv. Transp. 2018, 2018, 3869106. [Google Scholar] [CrossRef]
Zhang, C.; He, J.; Wang, Y.; Yan, X.; Zhang, C.; Chen, Y.; Liu, Z.; Zhou, B. A crash severity prediction method based on improved neural network and factor Analysis. Discret. Dyn. Nat. Soc. 2020, 2020, 4013185. [Google Scholar] [CrossRef]
Yang, J.; Han, S.; Chen, Y. Prediction of Traffic Accident Severity Based on Random Forest. J. Adv. Transp. 2023, 2023, 7641472. [Google Scholar] [CrossRef]
Gupta, U.; Varun, M.; Srinivasa, G. A Comprehensive Study of Road Traffic Accidents: Hotspot Analysis and Severity Prediction Using Machine Learning. In Proceedings of the 2022 IEEE Bombay Section Signature Conference (IBSSC), Mumbai, India, 8–10 December 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–6. [Google Scholar]
Paul, A.K.; Boni, P.K.; Islam, M.Z. A Data-Driven Study to Investigate the Causes of Severity of Road Accidents. In Proceedings of the 2022 13th International Conference on Computing Communication and Networking Technologies (ICCCNT), Kharagpur, India, 3–5 October 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–7. [Google Scholar]
Gatarić, D.; Ruškić, N.; Aleksić, B.; Đurić, T.; Pezo, L.; Lončar, B.; Pezo, M. Predicting Road Traffic Accidents—Artificial Neural Network Approach. Algorithms 2023, 16, 257. [Google Scholar] [CrossRef]
Sowdagur, J.A.; Rozbully-Sowdagur, B.T.B.; Suddul, G. An Artificial Neural Network Approach for Road Accident Severity Prediction. In Proceedings of the 2022 IEEE Zooming Innovation in Consumer Technologies Conference (ZINC), Novi Sad, Serbia, 25–26 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 267–270. [Google Scholar]
Meocci, M.; Branzi, V.; Martini, G.; Arrighi, R.; Petrizzo, I. A predictive pedestrian crash model based on artificial intelligence techniques. Appl. Sci. 2021, 11, 11364. [Google Scholar] [CrossRef]
Islam, M.K.; Reza, I.; Gazder, U.; Akter, R.; Arifuzzaman, M.; Rahman, M.M. Predicting road crash severity using classifier models and crash hotspots. Appl. Sci. 2022, 12, 11354. [Google Scholar] [CrossRef]
Aldhari, I.; Almoshaogeh, M.; Jamal, A.; Alharbi, F.; Alinizzi, M.; Haider, H. Severity Prediction of Highway Crashes in Saudi Arabia Using Machine Learning Techniques. Appl. Sci. 2022, 13, 233. [Google Scholar] [CrossRef]
Shen, Y.; Zheng, C.; Wu, F. Study on Traffic Accident Forecast of Urban Excess Tunnel Considering Missing Data Filling. Appl. Sci. 2023, 13, 6773. [Google Scholar] [CrossRef]
Zhang, J.; Li, Z.; Pu, Z.; Xu, C. Comparing prediction performance for crash injury severity among various machine learning and statistical methods. IEEE Access 2018, 6, 60079–60087. [Google Scholar] [CrossRef]
Infante, P.; Jacinto, G.; Afonso, A.; Rego, L.; Nogueira, V.; Quaresma, P.; Saias, J.; Santos, D.; Nogueira, P.; Silva, M.; et al. Comparison of statistical and machine-learning models on road traffic accident severity classification. Computers 2022, 11, 80. [Google Scholar] [CrossRef]
Mansoor, U.; Ratrout, N.T.; Rahman, S.M.; Assi, K. Crash severity prediction using two-layer ensemble machine learning model for proactive emergency management. IEEE Access 2020, 8, 210750–210762. [Google Scholar] [CrossRef]
Vijithasena, R.; Herath, W. Data Visualization and Machine Learning Approach for Analyzing Severity of Road Accidents. In Proceedings of the 2022 International Conference for Advancement in Technology (ICONAT), Goa, India, 21–22 January 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–6. [Google Scholar]
Wahab, L.; Jiang, H. A comparative study on machine learning based algorithms for prediction of motorcycle crash severity. PLoS ONE 2019, 14, e0214966. [Google Scholar] [CrossRef] [PubMed]
Ville de Montréal. Collisions Routières, [Jeu de données]. Dans Données Québec, 2018. Mis à jour le 19 Décembre 2022. 2022. Available online: https://www.donneesquebec.ca/recherche/dataset/vmtl-collisions-routieres (accessed on 19 December 2023).
Licenses, Creative Commons. Attribution 4.0 International (CC BY 4.0). Creative Commons License. 2013. Available online: https://creativecommons.org/licenses/by/4.0/deed.en (accessed on 20 December 2023).
McKinney, W. An improved air quality index machine learning-based forecasting with multivariate data imputation approach. Atmosphere. Sci. Comput. 2022, 13, 1144. [Google Scholar]
Emmanuel, T.; Maupong, T.; Mpoeleng, D.; Semong, T.; Mphago, B.; Tabona, O. A survey on missing data in machine learning. J. Big Data 2021, 8, 140. [Google Scholar] [CrossRef] [PubMed]
Nijman, S.; Leeuwenberg, A.; Beekers, I.; Verkouter, I.; Jacobs, J.; Bots, M.; Asselbergs, F.; Moons, K.; Debray, T. Missing data is poorly handled and reported in prediction model studies using machine learning: A literature review. J. Clin. Epidemiol. 2022, 142, 218–229. [Google Scholar] [CrossRef]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Swana, E.F.; Doorsamy, W.; Bokoro, P. Tomek link and SMOTE approaches for machine fault classification with an imbalanced dataset. Sensors 2022, 22, 3246. [Google Scholar] [CrossRef] [PubMed]
Muntasir Nishat, M.; Faisal, F.; Jahan Ratul, I.; Al-Monsur, A.; Ar-Rafi, A.M.; Nasrullah, S.M.; Reza, M.T.; Khan, M.R.H. A comprehensive investigation of the performances of different machine learning classifiers with SMOTE-ENN oversampling technique and hyperparameter optimization for imbalanced heart failure dataset. Sci. Program. 2022, 2022, 3649406. [Google Scholar] [CrossRef]
He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 1322–1328. [Google Scholar]
Ray, S.; Alshouiliy, K.; Roy, A.; AlGhamdi, A.; Agrawal, D.P. Chi-squared based feature selection for stroke prediction using AzureML. In Proceedings of the 2020 Intermountain Engineering, Technology and Computing (IETC), Orem, UT, USA, 2–3 October 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–6. [Google Scholar]
Spencer, R.; Thabtah, F.; Abdelhamid, N.; Thompson, M. Exploring feature selection and classification methods for predicting heart disease. Digit. Health 2020, 6, 2055207620914777. [Google Scholar] [CrossRef] [PubMed]
Thaseen, I.S.; Kumar, C.A. Intrusion detection model using fusion of chi-square feature selection and multi class SVM. J. King Saud Univ.-Comput. Inf. Sci. 2017, 29, 462–472. [Google Scholar]
Guo, M.; Yuan, Z.; Janson, B.; Peng, Y.; Yang, Y.; Wang, W. Older pedestrian traffic crashes severity analysis based on an emerging machine learning XGBoost. Sustainability 2021, 13, 926. [Google Scholar] [CrossRef]
Dong, S.; Khattak, A.; Ullah, I.; Zhou, J.; Hussain, A. Predicting and analyzing road traffic injury severity using boosting-based ensemble learning models with SHAPley Additive exPlanations. Int. J. Environ. Res. Public Health 2022, 19, 2925. [Google Scholar] [CrossRef] [PubMed]
Lu, P.; Zheng, Z.; Ren, Y.; Zhou, X.; Keramati, A.; Tolliver, D.; Huang, Y. A gradient boosting crash prediction approach for highway-rail grade crossing crash analysis. J. Adv. Transp. 2020, 2020, 6751728. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Bentéjac, C.; Csörgo, A.; Martínez-Muñoz, G. A comparative analysis of gradient boosting algorithms. Artif. Intell. Rev. 2021, 54, 1937–1967. [Google Scholar] [CrossRef]
Sarveshvar, M.; Gogoi, A.; Chaubey, A.K.; Rohit, S.; Mahesh, T. Performance of different machine learning techniques for the prediction of heart diseases. In Proceedings of the 2021 International Conference on Forensics, Analytics, Big Data, Security (FABS), Bengaluru, India, 21–22 December 2021; IEEE: Piscataway, NJ, USA, 2021; Volume 1, pp. 1–4. [Google Scholar]
Hébert, A.; Guédon, T.; Glatard, T.; Jaumard, B. High-resolution road vehicle collision prediction for the city of montreal. In Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 9–12 December 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1804–1813. [Google Scholar]
Mufid, M.R.; Basofi, A.; Al Rasyid, M.U.H.; Rochimansyah, I.F. Design an mvc model using python for flask framework development. In Proceedings of the 2019 International Electronics Symposium (IES), Surabaya, Indonesia, 27–28 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 214–219. [Google Scholar]

Figure 1. Distribution of accident rates by severity level.

Figure 2. Hourly distribution of accident severity.

Figure 3. Weekly distribution of accident severity.

Figure 4. Monthly distribution of accident severity.

Figure 5. Yearly distribution of accident severity.

Figure 6. Comparison of classification algorithm performance.

Figure 7. Montreal severity accident predictor web application architecture.

Figure 8. Montreal severity accident predictor web application interface.

Figure 9. Spatial distribution of accident severity in Montreal.

Table 1. Comparative analysis of machine learning approaches for predicting the severity of traffic accidents.

Study	Focus	Data Used	Models Evaluated	Key Findings
[14]	Prediction of traffic accidents	New Zealand dataset (2016–2020)	RF, DJ, AdaBoost, XGBoost, LGBM, CatBoost	RF most effective with 81.45% accuracy. Importance of road category and vehicle number.
[15]	Accident prediction based on road and traffic data	Not specified	Ensemble learning CPM-GAs	Improved accuracy and reduced variance in predictions.
[16]	Predicting the severity of a traffic accident	UK road safety dataset	Deep Forests	Superior stability and accuracy with minimal hyperparameters.
[17]	Traffic accident prediction with deep learning	Data from Knox County, Tennessee	Improved deep learning model, MVNB	The model is characterized by prediction accuracy and dimensionality reduction.
[18]	Accident severity prediction	I5 interstate highway, Washington State (2011–2015)	Improved neural network	The focus is on vehicle-related versus road-related factors.
[19]	Predicting the severity of a traffic accident	Chinese National Car Accident In-Depth Investigation System (2018–2020)	RF	The RF algorithm is superior in predicting severity.
[20]	Analysis of traffic accidents	UK dataset (2005–2017)	Naive-Bayes, LR, AdaBoost, XGBoost, RF	Insights into accident severity and hotspot identification.
[21]	Causes of the severity of a traffic accident	UK road accident database	NCA, k-nearest neighbors, Individual Conditional Expectation	Identified significant factors influencing the severity of the accident.
[22]	Traffic accident prediction with ANN	Serbia and Bosnia and Herzegovina	ANN	High accuracy in predicting accident events and severity.
[23]	Predicting the severity of road accidents in Mauritius	Not specified	ANN (MLP)	MLP outperforms other models with an accuracy of 84.1%.
[24]	Pedestrian crash model	Italy, ISTAT dataset (5 years)	Gradient Boosting	Effective in predicting the risk of pedestrian accidents
[25]	Analysis of crash severity and hotspots	Al-Ahsa, Saudi Arabia (2016–2018)	Gradient Boosting, RF, logistic regression	Identified factors and hotspots for severe R.T.C.s.
[26]	Severity of highway accident in Saudi Arabia	Qassim Province (2017–2019)	RF, XGBoost, logistic regression	XGBoost is the most accurate at predicting accident severity.
[27]	Traffic accident forecast in tunnels	YingTian Street Tunnel, Nanjing	GCN-LSTM, BP neural network, RF	The RF mode excels at predicting the duration of an accident.
[28]	Predicting the severity of injuries in an accident	Highway divergence areas, Florida	K-Nearest Neighbor, Decision Tree, RF, SVM	RF most effective; highlights overfitting problems.
[29]	Classification of the severity of a traffic accident	Setúbal, Portugal (2016–2019)	Logistic regression, machine learning models	Comparing performance between models on balanced datasets.
[30]	Accident severity prediction for emergency management	Great Britain (2011–2016)	Two-layer ensemble model	Superior performance in accuracy and F1 score.
[31]	Analysis of the severity of traffic accidents	USA (2016–2019)	Random Forest	High accuracy in predicting accident severity.
[32]	Predicting the severity of a motorcycle accident in Ghana	Ghana (2011–2015)	J48 Decision Tree, RF, IBk	RF is the most accurate in predicting severity.
Current Work	Accident severity prediction in Montreal	Montreal collision data (2012–2021)	XGBoost, CatBoost, RF, GB	The XGBoost model demonstrated highest accuracy (96%) and effectiveness in predicting accident severity.

Table 2. Summary of the dataset.

Description	Value
Number of rows	218,272
Number of columns	68
Type of data	float64, int64. object
Categorical variables	15 (type object)
Numerical variables	53 (29 int64, 24 float64)

Table 3. Example of coding for the Severity attribute.

Severity of the Accident	Numerical Coding
Damage Below Reporting Threshold	0
Property Damage Only	1
Minor	2
Serious	3
Fatal	4

Table 4. Attributes with more than 50% missing values dropped from the dataset.

Attribute	Number of Missing	Percentage
kilometer_marker	218,161	99.949146
road_direction	217,882	99.821324
civic_number_suffix	217,828	99.796584
road_number	217,550	99.669220
construction_zone	213,368	97.753262
special_situation	213,077	97.619942
positioning	169,056	77.451987
road_surface	165,001	75.594213
distance_in_meters	157,580	72.194326
cardinal_point_code	150,120	68.776572
civic_number	124,781	57.167662

Table 5. Missing values in categorical numeric attributes for imputation.

Attribute	Number of Missing	Percentage
type_of_marker	82,307	37.708456
collision_near	71,083	32.566248
road_configuration	21,972	10.066339
longitudinal_location	17,763	8.138011
weather_conditions	13,602	6.231674
lighting	12,919	5.918762
surface_condition	12,760	5.845917
street_name	12,298	5.634255
collision_type	10,067	4.612135
road_aspect	9917	4.543414
environment	7055	3.232206
road_category	6355	2.911505
detached_location	19	0.008705
administrative_region	8	0.003665
county_name	8	0.003665
municipality_code	7	0.003207

Table 6. Numerical attributes with missing values for imputation.

Attribute	Number of Missing	Percentage
authorized_speed	80,885	37.056975
x_coordinate	11	0.005040
y_coordinate	11	0.005040
longitude	11	0.005040
latitude	11	0.005040

Table 7. Performance of data balancing algorithms.

Balancing Algorithm	Accuracy
SMOTE-ENN	0.985085
SMOTE-Tomek	0.895400
SMOTE	0.867106
ADASYN	0.811218

Table 8. The 30 most important attributes selected using the chi-square method.

Feature	Chi-Square Score	Percentage
Collision_Near	495,129.765158	30.526103
Street_Name	175,285.888584	10.806854
Num_Serious_Injuries	169,805.678768	10.468984
Num_Deaths	162,050.724638	9.990870
Total_Victims	151,493.855851	9.340010
Num_Minor_Injuries	150,924.871613	9.304931
Pedestrian_Deaths	98,471.014493	6.071007
Total_Pedestrian_Victims	31,182.458448	1.922484
Pedestrian_Injuries	30,277.689032	1.866702
Longitudinal_Location	24,494.472488	1.510151
Bicycle_Deaths	20,159.420290	1.242883
Bicycle_Injuries	16,883.042752	1.040886
Total_Bicycle_Victims	16,847.829913	1.038715
X_Coordinate	14,566.502900	0.898065
Motorcycle_Deaths	11,630.434783	0.717048
Bicycle_Count	10,794.691447	0.665522
Unspecified_Vehicle_Count	8160.399934	0.503111
Y_Coordinate	6016.332705	0.370923
Total_Motorcycle_Victims	4873.843542	0.300486
Motorcycle_Injuries	4830.242330	0.297798
Road_Category	3951.186923	0.243601
Emergency_Vehicle_Count	2492.677718	0.153680
Heavy_Trucks_Count	2031.992082	0.125278
Motorcycle_Count	1912.272853	0.117897
Light_Cars_Trucks_Count	1911.036696	0.117821
Collision_Type	1561.271344	0.096257
Hour	1255.653908	0.077414
Authorized_Speed	1127.238142	0.069497
Surface_Condition	950.290263	0.058588
Weather_Conditions	915.358973	0.056434

Table 9. Summary of the classification report.

Class	Precision	Recall	F1 Score	Support	Accuracy
Results for XGBoost
Damage Below Reporting Threshold	0.79	0.75	0.77	1385
Property Damage Only	0.66	0.56	0.61	907
Minor	0.89	0.84	0.86	3974
Serious	0.97	1.00	0.98	12953
Fatal	1.00	1.00	1.00	15346
Weighted Avg	0.96	0.96	0.96	34565	0.96
Results for CatBoost
Damage Below Reporting Threshold	0.76	0.72	0.74	1385
Property Damage Only	0.62	0.43	0.51	907
Minor	0.86	0.78	0.82	3974
Serious	0.95	1.00	0.97	12953
Fatal	1.00	1.00	1.00	15346
Weighted Avg	0.94	0.95	0.94	34565	0.95
Results for RF
Damage Below Reporting Threshold	0.75	0.70	0.73	1385
Property Damage Only	0.62	0.31	0.42	907
Minor	0.81	0.65	0.72	3974
Serious	0.91	0.99	0.95	12953
Fatal	0.99	1.00	1.00	15346
Weighted Avg	0.92	0.93	0.92	34565	0.93
Results for GB
Damage Below Reporting Threshold	0.75	0.72	0.73	1385
Property Damage Only	0.56	0.42	0.48	907
Minor	0.76	0.59	0.67	3974
Serious	0.88	0.92	0.90	12953
Fatal	0.94	0.98	0.96	15346
Weighted Avg	0.88	0.89	0.88	34565	0.89

Table 10. The key factors influencing the severity of traffic accidents.

Feature	Chi-Square Score	Percentage
Collision_Near	495,129.765158	30.526103
Street_Name	175,285.888584	10.806854
Longitudinal_Location	24,494.472488	1.510151
Road_Category	3951.186923	0.243601
Emergency_Vehicle_Count	2492.677718	0.153680
Heavy_Trucks_Count	2031.992082	0.125278
Motorcycle_Count	1912.272853	0.117897
Light_Cars_Trucks_Count	1911.036696	0.117821
Collision_Type	1561.271344	0.096257
Surface_Condition	950.290263	0.058588
Weather_Conditions	915.358973	0.056434

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Muktar, B.; Fono, V. Toward Safer Roads: Predicting the Severity of Traffic Accidents in Montreal Using Machine Learning. Electronics 2024, 13, 3036. https://doi.org/10.3390/electronics13153036

AMA Style

Muktar B, Fono V. Toward Safer Roads: Predicting the Severity of Traffic Accidents in Montreal Using Machine Learning. Electronics. 2024; 13(15):3036. https://doi.org/10.3390/electronics13153036

Chicago/Turabian Style

Muktar, Bappa, and Vincent Fono. 2024. "Toward Safer Roads: Predicting the Severity of Traffic Accidents in Montreal Using Machine Learning" Electronics 13, no. 15: 3036. https://doi.org/10.3390/electronics13153036

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Toward Safer Roads: Predicting the Severity of Traffic Accidents in Montreal Using Machine Learning

Abstract

1. Introduction

2. Related Work

3. Data Overview

3.1. Data Source

3.2. Data Description

4. Methodology

4.1. Setup and Application Design

4.2. Data Preprocessing

4.3. Dealing with Missing Values

4.4. Feature Selection Using the Chi-Square Statistical Method

4.5. Exploratory Data Analysis

Impact of Exploratory Data Analysis on Data Preparation and Model Performance

4.6. Development of the Predictive Model

4.6.1. Gradient Boosting (GB)

4.6.2. Extreme Gradient Boosting (XGBoost)

4.6.3. Categorical Boosting (CatBoost)

4.6.4. RandomForest (RF)

5. Results and Discussion

5.1. Results

5.1.1. Interpretation of Results

5.1.2. Key Factors Influencing Accident Severity

5.1.3. Comparison of the Results with a Previous Study in the Literature

5.1.4. Real-Time Prediction Web Application

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI