Machine Learning Statistical Model Using Transportation Data
Machine Learning Statistical Model Using Transportation Data
Also, Most of the columns were Irrelevant and consisted of more than 60% of missing
values, so I decided to drop those features.
Geographical heatmap of accidents in each state
Predictive Analysis
► Predictive analytics uses mathematical modeling tools to generate predictions
about an unknown fact, characteristic, or event. “It’s about taking the data that
you know exists and building a mathematical model from that data to help you
make predictions about somebody not yet in that data set,” Goulding explains.
► An analyst’s role in predictive analysis is to assemble and organize the data,
identify which type of mathematical model applies to the case at hand, and
then draw the necessary conclusions from the results. They are often also
tasked with communicating those conclusions to stakeholders effectively and
engagingly.
► “The tools we’re using for predictive analytics now have improved and
become much more sophisticated,” Goulding says, explaining that these
advanced models have allowed us to “handle massive amounts of data in ways
we couldn’t before.”
► Example: Linear Regression, Logistic Regression, Decision Trees, Random
Forest, Support Vector Machines etc.
Cluster Analysis
► Clustering is the process of dividing a population or set of data points into
groups so that data points in the same group are more similar to other data
points in the same group and dissimilar to data points in other groups. It is
essentially a collection of objects based on their similarity and dissimilarity.
► Cluster analysis itself is not one specific algorithm but the general task to be
solved. It can be achieved by various algorithms that differ significantly in
their understanding of what constitutes a cluster and how to efficiently find
them. Popular notions of clusters include groups with small distances between
cluster members, dense areas of the data space, intervals or particular
Statistical distributions.
► Clustering can therefore be formulated as a multi-objective
optimization problem. The appropriate clustering algorithm and parameter
settings (including parameters such as the distance function to use, a density
threshold or the number of expected clusters) depend on the individual data
set and intended use of the results.
Random Forest
► Random Forest is a supervised machine learning algorithm. This Technique can be
used for both regression and classification tasks but generally performs better in
classification tasks. As the name suggests, Random Forest technique considers
multiple decision trees before giving an output. So, it is basically an ensemble of
decision trees.
► This technique is based on the belief that a greater number of trees would converge
to the right decision. For classification, it uses a voting system and then decides the
class whereas in regression it takes the mean of all the outputs of each of the
decision trees.
► It works well with large datasets with high dimensionality. The random forest
algorithm is an extension of the bagging method as it utilizes both bagging and
feature randomness to create an uncorrelated forest of decision trees. Feature
randomness, also known as feature bagging or “the random subspace method
generates a random subset of features, which ensures low correlation among
decision trees.
Random Forest Results
KNearest Neighbors
► The k-nearest neighbor algorithm, also known as KNN or k-NN, is a
non-parametric, supervised learning classifier that uses proximity to classify or
predict the grouping of an individual data point. It can be used for both regression
and classification problems, but it is most commonly used as a classification
algorithm, based on the assumption that similar points can be found close together.
► A majority vote is used to assign a class label to a classification problem that is, the
label that is most frequently represented around a given data point is used. While
technically this is referred to as "plurality voting," the term "majority vote" is more
commonly used in literature.
► The difference between these terms is that "majority voting" technically requires a
majority of more than 50%, which only works when there are only two options.
When there are multiple classes say, four categories you don't always need 50% of
the vote to make a decision about a class; you could assign a class label with a vote
of more than 25%.
KNeighbors Classifier
Variable Selection Method
► Feature or Variable selection methods are used to select specific features from our dataset, which are useful and important
for our model to learn and predict. As a result, feature selection is an important step in the development of a machine
learning model. Its goal is to identify the best set of features for developing a machine learning model.
► Some popular techniques of feature selection in machine learning are:
• Filter methods
• Wrapper methods
• Embedded methods
► Filter Methods
• These methods are generally used while doing the pre-processing step. These methods select features from the dataset
irrespective of the use of any machine learning algorithm.
• Techniques such as : Information gain, Chi-Square, Variance_Threshold, Mean_Absolute_Difference etc.
► Wrapper methods:
• Wrapper methods, also referred to as greedy algorithms train the algorithm by using a subset of features in an iterative
manner. Based on the conclusions made from training in prior to the model, addition and removal of features takes place.
• Techniques such as: Forward selection, Backward Elimination, Bi-Directional Elimination etc.
► Embedded methods:
• In embedded methods, the feature selection algorithm is blended as part of the learning algorithm, thus having its own
built-in feature selection methods. Embedded methods encounter the drawbacks of filter and wrapper methods and merge
their advantages.
• Techniques such as: Regularization, tree based methods
Variable selection using SequentialFeatureSelection
► Sequential feature selection algorithms are a type of greedy search algorithm that is
used to reduce a d-dimensional feature space to a k-dimensional feature subspace,
where k d. Feature selection algorithms are designed to automatically select a subset of
features that are most relevant to the problem.
► A wrapper approach, such as sequential feature selection, is especially useful when
embedded feature selection, such as a regularization penalty like LASSO, is not
applicable.
► SFAs, in a nutshell, remove or add features one at a time based on classifier
performance until a feature subset of the desired size k is reached.
► There are basically 4 types of SFA’s such as:
1. Sequential Forward Selection (SFS)
2. Sequential Backward Selection (SBS)
3. Sequential Forward Floating Selection (SFFS)
4. Sequential Backward Floating Selection (SBFS)
► The one we have employed in our project is the Sequential forward selection
Mlxlend Feature selection library for selecting the best features for the model.
Testing the Model on Variables selected by
algorithm.
Decision Tree
► A decision tree is a decision support tool that uses a tree-like model of decisions and their possible
consequences, including chance event outcomes, resource costs, and utility. It is one way to display
an algorithm that only contains conditional control statements. Decision trees are commonly used
in operations research, specifically in decision analysis, to help identify a strategy most likely to reach
a goal but are also a popular tool in machine learning. A decision tree is a flowchart-like structure in
which each internal node represents a "test" on an attribute (e.g. whether a coin flip comes up heads or
tails), each branch represent the outcome of the test, and each leaf node represents a class label
(decision taken after computing all attributes). The paths from root to leaf represent classification
rules. In decision analysis, a decision tree and the closely related influence diagram are used as a
visual and analytical decision support tool, where the expected values (or expected utility) of
competing alternatives are calculated.
► A decision tree consists of three types of nodes
website:https://techieyantechnologies.com