Machine Learning For Beginners
Machine Learning For Beginners
1.0.0 - Introduction
1.1.0 - Course purpose
Welcome to my Machine Learning Primer course! Whether you’re looking to build out Machine Learning algorithms or want to see
how to use Machine Learning in your organization, you’re in the right place. The first 4 chapters will have information for everyone,
chapter 5 onwards, we’ll start actually implementing algorithms so if you’re looking to just learn about how ML is used in industry,
you can watch until then.
Machine Learning is an ever evolving subject, there are an infinite number of things to learn and a constantly evolving body of
knowledge. The purpose of this course is to give you a practical understanding of the basics of Machine Learning so you can decide
if this material is for you and in what direction you’d like to continue learning. We won’t focus on the math and theory behind
Machine Learning, for that I recommend the excellent book: Hands-on Machine Learning with Scikit-Learn Keras & TensorFlow 2nd
Edition (affiliate link below).
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems [Géron,
Aurélien] on Amazon.com. *FREE* shipping on qualifying offers. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts,
Tools, and Techniques to Build Intelligent Systems
https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1492032646?crid=18FZPXM3VIMFT&dchild=1&keywords=hands
+on+machine+learning+with+scikit-learn+and+tensorflow+2&qid=1620018613&sprefix=hands+on+%2Caps%2C197&sr=8-3&linkCode=ll1&tag=shash
ankstore-20&linkId=738fc1e1fcd7646120a3b625e9d395db&language=en_US&ref_=as_li_ss_tl
plot_model(ada, plot='feature')
2.0.0 - ML Basics
Machine learning (ML) is the study of computer algorithms that can improve automatically through
experience and by the use of data - Wikipedia
What does this mean? Let's think of it from the perspective of a problem. Let's say I work at Nike and am in charge of building a
section of their website. When I go through their website and scroll over the different menus, the behave differently but
deterministically, meaning that if I input a certain command (hover over the menu) then a definite outcome will occur.
But let's say the problem now is to predict which shoes to stock in a location based on a bunch of customer data in a region. Let's
say you have 10,000 dimensions of customer data (things like gender, Instagram likes, favorite sports, average household income...
stuff like that), it would be impossible, or at least very tedious to hard code rules to determine how to stock a store for 10,000
dimensions of data. We might be able to do it for 30 stores or so but what happens when the data changes or we need to roll this
out nationally? The hard coded solution is not at all scalable. Instead we would use a machine learning algorithm to ingest all of this
data and then produce a prediction based on on the results of maybe 10 hard-coded stores.
Essentially a Machine Learning algorithm takes in a bunch of inputs in the form of data, and then produces an output that maximizes
some target metric (user retention, user happiness etc.) and can improve itself through more data being added to the algorithm.
AI Created Outfits
At Nordstrom, digital stylists help our customers to feel good and look their best by creating outstnding outfits
through a variety of styling experiences: one-on-one virtual styling help, try before you buy with Trunk Club,
personally curated looks, thematic outfit curations, outfits to showcase the versatility of an individual product,
https://medium.com/tech-at-nordstrom/ai-created-outfits-9529300a1af3
You cannot code the rules: This includes things like facial recognition or voice recognition where very subtle differences between
individual faces and voices can render a rules-based approach unusable
You cannot scale: This includes things like the previous Nike example. If there are thousands of variables that you need to
consider, it might not be possible or economical to hard code rules and machine learning might be necessary
Machine learning might also be a great version 2 for a product that you're creating. If you wanted to create a stock trading bot that
traded stocks based on whether the President of the United States mentioned them positively or negatively in a Tweet, then you
could start with a simple version that just counts the number of "positive" words in the Tweet vs. the number of "negative" words in
the Tweet and makes a decision based on that. Of course, this approach will quickly fail with the many quirks of grammar rendering
this method inaccurate at best. After a version 1 with this simple approach, you could always try version 2 with a proper sentiment
analysis algorithm.
(c. 1700 B.C.E.) Note: The Code of Hammurabi was a compilation of almost three hundred laws on every aspect of life. Much can be learned both about Mesopotamian
life and ideals through these laws.
http://www.wright.edu/~christopher.oldstone-moore/Hamm.htm
Basically, creators are responsible for the soundness of what they create. Now for better or worse, that's not exactly how society
today works, and this is even less true in the realm of software. Machine Learning algorithms have become so ubiquitous that they
form a large portion if not the majority of trading volume on the world's largest stock markets.
Algorithms for facial recognition have been used by law enforcement even when it's debateable how well they can recognize
individuals.
All this is to say, consider how your algorithms are trained and how they'll be used as you're building them.
CRISP-DM:
Business Understanding
Data Understanding
Data Preprocessing
Modeling
Evaluation
💡 Remember: You'll very rarely be asked to build a Machine Learning algorithm. You're generally being asked to solve a
problem which ML is the tool being used to solve
You'll notice that there're arrows going back and forth between Business Understanding and Data Understanding, this is because
people might request that you try and solve a problem with data that you don't actually have the data to solve. This is why it's a
back-and-forth between understanding what you business objective is and what data you have to actually solve for this.
In order to keep this course a bit more focused on the Machine Learning, I'm going to skip over this section of the process, but will
have a video in the future where I go over this process.
1. Translates all of your data into a format that the ML algorithm can work with
a. Alphabetical characters typically need to be translated into numbers, and dates need to be encoded to be recognized as
dates by the computer
2. Cleans up junk or meaningless values from your dataset so you don't "confuse" your algorithm
3. Adds new information to your dataset to that your algorithm can derive better insights
Although jumping straight into modeling is very tempting, your algorithm's performance will be largely dependent on how well you've
been able to clean your data so don't skimp on this step.
4.4.0 - Modeling
This is generally considered the most fun part. I like to think of Data Preparation as the meat and potatoes of CRISP-DM, and
Modeling as the spices, you can't eat spices raw, but your meal won't taste that interesting without spices.
Modeling is where the rubber hits the road for all the work we've done up until now.
Supervised → This is the first paradigm and refers to a problem where we're trying to match a set of data to an expected set of
outcomes. The expected set of outcomes is called our "Target Variable" and is what we're trying to predict. An example would
be trying to classify images as hotdogs or not hotdogs given we have a bunch of labeled images for the algorithm to reference.
Within the Supervised paradigm, we have two problem types:
Regression: A regression problem is one where the data we're matching to is numerical. An example would be predicting
how much different customers might spend given attributes about said customers
Classification: A classification problem is one where we try and classify things into groups instead of predicting numerical
values. The quintessential example of this is spam mail classification. In this example, we're not predicting a numerical
value but rather a category, "spam" or "not spam". Another would be our "hotdog/not hotdog" example
Unsupervised → This is basically the opposite of a supervised algorithm and refers to a problem where we don't have any
labeled data and are telling the algorithm to try and find natural patterns in the data. In other words, we don't have a target
variable. This would be like taking a bunch of customer data and telling an algorithm to find natural groups based on the data.
We can use this to discover patterns in the data.
Clustering: Just like it sounds, clustering is a problem type where we're trying to group data into natural groups using a
machine learning algorithm. A common application of clustering algorithms is in the domain of marketing. By taking a bunch
of customer data you can then cluster people into groups (often called archetypes) that can be marketed to in different
ways. If you've ever seen an advertisement on Instagram or another social network for a product you were just talking to
your friends about, there's probably a clustering algorithm that placed you into a group of people who would also be
interested in that product. (Basically your phone is not listening to you, algorithms are just that intelligent, and you're just that
predictable).
Reinforcement → Reinforcement learning is a form of learning where the algorithm is asked to solve a problem and is
rewarded for certain types of behavior and punished for other types of behavior. We won't be covering reinforcement learning in
this course.
After we've trained our model and taken some initial measurements, we might want to improve its performance through a process
called Hyperparameter Tuning.
What exactly are hyperparameters? To learn this it will help to take a quick look inside a machine learning algorithm. A machine
learning algorithm can be thought of as a complicated math equation with a bunch of different variables (this isn't strictly accurate
but works for the sake of our example). During training, our machine learning algorithm is trying out a bunch of different values for
each of the variables in said equation until it is able to minimize some other equation called a 'cost function'. The values that it's
changing are the 'parameters' of the machine learning algorithm. Ergo, a hyperparameter is a value that is used to control the
4.5.0 - Evaluation
After training our algorithm, we'll want to determine how well it performed. To do this, we'll test out the performance of our algorithm
on the test dataset we separated from the rest of our training dataset in the Data Preparation step of this process. We will then
compare the predictive power of our machine learning algorithm on the training set vs the test dataset. There are different
measurements we'll use in order to determine how well our algorithm performs. The point of measuring the performance of our
algorithm against a test dataset is to simulate performance of the algorithm in the real world with a dataset the algorithm has never
seen. We'll us different measures depending on whether we're working with a regression or classification problem. I'll go over the
details of each of these methods in the next next chapter where we actually solve one of each of these problems.
Regression
Classification
Accuracy Score
Confusion Matrix
You might be asking yourself why it's so important to "independently" test the performance of our algorithms. Machine Learning
algorithms often perform better when they are exposed to more data so why are we purposefully preventing the algorithm from
seeing a section of the data? We do this to prevent something called "Overfitting". Overfitting is a phenomenon that occurs when a
machine learning algorithm learns "too well" and matches the dataset it's trained on too well. This is not good because an overfitted
algorithm can't generalize and therefore won't be able to make predictions on data that it hasn't seen before. "Underfitting" is the
exact opposite concept where an algorithm generalizes too much and can't give us anything nearing an accurate prediction. By
removing some data in the form of our test dataset we can compare the results on the test dataset to ensure we're not overfitting our
data (underfitting is prevented by tuning our hyperparameters while training the machine learning algorithm on the training dataset).
https://www.youtube.com/watch?v=sZDgJKI8DAM
https://deepnote.com/referral?token=4e36c4ca45cb
By signing up for a free account using the affiliate link, you can also help me prove to future channel sponsors, that people like the
content we’re making here, so I can continue to make courses like this.
Deepnote basically provides an online Jupyter notebook with extra features like collaboration tools, easy integrations with other data
sources, autosaving, easy sharing, and GPUs if you need them.
Once you've created an account, you can access the project with all the code using the link below:
https://deepnote.com/project/Machine-Learning-Fundamentals-zcnNQ1sgRByLq4s4BYEdVQ/%2FChapter_
4-Data_Cleaning.ipynb
import pandas as pd
import numpy as np
career_data = pd.read_csv("data/career_data.csv")
career_data.head()
https://www.kaggle.com/promptcloud/careerbuilder-job-listing-2020/tasks?taskId=3662
Each row should represent a new observation (for a certain column or combination of columns, there shouldn't be duplicate
rows)
In this dataset, each observation is separated by the uniq_id column. Each row represents a different job listing.
In the world of data science, we refer to individual columns in a properly formatted dataset as 'features' or 'variables'. You'll hear me
using this term a lot more.
career_data.nunique()
As you can see we have a couple of columns with only one unique value, let's try systematically getting rid of them. First we need to
isolate a list of columns with only one unique value:
And finally let's go ahead and drop these columns, remember to use the columns parameter in the drop() method. Additionally, I
advocate for making copies of your dataframes each major step you take because you'll oftentimes want to revise a step and having
a variable that saves one iteration of your dataframe can save having to run all prior code at once. This might not work if your data is
too large though.
career_data_dropped_cols = career_data.copy()
career_data_dropped_cols = career_data_dropped_cols.drop(columns=columns_to_drop)
career_data_dropped_cols
One thing to note, this method doesn't drop columns where all the values are null, we can do that by using the dropna method.
Awesome, as you can see we've dropped about 33% of our columns. This will make our data much easier to work with.
As a bit of a sidenote, variability in individual columns is what gives machine learning algorithms their predictive power. There's
information in variability and processes like Principle Component Analysis (which is beyond the scope of this course) rely on this
variability to eke out information about a dataset.
Numerical
These are essentially numerical values where that you can perform math on. The "math" distinction is important. For
example, the American Postal Code (also known as a Zip Code), is a five digit number but shouldn't be encoded as a
Numerical value because you can't perform math on any zip code (90210 is no higher or lower than 01931). Additionally, zip
codes in the Northeastern United States usually have leading 0's which will be removed if you code then as numerical data
types (01931 will turn into 1931)
Dates
Dates are a way we can measure temporal, or time based data. It's key that you understand your data well enough to
understand what format a date is in, and use features in Python to parse out those dates.
String
This is a catchall data type for anything that can't quite fit into the other two data types. It refers to data that is stored as text.
career_data_dropped_cols.dtypes
As you can see, most of our data uses the object data type. This is generally correlated to a 'string' or text based data type in
Python. You'll notice that there are some columns that should probably be dates and numbers.
Let's take a look at the salary_offered column. It seems to be formatted as a string with dollar signs and commas. This won't mean
much to a machine learning algorithm, so let's convert it to a float (number with a decimal in it). We'll need to take this string and
remove the dollar sign and commas and then convert it to a float like-so:
career_data_dtype_change = career_data_dropped_cols.copy()
career_data_dtype_change["salary_offered"] = career_data_dtype_change["salary_offered"].apply(lambda x: x.replace('$', '').replace(',', '')
career_data_dtype_change["salary_offered"]
Let's now focus on the dates and a column that isn't a string but should be one. Towards the end of the dataset, you'll see a
postal_code column. In the United States (which is the region this dataset covers) postal codes, more commonly known as 'zip
codes', are a five digit number from '00000' to '99999'. They are generally grouped geographically and can even have leading
zeroes in the northeast (many of Boston's zip codes have leading zeroes). When Python reads in something as a numerical data
type, it will drop leading zeroes that aren't preceded by a decimal place. For this reason + the fact that you can't perform math on a
zip code (90210 is no better or worse than 02112) means that zip codes, although composed entirely of integers, should be
recognized as strings.
In order to ensure that the data is imported correctly, we will actually add the information to convert postal_code to a string in the
statement where we import our CSV.
We can parse dates on import using the parse_dates argument in the read_csv method, but I'll be using a different method here since
we have dates in a couple of different formats.
Let's start by taking a quick look at the formats that our date columns have adopted:
As suspected, it looks like the data is coming in a couple of different formats, let's see if we can have Pandas automatically
determine the format of any of these.
pd.to_datetime(career_data_dtype_change["crawl_timestamp"])
It looks like Pandas was not only able to recognize this as a date time, but also recognized the time zone as well. Pretty cool! Let's
create a new dataframe and save this over the old unformatted column.
career_data_dtype_change["crawl_timestamp"] = pd.to_datetime(career_data_dropped_cols["crawl_timestamp"])
career_data_dtype_change
The format we used for the above column will also work for all of the other date columns except one. postdate_yyyymmdd is formatted
the same as the other columns, but because there isn't any separator between the components of the dates, Pandas can't seem to
automatically convert it.
We'll have to take a look at the documentation to figure out how to convert this:
Looking through the documentation, we'll be using the format parameter like so:
career_data_dtype_change.isin(missing_values).mean().sort_values(ascending=False) * 100
When you can't find a way to bring in more data, one solution to a bunch of missing data is to just drop it, that's the first technique
we'll be using. Let's drop any columns where more than 86% of the data is missing:
career_data_missing_values = career_data_dtype_change.copy()
career_data_missing_values = career_data_missing_values.drop(columns=columns_to_drop)
career_data_missing_values
Imputation
When you have too much missing data, as in the case of the columns we just drop, you might not have the option to keep said data.
When less data is missing, there are a couple of tricks you can use in order to keep a larger dataset. In this course, we'll just go over
the most basic one: Imputation. Imputation is simply the process of replacing missing values in a dataset with another value. The
most common values to impute are the mean, median, or mode of the column in question. By doing this, you can keep the column in
question while also putting a relatively good estimate of what the value would have been in place of any missing values.
If we run the following script again, we'll see that we have three columns with missing information.
career_data_missing_values.isin(missing_values).mean().sort_values(ascending=False) * 100
Let's focus on the inferred_city column. You'll see that we have quite a few cities here and this is a text based column so a median
or mean value being used for imputation doesn't make much sense. That leaves us with using the mode, but just blindly using the
mode of the entire column when we can probably make a more educated guess, is probably not a great idea. Remember, GIGO,
garbage in garbage out, we need to be very careful what we put into our algorithms. It looks like we have a column called
inferred_state . Given that there's a correlation between the state and city columns, we might be able to take a the mode city of each
state and map over the missing values in the inferred_city column.
Now that we have a dictionary of how we want to map missing values in the inferred_city column based on the inferred_state
column, let's apply this for any missing values that we might encounter.
career_data_missing_values["inferred_city"] = career_data_missing_values["inferred_city"].fillna(career_data_missing_values["inferred_state
career_data_missing_values
Label Encoding
Houston → 1
Dallas → 2
Austin → 3
When your machine learning algorithm sees this it might assume that there is some relation between the numbers in this column
that doesn't actually exist (Houston is better than Austin because it's number one or Dallas is 2 times Houston because it's a 2). It is
for this reason that we'll generally want to avoid Label Encoding. For practice let's try and implement this on our inferred_state
column.
career_data_encoded = career_data_missing_values.copy()
le = LabelEncoder()
career_data_encoded["inferred_state"] = le.fit_transform(career_data_encoded["inferred_state"])
career_data_encoded["inferred_state"]
index City
1 Dallas
2 Houston
3 Dallas
4 Dallas
5 Austin
1 1 0 0
2 0 1 0
3 1 0 0
4 1 0 0
5 0 0 1
As you can see we now have expressed the 'City' column in three different columns called dummy variables, one for each value of
city. This helps avoid the ordinality problem we had earlier. The problem with One Hot Encoding though is that if you use it on a
column with a lot of categories, it can lead to an explosion of features which can really slow down your processing time and make
running a machine learning algorithm or even performing basic operations on your data, very hard. We can perform this operation
on our inferred_state data now by using the get_dummies method from Pandas.
pd.get_dummies(career_data_encoded["inferred_state"], prefix="state")
Another problem with One Hot encoding is the Dummy Variable Trap. When your machine learning algorithm looks through the
three 'City' columns that have been created, if it reads any two of them, it'll know what value is in the third. This ability for certain
columns to predict the values in another column, is called multicollinearity and is a big no-no in the world of machine learning. Your
Hash Encoding
Hash Encoding is one method to help deal with the explosion of features that can result from One Hot Encoding. It's out of scope for
this course but I wanted to bring your attention to it. Essentially, you take all of your categories and input them into a hash function
that will reduce the number of categories to something more manageable. As you can imagine, taking 100 categories and mapping
it to a smaller number of categories can lead to two categories having the same value in something called a 'collision'. It's not a
magic bullet but one way to deal with the explosion of features that might
5.2.9 - Multicollinearity
Multicollinearity refers to multiple features being highly correlated to one another in a model. Like we mentioned earlier this is
something we are trying to avoid because it makes it difficult for the model to determine which variable is affecting the target
variable. There are a couple of ways to limit multicollinearity with the VIF or Variance Inflation Factor probably being the most
popular.
https://online.stat.psu.edu/stat462/node/180/
The above source has probably the best explanation of VIF that I've seen myself.
Like Hash Encoding, this is also something that would be beyond the scope of an ML primer course, but I'd like you to be aware of
it.
A great example would be if you were working with a dataset of car data where you're trying to predict the price of a car based on
some engine metrics, you can use domain knowledge to determine what "displacement" (how much air is displaced by an engine) is
correlated to price. You could then calculate displacement for each row of data based on other information you might have. This
below link on Kaggle has code written out that explains this exact process.
Creating Features
Explore and run machine learning code with Kaggle Notebooks | Using data from FE Course Data
https://www.kaggle.com/ryanholbrook/creating-features?scriptVersionId=78174956&cellId=6
5.2.11 - Scaling
Many machine learning algorithms won't perform well if your numerical data is all on different scales. This is why we often want to
scale our data. Scaling is the process of limiting the numerical values of a column to a predefined range, usually 0-1. If we were to
scale the below table, we'd end up with:
Prescaled Scaled
1 0
4 0.375
2 0.125
9 1
This uses a method called MinMaxScaling which scales your values such that '1' is represents your max value and '0' represents the
min value in your column. Let's scale our Unix timestamp column.
career_data_scaled = career_data_encoded.copy()
scaler = MinMaxScaler()
The exact percentage you need to split your dataset by varies depending on who you ask but most people will say something less
than 33% for your test dataset and the rest for your training. Remember, the more data you train on, broadly speaking, the better
your algorithm will perform, but at the same time you want to have enough test data to fully verify the performance of your algorithm.
Like everything in Machine Learning, a lot of the complexity comes from making calls as to what your tradeoffs will be. Scikit Learn
defaults to a 25% test size.
To fully demonstrate this process, let's pretend that we're trying to predict incomes, so that's our target variable or y.
6.0.0 - Regression
Regression, a supervised technique for predicting a numerical variable. Today we'll be using a dataset from Kaggle to try and predict
the prices of cars. The dataset is available for download here:
https://www.kaggle.com/adityadesai13/used-car-dataset-ford-and-mercedes
2. Missing Data
3. Target Column
4. Feature Engineering
5. Data Encoding
6. Scaling
import pandas as pd
data = pd.read_csv("data/bmw.csv")
It looks like every column has more than 1 unique value, so there’s nothing to do for this step.
data.isin(missing_values).mean().sort_values(ascending=False) * 100
It looks like every row of every column is full, awesome another step we can skip!
# Feature Engineering
# We're going to add a classification that I manually put together
X["model"] = X["model"].str.strip()
As a side note, this is why I picked the BMW dataset. As a luxury automaker, they have fewer models than a company like VW.
X = pd.get_dummies(X, drop_first=True)
X
6.1.6 - Scaling
We’ll be using MinMaxScaling on our data.
scaler = MinMaxScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
X
This is the same thing you learned in Stats 101. Basically, using multiple variables we can build a linear model that uses
coefficients attached to each variable plus some constant to try and predict an outcome. In reality there are several factors
that you need to check prior to using a linear model, but I want to get you up and running by learning the syntax and
operation of skLearn after which you can start researching how individual algorithms work.
Random Forest
The Random Forest algorithm is a staple of any Data Scientist's toolbox. It essentially works by creating a forest of decision
trees, which are themselves another machine learning algorithm. Because a random forest is an algorithm of algorithms, it's
called an ensemble algorithm.
Boosting
Boosting is a technique that has risen to prominence over the last couple of years for its dominance in Kaggle competitions.
They're lightweight, fast, and easy to implement. They can work a couple of different ways, but essentially boosting
algorithms will start off with a simple model, then create another model that predicts a metric from the previous model, in the
case of the XGBoost algorithm we'll be using, that metric is the errors.
As you can see there are many different metrics we can use to determine how well our model is performing, we'll use Mean
Absolute Error because it's very easy to understand. The Mean Absolute Error or MAE is the average amount above or below the
true value our predictions are.
Before we implement an algorithm, let's go over the basic syntax we'll follow in order to implement most of our algorithms.
Let's see if we can tune our Random Forest model to perform better.
We'll be using a method called GridSearchCV . Essentially we feed it a list of hyperparameters we want it to test out with our algorithm
and then it will go through every combination of hyperparameters and run our algorithm on each one. This process can take a while
to complete so I’ve created a significantly smaller list of hyperparameters to test than we would in reality so that you can run this
faster than I did.
rf_random.best_params_
perfect_random_forest_predictions = perfect_random_forest.predict(X_test)
mean_absolute_error(y_test, perfect_random_forest_predictions)
It looks like our result is slightly better, not bad not amazing.
Mushroom Classification
Safe to eat or deadly poison?
https://www.kaggle.com/uciml/mushroom-classification
Like we did with the classification problem, let’s outline the steps we want to follow prior to actually executing on them. It looks like
this dataset only has categorical data, so we’ll probably just have to make a bunch of dummy variables. Hopefully our algorithms will
work well with all of the columns of data we’ll have to make!
2. Missing data
5. Train-Test Split
data.isin(missing_values).mean().sort_values(ascending=False) * 100
X = data.drop(columns="class")
y = data["class"]
X = pd.get_dummies(X, drop_first=True)
X
Logistic Regression
Boosting
random_forest_class = RandomForestClassifier()
random_forest_class.fit(X_train, y_train)
rf_predictions = log_reg.predict(X_test)
accuracy_score(rf_predictions, y_test)
7.2.3 - LightGBM
LightGBM is a Gradient Boosting Model created by Microsoft which is generally considered to be lighter and faster than XGBoost. If
you’re interested in learning more about LightGBM then check out this Kaggle notebook linked below.
https://www.kaggle.com/prashant111/lightgbm-classifier-in-python
We implement this algorithm the same way we’ve been implementing all of our algorithms, by fitting our model to the training data
and then predicting the classes using the newly fitted model.
Actual NO 21 31
Actual YES 41 7
As you can see, it basically maps out all of the predictions along with whether it was accurate or not. We want as many values as
possible on the main diagonal of the table as these are the truly accurate ones.
I modified the code from this Medium post to create a confusion matrix plotter:
import numpy as np
import seaborn as sns
from sklearn.metrics import confusion_matrix
confusion_matrix_plotter(light_gbm_predictions, y_test)
T rueP ositive
T rueP ositiveRate =
F alseNegative + T rueP ositive
This is the proportion of positive data points that are correctly considered as positive out of all data points.
T rueNegativeRate
T rueNegativeRate =
T rueNegative + F alseP ositive
This is the same as the True Positive Rate except with negative values.
# method I: plt
import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
plt.plot(tpr, fpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([-0.1, 1.1])
plt.ylim([0.0, 1.1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
7.3.3 - F1 Score
This is the harmonic mean between precision and recall and calculates the precision of the classifier (the number of instances
accurately classified). A higher F1 score means your model is performing better.