Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
118 views

Machine Learning For Beginners

Uploaded by

oya uysal kog
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
118 views

Machine Learning For Beginners

Uploaded by

oya uysal kog
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

🤖

Machine Learning Course Notes


1.0.0 - Introduction
1.1.0 - Course purpose
1.2.0 - Deepnote Sponsor Spotlight
1.3.0 - ML in 4 Lines of Code
2.0.0 - ML Basics
3.0.0 - ML in Business
3.1.0 - How to know when to use ML
3.2.0 - Ethics in Machine Learning
4.0.0 - Theory Behind ML
4.1.0 - Holistically Design a Machine Learning Algorithm Using CRISP-DM
4.2.0 - Business Understanding and Data Understanding
4.3.0 - Data Preparation
4.4.0 - Modeling
4.4.1 - Determining Which Model to Use
4.4.2 - Implementing a Model
4.5.0 - Evaluation
5.0.0 - Data Cleaning and Environment Setup
5.1.0 - Setting up an Environment
5.2.0 - Data Cleaning Techniques
5.2.1 - Initial Setup and Imports
5.2.2 - Basic Data Format
5.2.3 - Remove Columns with One Unique Value
5.2.4 - Data Types
5.2.5 - Parsing Dates
5.2.6 - Missing Data
5.2.7 - Select Target Column
5.2.8 - Data Encoding
5.2.9 - Multicollinearity
5.2.10 - Feature Engineering
5.2.11 - Scaling
5.2.12 - Train-Test Split
6.0.0 - Regression
6.1.0 - Data Cleaning: Regression
6.1.1 - Remove Columns with One Unique Value
6.1.2 - Missing Data
6.1.3 - Target Column
6.1.4 - Feature Engineering
6.1.5 - Data Encoding
6.1.6 - Scaling
6.1.7 - Train-Test Split
6.2.0 - Model Selection: Regression

Machine Learning Course Notes 1


6.3.0 - Model Implementation and Evaluation: Regression
6.3.1 - Linear Regression
6.3.2 - Random Forest
6.3.3 - XGBoost
6.4.0 - Hyperparameter Tuning
6.5.0 - Conclusion: Regression
7.0.0 - Classification Practice
7.1.0 - Data Cleaning: Classification
7.1.1 - Removing Columns with one Unique Value
7.1.2 - Missing Data
7.1.3 - Select Target Column
7.1.4 - Data Encoding
7.1.5 - Train-Test Split
7.2.0 - Model Selection: Classification
7.2.1 - Logistic Regression
7.2.2 - Random Forest Classifier
7.2.3 - LightGBM
7.3.0 - Model Evaluation: Classification
7.3.1 - Confusion Matrix
7.3.2 - Area Under the Curve (AUC)
7.3.3 - F1 Score
7.4.0 - Conclusion: Classification
8.0.0 - Course Conclusion

1.0.0 - Introduction
1.1.0 - Course purpose
Welcome to my Machine Learning Primer course! Whether you’re looking to build out Machine Learning algorithms or want to see
how to use Machine Learning in your organization, you’re in the right place. The first 4 chapters will have information for everyone,
chapter 5 onwards, we’ll start actually implementing algorithms so if you’re looking to just learn about how ML is used in industry,
you can watch until then.
Machine Learning is an ever evolving subject, there are an infinite number of things to learn and a constantly evolving body of
knowledge. The purpose of this course is to give you a practical understanding of the basics of Machine Learning so you can decide
if this material is for you and in what direction you’d like to continue learning. We won’t focus on the math and theory behind
Machine Learning, for that I recommend the excellent book: Hands-on Machine Learning with Scikit-Learn Keras & TensorFlow 2nd
Edition (affiliate link below).

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems [Géron,
Aurélien] on Amazon.com. *FREE* shipping on qualifying offers. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts,
Tools, and Techniques to Build Intelligent Systems
https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1492032646?crid=18FZPXM3VIMFT&dchild=1&keywords=hands
+on+machine+learning+with+scikit-learn+and+tensorflow+2&qid=1620018613&sprefix=hands+on+%2Caps%2C197&sr=8-3&linkCode=ll1&tag=shash
ankstore-20&linkId=738fc1e1fcd7646120a3b625e9d395db&language=en_US&ref_=as_li_ss_tl

1.2.0 - Deepnote Sponsor Spotlight


Before we get started I want to take a chance to thank this channel's first real sponsor Deepnote! Deepnote is an online notebook
that makes it easier for you and your team to collaborate on data projects to bring better insights to your organization. I'll actually be
running all the code in this video through Deepnote so y'all will get to see how well the tool works. Thanks again to Deepnote for
helping bring this course to life.

1.3.0 - ML in 4 Lines of Code

Machine Learning Course Notes 2


Although machine learning can sound incredibly complicated, it doesn't have to be. Over the years many libraries have come out
that have drastically simplified the process of creating machine learning algorithms such that you can create a decent model in 5
minutes with 4 lines of code. Don't believe me? Let's try it. I'll build a model that predicts the price of diamonds based on several
factors and tell you which factor is most important.

! pip install pycaret ## This line doesn't count


from pycaret.datasets import get_data # This line doesn't count either
dataset = get_data('diamond')

from pycaret.regression import * # Last uncounted line


exp_reg101 = setup(data = dataset, target = 'Price', session_id=123)
ada = create_model('ada')

plot_model(ada, plot='feature')

2.0.0 - ML Basics
Machine learning (ML) is the study of computer algorithms that can improve automatically through
experience and by the use of data - Wikipedia

What does this mean? Let's think of it from the perspective of a problem. Let's say I work at Nike and am in charge of building a
section of their website. When I go through their website and scroll over the different menus, the behave differently but
deterministically, meaning that if I input a certain command (hover over the menu) then a definite outcome will occur.

But let's say the problem now is to predict which shoes to stock in a location based on a bunch of customer data in a region. Let's
say you have 10,000 dimensions of customer data (things like gender, Instagram likes, favorite sports, average household income...
stuff like that), it would be impossible, or at least very tedious to hard code rules to determine how to stock a store for 10,000
dimensions of data. We might be able to do it for 30 stores or so but what happens when the data changes or we need to roll this
out nationally? The hard coded solution is not at all scalable. Instead we would use a machine learning algorithm to ingest all of this
data and then produce a prediction based on on the results of maybe 10 hard-coded stores.
Essentially a Machine Learning algorithm takes in a bunch of inputs in the form of data, and then produces an output that maximizes
some target metric (user retention, user happiness etc.) and can improve itself through more data being added to the algorithm.

Machine Learning Course Notes 3


3.0.0 - ML in Business
Machine learning is already used all aspects of business looking to use data to get closer to their customers, or to improve decision
making.
Spotify/YouTube/Netflix or any other tech first media company will use machine learning algorithms to help improve
recommendations for their customers. What song should you listen to next, what artist should be suggested, this is all decided
based on a series of parameters that Data Scientists and Machine Learning Engineers are able to figure out by using hundreds of
billions of rows of data on hundreds of millions of customers. Spotify for example has figured out that the magic number as to
whether you might like a song or not to be at 30 seconds of listening.
Nordstrom, the company I work for is a major North American fashion retailer and uses Machine Learning to assist our digital stylists
helping customers with their fit.

AI Created Outfits
At Nordstrom, digital stylists help our customers to feel good and look their best by creating outstnding outfits
through a variety of styling experiences: one-on-one virtual styling help, try before you buy with Trunk Club,
personally curated looks, thematic outfit curations, outfits to showcase the versatility of an individual product,
https://medium.com/tech-at-nordstrom/ai-created-outfits-9529300a1af3

3.1.0 - How to know when to use ML


Machine learning is a tool and like any tool should be used for situations it's best equipped to handle. These are software solutions
where coding the rules by which the software should work is too cumbersome or can't be scaled.

You cannot code the rules: This includes things like facial recognition or voice recognition where very subtle differences between
individual faces and voices can render a rules-based approach unusable

You cannot scale: This includes things like the previous Nike example. If there are thousands of variables that you need to
consider, it might not be possible or economical to hard code rules and machine learning might be necessary

When to Use Machine Learning


It is important to remember that ML is not a solution for every type of problem. There are certain cases where robust solutions can be developed without using ML
techniques. For example, you don't need ML if you can determine a target value by using simple rules, computations, or predetermined steps that can be programmed
without needing any data-driven learning.
https://docs.aws.amazon.com/machine-learning/latest/dg/when-to-use-machine-learning.html

Machine learning might also be a great version 2 for a product that you're creating. If you wanted to create a stock trading bot that
traded stocks based on whether the President of the United States mentioned them positively or negatively in a Tweet, then you
could start with a simple version that just counts the number of "positive" words in the Tweet vs. the number of "negative" words in
the Tweet and makes a decision based on that. Of course, this approach will quickly fail with the many quirks of grammar rendering
this method inaccurate at best. After a version 1 with this simple approach, you could always try version 2 with a proper sentiment
analysis algorithm.

3.2.0 - Ethics in Machine Learning


The Code of Hammurabi were a series of laws from the ancient civilization of Babylonia. There are about 300 laws written which talk
about how all aspects of ancient life were to be governed. The full text is linked below:

(c. 1700 B.C.E.) Note: The Code of Hammurabi was a compilation of almost three hundred laws on every aspect of life. Much can be learned both about Mesopotamian
life and ideals through these laws.
http://www.wright.edu/~christopher.oldstone-moore/Hamm.htm

Law 229 is one I find particularly interesting:

Machine Learning Course Notes 4


If a builder builds a house for a man and does not make its construction sound, and the house which
he has built collapses and causes the death of the owner of the house, the builder shall be put to
death.

Basically, creators are responsible for the soundness of what they create. Now for better or worse, that's not exactly how society
today works, and this is even less true in the realm of software. Machine Learning algorithms have become so ubiquitous that they
form a large portion if not the majority of trading volume on the world's largest stock markets.

The stockmarket is now run by computers, algorithms and passive managers


F IFTY YEARS ago investing was a distinctly human affair. "People would have to take each other out, and
dealers would entertain fund managers, and no one would know what the prices were," says Ray Dalio, who
worked on the trading floor of the New York Stock Exchange ( NYSE) in the early 1970s before founding
https://www.economist.com/briefing/2019/10/05/the-stockmarket-is-now-run-by-computers-algorithms-and-p
assive-managers

Algorithms for facial recognition have been used by law enforcement even when it's debateable how well they can recognize
individuals.

Amazon's Face Recognition Falsely Matched 28 Members of Congress With Mugshots


Amazon's face surveillance technology is the target of growing opposition nationwide, and today, there are 28
more causes for concern. In a test the ACLU recently conducted of the facial recognition tool, called
"Rekognition," the software incorrectly matched 28 members of Congress, identifying them as other people who
https://www.aclu.org/blog/privacy-technology/surveillance-technologies/amazons-face-recognition-falsely-ma
tched-28

All this is to say, consider how your algorithms are trained and how they'll be used as you're building them.

4.0.0 - Theory Behind ML


4.1.0 - Holistically Design a Machine Learning Algorithm Using CRISP-DM
I'm a big picture guy, and firmly believe that you need to always keep a firm eye on the greater impact of your work in order to drive
real value. With the ease by which we can deploy ML algorithms these days, it's important to understand that Machine Learning
projects are only useful insofar as they solve some real-world purpose. I don't want to teach you to just throw ML at a problem and
hope it works. To this end, we'll be using a framework developed by IBM in the 90's called CRISP-DM. It's a cycle we can use to
map the lifecycle of many data-centric projects and what I use for almost all of my data projects in my daily life. This is an acronym
that stands for Cross Industry Standard Process for Data Mining.

CRISP-DM:

Business Understanding

Data Understanding

Data Preprocessing

Modeling

Evaluation

Deployment → This is a bit out of the scope of this course

Machine Learning Course Notes 5


4.2.0 - Business Understanding and Data Understanding
The first step of this process is Business Understanding and Data Understanding. I like to say that this is the step where a lot of ML
projects stall out without anyone knowing so. When we talk about Business Understanding, we're referring to the process by which,
as the data expert, you sit down and understand the scope of the problem, define the objectives and then check to see if you have
the data necessary to accomplish it.

💡 Remember: You'll very rarely be asked to build a Machine Learning algorithm. You're generally being asked to solve a
problem which ML is the tool being used to solve

You'll notice that there're arrows going back and forth between Business Understanding and Data Understanding, this is because
people might request that you try and solve a problem with data that you don't actually have the data to solve. This is why it's a
back-and-forth between understanding what you business objective is and what data you have to actually solve for this.

In order to keep this course a bit more focused on the Machine Learning, I'm going to skip over this section of the process, but will
have a video in the future where I go over this process.

4.3.0 - Data Preparation


Data cleaning can be tedious but is one of the most important parts of the process. There's a saying in the data science community:
Garbage in Garbage Out or GIGO. Preparing your data serves three major functions:

1. Translates all of your data into a format that the ML algorithm can work with

a. Alphabetical characters typically need to be translated into numbers, and dates need to be encoded to be recognized as
dates by the computer

2. Cleans up junk or meaningless values from your dataset so you don't "confuse" your algorithm

3. Adds new information to your dataset to that your algorithm can derive better insights

Although jumping straight into modeling is very tempting, your algorithm's performance will be largely dependent on how well you've
been able to clean your data so don't skimp on this step.

Machine Learning Course Notes 6


One more thing we'll want to do in the Data Preparation step is to split our data into a 'training' dataset and a 'testing' dataset. The
training dataset is what we give our algorithm so that it can 'learn' from the data. After we've trained our machine learning algorithm
on the training dataset, we then see how it performed vs some of our testing data.

4.4.0 - Modeling
This is generally considered the most fun part. I like to think of Data Preparation as the meat and potatoes of CRISP-DM, and
Modeling as the spices, you can't eat spices raw, but your meal won't taste that interesting without spices.

Modeling is where the rubber hits the road for all the work we've done up until now.

4.4.1 - Determining Which Model to Use


The first step in the modeling section of the process is to determine what type of model we'll be using. There are three major
paradigms of Machine Learning algorithms, each of which has a couple of problem types.

Supervised → This is the first paradigm and refers to a problem where we're trying to match a set of data to an expected set of
outcomes. The expected set of outcomes is called our "Target Variable" and is what we're trying to predict. An example would
be trying to classify images as hotdogs or not hotdogs given we have a bunch of labeled images for the algorithm to reference.
Within the Supervised paradigm, we have two problem types:

Regression: A regression problem is one where the data we're matching to is numerical. An example would be predicting
how much different customers might spend given attributes about said customers

Classification: A classification problem is one where we try and classify things into groups instead of predicting numerical
values. The quintessential example of this is spam mail classification. In this example, we're not predicting a numerical
value but rather a category, "spam" or "not spam". Another would be our "hotdog/not hotdog" example

Unsupervised → This is basically the opposite of a supervised algorithm and refers to a problem where we don't have any
labeled data and are telling the algorithm to try and find natural patterns in the data. In other words, we don't have a target
variable. This would be like taking a bunch of customer data and telling an algorithm to find natural groups based on the data.
We can use this to discover patterns in the data.

Clustering: Just like it sounds, clustering is a problem type where we're trying to group data into natural groups using a
machine learning algorithm. A common application of clustering algorithms is in the domain of marketing. By taking a bunch
of customer data you can then cluster people into groups (often called archetypes) that can be marketed to in different
ways. If you've ever seen an advertisement on Instagram or another social network for a product you were just talking to
your friends about, there's probably a clustering algorithm that placed you into a group of people who would also be
interested in that product. (Basically your phone is not listening to you, algorithms are just that intelligent, and you're just that
predictable).

Reinforcement → Reinforcement learning is a form of learning where the algorithm is asked to solve a problem and is
rewarded for certain types of behavior and punished for other types of behavior. We won't be covering reinforcement learning in
this course.

4.4.2 - Implementing a Model


After we've cleaned our data and selected our model or models that we will use, we need to actually implement the model. This is
where we'll "train" our model. Training refers to the process by which we reveal our training data to the machine learning algorithm
and if it's a supervised algorithm, have it try and match (fit) to the target variable, and if it's an unsupervised algorithm, have it try
and find patterns in the data.

After we've trained our model and taken some initial measurements, we might want to improve its performance through a process
called Hyperparameter Tuning.

What exactly are hyperparameters? To learn this it will help to take a quick look inside a machine learning algorithm. A machine
learning algorithm can be thought of as a complicated math equation with a bunch of different variables (this isn't strictly accurate
but works for the sake of our example). During training, our machine learning algorithm is trying out a bunch of different values for
each of the variables in said equation until it is able to minimize some other equation called a 'cost function'. The values that it's
changing are the 'parameters' of the machine learning algorithm. Ergo, a hyperparameter is a value that is used to control the

Machine Learning Course Notes 7


overall training process. Hyperparameter tuning is the process by which we iterate through multiple hyperparameters in order to
create the best "learning environment" for our machine learning algorithm.

4.5.0 - Evaluation
After training our algorithm, we'll want to determine how well it performed. To do this, we'll test out the performance of our algorithm
on the test dataset we separated from the rest of our training dataset in the Data Preparation step of this process. We will then
compare the predictive power of our machine learning algorithm on the training set vs the test dataset. There are different
measurements we'll use in order to determine how well our algorithm performs. The point of measuring the performance of our
algorithm against a test dataset is to simulate performance of the algorithm in the real world with a dataset the algorithm has never
seen. We'll us different measures depending on whether we're working with a regression or classification problem. I'll go over the
details of each of these methods in the next next chapter where we actually solve one of each of these problems.

Regression

Root Mean Squared Error

Mean Squared Error

Classification

Accuracy Score

Confusion Matrix

Precision/Recall and Harmonic/F1 Score

Receiver Operating Characteristics (ROC) curve

You might be asking yourself why it's so important to "independently" test the performance of our algorithms. Machine Learning
algorithms often perform better when they are exposed to more data so why are we purposefully preventing the algorithm from
seeing a section of the data? We do this to prevent something called "Overfitting". Overfitting is a phenomenon that occurs when a
machine learning algorithm learns "too well" and matches the dataset it's trained on too well. This is not good because an overfitted
algorithm can't generalize and therefore won't be able to make predictions on data that it hasn't seen before. "Underfitting" is the
exact opposite concept where an algorithm generalizes too much and can't give us anything nearing an accurate prediction. By
removing some data in the form of our test dataset we can compare the results on the test dataset to ensure we're not overfitting our
data (underfitting is prevented by tuning our hyperparameters while training the machine learning algorithm on the training dataset).

5.0.0 - Data Cleaning and Environment Setup


We're now finally getting to the application section of the course. This is the section where you'll be expected to code up your
results. If you don't know how to code in Python, don't worry, I have a FREE course linked below that has all of the information you'll
need and more in order to get ready for the rest of this course.

Python for Data Analysts and Data Scientists


What YOU need to know to get started using Python to Analyze data. If you have any questions please feel free
to comment below or drop me a message. Download...

https://www.youtube.com/watch?v=sZDgJKI8DAM

5.1.0 - Setting up an Environment


Normally I would run this on my local machine, but to simplify the process of getting off the ground we're going to be using
Deepnote. If you want to get started feel free to use my affiliate link below:

Machine Learning Course Notes 8


You have been referred!
Sign up with this link to get 20 extra-powerful compute hours

https://deepnote.com/referral?token=4e36c4ca45cb

By signing up for a free account using the affiliate link, you can also help me prove to future channel sponsors, that people like the
content we’re making here, so I can continue to make courses like this.
Deepnote basically provides an online Jupyter notebook with extra features like collaboration tools, easy integrations with other data
sources, autosaving, easy sharing, and GPUs if you need them.
Once you've created an account, you can access the project with all the code using the link below:

Machine Learning Fundamentals


Managed notebooks for data scientists and researchers.

https://deepnote.com/project/Machine-Learning-Fundamentals-zcnNQ1sgRByLq4s4BYEdVQ/%2FChapter_
4-Data_Cleaning.ipynb

And simply duplicate the notebook:

5.2.0 - Data Cleaning Techniques


For this chapter, we'll be using the Chapter_5-Data_Cleaning.ipynb notebook.

5.2.1 - Initial Setup and Imports


Pandas is the standard library for tabular data manipulation in Python and we'll be using that here.
We'll want to import the Pandas library and then import our csv and take a quick look at it using the .head() method:

import pandas as pd
import numpy as np

career_data = pd.read_csv("data/career_data.csv")
career_data.head()

Machine Learning Course Notes 9


We'll be working with a Kaggle dataset that I modified slightly for the sake of this course. You can find the original dataset here:

Careerbuilder Job Listing 2020


A full sample of CareerBuilder Dataset from 2020

https://www.kaggle.com/promptcloud/careerbuilder-job-listing-2020/tasks?taskId=3662

5.2.2 - Basic Data Format


There are many different data types that Data Scientists work with, for the sake of simplicity, we'll only focus on normal tabular data
types (nothing like geospatial data). The basic format that data going into a machine learning algorithm needs to follow is outlined
below.

Each column should represent a variable or feature of the dataset

In this dataset it looks like we have 30 columns of data

One or a set of columns should be what we're trying to predict

This only applies if you're implementing a supervised algorithm

Each row should represent a new observation (for a certain column or combination of columns, there shouldn't be duplicate
rows)

In this dataset, each observation is separated by the uniq_id column. Each row represents a different job listing.

In the world of data science, we refer to individual columns in a properly formatted dataset as 'features' or 'variables'. You'll hear me
using this term a lot more.

5.2.3 - Remove Columns with One Unique Value


I generally try and remove as much data as possible from a dataset as early as possible. As you work with larger and larger
datasets, you'll run into performance bottlenecks and time constraints unless you find a way to limit the amount of data that you are
working with. To this end, we'll want to remove columns where there is only one unique value as that means that this column has no
predictive power (if every value is the same, then the machine learning algorithm won't be able to predict anything using these
values).
We can determine the number of unique values in our dataset by using this line of code:

career_data.nunique()

As you can see we have a couple of columns with only one unique value, let's try systematically getting rid of them. First we need to
isolate a list of columns with only one unique value:

columns_to_drop = career_data.nunique()[career_data.nunique() == 1].index


columns_to_drop

And finally let's go ahead and drop these columns, remember to use the columns parameter in the drop() method. Additionally, I
advocate for making copies of your dataframes each major step you take because you'll oftentimes want to revise a step and having
a variable that saves one iteration of your dataframe can save having to run all prior code at once. This might not work if your data is
too large though.

career_data_dropped_cols = career_data.copy()
career_data_dropped_cols = career_data_dropped_cols.drop(columns=columns_to_drop)
career_data_dropped_cols

One thing to note, this method doesn't drop columns where all the values are null, we can do that by using the dropna method.

Machine Learning Course Notes 10


career_data_dropped_cols = career_data_dropped_cols.dropna(axis=1, how='all')
career_data_dropped_cols

Awesome, as you can see we've dropped about 33% of our columns. This will make our data much easier to work with.
As a bit of a sidenote, variability in individual columns is what gives machine learning algorithms their predictive power. There's
information in variability and processes like Principle Component Analysis (which is beyond the scope of this course) rely on this
variability to eke out information about a dataset.

5.2.4 - Data Types


We want to verify our data types to ensure they've been recognized by Python as the right data types. While there are dozens of
different types of data, at the core, for our purposes, there are only three:

Numerical

These are essentially numerical values where that you can perform math on. The "math" distinction is important. For
example, the American Postal Code (also known as a Zip Code), is a five digit number but shouldn't be encoded as a
Numerical value because you can't perform math on any zip code (90210 is no higher or lower than 01931). Additionally, zip
codes in the Northeastern United States usually have leading 0's which will be removed if you code then as numerical data
types (01931 will turn into 1931)

Dates

Formatting dates is the bane of any data scientist's existence


- Shashank Kalanithi

Dates are a way we can measure temporal, or time based data. It's key that you understand your data well enough to
understand what format a date is in, and use features in Python to parse out those dates.

String

This is a catchall data type for anything that can't quite fit into the other two data types. It refers to data that is stored as text.

To test for data types, we'll use the dtypes method.

career_data_dropped_cols.dtypes

As you can see, most of our data uses the object data type. This is generally correlated to a 'string' or text based data type in
Python. You'll notice that there are some columns that should probably be dates and numbers.

Let's take a look at the salary_offered column. It seems to be formatted as a string with dollar signs and commas. This won't mean
much to a machine learning algorithm, so let's convert it to a float (number with a decimal in it). We'll need to take this string and
remove the dollar sign and commas and then convert it to a float like-so:

career_data_dtype_change = career_data_dropped_cols.copy()
career_data_dtype_change["salary_offered"] = career_data_dtype_change["salary_offered"].apply(lambda x: x.replace('$', '').replace(',', '')
career_data_dtype_change["salary_offered"]

Let's now focus on the dates and a column that isn't a string but should be one. Towards the end of the dataset, you'll see a
postal_code column. In the United States (which is the region this dataset covers) postal codes, more commonly known as 'zip

codes', are a five digit number from '00000' to '99999'. They are generally grouped geographically and can even have leading
zeroes in the northeast (many of Boston's zip codes have leading zeroes). When Python reads in something as a numerical data
type, it will drop leading zeroes that aren't preceded by a decimal place. For this reason + the fact that you can't perform math on a
zip code (90210 is no better or worse than 02112) means that zip codes, although composed entirely of integers, should be
recognized as strings.
In order to ensure that the data is imported correctly, we will actually add the information to convert postal_code to a string in the
statement where we import our CSV.

Machine Learning Course Notes 11


career_data = pd.read_csv("data/career_data.csv", dtype={"postal_code":str})

5.2.5 - Parsing Dates


Next we'll look at parsing dates. Date formatting can be the bane of a data scientist's existence because they can be very
ambiguous based on where you live and where your dataset is from.

We can parse dates on import using the parse_dates argument in the read_csv method, but I'll be using a different method here since
we have dates in a couple of different formats.
Let's start by taking a quick look at the formats that our date columns have adopted:

career_data_dtype_change[['crawl_timestamp', 'valid_through', 'postdate_yyyymmdd', 'last_expiry_check_date', 'latest_expiry_check_date', 'po

As suspected, it looks like the data is coming in a couple of different formats, let's see if we can have Pandas automatically
determine the format of any of these.

pd.to_datetime(career_data_dtype_change["crawl_timestamp"])

It looks like Pandas was not only able to recognize this as a date time, but also recognized the time zone as well. Pretty cool! Let's
create a new dataframe and save this over the old unformatted column.

career_data_dtype_change["crawl_timestamp"] = pd.to_datetime(career_data_dropped_cols["crawl_timestamp"])
career_data_dtype_change

The format we used for the above column will also work for all of the other date columns except one. postdate_yyyymmdd is formatted
the same as the other columns, but because there isn't any separator between the components of the dates, Pandas can't seem to
automatically convert it.
We'll have to take a look at the documentation to figure out how to convert this:

pandas.to_datetime - pandas 1.3.5 documentation


If True, use a cache of unique, converted dates to apply the datetime conversion. May produce significant speed-up when parsing duplicate date strings, especially ones
with timezone offsets. The cache is only used when there are at least 50 values. The presence of out-of-bounds values will render the cache unusable and may slow down
parsing.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html

Looking through the documentation, we'll be using the format parameter like so:

career_data_dtype_change["postdate_yyyymmdd"] = pd.to_datetime(career_data_dtype_change["postdate_yyyymmdd"], format="%Y%m%d")


career_data_dtype_change

5.2.6 - Missing Data


Next we'll want to deal with missing data. You could make an entire chapter on just dealing with missing data so for the sake of
brevity we'll just go over some basic techniques and I'll release a video on more advanced techniques in the future. First we'll need
to determine the percentage of missing values in our dataset per column. Missing data can come in many forms, let's create a list of
missing data values and then look for columns with the most missing data values as a percentage of all of their values.

missing_values = [np.nan, "", " ", None]

career_data_dtype_change.isin(missing_values).mean().sort_values(ascending=False) * 100

Machine Learning Course Notes 12


It looks like some columns have a bunch of missing values. When this many values are missing, it is important to determine what
the cause of the missing data might be and if there is any way to obtain said missing data.

When you can't find a way to bring in more data, one solution to a bunch of missing data is to just drop it, that's the first technique
we'll be using. Let's drop any columns where more than 86% of the data is missing:

columns_to_drop = career_data_dtype_change.isin(missing_values).mean()[(career_data_dtype_change.isin(missing_values).mean()) > 0.86].index

career_data_missing_values = career_data_dtype_change.copy()
career_data_missing_values = career_data_missing_values.drop(columns=columns_to_drop)
career_data_missing_values

Imputation
When you have too much missing data, as in the case of the columns we just drop, you might not have the option to keep said data.
When less data is missing, there are a couple of tricks you can use in order to keep a larger dataset. In this course, we'll just go over
the most basic one: Imputation. Imputation is simply the process of replacing missing values in a dataset with another value. The
most common values to impute are the mean, median, or mode of the column in question. By doing this, you can keep the column in
question while also putting a relatively good estimate of what the value would have been in place of any missing values.

If we run the following script again, we'll see that we have three columns with missing information.

missing_values = [np.nan, "", " ", None]

career_data_missing_values.isin(missing_values).mean().sort_values(ascending=False) * 100

Let's focus on the inferred_city column. You'll see that we have quite a few cities here and this is a text based column so a median
or mean value being used for imputation doesn't make much sense. That leaves us with using the mode, but just blindly using the
mode of the entire column when we can probably make a more educated guess, is probably not a great idea. Remember, GIGO,
garbage in garbage out, we need to be very careful what we put into our algorithms. It looks like we have a column called
inferred_state . Given that there's a correlation between the state and city columns, we might be able to take a the mode city of each

state and map over the missing values in the inferred_city column.

inferred_city_mapping = career_data_missing_values.groupby(['inferred_state'])["inferred_city"].agg(lambda x:x.value_counts().index[0]).to_d


inferred_city_mapping

Now that we have a dictionary of how we want to map missing values in the inferred_city column based on the inferred_state

column, let's apply this for any missing values that we might encounter.

career_data_missing_values["inferred_city"] = career_data_missing_values["inferred_city"].fillna(career_data_missing_values["inferred_state
career_data_missing_values

And that's one way to do basic imputation.

5.2.7 - Select Target Column


If we have a problem where we'll be using Supervised learning, then we'll want to split out our target variables from our predictor
variables. This is relatively simple as we'll just create a new series with our target variable and call it y and then set all other
columns in one dataframe as our X .

5.2.8 - Data Encoding


Many machine learning algorithms require all of their inputs and outputs to be numeric. How do we use categorical data if all data
has to be numeric? We systematically turn those categorical values into numeric values through a process called 'encoding'.

Label Encoding

Machine Learning Course Notes 13


Label Encoding refers to the practice of simply assigning a random number to each category in a column. This is the simplest form
of encoding but comes with a major drawback in that it imposes 'Ordinality' on your data. Say you have three cities in a column and
you encode them like so:

Houston → 1

Dallas → 2

Austin → 3

When your machine learning algorithm sees this it might assume that there is some relation between the numbers in this column
that doesn't actually exist (Houston is better than Austin because it's number one or Dallas is 2 times Houston because it's a 2). It is
for this reason that we'll generally want to avoid Label Encoding. For practice let's try and implement this on our inferred_state

column.

career_data_encoded = career_data_missing_values.copy()

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

career_data_encoded["inferred_state"] = le.fit_transform(career_data_encoded["inferred_state"])

career_data_encoded["inferred_state"]

One Hot Encoding


One Hot Encoding avoids the above problem by creating a new column for every category in a feature and then assigning each row
a 1 or 0 depending on whether that value is present or not.

index City
1 Dallas

2 Houston

3 Dallas

4 Dallas

5 Austin

For example, the above table would be converted to:

index City_Dallas City_Houston City_Austin

1 1 0 0
2 0 1 0

3 1 0 0

4 1 0 0

5 0 0 1

As you can see we now have expressed the 'City' column in three different columns called dummy variables, one for each value of
city. This helps avoid the ordinality problem we had earlier. The problem with One Hot Encoding though is that if you use it on a
column with a lot of categories, it can lead to an explosion of features which can really slow down your processing time and make
running a machine learning algorithm or even performing basic operations on your data, very hard. We can perform this operation
on our inferred_state data now by using the get_dummies method from Pandas.

pd.get_dummies(career_data_encoded["inferred_state"], prefix="state")

Dummy Variable Trap

Another problem with One Hot encoding is the Dummy Variable Trap. When your machine learning algorithm looks through the
three 'City' columns that have been created, if it reads any two of them, it'll know what value is in the third. This ability for certain
columns to predict the values in another column, is called multicollinearity and is a big no-no in the world of machine learning. Your

Machine Learning Course Notes 14


algorithm is supposed to predict the value of your target column or cluster your data using your predictor variables, if columns are
self-predicting, then the algorithm won't be able to eke out the desired pattern from the accidental one you created in your dummy
variables. The simple solution to this is to just drop one of the dummy variables you created. For the above example, we can add in
the drop_first parameter to remove the first dummy variable that's created.

pd.get_dummies(career_data_encoded["inferred_state"], prefix="state_", drop_first=True)

Hash Encoding

Hash Encoding is one method to help deal with the explosion of features that can result from One Hot Encoding. It's out of scope for
this course but I wanted to bring your attention to it. Essentially, you take all of your categories and input them into a hash function
that will reduce the number of categories to something more manageable. As you can imagine, taking 100 categories and mapping
it to a smaller number of categories can lead to two categories having the same value in something called a 'collision'. It's not a
magic bullet but one way to deal with the explosion of features that might

5.2.9 - Multicollinearity
Multicollinearity refers to multiple features being highly correlated to one another in a model. Like we mentioned earlier this is
something we are trying to avoid because it makes it difficult for the model to determine which variable is affecting the target
variable. There are a couple of ways to limit multicollinearity with the VIF or Variance Inflation Factor probably being the most
popular.

10.7 - Detecting Multicollinearity Using Variance Inflation Factors


Okay, now that we know the effects that multicollinearity can have on our regression analyses and subsequent
conclusions, how do we tell when it exists? That is, how can we tell if multicollinearity is present in our data?

https://online.stat.psu.edu/stat462/node/180/

The above source has probably the best explanation of VIF that I've seen myself.

Like Hash Encoding, this is also something that would be beyond the scope of an ML primer course, but I'd like you to be aware of
it.

5.2.10 - Feature Engineering


Feature engineering is a term you'll hear thrown around a lot and basically refers to the process of combining domain knowledge
with mathematical knowledge to transform features in your dataset to be more informative.

A great example would be if you were working with a dataset of car data where you're trying to predict the price of a car based on
some engine metrics, you can use domain knowledge to determine what "displacement" (how much air is displaced by an engine) is
correlated to price. You could then calculate displacement for each row of data based on other information you might have. This
below link on Kaggle has code written out that explains this exact process.

Creating Features
Explore and run machine learning code with Kaggle Notebooks | Using data from FE Course Data

https://www.kaggle.com/ryanholbrook/creating-features?scriptVersionId=78174956&cellId=6

5.2.11 - Scaling
Many machine learning algorithms won't perform well if your numerical data is all on different scales. This is why we often want to
scale our data. Scaling is the process of limiting the numerical values of a column to a predefined range, usually 0-1. If we were to
scale the below table, we'd end up with:

Prescaled Scaled

1 0
4 0.375

Machine Learning Course Notes 15


5 0.5

2 0.125

9 1

This uses a method called MinMaxScaling which scales your values such that '1' is represents your max value and '0' represents the
min value in your column. Let's scale our Unix timestamp column.

from sklearn.preprocessing import MinMaxScaler

career_data_scaled = career_data_encoded.copy()

scaler = MinMaxScaler()

# df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)


career_data_scaled["post_date_unix_time"] = scaler.fit_transform(career_data_scaled[["post_date_unix_time"]])
career_data_scaled

5.2.12 - Train-Test Split


After we perform all of our data transformations, it's time to split our data into a training and testing dataset. Remember, the training
dataset is what we're training our algorithms on, and we test its performance on the testing dataset.

The exact percentage you need to split your dataset by varies depending on who you ask but most people will say something less
than 33% for your test dataset and the rest for your training. Remember, the more data you train on, broadly speaking, the better
your algorithm will perform, but at the same time you want to have enough test data to fully verify the performance of your algorithm.
Like everything in Machine Learning, a lot of the complexity comes from making calls as to what your tradeoffs will be. Scikit Learn
defaults to a 25% test size.

To fully demonstrate this process, let's pretend that we're trying to predict incomes, so that's our target variable or y.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(career_data_scaled.drop(columns="salary_offered"), career_data_scaled["salary_offere


d"])

6.0.0 - Regression
Regression, a supervised technique for predicting a numerical variable. Today we'll be using a dataset from Kaggle to try and predict
the prices of cars. The dataset is available for download here:

100,000 UK Used Car Data set


100,000 scraped used car listings, cleaned and split into car make.

https://www.kaggle.com/adityadesai13/used-car-dataset-ford-and-mercedes

6.1.0 - Data Cleaning: Regression


As I mentioned earlier data cleaning is one of the most important steps in machine learning. Let's outline the basic process that we'll
be following:

1. Remove Columns with One Unique Value

2. Missing Data

3. Target Column

4. Feature Engineering

5. Data Encoding

6. Scaling

Machine Learning Course Notes 16


7. Train-Test Split

6.1.1 - Remove Columns with One Unique Value


First let’s check how many unique values exist per column:

import pandas as pd
data = pd.read_csv("data/bmw.csv")

# Unique Values per Column


data.nunique()

It looks like every column has more than 1 unique value, so there’s nothing to do for this step.

6.1.2 - Missing Data


Next let’s check and see which columns have missing data in them:

missing_values = [np.nan, "", " ", None]

data.isin(missing_values).mean().sort_values(ascending=False) * 100

It looks like every row of every column is full, awesome another step we can skip!

6.1.3 - Target Column


We want to predict the price column so let’s pull that out:

# Separating out the Target Column


X = data.drop(columns="price")
y = data["price"]

6.1.4 - Feature Engineering


If you remember earlier, we talked about feature engineering and how it’s essentially, using domain knowledge to add or enhance
the features in your dataset. I’ve taken the liberty of classifying every value in the model column as a car type (I’m a car guy so this
was pretty easy). Let’s map this dictionary to a new column.

car_type = {'5 Series':'sedan',


'6 Series':'coupe',
'1 Series':'coupe',
'7 Series':'sedan',
'2 Series':'coupe',
'4 Series':'coupe',
'X3':'suv',
'3 Series':'sedan',
'X5':'suv',
'X4':'suv',
'i3':'electric',
'X1':'suv',
'M4':'sports',
'X2':'suv',
'X6':'suv',
'8 Series':'coupe',
'Z4':'convertible',
'X7':'suv',
'M5':'sports',
'i8':'electric',
'M2':'sports',
'M3':'sports',
'M6':'sports',
'Z3':'convertible'}

# Feature Engineering
# We're going to add a classification that I manually put together
X["model"] = X["model"].str.strip()

Machine Learning Course Notes 17


X["car_type"] = X["model"].map(car_type)
X

As a side note, this is why I picked the BMW dataset. As a luxury automaker, they have fewer models than a company like VW.

6.1.5 - Data Encoding


Now let’s encode our data. We’ll be using OneHotEncoding.

X = pd.get_dummies(X, drop_first=True)
X

6.1.6 - Scaling
We’ll be using MinMaxScaling on our data.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
X

6.1.7 - Train-Test Split


And finally, let’s split our dataset into a training dataset and a testing dataset:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

6.2.0 - Model Selection: Regression


In this course I won't be going in depth as to how individual algorithms work as I'll just release separate videos about that. The idea
with this course is to give you the ammunition you need in order to get started with machine learning and to that end I believe it's
more important to understand the overall process vs. how individual algorithms work.
To illustrate how different algorithms work, we'll try three different algorithms:

Multiple Linear Regression

This is the same thing you learned in Stats 101. Basically, using multiple variables we can build a linear model that uses
coefficients attached to each variable plus some constant to try and predict an outcome. In reality there are several factors
that you need to check prior to using a linear model, but I want to get you up and running by learning the syntax and
operation of skLearn after which you can start researching how individual algorithms work.

Random Forest

The Random Forest algorithm is a staple of any Data Scientist's toolbox. It essentially works by creating a forest of decision
trees, which are themselves another machine learning algorithm. Because a random forest is an algorithm of algorithms, it's
called an ensemble algorithm.

Boosting

Boosting is a technique that has risen to prominence over the last couple of years for its dominance in Kaggle competitions.
They're lightweight, fast, and easy to implement. They can work a couple of different ways, but essentially boosting
algorithms will start off with a simple model, then create another model that predicts a metric from the previous model, in the
case of the XGBoost algorithm we'll be using, that metric is the errors.

6.3.0 - Model Implementation and Evaluation: Regression

Machine Learning Course Notes 18


Before we implement our algorithm, it'll be important for us to determine what the metric we'll want to use to evaluate the accuracy
of these models will be.

3.3. Metrics and scoring: quantifying the quality of predictions


Model selection and evaluation using tools, such as model_selection.GridSearchCV
andmodel_selection.cross_val_score , take a scoring parameter that controls what metric they apply to the
estimators evaluated. For the most common use cases, you can designate a scorer object with the scoring
https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics

As you can see there are many different metrics we can use to determine how well our model is performing, we'll use Mean
Absolute Error because it's very easy to understand. The Mean Absolute Error or MAE is the average amount above or below the
true value our predictions are.

Before we implement an algorithm, let's go over the basic syntax we'll follow in order to implement most of our algorithms.

### Import our Machine Learning Algorithm


### Import our metric

# Create a model object


# Fit the model object to our data (this is the training phase)

# Create predictions with your newly trained model


# Measure the efficacy of your algorithm using your metric

6.3.1 - Linear Regression


Let's now implement our Linear Regression algorithm. This one will be pretty simple and easy to

### Import our Machine Learning Algorithm


from sklearn.linear_model import LinearRegression
### Import our metric
from sklearn.metrics import mean_absolute_error

# Create a model object


linear_regressor = LinearRegression()

# Fit the object to our data (this is the training phase)


linear_regressor.fit(X_train, y_train)

# Create predictions with your newly trained model


linear_predictions = linear_regressor.predict(X_test)

# Measure the efficacy of your algorithm using your metric


mean_absolute_error(y_test, linear_predictions)

6.3.2 - Random Forest

### Import our Machine Learning Algorithm


from sklearn.ensemble import RandomForestRegressor
### Import our metric
from sklearn.metrics import mean_absolute_error

# Create a model object


random_forest_regressor = RandomForestRegressor(n_estimators=1000)

# Fit the object to our data (this is the training phase)


random_forest_regressor.fit(X_train, y_train)

# Create predictions with your newly trained model


random_forest_predictions = random_forest_regressor.predict(X_test)

# Measure the efficacy of your algorithm using your metric


mean_absolute_error(y_test, random_forest_predictions)

Machine Learning Course Notes 19


6.3.3 - XGBoost

! pip install xgboost

### Import our Machine Learning Algorithm


from xgboost import XGBRegressor
### Import our metric
from sklearn.metrics import mean_absolute_error

# Create a model object


boost_model = XGBRegressor()

# Fit the object to our data (this is the training phase)


boost_model.fit(X_train, y_train)

# Create predictions with your newly trained model


boost_predictions = boost_model.predict(X_test)

# Measure the efficacy of your algorithm using your metric


mean_absolute_error(y_test, boost_predictions)

Let's see if we can tune our Random Forest model to perform better.

6.4.0 - Hyperparameter Tuning


Great job! You've officially run your first machine learning algorithms, and not just simple ones, but advanced ones like XGBoost. Of
course, the skill of Data Scientists show not just in the implementation of the algorithms, but the tuning of the algorithms as well. If
you remember from before, hyperparameters are the parameters that control the learning of the algorithm, basically how it gets
trained. We'll be using this excellent guide from Will Koehrson on hyperparameter tuning.

Hyperparameter Tuning the Random Forest in Python


I have included Python code in this article where it is most instructive. Full code and data to follow along can be
found on the project Github page. The best way to think about hyperparameters is like the settings of an
algorithm that can be adjusted to optimize performance, just as we might turn the knobs of an AM radio to get a
https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2a
a77dd74

We'll be using a method called GridSearchCV . Essentially we feed it a list of hyperparameters we want it to test out with our algorithm
and then it will go through every combination of hyperparameters and run our algorithm on each one. This process can take a while
to complete so I’ve created a significantly smaller list of hyperparameters to test than we would in reality so that you can run this
faster than I did.

from sklearn.model_selection import GridSearchCV


# https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74

n_estimators = [1500, 1600]


# Number of features to consider at every split
max_features = ['auto']
# Maximum number of levels in tree
max_depth = [80, 90]
# Minimum number of samples required to split a node
min_samples_split = [5]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1]
# Method of selecting samples for training each tree
bootstrap = [True]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
random_grid

Machine Learning Course Notes 20


# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestRegressor()
# Random search of parameters, using 3 fold cross validation,
# search across 100 different combinations, and use all available cores
rf_random = GridSearchCV(estimator = rf, param_grid = random_grid, cv = 3, verbose=2, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train, y_train)

We can then get a list of the best hyperparameters using this:

rf_random.best_params_

from sklearn.ensemble import RandomForestRegressor


from sklearn.metrics import mean_absolute_error

perfect_random_forest = RandomForestRegressor(n_estimators=1600, min_samples_split=5, min_samples_leaf=1, max_features='auto', max_depth=


90, bootstrap=True)
perfect_random_forest.fit(X_train, y_train)

perfect_random_forest_predictions = perfect_random_forest.predict(X_test)

mean_absolute_error(y_test, perfect_random_forest_predictions)

This will take about 15 minutes to run.

It looks like our result is slightly better, not bad not amazing.

6.5.0 - Conclusion: Regression


Congratulations, you’ve run through a major part of the machine learning process. In a real life situation you’d also want to get your
results together in a presentable or deployable format, but that could happen in so many ways it wouldn’t make sense to put into this
course. Let’s try our hand at a classification problem now.

7.0.0 - Classification Practice


Classification problems are the other major type of Supervised Learning task. This is a task where you need to classify data into
different groups. They can get quite complicated if things belong in more than one group so we’ll be sticking to the basics of
classifying things into just two groups, otherwise known as Binary Classification.

7.1.0 - Data Cleaning: Classification


For our classification problem we’ll be working with a mushroom dataset. This dataset has a bunch of information on mushrooms
which we must use to determine if they’re edible or not. The original dataset can be found here:

Mushroom Classification
Safe to eat or deadly poison?

https://www.kaggle.com/uciml/mushroom-classification

Like we did with the classification problem, let’s outline the steps we want to follow prior to actually executing on them. It looks like
this dataset only has categorical data, so we’ll probably just have to make a bunch of dummy variables. Hopefully our algorithms will
work well with all of the columns of data we’ll have to make!

1. Remove columns with one unique value

2. Missing data

3. Select target column

Machine Learning Course Notes 21


4. Data Encoding

5. Train-Test Split

7.1.1 - Removing Columns with one Unique Value


In this dataset we have one column with only one unique value. Te remove it, we can use the code below.

data = data.drop(columns=data.nunique()[data.nunique() == 1].index)

7.1.2 - Missing Data


We’ll check for missing data the same way we did earlier in the course for the regression problems.

missing_values = [np.nan, "", " ", None]

data.isin(missing_values).mean().sort_values(ascending=False) * 100

It looks like we don’t have any missing values.

7.1.3 - Select Target Column


Our target column that we’re trying to predict is class . Let’s separate it out from the rest of the columns.

X = data.drop(columns="class")
y = data["class"]

7.1.4 - Data Encoding


This dataset is almost entirely categorical variables, but luckily it seems that there aren’t too many unique values per column. Let’s
also make sure to avoid the Dummy Variable Trap by using the drop_first parameter.

X = pd.get_dummies(X, drop_first=True)
X

7.1.5 - Train-Test Split


And finally, let’s do our train-test split.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

7.2.0 - Model Selection: Classification


As I mentioned earlier, we’ll just do basic overviews of algorithms here but I’m more concerned with you learning the framework
upon which you can implement these algorithms. For this task, we’ll also select three algorithms:

Logistic Regression

Random Forest Classifier

Boosting

Machine Learning Course Notes 22


3.3. Metrics and scoring: quantifying the quality of predictions
Model selection and evaluation using tools, such as model_selection.GridSearchCV
andmodel_selection.cross_val_score , take a scoring parameter that controls what metric they apply to the
estimators evaluated. For the most common use cases, you can designate a scorer object with the scoring
https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics

7.2.1 - Logistic Regression


Although it’s called “Logistic Regression”, this is actually an algorithm for classification problems. The implementation will be very
similar to our regression algorithms with a change coming in the evaluation metric we decide to use.

from sklearn.linear_model import LogisticRegression


from sklearn.metrics import accuracy_score

log_reg = LogisticRegression().fit(X_train, y_train)


log_predictions = log_reg.predict(X_test)
accuracy_score(log_predictions, y_test)

Wow it looks like we have a very good algorithm.

7.2.2 - Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier


from sklearn.metrics import accuracy_score

random_forest_class = RandomForestClassifier()
random_forest_class.fit(X_train, y_train)
rf_predictions = log_reg.predict(X_test)
accuracy_score(rf_predictions, y_test)

7.2.3 - LightGBM
LightGBM is a Gradient Boosting Model created by Microsoft which is generally considered to be lighter and faster than XGBoost. If
you’re interested in learning more about LightGBM then check out this Kaggle notebook linked below.

LightGBM Classifier in Python


Explore and run machine learning code with Kaggle Notebooks | Using data from Breast Cancer Prediction Dataset

https://www.kaggle.com/prashant111/lightgbm-classifier-in-python

! pip install lightgbm

import lightgbm as lgb


clf = lgb.LGBMClassifier()
clf.fit(X_train, y_train)
light_gbm_predictions = clf.predict(X_test)
accuracy_score(light_gbm_predictions, y_test)

We implement this algorithm the same way we’ve been implementing all of our algorithms, by fitting our model to the training data
and then predicting the classes using the newly fitted model.

7.3.0 - Model Evaluation: Classification


Classification presents from unique problems when it comes to evaluating algorithms. We’ve been using the accuracy score to judge
the performance of our machine learning algorithms, this actually isn’t always the best metric to use. Intuitively, you’d think that
accuracy is always what we’re striving for, with classification, but take the example of cancer diagnosis. If we assume that 99% of
the population doesn’t get cancer, and we had an algorithm that always predicted a patient didn’t have cancer, it might seem like a
good algorithm, even though it’s wrong in 100% of cases that matter.

Machine Learning Course Notes 23


Metrics to Evaluate your Machine Learning Algorithm
Evaluating your machine learning algorithm is an essential part of any project. Your model may give you
satisfying results when evaluated using a metric say accuracy_score but may give poor results when evaluated
against other metrics such as logarithmic_loss or any other such metric.
https://towardsdatascience.com/metrics-to-evaluate-your-machine-learning-algorithm-f10ba6e38234

7.3.1 - Confusion Matrix


The confusion matrix is the most obvious answer to the question of wanting to know the true efficacy of a classification algorithm. It
looks like this:

n = 100 Predicted NO Predicted YES

Actual NO 21 31

Actual YES 41 7

As you can see, it basically maps out all of the predictions along with whether it was accurate or not. We want as many values as
possible on the main diagonal of the table as these are the truly accurate ones.

I modified the code from this Medium post to create a confusion matrix plotter:

Confusion Matrix Visualization


How to add a label and percentage to a confusion matrix plotted using a Seaborn heatmap. Plus some additional
options. One great tool for evaluating the behavior and understanding the effectiveness of a binary or categorical
classifier is the Confusion Matrix.
https://medium.com/@dtuk81/confusion-matrix-visualization-fc31e3f30fea

import numpy as np
import seaborn as sns
from sklearn.metrics import confusion_matrix

def confusion_matrix_plotter(predictions, actuals):


cf_matrix = confusion_matrix(rf_predictions, y_test)
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
cf_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cf_matrix.flatten()/np.sum(cf_matrix)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
return sns.heatmap(cf_matrix, annot=labels, fmt="", cmap='Blues')

confusion_matrix_plotter(light_gbm_predictions, y_test)

7.3.2 - Area Under the Curve (AUC)


True Positive Rate (Sensitivity)

T rueP ositive
T rueP ositiveRate =
F alseNegative + T rueP ositive

This is the proportion of positive data points that are correctly considered as positive out of all data points.

True Negative Rate (Specificity)

T rueNegativeRate
T rueNegativeRate =
T rueNegative + F alseP ositive

This is the same as the True Positive Rate except with negative values.

False Positive Rate

Machine Learning Course Notes 24


F alseP ositive
F alseP ositiveRate =
T rueNegative + F alseP ositive

These are misclassified negative points out of all points.


Now we can plot these values one after another at different thresholds. This is called the ROC curve and represents the ability for
your algorithm to separate out different classes from one another.

import sklearn.metrics as metrics


# calculate the fpr and tpr for all thresholds of the classification
probs = random_forest_class.predict_proba(X_test)
preds = probs[:,1]
# y_test = y_test.map({'e':1, 'p':0})
fpr, tpr, threshold = metrics.roc_curve(y_test, preds)
roc_auc = metrics.auc(tpr, fpr)

# method I: plt
import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
plt.plot(tpr, fpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([-0.1, 1.1])
plt.ylim([0.0, 1.1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

7.3.3 - F1 Score
This is the harmonic mean between precision and recall and calculates the precision of the classifier (the number of instances
accurately classified). A higher F1 score means your model is performing better.

from sklearn.metrics import f1_score

f1_score(y_true = y_test, y_pred = y_pred)

7.4.0 - Conclusion: Classification


As you can probably tell, Classification problems have unique challenges that aren’t faced in regression problems. If you want to
study further, then look into multi-class classification problems as the way you design algorithms will be slightly different for those.

8.0.0 - Course Conclusion


You did it! You’ve reached the end of the course. I hope you learned a bunch and also remember that this is just the beginning of the
beginning of your journey into Machine Learning. There’s a lot more content to cover and I’ll be adding more videos on my channel
to cover it so stay tuned!

Machine Learning Course Notes 25

You might also like