1 - Course Slides - Data Science and ML Fundamentals
1 - Course Slides - Data Science and ML Fundamentals
Lester has led a startup fintech data science team and has held VP
roles in quantitative analysis and strategy at Wells Fargo. He has held
various data consulting roles in other Fortune 500 companies and is
and expert in the application of machine learning techniques.
Distinguish data science from Outline the data science and Describe basic data science
business intelligence machine learning process terms, roles, skills, and
applications
Data science is all about creating data driven insights that help us deal with uncertainty.
Start-up
What proportion of our crowd-sourced
investors invested $200 or less?
Bank
What proportion of our loans were issued
to at risk customers?
Store Chain
What proportion of our Q1 forecasted
sales come from the Pet Food category?
Financial Institution
Based on past transactions, which of these
new transactions are likely fraudulent?
Manufacturing Company
Based on sensor data, when is this critical
machine component likely to wear out?
E-commerce Company
Based on sales data, which of our high-
value customers are most likely to leave?
Descriptive Predictive
Provides a view of the facts of who, Provides a probable state of the
where, when, how many, and what future or an unknown variable.
exactly happened?
Diagnostic Prescriptive
Provides an analysis to tell us why Provides the best course of action in
something is happening—what was order to achieve a given outcome.
the leading cause?
Data
Science
Machine Software
Learning Dev
Computer
Science & Coding
Data
Science
Machine Software
Learning Dev
Computer
Science & Coding
BIDA TM - Business Intelligence & Data Analysis
The Data Science Process
Definition Scenario
Data Collection Capture information, ensure quality of data, Collect transaction data, user-names, credit
and Storage and store data into database. history. Identify past fraud.
Transform Data Optimize data for the project we’re working on Combine or manipulate datasets, filtering out
for Projects and select features of interest. items, or adjust formatting.
Statistical & Build models and algorithms that spot Train model to identify the leading
Predictive Analysis patterns in our data. indicators of fraudulent transactions.
Model Evaluation Test how well is the model performing? Which model is best at identifying fraudulent
Data Visualization Present results using data visualization. transactions? Optimize for business objectives.
Share Share dashboards and reports to business Share real time information identifying risky
Insights users for use in decision making and deploy transactions.
our models into operations.
1. Load & Clean Data 3. Feature Engineering Analysis & Machine Learning
Ensure clean and tidy data. Remove Manipulate input data into optimal
5. Model Evaluation & Visualization
errors. Deal with missing data points format for analysis. This may
Evaluate and compare model performance. Visualize and
include categorization, scaling, one
communicate results.
hot encoding etc.
Data
Science
2. Exploratory Data Analysis 4. Model Building
Skills
What can we learn at a Build models that can analyze data, make
glance? Explore data types or predictions or quantify uncertainty.
obvious relationships. Regression, classification etc.
Clustering
X1 X2 X3 X4 Y
X1 X2 X3 X4 Y
Y
Model evaluation is the most important part of the data science process from a leadership perspective.
Solve technical challenges Provide insight on business Projects and models that
and select the best objectives, project goals deliver targeted value to
analytical approach. and business costs. the business.
Business leaders and data science teams should work closely to align priorities, objectives and measures of success.
Business leaders need a basic understanding of model outputs, and their impact on decision making.
$
Identify Fraudulent Transactions Automated Nut Filtering
Using transaction data Using laser scans of nuts
Introduces
False Alarms High Quality
Lower Quality
Not Fraud Poor Nuts Selection
The business must quantify the cost of false alarms and the benefits of correct identification.
As we chase higher accuracy, the greater the marginal cost of time and resources needed to achieve it.
Business leaders should work closely with DS teams to ensure expectations, objectives and resources are aligned.
Prediction
BBQ
NOT SPAM (-) SPAM (+) Sales
NOT SPAM (-)
True False
Negative Positive
Actual
Forecast
SPAM (+)
False True
Temperature
Negative Positive
Leaders should understand the basics of model evaluation, to help challenge and discuss outcomes.
Data Collection Transform Data Stat & Predictive Model Evaluation Share
and Storage for Projects Analysis Data Visualization Insights
Data Collection Transform Data Stat & Predictive Model Evaluation Share
and Storage for Projects Analysis Data Visualization Insights
The goal of regression is to assess the relationship between one or more input variables (X) and a continuous output variable (Y).
Our line of best fit allows us to make predictions about the value of the target variable in a given scenario.
X X X
How do we decide where exactly the line of best fit sits? Generally, we try to minimize the size of errors in our predictions.
Errors represent the amount by which the target variable is different from the value predicted.
Y Sensitive to
Other Metrics
Outliers
Sum of Squared Error (SSE) Yes
Sensitive to
Other Metrics
Outliers
Sum Absolute Error (SAE) No
X Mean Absolute Error (MAE) No
The most common approach minimizes the squared errors, and is know as Ordinary Least Squares
Marketing Scenario
Sales
Coefficient of Determination (R2) is one of the most used metrics to evaluate regression models.
R2 measures how close the data are to the fitted regression line. In other words, how much of the variability
in Y is explained by changes in X.
R2 = 1 R2 = 0.86 R2 = 0
Higher R2 indicates better fit of the model, and therefore smaller errors.
R2 can sometimes be biased, so a related measure called Adjusted R2 can also be used.
Models need to be tested on new data, before we allow them to make real world decisions.
Available Sample Data
Approx 80% of sample data is used Approx 20% of the data is used to
to teach (train) the model. Training Data Testing Data test how well the model performs.
Y Y Y
X X X
Training Data Testing Data Real World Data
Used to teach the model what Used to test model Used to make real world
the relationship looks like performance on new data predictions and decisions
It is important that models are tested on data they have never seen before.
Error
High Bias Error Variance
Model Complexity
Linear
X X
Values sampled repeatedly. Useful for smooth, non Best fit may differ by Provides a level of
Common in medicine. linear relationships sample region certainty with outputs
Non Linear
Linear
Non
Linear
Y=0
X
X
NOT
INBOX SPAM GENUINE FORGED DEFAULT
SPAM
DEFAULT
• Classification tasks that have • Has more than two class labels • Has two or more class labels
two class labels
• Outcomes must be ONE of a • Outcome can be ONE or MORE
• Outcomes must be ONE of range of classes of the class labels
the two classes
Use Case Output Classes Use Case Output Classes Use Case Output Labels
In the rest of this course, we’ll explore the most common classification algorithms.
Once we understand each technique, we’ll compare and contrast the benefits, and outputs.
Logistic Regression probabilities are estimated using one or more input variables
0.5
Predict
0.25 SPAM
0 0 NOT SPAM
0 3 6 9 12 15
% of words misspelt
Logistic Regression uses a curved line to summarize our observed data points
The decision tree algorithm can be used to predict both categorical or numeric outcomes.
Spam Spam
Email Spam Email Spam
OR OR
Detection Detection
Not Spam Not Spam
Spam
Yes
Grammatical
Errors? Suspicious
Yes Spam
No
Yes Domain? No Not Spam
New Email Spelling
Received Errors? Yes
Suspicious Spam
No
Yes Domain? No Not Spam
Grammatical
Errors? No Yes
Suspicious Spam
Domain? No
Node Not Spam
Spam Spam
Suspicious Grammar
Domain? Spam Errors?
Spelling Suspicious Spam
Errors? Not Spam Domain?
Grammar Spelling Not Spam
Errors? Errors
Spelling Spam
Errors? Not Spam
Not Spam
Spam
Suspicious
Domain?
Grammar
Spelling Spam
Errors?
Not Spam Errors? Not Spam
Suspicious
Domain? Suspicious
Spam Domain?
Spelling Spam
Errors Grammar
Grammar Spam Errors? Not Spam
Errors? Not Spam
We choose the model which best separates the two classes, in this case Spam and Not Spam.
KNN assigns output classes based on the most similar observations in our sample space.
Frequency of Grammatical
Frequency of Grammatical
Errors (Y)
Errors (Y)
SVM models try to maximize the separation or margin between classes in the sample space.
Support Vector Support Vector Machines are an extension of Support Vector Classifiers.
Classifier with Two
Permitted Outliers
Class 1
• By allowing outliers, we make the model less sensitive to the training data.
Naïve Bayes is a probabilistic model based on Bayes theorem which calculates conditional probabilities.
Conditional probability is the probability of one event, given the probabilities of other events.
𝑃𝑃(𝐵𝐵|𝐴𝐴)𝑃𝑃 𝐴𝐴
𝑃𝑃(𝐴𝐴|𝐵𝐵) =
𝑃𝑃(𝐵𝐵) Did you know? The Naïve refers to the
Where A is the hypothesis or models assumption that all input
outcome variable, and B is the variables are independent.
evidence or features
For example, when we observe a phrase in an potential SPAM email, we might ask:
The Gaussian Naïve Bayes is an extension of Naïve Bayes, and is used to model normally distributed variables.
We plot our sample data and observe two very different but
overlapping distributions.
The Gaussian Naïve Bayes also captures prior probabilities, allowing us to capture additional information.
The confusion matrix helps us compare the predictions we control, vs the actual outcomes that we don’t.
It can help us understand the quality of our predictions, or the trade-offs we must make.
Prediction
Negative (0) Not Spam Positive (1) Spam
Suppose we predict roughly 50% of emails are SPAM, What if we want to increase the number of actual Spam
based on several input variables.. emails that we detect? We previously missed 14 of them!
True False
Negative (0)
Not Spam
Negative Positive True False
35 11 Negative Positive
Actual
17 29
Actual
Positive (1)
False True
Spam
Positive (1)
Negative Positive False True
Spam
14 40 Negative Positive
7 47
• Overall, 75 (40 + 35) out of 100 predictions were correct.
• There are now only 7 missed SPAM emails! Success!
• But there are 14 Spam Emails that we didn’t detect.
• But now we created 18 more false alarms. This may get
annoying for users as they have to search in their junk.
Our model cannot be perfect, and the confusion matrix helps us understand the trade offs.
BIDA TM - Business Intelligence & Data Analysis
Understanding Trade Offs
But how do we decide which outcome to favor? False Negatives are undesirable in disease
detection. We cannot afford to miss bad
outcomes.
Negative (0)
Prediction Positive (1)
Negative (0)
We must be clear on what we want to achieve, and the costs and benefits of each type of error.
There are several metrics and techniques that can help summarize the observations in the confusion matrix:
Underfitting and Overfitting can help us describe classification model outputs too.
Underfitting means the model under generalises Overfitting means the model learns the training data
the data too well, and misses the more general relationship.
15 Nearest Neighbours
1 Nearest Neighbour
We must find a balance to ensure our model performs according to our evaluation metrics.
Statistical &
Data Collection Transform Data Model Evaluation Share
Predictive
and Storage for Projects Data Visualization Insights
Analysis
2 15 527 13,000 NO
It is important that we investigate all Some errors are clearly incorrect, like a customer age of 186.
errors to understand why they occur and Some errors are suspicious, like a credit score of 999.
how they should or should not affect our
analysis. Some errors are ambiguous, like an applicant age of 15.
Duplicated Data
CSV data…
We should investigate the reason for Two out of three rows seem to be duplicated.
the duplicated data.
3 23/01/2014 +1 8lbs
EDA helps us gain initial insights from our data, that help us implement our model.
What types of data are How can we describe the data Are there any obvious
present? in each feature? relationships?
Understanding the data types helps us understand limitations and challenges we may encounter.
Can be measured on
0 5 10 15 20
a scale or timeline.
Continuous variables are the Jan ‘20 Jul ‘20 Jan ‘21 Jul ‘21
easiest to work with.
Understanding the data types helps us understand limitations and challenges we may encounter.
Categorical
Categorical features tell us which bucket a data point falls into.
Student 4 X Large
Numbers are not always continuous Dates are not always continuous
1 326 4,237
We cannot have 0.5 of a customer, so the When datapoints belong in buckets, they
scale is not strictly continuous. are considered categorical data.
Exploratory data analysis helps us be curious, ask questions and uncover patterns in our data.
Positive Correlations tell us that as one Negative Correlations tell us that as one
variable increases, the other tends to also. variable increases, the other decreases.
Correlation = -1.00
Correlation Zero Correlation tells us
= 1.00
that one variable has no
Correlation impact on the other.
= 0.93 Correlation = -0.50
Correlation
= 0.00
Correlation
= 1.00
Correlation = -1.00
Correlation can have a max value of 1. Correlation can have a min value of -1.
Passenger ID 0.9
Target: Survived
0.6
Ticket Class
Age 0.3
Siblings/Spouse
0.0
Parents/Children
Fare -0.3
Family
-0.6
Clear non-linear
No relationship
relationship
Exponential
Changing
relationship
variance
Company Valuation
In summary, we are eliminating some
columns (features) from our dataset.
X1 X2 X3 X4 X5
Common Feature Selection methods are Principal Component Analysis and Feature Importance.
BIDA TM - Business Intelligence & Data Analysis
Feature Engineering
Feature engineering is the process of modifying the structure or contents of our data to make it more
suitable for analysis, or to help improve the performance of a model.
Category
0
1
30
-3 40
-2 50
-1 600 701 80
2 90
3
Grouping or binning helps us simplify our data to make it more digestible, or remove some unnecessary detail.
High Cardinality
Zip Code Region
Features with many unique values are referred
22261 22
to as high cardinality.
23621 23
High cardinality provides lots of detail, but
25612 25
results in a small sample set per category.
23261 23
25211 25
Solutions
22515 22
It can help to reduce the number of categories: 26612 26
Red 1 0 0
Red 1 0 0
Yellow 0 1 0
Green 0 Another
0 chart 1
showing outliers
Yellow 0 1 0
Dummy Variable Encoding achieves a similar outcome, instead removing the final column.
Calculations can help us extract new information or summarize the data we have available.
First
Review Wait Part 1 Part 2 Part 3 Part 4 Avg
Start Date Performance
(Months)
Review
30% 20% 45% 50% 36%
01 Apr 2020 01 Sep 2020 5
01 Jun 2020 15 Jun 2020 0.5 80% 60% 70% 80% 73%
.
BIDA TM - Business Intelligence & Data Analysis
Training & Testing
Models need to be tested before we use them to predict real-world outcomes.
Available
Dataset
Tests must be carried out on new data that the model has never seen before.
Available
Dataset
A further technique: Training, Validation & Test splits the dataset into 3 segments.
What do they all have in common? All techniques are trying to validate results on new, unseen data.
Live Data
Live Data +
Data A Data B Data C Data D
Live Data +
X1 X2 X3 X4 Y
The purpose of clustering is to group data points into those with similar characteristics.
K Means Clustering
Input: Netflix Viewer Data Executives
Income
Income
Professi
onals
Students
Retirees
Age Age
Algorithm
Hierarchical Clustering
Benefits
50 Features Feature
10 Key Features • Improve analysis results
Selection
(Financial Ratios)
X1 X2 X3 X4 X5
Variable Reduction algorithms are designed to reduce the number of features.
Reinforcement Learning is where machines learn how to navigate scenarios through repetition.
• No memory loss
• Computers hold no recent memory biases
• Computational superiority
Neural Networks are inspired by the structures of neurons in our brains. They consist of nodes organized
into layers.
Deep Learning is an extension of Neural Networks, where the model may retrain itself multiple times.
Deep Learning models are less reliant on humans and may be able to train themselves.
A basic rule-based model. Bots make trades when certain conditions are met.
The basic principles in both scenarios are the same: Computers are simply following instructions.
Enterprise BI
Advanced Tableau – LOD Calculations
Advanced Power BI
Case Study: Financial Statements in Power BI
Case Study: Trading Dashboard in Tableau
BIDA
SQL Fundamentals
Courses
Tableau Fundamentals
Power BI Fundamentals
Power Pivot Fundamentals
Power Query Fundamentals
Intro to Business Intelligence
Data Architect
Data Engineer / SQL Developer
BI
Database Admin (DBA) Data Visualization Specialist
Roles
Business Intelligence Developer or Generalist
Data Analyst