Module 2
Module 2
• Used for binary classification, i.e. when outcomes can take only 1 of 2
values (yes/no, click/no click, health/sick)
• Among the most useful and popular methods (along with ordinary least
squares) in all of statistics, data science, academia
• Is it really machine learning?
• Early development in 19th century
Logistic Regression
• Single “Neuron”
“weights”
“parameters”
• Boosting
• Support vector machines (SVM)
• Neural nets - much more complicated and varied than covered here
• Many, many other regression techniques
• LASSO, Ridge, weighted regression, kernel regression
• Want to learn more?
• Statistics or Intro CS courses on machine learning
AI Fundamentals for Non-Data Scientists
Intro to Model Selection
• For any prediction problem, there are many algorithms and methods
available - decision trees, random forests, neural networks, and more
• Model evaluation and selection is done by evaluating model performance
on a validation dataset
• Holdout validation: Partition available data into a training dataset and a
holdout; evaluate model performance on holdout
• Cross-validation: Create a number of partitions (validation datasets) from
the training dataset; fit model to the training dataset (sans the validation
data); evaluate model against each validation dataset; repeat with each
validation set and average results to obtain the cross-validation error
Data vs. Model
Decision
What About Unstructured Data?
• When data are not structured, features have to be “engineered” from the
data
• A time-consuming and challenging process, and often requires domain
expertise
• One of the most difficult parts of the ML process, where data scientists
spend a lot of their time
• Can be as much an art as a science
Feature Engineering
Example: Features you might you engineer from real estate pictures
• Take individual images and extract individual features
• Requires knowledge of real estate
• Requires access to a realtor and a software developer
• Good amount of guessing involved — very likely to miss critical features
AI Fundamentals
Deep Learning
Domain experts
Deep Learning
• Image recognition
• Detecting fake news
• Detecting knockoffs from luxury products
AI Fundamentals
Evaluating ML Perfomance
Fraudulent Fraudulent
Legitimate Legitimate
Legitimate Legitimate
Legitimate Legitimate
Fraudulent Fraudulent
Fraudulent Legitimate
Legitimate Legitimate
Legitimate Fraudulent
Legitimate Legitimate
Legitimate Legitimate
Example: “Identify Fraudulent Credit Card Transactions”
Fraudulent Fraudulent
Legitimate Legitimate
Legitimate Legitimate
Legitimate Legitimate
Fraudulent Fraudulent
Fraudulent Legitimate
Legitimate Legitimate
Legitimate Fraudulent
Legitimate Legitimate
Legitimate Legitimate
Example: “Identify Fraudulent Credit Card Transactions”
Fraudulent Fraudulent
Legitimate Legitimate
Legitimate Legitimate
Legitimate Legitimate
Fraudulent Fraudulent
Fraudulent Legitimate
Legitimate Legitimate
Legitimate Fraudulent
Legitimate Legitimate
Legitimate Legitimate
Evaluating ML Performance
• What are the relative costs of false negatives and false positives?
Tradeoffs Between Loss Functions
• The data that the algorithm uses to learn the best mapping between the
inputs and the right predictions, or outputs
Training Data
• Training data is the key to building ML algorithms, but we care most about
performance on out-of-sample data
• The point is to predict outcomes where we don’t already know what is going
to happen
The Over-Fitting Problem
• Over-fitting is the danger that the model performs well on training data, but
not other data sets
• ML engineers try to avoid fitting the model to the point that it picks up noise
in the training data
• Trying to balance using the training data to build an accurate model, with
having a model that still performs well on out-of-sample data
• Example: Studying the test vs. studying the material
The Over-Fitting Problem
• The challenge is in capturing the relevant aspects of the model vs. capturing
the noise in the training data
• This is called the “bias-variance” tradeoff
The “Bias-Variance” Tradeoff
• Test data (also called a “hold-out sample”) is a data set that is not used to
train or build the model, but can be used to validate the model
• Validating performance on a data set that’s not used to build the model (test
data) helps ensure the model also works well on outside samples
Performance Tradeoffs with Training and Test Data
Where Does Test Data Come From?
• One common approach is for ML engineers to start with all of the data for
which they have labels, and then divide it up into training and test data (e.g.
conduct a 70/30 split)
• Example: Insurance data to predict accident likelihood
• Take all historical data and divide it up
• Everything up until the last 6 months is used as training data
• Everything from 6 months ago to the present day is used as test data
AI Fundamentals
Examples of End-to-End AI Workflow