Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
111 views

CS464 Ch1 Intro Fall2020

The document describes CS464, an undergraduate machine learning course at Bilkent University that aims to provide students with a broad overview of machine learning concepts and algorithms and skills to apply them to real-world problems, covering topics like classification, clustering, neural networks, and term projects involving applications like facial expression recognition and question categorization. Prerequisites for the course include basic algorithms, probability, statistics, and linear algebra, and students are encouraged to form project teams and discuss ideas.

Uploaded by

Mathias Bueno
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views

CS464 Ch1 Intro Fall2020

The document describes CS464, an undergraduate machine learning course at Bilkent University that aims to provide students with a broad overview of machine learning concepts and algorithms and skills to apply them to real-world problems, covering topics like classification, clustering, neural networks, and term projects involving applications like facial expression recognition and question categorization. Prerequisites for the course include basic algorithms, probability, statistics, and linear algebra, and students are encouraged to form project teams and discuss ideas.

Uploaded by

Mathias Bueno
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

CS464

Introduction to
Machine Learning
CS464
Bilkent University
CS464
•  Undergraduate-level introductory course aims at
giving a broad overview of many concepts and
algorithms in machine learning.

•  The goal is to provide students with a deep


understanding of the subject matter and skills to
apply these concepts to real world problems.
Prerequisites
•  Basic algorithms, data structures

•  Basic probability and statistic

•  Basic linear algebra

•  Good programming skills


Term Projects
•  “Photobomb Detector”

•  “Facial Expression Recognition”

•  “Predicting the Stack Exchange Questions’ Category”

•  Some ideas from Stanford ML course:


–  http://cs229.stanford.edu/proj2019aut/index.html
TO DO
•  Enroll on the Moodle:
•  Read the syllabus:
–  https://stars.bilkent.edu.tr/syllabus/view/CS/464/
•  Read information:
–  http://ciceklab.cs.bilkent.edu.tr/ercumentcicek/cs-464-
introduction-to-machine-learning-fall-2020/
•  Start talking to your classmates about project ideas
and start establishing teams:
–  Groups of 5 people from the same section.
Data-rich Era
•  The information volume grows exponentially
–  Most data will never be seen by humans

•  Information complexity is also increasing greatly


–  Most data cannot be comprehended by humans
directly

•  In the computationally enabled, data-rich era


machine learning is an essential component of the
new scientific toolkit
Genome Cost
Genome Data Size
Data-rich Era
•  The information volume grows exponentially
–  Most data will never be seen by humans

•  Information complexity is also increasing greatly


–  Most data cannot be comprehended by humans
directly

•  In the computationally enabled, data-rich era


machine learning is an essential component of the
new scientific toolkit
Learning from Data
•  Many applications require gaining insights from massive,
noisy data sets
•  Commercial applications
–  Speech recognition
•  https://youtu.be/2U1ImKsecB8
–  Consumer data (online advertising, viral marketing,…)
–  Health records (emergency rehospitalization,…)
•  Security related applications
–  Spam filtering, fraud detections…
–  Surveillance …
•  Science
–  Biology, astronomy, neuroscience, geology, …
–  Social science, economics
Spam or Ham

Ham

Spam
Text Classification
Incoming e-mail
Reading Zip Codes
•  Classifying hand written digits from binary images
Recommendation Systems
Recommendation Systems
Recommendation Systems
Statistical Machine Translation
Recognizing Faces

Figure taken from: https://github.com/deepinsight/insightface


Semantic Segmentation

Figure taken from: https://github.com/facebookresearch/detectron2


Image/Video Captioning

Automatically captioned: “Two pizzas sitting on top of a stove top oven”


Autopilot

Figure taken from: https://www.tesla.com/autopilot


Clustering Gene Expressions

Verhaak et al., Cancer Cell, 2010.


Analyzing fMRI Data

Mitchell et.al,
Science, 2009.

•  Predict activation patterns for nouns


Speech Recognition

https://youtu.be/D5VN56jQMWM
IBM Watson

•  IBM Watson won Jeopardy on February 2011


AlphaGo

•  AlphaGO beats team of five leading Go players


Health
Many Other Applications
•  Information Retrieval
•  Online advertising
•  Finance
•  Fraud detection
•  Economics
•  Social Sciences
Growth of Machine Learning
•  Machine learning is preferred approach to
–  Speech recognition, natural language processing
–  Computer vision
–  Medical outcomes analysis
–  Robot control
–  Computational biology
–  Sensor networks
•  This trend is accelerating
–  Big data
–  Improved machine learning algorithms
–  Faster computers
–  Good open-source software
Why “Learn” ?
•  Learning is useful when:
–  Human expertise does not exist (navigating on Mars),
–  Humans are unable to explain their expertise (speech
recognition, object recognition)
–  Solution needs to be adapted to particular cases
(personalization)
Experience = data
•  Humans learn to improve their skills through practice

•  The machine learning goal is to write algorithms that will


improve with experience
–  Experience is encoded as examples, or data
–  Autonomous learning must gain from experience

•  Where does training data come from?


–  Human annotation
–  Automatic collection
Objectives of Machine Learning
•  Algorithms:
–  design of efficient, accurate, and general learning algorithms to
deal with large-scale problems
–  make accurate predictions (unseen examples)
–  handle a variety of different learning problems

•  Theoretical questions:
–  What can be learned? Under what conditions?
–  What learning guarantees can be given?
–  What can we say about the inherent ease or difficulty of
learning problems?
Machine Learning?
•  Role of Probability and Statistics: Inference from a sample

•  Role of Computer Science: Efficient algorithms to


–  solve the optimization problem
–  representing and evaluating the model for inference
Definitions and Terminology
•  Example:
–  an object or instance in data used
•  Features (denoted by X):
–  the set of attributes, often represented as a vector, associated to an
example, e.g., height and weight for gender prediction
•  Labels (Y):
–  in classification, category associated to an object
–  in regression, real-valued numbers
Definitions and Terminology
•  Training set: Data used for training the learning
algorithm

•  Test data: Data used for evaluating the learned


model
Feature Extraction and Mapping
•  How you represent your data as a set of features is
very! important.
Example: Recognizing Flowers
Three types of Iris Flowers

Setosa Versicolor Virginica


Flower Recognition: Features

Setosa

Versicolor

Virginica
Example: A Digit Recognizer
•  Imagine you are asked to write a program to recognize hand
written digits

•  Now let’s use the terminology to define the problem


Terminology Example
Classification Task:
Given a hand written digit classify it into {0,1,..,9}

3
Digit Recognizer
•  Each hand written digit image is an example
(instance, or object) in your data
Labels (Space of outputs)
•  Labels are all digits:

Label of this example: 6


Features (Space of Inputs)
We could divide the image in 16x16 pixels and encode each
pixel intensity as a numeric number.
Feature vector
for the example
0
.
.
Map to .
features 1.2
3.2
4.5
1.2
.
.
0
256 x 1
16 x16 grid
Feature Mapping
•  How you represent your data as features is very!
important.
Training Set/Test Set
Training set Test set

+ their labels + their labels

Should not be overlapping


Terminology Example: Digit Recognizer

Training set Test set

+ their labels + their labels

•  Built the classifier using •  Predict the label of unseen


the features and the examples’ features.
labels in the training set •  Compare your predictions results
with the actual labels.
How would you classify ?

Predict its
label?
Nearest Neighbor Classifiers
•  Basic idea:
–  If it walks like a duck, quacks like a duck, then it’s
probably a duck

compute
distance test
sample

training choose k of the


samples “nearest” samples
Nearest Neighbor Classifiers
Unknown record
k-Nearest Neighbor (kNN)
•  Assumes all instances are points in n-dimensional
space

•  A distance measure is needed to determine the


“closeness” of instances

•  Classify an instance by finding its nearest neighbors


and picking the most popular class among the
neighbors
1- Nearest Neighbor Classifier Algorithm
•  The training data D = {(x1, y1), (x2, y2), …, (xn, yn)}
•  Learning algorithm:
–  Store training examples

•  Prediction algorithm:
–  To classify a new example xnew by finding the training
example (xi, yi) that is nearest to xnew

–  Guess the class ynew = yi


k-NN Classifier
•  To classify a new input vector x, examine the k-
closest training data points to x and assign the
object to the most frequently occurring class

X2

X1
k-NN Classifier
k-NN Classifier
1-Nearest Neighbor Classifier (k = 1)
3-Nearest Neighbor Classifier (k=3)
5-Nearest Neighbor Classifier (k=5)
Example: Hand Written Digits
•  16x16 bitmaps
•  8-bit grayscale
•  Euclidean distance
over pixels
Example: Hand Written Digits
•  16x16 bitmaps
•  8-bit grayscale
•  Euclidean distance
over pixels
Example: Hand Written Digits
•  16x16 bitmaps
•  8-bit grayscale
•  Euclidean distance
over pixels

Accuracy: 7-NN ~95.2%


SVM ~95.8%
Humans ~97.5%
Performance Measures
•  For binary classification:
–  True positives
–  True negatives
–  False positives
–  False negatives
•  Precision (selectivity)
–  a/(a+c)
•  Recall (sensitivity, TPR)
–  a/(a+b)
•  Specificity
–  d/(c+d)
•  FPR?
•  There is often a trade-off between precision and recall
–  Receiver Operating Characteristic (ROC) curves are used to
accurately assess performance in relation to this trade-off
Performance Measures

Source: wikipedia
Relationship to Modeling
Functional Representation
•  Variable X is related to variable Y, and we would like to learn
this relationship from the data
•  Example: X = [weight, blood glucose, . . .] and Y is whether
the person will have diabetes or not.
•  We assume there is a relationship between X and Y:
–  it is less likely to see certain X co-occur with “low risk” and unlikely to
see some other X co-occur with “high risk".

•  This relationship is encapsulated by P(X, Y)


Formulation of Learning
Multitude of Learning Frameworks
•  Presence/absence of labeled data
Multitude of Learning Frameworks
•  Type of Labels:
Multitude of Learning Frameworks
•  Problems also differ in the protocol for obtaining
data
–  Passive Learning
–  Active Learning
•  and in assumptions in data
–  Batch
–  Online
Examples of Regression Problems
•  Predict tomorrow’s stock market price given current
market conditions and other possible side information.
•  Predict the age of a viewer watching a given video on
YouTube.
•  Predict the location in 3D space of a robot arm end
effector, given control signals (torques) sent to its
various motors.
•  Predict the amount of prostate specific antigen (PSA) in
the body as a function of a number of different clinical
measurements.
•  Predict the temperature at any location inside a building
using weather data, time, door sensors, etc.
Linear Regression
Residual
error

Outcome
variable
(real-valued Coefficient Feature
labels) vector vector

–  Optimization problem: Given instances of x and y,


compute the coefficient vector that will minimize the
residual error
–  Common assumption: Residual error has a Gaussian
(normal) distribution
What Can Go Wrong?
Overfitting

14-degree polynomial 20-degree polynomial

•  Which model is more likely to be generalizable?


–  If we try to fit the model to the training data too accurately, we
will end up with a too complex model that considers every bit of
variation in the data as information
•  Occam’s Razor
Overfitting Example: k-NN

MAP Estimate for K=1 MAP Estimate for K=5

•  What happens if we grow K


Training Data:
further?
•  How about when K=N (number
of samples)?
Overfitting vs. Underfitting
Performance of K-NN on training and test samples as a function of K:

To assess overfitting,
reserve a portion of
these as test samples

Misclassification rate:
Bias and Variance
Do not confuse this
Model Selection K with the K in K-NN!

•  K-fold Cross-Validation
–  Randomly partition training samples
into K groups
–  Repeat for i=1 to K:
•  Reserve the ith group as the test set
•  Train on the remaining (K-1)N
samples
–  Results in one prediction per
sample
•  Use a measure of classification •  K = 5 is usually a good
accuracy to assess performance choice
–  Repeat the above procedure o  K=N => Leave-One-
multiple times Out Cross Validation
•  Report the mean and variance of •  Why repeat multiple
performance figures across these
times?
multiple runs
•  Why variance?
Curse of Dimensionality
•  The space grows exponentially with the number of
dimensions (features)
–  If the number of training samples stays constant as the
number of features goes up, the data gets sparser
–  Distance/locality-based classifiers (e.g., k-NN) are
particularly vulnerable to curse of dimensionality
Breaking The Curse
•  Feature Selection
–  Explicitly aim at reducing the number of features by trying
and evaluating models with different combinations of few
features
•  Computationally difficult since the number of combinations is
exponential
–  A parsimonious (less complex) model is always more
desirable for interpretability and generalizability
•  Dimensionality Reduction
–  Pre-process the data to obtain lower-dimensional
representations by discovering latent patterns
•  Principal Component Analysis (PCA)/ Singular Value
Decomposition (SVD)
Dimensionality Reduction via PCA

Principal Component 1

Principal Component 2

Mean
Interpretability vs. Accuracy
•  Data mining: Find interpretable patterns
–  Emphasis on understanding
•  Machine learning: Make accurate predictions
–  Emphasis on utility
More Data - Cleverer Algorithm
•  Lack of sufficient number of examples is usually
the biggest challenge in machine learning
–  Algorithmic sophistication can address this problem only
to a certain extent, since increasing complexity can
result in overfitting
•  The classifier should get better as it sees more
examples
–  If not, can we say it is a good algorithm?
•  Extreme case: Deep Learning (Convolutional
Neural Networks)
–  Works poorly with few samples
–  Works great if you have millions of samples
AI vs ML vs DL
•  Nice blog post by NVIDIA:
•  https://blogs.nvidia.com/blog/2016/07/29/whats-
difference-artificial-intelligence-machine-learning-
deep-learning-ai/
What will be Covered?
•  This class should give you the basic foundation for applying machine learning
•  Supervised learning
–  Bayesian learning
–  Linear models for regression and classification
–  Instance-based learning
–  Support vector machines
–  Decision Trees
–  Ensemble models
–  Deep Learning
• Unsupervised learning
–  Clustering
–  Dimensionality reduction
•  Model selection and evaluation
•  Sequential Models
–  Hidden Markov Models
•  Additional topics (if time permits)

You might also like