CS464 Ch1 Intro Fall2020
CS464 Ch1 Intro Fall2020
Introduction to
Machine Learning
CS464
Bilkent University
CS464
• Undergraduate-level introductory course aims at
giving a broad overview of many concepts and
algorithms in machine learning.
Ham
Spam
Text Classification
Incoming e-mail
Reading Zip Codes
• Classifying hand written digits from binary images
Recommendation Systems
Recommendation Systems
Recommendation Systems
Statistical Machine Translation
Recognizing Faces
Mitchell et.al,
Science, 2009.
https://youtu.be/D5VN56jQMWM
IBM Watson
• Theoretical questions:
– What can be learned? Under what conditions?
– What learning guarantees can be given?
– What can we say about the inherent ease or difficulty of
learning problems?
Machine Learning?
• Role of Probability and Statistics: Inference from a sample
Setosa
Versicolor
Virginica
Example: A Digit Recognizer
• Imagine you are asked to write a program to recognize hand
written digits
3
Digit Recognizer
• Each hand written digit image is an example
(instance, or object) in your data
Labels (Space of outputs)
• Labels are all digits:
Predict its
label?
Nearest Neighbor Classifiers
• Basic idea:
– If it walks like a duck, quacks like a duck, then it’s
probably a duck
compute
distance test
sample
• Prediction algorithm:
– To classify a new example xnew by finding the training
example (xi, yi) that is nearest to xnew
X2
X1
k-NN Classifier
k-NN Classifier
1-Nearest Neighbor Classifier (k = 1)
3-Nearest Neighbor Classifier (k=3)
5-Nearest Neighbor Classifier (k=5)
Example: Hand Written Digits
• 16x16 bitmaps
• 8-bit grayscale
• Euclidean distance
over pixels
Example: Hand Written Digits
• 16x16 bitmaps
• 8-bit grayscale
• Euclidean distance
over pixels
Example: Hand Written Digits
• 16x16 bitmaps
• 8-bit grayscale
• Euclidean distance
over pixels
Source: wikipedia
Relationship to Modeling
Functional Representation
• Variable X is related to variable Y, and we would like to learn
this relationship from the data
• Example: X = [weight, blood glucose, . . .] and Y is whether
the person will have diabetes or not.
• We assume there is a relationship between X and Y:
– it is less likely to see certain X co-occur with “low risk” and unlikely to
see some other X co-occur with “high risk".
Outcome
variable
(real-valued Coefficient Feature
labels) vector vector
To assess overfitting,
reserve a portion of
these as test samples
Misclassification rate:
Bias and Variance
Do not confuse this
Model Selection K with the K in K-NN!
• K-fold Cross-Validation
– Randomly partition training samples
into K groups
– Repeat for i=1 to K:
• Reserve the ith group as the test set
• Train on the remaining (K-1)N
samples
– Results in one prediction per
sample
• Use a measure of classification • K = 5 is usually a good
accuracy to assess performance choice
– Repeat the above procedure o K=N => Leave-One-
multiple times Out Cross Validation
• Report the mean and variance of • Why repeat multiple
performance figures across these
times?
multiple runs
• Why variance?
Curse of Dimensionality
• The space grows exponentially with the number of
dimensions (features)
– If the number of training samples stays constant as the
number of features goes up, the data gets sparser
– Distance/locality-based classifiers (e.g., k-NN) are
particularly vulnerable to curse of dimensionality
Breaking The Curse
• Feature Selection
– Explicitly aim at reducing the number of features by trying
and evaluating models with different combinations of few
features
• Computationally difficult since the number of combinations is
exponential
– A parsimonious (less complex) model is always more
desirable for interpretability and generalizability
• Dimensionality Reduction
– Pre-process the data to obtain lower-dimensional
representations by discovering latent patterns
• Principal Component Analysis (PCA)/ Singular Value
Decomposition (SVD)
Dimensionality Reduction via PCA
Principal Component 1
Principal Component 2
Mean
Interpretability vs. Accuracy
• Data mining: Find interpretable patterns
– Emphasis on understanding
• Machine learning: Make accurate predictions
– Emphasis on utility
More Data - Cleverer Algorithm
• Lack of sufficient number of examples is usually
the biggest challenge in machine learning
– Algorithmic sophistication can address this problem only
to a certain extent, since increasing complexity can
result in overfitting
• The classifier should get better as it sees more
examples
– If not, can we say it is a good algorithm?
• Extreme case: Deep Learning (Convolutional
Neural Networks)
– Works poorly with few samples
– Works great if you have millions of samples
AI vs ML vs DL
• Nice blog post by NVIDIA:
• https://blogs.nvidia.com/blog/2016/07/29/whats-
difference-artificial-intelligence-machine-learning-
deep-learning-ai/
What will be Covered?
• This class should give you the basic foundation for applying machine learning
• Supervised learning
– Bayesian learning
– Linear models for regression and classification
– Instance-based learning
– Support vector machines
– Decision Trees
– Ensemble models
– Deep Learning
• Unsupervised learning
– Clustering
– Dimensionality reduction
• Model selection and evaluation
• Sequential Models
– Hidden Markov Models
• Additional topics (if time permits)