2024 Machine Learning Intro
2024 Machine Learning Intro
Introduction to Machine
Learning
Benjamin Rosman
Benjamin.Rosman1@wits.ac.za
“Write a rap about purple
ducks robbing a bank.”
Benjamin Rosman
ChatGPT / Midjourney
Machine Learning – COMS3007
Introduction to Machine
Learning
Benjamin Rosman
Benjamin.Rosman1@wits.ac.za
Midjourney
Course Details
• Lecturer:
• Prof. Benjamin Rosman
• Contact details:
• The course will be run through the Moodle page
• Email: Benjamin.Rosman1@wits.ac.za
• Lecture Venues and Times:
• Lectures will be every Friday from 10h15-12h00 in FNB35.
• Tutorials and Labs:
• Tuts/labs are on Tuesday at 14h15-16h00 in the ground
floor MSL labs. There will be weekly handins, and
occasional quizzes.
Course Outline
1. Introduction to Machine Learning
2. Naïve Bayes and Probability
3. Decision Trees
4. Linear Regression
5. Logistic Regression
6. Neural Networks
7. K-means Clustering
8. Practical Application of ML Methods
9. Principal Components Analysis
10.Reinforcement Learning
Assessments
• Labs: 10%
• Tests/quizzes: 20%
• Assignments: 30%
• Exam: 40%
Textbooks
There is no prescribed textbook. We will loosely be following these
textbooks:
• Machine Learning, Tom M. Mitchell, McGraw-Hill, 1997
• Pattern Recognition and Machine Learning, Christopher M. Bishop,
Springer, 2006.
• Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew
G. Barto, The MIT press, 1998.
There are many texts on Machine Learning in the library and on the web
that have information on the above topics.
You can also look up the Coursera course “Machine Learning” by Andrew
Ng (Stanford University) which follows similar content.
What is expected of you?
• You can collaborate unless otherwise stated, but
make sure you acknowledge!
• Labs and the assignment will be done in groups
• Programming in python will be expected but not
taught
• All labs and assignments require programming
• I advise working in groups a lot: you will learn from
each other
• If you use a tool such as ChatGPT, please
acknowledge when you did, and list the prompts
that you used.
Mathematics
This is a maths-heavy course!
Prediction
Hand-Written Digits
We need a model:
• Modelling assumptions (or knowledge about the problem) go here!
• y = f(x; θ)
• θ = parameters of model
Features of
the data (2D)
𝑥1
An example
Training
An example
Training
An example
Training
We want the model to generalise to any
An example point in this space that we haven’t seen yet
Querying
?
Categories of ML
• Supervised learning
• Predict output y when given input x
• Learn from labelled data: {(xi, yi)} (Make predictions!)
• Classification: y categorical
• Regression: y real-valued
• Unsupervised learning (Understand data!)
• Learn from unlabelled data: {xi}
• Clustering
• Learning some structure in the data
• Semi-supervised learning (Combine the above!)
• Only some labels provided
• Reinforcement learning
• Learn from rewards (typically delayed)
• Generate own data (experience) through interacting with an environment
Examples: +10
• Learn to fly a helicopter
• Learn to make coffee
• Learn to play chess
Generalising
During training, only a small fraction of possible data will
be provided
Trade-off between
• Expressive: accurately capture distinctions in data
• Sparse: not need prohibitive amounts of data
• Feature selection
• Autonomously identify important dimensions
• Feature learning
• Combine simpler features into more complex ones
• E.g. deep learning (when we talk about neural networks)
Data
For any ML algorithm to work, we need data, and
more is always better. In ML, we “let the data do the
talking”.
Much work goes into collecting data sets. For large
models (many parameters), we may need many
millions of examples to learn a good model.
But, how do we know how well
the model will generalise?
Protip: Never trust people
that mess this up!
Splitting the Data
Typically divide the full data set into three:
• Training data: learn the model parameters
• This is the core learning part, and so it needs the most data
• +/- 60% of the data
• Validation data: learn the model hyperparameters
• Hyperparameters are values set before training begins, e.g. the
degree of the polynomial, the complexity of the neural network
• +/- 20% of the data
• Testing data: report quality of model
• This is used to report an unbiased evaluation of the final model
• +/- 20% of the data
Why split the data?
This red model has a perfect fit to the blue
training points: so they will not give a reliable
estimate of how well the model will
x generalise.
x
Instead, we want to test it on new data points
that it has never seen during training. This
gives a better idea of its performance.
Similarly, we may be learning the hyperparameter of the degree of the model (M), by
training a straight line model (M=1), quadratic model (M=2), … up to M=9 and then
seeing which is best. We can train them all on the same training data, but we need to
use different validation data to choose the best one. Again, we can’t just report its
performance on that data, as it is already biased. So, we then need a different testing
set to report final scores.
The test data must not be touched until the very end! It is the “blind/surprise test”.
Example: Polynomial Curve Fitting
Simple regression (supervised learning) problem
Target label (= y)
True unknown
function: sin(2πx)
Feature (1D)
Goal: given a new x, predict t (target)
A Polynomial Function
Assume the function is polynomial:
Learning:
• Find the weight vector w to minimise error E(w)
• Unique solution w*
• E(w) is quadratic in w Predicted value at x
• E’(w) is linear in w Error between predicted and
true value t
Squared so it is symmetrical
Sum over every data point
½ to make the maths simpler
after differentiating
More on this example in the linear
regression lecture.
Model Selection
Choosing M (polynomial order)
For M = 9, E(w*) = 0! But goal is to generalise!
Training vs Testing Error
Overfitting: high
Define error to compare across N: error on test
data, low error
on training data
More data
• Less severe over-fitting
• More complex model we can fit