Fundamentals of AI UPDATED
Fundamentals of AI UPDATED
Fundamentals of AI UPDATED
Programmes: BSc. ITS II, BSc. ICTM II, & BSc. ICTB II
1 / 132
Code of conduct
▶ Observe the following code of conduct;
• Be in class on time. Late comers will not be allowed in.
• Mute or switch off your mobile phones while in class.
• All communications concerning CSS 128 lectures, tutorials,
notes, assignments, tests etc. will be done via CRs.
• Any excuse for not attending lecture or tutorial sessions
should be communicated at the beginning of the lecture/tutorial
via CR.
• Use English for all communications concerning CSS 128.
• Strictly adhere to the University academic timetable and
deadlines.
• No substitute assignment/test will be given to any students who
will fail to write them without good reasons.
• Read all references provided.
• Violation of academic integrity will not be tolerated, but
dealt with severely in accordance to MU academic regulations.
• Any communications via emails, including submission of
assignments, MUST be done via student’s respective MU email
(@mustudent.ac.tz) and not otherwise.
2 / 132
Code of Conduct - Cont’d
▶ Course assessment:
• Quiz - Many.
• 2 Assignments @ 10%.
• 2 tests @ 15%.
• University Examination (UE) - 50%.
▶ Marks for assignments, tests or UE can not be compromised or
negotiated for.
▶ Hope to enjoy your maximum cooperation.
3 / 132
Requirements for the course
▶ Requirements
i. Basic knowledge of Python programming language
ii. Python language, at least Python v3.8
iii. Jupyter-notebook
iv. Python libraries/packages including numpy, matplotlib, pandas,
pip3, conda
v. Scikit-learn library
vi. TensorFlow framework
vii. Alternatively, you may install the latest Anaconda platform
which contains Python, Jupyter-notebook and several libraries
viii. Fairly powerful computer
4 / 132
Artificial intelligence (AI)
5 / 132
Benefits of applying AI
▶ AI improves accuracy: Intelligent computer-based agents
produce more correct output than human when they are
correctly programmed and given reliable input (data).
▶ Improves human safety and enable difficult and dangerous
exploration: Makes possible to perform tasks which were
otherwise impossible, risky or dangerous to the lives of
human operators.
▶ Increases efficiency: Computer-based intelligent agents are
more faster than human beings, huge computational power
compared to human beings, and can work for longer time (24
hours for 7 days) without resting, and can process very large
amount of data accurately. E.g AI driven chatbots and
virtual assistants
▶ Automation: AI systems are effective in automating
repetitive and time-consuming tasks which reduces human
intervention and cost, improves productivity and ability to
allocate human resources to more strategic and creative tasks
- Assembly lines and manufacturing plants, web data
extraction
6 / 132
Challenges in applying AI
7 / 132
Challenges in applying AI
▶ Algorithmic bias: Since AI models (trained algorithms) are
developed from data fed to them, if a large number of
examples in the learning process are from a certain group,
developed models tend to develop a bias toward other groups
• It leads to unfair or discriminatory outcomes in domains like
hiring, criminal justice, lending, traffic, healthcare.
▶ Lack of transparency and explainability: Many AI models are
complex and difficult to understand their decision making
process based on what they have learned from data.
• They are therefore considered "black boxes" because it is
difficult to understand their decision-making processes
▶ Ethical concerns: The use of AI raises ethical dilemmas such
as privacy, surveillance, and the potential for misuse with
many countries still lacking comprehensive guidelines and
frameworks to guarantee ethical use of AI.
▶ Regulation and Governance issues: Determining
responsibility for decisions and actions made by AI systems
is complex, especially in cases of system errors or misuse.
8 / 132
Challenges in applying AI
▶ Security and robustness issues:AI systems can be vulnerable
to data theft and/or manipulation which can lead to malicious
attacks such as unauthorized access to systems and automated
cyberattacks or incorrect results.
▶ Job displacement: Automation and AI technologies have the
potential to replace certain jobs, which can lead to
workforce disruptions.
▶ Data privacy issues: AI systems may collect users’
information without without their consent. In addition, many
countries lack regulations which protect users’ privacy
rights with regard to the use of AI technology.
▶ High development cost: Developing and implementing AI
solutions can be very expensive in terms of time, money and
labour incurred on data collection and processing activities,
computational power etc.
▶ Energy consumption: Training and running AI systems require
a great deal of computing power and electricity, and the
resulting CO2 emissions significantly contribute to carbon
footprint, which aggravates climate change.
9 / 132
Applications of AI
10 / 132
Definitions of key terms
11 / 132
Definitions of key terms
▶ Sample: An instance of an entity (record) that possesses
values for all features of the data set. It is also referred
to as observation, example or tuple.
▶ Prediction: It is a final output label, class or value
predicted by the model.
▶ Prediction error: A measure of distance (difference) between
predicted output/value and the target (expected) output. It
is commonly referred to as loss, loss value or loss function.
▶ Training: The process of passing training data to an AI
algorithm so that it finds patterns in the training data.
▶ Class: It is a set of possible labels to choose from in a
classification problem. Examples "Positive", "Negative",
"Dog", "Cat", "Girl", "Boy".
▶ Ground truth: A set of all target values for a data set.
▶ Target: It is the output that the model should ideally have
predicted according to an external source of data. It is
also referred to as expected value.
12 / 132
Definitions of key terms
▶ Feature: It is any column value in the input matrix that we
are using as an independent variable. It is also known as
attribute or measurement
• The number of features in a data set is called dimension
▶ Classification: It is a process of categorizing data into
predefined classes.
▶ Binary classification: A classification that involves only
two classes, for instance "Positive", "Negative" or "True",
"False" or "Yes", "No" or "Fraudulent", "Genuine".
▶ Accuracy: Percentage of data (validation or test set) that
is correctly predicted (classified or detected) by AI model.
▶ Hyperparameters: Configuration settings of the AI
learning algorithm that a user can adjust to affect the
performance of the model
▶ Model parameter: Variables of the AI learning algorithm
estimated from training data which minimize total loss and
specify how to transfer the input data into desired output.
13 / 132
Installing Python
14 / 132
Installing Python
15 / 132
Managing packages - Windows OS
16 / 132
Installing Python
17 / 132
Intelligent agent and environment
18 / 132
Intelligent agent and environment
▶ Common characteristics of intelligent agents include;
• Adaptation based on experience
• Real time problem solving
• Analysis of error or success rates
• Use of memory-based and retrieval
▶ A software agent receives keystrokes, file contents, network
packets as sensory inputs.
▶ It acts the environment by displaying output on the screen,
writing files and sending network packets.
▶ The agent function for an artificial agent is implemented by
an agent program.
▶ A rational agent is the one that does the right thing.
▶ The agent’s rationality is determined by;
• Performance measure used to define the criterion of success
• Agent’s prior knowledge of environment
• The actions that the agent can perform
• The agent’s percept sequence to date (i.e. history of all
inputs the agent has perceived)
19 / 132
Intelligent agent and environment
▶ Designing the agent must specify the task environment
clearly.
▶ The agent’s task environment must specify the Performance,
Environment, Actuators, Sensors, abbreviated as PEAS
▶ Performance measure evaluates the behaviour of the agent in
the environment
20 / 132
Intelligent agent and environment
▶ Task environments:
• Fully observable environment: Complete state of an environment
is available at any given time.
• Partially observable: The environment contains some noisy,
inacurate sensors, or missing some states.
• Single agent environment: One agent operates in the
environment
• Multi-agent environment: Multiple agents operate in the task
environment
• Deterministic environment: The next state of the environment
is completely determined by the environment’s current state
• Known environment: The outcomes for all actions are given
• Unknown environment: The environment is unknown to an agent,
and therefore it has to learn how to work in order to make good
decisions
21 / 132
Intelligent agent and environment
▶ Task environments:
• Stochastic environment: The next state of the environment is
not determined by the current state
♦ It contains partially obervable
• Uncertain environment: Not fully observable or not
deterministic
• Dynamic environment: The task environment changes while the
agent is deliberating.
• Static environment: Environment does not change
♦ They are easy to deal with as an agent
• Cooperative environment: Increase of performance by one agent
maximizes performance measure of all agents
♦ Avoiding collisions by a self-driving car maximizes performance of
all agents in the environment
• Competitive environment: An increased performance by one agent
decreases performance of an opponent agent in the environment.
22 / 132
Intelligent agent and environment
23 / 132
Intelligent agent and environment
24 / 132
Intelligent agent and environment
▶ Goal-based agent
▶ Utility-based agents
• Knowing the goal is not enough for an agent to generate a high
quality behaviour.
• An action needs to choose an action with the best utility
• Agent needs to know utility of each possible action (choice)
• It will distinguish between more and less desirable actions
• An agent’s utility function is an internalization of the
performance measure
• This will enables the agent to choose action that maximizes its
utility.
25 / 132
Intelligent agent and environment: Learning
agent
▶ It is an agent that has the capability to learn from the
environment and its experience in order to improve its
performance (i.e., maximize its performance measure).
▶ Learning allows an agent to acquire more knowledge about its
environment, which was initially unknown to its, to become
more competent (master it better).
▶ Key conceptual components of learning agent;
• Learning element: Responsible for making improvements
• Performance element: Responsible for selecting external
actions
• Critic: Provides feedback on agent’s actions to determine how
its performance element be modified to do better in the future
• Problem generator: Responsible for suggesting actions that
will lead to new and informative experiences.
▶ The design of the learning element depends on the design of
the performance element
▶ A critic tells the learning element how well (or bad) the
agent is doing with respect to a fixed performance standard.
26 / 132
Intelligent agent and environment
27 / 132
Problem solving agent
28 / 132
Problem solving agent
29 / 132
Problem solving agent
30 / 132
Machine Learning (ML)
▶ A branch of artificial intelligence (AI) that provides
systems with ability to automatically learn and improve
performance through exposure to data or experience without
being explicitly programmed.
▶ It involves training a learning algorithm using appropriate
data for the algorithm to acquire the knowledge (experience).
▶ The data used in the process of training a machine learning
algorithm is collectively called as learning set or dataset.
▶ The acquired knowledge (experience) will be used in the
future to perform a similar task on new but similar data.
▶ The learning set can consist of image, textual, numerical
values, sound etc.
▶ During training process, a learning algorithm finds patterns
from data - that is how the input corresponds to the target.
▶ The goal of ML is to develop a model (trained algorithm) with
the capability to perform a task for which it was developed.
31 / 132
Machine Learning (ML)
32 / 132
Machine Learning (ML)
▶ Supervised learning
• It is the most common type of ML.
• An agent (learning algorithm) observes several examples of
labeled input-output pairs to learn relationship (pattern)
between them.
• It uses labeled data- data in which each input data (example)
has a known output (target), to learn a mapping function that
turns input variable (x ) into output variable (y).
• Two types of supervised learning are;
♦ Classification
♦ Numeric prediction (regression)
33 / 132
Machine Learning (ML)
▶ Classification
• The most type of supervised learning in which a model (a
classifier) predicts categorical (discrete) and unordered class
labels
• Examples, a model classifies human images into "Male" or
"Female", or medical data into "Positive" or "Negative" for a
positively and negatively diagnosed patient, respectively.
• Dataset contains a set of predefined classes, and each input
(example) belongs to one of these predefined classes.
• The model predicts the category of the new data sample
(observation).
• Common applications of classification tasks include;
♦ Target marketing
♦ Medical diagnosis
♦ Fraud detection
♦ Performance prediction
♦ Loan payment
♦ Spam filtering
♦ Gender classification
♦ Weather forecasting
34 / 132
Machine Learning (ML): Supervised learning
35 / 132
Machine Learning (ML): Supervised learning
36 / 132
Machine Learning (ML): Supervised learning
▶ Numeric prediction (regression)
• A type of supervised learning that maps a function from input x
to an ordered continuous output (real value) variable y.
• It is also known as regression analysis.
• The output of the numeric prediction is a real (continuous)
value.
• It predicts a quantity or size as a real-value which may be an
integer or floating-point value.
• A model (predictor) predicts a quantity or size as a real value
which may be an integer or floating-point value.
• Common example of numeric prediction algorithm is linear
regression
• Common applications involve model’s prediction of;
♦ Life expectancy (age) based on eating pattern, medication, disease
state and other living conditions.
♦ House price based on house size (number of rooms), age, floor
size, location, roof type, owner’s job etc.
♦ Salary of an employee given academic qualifications
♦ Crop yield based on soil precipitation, rainfall, fertilizer,
temperature etc.
37 / 132
Machine Learning (ML): Unsupervised learning
▶ A type of machine learning that develops a model without
relying on labeled data.
▶ Most common examples of unsupervised learning algorithms
include K-means and K-nearest neighbor
▶ Three main types of unsupervised machine learning are;
• Clustering (data segmentation): It predicts potentially similar
objects (data samples) from input examples and group them into
one cluster.
♦ Each cluster contains objects that share some similarities but are
more dissimilar to other objects.
♦ Common applications include in biology, customer segmentation, web
searches, fraud detection (in banking transactions, phone calls,
Internet usage)
• Dimensionality reduction: Involves reducing number of
attributes (features) per data sample in a high dimensional
dataset by removing uninformative, noisy or redundant features
in order to produce a reduced or compressed data structure with
fewer features.
♦ Aims at improving model’s accuracy, processing speed, data
visualization, reduce storage space, reducing model complexity.
• Density estimation: It is the problem of reconstructing the
probability density function using a set of given data points
38 / 132
Machine Learning: Reinforcement learning
39 / 132
Dataset (learning set)
40 / 132
Dataset (learning set)
42 / 132
Dataset: Splitting of the dataset
▶ Dataset for ML/DL is commonly split (divided) into three
categories/sets;
• Training set
• Validation set
• Test set
▶ Care must be taken when splitting dataset into these sets in
order to avoid joint samples and human bias.
▶ All sets must have similar features. Differences in their
features, improper splitting etc. are likely to generate
overfitting or underfitting.
▶ Training set:
• Comprises the bulk of the entire dataset, about 70-90% of the
entire dataset.
• m Large size ensures that the model will learn as many patterns
(features) and sufficiently from the data as possible.
• Must contain as many possible patterns of the population as
possible.
• When limited (small) data is available, data augmentation
technique is used to increase the size of the dataset.
43 / 132
Dataset: Splitting of the dataset
44 / 132
Splitting of the dataset
45 / 132
Splitting of the dataset
▶ Validation set
• A set used to evaluate the performance of the model at the end
of each epoch during training.
• It comprises about 10% to 30% of the entire dataset.
• It must not contain any samples used during training in order
not to leak information of the training data.
• m Usually one epoch, with total iterations equal to the number
of samples in the validation set.
• When you have large validation set, increase the batch size,and
therefore change number of iterations.
▶ Test set
• The smallest of all sets.
• A set used to check model’s performance after it has been
trained and evaluated.
• It consists of up to 10% of the entire dataset.
• It is applied to verify the model’s performance after training
is complete, usually on the best performing model.
46 / 132
Public (benchmark) datasets
47 / 132
Public (benchmark) datasets
48 / 132
Public (benchmark) datasets
▶ MNIST Dataset:
• A dataset of handwritten digits containing 60,000 training
images and 10,000 testing images.
• Commonly used for classification tasks.
▶ Fake News Detection Dataset: A csv file that contains 7796
rows with four columns, news, title, news text and result.
▶ Titanic Dataset: Contains information like name, age, sex,
number of siblings etc for about 891 passengers in the
training set and 418 passengers in the testing set.
▶ Credit card Fraud Detection Dataset:
• Contains transactions made by credit cards.
• They are labeled as Fraudulent or genuine.
▶ Stanford Dogs Dataset: Contains 20,580 images for 120
different dog breed categories.
49 / 132
Public (benchmark) datasets
▶ ImageNet Dataset:
• One of the most common, popular and largest image datasets for
computer vision.
• Contains over 1.28 million images for training and 50,000
images for validation for 1000 categories.
▶ Common Objects in Context (MS COCO) Dataset: A large scale
object detection, segmentation dataset with over 300,000
images for 90 classes of common objects such as person,
laptop, bicycle, car, airplane, bus, train, bird, cat, dog,
giraffe, horse, sheep, cow, elephant, cup, fork, chair, book
etc.
▶ CIFAR-10 Dataset:
• Contains a dataset for small, low resolution (32 x 32) colour
images of 10 classes of objects.
• Contains 50,000 and 10,000 images for training and testing
respectively.
• The ten classes are airplane, automobile, bird, cat, deer, dog,
frog, horse, ship and truck.
▶ CIFAR1-100: Like CIFAR-10, but for 100 different classes.
50 / 132
Arrays
51 / 132
numpy
52 / 132
pandas
▶ It provides an easy way to use data structure and analysis
tools.
▶ In ML, a substantial amount of time is spent on preparing the
data as well as analyzing basic trend and patterns.
▶ This is where pandas receive ML expert attention.
▶ It is built on top of numpy to provide additional higher
level data manipulation and analysis tools that make working
with tabular data even more convenient.
▶ You can use it to read data from a broad range of data
sources like CSV, SQL databases, JSON files and Excel.
▶ It is simple to use and allows developers to manage complex
data operations with just one or two commands.
▶ When data is clean, and structured, every row represents an
observation and every column a feature, and both rows and
columns can have labels.
▶ A commonly used alias for pandas is pd.
53 / 132
Manipulation of CSV files
54 / 132
Manipulation of CSV files
▶ Start up jupyter-notebook
▶ Import pandas
import pandas as pd
▶ Load the data;
variablename = pd.read csv("filepath")
▶ variablename is a variable that holds retrieved data, and
file path is the complete path of the csv file
▶ For instance;
irisdata = pd.read csv("iris.csv")
▶ You may optionally add an argument for encoding format like;
irisdata = pd.read csv("iris.csv",encoding="utf-8")
▶ Windows uses \ while Linux uses / to specify the file path
▶ Row selection - Counting of rows starts from 0
irisdata[:] - Selects all rows
irisdata[:5] - Selects the first five rows
irisdata[5:10] - Selects 5th to 9th rows
55 / 132
Manipulation of CSV files
▶ Column and row selection
irisdata[["columnName"]] - Displays the selected column
irisdata[["columnName1","columnName2]] - Selected columns
▶ iloc function
Enables selection of rows and columns by using integers,
with 0 means first row
irisdata.iloc[N]-Displays column name and respective
value for the Nth sample
irisdata.iloc[[0]]-Displays a row for the int sample
the int row
irisdata.iloc[[0,2]]-Displays rows for 0 and 2 samples
irisdata.iloc[0:65] - Displays values for all columns from
the 0th to the 65th samples
irisdata.iloc[[3,4],[1,2]]-Displays values of the 1st and 2nd
columns for the 3rd and 4th samples
irisdata.iloc[0:66,-3]-Displays values of the third column
from right for the first 66 samples.
56 / 132
Manipulation of CSV files
▶ Row and column selection
irisdata.iloc[:66,-3:]-Displays values of the last three
columns for the first 66 rows
irisdata.iloc[:66,1:3] - Similar to irisdata.iloc[:66,-3:]
irisdata.head()- Displays the first five rows of the data set
irisdata.head(N)-Displays the first N rows of the data set
irisdata.tail()-Displays the last five rows of the data
irisdata.tail(N)-Displays the last N rows of the data
irisdata.info()-Describes the columns of the data set, their
type, number of values for each column, index
number (order) etc.
irisdata.shape - Produces a vector with values for rows and
columns in a dataset
irisdata.columns-Produces an index with list of column names.
irisdata[’variety’].value counts()- Extract classes and their
respective number of samples
irisdata.describe()-Calculates some statistical data like
percentile, mean and std of the numerical
values of the series or DataFrame
57 / 132
Python Virtual environment - On Windows
58 / 132
Python Virtual environment - On Windows
▶ You may specify the version of Python to be installed in the
virtual environment as well by running the command;
conda create -n css128pyenv pip python=3.8.5
▶ Activating a virtual environment created;
conda activate css128pyenv
▶ Once activated, the name of your PVE (e.g. pyenv4icqproject)
should be displayed within brackets at the beginning of your
Anaconda CMD path specifier.
61 / 132
Running ML Projects: Steps
1. Create a folder (should not contain space or special
character) to save your files and data. Don’t use python as
folder name as it may be misinterpreted as python command
2. Use cd command to navigate to the folder created in step 1.
3. Activate your Python virtual environment (PVE)
4. Start up Jupyter-notebook by executing the command;
jupyter-notebook
5. On the window that appears, click New tab, then select
Notebook
6. Select the appropriate PVE of your choice
7. Rename the notebook file (File tab > Rename)
8. In the first cell, write all necessary import statements
9. Import the data into the notebook file
10. Write the rest of the code in the cells
11. Execute each cell before moving to the next cell by clicking
Run button.
62 / 132
Running ML Projects: Steps
63 / 132
Running ML Projects: Steps
64 / 132
Function
▶ Function: A relationship between variables.
▶ It relates an input or set of inputs to a unique output.
▶ x and f (x ) denote input and output of the function,
respectively.
▶ Each input must correspond to a unique output.
▶ Considering a function which takes an input and squares it to
give the output, 4 is the output if we input 2.
▶ y = f (x ) denotes a function in which y depends on x , that is
y is a function of x .
▶ x and y are independent & dependent variables, respectively.
▶ f (x ) = 3x + 10 is a function that takes the input x multiplies
by 3 and adds 10.
▶ f (x ) can also be written as f .
▶ A function’s input is called the domain, and its output is
range.
65 / 132
Function
66 / 132
Linear function
67 / 132
Linear equation
▶ Linear equation is an equation in which the highest power of
the variable is always 1
▶ The standard (general) form of a linear equation in one
variable is Ax + B = 0, where A and B are real numbers and x
is a single variable.
▶ The standard form of a linear equation in two variables is
Ax + By = C, where A and B are coefficients of variables x
and y, respectively, and C is a constant.
▶ Linear equation formula is the way of expressing a linear
equation.
▶ A linear equation can be expressed in the standard form; the
slope-intercept form, or the point-slope form.
• A standard form; either Ax + B = 0 or Ax + By = C.
• The slope-intercept form; y = mx + c, where m is the slope, in
which m ̸= 0, and c is the y-intercept, i.e., a point on which a
straight line crosses the y axis (y value when x is 0).
• Point-slope form; y2 − y1 = m(x2 − x1 ) (where m = slope and
(x1 , y1 ) is a point on the line)
68 / 132
Slope, y-intercept, and equation of the line
▶ The slope (gradient) of a linear equation is the amount by
which the line is rising or falling.
▶ It is the change in y coordinate with respect to the change
in x coordinate.
▶ It represents the rate of change of y for each unit increase
in x .
▶ If (x1 , y1 ) and (x2 , y2 ) are any two points on a line then its
2 −y1 )
slope (m) is calculated using the formula; m = (y ∆y
(x2 −x1 ) = ∆x
▶ Slope of a line between two points is said to be the rise of
the line from one point to another along y-axis over the run
along x-axis; m = Rise vertical numerical difference
Run = horizontal numerical difference
▶ The slope of a line gives the measure of its steepness and
direction
▶ Given a straight line with points P (4, 2) and R(6, 8), find its
slope, equation of the line, y-intercept and plot its graph.
• Assuming point P is x1 and y1 , respectively, and R is x2 and
2 −y1 )
y2 , respectively, the slope of the line = (y
(x2 −x1 )
(8−2)
= (6−4) = 26 = 3
69 / 132
Slope, y-intercept, and equation of the line
70 / 132
Slope, y-intercept, and equation of the line
71 / 132
Linear function and graph
x -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10
y -25 -22 -19 -16 -13 -10 -7 -4 -1 2 5 8 11 14 17 20
72 / 132
Linear graph
73 / 132
Slope, y-intercept, and equation of the line
74 / 132
Linear Regression
75 / 132
Linear Regression
76 / 132
Linear Regression
77 / 132
Linear Regression
78 / 132
Linear Regression: Simple linear regression
79 / 132
Linear Regression
80 / 132
Linear Regression: Inspection method
81 / 132
Linear Regression: Inspection method
Example 1
The figures show the output (in thousands of tons) and expenditure
on energy ($) for a firm over ten monthly periods.
Output (x) 20 22 25 26 21 23 28 20 25 29
Expenditure (y) 106 138 158 172 120 142 184 102 164 192
82 / 132
Linear Regression: Inspection method
85 / 132
Linear Regression: Semi-averages method
Example 2
Twelve administrative trainees in a company took an aptitude test
in two parts, one designed to test the ability to do appropriate
calculations and the other designed to test skill in interpreting
results. Their scores were as follows;
Trainee (x) A B C D E F G H I J K L
Calculation score 23 56 74 29 82 45 36 51 60 55 52 88
Interpretation score 16 38 65 39 32 51 11 19 47 54 43 50
86 / 132
Linear Regression: Least square method
▶ The main drawback of semi-averages method is that it relies
on only two mean points which may be distort the regression
(equation) line when there are extreme values in x and y.
▶ Least square method, also known as ordinary least square
(OLS), addresses this drawback by involving all values and
thus is considered more robust (superior) than other methods.
▶ It attempts to provide the best-fitting (regression) line for
a given set of data by minimizing the sum of the squares of
the vertical deviations (residuals), known as sum of squared
residuals (SSR), from each data point (see Fig. 11).
▶ The deviation of a point residing on the line is 0.
▶ As deviations are squared, there is cancellation between
positive and negative values of the deviations.
▶ Using this method, the values of a and b of regression line
y = a + bx are obtained using the following formulae;
P P P P P
n xy − x y y x
b= P 2 P 2 ; a= −b (1)
n x − ( x) n n
87 / 132
Linear Regression: Least square method
a = ȳ − b x̄ (3)
▶ Where;
x̄ is the mean value for x values
ȳ is the mean value for y values
▶ Estimates a and b are the values that minimize the sum of
squared residuals (SSR)
88 / 132
Linear Regression: Simple linear regression
90 / 132
Linear Regression: Least square method
Solution for Example 1
Given equation of the line y = a + bx , find the value of a and b.
x y xy x2
20 106 2120 400
22 138 3036 484
25 158 3950 625
26 172 4472 676
21 120 2520 441
23 142 3266 529
28 184 5152 784
20 102 2040 400
25 165 4125 625
29 192 5568 841
P P P P 2
x = 239 y = 1479 xy = 36249 x = 5805
Formula P P P P P
n xy − x y y x
b= P 2 P 2 ; a= −b
n x − ( x) n n
91 / 132
Linear Regression: Least square method
b = 9.6975
Having obtained b, we can obtain a as follows;
1478 239
a= − 9.6975 x
10 10
a = 147.8 − 9.6975 x 23.9
a = 147.8 − 231.77
a = −83.97
Equation of the line;
y = −83.97 + 9.6975x
y = 9.6975x − 83.97
92 / 132
Interpolation and extrapolation
93 / 132
Correlation techniques
94 / 132
Correlation techniques: Coefficient of
correlation (correlation coefficient)
▶ Coefficient of correlation: A statistical measure which
expresses the strength of the linear correlation between
bivariate data (variables) x and y.
▶ r is a dimensionless quantity as it does not depend on the
units employed.
▶ Represented by symbol r , or R, correlation coefficient lies
from -1 to +1, that is −1 ≤ r ≤ +1.
▶ r = 0 signifies that there is no correlation between
bivariate variables.
▶ Correlation coefficient is commonly computed using the
following formula;
P P P
n xy − x y
r=p P p P (4)
2
P 2
n y 2 − ( y)2
P
n x − ( x) x
96 / 132
Correlation techniques: The correlation
coefficient
97 / 132
Correlation techniques
98 / 132
Correlation techniques
100 / 132
Correlation techniques: Computing r and R
Example 4
The data in Table 3 relates the weekly maintenance cost ($) to the
age (in months) of ten machines of similar type in a manufacturing
company. Calculate the correlation coefficient (r ) and coefficient
of determination (R) between age and cost.
Machine 1 2 3 4 5 6 7 8 9 10
Age (x ) 5 10 15 20 30 30 30 50 50 60
Cost (y) 190 240 250 300 310 355 300 300 350 395
101 / 132
Correlation coefficient and R
Solution for Example 4
x y xy x2 y2
5 190 950 25 36100
10 240 2400 100 57600
15 250 3750 225 62500
20 300 6000 400 90000
30 310 9300 900 96100
30 335 10050 900 112225
30 300 9000 900 90000
50 300 15000 2500 90000
50 350 17500 2500 122500
60 395 23700 3600 156025
P P P P 2 P 2
x = 300 y = 2970 xy = 97650 x = 12050 y = 913050
Formula
P P P
n xy − x y
r=p P p P
2
P 2
n y 2 − ( y)2
P
n x − ( x) x
102 / 132
Correlation coefficient and R
Calculations
10 x 97650 − 300 x 2970
r=√ √
10 x 12050 − 3002 x 10 x 913050 − 29702
976500 − 891000
r=√ √
120500 − 90000 x 9130500 − 8820900
85500
r=√ √
30500 x 309600
85500
r=
174.64 x 556.417
85500
r=
97172.66
r = 0.88
Then R = r 2 = 0.882 = 0.774
R = 0.774 means that 77.4% of maintenance cost is explained by age
103 / 132
Next .....
104 / 132
Linear Regression: Practical
105 / 132
Linear Regression: Simple linear regression
106 / 132
Linear Regression: Simple linear regression
▶ Simple linear regression is a linear regression of the form
y = c + bx where;
• y is the dependent variable, also known as outcome or response
• x is the independent variable (predictor)
• b is the slope of the line also known as regression coefficient
• c is the intercept
▶ In ML, y and x are commonly known as target and feature
vector, respectively.
▶ Regression line is the line that best fits the graph between
predictor variable (x ) and predicted variable i.e, y.
▶ Steps to observe when implementing linear regression;
• Install and import packages needed including numpy, pandas,
matplotlib, LinearRegression
• Provide data to work with, and make appropriate transformation
• Visualize the data points to see if the data is suitable for
the linear regression
• Create the regression model using the linear regression
algorithm and fit it with the existing data
• Assess the results of the model fitting to know whether it is
satisfactory
• Apply the model for prediction
107 / 132
Linear Regression: Simple linear regression
▶ Data structure: A collection of values stored in one variable
with common examples being array, dictionary, list, stack.
▶ Array : A data structure storing more than one values at a
time.
• We refer to the objects within an array as its elements.
• The method that we use to refer to elements in an array is
numbering and then indexing them
• If we have n elements in the sequence, we think of them as
being numbered from 0 to n - 1.
▶ One-dimensional array: One-dimensional (1D) array is an
array that stores a sequence of (references to) objects.
• Elements of a 1D array are indexed by a single integer
• list = [100, 200, 300, 400]
• list[2] # will display 300
▶ A two-dimensional array (2D): An array of (references to)
one-dimensional arrays
• The elements of a 2D array are indexed by a pair of integers:
the first specifying a row, and the second specifying a column.
• T = [[11, 12, 5, 2], [15, 6,10], [10, 8, 12, 5], [12,15,8,6]]
• T[0] produces [11 ,12, 5, 2]
• T[0][1] produces 12
108 / 132
Classification
▶ Classification algorithms predict one of a number of discrete
outcomes.
▶ For example, an email client sorts mail into personal mail
and junk mail- two outcomes.
▶ Involves machine learning algorithms which learn to associate
some input with some output, given a training set of examples
of inputs x and outputs y.
▶ The outputs y is commonly provided by a human "supervisor".
▶ Most classification algorithms are based on estimating a
probability distribution p(y | x ).
▶ This can be done by using maximum likelihood estimation to
find the best parameter vector θ for a parametric family of
distributions p(y | x ; θ).
▶ Popular classification algorithms include;
• Logistic regression
• K nearest neighbors (KNN)
• Support vector machine (SVM)
109 / 132
Classification - Binary classification
▶ If we have two possible classes (class 0 and class 1), then
we need only specify the probability of one of these classes.
▶ The probability of one class determines that of another class
as the two probabilities must add up to 1.
▶ This is binary classification in which classifiers produce
two labels (output values), each attaining a binary value,
commonly represented as {1, 0}, {+1, -1}, {False, True},
{blue, red} etc.
▶ In binary classification, the two labels are often referred
to as the positive and negative classes, respectively, such
as {+1, -1}.
▶ Support vector machine is driven by the linear function
w T x + b, where w T is weighted sum of the input, x , And b is
a constant called bias.
▶ However, SVM does not output class probabilities, but class
identity.
▶ When w T x + b is positive, SVM predicts the presence of the
positive class.
▶ When w T x + b is negative, SVM predicts the presence of the
negative class. 110 / 132
Classification: Binary classification
▶ SVM is suitable for binary classifications, i.e.,
classification in which the outcome consists of two values.
▶ Binary classification considers a predictor (classifier) of
the form f : RD →{+1, -1}.
▶ Each sample (data point) xn is represented as a feature
vector D of real numbers.
▶ The two labels should not be confused with normal positive or
negative numbers, but representation of two distinct classes.
▶ For instance, a cancer detection task, a patient with cancer
can be labeled +1, and one without it as -1.
▶ SVM solves a classification task using a set of examples
xn ∈ RD , along with their corresponding (binary) labels
yn ∈ {+1, −1}.
▶ Given a training data set consisting of example-label pairs
{(x1 , y1 ), ......, (xn , yn )}, we estimate parameters of the model
which produce the smallest classification error.
111 / 132
Classification: Binary classification
▶ Intuitively, binary classes can be represented in a plane,
and separated by a hyperplane (linear separator) as shown in
Fig. 14.
▶ Every data sample xn is a two-dimensional location (xn(1) and
(2)
xn ), and the corresponding binary label yn is either a red
disc or blue disc.
▶ An xy-plane represents data examples consisting of two
classes which contain features arranged in a way that can be
separated (classified) by drawing a straight line.
▶ The idea behind many classification algorithms is to
represent data in RD and then partition this space such that
examples with the same label are in the same partition
(side).
▶ With binary classifications, the hyperplane in the xy-plane
divides the space into two partitions corresponding to the
positive and negative classes, respectively.
112 / 132
Classification - Binary classification
113 / 132
Support vector machines (SVMs)
114 / 132
Standardization of dataset
▶ Standardization aims at making raw data more amenable to MLA.
▶ Common standardization techniques include vectorization,
normalization/scaling, data cleaning and feature cleaning.
▶ Since each feature contains values in their own range, it is
recommended to feed the MLAs with small values in similar
scale (range) because MLAs work better with smaller values
than larger ones.
▶ It is therefore important to scale the values of all features
of the dataset into the same standard (scale).
▶ Data normalization achieves this by converting each data
feature (x ) into a standard deviation of 1 and mean of 0.
▶ Several Python libraries provide many tools (functions) to
perform data standardization.
▶ Ensure the data values having the following characteristics;
• Contain small values in the range 0-1.
• Be homogeneous: All features take values in the same range
• Data values for each feature have standard deviation of 1 and
mean of 0.
115 / 132
Data encoding
▶ Since MLAs do only process numerical values (numbers) and not
text, it is important to encode any categorical values
(labels) of any feature into one form of numerical values.
▶ The three classes (labels) of the iris dataset i.e., [setosa,
versicolor,virginica], nust be encoded into a certain form of
numbers to be fed as input to any MLA.
▶ Two methods commonly used to encode categorical data into
numerical format are;
• One-hot-encoded format
✓ Given [’Setosa’, ’Versicolor’, ’Virginica’], a prediction [0, 1,
0] means the model has predicted ’Versicolor’ while [1,0,0] means
’Setosa.
• Class indices: Class labels are represented using integers,
commonly starting with 0.
✓ The [’Setosa’, ’Versicolor’, ’Virginica’] can be represented as
[0,1,2]
▶ Like data standardization, Python offers a wide range of
libraries with tools to perform these encoding automatically.
116 / 132
Model evaluation and Evaluation metrics
▶ Models must be evaluated to assess their performance on new
(unseen) data.
▶ Although you may obtain performance of the model on training
set (data used to develop the model), common practice is to
use the validation set and test set.
▶ Models’ performance is estimated using various evaluation
(performance) metrics appropriate for that task.
▶ Evaluation metric: It is a quantitative measure used to
assess the performance and effectiveness of a statistical or
machine learning model
▶ The final performance estimate of the model is made on
out-of-sample data i.e. data set that was not "seen" (used)
during training process such as test set.
▶ Different sets of evaluation metrics exist for classification
task and numeric prediction tasks.
▶ It is important to choose the right evaluation metrics for
particular ML task in order to assess the model objectively.
117 / 132
Evaluation metrics - Classification tasks
▶ Several evaluation metrics are used to evaluate performance
of classification models. These include;
• True Positive (TP): A correct predication (classification) of a
true positive class, i.e. both actual (target) value and
predicted value are positive.
✓ A model correctly predicts a positive sample as positive.
✓ A model correctly predicts a Setosa sample as Setosa
• True Negative (TN): A correct prediction of a true negative
object, i.e. both the actual value and predicted value are
negative.
✓ A ground-truth contains no sample and model predicts no sample
✓ The model predicts a no-cancer sample as no-cancer
• False Positive (FP): An incorrect prediction of a non-existing
object or a misclassification of an existing object by a model.
✓ The ground-truth is negative but the model predicts a sample
✓ The model predicts a Setosa sample as Versicolor sample
✓ The model predicts a no-cancer sample as cancer
• False Negative (FN): An incorrect prediction by the model of a
positive class as negative.
✓ A ground-truth (true positive) sample is falsely (incorrectly)
rejected by the model.
✓ When a model predicts a cancer sample as no-cancer.
118 / 132
Evaluation metrics - Classification tasks
▶ TP, TN, FP and FN are often organized into a confusion
matrix.
▶ Confusion matrix: It is a table describing the performance
of a machine learning model, such as a classification model.
119 / 132
Evaluation metrics - Classification tasks
▶ TP, FN, FP, and TN metrics are combined to produce more
informative metrics. These include.
• Recall: Also referred to as sensitivity.
✓ It is defined as a percentage of true positive samples classified
as true positive samples over all ground-truth samples.
✓ It is expressed as;
TP TP
Recall = = (5)
TP + FN All ground-truths
✓ Ground-truth: A set of true (target) values for all samples in a
given data set.
• Precision: It is a percentage of all true positive samples over
all positive predictions (TP + FP) made by the model.
TP
Precision = (6)
TP + FP
• Accuracy: It is a percentage of predictions which give the
correct answer over all predictions made by the model.
TP + TN
Accuracy = (7)
TP + FP + FN + TN
✓ Describes model’s performance across all classes in the dataset.
120 / 132
Evaluation metrics - Classification tasks
121 / 132
Evaluation metrics - Classification tasks
150
= 0.9090 = 90.90%
165
122 / 132
Evaluation metrics - Numeric prediction
(Regression) tasks
123 / 132
Evaluation metrics - Numeric prediction
(Regression) tasks
124 / 132
Evaluation metrics - Numeric prediction
(Regression) tasks
125 / 132
Evaluation metrics - Numeric prediction
(Regression) tasks
126 / 132
Evaluation metrics - Numeric prediction
(Regression) tasks
▶ Given the Table 4 below (n = 5), compute the values for MAE,
MSE and RMSE.
128 / 132
Deep learning
129 / 132
Deep learning
▶ Compared to conventional ML algorithms and other types of
ANNs, DL offers a number of advantages including;
• Learn and extract features from data automatically (i.e. do
not need manually extracted features)
• Well designed for parallelization (i.e., parallel processing)
on single or multiple central processing units (CPUs), graphics
processing units (GPUs) and tensor processing units (TPUs)
• They can be trained on additional data without starting from
scratch.
• They can be trained on one dataset and deployed on another
similar data
• They can efficiently handle large and high-dimensional data
including audio, textual, image etc.
▶ Major types of DL include;
• Convolutional neural network (CNNs)
• Recurrent neural network
• Recursive neural network
• Autoencoders
• Deep belief networks (DBNs)
• Generative adversarial networks (GANs)
• Deep Q-learning
130 / 132
Deep learning
▶ Of all types of DL algorithms, CNN is commonly used for image
processing tasks including image classification because;
• Spatial dimensionality shape (width, height, depth) which matches
the structure of an image
✓ This shape enables CNNs to process images efficiently.
• Extract features from images automatically, as manual feature
extraction is time-consuming and labour-intensive
• CNNs have locally connected patches of neurons which reduce the
number of neuron connections between layers, parameters to
train and computations.
• They can learn translation-invariant local features which
enable any feature learned to be recognized when when they
appear in other parts.
• They learn features hierarchically with outer layers learning
generic features about objects, while inner learn more specific
features
• They are designed to exploit relationship between space and
pixels.
▶ Examples of CNN architectures include AlexNet, Residual
Network (ResNet), Inception ResNet, Visual Geometry Group
(VGG), MobileNet, GoogLeNet
131 / 132
Deep learning
132 / 132