Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Fundamentals of AI UPDATED

Download as pdf or txt
Download as pdf or txt
You are on page 1of 132

Mzumbe University (MU)

CSS 128: Fundamentals of Artificial Intelligence

Kilima, Frank Godlove

Programmes: BSc. ITS II, BSc. ICTM II, & BSc. ICTB II

June 15, 2024

1 / 132
Code of conduct
▶ Observe the following code of conduct;
• Be in class on time. Late comers will not be allowed in.
• Mute or switch off your mobile phones while in class.
• All communications concerning CSS 128 lectures, tutorials,
notes, assignments, tests etc. will be done via CRs.
• Any excuse for not attending lecture or tutorial sessions
should be communicated at the beginning of the lecture/tutorial
via CR.
• Use English for all communications concerning CSS 128.
• Strictly adhere to the University academic timetable and
• No substitute assignment/test will be given to any students who
will fail to write them without good reasons.
• Read all references provided.
• Violation of academic integrity will not be tolerated, but
dealt with severely in accordance to MU academic regulations.
• Any communications via emails, including submission of
assignments, MUST be done via student’s respective MU email
(@mustudent.ac.tz) and not otherwise.

2 / 132
Code of Conduct - Cont’d

▶ Course assessment:
• Quiz - Many.
• 2 Assignments @ 10%.
• 2 tests @ 15%.
• University Examination (UE) - 50%.
▶ Marks for assignments, tests or UE can not be compromised or
negotiated for.
▶ Hope to enjoy your maximum cooperation.

3 / 132
Requirements for the course

▶ Requirements
i. Basic knowledge of Python programming language
ii. Python language, at least Python v3.8
iii. Jupyter-notebook
iv. Python libraries/packages including numpy, matplotlib, pandas,
pip3, conda
v. Scikit-learn library
vi. TensorFlow framework
vii. Alternatively, you may install the latest Anaconda platform
which contains Python, Jupyter-notebook and several libraries
viii. Fairly powerful computer

4 / 132
Artificial intelligence (AI)

▶ It has many definitions including;

• A branch of science that deals with designing and developing
machines which can think and have a capability to react like
human beings
• The art and science of creating machines which perform
functions that require intelligence when performed by people.
• A way to develop machines which think and behave intelligently
• The efforts to automate intellectual tasks normally performed
by humans.
• A theory and development of computer systems able to perform
tasks which normally require intelligence when performed by
human being such as visual perception, speech recognition,
decision making, speech translation, game playing etc.

5 / 132
Benefits of applying AI
▶ AI improves accuracy: Intelligent computer-based agents
produce more correct output than human when they are
correctly programmed and given reliable input (data).
▶ Improves human safety and enable difficult and dangerous
exploration: Makes possible to perform tasks which were
otherwise impossible, risky or dangerous to the lives of
human operators.
▶ Increases efficiency: Computer-based intelligent agents are
more faster than human beings, huge computational power
compared to human beings, and can work for longer time (24
hours for 7 days) without resting, and can process very large
amount of data accurately. E.g AI driven chatbots and
virtual assistants
▶ Automation: AI systems are effective in automating
repetitive and time-consuming tasks which reduces human
intervention and cost, improves productivity and ability to
allocate human resources to more strategic and creative tasks
- Assembly lines and manufacturing plants, web data
6 / 132
Challenges in applying AI

▶ Complexity in the design of AI architectures: Modern AI

architectures may produce hundreds, thousands, ten thousands
or millions of parameters for optimal performance (high
▶ Need for voluminous data: Training AI algorithms such as
deep learning algorithms for high accuracy may require huge
amount of data ranging from thousands, hundreds of thousands
or even millions of data samples.
• Collecting appropriate data may be challenging (unavailability
of data), financially expensive and time consuming
• Data processing and preprocessing is laborious, financially
expensive, time consuming, and require large space for storage
▶ Need for huge computational power: Training AI algorithms
require powerful processors, Graphical Processing Units
(GPUs) and Tensor Processing Units (TPUs) which are expensive
to buy or pay for their subscribed services

7 / 132
Challenges in applying AI
▶ Algorithmic bias: Since AI models (trained algorithms) are
developed from data fed to them, if a large number of
examples in the learning process are from a certain group,
developed models tend to develop a bias toward other groups
• It leads to unfair or discriminatory outcomes in domains like
hiring, criminal justice, lending, traffic, healthcare.
▶ Lack of transparency and explainability: Many AI models are
complex and difficult to understand their decision making
process based on what they have learned from data.
• They are therefore considered "black boxes" because it is
difficult to understand their decision-making processes
▶ Ethical concerns: The use of AI raises ethical dilemmas such
as privacy, surveillance, and the potential for misuse with
many countries still lacking comprehensive guidelines and
frameworks to guarantee ethical use of AI.
▶ Regulation and Governance issues: Determining
responsibility for decisions and actions made by AI systems
is complex, especially in cases of system errors or misuse.
8 / 132
Challenges in applying AI
▶ Security and robustness issues:AI systems can be vulnerable
to data theft and/or manipulation which can lead to malicious
attacks such as unauthorized access to systems and automated
cyberattacks or incorrect results.
▶ Job displacement: Automation and AI technologies have the
potential to replace certain jobs, which can lead to
workforce disruptions.
▶ Data privacy issues: AI systems may collect users’
information without without their consent. In addition, many
countries lack regulations which protect users’ privacy
rights with regard to the use of AI technology.
▶ High development cost: Developing and implementing AI
solutions can be very expensive in terms of time, money and
labour incurred on data collection and processing activities,
computational power etc.
▶ Energy consumption: Training and running AI systems require
a great deal of computing power and electricity, and the
resulting CO2 emissions significantly contribute to carbon
footprint, which aggravates climate change.
9 / 132
Applications of AI

▶ With relevant examples and citations, precisely discuss at

least 10 different applications (uses) of AI. Submission
deadline: Monday, 18th March 2024 at 4:30PM.

10 / 132
Definitions of key terms

▶ AI algorithm: It is an organized set of procedures which is

run on a particular dataset to recognize (learn) patterns and
rules (from dataset) so as to create an AI model which can
make predictions on new data
▶ AI model: It is an output of the trained algorithm which
act like a program which can be run on new data (similar to
one used to train the algorithm) to make predictions.
▶ Dataset: A collection of data (images, audio, video,
numerical data points) that is used to develop AI models
• An AI algorithm is trained on dataset to become an AI model
Model = f (AI algorithm + data)
• AI algorithms provide a type of automatic programming where AI
models represent the program
• Developing any AI model requires certain amount of appropriate
and high quality dataset

11 / 132
Definitions of key terms
▶ Sample: An instance of an entity (record) that possesses
values for all features of the data set. It is also referred
to as observation, example or tuple.
▶ Prediction: It is a final output label, class or value
predicted by the model.
▶ Prediction error: A measure of distance (difference) between
predicted output/value and the target (expected) output. It
is commonly referred to as loss, loss value or loss function.
▶ Training: The process of passing training data to an AI
algorithm so that it finds patterns in the training data.
▶ Class: It is a set of possible labels to choose from in a
classification problem. Examples "Positive", "Negative",
"Dog", "Cat", "Girl", "Boy".
▶ Ground truth: A set of all target values for a data set.
▶ Target: It is the output that the model should ideally have
predicted according to an external source of data. It is
also referred to as expected value.
12 / 132
Definitions of key terms
▶ Feature: It is any column value in the input matrix that we
are using as an independent variable. It is also known as
attribute or measurement
• The number of features in a data set is called dimension
▶ Classification: It is a process of categorizing data into
predefined classes.
▶ Binary classification: A classification that involves only
two classes, for instance "Positive", "Negative" or "True",
"False" or "Yes", "No" or "Fraudulent", "Genuine".
▶ Accuracy: Percentage of data (validation or test set) that
is correctly predicted (classified or detected) by AI model.
▶ Hyperparameters: Configuration settings of the AI
learning algorithm that a user can adjust to affect the
performance of the model
▶ Model parameter: Variables of the AI learning algorithm
estimated from training data which minimize total loss and
specify how to transfer the input data into desired output.
13 / 132
Installing Python

▶ Python is one of the most powerful programming languages in

the field of AI.
▶ You can easily download Python from Internet and install it
on your computer.
▶ However, the most efficient method of installing python is by
installing a python distribution known as Anaconda
▶ Anaconda comes shipped with latest Python version and many
other packages/libraries necessary for Python programming.
▶ It also provides tools such as conda and pip which are used
to install, update, or remove many other packages not shipped
with standard Python.
▶ It also comes with many libraries pre installed, such as
Jupyter Notebook which is useful in interacting with Python.

14 / 132
Installing Python

▶ Start up the Anaconda set up file

▶ After installing Anaconda, run one of the following commands
to update it;
conda update -n base conda
conda update conda
▶ Use conda command to install, update or remove packages using
this syntax;
conda install package name
▶ To install numpy, run the following command;
conda install numpy
▶ Alternatively, you may use pip tool to install, update or
remove packages
▶ To install numpy using pip, run the following command;
pip install numpy

15 / 132
Managing packages - Windows OS

▶ Installing packages, use pip or conda tools.

conda install package name
pip install package name
▶ For example, to install numpy run the command;
conda install numpy
▶ You can also use pip tool;
pip install numpy
▶ You can also run this command
python -m pip install package name
▶ To upgrade a package such as pip run the command;
python -m pip install --upgrade package name
▶ package name is the name of any package/library you want to
install or upgrade

16 / 132
Installing Python

▶ To update a package you may run the following command;

python -m pip install --upgrade packagename
▶ For instance, to update pip, run the following command;
python -m pip install --upgrade pip
▶ You may start up Anaconda Navigator by clicking the
appropriate icon
▶ Alternatively, you may start up Anaconda prompt, which is
similar to Windows CMD
▶ From there, you may start up many libraries such as Jupyter
▶ You may also use it to get into Python command prompt by
running the following command;

17 / 132
Intelligent agent and environment

▶ An intelligent agent is a machine that has sensors to

perceive its environment and actuators to act upon the
▶ Sensors allow agents to sense the environment and perceive
its present state.
• Examples of sensors are cameras, microphones, rain sensors,
NLP, infrared range finders
▶ Actuators allow agents to take actions with reference to the
environment perceived.
• Examples of actuators are speakers, motors and screens.
▶ They also possess effectors to effect the environment - like
move, catch/hold things etc.
• Examples of effectors are fingers and wheels
▶ Intelligent agent works without assistance, interprets
inputs, senses environment, and makes choices and acts to
achieve a desired goal

18 / 132
Intelligent agent and environment
▶ Common characteristics of intelligent agents include;
• Adaptation based on experience
• Real time problem solving
• Analysis of error or success rates
• Use of memory-based and retrieval
▶ A software agent receives keystrokes, file contents, network
packets as sensory inputs.
▶ It acts the environment by displaying output on the screen,
writing files and sending network packets.
▶ The agent function for an artificial agent is implemented by
an agent program.
▶ A rational agent is the one that does the right thing.
▶ The agent’s rationality is determined by;
• Performance measure used to define the criterion of success
• Agent’s prior knowledge of environment
• The actions that the agent can perform
• The agent’s percept sequence to date (i.e. history of all
inputs the agent has perceived)
19 / 132
Intelligent agent and environment
▶ Designing the agent must specify the task environment
▶ The agent’s task environment must specify the Performance,
Environment, Actuators, Sensors, abbreviated as PEAS
▶ Performance measure evaluates the behaviour of the agent in
the environment

Figure 1: PEAS description of the task environment for a self driving


20 / 132
Intelligent agent and environment

▶ Task environments:
• Fully observable environment: Complete state of an environment
is available at any given time.
• Partially observable: The environment contains some noisy,
inacurate sensors, or missing some states.
• Single agent environment: One agent operates in the
• Multi-agent environment: Multiple agents operate in the task
• Deterministic environment: The next state of the environment
is completely determined by the environment’s current state
• Known environment: The outcomes for all actions are given
• Unknown environment: The environment is unknown to an agent,
and therefore it has to learn how to work in order to make good

21 / 132
Intelligent agent and environment

▶ Task environments:
• Stochastic environment: The next state of the environment is
not determined by the current state
♦ It contains partially obervable
• Uncertain environment: Not fully observable or not
• Dynamic environment: The task environment changes while the
agent is deliberating.
• Static environment: Environment does not change
♦ They are easy to deal with as an agent
• Cooperative environment: Increase of performance by one agent
maximizes performance measure of all agents
♦ Avoiding collisions by a self-driving car maximizes performance of
all agents in the environment
• Competitive environment: An increased performance by one agent
decreases performance of an opponent agent in the environment.

22 / 132
Intelligent agent and environment

▶ AI’s goal is to design an agent program that implements the

agent function - the mapping from percepts to actions.
▶ The agent program runs on computing environment, such as a
normal PC, robot etc.
agent = architecture + program
▶ The program chosen must be appropriate for given architecture
▶ The architecture makes the percepts from sensors, runs the
program, and feeds the actuators with program action choices
▶ Major types of agent programs based on how they select
• Simple reflex agents
• Model-based reflex agents
• Goal-based agents
• Utility-based agents

23 / 132
Intelligent agent and environment

▶ Simple reflex agents

• An agent selects actions on the basis of current percept,
ignoring the rest of the percept.
▶ Model-based reflex agents
• The agent attempts to handle partial observability of its task
environment by keeping track of the unobserved environment
• The agent maintains internal state that depends on the percept
history to reflect some unobserved aspects of the current state
• The agent must update knowledge about its environment with time

24 / 132
Intelligent agent and environment

▶ Goal-based agent
▶ Utility-based agents
• Knowing the goal is not enough for an agent to generate a high
quality behaviour.
• An action needs to choose an action with the best utility
• Agent needs to know utility of each possible action (choice)
• It will distinguish between more and less desirable actions
• An agent’s utility function is an internalization of the
performance measure
• This will enables the agent to choose action that maximizes its

25 / 132
Intelligent agent and environment: Learning
▶ It is an agent that has the capability to learn from the
environment and its experience in order to improve its
performance (i.e., maximize its performance measure).
▶ Learning allows an agent to acquire more knowledge about its
environment, which was initially unknown to its, to become
more competent (master it better).
▶ Key conceptual components of learning agent;
• Learning element: Responsible for making improvements
• Performance element: Responsible for selecting external
• Critic: Provides feedback on agent’s actions to determine how
its performance element be modified to do better in the future
• Problem generator: Responsible for suggesting actions that
will lead to new and informative experiences.
▶ The design of the learning element depends on the design of
the performance element
▶ A critic tells the learning element how well (or bad) the
agent is doing with respect to a fixed performance standard.
26 / 132
Intelligent agent and environment

Figure 2: General architecture of the learning agent

27 / 132
Problem solving agent

▶ Intelligent agents are expected to maximize their performance

measure, which is a measure that evaluates the behaviour of
an agent in its environment.
▶ The agent determines the goal and aims at satisfying it by
only taking actions that helps it to satisfy the goal (i.e.,
rejection actions which dont help it to satisfy its goal).
▶ Problem solves understanding, representation, formulation and
▶ Most of problems being solved are complex, with partial data.
▶ Problem solving aims at finding out a set of actions so as to
reach a set of goal.
▶ The agent’s task is to find out what action and how to act
now and in the future to satisfy the goal.
▶ Agents needs to decide what actions and states it should
consider to satisfy the goal.

28 / 132
Problem solving agent

▶ Most of problems being solved are complex, with partial data.

▶ Major steps of problem solving involve;
• Problem formulation (definition): A process of deciding what
actions and states to consider given a goal.
• Problem analysis and representation: Problem is represented in
a way that determines a solution
• Planning: Process of determining a sequence of actions
(solution) to be taken to solve a problem.
♦ It involves search for an optimal solution
♦ Solution to a problem is a sequence of actions that leads from
initial state to a goal state.
♦ Solution quality is measured by the path cost function
♦ An optimal solution has the lowest path cost among all possible
• Execution: Carrying out actions determined in planning
• Evaluating solution: Assessing how good the solution is.
• Consolidating gains: Modifying actions previously done in order
to get a better solution based on the evaluation made on the

29 / 132
Problem solving agent

▶ What constitute a problem?

• Initial state: The state of an environment an agent starts in.
An agent will pass through different states to reach the goal
• A description of the possible actions available to the agent.
• A description of what each action does
• The goal test: Determining whether a given state is a goal
state or not.
• Path cost: Cost for each path (solution) to reach the goal

30 / 132
Machine Learning (ML)
▶ A branch of artificial intelligence (AI) that provides
systems with ability to automatically learn and improve
performance through exposure to data or experience without
being explicitly programmed.
▶ It involves training a learning algorithm using appropriate
data for the algorithm to acquire the knowledge (experience).
▶ The data used in the process of training a machine learning
algorithm is collectively called as learning set or dataset.
▶ The acquired knowledge (experience) will be used in the
future to perform a similar task on new but similar data.
▶ The learning set can consist of image, textual, numerical
values, sound etc.
▶ During training process, a learning algorithm finds patterns
from data - that is how the input corresponds to the target.
▶ The goal of ML is to develop a model (trained algorithm) with
the capability to perform a task for which it was developed.

31 / 132
Machine Learning (ML)

▶ ML is well suited to complex data sets that have huge numbers

of variables and features.
▶ Unlike classical programming, ML algorithm finds the rules
from the data.
▶ ML system is not explicitly programmed, but trained by
passing to it many relevant data examples relevant, allowing
it to make observations and establish statistical structure.
▶ Learning process of the ML algorithm involves activities such
as understanding, memorization, knowledge acquisition,
knowledge management etc.
▶ Three major types of machine learning are;
• Supervised learning
• Unsupervised learning
• Reinforcement learning

32 / 132
Machine Learning (ML)

▶ Supervised learning
• It is the most common type of ML.
• An agent (learning algorithm) observes several examples of
labeled input-output pairs to learn relationship (pattern)
between them.
• It uses labeled data- data in which each input data (example)
has a known output (target), to learn a mapping function that
turns input variable (x ) into output variable (y).
• Two types of supervised learning are;
♦ Classification
♦ Numeric prediction (regression)

33 / 132
Machine Learning (ML)
▶ Classification
• The most type of supervised learning in which a model (a
classifier) predicts categorical (discrete) and unordered class
• Examples, a model classifies human images into "Male" or
"Female", or medical data into "Positive" or "Negative" for a
positively and negatively diagnosed patient, respectively.
• Dataset contains a set of predefined classes, and each input
(example) belongs to one of these predefined classes.
• The model predicts the category of the new data sample
• Common applications of classification tasks include;
♦ Target marketing
♦ Medical diagnosis
♦ Fraud detection
♦ Performance prediction
♦ Loan payment
♦ Spam filtering
♦ Gender classification
♦ Weather forecasting

34 / 132
Machine Learning (ML): Supervised learning

▶ It produces categorical value rather than numerical value

▶ Class labels can be represented by using values such as 1,2,3
or A, B, C where ordering among class values has no meaning.
▶ Examples of classification algorithms include
• Logistic regression (linear classifier)
• Support vector machine (SVM)
• Decision tree
• Random forest
• Artificial neural network (ANN)
• Convolutional neural network (CNN)

35 / 132
Machine Learning (ML): Supervised learning

Figure 3: Demonstration of of classification task

36 / 132
Machine Learning (ML): Supervised learning
▶ Numeric prediction (regression)
• A type of supervised learning that maps a function from input x
to an ordered continuous output (real value) variable y.
• It is also known as regression analysis.
• The output of the numeric prediction is a real (continuous)
• It predicts a quantity or size as a real-value which may be an
integer or floating-point value.
• A model (predictor) predicts a quantity or size as a real value
which may be an integer or floating-point value.
• Common example of numeric prediction algorithm is linear
• Common applications involve model’s prediction of;
♦ Life expectancy (age) based on eating pattern, medication, disease
state and other living conditions.
♦ House price based on house size (number of rooms), age, floor
size, location, roof type, owner’s job etc.
♦ Salary of an employee given academic qualifications
♦ Crop yield based on soil precipitation, rainfall, fertilizer,
temperature etc.

37 / 132
Machine Learning (ML): Unsupervised learning
▶ A type of machine learning that develops a model without
relying on labeled data.
▶ Most common examples of unsupervised learning algorithms
include K-means and K-nearest neighbor
▶ Three main types of unsupervised machine learning are;
• Clustering (data segmentation): It predicts potentially similar
objects (data samples) from input examples and group them into
one cluster.
♦ Each cluster contains objects that share some similarities but are
more dissimilar to other objects.
♦ Common applications include in biology, customer segmentation, web
searches, fraud detection (in banking transactions, phone calls,
Internet usage)
• Dimensionality reduction: Involves reducing number of
attributes (features) per data sample in a high dimensional
dataset by removing uninformative, noisy or redundant features
in order to produce a reduced or compressed data structure with
fewer features.
♦ Aims at improving model’s accuracy, processing speed, data
visualization, reduce storage space, reducing model complexity.
• Density estimation: It is the problem of reconstructing the
probability density function using a set of given data points
38 / 132
Machine Learning: Reinforcement learning

▶ An agent learns from direct interaction with dynamic

▶ The agent gets feedback about the outcome of its actions as
reward for good actions or punishment for bad actions
▶ No correct and precise labeled data is provided and the agent
acts based on the past experience and from the environmental
▶ It is given little knowledge to enable it start operating in
the environment, and it will gain more knowledge through
experience from the environment
▶ Examples of reinforcement learning algorithms are Q-learning
and deep Q-learning

39 / 132
Dataset (learning set)

▶ Data (textual, images, voice, numerical etc.) is an integral

part of AI applications specifically for machine learning
▶ A collection of data used to develop an ML model is called
dataset or learning set.
▶ The dataset used must be relevant to the problem addressed
and adequate (sufficient enough).
▶ There should exist a relationship (pattern, association)
between input data and output (labels or numerical values)
for every instance of the dataset.
▶ Features (attributes, measurements or dimensions) of each
input data sample (example or observation) must bear
relationship with its output (label or numeric value).
▶ Dataset must be properly processed for the training task.

40 / 132
Dataset (learning set)

Table 1: Body Mass Index (BMI)

# Weight (kg) Height (m) BMI Category

1 50 1.75 16.2 Underweight
2 54 1.63 20.6 Normal
3 56 1.7 19.6 Underweight
4 77 1.73 24.4 Normal
5 59 1.75 19.2 Underweight
6 79 1.47 36.6 Obese
7 83 1.5 37.4 Obese
8 74 1.6 29.2 Overweight
9 86 1.75 28.1 0verweight
10 90 1.52 39.1 Obese
11 41 1.5 18.2 Underweight
12 77 1.91 21.1 Normal
13 95 2.03 23.1 Normal
41 / 132
Dataset (learning set)

▶ If there exist no relationship/pattern between input features

and output for samples of the dataset, then the model will be
unable to learn from training data.
▶ This will lead to the model underfit to the training data as
it will perform random guessing.
▶ An underfit model generates low accuracy, high prediction
error (loss value) that does not improve with training
▶ The value for every input feature must be within the
acceptable range for that particular feature.
▶ Collection and preprocessing of data for ML applications are
often time consuming and may be financially expensive.

42 / 132
Dataset: Splitting of the dataset
▶ Dataset for ML/DL is commonly split (divided) into three
• Training set
• Validation set
• Test set
▶ Care must be taken when splitting dataset into these sets in
order to avoid joint samples and human bias.
▶ All sets must have similar features. Differences in their
features, improper splitting etc. are likely to generate
overfitting or underfitting.
▶ Training set:
• Comprises the bulk of the entire dataset, about 70-90% of the
entire dataset.
• m Large size ensures that the model will learn as many patterns
(features) and sufficiently from the data as possible.
• Must contain as many possible patterns of the population as
• When limited (small) data is available, data augmentation
technique is used to increase the size of the dataset.
43 / 132
Dataset: Splitting of the dataset

▶ Data augmentation: A technique of generating more training

data from existing training samples by augmenting samples via
a number of random transformations that yield
believable-looking images.
▶ It is performed on training set and not on validation or test
▶ Many ML algorithms specifically DL have automatic features
for data augmentation, you don’t need to perform them
▶ Training set must be passed to the model several times for
the model to learn patterns existing in the data.
▶ Epoch: A single cycle consisting of iterations required to
train the model (once) on the entire training set.

44 / 132
Splitting of the dataset

▶ An iteration is a sub-cycle (part of an epoch) which feeds a

batch (bunch) of training set into the model.
▶ The amount of data fed into the the model in a single
iteration is called batch size.
▶ To train the model sufficiently, you need to run training in
several epochs.
▶ Total amount of iterations is obtained by dividing total
samples in the training set by batch size.
▶ A training set with 256 examples and a batch size of 4
samples will have 256
4 = 64 iterations per epoch to train all
training set once.
▶ If 10 epochs are needed, then there will be 64x 10 = 640
▶ Batch size, iterations, epoch are some of the hyperparameters
to be tuned during model optimization.

45 / 132
Splitting of the dataset

▶ Validation set
• A set used to evaluate the performance of the model at the end
of each epoch during training.
• It comprises about 10% to 30% of the entire dataset.
• It must not contain any samples used during training in order
not to leak information of the training data.
• m Usually one epoch, with total iterations equal to the number
of samples in the validation set.
• When you have large validation set, increase the batch size,and
therefore change number of iterations.
▶ Test set
• The smallest of all sets.
• A set used to check model’s performance after it has been
trained and evaluated.
• It consists of up to 10% of the entire dataset.
• It is applied to verify the model’s performance after training
is complete, usually on the best performing model.

46 / 132
Public (benchmark) datasets

▶ Collecting data for machine learning is the most challenging

▶ There are various public datasets for different ML tasks.
▶ Several of them are preloaded with different frameworks or
can be downloaded and deployed into models.
▶ These datasets are suitable for ML beginners who learn how to
develop, configure and optimize ML models.
▶ Examples of these public datasets are;

47 / 132
Public (benchmark) datasets

▶ Boston Housing Dataset: Contains information collected by US

Census Service concerning housing in Boston Mass.
▶ Google-Landmarks-v2: For landmarks recognition and
▶ IRIS Dataset:
• It is a simple and beginner friendly dataset containing
information about flower petals and sepals width for three
flower species.
• Each flower species contains 50 rows (instances), that is there
are 150 rows in total.
• The three flower species are Sesota, Versicolour and Virginica.

48 / 132
Public (benchmark) datasets

▶ MNIST Dataset:
• A dataset of handwritten digits containing 60,000 training
images and 10,000 testing images.
• Commonly used for classification tasks.
▶ Fake News Detection Dataset: A csv file that contains 7796
rows with four columns, news, title, news text and result.
▶ Titanic Dataset: Contains information like name, age, sex,
number of siblings etc for about 891 passengers in the
training set and 418 passengers in the testing set.
▶ Credit card Fraud Detection Dataset:
• Contains transactions made by credit cards.
• They are labeled as Fraudulent or genuine.
▶ Stanford Dogs Dataset: Contains 20,580 images for 120
different dog breed categories.

49 / 132
Public (benchmark) datasets
▶ ImageNet Dataset:
• One of the most common, popular and largest image datasets for
computer vision.
• Contains over 1.28 million images for training and 50,000
images for validation for 1000 categories.
▶ Common Objects in Context (MS COCO) Dataset: A large scale
object detection, segmentation dataset with over 300,000
images for 90 classes of common objects such as person,
laptop, bicycle, car, airplane, bus, train, bird, cat, dog,
giraffe, horse, sheep, cow, elephant, cup, fork, chair, book
▶ CIFAR-10 Dataset:
• Contains a dataset for small, low resolution (32 x 32) colour
images of 10 classes of objects.
• Contains 50,000 and 10,000 images for training and testing
• The ten classes are airplane, automobile, bird, cat, deer, dog,
frog, horse, ship and truck.
▶ CIFAR1-100: Like CIFAR-10, but for 100 different classes.
50 / 132

▶ An array is a collection of items stored at contiguous memory

▶ An array is a special variable, which can hold more than one
value at a time.
▶ The idea is to store multiple items of the same type
▶ Array in Python can be created by importing an array module.
▶ NumPy’s arrays have fixed size and are homogeneous, i.e.
elements must be of the same type.
▶ They are an ordered collection of elements with every value
being of the same data type.

51 / 132

▶ A Python library used for working with arrays.

▶ It contains multidimensional arrays for storing and
manipulating data for scientific computations.
▶ It provides efficient ways for working with multidimensional
array data structure efficiently.
▶ It comes with built in functions for various computational
operations, such as arange().
▶ NumPy is a general-purpose array-processing package
▶ It provides a high-performance multidimensional array object
and tools for working with these arrays.
▶ A commonly used alias for numpy is np.

52 / 132
▶ It provides an easy way to use data structure and analysis
▶ In ML, a substantial amount of time is spent on preparing the
data as well as analyzing basic trend and patterns.
▶ This is where pandas receive ML expert attention.
▶ It is built on top of numpy to provide additional higher
level data manipulation and analysis tools that make working
with tabular data even more convenient.
▶ You can use it to read data from a broad range of data
sources like CSV, SQL databases, JSON files and Excel.
▶ It is simple to use and allows developers to manage complex
data operations with just one or two commands.
▶ When data is clean, and structured, every row represents an
observation and every column a feature, and both rows and
columns can have labels.
▶ A commonly used alias for pandas is pd.

53 / 132
Manipulation of CSV files

▶ Displaying rows in the CSV file

▶ Printing all values of the nth column
▶ Using pandas to view column names and values.
▶ Special functions

54 / 132
Manipulation of CSV files
▶ Start up jupyter-notebook
▶ Import pandas
import pandas as pd
▶ Load the data;
variablename = pd.read csv("filepath")
▶ variablename is a variable that holds retrieved data, and
file path is the complete path of the csv file
▶ For instance;
irisdata = pd.read csv("iris.csv")
▶ You may optionally add an argument for encoding format like;
irisdata = pd.read csv("iris.csv",encoding="utf-8")
▶ Windows uses \ while Linux uses / to specify the file path
▶ Row selection - Counting of rows starts from 0
irisdata[:] - Selects all rows
irisdata[:5] - Selects the first five rows
irisdata[5:10] - Selects 5th to 9th rows

55 / 132
Manipulation of CSV files
▶ Column and row selection
irisdata[["columnName"]] - Displays the selected column
irisdata[["columnName1","columnName2]] - Selected columns
▶ iloc function
Enables selection of rows and columns by using integers,
with 0 means first row
irisdata.iloc[N]-Displays column name and respective
value for the Nth sample
irisdata.iloc[[0]]-Displays a row for the int sample
the int row
irisdata.iloc[[0,2]]-Displays rows for 0 and 2 samples
irisdata.iloc[0:65] - Displays values for all columns from
the 0th to the 65th samples
irisdata.iloc[[3,4],[1,2]]-Displays values of the 1st and 2nd
columns for the 3rd and 4th samples
irisdata.iloc[0:66,-3]-Displays values of the third column
from right for the first 66 samples.

56 / 132
Manipulation of CSV files
▶ Row and column selection
irisdata.iloc[:66,-3:]-Displays values of the last three
columns for the first 66 rows
irisdata.iloc[:66,1:3] - Similar to irisdata.iloc[:66,-3:]
irisdata.head()- Displays the first five rows of the data set
irisdata.head(N)-Displays the first N rows of the data set
irisdata.tail()-Displays the last five rows of the data
irisdata.tail(N)-Displays the last N rows of the data
irisdata.info()-Describes the columns of the data set, their
type, number of values for each column, index
number (order) etc.
irisdata.shape - Produces a vector with values for rows and
columns in a dataset
irisdata.columns-Produces an index with list of column names.
irisdata[’variety’].value counts()- Extract classes and their
respective number of samples
irisdata.describe()-Calculates some statistical data like
percentile, mean and std of the numerical
values of the series or DataFrame
57 / 132
Python Virtual environment - On Windows

▶ Python virtual environment (PVE) isolated space in Python for

the Python Project.
▶ It ensures that each ML project has its own set of packages
and dependencies that will not disrupt any of the other
▶ It also provides greater control over Python projects and
handling of Python packages.
▶ You can create as many Python virtual environments (PVEs) as
you want.
▶ To create PVE in Windows run this command;
conda create -n envi name
▶ Creating a Python virtual environment called css128pyenv
execute the following command;
conda create -n css128pyenv

58 / 132
Python Virtual environment - On Windows
▶ You may specify the version of Python to be installed in the
virtual environment as well by running the command;
conda create -n css128pyenv pip python=3.8.5
▶ Activating a virtual environment created;
conda activate css128pyenv
▶ Once activated, the name of your PVE (e.g. pyenv4icqproject)
should be displayed within brackets at the beginning of your
Anaconda CMD path specifier.

▶ To quit from the virtual environment run the command;

conda deactivate
▶ It is highly recommended that you create and use your own
virtual environment, rather than base environment
▶ To delete a PVE, execute the command;
conda env remove --name env name
▶ To remove the css128pyenv, execute the command;
conda env remove --name css128pyenv
59 / 132
Python Virtual environment - On Windows
▶ You need to link your newly created PVE (css128pyenv) with
Jupyter notebook by running the following command;
▶ Install ipykernel
pip install ipykernel
▶ Run the following command;
python -m ipykernel install --user --name=css128pyenv
▶ Make sure that css128pyenv exists, (i.e., has been created).
▶ Avoid putting space either side of the =
▶ After creating a PVE, you may be required to install many
libraries that were running well on base environment
▶ To view list pf PVEs created into the system (using conda);
conda info --envs
▶ To get into Python command line interface (CLI), execute the
▶ To quit from Python CLI execute the command;
60 / 132
Running ML Projects
▶ Remember to activate your PVE whenever you want to run your
project or install packages
▶ Use conda (installed with Anaconda) or pip to get the
libraries (packages) installed, upgraded or removed.
▶ You need to install pip by running this command;
conda install pip
pip install --user --upgrade pip
▶ Run these commands to install/update the following libraries
pip install numpy
pip install pandas
pip install matplotlib
pip install jupyter
pip install --user --upgrade jupyter
pip install -U scikit-learn
▶ You may used conda instead of pip to install packages
▶ Remember, pip does not come preinstalled with Anaconda

61 / 132
Running ML Projects: Steps
1. Create a folder (should not contain space or special
character) to save your files and data. Don’t use python as
folder name as it may be misinterpreted as python command
2. Use cd command to navigate to the folder created in step 1.
3. Activate your Python virtual environment (PVE)
4. Start up Jupyter-notebook by executing the command;
5. On the window that appears, click New tab, then select
6. Select the appropriate PVE of your choice
7. Rename the notebook file (File tab > Rename)
8. In the first cell, write all necessary import statements
9. Import the data into the notebook file
10. Write the rest of the code in the cells
11. Execute each cell before moving to the next cell by clicking
Run button.
62 / 132
Running ML Projects: Steps

Figure 4: Jupyter Notebook window

63 / 132
Running ML Projects: Steps

Figure 5: Jupyter Notebook window

64 / 132
▶ Function: A relationship between variables.
▶ It relates an input or set of inputs to a unique output.
▶ x and f (x ) denote input and output of the function,
▶ Each input must correspond to a unique output.
▶ Considering a function which takes an input and squares it to
give the output, 4 is the output if we input 2.
▶ y = f (x ) denotes a function in which y depends on x , that is
y is a function of x .
▶ x and y are independent & dependent variables, respectively.
▶ f (x ) = 3x + 10 is a function that takes the input x multiplies
by 3 and adds 10.
▶ f (x ) can also be written as f .
▶ A function’s input is called the domain, and its output is

65 / 132

▶ Given f (x ) = x 2 + 2x − 1, find f (1).

▶ Given f (x ) = −x 2 + 6x − 11, find;
i. f (2)
ii. f (t)
iii. f (x − 3)

66 / 132
Linear function

▶ Linear function is a function which has variables with

highest power equal to one.
▶ Linear function may take the form f (x ) = y = mx + c where m
and c are constants and m is a non-zero real number.
▶ Similarly, a linear function is commonly represented as
f (x ) = y = ax + b
▶ m and c (or a and b) can be any real numbers.
▶ The relationship between x and y can be represented in a
graphical form as a straight (linear) line.
▶ Examples of linear functions are;
(i.) y = 15x + 10
(ii.) k = 2r + 3
(iii.) 2x = 3 − 5y
(iv.) y = 5 − 3x
(v.) y = −3 + x
(vi.) y = 9x + 12

67 / 132
Linear equation
▶ Linear equation is an equation in which the highest power of
the variable is always 1
▶ The standard (general) form of a linear equation in one
variable is Ax + B = 0, where A and B are real numbers and x
is a single variable.
▶ The standard form of a linear equation in two variables is
Ax + By = C, where A and B are coefficients of variables x
and y, respectively, and C is a constant.
▶ Linear equation formula is the way of expressing a linear
▶ A linear equation can be expressed in the standard form; the
slope-intercept form, or the point-slope form.
• A standard form; either Ax + B = 0 or Ax + By = C.
• The slope-intercept form; y = mx + c, where m is the slope, in
which m ̸= 0, and c is the y-intercept, i.e., a point on which a
straight line crosses the y axis (y value when x is 0).
• Point-slope form; y2 − y1 = m(x2 − x1 ) (where m = slope and
(x1 , y1 ) is a point on the line)
68 / 132
Slope, y-intercept, and equation of the line
▶ The slope (gradient) of a linear equation is the amount by
which the line is rising or falling.
▶ It is the change in y coordinate with respect to the change
in x coordinate.
▶ It represents the rate of change of y for each unit increase
in x .
▶ If (x1 , y1 ) and (x2 , y2 ) are any two points on a line then its
2 −y1 )
slope (m) is calculated using the formula; m = (y ∆y
(x2 −x1 ) = ∆x
▶ Slope of a line between two points is said to be the rise of
the line from one point to another along y-axis over the run
along x-axis; m = Rise vertical numerical difference
Run = horizontal numerical difference
▶ The slope of a line gives the measure of its steepness and
▶ Given a straight line with points P (4, 2) and R(6, 8), find its
slope, equation of the line, y-intercept and plot its graph.
• Assuming point P is x1 and y1 , respectively, and R is x2 and
2 −y1 )
y2 , respectively, the slope of the line = (y
(x2 −x1 )
= (6−4) = 26 = 3

69 / 132
Slope, y-intercept, and equation of the line

▶ Equation of the line and y-intercept

• Given the form y = mx + c, equation of the line aims at finding
the values of m, c, and establishing an equation for finding
the value of y, given x and vice versa.
• With m = 3, substituting it in the general form y = mx + c, we
get y = 3x + c
• Since the linear relationship must satisfy all x , y on the line,
substituting values of a point P (4, 2) into y = 3x + c, we get
2 = 3x 4 + c, which gives 2 = 12 + c, and finally c = -10
• Thus, given y = 3x + c, the equation of the line becomes
y = 3x − 10
• y-intercept is therefore -10
• You can similarly use point R(6, 8) to compute equation of the

70 / 132
Slope, y-intercept, and equation of the line

▶ You may also obtain equation of the line by substituting

point P or R into a formula for slope, given m = 3.
▶ Using point P (4, 2), equation of the line is obtained as;
3 = (x −4)
• 3(x − 4) = y − 2
• 3x − 12 = y − 2
• 3x − 12 + 2 = y
• 3x − 10 = y
• y = 3x − 10
• From equation of the line y = 3x − 10, y-intercept is -10.
▶ The y-intercept of a linear function is the point on the
y − axis where the straight (linear) line crosses.
▶ In the graph presented in Fig. 13, the straight line crossed
the y − axis at -10.
▶ In the above example, y = 3x − 10, for every unit increase in
x , y increases by 3 (which is the value of slope).

71 / 132
Linear function and graph

Plot graph of the linear function y = 3x − 10

Table 2: Table of values for y = 3x − 10

x -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10
y -25 -22 -19 -16 -13 -10 -7 -4 -1 2 5 8 11 14 17 20

72 / 132
Linear graph

Figure 6: Graph of the linear equation y = 3x − 10

73 / 132
Slope, y-intercept, and equation of the line

▶ If two lines, L1 and L2 are parallel, then their slopes are

equal i.e., m1 = m2
▶ If two lines, L1 and L2 , are perpendicular, then the product
of their slope must be equal to -1, i.e., m1 ∗ m2 = −1.
▶ Given two points, (1, 2) and (3, 8), find;
• Gradient
• Equation of the line
• y-intercept
• Plot the graph of the line

74 / 132
Linear Regression

▶ Quite often, there exist relationship between two or more

▶ For instance, weights of people depend to some degree on
their heights.
▶ Variable is a symbol that can assume any of the prescribed
set of values, called domain.
▶ When data in the scatter diagram is approximated using a
straight line (example in Fig. 7), the variables involved
are said to exhibit linear relationship.
▶ If relationship between variables in the scatter diagram is
not linear (example in Fig. 8), then it is said to be
nonlinear relationship.

75 / 132
Linear Regression

Figure 7: Linear relationship between weight and height

76 / 132
Linear Regression

Figure 8: Nonlinear relationship

77 / 132
Linear Regression

▶ Regression is concerned with obtaining a mathematical

equation which describes the relationship among variables.
▶ Linear regression is a class of regression used to model
relationships between two variables using linear equations.
▶ It defines a function that describes relationship between
variables x and y.
▶ Used to estimate the value of dependent variable (y) given an
independent variable (x ), or vice versa.
▶ The line used to estimate the value of y from x on the
linear graph is called regression line of y on x .
▶ The line used to estimate the value of x from a given value
of y is called regression line of x on y.

78 / 132
Linear Regression: Simple linear regression

▶ It is a predictive analysis in machine learning (ML) which is

used to observe the following;
• If a set of predictor (independent) variable does well in
predicting the outcome (dependent) variable.
• Which of the predictor variables are significant in terms of
predicting the outcome variable.
▶ Usually, linear regression is used with one dependent
variable (Y ), and one or more predictor variables.

79 / 132
Linear Regression

▶ Fitting a linear regression: The process of obtaining a

linear regression relationship for a given set of data.
▶ Three methods commonly used to fit a regression line to a
given set of (bivariate) data are;
• Inspection method
• Semi-averages method
• Least squares method

80 / 132
Linear Regression: Inspection method

▶ It is the simplest method of fitting a regression line.

▶ It involves an individual judgment to draw an approximating
line to fit a set of data.
▶ It consists of plotting a scatter diagram for the relevant
▶ Then drawing a line that most suitably fits the data in the
scatter diagram
▶ Disadvantage:
• Allows different fit lines to be drawn by different people on
the same data
▶ It can be improved by plotting a mean point for x and y and
ensure that the regression line passes through it.

81 / 132
Linear Regression: Inspection method
Example 1
The figures show the output (in thousands of tons) and expenditure
on energy ($) for a firm over ten monthly periods.

Output (x) 20 22 25 26 21 23 28 20 25 29
Expenditure (y) 106 138 158 172 120 142 184 102 164 192

Use the above data to;

a. Draw a scatter diagram for the data
b. Calculate the mean point of the data and plot it on the
c. Use the inspection method to fit the regression line which
passes through the mean point
d. Find the equation of the line y = a + bx
e. Estimate the energy expenditure if the following month’s
output is planned at 27000 tons.

82 / 132
Linear Regression: Inspection method

Figure 9: Scatter diagram for output vs expenditure

83 / 132
Linear Regression: Inspection method

Figure 10: Scatter diagram for output vs expenditure

84 / 132
Linear Regression: Semi-averages method

i. Sort the (bivariate) data into size order by x-value

ii. Split the data up into two equal groups, a lower and an upper
iii. If there is an odd number of items, ignore the central data.
iv. Calculate the mean point for each group
v. Plot the above mean points on graph within suitably scaled
axes and join them with a straight line.
vi. The plotted line is the required y on x regression line

85 / 132
Linear Regression: Semi-averages method
Example 2
Twelve administrative trainees in a company took an aptitude test
in two parts, one designed to test the ability to do appropriate
calculations and the other designed to test skill in interpreting
results. Their scores were as follows;

Trainee (x) A B C D E F G H I J K L
Calculation score 23 56 74 29 82 45 36 51 60 55 52 88
Interpretation score 16 38 65 39 32 51 11 19 47 54 43 50

Use the above data to;

a. Obtain the interpretation on calculation regression line,
using the method of semi-averages and plot it, together with
the original data on a scatter diagram.
b. Trainee N obtained 72 in the calculations test but was absent
for the interpretation test. Use the regression line to
estimate trainee N’s interpretation score.

86 / 132
Linear Regression: Least square method
▶ The main drawback of semi-averages method is that it relies
on only two mean points which may be distort the regression
(equation) line when there are extreme values in x and y.
▶ Least square method, also known as ordinary least square
(OLS), addresses this drawback by involving all values and
thus is considered more robust (superior) than other methods.
▶ It attempts to provide the best-fitting (regression) line for
a given set of data by minimizing the sum of the squares of
the vertical deviations (residuals), known as sum of squared
residuals (SSR), from each data point (see Fig. 11).
▶ The deviation of a point residing on the line is 0.
▶ As deviations are squared, there is cancellation between
positive and negative values of the deviations.
▶ Using this method, the values of a and b of regression line
y = a + bx are obtained using the following formulae;
n xy − x y y x
b= P 2 P 2 ; a= −b (1)
n x − ( x) n n

87 / 132
Linear Regression: Least square method

▶ The values of b and a in equation 1 can also be computed

using the following formulas;
(xi − x̄ )(yi − ȳ)
b = i=1Pn 2
i=1 (xi − x̄ )

a = ȳ − b x̄ (3)
▶ Where;
x̄ is the mean value for x values
ȳ is the mean value for y values
▶ Estimates a and b are the values that minimize the sum of
squared residuals (SSR)

88 / 132
Linear Regression: Simple linear regression

Figure 11: Demonstration of SSR

89 / 132
Linear Regression: Simple linear regression
▶ Simple linear regression estimates exactly how much y will
change when x changes by a certain amount.
▶ Generally, the data does not fall on a line, and therefore
the regression equation should include an explicit error ei
▶ The equation of the line can therefore be represented as
yi = b0 + b1 xi + ei
▶ The fitted values, also known as predicted values, are
typically denoted as ŷi = b̂0 + b̂1 xi
▶ The notations b̂0 and b̂1 indicate that the coefficients are
▶ Actual response: real value of y, predicted value: y value
predicted by the linear regression, denoted as ŷ
▶ The residuals ei , represented by vertical dashed grey lines
is computed by subtracting the predicted value (ŷ) from the
real value (y), i.e., ei = real value − predicted value
▶ Since ith real y value = yi , and ith predicted y value is
f (xi ) = ŷi = b0 + b1 xi , then ei = yi − f (xi ) = yi − ŷi = yi − (b0 + b1 xi )

90 / 132
Linear Regression: Least square method
Solution for Example 1
Given equation of the line y = a + bx , find the value of a and b.
x y xy x2
20 106 2120 400
22 138 3036 484
25 158 3950 625
26 172 4472 676
21 120 2520 441
23 142 3266 529
28 184 5152 784
20 102 2040 400
25 165 4125 625
29 192 5568 841
P P P P 2
x = 239 y = 1479 xy = 36249 x = 5805

Formula P P P P P
n xy − x y y x
b= P 2 P 2 ; a= −b
n x − ( x) n n

91 / 132
Linear Regression: Least square method

Solution for Example 1

10 x 36249 − 239 x 1479 362490 − 353481 9009
b= = =
10 x 5805 − 57121 58050 − 57121 929

b = 9.6975
Having obtained b, we can obtain a as follows;
1478 239
a= − 9.6975 x
10 10
a = 147.8 − 9.6975 x 23.9
a = 147.8 − 231.77
a = −83.97
Equation of the line;
y = −83.97 + 9.6975x
y = 9.6975x − 83.97

92 / 132
Interpolation and extrapolation

▶ Interpolation: An estimation of a value carried within the

range of values given for the independent variable (x ). For
instance in example 1, the values of x used to calculate y
ranged from 20 to 29. Any estimate for y based on a value of
x within this range is known as an interpolated estimate.
▶ Extrapolation: An estimation for a dependent variable that
is based on values of the independent variable outside the
range considered in the calculation of the appropriate
regression line. In example 1, estimating a value of y given
x = 35
• Extrapolation is most commonly used for forecasting

93 / 132
Correlation techniques

▶ Correlation is a technique used to describe the strength of

the relationship between two variables by measuring the
degree of ’scatter’ of the data values.
▶ Although regression analysis identifies a relationship for a
given set of bivariate data, it does not give any indication
of how good this relationship might be.
▶ Correlation provides a measure of how well a least squares
regression line ’fits’ the given set of data.
▶ The less scattered the data values are (i.e., the closer the
data points are to the regression line), the stronger the
correlation is.
▶ The stronger the correlation, the more confidence one would
have in using the regression line for estimation.

94 / 132
Correlation techniques: Coefficient of
correlation (correlation coefficient)
▶ Coefficient of correlation: A statistical measure which
expresses the strength of the linear correlation between
bivariate data (variables) x and y.
▶ r is a dimensionless quantity as it does not depend on the
units employed.
▶ Represented by symbol r , or R, correlation coefficient lies
from -1 to +1, that is −1 ≤ r ≤ +1.
▶ r = 0 signifies that there is no correlation between
bivariate variables.
▶ Correlation coefficient is commonly computed using the
following formula;
n xy − x y
r=p P p P (4)
P 2
n y 2 − ( y)2
n x − ( x) x

▶ where n is the number of bivariate data (x , y).

95 / 132
Correlation techniques: The correlation

▶ Positive (direct) correlation:

• Correlation in which the increase in the value of one variable
(let say x ) leads to the increase in the value of the other
variable (let say y).
• Correlation coefficient, r , takes value from 0 and +1, with
r = +1 signifying perfect positive correlation.
• Examples of positive correlation include;
♦ Age/education of an employee and salary
♦ Machine maintenance cost and age
♦ Number of vehicles registered and road accident deaths
♦ Living costs and size of family (number of family members)
♦ Global temperature and sea levels
♦ Global population and environmental degradation
♦ Employees and wage bill

96 / 132
Correlation techniques: The correlation

▶ Negative (inverse) correlation:

• Correlation in which the increase in the value of one variable
(let say x ) leads to the decrease in the value of the other
variable (let say y).
• Correlation coefficient, r , takes value from 0 and −1, with
r = −1 signifying perfect negative correlation.
• Examples of negative correlation include;
♦ Student’s absenteeism and grades
♦ Car speed and travel time
♦ Reading time and mistakes made
♦ Loan payment and debt left
♦ Size of the herd and vegetation

97 / 132
Correlation techniques

Figure 12: Correlation techniques

98 / 132
Correlation techniques

Figure 13: Correlation types

99 / 132
Correlation techniques: Coefficient of

▶ A statistical measure expressing how variations in one

variable (such as dependent variable) are explained by
variations in an independent variable (predictor).
▶ It is represented by R (or r 2 ) and computed by squaring a
linear correlation coefficient (r).
▶ It lies from 0 to 1.

100 / 132
Correlation techniques: Computing r and R

Example 4
The data in Table 3 relates the weekly maintenance cost ($) to the
age (in months) of ten machines of similar type in a manufacturing
company. Calculate the correlation coefficient (r ) and coefficient
of determination (R) between age and cost.

Table 3: Maintenance cost and age of machines

Machine 1 2 3 4 5 6 7 8 9 10
Age (x ) 5 10 15 20 30 30 30 50 50 60
Cost (y) 190 240 250 300 310 355 300 300 350 395

101 / 132
Correlation coefficient and R
Solution for Example 4

x y xy x2 y2
5 190 950 25 36100
10 240 2400 100 57600
15 250 3750 225 62500
20 300 6000 400 90000
30 310 9300 900 96100
30 335 10050 900 112225
30 300 9000 900 90000
50 300 15000 2500 90000
50 350 17500 2500 122500
60 395 23700 3600 156025
P P P P 2 P 2
x = 300 y = 2970 xy = 97650 x = 12050 y = 913050

n xy − x y
r=p P p P
P 2
n y 2 − ( y)2
n x − ( x) x

102 / 132
Correlation coefficient and R
10 x 97650 − 300 x 2970
r=√ √
10 x 12050 − 3002 x 10 x 913050 − 29702

976500 − 891000
r=√ √
120500 − 90000 x 9130500 − 8820900
r=√ √
30500 x 309600
174.64 x 556.417
r = 0.88
Then R = r 2 = 0.882 = 0.774
R = 0.774 means that 77.4% of maintenance cost is explained by age
103 / 132
Next .....

i. More practicals on linear regression problems.........

ii. Correlation
iii. Model evaluation and evaluation metrics
iv. Probability, matrices, sets
v. Classification problems
vi. Deep learning
vii. Clustering
viii. Data collection for ML tasks
ix. Training ML models using GPUs

104 / 132
Linear Regression: Practical

▶ Linear regression (LR) aims at establishing the function

which maps the independent features to dependent feature
sufficiently well.
▶ Independent features are called independent variables, inputs
or regressor or predictors.
▶ The dependent features are called dependent variables.
▶ Regression problems usually produce one continuous and
unbounded value.
▶ It is different from classification problems which produce
discrete data value.
▶ Inputs and outputs are represented as x and y, respectively.
▶ When there are n independent variables, they are represented
as a vector X = (x1 , − − − − xn ), where r is the number of

105 / 132
Linear Regression: Simple linear regression

▶ Simple or single variate linear regression is the simplest

regression as it has a single independent variable X = x .
▶ It is also called univariate linear regression.
▶ The estimated function is represented by the equation
f (x ) = b0 + b1 x1 , where b0 and b1 are y-intercept and slope,
▶ The goal is to calculate the optimal value for b0 and b1 that
minimizes the sum of squared residuals (SSR) and determine
the estimated regression function.
▶ b0 is the point where estimated regression line crosses the y
axis, i.e., the value of the output (f (x )) for x = 0.
▶ b1 determines the slope of the estimated regression line.

106 / 132
Linear Regression: Simple linear regression
▶ Simple linear regression is a linear regression of the form
y = c + bx where;
• y is the dependent variable, also known as outcome or response
• x is the independent variable (predictor)
• b is the slope of the line also known as regression coefficient
• c is the intercept
▶ In ML, y and x are commonly known as target and feature
vector, respectively.
▶ Regression line is the line that best fits the graph between
predictor variable (x ) and predicted variable i.e, y.
▶ Steps to observe when implementing linear regression;
• Install and import packages needed including numpy, pandas,
matplotlib, LinearRegression
• Provide data to work with, and make appropriate transformation
• Visualize the data points to see if the data is suitable for
the linear regression
• Create the regression model using the linear regression
algorithm and fit it with the existing data
• Assess the results of the model fitting to know whether it is
• Apply the model for prediction
107 / 132
Linear Regression: Simple linear regression
▶ Data structure: A collection of values stored in one variable
with common examples being array, dictionary, list, stack.
▶ Array : A data structure storing more than one values at a
• We refer to the objects within an array as its elements.
• The method that we use to refer to elements in an array is
numbering and then indexing them
• If we have n elements in the sequence, we think of them as
being numbered from 0 to n - 1.
▶ One-dimensional array: One-dimensional (1D) array is an
array that stores a sequence of (references to) objects.
• Elements of a 1D array are indexed by a single integer
• list = [100, 200, 300, 400]
• list[2] # will display 300
▶ A two-dimensional array (2D): An array of (references to)
one-dimensional arrays
• The elements of a 2D array are indexed by a pair of integers:
the first specifying a row, and the second specifying a column.
• T = [[11, 12, 5, 2], [15, 6,10], [10, 8, 12, 5], [12,15,8,6]]
• T[0] produces [11 ,12, 5, 2]
• T[0][1] produces 12
108 / 132
▶ Classification algorithms predict one of a number of discrete
▶ For example, an email client sorts mail into personal mail
and junk mail- two outcomes.
▶ Involves machine learning algorithms which learn to associate
some input with some output, given a training set of examples
of inputs x and outputs y.
▶ The outputs y is commonly provided by a human "supervisor".
▶ Most classification algorithms are based on estimating a
probability distribution p(y | x ).
▶ This can be done by using maximum likelihood estimation to
find the best parameter vector θ for a parametric family of
distributions p(y | x ; θ).
▶ Popular classification algorithms include;
• Logistic regression
• K nearest neighbors (KNN)
• Support vector machine (SVM)
109 / 132
Classification - Binary classification
▶ If we have two possible classes (class 0 and class 1), then
we need only specify the probability of one of these classes.
▶ The probability of one class determines that of another class
as the two probabilities must add up to 1.
▶ This is binary classification in which classifiers produce
two labels (output values), each attaining a binary value,
commonly represented as {1, 0}, {+1, -1}, {False, True},
{blue, red} etc.
▶ In binary classification, the two labels are often referred
to as the positive and negative classes, respectively, such
as {+1, -1}.
▶ Support vector machine is driven by the linear function
w T x + b, where w T is weighted sum of the input, x , And b is
a constant called bias.
▶ However, SVM does not output class probabilities, but class
▶ When w T x + b is positive, SVM predicts the presence of the
positive class.
▶ When w T x + b is negative, SVM predicts the presence of the
negative class. 110 / 132
Classification: Binary classification
▶ SVM is suitable for binary classifications, i.e.,
classification in which the outcome consists of two values.
▶ Binary classification considers a predictor (classifier) of
the form f : RD →{+1, -1}.
▶ Each sample (data point) xn is represented as a feature
vector D of real numbers.
▶ The two labels should not be confused with normal positive or
negative numbers, but representation of two distinct classes.
▶ For instance, a cancer detection task, a patient with cancer
can be labeled +1, and one without it as -1.
▶ SVM solves a classification task using a set of examples
xn ∈ RD , along with their corresponding (binary) labels
yn ∈ {+1, −1}.
▶ Given a training data set consisting of example-label pairs
{(x1 , y1 ), ......, (xn , yn )}, we estimate parameters of the model
which produce the smallest classification error.

111 / 132
Classification: Binary classification
▶ Intuitively, binary classes can be represented in a plane,
and separated by a hyperplane (linear separator) as shown in
Fig. 14.
▶ Every data sample xn is a two-dimensional location (xn(1) and
xn ), and the corresponding binary label yn is either a red
disc or blue disc.
▶ An xy-plane represents data examples consisting of two
classes which contain features arranged in a way that can be
separated (classified) by drawing a straight line.
▶ The idea behind many classification algorithms is to
represent data in RD and then partition this space such that
examples with the same label are in the same partition
▶ With binary classifications, the hyperplane in the xy-plane
divides the space into two partitions corresponding to the
positive and negative classes, respectively.

112 / 132
Classification - Binary classification

Figure 14: Demonstration of a hyperplane separating binary classes

113 / 132
Support vector machines (SVMs)

▶ A class of popular supervised machine learning algorithms

(techniques) available for both classification and linear
regression (numeric prediction) tasks.
▶ Its two major categories, support vector classification (SVC)
and support vector regression (SVR) are used for
classification tasks and regression tasks, respectively.
▶ SVC is less prone to outliers than many classifiers including
logistic regression because it only cares about the points
closest to the decision boundary or support vectors.
▶ Their main task is to discriminate against new observations
between two classes.
▶ Although it is commonly used for binary classification tasks,
it can also be used for multiclass classification.

114 / 132
Standardization of dataset
▶ Standardization aims at making raw data more amenable to MLA.
▶ Common standardization techniques include vectorization,
normalization/scaling, data cleaning and feature cleaning.
▶ Since each feature contains values in their own range, it is
recommended to feed the MLAs with small values in similar
scale (range) because MLAs work better with smaller values
than larger ones.
▶ It is therefore important to scale the values of all features
of the dataset into the same standard (scale).
▶ Data normalization achieves this by converting each data
feature (x ) into a standard deviation of 1 and mean of 0.
▶ Several Python libraries provide many tools (functions) to
perform data standardization.
▶ Ensure the data values having the following characteristics;
• Contain small values in the range 0-1.
• Be homogeneous: All features take values in the same range
• Data values for each feature have standard deviation of 1 and
mean of 0.
115 / 132
Data encoding
▶ Since MLAs do only process numerical values (numbers) and not
text, it is important to encode any categorical values
(labels) of any feature into one form of numerical values.
▶ The three classes (labels) of the iris dataset i.e., [setosa,
versicolor,virginica], nust be encoded into a certain form of
numbers to be fed as input to any MLA.
▶ Two methods commonly used to encode categorical data into
numerical format are;
• One-hot-encoded format
✓ Given [’Setosa’, ’Versicolor’, ’Virginica’], a prediction [0, 1,
0] means the model has predicted ’Versicolor’ while [1,0,0] means
• Class indices: Class labels are represented using integers,
commonly starting with 0.
✓ The [’Setosa’, ’Versicolor’, ’Virginica’] can be represented as
▶ Like data standardization, Python offers a wide range of
libraries with tools to perform these encoding automatically.

116 / 132
Model evaluation and Evaluation metrics
▶ Models must be evaluated to assess their performance on new
(unseen) data.
▶ Although you may obtain performance of the model on training
set (data used to develop the model), common practice is to
use the validation set and test set.
▶ Models’ performance is estimated using various evaluation
(performance) metrics appropriate for that task.
▶ Evaluation metric: It is a quantitative measure used to
assess the performance and effectiveness of a statistical or
machine learning model
▶ The final performance estimate of the model is made on
out-of-sample data i.e. data set that was not "seen" (used)
during training process such as test set.
▶ Different sets of evaluation metrics exist for classification
task and numeric prediction tasks.
▶ It is important to choose the right evaluation metrics for
particular ML task in order to assess the model objectively.
117 / 132
Evaluation metrics - Classification tasks
▶ Several evaluation metrics are used to evaluate performance
of classification models. These include;
• True Positive (TP): A correct predication (classification) of a
true positive class, i.e. both actual (target) value and
predicted value are positive.
✓ A model correctly predicts a positive sample as positive.
✓ A model correctly predicts a Setosa sample as Setosa
• True Negative (TN): A correct prediction of a true negative
object, i.e. both the actual value and predicted value are
✓ A ground-truth contains no sample and model predicts no sample
✓ The model predicts a no-cancer sample as no-cancer
• False Positive (FP): An incorrect prediction of a non-existing
object or a misclassification of an existing object by a model.
✓ The ground-truth is negative but the model predicts a sample
✓ The model predicts a Setosa sample as Versicolor sample
✓ The model predicts a no-cancer sample as cancer
• False Negative (FN): An incorrect prediction by the model of a
positive class as negative.
✓ A ground-truth (true positive) sample is falsely (incorrectly)
rejected by the model.
✓ When a model predicts a cancer sample as no-cancer.

118 / 132
Evaluation metrics - Classification tasks
▶ TP, TN, FP and FN are often organized into a confusion
▶ Confusion matrix: It is a table describing the performance
of a machine learning model, such as a classification model.

Figure 15: Confusion matrix

119 / 132
Evaluation metrics - Classification tasks
▶ TP, FN, FP, and TN metrics are combined to produce more
informative metrics. These include.
• Recall: Also referred to as sensitivity.
✓ It is defined as a percentage of true positive samples classified
as true positive samples over all ground-truth samples.
✓ It is expressed as;
Recall = = (5)
TP + FN All ground-truths
✓ Ground-truth: A set of true (target) values for all samples in a
given data set.
• Precision: It is a percentage of all true positive samples over
all positive predictions (TP + FP) made by the model.

Precision = (6)
• Accuracy: It is a percentage of predictions which give the
correct answer over all predictions made by the model.
Accuracy = (7)
TP + FP + FN + TN
✓ Describes model’s performance across all classes in the dataset.
120 / 132
Evaluation metrics - Classification tasks

Figure 16: An example of a confusion matrix

121 / 132
Evaluation metrics - Classification tasks

▶ Using confusion matrix in Figure 16, compute the recall and

precision of the model.
▶ From the Figure, TP = 100, TN = 50, FP = 10, FN = 5
i. Recall:
TP 100 100
= = = 0.952 = 95.2% (8)
TP + FN 100 + 5 105
ii. Precision
TP 100 100
= = = 0.9090 = 90.90% (9)
TP + FP 100 + 10 110
iii. Accuracy
TP + TN 100 + 50
= (10)
TP + FP + FN + TN 100 + 10 + 5 + 50

= 0.9090 = 90.90%

122 / 132
Evaluation metrics - Numeric prediction
(Regression) tasks

▶ Mean absolute error (MAE): Computes the average (mean) of

the absolute difference between actual value (yi ) and
predicted value (ȳi ) for each sample in the data set.
1 Xn
MAE = |yi − ȳi | (11)
n i=1
• Taking the absolute differences ensures that the negative and
positive residuals do not cancel out.
• A small MAE suggests that the model is doing better while a
large MAE value means the model is performing bad.

123 / 132
Evaluation metrics - Numeric prediction
(Regression) tasks

Figure 17: Mean average error (MAE)

124 / 132
Evaluation metrics - Numeric prediction
(Regression) tasks

▶ Mean squared error (MSE): It resembles MAE, but squares the

differences before summing them all.
• Compute the difference between actual value (yi ) and predicted
value (ȳi ) for each sample, square the difference, sum them and
divide the sum by the number of samples.
1 Xn
MSE = (yi − ȳi )2 (12)
n i=1

• Outliers will contribute much higher total error than they

would in MAE because they produce bigger yi − ȳi difference
between actual value and predicted value.

125 / 132
Evaluation metrics - Numeric prediction
(Regression) tasks

▶ Root Mean squared error (RMSE): It solves the problem

raised with MSE by getting the square root of the final value
so as to scale it back to the same units of the data.
• Lower RMSE value signifies that the model performs well, while
larger RMSE signifies that the model is not performing well
Pn 2
i=1 (yi − ȳi )
RMSE = (13)

• RMSE is a square root of the MSE i.e., RMSE = MSE
• You only need to compute MSE and then find its square root to
get the RMSE.

126 / 132
Evaluation metrics - Numeric prediction
(Regression) tasks
▶ Given the Table 4 below (n = 5), compute the values for MAE,

Table 4: Actual values and Predicted values for LR metrics

Actual value (yi ) Predicted value (ȳi ) |yi − ȳi | (yi − ȳi ) (yi − ȳi )2
10 12 2 -2 4
23 21 2 2 4
45 40 5 5 25
21 11 10 10 100
44 49 5 -5 25
1 Xn 1
MAE = |yi − ȳi | = x 24 = 4.8 (14)
n i=1 5
1 Xn 1
MSE = (yi − ȳi )2 = x 158 = 31.6 (15)
n s i=1 5
Pn 2
i=1 (yi − ȳi ) 158
RMSE = = = 5.6 (16)
n 5
127 / 132
Deep learning
▶ Artificial neural networks (ANNs) are a special class of
machine learning algorithms which attempt to mimic the
functioning of a human brain.
▶ ANNs are organized into layered structure with each layer
consisting of computational (data processing) units called
artificial neurons (or neurons/nodes).
▶ Neurons of adjacent layers are connected via numerical values
called weights (or also known as parameters).
▶ Unlike many conventional ML algorithms, ANNs extract features
from data automatically.
▶ There exist different types of ANNs including deep learning
▶ Deep learning (DL), also known as deep neural network (DNN),
is a class of ANNs which contain multiple layers of neurons.
▶ This multi-layer structure enables them to learn features
automatically and progressively from data.

128 / 132
Deep learning

Figure 18: General structure of an artificial neural network (ANN))

129 / 132
Deep learning
▶ Compared to conventional ML algorithms and other types of
ANNs, DL offers a number of advantages including;
• Learn and extract features from data automatically (i.e. do
not need manually extracted features)
• Well designed for parallelization (i.e., parallel processing)
on single or multiple central processing units (CPUs), graphics
processing units (GPUs) and tensor processing units (TPUs)
• They can be trained on additional data without starting from
• They can be trained on one dataset and deployed on another
similar data
• They can efficiently handle large and high-dimensional data
including audio, textual, image etc.
▶ Major types of DL include;
• Convolutional neural network (CNNs)
• Recurrent neural network
• Recursive neural network
• Autoencoders
• Deep belief networks (DBNs)
• Generative adversarial networks (GANs)
• Deep Q-learning
130 / 132
Deep learning
▶ Of all types of DL algorithms, CNN is commonly used for image
processing tasks including image classification because;
• Spatial dimensionality shape (width, height, depth) which matches
the structure of an image
✓ This shape enables CNNs to process images efficiently.
• Extract features from images automatically, as manual feature
extraction is time-consuming and labour-intensive
• CNNs have locally connected patches of neurons which reduce the
number of neuron connections between layers, parameters to
train and computations.
• They can learn translation-invariant local features which
enable any feature learned to be recognized when when they
appear in other parts.
• They learn features hierarchically with outer layers learning
generic features about objects, while inner learn more specific
• They are designed to exploit relationship between space and
▶ Examples of CNN architectures include AlexNet, Residual
Network (ResNet), Inception ResNet, Visual Geometry Group
(VGG), MobileNet, GoogLeNet
131 / 132
Deep learning

Figure 19: Generic structure of the convolutional neural network (CNN)

132 / 132

You might also like