Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
3 views

ML Unit-II Notes

The document discusses the importance of data in machine learning, highlighting its role in model training, validation, and testing. It categorizes data into qualitative and quantitative types, detailing subtypes such as nominal, ordinal, discrete, and continuous data. Additionally, it covers statistical concepts essential for data analysis, including descriptive statistics, measures of central tendency, and variability.

Uploaded by

Asfia Al Hera
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

ML Unit-II Notes

The document discusses the importance of data in machine learning, highlighting its role in model training, validation, and testing. It categorizes data into qualitative and quantitative types, detailing subtypes such as nominal, ordinal, discrete, and continuous data. Additionally, it covers statistical concepts essential for data analysis, including descriptive statistics, measures of central tendency, and variability.

Uploaded by

Asfia Al Hera
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

ML Unit-II

Data in Machine Learning


Data refers to facts and statistics collected together for reference or analysis.

Set of values of qualitative or quantitative variables about one or more persons or objects

Any unprocessed fact, value, text, sound, picture , video, code, plots or graphs etc

Without data, we can’t train any model and all modern research and automation will go in vain.

Big Enterprises are spending lots of money just to gather as much certain data as possible.

Example: Why did Facebook acquire WhatsApp by paying a huge price of $19 billion?
To have access to the users’ information  To facilitate the task of improvement in their services.

Helps in predicting the future or forecast based on the previous trend of data.

Helps in determining patterns that may exist between data.

Helps in detecting fraud by uncovering anomalies in the data.


Datasets in Machine Learning

Training Data- Use to train the model

Validation Data- To do frequent evaluation of model

Testing Data- After prediction, we test our model by comparing


it with the actual output present in the testing data
Types of Data in Machine Learning

Categories / Types of Data

Qualitative Data
Quantitative Data
(Categorical or
(Numerical)
Attribute)

Discrete /
Nominal Ordinal Continuous
Interval
Qualitative Data
Characteristics and descriptions that can’t be easily measured, but can be observed & recorded
subjectively. [ Non-numerical in nature]

Categorical data – data that can be arranged categorically based on the attributes and properties of a thing
or a phenomenon.

For example, think of a student reading a paragraph from a book during one of the class sessions. A teacher
who is listening to the reading gives feedback on how the child read that paragraph. If the teacher gives
feedback based on fluency, intonation, throw of words, clarity in pronunciation without giving a grade to
the child, this is considered as an example of qualitative data.

Descriptive and conceptual findings collected through questionnaires, interviews, or observation.

Qualitative data is about the emotions or perceptions of people, what they feel.

Gender, country name, animal species, and emotional state are examples of qualitative data.
Nominal Data
Data with no inherent order or ranking such as gender or race, such kind of the data is called Nominal
data.

In statistics, nominal data (also known as nominal scale) is a classification of categorical variables that do
not provide any quantitative value.

It is sometimes referred to as labelled or named data.

Nominal data cannot be ordered and cannot be measured.


Characteristics of Nominal Data
Ordinal Data
Data with an ordered series, such as shown in the table, such kind of data is called Ordinal data

Ordinal data is a type of qualitative data where the variables have natural, ordered categories and the
distances between the categories are not known.

For example, ordinal data is said to have been collected when a customer inputs his/her satisfaction on the
variable scale — "satisfied, indifferent, dissatisfied".

An organization creates an employee exit questionnaire that primarily highlights this question:
“How Mr. Abdul Rais is teaching Machine Learning?” (Likert Scale)
Excellent
Very Good
Good
Average
Poor
Discrete / Interval Data
Discrete data is a count that involves integers — only a limited/ finite number of values is possible.

 This type of data cannot be subdivided into different parts.

Discrete data includes discrete variables that are finite, numeric, countable, and non-negative integers.

This data type is mainly used for simple statistical analysis because it’s easy to summarize and compute.

By nature, discrete data cannot be measured at all. For example, you can measure your weight with the
help of a scale. So, your weight is not a discrete data.

Discrete data can be easily visualized and demonstrated using simple statistical methods such as bar
charts, line charts, or pie charts.
Examples of Discrete / Interval Data
Number of students in a class

The number of workers in a company.

The number of parts damaged during transportation.

Shoe sizes.

Number of languages an individual speaks.

The number of home runs in a baseball game.

The number of test questions you answered correctly.

Instruments in a shelf.

The number of siblings a randomly selected individual has.


Continuous Data
Hold infinite number of possible values between any two points

Continuous data can assume any numeric value and can be meaningfully split into smaller parts.

Consequently, they have valid fractional and decimal values.

Generally, you measure them using a scale.

For example, you have continuous data when you measure weight, height, length, time, and temperature.

Frequently, you’ll use histograms and scatterplots to graph continuous variables. These graphs are
designed to handle values that fall on a continuous spectrum and have decimal places.
Data Representation
A machine learning model can't directly see, hear, or sense input examples.

Instead, you must create a representation of the data to provide the model with a useful vantage point into
the data's key qualities.

That is, in order to train a model, you must choose the set of features that best represent the data.

The main objective of machine learning is to build models by interpreting data. To do so, it is highly
important to feed the data in a way that is readable by the computer.

To feed data into a scikit-learn model, it must be represented as a table or matrix of the required
dimension

Most tables fed into machine learning problems are two-dimensional (rows and columns)

Each row represents an observation (an instance)

Each column represents a characteristic (feature) of each observation.


Data Representation
The following table is a fragment of a sample dataset of scikit-learn. The purpose of the dataset is to
differentiate from among three types of iris plants based on their characteristics.
Data Representation
Data Representation
Statistics
An area of applied mathematics concerned with the data collection, analysis, interpretation and
presentation and visualization.

Governmental needs for census data as well as information about a variety of economic activities
provided much of the early impetus for the field of statistics.

To turn large amounts of data into useful information has stimulated both theoretical and practical
developments in statistics.

Any time data are collected & analyzed, statistics are being done. This can range from government
agencies to academic research to analyzing investments.

There are two types of statistics

Descriptive Statistics

Inferential Statistics
Examples of Statistics
You and a friend are at a baseball game, and out of the blue he offers you a bet that neither team will hit a
home run in that game. Should you take the bet?

Your company has created a new drug that may cure cancer. How would you conduct a test to confirm the
drug’s effectiveness?

The latest sales data have just come in, and your boss wants you to prepare a report for management on
places where the company could improve its business. What should you look for? What should you not
look for?
Basic Terminologies in Statistics:
Population
A collection of set of individuals or objects or events whose properties are to be analyzed
Sample
A subset of population is called ‘Sample’. A well-chosen sample will contain most of the information
about a particular population parameter.
Descriptive Statistics
Descriptive statistics summarize & organize characteristics of a data set.

[A data set is a collection of responses or observations from a sample or entire population]

Provide descriptions of the population or sample, either through numerical calculations or graphs or tables

It is mainly focused upon the main characteristics of data. It provides graphical summary of the data.

Visual/ Graphical representation is more effective than presenting huge numbers.

Descriptive Statistical Analysis helps you to understand your data and is a very important part of ML. This is
due to ML being all about making predictions. On the other hand, statistics is all about drawing conclusions
from data, which is a necessary initial step.

Apron / T-Shirt Example


Types of Descriptive Statistics

Distribution concerns the frequency of


each value.

Central tendency concerns the averages


Descriptive Statistics
of the values.

Variability or dispersion concerns how


spread out the values are.
Frequency Distribution
A data set is made up of a distribution of values, or scores.

In tables or graphs, you can summarize the frequency of every possible value of a variable in numbers or
percentages.
Simple frequency distribution table
For the variable of gender, you list all possible answers on the left column. You count the number or
percentage of responses for each answer and display it on the right column.
From this table, you can see that more women than men or people with another gender identity took
part in the study.

Gender Number

Male 182
Female 235
Other 27
Frequency Distribution
Grouped frequency distribution table

In a grouped frequency distribution, you can group numerical response values and add up the
number of responses for each group. You can also convert each of these numbers to percentages.

From this table, you can see that most people visited the library between 5 and 16 times in the past year.

Library visits in the past year Percent


0–4 6%
5–8 20%
9–12 42%
13–16 24%
17+ 8%
Measure of Central Tendency
Measures of central tendency estimate the center, or average, of a data set.

Central tendency sometimes called “measures of location,” “central location,” or just “center”.

Central tendency doesn’t tell you specifics about the individual pieces of data, but it gives you an overall
picture of what is going on in the entire data set.

 The 3 most common measures of central tendency are

Mean: the sum of all values divided by the total number of values.

Median: the middle number in an ordered data set.

Mode: the most frequent value.


Mean / Arithmetic Mean
The mean, or M, is the most commonly used method for finding the average.
To find the mean, simply add up all response values and divide the sum by the total number of responses or
observations (N)
The mean is the average of a set of numbers.

Mean number of library visits

Data set 15, 3, 12, 0, 24, 3


Sum of all values 15 + 3 + 12 + 0 + 24 + 3 = 57
Total number of responses N=6
Mean Divide the sum of values by N to find M: 57/6 = 9.5
Median
The median is the middle or midpoint of a set of numbers. Think of it like the median in a road (that grassy
area in the middle that separates traffic). Place your data in order, and the number in the exact centre of
a list is the median.

Median is the 50%th percentile of the data. It is exactly the centre point of the data.

Median can be identified by ordering the data in ascending order and splits the data into two equal
parts and find the middle number.
Median number of library visits Median number of library visits

Ordered data set Ordered data set


0, 3, 3, 12, 15, 24 10, 20, 30, 40, 50
(Even No.) (Odd No.)

Middle numbers 3, 12 Middle numbers 30


Find the mean of the two middle Find the mean of the two middle
Median Median
numbers: (3 + 12)/2 = 7.5 numbers= 30
Mode
The mode is the most common number in a set.

The mode is the most popular or most frequent response value. A data set can have no mode, one mode,
or more than one mode.

To find the mode, order your data set from lowest to highest and find the response that occurs most
frequently.
Some data sets have no mode, one mode, two modes, etc
None: 1, 2, 3, 4, 6, 8, 9.
Mode number of library visits
One mode: unimodal: 1, 2, 3, 3, 4, 5.
Two: bimodal: 1, 1, 2, 3, 4, 4, 5. Ordered data
0, 3, 3, 12, 15, 24
Three: trimodal: 1, 1, 2, 3, 3, 4, 5, 5. set
More than one (two, three or more) = multimodal. Middle numbers 3, 12
Find the most frequently occurring
Mode
response: 3
Measure of Variability / Spread
It is used to describe the variability in a sample or population.

The dispersion is the “Spread of the data”. It measures how far the data is spread.

In most of the dataset, the data values are closely located near the mean. On some other dataset, the
values are widely spread out of the mean.

When a data set has a large value, the values in the set are widely scattered; when it is small the items in
the set are tightly clustered.
These dispersions of data can be measured by
Range
Inter Quartile Range ( IQR )
Standard Deviation
Variance
 Spread can also be shown in graphs: dot plots, boxplots, and stem and leaf plots have a greater distance
with samples that have a larger dispersion and vice versa.
Range
It is highest value minus the lowest value.
Formula : Range = Max Value – Min Value
Spread of your data from the lowest to the highest value in the distribution.
It is a commonly used measure of variability.
To find the range, follow these steps:
Order all values in your data set from low to high.
Subtract the lowest value from the highest value.
This process is the same regardless of whether your values are positive or negative, or whole numbers
or fractions.
Range example: Your data set is the ages of 8 participants.
R = H – L
Participant 1 2 3 4 5 6 7 8
R = 37 – 19 = 18 Age 37 19 31 29 21 26 33 36
The range of our data set is 18 years. First, order the values from low to high to identify the lowest value (L) and the highest value (H).

Age 19 21 26 29 31 33 36 37
Inter Quartile Range (IQR)
Quartiles are values that divide your data into quarters. They divide your data into four segments
according to where the numbers fall on the number line. The four quarters that divide a data set into
quartiles are:

Example: Divide the following data set into quartiles: 2, 5, 6, 7, 10, 22, 13, 14, 16, 65, 45, 12.

Step 1: Put the numbers in order: 2, 5, 6, 7, 10, 12 13, 14, 16, 22, 45, 65.

Step 2: Count how many numbers there are in your set and then divide by 4 to cut the list of numbers into
quarters.

There are 12 numbers in this set, so you would have 3 numbers in each quartile. 2, 5, 6, | 7, 10, 12 |
13, 14, 16, | 22, 45, 65
Inter Quartile Range (IQR): It is the difference between the third quartile (Q3) and the first Quartile (Q1)

Formula : IQR = Q3 - Q1
Inter Quartile Range (IQR)
Step 1: Put the numbers in order.
1, 2, 5, 6, 7, 9, 12, 15, 18, 19, 27

Step 2: Find the median.


1, 2, 5, 6, 7, 9 , 12, 15, 18, 19, 27

Step 3: Place parentheses around the numbers above and below the median.
Not necessary statistically, but it makes Q1 and Q3 easier to spot.
(1, 2, 5, 6, 7), 9 , (12, 15, 18, 19, 27)

Step 4: Find Q1 and Q3


Think of Q1 as a median in the lower half of the data and think of Q3 as a median for the upper half of data.
(1, 2, 5, 6, 7), 9 , ( 12, 15, 18, 19, 27) Q1 = 5 and Q3 = 18.

Step 5: Subtract Q1 from Q3 to find the interquartile range.


18 – 5 = 13.
Variance
Variance is the expectation of the squared deviation of a random variable from its population
mean or sample mean.
The variance is the average of squared deviations from the mean.
Variance reflects the degree of spread in the data set. The more spread the data, the larger the variance is
in relation to the mean.
Follow these five steps:
List the scores/observations and their means.
Find the deviation by subtracting the mean from each score/observation
Square each deviation.
Total up all the squared deviations.
Divide the sum of the squared deviations by n-1 (Sample) or n (Population)
To find the variance, simply square the standard deviation

The symbol for variance is s2


Standard Deviation
The standard deviation (s) is the average amount of variability in your dataset.

It tells you, on an average, how far each score lies from the mean.

The larger the standard deviation, the more variable the data set is.

The square root of the variance


Follow these six steps:
List the scores and their means.
Find the deviation by subtracting the mean from each score.
Square each deviation.
Total up all the squared deviations.
Divide the sum of the squared deviations by N-1.
Find the result’s square root.
Formula:
Standard Deviation
 Example

Standard deviations of visits to the library in the past year


In the table below, you complete Steps 1 through 4.
Step 5: 421.5/5 = 84.3
Step 6: √84.3 = 9.18
From learning that s = 9.18, you can say that on average, each score deviates from the mean by 9.18 points.

Raw data Deviation from mean Squared deviation


15 15 – 9.5 = 5.5 30.25
3 3 – 9.5 = - 6.5 42.25
12 12 – 9.5 = 2.5 6.25
0 0 – 9.5 = - 9.5 90.25
24 24 – 9.5 = 14.5 210.25
3 3 – 9.5 = - 6.5 42.25
M = 9.5 Sum = 0 Sum of squares = 421.5
Mean Absolute Deviation using Sample Dataset
 It is called Mean Absolute Deviation (MAD)
The mean absolute deviation of a dataset is the average distance between each data point and the mean.
It gives us an idea about the variability in a dataset.
Formula:

Here's how to calculate the mean absolute deviation.


Step 1: Calculate the mean.
Step 2: Calculate how far away each data point is from the mean using positive distances. These are called
absolute deviations.
Step 3: Add those deviations together.
Step 4: Divide the sum by the number of data points.
Mean Absolute Deviation using Sample Dataset
Example:
Suppose that we start with the following data set: 1, 2, 2, 3, 5, 7, 7, 7, 7, 9.
The mean of this data set is 5. The following table will organize our work in calculating the mean absolute
deviation about the mean.
We now divide this sum by 10, since there are a total of ten data values. The mean absolute deviation about
the mean is 24/10 = 2.4. Data Value Deviation from mean Absolute Value of Deviation
1 1 - 5 = -4 |-4| = 4
2 2 - 5 = -3 |-3| = 3
2 2 - 5 = -3 |-3| = 3
3 3 - 5 = -2 |-2| = 2
5 5-5=0 |0| = 0
7 7-5=2 |2| = 2
7 7-5=2 |2| = 2
7 7-5=2 |2| = 2
7 7-5=2 |2| = 2
9 9-5=4 |4| = 4
Total of Absolute Deviations: 24
Percentile
Percentiles indicate the percentage of scores that fall below a particular value.

They tell you where a score stands relative to other scores.

You might know that you scored 67 out of 90 on a test. But that figure has no real meaning unless you know
what percentile you fall into.

Example-1: If you know that your score is in the 90th percentile, that means you scored better than 90% of
people who took the test.

Example-2: A person with an IQ of 120 is at the 91st percentile, which indicates that their IQ is higher than 91
percent of other scores.
Percentile
The general rule is that if value X is at the kth percentile, then X is greater than K% of the values.

Percentiles are commonly used to report scores in tests, like the SAT, GRE and LSAT.

For example, the 70th percentile on the 2013 GRE was 156. That means if you scored 156 on the exam, your
score was better than 70 percent of test takers.

The 25th percentile is also called the first quartile.

The 50th percentile is generally the median

The 75th percentile is also called the third quartile.


Percentile
Percentile Three Definitions
There are three definitions of percentile

Definition-1: The nth percentile is the lowest score that is greater than a certain percentage (“n”) of the
scores.

Definition-2: The nth percentile is the smallest score that is greater than or equal to a certain percentage
of the scores. To rephrase this, it’s the percentage of data that falls at or below a certain observation. This is
the definition used in AP statistics.

Definition-3: A weighted mean of the percentiles from the first two definitions.

Percentile Rank
If you score in the 25th percentile, then 25% of test takers are below your score. The “25” is called
the percentile rank.
How to find a Percentile
Example question: Find out where the 25th percentile is in the above list.

Step 1: Calculate what rank is at the 25th percentile. Use the following formula:
Rank = Percentile / 100 * (number of items + 1)
Rank = 25 / 100 * (8 + 1) = 0.25 * 9 = 2.25.
A rank of 2.25 is at the 25th percentile. So you must either round up, or round down. As 2.25 is closer to 2
SCO RE RA N K
than 3, I’m going to round down to a rank of 2.
30 1

33 2

43 3

53 4

56 5

67 6

68 7

72 8
How to find a Percentile
Step 2: Choose either definition 1 or 2:
Definition 1: The lowest score that is greater than 25% of the scores. That equals a score of 43 on this list (a
rank of 3).
Definition 2: The smallest score that is greater than or equal to 25% of the scores. That equals a score of 33
on this list (a rank of 2).
Depending on which definition you use, the 25th percentile could be reported at 33 or 43! A third definition
attempts to correct this possible misinterpretation:
Definition 3: A weighted mean of the percentiles from the first two definitions.

In the above example, here’s how the percentile would be worked out using the weighted mean:

Multiply the difference between the scores by 0.25 (the fraction of the rank we calculated above). The scores
were 43 and 33, giving us a difference of 10:
(0.25)(43 – 33) = 2.5

Add the result to the lower score. 2.5 + 33 = 35.5

In this case, the 25th percentile score is 35.5, which makes more sense as it’s in the middle of 43 and 33.
Outlier
An outlier is a piece of data that is an abnormal distance from other points.

In other words, it’s data that lies outside the other values in the set. If you had Pinocchio in a class of
children, the length of his nose compared to the other children would be an outlier.

Outliers are stragglers — extremely high or extremely low values — in a data set that can throw off your
stats. For example, if you were measuring children’s nose length, your average value might be thrown off if
Pinocchio was in the class.

A box and whiskers chart (boxplot) often shows outliers

An outlier is a data point that is noticeably different from the rest.

They represent errors in measurement, bad data collection, or simply show variables not considered
when collecting the data.
Outlier
In this set of random numbers, 1 and 201 are outliers:
1, 99, 100, 101, 103, 109, 110, 201
“1” is an extremely low value and “201” is an extremely high value.

Of course, trying to find outliers isn’t always that simple. Your data set may look like this:
61, 10, 32, 19, 22, 29, 36, 14, 49, 3.
You could take a guess that 3 might be an outlier and perhaps 61. But you’d be wrong: 61 is the only outlier
in this data set.
Impact of Outlier on ML Models
In supervised models, outliers can deceive the training process resulting in prolonged training times, or

leading to the development of less precise models.

According to Alvira Swalin, a data scientist at Uber, machine learning models, like linear & logistic

regression are easily influenced by the outliers in the training data. Some models even exist that hike

the weights of misclassified points for every repetition of the training.


How to Find Outliers Using IQR
An outlier is defined as being any point of data that lies over 1.5 IQRs below the first quartile (Q1) or
above the third quartile (Q3) in a data set.
High = (Q3) + 1.5 IQR
Low = (Q1) – 1.5 IQR

Example Question: Find the outliers for the following data set: 3, 10, 14, 22, 19, 29, 70, 49, 36, 32.

Step 1: Find the IQR, Q1 (25th percentile) and Q3 (75th percentile).


IQR = 22
Q1 = 14
Q3 = 36
How to Find Outliers Using IQR
Step 2: Multiply the IQR you found in Step 1 by 1.5:
IQR * 1.5 = 22 * 1.5 = 33.

Step 3: Add the amount you found in Step 2 to Q3 from Step 1:


33 + 36 = 69.

This is your upper limit. Set this number aside for a moment.

Step 3: Subtract the amount you found in Step 2 from Q1 from Step 1:
14 – 33 = -19.
This is your lower limit. Set this number aside for a moment.
How to Find Outliers Using IQR
Step 5: Put the numbers from your data set in order:
3, 10, 14, 19, 22, 29, 32, 36, 49, 70

Step 6: Insert your low and high values into your data set, in order:
-19, 3, 10, 14, 19, 22, 29, 32, 36, 49, 69, 70

Step 7: Highlight any number below or above the numbers you inserted in Step 6:

-19, 3, 10, 14, 19, 22, 29, 32, 36, 49, 69, 70
Boxplot
A boxplot, also called a box and whisker plot

It is a way to show the spread and center of a data set.

Measures of spread include the interquartile range and the mean of the data set.

Measures of center include the mean and median (the middle of a data set).
How to read a Boxplot / Five No. Summary
Step 1: Find the minimum.
The minimum (the smallest number in the data set). The minimum is shown at the far left of the chart, at the
end/ tip of the left “whisker.”

Step 2: Find Q1, the first quartile.


Q1 is represented by the far left hand side of the box.

Step 3: Find the median.


The median is represented by the vertical bar shown as a line in the center of the box.

Step 4: Find Q3, the third quartile.


Q3 is the far right hand edge of the box

Step 5: Find the maximum.


The maximum (the largest number in the data set), shown at the far right of the box.
How to read a Boxplot / Five No. Summary
Uses of Boxplot
Box plots provide a visual summary of the data with which we can quickly

Identify the average value of the data,

How dispersed the data is

Whether the data is skewed or not (skewness).

The Median gives you the average value of the data.

Box Plots shows Skewness of the data


Correlation
Correlation is used to test relationships between quantitative variables or categorical variables.

In other words, it’s a measure of how things are related.

The study of how variables are correlated is called correlation analysis.

Some examples of data that have a high correlation:

Your caloric intake and your weight.

Your eye color and your relatives’ eye colors.

The amount of time your study and your GPA.

Researchers have found a direct correlation between smoking and lung cancer.

Some examples of data that have a low correlation (or none at all):

A dog’s name and the type of dog biscuit they prefer.

The cost of a car wash and how long it takes to buy a soda inside the station.
Correlation
Correlations are useful because if you can find out what relationship variables have, you can
make predictions about future behaviour.

Knowing what the future holds is very important in the social sciences like government and healthcare.
Businesses also use these statistics for budgets and business plans.

The word Correlation is made of Co- (meaning "together"), and Relation

Correlation means association - more precisely it is a measure of the extent to which two variables are
related. There are three possible results of a correlational study:

A positive correlation,

A negative correlation, and

No correlation.
Correlation
A Positive Correlation:

It is a relationship between two variables in which both variables move in the same direction.

Therefore, when one variable increases as the other variable increases, or one variable decreases while
the other decreases.

An example of positive correlation would be height and weight. Taller people tend to be heavier.

A Negative Correlation:

Relationship between two variables in which an increase in one variable is associated with a decrease
in the other.

An example of negative correlation would be height above sea level and temperature. As you climb the
mountain (increase in height) it gets colder (decrease in temperature).

A zero Correlation exists when there is no relationship between two variables. For example there is no
relationship between the amount of tea drunk and level of intelligence.
Correlation
A correlation can be expressed visually by drawing a scattergram (also known as a scatterplot, scatter
graph, scatter chart, or scatter diagram).

A scattergram is a graphical display that shows the relationships or associations between two numerical
variables (or co-variables), which are represented as points (or dots) for each pair of score.

A scattergraph indicates the strength and direction of the correlation between the co-variables.

 Correlation can have a value:


1 is a perfect positive correlation
0 is no correlation (the values don't seem linked at all)
-1 is a perfect negative correlation
Correlation Coefficient
A correlation coefficient is a way to put a value to the relationship.

Correlation coefficients have a value of between -1 and 1.

A “0” means there is no relationship between the variables at all,

While -1 or 1 means that there is a perfect negative or positive correlation


Correlation Example
The local ice cream shop keeps track of how much ice cream they sell versus the temperature on that day.
Here are their figures for the last 12 days: Ice Cream Sales vs Temperature
Temperature °C Ice Cream Sales
And here is the same data as a Scatter Plot 14.2° $215
16.4° $325
11.9° $185
15.2° $332
18.5° $406
22.1° $522
19.4° $412
25.1° $614
23.4° $544
18.1° $421
22.6° $445
17.2° $408
In fact the correlation is 0.9575
Calculating Correlation (Pearson's Correlation)
Let us call the two sets of data "x" and "y" (in our case Temperature is x and Ice Cream Sales is y):

Step 1: Find the mean of x, and the mean of y

Step 2: Subtract the mean of x from every x value (call them "a"), and subtract the mean of y from every y
value (call them "b")

Step 3: Calculate: ab, a2 and b2 for every value

Step 4: Sum up ab, sum up a2 and sum up b2

Step 5: Divide the sum of ab by the square root of [(sum of a2) × (sum of b2)]

Here is how I calculated the first Ice Cream example (values rounded to 1 or 0 decimal places):
Calculating Correlation (Pearson's Correlation)
 Formula

Where:
Σ is Sigma, the symbol for "sum up"
is each x-value minus the mean of x (called "a" above)
is each y-value minus the mean of y (called "b" above)
Covariance
Covariance is a measure of how much two random variables vary together.

It’s similar to variance, but where variance tells you how a single variable varies, covariance tells you
how two variables vary together.

Covariance is a statistical tool that is used to determine the direction of the relationship between the
movements of two random variables.

When two stocks tend to move together, they are seen as having a positive covariance; when they move
inversely, the covariance is negative.

This relationship is determined by the sign (positive or negative) of the covariance value.

In other words, whether they tend to move in the same or opposite directions.
Types of Covariance
Positive Covariance

A positive covariance between two variables indicates that these variables tend to be higher or lower
at the same time.

In other words, a positive covariance between variables x and y indicates that x is higher than average at
the same times that y is higher than average, and vice versa.

When charted on a two-dimensional graph, the data points will tend to slope upwards.

Negative Covariance

When the calculated covariance is less than zero, this indicates that the two variables have an
inverse relationship.

In other words, an x value that is lower than average tends to be paired with a y that is greater than
average, and vice versa.
Covariance Formula
Formula

Where,
xi = data value of x
yi = data value of y
x̄ = mean of x
ȳ = mean of y
N = number of data values.
Covariance
Below figure shows the covariance of X and Y.

If cov(X, Y) is greater than zero, then we can say that the covariance for any two variables is positive and
both the variables move in the same direction.

If cov(X, Y) is less than zero, then we can say that the covariance for any two variables is negative and both
the variables move in the opposite direction.

If cov(X, Y) is zero, then we can say that there is no relation between two variables.

The relationship between the correlation coefficient and covariance is given by;

Correlation,ρ(X,Y) = Cov(X,Y)/σX σy
Where:
ρ(X,Y) = correlation between the variables X and Y
Cov(X,Y) = covariance between the variables X and Y
σX = standard deviation of the X variable
σY = standard deviation of the Y variable
Covariance
Question:
Calculate the coefficient of covariance for the following data:

X 2 8 18 20 28 30

Y 5 12 18 23 45 50

Solution:
Number of observations = 6
Mean of X = 17.67
Mean of Y = 25.5
Cov(X, Y) = (⅙) [(2 – 17.67)(5 – 25.5) + (8 – 17.67)(12 – 25.5) + (18 – 17.67)(18 – 25.5) + (20 – 17.67)(23 – 25.5)
+ (28 – 17.67)(45 – 25.5) + (30 – 17.67)(50 – 25.5)]
Cov(X, Y) = 157.83
Causation
According to Merriam-Webster, causation is “the act or process of causing something to happen or exist.”

In other words, causation means one event is 100 percent certain to cause something else.

If you paint, you’ll make a painting.

If you stand in the rain, you’ll get wet.

Correlation means there’s a relationship, but not a hundred percent. If you paint, you might sell a
painting. If you stand in the rain, you might get hit by lightning.

Causation indicates a relationship between two events where one event is affected by the other.

In statistics, when the value of one event, or variable, increases or decreases as a result of other events, it is
said there is causation.
Data preparation / Data Pre processing
Data Processing

It is the task of converting data from a given form to a much more usable and desired form i.e. making
it more meaningful and informative.

Using Machine Learning algorithms, mathematical modelling, and statistical knowledge, this entire process
can be automated.

The output of this complete process can be in any desired form like graphs, videos, charts, tables,
images, and many more, depending on the task we are performing and the requirements of the machine.

This might seem to be simple but when it comes to massive organizations like Twitter, Facebook,
Administrative bodies like Parliament, UNESCO, and health sector organizations, this entire process
needs to be performed in a very structured manner.

So, the steps to perform are as follows:


Data preparation / Data Pre processing
Data preparation / Data Pre processing
Collection
The most crucial step when starting with ML is to have data of good quality and accuracy.

Data can be collected from any authenticated source like data.gov.in, Kaggle or UCI dataset
repository.

For example, while preparing for a competitive exam, students study from the best study material that
they can access so that they learn the best to obtain the best results.

In the same way, high-quality and accurate data will make the learning process of the model easier
and better and at the time of testing, the model would yield state-of-the-art results.

A huge amount of capital, time and resources are consumed in collecting data.

Organizations or researchers have to decide what kind of data they need to execute their tasks or
research.

Example: Working on the Facial Expression Recognizer, needs numerous images having a variety of
human expressions. Good data ensures that the results of the model are valid and can be trusted upon.
Data preparation / Data Pre processing
Preparation:
The collected data can be in a raw form which can’t be directly fed to the machine. So, this is a process of
collecting datasets from different sources, analysing these datasets and then constructing a new dataset
for further processing and exploration.

This preparation can be performed either manually or from the automatic approach.

Data can also be prepared in numeric forms also which would fasten the model’s learning.

Example: An image can be converted to a matrix of N X N dimensions; the value of each cell will indicate
the image pixel.
Data preparation / Data Pre processing
Preparation:
Process of transforming raw data so that data scientists and analysts can run it through machine
learning algorithms to uncover insights or make predictions”

Some datasets have values that are missing, invalid, or otherwise difficult for an algorithm to process. If
data is missing, the algorithm can’t use it. If data is invalid, the algorithm produces less accurate or
even misleading outcomes.

Some datasets are relatively clean but need to be shaped (e.g., aggregated or pivoted) and many
datasets are just lacking useful business context (e.g., poorly defined ID values), hence the need for
feature enrichment.
Why do we need Data Preprocessing?
A real-world data generally contains noises, missing values, and maybe in an unusable format which
cannot be directly used for machine learning models.
Data pre-processing is required tasks for cleaning the data and making it suitable for a machine learning
model which also increases the accuracy and efficiency of a machine learning model.
Data preparation / Data Pre processing
Input

Now the prepared data can be in the form that may not be machine-readable, so to convert this data
to the readable form, some conversion algorithms are needed.

For this task to be executed, high computation and accuracy is needed. Example: Data can be collected
through the sources like MNIST Digit data (images), Twitter comments, audio files, video clips.

Processing

This is the stage where algorithms and ML techniques are required to perform the instructions provided
over a large volume of data with accuracy and optimal computation.
Data preparation / Data Pre processing
Output

In this stage, results are procured by the machine in a meaningful manner which can be inferred
easily by the user.

Output can be in the form of reports, graphs, videos, etc

Storage

This is the final step in which the obtained output and the data model and all the useful information
are saved for future use.
The data preparation process can be complicated by issues such as:

Inconsistent
The need for
Improperly values and Limited or
Missing or techniques
Outliers or formatted / non- sparse
incomplete such as
anomalies. structured standardized features /
records. feature
data. categorical attributes.
engineering.
variables.
Data Visualization
Data visualization is defined as a graphical representation that contains the information and the data.

By using visual elements like charts, graphs, and maps, data visualization techniques provide an accessible
way to see and understand trends, outliers, and patterns in data.

In modern days we have a lot of data in our hands i.e, in the world of Big Data, data visualization tools, and
technologies are crucial to analyze massive amounts of information and make data-driven decisions.

Data visualization is an easy and quick way to convey concepts universally. You can experiment with a
different outline by making a slight adjustment.

Visualize phenomenon’s that cannot be observed directly, such as weather patterns, medical conditions,
or mathematical relationships.
Data Visualization Techniques / Methods:
1. Line chart / Plot:

A-line chart illustration changes over time.

The X-axis represents the period, whereas the Y-axis represents the quantity.

showcases changing data over time.

Both axes contain numerical values representative of the data.

To create a line chart, input the relevant time frame along the X-axis and the quantitative measurement on the
Y-axis. Plot the data in the graph by connecting the time value and the numeric value. After plotting all the
dots, connect them with a line.

A line graph can have one line or several. In the case of a chart with several lines, each one represents
a category.
Data Visualization Techniques / Methods:
2. Bar chart / Plot / Graph/ Column Charts:

Bar charts illustration also changes over time.

Used to compare data along two axes.

One of the axes is numerical, while the other visualizes the categories or topics being measured.

You can use a bar chart with vertical bars or horizontal bars.

On vertical bar graphs, numerical values are on the Y-axis (vertical axis); on horizontal bars, they are on the X-
axis (horizontal axis.)
Data Visualization Techniques / Methods:
3. Pie chart / Plot :

Circular chart with multiple divisions where each division shows the contribution of each value to the total
value.

The Pie represents the total value, i.e., 100 percent, and each slice of the pie chart adds some percent to
the total.

The larger the contribution of an attribute, the larger will be the size of the slice of the pie chart.
Data Visualization Techniques / Methods:
4. Scatter grams / Scatterplot / Scatter graph / Scatter chart / Scatter Diagram :

A correlation can be expressed visually.

A graphical display that shows the relationships or associations between two numerical variables (or co-
variables), which are represented as points (or dots) for each pair of score.

Indicates the strength and direction of the correlation between the co-variables.

Available in both 2-D as well as in 3-D. The 2-D scatter plot is the common one, where we will primarily try
to find the patterns, clusters, and separability of the data.
Data Visualization Techniques / Methods:
4. Scatter grams / Scatterplot / Scatter graph / Scatter chart / Scatter Diagram :

A correlation can be expressed visually.

A graphical display that shows the relationships or associations between two numerical variables (or co-
variables), which are represented as points (or dots) for each pair of score.

Indicates the strength and direction of the correlation between the co-variables.

Available in both 2-D as well as in 3-D. The 2-D scatter plot is the common one, where we will primarily try
to find the patterns, clusters, and separability of the data.
Data Visualization Techniques / Methods:
5. Box plot / Box and Whisker Plot:

This plot can be used to obtain more statistical details about the data.

The straight lines at the maximum and minimum are also called whiskers.

Points that lie outside the whiskers will be considered as an outlier.

The box plot also gives us a description of the 25th, 50th,75th quartiles.

we can also determine the IQR where maximum details of the data will be present.
Data Visualization Techniques / Methods:
6. Histogram Plot:

Similar to a bar graph but has a different plotting system.

To analyze ranges of data according to a specific frequency.

Histograms can only be vertical, differently from how bar charts can be both vertical and horizontal.

A histogram is a graphical representation that organizes a group of data points into user-specified ranges.

A histogram represents the frequency distribution of variables in a data set over a specific time period

On the other hand, a bar graph typically represents a graphical comparison of discrete or categorical
variables.
Data Visualization Techniques / Methods:
7. Density Plot:

Another quick and easy technique for getting each attributes distribution is Density plots.

It is also like histogram but having a smooth curve drawn through the top of each bin.

We can call them as abstracted histograms.


Exploratory Data Analysis
Data analytics process

To understand the data in depth

Learn the different data characteristics

This allows you to get a better feel of your data and find useful patterns in it.

Exploratory Data Analysis (EDA) is an approach to analyse the data using visual techniques.

It refers to the critical process of performing initial investigations on data so as to discover trends,
patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics
and graphical representations.

It is an approach of analysing data sets to summarize their main characteristics, often using statistical
graphics and other data visualization methods.
Exploratory Data Analysis
It helps you gather insights and make better sense of the data, and removes irregularities and
unnecessary values from data.

Helps you prepare your dataset for analysis.

Allows a machine learning model to predict our dataset better.

Gives you more accurate results.

It also helps us to choose a better machine learning model.


Steps involved in Exploratory Data Analysis
1. Data Collection

It refers to the process of finding and loading data into our system.

Good, reliable data can be found on various public sites or bought from private organizations.

Some reliable sites for data collection are Kaggle, GitHub, Machine Learning Repository, etc.

2. Data Cleaning

Process of removing unwanted variables and values from your dataset and getting rid of any
irregularities in it.

Such anomalies can disproportionately skew the data and hence adversely affect the results.

Some steps that can be done to clean data are:

Removing missing values, outliers, and unnecessary rows/ columns.

Re-indexing and reformatting our data.


Steps involved in Exploratory Data Analysis
3. Univariate Analysis

In Univariate Analysis, you analyze data of just one variable.

A variable in your dataset refers to a single feature/ column.

You can do this either with graphical or non-graphical means by finding specific mathematical values in the
data.

Some visual methods include:

Histograms: Bar plots in which the frequency of data is represented with rectangle bars.

Box-plots: Here the information is represented in the form of boxes.


Steps involved in Exploratory Data Analysis
5. Bivariate Analysis

Here, you use two variables and compare them.

This way, you can find how one feature affects the other.

It is done with scatter plots, which plot individual data points or correlation matrices that plot the
correlation in hues. You can also use boxplots.

6. Handling Outliers

Deviates significantly from the rest of the objects. So called abnormal

They can be caused by measurement or execution errors. The analysis for outlier detection is referred to as
outlier mining.

There are many ways to detect the outliers, and the removal process is the data frame same as removing
a data item from the panda’s dataframe.

You might also like