ML Unit-II Notes
ML Unit-II Notes
Set of values of qualitative or quantitative variables about one or more persons or objects
Any unprocessed fact, value, text, sound, picture , video, code, plots or graphs etc
Without data, we can’t train any model and all modern research and automation will go in vain.
Big Enterprises are spending lots of money just to gather as much certain data as possible.
Example: Why did Facebook acquire WhatsApp by paying a huge price of $19 billion?
To have access to the users’ information To facilitate the task of improvement in their services.
Helps in predicting the future or forecast based on the previous trend of data.
Qualitative Data
Quantitative Data
(Categorical or
(Numerical)
Attribute)
Discrete /
Nominal Ordinal Continuous
Interval
Qualitative Data
Characteristics and descriptions that can’t be easily measured, but can be observed & recorded
subjectively. [ Non-numerical in nature]
Categorical data – data that can be arranged categorically based on the attributes and properties of a thing
or a phenomenon.
For example, think of a student reading a paragraph from a book during one of the class sessions. A teacher
who is listening to the reading gives feedback on how the child read that paragraph. If the teacher gives
feedback based on fluency, intonation, throw of words, clarity in pronunciation without giving a grade to
the child, this is considered as an example of qualitative data.
Qualitative data is about the emotions or perceptions of people, what they feel.
Gender, country name, animal species, and emotional state are examples of qualitative data.
Nominal Data
Data with no inherent order or ranking such as gender or race, such kind of the data is called Nominal
data.
In statistics, nominal data (also known as nominal scale) is a classification of categorical variables that do
not provide any quantitative value.
Ordinal data is a type of qualitative data where the variables have natural, ordered categories and the
distances between the categories are not known.
For example, ordinal data is said to have been collected when a customer inputs his/her satisfaction on the
variable scale — "satisfied, indifferent, dissatisfied".
An organization creates an employee exit questionnaire that primarily highlights this question:
“How Mr. Abdul Rais is teaching Machine Learning?” (Likert Scale)
Excellent
Very Good
Good
Average
Poor
Discrete / Interval Data
Discrete data is a count that involves integers — only a limited/ finite number of values is possible.
Discrete data includes discrete variables that are finite, numeric, countable, and non-negative integers.
This data type is mainly used for simple statistical analysis because it’s easy to summarize and compute.
By nature, discrete data cannot be measured at all. For example, you can measure your weight with the
help of a scale. So, your weight is not a discrete data.
Discrete data can be easily visualized and demonstrated using simple statistical methods such as bar
charts, line charts, or pie charts.
Examples of Discrete / Interval Data
Number of students in a class
Shoe sizes.
Instruments in a shelf.
Continuous data can assume any numeric value and can be meaningfully split into smaller parts.
For example, you have continuous data when you measure weight, height, length, time, and temperature.
Frequently, you’ll use histograms and scatterplots to graph continuous variables. These graphs are
designed to handle values that fall on a continuous spectrum and have decimal places.
Data Representation
A machine learning model can't directly see, hear, or sense input examples.
Instead, you must create a representation of the data to provide the model with a useful vantage point into
the data's key qualities.
That is, in order to train a model, you must choose the set of features that best represent the data.
The main objective of machine learning is to build models by interpreting data. To do so, it is highly
important to feed the data in a way that is readable by the computer.
To feed data into a scikit-learn model, it must be represented as a table or matrix of the required
dimension
Most tables fed into machine learning problems are two-dimensional (rows and columns)
Governmental needs for census data as well as information about a variety of economic activities
provided much of the early impetus for the field of statistics.
To turn large amounts of data into useful information has stimulated both theoretical and practical
developments in statistics.
Any time data are collected & analyzed, statistics are being done. This can range from government
agencies to academic research to analyzing investments.
Descriptive Statistics
Inferential Statistics
Examples of Statistics
You and a friend are at a baseball game, and out of the blue he offers you a bet that neither team will hit a
home run in that game. Should you take the bet?
Your company has created a new drug that may cure cancer. How would you conduct a test to confirm the
drug’s effectiveness?
The latest sales data have just come in, and your boss wants you to prepare a report for management on
places where the company could improve its business. What should you look for? What should you not
look for?
Basic Terminologies in Statistics:
Population
A collection of set of individuals or objects or events whose properties are to be analyzed
Sample
A subset of population is called ‘Sample’. A well-chosen sample will contain most of the information
about a particular population parameter.
Descriptive Statistics
Descriptive statistics summarize & organize characteristics of a data set.
Provide descriptions of the population or sample, either through numerical calculations or graphs or tables
It is mainly focused upon the main characteristics of data. It provides graphical summary of the data.
Descriptive Statistical Analysis helps you to understand your data and is a very important part of ML. This is
due to ML being all about making predictions. On the other hand, statistics is all about drawing conclusions
from data, which is a necessary initial step.
In tables or graphs, you can summarize the frequency of every possible value of a variable in numbers or
percentages.
Simple frequency distribution table
For the variable of gender, you list all possible answers on the left column. You count the number or
percentage of responses for each answer and display it on the right column.
From this table, you can see that more women than men or people with another gender identity took
part in the study.
Gender Number
Male 182
Female 235
Other 27
Frequency Distribution
Grouped frequency distribution table
In a grouped frequency distribution, you can group numerical response values and add up the
number of responses for each group. You can also convert each of these numbers to percentages.
From this table, you can see that most people visited the library between 5 and 16 times in the past year.
Central tendency sometimes called “measures of location,” “central location,” or just “center”.
Central tendency doesn’t tell you specifics about the individual pieces of data, but it gives you an overall
picture of what is going on in the entire data set.
Mean: the sum of all values divided by the total number of values.
Median is the 50%th percentile of the data. It is exactly the centre point of the data.
Median can be identified by ordering the data in ascending order and splits the data into two equal
parts and find the middle number.
Median number of library visits Median number of library visits
The mode is the most popular or most frequent response value. A data set can have no mode, one mode,
or more than one mode.
To find the mode, order your data set from lowest to highest and find the response that occurs most
frequently.
Some data sets have no mode, one mode, two modes, etc
None: 1, 2, 3, 4, 6, 8, 9.
Mode number of library visits
One mode: unimodal: 1, 2, 3, 3, 4, 5.
Two: bimodal: 1, 1, 2, 3, 4, 4, 5. Ordered data
0, 3, 3, 12, 15, 24
Three: trimodal: 1, 1, 2, 3, 3, 4, 5, 5. set
More than one (two, three or more) = multimodal. Middle numbers 3, 12
Find the most frequently occurring
Mode
response: 3
Measure of Variability / Spread
It is used to describe the variability in a sample or population.
The dispersion is the “Spread of the data”. It measures how far the data is spread.
In most of the dataset, the data values are closely located near the mean. On some other dataset, the
values are widely spread out of the mean.
When a data set has a large value, the values in the set are widely scattered; when it is small the items in
the set are tightly clustered.
These dispersions of data can be measured by
Range
Inter Quartile Range ( IQR )
Standard Deviation
Variance
Spread can also be shown in graphs: dot plots, boxplots, and stem and leaf plots have a greater distance
with samples that have a larger dispersion and vice versa.
Range
It is highest value minus the lowest value.
Formula : Range = Max Value – Min Value
Spread of your data from the lowest to the highest value in the distribution.
It is a commonly used measure of variability.
To find the range, follow these steps:
Order all values in your data set from low to high.
Subtract the lowest value from the highest value.
This process is the same regardless of whether your values are positive or negative, or whole numbers
or fractions.
Range example: Your data set is the ages of 8 participants.
R = H – L
Participant 1 2 3 4 5 6 7 8
R = 37 – 19 = 18 Age 37 19 31 29 21 26 33 36
The range of our data set is 18 years. First, order the values from low to high to identify the lowest value (L) and the highest value (H).
Age 19 21 26 29 31 33 36 37
Inter Quartile Range (IQR)
Quartiles are values that divide your data into quarters. They divide your data into four segments
according to where the numbers fall on the number line. The four quarters that divide a data set into
quartiles are:
Example: Divide the following data set into quartiles: 2, 5, 6, 7, 10, 22, 13, 14, 16, 65, 45, 12.
Step 1: Put the numbers in order: 2, 5, 6, 7, 10, 12 13, 14, 16, 22, 45, 65.
Step 2: Count how many numbers there are in your set and then divide by 4 to cut the list of numbers into
quarters.
There are 12 numbers in this set, so you would have 3 numbers in each quartile. 2, 5, 6, | 7, 10, 12 |
13, 14, 16, | 22, 45, 65
Inter Quartile Range (IQR): It is the difference between the third quartile (Q3) and the first Quartile (Q1)
Formula : IQR = Q3 - Q1
Inter Quartile Range (IQR)
Step 1: Put the numbers in order.
1, 2, 5, 6, 7, 9, 12, 15, 18, 19, 27
Step 3: Place parentheses around the numbers above and below the median.
Not necessary statistically, but it makes Q1 and Q3 easier to spot.
(1, 2, 5, 6, 7), 9 , (12, 15, 18, 19, 27)
It tells you, on an average, how far each score lies from the mean.
The larger the standard deviation, the more variable the data set is.
You might know that you scored 67 out of 90 on a test. But that figure has no real meaning unless you know
what percentile you fall into.
Example-1: If you know that your score is in the 90th percentile, that means you scored better than 90% of
people who took the test.
Example-2: A person with an IQ of 120 is at the 91st percentile, which indicates that their IQ is higher than 91
percent of other scores.
Percentile
The general rule is that if value X is at the kth percentile, then X is greater than K% of the values.
Percentiles are commonly used to report scores in tests, like the SAT, GRE and LSAT.
For example, the 70th percentile on the 2013 GRE was 156. That means if you scored 156 on the exam, your
score was better than 70 percent of test takers.
Definition-1: The nth percentile is the lowest score that is greater than a certain percentage (“n”) of the
scores.
Definition-2: The nth percentile is the smallest score that is greater than or equal to a certain percentage
of the scores. To rephrase this, it’s the percentage of data that falls at or below a certain observation. This is
the definition used in AP statistics.
Definition-3: A weighted mean of the percentiles from the first two definitions.
Percentile Rank
If you score in the 25th percentile, then 25% of test takers are below your score. The “25” is called
the percentile rank.
How to find a Percentile
Example question: Find out where the 25th percentile is in the above list.
Step 1: Calculate what rank is at the 25th percentile. Use the following formula:
Rank = Percentile / 100 * (number of items + 1)
Rank = 25 / 100 * (8 + 1) = 0.25 * 9 = 2.25.
A rank of 2.25 is at the 25th percentile. So you must either round up, or round down. As 2.25 is closer to 2
SCO RE RA N K
than 3, I’m going to round down to a rank of 2.
30 1
33 2
43 3
53 4
56 5
67 6
68 7
72 8
How to find a Percentile
Step 2: Choose either definition 1 or 2:
Definition 1: The lowest score that is greater than 25% of the scores. That equals a score of 43 on this list (a
rank of 3).
Definition 2: The smallest score that is greater than or equal to 25% of the scores. That equals a score of 33
on this list (a rank of 2).
Depending on which definition you use, the 25th percentile could be reported at 33 or 43! A third definition
attempts to correct this possible misinterpretation:
Definition 3: A weighted mean of the percentiles from the first two definitions.
In the above example, here’s how the percentile would be worked out using the weighted mean:
Multiply the difference between the scores by 0.25 (the fraction of the rank we calculated above). The scores
were 43 and 33, giving us a difference of 10:
(0.25)(43 – 33) = 2.5
In this case, the 25th percentile score is 35.5, which makes more sense as it’s in the middle of 43 and 33.
Outlier
An outlier is a piece of data that is an abnormal distance from other points.
In other words, it’s data that lies outside the other values in the set. If you had Pinocchio in a class of
children, the length of his nose compared to the other children would be an outlier.
Outliers are stragglers — extremely high or extremely low values — in a data set that can throw off your
stats. For example, if you were measuring children’s nose length, your average value might be thrown off if
Pinocchio was in the class.
An outlier is a data point that is noticeably different from the rest.
They represent errors in measurement, bad data collection, or simply show variables not considered
when collecting the data.
Outlier
In this set of random numbers, 1 and 201 are outliers:
1, 99, 100, 101, 103, 109, 110, 201
“1” is an extremely low value and “201” is an extremely high value.
Of course, trying to find outliers isn’t always that simple. Your data set may look like this:
61, 10, 32, 19, 22, 29, 36, 14, 49, 3.
You could take a guess that 3 might be an outlier and perhaps 61. But you’d be wrong: 61 is the only outlier
in this data set.
Impact of Outlier on ML Models
In supervised models, outliers can deceive the training process resulting in prolonged training times, or
According to Alvira Swalin, a data scientist at Uber, machine learning models, like linear & logistic
regression are easily influenced by the outliers in the training data. Some models even exist that hike
Example Question: Find the outliers for the following data set: 3, 10, 14, 22, 19, 29, 70, 49, 36, 32.
This is your upper limit. Set this number aside for a moment.
Step 3: Subtract the amount you found in Step 2 from Q1 from Step 1:
14 – 33 = -19.
This is your lower limit. Set this number aside for a moment.
How to Find Outliers Using IQR
Step 5: Put the numbers from your data set in order:
3, 10, 14, 19, 22, 29, 32, 36, 49, 70
Step 6: Insert your low and high values into your data set, in order:
-19, 3, 10, 14, 19, 22, 29, 32, 36, 49, 69, 70
Step 7: Highlight any number below or above the numbers you inserted in Step 6:
-19, 3, 10, 14, 19, 22, 29, 32, 36, 49, 69, 70
Boxplot
A boxplot, also called a box and whisker plot
Measures of spread include the interquartile range and the mean of the data set.
Measures of center include the mean and median (the middle of a data set).
How to read a Boxplot / Five No. Summary
Step 1: Find the minimum.
The minimum (the smallest number in the data set). The minimum is shown at the far left of the chart, at the
end/ tip of the left “whisker.”
Researchers have found a direct correlation between smoking and lung cancer.
Some examples of data that have a low correlation (or none at all):
The cost of a car wash and how long it takes to buy a soda inside the station.
Correlation
Correlations are useful because if you can find out what relationship variables have, you can
make predictions about future behaviour.
Knowing what the future holds is very important in the social sciences like government and healthcare.
Businesses also use these statistics for budgets and business plans.
Correlation means association - more precisely it is a measure of the extent to which two variables are
related. There are three possible results of a correlational study:
A positive correlation,
No correlation.
Correlation
A Positive Correlation:
It is a relationship between two variables in which both variables move in the same direction.
Therefore, when one variable increases as the other variable increases, or one variable decreases while
the other decreases.
An example of positive correlation would be height and weight. Taller people tend to be heavier.
A Negative Correlation:
Relationship between two variables in which an increase in one variable is associated with a decrease
in the other.
An example of negative correlation would be height above sea level and temperature. As you climb the
mountain (increase in height) it gets colder (decrease in temperature).
A zero Correlation exists when there is no relationship between two variables. For example there is no
relationship between the amount of tea drunk and level of intelligence.
Correlation
A correlation can be expressed visually by drawing a scattergram (also known as a scatterplot, scatter
graph, scatter chart, or scatter diagram).
A scattergram is a graphical display that shows the relationships or associations between two numerical
variables (or co-variables), which are represented as points (or dots) for each pair of score.
A scattergraph indicates the strength and direction of the correlation between the co-variables.
Step 2: Subtract the mean of x from every x value (call them "a"), and subtract the mean of y from every y
value (call them "b")
Step 5: Divide the sum of ab by the square root of [(sum of a2) × (sum of b2)]
Here is how I calculated the first Ice Cream example (values rounded to 1 or 0 decimal places):
Calculating Correlation (Pearson's Correlation)
Formula
Where:
Σ is Sigma, the symbol for "sum up"
is each x-value minus the mean of x (called "a" above)
is each y-value minus the mean of y (called "b" above)
Covariance
Covariance is a measure of how much two random variables vary together.
It’s similar to variance, but where variance tells you how a single variable varies, covariance tells you
how two variables vary together.
Covariance is a statistical tool that is used to determine the direction of the relationship between the
movements of two random variables.
When two stocks tend to move together, they are seen as having a positive covariance; when they move
inversely, the covariance is negative.
This relationship is determined by the sign (positive or negative) of the covariance value.
In other words, whether they tend to move in the same or opposite directions.
Types of Covariance
Positive Covariance
A positive covariance between two variables indicates that these variables tend to be higher or lower
at the same time.
In other words, a positive covariance between variables x and y indicates that x is higher than average at
the same times that y is higher than average, and vice versa.
When charted on a two-dimensional graph, the data points will tend to slope upwards.
Negative Covariance
When the calculated covariance is less than zero, this indicates that the two variables have an
inverse relationship.
In other words, an x value that is lower than average tends to be paired with a y that is greater than
average, and vice versa.
Covariance Formula
Formula
Where,
xi = data value of x
yi = data value of y
x̄ = mean of x
ȳ = mean of y
N = number of data values.
Covariance
Below figure shows the covariance of X and Y.
If cov(X, Y) is greater than zero, then we can say that the covariance for any two variables is positive and
both the variables move in the same direction.
If cov(X, Y) is less than zero, then we can say that the covariance for any two variables is negative and both
the variables move in the opposite direction.
If cov(X, Y) is zero, then we can say that there is no relation between two variables.
The relationship between the correlation coefficient and covariance is given by;
Correlation,ρ(X,Y) = Cov(X,Y)/σX σy
Where:
ρ(X,Y) = correlation between the variables X and Y
Cov(X,Y) = covariance between the variables X and Y
σX = standard deviation of the X variable
σY = standard deviation of the Y variable
Covariance
Question:
Calculate the coefficient of covariance for the following data:
X 2 8 18 20 28 30
Y 5 12 18 23 45 50
Solution:
Number of observations = 6
Mean of X = 17.67
Mean of Y = 25.5
Cov(X, Y) = (⅙) [(2 – 17.67)(5 – 25.5) + (8 – 17.67)(12 – 25.5) + (18 – 17.67)(18 – 25.5) + (20 – 17.67)(23 – 25.5)
+ (28 – 17.67)(45 – 25.5) + (30 – 17.67)(50 – 25.5)]
Cov(X, Y) = 157.83
Causation
According to Merriam-Webster, causation is “the act or process of causing something to happen or exist.”
In other words, causation means one event is 100 percent certain to cause something else.
Correlation means there’s a relationship, but not a hundred percent. If you paint, you might sell a
painting. If you stand in the rain, you might get hit by lightning.
Causation indicates a relationship between two events where one event is affected by the other.
In statistics, when the value of one event, or variable, increases or decreases as a result of other events, it is
said there is causation.
Data preparation / Data Pre processing
Data Processing
It is the task of converting data from a given form to a much more usable and desired form i.e. making
it more meaningful and informative.
Using Machine Learning algorithms, mathematical modelling, and statistical knowledge, this entire process
can be automated.
The output of this complete process can be in any desired form like graphs, videos, charts, tables,
images, and many more, depending on the task we are performing and the requirements of the machine.
This might seem to be simple but when it comes to massive organizations like Twitter, Facebook,
Administrative bodies like Parliament, UNESCO, and health sector organizations, this entire process
needs to be performed in a very structured manner.
Data can be collected from any authenticated source like data.gov.in, Kaggle or UCI dataset
repository.
For example, while preparing for a competitive exam, students study from the best study material that
they can access so that they learn the best to obtain the best results.
In the same way, high-quality and accurate data will make the learning process of the model easier
and better and at the time of testing, the model would yield state-of-the-art results.
A huge amount of capital, time and resources are consumed in collecting data.
Organizations or researchers have to decide what kind of data they need to execute their tasks or
research.
Example: Working on the Facial Expression Recognizer, needs numerous images having a variety of
human expressions. Good data ensures that the results of the model are valid and can be trusted upon.
Data preparation / Data Pre processing
Preparation:
The collected data can be in a raw form which can’t be directly fed to the machine. So, this is a process of
collecting datasets from different sources, analysing these datasets and then constructing a new dataset
for further processing and exploration.
This preparation can be performed either manually or from the automatic approach.
Data can also be prepared in numeric forms also which would fasten the model’s learning.
Example: An image can be converted to a matrix of N X N dimensions; the value of each cell will indicate
the image pixel.
Data preparation / Data Pre processing
Preparation:
Process of transforming raw data so that data scientists and analysts can run it through machine
learning algorithms to uncover insights or make predictions”
Some datasets have values that are missing, invalid, or otherwise difficult for an algorithm to process. If
data is missing, the algorithm can’t use it. If data is invalid, the algorithm produces less accurate or
even misleading outcomes.
Some datasets are relatively clean but need to be shaped (e.g., aggregated or pivoted) and many
datasets are just lacking useful business context (e.g., poorly defined ID values), hence the need for
feature enrichment.
Why do we need Data Preprocessing?
A real-world data generally contains noises, missing values, and maybe in an unusable format which
cannot be directly used for machine learning models.
Data pre-processing is required tasks for cleaning the data and making it suitable for a machine learning
model which also increases the accuracy and efficiency of a machine learning model.
Data preparation / Data Pre processing
Input
Now the prepared data can be in the form that may not be machine-readable, so to convert this data
to the readable form, some conversion algorithms are needed.
For this task to be executed, high computation and accuracy is needed. Example: Data can be collected
through the sources like MNIST Digit data (images), Twitter comments, audio files, video clips.
Processing
This is the stage where algorithms and ML techniques are required to perform the instructions provided
over a large volume of data with accuracy and optimal computation.
Data preparation / Data Pre processing
Output
In this stage, results are procured by the machine in a meaningful manner which can be inferred
easily by the user.
Storage
This is the final step in which the obtained output and the data model and all the useful information
are saved for future use.
The data preparation process can be complicated by issues such as:
Inconsistent
The need for
Improperly values and Limited or
Missing or techniques
Outliers or formatted / non- sparse
incomplete such as
anomalies. structured standardized features /
records. feature
data. categorical attributes.
engineering.
variables.
Data Visualization
Data visualization is defined as a graphical representation that contains the information and the data.
By using visual elements like charts, graphs, and maps, data visualization techniques provide an accessible
way to see and understand trends, outliers, and patterns in data.
In modern days we have a lot of data in our hands i.e, in the world of Big Data, data visualization tools, and
technologies are crucial to analyze massive amounts of information and make data-driven decisions.
Data visualization is an easy and quick way to convey concepts universally. You can experiment with a
different outline by making a slight adjustment.
Visualize phenomenon’s that cannot be observed directly, such as weather patterns, medical conditions,
or mathematical relationships.
Data Visualization Techniques / Methods:
1. Line chart / Plot:
The X-axis represents the period, whereas the Y-axis represents the quantity.
To create a line chart, input the relevant time frame along the X-axis and the quantitative measurement on the
Y-axis. Plot the data in the graph by connecting the time value and the numeric value. After plotting all the
dots, connect them with a line.
A line graph can have one line or several. In the case of a chart with several lines, each one represents
a category.
Data Visualization Techniques / Methods:
2. Bar chart / Plot / Graph/ Column Charts:
One of the axes is numerical, while the other visualizes the categories or topics being measured.
You can use a bar chart with vertical bars or horizontal bars.
On vertical bar graphs, numerical values are on the Y-axis (vertical axis); on horizontal bars, they are on the X-
axis (horizontal axis.)
Data Visualization Techniques / Methods:
3. Pie chart / Plot :
Circular chart with multiple divisions where each division shows the contribution of each value to the total
value.
The Pie represents the total value, i.e., 100 percent, and each slice of the pie chart adds some percent to
the total.
The larger the contribution of an attribute, the larger will be the size of the slice of the pie chart.
Data Visualization Techniques / Methods:
4. Scatter grams / Scatterplot / Scatter graph / Scatter chart / Scatter Diagram :
A graphical display that shows the relationships or associations between two numerical variables (or co-
variables), which are represented as points (or dots) for each pair of score.
Indicates the strength and direction of the correlation between the co-variables.
Available in both 2-D as well as in 3-D. The 2-D scatter plot is the common one, where we will primarily try
to find the patterns, clusters, and separability of the data.
Data Visualization Techniques / Methods:
4. Scatter grams / Scatterplot / Scatter graph / Scatter chart / Scatter Diagram :
A graphical display that shows the relationships or associations between two numerical variables (or co-
variables), which are represented as points (or dots) for each pair of score.
Indicates the strength and direction of the correlation between the co-variables.
Available in both 2-D as well as in 3-D. The 2-D scatter plot is the common one, where we will primarily try
to find the patterns, clusters, and separability of the data.
Data Visualization Techniques / Methods:
5. Box plot / Box and Whisker Plot:
This plot can be used to obtain more statistical details about the data.
The straight lines at the maximum and minimum are also called whiskers.
The box plot also gives us a description of the 25th, 50th,75th quartiles.
we can also determine the IQR where maximum details of the data will be present.
Data Visualization Techniques / Methods:
6. Histogram Plot:
Histograms can only be vertical, differently from how bar charts can be both vertical and horizontal.
A histogram is a graphical representation that organizes a group of data points into user-specified ranges.
A histogram represents the frequency distribution of variables in a data set over a specific time period
On the other hand, a bar graph typically represents a graphical comparison of discrete or categorical
variables.
Data Visualization Techniques / Methods:
7. Density Plot:
Another quick and easy technique for getting each attributes distribution is Density plots.
It is also like histogram but having a smooth curve drawn through the top of each bin.
This allows you to get a better feel of your data and find useful patterns in it.
Exploratory Data Analysis (EDA) is an approach to analyse the data using visual techniques.
It refers to the critical process of performing initial investigations on data so as to discover trends,
patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics
and graphical representations.
It is an approach of analysing data sets to summarize their main characteristics, often using statistical
graphics and other data visualization methods.
Exploratory Data Analysis
It helps you gather insights and make better sense of the data, and removes irregularities and
unnecessary values from data.
It refers to the process of finding and loading data into our system.
Good, reliable data can be found on various public sites or bought from private organizations.
Some reliable sites for data collection are Kaggle, GitHub, Machine Learning Repository, etc.
2. Data Cleaning
Process of removing unwanted variables and values from your dataset and getting rid of any
irregularities in it.
Such anomalies can disproportionately skew the data and hence adversely affect the results.
You can do this either with graphical or non-graphical means by finding specific mathematical values in the
data.
Histograms: Bar plots in which the frequency of data is represented with rectangle bars.
This way, you can find how one feature affects the other.
It is done with scatter plots, which plot individual data points or correlation matrices that plot the
correlation in hues. You can also use boxplots.
6. Handling Outliers
They can be caused by measurement or execution errors. The analysis for outlier detection is referred to as
outlier mining.
There are many ways to detect the outliers, and the removal process is the data frame same as removing
a data item from the panda’s dataframe.