Statistic & Analytics
Statistic & Analytics
Primary Data
Primary data or raw data is a type of information that is obtained directly from the first -hand
source through experiments, surveys, or observations. The primary data is further classified
into two types. They are
• Qualitative Data
• Quantitative Data
Qualitative Data
If the data is classified on the basis of the qualitative characteristics or attributes is called
qualitative data.
It does not involve any mathematical calculations. This method is closely associated with
elements that are not quantitative. This qualitative data collection method includes interviews,
questionnaires, observations, case studies survey focus group discussion etc. There are several
methods to collect this type of data. They are also called data collection tools.
Data collection tools
1. Questionnaires
2. Survey
3. Interviews
4. Focus group discussion.
1
• Questionnaires
In this method, the set of questions are mailed to the respondent. They should read, reply and
subsequently return the questionnaire. The questions are printed in the definite order on the
form. A good survey should have the following features:
Interviews
The Interview Method of collecting data involves presentation of oral-verbal stimuli and reply
in terms of oral – verbal responses. It requires the interviewer asking questions in a face-to-
face contact with the person. The method of collecting data in terms of oral or verbal responses.
It is achieved in two ways, such as
The opposite of quantitative research which involves numerical based data, this data collection
method focuses more on qualitative research. It falls under the primary category for data based
on the feelings and opinions of the respondents. This research involves asking open-ended
questions to a group of individuals usually ranging from 6-10 people, to provide feedback.
2
duration of time. Quantitative data is also known as numerical data which represents the
numerical value (i.e., how much, how often, how many). Numerical data gives information
about the quantities of a specific thing. Some examples of numerical data are height, length,
size, weight, and so on. The quantitative data can be classified into two different types based
on the data sets. The two different classifications of numerical data are discrete data and
continuous data.
Discrete Data
Discrete data can take only discrete values. Discrete information contains only a finite number
of possible values. Those values cannot be subdivided meaningfully. Here, things can be
counted in the whole numbers.
Example: Number of students in the class
Continuous Data
Continuous data is data that can be calculated. It has an infinite number of probable values that
can be selected within a given specific range.
Example: Temperature range
Data cleaning.
Data cleansing or data cleaning is the process of detecting and correcting corrupt or
inaccurate records from a record set, table, or database and refers to identifying
incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing,
modifying, or deleting the dirty or coarse data
There are several methods for cleaning data depending on how it is stored along with
the answers being sought.
1. Descriptive statistics
2. Inferential statistics
3
Descriptive Statistics
Descriptive statistics is a way to organise, represent and describe a collection of data using
tables, graphs, and summary measures. For example, the collection of people in a city using
the internet or using Television
In this type of statistics, the data is summarised through the given observations. The
summarisation is one from a sample of population using parameters such as the mean
or standard deviation.
Inferential Statistics
Inferential Statistics is a method which allows us to use information collected from a sample
to make decisions, predictions or inferences from a population. It grants us permission to give
statements which goes beyond the available data or information. For example, deriving
estimates from hypothetical research.
This type of statistics is used to interpret the meaning of Descriptive statistics. That means
once the data has been collected, analysed and summarised then we use these stats to describe
the meaning of the collected data. Or we can say, it is used to draw conclusions from the data
that depends on random variations such as observational errors, sampling variation, etc.
Data tabulation
Tabulation is a process of systematic arrangement of the classified data in rows and columns,
in the form of table.
ex.1:
No.of oranges in 5 6 7 8 9 10 Total
the box
No.of boxes. 5 8 10 6 3 13 45
Ex.2:
Height (cm) 140-150 150-160 160-170 170-180 Total
No. of 6 24 18 2 50
students
The above two types example are Frequency Distribution or Frequency table.
4
Formation of Discrete frequency distribution:
For formation of frequency distribution, three columns are formed -variable, tally bars,
frequency. In the first column, values of given variable are written without repetition in an
order. For each value a tally/stroke is marked against that value in the second column. In this
way tally scores are marked for all values. For easy counting the tallies are put as a group of 5
(IIII ). Finally count the number of tally bars corresponding to each value of the variable in
third column. It is known as frequency. The total frequency (N) is equal to the total number of
observations.
Ex.3.
In survey of 40 families in Haveri, the number of children per family was recorded and the
following data were obtained.
1,0,3,2,1,5,6,2,2,1,0,3,4,2,1,6,3,2,1,5,3,3,2,4,2,2,3,0,2,1,4,5,3,3,4,4,1,2,4,5.
Represent the data in the form of a discrete frequency distribution.
Sol: Frequency distribution of the number of children.
5
Exclusive class:
In a class, If the lower limit is included in the same class and upper limit is excluded from
that class but included in the next class, such a class is called Exclusive class. Here, upper
limit of a class is equal to lower limit of the next class.
Ex: 30-40, 40-50.50-60 are exclusive classes.
Correction factor:
It is half of the difference between lower limit of a class and upper limit of a class and upper
limit of the preceding class thus,
𝑙𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡𝑜𝑓 𝑐𝑙𝑎𝑠𝑠−𝑢𝑝𝑝𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑝𝑟𝑒𝑐𝑒𝑑𝑖𝑛𝑔𝑐𝑙𝑎𝑠𝑠
Correction factor (C.F) = 2
To get exclusive class intervals from inclusive class intervals, add C.F from all lower limits.
Ex: Convert the flowing inclusive class intervals to exclusive class intervals.
C-I 10 -19 20 - 29 30 - 39 40-49
20−19 1
C.F = 2 = 2 = 0.5
By subtracting it from all lower limits we get lower limit as 9.5, 19.5,29.5, 39.5 and adding it
to all upper limits we get upper limit as 19.5, 29.5, 39.5, 49.5.
Therefore, the exclusive class intervals are 9.5 -19.5, 19.5 -29.5, 29.5 - 39.5 and 39.5 - 49.5.
Open- end classes
In a class, if the lower or upper limit of the class is not specified such a class is called open-
end class. For example: less than (below) or more than (above) a particular class limit.
The frequency distribution based on open-end classes is called open end frequency
distribution.
Ex:
Class interval Frequency
Less than 20 8
20 – 30 15
30 - 40 23
40 -50 12
50-60 9
More than 60 3
6
Cumulative frequency
The added up frequencies are called cumulative frequencies.
There are two types of cumulative frequencies:
1. Less than type
2. More than type,
The number of observations ( frequencies) below a certain limit is less than cumulative
frequency (L.C.F). The frequency distribution formed for less than cumulative frequencies
against upper class limits, is, less than cumulative frequency distribution.
Ex:
Frequency Less than cumulative
Distribution frequency distribution
Weight Number of Weight (kg) Number of
(Kg) persons persons
30-40 10 Less than 40 10
40- 50 15 Less than 50 (10+15)=25
50-60 20 Less than 60 (25+20)=45
60-70 15 Less than 70 (15+45)=60
The number of frequencies above a certain limit is more than cumulative frequency . The
frequency distribution formed for more than cumulative frequencies against lower class limit
is more than cumulative frequency distribution.
It is used to compare the concentration of the frequencies of different classes for a given
frequency distribution.
Weight Number Width of Frequency
(Kg) of the class density(f/c)
persons(f)
0-10 10 10 10
=1
10
10- 30 15 20 .75
30-50 40 20 2
50-60 45 10 4.5
60-65 20 5 5
7
Relative frequency (relative frequency table)
Relative frequency is the ratio of frequency of the value of the variable to the total frequency.
𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒
i.e Relative frequency (R.f) =
𝑡𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
Ex:
No of Relative
apples No of Frequency
boxes
Per box R.f=f/N
5 5 5/45 = 0.111
6 8 8/45 = 0.178
7 13 13/45 = 0.289
8 10 10/45 = 0.222
9 6 6/45 = 0.133
10 3 3/45 = 0.067
Total (N) 45 1
8
Diagrammatic and Graphical Representation of data using Excel
Microsoft Excel is a spreadsheet application that is commonly used for a variety of uses. At its
core, Excel is a table consisting of rows and columns. Excel is composed of rows and columns
and uses a spreadsheet to display data. Features include: calculation, graphing tools, pivot
tables, and a macro programming language called Visual Basic for Applications.
Every worksheet is made up of thousands of rectangles, which are called cells. A cell is
the intersection of a row and a column—in other words, where a row and column meet.
Columns are identified by letters (A, B, C), while rows are identified by numbers (1,
2,3). Each cell has its own name—or cell address—based on its column and row. In the
example below, the selected cell intersects column C and row 5, so the cell address is C5.
Note that the cell address also appears in the Name box in the top-left corner, and that a
cell's column and row headings are highlighted when the cell is selected.
You can also select multiple cells at the same time. A group of cells is known as a cell
range. Rather than a single cell address, you will refer to a cell range using the cell addresses
of the first and last cells in the cell range, separated by a colon. For example, a cell range
that included cells A1, A2, A3, A4, and A5 would be written as A1:A5. Take a look at the
different cell ranges below:
9
• Cell range A1:F8
If the columns in your spreadsheet are labelled with numbers instead of letters, you'll need to
change the default reference style for Excel.
To select a cell:
To input or edit cell content, you'll first need to select the cell.
1. Click a cell to select it. In our example, we'll select cell D9.
2. A border will appear around the selected cell, and the column
heading and row heading will be highlighted. The cell will remain
selected until you click another cell in the worksheet.
You can also select cells using the arrow keys on your keyboard.
10
Cell content
Any information you enter into a spreadsheet will be stored in a cell. Each cell can contain
different types of content, including text, formatting, formulas, and functions.
• Text: Cells can contain text, such as letters, numbers, and dates.
Formulas and functions: Cells can contain formulas and functions that calculate cell
values. In our example, SUM(C5:C18) adds the value of each cell in the cell range B2:B8
and displays the total in cell c19.
11
To insert content:
1. Click a cell to select it. In our example, we'll select cell G9.
2. Type something into the selected cell, then press Enter on your
keyboard. The content will appear in the cell and the formula bar.
You can also input and edit cell content in the formula bar.
12
To delete (or clear) cell content:
1. Select the cell(s) with content you want to delete. In our example,
we'll select the cell range A9:C9.
2. Select the Clear command on the Home tab, then click Clear
Contents.
You can also use the Delete key on your keyboard to delete content from multiple cells at
once. The Backspace key will only delete content from one cell at a time.
To delete cells: There is an important difference between deleting the content of a cell
and deleting the cell itself. If you delete the entire cell, the cells below it will shift to fill in
the gaps and replace the deleted cells.
13
1. Select the cell(s) you want to delete. In our example, we'll
select A9:C9.
2. Select the Delete command from the Home tab on the Ribbon.
14
2. Click the Copy command on the Home tab, or press Ctrl+C on your
keyboard.
3. Select the cell(s) where you want to paste the content. In our example,
we'll select F9:F17. The copied cell(s) will have a dashed box around
them.
4. Click the Paste command on the Home tab, or press Ctrl+V on your
keyboard.
15
5. The content will be pasted into the selected cells.
You can also access additional paste options, which are especially convenient when
working with cells that contain formulas or formatting. Just click the drop-down arrow on
the Paste command to see these options.
16
Instead of choosing commands from the Ribbon, you can access commands quickly by right-
clicking. Simply select the cell(s) you want to format, then right-click the mouse. A drop-
down menu will appear, where you'll find several commands that are also located on the
Ribbon.
3. Select the cells where you want to paste the content. In our example, we'll
select F10:F11. The cut cells will now have a dashed box around them.
4. Right-click the mouse and select the Paste command. Alternatively, you can
use the command on the Home tab, or press Ctrl+V on your keyboard.
17
5. The cut content will be removed from the original cells
and pasted into the selected cells.
Instead of cutting, copying, and pasting, you can drag and drop cells to move their contents.
18
3. Click and drag the cells to the desired location. In our example, we'll
move them to H4:H12.
4. Release the mouse. The cells will be dropped in the selected location.
A common problem that occurs as a database grows in size is that many duplicate rows appear
in it. And even if your huge database contains just a handful of identical records, those few
duplicates can cause a whole lot of problems, for example mailing multiple copies of the same
document to the same person, or calculating the same numbers more than once in a summary
report. So, before using a database, it makes sense to check it for duplicate entries, to make
sure you are not wasting time on repeating your efforts.
This tool allows you to find and remove absolute duplicates (cells or entire rows) as well
as partially matching records (rows that have identical values in a specified column or
columns)irrelevant observations . To perform this, follow the below steps.
Note. Because the Remove Duplicates tool permanently deletes identical records, it's a good
idea to make a copy of the original data before removing duplicate rows.
1. To begin with, select the range in which you want to delete dupes. To select the
entire table, press Ctrl + A.
2. With the range selected, go to the Data tab > Data Tools group, and click
the Remove Duplicates button.
19
3. The Remove Duplicates dialog box will open, you select the columns to check for
duplicates, and click OK.
o To delete duplicate rows that have completely equal values in all
columns, leave the check marks next to all columns, like in the screenshot
below.
o To remove partial duplicates based on one or more key columns, select
only those relevant columns. If your table has many columns, the fastest
way is to click the Unselect All button, and then select the columns you
want to check for dupes.
o If your table does not have headers, clear the My data has headers box in
the upper-right corner of the dialog window, which is usually selected by
default.
Done! All duplicate rows in the selected range are deleted, and a message is displayed
indicating how many duplicate entries have been removed and how many unique values
remain.
20
Get rid of duplicates by copying unique records to another location
Another way to get rid of duplicates in Excel is separating unique values, and copying them
to another sheet or a different workbook. The detailed steps follow below.
1. Select the range or the entire table that you want to dedupe.
2. Navigate to the Data tab > Sort & Filter group, and click the Advanced button.
21
4. Finally, click OK, and the unique values will be copied to a new location:
Assuming you have our Ultimate Suite installed in your Excel, perform these simple steps to
eliminate duplicate rows or cells:
1. Select any cell in the table that you want to dedupe, and click the Dedupe
Table button on the Ablebits Data tab. Your entire table will get selected
automatically.
2. The Dedupe Table dialog window will open, and all the columns will be selected
by default. You pick Delete duplicates from the Select the action drop-down list
and click OK. Done!
22
As you can see in the following screenshot, all duplicates rows except 1st occurrences are
deleted:
Tip. If you want to remove duplicate rows based on values in a key column, leave only
that column(s) selected, and uncheck all other irrelevant columns.
And if you want to perform some other action, say, highlight duplicate rows without
deleting them, or copy duplicate values to another location, select the corresponding option
from the drop-down list:
23
If you want more options, such as deleting duplicate rows including first occurrences or
finding unique values, then use the Duplicate Remover wizard that provides all these features.
Below you will find full details and a step-by-step example.
Removing duplicates in Excel is a common operation. However, in each particular case, there
can be a number of specificities. While the Dedupe Table tool focuses on speed, the Duplicate
Remover offers a number of additional options to dedupe your Excel sheets exactly the way
you want.
1. Select any cell within the table where you want to delete duplicates, switch to
the Ablebits Data tab, and click the Duplicate Remover button.
24
The Duplicate Remover wizard will run and the entire table will get selected.The add -in will
also suggest creating a backup copy, and because you are going to permanently delete
duplicates, we strongly advise that you check this box. Verify that the table has been selected
correctly and click Next.
2. Select what records you want to find and remove. The following options are
available to you:
o Duplicates except 1st occurrences
o Duplicates including 1st occurrences
o Unique values
o Unique values and 1st duplicate occurrences
25
3. And now, select the columns to search for duplicates. Because our aim is to
eliminate duplicate rows, be sure to select all the columns (which is usually done
by default).
4. Finally, select the action you want to perform on dupes and click
the Finish button. In this example, we expectedly choose the Delete duplicate
values option.
That's it! The Duplicate Remover add-in swiftly does its job and notifies you how many
duplicate rows have been found and deleted:
26
That's how you can wipe duplicates off Excel Data.
27
Frequency Distribution
There are many ways of presentation of frequency distribution in the form of graphs. Some
commonly used graphs are
28
Now in order to calculate frequency, we have to group the data with students marks as shown
below.
Now using the frequency function we will group the data by following the below steps.
• Create a new column named Frequency
• Use the frequency formulation on G column by selecting G3 to G13.
29
• Here we need to select the entire frequency column then only the frequency function
will work properly or else we will get an error value.
• As shown in the above screenshot we have selected column as data array and Bin
array as Student marks =FREQUENCY (C3:C22,G3:G13) and go
for CTRL+SHIFT+ENTER.
• So that we will get the values in all the column.
• Once we hit the CTRL+SHIFT+ENTER we can see the open and closing parenthesis
as shown below.
30
Now using the Excel Frequency Distribution, we have grouped the student’s marks with mark
wise which shows students has scored marks with 0-10 we have 1 student, 10-20 we have 5
student, 20-30 we have 1 student and 30-40 we have 3 student and so on…as shown below.
31
Press ok, the histogram data table appear,
32
Enter Input range, Bin range and out put range
33
Press ok, The frequency distribution table along with Histogram appear
34
How to Calculate Relative Frequency in Excel
A frequency table is a table that displays information about frequencies. Frequencies simply
tell us how many times a certain event has occurred.
For example, the following table shows how many items a shop sold in different price ranges
in a given week:
Frequency
Item
Price
1 -10 20
11-20 21
21-30 13
31-40 8
41-50 4
The first column displays the price class and the second column displays the frequency of
that class.
It’s also possible to calculate the relative frequency for each class, which is simply the
frequency of each class as a percentage of the whole.
Frequency Relative
Item frequency
Price
1 -10 20 0.303
11-20 21 0.318
21-30 13 0.197
31-40 8 0.121
41-50 4 0.061
Total 66 1
In total, there were 66 items sold. Thus, we found the relative frequency of each class by taking
the frequency of each class and dividing by the total items sold.
For example, there were 20 items sold in the price range of 1 – 10. Thus, the relative frequency
of the class 1 – 10 is 20 / 66 = 0.303.
Next, there were 21 items sold in the price range of 11 – 20. Thus, the relative frequency of the
class 11 – 20 is 21 / 66 = 0.318, and remaining relative frequency are 13/66 =0.197, 8/66
=0.121, 4/66=0.061.
35
The following example illustrates how to find relative frequencies in Excel.
First, we will enter the class and the frequency in columns A and B:
Next, we will calculate the relative frequency of each class in column C. Column D shows
the formulas we used:
36
We can verify that our calculations are correct by making sure the sum of the relative
frequencies adds up to 1:
We can also create a relative frequency histogram to visualize the relative frequencies.
Simply highlight the relative frequencies:
37
Then go to the Charts group in the Insert tab and click the first chart type in Insert Column
or Bar Chart:
38
Modify the x-axis labels by right-clicking on the chart and clicking Select Data.
Under Horizontal (Category) Axis Labels click Edit and type in the cell range that contains
the item prices. Click OK and the new axis labels will automatically appear:
The same procedure (Bar graph) is used to draw pie chart and line graph by selecting Data,
go to Insert tab and click Chart group, select the required chart ( pie or line), then click ok,
the required graph appear for the selected data
39
Pie Graph: The “pie chart” also is known as “circle chart”, that divides the circular statistical
graphic into sectors or slices in order to illustrate the numerical problems. Each sector denotes
a proportionate part of the whole. To find out the composition of something, Pie-chart works
the best at that time. In most of the cases, pie charts replace some other graphs like the bar
graph, line plots, histograms etc. Formula The pie chart is an important type of data
representation. It contains different segments and sectors in which each segment and sectors of
a pie chart forms a certain portion of the total(percentage). The total of all the data is equal to
360°
Grouped Data:
1. Bar Graph Definition The pictorial representations of a grouped data, in the form of vertical
or horizontal rectangular bars, where the lengths of the bars are equivalent to the measure of
data, are known as bar graphs or bar charts.
The bars drawn are of uniform width, and the variable quantity is represented on one of the
axes. Also, the measure of the variable is depicted on the other axes. The heights or the lengths
of the bars denote the value of the variable, and these graphs are also used to compare certain
quantities. The frequency distribution tables can be easily represented using bar charts which
simplify the calculations and understanding of data.
Even though the graph can be plotted using horizontally or vertically, the most usual type of
bar graph used is the vertical bar graph. The orientation of the x-axis and y-axis are changed
depending on the type of vertical and horizontal bar chart. Apart from the vertical and
horizontal bar graph, the two different types of bar charts are:
• Grouped Bar Graph
• Stacked Bar Graph
Now, let us discuss the four different types of bar graphs.
40
Grouped Bar Graph
Grouped bar graph is also called the clustered bar graph, which is used to represent the discrete
value for more than one object that shares the same category. In this type of bar chart, the total
number of instances are combined into a single bar. In other words, a grouped bar graph is a
type of bar graph in which different sets of data items are compared. Here, a single color is
used to represent the specific series across the set. The grouped bar graph can be represented
using both vertical and horizontal bar charts.
Stacked Bar Graph
Stacked bar graph is also called the composite bar chart, which divides the aggregate into
different parts. In this type of bar graph, each part can be represented using different colors,
which helps to easily identify the different categories. The stacked bar chart requires specific
labeling to show the different parts of the bar. In a stacked bar graph, each bar represents the
whole and each segment represents the different parts of the whole.
Uses of Bar Graphs
Bar graphs are used to match things between different groups or to trace changes over time.
Yet, when trying to estimate change over time, bar graphs are most suitable when the changes
are bigger.
Bar charts possess a discrete domain of divisions and are normally scaled so that all the data
can fit on the graph. When there is no regular order of the divisions being matched, bars on the
chart may be organized in any order. Bar charts organized from the highest to the lowest
number are called Pareto charts.
41
How to Make a Frequency Polygon in Excel
Next, we will create the frequency polygon. Highlight the frequency values in column C:
42
Then go to the Charts group in the Insert tab and click the first chart type in Insert Line or
Area Chart:
43
To change the x-axis labels, right click anywhere on the chart and click Select Data. A new
window will pop up. Under Horizontal (Category) Axis Labels click Edit and type in the
cell range that contains the Midpoint values.
44
Click OK and the new axis labels will automatically appear:
45
Frequency polygon curve
Frequency polygon steps are used to get for frequency polygon curve, but in frequency
curve instead of selecting line (chart tools) we have to select (Scatter tool chart) smooth
curve. The curve chart is high lightened below. Select data and directly go for chart
tools to get frequency polygon curve.
Ex:
For Relative frequency polygon refer above mentioned steps of relative frequency
distribution , calculate the mid points of class intervals and frequency. Using these
draw frequency polygon, it gives Relative frequency polygon.
46
Histogram
A histogram is a common data analysis tool in the business world. It’s a column chart that
shows the frequency of the occurrence of a variable in the specified range.
Histogram is a graphical representation, similar to a bar chart in structure, that organizes a
group of data points into user-specified ranges. The histogram condenses a data series into
an easily interpreted visual by taking many data points and grouping them into logical
ranges or bins.
A simple example of a histogram is the distribution of marks scored in a subject. You can
easily create a histogram and see how many students scored less than 35, how many were
between 35-50, how many between 50-60 and so on.
47
3. In the Charts group, click on the ‘Insert Static Chart’ option.
The above steps would insert a histogram chart based on your data set (as shown below).
48
Now you can customize this chart by right-clicking on the vertical axis and selecting Format
Axis.
This will open a pane on the right with all the relevant axis options.
Here are some of the things you can do to customize this histogram chart:
1. By Category: This option is used when you have text categories. This could
be useful when you have repetitions in categories and you want to know the
sum or count of the categories. For example, if you have sales data for items
49
such as Printer, Laptop, Mouse, and Scanner, and you want to know the total
sales of each of these items, you can use the By Category option. It isn’t
helpful in our example as all our categories are different (Student 1, Student 2,
Student3, and so on.)
2. Automatic: This option automatically decides what bins to create in the
Histogram. For example, in our chart, it decided that there should be four bins.
You can change this by using the ‘Bin Width/Number of Bins’ options
(covered below).
3. Bin Width: Here you can define how big the bin should be. If I enter 20 here,
it will create bins such as 36-56, 56-76, 76-96, 96-116.
4. Number of Bins: Here you can specify how many bins you want. It will
automatically create a chart with that many bins. For example, if I specify 7
here, it will create a chart as shown below. At a given point, you can either
specify Bin Width or Number of Bins (not both).
5. Overflow Bin: Use this bin if you want all the values above a certain value
clubbed together in the Histogram chart. For example, if I want to know the
number of students that have scored more than 75, I can enter 75 as the
Overflow Bin value. It will show me something as shown
50
6. Underflow Bin: Similar to Overflow Bin, if I want to know the number of
students that have scored less than 40, I can enter 4o as the value and show a
chart as shown below.
Once you have specified all the settings and have the histogram chart you want, you can
further customize it (changing the title, removing gridlines, changing colors, etc.)
51
Histogram using Data Analysis Toolpak
Once you have the Analysis Toolpak enabled, you can use it to create a histogram in Excel.
Suppose you have a dataset as shown below. It has the marks (out of 100) of 40 students in a
subject.
To create a histogram using this data, we need to create the data intervals in which we want
to find the data frequency. These are called bins.
With the above dataset, the bins would be the marks intervals.
You need to specify these bins separately in an additional column as shown below:
Now that we have all the data in place, let’s see how to create a histogram using this data:
• Click the Data tab.
• In the Analysis group, click on Data Analysis.
52
• In the ‘Data Analysis’ dialog box, select Histogram from the list.
• Click OK.
• In the Histogram dialog box:
• Select the Input Range (all the marks in our example)
• Select the Bin Range (cells D2:D7)
• Leave the Labels checkbox unchecked (you need to check it if you
included labels in the data selection).
• Specify the Output Range if you want to get the Histogram in the
same worksheet. Else, choose New Worksheet/Workbook option
to get it in a separate worksheet/workbook.
• Select Chart Output.
• Click OK.
This would insert the frequency distribution table and the chart in the specified
location.
53
Now there are some things you need to know about the histogram created using the Analysis
Toolpak:
• The first bin includes all the values below it. In this case, 35 shows 3 values
indicating that there are three students who scored less than 35.
• The last specified bin is 90, however, Excel automatically adds another bin
– More. This bin would include any data point which lies after the last
specified bin. In this example, it means that there are 2 students who have
scored more than 90.
• Note that even if I add the last bin as 100, this additional bin would still be
created.
• This creates a static histogram chart. Since Excel creates and pastes the
frequency distribution as values, the chart would not update when you change
the underlying data. To refresh it, you’ll have to create the histogram again.
• The default chart is not always in the best format. You can change the
formatting like any other regular chart.
• Once created, you can not use Control + Z to revert it. You’ll have to
manually delete the table and the chart.
• If you create a histogram without specifying the bins (i.e., you leave the Bin
Range empty), it would still create the histogram. It would automatically
create six equally spaced bins and used this data to create the histogram.
54
Box plot
A boxplot (box plot, or whisker plot) is a compact, but efficient way to represent a dataset
using descriptive stats. This “little diagram” combines informative, standard values such as the
first and third quartiles (the bottom and top of the box, respectively), the median (the flat line
inside the box) and sometimes the mean (a second flat line inside the box). The whiskers are
often used to represent the minimum and maximum values, but some use other parameters such
as: one standard deviation above and below the mean of the data OR the lowest and highest
values contained in the range defined by the 1st quartile minus 1,5 times the interquartile range
and the 3rd quartile plus 1,5 times the interquartile range (cf. “Tukey plot”) OR the 5th and
95th percentiles, etc. Anyway, because the whiskers are defined by the user (and not by
convention), it is important, when creating the boxplot, to mention what they represent in the
legend of the chart.
Box plot
BINS (i.e. categories that become the “bars” in the graph) are automatically created in Excel
2016 using Scott’s Rule.
Step 1: Enter your data into a single column.
55
Step 2: Highlight the data you entered in Step 1. To do this, click and hold on the first cell
and then drag the mouse down to the end of the data.
Step 3: Click the” Insert” tab, click statistics charts (a blue icon with three vertical bars) and
then click a histogram icon.
Click ok
56
Example #1 – Box Plot in Excel
Suppose we have data as shown below which specifies the number of units we sold of a
product month-wise for years 2017, 2018 and 2019 respectively.
Step 1: Select the data and navigate to Insert option in the Excel ribbon. You will have
several graphical options under the Charts section.
Step 2: Select the Box and Whisker option which specifies the Box and Whisker plot.
Right-click on the chart, select the Format Data Series option then select the Show inner
points option. You can see a Box and Whisker plot as shown below.
57
Use 2016 for Box plot it is little bit difficult using excel 2016 below versions.
58
Step 2: Now, since we are about to use the stack chart and modify it into a box and whisker
plot, we need each statistic as a difference from it’s a subsequent statistic. Therefore, we use
the differences between Q1 – Minimum and Maximum – Q3 as Whiskers. Q1, Q2-Q1, Q3-
Q2 (Interquartile ranges) as Box. And combine together, it will form a Box-Whisker Plot.
Step 3: Now, we are about to add the boxes as the first part of this plot. Select the data from
B24:D26 for boxes (remember Q1 – Minimum and Maximum – Q3 are for the Whiskers?)
Step 4: Go to Insert tab on the excel ribbon and navigate to Recommended Charts under the
Charts section.
59
Step 5: Inside Insert Chart window > All Charts > navigate to Column Charts and select the
second option which specifies the Stack Column Chart and click OK.
Step 6: Now, we need to add whiskers. I will start with the lower whisker first. Select the
stack chart portion, which represents the Q1 (Blue bar) > Click on Plus Sign > Select Error
Bars > Navigate to More Options… dropdown under Error Bars.
60
Step 7: As soon as you click on More Options… Format Error Bars menu will appear > Error
Bar Options > Direction : Minus Radio Button (since we are adding the lower whisker) > End
Style : Cap radio button > Error Amount : Custom > Select Specify Value.
It opens up a window within which specifies the lower whisker values (Q1 – Minimum
B23:D23) under Negative Error Value and Click OK.
61
Step 8: Do the same for the upper Whiskers. Select the gray bar (Q3-Median bar), instead of
selecting Direction as Minus use Plus and add the values of Maximum – Q3 i.e. B27:D27
under Positive Error Values box.
It opens up a window within which specifies the lower whisker values (Q3 – Maximum
B27:D27) under Positive Error Value and Click OK.
62
The Graph now should look like the screenshot below:
Step 9: Remove the bars associated with Q1 – Minimum. Select the Bars > Format Data
Series > Fill & Line > No Fill. This will remove the lower part as it is not useful in the Box-
Whisker plot and just added initially because we want to plot the stack bar chart as a first
step.
Step 10: Select the Orange bar (Median – Q1) > Format Data Series > Fill & Line > No fill
under Fill section > Solid line under Border section > Color > Black. This will remove the
colors from the bars and represent them just as outline boxes.
63
Follow the same procedure for the gray bar (Maximum – Q3) to remove the colour from it
and represent it as a solid line bar. Plot should look like the one in screenshot below:
64
This is how we can create Box-Whisker Plot under any version of Excel. If you have Excel
2016 and above, you anyway have the direct chart option for the Box-Whisker plot.
Quartile
A quartile is a statistical term that describes a division of observations into four defined
intervals based on the values of the data and how they compare to the entire set of Each quartile
contains 25% of the total observations. Generally, the data is arranged from smallest to largest:
Select the data and by using above functions we get the Quartile values.
Ex:
65
Stem and leaf plot
A stem-and-leaf display (also known as a stemplot) is a diagram designed to allow you to
quickly assess the distribution of a given dataset. Basically, the plot splits two-digit numbers
in half:
Stems – The first digit
Leaves – The second digit
Use the following steps to create a stem-and-leaf plot in Excel.
Step 1: Enter the data.
Enter the data values in a single column:
66
Step 3: Manually enter the “stems” based on the minimum and maximum values.
67
Once you finish typing the formula and click Enter, you will get the following result:
68
To double check that your results are correct, you can verify three numbers:
69
Central Tendency
Central tendency refers to the value derived out of the random variables from the set of data
that reflects the centre of the distribution of the data and which generally can be described
using different measures like mean, median and the mode.
It is a single value that attempts to describe a set of data by identifying the middle of the
central position within the given dataset. Sometimes these measures are called the standards
of middle or the central location. The mean (otherwise known as the average) is the most
commonly used measure for central tendency, but there are other methodologies such as the
median and the mode.
Example #1
Consider following sample : 33, 55, 66, 56, 77, 63, 87, 45, 33, 82, 67, 56, 77, 62, 56. You are
required to come up with a central tendency.
Solution:
Below is given data for calculation.
70
• Mean = 915/15
Mean will be –
Mean = 61
The calculation of the Median will be as follows-Enter the median formula
=MEDIAN(DATA arrays)
Median =62
Since the number of observations is odd, the middle value, which is the 8 th position, will be
the median, which is 62.
Calculation of Mode will be as follows-
71
Mode = 56
For more, we can note from the above table that the number of observations that are recurring
most times is 56. (3 times in the dataset)
72
As you can see the answers of Average and Standard Deviation contain too many decimals,
you can easily get rid of them by using the rounding functions
Absolute & relative dispersion are two different ways to measure the spread of a data set.
They are used extensively in biological statistics, as biological phenomena almost always
show some variation and spread.
The easiest way to differentiate relative dispersion/absolute dispersion is to check whether
your statistic involves units. Absolute measures always have units, while relative measures
do not.
Absolute Measures of Dispersion
Absolute measures of dispersion include:
• The range,
• The quartile deviation,
• The mean deviation,
• The standard deviation and variance.
Absolute measures of dispersion use the original units of data, and are most useful for
understanding the dispersion within the context of your experiment and measurements.
Relative Measures of Dispersion: Relative measures of dispersion are calculated as ratios or
percentages; for example, one relative measure of dispersion is the ratio of the standard
deviation to the mean. Relative measures of dispersion are always dimensionless, and they are
particularly useful for making comparisons between separate data sets or different experiments
that might use different units. They are sometimes called coefficients of dispersion.
Some Commonly Used Measures of Relative Dispersion / Absolute Dispersion
The simplest measure of absolute dispersion is the range. This is just the upper limit minus the
lower limit; the largest data point minus the smallest. We can write this as R = H – L.
For example, if a data set consisted of the points 2, 4, 5, 8, and 18, the range would be
18 – 2 = 16.
The analogous relative measure of dispersion is the coefficient of range. This is given by
(H – L)/(H + L). For our example data set, it would be the ratio (18 – 2)/(18 + 2), so (16/20)
or 4/5.
The standard deviation is a more complicated measure of absolute dispersion, you could
calculate it by squaring the difference between each data point and the mean, summing those
squares, dividing by a number that is one less than the number of your data points, and then
taking the square root of that. Since your values are squared and in the end the square root is
taken again, the standard deviation is given in the your original units of measure.
The coefficient of standard deviation, the analogous measure of relative dispersion, is just the
standard deviation divided by the arithmetic mean. To give it as a percentage rather than a ratio,
multiply by 100%.
73
How to Calculate the Mean Absolute Deviation in Excel
To calculate the mean absolute deviation in Excel, we can perform the following steps:
Step 1: Enter the data. For this example, we’ll enter 15 data values in cells A2:A16.
Step 2: Find the mean value. In cell D1, type the following
formula: =AVERAGE(A2:A16). This calculates the mean value for the data values, which
turns out to be 15.8.
Step 3: Calculate the absolute deviations. In cell B2, type the following
formula: =ABS(A2-$D$1). This calculates the absolute deviation of the value in cell A2 from
the mean value in the dataset.
74
Next, click cell B2. Then, place over the bottom right corner of the cell until a
black + sign appears. Double click the + sign to fill in the remaining values in column B.
75
Step 4: Calculate the mean absolute deviation.
In cell B17, type the following formula: =AVERAGE(B2:B16). This calculates the mean
absolute deviation for the data values, which turns out to be 6.1866.
Note that you can use these four steps to calculate the mean absolute deviation for any
number of data values. In this example, we used 15 data values but you could use these exact
steps to calculate the mean absolute deviation for 5 data values or 5,000 data values.
Quartile Deviation
Quartile deviation is based on the difference between the first quartile and the third
quartile in the frequency distribution and the difference is also known as the
interquartile range, the difference divided by two is known as quartile deviation or semi
interquartile range.
Use the following data for the calculation of quartile deviation.
Q1=148.75
Calculation of Q3 can be done as follows,
76
Q3= 179.75
77
QD = 15.50
78
Enter the cell range for your list of numbers in the Number 1 box. For example, if your data
were in column A from row 1 to 13, you would enter A1:A13. Instead of typing the range,
you can also move the cursor to the beginning of the set of scores you wish to use and click
and drag the cursor across them. Once you have entered the range for your list, click
on OK at the bottom of the dialog box. The mean (average) for the list will appear in the cell
you selected.
79
Finding the Standard Deviation
Place the cursor where you wish to have the standard deviation appear and click the mouse
button.Select Insert Function (f x ) from the FORMULAS tab. A dialog box will
appear. Select STDEV.S (for a sample) from the the Statistical category. (Note: If your data
are from a population, click on STDEV.P). After you have made your selections, click
on OK at the bottom of the dialog box.
Enter the cell range for your list of numbers in the Number 1 box. For example, if your data
were in column A from row 1 to 13, you would enter A1:A13. Instead of typing the range,
you can also move the cursor to the beginning of the set of scores you wish to use and click
and drag the cursor across them. Once you have entered the range for your list, click
on OK at the bottom of the dialog box. The standard deviation for the list will appear in the
cell you selected.
80
VARIANCE
Variance is one of the most useful tools in probability theory and statistics. In science, it
describes how far each number in the data set is from the mean. In practice, it often shows
how much something changes. For example, temperature near the equator has less variance
than in other climate zones. In this article, we will analyze different methods of calculating
variance in Excel. How to find Variance in Excel?
This is complete data. When we capture complete data (entire population) we calculate the
variance of population. The Excel function for calculating Variance of population is VAR.P.
The syntax of VAR.P is
=VAR.P(number1,NUMBER2)
Number1, number2,...: these are the numbers of which you want to calculate variance.
The first number is compulsory.
Let's use this formula to calculate the variance of our data. We have data in cell C2:C15. So
the formula will be:
=VAR.P(C2:C15)
This returns a value 186.4285714, which is quite a large variance given our data.
81
SKEWNESS and KUTROSIS
If one tail is longer than another, the distribution is skewed. These distributions are sometimes
called asymmetric or asymmetrical distributions as they don’t show any kind of symmetry.
Symmetry means that one half of the distribution is a mirror image of the other half. For
example, the normal distribution is a symmetric distribution with no skew. The tails are exactly
the same.
It is the degree of distortion from the symmetrical bell curve or the normal distribution. It
measures the lack of symmetry in data distribution.
It differentiates extreme values in one versus the other tail. A symmetrical distribution will
have a skewness of 0.
There are two types of Skewness: Positive and Negative
Positive Skewness means when the tail on the right side of the distribution is longer or fatter.
The mean and median will be greater than the mode.
Negative Skewness is when the tail of the left side of the distribution is longer or fatter than
the tail on the right side. The mean and median will be less than the mode.
So, when is the skewness too much?
The rule of thumb seems to be:
• If the skewness is between -0.5 and 0.5, the data are fairly symmetrical.
• If the skewness is between -1 and -0.5(negatively skewed) or between 0.5 and
1(positively skewed), the data are moderately skewed.
• If the skewness is less than -1(negatively skewed) or greater than 1(positively
skewed), the data are highly skewed.
Kurtosis
Kurtosis is all about the tails of the distribution — not the peakedness or flatness. It is used to
describe the extreme values in one versus the other tail. It is actually the measure of
outliers present in the distribution.
High kurtosis in a data set is an indicator that data has heavy tails or outliers. If there is a high
kurtosis, then, we need to investigate why do we have so many outliers. It indicates a lot of
things, maybe wrong data entry or other things. Investigate!
Low kurtosis in a data set is an indicator that data has light tails or lack of outliers. If we get
low kurtosis(too good to be true), then also we need to investigate and trim the dataset of
unwanted results.
82
Mesokurtic: This distribution has kurtosis statistic similar to that of the normal distribution. It
means that the extreme values of the distribution are similar to that of a normal distribution
characteristic. This definition is used so that the standard normal distribution has a kurtosis of
three.
Leptokurtic (Kurtosis > 3): Distribution is longer, tails are fatter. Peak is higher and sharper
than Mesokurtic, which means that data are heavy-tailed or profusion of outliers.
Outliers stretch the horizontal axis of the histogram graph, which makes the bulk of the data
appear in a narrow (“skinny”) vertical range, thereby giving the “skinniness” of a leptokurtic
distribution.
Platykurtic: (Kurtosis < 3): Distribution is shorter, tails are thinner than the normal
distribution. The peak is lower and broader than Mesokurtic, which means that data are light-
tailed or lack of outliers.
The reason for this is because the extreme values are less than that of the normal distribution.
We can use the Analysis Toolpak add-in to generate descriptive statistics. For example, you
may have the scores of 14 participants for a test.
83
To generate descriptive statistics for these scores, execute the following steps.
1. On the Data tab, in the Analysis group, click Data Analysis.
Note: can't find the Data Analysis button? Click here to load the Analysis ToolPak add-in.
2. Select Descriptive Statistics and click OK.
6. Click OK.
84
Result:
Thank you
85
Excel Frequency Distribution Using Pivot Table (Extra )
In this example, we will see how to make excel frequency distribution using graphical data
with the available sales database.
One of the easiest ways to make excel frequency distribution is using the pivot table so that
we can create graphical data.
Consider the below sales data which has year wise sale. Now we will see how to use this
using a pivot table with the following steps.
• Create a Pivot Table for the above sales data. For creating a pivot table we have to go
to the insert menu and select pivot table.
•
• Drag down the Sales in Row Labels. Drag down the same sales in Values.
86
• Make sure that we have selected the pivot field setting to count so that we will get the
sales count numbers which are shown below.
• Click on the row label sales number and right click then Choose Group option.
87
• So that we will get the grouping dialogue box as shown below:
•
•
• Edit the grouping numbers starting at 5000 and Ending at 18000 and it Group By
1000 and then click ok
88
After that, we will get the below following result where sales data has been grouped by 1000
as shown below:
We can see that Sales data has been grouped by 1000 with Minimum to Maximum values
which can be shown more professionally by displaying in graphical format.
• Go to insert menu and select the Column chart.
89
• So the output will be as follows:
We can find histogram in the data analysis group under the data menu which is nothing but
add-ins. We will see how to apply histogram by following the below steps.
90
• Go to Data Menu on the right top, we can find the data analysis. Click on the data
analysis which is highlighted as shown below.
• So that we will get the below dialogue box. Choose Histogram option and Click ok.
91
• Give the Input Range and Bin Range as shown below.
• Make sure that we have a check mark all option like label option, Cumulative
Percentage, Chart Output and then Click OK.
92
• In the below chart we got the output which shows the cumulative percentage along
with frequency.
We can display the above histogram more professionally by editing the sales data as follows.
• Right click on the histogram chart and click on Select Data.
93
• We will get the dialogue box to change the Ranges. Click on edit.
• So that we can edit the ranges we need to give. Edit the Bins value what exactly we
need to specify the range so that we will get the appropriate result and then click ok.
94
• So the Result will be as below.
95
Things to Remember about Excel Frequency Distribution
• In excel Frequency distribution, while grouping we might lose some of the data,
hence make sure that we are grouping in a proper manner.
• While using excel frequency distribution make sure that classes should be in equal
size with an upper limit and lower limit values.
96
97