Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
36 views

Statistic & Analytics

This document provides an overview of statistical data collection and analysis. It discusses the different types of data, including primary and secondary data. Primary data can be qualitative or quantitative, and both have different collection methods like surveys, interviews, and observations. Secondary data is previously collected data available from various published and unpublished sources. The document also discusses descriptive and inferential statistics. Descriptive statistics summarize and organize data, while inferential statistics allow inferences about populations based on samples. Finally, it covers data tabulation and frequency distributions which systematically arrange data in tables.

Uploaded by

Shanmukha Ak
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Statistic & Analytics

This document provides an overview of statistical data collection and analysis. It discusses the different types of data, including primary and secondary data. Primary data can be qualitative or quantitative, and both have different collection methods like surveys, interviews, and observations. Secondary data is previously collected data available from various published and unpublished sources. The document also discusses descriptive and inferential statistics. Descriptive statistics summarize and organize data, while inferential statistics allow inferences about populations based on samples. Finally, it covers data tabulation and frequency distributions which systematically arrange data in tables.

Uploaded by

Shanmukha Ak
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 98

0

For DIPLOMA FIRST YEAR

Rajashekhar K M & Gopal Ganiga


RURAL POLYTECHNIC HAUNSABHAVI,
HIREKERUR, HAVERI
STATISTICS AND ANALYTICS

STATISTICAL DATA COLLECTION AND TYPES


Data
Data is a collection of information gathered by observations, measurements, research or
analysis. They may consist of facts, numbers, names, figures or even description of things. Data
is organized in the form of graphs, charts or tables.
The data collection is a process of gathering information from all the relevant sources to find a
solution to the research problem. It helps to evaluate the outcome of the problem. The data
collection methods allow a person to conclude an answer to the relevant question. Most of the
organizations use data collection methods to make assumptions about future probabilities and
trends. Once the data is collected, it is necessary to undergo the data organization process.
Classification of Data
The main sources of the data collections methods are “Data”. A data can be classified into two
types, namely primary data and the secondary data. Depends on the type of data, the data
collection method is divided into two categories namely,

• Primary Data or primary data collection methods


• Secondary Data or Secondary data collection methods

Primary Data
Primary data or raw data is a type of information that is obtained directly from the first -hand
source through experiments, surveys, or observations. The primary data is further classified
into two types. They are
• Qualitative Data
• Quantitative Data
Qualitative Data
If the data is classified on the basis of the qualitative characteristics or attributes is called
qualitative data.
It does not involve any mathematical calculations. This method is closely associated with
elements that are not quantitative. This qualitative data collection method includes interviews,
questionnaires, observations, case studies survey focus group discussion etc. There are several
methods to collect this type of data. They are also called data collection tools.
Data collection tools
1. Questionnaires
2. Survey
3. Interviews
4. Focus group discussion.

1
• Questionnaires
In this method, the set of questions are mailed to the respondent. They should read, reply and
subsequently return the questionnaire. The questions are printed in the definite order on the
form. A good survey should have the following features:

• Short and simple


• Should follow a logical sequence
• Provide adequate space for answers
• Avoid technical terms
• Should have good physical appearance such as colour, quality of the paper to attract the
attention of the respondent
Survey
A survey is a process of data gathering involving a variety of data collection methods, including
a questionnaire. Questionnaires are a list of questions either an open ended (Questions in which
the respondent answers in his own words or close - ended (Question in which respondent selects
one or more options from pre-determined set of responses for which the respondent give
answers). Enumerator goes to the respondents, asks them the questions from the Performa in
the order listed, and records the responses in the space provided Questionnaire is the most
commonly used method in survey.

Interviews

The Interview Method of collecting data involves presentation of oral-verbal stimuli and reply
in terms of oral – verbal responses. It requires the interviewer asking questions in a face-to-
face contact with the person. The method of collecting data in terms of oral or verbal responses.
It is achieved in two ways, such as

• Personal Interview – In this method, a person known as an interviewer is required to


ask questions face to face to the other person. The personal interview can be structured
or unstructured, direct investigation, focused conversation etc.
• Telephonic Interview – In this method, an interviewer obtains information by
contacting people on the telephone to ask the questions or views orally.

Focus group discussion

The opposite of quantitative research which involves numerical based data, this data collection
method focuses more on qualitative research. It falls under the primary category for data based
on the feelings and opinions of the respondents. This research involves asking open-ended
questions to a group of individuals usually ranging from 6-10 people, to provide feedback.

Quantitative Data Collection Methods


The classification of the units on the basis of quantitative characteristics or variable (such as
weight, wages, age in years, numbers of children, phone number etc) is called quantitative data
collection method. It is based on mathematical calculations using various formats like close-
ended questions, correlation and regression methods, mean, median or mode measures. This
method is cheaper than qualitative data collection methods, and it can be applied in a short

2
duration of time. Quantitative data is also known as numerical data which represents the
numerical value (i.e., how much, how often, how many). Numerical data gives information
about the quantities of a specific thing. Some examples of numerical data are height, length,
size, weight, and so on. The quantitative data can be classified into two different types based
on the data sets. The two different classifications of numerical data are discrete data and
continuous data.

Discrete Data
Discrete data can take only discrete values. Discrete information contains only a finite number
of possible values. Those values cannot be subdivided meaningfully. Here, things can be
counted in the whole numbers.
Example: Number of students in the class

Continuous Data
Continuous data is data that can be calculated. It has an infinite number of probable values that
can be selected within a given specific range.
Example: Temperature range

Secondary Data Collection Methods


Secondary data is data is the data which has been already collected and analysed by someone
other than the actual user. It means that the information is already available, and someone
analyses it. The secondary data includes magazines, newspapers, books, journals etc. It may be
either published data or unpublished data.
Published data are available in various resources including
• Government publications
• public records
• Historical and statistical documents
• Business documents
• Technical and trade journals
Unpublished data includes
• Diaries
• Letters
• Unpublished biographies etc.

Data cleaning.
Data cleansing or data cleaning is the process of detecting and correcting corrupt or
inaccurate records from a record set, table, or database and refers to identifying
incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing,
modifying, or deleting the dirty or coarse data
There are several methods for cleaning data depending on how it is stored along with
the answers being sought.

Statistics have majorly categorised into two types:

1. Descriptive statistics
2. Inferential statistics
3
Descriptive Statistics

Descriptive statistics is a way to organise, represent and describe a collection of data using
tables, graphs, and summary measures. For example, the collection of people in a city using
the internet or using Television
In this type of statistics, the data is summarised through the given observations. The
summarisation is one from a sample of population using parameters such as the mean
or standard deviation.

Inferential Statistics
Inferential Statistics is a method which allows us to use information collected from a sample
to make decisions, predictions or inferences from a population. It grants us permission to give
statements which goes beyond the available data or information. For example, deriving
estimates from hypothetical research.
This type of statistics is used to interpret the meaning of Descriptive statistics. That means
once the data has been collected, analysed and summarised then we use these stats to describe
the meaning of the collected data. Or we can say, it is used to draw conclusions from the data
that depends on random variations such as observational errors, sampling variation, etc.
Data tabulation
Tabulation is a process of systematic arrangement of the classified data in rows and columns,
in the form of table.
ex.1:
No.of oranges in 5 6 7 8 9 10 Total
the box
No.of boxes. 5 8 10 6 3 13 45

Ex.2:
Height (cm) 140-150 150-160 160-170 170-180 Total
No. of 6 24 18 2 50
students

The above two types example are Frequency Distribution or Frequency table.

Frequency distribution is a systematic presentation of the values taken by a variable along


with their frequencies. Frequency refers to the number of times an observation is repeated.
The number of observations corresponding to a particular class is known as class frequency.
Class frequency is a positive integer including zero.
From the above examples, we can explain the following terms:
Number of oranges per boxes, height are variables and number of boxes, number of students
are frequencies.
While framing a frequency distribution, if class intervals are not considered , is called Discrete
frequency distribution (Ex.1).
While framing a frequency distribution, if class intervals are considered, is called Continuous
frequency distribution (Ex.2).

4
Formation of Discrete frequency distribution:

For formation of frequency distribution, three columns are formed -variable, tally bars,
frequency. In the first column, values of given variable are written without repetition in an
order. For each value a tally/stroke is marked against that value in the second column. In this
way tally scores are marked for all values. For easy counting the tallies are put as a group of 5
(IIII ). Finally count the number of tally bars corresponding to each value of the variable in
third column. It is known as frequency. The total frequency (N) is equal to the total number of
observations.
Ex.3.
In survey of 40 families in Haveri, the number of children per family was recorded and the
following data were obtained.
1,0,3,2,1,5,6,2,2,1,0,3,4,2,1,6,3,2,1,5,3,3,2,4,2,2,3,0,2,1,4,5,3,3,4,4,1,2,4,5.
Represent the data in the form of a discrete frequency distribution.
Sol: Frequency distribution of the number of children.

Number of children (x) Tally Marks Frequency (f)


0 III 3
1 IIII II 7
2 IIII IIII 10
3 IIII III 8
4 IIII II 6
5 IIII 4
6 II 2
Total 40

Formation of continuous frequency distribution:


Suitable class intervals are formed on the basis of the magnitude of the data. For each value a
tally mark is marked against the class in which it falls. This process is continued until all the
values are exhausted. The tallies of each class are counted and written as frequency of that
class.
To construct a continuous frequency distribution table, it essential to know the following
factors:
Range: It is the difference highest and lowest value in the data
i.e., Range = H.V- L.V.
Class:
The sub range is called class.
Class limits:
The lowest and highest values which are taken to define the boundaries of a class are class
limits. The lowest value is called lower limit(L.L) and the highest value is upper limit(U.L).
Example: ( 30 – 40), (40-50)… are class limits.
The lowest values of the class are 30,40 they are lower limits and the highest value of the
class are 40, 50 they are upper limits.
Inclusive class:
In a class, if lower as well as upper limits are included in the same class, such a class is called
Inclusive class. Here, upper limit of a class is not equal to the lower limit of the next class.
Ex. 0-9, 10-19, 20-39…. Are inclusive classes.

5
Exclusive class:
In a class, If the lower limit is included in the same class and upper limit is excluded from
that class but included in the next class, such a class is called Exclusive class. Here, upper
limit of a class is equal to lower limit of the next class.
Ex: 30-40, 40-50.50-60 are exclusive classes.
Correction factor:
It is half of the difference between lower limit of a class and upper limit of a class and upper
limit of the preceding class thus,
𝑙𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡𝑜𝑓 𝑐𝑙𝑎𝑠𝑠−𝑢𝑝𝑝𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑝𝑟𝑒𝑐𝑒𝑑𝑖𝑛𝑔𝑐𝑙𝑎𝑠𝑠
Correction factor (C.F) = 2
To get exclusive class intervals from inclusive class intervals, add C.F from all lower limits.
Ex: Convert the flowing inclusive class intervals to exclusive class intervals.
C-I 10 -19 20 - 29 30 - 39 40-49

20−19 1
C.F = 2 = 2 = 0.5
By subtracting it from all lower limits we get lower limit as 9.5, 19.5,29.5, 39.5 and adding it
to all upper limits we get upper limit as 19.5, 29.5, 39.5, 49.5.
Therefore, the exclusive class intervals are 9.5 -19.5, 19.5 -29.5, 29.5 - 39.5 and 39.5 - 49.5.
Open- end classes
In a class, if the lower or upper limit of the class is not specified such a class is called open-
end class. For example: less than (below) or more than (above) a particular class limit.
The frequency distribution based on open-end classes is called open end frequency
distribution.
Ex:
Class interval Frequency
Less than 20 8
20 – 30 15
30 - 40 23
40 -50 12
50-60 9
More than 60 3

Mid-point (class mark)


The central value of a class is called mid-point or class mark. It is the average of class limits.
𝑙𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 +𝑢𝑝𝑝𝑒𝑟 𝑙𝑖𝑚𝑖𝑡
i.e., m or x = 2
Ex: Mid point of the class ( 10 – 20 ) is,
𝐿𝐿+𝑈𝐿 10+20
M = 2 = 2 = 15
Width ( size) of the class:
The difference between the upper and lower limits of a class is called width of the class. It is
denoted by c or I .
For example, the width of the class intervals ( 30 – 40) is 40 – 30 = 10.
Number of classes:
The number of classes can be obtained by using the
Prof. Sturge’s Rule
Number of classes (K) = 1 + 3.322 log N; N: Number of observations.
The width of the class can also be obtained by:
𝑅𝑎𝑛𝑔𝑒
Width of the class = c =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠(𝑘)

6
Cumulative frequency
The added up frequencies are called cumulative frequencies.
There are two types of cumulative frequencies:
1. Less than type
2. More than type,
The number of observations ( frequencies) below a certain limit is less than cumulative
frequency (L.C.F). The frequency distribution formed for less than cumulative frequencies
against upper class limits, is, less than cumulative frequency distribution.

Ex:
Frequency Less than cumulative
Distribution frequency distribution
Weight Number of Weight (kg) Number of
(Kg) persons persons
30-40 10 Less than 40 10
40- 50 15 Less than 50 (10+15)=25
50-60 20 Less than 60 (25+20)=45
60-70 15 Less than 70 (15+45)=60

The number of frequencies above a certain limit is more than cumulative frequency . The
frequency distribution formed for more than cumulative frequencies against lower class limit
is more than cumulative frequency distribution.

Frequency More than cumulative


Distribution frequency distribution
Weight Number of Weight (kg) Number of
(Kg) persons persons
30-40 10 More than 30 (50+10)=60
40- 50 15 More than 40 (45+15)=50
50-60 20 More than 50 (15+20)=45
60-70 15 More than 60 15
Frequency density:
The frequency per unit of class interval is the frequency density
(f/c). or , It is the ratio of the class frequency to the width of the class interval,

𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑙𝑎𝑠𝑠


i.e., Frequency density = 𝑤𝑖𝑑𝑡ℎ 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑙𝑎𝑠𝑠

It is used to compare the concentration of the frequencies of different classes for a given
frequency distribution.
Weight Number Width of Frequency
(Kg) of the class density(f/c)
persons(f)
0-10 10 10 10
=1
10
10- 30 15 20 .75
30-50 40 20 2
50-60 45 10 4.5
60-65 20 5 5

7
Relative frequency (relative frequency table)
Relative frequency is the ratio of frequency of the value of the variable to the total frequency.
𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒
i.e Relative frequency (R.f) =
𝑡𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
Ex:
No of Relative
apples No of Frequency
boxes
Per box R.f=f/N

5 5 5/45 = 0.111

6 8 8/45 = 0.178

7 13 13/45 = 0.289

8 10 10/45 = 0.222

9 6 6/45 = 0.133

10 3 3/45 = 0.067

Total (N) 45 1

8
Diagrammatic and Graphical Representation of data using Excel
Microsoft Excel is a spreadsheet application that is commonly used for a variety of uses. At its
core, Excel is a table consisting of rows and columns. Excel is composed of rows and columns
and uses a spreadsheet to display data. Features include: calculation, graphing tools, pivot
tables, and a macro programming language called Visual Basic for Applications.

Every worksheet is made up of thousands of rectangles, which are called cells. A cell is
the intersection of a row and a column—in other words, where a row and column meet.

Columns are identified by letters (A, B, C), while rows are identified by numbers (1,
2,3). Each cell has its own name—or cell address—based on its column and row. In the
example below, the selected cell intersects column C and row 5, so the cell address is C5.

Note that the cell address also appears in the Name box in the top-left corner, and that a
cell's column and row headings are highlighted when the cell is selected.

You can also select multiple cells at the same time. A group of cells is known as a cell
range. Rather than a single cell address, you will refer to a cell range using the cell addresses
of the first and last cells in the cell range, separated by a colon. For example, a cell range
that included cells A1, A2, A3, A4, and A5 would be written as A1:A5. Take a look at the
different cell ranges below:

• Cell range A1:A8

9
• Cell range A1:F8

If the columns in your spreadsheet are labelled with numbers instead of letters, you'll need to
change the default reference style for Excel.
To select a cell:

To input or edit cell content, you'll first need to select the cell.

1. Click a cell to select it. In our example, we'll select cell D9.
2. A border will appear around the selected cell, and the column
heading and row heading will be highlighted. The cell will remain
selected until you click another cell in the worksheet.

You can also select cells using the arrow keys on your keyboard.

To select a cell range:


Sometimes you may want to select a larger group of cells, or a cell range.
1. Click and drag the mouse until all of the adjoining cells you want to
select are highlighted. In our example, we'll select the cell
range B5:C18.
2. Release the mouse to select the desired cell range. The cells will
remain selected until you click another cell in the worksheet.

10
Cell content
Any information you enter into a spreadsheet will be stored in a cell. Each cell can contain
different types of content, including text, formatting, formulas, and functions.
• Text: Cells can contain text, such as letters, numbers, and dates.

• Formatting attributes: Cells can contain formatting attributes that


change the way letters, numbers, and dates are displayed. For
example, percentages can appear as 0.15 or 15%. You can even
change a cell's text or background colour.

Formulas and functions: Cells can contain formulas and functions that calculate cell
values. In our example, SUM(C5:C18) adds the value of each cell in the cell range B2:B8
and displays the total in cell c19.

11
To insert content:

1. Click a cell to select it. In our example, we'll select cell G9.

2. Type something into the selected cell, then press Enter on your
keyboard. The content will appear in the cell and the formula bar.
You can also input and edit cell content in the formula bar.

12
To delete (or clear) cell content:
1. Select the cell(s) with content you want to delete. In our example,
we'll select the cell range A9:C9.

2. Select the Clear command on the Home tab, then click Clear
Contents.

3. The cell contents will be deleted.

You can also use the Delete key on your keyboard to delete content from multiple cells at
once. The Backspace key will only delete content from one cell at a time.
To delete cells: There is an important difference between deleting the content of a cell
and deleting the cell itself. If you delete the entire cell, the cells below it will shift to fill in
the gaps and replace the deleted cells.

13
1. Select the cell(s) you want to delete. In our example, we'll
select A9:C9.

2. Select the Delete command from the Home tab on the Ribbon.

3. The cells below will shift up and fill in the gaps.

To copy and paste cell content:


Excel allows you to copy content that is already entered into your spreadsheet and paste that
content to other cells, which can save you time and effort.
1. Select the cell(s) you want to copy. In our example, we'll select C9.

14
2. Click the Copy command on the Home tab, or press Ctrl+C on your
keyboard.

3. Select the cell(s) where you want to paste the content. In our example,
we'll select F9:F17. The copied cell(s) will have a dashed box around
them.

4. Click the Paste command on the Home tab, or press Ctrl+V on your
keyboard.

15
5. The content will be pasted into the selected cells.

To access more paste options:

You can also access additional paste options, which are especially convenient when
working with cells that contain formulas or formatting. Just click the drop-down arrow on
the Paste command to see these options.

16
Instead of choosing commands from the Ribbon, you can access commands quickly by right-
clicking. Simply select the cell(s) you want to format, then right-click the mouse. A drop-
down menu will appear, where you'll find several commands that are also located on the
Ribbon.

To cut and paste cell content:


Unlike copying and pasting, which duplicates cell content, cutting allows you
to move content between cells.
1. Select the cell(s) you want to cut. In our example, we'll select G5:G6.
2. Right-click the mouse and select the Cut command. Alternatively, you
can use the command on the Home tab, or press Ctrl+X on your
keyboard.

3. Select the cells where you want to paste the content. In our example, we'll
select F10:F11. The cut cells will now have a dashed box around them.
4. Right-click the mouse and select the Paste command. Alternatively, you can
use the command on the Home tab, or press Ctrl+V on your keyboard.

17
5. The cut content will be removed from the original cells
and pasted into the selected cells.

To drag and drop cells:

Instead of cutting, copying, and pasting, you can drag and drop cells to move their contents.

1. Select the cell(s) you want to move. In our example, we'll


select I4:I12.
2. Hover the mouse over the border of the selected cell(s) until the
mouse changes to a pointer with four arrows.

18
3. Click and drag the cells to the desired location. In our example, we'll
move them to H4:H12.

4. Release the mouse. The cells will be dropped in the selected location.

A common problem that occurs as a database grows in size is that many duplicate rows appear
in it. And even if your huge database contains just a handful of identical records, those few
duplicates can cause a whole lot of problems, for example mailing multiple copies of the same
document to the same person, or calculating the same numbers more than once in a summary
report. So, before using a database, it makes sense to check it for duplicate entries, to make
sure you are not wasting time on repeating your efforts.

This tool allows you to find and remove absolute duplicates (cells or entire rows) as well
as partially matching records (rows that have identical values in a specified column or
columns)irrelevant observations . To perform this, follow the below steps.

Note. Because the Remove Duplicates tool permanently deletes identical records, it's a good
idea to make a copy of the original data before removing duplicate rows.
1. To begin with, select the range in which you want to delete dupes. To select the
entire table, press Ctrl + A.
2. With the range selected, go to the Data tab > Data Tools group, and click
the Remove Duplicates button.

19
3. The Remove Duplicates dialog box will open, you select the columns to check for
duplicates, and click OK.
o To delete duplicate rows that have completely equal values in all
columns, leave the check marks next to all columns, like in the screenshot
below.
o To remove partial duplicates based on one or more key columns, select
only those relevant columns. If your table has many columns, the fastest
way is to click the Unselect All button, and then select the columns you
want to check for dupes.
o If your table does not have headers, clear the My data has headers box in
the upper-right corner of the dialog window, which is usually selected by
default.

Done! All duplicate rows in the selected range are deleted, and a message is displayed
indicating how many duplicate entries have been removed and how many unique values
remain.

20
Get rid of duplicates by copying unique records to another location
Another way to get rid of duplicates in Excel is separating unique values, and copying them
to another sheet or a different workbook. The detailed steps follow below.
1. Select the range or the entire table that you want to dedupe.
2. Navigate to the Data tab > Sort & Filter group, and click the Advanced button.

3. In the Advanced Filter dialog window, do the following:


o Select the Copy to another location radio button.
o Verify whether the correct range appears in the List Range This should be
the range you've selected on step 1.
o In the Copy to box, enter the range where you wish to copy the unique
values (it's actually sufficient to select the upper-left cell of the destination
range).
o Select the Unique records only

21
4. Finally, click OK, and the unique values will be copied to a new location:

How to get rid of duplicates in Excel

Assuming you have our Ultimate Suite installed in your Excel, perform these simple steps to
eliminate duplicate rows or cells:

1. Select any cell in the table that you want to dedupe, and click the Dedupe
Table button on the Ablebits Data tab. Your entire table will get selected
automatically.

2. The Dedupe Table dialog window will open, and all the columns will be selected
by default. You pick Delete duplicates from the Select the action drop-down list
and click OK. Done!

22
As you can see in the following screenshot, all duplicates rows except 1st occurrences are
deleted:

Tip. If you want to remove duplicate rows based on values in a key column, leave only
that column(s) selected, and uncheck all other irrelevant columns.

And if you want to perform some other action, say, highlight duplicate rows without
deleting them, or copy duplicate values to another location, select the corresponding option
from the drop-down list:

23
If you want more options, such as deleting duplicate rows including first occurrences or
finding unique values, then use the Duplicate Remover wizard that provides all these features.
Below you will find full details and a step-by-step example.

How to find and delete duplicate values with or without 1 st occurrences

Removing duplicates in Excel is a common operation. However, in each particular case, there
can be a number of specificities. While the Dedupe Table tool focuses on speed, the Duplicate
Remover offers a number of additional options to dedupe your Excel sheets exactly the way
you want.

1. Select any cell within the table where you want to delete duplicates, switch to
the Ablebits Data tab, and click the Duplicate Remover button.

24
The Duplicate Remover wizard will run and the entire table will get selected.The add -in will
also suggest creating a backup copy, and because you are going to permanently delete
duplicates, we strongly advise that you check this box. Verify that the table has been selected
correctly and click Next.

2. Select what records you want to find and remove. The following options are
available to you:
o Duplicates except 1st occurrences
o Duplicates including 1st occurrences
o Unique values
o Unique values and 1st duplicate occurrences

In this example, let's delete duplicate rows including 1st occurrences:

25
3. And now, select the columns to search for duplicates. Because our aim is to
eliminate duplicate rows, be sure to select all the columns (which is usually done
by default).

4. Finally, select the action you want to perform on dupes and click
the Finish button. In this example, we expectedly choose the Delete duplicate
values option.

That's it! The Duplicate Remover add-in swiftly does its job and notifies you how many
duplicate rows have been found and deleted:

26
That's how you can wipe duplicates off Excel Data.

27
Frequency Distribution
There are many ways of presentation of frequency distribution in the form of graphs. Some
commonly used graphs are

• Frequency Polygon: A frequency polygon is a graph constructed by using lines to


join the midpoints of each interval, or bin.
• Frequency curve: Smooth free curve moving around the frequency polygon is known
as frequency curve.
• Histogram: A frequency distribution shows how often each different value in a set of
data occurs. A histogram is the most commonly used graph to show frequency
distributions. It looks very much like a bar chart, but there are important differences
between them.

Frequency Distribution table using Excel functions


Frequency Distribution in Excel is used to give an impression of how the data is spread out.
This can be done using a Histogram which gives the proper vision of how the data is being
distributed. To create Frequency Distribution in Excel we must have Data Analysis Toolpak
which we can activate from the Add-Ins option available in the Developer menu tab. Once it
is activated, select the Histogram from Data Analysis then select the data which we want to
project.
Frequency Formula in Excel
Below is the Frequency Formula in Excel :

The Frequency Function has two arguments are as below:


• Data array: A set of array values where it is used to count the frequencies. If the data
array values is zero (i.e. Null values) then frequency function in excel returns an array
of zero values.
• Bins array: A set of array values which is used to group the values in the data array.
If the bin array values is zero (i.e. Null values) then it will return the number of array
elements from the data array

Example #1
In this example, we are going to see how to find the frequency with the available student
database.
Let’s consider the below example which shows students score which is shown below.
Popular Course in this category

28
Now in order to calculate frequency, we have to group the data with students marks as shown
below.

Now using the frequency function we will group the data by following the below steps.
• Create a new column named Frequency
• Use the frequency formulation on G column by selecting G3 to G13.

29
• Here we need to select the entire frequency column then only the frequency function
will work properly or else we will get an error value.
• As shown in the above screenshot we have selected column as data array and Bin
array as Student marks =FREQUENCY (C3:C22,G3:G13) and go
for CTRL+SHIFT+ENTER.
• So that we will get the values in all the column.
• Once we hit the CTRL+SHIFT+ENTER we can see the open and closing parenthesis
as shown below.

30
Now using the Excel Frequency Distribution, we have grouped the student’s marks with mark
wise which shows students has scored marks with 0-10 we have 1 student, 10-20 we have 5
student, 20-30 we have 1 student and 30-40 we have 3 student and so on…as shown below.

Find the frequency distribution table using Data analysis pack


For this first enter the given data and find the bin values. Next go to data, then press data
analysis

Then go for histogram

31
Press ok, the histogram data table appear,

32
Enter Input range, Bin range and out put range

33
Press ok, The frequency distribution table along with Histogram appear

34
How to Calculate Relative Frequency in Excel

A frequency table is a table that displays information about frequencies. Frequencies simply
tell us how many times a certain event has occurred.

For example, the following table shows how many items a shop sold in different price ranges
in a given week:

Frequency
Item
Price
1 -10 20
11-20 21
21-30 13
31-40 8
41-50 4

The first column displays the price class and the second column displays the frequency of
that class.

It’s also possible to calculate the relative frequency for each class, which is simply the
frequency of each class as a percentage of the whole.

Frequency Relative
Item frequency
Price
1 -10 20 0.303
11-20 21 0.318
21-30 13 0.197
31-40 8 0.121
41-50 4 0.061
Total 66 1

In total, there were 66 items sold. Thus, we found the relative frequency of each class by taking
the frequency of each class and dividing by the total items sold.

For example, there were 20 items sold in the price range of 1 – 10. Thus, the relative frequency
of the class 1 – 10 is 20 / 66 = 0.303.

Next, there were 21 items sold in the price range of 11 – 20. Thus, the relative frequency of the
class 11 – 20 is 21 / 66 = 0.318, and remaining relative frequency are 13/66 =0.197, 8/66
=0.121, 4/66=0.061.

35
The following example illustrates how to find relative frequencies in Excel.

Example: Relative Frequencies in Excel

First, we will enter the class and the frequency in columns A and B:

Next, we will calculate the relative frequency of each class in column C. Column D shows
the formulas we used:

36
We can verify that our calculations are correct by making sure the sum of the relative
frequencies adds up to 1:

We can also create a relative frequency histogram to visualize the relative frequencies.
Simply highlight the relative frequencies:

37
Then go to the Charts group in the Insert tab and click the first chart type in Insert Column
or Bar Chart:

A relative frequency bar graph will automatically appear:

38
Modify the x-axis labels by right-clicking on the chart and clicking Select Data.
Under Horizontal (Category) Axis Labels click Edit and type in the cell range that contains
the item prices. Click OK and the new axis labels will automatically appear:

The same procedure (Bar graph) is used to draw pie chart and line graph by selecting Data,
go to Insert tab and click Chart group, select the required chart ( pie or line), then click ok,
the required graph appear for the selected data

39
Pie Graph: The “pie chart” also is known as “circle chart”, that divides the circular statistical
graphic into sectors or slices in order to illustrate the numerical problems. Each sector denotes
a proportionate part of the whole. To find out the composition of something, Pie-chart works
the best at that time. In most of the cases, pie charts replace some other graphs like the bar
graph, line plots, histograms etc. Formula The pie chart is an important type of data
representation. It contains different segments and sectors in which each segment and sectors of
a pie chart forms a certain portion of the total(percentage). The total of all the data is equal to
360°

Grouped Data:
1. Bar Graph Definition The pictorial representations of a grouped data, in the form of vertical
or horizontal rectangular bars, where the lengths of the bars are equivalent to the measure of
data, are known as bar graphs or bar charts.
The bars drawn are of uniform width, and the variable quantity is represented on one of the
axes. Also, the measure of the variable is depicted on the other axes. The heights or the lengths
of the bars denote the value of the variable, and these graphs are also used to compare certain
quantities. The frequency distribution tables can be easily represented using bar charts which
simplify the calculations and understanding of data.

Types of Bar Charts


The bar graphs can be vertical or horizontal. The primary feature of any bar graph is its length
or height. If the length of the bar graph is more, then the values are greater of any given data.
Bar graphs normally show categorical and numeric variables arranged in class intervals. They
consist of an axis and a series of labeled horizontal or vertical bars. The bars represent
frequencies of distinctive values of a variable or commonly the distinct values themselves. The
number of values on the x-axis of a bar graph or the y-axis of a column graph is called the
scale.
The types of bar charts are as follows:

1. Vertical bar chart


2. Horizontal bar chart

Even though the graph can be plotted using horizontally or vertically, the most usual type of
bar graph used is the vertical bar graph. The orientation of the x-axis and y-axis are changed
depending on the type of vertical and horizontal bar chart. Apart from the vertical and
horizontal bar graph, the two different types of bar charts are:
• Grouped Bar Graph
• Stacked Bar Graph
Now, let us discuss the four different types of bar graphs.

Vertical Bar Graphs


When the grouped data are represented vertically in a graph or chart with the help of bars,
where the bars denote the measure of data, such graphs are called vertical bar graphs. The data
is represented along the y-axis of the graph, and the height of the bars shows the values.

Horizontal Bar Graphs


When the grouped data are represented horizontally in a chart with the help of bars, then such
graphs are called horizontal bar graphs, where the bars show the measure of data. The data is
depicted here along the x-axis of the graph, and the length of the bars denote the values.

40
Grouped Bar Graph
Grouped bar graph is also called the clustered bar graph, which is used to represent the discrete
value for more than one object that shares the same category. In this type of bar chart, the total
number of instances are combined into a single bar. In other words, a grouped bar graph is a
type of bar graph in which different sets of data items are compared. Here, a single color is
used to represent the specific series across the set. The grouped bar graph can be represented
using both vertical and horizontal bar charts.
Stacked Bar Graph
Stacked bar graph is also called the composite bar chart, which divides the aggregate into
different parts. In this type of bar graph, each part can be represented using different colors,
which helps to easily identify the different categories. The stacked bar chart requires specific
labeling to show the different parts of the bar. In a stacked bar graph, each bar represents the
whole and each segment represents the different parts of the whole.
Uses of Bar Graphs
Bar graphs are used to match things between different groups or to trace changes over time.
Yet, when trying to estimate change over time, bar graphs are most suitable when the changes
are bigger.
Bar charts possess a discrete domain of divisions and are normally scaled so that all the data
can fit on the graph. When there is no regular order of the divisions being matched, bars on the
chart may be organized in any order. Bar charts organized from the highest to the lowest
number are called Pareto charts.

41
How to Make a Frequency Polygon in Excel

A frequency polygon is a type of chart that helps us visualize a distribution of values.


This example explains how to create a frequency polygon in Excel.
Example: Frequency Polygon in Excel
Use the following steps to create a frequency polygon.
Step 1: Enter the data for a frequency table.
Enter the following data for a frequency table that shows the number of students who
received a certain score on an exam:

Step 2: Find the midpoint of each class.


Next, use the =AVERAGE() function in Excel to find the midpoint of each class, which
represents the middle number in each class:

Next, we will create the frequency polygon. Highlight the frequency values in column C:

42
Then go to the Charts group in the Insert tab and click the first chart type in Insert Line or
Area Chart:

A frequency polygon will automatically appear:

43
To change the x-axis labels, right click anywhere on the chart and click Select Data. A new
window will pop up. Under Horizontal (Category) Axis Labels click Edit and type in the
cell range that contains the Midpoint values.

44
Click OK and the new axis labels will automatically appear:

Ex: Data collected for 50 students marks in different subject

45
Frequency polygon curve
Frequency polygon steps are used to get for frequency polygon curve, but in frequency
curve instead of selecting line (chart tools) we have to select (Scatter tool chart) smooth
curve. The curve chart is high lightened below. Select data and directly go for chart
tools to get frequency polygon curve.
Ex:

Relative frequency polygon

For Relative frequency polygon refer above mentioned steps of relative frequency
distribution , calculate the mid points of class intervals and frequency. Using these
draw frequency polygon, it gives Relative frequency polygon.

46
Histogram
A histogram is a common data analysis tool in the business world. It’s a column chart that
shows the frequency of the occurrence of a variable in the specified range.
Histogram is a graphical representation, similar to a bar chart in structure, that organizes a
group of data points into user-specified ranges. The histogram condenses a data series into
an easily interpreted visual by taking many data points and grouping them into logical
ranges or bins.
A simple example of a histogram is the distribution of marks scored in a subject. You can
easily create a histogram and see how many students scored less than 35, how many were
between 35-50, how many between 50-60 and so on.

Here are the steps to create a Histogram chart in Excel 2016:


1. Select the entire dataset.
2. Click the Insert tab.

47
3. In the Charts group, click on the ‘Insert Static Chart’ option.

4. In the HIstogram group, click on the Histogram chart icon.

The above steps would insert a histogram chart based on your data set (as shown below).

48
Now you can customize this chart by right-clicking on the vertical axis and selecting Format
Axis.

This will open a pane on the right with all the relevant axis options.

Here are some of the things you can do to customize this histogram chart:

1. By Category: This option is used when you have text categories. This could
be useful when you have repetitions in categories and you want to know the
sum or count of the categories. For example, if you have sales data for items

49
such as Printer, Laptop, Mouse, and Scanner, and you want to know the total
sales of each of these items, you can use the By Category option. It isn’t
helpful in our example as all our categories are different (Student 1, Student 2,
Student3, and so on.)
2. Automatic: This option automatically decides what bins to create in the
Histogram. For example, in our chart, it decided that there should be four bins.
You can change this by using the ‘Bin Width/Number of Bins’ options
(covered below).
3. Bin Width: Here you can define how big the bin should be. If I enter 20 here,
it will create bins such as 36-56, 56-76, 76-96, 96-116.

4. Number of Bins: Here you can specify how many bins you want. It will
automatically create a chart with that many bins. For example, if I specify 7
here, it will create a chart as shown below. At a given point, you can either
specify Bin Width or Number of Bins (not both).

5. Overflow Bin: Use this bin if you want all the values above a certain value
clubbed together in the Histogram chart. For example, if I want to know the
number of students that have scored more than 75, I can enter 75 as the
Overflow Bin value. It will show me something as shown

50
6. Underflow Bin: Similar to Overflow Bin, if I want to know the number of
students that have scored less than 40, I can enter 4o as the value and show a
chart as shown below.

Once you have specified all the settings and have the histogram chart you want, you can
further customize it (changing the title, removing gridlines, changing colors, etc.)

51
Histogram using Data Analysis Toolpak
Once you have the Analysis Toolpak enabled, you can use it to create a histogram in Excel.
Suppose you have a dataset as shown below. It has the marks (out of 100) of 40 students in a
subject.

To create a histogram using this data, we need to create the data intervals in which we want
to find the data frequency. These are called bins.
With the above dataset, the bins would be the marks intervals.
You need to specify these bins separately in an additional column as shown below:

Now that we have all the data in place, let’s see how to create a histogram using this data:
• Click the Data tab.
• In the Analysis group, click on Data Analysis.

52
• In the ‘Data Analysis’ dialog box, select Histogram from the list.

• Click OK.
• In the Histogram dialog box:
• Select the Input Range (all the marks in our example)
• Select the Bin Range (cells D2:D7)
• Leave the Labels checkbox unchecked (you need to check it if you
included labels in the data selection).
• Specify the Output Range if you want to get the Histogram in the
same worksheet. Else, choose New Worksheet/Workbook option
to get it in a separate worksheet/workbook.
• Select Chart Output.

• Click OK.
This would insert the frequency distribution table and the chart in the specified
location.

53
Now there are some things you need to know about the histogram created using the Analysis
Toolpak:
• The first bin includes all the values below it. In this case, 35 shows 3 values
indicating that there are three students who scored less than 35.
• The last specified bin is 90, however, Excel automatically adds another bin
– More. This bin would include any data point which lies after the last
specified bin. In this example, it means that there are 2 students who have
scored more than 90.
• Note that even if I add the last bin as 100, this additional bin would still be
created.
• This creates a static histogram chart. Since Excel creates and pastes the
frequency distribution as values, the chart would not update when you change
the underlying data. To refresh it, you’ll have to create the histogram again.
• The default chart is not always in the best format. You can change the
formatting like any other regular chart.
• Once created, you can not use Control + Z to revert it. You’ll have to
manually delete the table and the chart.
• If you create a histogram without specifying the bins (i.e., you leave the Bin
Range empty), it would still create the histogram. It would automatically
create six equally spaced bins and used this data to create the histogram.

54
Box plot
A boxplot (box plot, or whisker plot) is a compact, but efficient way to represent a dataset
using descriptive stats. This “little diagram” combines informative, standard values such as the
first and third quartiles (the bottom and top of the box, respectively), the median (the flat line
inside the box) and sometimes the mean (a second flat line inside the box). The whiskers are
often used to represent the minimum and maximum values, but some use other parameters such
as: one standard deviation above and below the mean of the data OR the lowest and highest
values contained in the range defined by the 1st quartile minus 1,5 times the interquartile range
and the 3rd quartile plus 1,5 times the interquartile range (cf. “Tukey plot”) OR the 5th and
95th percentiles, etc. Anyway, because the whiskers are defined by the user (and not by
convention), it is important, when creating the boxplot, to mention what they represent in the
legend of the chart.

Box plot
BINS (i.e. categories that become the “bars” in the graph) are automatically created in Excel
2016 using Scott’s Rule.
Step 1: Enter your data into a single column.

55
Step 2: Highlight the data you entered in Step 1. To do this, click and hold on the first cell
and then drag the mouse down to the end of the data.

Step 3: Click the” Insert” tab, click statistics charts (a blue icon with three vertical bars) and
then click a histogram icon.

Click ok

56
Example #1 – Box Plot in Excel
Suppose we have data as shown below which specifies the number of units we sold of a
product month-wise for years 2017, 2018 and 2019 respectively.

Step 1: Select the data and navigate to Insert option in the Excel ribbon. You will have
several graphical options under the Charts section.

Step 2: Select the Box and Whisker option which specifies the Box and Whisker plot.

Right-click on the chart, select the Format Data Series option then select the Show inner
points option. You can see a Box and Whisker plot as shown below.

57
Use 2016 for Box plot it is little bit difficult using excel 2016 below versions.

Example #2 – Box and Whisker Plot in Excel using Excel 2013


In this example, we are going to plot the Box and Whisker plot using the five-number
summary which we have discussed earlier.
Step 1: Compute the Minimum Maximum and Quarter values. MIN function allows you to
give you Minimum value, MEDIAN will provide you the median Quarter.INC allows us to
compute the quarter values and MAX allows us to calculate the Maximum value for the given
data. See the screenshot below for five-number summary statistics.

58
Step 2: Now, since we are about to use the stack chart and modify it into a box and whisker
plot, we need each statistic as a difference from it’s a subsequent statistic. Therefore, we use
the differences between Q1 – Minimum and Maximum – Q3 as Whiskers. Q1, Q2-Q1, Q3-
Q2 (Interquartile ranges) as Box. And combine together, it will form a Box-Whisker Plot.

Step 3: Now, we are about to add the boxes as the first part of this plot. Select the data from
B24:D26 for boxes (remember Q1 – Minimum and Maximum – Q3 are for the Whiskers?)

Step 4: Go to Insert tab on the excel ribbon and navigate to Recommended Charts under the
Charts section.

59
Step 5: Inside Insert Chart window > All Charts > navigate to Column Charts and select the
second option which specifies the Stack Column Chart and click OK.

This is how it looks.

Step 6: Now, we need to add whiskers. I will start with the lower whisker first. Select the
stack chart portion, which represents the Q1 (Blue bar) > Click on Plus Sign > Select Error
Bars > Navigate to More Options… dropdown under Error Bars.

60
Step 7: As soon as you click on More Options… Format Error Bars menu will appear > Error
Bar Options > Direction : Minus Radio Button (since we are adding the lower whisker) > End
Style : Cap radio button > Error Amount : Custom > Select Specify Value.

It opens up a window within which specifies the lower whisker values (Q1 – Minimum
B23:D23) under Negative Error Value and Click OK.

61
Step 8: Do the same for the upper Whiskers. Select the gray bar (Q3-Median bar), instead of
selecting Direction as Minus use Plus and add the values of Maximum – Q3 i.e. B27:D27
under Positive Error Values box.

It opens up a window within which specifies the lower whisker values (Q3 – Maximum
B27:D27) under Positive Error Value and Click OK.

62
The Graph now should look like the screenshot below:

Step 9: Remove the bars associated with Q1 – Minimum. Select the Bars > Format Data
Series > Fill & Line > No Fill. This will remove the lower part as it is not useful in the Box-
Whisker plot and just added initially because we want to plot the stack bar chart as a first
step.

Step 10: Select the Orange bar (Median – Q1) > Format Data Series > Fill & Line > No fill
under Fill section > Solid line under Border section > Color > Black. This will remove the
colors from the bars and represent them just as outline boxes.

63
Follow the same procedure for the gray bar (Maximum – Q3) to remove the colour from it
and represent it as a solid line bar. Plot should look like the one in screenshot below:

64
This is how we can create Box-Whisker Plot under any version of Excel. If you have Excel
2016 and above, you anyway have the direct chart option for the Box-Whisker plot.

Quartile

A quartile is a statistical term that describes a division of observations into four defined
intervals based on the values of the data and how they compare to the entire set of Each quartile
contains 25% of the total observations. Generally, the data is arranged from smallest to largest:

1. First quartile: the lowest 25% of numbers


2. Second quartile: between 25.1% and 50% (up to the median)
3. Third quartile: 51% to 75% (above the median)
4. Fourth quartile: the highest 25% of numbers

Minimum Value QUARTILE(A1:A20,0) MIN(A1:A20)

1st Quarter QUARTILE(A1:A20,1)

Median QUARTILE(A1:A20,2) MEDIAN(A1:A20)

3rd Quarter QUARTILE(A1:A20,3)

Maximum Value QUARTILE(A1:A20,4) MAX(A1:A20)

Select the data and by using above functions we get the Quartile values.
Ex:

65
Stem and leaf plot
A stem-and-leaf display (also known as a stemplot) is a diagram designed to allow you to
quickly assess the distribution of a given dataset. Basically, the plot splits two-digit numbers
in half:
Stems – The first digit
Leaves – The second digit
Use the following steps to create a stem-and-leaf plot in Excel.
Step 1: Enter the data.
Enter the data values in a single column:

Step 2: Identify the minimum and maximum values.

66
Step 3: Manually enter the “stems” based on the minimum and maximum values.

Step 4: Calculate the “leaves” for the first row.


The following calculation shows how to compute the leaves for the first row. Don’t be
intimidated by the length of the formula – it’s actually very simple, just repetitive.

67
Once you finish typing the formula and click Enter, you will get the following result:

Step 5: Repeat the calculation for each row.


To repeat this calculation for each row, simply click on cell D7, hover over the bottom-right
hand corner of the cell until a tiny + appears, then double-click. This will copy the formula to
the rest of the rows in the Stem-and-Leaf plot:

68
To double check that your results are correct, you can verify three numbers:

• Make sure the number of individual “leaves” matches the number of


observations. In our example, we have 10 total “leaves” which matches the 10
observations in our original dataset.
• Verify the minimum number. The very first leaf should match the minimum value
in your dataset. In our example, we see that the first leaf is “4” and is paired with a
stem of “1” which matches the minimum number of “14” in our dataset.
• Verify the maximum number. The very last leaf should match the maximum value
in your dataset. In our example, we see that the last leaf is “5” and is paired with a
stem of “3” which matches the minimum number of “35” in our dataset.
Once you verify these three numbers, you can be confident that your Stem-and-Leaf
plot is correct.

69
Central Tendency

Central tendency refers to the value derived out of the random variables from the set of data
that reflects the centre of the distribution of the data and which generally can be described
using different measures like mean, median and the mode.

It is a single value that attempts to describe a set of data by identifying the middle of the
central position within the given dataset. Sometimes these measures are called the standards
of middle or the central location. The mean (otherwise known as the average) is the most
commonly used measure for central tendency, but there are other methodologies such as the
median and the mode.

Example #1
Consider following sample : 33, 55, 66, 56, 77, 63, 87, 45, 33, 82, 67, 56, 77, 62, 56. You are
required to come up with a central tendency.
Solution:
Below is given data for calculation.

Using the above information, the calculation of mean will be as follows,

70
• Mean = 915/15
Mean will be –

Mean = 61
The calculation of the Median will be as follows-Enter the median formula
=MEDIAN(DATA arrays)

Median =62
Since the number of observations is odd, the middle value, which is the 8 th position, will be
the median, which is 62.
Calculation of Mode will be as follows-

71
Mode = 56
For more, we can note from the above table that the number of observations that are recurring
most times is 56. (3 times in the dataset)

Standard Deviation in Excel


Standard deviation in Excel helps you to understand, how much your values deviate from the
Average or Mean that is it tells you that whether your data is somewhere close to the average
or fluctuates a lot. If the value received is on the higher side then that means that your data
has a lot of fluctuations and vice versa. To calculate standard deviation in excel we use
STDEV function. In the same example we shall use the STDEV function so our formula will
be =STDEV(B2:B12). Our answer is around 20 which indicates that the marks of the students
fluctuates a lot.

72
As you can see the answers of Average and Standard Deviation contain too many decimals,
you can easily get rid of them by using the rounding functions

Absolute & relative dispersion are two different ways to measure the spread of a data set.
They are used extensively in biological statistics, as biological phenomena almost always
show some variation and spread.
The easiest way to differentiate relative dispersion/absolute dispersion is to check whether
your statistic involves units. Absolute measures always have units, while relative measures
do not.
Absolute Measures of Dispersion
Absolute measures of dispersion include:
• The range,
• The quartile deviation,
• The mean deviation,
• The standard deviation and variance.
Absolute measures of dispersion use the original units of data, and are most useful for
understanding the dispersion within the context of your experiment and measurements.
Relative Measures of Dispersion: Relative measures of dispersion are calculated as ratios or
percentages; for example, one relative measure of dispersion is the ratio of the standard
deviation to the mean. Relative measures of dispersion are always dimensionless, and they are
particularly useful for making comparisons between separate data sets or different experiments
that might use different units. They are sometimes called coefficients of dispersion.
Some Commonly Used Measures of Relative Dispersion / Absolute Dispersion
The simplest measure of absolute dispersion is the range. This is just the upper limit minus the
lower limit; the largest data point minus the smallest. We can write this as R = H – L.

For example, if a data set consisted of the points 2, 4, 5, 8, and 18, the range would be
18 – 2 = 16.
The analogous relative measure of dispersion is the coefficient of range. This is given by
(H – L)/(H + L). For our example data set, it would be the ratio (18 – 2)/(18 + 2), so (16/20)
or 4/5.
The standard deviation is a more complicated measure of absolute dispersion, you could
calculate it by squaring the difference between each data point and the mean, summing those
squares, dividing by a number that is one less than the number of your data points, and then
taking the square root of that. Since your values are squared and in the end the square root is
taken again, the standard deviation is given in the your original units of measure.
The coefficient of standard deviation, the analogous measure of relative dispersion, is just the
standard deviation divided by the arithmetic mean. To give it as a percentage rather than a ratio,
multiply by 100%.

73
How to Calculate the Mean Absolute Deviation in Excel
To calculate the mean absolute deviation in Excel, we can perform the following steps:
Step 1: Enter the data. For this example, we’ll enter 15 data values in cells A2:A16.

Step 2: Find the mean value. In cell D1, type the following
formula: =AVERAGE(A2:A16). This calculates the mean value for the data values, which
turns out to be 15.8.

Step 3: Calculate the absolute deviations. In cell B2, type the following
formula: =ABS(A2-$D$1). This calculates the absolute deviation of the value in cell A2 from
the mean value in the dataset.

74
Next, click cell B2. Then, place over the bottom right corner of the cell until a
black + sign appears. Double click the + sign to fill in the remaining values in column B.

75
Step 4: Calculate the mean absolute deviation.
In cell B17, type the following formula: =AVERAGE(B2:B16). This calculates the mean
absolute deviation for the data values, which turns out to be 6.1866.
Note that you can use these four steps to calculate the mean absolute deviation for any
number of data values. In this example, we used 15 data values but you could use these exact
steps to calculate the mean absolute deviation for 5 data values or 5,000 data values.

Quartile Deviation
Quartile deviation is based on the difference between the first quartile and the third
quartile in the frequency distribution and the difference is also known as the
interquartile range, the difference divided by two is known as quartile deviation or semi
interquartile range.
Use the following data for the calculation of quartile deviation.

Calculation of Q1 can be done as follows,

Q1=148.75
Calculation of Q3 can be done as follows,

76
Q3= 179.75

Calculation of quartile deviation can be done as follows,

Using the quartile deviation formula, we have (179.75-148.75 )/ 2


Q.D. will be –

77
QD = 15.50

Finding the Mean Mode and median by alternate method


Enter the scores in one of the columns on the Excel spreadsheet (see the example below).
After the data have been entered, place the cursor where you wish to have the mean (average)
appear and click the mouse button. Select Insert Function (fx) from the FORMULAS tab. A
dialog box will appear. Select AVERAGE from the Statistical category and click OK. (Note:
If you want the Median, select MEDIAN. If you want the Mode, select MODE.SNGL. Excel
only provides one mode. If a data set had more than one mode, Excel would only display one
of them.)

78
Enter the cell range for your list of numbers in the Number 1 box. For example, if your data
were in column A from row 1 to 13, you would enter A1:A13. Instead of typing the range,
you can also move the cursor to the beginning of the set of scores you wish to use and click
and drag the cursor across them. Once you have entered the range for your list, click
on OK at the bottom of the dialog box. The mean (average) for the list will appear in the cell
you selected.

79
Finding the Standard Deviation
Place the cursor where you wish to have the standard deviation appear and click the mouse
button.Select Insert Function (f x ) from the FORMULAS tab. A dialog box will
appear. Select STDEV.S (for a sample) from the the Statistical category. (Note: If your data
are from a population, click on STDEV.P). After you have made your selections, click
on OK at the bottom of the dialog box.

Enter the cell range for your list of numbers in the Number 1 box. For example, if your data
were in column A from row 1 to 13, you would enter A1:A13. Instead of typing the range,
you can also move the cursor to the beginning of the set of scores you wish to use and click
and drag the cursor across them. Once you have entered the range for your list, click
on OK at the bottom of the dialog box. The standard deviation for the list will appear in the
cell you selected.

80
VARIANCE
Variance is one of the most useful tools in probability theory and statistics. In science, it
describes how far each number in the data set is from the mean. In practice, it often shows
how much something changes. For example, temperature near the equator has less variance
than in other climate zones. In this article, we will analyze different methods of calculating
variance in Excel. How to find Variance in Excel?

This is complete data. When we capture complete data (entire population) we calculate the
variance of population. The Excel function for calculating Variance of population is VAR.P.
The syntax of VAR.P is
=VAR.P(number1,NUMBER2)
Number1, number2,...: these are the numbers of which you want to calculate variance.
The first number is compulsory.
Let's use this formula to calculate the variance of our data. We have data in cell C2:C15. So
the formula will be:
=VAR.P(C2:C15)
This returns a value 186.4285714, which is quite a large variance given our data.

81
SKEWNESS and KUTROSIS
If one tail is longer than another, the distribution is skewed. These distributions are sometimes
called asymmetric or asymmetrical distributions as they don’t show any kind of symmetry.
Symmetry means that one half of the distribution is a mirror image of the other half. For
example, the normal distribution is a symmetric distribution with no skew. The tails are exactly
the same.
It is the degree of distortion from the symmetrical bell curve or the normal distribution. It
measures the lack of symmetry in data distribution.
It differentiates extreme values in one versus the other tail. A symmetrical distribution will
have a skewness of 0.
There are two types of Skewness: Positive and Negative

Positive Skewness means when the tail on the right side of the distribution is longer or fatter.
The mean and median will be greater than the mode.
Negative Skewness is when the tail of the left side of the distribution is longer or fatter than
the tail on the right side. The mean and median will be less than the mode.
So, when is the skewness too much?
The rule of thumb seems to be:
• If the skewness is between -0.5 and 0.5, the data are fairly symmetrical.
• If the skewness is between -1 and -0.5(negatively skewed) or between 0.5 and
1(positively skewed), the data are moderately skewed.
• If the skewness is less than -1(negatively skewed) or greater than 1(positively
skewed), the data are highly skewed.

Kurtosis
Kurtosis is all about the tails of the distribution — not the peakedness or flatness. It is used to
describe the extreme values in one versus the other tail. It is actually the measure of
outliers present in the distribution.
High kurtosis in a data set is an indicator that data has heavy tails or outliers. If there is a high
kurtosis, then, we need to investigate why do we have so many outliers. It indicates a lot of
things, maybe wrong data entry or other things. Investigate!
Low kurtosis in a data set is an indicator that data has light tails or lack of outliers. If we get
low kurtosis(too good to be true), then also we need to investigate and trim the dataset of
unwanted results.

82
Mesokurtic: This distribution has kurtosis statistic similar to that of the normal distribution. It
means that the extreme values of the distribution are similar to that of a normal distribution
characteristic. This definition is used so that the standard normal distribution has a kurtosis of
three.
Leptokurtic (Kurtosis > 3): Distribution is longer, tails are fatter. Peak is higher and sharper
than Mesokurtic, which means that data are heavy-tailed or profusion of outliers.
Outliers stretch the horizontal axis of the histogram graph, which makes the bulk of the data
appear in a narrow (“skinny”) vertical range, thereby giving the “skinniness” of a leptokurtic
distribution.
Platykurtic: (Kurtosis < 3): Distribution is shorter, tails are thinner than the normal
distribution. The peak is lower and broader than Mesokurtic, which means that data are light-
tailed or lack of outliers.
The reason for this is because the extreme values are less than that of the normal distribution.

Skewness graph and kurtosis

We can use the Analysis Toolpak add-in to generate descriptive statistics. For example, you
may have the scores of 14 participants for a test.

83
To generate descriptive statistics for these scores, execute the following steps.
1. On the Data tab, in the Analysis group, click Data Analysis.

Note: can't find the Data Analysis button? Click here to load the Analysis ToolPak add-in.
2. Select Descriptive Statistics and click OK.

3. Select the range A2:A15 as the Input Range.


4. Select cell C1 as the Output Range.
5. Make sure Summary statistics is checked.

6. Click OK.

84
Result:

Thank you

85
Excel Frequency Distribution Using Pivot Table (Extra )
In this example, we will see how to make excel frequency distribution using graphical data
with the available sales database.
One of the easiest ways to make excel frequency distribution is using the pivot table so that
we can create graphical data.
Consider the below sales data which has year wise sale. Now we will see how to use this
using a pivot table with the following steps.

• Create a Pivot Table for the above sales data. For creating a pivot table we have to go
to the insert menu and select pivot table.

• Drag down the Sales in Row Labels. Drag down the same sales in Values.

86
• Make sure that we have selected the pivot field setting to count so that we will get the
sales count numbers which are shown below.

• Click on the row label sales number and right click then Choose Group option.

87
• So that we will get the grouping dialogue box as shown below:

• Edit the grouping numbers starting at 5000 and Ending at 18000 and it Group By
1000 and then click ok

88
After that, we will get the below following result where sales data has been grouped by 1000
as shown below:

We can see that Sales data has been grouped by 1000 with Minimum to Maximum values
which can be shown more professionally by displaying in graphical format.
• Go to insert menu and select the Column chart.

89
• So the output will be as follows:

Excel Frequency Distribution Using Histogram


By using the pivot table we have grouped the sales data, now we will see how to make
historical sales data by Frequency Distribution in excel.
Consider the below sales data for creating a histogram which has Sales Person Name with
corresponding sales values. Where CP is nothing but Consumer Pack and Tins are range
values i.e how much tins has been sold out for the specific salespersons.

We can find histogram in the data analysis group under the data menu which is nothing but
add-ins. We will see how to apply histogram by following the below steps.

90
• Go to Data Menu on the right top, we can find the data analysis. Click on the data
analysis which is highlighted as shown below.

• So that we will get the below dialogue box. Choose Histogram option and Click ok.

• We will get the below histogram dialogue box.

91
• Give the Input Range and Bin Range as shown below.

• Make sure that we have a check mark all option like label option, Cumulative
Percentage, Chart Output and then Click OK.

92
• In the below chart we got the output which shows the cumulative percentage along
with frequency.

We can display the above histogram more professionally by editing the sales data as follows.
• Right click on the histogram chart and click on Select Data.

93
• We will get the dialogue box to change the Ranges. Click on edit.

• So that we can edit the ranges we need to give. Edit the Bins value what exactly we
need to specify the range so that we will get the appropriate result and then click ok.

94
• So the Result will be as below.

95
Things to Remember about Excel Frequency Distribution
• In excel Frequency distribution, while grouping we might lose some of the data,
hence make sure that we are grouping in a proper manner.
• While using excel frequency distribution make sure that classes should be in equal
size with an upper limit and lower limit values.

96
97

You might also like