Exploratory Data Analysis
Exploratory Data Analysis
Exploratory Data Analysis
constructed on either side at a distance of 3 times the IQR, where IQR denotes the interquartile
range. Observations that fall between the inner and outer fences are called suspected outliers,
while observations that lie beyond the outer fences are called outliers. These observations are
denoted with asterix (*) and the whiskers are drawn only to the extreme values within or on the
inner fences. When we are analysing a set of data, suspected outliers deserve a closer look and
outliers should be looked at very carefully.
Use: From the box plot we can have a quick view of the centre (given by Q 2). The spacings
between the different parts of the box plot help to indicate the degree of dispersion and skewness
in the data. It also helps to identify outliers.
Advantage: Organizing data in a box plot by using five key concept is an efficient way of
dealing with large data that is too unmanageable for other graphs, such as stem and leaf plots.
Because of the small size of a boxplot, it is easy to display and compare several box plots in a
small space. A box plot is a good alternative or complement to a histogram and is usually better
for showing several simultaneous comparisons.
Disadvantage: The issue with handling such large amounts of data in a box plot is that the exact
values and details of the distribution of results are not retained.
Stem and Leaf display
A stem and leaf display is a graphical method of displaying data. It is particularly useful when
the data are not too numerous. Stem and Leaf display is a "shorthand" notation for representing
numbers. We break each number into 2 parts. The last digit is called the leaf, and the rest of the
number is called the stem. For example the number 75 has a stem of 7 and a leaf of 5. The
number 129 has a stem of 12 and a leaf of 9. We then collect all numbers with the same stem and
place them in a row. Let us illustrate the stem and leaf display with an example. Let us consider
the ages of 48 students in a statistics course:
22
19
30
21
41
24
19
23
27
32
30
20
22
36
24
26
39
20
21
19
19
19
22
30
31
17
18
21
26
21
25
21
22
22
20
40
23
19
21
17
20
33
22
31
19
24
37
22
Here we observe that the lowest age is 17, and the highest age is 41. The stem for 17 is 1, and the
stem for 41 is 4. So we start off by making a table with all the possible stems. Finally the stem
and leaf display will look as follows. (See class notes) Next we arrange the leaves in an
increasing order and get an ordered stem and leaf plot. This is displayed below. ( See class
notes). Occasionally if one or more of the rows has too much information, we can divide each
row into equal halves like follows and get a stem and leaf display as follows ( refer to class
notes)
Advantages: Here we retain the original values of the variable which are otherwise lost in box
plot or histogram. Moreover if the leaves are carefully aligned vertically, then the table has the
same effect of that of the histogram, so that we can identify the shape of the distribution as well.
Sample percentiles can be found easily from stem and leaf display. A back to back stem and leaf
diagram is used to compare two groups of data.
Disadvantage: This presentation of data is not suitable when the data is numerous since we
retain every observation. Moreover it is not possible to compare three or more groups of data by
using stem and leaf display.