Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
10 views

Chapter IV Data Exploration and Visualization

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Chapter IV Data Exploration and Visualization

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

CHAPTER 4

Learning Module in
IT Inst 3 – DATA SCIENCE ANALYTICS

INTENDED LEARNING OUTCOMES:

At the end of the lesson, the students are expected to:


 Explain the importance of using graphs and charts; and
 Create data visualizations to effectively communicate insights;

LESSON: Data Exploration and Visualization

I. Data Exploration
Exploratory Data Analysis (EDA) is a critical initial step in data analysis process that
examines and analyzes data to understand its characteristics, patterns, and relationships.
This involves visually exploring the data, summarizing it main features, and identifying
potential trends, outliers, and anomalies.

Univariate Analysis:
Univariate analysis involves the examination of cases of one variable at a time. There are
three major characteristics of a single variable that we tend to look at:
 the distribution
 the central tendency
 the dispersion
In most situations, we would describe all three of these characteristics for each of the
variables in our study.

The Distribution: The distribution is a summary of the frequency of individual values


or ranges of values for a variable. The simplest distribution would list every value of a
variable and the number of persons who had each value.

For instance, a typical way to describe the distribution of college students is by


year in college, listing the number or percent of students at each of the four years.
Or, we describe gender by listing the number or percent of males and females. In
these cases, the variable has few enough values that we can list each one and
summarize how many sample cases had the value. But what do we do for a variable
like income or GPA? With these variables, there can be a large number of possible
values, with relatively few people having each one. In this case, we group the raw
scores into categories according to ranges of values. For instance, we might look at
GPA according to the letter grade ranges. Or, we might group income into four or five
ranges of income values.

Frequency distribution table.

1
One of the most common ways to describe a single variable is with a frequency
distribution. Depending on the particular variable, all of the data values may be
represented, or you may group the values into categories first (e.g., with age, price,
or temperature variables, it would usually not be sensible to determine the
frequencies for each value. Rather, the values are grouped into ranges and the
frequencies are determined.). Frequency distributions can be depicted in two ways,
as a table or as a graph. For example, Table 1 shows an age frequency distribution
with five categories of age ranges defined. The same frequency distribution can be
depicted in a graph let’s say as shown in Figure 1.

Sample Percentage Distribution in Table and Graph Format


Profile of the Respondents

Profile of the
Frequency Percentage
Respondents
Age
25 – 39 21 52.5
40 – 50 14 35.0
51 – 64 5 12.5
Civil Status
Single 6 15.0
Married 34 85.0
Sex
Male 14 35.0
Female 26 65.0
Status of Appointment
Permanent 37 92.5
Job Order 3 7.5

Figure 1. Profile of the Respondents based on Civil Status

Civil Status
15%

85%

Single Married

Frequency distribution using bar chart.


Distributions may also be displayed using percentages. For example, you could
use percentages to describe the:
 percentage of people in different income levels

2
 percentage of people in different age ranges
 percentage of people in different ranges of standardized test scores

Central Tendency: The central tendency of a distribution is an estimate of the


"center" of a distribution of values. There are three major types of estimates of
central tendency:
 Mean
 Median
 Mode

The Mean or average is probably the most commonly used method of describing
central tendency. To compute the mean all you do is add up all the values and divide
by the number of values. For example, the mean or average quiz score is determined
by summing all the scores and dividing by the number of students taking the exam.
For example, consider the test score values:

15, 20, 21, 20, 36, 15, 25, 15

The sum of these 8 values is 167, so the mean is 167/8 = 20.875.

The Median is the score found at the exact middle of the set of values. One way
to compute the median is to list all scores in numerical order, and then locate the
score in the center of the sample. For example, if there are 500 scores in the list,
score #250 would be the median. If we order the 8 scores shown above, we would
get:

15, 15,15,20,20,21,25,36

There are 8 scores and scores #4 and #5 represent the halfway point. Since both
of these scores are 20, the median is 20. If the two middle scores had different
values, you would have to interpolate to determine the median.

The mode is the most frequently occurring value in the set of scores. To
determine the mode, you might again order the scores as shown above, and then
count each one. The most frequently occurring value is the mode. In our example,
the value 15 occurs three times and is the model. In some distributions, there is
more than one modal value. For instance, in a bimodal distribution, there are two
values that occur most frequently.

Notice that for the same set of 8 scores, we got three different values -- 20.875,
20, and 15 -- for the mean, median and mode respectively. If the distribution is truly
normal (i.e., bell-shaped), the mean, median and mode are all equal to each other.

You might also like