Notes - Chapter 2 - IT Skills and Data Analysis I
Notes - Chapter 2 - IT Skills and Data Analysis I
Frequency Distribution Tables: Lose some information and Gain some other information
In table 2-6, We no longer know, for example, that the value 5.5 appears four times or
that the value 5.1 does not appear at all.
We can see from Table 2-6 that average inventory falls most often in the range from 3.8
to 4.3 days.
A frequency distribution is a table that organizes data into classes, that is, into groups of values
describing one characteristic of the data.
A frequency distribution shows the number of observations from the data set that fall into each of
the classes.
A relative frequency distribution presents frequencies in terms of fractions or percentages.
Classes in any relative or simple frequency distribution are all-inclusive and mutually
exclusive
Classes of qualitative data
Although Table 2-10 does not list every occupation held by the graduates of Central College, it is
still all-inclusive. Why? The class “other” covers all the observations that fail to fit one of the
enumerated categories. We will use a word like this whenever our list does not specifically list
all the possibilities. This “other” is called an open ended class
2. Sort the data points into classes and count the number of points in each class.
3. Illustrate the data in a chart.
Histograms
Definition: A numerical variable is discrete if its set of possible values either is finite or else can
be listed in an infinite sequence (one in which there is a first number, a second number,
and so on). A numerical variable is continuous if its possible values consist of an entire
interval on the number line.
Example:
A discrete variable x almost always results from counting, in which case possible values
are 0, 1, 2, 3, . . . or some subset of these integers.
Continuous variables arise from making measurements. For example, if x is the pH of a
chemical substance, then in theory x could be any number between 0 and 14: 7.0, 7.03,
7.032, and so on. Of course, in practice there are limitations on the degree of accuracy of
any measuring instrument, so we may not be able to determine pH, reaction time, height,
and concentration to an arbitrarily large number of decimal places.
Frequency of any particular x value is the number of times that value occurs in the data set.
Relative frequency of a value is the fraction or proportion of times the value occurs:
𝐧𝐮𝐦𝐛𝐞𝐫 𝐨𝐟 𝐭𝐢𝐦𝐞𝐬 𝐭𝐡𝐞 𝐯𝐚𝐥𝐮𝐞 𝐨𝐜𝐜𝐮𝐫𝐬 (= 𝐅𝐫𝐞𝐪𝐮𝐞𝐧𝐜𝐲)
𝐫𝐞𝐥𝐚𝐭𝐢𝐯𝐞 𝐟𝐫𝐞𝐪𝐮𝐞𝐧𝐜𝐲 𝐨𝐟 𝐚 𝐯𝐚𝐥𝐮𝐞 =
𝐧𝐮𝐦𝐛𝐞𝐫 𝐨𝐟 𝐨𝐛𝐬𝐞𝐫𝐯𝐚𝐭𝐢𝐨𝐧𝐬 𝐢𝐧 𝐭𝐡𝐞 𝐝𝐚𝐭𝐚 𝐬𝐞𝐭
Example: Power companies need information about customer usage to obtain accurate forecasts
of demands. Investigators from Wisconsin Power and Light determined energy consumption
(BTUs) during a particular period for a sample of 90 gas-heated homes. An adjusted
consumption value was calculated as follows:
consumption
adjusted consumption =
(weather, in degree days)(house area)
Equal-width classes may not be a sensible choice if there are some regions of the
measurement scale that have a high concentration of data values and other parts where
data is quite sparse.
Constructing a Histogram for Continuous Data: Unequal Class Widths
1. Arrange the data in increasing order.
2. Determine minimum value, maximum value and number of data.
3. Determine classes with unequal widths.
4. Make a table showing the frequency, relative frequency and density for each class.
Relative frequency of the class
Density =
class width
5. Mark the class boundaries on a horizontal measurement axis.
6. Above each class interval, draw a rectangle whose height is the corresponding
density.
Example
The area of each rectangle is the relative frequency of the corresponding class.
Furthermore, since the sum of relative frequencies should be 1, the total area of all
rectangles in a density histogram is l.
Advantage of the relative frequency histogram: while the absolute numbers may
change (as we test more looms, for example), the relationship among the classes may
remain stable. Twenty percent of all the looms may fall in the class “16.1–16.3 yards”
whether we test 30 or 300 looms. It is easy to compare the data from different sizes of
samples when we use relative frequency histograms.
Histogram Shapes:
Unimodal: A unimodal histogram is one that rises to a single peak and then declines.
Bimodal: A bimodal histogram has two different peaks.
Multimodal: A histogram with more than two peaks is said to be multimodal.
Smoothed:
Example
Frequency Polygons: Although less widely used, frequency polygons are another way to
portray graphically both simple and relative frequency distributions. To construct a frequency
polygon, we mark the frequencies on the vertical axis and the values of the variable we are
measuring on the horizontal axis, as we did with histograms. Next, we plot each class frequency
by drawing a dot above its midpoint, and connect the successive dots with straight lines to form a
polygon.
Converting a frequency polygon to a histogram: A frequency polygon is simply a line graph
that connects the midpoints of all the bars in a histogram. Therefore, we can reproduce the
histogram by drawing vertical lines from the bounds of the classes (as marked on the horizontal
axis) and connecting them with horizontal lines at the heights of the polygon at each midpoint.
Advantages of histograms:
1. The rectangle clearly shows each separate class in the distribution.
2. The area of each rectangle, relative to all the other rectangles, shows the proportion of
the total number of observations that occur in that class.
Advantages of polygons:
1. It sketches an outline of the data pattern more clearly.
2. The polygon becomes increasingly smooth and curvelike as we increase the number of
classes and the number of observations.
Frequency curve: A polygon such as the one we have just described, smoothed by added classes
and data points, is called a frequency curve. In Figure 2-10, we have used our carpet-loom
example, but we have increased the number of observations to 300 and the number of classes to
10. Notice that we have connected the points with curved lines to approximate the way the
polygon would look if we had a very large number of data points and very small class intervals.
Ogives: A graph of a cumulative frequency distribution is called an ogive (pronounced “oh-
jive”).
Branches of Statistics:
Descriptive Statistics: to summarize and describe important features of the data. Some of these
methods are graphical in nature; the construction of histograms, boxplots, and scatter plots are
primary examples. Other descriptive methods involve calculation of numerical summary
measures, such as means, standard deviations, and correlation coefficients.
Inferential Statistics: Techniques for generalizing from a sample to a population are gathered
within the branch of our discipline called inferential statistics. Having obtained a sample from a
population, an investigator would frequently like to use sample information to draw some type of
conclusion (make an inference of some sort) about the population.
Example: the contrasting focus of probability and inferential statistics, consider drivers’ use
of manual lap belts in cars equipped with automatic shoulder belt systems. (The article
“Automobile Seat Belts: Usage Patterns in Automatic Belt Systems,” Human Factors,
1998: 126–135, summarizes usage data.)
In probability,
Assumption: we might assume that 50% of all drivers of cars equipped in this way in a
certain metropolitan area regularly use their lap belt
Qus1) How likely is it that a sample of 100 such drivers will include at least 70 who
regularly use their lap belt?
Qus2) How many of the drivers in a sample of size 100 can we expect to regularly use
their lap belt?”
In inferential statistics,
we have sample information available; for example, a sample of 100 drivers of such cars
revealed that 65 regularly use their lap belt.
Qus3) Does this provide substantial evidence for concluding that more than 50% of all
such drivers in this area regularly use their lap belt?”
In this latter scenario, we are attempting to use sample information to answer a question about
the structure of the entire population from which the sample was selected
Measures of Location
A physical interpretation of the sample mean demonstrates how it measures the center of
a sample.
One of our first tasks in statistical inference will be to present methods based on the
sample mean for drawing conclusions about a population mean.
The mean suffers from one deficiency that makes it an inappropriate measure of center
under some circumstances: Its value can be greatly affected by the presence of even a
single outlier (unusually large or small observation).
Example:
Number of employees = 10,
9 employees earn Rs. 50,000 per month
1 employee earn Rs. 1,50,000 per month.
9×50,000+1,50,000
Mean salary = = 60,000
10
The sample mean salary Rs 60,000, certainly does not seem representative of the
data.
In such a situation, it is desirable to employ a measure that is less sensitive to
outlying values than 𝑥̅ .
The Median
Motivation for 𝒔𝟐
Sample Variance = 𝑠 2 , Sample Standard Deviation = 𝑠, Sample Mean = 𝑥̅
Population Variance = 𝜎 2 , Population Standard Deviation = 𝜎, Population mean = 𝜇
Sample size = 𝑛, Population size = 𝑁
2
∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2 2
∑𝑁𝑖=1(𝑥𝑖 − 𝜇)2
𝑠 = , 𝜎 =
𝑛−1 𝑁
If we actually knew the value of 𝜇, then we could define the sample variance as the average
squared deviation of the sample 𝑥𝑖 ′𝑠 about 𝜇. However, the value of 𝜇 is almost never known, so
the sum of squared deviations about 𝑥̅ must be used. But the 𝑥𝑖 ′𝑠 tend to be closer to their
average 𝑥̅ than to the population average 𝜇,
To compensate for this the divisor (𝑛 − 1) is used rather than 𝑛. In other words, if we used a
divisor 𝑛 in the sample variance, then the resulting quantity would tend to underestimate 𝜎 2
(produce estimated values that are too small on the average), whereas dividing by the slightly
smaller (𝑛 − 1) corrects this underestimating.
A Computation for 𝑠 2 :
𝑛 2
2
∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2 ∑𝑛𝑖=1 𝑥𝑖
2
𝑠 = = ∑ 𝑥𝑖 − 𝑛 ( )
𝑛−1 𝑛
𝑖=1
Proposition: Let 𝑥1 , 𝑥2 , … , 𝑥𝑛 be a sample and 𝑐 be any nonzero constant. 𝑠𝑥2 variance and 𝑠𝑥
standard deviation.
(1) If 𝑦1 = 𝑥1 + 𝑐, 𝑦2 = 𝑥2 + 𝑐, … , 𝑦𝑛 = 𝑥𝑛 + 𝑐, then 𝑠𝑦2 = 𝑠𝑥2 and
(2) If 𝑦1 = 𝑐𝑥1 , 𝑦2 = 𝑐𝑥2 , … , 𝑦𝑛 = 𝑐𝑥𝑛 , then 𝑠𝑦2 = 𝑐 2 𝑠𝑥2 , 𝑠𝑦 = |𝑐|𝑠𝑥 .