Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
30 views

Notes - Chapter 2 - IT Skills and Data Analysis I

du notes
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Notes - Chapter 2 - IT Skills and Data Analysis I

du notes
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Chapter 2

Grouping and Displaying Data to Convey Meaning: Tables and Graphs


The production manager of the Dalmon Carpet Company is responsible for the output of over
500 carpet looms. So that he does not have to measure the daily output (in yards) of each loom,
he samples the output from 30 looms each day and draws a conclusion as to the
average carpet production of the entire 500 looms. The table below shows the yards
produced by each of the 30 looms in yesterday’s sample. These production amounts are the raw
data from which the production manager can draw conclusions about the entire population of
looms yesterday
YARDS PRODUCED YESTERDAY BY EACH OF 30 CARPET LOOMS
16.2 15.4 16.0 16.6 15.9 15.8 16.0 16.8 16.9 16.8
15.7 16.4 15.2 15.8 15.9 16.1 15.6 15.9 15.6 16.0
16.4 15.8 15.7 16.2 15.6 15.9 16.3 16.3 16.0 16.3

Data are collections of any number of related observations.


Example: We can collect the number of telephones that several workers install on a given
day or that one worker installs per day over a period of several days, and we can call the
results our data.
Data set: A collection of data is called a data set
Data point: a single observation a data point.
Collecting Data
Statistics deals not only with the organization and analysis of data once it has been collected but
also with the development of techniques for collecting the data. If data is not properly collected,
an investigator may not be able to answer the questions under consideration with a reasonable
degree of confidence. One common problem is that the target population—the one about which
conclusions are to be drawn—may be different from the population actually sampled.
Statisticians select their observations so that all relevant groups are represented in the data.
Example: To determine the potential market for a new product, for example, analysts might
study 100 consumers in a certain geographical area. Analysts must be certain that this group
contains people representing variables such as income level, race, education, and neighborhood.
Data can come from actual observations or from records that are kept for normal purposes.
Use data about the past to make decisions about the future.
Example: We can analyze results of current and past batches to improve result of future
batches.
Simple random sample: is one for which any particular subset of the specified size.
Sometimes alternative sampling methods can be used to make the selection process easier, to
obtain extra information, or to increase the degree of confidence in conclusions.
Stratified sampling, entails separating the population units into nonoverlapping groups and
taking a sample from each one.
Example: An article in the New York Times (Jan. 27, 1987) reported that heart attack risk
could be reduced by taking aspirin.
This conclusion was based on a designed experiment involving both a control group of
individuals that took a placebo having the appearance of aspirin but known to be inert
and a treatment group that took aspirin according to a specified regimen.
Subjects were randomly assigned to the groups to protect against any biases and so that
probability-based methods could be used to analyze the data.
Of the 11,034 individuals in the control group, 189 subsequently experienced heart
attacks, whereas only 104 of the 11,037 in the aspirin/treatment group had a heart
attack.
The incidence rate of heart attacks in the treatment group was only about half that in the
control group. One possible explanation for this result is chance variation—that aspirin
really doesn’t have the desired effect and the observed difference is just typical variation
in the same way that tossing two identical coins would usually produce different numbers
of heads. However, in this case, inferential methods suggest that chance variation by
itself cannot adequately explain the magnitude of the observed difference.

Tests for data


Managers must be very careful to be sure that the data they are using are based on correct
assumptions and interpretations. Before relying on any interpreted data, from a computer or not,
test the data by asking these question
1. Where did the data come from? Is the source biased—that is, is it likely to have an
interest in supplying data points that will lead to one conclusion rather than another?
2. Do the data support or contradict other evidence we have?
3. Is evidence missing that might cause us to come to a different conclusion?
4. How many observations do we have? Do they represent all the groups we wish to study?
5. Is the conclusion logical? Have we made conclusions that the data do not support?
Study your answers to these questions. Are the data worth using? Or should we wait and collect
more information before acting?

The effect of incomplete or biased data


Example. A national association of truck lines claimed in an advertisement that “75
percent of everything you use travels by truck.” This might lead us to believe that cars,
railroads, airplanes, ships, and other forms of transportation carry only 25 percent of what
we use. Reaching such a conclusion is easy but not enlightening. Missing from the
trucking assertion is the question of double counting. What did they do when something
was carried to your city by rail and delivered to your house by truck? How were packages
treated if they went by airmail and then by truck?
Advantages of samples:
 Studying samples is easier than studying the whole population; it costs less and takes less
time.
 Examining an entire population still allows defective items to be accepted; thus,
sampling, in some instances, can raise the quality level.
A representative sample contains the relevant characteristics of the population in the same
proportions as they are included in that population.

Populations, Samples, and Processes


Population: An investigation will typically focus on a well-defined collection of objects
constituting a population of interest. In one study, the population might consist of all gelatin
capsules of a particular type produced during a specified period.
Census: When desired information is available for all objects in the population, we have what is
called a census. Constraints on time, money, and other scarce resources usually make a census
impractical or infeasible. Instead,
Sample: a subset of the population—a sample—is selected in some prescribed manner. Example,
we might select a sample of last year’s engineering graduates to obtain feedback about the
quality of the engineering curricula.
Variable: A variable is any characteristic whose value may change from one object to another in
the population. We shall initially denote variables by lowercase letters from the end of our
alphabet. Examples include
x = brand of calculator owned by a student
y = number of visits to a particular Web site during a specified period
z = braking distance of an automobile under specified conditions
A univariate data set consists of observations on a single variable. For example, we might
determine the type of transmission, automatic (A) or manual (M), on each of ten automobiles
recently purchased at a certain dealership, resulting in the categorical data
MAAAMAAMAA
We have bivariate data when observations are made on each of two variables. Our data set might
consist of a (height, weight) pair for each basketball player on a team, with the first observation as
(72, 168), the second as (75, 212), and so on. If an engineer determines the value of both and for
component failure, the resulting data set is bivariate with one variable numerical and the other
categorical.
Multivariate data arises when observations are made on more than one variable (so bivariate is a
special case of multivariate).
Why should we arrange data: The purpose of organizing data is to enable us to see quickly
some of the characteristics of the data we have collected. We look for things such as the range
(the largest and smallest values), apparent patterns, what values the data may tend to group
around, what values appear most often, and so on.
Raw Data: Information before it is arranged and analyzed is called raw data. It is “raw” because
it is unprocessed by statistical methods.
Example: The carpet-loom data in the chapter-opening problem was one example of raw
data.
The data array is one of the simplest ways to present data. It arranges values in ascending or
descending order.
A Better Way to Arrange Data: The Frequency Distribution

Frequency Distribution Tables: Lose some information and Gain some other information
 In table 2-6, We no longer know, for example, that the value 5.5 appears four times or
that the value 5.1 does not appear at all.
 We can see from Table 2-6 that average inventory falls most often in the range from 3.8
to 4.3 days.
A frequency distribution is a table that organizes data into classes, that is, into groups of values
describing one characteristic of the data.
A frequency distribution shows the number of observations from the data set that fall into each of
the classes.
A relative frequency distribution presents frequencies in terms of fractions or percentages.
 Classes in any relative or simple frequency distribution are all-inclusive and mutually
exclusive
Classes of qualitative data

Although Table 2-10 does not list every occupation held by the graduates of Central College, it is
still all-inclusive. Why? The class “other” covers all the observations that fail to fit one of the
enumerated categories. We will use a word like this whenever our list does not specifically list
all the possibilities. This “other” is called an open ended class

 72 and older is open ended


CONSTRUCTING A FREQUENCY DISTRIBUTION
1. Decide on the type and number of classes for dividing the data.
 Divide the range by equal classes: The range must be divided by equal classes;
that is, the width of the interval from the beginning of one class to the beginning
of the next class must be the same for every class.
 Problems with unequal classes: If the classes were unequal and the width of the
intervals differed among the classes, then we would have a distribution that is
much more difficult to interpret than one with equal intervals.
 Use 6 to 15 classes: As a rule, statisticians rarely use fewer than 6 or more than
15 classes.
There are no hard-and-fast rules concerning either the number of classes or the choice
of classes themselves. Between 5 and 20 classes will be satisfactory for most data sets.
Generally, the larger the number of observations in a data set, the more classes should
be used.
A reasonable rule of thumb is
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠 ≈ √𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
 Determine the width of the class intervals:

2. Sort the data points into classes and count the number of points in each class.
3. Illustrate the data in a chart.
Histograms
Definition: A numerical variable is discrete if its set of possible values either is finite or else can
be listed in an infinite sequence (one in which there is a first number, a second number,
and so on). A numerical variable is continuous if its possible values consist of an entire
interval on the number line.
Example:
A discrete variable x almost always results from counting, in which case possible values
are 0, 1, 2, 3, . . . or some subset of these integers.
Continuous variables arise from making measurements. For example, if x is the pH of a
chemical substance, then in theory x could be any number between 0 and 14: 7.0, 7.03,
7.032, and so on. Of course, in practice there are limitations on the degree of accuracy of
any measuring instrument, so we may not be able to determine pH, reaction time, height,
and concentration to an arbitrarily large number of decimal places.
Frequency of any particular x value is the number of times that value occurs in the data set.
Relative frequency of a value is the fraction or proportion of times the value occurs:
𝐧𝐮𝐦𝐛𝐞𝐫 𝐨𝐟 𝐭𝐢𝐦𝐞𝐬 𝐭𝐡𝐞 𝐯𝐚𝐥𝐮𝐞 𝐨𝐜𝐜𝐮𝐫𝐬 (= 𝐅𝐫𝐞𝐪𝐮𝐞𝐧𝐜𝐲)
𝐫𝐞𝐥𝐚𝐭𝐢𝐯𝐞 𝐟𝐫𝐞𝐪𝐮𝐞𝐧𝐜𝐲 𝐨𝐟 𝐚 𝐯𝐚𝐥𝐮𝐞 =
𝐧𝐮𝐦𝐛𝐞𝐫 𝐨𝐟 𝐨𝐛𝐬𝐞𝐫𝐯𝐚𝐭𝐢𝐨𝐧𝐬 𝐢𝐧 𝐭𝐡𝐞 𝐝𝐚𝐭𝐚 𝐬𝐞𝐭

Multiplying a relative frequency by 100 gives a percentage;


In the college-course example, 35% of the students in the sample are taking three courses.
 The relative frequencies, or percentages, are usually of more interest than the
frequencies themselves.
 In theory, the relative frequencies should sum to 1, but in practice the sum may differ
slightly from 1 because of rounding.
 A frequency distribution is a tabulation of the frequencies and/or relative frequencies

Constructing a Histogram for Discrete Data


1. Arrange the data in increasing order.
2. Make a table showing the frequency and relative frequency of each data point (x value).
3. Then mark possible data points (i.e. x values) on a horizontal scale.
4. Above data point (i.e. x value), draw a rectangle whose height is the relative
frequency (or alternatively, the frequency) of that value.
This construction ensures that the area of each rectangle is proportional to the relative frequency
of the value. Thus if the relative frequencies of and are 0.35 and 0.07, respectively, then the area
of the rectangle above 1 is five times the area of the rectangle above 5.
Example
Constructing a histogram for continuous data (measurements) entails subdividing the
measurement axis into a suitable number of class intervals or classes, such that each observation
is contained in exactly one class.
 An observation on a boundary is placed in the interval to the right of the boundary.
Constructing a Histogram for Continuous Data: Equal Class Widths
1. Arrange the data in increasing order.
2. Determine minimum value, maximum value and number of data.
3. Determine classes with equal widths.
4. Make a table showing the frequency and relative frequency for each class.
5. Mark the class boundaries on a horizontal measurement axis.
6. Above each class interval, draw a rectangle whose height is the corresponding
relative frequency (or frequency).

Example: Power companies need information about customer usage to obtain accurate forecasts
of demands. Investigators from Wisconsin Power and Light determined energy consumption
(BTUs) during a particular period for a sample of 90 gas-heated homes. An adjusted
consumption value was calculated as follows:
consumption
adjusted consumption =
(weather, in degree days)(house area)
 Equal-width classes may not be a sensible choice if there are some regions of the
measurement scale that have a high concentration of data values and other parts where
data is quite sparse.
Constructing a Histogram for Continuous Data: Unequal Class Widths
1. Arrange the data in increasing order.
2. Determine minimum value, maximum value and number of data.
3. Determine classes with unequal widths.
4. Make a table showing the frequency, relative frequency and density for each class.
Relative frequency of the class
Density =
class width
5. Mark the class boundaries on a horizontal measurement axis.
6. Above each class interval, draw a rectangle whose height is the corresponding
density.

Example 1.11: Present the histogram for the random data


Qualitative Data Histogram:
Both a frequency distribution and a histogram can be constructed when the data set is qualitative
(categorical) in nature. In some cases, there will be a natural ordering of classes—for example,
freshmen, sophomores, juniors, seniors, graduate students— whereas in other cases the order will
be arbitrary—for example, Catholic, Jewish, Protestant, and the like. With such categorical data,
the intervals above which rectangles are constructed should have equal width.
Constructing a Histogram for Qualitative Data
1. Make a table the frequency and relative frequency of each quality.
2. Write all qualities on a horizontal scale.
5. Above quality, draw a rectangle whose height is the relative frequency (or
alternatively, the frequency) of that quality.

Example
 The area of each rectangle is the relative frequency of the corresponding class.
Furthermore, since the sum of relative frequencies should be 1, the total area of all
rectangles in a density histogram is l.
 Advantage of the relative frequency histogram: while the absolute numbers may
change (as we test more looms, for example), the relationship among the classes may
remain stable. Twenty percent of all the looms may fall in the class “16.1–16.3 yards”
whether we test 30 or 300 looms. It is easy to compare the data from different sizes of
samples when we use relative frequency histograms.
Histogram Shapes:
Unimodal: A unimodal histogram is one that rises to a single peak and then declines.
Bimodal: A bimodal histogram has two different peaks.
Multimodal: A histogram with more than two peaks is said to be multimodal.
Smoothed:
Example
Frequency Polygons: Although less widely used, frequency polygons are another way to
portray graphically both simple and relative frequency distributions. To construct a frequency
polygon, we mark the frequencies on the vertical axis and the values of the variable we are
measuring on the horizontal axis, as we did with histograms. Next, we plot each class frequency
by drawing a dot above its midpoint, and connect the successive dots with straight lines to form a
polygon.
Converting a frequency polygon to a histogram: A frequency polygon is simply a line graph
that connects the midpoints of all the bars in a histogram. Therefore, we can reproduce the
histogram by drawing vertical lines from the bounds of the classes (as marked on the horizontal
axis) and connecting them with horizontal lines at the heights of the polygon at each midpoint.

 Advantages of histograms:
1. The rectangle clearly shows each separate class in the distribution.
2. The area of each rectangle, relative to all the other rectangles, shows the proportion of
the total number of observations that occur in that class.
 Advantages of polygons:
1. It sketches an outline of the data pattern more clearly.
2. The polygon becomes increasingly smooth and curvelike as we increase the number of
classes and the number of observations.
Frequency curve: A polygon such as the one we have just described, smoothed by added classes
and data points, is called a frequency curve. In Figure 2-10, we have used our carpet-loom
example, but we have increased the number of observations to 300 and the number of classes to
10. Notice that we have connected the points with curved lines to approximate the way the
polygon would look if we had a very large number of data points and very small class intervals.
Ogives: A graph of a cumulative frequency distribution is called an ogive (pronounced “oh-
jive”).
Branches of Statistics:
Descriptive Statistics: to summarize and describe important features of the data. Some of these
methods are graphical in nature; the construction of histograms, boxplots, and scatter plots are
primary examples. Other descriptive methods involve calculation of numerical summary
measures, such as means, standard deviations, and correlation coefficients.
Inferential Statistics: Techniques for generalizing from a sample to a population are gathered
within the branch of our discipline called inferential statistics. Having obtained a sample from a
population, an investigator would frequently like to use sample information to draw some type of
conclusion (make an inference of some sort) about the population.

Example: the contrasting focus of probability and inferential statistics, consider drivers’ use
of manual lap belts in cars equipped with automatic shoulder belt systems. (The article
“Automobile Seat Belts: Usage Patterns in Automatic Belt Systems,” Human Factors,
1998: 126–135, summarizes usage data.)
In probability,
Assumption: we might assume that 50% of all drivers of cars equipped in this way in a
certain metropolitan area regularly use their lap belt
Qus1) How likely is it that a sample of 100 such drivers will include at least 70 who
regularly use their lap belt?
Qus2) How many of the drivers in a sample of size 100 can we expect to regularly use
their lap belt?”
In inferential statistics,
we have sample information available; for example, a sample of 100 drivers of such cars
revealed that 65 regularly use their lap belt.
Qus3) Does this provide substantial evidence for concluding that more than 50% of all
such drivers in this area regularly use their lap belt?”
In this latter scenario, we are attempting to use sample information to answer a question about
the structure of the entire population from which the sample was selected
Measures of Location

 A physical interpretation of the sample mean demonstrates how it measures the center of
a sample.

Sum of 𝑛 sample values


𝑥̅ : Sample mean; 𝑥̅ = 𝑛
Sum of 𝑁 population values
𝜇 : population mean; 𝜇 = 𝑁

 One of our first tasks in statistical inference will be to present methods based on the
sample mean for drawing conclusions about a population mean.
 The mean suffers from one deficiency that makes it an inappropriate measure of center
under some circumstances: Its value can be greatly affected by the presence of even a
single outlier (unusually large or small observation).
Example:
Number of employees = 10,
9 employees earn Rs. 50,000 per month
1 employee earn Rs. 1,50,000 per month.
9×50,000+1,50,000
Mean salary = = 60,000
10
The sample mean salary Rs 60,000, certainly does not seem representative of the
data.
 In such a situation, it is desirable to employ a measure that is less sensitive to
outlying values than 𝑥̅ .
The Median

 The sample median is very insensitive to outliers.


𝑥̃ : Sample median
𝜇̃ : Population median
As with 𝑥̅ and 𝜇, we can think of using the sample median 𝑥̃ to make inference about population
median 𝜇̃.
Measures of Variability
Reporting a measure of center gives only partial information about a data set or distribution.
Different samples or populations may have identical measures of center yet differ from one
another in other important ways.
Figure 1.19 shows dotplots of three samples with the same mean and median, yet the extent of
spread about the center is different for all three samples. The first sample has the largest amount
of variability, the third has the smallest amount, and the second is intermediate to the other two
in this respect.

Measures of Variability for Sample Data


Range: Range is the difference between the largest and smallest sample values. This is the
simplest measure of variability in a sample.
A defect of the range, though, is that it depends on only the two most extreme observations and
disregards the positions of the remaining values.
Deviations from the mean: 𝑥1 − 𝑥̅ , 𝑥2 − 𝑥̅ , 𝑥3 − 𝑥̅ , … , 𝑥𝑛 − 𝑥̅
If all the deviations are small in magnitude, then all 𝑥𝑖 ′s are close to the mean and there is
little variability. Alternatively, if some of the deviations are large in magnitude, then
some 𝑥𝑖 ′s lie far from 𝑥̅ , suggesting a greater amount of variability.
A simple way to combine the deviations into a single quantity is to average them. Unfortunately,
this is a bad idea:
∑𝑛𝑖=1(𝑥𝑖 − ̅
𝑥) ∑𝑛𝑖=1 𝑥𝑖 − ∑𝑛𝑖=1 ̅
𝑥 𝑛̅𝑥 − 𝑛𝑥
̅
Average of deviations = = = =0
𝑛 𝑛 𝑛
Absolute deviations from mean: |𝑥1 − 𝑥̅ |, |𝑥2 − 𝑥̅ |, |𝑥3 − 𝑥̅ |, … , |𝑥𝑛 − 𝑥̅ |
The absolute value operation leads to a number of theoretical difficulties
Square deviations from mean: (𝑥1 − 𝑥̅ )2 , (𝑥2 − 𝑥̅ )2 , (𝑥3 − 𝑥̅ )2 , … , (𝑥𝑛 − 𝑥̅ )2
 Note that 𝑠 2 and 𝑠 are both nonnegative.
 The unit for 𝑠 is the same as the unit for each of the 𝑥𝑖 ′𝑠.

Motivation for 𝒔𝟐
Sample Variance = 𝑠 2 , Sample Standard Deviation = 𝑠, Sample Mean = 𝑥̅
Population Variance = 𝜎 2 , Population Standard Deviation = 𝜎, Population mean = 𝜇
Sample size = 𝑛, Population size = 𝑁

2
∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2 2
∑𝑁𝑖=1(𝑥𝑖 − 𝜇)2
𝑠 = , 𝜎 =
𝑛−1 𝑁
If we actually knew the value of 𝜇, then we could define the sample variance as the average
squared deviation of the sample 𝑥𝑖 ′𝑠 about 𝜇. However, the value of 𝜇 is almost never known, so
the sum of squared deviations about 𝑥̅ must be used. But the 𝑥𝑖 ′𝑠 tend to be closer to their
average 𝑥̅ than to the population average 𝜇,
To compensate for this the divisor (𝑛 − 1) is used rather than 𝑛. In other words, if we used a
divisor 𝑛 in the sample variance, then the resulting quantity would tend to underestimate 𝜎 2
(produce estimated values that are too small on the average), whereas dividing by the slightly
smaller (𝑛 − 1) corrects this underestimating.
A Computation for 𝑠 2 :
𝑛 2
2
∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2 ∑𝑛𝑖=1 𝑥𝑖
2
𝑠 = = ∑ 𝑥𝑖 − 𝑛 ( )
𝑛−1 𝑛
𝑖=1

Proposition: Let 𝑥1 , 𝑥2 , … , 𝑥𝑛 be a sample and 𝑐 be any nonzero constant. 𝑠𝑥2 variance and 𝑠𝑥
standard deviation.
(1) If 𝑦1 = 𝑥1 + 𝑐, 𝑦2 = 𝑥2 + 𝑐, … , 𝑦𝑛 = 𝑥𝑛 + 𝑐, then 𝑠𝑦2 = 𝑠𝑥2 and
(2) If 𝑦1 = 𝑐𝑥1 , 𝑦2 = 𝑐𝑥2 , … , 𝑦𝑛 = 𝑐𝑥𝑛 , then 𝑠𝑦2 = 𝑐 2 𝑠𝑥2 , 𝑠𝑦 = |𝑐|𝑠𝑥 .

You might also like