Module 4
Module 4
Data are pieces of information about individuals organized into variables. By an individual, we mean a
particular person or object. By a variable, we mean a particular characteristic of the individual.
A dataset is a set of data identified with particular circumstances. Datasets are typically displayed in
tables, in which rows represent individuals and columns represent variables.
Categorical variables take category or label values and place an individual into one of several groups.
Each observation can be placed in only one category, and the categories are mutually exclusive.
Quantitative variables take numerical values and represent some kind of measurement
It would make sense to average a quantitative variable but not a categorical variable
In order to summarize the distribution of a categorical variable, we first create a table of the different
values (categories) the variable takes, how many times each value occurs (count) and, more importantly,
how often each value occurs (by converting the counts to percentages); this table is called a frequency
distribution.
The pie chart emphasizes how the different categories relate to the whole, and the bar chart emphasizes
how the different categories compare with each other.
The center of the distribution is its midpoint—the value that divides the distribution so that approximately
half the observations take smaller values, and approximately half the observations take larger values.
The spread (also called variability) of the distribution can be described by the approximate range covered
by the data.
The stemplot is a simple but useful visual display of quantitative data. Its principal virtues are:
Note that when n is odd, the median is not included in either the bottom or top half of the data
below Q1 - 1.5(IQR) or
above Q3 + 1.5(IQR)
Even though it is an extreme value, if an outlier can be understood to have been produced by essentially
the same sort of physical or biological process as the rest of the data, and if such extreme values are
expected to eventually occur again, then such an outlier indicates something important and interesting
about the process you're investigating, and it should be kept in the data.
If an outlier can be explained to have been produced under fundamentally different conditions from the
rest of the data (or by a fundamentally different process), such an outlier can be removed from the data if
your goal is to investigate only the process that produced the rest of the data.
An outlier might indicate a mistake in the data (like a typo, or a measuring error), in which case it should
be corrected if possible or else removed from the data before calculating summary statistics or making
inferences from the data (and the reason for the mistake should be investigated).
The standard deviation gives the average (or typical distance) between a data point and the mean
Approximately 68% of the observations fall within 1 standard deviation of the mean.
Approximately 95% of the observations fall within 2 standard deviations of the mean.
Approximately 99.7% (or virtually all) of the observations fall within 3 standard deviations of the
mean.
A positive (or increasing) relationship means that an increase in one of the variables is associated with
an increase in the other.
A negative (or decreasing) relationship means that an increase in one of the variables is associated with
a decrease in the other.
Properties of the correlation coefficient:
The correlation does not change when the units of measurement of either one of the variables change. In
other words, if we change the units of measurement of the explanatory variable and/or the response
variable, the change has no effect on the correlation (r).
The correlation measures only the strength of a linear relationship between two variables. It ignores any
other type of relationship, no matter how strong it is.
Types of Samples:
A sampling frame is where the sample is drawn from a group which is not the total population
riables of interest are recorded as they naturally occur. There is no interference by the researchers who
conduct the study.
A sample survey, which is a particular type of observational study in which individuals report variables'
values themselves, frequently by giving their opinions.
Perform an experiment. Instead of assessing the values of the variables as they naturally occur, the
researchers interfere, and they are the ones who assign the values of the explanatory variable to the
individuals. The researchers "take control" of the values of the explanatory variable because they want to
see how changes in the value of the explanatory variable affect the response variable. (Note: By nature,
any experiment involves at least two variables.
In general, we control for the effects of a lurking variable by separately studying groups that are similar
with respect to this variable.
If neither the subjects nor the researchers know who was assigned what treatment, then the experiment is
called double-blind
The most reliable way to determine whether the explanatory variable is actually causing changes in the
response variable is to carry out a randomized controlled double-blind experiment.
Some of the inherent difficulties that may be encountered in experimentation are the Hawthorne effect,
lack of realism, noncompliance, and treatments that are unethical, impossible, or impractical to impose.
This phenomenon, whereby people in an experiment behave differently from how they would normally
behave, is called the Hawthorne effect.
Probability
One method for determining whether two events are independent is to compare P(B | A) and P(B)
If the two are equal (i.e., knowing or not knowing whether A has occurred has no effect on the probability
of B occurring) then the two events are independent. Otherwise, if the probability changes depending
on whether we know that A has occurred or not, then the two events are not independent. Similarly,
using the same reasoning, we can compare P(A | B) and P(A).
P(B | A) = P(B)
P(A | B) = P(A)
A random variable assigns a unique numerical value to the outcome of a random experiment.
When describing the shape of a scatter plot, we have to consider the direction, form and strength of the
relationship and then outliers
Page 56 is an important page talking about the features of a the correlation coefficient
a lurking variable, by definition, is a variable that was not included in the study, but could have a
substantial effect on our understanding of the relationship between the two studied variables.
Binomial experiments are random experiments that consist of a fixed number of repeated trials, like
tossing a coin 10 times, randomly choosing 10 people, rolling a die 5 times, etc. These trials, however,
need to be independent in the sense that the outcome in one trial has no effect on the outcome in other
trials. In each of these repeated trials there is one outcome that is of interest to us (we call this outcome
"success"), and each of the trials is identical in the sense that the probability that the trial will end in a
"success" is the same in each of the trials.
The random variable X that represents the number of successes in those n trials is called binomial
The number (X) of successes in a sample of size n taken without replacement from a population with
proportion (p) of successes is approximately binomial with n and p as long as the sample size (n) is at
most 10% of the population size (N).
Consider a random experiment that consists of n trials, each one ending up in either success or failure.
The number of possible outcomes in the sample space that have exactly k successes out of n is:
n!/(k!(n−k)!)
25% of data is less that 0.67 standard deviations from the mean
We can generalize what we learned in the last example and say that when two individuals are selected at
random from a large population (like in the example, the entire U.S.) any event associated with one
individual is independent of any event associated with the other individual. The fact that the two are
chosen from a large population is key to the independence.
If A and B are two independent events, then P(A and B) = P(A) * P(B).
A parameter is a number that describes the population; a statistic is a number that is computed from the
sample.
The standard deviation of sample proportions is:
the null hypothesis suggests nothing special is going on; in other words, there is no change from the status
quo, no difference from the traditional state of affairs, no relationship
The significance level of the test is the cutoff at which the thing will be surprising if the null hypothesis is
true
if the p-value < α (usually .05), then the data we got is considered to be "rare (or surprising)
enough" when Ho is true, and we say that the data provide significant evidence against H o, so we
reject Ho and accept Ha.
if the p-value > α (usually .05), then our data are not considered to be "surprising enough" when
Ho is true, and we say that our data do not provide enough evidence to reject H o (or, equivalently,
that the data do not provide enough evidence to accept Ha).