Define Statistics
Define Statistics
Define Statistics
buying habits or to attempt to predict future events, such as projecting the future
return of a security or asset class based on returns in a sample period.
Regression analysis is a widely used technique of statistical inference used to
determine the strength and nature of the relationship (i.e., the correlation) between
a dependent variable and one or more explanatory (independent) variables. The
output of a regression model is often analyzed for statistical significance, which
refers to the claim that a result from findings generated by testing or
experimentation is not likely to have occurred randomly or by chance but is likely to
be attributable to a specific cause elucidated by the data. Having statistical
significance is important for academic disciplines or practitioners that rely heavily on
analyzing data and research.
The Branches of Statistics
Two branches, descriptive statistics and inferential statistics, comprise the field of
statistics.
Descriptive Statistics
CONCEPT The branch of statistics that focuses on collecting, summarizing, and
presenting a set of data.
EXAMPLES The average age of citizens who voted for the winning candidate in the
last presidential election, the average length of all books about statistics, the
variation in the weight of 100 boxes of cereal selected from a factory's production
line.
Inferential Statistics
CONCEPT The branch of statistics that analyzes sample data to draw conclusions
about a population.
EXAMPLE A survey that sampled 2,001 full-or part-time workers ages 50 to 70,
conducted by the American Association of Retired Persons (AARP), discovered that
70% of those polled planned to work past the traditional mid-60s retirement age. By
using methods discussed in Section 6.4, this statistic could be used to draw
conclusions about the population of all workers ages 50 to 70.
The only averages There can be more than one mode, and
that can be used if there can also be no mode which means
Mode
the data set is not in the mode is not always representative of
numbers. the data.
4
What Is Variance?
The variance is a measure of variability. It is calculated by taking the average of
squared deviations from the mean.
Variance tells you the degree of spread in your data set. The more spread the
data, the larger the variance is in relation to the mean.
Variance = (Standard deviation)2= σ2
Population variance
When you have collected data from every member of the population that
you’re interested in, you can get an exact value for population variance.
The population variance formula looks like this:
Formula Explanation
= population variance
= sum of…
Χ = each value
= population mean
Ν = number of values in the population
Sample variance
When you collect data from a sample, the sample variance is used to make
estimates or inferences about the population variance.
The sample variance formula looks like this:
Formula Explanation
= sample variance
= sum of…
Χ = each value
= sample mean
n = number of values in the sample
6
The formula includes Q3 and Q1 in the calculation, which is the top 25%
and lower 25% data, respectively. When the difference is between these
two, and this number halves, it gives measures of spread or dispersion.
So, to calculate quartile deviation, you need first to find out Q1, then the
second step is to find Q3 and then make a difference between both, and
the final step is to divide by 2.
It is one of the best methods of dispersion for open-ended data.
7
You can think of skewness in terms of tails. A tail is a long, tapering end of a
distribution. It indicates that there are observations at one of the extreme
ends of the distribution, but that they’re relatively infrequent. A right-skewed
distribution has a long tail on its right side.
Right skew: mean > median
For example, the mean number of sunspots observed per year was 48.6, which
is greater than the median of 39.
What is left skew (negative skew)?
A left-skewed distribution is longer on the left side of its peak than on its right.
In other words, a left-skewed distribution has a long tail on its left side. Left
skew is also referred to as negative skew.
Left skew: mean < median
For example, the mean zoology test score was 53.7, which is less than the
median of 55.
How to calculate skewness
There are several formulas to measure skewness. One of the simplest is
Pearson’s median skewness. It takes advantage of the fact that the mean and
median are unequal in a skewed distribution.
Events in Probability
Events in probability can be defined as a set of outcomes of a random
experiment. The sample space indicates all possible outcomes of an
experiment. Thus, events in probability can also be described as subsets of the
sample space.
There are many different types of events in probability. Each type of event has
its own individual properties. This classification of events in probability helps to
simplify mathematical calculations. In this article, we will learn more about
events in probability, their types and see certain associated examples.
Events in Probability Example
Suppose a fair die is rolled. The total number of possible outcomes will form
the sample space and are given by {1, 2, 3, 4, 5, 6}. Let an event, E, be defined
as getting an even number on the die. Then E = {2, 4, 6}. Thus, it can be seen
that E is a subset of the sample space and is an outcome of the rolling of a die.
Types of Events in Probability
There are several different types of events in probability. There can only be
one sample space for a random experiment however, there can be many
different types of events. Some of the important events in probability are listed
below.
Independent and Dependent Events
Independent events in probability are those events whose outcome does not
depend on some previous outcome. No matter how many times an experiment
has been conducted the probability of occurrence of independent events will
be the same. For example, tossing a coin is an independent event in
probability.
Dependent events in probability are events whose outcome depends on a
previous outcome. This implies that the probability of occurrence of a
dependent event will be affected by some previous outcome. For example,
drawing two balls one after another from a bag without replacement.
Impossible and Sure Events
An event that can never happen is known as an impossible event. As
impossible events in probability will never take place thus, the chance that
10
they will occur is always 0. For example, the sun revolving around the earth is
an impossible event.
A sure event is one that will always happen. The probability of occurrence of a
sure event will always be 1. For example, the earth revolving around the sun is
a sure event.
Simple and Compound Events
If an event consists of a single point or a single result from the sample space, it
is termed a simple event. The event of getting less than 2 on rolling a fair die,
denoted as E = {1}, is an example of a simple event.
If an event consists of more than a single result from the sample space, it is
called a compound event. An example of a compound event in probability is
rolling a fair die and getting an odd number. E = {1, 3, 5}.
Complementary Events
When there are two events such that one event can occur if and only if the
other does not take place then such events are known as complementary
events in probability. The sum of the probability of complementary events will
always be equal to 1. For example, on tossing a coin let E be defined as getting
a head. Then the complement of E is E' which will be the event of getting a tail.
Thus, E and E' together make up complementary events. Such events are
mutually exclusive and exhaustive.
Mutually Exclusive Events
Events that cannot occur at the same time are known as mutually exclusive
events. Thus, mutually exclusive events in probability do not have any common
outcomes. For example, S = {10, 9, 8, 7, 6, 5, 4}, A = {4, 6, 7} and B = {10, 9, 8}.
As there is nothing common between sets A and B thus, they are mutually
exclusive events.
Exhaustive Events
Exhaustive events in probability are those events when taken together from
the sample space of a random experiment. In other words, a set of events out
of which at least one is sure to occur when the experiment is performed are
exhaustive events. For example, the outcome of an exam is either passing or
failing.
Equally Likely Events
11
Equally likely events in probability are those events in which the outcomes are
equally possible. For example, on tossing a coin, getting a head or getting a tail,
are equally likely events.
Intersection of Events in Probability
The intersection of events in probability corresponds to the AND event. If two
events are associated with the "AND" operator, it implies that the common
outcomes of both events will be the result. It is denoted by the intersection
symbol "∩". For example, A = {1, 2, 3, 4}, B = {2, 3, 5, 6} then A ∩ B = {2, 3}.
What Is Sampling?
Sampling is a process in statistical analysis where researchers take a
predetermined number of observations from a larger population. The method
of sampling depends on the type of analysis being performed, but it may
include simple random sampling or systematic sampling.
KEY TAKEAWAYS
Certified Public Accountants use sampling during audits to determine
the accuracy and completeness of account balances.1
Types of sampling include random sampling, block sampling, judgement
sampling, and systematic sampling.
Companies use sampling as a marketing tool to identify the needs and
wants of their target market.
Sampling may be defined as the procedure in which a sample is selected from
an individual or a group of people of certain kind for research purpose. In
sampling, the population is divided into a number of parts called sampling
units.
Merits:
1. Economical:
It is economical, because we have not to collect all data. Instead of
getting data from 5000 farmers, we get it from 50-100 only.
3. Reliable:
If sample is taken judiciously, the results are very reliable and
accurate.
4. Organisational Convenience:
As samples are taken and the number of units is smaller, the better
(Trained) enumerators can be employed by the organisation.
5. More Scientific:
According to Prof R.A. Fisher, “The sample technique has four
important advantages over census technique of data collection.
They are Speed, Economy, Adaptability and Scientific approach.”
6. Detailed Enquiry:
A detailed study can be undertaken in case of the units included in
the sample. Size of sample can be taken according to time and
money available with the investigator.
7. Indispensable Method:
14
Demerits:
2. Wrong Conclusion:
If the sample is not representative, the results will not be correct.
These will lead to the wrong conclusions.
3. Small Universe:
Sometimes universe is so small that proper samples cannot be taken
not of it. Number of units are so less.
4. Specialised Knowledge:
It is a scientific method. Therefore, to get a good and representative
sample, one should have special knowledge to get good sample and
to perform proper analysis so that reliable result may be achieved.
5. Inherent defects:
The results which are achieved though the analysis of sampling data
may not be accurate as this method have inherent defects. There is
not even a single method of sampling which has no demerit.
6. Sampling Error:
This method of sampling has many errors.
7. Personal Bias:
As in many cases the investigator, chooses samples, such as
convenience method, chances of personal bias creep in.
15
This histogram shows us that our initial sample mean of 103 falls near the
center of the sampling distribution. Means occur in this range the most
frequently—18 of the 50 samples (36%) fall within the middle bar. However,
other samples from the same population have higher and lower means. The
frequency of means is highest in the sampling distribution center and tapers
off in both directions. None of our 50 sample means fall outside the range of
85-118. Consequently, it is very unusual to obtain sample means outside this
range.
Typically, you don’t know the population parameters. Instead, you use samples
to estimate them. However, we know the parameters for this simulation
because I’ve set the population to follow a normal distribution with a mean (µ)
weight of 100 grams and a standard deviation (σ) of 15 grams. Those are
the parameters of the apple population from which we’ve been sampling.
Notice how the histogram centers on the population mean of 100,
and sample means become rarer further away. It’s also a reasonably
symmetric distribution. Those are features of many sampling distributions. This
distribution isn’t particularly smooth because 50 samples is a small number for
this purpose, as you’ll see.
I used Excel to create this example. I had it randomly draw 50 samples with
a sample size of 10 from a population with µ = 100 and σ = 15.
17
This value can be used to calculate the coefficient of determination (R²) using
Formula 1:
Where:
RSS = sum of squared residuals
TSS = total sum of squares
These values can be used to calculate the coefficient of determination (R²) using
Formula 2:
coefficient of correlation
21
If x & y are the two variables of discussion, then the correlation coefficient can
be calculated using the formula
Here,
n = Number of values or elements
∑x = Sum of 1st values list
∑y = Sum of 2nd values list
22
or
4.5.2 Optimum allocation
Optimum allocation takes into consideration both the sizes of the strata and
the variability inside the strata. In order to obtain the minimum sampling
variance the total sample size should be allocated to the strata proportionally
to their sizes and also to the standard deviation of their values, i.e. to the
square root of the variances.
nh = constant × Nh sh
so that
27
where n is total sample size, nh is the sample size in stratum h, Nh is the size
of stratum h and sh is the square root of the variance in stratum h.
4.5.3 Optimum allocation with variable cost
In some sampling situations, the cost of sampling in terms of time or money is
composed of a fixed part and of a variable part depending on the stratum.
The sampling cost function is thus of the form:
where C is the total cost of the sampling, c0 is an overhead cost and ch is the
cost per sampling unit in stratum h, which may vary from stratum to stratum.
The optimum allocation of the sample to the strata in this situation is
allocating sample size to the strata proportional to the size, and the standard
error, and inversely proportional to the cost of sampling in each stratum. This
gives the following sample size for stratum h:
Very often, it is the total cost of the sampling, rather than the total sample
size, that is fixed. This is usually the case with research vessel surveys, in which
the number of days is fixed beforehand. In this case, the optimum allocation of
sample size among strata is
point may be random, the sampling involves the use of fixed intervals between
each member.
Example of Systematic Sampling
The goal of systematic sampling is to obtain an unbiased sample. The method
in which to achieve this is by assigning a number to every participant in the
population and then selecting the same designated interval in the population
to create the sample.
For example, you could choose every 5th participant or every 20th participant
but you must choose the same one in every population. The process of
selecting this nth number is systematic sampling.
Cluster Sampling
Cluster sampling is another type of random statistical measure. This method is
used when there are different subsets of groups present in a larger population.
These groups are known as clusters. Cluster sampling is commonly used
by marketing groups and professionals.
Cluster sampling is a two-step procedure. First, the entire population is
selected and separated into different clusters. Random samples are then
chosen from these subgroups. For example, a researcher may find it difficult to
construct the entire population of customers of a grocery store to interview.
However, they may be able to create a random subset of stores; this
represents the first step in the process. The second step is to interview a
random sample of the customers of those stores.
Example of Cluster Sampling
For example, say an academic study is being conducted to determine how
many employees at investment banks hold MBAs, and of those MBAs, how
many are from Ivy League schools. It would be difficult for the statistician to go
to every investment bank and ask every single employee their educational
background. To achieve the goal, a statistician can employ cluster sampling.
The first step would be to form a cluster of investment banks. Rather than
study every investment bank, the statistician can choose to study the top three
largest investment banks based on revenue, forming the first cluster. From
there, rather than interviewing every employee in all three investment banks,
a statistician could form another cluster, which would include employees from
29
only certain departments, for example, sales and trading or mergers and
acquisitions.
What is a Rejection Region?
A rejection region (also called a critical region) is an area of a graph where you
would reject the null hypothesis if your test results fall into that area. In other
words, if your results fall into that area then they are statistically significant.
The main purpose of statistics is to test theories or results from experiments.
For example, you might have invented a new fertilizer that you think makes
plants grow 50% faster. In order to prove your theory is true, your experiment
must:
1. Be repeatable.
2. Be compared to a known fact about plants (in this example, probably the
average growth rate of plants without the fertilizer).
Acceptance Region:
In hypothesis testing, the test procedure partitions all the possible sample
outcomes into two subsets (on the basis of whether the observed value of the
test statistic is smaller than a threshold value or not). The subset that is
considered to be consistent with the null hypothesis is called the "acceptance
region"; another subset is called the "rejection region" (or "critical region").
If the sample outcome falls into the acceptance region, then the null
hypothesis is accepted. If the sample outcome falls into the rejection region,
then the null hypothesis is rejected (i.e. the alternative hypothesis is accepted).
We call this type of statistical testing a hypothesis test. The rejection region is
a part of the testing process. Specifically, it is an area of probability that tells
you if your theory (your “”hypothesis”) is probably true.
30
All possible values which a test-statistic may assume can be divided into two
mutually exclusive groups: one group consisting of values which appear to be
consistent with the null hypothesis and the other having values which are
unlikely to occur if Ho is true. The first group is called the acceptance region
and the second set of values is known as the rejection region for a test. The
rejection region is also called the critical region. The value(s) that separates the
critical region from the acceptance region is called the critical value(s). The
critical value, which can be in the same units as the parameter or in the
standardized units, is to be decided by the experimenter keeping in view the
degree of confidence they are willing to have in the null hypothesis.
Hypothesis Testing
Hypothesis testing is a formal procedure for investigating our ideas about the
world using statistics. It is most often used by scientists to test specific
predictions, called hypotheses, that arise from theories.
There are 5 main steps in hypothesis testing:
1. State your research hypothesis as a null hypothesis and alternate
hypothesis (Ho) and (Ha or H1).
2. Collect data in a way designed to test the hypothesis.
3. Perform an appropriate statistical test.
4. Decide whether to reject or fail to reject your null hypothesis.
5. Present the findings in your results and discussion section.
What is an Estimator?
estimator
The sample mean is an estimator for the population mean.
An estimator is a statistic that estimates some fact about the population. You
can also think of an estimator as the rule that creates an estimate. For
example, the sample mean(x̄) is an estimator for the population mean, μ.
The quantity that is being estimated (i.e. the one you want to know) is called
the estimand. For example, let’s say you wanted to know the average height of
children in a certain school with a population of 1000 students. You take a
sample of 30 children, measure them and find that the mean height is 56
31
inches. This is your sample mean, the estimator. You use the sample mean to
estimate that the population mean (your estimand) is about 56 inches.
Point vs. Interval
Estimators can be a range of values (like a confidence interval) or a single value
(like the standard deviation). When an estimator is a range of values, it’s called
an interval estimate. For the height example above, you might add on a
confidence interval of a couple of inches either way, say 54 to 58 inches. When
it is a single value — like 56 inches — it’s called a point estimate.
Types
Estimators can be described in several ways (click on the bold word for the
main article on that term):
Biased: a statistic that is either an overestimate or an underestimate.
Efficient: a statistic with small variances (the one with the smallest possible
variance is also called the “best”). Inefficient estimators can give you good
results as well, but they usually requires much larger samples.
Invariant: statistics that are not easily changed by transformations, like simple
data shifts.
Shrinkage: a raw estimate that’s improved by combining it with other
information. See also: The James-Stein estimator.
Sufficient: a statistic that estimates the population parameter as well as if you
knew all of the data in all possible samples.
Unbiased: an accurate statistic that neither underestimates nor overestimates.
What is a Point Estimate?
In simple terms, any statistic can be a point estimate. A statistic is
an estimator of some parameter in a population. For example:
The sample standard deviation (s) is a point estimate of the
population standard deviation (σ).
The sample mean (̄x) is a point estimate of the population mean, μ.
The sample variance (s2) is a point estimate of the population
variance (σ2).
32
The following are the main uses of statistics in various business activities:
With the help of statistical methods, quantitative information about
production, sale, purchase, finance, etc. can be obtained. This type of
information helps businessmen in formulating suitable policies.
The arithmetic mean (or simply "mean") of a sample is the sum of the sampled
values divided by the number of items in the sample.
Median:
The median is that value of the series which divides the group into two equal
parts, one part comprising all values greater than the median value and the
other part comprising all the values smaller than the median value.
Merits of median
(1) Simplicity:- It is very simple measure of the central tendency of the series. I
the case of simple statistical series, just a glance at the data is enough to locate
the median value.
(2) Free from the effect of extreme values: - Unlike arithmetic mean, median
value is not destroyed by the extreme values of the series.
(3) Certainty: - Certainty is another merits is the median. Median values are
always a certain specific value in the series.
(4) Real value: - Median value is real value and is a better representative value
of the series compared to arithmetic mean average, the value of which may
not exist in the series at all.
(5) Graphic presentation: - Besides algebraic approach, the median value can
be estimated also through the graphic presentation of data.
(6) Possible even when data is incomplete: - Median can be estimated even in
the case of certain incomplete series. It is enough if one knows the number of
items and the middle item of the series.
Demerits of median:
measure in case of such series the different values of which are wide apart
from each other. Also, median is of limited representative character as it is not
based on all the items in the series.
(2) Unrealistic:- When the median is located somewhere between the two
middle values, it remains only an approximate measure, not a precise value.
Mode:
Merits of mode:
(1) Simple and popular: - Mode is very simple measure of central tendency.
Sometimes, just at the series is enough to locate the model value. Because of
its simplicity, it s a very popular measure of the central tendency.
(2) Less effect of marginal values: - Compared top mean, mode is less affected
by marginal values in the series. Mode is determined only by the value with
highest frequencies.
(3) Graphic presentation:- Mode can be located graphically, with the help of
histogram.
(4) Best representative: - Mode is that value which occurs most frequently in
37
the series. Accordingly, mode is the best representative value of the series.
(5) No need of knowing all the items or frequencies: - The calculation of mode
does not require knowledge of all the items and frequencies of a distribution.
In simple series, it is enough if one knows the items with highest frequencies in
the distribution.
Demerits of mode:
(1) Uncertain and vague: - Mode is an uncertain and vague measure of the
central tendency.
(2) Not capable of algebraic treatment: - Unlike mean, mode is not capable of
further algebraic treatment.
(3) Difficult: - With frequencies of all items are identical, it is difficult to identify
the modal value.
Two distributions that are similar in statistics are the Binomial distribution and
the Poisson distribution.
This tutorial provides a brief explanation of each distribution along with the
similarities and differences between the two.
The Binomial Distribution
The Binomial distribution describes the probability of obtaining k successes
in n binomial experiments.
If a random variable X follows a binomial distribution, then the probability
that X = k successes can be found by the following formula:
P(X=k) = nCk * pk * (1-p)n-k
where:
n: number of trials
k: number of successes
p: probability of success on a given trial
C : the number of ways to obtain k successes in n trials
n k
For example, suppose we flip a coin 3 times. We can use the formula above to
determine the probability of obtaining 0 heads during these 3 flips:
P(X=0) = 3C0 * .50 * (1-.5)3-0 = 1 * 1 * (.5)3 = 0.125
The Poisson Distribution
The Poisson distribution describes the probability of experiencing k events
during a fixed time interval.
If a random variable X follows a Poisson distribution, then the probability
that X = k events can be found by the following formula:
P(X=k) = λk * e– λ / k!
where:
λ: mean number of successes that occur during a specific interval
k: number of successes
42
KEY TAKEAWAYS
Subjective probability is a type of probability derived from an individual's
personal judgment or own experience about whether a specific outcome is
likely to occur.
It contains no formal calculations and only reflects the subject's opinions and
past experience rather than on data or computation.
43
Subjective probabilities differ from person to person and contain a high degree
of personal bias.
Example of Subjective Probability
An example of subjective probability is asking New York Yankees fans, before
the baseball season starts, about the chances of New York winning the World
Series. While there is no absolute mathematical proof behind the answer to
the example, fans might still reply in actual percentage terms, such as the
Yankees having a 25% chance of winning the World Series.
What Is Conditional Probability?
Conditional probability is defined as the likelihood of an event or outcome
occurring, based on the occurrence of a previous event or outcome.
Conditional probability is calculated by multiplying the probability of the
preceding event by the updated probability of the succeeding, or conditional,
event.
Conditional probability can be contrasted with unconditional probability.
Unconditional probability refers to the likelihood that an event will take place
irrespective of whether any other events have taken place or any other
conditions are present.
KEY TAKEAWAYS
Conditional probability refers to the chances that some outcome occurs
given that another event has also occurred.
It is often stated as the probability of B given A and is written as P(B|A),
where the probability of B depends on that of A happening.
Conditional probability can be contrasted with unconditional probability.
Probabilities are classified as either conditional, marginal, or joint.
Bayes' theorem is a mathematical formula used in calculating conditional
probability.
power of test also increases, that results in the reduction in risk of making type
II error.
E.g. Suppose on the basis of sample results, the research team of an
organisation claims that less than 50% of the total customers like the new
service started by the company, which is, in fact, greater than 50%.
Key Differences Between Type I and Type II Error
The points given below are substantial so far as the differences between type I
and type II error is concerned:
1. Type I error is an error that takes place when the outcome is a rejection
of null hypothesis which is, in fact, true. Type II error occurs when the
sample results in the acceptance of null hypothesis, which is actually
false.
2. Type I error or otherwise known as false positives, in essence, the
positive result is equivalent to the refusal of the null hypothesis. In
contrast, Type II error is also known as false negatives, i.e. negative
result, leads to the acceptance of the null hypothesis.
3. When the null hypothesis is true but mistakenly rejected, it is type I
error. As against this, when the null hypothesis is false but erroneously
accepted, it is type II error.
4. Type I error tends to assert something that is not really present, i.e. it is
a false hit. On the contrary, type II error fails in identifying something,
that is present, i.e. it is a miss.
5. The probability of committing type I error is the sample as the level of
significance. Conversely, the likelihood of committing type II error is
same as the power of the test.
6. Greek letter ‘α’ indicates type I error. Unlike, type II error which is
denoted by Greek letter ‘β’.
BASIS FOR
TYPE I ERROR TYPE II ERROR
COMPARISON
BASIS FOR
TYPE I ERROR TYPE II ERROR
COMPARISON
Types of Probability
There are three major types of probabilities:
Theoretical Probability
Experimental Probability
Axiomatic Probability
Theoretical Probability
It is based on the possible chances of something to happen. The theoretical
probability is mainly based on the reasoning behind probability. For example, if
a coin is tossed, the theoretical probability of getting a head will be ½.
49
Experimental Probability
It is based on the basis of the observations of an experiment. The experimental
probability can be calculated based on the number of possible outcomes by
the total number of trials. For example, if a coin is tossed 10 times and head is
recorded 6 times then, the experimental probability for heads is 6/10 or, 3/5.
Axiomatic Probability
In axiomatic probability, a set of rules or axioms are set which applies to all
types. These axioms are set by Kolmogorov and are known as Kolmogorov’s
three axioms. With the axiomatic approach to probability, the chances of
occurrence or non-occurrence of the events can be quantified. The axiomatic
probability lesson covers this concept in detail with Kolmogorov’s three rules
(axioms) along with various examples.
Conditional Probability is the likelihood of an event or outcome occurring
based on the occurrence of a previous event or outcome.
Probability of an Event
Assume an event E can occur in r ways out of a sum of n probable or
possible equally likely ways. Then the probability of happening of the event or
its success is expressed as;
P(E) = r/n
The probability that the event will not occur or known as its failure is
expressed as:
P(E’) = (n-r)/n = 1-(r/n)
E’ represents that the event will not occur.
Therefore, now we can say;
P(E) + P(E’) = 1
This means that the total of all the probabilities in any random test or
experiment is equal to 1.
What Is a Confidence Interval?
A confidence interval, in statistics, refers to the probability that
a population parameter will fall between a set of values for a certain
proportion of times. Analysts often use confidence intervals than contain
50
Sa
mpling error is one which occurs due to unrepresentativeness of the sample
selected for observation. Conversely, non-sampling error is an error arise from
human error, such as error in problem identification, method or procedure
used, etc.
An ideal research design seeks to control various types of error, but there are
some potential sources which may affect it. In sampling theory, total error can
be defined as the variation between the mean value of population parameter
and the observed mean value obtained in the research. The total error can be
classified into two categories, i.e. sampling error and non-sampling error.
In this article excerpt, you can find the important differences between
sampling and non-sampling error in detail.
Content: Sampling Error Vs Non-Sampling Error
1. Comparison Chart
2. Definition
3. Key Differences
4. Conclusion
Comparison Chart
53
BASIS FOR
SAMPLING ERROR NON-SAMPLING ERROR
COMPARISON
4. Cluster sampling
Cluster sampling also involves dividing the population into subgroups, but each
subgroup should have similar characteristics to the whole sample. Instead of
sampling individuals from each subgroup, you randomly select entire
subgroups.
If it is practically possible, you might include every individual from each
sampled cluster. If the clusters themselves are large, you can also sample
individuals from within each cluster using one of the techniques above. This is
called multistage sampling.
This method is good for dealing with large and dispersed populations, but
there is more risk of error in the sample, as there could be substantial
differences between clusters. It’s difficult to guarantee that the sampled
clusters are really representative of the whole population.
Example: Cluster sampling
The company has offices in 10 cities across the country (all with roughly the
same number of employees in similar roles). You don’t have the capacity to
travel to every office to collect your data, so you use random sampling to
select 3 offices – these are your clusters.
Non-probability sampling methods
In a non-probability sample, individuals are selected based on non-random
criteria, and not every individual has a chance of being included.
This type of sample is easier and cheaper to access, but it has a higher risk
of sampling bias. That means the inferences you can make about the
population are weaker than with probability samples, and your conclusions
may be more limited. If you use a non-probability sample, you should still aim
to make it as representative of the population as possible.
Non-probability sampling techniques are often used
in exploratory and qualitative research. In these types of research, the aim is
not to test a hypothesis about a broad population, but to develop an initial
understanding of a small or under-researched population.
1. Convenience sampling
A convenience sample simply includes the individuals who happen to be most
accessible to the researcher.
58
This is an easy and inexpensive way to gather initial data, but there is no way
to tell if the sample is representative of the population, so it can’t
produce generalizable results. Convenience samples are at risk for
both sampling bias and selection bias.
Example: Convenience sampling
You are researching opinions about student support services in your university,
so after each of your classes, you ask your fellow students to complete
a survey on the topic. This is a convenient way to gather data, but as you only
surveyed students taking the same classes as you at the same level, the sample
is not representative of all the students at your university.
2. Voluntary response sampling
Similar to a convenience sample, a voluntary response sample is mainly based
on ease of access. Instead of the researcher choosing participants and directly
contacting them, people volunteer themselves (e.g. by responding to a public
online survey).
Voluntary response samples are always at least somewhat biased, as some
people will inherently be more likely to volunteer than others, leading to self-
selection bias.
Example: Voluntary response sampling
You send out the survey to all students at your university and a lot of students
decide to complete it. This can certainly give you some insight into the topic,
but the people who responded are more likely to be those who have strong
opinions about the student support services, so you can’t be sure that their
opinions are representative of all students.
3. Purposive sampling
This type of sampling, also known as judgement sampling, involves the
researcher using their expertise to select a sample that is most useful to the
purposes of the research.
It is often used in qualitative research, where the researcher wants to gain
detailed knowledge about a specific phenomenon rather than make statistical
inferences, or where the population is very small and specific. An effective
purposive sample must have clear criteria and rationale for inclusion. Always
59
make sure to describe your inclusion and exclusion criteria and beware
of observer bias affecting your arguments.
Some research will do both of these things, but usually the research problem
focuses on one or the other. The type of research problem you choose
depends on your broad topic of interest and the type of research you think will
fit best.
This article helps you identify and refine a research problem. When writing
your research proposal or introduction, formulate it as a problem statement
and/or research questions.
Why is the research problem important?
Having an interesting topic isn’t a strong enough basis for academic research.
Without a well-defined research problem, you are likely to end up with an
unfocused and unmanageable project.
You might end up repeating what other people have already said, trying to say
too much, or doing research without a clear purpose and justification. You
need a clear problem in order to do research that contributes new and
relevant insights.
Whether you’re planning your thesis, starting a research paper, or writing a
research proposal, the research problem is the first step towards knowing
exactly what you’ll do and why.
Step 1: Identify a broad problem area
As you read about your topic, look for under-explored aspects or areas of
concern, conflict, or controversy. Your goal is to find a gap that your research
project can fill.
Practical research problems
If you are doing practical research, you can identify a problem by reading
reports, following up on previous research, or talking to people who work in
the relevant field or organization. You might look for:
Issues with performance or efficiency
Processes that could be improved
Areas of concern among practitioners
Difficulties faced by specific groups of people
Examples of practical research problems
61
Voter turnout in New England has been decreasing, in contrast to the rest of
the country.
The HR department of a local chain of restaurants has a high staff turnover
rate.
A non-profit organization faces a funding gap that means some of its programs
will have to be cut.
Theoretical research problems
If you are doing theoretical research, you can identify a research problem by
reading existing research, theory, and debates on your topic to find a gap in
what is currently known about it. You might look for:
A phenomenon or context that has not been closely studied
A contradiction between two or more perspectives
A situation or relationship that is not well understood
A troubling question that has yet to be resolved
Examples of theoretical research problems
The effects of long-term Vitamin D deficiency on cardiovascular health are not
well understood.
The relationship between gender, race, and income inequality has yet to be
closely studied in the context of the millennial gig economy.
Historians of Scottish nationalism disagree about the role of the British Empire
in the development of Scotland’s national identity.
Step 2: Learn more about the problem
Next, you have to find out what is already known about the problem, and
pinpoint the exact aspect that your research will address.
Context and background
Who does the problem affect?
Is it a newly-discovered problem, or a well-established one?
What research has already been done?
What, if any, solutions have been proposed
62
63
Achievabilit Confirm that your project is feasible within the timeline of your program or
y funding deadline.
In/0=Gn / G0,
66
In/0=Gn / G0 *100
Properties of Simple Index Numbers
Identity: If two compared situations (or two periods) are identical, the
value of the index number is...
food, clothing, fuel, and lighting, house rent, etc., govern the market for such
goods and services.
4. In Measuring Changes in Industrial Production:
ADVERTISEMENTS:
Index numbers of industrial production measure increase or decrease in
industrial production in a given year as compared to the base year. We can
know from such as index number the actual condition of different industries,
whether production is increasing or decreasing in them, for an industrial index
number measures changes in the quantity of production.
5. In Internal Trade:
The study of indices of the wholesale prices of consumer and industrial goods
and of industrial production helps commerce and industry in expanding or
decreasing internal trade.
6. In External Trade:
The foreign trade position of a country can be accessed on the basis of its
export and import indices. These indices reveal whether the external trade of
the country is increasing or decreasing.
7. In Economic Policies:
Index numbers are helpful to the state in formulating and adopting
appropriate economic policies. Index numbers measure changes in such
magnitudes as prices, incomes, wages, production, employment, products,
exports, imports, etc. By comparing the index numbers of these magnitudes
for different periods, the government can know the present trend of economic
activity and accordingly adopt price policy, foreign trade policy and general
economic policies.
8. In Determining the Foreign Exchange Rate:
Index numbers of wholesale price of two countries are used to determine their
rate of foreign exchange. They are the basis of the purchasing power parity
theory which determines the exchange rate between two countries on
inconvertible paper standard.
68
= Median.
Where,
l = lower class boundary of the class containing the , i.e. the class
corresponding to the cumulative frequency in which n/4 or 3n/4 lies
h = class interval size of the class containing .
f = frequency of the class containing .
n = number of values, or the total frequency.
C.F = cumulative frequency of the class preceding the class containing .
Deciles:
The values which divide an array into ten equal parts are called deciles. The
first, second,…… ninth deciles by respectively. The fifth decile (
corresponds to median. The second, fourth, sixth and eighth deciles which
collectively divide the data into five equal parts are called quintiles.
Deciles for Ungrouped Data:
Deciles for ungrouped data will be calculated from the following formulae;
Where,
l = lower class boundary of the class containing the , i.e. the class
corresponding to the cumulative frequency in which 2n/10 or 9n/10 lies
h = class interval size of the class containing .
f = frequency of the class containing .
n = number of values, or the total frequency.
C.F = cumulative frequency of the class preceding the class containing .
Percentiles:
The values which divide an array into one hundred equal parts are called
percentiles. The first, second,……. Ninety-ninth percentile are denoted
by The 50th percentile ( ) corresponds to the median. The
25th percentile corresponds to the first quartile and the
th
75 percentile corresponds to the third quartile.
Percentiles for Ungrouped Data:
Percentile from ungrouped data could be calculated from the following
formulae;
71
Where,
l = lower class boundary of the class containing the , i.e. the class
corresponding to the cumulative frequency in which 35n/100 or 99n/100 lies
h = class interval size of the class containing. .
f = frequency of the class containing .
n = number of values, or the total frequency.
C.F = cumulative frequency of the class preceding the class containing .
garden and measure 25 petals of each species. You can test the difference
between these two groups using a t test and null and alterative hypotheses.
The null hypothesis (H0) is that the true difference between these group
means is zero.
The alternate hypothesis (Ha) is that the true difference is different from
zero.
What type of t test should I use?
When choosing a t test, you will need to consider two things: whether the
groups being compared come from a single population or two different
populations, and whether you want to test the difference in a specific
direction.
One-sample, two-sample, or paired t test?
If the groups come from a single population (e.g., measuring before and
after an experimental treatment), perform a paired t test. This is
a within-subjects design.
If the groups come from two different populations (e.g., two different
species, or people from two separate cities), perform a two-
sample t test (a.k.a. independent t test). This is a between-subjects
design.
If there is one group being compared against a standard value (e.g.,
comparing the acidity of a liquid to a neutral pH of 7), perform a one-
sample t test.
One-tailed or two-tailed t test?
If you only care whether the two populations are different from one
another, perform a two-tailed t test.
If you want to know whether one population mean is greater than or
less than the other, perform a one-tailed t test.
t test example
In your test of whether petal length differs by species:
Your observations come from two separate populations (separate
species), so you perform a two-sample t test.
73
You don’t care about the direction of the difference, only whether there
is a difference, so you choose to use a two-tailed t test.
T test formula
The formula for the two-sample t test (a.k.a. the Student’s t-test) is shown below.
In this formula, t is the t value, x1 and x2 are the means of the two groups being
compared, s2 is the pooled standard error of the two groups, and n1 and n2 are
the number of observations in each of the groups.
A larger t value shows that the difference between group means is greater
than the pooled standard error, indicating a more significant difference
between the groups.
You can compare your calculated t value against the values in a critical value
chart (e.g., Student’s t table) to determine whether your t value is greater than
what would be expected by chance. If so, you can reject the null hypothesis
and conclude that the two groups are in fact different.
What Is Analysis of Variance (ANOVA)?
Analysis of variance (ANOVA) is an analysis tool used in statistics that splits an
observed aggregate variability found inside a data set into two parts:
systematic factors and random factors. The systematic factors have a statistical
influence on the given data set, while the random factors do not. Analysts use
the ANOVA test to determine the influence that independent variables have on
the dependent variable in a regression study.
The t- and z-test methods developed in the 20th century were used for
statistical analysis until 1918, when Ronald Fisher created the analysis of
variance method.12 ANOVA is also called the Fisher analysis of variance, and it
is the extension of the t- and z-tests. The term became well-known in 1925,
after appearing in Fisher's book, "Statistical Methods for Research
Workers."3 It was employed in experimental psychology and later expanded to
subjects that were more complex.
KEY TAKEAWAYS
74
Interpolation Formula
The unknown value on the data points can be found using the linear
interpolation and Lagrange’s interpolation formula.
Interpolation Methods
There are different types of interpolation methods. They are:
Linear Interpolation Method – This method applies a distinct linear polynomial
between each pair of data points for curves, or within the sets of three points
for surfaces.
Nearest Neighbour Method – This method inserts the value of an interpolated
point to the value of the most adjacent data point. Therefore, this method
does not produce any new data points.
Cubic Spline Interpolation Method – This method fits a different cubic
polynomial between each pair of data points for curves, or between sets of
three points for surfaces.
Shape-Preservation Method – This method is also known as Piecewise Cubic
Hermite Interpolation (PCHIP). It preserves the monotonicity and the shape of
the data. It is for curves only.
Thin-plate Spline Method – This method consists of smooth surfaces that also
extrapolate well. It is only for surfaces only
Biharmonic Interpolation Method – This method is applied to the surfaces
only.
2. The rise and fall in the values should be uniform. For example, if we are
given data regarding rainfall in various years and some of the observations are
for the years in which El-Nino occurred, then interpolation methods are not
applicable.
3. When we apply calculus of finite differences, we assume that the given set
of observations is capable of being expressed in a polynomial form.
---------------------------------------------------------------------------------------
What is Research?
Research is the careful consideration of study regarding a particular concern or
problem using scientific methods. According to the American sociologist Earl
Robert Babbie, “research is a systematic inquiry to describe, explain, predict,
and control the observed phenomenon. It involves inductive and deductive
methods.”
Inductive methods analyze an observed event, while deductive methods verify
the observed event. Inductive approaches are associated with qualitative
research, and deductive methods are more commonly associated with
quantitative analysis.
Research is conducted with a purpose to:
Identify potential and new customers
Understand existing customers
Set pragmatic goals
Develop productive market strategies
Address business challenges
Put together a business expansion plan
Identify new business opportunities
What are the characteristics of research?
1. Good research follows a systematic approach to capture accurate data.
Researchers need to practice ethics and a code of conduct while making
observations or drawing conclusions.
78
Approach
Unstructured Structured Highly structured
used
Conducted By using
Asking questions Asking questions
through hypotheses.
Types of Research
1. Descriptive Research
From conducting meta analysis, literary research or scientific trials and learning
public opinion, there are many methods through which this research is done.
3. Applied Research
When a business or say, the society is faced with an issue that needs an
immediate solution or resolution, Applied Research is the research type that
comes to the rescue.
The crux of Applied Research is to figure out the solution to a certain growing
practical issue.
4. Fundamental Research
5. Quantitative Research
Quantitative Research, as the name suggests, is based on the measurement of
a particular amount or quantity of a particular phenomenon. It focuses on
gathering and interpreting numerical data and can be adopted for discovering
any averages or patterns or for making predictions.
This form of Research is number based and it lies under the two main Research
Types. It makes use of tables, data and graphs to reach a conclusion. The
outcomes generated from this research are measurable and can be repeated
unlike the outcomes of qualitative research. This research type is mainly
adopted for scientific and field based research.
Descriptive research - The study variables are analyzed and a summary of the
same is seeked.
6. Qualitative Research
As the name suggests, this form of Research is more considered with the
quality of a certain phenomenon, it dives into the “why” alongside the “what”.
For instance, let’s consider a gender neutral clothing store which has more
women visiting it than men.
Qualitative research would be determining why men are not visiting the store
by carrying out an in-depth interview of some potential customers in this
category.
This form of research is interested in getting to the bottom of the reasons for
human behaviour, i.e understanding why certain actions are taken by people
or why they think certain thoughts.
84
Through this research the factors influencing people into behaving in a certain
way or which control their preferences towards a certain thing can be
interpreted.
7. Conceptual Research
8. Empirical Research
86
This is a research method that focuses solely on aspects like observation and
experience, without focusing on the theory or system. It is based on data and it
can churn conclusions that can be confirmed or verified through observation
and experiment. Empirical Research is mainly undertaken to determine proof
that certain variables are affecting the others in a particular way.
The method for making a frequency table differs between the four types of
frequency distributions. You can follow the guides below or use software such
as Excel, SPSS, or R to make a frequency table.