Fhca Notes
Fhca Notes
Fhca Notes
UNIT I
Introduction
Healthcare analytics involves the application of data analysis and insights in the
healthcare industry to improve patient outcomes, operational efficiency, and decision-
making processes.
Fundamental aspects:
Data
The raw material of statistics is data. For our purposes we may define data as
numbers.The two kinds of numbers that we use in statistics are numbers that result
from the taking—in the usual sense of the term—of a measurement, and those that
result from the process of counting. For example, when a nurse weighs a patient or
takes a patient’s temperature, a measurement
Study Design: Biostatisticians play a crucial role in designing experiments and studies.
They determine the sample size, randomization methods, and data collection
techniques to ensure that results are reliable and meaningful.
Data Collection: They collect data through various methods, such as surveys, clinical
trials, observations, or experiments. This data may include information on diseases,
treatments, genetics, environmental factors, and more.
Data Analysis: Once data is collected, biostatisticians use statistical methods to analyze
it. They employ techniques like hypothesis testing, regression analysis, survival
analysis, and more to draw conclusions and make inferences from the data.
Variable: variables include diastolic blood pressure, heart rate, the heights of adult
males, the weights of preschool children, and the ages of patients seen in a dental clinic.
Quantitative Variables: A quantitative variable is one that can be measured in the
usual sense
Qualitative Variables: Measurements made on qualitative variables convey
information regarding Attribute.
Random Variable: the values obtained arise as a result of chance factors, so that they
cannot be exactly predicted in advance, the variable is called a random variable. An
example of a random variable is adult height.
Discrete Random Variable: Variables may be characterized further as to whether they
are discrete or continuous. A discrete variable is characterized by gaps or interruptions
in the values that it can assume.
Continuous Random Variable : A continuous random variable does not possess the
gaps or interruptions characteristic of a discrete random variable
Population : A population or collection of entities may, however, consist of animals,
machines, places, or cells. a population of entities as the largest collection of entities for
which we have an interest at a particular time
Sample: A sample may be defined simply as a part of a population.
Introduction to biostatistics
Biostatistics is the branch of statistics that deals with data related to living organisms,
health, and biology. It involves the application of statistical methods to design
experiments, collect, analyze, and interpret data in fields such as medicine, public
health, genetics, ecology, and more.
Here are some key aspects:
1. Study Design: Biostatisticians play a crucial role in designing experiments and
studies. They determine the sample size, randomization methods, and data
collection techniques to ensure that results are reliable and meaningful.
2. Data Collection: They collect data through various methods, such as surveys,
clinical trials, observations, or experiments. This data may include information
on diseases, treatments, genetics, environmental factors, and more.
3. Data Analysis: Once data is collected, biostatisticians use statistical methods to
analyze it. They employ techniques like hypothesis testing, regression analysis,
survival analysis, and more to draw conclusions and make inferences from the
data.
4. Interpretation: Biostatisticians interpret the results of their analyses, often
collaborating with researchers, doctors, or policymakers to understand the
implications of their findings. This interpretation guides decision-making in
healthcare, policy formulation, and scientific research.
5. Application in Public Health: Biostatistics plays a vital role in public health by
analyzing patterns of diseases, evaluating the effectiveness of interventions, and
predicting health outcomes in populations.
The sum of the probabilities of the mutually exclusive outcomes is equal to 1. This is the
property of exhaustiveness and refers to the fact that the observer of a probabilistic
process must allow for all possible events, and when all are taken together, their total
probability is 1.
3. Consider any two mutually exclusive events, Ei and Ej. The probability of the
occurrence of either Ei or Ej is equal to the sum of their individual probabilities.
Solution:
For purposes of illustrating the calculation of probabilities we consider this group of
318 subjects to be the largest group for which we have an interest. In other words, for
this example, we consider the 318 subjects as a population. We assume that Early and
Later are mutually exclusive categories and that the likelihood of selecting any one
person is equal to the likelihood of selecting any other person. We define the desired
probability as the number of subjects with the characteristic of interest (Early) divided
by the total number of subjects. We may write the result in probability notation as
follows:
Problem:
We wish to compute the joint probability of Early age at onset (E) and a negative family
history of mood disorders (A) from a knowledge of an appropriate marginal
probability and an appropriate conditional probability.
Solution:
The probability we seek is P(E∩A).
P(E) =141/ 318 = 0.4434,
and a conditional probability = P(A │B) = 28/ 141 = 0.1986
P(E∩A)= P(E) . P (A│B) = (0.4434).( 0.1986) = 0.0881
The Addition Rule
Given two events A and B, the probability that event A, or event B, or both occur is
equal to the probability that event A occurs, plus the probability that event B occurs,
minus the probability that the events occur simultaneously. The addition rule may be
written
Independent Events
P(A│B) = P (A) . In such cases we say that A and B are independent events. The
multiplication rule for two independent events, then, may be written as
if two events are independent, the probability of their joint occurrence is equal to the
product of the probabilities of their individual occurrences. when two events with
nonzero probabilities are independent, each of the following statements is true
Marginal Probability
Given some variable that can be broken down into m categories designated by A1;A2; .
. . ;Ai; . . . ;Am and another jointly occurring variable that is broken down into n
categories designated by B1; B2; . . . ; Bj; . . . ; Bn, the marginal probability of Ai P(Ai) is
equal to the sum of the joint probabilities of Ai with all the categories of B. That is,
Bayes Theorem
Conditional Probability
The conditional probability of A given B is equal to the probability of A∩ B divided by the
probability of B, provided the probability of B is not zero.
Problem:
Suppose we pick a subject at random from the 318 subjects and find that he is 18 years
or younger (E). What is the probability that this subject will be one who has no family
history of mood disorders (A)?
Solution: The total number of subjects is no longer of interest, since, with the selection
of an Early subject, the Later subjects are eliminated. We may define the desired
probability, then, as follows: What is the probability that a subject has no family history
of mood disorders (A), given that the selected subject is Early (E)? This is a conditional
probability and is written as P (A│B) in which the vertical line is read ―given.‖ The 141
Early subjects become the denominator of this conditional probability, and 28, the
number of Early subjects with no family history of mood disorders, becomes the
numerator. Our desired probability, then, is
Problems
In an article appearing in the Journal of the American Dietetic Association, Holben et
al.(A-1) looked at food security status in families in the Appalachian region of southern
Ohio.The purpose of the study was to examine hunger rates of families with children in
a local Head Start program in Athens, Ohio. The survey instrument included the
18-question U.S.Household Food Security Survey Module for measuring hunger and
food security. In addition, participants were asked how many food assistance
programs they had used in the last 12 months. Table shows the number of food assistance
programs used by subjects in this sample. We wish to construct the probability distribution of
the discrete variable X, whereX =number of food assistance programs used by the study subjects.
Likelihood & odds
The likelihood function (likelihood) represents the probability of random variable
realizations conditional on particular values of the statistical parameters.
The likelihood is the chance, the possibility of doing or achieving something, and the
condition that can ensure success. The likelihood of a hypothesis (H) given some data
(D) is the probability of obtaining D given that H is true multiplied by an arbitrary
positive constant K:
L(H) = K × P(D|H)
In most cases, a hypothesis represents a value of a parameter in a statistical model,
such as the mean of a normal distribution. Because likelihood is not actually a
probability, it does not obey various rules of probability; for example, likelihoods need
not sum to 1. In the case of a conditional probability, P(D|H), the hypothesis is fixed
and the data are free to vary. Likelihood, however, is the opposite. The likelihood of a
hypothesis, L(H), is conditioned on the data, as if they are fixed while the hypothesis
can vary. Suppose a coin is flipped n times, and we observe x heads and n – x tails. The
probability of getting x heads in n flips is defined by the binomial distribution as
follows:
𝒏
𝑷(𝑿 = 𝒙|𝒑) = ( )𝒑𝒙 (𝟏 − 𝒑)𝒏−𝒙
𝒙
where p is the probability of heads and the binomial coefficient,
𝒏 𝒏!
( )= ,
𝒙 𝒙! (𝒏 − 𝒙)!
counts the number of ways to get x heads in n flips. For example, if x = 2 and n = 3, the
binomial coefficient is calculated as 3!/(2! × 1!), which is equal to 3; there are three
distinct ways to get two heads in three flips (i.e., head-head-tail, head-tail-head, tail-
head-head). Thus, the probability of getting two heads in three flips if p is .50 would be
.375 (3 × .502 × (1 – .50)1), or 3 out of 8.
If the coin is fair, so that p = .50, and we flip it 10 times, the probability of six heads and
four tails is
𝟏𝟎!
𝑷(𝑿 = 𝟔|𝒑 =. 𝟓𝟎) = (. 𝟓𝟎)𝟔 (𝟏−. 𝟓𝟎)𝟒 ≈. 𝟐𝟏
𝟔! × 𝟒!
If the coin is a trick coin, so that p = .75, the probability of six heads in 10 tosses is
𝟏𝟎!
𝑷(𝑿 = 𝟔|𝒑 =. 𝟕𝟓) = (. 𝟕𝟓)𝟔 (𝟏−. 𝟕𝟓)𝟒 ≈. 𝟏𝟓
𝟔! × 𝟒!
Likelihoods may seem overly restrictive because we have compared only two simple
statistical hypotheses in a single likelihood ratio. The likelihood ratio of any two
hypotheses is simply the ratio of their heights on this curve.
which suggests that we can interpret the information contained in the prior as adding a
certain amount of previous data (i.e., a – 1 past successes and b – 1 past failures) to the
data from our current experiment. Because we are multiplying together terms with the
same base, the exponents can be added together in a final simplification step:
Posterior ∝ 𝑝 𝑥+𝑎−1 (1 − 𝑝)𝑛−𝑥+𝑏−1
This final formula looks like our original beta distribution but with new shape
parameters equal to x + a and n – x + b. In other words, we started with the prior
distribution beta (a,b) and added the successes from the data, x, to a and the failures, n –
x, to b, and our posterior distribution is a beta(x + a,n – x + b) distribution.
consider the previous example of observing 60 heads in 100 flips of a coin. Imagine that
going into this experiment, we had some reason to believe the coin’s bias was within .20
of being fair in either direction; that is, we believed that p was likely within the range of
.30 to .70. We could choose to represent this information using the beta(25,25)
distribution shown as the dotted line. The likelihood function for the 60 flips is shown
as the dot-and-dashed line and is identical to that shown in the middle panel.
The statistical distribution using appropriate software tool –Python
The data is described in such a way that it can express some meaningful information
that can also be used to find some future trends. Describing and summarizing a
single variable is called univariate analysis. Describing a statistical relationship
between two variables is called bivariate analysis. Describing the statistical
relationship between multiple variables is called multivariate analysis.
There are two types of Descriptive Statistics:
The measure of central tendency
Measure of variability
Mean
Median
Median Low
Median High
Mode
Mean
It is the sum of observations divided by the total number of observations. It is also
defined as average which is the sum divided by count. The mean() function returns
the mean or average of the data passed in its arguments. If the passed argument is
empty, Statistics Error is raised.
Example: Python code to calculate mean
# Python code to demonstrate the working of
# mean()
# importing statistics to handle statistical
# operations
import statistics
# initializing list
li = [1, 2, 3, 3, 2, 2, 2, 1]
# using mean() to calculate average of list
# elements
print ("The average of list values is : ",end="")
print (statistics.mean(li))
Output
The average of list values is : 2
The median_low() function returns the median of data in case of odd number of
elements, but in case of even number of elements, returns the lower of two middle
elements. If the passed argument is empty, StatisticsError is raised
# Python code to demonstrate the
# working of median_low()
# importing the statistics module
import statistics
# simple list of a set of integers
set1 = [1, 3, 3, 4, 5, 7]
# Print median of the data-set
# Median value may or may not
# lie within the data-set
print("Median of the set is % s" % (statistics.median(set1)))
# Print low median of the data-set
print("Low Median of the set is % s "
% (statistics.median_low(set1)))
Output:
Median of the set is 3.5
Low Median of the set is 3
In Python, you can use various libraries such as NumPy, SciPy, and Matplotlib to
analyze data and determine the statistical distribution. Here's an example of how you
might find the distribution of a dataset using these libraries:
Firstly, let's generate some sample data. For demonstration purposes, we'll create a
dataset following a normal distribution.
This code snippet demonstrates:
Generating a dataset of 1000 data points following a normal distribution. Plotting a
histogram to visualize the distribution of the generated data.
Fitting a normal distribution curve to the data and plotting it over the histogram.
The stats.norm.fit() function in this example fits a normal distribution to the data
using maximum likelihood estimation, estimating the mean and standard deviation of
the distribution. You can replace 'norm' with other distribution names like 'gamma',
'expon', etc., to fit different distributions to your data.
This is a basic example, and in practice, you might need to preprocess and analyze
your data differently based on its characteristics and the specific analysis you're
conducting. But this should give you a starting point for determining the statistical
distribution of your data using Python.
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# Generating a dataset with a normal distribution
np.random.seed(42) # Setting seed for reproducibility
data = np.random.normal(loc=0, scale=1, size=1000) # Mean=0, Standard
Deviation=1, 1000 data points
The P-value is known as the probability value. It is defined as the probability of getting
a result that is either the same or more extreme than the actual observations. The P-
value is known as the level of marginal significance within the hypothesis testing that
represents the probability of occurrence of the given event. The P-value is used as an
alternative to the rejection point to provide the least significance at which the null
hypothesis would be rejected. If the P-value is small, then there is stronger evidence in
favour of the alternative hypothesis. P-value Table
Definition :
A p value is the probability that the computed value of a test statistic is at least as
extreme as a specified value of the test statistic when the null hypothesis is true. Thus,
the p value is the smallest value of a for which we can reject a null hypothesis.
Generally, the level of statistical significance is often expressed in p-value and
the range between 0 and 1. The smaller the p-value, the stronger the evidence and
hence, the result should be statistically significant. Hence, the rejection of the null
hypothesis is highly possible, as the p-value becomes smaller. A statistician wants to
test the hypothesis H0: μ = 120 using the alternative hypothesis Hα: μ > 120 and
assuming that α = 0.05. For that, he took the sample values as n =40, σ = 32.17 and x̄ =
105.37. Determine the conclusion for this hypothesis?
Solution:
We know that,
One-sided p-value: You can use this method of testing if a large or unexpected
change in the data makes only a small or no difference to your data set.
Typically, this is unusual and you can use a two-sided p-value test instead.
Two-sided p-value: You can use this method of testing if a large change in the
data would affect the outcome of the research and if the alternative hypothesis is
fairly general instead of specific. Most professionals use this method to ensure
they account for large changes in data.
Chi-square Test
The chi-square distribution is the most frequently employed statistical technique for the
analysis of count or frequency data. A statistical test that is used to compare observed
and expected results. The Chi-square statistic compares the size of any discrepancies
between the expected results and actual results.
X2 is distributed approximately as x2 with k – r degrees of freedom.
Oi is the observed frequency for the ith category of the variable of interest,
and Ei is the expected frequency
and expected frequencies are close together and will be large if the differences are large.
The computed value of X2 is compared with the tabulated value of X2 with k – r
degrees of freedom. The decision rule, then, is: Reject H0 if X2 is greater than or equal to
the tabulated X2 for the chosen value of a.
Types of Chi-square
Tests of goodness-of-fit
Test of independence
Test of Homogeneity
Tests of goodness-of-fit
• The chi-square test for goodness-of-fit uses frequency data from a sample to test
hypotheses about the shape or proportions of a population.
• The data, called observed frequencies, simply count how many individuals from
the sample are in each category.
Problem:
Consider from a group of persons , certain Eye colour persons are selected from
random Eye colour in a sample of 40 ,Blue 12,brown 21,green 3,others 4.Eye colour in
population -Brown 80%,Blue 10%,Green ,2%,Others 8%. Is there any difference between
proportion of sample to that of population .Use α= 0.05
Problem:
A total 1500 workers on 2 operators(A&B) Were classified as deaf & non-deaf according
to the following table.is there association between deafness & type of operator .let α
0.05
Calculate:
E= Tr x Tc / GT
Steps:
1.Data
Represent 1500 workers,1000 on operator A 100 of them were deaf while 500 on
operator B 60 of them were deaf
2. Assumption
• Sample is randomly selected from the population.
3. Hypothesis
• HO: there is no significant association between type of operator & deafness.
• HA: there is significant association between type of operator & deafness.
4. Level of significance; (α = 0.05);
• % Chance factor effect area
• 95% Influencing factor effect area
• d.f.(degree of freedom)=(r-1)(c-1) =(2-1)(2-1)=1
D.f. 1 for 0.05=3.841
5. Apply a proper test of significance
6. Statistical decision
Calculated chi< tabulated chi
P>0.5
7. Conclusion
We accept H0
HO may be true
There is no significant association between type of operator & deafness
When 2x2 chi-square test have a zero cell (one of the four cells is zero) we cannot apply
chi-square test because we have what is called a complete dependence criteria.
But for axb chi-square test and one of the cells is zero when cannot apply the test
unless we do proper categorization to get rid of the zero cell.
Hypothesis Testing
A hypothesis may be defined simply as a statement about one or more populations.
Statistical hypotheses are hypotheses that are stated in such a way that they may be
evaluated by appropriate statistical techniques.
Hypothesis Testing Steps
1. Data. The nature of the data that form the basis of the testing procedures must be
understood, since this determines the particular test to be employed
2. Assumptions : A general procedure is modified depending on the assumptions
3. Hypothesis : There are two statistical hypotheses involved in hypothesis testing, and
these should be stated explicitly. The null hypothesis is the hypothesis to be tested. It is
designated by the symbol H0. The alternative hypothesis is a statement of what we will
believe is true if our sample data cause us to reject the null hypothesis the alternative
hypothesis by the symbol HA a certain population mean is not 50?
The null hypothesis is
H0: =50 and the alternative is
HA: ≠50
Suppose we want to know if we can conclude that the population mean is
greater than
50. Our hypotheses are
H0: μ ≤ 50 HA: μ > 50
If we want to know if we can conclude that the population mean is less than 50, the
hypotheses are
H0: μ ≥50 HA: μ <50
4. Test statistic. The test statistic is some statistic that may be computed from the data
of the sample. As
we will see, the test statistic serves as a decision maker, since the decision to reject or
not to reject the null hypothesis depends on the magnitude of the test statistic.
An example of a test statistic is the quantity
6. Decision rule. The decision rule tells us to reject the null hypothesis if the value of the
test statistic that we compute from our sample is one of the values in the rejection
region and to not reject the null hypothesis if the computed value of the test statistic is
one of the values in the nonrejection region
7. Calculation of test statistic. From the data contained in the sample we compute a
value of the test statistic and compare it with the rejection and nonrejection regions
that have already been specified.
8. Statistical decision. The statistical decision consists of rejecting or of not rejecting
the null hypothesis
It is rejected if the computed value of the test statistic falls in the rejection region, and it
is not rejected if the computed value of the test statistic falls in
the nonrejection region.
9. Conclusion.
If H0 is rejected, we conclude that HA is true.
If H0 is not rejected, we conclude that H0 may be true. 10. p values.
The p value is a number that tells us how unusual our sample results are,
given that the null hypothesis is true.
A p value indicating that the sample results are not likely to have occurred, if
the null hypothesis is true, provides justification for doubting the truth of the
null hypothesis.
Purpose of Hypothesis Testing
The purpose of hypothesis testing is to assist administrators and clinicians in making
decisions. The administrative or clinical decision usually depends on the statistical
decision. If the null hypothesis is rejected, the administrative or clinical decision usually
reflects this, in that the decision is compatible with the alternative hypothesis. The
reverse is usually true if the null hypothesis is not rejected. The administrative or
clinical decision, however, may take other forms, such as a decision to gather more
data.
Hypothesis Testing:
A single population mean
The testing of a hypothesis about a population mean under three different conditions:
(1) when sampling is from a normally distributed population of values with known
variance; (2) when sampling is from a normally distributed
population with unknown variance, and (3) when sampling is from a population that is
not normally distributed. When sampling is from a normally distributed population
and the population variance is known, the test statistic
for testing H0: μ – μ0
Problems:
1. Does the evidence support the idea that the average lecture consists of 3000
words if a random sample of the lectures of 16 professors had a mean of 3472
words, given the population standard deviation is 500 words? Use α = 0.01.
Assume that lecture lengths are approximately normally distributed. Show all
steps.
μ = 3000
σ = 500
𝐱̅ = 3472
n = 16
α = 0.0
1) Ho: μ = 3000
2) Ha : μ ≠ 3000
3) α = 0.01
4) Reject Ho if z < −2.576 or z > 2.576
5) 𝐳 = 𝟑𝟒𝟕𝟐−𝟑𝟎𝟎𝟎 (𝟓𝟎𝟎 √𝟏𝟔) = 𝟑. 𝟕𝟖
6) Reject Ho, because 3.78 > 2.576
7) At α = 0.01, the population mean is not equal to 3000 words.
2. Suppose that scores on the Scholastic Aptitude Test form a normal distribution with
μ = 500 and α = 100. A high school counselor has developed a special course designed
to boost SAT scores. A random sample of 16 students is selected to take the course and
then the SAT. The sample had an average score of 𝑋= 544. Does the course boost SAT
scores? Test at α = 0.01. Show all steps.
μ = 500
σ = 100
𝐱̅ = 544
n = 16
α = 0.01
1) Ho: μ = 500
2) Ha : μ > 500
3) α = 0.01
4) Reject Ho if z > 2.326
5) 𝐳 = 𝟓𝟒𝟒−𝟓𝟎𝟎 (𝟏𝟎𝟎 √𝟏𝟔) = 𝟏. 𝟕𝟔
6) Accept Ho, because 1.76 < 2.326
7) At α = 0.01, the population mean is equal to 500.
One-Sided Hypothesis Tests
Hypothesis test may be one-sided, in which case all the rejection region is in one or the
other tail of the distribution. Whether a one-sided or a two-sided test is used depends
on the nature of the question being asked by the researcher.
Problem
Researchers are interested in the mean age of a certain population. Let us say that they
are asking the following question: Can we conclude that the mean age of this
population is different from 30 years? Suppose, instead of asking if they could conclude
that μ ≠ 30, the researchers had asked: Can we conclude that μ < 30? To this question
we would reply that they can so conclude if they can reject the null hypothesis that μ ≥
30.
1. Data. See the previous example.
2. Assumptions. See the previous example.
3. Hypotheses.
H0: μ >30
HA: μ < 30
The inequality in the null hypothesis implies that the null hypothesis
consists of an infinite number of hypotheses.
5.Test statistic.
= -2:12
8.Statistical decision. We are able to reject the null hypothesis since
-2:12 < -1:645
9. Conclusion.
Problem:
1. Researchers wish to know if the data they have collected provide sufficient evidence
to indicate a difference in mean serum uric acid levels between normal individuals and
individuals with Down’s syndrome. The data consist of serum uric acid readings on 12
individuals with Down’s syndrome and 15 normal individuals. The means are 𝑋1 =
4:5 mg/100 ml 𝑋2 = 3:4 mg/100 ml.
We will say that the sample data do provide evidence that the population means are
not equal if we can reject the null hypothesis that the population means are equal. Let
us reach a conclusion by means of the ten-step hypothesis testing procedure.
1. Data. See problem statement.
2. Assumptions. The data constitute two independent simple random samples each
drawn from a normally distributed population with a variance equal to 1 for the
Down’s syndrome population and 1.5 for the normal population.
3. Hypotheses:
5. Distribution of test statistic. When the null hypothesis is true, the test statistic follows
the standard normal distribution.
6. Decision rule. Let α = 0.05. The critical values of z are ±1:96.
Reject H0 unless -1:96 < zcomputed < 1:96.
7. Statistical decision. Reject H0, since 2:57 > 1.96.
8. Conclusion. Conclude that, on the basis of these data,
9. There is an indication that the two population means are not equal.
10. p value. For this test, p= 0.0102.
Problem:
The purpose of a study by Wilkins et al. (A-28) was to measure the effectiveness of
recombinant human growth hormone (rhGH) on children with total body surface area
burns > 40 percent. In this study, 16 subjects received daily injections at home of rhGH.
At baseline, the researchers wanted to know the current levels of insulin-like growth
factor (IGF-I) prior to administration of rhGH. The sample variance of IGF-I levels (in
ng/ml) was 670.81. We wish to know if we may conclude from these data that the
population variance is not 600.
1. Data. See statement in the example.
2.Assumptions. The study sample constitutes a simple random sample from a
population of similar children. The IGF-I levels are normally distributed.
3. Hypothesis
4. Test statistic. The test statistic is given by Equation
5. Distribution of test statistic. When the null hypothesis is true, the test statistic is
distributed as x2 with n - 1 degrees of freedom.
6. Decision rule. Let α=0.05. Critical values of x2 are 6.262 and 27.488.Reject H0 unless
the computed value of the test statistic is between 6.262 and 27.488. The rejection and
non rejection regions are shown in fig
7. Calculation of test statistic.
8. Statistical decision. Do not reject H0 since 6:262 < 16:77 < 27:488.
9. Conclusion. Based on these data we are unable to conclude that the population
variance is not 600.
10. p value. The determination of the p value for this test is complicated by the fact that
we have a two-sided test and an asymmetric sampling distribution. When we have a
two-sided test and a symmetric sampling distribution such as the standard normal or t,
we may, as we have seen, double the one-sided p value. Problems arise when we
attempt to do this with an asymmetric sampling distribution such as the chi-square
distribution
Hypothesis Testing Python
Imagine a woman in her seventies who has a noticeable tummy bump. Medical
professionals could presume the bulge is a fibroid. In this instance, our first finding (or
the null hypothesis) is that this woman has a fibroid, and our alternative finding is that
she does. We shall use terms null hypothesis (beginning assumption) and alternate
hypothesis (countering assumption) to conduct hypothesis testing. The next step is
gathering the data samples we can use to validate the null hypothesis.The following
options are the remaining ones:
T- Test: When comparing the mean values of two samples that specific characteristics
may connect, a t-test is performed to see if there exists a substantial difference. It is
typically employed when data sets, such as those obtained from tossing a coin 100
times and stored as results, would exhibit a normal distribution. It could have
unknown variances. The t-test is a method for evaluating hypotheses that allows you to
assess a population-applicable assumption.
Assumptions
o Each sample's data is randomly and uniformly distributed (iid).
o Each sample's data have a normal distribution.
o Every sample's data share the same variance.
T-tests are of two types: 1. one-sampled t-test and 2. two-sampled t-test.
One sample t-test: The One Sample t-test ascertains if the sample average differs
statistically from an actual or apposed population mean. A parametric testing technique
is the One Sample t-test.
Example: You are determining if the average age of 10 people is 30 or otherwise. Check
the Python script below for the implementation.
Code
# Python program to implement T-Test on a sample of ages
# Importing the required libraries
from scipy.stats import ttest_1samp
import numpy as np
# Creating a sample of ages
ages = [45, 89, 23, 46, 12, 69, 45, 24, 34, 67]
print(ages)
# Calculating the mean of the sample
mean = np.mean(ages)
print(mean)
# Performing the T-Test
t_test, p_val = ttest_1samp(ages, 30)
print("P-value is: ", p_val)
# taking the threshold value as 0.05 or 5%
if p_val < 0.05:
print(" We can reject the null hypothesis")
else:
print("We can accept the null hypothesis")
Output
[45, 89, 23, 46, 12, 69, 45, 24, 34, 67]
45.4
P-value is: 0.07179988272763554
We can accept the null hypothesis
Chi-Square test
Is a statistical method to determine if two categorical variables have a significant
correlation between them. Both those variables should be from same population and
they should be categorical like − Yes/No, Male/Female, Red/Green etc. For example,
we can build a data set with observations on people's ice-cream buying pattern and try
to correlate the gender of a person with the flavour of the ice-cream they prefer. If a
correlation is found we can plan for appropriate stock of flavours by knowing the
number of gender of people visiting.We use various functions in numpy library to carry
out the chi-square test.
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0, 10, 100)
fig,ax = plt.subplots(1,1)
linestyles = [':', '--', '-.', '-']
deg_of_freedom = [1, 4, 7, 6]
for df, ls in zip(deg_of_freedom, linestyles):
ax.plot(x, stats.chi2.pdf(x, df), linestyle=ls)
plt.xlim(0, 10)
plt.ylim(0, 0.4)
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Chi-Square Distribution')
plt.legend()
plt.show()
Calculating one-proportional Z-test using formula
z=(P-Po)/sqrt(Po(1-Po)/n
Where:
P: Observed sample proportion
Po: Hypothesized Population Proportion
n: Sample size
In this example, we are using the P-value to 0.86, Po to 0.80, and n to 100, and by
using this we will be calculating the z-test one proportional in the python
programming language.
Code:
import math
P = 0.86
Po = 0.80
n = 100
a = (P-Po)
b = Po*(1-Po)/n
z = a/math.sqrt(b)
print(z)
Output:
1.4999999999999984