Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Fhca Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 45

MAHENDRA COLLEGE OF ENGINEERING

Salem-campus, Attur main road, Minnampalli, Salem-636 106

DEPARTMENT OF BIOMEDICAL ENGINNEERING

BM3651- FUNDAMENTALS OF HEALTHCARE ANALYTICS

UNIT I

Introduction

Fundamentals of Healthcare Analytics- Overview

Healthcare analytics involves the application of data analysis and insights in the
healthcare industry to improve patient outcomes, operational efficiency, and decision-
making processes.

Fundamental aspects:

1. Data Collection: Gathering and managing various types of healthcare data,


including patient records, clinical data, administrative information, and financial
data.

2. Data Processing and Integration: Cleaning, organizing, and integrating data


from disparate sources to create a unified dataset. This often involves using tools
and techniques to handle structured and unstructured data.

3. Descriptive Analytics: Utilizing historical data to understand past trends and


patterns in healthcare, such as patient demographics, disease prevalence, and
resource utilization.

4. Predictive Analytics: Forecasting future outcomes or trends based on historical


data. This involves using statistical models and machine learning algorithms to
predict events like disease outbreaks, patient readmissions, or resource needs.

5. Prescriptive Analytics: Recommending actions or interventions based on


predictive analytics to optimize decision-making. For instance, suggesting the
most effective treatment plans or resource allocation strategies.

6. Performance Measurement: Evaluating and monitoring the effectiveness of


healthcare initiatives, interventions, and programs through key performance
indicators (KPIs) and metrics.
7. Privacy and Security: Ensuring compliance with regulations like HIPAA (Health
Insurance Portability and Accountability Act) to safeguard patient data and
maintain confidentiality.

8. Technological Tools: Using advanced technologies like artificial intelligence,


machine learning, big data analytics, and data visualization tools to derive
meaningful insights from healthcare data.

9. Clinical Decision Support: Providing clinicians with data-driven insights at the


point of care to aid in diagnosis, treatment planning, and personalized medicine.

10. Healthcare Economics: Analyzing the financial aspects of healthcare, including


cost analysis, revenue cycle management, and reimbursement models.

Data

The raw material of statistics is data. For our purposes we may define data as
numbers.The two kinds of numbers that we use in statistics are numbers that result
from the taking—in the usual sense of the term—of a measurement, and those that
result from the process of counting. For example, when a nurse weighs a patient or
takes a patient’s temperature, a measurement

Statistics: The meaning of statistics is implicit in the previous section. More


concretely, however, we may say that statistics is a field of study concerned with (1)
the collection, organization, summarization, and analysis of data; and (2) the drawing
of inferences about a body of data when only a part of the data is observed.
Sources of Data data are usually available from one or more of the following sources:
1. Routinely kept records
2. Surveys
3. Experiments
4. External sources
Biostatistics is the branch of statistics that deals with data related to living organisms,
health, and biology. It involves the application of statistical methods to design
experiments, collect, analyze, and interpret data in fields such as medicine, public
health, genetics, ecology, and more. Here are some key aspects:

Study Design: Biostatisticians play a crucial role in designing experiments and studies.
They determine the sample size, randomization methods, and data collection
techniques to ensure that results are reliable and meaningful.

Data Collection: They collect data through various methods, such as surveys, clinical
trials, observations, or experiments. This data may include information on diseases,
treatments, genetics, environmental factors, and more.
Data Analysis: Once data is collected, biostatisticians use statistical methods to analyze
it. They employ techniques like hypothesis testing, regression analysis, survival
analysis, and more to draw conclusions and make inferences from the data.

Interpretation: Biostatisticians interpret the results of their analyses, often collaborating


with researchers, doctors, or policymakers to understand the implications of their
findings. This interpretation guides decision-making in healthcare, policy formulation,
and scientific research.

Application in Public Health: Biostatistics plays a vital role in public health by


analyzing patterns of diseases, evaluating the effectiveness of interventions, and
predicting health outcomes in populations.

Variable: variables include diastolic blood pressure, heart rate, the heights of adult
males, the weights of preschool children, and the ages of patients seen in a dental clinic.
Quantitative Variables: A quantitative variable is one that can be measured in the
usual sense
Qualitative Variables: Measurements made on qualitative variables convey
information regarding Attribute.
Random Variable: the values obtained arise as a result of chance factors, so that they
cannot be exactly predicted in advance, the variable is called a random variable. An
example of a random variable is adult height.
Discrete Random Variable: Variables may be characterized further as to whether they
are discrete or continuous. A discrete variable is characterized by gaps or interruptions
in the values that it can assume.
Continuous Random Variable : A continuous random variable does not possess the
gaps or interruptions characteristic of a discrete random variable
Population : A population or collection of entities may, however, consist of animals,
machines, places, or cells. a population of entities as the largest collection of entities for
which we have an interest at a particular time
Sample: A sample may be defined simply as a part of a population.
Introduction to biostatistics
Biostatistics is the branch of statistics that deals with data related to living organisms,
health, and biology. It involves the application of statistical methods to design
experiments, collect, analyze, and interpret data in fields such as medicine, public
health, genetics, ecology, and more.
Here are some key aspects:
1. Study Design: Biostatisticians play a crucial role in designing experiments and
studies. They determine the sample size, randomization methods, and data
collection techniques to ensure that results are reliable and meaningful.
2. Data Collection: They collect data through various methods, such as surveys,
clinical trials, observations, or experiments. This data may include information
on diseases, treatments, genetics, environmental factors, and more.
3. Data Analysis: Once data is collected, biostatisticians use statistical methods to
analyze it. They employ techniques like hypothesis testing, regression analysis,
survival analysis, and more to draw conclusions and make inferences from the
data.
4. Interpretation: Biostatisticians interpret the results of their analyses, often
collaborating with researchers, doctors, or policymakers to understand the
implications of their findings. This interpretation guides decision-making in
healthcare, policy formulation, and scientific research.
5. Application in Public Health: Biostatistics plays a vital role in public health by
analyzing patterns of diseases, evaluating the effectiveness of interventions, and
predicting health outcomes in populations.

Usage of biostatics in health care


• Documentation of medical history of diseases.
• Planning and conduct of clinical studies.
• Evaluating the merits of different procedures.
• In providing methods for definition of ―normal‖ and ―abnormal‖.

Role of Biostatistics in patient care


• In increasing awareness regarding diagnostic, therapeutic and prognostic
uncertainties and providing rules of probability to delineate those uncertainties
• In providing methods to integrate chances with value judgments that could be
most beneficial to patient
• In providing methods such as sensitivity-specificity and predictivities that help
choose valid tests for patient assessment
• In providing tools such as scoring system and expert system that can help reduce
epistemic uncertainties
• In carrying out a valid and reliable health situation analysis, including in proper
summarization and interpretation of data.

COMPUTERS AND BIOSTATISTICAL ANALYSIS


The widespread use of computers has had a tremendous impact on health
sciences research in general and bio statistical analysis in particular. The necessity to
perform long and tedious arithmetic computations as part of the statistical analysis of
data lives only in the memory of those researchers and practitioners whose careers
antedate the so-called computer revolution.
The use of computers makes it possible for investigators to devote more time to
the improvement of the quality of raw data and the interpretation of the results. The
current prevalence of microcomputers and the abundance of available statistical
software programs have further revolutionized statistical computing. Computers
currently on the market are equipped with random number generating capabilities. As
an alternative to using printed tables of random numbers, investigators may use
computers to generate the random numbers they need .
Actually, the ―random‖ numbers generated by most computers are in reality
pseudorandom numbers because they are the result of a deterministic formula. serve
satisfactorily for many practical purposes.
The usefulness of the computer in the health sciences is not limited to statistical
analysis. Computers play a pivotal role in bio-statistical analysis, revolutionizing the
way researchers process,
analyze, and interpret biological data. Here's how:
1. Data Processing: Computers efficiently handle large volumes of biological data,
such as DNA sequences, gene expressions, protein structures, and patient
records. They organize and preprocess this data for analysis.
2. Statistical Analysis: Software and algorithms perform complex statistical
analyses on biological data. This includes hypothesis testing, regression analysis,
survival analysis, and more. These analyses help researchers identify patterns,
correlations, and associations within biological datasets.
3. Machine Learning and AI: Computers utilize machine learning and artificial
intelligence techniques to identify subtle patterns within biological data that
might be challenging for humans to detect. These methods contribute to
predictive modeling, classification of diseases, drug discovery, and personalized
medicine.
4. Visualization: Computers generate visual representations, such as graphs,
charts, and 3D models, to help researchers interpret and communicate their
findings effectively.
5. Database Management: Databases store vast amounts of biological data,
and computers efficiently manage, update, and retrieve this information for
researchers, facilitating cross-study comparisons and meta-analyses.
6. High-Performance Computing: Complex computational tasks, like
molecular modeling, simulation of biological systems, or analyzing large-scale
genomic data, require high-performance computing. Supercomputers and
clusters of powerful machines enable these calculations within reasonable
time frames.
7. Reproducibility and Collaboration: Computers facilitate reproducibility in
research by allowing scientists to share code, algorithms, and methodologies.
Collaboration across geographical boundaries becomes easier through shared
platforms and cloud-based tools.
Introduction to probability
• A probability provides a quantitative description of the chances or likelihoods
associated with various outcomes
• It provides a bridge between descriptive and inferential statistics
The concept of objective probability may be categorized further under the headings
• of (1) classical, or a priori, probability, and (2) the relative frequency, or a
posteriori, concept of probability.
Classical Probability:
If an event can occur in N mutually exclusive and equally likely ways,and if m of these
possess a trait E, the probability of the occurrence of E is equal to m=N.
If we read P(E) as ―the probability of E,‖ we may express this definition as

Relative Frequency Probability The relative frequency approach to probability depends


on the repeatability of some process and the ability to count the number of repetitions,
as well as the number of times that some event of interest occurs If some process is
repeated a large number of times, n, and if some
Definition:
Resulting event with the characteristic E occurs m times, the relative frequency of
occurrence of E, m=n, will be approximately equal to the probability of E.
P(E) = m / n
Subjective Probability
This concept of probability, one may evaluate the probability of an event that can only
happen once, for example, the probability that a cure for cancer will be discovered
within the next 10 years.
Bayesian Methods
Probabilities based on classical or relative frequency concepts are designed to allow for
decisions to be made solely on the basis of collected data, Bayesian methods make use
of what are known as prior probabilities and posterior probabilities.
Definition :
The prior probability of an event is a probability based on prior knowledge, prior
experience, or results derived from prior data collection activity. The posterior
probability of an event is a probability obtained by using new information to update or
revise a prior probability.
ELEMENTARY PROPERTIES OF PROBABILITY
The three properties are as follows.
1. Given some process (or experiment) with n mutually exclusive outcomes (called
events), E1; E2; . . . ; En, the probability of any event Ei is assigned a nonnegative
number. That is,
2. A key concept in the statement of this property is the concept of mutually
exclusive outcomes. Two events are said to be mutually exclusive if they cannot occur
simultaneously

The sum of the probabilities of the mutually exclusive outcomes is equal to 1. This is the
property of exhaustiveness and refers to the fact that the observer of a probabilistic
process must allow for all possible events, and when all are taken together, their total
probability is 1.

3. Consider any two mutually exclusive events, Ei and Ej. The probability of the
occurrence of either Ei or Ej is equal to the sum of their individual probabilities.

CALCULATING THE PROBABILITY OF AN EVENT


When probabilities are calculated with a subset of the total group as the denominator,
the result is a conditional probability
Joint Probability
Sometimes we want to find the probability that a subject picked at random from a
group of subjects possesses two characteristics at the same time. Such a probability is
referred to as a joint probability.
Problem:
The primary aim of a study by Carter et al. (A-1) was to investigate the effect of the age
at onset of bipolar disorder on the course of the illness. One of the variables
investigated was family history of mood disorders. Table shows the frequency of a
family history of mood disorders in the two groups of interest (Early age at onset
defined to be 18 years or younger and Later age at onset defined to be later than 18
years). Suppose we pick a person at random from this sample. What is the probability
that this person will be 18 years old or younger?

Solution:
For purposes of illustrating the calculation of probabilities we consider this group of
318 subjects to be the largest group for which we have an interest. In other words, for
this example, we consider the 318 subjects as a population. We assume that Early and
Later are mutually exclusive categories and that the likelihood of selecting any one
person is equal to the likelihood of selecting any other person. We define the desired
probability as the number of subjects with the characteristic of interest (Early) divided
by the total number of subjects. We may write the result in probability notation as
follows:

P(E) = number of Early subjects/total number of subjects


= 141/ 318= 0. 4434
Problem:
What is the probability that a person picked at random from the 318 subjects will be
Early (E) and will be a person who has no family history of mood disorders (A)?
The probability we are seeking may be written in symbolic notation as P(E ∩ A ) in
which the symbol ∩ is read either as ―intersection‖ or ―and.‖ The statement E ∩ A
indicates the joint occurrence of conditions E and A. The number of subjects satisfying
both of the desired conditions is found in Table 3.4.1 at the intersection of the column
labeled E and the row labeled A and is seen to be 28. Since the selection will be made
from the total set of subjects, the denominator is 318. Thus, we may write the joint
probability as
P ( E ∩ A) =28/ 318=0. 0881
The Multiplication Rule
A probability may be computed from other probabilities. For example, a joint
probability may be computed as the product of an appropriate marginal probability
and an appropriate conditional probability. This relationship is known as the
multiplication rule of probability.
The conditional probability of A given B is equal to the probability of A ∩ B divided by
the probability of B, provided the probability of B is not zero.

Problem:
We wish to compute the joint probability of Early age at onset (E) and a negative family
history of mood disorders (A) from a knowledge of an appropriate marginal
probability and an appropriate conditional probability.
Solution:
The probability we seek is P(E∩A).
P(E) =141/ 318 = 0.4434,
and a conditional probability = P(A │B) = 28/ 141 = 0.1986
P(E∩A)= P(E) . P (A│B) = (0.4434).( 0.1986) = 0.0881
The Addition Rule
Given two events A and B, the probability that event A, or event B, or both occur is
equal to the probability that event A occurs, plus the probability that event B occurs,
minus the probability that the events occur simultaneously. The addition rule may be
written

When events A and B cannot occur simultaneously, P(A ∩ B) is sometimes called


―exclusive or,‖ and P( A U B) = 0. When events A and B can occur simultaneously, P( A
U B) is sometimes called ―inclusive or,‖ and we use the addition rule to calculate P( A
U B).

Independent Events
P(A│B) = P (A) . In such cases we say that A and B are independent events. The
multiplication rule for two independent events, then, may be written as

if two events are independent, the probability of their joint occurrence is equal to the
product of the probabilities of their individual occurrences. when two events with
nonzero probabilities are independent, each of the following statements is true

Marginal Probability
Given some variable that can be broken down into m categories designated by A1;A2; .
. . ;Ai; . . . ;Am and another jointly occurring variable that is broken down into n
categories designated by B1; B2; . . . ; Bj; . . . ; Bn, the marginal probability of Ai P(Ai) is
equal to the sum of the joint probabilities of Ai with all the categories of B. That is,

Bayes Theorem
Conditional Probability
The conditional probability of A given B is equal to the probability of A∩ B divided by the
probability of B, provided the probability of B is not zero.

Problem:
Suppose we pick a subject at random from the 318 subjects and find that he is 18 years
or younger (E). What is the probability that this subject will be one who has no family
history of mood disorders (A)?
Solution: The total number of subjects is no longer of interest, since, with the selection
of an Early subject, the Later subjects are eliminated. We may define the desired
probability, then, as follows: What is the probability that a subject has no family history
of mood disorders (A), given that the selected subject is Early (E)? This is a conditional
probability and is written as P (A│B) in which the vertical line is read ―given.‖ The 141
Early subjects become the denominator of this conditional probability, and 28, the
number of Early subjects with no family history of mood disorders, becomes the
numerator. Our desired probability, then, is

P (A│B) = 28/141=0 .1986

Problems
In an article appearing in the Journal of the American Dietetic Association, Holben et
al.(A-1) looked at food security status in families in the Appalachian region of southern
Ohio.The purpose of the study was to examine hunger rates of families with children in
a local Head Start program in Athens, Ohio. The survey instrument included the
18-question U.S.Household Food Security Survey Module for measuring hunger and
food security. In addition, participants were asked how many food assistance
programs they had used in the last 12 months. Table shows the number of food assistance
programs used by subjects in this sample. We wish to construct the probability distribution of
the discrete variable X, whereX =number of food assistance programs used by the study subjects.
Likelihood & odds
The likelihood function (likelihood) represents the probability of random variable
realizations conditional on particular values of the statistical parameters.
The likelihood is the chance, the possibility of doing or achieving something, and the
condition that can ensure success. The likelihood of a hypothesis (H) given some data
(D) is the probability of obtaining D given that H is true multiplied by an arbitrary
positive constant K:
L(H) = K × P(D|H)
In most cases, a hypothesis represents a value of a parameter in a statistical model,
such as the mean of a normal distribution. Because likelihood is not actually a
probability, it does not obey various rules of probability; for example, likelihoods need
not sum to 1. In the case of a conditional probability, P(D|H), the hypothesis is fixed
and the data are free to vary. Likelihood, however, is the opposite. The likelihood of a
hypothesis, L(H), is conditioned on the data, as if they are fixed while the hypothesis
can vary. Suppose a coin is flipped n times, and we observe x heads and n – x tails. The
probability of getting x heads in n flips is defined by the binomial distribution as
follows:
𝒏
𝑷(𝑿 = 𝒙|𝒑) = ( )𝒑𝒙 (𝟏 − 𝒑)𝒏−𝒙
𝒙
where p is the probability of heads and the binomial coefficient,
𝒏 𝒏!
( )= ,
𝒙 𝒙! (𝒏 − 𝒙)!

counts the number of ways to get x heads in n flips. For example, if x = 2 and n = 3, the
binomial coefficient is calculated as 3!/(2! × 1!), which is equal to 3; there are three
distinct ways to get two heads in three flips (i.e., head-head-tail, head-tail-head, tail-
head-head). Thus, the probability of getting two heads in three flips if p is .50 would be
.375 (3 × .502 × (1 – .50)1), or 3 out of 8.
If the coin is fair, so that p = .50, and we flip it 10 times, the probability of six heads and
four tails is
𝟏𝟎!
𝑷(𝑿 = 𝟔|𝒑 =. 𝟓𝟎) = (. 𝟓𝟎)𝟔 (𝟏−. 𝟓𝟎)𝟒 ≈. 𝟐𝟏
𝟔! × 𝟒!
If the coin is a trick coin, so that p = .75, the probability of six heads in 10 tosses is
𝟏𝟎!
𝑷(𝑿 = 𝟔|𝒑 =. 𝟕𝟓) = (. 𝟕𝟓)𝟔 (𝟏−. 𝟕𝟓)𝟒 ≈. 𝟏𝟓
𝟔! × 𝟒!
Likelihoods may seem overly restrictive because we have compared only two simple
statistical hypotheses in a single likelihood ratio. The likelihood ratio of any two
hypotheses is simply the ratio of their heights on this curve.

The Normal distribution, chi-square distribution, binomial distribution, Poisson


distribution, and uniform distribution. Likelihoods are also a key component of
Bayesian inference. The Bayesian approach to statistics is fundamentally about making
use of all available information when drawing inferences in the face of
uncertainty. Previous information is quantified using what is known as a prior
distribution. Mathematically, a well-known conditional-probability theorem states that
the procedure for obtaining the posterior distribution of θ is as follows:
𝑷(𝜽|𝑫) = 𝑲 × 𝑷(𝜽) × 𝑷(𝑫|𝜽)
In this context, K is merely a rescaling constant and is equal to 1/P(D). We often write
this theorem more simply as
𝑷(𝜽|𝑫) ∝ 𝑷(𝜽) × 𝑷(𝑫|𝜽)
where ∝ means ―is proportional to
Conjugate distributions are convenient in that they reduce Bayesian updating to some
simple algebra. We begin with the formula for the binomial likelihood function
𝐋𝐢𝐤𝐞𝐥𝐢𝐡𝐨𝐨𝐝 ∝ 𝒑𝒙 (𝟏 − 𝒑)𝒏−𝒙
and then multiply it by the formula for the beta prior with a and b shape parameters,
𝐏𝐫𝐢𝐨𝐫 ∝ 𝒑𝒂−𝟏 𝟏 − 𝒑 𝒃−𝟏
o obtain the following formula for the posterior distribution:
𝐏𝐨𝐬𝐭𝐞𝐫𝐢𝐨𝐫 ∝ 𝒑𝒂−𝟏 (𝟏 − 𝒑)𝒃−𝟏 × 𝒑𝒙 (𝟏 − 𝒑)𝒏−𝒙
︸ ︸
𝐏𝐫𝐢𝐨𝐫 𝐋𝐢𝐤𝐞𝐥𝐢𝐡𝐨𝐨𝐝

which suggests that we can interpret the information contained in the prior as adding a
certain amount of previous data (i.e., a – 1 past successes and b – 1 past failures) to the
data from our current experiment. Because we are multiplying together terms with the
same base, the exponents can be added together in a final simplification step:
Posterior ∝ 𝑝 𝑥+𝑎−1 (1 − 𝑝)𝑛−𝑥+𝑏−1
This final formula looks like our original beta distribution but with new shape
parameters equal to x + a and n – x + b. In other words, we started with the prior
distribution beta (a,b) and added the successes from the data, x, to a and the failures, n –
x, to b, and our posterior distribution is a beta(x + a,n – x + b) distribution.
consider the previous example of observing 60 heads in 100 flips of a coin. Imagine that
going into this experiment, we had some reason to believe the coin’s bias was within .20
of being fair in either direction; that is, we believed that p was likely within the range of
.30 to .70. We could choose to represent this information using the beta(25,25)
distribution shown as the dotted line. The likelihood function for the 60 flips is shown
as the dot-and-dashed line and is identical to that shown in the middle panel.
The statistical distribution using appropriate software tool –Python
The data is described in such a way that it can express some meaningful information
that can also be used to find some future trends. Describing and summarizing a
single variable is called univariate analysis. Describing a statistical relationship
between two variables is called bivariate analysis. Describing the statistical
relationship between multiple variables is called multivariate analysis.
There are two types of Descriptive Statistics:
 The measure of central tendency
 Measure of variability

Measure of Central Tendency


The measure of central tendency is a single value that attempts to describe the whole
set of data. There are three main features of central tendency:

 Mean
 Median
 Median Low
 Median High
 Mode

Mean
It is the sum of observations divided by the total number of observations. It is also
defined as average which is the sum divided by count. The mean() function returns
the mean or average of the data passed in its arguments. If the passed argument is
empty, Statistics Error is raised.
Example: Python code to calculate mean
# Python code to demonstrate the working of
# mean()
# importing statistics to handle statistical
# operations
import statistics
# initializing list
li = [1, 2, 3, 3, 2, 2, 2, 1]
# using mean() to calculate average of list
# elements
print ("The average of list values is : ",end="")
print (statistics.mean(li))

Output
The average of list values is : 2

The median_low() function returns the median of data in case of odd number of
elements, but in case of even number of elements, returns the lower of two middle
elements. If the passed argument is empty, StatisticsError is raised
# Python code to demonstrate the
# working of median_low()
# importing the statistics module
import statistics
# simple list of a set of integers
set1 = [1, 3, 3, 4, 5, 7]
# Print median of the data-set
# Median value may or may not
# lie within the data-set
print("Median of the set is % s" % (statistics.median(set1)))
# Print low median of the data-set
print("Low Median of the set is % s "
% (statistics.median_low(set1)))
Output:
Median of the set is 3.5
Low Median of the set is 3

In Python, you can use various libraries such as NumPy, SciPy, and Matplotlib to
analyze data and determine the statistical distribution. Here's an example of how you
might find the distribution of a dataset using these libraries:
Firstly, let's generate some sample data. For demonstration purposes, we'll create a
dataset following a normal distribution.
This code snippet demonstrates:
Generating a dataset of 1000 data points following a normal distribution. Plotting a
histogram to visualize the distribution of the generated data.
Fitting a normal distribution curve to the data and plotting it over the histogram.
The stats.norm.fit() function in this example fits a normal distribution to the data
using maximum likelihood estimation, estimating the mean and standard deviation of
the distribution. You can replace 'norm' with other distribution names like 'gamma',
'expon', etc., to fit different distributions to your data.
This is a basic example, and in practice, you might need to preprocess and analyze
your data differently based on its characteristics and the specific analysis you're
conducting. But this should give you a starting point for determining the statistical
distribution of your data using Python.
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# Generating a dataset with a normal distribution
np.random.seed(42) # Setting seed for reproducibility
data = np.random.normal(loc=0, scale=1, size=1000) # Mean=0, Standard
Deviation=1, 1000 data points

# Plotting a histogram to visualize the distribution


plt.hist(data, bins=30, density=True, alpha=0.5, color='blue')
plt.title('Histogram of Sample Data')
plt.xlabel('Values')
plt.ylabel('Frequency')

# Fitting a distribution to the data


# You can try different distributions like 'norm' for normal distribution, 'gamma',
'expon', etc.
param = stats.norm.fit(data) # Fitting a normal distribution to the data
x = np.linspace(min(data), max(data), 100)
pdf_fitted = stats.norm.pdf(x, *param)
plt.plot(x, pdf_fitted, 'r-', linewidth=2)
plt.show()
UNIT II STATISTICAL PARAMETERS

Statistical parameters p-values

The P-value is known as the probability value. It is defined as the probability of getting
a result that is either the same or more extreme than the actual observations. The P-
value is known as the level of marginal significance within the hypothesis testing that
represents the probability of occurrence of the given event. The P-value is used as an
alternative to the rejection point to provide the least significance at which the null
hypothesis would be rejected. If the P-value is small, then there is stronger evidence in
favour of the alternative hypothesis. P-value Table

The P-value table shows the hypothesis interpretations:

Definition :
A p value is the probability that the computed value of a test statistic is at least as
extreme as a specified value of the test statistic when the null hypothesis is true. Thus,
the p value is the smallest value of a for which we can reject a null hypothesis.
Generally, the level of statistical significance is often expressed in p-value and
the range between 0 and 1. The smaller the p-value, the stronger the evidence and
hence, the result should be statistically significant. Hence, the rejection of the null
hypothesis is highly possible, as the p-value becomes smaller. A statistician wants to
test the hypothesis H0: μ = 120 using the alternative hypothesis Hα: μ > 120 and
assuming that α = 0.05. For that, he took the sample values as n =40, σ = 32.17 and x̄ =
105.37. Determine the conclusion for this hypothesis?
Solution:
We know that,

Now substitute the given values


32.17
𝜎¯ = = 5.0865
𝑥 40
Now, using the test static formula, we get
t = (105.37 – 120) / 5.0865
Therefore, t = -2.8762
From the Z-Score table, we can find the value of P(t>-2.8762)
From the table, we get
P (t<-2.8762) = P(t>2.8762) = 0.003
Therefore,
If P(t>-2.8762) =1- 0.003 =0.997
P- value =0.997 > 0.05
Therefore, from the conclusion, if p>0.05, the null hypothesis is accepted or fails to
reject. Hence, the conclusion is ―fails to reject H0.‖

There are two types of p-value you can use:

 One-sided p-value: You can use this method of testing if a large or unexpected
change in the data makes only a small or no difference to your data set.
Typically, this is unusual and you can use a two-sided p-value test instead.
 Two-sided p-value: You can use this method of testing if a large change in the
data would affect the outcome of the research and if the alternative hypothesis is
fairly general instead of specific. Most professionals use this method to ensure
they account for large changes in data.

Chi-square Test
The chi-square distribution is the most frequently employed statistical technique for the
analysis of count or frequency data. A statistical test that is used to compare observed
and expected results. The Chi-square statistic compares the size of any discrepancies
between the expected results and actual results.
X2 is distributed approximately as x2 with k – r degrees of freedom.
Oi is the observed frequency for the ith category of the variable of interest,
and Ei is the expected frequency

Applications of Chi-square test:


1. Goodness-of-fit
2. The 2 x 2 chi-square test (contingency table, four fold table)
3. The a x b chi-square test (r x c chi-square test)
Steps of Chi hypothesis testing
1. Data: Counts or proportion.
2. Assumption: random sample selected from a population.
3. HO : no sign. Difference in proportion, no significant association.
HA: sign. Difference in proportion, significant association.
4. Level of sign.
• df 1st application=k-1(k is no. of groups)
• df 2nd &3rd application=(column-1)(row-1)
• IN 2nd application(contingency table)
• Df=1, tab. Chi= 3.841 always
• Graph is one side (only +ve)
4. Apply appropriate test of significance
6. Statistical decision
7. Conclusion
 Calculated chi <tabulated chi, P>0.05,Accept HO,(may be true)
 If calculated chi> tabulated chi ,P<0.05,Reject HO& accept HA.

The Decision Rule


The quantity

and expected frequencies are close together and will be large if the differences are large.
The computed value of X2 is compared with the tabulated value of X2 with k – r
degrees of freedom. The decision rule, then, is: Reject H0 if X2 is greater than or equal to
the tabulated X2 for the chosen value of a.
Types of Chi-square
 Tests of goodness-of-fit
 Test of independence
 Test of Homogeneity
Tests of goodness-of-fit
• The chi-square test for goodness-of-fit uses frequency data from a sample to test
hypotheses about the shape or proportions of a population.
• The data, called observed frequencies, simply count how many individuals from
the sample are in each category.

Problem:
Consider from a group of persons , certain Eye colour persons are selected from
random Eye colour in a sample of 40 ,Blue 12,brown 21,green 3,others 4.Eye colour in
population -Brown 80%,Blue 10%,Green ,2%,Others 8%. Is there any difference between
proportion of sample to that of population .Use α= 0.05

Expected blue =10/100*40=4


Expected brown=80/100*40=32
Expected green=2/100*40=0.8
Expected others=8/100*40=3
Steps:
1. Data
Represents the eye colour of 40 person in the following distribution
Brown=21 person,blue=12 person,green=3,others=4
2. Assumption
Sample is randomly selected from the population
3. Hypothesis
• Null hypothesis: there is no significant difference in proportion of eye colour of
sample to that of the population
• Alternative hypothesis: there is significant difference in proportion of eye colour
of sample to that of the population

4. Level of significance; (α =0.05)


• 5% Chance factor effect area ,95% Influencing factor effect area
• d.f.(degree of freedom)=K-1; (K=Number of subgroups) =4-1=3
• D.f. for 0.5=7.81

5. Apply a proper test of significance

=(12-4)² (21-32)² (3-0.8)² (4-3)²


------------ +---------- +----------- + --------
4 32 0.8 3
=(64/4) + (121/32)+(4.8/0.8)+(1/3)
=16+3.78+6+0.3=
Calculated chi =26.08
6. Statistical decision:
Calculated chi> tabulated chi, P<0.5
7. Conclusion
We reject H0 &accept HA: there is significant difference in proportion of eye
colour of sample to that of the population

Applications of Chi-square test


1. Goodness-of-fit
2. The 2 x 2 chi-square test (contingency table, four fold table)
3. The a x b chi-square test (r x c chi-square test)
The Chi-Square Test for Independence
The second chi-square test, the chi-square test for independence, can be used and
interpreted in two different ways:
• Testing hypotheses about the relationship between two variables in a
population, or (2×2)
• Testing hypotheses about differences between proportions for two or more
populations.(a×b)
• The data, called observed frequencies, simply show how many individuals from
the sample are in each cell of the matrix.
• The null hypothesis for this test states that there is no relationship between the
two variables; that is, the two variables are independent.

2X2 chi square (contingency table)


Expected value
E= Tr x Tc / GT
d.f = (r-1) . (c-1) = 1= 3.841

Problem:
A total 1500 workers on 2 operators(A&B) Were classified as deaf & non-deaf according
to the following table.is there association between deafness & type of operator .let α
0.05

Calculate:
E= Tr x Tc / GT
Steps:
1.Data
Represent 1500 workers,1000 on operator A 100 of them were deaf while 500 on
operator B 60 of them were deaf
2. Assumption
• Sample is randomly selected from the population.
3. Hypothesis
• HO: there is no significant association between type of operator & deafness.
• HA: there is significant association between type of operator & deafness.
4. Level of significance; (α = 0.05);
• % Chance factor effect area
• 95% Influencing factor effect area
• d.f.(degree of freedom)=(r-1)(c-1) =(2-1)(2-1)=1
D.f. 1 for 0.05=3.841
5. Apply a proper test of significance

=(100-106.7)² ( 900-893.3)² (60-53.3)²


--------------- + ---------------- + --------------
106.7 893.3 53.3
+(440-446.7)²
---------------
= 446.7
= 0.42+0.05+o.84+0.10
= 1.41

6. Statistical decision
Calculated chi< tabulated chi
P>0.5
7. Conclusion
We accept H0
HO may be true
There is no significant association between type of operator & deafness
When 2x2 chi-square test have a zero cell (one of the four cells is zero) we cannot apply
chi-square test because we have what is called a complete dependence criteria.
But for axb chi-square test and one of the cells is zero when cannot apply the test
unless we do proper categorization to get rid of the zero cell.

Properties of Chi-square test:


1. The mean of the X2 distribution is equal
to the number of degrees of freedom
2. The variance of X2 distribution is twice the degree of freedom
3. If X2 is a chi-square variate with γ degree of freedom then X2 / 2 is a gamma
variate
4. Standard X2 variate tends to standard normal variate as n to ∞
Applications:
1. To test the hypothetical value of the population
2. To test the goodness of fit
3. To test the independence of attributes
4. To test the homogeneity of independent estimates
5. To combine various probabilities to give a single set of significance

Hypothesis Testing
A hypothesis may be defined simply as a statement about one or more populations.
Statistical hypotheses are hypotheses that are stated in such a way that they may be
evaluated by appropriate statistical techniques.
Hypothesis Testing Steps
1. Data. The nature of the data that form the basis of the testing procedures must be
understood, since this determines the particular test to be employed
2. Assumptions : A general procedure is modified depending on the assumptions
3. Hypothesis : There are two statistical hypotheses involved in hypothesis testing, and
these should be stated explicitly. The null hypothesis is the hypothesis to be tested. It is
designated by the symbol H0. The alternative hypothesis is a statement of what we will
believe is true if our sample data cause us to reject the null hypothesis the alternative
hypothesis by the symbol HA a certain population mean is not 50?
The null hypothesis is
 H0: =50 and the alternative is
 HA: ≠50
 Suppose we want to know if we can conclude that the population mean is
greater than
 50. Our hypotheses are
 H0: μ ≤ 50 HA: μ > 50
If we want to know if we can conclude that the population mean is less than 50, the
hypotheses are
 H0: μ ≥50 HA: μ <50
4. Test statistic. The test statistic is some statistic that may be computed from the data
of the sample. As
we will see, the test statistic serves as a decision maker, since the decision to reject or
not to reject the null hypothesis depends on the magnitude of the test statistic.
An example of a test statistic is the quantity

where μ 0 is a hypothesized value of a population mean. This test statistic is related to


the statistic

5. Distribution of test statistic

Distribution of the test statistic

6. Decision rule. The decision rule tells us to reject the null hypothesis if the value of the
test statistic that we compute from our sample is one of the values in the rejection
region and to not reject the null hypothesis if the computed value of the test statistic is
one of the values in the nonrejection region
7. Calculation of test statistic. From the data contained in the sample we compute a
value of the test statistic and compare it with the rejection and nonrejection regions
that have already been specified.
8. Statistical decision. The statistical decision consists of rejecting or of not rejecting
the null hypothesis
It is rejected if the computed value of the test statistic falls in the rejection region, and it
is not rejected if the computed value of the test statistic falls in
the nonrejection region.
9. Conclusion.
 If H0 is rejected, we conclude that HA is true.
 If H0 is not rejected, we conclude that H0 may be true. 10. p values.
 The p value is a number that tells us how unusual our sample results are,
given that the null hypothesis is true.
 A p value indicating that the sample results are not likely to have occurred, if
the null hypothesis is true, provides justification for doubting the truth of the
null hypothesis.
Purpose of Hypothesis Testing
The purpose of hypothesis testing is to assist administrators and clinicians in making
decisions. The administrative or clinical decision usually depends on the statistical
decision. If the null hypothesis is rejected, the administrative or clinical decision usually
reflects this, in that the decision is compatible with the alternative hypothesis. The
reverse is usually true if the null hypothesis is not rejected. The administrative or
clinical decision, however, may take other forms, such as a decision to gather more
data.
Hypothesis Testing:
A single population mean
The testing of a hypothesis about a population mean under three different conditions:
(1) when sampling is from a normally distributed population of values with known
variance; (2) when sampling is from a normally distributed
population with unknown variance, and (3) when sampling is from a population that is
not normally distributed. When sampling is from a normally distributed population
and the population variance is known, the test statistic
for testing H0: μ – μ0

which, when H0 is true, is distributed as the standard normal

Problems:
1. Does the evidence support the idea that the average lecture consists of 3000
words if a random sample of the lectures of 16 professors had a mean of 3472
words, given the population standard deviation is 500 words? Use α = 0.01.
Assume that lecture lengths are approximately normally distributed. Show all
steps.

μ = 3000
σ = 500
𝐱̅ = 3472
n = 16
α = 0.0

1) Ho: μ = 3000
2) Ha : μ ≠ 3000
3) α = 0.01
4) Reject Ho if z < −2.576 or z > 2.576
5) 𝐳 = 𝟑𝟒𝟕𝟐−𝟑𝟎𝟎𝟎 (𝟓𝟎𝟎 √𝟏𝟔) = 𝟑. 𝟕𝟖
6) Reject Ho, because 3.78 > 2.576
7) At α = 0.01, the population mean is not equal to 3000 words.

2. Suppose that scores on the Scholastic Aptitude Test form a normal distribution with
μ = 500 and α = 100. A high school counselor has developed a special course designed
to boost SAT scores. A random sample of 16 students is selected to take the course and
then the SAT. The sample had an average score of 𝑋= 544. Does the course boost SAT
scores? Test at α = 0.01. Show all steps.

μ = 500
σ = 100
𝐱̅ = 544
n = 16
α = 0.01
1) Ho: μ = 500
2) Ha : μ > 500
3) α = 0.01
4) Reject Ho if z > 2.326
5) 𝐳 = 𝟓𝟒𝟒−𝟓𝟎𝟎 (𝟏𝟎𝟎 √𝟏𝟔) = 𝟏. 𝟕𝟔
6) Accept Ho, because 1.76 < 2.326
7) At α = 0.01, the population mean is equal to 500.
One-Sided Hypothesis Tests
Hypothesis test may be one-sided, in which case all the rejection region is in one or the
other tail of the distribution. Whether a one-sided or a two-sided test is used depends
on the nature of the question being asked by the researcher.

Problem
Researchers are interested in the mean age of a certain population. Let us say that they
are asking the following question: Can we conclude that the mean age of this
population is different from 30 years? Suppose, instead of asking if they could conclude
that μ ≠ 30, the researchers had asked: Can we conclude that μ < 30? To this question
we would reply that they can so conclude if they can reject the null hypothesis that μ ≥
30.
1. Data. See the previous example.
2. Assumptions. See the previous example.
3. Hypotheses.
H0: μ >30
HA: μ < 30
The inequality in the null hypothesis implies that the null hypothesis
consists of an infinite number of hypotheses.
5.Test statistic.

6. Decision rule. Let us again use α= :05.


7. Calculation of test statistic

= -2:12
8.Statistical decision. We are able to reject the null hypothesis since
-2:12 < -1:645
9. Conclusion.

THE DIFFERENCE BETWEEN TWO POPULATION MEANS


Hypothesis testing involving the difference between two population means is most
frequently employed to determine whether or not it is reasonable to conclude that the
two population means are unequal.

Sampling from Normally Distributed Populations: Population


Variances Known When each of two independent simple random samples has
been drawn from a normally distributed population with a known variance, the test
statistic for testing the null hypothesis of equal population means is

Problem:
1. Researchers wish to know if the data they have collected provide sufficient evidence
to indicate a difference in mean serum uric acid levels between normal individuals and
individuals with Down’s syndrome. The data consist of serum uric acid readings on 12
individuals with Down’s syndrome and 15 normal individuals. The means are 𝑋1 =
4:5 mg/100 ml 𝑋2 = 3:4 mg/100 ml.
We will say that the sample data do provide evidence that the population means are
not equal if we can reject the null hypothesis that the population means are equal. Let
us reach a conclusion by means of the ten-step hypothesis testing procedure.
1. Data. See problem statement.
2. Assumptions. The data constitute two independent simple random samples each
drawn from a normally distributed population with a variance equal to 1 for the
Down’s syndrome population and 1.5 for the normal population.
3. Hypotheses:

An alternative way of stating the hypotheses is as follows:

4. The test statistic.

5. Distribution of test statistic. When the null hypothesis is true, the test statistic follows
the standard normal distribution.
6. Decision rule. Let α = 0.05. The critical values of z are ±1:96.
Reject H0 unless -1:96 < zcomputed < 1:96.
7. Statistical decision. Reject H0, since 2:57 > 1.96.
8. Conclusion. Conclude that, on the basis of these data,
9. There is an indication that the two population means are not equal.
10. p value. For this test, p= 0.0102.

Hypothesis testing: a single population variance


The general principles presented in that section may be employed to test a hypothesis
about a population variance. When the data available for analysis consist of a simple
random sample drawn from a normally distributed population, the test statistic for
testing hypotheses about a population variance is

Problem:
The purpose of a study by Wilkins et al. (A-28) was to measure the effectiveness of
recombinant human growth hormone (rhGH) on children with total body surface area
burns > 40 percent. In this study, 16 subjects received daily injections at home of rhGH.
At baseline, the researchers wanted to know the current levels of insulin-like growth
factor (IGF-I) prior to administration of rhGH. The sample variance of IGF-I levels (in
ng/ml) was 670.81. We wish to know if we may conclude from these data that the
population variance is not 600.
1. Data. See statement in the example.
2.Assumptions. The study sample constitutes a simple random sample from a
population of similar children. The IGF-I levels are normally distributed.
3. Hypothesis
4. Test statistic. The test statistic is given by Equation

5. Distribution of test statistic. When the null hypothesis is true, the test statistic is
distributed as x2 with n - 1 degrees of freedom.
6. Decision rule. Let α=0.05. Critical values of x2 are 6.262 and 27.488.Reject H0 unless
the computed value of the test statistic is between 6.262 and 27.488. The rejection and
non rejection regions are shown in fig
7. Calculation of test statistic.

8. Statistical decision. Do not reject H0 since 6:262 < 16:77 < 27:488.
9. Conclusion. Based on these data we are unable to conclude that the population
variance is not 600.
10. p value. The determination of the p value for this test is complicated by the fact that
we have a two-sided test and an asymmetric sampling distribution. When we have a
two-sided test and a symmetric sampling distribution such as the standard normal or t,
we may, as we have seen, double the one-sided p value. Problems arise when we
attempt to do this with an asymmetric sampling distribution such as the chi-square
distribution
Hypothesis Testing Python
Imagine a woman in her seventies who has a noticeable tummy bump. Medical
professionals could presume the bulge is a fibroid. In this instance, our first finding (or
the null hypothesis) is that this woman has a fibroid, and our alternative finding is that
she does. We shall use terms null hypothesis (beginning assumption) and alternate
hypothesis (countering assumption) to conduct hypothesis testing. The next step is
gathering the data samples we can use to validate the null hypothesis.The following
options are the remaining ones:

o Although the null hypothesis (H0) was correct, we rejected it.


o Although the null hypothesis (H0) was incorrect, we did not dismiss it.
P-value:- The likelihood of discovering the recorded or more severe outcomes
whenever the null hypothesis (H0) of a research question is true is known as the P
value or computed probability; the meaning of "severe" relies upon how the hypothesis
has been tested.When your P value falls below the selected significance threshold, you
dismiss the null hypothesis and agree that your sample contains solid proof that the
alternative hypothesis is true. It Still does not Suggests a "significant" or "important"
change; you must determine that while evaluating the applicability of your conclusion
in the actual world.

T- Test: When comparing the mean values of two samples that specific characteristics
may connect, a t-test is performed to see if there exists a substantial difference. It is
typically employed when data sets, such as those obtained from tossing a coin 100
times and stored as results, would exhibit a normal distribution. It could have
unknown variances. The t-test is a method for evaluating hypotheses that allows you to
assess a population-applicable assumption.
Assumptions
o Each sample's data is randomly and uniformly distributed (iid).
o Each sample's data have a normal distribution.
o Every sample's data share the same variance.
T-tests are of two types: 1. one-sampled t-test and 2. two-sampled t-test.
One sample t-test: The One Sample t-test ascertains if the sample average differs
statistically from an actual or apposed population mean. A parametric testing technique
is the One Sample t-test.
Example: You are determining if the average age of 10 people is 30 or otherwise. Check
the Python script below for the implementation.
Code
# Python program to implement T-Test on a sample of ages
# Importing the required libraries
from scipy.stats import ttest_1samp
import numpy as np
# Creating a sample of ages
ages = [45, 89, 23, 46, 12, 69, 45, 24, 34, 67]
print(ages)
# Calculating the mean of the sample
mean = np.mean(ages)
print(mean)
# Performing the T-Test
t_test, p_val = ttest_1samp(ages, 30)
print("P-value is: ", p_val)
# taking the threshold value as 0.05 or 5%
if p_val < 0.05:
print(" We can reject the null hypothesis")
else:
print("We can accept the null hypothesis")

Output

[45, 89, 23, 46, 12, 69, 45, 24, 34, 67]
45.4
P-value is: 0.07179988272763554
We can accept the null hypothesis
Chi-Square test
Is a statistical method to determine if two categorical variables have a significant
correlation between them. Both those variables should be from same population and
they should be categorical like − Yes/No, Male/Female, Red/Green etc. For example,
we can build a data set with observations on people's ice-cream buying pattern and try
to correlate the gender of a person with the flavour of the ice-cream they prefer. If a
correlation is found we can plan for appropriate stock of flavours by knowing the
number of gender of people visiting.We use various functions in numpy library to carry
out the chi-square test.
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0, 10, 100)
fig,ax = plt.subplots(1,1)
linestyles = [':', '--', '-.', '-']
deg_of_freedom = [1, 4, 7, 6]
for df, ls in zip(deg_of_freedom, linestyles):
ax.plot(x, stats.chi2.pdf(x, df), linestyle=ls)
plt.xlim(0, 10)
plt.ylim(0, 0.4)
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Chi-Square Distribution')
plt.legend()
plt.show()
Calculating one-proportional Z-test using formula
z=(P-Po)/sqrt(Po(1-Po)/n

Where:
 P: Observed sample proportion
 Po: Hypothesized Population Proportion
 n: Sample size
In this example, we are using the P-value to 0.86, Po to 0.80, and n to 100, and by
using this we will be calculating the z-test one proportional in the python
programming language.

Code:
import math
P = 0.86
Po = 0.80
n = 100
a = (P-Po)
b = Po*(1-Po)/n
z = a/math.sqrt(b)
print(z)
Output:
1.4999999999999984

You might also like