Concise Biostatistics Manual

Concise Biostatistics Manual Prashant Rao, Sarika Rao
1
Concise
Biostatistics
Manual
By
Dr Prashant R Rao
MBBS, MS, DNB, MNAMS, FMAS, FIAGES
DNB (Surgical Gastroenterology)
Assistant Professor in Surgical Gastroenterology
LTMMC & LTMGH, Sion, Mumbai
&
Dr Sarika P Rao
MBBS, MS, MCh, DNB (Plastic & Reconstructive Surgery)
Fellowship in Microvascular & Aesthetic Surgery
Assistant Professor in Plastic & Reconstructive Surgery
LTMMC & LTMGH, Sion, Mumbai
2
Dedicated to
Our Parents
3
Concise Biostatistics Manual
© Leelavathi Publications
First Edition: 2019
All rights reserved.
The authors have taken special care to ensure that the information provided in
the text are correct to the best of their abilities. However, mistakes are inevitable.
Hence the readers are requested to check and confirm the information provided
in the book in case of any doubt. The authors are not liable to anyone for any loss
or damage caused by the errors.
PUBLISHED BY: LEELAVATI PUBLICATIONS
Cost: free. I just hope this book reaches to whoever needs it and it helps you in
any way possible to pass your theory exams.
Help us make this book better by providing your valuable feedback, positive
criticisms and suggestions to us, via email on concisecancermanual@gmail.com
Also like and follow us on our Facebook page

“Concise Cancer Manual- Preparatory Manual for Surgery Exams”
For snippets from the book with the same title and the latest updates in GI
Surgery
4
Concise Biostatistics Manual

About the Book
This book has been prepared by compiling and editing “DATA” from various study
materials and notes provided to us by our seniors and colleague friends, along
with some very good articles and books.
Multiple topics in the subject of Biostatistics which are frequently asked in the
medical examinations have been covered in this book.
The highlights of this book are:

➢ Designed keeping in mind the examination patterns of our universities
➢ Important topics frequently asked in exams are all compiled in one place
➢ Simpliﬁed language for easy understanding
➢ Point wise standardized description for better grasping and answer writing
Further reading
If one has the time and patience please try and go through the following:
➢ Ghoshal UC, Tripathi S, Chourasia D (2007) Principle of statistical analysis in

clinical research: a primer. In: Mehta R (ed) Clinical gastroenterology. Paras
Publishing, Hyderabad, pp 372–386
➢ “High-Yield Biostatistics, Epidemiology & Public Health” by Anthony N Glaser

➢ “Methods in Biostatistics for Medical Students and Research Workers” by B K
Mahajan
5
Also try: Concise Cancer Manual, Available Online
6
About Concise Cancer Manual
7
Concise Biostatistics Manual: Topics covered:
➢ Types of Research Studies

➢ Case Control Study
➢ Cohort study
➢ Case Control Study vs Cohort Study
➢ Randomized Control Trial
➢ Concept of Randomization in RCT
➢ Concept of blinding in RCT
➢ Concept of Allocation concealment in RCT
➢ Meta-analysis
➢ Bias in Clinical Research
➢ Types of Data in Statistics
➢ Measures of Central tendency
➢ Measures of Dispersion of Data
➢ Concept of hypothesis testing
➢ P value
➢ Types of Error in Statistics
➢ Concept of power of a study
➢ Sample size
➢ Statistical Tests and Choosing a statistical test
➢ Concept of Univariate and Multivariate Analysis
➢ Correlation and Regression
➢ Incidence and Prevalence
➢ Screening
➢ Evaluation of an investigative test
➢ Kaplan Meier plots
➢ Forest plots
➢ Receiver Operating Characteristic curve
➢ Evidence based medicine
➢ Levels of evidence and Grades of Recommendation
➢ Ethics in Research and Informed consent
➢ Clavien-Dindo classification of surgical complications
8
Types of research studies
• Descriptive studies
• Analytical studies
o Ecological studies
o Case control studies
o Cohort studies
o Cross sectional studies
• Experimental studies
o Randomised control trial
o Uncontrolled trial
• Integrative studies
o Systematic review
o Meta-analysis
9
Case Control Study
• Definition: it is a type of observational study comparing characteristics of

individuals with the disease of interest with a suitable control group of
individuals without the disease
• Since it is an observational study no intervention is attempted or no

attempt is made to alter the course of disease
• Case control studies are commonly retrospective in nature
• It is also known as: backward looking study, effect to cause study, disease
to risk factor study, outcome to exposure study
• It provides Odds ratio, which is an estimate of relative risk
• Distinguishing features:
o Both exposure and disease have occurred before the start of study
o Study proceeds backward from effect to cause
o It uses a control/ comparison group to support or refute an inference
o It provides Odds ratio which is a measure of the strength of

association between the risk factor and outcome
• It is based on 3 assumptions:
o Cases must be representative of those with the disease
o Controls must be representative of those without the disease
o The disease being investigated must be relatively rare
10
• Basic steps:
o Selection of cases and controls
o Matching
o Assessment of exposure
o Analysis and interpretation
• Study design:
Exposure Diseased/ cases Non diseased/ controls
Yes a b
No c d
o Exposure rate in cases: a/ a+c

o Exposure rate in controls: b/ b+d
o Odds ratio= a/b = ad
c/d bc
11
• Advantages of case control study:
o Short duration
o Rare diseases can be studied
o Multiple risk factors can be simultaneously looked at
o No follow up required
o Inexpensive and Rapid
o No extra manpower required, less administrative problems
o No ethical problems
o No Hawthrone effects
o One can calculate odds ratio
o No risk to subjects
• Disadvantages
o Retrospective study so data may be of poor quality
o Incidence, relative risk, attributable risk cannot be calculated; it only

yields an estimate of relative risk, which is, odds ratio
o Interviewer bias, confounding bias, recall bias and selection bias is

involved
o Selection of appropriate controls is difficult
o It does not differentiate between causes and associated factors
12
Odds ratio
• It is a measure of the strength of association between the risk factor and
outcome
Case Control
Exposed a b
Non exposed c d
• Odds ratio (OR) = ad/bc
• Interpretation:
o OR = 1: exposure to risk factor is identical in both case and control

group
o OR < 1: exposure to risk factor is lower in cases than in control
o OR > 1: exposure to risk factor is greater in cases than in control
13
Cohort study
• Definition: it is a type of analytical study which is undertaken to obtain

evidence to support or refute existence of association between suspected
cause and disease
• Also known as: prospective study, forward-looking study, incidence study,

cause-effect study, longitudinal study, exposure to outcome study
• Since it is an observational study no intervention is attempted or no

attempt is made to alter the course of disease
• Analysis may be both retrospective and prospective
• Definition of cohort: defined as a group of people who share a common

characteristic or experience within a defined time period
o Example: all those born in 2019 form birth cohort of 2019
• Distinguishing features:
o The two cohorts are identified prior to appearance of the disease

under investigation
o The study groups are observed over a period of time to determine

the frequency of disease among them
o Study proceeds forward from cause to effect
o The exposure has occurred but disease has not
14
• Indication:
o Good evidence between exposure and disease
o When the exposure is rare, but the incidence of disease is higher

among exposed
o When attrition in study population can be minimized
• Types of cohort studies
o Prospective cohort studies
o Retrospective cohort studies
o Combination of retrospective and prospective studies
o Prospective study is also known as current cohort study and

retrospective cohort study is also known as historical cohort study
• Elements or steps of performing cohort study
o Selection of study subjects
o Obtaining data on exposure
o Selection of suitable comparison group who are unexposed
o Follow up
15
• Cohort study design
• Analysis
Exposure Deseased Non diseased

Yes A b
No C d
o Incidence in exposed: a/ a+b
o Incidence in non-exposed: c/ c+d
o Interpretation:
▪ If incidence in exposed is more than incidence in non exposed:

risk is present
▪ If incidence in exposed is equal to incidence in non exposed:

there is no risk and
▪ If the incidence in exposed is less than that in non exposed: the

exposure is protective
16
o Relative risk:
▪ Is the ratio of incidence of disease among exposed to incidence

of disease among non-exposed
▪ Relative risk is direct measure of strength of association

between suspected cause and effect
▪ Interpretation of relative risk:
• If RR > one: risk is present
• If RR = one: there is no risk
• If RR < 1: exposure is protective
• Larger the RR greater the strength of association

between suspected factor and disease
• Example: relative risk of 2 indicates that the incidence

rate of disease is two times higher in exposed subjects
as compared to non exposed ones
o Attributable risk
▪ Defined as the difference in incidence rate of disease in

exposed group and un exposed group
▪ Expressed as percentage
▪ AR= incidence in exposed- incidence in unexposed x 100

incidence in exposed
17
▪ Uses: It indicates the extent to which the disease under study

can be attributed to the exposure
o Population attributable risk =
Incidence rate in total sample – incidence in exposed x 100

Incidence rate in total sample
▪ Use: it provides an estimate of the amount by which the

disease could be reduced in that population if the suspected
factor was eliminated or modified
• Advantages of cohort study:
o In case of prospective study, data is of better quality
o It yields incidence rate, relative risk and attributable risk
o Multiple diseases resulting from an etiological factor under study can

be simultaneously looked at / several possible outcomes related to
exposure can be studied simultaneously
o No recall bias as in case control study
18
• Disadvantages:
o Long duration of study
o Unsuitable for study of rare diseases
o Attrition is a problem
o It is expensive
o It can be difficult to find a suitable cohort group
o Administrative and ethical problems
o Hawthrone effect: change in behaviour of study subjects
o If retrospective, there will be problem with data quality
19
Case Control Study vs Cohort Study
Case control study Cohort study

proceeds from effect to cause, that is proceeds from cause to the effect, that
retrograde is antegrade
relatively inexpensive it is expensive
tests whether the suspected cause it test whether the disease occurs
occurs more frequently in those with more frequently in those exposed than
the disease than among those without in unexposed individuals
the disease
it is retrospective it maybe retrospective or prospective
involves lesson number of subjects it involves larger number of subjects
provides quick results it usually involves long follow-up
period
suitable for study of rare diseases unsuitable for study of rare diseases
generally yields only an estimate of it yields incidence rate relative risk and
relative risk, which is, odds ratio attributable risk
this is usually the first approach to the it is reserved for testing of precisely
testing of a hypothesis formulated hypothesis
20
Randomized Control Trial
• It’s a type of experimental study design which aims to reduce bias when
testing a new treatment/ intervention
• Basic steps:
o Drawing up a protocol
▪ One of the essential features of RCT is that this study is

conducted under a strict protocol
▪ The protocol specifies the aims and objectives of the study,

questions to be answered, inclusion and exclusion criteria for
selection of study and control group subjects, sample size,
treatment or intervention to be performed
o Selecting a reference and experimental population
▪ Reference population: it is the population to which the findings

of the trial if found successful will be applicable
▪ Study/ experimental population:
• Derived from reference population
• Should be randomly chosen
• Should be representative of reference population
• Should be eligible for the trial and should give consent
for the same
• Is also called as experimental population
21
o Randomization
▪ It is the heart of randomized control trial
▪ Definition: randomization is a statistical procedure by which

the participants are allocated into either study or control
groups, to receive or not receive the intervention under study
▪ Randomization is done in an attempt to eliminate bias and

allow for comparability between the study and control group
▪ It ensures that the investigator has no control over the

allocation process and this helps eliminate selection bias
▪ It means that every individual study subject has an equal

chance of being allocated to either study or control group
▪ It is best done by using a table of Random numbers
o Intervention/ manipulation: Deliberate application of treatment/

intervention to be tested or withdrawal/ reduction of suspected
causal factors
o Follow up: examination of the experimental and control group

subjects at defined interval of time to look for desired study outcome
▪ Attrition: it is loss to follow up which may happen due to either
death, migration, loss of interest or withdrawal of consent.
Every effort should be made to minimize this. Losses are
inevitable
o Assessment of data to derive results
22
• Basic study design:
Protocol
Select a suitable population (reference/ target population)
Select a suitable sample (study population)
Make necessary exclusions Those not eligible
Those not willing
Randomization
Experimental group control group
Manipulation and follow up
Assessment
23
• Classification of RCTs:
o On basis of hypothesis:
▪ Superiority trial
▪ Non inferiority trial
▪ Equivalence trial
o On basis of outcome of interest:
▪ Explanatory
▪ Pragmatic
o Study designs in randomized control trial:
▪ Concurrent parallel study design
▪ Crossover type study design
o Types of RCT:
▪ Animal experiment
▪ Human clinical trial
▪ Preventive trial
▪ Risk factor trial
▪ Cessation experiment
24
• Advantages of RCT
o Considered most reliable form of scientific evidence (level 1
evidence)
o Used to find cause and effect relationship
o No selection bias
o Multiple outcome variables can be measured in a single study
• Disadvantages of RCT
o Expensive
o Longer study duration
o Ethical restrictions and administrative issues
o Participant and observer bias
o Noncompliance of controls threatens the validity of study
25
Concept of Randomization in RCT
• It is the heart of randomized control trial
• Definition: randomization is a statistical procedure by which the participants

are allocated into either study or control groups, to receive or not receive the
intervention under study
• Randomization is done in an attempt to eliminate bias and allow for

comparability between the study and control group
• It ensures that the investigator has no control over the allocation process and
this helps eliminate selection bias
• It means that every individual study subject has an equal chance of being
allocated to either study or control group
• It is best done by using a table of Random numbers
• Methods:
o Simple randomization (easiest method)
o Systematic randomization
o Block randomization
o Stratified randomization
• Advantages:
o Eliminates bias, especially selection bias and confounding bias
o Allows for comparability between study and control group
o Facilitates the concept of ‘blinding’
o It permits the use of probability theory to express the likelihood that any
difference in outcome between treatment groups merely indicates
chance
26
Concept of blinding in RCT
• It refers to the concealment of group allocation from one or more

individuals involved in the research study
• It is also called as ‘masking’
• Classification:
o Single Blind: Participant is not aware whether he belongs to study or
control group.
o Double Blind: Neither the participant nor the investigator is aware of
the group allocation and treatment or manipulation received.
o Triple Blind: Participant, investigator as well as the person analyzing
the data are unaware of the group allocation and treatment or
manipulation received.
• Purpose:
o Randomization minimizes the differences between the treatment
and control groups
o Reduces bias in a study
o Since the participant is unaware about which treatment group they
are in, their beliefs about the treatment are less likely to influence
the outcome.
o Also, since the researcher is unaware of which subjects are receiving
the tested treatment, they are less likely to influence the outcomes.
27
Concept of ‘Allocation concealment’ in RCT
• It refers to the stringent precautions taken to ensure that the group

assignment of study subjects is not revealed prior to definitively allocating
them to their respective groups.
• It means that the person randomizing the subjects does not know what the
next treatment allocation would be
• There is a possibility that the person randomizing the patients to different
groups may selectively allocate a subject to a specific group if he knowns
what the next allocation would be, thus introducing selection bias
• To prevent this, one should be unaware of the future group allocation
• This is called as ‘allocation concealment’
• Example: if a health care provider knows what the next group allocation is,
he may try and allocate it to a particular subject. This will lead to the
introduction of selection bias. Allocation concealment can help eliminate
this bias.
• Methods of allocation concealment
o SNOSE: sequentially numbered opaque sealed envelope
o Numbered/ coded container
o Secured computer method
o Pharmacy controlled method
o Centralized service: where researcher calls trial office to know
allocation sequence. This is the best method
28
• Importance of allocation concealment

o Yields larger estimates of effect
o Yields greater heterogeneity in results
o Allows to reduce selection bias
29
Meta-analysis
• first done by Karl Pearson in 1904, term coined by Gene V Glass in 1940
• Meta-analysis is a statistical analysis that systematically combines the

results of several studies which address a specific research question
• Steps:
o Define research question and specify hypothesis
o Define criteria for inclusion and exclusion of studies
o Literature search
o Selection of studies on specified subject
o Aggregate finding across different studies
o Selection of meta regression model; three types are:
▪ Simple regression model
▪ Fixed effects meta regression model
▪ Random effect regression model
o Combined study results using different approaches; example: inverse

variance method, Mantel Haenszel method, Peto method.
o Report results
30
• Meta-analysis flowchart:
Define Research question
Perform literature search
Select studies
Extract data
Analyze data
Statistical analysis
Report results
31
Bias in Clinical Research
• It is a systematic error which is defined as disproportionate weight in favor

of or against one thing, person or group compared with another
• It is in a way considered to be unfair
• Types of bias
o Selection bias:
▪ It's also called as Berksonian bias
▪ It involves individuals being more likely to be selected for study

than others
o Funding bias: It refers to bias which has been introduced to derive

outcomes which favor the study’s financial sponsor.
o Reporting bias: It refers to bias which is introduced when reporting

observations; such that observations of a certain kind are more likely
to be reported than others
o Analytical bias: It refers to bias which is introduced while analyzing

the study results
o Exclusion bias: It refers to bias which is introduced due to systematic

exclusion of certain individuals from the study to alter study results
o Attrition bias: It refers to bias which is introduced due to loss of

participants/ attrition; example: death or loss to follow up
32
o Recall bias: It refers to bias which arises as a result of inability of the

participant to recollect past events accurately
o Observer bias: It refers to bias which is introduced by the researcher/

observer. It is the subconscious cognitive bias of judgement by the
researcher
o Confounding bias:
▪ It refers to a situation in which association between exposure

and outcome is distorted by presence of a confounding factor/
variable
▪ exposure outcome
confounding variable
▪ Types:
• Positive confounding: observed association is biased

away from null
• Negative confounding: observed association is biased

towards null
• Various methods to eliminate bias in RCT
o Randomization
o Blinding
o Allocation concealment
33
Types of Data in Statistics
Data refers to observed values of a variable
Types of data
• Qualitative or categorical or nonparametric data: it refers to data which can

be separated into different categories
o Ordinal data
▪ It refers to data which can be arranged in an ascending or

descending order
▪ Example: stages of cancer (1, 2, 3, 4)
o Nominal data
▪ It refers to data which can't be arranged in ascending or

descending order
▪ Example: sex, eye color
• Quantitative or numerical or parametric data: it refers to data which can be

measured
o Interval data
▪ It refers to data which is measured along a scale in which each

data point is equidistant from one another
▪ Example: level of pain rated from 1 to 10 on scale
34
o Ratio
▪ It refers to data which can measured as multiples of one

another
▪ That is data which can be multiplied or divided
35
Measures of Central tendency
✓ Mean
✓ Median
✓ Mode
• Mean
o It is derived by adding all the individual observations and then

dividing it by the total number of observations
o Example:
If individual observations are a, b, c and d; the
mean = a + b + c + d/ 4
o Advantages
▪ Easy to calculate
▪ Easy to understand
▪ All values in the distribution are included in its calculation
▪ It is most commonly used statistical measure of central

tendency
o Disadvantage
▪ It is affected by the extreme values which may result in skewed

results
▪ The value at times may look ridiculous
36
• Median
o It is derived by first arranging the data in ascending or descending

order; the value of the observation in the middle of the set is the
median
o It is a better indicator of central tendency as compared to mean,

when the lowest and highest observation are wide apart or when
they are unevenly distributed
o Advantage
▪ It is easy to calculate
▪ Easy to understand
▪ Not affected by sampling variations
o Disadvantage
▪ It does not consider all values in the distribution
o Example
▪ If individual observations are 1, 2, 3, 4, 5, then the median is 3
▪ If individual observations are 1, 2, 3, 4, then the median is

2 + 3/2 = 2.5
37
• Mode
o It refers to the most commonly occurring value in a distribution of

data
o It is the most frequent item or the most fashionable value in the

series of observations
o Types
▪ Unimodal: single mode
▪ Bimodal: distribution having 2 modes
▪ Multimodal: distribution having more than 2 modes
o Advantage
▪ Easy to calculate
▪ Easy-to-understand
▪ Not affected by sample variation
o Disadvantage
▪ Exact location is often uncertain and not clearly defined
38
Measures of Dispersion of Data

• Also known as variability of data
• Measures of dispersion or variability of a data give an idea of the extent to
which the values are clustered or spread out.
• In other words, it gives an idea of homogeneity and heterogeneity of data.
• Two sets of data can have similar measures of central tendency but
different measures of dispersion
• Therefore, measures of central tendency should be reported along with
measures of dispersion.
• Measures of dispersion include:
o Range:
▪ It is the simplest measure of dispersion.
▪ It can be represented as the difference between maximum and
minimum value or simply as maximum and minimum value.
▪ Range is given with median.
o Mean deviation: It is the average of deviation from arithmetic mean
o Standard deviation:
▪ it is always given with mean.
▪ It denotes the extent of variation of values from the mean.
▪ Example: if the standard deviation is 10, then the values tend
to be about 10 units above and below the mean.
▪ Higher values of standard deviation represent higher variability
in the data and vice versa.
▪ Zero represents no variability
39
Concept of hypothesis testing

• Hypothesis testing or significance testing is to quantify our belief against a
particular hypothesis.
• For example, in a clinical trial for testing a new drug against the current
one,
o Null hypothesis (H0): assumes no effect of the given drug (i.e. the
new drug is no better, than the current drug or there is no difference
between the two drugs).
o Alternative hypothesis (H1): holds that the null hypothesis is not true.
The alternative hypothesis is what we wish to prove (i.e. the new
drug has a significantly different effect, on average, compared to that
of the current drug).
• A P value of less than 0.05 means that probability of null hypothesis (H0)
being correct is less than 5% (less than 5 out of 100 means less than 0.05
out of 1). So if P < 0.05 then H0 is rejected and the H1 is accepted
[P value is discussed in detail in the next chapter]
40
P value
• P value = Probability value
• Definition:
o It is the probability of occurrence of an event by chance
o It is the probability of null hypothesis being true (null hypothesis

assumes that there is no significant difference between specified
populations)
o It is the probability of type 1 error
• For example, if you Toss a Coin the probability of getting head or tail is 50%
so the P value is 0.5
• Lesser the p-value, lesser is the probability of the event occurring by chance
• As the confidence interval increases, previously significant value becomes

non significant
• Interpretation of P value
o P = 0.5 means probability of occurrence of an event by chance is 50

in 100 or 50%
o P = 0.05 means the probability of occurrence of an event by chance is

5 in 100 or 5%
o P = 0.01 means the probability of the occurrence of an event by

chance is 1 in 100 or 1%
41
o P = 0.001 means the probability of occurrence of an event by chance

is 1 in 1000 cases
• P value for null hypothesis is usually kept at less than 0.05, it means that
null hypothesis is true in less than 5% cases. So, if P < 0.05, the null
hypothesis is rejected, that is alternative hypothesis is accepted and the
difference is statistically significant
• Significance:
o P < 0.05 is considered statistically significant
o Lesser the P-value, lesser is the probability of occurrence of the event

by chance
o Higher the P value lesser the significance
o As the confidence interval increases, previously significant values will

become insignificant
42
Types of Error in Statistics
• There are four possible outcomes at the end of statistical analysis in a

research study
o True positive
o True negative
o False positive
o False negative
• Error in research results when false positive or false negative outcomes are
accepted
• Type I error: To reject the null hypothesis when it is true or
o False positive error or
o Alpha error.
o Example: type I error would mean that the effects of two drugs
studied were found to be different by statistical analysis, when in fact
there was no difference between them.
43
• Type II error: To accept the null hypothesis when it is false or
o False negative error or
o Beta error.
o Example: type II error would mean that the effects of two drugs
studied were not found different by statistical analysis, when in fact
there was difference
• One has to increase sample size to reduce error
Status of Null Based on statistical Based on statistical

hypothesis analysis Null analysis Null
hypothesis is accepted hypothesis is rejected
Null hypothesis is true True positive False positive or type I
error
Null hypothesis is false False negative or type True negative
II error
44
Concept of power of a study

• The power of a statistical hypothesis test measures the test's ability to
reject the null hypothesis when it is actually false - that is, to make the
correct decision.
• Statistical power = rejection of null hypothesis when alternate hypothesis is true
= making a correct decision
= 1 - ß {type II error]
• The maximum power a test can have is 1 and the minimum is 0.
• Ideally, we want a test to have high power, close to 1.
• Increasing the sample size is the best way to increase the power of a
statistical test.
45
Sample size
• Sample size is defined as the number of subjects that are included in a

given research study
• It is impossible to study whole population, so a sample is selected from the

population in a random manner
• The sample size should be just large enough to be able to detect a

difference if it exists
• This number is usually represented by the term ‘n’
• Importance of sample size
o Calculation of sample size helps in planning study
o Calculation of sample size helps to estimate the resources that would

be required, that is: manpower, money, material and time that
would be required to complete the research
o Sample size calculation helps to ensure scientific and ethical integrity

of the research study
• Factors to consider while calculating sample size
o Type of study?
o What is the primary outcome variable of the study?
o What is estimated value of primary outcome variable and acceptable

precision?
46
o What is acceptable Type 1 and 2 error for hypothesis testing?
o What is the desired effect size?
• Factors affecting sample size
o Feasibility: it refers to what is possible or what one can do,

depending upon the resources available to you
o Nonresponse rates and dropout rates
▪ Not everyone who you select in your sample will respond.

Some will be non-responders and some will dropout
▪ Nonresponse rates and dropout rates lead to reduction in the

sample size studied. power of study also decreases
• Small vs large sample size
o Small sample size
▪ May not be possible to detect significant difference even if it

exists
▪ May result in estimation of false results
▪ Objectives of study may not be achieved
47
o Larger sample size
▪ May result in wastage of resources (manpower, time, effort,

money)
▪ Very small clinically insignificant differences may be detected,

which may not be helpful in clinical practice
o Larger sample size can minimise the sampling error. That is, larger
samples tend to be associated with smaller margin of error.
However, there is a point at which increasing sample size no longer
impacts the sampling error: this is known as law of diminishing
returns
• Calculation of sample size
o Methods: software method and formula method
o For calculation one needs to know:
▪ Population size
▪ Expected frequency of disease in population
▪ Confidence limit
• Significance:
o Sample size influences 2 statistical parameters
▪ Precision of the study

▪ Power of the study to draw conclusions: power is defined as
probability of finding a statistically significant result
48
Statistical Tests
• Non parametric tests:
o Statistical test used in the case of non parametric data
▪ Chi square test
▪ Mann Whitney u test
▪ fisher exact test
▪ Wilcoxon test
• Parametric tests:
o Statistical test used in the case of parametric data
▪ Large sample size: Z test
▪ Small sample size:
• Paired T test
• Unpaired T test
• One-way ANOVA
• Two-way ANOVA
49
Choosing a statistical test

• Once data has been collected and tabulated into the spreadsheet, we need
to decide what statistical test needs to be performed for different variables
tabulated into the spreadsheet.
• Choice of different tests depends on the following factors:
o Whether the variable is categorical or continuous?
o Whether the data in that variable is normally distributed (parametric)
or not normally distributed (nonparametric)?
o Whether the data is paired or unpaired?
50
51
Source:
Ghoshal UC, Tripathi S, Chourasia D (2007) Principle of statistical analysis in

clinical research: a primer. In: Mehta R (ed) Clinical gastroenterology. Paras
Publishing, Hyderabad, pp 372–386
52
Source: https://cyfar.org/types-statistical-tests
53
Concept of Univariate and Multivariate Analysis

• univariate analysis: it analyzes whether one variable is associated with
another or not?
o That is only one variable is analyzed at a time
o Such association does not necessarily mean causation.
• Multivariate analysis: it simultaneously tests effect of multiple factors on an
outcome
o That is more than two variables is analyzed at a time
o It helps in inferring which are the independent factors that are
associated with the outcome
54
Correlation
• Correlation is relationship between the two sets of continuous data
o Example: relationship between height and body weight; relationship
between fasting blood sugar and body weight
• Correlation statistics is used to determine the extent to which two
independent variables are related and yields a number called coefficient of
correlation.
• Correlation coefficient may be positive or negative and may vary from -1 to
+1
• Positive correlation means that values of two different variables increase
and decrease together (direct relationship).
o For example, speed of running and pulse rate correlates positively.
• Negative correlation means that if value of one variable decreases then
value of the other variable increases (inverse relationship).
o For example, age and number of scalp hair may correlates negatively.
• The strength of a correlation is determined by absolute value of correlation
coefficient
• Closer is the value to 1, stronger is the correlation.
• Correlation between two variables is shown by scatter plot
• P value in a correlation statistics indicates whether the correlation (or no
correlation) observed is real or by chance.
• Correlation analysis is important because it can be used to predict values of
one variable on the basis of value of other variable.
55
• A correlation does not mean causation but it also does not mean absence
of causation, that is, if two variables exhibit strong correlation then one of
the variables may cause the other.
• Correlation data is therefore not sufficient evidence for causation.
• Pearson correlation is applied for parametric data while Spearman
correlation is applied for nonparametric data.
• Combined effect of a group of variable upon a variable not included in the
group is called as multiple correlation.
Fig: Scatter plots
56
Regression
• Regression analysis is used to predict the values of a quantitative

dependent variable based on the values of one or more independent
variables.
• In simple regression analysis, there is one quantitative dependent variable
and one independent variable.
• In multiple regression analysis, there is one quantitative dependent variable
and two or more independent variables.
o For example, one may derive a formula to predict liver span
(dependent variable) from the height of a person (independent
variable).
• Linear regression statistics finds the best-fit line (line of regression) that
predicts dependent variable from independent variable
• Linear regression statistics is applied to data where independent variable is
continuous.
• If the independent variable is categorical (e.g. present vs. absent) then
logistic regression is used.
57
Incidence
• Incidence rate is defined as the number of new cases occurring in a defined

population during a specified period of time
IR = number of new cases of specific disease diagnosed during a given time period x 1000
population at risk during this time period
• Thus, incidence rates refer to:
o Only new cases
o Diagnosed during a given time period
o In a specified / at risk population
• Uses
o It measures the rate at which new cases are occurring in the

population
o Helps describe the magnitude of the illness
o It acts as a health status indicator
o Is useful for taking action to control the disease
o For planning research to identify etiology, pathogenesis and

distribution of disease
o To determine the efficacy of vaccination by calculating secondary

attack rate in vaccinated and unvaccinated groups
58
o To evaluate effectiveness of disease control measures such as

isolation, immunization, disinfection
• Special incidence rates
o Attack rate
o Secondary attack rate
o Hospital admission rate
• Attack rate:
o Attack rate is equal to number of new cases of a specified disease

during a specified time interval divided by total population at risk
during the same time interval multiplied with 100
o Usually expressed as a percentage
o Used during epidemics
• Secondary attack rate:
o It is defined as a number of exposed persons developing the disease

within the range of incubation period following a primary case
o It is equal to the number of exposed persons developed in the

disease within the range of incubation period divided by total
number of exposed/ susceptible contacts multiplied with 100
59
o Denominator includes only susceptible individuals; individuals who

are vaccinated or have previously suffered from the disease are
excluded
o The index case is excluded from both numerator and denominator
o Uses
▪ It is a measure of communicability of a disease
▪ Helps determine efficacy of vaccination by calculating

secondary attack rate in unvaccinated and vaccinated group
o Limitation
▪ Secondary attack rate cannot be measured for diseases with

subclinical manifestation
60
Prevalence
• Definition: it is the total number of all individuals who have the disease at
particular time period divided by the population at risk of having the
disease in this time period
• Prevalence is a ratio
• The term refers to all current cases (new and old existing cases) at a given
time period in a given population
• There are two types:
o Point prevalence
o Period prevalence
• Point prevalence
o Point prevalence is more commonly used than period prevalence
o The term prevalence when used alone refers to point prevalence
o It is defined as the total number of all current cases (old and new) of
a disease at one point of time in a defined population
o It is equal to number of all current cases (old and new) of a disease at

one point of time divided by estimated population in the same time
multiplied with 100.
o It can be made specific for age, sex and other relevant factors
61
• Period prevalence:
o It measures the frequency of all current cases (old and new) existing
during a defined period of time in a defined population
o It includes cases arising before but extending into the defined period
as well as those arising during the defined time period
o It is equal to number of existing cases (old and new) of a specified

disease during a given period of time divided by estimated mid
interval population multiplied with 100
• Uses
o To estimate the magnitude of health problem in community
o Identify potential high-risk populations
o Useful for administrative and planning purpose; example: allocation

of hospital beds, manpower and rehabilitation facilities
62
• Relationship between incidence and prevalence
o Prevalence depends on incidence and duration of illness
o If population is stable and the incidence and duration of illness is

unchanging
o Prevalence = incidence x duration of illness
63
Screening Test
• It is defined as the search for unrecognized disease or conditions by means

of rapidly applied tests, examinations or other procedure in an apparently
healthy individual
Screening test Diagnostic test
applied on healthy individuals applied to a diseased individual
applied to groups of individuals applied on a specific individual and

not on a group
based on a cut off point based on evaluation of number of

symptoms, signs and test findings
Less accurate it is more accurate
less expensive more expensive
not basis of initiation of treatment; used as a basis for initiation of

it is basis for further evaluation treatment
with diagnostic tests
example: newborn screening, example HbA1c in diabetics

screening of anemia in pregnant
women
64
• Uses of screening tests
o Case detection: also known as prescriptive screening
▪ Peoples are screened primarily for their own benefits
▪ Example neonatal screening
o Control of diseases: also known as prospective screening
▪ People are screened for the benefit of others
▪ Example: screening of immigrants for detection of infectious

diseases; example: yellow fever
o Research purpose: screening may sometimes be performed for

research purposes; example in chronic diseases whose natural
history is not fully known
o Educational purposes
• Types of screening
o Mass screening: screening of whole population or sizable subgroup

of population
o High risk screening: screening of a particular subgroup of population

which is deemed to be at high risk for the particular disease. It is
more productive
o Multiphasic screening: application of two or more screening test
65
• Criteria to consider diseases for screening:
o Important public health problem with high prevalence
o Recognizable by screening test in asymptomatic phase
o Natural history is well understood
o Test should be able to detect the disease prior to the onset of sign
and symptoms
o Facilities to confirm the diagnosis should be available
o Effective treatment should be available
o Good evidence is available that early detection and treatment of the

said disease reduces morbidity and mortality of the disease
o Expected benefits exceeds risk and cost of test
• Characteristics of screening test
o Acceptability: it should be acceptable to the individuals in whom it

has to be done
o Repeatability
▪ The test must give consistent results when repeated more than
once on the same individual under the same conditions
▪ Factors which affect repeatability of the test
• Observer variation: it maybe either intra-observer

variation (one observer finding different values in the
66
same patient) or inter-observer variation (two different

observers finding different value in same patient)
• Biological variation
o Validity: also known as accuracy
▪ It is defined as the ability of the test to separate or distinguish

those who have the disease from those who do not
▪ It has two components: sensitivity and specificity
▪ Both are expressed as a percentage
67
Evaluation of an investigative test:
• Sensitivity
• Specificity
• Positive predictive value
• Negative predictive value
Screening test diseased Non diseased Total

results
Positive a b a+b
true positive false positive
negative c d c+d
false negative true negative
total a+c b+d a +b + c + d
• Sensitivity
o Defined as the ability of a test to correctly identify all those who have
the disease that is true positive
o Sensitivity = [a / a + c] x 100
• Specificity
o Defined as the ability of the test to correctly identify those who do

not have the disease that is true negative
o Specificity = [d / b + d] x 100
68
• Positive predictive value: Proportion of individuals with a positive test result

who have the disease.
o Positive predictive value = [a / a + b] x 100
• Negative predictive value: Proportion of individuals with a negative test

who do not have the disease
o Negative predictive value = [c / a + c] x 100
• Predictive value reflects diagnostic power of the test
• Diagnostic accuracy (DA): Accuracy is the proportion of all test results

(positive and negative) that are correct.
o Diagnostic accuracy = A+D / A+B+C+D
• An ideal screening test should be 100% sensitive and 100 % specific
• However, when sensitivity increases specificity of the test decreases; that is

sensitivity is inversely proportional to specificity and vice versa
• In general
o Screening test should have high sensitivity and
o Diagnostic test should have high specificity
69
Kaplan Meier plots
• It is a non parametric statistical analysis used to estimate survival function

form lifetime data
• In medical research it is often used to measure the fraction of patients

living for a certain amount of time after treatment, that is survival
• Named after Edward L Kaplan and Pual Meier
• Basic concepts
o Kaplan Meier plots is a series of declining horizontal steps, which,

with a large enough sample size approaches the true survival
function for that specific population
o In order to generate Kaplan Meier plots at least two pieces of data

are required for each patient
▪ The status of the last observation
▪ Time to event (or time to censoring)
o Length of horizontal lines along the x-axis represent survival duration

for that interval
o Vertical axis represents estimated probability of survival
70
• Example:
o The survival of patients in surgery plus adjuvant chemotherapy group

(group 1) is better than patients in surgery only group (group 2)
• Advantage
o The advantage of KM plots is the possibility of inclusion of censored

data which means that the information about patients who are lost
at any point of time, for any reason, can be used for the analysis
o With the use of these plots professionals have an improved

understanding of the disease processes
o Kaplan Meier plots can be used in clinical practice for counselling

while dealing with patients with different types of diseases
71
Forest plots
• Forest plot is graphical manner of presenting means and confidence

interval of studies, so that they can be easily reviewed and compared
• It is convenient and easily understandable manner of presenting study

results in a systematic review/ meta-analysis
• It is used mainly to present results of individual study in a systematic

review/ meta-analysis as well as the systematic review/ meta-analysis itself
• It shows effect size of all studies and results of meta analysis
• Example of a forest plot
• Parts of a forest plot:
o Left side: it enumerates the names of various studies which have

been included in the meta-analysis in a chronological order
o Right side: it represents the measure of effect of the studies included
72
Fig: understanding the components of a forest plot
o The square in diagram is a measure of effect of the study (example:

mean, odds ratio, relative risk) and the horizontal line represents the
95% confidence interval of the study
o Size of each square or the area of a square is proportional to the weight

of the study in the meta-analysis
o The diamond in the plot represents the overall measure of effect of the
meta-analysis
o The vertical line represents the line of no effect/ null hypothesis
73
• Advantage
o Easy to understand
o Simple and convenient method of representing result of individual

studies and net result of meta-analysis
74
Receiver Operating Characteristic curve

• Receiver operating characteristic is a plot of sensitivity vs 1 - Specificity or
plot of true positive rate against the false positive rate, for the different
possible cut offs (threshold values) of a diagnostic test.
• It shows the relationship between sensitivity and specificity (any increase in
sensitivity is accompanied by a decrease in specificity).
• ROC curve gives an idea of accuracy of a test (efficiency of the test to
discriminate between true positive and true negative).
• The area under the curve gives the measure of test accuracy.
• Area of ROC curves is calculated by complex mathematical models but
• can be obtained easily by various computer programs.
75
Evidence based medicine
• It is defined as the process of turning clinical problems into questions, then

answering them by systematically locating, appraising and using research
findings to finally help with clinical decision making
• It is also known as evidence based clinical practice
• Steps in Evidence based medicine
o Evaluate your patient in form of: history, clinical examination and

laboratory investigations
o Ask appropriate clinical questions
▪ The question should include all components of PICO, that is:
• P: description of patient population
• I: intervention
• C: comparison group
• O: outcome
76
o Acquire the best evidence in the form of research available
o Appraise the evidence
o Apply the evidence to patient care
o Self evaluation
• Rules of evidence based medicine
o Not all evidence is equivalent
o Evidence alone cannot help make clinical decisions
• Advantages
o It helps upgrade the knowledge base of the clinician
o It helps improve the understanding of the clinician in aspects of

research and its methods
o It improves the confidence of the clinician in managing clinical

situations
o It improves the computer literacy and data searching skills
o It allows group problem solving and teaching
o It improves our reading habits
o Wasteful practices can be abandoned
o It helps with more effective use of resources
o Helps keep the clinician UpToDate
77
o Helps in decision making process
• Disadvantage
o It takes time to learn the methods and to put them into clinical
practice
o Research is costly
Cochrane collaboration
• One of the international Agencies which has taken up the task of building
evidence-based medicine is Cochrane collaboration
• Goal of the collaboration is
o To produce high quality systematic reviews
o To ensure that these systematic reviews are subjected to very high-

quality Peer reviews
o To disseminate these systematic reviews electronically via the

Internet
78
Levels of evidence
From the Centre for Evidence-Based Medicine, http://www.cebm.net.
Level Type of evidence
1A Systematic review (with homogeneity) of RCTs
1B Individual RCT (with narrow confidence intervals)
1C All or none study
2A Systematic review (with homogeneity) of cohort studies
2B Individual Cohort study (including low quality RCT, e.g. <80% follow-up)
2C “Outcomes” research; Ecological studies
3A Systematic review (with homogeneity) of case-control studies
3B Individual Case-control study
4 Case series and poor quality cohort and case-control study
5 Expert opinion without explicit critical appraisal or based on physiology

bench research or “first principles”
Grades of Recommendation
A based on level 1 studies
B based on level 2 or 3 studies or extrapolations from level 1 studies
C based on level 4 studies or extrapolations from level 2 or 3 studies
based on level 5 evidence or troublingly inconsistent or inconclusive studies

D of any level
“Extrapolations” are where data is used in a situation that has potentially clinically
important differences than the original study situation.
79
Ethics in Research
• It is defined as the philosophy of morality
• Principles of ethics:
o Respect for autonomy
o Beneficience
o Non maleficence
o Justice
• Respect for autonomy:
o It is obligation to respect the decision-making capacity of the

individual
o You need to consult research participants and obtain their

agreement before you start your work
o Informed consent is obligatory
o This principle gives the individual subject, the right to gather as much
information as possible so that they can make their informed choice
whether to go forward with the intervention or not
o Patient need not give any reason for withdrawal
80
• Beneficence (means: to do good) & non maleficence (means: first do no

harm)
o This principle confers the responsibility of protecting the physical,

mental and social well-being of the research participant during a
research to the researcher.
o The researcher must consider the principles of Beneficence & non

maleficence together and aim at producing net benefit over harm
• Justice
o Justice precludes exposing one group of individuals to risks of

research for benefit of another group
81
Informed consent
• It is defined as consent given by a competent individual who has received

the necessary information in the language he best understands, has
adequately understood the information and after considering the
information has arrived at the decision without having been subjected to
coercion, undue influence, inducement or intimidation
• If an individual is willing to participate in a research, the investigator should

take the participants informed consent
• In case of minor or other vulnerable participant, parents/ guardians are

legally authorized representative who can give consent on their behalf
o Vulnerable participants: any individual who lacks the ability to fully

consent to participate in a study; example: minor, illiterate person
o However, in case of minors above 7 years of age, their consent (also

known as assent) should be obtained to the extent of the child's
capabilities
o For children below 7 years of age, consent has to be obtained from

parents/ guardians alone
• Elements of informed consent
o Volunteerism
o Information disclosure
o Decision making capacity
82
o Investigator should give information about following to the

participant before taking consent
o Purpose of the study
o Expectation from the participant
o Responsibilities of the investigator
o Risk and benefits of the intervention under study
o Alternatives available
o Option to withdraw from the study
o In case of complication whom to contact
83
The Clavien-Dindo classification of surgical complications
84
Notes
85

Concise Biostatistics Manual

Uploaded by

Copyright:

Available Formats

Concise Biostatistics Manual

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Concise Biostatistics Manual

Uploaded by

Copyright:

Available Formats

Concise Biostatistics Manual Prashant Rao, Sarika Rao

Concise Biostatistics Manual

PUBLISHED BY: LEELAVATI PUBLICATIONS

Also like and follow us on our Facebook page

Concise Biostatistics Manual

The highlights of this book are:

➢ Ghoshal UC, Tripathi S, Chourasia D (2007) Principle of statistical analysis in

➢ “High-Yield Biostatistics, Epidemiology & Public Health” by Anthony N Glaser

Also try: Concise Cancer Manual, Available Online

About Concise Cancer Manual

Concise Biostatistics Manual: Topics covered:

➢ Types of Research Studies

Types of research studies

o Case control studies

o Cross sectional studies

o Randomised control trial

Case Control Study

• Definition: it is a type of observational study comparing characteristics of

• Since it is an observational study no intervention is attempted or no

• Case control studies are commonly retrospective in nature

• It provides Odds ratio, which is an estimate of relative risk

o Study proceeds backward from effect to cause

o It uses a control/ comparison group to support or refute an inference

o It provides Odds ratio which is a measure of the strength of

o Cases must be representative of those with the disease

o Controls must be representative of those without the disease

o The disease being investigated must be relatively rare

o Selection of cases and controls

o Analysis and interpretation

Exposure Diseased/ cases Non diseased/ controls

o Exposure rate in cases: a/ a+c

• Advantages of case control study:

o Rare diseases can be studied

o Multiple risk factors can be simultaneously looked at

o Inexpensive and Rapid

o No extra manpower required, less administrative problems

o One can calculate odds ratio

o Retrospective study so data may be of poor quality

o Incidence, relative risk, attributable risk cannot be calculated; it only

o Interviewer bias, confounding bias, recall bias and selection bias is

o Selection of appropriate controls is difficult

o It does not differentiate between causes and associated factors

• Odds ratio (OR) = ad/bc

o OR = 1: exposure to risk factor is identical in both case and control

o OR < 1: exposure to risk factor is lower in cases than in control

o OR > 1: exposure to risk factor is greater in cases than in control

• Definition: it is a type of analytical study which is undertaken to obtain

• Also known as: prospective study, forward-looking study, incidence study,

• Since it is an observational study no intervention is attempted or no

• Analysis may be both retrospective and prospective

• Definition of cohort: defined as a group of people who share a common

o Example: all those born in 2019 form birth cohort of 2019

o The two cohorts are identified prior to appearance of the disease

o The study groups are observed over a period of time to determine

o Study proceeds forward from cause to effect

o The exposure has occurred but disease has not

o Good evidence between exposure and disease

o When the exposure is rare, but the incidence of disease is higher

o When attrition in study population can be minimized

• Types of cohort studies