Statistics Interview Questions
Statistics Interview Questions
It is analyzed using sta s cal analysis. It is analyzed by grouping the data into
categories and themes.
Quan ta ve Data Types: Discrete Data and Qualita ve Data Types: Nominal Data and
Con nuous Data. Ordinal Data.
Ex: Age, Height, Weight, Income, Groupe size, Ex: Gender, Marital status, Na ve language,
Test score. Qualifica ons, Colours.
They are used in various fields, including business, finance, healthcare, educa on, and more. The
choice of KPIs depends on the goals and objec ves of the organiza on or process being assessed.
By regularly monitoring and analyzing KPIs, organiza ons can iden fy areas of improvement, make
data-driven decisions, and measure progress toward their strategic goals.
Q 5. What is the difference between Univariate, Bivariate, and
Mul variate Analysis?
Q 6. How would you approach a dataset that’s missing more than 30%
of its values?
Choose an appropriate imputa on method based on the nature of the missing data:
b. Median:
To find the median, first, arrange the incomes in ascending order:
{25,28,30,32,35,38,40,42,45,5000}
35 + 38
𝑀𝑒𝑑𝑖𝑎𝑛 = = 36.5
2
The median income (36.5) is a be er measure of central tendency in this scenario because it
is not affected by extreme values.
Typically used at the ini al stage of data analysis to Typically used a er the ini al data explora on
understand the dataset and iden fy pa erns, (descrip ve sta s cs) when researchers want to
trends, and important features. make predic ons, test hypotheses, or make
statements about a popula on.
Are generally applied to both popula ons and Are focused on making statements or inferences
samples. They can be used to summarize data from about a popula on based on data from a sample.
a complete popula on or from a sample drawn from They involve es ma ng popula on parameters and
the popula on. assessing the uncertainty associated with those
es mates.
Examples: Common descrip ve sta s cs include Examples: Common inferen al sta s cal techniques
measures of central tendency (e.g., mean, median, include hypothesis tes ng, confidence intervals,
mode), measures of dispersion (e.g., range, regression analysis, analysis of variance (ANOVA),
variance, standard devia on), frequency chi-square tests, and various forms of mul variate
distribu ons, histograms, and summary tables. analysis.
Q 9. Can you state the method of dispersion of the data in sta s cs?
In sta s cs, measures of dispersion, also known as measures of variability or spread, are used to
describe how data points in a dataset are spread out or dispersed. These measures provide valuable
insights into the extent to which data values deviate from the central tendency (e.g., the mean) and
how variable or homogeneous the dataset is.
Range:
The range is the simplest measure of dispersion and is calculated as the difference between
the maximum and minimum values in a dataset. It provides an idea of the spread of data but
is sensi ve to outliers.
Variance:
Variance quan fies the average squared difference between each data point and the mean.
It is calculated by taking the average of the squared devia ons from the mean.
Standard Devia on:
The standard devia on is the square root of the variance. It provides a measure of dispersion
in the same units as the original data, making it easier to interpret.
Example:
Suppose you have a dataset of exam scores for a class of students:
Exam Scores: [60, 72, 78, 85, 92, 95]
Q 12. What are the scenarios where outliers are kept in the data?
Outliers may be kept in data when they represent important and meaningful informa on, unusual
events, or rare occurrences that are relevant to the analysis, such as detec ng anomalies,
understanding extreme behavior, or studying unique cases.
Q 13. What is the meaning of standard devia on?
The standard devia on is a sta s cal measure that quan fies the amount of varia on or
dispersion in a set of data values.
It provides insight into how spread out or clustered the data points are around the mean
(average) value.
In other words, the standard devia on helps us understand the extent to which individual
data points deviate from the mean.
The standard devia on is calculated as the square root of variance by determining each data
point’s devia on rela ve to the mean.
∑(𝑥 − 𝑥̅ )
𝜎=
𝑁
The key idea behind Bessel's correc on is that when you calculate the variance or standard devia on
using sample data (rather than data from the en re popula on), you tend to underes mate the true
popula on variance or standard devia on. This underes ma on occurs because you are basing your
calcula ons on a smaller subset of the data.
Bessel's correc on adjusts for this underes ma on by dividing the sum of squared differences from
the mean by (n - 1), where "n" is the sample size. In contrast, when calcula ng popula on variance
and standard devia on, you divide by "n" (the actual popula on size). By using (n - 1) instead of "n"
in the formula, Bessel's correc on increases the calculated variance and standard devia on slightly,
making them more representa ve of the popula on.
It is associated with a higher standard devia on It is associated with a lower standard devia on
and a larger range or interquar le range (IQR). and a smaller range or interquar le range (IQR).
Example: A dataset of income levels for a Example: A dataset of test scores for a group of
diverse popula on, where some individuals students who all scored very close to each
have very high incomes, and others have very other, crea ng a concentrated distribu on.
low incomes, crea ng a wide spread.
CV= Coefficient of varia on, σ= standard devia on of the dataset, μ= mean of the dataset.
The coefficient of varia on is par cularly useful when you want to compare the rela ve
variability of two or more datasets with different units of measurement or different means. It
provides a standardized way to express the dispersion of data rela ve to the mean, making it
easier to compare datasets of varying scales.
Example:
Test Scores: Consider two classes, Class A and Class B, with test scores. Here are the sta s cs
for both classes:
Class A: Mean Score = 85, Standard Devia on = 10
Class B: Mean Score = 90, Standard Devia on = 8
In this example, Class A has a higher coefficient of varia on (11.76%) compared to Class B
(8.89%). This suggests that the test scores in Class A are more variable rela ve to their mean
compared to Class B.
Q 17. What is meant by mean imputa on for missing data? Why is it
bad?
Mean imputa on is a method for handling missing data by replacing missing values with the mean
(average) value of the available data in the same column.
a. Minimum (Min): This is the smallest value in the dataset, represen ng the lowest data point.
It gives you an idea of the floor or lower boundary of the data.
b. First Quar le (Q1): The first quar le, also known as the lower quar le, is the value below
which 25% of the data falls. It marks the 25th percen le of the dataset and represents the
lower boundary of the middle 50% of the data.
c. Median (Q2): The median, or the second quar le, is the middle value of the dataset when it
is sorted in ascending order. It divides the data into two equal halves, with 50% of the data
falling below it and 50% above it. The median represents the central tendency of the data.
d. Third Quar le (Q3): The third quar le, also known as the upper quar le, is the value below
which 75% of the data falls. It marks the 75th percen le of the dataset and represents the
upper boundary of the middle 50% of the data.
e. Maximum (Max): This is the largest value in the dataset, represen ng the highest data point.
It gives you an idea of the ceiling or upper boundary of the data.
The five-number summary is o en used to create box plots (box-and-whisker plots), which provide a
visual representa on of these five summary sta s cs. Box plots are helpful for understanding the
spread, central tendency, and presence of outliers in a dataset. The box in the plot represents the
interquar le range (IQR), which is the range between the first quar le (Q1) and the third quar le
(Q3), while the whiskers extend to the minimum and maximum values, indica ng the range of the
data.
Q 20. What is the difference between the 1st quar le, the 2nd
quar le and the 3rd quar le?
The 1st quar le (Q1) is the value below which 25% of the data falls. It represents the lower
boundary of the middle 50% of the data.
The 2nd quar le (Q2), also known as the median, is the middle value of the data when it's
sorted. It divides the data into two equal halves, with 50% below it and 50% above it.
The 3rd quar le (Q3) is the value below which 75% of the data falls. It represents the upper
boundary of the middle 50% of the data.
Think of quar les as dividing your data into four equal parts, with Q1 marking the 25% point, Q2
(median) marking the 50% point, and Q3 marking the 75% point. These values help you understand
where the data is concentrated and how it's spread out.
Percent Percen le
For example, 25 percent (25%) is equivalent to For example, the 25th percen le (also known as
0.25 or 25/100. It means 25 out of every 100, the first quar le, Q1) is the value below which
or one-quarter of the whole. 25% of the data points in a dataset lie.
Q 22. What is an Outlier?
An outlier is a data point that significantly deviates from the rest of the data in a dataset.
In other words, it's an observa on that is unusually distant from other observa ons in the
dataset.
Outliers can be either excep onally high values (posi ve outliers) or excep onally low values
(nega ve outliers).
1. Nega ve Impacts:
Influence on Measures of Central Tendency:
A single extreme outlier can pull the mean in its direc on, making it unrepresenta ve of the
majority of the data.
Impact on Dispersion Measures:
The presence of outliers can inflate the measures like the standard devia on and the
interquar le range (IQR), making them larger than they would be without outliers.
Skewing Data Distribu ons:
Posi ve outliers can result in right-skewed distribu ons, while nega ve outliers can result in
le -skewed distribu ons. This can affect the interpreta on of the data.
Misleading Summary Sta s cs:
Outliers can distort the interpreta on of summary sta s cs.
Impact on Hypothesis Tes ng:
Outliers can affect the results of hypothesis tests. They can lead to incorrect conclusions,
such as detec ng significant differences when none exist or failing to detect real differences
when outliers mask them.
2. Posi ve Impacts:
Detec on of Anomalies:
Outliers can signal the presence of anomalies or rare events in a dataset. Iden fying these
anomalies can be valuable in various fields, including fraud detec on, quality control, and
outlier detec on in scien fic experiments.
Robust Modeling:
In some cases, outliers can be genuine observa ons that are important to model. For
example, in financial modeling, extreme stock price movements may contain valuable
informa on for predic ng market trends.
Q 24. Men on methods to screen for outliers in a dataset.
There are several methods to screen for outliers in a dataset, ranging from graphical techniques to
sta s cal tests. Here are some commonly used methods:
Sca erplots:
Sca erplots are par cularly useful for iden fying
outliers in bivariate or mul variate data. Outliers can
appear as data points that are far from the main
cluster of points in the sca erplot.
Z-Scores:
Z-scores (standard scores) measure how many standard devia ons a data point is away from
the mean. Data points with high absolute Z-scores (typically greater than 2 or 3) are o en
considered poten al outliers.
It's important to note that the choice of outlier detec on method should be guided by the
characteris cs of your data and the specific goals of your analysis.
Q 25. How you can handle outliers in the datasets.
Handling outliers in datasets is an important step in data preprocessing to ensure that they do not
unduly influence the results of your analysis or modeling. The approach you choose for handling
outliers depends on the nature of the data, the context of the analysis, and your specific objec ves.
Here are several methods for handling outliers:
Range:
The range is the simplest measure of spread in a dataset. It is the difference between the
maximum and minimum values in the dataset.
𝐼𝑄𝑅 = 𝑄3 − 𝑄1
Range
The empirical rule, also known as the 68-95-99.7 rule or the three-sigma rule, is a sta s cal
guideline used to describe the approximate distribu on of data in a normal distribu on (bell-shaped)
curve. It provides insights into how data values are distributed around the mean (average) in a
normally distributed dataset.
Approximately 68% of the data falls within one standard devia on of the mean.
Approximately 95% of the data falls within two standard devia ons of the mean.
Approximately 99.7% of the data falls within three standard devia ons of the mean.
Q 28. What is skewness?
Skewness is a measure of the asymmetry of a distribu on. A distribu on is asymmetrical when its le
and right side are not mirror images.
A distribu on can have right (or posi ve), le (or nega ve), or zero skewness. A right-skewed
distribu on is longer on the right side of its peak, and a le -skewed distribu on is longer on the le
side of its peak:
Mesokur c
Leptokur c
Platykur c
Q 33. Can you give an example to denote the working of the central
limit theorem?
A popula on follows a Poisson distribu on (le image). If we take 10,000 samples from the
popula on, each with a sample size of 50, the sample means follow a normal distribu on, as
predicted by the central limit theorem (right image).
Q 34. What general condi ons must be sa sfied for the central limit
theorem to hold?
For the Central Limit Theorem (CLT) to hold:
Random Sampling:
Data must be randomly selected from the popula on.
Independence:
Data points must be independent of each other.
Sufficient Sample Size:
The sample size should generally be greater than or equal to 30.
Finite Variance:
The popula on should have a finite variance.
Iden cal Distribu on:
Ideally, data should come from a popula on with the same distribu on.
The CLT states that as sample size increases, sample means approach a normal distribu on.
Observer selec on
A ri on
Protopathic bias
Time intervals
Sampling bias
Q 37. What is the probability of throwing two fair dice when the sum
is 8?
To find the probability of throwing two fair dice and ge ng a sum of 8, we need to
determine how many favorable outcomes (sums of 8) there are and divide that by the total
number of possible outcomes when rolling two dice.
Each die has 6 sides, numbered from 1 to 6. When you roll two dice, there are 6 x 6 = 36
possible outcomes because each die has 6 possible outcomes, and they are independent.
Now, let's calculate the favorable outcomes where the sum is 8:
(2, 6), (3, 5), (4, 4), (5, 3), (6, 2) ---- There are 5 favorable outcomes.
So, the probability of ge ng a sum of 8 when rolling two fair dice is:
( )
𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 = ( )
=
There are two main types of probability distribu ons: Discrete and Con nuous.
Symmetry: The normal distribu on is symmetric, meaning that it is centred around a single
peak, and the le and right tails are mirror images of each other. The mean, median, and
mode of a normal distribu on are all equal and located at the centre of the distribu on.
Bell-shaped: The PDF of a normal distribu on has a bell-shaped curve, with the highest point
(peak) at the mean value and gradually decreasing probabili es as you move away from the
mean in either direc on.
Mean and Standard Devia on: The normal distribu on is fully characterized by two
parameters: the mean (μ) and the standard devia on (σ). The mean represents the centre of
the distribu on, while the standard devia on controls the spread or dispersion of the data.
Larger standard devia ons result in wider distribu ons.
Empirical Rule: The normal distribu on follows the empirical rule (also known as the 68-95-
99.7 rule), which states that approximately:
a. About 68% of the data falls within one standard devia on of the mean.
b. About 95% of the data falls within two standard devia ons of the mean.
c. About 99.7% of the data falls within three standard devia ons of the mean.
Con nuous: The normal distribu on is a con nuous probability distribu on, which means
that it can take on an infinite number of values within its range. There are no gaps or
discon nui es in the distribu on.
Many natural phenomena, such as weights, heights and IQ scores, approximate a normal
distribu on. It is also fundamental in hypothesis tes ng and sta s cal modeling.
Q 40. Can you state the formula for normal distribu on?
This formula represents the bell-shaped curve of the normal distribu on, which is symmetric around
the mean (μ) and characterized by its mean and standard devia on. It describes the probability of
observing a specific value (x) in a normally distributed dataset.
1 ( )
𝑓(𝑥) = 𝑒
𝜎√2𝜋
Where:
Q 41. What is the rela onship between mean and median in a normal
distribu on?
In a normal distribu on, the mean and median are equal and coincide at the centre of the
distribu on.
Q 42. What are some of the proper es of a normal distribu on?
A normal distribu on, also known as a Gaussian distribu on or bell curve, has several key proper es:
Bell-Shaped Curve: The distribu on looks like a symmetrical bell, with a peak in the middle
and tails that taper off gradually on both sides.
Symmetry: It's perfectly symmetric, meaning if you fold the curve in half, one side is a mirror
image of the other.
Central Peak: The highest point (peak) of the curve is at the mean, which is also the middle
of the data.
Mean = Median = Mode: The mean (average), median (middle value), and mode (most
common value) are all at the same point in the middle of the distribu on.
Tails Extend to Infinity: The tails of the curve stretch infinitely in both direc ons, but they get
closer and closer to the horizontal axis as they go farther from the mean.
Standard Devia on Controls Spread: The width of the bell curve is determined by the
standard devia on. A larger standard devia on makes the curve wider, and a smaller one
makes it narrower.
Empirical Rule: This rule helps you es mate where data points are likely to be within the
distribu on. It's based on the 68-95-99.7 rule. Approximately 68% of the data falls within
one standard devia on of the mean, about 95% falls within two standard devia ons, and
roughly 99.7% falls within three standard devia ons.
Used in Many Real-Life Situa ons: The normal distribu on is commonly seen in nature and in
human-made systems, including things like height measurements, IQ scores, and errors in
manufacturing.
Easy for Sta s cal Analysis: Because of its well-defined proper es, the normal distribu on is
o en used in sta s cs for modeling and making predic ons about data.
Here are the steps to convert a value from a normal distribu on to a standard normal distribu on:
Determine the Mean and Standard Devia on of the Original Normal Distribu on:
Iden fy the mean (μ) and standard devia on (σ) of the original normal distribu on.
Calculate the Z-Score:
The Z-score (also known as the standard score) measures how many standard devia ons a
par cular value is from the mean in the original distribu on.
Calculate the Z-score using the formula:
(𝑋 − 𝜇)
𝑍=
𝜎
where:
Z is the Z-score.
X is the value from the original distribu on that you want to convert.
μ is the mean of the original distribu on.
σ is the standard devia on of the original distribu on.
The Resul ng Z-Score Represents the Standard Normal Distribu on:
The Z-score you calculate in step 2 represents the equivalent value in a standard normal
distribu on.
By following these steps, you can convert any value from a normal distribu on into a corresponding
value in the standard normal distribu on. This conversion is useful for performing standard normal
distribu on-based calcula ons and making comparisons between data from different normal
distribu ons.
Q 45. Can you tell me the range of the values in standard normal
distribu on?
In a standard normal distribu on, also known as the standard normal or Z-distribu on, the range of
possible values extends from nega ve infinity (−∞) to posi ve infinity (+∞).
However, it's important to note that while the range of possible values is theore cally infinite, the
vast majority of values in a standard normal distribu on are concentrated within a rela vely narrow
range around the mean, which is 0. The distribu on is bell-shaped, and as you move away from the
mean in either direc on, the probability density of values decreases. The tails of the distribu on
extend to infinity, but they become increasingly rare as you move farther from the mean.
Sta s cally, most of the values in a standard normal distribu on fall within a few standard devia ons
of the mean. Approximately:
This means that the values within the range of roughly -3 to +3 standard devia ons from the mean
cover the vast majority of observa ons in a standard normal distribu on. Beyond this range, the
probability of observing a value becomes extremely low.
Understanding the skewness of a dataset is essen al in sta s cs because it can affect the choice of
appropriate sta s cal analyses and modeling techniques. Le -skewed and right-skewed distribu ons
o en require different approaches for analysis and interpreta on.
The tail of the distribu on extends to the right, meaning there are some rela vely large
values that pull the mean in that direc on.
The median, being the middle value, is less affected by extreme values in the tail, so it is
typically lower than the mean in a posi vely skewed distribu on.
Since the distribu on is le -skewed, it means that the tail of the distribu on is on the le
side, and there are some rela vely small values that are pulling the mean in that direc on.
The median, being the middle value, is less affected by extreme values in the tail. In a le -
skewed distribu on, the median is typically greater than the mean.
In a le -skewed distribu on, the mode is typically greater than the median and the mean. It
is o en closer to the peak of the distribu on, which is located to the right of the centre.
In summary, you can conclude that in a le -skewed distribu on with a median of 60, the mean is
likely less than 60, and the mode is likely greater than 60.
Q 50. Imagine that Jeremy took part in an examina on. The test has a
mean score of 160, and it has a standard devia on of 15. If Jeremy’s z-
score is 1.20, what would be his score on the test?
To find Jeremy's score on the test given his z-score, you can use the formula for calcula ng a score
from a z-score in a normal distribu on:
( )
𝑍= ↔ 𝑍 × 𝜎 = 𝑋 − 𝜇 ↔ 𝑋 = (𝑍 × 𝜎) + 𝜇
In this case:
Q 51. The standard normal curve has a total area to be under one, and
it is symmetric around zero. True or False?
Covariance can help you understand whether two variables tend to move in the same direc on
(posi ve covariance) or in opposite direc ons (nega ve covariance).
Q 53. Can you tell me the difference between unimodal bimodal and
bell-shaped curves?
Unimodal, bimodal, and bell-shaped curves are terms used to describe different characteris cs of the
shape of a data distribu on:
1. Unimodal Curve:
Defini on: A unimodal curve represents a data distribu on with a single dis nct peak or
mode, meaning that there is one value around which the data cluster the most.
Shape: Unimodal distribu ons are typically symmetric or asymmetric but have only one
primary peak.
Examples: A normal distribu on, where data is symmetrically distributed around the mean, is a
classic example of a unimodal curve. Other unimodal distribu ons can be skewed to the le
(nega vely skewed) or to the right (posi vely skewed).
2. Bimodal Curve:
Defini on: A bimodal curve represents a data distribu on with two dis nct peaks or modes,
indica ng that there are two values around which the data cluster the most.
Shape: Bimodal distribu ons have two primary peaks separated by a trough or dip in the
distribu on.
Examples: The distribu on of test scores in a classroom with two dis nct groups of high achievers
and low achievers might be bimodal. Similarly, a distribu on of daily temperatures in a year might
have two peaks, one for summer and one for winter.
3. Bell-Shaped Curve:
Defini on: A bell-shaped curve represents a data distribu on that has a symmetric, smooth,
and roughly symmetrical shape resembling a bell.
Shape: Bell-shaped distribu ons have a single peak (unimodal) and are symmetric, with the
tails of the distribu on tapering off gradually as you move away from the peak.
Examples: The classic example of a bell-shaped curve is a normal distribu on, where data is
symmetrically distributed around the mean. However, other distribu ons with a similar bell-shaped
appearance can also exist.
So, while symmetry and unimodality o en go together, symmetry does not inherently require
unimodality, and a symmetric distribu on can have mul ple modes.
Q 55. What are some examples of data sets with non-Gaussian
distribu ons?
Many real-world datasets exhibit non-Gaussian or non-normal distribu ons due to various
underlying factors. Here are some examples of data sets with non-Gaussian distribu ons:
1. Income Distribu on: Income data is o en right-skewed, with most people earning average
incomes and a few earning very high incomes. This leads to a distribu on that does not
follow a normal curve.
2. Stock Returns: Daily stock returns can have fat tails and exhibit vola lity clustering, making
their distribu on non-normal. Events like stock market crashes can cause significant
devia ons from normality.
3. Website Traffic: The number of visitors to a website on any given day o en follows a
distribu on with a long tail. A few days with extremely high traffic can result in a skewed
distribu on.
4. Ages at Re rement: The distribu on of ages at which people re re can be le -skewed, with
many re ring around a certain age and very few re ring at younger ages.
5. Number of Customer Arrivals: The number of customers arriving at a store or service centre
follows a Poisson distribu on, which is discrete and not normal.
6. Test Scores: Test scores, par cularly in educa onal se ngs, o en have a distribu on with
modes due to various subpopula ons of students, leading to a mul modal distribu on.
7. City Popula on Sizes: The distribu on of city popula on sizes worldwide is o en right-
skewed, with a few megaci es having very high popula ons and the majority of ci es having
smaller popula ons.
8. Wait Times: The distribu on of wait mes in queues or lines can o en be right-skewed, with
a few people experiencing very long waits and most people experiencing shorter waits.
9. Social Media Engagement: The number of likes, shares, or comments on social media posts
can exhibit a highly skewed distribu on, with a few posts going viral and receiving a
dispropor onate number of interac ons.
10. Height and Weight: While human height and weight o en follow roughly normal
distribu ons, they can also be influenced by factors like nutri on and gene cs, leading to
devia ons from normality in some popula ons.
These examples illustrate that real-world data can take on various shapes and characteris cs, and not
all datasets follow the idealized Gaussian or normal distribu on. Understanding the distribu on of
data is essen al for making accurate sta s cal inferences and modeling.
The probability mass func on (PMF) of the binomial distribu on is given by the formula:
𝑛
𝑃(𝑋 = 𝑘) = ∗𝑝 ∗𝑞
𝑘
where,
Q 57. What are the criteria that Binomial distribu ons must meet?
The binomial distribu on is a probability distribu on that models a specific type of random
experiment. To use the binomial distribu on, certain criteria or assump ons must be met:
4. Cluster Sampling:
5. Convenience Sampling:
The choice of sampling method depends on the research objec ves, available resources, and the
characteris cs of the popula on being studied. Each method has its own strengths and limita ons,
and researchers must consider these factors when designing and conduc ng a study.
Q 61. Why is sampling required?
Sampling is required for several simple and prac cal reasons:
1. Efficiency: Sampling is faster and more cost-effec ve than collec ng data from an en re
popula on, especially when the popula on is large.
2. Resource Conserva on: It saves me, money, and resources, making research more feasible
and prac cal.
3. Timeliness: Allows for quicker data collec on and analysis, which can be crucial in me-
sensi ve situa ons.
4. Accessibility: Some popula ons are difficult to access, making sampling the only prac cal
op on.
5. Accuracy: When done correctly, sampling provides accurate es mates of popula on
characteris cs.
6. Risk Reduc on: Reduces the poten al for errors in data collec on and analysis.
7. Inference: Provides a basis for making conclusions about the en re popula on based on the
characteris cs of the sample.
8. Privacy and Ethics: Respects privacy and ethical considera ons, especially in sensi ve
research areas.
9. Analysis: Simplifies data analysis, par cularly for large datasets.
Sampling is a prac cal and essen al tool for researchers to gather valuable informa on while
managing constraints and prac cal limita ons.
Sample size calcula ons ensure your study has enough data to draw meaningful conclusions while
controlling for errors and precision.
Q 63. Can you give the difference between stra fied sampling and
clustering sampling?
The key dis nc on between stra fied sampling and cluster sampling lies in how the popula on is
divided and sampled:
Stra fied sampling divides the popula on into homogeneous subgroups (strata) and selects
samples from each stratum independently to ensure representa on from all subgroups.
Cluster sampling divides the popula on into clusters and randomly selects clusters to
sample, then collects data from all individuals/items within the selected clusters.
Q 64. Where is inferen al sta s cs used?
Inferen al sta s cs is used in various fields and contexts to make predic ons, draw conclusions, and
make inferences about popula ons based on sample data.
Here are some common areas and applica ons where inferen al sta s cs is used:
1. Scien fic Research:
Inferen al sta s cs is fundamental in scien fic research across disciplines such as biology,
physics, chemistry, and environmental science. Researchers use sta s cal tests to analyze
data and draw conclusions about hypotheses.
2. Business and Economics:
Businesses use inferen al sta s cs for market research, sales forecas ng, quality control,
and decision-making. Econometric models are employed to analyze economic data and make
policy recommenda ons.
3. Healthcare and Medicine:
Medical researchers and healthcare professionals use inferen al sta s cs to study the
effec veness of treatments, analyze pa ent data, and draw conclusions about disease
prevalence. Clinical trials rely heavily on inferen al sta s cs.
4. Educa on:
In the field of educa on, inferen al sta s cs are used to assess the effec veness of teaching
methods, evaluate standardized test scores, and make policy decisions about educa onal
programs.
5. Market Research and Data Analysis:
Market researchers use inferen al sta s cs to make predic ons about consumer
preferences, market trends, and the impact of marke ng campaigns.
6. Finance and Investment:
In finance, inferen al sta s cs are used to assess investment risk, analyze stock market data,
and es mate future asset prices. Por olio op miza on and risk management rely on
sta s cal modeling.
7. Criminal Jus ce and Criminology:
Researchers and law enforcement agencies use inferen al sta s cs to analyze crime data,
study crime pa erns, and evaluate the effec veness of criminal jus ce programs.
8. Sports and Athle cs:
In sports analy cs, inferen al sta s cs are used to analyze player performance, predict game
outcomes, and make strategic decisions in sports management.
Q 65. What are popula on and sample in Inferen al Sta s cs, and
how are they different?
In inferen al sta s cs, the concepts of "popula on" and "sample" are fundamental and play dis nct
roles.
Popula on Sample
Purpose: Purpose:
-In inferen al sta s cs, the popula on is the -The primary purpose of taking a sample is
ul mate target for making conclusions and prac cality. It's o en more feasible, cost-
generaliza ons. However, it is o en imprac cal effec ve, and efficient to collect data from a
or impossible to collect data from the en re sample rather than the en re popula on.
popula on. -Inferen al sta s cs use data from the sample
to make inferences, predic ons, or
generaliza ons about the larger popula on.
Q 66. What is the rela onship between the confidence level and the
significance level in sta s cs?
The rela onship between the confidence level and the significance level in sta s cs is inverse and
complementary. These two concepts are essen al in hypothesis tes ng and sta s cal inference.
Rela onship:
If you set a confidence level of 95% (1−α=0.95), the significance level would be 0.05(α=0.05).
If you set a confidence level of 99% (1−α=0.99), the significance level would be 0.01(α=0.01).
The confidence level (o en denoted as 1−α) The significance level (denoted as α) is the
represents the probability that a confidence probability of making a Type I error in
interval calculated from sample data contains hypothesis tes ng. It is also known as the
the true popula on parameter. "alpha level" or "level of significance."
It is a measure of how confident you are that A Type I error occurs when you incorrectly
the interval you calculated captures the reject a true null hypothesis. In other words, it
parameter you're es ma ng. represents the probability of finding a
significant result (rejec ng the null hypothesis)
when there is no real effect or difference in the
popula on.
Commonly used confidence levels include 90%, Commonly used significance levels are 0.05
95%, and 99%. (5%), 0.01 (1%), and 0.10 (10%).
It provides a "best guess" or a single numerical It provides a range of plausible values for the
value for the parameter. parameter along with a level of confidence
(e.g., 95% confidence interval).
The confidence interval reflects the uncertainty
associated with the es mate and quan fies
how confident you are that the true parameter
falls within the interval.
For example, if you calculate the sample mean For example, a 95% confidence interval for the
(𝑥̅ ) from a sample of data, 𝑥̅ itself is a point popula on mean (μ) might be (60,70),
es mate of the popula on mean (μ). indica ng that you are 95% confident that the
true popula on mean falls between 60 and 70.
Key Difference:
The main difference between a point es mate and a confidence interval es mate is that a
point es mate provides a single value, while a confidence interval es mate provides a range
of values.
Point es mates are useful for providing a single es mate of a parameter when you need a
single, specific value.
Confidence interval es mates are useful when you want to convey the uncertainty
associated with your es mate and provide a range of values within which the parameter is
likely to fall.
Biased Unbiased
A sta s cal es mator is said to be "biased" if, A sta s cal es mator is considered "unbiased"
on average, it systema cally overes mates or if, on average, it provides es mates that are
underes mates the true popula on parameter. equal to the true popula on parameter.
In other words, a biased es mator tends to In mathema cal terms, the expected value
consistently deviate from the true value in a (mean) of an unbiased es mator is equal to the
specific direc on (either consistently too high true value of the parameter being es mated.
or too low).
Biased es mators can result from flaws in the Unbiased es mators are desirable because,
es ma on method or sampling procedure. over repeated sampling, they provide accurate
and unbiased es mates of the popula on
parameter.
When using a biased es mator, it's important to While unbiased es mators are preferred, they
be aware of the direc on and magnitude of the are not always achievable, and in some cases,
bias to adjust for it in data analysis or decision- biased es mators may be the best available
making. op on.
Q 69. How does the width of the confidence interval change with
length?
The width of a confidence interval changes inversely with the level of confidence and the precision of
the es mate. In other words, as you increase the level of confidence or decrease the precision
(increase the margin of error), the width of the confidence interval increases, and vice versa.
Use a Larger Sample: The bigger the sample, the closer our es mate is to reality.
Randomly Choose the Sample: Ensure that everyone in the popula on has an equal chance
of being in the sample.
Be Careful with Surveys: Encourage more people to respond to surveys to make sure they
represent the whole popula on.
Use Proper Methods: Follow good sta s cal methods to analyze the data from your sample.
Reducing sampling error helps us make more accurate predic ons about the popula on based on
our sample.
Q 72. How do the standard error and the margin of error relate?
In simple words, think of the standard error (SE) as a measure of how much sample data can vary
from the true popula on value. It's like a measure of how shaky or uncertain our es mate is.
The margin of error (MOE) is directly related to the standard error. It tells us how much we should
add to and subtract from our sample es mate to create a range that likely includes the true
popula on value. It's like a safety buffer around our es mate.
So, the standard error tells us about the uncertainty in our es mate, and the margin of error tells us
the size of the safety buffer we need to account for that uncertainty. If you want a narrower margin
of error, you need a more precise es mate, which usually means a larger sample size or a lower level
of confidence.
Here are the key components and steps involved in hypothesis tes ng:
Formulate Hypotheses
Collect Data
Calculate Test Sta s c
Determine Cri cal Region
Compare Test Sta s c and Cri cal Region
Calculate P-Value
Make a Decision
Draw Conclusions
It's par cularly useful when you have a sample and you want to
assess whether it represents a popula on with a specific mean.
Q 77. What is the meaning of degrees of freedom (DF) in sta s cs?
In sta s cs, degrees of freedom (DF) refer to the number of values in the final calcula on of a
sta s c that are free to vary. Degrees of freedom are a fundamental concept in hypothesis tes ng,
confidence intervals, and various sta s cal analyses. They are used in various sta s cal tests, such as
t-tests, chi-square tests, and analysis of variance (ANOVA).
The concept of degrees of freedom can be a bit abstract, but it's essen al to understand because it
affects the behaviour of sta s cal tests and the interpreta on of their results. Here's a basic
explana on:
T-Tests:
In a t-test, degrees of freedom are related to the sample size. If you have a sample of size
"n" then,
1. One-sample t-test: 𝐷𝑒𝑔𝑟𝑒𝑒𝑠 𝑜𝑓 𝑓𝑟𝑒𝑒𝑑𝑜𝑚 = 𝑛 − 1
where "n1" and "n2" are the sample sizes of the two groups being compared. This "n1 + n2 -
2" represents the number of data points that are free to vary a er es ma ng the means of
the two groups.
Chi-Square Tests:
In chi-square tests, degrees of freedom are related to the number of categories being
compared.
For a chi-square test of independence, the degrees of freedom are calculated as,
where "rows" and "columns" represent the number of categories in the rows and columns of
the con ngency table. This calcula on reflects the number of categories that can vary freely.
ANOVA:
In analysis of variance (ANOVA), degrees of freedom are associated with the number of
groups being compared.
There are two types of degrees of freedom in ANOVA:
1. Between-group degrees of freedom:
The between-group degrees of freedom are related to the number of groups minus one.
2. Within-group degrees of freedom.
The within-group degrees of freedom are related to the total sample size minus the
number of groups.
These degrees of freedom help determine whether there are significant differences between
group means.
In essence, degrees of freedom represent the flexibility or "freedom" in the data or the sta s cal
model. Understanding degrees of freedom is crucial because they affect the distribu on of test
sta s cs and, consequently, the interpreta on of p-values and the conclusions drawn from sta s cal
analyses. Different sta s cal tests have different formulas for calcula ng degrees of freedom, and
they are chosen to ensure the validity of the sta s cal test being performed.
Q 78. What is the p-value in hypothesis tes ng?
Formulate Hypotheses:
Start by defining your null hypothesis (H0) and alterna ve hypothesis (Ha). H0 typically
represents a statement of no effect or no difference, while Ha suggests there is an effect or
difference.
Choose a Sta s cal Test:
Select the appropriate sta s cal test based on your research ques on and the type of data
you have. The choice of test depends on whether you're comparing means, tes ng
propor ons, examining associa ons, etc.
Collect Data:
Collect relevant data for your analysis. The data should match the assump ons and
requirements of the chosen sta s cal test.
Calculate the Test Sta s c:
Calculate the test sta s c that corresponds to your chosen test. This involves using
mathema cal formulas specific to the test.
Determine the Sampling Distribu on:
Determine the theore cal sampling distribu on of the test sta s c under the assump on
that the null hypothesis is true. This distribu on depends on the test you're conduc ng (e.g.,
t-distribu on, chi-square distribu on, F-distribu on, normal distribu on).
Find the Observed Test Sta s c:
Calculate the observed test sta s c using your data.
Calculate the p-value:
The p-value is calculated based on the observed test sta s c and its distribu on under the
null hypothesis.
1. For one-tailed tests (where you are only interested in one direc on of an effect), the p-
value is the probability of observing a test sta s c as extreme or more extreme than the
observed value in that direc on.
2. For two-tailed tests (where you are interested in both direc ons of an effect), the p-value
is the probability of observing a test sta s c as extreme or more extreme than the
observed value in either direc on.
Compare the p-value to the Significance Level (α):
Decide on a significance level (α), which is typically set at 0.05 but can vary depending on the
study.
1. If the p-value is less than or equal to α, you reject the null hypothesis (conclude there is
evidence for the alterna ve hypothesis).
2. If the p-value is greater than α, you fail to reject the null hypothesis (insufficient evidence
to support the alterna ve hypothesis).
It's important to note that the specific calcula ons for the test sta s c and p-value depend on the
chosen sta s cal test. Different tests have different formulas and assump ons. In prac ce, sta s cal
so ware or calculators are o en used to perform these calcula ons automa cally, as they can be
complex for many tests. Addi onally, when conduc ng hypothesis tests, make sure to consider the
assump ons and limita ons of the chosen test to ensure the validity of your results.
= (0.7) ^ 3 = 0.343
Type I Type II
1. The chance or probability that you will reject 1. The chance or probability that you will not
a null hypothesis that should not have been reject a null hypothesis when it should have
rejected. been rejected.
2. This will result in you deciding two groups are 2. This will result in you deciding two groups are
different or two variables are related when they not different or two variables are not related
really are not. when they really are.
3. The probability of a Type I error is called 3. The probability of a Type II error is called
alpha (𝛼). beta (𝛽).
Q 83. When should you use a t-test vs a z-test?
A z-test is used to test a Null Hypothesis if the popula on variance is known, or if the sample size is
larger than 30, for an unknown popula on variance. A t-test is used when the sample size is less than
30 and the popula on variance is unknown.
Q 84. What is the difference between the f test and anova test?
The F-test and ANOVA (Analysis of Variance) are related sta s cal tests, but they serve different
purposes and are used in different contexts.
f test anova test
Purpose: Purpose:
The F-test is a sta s cal test used to compare ANOVA, on the other hand, is used to compare
the variances of two or more popula ons or means of three or more groups to determine if
samples. there are sta s cally significant differences
among the group means.
The test sta s c for the F-test follows an F- ANOVA uses an F-sta s c as well, but the
distribu on, which is a right-skewed calcula on is different from the F-test. It
distribu on. The F-sta s c is calculated by assesses the ra o of varia on between group
dividing the variance of one group by the means to the varia on within groups.
variance of another group.
Common use cases for the F-test include ANOVA is commonly used in experimental
comparing the variances of two groups (F-test designs where you have several treatments or
for equality of variances), assessing the condi ons and you want to determine if there
goodness of fit of a sta s cal model, and is a sta s cally significant difference in the
performing regression analysis (F-test for means of these groups. It is o en followed by
overall model fit). post-hoc tests to iden fy which specific group
means differ from each other.
1. Bootstrapping:
Bootstrap Sampling: In bootstrap resampling, you randomly select data points from your
dataset with replacement to create mul ple "bootstrap samples" of the same size as the
original dataset.
Purpose: Bootstrapping is o en used to es mate the sampling distribu on of a sta s c (e.g.,
mean, median, standard devia on) or to construct confidence intervals.
2. Cross-Valida on:
K-Fold Cross-Valida on: In cross-valida on, you par on your dataset into "k" subsets
(folds). You itera vely use k-1 folds for training and the remaining fold for tes ng, repea ng
this process k mes.
Purpose: Cross-valida on is widely used in machine learning to assess model performance,
tune hyperparameters, and detect overfi ng.
In other words, if you construct a large number of confidence intervals using the same method and
the same confidence level (e.g., 95% confidence level), and if you repeat this process many mes,
then approximately 5% of these intervals will not contain the true popula on parameter.
In simpler terms, a confounding variable is an extra factor that can distort the observed rela onship
between two other variables by either masking or falsely sugges ng a connec on between them.
Example: Suppose you are studying the rela onship between coffee consump on (independent
variable) and the risk of heart disease (dependent variable). Age is a confounding variable because it
is related to both coffee consump on (as people of different ages may drink different amounts of
coffee) and the risk of heart disease (as older individuals tend to have a higher risk). Without
considering age as a confounder, you may mistakenly conclude that coffee consump on directly
affects heart disease risk.
Q 88. What are the steps we should take in hypothesis tes ng?
Hypothesis tes ng is a structured process used in sta s cs to make inferences about popula on
parameters based on sample data. Here are the steps typically involved in hypothesis tes ng:
1. Formulate Hypotheses:
State the null hypothesis (H0): This is a statement of no effect or no difference. It represents
the default assump on you want to test.
State the alterna ve hypothesis (Ha): This is the hypothesis you want to provide evidence for,
sugges ng that there is an effect, difference, or rela onship in the popula on.
2. Choose a Significance Level (α):
Select the significance level (α), which represents the probability of making a Type I error
(rejec ng the null hypothesis when it is true). Common choices include 0.05 (5%) and 0.01
(1%).
3. Collect and Analyse Data:
Collect sample data that are relevant to your research ques on.
Perform appropriate sta s cal analysis based on the type of data and research design. This
analysis depends on the specific hypothesis test you're conduc ng (e.g., t-test, chi-square
test, ANOVA).
4. Calculate the Test Sta s c:
Calculate the test sta s c based on your sample data and the null hypothesis. The test
sta s c quan fies how different your sample data are from what you would expect under
the null hypothesis.
5. Determine the Cri cal Region:
Iden fy the cri cal region or rejec on region in the probability distribu on of the test
sta s c. This is the range of values that would lead to rejec ng the null hypothesis if the test
sta s c falls within it.
6. Compare the Test Sta s c to Cri cal Values:
Compare the calculated test sta s c to the cri cal values (cut-off values) corresponding to
the chosen significance level. If the test sta s c falls in the cri cal region, you reject the null
hypothesis. Otherwise, you fail to reject it.
7. Calculate the P-Value:
Alterna vely, you can calculate the p-value, which is the probability of observing a test
sta s c as extreme as, or more extreme than, the one calculated, assuming the null
hypothesis is true.
If the p-value is less than or equal to the chosen significance level (α), you reject the null
hypothesis.
If the p-value is greater than α, you fail to reject it.
8. Make a Decision:
Based on the comparison of the test sta s c (or p-value) to the cri cal values (or α), make a
decision:
If you reject the null hypothesis, conclude that there is evidence for the alterna ve
hypothesis.
If you fail to reject the null hypothesis, conclude that there is insufficient evidence to support
the alterna ve hypothesis.
9. Interpret Results:
Interpret the results in the context of your research ques on. Explain the prac cal
significance of your findings and their implica ons.
10. Report Findings:
Clearly communicate your results, including the test sta s c, p-value (if used), conclusion,
and any relevant effect size measures, in a clear and concise manner.
Imagine you're a detec ve inves ga ng a case. You have a suspect on trial, and you want to know if
there's enough evidence to say they are guilty.
The p-value is like a measure of how strong your evidence is against the suspect. It tells you the
likelihood of ge ng the evidence you have if the suspect is innocent.
Interpola on es mates values within the range of known data, where you have observed the actual
pa ern or rela onship between data points. As long as this rela onship is rela vely consistent,
interpola on tends to provide reasonably accurate es mates.
Extrapola on, on the other hand, involves predic ng values beyond the range of known data, which
is inherently uncertain. Extrapola on assumes that the same pa ern or trend will con nue, and this
assump on may not always hold true, especially when data are subject to changing condi ons or
unobserved factors.
Q 92. You roll a biased coin (p(head)=0.8) five mes. What’s the
probability of ge ng three or more heads?
To start off the ques on, we need 3, 4, or 5 heads to sa sfy the cases.
1. Calculate the expected number of infec ons under the standard rate: Standard infec on rate = 1
infec on per 100 person-days.
1
𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑖𝑛𝑓𝑒𝑐𝑡𝑖𝑜𝑛𝑠 = (1787) = 17.87
100
2. Use the Poisson distribu on to find the probability of observing 10 or fewer infec ons when the
expected number is 17.87. The Poisson probability mass func on is:
𝑒 ∗𝜆
𝑃(𝑋 = 𝑥) =
𝑥!
3. Calculate the cumula ve probability of observing 10 or fewer infec ons:
.
𝑒 ∗ 17.87
𝑃(𝑋 ≤ 10) =
𝑥!
4. Find the p-value, which is the probability of observing 10 or fewer infec ons:
To calculate a 95% Student's t-confidence interval for the mean brain volume in the popula on, you
can use the following formula:
𝑠
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙 = 𝑥̅ ± (𝑡 ∗ )
√𝑛
Where:
t is the cri cal t-value for a 95% confidence interval with (n - 1) degrees of freedom.
First, let's find the cri cal t-value for a 95% confidence interval with 8 degrees of freedom (9 - 1 = 8).
You can use a t-table or a calculator to find this value. For a 95% confidence level and 8 degrees of
freedom, the cri cal t-value is approximately 2.306.
Now, plug in the values into the formula:
30
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙 = 1100 ± (2.306 ∗ )
√9
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙 = 1100 ± (2.306 ∗ 10)
Now, calculate the lower and upper bounds of the confidence interval:
So, the 95% confidence interval for the mean brain volume in this new popula on is approximately
1,076.94 cc to 1,123.06 cc. This means that we are 95% confident that the true mean brain volume in
the popula on falls within this range.
The main idea behind ANOVA is to par on the total variance in the data into different components,
which can be a ributed to different sources or factors.
If the p-value is less than or equal to the chosen significance level (α), typically 0.05, it
suggests that the observed results are sta s cally significant. In this case, you reject the null
hypothesis.
If the p-value is greater than the significance level, it suggests that the observed results are
not sta s cally significant. In this case, you fail to reject the null hypothesis.
In short, it's a way to decide whether the data provides enough evidence to challenge a specific
hypothesis or not.
Purpose: Purpose:
Histograms are used to visualize the distribu on Box plots are used to display the distribu on,
of con nuous data by dividing it into bins or central tendency, and spread (variability) of a
intervals and displaying the frequency or count dataset. They are par cularly useful for
of data points within each bin. iden fying outliers and comparing the
distribu on of mul ple datasets.
Appearance: Appearance:
A histogram consists of a series of adjacent bars A box plot consists of a rectangular "box" with a
or bins, with the width of each bin represen ng line inside it (the median), and "whiskers" that
a range of values. The height of each bar extend from the box. Some mes, individual
represents the frequency or count of data data points are plo ed as dots.
points in that bin.
Usage: Usage:
Commonly used for exploring the distribu on Commonly used for comparing distribu ons
of data, iden fying pa erns, and assessing data between different groups or visualizing the
characteris cs. spread of data.
1. What is the probability you picked the double-headed coin (now referred as D)?
2. What is the probability of ge ng a head on the next toss?
PART1
We are trying to find the probability of having a double-headed coin. We know that the same coin
has been flipped 10 mes, and we've go en 10 heads (intui vely, you're probably thinking that there
is a significant chance we have the double-headed coin). Formally, we're trying to find
𝑃 (𝐷 | 10 ℎ𝑒𝑎𝑑𝑠).
Since we have all the components of 𝑃(𝐷 | 10 𝐻), compute and you'll find the the probability of
having a double headed coin is 0.506. We have finished the first ques on.
PART2
The second ques on is then easily answered: we just compute the two individual possibili es and
add.
Example: Suppose you calculate a 95% confidence interval for the average height of a popula on,
and you obtain the interval [165 cm, 175 cm].
"We are 95% confident that the true average height of the popula on falls within the range of 165
cm to 175 cm."
Q 102. How do you stay up-to-date with the new and upcoming
concepts in sta s cs?
To stay up-to-date with new concepts in sta s cs:
Read Journals: Regularly read sta s cal journals and publica ons.
Online Courses: Take online courses and webinars.
Conferences: A end sta s cal conferences and workshops.
Join Forums: Par cipate in online sta s cal forums and communi es.
Network: Connect with sta s cians and data scien sts.
Subscribe: Subscribe to sta s cal newsle ers and blogs.
Follow Researchers: Follow leading sta s cians on social media.
Con nuous Learning: Embrace a culture of con nuous learning.
Correla on Coefficient:
The most common way to measure correla on is by calcula ng the correla on coefficient,
which is represented by the symbol "r" or "ρ" (rho). The correla on coefficient is a numerical
value that ranges between -1 and 1, with the following interpreta ons:
1. A posi ve correla on (r > 0) indicates that as one variable increases, the other tends to
increase as well.
2. A nega ve correla on (r < 0) indicates that as one variable increases, the other tends to
decrease.
3. A correla on coefficient of 0 (r = 0) suggests no linear rela onship between the
variables.
Strength of Correla on:
The absolute value of the correla on coefficient (|r|) indicates the strength of the
rela onship. Values closer to -1 or 1 represent stronger correla ons, while values closer to 0
represent weaker correla ons.
Direc on of Correla on:
The sign of the correla on coefficient (+ or -) indicates the direc on of the rela onship. A
posi ve coefficient means the variables move in the same direc on, while a nega ve
coefficient means they move in opposite direc ons.
Sca erplots:
Sca erplots are o en used to visually represent the rela onship between two variables.
Points on the plot represent data points, and the pa ern they form can give an indica on of
the correla on.
Associa on: A high posi ve correla on implies that, on average, as the amount of me a
person sleeps increases, their produc ve work also tends to increase. In other words, there
appears to be a connec on between sleep and produc vity.
Predic ve Value: The strength of the correla on can indicate the extent to which sleep me
can be used to predict or es mate produc ve work. If the correla on is strong, sleep me
may be a good predictor of work produc vity.
Direc on: A posi ve correla on means that as one variable (sleep me) increases, the other
variable (produc ve work) tends to increase as well. This suggests that ge ng more sleep is
associated with higher produc vity, which aligns with common understanding.
Q 107. How will you determine the test for the con nuous data?
Common tests for analyzing con nuous data in sta s cs include:
The choice of test depends on your research ques on, data distribu on, and experimental design.
Skewness: Data may be skewed to the le (nega vely skewed) or right (posi vely skewed),
leading to non-normality.
Outliers: Extreme values or outliers in the dataset can distort the normal distribu on.
Sampling Bias: Non-random sampling or selec on bias may result in data that does not
reflect the popula on's true distribu on.
Non-linear Rela onships: Data influenced by non-linear rela onships or complex
interac ons may deviate from normality.
Data Transforma on: Some data, such as counts or propor ons, inherently follow non-
normal distribu ons.
Natural Varia on: In some cases, data may naturally follow a non-normal distribu on due to
the underlying process being studied.
Measurement Errors: Errors in data collec on or measurement can introduce non-normality.
Censoring or Floor/Ceiling Effects: Data may be bounded, leading to devia ons from
normality at the bounds.
Understanding the cause of non-normality is essen al for appropriate data analysis and choosing the
right sta s cal techniques or transforma ons.
Q 109. Why is there no such thing like 3 samples t- test? why t-test
failed with 3 samples?
There is no dedicated "3 samples t-test" because tradi onal t-tests are designed for comparing
means between two groups, not three. When you have three or more groups to compare, you
typically use analysis of variance (ANOVA) or its varia ons, which can determine whether there are
sta s cally significant differences among mul ple groups. T-tests can be applied to compare pairs of
groups within an ANOVA framework, but they are not used to directly compare three groups
simultaneously.