Statistics, Statistical Modelling & Data Analytics
Statistics, Statistical Modelling & Data Analytics
Unit - I
Statistics: Introduction
Statistics: Introduction
Statistical Methods:
Applications of Statistics:
Example:
Suppose a pharmaceutical company wants to test the effectiveness of a new
drug. They conduct a clinical trial where they administer the drug to a sample of
patients and measure its effects on their symptoms. By analyzing the data from
the trial using statistical methods, such as hypothesis testing and regression
analysis, the company can determine whether the drug is effective and make
decisions about its future development and marketing.
Understanding the basic concepts of statistics is essential for interpreting data
effectively and making informed decisions in various fields.
1. Mean:
The mean, also known as the average, is calculated by summing up all the
values in a dataset and then dividing by the total number of values.
2. Median:
It divides the dataset into two equal halves, with half of the values lying below
and half lying above the median.
Example: For the dataset {1, 3, 5, 6, 9}, the median is 5. For the dataset {2, 4,
6, 8}, the median is (4 + 6) / 2 = 5.
3. Mode:
Unlike the mean and median, the mode can be applied to both numerical and
categorical data.
A dataset may have one mode (unimodal), two modes (bimodal), or more than
two modes (multimodal). It is also possible for a dataset to have no mode if all
values occur with the same frequency.
Applications:
Mean is often used in situations where the data is normally distributed and
outliers are not a concern, such as calculating average test scores.
Mode is useful for identifying the most common value in a dataset, such as the
most frequently occurring color in a survey.
Understanding the mean, median, and mode allows for a comprehensive analysis
of data distribution and central tendency, aiding in decision-making and
interpretation of datasets.
1. Variance:
Variance measures the average squared deviation of each data point from the
mean of the dataset.
It quantifies the spread of the data points and indicates how much they
deviate from the mean.
2. Standard Deviation:
It represents the average distance of data points from the mean and is
expressed in the same units as the original data.
Formula: Standard Deviation (σ) = √(Σ[(x - μ)²] / n), where Σ represents the
sum, x represents each individual data point, μ represents the mean, and n
represents the total number of data points.
Since standard deviation is the square root of variance, they measure the
same underlying concept of data dispersion.
Applications:
Variance and standard deviation are used to quantify the spread of data points
in various fields such as finance, engineering, and social sciences.
They are essential for assessing the consistency and variability of data,
identifying outliers, and making predictions based on data patterns.
Example:
Consider the following dataset representing the daily temperatures (in degrees
Celsius) recorded over a week: {25, 26, 27, 24, 26, 28, 23}.
Understanding variance and standard deviation provides valuable insights into the
variability and consistency of data, aiding in decision-making and analysis of
datasets.
Data Visualization
Data visualization is the graphical representation of data to communicate
information effectively and efficiently. It involves converting raw data into visual
formats such as charts, graphs, and maps to facilitate understanding, analysis,
and interpretation. Data visualization plays a crucial role in exploratory data
analysis, decision-making, and communication of insights in various fields
including business, science, healthcare, and academia.
Key Concepts:
1. Data Types: Data visualization techniques vary based on the type of data
being visualized. Common data types include:
Categorical Data: Represented using pie charts, bar charts, stacked bar
charts, etc.
3. Visualization Tools: There are numerous software tools and libraries available
for creating data visualizations, including:
Graphical Tools: Microsoft Excel, Tableau, Google Data Studio, Power BI.
Example:
Consider a dataset containing sales data for a retail store over a year. To analyze
sales performance, various visualizations can be created:
A line graph showing sales trends over time, highlighting seasonal patterns or
trends.
A heatmap illustrating sales volume by day of the week and time of day.
By visualizing the sales data using these techniques, stakeholders can quickly
grasp key insights such as peak sales periods, top-selling products, and regional
sales patterns.
1. Random Variables:
For discrete random variables, the probability mass function (PMF) gives
the probability that the random variable takes on a specific value.
Both PMF and PDF describe the distribution of probabilities across the
possible values of the random variable.
Each distribution has its own set of parameters that govern its shape,
center, and spread.
Applications:
Example:
Consider a manufacturing process that produces light bulbs. The number of
defective bulbs produced in a day follows a Poisson distribution with a mean of 5
defective bulbs per day. By understanding the properties of the Poisson
distribution, such as its mean and variance, the manufacturer can assess the
likelihood of different outcomes and make informed decisions about process
improvements and quality control measures.
Probability distributions provide a powerful framework for quantifying uncertainty
and analyzing random phenomena in diverse fields. Mastery of probability
distributions is essential for statistical analysis, decision-making, and modeling of
real-world processes.
Hypothesis Testing
Hypothesis testing is a statistical method used to make inferences about
population parameters based on sample data. It involves formulating two
competing hypotheses, the null hypothesis (H0) and the alternative hypothesis
(H1), and using sample evidence to determine which hypothesis is more plausible.
Hypothesis testing follows a structured process involving the selection of a
significance level, calculation of a test statistic, and comparison of the test
statistic to a critical value or p-value.
Key Concepts:
The null hypothesis represents the status quo or the default assumption.
Denoted as H0.
The alternative hypothesis contradicts the null hypothesis and states the
researcher's claim or hypothesis.
Denoted as H1.
4. Test Statistic:
The test statistic is a numerical value calculated from sample data that
measures the strength of evidence against the null hypothesis.
The choice of test statistic depends on the type of hypothesis being tested
and the characteristics of the data.
1. Parametric Tests:
2. Nonparametric Tests:
3. Collect Sample Data: Collect and analyze sample data relevant to the
hypothesis being tested.
4. Calculate Test Statistic: Compute the test statistic using the sample data and
the chosen test method.
5. Determine Critical Value or P-value: Determine the critical value from the
appropriate probability distribution or calculate the p-value.
6. Make Decision: Compare the test statistic to the critical value or p-value and
decide whether to reject or fail to reject the null hypothesis.
Example:
Suppose a researcher wants to test whether the mean weight of a certain species
of fish is different from 100 grams. The null and alternative hypotheses are
formulated as follows:
Alternative Hypothesis (H1): μ ≠ 100 (Mean weight of fish is not equal to 100
grams).
3. Collect Sample Data: Sample mean (x̄ ) = 105, Sample size (n) = 30.
5. Determine Critical Value or P-value: Look up the critical value from the t-
distribution table or calculate the p-value.
6. Make Decision: Compare the test statistic to the critical value or p-value.
7. Draw Conclusion: If the p-value is less than the significance level (α), reject
the null hypothesis. Otherwise, fail to reject the null hypothesis.
In this example, if the calculated p-value is less than 0.05, the researcher would
reject the null hypothesis and conclude that the mean weight of the fish is
significantly different from 100 grams.
Understanding hypothesis testing allows researchers to draw meaningful
conclusions from sample data and make informed decisions based on statistical
evidence. It is a powerful tool for testing research hypotheses, analyzing data, and
drawing conclusions about population parameters.
Linear Algebra
Linear algebra is a branch of mathematics that deals with vector spaces and linear
mappings between them. It provides a framework for representing and solving
systems of linear equations, as well as analyzing geometric transformations and
structures. Linear algebra has applications in various fields including engineering,
computer science, physics, economics, and data analysis.
Key Concepts:
Scalars are quantities that only have magnitude, such as real numbers.
2. Vector Operations:
Dot Product: Also known as the scalar product, it yields a scalar quantity
by multiplying corresponding components of two vectors and summing
the results.
Eigenvectors are nonzero vectors that remain in the same direction after a
linear transformation.
Applications:
Example:
Population Statistics
Population statistics refer to the quantitative measurements and analysis of
characteristics or attributes of an entire population. A population in statistics
represents the entire group of individuals, objects, or events of interest that share
common characteristics. Population statistics provide valuable insights into the
overall characteristics, trends, and variability of a population, enabling
researchers, policymakers, and businesses to make informed decisions and draw
meaningful conclusions.
Key Concepts:
1. Population Parameters:
4. Population Proportion:
5. Population Distribution:
Applications:
Example:
Suppose a city government wants to estimate the average household income of all
residents in the city. They collect income data from a random sample of 500
Population Mean (μ): The city government can use the sample mean as an
estimate of the population mean income, assuming the sample is
representative of the entire population.
Population Variance (σ²) and Standard Deviation (σ): Since the city
government only has sample data, they can estimate the population variance
and standard deviation using statistical formulas for sample variance and
sample standard deviation.
By analyzing population statistics, the city government can gain insights into the
income distribution, identify income disparities, and formulate policies to address
socioeconomic issues effectively.
Understanding population statistics is essential for making informed decisions,
conducting meaningful research, and addressing societal challenges based on
comprehensive and accurate data about entire populations.
1. Mathematical Methods:
Linear Algebra: Linear algebra involves the study of vectors, matrices, and
systems of linear equations, with applications in solving linear
2. Probability Theory:
Central Limit Theorem: The central limit theorem states that the
distribution of the sum (or average) of a large number of independent,
identically distributed random variables approaches a normal distribution,
regardless of the original distribution.
Applications:
4. Computer Science and Machine Learning: Probability theory forms the basis
of algorithms and techniques used in machine learning, pattern recognition,
artificial intelligence, and probabilistic graphical models, while mathematical
methods are used in algorithm design, computational geometry, and
optimization problems in computer science.
Example:
Consider a scenario where a company wants to model the daily demand for its
product. They collect historical sales data and use mathematical methods to fit a
probability distribution to the data. Based on the analysis, they find that the
demand follows a normal distribution with a mean of 100 units and a standard
deviation of 20 units.
Using probability theory, the company can make predictions about future demand,
estimate the likelihood of stockouts or excess inventory, and optimize inventory
levels to minimize costs while meeting customer demand effectively.
1. Sampling Distributions:
The central limit theorem states that the sampling distribution of the
sample mean approaches a normal distribution as the sample size
increases, regardless of the shape of the population distribution, provided
that the sample size is sufficiently large.
2. Point Estimation:
Common point estimators include the sample mean (for population mean
estimation) and the sample proportion (for population proportion
estimation).
Point estimators aim to provide the best guess or "point estimate" of the
population parameter based on available sample data.
3. Confidence Intervals:
Applications:
Example:
Suppose a researcher wants to estimate the average height of adult males in a
population. They collect a random sample of 100 adult males and calculate the
sample mean height to be 175 cm with a standard deviation of 10 cm.
Using statistical inference techniques:
Point Estimation: The researcher uses the sample mean (175 cm) as a point
estimate of the population mean height.
Quantitative Analysis
Quantitative analysis involves the systematic and mathematical examination of
data to understand and interpret numerical information. It employs various
statistical and mathematical techniques to analyze, model, and interpret data,
providing insights into patterns, trends, relationships, and associations within the
data. Quantitative analysis is widely used across disciplines such as finance,
economics, business, science, engineering, and social sciences to inform
decision-making, forecast outcomes, and derive actionable insights.
Key Concepts:
1. Data Collection:
2. Descriptive Statistics:
3. Inferential Statistics:
4. Regression Analysis:
Applications:
Example:
Suppose a retail company wants to analyze sales data to understand the factors
influencing sales revenue. They collect data on sales revenue, advertising
expenditure, store location, customer demographics, and promotional activities
over the past year.
Using quantitative analysis:
Time Series Analysis: The company examines sales data over time to identify
seasonal patterns, trends, and any cyclicality in sales performance.
By employing quantitative analysis techniques, the company can gain insights into
the drivers of sales revenue, identify opportunities for improvement, and optimize
marketing strategies to maximize profitability.
Unit - II
Statistical Modeling
Statistical modeling is a process of using statistical techniques to describe,
analyze, and make predictions about relationships and patterns within data. It
involves formulating mathematical models that represent the underlying structure
of data and capturing the relationships between variables. Statistical models are
used to test hypotheses, make predictions, and infer information about
populations based on sample data. Statistical modeling is widely employed across
various disciplines, including economics, finance, biology, sociology, and
engineering, to understand complex phenomena and inform decision-making.
Key Concepts:
1. Model Formulation:
The choice of model depends on the nature of the data, the research
question, and the assumptions underlying the modeling process.
2. Parameter Estimation:
3. Model Evaluation:
4. Model Selection:
Applications:
Example:
Suppose a pharmaceutical company wants to develop a statistical model to
predict the effectiveness of a new drug in treating a particular medical condition.
They collect data on patient characteristics, disease severity, treatment dosage,
and treatment outcomes from clinical trials.
Using statistical modeling:
Once validated, the model can be used to predict treatment outcomes for new
patients and inform clinical decision-making.
1. Model Specification:
This can include linear models, nonlinear models, hierarchical models, and
more complex structures.
2. Parameter Estimation:
3. Model Evaluation:
Model evaluation assesses how well the model fits the data and whether it
provides meaningful insights or predictions.
Inference involves using the fitted model to make conclusions about the
population parameters and test hypotheses.
3. Time Series Models: Used to analyze and forecast time-dependent data, such
as autoregressive integrated moving average (ARIMA) models and seasonal
decomposition models.
Applications:
Example:
Suppose a retail company wants to develop a statistical model to predict customer
churn. They collect data on customer demographics, purchase history, and
engagement metrics.
Using statistical modeling:
They estimate the model parameters using historical data and validate the
model's performance using a holdout dataset or cross-validation.
Once validated, the model can be used to identify at-risk customers and
implement targeted retention strategies.
1. Variability:
2. Hypothesis Testing:
The test statistic used in ANOVA is the F-statistic, which compares the
ratio of between-group variability to within-group variability.
3. Types of ANOVA:
4. Assumptions:
ANOVA assumes that the data within each group are normally distributed,
the variances of the groups are homogeneous (equal), and the
observations are independent.
Example:
Suppose a researcher wants to compare the effectiveness of three different
training programs on employee performance. They randomly assign employees to
three groups: Group A receives training program 1, Group B receives training
program 2, and Group C receives training program 3.
Using ANOVA:
The researcher collects performance data from each group and conducts a
one-way ANOVA to compare the mean performance scores across the three
groups.
By using ANOVA, the researcher can determine whether there are significant
differences in performance outcomes among the training programs and make
informed decisions about which program is most effective for improving employee
performance.
Analysis of variance is a versatile statistical technique with widespread
applications in experimental design, quality control, social sciences, and many
other fields. It provides valuable insights into group differences and helps
researchers draw meaningful conclusions from their data.
Gauss-Markov Theorem
The OLS estimator provides estimates of the coefficients that best fit the
observed data points in a least squares sense.
3. Gauss-Markov Theorem:
The Gauss-Markov theorem states that under certain conditions, the OLS
estimator is the best linear unbiased estimator (BLUE) of the coefficients in
a linear regression model.
Specifically, if the errors (residuals) in the model have a mean of zero, are
uncorrelated, and have constant variance (homoscedasticity), then the
OLS estimator is unbiased and has minimum variance among all linear
unbiased estimators.
Additionally, the OLS estimator is efficient in the sense that it achieves the
smallest possible variance among all linear unbiased estimators, making it
the most precise estimator under the specified conditions.
4. Finance and Business: In finance and business analytics, the theorem is used
to model relationships between financial variables, forecast future trends, and
assess the impact of business decisions.
Example:
Suppose a researcher wants to estimate the relationship between advertising
spending (X) and sales revenue (Y) for a particular product. They collect data on
advertising expenditures and corresponding sales revenue for several months and
fit a linear regression model to the data using OLS estimation.
Using the Gauss-Markov theorem:
If the assumptions of the theorem hold (e.g., errors have zero mean, are
uncorrelated, and have constant variance), then the OLS estimator provides
unbiased and efficient estimates of the regression coefficients.
The researcher can use the OLS estimates to assess the impact of advertising
spending on sales revenue and make predictions about future sales based on
advertising budgets.
The OLS regression line is the line that best fits the observed data points
by minimizing the sum of squared vertical distances (residuals) between
the observed yᵢ values and the corresponding predicted values on the
regression line.
The residual for each observation is the vertical distance between the
observed yᵢ value and the predicted value on the regression line.
Each observed data point (xᵢ, yᵢ) can be projected onto the regression line
to obtain the predicted value ȳᵢ.
The vertical distance between the observed data point and its projection
onto the regression line represents the residual for that observation.
4. Minimization of Residuals:
2. Assessment of Model Fit: Geometric insights can help assess the adequacy
of the regression model by examining the distribution of residuals around the
regression line. A good fit is indicated by residuals that are randomly scattered
around the line with no discernible pattern.
Example:
Each observed data point can be projected onto the regression line to obtain
the predicted exam score.
The vertical distance between each data point and its projection onto the
regression line represents the residual for that observation.
The OLS regression line is chosen to minimize the sum of squared residuals,
ensuring that the residuals are orthogonal to the line.
By understanding the geometry of least squares, analysts can gain insights into
how the OLS estimator works geometrically, facilitating better interpretation and
application of regression analysis in various fields.
In summary, the geometry of least squares provides a geometric perspective on
the OLS estimation method in linear regression analysis. It visualizes the
relationship between observed data points and the fitted regression line, aiding in
understanding OLS properties, model diagnostics, and interpretation of regression
results.
Each observed data point corresponds to a vector in the space, where the
components represent the values of the independent variables.
In the context of linear models, the space spanned by the observed data
points is the data subspace, while the space spanned by the regression
coefficients is the coefficient subspace.
Basis vectors are vectors that span a subspace, meaning that any vector in
the subspace can be expressed as a linear combination of the basis
vectors.
The projection of a data point onto the coefficient subspace represents the
predicted response value for that data point based on the linear model.
The difference between the observed response value and the projected
value is the residual, representing the error or discrepancy between the
observed data and the model prediction.
4. Orthogonal Decomposition:
Example:
Consider a simple linear regression model with one independent variable (x) and
one dependent variable (y). The subspace formulation represents the observed
data points (xᵢ, yᵢ) as vectors in a two-dimensional space, where xᵢ is the
independent variable value and yᵢ is the corresponding dependent variable value.
Using the subspace formulation:
The data subspace is spanned by the observed data points, representing the
space of possible values for the dependent variable given the independent
variable.
The regression line is the projection of the data subspace onto the coefficient
subspace, representing the best linear approximation to the relationship
between x and y.
Example:
Key Concepts:
In regression analysis, the observed data points are projected onto the
model space defined by the regression coefficients.
2. Orthogonality of Residuals:
The least squares criterion aims to minimize the sum of squared residuals,
which is equivalent to finding the orthogonal projection of the data onto
the model space.
4. Orthogonal Decomposition:
Applications:
Example:
Consider a simple linear regression model with one predictor variable (X) and one
response variable (Y). The goal is to estimate the regression coefficients
(intercept and slope) that best describe the relationship between X and Y.
Using least squares estimation:
The observed data points (Xᵢ, Yᵢ) are projected onto the model space spanned
by the predictor variable X.
1. Factorial Design:
2. Main Effects:
The main effect of a factor refers to the average change in the response
variable associated with changing the levels of that factor, while holding
other factors constant.
Main effects represent the overall influence of each factor on the response
variable, ignoring interactions with other factors.
3. Interaction Effects:
Interaction effects occur when the effect of one factor on the response
variable depends on the level of another factor.
4. Factorial Notation:
The notation "k1 x k2 x ... x kn" represents a factorial design with k1 levels
of the first factor, k2 levels of the second factor, and so on.
Advantages:
Applications:
2. Model Formula:
3. Assumptions:
4. Hypothesis Testing:
Applications:
1. Residuals:
2. Types of Residuals:
3. Residual Analysis:
4. Influence Diagnostics:
Advantages:
Applications:
Example:
Suppose a researcher conducts a multiple linear regression analysis to predict
housing prices based on various predictor variables such as square footage,
number of bedrooms, and location. After fitting the regression model, the
researcher performs regression diagnostics to evaluate the model's performance
and reliability.
The researcher conducts the following diagnostic checks:
1. Logarithmic Transformation:
Log transformations are useful for dealing with data that exhibit
exponential growth or decay, such as financial data, population growth
rates, or reaction kinetics.
Square root transformations involve taking the square root of the variable.
3. Reciprocal Transformation:
Reciprocal transformations are useful for dealing with data that exhibit a
curvilinear relationship, where the effect of the predictor variable on the
response variable diminishes as the predictor variable increases.
4. Exponential Transformation:
Choosing Transformations:
1. Visual Inspection:
2. Statistical Tests:
Applications:
Example:
Suppose a researcher conducts a regression analysis to predict house prices
based on square footage (X1) and number of bedrooms (X2). However, the
scatterplot of house prices against square footage shows a curved relationship,
indicating the need for a transformation.
The researcher decides to apply a logarithmic transformation to the square
footage variable (X1_log) before fitting the regression model. The transformed
model becomes:
The Box-Cox transformation assumes that the data are strictly positive;
therefore, it is not suitable for non-positive data.
Applications:
2. Time Series Analysis: In time series analysis, the Box-Cox transformation can
be applied to stabilize the variance of time series data and remove trends or
2. Model Complexity:
4. Model Interpretability:
Model interpretability refers to the ease with which the model's predictions
can be explained and understood by stakeholders.
Strategies:
2. Iterative Model Building: Iteratively add or remove variables from the model
based on their significance and contribution to model performance.
Applications:
Example:
Suppose a data scientist is tasked with building a predictive model to forecast
housing prices based on various predictor variables such as square footage,
3. Model Building: Start with a simple linear regression model using the selected
predictor variables and assess its performance using cross-validation
techniques (e.g., k-fold cross-validation).
By following these model selection and building strategies, the data scientist can
develop a reliable predictive model for housing price forecasting that effectively
captures the relationships between predictor variables and housing prices while
ensuring robustness and generalizability.
1. Binary Outcome:
2. Logit Function:
The logistic regression model uses the logit function to model the
relationship between the predictor variables and the probability of the
binary outcome.
The logit function is defined as the natural logarithm of the odds ratio:
\[ \text{logit}(p) = \log\left(\frac{p}{1-p}\right) \]
where \( p \) is the probability of the event occurring.
3. Model Equation:
4. Interpretation of Coefficients:
The regression coefficients (\( \beta \)) in logistic regression represent the
change in the log-odds of the outcome for a one-unit change in the
predictor variable, holding other variables constant.
Exponentiating the coefficients yields the odds ratio, which represents the
multiplicative change in the odds of the outcome for a one-unit increase in
the predictor variable.
Assumptions:
1. Linearity in the Logit: The relationship between the predictor variables and
the log-odds of the outcome is assumed to be linear.
4. Large Sample Size: Logistic regression performs well with large sample sizes.
Applications:
Example:
Suppose a bank wants to predict whether a credit card transaction is fraudulent
based on transaction features such as transaction amount, merchant category,
and time of day. The bank collects historical data on credit card transactions,
including whether each transaction was fraudulent or not.
The bank decides to use logistic regression to build a predictive model. They
preprocess the data, splitting it into training and testing datasets. Then, they fit a
logistic regression model to the training data, with transaction features as
predictor variables and the binary outcome variable (fraudulent or not) as the
response variable.
After fitting the model, they evaluate its performance using metrics such as
accuracy, precision, recall, and the area under the ROC curve (AUC-ROC) on the
testing dataset. The bank uses these metrics to assess the model's predictive
accuracy and determine its suitability for detecting fraudulent transactions in real-
time.
4. Interpretation of Coefficients:
Exponentiating the coefficients yields the incidence rate ratio (IRR), which
represents the multiplicative change in the expected count of the event for
a one-unit increase in the predictor variable.
Assumptions:
The relationship between the predictor variables and the log expected
count of the event is assumed to be linear.
3. No Overdispersion:
Applications:
Example:
Suppose a researcher wants to study the factors influencing the number of
customer complaints received by a company each month. The researcher collects
data on various predictor variables, including product type, customer
demographics, and service quality ratings.
The researcher decides to use Poisson regression to model the count of customer
complaints as a function of the predictor variables. They preprocess the data,
splitting it into training and testing datasets. Then, they fit a Poisson regression
model to the training data, with predictor variables as covariates and the count of
customer complaints as the outcome variable.
After fitting the model, they assess the model's goodness of fit using diagnostic
tests and evaluate the significance of the predictor variables using hypothesis
tests. Finally, they use the model to make predictions on the testing dataset and
assess its predictive accuracy.
In summary, Poisson regression models are valuable tools for analyzing count
data and understanding the factors influencing the frequency of events or
occurrences. They provide insights into the relationship between predictor
variables and event rates, allowing researchers to make informed decisions in
various fields of study.
updated: https://yashnote.notion.site/Statistics-Statistical-Modelling-Data-
Analytics-7154397f8ce74050b5a720c4e035a590?pvs=4