Statistical Modeling Notes

UNIT 1
BASIC DEFINITIONS AND FACTS-REVISION

DEFINITIONS (RECALLING)
1. Random Experiment or Statistical Experiment
Is an experiment in which the outcome cannot be predetermined with
certainty.
For example Tossing of a coin or a die is a random experiment
2. Sample Space
Is the set that contains all the possible outcomes of a random experiment.
A sample space will be denoted by the symbol Ω.
For example tossing a coin once. Ω
3. Event
Is a subset of the sample space.
Denoted by E
A coin is tossed once. Write down all the possible events.
The possible events will be {H,T}, {H}, {T}, {}.
4. Mutually Exclusive Events
Events are said to be mutually exclusive if occurrence of one event excludes
the occurrence of all others.
For example, turning tail of a fair coin excludes the possibility of it turning
head at the same time
5. Probability
Probability of an event, denoted P(E), is a number that satisfies the following
conditions or axioms:
a) 0 ≤ P(E) ≤ 1
b) P(Ω) = 1
c) If are mutually exclusive events, then )
In other words, probability is just the ratio of number of experimental
outcomes of interest to the total number of trials as they increase to infinity.
 Thus,
6. Random Variable
A random variable is a function that maps a sample space into a set of real
numbers.
For example A coin is tossed once. Construct a random variable of your
choice. Suppose we have a variable X, then becomes a random variable
Again, tossing a coin twice would mean
Let the random variable represent the number of heads, then we must have
7. Probability Distribution
Is a function that assigns an event or an outcome of a random experiment a
probability.
 Also refers as function that maps a random variable to a probability.
For example the table below show a probability distribution.
Number of heads 0 1 2
probability 0.25 0.5 0.25
8. A parameter is a summary measure of a population such as

population mean while a statistic is a summary measure of sample
which is used to estimate the parameter such as sample mean.
STATISTICAL MODEL
• Is a probability distribution that is constructed on a data set to enable
inferences to be drawn about the population.
• Is also defined as a set of probability distributions on a sample space.
• For example Suppose probability distribution given in this table 1.1 is
for the distribution of the ages of student in your class.
Age 30 41 52
probability 0.25 0.5 0.25
• If your class has 1000 student how many students do you expect to be
aged 30 , 41 , or 52 ?
EXAMPLE CONTINUE
• Solution
• Table 1.2
Age 30 41 52
Number of students 250 500 250
• If you went somewhere to collect data and found that the data you
collected is like the one in table 1.2, you would automatically
conclude that its probability distribution must be the one in table 1.1.
COMMON STATISTICAL
DISTRIBUTIONS
1. The random variable X is said to follow the normal distribution with
mean µ and variance if and only if its probability density function
(pdf) is
2. The random variable X is said to follow the Poisson distribution with

mean λ if and only if its probability mass function (pmf) is
where k = 0,1,2,….
COMMON STATISTICAL
DISTRIBUTIONS
3. The random variable X is said to follow the binomial distribution
with parameters (n, p) if and only if its probability mass function
(pmf) is
4. The random variable X is said to follow an exponential distribution

with parameter λ > 0 if and only if its probability density function is
TYPES OF STATISTICAL MODELS
• Statistical models are generally categorised into three: Parametric,
semi-parametric and non-parametric models.
1. Parametric model
A statistical model is said to be parametric if its distribution is completely
specified by a finite set of parameters.
For example, if we let Θ to be the parameter specification, then for
 Normal distribution =
 Poisson distribution =
 Bernoulli distribution = {p}
 Binomial distribution = {n, p}
We write the pdf to emphasize the parameter
TYPES OF STATISTICAL MODELS
2. Non-parametric model
A statistical model is said to be non-parametric when it cannot be
parameterized by a fixed number of parameters.
Examples of non-parametric models include: Sign Test, Signed rank test,
Wilcoxon Sign Test.
3. Semi-parametric model
A statistical model is said to be semi-parametric if it has both finite
dimensional and infinite dimensional parameters.
For example cox-regression for survival analysis.
GENERAL LINEAR MODELS
• Linear Statistical Model
A model is said to be statistical model if the error term has its own
distribution i.e. is a random variable, usually ∼ N(0 ).
The expectation of linear equation gives the constant mean value
For many independent variables, the equation would be where each
independent variable is continuous and dependent variable also continuous.
This is called the General Linear Regression Model
ASSUMPTIONS OF GENERAL LINEAR
MODELS
• Linearity - We assume that each is linearly related to dependent
variable .
• Independence - We assume that each of our observations are
independent of each other ( independent, so is ).
• Normality- We assume that error terms are identically and
independently distributed (iid) with mean zero and variance ( ∼
N(0 )).
• Equality of Variances- We assume that data points are homoscedastic
i.e equally scattered about the linear regression line regardless of the
value of variables.
GENERALIZED LINEAR MODELS
• The generalized linear models are usually of the form
• Generalized linear models extend the last assumptions.
• They generalize the possible distributions of the residuals to a family
of distributions called the exponential family.
• For example Binomial distribution, Poisson distribution, exponential
distribution.
GENERALIZED LINEAR MODELS
• When you change the distribution of the residuals, it turns out that the
relationship between Y and the model parameters is no longer linear.
• However, for each distribution in the exponential family, there exists at
least one function f(µ) of the mean of Y whose relationship with the
model parameters is linear.
• This function is called the link function.
• The link function you choose will depend on which distribution you are
choosing for the outcome variable.
• For example, a binomial residual can use a probit or a logit link function.
• A Poisson residual uses a log link function.
STRUCTURAL EQUATION MODELS
• In defining the structural equation models, we are going to discuss the
meaning of important components of the modeling which are factor, path and
regression analysis.
• Factor analysis is a statistical method used to describe variability among
observed, correlated variables in terms of a potentially lower number of
unobserved variables called factors.
• For example, it is possible that variations in 4 observed variables mainly reflect the
variations in two unobserved variables.
• As for principal components analysis, factor analysis is a multivariate method used for
data reduction purposes.
• It is used for interval data despite ordinal data can also apply (likert scale) e.g levels of
happiness as happy, most happy, happiest.
• Variables in factor analysis should be linearly related to each other.
• Regression Analysis a statistical process for estimating the relationships
among variables.
We can just estimate the values of the parameters which are in the model if we
have sufficient data values.
Finding such estimates and putting them in the model is what is known as
regression analysis.
• Path Analysis compares two or more casual models from correlation
matrix.
The path of the model is shown by a square and an arrow, which shows the
causation.
Regression weight is predicated by the model. Then the goodness of fit statistic
is calculated in order to see the fitting of the model.
• Path Model is a diagram showing independent, intermediate, and
dependent variables.
• A single-headed arrow shows the cause for the independent, intermediate
and dependent variable.
• A double-headed arrow shows the covariance between the two variables.
• Exogenous variables causes lie outside the model while Endogenous variables
are determined by variables within the model.
• Structural equation modeling (SEM) is a methodology for
representing, estimating, and testing a network of relationships
between variables (measured variables and latent constructs).
• Structural equation modeling (SEM) uses various types of models to
depict relationships among observed variables, with the same basic
goal of providing a quantitative test of a theoretical model
hypothesized by the researcher.
• A health care professional might believe that a good diet and regular
exercise reduce the risk of a heart attack.
• Thus, if one of the two recommendations is missing, its likely that one
can suffer from heart attack much as their are many underlying
factors for heart attack other than the two recommended activities.
• The goal of structural equation model analysis is to determine the
extent to which theoretical model is supported by sample data.
• If the sample data support the theoretical model, then more complex
theoretical models can be hypothesized.
• SEM includes factor, path analysis and regression.
MULTILEVEL MODELS
• They are also called hierarchal linear models, nested models, mixed models, random
coefficient models, random-effects models, random parameters models or split-plot
designs.
• Are statistical models that vary at more than one level.
• Multilevel models recognise the existence of data hierarchies by allowing for residual
components at each level in the hierarchy.
• For example, children with the same parents tend to be more alike in their physical and
mental characteristics than individuals chosen at random from the population at large.
• Individuals may be further nested within geographical areas or institutions such as
schools or employers.
• Multilevel data structures also arise in longitudinal studies where an individuals
responses over time are correlated with each other
MULTILEVEL MODELS
• Thus, the residuals are partitioned into between school level residuals
and within school component (variance of child-level residuals).
• The school residual affects school, represents unobserved school
characteristics that affect the child outcome.
• These unobserved variables leads to correlation between outcomes for
children from same school.
• Multilevel models have two levels: measurement and subject level.
• It has the general regression model of
• the presence of two random variables
The measurement level random variable and the subject level random variable .
ADVANTAGES OF MULTILEVEL
MODEL
• They allow observation of variations between and within subjects at
once
• They can be applied in a clustered data
• They use few parameters

Statistical Modeling Notes

Uploaded by

Copyright:

Available Formats

Statistical Modeling Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistical Modeling Notes

Uploaded by

Copyright:

Available Formats

UNIT 1

BASIC DEFINITIONS AND FACTS-REVISION

8. A parameter is a summary measure of a population such as

2. The random variable X is said to follow the Poisson distribution with

4. The random variable X is said to follow an exponential distribution

You might also like