Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
45 views

Statistical Model

Statistical Model

Uploaded by

hasan jami
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Statistical Model

Statistical Model

Uploaded by

hasan jami
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Statistical model

A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the
generation of sample data (and similar data from a larger population). A statistical model represents, often in
considerably idealized form, the data-generating process.[1]

A statistical model is usually specified as a mathematical relationship between one or more random
variables and other non-random variables. As such, a statistical model is "a formal representation of a
theory" (Herman Adèr quoting Kenneth Bollen).[2]

All statistical hypothesis tests and all statistical estimators are derived via statistical models. More generally,
statistical models are part of the foundation of statistical inference.

Contents
Introduction
Formal definition
An example
General remarks
Dimension of a model
Nested models
Comparing models
See also
Notes
References
Further reading

Introduction
Informally, a statistical model can be thought of as a statistical assumption (or set of statistical assumptions)
with a certain property: that the assumption allows us to calculate the probability of any event. As an
example, consider a pair of ordinary six-sided dice. We will study two different statistical assumptions
about the dice.

The first statistical assumption is this: for each of the dice, the probability of each face (1, 2, 3, 4, 5, and 6)
coming up is 16 . From that assumption, we can calculate the probability of both dice coming up 5: 
1 1 1
6 × 6   =  36 .  More generally, we can calculate the probability of any event: e.g. (1 and 2) or (3 and 3) or
(5 and 6).
The alternative statistical assumption is this: for each of the dice, the probability of the face 5 coming up is
1
8
(because the dice are weighted). From that assumption, we can calculate the probability of both dice
coming up 5:  18 × 18   =  64
1
.  We cannot, however, calculate the probability of any other nontrivial event, as
the probabilities of the other faces are unknown.

The first statistical assumption constitutes a statistical model: because with the assumption alone, we can
calculate the probability of any event. The alternative statistical assumption does not constitute a statistical
model: because with the assumption alone, we cannot calculate the probability of every event.

In the example above, with the first assumption, calculating the probability of an event is easy. With some
other examples, though, the calculation can be difficult, or even impractical (e.g. it might require millions of
years of computation). For an assumption to constitute a statistical model, such difficulty is acceptable:
doing the calculation does not need to be practicable, just theoretically possible.

Formal definition
In mathematical terms, a statistical model is usually thought of as a pair ( ), where is the set of
possible observations, i.e. the sample space, and is a set of probability distributions on .[3]

The intuition behind this definition is as follows. It is assumed that there is a "true" probability distribution
induced by the process that generates the observed data. We choose to represent a set (of distributions)
which contains a distribution that adequately approximates the true distribution.

Note that we do not require that contains the true distribution, and in practice that is rarely the case.
Indeed, as Burnham & Anderson state, "A model is a simplification or approximation of reality and hence
will not reflect all of reality"[4]—hence the saying "all models are wrong".

The set is almost always parameterized: . The set defines the parameters of the
model. A parameterization is generally required to have distinct parameter values give rise to distinct
distributions, i.e. must hold (in other words, it must be injective). A
parameterization that meets the requirement is said to be identifiable.[3]

An example
Suppose that we have a population of children, with the ages of the children distributed uniformly, in the
population. The height of a child will be stochastically related to the age: e.g. when we know that a child is
of age 7, this influences the chance of the child being 1.5 meters tall. We could formalize that relationship in
a linear regression model, like this: heighti = b0  + b1 agei + εi, where b0 is the intercept, b1 is a parameter
that age is multiplied by to obtain a prediction of height, εi is the error term, and i identifies the child. This
implies that height is predicted by age, with some error.

An admissible model must be consistent with all the data points. Thus, a straight line (heighti = b0  + b1 agei)
cannot be the equation for a model of the data—unless it exactly fits all the data points, i.e. all the data
points lie perfectly on the line. The error term, εi, must be included in the equation, so that the model is
consistent with all the data points.

To do statistical inference, we would first need to assume some probability distributions for the εi. For
instance, we might assume that the εi distributions are i.i.d. Gaussian, with zero mean. In this instance, the
model would have 3 parameters: b0 , b1 , and the variance of the Gaussian distribution.
We can formally specify the model in the form ( ) as follows. The sample space, , of our model
comprises the set of all possible pairs (age, height). Each possible value of  = (b0 , b1 , σ2 ) determines a
distribution on ; denote that distribution by . If is the set of all possible values of , then
. (The parameterization is identifiable, and this is easy to check.)

In this example, the model is determined by (1) specifying and (2) making some assumptions relevant to
. There are two assumptions: that height can be approximated by a linear function of age; that errors in
the approximation are distributed as i.i.d. Gaussian. The assumptions are sufficient to specify —as they
are required to do.

General remarks
A statistical model is a special class of mathematical model. What distinguishes a statistical model from
other mathematical models is that a statistical model is non-deterministic. Thus, in a statistical model
specified via mathematical equations, some of the variables do not have specific values, but instead have
probability distributions; i.e. some of the variables are stochastic. In the above example with children's
heights, ε is a stochastic variable; without that stochastic variable, the model would be deterministic.

Statistical models are often used even when the data-generating process being modeled is deterministic. For
instance, coin tossing is, in principle, a deterministic process; yet it is commonly modeled as stochastic (via
a Bernoulli process).

Choosing an appropriate statistical model to represent a given data-generating process is sometimes


extremely difficult, and may require knowledge of both the process and relevant statistical analyses.
Relatedly, the statistician Sir David Cox has said, "How [the] translation from subject-matter problem to
statistical model is done is often the most critical part of an analysis".[5]

There are three purposes for a statistical model, according to Konishi & Kitagawa.[6]

Predictions
Extraction of information
Description of stochastic structures

Those three purposes are essentially the same as the three purposes indicated by Friendly  & Meyer:
prediction, estimation, description.[7] The three purposes correspond with the three kinds of logical
reasoning: deductive reasoning, inductive reasoning, abductive reasoning.

Dimension of a model
Suppose that we have a statistical model ( ) with . The model is said to be
parametric if has a finite dimension. In notation, we write that where k is a positive integer (
denotes the real numbers; other sets can be used, in principle). Here, k is called the dimension of the
model.

As an example, if we assume that data arise from a univariate Gaussian distribution, then we are assuming
that

.
In this example, the dimension, k, equals 2.

As another example, suppose that the data consists of points (x, y) that we assume are distributed according
to a straight line with i.i.d. Gaussian residuals (with zero mean): this leads to the same statistical model as
was used in the example with children's heights. The dimension of the statistical model is 3: the intercept of
the line, the slope of the line, and the variance of the distribution of the residuals. (Note that in geometry, a
straight line has dimension 1.)

Although formally is a single parameter that has dimension k, it is sometimes regarded as


comprising k separate parameters. For example, with the univariate Gaussian distribution, is formally a
single parameter with dimension 2, but it is sometimes regarded as comprising 2 separate parameters—the
mean and the standard deviation.

A statistical model is nonparametric if the parameter set is infinite dimensional. A statistical model is
semiparametric if it has both finite-dimensional and infinite-dimensional parameters. Formally, if k is the
dimension of and n is the number of samples, both semiparametric and nonparametric models have
as . If as , then the model is semiparametric; otherwise, the model is
nonparametric.

Parametric models are by far the most commonly used statistical models. Regarding semiparametric and
nonparametric models, Sir David Cox has said, "These typically involve fewer assumptions of structure
and distributional form but usually contain strong assumptions about independencies".[8]

Nested models
Two statistical models are nested if the first model can be transformed into the second model by imposing
constraints on the parameters of the first model. As an example, the set of all Gaussian distributions has,
nested within it, the set of zero-mean Gaussian distributions: we constrain the mean in the set of all
Gaussian distributions to get the zero-mean distributions. As a second example, the quadratic model

y = b0 + b1x + b2x2 + ε,    ε ~ 𝒩(0, σ2)

has, nested within it, the linear model

y = b0 + b1x + ε,    ε ~ 𝒩(0, σ2)

—we constrain the parameter b 2 to equal 0.

In both those examples, the first model has a higher dimension than the second model (for the first example,
the zero-mean model has dimension 1). Such is often, but not always, the case. As a different example, the
set of positive-mean Gaussian distributions, which has dimension 2, is nested within the set of all Gaussian
distributions.

Comparing models
Comparing statistical models is fundamental for much of statistical inference. Indeed, Konishi & Kitagawa
(2008, p.  75) state this: "The majority of the problems in statistical inference can be considered to be
problems related to statistical modeling. They are typically formulated as comparisons of several statistical
models."
Common criteria for comparing models include the following: R2 , Bayes factor, Akaike information
criterion, and the likelihood-ratio test together with its generalization, the relative likelihood.

See also
All models are wrong Response modeling methodology
Blockmodel Scientific model
Conceptual model Statistical inference
Design of experiments Statistical model specification
Deterministic model Statistical model validation
Effective theory Statistical theory
Predictive model Stochastic process

Notes
1. Cox 2006, p. 178 5. Cox 2006, p. 197
2. Adèr 2008, p. 280 (https://books.google.co 6. Konishi & Kitagawa 2008, §1.1
m/books?id=LCnOj4ZFyjkC&pg=PA280) 7. Friendly & Meyer 2016, §11.6
3. McCullagh 2002 8. Cox 2006, p. 2
4. Burnham & Anderson 2002, §1.2.5

References
Adèr, H. J. (2008), "Modelling", in Adèr, H. J.; Mellenbergh, G. J. (eds.), Advising on
Research Methods: A consultant's companion, Huizen, The Netherlands: Johannes van
Kessel Publishing, pp. 271–304.
Burnham, K. P.; Anderson, D. R. (2002), Model Selection and Multimodel Inference
(2nd ed.), Springer-Verlag.
Cox, D. R. (2006), Principles of Statistical Inference, Cambridge University Press.
Friendly, M.; Meyer, D. (2016), Discrete Data Analysis with R, Chapman & Hall.
Konishi, S.; Kitagawa, G. (2008), Information Criteria and Statistical Modeling, Springer.
McCullagh, P. (2002), "What is a statistical model?" (http://www.stat.uchicago.edu/~pmcc/pu
bs/AOS023.pdf) (PDF), Annals of Statistics, 30 (5): 1225–1310,
doi:10.1214/aos/1035844977 (https://doi.org/10.1214%2Faos%2F1035844977).

Further reading
Davison, A. C. (2008), Statistical Models, Cambridge University Press
Drton, M.; Sullivant, S. (2007), "Algebraic statistical models" (http://www3.stat.sinica.edu.tw/s
tatistica/oldpdf/A17n41.pdf) (PDF), Statistica Sinica, 17: 1273–1297
Freedman, D. A. (2009), Statistical Models, Cambridge University Press
Helland, I. S. (2010), Steps Towards a Unified Basis for Scientific Models and Methods,
World Scientific
Kroese, D. P.; Chan, J. C. C. (2014), Statistical Modeling and Computation, Springer
Shmueli, G. (2010), "To explain or to predict?", Statistical Science, 25 (3): 289–310,
arXiv:1101.0891 (https://arxiv.org/abs/1101.0891), doi:10.1214/10-STS330 (https://doi.org/1
0.1214%2F10-STS330)
Retrieved from "https://en.wikipedia.org/w/index.php?title=Statistical_model&oldid=1083915276"

This page was last edited on 21 April 2022, at 14:44 (UTC).

Text is available under the Creative Commons Attribution-ShareAlike License 3.0;


additional terms may apply. By
using this site, you agree to the Terms of Use and Privacy Policy. Wikipedia® is a registered trademark of the
Wikimedia Foundation, Inc., a non-profit organization.

You might also like