Technometrics
ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/utch20
Linear Models with Python
Faraway Julian J.. Boca Raton, FL, Chapman and Hall/CRC, Taylor & Francis
Group, 2021, 308 pp., 85 b/w illustrations, $99.95 (Hardback), ISBN:
978-1-138-48395-8.
Stan Lipovetsky
To cite this article: Stan Lipovetsky (2021) Linear Models with Python, Technometrics, 63:3,
426-427, DOI: 10.1080/00401706.2021.1945323
To link to this article: https://doi.org/10.1080/00401706.2021.1945323
Published online: 04 Aug 2021.
Submit your article to this journal
Article views: 220
View related articles
View Crossmark data
Full Terms & Conditions of access and use can be found at
https://www.tandfonline.com/action/journalInformation?journalCode=utch20
426
BOOK REVIEWS
Linear Models with Python, by Julian J. Faraway. Boca
Raton, FL, Chapman and Hall/CRC, Taylor & Francis Group,
2021, 308 pp., 85 b/w illustrations, $99.95 (Hardback),
ISBN: 978-1-138-48395-8.
The book belongs to the Texts in Statistical Science Series,
and it presents a convenient way to learn more on both statistical
modeling and statistical Python language, in continuation of
the author’s previous book on linear models with R. While R
is the statistical language par excellence, Python is more applied
in machine learning (ML) and data/computer science aims and
projects. The author tells in the Preface:
This book is written in three languages: English, Mathematics and Python. I aim to combine these three seamlessly to
allow coherent exposition of the practice of linear modeling. This requires the reader to become somewhat fluent in
Python. This is not a book about learning Python but like
any foreign language, one becomes proficient by practicing it
rather than by memorizing the dictionary. (p. ix)
The book consists of 17 chapters on statistical methods with
their applications via Python packages, and an appendix about
Python basics.
Chapter 1, “Introduction,” starts with a practical question—
how to import real data from a repository of ML databases and
to describe this data using Python tools. The first important
issue is that Python operates numerically via the packages, so
it is required to load several of them. For example, it can be
done with the commands: import numpy as np. An example
of medical data is transferred, and Python codes and outputs
are given for the data description. In Python, a name of the
package is put before the command, for instance, the command for a sum can be np.sum. It is also shown how to plot
a data, build a histogram or a bivariate scatterplot. History of
the least squares modeling is traced by works of T. Mayer, A.M.
Legendre, C.F. Gauss, and F. Galton who coined the term of
regression in 1875. Several examples on regression modeling are
given. Several next chapters are devoted to multiple regression
modeling.
Chapter 2 of “Estimation” presents formulas for linear or
linearized regression in matrix form, with many illustrations of
building models by Python codes. The Moore-Penrose inverse
method and QR matrix decomposition, Gauss–Markov theorem and the best linear unbiased estimate (BLUE), goodness
of fit and the coefficient of multiple determination R2 , and
questions of identifiability and orthogonality are described as
well.
Chapter 3 of “Inference” considers how to perform hypotheses tests and construct confidence intervals (CI). It includes tests
to compare models for choosing the best set of the predictors
by the partial and total Fisher F-statistics, permutation tests,
analysis of variance (ANOVA), sampling, CI for the regression
parameters, and bootstrapping CI. Chapter 4 of “Prediction,”
applies the obtained regression for point and interval predictions. Particularly, it describes prediction of body fat percent by
the Brozek’s model, and the numbers of airline passengers by
autoregression models.
Chapter 5 of “Explanation” deals with interpretation of
regression models, including a meaning of the regression
parameters, discusses observational data and experimental
design, causality problem and confounding variable. The data
are taken on the Democratic Party primary to select the U.S.
presidential candidates held in New Hampshire in January 2008.
In the primary, H. Clinton defeated B. Obama contrary to the
expectations of preelection opinion polls.
Two different voting technologies were used in New
Hampshire. Some wards (administrative districts) used
paper ballots, counted by hand, while others used optically
scanned ballots, counted by machine. Among the paper
ballots, Obama had more votes than Clinton, while Clinton
defeated Obama on just the machine-counted ballots. Since
the method of voting should make no causal difference to
the outcome, suspicions were raised regarding the integrity
of the election. (p. 65).
The considered approaches are applied to this case, together
with a random matching technique used in clinical trial where
a treatment is compared with a control, and a greedy matching algorithm is employed. The results do not demonstrate a
significant statistical difference, that means no clear preference
for hand or digital voting. Chapter 6 of “Diagnostics” focuses
on the model residual errors, their plots and Q-Q plots, correlated errors, identification of unusual observations, outliers, and
influential observations, and partial regressions.
Chapter 7 “Problems with the Predictors” studies the case
of errors in the independent variables, and special estimators
for the regression parameters for such data. Changing scales in
the variables and its translation into the model parameters is
also considered. Collinearity in the predictors and the variance
inflation factor are discussed and modeled.
Chapter 8 “Problems with the Error” considers the ordinary
least-square (OLS) extensions in several ways. If the errors are
dependent, then the generalized least square (GLS) can be used.
If the errors are independent but not identically distributed by
the predictors, the weighted least square (WLS) can be applied.
These techniques correspond to the known covariance matrix of
errors, and they are modeled on several datasets, particularly, on
the interesting example of the French presidency. Lack of fit testing, polynomial fit, robust regressions with M-estimator, high
breakdown estimators, including the random sample consensus
algorithm developed in image processing and the least-trimmed
squares technique, are also described.
Chapter 9 “Transformation” concerns both response and
predictor variables possible alterations for a better fit. It includes
model linearization, Box–Cox power transformation, the
so-called broken stick regression with switching parameters,
polynomials, particularly, Legendre orthogonal polynomials,
response surface models, splines, additive models, with numerous numerical examples.
Chapter 10 of “Model Selection” describes choosing of the
theoretical form and predictors for regression, for example,
adding mix-effect variables, with backward elimination and forward selection in stepwise regression, and estimation by residual
sum of squares and adjusted R2 , Kullback–Leibler information, Akaike and Bayesian information criteria (AIC and BIC,
respectively). Another approach of recursive feature elimination
(RFE) which recursively eliminates the least important variable
BOOK REVIEWS
from the model, refits and repeats, is also applied for ranking
the predictors. Sample splitting is used for the training data
for fit and the testing data for evaluation of the model quality
by the root-mean-squared error (RMSE). The known in ML
cross-validation approach includes data splitting to k subsets,
using k–1 parts to fit the model and the remaining part to
evaluate, with this process repeated k times and the performance
measure averaged to obtain an overall estimation of the model
quality. Implementation via RFE method in the Python function
RFECV is demonstrated.
Chapter 11 of “Shrinkage Methods” presents principal component analysis (PCA) for building the orthogonal predictors,
the robust PCA, partial least square, ridge regression, least absolute shrinkage and selection operator, and their applications by
various datasets performed with scikit-learn package.
Chapter 12 of “Insurance Redlining—A Complete Example,”
discusses a whole process of statistical modeling with possible
difficulties in the data and its processing. Redlining means the
marked city areas or zip codes where insurance companies
refuse to issue insurance to home owners because of a high fire
or crime level in those areas, or other issues. Chicago 1977–1978
data are studied, with the so-called ecological correlation analysis, full regression model, its diagnostics, sensitivity analysis and
the results interpretation.
Chapter 13 “Missing Data” gives classification of missing
cases, describes representation and detection of missing values,
their single and multiple imputations using the multiple imputation chained equation method in a Python package.
Qualitative predictors are considered in the next chapters.
Chapter 14 of “Categorical Predictors” deals with categorical
independent variables, also called factors, which could have
several levels. Using factors as predictors in the regression is
described, with the continuous predictors and their interaction terms, in different factor coding by a contrast matrix, for
example, with treatment coding, or sum coding. Chapter 15
“One-Factor Models” deals with only one qualitative predictor
which can be studied in ANOVA approach of variance decomposition, demonstrated in numerical examples with diagnostics
and graphical analysis. Pairwise comparisons between the levels
with CI estimates are also performed. Characteristics of the
false discovery rate, familywise error rate, and Bonferroni correction are also discussed. Chapter 16 of “Models with Several
Factors” continues with categorical predictors in the designed
experiments, also called factorial designs. All possible combinations of the factors’ levels presented are called as a full factorial
design, and repeated observations for the same combination
of factor levels are called replicates. Two factors without and
with replication are considered in the models with interaction between factors. Diagnostic plots and Tukey’s nonadditivity test are applied for investigating on an interaction. Larger
factorial experiments with many factors of different number
of levels are described too, with a full experiment if there is
at least one of each combination of all factors’ levels, and a
fraction factorials experiment if there is just a fraction of full
design combinations. Chapter 17 “Experiments with Blocks,”
describes the completely randomized design with the treatments
assigned to the experimental units at random. The so-called
block design can be more effective for heterogeneous data gathered into blocks so that the within-block variation is small, but
427
the between-block variation is large. In a randomized block
design, the treatment levels are assigned randomly within a
block, and when there is one observation on each treatment
in each block it is called a randomized complete block design
(RCBD). Latin squares, useful when there are two blocking
variables, are presented as well. In RCBD the block size is equal
to the number of treatments, but when the block size is less
than the number of treatments, an incomplete block design
is used. The balanced incomplete block design, with all the
pairwise differences identifiable and of the same standard error,
is also described. Pairwise differences are often more useful than
other contrasts. All these models are studied on various datasets
with help of Python packages, and the numerical results are
discussed.
Finally, Appendix A describes how to learn Python. It is
possible to install Python from Anaconda at www.anaconda.
com, and to use Jupyter notebooks from www.jupyter.org. Several packages come preinstalled with the Anaconda release, and
they are used for the data and some functions in the book:
numpy, scipy, pandas, statsmodels, matplotlib, seaborn, scikitlearn, and patsy, just to get something similar to the base R environment. The author recommends to install also his package far
away from the Python Package index (PyPi) at the repository
www.pypi.org. Readers should see https://julianfaraway.github.
io/LMP/serves for information about changes or errors found in
the text. Several additional author’s notes on Python are as follows: (i) Base R is quite functional without loading any packages.
In Python, you will always need to load some packages even
to do some basic computations. (ii) Python is very fussy about
namespaces. You will have to prefix every loaded function. For
example, you cannot write log(x)—you’ll need to write np.log(x)
indicating that log comes from the numpy package. (ii) Python
array indices start from zero, it is something the R user has to
continually adjust to. (iv) matplotlib is the Python equivalent
of the R base plotting functionality. (v) statsmodels provides the
linear modeling functionality found in R but you will find some
differences that will trip you up. In particular, no intercept term
is included by default and the handling of saturated models is
different. Of course, you can work around all these issues. (vi)
Python uses pipes very commonly. It helps if you have already
started using these in R via the %>% operator to get you into
that frame of mind.
A bibliography of many dozen sources and an Index
is supplied as well. Multiple Python program scripts and
screenshots of the outcomes fill the book, and each chapter
suggests numerous exercises for training in coding. The
book presents an amazingly valuable source of knowledge
on statistical modeling and Python tools for students and
practitioners.
Note: a brief 30-page introduction to Python is free available
online: Bell A., Python for economists, python_for_economists.
pdf (harvard.edu), 2016. (S.L.).
Stan Lipovetsky
Minneapolis, MN