Linear Models with Python

Stan Lipovetsky

Linear Models with Python

2021, Technometrics

Technometrics ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/utch20 Linear Models with Python Faraway Julian J.. Boca Raton, FL, Chapman and Hall/CRC, Taylor & Francis Group, 2021, 308 pp., 85 b/w illustrations, $99.95 (Hardback), ISBN: 978-1-138-48395-8. Stan Lipovetsky To cite this article: Stan Lipovetsky (2021) Linear Models with Python, Technometrics, 63:3, 426-427, DOI: 10.1080/00401706.2021.1945323 To link to this article: https://doi.org/10.1080/00401706.2021.1945323 Published online: 04 Aug 2021. Submit your article to this journal Article views: 220 View related articles View Crossmark data Full Terms & Conditions of access and use can be found at https://www.tandfonline.com/action/journalInformation?journalCode=utch20 426 BOOK REVIEWS Linear Models with Python, by Julian J. Faraway. Boca Raton, FL, Chapman and Hall/CRC, Taylor & Francis Group, 2021, 308 pp., 85 b/w illustrations, $99.95 (Hardback), ISBN: 978-1-138-48395-8. The book belongs to the Texts in Statistical Science Series, and it presents a convenient way to learn more on both statistical modeling and statistical Python language, in continuation of the author’s previous book on linear models with R. While R is the statistical language par excellence, Python is more applied in machine learning (ML) and data/computer science aims and projects. The author tells in the Preface: This book is written in three languages: English, Mathematics and Python. I aim to combine these three seamlessly to allow coherent exposition of the practice of linear modeling. This requires the reader to become somewhat fluent in Python. This is not a book about learning Python but like any foreign language, one becomes proficient by practicing it rather than by memorizing the dictionary. (p. ix) The book consists of 17 chapters on statistical methods with their applications via Python packages, and an appendix about Python basics. Chapter 1, “Introduction,” starts with a practical question— how to import real data from a repository of ML databases and to describe this data using Python tools. The first important issue is that Python operates numerically via the packages, so it is required to load several of them. For example, it can be done with the commands: import numpy as np. An example of medical data is transferred, and Python codes and outputs are given for the data description. In Python, a name of the package is put before the command, for instance, the command for a sum can be np.sum. It is also shown how to plot a data, build a histogram or a bivariate scatterplot. History of the least squares modeling is traced by works of T. Mayer, A.M. Legendre, C.F. Gauss, and F. Galton who coined the term of regression in 1875. Several examples on regression modeling are given. Several next chapters are devoted to multiple regression modeling. Chapter 2 of “Estimation” presents formulas for linear or linearized regression in matrix form, with many illustrations of building models by Python codes. The Moore-Penrose inverse method and QR matrix decomposition, Gauss–Markov theorem and the best linear unbiased estimate (BLUE), goodness of fit and the coefficient of multiple determination R2 , and questions of identifiability and orthogonality are described as well. Chapter 3 of “Inference” considers how to perform hypotheses tests and construct confidence intervals (CI). It includes tests to compare models for choosing the best set of the predictors by the partial and total Fisher F-statistics, permutation tests, analysis of variance (ANOVA), sampling, CI for the regression parameters, and bootstrapping CI. Chapter 4 of “Prediction,” applies the obtained regression for point and interval predictions. Particularly, it describes prediction of body fat percent by the Brozek’s model, and the numbers of airline passengers by autoregression models. Chapter 5 of “Explanation” deals with interpretation of regression models, including a meaning of the regression parameters, discusses observational data and experimental design, causality problem and confounding variable. The data are taken on the Democratic Party primary to select the U.S. presidential candidates held in New Hampshire in January 2008. In the primary, H. Clinton defeated B. Obama contrary to the expectations of preelection opinion polls. Two different voting technologies were used in New Hampshire. Some wards (administrative districts) used paper ballots, counted by hand, while others used optically scanned ballots, counted by machine. Among the paper ballots, Obama had more votes than Clinton, while Clinton defeated Obama on just the machine-counted ballots. Since the method of voting should make no causal difference to the outcome, suspicions were raised regarding the integrity of the election. (p. 65). The considered approaches are applied to this case, together with a random matching technique used in clinical trial where a treatment is compared with a control, and a greedy matching algorithm is employed. The results do not demonstrate a significant statistical difference, that means no clear preference for hand or digital voting. Chapter 6 of “Diagnostics” focuses on the model residual errors, their plots and Q-Q plots, correlated errors, identification of unusual observations, outliers, and influential observations, and partial regressions. Chapter 7 “Problems with the Predictors” studies the case of errors in the independent variables, and special estimators for the regression parameters for such data. Changing scales in the variables and its translation into the model parameters is also considered. Collinearity in the predictors and the variance inflation factor are discussed and modeled. Chapter 8 “Problems with the Error” considers the ordinary least-square (OLS) extensions in several ways. If the errors are dependent, then the generalized least square (GLS) can be used. If the errors are independent but not identically distributed by the predictors, the weighted least square (WLS) can be applied. These techniques correspond to the known covariance matrix of errors, and they are modeled on several datasets, particularly, on the interesting example of the French presidency. Lack of fit testing, polynomial fit, robust regressions with M-estimator, high breakdown estimators, including the random sample consensus algorithm developed in image processing and the least-trimmed squares technique, are also described. Chapter 9 “Transformation” concerns both response and predictor variables possible alterations for a better fit. It includes model linearization, Box–Cox power transformation, the so-called broken stick regression with switching parameters, polynomials, particularly, Legendre orthogonal polynomials, response surface models, splines, additive models, with numerous numerical examples. Chapter 10 of “Model Selection” describes choosing of the theoretical form and predictors for regression, for example, adding mix-effect variables, with backward elimination and forward selection in stepwise regression, and estimation by residual sum of squares and adjusted R2 , Kullback–Leibler information, Akaike and Bayesian information criteria (AIC and BIC, respectively). Another approach of recursive feature elimination (RFE) which recursively eliminates the least important variable BOOK REVIEWS from the model, refits and repeats, is also applied for ranking the predictors. Sample splitting is used for the training data for fit and the testing data for evaluation of the model quality by the root-mean-squared error (RMSE). The known in ML cross-validation approach includes data splitting to k subsets, using k–1 parts to fit the model and the remaining part to evaluate, with this process repeated k times and the performance measure averaged to obtain an overall estimation of the model quality. Implementation via RFE method in the Python function RFECV is demonstrated. Chapter 11 of “Shrinkage Methods” presents principal component analysis (PCA) for building the orthogonal predictors, the robust PCA, partial least square, ridge regression, least absolute shrinkage and selection operator, and their applications by various datasets performed with scikit-learn package. Chapter 12 of “Insurance Redlining—A Complete Example,” discusses a whole process of statistical modeling with possible difficulties in the data and its processing. Redlining means the marked city areas or zip codes where insurance companies refuse to issue insurance to home owners because of a high fire or crime level in those areas, or other issues. Chicago 1977–1978 data are studied, with the so-called ecological correlation analysis, full regression model, its diagnostics, sensitivity analysis and the results interpretation. Chapter 13 “Missing Data” gives classification of missing cases, describes representation and detection of missing values, their single and multiple imputations using the multiple imputation chained equation method in a Python package. Qualitative predictors are considered in the next chapters. Chapter 14 of “Categorical Predictors” deals with categorical independent variables, also called factors, which could have several levels. Using factors as predictors in the regression is described, with the continuous predictors and their interaction terms, in different factor coding by a contrast matrix, for example, with treatment coding, or sum coding. Chapter 15 “One-Factor Models” deals with only one qualitative predictor which can be studied in ANOVA approach of variance decomposition, demonstrated in numerical examples with diagnostics and graphical analysis. Pairwise comparisons between the levels with CI estimates are also performed. Characteristics of the false discovery rate, familywise error rate, and Bonferroni correction are also discussed. Chapter 16 of “Models with Several Factors” continues with categorical predictors in the designed experiments, also called factorial designs. All possible combinations of the factors’ levels presented are called as a full factorial design, and repeated observations for the same combination of factor levels are called replicates. Two factors without and with replication are considered in the models with interaction between factors. Diagnostic plots and Tukey’s nonadditivity test are applied for investigating on an interaction. Larger factorial experiments with many factors of different number of levels are described too, with a full experiment if there is at least one of each combination of all factors’ levels, and a fraction factorials experiment if there is just a fraction of full design combinations. Chapter 17 “Experiments with Blocks,” describes the completely randomized design with the treatments assigned to the experimental units at random. The so-called block design can be more effective for heterogeneous data gathered into blocks so that the within-block variation is small, but 427 the between-block variation is large. In a randomized block design, the treatment levels are assigned randomly within a block, and when there is one observation on each treatment in each block it is called a randomized complete block design (RCBD). Latin squares, useful when there are two blocking variables, are presented as well. In RCBD the block size is equal to the number of treatments, but when the block size is less than the number of treatments, an incomplete block design is used. The balanced incomplete block design, with all the pairwise differences identifiable and of the same standard error, is also described. Pairwise differences are often more useful than other contrasts. All these models are studied on various datasets with help of Python packages, and the numerical results are discussed. Finally, Appendix A describes how to learn Python. It is possible to install Python from Anaconda at www.anaconda. com, and to use Jupyter notebooks from www.jupyter.org. Several packages come preinstalled with the Anaconda release, and they are used for the data and some functions in the book: numpy, scipy, pandas, statsmodels, matplotlib, seaborn, scikitlearn, and patsy, just to get something similar to the base R environment. The author recommends to install also his package far away from the Python Package index (PyPi) at the repository www.pypi.org. Readers should see https://julianfaraway.github. io/LMP/serves for information about changes or errors found in the text. Several additional author’s notes on Python are as follows: (i) Base R is quite functional without loading any packages. In Python, you will always need to load some packages even to do some basic computations. (ii) Python is very fussy about namespaces. You will have to prefix every loaded function. For example, you cannot write log(x)—you’ll need to write np.log(x) indicating that log comes from the numpy package. (ii) Python array indices start from zero, it is something the R user has to continually adjust to. (iv) matplotlib is the Python equivalent of the R base plotting functionality. (v) statsmodels provides the linear modeling functionality found in R but you will find some differences that will trip you up. In particular, no intercept term is included by default and the handling of saturated models is different. Of course, you can work around all these issues. (vi) Python uses pipes very commonly. It helps if you have already started using these in R via the %>% operator to get you into that frame of mind. A bibliography of many dozen sources and an Index is supplied as well. Multiple Python program scripts and screenshots of the outcomes fill the book, and each chapter suggests numerous exercises for training in coding. The book presents an amazingly valuable source of knowledge on statistical modeling and Python tools for students and practitioners. Note: a brief 30-page introduction to Python is free available online: Bell A., Python for economists, python_for_economists. pdf (harvard.edu), 2016. (S.L.). Stan Lipovetsky Minneapolis, MN

Log In

Linear Models with Python

Related papers

Related papers

Related topics