Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
50% found this document useful (2 votes)
462 views

Regression Modeling Strategies: Frank E. Harrell, JR

This document provides an overview and summary of the book "Regression Modeling Strategies" by Frank E. Harrell Jr. It discusses the use of regression modeling for linear models, logistic regression, and survival analysis. The book covers topics such as model formulation, interpreting model parameters, assessing model fit and assumptions, handling missing data, variable selection, and model validation. It also provides case studies demonstrating various modeling strategies.

Uploaded by

Cipriana Gîrbea
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
50% found this document useful (2 votes)
462 views

Regression Modeling Strategies: Frank E. Harrell, JR

This document provides an overview and summary of the book "Regression Modeling Strategies" by Frank E. Harrell Jr. It discusses the use of regression modeling for linear models, logistic regression, and survival analysis. The book covers topics such as model formulation, interpreting model parameters, assessing model fit and assumptions, handling missing data, variable selection, and model validation. It also provides case studies demonstrating various modeling strategies.

Uploaded by

Cipriana Gîrbea
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Frank E. Harrell, Jr.

Regression Modeling
Strategies
With Applications to
Linear Models,
Logistic Regression,
and Survival Analysis

With 141 Figures

Springer

Contents

Preface

vii

Typographical Conventions
1 Introduction
1.1
1.2
1.3
1.4
1.5

Hypothesis Testing, Estimation, and Prediction


Examples of Uses of Predictive Multivariable Modeling
Planning for Modeling
1.3.1
Emphasizing Continuous Variables
Choice of the Model
Further Reading

2 General Aspects of Fitting Regression Models


2.1 Notation for Multivariable Regression Models
2.2 Model Formulations
2.3 Interpreting Model Parameters
2.3.1 Nominal Predictors
2.3.2 Interactions

xxiii
1
1
3
4
6
6
8
11
11
12
13
14
14

xiv

Contents
2.3.3
2.4

Example: Inference for a Simple Model

Relaxing Linearity Assumption for Continuous Predictors

15
16

2.4.1

Simple Nonlinear Terms

16

2.4.2

Splines for Estimating Shape of Regression Function and Determining Predictor Transformations

18

2.4.3

Cubic Spline Functions

19

2.4.4

Restricted Cubic Splines

20

2.4.5

Choosing Number and Position of Knots

23

2.4.6

Nonparametric Regression

24

2.4.7

Advantages of Regression Splines over Other Methods . . . .

26

2.5

Recursive Partitioning: Tree-Based Models

26

2.6

Multiple Degree of Freedom Tests of Association

27

2.7

Assessment of Model Fit

29

2.7.1

Regression Assumptions

29

2.7.2

Modeling and Testing Complex Interactions

32

2.7.3

Fitting Ordinal Predictors

34

2.7.4

Distributional Assumptions

35

2.8

Further Reading

36

2.9

Problems

37

Missing Data

41

3.1

Types of Missing Data

41

3.2

Prelude to Modeling

42

3.3

Missing Values for Different Types of Response Variables

43

3.4

Problems with Simple Alternatives to Imputation

43

3.5

Strategies for Developing Imputation Algorithms

44

3.6

Single Conditional Mean Imputation

47

3.7

Multiple Imputation

47

3.8

Summary and Rough Guidelines

48

3.9

Further Reading

50

3.10

Problems

51

Multivariable Modeling Strategies


4.1

Prespecification of Predictor Complexity Without


Later Simplification

53
53

Contents
4.2

Checking Assumptions of Multiple Predictors Simultaneously

4.3
4.4

Variable Selection
Overfitting and Limits on Number of Predictors

56
60

4.5

Shrinkage

61

4.6
4.7

Collinearity
Data Reduction

64
66

4.8
4.9
4.10

4.11

...

5.1
5.2

5.3
5.4

5.5

56

4.7.1
4.7.2
4.7.3
4.7.4

Variable Clustering
Transformation and Scaling Variables Without Using Y . . .
Simultaneous Transformation and Imputation
Simple Scoring of Variable Clusters

66
67
69
70

4.7.5
4.7.6

Simplifying Cluster Scores


How Much Data Reduction Is Necessary?

72
73

Overly Influential Observations

74

Comparing Two Models


Summary: Possible Modeling Strategies
4.10.1 Developing Predictive Models

77
79
79

4.10.2

82

Developing Models for Effect Estimation

4.10.3 Developing Models for Hypothesis Testing


Further Reading

5 Resampling, Validating, Describing, and Simplifying the Model

xv

The Bootstrap

83
84
87
87

Model Validation
5.2.1
Introduction
5.2.2
Which Quantities Should Be Used in Validation?

90
90
91

5.2.3
5.2.4
5.2.5

91
93
94

Data-Splitting
Improvements on Data-Splitting: Resampling
Validation Using the Bootstrap

Describing the Fitted Model


Simplifying the Final Model by Approximating It
5.4.1
Difficulties Using Full Models
5.4.2
Approximating the Full Model
Further Reading

S-Plus Software

97
98
98
99
101
105

xvi

Contents
6.1

The S Modeling Language

106

6.2

User-Contributed Functions

107

6.3

The Design Library

108

6.4

Other Functions

119

6.5

Further Reading

120

Case Study in Least Squares Fitting and Interpretation


of a Linear Model

121

7.1

Descriptive Statistics

122

7.2

Spending Degrees of Freedom/Specifying Predictor Complexity . . 127

7.3

Fitting the Model Using Least Squares

128

7.4

Checking Distributional Assumptions

131

7.5

Checking Goodness of Fit

135

7.6

Overly Influential Observations

135

7.7

Test Statistics and Partial R

136

7.8

Interpreting the Model

137

7.9

Problems

142

Case Study in Imputation and Data Reduction

147

8.1

Data

147

8.2

How Many Parameters Can Be Estimated?

150

8.3

Variable Clustering

151

8.4

Single Imputation Using Constants or Recursive Partitioning

8.5

Transformation and Single Imputation Using transcan

157

8.6

Data Reduction Using Principal Components

160

8.7

Detailed Examination of Individual Transformations

168

8.8

Examination of Variable Clusters on Transformed Variables . . . .

169

8.9

Transformation Using Nonparametric Smoothers

170

8.10

Multiple Imputation

172

8.11

Further Reading

175

8.12

Problems

176

. . . 154

Overview of Maximum Likelihood Estimation

179

9.1

General NotionsSimple Cases

179

9.2

Hypothesis Tests

183

Contents

9.3

xvii

9.2.1

Likelihood Ratio Test

183

9.2.2

Wald Test

184

9.2.3

Score Test

184

9.2.4

Normal DistributionOne Sample

185

General Case

186

9.3.1

Global Test Statistics

187

9.3.2

Testing a Subset of the Parameters

187

9.3.3

Which Test Statistics to Use When

189

9.3.4

Example: BinomialComparing Two Proportions

190

9.4

Iterative ML Estimation

192

9.5

Robust Estimation of the Covariance Matrix

193

9.6

Wald, Score, and Likelihood-Based Confidence Intervals

194

9.7

Bootstrap Confidence Regions

195

9.8

Further Use of the Log Likelihood

202

9.9

9.8.1

Rating Two Models, Penalizing for Complexity

202

9.8.2

Testing Whether One Model Is Better than Another

9.8.3

Unitless Index of Predictive Ability

203

9.8.4

Unitless Index of Adequacy of a Subset of Predictors . . . .

205

....

Weighted Maximum Likelihood Estimation

203

206

9.10

Penalized Maximum Likelihood Estimation

207

9.11

Further Reading

210

9.12

Problems

212

10 Binary Logistic Regression


10.1

10.2

Model

215
215

10.1.1

Model Assumptions and Interpretation of Parameters . . . .

217

10.1.2

Odds Ratio, Risk Ratio, and Risk Difference

220

10.1.3

Detailed Example

221

10.1.4

Design Formulations

227

Estimation

228

10.2.1

Maximum Likelihood Estimates

228

10.2.2

Estimation of Odds Ratios and Probabilities

228

10.3

Test Statistics

229

10.4

Residuals

230

xviii

Contents

10.5
10.6
10.7
10.8
10.9

Assessment of Model Fit


Collinearity
Overly Influential Observations
Quantifying Predictive Ability
Validating the Fitted Model

10.10 Describing the Fitted Model


10.11

S-PLUS Functions

10.12 Further Reading


10.13 Problems
11 Logistic Model Case Study 1: Predicting Cause of Death
11.1 Preparation for Modeling
11.2

230
244
245
247
249
253
257

264
265
269
269

11.4
11.5

Regression on Principal Components, Cluster Scores, and Pretransformations


Fit and Diagnostics for a Full Model, and Interpreting Pretransformations
Describing Results Using a Reduced Model
Approximating the Full Model Using Recursive Partitioning . . . .

276
285
291

11.6

Validating the Reduced Model

294

11.3

12 Logistic Model Case Study 2: Survival of Titanic Passengers

271

299

12.1

Descriptive Statistics

12.2
12.3
12.4
12.5
12.6
12.7
12.8

Exploring Trends with Nonparametric Regression


305
Binary Logistic Model With Casewise Deletion of Missing Values . 305
Examining Missing Data Patterns
'
312
Single Conditional Mean Imputation
316
Multiple Imputation
320
Summarizing the Fitted Model
322
Problems
326

13 Ordinal Logistic Regression


13.1
13.2
13.3

Background
Ordinality Assumption
Proportional Odds Model
13.3.1 Model
13.3.2 Assumptions and Interpretation of Parameters

300

331
331
332
333
333
333

Contents

xix

13.3.3

Estimation

334

13.3.4

Residuals

334

13.3.5

Assessment of Model Fit

335

13.3.6

Quantifying Predictive Ability

335

13.3.7

Validating the Fitted Model

337

13.3.8

S-PLUS Functions

337

13.4

13.5
13.6

Continuation Ratio Model


13.4.1 Model
13.4.2 Assumptions and Interpretation of Parameters
13.4.3 Estimation

338
338
338
339

13.4.4

Residuals

339

13.4.5

Assessment of Model Fit

339

13.4.6

Extended CR Model

339

13.4.7
13.4.8

Role of Penalization in Extended CR Model


Validating the Fitted Model

340
341

13.4.9

S-PLUS Functions

341

Further Reading
Problems

342
342

14 Case Study in Ordinal Regression, Data Reduction, and Penalization


345
14.1
14.2

Response Variable
Variable Clustering

14.3
14.4

Developing Cluster Summary Scores


Assessing Ordinality of Y for each X, and Unadjusted Checking of
PO and CR Assumptions
A Tentative Full Proportional Odds Model
Residual Plots
Graphical Assessment of Fit of CR Model
Extended Continuation Ratio Model
Penalized Estimation
Using Approximations to Simplify the Model
Validating the Model
Summary
Further Reading

14.5
14.6
14.7
14.8
14.9
14.10
14.11
14.12
14.13

. 346
347
349
351
352
355
357
357
359
364
367
369
371

xx

Contents
14.14 Problems

15 Models Using Nonparametric Transformations of X and Y


15.1 Background
15.2 Generalized Additive Models
15.3 Nonparametric Estimation of ^-Transformation
15.4 Obtaining Estimates on the Original Scale
15.5 S-PLUS Functions
15.6 Case Study

371
375
375
376
376
377
378
379

16 Introduction to Survival Analysis


389
16.1 Background
389
16.2 Censoring, Delayed Entry, and Truncation
391
16.3 Notation, Survival, and Hazard Functions
392
16.4 Homogeneous Failure Time Distributions
398
16.5 Nonparametric Estimation of 5 and A
400
16.5.1 Kaplan-Meier Estimator
400
16.5.2 Altschuler-Nelson Estimator
403
16.6 Analysis of Multiple Endpoints
404
16.6.1 Competing Risks
404
16.6.2 Competing Dependent Risks
405
16.6.3 State Transitions and Multiple Types of Nonfatal Events . . 406
16.6.4 Joint Analysis of Time and Severity of an Event
407
16.6.5 Analysis of Multiple Events
407
16.7 S-PLUS Functions
408
16.8 Further Reading
410
16.9 Problems
411
17 Parametric Survival Models
17.1 Homogeneous Models (No Predictors)
17.1.1 Specific Models
17.1.2 Estimation
17.1.3 Assessment of Model Fit
17.2 Parametric Proportional Hazards Models
17.2.1 Model

413
413
413
414
416
417
417

Contents

17.3

17.4

xxi

17.2.2

Model Assumptions and Interpretation of Parameters . . . .

418

17.2.3

Hazard Ratio, Risk Ratio, and Risk Difference

419

17.2.4

Specific Models

421

17.2.5

Estimation

422

17.2.6

Assessment of Model Fit

Accelerated Failure Time Models

423
426

17.3.1

Model

426

17.3.2

Model Assumptions and Interpretation of Parameters . . . .

427

17.3.3

Specific Models

427

17.3.4

Estimation

428

17.3.5

Residuals

429

17.3.6

Assessment of Model Fit

430

17.3.7

Validating the Fitted Model

434

Buckley-James Regression Model

435

17.5

Design Formulations

435

17.6

Test Statistics

435

17.7

Quantifying Predictive Ability

436

17.8

S-PLUS Functions

436

17.9

Further Reading

17.10 Problems

441
441

18 Case Study in Parametric Survival Modeling and Model Approximation


443
18.1 Descriptive Statistics
443
18.2

Checking Adequacy of Log-Normal Accelerated Failure Time Model 448

18.3
18.4

Summarizing the Fitted Model


Internal Validation of the Fitted Model Using the Bootstrap . . . .

18.5

Approximating the Full Model

458

18.6

Problems

464

19 Cox Proportional Hazards Regression Model


19.1

Model

454
454

465
465

19.1.1

Preliminaries

465

19.1.2

Model Definition

466

19.1.3

Estimation of/?

466

xxii

Contents
19.1.4 Model Assumptions and Interpretation of Parameters . . . .
19.1.5 Example
19.1.6 Design Formulations
19.1.7 Extending the Model by Stratification
19.2 Estimation of Survival Probability and Secondary Parameters . . .

468
468
470
470
472

19.3

474

Test Statistics

19.4
19.5

Residuals
476
Assessment of Model Fit
476
19.5.1 Regression Assumptions
477
19.5.2 Proportional Hazards Assumption
483
19.6 What to Do When PH Fails
489
19.7 Collinearity
491
19.8 Overly Influential Observations
492
19.9 Quantifying Predictive Ability
492
19.10 Validating the Fitted Model
493
19.10.1 Validation of Model Calibration
493
19.10.2 Validation of Discrimination and Other Statistical Indexes . 494
19.11 Describing the Fitted Model
496
19.12

S-PLUS Functions

19.13. Further Reading

499

506

20 Case Study in Cox Regression


509
20.1 Choosing the Number of Parameters and Fitting the Model . . . . 509
20.2
20.3
20.4
20.5
20.6
20.7

Checking Proportional Hazards


Testing Interactions
Describing Predictor Effects
Validating the Model
Presenting the Model
Problems

513
516
517
517
519
522

Appendix

523

References

527

Index

559

You might also like