ML Sas
ML Sas
ML Sas
The diagram containing this analysis is stored as an XML file on the course data disk. You can
open this file by right-clicking Diagrams Import Diagram from XML in SAS Enterprise
Miner. All nodes in the opened file, except the data node, contain the property settings outlined in
this case study. If you want to run the diagram, you need to re-create the case study data set using
the metadata settings indicated below.
A-2 Appendix A Case Studies
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
A.1 Credit Risk Case Study A-3
A SAS Enterprise Miner data source was defined for the CREDIT data set using the metadata settings
indicated above. The Data source definition was expedited by customizing the Advanced Metadata
Advisor in the Data Source Wizard as indicated.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
A-4 Appendix A Case Studies
The Decisions option Default with Inverse Prior Weights was selected to provide the values on the
Decision Weights tab.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
A.1 Credit Risk Case Study A-5
It can be shown that, theoretically, the so-called central decision rule optimizes model performance based
on the KS statistic.
The StatExplore node was used to provide preliminary statistics on the target variable.
BanruptcyInd and TARGET were the only two class variables in the CREDIT data set.
The Interval Variable Summary shows missing values on 11 of the 27 interval inputs.
By creating plots using the Explore window, it was found that several of the interval inputs show
somewhat skewed distributions. Transformation of the more severe cases was pursued in regression
modeling.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
A-6 Appendix A Case Studies
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
A.1 Credit Risk Case Study A-7
Because it was the most likely model to be selected for deployment, a regression model was considered
first.
In the Data Partition node, 50% of the data was chosen for training and 50% for validation.
The Impute node replaced missing values for the interval inputs with the input mean (the default for
interval valued input variables), and added unique imputation indicators for each input with missing
values.
The Regression node used the stepwise method for input variable selection, and validation profit for
complexity optimization.
The selected model included seven inputs. See line 1195 of the Output window.
The odds ratio estimates facilitated model interpretation. Increasing risk was associated with increasing
values of IMP_TLBalHCPct, InqFinanceCnt24, TLDel3060Cnt24, TLDel60Cnt, and TLOpenPct.
Increasing risk was associated with decreasing values of IMP_TLSatPct and TLTimeFirst.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
A-8 Appendix A Case Studies
The iteration plot (found by selecting View Model Iteration Plot in the Results window) can be set
to show average profit versus iteration.
In theory, the average profit for a model using the defined profit matrix equals 1+KS statistic. Thus, the
iteration plot (from the Regression node’s Results window) showed how the profit (or, in turn, the KS
statistic) varied with model complexity. From the plot, the maximum validation profit equaled 1.43,
which implies that the maximum KS statistic equaled 0.43.
The actual calculated value of KS (as found using the Model Comparison node) was found to
differ slightly from this value. (See below.)
Although it is not possible to deploy it as the final prediction model, a neural network was used to
investigate regression lack of fit.
The default settings of the Neural Network node were used in combination with inputs selected by the
Stepwise Regression node.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
A.1 Credit Risk Case Study A-9
The iteration plot showed a slightly higher validation average profit compared to the stepwise regression
model.
It was possible (although not likely) that transformations to the regression inputs could improve
regression prediction.
In assaying the data, it was noted that some of the inputs had rather skewed distributions. Such
distributions create high leverage points that can distort an input’s association with the target.
The Transform Variables node was used to regularize the distributions of the model inputs
before fitting the stepwise regression.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
A-10 Appendix A Case Studies
The Transform Variables node was set to maximize the normality of each interval input by selecting from
one of several power and logarithmic transformations.
The Transformed Stepwise Regression node performed stepwise selection from the transformed inputs.
The selected model had many of the same inputs as the original stepwise regression model, but on a
transformed (and difficult to interpret) scale.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
A.1 Credit Risk Case Study A-11
The transformations would be justified (despite the increased difficulty in model interpretation) if they
resulted in significant improvement in model fit. Based on the profit calculation, the transformed stepwise
regression model showed only marginal performance improvement compared to the original stepwise
regression model.
Partitioning input variables into discrete ranges was another common risk-modeling method that was
investigated.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
A-12 Appendix A Case Studies
Three discretization approaches were investigated. The Bucket Input Variables node partitioned each
interval input into four bins with equal widths. The Bin Input Variables node partitioned each interval
input into four bins with equal sizes. The Optimal Discrete Input Variables node found optimal partitions
for each input variable using decision tree methods.
Bucket Transformation
The relatively small size of the CREDIT data set resulted in problems for the bucket stepwise regression
model. Many of the bins had a small number of observations, which resulted in quasi-complete separation
problems for the regression model, as dramatically illustrated by the selected model’s odds ratio report.
Go to line 1057 of the Output window.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
A.1 Credit Risk Case Study A-13
The iteration plot showed substantially worse performance compared to the other modeling efforts.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
A-14 Appendix A Case Studies
The improved model fit was also seen in the iteration plot, although the average profit of the selected
model was still not as large as the original stepwise regression model.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
A.1 Credit Risk Case Study A-15
Optimal Transformation
A final attempt on discretization was made using the optimistically named Optimal Discrete
transformation. The final 18 degree-of-freedom model included 10 separate inputs (more than any other
model). Contents of the Output window starting at line 1696 are shown below.
The validation average profit was still slightly smaller than the original model. A substantial difference in
profit between the training and validation data was also observed. Such a difference was suggestive of
overfitting by the model.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
A-16 Appendix A Case Studies
The collection of models was assessed using the Model Comparison node. (Only a portion of the flow is
shown to assist in viewing.)
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
A.1 Credit Risk Case Study A-17
The Fit Statistics table from the Output window is shown below.
Data Role=Valid
Statistics Reg Neural Reg5 Reg2 Reg4 Reg3
Valid: Kolmogorov-Smirnov Statistic 0.43 0.46 0.42 0.44 0.45 0.39
Valid: Average Profit for TARGET 1.43 1.42 1.42 1.42 1.41 1.38
Valid: Average Squared Error 0.12 0.12 0.12 0.12 0.12 0.13
Valid: Roc Index 0.77 0.77 0.76 0.78 0.77 0.73
Valid: Average Error Function 0.38 0.39 0.40 0.38 0.39 0.43
Valid: Percent Capture Response 14.40 12.00 11.60 14.40 12.64 9.60
Valid: Divisor for VASE 3000.00 3000.00 3000.00 3000.00 3000.00 3000.00
Valid: Error Function 1152.26 1168.64 1186.46 1131.42 1158.23 1282.59
Valid: Gain 180.00 152.00 148.00 192.00 144.89 124.00
Valid: Gini Coefficient 0.54 0.54 0.53 0.56 0.54 0.47
Valid: Bin-Based Two-Way Kolmogorov-Smirnov Statistic 0.43 0.44 0.41 0.44 0.45 0.39
Valid: Lift 2.88 2.40 2.32 2.88 2.53 1.92
Valid: Maximum Absolute Error 0.97 0.99 1.00 0.98 0.99 1.00
Valid: Misclassification Rate 0.17 0.17 0.17 0.17 0.17 0.17
Valid: Mean Square Error 0.12 0.12 0.12 0.12 0.12 0.13
Valid: Sum of Frequencies 1500.00 1500.00 1500.00 1500.00 1500.00 1500.00
Valid: Total Profit for TARGET 2143.03 2131.02 2127.45 2127.44 2121.42 2072.25
Valid: Root Average Squared Error 0.35 0.35 0.35 0.34 0.35 0.36
Valid: Percent Response 48.00 40.00 38.67 48.00 42.13 32.00
Valid: Root Mean Square Error 0.35 0.35 0.35 0.34 0.35 0.36
Valid: Sum of Square Errors 359.70 367.22 371.58 352.69 366.76 381.44
Valid: Sum of Case Weights Times Freq 3000.00 3000.00 3000.00 3000.00 3000.00 3000.00
The best model, as measured by average profit, was the original regression. The neural network had the
highest KS statistic. The log-transformed regression, Reg2, had the highest ROC-index.
If the purpose of a credit risk model is to order the cases, then Reg2, the transformed regression, had the
highest rank decision statistic, the ROC index.
In short, the best model for deployment was as much a matter of taste as of statistical performance. The
relatively small validation data set used to compare the models did not produce a clear winner.
In the end, the model selected for deployment was the original stepwise regression, because it offered
consistently good performance across multiple assessment measures.
Copyright © 2015, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.