Glass Box Neural Networks

Ross  Bettinger

SESUG Paper 102-2019 Glass Box Neural Networks Ross Bettinger, Silver Spring, MD Abstract Neural network models are typically described as “black boxes” because their inner workings are not easy to understand. We propose that, since a neural network model that accurately predicts its target variable is a good representation of the training data, the output of the model may be recast as a target variable and subjected to standard regression algorithms to “explain” it as a response variable. Thus, the “black box” of the internal mechanism is transformed into a “glass box” that facilitates understanding of the underlying model. Deriving a regression model from a set of training data analogous to a neural network is an effective means to understand a neural network model because regression algorithms are commonly-used tools and the interpretation of a regression model is straight-forward and well-understood. Keywords Ordinary least squares regression, logistic regression, multilayer perceptron, neural networks, target variable, dependent variable, continuous variable, categorical variable, variable selection, black box, glass box, SAS® Enterprise Miner® Introduction Neural networks are machine learning algorithms that are noted for their ability to learn descriptive features of a set of training data. The model represented by a neural network can be applied to new data for the purpose of predicting the value of an unknown target variable. Neural networks use supervised learning to create a model based on features of a continuous target variable or of a categorical target variable, and thus can be used for building regression or classification models. Unlike ordinary least squares regression or logistic regression models, neural networks do not produce a set of parameter estimates. Such parameters can be used to signify the unit change in a continuous target variable given a unit change in a regressor, or the change in the odds ratio of a categorical target variable for a specified value of a categorical predictor. Hence, neural networks are often called “black box” models because their inner workings are opaque and are hard to interpret. We demonstrate a technique by which the output of a neural network can be analyzed by regression to transform the neural network output into the context of a regression problem. The parameter estimates of the regression model may be used as surrogates for the neural network variable weights and biases to reveal the inner workings of the neural network. Discussion If a neural network accurately fits its set of training data, we may conclude that it has successfully abstracted from the data the relevant relationships between the target variable and the independent variables associated with the target variable. We will assume that this is the case, so that the output of the neural network, which represents the prediction of the model, may be reinterpreted to be a target variable for a subsequent model. We may then build a second model using the neural network output and all of the original variables used in the first model. While we understand that a model is an approximation to and an abstraction 1 from the relationships in the data used to build it, we base our thesis on the concept that “… all models are wrong; the practical question is how wrong do they have to be to not be useful.”1 We restrict our exposition to multilayer perceptron neural networks that produce a single output value which becomes the dependent variable for a regression algorithm. Methodology We describe the methodology of converting a “black box” neural network into a “glass box” model briefly and demonstrate the technique with an example. There are three phases to the technique: 1. Build a single-output neural network model 1.1. Build a classification model if the neural network target variable2 is categorical in nature, e.g., the target variable is to be assigned a label from a (typically small) finite set of labels. The output is then a label stored in the predicted target variable that is assigned to an observation. 1.2. Build a prediction model if the target variable is numeric and continuous in nature, e.g., the target variable may represent a potentially infinite number of values. The output is then a numeric value assigned to the predicted target variable. 2. Build a regression model using the output of the neural network model as the dependent variable based on all of the original variables used to build the NN model. All of the original variables must be used because the information contained in the modeling data is related to the output of the NN, e.g., the label assigned to the target variable or the value computed for it, and the regression algorithm, must use the same information to interpret the NN output as was used to create the NN model.3 2.1. If a classification NN model was built, use logistic regression for a binary-valued target variable or multinomial logistic regression for a nominal or ordinal-valued target variable.4 2.2. If a continuous NN model was built, use ordinary least-squares regression. 3. Assuming that the regression model is a close approximation to the neural network model, use the parameter estimates of the regression model to explain the effect of the predictor variables on the value of the target variable. By assumption, since the regression model output closely approximates the NN model output, the regression parameter estimates are useful proxies for the NN model predictor variable weights, and we may describe the opaque workings of the NN model in terms of the transparent regression equation. Example of Categorical Target Variable A categorical target variable can be binary, nominal, or ordinal in its measurement scale. It has a finite set of values, typically a very small number. For the purpose of this discussion, we use sample data from 1 George Box, https://en.wikipedia.org/wiki/All_models_are_wrong In the machine learning literature, a “target variable” is the variable whose values are to be predicted by a machine learning algorithm. The ML “target variable” is the same as the statistician’s “dependent variable”. It is not clear to us why there is a difference in terminology, but there are some things which are not given us to know. 3 More complex algorithms may be used, e.g., generalized linear models, but a simple, well-understood algorithm admits of readily-understood interpretations. 4 A generalized linear model may be used if the relationship between the linkage of the odds ratio and the dependent variables is not linear, but increasing sophistication may beget increasing subtlety of interpretation. 2 2 the 1994 Census database [1]. The target variable is a binary variable which contains 1 if a person’s income is over $50,000/year and 0 if the person’s income is less than or equal to $50,000/year. In addition to the binary target variable, there were four interval-scale and eight nominal-scale input variables. Table 1 contains a brief description of the variables used in the model. Table 1: Categorical Target Modeling Variables Variable Class Age Cap_Gain Cap_Loss Country Educ Hourweek Marital Occupatn Race Relation Sex Workclass Measurement Scale Binary Continuous Continuous Continuous Nominal Nominal Continuous Nominal Nominal Nominal Nominal Nominal Nominal Description Target variable Person’s age Income from investments, apart from wages/ salary Losses from investments, apart from wages/salary Country of origin Highest educational level achieved Hours worked per week Marital status Occupational category White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black Wife, Own-child, Husband, Not-in family, Other-relative, Unmarried Male/Female Private sector, public sector, &c. Exploratory Data Analysis Exploratory data analysis revealed that the variables most strongly correlated with the target variable were Age, Educ, Hourweek, Occupatn, and Relation. The other variables were omitted from the analysis because they were not strongly associated with the target variable. The Relation variable was later discarded because it created an error condition called “quasi-complete separation”. This topic is discussed below. 3 Stacked histograms of the predictor variables show the distribution of the target variable by grouping interval: Figure 1: Stacked histograms of raw data We see that the peak earning years for the ages of persons included in the data sample extend from the mid-30’s to the mid-50’s. College graduates with bachelor’s degrees and higher education are much more likely to have incomes greater than $50,000 than those who did not complete four years of higher education. Those who worked more than 40 hours/week are represented proportionately more in the greater than $50,000 class than other workers. Salaried employees in sales, professional, and executive occupations are also high-income individuals compared to blue collar workers or support occupations. Quasi-Complete Separation After we built preliminary NN models, we noticed that the Relation variable contained categories that did not contain any values for target variable = 1, e.g., for the case where the income was greater than $50,000. The NN models classified all cases of ‘Own-child’ and ‘Unmarried’ into tgt_class = 0, thus creating a condition called “Quasi-complete separation”. The logistic regression algorithm is designed to produce a rule that separates the set of input data into two subsets that have minimal overlap, and if the 4 data contain disjoint subsets, the parameter estimation algorithm fails to converge. One remedy in this case is to exclude the variable causing the separation from the modeling process.5 Where appropriate and meaningful, we grouped the predictors into fewer discrete categories than are present in the data so as to ensure that there would be an adequate representation of the target variable in each category to avoid quasi-complete separation. For example, in Table 2, we see that the rule “If Relation = ‘Own-child’ or Relation = ‘Unmarried’ then Into: tgt_class = 0” would completely separate the dataset into two disjoint sets.6 No other variables would be required, and in this case, the logistic regression algorithm would fail. We did not include the variable Relation in subsequent analysis for this reason. Table 2: Example of Quasi-Complete Separation Modifying the Data Histograms of the grouped variables are shown in Figure 2. We arranged the groups based on visual inspection for variables Age, Educ, and Hourweek, and used the SAS Enterprise Miner® Decision Tree node to group the Occupatn variable. SAS Usage Node 22599: “Understanding and correcting complete or quasi-complete separation problems”, addresses this situation (http://support.sas.com/kb/22/599.html) 6 The variable “Into: tgt_class” is created by the NN model and contains the decision made by the NN model to assign an observation into the class “<=50K” or “>50K”. This variable is critical to the glass box process in that it links the NN model to the logistic regression model. “Into: tgt_class” is the output of the NN model and it is used as the dependent variable of the logistic regression modeling procedure. 5 5 Figure 2: Stacked Histograms of Grouped Data Neural Network Modeling We used Enterprise Miner to build three NN models to explore the effect of complexity on classification accuracy. The multilayer perceptron neural networks had one, two, and three combination functions in the hidden layer. We tested the hypothesis that complex relationships between the target variable and the predictor variables would be better represented by more complex NN models. Figure 3 shows the ROC plot for the three models across the training, validation, and test datasets used to train and evaluate the models. 6 Figure 3: ROC Plots for Neural Network Models Table 2 indicates the improvement in performance due to increased NN complexity, which is implemented by increasing the number of hidden nodes. The AUROC statistic is computed from the Test data, which represents the holdout sample and is assumed to be similar to data that would be used in scoring for a deployed model. Table 3: Number of Hidden Nodes Number of Hidden Nodes 1 2 3 Area Under ROC Plot 0.877 0.878 0.876 We see that there is very little improvement in classification accuracy attributable to increasing complexity, so we used Ockham’s Razor7 and invoked the principle of parsimony to select the simplest model, e.g., the NN model with one hidden node.8 Logistic Regression Modeling We used the same data (training and validation datasets) and predictors (Age_Group, Educ_Group, Hourweek_Group, Occupatn_Group) that served as inputs to the NN model as input to the logistic regression model. The dependent variable for the logistic regression model was the output of the NN model, “Into: tgt_class”. The resulting logistic regression model is 𝑙𝑜𝑔 ( 𝑝𝑖 1−𝑝𝑖 )=𝛽0 + 𝛽1 𝐴𝑔𝑒_𝐺𝑟𝑜𝑢𝑝𝑖 + 𝛽2 𝐸𝑑𝑢𝑐_𝐺𝑟𝑜𝑢𝑝𝑖 + 𝛽3 𝐻𝑜𝑢𝑟𝑤𝑒𝑒𝑘_𝐺𝑟𝑜𝑢𝑝𝑖 + 𝛽4 𝑂𝑐𝑐𝑢𝑝𝑎𝑡𝑛_𝐺𝑟𝑜𝑢𝑝𝑖 where 𝑝𝑖 represents the probability that observation i belongs to the income > $50,000 class, e.g., Into_class = 1. If we define the log of the odds ratio as the logit(𝑝𝑖 ), then we can say that 7 [1] The principal of parsimony states that "Entities are not to be multiplied without necessity". See, e.g., https://en.wikipedia.org/wiki/Occam%27s_razor for historical background. 8 It can be shown that a three-layer NN with one hidden node is equivalent to a logistic regression algorithm. See Appendix A. 7 𝑝 𝑙𝑜𝑔𝑖𝑡(𝑝𝑖 ) = log ( 1−𝑝𝑖 ) = 𝜷′ 𝒙𝒊 so that 𝑝𝑖 = 𝑖 ′ 𝑒 𝜷 𝒙𝒊 ′ 1+𝑒 𝜷 𝒙𝒊 is the probability that observation i is in class 1 [2]. We note that the model diagnostics indicated satisfactory performance and that the logistic regression model based on the output of the NN model represented the NN model’s performance to a high degree of accuracy. The area under the ROC curve (AUROC) was 0.9324, indicating that the logistic regression model performed very well under a variety of event definitions where the probability of Into: tgt_class ranged from 0 to 1. Perfect separation of the Into:tgt_class dependent variable into disjoint subsets would produce an AUROC of 1. Figure 4 shows this performance. Figure 4: LR Model Based on NN Model Interpretation of Logistic Regression Results To simplify the interpretation of the LR results, we built a model using educational attainment alone. Table 4 shows the distribution of the dependent variable, Into:tgt_class, by category. We see that every category is populated, although “Elem-Some-High-School” is sparse for high-income observations. Table 4: Education Group 8 The SAS code used to build the model shown in Equation 2 is proc logistic data=train_validate plots( only )=( oddsratio( group ) roc ) ; class I_tgt_class educ_group( ref='4 Bachelors' ) ; model I_tgt_class( event = '1' ) = educ_group / rsquare ; oddsratio educ_group / diff=ref ; run ; The model equation is 𝑙𝑜𝑔𝑖𝑡(𝑝𝑖 ) = 𝛽0 +𝛽1 (𝐸𝑑𝑢𝑐_𝐺𝑟𝑜𝑢𝑝𝑖 = ′ 1𝐸𝑙𝑒𝑚 − 𝑆𝑜𝑚𝑒 − 𝐻𝑖𝑔ℎ − 𝑆𝑐ℎ𝑜𝑜𝑙 ′ ) +𝛽2 (𝐸𝑑𝑢𝑐_𝐺𝑟𝑜𝑢𝑝𝑖 = ′ 2 𝐻𝑆 − 𝐺𝑟𝑎𝑑𝑢𝑎𝑡𝑒 ′ ) +𝛽3 (𝐸𝑑𝑢𝑐_𝐺𝑟𝑜𝑢𝑝𝑖 = ′ 3 𝐴𝑠𝑠𝑜𝑐, 𝑆𝑜𝑚𝑒 − 𝐶𝑜𝑙𝑙𝑒𝑔𝑒 ′ ) +0 (𝐸𝑑𝑢𝑐_𝐺𝑟𝑜𝑢𝑝𝑖 = ′ 4 𝐵𝑎𝑐ℎ𝑒𝑙𝑜𝑟𝑠 ′ ) +𝛽5 (𝐸𝑑𝑢𝑐_𝐺𝑟𝑜𝑢𝑝𝑖 = ′ 5 𝑃𝑜𝑠𝑡𝑔𝑟𝑎𝑑𝑢𝑎𝑡𝑒′) [2] Since we used the ‘4 Bachelors’ category as the reference value to which other categories are compared, it is not represented in the equation. Table 5 contains the parameter estimates obtained by the maximum likelihood process. Table 5: Logistic Regression Maximum Likelihood Estimates The parameter estimates from the maximum likelihood estimation process have been substituted into Eq. 2 to produce the model that represents the effect of educational attainment on achieving high income. 𝑙𝑜𝑔𝑖𝑡(𝑝𝑖 ) = −2.1777 −4.8533 ∙ (𝐸𝑑𝑢𝑐_𝐺𝑟𝑜𝑢𝑝𝑖 = ′ 1𝐸𝑙𝑒𝑚 − 𝑆𝑜𝑚𝑒 − 𝐻𝑖𝑔ℎ − 𝑆𝑐ℎ𝑜𝑜𝑙 ′ ) −0.5519 ∙ (𝐸𝑑𝑢𝑐_𝐺𝑟𝑜𝑢𝑝𝑖 = ′ 2 𝐻𝑆 − 𝐺𝑟𝑎𝑑𝑢𝑎𝑡𝑒 ′ ) +0.7464 ∙ (𝐸𝑑𝑢𝑐_𝐺𝑟𝑜𝑢𝑝𝑖 = ′ 3 𝐴𝑠𝑠𝑜𝑐, 𝑆𝑜𝑚𝑒 − 𝐶𝑜𝑙𝑙𝑒𝑔𝑒 ′ ) +0 ∙ (𝐸𝑑𝑢𝑐_𝐺𝑟𝑜𝑢𝑝𝑖 = ′ 4 𝐵𝑎𝑐ℎ𝑒𝑙𝑜𝑟𝑠 ′ ) +2.7002 ∙ (𝐸𝑑𝑢𝑐_𝐺𝑟𝑜𝑢𝑝𝑖 = ′ 5 𝑃𝑜𝑠𝑡𝑔𝑟𝑎𝑑𝑢𝑎𝑡𝑒′) [3] Table 6 displays the odds ratio estimates computed by PROC LOGISTIC. Each category is compared to ′ ‘4 Bachelors’. We recall that the odds ratio of a particular category is 𝑂𝑅 = 𝑒 𝛽 𝒙 . 9 Table 6: Odds Ratio Estimates Figure 5 graphically represents the impact of education on income. Figure 5: Odds Ratios of Education Referred to Baccalaureate The odds ratio estimates in Table 5 can be converted into probabilities by using the relationship 𝑝 = 𝑂𝑅 . 1+𝑂𝑅 We computed the probabilities of being in the high-income class based on educational achievement compared to a Bachelors degree and include them in Table 7: Table 7: Probability of High Income Based on Educational Achievement Category Elem-Some-High-School vs Bachelors HS Graduate vs Bachelors Assoc, Some College vs Bachelors Postgraduate vs Bachelors Probability 0.000999 .0749 .2296 .6773 Clearly, the importance of educational achievement on earning power cannot be disregarded. The ROC plot for the logistic regression model is shown in Figure 6. The single variable Educ_Group has strong predictive power! 10 Figure 6: ROC Curve for Educ_Group Model Summary We propose a three-phase model building approach in the context of a classification problem. It may equally well be applied to a prediction problem for a continuous target variable. We demonstrated the feasibility of using logistic regression to illuminate the inner workings of a simple feed-forward neural network model with a categorical target variable. By mapping NN methodology into a regression context, we converted the black box NN model into a “glass box” logistic regression model. We believe that this technique is applicable to various modeling problems and is highly useful in understanding NN models. Converting the “black box” of the hidden layer of NN modeling into a “glass box” of regression removes the mystery of the NN model and may reduce the natural tendency to avoid what cannot be understood. We hope that this work may encourage application of NN modeling technology to a wider audience of decision makers. 11 Appendix The single hidden layer perceptron shown in Figure A1 computes the signum function (Figure A2). The combination function 𝑓 combines the bias and weighted inputs and produces an input to the signum function. The signum function applies thresholding to the input and creates a discrete value in the interval [-1,1]. Figure A1: Single Hidden Layer Perceptron The signum function applies hard limiting to its input. If x < 0 then signum( x ) = -1, if x = 0 then signum( x ) = 0, and if x > 0 then signum( x ) = +1. Figure A2: Signum Function 12 The logistic function is defined to be 1 1 + 𝑒 −𝑥 If the logistic function is used instead of the signum function, the discriminatory power of the neural network is increased because the output of logistic( x ) is a continuous value in the interval (0, 1). This output may be defined to be the probability of an event, which represents the outcome of some process under observation. Then we may say that, if the probability of an event is, e.g., 0.75, and if the output of the neural network is 0.80, the label applied to the event is “Occurred”. Otherwise, it did not occur if the threshold of 0.75 is not exceeded, and the label is “Nonoccurrence”. In this case, the event is binary-valued. Other definitions are possible, based on the number of states (labels) that an event can represent. 𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐( 𝑥 ) = Figure A3: Logistic Function Then the neural network equation 𝑦 = 𝑓(𝑏𝑖𝑎𝑠 + 𝑤1 𝑥1 + 𝑤2 𝑥2 ) has the same structure as the logistic regression equation 𝑝 𝑙𝑜𝑔𝑖𝑡( ) = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 1−𝑝 where 𝑓(𝑥) = 𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐(𝑥) and the equivalence between a neural network model and a logistic regression model is apparent. References [1] Kohavi, Ronny and Becker, Barry (1996). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml/datasets/Adult]. Irvine, CA: University of California, School of Information and Computer Science. [2] Allison, Paul D. (1999). Logistic Regression Using the SAS® System: Theory and Application. Cary, NC: SAS Institute Inc. 13 Acknowledgements We thank Garrett Frere and Mark Leventhal for their generosity in taking the time to review this document. Contact Information Your comments and questions are valued and encouraged. Contact the author at: Name: Ross Bettinger Enterprise: Consultant E-mail: rsbettinger@gmail.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. 14

Log In

Glass Box Neural Networks

Glass Box Neural Networks

Related Papers

RELATED PAPERS