1. Introduction
The selection of parsimonious log-linear and logistic models for discrete multivariate data is crucial for concise data interpretation based on statistical inference methods. Since the advent of linear regression, researchers developed model selection methods, such as ridge regression [
1], the
statistic [
2], and the Akaike information criterion (AIC) [
3]. As a tool for enhancing the prediction accuracy of a selected model, the AIC inspired the development of the Bayesian information criterion (BIC) [
4], which maximizes posterior likelihood against a finite-mixture prior distribution. Researchers implemented conventional selection procedures with various loss-function penalty criteria to facilitate the selection of models and corresponding variables in generalized linear models (GLMs) [
5,
6,
7,
8]. However, the inclusion of AIC and BIC in statistics software packages [
9] led to the selection of various sets of predictors when applying GLMs to analyzing the same dataset [
10,
11].
Researchers in the field of model and variable selection recently struggled with high dimensionality (numerous variables) and small (sparse) datasets. This led to the development of methods for the optimization of loss function criteria, such as Lasso, adaptive Lasso, elastic Net, Scad, and the Dantzig selector [
12,
13,
14,
15,
16]. In a large-scale meta-analysis, Wang et al. [
17] observed considerable variation in the cross-validation performance of model/variable selection methods.
When employing contingency tables as the core of discrete multivariate data analysis, researchers established that valid log-linear models can be used to generate valid logistic models; however, the converse is not true. The construction of hierarchical log-linear models is generally based on stepwise selection procedures, beginning with the inspection of two-way association effects, followed by the inspection of significant three-way and higher-order interaction effects. The effects of including additional parameters can be assessed using the likelihood ratio (LR) deviance or chi-square statistic, as a basic way of testing goodness-of-fit when discriminating between models with the nested structure of parameters [
18,
19,
20,
21,
22]. In contingency table analysis, LR deviance is a log-likelihood measure of the geometric projection from data to a hypothetical model. By analogy, an algebraic definition of a log-linear or logistic model is equivalent to describing the geometry of orthogonal decomposition of the log-likelihood that intrinsically characterizes the association effects between variables.
The analysis of variance (ANOVA) for a linear model illustrates the orthogonal log-likelihood decomposition of normally distributed variables. A basic extension of this analysis is the mutual information (MI) identity, which introduces an orthogonal decomposition of the log-likelihood ratios of the joint distribution of variables [
23]. For example, the model of independence between three variables can be decomposed into a sum of three orthogonal terms, including two two-way MI association terms and one three-way conditional MI (CMI) term. Each term defines a Kullback–Leibler (KL) divergence value between a pair of variables [
24]. A two-way KL divergence value describes a log-likelihood projection from data to the model of independence between the variables, which characterizes the basic Pythagorean law of information identity [
25]. The three-way Pythagorean law describes CMI as the sum of two orthogonal components: the interaction (which vanishes with normal distributions) and partial association. When dealing with a three-way contingency table, this law yields two-step orthogonal LR tests rather than a non-orthogonal combination of the Breslow–Day test [
26] and the Mantel–Haenszel test [
27]. Related examples can be found in [
28]. The MI of a vector variable can generally be decomposed as a sum of orthogonal lower-dimensional MI and CMI terms, which present significant main and interaction effects between variables.
In this paper, we propose an MI decomposition approach to constructing log-linear and logistic models based on the multinomial distribution of a contingency table. In
Section 2, we demonstrate that the basic forms of log-linear models can be expressed using MI and CMI terms. We also demonstrate the execution of a two-step LR test for conditional independence based on the Pythagorean law. In
Section 3, we examine the main and interaction effects among factors by testing a series of MI identities in a clinical dataset, which includes six binary risk factors and the dichotomous stroke status of patients. We then use the observed significant effects to construct a concise log-linear model and present its graphical model. In
Section 4, we present a parsimonious logistic model for predicting the stroke status through the MI deletion of redundant factors and effects. In
Section 5, we outline the stepwise forward and backward MI selection procedure for indispensable predictors and construct parsimonious logistic model through inspecting an MI identity comprising significant main and interaction effects. In
Section 6, we implement the proposed MI procedure for a German banking credit dataset (in the UCI Machine Learning Repository) which comprises 21 attributes of 1000 customers [
7].
Section 7 presents an empirical comparison between the proposed MI analysis and two state-of-the-art methods.
To summarize, the proposed MI analysis can be used to construct parsimonious log-linear and logistic models for discrete multivariate data. It is also shown that discretization of continuous variables presents a legitimate and potential MI application to analyzing datasets that include both continuous and discrete variables. For the extended applications to GLMs, proper inference theory remains to be fully developed.
2. Log-Linear Model and Mutual Information
Among the GLMs, log-linear models are commonly applied to modeling contingency tables through the log link function, multinomial, and Poisson distributions. The conditional distribution of a multivariate Poisson distribution given the margins of the discrete variables characterizes the multinomial distribution of a contingency table [
18,
21,
29]. We illustrate the essential connections between hierarchical log-linear models and mutual information decompositions in this section. Let
denote a three-way
I ×
J ×
K contingency table with cell counts
,
i = 1, …,
I,
j 1, …,
J,
k 1, …,
K, and total count
. Denote the joint probability density function (pdf) and cell mean by
and
(
, respectively. The (multivariate) multinomial distribution of the table is
where
and
1. Equation (1) can be derived from a random total count
N, where
n follows a multivariate Poisson sampling distribution
Each cell mean
is a Poisson mean, and
e is the Euler number [
29]. Aside from a constant term, the log-likelihood of the Poisson multinomial model is
In a hierarchical log-linear model, for example, the saturated model of three variables specifies the logarithmic factor in (2) as
Model (3) explains the observed table through inspecting significant main effects, two- and three-way effects by checking LR deviance statistics between the nested models [
18,
19,
21,
30,
31,
32].
We now formulate a theoretically equivalent approach to inspecting hierarchical log-linear models such as (3) through the analysis of MI and illustrate the connection between the two procedures. In a three-dimensional table, the Shannon entropy defines a basic equation of joint and marginal probabilities as
where
is the joint entropy. A marginal entropy such as
H(X) is defined by analogy. Here,
is the MI between the three variables [
33]. The MI is defined with a geometric notion; that is, it is the KL divergence value from the joint pdf onto the product of marginal pdfs, or the KL projection root in the hypothesized space of independence [24-25,34]. By factoring the joint log-likelihood, an orthogonal partition of the MI terms between the variables is expressed as the following MI identity
A two-way MI term such as
I(X, Z) is analogously defined with the marginal (X, Z) table. The CMI term
I(X, Y|Z) defines the expected log-likelihood ratio for testing the conditional independence between X and Y across levels of Z. The right-hand side of (5) admits three information equivalent forms through exchanging the common variable Z with either X or Y. It is crucial that the MI of a three-way table (X, Y, Z) is decomposed as the sum of two two-way MI terms and a three-way CMI term. With multinomial log-likelihood, the sample version of (5) satisfies the same MI identity through the expression
where
is the total sample size. The notation
denotes the sample estimate of the joint pdf in the (
i, k) cell, and
is the product pdf estimate under the assumption of independence. Other notations in (6) are defined by analogy. The constant factor
is added to yield quadratic approximation in distribution such that the scaled
,
and
are asymptotically chi-square distributed with (
I-1)(
K-1), (
J-1)(
K-1) and (
I-1)(
J-1)
K degrees of freedom (
dfs), respectively. For ease of notation, the factor
will be implicitly used as in (6) but omitted in the rest of the text.
In application, the sample MI
in (6), comparable to the deviance statistic
[
18], is used to test the hypothesis of conditional independence between X and Y given Z; that is,
I(X, Y|Z)
, which defines the log-linear model {XZ, YZ}. With this hypothesis, a remarkable Pythagorean law characterizes that
(X, Y|Z) depicts the hypotenuse of the right triangle with two orthogonal sides: one side is the three-way interaction
Int(X, Y, Z), which defines the heterogeneous association between X and Y across the levels of Z; the other side is the partial association
Par(X, Y|Z), which defines the homogeneous (uniform) association between X and Y across levels of Z. Specifically, the three-way CMI term of (5) can be expressed as the sum of two orthogonal components
We now illustrate the connection between (3) and (4), where (4) represents the expectation of the sample version of (3) except for a (normalizing) constant. The sum
H(X) +
H(Y) +
H(Z)
H(X, Y, Z) represents
, and
represents the total association
in (3) using
I(X, Z) +
I(Y, Z) +
I(X, Y|Z) of (5). The last statement and (7) explain that the sum
is exactly presented by
I(X, Y|Z)
Int(X, Y, Z) +
Par(X, Y|Z), where the two-way term
in (3) actually represents the three-way term
Par(X, Y|Z) [
34]. This does not yield incorrect estimates because it is the unique three-way partial association term which must match the last two-way XY term. In other words,
should be interpreted as the three-way partial association between X and Y given Z when
and
are already in the model.
The sample analogs of the terms in (7), being the last summand in (6), satisfy the same Equation (7). In practice, the sample CMI
(X, Y|Z) is the maximum likelihood estimate (MLE) of conditional independence. The MLE
(X, Y, Z) is computed using the iterative proportional fitting or the Newton–Raphson procedure [
18]. The MLE of the partial association
(X, Y|Z) can be obtained by subtraction. Here, the Pythagorean law asserts that testing for conditional independence using the MLE
(X, Y|Z) can be decomposed as a two-step LR test, where the usual test size
0.05 is replaced by two separate sizes
and
, such that
by the geometry of orthogonal decomposition. Because all three tests logically include testing for conditional independence, the law stipulates that the hypothesis of no interaction should be tested using
(X, Y, Z) with (
I-1)(
J-1)(
K-1)
dfs against the size
(more stringent than using
), and the hypothesis of uniform (partial) association should be tested using
(X, Y|Z) with (
I-1)(
J-1)
dfs against the size
[
28].
Formulas (5) to (7) can be straightforwardly extended to multi-way tables. Suppose that a four-way table (T, X, Y, Z) is available, the updated MI identity is
Similar to the connection between (3) and (5), the difference of information decomposition between the usual log-linear model of all (six) two-way effects {XZ, YZ, TZ, XY, TY, TX}, and the MI association model in (8) is now explained. When the first three two-way terms {XZ, YZ, TZ} or the three two-way MI components in (8), are already in the model, the terms XY and TY actually represent partial association components
Par(X, Y|Z) and
Par(T, Y|Z) within the CMI terms
I(X, Y|Z) and
I(T, Y|Z) in (7), respectively. The last two-way term TX actually yields the component
Par(T, X|{Y, Z}) within the term
in (8). These illustrations show that the sum of the three interaction effects
Int(X, Y, Z),
Int(T, Y, Z) and
Int(T, X, {Y, Z}) is the LR deviance statistic for testing the goodness of fit of the model with six two-way effects. If the interaction effects are significant, the corresponding graphical model may not be expressed as a union of non-overlapping cliques, which were termed indecomposable models by [
35] and [
36].
Meanwhile, the last three summands in the bracket of (8) accounts for the association between T and {X, Y, Z}. They can be used to explain the (logistic) regression of T on {X, Y, Z} when T is the response variable of interest. For instance, if the null hypothesis of vanishing CMI “
” is tested and accepted, X can be removed from a regression model. While information equivalent MI identities to (8) can be derived from interchanging the explanatory variables [
23], the MI approach to building a log-linear or logistic model must be approved by testing for a valid MI identity.
3. MI Log-Linear Modelling
It is convenient to illustrate MI log-linear analysis using a contingency table of finite variables. The basic idea is to delete insignificant higher-order CMI and interaction effects as many as possible, such that the least number of significant lower-order effects are kept in the model. We emphasize that non-unique concise log-linear models can often be obtained from different sets of selected variables.
In this section, a clinical dataset is used to illustrate the MI approach for constructing a log-linear model. In the dataset, the brain computed tomography (CT) scans were available from 354 patients, who received a diagnosis of ischemic stroke in the middle cerebral arterial (MCA) territory, and 1518 control patients who did not have any ischemic stroke symptom at the time of CT scans [
37]. The data were collected during 2006 through 2008 for a research project on the association between ischemic stroke and its risk factors, among which the calcification burden in the MCA territory was of main concern when specifying a logistic or log-linear model. The target variable was the status of MCA stroke patients (S: 1 = case; 0 = control), and the risk factors consisted of the calcification burden (C: 1 = yes; 0 = no) in the MCA territory, age (A: 1 ≥ 60; 0 < 60), gender (G: 1 = male; 0 = female), hypertension (H: 1= SBP > 140mm Hg or DBP > 90 mm Hg; 0 = otherwise), diabetes mellitus (D: 1 = fasting serum glucose level > 7.8 mmol/L; 0 = otherwise), and smoking (M: 1 = smoking over 1 cigarette/day; 0 = none). The risk factors were coded as 1 for case, 0 for control, except for age and gender.
Our goal is to assess parsimonious models for interpretation of the stroke dataset. We proceed by deleting the most insignificant high-way MI and CMI terms between the variables. Among the seven factors, factor C (calcification burden) yields the least significant MI measures with the other 6 factors. Putting aside the factor C, the next factor M has the least significant MI, followed by the factor G. Then, the remaining four factors {S, A, D, H} are significantly associated with each other. As an analog of (8), an information identity between the seven factors can be expressed as
Note that all summands on the right-hand side of (9) are orthogonal to each other. The first summand is the MI between C and the other six factors, which can be decomposed into the sum of six orthogonal terms below. Each asterisk “*” indicates an insignificant chi-square test statistic at level α
0.05, and values in the parentheses following each MI term are the LR (goodness-of-fit) statistic, degrees of freedom, and chi-square
p-value, respectively.
Equation (10) can be summarized as
In (10) and (11), it was found that (C, A, S) (8.234, df 1, p 0.004) was a significant component of (C, A|S) such that the association between the calcification burden in the MCA territory and age differed between the case and control groups. Moreover, (C, A|S) (23.215, df = 1) was smaller than (C, A) (46.316, df 1), so that (C, {S, A}) was less than the sum of (C, A) and (C, S).
By analogy, the factor M in (9) yields a similar decomposition as follows:
Here, the CMI estimate (M, {A, D, H}|{G, S}) is decomposed into the first three insignificant terms on the right-hand side of (12). The last summand (M, {G, S}) is equal to the sum of two significant terms (M, G) (314.21, df 1) and (M, S|G) (16.935, df 2) The latter is equal to the sum (M, S, G) (7.224, df 1) + (M, S|G) (9.211, df 1), which are smaller than (M, S) (12.903, df 1). It indicates that, in a final valid log-linear model, the three-way CMI term (M, S|G) can possibly be replaced by the two-way MI term (M, S) through compensation of information from other terms.
Next, the factor G in (9) provides the following decomposition:
The summand (G, D|A) in (13) includes the significant interaction (G, D, A) (13.529, df 1), which is close to (G, A) and greater than (G, D) (8.742, df 1).
Finally, the last two terms in (9) consist of the following significant components
and
Equations (14) and (15) include significant interaction terms (S, D, {A, H}) (19.690, df 3), (S, A, H) (13.543, df 1, p < 0.001) and (A, D, H) (16.797, df 1). Here, (S, D|{A, H}) is slightly less than (S, D) (24.08), (S, H|A) is less than (S, H) (105.425), and (D, A|H) is less than (D, A) (66.98). However, significant interaction terms (S, A, H) and (A, D, H) may be indispensable, as illustrated below.
By collecting significant MI and interaction terms in (11) and (13)–(15), significant terms in (9) are included in the following decomposition
Among hierarchical log-linear models, a class of graphical models composed of nodes and edges were attractive with simple network expressions [
38]. Note that the graphs were constructed from models comprising mainly significant two-way effects (edges) between the variables (nodes), with possibly significant cliques (triangles). From the information identity (8), it is known that there are no more than six orthogonal two-way effects that can be derived from seven variables {C, A, S, M, G, D, H}. However, from among those two- and high-way effects in Equations (14)–(16), it is found that the set of six two-way effects {CS, MG, AH, DH, SH, DA} with the greatest MI estimates could only yield an under-fit model. Suppose that a valid log-linear model is given by inserting additional non-orthogonal two-way effects to the last model of six orthogonal effects. Then, the largest model of twelve significant two-way effects is the following null graphical model
which is still an under-fit model with LR deviance 171.577 (
df 108,
p 0.001). Consequently, the above analysis based on (16) indicates that a valid graphical model must include some three- or high-way interaction effects.
By (16), it was shown that adding the orthogonal three-way effects {ADG, SAH, ADH} to the subset {CA, CS, MS, MG, SD} of model (17) would yield a valid log-linear model
with the deviance 127.96 (
df = 105,
p = 0.063). The three-way effect AD was repeated in (18), and the duplication can be avoided by regrouping the terms in (16) to yield another valid model
with deviance 126.53 (
df 103,
p 0.058). Now that LLM
1 uses two fewer parameters than does LLM
2, it can be regarded as the best parsimonious model for the MCA stroke data by the proposed MI analysis. Consequently, the so-called graphical model LLM
1 must include three significant interaction terms {ADG, ADH, SAH} in (16). A modification of the standard diagram is recommended to plot the graphical model of LLM
1, which shows three significant triangles (cliques) for three sets of three nodes (see
Figure 1 below). This extends the construction and interpretation of graphical models beyond the existing literature, for example, [
30,
38].
To summarize, the above scheme of identifying the best parsimonious log-linear model LLM1 for the seven-way stroke data also leads to exhibiting significant edges and cliques (triangles) in a graphical model. Because the sample size was large relative to the number of variables, it is unnecessary to use a penalty criterion as commonly practiced in the literature.
4. MI Logistic Regression
As a member of GLMs, the logistic model extends classical linear models to the condition of using a binary (or discrete) response variable [
39]. The covariates can be purely qualitative in nominal levels or mixed categorical and numerical variables. Recall Equation (8), given the target variable T and the regressors X, Y, and Z, a logistic model consisting of three significant main effects and YZ interaction effect can be expressed as
Model (20), being equivalent to the log-linear model {XYZ, TYZ, TX}, explains T using the composite of main and interaction effects for (Y, Z), in addition to the main effect X. The goodness-of-fit test for model (20) is examined using the deviance (T, X, {Y, Z}) according to (8), where (T, X|{Y, Z}) is significant for the main effect X. This ascertains correct main-effect parameter estimates bypassing the model of purely main effects (corresponding to the model with all possible two-way effects plus one three-way effect) according to the discussions of proper log-likelihood decomposition in (8). The MI approach to modeling logistic regression will be illustrated using the MCA stroke dataset.
Consider the stroke dataset where the target variable is the stroke status S (1 for cases and 0 for controls). The parsimonious log-linear model LLM
1 of (18) presented the essential association between S and six risk factors. In particular, it was shown that the factor G (gender) is not needed for the illustration of S, because it is unrelated to S in model LLM
1. Indeed, the first-step MI analysis confirmed this observation:
The first summand in (21) confirmed that factor G was dispensable as expected. Next, the second summand of (21) indicated that the factor M (smoking) was marginally significant conditional on the other four factors. A clinical concern about the calcification burden in the brain MCI territory could address whether M is related to the ischemic stroke. By this scenario, the relevant CMI terms can be expressed as
which approves the significant partial association between S and M conditional on the factors {H, D, C, A}. Given the effect of M on S in Equation (22), the MI between S and the five risk factors {C, M, H, A, D} can be rearranged to yield the decomposed MI identity in
Table 1 below.
By deleting two insignificant higher-order interactions, five main-effect (partial association) terms {M, C, H, A, D} and the interaction effect
(S, H, {D, A}) remain in the MI identity (23) as shown in
Table 1. Although this four-way interaction term is significant, it was not selected into a valid model, because it is smaller than the sum of two three-way estimates
(S, A, H) and
(S, D, H) and can be replaced by these two three-way effects. In fact,
Table 1 leads to the MI logistic model with significant main and interaction effects as
Model (24) was the most concise logistic model with the least significant effects, given the LR deviance 26.651 (
df = 23,
p = 0.271) in
Table 1. Remarkably, the minimum AIC model using the five factors {C, M, H, A, D} can be found by including more interaction effects among {MA, MD, MH} in the standard procedure (cf. SAS CATMOD or SPSS logistic procedure). The minimum AIC model using significant parameter estimates gives
with the LR deviance 18.97 (
df 21,
p 0.587). It is worth noting that the estimate 0.495 of the parameter M in (24) is closer to the raw-data logarithmic odds ratio 0.487 (between S and M) than the estimate 1.180 of M in the less concise AIC model in (25). This evidences that the effect of M on S was confounded by the extra interaction effects {MA, MD} in (25).
6. An Empirical Study
The Statlog (German Banking Credit Data) dataset in the UCI Machine Learning Repository (
https://archieve.ics.uci.edu/ml/datasets/statlog+(german+credit+data) accessed on 27 April 2023) lists the banking data of 21 discrete attributes (categorical and discretized numerical variables) of 1000 customers (cf. [
7]). This dataset was extracted from the original data of 1101 customers, in which three numerical attributes {A2, A5, A13} were discretized into discrete levels {10, 9, 5}, respectively [
40]. The binary credibility variable of interest is denoted by
, (
T = 1 for good creditworthy,
T = 2 for not creditworthy). The other 20 attributes, denoted by the set
X, were available to explain
T (see
Appendix A for the list of 21 attributes). Fahrmeir et al. [
40] selected for the discretized data a main-effect logistic regression model with five predictors {A1, A2, A3, A4, A9} (defined as model FT
1 in
Table 2 below), which excluded “A5, credit amount” and “A13, age” from a previous selected subset of 7 attributes {A1, A2, A3, A4, A5, A9, A13}. Fahrmeir and Tutz [
7] further discussed several variable selection criteria, namely the Wald test, AIC, BIC, and generalized score test, and concluded the updated analysis with the same model FT
1 of five main effects.
Consider the MI analysis discussed in
Section 5. Steps 1 to 4 were used to analyze the 1994 raw-level data with the binary target
T and 20 discrete attributes. It yielded two MI logistic models denoted by MI
1: {A1, A3, A12} and MI
2: {A1, A3
A14} with the LR deviance
p-values 0.13 and 0.08, respectively, as listed in
Table 2. For this Statlog raw-level data,
Table 3 records the MI decomposition of the best parsimonious model MI
1 without significant interaction effects.
In
Table 2, Fahrmeir and Tutz [
7] obtained the model FT
1: {A1, A2*, A3, A4*, A9*} for the raw-level data, which included the main-effect (subset) model FT
2: {A1, A3, A9*}. The MI analysis showed that both models were over-fit, because each one of the three attributes {A2, A4, A9} was dispensable (marked with an asterisk “*”) given the set of predictors {A1, A3}; that is, the estimates
(A9, T|A1, A3),
(A2, T|A1, A3) and
(A4, T|A1, A3) were insignificant. Meanwhile, the MI analysis yielded models MI
1 and MI
2, as shown in
Table 2, using the raw-level data. The partial and interaction effects were decomposed for MI
1 in
Table 3. In
Table 2, the model SIS
1: {A1, A2*, A3} was the only model selected by the SIS method which will be explained in
Section 7. In the rest of this study, the AIC estimates given in
Table 2,
Table 4 and
Table 5 were only recorded without discussion.
7. Comparison with Existing Methods
Researchers recently focused on methods for the selection of models and variables to deal with data featuring high-dimensional attributes. The least absolute shrinkage and selection operator, in short, the Lasso method [
15], attracted a particular attention in recent years. Examples include the smoothly clipped absolute deviation (SCAD) penalty [
13], the sure independence screening (SIS) for high dimensional feature space [
41], SIS for GLM [
42,
43], the elastic net [
44], adaptive Lasso [
16], the Dantzig selector [
12], and feature screening with categorical data [
45]. In the following, we outline a comparison of the SIS method and the proposed MI scheme in the analysis of banking credit data. The SIS computing program (http:’’CRAN.R-project.org/packageSIS accessed on 27 April 2023) yielded the unique model SIS
1: {A1, A2*, A3} in
Table 2. Note that the results exhibited over-fitting, due to the fact that the discretization of attribute A2 into 10 levels was large enough to render the CMI estimate
(A2, T|A1, A3) insignificant. The same model labelled SIS
3 in
Table 5 was marginally valid (
p-value 0.049) when A2 was discretized as a three-level attribute.
Chernoff et al. [
46] proposed an innovative nonparametric approach to variable and model selection using GLMs, referred to as the “
I-scores” method. Their method begins with the partitioning of explanatory attributes versus a target (response) variable in the data. In other words, a family of weighted mean squared differences of the target values (i.e.,
I-scores) is computed across various attribute partitions. Attributes of influence among the partitions are identified as the factors, which, if excluded (or included), would decrease (or increase) a large proportion of the
I-scores. This means that influential attributes can be selected by evaluating sample histograms of the various
I-scores across partitions. The authors recommended two working rules: (1) a hard thresholding (HT) rule for the selection of a predictor (effect) that induces a sufficiently large amount of changes in the
I-score, and (2) a soft thresholding (ST) rule that induces a moderate change in the
I-score. In practice, the attributes are partitioned into finite discrete sample spaces, such that the
I-scores can be computed for an effective comparison. Discretizing each continuous attribute to a two- or three-level factor is a convenient feature of the
I-score computing program at (
https://github.com/adelinelo/iscreen accessed on 27 April 2023) [
47].
We performed a comparison between the SIS,
I-scores, and proposed MI methods using two modified forms of the numerical attributes A2, A5, and A13 in the original banking credit data. The first form scaled these numerical attributes as two-level attributes (e.g., {0, 1}), whereas the second form scaled them as three-level attributes (e.g., {0, 1, 2}), as shown in
Appendix A.
Table 4 and
Table 5, respectively, present the predictors and models selected from the dataset using these two modified forms.
As shown in
Table 4, the SIS method selected the main-effect model SIS
2: {A1, A2, A3, A6*, A12*} for the first-form modified data. Model SIS
2 was over-fit, based on the fact that the variables {A6, A12} were dispensable given {A1, A2, A3}. In other words, the CMI estimates
(A6, T|A1, A2, A3) and
(A12, T|A1, A2, A3) were insignificant.
Results of the
I-Score method are listed in two parts. The first set of results includes models obtained using the hard-thresholding (HT) rule by which the
I-Scores of each influential variable were set at ≥ 11.0 as a high threshold in the
I-Score histogram. This yielded models HT
1 and HT
2. Note that as a singleton {A1}, model HT
1 was clearly under-fit, based on the observation that all other models in
Table 4 included {A1} as well as other factors to enhance data interpretation. Model HT
2 {A1, A2, A14*, A2
A14*} was over-fit based on the estimate
(A14, T|A1, A2)
25.39 (
df 16,
p = 0.063). The second set of results provided soft-thresholding (ST) models, which allowed the selection of more attributes and parameter effects by imposing an inequality constraint (
I-Scores ≥ 7.0) that was weaker than the HT rule. This yielded models ST
1 and ST
2 as presented in
Table 4. Both models were over-fit, due to their use of at least four attributes and interactions (as with model SIS
2).
MI analysis yielded two valid models for the first-form modified data, including MI
3: {A1, A2, A12} and MI
4: {A1, A2
A3}. It should be noted that the modified two-level attribute A2 (
duration of individual banking credit account) was identified as useful and was, therefore, more likely to be selected than its original 10-level version, which did not appear in
Table 2.
Table 5 lists the models selected for the second-form modified data using three-level attributes of A2, A5, and A13. The unique SIS model SIS
3: {A1, A2, A3}, which matched the over-fit model SIS
1 (
Table 2), was deemed marginally valid after modifying attribute A2 (with ten levels in
Table 3, and two levels in
Table 4) as a three-level attribute. This demonstrates that the poor fit of model SIS
1 can be rectified using three-level attribute A2. The barely valid model MI
5: {A1, A2, A12} matched the valid model MI
3 (see
Table 4); therefore, the corresponding interpretation did not vary when A2 was discretized as a two- or three-level attribute. It should be noted that model MI
6: {A1, A3, A5} was valid when attribute A5 (
individual credit amount) was replaced by a three-level attribute, considering that A5 never occurred in a valid model.
The
I-Score model HT
3: {A1, A2} proved valid (using two predictors); however, it was clearly under-fit when compared to models SIS
3 and MI
5. Models HT
4, ST
3, and ST
4 were clearly over-fit (using four or more predictors with interaction parameters), when compared to the previous analysis of models in
Table 2.
It should be noted that numerical attribute A5 (individual credit amount) was found to be useful (with the other two attributes A1 and A3) in explaining the target credibility only when applied via the discretized three-level attribute A5 in model MI6. This begs the question of whether model MI6 maintains its validity when A5 is returned to its original numerical status. The answer is yes—the same main-effect logistic model {A1, A3, A5 (numerical)} remains valid, with deviance of 1021.3 (df 977, p 0.158). The same question can be posed to the two other models: MI3 (MI5) {A1, A2, A12} and MI4 (SIS3) {A1, A2, A3} (or {A1, A2 A3}). We found that when using the original numerical A2, model {A1, A2, A3} was indeed valid, with a deviance of 259.37 (df 220, p 0.354). By contrast, model {A1, A2, A12} was revised as {A1, A2, A12, A1 × A2, A1 × A12, A2 × A12}, which required the inclusion of three two-way interaction effects to yield a valid model with deviance of 234.82 (df 201, p 0.051).
Table 2,
Table 4, and
Table 5 present a summary of the selected models. Logistic regression analysis and the SIS method (Lasso type) were both shown to select a unique (i.e., best) model; however, the
I-score method (using either hard or soft thresholding) and the proposed MI method often selected more than one valid model. It should be noted that attributes {A1, A2, A3} were consistently selected for the SIS models as well as models MI
4, ST
1, and ST
3. Note, however, that among these models, only MI
4 and SIS
3 were deemed valid.
In the five-variable main-effect model in the original analysis by [
7] and by [
40], MI analysis consistently selected variables {A1, A3}, while variables {A2, A4, A9} were deemed dispensable. As shown in
Table 2, the 10-level variable A2 was deemed dispensable for model SIS
1: {A1, A2*, A3}. MI analysis confirmed that valid models could be obtained by replacing A2 with A12 (MI
1) or A14 (MI
2). Indeed, attributes A1 (
status of bank account) and (two- or three-level) A2 (
length of credit account) were selected as useful predictors by all three methods in
Table 4 and
Table 5.
When applied to banking credit data with a moderate sample size (1000) and moderate number of discrete explanatory attributes (20), MI analysis selected several concise logistic models by which to obtain valid interpretations of the data. Models MI
1 and MI
2 proved equally useful (
Table 2), as did MI
3 and MI
4 (
Table 4). MI
3 and MI
5 described the same model; however, they differed in their use of attribute A2 (two-level or three-level, respectively). MI analysis verified that attribute A5 in model MI
6 was a useful predictor as a discrete three-level or numerical variable. Note that the discretization of numerical attributes in MI analysis makes it possible to obtain valid models using the same predictors in a discrete or numerical form. To summarize, MI analysis yielded parsimonious logistic models with predictors capable of explaining the
creditability of the banking credit data.