2.2.1. Interpreting Dataset
Our early work with PDD focused on interpretable pattern discovery. In this section, we presented how PDD works for interpretability. As detailed in
Appendix A, the processed dataset of 114 features will be used to compare the supervised learning models. This dataset includes features related to patient information, physical assessments, and temporal features capturing two observations (i.e., time points 2 and 14) of six physical measurements. Statistics such as mean, standard deviation, minimum, maximum, kurtosis, and skewness are calculated for each observation. In the pattern discovery process, we aim to discover associations between feature values. Given that many of the statistical features convey the same information, to simplify the analysis and enhance interpretability, we only retained the standard deviation for each observation, as it best describes the variability of data. Therefore, to avoid the complexity introduced by an overwhelming number of features (attributes), we finally retained a subset of 27 features describing patient demographics and their physical assessments. This includes 15 features related to physical measurements and patient information (i.e., GCS, systolic blood pressure, diastolic blood pressure, mean blood pressure, pulse pressure, heart rate, respiration rate, SpO2, age, gender, ethnicity, discharge status, admission weight, discharge weight, and height); another two observation features for each of the six physical measurements, totaling 12 features; and one label feature. Consequently, the dimension of the dataset is 10,743 by 28.
Due to the unlimited degrees of freedom inherent in numerical features, correlating these features with the target variable and interpreting the associations present significant challenges. So, the first step of the PDD process is discretizing the numerical features into event-based or discrete categories according to clinical standards as
Table 1 shows. Other numerical features without clear clinical significance such as age and admission weight, were discretized into intervals using the Equal Frequency [
24]. This method ensures an equal distribution of data points (i.e., records) within each interval. It is achieved by first sorting the numerical data in ascending order, and then dividing the data into three intervals (bins) each containing an equal number of data points. In information theory, this method indirectly maximizes the data’s informational content. It also handles skewed distributions and simplifies analysis with interpretable categories that are more understandable than precise numerical values. Hence, by applying this discretization approach, all features are converted to categorical ones.
Then, on the discrete dataset, using PDD [
16], we construct a statistical residual matrix (SR-Matrix) to account for the statistical strength of the associations among feature values. In pattern discovery, the term “attribute” is used instead of “feature”, so “attribute value” (AV) will be used subsequently. Since the meaning of the attribute values (AVs) and their class labels are implicit, the discovery of a statistically significant association of the AVs is unaffected by prior knowledge or confounding factors. To evaluate the association between each pair of AVs in the SR-Matrix, we calculated the statistical measure of adjusted standardized residual to represent the statistical weights of the association between distinct AV pairs. For instance, if
represents the value of attribute
as
H, and
represents the value of attribute
as
L. Then the adjusted standard residual of the association between
and
is calculated in Equation (
1), which is denoted as
.
where
and
represent the number of occurrences of each attribute value;
is the total number of co-occurrences of the two attribute values; and
refers to the expected frequency of co-occurrences of the two attribute values;
N is the total number of entities. Further details on the calculation are provided in
Appendix B.
The key idea of the pattern discovery and disentanglement are briefly described below. We assume that certain attribute values (AVs) or associations co-occur on samples due to certain primary causes (referred to as primary sources). Other significant or insignificant AV associations may also occur in the samples, entangled with those originating from the primary sources. Hence, the goal of PDD is to discover statistically significant association patterns from AV groups on a disentangled principal component (PC), then cluster each AV group into subgroups, denoting them as Disentangled Statistical Units (DSUs). These DSUs are functionally independent and statistically connected within the same group but not associated with other groups.
To achieve such a goal, PDD first applies a linear transformation, Principal Component Analysis (PCA), to decompose the SR-Matrix into Principal Components (PCs). Each PC is functionally independent, capturing associations uncorrelated to those obtained in other PCs. To look for AVs with statistically significant associations with other AVs on the same PC, PDD reprojects each PC back onto an SR-Matrix to generate a Reconstructed SR-Matrix (RSR-Matrix) for each distinct PC. If the maximum residual between a pair of attribute values (AVs) within an RSR-Matrix exceeds a statistical threshold, such as 1.96 for a 95% confidence interval, the association is considered statistically significant. Notably, the associations identified within each RSR-Matrix (or PCs) remain functionally independent from those in other RSR-Matrices (or PCs). Then, on the PCs with statistically significant associations, we obtain AV groups where the AVs within are statistically connected. From entities containing these AVs, we use a hierarchical clustering algorithm to obtain entity clusters containing AV subgroups with statistically connected AVs within, but not outside of that subgroup. They are the DSUs originate from distinct primary sources.
Figure 1 illustrates the AV association disengagement concept. After Principal Component Decomposition of the SR-Matrix, PCs and their corresponding RSR-Matrices are obtained. Only RSR-Matrices containing SR values exceed the statistical significance threshold (i.e., 1.96), and their corresponding PCs are retained. For each of those retaining PCs, three groups of projected AVs can be identified along its axis, each showing different degrees of statistical associations.
Those close to the origin are not statistically significant and thus do not associate with distinct groups/classes (marked as (a) in
Figure 1), They are close to the origin because all their coordinates (SRs) are low and insignificant.
Those at one extremal end of the projections; and those at the opposite extremal end (both marked as (b)).
The AV groups or subgroups in (b), if their AVs within are statistically connected but disconnected from other groups, may associate with distinct sources (i.e., classes) (marked as (c)).
As a result, two AV groups at opposite extremes were discovered. Each AV within such a group is statistically linked to at least one other AV within, and none of them is statistically connected to AVs in other groups. Furthermore, it is possible that among the AVs in the AV groups, some of the AV Subgroups may only occur on subsets of the entities uncorrelated to other subgroups. Hence, to achieve a more detailed separation of groups, several subgroups are separated in each AV group based on their appearance in entity groups. This is done using a similarity measure defined by the overlapping of entities each AV can cover. Then AV subgroups found in different entity groups can be identified. We denote such an AV subgroup by a three-digit code [#PC, #Group, #SubGroup] and refer to it as a Disentangled Space Unit (DSU). We hypothesize that these DSUs originate from distinct functional sources.
Therefore, in each subgroup denoted by DSU, a set of AVs is included, which are referred to as pattern candidates. We then developed a pattern discovery algorithm to grow high-order patterns, called comprehensive patterns, from the pattern candidates. In the end, a set of high-order comprehensive patterns is generated within each DSU, and they are all associated with the same distinct source.
The interpretable output of the PDD is organized in a PDD Knowledge Base. This framework is divided into three parts: the Knowledge Space, the Pattern Space, and the Data Space. Firstly, the Knowledge Space lists the disentangled AV subgroups referred to as a Disentangled Space Unit (DSU) (denoted by a three-digit code, [#PC, #Group, #SubGroup] shown in the three columns of the knowledge space to indicate different levels of grouping) linking to the patterns discovered by PDD on the records. Secondly, the Pattern Space displays the discovered patterns, detailing their associations and their targets (the specified class or groups). Thirdly, Data Space shows the record IDs of each patient, linking to the knowledge source (DSU) and the associated patterns. Thus, this Knowledge Base effectively links knowledge, patterns, and data together. If an entity (i.e., a record) is labelled as a class, we can trace the “what” (i.e., the patterns it possesses), the “why” (the specific functional group it belongs to), and the “how” (by linking the patterns to the entity clusters containing the pattern(s)).
The novelty and uniqueness of PDD lie in its ability to discover the most fundamental, explainable, and displayable associations at the AV level from entities (i.e., records) associated with presumed distinct primary sources. This is based on robust statistics, unbiased by class labels, confounding factors, and imbalanced group sizes, yet its results are trackable and verifiable by other scientific methods.
2.2.2. Clustering Patient Records
Without specifying the number of clusters to direct the unsupervised process, PDD can cluster records based on the disentangled pattern groups and subgroups.
As described in
Section 2.2.1, the output of PDD is organized into a Knowledge Base, where each pattern subgroup is represented by a
. As defined in
Section 2.2.1, the set of AVs displayed in each DSU is a summarized pattern, representing the union of all the comprehensive patterns on entities discovered from that subgroup. We denote the number of comprehensive patterns discovered from the summarized pattern in DSU as
. For example, in
, if 10 comprehensive patterns are found, then
. Each record may possess none, one, or multiple comprehensive patterns for each DSU. We denote the number of comprehensive patterns possessed by a record in a specific DSU as
. For example,
and
represent the record with
possess 5 comprehensive patterns in
and 6 comprehensive patterns in
.
Each DSU can represent a specific function or characteristic in the data, potentially associated with a particular class. For example, in this study,
is associated with
, while
is associated with
. The fact that
and
appear as two opposite groups in
(
Figure 1) indicates that their AV associations have significant differences as captured by
. Some DSUs might reveal rare patterns not associated with any class while the class label is not in the association.
Based on the definitions described above, we cluster the records by assigning each record to the class that matches the most comprehensive patterns compared to any other class. To provide a more detailed explanation of the clustering process, consider the following example. The DSU outputted by PDD are , , , and , which are associated with , , , and , respectively. The total number of comprehensive patterns in these DSUs are , , , and respectively. Consider a record () with comprehensive patterns possessed by this record are , , , and .
Due to the variation in the number of comprehensive patterns across DSUs, we use a percentage rather than an absolute value to measure the association of the record with pattern groups. Hence, to determine how the record () is associated with the patterns, we calculate the average percentage of the number of comprehensive patterns associated with a specific class possessed by the record, denoted as . Due to indicates that the record is not covered by the DSU[2,1,1], it is excluded from the calculation to avoid the significant impact of a zero value on the final percentage. Hence, the association of the record () with the patterns is calculated as . Similarly, . Since is greater than , the record is assigned as . To evaluate the accuracy of this assignment for all records, we compare the assigned class label with the original implicit class label.
2.2.3. Detecting Abnormal Records
The evaluation of the classification or prediction involves comparing the predicted labels with the original class labels. However, this comparison is unreliable if mislabels exist in the original data. To address this issue, we proposed an error detection method to identify abnormal records using the patterns discovered by PDD. In our early work on PDD, we integrated both the supervised and unsupervised methods for error detection and class association. In this paper, we simplify the process by using only a novel unsupervised method on a dataset with implicit class labels as the ground truth, making the error detection process more succinct.
To determine whether a record is abnormal, the proposed algorithm compares the class assigned by PDD with its original labels, evaluating the consistency of discovered patterns with their respective explicit or implicit class labels. We define three statuses to an abnormal record: Mislabelled, Outlier, and Undecided, which are detailed below.
Mislabelled: If a record is categorized into one class but matches more patterns from a different class according to the PDD output, it suggests the record may be
mislabelled. For example, consider the same record with
described in
Section 2.2.2 with the same setting of the pattern groups where
and
. If the record is originally labelled as
in the dataset, but the relative difference
is greater than
, this suggests that the record (
) is more associated with
than with
. The relative difference is used instead of absolute difference because it provides a scale-independent comparison of the number of patterns associated with one class to another. A value greater than 0.1 indicates that the number of patterns associated with one class is statistically significantly greater than the number associated with another class. Hence, the record (
) may be
mislabelled.
Outlier: If a record possesses no patterns or very few patterns, it may indicate the record is an outlier. For example, a record with uses the previously described pattern group settings. The comprehensive patterns possessed by this record are: , , , and . Calculating the percentages, is and is . Both percentages are less than or equal to , suggesting that record m possesses fewer than 1% of the patterns associated with either class, which may indicate it is an outlier.
Undecided: If the number of possessed patterns for a record is similar across different classes, the record should be classified as undecided. For example, a record with uses the previously described pattern group settings. The comprehensive patterns possessed by this record are: , , , and . Calculating the percentages, is the mean of and , which is ; and is the mean of and , which is . Since the difference between the two percentages is zero or less than , record k may be associated with both classes, suggesting it is undecided
To avoid adding new incorrect information, mislabelled, undecided, and outliers are removed from the dataset. Hence, to validate the effectiveness of abnormal records detected by PDD, we compared classification results from the original dataset to those from a dataset without abnormal records when various classifiers were applied.