1. Introduction
In biology, phylogenetic inference is an important research focus with the goal to discover the evolutionary history of species and their relationships. The goal of phylogenetic inference is to assemble a tree representing a hypothesis of the evolutionary ancestry of a set of genes, species, or other taxa.
Figure 1A–H shows selected fossil images, a species-against-attributes matrix, and the results of the phylogenetic inference analysis.
Due to incomplete records, there are almost always missing values for fossils; for example, the parts marked with “?” in image (C) in
Figure 1 indicate missing data. Thus, it may be difficult to support the results of the phylogenetic inference analysis under such circumstances. To solve this problem, four methods have been developed to deal with missing data. First, a certain proportion of incomplete species or attributes can be removed [
2,
3]. However, in many cases, the exclusion of incomplete species and attributes is carried out in an arbitrary manner without specific explanations or reasons [
3,
4,
5,
6]. Second, the number of attributes can be increased [
7,
8]. Research results showed that if the overall number of attributes in the analysis was sufficiently large (more than 1000 attributes), a phylogenetic inference method accurately reconstructed the position of highly incomplete taxa (e.g., 95% missing data) [
2,
3]. However, due to the simple structure of early paleontology, species often have less than 200 attributes. When the absence rate is high, a common phylogenetic inference cannot be accurately inferred. Highly incomplete taxa may produce multiple equally parsimonious trees and poorly resolved consensus trees, resulting in low phylogenetic accuracy [
9]. Third, the missing values can be filled in. In the Hennig86 [
10] and PAUP [
11] programs, for example, each unspecified attribute is randomly assigned a value that is suitable for the attribute. Each of these three methods of dealing with missing data has its strengths and weaknesses, but none reflect the true value of the missing data. Fourth, The species-against-feature matrix with missing data can transform into a suitable sparse expression form by a sparse sampling algorithm, and the reconstruction algorithm is used to reconstruct the sampling point. Sparse signal recovery theory shows that this method can accurately reconstruct data. A common sparse representation method is wavelet analysis [
12,
13,
14,
15]. By sparse representation of the data, wavelet analysis could potentially be applied to the task of recovering missing phylogenetic information. It has been successfully applied to signals, images, gene classification and so on [
16,
17]. However, in the study of morphological phylogenetic analysis, it is a method that is little studied but worth trying.
In addition, there are three main approaches based on the principle of optimality for inferring the phylogenetic tree, namely maximum parsimony (MP) [
18], maximum likelihood (ML) [
19] and Bayesian inference (BI) methods [
9,
20]. The ML and Bayesian methods are commonly used probabilistic approaches based on matrices containing only gene data from living species [
21]. However, since DNA is usually not available for fossil taxa, only the fossil occurrence dates are used in time-calibrated phylogenies [
22]. Moreover, researchers have found that the ML and Bayesian methods do not deal effectively with missing morphological data [
20]. MP is well known to be non-deterministic polynomial-time (NP)-hard [
23]. Given the large number of taxonomic groups, the only effective method of obtaining the optimal phylogenetic tree is to perform a heuristic search. However, studies have shown that MP may fall into a local optimum. Therefore, complex and flexible heuristics are needed to ensure that the tree space is fully explored.
Our motivation is to introduce a phylogenetic inference method that reduces the impact of missing data. In this paper, we propose an evolution analysis algorithm based on bi-directional cognitive processing; we call this approach phylogenetic deduction based on a concept decision tree (CDT). We use a cognitive model to reduce the search scope caused by incomplete data. In this model, a priori knowledge of relatively complete species is used to create a highly reliable phylogenetic tree as an initial seed. Attribute reduction [
24] based on rough sets [
25] is used to construct multiple concept-sample templates for each node of the initial seed tree by removing unrelated or unimportant attributes in order to improve the classification or decision-making [
26], thereby reducing the impact of missing data. We apply a matching algorithm to evaluate the matching degree between species’ attributes and the nodes’ concept-sample templates; hence we determine the location of the species by a serial search in the phylogenetic tree. Therefore, the global combinatorial explosion problem is decomposed into a classification framework that prevents instability. Compared with the traditional parallel phylogenetic inference process applied to all species, our method greatly reduces the computational scale and complexity of the task. Gradually, a complete phylogenetic tree is established.
Here we compare our method with the MP, ML, and BI methods using morphological datasets with different amounts of missing data. We show that the proposed algorithm makes a contribution to the field because it enables the construction of morphological data with an accuracy of 86.5% whereas the MP, ML, and BI methods provide accuracies of 85.5%, 82.8%, and 85.1%, respectively. We also compare the stability of the methods to establish the tree. The experimental results show that the variance of our method and the other methods is 0.0872. Therefore, a stable phylogenetic tree can be constructed.
The rest of the paper is organized as follows.
Section 2 introduces the framework of the CDT algorithm. The process of developing concept-sample templates based on genetic algorithms (GAs) is described in
Section 3.
Section 4 presents the experimental results of the CDT and the discussion. Finally,
Section 5 provides the conclusions of the study.
2. Framework of the CDT Algorithm
The objective of the CDT algorithm is to construct a phylogenetic tree T for a set of species S, expressed as where . We input a species-against-attribute matrix for a set of species S. The species are sorted in order of completeness from high to low, which is denoted as . For each species , there are m attributes, which are defined as . We divide S into and , where and . The species in are relatively complete, whereas those in are missing many attributes.
The framework of the phylogenetic inference based on the CDT is shown in
Figure 2.
We divide the framework into four steps as follows:
(1) The establishment of the initial seed tree
Due to the ambiguity of phylogenetic tree construction, the initial concept establishment is very important because it reduces the complexity of the subsequent steps. During the analysis of species evolution, we first apply either biologists’ prior knowledge or common software tools (such as MrBayes [
27], PAUP* [
11], or TNT [
28]) to a set of relatively complete species
in order to build a reliable phylogenetic tree
as an initial seed, where
,
.
(2) The generation of decision points in the initial seed tree
To take advantage of the established concepts, we perform attribute reduction on the rough set at each branch node of the initial seed tree by analyzing the species ’ location. In this way, we obtain the concept-sample templates for the branch nodes in . Therefore, the branch nodes have decision-making functions that become decision points. Correspondingly, the phylogenetic seed tree becomes the decision tree , which provides the basis for the grafting of species with missing data.
(3) Species grafting
For species in , we can determine its location in the phylogenetic tree by matching the species’ attributes with multiple concept-sample templates of each decision point in a top-down manner.
(4) The construction of a complete phylogenetic tree
The evolutionary process starts with the most reliable species in , followed by grafting it onto the tree, as described in Step 3. The next species is then added, and so on, finishing with species . In this way, a complete phylogenetic tree is constructed.
In this paper, we focus on the generation of decision points in the initial seed tree (
Section 3) and species grafting (
Section 4).
3. Construction of Multiple Concept-Sample Templates
The internal nodes in the phylogenetic tree are an important decision-making basis for phylogenetic inference. Therefore, we transform the internal nodes into decision points. Due to a large number of missing and inconsistent attributes, traditional pattern recognition methods are not applicable. Therefore, a method is required to provide decision-making attribute sets for the internal nodes.
We propose to generate multiple concept-sample templates for the internal nodes based on the species’ location. The purpose of rough set attribute reduction is to remove unrelated or unimportant attributes in order to improve classification or decision-making [
21,
29]. Attribute reduction has been shown to be an NP-hard problem for combinatorial optimization [
22,
23]. However, in many applications, it is necessary to find only one minimum attribute reduction. On the other hand, because morphological data in Paleontology are often missing many values, we need to use multiple concept-sample templates to make full use of the data. In this study, we use entropy-based genetic algorithms (GAs) [
24] to find the optimal template sets heuristically because they can simulate the optimal solution of a natural evolutionary process, and phylogenetic inference is essentially part of the study of evolution.
3.1. The Design of the Genetic Algorithm for Attribute Reduction
In this section, we introduce the details of the GA to deal with attribute reduction in the rough set theory.
3.1.1. Encoding Method
A variable-length decimal array of one-dimensional strings represents the chromosome. The length of the chromosome equals the number of the species’ attributes, i.e., N. Each gene bit corresponds to an attribute in the chromosome. Each gene bit in the chromosome is numbered , and the corresponding code ranges from 0 to the number of the species’ attributes, where 0 denotes that the attribute is not selected and i denotes that the ith attribute is selected as the attribute of the concept-sample template. The chromosomes in the initial population are generated using uniformly distributed random numbers.
When the length of the chromosome is
N, each chromosome corresponds to a unique set of concept-sample templates for a total of
, as shown in
Table 1 below:
For example,
Table 2 shows the encoding method of a chromosome with
. Sites 8 and 9 have the same value 9, indicating that attributes 8 and 9 belong to the same concept-sample template. Site 10 has value 8 and the other sites have different codes; therefore, attribute 10 represents a single concept-sample template. For example, if the template set X for a decision point is
, it contains
.
3.1.2. Fitness Function
The fitness of a chromosome determines the probability with which it will be inherited by the next generation. Here, the fitness of a chromosome is calculated by reference to the concept-sample template set generated by it. According to the principle of attribute reduction, B represents the attribute subset of the present mapping, represents the attribute set of the species, and represents the class label of the species belonging to the node.
Definition 1. Let be a non-empty finite set of objects, called the domain. , the B-lower approximation set of X is defined as follows:where denotes an equivalence class determined by object x. Definition 2. Assuming that , , the lower approximation set is defined as follows:That is, the lower approximation set is obtained from all of the sets contained in X. If
, we calculate
and substitute it into the fitness function of Equation (
3). If
,
. The fitness function is defined as follows:
where
L represents the number of concept template sets in the chromosome,
represents the number of species attributes,
n represents the
nth concept-sample template, and
represents the number of attributes in the
nth template.
3.1.3. Selection Operator
We use the roulette wheel selection method to choose the best individual to continue to the next generation. Individuals are selected with a probability proportional to their fitness values [
30]. If a population
(
is the population size) and the fitness of the individual
is
, the probability of an individual
being selected is
:
reflects the proportion of the fitness value of the individual
with respect to the sum of fitness values of all individuals.
In order to ensure that the best individuals survive to the next generation, we use the optimal preservation strategy [
31]. If the fitness value of the worst individual in the current generation is less than the fitness value of the best individual in the previous generation, we use the best individual in the previous generation to replace the worst individual in the current generation. In the case of more than one optimal individual, the optimal individual is randomly selected to replace the worst individual.
3.1.4. Crossover Operator
The crossover operation uses a random single-point crossover strategy. An individual is chosen to take part in the crossover at a certain probability . All selected individuals are randomly paired. For each pair of individuals, a cross-point is selected randomly. Some of the chromosomes of the paired individuals are exchanged at the cross-point. In this way, the next generation of individuals is generated.
3.1.5. Mutation Operator
The mutation operations use the “basic bit” variation. For each chromosome selected with probability , its mutation point is specified by a random probability and the value at the specified mutation point becomes another state value. In this way, we can generate further members of the next generation to improve the performance of the heuristic search.
3.1.6. Modification Operator
Step 1: Calculate the mutual information
of the condition attribute set
C and the decision attribute set
D. The mutual information [
32] of
C and
D is defined as
where
and the conditional entropy of
X and
Y is defined as
. When
X and
Y are independent,
; otherwise, this index is positive [
32,
33] and it increases with the degree of dependence between the components
and
.
Step 2: Calculate and . If then repeat steps 3 and 4; otherwise, end the modification;
Step 3: Select attribute a in so that reaches the maximum value. reflects the increment of mutual information when a is added to . According to the definition of attribute importance of the mutual information, we select the attribute and set it to ;
Step 4: Change the bit corresponding to from 0 to j and return to step 2;
3.2. Algorithm Description
Input: An attribute table of Species C, the class label of the species D
Output: Concept-sample template sets for each internal node
Step 0: Set the parameters: chromosome size m, population size , crossover probability , mutation probability , and maximum generation . Let generation .
Step 1: Generate chromosomes randomly.
Step 2: Calculate the fitness value of each chromosome.
Step 3: Perform crossover on individuals selected with probability .
Step 4: Perform mutation on individuals selected with probability .
Step 5: Create the new population. Select individuals from the parents and offspring for the next generation by the roulette wheel selection method.
Step 6: Perform modification of the individuals.
Step 7: Stop calculating. If , then output the corresponding concept-template collection and stop, else let and return to Step 2.
5. Experimental Results
To assess the accuracy and reliability of the CDT, we conducted experiments on six species datasets. The summary information for the datasets is shown in
Table 3.
The datasets were used to construct phylogenetic trees using our CDT algorithm as well as three other standard methods, namely MP, ML, and BI. The specific steps are described in
Section 5.1. The grafting results of CDT were compared to the accepted tree topologies (model trees) that are part of the datasets. The results were then compared.
5.1. CDT Accuracy Analysis
The accuracy rate of the assignment of a species, i.e., the accuracy of the species’ phylogenetic analysis, depends on the node path of that species. The path of a species in a phylogenetic tree model accepted by biologists is considered to be the standard path sequence
. The path sequences of the grafted species
obtained from the CDT, MP, ML, and BI methods were compared with the standard path sequences.
denotes that
matches the standard sequence
.
is the number of path matching species and
is the total number of standard sequence species. The accuracy can be expressed by Equation (
7). For example, if
and
, then
.
To verify the performance of the CDT algorithm, the attributes of the species which are to be grafted are randomly chosen to be incomplete. The missing proportions are 0%, 10%, 20%, 30%, 40%, 50%, 60% and 70%. On the basis of different proportions of missing data, we apply the CDT algorithm for species grafting and the MP, ML, and BI methods to establish phylogenetic trees. The bootstrap method [
41,
42] is used to resample the data set 1000 times and the average accuracy of the four methods is calculated. For six species datasets in
Table 3, the accuracies of the four methods of phylogenetic analysis under different proportions of missing data are shown in
Figure 6.
We observe the following:
In general, an increase in missing data results in insufficient information and a decrease in accuracy.
When the proportion of missing data is less than 10%, the accuracies are similar for the different methods, i.e., the species can be classified accurately.
The proposed method significantly improves the accuracy of the results, especially for datasets with many missing data (missing proportions > 40%). This occurs because as the species’ number of attributes increases, the amount of data used for the concept-sample templates increases; although the proportion of missing data increases, it is much easier to assign the species to the correct location.
The average accuracies of the CDT, MP, ML and BI methods are shown in
Table 4. The accuracy of the CDT method was 86.5% whereas the accuracies of the MP, ML, and BI methods were 85.5%, 82.8%, and 85.1%, respectively, indicating that the proposed method had the highest average accuracy.
5.2. CDT Reliability Analysis
To evaluate the reliability of the CDT method, we used the tree length [
43] to determine the optimality criteria. In phylogeny, the length of the phylogenetic tree is a parameter for evaluating morphological changes in the tree, i.e., the number of changes in the attributes. The shorter the tree length, the more reliable the phylogenetic tree is. Therefore, a phylogenetic tree with the lowest number of changes in the attribute state is preferred.
We used the results of the phylogenetic tree described in
Section 5.1 and calculated the tree length separately as shown in
Figure 7.
In
Figure 7, it is observed that grafted species with different proportions of missing data have little effect on the tree length. The data in
Table 5 were obtained by analyzing the results in
Figure 7.
Table 5 shows that the tree length of the phylogenetic tree is similar for all four methods and the average variance of the tree length is 0.0872. Therefore, our method is as reliable as the other methods.
5.3. Phylogenetic Inference on Cambrian Lobopodians
In this study, we apply the CDT to the phylogenetic analysis of the Cambrian lobopodians. The Cambrian lobopodians paleontological morphological dataset [
1] contains large amounts of missing data; for example, the species
Opabinia has 32% missing data, while
Hadranaxa and
Orstenotubulus have 48% missing data. The species
Opabinia,
Hadranaxa, and
Orstenotubulus were sequentially used for grafting to construct phylogenetic trees, as shown in
Figure 8. The results show that our method provides a phylogenetic tree that is consistent with the assessment of paleontologists.