Reference-Based_Sequence_Classification
Reference-Based_Sequence_Classification
ABSTRACT Sequence classification is an important data mining task in many real-world applications. Over
the past few decades, many sequence classification methods have been proposed from different aspects.
In particular, the pattern-based method is one of the most important and widely studied sequence classifica-
tion methods in the literature. In this paper, we present a reference-based sequence classification framework,
which can unify existing pattern-based sequence classification methods under the same umbrella. More
importantly, this framework can be used as a general platform for developing new sequence classification
algorithms. By utilizing this framework as a tool, we propose new sequence classification algorithms that
are quite different from existing solutions. Experimental results show that new methods developed under the
proposed framework are capable of achieving comparable classification accuracy to those state-of-the-art
sequence classification algorithms.
INDEX TERMS Sequence classification, sequential data analysis, cluster analysis, hypothesis testing,
sequence embedding.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020 218199
Z. He et al.: Reference-Based Sequence Classification
choose any one of these existing classification methods to new methods developed under the proposed framework are
fulfill the sequence classification task. Second, in some pop- capable of achieving better classification accuracy than tra-
ular feature-based methods such as pattern-based methods, ditional sequence classification methods. This indicates that
each feature has a good interpretability. Last but not least, such a reference-based sequence classification framework is
the extraction of features from sequences has been exten- promising from a practical point of view.
sively studied across different fields, making it feasible to The main contributions of this paper can be summarized as
generate sequence features in an effective manner. follows:
The k-mer (in bioinformatics) or k-gram (in natural lan- • We present a general reference-based framework for
guage processing) is a substring that is composed of k con- feature-based sequence classification. It offers a unified
secutive elements, which is probably the most widely used view for understanding and explaining many existing
feature in feature-based sequence classification. Such a k-mer feature-based sequence classification methods in which
based feature construction method is further generalized by different types of sequential patterns are used as features.
the pattern-based method, in which a feature is a sequential • The reference-based framework can be used as a gen-
pattern (a subsequence) that satisfies some constraints (e.g. eral platform for developing new feature-based sequence
frequent pattern, discriminative pattern). Over the past few classification algorithms. To verify this point, we design
decades, a large number of pattern-based methods have been new feature-based sequence classification algorithms
presented in the context of sequence classification [5]–[30]. under this framework and demonstrate its advantages
In this paper, we present a reference-based sequence classi- through extensive experimental results on real sequential
fication framework, which can be considered as a non-trivial data sets.
generalization of the pattern-based methods. This framework The rest of the paper is structured as follows. Section II
has several key steps: candidate set construction, reference gives a discussion on the related work. In Section III,
point selection and feature value construction. In the first we introduce the reference-based sequence classification
step, a set of sequences that serve as the candidate reference framework in detail. In Section IV, we show that many
points are constructed. Then, some sequences from the can- existing feature-based sequence classification algorithms
didate set are selected as the reference points according to can be reformulated within the reference-based framework.
certain criteria. The number of features in the transformed In Section V, we present new feature-based sequence clas-
vectorial data will equal the number of selected reference sification algorithms under this framework, which are effec-
points. In other words, each reference point will correspond tive and quite different from available solutions. We exper-
to a transformed feature. Finally, a similarity function is used imentally evaluate the proposed reference-based framework
to calculate the similarity between each sequence in the data through a series of experiments on real-life data sets in
and every reference point. The similarity to each reference Section VI. Finally, we summarise our research and give a
point will be used as the corresponding feature value. discussion on the future work in Section VII.
The reference-based sequence classification framework is
quite general and flexible since the selection of both ref- II. RELATED WORK
erence points and similarity functions is arbitrary. Existing In this section, we discuss previous research efforts that are
feature-based methods can be regarded as a special variant closely related to our method. In Section II-A, we provide a
under our framework by (1) using (frequent or discriminative) categorization on existing feature-based sequence classifica-
sequential patterns (subsequences) as reference points and (2) tion methods. In Section II-B, we discuss several instance-
utilizing a boolean function (output 1 if the reference point based feature generation methods in the literature of time
is contained in a given sequence and output 0 otherwise) series classification. In Section II-C, we present a concise dis-
as the similarity function. Besides unifying existing pattern- cussion on reference-based sequence clustering algorithms.
based methods under the same umbrella, the reference-based In Section II-D, we provide a short summary on dimension
sequence classification framework can be used as a general reduction and embedding methods based on landmark points.
platform for developing new feature-based sequence classi-
fication methods. To justify this point, we develop a new A. FEATURE-BASED METHODS
feature-based method in which a subset of training sequences 1) EXPLICIT SUBSEQUENCE REPRESENTATION
are used as the reference points and the Jaccard coefficient is WITHOUT SELECTION
used as the similarity function. In particular, we present two The naive approach in dealing with discrete sequences is to
instance selection methods to select a good set of reference treat each element as a feature. However, the order informa-
points. tion between different elements will be lost and the sequen-
To demonstrate the feasibility and advantages of this new tial nature cannot be captured in the classification. Short
framework, we conduct a series of comprehensive perfor- sequence segments of k consecutive elements called k-grams
mance studies on real sequential data sets. In the experi- can be used as features to solve this problem. Given a set of
ments, we compare several variants under our framework k-grams, a sequence can be represented as a vector of the
with some existing sequence classification methods in terms presence or absence of the k-grams or the frequencies of the
of classification accuracy. Experimental results show that k-grams. In this feature representation method, all k-grams
(for a specified k value) are explicitly used as the features an input sequence is the set of all its k-length (contiguous)
without feature selection. subsequences.
Lodhi et al. [36] present a string kernel based on gapped k-
2) EXPLICIT SUBSEQUENCE REPRESENTATION WITH length subsequences for text classification. The subsequences
SELECTION (CLASSIFIER-INDEPENDENT) are weighted by an exponentially decaying factor of their full
Lesh et al. [26] present a pattern-based classification method length in the text.
in which a sequential pattern is chosen as a feature. The In [37], a mismatch string kernel is proposed, in which a
selected pattern should satisfy the following criteria: (1) be certain number of mismatches are allowed in counting the
frequent, (2) be distinctive of at least one class and (3) not occurrence of a subsequence. Several string kernels related
redundant. Towards this direction, many pattern-based classi- to the mismatch kernel are presented in [38]: restricted gappy
fication methods have been subsequently proposed, in which kernels, substitution kernels and wildcard kernels.
different constraints are imposed on the patterns that should
be selected as features [5]–[25], [27]–[30]. Note that any 5) SEQUENCE EMBEDDING
classifier designed for vectorial data can be applied to the All the methods mentioned above use subsequences as fea-
transformed data generated from such pattern-based meth- tures. Alternatively, the sequence embedding method gen-
ods. In other words, such feature generation methods are erates a vector representation in which each feature does
classifier-independent. not have a clear interpretation. Most existing approaches
for sequence embedding are proposed for texts in natural
3) EXPLICIT SUBSEQUENCE REPRESENTATION WITH language processing, where word and document embed-
SELECTION (CLASSIFIER-DEPENDENT) dings are used as an efficient way to encode the text
The above pattern-based methods are universal and classifier- (e.g. [39], [40]). The basic assumption in these methods is that
independent. However, some patterns that are critical to the words that appear in similar contexts have similar meanings.
classifier may be filtered out during the selection process. The word2vec model [39] uses a two-layer neural net-
Thus, several methods which can select pattern features from work to learn a vector representation for each word. The
the entire pattern space for a specific classifier have been sequence (text) embedding vector can be further generated
proposed [31]–[33]. by combining the feature vectors for words. The doc2vec
In [31], a coordinate-wise gradient ascent technique is model [40] extends word2vec by directly learning feature
presented for learning the logistic regression function in the vectors for entire sentences, paragraphs, or documents.
space of all k-grams. The method exploits the inherent struc- Nguyen et al. [41] propose an unsupervised method
ture of the k-gram feature space to automatically provide a (named Sqn2Vec) for learning sequence embedding by pre-
compact set of highly discriminative k-gram features. In [32], dicting its belonging singleton symbols and sequential pat-
a framework is presented in which linear classifiers such terns (SPs). The main objective of Sqn2Vec is to address the
as logistic regression and support vector machine can work limitations of two existing approaches: pattern-based meth-
directly in the explicit high-dimensional space of all sub- ods often produce sparse and high-dimensional feature vec-
sequences. The key idea is a gradient-bounded coordinate- tors while sequence embedding methods in natural language
descent strategy to quickly retrieve features without explicitly processing may fail on data sets with a small vocabulary.
enumerating all potential subsequences. In [33], a novel doc-
ument classification method using all substrings as features 6) SUMMARY OF FEATURE-BASED METHODS
is proposed, in which L1 regularization is applied to a multi- Roughly, existing feature-based sequence classification
class logistic regression model to fulfill the feature selection methods can be divided into the above five categories. Each
task automatically and efficiently. of these methods has its pros and cons, which we will discuss
briefly next.
4) IMPLICIT SUBSEQUENCE REPRESENTATION First, using k-grams as features without feature selection
In contrast to explicit subsequence representation, is simple and effective in practice. However, the feature
kernel-based methods employ an implicit subsequence rep- length k cannot be large and many redundant features may
resentation strategy. A kernel function is the key ingredient be included.
for learning with support vector machines (SVMs) and it Second, in the pattern-based method, the length of a feature
implicitly defines a high-dimensional feature space. Some is not restricted as long as the feature satisfies given con-
kernel functions K (x, y) have been presented for measuring straints and redundant features can be filtered out in some
the similarity between two sequences x and y (e.g. [34]). formulations. However, it is a non-trivial task to efficiently
There are a variety of string kernels which are widely used mine patterns that can satisfy the constraints.
for sequence classification (e.g. [35]–[38]). A sequence is Third, sequence classification methods based on adaptive
transformed into a feature space and the kernel function is feature selection can automatically select features from the
the inner product of two transformed feature vectors. set of all subsequences. The basic idea is to integrate the
Leslie et al. [35] propose a k-spectrum kernel for protein feature selection and classifier construction into the same
classification. Given a number k ≥ 1, the k-spectrum of procedure. Hence, these methods are classifier-dependent in
In the second step, we generate the set of candidate ref- threshold minsup, if supDci (t) ≥ minsup, then t satisfies the
erence sequences CR from the alphabet I . Note that any minsup constraint and t is a frequent sequential pattern in Dci .
sequence over I can be a member of CR. In other words, CR Constraint 3 (Mindisc Constraint [48]): Given two class
can be an infinite set. In practice, some constraints will be labels c1 and c2 , a sequence t is said to be a discrimina-
imposed on the potential member in CR. For instance, those tive pattern if it is over-expressed on Dc1 against Dc2 (or
pattern-based methods only consider subsequences of TrainD the vice versa). To evaluate the discriminative power, many
as members of CR under our framework, which will be further measures/functions have been proposed in the literature [48].
discussed in Section IV. Furthermore, the use of different If the discriminative function value of t can pass certain
construction methods for building the candidate set CR will constraints, then it satisfies the mindisc constraint. Here we
lead to the generation of many new feature-based sequence just list some measures that have been used for selecting
classification methods. discriminative patterns in sequence classification.
In the third step, we select a subset of sequences R from • Discriminative Function (DF) 1 [12]:
CR as the landmark sequences for generating features. That
is, each reference sequence will correspond to a transformed supDc1 (t) > minsup,
feature. The critical issue in this step is how to design an supDc2 (t) ≤ minsup, (III.1)
effective pivot sequence selection method. To date, existing
pattern-based methods typically utilize some simple crite- where minsup is a given support threshold.
ria to conduct the reference sequence selection task. For • Discriminative Function (DF) 2 [11]:
example, those methods based on frequent subsequences use
the minimal support constraint as the criterion for refer- occDc1 (t) > mincount,
ence sequence selection. Apparently, many new and inter-
occDc2 (t) ≤ mincount, (III.2)
esting pivot sequence selection methods remain unexplored
under our framework. In the subsequent paragraphs of this occountDc (t)
1
subsection, we will list some commonly used criteria for where occDc1 (t) = |Dc1 | and mincount is a
selecting reference sequences from the set of candidate pivot given threshold. The occountDc1 (t) is the number of non-
sequences. overlapping occurrences of t in Dc1 .
Constraint 1 (Gap Constraint [11]): Given two sequences • Discriminative Function (DF) 3 [12]:
s = hs1 , s2 , . . . , sl i and t = ht1 , t2 , . . . , tr i, if t is the
subsequence of s such that t1 = si1 , t2 = si2 , . . . , tr = sir , supdiff = supDc1 (t) − supDc2 (t). (III.3)
the gap between ik and ik+1 is defined as Gap(s, ik , ik+1 ) =
ik+1 − ik − 1. Given two thresholds mingap and maxgap (0 ≤ • Discriminative Function (DF) 4 [11]:
mingap ≤ maxgap), if mingap ≤ Gap(s, ik , ik+1 ) ≤ maxgap Occbetween
(1 ≤ k ≤ r − 1), then the occurrence of t in s fulfills the F − ratio = , (III.4)
Occwithin
gap constraint.
Constraint 2 (Minsup Constraint [12]): Given a set of where
sequences Dci with the class label ci and a sequence t, occDc1 (t) + occDc2 (t)
countDci (t) is used to denote the number of sequences in Occbetween = |Dc1 |(occDc1 (t) − )2
Dci that contain t as a subsequence. The support of t in 2
countDc (t) occDc1 (t) + occDc2 (t)
Dci is defined as supDci (t) = i + |Dc2 |(occDc2 (t) − )2 ,
|Dci | . Given a positive
2
VOLUME 8, 2020 218203
Z. He et al.: Reference-Based Sequence Classification
FIGURE 3. The process of feature value generation, model construction and prediction.
where C(t, s) is the cohesion of t in the sequence s. Numerous classification methods have been designed for
• Similarity Function (SF) 4 [18]: classifying feature vectors (e.g. support vector machines
( and decision trees) [4], [49]. After training a classifier
occnum, if t ⊆ s, with TrainD0 , the prediction model is ready for classifying
Sim(s, t) = (III.9) unknown samples.
0, otherwise,
In the second step, we forward the vectorial testing set
where occnum is the number of occurrences of t in s. TestD0 to the classifier to make predictions. In the third step,
• Similarity Function (SF) 5 [11]: we output the prediction result and compute the classification
( accuracy by comparing the predicted class labels with the
occounts (t), if t ⊆ s, ground-truth labels.
Sim(s, t) = (III.10)
0, otherwise,
IV. GENERAL FRAMEWORK FOR
where occounts (t) is the number of non-overlapping occur- FEATURE-BASED CLASSIFICATION
rences of t in s. In this section, we show that many existing feature-based
• Similarity Function (SF) 6 [19]: sequence classification algorithms can be reformulated
|LCS(s, t)| within the presented reference-based framework. The differ-
Sim(s, t) = , (III.11) ences between these algorithms mainly lie in the selection
Max {|s|, |t|}
of reference points and similarity functions. As summarized
where |LCS(s, t)| is the length of the longest common in Table 1, we can categorize these existing methods accord-
subsequence, |s| and |t| are the length of s and t respectively. ing to three criteria: (1) How to construct the candidate set
of reference points? (2) How to choose a set of reference
C. MODEL CONSTRUCTION AND PREDICTION points? (3) Which similarity function should be used? Note
In the third stage of the presented framework, we construct a that the definitions and notations for different constraints
prediction model to make predictions. As shown in the right and similarity functions have been presented in Section III-A
part of Fig. 3, this procedure can be further divided into three and Section III-B. From Table 1, we have the following
steps: model construction, prediction and classification result observations.
generation. First of all, any sequence over the alphabet can be a
In the first step, an existing vectorial data classification potential member of the candidate set of reference points
method is used to construct a prediction model from the CR. However, all feature-based sequence classification algo-
vectorial training set TrainD0 since we have transformed rithms in Table 1 use SubTrainD to construct CR since the
training sequences into feature vectors in the second stage. idea of using subsequences as features is quite natural with
TABLE 1. The categorization of some existing feature-based sequence classification algorithms under our framework.
a good interpretability. Although SubTrainD is a finite set, V. NEW VARIANTS UNDER THE FRAMEWORK
its size is still very large and most sequences in SubTrainD In addition to encompassing existing pattern-based methods,
are useless and redundant for classification. Therefore, it is this framework can also be used as a general platform to
necessary to explore alternative methods for constructing the design new feature-based sequence classification methods.
set of candidate reference points. For instance, we may use all As discussed in Section IV, there are three key ingredients
original sequences in TrainD to construct CR, so that the size in our framework: the construction of the candidate reference
of CR will be greatly reduced and the corresponding features point set, the selection of reference points and the selection
may be more representative. of similarity function. Obviously, we will generate a ‘‘new’’
Second, many sequence selection criteria have been sequence classification algorithm based on an unexplored
proposed to select R from CR, such as minsup and mindisc. combination of these three components. In view of the fact
The main objective of applying these criteria is to select that the number of possible combinations is quite large, it is
a subset of sequences that can generate good features for infeasible to enumerate all these variants. Instead, we will
building the classifier. However, it is not an easy task to set only present two variants that are quite different from existing
suitable thresholds for these constraints to produce a set of algorithms to demonstrate the advantage of this framework.
reference sequences with moderate size. More importantly,
most of these constraints are proposed from the literature
of sequential pattern mining, which may be only applica- A. THE USE OF TRAINING SET AS THE CANDIDATE SET
ble to the selection of reference sequences from SubTrainD. With our framework, all previous pattern-based sequence
In other words, more general reference point selection strate- classification methods utilize the set SubTrainD as the can-
gies should be developed. didate reference point set CR in the first step. One limitation
Last, the most widely used similarity function in Table 1 of this strategy is that the actual size of CR will be very
is SF 1, which is a boolean function based on whether large. As a result, it poses great challenges for the reference
the reference point is a subsequence of the sequence in point selection task in the consequent step. To alleviate these
TrainD. Although some non-boolean functions have been issues, we propose to use all original sequences in TrainD to
used, the potential of utilizing more elaborate similarity func- construct the set of candidate reference points. The rationale
tions between two sequences still needs further investigation. for this candidate set construction method is based on the
Overall, our reference-based sequence classification following observations.
framework is quite generic, in which many existing pattern- Firstly, all information given for building the classifier is
based sequence classification methods can be reformulated contained in the original training set. In other words, we will
as its special variants. Meanwhile, there are still many limita- not lose any relevant information for the classification task if
tions in current research efforts under this framework. Hence, TrainD is used as the candidate set of reference sequences.
new and effective sequence classification methods should be In fact, the widely used candidate set SubTrainD is derived
developed towards this direction. from TrainD.
Secondly, even we use all the training sequences in TrainD In the first stage, the i-th sequence in CR will form a
as the reference points, the transformed vectorial data will be cluster Ci .
a |TrainD| × |TrainD| table. That is, the number of features is In the second stage, a similarity function is used to cal-
still no larger than the number of samples. Therefore, we do culate the similarity between each pair of clusters to pro-
not need to analyze a HDLSS (high-dimension, low-sample- duce a similarity matrix Sim, where Sim[i, j] is the similarity
size) data set during the classification stage. In contrast, between the two clusters Ci and Cj . Many similarity measures
the number of features may be much larger than the number have been presented for sequential data (e.g. [53]). Here we
of samples in the vectorial data obtained from SubTrainD if choose the Jaccard coefficient. More specific details on the
the parameters are not properly specified during the reference similarity function will be discussed in Section V-C.
point selection procedure. In fact, we have tested the per- In the third stage, we first search the similarity matrix Sim
formance when all training sequences are used as reference to identify the maximum value maxSim, which corresponds
points. The experimental results show that this quite simple to the most similar pair of clusters Ck and Cl . Then, these
idea is able to achieve comparable performance in terms of two clusters are merged to form a new cluster Ck and the
classification accuracy. number of clusters in total is decreased by 1. Meanwhile,
Finally, the same idea has been employed in the literature the entries related to Cl in Sim are set to be 0 and Sim
of time series classification [42], [43]. Its success motivates is updated by recalculating the similarity between Ck and
us to investigate the feasibility and advantage in the context each of the remaining clusters. The similarity between the
of discrete sequence classification. newly generated cluster and each of the remaining clusters
is calculated as the average similarity between all members
B. TWO REFERENCE POINT SELECTION METHODS in the two clusters since we use the group-average method.
To select reference sequences from TrainD, those existing We repeat the third stage until the number of clusters is equal
constraints proposed in the context of sequential pattern min- to the number of reference points we want to select.
ing are not applicable. Therefore, we have to develop new In the last stage, we select a representative sequence from
algorithms to choose a subset of representative reference each cluster. For each cluster, any sequence in this cluster
sequences from TrainD. To this end, two different reference can be used as a representative. To provide a consistent and
sequence selection methods are presented. The first one is deterministic output, we use the sequence with the minimum
an unsupervised method, which selects reference sequences subscript in the cluster as the reference point.
based on cluster analysis without considering the class label
information. The second one is a supervised method, which 2) SUPERVISED REFERENCE POINT SELECTION
evaluates each candidate sequence according to its discrim- To choose a subset of representative reference sequences from
inative ability across different classes. In the following two TrainD, we can also employ a supervised method in which
sub-sections, we will present the details of these two refer- the class label information is utilized. As we have discussed
ence point selection algorithms. in Section IV, different mindisc constraints have been widely
used to evaluate the discriminative power of sequential pat-
1) UNSUPERVISED REFERENCE POINT SELECTION terns. Unfortunately, these constraints are only applicable to
As we have discussed in Section V-A, we may choose all the selection of reference points from SubTrainD. In addition,
sequences in the training set as reference points. However, it is not an easy task to set suitable thresholds to control
the number of features in the transformed vectorial data the number of selected reference points. In order to over-
can still be very large if the number of training instances come these limitations, we present a reference point selection
is large. The selection of a small subset of representative method based on hypothesis testing, in which the statistical
training sequences as reference points will greatly reduce the significance in terms of p-value is used to assess the discrim-
computational burden in the subsequent stage. One natural inative power of each candidate sequence.
idea is to divide the training sequences in CR into different Hypothesis testing is a commonly used method in statisti-
clusters using a clustering algorithm [50]. Then, we can select cal inference. The usual line of reasoning is as follows: first,
a representative sequence from each cluster as the reference formulate the null hypothesis and the alternative hypothesis;
point. second, select an appropriate test statistic; third, set a sig-
To date, many algorithms have been presented for clus- nificance level threshold; finally, reject the null hypothesis
tering discrete sequences (e.g. [51]). We can just adopt an if and only if the p-value is less than the significance level
existing sequence clustering algorithm in our pipeline. Here threshold, where the p-value is the probability of getting a
we choose the Group-average Agglomerative Hierarchical value of the test statistic that is at least as extreme as what
Clustering (GAHC) algorithm [52] to fulfill the sequence is actually observed on condition that the null hypothesis is
clustering task. This algorithm is used because it can often true.
generate a high-quality clustering result and can handle any In order to assess the discriminative power of each can-
forms of similarity measure. didate sequence in terms of p-value, we can use the null
In the following, we will describe the details of the refer- hypothesis that this sequence does not belong to any class
ence point selection method based on GAHC. and all sequences from different classes are drawn from the
same population. If the above null hypothesis is true, then In the first stage (step 1-4), we select a set of sequences
the similarities between the candidate sequence and training Dci with the class label ci from CR, then we regard Dci as the
sequences are drawn from the same population. Therefore, positive data set D+ and use the set of all remaining sequences
we can formulate the corresponding hypothesis testing prob- in CR as the negative data set D− .
lem as a two-sample testing problem [54], where one sample In the second stage (step 5-17), for each sequence Sk in
is the set of similarities between the candidate sequence and D+ , a similarity function is used to calculate the similarity
the training sequences from one target class and another sam- between Sk and each sequence in D+ and D− , where the
ple is the set of similarities between the candidate sequence similarity function is the same as that used in Section V-B1
and the training sequences from the remaining classes. and Sim[k, j] is the similarity between the two sequences
Since we test all candidate sequences in CR at the same Sk and Sj . Then, the Mann-Whitney U test [56] is used to
time, it is actually a multiple hypothesis testing problem. If calculate the p-value based on the two similarity set Sim+
no multiple testing correction is conducted, then the number and Sim− .
of false positives among reported reference sequences may be In the third stage (step 18-27), the BH method first
very high. To tackle this problem, we adopt the BH procedure sorts sequences in D+ according to their corresponding p-
to control the FDR (False Discovery Rate) [55], which is value in an ascending order, i.e., D+ = S1 , S2 , . . . , S|D+ |
the expected proportion of false positives among all reported (S1 .pvalue ≤ S2 .pvalue ≤ . . . ≤ S|D+ | .pvalue).
sequences. Then, we sequentially search D+ to identify the maximal
The reference point selection method based on MHT (Mul- sequence index maxindex which satisfies the condition that
tiple Hypothesis Testing) is shown in Algorithm 1. In the Sk .pvalue ≤ α |Dk+ | , where α is the significance level thresh-
following, we will elaborate on this algorithm in detail. old. Those sequences whose indices are larger than maxindex
will be removed from D+ .
In the last stage (step 28-30), we select all sequences from
Algorithm 1 Reference Point Selection Based on MHT D+ as reference points. The whole process will be terminated
after each set of sequences from every class has been regarded
Input: Candidate reference sequence set CR, α
as D+ .
Output: Reference point set R
1: R ← ∅;
C. SIMILARITY FUNCTION
2: for each Dci in CR do
In order to measure the similarity between two sequences,
3: D+ ← Dci ;
we choose the Jaccard coefficient as the similarity function
4: D− ← CR − Dci ;
in our method. The larger the Jaccard coefficient between the
5: for each sequence Sk in D+ do
two sequences is, the more similar they are.
6: Sim+ ← ∅;
Given two sequences s = hs1 , s2 , . . . , sl i and t =
7: Sim− ← ∅;
ht1 , t2 , . . . , tr i, the Jaccard coefficient is defined as:
8: for each sequence Sj in D+ do
9: calculate Sim[k, j]; |s ∩ t|
J (s, t) = , (V.1)
10: Sim+ ← Sim+ ∪ {Sim[k, j]}; |s| + |t| − |s ∩ t|
11: end for where |s ∩ t| is the number of items in the intersection of
12: for each sequence Sj in D− do s and t. However, this may lose the order information of
13: calculate Sim[k, j]; sequences. To alleviate this issue, we use the LCS (Longest
14: Sim− ← Sim− ∪ {Sim[k, j]}; Common Subsequence) between s and t to replace s∩t. Then,
15: end for the Jaccard coefficient is redefined as:
16: Sk .pvalue ← Utest(Sim+ , Sim− );
|LCS(s, t)|
17: end for J (s, t) = . (V.2)
18: sort D+ ; |s| + |t| − |LCS(s, t)|
19: maxindex ← 0; Example 1: Given two sequences s = ha, b, c, d, ei and
20: for each sequence Sk in D+ do t = he, c, d, ci, the LCS(s, t) is hc, di, then the modified
21: if Sk .pvalue ≤ α |Dk+ | then Jaccard coefficient is
22: maxindex ← k; 2
23: end if J (abcde, ecdc) = ≈ 0.286.
5+4−2
24: end for
Note that we can also use other similarity functions in the
25: for k ← maxindex + 1 to |D+ | − 1 do
literature, such as those methods summarized and reviewed
26: D+ ← D+ − {Sk };
in [53]. The choice of a more appropriate similarity func-
27: end for
tion may yield better performance than the modified Jaccard
28: R ← R ∪ D+ ;
coefficient. In order to check the effect of similarity func-
29: end for
tion on the classification performance, we also consider the
30: return R;
following two alternative similarity functions.
The first one is the String Subsequence Kernel (SSK) [36]. TABLE 2. Summary of the sequential data sets used in the experiments.
The main idea of SSK is to compare two sequences by
means of the subsequences they contain in common. That is,
the more subsequences in common, the more similar they are.
Given two sequences s = hs1 , s2 , . . . , sl i and t =
ht1 , t2 , . . . , tr i and a parameter n, the SSK is defined as:
Kn (s, t) = h8(s), 8(t)i
X
= φu (s).φu (t)
u∈I n
XX X
= λls (u) λlt (u)
u∈I n u⊆s u⊆t
X XX
= λ ls (u)+lt (u)
, (V.3)
u∈I n u⊆s u⊆t
where φu (s) is the feature mapping for the sequence s and
each u ∈ I n , I is a finite alphabet, I n is the set of all
subsequences of length n and u is a subsequence of s such VI. EXPERIMENTS
that u1 = si1 , u2 = si2 , . . . , un = sin , ls (u) = in − i1 + 1 To demonstrate the feasibility and advantages of this new
is the length of u in s, λ ∈ (0, 1) is a decay factor which framework, we conducted experiments on fourteen real
is used to penalize the gap. The calculation steps are as sequential data sets. We compared our two algorithms derived
follows: enumerate all subsequences of length n, compute the under the reference-based framework with other sequence
feature vectors for the two sequences, and then compute the classification algorithms in terms of classification accuracy.
similarity. The normalized kernel value is given by All experiments were conducted on a PC with Intel(R)
Kn (s, t) Xeon(R) CPU 2.40GHz and 12G Memory. All the reported
K̂n (s, t) = √ . (V.4) accuracies in the experiments were the average accuracies
Kn (s, s)Kn (t, t)
obtained by repeating the 5-fold cross-validation 5 times
Example 2: Given two sequences s = ha, b, c, d, ei and except SCIP (accuracies in SCIP were obtained using 10-
t = he, c, d, ci, the subsequences of length 1 (n = 1) are fold cross-validation because this is a fixed setting in software
a, b, c, d, e. The corresponding feature vector for each of the package provided by the author).
sequences can be denoted as φ1 (s) = hλ, λ, λ, λ, λi and
φ1 (t) = h0, 0, 2λ, λ, λi, then the normalized kernel value is
A. DATA SETS
K1 (abcde, ecdc)
K̂1 (abcde, ecdc) = √ We choose fourteen benchmark data sets which are widely
K1 (abcde, abcde)K1 (ecdc, ecdc) used for evaluating sequence classification algorithms: Activ-
4λ2 ity [57], Aslbu [14], Auslan2 [14], Context [58], Epitope [12],
= √ ≈ 0.73.
5λ2 × 6λ2 Gene [59], News [5], Pioneer [14], Question [60], Reuters [5],
When this function is employed in our method, n = 1 is Robot [5], Skating [14], Unix [5], Webkb [5]. The main
used as the default parameter setting. Although the setting of characteristics of these data sets are summarized in Table 2,
n = 1 may lose the order information, it will greatly reduce where |D| represents the number of sequences in the data set,
the computational cost and can provide satisfactory results in #items denotes the number of distinct elements, minl, maxl
practice. and avgl are used to denote the minimum length, maximum
Another alternative similarity function is the normalized length and average length of the sequences respectively, and
LCS. The larger the normalized LCS between two sequences #classes represents the number of distinct classes in the data
is, the more similar they are. set.
Given two sequences s = hs1 , s2 , . . . , sl i and t =
ht1 , t2 , . . . , tr i, the normalized LCS is defined as: B. PARAMETER SETTINGS
|LCS(s, t)| Our two algorithms are denoted by R-MHT (Reference
Sim(s, t) = , (V.5) Point Selection Based on MHT) and R-GAHC (Reference
Min {|s|, |t|}
Point Selection Based on GAHC), respectively. In addition,
where |LCS(s, t)| is the length of the longest common subse- the method that uses all sequences in TrainD as reference
quence, |s| is length of s, and |t| is the length of t. points is denoted as R-A, which is also included in the per-
Example 3: Given two sequences s = ha, b, c, d, ei and formance comparison. We compare our algorithms with five
t = he, c, d, ci, the LCS(s, t) is hc, di, then the normalized existing sequence classification algorithms: MiSeRe1 [17],
LCS is
2
Sim(abcde, ecdc) = = 0.5. 1 http://www.misere.co.nf
4
TABLE 4. The average classification accuracies of different methods over all data sets used in the experiment.
Sqn2Vec2 [41], SCIP3 [5], FSP (the algorithm based on fre- we cannot find any discriminative patterns from this data
quent sequential patterns) and DSP (the algorithm based on set based on the given parameter setting. In the experiments,
discriminative sequential patterns). α = 0.05 is used for R-MHT and pointnum is specified to be
In MiSeRe, num_of _rules is specified to be 1024 and 1/10 of the size of TrainD for R-GAHC. After transforming
execution_time is set to be 5 minutes for all data sets. sequences into feature vectors, we chose NB (Naive Bayes),
Sqn2Vec is an unsupervised method for learning sequence DT (Decision Tree), SVM (Support Vector Machine), KNN
embeddings from both singleton symbols and sequential pat- (k Nearest Neighbors) as the classifiers. The implementation
terns. It has two variants: Sqn2VecSEP and Sqn2VecSIM, of each classifier was obtained from WEKA [62] except
where Sqn2VecSEP (Sqn2VecSIM) generates sequence rep- Sqn2Vec. In Sqn2Vec, all classifiers were obtained from
resentations from singleton symbols and sequential pat- scikit-learn [63] since its source code is written in python.
terns separately (simultaneously). In these two variants, In order to have a global picture of the overall performance
minsup = 0.05, maxgap = 4 and the embedding dimension of different algorithms, we calculate the average accuracy
d is set to be 128 for all data sets. over all data sets for each classifier. The corresponding aver-
SCIP is a sequence classification method based on interest- age accuracies for different methods are recorded in Table 4.
ing patterns, which has four different variants: SCII_HAR, The results show that among our two methods, R-MHT can
SCII_MA, SCIS_HAR and SCIS_MA. In the experiments, achieve better performance than R-GAHC when NB, DT
the following parameter setting is used in all data sets: and SVM are used as the classifier. However, R-MHT has
minsup = 0.05, minint = 0.02, maxsize = 3, conf = 0.5 and a bad performance when KNN is used as the classifier. Since
topk = 11. we select a representative sequence from each cluster in
Frequent sequential patterns have been widely used as R-GAHC and any sequence in a cluster can be used as a rep-
features in sequence classification. To include the algorithm resentative, we may miss the most representative sequence.
based on frequent sequential patterns in the comparison Meanwhile, the choice of clustering method and the speci-
(denoted by FSP), we employ the PrefixSpan algorithm [61] fication of the number of clusters will influence the results.
as the frequent sequential pattern mining algorithm. The In addition, the R-A method outperforms R-MHT and
parameters are specified as follows: maxsize = 3 and minsup R-GAHC since we will not lose any relevant information for
= 0.3 for all data sets except Context (the minsup in Context the classification task when all training sequences are used
is set to be 0.9 in order to avoid the generation of too many as reference points. However, the feature dimension will be
patterns). very high in R-A, which will incur high computational cost
Similarly, discriminative sequential patterns are widely in practice.
used as features in many sequence classification algorithms Compared with other classification methods, our methods
and applications as well. To include the algorithm based on are able to achieve comparable performance. In particular,
discriminative sequential patterns in the comparison (denoted R-A and MiSeRe [17] can achieve the highest average classi-
by DSP), we first use the PrefixSpan algorithm to mine a set fication accuracy among all competitors since all information
of frequent sequential patterns and then detect discriminative given for building the classifier is contained in the reference
patterns from the frequent pattern set. The parameters for Pre- point set in R-A. The reason why R-MHT and R-GAHC are
fixSpan are identical to those used in FSP and minGR = 3 is slightly worse may be that their reference points are less dis-
used as the threshold for filtering discriminative sequential tinct from each other in different classes and some sequences
patterns. that are important for classification are missed. It is quite
amazing since R-A is a very simple algorithm derived from
C. RESULTS our framework. This indicates that the proposed reference-
In Table 3, the detailed performance comparison results in based sequence classification framework is quite useful in
terms of classification accuracies are presented. Note that practice. It can be expected more accurate feature-based
the result of DSP on the Skating data set is N/A because sequence classification methods will be developed under this
framework in the future. From Table 3 and Table 4, it can be
2 https://github.com/nphdang/Sqn2Vec also observed that none of the algorithms in the comparison
3 http://adrem.ua.ac.be/sites/adrem.ua.ac.be/files/SCIP.zip can always achieve the best performance across all data sets.
TABLE 5. The average classification accuracies of different similarity functions over all data sets used in the experiment.
Therefore, more research efforts still should be devoted to the sequence classification framework is quite promising and
development of effective sequence classification algorithms. useful in practice.
The use of different similarity functions may affect the In future work, we intend to explore more appropriate ref-
performance of our algorithms. To investigate this issue, erence sequence selection methods and similarity functions to
we use two additional similarity functions in the experiments improve the performance and reduce the computational cost.
for comparison: SSK and the normalized LCS, whose details As a result, more accurate feature-based sequence classifica-
have been introduced in Section V-C. tion methods would be derived under this framework.
Table 5 presents the average classification accuracies of
different similarity functions over all data sets. Jaccard coef- REFERENCES
ficient, SSK and normalized LCS are denoted as J, S and [1] J. Han, J. Pei, and M. Kamber, Data Mining: Concepts and Techniques.
N, respectively. In Table 5, R-A-J means that the Jaccard Amsterdam, The Netherlands: Elsevier, 2011.
[2] M. Deshpande and G. Karypis, ‘‘Evaluation of techniques for classifying
coefficient is used as the similarity function in R-A. Other biological sequences,’’ in Proc. 6th Pacific–Asia Conf. Adv. Knowl. Dis-
notations in this table can be interpreted in a similar manner. covery Data Mining. Berlin, Germany: Springer, 2002, pp. 417–431.
The results show that the use of different similarity functions [3] Z. Xing, J. Pei, and E. Keogh, ‘‘A brief survey on sequence classification,’’
ACM SIGKDD Explor. Newslett., vol. 12, no. 1, pp. 40–48, Nov. 2010.
can affect the performance of our algorithms. Among these [4] E. Cernadas and D. Amorim, ‘‘Do we need hundreds of classifiers to solve
three similarity functions, the use of the Jaccard coefficient real world classification problems?’’ J. Mach. Learn. Res., vol. 15, no. 1,
as the similarity function can achieve better performance in pp. 3133–3181, 2014.
[5] C. Zhou, B. Cule, and B. Goethals, ‘‘Pattern based sequence classifi-
most cases. However, R-MHT-J has unsatisfactory perfor- cation,’’ IEEE Trans. Knowl. Data Eng., vol. 28, no. 5, pp. 1285–1298,
mance when KNN is used as the classifier. It can be also May 2016.
observed that none of the similarity functions is always the [6] T. P. Exarchos, M. G. Tsipouras, C. Papaloukas, and D. I. Fotiadis, ‘‘A two-
stage methodology for sequence classification based on sequential pattern
best performer. Therefore, more suitable similarity functions mining and optimization,’’ Data Knowl. Eng., vol. 66, no. 3, pp. 467–487,
should be developed. Sep. 2008.
The above experimental results and analysis show that the [7] D. Lo, H. Cheng, J. Han, S.-C. Khoo, and C. Sun, ‘‘Classification of
software behaviors for failure detection: A discriminative pattern mining
proposed new methods based on our framework can achieve approach,’’ in Proc. 15th ACM SIGKDD Int. Conf. Knowl. Discovery Data
comparable performance to those state-of-the-art sequence Mining, 2009, pp. 557–566.
classification algorithms, which demonstrate the feasibility [8] R. She, F. Chen, K. Wang, M. Ester, J. L. Gardy, and F. S. Brinkman,
‘‘Frequent-subsequence-based prediction of outer membrane proteins,’’ in
and advantages of our framework. And our framework is Proc. 9th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2003,
quite general and flexible since the selection of both reference pp. 436–445.
points and similarity functions is arbitrary. However, since the [9] T. Hopf and S. Kramer, ‘‘Mining class-correlated patterns for sequence
labeling,’’ in Proc. Springer Int. Conf. Discovery Sci. Springer, 2010,
feature selection and classifier construction in our framework
pp. 311–325.
are separate and any existing vectorial data classification [10] H. Haleem, P. K. Sharma, and M. M. S. Beg, ‘‘Novel frequent sequen-
methods can be used to tackle the sequence classification tial patterns based probabilistic model for effective classification of Web
problem, some features that are critical to the classifier may documents,’’ in Proc. Int. Conf. Comput. Commun. Technol. (ICCCT),
Sep. 2014, pp. 361–371.
be filtered out during the selection process. [11] K. Deng and O. R. Zaïane, ‘‘An occurrence based approach to mine emerg-
ing sequences,’’ in Proc. Springer Int. Conf. Data Warehousing Knowl.
Discovery. Berlin, Germany: Springer, 2010, pp. 275–284.
VII. CONCLUSION [12] K. Deng and O. R. Zaïane, ‘‘Contrasting sequence groups by emerging
In this paper, we present a reference-based sequence classifi- sequences,’’ in Proc. 12th Int. Conf. Discovery Sci. Berlin, Germany:
Springer, 2009, pp. 377–384.
cation framework by generalizing the pattern-based methods.
[13] Z. He, S. Zhang, and J. Wu, ‘‘Significance-based discriminative sequential
This framework is quite general and flexible, which can pattern mining,’’ Expert Syst. Appl., vol. 122, pp. 54–64, May 2019.
be used as a general platform to develop new algorithms [14] D. Fradkin and F. Mörchen, ‘‘Mining sequential patterns for classifica-
for sequence classification. To verify this point, we present tion,’’ Knowl. Inf. Syst., vol. 45, no. 3, pp. 731–749, Dec. 2015.
[15] H. Yahyaoui and A. Al-Mutairi, ‘‘A feature-based trust sequence classifi-
several new feature-based sequence classification algorithms cation algorithm,’’ Inf. Sci., vol. 328, pp. 455–484, Jan. 2016.
under this new framework. A series of comprehensive exper- [16] C.-H. Lee, ‘‘A multi-phase approach for classifying multi-dimensional
iments on real data sets show that our methods are capa- sequence data,’’ Intell. Data Anal., vol. 19, no. 3, pp. 547–561, Jun. 2015.
[17] E. Egho, D. Gay, M. Boullé, N. Voisine, and F. Clérot, ‘‘A user parameter-
ble of achieving better classification accuracy than existing free approach for mining robust sequential classification rules,’’ Knowl.
sequence classification algorithms. Thus, the reference-based Inf. Syst., vol. 52, no. 1, pp. 53–81, Jul. 2017.
[18] A. N. Ntagiou, M. G. Tsipouras, N. Giannakeas, and A. T. Tzallas, ‘‘Protein [43] R. J. Kate, ‘‘Using dynamic time warping distances as features for
structure recognition by means of sequential pattern mining,’’ in Proc. improved time series classification,’’ Data Mining Knowl. Discovery,
IEEE 17th Int. Conf. Bioinf. Bioeng. (BIBE), Oct. 2017, pp. 334–339. vol. 30, no. 2, pp. 283–312, Mar. 2016.
[19] C.-Y. Tsai and C.-J. Chen, ‘‘A PSO-AB classifier for solving sequence clas- [44] G. Blackshields, M. Larkin, I. M. Wallace, A. Wilm, and D. G. Higgins,
sification problems,’’ Appl. Soft Comput., vol. 27, pp. 11–27, Feb. 2015. ‘‘Fast embedding methods for clustering tens of thousands of sequences,’’
[20] I. Batal, H. Valizadegan, G. F. Cooper, and M. Hauskrecht, ‘‘A temporal Comput. Biol. Chem., vol. 32, no. 4, pp. 282–286, Aug. 2008.
pattern mining approach for classifying electronic health record data,’’ [45] K. Voevodski, M.-F. Balcan, H. Röglin, S.-H. Teng, and Y. Xia, ‘‘Active
ACM Trans. Intell. Syst. Technol., vol. 4, no. 4, p. 63, 2013. clustering of biological sequences,’’ J. Mach. Learn. Res., vol. 13, no. 1,
[21] C.-Y. Tsai, C.-J. Chen, and C.-J. Chien, ‘‘A time-interval sequence classifi- pp. 203–225, 2012.
cation method,’’ Knowl. Inf. Syst., vol. 37, no. 2, pp. 251–278, Nov. 2013. [46] C. Faloutsos and K.-I. Lin, ‘‘Fastmap: A fast algorithm for indexing, data-
[22] T. P. Exarchos, M. G. Tsipouras, C. Papaloukas, and D. I. Fotiadis, ‘‘An mining and visualization of traditional and multimedia datasets,’’ in Proc.
optimized sequential pattern matching methodology for sequence classifi- ACM SIGMOD Int. Conf. Manage. Data, 1995, pp. 163–174.
cation,’’ Knowl. Inf. Syst., vol. 19, no. 2, pp. 249–264, May 2009. [47] G. R. Hjaltason and H. Samet, ‘‘Properties of embedding methods for
[23] V. S. M. Tseng and C.-H. Lee, ‘‘CBS: A new classification method by using similarity searching in metric spaces,’’ IEEE Trans. Pattern Anal. Mach.
sequential patterns,’’ in Proc. SIAM Int. Conf. Data Mining, Apr. 2005, Intell., vol. 25, no. 5, pp. 530–549, May 2003.
pp. 596–600. [48] X. Liu, J. Wu, F. Gu, J. Wang, and Z. He, ‘‘Discriminative pattern mining
[24] V. S. Tseng and C.-H. Lee, ‘‘Effective temporal data classification by and its applications in bioinformatics,’’ Briefings Bioinf., vol. 16, no. 5,
integrating sequential pattern mining and probabilistic induction,’’ Expert pp. 884–900, Sep. 2015.
Syst. Appl., vol. 36, no. 5, pp. 9524–9532, Jul. 2009. [49] C. Zhang, C. Liu, X. Zhang, and G. Almpanidis, ‘‘An up-to-date com-
[25] Z. Syed, P. Indyk, and J. Guttag, ‘‘Learning approximate sequen- parison of state-of-the-art classification algorithms,’’ Expert Syst. Appl.,
tial patterns for classification,’’ J. Mach. Learn. Res., vol. 10, no. 8, vol. 82, pp. 128–150, Oct. 2017.
pp. 1913–1936, 2009. [50] A. K. Jain, M. N. Murty, and P. J. Flynn, ‘‘Data clustering: A review,’’ ACM
Comput. Surv., vol. 31, no. 3, pp. 264–323, 1999.
[26] N. Lesh, M. J. Zaki, and M. Ogihara, ‘‘Mining features for sequence
[51] T. Xiong, S. Wang, Q. Jiang, and J. Z. Huang, ‘‘A novel variable-order
classification,’’ in Proc. 5th ACM SIGKDD Int. Conf. Knowl. Discovery
Markov model for clustering categorical sequences,’’ IEEE Trans. Knowl.
Data Mining, 1999, pp. 342–346.
Data Eng., vol. 26, no. 10, pp. 2339–2353, Oct. 2014.
[27] P. Rani and V. Pudi, ‘‘RBNBC: Repeat based Naïve Bayes classifier for
[52] P. Willett, ‘‘Recent trends in hierarchic document clustering: A critical
biological sequences,’’ in Proc. 8th IEEE Int. Conf. Data Mining, 2008,
review,’’ Inf. Process. Manage., vol. 24, no. 5, pp. 577–597, Jan. 1988.
pp. 989–994.
[53] K. Rieck and P. Laskov, ‘‘Linear-time computation of similarity measures
[28] P. Holat, M. Plantevit, C. Raïssi, N. Tomeh, T. Charnois, and B. Crémilleux,
for sequential data,’’ J. Mach. Learn. Res., vol. 9, no. 1, pp. 23–48, 2008.
‘‘Sequence classification based on delta-free sequential patterns,’’ in Proc.
[54] J. D. Gibbons and S. Chakraborti, Nonparametric Statistical Inference.
IEEE Int. Conf. Data Mining, Dec. 2014, pp. 170–179.
Berlin, Germany: Springer, 2011.
[29] J. K. Febrer-Hernández, R. Hernández-León, C. Feregrino-Uribe, and [55] Y. Benjamini and Y. Hochberg, ‘‘Controlling the false discovery rate:
J. Hernández-Palancar, ‘‘SPaC-NF: A classifier based on sequential pat- A practical and powerful approach to multiple testing,’’ J. Roy. Stat. Soc.
terns with high netconf,’’ Intell. Data Anal., vol. 20, no. 5, pp. 1101–1113, B, Methodol., vol. 57, no. 1, pp. 289–300, Jan. 1995.
Sep. 2016. [56] H. B. Mann and D. R. Whitney, ‘‘On a test of whether one of two ran-
[30] Z. He, S. Zhang, F. Gu, and J. Wu, ‘‘Mining conditional discriminative dom variables is stochastically larger than the other,’’ Ann. Math. Statist.,
sequential patterns,’’ Inf. Sci., vol. 478, pp. 524–539, Apr. 2019. vol. 18, no. 1, pp. 50–60, Mar. 1947.
[31] G. Ifrim, G. Bakir, and G. Weikum, ‘‘Fast logistic regression for text [57] M. Lichman, ‘‘UCI machine learning repository, 2013,’’ School Inf. Com-
categorization with variable-length n-grams,’’ in Proc. 14th ACM SIGKDD put. Sci., Univ. California Irvine, Irvine, CA, USA, Tech. Rep., 2013.
Int. Conf. Knowl. Discovery Data Mining, 2008, pp. 354–362. [Online]. Available: http://archive.ics.uci.edu/ml
[32] G. Ifrim and C. Wiuf, ‘‘Bounded coordinate-descent for biological [58] J. Mäntyjärvi, J. Himberg, P. Kangas, U. Tuomela, and P. Huuskonen,
sequence classification in high dimensional predictor space,’’ in Proc. 17th ‘‘Sensor signal data set for exploring context recognition of mobile
ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining (KDD), 2011, devices,’’ in Proc. 2nd Int. Conf. Pervasive Comput., 2004, pp. 18–23.
pp. 708–716. [59] L. Wei, M. Liao, Y. Gao, R. Ji, Z. He, and Q. Zou, ‘‘Improved and
[33] D. Okanohara and J. Tsujii, ‘‘Text categorization with all substring fea- promising identification of human MicroRNAs by incorporating a high-
tures,’’ in Proc. SIAM Int. Conf. Data Mining, 2009, pp. 838–846. quality negative set,’’ IEEE/ACM Trans. Comput. Biol. Bioinf., vol. 11,
[34] S. Sonnenburg, G. Rätsch, and C. Schäfer, ‘‘Learning interpretable svms no. 1, pp. 192–201, Jan. 2014.
for biological sequence classification,’’ in Proc. 9th Annu. Int. Conf. Res. [60] Y. Kim, ‘‘Convolutional neural networks for sentence classification,’’ in
Comput. Mol. Biol. Berlin, Germany: Springer, 2005, pp. 389–407. Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), 2014,
[35] C. Leslie, E. Eskin, and W. S. Noble, ‘‘The spectrum kernel: A string kernel pp. 1746–1751.
for SVM protein classification,’’ in Proc. Pacific Symp. Biocomput., 2002, [61] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and
pp. 564–575. M.-C. Hsu, ‘‘PrefixSpan: Mining sequential patterns efficiently by prefix-
[36] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins, projected pattern growth,’’ in Proc. 17th Int. Conf. Data Eng., 2001,
‘‘Text classification using string kernels,’’ J. Mach. Learn. Res., vol. 2, pp. 215–224.
no. 2, pp. 419–444, 2002. [62] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and
[37] E. Eskin, J. Weston, W. S. Noble, and C. S. Leslie, ‘‘Mismatch string I. H. Witten, ‘‘The Weka data mining software: An update,’’ ACM
kernels for SVM protein classification,’’ in Proc. Adv. Neural Inf. Process. SIGKDD Explor. Newslett., vol. 11, no. 1, pp. 10–18, 2009.
Syst., 2003, pp. 1441–1448. [63] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, and B. Thirion,
[38] C. Leslie and R. Kuang, ‘‘Fast string kernels using inexact matching for ‘‘Scikit-learn: Machine learning in python,’’ J. Mach. Learn. Res., vol. 12,
protein sequences,’’ J. Mach. Learn. Res., vol. 5, no. 11, pp. 1435–1455, no. 10, pp. 2825–2830, 2011.
2004.
[39] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, ‘‘Distributed
ZENGYOU HE received the B.S., M.S., and Ph.D.
representations of words and phrases and their compositionality,’’ in Proc.
Adv. Neural Inf. Process. Syst., 2013, pp. 3111–3119.
degrees in computer science from the Harbin Insti-
[40] Q. Le and T. Mikolov, ‘‘Distributed representations of sentences and tute of Technology, China, in 2000, 2002, and
documents,’’ in Proc. Int. Conf. Mach. Learn., 2014, pp. 1188–1196. 2006, respectively. He was a Research Associate
[41] D. Nguyen, W. Luo, T. D. Nguyen, S. Venkatesh, and D. Phung, ‘‘Sqn2vec: with the Department of Electronic and Computer
Learning sequence representation via sequential patterns with a gap con- Engineering, Hong Kong University of Science
straint,’’ in Proc. 2018 Joint Eur. Conf. Mach. Learn. Knowl. Discovery and Technology, from February 2007 to Febru-
Databases. Cham, Switzerland: Springer, 2018, pp. 569–584. ary 2010. He is currently a Professor with the
[42] A. Iosifidis, A. Tefas, and I. Pitas, ‘‘Multidimensional sequence classifi- School of Software, Dalian University of Technol-
cation based on fuzzy distances and discriminant analysis,’’ IEEE Trans. ogy. His research interests include data mining and
Knowl. Data Eng., vol. 25, no. 11, pp. 2564–2575, Nov. 2013. bioinformatics.
GUANGYAO XU received the B.S. degree in elec- BO XU (Member, IEEE) received the B.Sc. and
tronic and information engineering from Dalian Ph.D. degrees from the Dalian University of Tech-
Maritime University, China, in 2018. He is cur- nology, China, in 2007 and 2014, respectively. She
rently pursuing the M.S. degree with the School is currently an Associate Professor with the School
of Software, Dalian University of Technology. of Software, Dalian University of Technology. Her
His research interest includes data mining and its current research interests include biomedical liter-
applications. ature data mining, information retrieval, and natu-
ral language processing.
of Software. His research interest includes data that his more than 100 articles have been cited more than 5000 times.
mining and its applications. He is the Editor-in-Chief of Current Bioinformatics, an Associate Editor
of IEEE ACCESS, and the Editor Board Member of Computers in Biology
and Medicine, Genes, and Scientific Reports. He was selected as one of the
Clarivate Analytics Highly Cited Researchers in 2018.