Reference-Based_Sequence_Classification

This paper presents a reference-based sequence classification framework that unifies existing pattern-based methods and serves as a platform for developing new sequence classification algorithms. The framework allows for flexible selection of reference points and similarity functions, leading to improved classification accuracy compared to traditional methods. Experimental results demonstrate the effectiveness of the proposed algorithms on real sequential data sets.

Uploaded by

HUSNI TAMRIN

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Reference-Based_Sequence_Classification

Uploaded by

HUSNI TAMRIN

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Received November 22, 2020, accepted December 1, 2020, date of publication December 7, 2020,

date of current version December 16, 2020.

Digital Object Identifier 10.1109/ACCESS.2020.3042757

Reference-Based Sequence Classification

ZENGYOU HE 1 , GUANGYAO XU 1 , CHAOHUA SHENG 1, BO XU 1, (Member, IEEE),
AND QUAN ZOU 2 , (Senior Member, IEEE)
1 School of Software, Dalian University of Technology, Dalian 116024, China
2 Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology, Chengdu 610054, China
Corresponding author: Zengyou He (zyhe@dlut.edu.cn)
This work was supported in part by the Natural Science Foundation of China under Grant 61972066 and Grant 61572094, and in part by
the Fundamental Research Funds for the Central Universities under Grant DUT20YG106.

ABSTRACT Sequence classification is an important data mining task in many real-world applications. Over
the past few decades, many sequence classification methods have been proposed from different aspects.
In particular, the pattern-based method is one of the most important and widely studied sequence classifica-
tion methods in the literature. In this paper, we present a reference-based sequence classification framework,
which can unify existing pattern-based sequence classification methods under the same umbrella. More
importantly, this framework can be used as a general platform for developing new sequence classification
algorithms. By utilizing this framework as a tool, we propose new sequence classification algorithms that
are quite different from existing solutions. Experimental results show that new methods developed under the
proposed framework are capable of achieving comparable classification accuracy to those state-of-the-art
sequence classification algorithms.

INDEX TERMS Sequence classification, sequential data analysis, cluster analysis, hypothesis testing,
sequence embedding.

I. INTRODUCTION phase. In many real-world applications, we can formulate

In many practical applications, we have to conduct data anal- the data analysis task as a sequence classification problem.
ysis on data sets that are composed of discrete sequences. For instance, the essential task in numerous bioinformatics
Each sequence is an ordered list of elements. For instance, applications is to classify biological sequences into existing
such a sequence can be a protein sequence, where each ele- categories [2].
ment corresponds to an amino acid. Due to the existence of a To tackle the sequence classification problem, many effec-
large number of discrete sequences in a wide range of appli- tive methods have been proposed from different aspects.
cations, sequential data analysis has become an important Roughly, existing sequence classification methods can be
issue in machine learning and data mining. Compared to non- divided into three categories [3]: feature-based methods,
sequential data mining, sequential data analysis is confronted distance-based methods and model-based methods. Feature-
with new challenges because of the ordering relationship based methods first transform sequences into feature vectors
between different elements in the sequences. Similar to the and then apply existing vectorial data classification meth-
analysis of non-sequential data, there are different sequential ods. Distance-based methods apply classifiers such as KNN
data mining problems such as clustering, classification and (k Nearest Neighbors) to solve the sequence classification
pattern discovery. In this paper, we focus on the sequence problem, in which the key issue is to specify a proper distance
classification problem. function to measure the distance between two sequences [3].
The task of classification is to determine which prede- Model-based methods generally assume that sequences from
fined target class one unknown object should be assigned different classes are generated from different probability dis-
to [1]. As a specific case of the general classification prob- tributions, in which the key issue is to estimate the model
lem, sequence classification is to assign class labels to new parameters from the set of training sequences.
sequences based on the classifier constructed in the training In this paper, we focus on the feature-based method since
it has several advantages. First of all, various effective classi-
The associate editor coordinating the review of this manuscript and fiers have been developed for vectorial data classification [4].
approving it for publication was Aysegul Ucar . After transforming sequences into feature vectors, we can

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020 218199
Z. He et al.: Reference-Based Sequence Classification

choose any one of these existing classification methods to new methods developed under the proposed framework are
fulfill the sequence classification task. Second, in some pop- capable of achieving better classification accuracy than tra-
ular feature-based methods such as pattern-based methods, ditional sequence classification methods. This indicates that
each feature has a good interpretability. Last but not least, such a reference-based sequence classification framework is
the extraction of features from sequences has been exten- promising from a practical point of view.
sively studied across different fields, making it feasible to The main contributions of this paper can be summarized as
generate sequence features in an effective manner. follows:
The k-mer (in bioinformatics) or k-gram (in natural lan- • We present a general reference-based framework for
guage processing) is a substring that is composed of k con- feature-based sequence classification. It offers a unified
secutive elements, which is probably the most widely used view for understanding and explaining many existing
feature in feature-based sequence classification. Such a k-mer feature-based sequence classification methods in which
based feature construction method is further generalized by different types of sequential patterns are used as features.
the pattern-based method, in which a feature is a sequential • The reference-based framework can be used as a gen-
pattern (a subsequence) that satisfies some constraints (e.g. eral platform for developing new feature-based sequence
frequent pattern, discriminative pattern). Over the past few classification algorithms. To verify this point, we design
decades, a large number of pattern-based methods have been new feature-based sequence classification algorithms
presented in the context of sequence classification [5]–[30]. under this framework and demonstrate its advantages
In this paper, we present a reference-based sequence classi- through extensive experimental results on real sequential
fication framework, which can be considered as a non-trivial data sets.
generalization of the pattern-based methods. This framework The rest of the paper is structured as follows. Section II
has several key steps: candidate set construction, reference gives a discussion on the related work. In Section III,
point selection and feature value construction. In the first we introduce the reference-based sequence classification
step, a set of sequences that serve as the candidate reference framework in detail. In Section IV, we show that many
points are constructed. Then, some sequences from the can- existing feature-based sequence classification algorithms
didate set are selected as the reference points according to can be reformulated within the reference-based framework.
certain criteria. The number of features in the transformed In Section V, we present new feature-based sequence clas-
vectorial data will equal the number of selected reference sification algorithms under this framework, which are effec-
points. In other words, each reference point will correspond tive and quite different from available solutions. We exper-
to a transformed feature. Finally, a similarity function is used imentally evaluate the proposed reference-based framework
to calculate the similarity between each sequence in the data through a series of experiments on real-life data sets in
and every reference point. The similarity to each reference Section VI. Finally, we summarise our research and give a
point will be used as the corresponding feature value. discussion on the future work in Section VII.
The reference-based sequence classification framework is
quite general and flexible since the selection of both ref- II. RELATED WORK
erence points and similarity functions is arbitrary. Existing In this section, we discuss previous research efforts that are
feature-based methods can be regarded as a special variant closely related to our method. In Section II-A, we provide a
under our framework by (1) using (frequent or discriminative) categorization on existing feature-based sequence classifica-
sequential patterns (subsequences) as reference points and (2) tion methods. In Section II-B, we discuss several instance-
utilizing a boolean function (output 1 if the reference point based feature generation methods in the literature of time
is contained in a given sequence and output 0 otherwise) series classification. In Section II-C, we present a concise dis-
as the similarity function. Besides unifying existing pattern- cussion on reference-based sequence clustering algorithms.
based methods under the same umbrella, the reference-based In Section II-D, we provide a short summary on dimension
sequence classification framework can be used as a general reduction and embedding methods based on landmark points.
platform for developing new feature-based sequence classi-
fication methods. To justify this point, we develop a new A. FEATURE-BASED METHODS
feature-based method in which a subset of training sequences 1) EXPLICIT SUBSEQUENCE REPRESENTATION
are used as the reference points and the Jaccard coefficient is WITHOUT SELECTION
used as the similarity function. In particular, we present two The naive approach in dealing with discrete sequences is to
instance selection methods to select a good set of reference treat each element as a feature. However, the order informa-
points. tion between different elements will be lost and the sequen-
To demonstrate the feasibility and advantages of this new tial nature cannot be captured in the classification. Short
framework, we conduct a series of comprehensive perfor- sequence segments of k consecutive elements called k-grams
mance studies on real sequential data sets. In the experi- can be used as features to solve this problem. Given a set of
ments, we compare several variants under our framework k-grams, a sequence can be represented as a vector of the
with some existing sequence classification methods in terms presence or absence of the k-grams or the frequencies of the
of classification accuracy. Experimental results show that k-grams. In this feature representation method, all k-grams