International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.5, September 2012
KEYWORD-DRIVEN SUFFIX ARRAYS FOR
ON-LINE KEYWORD SEARCHING FROM
DOCUMENTS IN CHINESE
Yanhua Zhang
School of Software Engineering of University of Science and
Technology of China
Suzhou, China
cmlzyh@gmail.com
ABSTRACTS
On-line keyword searching from documents in Chinese tends to use inverted indexing as the main
technique, which has its difficulties. Suffix Array is widely used for processing text in Western languages.
However, it fails to get widely used in Chinese processing because of the speciality of Chinese. Suffix Array
is a powerful tool. However it costs too much space. That is the major bottleneck of suffix Array. A data
structure called Keyword-driven Suffix Array is proposed in this paper for on-line keyword searching from
documents in Chinese, based on observation of on-line search pattern and traits of Chinese. Space
efficiency is improved a lot using this data structure. When the document database is large enough, space
efficiency is improved by about 5/6 using this data structure without sacrificing its time efficiency.
KEYWORDS
Chinese word segmentation, inverted index, Suffix Array, information retrieval, time/space efficiency
1. INTRODUCTION
1.1. Inverted-indexing mechanism
Given a document database consisting of a list of documents and also given a keyword, keyword
searching is to find all occurrences of this word in the documents. Using an inverted file to search
is now the mainly way for keyword searching from documents in Chinese. As stated in Error!
Reference source not found., an inverted file contains, for each term that appears anywhere in
the database, a list of the numbers of the documents containing that term. To build an inverted file
for documents in Chinese, corresponding techniques for word segmentation should be used
because there is often no delimiter between two characters in Chinese text.
However, this method fails to work efficiently when the text is in Chinese. Firstly, the efficiency
of searching depends on Chinese word segmentation Error! Reference source not found..
However, Chinese word segmentation has its own difficulties. For example, ambiguous
segmentation is often encountered. Besides, this method doesn’t work well for unstructured
keywords.
DOI : 10.5121/ijaia.2012.3503
31
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.5, September 2012
1.2. Suffix Array and its bottleneck
Suffix Array is a method for on-line string searching proposed by Udi Manber and Gene Myers
Error! Reference source not found.. It is a full-text indexed data structure and it indexes all
suffixes. Thus suffix array can work independently of word segmentation.
Given a keyword K, on-line searching of K from a string S using suffix arrays can break into two
phases. One is construction of the suffix array and the other is search.
Algorithms for linear-time construction of suffix array have been proposed in Error! Reference
source not found.-Error! Reference source not found..
The second phase is search. The longest common prefixes (lcps) of adjacent elements in the
suffix array could be computed in linear time Error! Reference source not found. and the
range minimum query problem (RMQ) could be solved in constant time Error! Reference
source not found.-Error! Reference source not found.. When they are coupled, string
searching can be answered in O(P + log N) time, where P is the length of K and N is the length of
S.
Both of the two phases have got asymptotically optimal in theory. The fly in the ointment is that
suffix arrays built for large text will cost too much space.
Compressed suffix arrays were proposed subsequently in Error! Reference source not
found.-Error! Reference source not found.. Space efficiency is indeed improved using
compressed suffix arrays. However, most such data structure manages to benefit from data
compression. That means, some data compression techniques are firstly used to store suffix arrays
and later corresponding decompression techniques are used to retrieve information. Thus space is
obtained at the cost of time. Besides, it is often more complicated to implement and it is not very
efficient for large alphabet.
A data structure called Keyword-driven Suffix Array is proposed in this paper for on-line
keyword searching from documents in Chinese. Also the construction algorithm is given.
Differently from most compressed suffix arrays, this structure aims to decrease useless
information in suffix arrays to improve space efficiency without sacrificing its time efficiency.
What is more, all difficulties with inverted indexing method are not encountered using this new
data structure. This data structure could also be used for languages which have the same traits
with Chinese.
1.3. Overview
This paper is organized as follows. Section 3 gives a brief introduction to two basic structures Suffix Array and Splay Tree for preparation of section 4. Section 4 explains the algorithm to build
the new data structure called Keyword-driven Suffix Array. Section 5 gives the conclusion.
32
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.5, September 2012
2. NOTATION
(1) Arrays index from 0 in the following sections.
(2) Suppose that S denotes a string with length n, both x and y are integers, 0≤x, y≤n-1, we
will denote substring S[x]S[x+1]…S[y-1]S[y] as S[x..y].
(3) Given two strings, we use symbol “ ” to represent the concatenation of them. That is, xy
zw = xyzw, where “x”, “y”, “z” and “w” are all letters.
(4) We use “log” to denote binary logarithm.
(5) We use symbol “O” to denote the asymptotically upper bound.
3. BRIEF INTRODUCTION TO TWO DATA STRUCTURES
3.1. Introduction to Suffix Array
3.1.1. Definition of Suffix Array
Given a string S with length n, then its suffix array SA[0..n-1] is an array that satisfies the
following inequality:
suffix(SA[i ]) ≤ suffix(SA[ j ]), ∀0 ≤ i < j ≤ n − 1
(1)
Suffix(i) denotes the suffix of S starting from position i, that is, suffix(i) = S[i..n-1].
3.1.2. Linear time suffix array construction
We will use DC-3 algorithm, which is proposed in 2006 Error! Reference source not found.
for the construction of suffix arrays.
3.1.3. Search
Given two strings - S1 and S2, we use lcp(S1, S2) to represent the length of the longest common
prefix of S1 and S2. LCP array is such an array that satisfies the following condition:
LCP[i ] = lcp(suffix(SA[i ]), suffix(SA[i − 1])), ∀0 < i ≤ n − 1
(2)
Besides, LCP[0] is defined to be 0.
LCP array can be constructed in O(n) time. Using LCP array and Range Minimum Query
problem (RMQ), searching of a keyword with length p can be finished in O(p + log n) time.
3.2. Introduction to Splay Tree
Splay Tree is a self-adjusting form of binary search tree, which is developed by Daniel Dominic
Sleator and Robert Endre Tarjan Error! Reference source not found.. On an n-node splay
tree, all the standard search tree operations have an amortized time bound of O(log n) per
operation.
33
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.5, September 2012
4. KEYWORD-DRIVEN SUFFIX ARRAYS
4.1. Several observations
4.1.1. On-line search pattern
There exist many fields in the world, such as politics, economics, military affairs etc. Number of
users varies a lot from one field to another. That is, some fields are more popular than others and
the number of users interested in that field is larger than that in other fields. Besides, each user
tends to take interest in one or more limited fields. On the other hand, each keyword is often
closely associated with one certain field. Thus some keywords are accessed more frequently than
others during a relatively long time interval.
4.1.2. Traits of Chinese
Most keywords searched by users are technical words or terms used frequently daily, though their
lengths may vary. To study the ability of each Chinese character to start a keyword, firstly
professional term lists in Chinese for computer science, economics and medicine are downloaded.
Then a few Chinese characters are chose randomly and finally numbers of occurrences of all
words (or terms) starting with these characters in each list are computed respectively. The result
shows that the number of technical words that begin with a certain Chinese character is often not
quite large and that technical words beginning with a certain Chinese character tend to be closely
associated with several limited fields.
4.2. Description of the algorithm for constructing Keyword-driven Suffix Arrays for
on-line keyword searching from documents in Chinese
Based on the observations shown above, algorithm for constructing Keyword-driven Suffix Array
is stated as below. The algorithm could be broken into two phases – pre-processing and search.
4.2.1. Pre-processing
0) Construct a dictionary, Dict for Chinese characters frequently used. That is, map each Chinese
character to a positive integer code. We use MAXChineseChar to denote the number of
Chinese characters frequently used.
1) Get source string S.
Concatenate all the documents to get the source string S, from which a keyword K is to be
searched. Suppose that there exist s documents in the database D and we denote them as D1,
D2, …, Ds respectively. Suppose sting S is of size N. Then we have equation (3).
S [0..N − 1] = D = D1°D2°D3°°DS
(3)
2) Construct the suffix array, SA_old for string S, using DC-3 algorithm. Next, construct
corresponding LCP array, LCP_old. Samples of the two arrays are shown in figure 1.
34
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.5, September 2012
3) Construct two arrays - sub_SA and sub_LCP both indexed from 1 to MAXChineseChar.
For i (1≤i≤MAXChineseChar), both sub_SA[i] and sub_LCP[i] are also arrays. sub_SA[i]
stores the positions of all suffixes which begin with the corresponding character of code i
sorted by the order of Dict code. Since sub_SA[i] consists of several consecutive items of array
SA_old, its size can be computed using binary search. Correspondingly, array sub_LCP[i] is of
the same length.
Figure 1. Sample of suffix array and LCP array for source string S
4) Initialize both sub_SA and sub_LCP.
Scan array SA_old one by one from the beginning to the end. For i (0≤i≤N-1), which begins
with 0, the character at position SA_old[i] in string S is S[SA_old[i]] and its Dict code is
Dict[S[SA_old[i]]]. Thus element SA_old[i] can be set to the proper position of
sub_SA[Dict[S[SA_old[i]]]]. That is, the element at the first position of array
sub_SA[Dict[S[SA_old[i]]]] (or array sub_LCP[Dict[S[SA_old[i]]]]) which hasn’t been
initialized could be initialized to be SA_old[i] (or LCP_old[i]). Repeat this procedure for all i,
where 0≤i≤N-1. The two arrays are like figure 2 shows.
35
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.5, September 2012
Figure 2. Sample of array sub_SA and array sub_LCP
5) Destruct both arrays of SA_old and LCP_old.
6) Construct an array of splay trees – splay[1.. MAXChineseChar].
Element splay[i] (1≤i≤MAXChineseChar) is a splay tree corresponding to character C1,
whose Dict code is i. Each node in splay[i] corresponds to a character C2 such that C1 C2
xyz (“xyz” here represents some string. It may be empty as well) has been searched as a
keyword. Each node in splay[i] has two fields – one is character field, which is C2 here and
the other is a pair field of (start, end), where start and end represent respectively the beginning
and ending position of such an interval, Intvl of sub_SA that the suffix indexed at each position
in Intvl begins with C1 C2. The character field could be viewed as key for comparison
between nodes. Initialize each splay tree splay[i] to be empty.
36
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.5, September 2012
4.2.2. Search
Figure 3. Sample of searching of keyword
The search phase is illustrated below. It works as figure 3 shows.
Suppose that the keyword, K being searched consists of m characters and we denote it as K[0..m1].
To search K, first locate the splay tree splay[Dict[K[0]]], which is corresponding to the character
K[0]. Then in this tree search a node whose character field equals to K[1] using SEARCH
operation for Splay Tree.
a)
If such a node exists, suppose that the value of (start, end) field is (s, e). Then following
suffix array based search algorithm as section 3.1.3 states, the remaining substring of K
(K[2..m-1])
could
be
searched
using
sub_SA[Dict[K[0]]][s..e]
and
sub_LCP[Dict[K[0]]][s..e].
b)
If such a node doesn’t exist, substring K[0] K[1] can be searched using sub_SA[Dict[K[0]]]
and sub_LCP[Dict[K[0]]]. Thus the starting and ending positions s and e can be obtained.
Next, construct a new node P with K[1] as its character field and with (s, e) as its field of
(start, end). Finally, node P can be inserted into splay[Dict[K[0]]] using INSERT operation
for Splay Tree. Then search the remaining part of K as case a) illustrates.
With many keyword searched, each splay tree will expand quickly. When splay trees get large
enough, each splay tree will get stable. It means keywords searched later will probably fall into its
corresponding splay tree.
By now, the (start, end) field of each node of all splay trees can be replaced by a pair field which
we name as (SA, LCP), where values of SA and LCP are sub_SA[start..end] and
37
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.5, September 2012
sub_LCP[start..end] respectively. By now, array sub_SA[1..MAXChineseChar] and
sub_LCP[1..MAXChineseChar] are of no use any more and both of them can be destructed.
Figure 4 gives a sample for splay trees in stable state.
Figure 4. Sample of splay trees in stable state
This paper calls suffix arrays formed in this way Keyword-driven Suffix Arrays.
Then how to search a keyword, K after splay trees get stable?
To search K, first locate the splay tree splay[Dict[K[0]]], which is corresponding to the character
K[0]. Then in this tree search a node whose character field equals to K[1] using SEARCH
operation for Splay Tree.
a)
b)
When such a node is found, name it as Q. The remaining substring of K (K[2..m-1]) can be
searched using value of (SA, LCP) field of node Q.
Otherwise, construct a new node P with K[1] as its character field. Then find all starting
positions of occurrences of substring K[0] K[1] in string S using KMP algorithm Error!
Reference source not found.. Next sort all suffixes indexed at the positions which are
just obtained using multi-key quick sort method. Thus subfield SA of node P could be
obtained and then subfield LCP could be computed by comparing adjacent items of SA.
Using the two fields, all occurrences of K can be obtained. To maintain the corresponding
splay tree splay[Dict[K[0]]], node P should be inserted into it first. Then use a rotation
operation to move node P to be father of the node F, who is father of P before. Then delete
node F and use SPLAY operation to move node P to the root.
Notes: case a will happen at a percentage much higher than case b.
By experiment when the size of each splay tree reaches 167 on average (height of each splay tree
is 8 on average), splay trees get stable. The hit rate will then reach 95%, which could be viewed
as a threshold value.
Suppose that there are 5,000 Chinese characters frequently used. According to statistics, at most
one in six suffixes are those that frequently searched keywords begin with. Then length of both
SA and LCP for each node in all splay trees is at most
38
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.5, September 2012
N × (1 / 6) × (1 / 5000) × (1 / 167) ≈ N / 5000000
(4)
4.3. Performance analysis
4.3.1. Time efficiency
i.
Pre-processing
Time used in this phase is O(N).
ii.
Search
a)
Before splay trees become stable
Time used for searching is O(m+log(N/5000)).
b)
After splay trees get stable
Time efficiency of search in this phase is about O(m + log (N/5000000)).
To make analysis below more concise, we will use the number of CPU operations. The “m”
item and the multiplication factors within log could be neglected when N is large enough. The
number of CPU operations for maintaining splay trees can be viewed as 8 since each splay tree
is of height of 8 on average.
Thus when N is large enough, for 95 percent keyword searching requests, the number of CPU
operations is log (N/5000000) + 8 using this method. That is less than log N, which is the
number of CPU operations using Suffix Array.
4.3.2. Space efficiency
a)
Pre-processing
Space used is O(N), the same as that used using Suffix Array.
b) Search
Extra space occupied by pointers of nodes of splay trees could be neglected when N is large
enough.
1) Before splay trees get stable, Space used is O(N), the same as that used using suffix
arrays.
2) After splay trees get stable, space occupied by (SA, LCP) field of all nodes in all splay
trees is about O((1/6)*N).
Space used by this data structure is decreased by about 5/6 after splay trees get stable.
5. CONCLUSION
Space efficiency is improved using this new data structure because it is more strongly structured
than Suffix Array.
ACKNOWLEDGEMENTS
Firstly, I would like to show my deepest gratitude to Mr. Chenxi Shao and Mr. Liusheng Huang
for their professional guidance. Secondly, I appreciate Wei Zhou for his help with my English
during these years. Finally, I thank my father, though he has been gone for 9 years and my
mother.
39
International Journal of Artificial Intelligence & Applications (IJAIA), Vol.3, No.5, September 2012
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
Alistair Moffat & Justin Zobel, (1996) “Self-indexing inverted files for fast text retrieval”, ACM
Transactions on Information Systems, Vol. 14, No. 4, pp. 349-379.
Richard Sproat & Thomas Emerson, (2003) “The First International Chinese Word Segmentation
Bakeoff”, In Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, pp.
133-143.
Udi Manber & Gene Myers, (1993) “Suffix arrays: A new method for on-line string searches”, SIAM
Journal on Computing, Vol. 22, No. 5, pp. 935-948.
Dong Kyue Kim, Jeong Seop Sim, Heejin Park & Kunsoo Park, (2003) “Linear-Time Construction of
Suffix Arrays”, in Proceedings of the 14th annual conference on Combinatorial pattern matching,
pp.186-199.
Pang Ko & Srinivas Aluru, (2003) “Space Efficient Linear Time Construction of Suffix Arrays”, in
Proceedings of the 14th annual conference on Combinatorial pattern matching, pp. 200-210.
Juha KÄRKKÄINEN, Peter Sanders & Stefan Burkhardt, (2006) “Linear work suffix array
construction”, Journal of the ACM (JACM) , Vol.53, No. 6, pp. 918-936.
Ge Nong, Sen Zhang & Wai Hong Chan, (2009) “Linear Suffix Array Construction by Almost Pure
Induced-Sortings”, in Data Compression Conference, pp. 193-202.
KASAI, T., LEE, G., ARIMURA, H., ARIKAWA, S. & PARK, K., (2006) “Linear-time longestcommon prefix computation in suffix arrays and its applications”, in Proceedings of the 12th Annual
Symposium on Combinatorial Pattern Matching, pp. 181-192.
Bender, Michael A, Farach-Colton & Martin, (2000) “The LCA problem revisited”, in Proceedings of
the 4th Latin American Symposium on Theoretical Informatics, pp. 88-94.
Johannes Fischer & Volker Heun, (2006) “Theoretical and Practical Improvements on the RMQProblem, with Applications to LCA and LCE”, in Proceedings of the 17th Annual conference on
Combinatorial Pattern Matching, pp. 36-48.
Demaine, Erik, Gad Landau & Oren Weimann, (2009) “on Cartesian Trees and Range Minimum
Queries”, in Proceedings of the 36th International Colloquium on Automata, Languages and
Programming, pp. 341-353.
Roberto Grossi & Jeffrey Scott Vitter, (2005) “Compressed suffix arrays and suffix trees with
applications to text indexing and string matching”, SIAM Journal on Computing, Vol. 35, No. 2, pp.
378-407.
Roberto Grossi, (2011) “A quick tour on suffix arrays and compressed suffix arrays”, Theoretical
Computer Science, Vol. 412, No. 27, pp. 2964-2973.
Daniel Dominic Sleator & Robert Endre Tarjan, (1985) “Self-adjusting binary search trees”, Journal
of the ACM (JACM), Vol. 32, No.3, pp. 652-686.
Donald Knuth, James H.Morris & Vaughan Pratt, (1977) “Fast Pattern in strings”, SIAM Journal on
Computing, Vol. 6, No. 2, pp. 323-350.
Authors
I am now a student of School of Software Engineering of University of Science and Technology of China. I
graduated from Hebei Polytechnic University in July 2004, getting my bachelor’s degree majoring in
computer science and technology. From 2004 to 2007 I worked within Hebei Datang Information and
Technology Company Ltd. During these years I began to take interest in algorithm designing and analysis
and I read several famous books such as Introduction to Algorithms to learn algorithm comprehensively
and systematically. To get more professional guidance, I entered School of Software Engineering of
University of Science and Technology of China in July 2010. I began to know suffix array a bout 1 year
ago. I read several papers about it and realized that it could be improved with a little change to search online from Chinese documents. Thus I write this paper.
40