A Bulk-Loading Algorithm For The Bond-Tree Index Scheme For Non-Ordered Discrete Data Spaces

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/309375845
A Bulk-Loading Algorithm for the BoND-Tree Index Scheme for Non-ordered

Discrete Data Spaces
Conference Paper · September 2016
CITATIONS READS
0 66
4 authors, including:
Akm Tauhidul Islam Qiang Zhu

Michigan State University University of Michigan-Dearborn
7 PUBLICATIONS 32 CITATIONS 94 PUBLICATIONS 694 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Genome Sequencing View project
All content following this page was uploaded by Qiang Zhu on 22 October 2016.
The user has requested enhancement of the downloaded file.

Proc. of 25th International Conf. on Software Engineering and Data Engineering (SEDE'16), Denver, Sept. 26 - 28, 2016
A Bulk-Loading Algorithm for the BoND-Tree Index Scheme for

Non-ordered Discrete Data Spaces
Dong-Yoon Choi† , AKM Tauhidul Islam† , Sakti Pramanik† and Qiang Zhu‡
†
Computer Science and Engineering, Michigan State University, East Lansing, MI, USA
‡
Computer and Information Science, University of Michigan, Dearborn, MI, USA
Abstract efficiently process queries on such large datasets in

Recent years have witnessed an increasing demand NDDSs, an effective index method is required.
to process queries on large datasets in Non-ordered Dis- Several index schemes for an NDDS have been pro-
crete Data Spaces (NDDS) from numerous applications. posed recently. The ND-tree [7] is a data-partitioning
A number of index trees have been proposed in the based index method developed to support efficient
literature to support efficient queries on large datasets similarity queries (e.g., range queries and k-nearest
in an NDDS. However, the conventional tuple-loading neighbor queries) in an NDDS. The NSP-tree [8] is
method for building the index trees, takes too much a space-partitioning based index method developed to
time. Although numerous bulk-loading techniques have also support efficient similarity queries in an NDDS.
been proposed to efficiently build index trees for large The BoND-tree [9] is the most recent index method
datasets in Continuous Data Spaces (CDS), limited developed to support efficient box queries in an NDDS.
work has been done to efficiently bulk-load an index Each of the above index methods adopts different sets
tree for a large dataset in an NDDS. In this paper, of heuristics to build an index structure that is most
we present a bulk-loading method to efficiently build suitable for its respective query type and assumed
the recently proposed BoND-tree for a large dataset scenarios. Although these index methods can support
in an NDDS. A number of effective strategies have efficient queries, using a conventional tuple-loading
been incorporated in the method to facilitate the (TL) approach come with the methods to build an
efficient bulk-loading process. Experimental results index tree for a large dataset often takes too much time.
demonstrate that the proposed bulk-loading method is Efficient index tree building techniques are desired.
quite promising in efficiently building a BoND-tree for A number of so-called bulk-loading techniques have
a large non-ordered discrete dataset. been suggested to efficiently build the multidimensional
keywords: Non-ordered discrete data space, multidi- index structures such as the R-tree [10] and its variants
mensional index, bulk loading, box query, algorithm. [11] in Continuous Data Spaces (CDS). One set of such
techniques [12, 13, 14] apply a bottom-up approach.
1 Introduction The basic idea is to sort all the input vectors according
In recent years, there is an increasing demand to to a chosen dimension, place them in the leaf nodes in
process queries on large multidimensional non-ordered the order, and build the target index tree recursively
discrete datasets from numerous application domains from bottom up. Another set of techniques [15,
such as bioinformatics, network security, and social 16] adopt a top-down approach. The basic idea
media. For example, over the last decade, various se- is to sort the input vectors according to a chosen
quence analysis approaches in bioinformatics have been dimension, divide them into a number of subsets of
developed that make use of fixed-length strings/sub- an approximately equal size, put each subset into a
sequences, called k-mers, from genome sequences [1, 2]. subtree of the root, and build the target index tree
A k-mer such as “aggctcaa” (an 8-mer) can be viewed as recursively from top down. Unfortunately, these sorting
a vector in a k-dimensional Non-ordered Discrete Data based techniques cannot be directly applied to build an
Space (NDDS), where the value for each dimension index tree for NDDSs since no ordering exists for the
comes from alphabet Ω = {a, g, t, c}. There is no values on each dimension for such a space.
natural ordering among elements/letters in Ω. Various Another type of so-called generic bulk-loading tech-
sequence analysis problems such as the sequential error niques [17, 18] utilizes some operations (e.g., Insert-
correction [3, 4, 5] and the back translated peptide IntoNode, ChooseSubtree and Split) provided by the
k-mer search and local alignment [6] can be solved conventional TL algorithm to bulk-load the target tree.
by processing queries on large k-mer datasets. To A representative algorithm of this type, termed GBLA
123
[18], adopts a buffer based approach. The key idea is with no natural ordering among them. A discrete
to build a temporary structure, called the buffer tree box/rectangle R in Ωd is defined as R = B1 × B2 × ... ×
(whose nodes are called the index nodes), first. Each Bd , where∏Bi ⊆ Ai . The area of rectangle R is defined
d
index node is associated with a buffer. Vectors are as |R| = i=1 |Bi |, where cardinality |Bi | is called the
inserted into the buffer b associated with the desired edge length or span of R along the i-th dimension.
index node n first. When buffer b is full, it is cleared For a set of vectors, their discrete minimum bounding
and its vectors are pushed into the buffers for the child rectangle/box (DMBR) is the smallest rectangle/box
index nodes of n. Only the relevant index node and that contains all the vectors. A (discrete) box query q
buffer pages are kept in the memory. The first buffer on a dataset S in an NDDS is defined as a query with a
tree is used to appropriately place all the input vectors specified box R that returns all the vectors from S that
into their leaf nodes of the target index tree. A second lie within R.
buffer tree is then built to appropriately arrange all the The structure of the BoND-tree is similar to that
leaf nodes into their parent nodes of the target index of the R*-tree [11], except that the discrete geometric
tree, and so on. The target index tree is, therefore, built concepts (e.g., discrete rectangle/box) in an NDDS
in a bottom-up fashion. Although the GBLA algorithm are used. More specifically, a BoND-tree satisfies the
can be applied for any index structure that supports the following two requirements: (1) every non-leaf node
required operations, it is typically not optimized for the has between m and M children unless it is the root
target index structure. Proper modifications are needed (which may have a minimum of two children in this
to achieve a better efficiency. case); (2) every leaf node contains between m and M
Limited work has been done for bulk-loading the entries unless it is the root (which may have a minimum
index trees in NDDSs. A bulk-loading algorithm [19], of one entry/vector in this case). A leaf node contains
called NDTBL, was proposed for the ND-tree, which an array of entries of the form (op, key), where key is
is essentially an improved version of GBLA. A bulk- a vector in the NDDS and op is a pointer to the object
loading algorithm [20], called NSPBL, was suggested for represented by key in the database. A non-leaf node N
the NSP-tree, which adopts a bottom-up space splitting contains an array of entries of the form (cp, DMBR),
strategy based on histograms that record the space where cp is a pointer to a child node Ni of N and DMBR
distribution of relevant vectors. However, no bulk- is the DMBR of Ni .
loading technique has been introduced for the most More concepts about the NDDS and the BoND-tree
recent BoND-tree in NDDSs. can be found in [7, 8, 21].
In this paper, we introduce a bulk-loading algorithm
to efficiently build a BoND-tree in NDDSs. Our method
was inspired by GBLA [18], but optimized for the
BoND-tree in several ways: (1) buffers are associated
with the nodes of the target index tree directly instead
of employing an auxiliary buffer tree; (2) all the buffers
are kept in main memory ; (3) the buffers of non-
leaf index nodes directly above leaf index nodes can be
larger than the buffers of non-leaf nodes at levels above;
(4) auxiliary buffers are used to bulk-check vectors
for duplicates before insertion; (5) vectors are inserted
into leaf-nodes in groups. Experiments demonstrate Figure 1: An example of bulk-loading the BoND-tree
that the proposed bulk-loading BoND-tree algorithm
is efficient compared to the conventional TL algorithm.
The rest of this paper is organized as follows. Section
3 Bulk-Loading the BoND-Tree
2 gives an overview of relevant concepts that are needed In this section, we introduce a BoND-tree bulk-
for the discussion of this work. Section 3 discusses the loading algorithm, called BoNDTBL which was inspired
details of the proposed bulk-loading method. Section 4 by the generic bulk-loading algorithm GBLA [18].
reports our experimental results. Section 5 concludes Figure 1 presents an example, illustrating the bulk-
the paper. loading process for the BoND-tree using BoNDTBL.
3.1 Key Idea
2 Preliminaries The basic idea of our algorithm is as follows. Each
In general, a d-dimensional NDDS Ωd is defined as node of the target BoND-tree under construction is
a Cartesian product: Ωd = A1 × A2 × ... × Ad , where saved on disk. Each non-leaf node has an associated
Ai (1 ≤ i ≤ d) is the alphabet or domain for the i-th buffer. A buffer consists of several pages. Each page of
dimension, which consists of a finite number of letters the buffer can store at most B vectors. Hence, a buffer
124
with P pages can store at most B ∗ P vectors. Instead is similar to the bulk-loading process described in this
of keeping most of buffer pages associated with each section, the details of the algorithm are not included.
non-leaf node on disk as in the GBLA, all pages of such The following function inserts the distinct vectors
auxiliary buffers attached to the non-leaf nodes of the passed in from StartBulkLoading to begin bulk-loading
BoND-tree are kept in memory until the construction the BoND-Tree.
of the tree is complete. Auxiliary buffers associated
Function 3.2. InsertVector
with non-leaf nodes of the BoND-tree that are directly Input: A vector V in a d-dimensional NDDS
above the leaf-nodes are allowed to contain more buffer 1. if (R buffer is full) then
pages than the auxiliary buffers of non-leaf nodes at 2. newSiblings := ClearBuffer(R)
levels above. This is because a majority of the I/Os 3. end if
4. if (!empty(newSiblings)) then
spent from constructing the BoND-tree comes from 5. Create a new root with R and newSiblings as children
inserting vectors into the leaf-nodes. For the rest of 6. Create a new buffer for new root and store in BufferFrame
the paper, we will consider P ′ pages for buffers right 7. else
above the leaf level where P ′ ≥ P . In order to reduce 8. InsertIntoBuffer(R buffer, U V )
9. end if
the number of attempts to insert duplicated vectors
into the BoND-tree, a duplicate checking is performed Function InsertVector inserts a unique vector into the
before inserting vectors into the index tree. However, root buffer R until its capacity is reached. Once the root
using the conventional duplicate checking method to buffer reaches the capacity, all vectors in the root buffer
check if each vector is already present in the index are cleared to the level below by calling ClearBuffer.
tree one by one before the insertion as done in tuple- ClearBuffer returns new siblings to the current root
loading the BoND-tree is costly and significantly slows node if a split occurred during the clearing process. If a
down the construction process of the BoND-tree. To new sibling(s) was created, a new root node is created
make this process more efficient, auxiliary buffers are and an associated buffer is also created and stored in a
used to bulk-check vectors for duplicates in groups main memory list of buffers called BufferFrame.
before inserting vectors into the root buffer. Once an Function 3.3. ClearInternalBuffer
auxiliary buffer associated with a non-leaf node directly Input: Buffer Entries Associated to a Non-leaf Node N
above the leaf nodes becomes full, the input vectors are Output: newSiblings
first grouped using findBestChild function (an existing 1. overflowList := {}
2. for each of B ∗ P vectors in buffer of N do
operation for the BoND-tree) and then inserted into the 3. BestChild := ChooseSubtree(N, UV)
leaf nodes. 4. ChildBuffer := BufferFrame(BestChild)
5. InsertIntoBuffer(ChildBuffer, UV)
3.2 Main Components 6. if ChildBuffer overflows then
Our bulk-loading algorithm starts with the following 7. add BestChild to overflowList
function to filter duplicate input vectors before the 8. end if
9. end for
bulk-loading process and inserts distinct input vectors 10. newChildren := {}
into the BoND-tree. 11. for each Child in overflowList do
12. add ClearBuffer(Child) to newChildren
Function 3.1. StartBulkLoading 13. end for
Input: A vector V in a d-dimensional NDDS 14. newSiblings := InsertChildren(N, newChildren)
1. if (FilterBuffer is full) then 15. Store each new buffer created for a new sibling in BufferFrame
2. uniqV ectors = BulkDuplicateCheck(F ilterBuf f er) 16. return newSiblings
3. for each vector V in uniqVectors do
4. InsertVector(Root, V ) Whenever a buffer gets full, the entries are directed
5. end for to the corresponding children nodes. A simple approach
6. else of clearing the entries would be to determine the most
7. InsertIntoBuffer(FilterBuffer, V )
8. end if suitable child for each buffer entry in FCFS manner.
However, this technique may require more I/Os while
Function StartBulkLoading initiates the duplicate clearing entries to the leaf nodes as same leaf node may
checking process at root buffer before inserting vectors need to be called more than once. To optimize the
into the BoND-Tree to ensure that each vector at root in I/O cost, first, we determined the best child node for
the BoND-Tree is unique. This routine simply gathers each buffer entry right above the leaf level and then
input vectors into a special auxiliary buffer called grouped them based on the child DMBR index. Then,
FilterBuffer until it reaches capacity. Once FilterBuffer the corresponding leaf nodes are called only once to
reaches capacity, all vectors in FilterBuffer are passed insert the vectors. This is referred as optimized buffer
into function BulkDuplicateCheck. It returns a set of clearing in the following text.
distinct vectors which are then inserted into the BoND- Function ClearBuffer distinguishes between non-leaf
Tree. Because the algorithm for BulkDuplicateCheck nodes and leaf nodes. Function ClearInternalBuffer
125
clears the vectors from an overflowing buffer of a non- In case of a leaf node split (via an existing operation
leaf node. It is similar to the relevant procedure Split for the BoND-tree), DMBR of the new leaf node
described in GBLA with a few changes. As in GBLA, is inserted into N and the entries of N are updated.
ClearInternalBuffer clears B ∗ P vectors in the buffer If all groups of vectors are inserted into the desired
of node N. However, there is no need to read in the data pages without causing node N to split, function
pages of the buffer from disk as they are already present ClearLeafBuffer returns the empty set, newSiblings.
in main memory in BufferFrame for our algorithm. However, during the insertion process the splitting of
ClearInternalBuffer calls ChooseSubtree function (an a leaf node may cause node N to split. In this case,
existing operation for the BoND-tree) for each vector to a new sibling of node N and a new buffer for this new
assign the vector to one of the children of N and inserts sibling is created. The block number of the new non-leaf
the vector into the buffer of the desired child node. node is added to newSiblings and the remaining vectors
The desired child node is determined by applying a in the buffer of N are then split between the buffer of
set of heuristics such as minimum overlap enlargement, N and the buffer of its new sibling. Then the vectors
minimum area enlargement, and minimum area. To in buffers of both N and the new sibling are cleared by
break a tie between two children, the heuristics are calling ClearLeafBuffer on each sibling node and any
applied in the order they are presented. If there is new sibling node resulting from this clearing process
still a tie after all the heuristics are applied, a child is is also added to newSiblings. After clearing both of
chosen randomly. If the buffer of a child node overflows, the buffers of node N and its new sibling, function
it is added to overf lowList. But it is important to ClearLeafBuffer returns the set of new siblings created
note that it can still receive vectors from the buffer of during leaf buffer clearing process.
N. Once B ∗ P vectors have been cleared, ClearBuffer
executes either ClearInternalBuffer or ClearLeafBuffer
for any child node in overf lowList based on the buffer
level in the tree. Any new children created from this
clearing process are inserted into N, and if N splits, each
new buffer created for a new sibling of N is stored in
BufferFrame. ClearInternalBuffer function returns the
newSiblings of N.
Function 3.4. ClearLeafBuffer
Input: Buffer Entries Associated to a Leaf Node N
Output: newSiblings
1. newSiblings := {}
2. for each unique vector (UV) in buffer of N do
3. DataNodeNumber := ChooseSubtree(N, UV)
(* DataNodeNumber is temporarily stored with UV *)
4. end for Figure 2: I/O comparison of buffer clearing techniques
5. Sort UVs in buffer by desired DataNodeNumber
6. for each group of UVs in buffer of N with the same desired
DataNodeNumber do 4 Experiments
7. insert the UV into the desired DataNode of N To examine the performance of the proposed bulk-
8. if DataNode overflows then
9. Split(DataNode) (* update entries of N *)
loading method BoNDTBL, we conducted extensive
10. if overflow in N then experiments. The performance was evaluated in terms
11. NewSibling := Split(N) of index creation time and I/O as well as box query
12. create buffer for NewSibling & store in BufferFrame search efficiency I/O. BoNDTBL was implemented in
13. (* split buffer of N with new buffer of NewSibling *)
14. add NewSibling to newSiblings the C++ programming language. All the experiments
15. add ClearLeafBuffer(N) to newSiblings were conducted on a virtual linux environment with
16. add ClearLeafBuffer(NewSibling) to newSiblings a 2.6 GHz Intel Core i5 CPU, 1 GB RAM and
17. end if 20 GB storage in a Macbook with a 2.6 GHz Intel
18. end if
19. end for Core i5 CPU, 8 GB RAM and 120 GB Hard Drive.
20. return newSiblings The k-mers in the test database are generated from
bacteria.105.1.genomic.fna sequences. Box queries are
Function ClearLeafBuffer inserts the vectors in buffer
created from a set of Nitrogen Reductase (Nirk) protein
of node N to the desired leaf nodes in groups. For this
sequences. The performance data was measured based
to happen, the vectors in the buffer of N are first sorted
on the average from three executions. In the remaining
based on the block number of the desired leaf node.
discussions, we use BL BoND-tree and TL BoND-tree
Once the vectors are grouped by desired leaf nodes, each
to denote the BoND-trees constructed by the bulk-
group of vectors are inserted into the relevant leaf node.
loading and tuple-loading methods, respectively.
126
4.1 Clearing Buffer Entries Both figures show that the proposed bulk-loading
We have experimented with different techniques of algorithm for the BoND-Tree outperformed the con-
clearing buffer entries. Figure 2 presents the compari- ventional tuple-loading algorithm in terms of disk I/Os
son of FCFS and optimized buffer clearing techniques which is translated to better construction time. For
on indexes of 4M 15-mer vectors for different buffer example, BoNDTBL with bulk duplicate-checking was
sizes. It is shown that the optimized buffer clearing able to reduce construction time by 51% when loading
reduces I/O cost significantly for any buffer size. As 20 million genomic vectors in our experiments. The
buffer size increases, the optimized approach reduces bulk-loading algorithm consistently performed better
I/O proportionately because more vectors could be with an increasing database size, which shows our
inserted into children nodes at once. strategies via keeping buffers in main memory, using
auxiliary buffers for bulk-checking duplicates as well as
62
BL BoND-tree Buffer Size vs. Time optimized buffer clearing technique for leaf nodes were
61 effective.
Index Ceation Time (min)
60 4.4 Quality of the Index

59 We have also compared quality of BL BoND-trees
58
with TL BoND-trees by examining box query search
performance. The evaluation was based on the box
57
query search I/Os. We have generated 1000 15-mer box
56 queries from a set Nirk query sequences and executed
55
the queries on both 15-mer BL BoND-trees and TL
100 150 200 250 300 350 400
Buffer Size (# Entries)
BoND-trees. Table 1 shows the query search I/Os for
different index sizes. Though, BL BoND-trees and TL
Figure 3: Buffer size effect on index construction time BoND-trees may construct different index structures,
the resulting k-mer vectors are exactly same which
4.2 Buffer Size Effect shows the correctness of the proposed method. Same
The buffer size used in creating the BL BoND-tree set of singleton k-mer vectors are stored at the leaf level
was determined by constructing indexes using different and the box queries search all the relevant paths in the
buffer sizes and picking the buffer size that constructed index. On the other hand, the numbers of I/Os of both
the tree in the least amount of time. The buffer size trees are almost same. Hence, the qualities of the trees
is defined by the total number of vectors it can hold. are comparable.
Figure 3 shows average BL BoND-tree construction Table 1: Comparison of Box-query Search I/Os
time for a dataset of 20 million 15-mer vectors using between BoNDTBL and BoND-tree
different buffer sizes. Increasing buffer sizes has a k-mers k-mer Hits # BoxQuery I/Os
trade-off between reduced time for I/Os and increased (M ) BL BoND-tree TL BoND-tree
processing time for grouping vectors. Our experiments 4 2006 4516 4516
showed that, buffer size of 280 entries required lowest 20 9492 4695 4679
tree construction time for the datasets. We have 36 13249 4742 4734
also experimented with different size of datasets to 52 16675 4794 4786
validate optimum buffer size. The difference of index
construction time increases over a range of buffer sizes 5 Conclusions
with increasing dataset size. However, in figure 3, we
Although many bulk-loading techniques have been
put index construction time of only one dataset so that
proposed to construct index trees in CDSs, limited
the effect of buffer sizes can be seen explicitly.
similar work has been done specifically for NDDSs.
4.3 Index Creation Time and I/O In this paper, we have presented a bulk-loading algo-
We have compared the index creation performances rithm to bulk-load the BoND-tree for large datasets in
of the BL BoND-tree algorithm and the TL BoND-tree NDDSs. The algorithm incorporates various effective
approach. The evaluation was based on the number strategies including use of available memory to store
of disk I/Os and construction time in minutes. The all buffers associated with non-leaf nodes, applying
disk block size was set to 4 KB. The minimum space an efficient duplicate bulk-checking mechanism, and
utilization for a disk block was set to 30%. Figure grouping entries before inserting into leaf-nodes of the
4(a) and 4(b) show the number of of I/Os and time in index tree. Experiments show that the proposed bulk-
minutes, respectively, needed to construct the BoND- loading algorithm outperforms the conventional tuple-
trees for datasets of different sizes using tuple-loading loading algorithm regarding disk I/Os and construction
and bulk-loading methods. time of the index tree. Future work includes using
127
Figure 4: Comparison of Bulk-loading and Tuple-loading BoND-tree index construction a) I/Os and b) time
information in buffers at time of a node split for a better in nonordered discrete data spaces. IEEE Trans. on
splitting process and exploring the multi-way splitting Knowl. and Data Eng., 25(11):2629–2643, 2013.
of leaf nodes as proposed in NDTBL [19]. [10] A. Guttman. R-tree: a Dynamic Index Structure for
Spatial Searching. Proc of SIGMOD’84, pp.47–57, ’84.
Acknowledgment: Research was supported by the US
[11] N. Beckmann, H.P. Kriegel, R. Schneider and B.
National Science Foundation (NSF) (under Grants IIS- Seeger. The R*-tree: an efficient and robust
1319909 and IIS-1320078) for Research Experiences for access method for points and rectangles.. Proc. of
Undergraduates (REU). SIGMOD’90, pp. 322–331, 1990.
[12] D. DeWitt, N. Kabra, J. Luo, J. Patel, J. Yu. Client-
Server Paradise. Proc. of VLDB’94, pp. 558–569, 1994.
References [13] I. Kamel, C. Faloutsos. On packing R-trees. Proc. of
[1] M. Metzker. Sequencing technologies - the next CIKM’93, pp. 490–499, 1993.
generation. Nat Rev Genet, 11:31–46, 2010. [14] N. Roussopoulos, D. Leifker. Direct spatial search
[2] T. J. Treangen, S. L. Salzberg. Repetitive DNA and on pictorial databases using packed R-trees. Proc. of
next-generation sequencing: computational challenges SIGMOD’85, pp. 17–31, 1985.
and solutions. Nature Reviews Genetics, 13(1):36–46, [15] Y. Garcia, M. Lopez, S. Leutenegger. A greedy
2012. algorithm for bulk loading R-trees. Proc. of ACM-
[3] Y. Gu, Q. Zhu, X. Liu, Y. Dong, C. T. Brown, S. GIS’98, pp. 02–07, 1998.
Pramanik. Using Disk Based Index and Box Queries [16] S. Leutenegger, J. Edgington, M. Lopez. STR: A
for Genome Sequencing Error Correction. Proc. of Simple and Efficient Algorithm for R-Tree Packing.
BICoB’16, pp. 69 – 76, 2016. Proc. of ICDE’97, pp. 497–506, 1997.
[4] Y. Gu, X. Liu, Q. Zhu, Y. Dong, C. T. Brown, S. [17] S. Berchtold, C. Bohm, H.-P. Kriegel. Improving
Pramanik. A new method for DNA sequencing error the Query Performance of High-Dimensional Index
verification and correction via an on-disk index tree. Structures by Bulk-Load Operations. Proc. of
Proc. of ACM BCB’15, pp. 503–504, 2015. EDBT’98, pp. 216–230, 1998.
[5] D. R. Kelley, M. C. Schatz, et al. Quake: quality-aware [18] J. Bercken, B. Seeger, P. Widmayer. A Generic
detection and correction of sequencing errors. Genome Approach to Bulk Loading Multidimensional Index
Biol, 11(11):R116, 2010. Structures. Proc. of VLDB’97, pp. 406–415, 1997.
[6] AKM T. Islam, S. Pramanik, X. Ji, J. R. Cole, Q. [19] H.-J. Seok, G. Qian, Q. Zhu, A. Oswald, S. Pramanik.
Zhu. Back Translated Peptide k-mer Search and Local Bulk-loading the ND-tree in non-ordered discrete data
Alignment in Large DNA Sequence Databases Using spaces. Proc. of DASFAA’08, pp. 156–171, 2008.
BoND-SD-tree Indexing. Proc. of BIBE’15, pp. 1 – 6, [20] G. Qian, H.-J. Seok, Q. Zhu, S. Pramanik. Space-
2015. partitioning-based bulk-loading for the NSP-tree in
[7] G. Qian, Q. Zhu, Q. Xue, S. Pramanik. The ND-tree: non-ordered discrete data spaces. Proc. of DEXA’08,
A dynamic indexing technique for multidimensional pp. 404–418, 2008.
non-ordered discrete data spaces. Proc. of VLDB’03, [21] G. Qian, Q. Zhu, Q. Xue, S. Pramanik. Dynamic
pp. 620–631, 2003. indexing for multidimensional non-ordered discrete
[8] G. Qian, Q. Zhu, Q. Xue, S. Pramanik. A space- data spaces using a data-partitioning approach. ACM
partitioning-based indexing method for multidimen- Trans. on Database Syst., 31(2):439–484, 2006.
sional non-ordered discrete data spaces. ACM Trans.
on Info. Syst., 23(1):79–110, 2006.
[9] C. Chen, A. Watve, S. Pramanik, Q. Zhu. The BoND-
tree: an efficient indexing method for box queries
128
View publication stats

A Bulk-Loading Algorithm For The Bond-Tree Index Scheme For Non-Ordered Discrete Data Spaces

Uploaded by

Copyright:

Available Formats

A Bulk-Loading Algorithm For The Bond-Tree Index Scheme For Non-Ordered Discrete Data Spaces

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Bulk-Loading Algorithm For The Bond-Tree Index Scheme For Non-Ordered Discrete Data Spaces

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

A Bulk-Loading Algorithm for the BoND-Tree Index Scheme for Non-ordered

Conference Paper · September 2016

Akm Tauhidul Islam Qiang Zhu

SEE PROFILE SEE PROFILE

Genome Sequencing View project

The user has requested enhancement of the downloaded file.

A Bulk-Loading Algorithm for the BoND-Tree Index Scheme for

Abstract eﬃciently process queries on such large datasets in

60 4.4 Quality of the Index

View publication stats

You might also like