A Bulk-Loading Algorithm For The Bond-Tree Index Scheme For Non-Ordered Discrete Data Spaces
A Bulk-Loading Algorithm For The Bond-Tree Index Scheme For Non-Ordered Discrete Data Spaces
A Bulk-Loading Algorithm For The Bond-Tree Index Scheme For Non-Ordered Discrete Data Spaces
net/publication/309375845
CITATIONS READS
0 66
4 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Qiang Zhu on 22 October 2016.
Dong-Yoon Choi† , AKM Tauhidul Islam† , Sakti Pramanik† and Qiang Zhu‡
†
Computer Science and Engineering, Michigan State University, East Lansing, MI, USA
‡
Computer and Information Science, University of Michigan, Dearborn, MI, USA
123
[18], adopts a buffer based approach. The key idea is with no natural ordering among them. A discrete
to build a temporary structure, called the buffer tree box/rectangle R in Ωd is defined as R = B1 × B2 × ... ×
(whose nodes are called the index nodes), first. Each Bd , where∏Bi ⊆ Ai . The area of rectangle R is defined
d
index node is associated with a buffer. Vectors are as |R| = i=1 |Bi |, where cardinality |Bi | is called the
inserted into the buffer b associated with the desired edge length or span of R along the i-th dimension.
index node n first. When buffer b is full, it is cleared For a set of vectors, their discrete minimum bounding
and its vectors are pushed into the buffers for the child rectangle/box (DMBR) is the smallest rectangle/box
index nodes of n. Only the relevant index node and that contains all the vectors. A (discrete) box query q
buffer pages are kept in the memory. The first buffer on a dataset S in an NDDS is defined as a query with a
tree is used to appropriately place all the input vectors specified box R that returns all the vectors from S that
into their leaf nodes of the target index tree. A second lie within R.
buffer tree is then built to appropriately arrange all the The structure of the BoND-tree is similar to that
leaf nodes into their parent nodes of the target index of the R*-tree [11], except that the discrete geometric
tree, and so on. The target index tree is, therefore, built concepts (e.g., discrete rectangle/box) in an NDDS
in a bottom-up fashion. Although the GBLA algorithm are used. More specifically, a BoND-tree satisfies the
can be applied for any index structure that supports the following two requirements: (1) every non-leaf node
required operations, it is typically not optimized for the has between m and M children unless it is the root
target index structure. Proper modifications are needed (which may have a minimum of two children in this
to achieve a better efficiency. case); (2) every leaf node contains between m and M
Limited work has been done for bulk-loading the entries unless it is the root (which may have a minimum
index trees in NDDSs. A bulk-loading algorithm [19], of one entry/vector in this case). A leaf node contains
called NDTBL, was proposed for the ND-tree, which an array of entries of the form (op, key), where key is
is essentially an improved version of GBLA. A bulk- a vector in the NDDS and op is a pointer to the object
loading algorithm [20], called NSPBL, was suggested for represented by key in the database. A non-leaf node N
the NSP-tree, which adopts a bottom-up space splitting contains an array of entries of the form (cp, DMBR),
strategy based on histograms that record the space where cp is a pointer to a child node Ni of N and DMBR
distribution of relevant vectors. However, no bulk- is the DMBR of Ni .
loading technique has been introduced for the most More concepts about the NDDS and the BoND-tree
recent BoND-tree in NDDSs. can be found in [7, 8, 21].
In this paper, we introduce a bulk-loading algorithm
to efficiently build a BoND-tree in NDDSs. Our method
was inspired by GBLA [18], but optimized for the
BoND-tree in several ways: (1) buffers are associated
with the nodes of the target index tree directly instead
of employing an auxiliary buffer tree; (2) all the buffers
are kept in main memory ; (3) the buffers of non-
leaf index nodes directly above leaf index nodes can be
larger than the buffers of non-leaf nodes at levels above;
(4) auxiliary buffers are used to bulk-check vectors
for duplicates before insertion; (5) vectors are inserted
into leaf-nodes in groups. Experiments demonstrate Figure 1: An example of bulk-loading the BoND-tree
that the proposed bulk-loading BoND-tree algorithm
is efficient compared to the conventional TL algorithm.
The rest of this paper is organized as follows. Section
3 Bulk-Loading the BoND-Tree
2 gives an overview of relevant concepts that are needed In this section, we introduce a BoND-tree bulk-
for the discussion of this work. Section 3 discusses the loading algorithm, called BoNDTBL which was inspired
details of the proposed bulk-loading method. Section 4 by the generic bulk-loading algorithm GBLA [18].
reports our experimental results. Section 5 concludes Figure 1 presents an example, illustrating the bulk-
the paper. loading process for the BoND-tree using BoNDTBL.
3.1 Key Idea
2 Preliminaries The basic idea of our algorithm is as follows. Each
In general, a d-dimensional NDDS Ωd is defined as node of the target BoND-tree under construction is
a Cartesian product: Ωd = A1 × A2 × ... × Ad , where saved on disk. Each non-leaf node has an associated
Ai (1 ≤ i ≤ d) is the alphabet or domain for the i-th buffer. A buffer consists of several pages. Each page of
dimension, which consists of a finite number of letters the buffer can store at most B vectors. Hence, a buffer
124
with P pages can store at most B ∗ P vectors. Instead is similar to the bulk-loading process described in this
of keeping most of buffer pages associated with each section, the details of the algorithm are not included.
non-leaf node on disk as in the GBLA, all pages of such The following function inserts the distinct vectors
auxiliary buffers attached to the non-leaf nodes of the passed in from StartBulkLoading to begin bulk-loading
BoND-tree are kept in memory until the construction the BoND-Tree.
of the tree is complete. Auxiliary buffers associated
Function 3.2. InsertVector
with non-leaf nodes of the BoND-tree that are directly Input: A vector V in a d-dimensional NDDS
above the leaf-nodes are allowed to contain more buffer 1. if (R buffer is full) then
pages than the auxiliary buffers of non-leaf nodes at 2. newSiblings := ClearBuffer(R)
levels above. This is because a majority of the I/Os 3. end if
4. if (!empty(newSiblings)) then
spent from constructing the BoND-tree comes from 5. Create a new root with R and newSiblings as children
inserting vectors into the leaf-nodes. For the rest of 6. Create a new buffer for new root and store in BufferFrame
the paper, we will consider P ′ pages for buffers right 7. else
above the leaf level where P ′ ≥ P . In order to reduce 8. InsertIntoBuffer(R buffer, U V )
9. end if
the number of attempts to insert duplicated vectors
into the BoND-tree, a duplicate checking is performed Function InsertVector inserts a unique vector into the
before inserting vectors into the index tree. However, root buffer R until its capacity is reached. Once the root
using the conventional duplicate checking method to buffer reaches the capacity, all vectors in the root buffer
check if each vector is already present in the index are cleared to the level below by calling ClearBuffer.
tree one by one before the insertion as done in tuple- ClearBuffer returns new siblings to the current root
loading the BoND-tree is costly and significantly slows node if a split occurred during the clearing process. If a
down the construction process of the BoND-tree. To new sibling(s) was created, a new root node is created
make this process more efficient, auxiliary buffers are and an associated buffer is also created and stored in a
used to bulk-check vectors for duplicates in groups main memory list of buffers called BufferFrame.
before inserting vectors into the root buffer. Once an Function 3.3. ClearInternalBuffer
auxiliary buffer associated with a non-leaf node directly Input: Buffer Entries Associated to a Non-leaf Node N
above the leaf nodes becomes full, the input vectors are Output: newSiblings
first grouped using findBestChild function (an existing 1. overflowList := {}
2. for each of B ∗ P vectors in buffer of N do
operation for the BoND-tree) and then inserted into the 3. BestChild := ChooseSubtree(N, UV)
leaf nodes. 4. ChildBuffer := BufferFrame(BestChild)
5. InsertIntoBuffer(ChildBuffer, UV)
3.2 Main Components 6. if ChildBuffer overflows then
Our bulk-loading algorithm starts with the following 7. add BestChild to overflowList
function to filter duplicate input vectors before the 8. end if
9. end for
bulk-loading process and inserts distinct input vectors 10. newChildren := {}
into the BoND-tree. 11. for each Child in overflowList do
12. add ClearBuffer(Child) to newChildren
Function 3.1. StartBulkLoading 13. end for
Input: A vector V in a d-dimensional NDDS 14. newSiblings := InsertChildren(N, newChildren)
1. if (FilterBuffer is full) then 15. Store each new buffer created for a new sibling in BufferFrame
2. uniqV ectors = BulkDuplicateCheck(F ilterBuf f er) 16. return newSiblings
3. for each vector V in uniqVectors do
4. InsertVector(Root, V ) Whenever a buffer gets full, the entries are directed
5. end for to the corresponding children nodes. A simple approach
6. else of clearing the entries would be to determine the most
7. InsertIntoBuffer(FilterBuffer, V )
8. end if suitable child for each buffer entry in FCFS manner.
However, this technique may require more I/Os while
Function StartBulkLoading initiates the duplicate clearing entries to the leaf nodes as same leaf node may
checking process at root buffer before inserting vectors need to be called more than once. To optimize the
into the BoND-Tree to ensure that each vector at root in I/O cost, first, we determined the best child node for
the BoND-Tree is unique. This routine simply gathers each buffer entry right above the leaf level and then
input vectors into a special auxiliary buffer called grouped them based on the child DMBR index. Then,
FilterBuffer until it reaches capacity. Once FilterBuffer the corresponding leaf nodes are called only once to
reaches capacity, all vectors in FilterBuffer are passed insert the vectors. This is referred as optimized buffer
into function BulkDuplicateCheck. It returns a set of clearing in the following text.
distinct vectors which are then inserted into the BoND- Function ClearBuffer distinguishes between non-leaf
Tree. Because the algorithm for BulkDuplicateCheck nodes and leaf nodes. Function ClearInternalBuffer
125
clears the vectors from an overflowing buffer of a non- In case of a leaf node split (via an existing operation
leaf node. It is similar to the relevant procedure Split for the BoND-tree), DMBR of the new leaf node
described in GBLA with a few changes. As in GBLA, is inserted into N and the entries of N are updated.
ClearInternalBuffer clears B ∗ P vectors in the buffer If all groups of vectors are inserted into the desired
of node N. However, there is no need to read in the data pages without causing node N to split, function
pages of the buffer from disk as they are already present ClearLeafBuffer returns the empty set, newSiblings.
in main memory in BufferFrame for our algorithm. However, during the insertion process the splitting of
ClearInternalBuffer calls ChooseSubtree function (an a leaf node may cause node N to split. In this case,
existing operation for the BoND-tree) for each vector to a new sibling of node N and a new buffer for this new
assign the vector to one of the children of N and inserts sibling is created. The block number of the new non-leaf
the vector into the buffer of the desired child node. node is added to newSiblings and the remaining vectors
The desired child node is determined by applying a in the buffer of N are then split between the buffer of
set of heuristics such as minimum overlap enlargement, N and the buffer of its new sibling. Then the vectors
minimum area enlargement, and minimum area. To in buffers of both N and the new sibling are cleared by
break a tie between two children, the heuristics are calling ClearLeafBuffer on each sibling node and any
applied in the order they are presented. If there is new sibling node resulting from this clearing process
still a tie after all the heuristics are applied, a child is is also added to newSiblings. After clearing both of
chosen randomly. If the buffer of a child node overflows, the buffers of node N and its new sibling, function
it is added to overf lowList. But it is important to ClearLeafBuffer returns the set of new siblings created
note that it can still receive vectors from the buffer of during leaf buffer clearing process.
N. Once B ∗ P vectors have been cleared, ClearBuffer
executes either ClearInternalBuffer or ClearLeafBuffer
for any child node in overf lowList based on the buffer
level in the tree. Any new children created from this
clearing process are inserted into N, and if N splits, each
new buffer created for a new sibling of N is stored in
BufferFrame. ClearInternalBuffer function returns the
newSiblings of N.
Function 3.4. ClearLeafBuffer
Input: Buffer Entries Associated to a Leaf Node N
Output: newSiblings
1. newSiblings := {}
2. for each unique vector (UV) in buffer of N do
3. DataNodeNumber := ChooseSubtree(N, UV)
(* DataNodeNumber is temporarily stored with UV *)
4. end for Figure 2: I/O comparison of buffer clearing techniques
5. Sort UVs in buffer by desired DataNodeNumber
6. for each group of UVs in buffer of N with the same desired
DataNodeNumber do 4 Experiments
7. insert the UV into the desired DataNode of N To examine the performance of the proposed bulk-
8. if DataNode overflows then
9. Split(DataNode) (* update entries of N *)
loading method BoNDTBL, we conducted extensive
10. if overflow in N then experiments. The performance was evaluated in terms
11. NewSibling := Split(N) of index creation time and I/O as well as box query
12. create buffer for NewSibling & store in BufferFrame search efficiency I/O. BoNDTBL was implemented in
13. (* split buffer of N with new buffer of NewSibling *)
14. add NewSibling to newSiblings the C++ programming language. All the experiments
15. add ClearLeafBuffer(N) to newSiblings were conducted on a virtual linux environment with
16. add ClearLeafBuffer(NewSibling) to newSiblings a 2.6 GHz Intel Core i5 CPU, 1 GB RAM and
17. end if 20 GB storage in a Macbook with a 2.6 GHz Intel
18. end if
19. end for Core i5 CPU, 8 GB RAM and 120 GB Hard Drive.
20. return newSiblings The k-mers in the test database are generated from
bacteria.105.1.genomic.fna sequences. Box queries are
Function ClearLeafBuffer inserts the vectors in buffer
created from a set of Nitrogen Reductase (Nirk) protein
of node N to the desired leaf nodes in groups. For this
sequences. The performance data was measured based
to happen, the vectors in the buffer of N are first sorted
on the average from three executions. In the remaining
based on the block number of the desired leaf node.
discussions, we use BL BoND-tree and TL BoND-tree
Once the vectors are grouped by desired leaf nodes, each
to denote the BoND-trees constructed by the bulk-
group of vectors are inserted into the relevant leaf node.
loading and tuple-loading methods, respectively.
126
4.1 Clearing Buffer Entries Both figures show that the proposed bulk-loading
We have experimented with different techniques of algorithm for the BoND-Tree outperformed the con-
clearing buffer entries. Figure 2 presents the compari- ventional tuple-loading algorithm in terms of disk I/Os
son of FCFS and optimized buffer clearing techniques which is translated to better construction time. For
on indexes of 4M 15-mer vectors for different buffer example, BoNDTBL with bulk duplicate-checking was
sizes. It is shown that the optimized buffer clearing able to reduce construction time by 51% when loading
reduces I/O cost significantly for any buffer size. As 20 million genomic vectors in our experiments. The
buffer size increases, the optimized approach reduces bulk-loading algorithm consistently performed better
I/O proportionately because more vectors could be with an increasing database size, which shows our
inserted into children nodes at once. strategies via keeping buffers in main memory, using
auxiliary buffers for bulk-checking duplicates as well as
62
BL BoND-tree Buffer Size vs. Time optimized buffer clearing technique for leaf nodes were
61 effective.
Index Ceation Time (min)
127
Figure 4: Comparison of Bulk-loading and Tuple-loading BoND-tree index construction a) I/Os and b) time
information in buffers at time of a node split for a better in nonordered discrete data spaces. IEEE Trans. on
splitting process and exploring the multi-way splitting Knowl. and Data Eng., 25(11):2629–2643, 2013.
of leaf nodes as proposed in NDTBL [19]. [10] A. Guttman. R-tree: a Dynamic Index Structure for
Spatial Searching. Proc of SIGMOD’84, pp.47–57, ’84.
Acknowledgment: Research was supported by the US
[11] N. Beckmann, H.P. Kriegel, R. Schneider and B.
National Science Foundation (NSF) (under Grants IIS- Seeger. The R*-tree: an efficient and robust
1319909 and IIS-1320078) for Research Experiences for access method for points and rectangles.. Proc. of
Undergraduates (REU). SIGMOD’90, pp. 322–331, 1990.
[12] D. DeWitt, N. Kabra, J. Luo, J. Patel, J. Yu. Client-
Server Paradise. Proc. of VLDB’94, pp. 558–569, 1994.
References [13] I. Kamel, C. Faloutsos. On packing R-trees. Proc. of
[1] M. Metzker. Sequencing technologies - the next CIKM’93, pp. 490–499, 1993.
generation. Nat Rev Genet, 11:31–46, 2010. [14] N. Roussopoulos, D. Leifker. Direct spatial search
[2] T. J. Treangen, S. L. Salzberg. Repetitive DNA and on pictorial databases using packed R-trees. Proc. of
next-generation sequencing: computational challenges SIGMOD’85, pp. 17–31, 1985.
and solutions. Nature Reviews Genetics, 13(1):36–46, [15] Y. Garcia, M. Lopez, S. Leutenegger. A greedy
2012. algorithm for bulk loading R-trees. Proc. of ACM-
[3] Y. Gu, Q. Zhu, X. Liu, Y. Dong, C. T. Brown, S. GIS’98, pp. 02–07, 1998.
Pramanik. Using Disk Based Index and Box Queries [16] S. Leutenegger, J. Edgington, M. Lopez. STR: A
for Genome Sequencing Error Correction. Proc. of Simple and Efficient Algorithm for R-Tree Packing.
BICoB’16, pp. 69 – 76, 2016. Proc. of ICDE’97, pp. 497–506, 1997.
[4] Y. Gu, X. Liu, Q. Zhu, Y. Dong, C. T. Brown, S. [17] S. Berchtold, C. Bohm, H.-P. Kriegel. Improving
Pramanik. A new method for DNA sequencing error the Query Performance of High-Dimensional Index
verification and correction via an on-disk index tree. Structures by Bulk-Load Operations. Proc. of
Proc. of ACM BCB’15, pp. 503–504, 2015. EDBT’98, pp. 216–230, 1998.
[5] D. R. Kelley, M. C. Schatz, et al. Quake: quality-aware [18] J. Bercken, B. Seeger, P. Widmayer. A Generic
detection and correction of sequencing errors. Genome Approach to Bulk Loading Multidimensional Index
Biol, 11(11):R116, 2010. Structures. Proc. of VLDB’97, pp. 406–415, 1997.
[6] AKM T. Islam, S. Pramanik, X. Ji, J. R. Cole, Q. [19] H.-J. Seok, G. Qian, Q. Zhu, A. Oswald, S. Pramanik.
Zhu. Back Translated Peptide k-mer Search and Local Bulk-loading the ND-tree in non-ordered discrete data
Alignment in Large DNA Sequence Databases Using spaces. Proc. of DASFAA’08, pp. 156–171, 2008.
BoND-SD-tree Indexing. Proc. of BIBE’15, pp. 1 – 6, [20] G. Qian, H.-J. Seok, Q. Zhu, S. Pramanik. Space-
2015. partitioning-based bulk-loading for the NSP-tree in
[7] G. Qian, Q. Zhu, Q. Xue, S. Pramanik. The ND-tree: non-ordered discrete data spaces. Proc. of DEXA’08,
A dynamic indexing technique for multidimensional pp. 404–418, 2008.
non-ordered discrete data spaces. Proc. of VLDB’03, [21] G. Qian, Q. Zhu, Q. Xue, S. Pramanik. Dynamic
pp. 620–631, 2003. indexing for multidimensional non-ordered discrete
[8] G. Qian, Q. Zhu, Q. Xue, S. Pramanik. A space- data spaces using a data-partitioning approach. ACM
partitioning-based indexing method for multidimen- Trans. on Database Syst., 31(2):439–484, 2006.
sional non-ordered discrete data spaces. ACM Trans.
on Info. Syst., 23(1):79–110, 2006.
[9] C. Chen, A. Watve, S. Pramanik, Q. Zhu. The BoND-
tree: an efficient indexing method for box queries
128