Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SLAC-PUB-1549(Rev.) STAN-CS-75-482 February 1975 Revised December 1975 Revised July 1976 AN ALGORITHMFOR FINDING BEST MATCHES IN LOGARITHMIC EXPECTEDTIME Jerome H. Friedman Stanford Linear Accelerator Stanford University, Stanford, Center Ca. 94305 Jon Louis Bentley Department of Computer Science University of North Carolina at Chapel Hill Chapel Hill, N.C. 27514 Raphael Ari Finkel Department of Computer Science Stanford University, Stanford, Ca. 94305 ABSTRACT An algorithm a file and data structure containing N records, for the m closest record. tional The computation required The expected each search is independent to perform evidence suggests is considerably (Submitted Work supported Administration each described matches or nearest to kNlogN. tation are presented by k real neighbors to organize the file number of records of the file size. faster valued keys, to a given query is propor- examined in The expected each search is proportional-to that for searching 1ogN. except for very small files, this compu- Empirical algorithm than other methods. to ACM Transactions on Mathematical Software) in part by U.S. Energy Research and Development under contract E(O43)515 The Best Match or Nearest Neighbor The best match or nearest that store records blem is to find according file several those records of N recor,ds (each (possibly for post offices. If closest a letter town that The solution mation retrieval similar measure D, find decisions classified. to this might involve attributes that Multivariate costly searching describe is its will in space and time, its on all cities longitude and latithe Infor- for those items most would be cataloged Classification characteristics. features is closest estimation from each category to the record can be performed ccntaining the closest to be by calcu- m neighbors. Searching technique minimizes values. a post office, a catalog prototype density small,identically procedure information each item in the file Used for Associative from any query record this attribute to a query might be chosen as the destination. which of these prototypes problem is the cell into records problem is of use in many applications. One straightforward neighbor specified given a valued attributes) the m closest is addressed to a town without has a post office Formally, by k real each city the volume about a given point Structures vided with can be made by selecting and finding lating with The pro- to a query record measure. example, might contain to a given query item; by numerical most similar or distance Associated to data files valued keys or attributes. in the file not in the file) A data file, tude. real problem applies of which is described a dissimilarity record neighbor to some dissimilarity and with with Problem for solving method. sized find the best match or nearest The k-dimensional cells. A spiral search of the cells the best matches of that the number of records especially is large. -l- key space is di- examined, record. it when the dimensionality Although is extremely of the space _ I Burkhard cribe and Keller heuristic strategies gies use the triangle consideration while these techniques based on clustering inequality the file. a substantial These stratefrom no calculations experiments fraction the nearest of the keys. and Shustek [3] describe Basket-t, neighbor of ex- indfcate that of the records to be with distance this [4] deals with that the expected two keys per record of dissimilarity inequality. required to search due to Elias after which on only two values; diagram (a general structure the and Euclidean distance for case of measure. He One can search for best matches in worst a file organization proportional can perform the search in worst zation requires Unfortunately, enough on one is the Hamming distance. (two dimensions) to N and computation that on those to the best match problem for the special two algorithms. time, list 1 1 to kmk l$-*E . keys. That is, each key takes the plane) O[(logN)2] computation for a projection the triangle of an algorithm Shamos [5] employs the Voroni presents forming to a wide variety method is proportional applied strategy match closely they satisfy shows the optimality binary function searching involves that The method is applicable They were able to show that Rivest It only those records measures and does not require the file problem. another onto one or more keys, keeping a linear and searching tional des- some of the records Although simulation [2] techniques. to eliminate are presented, permit of the records keys, Fukunaga and Narendra from consideration. Friedman, solving and later searching pected performance eliminated [l] both storage that requires to NlogN. case O[logN] time, and computation propor- The other algorithm after a file proportional these methods have not yet been generalized -2- storage case organito N? to higher or more general dimensionalities Finkel tree, for binary the storage tree different termed for storing data generalization be applied This paper blem of finding partitioning In his to the best introduces best measures matches. The storage required computation is proportional This the search quired to search the file that file a it is k-d trees the average for best obey the triangle For large The time to logN, number of record matches this is quite inequality. to N, while files, the expected is shown to be in- in descending so that in of dissimilarity the search spent the pro- effective is proportional for matches with for is very they required N. algorithm a wide variety to kNlogN. best that with organization is proportional for of the structure; data structure in searching size, the quad l-71 develops suggests k-d tree so that examinations during Bentley Bentley in the file for of the file called is a generalization keys. article, method can be applied number of record It keys. on single structure, match problem. and does not require dependent a tree an optimized the records This measures. of the same one-dimensional examinations (1) involved small. describe of composite the k-d tree. could [6] and Bentley dissimilarity the tree the expected time method is proportional re- to 1ogN. Definition of the k-d Tree The k-d tree sorting subfile. nonterminal a subfile of the simple The k-d tree and searching. represents that is a generalization of the records The root of the tree is a binary tree in the file represents node has two sons or successor -3- binary tree in which used for each node and a partitioning the entire nodes. file. These successor of Each nodes represent the two subfiles nodes represent defined mutually exclusive which collectively form a partition subsets are called of records by the partitioning. small subsets ., key and a partition All in a subfile searching, is defined with key values value belong to the left son. assigning to the two subfiles. In k dimensions, The keg variable a record number can range from 1 to k. the tree; the discriminator the keys in order. those with key. value be- thus becomes a discriminator by k keys. that The original for of that a larger for partitioning for is represented less than or equal to the par- node in the tree; [7] chooses the discriminator a record is represented these can serve as the discriminator sented by a particular These terminal by some value son, while long to the right records space. buckets. by a single tition of the data records, of the record In the case of one-dimensional records The terminal is, Any one of the subfile repre- the discriminating key k-d tree proposed by Bentley each node on the basis each level for is obtained of its level by cycling in through That is, D=Lmodk+l where D is the discriminating is defined to be at level random key values zero. pected The partition in each particular This paper deals with value for key number for each subfile, cost of searching level L and the root values are chosen to be subfile. choosing both the discriminator as well as the bucket for nearest what is termed an optimized node k-d tree. -4- neighbors. size, and partition to minimize the ex- This process yields The Search Algorithm The k-d tree examining data structure only those records reducing the computation The first invocation able as a global boundaries the root is most easily the subfile node is defined boundaries above it in the tree. current subfile, accrual of this cell is smaller for priority dissimilarity found to be closer is updated. sive procedure If by the node. defined the side of the partition its records a cell in The volume the records so far as a is examined and list, the subfile When control to consider the query record. in the encountered is not terminal, as the query record. -5- The subfile. member of this is necessary the on the value is always maintained Whenever a record opposite at the nodes by nodes deeper in the tree. of the m closest it These not only divides for the node representing if keys. defined then all the node under investigation is made to determine The domain of of any node defines than the most distant is called the geometric on all is terminal, the search. same side of the partition test is, Avail- in the two new subfiles. to the query record queue during node; that space containing subfiles A list pro- argument. a lower or upper limit in the ancestors the node under investigation and their as this by the partitions each record record-key are examined. of the tree also defines of these limits greatly is the node under investigation. At each node, the partition key for thereby as a recursive to be plus and minus infinity but it the multidimensional described represented are determined the discriminator mechanism for the best matches. is the domain of that delimiting geometric bucket to find passes the root array an efficient to the query record, The argument to the procedure cedure. If closest required The search algorithm of provides It the list the recuron the returns, the records is necessary a on to consider that subfile records overlap only if the ball to the dissimilarity is referred ball test can be among the m closest sidered and the procedure subfile. determine if it is necessary tire If is entirely so, the current file list and ball-within-bounds a detailed notation. The Optimized k-d Tree examined with are the discriminating The solution distribution no knowledge of this seek a procedure that only uses information will will, contained all deter- for the en- The bounds-overlap-ball Appendix 2 one. -6- to be adjusted value at each non-terminal bucket. in general, depend upon the key space. Usually, of the distribution possible using the expected number of in advance of the queries. in the file to domain of the in each terminal in the record is independent for any particular This test The parameters contained distribution be seen to be good for be optimal is to minimize to the optimization of query records must be con- returning in Appendix 1. key nwnber and partition the for the node representing the geometric the search algorithm. node, and the number of records If of the complete search algorithm The goal of the optimization records side of the subtree the search. within This the bounds-overlap- is made before are described description an algorithmic of that need be examined. tests If of m best matches is correct and no more records contains test equal to the query record. recursively to continue radius on the opposite then the records is called with those so far encountered. test. records A "ball-within-bounds" mines whether the ball node. record then none of the records the ball, delimiting at the query record to the mth closest bounds do overlap that centered boundaries to as the "bounds-overlap-ball" fails, partition the geometric records. one has Thus, we of queries and Such a procedure query distributions but will not A second restriction is that key number and partition subfile represented the k-d tree by that node. the discriminating The information ing is the location that lie we can provide a binary choice It side. is well known that This criterion dictates at the median of the marginal tive of which key is chosen for the discriminator. side of the partition sect the current the partition radius m-nearest is greater neighbor query locations) for range in values before every nonterminal discriminator, likely. that we locate the parti- of key values, irrespec- the subfile if on the opposite does not inter- the distance of the ball. By definition, Thus, the probability (averaged key which exhibited to of being on either That is, is least provided were equally the partition key coordinates. the ball that if ball. than the radius intersecting The prescription searching to the query record is the same along all the partition distribution can exclude for of those records information tion The search algorithm [8] and by the partition- and the identities should have had equal probability side of the partition. tree value at each nonterminal is maximal when the two alternatives Thus, each record binary a prescription to the search algorithm of the partition on either so that time complexity. key and partition provided a general is known to be NP-complete of non-polynomial Under these two restrictions, choosing is necessary avoiding recursively, for discriminating node depend only on the This restriction Such an optimization very likely values at any particular node. can be defined optimization. thus, value the solution over all the greatest to the of possible spread or the partitioning. for optimizing node the key with the k-d tree, the largest then, spread in values and to choose the median of the discriminator -7- is to choose at as the key values as the partition. is developed presents The optimum number of records in the next section an algorithm that on analysis builds for each terminal Appendix 3 of performance. an optimized k-d tree bucket according to this prescription. Analysis of the Performance The storage file size, stored for file The discriminating required At each level of the tree, This requires computation logN, so the total kNlogN. value must be The number of non- b is the number of records to build the entire [Here we are solving in each term- C',(l), xi(2), in the file. " ' Xi(k)] If the value space of k dimensions. coordinate sented as a point, the m closest dissimilarity space. is proportional relation TN = 2TN/2 + m' framework. the set of key values a record file is to of the search is not so easily of each key is plotted The entire must be scanned. TN = O(kNlogN).] in a geometric for derived. The depth of the tree the tree the recurrence represent then the set of key values k-dimensional to kN. to have the solution discussed is easily set of key values to build The expected time performance is most easily the k-d tree proportional computation which is well-known find the bucket. The computation It is proportionalto key number and partition N I; - lwhere 11 nodes is organization node of the k-d tree. (2) for each nonterminal terminal inal N. required represents Let ?i = the ith record along a coordinate axis, for a point is a collection The query record derived. in a coordinate of such points can similarly in be repre- x" in this space. The best match problem is then to q' points to the query point in this space by the given measure. -8- The performance of records of the algorithm in the file, N, the dimensionality number of nearest neighbors terminal b, the dissimilarity buckets, distribution sought, p(x", of the file Let S-(3-) l” at ?q that Y contains So q {x” measure D(?,?), ball in the record D(x’,?q) 5 D(it,~?~) mth nearest where 4 is the and the probability of this content in the and the key space. space centered to x”q' points k, the employed, in the coordinate the m closest 1 number (number of keys), m, the number of records records be the smallest exactly may depend upon the total That is, 0) 3 neighbor region, to ? . q um(zq), The volume is defined as ~(8 with dx", J sm(2q) It can be shown [g] that follows the probability a beta distribution, B(m,N); 0 I um(Xq) Il. distribution that (3) of um,(Tq) is, m-1[ l-;lm]N-m dim) = (m-l):(N-m), . ruml . independently of the probability or the dissimilarity tribution measure, density D@,% - function E[u,l = urn p[u,l -9- of the points, The expected value is m dum = -N+l (4) of this p (3, dis- These results state has probability that content m/(N+l) exactly the file size, N, is large Sm(zq) is small and thus the probability p(x", is approximately constant m points on the average. we assume that To proceed further, enough so that any compact volume enclosing within the region distribution Sm(?q). In this . eqn 3 by we can approximate case, and from eqn 5 E[vm(X3,)] 3 m N+l Here 5 (zq) is the probability * Note that sm($) Consider gorithm that it size. reasonably hypercubical contain very nearly that In fact, compact. with the geometric edge length The effect divide the coordinate each containing have that the expected the largest spread in values approximately be is to the coordinate hypercubical the same number of records. volume of such a bucket at of the volume of the k-d tree partitioning, E[vb(‘b)I - j& +j will shape of these buckets The edges are parallel space into al- where b is the maxi- shape of these buckets equal to the kth root of the optimized very nearly b records, the expected space occupied by the bucket. axes. k-d tree partitioning Choosing the median insures section. Choosing the key with each node insures (7) averaged over the small region of the optimized in the previous each bucket will mum bucket density . can never be zero. now the effect described ' p (xi,, then, is to subregions, From eqn 7, we is (8) b - 10 - G 5 bc Two important minimizing results with it follow respect the (upper bound on the) buckets G(k)] = b{[ ; ; + ljk from this to b yields 02) expression. the result number of records should each contain . one record. b=l; to minimize the terminal examined, With this First, provision, kqn 12 becomes 5 2 {[ mG(k)] The second important amined is independent tribution derived can be easily accumulated and distribution a fixed number of records file containing volume containing leaving as possible. these results by any region, the then This is accomplished the file buckets to file partitions the k-dimensional bucket has the same properties m best matches. (b and m, respectively) As a result, and their geometrical the dependence of the bucof key values Sm(?q) containing the m best matches. increases, the m best matches shrink the number of overlapped buckets, - 12 - J, as the Namely, each contains size and distribution key density size consequence of the prescription This prescription compact. for the region size or the local key space. overlapped is a direct each terminal Sm(?q), cal to that dis- If the goal is to minimize of the number of overlapped region, ket volumes on total in the record ex- as small as possible. k-d trees. shapes are reasonably number of records N, and the probability size, the buckets of key values space so that the expected intuitively. should be as fine The independence optimizing p (3, 03) here in a somewhat obtuse fashion, coverage of all by making each bucket record of the file understood the partitioning + ljk. is that of the key values, Although for result 1 E is identiAs the the bucket volumes and the at exactly constant. the same rate, The constancy creases implies logarithmic of the number of records that in file file search time for I. is directly The amount of backtracking N. binary to descend from the root to the terminal 4, which we have demonstrated portional is a balanced in the number of nodes,which size, size in- to search for best matches is The k-d tree size. the time required logarithmic the time required examined as file buckets is to the is proportional to Thus, the expected of N. the m best matches to a prespecified Thus, proportional in the tree to be independent tree. query record is pro- to 1ogN. Dissimilarity Measures The derivations concerning of the preceding the particular dissimilarity some implicit are, however, A dissimilarity section make no explicit measure, D(x",%, assumptions that measure is defined assumptions employed. There are now discussed., as k fib(i), D(x",,3 = F 11 i=l where the k -i- 1 arbitrary the basic properties f,b,Y) I functions YWI 3 F and {fi]tZl, 04) are required to satisfy of symmetry 15i5k = f,(Y,X) J. (15s) and monotonicity F(x) 2 F(Y) fi(x,z) The k functions, they define 2 fi(x,y) if if (15b) x>y ‘z 2 y-l or x xryr zj {fi(x,y)]t,L, are called the one-dimensional distance - 13 - \! Isick the coordinate . distance along each coordinate. 05c) functions; Since the - spread in coordinate the center, estimate values the ith coordinate the bounds-overlap-ball To this the particular dissimilarity will yields serve just tions are all the linear functions It described if identical, that ?(x,y) is, fi(x,y) depends upon Any set of funcdistance the coordinate = f(x,y) functions distance for II func- i 5 k, then can be used to estimate = Ix-y/ that The purpose the k-d tree. as the coordinate For example, in Appen- is not necessary is to order the key numbers. as well. of the also appear in of the k-d tree measure employed. the same ordering function the construction tests be used in building from should be used to during distance the construction extent, of the spread estimation that key values and ball-within-bounds these functions tions function (Th ese coordinate k-d tree. dix 1.) to be the average distance distance the spread in the ith optimized exactly is defined the spread in key values. The properties directly through Appendix 1). measure. of the dissimilarity the bounds-overlap-ball These tests First, nondecreasing ordinate. with increasing coordinate by eqn 14, together distance, dissimilarity set. with tests along any co- IX(i)-W)l, of the co- dissimilarity based for a dissimilarity measure of eqn 15, are sufficient the restrictions (see must be 63, based on any subset The form required algorithm of a dissimilarity between two points, linear this to both of these properties. A dissimilarity addition only two properties the dissimilarity into and ball-within-bounds must be less than or equal to the actual on the full guarantee require Second, a partial ordinates measure enter measure is said to be a metric to symmetry and monotonicity (eqns 14-15c), distance it if, in obeys the tri- angle inequality D(x",?) + D(?,,z? 2 D(?,a. - 14 - 06) .- The most common metric distances are the vectorspace p-norms rk 0,(x”,?) = ‘1 ii=l Of these, IX(i) m) i; - Y(i)j' r the most commonly used are: p = 1: taxicab p = 2: Euclidean p= maximum coordinate co: or city block distance distance distance . That is, D,o,(?,?) Since the separate distances, estimate coordinate the linear particular number of records examined pected number) as a function . functions building 08) are identical ?(x,y) = Ix-y/, the k-d tree. (3) (eqn 18), the geometric and the inequality distance, - Y(i)/ function, for case of the p = co distance For this distance distance the key spreads 9a and 13) is unity, IX(i) = max llilk for these can be used to For the special constant G(k) (eqns of eqn 13 becomes an equality. we can therefore (instead calculate the expected of an upper bound on the ex- of the number of best matches, m, and number of keys, k : Rco(m,k) Note that ball i = (m + l)k for m=l, Rco(l,k) of constant = 2k. The number of buckets volume decreases with serves as a --lower bound for all (1-9) . vector increasing by a p, so the p = co result space p norms. - 15 - overlapped There is an assumption vious section. optimal It order; is that that is, query record. It comes to this ideal. this in the results of increasing does exist, dissimilarity inefficiency the geometric constant, G(k), eqn 19 represents from the geometrical, it in eqns 12 and 13, However, to the extent eqn 19 is overly in search algorithm is purely unchanged. of the pre- examines the buckets how close the k-d tree conclusions G(k) = 1) and thus, ., in order Since this the general inefficiency is implicit the search algorithm is not clear can be absorbed into leaving that optimistic (as it a lower bound even for that assumes the p = co distance. Simulation Results Several simulations mance of the algorithm The results eqn 19. lation, record with are presented unit uncertainty two percent in the worst Figure record). tor A similar matrix. and the number of record predicted from a normal set of 2000 query examinations of these averages is quite required small, with are shown both for in the previous The solid line dimensionality represents of the algorithm section. being around required (number of keys per and the p=co vec- eqn l-9 which predicts the (R = 2k). corresponds For low dimensionality - examinations the p=2 (Euclidean) number for the p = co metric The behavior The cases. the best match (m=l) varies Results by For each simu- keys was generated 1 shows how the average number of record space norms. expected 1 and 2. the perfor- the m best matches was averaged over these 2000 queries. statistical to find into to the performance in Figures dispersion keys was generated to find to gain insight and to compare it of 8192 sets of record a file distribution were performed 16 - closely (k I6), to that discussed the p=co results - strongly that, the 2k dependence. These simulation results for m=l, the k-d tree search algorithm is not far from exhibit at least optimal. lation (k 5 6) where N = 8192 appears to For those dimensionalities be big enough for the validity results of the large for p = co lie indicate assumption, (4) the simu- file no more than 2C$ above that predicted by eqn 19. The Euclidean performance distance of the algorithm The increase p=co. results for shown in Figure for tance is to be chosen mainly the higher for that the lower p-norms is not as good as for in expected number of records but becomes more pronounced 1 confirm rapid examined is not severe, dimensionalities. calculation, If a dis- the p=co distance is a good choice. Figure 2 shows how the number of records The average number of record ber of best matches sought. quired to find Euclidean (solid the corresponding the m-nearest neighbor cells, ball therefore, to be linear grows linearly should mately borne out by the results of the non-optimality a larger large fwa dtiensions, the large thee Figure re- If with file that m-l and 5% for m=25. - 17 - increasing One would in- The average number of This is approxi- 2 also shows that the effect becomes more pronounced for is assumed that assumption 2ashows m. similarly. Figure it with of eqn 19 since the expected volume of of the search algorithm number of best matches. enough so that than linearly. increase shown. the prediction examined rises more slowly expect the increase overlapped along with The average number of records number of best matches slightly tuitively examinations number of best matches for both the and p=co norms is displayed line). examined depends on the num- is valid 8192 records is even for m=25 in the inefficiency-‘ls-l%--r--- - Implementation The above discussion has centered amined as the sole criterion for performance This has the advantage that implementation evaluation requirements to the number of records examined, These considerations include and the overhead computation required kNlogN, as previously stated. 3 where the actual there are other required to build The overhead required This calculation node visited in the search. calculating the dissimilarity boundary of the subfile test point, few keys. then it the subfile If, on the oth.erhand,the bounds do in fact test becomes as expensive suggests that if a subfile simply be investigated This situation several if must be performed the tree a full is values of k. at each in Appendix quickly on the basis of the keys. keys are included dissimilarity to overlap and the bounds-overlap-ball 1, to the The coordinate the boundary is far then all to in Figure from the query record is very likely is most likely the k-d of only a If calculation point, the and the This calculation. the ball, dis- from the boundary is close to the test the ball, as as well. is dominated by the bounds- to examine most or all overlap for As described can be excluded may be necessary empirically under consideration. tances are compared one key at a time; related is proportional needed to build to search the tree non-terminal Al- to search the tree. number of records calculation. involves to build the k-d tree overlap-ball is executed. considerations This is illustrated of the total of are strongly required computation (5) per record shown as a function closest of the details of the algorithm it should omitted. to occur near the bottom of the tree where - 18 ex- cf the algorithm. is independent the computation The computation it evaluation and the computer upon which the algorithm though the computational tree on the expected number of records - the file records profitable are closest to increase the bucket the number of record lation each file record a bucket are relatively records efficient most or all to have larger the number of records This speculation quired for finding creasing the performance simulations it bucket near the-bottom It sizes of calcu- Since the records is very likely pass. that if in one of is then more computation- ezen though this increases examined. in Figure 4. Here the computation of the search. bucket sizes. reIn- considerably improves This improvement is approximately constant size from one record per bucket from 4 to 32. Figure (not 4 shows results shown) verify dent of dimensionality, records, each bucket. best matches is shown for various sizes Although will is confirmed the bucket for bucket once for must a bounds-overlap-ball per bucket, close together, them passes the test, calculation close to the query record need only be performed ally it may be sizes even at the expense of increasing a bounds-overlap-ball per bucket, With several the tree. Therefore, comparisons. With one record be made for to the query record. that for this only a few situations, other behavior indepen- k, number of best matches, is completely m, and number of file N. Comparison to Other Methods The only previous various dimensionalities, records is the sorting This algorithm the brute for force method with verified number of best matches, algorithm of Friedman, has been shown to yield method (linear a wide variety expected performance of situations. and number of file Baskett a considerable search over all Figure - 1-g - for and Shustek [?I]. improvment the records over in the file) 5 shows the computation (CPU milliseconds per query) sorting algorithm Also shown is the average number of records The rate indicates is valid. near-asymptotic of sixteen of increase how near it sumption In four buckets by this tree method. (using required of this behavior dimensions, sizes greater for files of 16000 records. than 2000. file size. limit increasing file size where the large file as- as small as 128 records. appears reasonably dimensions, close for the limit Even for this case, however, examined with file is not near the increase size is only slightly than logarithmic. The logarithmic size increases except that behavior for eight involved computation of the overall is illustrated Comparison of Figure for dimensions 3 to in building mensions, the increase represents while algorithm in Figure is slightly faster. the tree is not excessive. decreases with (6) compu- The fraction increasing dimensions that fraction computation of dimensionality. is the same as the number of file about 25% of the total for eight 5, 5 shows that the preprocessing spent on preprocessing preprocessing as the file computation the k-d tree Figure When the number of query records five increasing 5 show that in two dimensions limit In eight in average number of records tation average with in Figure the asymptotic for and the k-d examined under the k-d tree occurs even for files file faster records) is to the asymptotic The results algorithm for records, two di- is between three and percent. The computation required by the sorting algorithm has been shown 11 I; 1-I; km N . Although this is much worse than [3] to be proportionalto logN, the sorting algorithm very small files, it introduces is faster very little than the k-d tree - 20 - overhead so that algorithm. for For larger files, ,however, the k-d tree especially advantage, Implementation Efficient all algorithm for dimensions. (7) higher operation of the k-d tree buckets reside essing, these data can be arranged records in the same bucket can be stored the external is not even necessary Only the top levels levels can be stored keeps non-terminal the entire of the tree storage so that Buckets close together there will k-d tree reside files, in fast that sons. ACKNOWLEDGMENT Helpful discussions and J.E. with F. Baskett, Zolnowsky are gratefully - 21 - M.G.N. Hine, acknowledged. it memory. memory; the lower device under an arrangement nodes close to their examines be few accesses to For ex&memely large need to be in fast on an external device Since the search algorithm on the average, that During the.preproc- memory. together. similarly. that does not require on an external for each query. 63) storage algorithm in fast are stored a small number of buckets .. computational on Secondary Storage of the terminal in the tree is seen to have a clear C.T. Zahn, APPENDIX 1 describes This appendix ball-within-bounds tests The purpose geometric tered discussed boundaries record query record delimiting is determined partial distance than the radius a ball the cento the mth If this dissimilarity dissimilarity between the is greater from consideration. if as follows: coordinate This mini- the query record's of the geometric otherwise it jth key domain, then is set to the co- f then there of the neighborhood, (eqn 14), there If The test sum of coordinate case of the p=co vector whether is no overlap the sum of coordinate is no overlap. as soon as the partial if the smallest is set to zero; domain and the neighborhood. to testing overlap if (eqns 14, 15) by which the key falls outside the do3 If any of these coordinate distances is greater coordinate. distance the special of records can be eliminated the bounds for the jth main in that F-l(r) is to determine That is, encountered. then the subfile ordinate test r equal to the dissimilarity and the query record. mal dissimilarity the jth radius employed is to find bounded region is within in the text. a subfile with and r = D(?m,?q) where x" is the q and zrn is the mth best match so far encountered in the search. so far The technique than r, for the bounds-overlap-ball of the bounds-overlap-ball at the query record closest algorithms is greater exceeds with failure exceeds F -l(r). space norm, this any of the distances distances can terminate distances between the technique In reduces than the radius and, so, failing. The ball-within-bounds from the query record compared to the radius, ordinate distances test is simpler. to the closer r. Here the coordinate boundary The test fails is less than the radius. - 22 - distance along each key is in turn as soon as one of these coThe test succeeds if all of these coordinate Descriptions distances are greater of these tests than the radius. in an algorithmic notation are presented in the next appendix. APPENDIX 2 This appendix presents the k-d tree in an algorith- search algorithm mic notation. global Xq[l:k], "key values WD[l:ml, "priority of the query record" queue of the m closest countered P&R[l:ml, to the mth nearest en RD[ 11 at any phase of the search. is the distance far distances neighbor encountered." "priority queue of the record corresponding numbers of the m best matches encountered at any phase of the search" B+[l:kl, "coordinate upper bounds" B-b:k], "coordinate lower bounds" discriminator partition "discriminator [l:I], "partition [l:I]; at each k-d tree value at each k-d tree "I is the number of internal "initialize" P&D[l:m] "search" SEARCH(root); procedure t co; B+[l:k] node" t CO; B-[l:k] node" nodes' + - ~0; SEARCH(node); begin local p, a, temp; if node is terminal then begin I_(examine records -if in bucket(node), updating BALL WITHIN BOUNDS--I_ then done else return e+; d tdiscriminator[node]; p tpartition[node]; - 23 - P&D, p&R); so "recursive call on closer son" if X [d] 5 p q then begin temp +-B+[dl; B+[dl +P; SEARCH(leftson(node)); B+[d] ttemp; end else begin -temp tB-[d]; ~-[a] +P; SEARCH(rightson(node)); 'recursive call on farther B-Cd] ttemp; son, if necessary" if X [d] 5 p q then begin -temp +B-[dl; if B-Cdl +P; BOUNDSOVERLAPBALL then SRARCH(rightson(node)); B-b] ttemp; end else begin -- temp + B+[aJ; B+Cdl + P; -if BOUNDSOVERLAPBALL then SEARCH(leftson(node) B+[d] ttemp; a; 'see if we should return or terminate' i-J BALL WITHIN BOUNDS--then done else return; - 24 - >; logical procedure BALL WITHIN BOUNDS; begin local d; d t 1 --step 1 until for k do COORDINATEDISTANCE (d, Xq[d], B-L -or COORDINATEDISTANCE (d, Xq[d], B+[ -if aI>5 P&DC1-l then return(false); return(true); a; ., logical procedure BOUNDSOVERLAPBALL; begin local sum, d; sum to; for d t 1 step 1 until if Xqral k & < B-t-d] then begin "lower than low boundary" sum +-sum + COORDINATEDISTANCE (d,Xq[d], s DISSIM(sum) > FQD[l] then return B-La]); (true); end else if Xqkd then begin > B+[dl "higher than high boundary" sum t sum + COORDINATEDISTANCE (d,Xq[d], if DISSIM(sum) > KJD[l] then return a; return (false); *; - 25 - (true); B,[d]); DISSIM (x) and COORDINATEDISTANCE (j,x,y) The procedures functions F(x) and fj(x,y) that appear in the definition are the of the dissim- measure (eqn 14). ilarity APRENDIX 3 This appendix presents the procedure for a description constructing in an algorithmic an optimized notation of k-d tree for best match file searching. root tBUILD TREE (entire node procedure file); BUILD TREE (file); begin local -if j,d, maxspread, SIZE(subfile) p; 5 b then return@&?, TERMINAL(file)); maxspread to; for k -do "find j tl step 1 -until -if SPREADEST(j,file) coordinate with greatest spreadll > maxspread then begin maxspread tSPREADEST(j,file); d tj; e&; en&; p tMEDIAN(d,file); return MARE NONTERMINAL(d,p,BUILMREE(LEFT SUBFILE(d,p,file)),BUILM'REE (RIGHTSUBFILE(d,p,file)); The procedure value spread for SFREADEST(j,subfile) the records the jth coordinate returns the median of the jth are procedures tree and return that distance store a pointer returns in the subfile function. to that parameters node. - 26 - represented The procedure key values. their the estimated jth key by the node, using MEDIAN (j,subfile) MfKE TERMINAL and MAKE NONTERMINAL as values of a node in the k-d - FOOTNOTES (1) A record examination involves calculating to the dissimilarity the dissimilarity Since the k-d tree to store (3) (4) updating (5) along each key the trimmed variance sures that the estimate Asymptotic behavior increasing All simulations so far of m closest tree, it can be determined file node [11]. The trimming empirically by (6) The behavior for mic for (7) large eight on an IBM 370/168 optimization dimensions enough file The comparison in Figure grows nearly best matches, computer. (8) Inspection and eight 16000 records. with file dimensions, Increasing the total become logarith- in computation while for Thus, for m. for k-d tree large for bucket respectively, computation - 27 numbers of of size of 16 records,the for total file size to 32 records required - larger algor- accessed is 1i.56, 6.25 and 75.0 for the bucket is increase. duces the average number of accesses for eight increasing the IBM FORTRAN size at which the performance 5 shows that average number of buckets four, algorithm, is comparable will of Figure All two. of course, The increase linearly the crossover the two algorithms will, 5. 5 is for the best match (m=l) since this on application. T E m grows as m for the sorting it level examined sizes. the most co ithm, observ- in Figure programs were coded in FORTRANIV and compiled with with in- extreme outliers. This is illustrated were performed compiler by cal- of the average number of records size. H (extended) encounter- records. can be estimated against it is not necessary of the key values. is robust of increase with the list record comparing to the sons of each nonterminal The spread of values ing the rate keys from memory, to the query record, is a complete binary pointers culating the record to the mth closest ed, and if necessary, (2) fetching dimensions two, size of (not shown) reto 44.0 while for the search by only 8%. REFERENCES [1] Burkhard, W.A. and Keller, 16 (April Corn. of ACM, Vol. searching. [2] Some approaches to best match file R.M. Fukunaga, K., and Narendra, computing k-nearest 1973), 230-236. A Branch and bound algorithm P.M. IEEE Trans. neighbors. for ~24 (1975), Comput., 750-753. [3] Friedman, J.H., finding Baskett, nearest F., and Shustek, L.J. IFEE Trans. neighbors. An algorithm Comput., for C-24(1975) looo- 1006. [4] Rivest, On the optimality R. Proceedings best match searches. Sweden (August [5] of Elias' for performing IFIP Congress 74, Stockholm, 1974), 678-681. Computational Shamos, M.I. algorithm Conference Geometry. record Annual ACM Symposium of Theory of Computing, of Seventh Albuquerque, N.M., (May 7, 1975). [6] Finkel, R.A. and Bentley, retrieval [7] Bentley, [8I Hyafil, on composite L., trees searching. [F] [lo] and Hostetler, estimates. Knuth, D.E., The Art Menlo Park, L.D., used for optimal Processing associ- 509-517. (Sept.l975), Optimization IEEE Trans. binary decision Letters, Info. of k-nearest Theory, Computing and.Mathematical Research Associates, [ll] 4(1)(1974),1-9. search trees Constructing for Vol. 5, 15-17. S.M., Numerical Pizer, binary Information is NP-complete. Fukunaga, K., density R.L., - a data structure Acta Informatica Corn. of ACM, vol.18 and Rivest, (May l976), Quad trees keys. Multidimensional J.L. ative J.L. Palo Alto, p 4OP. - 28 - (1973), 320-326. Analysis, Science Ca., 1975, pp 88, eqn 87. of Computer Programming, Ca., 1969, IT-19 neighbor Vol. 1, Addison-Wesley, I FIGURF: CAPTIOrJS FIGURE 1. Variation with of the average number of records dimensionality constant file (number of keys per record) Results size. FIGURE 2. number of best matches sought for Results dictions of eqn 19 for Computation per file as a function examined several dimen- are shown for the Euclidean The solid (p=2) and p=co metrics. tree is the pre- of the average number of records sionalities. FIGURE 3. line of eqn 19 for the p=co metric. Variation with for are shown for the Euclidean The solid (p=2) and p=co metrics. diction examined lines are the pre- the p=oo metric. record of total required file to build size for the k-d several dimensionalities. FIGURE 4. Computation function FIGURF: 5. for required of terminal Computation bucket of total k-d tree algorithms size. for best match searching required function the best match search as a file as a size for both the sorting at several and dimensionalities. Also shown is the variation of the average number of records examined with total file * buckets of 16 records were used with algorithm. - 29 - size. Terminal the k-d tree IO3 I I 8 I92 Records 0 Euclidean q p= co Metric Metric 102 IO’ I 2 4 6 NUMBER OF KEYS I I 8 IO (dimens ional I tY) 2668Al Figure 1 60 0 50 40 -Two Keys per Record 8192 Records 0 Euclidean Metric 30 20 IO 0 5 IO I5 20 25 NUMBER OF BEST MATCHES 2660A2 Figure 23 200 I50 0 Four Keys per Record 8192 Records - 0 0 Euclidean Metric q p=a Metric 100 50 0 IO 15 5 NUMBER OF BEST 20 25 MATCHES 2668A3 Figure 2b 600 400 I I I Six Keys per Record 8192 Records 0 cl I I 0 Euclidean Me tric p= ~0 Metric 0 200 0 0 I I I IO I5 5 NUMBER OF BEST I I 20 25 MATCHES 2668A4 Figure 2c I .4 I ’I I I I I IIIII~ I lllll~ r ., Euclidean Metric 1.2 - 2 Two Keys per Record 4 Four Keys per Record I .o 8 Eight Keys per Record 8 8 0.8 0.6 8 8 8 - 8 4 2 2 II 102 4 2 I I I 4 4 4 I IllIll 2 2 2 2 4 I I 4 3 llllll I04 IO3 TOTAL NUMBER OF RECORDS IN FILE 2668A7 Figure 3 c . 70 I I 1 I I I I IIIII Euclidean 60 8 50 Metric 2 Two Keys per Record (x10) 8 Eight Keys per Record 8 30 - 20 - 0 I- 8 40 IO I’ I - 2 2 I I I I 2 I I I IIIII 8 8 2 2 8 2 I I 50 IO 5 I NUMBER OF RECORDS PER BUCKET 2668A5 Figure 4 0 MILLISECONDS 0 z-0 in PER QUERY ys cm Iv b-J r N Fl :m : 0 0 AVERAGE NUMBER OF RECORDS EXAMINED 30 25 20 I I I IIIII~ I I I Illll~ I Euclidean Metric Four Keys per Record Sorting Algorithm k-d Tree Algorithm Ave. Records Examined I5 IO 5 0 TOTAL NUMBER OF RECORDS IN FILE Figure 5b I 300 250 200 I I I I I III( I Euclidean I I I IllI] I Metric Eight Keys per Record 0 Sor ting Algorithm • I k-d Tree Algorithm a Ave. Records Examined n i 3000 2500 2000 150 1500 100 1000 50 500 0 0 TOTAL NUMBER OF RECORDS Figure 5c IN FILE