Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Relational link-based ranking

2004

Relational link-based ranking Floris Geerts ∗ Heikki Mannila Evimaria Terzi Laboratory for Foundations Basic Research Unit of Computer Science Helsinki Institute for Information Technology School of Informatics Department of Computer Science University of Edinburgh, UK University of Helsinki, Finland fgeerts@inf.ed.ac.uk {mannila,terzi}@cs.helsinki.fi Abstract Link analysis methods show that the interconnections between web pages have lots of valuable information. The link analysis methods are, however, inherently oriented towards analyzing binary relations. We consider the question of generalizing link analysis methods for analyzing relational databases. To this aim, we provide a generalized ranking framework and address its practical implications. More specifically, we associate with each relational database and set of queries a unique weighted directed graph, which we call the database graph. We explore the properties of database graphs. In analogy to link analysis algorithms, which use the Web graph to rank web pages, we use the database graph to rank partial tuples. In this way we can, e.g., extend the PageRank link analysis algorithm to relational databases and give this extension a random querier interpretation. Similarly, we extend the HITS link analysis algorithm to relational databases. We conclude with some preliminary experimental results. ∗ Work done while at the Basic Research Unit, Helsinki Institute for Information Technology, Department of Computer Science, University of Helsinki Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment. Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004 1 Introduction Methods for ranking elements have been widely discussed in a variety of settings. In the context of database systems the motivation for ranking has increased along with the size of databases. In huge databases the users that pose a query would like to see the top-k partial tuples that satisfy their query rather than thousands of tuples ordered in a completely uninformative way. Additionally, the necessity of ranking the query results goes far beyond the functionality of the existing ORDER BY operator, which sorts the results only according to the values in the specified attributes. A variety of algorithms that efficiently handle the topk selection [15, 19] and top-k join queries [20, 24] have been proposed. Ranking is a notion that has appeared also in the context of Web search applications. The natural need in this context is to rank the web pages returned as a result to a user query. In this case the pages are ranked such that the more relevant the page is to the query, the higher it is ranked. Furthermore, among the web pages that are equally relevant those that are more “important” should precede the less “important” ones. Many ranking algorithms for web pages have been developed ([11, 6, 22, 9, 25]) with the most popular among them being the HITS algorithm proposed by Kleinberg [22] and the PageRank algorithm proposed by Brin et.al [11]. The latter has led to the popular Google search engine. Web pages are categorical data, and thus the problem of ranking them as such is not trivial since they do not have an intrinsic numerical value on which a ranking could be based on. However, all the ranking algorithms developed for them exploit the hyperlink information, i.e. the structure of the Web graph, in order to assign to each web page a rank value and obtain a ranking based on these values. In contrast to web pages, the assignment of rank values to categorical data in relational databases has not yet been much investigated. In this paper, we do exactly this. More specifically, we address the problem of automated assignment of rank values to categorical partial tuples. Based on this assignment we produce useful rankings of partial tuples. We will construct database graphs using queries and try to exploit their structure to obtain rank values. These rank values can be used in a variety of database applications: First, one can get ranked answers to queries. Second, they can serve as input to the existing top-k algorithms mentioned above. Until now, the top-k algorithms are mainly applied to databases with non-categorical attributes and the topk algorithms use these values as input. The rank values we obtain for categorical data can be used in a similar way. Finally, the obtained rank values can be helpful in providing ranked keyword search results in relational databases. How exactly the obtained values are going to be used is beyond the scope of this paper. Here we only consider how such rank values can be obtained. More specifically, we present a general framework for obtaining such rank values for partial tuples of relational databases. The goal is to define those rank scores and find the algorithms to calculate them. For this we exploit information about the interconnections of the partial tuples in the database, as these can be discovered using relational algebra queries. To obtain rankings for partial tuples we mimic the principles of link analysis algorithms. The well-studied algorithms ([11, 6, 22, 9, 25]) for the Web show that the structure of the interconnections of web pages has lots of valuable information. For example, Kleinberg’s HITS algorithm [22] suggests that each page should have a separate “authority” rating (based on the links going to the page) and “hub” rating (based on the links going from the page). The intuition behind the algorithm is that important hubs have links to important authorities and important authorities are linked by important hubs. Brin’s PageRank algorithm [11], on the other hand, calculates globally the PageRank of a web page by considering a random walk on the Web graph and computing its stationary distribution. The PageRank algorithm can also be seen as a model of a user’s behavior where a hypothetical web surfer clicks on hyperlinks at random with no regard towards content. More specifically, when the random surfer is on a web page, the probability that he clicks on one hyperlink of the page depends solely on the number of outgoing links the latter has. However, sometimes the surfer gets bored and jumps to a random web page on the Web. The PageRank of a web page is the expected number of times the random surfer visits that page if he would click infinitely many times. Important web pages are ones which are visited very often by the random surfer. We now rephrase the random surfer in the relational π2 σ1=b W π2 σ1=a W π1 W ∪ π2 W a d a d a d a b a b a b b b b b b b b a b a b a b c b c b c Figure 1: Random walk of random surfer using only 2 kinds of queries. database setting. Consider the fragment of the Web shown as the binary table W in Figure 1. In the same figure we have shown the surf trail b → a → b → d of the random surfer. In order to walk along the partial tuples (pages) in W , the random surfer needs only two kinds of queries: The first is simply the query which returns all pages present in W . This can be expressed by the expression π1 W ∪ π2 W . The second kind are queries expressed by π2 σ1=v W , in which v is a page present in W . In other words, these queries ask for all pages reachable from a certain page v. After the random surfer has evaluated one of these queries, he selects a random tuple out of the query result and repeats this procedure again. An important restriction is that while π1 W ∪ π2 W may be asked by the random surfer independent of the current page, π2 σ1=v W may only be asked when the surfer is at page v. In Figure 1 we have shown which queries are asked in order to obtain the shown surf trail. We use this observation to extend the random surfer model to the random querier, which generalizes random-walk based link analysis algorithms by providing the random surfer with a different set of queries at his disposal. Additionally, the model facilitates extensions that allow for using this model for ranking partial tuples. Seeing the Web as a database allows us to see a hyperlink between two web pages to exist due to queries that connect the two web pages. E.g., in Figure 1 the link between page b and page a can be seen to exists due to the fact that a is in the query result π2 σ1=b W . This idea generalizes to arbitrary databases D and any finite set of queries {q1 , . . . , qn }: There exists a link between two partial tuples ~s and ~t of a database D if there exists an i = 1, . . . , n such that ~t is in the query result of qi when evaluated on D and where the selection parameters of qi are instantiated with constants in ~s. We augment these links with weights relative to some preference function on the queries and frequency information of tuples in the query results. In this way we obtain a weighted directed graph which we call the database graph. The database graph is a natural generalization of the graphs used in link analysis. The database graph enables any graph-based link analysis method to be used for ranking partial tuples. For example, both the PageRank and HITS algorithms can be generalized to operate on the database graph; the generalizations provide tuple ranking algorithms for relational databases. The contributions of this paper are the following: • We define the database graph for a given database, set of queries and preference function and explore its theoretical properties. • We study random walks on the database graph and show that they can be interpreted as the walks of a random querier. We use the stationary distribution of the random walks to assign rank values to partial tuples. • We show that the random querier generalizes many well-known link analysis algorithms. • As a second application of the database graph, we extend the HITS algorithm to relational databases and use it to assign rank values as well. • We experimentally evaluate the use of the obtained rank values to rank query results. Related work The problem of assigning rank values to partial tuples in the relational framework is related to the problem of ranking web pages. The latter has been extensively investigated and several link analysis algorithms have been developed for this [11, 6, 22, 9, 25]. Even some unifying frameworks for link analysis algorithms exist [12]. Interesting work on ranking elements in relational databases based on measures from Information Retrieval (IR) is described in [3], however the notion of “link” provided by the queries has not been considered there. The representation of a database as a graph and link-based ranking appears in the context of keyword search in [5, 18, 2]. The nodes in the graph are the database tuples and the directed relationships between the nodes are induced by foreign key or other constraints. The ranking values are related to the inverse of the path distance between nodes. Graph representations of databases and random walks on them are considered also in the context of similarity of categorical attributes. Both [26] and [21] construct a graph where the nodes are the constants in the database and two nodes are linked when they appear in the same tuple. They perform different random walks on them in order to obtain a similarity measure for the values. A related iterative approach is the idea of hyperedges connecting tuples based on values [17]. The main difference that we use partial tuples instead of tuples and that we use queries to connect them. A random walk approach to ranking on (semi)structured data is proposed in [4]. Although the approach to ranking is very similar to ours, the graph construction is heavily dependent on the presence of (semi-)structured data. Organization The rest of this paper is organized as follows. In Section 2 we define databases and query languages. In Section 3, we formally define the database graph and prove some of its properties. We then define the random walk on the database graph and the random querier in Section 4. In Section 5 we extend PageRank and HITS algorithm to relational databases using the database graph. Section 6 describes some experimental results. We conclude the paper in Section 7. 2 Preliminaries We refer to [1, 16] for a more detailed description of basic database notions. For simplicity of exposition, we assume that the database schema S consists of a single relation name R of arity n. However, all definitions and results generalize directly to arbitrary database schemas. Let D be a database instance over S. The active domain of D, denoted by adom(D), consists of all constants in D. For a tuple ~t ∈ D of size n, we denote the value of its i-th attribute by ti ∈ adom(D). The active domain of a tuple ~t ∈ D, denoted by adom(~t), is the set {t1 , . . . , tn }. The standard query language is the relational algebra, or equivalently the relational calculus, over the database schema S. We denote this query language by RA. Relations and queries are interpreted using the bag semantics, i.e., duplicate tuples are allowed. The reason for this is that we need the notion of frequency which disappears if we do not allow for duplicates. We will not distinguish between queries and the RA expressions expressing them. We denote the query result of q on D by q(D). Let q ∈ RA be an n-ary query and denote the set of attributes in the query result by I. We will partition I in source attributes ~x and the target attributes ~y. We always assume that this partition is specified for each query q we encounter. We make this explicit by writing q(~y |~x) instead of simply q. Let ~s ∈ adom(D)k where k = |~x| and let ℓ = |~y |. Then we define the RA expression q(~y |~s) ≡ πy1 ,...,yℓ σx1 =s1 ,...,xk =sk q(~y |~x). We will denote the query result of q(~y |~s) on D by q(D, ~s). We extend the RA with the duplicate elimination operator δ for transforming bags into sets if necessary. Given a tuple ~s and a query q ∈ RA the support of ~s in q(D), denoted by supp(~s, q(D)), is the number of times ~s appears in q(D). The frequency of ~s in q(D) s,q(D)) is defined as freq(~s, q(D)) = supp(~ , where |q(D)| |q(D)| denotes the size of q(D). 3 The database graph As already mentioned in the Introduction, one can consider the web as a database D over a binary relation W . Then following a hyperlink from a page v can be seen as first querying the database using the query q(y|x) ≡ W (x, y), and then selecting a page out of q(D, v). Two web pages v and w are now linked by the query q iff w ∈ q(D, v). We generalize this idea to arbitrary databases and queries. Definition 1 (Link). For a given database D and query language L ⊆ RA, a tuple ~s ∈ adom(D)k is L-linked to a tuple ~t ∈ adom(D)ℓ iff there exists a query q(~y |~x) ∈ L such that |~x| = k, |~y| = ℓ, and ~t ∈ q(D, ~s). From now on we assume that L consists of a finite number of queries. Let M = hD, L, f i where D is a database, L ⊆ RA, and f is some preference function f : L → Q+ . Here, Q+ denotes the set of strictly positive rational numbers. We now define the database graph. The definition is rather technical but the intuition behind it is very natural. Indeed, the vertices of the database graph correspond to the active domain of tuples in the answers to queries in L. The reason why we work with the active domains instead of the tuples themselves is that a constant appearing in some attribute can possibly be used in other attributes as well. So instead of storing a constant for each possible attribute separately, we store it only once. This slightly complicates the formal definition (see below) of database graph since many different tuples can correspond to the same vertex. The edge relation is based on Definition 1. Finally, we will assign weights to the edges corresponding to the preferences of the queries establishing this edge (or link) and the support of the tuples consistent with the target vertex in the query results. More formally, Definition 2 (Database graph). Given M = hD, L, f i the corresponding database graph is the weighted directed graph GM = (VM , EM , λM ) where, • The set of vertices VM is constructed as follows: For each query q(~y |~x) ∈ L we instantiate the parameters ~x with tuples ~s ∈ adom(D)k , where k = |~x|. For each ~t ∈ q(D, ~s), we add the vertex v = adom(~t) to VM , if not already included. Note that v is a set of constants. Thus, VM is  adom(~t) | q(~y | ~x) ∈ L, |~x| = k ~s ∈ adom(D)k , ~t ∈ q(D, ~s)}. For a vertex v ∈ VM , we denote by v k the set of all k-tuples formed from constants in v. • The set of edges EM is equal to all ordered pairs of vertices (v, w) such that there exists a tuple ~s ∈ v k which is L-linked to a tuple ~t ∈ adom(D)ℓ such that w = adom(~t); and • The weight function λM : EM → Q+ is defined as λM (v, w) = X X freq(~t, q(D, ~s))). f (q)( q(~ y |~ x)∈L ~ s∈v k ,k=|~ x| ~ t∈w ℓ ,ℓ=|~ y| We illustrate the concept of database graph by the following examples. Example 1. Let D be the database given by the table in Figure 2. The language L consists of the queries q1 (y|x) ≡ π1,2 R(x, y, z) and q2 (y, z|x) ≡ R(x, y, z). Then for any constant a appearing in the first attribute q1 (D, a) equals {b | (a, b) ∈ q1 (D)}. Similarly, for any constant a, q2 (D, a) consists of the pairs {(b, c) | (a, b, c) ∈ q2 (D)}. This shows that q1 will link the first attribute to the second one, while q2 links the first attribute to the second and third one, as can be seen in Figure 2. We define the preference function as f (q1 ) = f (q2 ) = 1. The complete database graph is shown in Figure 2. E.g., the weight on the edge from {v2 } to {t2 , v3 } is equal to f (q2 )freq((t2 , v3 ), q2 (D, v2 )) = 1. When we disregard the weights, another example is the Gaifman graph of finite model theory [13]. Example 2. Let D be a database over an n-ary relation R. Consider the language L consisting of qi,j (xj |xi ) ≡ πi,j R(x1 , . . . , xn ) and f (qi,j ) = 1 for all i, j = 1, . . . , n. The database graph has as vertices the constants in adom(D) and there is an edge between two constants iff they appear in the same tuple in D. The database graph is a well-defined object. Indeed, we call hD, L, f i and hD′ , L, f i isomorphic, denoted by hD, L, f i ∼ = hD′ , L, f i, if there exists a bijection b : adom(D) → adom(D′ ) such that for all q(~y |~x) ∈ L and ~s ∈ adom(D)k for k = |~x|, we have for any ~t ∈ q(D, ~s) that freq(~t, q(D, ~s)) = freq(b(~t), q(D′ , b(~s))), where b is extended to tuples ~x as b(~x) (b(x1 ), . . . , b(xk )). = Theorem 1. If M = hD, L, f i and N = hD′ , L, f i such that M ∼ = N , then GM is isomorphic to GN . v1 v1 v D= 1 v2 v4 v4 v2 v3 v4 v3 v1 v3 t1 t1 t1 t2 t2 t2 t2 , v1 1/2 v4 t2 , v3 t1 , v4 t1 , v3 1 1/2 1/3 1/2 1/2 v3 1 1/3 v2 t1 , v2 1/3 1/3 1/3 v1 1/3 Figure 2: The database D (left) and the database graph GM of M = hD, L, f i of Example 1 (right). Proof. We refer for the proof to the full paper. We also have a monotonicity property with respect to taking sub-languages. Theorem 2. If M = hD, L, f i and N = hD, L′ , f ′ i such that L′ ⊆ L and f (q) = f ′ (q) for any q ∈ L′ , then GN is isomorphic to a subgraph of GM . Proof. The proof is analogous to the proof of Theorem 1. Proof. First, we remark that the topology of the graph is independent of f . So, we can disregard the preference function in what follows. We use a reduction to the undecidability of satisfiability of relational algebra expressions on binary relations [8]. We construct for each q(x1 , . . . , xk ) ∈ RA the language L = {q1 , q2 , q3 } where, q1 (u, z|x, y) ≡ In general, the reverse of Theorem 1 is not true as can be seen from the following example. q2 (y|) ≡ Example 3. Consider the databases D and D′ shown in Figure 3. Here, different symbols denote different constants. Let L consist of the queries q3 (z|) ≡ q0 (x|) q1 (y|x) q2 (y|x) q3 (y|x) ≡ ≡ ≡ ≡ π1 R(x, y, z, u, v), π1,2 σ3=5 R(x, y, z, u, v), π1,2 σ3=4 R(x, y, z, u, v), π1,2 σ36=4 R(x, y, z, u, v). The preference function f assigns weight 1 to each query. It is easily verified that the graphs GM and GN are isomorphic and correspond to the graph shown in Figure 3. However, there is no bijection making hD, L, f i and hD′ , L, f i isomorphic. Indeed, from q1 (D, s) and q1 (D′ , s) the bijection b should map b(t1 ) = t1 , while from q2 (D, s), q2 (D′ , s), q3 (D, s) and q3 (D′ , s) it follows that b(t1 ) = t2 . The database graph is defined without taking into account any semantic relationships between attributes or additional schema constraints. However, this can be easily incorporated in the queries used in the language L. In the next section we define a random walk on the database graph. In order for the random walk to have nice convergence properties (see the next section), the underlying graph should be strongly connected and non-bipartite. This property turns out to be undecidable. Theorem 3. Given a query language L, it is undecidable whether the database graph is strongly connected and non-bipartite for all D and preference functions f . if(∃x1 · · · ∃xk q(x1 , . . . , xk |)) then σ16=2∧1=3∧2=4 R(x, y) × R(u, z) if(∃x1 · · · ∃xk q(x1 , . . . , xk |)) then π2 R(x, y) if not(∃x1 · · · ∃xk q(x1 , . . . , xk |)) then π1 R(z, u) ∪ π2 R(u, z) By construction, for any D and f , the database graph associated with hD, L, f i will be connected and nonbipartite iff q is not satisfiable. Indeed if q is not satisfiable then L collapses to q3 . For any D and f , the database graph associated with hD, q3 , f i is the complete graph with vertex set adom(D). This is clearly always a strongly connected and non-bipartite graph. For the other direction, suppose that there exists D and f such that the graph associated with hD, L, f i is disconnected or bipartite. We need to show that this implies that on D the query q is satisfiable. Therefore, we show that for any D and f , the database graph associated with hD, {q1 , q2 }, f i is disconnected. W.l.o.g., we may assume that D only consists of tuples (s, t) such that s 6= t. Indeed, if |adom(D)| > 1 (The case when |adom(D)| = 1 can be disregarded), applying first the query π14 R × R ensures that D always contains (s, t) with s 6= t. We then select only those pairs (s, t) from D such that s 6= t. So, the database graph will be not connected because there is an edge in the database graph from vertex {s, t} to vertex {t} by q2 , but no edge exists from {t} to {s, t}. This is because q1 only links {t} to vertex {t, t}, which is by construction not in D. 4 Random walks on databases Let G = (V, E, λ) be a weighted directed graph. We next define the random walk on this graph, and then show how the concept applies to database graphs. s s s D= s s s t1 t1 t1 t2 t2 t2 α α α β β β α γ γ γ β β α α α α α α s s s ′ D = s s s t1 t1 t1 t2 t2 t2 α α α β β β α α γ γ γ β α α α α α α t1 2 1 s 1 1 t2 Figure 3: Two non-equivalent databases (left) giving the same database graph (right) for L of Example 3. Definition 3 (RW on a graph, [7]). A simple random walk on G is the following random process: Start in a randomly selected vertex v ∈ V . Next, jump w with probability P to an adjacent vertex  ′ λ(v, w)/ λ(v, w ) . This is then repeated ′ (v,w )∈E starting from vertex w. A random walk on G can also be seen as a Markov chain with state space V where the transition probabilities are represented by a stochastic1 |V | × |V |-matrix PG  = (Pvw ), where Pvw = P ′ λ(v, w)/ (v,w ′ )∈E λ(v, w ) . Theorem (Fundamental Theorem of Markov Chains, [23]). If G is strongly connected and nonbipartite, then the Markov chain given by PG has the following properties. There existsPa unique stationary distribution p~, i.e., ~ p = p~PG and i pi = 1. Moreover, let N (v, k) be the number of times the Markov chain visits v in k steps, then N (v, k) = pv . k→∞ k lim The stationary distribution is a description of the steady-state behavior of the Markov Chain. The stationary distribution will be used to obtain rank values for partial tuples in the next section. Definition 4 (RW on database). The random walk on M = hD, L, f i is the simple random walk on the database graph GM . such that q(D, ~s) is nonempty. For a query q(~y |~x) ∈ L and s ⊆ adom(D), let Γ(q, s) be the set of tuples ~s ∈ sk for k = |~x| such that q(D, ~s) is nonempty. Moreover, let γ(q, s) = |Γ(q, s)|. Definition 5 (Random Querier). The (L, f )random querier on D is the following random process: An initial element s is selected randomly from the set of vertices VM from the database graph. Next, a query q is chosen from Ls with probability X γ(q, s)f (q)/( γ(q ′ , s)f (q ′ )), q′ ∈Ls and a tuple ~s ∈ Γ(q, s) is chosen uniformly at random. Finally, a tuple ~t is selected randomly from q(D, ~s). This is then repeated starting from adom(~t). Theorem 4. The random walk performed by the (L, f )-random querier on D is the same random walk as the random walk on the database graph GM of M = hD, L, f i. Proof. We refer for the proof to the full paper. Example 4. A well-known example of a random walk is the random surfer introduced by Brin [10]. The random surfer is the same as the (L, f )-random querier on the Web database D, with L = {q1 (y|x) ≡ W (x, y), q2 (x|) ≡ δ(π1 W (x, y) ∪ π2 W (y, x))} and f (q1 ) = 1 − p and f (q2 ) = p. Let V = adom(D). The transition matrix P = (Pvw ) of the random surfer is given by ( p + (1−p) if (v, w) ∈ D Pvw = |Vp | outdeg(v) otherwise. |V | We now define the random querier. When in certain vertex of the database graph, the random querier will select a query compatible with the vertex in which he currently is. The probability of selecting such query depends on a given preference function. Once the query is asked a random tuple is selected as input parameter for the query and a random tuple is selected from the output. We make this more formal in what follows. Let s ⊆ adom(D). We denote by Ls all queries q(~y |~x) ∈ L for which there exists ~s ∈ sk for k = |~x| In the database graph corresponding to hD, L, f i we have an edge (v, w) for any pair of vertices. The weight of an edge (v, w) is given by the sum of 1 A stochastic matrix is a matrix in which for each row the elements in the row sum up to one. Hence, the (L, f )-random querier on D has the same transition matrix as the one of the random surfer. f (q2 )freq(w, q2 (D)) = p 1 , |V | and f (q1 )freq(w, q1 (D, v)) which is equal to ( 1−p if (v, w) ∈ D outdeg(v) 0 otherwise. Also other link analysis algorithms fit perfectly in the random queries framework. Example 5 (sHITS, [6]). The stochastic HITS algorithm has as transition matrix Pvw = P |{x ∈ V |(x, v) ∈ E ∧ (x, w) ∈ E}| . y∈V |{x ∈ V |(x, v) ∈ E ∧ (x, y) ∈ E}| A simple computation shows that the (L, f )-random querier with L = {q(z|x) ≡ π2,4 σ1=3 W (u, x) × W (y, z)} and f (q) = 1 results in the same random walk. We are primarily interested in the stationary distribution of the random walks. By the Fundamental Theorem of Markov Chains, this can be obtained by computing the matrix and solving the eigenvector problem. However, for undirected graphs we have a closed form expression for the stationary distribution. Theorem ([23]). Let G = (V, E) a connected, nonbipartite, undirected and un-weighted graph and let m = |E|. Then the stationary distribution (pv )v∈V of the simple random walk on G is given by deg(v)/2m. We have the following result. Theorem 5. It is decidable for a given weighted directed graph G = (V, E, λ), whether there exists an undirected multi-graph Gu = (Vu , Eu ) such that the simple random walks on G and Gu have the same transition matrix. Moreover, the graph Gu can be computed, if it exists. Proof. (Sketch) Note that weighted directed graph G can be transformed in a graph with integer weights by the multiplying for each vertex V all weights of edges starting in v with the least common multiplier (l.c.m) of the denominators of weights of the edges starting in v. So, we may assume that G has integer weights. Also note that for each node v ∈ V we are allowed to multiply the weights of all outgoing edges by the same integer nv without affecting the random walk. We get rid of the integer weights by replacing each edge (v, w) with integer weight k, by k edges (v, w) of weight 1. We abuse notation and call this edge set also E. So, in order to decide whether G can be replaced by an undirected multigraph we need to check whether there exist integers (nv ) with v ∈ V such that the indegree becomes equal to the outdegree for every vertex v, or for every v, X X nw λ(w, v). (1) nv λ(v, w) = w:(v,w)∈E w:(w,v)∈E This can be decided using standard integer programming techniques [27]. So the answer to the decision problem is yes iff there exists a solution to equation (1). In case there exists a solution, we define Gu = (Vu , Eu ) as Vu = V and Eu contains an undirected edge (v, w) for every pair of edges (v, w) and (w, v) in E. The importance of the previous result is that before starting computing the stationary distribution of a random walk, one can decide whether the walk corresponds to a walk on an undirected graph. In this case the stationary distribution can be computed much more efficiently, using the Fundamental Theorem of Markov Chains. For some languages it can be shown that there exists integers nv such that Equation (1) holds for any D. Example 6. Consider the database D consisting of a single relation R, and let L = {qi,j (xj |xi ) ≡ πi,j R(x1 , . . . , xn ) | 1 ≤ i ≤ n, 1 ≤ j ≤ n}. All queries have preference 1. Let G be the database graph. Then there exists an undirected graph Gu satisfying the property stated in Theorem 5. Indeed, for each s and each t such that t ∈ qi,j (D, s) we have an edge (s, t) of weight freq(t, qi,j (D, s)). The l.c.m. of the denominators for all outgoing edges from s is |qi,j (D, s)|, so we get the integer weight λ(s, t) = freq(t, qi,j (D, s))|qi,j (D, s)| = |{~u ∈ D | ui = s ∧ uj = t}|. We get the same integer weight for the edge (t, s), so λ(s, t) = λ(t, s) and hence Equation (1) holds for ns = 1 for all s. This reasoning is clearly independent from D. 5 Rank algorithms In this section we describe two methods for obtaining rank values for partial tuples. Both are based on eigenvector computations. The first one, RelWalk, is based on the stationary distribution of a random walk on the database graph similarly to PageRank. The second, RelHITS, uses the mutual reinforcement technique of HITS. Therefore, the assignement of rank values is based on the normalized principal eigenvector of a matrix associated to a certain subgraph of the database graph. Both algorithms output rank values of partial tuples which can serve either directly as a ranking of query results, or as input for top-k selection and join algorithms. 5.1 RelWalk The RelWalk algorithm takes as input the database D, a language L and preference function f , and computes the rank values for subsets s ⊆ adom(D). The rank value of s corresponds to the value ps in the stationary distribution p~ of the random walk on the database graph GM of M = hD, L, f i. The Fundamental Theorem of Markov Chains says that ps gives the probability that the (L, f )-random querier on D visits s, given that he was allowed to ask the queries in L for infinite long time. Intuitively, frequently visited states are regarded as more important. We compute the stationary distribution p~ by solving the eigenvector problem p~P = ~ p where P is the transition matrix of the random walk. The database graph must be strongly connected and non-bipartite in order for the stationary distribution to exist. Of course, not every database graph has this property. However, we can alter the query language such that we always end up with a strongly connected and non-bipartite database graph. Indeed, we simply add queries of the form q(xi1 , . . . , xik |) ≡ πi1 ,...,ik R where the projections are chosen such that no new vertices are introduced in the database graph. The database graph is now strongly connected since all vertices are connected to each other. It is also nonbipartite since all vertices have a self-loop. Note that the PageRank algorithm uses the same adaptation by adding the query q2 (x|) ≡ π1 R(x, y) ∪ π2 R(y, x). 5.2 RelHITS In contrast to RelWalk, RelHITS is query dependent. Therefore, RelHITS algorithm takes as input M = hD, L, f i and an imposed query q. The algorithm considers the database graph GM and selects ′ ′ the subgraph G′M = (VM , EM , λ′M ) where ′ EM = {(v, w) ∈ E | w ⊆ adom(Q(q, D))}, ′ and VM consists only of nodes connected by edges in ′ EM . The weight function λ′M is the restriction of λM ′ to EM . From the graph G′M we form the m × n matrix ′ Q = (Qvw ) = (λ′M (v, w)), where m = |H = {v ∈ VM | ′ ′ ′ ∃w ∈ VM (v, w) ∈ EM }| and n = |A = {w ∈ VM | ∃v ∈ ′ ′ VM (v, w) ∈ EM }|. The elements in the sets H and A can be thought of as the hubs and the authorities in the context of the HITS algorithm and therefore RelHITS scores hj and aj are computed iteratively as follows. ( P (t−1) ′ htj ← ′ λM (j, i)hi (j,i)∈EM (2) P (t−1) ′ atj ← (i,j)∈E ′ λM (i, j)ai M The main idea is that important hubs are related to important authorities and vice versa. Moreover, the update schema (2) converges to the principal eigenvector ~h of QQT for the hub scores, while the authority scores converge to the principal eigenvector ~a of QT Q ([22]). The RelHITS normalizes these eigenvectors and outputs them. 6 Experimental evaluation In this section we describe our implementation for constructing the database graph and obtaining rank values of partial tuples. We give the setup of our experiments and present the corresponding results. We implemented the database graph construction on top of the Postgres relational database management system. JDBC has been used for connecting to the database system. We ran the RelWalk and the RelHITS algorithm on the bibliography database 2 . This database D consists of a single relation R with attributes paper title, author, conference, and year. There are 7 677 tuples in the database, 3 062 unique paper titles and 4 203 unique authors. The main goal of the experiments is to show that for different query languages there are different rankings obtained. These rankings are closely related to the queries used for the construction of the database graph and in all the cases have a meaningful interpretation in terms of these queries. The experiments also show the flexibility of our general framework. Any weighted set of queries can be used to construct a database graph from D. This raises the question which queries should be used and this is a very challenging problem to explore indeed. Computing the rank values of partial tuples for a given query language is only preprocessing step. Once the rank values are computed they can be used for ranking tuples in the answer of queries imposed on the database. We illustrate this for simple query languages and queries in the next sections. 6.1 Experimental setup For the purpose of the experiments we have constructed the database graphs and obtained partial tuple rankings using both the RelWalk and RelHITS algorithms. For each one of the algorithms we constructed the corresponding database graphs using two different query languages, namely L1 and L2 for RelWalk and L′1 and L′2 for RelHITS. The languages were selected in such a way that L1 (and L2 ) is expected to show similar rankings to L′1 (and L′2 ). In the sequel we show the ranked outputs (along with the rank values) we obtained when the following two queries were imposed to the database: q(x2 |) ≡ π2 R and q ′ (x2 |) ≡ π2 σ6=‘H. Garcia-Molina’ σ1=5 R × R. Query q is a simple projection on all the authors of the database, while query q ′ is a projection on all the authors of the database that are co-authors of “H. Garcia-Molina”. The selection of q and q ′ is made mainly for two reasons. First, their output consists of partial tuples already assigned a rank value from our ranking algorithms and thus we do not need to employ any other additional procedure for ranking aggregates. Second, the output of the queries demonstrates the different features of the proposed ranking algorithms. Query languages for RelWalk The first language used for obtaining RelWalk rankings was L1 = {q1 (xj |xi ), q2 (x|)} where q1 (xj |xi ) ≡ πi,j R(x1 , x2 , x3 , x4 ) with i 6= j and preference f1 = 0.9 and q2 (x|) ≡ π1 R ∪ π2 R ∪ π3 R ∪ π4 R with preference f2 = 0.1. The intuition behind L1 is exactly what one 2 The data set is available http://liinwww.ira.uka.de/bibliography/ at expects the random querier to do when he has the freedom to do a random walk on partial tuples of size one. Query q2 makes sure that the constructed database graph is strongly connected and non-bipartite. Therefore, there is an underlying stationary distribution. Additionally, notice that for L1 we can apply the result of Example 6 and we can show that the degree of nodes in the corresponding undirected graph, and hence the stationary distribution, are frequency related. The database graph constructed by RelWalk when language L1 was considered consists of a total number of 7 294 nodes (partial tuples of size 1) and 50 434 edges (relational links taken into consideration for obtaining the ranking) when query q1 is considered. The query q2 makes the graph a complete graph. The second language used for obtaining RelWalk rankings was L2 = {q1 (xj |xi ), q2 (x2 |x6 ), q3 (x|)} where q1 (xj |xi ) ≡ πi,j R(x1 , x2 , x3 , x4 ) with i 6= j and preference f1 = 0.45, q2 (x2 |x6 ) ≡ π2,6 σ1=5,26=6 R × R with preference f2 = 0.45 , and finally q3 (x|) ≡ π1 R ∪π2 R ∪ π3 R ∪ π4 R with preference f3 = 0.1. For this language the results of Example 6 do not apply and thus we expect the obtained rankings to be not related to frequency. This is due to the co-author query q2 that has been included in the language. The database graph constructed using language L2 has 7 294 nodes and 68 012 edges when queries q1 and q2 are only considered. Query q3 again in this case makes the graph a complete graph. Query languages for RelHITS The two languages used for obtaining the RelHITS rankings are L′1 and L′2 and are selected such that they have similar flavor to L1 and L2 used for RelWalk. This means that they are expected to give similar rankings. Language L′1 = {q1 (x2 |x3 ), q2 (x2 |x4 )} consists of q1 (x2 |x3 ) ≡ π2,3 R(x1 , x2 , x3 , x4 ) with preference f1 = 0.5 and q2 (x2 |x4 ) ≡ π2,4 R(x1 , x2 , x3 , x4 ) with preference f2 = 0.5. The intuition behind L′1 is that, as in L1 , partial tuples of size 1 and direct links between them are again included in the database graph. Language L′2 = {q1 (x2 |x3 , x6 )} on the other hand consists of a single query q1 (x2 |x3 , x6 ) ≡ π2,3,6 σ26=6,1=5 R × R with preference equal to 1. This apparently is the coauthor query so that there is a relationship between L′2 and L2 and thus the comparison between the obtained rankings makes sense. For the case of RelHITS the size of the constructed database graph depends not only on the query languages (L′1 and L′2 ) used for the graph construction, but also on the imposed queries q and q ′ the results of which we want to rank. So in the case of L′1 and query q the constructed database graph consists of 4 232 nodes and 11 375 edges, while for the same language but for query q ′ the corresponding graph of RelHITS consists of only 86 nodes and 171 edges. For language L′2 the database graph that corresponds to q consists of 4 203 nodes and 24 620 edges while the one that corresponds to q ′ has only 144 nodes and 393 edges. 6.2 Experimental results The ranked output for q when RelWalk ranking algorithm is used is shown in Tables 1 and 2 for the query languages L1 and L2 respectively. Rankings obtained by using L1 are related to frequency and thus are used as a baseline case. On the other hand, the effect on the co-author query in L2 appears in the obtained rankings. For example, “Nicolas Adiba” is included in the most highly ranked authors, though he appears to have a single paper in the database. However, he participates in a paper with 18 other co-authors among which there are “Michael J. Carey” (ranked first) with 47 entries and “Daniela Florescu” with 16 entries. The same holds for “Steve Kirsch” and “Michael Blow” who are all authors of the same 18-author paper. The results obtained for q using the rankings of the RelHITS algorithm are shown in Tables 3 and 4 for query languages L′1 and L′2 respectively. There, we can make the similar observations as those made for the rankings obtained using the RelWalk algorithm. For example, “Eugene J. Shekita” appears to have much less entries in the database than authors ranked after him. This is due to the fact that “Michael J. Carey” (first in the ranking) appears to be among his co-authors. A comparison plot for our rankings is shown in Figure 4. Each pair of rankings r, r′ is compared by taking the first k authors of each ranking (x-axis) and forming the sets r(k), r′ (k). The y axis is the ′ (k)| value of |r(k)∩r and is used as a similarity bek tween the two rankings. For the results of the query q(y|) ≡ π2 R(x, y) we use the first 200 ranked authors. In the plot the subscripts in the names of the algorithm correspond to the language used for the ranking. We performed similar experiments for ranking the co-authors of “H. Garcia-Molina”. Tables 5 and 6 for RelWalk and Tables 7 and 8 for RelHITS show highest ranked authors that are co-authors of H. “Garcia-Molina” (query q ′ ≡ π2 σ6=‘H. Garcia-Molina’ σ1=5 R × R). As before the languages L1 (L′1 ) and L2 (L′2 ) were used for constructing the corresponding graphs. Phenomena analogous to the previous experiment appear here as well. For example in Table 6, “Edward Chang” has only two papers in the database but still he is ranked 5th. The same for “Svetlozar Nestorov” who is highly ranked though he has only 3 entries in the database. This is due to his co-author list which contains very highly ranked authors like S. Abiteboul, J. Widom, R. Motwani and J. Ullman. In Figure 5 we show an overall comparison plot of the rankings for elements which are answers to the query q ′ The assumptions and notation are the same as those followed in the previous experiment. For the comparisons presented in Figure 5 we only used the 70 highest ranked authors (out of the 104). RelWalk Ranking Language: L1 = {q1 (xj |xi ), q2 (x|)} with: q1 (xj |xi ) ≡ πi,j R(x1 , x2 , x3 , x4 ) with i 6= j q2 (x|) ≡ π1 R ∪ π2 R ∪ π3 R ∪ π4 R H. Garcia-Molina Michael J. Carey H. V. Jagadish David J. DeWitt S. Abiteboul Rakesh Agrawal C. Faloutsos Surajit Chaudhuri Michael Stonebraker Raghu Ramakrishnan Jeffrey F. Naughton Jennifer Widom Yannis E. Ioannidis A. Levy f1 = 0.9 f2 = 0.1 1.608E-4 1.608E-4 1.569E-4 1.552E-4 1.530E-4 1.513E-4 1.508E-4 1.502E-4 1.502E-4 1.497E-4 1.497E-4 1.491E-4 1.486E-4 1.480E-4 Table 1: RelWalk ranking for all the authors in the database using L1 . 6.3 Discussion The experiments conducted and described in the previous subsection show that the obtained rankings are highly dependent on the query languages that are used for constructing the database graphs. For example when the co-author query was included in the language L2 the ranking was highly influenced by co-author relationships, meaning that authors with less papers but with more (or highly-ranked) co-authors appear high in the obtained rankings. Additionally, the second experiment gives an idea of how different RelWalk and RelHITS are, since the latter is query dependent and it only considers the part of the database that is related to the imposed query. This difference although not apparent in the first experiment where the imposed query was considering all the authors in the database, it becomes obvious in the second case where only the co-authors of “H. Garcia-Molina” are to be considered. The corresponding comparison Figures 4 and 5 also imply this difference in the behavior of the two ranking algorithms for different queries. 7 Conclusions We have shown how to associate with each database, query language, and preference function a unique database graph and explored some of its interesting properties. The database graph provides a nice framework for extending existing and creating new link analysis algorithms for the Web and relational databases. The flexibility of the framework is provided by the use of relational algebra queries. The database graph also enables us to define the random querier which performs a random walk on databases by asking queries in each step of the walk. We applied our concepts to RelWalk Ranking Language: L2 = {q1 (xj |xi ), q2 (x2 |x6 ), q3 (x|)} with: q1 (xj |xi ) ≡ πi,j R(x1 , x2 , x3 , x4 ) with i 6= j q2 (x2 |x6 ) ≡ π2,6 σ1=5,26=6 R × R q3 (y|) ≡ π1 R ∪ π2 R ∪ π3 R ∪ π4 R Michael J. Carey S. Abiteboul A.Cichocki V. Kashyap R. Brice J. Fowler W. Bohrer R. J. Bayardo David J. DeWitt H. Garcia-Molina Daniela Florescu Steve Kirsch Michael Blow Nicolas Adiba f1 = 0.45 f2 = 0.45 f3 = 0.1 2.695E-4 2.142E-4 1.990E-4 1.990E-4 1.990E-4 1.990E-4 1.990E-4 1.990E-4 1.983E-4 1.919E-4 1.915E-4 1.844E-4 1.844E-4 1.844E-4 Table 2: RelWalk ranking for all the authors in the database using L2 . RelHITS Ranking Language: L′1 = {q1 (x2 |x3 ), q2 (x2 |x4 )} with: q1 (x2 |x3 ) ≡ π2,3 R(x1 , x2 , x3 , x4 ) q2 (x2 |x4 ) ≡ π2,4 R(x1 , x2 , x3 , x4 ) H. Garcia-Molina Michael J. Carey H. V. Jagadish Surajit Chaudhuri David J. DeWitt Rakesh Agrawal A. Levy Jennifer Widom S. Abiteboul C. Faloutsos Raghu Ramakrishnan Jeffrey F. Naughton f1 = 0.5 f2 = 0.5 0.007572 0.006437 0.006081 0.004740 0.004603 0.004291 0.004283 0.004207 0.004179 0.004056 0.003795 0.003765 Table 3: RelHITS rankings for all the authors in the database using L′1 . obtain two algorithms that provide rank values on partial tuples. These values are interesting on their own, but can also serve as input for top-k selection and join algorithms to obtain ranking of query results. We point out some interesting questions and open problems: • How can the database graph be used to define measures of similarity between categorical data? Possible measures include the shortest path between tuples and the commute distance between nodes on the database graph. • Is there a more close connection between the expressive power of L and the database graph and random querier? What are the properties of the random querier with memory ([14]). • Finally, is there an objective way of selecting the query language used for defining the database graph. 1 0.9 0.9 0.8 0.8 RelWalk_1 vs RelWalk_2 RelHITS_1 vs RelHITS_2 RelWalk_1 vs RelHITS_2 RelWalk_1 vs RelHITS1 RelWalk_2 vs RelHITS_2 RelWalk_2 vs RelHITS_1 0.7 0.6 0.5 0.4 Percentage of sharing in rankings Percentage of sharing in rankings 1 0.7 0.6 0.5 RelWalk_1 vs RelWalk_2 RelHITS_1 vs RelHITS_2 RelWalk_1 vs RelHITS_2 RelWalk_1 vs RelHITS_1 RelWalk_2 vs RelHITS_2 RelWalk_2 vs RelHITS_1 0.4 0.3 0.3 0.2 0.2 0.1 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 Number of highest ranked elements considered 5 10 15 20 25 30 35 40 45 50 55 60 65 70 Number of highest ranked elements considered Figure 4: Comparison of rankings for all authors in the database. RelHITS Ranking Language: L′2 = {q1 (x2 |x3 , x6 )} with: q1 (x2 |x3 , x6 ) ≡ π2,3,6 σ26=6,1=5 R × R Michael J. Carey David J. DeWitt Jeffrey F. Naughton H. Garcia-Molina Yannis E. Ioannidis Miron Livny Raghu Ramakrishnan H. Pirahesh Michael J. Franklin Eugene J. Shekita Jennifer Widom Praveen Seshadri 0 f1 = 1.0 0.55364 0.02967 0.02593 0.02593 0.02266 0.01887 0.01659 0.01442 0.01086 0.01047 0.01042 0.00925 Table 4: RelHITS rankings for all the authors in the database using L′2 . Figure 5: Comparison of rankings for all co-authors of ’H. Garcia-Molina’. RelWalk Ranking Language: L1 = {q1 (xj |xi ), q2 (x|)} with: q1 (xj |xi ) ≡ πi,j R(x1 , x2 , x3 , x4 ) with i 6= j q2 (x|) ≡ π1 R ∪ π2 R ∪ π3 R ∪ π4 R S. Abiteboul Raghu Ramakrishnan Jennifer Widom A. Silberschatz J. Ullman Rajeev Motwani Anand Rajaraman Luis Gravano Anthony Tomasic Ramana Yerneni Vasilis Vassalos Yannis Papakonstantinou Narayanan Shivakumar Sergey Brin Janet L. Wiener Kenneth Salem f1 = 0.9 f2 = 0.1 1.530E-4 1.497E-4 1.491E-4 1.480E-4 1.425E-4 1.408E-4 1.391E-4 1.391E-4 1.375E-4 1.375E-4 1.375E-4 1.375E-4 1.375E-4 1.375E-4 1.375E-4 1.375E-4 Acknowledgment We would like to thank Aris Gionis for helpful discussions. References [1] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995. [2] S. Agrawal, S. Chaudhuri, and G. Das. DBXplorer: A system for keyword-based search over relational databases. In ICDE, 2002. [3] S. Agrawal, S. Chaudhuri, G. Das, and A. Gionis. Automated ranking of database query results. In CIDR, 2003. [4] A. Balmin, V. Hristidis, and Y. Papakonstantinou. ObjectRank: Authority-based keyword search in databases. In VLDB, 2004. [5] G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan. Keyword searching and browsing in databases using BANKS. In ICDE, 2002. Table 5: RelWalk rankings for all co-authors of ’H. Garcia-Molina’ using L1 . [6] K. Bharat and M. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In SIGIR, 1998. [7] B. Bollobás. Modern Graph Theory. SpringerVerlag, 1998. [8] E. Börger, E. Grädel, and Y. Gurevich. The Classical Decision Problem. Springer-Verlag, 1997. [9] A. Borodin, J. S. Rosenthal, G. O. Roberts, and P. Tsaparas. Finding authorities and hubs from link structures on the World Wide Web. In WWW, 2001. [10] S. Brin, R. Motwani, L. Page, R. Motwani, and T. Winograd. What can you do with the web in your pocket? Data Engineering Bulletin, 1998. [11] S. Brin and L. Page. The anatomy of a largescale hypertextual Web search engine. Computer Networks and ISDN Systems, 30:107–117, 1998. RelWalk Ranking Language: L2 = {q1 (xj |xi ), q2 (x2 |x6 ), q3 (x|)} with: q1 (xj |xi ) ≡ πi,j R(x1 , x2 , x3 , x4 ) with i 6= j q2 (x2 |x6 ) ≡ π2,6 σ1=5,26=6 R × R q3 (x|) ≡ π1 R ∪ π2 R ∪ π3 R ∪ π4 R S. Abiteboul Jennifer Widom A. Silberschatz Raghu Ramakrishnan Edward Chang J. Ullman Rajeev Motwani Roy Goldman Anand Rajaraman Svetlozar Nestorov Ramana Yerneni Yannis Papakonstantinou Luis Gravano Vasilis Vassalos Joachim Hammer Ming-Chien Shan f1 = 0.45 f2 = 0.45 f3 = 0.1 2.142E-4 1.771E-4 1.736E-4 1.703E-4 1.619E-4 1.531E-4 1.498E-4 1.498E-4 1.490E-4 1.479E-4 1.443E-4 1.439E-4 1.432E-4 1.428E-4 1.423E-4 1.419E-4 Table 6: RelWalk rankings for all co-authors of ’H. Garcia-Molina’ using L2 . RelHITS Ranking Language: L′1 = {q1 (x2 |x3 ), q2 (x2 |x4 )} with: q1 (x2 |x3 ) ≡ π2,3 R(x1 , x2 , x3 , x4 ) q2 (x2 |x4 ) ≡ π2,4 R(x1 , x2 , x3 , x4 ) Jennifer Widom Ramana Yerneni Narayanan Shivakumar Joachim Hammer Luis Gravano Anthony Tomasic Chen-Chuan K. Chang Yannis Papakonstantinou Junghoo Cho Yue Zhuge Vasilis Vassalos Janet L. Wiener Sudarshan S. Chawathe Jeffrey Ullman f1 = 0.5 f2 = 0.5 0.0552 0.0487 0.0417 0.0390 0.0365 0.0326 0.0298 0.0294 0.0291 0.0262 0.0258 0.0254 0.0245 0.0238 Table 7: RelHITS rankings for all co-authors of ’H. Garcia-Molina’ using L′1 . RelHITS Ranking Language: L′2 = {q1 (x2 |x3 , x6 )} with: q1 (x2 |x3 , x6 ) ≡ π2,3,6 σ26=6,1=5 R × R Jennifer Widom Narayanan Shivakumar Ramana Yerneni Luis Gravano Joachim Hammer Yannis Papakonstantinou Wilburt J. Labio Chen-Chuan K. Chang Junghoo Cho Vasilis Vassalos Anthony Tomasic Chen Li Jeffrey Ullman Yue Zhuge [12] C. Ding, X. He, P. Husbands andH. Zha, and H.D. Simon. PageRank, HITS and a unified framework for link analysis. In SIGIR, 2002. [13] H. D. Ebbinghaus and J. Flum. Finite Model Theory. Springer-Verlag, 1995. [14] R. Fagin, A. Karlin, J. Kleinberg, P. Raghavan, S. Rajagopalan, R. Rubinfeld, M. Sudan, and A. Tomkins. Random walks with “back buttons”. Annals of Applied Probability, 11(3):810– 862, 2001. [15] R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. In PODS, 2001. [16] H. Garcia-Molina, J. Ullman, and J. Widom. Database Systems, The Complete Book. Prentice Hall, 2002. [17] D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamical systems. In VLDB, 1998. [18] V. Hristidis and Y. Papakonstantinou. DISCOVER: Keyword search in relational databases. In VLDB, 2002. [19] V. Hristidis and Y. Papakonstantinou. Algorithms and applications for answering ranked queries using ranked views. VLDB Journal, 2003. [20] I. F. Ilyas, W. G. Aref, and A. K. Elmagarmid. Supporting top-k join queries in relational databases. In VLDB, 2003. [21] G. Jeh and J. Widom. SimRank: a measure of structural-context similarity. In KDD, 2002. [22] J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of ACM, 46, 1999. [23] R. Motwani and P. Rhaghavan. Randomized Algorithms. MIT Press, 1995. f1 = 1.0 0.0514 0.0456 0.0424 0.0352 0.0350 0.0330 0.0285 0.0258 0.0254 0.0249 0.0233 0.0223 0.0223 0.0203 Table 8: RelHITS rankings for all co-authors of ’H. Garcia-Molina’ using L′2 . [24] A. Natsev, Y.-C. Chang, J.R. Smith, C.-S. Li, and J.S. Vitter. Supporting incremental join queries on ranked inputs. In VLDB, 2001. [25] A. Ng, A. Zheng, and M. Jordan. Stable algorithms for link analysis. In SIGIR, 2001. [26] C. R. Palmer and C. Faloutsos. Electricity based external similarity of categorical attributes. In PAKDD, 2003. [27] A. Schrijver. Theory of Linear and Integer Programming. John Wiley & Sons, 1998.