We study equi-join computation in the massively parallel computation (MPC) model. Currently, a main open question under this topic is whether it is possible to design an algorithm that can process any join with load O(N polylog N/p1/ρ*) — measured in the number of words communicated per machine — where N is the total number of tuples in the input relations, ρ* is the join’s fractional edge covering number, and p is the number of machines. We settle the question in the negative for the class of tuple-based algorithms (all the known MPC join algorithms fall in this class) by proving the existence of a join query with ρ* = 2 that requires a load of Ω (N/p1/3) to evaluate. Our lower bound provides solid evidence that the “AGM bound” alone is not sufficient for characterizing the hardness of join evaluation in MPC (a phenomenon that does not exist in RAM). The hard join instance identified in our argument is cyclic, which leaves the question of whether O(N polylog N/p1/ρ*) is still possible for acyclic joins. We answer this question in the affirmative by showing that any acyclic join can be evaluated with load O(N / p1/ρ*), which is asymptotically optimal (there are no polylogarithmic factors in our bound). The separation between cyclic and acyclic joins is yet another phenomenon that is absent in RAM. Our algorithm owes to the discovery of a new mathematical structure — we call “canonical edge cover” — of acyclic hypergraphs, which has numerous non-trivial properties and makes an elegant addition to database theory.
1 Introduction
Join evaluation is an important problem at the core of database theory. The past decade has witnessed significant progress towards understanding the problem’s complexity in the random access machine(RAM) model. For a join involving a constant number of attributes, Atserials, Grohe, and Marx proved in their seminal work [6] that the join result can include only \(O(N^{\rho ^*})\) tuples, where N is the number of tuples in the input relations, and \(\rho ^*\) is the join’s fractional edge covering number.1 The bound—commonly known as the AGM bound— is tight in the sense that a join can indeed return \(\Omega (N^{\rho ^*})\) tuples in the worst case. An algorithm, therefore, is worst-case optimal if it can process any join in \(O(N^{\rho ^*})\) time. Many algorithms whose running time matches this bound—sometimes up to an \(\tilde{O}(1)\) factor, where \(\tilde{O}(.)\) hides a \(\text{polylog}N\) term—have been discovered [5, 18, 21, 22, 23, 24, 25, 33].
In big-data analysis, the input relations may not fit in one machine’s memory, and therefore, joins are often processed with multiple machines on a massively parallel system like MapReduce [9], Spark [35], Hive [32], Dremmel [20], and the like. CPU calculation is no longer the performance bottleneck in those environments. The new bottleneck, instead, is network communication, because of which the design of “massive join” algorithms has focused on the massively parallel computation(MPC) [8] model (to be formally defined in Section 1.1). Unraveling the worst-case complexity of join evaluation in MPC, however, has turned out to be an intriguing challenge. On the one hand, the AGM bound implies [19] a lower bound of \(\Omega (N / p^{1/\rho ^*})\) on the cost of any MPC algorithm—measured in the number of words communicated per machine—where p is the number of machines. On the other hand, despite significant efforts [2, 3, 8, 13, 14, 16, 17, 19, 27, 30], no known algorithms have been able to match this bound.
In this article, we will disprove the possibility of any MPC algorithm that can ensure cost \(\tilde{O}(N / p^{1/\rho ^*})\) for arbitrary joins. We will establish a new, higher, lower bound, thereby revealing the somewhat surprising fact that the join problem (unlike in RAM) cannot be characterized by the AGM bound alone in MPC. Furthermore, we will contrast the lower bound by developing an optimal algorithm with cost \(O(N / p^{1/\rho ^*})\) for “acyclic joins”, which form a class of joins with profound importance in database systems [1, 8, 13, 15, 34]. The separation between acyclic and cyclic joins is another characteristic of the join problem that does not exist in RAM.
1.1 Problem Definitions and Complexity Parameters
Natural Joins. Let \({\bf att}\) be a set where each element is called an attribute, and \({\bf dom}\) be another set where each element is called a value. The concrete choice of \({\bf dom}\) is unimportant, although each value in \({\bf dom}\) should occupy only a constant number of words. We assume a total order on \({\bf dom}\) (if necessary, manually impose one by ordering the values arbitrarily). A tuple over a set \(U \subseteq {\bf att}\) is a function \({\bf {u}}: U \rightarrow {\bf dom}\). For each attribute \(X \in U\), we refer to \({\bf {u}}(X)\) as the value of \({\bf {u}}\) on X. Given a subset \(U^{\prime } \subseteq U\), define \({\bf {u}}[U^{\prime }]\) as the tuple \({\bf {u^{\prime }}}\) over \(U^{\prime }\) such that \({\bf {u^{\prime }}}(X) = {\bf {u}}(X)\) for every \(X \in U^{\prime }\). A relation is a set R of tuples over the same set U of attributes; we call U the scheme of R, a fact denoted as \(\mathit {scheme}(R) = U\). Given a subset \(U \subseteq \mathit {scheme}(R)\), the projection of R on U — denoted as \(\pi _U(R)\) — is a relation with scheme U defined as \(\pi _U(R) = \lbrace \text{tuple $ {\bf {u}}$ over $U \mid \exists $ tuple $ {\bf {v}} \in R$ s.t. $ {\bf {u}}[U] = {\bf {v}}[U]$}\rbrace .\)
We represent a join query (henceforth, simply a join or query) as a set Q of relations. Define \(\mathit {attset}(Q) = \bigcup _{R \in Q} \mathit {scheme}(R)\). The query result is the following relation over \(\mathit {attset}(Q):\)
We refer to \(|\mathit {Join}(Q)|\) as the output size of Q. If the relations in Q are \(R_1, R_2, \ldots , R_{|Q|}\), we may also represent \(\mathit {Join}(Q)\) as \(R_1 \bowtie R_2 \bowtie \cdots \bowtie R_{|Q|}\).
The query Q can be characterized by a schema graph\(G = (V, E)\), which is a hypergraph where each vertex in V is a distinct attribute in \(\mathit {attset}(Q)\), and each edge in E is the scheme of a distinct relation in Q. The set E may contain identical edges because two (or more) relations in Q can have the same scheme. The term “hyper” suggests that an edge can have more than two attributes.
Fig. 1.
A query Q is acyclic if its schema graph is acyclic. Specifically, a hypergraph \(G = (V, E)\) is acyclic if we can create a tree T where
—
every node in T stores—hence, “corresponds to”—a distinct edge in E;
—
(the connectedness requirement) for every attribute \(X \in V\), the set of nodes whose corresponding edges contain X forms a connected subtree in T.
We will call T an edge tree of G. We say that Q is cyclic if it does not satisfy the above conditions.
We use
\begin{equation} N = \sum _{R \in Q} |R| \end{equation}
(2)
to denote the input size of Q, namely, the total number of tuples in the relations participating in the join. Our discussion focuses on data complexities, that is, we are interested in the influence of N on the algorithm performance. For that reason, we assume that the schema graph G of Q has \(O(1)\) vertices, i.e., \(|\mathit {attset}(Q)| = O(1)\).
Computation Model. The MPC (massively parallel computation) model [8] has been widely deployed to design parallel algorithms on large-scaled data [2, 3, 8, 13, 14, 16, 17, 19, 27, 29, 30]. In this model, we have p share-nothing machines that are interconnected in a network. In the beginning, each machine stores \(O(N/p)\) tuples from the relations of a query Q. An algorithm starts by having each machine perform some initial computation on its local data and then executes in rounds, each having two phases:
—
in the first phase, the machines exchange messages (every message should have been prepared either in the initial computation or the second phase of the previous round);
—
in the second phase, each machine performs local computation.
An algorithm is required to finish in a constant number of rounds, and when it does, every tuple in \(\mathit {Join}(Q)\) is required to have been produced on at least one machine. The load of a round is the largest number of words received by a machine in that round. The load of an algorithm is the maximum load of all the rounds. We consider \(p \lt N^{1-\epsilon }\), where \(\epsilon \gt 0\) can be an arbitrarily small constant; this is a standard assumption behind all the previous work on MPC. With load N, any problem can be solved trivially in one round by simply sending all data to one machine. The crux of designing a load-efficient MPC algorithm is to limit the “intermediate results” that need to be transmitted across machines.
We will confine our attention to the class of tuple-based algorithms, which treat tuples in the relations of Q as “atoms” that must always be transmitted in their entirety. Atoms are allowed to be copied, but each copy must again be sent in its entirety. To report a result tuple \({\bf {u}} \in \mathit {Join}(Q)\), a machine must have received all the atom tuples \({\bf {u}}[\mathit {scheme}(R)]\) for every \(R \in Q\). While the tuple-based class of algorithms does not encompass all possible approaches, it does include the existing MPC join algorithms that we are aware of, which will be discussed in Section 1.2. Therefore, analyzing the optimal communication complexity achievable by this class can provide valuable insights into the problem’s characteristics.
Results are reported by invoking a special zero-cost function \(\mathit {emit}(.)\). In particular, the machine, which has received all the necessary atom tuples for a result tuple \({\bf {u}} \in \mathit {Join}(Q)\), outputs \({\bf {u}}\) by employing \(\mathit {emit}( {\bf {u}})\), with the stipulation that \({\bf {u}}\) can be output only once (across all machines) throughout the algorithm’s execution. This reporting “style” reflects how join results are usually consumed in database systems: they could be (i) transmitted to a remote server via the network, (ii) written to a certain type of persistent storage, or (iii) directly supplied to a downstream process, such as an aggregate function like counting or a user-defined utility function. From the MPC model’s perspective, incorporating the \(\mathit {emit}(.)\) function effectively absolves the algorithm from the responsibility of storing the tuples of the join result. Generally, the size of \(\mathit {Join}(Q)\) could be polynomial in N (and exponential in \(|Q|\), which is regarded as a constant in this article), making \(|\mathit {Join}(Q)|/p\) potentially much larger than a machine’s memory capacity. In contrast, an MPC algorithm should utilize far less than N memory on each machine: ideally, the memory usage on each machine should be at the same order as the algorithm’s load.
Our lower bounds are combinatorial in nature. We count only how many atom tuples must be communicated in order to emit all the tuples in the join result, while any other information can be communicated for free.
Fractional Edge Coverings and Packings. Consider a query Q — which may or may not be acyclic — with schema graph \(G = (V, E)\). Let W be a function that associates every edge \(e \in E\) with a real-valued weight\(W(e)\) between 0 and 1. The function is called a fractional edge covering of G if
holds for every attribute \(X \in V\), namely, the total weight of all the edges covering X is at least 1. Similarly, W is a called a fractional edge packing of G if
holds for every attribute \(X \in V\), namely, the total weight of all the edges covering X is at most 1. In any case, we refer to \(\sum _{e \in E} W(e)\) as the total weight of W.
The fractional edge covering number of G (also of Q) — denoted as \(\rho ^*\) — is the minimum total weight of all possible fractional edge coverings of G. The fractional edge packing number of G (also of Q)—denoted as \(\tau ^*\) —is the maximum total weight of all possible fractional edge packings of G. A fractional edge covering (respectively, packing) is optimal if its total weight equals \(\rho ^*\) (respectively, \(\tau ^*\)).
1.2 Previous Results
AGM Bound and Join Algorithms in RAM. Consider an arbitrary join query Q whose schema graph \(G = (V, E)\) admits a fractional edge covering W. For each edge \(e \in E\), let \(R_e\) be the (only) relation in Q whose scheme corresponds to e. The AGM bound [6] states that \(\mathit {Join}(Q)\) can contain no more than \(\prod _{e \in E} |R_e|^{W(e)}\) tuples. Applying the trivial fact \(|R_e| \le N\) (where N is the input size of Q) and supplying an optimal fractional edge covering W, we obtain \(|\mathit {Join}(Q)| \le N^{\rho ^*}\). This inequality is asymptotically tight because, for any hypergraph \(G = (V, E)\) where V has a constant size, there exists a join query Q with schema graph G whose \(\mathit {Join}(Q)\) has \(\Omega (N^{\rho ^*})\) tuples [6].
An algorithm able to answer Q using \(O(N^{\rho ^*})\) time in the RAM model is considered worst-case optimal because when \(|\mathit {Join}(Q)| = \Omega (N^{\rho ^*})\), we need \(\Theta (N^{\rho ^*})\) time even just to output \(\mathit {Join}(Q)\). Ngo et al. [23] designed the first algorithm that guarantees a running time of \(O(N^{\rho ^*})\) for all queries.2 Since then, the community has discovered more algorithms [5, 18, 21, 22, 23, 24, 25, 33] that are all worst-case optimal (sometimes up to an \(\tilde{O}(1)\) factor) but differ in their own features. When Q is acyclic, optimal efficiency can be achieved using a simpler algorithm due to Yannakakis [34].
Join Algorithms in MPC (and the Quest for Load \({\bf {\tilde{O}(N/p^{1/\rho ^*})}}\)). Via a reduction from the set-disjointness problem in communication complexity, Hu et al. [14] showed that \(\Omega (N/p)\) is a lower bound on the load of join evaluation in MPC.3 Separately, Koutris et al. [19] observed that, the AGM bound implies another lower bound of \(\Omega (N/p^{1/\rho ^*})\) on the load. To understand why, suppose that each machine sees at most L atom tuples (i.e., tuples in the input relations of Q) during the entire algorithm. By the AGM bound, the machine can produce at most \(L^{\rho ^*}\) tuples in the join result. Thus, when \(|\mathit {Join}(Q)| = \Omega (N^{\rho ^*})\), we must have \(p \cdot L^{\rho ^*} = \Omega (N^{\rho ^*})\), which yields \(L = \Omega (N/p^{1/\rho ^*})\). For \(\rho ^* \gt 1\) (the case of \(\rho ^* = 1\) has been captured by the lower bound of [14]), \(N/p^{1/\rho ^*} \gg N/p\) such that at least \(\Omega (N/p^{1/\rho ^*})\) of the tuples seen by a machine need to come from other machines, suggesting that the algorithm’s load must be \(\Omega (N/p^{1/\rho ^*})\).
The above negative results have motivated considerable research looking for MPC algorithms whose loads are bounded by \(\tilde{O}(N/p^{1/\rho {^*}})\); such algorithms are worst-case optimal up to an \(\tilde{O}(1)\) factor. The goal has been realized only on four query classes. The first consists of all the Cartesian-product joins where the relations in Q have disjoint schemes; see [3], [7], and [17] for several optimal algorithms on such queries. The second is the so-called Loomis-Whitney join, where E consists of all the \(|V|\) possible edges of \(|V|-1\) attributes; see [19] for an optimal algorithm for such queries. The third class includes every join where each relation has at most two attributes; see [16], [17], [27], and [30] for optimal algorithms for these queries. The fourth class comprises all the so-called r-hierarchical joins, which constitute a subset of the acyclic queries considered in this article; see [13] for an optimal r-hierarchical algorithm.
We refer the reader to (i) [7] and [19] for join algorithms that perform only a single round, and (ii) [2], [13], and [14] for algorithms whose loads are sensitive to the join size \(|\mathit {Join}(Q)|\) and hence can be even lower than \(\Omega (N/p^{1/\rho ^*})\) when the join result is small.
1.3 Contributions
New Results. Our first result eliminates the possibility of answering an arbitrary join query with load \(\tilde{O}(N / p^{1/\rho ^*})\) in the MPC model. Specifically, we prove (in Theorem 1) the existence of a cyclic query Q with fractional edge covering number \(\rho ^* = 2\) and fractional edge packing number \(\tau ^* = 3\), such that any algorithm solving the query must incur a load of \(\Omega (N / p^{1/3})\) when \(p = O(N^{1/3})\). This offers solid evidence that, unlike in RAM, the AGM bound alone is insufficient to characterize the performance of join queries in MPC.
Given this new finding, a natural question is whether cyclicity is the “culprit” for the above, somewhat bizarre, MPC characteristic. We answer the question in the affirmative. Specifically, we prove (in Theorem 15) that every acyclic query can be evaluated with load \(O(N / p^{1/\rho ^*})\) in MPC, which is asymptotically optimal (note that the load complexity does not hide any polylogarithmic factors). This officially separates the class of acyclic queries from the class of cyclic queries in MPC (recall that no such separation exists in RAM). Our algorithm uses \(O(N / p^{1/\rho ^*})\) memory on every machine.
Fig. 2.
Our Techniques. The cyclic query behind our lower bound has a schema graph illustrated in Figure 2 (every letter is a vertex and every ellipse is an edge). A join having this schema graph—which we will refer to as a boat join —has five relations with schemes ABC, DEF, AD, BE, and CF, respectively (the reader should take a moment to verify its fractional edge covering number 2 and fractional edge packing number 3). The crux of our proof is to construct a boat join Q with a special property: for \(L = \Omega (N^{5/6})\), any L “atom tuples” from the input relations can produce \(O(L^3 / N)\) tuples in the join result. To contrast this property with the AGM bound, we note that any L atom tuples can produce at most \(L^2\) result tuples under the AGM bound. Because \(L^3/N = o(L^2)\) for \(L = o(N)\), each machine, if permitted to see only L atom tuples, can actually produce fewer result tuples than predicted by the AGM bound. This is the rationale behind our stronger lower bound.
Our construction has a deeper implication. The hard join query Q described earlier produces \(\Theta (N^2)\) join tuples, asymptotically the largest possible size asserted by the AGM bound. However, unlike in RAM where (for proving lower bounds) it suffices to look at the size of the global join result, our techniques suggest that in MPC it is imperative to look at the maximum size of local joins — namely, how many result tuples can be produced by \(L \ll N\) tuples only. The AGM bound can be very loose in bounding the local join sizes, which is the core reason why it does not (fully) characterize the join performance in MPC.
To develop our optimal MPC algorithm for acyclic queries, we present a theory of acyclic hypergraphs revolving around a new concept “canonical edge cover”. To pave the way for the concept, we first prove that any acyclic hypergraph \(G = (V, E)\) admits an integral optimal fractional edge covering W, namely, W assigns every edge in E an integer weight: either 0 or 1. This fact allows us to connect W to edge “covers”: a subset \(S \subseteq E\) is an edge cover4 of G if every attribute of V appears in at least one edge of S. Thus, the fractional edge covering number \(\rho ^*\) of G is simply the minimum size of all edge covers, namely, the smallest number of edges that we must pick to cover all the attributes.
A hypergraph G can have multiple optimal edge covers (all with size \(\rho ^*\)), among which we identify one as “canonical”. In Figure 1, the nine circled nodes constitute a canonical edge cover \(\mathcal {F}\) of G. Let us give an informal explanation on the derivation of \(\mathcal {F}\). After rooting the tree in Figure 1 at HN, we add to \(\mathcal {F}\) all the leaf nodes: BO, ABC, BD, EFG, HI, and LM. Next, we process the non-leaf nodes bottom up. At BCE, we ask: which attributes will disappear as we ascend further in the tree? The answer is B, which is thus a “disappearing” attribute of BCE. Then, we ask: does \(\mathcal {F}\) already cover B? The answer is yes, due to the existence of BO; we therefore do not include BCE in \(\mathcal {F}\). We continue to process CEF and CEJ similarly, but neither of them enters \(\mathcal {F}\). At EHJ, the disappearing attributes are E and J. In general, as long as one disappearing attribute has not been covered by \(\mathcal {F}\), we pick the node; this is why EHJ is in \(\mathcal {F}\). The other nodes HK and HN in \(\mathcal {F}\) are chosen based on the same reasoning.
We show that a canonical edge cover determined this way has appealing properties, which eventually lead to a recursive strategy for evaluating any acyclic join optimally in MPC. At a high level, our algorithm works by simplifying G into several “residual” hypergraphs, each of which defines a sub-query to be computed recursively. Apart from some trivial modifications (such as removing the attributes and edges that have become irrelevant), a canonical edge cover of G remains canonical on every residual hypergraph. We utilize this crucial property to relate the load of the original query to the loads of the sub-queries, which yields an unusual recurrence whose solution proves an overall load of \(O(N/p^{1/\rho ^*})\). Canonical edge cover is, we believe, an elegant addition to database theory and finds further applications. In fact, by adapting our MPC algorithm to the external memory model [4], we can obtain an I/O-efficient algorithm for evaluating any acyclic join in \(O(\frac{N^{\rho ^*}}{M^{\rho ^*-1} B} \log _{M/B} \frac{N}{B})\) I/Os, which improves several existing algorithms [12, 19, 26].
Gottlob et al. [10] proved that detecting whether an acyclic query has an empty result is LOGCFL-complete, even if the number of relations in the query is not constant. Their result has important implications, such as the ability to solve the problem in a polylogarithmic number of steps on an EREW PRAM with a polynomial number of processors. However, our work on parallel evaluation of acyclic queries focuses on minimizing cross-machine communication, which is different from the goal of [10] to reduce concurrent computation steps. Thus, our findings are not directly comparable to theirs.
2 A Lower Bound for Boat Joins
In this section, we will focus on boat joins, which have the schema graph in Figure 2. Our main result is:
Recall that a boat join has fractional edge covering number \(\rho ^* = 2\) and fractional edge packing number \(\tau ^* = 3\). Hence, the theorem indicates that the join’s load can exceed \(O(N/p^{1/\rho ^*})\) and reach \(\Omega (N/p^{1/\tau ^*})\). The theorem is tight up to an \(\tilde{O}(1)\) factor because there are algorithms [19, 27] able to evaluate any boat join with load \(\tilde{O}(N/p^{1/3})\).
Given a join Q, (as before) we use the term, atom tuple, to refer to a tuple in the input relations of Q. The core of our argument is to prove:
Theorem 1 is in fact a corollary of Lemma 2 and the standard counting argument reviewed in Section 1.2. Consider the boat join Q given in the lemma. Let L be the maximum number of atom tuples that a machine sees during the entire evaluation of Q. As \(\mathit {Join}(Q)\) has \(\Theta (n^2)\) tuples, we know \(L = \Omega (n/p^{1/\rho ^*}) = \Omega (n/\sqrt {p})\) from the argument of [19] (see Section 1.2). To prove Theorem 1, we consider \(p \le c \cdot n^{1/3}\) for a sufficiently large constant \(c \gt 0\) to satisfy the requirement \(L^2 = \Omega (n^2 / p) \ge c^{\prime } \cdot n^{5/3}\), where \(c^{\prime }\) is the constant stated in Lemma 2. The lemma then tells us that each machine can generate \(O(L^3 / n)\) result tuples. To produce all the \(\Theta (n^2)\) tuples in \(\mathit {Join}(Q)\), we need \(p \cdot O(L^3 / n) = \Theta (n^2)\), which gives \(L = \Omega (n / p^{1/3}) = \Omega (N / p^{1/3})\).
The rest of the section serves as a proof of Lemma 2. Given an integer \(k \ge 1\), we denote by \([k]\) the set of integers \(\lbrace 1, 2, \ldots , k\rbrace\). Fix integers n and L satisfying the condition \(L^2 \ge c^{\prime } \cdot n^{5/3}\), where the constant \(c^{\prime }\) will be chosen later in the proof. We consider, w.l.o.g., that \(n^{1/3}\) is an integer. A boat join, as shown in Figure 2, has attributes \(\texttt {A}, \texttt {B}, \ldots ,\) and \(\texttt {F}\). We design their domains to be
Recall that a boat join has five relations: \(R_{\texttt {ABC}}, R_{\texttt {DEF}}, R_{\texttt {AD}}, R_{\texttt {BE}}\), and \(R_{\texttt {CF}}\). Henceforth, for each edge \(e \in \lbrace \texttt {ABC}, \texttt {DEF}, \texttt {AD}, \texttt {BE},\)\(\texttt {CF}\rbrace\) in the schema graph, we use \(R_e\) to denote the relation with scheme e. Furthermore, for any \(\lbrace X_1, \ldots , X_k\rbrace \subseteq \lbrace \texttt {A}, \texttt {B}, \ldots , \texttt {E}\rbrace\) where \(k \ge 2\), we use \({\bf dom}(X_1) \times \cdots \times {\bf dom}(X_k)\) to denote the “Cartesian product” relation that contains \(\prod _{i=1}^k |{\bf dom}(X_i)|\) tuples such that, for any \((x_1, \ldots , x_k) \in {\bf dom}(X_1) \times \cdots \times {\bf dom}(X_k)\), the relation has a tuple \({\bf {u}}\) with \({\bf {u}}(X_i) = x_i\) for every \(k \in [i]\).
We will construct a set — denoted as \(\mathcal {Q}_\textrm {boat}\) — of boat joins. All those joins have precisely the same \(R_{\texttt {ABC}}, R_{\texttt {AD}}, R_{\texttt {BE}}\), and \(R_{\texttt {CF}}\), but differ in \(R_{\texttt {DEF}}\). Specifically, \(R_\texttt {ABC} = {\bf dom}(\texttt {A}) \times {\bf dom}(\texttt {B}) \times {\bf dom}(\texttt {C})\), \(R_\texttt {AD} = {\bf dom}(\texttt {A}) \times {\bf dom}(\texttt {D})\), \(R_\texttt {BE} = {\bf dom}(\texttt {B}) \times {\bf dom}(\texttt {E})\), and \(R_\texttt {CF} = {\bf dom}(\texttt {C}) \times {\bf dom}(\texttt {F})\). Note that these four relations have exactly n tuples each. It remains to clarify \(R_{\texttt {DEF}}\) (the relation that distinguishes different boat joins). In every boat join \(Q \in \mathcal {Q}_\textrm {boat}\), the relation \(R_{\texttt {DEF}} \in Q\) is a subset of the Cartesian-product relation \({\bf dom}(\texttt {D}) \times {\bf dom}(\texttt {E}) \times {\bf dom}(\texttt {F})\). The number of possible subsets is \(2^{n^2}\), which is exactly the number of boat joins in \(\mathcal {Q}_\textrm {boat}\), each using a distinct subset as its \(R_{\texttt {DEF}}\). For every join \(Q \in \mathcal {Q}_\textrm {boat}\), the result \(\mathit {Join}(Q)\) is always \(R_\texttt {ABC} \times R_\texttt {DEF}\).
We will show that at least one of the joins in \(\mathcal {Q}_\textrm {boat}\) possesses the properties in Lemma 2. Our proof will proceed in two steps. First, we will reveal an intrinsic property of the boat joins in \(Q_\textrm {boat}\) regarding how to select a designated number of tuples to maximize the number of result tuples. The second step will then utilize the property to find a hard boat join to establish Lemma 2.
2.1 Maximizing the Size of a Local Join
This subsection will concentrate on an arbitrary boat join \(Q \in \mathcal {Q}_\textrm {boat}\) and, therefore, every mention of “\(R_\texttt {DEF}\)” refers to the relation \(R_\texttt {DEF}\) in Q (remember \(R_\texttt {DEF}\) can be any subset of \({\bf dom}(\texttt {D}) \times {\bf dom}(\texttt {E}) \times {\bf dom}(\texttt {F})\)). We now define a combinatorial optimization problem crucial to our analysis:
Local-Join Maximiztion. Given a boat join \(Q \in \mathcal {Q}_\textrm {boat}\) and an arbitrary integer \(L \ge 1\), choose \(R^{\prime }_e \subseteq R_e\) for each \(e \in\)\(\lbrace \texttt {ABC},\)\(\texttt {AD}, \texttt {BE}, \texttt {CF}\rbrace\) to maximize the output size of the local join\(\lbrace R^{\prime }_\texttt {ABC}, R^{\prime }_\texttt {AD}, R^{\prime }_\texttt {BE}, R^{\prime }_\texttt {CF}, R_\texttt {DEF}\rbrace\) subject to the constraint that each of the relations \(R^{\prime }_\texttt {ABC}, R^{\prime }_\texttt {AD}, R^{\prime }_\texttt {BE}\), and \(R^{\prime }_\texttt {CF}\) contains at most L tuples. We will represent the above problem as LJM\((L)\).
Note that the size-L constraint concerns only the edges ABC, AD, BE, and CF, while the entire \(R_\texttt {DEF}\) participates in the local join. Define
\[\begin{eqnarray} \mathrm{OPT}(Q,L) &=& \text{the maximum output size of all possible local joins in LJM$(L)$.} \end{eqnarray}\]
(3)
Solving the LJM problem exactly is challenging. However, as will become evident in Section 2.2, it suffices to find a way to approximate \(\mathrm{OPT}(Q,L)\) within a constant factor. For that purpose, we can restrict our attention to \(R^{\prime }_\texttt {ABC}, R^{\prime }_\texttt {AD}, R^{\prime }_\texttt {BE}\), and \(R^{\prime }_\texttt {CF}\) that conform to a special form. In general, a relation R with scheme \(\lbrace X_1, X_2, \ldots , X_k\rbrace\) (for some \(k \ge 1\)) is said to be in the Cartesian-product form (CP-form) if \(R = \pi _{X_1}(R) \times \pi _{X_2}(R) \times \cdots \times \pi _{X_k}(R)\), namely, R is the Cartesian product of its projections on the k attributes. We now define a variant of the LJM problem:
Local-Join Maximization with Cartesian Products. Given a boat join \(Q \in \mathcal {Q}_\textrm {boat}\) and an arbitrary integer \(L \ge 1\), choose \(R^{\prime }_e \subseteq R_e\) for each \(e \in\)\(\lbrace \texttt {ABC},\)\(\texttt {AD}, \texttt {BE}, \texttt {CF}\rbrace\) to maximize the output size of the local join\(\lbrace R^{\prime }_\texttt {ABC}, R^{\prime }_\texttt {AD}, R^{\prime }_\texttt {BE}, R^{\prime }_\texttt {CF}, R_\texttt {DEF}\rbrace\) subject to the constraint that each of the relations \(R^{\prime }_\texttt {ABC}, R^{\prime }_\texttt {AD}, R^{\prime }_\texttt {BE}\), and \(R^{\prime }_\texttt {CF}\) (i) contains at most L tuples, and (ii) is in the CP-form. We will represent the above problem as LJM-CP\((L)\).
Define
\begin{equation} \mathrm{OPT}_\mathrm{CP}(Q,L) = \text{the maximum output size of all possible local joins in LJM-CP$(L)$.} \end{equation}
(4)
The lemma below, whose proof is deferred to Section 2.3, gives a crucial relationship between the functions in Equations (3) and (4).
2.2 Identifying a Hard Boat Join
In this subsection, we will prove the existence of a boat join \(Q \in \mathcal {Q}_\textrm {boat}\) such that
—
Q has an input size \(\Theta (n),\)
—
\(|\mathit {Join}(Q)| = \Theta (n^2)\), and
—
\(\mathrm{OPT}_\mathrm{CP}(Q, 8L) = O(L^3 / n),\)
as long as n is sufficiently large and \(L \ge c^{\prime } \cdot n^{5/6}\) for some constant \(c^{\prime }\) to be chosen later. It follows from Lemma 3 that \(\mathrm{OPT}(Q,L) = O(L^3 / n)\). By definition of LJM\((L)\) and the meaning of \(\mathrm{OPT}(Q,L)\) (see Equation (3)), any L atom tuples of Q can produce \(O(L^3/n)\) result tuples. This will then complete the proof of Lemma 2.
Recall that all the boat joins in \(\mathcal {Q}_\textrm {boat}\) differ only in their \(R_\texttt {DEF}\), which can be any subset of \({\bf dom}(\texttt {D}) \times {\bf dom}(\texttt {E}) \times {\bf dom}(\texttt {F})\). Next, we impose a distribution over \(\mathcal {Q}_\textrm {boat}\). For this purpose, create \(R_\texttt {DEF}\) by including each tuple of \({\bf dom}(\texttt {D}) \times {\bf dom}(\texttt {E}) \times {\bf dom}(\texttt {F})\) independently with probability \(1/n\). The expected size of \(R_\texttt {DEF}\) is \((n^{2/3})^3 \cdot \frac{1}{n} = n\). Accordingly, the boat join Q thus obtained—which is now a random variable—has an expected input size of 5n and an expected output size of \(n^2\) (recall that \(\mathit {Join}(Q) = R_\texttt {ABC} \times R_\texttt {DEF}\)). Our goal is to prove that, with a positive probability, Q satisfies two conditions simultaneously:
—
C2.2-1: \(R_\texttt {DEF}\) has at most 2n tuples;
The positive probability assures us that a boat join Q fulfilling the two conditions definitely exists. Condition C.2.2-1 implies that Q has an input size \(\Theta (n)\) and an output size \(\Theta (n^2)\). It thus follows that Q has all the properties promised at the beginning of this subsection.
The satisfaction probability of C.2.2-1 is easy to analyze: as each tuple in \({\bf dom}(\texttt {A}) \times {\bf dom}(\texttt {B}) \times {\bf dom}(\texttt {C})\) belongs to \(R_\texttt {REF}\) independently with probability \(1/n\), a simple application of Chernoff bound (42) in Appendix A (supplying \(\gamma = 1\)) shows that the probability for \(|R_\texttt {DEF}|\) to be over twice its expectation \(\mathbf {E}[|R_\texttt {DEF}|] = n\) is at most \(\exp (-\Omega (\mathbf {E}[|R_\texttt {DEF}|])) = \exp (-\Omega (n))\), which is less than \(1/4\) for sufficiently large n. The following discussion will focus on Condition C.2.2-2.
We refer to \(\lbrace R^{\prime }_\texttt {ABC}, R^{\prime }_\texttt {AD}, R^{\prime }_\texttt {BE}, R^{\prime }_\texttt {CF}\rbrace\) as a legal CP-form selection if, for each \(e \in\)\(\lbrace \texttt {ABC},\)\(\texttt {AD}, \texttt {BE}, \texttt {CF}\rbrace\), the relation \(R^{\prime }_e\) (i) is a subset of \(R_e\), and (ii) is in the CP-form, and (iii) contains at most 8L tuples. The next lemma offers a bound on the number of such selections.
In the LJM-CP\((8L)\) problem defined by a boat join Q, each local join is formed by a legal CP-form selection and together with the relation \(R_\texttt {DEF} \in Q\). The value \(\mathrm{OPT}_\mathrm{CP}(Q,8L)\) is the maximum size of all those local joins. We thus have:
\[\begin{eqnarray} \!\!\!\!\!\!\!\!\!\! && \mathbf {Pr}[\mathrm{OPT}_\mathrm{CP}(Q,8L) \gt 2 \cdot (8L)^3/n] \nonumber \nonumber\\ && (\textrm {note: the probability is over the distribution of Q}) \nonumber \nonumber\\ &&= {\mathbf {Pr}[\exists \text{ one legal CP-form selection $\lbrace R^{\prime }_\texttt {ABC}, R^{\prime }_\texttt {AD}, R^{\prime }_\texttt {BE}, R^{\prime }_\texttt {CF}\rbrace $ s.t. $|\mathit {Join}(\lbrace R^{\prime }_\texttt {ABC}, R^{\prime }_\texttt {AD}, R^{\prime }_\texttt {BE}, R^{\prime }_\texttt {CF}, R_\texttt {DEF}\rbrace)| \gt 2 \cdot (8L)^3/n$}]} \nonumber \nonumber\\ && (\textrm {note: the probability is over the distribution of} \ {R_\texttt {DEF}}) \nonumber \nonumber\\ &&\le \sum _{\text{legal CP-form selection } \lbrace R^{\prime }_\texttt {ABC}, R^{\prime }_\texttt {AD}, R^{\prime }_\texttt {BE}, R^{\prime }_\texttt {CF}\rbrace } \mathbf {Pr}[|\mathit {Join}(\lbrace R^{\prime }_\texttt {ABC}, R^{\prime }_\texttt {AD}, R^{\prime }_\texttt {BE}, R^{\prime }_\texttt {CF}, R_\texttt {DEF}\rbrace)| \gt 2 \cdot (8L)^3/n] \\ && (\textrm {note: the probability is over the distribution of} \ {R_\texttt {DEF}}). \nonumber \nonumber \end{eqnarray}\]
for some sufficiently large constant \(c_0 \gt 0\). We can now fix the constant \(c^{\prime }\) in Lemma 2 to \(\sqrt {c_0}\).
In conclusion, we have shown that, with probability at least \(1 - (1/4) - (1/4) = 1/2\), a boat join Q we generated at the beginning of the subsection satisfies both conditions C2.2-1 and C2.2-2.
This subsection is devoted to proving Lemma 3. Recall that, in the context of this lemma, we concentrate on one (arbitrarily) given boat join \(Q \in \mathcal {Q}_\textrm {boat}\) (in other words, \(R_\texttt {DEF}\) has been fixed). Let \(R^*_\texttt {ABC}, R^*_\texttt {AD}, R^*_\texttt {BE}\), and \(R^*_\texttt {CF}\) constitute an optimal solution to LJM\((L)\), i.e., \(\mathrm{OPT}(Q,L)\) equals the output size of the local join \(\lbrace R^*_\texttt {ABC}, R^*_\texttt {AD}, R^*_\texttt {BE},\)\(R^*_\texttt {CF}, R_\texttt {DEF}\rbrace\). We will construct \(R^{\prime }_\texttt {ABC}, R^{\prime }_\texttt {AD}, R^{\prime }_\texttt {BE}\), and \(R^{\prime }_\texttt {CF}\) such that
—
all of them are in the CP-form and have at most 8L tuples each;
—
the join \(\lbrace R^{\prime }_\texttt {ABC}, R^{\prime }_\texttt {AD}, R^{\prime }_\texttt {BE}, R^{\prime }_\texttt {CF}, R_\texttt {DEF}\rbrace\) has size at least \(\frac{1}{128} \cdot \mathrm{OPT}(Q, L)\).
Our conversion starts by setting \(R^{\prime }_e = R^*_e\) for each \(e \in \lbrace \texttt {AD}, \texttt {BE}, \texttt {CF}, \texttt {ABC}\rbrace\) and proceeds by converting—in this order—\({R^{\prime }}_\texttt {AD},R^{\prime }_\texttt {AD}\), \(R^{\prime }_\texttt {BE}\), \(R^{\prime }_\texttt {CF}\), and \(R^{\prime }_\texttt {ABC}\) to the CP-form incrementally. After turning each of \(R^{\prime }_\texttt {AD}\), \(R^{\prime }_\texttt {BE}\), \(R^{\prime }_\texttt {CF}\) into the CP-form, the output size of the join \(\lbrace R^{\prime }_\texttt {ABC}, R^{\prime }_\texttt {AD}, R^{\prime }_\texttt {BE}, R^{\prime }_\texttt {CF}, R_\texttt {DEF}\rbrace\) can decrease by a factor at most 4, while the size of \(R^{\prime }_e\) at most doubles for each \(e \in \lbrace \texttt {ABC}, \texttt {AD}, \texttt {BE}, \texttt {CF}\rbrace\). The last conversion (on \(R^{\prime }_\texttt {ABC}\)) can reduce the join output size by another factor of 2, but will not increase the size of any relation further.
Conversion of \({\bf {R^{\prime }_\texttt {AD}}}\). At this moment, \(R^{\prime }_e = R^*_e\) for each \(e \in \lbrace \texttt {ABC}, \texttt {AD}, \texttt {BE}, \texttt {CF}\rbrace\) and, hence, \(|R^{\prime }_e| \le L\). We will produce two new relations \(R^{\prime \prime }_{ABC}\) and \(R^{\prime \prime }_{AD}\) such that
—
\(R^{\prime \prime }_{AD}\) is in the CP-form (but \(R^{\prime \prime }_{ABC}\) may not);
—
each of \(R^{\prime \prime }_{ABC}\) and \(R^{\prime \prime }_{AD}\) has at most 2L tuples;
—
the output size of the join \(\lbrace R^{\prime \prime }_\texttt {ABC}, R^{\prime \prime }_\texttt {AD}, R^{\prime }_\texttt {BE}, R^{\prime }_\texttt {CF}, R_\texttt {DEF}\rbrace\) is at least 1/4 of that of the join \(\lbrace R^{\prime }_\texttt {ABC}, R^{\prime }_\texttt {AD}, R^{\prime }_\texttt {BE}, R^{\prime }_\texttt {CF}, R_\texttt {DEF}\rbrace\).
The conversion will then finish by replacing \(R^{\prime }_\texttt {ABC}\) and \(R^{\prime }_\texttt {AD}\) with \(R^{\prime \prime }_\texttt {ABC}\) and \(R^{\prime \prime }_\texttt {AD}\).
Our strategy for generating \(R^{\prime \prime }_\texttt {ABC}\) and \(R^{\prime \prime }_\texttt {AD}\) involves three steps:
—
First, obtain a “good” subset \(S_\texttt {BC} \subseteq {\bf dom}(\texttt {B}) \times {\bf dom}(\texttt {C})\), and a “good” subset \(S_\texttt {D} \subseteq {\bf dom}(D)\). The reader can regard \(S_\texttt {BC}\) as a relation over \(\lbrace \texttt {B}, \texttt {C}\rbrace\) and \(S_\texttt {D}\) as a relation over \(\lbrace \texttt {D}\rbrace\).
—
Second, choose a subset \(S_\texttt {A} \subseteq {\bf dom}(\texttt {A})\), which can be regarded as a relation over \(\lbrace \texttt {A}\rbrace\).
Given a value \(a \in \texttt {A}\), we denote by \(\lbrace a\rbrace\) the special “singleton” relation that contains only one tuple with value a on attribute \(\texttt {A}\). Regardless of \(S_\texttt {BC}\), \(S_\texttt {D}\), and \(S_\texttt {A}\), it always holds that
where the term “\(\lbrace 1\rbrace\)” in (10) refers to the relation \(\lbrace a\rbrace\) with \(a = 1 \in {\bf dom}(\texttt {A})\). The equality in (10) holds because, once \(S_\texttt {BC}\) and \(S_\texttt {D}\) are fixed, the output size of the join \(\lbrace \lbrace a\rbrace \times S_\texttt {BC}, \lbrace a\rbrace \times S_\texttt {D}, R^{\prime }_\texttt {BE}, R^{\prime }_\texttt {CF}, R_\texttt {DEF}\rbrace\) is identical for any \(a \in {\bf dom}(\texttt {A})\).
Motivated by (10), we consider the following refined variant of LJM:
Local-Join Maximization by Choosing BC and D. Fix an arbitrary integer \(t \ge 1\). Choose \(S_\texttt {BC} \subseteq {\bf dom}(\texttt {B}) \times {\bf dom}(\texttt {C})\) and \(S_\texttt {D} \subseteq {\bf dom}(D)\) to maximize the size of the local join\(\lbrace \lbrace 1\rbrace \times S_\texttt {BC}, \lbrace 1\rbrace \times S_\texttt {D}, R^{\prime }_\texttt {BE}, R^{\prime }_\texttt {CF}, R_\texttt {DEF}\rbrace\) subject to the constraint \(|S_\texttt {BC}| + |S_\texttt {D}| \le t\). We will represent the above problem as LJM-choose-BC-D\((t)\).
Define
\begin{equation} \Delta (t) = \text{the maximum output size of all possible local joins in the problem LJM-choose-BC-D$(t)$.} \end{equation}
(11)
How to compute \(\Delta (t)\) precisely is of no relevance to us; what matters, instead, is that \(\Delta (t)\) exists and is monotonically increasing. Define
Note, importantly, that \(t^*\) is selected from the range \([ \frac{2L}{N^{1/3}}, 2L]\). We are now ready to explain how to generate \(S_\texttt {BC}\), \(S_\texttt {D}\), and \(S_\texttt {A}\) for computing \(R^{\prime \prime }_\texttt {ABC}\) and \(R^{\prime \prime }_\texttt {AD}\):
—
Set \(S_\texttt {BC}\) and \(S_\texttt {D}\) as in an optimal solution to the LJM-choose-BC-D\((t^*)\) problem, i.e., the join \(\lbrace \lbrace 1\rbrace \times S_\texttt {BC}, \lbrace 1\rbrace \times S_\texttt {D}, R^{\prime }_\texttt {BE}, R^{\prime }_\texttt {CF}, R_\texttt {DEF}\rbrace\) has size \(\Delta (t^*)\);
Hence, each of \(R^{\prime \prime }_\texttt {ABC}\) and \(R^{\prime \prime }_\texttt {AD}\) has at most 2L tuples, as desired. Next, we prove that the join \(\lbrace R^{\prime \prime }_\texttt {ABC}, R^{\prime \prime }_\texttt {AD}, R^{\prime }_\texttt {BE}, R^{\prime }_\texttt {CF}, R_\texttt {DEF}\rbrace\) has a sufficiently large result.
Conversion of \({\bf {R^{\prime }_\texttt {BE}}}\). At this moment, \(|R^{\prime }_e| \le 2L\) for each \(e \in \lbrace \texttt {ABC}, \texttt {AD}, \texttt {BE}, \texttt {CF}\rbrace\). We aim to produce two new relations \(R^{\prime \prime }_{ABC}\) and \(R^{\prime \prime }_{BE}\) such that (i) \(R^{\prime \prime }_{BE}\) is in the CP-form (but \(R^{\prime \prime }_{ABC}\) may not), (ii) each of \(R^{\prime \prime }_{ABC}\) and \(R^{\prime \prime }_{BE}\) has at most 4L tuples, and (iii) the output size of the join \(\lbrace R^{\prime \prime }_\texttt {ABC}, R^{\prime }_\texttt {AD}, R^{\prime \prime }_\texttt {BE}, R^{\prime }_\texttt {CF}, R_\texttt {DEF}\rbrace\) is at least 1/4 of that of the join \(\lbrace R^{\prime }_\texttt {ABC}, R^{\prime }_\texttt {AD}, R^{\prime }_\texttt {BE}, R^{\prime }_\texttt {CF}, R_\texttt {DEF}\rbrace\). Due to symmetry, we can achieve the purpose by applying the same argument presented earlier for \(R^{\prime }_\texttt {AD}\) and changing L to 2L (it would help to “rename” A to B and D to E in applying the argument and then restore the names afterwards). The conversion then finishes by replacing \(R^{\prime }_\texttt {ABC}\) and \(R^{\prime }_\texttt {BE}\) with \(R^{\prime \prime }_\texttt {ABC}\) and \(R^{\prime \prime }_\texttt {BE}\). Note that \(R^{\prime }_\texttt {AD}\) is not affected by this conversion and hence remains in the CP-form.
Conversion of \({\bf {R^{\prime }_\texttt {CF}}}\). This should have become straightforward from the previous two conversions. \(R^{\prime }_\texttt {AD}\) and \(R^{\prime }_\texttt {BE}\) are not affected by this conversion and hence remain in the CP-form.
Conversion of \({\bf {R^{\prime }_\texttt {ABC}}}\). At this moment, \(|R^{\prime }_e| \le 8L\) for each \(e \in \lbrace \texttt {ABC}, \texttt {AD}, \texttt {BE}, \texttt {CF}\rbrace\). Furthermore, \(R^{\prime }_\texttt {AD}\), \(R^{\prime }_\texttt {BE}\), and \(R^{\prime }_\texttt {CF}\) are already in the CP-form. We will produce a new relation \(R^{\prime \prime }_{ABC}\) such that
—
\(R^{\prime \prime }_{ABC}\) is in the CP-form and has at most 8L tuples;
—
the output size of the join \(\lbrace R^{\prime \prime }_\texttt {ABC}, R^{\prime }_\texttt {AD}, R^{\prime }_\texttt {BE}, R^{\prime }_\texttt {CF}, R_\texttt {DEF}\rbrace\) is at least half of that of the join \(\lbrace R^{\prime }_\texttt {ABC}, R^{\prime }_\texttt {AD}, R^{\prime }_\texttt {BE}, R^{\prime }_\texttt {CF}, R_\texttt {DEF}\rbrace\).
After setting \(R^{\prime }_\texttt {ABC}\) to \(R^{\prime \prime }_\texttt {ABC}\), we will have obtained the join \(\lbrace R^{\prime }_\texttt {ABC}, R^{\prime }_\texttt {AD}, R^{\prime }_\texttt {BE}, R^{\prime }_\texttt {CF}, R_\texttt {DEF}\rbrace\) needed to complete the proof of Lemma 3; note that \(R^{\prime }_\texttt {AD}\), \(R^{\prime }_\texttt {BE}\), and \(R^{\prime }_\texttt {CF}\) are not affected by this conversion.
Given a tuple \({\bf {u}} \in {\bf dom}(\texttt {A}) \times {\bf dom}(\texttt {B}) \times {\bf dom}(\texttt {C})\), we use \(\lbrace {\bf {u}}\rbrace\) to denote the singleton relation with scheme ABC containing only \({\bf {u}}\). Given also a tuple \({\bf {v}} \in R_\texttt {DEF}\), we use \({\bf {u}} \circ {\bf {v}}\) to denote the tuple over scheme ABCDEF that takes value \({\bf {u}}(X)\) for every attribute \(X \in \lbrace \texttt {A}, \texttt {B}, \texttt {C}\rbrace\) and value \({\bf {v}}(X)\) for every attribute \(X \in \lbrace \texttt {D}, \texttt {E}, \texttt {F}\rbrace\). The lemma below explains why we want to make sure that \(R^{\prime }_\texttt {AD}\), \(R^{\prime }_\texttt {BE}\), and \(R^{\prime }_\texttt {CF}\) are already in the CP-form.
We denote by \(s_1\) the output size of the join \(\lbrace \lbrace {\bf {u}}\rbrace , R^{\prime }_\texttt {AD}, R^{\prime }_\texttt {BE}, R^{\prime }_\texttt {CF}, R_\texttt {DEF}\rbrace\) for an arbitrary \({\bf {u}} \in \pi _\texttt {A}(R^{\prime }_\texttt {AD}) \times \pi _\texttt {B}(R^{\prime }_\texttt {BE}) \times \pi _\texttt {C}(R^{\prime }_\texttt {CF})\). It follows from Lemma 7 that
Next, we will construct an \(R^{\prime \prime }_\texttt {ABC} \subseteq \pi _\texttt {A}(R^{\prime }_\texttt {AD}) \times \pi _\texttt {B}(R^{\prime }_\texttt {BE}) \times \pi _\texttt {C}(R^{\prime }_\texttt {CF})\) such that \(R^{\prime \prime }_\texttt {ABC}\) is in the CP-form and
Since (by Lemma 7) the output size of the join \(\lbrace R^{\prime \prime }_\texttt {ABC}, R^{\prime }_\texttt {AD}, R^{\prime }_\texttt {BE}, R^{\prime }_\texttt {CF}, R_\texttt {DEF}\rbrace\) is \(s_1 \cdot |R^{\prime \prime }_\texttt {ABC}|\), (16) and (18) together will assure us that the aforementioned join size is at least half of that of the join \(\lbrace R^{\prime }_\texttt {ABC}, R^{\prime }_\texttt {AD}, R^{\prime }_\texttt {BE}, R^{\prime }_\texttt {CF}, R_\texttt {DEF}\rbrace\).
Henceforth, we consider \(s_2 \gt 0\) (otherwise, simply choose \(R^{\prime \prime }_\texttt {ABC}\) to be an empty solution). Define
It must hold that \(|S^{\prime }_\texttt {A}||S^{\prime }_\texttt {B}||S^{\prime }_\texttt {C}| \ge s_2\) (otherwise, \(R^{\prime }_\texttt {ABC} \cap (\pi _\texttt {A}(R^{\prime }_\texttt {AD}) \times \pi _\texttt {B}(R^{\prime }_\texttt {BE}) \times \pi _\texttt {C}(R^{\prime }_\texttt {CF})\) would have a size less than \(s_2\), giving a contradiction).
We will choose subsets \(S_\texttt {A} \subseteq S^{\prime }_\texttt {A}\), \(S_\texttt {B} \subseteq S^{\prime }_\texttt {B}\), \(S_\texttt {C} \subseteq S^{\prime }_\texttt {C}\), and then generate \(R^{\prime \prime }_\texttt {ABC} = S_\texttt {A} \times S_\texttt {B} \times S_\texttt {C}\). Specifically, we first set \(S_\texttt {A}\) directly to \(S^{\prime }_\texttt {A}\). Let \(k_\texttt {B}\) be the greatest integer in \([1, |S^{\prime }_\texttt {B}|]\) satisfying \(|S^{\prime }_\texttt {A}| \cdot k_\texttt {B} \le 8L\); note that, if \(k_\texttt {B} \lt |S_\texttt {B}^{\prime }|\), then we must have \(|S^{\prime }_\texttt {A}| \cdot k_\texttt {B} \gt 4L\).5 Now, create \(S_\texttt {B}\) by including \(k_\texttt {B}\) arbitrary values in \(S^{\prime }_\texttt {B}\). Let \(k_\texttt {C}\) be the greatest integer in \([1, |S^{\prime }_\texttt {C}|]\) satisfying \(|S^{\prime }_\texttt {A}| \cdot k_\texttt {B} \cdot k_\texttt {C} \le 8L\) (if \(k_\texttt {C} \lt |S^{\prime }_\texttt {C}|\), then \(|S^{\prime }_\texttt {A}| \cdot k_\texttt {B} \cdot k_\texttt {C} \gt 4L\)). Create \(S_\texttt {C}\) by including \(k_\texttt {C}\) arbitrary values in \(S^{\prime }_\texttt {C}\).
\(|R^{\prime \prime }_\texttt {ABC}|\) has size \(|S^{\prime }_\texttt {A}| \cdot k_\texttt {B} \cdot k_\texttt {C} \le 8L\). For validating (18), it remains to explain why \(|R^{\prime \prime }_\texttt {ABC}| \ge s_2 / 2\). This is obvious if \(k_\texttt {B} = |S^{\prime }_\texttt {B}|\) and \(k_\texttt {C} = |S^{\prime }_\texttt {C}|\) (in this case, \(|R^{\prime \prime }_\texttt {ABC}| = |S^{\prime }_\texttt {A}||S^{\prime }_\texttt {B}||S^{\prime }_\texttt {C}| \ge s_2\)). Otherwise, \(|S^{\prime }_\texttt {A}| \cdot k_\texttt {B} \cdot k_\texttt {C}\) must be greater than 4L, which is at least \(s_2 / 2\) because \(s_2 \le |R^{\prime }_\texttt {ABC}| \le 8L\). This concludes the proof of Lemma 3.
3 Canonical Edge Covers for Acyclic Hypergraphs
Since it is no longer feasible to process all cyclic joins with a load of \(\tilde{O}(N/p^{1/\rho ^*})\), our focus will shift to acyclic joins, as defined in Section 1.1. In this section, we will concentrate solely on graph theory and introduce the concept of “canonical edge cover” for acyclic hypergraphs, along with several important properties. These properties will then be leveraged in Sections 4 and 5 to design an MPC algorithm for evaluating any acyclic join with a load of \(O(N/p^{1/\rho ^*})\).
Our discussion throughout the section is based on:
—
an acyclic hypergraph \(G = (V, E)\) with \(|E| \ge 2\), and
—
an edge tree T of G.
As G and T both contain “vertices” and “edges”, for better clarity we will obey several conventions in our presentation. A vertex in G will always be referred to as an attribute, while the term node is reserved for the vertices in T. Furthermore, to avoid confusion with the edges in G, we will always refer to an edge in T as a link.
An edge \(e \in E\) is subsumed in G if it is a subset of another edge \(e^{\prime } \in E\), i.e., \(e \subseteq e^{\prime }\). If an attribute X appears in only one edge of E, it is an exclusive attribute; otherwise, X is non-exclusive. Unless otherwise stated, we allow G to be an arbitrary acyclic hypergraph. This means that E can have two or more edges containing the same set of attributes (nonetheless, they are still different edges) and may even have empty edges. We say that G is non-empty if \(E \ne \emptyset\) and that G is reduced if E has no subsumed edges.
By rooting T at an arbitrary leaf, we can regard T as a rooted tree. Make all the links from parent to child; this way, T becomes a directed acyclic graph. We say that the root of T is the highest node in T and, in general, a node is higher (or lower) than any of its proper descendants (or ancestors). For any non-root node e, we denote its parent node in T as \(\mathit {parent}(e)\).
Now that there are two views of T (i.e., undirected and directed), we will be careful with tree terminology. By default, we will treat T as a directed tree. Accordingly, a leaf of T is a node with out-degree 0, a path is a sequence of nodes where each node has a link pointing to the next node, and a subtree rooted at a node e is the directed tree induced by the nodes reachable from e in T. Sometimes, we may revert back to the undirected view of T. In that case, we use the term raw leaf for a leaf in the undirected T (a raw leaf can be a leaf or the root under the directed view).
3.1 Canonical Edge Cover: Formulation and Basic Properties
For each attribute \(X \in V\), we define the summit of X as the highest node in T containing X. If node e is the summit of X, we call X a disappearing attribute in e. By acyclicity’s connectedness requirement (Section 1.1), X can appear only in the subtree rooted at e and hence “disappears” as soon as we leave the subtree.
We say that a subset \(S \subseteq E\)covers an attribute \(X \in V\) if S has an edge containing X. Recall (from Section 1.3) that an optimal edge cover of G is the smallest S covering every attribute in V. Optimal edge covers are not unique; some are of particular importance to us, and we will identify them as “canonical”. Towards a procedural definition, consider the following algorithm:
As proved shortly, the output of edge-cover is uniquely determined by T, regardless of the reverse topological order used at Line 2. This permits us to define the canonical edge cover (CEC) of G induced by T to be the output of edge-cover.
The lemma below gives three properties of edge-cover that pave the foundation of all the development in the subsequent sections.
As a remark, Lemma 8 implies that any acyclic hypergraph G has an integral optimal fractional edge covering that maps every edge of G to either 0 or 1.
3.2 Signature Paths, Clusterings, k-Groups, Anchor Leaves, and Anchor Attributes
Suppose that we have already computed the CEC \(\mathcal {F}\) of \(G = (V, E)\) induced by an edge tree T of G. This subsection will introduce several concepts derived from \(\mathcal {F}\) that are important to our analysis.
Whenever \(\mathcal {F}\) includes the root of T, we can define a signature path — denoted as \(\textrm {sigpath}(f, T)\) — for each node \(f \in \mathcal {F}\). Specifically, \(\textrm {sigpath}(f, T)\) is a set of nodes obtained as follows.
—
If f is the root of T, \(\textrm {sigpath}(f, T) = \lbrace f\rbrace\).
—
Otherwise, let \(\hat{f}\) be the lowest node in \(\mathcal {F}\) that is a proper ancestor of f. Then, \(\textrm {sigpath}(f, T)\) is the set of nodes on the path from \(\hat{f}\) to f, except \(\hat{f}\).
The concepts to be defined in the remainder of this subsection apply only if G is reduced. When G is reduced, \(\mathcal {F}\) contains the root and all the leaves of T (Lemma 8). In this case, we define the T-clustering of G as
For each \(f \in \mathcal {F}\), we will refer to \(\textrm {sigpath}(f, T)\) as a cluster of \(\mathcal {C}\). Note that every node of T (or equivalently, every edge of E) belongs to at least one—but possibly more than one—cluster. If f is not the root of T, we call \(\textrm {sigpath}(f,T)\) a non-root cluster. Given an integer \(k \ge 1\), we define a k-group of \(\mathcal {C}\) to be a collection of k edges, each taken from a distinct cluster in C.
Let \(f_\textrm {anc}\) be a leaf node in \(\mathcal {F}\), and \(\hat{f}\) be the lowest proper ancestor of \(f_\textrm {anc}\) in \(\mathcal {F}\). We call \(f_\textrm {anc}\) an anchor leaf of T if
—
\(\hat{f}\) has no non-leaf proper descendant in \(\mathcal {F}\), and
—
\(f_\textrm {anc}\) has an attribute \(A_\textrm {anc}\) such that
–
\(A_\textrm {anc}\notin \hat{f}\);
–
\(A_\textrm {anc}\in e\) for every node \(e \in \textrm {sigpath}(f_\textrm {anc}, T)\).
We call \(A_\textrm {anc}\) an anchor attribute of \(f_\textrm {anc}\). The above definition does not apply to the case where \(\hat{f} = \mathit {nil}\) (i.e., T has only a single node, which is \(f_\textrm {anc}\)); in that special case, we define the anchor leaf of T to be \(f_\textrm {anc}\) and call any attribute in \(f_\textrm {anc}\) an anchor attribute.
3.3 CEC Properties after Removing an Anchor Attribute
This subsection assumes that the input hypergraph \(G = (V, E)\) is reduced. As before, let T be an edge tree of G and \(\mathcal {F}\) be the CEC induced by T. Identify an arbitrary anchor leaf \(f_\textrm {anc}\) of T and an arbitrary anchor attribute \(A_\textrm {anc}\) of \(f_\textrm {anc}\). As will be clear in Section 4, in join evaluation, we will need to simplify G by removing \(A_\textrm {anc}\). CEC has several interesting properties under such simplification as discussed next.
3.3.1 Residual Hypergraph and Its CEC.
Removing \(A_\textrm {anc}\) from G produces a residual hypergraph\(G^{\prime } = (V^{\prime }, E^{\prime })\) where
—
\(V^{\prime } = V \setminus \lbrace A_\textrm {anc}\rbrace\), and
—
\(E^{\prime }\) is produced by including, for every \(e \in E\), an edge \(\mathit {map}(e) = e \setminus \lbrace A_\textrm {anc}\rbrace\).
Define \(\mathit {map}^{-1}\) as the inverse function of \(\mathit {map}\), namely, for each \(e^{\prime } \in E^{\prime }\), \(\mathit {map}^{-1}(e^{\prime })\) is the unique edge \(e \in E\) satisfying \(e^{\prime } = \mathit {map}(e)\). The functions \(\mathit {map}\) and \(\mathit {map}^{-1}\) capture the one-one correspondence between E and \(E^{\prime }\).
Denote by \(T^{\prime }\) the edge tree of \(G^{\prime }\) obtained by discarding \(A_\textrm {anc}\) from every node in T. The next lemma, whose proof can be found in Appendix B, shows that the CEC of \(G^{\prime }\) induced by \(T^{\prime }\) can be derived directly from \(\mathcal {F}\).
Fig. 3.
3.3.2 Cleansing the Residual Graph and Preserving the CEC.
Even though G is reduced, the residual hypergraph \(G^{\prime }\) may contain subsumed edges. Next, we describe a cleansing procedure which converts \(G^{\prime }\) into a reduced hypergraph \(G^* = (V^{\prime }, E^*)\) (note that \(G^*\) has the same vertices as \(G^{\prime }\)) and converts \(T^{\prime }\) into an edge tree \(T^*\) of \(G^*\).
Fig. 4.
At the end of cleansing, we always set \(\mathcal {F}^* = \mathcal {F}^{\prime }\) directly. An important property of cleansing is that it does not affect the CEC, as formally stated below.
The proof of the lemma can be found in Appendix C.
3.3.3 Preserving k-Groups.
The next property concerns the hypergraph \(G^* = (V^{\prime }, E^*)\) after cleansing the original hypergraph \(G = (V, E)\). Recall that \(T^*\) and T are edge trees of \(G^*\) and G, respectively. Before proceeding, the reader should recall that every edge \(e^* \in E^*\) corresponds to a distinct edge \(\mathit {map}^{-1}(e^*) \in E\).
Fig. 5.
By definition of k-group (see Section 3.2), \(e^*_1, \ldots , e^*_k\) originate from k distinct clusters in \(\mathcal {C}^*\). The lemma essentially promises k different clusters in \(\mathcal {C}\), each of which contains a distinct edge in \(\lbrace \mathit {map}^{-1}(e^*_1), \ldots , \mathit {map}^{-1}(e^*_k)\rbrace\).
3.4 CEC Properties after Removing a Signature Path
This subsection will discuss another simplification needed in join evaluation. As before, we have a hypergraph \(G = (V, E)\), and denote by T an edge tree of G and by \(\mathcal {F}\) the CEC of G induced by T. Identify an arbitrary anchor leaf \(f_\textrm {anc}\) of T and an arbitrary anchor attribute \(A_\textrm {anc}\) of \(f_\textrm {anc}\). The simplification deletes all the edges in the signature path \(\textrm {sigpath}(f_\textrm {anc},T)\) from G. Next, we discuss the properties of CEC under such simplification.
3.4.1 Residual Hypergraphs and Their CECs.
Removing \(\textrm {sigpath}(f_\textrm {anc}, T)\) decomposes G into multiple components. To explain, define
\[\begin{eqnarray} Z &=& \lbrace \text{node $z$ in $T$} \mid z \notin \textrm {sigpath}(f_\textrm {anc}, T) \text{ and } \mathit {parent}(z) \in \textrm {sigpath}(f_\textrm {anc}, T)\rbrace . \end{eqnarray}\]
(21)
For each \(z \in Z\), define a rooted tree \(T^*_z\) as follows:
—
The root of \(T^*_z\) is the parent of z in T;
—
The root of \(T^*_z\) has only one child in \(T^*_z\), which is z;
—
The subtree rooted at z in \(T^*_z\) is the same as the subtree rooted at z in T.
Separately, define \(\overline{T^*}\) as the rooted tree obtained by removing from T the subtree rooted at the highest node in \(\textrm {sigpath}(f_\textrm {anc}, T)\).
From each \(T^*_z\), generate a residual hypergraph\(G^*_z = (V^*_z, E^*_z)\) where:
—
\(E^*_z\) includes all and only the nodes (a.k.a., edges of G) in \(T^*_z\);
—
\(V^*_z\) is the set of attributes appearing in at least one edge in \(E^*_z\).
Likewise, from \(\overline{T^*}\), generate a residual hypergraph\(\overline{G^*} = (\overline{V^*}, \overline{E^*})\) where
—
\(\overline{E^*}\) includes all and only the nodes in \(\overline{T^*}\);
—
\(\overline{V^*}\) is the set of attributes appearing in at least one edge in \(\overline{E^*}\).
Because G is reduced, so must be all the residual hypergraphs. For each \(z \in Z\), \(T^*_z\) is an edge tree of \(G^*_z\); similarly, \(\overline{T^*}\) is an edge tree of \(\overline{G^*}\).
Fig. 6.
Recall that \(\mathcal {F}\) is the CEC of G induced by T. The next lemma shows that the CECs of the residual hypergraphs can be derived from \(\mathcal {F}\) effortlessly.
We close the section with a property resembling Lemma 12. Define a super-k-group to be a set of edges \(K = \lbrace e_1, e_2, \ldots , e_k\rbrace\) satisfying:
—
each \(e_i\), \(i \in [k]\), is taken from either a cluster of the \(\overline{T^*}\)-clustering of \(\overline{G^*}\) or a non-root cluster6 of the \(T^*_z\)-clustering of \(G^*_z\) for some \(z \in Z\);
—
no two edges in K are taken from the same cluster.
The following sections will apply the theory of CECs to solve acyclic joins in the MPC model. Specifically, we will describe a new algorithm in this section and present its analysis in Section 5. Figure 7 provides an overview of our algorithm.
Fig. 7.
4.1 Fundamental Definitions
In this subsection, we will introduce several basic definitions applicable to general acyclic queries. Consider an acyclic query Q whose schema graph is \(G = (V, E)\). Fix an arbitrary edge tree T of G and use the edge-cover algorithm (in Section 3.1) to compute the CEC \(\mathcal {F}\) of G induced by T. Let \(\mathcal {C}\) be the T-clustering of G in (19).
Recall that a k-group of \(\mathcal {C}\) is a collection of k edges, each taken from a distinct cluster in \(\mathcal {C}\). Given a k-group K of \(\mathcal {C}\), we define
where \(R_e\) is the (only) relation in Q with scheme e; the Q-product of K is simply the Cartesian-product size of all the input relations corresponding to the edges in K. Given an integer \(k \in [|\mathcal {F}|]\), we define the max-\((k,Q)\)-product of \(\mathcal {C}\) as the largest Q-product of all k-groups K, or formally:
\begin{equation} P_k(Q,\mathcal {C}) = \max _{\text{$k$-group $K$ of $\mathcal {C}$}} \text{$Q$-product of $K$.} \end{equation}
(25)
As the Q-product of any k-group is at most \({N}^k\) where N is the input size of Q, we always have \(P_k(Q,\mathcal {C}) \le {N}^k\). Finally, define
where \(|\mathcal {C}|\) is the number of clusters in \(\mathcal {C}\).
As \(P_k(Q,\mathcal {C}) \le {N}^k\) for any \(k \in [|\mathcal {C}|]\), the Q-induced load of \(\mathcal {C}\) is at most \(N / p^{1/|\mathcal {C}|}\). Another useful fact is \(P_1(Q, \mathcal {C}) = \Theta (N)\) (which holds because Q has a constant number of relations). This means that the Q-induced load of \(\mathcal {C}\) is \(\Omega (N/p)\).
4.2 Configurations
Henceforth, we fix Q to be the acyclic query to be answered. Denote by \(G = (V, E)\) the schema graph of Q. We assume G to be reduced; otherwise, Q can be converted in load \(O(N/p)\) to a query that has the same result but with a reduced schema graph [14, 17]. We will also assume that Q has at least two relations; otherwise, the query is trivial and requires no communication.
Choose an arbitrary edge tree T of G and compute the CEC \(\mathcal {F}\) of G induced by T. Define \(\mathcal {C}\) as the T-clustering of G given in (19). Choose an arbitrary anchor leaf \(f_\textrm {anc}\) of T and an arbitrary anchor attribute \(A_\textrm {anc}\) of \(f_\textrm {anc}\); remember that \(A_\textrm {anc}\) appears in all the edges of \(\textrm {sigpath}(f_\textrm {anc}, T)\).
For each edge \(e \in E\), let \(R_e\) represent the relation in Q whose scheme is e. Fix a value \(x \in {\bf dom}\). Given an edge \(e \in \textrm {sigpath}(f_\textrm {anc}, T)\), we define the \(A_\textrm {anc}\)-frequency of x in \(R_e\) as the number of tuples \({\bf {u}} \in R_e\) such that \({\bf {u}}(A_\textrm {anc}) = x\). Moreover, define the signature-path \(A_\textrm {anc}\)-frequency of x as the sum of its \(A_\textrm {anc}\)-frequencies in the \(R_e\) of all \(e \in \textrm {sigpath}(f_\textrm {anc}, T)\).
We will use L to represent the Q-induced load of \(\mathcal {C}\) defined in (26). Given a value \(x \in {\bf dom}\), we say that x is
—
heavy, if its signature-path \(A_\textrm {anc}\)-frequency is at least L;
—
light, otherwise.
Divide \({\bf dom}\) into disjoint intervals such that either the entire \({\bf dom}\) is one interval or the light values in each interval have a total signature-path \(A_\textrm {anc}\)-frequency of \(\Theta (L)\). We will refer to those intervals as the light intervals of \(A_\textrm {anc}\).
A configuration\(\eta\) is either a heavy value or a light interval of \(A_\textrm {anc}\). The number of configurations, which is the total number of heavy values and light intervals, is at most
where the second equality used the fact that \(\textrm {sigpath}(f_\textrm {anc},T)\) has \(O(1)\) edges and the third equality applied the definition of the max-\((k,Q)\)-product of \(\mathcal {C}\) in (25) and the definition of L, i.e., the Q-induced load of \(\mathcal {C}\) in (26).
For each edge \(e \in E\), define a relation \(R(e, \eta)\) as follows:
—
if \(\eta\) is a heavy value, \(R(e, \eta)\) includes all and only the tuples \({\bf {u}} \in R_e\) satisfying \({\bf {u}}(A_\textrm {anc}) = \eta\);
—
if \(\eta\) is a light interval, \(R(e, \eta)\) includes all and only the tuples \({\bf {u}} \in R_e\) where \({\bf {u}}(A_\textrm {anc})\) is a light value in \(\eta\).
Note that \(R(e, \eta) = R_e\) if \(A_\textrm {anc}\notin e\). We associate the configuration with a query
Our objective is to compute \(\mathit {Join}(Q_\eta)\) for all \(\eta\) in parallel. The final result \(\mathit {Join}(Q)\) is simply \(\bigcup _\eta \mathit {Join}(Q_\eta)\).
Note that \(Q_\eta\) has the same schema graph G as Q. Recall that \(\mathcal {C}\) is the T-clustering of G. The rest of the section will explain how to solve \(\mathit {Join}(Q_\eta)\) for an arbitrary \(\eta\) using
machines, where \(P_k(Q_\eta , \mathcal {C})\) is the max-\((k,Q_\eta)\)-product of \(\mathcal {C}\). As will be proved in Section 5, we can adjust the constants in (28) to make sure \(\sum _\eta p_\eta \le p\).
4.3 Solving \(Q_\eta\) When \(\eta\) is a Heavy Value
Remove \(A_\textrm {anc}\) from G, and define the residual hypergraph \(G^{\prime } = (V^{\prime }, E^{\prime })\), as well as the functions \(\mathit {map}(.)\) and \(\mathit {map}^{-1}(.)\), in the way explained in Section 3.3. We compute \(\mathit {Join}(Q_\eta)\) in five steps.
Step 1. Send the tuples of \(R(e, \eta)\), for all \(e \in E\), to the \(p_\eta\) allocated machines such that each machine receives \(\Theta (\frac{1}{p_\eta } \sum _{e \in E} |R(e, \eta)|)\) tuples.
Step 2. For each \(e \in E\), convert \(R(e, \eta)\) to \(R^*(e^{\prime }, \eta)\) where \(e^{\prime } = \mathit {map}(e) = e \setminus \lbrace A_\textrm {anc}\rbrace\). Specifically, \(R^*(e^{\prime }, \eta)\) is a copy of \(R(e, \eta)\) but with \(A_\textrm {anc}\) discarded, or formally:
Note that if \(A_\textrm {anc}\notin e\), then \(R^*(e^{\prime }, \eta) = R(e, \eta)\). No communication occurs as each machine simply discards \(A_\textrm {anc}\) from every tuple \({\bf {u}} \in R(e, \eta)\) in the local storage.
Step 3. Cleanse \(G^{\prime }\) into \(G^* = (V^{\prime }, E^*)\) by calling the cleansealgorithm in Section 3.3.1. Every timecleanseperforms an iteration in Lines 5-9 with edges \(e_\textrm {small}\) and \(e_\textrm {big}\), we perform a semi-join between \(R^*(e_\textrm {small}, \eta)\) and \(R^*(e_\textrm {big}, \eta)\). The semi-join removes every tuple \({\bf {u}}\) from \(R^*(e_\textrm {big}, \eta)\) with the property that \({\bf {u}}[e_\textrm {small}]\) is absent from \(R^*(e_\textrm {small}, \eta)\). \(R^*(e_\textrm {small}, \eta)\) is discarded after the semi-join.
Note that \(Q^*_\eta\) involves one less attribute than Q (because \(A_\textrm {anc}\) no longer exists). Compute \(\mathit {Join}(Q^*_\eta)\) using \(p_\eta\) machines recursively.
Step 5. Output \(\mathit {Join}(Q_\eta)\) by augmenting each tuple \({\bf {u}} \in \mathit {Join}(Q^*_\eta)\) with \({\bf {u}}(A_\textrm {anc}) = \eta\). No communication is needed.
Tuple-Based Implementation. For the benefit of reader comprehension, we have deliberately presented our algorithm in a way that aligns conceptually with the discussion in Section 3.3. However, this approach may inadvertently lead to the misconception that our algorithm does not handle each tuple in the original relations of Q as atoms, as mandated by the class of tuple-based algorithms outlined in Section 1.1. Specifically, the confusion lies in removing the attribute \(A_\textrm {anc}\) in Step 2 and “concatenating” it back in Step 5.
Nevertheless, once the reader has grasped the underlying rationale, it is rudimentary to resolve the issue by electing for a tuple-based implementation. First, it is important to remember that \(\eta\) is the sole value under attribute \(A_\textrm {anc}\) for the sub-query \(Q_\eta\) we are processing. As mentioned, for each edge \(e \in E\) containing \(A_\textrm {anc}\), Step 2 removes \(A_\textrm {anc}\) from relation \(R(e, \eta)\) by generating the relation \(R^*(e^{\prime }, \eta)\) of (29), where \(e^{\prime } = e \setminus \lbrace A_\textrm {anc}\rbrace\). Specifically, for each tuple \({\bf {u}} \in R(e, \eta)\), Step 2 adds the tuple \({\bf {v}} = {\bf {u}}[e^{\prime }]\) to \(R^*(e^{\prime }, \eta)\), effectively retaining all values of \({\bf {u}}\) except \({\bf {u}}(A_\textrm {anc}) = \eta\). This gives rise to a sub-query devoid of attribute \(A_\textrm {anc}\). Steps 3-4 evaluate this sub-query by moving tuples like \({\bf {v}}\) among the \(p_\eta\) allocated machines. Each result tuple of the sub-query needs to be augmented with the value \(\eta\) on attribute \(A_\textrm {anc}\) before being returned (Step 5). To allow each machine to perform such augmentation locally, we can broadcast \(\eta\) to all the \(p_\eta\) allocated machines (this increases the load by only one). This way, whenever \({\bf {v}}\) is communicated between two machines, we are in fact transmitting a pair \(( {\bf {v}}, \eta)\), which is equivalent to sending the tuple \({\bf {u}}\) itself as an atom. This, thus, yields a tuple-based implementation of our algorithm.
4.4 Solving \(Q_\eta\) When \(\eta\) is a Light Interval
Remove \(\textrm {sigpath}(f_\textrm {anc}, T)\) from G, and define Z, \(G^*_z\) and \(T^*_z\) for each \(z \in Z\), \(\overline{G^*}\), and \(\overline{T^*}\) all in the way described in Section 3.4. We compute \(\mathit {Join}(Q_\eta)\) in four steps.
Step 1. Same as Step 1 of the algorithm in Section 4.3.
Step 2. For each \(e \in \textrm {sigpath}(f_\textrm {anc}, T)\), broadcast \(R(e, \eta)\) to all \(p_\eta\) machines. By definition of light interval, the size of every such \(R(e, \eta)\) is at most L.
Step 3. For each \(z \in Z\), define for \(G^*_z = (V^*_z, E^*_z)\):
Note that \(\overline{Q^*_\eta }\) and the \(Q^*_{\eta , z}\) of each \(z \in Z\) have at least one less relation than Q due to the disappearance of \(f_\textrm {anc}\).
where \(P_k(\overline{Q^*_\eta }, \overline{\mathcal {C}^*})\) is the max-\((k, \overline{Q^*_\eta })\)-product of the clustering \(\overline{\mathcal {C}^*}\). We will prove later that \(\mathit {Join}(Q^*_{\eta , z})\) of each \(z \in Z\) can be evaluated with load \(O(L)\) using \(p_{\eta ,z}\) machines, and \(\mathit {Join}(\overline{Q^*_\eta })\) can be evaluated with load \(O(L)\) using \(\overline{p_\eta }\) machines. Therefore, applying the Cartesian product algorithm given in Lemma 4 of [17], we can compute (30) with load \(O(L)\) using
machines. As proved in Section 5, we can adjust the constants in (31) and (32) to make sure that (33) is at most the value \(p_\eta\) given in (28).
Step 4. We combine the Cartesian product in (30) with the tuples broadcast in Step 2 to derive \(\mathit {Join}(Q_\eta)\) with no more communication. Specifically, for each tuple \({\bf {u}}\) in the Cartesian product, the machine where \({\bf {u}}\) resides outputs \(\lbrace {\bf {u}}\rbrace \bowtie \left(\bowtie _{e \in \textrm {sigpath}(f_\textrm {anc}, T)} R(e, \eta)\right)\). It is easy to verify that all the tuples of \(\mathit {Join}(Q_\eta)\) will be produced this way.
5 Analysis of the Algorithm
This section will establish another main result of the article:
We will actually prove a stronger claim:
Before proving the lemma, let us first clarify how it leads to Theorem 15. First, if G is not reduced, we can convert Q into another query with the same result whose schema graph is reduced, which can be done with load \(O(N/p)\) using algorithms from [14] and [17]. If, on the other hand, G is reduced, the Q-induced load L of \(\mathcal {C}\) is at most \(N / p^{1/|\mathcal {C}|}\) (as discussed in Section 4.1). As can be seen from (19), \(|\mathcal {C}|\) equals the size of \(\mathcal {F}\), which by Lemma 8 is \(\rho ^*\). Therefore, \(L = O(N / p^{1/\rho ^*})\) and Theorem 15 follows.
The rest of the section serves as a proof of Lemma 16. All the notations below follow those in Section 4. Our proof is via induction on the number of participating attributes (i.e., \(|V|\)) and the number of participating relations (i.e., \(|Q|\)). If \(|Q| = 1\), the lemma trivially holds. If \(|V| = 1\), Q has only one relation (because Q is reduced) and the lemma again holds trivially. Next, assuming that the lemma holds on any query with either strictly less participating attributes or strictly less participating relations than Q, we will prove the lemma’s correctness on Q. Our analysis will answer three questions:
(1)
Why do we have enough machines to handle all configurations in parallel? In particular, we must show that \(\sum _\eta p_\eta \le p\), where \(p_\eta\) is the number of machines allocated to \(\eta\), as is given in (28).
(2)
Why does each step in Section 4.3 and 4.4 entail a load of \(O(L)\)?
(3)
Why do we have \(\overline{p_\eta } \cdot \prod _{z \in Z} p_{\eta ,z} \le p_\eta\) in Step 3 of Section 4.4?
Settling these questions will complete the proof of Lemma 16.
Remark on Memory Usage. As a corollary of Theorem 15, our algorithm utilizes \(O(N/p^{1/\rho ^*})\) words of memory on each machine. More specifically, each machine receives in total \(O(N/p^{1/\rho ^*})\) “atom tuples” from the relations of Q, a.k.a., a subset of each relation in Q. Then, the machine locally computes the join induced by those subsets. Such computation can be done with no extra memory asymptotically7 — recall that the join result is output by emission, rather than physically stored.
5.1 Total Number of Machines for All Configurations
It suffices to prove \(\sum _\eta p_\eta = O(p)\) because adjusting the hidden constants will then ensure \(\sum _\eta p_\eta \le p\). For every \(k \in [|\mathcal {C}|]\), we will show
Now, fix k to an arbitrary integer in \([|\mathcal {C}|]\). For any configuration \(\eta\), the schema graph of \(Q_\eta\) is always G (i.e., same as the schema graph of Q). Consider an arbitrary k-group K of \(\mathcal {C}\) (the concept of k-group was defined in Section 3.2). The \(Q_\eta\)-product of K, defined in (24), is \(\prod _{e \in K} |R(e, \eta)|\). Given any K, we will prove
It remains to prove (35). Let us first consider the case where \(K \cap \textrm {sigpath}(f_\textrm {anc}, T) \ne \emptyset\), namely, K has an edge \(e_0\) picked from the cluster \(\textrm {sigpath}(f_\textrm {anc}, T)\). In this case, we have:
For each \(e \in K \setminus \lbrace e_0\rbrace\), obviously \(|R(e, \eta)| \le |R_e|\). Regarding \(e_0\), because \(A_\textrm {anc}\) must be an attribute of \(e_0\), the relations \(R(e_0, \eta)\) of all the configurations \(\eta\) form a partition of \(R(e_0)\).8 Hence:
Therefore, the left-hand side of (35) is bounded by \(\left(1/L^k\right) \cdot \textrm {max-(k,Q)-product of} \ {\mathcal {C}}\), which is at most p by definition of L (recall that L is the Q-induced load of \(\mathcal {C}\), defined in (26)).
Next, we consider \(K \cap \textrm {sigpath}(f_\textrm {anc}, T) = \emptyset\). In this case, we must have \(k = |K| \le |\mathcal {F}| - 1\), because the edges in K need to come from distinct clusters of \(\mathcal {C}\), and \(\mathcal {C}\) has \(|\mathcal {F}|\) clusters (one of them is \(\textrm {sigpath}(f_\textrm {anc}, T)\), which now must be excluded). We can derive:
which is at most p. This completes the proof of \(\sum _\eta p_\eta = O(p)\).
5.2 Heavy \(Q_\eta\)
This subsection will prove that the algorithm in Section 4.3 has load \(O(L)\). Steps 2 and 5 demand no communication. The following discussion focuses on the other steps.
Let us start with a technical lemma that will be useful later. Recall that G is the schema graph of query Q (and \(Q_\eta\)), T is an edge tree of G, and \(\mathcal {C}\) is the T-clustering of G. For any \(k \in [|\mathcal {C}|]\), \(P_k(Q_\eta , \mathcal {C})\) is the max-\((k,Q)\)-product of \(\mathcal {C}\), defined in (25). Our technical lemma is:
We now continue the analysis of the algorithm in Section 4.3. The loads of Steps 1 and 3 are both bounded9 by
where the second equality applied the definition of \(P_1(Q_\eta , \mathcal {C})\) in (25) — note that the \(Q_\eta\)-product of a 1-group K is merely the maximum size of the relations \(R(e, \eta)\) of all \(e \in K\) — and the third inequality applied Lemma 17.
For analyzing Step 4, let us first recall that, after removing the anchor attribute \(A_\textrm {anc}\), we convert G into residual hypergraph \(G^{\prime }\) and T into an edge tree \(T^{\prime }\) of \(G^{\prime }\). Then, Step 3 cleanses \(G^{\prime }\) into a reduced hypergraph \(G^*\) and, accordingly, converts \(T^{\prime }\) into an edge tree \(T^*\) of \(G^*\). Let \(\mathcal {C}^*\) be the \(T^*\)-clustering of \(G^*\). The discussion in Section 3.3 tells us \(|\mathcal {C}^*| \le |\mathcal {C}|\), where as mentioned before \(\mathcal {C}\) is the T-clustering of G. By the definition in (26), the \(Q^*_\eta\)-induced load of \(\mathcal {C}^*\) is
where \(P_k(Q^*_\eta ,\mathcal {C}^*)\) is the max-\((k,Q^*_\eta)\)-product of \(\mathcal {C}^*\) (defined in (25)). By our inductive assumption (that Lemma 16 holds on \(Q^*_\eta\)), Step 4 incurs load \(O(L^*_\eta)\). Next, we will argue that \(O(L^*_\eta) = O(L)\).
Applying the above lemma to (37), we now have \(L^*_\eta \le \max _{k=1}^{|\mathcal {C}^*|} (\frac{P_k(Q_\eta ,\mathcal {C})}{p_\eta })^{1/k}\), which is \(O(L)\) by Lemma 17.
5.3 Light \(Q_\eta\)
This subsection will concentrate on the algorithm of Section 4.4.
Load. Step 1 incurs load \(O(L)\) (same analysis as in Section 4.3). Step 2 also requires a load of \(O(L)\) because every broadcast relation has a size of at most L. Step 4 needs no communication.
For analyzing Step 3, let us recall that, at this moment, we have removed all the edges in the signature path \(\textrm {sigpath}(f_\textrm {anc}, T)\) from the schema graph G of Q. This yields set Z, defined in (21). For each edge \(z \in Z\), we have obtained a sub-query \(Q^*_{\eta , z}\), whose schema graph \(G^*_z\) has an edge tree \(T^*_z\), which defines a \(T^*_z\)-clustering \(\mathcal {C}^*_z\) of \(G^*_z\). In addition, we have also obtained another sub-query \(\overline{Q^*_\eta }\), whose schema graph \(\overline{G^*}\) has an edge tree \(\overline{T^*}\), which defines a \(\overline{T^*}\)-clustering \(\overline{\mathcal {C}^*}\) of \(\overline{G^*}\).
Let us first consider \(\overline{Q^*_\eta }\). The \(\overline{Q^*_\eta }\)-induced load of \(\overline{\mathcal {C}^*}\) is
where \(P_k(\overline{Q^*_\eta },\overline{\mathcal {C}^*})\) is the max-\((k,\overline{Q^*_\eta })\)-product of \(\overline{\mathcal {C}^*}\) (see the definition in (25)). Regarding the join \(Q^*_{\eta ,z}\) for each \(z \in Z\), the \(Q^*_{\eta ,z}\)-induced load of \(\mathcal {C}^*_z\) is
where \(P_k(Q^*_{\eta ,z},\mathcal {C}^*_z)\) is the max-\((k, Q^*_{\eta ,z})\)-product of \(\mathcal {C}^*_z\). By our inductive assumption—namely, Lemma 16 holds on \(\overline{Q^*_\eta }\) and the \(Q^*_{\eta ,z}\) of each \(z \in Z\) —we know:
—
evaluating \(\overline{Q^*_\eta }\) with \(\overline{p_\eta }\) machines requires load \(O(\overline{L^*_\eta })\), which is \(O(L)\) given the \(\overline{p_\eta }\) in (32), following an argument similar to that used to prove (37) is \(O(L)\);
—
evaluating \(Q^*_{\eta , z}\) of any \(z \in Z\) with \(p_{\eta ,z}\) machines requires load \(O(L^*_{\eta ,z})\), which is \(O(L)\) given the \(p_{\eta ,z}\) in (31), again following an argument similar to that used to prove (37) is \(O(L)\).
Thus, the Cartesian product at Step 3 can be computed with load \(O(L)\), as explained in Section 4.4.
Number of Machines in Step 3. To establish Lemma 16, it remains to prove that \(\overline{p_\eta } \cdot \prod _{z \in Z} p_{\eta ,z} \le p_\eta\) always holds in Step 3. It suffices to show \(\overline{p_\eta } \cdot \prod _{z \in Z} p_{\eta ,z} = O(p_\eta)\) because we can then adjust the constants to ensure \(\overline{p_\eta } \cdot \prod _{z \in Z} p_{\eta ,z} \le p_\eta\).
Fix an arbitrary \(z \in Z\). The root of \(T^*_z\) — denoted as \(e_\mathit {root}\) — must belong to \(\textrm {sigpath}(f_\textrm {anc}, T)\). Recall that a k-group K of \(\mathcal {C}^*_z\) takes edges from distinct clusters in \(\mathcal {C}^*_z\). Call K a
—
non-root k-group of \(\mathcal {C}^*_z\) if \(e_\mathit {root}\notin K\), or
—
a root k-group of \(\mathcal {C}^*_z\), otherwise.
A non-root k-group K must have a size \(|K| \le |\mathcal {C}^*_z| - 1\) because \(e_\mathit {root}\) makes a cluster in \(\mathcal {C}^*_z\).
For each \(k \in [|\mathcal {C}^*_z|]\), define
\begin{equation*} P^\mathit {non}_k(Q^*_{\eta ,z},\mathcal {C}^*_z) = \left\lbrace \begin{array}{ll} 1 & if k = 0 \\ \text{max-}(k,Q^*_{\eta , z})\text{-product of all the} \ {non-root} \ \text{k-groups of} \ \mathcal {C}^*_z & if 1 \le k \le |\mathcal {C}^*_z| - 1 \\ -\infty & if k = |\mathcal {C}^*_z| \end{array} \right. \nonumber \nonumber \end{equation*}
where the second equality used the fact that \(P^\mathit {non}_k(Q^*_{\eta ,z},\mathcal {C}^*_z) = -\infty\) for \(k = |\mathcal {C}^*_z|\).
We are now ready to prove \(\overline{p_\eta } \cdot \prod _{z \in Z} p_{\eta ,z} = O(p_\eta)\). For each \(z \in Z\), the value \(p_{\eta ,z}\) in (40) is either \(\Theta (\max _{k=1}^{|\mathcal {C}^*_z|-1} \frac{P^\mathit {non}_k(Q^*_{\eta ,z},\mathcal {C}^*_z)}{L^k})\) or \(\Theta (1)\). Depending on which case it is, we define integer \(k_z\) and a set \(K_z\) of edges differently, as explained next:
—
If \(p_{\eta ,z} = \Theta (\frac{P^\mathit {non}_k(Q^*_{\eta ,z},\mathcal {C}^*_z)}{L^k})\) for some \(k \in [|\mathcal {C}^*_z|-1]\), then
In this case, define the \(Q^*_{\eta ,z}\)-product of \(K_z\) to be 1.
The above definitions of \(k_z\) and \(K_z\) guarantee \(p_{\eta ,z} = \Theta (\frac{\text{$Q^*_{\eta ,z}$-product of $K_z$}}{L^{k_z}})\) in all cases.
In the same fashion, concerning the value \(\overline{p_\eta }\) in (32), we define integer \(\overline{k}\) and a set \(\overline{K}\) of edges as follows:
—
If \(\overline{p_\eta } = \Theta (\frac{P_k(\overline{Q^*_\eta },\overline{\mathcal {C}^*})}{L^k})\) for some \(k \in [|\overline{\mathcal {C}^*}|]\), then
In this case, define the \(\overline{Q^*_\eta }\)-product of \(\overline{K}\) to be 1.
The above definitions of \(\overline{k}\) and \(\overline{K}\) guarantee \(\overline{p}_{\eta } = \Theta (\frac{\text{$\overline{Q^*_{\eta }}$-product of $\overline{K}$}}{L^{\overline{k}}})\) in all cases.
If \(K_\mathit {super}\ne \emptyset\), then \(K_\mathit {super}\) is a super-\(|K_\mathit {super}|\)-group (see Section 3.4 for the definition of “super-k-group”). By Lemma 14, \(K_\mathit {super}\) is a \(|K_\mathit {super}|\)-group of T. We thus have:
where the last equality used the definition of \(p_\eta\) in (28).
We have shown that \(\overline{p_\eta } \cdot \prod _{z \in Z} p_{\eta ,z} = O(p_\eta)\) holds in all cases. This completes the whole proof of Lemma 16.
6 Concluding Remarks
In this article, we have disproved the existence of any tuple-based algorithm that can evaluate an arbitrary join query in the MPC model with load \(\tilde{O}(N / p^{1/\rho ^*})\), where N is the query’s input size, \(\rho ^*\) is the query’s fractional edge covering number, and p is the number of machines. Specifically, we have established a new load lower bound of \(\Omega (N / p^{1/\tau ^*})\) for an instance of boat joins (see Figure 2 for the schema graph of such joins), where \(\tau ^* = 3\) is the fractional edge packing number for such joins, and their \(\rho ^*\) value equals 2. We can actually make the gap between \(\tilde{O}(N / p^{1/\rho ^*})\) and \(\Omega (N / p^{1/\tau ^*})\) arbitrarily large, by adapting our argument to a class of “generalized boat joins” (defined as follows. Fix any constant integer \(k \ge 3\). The schema graph G of a generalized boat join has 2k attributes \(X_1, X_2, \ldots , X_k\) and \(Y_1, Y_2, \ldots , Y_k\), and \(2k + 2\) edges: \(\lbrace X_1, X_2, \ldots , X_k\rbrace\), \(\lbrace Y_1, Y_2, \ldots , Y_k\rbrace\), and \(\lbrace X_i, Y_i\rbrace\) for every \(i \in [k]\). G has a fractional edge covering number \(\rho ^* = 2\) and yet a fractional edge packing number \(\tau ^* = k\). It is not difficult to modify our argument to prove that, for every k, \(\Omega (N / p^{1/\tau ^*}) = \Omega (N / p^{1/k})\) is a load lower bound on at least one instance of generalized boat joins.
Boat joins, as well as their generalized counterparts, have cyclic schema graphs. We have shown that cyclicity is indeed what prevents us from guaranteeing a load of \(\tilde{O}(N / p^{1/\rho ^*})\). For that purpose, we have presented an algorithm that can evaluate any acyclic join with load \(O(N / p^{1/\rho ^*})\) — without any polylogarithmic factor—which matches the well-known lower bound of \(\Omega (N / p^{1/\rho ^*})\) and is therefore asymptotically optimal. Our algorithm is made possible by canonical edge cover, a new mathematical structure of acyclic hypergraphs that we discover in this article. Every acyclic hypergraph has a canonical edge cover, which constitutes an integral optimal fractional edge covering and has many interesting properties useful for algorithm design.
An intriguing open problem left behind by this article is whether (cyclic) join evaluation in MPC can be fully characterized by putting together the fractional edge covering number \(\rho ^*\) and fractional edge packing number \(\tau ^*\) of a join. Our lower bound argument does not rule out an algorithm with load \(\tilde{O}(N / p^{1/\chi ^*})\), where \(\chi ^* = \max \lbrace \rho ^*, \tau ^*\rbrace\). In fact, all the generalized boat joins can be evaluated with load \(\tilde{O}(N / p^{1/\tau ^*})\) using algorithms from [19] and [27]. Unfortunately, the algorithms in [19] and [27] fail to achieve load \(\tilde{O}(N / p^{1/\chi ^*})\) for arbitrary joins.
Footnotes
1
A formal definition of \(\rho ^*\) will appear in Section 1.1. For our discussion here, it suffices to understand \(\rho ^*\) as a value at least 1.
2
The algorithm proposed in [23] achieves this time complexity by using a preprocessing step that creates perfect-hashing data structures on the input relations. This step takes \(O(N)\) expected time. Without preprocessing, the algorithm has an expected time complexity of \(O(N^{\rho ^*})\) or a worst-case time complexity of \(O(N^{\rho ^*} \log N)\).
3
The lower bound of [14] holds even on algorithms that are not tuple-based.
4
In case the reader is wondering, the literature uses the words “covering” and “cover” exactly the way they are used in our article.
5
This is because otherwise \(|S^{\prime }_\texttt {A}| \cdot (k_\texttt {B} + 1)\), which is at most \(|S^{\prime }_\texttt {A}| \cdot (2k_\texttt {B})\), would be at most 8L, contradicting the definition of \(k_\texttt {B}\).
6
Namely, \(e_i\) cannot be the root of \(T^*_z\).
7
As CPU time is for free in our model, one can compute the join on those subsets using a nested loop. For better practical efficiency, one can apply any of the join algorithms [5, 18, 21, 22, 23, 24, 25, 33] in RAM, which all require memory at the same order as the input size.
8
The \(R(e_0, \eta)\) of all \(\eta\) are mutually disjoint and their union equals \(R(e_0)\).
9
Step 3 requires \(O(1)\) semi-joins, each of which can be performed by sorting. For sorting in the MPC model, see Section 2.2.1 of [14]. The stated bound for Steps 1 and 3 requires the assumption \(p \le N^{1-\epsilon }\) introduced in Section 1.1.
10
In the special case where \(e_\textrm {big}\) is the first in \(\sigma _0\), define \(e_\textrm {before}= \mathit {nil}\) with \(F_0(e_\textrm {before}) = F_1(e_\textrm {before}) = \emptyset\).
A Chernoff Bounds
Let \(X_1, X_2, \ldots , X_t\) be \(t \ge 1\) independent Bernoulli random variables such that \(\mathbf {Pr}[X_i = 1]\) is the same for all \(i \in [1, t]\) (hence, so is \(\mathbf {Pr}[X_i = 0]\)). Let \(X = \sum _{i=1}^t\) and \(\mu = \mathbf {E}[X]\). For any \(0 \lt \gamma \le 1\), it holds that
We will first prove that \(\mathcal {F}^{\prime }\) is the CEC of \(G^{\prime }\) induced by \(T^{\prime }\) by discussing in Section B.1 the scenario where \(\mathit {map}(f_\textrm {anc}) = f_\textrm {anc}\setminus \lbrace A_\textrm {anc}\rbrace\) is a subsumed edge in \(G^{\prime }\) and in Section B.2 the scenario where \(\mathit {map}(f_\textrm {anc})\) is not subsumed in \(G^{\prime }\). Then, Section B.3 will explain why \(\mathcal {F}^{\prime }\) cannot contain any subsumed edge of \(G^{\prime }\).
B.1 \(\mathcal {F}^{\prime }\) is the CEC of \(G^{\prime }\): the Scenario Where \(\mathit {map}(f_\textrm {anc})\) Is Subsumed
Let \(\hat{e}\) be the parent of \(f_\textrm {anc}\) in T. As \(\mathit {map}(f_\textrm {anc})\) is subsumed in \(G^{\prime }\), \(\mathit {map}(f_\textrm {anc})\) must be a subset of \(\mathit {map}(\hat{e})\). This implies \(A_\textrm {anc}\notin \hat{e}\) (otherwise, \(f_\textrm {anc}\subseteq \hat{e}\) and G is not reduced). Because \(A_\textrm {anc}\) needs to appear in all the nodes of \(\textrm {sigpath}(f_\textrm {anc}, T)\), \(A_\textrm {anc}\notin \hat{e}\) indicates that \(\hat{e} \notin \textrm {sigpath}(f_\textrm {anc}, T)\) and thus \(\textrm {sigpath}(f_\textrm {anc}, T)\) has only a single node \(f_\textrm {anc}\). It thus follows that \(\hat{e} \in \mathcal {F}\) and \(A_\textrm {anc}\) is an exclusive attribute in \(f_\textrm {anc}\).
To show that \(\mathcal {F}^{\prime } = \mathcal {F}\setminus \lbrace f_\textrm {anc}\rbrace\) is the CEC of \(G^{\prime }\) induced by \(T^{\prime }\), it suffices to prove that \(\mathcal {F}^{\prime }\) is the output of edge-cover\((T^{\prime })\) on an arbitrary reverse topological order of \(T^{\prime }\) (Lemma 8 tells us that the output is not sensitive to the reverse topological order). For this purpose, consider \(\sigma _0\) as any reverse topological order of T where \(\hat{e}\) succeeds \(f_\textrm {anc}\) (i.e., \(\hat{e}\) is the immediate successor of \(f_\textrm {anc}\) in \(\sigma _0\)). Let \(\sigma _1\) be the sequence obtained by removing \(f_\textrm {anc}\) from \(\sigma _0\); \(\sigma _1\) must be a reverse topological order of \(T^{\prime }\). Let \(e_\textrm {before}\) be the node preceding \(f_\textrm {anc}\) in \(\sigma _0\) (i.e., \(e_\textrm {before}\) is the immediate predecessor of \(f_\textrm {anc}\) in \(\sigma _0\)) and hence preceding \(\hat{e}\) in \(\sigma _1\); define \(e_\textrm {before}= \mathit {nil}\) if \(f_\textrm {anc}\) is the first in \(\sigma _0\).
Let us compare the execution of edge-cover\((T)\) on \(\sigma _0\) to that of edge-cover\((T^{\prime })\) on \(\sigma _1\). The two executions are identical till the moment right after \(e_\textrm {before}\) has been processed (by Line 4 of edge-cover). By the fact that edge-cover\((T)\) adds \(\hat{e}\) to \(F_{{\mathrm{tmp}}}\) (we have proved earlier \(\hat{e} \in \mathcal {F}\)), \(\hat{e}\) has a disappearing attribute not covered by \(F_{{\mathrm{tmp}}}\) when it is processed. Hence, when \(\hat{e}\) is processed by edge-cover\((T^{\prime })\), it must also have a disappearing attribute not covered by \(F_{{\mathrm{tmp}}}\) and thus is added to \(F_{{\mathrm{tmp}}}\). The rest execution of edge-cover\((T)\) is the same as that of edge-cover\((T^{\prime })\) because every non-exclusive attribute of \(f_\textrm {anc}\) is in \(\hat{e}\). Therefore, the output of edge-cover\((T^{\prime })\) is the same as that of edge-cover\((T)\), except that the former does not include \(f_\textrm {anc}\).
B.2 \(\mathcal {F}^{\prime }\) is the CEC of \(G^{\prime }\): The Scenario Where \(\mathit {map}(f_\textrm {anc})\) is Not Subsumed in \(G^{\prime }\)
Let \(\sigma _0 = (e_1, e_2, \ldots , e_{|E|})\) be an arbitrary reverse topological order of T. Define \(e_i^{\prime } = \mathit {map}(e_i) = e_i \setminus \lbrace A_\textrm {anc}\rbrace\) for \(i \in [|E|]\). The sequence \(\sigma _1 = (e_1^{\prime }, e_2^{\prime }, \ldots , e_{|E|}^{\prime })\) is a reverse topological order of \(T^{\prime }\). We will compare the execution of edge-cover\((T)\) on \(\sigma _0\) to that of edge-cover\((T^{\prime })\) on \(\sigma _1\). Define \(F_0(e_i)\) (respectively, \(F_1(e_i^{\prime })\)) as the content of \(F_{{\mathrm{tmp}}}\) after edge-cover\((T)\) (respectively, edge-cover\((T^{\prime })\)) has processed \(e_i\) (respectively, \(e_i^{\prime }\)).
To prove the claim, first note that, because e is a leaf of T and G is reduced, e must have an exclusive attribute X. If edge-cover\((T^{\prime })\) does not add \(e^{\prime }\) to \(F_{{\mathrm{tmp}}}\), \(e^{\prime }\) has no exclusive attributes in \(T^{\prime }\). This implies \(X = A_\textrm {anc}\), which further implies \(f_\textrm {anc}= e\) (otherwise, \(A_\textrm {anc}\) appears in two distinct nodes and thus cannot be exclusive). However, in that case, \(e^{\prime }\) must contain an exclusive attribute in \(T^{\prime }\) (because \(e^{\prime } = \mathit {map}(f_\textrm {anc})\) is not subsumed in \(G^{\prime }\)), thus giving a contradiction.
We prove the claim by induction on i. Because \(e_1\) is a leaf of T, Lemma 8 and Claim 1 guarantee \(e_1 \in F_0(e_1)\) and \(e_1^{\prime } \in F_1(e_1^{\prime })\), respectively. Thus, Claim 2 holds for \(i = 1\).
Next, we prove the correctness on \(i \gt 1\), assuming that it holds on \(e_{i-1}\) and \(e^{\prime }_{i-1}\). The inductive assumption implies that \(F_0(e_{i-1})\) covers an attribute \(X \ne A_\textrm {anc}\) if and only if \(F_1(e_{i-1}^{\prime })\) covers X. If \(e_i \notin F_0(e_i)\), every disappearing attribute of \(e_i\) must be covered by \(F_0(e_{i-1})\). Hence, \(F_1(e_{i-1}^{\prime })\) must cover all the disappearing attributes of \(e_i^{\prime }\) and thus \(e_i^{\prime } \notin F_1(e_i^{\prime })\).
The rest of the proof assumes \(e_i \in F_0(e_i)\), i.e., \(e_i\) has a disappearing attribute X not covered by \(F_0(e_{i-1})\). If \(X \ne A_\textrm {anc}\), X is a disappearing attribute in \(e_i^{\prime }\) not covered by \(F_1(e_{i-1}^{\prime })\) and thus \(e_i^{\prime } \in F_1(e_i^{\prime })\). It remains to discuss the scenario \(X = A_\textrm {anc}\). As \(A_\textrm {anc}\) is disappearing at \(e_i\), \(A_\textrm {anc}\) cannot exist in any proper ancestor of \(e_i\). Thus, \(f_\textrm {anc}\) must be a descendant of \(e_i\). We can assert that \(f_\textrm {anc}= e_i\); otherwise, the leaf \(f_\textrm {anc}\) is processed before \(e_i\) and must exist in \(F_0(e_{i-1})\) (Lemma 8), contradicting the fact that \(A_\textrm {anc}\) is not covered by \(F_0(e_{i-1})\). Then, \(e^{\prime }_i \in F_1(e_i^{\prime })\) follows from Claim 1.
We can now conclude that \(\mathcal {F}^{\prime }\) is always the CEC of \(G^{\prime }\) induced by \(T^{\prime }\).
Consider any subsumed edge \(e^{\prime }\) in \(E^{\prime }\). Define \(e = \mathit {map}^{-1}(e^{\prime })\); we know that e must contain \(A_\textrm {anc}\) (otherwise, e is subsumed in E and G is not reduced). Hence, \(e = e^{\prime } \cup \lbrace A_\textrm {anc}\rbrace\). If \(e = f_\textrm {anc}\), then \(\mathit {map}(f_\textrm {anc}) = \mathit {map}(e) = e^{\prime }\) is subsumed in \(G^{\prime }\), in which case \(e^{\prime } \notin \mathcal {F}^{\prime }\) holds due to the explicit exclusion of \(f_\textrm {anc}\) from \(\mathcal {F}^{\prime }\) as shown in (20).
Next, we consider \(e \ne f_\textrm {anc}\). To prove \(e \notin \mathcal {F}^{\prime }\), by the way \(\mathcal {F}^{\prime }\) is computed in (20), it suffices to show \(e \notin \mathcal {F}\), where \(\mathcal {F}\) is the CEC of G induced by T. Assume, on the contrary, that \(e \in \mathcal {F}\). Let \(\hat{f}\) be the lowest proper ancestor of \(f_\textrm {anc}\) in \(\mathcal {F}\) (here, “ancestor” is defined with respect to T). The definition of \(A_\textrm {anc}\) assures us \(A_\textrm {anc}\notin \hat{f}\). Because \(A_\textrm {anc}\in f_\textrm {anc}\) and \(A_\textrm {anc}\in e\), e must be a proper descendant of \(\hat{f}\) in T (connectedness of acyclicity). By definition of \(f_\textrm {anc}\), \(\hat{f}\) cannot have any non-leaf proper descendant in \(\mathcal {F}\). Hence, e must be a leaf of T.
Because \(A_\textrm {anc}\) appears in two distinct leaves of T (i.e., \(f_\textrm {anc}\) and e), connectedness of acyclicity demands that \(A_\textrm {anc}\) should also exist in the parent \(\hat{e}\) of e. As G is reduced, e must have an attribute X that does not appear in \(\hat{e}\) and thus must be exclusive. It follows that \(X \ne A_\textrm {anc}\). However, in that case, \(e^{\prime } = e \setminus \lbrace A_\textrm {anc}\rbrace\) contains X and thus cannot be subsumed in \(G^{\prime }\) (X remains exclusive in \(G^{\prime }\)), giving a contradiction.
We discuss only the scenario where \(\mathit {map}(f_\textrm {anc})\) is not subsumed in \(G^{\prime }\) (the opposite case is easy and omitted). Our proof will establish a stronger claim:
\(\mathcal {F}^* = \mathcal {F}^{\prime }\) is the CEC of \(G^*\) induced by \(T^*\) every time Line 5 of cleanse is executed.
\(G^* = G^{\prime }\) and \(T^* = T^{\prime }\) at Line 1. \(\mathcal {F}^* = \mathcal {F}^{\prime }\) is the CEC of \(G^*\) induced by \(T^*\) at this moment (Lemma 10). Hence, the claim holds on the first execution of Line 5.
Inductively, assuming that the claim holds currently, we will show that it still does after cleanse deletes the next \(e_\textrm {small}\) from \(G^*\). Let \(G^*_0\) and \(T^*_0\) (respectively, \(G^*_1\) and \(T^*_1\)) be the \(G^*\) and \(T^*\) before (respectively, after) the deletion of \(e_\textrm {small}\), . The fact \(e_\textrm {small}\) being subsumed in \(G^*\) suggests \(e_\textrm {small}\) being subsumed in \(G^{\prime }\). By Lemma 10, \(e_\textrm {small}\notin \mathcal {F}^{\prime } = \mathcal {F}^*\).
Case 1: \({\bf {e_\textrm {big}}}\) parents \({\bf {e_\textrm {small}}}\). Let \(\sigma _0\) be a reverse topological order of \(T^*_0\) where \(e_\textrm {big}\) succeeds \(e_\textrm {small}\). As \(\mathcal {F}^*\) is the CEC of \(G^*_0\) induced by \(T^*_0\), edge-cover\((T^*_0)\) produces \(\mathcal {F}^*\) if executed on \(\sigma _0\) (Lemma 8).
Let \(\sigma _1\) be a copy of \(\sigma _0\) but with \(e_\textrm {small}\) removed; \(\sigma _1\) is a reverse topological order of \(T^*_1\). Every node in \(T^*_1\) retains the same disappearing attributes as in \(T^*_0\) (see Figure 4(a)), whereas \(e_\textrm {small}\) has no disappearing attributes. It is easy to verify that running edge-cover\((T^*_1)\) on \(\sigma _1\) has the same output \(\mathcal {F}^*\) as running edge-cover\((T^*_0)\) on \(\sigma _0\).
Case 2: \({\bf {e_\textrm {small}}}\) parents \({\bf {e_\textrm {big}}}\). Let \(\sigma _0\) be a reverse topological order of \(T^*_0\) where \(e_\textrm {small}\) succeeds \(e_\textrm {big}\). Let \(\sigma _1\) be a copy of \(\sigma _0\) but with \(e_\textrm {small}\) removed; \(\sigma _1\) is a reverse topological order of \(T^*_1\). We will argue that running edge-cover\((T^*_1)\) on \(\sigma _1\) also returns \(\mathcal {F}^*\).
The reader should note several facts about disappearing attributes. If an attribute has \(e_\textrm {small}\) as the summit in \(T^*_0\), the attribute’s summit in \(T^*_1\) becomes \(e_\textrm {big}\) (see Figure 4(b)). If an attribute has \(e \ne e_\textrm {small}\) as the summit in \(T^*_0\), its summit in \(T^*_1\) is still e. Hence, every node in \(T^*_1\) except \(e_\textrm {big}\) retains the same disappearing attributes as in \(T^*_0\), whereas the disappearing attributes of \(e_\textrm {big}\) in \(T^*_1\) contain those of \(e_\textrm {big}\) and \(e_\textrm {small}\) in \(T^*_0\).
For each node e in \(\sigma _0\) (respectively, \(\sigma _1\)), denote by \(F_0(e)\) (respectively, \(F_1(e)\)) the content of \(F_{{\mathrm{tmp}}}\) after edge-cover\((T^*_0)\) (respectively, edge-cover\((T^*_1)\)) has processed e. Let \(e_\textrm {before}\) be the node before \(e_\textrm {big}\) in \(\sigma _0\).10 It is easy to see that edge-cover\((T^*_0)\) and edge-cover\((T^*_1)\) behave the same way until finishing with \(e_\textrm {before}\), which gives \(F_0(e_\textrm {before}) = F_1(e_\textrm {before})\). It must hold that \(e_\textrm {small}\notin F_0(e_\textrm {small})\) (otherwise, \(e_\textrm {small}\) would be a subsumed edge in \(\mathcal {F}^*\), contradicting Lemma 10). Two possibilities apply to \(e_\textrm {big}\):
(1)
\(e_\textrm {big}\in F_0(e_\textrm {big})\). Hence, \(e_\textrm {big}\) has a disappearing attribute in \(T^*_0\) not covered by \(F_0(e_\textrm {before})\). This means that \(e_\textrm {big}\) also has a disappearing attribute in \(T^*_1\) not covered by \(F_1(e_\textrm {before}) = F_0(e_\textrm {before})\). It follows that \(e_\textrm {big}\in F_1(e_\textrm {big})\), meaning \(F_1(e_\textrm {big}) = F_0(e_\textrm {big}) = F_0(e_\textrm {small})\).
(2)
\(e_\textrm {big}\notin F_0(e_\textrm {big})\). All the disappearing attributes of \(e_\textrm {big}\) and \(e_\textrm {small}\) in \(T^*_0\) are covered by \(F_0(e_\textrm {before})\). Hence, the disappearing attributes of \(e_\textrm {big}\) in \(T^*_1\) are covered by \(F_1(e_\textrm {before}) = F_0(e_\textrm {before})\). Therefore, \(e_\textrm {big}\notin F_1(e_\textrm {big})\), meaning \(F_0(e_\textrm {small}) = F_0(e_\textrm {before}) = F_1(e_\textrm {before}) = F_1(e_\textrm {big})\).
We now conclude that \(F_1(e_\textrm {big}) = F_0(e_\textrm {small})\) always holds. Every remaining node in \(\sigma _0\) and \(\sigma _1\) has the same disappearing attributes in \(T^*_0\) and \(T^*_1\). The rest execution of edge-cover\((T^*_0)\) is identical to that of edge-cover\((T^*_1)\).
We will discuss only the scenario where \(\mathit {map}(f_\textrm {anc})\) is not subsumed (the opposite scenario is easy and omitted).
Departing from acyclic queries, let us consider a more general problem on a rooted tree \(\mathcal {T}\) where (i) every node is colored black or white, and (ii) the root and all the leaves are black. Denote by B the set of black nodes. Each black node \(b \in B\) is associated with a signature path:
—
If b is the root of \(\mathcal {T}\), its signature path contains just b itself.
—
Otherwise, let \(\hat{b}\) be the lowest ancestor of b among all the nodes in B; the signature path of b is the set of nodes on the path from \(\hat{b}\) to b, except \(\hat{b}\).
Fig. 8.
We define four types of contractions:
—
Type 1: We are given two white nodes \(v_1\) and \(v_2\) such that \(v_1\) parents \(v_2\). The contraction removes \(v_2\) from \(\mathcal {T}\) and makes \(v_1\) the new parent for all the child nodes of \(v_2\). See Figure 8(a).
—
Type 2: We are given two white nodes \(v_1\) and \(v_2\) such that \(v_1\) parents \(v_2\). The contraction removes \(v_1\) from \(\mathcal {T}\), makes \(v_2\) the new parent for all the child nodes of \(v_1\), and makes \(v_2\) a child of the original parent of \(v_1\). See Figure 8(b).
—
Type 3: Same as Type 1, except that \(v_1\) is black and \(v_2\) is white. See Figure 8(c).
—
Type 4: Same as Type 2, except that \(v_1\) is white and \(v_2\) is black. See Figure 8(d).
The facts below are evident:
—
The number of black nodes remains the same after a contraction.
—
After a contraction, each signature path either remains the same or shrinks.
We now draw correspondence between a contraction and an edge deletion in cleanse. \(\mathcal {T}\) corresponds to the current edge tree \(T^*\) in cleanse. The set B of black nodes equals \(\mathcal {F}^* = \mathcal {F}^{\prime }\) for the entire execution of cleanse. The set \(\lbrace v_1, v_2\rbrace\) corresponds to \(\lbrace e_\textrm {small}, e_\textrm {big}\rbrace\). As shown in Lemma 10, \(e_\textrm {small}\) cannot exist in \(\mathcal {F}^*\) and thus cannot correspond to a black node. If we denote by \(\mathcal {C}\) (respectively, \(\mathcal {C}^*\)) the set of signature paths at the beginning (respectively, end) of cleanse, each signature path in \(\mathcal {C}^*\) is obtained by continuously shrinking a distinct signature path in \(\mathcal {C}\). This implies Lemma 12, noticing that \(\mathcal {C}= \lbrace \textrm {sigpath}(f, T) \mid f \in \mathcal {F}\rbrace\) and \(\mathcal {C}^* = \lbrace \textrm {sigpath}(f^*, T^*) \mid f^* \in \mathcal {F}^*\rbrace\).
We will first prove that, for any \(z \in Z\), \(\mathcal {F}^*_z\) is the CEC of \(G^*_z\) induced by \(T^*_z\). Let \(\hat{z}\) be the parent of z. Recall that \(\mathcal {F}\) is the CEC of G induced by T. Consider a reverse topological order \(\sigma _z\) of T satisfying the following condition: a prefix of \(\sigma _z\) is a permutation of the nodes in the subtree of T rooted at z. In other words, in \(\sigma _z\), every node in the aforementioned subtree must rank before every node outside the subtree. Define \(\sigma ^*_z\) to be the sequence obtained by deleting from \(\sigma _z\) all the nodes e such that \(e \ne \hat{z}\) and e is outside the subtree of T rooted at z. It is clear that \(\sigma ^*_z\) is a reverse topological order of \(T^*_z\).
Let us compare the execution of edge-cover\((T)\) on \(\sigma\) to that of edge-cover\((T^*_z)\) on \(\sigma ^*_z\). They are exactly the same until z has been processed. Hence, every node in the \(F_{{\mathrm{tmp}}}\) of edge-cover\((T)\) at this moment must have been added to \(F_{{\mathrm{tmp}}}\) by edge-cover\((T^*_z)\). This means that all the nodes in \(\mathcal {F}^*_z\), except \(\hat{z}\), must appear in the final \(F_{{\mathrm{tmp}}}\) output by edge-cover\((T^*_z)\). Finally, the final \(F_{{\mathrm{tmp}}}\) must also contain \(\hat{z}\) as well due to Lemma 8 (notice that \(\hat{z}\) is a raw leaf of \(T^*_z\)). This shows that \(\mathcal {F}^*_z\) is the CEC of \(G^*_z\) induced by \(T^*_z\).
Next, we prove that \(\overline{\mathcal {F}^*}\) is the CEC of \(\overline{G^*}\) induced by \(\overline{T^*}\). Let \(\overline{e}\) be the highest node in \(\textrm {sigpath}(f_\textrm {anc},T)\). Consider a reverse topological order \(\overline{\sigma }\) of T satisfying the following condition: a prefix of \(\overline{\sigma }\) is a permutation of the nodes in the subtree of T rooted at \(\overline{e}\). Define \(\overline{\sigma ^*}\) to be the sequence obtained by deleting that prefix from \(\overline{\sigma }\). It is clear that \(\overline{\sigma ^*}\) is a reverse topological order of \(\overline{T^*}\). Define \(\hat{\overline{e}}\) to be the parent of \(\overline{e}\) in T. Note that \(\hat{\overline{e}}\) must belong to \(\mathcal {F}\) due to the definitions of \(\overline{e}\) and \(\textrm {sigpath}(f_\textrm {anc}, T)\)
We will compare the execution of edge-cover\((T)\) on \(\sigma\) to that of edge-cover\((\overline{T^*})\) on \(\overline{\sigma ^*}\). For each e in \(\sigma\), define \(F_0(e)\) as the content of \(F_{{\mathrm{tmp}}}\) after edge-cover\((T)\) has finished processing e. Similarly, for each e in \(\overline{\sigma ^*}\), define \(F_1(e)\) as the content of \(F_{{\mathrm{tmp}}}\) after edge-cover\((\overline{T^*})\) has finished processing e. Divide \(\sigma\) into three segments: (i) \(\sigma _1\), which includes the prefix of \(\sigma\) ending at (and including) \(\overline{e}\), (ii) \(\sigma _2\), which starts right after \(\sigma _1\) and ends at (and includes) \(\hat{\overline{e}}\), and (iii) \(\sigma _3\), which is the rest of \(\sigma\). Note that \(\overline{\sigma ^*}\) is the concatenation of \(\sigma _2\) and \(\sigma _3\).
We prove the claim by induction. As the base case, consider e as the first element in \(\sigma _2\). In \(\overline{T^*}\), e must be a leaf and, by Lemma 8, must be in \(F_1(e)\). In T, e is either a leaf or \(\hat{\overline{e}}\). In the former case, Lemma 8 assures us \(e \in F_0(e)\). In the latter case, e is also in \(F_0(e)\) because \(\hat{\overline{e}} \in \mathcal {F}\).
Next, we prove the claim on every other node e in \(\sigma _2\), assuming the claim’s correctness on the node \(e_\textrm {before}\) preceding e in \(\sigma _2\). This inductive assumption implies \(F_1(e_\textrm {before}) \subseteq F_0(e_\textrm {before})\). If \(e \in F_0(e)\), then e has a disappearing attribute X not covered by \(F_0(e_\textrm {before})\). As \(F_1(e_\textrm {before}) \subseteq F_0(e_\textrm {before})\), \(F_1(e_\textrm {before})\) does not cover X, either. Hence, edge-cover\((\overline{T^*})\) adds e to \(F_{{\mathrm{tmp}}}\), namely, \(e \in F_1(e)\).
Let us now focus on the case where \(e \in F_1(e)\). If \(e = \hat{\overline{e}}\), the fact \(\hat{\overline{e}} \in \mathcal {F}\) indicates \(e \in F_0(e)\). Next, we consider \(e \ne \hat{\overline{e}}\), meaning that e is a proper descendant of \(\hat{\overline{e}}\). The fact \(e \in F_1(e)\) suggests that e has a disappearing attribute X not covered by \(F_1(e_\textrm {before})\). If \(e \notin F_0(e)\), \(F_0(e_\textrm {before})\) must have a node \(e^{\prime }\) containing X. Node \(e^{\prime }\) must come from \(\sigma _1\) (the inductive assumption prohibits \(e^{\prime }\) from appearing in \(\sigma _2\)) and hence must be a descendant of \(\overline{e}\). By acyclicity’s connectedness requirement, X appearing in both e and \(e^{\prime }\) means that X must belong to \(\hat{\overline{e}}\). But this contradicts X disappearing at e. We thus conclude that \(e \in F_0(e)\).
Claim 1 assures us that \(F_1(\hat{\overline{e}}) \subseteq F_0(\hat{\overline{e}})\). Note also that \(\hat{\overline{e}}\) belongs to \(F_0(\hat{\overline{e}})\) (as explained before, \(\hat{\overline{e}} \in \mathcal {F}\)) and hence also to \(F_1(\hat{\overline{e}})\) (Claim 1). Any node \(e^{\prime } \in F_0(\hat{\overline{e}}) \setminus F_1(\hat{\overline{e}})\) must appear in the subtree rooted at \(\hat{\overline{e}}\) in T, whereas any node e in \(\sigma _3\) must be outside that subtree. By acyclicity’s connectedness requirement, if \(e^{\prime }\) contains an attribute X in e, then \(X \in \hat{\overline{e}}\) for sure. This means that \(F_1(\hat{\overline{e}})\) covers a disappearing attribute of e if and only if \(F_0(\hat{\overline{e}})\) does so. Therefore, edge-cover\((\overline{T^*})\) processes each node of \(\sigma _3\) in the same way as edge-cover\((T)\). This proves the correctness of Claim 2.
By putting Claims 1 and 2 together, we conclude that edge-cover\((\overline{T^*})\) returns all and only the attributes in \(\sigma _2 \cup \sigma _3\) output by edge-cover\((T)\). Therefore, the output of edge-cover\((\overline{T^*})\) is \(\mathcal {F}\cap \overline{E^*} = \overline{\mathcal {F}^*}\).
For any \(f^* \in \mathcal {F}^*_z\) and any \(z \in Z\) that is not the root of \(T^*_z\), it holds that \(\textrm {sigpath}(f^*, T^*_z) \subseteq \textrm {sigpath}(f^*, T)\). Similarly, for any \(f^* \in \overline{\mathcal {F}^*}\), it holds that \(\textrm {sigpath}(f^*, \overline{T^*}) \subseteq \textrm {sigpath}(f^*, T)\). To prove the lemma, it suffices to show that, given a super-k-group \(K = \lbrace e_1, \ldots , e_k\rbrace\), we can always assign each \(e_i\), \(i \in [k]\), to a distinct cluster in \(\lbrace \textrm {sigpath}(f, T) \mid f \in \mathcal {F}\rbrace\). This is easy: if \(e_i\) is picked from \(\textrm {sigpath}(f^*, T^*_z)\) for some \(z \in \mathcal {F}\) and \(f^* \in \mathcal {F}^*_z\), assign \(e_i\) to \(\textrm {sigpath}(f^*, T)\); if \(e_i\) is picked from \(\textrm {sigpath}(f^*, \overline{T^*})\) for some \(f^* \in \overline{\mathcal {F}^*}\), assign \(e_i\) to \(\textrm {sigpath}(f^*, T)\).
References
[1]
Serge Abiteboul, Richard Hull, and Victor Vianu. 1995. Foundations of Databases. Addison-Wesley.
Foto N. Afrati, Manas R. Joglekar, Christopher Ré, Semih Salihoglu, and Jeffrey D. Ullman. 2017. GYM: A multiround distributed join algorithm. In Proceedings of the International Conference on Database Theory (ICDT’17). 4:1–4:18.
Foto N. Afrati and Jeffrey D. Ullman. 2011. Optimizing multiway joins in a map-reduce environment. IEEE Transactions on Knowledge and Data Engineering (TKDE) 23, 9 (2011), 1282–1298.
Alok Aggarwal and Jeffrey Scott Vitter. 1988. The input/output complexity of sorting and related problems. Communications of the ACM (CACM) 31, 9 (1988), 1116–1127.
Kaleb Alway, Eric Blais, and Semih Salihoglu. 2021. Box covers and domain orderings for beyond worst-case join processing. In Proceedings of the International Conference on Database Theory (ICDT’21). 3:1–3:23.
Paul Beame, Paraschos Koutris, and Dan Suciu. 2017. Communication steps for parallel query processing. Journal of the ACM (JACM) 64, 6 (2017), 40:1–40:58.
Christoph Berkholz, Jens Keppeler, and Nicole Schweikardt. 2017. Answering conjunctive queries under updates. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS’17). 303–318.
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’04). 137–150.
Georg Gottlob, Nicola Leone, and Francesco Scarcello. 2001. The complexity of acyclic conjunctive queries. Journal of the ACM (JACM) 48, 3 (2001), 431–498.
Xiao Hu. 2021. Cover or pack: New upper and lower bounds for massively parallel joins. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS’21). 181–198.
Xiao Hu and Ke Yi. 2016. Towards a worst-case I/O-Optimal algorithm for acyclic joins. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS’16). 135–150.
Xiao Hu and Ke Yi. 2019. Instance and output optimal parallel algorithms for acyclic joins. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS’19). 450–463.
Xiao Hu, Ke Yi, and Yufei Tao. 2019. Output-optimal massively parallel algorithms for similarity joins. ACM Transactions on Database Systems (TODS) 44, 2 (2019), 6:1–6:36.
Muhammad Idris, Martín Ugarte, and Stijn Vansummeren. 2017. The dynamic yannakakis algorithm: Compact and efficient query processing under updates. In Proceedings of the ACM Management of Data (SIGMOD’17). ACM, 1259–1274.
Bas Ketsman and Dan Suciu. 2017. A worst-case optimal multi-round algorithm for parallel computation of conjunctive queries. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS’17). 417–428.
Mahmoud Abo Khamis, Hung Q. Ngo, Christopher Re, and Atri Rudra. 2016. Joins via geometric resolutions: Worst case and beyond. ACM Transactions on Database Systems (TODS) 41, 4 (2016), 22:1–22:45.
Paraschos Koutris, Paul Beame, and Dan Suciu. 2016. Worst-case optimal algorithms for parallel query processing. In Proceedings of the International Conference on Database Theory (ICDT’16). 8:1–8:18.
Gonzalo Navarro, Juan L. Reutter, and Javiel Rojas-Ledesma. 2020. Optimal joins using compact data structures. In Proceedings of the International Conference on Database Theory (ICDT’20), Vol. 155. 21:1–21:21.
Hung Q. Ngo, Dung T. Nguyen, Christopher Re, and Atri Rudra. 2014. Beyond worst-case analysis for joins with minesweeper. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS’14). 234–245.
Hung Q. Ngo, Ely Porat, Christopher Ré, and Atri Rudra. 2012. Worst-Case optimal join algorithms: [Extended Abstract]. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS’12). 37–48.
Hung Q. Ngo, Christopher Re, and Atri Rudra. 2013. Skew strikes back: New developments in the theory of join algorithms. SIGMOD Rec. 42, 4 (2013), 5–16.
Anna Pagh and Rasmus Pagh. 2006. Scalable computation of acyclic joins. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS’06). 225–232.
Miao Qiao and Yufei Tao. 2021. Two-attribute skew free, isolated CP theorem, and massively parallel joins. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS’21). 166–180.
Cheng Sheng, Yufei Tao, and Jianzhong Li. 2012. Exact and approximate algorithms for the most connected vertex problem. ACM Transactions on Database Systems (TODS) 37, 2 (2012), 12:1–12:39.
Yufei Tao. 2018. Massively parallel entity matching with linear classification in low dimensional space. In Proceedings of the International Conference on Database Theory (ICDT’18), Vol. 98. 20:1–20:19.
Yufei Tao. 2020. A simple parallel algorithm for natural joins on binary relations. In Proceedings of the International Conference on Database Theory (ICDT’20). 25:1–25:18.
Yufei Tao. 2022. Parallel acyclic joins with canonical edge covers. In Proceedings of the International Conference on Database Theory (ICDT’22). 9:1–9:19.
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Anthony, Hao Liu, and Raghotham Murthy. 2010. HIVE - a petabyte scale data warehouse using Hadoop. In Proceedings of the International Conference on Data Engineering (ICDE’10). 996–1005.
Todd L. Veldhuizen. 2014. Triejoin: A simple, worst-case optimal join algorithm. In Proceedings of the International Conference on Database Theory (ICDT’14). 96–106.
Mihalis Yannakakis. 1981. Algorithms for acyclic database schemes. In Proceedings of the 7th International Conference on Very Large Data Bases (September 9–11, 1981, Cannes, France).82–94.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI’12). 15–28.
Single-round multiway join algorithms first reshuffle data over many servers and then evaluate the query at hand in a parallel and communication-free way. A key question is whether a given distribution policy for the reshuffle is adequate for computing ...
The aim of this work was to develop a technique to speed up complex joins in an incremental visual query system. When designing a visual, highly interactive interface for ad-hoc (read-only) queries, fast response times are of paramount importance. ...
ICDT '12: Proceedings of the 15th International Conference on Database Theory
The problems of query containment, equivalence, and minimization are fundamental problems in the context of query processing and optimization. In their classic work [2] published in 1977, Chandra and Merlin solved the three problems for the language of ...
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].