research-article

Open access

Marrying Top-k with Skyline Queries: Operators with Relaxed Preference Input and Controllable Output Size

Authors:

Kyriakos Mouratidis,

Keming Li,

Bo TangAuthors Info & Claims

ACM Transactions on Database Systems, Volume 50, Issue 1

Article No.: 2, Pages 1 - 37

https://doi.org/10.1145/3705726

Published: 10 January 2025 Publication History

PDF eReader

Abstract

The two paradigms to identify records of preference in a multi-objective setting rely either on dominance (e.g., the skyline operator) or on a utility function defined over the records’ attributes (typically using a top-k query). Despite their proliferation, each has its own palpable drawbacks. Motivated by these drawbacks, we identify three hard requirements for practical decision support, namely, personalization, controllable output size, and flexibility in preference specification. With these requirements as a guide, we combine elements from both paradigms and propose two new operators, ORD and ORU. We present a suite of algorithms for their efficient processing, dedicating more technical effort to ORU, whose nature is inherently more challenging. Specifically, besides a sophisticated algorithm for ORD, we describe two exact methods for ORU and one approximate. We perform a qualitative study to demonstrate how our operators work and evaluate the performance of our algorithms against adaptations of previous work that mimic their output.

1 Introduction

In today’s connected world, users are presented with numerous alternatives to cover their everyday needs. Deciding among these alternatives generally entails the consideration of multiple, often conflicting aspects. Indeed, multi-objective optimization has been a traditional research topic [35, 53, 77] whose practical relevance has increased in the current reality. For a large set of alternatives (i.e., d-dimensional records), there are two main paradigms to determine those of most interest to the user, namely, based on dominance or ranking by utility.

The first paradigm considers that a record dominates another if all its attributes are more preferable. Based on that notion, the skyline operator reports the records that are not dominated [17], while the k-skyband reports those dominated by a maximum of $(k-1)$ others [69]. The dominance paradigm is intuitive to the user and straightforward to apply. However, it comes with two major drawbacks. First, it is not personalized, reporting the same result for every user. Furthermore, its output size (i.e., the number of reported records) is uncontrollable and often overwhelming [15, 36].

The second paradigm, ranking by utility, associates each record with a score via a (user-specific) function over the record’s attributes. Most commonly, the utility function is a weighted sum, with user preferences expressed by d per-attribute weights $w_i$ (together comprising a preference vector ${\boldsymbol{ w}}$). This linear type of scoring has been the most proliferate since the inception of ranking by utility [29, 49] and is shown by user studies to effectively model the way humans assess tradeoffs in real-life multi-objective decisions [72]. Ranking by utility, in the form of a $\text{top-}k$ query, offers personalization and control of the output size. Its Achilles’ heel, however, lies in specifying the “correct” weights, since a small change in ${\boldsymbol{ w}}$ can drastically alter the $\text{top-}k$ result [45, 95].

From the strengths and weaknesses of the two traditional paradigms, we infer three key desiderata for multi-objective querying:

—

Personalization. To effectively support a user’s decisions, an operator should take into account his or her individual preferences. This requirement is essential, especially nowadays, where an abundance of personal information is readily available via smartphones, fitness trackers, online activities, and so forth.

—

Controllable output size. Hick’s law, known since the 1950s, suggests that controlling the number of results presented to the user is essential to the quality of the decision and to the user experience [40, 43]. That law has been used as a cornerstone in e-commerce applications, meta-search engines, and so on [38, 41]. Dictating the output size is crucial also because of design considerations, such as display size, device capabilities, connection speed, and so forth. Hence, output-size specified (OSS) operators are required.

—

Flexibility in preference specification. In standard multi-objective querying, the user’s preferences (i.e., vector ${\boldsymbol{ w}}$) are assumed to be either input directly by the user or somehow mined (e.g., via online behavior and review mining [48, 85], pairwise comparisons of example records [46, 72], or some other preference learning technique [30]). In the former case, a user cannot be reasonably expected to quantify with absolute precision the relative importance of the various attributes. In the latter case, preference learning methods come with an understanding that they can only estimate the user’s latent preferences. Therefore, a practical operator should allow some slack in the preference input.

Henceforth referring to the above three as hard requirements, we propose a general querying methodology that satisfies all three of them. In particular, we define two operators that (1) are personalized, (2) are OSS, and (3) have a relaxed preference input. To achieve personalization, we employ linear scoring, due to its demonstrated effectiveness in modeling human decision-making [72]. However, we consider the input ${\boldsymbol{ w}}$ as a best-effort estimate. We therefore relax it by incrementally expanding it equally in all directions in the preference domain. Starting at w and its own $\text{top-}k$ records, as the expansion radius grows, we gradually include in the output additional records that cater to alternative, similar preferences. The stopping radius is indirectly (yet strictly) determined by the desired output size m.

Observe that our approach makes no assumption about (and requires no knowledge of) the accuracy of w or its distribution. Instead, the manual specification or the mining of w is external and orthogonal to our work, and so is the effectiveness/accuracy of that process. With the estimated w being the only preference information available to our algorithms, our rationale is to consider the absolute nearest possible vectors to it, thus producing an output (of size m) that caters to the tightest set of alternative preferences around w.

Research on both standard paradigms (dominance-based and ranking by utility) has considered their individual weaknesses, but no existing work satisfies all three hard requirements. The skyline literature includes formulations that control the output size by loosening the definition of dominance [18, 52, 87], identifying representatives [39, 54, 57, 78], or considering subspaces [19, 83]. For example, Lin et al. [57] report the m skyline members that dominate the most non-skyline records, while Chan et al. [19] shortlist the m records that belong to the most subspace skylines. These definitions aim to produce the most competitive or the most representative skyline records in a general sense, without a specific user in mind, thus lacking personalization.

Centered more on utility, studies on regret-minimizing sets report m representative records from the dataset. Typically, they define the regret ratio as the relative difference between the utility of the top-scoring record in the selected subset and the top-scorer in the entire dataset. Their objective is to minimize the aggregate (usually the maximum) regret ratio across every possible utility function [68, 89]; i.e., the reported subset is meant to satisfy as it best can all possible users, without an intent for personalization.

Two recent studies, [24] and [64], attempt to relax the preference input in ranking by utility. That is, they assume that the preference input is a convex polytope R instead of a vector ${\boldsymbol{ w}}$. Concordantly, they report the records that could be among the k most preferable for any ${\boldsymbol{ w}}\in R$ (for $k=1$ and $k \ge 1$ in [24] and [64], respectively). Unfortunately, these methods are not OSS and, moreover, come without even an estimate of the output size; i.e., the user/application is in the dark on whether R is too large or too small to produce, even approximately, the required number of records. Furthermore, these approaches may remove the need for a particular w but require specifying a polytope R in the preference domain. Deciding R is left to the user or application, a choice that becomes tougher considering that the dynamics in the preference domain are hard to gauge. In usability terms, specifying the output size m is arguably more tangible and more relatable to the user/application than specifying a polytope in the preference domain.

In this article, we define two new operators, $\mathsf {ORD}$ and $\mathsf {ORU}$, which satisfy all three hard requirements. They expand the preference input w in a similar way; however, they retain a stronger flavor of either paradigm each. $\mathsf {ORD}$ employs an adaptive notion of dominance that is guided by m, while $\mathsf {ORU}$ sticks closer to ranking by utility. Hard requirements aside, practicality also demands responsiveness and scalability. We make geometric observations and establish propositions that lead to efficient processing. Specifically, we propose an algorithm for $\mathsf {ORD}$ processing and two for $\mathsf {ORU}$. Our methods are orders of magnitude faster than adaptations of previous work, which can merely simulate the $\mathsf {ORD}$/$\mathsf {ORU}$ output, and still, without OSS guarantees. Furthermore, given that the nature of $\mathsf {ORU}$ renders it significantly more complex than $\mathsf {ORD}$, we develop an approximate $\mathsf {ORU}$ algorithm too, which trades accuracy for efficiency and comes with proven approximation guarantees.

In Table 1, we summarize the properties of existing multi-objective queries and juxtapose them with our operators. A more comprehensive description of related work and a formal definition of $\mathsf {ORD}$ and $\mathsf {ORU}$ are given in Sections 2 and 3, respectively. Following that, Section 4 presents our processing method for $\mathsf {ORD}$, while Sections 5 and 6 describe two exact methods for $\mathsf {ORU}$. Section 7 complements our exact processing suite with an approximate $\mathsf {ORU}$ algorithm. Section 8 discusses an additional use case of our operators that extends their applicability. Section 9 includes a qualitative study and the experimental evaluation, while Section 10 concludes the article.

Table 1.

Operator	Personalized	OSS	Flexible Input
Skyline/k-Skyband	✘	✘	✔
Top-k	✔	✔	✘
OSS skylines	✘	✔	✔
Regret-minimizing sets	✘	✔	✔
Fixed-region techniques	✔	✘	✔
Proposed ($\mathsf {ORD}$ and $\mathsf {ORU}$)	✔	✔	✔

Table 1. Multi-objective Queries and Their Properties

2 Related Work

The two traditional alternatives to determine the most preferable records from a dataset D with d attributes are based on dominance and on ranking by utility score. A record dominates another if it is at least as preferable in all dimensions and strictly more preferable in at least one dimension. The records that are not dominated by any other make up the skyline [17], while, more generally, those dominated by fewer than k form the k-skyband [69]. In contrast, in the ranking approach, the score of a record is typically defined as the weighted sum of its attributes for a vector of d user-specific weights. The $\text{top-}k$ set includes the k records with the largest scores [44].

For large, indexed datasets, the most common processing algorithms in both cases follow the branch-and-bound methodology. BBS [69] visits index nodes and data records in increasing distance from the top corner of the data space (i.e., the corner with the maximum possible attribute values), using a min-heap to organize them by that distance. It maintains as skyline/k-skyband the records dominated by none/fewer than k records encountered so far. BBR [79] computes the $\text{top-}k$ set by visiting index nodes and data records in decreasing (upper bound of) score, using a max-heap. The first k records popped from the heap are the $\text{top-}k$.

OSS Skylines: The size of the skyline is uncontrollable and oftentimes very large [36]. That being a major shortcoming, there have been several approaches to limit it.

Chan et al. [18] consider that a record ${\boldsymbol{ r}}_i$ m-dominates another ${\boldsymbol{ r}}_j$ for an $m \le d$ if there is a subspace of m dimensions where ${\boldsymbol{ r}}_i$ dominates ${\boldsymbol{ r}}_j$. A smaller m implies a smaller skyline, thus indirectly controlling its size. Koltun and Papadimitriou [52] propose $\epsilon$-dominance, where the attributes of a record ${\boldsymbol{ r}}_i$ are multiplied by $(1+\epsilon)$ to check whether it dominates another record ${\boldsymbol{ r}}_j$. In the same spirit, Xia et al. [87] increment the attributes of ${\boldsymbol{ r}}_i$ by an absolute $\delta$ value on all (appropriately scaled) dimensions. Other studies aim to select the m most representative skyline records. The dominance count of a record, i.e., the number of records it dominates, has been used as a measure of its importance [34, 81, 92]. By that intuition, Lin et al. [57] choose as representatives the m skyline records that dominate the most other records. Lee and Hwang [54] propose a pivot-based space partitioning for that problem, while Gao et al. [31] enhance it by favoring representatives that dominate the less frequently dominated non-skyline records. The latter’s performance is improved by Han et al. [39]. By a different, distance-based intuition, Tao et al. [78] choose as representatives the m skyline records that minimize the distance from the remaining skyline members.

Sarma et al. [73] pick m records from the skyline so that the probability that a random user would click on one of them is maximized. Assuming that a record ${\boldsymbol{ r}}_i$ is interesting if its attributes exceed a certain threshold per dimension and that the distribution of the threshold values is known, they propose approximate and sampling methods to select the m representatives. Magnani et al. [58] consider various measures of diversity and significance and assume a linear combination of these two factors as the objective function that the m chosen skyline records must maximize.

Another approach to select representatives considers membership in subspace skylines [70, 80]. For instance, Chan et al. [19] report the m skyline records that appear in the most subspace skylines. Vlachou and Vazirgiannis [83] measure importance according to dominance in different subspaces and assume propagation of importance via dominance links. Another formulation considers that some attributes are more important [50, 61]; Lee et al. [55] select representatives according to skyline membership in the induced prioritized subspaces.

Most OSS skylines do not take into account a user’s personal preferences. An exception, in abstract terms at least, is Bartolini et al. [13], who consider that record attributes correspond to user-specific ratings. If a user has not provided ratings for records ${\boldsymbol{ r}}_i$ and ${\boldsymbol{ r}}_j$ but at least a fraction of similar users have indicated ratings where ${\boldsymbol{ r}}_i$ dominates ${\boldsymbol{ r}}_j$, the same is assumed for the user at hand too. The required fraction indirectly controls the skyline size. The focus in [13] is to infer dominance when user ratings (i.e., record attributes) are missing. In contrast, in our target applications the records’ attributes are given and no information for other users is required. Another distinction between OSS skylines and our work is that they consider $k = 1$; i.e., once dominated, a record is eliminated. Instead, our operators may dig deeper, to larger k values, because they can rely on the personal preferences (roughly) specified by ${\boldsymbol{ w}}$. A final remark on OSS skylines is that they should not be confused with output-sensitive skyline algorithms [51, 60, 75]. The latter have an asymptotic complexity that depends on the size of their output, as opposed to controlling the skyline size itself.

Regret Minimization: Work on regret-minimizing sets (RMS) produces an m-sized subset $S \subset D$ that tries to satisfy as it best can any possible user. In the original formulation [68], the regret ratio for a user is defined as the relative difference between the maximum utility score in S and that in the entire D. The objective for S is to minimize the maximum regret ratio for any possible user. There have been many follow-up studies (e.g., [9, 90]) considering also RMS variants, most notably k-RMS [22] (where the regret ratio reflects the difference between the top-scorer in S and the top-k-th in D), defining regret based on the rank of records [10], and so forth. A survey is given in [89]. RMS studies are not concerned with personalization. Also, even if fed with our operators’ stopping radius, they cannot reproduce our output. For example, to solve classic RMS [68], it suffices to consider only skyline records. In contrast, our output may also include records below the skyline.

In [71], Peng and Wong assume that the probability density function of the users’ preference vectors is given. Their goal is to select m records from D so that the top-1 option for a randomly chosen user (preference vector) has the highest probability to be among these m records. In the case of linear utility functions, these options must fall on the convex hull. They propose an approximate method based on samples drawn from the given distribution and provide probabilistic guarantees on the inaccuracy (error) of their solution. Again focusing on the top-1 scenario (i.e., $k=1$) and assuming that the distribution of preference vectors is known, Zeighami and Wong [94] select a subset of m records from D so that the expected utility loss between the top record in the subset and the top record in the entire D is minimized. An exact solution is possible in the degenerate $d=2$ case (where the preference domain is practically 1-dimensional), but for general d, a sampling-based approximate algorithm is proposed that offers probabilistic error guarantees. Both [71] and [94] assume that the distribution of preferences is known. In contrast, our operators assume no knowledge of the distribution of ${\boldsymbol{ w}}$. Furthermore, [71, 94] focus on $k=1$ and are bound to candidate records that fall on the convex hull, whereas our methodology applies to general $k \ge 1$ and may look at deeper layers of the dataset. Even if [71, 94] could somehow be extended to general $k \ge 1$ and our operators’ stopping radius were fed to them, they could still not guarantee to report all possible $\text{top-}k$ records within that radius, because they rely on sampling.

Inspired by RMS but aiming for personalization, interactive regret minimization (IRM) involves the user in the search process [67]. Initially oblivious of the user’s preferences, IRM goes through multiple rounds of interaction. In each round, it presents the user with a number of records and asks him or her to choose the best, thus learning the user’s (latent) preference vector increasingly well. When the regret ratio is guaranteed to be small enough (or the actual top-scorer is found), the last record chosen becomes the answer for this user. The original IRM method [67] involves artificial records in its interactions, which is resolved in [88]. The latter is enhanced in [96] by asking the user to sort the presented records (instead of just choosing the best). IRM assumes a different query processing model altogether, requiring active user involvement. Moreover, its objective is to eventually identify the one record with maximum utility, and thus it considers only records on the skyline or convex hull.

Fixed-region Techniques: The closest related studies to our work are [24] and [64]. Given a convex preference polytope R, Ciaccia and Martinenghi [24] define that ${\boldsymbol{ r}}_i$ R-dominates ${\boldsymbol{ r}}_j$ if ${\boldsymbol{ r}}_i$ scores higher than ${\boldsymbol{ r}}_j$ for any ${\boldsymbol{ w}}\in R$. They propose an R-dominance test that checks one linear condition per extreme vertex of R and compute the R-skyline (i.e., the records that are not R-dominated by any other) by integrating that test into standard skyline algorithms. They also introduce an operator that reports as potentially optimal every ${\boldsymbol{ r}}_i$ that is the top record for at least one ${\boldsymbol{ w}}\in R$. To check a record for potential optimality, they solve a linear programming (LP) problem defined according to the extreme vertices of R. Mouratidis and Tang [64] extend potential optimality to $k \ge 1$; i.e., they identify the records that appear in the $\text{top-}k$ result for at least one ${\boldsymbol{ w}}\in R$. In a more advanced variant, they explicitly report every possible (order-insensitive) $\text{top-}k$ set for any ${\boldsymbol{ w}}\in R$. They first disqualify records R-dominated by k or more others. Among the remaining candidates, they determine the top-k-th in each partition of R and, accordingly, the (order-insensitive) prefix of the $\text{top-}k$ set.

In terms of practicality, the operators in [24] and [64] lack the OSS property, meaning that the user/application cannot determine (or even predict) the size of the output. The techniques themselves cannot be extended to our problem, because they rely on R being fixed and given in advance. For example, their R-dominance and LP tests are defined according to the extreme vertices of R and are contingent on these vertices being fixed and known. Furthermore, they require R to be a convex polytope. In contrast, in our case R is not specified, and the preference region (even if it were given in advance) is effectively a hyper-sphere, not a polytope. If we approximate hyper-spheres with hyper-cubes and make repetitive calls for different side-lengths of R in an exploratory manner, the approaches in [24] (for $k=1)$ or [64] (for general k) could somehow simulate our operators, but even with that slack, they would require an excessive number of trials/executions to produce an output of exactly m records. In other words, a second compromise is necessary, i.e., allow them to terminate when the output size is “almost” m (e.g., within a 10% deviation). Our framework is natively, strictly OSS. We note that the above description of [24] collectively represents its extension in [26] too.

Related Top-k Work: On the $\text{top-}k$ front, there are studies for unspecific or unknown preference vector w that are somewhat related to our work. For example, Soliman et al. [76] compute the most probable $\text{top-}k$ result if w is a random, uniformly distributed vector. Uncertain records/attributes have also been considered, leading to probabilistic $\text{top-}k$ outputs [7, 27, 91]. On the other hand, Zhang et al. [95] compute the preference region that corresponds to a given $\text{top-}k$ result in a task that, loosely speaking, is inverse from ours.

Preliminary Version of this Work: This article extends the study in [63], where we originally defined the $\mathsf {ORD}$ and $\mathsf {ORU}$ operators (current Section 3) and developed the first processing methods for them (Sections 4 and 5, respectively). That study showed that the $\mathsf {ORU}$ problem is considerably more challenging and that, although the algorithm for $\mathsf {ORD}$ delivers sub-second response times even for very large problem instances, the method for $\mathsf {ORU}$ is not quite there. Motivated by this, we tackle $\mathsf {ORU}$ further in this extension, by pursuing two orthogonal directions. Specifically, in Section 6 we present a new, fundamentally different approach for (exact) $\mathsf {ORU}$ processing, which is several times (and up to 3 orders of magnitude) faster than the one in [63]. On the other hand, in Section 7 we propose an approximate $\mathsf {ORU}$ algorithm, which offers proven accuracy guarantees and allows control of the tradeoff between accuracy and efficiency.

3 Problem Formulation

We consider that the available options are represented as d-dimensional records ${\boldsymbol{ r}}= (x_1, x_2, \ldots , x_d)$ in a dataset D indexed by a spatial access method, e.g., an R-tree [14, 74]. We follow the convention that the larger the attributes the better, yet our findings adapt easily to cases where some/all attributes are to be minimized. Given a preference vector v of d non-negative weights $w_i$, the utility score of a record ${\boldsymbol{ r}}$ is defined as their inner product, i.e., $U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}})=\sum _{i=1}^d x_i \cdot w_i$. Accordingly, the $\text{top-}k$ result comprises the k records with the highest scores. Ordering D by utility is independent from the magnitude of ${\boldsymbol{ v}}$ [42, 56], and thus we assume preference vectors where $\sum _{i=1}^d w_i =1$. In other words, the domain of the preference vectors, called the preference domain, is the unit $(d-1)$-simplex in a space whose d axes correspond to the $w_i$ values, i.e., the simplex $\Delta ^{d-1} = \lbrace {\boldsymbol{ v}}\in {\rm I\!R}_+^d | \sum _{i=1}^d w_i =1\rbrace$. For $d=3$, the preference domain is an equilateral triangle, shown in gray in Figure 1(a). Effectively, any valid preference vector is represented as a point in that triangle. For $d=4$, the preference domain is a tetrahedron, and so on.

Fig. 1.

Let ${\boldsymbol{ w}}$ be a best-effort estimate of the user’s preference vector, henceforth called the seed, and consider the preference vectors ${\boldsymbol{ v}}$ within distance $\rho$ from ${\boldsymbol{ w}}$, i.e., where $| {\boldsymbol{ v}}- {\boldsymbol{ w}}| \le \rho$. If a record ${\boldsymbol{ r}}_i$ scores at least as high as another ${\boldsymbol{ r}}_j$ for every such vector ${\boldsymbol{ v}}$ and strictly higher for at least one of them, we say that ${\boldsymbol{ r}}_i$ $\rho$-dominates ${\boldsymbol{ r}}_j$. The records that are $\rho$-dominated by fewer than k others form the $\rho$-skyband. This general notion includes the $\rho$-skyline as a special case for $k=1$. Note that a larger $\rho$ implies a larger $\rho$-skyband. In the extreme settings, $\rho = 0$ renders the $\rho$-skyband equivalent to a traditional $\text{top-}k$ query at w, while $\rho = \infty$ makes it equivalent to the standard k-skyband.¹ We may now define our first operator, abbreviated as $\mathsf {ORD}$ to stress its OSS property, relaxed input, and stronger dominance-oriented flavor.

Definition 1 ($\mathsf {ORD}$).

Given the seed vector ${\boldsymbol{ w}}$ and the required output size m, $\mathsf {ORD}$ reports the records that are $\rho$-dominated by fewer than k others for the minimum $\rho$ that produces exactly m records in the output.

Observe that user and application are both transparent to $\rho$, which relieves them from being concerned with the complex dynamics of the preference domain. The appropriate $\rho$ is determined automatically by our framework, according to the desired output size m. Our second operator shares that trait too but follows more closely the ranking by utility paradigm, thus its abbreviation, $\mathsf {ORU}$.

Definition 2 ($\mathsf {ORU}$).

Given the seed vector ${\boldsymbol{ w}}$ and the required output size m, $\mathsf {ORU}$ reports the records that belong to the $\text{top-}k$ result for at least one preference vector within distance $\rho$ from w for the minimum $\rho$ that produces exactly m records in the output.

While beyond the requirements of Definition 2, a byproduct of our $\mathsf {ORU}$ algorithms is the reporting of the specific (order-sensitive) $\text{top-}k$ result for any vector within radius $\rho$ from ${\boldsymbol{ w}}$. This enables additional applications, like determining the most stable [8, 95] or the most representative [76] $\text{top-}k$ results in the vicinity of ${\boldsymbol{ w}}$ according to the volume of the preference regions that produce them.

Our $\mathsf {ORD}$/$\mathsf {ORU}$ techniques require no precomputation other than a general-purpose spatial index on D. This implies that updates in D affect only (and are readily supported by) the index. Also, it enables the integration of common predicates into our framework. For example, should the user impose arbitrary range predicates (e.g., price between $150 and $200, size between 400ft$^2$ and 600ft$^2$, etc.), we may execute a multi-dimensional range query on D, followed by $\mathsf {ORD}$/$\mathsf {ORU}$ in the selected part of the index/dataset.

Multi-objective querying generally loses its meaning in high dimensions. For instance, for more than a handful of dimensions almost every record tends to belong to the skyline [19, 36], while utility-wise the scores of all records tend to converge [65, 93]. We hence focus on low-dimensional settings. A final remark is that although we position our work within preference-based record shortlisting for a human user, our techniques apply to general multi-objective scenarios where the suitability of available options is defined by a linear function over the options’ attributes.

4 $\mathsf {ORD}$ Algorithm

The output of $\mathsf {ORD}$ is a $\rho$-skyband, and in particular the one for the smallest $\rho$ that includes m records. We make several observations that lead to an efficient $\mathsf {ORD}$ processing methodology.

4.1 Observations and Main Idea

Properties of Candidate Records: Without loss of generality, assume that no two records coincide or score the same for the seed vector ${\boldsymbol{ w}}$. Unless a record belongs to the traditional k-skyband, it cannot belong to the $\text{top-}k$ result for any preference vector [17]. Hence, for any $\rho$, the $\rho$-skyband is a subset of the k-skyband. Therefore, the latter includes all the candidates we may need for $\mathsf {ORD}$. Consider a record ${\boldsymbol{ r}}_i$ among them. The remaining candidates fall into three categories regarding their potential to $\rho$-dominate ${\boldsymbol{ r}}_i$:

—

Records that score lower than ${\boldsymbol{ r}}_i$ for the seed vector ${\boldsymbol{ w}}$ cannot $\rho$-dominate it for any radius $\rho$ (because any $\rho$ includes the seed itself). By the way, a corollary of this is that the $\text{top-}k$ records of ${\boldsymbol{ w}}$ belong to the $\rho$-skyband for every $\rho$.

—

Records that dominate ${\boldsymbol{ r}}_i$ in the traditional sense score higher for any preference vector, and thus they $\rho$-dominate ${\boldsymbol{ r}}_i$ for any $\rho$.

—

The remaining records (i.e., those that do not dominate ${\boldsymbol{ r}}_i$ but score higher than it for ${\boldsymbol{ w}}$) $\rho$-dominate ${\boldsymbol{ r}}_i$ for a non-empty range of $\rho$ values, as we explain next.

Consider a record ${\boldsymbol{ r}}_j$ that falls in the third category. As such, it does not dominate ${\boldsymbol{ r}}_i$. Also, since $U_{ {\boldsymbol{ w}}}( {\boldsymbol{ r}}_j) \gt U_{ {\boldsymbol{ w}}}( {\boldsymbol{ r}}_i)$, record ${\boldsymbol{ r}}_j$ is not dominated by ${\boldsymbol{ r}}_i$ either. Every pair of records that do not dominate each other define a hyper-plane with equation $U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}_i) = U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}_j)$ that cuts through the preference domain; i.e., it divides $\Delta ^{d-1}$ into two non-empty parts. Since ${\boldsymbol{ r}}_j$ scores higher than ${\boldsymbol{ r}}_i$ for the seed ${\boldsymbol{ w}}$, it holds that $U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}_i) \lt U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}_j)$ in the entire part that includes ${\boldsymbol{ w}}$, while $U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}_i) \gt U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}_j)$ in the other part. Assuming $d=3$, Figure 1(b) illustrates in gray a hyper-plane with equation $U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}_i) = U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}_j)$.

Let ${\boldsymbol{ s}}_{i,j}$ be the intersection of hyper-plane $U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}_i) = U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}_j)$ with the preference domain $\Delta ^{d-1}$. Geometrically, ${\boldsymbol{ s}}_{i,j}$ is a $(d-2)$-simplex, e.g., for $d=3$ it is a line segment, shown in bold in Figure 1(b). Let also $\rho _{i,j}$ be the minimum distance between ${\boldsymbol{ w}}$ and any ${\boldsymbol{ v}}\in {\boldsymbol{ s}}_{i,j}$. For any preference vector in $\Delta ^{d-1}$ that is within distance $\rho _{i,j}$ from ${\boldsymbol{ w}}$, record ${\boldsymbol{ r}}_j$ scores higher than ${\boldsymbol{ r}}_i$; i.e., ${\boldsymbol{ r}}_j$ $\rho$-dominates ${\boldsymbol{ r}}_i$ for every $\rho \le \rho _{i,j}$. In contrast, it does not $\rho$-dominate ${\boldsymbol{ r}}_i$ for $\rho \gt \rho _{i,j}$. In implementation terms, we can compute the mindist $\rho _{i,j}$ using a quadratic programming solver [37, 62] with (squared) distance as the minimization objective, subject to the linear constraints that define ${\boldsymbol{ s}}_{i,j}$, i.e., $U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}_i) = U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}_j)$ and $\sum _{i=1}^d w_i =1$.

Inflection Radius: By considering all records ${\boldsymbol{ r}}_j$ in the third category against ${\boldsymbol{ r}}_i$, they are each mapped into an interval of $\rho$ values where they $\rho$-dominate it. Figure 2(a) offers an example. Assume that $k=5$ and that the 5-skyband includes eight records that score higher than ${\boldsymbol{ r}}_i$ for ${\boldsymbol{ w}}$. Out of these, ${\boldsymbol{ r}}_i$ is dominated in the traditional sense by three (i.e., ${\boldsymbol{ r}}_3$, ${\boldsymbol{ r}}_4$, ${\boldsymbol{ r}}_6$), thus their infinite intervals. The remaining five do not dominate ${\boldsymbol{ r}}_i$; hence, they fall in the third category; they each have a finite mindist $\rho _{i,j}$ mapped to intervals as illustrated. By sweeping the intervals from left to right, we can easily identify the $\rho$ value past which ${\boldsymbol{ r}}_i$ is dominated by fewer than k others; i.e., it becomes part of the $\rho$-skyband. We call that value the inflection radius of ${\boldsymbol{ r}}_i$ and denote it as $\rho _i$. In our example, $\rho _i = \rho _{i,7}$.

Fig. 2.

A Preliminary Approach: Based on the above, a first-cut $\mathsf {ORD}$ solution is to compute the entire k-skyband, and for each record in it, to derive the inflection radius, and then, to output the m of them with the smallest inflection radii. To exemplify, assume that the 5-skyband includes 12 records; in Figure 2(b), we map each of them to an interval of $\rho$ values where it belongs to the $\rho$-skyband, according to its inflection radius. These intervals have different meaning from Figure 2(a). In Figure 2(a), all intervals refer to ${\boldsymbol{ r}}_i$, helping to compute its own inflection radius $\rho _i$. In contrast, Figure 2(b) is a global representation of the $\rho$-skyband for different $\rho$ values. Specifically, if we sweep the chart with a vertical line, the intervals that intersect the line at any position indicate the $\rho$-skyband members for the $\rho$ value that corresponds to that position. In our example, assuming that $m=8$, the $\mathsf {ORD}$ output is set $\lbrace {\boldsymbol{ r}}_1, {\boldsymbol{ r}}_2, {\boldsymbol{ r}}_3, {\boldsymbol{ r}}_4, {\boldsymbol{ r}}_5, {\boldsymbol{ r}}_7, {\boldsymbol{ r}}_{10}, {\boldsymbol{ r}}_{12}\rbrace$, which corresponds to the $\rho$-skyband for $\rho = \rho _{12}$.

An interesting insight is that the $\mathsf {ORD}$ output may vary from standard ranking by utility ($\text{top-}k$) all the way to traditional dominance-based querying (k-skyband), depending on m. On the one hand, by definition, the $\text{top-}k$ records are the only members of the $\rho$-skyband for $\rho = 0$, which corresponds to the smallest possible m (i.e., $m=k$). On the other hand, every k-skyband member will appear in the $\rho$-skyband for a sufficiently large $\rho$ or, equivalently, for a sufficiently large m. To visualize these extremes, at the leftmost position in Figure 2(b) the sweeping line intersects only the intervals of the $\text{top-}k$ records (${\boldsymbol{ r}}_1$ to ${\boldsymbol{ r}}_5$), while at the rightmost² it intersects the entire k-skyband. That said, although this generality is welcome, the practical strength of $\mathsf {ORD}$ is for m values in between these extremes.

4.2 Efficient $\mathsf {ORD}$ Processing

The $\mathsf {ORD}$ processing idea described so far is a foundation that offers an abstract-level understanding of our method. However, an efficient solution must address several performance issues. Primarily, we would want to avoid computing the entire k-skyband in the beginning of the process. Indeed, the k-skyband may include numerous records, many times more than the m required [15, 36]. Ideally, we want to limit the number of considered candidates to as tight a superset of the $\mathsf {ORD}$ output as possible. The algorithm we present next serves that objective.

We first invoke a progressive k-skyband retrieval that fetches its members one by one and place them into a candidate set. Importantly, unlike standard k-skyband computation, we enforce that its members are fetched in decreasing score order for ${\boldsymbol{ w}}$ (we will explain how shortly). This retrieval order is essential, because when a new candidate ${\boldsymbol{ r}}_i$ is fetched, we can definitively compute its inflection radius $\rho _i$ already, without having to derive the entire k-skyband. The rationale is that only k-skyband records with higher score may $\rho$-dominate ${\boldsymbol{ r}}_i$, and these are guaranteed to be fetched before it.

We keep fetching new k-skyband members in that fashion until the candidate set reaches size $(m+1)$. At that stage, we evict the candidate with the largest inflection radius. Also, an important algorithmic shift takes place. Let $\bar{\rho }$ be the maximum inflection radius of the remaining m candidates. The $\bar{\rho }$-skyband is guaranteed to include at least the m existing candidates, and thus $\bar{\rho }$ upper bounds the eventual stopping radius of the algorithm. Therefore, from this point onward, we switch to fetching $\bar{\rho }$-skyband members (instead of k-skyband members), still in decreasing order of score for ${\boldsymbol{ w}}$. The switch can be performed transparently, as we elaborate later.

To exemplify, consider Figure 2(b). Assume that $k = 5$ and $m = 11$ and that we have fetched the $m+1 = 12$ depicted records, in the order indicated by their subscripts; i.e., ${\boldsymbol{ r}}_i$ was fetched i-th. To bring the candidates down to $m=11$, we discard ${\boldsymbol{ r}}_8$, as it has the largest inflection radius. Practically, that sets $\bar{\rho }$ to $\rho _8$.

As new candidates are fetched and evictions are made to keep their total number to m, the current $\bar{\rho }$ keeps shrinking. Meanwhile, as $\bar{\rho }$ shrinks, records tend to $\rho$-dominate more others. This implies that the $\bar{\rho }$-skyband retrieval becomes increasingly more selective, thus filtering out more aggressively regular k-skyband members that cannot participate in the $\mathsf {ORD}$ result. When the $\bar{\rho }$-skyband module cannot fetch any more records, the candidate set is finalized as the $\mathsf {ORD}$ result. The latter corresponds to the $\rho$-skyband for $\rho$ equal to the maximum inflection radius across its members.

The $\mathsf {ORD}$ algorithm relies on a progressive k-skyband module, with the extra requirement to fetch records in decreasing score according to ${\boldsymbol{ w}}$. We use an adaptation of BBS [69] where we visit index nodes and records in decreasing (upper bound of) score for ${\boldsymbol{ w}}$, using a max-heap. Once the $(m+1)$-th record is fetched, we shift to $\bar{\rho }$-skyband computation using the exact same heap as per normal, but replace the regular dominance tests of BBS with $\bar{\rho }$-dominance for the current $\bar{\rho }$ value. The visiting order by score and the use of $\bar{\rho }$-dominance tests instead of regular dominance are permissible modifications to vanilla BBS, because its correctness is guaranteed as long as no record ${\boldsymbol{ r}}_j$ fetched after another ${\boldsymbol{ r}}_i$ may dominate ${\boldsymbol{ r}}_i$ [69]. That property is upheld by our visiting order (by score), both initially for regular dominance and after the shift to $\bar{\rho }$-dominance. Indeed, $U_{ {\boldsymbol{ w}}}( {\boldsymbol{ r}}_i) \gt U_{ {\boldsymbol{ w}}}( {\boldsymbol{ r}}_j)$ ensures that ${\boldsymbol{ r}}_j$ cannot dominate or ${\rho }$-dominate ${\boldsymbol{ r}}_i$ for any $\rho$. An implementation note on the adapted BBS regards its ${\rho }$-dominance building block. That block tests whether an already-fetched $\bar{\rho }$-skyband record ${\boldsymbol{ r}}_i$ $\bar{\rho }$-dominates a not-yet-fetched record ${\boldsymbol{ r}}_j$ (or an unvisited index node whose top corner is ${\boldsymbol{ r}}_j$). The test is performed as explained in Section 4.1, by comparing the mindist $\rho _{i,j}$ with $\bar{\rho }$.

5 Basic $\mathsf {ORU}$ Algorithm

Our second operator, $\mathsf {ORU}$, adheres more closely to ranking by utility; it reports records that belong to the $\text{top-}k$ for at least one preference vector within radius $\rho$ from the seed w for the minimum $\rho$ that produces exactly m records. Despite the seeming similarity to $\mathsf {ORD}$’s definition, the $\text{top-}k$ ranking involved in $\mathsf {ORU}$ renders it innately different and its solution considerably more complex.

In this section, we describe a first solution to $\mathsf {ORU}$, which we call the basic ORU algorithm ($\mathsf {BORU}$). We present important preliminaries (in Section 5.1), a crucial theorem and algorithmic basis to $\mathsf {BORU}$ (in Section 5.2), and eventually the algorithm’s complete implementation (in Section 5.3).

5.1 Fundamentals

The abstractions and techniques used in $\mathsf {BORU}$ have the notion of the convex hull at their core [16]. The convex hull of D is the smallest convex polytope that encloses all its records. It comprises facets, each defined by d extreme vertices (records) in general position. The outer polygon in Figure 3(a) is the convex hull of an example dataset. Facet ${\boldsymbol{ r}}_1 {\boldsymbol{ r}}_2$ is defined by extreme vertices ${\boldsymbol{ r}}_1$ and ${\boldsymbol{ r}}_2$, and so forth.

Fig. 3.

A vector is normal to a hyper-plane when its direction is perpendicular to the hyper-plane. The norm of a facet on the hull is the normal vector to that facet whose sum of coordinates is 1 and is directed toward the exterior of the hull. In our example, the norm of ${\boldsymbol{ r}}_1 {\boldsymbol{ r}}_2$ is vector ${\boldsymbol{ v}}_1$. Effectively, the norm of a facet corresponds to a point in the preference domain $\Delta ^{d-1}$.

The top record for a preference vector ${\boldsymbol{ v}}$ is the one met first by a hyper-plane normal to ${\boldsymbol{ v}}$ that sweeps the data space from the top corner to the origin [28, 33]. Hence, the top record in D is guaranteed to lie on its convex hull [20, 59]. Since in our case the weights are non-negative, the top record is among the extreme vertices of facets with non-negative norms. We call upper hull the part that corresponds to these facets. In Figure 3(a), the upper hull is bold (and the rest of the convex hull is dashed). For the shown w, the top record is ${\boldsymbol{ r}}_3$, as it is met first by the sweeping line normal to w.

To explain the fundamentals of our methodology, assume that we have already computed the first k upper hull layers: the first layer, $L_1$, includes the upper hull of D; the second, $L_2$, includes the upper hull of $D - L_1$; and, generally, layer $L_i$ includes the upper hull of D after subtracting layers $L_1$ to $L_{i-1}$. In Figure 3(a), layer $L_1$ includes records ${\boldsymbol{ r}}_1$ to ${\boldsymbol{ r}}_5$, and $L_2$ records ${\boldsymbol{ r}}_6$ to ${\boldsymbol{ r}}_{10}$. Note that in reality the complete $\mathsf {BORU}$ algorithm does not require such precomputation but instead builds on the fly (i.e., at query time) only parts of the necessary layers, thus applying to arbitrary k, avoiding precomputation costs (time and space), and extending transparently to dynamic datasets, i.e., to cases where record insertions/deletions may occur in D and invalidate precomputed information. Also, assume that we already know the necessary radius $\rho$ for $\mathsf {ORU}$ to output m records. Of course, this too is an assumption we will drop later (in Section 5.3). Adjacent Set $\mathcal {A}( {\boldsymbol{ r}})$: Consider a record ${\boldsymbol{ r}}$ in layer $L_i$. We denote by $\mathcal {F}( {\boldsymbol{ r}})$ the set of $L_i$ facets with ${\boldsymbol{ r}}$ as one of their extreme vertices, and by $\mathcal {A}( {\boldsymbol{ r}})$ the records adjacent to ${\boldsymbol{ r}}$, i.e., the $L_i$ records (other than ${\boldsymbol{ r}}$) that define facets in $\mathcal {F}( {\boldsymbol{ r}})$. In Figure 3(a), for example, $\mathcal {F}( {\boldsymbol{ r}}_3) = \lbrace {\boldsymbol{ r}}_2 {\boldsymbol{ r}}_3, {\boldsymbol{ r}}_3 {\boldsymbol{ r}}_4\rbrace$ and $\mathcal {A}( {\boldsymbol{ r}}_3) = \lbrace {\boldsymbol{ r}}_2, {\boldsymbol{ r}}_4\rbrace$. The following Lemmas 1, 2, and 3 are crucial. Note that they refer to records within the same layer $L_i$.

Lemma 1.

Given a preference vector v whose top record in $L_i$ is r, if we start shifting v toward any direction in the preference domain, the first record in $L_i$ to outscore r is always in $\mathcal {A}( {\boldsymbol{ r}})$, i.e., among the records adjacent to r. Furthermore, each of the records in $\mathcal {A}( {\boldsymbol{ r}})$ is the first outscoring record for some shifting direction of v.

Proof.

Let $H_{ {\boldsymbol{ r}}}$ be the hyper-plane in the data space that is normal to v and passes through r. Record r is the top for v, as long as there is no record above $H_{ {\boldsymbol{ r}}}$ (i.e., in the half-space that includes the top corner of the data space). Assume that v gradually shifts toward a specific direction, with $H_{ {\boldsymbol{ r}}}$ always passing through r. As the orientation of $H_{ {\boldsymbol{ r}}}$ shifts together with v, the first record in $L_i$ that is met by $H_{ {\boldsymbol{ r}}}$ is the record ${\boldsymbol{ r}}_i$ that will outscore r if we shift v infinitesimally any further (in the same direction). Suppose that ${\boldsymbol{ r}}_i$ is not in $\mathcal {A}( {\boldsymbol{ r}})$; i.e., it shares no common $L_i$ facet with r. At the time that $H_{ {\boldsymbol{ r}}}$ touches ${\boldsymbol{ r}}_i$, according to the hypothesis, no other record in $L_i$ should lie above $H_{ {\boldsymbol{ r}}}$. This, however, is a contradiction, because the convexity of $L_i$ implies that any hyper-plane that passes through two non-adjacent records in $L_i$ (r and ${\boldsymbol{ r}}_i$ in this case) cuts through the interior of $L_i$; i.e., there is at least one other extreme vertex (record) in $L_i$ that lies above $H_{ {\boldsymbol{ r}}}$. We conclude that the first record to outscore r, for any direction of shifting v, must be in $\mathcal {A}( {\boldsymbol{ r}})$.

It remains to show that for each record ${\boldsymbol{ r}}_i$ in $\mathcal {A}( {\boldsymbol{ r}})$, there is a direction of shifting v that makes ${\boldsymbol{ r}}_i$ the first outscoring record. Let f be a facet in $\mathcal {F}( {\boldsymbol{ r}})$ where ${\boldsymbol{ r}}_i$ is a defining vertex. Consider the shifting of v toward the norm of f, equivalently, the shifting of $H_{ {\boldsymbol{ r}}}$ until it falls on f. Since f is a facet of the convex hull, it leaves all $L_i$ records toward its interior. Thus, there is no $L_i$ record above $H_{ {\boldsymbol{ r}}}$ at all times until now. After $H_{ {\boldsymbol{ r}}}$ has fallen on f, if v shifts infinitesimally toward ${\boldsymbol{ r}}_i$, ${\boldsymbol{ r}}_i$ will become the first to outscore r. □

By Lemma 1, if w shifts clockwise/anticlockwise in Figure 3(a), ${\boldsymbol{ r}}_4$ and ${\boldsymbol{ r}}_2$, respectively, will be the first $L_1$ records to outscore ${\boldsymbol{ r}}_3$.

Top-region $\mathcal {C}( {\boldsymbol{ r}})$: Building on Lemma 1, our next proposition reveals an important property within $L_i$ and helps define the top-region of a record ${\boldsymbol{ r}}\in L_i$, i.e., the region $\mathcal {C}( {\boldsymbol{ r}})$ in the preference domain where every vector has r as its top record in $L_i$.

Lemma 2.

Let r be a record in $L_i$. r is the top-scorer across all $L_i$ records for those preference vectors v that fall in the convex polytope $\mathcal {C}( {\boldsymbol{ r}})$ defined by (i.e., whose extreme vertices correspond to) the norms of the facets in $\mathcal {F}( {\boldsymbol{ r}})$.

Proof.

From Lemma 1, we infer that $\mathcal {C}( {\boldsymbol{ r}})$ is determined by records in $\mathcal {A}( {\boldsymbol{ r}})$, since they are the first to outscore r once v leaves $\mathcal {C}( {\boldsymbol{ r}})$. In particular, each adjacent record ${\boldsymbol{ r}}_i$ corresponds to a half-space $U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}) \ge U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}_i)$ in the preference domain (simply expressing that r should score no lower than ${\boldsymbol{ r}}_i$ anywhere in $\mathcal {C}( {\boldsymbol{ r}})$). $\mathcal {C}( {\boldsymbol{ r}})$ is the intersection of all these half-spaces, which (by definition [16]) is a convex polytope. Each facet of $\mathcal {C}( {\boldsymbol{ r}})$ is attributed to one of the intersected half-spaces, say, $U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}) \ge U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}_i),$ and, in effect, to one of the adjacent records, i.e., ${\boldsymbol{ r}}_i$ in this case. In general position, every extreme vertex of $\mathcal {C}( {\boldsymbol{ r}})$, say ${\boldsymbol{ v}}_j$, corresponds to the intersection of $(d-1)$ of its facets, i.e., to $(d-1)$ equalities of the form $U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}) = U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}_i)$, where each ${\boldsymbol{ r}}_i$ is adjacent to r. Let S be the record set composed of r and these specific $(d-1)$ adjacent records. As all records in S have the same score according to ${\boldsymbol{ v}}_j$, by Lemma 1, such a tie is only feasible if any pair of records in S is adjacent to each other on $L_i$. Since, in general position, each facet on $L_i$ is defined by d records, the records in S define a facet f in $\mathcal {F}( {\boldsymbol{ r}})$, with ${\boldsymbol{ v}}_j$ as its norm. In other words, there is a direct one-to-one mapping between the facets in $\mathcal {F}( {\boldsymbol{ r}})$ and the extreme vertices of $\mathcal {C}( {\boldsymbol{ r}})$. □

By Lemma 2, $\mathcal {C}( {\boldsymbol{ r}})$ can be seen as a dual representation of $\mathcal {F}( {\boldsymbol{ r}})$, where the former refers to the preference domain and the latter to the data space. Consider ${\boldsymbol{ r}}_3$ in Figure 3(a). Facet set $\mathcal {F}( {\boldsymbol{ r}}_3) = \lbrace {\boldsymbol{ r}}_2 {\boldsymbol{ r}}_3, {\boldsymbol{ r}}_3 {\boldsymbol{ r}}_4\rbrace$ translates to the top-region defined by their norms ${\boldsymbol{ v}}_2$ and ${\boldsymbol{ v}}_3$; i.e., $\mathcal {C}( {\boldsymbol{ r}}_3)$ is segment ${\boldsymbol{ v}}_2 {\boldsymbol{ v}}_3$ in the preference domain. Note that for $d=2$, the preference domain $\Delta ^{d-1}$ is a line segment.

Order Continuity: Lemmas 1 and 2, in tandem, suggest a continuity in the score order among $L_i$ records for every v. Specifically, the different top-regions for any given layer L define a partitioning of the preference domain, with adjacent records ${\boldsymbol{ r}}_i, {\boldsymbol{ r}}_j$ in L having neighboring top-regions $\mathcal {C}( {\boldsymbol{ r}}_i), \mathcal {C}( {\boldsymbol{ r}}_j)$ in $\Delta ^{d-1}$. Considering layer $L_1$ in our running example, Figure 3(b) demonstrates the partitioning of the preference domain. The top-region of ${\boldsymbol{ r}}_1$ is the segment from point $(0,1)$ to ${\boldsymbol{ v}}_1$; of ${\boldsymbol{ r}}_2$ from ${\boldsymbol{ v}}_1$ to ${\boldsymbol{ v}}_2$; of ${\boldsymbol{ r}}_3$ from ${\boldsymbol{ v}}_2$ to ${\boldsymbol{ v}}_3$; and so forth. Lemma 3 establishes a property for every vector in a top-region.

Lemma 3.

For any preference vector ${\boldsymbol{ v}}\in \mathcal {C}( {\boldsymbol{ r}})$, the top-2-nd record in $L_i$ is always in $\mathcal {A}( {\boldsymbol{ r}})$, i.e., among the records adjacent to ${\boldsymbol{ r}}$.

Proof.

Let $H_{ {\boldsymbol{ v}}}$ be the hyper-plane (in data space) that is normal to v. Sweeping the data space with $H_{ {\boldsymbol{ v}}}$, the first encountered record in $L_i$ is, by definition, r. As sweeping continues further, the convexity of $L_i$ ensures that $H_{ {\boldsymbol{ v}}}$ cuts only through the facets in $\mathcal {F}( {\boldsymbol{ r}})$. Since $L_i$ is hollow (i.e., has no records in its interior), the $L_i$ record to be encountered next (i.e., the top-2-nd in $L_i$) must be an extreme vertex of a facet in $\mathcal {F}( {\boldsymbol{ r}})$, i.e., a record in $\mathcal {A}( {\boldsymbol{ r}})$. □

In our example, Lemma 3 implies that the top-2-nd record in layer $L_1$ for any ${\boldsymbol{ v}}\in \mathcal {C}( {\boldsymbol{ r}}_3)$ is either ${\boldsymbol{ r}}_2$ or ${\boldsymbol{ r}}_4$. Note that all three lemmas consider a layer in isolation. For instance, although ${\boldsymbol{ w}}\in \mathcal {C}( {\boldsymbol{ r}}_3)$, its top-2-nd record in the entire D is none of ${\boldsymbol{ r}}_2$ or ${\boldsymbol{ r}}_4$, but ${\boldsymbol{ r}}_8$ from $L_2$.

5.2 An Algorithmic Basis to the $\mathsf {BORU}$ Technique

In this section, we prove an important theorem and provide an algorithmic basis to $\mathsf {BORU}$. Recall that we assumed we already know the minimum radius $\rho$ required to produce m records and that the first k upper hull layers are precomputed. We do not drop these assumptions yet. Here we focus on determining the $\text{top-}k$ result for any possible preference vector within radius $\rho$ from the seed w in order to form the $\mathsf {BORU}$ output.

First, we find all records in layer $L_1$ whose top-region has mindist to w no greater than $\rho$. Let C be one of these regions. We already know the top record in it, say, r. Considering C in isolation, our next task is to determine the top-2-nd record anywhere in it (i.e., for any possible preference vector ${\boldsymbol{ v}}\in C$) and to partition C accordingly. By Lemma 3, if we only considered $L_1$, the top-2-nd record for any ${\boldsymbol{ v}}\in C$ would be among those adjacent to r. On the other hand, in the remaining dataset (i.e., if we ignored the records in $L_1$), the top-2-nd record would be in $L_2$ and, more specifically, by Lemma 2, among the $L_2$ records ${\boldsymbol{ r}}_i$ whose top-region $\mathcal {C}( {\boldsymbol{ r}}_i)$ overlaps C. Theorem 1 generalizes this key observation.

Theorem 1.

Assume that anywhere in a preference region C the (order-sensitive) top-i result is the same and it is known. Also, let $L_t$ be the deepest layer that any of the top-i records belongs to. The top-$(i+1)$-th record anywhere in C must be in the union of:

—

Set (i): The adjacent records to any member of the known top-i result in its respective layer, and

—

Set (ii): The records in the $(t+1)$-th layer (i.e., $L_{t+1}$) whose top-region overlaps C.

Proof.

Let S be the union of all records in the first t layers. Due to Lemma 3, when it is applied to each of the top-i records in their respective layer, if we only considered S, the top-$(i+1)$-th record would be in Set (i). On the other hand, if we only considered the rest of the dataset, i.e., $D-S$, the next highest-scoring record would be in $L_{t+1}$ and, specifically, by Lemma 2, in Set (ii). Hence, in the overall product set D (i.e., in the union of S and $D-S$), the top-$(i+1)$-th record anywhere in C must be in the union of Sets (i) and (ii). □

Returning to our processing description for region C, the top record (the order-sensitive top-i result, in the general case) is already known and fixed anywhere in it, and thus we can readily determine Set (i). We can also extract from $L_2$ (from $L_{t+1}$, in the general case) the part of the upper hull that corresponds to records in Set (ii); let us denote that part as $L_{prt}$. We update the upper hull $L_{prt}$ to also cover Set (i) records and denote its updated version as $L_{upd}$. Next, we apply Lemma 2 to $L_{upd}$ to identify the top-2-nd records (the top-$(i+1)$-th, in the general case) for any ${\boldsymbol{ v}}\in C$, and we partition C accordingly. We continue this process recursively in each produced partition until the full, order-sensitive $\text{top-}k$ result is known anywhere in C. Repeating that process for all $L_1$ top-regions with mindist up to $\rho$, we derive all the required $\text{top-}k$ results.

In our example, assume that $\rho$ is as shown in Figure 3(b). The $L_1$ top-regions with mindist from w up to $\rho$ correspond to ${\boldsymbol{ r}}_2$, ${\boldsymbol{ r}}_3$, and ${\boldsymbol{ r}}_4$. Focusing on $\mathcal {C}( {\boldsymbol{ r}}_3)$ (i.e., segment ${\boldsymbol{ v}}_2 {\boldsymbol{ v}}_3$), we know already that the top record is ${\boldsymbol{ r}}_3$ and seek to find the top-2-nd. Set (i) includes ${\boldsymbol{ r}}_2$ and ${\boldsymbol{ r}}_4$. To determine Set (ii), we refer to $L_2$. Figure 4(a) illustrates the $L_2$ top-regions. Among them, those that overlap $\mathcal {C}( {\boldsymbol{ r}}_3)$ are $\mathcal {C}( {\boldsymbol{ r}}_7)$, $\mathcal {C}( {\boldsymbol{ r}}_8)$, and $\mathcal {C}( {\boldsymbol{ r}}_9)$. Thus, we form $L_{prt}$ as the part of $L_2$ that corresponds to ${\boldsymbol{ r}}_7, {\boldsymbol{ r}}_8, {\boldsymbol{ r}}_9$. Updating $L_{prt}$ to also cover Set (i) (i.e., ${\boldsymbol{ r}}_2, {\boldsymbol{ r}}_4$) results in the upper hull $L_{upd}$ in Figure 4(b). $L_{upd}$ suggests that the top-2-nd record is one of ${\boldsymbol{ r}}_2, {\boldsymbol{ r}}_8, {\boldsymbol{ r}}_4$. Furthermore, by Lemma 2, their $L_{upd}$ top-regions (determined by facet norms ${\boldsymbol{ v}}_a$ and ${\boldsymbol{ v}}_b$) help partition $\mathcal {C}( {\boldsymbol{ r}}_3)$ according to which exactly among them is the top-2-nd. Figure 4(c) presents the induced partitioning. The top-2 result is $\lbrace {\boldsymbol{ r}}_3, {\boldsymbol{ r}}_2\rbrace$ for preference vectors in ${\boldsymbol{ v}}_2 {\boldsymbol{ v}}_a$; $\lbrace {\boldsymbol{ r}}_3, {\boldsymbol{ r}}_8\rbrace$ in ${\boldsymbol{ v}}_a {\boldsymbol{ v}}_b$; and $\lbrace {\boldsymbol{ r}}_3, {\boldsymbol{ r}}_4\rbrace$ in ${\boldsymbol{ v}}_b {\boldsymbol{ v}}_3$. The process repeats recursively in order to determine the top-3-rd record in each of the three partitions, and so on.

Fig. 4.

Observe that we may not need to reach as deep as the k-th layer, since members of Set (i) could prevent those of Set (ii) from entering the result, thus giving more “width” to the $\mathsf {BORU}$ search (in the data space) than “depth.” For instance, in Figure 4(b), it could be the case that none of the $L_2$ records belongs to $L_{upd}$, i.e., that the top-2-nd record comes from $L_1$ (i.e., ${\boldsymbol{ r}}_2$ or ${\boldsymbol{ r}}_4$) for any ${\boldsymbol{ v}}\in \mathcal {C}( {\boldsymbol{ r}}_3)$.

5.3 Dropping Assumptions; Complete $\mathsf {BORU}$

In this section, we describe our complete $\mathsf {BORU}$ algorithm. So far, we have made two impractical assumptions, i.e., that we have already computed the first k upper hull layers and that we know in advance the necessary radius $\rho$ to output m records. Here we drop these assumptions, the first so that our algorithm is precomputation-free, and the second because it defies our problem formulation.

Without any precomputed layers or known $\rho$, our first step is to produce an overestimate of $\rho$, denoted as $\bar{\rho }$, which ensures an output size of at least m. That overestimate can be the radius required so that $\mathsf {ORU}$’s output for $k=1$ includes m records. This radius, for any k, is guaranteed to produce at least as many records.

A straightforward approach to derive $\bar{\rho }$ (based on $k=1$) would be to compute the upper hull of the entire dataset D, get the m top-regions with the smallest mindist to w, and use the largest mindist among them as $\bar{\rho }$. That, however, may be too costly. Ideally, we would want to localize the upper hull computation to just the vicinity of w. To achieve this, we exploit the fact that the $\rho$-skyline is a superset of $\mathsf {ORU}$’s output for the same $\rho$ and $k=1$, as follows directly from the definition of $\rho$-dominance.

We use an incremental $\rho$-skyline algorithm, which supports “get next” calls to extend a $\rho$-skyline for the immediately larger $\rho$ around w that admits exactly one new record to it. Details on that technique (for its general, $\rho$-skyband version) are presented in Section 5.3.2. We initialize that algorithm and prompt it until the $\rho$-skyline includes m records. Next, we compute their upper hull $L_{tmp}$. In general, not all $\rho$-skyline records will make it to $L_{tmp}$. If that is indeed the case, we keep prompting the $\rho$-skyline algorithm and updating the upper hull $L_{tmp}$ to cover the additional records until $L_{tmp}$ includes m extreme vertices (records). We use as $\bar{\rho }$ the final radius reported by the $\rho$-skyline algorithm. On top of this, note that the final $L_{tmp}$ is guaranteed to include all the parts of layer $L_1$ that $\mathsf {BORU}$ could possibly need for an output of size m. In other words, the final $L_{tmp}$ can serve already as layer $L_1$ in subsequent processing.

Using the obtained $\bar{\rho }$, we compute the $\rho$-skyband (for the actual k specified in the input). That can be done with a standard k-skyband algorithm by simply replacing regular dominance tests with $\bar{\rho }$-dominance ones (described in the last paragraph of Section 4.2). The derived $\bar{\rho }$-skyband is a guaranteed superset of the $\mathsf {ORU}$ output, and thus we place its members into a candidate set M.

Even with $\bar{\rho }$ available, we have only an overestimate of the actual radius required to output m records. Thus, computing upper hull layers directly on M would be computationally wasteful. To circumvent this, we employ an adaptive technique that progressively outputs confirmed candidates (i.e., records guaranteed to be in the $\mathsf {ORU}$ result) while the search is ongoing. Effectively, this enables the tightening of $\bar{\rho }$ on the fly, and hence the shrinking of the $\bar{\rho }$-skyband and the elimination of candidates from M, so that the layer (i.e., upper hull) computations execute on increasingly fewer records. This improved implementation is presented next.

5.3.1 Gradually Tightening the Overestimate $\bar{\rho }$.

What we already have is the initial overestimate $\bar{\rho }$, (the necessary part of) layer $L_1$, and the candidate set M. We treat the structure of all upper hull layers as an implicit tree and apply the best-first approach to gradually explore that tree in increasing distance from the seed w. In particular, we maintain a min-heap Q that organizes known top-i results (for $i \le k$) and their respective preference regions C, with mindist to w as their key. Let r be the top record according to w. We start with the top-region of r (i.e., $\mathcal {C}( {\boldsymbol{ r}})$), whose mindist is by definition 0. First, we partition $\mathcal {C}( {\boldsymbol{ r}})$ according to the possible top-2-nd records, which requires computing $L_2$ (i.e., the upper hull of record set $M - L_1$) and applying Theorem 1, as demonstrated in Section 5.2. Each produced partition is pushed into Q with key equal to its mindist to w and associated with its (now known) top-2 result. Second, for each $L_1$ record ${\boldsymbol{ r}}_i$ that is adjacent to r, we push its top-region $\mathcal {C}( {\boldsymbol{ r}}_i)$ into Q (with $\lbrace {\boldsymbol{ r}}_i \rbrace$ as its top-1 result). Then, we iteratively pop the heap. For each region C popped from Q, we distinguish two cases:

Case 1: If C corresponds to a top-i result (with $i \lt k$), we partition it according to the different top-$(i+1)$-th records in C (using Theorem 1, as in Section 5.2) and push the produced partitions into the heap (associated with their, now known, top-$(i+1)$ results). Applying Theorem 1 might require computing a new upper hull layer on the candidate set; details and an optimization are discussed later in this section. Importantly, if C corresponds to a top-1 result $\lbrace {\boldsymbol{ r}}_i \rbrace$, we additionally push into the heap the top-regions of its adjacent records (omitting any that were pushed into Q previously, to avoid duplication). The reason is that, unlike best-first search in an actual tree, at the “root level” of our implicit structure, we initially did not push into Q the top-regions of all $L_1$ records, only those neighboring $\mathcal {C}( {\boldsymbol{ r}})$. That was in order to save mindist calculations and unnecessary push operations, since many $L_1$ top-regions may lie too far from w to affect $\mathsf {BORU}$ processing. Instead, we use the continuity implied by Lemmas 1 and 2 to gradually push top-regions from $L_1$ into Q only when one of their neighbors is popped.

Case 2: If C corresponds to a $\text{top-}k$ result, it is considered finalized. That is, the $\text{top-}k$ result and its respective region C are appended to the $\mathsf {BORU}$ output. Observe that, as we explained in Section 3, our algorithm goes beyond Definition 2 to output not only records but also specific order-sensitive $\text{top-}k$ results, together with the preference regions that produce them.

The process terminates when the output includes m distinct records. The mindist of the last finalized region is the minimum radius $\rho$ that appears in Definition 2. In a nutshell, $\mathsf {BORU}$ explores (i.e., partitions or finalizes, for $i\lt k$ and for $i=k$, respectively) regions C in increasing distance from w, utilizing the implicit tree structure to dismiss those too distant to affect the result.

Figure 5 shows the implicit tree for our running example. Each node $N_j$ represents a preference region (i.e., a segment, for $d=2$) and its respective top-i result. The root corresponds to the $L_1$ top-region that includes w, i.e., $\mathcal {C}( {\boldsymbol{ r}}_3)$. First, we partition $\mathcal {C}( {\boldsymbol{ r}}_3)$ into three regions, namely, ${\boldsymbol{ v}}_2 {\boldsymbol{ v}}_a, {\boldsymbol{ v}}_a {\boldsymbol{ v}}_b, {\boldsymbol{ v}}_b {\boldsymbol{ v}}_3$, as demonstrated in Figure 4. Associated with their top-2 results, they conceptually form nodes $N_4, N_5, N_6$ and are pushed into Q. We also push the top-regions of $L_1$ records adjacent to ${\boldsymbol{ r}}_3$, i.e., $\mathcal {C}( {\boldsymbol{ r}}_2), \mathcal {C}( {\boldsymbol{ r}}_4)$, associated with their top-1 results (nodes $N_2, N_3$). Then, iterative popping commences.

Fig. 5.

The first popped node is $N_5$ (with mindist 0, since it actually includes w), whose top-2 result is $\lbrace {\boldsymbol{ r}}_3, {\boldsymbol{ r}}_8\rbrace$. If $k = 2$, it is finalized and its $\text{top-}k$ records are output directly (Case 2). Popping continues with $N_4$ and $N_6$, which are finalized too; their $\text{top-}k$ results contribute two new records to the output (i.e., ${\boldsymbol{ r}}_2$ and ${\boldsymbol{ r}}_4$). If $m=4$, the process terminates. Otherwise, the next popped node is $N_2$, for which we know the top-1 result (i.e., $N_2$ falls under Case 1). We will need to partition it by Theorem 1 and push into Q its resulting “children” (not illustrated). Importantly, since $N_2$ belongs to $L_1$, we will also need to push its neighboring top-regions that were not encountered before, i.e., $\mathcal {C}( {\boldsymbol{ r}}_1)$ (not illustrated). The implicit tree is constructed gradually, with new nodes formed for each Case 1 pop.

An important point on Case 1 is that partitioning C according to its different top-$(i+1)$-th records might require computing a new upper hull layer. That is, if layer $L_{t+1}$ (referring to the value of t in Theorem 1 for C) was not previously computed, we need to compute it now. A naïve approach is to simply remove the first t layers from the candidate set M and compute the upper hull on the remaining candidates. An improvement, however, is possible, by shrinking M. Recall that our initial $\bar{\rho }$ estimation assumed we needed m records in the $\bar{\rho }$-skyline. Letting $\tau$ be the number of distinct records that (1) belong to the $\text{top-}k$ results already finalized and (2) are not members of the $\bar{\rho }$-skyline, we can roll back the incremental $\rho$-skyline computation so that it includes only $m-\tau$ records. This backtracking effectively reduces $\bar{\rho }$ and in turn enables the shrinking of M to only keep $\rho$-skyband records (for the actual k input) for the reduced $\bar{\rho }$. The shrinking can be done trivially if we record the inflection radius for each record in the $\rho$-skyline during the original $\bar{\rho }$ estimation and for each record in the initial candidate set (i.e., in the $\rho$-skyband for the original $\bar{\rho }$ estimate).

5.3.2 Incremental $\rho$-skyband.

The OSS property and m being dictated by the user/application is a central point of our motivation and thus a hard requirement for our operators. However, $\mathsf {BORU}$ requires as a building block an incremental $\rho$-skyband module ($\mathsf {IRD}$). While $\mathsf {BORU}$ requires this for the $\rho$-skyline only (i.e., $k=1$), here we address the arbitrary k version for generality.

The $\mathsf {IRD}$ challenge is that, unlike $\mathsf {ORD}$, no $\rho$-dominance can ever be used to narrow down the search, because every k-skyband member may be output after a sufficient number of “get next” calls. Thus, the key question is how to serve these calls without computing the entire k-skyband. The main idea is to progressively fetch k-skyband members but only output them when their inflection radius is no larger than a gradually growing threshold $\underline{\rho }$, introduced later.

$\mathsf {IRD}$ invokes a regular k-skyband algorithm to progressively fetch its members in decreasing score order for ${\boldsymbol{ w}}$. We use BBS [69] for that building block but amend its default record/index node visiting order to order by score, as we did in Section 4.2. Let set T hold the k-skyband records fetched by BBS so far. As ensured by the score-based fetching order, we can compute the exact inflection radius for each record in T at the time it was fetched.

An invariant of branch-and-bound algorithms, like BBS, is that at any point during execution, their heap contents (records and index nodes) represent the not-yet-considered part of the dataset. Let S be the set of all records and nodes currently in the heap. For simplicity, we extend notation ${\boldsymbol{ r}}_i$ to nodes too, since BBS anyway represents them by the top corner of their minimum bounding box. For each ${\boldsymbol{ r}}_i \in S$ we can compute an inflection radius $\underline{\rho _i}$ based on the current set T. However, that $\underline{\rho _i}$ is just a lower bound of the actual inflection radius, because BBS may have not yet fetched in T all the k-skyband records with score larger than $U_{ {\boldsymbol{ w}}}( {\boldsymbol{ r}}_i)$; i.e., T may currently not include all records that $\rho$-dominate ${\boldsymbol{ r}}_i$.

Since S serves as a representation of all unexplored records, the minimum $\underline{\rho _i}$ among the members of S serves as an overall lower bound $\underline{\rho }$ for any non-fetched record. Therefore, every record in T with inflection radius no greater than $\underline{\rho }$ has a confirmed order in the output of $\mathsf {IRD}$. In other words, these records are guaranteed to comprise the $\rho$-skyband for radius $\underline{\rho }$.

Consider Figure 2(b), where $k=5$. When $\mathsf {IRD}$ is first invoked, we execute BBS to progressively fetch the first k records, i.e., ${\boldsymbol{ r}}_1$ to ${\boldsymbol{ r}}_5$, which by definition are the $\text{top-}k$. They are placed into set T and also output directly by $\mathsf {IRD}$. When prompted with a “get next” call, $\mathsf {IRD}$ resumes BBS to progressively fetch new k-skyband members into T. Whenever BBS fetches a new record, we update $\underline{\rho }$ according to the current contents of the BBS heap (i.e., of set S). Let ${\boldsymbol{ r}}_i$ be the not-yet-output record in T with the smallest inflection radius. If $\rho _i \le \underline{\rho }$, $\mathsf {IRD}$ outputs ${\boldsymbol{ r}}_i$ and pauses. Otherwise, we keep fetching new records by BBS and updating $\underline{\rho }$ after each retrieval, until $\underline{\rho }$ becomes at least as large as the inflection radius of one record in T. That record is output by $\mathsf {IRD}$ as the next $\rho$-skyband member. Subsequent “get next” calls are served by resuming this process.

Returning to our example, consider that $\mathsf {IRD}$ receives a “get next” call (after its initialization, which reports the top-5 records all at once). It resumes BBS, but as it fetches ${\boldsymbol{ r}}_6$, ${\boldsymbol{ r}}_7$, ${\boldsymbol{ r}}_8$ into T, assume that $\underline{\rho }$ does not become as large as any of the inflection radii in T after any of these retrievals. However, when ${\boldsymbol{ r}}_9$ is fetched too, suppose that the updated $\underline{\rho }$ becomes greater than (or equal to) $\rho _7$. $\mathsf {IRD}$ outputs ${\boldsymbol{ r}}_7$ and pauses (until it receives another “get next” call).

6 Fast $\mathsf {ORU}$ Algorithm

In this section, we present an alternative, fundamentally different approach for (exact) $\mathsf {ORU}$ processing, namely, the fast ORU algorithm ($\mathsf {FORU}$). It differs from $\mathsf {BORU}$ in three core aspects:

—

$\mathsf {BORU}$ starts with an overestimate of $\rho$, which it gradually shrinks until the output size drops to m. A disadvantage of this approach is that it examines a larger part of the preference domain (and therefore more candidate records) than necessary, thus wasting computations. In contrast, $\mathsf {FORU}$ uses the seed w as the starting point for a gradual expansion until the output includes m records.

—

$\mathsf {BORU}$ is centered on (and partitions the explored part of the preference domain according to) order-sensitive $\text{top-}k$ results. Order sensitivity is not required by Definition 2 and thus leads to an unnecessary waste of computations. $\mathsf {FORU}$ abolishes order sensitivity in the $\text{top-}k$ results that lead its execution.

—

$\mathsf {BORU}$ starts with preference regions that it partitions in order to determine the possible $\text{top-}k$ results therein. In contrast, $\mathsf {FORU}$ starts with possible (order-insensitive) $\text{top-}k$ results and determines the regions that correspond to them.

6.1 Fundamentals and Algorithmic Outline

In the context of $\mathsf {FORU}$, any reference to $\text{top-}k$ result/set refers to its order-insensitive definition. Also, in case of utility ties in the top-k-th position, the $\text{top-}k$ set is assumed to include all tied records (i.e., its cardinality may exceed k). A central concept in $\mathsf {FORU}$ is the result region.

Result Region $\mathcal {R}(S)$: Given a $\text{top-}k$ set S, the result region $\mathcal {R}(S)$ is the preference region that includes those and only those vectors ${\boldsymbol{ v}}$ whose $\text{top-}k$ set is S. Following directly from its definition, $\mathcal {R}(S)$ is the intersection of half-spaces $U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}) \ge U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}^{\prime })$ for each pair of ${\boldsymbol{ r}}\in S$ and ${\boldsymbol{ r}}^{\prime } \in D-S$. Formally,

\begin{equation} \mathcal {R}(S) = \bigcap _{ {\boldsymbol{ r}}\in S, {\boldsymbol{ r}}^{\prime } \in D-S} \;\; \lbrace {\boldsymbol{ v}}\in \Delta ^{d-1}| U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}) \ge U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}^{\prime })\rbrace . \end{equation}

(1)

If $\mathcal {R}(S) = \emptyset$, set S is not a possible $\text{top-}k$ result and we call it infeasible. On the other hand, if $\mathcal {R}(S) \ne \emptyset$, set S is a feasible $\text{top-}k$ result and, geometrically, $\mathcal {R}(S)$ (as the intersection of a finite number of half-spaces) is a convex polytope [16].

Let $S_{ {\boldsymbol{ w}}}$ be the $\text{top-}k$ set for the seed vector w and $\mathcal {R}_{ {\boldsymbol{ w}}}$ be the respective result region. For now, assume that we derive $\mathcal {R}_{ {\boldsymbol{ w}}}$ by Equation (1) using, however, the k-skyband instead of D, since it includes every record that could appear in any $\text{top-}k$ result (and thus any record that may affect $\mathcal {R}_{ {\boldsymbol{ w}}}$). Still, the k-skyband may be large and, hence, the number of intersected half-spaces excessive, but this can be improved as we explain later, in Section 6.2. A high-level outline of $\mathsf {FORU}$ is the following:

With $\mathcal {R}_{ {\boldsymbol{ w}}}$ at hand, $\mathsf {FORU}$ initially determines every possible $\text{top-}k$ set when the preference vector moves infinitesimally outside $\mathcal {R}_{ {\boldsymbol{ w}}}$ and derives accordingly the result regions for these $\text{top-}k$ sets. After computing all the neighbor regions to $\mathcal {R}_{ {\boldsymbol{ w}}}$, the process is repeated to determine their own neighbor regions, thus progressively tessellating the preference domain around w until the union of encountered $\text{top-}k$ sets hits the desired output size m.

Neighbor Region/Set: Consider a non-empty result region $\mathcal {R}(S)$. Formally, we define as neighbor region every non-empty result region $\mathcal {R}(S^{\prime })$ that shares at least one extreme vertex with $\mathcal {R}(S)$. Similarly, we refer to $S^{\prime }$ as a neighbor set to S. Figure 6 illustrates the result region $\mathcal {R}_{ {\boldsymbol{ w}}}$ that (includes the seed vector ${\boldsymbol{ w}}$ and) corresponds to $S_{ {\boldsymbol{ w}}}$ and the result region $\mathcal {R}(S)$ of another $\text{top-}k$ set, S. As $\mathcal {R}(S)$ has a common extreme vertex with $\mathcal {R}_{ {\boldsymbol{ w}}}$ (e.g., ${\boldsymbol{ v}}_1$), it is a neighbor region to $\mathcal {R}_{ {\boldsymbol{ w}}}$, and set S is a neighbor set to $S_{ {\boldsymbol{ w}}}$.

Fig. 6.

Lemma 4.

Given two neighboring result regions $\mathcal {R}(S)$ and $\mathcal {R}(S^{\prime })$ and letting ${\boldsymbol{ v}}$ be an extreme vertex they have in common, the $\text{top-}k$ set of ${\boldsymbol{ v}}$ is a superset of S and a superset of $S^{\prime }$.

Proof.

Consider a feasible set S and recall Equation (1). Any vector ${\boldsymbol{ v}}$ on the boundary of $\mathcal {R}(S)$ lies in at least one hyper-plane $U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}) = U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}^{\prime })$, where ${\boldsymbol{ r}}\in S, {\boldsymbol{ r}}^{\prime } \in D-S$. That is, the $\text{top-}k$ set of ${\boldsymbol{ v}}$ includes every record in S but also at least one extra record ${\boldsymbol{ r}}^{\prime } \notin S$. We can similarly show that the $\text{top-}k$ set for any vector ${\boldsymbol{ v}}$ on the boundary of $\mathcal {R}(S^{\prime })$ includes $S^{\prime }$ and at least one extra record. From the hypothesis that ${\boldsymbol{ v}}$ is a common extreme vertex of $\mathcal {R}(S)$ and $\mathcal {R}(S^{\prime }),$ it follows that ${\boldsymbol{ v}}$ lies on both their boundaries, and thus its $\text{top-}k$ set is a superset of S and of $S^{\prime }$. □

Referring to Figure 6, assume that $k=3$, that $S_{ {\boldsymbol{ w}}} = \lbrace {\boldsymbol{ r}}_1, {\boldsymbol{ r}}_2, {\boldsymbol{ r}}_3\rbrace$, and that the $\text{top-}k$ set for ${\boldsymbol{ v}}_1$ is $S_{ {\boldsymbol{ v}}_1} = \lbrace {\boldsymbol{ r}}_1, {\boldsymbol{ r}}_2, {\boldsymbol{ r}}_3, {\boldsymbol{ r}}_4, {\boldsymbol{ r}}_5 \rbrace$. According to Lemma 4, for any neighbor region that has ${\boldsymbol{ v}}_1$ as an extreme vertex (like $\mathcal {R}(S)$ in the figure), the respective $\text{top-}k$ set S is a subset of $S_{ {\boldsymbol{ v}}_1}$. $\mathsf {FORU}$ uses that fact to discover each and every neighbor region to $\mathcal {R}_{ {\boldsymbol{ w}}}$ that shares ${\boldsymbol{ v}}_1$ as an extreme vertex; recall that, although $S_{ {\boldsymbol{ w}}}$ (and thus $\mathcal {R}_{ {\boldsymbol{ w}}}$ and its extreme vertices) is clear how to derive, the neighbor regions are unknown. To compute the neighbor regions (and thus expand the tessellation around ${\boldsymbol{ w}}$) we exploit the fact that $S \subset S_{ {\boldsymbol{ v}}_1}$ and consider every possible k-combination from $S_{ {\boldsymbol{ v}}_1}$ as a candidate $\text{top-}k$ set, e.g., $S = \lbrace {\boldsymbol{ r}}_1, {\boldsymbol{ r}}_4, {\boldsymbol{ r}}_5 \rbrace$. For each of the candidate $\text{top-}k$ sets, we apply Equation (1), using, however, $S_{ {\boldsymbol{ v}}_1}$ in place of D; i.e., in our example where $S = \lbrace {\boldsymbol{ r}}_1, {\boldsymbol{ r}}_4, {\boldsymbol{ r}}_5 \rbrace$, the only non-result records ${\boldsymbol{ r}}^{\prime }$ we need to consider in Equation (1) are ${\boldsymbol{ r}}_2$ and ${\boldsymbol{ r}}_3$. Those feasible among the candidate sets, i.e., those where $\mathcal {R}(S) \ne \emptyset$, lead to the neighbor regions that share ${\boldsymbol{ v}}_1$ with $\mathcal {R}_{ {\boldsymbol{ w}}}$, like the one illustrated in Figure 6.

An important point is that, as we explained in the beginning of the section, the $\text{top-}k$ set for a neighbor region could include more than k records. Although we consider combinations of exactly k records from $S_{ {\boldsymbol{ v}}_1}$, even if there are ties in the top-k-th position and some tying records are cut out from a k-combination, such records will not be missed by the overall process, because they are bound to appear in another feasible k-combination.

By repeating the above investigation for each extreme vertex of $\mathcal {R}_{ {\boldsymbol{ w}}}$, we can derive all neighbor regions/sets to $\mathcal {R}_{ {\boldsymbol{ w}}}$, completing one hop in the expansion. The tessellation could continue in a “breadth-first” manner, completing one hop after the other. This, however, is not the most efficient approach, since expansion is bounded by distance (i.e., the stopping radius $\rho$) instead of number of hops. For that reason, $\mathsf {FORU}$ employs a min-heap Q that organizes encountered extreme vertices and result regions by minimum distance to ${\boldsymbol{ w}}$. This way, it incorporates their $\text{top-}k$ sets in the overall output in the appropriate order and prioritizes the derivation of new result regions in increasing distance from ${\boldsymbol{ w}}$ to save computations. In the following section, we present the technical specifics of $\mathsf {FORU}$ in the form of pseudo-code.

6.2 Technical Specifics

Algorithm 1 summarizes the $\mathsf {FORU}$ algorithm. Line 1 performs a regular $\text{top-}k$ computation using a standard technique, like BBR [79] (the same holds for Lines 9 and 34). Line 3 initializes $\mathcal {G}$, an array of size m, which at the end of the process will hold the output of $\mathsf {FORU}$. $\mathcal {G}$ maintains records sorted in ascending order of a key (a distance to w, in particular), initialized to $\infty$ for its unoccupied slots.

Line 4 initializes $T_s$, whose purpose is to avoid examining the same k-combination of records multiple times. Specifically, its elements are $\text{top-}k$ sets, starting with the seed’s $\text{top-}k$ set (i.e., $S_{ {\boldsymbol{ w}}}$). Each k-combination encountered by $\mathsf {FORU}$ (in Line 19) is book-kept as a separate element in $T_s$. That is essential, since the same k-combination could occur while examining different extreme vertices whose $\text{top-}k$ sets happen to overlap. Similarly, the role of $T_v$ is to avoid considering anew extreme vertices that have already been accounted for (in Line 30) in the context of another neighbor result region.

As explained previously, the min-heap Q (initialized in Line 6) organizes the encountered extreme vertices and result regions. If a heap element corresponds to an extreme vertex ${\boldsymbol{ v}}$, the element includes the respective $\text{top-}k$ set $S_{ {\boldsymbol{ v}}}$, with key the distance between ${\boldsymbol{ v}}$ and ${\boldsymbol{ w}}$. If the heap element corresponds to a result region $\mathcal {R}(S)$, it includes the respective k-combination S, with key the mindist between $\mathcal {R}(S)$ and w. When we pop the heap (Lines 13 and 27), the element with the smallest key is removed from the heap, and its record set ($S_{ {\boldsymbol{ v}}}$ or S) and key value ($dist( {\boldsymbol{ v}}, {\boldsymbol{ w}})$ or $mindist(\mathcal {R}(S), {\boldsymbol{ w}})$, respectively) become available.

Let $S_{ {\boldsymbol{ v}}}$ be the $\text{top-}k$ set and $dist( {\boldsymbol{ v}}, {\boldsymbol{ w}})$ be the key of the popped heap entry. For every record r in $S_{ {\boldsymbol{ v}}}$, we update $\mathcal {G}$ (in Lines 15-18). Specifically, if r is not already in $\mathcal {G}$ and $dist( {\boldsymbol{ v}}, {\boldsymbol{ w}})$ is smaller than the m-th key in $\mathcal {G}$, we insert r into $\mathcal {G}$ with key $dist( {\boldsymbol{ v}}, {\boldsymbol{ w}})$ (evicting the previous m-th record from $\mathcal {G}$ to keep its size to m). If r is already in $\mathcal {G}$ with a key greater than $dist( {\boldsymbol{ v}}, {\boldsymbol{ w}})$, we update its key to $dist( {\boldsymbol{ v}}, {\boldsymbol{ w}})$. The latter situation may occur if r was previously encountered in the $\text{top-}k$ set of another popped heap entry. As an example, consider Figure 6, where $\mathcal {R}$ is pushed into the heap only after vertex ${\boldsymbol{ v}}_1$ has been popped, despite the fact that $\mathcal {R}$ is actually nearer to ${\boldsymbol{ w}}$. If a record ${\boldsymbol{ r}}$ in $S_{ {\boldsymbol{ v}}_1}$ had been inserted into $\mathcal {G}$, and the same record is now encountered in the $\text{top-}k$ set of $\mathcal {R}$, the key of ${\boldsymbol{ r}}$ in $\mathcal {G}$ will be updated to $mindist(\mathcal {R}, {\boldsymbol{ w}})$. The process to update $\mathcal {G}$ when the popped heap entry corresponds to a result region is similar (Line 29).

$\mathsf {FORU}$ terminates when the m-th key in $\mathcal {G}$ is no greater than the key at the top of the heap, at which point $\mathcal {G}$ is finalized and output. The radius $\rho$ in Definition 2 corresponds to the m-th key in the finalized $\mathcal {G}$. Another termination scenario is when the heap becomes empty, which means that the entire preference domain has been explored. The latter case may occur when the possible $\text{top-}k$ records throughout the preference domain are fewer than m.

A low-level optimization regards Lines 23 and 24. To save computations, we may first check the feasibility of S via linear programming [16] before spending the (typically higher) cost to derive the exact geometry of $\mathcal {R}(S)$ via half-space intersection. In Line 25, we compute $mindist(\mathcal {R}(S), {\boldsymbol{ w}})$ using quadratic programming [37, 62].

As a note on Line 23, recall that $\mathcal {R}(S)$ is defined by Equation (1), using $S_{ {\boldsymbol{ v}}}$ in place of D for the half-space intersection. $S_{ {\boldsymbol{ v}}}$, being a $\text{top-}k$ set, is expected to be small, i.e., to include just slightly over k records. Unlike Line 23, however, the derivation of $\mathcal {R}_{ {\boldsymbol{ w}}}$ in Line 2 is more complex. Specifically, $\mathcal {R}_{ {\boldsymbol{ w}}}$ can still be computed by Equation (1), but the non-result records we need to consider are all the k-skyband records (i.e., by using the k-skyband in place of D). Surely, using the k-skyband instead of the entire D in Equation (1) is a relief. However, the k-skyband may still include numerous records, implying a large number of half-spaces to intersect. We may reduce the number of half-spaces based on the following two observations.

First, on closer inspection, even the entire k-skyband is not necessary (when replacing D in Equation (1)). In particular, let $S_{NR}$ be the skyline of set $D-S_{ {\boldsymbol{ w}}}$, i.e., the skyline of the dataset after we exclude the $\text{top-}k$ records for w. The records in $S_{NR}$ are the only possible candidates to enter the $\text{top-}k$ result when the preference vector moves infinitesimally outside $\mathcal {R}_{ {\boldsymbol{ w}}}$. To see this, from the definition of the skyline [17], it follows that for every non-result record ${\boldsymbol{ r}}_j \notin S_{NR}$, there is a record ${\boldsymbol{ r}}_i \in S_{NR}$ that dominates it. For any preference vector ${\boldsymbol{ v}}$, it holds that $U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}_i) \ge U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}_j)$. Therefore, referring to Equation (1), the condition that a result record ${\boldsymbol{ r}}\in S_{ {\boldsymbol{ w}}}$ stays ahead of ${\boldsymbol{ r}}_i$ utility-wise (i.e., the condition $U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}) \ge U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}_i)$) is stricter, and hence subsumes the condition that ${\boldsymbol{ r}}$ stays ahead of ${\boldsymbol{ r}}_j$. Formally, the half-space $U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}) \ge U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}_j)$ is not a bounding half-space of $\mathcal {R}_{ {\boldsymbol{ w}}}$ and can thus be ignored in Equation (1).

Second, consider a result record ${\boldsymbol{ r}}_i \in S_{ {\boldsymbol{ w}}}$ that dominates another result record ${\boldsymbol{ r}}_j \in S_{ {\boldsymbol{ w}}}$. For any preference vector ${\boldsymbol{ v}}$ it holds that $U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}_i) \ge U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}_j)$. Therefore, the condition that ${\boldsymbol{ r}}_j$ stays ahead of a non-result record ${\boldsymbol{ r}}^{\prime }$ utility-wise (i.e., the condition $U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}_j) \ge U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}^{\prime })$) is stricter, and thus subsumes the condition that ${\boldsymbol{ r}}_i$ stays ahead of ${\boldsymbol{ r}}^{\prime }$. In other words, the half-space $U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}_i) \ge U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}^{\prime })$ is not a bounding half-space of $\mathcal {R}_{ {\boldsymbol{ w}}}$ and can be ignored in Equation (1). Letting $S_R$ be the set of result records ${\boldsymbol{ r}}_j \in S_{ {\boldsymbol{ w}}}$ that do not dominate any other result record ${\boldsymbol{ r}}_i \in S_{ {\boldsymbol{ w}}}$, and using set $S_{NR}$ from the previous observation, $\mathcal {R}_{ {\boldsymbol{ w}}}$ is the intersection $\bigcap _{ {\boldsymbol{ r}}\in S_R, {\boldsymbol{ r}}^{\prime } \in S_{NR}}\lbrace {\boldsymbol{ v}}\in \Delta ^{d-1}| U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}) \ge U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}^{\prime })\rbrace$.

6.3 $\mathsf {FORU}$ and Order Sensitivity

As explained in the beginning of Section 6, a key difference between $\mathsf {BORU}$ and $\mathsf {FORU}$ is that the latter centers its execution around order-insensitive $\text{top-}k$ results. Although order sensitivity is not required by Definition 2, if it is desired, $\mathsf {FORU}$ can still be used.

Specifically, consider the definition of a result region $\mathcal {R}(S)$ and Equation (1). The order-sensitive result region for a specific (ordered) $\text{top-}k$ result is the intersection of $\mathcal {R}(S)$ from Equation (1) with a half-space $U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}_{i}) \ge U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}_{i+1})$ for each pair of consecutive records ${\boldsymbol{ r}}_{i}, {\boldsymbol{ r}}_{i+1}$ in the (ordered) $\text{top-}k$ result. Effectively, (1) the half-spaces involved in Equation (1) ensure that the result records stay ahead of the non-result ones, and (2) the newly introduced half-spaces (henceforth referred to as order-preserving half-spaces) ensure that the record order within the $\text{top-}k$ result stays the same. It follows that the order-sensitive result region for a $\text{top-}k$ result is contained in its order-insensitive counterpart.

Now, recall that $\mathsf {FORU}$ offers a tessellation of the preference domain around w into (order-insensitive) result regions $\mathcal {R}(S)$, and for each of them it reports the respective $\text{top-}k$ set S. Given such a region $\mathcal {R}(S)$, we may partition it into order-sensitive regions by considering every possible ranking within S and intersecting $\mathcal {R}(S)$ with the corresponding order-preserving half-spaces. If the intersection is empty, it means that the considered ordering is infeasible; otherwise, it is feasible and we know its (order-sensitive) result region. With $\mathsf {FORU}$’s stopping radius $\rho$ at hand, we may simply report the (order-sensitive) result regions that are within distance $\rho$ from w, together with their respective (ordered) $\text{top-}k$ sets.

6.4 Analytical Comparison to $\mathsf {BORU}$

In this section, we perform an analytical comparison between $\mathsf {BORU}$ and $\mathsf {FORU}$ to get an indication of the relative performance we expect in practice.

The complexity of $\mathsf {BORU}$ is dominated by the geometric computations spent on the candidate set M, and specifically in the computation of the preference regions C across the nodes of the implicit tree (elaborated in Section 5.3.1 and exemplified in Figure 5). The candidate set M is a subset of the k-skyband, whose expected cardinality is $\frac{k\ln ^{d-1} |D|}{d!}$ [36]. Hence, $|M|$ is in $O(\frac{k\ln ^{d-1} |D|}{d!})$. Deriving the region (convex polytope) C in a node $N_j$ of the implicit tree is equivalent to intersecting the preference regions that correspond to all the nodes along the path from $N_j$ to the root of the implicit tree and therefore (1) the time spent in region computations is dominated by those for the leaves of the implicit tree and (2) the time spent for each of these regions is upper bounded by the time to intersect $|M|$ half-spaces, that is, $O(|M|^{\lfloor {d/2} \rfloor })$ time [11]. Now, the number of leaves in the implicit tree is in the order of possible order-sensitive $\text{top-}k$ results within radius $\rho$, which in turn is upper bounded by the permutations selecting k records from the m in the $\mathsf {ORU}$ output, i.e., $\frac{m!}{(m-k)!}$. Overall, the time complexity of $\mathsf {BORU}$ is $O(\frac{m!}{(m-k)!}|M|^{{\lfloor {d/2} \rfloor }})$. The latter complexity also applies to the space requirements of $\mathsf {BORU}$, since the geometric representation of each node’s preference region takes $O(|M|^{{\lfloor {d/2} \rfloor }})$ space [16]. Note that the derived time/space complexities are greater than (and thus subsume) the computation time/space required for the k upper hull layers, i.e., $O(k|M|^{{\lfloor {d/2} \rfloor }})$ [21].

Returning to the $\mathsf {FORU}$ algorithm detailed in Section 6.2, its complexity too is dominated by the computational geometric primitives it calls on, and in particular by the computation of the different result regions $\mathcal {R}(S)$ it needs to consider. Due to the best-first fashion of the tessellation/expansion of $\mathsf {FORU}$, the total number of records it encounters is in $O(m)$. This implies that $\mathsf {FORU}$ examines $O(\binom{m}{k}) = O(\frac{m!}{k!(m-k)!})$ k-combinations of records. For the feasible among them (i.e., those where the result region $\mathcal {R}(S) \ne \emptyset$), it needs to compute the exact geometry of $\mathcal {R}(S)$ as the intersection of $O(m)$ half-spaces, requiring $O(m^{{\lfloor {d/2} \rfloor }})$ time [11]. Therefore, $\mathsf {FORU}$’s total time complexity is $O(\frac{m!}{k!(m-k)!} m^{{\lfloor {d/2} \rfloor }})$. The same complexity applies for its space requirements, since they are dominated by the size of its heap; at any point the heap contains a maximum of $O(\frac{m!}{k!(m-k)!})$ result regions, whose geometric representation requires $O(m^{{\lfloor {d/2} \rfloor }})$ space each [16].

In summary, the time/space complexity of $\mathsf {BORU}$ is $O(\frac{m!}{(m-k)!}|M|^{{\lfloor {d/2} \rfloor }})$ vis-à-vis $O(\frac{m!}{k!(m-k)!} m^{{\lfloor {d/2} \rfloor }})$ for $\mathsf {FORU}$. These complexities involve two factors that reflect the core differences between the two algorithms. First, the permutation (for $\mathsf {BORU}$) instead of the combination (for $\mathsf {FORU}$) is aligned with $\mathsf {BORU}$’s order sensitivity versus $\mathsf {FORU}$’s order insensitivity in the $\text{top-}k$ sets they consider/report. Second, factor $O(|M|)$ (for $\mathsf {BORU}$) versus $O(m)$ (for $\mathsf {FORU}$) represents $\mathsf {BORU}$’s exploration of a larger part of the preference domain due to its starting by an overestimate of $\rho$, as opposed to $\mathsf {FORU}$’s starting from w and its tight incremental expansion until the desired output size is reached.

7 Approximate $\mathsf {ORU}$ Processing

Our experiments corroborate the above comparison and demonstrate that $\mathsf {FORU}$ indeed achieves major performance improvements over $\mathsf {BORU}$. If even faster processing is required, however, and some inaccuracy can be tolerated, approximate computation is also possible. In this section, we present the approximate ORU algorithm ($\mathsf {AORU}$), which offers a controllable tradeoff between accuracy and running time and comes with proven approximation guarantees. $\mathsf {AORU}$ receives an extra input, $\delta$, which controls accuracy and determines the bounds of the approximation.

7.1 Algorithm Description

At the core of $\mathsf {AORU}$ lies the drill operation. A drill is a regular $\text{top-}k$ query for a preference vector, called the drill location. The general idea is to perform repetitive drills around w, with drill locations gradually moving farther from it, until the union of their $\text{top-}k$ results includes at least m distinct records. The drill locations are determined by parameter $\delta$.

In particular, we utilize an (implicit) partitioning of the preference domain according to d-dimensional hyper-cubes. The side-length of the hyper-cubes is $\frac{\delta }{d}$, so that the main diagonal of each hyper-cube has an $L^1$-norm length of $\delta$. The partitioning is centered around w. Note that since the preference domain is a $(d-1)$-simplex and the hyper-cubes are d-dimensional, we only consider hyper-cubes that overlap with the preference domain. In the 2-dimensional example of Figure 7, the relevant hyper-cubes (squares, in $d=2$) are drawn with a solid boundary. Their side-length is $\frac{\delta }{d} = \frac{\delta }{2}$, except for the border ones that may be smaller (since the partitioning is centered around w).

Fig. 7.

$\mathsf {AORU}$ performs the first drill at w itself. Subsequently, it considers the hyper-cubes one by one, in increasing minimum distance to w. For each considered hyper-cube in the aforementioned order, it performs a drill at the hyper-cube’s centroid. The records retrieved by the repetitive drills are appended to the $\mathsf {AORU}$ output (avoiding duplicates) until its size hits m and the algorithm terminates. In the event that the last drill introduces to the output more new records than necessary, we only include the higher-ranking among them to bring the output size to exactly m.

An implementation remark is that the partitioning of the preference domain into hyper-cubes is implicit, meaning that not all hyper-cubes need be exhaustively produced. Instead, since the partitioning is centered around w, we first produce the hyper-cubes that have one corner at w and whose open representation intersects the preference domain (i.e., hyper-cubes $\mathcal {Q}_1$ and $\mathcal {Q}_2$ in Figure 7), continue with their own neighboring hyper-cubes ($\mathcal {Q}_3$ and $\mathcal {Q}_4$ in our example), and so on incrementally, as needed by the $\mathsf {AORU}$ process until its output is complete. Note that the incremental generation of hyper-cubes around w hints directly to their order of consideration (i.e., drilling) by $\mathsf {AORU}$.

Another technical detail is that for $d\gt 2$ the centroids of the hyper-cubes lie higher than the preference domain, i.e., have coordinates where $\sum _{i=1}^d w_i \gt 1$. To demonstrate, a simple example for $d=3$ is that the centroid of the unit cube is vector $(0.5, 0.5, 0.5)$, which lies above the respective preference domain, i.e., above $\Delta ^{2} = \lbrace {\boldsymbol{ v}}\in {\rm I\!R}_+^3 | \sum _{i=1}^3 w_i =1\rbrace$. That fact, however, does not necessitate any normalization to the centroids prior to the drills; i.e., it does not affect the design of $\mathsf {AORU}$ (nor the validity of its accuracy analysis that follows in Section 7.2), because the magnitude of the centroids does not affect the retrieved $\text{top-}k$ results, as we established in the beginning of Section 3 for the preference vectors in general.

A high-level remark about $\mathsf {AORU}$ regards the twofold purpose it serves in our study. First, as originally (and primarily) motivated, it complements our algorithmic suite with a faster alternative to exact $\mathsf {ORU}$ algorithms. Second, the minimalism of its design enriches the performance evaluation of the exact $\mathsf {ORU}$ algorithms, giving a better sense of how efficient they are compared to a very intuitive (yet inexact) $\mathsf {ORU}$ processing approach. In particular, $\mathsf {AORU}$ relies on repetitive $\text{top-}k$ retrievals (drills) using the established BBR algorithm, which has trivial computational requirements [79]. Regarding the number of drills, if we assume that each drill is equally likely to contribute a new record to the output, $\mathsf {AORU}$ requires $O(m)$ calls to BBR.

In relation to the exact $\mathsf {ORU}$ algorithms, aside from performance, we must mention that $\mathsf {AORU}$ too executes an expansion around w, but instead of considering every possible preference vector along the way, it only accounts for some of them (namely, the drill locations). This implies that the extent of $\mathsf {AORU}$’s expansion is naturally greater. Specifically, its stopping radius corresponds to the distance between w and the farthest drill location, and that distance can be no smaller than the stopping radius of the exact algorithms. A corollary of this fact is that, up to the radius of the exact $\mathsf {ORU}$ solution (i.e., up to radius $\rho$ in Definition 2), all the drills, by definition, introduce actual $\mathsf {ORU}$ results in $\mathsf {AORU}$’s output (but this may not hold for farther drill locations).

7.2 Accuracy Analysis

In this section, we analyze $\mathsf {AORU}$’s accuracy. To facilitate presentation, note that just like the magnitude of a preference vector, the scaling of dataset D by a constant factor does not affect the ordering of D by utility. In particular, letting $\alpha$ be the maximum value across all attributes of any record in D, the $\mathsf {ORU}$ result on D (for a given ${\boldsymbol{ w}}$ and m) is identical to the $\mathsf {ORU}$ result if we scale (i.e., multiply) every record of D by constant factor $\frac{1}{\alpha }$. Without loss of generality, we assume that $\alpha = 1$.

Let $\rho$ be the stopping radius of the exact $\mathsf {ORU}$ operator and ${\boldsymbol{ v}}$ be the preference vector of a user where $| {\boldsymbol{ v}}- {\boldsymbol{ w}}| \lt \rho$. By Definition 2, $\mathsf {ORU}$’s output includes the user’s $\text{top-}k$ records in D, i.e., those with the maximum utility. We will center our analysis on the utility loss incurred in the $\text{top-}k$ result of ${\boldsymbol{ v}}$ when instead of the exact $\mathsf {ORU}$ output, the user is presented with $\mathsf {AORU}$’s output.

Since (as elaborated at the end of Section 7.1) $\mathsf {AORU}$’s search extends at least up to radius $\rho$ from ${\boldsymbol{ w}}$, it considers (i.e., drills) the hyper-cube $\mathcal {Q}$ that includes vector ${\boldsymbol{ v}}$. Let c be the centroid of $\mathcal {Q}$ and sets $S_{ {\boldsymbol{ c}}}$ and $S_{ {\boldsymbol{ v}}}$ be the $\text{top-}k$ results for c and v, respectively (where possible ties in the top-k-th position are resolved arbitrarily). If a record ${\boldsymbol{ r}}^{out}$ belongs to $S_{ {\boldsymbol{ v}}}$ but not in the $\mathsf {AORU}$ output, this means that ${\boldsymbol{ r}}^{out} \notin S_{ {\boldsymbol{ c}}}$ and, consequently, that there must be a record ${\boldsymbol{ r}}^{in} \in S_{ {\boldsymbol{ c}}}$ that does not belong to $S_{ {\boldsymbol{ v}}}$. The following theorem bounds (by $\delta$) the utility loss experienced by the user (i.e., preference vector ${\boldsymbol{ v}}$) if ${\boldsymbol{ r}}^{in}$ takes the place of ${\boldsymbol{ r}}^{out}$ in the $\text{top-}k$ result.

Theorem 2.

The utility of ${\boldsymbol{ r}}^{out}$ for v cannot be more than $\delta$ higher than the utility of ${\boldsymbol{ r}}^{in}$ for v, i.e., $U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}^{out}) - U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}^{in}) \le \delta$.

The $\mathsf {ORU}$ operator (Definition 2) is centered on the relevant vectors’ entire $\text{top-}k$ sets as opposed to individual records therein. Hence, a meaningful use of Theorem 2 is to bound the total (i.e., summed) utility loss experienced by v across its entire $\text{top-}k$ set. Corollary 1 bounds that loss.

Corollary 1.

Let k-sets $S_{ {\boldsymbol{ v}}}$ and $\bar{S}_{ {\boldsymbol{ v}}}$ represent the $\text{top-}k$ result for v in the entire D and in the output of $\mathsf {AORU}$, respectively. If $\bar{S}_{ {\boldsymbol{ v}}}$ is reported for v instead of $S_{ {\boldsymbol{ v}}}$, the summed utility loss cannot exceed $k \cdot \delta$, i.e., $\sum _{ {\boldsymbol{ r}}\in S_{ {\boldsymbol{ v}}}} U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}) - \sum _{ {\boldsymbol{ r}}\in \bar{S}_{ {\boldsymbol{ v}}}} U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}) \le k \cdot \delta$.

Proof.

Theorem 2 can be applied inductively to replace each record ${\boldsymbol{ r}}^{out} \in S_{ {\boldsymbol{ v}}}$ that is missed by $\mathsf {AORU}$ with a separate and distinct output record ${\boldsymbol{ r}}^{in} \in S_{ {\boldsymbol{ c}}}$ such that $U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}^{out}) - U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}^{in}) \le \delta$. In the worst case (where all k members of $S_{ {\boldsymbol{ v}}}$ are substituted), the summed utility loss is upper bounded by $k \cdot \delta$. □

We note that several common definitions of utility for ranked lists (e.g., discounted cumulative gain and its variants [47, 86]) associate a decreasing, non-negative weight $\lambda _i$ per rank position and report the weighted sum of the records’ utilities along the list as its overall utility. That is, if the (ordered) $\text{top-}k$ result for vector v is $\lbrace {\boldsymbol{ r}}_1, {\boldsymbol{ r}}_2, \ldots , {\boldsymbol{ r}}_k \rbrace$, the result’s overall utility is defined as $\sum _{i=1}^k \lambda _i \cdot U_{ {\boldsymbol{ v}}}( {\boldsymbol{ r}}_i)$. For any aggregate measure of that type, following the same lines as Corollary 1, Theorem 2 guarantees that $\mathsf {AORU}$ incurs an overall utility loss for v that cannot exceed $\sum _{i=1}^k \lambda _i \delta$.

8 Discussion: An $\mathsf {ORD}$/$\mathsf {ORU}$ Use Case for Unspecified k

Starting with an observation on the relationship between the input parameter k and our operators’ stopping radius $\rho$, in this section we broaden their applicability to situations where k is also left for the operators to decide, according to a balancing (cost) function between k and $\rho$. The following discussion applies equally to both $\mathsf {ORD}$ and $\mathsf {ORU}$.

From Definitions 1 and 2, it follows that (for given m) the operators’ stopping radius $\rho$ is monotonically decreasing with k. The radius $\rho$ determines the “width” of preference relaxation, while k determines the “depth” of the search, i.e., the deepest skyband or convex hull layer (in $\mathsf {ORD}$ and $\mathsf {ORU}$, respectively) that may contribute records to the operator’s output. Given that the width ($\rho$) and depth (k) render different types of diversity among the output records, it could be meaningful to define $\mathsf {ORD}$/$\mathsf {ORU}$ versions where the application specifies a cost function $Cost(k, \rho)$ that is increasingly monotone to k and $\rho$ and thus determines implicitly the desired balance between the width and the depth of the search. It is then required from the operator to identify the value of k that minimizes $Cost(k, \rho)$ and output the respective $\mathsf {ORD}$/$\mathsf {ORU}$ result for that k.

A straightforward approach would be to compute the standard $\mathsf {ORD}$/$\mathsf {ORU}$ result for every $k \in [1, m]$ and report the one that minimizes $Cost(k, \rho)$. That, however, requires $m-1$ calls to the standard $\mathsf {ORD}$/$\mathsf {ORU}$ operator³ and can be improved on. Since k determines $\rho$ and $\rho$ decreases monotonically with k, it follows that $Cost(k, \rho)$ is a convex function of k. The optimal k can therefore be identified by trying a total of $O(\log _2 m)$ values in the discrete domain $[1, m]$ (equivalently, by making $O(\log _2 m)$ calls to the standard $\mathsf {ORD}$/$\mathsf {ORU}$ operator) using binary search [66]. In a nutshell, if at any point the feasible range of the binary search is $[k_L, k_R]$, we (call standard $\mathsf {ORD}$/$\mathsf {ORU}$ and) evaluate the cost for $k = \lfloor {\frac{k_L+k_R}{2}} \rfloor$ and for $k^{\prime } = k+1$. If the cost for k is no greater than the cost for $k^{\prime }$, we continue left (i.e., set the feasible range to $[k_L,k]$); otherwise, we continue right (i.e., set the feasible range to $[k^{\prime }, k_R]$). Starting with $[1, m]$ as the initial feasible range, we repeat the halving process until the feasible range becomes trivial.

9 Experiments

In this section, we present qualitative and performance experiments. We use both real and synthetic datasets, i.e., HOTEL, HOUSE, NBA, and COR, IND, ANTI, respectively. HOTEL contains 166,378 hotel records with $d=4$ attributes [1]. HOUSE includes 315,265 records of $d=6$ types of household expenses [2]. NBA holds $d=8$ statistics for 21,960 NBA players [4]. The synthetic datasets represent typical distributions in multi-objective decisions [17]. Table 2 lists the problem parameters with their tested and default values (in bold). In each experiment, we vary one parameter and fix the others to their defaults. Every measurement is the average for 50 random seeds w. The datasets are indexed by R-trees and kept in memory. All algorithms were implemented in C++ and run on a machine with Intel i7-7700 CPU at 3.60GHz and 32Gb RAM. We used lp_solve [3] as the linear programming solver, QuadProg++ [32] as the quadratic programming solver, and Qhull [12] for computational geometric primitives. Our implementation is available at [5].

Table 2.

Parameter	Tested and Default Values
Dataset cardinality $\|D\|$	100K, 400K, 1.6M, 6.4M, 25.6M
Dimensionality d	2, 3, 4, 5, 6, 7
Parameter k	1, 5, 10, 15, 20
Output size m	10, 30, 50, 70, 90

Table 2. Parameters, Tested Values, and Defaults

We start with qualitative results to draw a distinction between our operators and representative previous ones from Table 1.

9.1 Qualitative Study

First, we perform a case study to visualize how the results of $\mathsf {ORD}$ and $\mathsf {ORU}$ differ from (1) an OSS skyline and (2) a top-m query for the same vector w. Focused on the results of our operators (rather than processing efficiency), note that $\mathsf {ORD}$ and $\mathsf {ORU}$ in this case study refer to the problem outputs in Definitions 1 and 2, respectively, and are orthogonal to the choice of processing algorithm (e.g., $\mathsf {BORU}$ or $\mathsf {FORU}$). From the OSS skyline family, we use as representative the most cited full-dimensionality formulation [57], which reports the m skyline members that dominate the most non-skyline records. We use the NBA 2018–2019 season statistics for the total of its 708 players on Assists and Rebounds (in Figure 8(a)) and Points and Rebounds (in Figure 8(b)), normalized in the $[0,1]$ range. We set $k=2$ and $m=6$ and use $(0.49,0.51)$ and $(0.43, 0.57)$ as the seed w, respectively. The results of the methods are illustrated as differently oriented/colored triangles.

Fig. 8.

A first observation is that our operators report distinct results from past approaches (and from each other). For example, in Figure 8(a) only $\mathsf {ORU}$ reports Trae Young (rising stars challenge player), while half of its output records are not in the top-m result, and one-third of them are not in the OSS skyline. The comparison of our operators with top-m reveals an even more interesting fact. In Figure 8(a), top-m misses Andre Drummond (the rebound leader in the 2018–2019 season), whom it would report if we only slightly revised w from $(0.49,0.51)$ to $(0.48,0.52)$. Similarly, in Figure 8(b), top-m misses James Harden (the season’s scoring leader), whom it would report if we revised w from $(0.43, 0.57)$ to $(0.44, 0.56)$. The inclusion of these players in the result of both of our operators (in the respective Figure 8(a) and 8(b) settings) confirms that they successfully employ some “width” in their search by reporting records that are particularly strong for alternative, very similar preferences to the seed w. Investigations in our full-scale experimental settings, using the Jaccard coefficient as the similarity measure, demonstrate that (1) the OSS skyline is more dissimilar to our operators than top-m, as it is not guided by w, and (2) top-m becomes increasingly more dissimilar to $\mathsf {ORD}$/$\mathsf {ORU}$ as m grows. We omit the charts for brevity, but as an indication, for IND data and the default parameters in Table 2, the Jaccard similarity of the OSS skyline to $\mathsf {ORD}$ is 0.25, and to $\mathsf {ORU}$ it is 0.24. In the same setting, the Jaccard similarity of top-m to $\mathsf {ORD}$ is 0.44, and to $\mathsf {ORU}$ it is 0.32.

Next, we use real TripAdvisor data and reviews (TA) [6]. TA includes ratings for 1,850 hotels on $d = 7$ aspects (value, room, location, etc.), forming dataset D. It also includes actual user reviews for these hotels, each comprising a comment and an overall score. The reviews offer a practical example of how preference vectors could be extracted for real users. Specifically, we employ [84], an established preference mining method that estimates a user’s weight vector based on his or her reviews. Applying [84], we get vectors for 137,563 users. The dataset and vectors from TA are used in the next experiment on fixed-region methods [24, 64], but at the same time they offer an end-to-end application example for our techniques.

The fixed-region methods require a preference polytope R to be specified as part of their input. We demonstrate that it is not feasible to estimate the size of R required to produce, even approximately, m records. Worse yet, even the same polytope R, when positioned at different parts of the preference domain, can produce vastly different output sizes. Specifically, we use $k=5$ and vary m from 10 to 20 (since the dataset in TA is highly correlated and its 5-skyband includes only 61 hotels). We first execute $\mathsf {ORU}$ for 50 randomly selected TA users (preference vectors) and record the average stopping radius, denoted as $\rho ^*$. We then produce a hyper-cube R with volume equal to the hyper-sphere with radius $\rho ^*$. Next, we count the number of distinct records (TA hotels) output by the fixed-region $\text{top-}k$ operator in [64] when R is positioned around each of the 50 user vectors (we use [64] since [24] works only for $k=1$). Figure 9(a) presents in box-plot the observed variation in output size. To confirm, in Figure 9(b), we repeat this process for our full-scale setting, using random preference vectors, IND data, the default parameters, and standard range for m from Table 2. The box-plots indicate that even with knowledge of $\rho ^*$, fixed-region techniques can produce dramatically different output sizes, which may hugely under- or over-shoot the target m. In contrast, our operators relieve the user/application from the need to meddle with the preference domain’s complex dynamics and abide strictly by the requested output size m. Note that the counterpart of this experiment, using $\mathsf {ORD}$ to compute $\rho ^*$ and producing the fixed-region R-skyband, demonstrates an even greater variability than Figure 9 for both TA and IND (charts omitted for brevity). For example, for target $m=50$ in IND data, the fixed-region R-skyband outputs from 12 all the way to 269 records.

Fig. 9.

9.2 Performance Evaluation for $\mathsf {ORD}$ Processing

In this section, we evaluate the performance of our $\mathsf {ORD}$ algorithm. For comparison, we adapted a fixed-region R-skyband technique ($\mathsf {RSB}$), as sketched in Section 2. In the adaptation, the initial hyper-cube R is sized so that its volume ratio to the preference domain equals the ratio of m to the expected k-skyband cardinality (i.e., $\frac{k\ln ^{d-1} |D|}{d!}$, according to [36]). Based on the size of the R-skyband computed for the initial R, its volume is re-estimated proportionally to the desired m. The trials (R-skyband computation and R re-estimation) are repeated until the output is within 10% from m. For the implementation of $\mathsf {RSB}$, we use the index-based R-skyband module of [64], as it is considerably faster than the no-index approach of [24]. We also include baseline ORD-BSL that computes the entire k-skyband according to Section 4.1, without the enhancements in Section 4.2.

In Figure 10, we present (in logarithmic scale) the running time for all competitors versus each problem parameter, using IND data. $\mathsf {ORD}$ is 2 to 4 orders of magnitude faster than $\mathsf {RSB}$, indicating the impracticality of the fixed-region approach to mimic $\mathsf {ORD}$’s output, despite the ample slack given. The main reason is the numerous trials required for $\mathsf {RSB}$ to “converge” to an acceptable deviation from m. The runner-up is ORD-BSL, which is 1 to 2 orders of magnitude slower than the fully enhanced $\mathsf {ORD}$. Moreover, the results also demonstrate $\mathsf {ORD}$’s ability to scale. Its running time grows almost linearly to k and m, and sub-linearly to $|D|$. It increases more sharply with d, due to the growing complexity of its geometric building blocks. Still, even for $d=7$ $\mathsf {ORD}$ takes only 3.2 seconds.

Fig. 10.

Having established the superiority of our $\mathsf {ORD}$ algorithm, in Figure 11(a) we plot its running time for different synthetic distributions, varying m. Processing in ANTI is the slowest. The reason is that dominance (and, by deduction, $\rho$-dominance too) is less common among ANTI records, and thus many candidates need to be considered before $\mathsf {ORD}$ can terminate. COR exhibits the inverse effect.

Fig. 11.

Again varying m, in Figure 11(b), we use real data. NBA is smaller than HOTEL, but it has larger dimensionality, which explains why $\mathsf {ORD}$ is slower on NBA in most cases. At the same time, NBA is significantly more correlated than HOTEL and HOUSE, which is why the running time increases less sharply with m compared to the other two datasets.

9.3 Performance Evaluation for $\mathsf {ORU}$ Processing

Turning to the $\mathsf {ORU}$ problem (Definition 2), we first consider exact processing. Our suite includes two exact algorithms, i.e., $\mathsf {BORU}$ (in Section 5) and $\mathsf {FORU}$ (in Section 6). For comparison, we use an adaptation of a fixed-region $\text{top-}k$ algorithm from [64], termed $\mathsf {JAA}$, to simulate our problem’s output. Specifically, we employ a similar R estimation and trial approach as for $\mathsf {RSB}$, allowing a deviation of up to 10% from the desired output size m. We also include $\mathsf {ORU}$-$\mathsf {BSL}$, a baseline that utilizes the initial overestimate $\bar{\rho }$ in Section 5.3, computes upper hull layers on the entire candidate set M, partitions the $L_1$ top-regions by Theorem 1, and reports the m-sized union of the $\text{top-}k$ records for the closest produced regions to w.

In Figure 12, we plot (in logarithmic scale) the running time of all competitors versus each problem parameter, using IND data. $\mathsf {ORU}$-$\mathsf {BSL}$, although identical to $\mathsf {BORU}$ for $k=1$, generally performs very poorly, failing to terminate within reasonable time in most large settings. This demonstrates the vital role of the progressive tightening of the radius overestimate and the respective elimination of candidate records, presented in Section 5.3.1. Indeed, $\mathsf {BORU}$ is 2 to 4 orders of magnitude faster than $\mathsf {ORU}$-$\mathsf {BSL}$. It is also 12 to 134 times faster than $\mathsf {JAA}$, confirming the general unsuitability of fixed-region approaches to our problems.

Fig. 12.

Having demonstrated the deficiencies of alternatives compared to $\mathsf {BORU}$, we may now focus on the further gains achieved by $\mathsf {FORU}$. Recall that $\mathsf {FORU}$ is one of the two (and by far the most important) technical extensions in this journal article. In Figure 12(a) (varying k) $\mathsf {FORU}$ is 6 to 56 times faster than $\mathsf {BORU}$, with the margin generally growing with k. The main reason behind this drastic improvement lies in the fundamentals of $\mathsf {FORU}$ expansion. Recall that $\mathsf {BORU}$ starts with an overestimate of $\rho$, which is conservatively set so that $\mathsf {ORU}$’s output for $k=1$ includes m records. Since this radius determines the search in deeper upper hull layers, being conservative, it leads to the consideration of a larger part of the upper hulls than necessary. As k grows, more upper hull layers become relevant to the problem, which means that the original overestimate (for $k=1$) is even further off from the actual stopping radius; i.e., effectively the overestimate becomes looser. On the other hand, $\mathsf {FORU}$ does not suffer from this issue, because it starts from w and tessellates the preference domain around it in an incremental, only-if-needed fashion.

In Figure 12(b) (varying m) $\mathsf {FORU}$ is 1 order of magnitude faster than $\mathsf {BORU}$, and the margin grows with m. The major factor at play for this behavior is again the overestimated starting radius in $\mathsf {BORU}$. As m grows, the overestimate necessarily becomes larger, since more top-1-st records (m, to be exact) need be confirmed by its $\mathsf {IRD}$ module (in Section 5.3.2). This means that it also becomes looser, leading to more unnecessary computations in the deeper upper hull layers. $\mathsf {FORU}$, on the other hand, relies on no overestimate and avoids that problem.

In Figure 12(c) (varying $|D|$) $\mathsf {FORU}$ is 3 to 89 times faster than $\mathsf {BORU}$. Moreover, just as with k and m, $\mathsf {FORU}$’s gains over $\mathsf {BORU}$ grow with $|D|$ too, demonstrating that $\mathsf {FORU}$ offers a general improved scalability to larger problem instances. The reason the gap gets wider with $|D|$ lies in the way these algorithms fetch candidate records. $\mathsf {BORU}$ fetches candidates using (partial) upper hull computations, whereas $\mathsf {FORU}$ uses plain $\text{top-}k$ search. Given that the former operation is significantly more complex, the advantage offered to $\mathsf {FORU}$ is amplified as the total number of records increases. Comparison with $\mathsf {BORU}$ aside, one of the boldest findings here is that $\mathsf {FORU}$ delivers sub-second running time even for the largest dataset of 25.6M records!

In Figure 12(d) (varying d) $\mathsf {FORU}$ is faster than $\mathsf {BORU}$ in all dimensionalities, being 3 orders of magnitude faster for $d=2$, with the gap progressively narrowing for higher d. The reason for this trend lies in the computational geometric primitives the two algorithms rely on, whose cost increases exponentially with d [16]. As d grows, these primitives consume a larger fraction of the total running time, until they completely dominate it, rendering largely inconsequential the individual optimizations and algorithmic distinctions of the two algorithms. The nullification of algorithmic ingenuity for large d befits the curse of dimensionality in multi-objective querying and its loss of meaning, discussed at the end of Section 3.

The next set of experiments (Figure 13) serves two purposes. The first is to conclude the comparison of the exact approaches by considering other datasets. Specifically, in Figures 13(a) through 13(e), we omit the problematic $\mathsf {JAA}$ and $\mathsf {ORU}$-$\mathsf {BSL}$ and focus on $\mathsf {FORU}$ and $\mathsf {BORU}$ (let us ignore the line of $\mathsf {AORU}$ for now). We present their running time on the remaining synthetic datasets (i.e., COR and ANTI) and on the real datasets, as we vary m according to Table 2. An exception to this setting is NBA; its strong correlation leads to only 31 distinct $\text{top-}k$ records (for the default $k=5$) throughout the entire preference domain, which are insufficient for the larger m values. To make the experiment meaningful, for NBA in particular, we use $k=15$ and a different range of m values (i.e., 30, 35, 40, 45, 50). The big picture from the five datasets is similar to Figure 12(b) for IND data, with $\mathsf {FORU}$ being several times (and up to 3 orders of magnitude) faster than $\mathsf {BORU}$. The only exception is the $m=10$ setting for ANTI data. That is due to $m =10$ being very small (just twice of k) and the fact that in anti-correlated data just a minuscule expansion suffices to output that many records. In turn, this implies that one-off initialization costs dominate the total running time, suppressing the effects of $\mathsf {FORU}$’s more efficient expansion process.

Fig. 13.

The second purpose of Figure 13 is to shed light on approximate processing. In particular, we included $\mathsf {AORU}$ in the first five charts (running time) and present its respective accuracy in a combined diagram (Figure 13(f)). To ensure homogeneity across the datasets, we set the $\delta$ parameter of $\mathsf {AORU}$ such that the side-length of the hyper-cubes is consistent (i.e., 0.035 in our experiments). That is, for a d-dimensional dataset, $\delta$ is set to $0.035\cdot d$ (recall that the synthetic datasets have the default $d=4$, while the real ones have their own native dimensionality, as described in the beginning of the experimental section). To measure $\mathsf {AORU}$’s accuracy, we present its F1 score (with respect to the exact result) in the combined Figure 13(f) that covers all datasets and settings tested in this experiment. The F1 score is the harmonic mean of precision and recall (i.e., $2\frac{precision \cdot recall}{precision + recall}$), which balances them effectively [82].

$\mathsf {AORU}$ is faster than the exact methods in all settings and offers a decent approximation accuracy. However, we observe a deviation from this norm for ANTI, where the running time is particularly short but the accuracy suffers. The reason for this behavior lies in the common hyper-cube side-length we employed across datasets, which is too large for ANTI but enables interesting insights. In particular, in anti-correlated data, there is a high density of different $\text{top-}k$ results around w. Using the same hyper-cube side-length as COR, for instance, leads to earlier termination, because the drills fetch vastly different $\text{top-}k$ results, and thus fewer drills are required to fill $\mathsf {AORU}$’s output with distinct records. On the other hand, the drill locations make large leaps in the preference domain, failing to account for many alternative $\text{top-}k$ results in between, which has a negative effect on accuracy. Another observation is that the accuracy drops with m. A larger m means a wider expansion, i.e., that $\mathsf {AORU}$ needs to drill farther from w. The farther we drill from w, the more likely it is to introduce non-result records to $\mathsf {AORU}$’s output.

In Figure 14, we investigate the effect of the hyper-cube side-length (effectively, of parameter $\delta$) on the running time and accuracy of $\mathsf {AORU}$. We use the default IND dataset and consider side-lengths 0.01, 0.02, 0.03, and 0.04, as indicated by the suffix of the $\mathsf {AORU}$ labels. In Figure 14(a), we also include the running times of $\mathsf {BORU}$ and $\mathsf {FORU}$ for reference. As expected, the side-length directly determines the performance of $\mathsf {AORU}$, because it controls the density (and thus the number) of drills performed. On the other hand, it also determines accuracy; the interesting finding here is that even for the larger side-lengths tested, the accuracy remains reasonable, which suggests that the running time drops much more aggressively than the accuracy.

Fig. 14.

10 Conclusion

In this article, we draw motivation from the known weaknesses of standard skyline and $\text{top-}k$ queries. Based on these shortcomings, we identify three hard requirements for practical decision support in multi-objective settings: personalization, controllable output size, and flexibility in preference specification. We define two operators to bridge that gap and develop a suite of algorithms for their efficient processing. Our qualitative analysis indicates that they offer a novel type of support, distinct from past practices. Also, our experiments demonstrate that our algorithms deliver practical and scalable performance.

A promising direction for future work is to consider our problems in highly skewed or sparse datasets, where multi-objective querying may be meaningful in higher dimensions too [93]. To this end, we will need to employ different indexing approaches so that sub-spaces can be effectively isolated and dealt with, e.g., per-attribute sorted lists. Furthermore, we will need to develop new algorithms, probably in the spirit of expressing geometric conditions in the form of attribute thresholds along the sorted lists, akin to [25] for R-skyband computation in distributed settings.

Footnotes

To see this, a fact about traditional dominance is that if ${\boldsymbol{ r}}_i$ scores no lower than ${\boldsymbol{ r}}_j$ for every vector in the preference domain and higher for at least one vector therein, then (and only then) ${\boldsymbol{ r}}_i$ dominates ${\boldsymbol{ r}}_j$ [23].

The largest meaningful $\rho$ (although visualized as $\infty$) is the distance between ${\boldsymbol{ w}}$ and its farthest point in $\Delta ^{d-1}$, since that $\rho$ already covers the entire preference domain.

The number of calls is $m-1$, because for $k=m$, we know that $\rho =0$ by definition, and thus $Cost(m, 0)$ can be directly evaluated without an $\mathsf {ORD}$/$\mathsf {ORU}$ call.

References

[1]

Hotel dataset. (n.d.). https://github.com/RyanBalfanz/hotelsbase

Parameter	Tested and Default Values
Dataset cardinality \(\|D\|\)	100K, 400K, 1.6M, 6.4M, 25.6M
Dimensionality d	2, 3, 4, 5, 6, 7
Parameter k	1, 5, 10, 15, 20
Output size m	10, 30, 50, 70, 90

Abstract

1 Introduction

2 Related Work

3 Problem Formulation

4 \(\mathsf {ORD}\) Algorithm

4.1 Observations and Main Idea

4.2 Efficient \(\mathsf {ORD}\) Processing

5 Basic \(\mathsf {ORU}\) Algorithm

5.1 Fundamentals

5.2 An Algorithmic Basis to the \(\mathsf {BORU}\) Technique

5.3 Dropping Assumptions; Complete \(\mathsf {BORU}\)

5.3.1 Gradually Tightening the Overestimate \(\bar{\rho }\).

5.3.2 Incremental \(\rho\)-skyband.

6 Fast \(\mathsf {ORU}\) Algorithm

6.1 Fundamentals and Algorithmic Outline

6.2 Technical Specifics

6.3 \(\mathsf {FORU}\) and Order Sensitivity

6.4 Analytical Comparison to \(\mathsf {BORU}\)

7 Approximate \(\mathsf {ORU}\) Processing

7.1 Algorithm Description

7.2 Accuracy Analysis

8 Discussion: An \(\mathsf {ORD}\)/\(\mathsf {ORU}\) Use Case for Unspecified k

9 Experiments

9.1 Qualitative Study

9.2 Performance Evaluation for \(\mathsf {ORD}\) Processing

9.3 Performance Evaluation for \(\mathsf {ORU}\) Processing

10 Conclusion

Footnotes

References

Index Terms

Recommendations

Marrying Top-k with Skyline Queries: Relaxing the Preference Input while Producing Output of Controllable Size

Preference elicitation in prioritized skyline queries

Top-k best probability queries and semantics ranking properties on probabilistic databases

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations