research-article

Public Access

Spectral Ranking Regression

Authors:

Stratis IoannidisAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 16, Issue 6

Article No.: 120, Pages 1 - 38

https://doi.org/10.1145/3530693

Published: 30 July 2022 Publication History

All formats PDF

Abstract

We study the problem of ranking regression, in which a dataset of rankings is used to learn Plackett–Luce scores as functions of sample features. We propose a novel spectral algorithm to accelerate learning in ranking regression. Our main technical contribution is to show that the Plackett–Luce negative log-likelihood augmented with a proximal penalty has stationary points that satisfy the balance equations of a Markov Chain. This allows us to tackle the ranking regression problem via an efficient spectral algorithm by using the Alternating Directions Method of Multipliers (ADMM). ADMM separates the learning of scores and model parameters, and in turn, enables us to devise fast spectral algorithms for ranking regression via both shallow and deep neural network (DNN) models. For shallow models, our algorithms are up to 579 times faster than the Newton’s method. For DNN models, we extend the standard ADMM via a Kullback–Leibler proximal penalty and show that this is still amenable to fast inference via a spectral approach. Compared to a state-of-the-art siamese network, our resulting algorithms are up to 175 times faster and attain better predictions by up to 26% Top-1 Accuracy and 6% Kendall-Tau correlation over five real-life ranking datasets.

1 Introduction

Learning from ranking observations arises in many domains, including, e.g., econometrics [McFadden 1973; Ryzin and Mahajan 1999], psychometrics [Thurstone 1927; Bradley and Terry 1952], sports [Elo 1978], and medicine [Tian et al. 2019], to name a few. Ranking observations are (potentially noisy and incomplete) orderings of subsets of samples. Given a dataset of such ranking observations, probabilistic inference typically assumes the existence of an underlying total ordering and aims at recovering it. The Plackett–Luce (PL) model [Plackett 1975] is a prominent parametric model in this setting; it postulates that (a) each sample is parametrized by a score and (b) the probability that a sample is ranked higher than a set of alternatives is proportional to this score.

PL scores are traditionally learned from ranking observations via Maximum Likelihood Estimation (MLE) [Dykstra 1960; Hunter 2004; Hajek et al. 2014; Negahban et al. 2018]; under a reparametrization, the negative log-likelihood is convex and PL scores can be estimated via, e.g., Newton’s method [Nocedal and Wright 2006]. Nevertheless, for large datasets, Newton’s method can be prohibitively slow [Hunter 2004]. Recently, Maystre and Grossglauser [2015] proposed a highly efficient iterative spectral method, termed Iterative Luce Spectral Ranking (ILSR), that estimates PL scores significantly faster than state-of-the-art methods. ILSR relies on the fact that ML estimates of PL scores constitute the stationary distribution of a Markov Chain (MC) with transition rates defined by ranking observations. In turn, this stationary distribution can be found very efficiently via spectral techniques, e.g., the power method [Lei et al. 2016].

The above approaches learn PL scores in the absence of sample features, which precludes rank predictions on samples outside the training set. A natural variant of the above setting is ranking regression. In ranking regression, PL scores are assumed to be a parametric function of sample features; the parameters of this function are then regressed from ranking observations. Ranking regression has received considerable attention in the literature, via both shallow [Joachims 2002; Pahikkala et al. 2009; Tian et al. 2019] and deep models [Burges et al. 2005; Chang et al. 2016; Dubey et al. 2016; Han 2018; Yıldız et al. 2019]. Particularly in deep neural network (DNN) regression of PL scores, siamese neural networks [Bromley et al. 1994] perform extremely well: For example, Yıldız et al. [2019] train a 5.9M parameter siamese neural network from pairwise comparisons on just 80 images, attaining 0.92 AUC.

Virtually all existing work on ranking regression relies on classic optimization methods for parameter inference. To the best of our knowledge, the opportunity to accelerate learning in ranking regression via spectral methods such as ILSR [Maystre and Grossglauser 2015] has not yet been explored. One reason is that this is technically challenging: In contrast to the feature-less setting, it is not a priori evident how to devise a spectral method akin to ILSR for ranking regression via PL. Particularly, PL scores regressed from sample features cannot be directly expressed as stationary distributions of an MC.

Our main technical contribution is to show that the PL negative log-likelihood augmented with a proximal penalty has stationary points that satisfy the balance equations of an MC (c.f. Theorem 3.2). This allows us to tackle the ranking regression problem via a spectral method akin to ILSR by using Alternating Directions Method of Multipliers (ADMM) [Boyd et al. 2011]. Intuitively, ADMM decouples the optimization of PL scores from regression model parameters, encapsulating them in a proximal penalty. In turn, this can be solved highly efficiently via an ILSR-like spectral algorithm.

Overall, we make the following contributions.

—

We propose the first approach to accelerate learning in ranking regression via spectral methods. As stated above, our main technical contribution is to show that the PL negative log-likelihood augmented with a proximal penalty has stationary points that satisfy the balance equations of an MC (c.f. Theorem 3.2). We further show that using ADMM to solve the ranking regression problem reduces it to the minimization of such an objective, which is thus amenable to a highly efficient spectral solution.

—

We employ our spectral algorithm to regress PL scores via shallow regression functions of sample features, focusing on standard ADMM with \(\ell _2\)-norm proximal penalty. Although the ranking regression problem is non-convex, we establish conditions that yield convergence guarantees for our spectral algorithm in the case of affine regression (c.f. Theorem 3.4).

—

We also extend the application of our spectral algorithm to ranking regression via DNNs. We show experimentally that replacing the standard \(\ell _2\)-norm proximal penalty in ADMM with Kullback–Leibler (KL) divergence [Kullback and Leibler 1951] has a better performance in this setting.

Our proposed algorithm yields significant performance improvements across the board. In the case of shallow ranking regression, our spectral algorithm is up to 579 times faster than traditional shallow ranking regression methods, and outperforms feature-less methods (including ILSR) by \(13\%\) accuracy. In the case of deep ranking regression, our algorithm significantly outperforms standard deep siamese networks, while the KL-divergence penalty attains better accuracy than the one attained by the \(\ell _2\)-penalty ADMM. As shown in Figure 1, siamese network training grows exponentially with ranking size \(K\); both KL-divergence (DSR-KL) and \(\ell _2\)-penalty (DSR-l2) ADMM scale more gracefully w.r.t. \(K\). However, DSR-KL has a considerably higher accuracy, comparable to the one attained by the less efficient siamese network.

Fig. 1.

The remainder of this article is organized as follows. We review related literature in Section 2. We formulate our main contributions and proposed algorithms in Section 3. We present our experiments in Section 4 and conclude with future work in Section 5.

2 Related Work and Technical Preliminary

2.1 Related Work

Rank Aggregation. The problem of rank aggregation [Dwork et al. 2001], in which a total ordering of samples is regressed from ranking observations, is classic; literature on the subject is vast—see, e.g., the surveys by Fligner and Verducci [1993], Cattelan [2012], and Marden [2014]. Probabilistic inference in this setting typically assumes (a) that a “true” total ordering of samples exists, and (b) that ranking observations exhibit a stochastic transitivity property [Agarwal 2016]: A sample is more likely to be ranked higher than another when this event is consistent with the underlying total ordering. The noisy permutation model is a non-parametric model for this setting: Pairwise comparisons consistent with the underlying total ordering are observed under i.i.d. Bernoulli noise. MLE is NP-hard in this setting. A polytime algorithm by Braverman and Mossel [2008] recovers the underlying ordering in \(\Theta (n\log n)\) comparisons, w.h.p.; this is tightened by several recent works [Wauthier et al. 2013; Mao et al. 2018a;, 2018b]. The Mallows model [Mallows 1957] assumes that the probability of a ranking observation is a decreasing function of its distance from the underlying total ordering, under appropriate notions of distance (e.g., Kendall-Tau (KT)); MLE can be approached, e.g., via EM [Lu and Boutilier 2011]. Shah et al. [2016b] learn the full matrix of pairwise comparison probabilities via a minimax optimal estimator. Rajkumar and Agarwal [2016] learn the matrix via matrix completion, requiring \(\Theta (nr \log n)\) comparisons, where \(r \ll n\) is the rank. Ammar and Shah [2011] assume that comparisons are sampled from an unknown distribution over total orderings and propose an entropy maximization algorithm requiring \(\Theta (n^2)\) comparisons.

Parametric Rank Aggregation. We focus on parametric models, as they are more natural in the context of regressing rankings from sample features. In both PL [Plackett 1975] and Thurstone [Thurstone 1927], each sample is parametrized by a score. In the Thurstone model, observations result from comparing scores after the addition of Gaussian noise. Vojnovic and Yun [2016] and Shah et al. [2016a] estimate Thurstone scores via MLE and provide sample complexity bounds that are inversely proportional to the smallest non-zero eigenvalue of the Laplacian of a graph modeling comparisons. In PL, the probability that a sample is chosen over a set of alternatives is proportional to its score. Hunter [2004] proposes a Minorization-Maximization (MM) approach to estimate PL scores via MLE, earlier used by Dykstra [1960] on pairwise comparisons (i.e., on the Bradley–Terry (BT) setting). Hajek et al. [2014] provide an upper bound on the error in estimating the PL scores via MLE and show that the latter is minimax-optimal. Negahban et al. [2018] propose a latent factor model, estimating parameters via a convex relaxation of the corresponding rank penalty and providing sample complexity guarantees. Vojnovic et al. [2020] show that gradient-descent and MM algorithms that solve the MLE problem converge with linear rate.

PL Inference via Spectral Methods. Our focus on PL is due to the recent emergence of spectral algorithms for inference in this setting. Negahban et al. [2012] propose the Rank Centrality (RC) algorithm for the BT setting and derive a minimax error bound. Chen and Suh [2015] propose a spectral MLE algorithm extending RC with an additional stage that cyclically performs MLE for each score. Soufiani et al. [2013] and Jang et al. [2017] extend RC to rankings of two or more samples by breaking rankings into independent comparisons. Improved bounds, applying also to broader noise settings, are provided by Rajkumar and Agarwal [2014]. Khetan and Oh [2016] generalize the work by Soufiani et al. [2013] by breaking rankings into independent shorter rankings, and building a hierarchy of tractable and consistent estimators. Blanchet et al. [2016] model sequential choices by state transitions in an MC, where transitions are functions of choice probabilities.

Bridging the above approaches with MLE, Maystre and Grossglauser [2015] show that the MLE of PL scores can be expressed as the stationary distribution of an MC. Their proposed ILSR algorithm estimates the PL scores faster than traditional optimization methods, such as, e.g., Hunter’s [Hunter 2004] and Newton’s method [Nocedal and Wright 2006], and more accurately than prior spectral rank aggregation methods. More recently, Ragain and Ugander [2016] showed that a spectral approach applies even after relaxing the assumption that the relative order of any two samples is independent of the alternatives. Agarwal et al. [2018] proposed another spectral method called accelerated spectral ranking by departing from the exact equivalence between MLE and MC approximation and demonstrate faster convergence than ILSR.

Ranking Regression via PL. We depart from all aforementioned methods by regressing ranked choices from sample features. Ranking regression via the PL [Plackett 1975] model has a long history, mostly focusing on pairwise comparisons, i.e., restricted to \(K=2\); this case is also known as the BT model [Bradley and Terry 1952]. The resulting algorithms have been employed in a broad array of domains including medicine [Tian et al. 2019], information retrieval on search engines [Joachims 2002; Burges et al. 2005], image-based aesthetic recommendations [Chang et al. 2016; Sun et al. 2017], click prediction for online advertisements [Sculley 2010], and image retrieval [Chen et al. 2015], to name a few.

Among shallow regressors, RankSVM [Joachims 2002] learns a target ranking from features via a linear Support Vector Machine (SVM), with constraints imposed by all possible comparisons. Pahikkala et al. [2009] propose a regularized least-squares based algorithm for learning to rank from comparisons. Several works [Joachims 2002; Pahikkala et al. 2009; Guo et al. 2018; Tian et al. 2019] regress comparisons via MLE over BT [Bradley and Terry 1952] models.

Deeper models for regressing comparisons have also been extensively explored in the BT setting [Burges et al. 2005; Chang et al. 2016; Dubey et al. 2016; Doughty et al. 2018; Yıldız et al. 2019]. Burges et al. [2005] estimate comparisons via a fully connected neural network called RankNet, using the BT model to construct a cross-entropy loss. Several works [Chang et al. 2016; Dubey et al. 2016; Doughty et al. 2018; Yıldız et al. 2019] learn from comparisons via siamese networks [Bromley et al. 1994]. Chang et al. [2016] and Yıldız et al. [2019] learn from comparisons using MLE on the BT model. Dubey et al. [2016] predict image comparisons, via a loss function combining cross-entropy and hinge loss. Doughty et al. [2018] learn similarities and comparisons of videos via hinge loss. All of the above methods generalize to rankings under the PL model, but suffer from the complexity issues outlined in the introduction.

Cao et al. [2007], Xia et al. [2008], and Ma et al. [2020] focus on learning from star/relevance ratings rather than rankings; they train a neural network called ListNet that treats ratings as PL scores. Particularly, given a dataset of \(n\) ratings, ListNet constructs all \(\frac{n!}{(n-K)!}\) possible \(K\)-sized rankings; it is then trained by minimizing the cross entropy between the PL probabilities that a given \(K\)-ranking comprises the top samples among all \(n\), with scores being (a) the predicted scores and (b) the ground-truth ratings, respectively. Hence, although ListNet solves a very different problem than ranking regression, it is related to the siamese methods mentioned above through its (exponential in \(K\)) objective.

ADMM Applications in DNN Training. When training DNNs, ADMM is traditionally used to enforce sparsity. One application is compressed sensing, in which a signal is recovered from undersampled measurements via a DNN transformation [Sun et al. 2016; Li et al. 2018; Liu et al. 2018; Ma et al. 2019; Yang et al. 2020]. The training objective combines a standard regression loss with a norm penalty on measurements, including, e.g., \(\ell _1\), \(\ell _2\), Frobenius, or nuclear norm. Another application is model pruning [Ye et al. 2019;, 2019; Zhao and Liao 2019]: ADMM is used to enforce weight sparsity by replacing norm penalties with (possibly non-convex) low-cardinality set constraints. Zhao et al. [2018] train a DNN classifier to recover samples distorted by additive adversarial noise, using an objective that combines a classification loss with an \(\ell _0\), \(\ell _1\), \(\ell _2\), or \(\ell _\infty\) norm penalty.

Relationship to Previous Works. The present article is a combination and extension of two of our earlier works [Yıldız et al. 2020;, 2021], which dealt with shallow and deep model settings separately. We unify the key theoretical results of these works, and include all proofs omitted from these conference versions. We extend our experimental evaluations, and provide a more detailed technical preliminary.

2.2 Technical Preliminary

2.2.1 The PL Model.

We introduce here the PL discriminative model [Plackett 1975] that we use in our analysis; before describing it for ranking observations, we first consider the simpler “maximal choice” setting. Consider a dataset of \(n\) samples, indexed by \(i \in \left[n\right]\equiv \lbrace 1,\ldots ,n\rbrace\). Every sample \(i\in \left[n\right]\) has a corresponding \(d\)-dimensional feature vector \(\mathbf {x}_i\in \mathbb {R}^d\). There exists an underlying total ordering of these \(n\) samples. A labeler of this dataset acts as a (possibly noisy) oracle revealing this total ordering: When presented with a query \(A\subseteq \left[n\right]\), i.e., a set of alternatives, the noisy labeler chooses the maximal sample in \(A\) w.r.t. the underlying total ordering. Formally, our “labeled” dataset \(\mathcal {D}= \lbrace (c_\ell , A_\ell) \rbrace _{\ell =1}^m\) consists of \(m\) observations \((c_\ell , A_\ell)\), \(\ell \in \left[m\right]\equiv \lbrace 1,\ldots ,m\rbrace\), where \(A_\ell \subseteq \left[n\right]\) is the \(\ell\)th query submitted to the labeler and \(c_\ell \in A_\ell\) is her respective \(\ell\)th maximal choice (i.e., the label). For every sample \(i\in \left[n\right]\), we denote by \(W_i = \lbrace \ell \in \left[m\right]\mid i \in A_\ell , c_\ell = i\rbrace\) the set of observations where sample \(i \in \left[n\right]\) is chosen, and by \(L_i = \lbrace \ell \in \left[m\right]\mid i \in A_\ell , c_\ell \ne i\rbrace\) the set of observations where sample \(i \in \left[n\right]\) is not chosen.

The PL model [Plackett 1975] asserts that every sample \(i\in \left[n\right]\) is associated with a non-negative deterministic score \({\pi }_i \in \mathbb {R}_+\). Given scores \(\mathbf {\pi }= [{\pi }_i]_{i \in \left[n\right]} \in \mathbb {R}^n_+\), then (i) observations \((c_\ell , A_\ell), \ell \in \left[m\right]\) are independent, and (ii) for a query \(A_\ell\), the probability that sample \(c_\ell \in A_\ell\) is chosen is

\begin{align} \mathbf {P}(c_\ell \,|\,A_\ell , \mathbf {\pi }) = {{\pi }_{c_\ell }}/{\sum \limits _{j \in A_\ell } {\pi }_j} = {{\pi }_\ell }/{\sum \limits _{j \in A_\ell } {\pi }_j}, \end{align}

(1)

where, in the last equation, we abuse notation to write the score of the chosen sample as \({\pi }_\ell \equiv {\pi }_{c_\ell }\). Note that \(\mathbf {P}(c_\ell \,|\,A_\ell , \mathbf {\pi }) = \mathbf {P}(c_\ell \,|\,A_\ell , s\mathbf {\pi })\), for all \(s\gt 0\); thus, w.l.o.g., we assume (or enforce via rescaling) that PL scores satisfy \(\mathbf {1}^\top \mathbf {\pi }= 1\), i.e., \(\mathbf {\pi }\) is a distribution over \(\left[n\right]\).

Ranking Observations. PL readily extends to datasets of (partial) ranking observations. In this setting, when presented with a query \(A_\ell \subseteq \left[n\right]\) of \(K=|A_\ell |\) samples, the labeler ranks the samples in \(A_\ell\) into an ordered sequence \(\alpha _{1}^\ell \succ \alpha _{2}^\ell \succ \cdots \succ \alpha ^\ell _{K}\). Under the PL model, this ranking is expressed as \(K-1\) maximal choice queries: \(\alpha _{1}^\ell\) over \(A_\ell\), \(\alpha _{2}^\ell\) over \(A_\ell \setminus \lbrace \alpha _{1}^\ell \rbrace\), and so on, so that

\begin{align} \mathbf {P}(\alpha _{1}^\ell \succ \alpha _{2}^\ell \succ \cdots \succ \alpha ^\ell _{K}\,|\,A_\ell , \mathbf {\pi }) = \prod \limits _{t=1}^{K-1} \left({{\pi }_{\alpha _t^\ell }}/{\sum \limits _{s=t}^{K} {\pi }_{\alpha _s^\ell }}\right). \end{align}

(2)

The product form of (2) implies that a ranking observation in response to a query \(A_\ell\) can be converted to \(K-1\) independent maximal-choice observations (again, \(\alpha _{1}^\ell\) over \(A_\ell\), \(\alpha _{2}^\ell\) over \(A_\ell \setminus \lbrace \alpha _{1}^\ell \rbrace\), and so forth), each governed by (1), that yield the same joint probability: \((\alpha _{1}^\ell \succ \cdots \succ \alpha ^\ell _{K})\) can be seen as the outcome of \(\alpha _{1}^\ell\) being chosen as the top within the query set \({A_\ell }\), \(\alpha _{2}^\ell\) being the top among \(A_\ell \setminus \lbrace \alpha _{1}^\ell \rbrace\), and so on. MLE over a dataset of ranking observations thus reduces to MLE over a dataset of maximal choice observations. For notational simplicity, we present our analysis over maximal-choice datasets, keeping the above reduction in mind.

Parameter Inference. Given the dataset of observations \(\mathcal {D}\), MLE of the PL scores \(\mathbf {\pi }\in \mathbb {R}^n_+\) amounts to minimizing the negative log-likelihood:

\begin{align} \mathop {\min }\limits _{\mathbf {\pi }\in \mathbb {R}^n_+} \mathcal {L}(\mathcal {D}\mid \mathbf {\pi }) \equiv \sum \limits _{\ell =1}^m\left(\log \sum \limits _{j \in A_\ell } {\pi }_j - \log \,{\pi }_\ell \right). \end{align}

(3)

Reparametrizing the scores as \({\pi }_i = e^{\theta _i}\), \(i \in \left[n\right]\) makes the negative log-likelihood \(\mathcal {L}\) convex in \(\mathbf {\theta }= [\theta _i]_{i\in \left[n\right]}\). Under this transform, a global optimum of Equation (3) can be computed via, e.g., Newton’s method.

2.2.2 Iterative Luce Spectral Ranking.

As described in more detail in Section 2.2.1, MLE of PL scores is convex under the reparametrization \({\pi }_i = e^{\theta _i}\), \(i \in \left[n\right]\), which, in turn, enables computing the PL scores via Newton’s method. Nevertheless, Newton’s method can be prohibitively slow for large \(n\) and \(m\) [Hunter 2004]. Maystre and Grossglauser [2015] proposed a novel spectral algorithm that is significantly faster than traditional optimization methods, such as, e.g., Hunter’s [Hunter 2004] and Newton’s method, and more accurate than prior spectral rank aggregation methods. They show that the MLE of PL scores can be expressed as the stationary distribution of an MC. Their algorithm relies on the following theorem which, for completeness, we re-prove in Appendix B:

Theorem 2.1 (Maystre and Grossglauser [2015]).

An optimal solution \(\mathbf {\pi }\in \mathbb {R}^n_+\) to (3) satisfies:

\begin{align} \sum _{j\ne i} {\pi }_j \lambda _{ji}(\mathbf {\pi }) = \sum _{j\ne i} {\pi }_i \lambda _{ij}(\mathbf {\pi }), \quad \text{for all}~i \in \left[n\right], \end{align}

(4)

where, for all \(i,j \in \left[n\right]\), with \(i\ne j\),

\begin{align} \lambda _{ji}(\mathbf {\pi }) = \sum _{\ell \in W_i \cap L_j} \left({\sum _{t\in A_\ell }{\pi }_t}\right)^{-1} \ge 0, \end{align}

(5)

for \(W_i = \lbrace \ell \,| i \in A_\ell , c_\ell = i\rbrace\) the observations where sample \(i \in \left[n\right]\) is chosen and \(L_i = \lbrace \ell \,| i \in A_\ell , c_\ell \ne i\rbrace\) the observations where sample \(i \in \left[n\right]\) is not chosen.

Equation (4) are the balance equations of a continuous-time MC with transition rates:

\begin{align} \mathbf {\Lambda }(\mathbf {\pi })= [\lambda _{ji}(\mathbf {\pi })]_{i,j \in \left[n\right]}, \end{align}

(6)

where \(\lambda _{ji}(\mathbf {\pi }) \,\,\) are given by Equation (5). Hence, \(\mathbf {\pi }\) is the stationary distribution of the MC defined by transition rates \(\mathbf {\Lambda }(\mathbf {\pi })\) (c.f. Appendix A).

Let \(\mathtt {ssd}(\mathbf {\Lambda })\) be the stationary distribution of an MC with transition rates \(\mathbf {\Lambda }\). When matrix \(\mathbf {\Lambda }\) is fixed (i.e., the transition rates are known), the vector \(\mathtt {ssd}(\mathbf {\Lambda })\) is a solution to the linear system defined by the balance equations (4) and \(\mathbf {1}^\top \mathbf {\pi }=1\), as it is a distribution. However, the transition matrix \(\mathbf {\Lambda }= \mathbf {\Lambda }(\mathbf {\pi })\) in Theorem 2.1 is itself a function of \(\mathbf {\pi }\), and is therefore a priori unknown. Maystre and Grossglauser [2015] find \(\mathbf {\pi }\) through an iterative algorithm. Starting from the uniform distribution \({\mathbf {\pi }}^0 = \frac{1}{n}\mathbf {1}\), they compute:

\begin{align} \mathbf {\pi }^{l+1} = \mathtt {ssd}\left(\mathbf {\Lambda }(\mathbf {\pi }^l)\right), \quad \text{for}~ l \in \mathbb {N}, \end{align}

(7)

where \(\mathbf {\Lambda }(\cdot)\) is given by (5), (6).

Maystre and Grossglauser [2015] refer to Equation (7) as the ILSR algorithm. They also establish that (7) converges to an optimal solution for MLE of PL scores under mild assumptions. Most importantly, as mentioned above, ILSR significantly outperforms state-of-the-art MLE algorithms in computational efficiency. We note that the dataset \(\mathcal {D}\) appears in the computation of the rate matrix \(\mathbf {\Lambda }\), via sets \(W_i\) and \(L_j\); the computation of these weights can be easily parallelized.

2.2.3 Alternating Directions Method of Multipliers.

Virtually all shallow and deep models for ranking regression rely on classic optimization methods for direct parameter inference, resulting in prohibitively slow learning for large rankings. We solve the ranking regression problem via the ADMM [Boyd et al. 2011] and attain spectral algorithms that significantly accelerate learning.

ADMM is a primal-dual algorithm designed for affine-constrained optimization problems with decoupled objectives, i.e., objectives that can be written as a sum of functions where each function depends on only one of the optimized variables. ADMM combines the traditional dual decomposition [Nocedal and Wright 2006] and augmented Lagrangian [Nocedal and Wright 2006] optimization algorithms; it has also been shown to be a special case of the Douglas–Rachford splitting [Nocedal and Wright 2006] algorithm. ADMM became particularly popular for being well-suited for distributed convex optimization, a problem that commonly arises in applied statistics and machine learning [Boyd et al. 2011]. We follow the notation by Boyd et al. [2011] in our exposition below.

Standard ADMM. Formally, the ADMM algorithm solves problems in the form:

\begin{align} \begin{split} \text{minimize} &\quad f(\mathbf {x}) + g(\mathbf {z})\\ \text{subject to} &\quad \mathbf {A} \mathbf {x} + \mathbf {B} \mathbf {z} = \mathbf {c}, \end{split} \end{align}

(8)

where \(\mathbf {x} \in \mathbb {R}^n\) and \(\mathbf {z} \in \mathbb {R}^m\) are the optimized variables, with \(\mathbf {A} \in \mathbb {R}^{p \times n}\), \(\mathbf {B} \in \mathbb {R}^{p \times m}\), and \(\mathbf {c} \in \mathbb {R}^p\). \(f\) and \(g\) are typically assumed to be convex functions that depend on decoupled variables \(\mathbf {x}\) and \(\mathbf {z}\), respectively.

ADMM solves a constrained optimization problem by minimizing the augmented Lagrangian instantiated below, rather than the standard Lagrangian:

\begin{align} L_\rho (\mathbf {x}, \mathbf {z}, \mathbf {y}) = f(\mathbf {x}) + g(\mathbf {z}) + \mathbf {y}^\top (\mathbf {A} \mathbf {x} + \mathbf {B} \mathbf {z} - \mathbf {c}) + \frac{\rho }{2} \parallel \mathbf {A} \mathbf {x} + \mathbf {B} \mathbf {z} - \mathbf {c} \parallel _2^2, \end{align}

(9)

where \(\mathbf {y}\in \mathbb {R}^p\) is the Lagrangian dual variable corresponding to the equality constraint and \(\rho \ge 0\) is a penalty parameter. Evidently, the difference of augmented Lagrangian from the standard Lagrangian is the additional quadratic proximal penalty on the equality constraint. This additional penalty is shown to greatly improve convergence properties of the algorithm [Boyd et al. 2011], compared to, e.g., dual ascent.

ADMM iteratively optimizes the augmented Lagrangian w.r.t. the primal variables \(\mathbf {x}\) and \(\mathbf {z}\), and dual variables \(\mathbf {y}\) as follows:

\begin{align} \begin{split} \mathbf {x}^{k+1} &= \mathop{\text{arg min}}\limits_{x} L_\rho (\mathbf {x}, \mathbf {z}^k, \mathbf {y}^k), \\ \mathbf {z}^{k+1} &= \mathop{\text{arg min}}\limits_{z} L_\rho (\mathbf {x}^{k+1}, \mathbf {z}, \mathbf {y}^k), \\ \mathbf {y}^{k+1} &= \mathbf {y}^k + \rho (\mathbf {A} \mathbf {x}^{k+1} + \mathbf {B} \mathbf {z}^{k+1} - \mathbf {c}). \end{split} \end{align}

(10)

For convex problems, there are well-established convergence properties for ADMM. If \(f\) and \(g\) are closed, proper, and convex, and the augmented Lagrangian without the quadratic penalty has a saddle point, the ADMM iterations are guaranteed to converge to a point where (a) the equality constraint is satisfied, and (b) objectives and dual variable attain optimal values. Even though the convergence has linear rate, in many applications, ADMM has been shown to converge to a modest accuracy in a few tens of iterations [Boyd et al. 2011]. ADMM has also been extensively used to solve non-convex problems similar to the one we study here [Chartrand and Wohlberg 2013; Guo et al. 2017; Hong 2018; Wang et al. 2019]. Focusing on specific instances of non-convex problems, Magnússon et al. [2015]; Guo et al. [2017] and Hong [2018] establish that ADMM converges to a stationary point of the augmented Lagrangian. In general, ADMM is not guaranteed to converge for non-convex problems, and even if it does, it may not converge a global optimum.

Generalization to Bregman Divergence Proximal Penalty. ADMM has extensions where the quadratic proximal penalty is generalized to, e.g., Bregman divergences. Wang and Banerjee [2014] employ ADMM with a Bregman divergence proximal penalty, where KL is a special case, to solve a problem of the form (8). They minimize the augmented Lagrangian:

\begin{align} L_\rho (\mathbf {x}, \mathbf {z}, \mathbf {y}) = f(\mathbf {x}) + g(\mathbf {z}) + \mathbf {y}^\top (\mathbf {A} \mathbf {x} + \mathbf {B} \mathbf {z} - \mathbf {c}) + \rho D_p(\mathbf {B} \mathbf {z} , \mathbf {c} - \mathbf {A} \mathbf {x}), \end{align}

(11)

where \(D_p\) is a Bregman divergence [Bregman 1967]. A Bregman divergence is associated with a strictly convex and contionously differentiable function \(h: \Omega \rightarrow \mathbb {R}\), where \(\Omega\) is a closed convex set. Given such a function \(h\) and any two points \(x, y \in \Omega\), the Bregman divergence is the difference between the value of \(h\) at point \(x\) and the value of the first-order Taylor expansion of \(h\) around point \(y\) evaluated at point \(x\) [Bregman 1967]:

\begin{align} D_p(x, y) = h(x) - h(y) - \nabla h(y) ^\top (x - y). \end{align}

(12)

Having adjusted the augmented Lagrangian as (11), Bregman-ADMM again iteratively optimizes the augmented Lagrangian w.r.t. the primal variables \(\mathbf {x}\) and \(\mathbf {z}\), and dual variables \(\mathbf {y}\) via (10). Wang and Banerjee [2014] show that ADMM with a Bregman divergence proximal penalty converges with linear rate for convex objectives. Wang et al. [2018] extend this convergence guarantee to local optima of non-convex problems. Several works [Shi et al. 2017; Yu et al. 2018; Yu and Açıkmeşe 2019] employ a Bregman ADMM algorithm for distributed optimization of separable problems.

3 Spectral Ranking Regression

3.1 Problem Formulation: Ranking Regression

As introduced in Section 2.2.1, MLE of the PL scores \(\mathbf {\pi }\in \mathbb {R}^n_+\) amounts to minimizing the negative log-likelihood:

\begin{align} \min _{\mathbf {\pi }\in \mathbb {R}^n_+} \mathcal {L}(\mathcal {D}\mid \mathbf {\pi }) \equiv \sum _{\ell =1}^m\left(\log \sum _{j \in A_\ell } {\pi }_j - \log \,{\pi }_\ell \right). \end{align}

(13)

To regress scores \(\mathbf {\pi }\) from sample features \(\mathbf {X}= {[\mathbf {x}_1,..,\mathbf {x}_n]}^T \in \mathbb {R}^{n\times d}\), we assume that there exists a function \(\tilde{{\pi }}:\mathbb {R}^d\rightarrow [0,1]\), parametrized by \(\mathbf {w}\in \mathbb {R}^{d^{\prime }}\), such that

\begin{align} {\pi }_i=\tilde{{\pi }}(\mathbf {x}_{i}; \mathbf {w}), \quad \text{for all}~i\in \left[n\right]\!. \end{align}

(14)

Then, MLE amounts to minimizing the following objective w.r.t. \(\mathbf {w}\in \mathbb {R}^{d^{\prime }}\):

\begin{align} \mathcal {L}(\mathcal {D}\mid \mathbf {w}) \equiv \sum _{\ell =1}^m\left(\log \sum _{j \in A_\ell } \tilde{{\pi }}(\mathbf {x}_{j}; \mathbf {w}) - \log \tilde{{\pi }}(\mathbf {x}_{\ell }; \mathbf {w}) \right). \end{align}

(15)

We tackle the regression problem via the MLE objective (15) w.r.t. three regression functions. Our notation is summarized in Table 1.

Table 1.

\(n\)	number of samples
\(m\)	number of choice/ranking observations
\(d\)	number of features of each sample
\(K\)	number of samples in each ranking query
\(\mathbf {w}\in \mathbb {R}^{d^{\prime }}\)	regression model parameters
\(\mathbf {x}_i \in \mathbb {R}^d\)	feature vector of sample \(i\)
\(\mathbf {X}\in \mathbb {R}^{n\times d}\)	matrix of feature vectors of all samples
\(\mathbf {\tilde{X}}= [\mathbf {X}\| \mathbf {1}] \in \mathbb {R}^{n\times (d+1)}\)	extended matrix of feature vectors of all samples
\(c_\ell\)	maximal choice for the \(\ell\)th observation
\(\alpha _\ell\)	ranking of samples in the \(\ell\)th observation
\(A_\ell\)	set of alternative samples, i.e., query set of the \(\ell\)th observation
\(\left[n\right]\)	set of all samples
\(\left[m\right]\)	set of all choice/ranking observations
\(\mathcal {D}\)	dataset of all query–observation pairs
\(W_i\)	set of observations in which sample \(i\) is chosen
\(L_i\)	set of observations in which sample \(i\) is not chosen
\(\mathbf {\pi }\in \mathbb {R}^n_+\)	scores of all samples
\(\tilde{\mathbf {\pi }}(\cdot , \mathbf {w}) \in \mathbb {R}^n_+\)	regression function of all scores
\(D_p\)	proximal penalty function of ADMM
\(\mathcal {L}\)	negative log-likelihood of the PL model
\(L\)	augmented Lagrangian of ADMM
\(\rho \gt 0\)	penalty parameter of ADMM
\(\mathbf {y}\in \mathbb {R}^{n}\)	dual variables of ADMM
\(\lambda _{ji}, \,\, i,j\in \left[n\right]\)	transition rates of the ILSR MC (c.f. (5))
\(\mathbf {\Lambda }\)	transition matrix of the ILSR MC (c.f. (6))
\(\mu _{ji}, \,\, i,j\in \left[n\right]\)	transition rates of the MC (c.f. (27))
\(\mathbf {M}\)	transition matrix of the MC (c.f. (28))
\(\mathbf {\sigma }\in \mathbb {R}^n\)	additional terms in the MC rates (c.f. Theorem 3.2)
\((\left[n\right]_{-}, \left[n\right]_{+})\)	partition of \(\left[n\right]\) s.t. \(\sigma _i \ge 0\) for \(i \in \left[n\right]_{+}\) and \(\sigma _i \lt 0\) for \(i \in \left[n\right]_{-}\)
\(\mathtt {ssd}\)	stationary distribution of an MC

Table 1. Notation Summary

Affine Regression. We assume that there exist \(\mathbf {a}\in \mathbb {R}^d\) and \({b}\in \mathbb {R}\), such that \(\mathbf {\pi }= \tilde{{\pi }}(\mathbf {X}; \mathbf {w})\equiv \mathbf {X}\mathbf {a}+ \mathbf {1} {b}\,\) for \(\mathbf {w}=(\mathbf {a},{b})\in \mathbb {R}^{d+1}\). Then, MLE of parameters \(\mathbf {w}\) amounts to solving:

\begin{align} \min _{\mathbf {w}=(\mathbf {a},{b}) \in \mathbb {R}^{d+1} :\tilde{{\pi }}(\mathbf {X}; \mathbf {w})\ge \mathbf {0}}\mathcal {L}\left(\mathcal {D}\,|\,\tilde{{\pi }}(\mathbf {X}; \mathbf {w})\equiv \mathbf {X}\mathbf {a}+ \mathbf {1} {b}\right)\!, \end{align}

(16)

where \(\mathcal {L}\) is given by (15). We note that Problem (16) is not convex, as the objective is not convex in \((\mathbf {a},{b})\in \mathbb {R}^{d+1}\).

Logistic Regression. In the logistic case, we assume that there exists \(\mathbf {w}\in \mathbb {R}^d\) s.t. \(\mathbf {\pi }=\tilde{{\pi }}(\mathbf {X}; \mathbf {w})\equiv [e^{\mathbf {w}^\top \mathbf {x}_i}]_{i\in \left[n\right]}.\) As \(\tilde{{\pi }}(\mathbf {X}; \mathbf {w})\ge 0\) by definition, MLE corresponds to

\begin{align} \min _{\mathbf {w}\in \mathbb {R}^d} \mathcal {L}\left(\mathcal {D}\,|\,\tilde{{\pi }}(\mathbf {X}; \mathbf {w})\equiv [e^{\mathbf {w}^\top \mathbf {x}_i}]_{i\in \left[n\right]}\right)\!. \end{align}

(17)

The objective of (17) is convex in \(\mathbf {w}\), so an optimal solution can be found via, e.g., Newton’s method [Nocedal and Wright 2006]. The convexity of Problem (17) w.r.t. \(\mathbf {w}\) follows by the facts that (a) the reparametrization \({\pi }_i = e^{\theta _i}\), \(i \in \left[n\right]\) makes \(\mathcal {L}\) convex, and (b) the composition of convex and affine is convex [Boyd and Vandenberghe 2004].

DNN Regression. Finally, we regress scores \(\mathbf {\pi }\) via a generic DNN function \(\tilde{{\pi }}(\mathbf {X}; \mathbf {w})\) by solving the following MLE problem:

\begin{align} \min _{\mathbf {w}\in \mathbb {R}^{d^{\prime }}} \mathcal {L}(\mathcal {D}\,|\,\tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w})), \end{align}

(18)

where \(\tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w})\in \mathbb {R}^n_+\) is the vector map whose coefficients \(\tilde{{\pi }}_i\in \mathbb {R}_+\) are given by \(\tilde{{\pi }}_i=\tilde{{\pi }}(\mathbf {x}_{i}; \mathbf {w})\), i.e., \(\tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w})\equiv [\tilde{{\pi }}(\mathbf {x}_{i}; \mathbf {w})]_{i \in \left[n\right]}\). For instance, \(\tilde{{\pi }}(\cdot , \mathbf {w})\) can be a fully connected feed-forward neural network if \(\mathbf {X}\) contains numerical features, or a convolutional neural network if \(\mathbf {X}\) is an image.

As also discussed in Section 1, virtually all state-of-the-art ranking regression methods minimize Equation (15) via classic gradient-based optimization [Joachims 2002; Burges et al. 2005; Pahikkala et al. 2009; Chang et al. 2016; Dubey et al. 2016; Doughty et al. 2018; Guo et al. 2018; Tian et al. 2019; Yıldız et al. 2019]. Note that the optimization of objective (15) implies the following process: For each observation \((c_\ell ,A_\ell)\) in \(\mathcal {D}\), the regression function \(\tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w})\) needs to be evaluated and updated over all samples in \(A_\ell\). Overall, the ranking regression objective (15) implies that \(K\) samples need to be loaded in RAM to process a single ranking or maximal choice query. To make matters worse, the number \(m\) of possible rankings/maximal queries in \(\mathcal {D}\) grows as \(\left({{n}\atop {K}}\right) = O(n^K)\). This means that the length of a training epoch of stochastic gradient descent (and hence, the number of expensive gradient computations) is \(m=O(n^K)\), i.e., grows exponentially in \(K\). Combined, these two effects can make the cost of a single optimization step over Equation (15) prohibitive (c.f. Figures 1 and 5(c)). This motivates us to explore efficient spectral algorithms to solve regression problems (16)–(18), as we discuss next.

3.2 Key Technical Result: Decoupling Optimization and a Spectral Method

Given ILSR’s significant computational benefits in Section 2, we wish to develop analogues in the regression setting. In contrast to the feature-less setting, it is not a priori evident how to solve Problems (16)–(18) via a spectral approach. Taking the affine case as an example, and momentarily ignoring issues of non-convexity, the stationary points of the optimization problem (16) cannot be expressed via the balance equations of an MC. Our main contribution is to circumvent this problem by using the ADMM [Boyd et al. 2011]. Intuitively, ADMM allows us to decouple the optimization of scores \(\mathbf {\pi }\) from model parameters \(\mathbf {w}\), encapsulating them in a proximal penalty: The latter becomes amenable to a spectral approach after a series of manipulations that we outline below (see Theorem 3.2).

We wish to devise an efficient algorithm to minimize the regression objective (15). To do so, we extend the standard ADMM [Boyd et al. 2011] introduced in Section 2. To this end, we rewrite the minimization of (15) as

\begin{align} \underset{\mathbf {\pi }, \mathbf {w}}{\text{Minimize}} &\quad \mathcal {L}(\mathcal {D}\,|\,\mathbf {\pi }) \equiv \sum _{\ell =1}^m\left(\log \sum _{j \in A_\ell } {\pi }_j - \log \,{\pi }_\ell \right) \end{align}

(19a)

\begin{align} \text{subject to:} & \quad \mathbf {\pi }= \tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w}), \quad \mathbf {\pi } \ge \mathbf {0}. \end{align}

(19b)

To solve (19) via ADMM, consider the following augmented Lagrangian (see Section 2):

\begin{align} L_\rho (\mathbf {\pi }, \mathbf {w}, \mathbf {y}) = \mathcal {L}(\mathcal {D}\,|\,\mathbf {\pi }) &+ \mathbf {y}^\top (\mathbf {\pi }-\tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w})) + \rho \cdot D_p(\mathbf {\pi }||\tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w})), \end{align}

(20)

where \(\mathbf {y}\in \mathbb {R}^n\) is the Lagrangian dual variable corresponding to the equality constraint in (19b), \(\rho \ge 0\) is a penalty parameter, and \(D_p(\cdot ||\cdot)\) is a proximal penalty term. This term satisfies: (i) \(D_p(\mathbf {\pi }||\tilde{\mathbf {\pi }}) \ge 0\) for all \(\mathbf {\pi },\tilde{\mathbf {\pi }}\in \mathbb {R}^n_+\), and (ii) \(D_p(\mathbf {\pi }||\tilde{\mathbf {\pi }}) = 0\) if and only if \(\tilde{\mathbf {\pi }}=\mathbf {\pi }\). Classic ADMM involves using the usual \(\ell _2\) proximal penalty, i.e., \(D_p(\mathbf {\pi }||\tilde{\mathbf {\pi }}) = \frac{1}{2} \parallel \! \mathbf {\pi }- \tilde{\mathbf {\pi }}\!\parallel _2^2\). We depart from this by considering more generic \(D_p\); as we discuss in Section 3.5, using KL divergence instead, i.e., \(D_p(\mathbf {\pi }||\tilde{\mathbf {\pi }}) = \sum _{i=1}^{n} {\pi }_i \log \frac{{\pi }_i}{\tilde{{\pi }}_i}\) has significant advantages in our setting.

In its general form, ADMM alternates between optimizing \(\mathbf {\pi }\) and \(\mathbf {w}\) until convergence via the following primal-dual algorithm on the augmented Lagrangian (c.f. Section 2):

\begin{align} \mathbf {\pi }^{k+1} &= \mathop{\text{arg min}}\limits_{\mathbf {\pi }\in \mathbb {R}^n_+} \,\, L_\rho (\mathbf {\pi }, \mathbf {w}^k, \mathbf {y}^k), \end{align}

(21a)

\begin{align} \mathbf {w}^{k+1} &= \mathop{\text{arg min}}\limits_{\mathbf {w}\in \mathbb {R}^{d^{\prime }}} \,\,L_\rho (\mathbf {\pi }^{k+1}, \mathbf {w}, \mathbf {y}^k), \end{align}

(21b)

\begin{align} \mathbf {y}^{k+1} &= \mathbf {y}^k + \rho (\mathbf {\pi }^{k+1}- \tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w}^{k+1})) \end{align}

(21c)

This has the following immediate computational advantages. First, step (21b) is equivalent to

\begin{align} \begin{split}\mathbf {w}^{k+1} = \underset{\mathbf {w}\in \mathbb {R}^{d^{\prime }}}{\text{arg min}} \,\, & \rho D_p\big (\mathbf {\pi }^{k+1}||\tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w})\big) - {\mathbf {y}^k}^\top \tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w}). \end{split} \end{align}

(22)

Note that this operation does not depend on dataset \(\mathcal {D}\): The model \(\tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w})\) is regressed directly from the present score estimates \(\mathbf {\pi }^{k+1}\) via penalty \(D_p\), with the additional linear dual term. In particular, as discussed below, \(D_p\) leads to a least-squares regression in the case of the \(\ell _2\)-proximal penalty, and a max-entropy penalty in the case of KL-divergence; both can be solved efficiently. Crucially, an optimization method that solves (22) iterates over the \(O(n)\) samples, instead of the \(O(n^K)\) ranking observations, which would be the case when Equation (15) is minimized directly without ADMM decoupling.

Most importantly, step (21b)—which does depend on \(\mathcal {D}\)—admits a highly efficient spectral implementation for a proximal penalty \(D_p\); ADMM (21) therefore delegates solving the “expensive” portion of the problem to a highly efficient algorithm. To establish this, we begin with the following lemma, which we prove in Appendix C.

Lemma 3.1.

Given \(\tilde{\mathbf {\pi }}^k\equiv \tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w}^k) \in \mathbb {R}^n_+\) and \(\mathbf {y}^k \in \mathbb {R}^n\), let \(\mathbf {\pi }\in \mathbb {R}_+^n\) be such that

\begin{align} \nabla _{\mathbf {\pi }}& L_\rho (\mathbf {\pi }, \mathbf {w}^k, \mathbf {y}^k) = \mathbf {0}. \end{align}

(23)

Let \(\mathbf {\sigma }(\mathbf {\pi })=[\sigma _i(\mathbf {\pi })]_{i\in \left[n\right]}\in \mathbb {R}^n\), where

\begin{align} \sigma _i(\mathbf {\pi }) &\equiv \rho \frac{\partial D_p(\mathbf {\pi }|| \tilde{\mathbf {\pi }}^k)}{\partial {{\pi }_i}} + {y}_i^k, \quad \text{for }i \in \left[n\right]\!, \end{align}

(24)

and

\begin{align} \lambda _{ji}(\mathbf {\pi }) \equiv \sum _{\ell \in W_i \cap L_j} \left({\sum _{t\in A_\ell }{\pi }_t}\right)^{-1} \ge 0,\quad \text{for }i,j \in \left[n\right]\text{ with }i\ne j, \end{align}

(25)

with \(W_i = \lbrace \ell \,| i \in A_\ell , c_\ell = i\rbrace\) denoting the observations where sample \(i \in \left[n\right]\) is chosen and \(L_i = \lbrace \ell \,| i \in A_\ell , c_\ell \ne i\rbrace\) denoting the observations where sample \(i \in \left[n\right]\) is not chosen. Then, (23) is equivalent to

\begin{align} \sum _{j\ne i} {\pi }_j \lambda _{ji}(\mathbf {\pi }) - \sum _{j\ne i} {\pi }_i \lambda _{ij}(\mathbf {\pi }) = {\pi }_i \sigma _i(\mathbf {\pi }), \quad \text{for all }i \in \left[n\right]\!. \end{align}

(26)

Contrasting Equation (26) to Equation (4), it is not immediately evident that the former corresponds to the balance equations of an MC as, in general, \(\mathbf {\sigma }(\mathbf {\pi }) = [\sigma _i(\mathbf {\pi })]_{i \in \left[n\right]} \ne \mathbf {0}\). Nevertheless, for an appropriate definition of transition rates, we prove that this is indeed the case:

Theorem 3.2.

Equations (26) are the balance equations of a continuous-time Markov Chain (MC) with transition rates:

\begin{align} \mu _{ji}(\mathbf {\pi }) &= {\left\lbrace \begin{array}{ll} \lambda _{ji}(\mathbf {\pi }) + \displaystyle \frac{2 {\pi }_i \sigma _i(\mathbf {\pi }) \sigma _j(\mathbf {\pi })}{ \sum \nolimits _{t\in \left[n\right]_{-}} {\pi }_t \sigma _t(\mathbf {\pi }) - \sum \nolimits _{t\in \left[n\right]_{+}} {\pi }_t \sigma _t(\mathbf {\pi })}, &\text{if} \,\,\, j \in \left[n\right]_{+}\,\,\, \text{and} \,\,\, i \in \left[n\right]_{-}, \\ \lambda _{ji}(\mathbf {\pi }) & \text{otherwise}, \end{array}\right.} \end{align}

(27)

where \((\left[n\right]_{+}\!, \left[n\right]_{-})\) denotes a partition of \(\left[n\right]\) such that \(\sigma _i(\mathbf {\pi }) \ge 0\) for all \(i \in \left[n\right]_{+}\) and \(\sigma _i(\mathbf {\pi }) \lt 0\) for all \(i \in \left[n\right]_{-}\).

The proof is in Appendix D. By Lemma 3.1 and Theorem 3.2, we conclude that a stationary \(\mathbf {\pi }\in \mathbb {R}^n_+\) satisfying (23) is also the stationary distribution of the continuous-time MC with transition rates:

\begin{align} \mathbf {M}(\mathbf {\pi })= [\mu _{ji}(\mathbf {\pi })]_{i,j \in \left[n\right]}, \end{align}

(28)

where \(\mu _{ji}(\mathbf {\pi })\) are given by Equation (27). Motivated by these observations, and mirroring ILSR (Equation (7)), we compute a solution to (21b) via

\begin{align} \mathbf {\pi }^{l+1} = \mathtt {ssd}\left(\mathbf {M}(\mathbf {\pi }^l)\right)\!, \quad \text{for}~ l \in \mathbb {N}, \end{align}

(29)

where \(\mathbf {M}(\cdot)\) is given by Equation (28) and \(\mathtt {ssd}(\cdot)\) is an algorithm computing the steady state distribution. We refer to this procedure as ILSRX (“ILSR with features”).

Our spectral algorithm solving Equation (19) is summarized in Algorithm 1 and Figure 2. We initialize all samples with equal scores, i.e., \(\mathbf {\pi }= \tilde{\mathbf {\pi }}= \frac{1}{n}\mathbf {1}\), and set the initial dual variable as \(\mathbf {y}^0 = \mathbf {0}\). We iteratively update \(\mathbf {\pi }\), \(\mathbf {w}\), and \(\mathbf {y}\) via Equation (21) until convergence. At each ADMM iteration \(k+1\), we initialize \(\mathbf {\pi }\) with \(\mathbf {\pi }^{k}\), and update \(\mathbf {\pi }\) via ILSRX given by Equation (29). We elaborate on the resulting implementation specifics for each of the regression problems (16)–(18) below.

Fig. 2.

3.3 Spectral Algorithm for Affine Regression

In this section, we apply Algorithm 1 on affine regression of PL scores via

\begin{align} \tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w})=\mathbf {X}\mathbf {a}+ \mathbf {1} {b}=\mathbf {\tilde{X}}\mathbf {w}. \end{align}

(30)

We introduce \(\mathbf {w}=(\mathbf {a}, {b}) \in \mathbb {R}^{d+1}\) and \(\mathbf {\tilde{X}}= [\mathbf {X}| \mathbf {1}] \in \mathbb {R}^{n\times (d+1)}\) for notational simplicity. Recall from Section 3.1 that the resulting MLE problem is given by

\begin{align} \min _{\mathbf {w}\in \mathbb {R}^{d+1} :\tilde{{\pi }}(\mathbf {X}; \mathbf {w})\ge \mathbf {0}}\mathcal {L}\left(\mathcal {D}\,|\,\tilde{{\pi }}(\mathbf {X}; \mathbf {w})\equiv \mathbf {\tilde{X}}\mathbf {w}\right)\!. \end{align}

(31)

In this setting, we employ standard ADMM with quadratic proximal penalty:

\begin{align} D_p(\mathbf {\pi }||\tilde{\mathbf {\pi }}) = \frac{1}{2} \parallel \! \mathbf {\pi }- \tilde{\mathbf {\pi }}\!\parallel _2^2\!. \end{align}

(32)

This leads to additional theoretical guarantees explained in Section 3.7. Employing Algorithm 1 with quadratic \(D_p(\mathbf {\pi }||\tilde{\mathbf {\pi }})\) and affine \(\tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w})\), the regression step (21b) of ADMM becomes:

\begin{align} \mathbf {w}^{k+1} &= \underset{\mathbf {w}\in \mathbb {R}^{d+1}}{\text{arg min}} \, {\mathbf {y}^k}^T (\mathbf {\pi }^{k+1} - \mathbf {\tilde{X}}\mathbf {w}) + \frac{\rho }{2}\parallel \! \mathbf {\pi }^{k+1}- \mathbf {\tilde{X}}\mathbf {w}\!\parallel _2^2\nonumber \nonumber\\ &= {(\mathbf {\tilde{X}}^T\mathbf {\tilde{X}})}^{-1}\mathbf {\tilde{X}}^T\left(\mathbf {\pi }^{k+1} + \frac{1}{\rho }\mathbf {y}^{k}\right). \end{align}

(33)

Equation (33) follows from completing the square by combining the linear and quadratic terms into a single quadratic objective. This has the following immediate computational advantage: The regression step (33) is a quadratic minimization and admits a closed form solution.

Crucially, the score update step (21b) is amenable to the spectral solution established in Theorem 3.2, where

\begin{align} \mathbf {\sigma }= \rho (\mathbf {\pi }- \mathbf {\tilde{X}}\mathbf {w}^k) + \mathbf {y}^k. \end{align}

(34)

Having adjusted the transition matrix \(\mathbf {M}(\mathbf {\pi })\) thusly, \(\mathbf {\pi }\) can be obtained by repeated iterations of ILSRX (29), with \(\tilde{{\pi }}(\mathbf {X}; \mathbf {w})\) and \(D_p\) given by Equations (30) and (32), respectively. We refer to this instance of Algorithm 1 as Spectral Ranking-l2 (SR-l2).

We note that, as Problem (19) is non-convex, selecting a good initialization point is important in practice. We discuss initialization below, leaving the additional theoretical guarantees to Section 3.7.

Initialization. We initialize \(\mathbf {w}\) so that \(\mathbf {\tilde{X}}\mathbf {w}\) is a good approximation of PL scores. We use a technique akin to Saha and Rajkumar [2018], applied to our affine setting. Given a distribution over queries \(A\subseteq \left[n\right]\), let \(P_{ij}=\mathbb {E}_A[c=i|\lbrace i,j\rbrace \subseteq A]\) be the probability that \(i\) is chosen given a query \(A\) that contains both \(i\) and \(j\). By (1), for \(i, j \in \left[n\right]\), \(\frac{{P}_{ij}}{{P}_{ji}} = \frac{{\pi }_i}{{\pi }_j} = \frac{\mathbf {x}_i^\top \mathbf {a}+ {b}}{\mathbf {x}_j^\top \mathbf {a}+ {b}}\), or

\begin{align} \begin{split} \delta _{ij}(\mathbf {w}) \equiv {({P}_{ij} \mathbf {x}_j - {P}_{ji} \mathbf {x}_i)}^\top \mathbf {a}+ {({P}_{ij} - {P}_{ji})}{b}= 0. \end{split} \end{align}

(35)

Motivated by (35), we estimate \(P_{ij}\) empirically from \(\mathcal {D}\), and obtain our initialization \(\mathbf {w}^0=(\mathbf {a}^0,{b}^0)\in \mathbb {R}^{d+1}\) by solving (35) in the least-square sense; that is,

\begin{align} \boldsymbol{w}^0 \mathop{\arg\min}\limits_{\mathbf {w}\in \mathbb {R}^{d+1}: \mathbf {\tilde{X}}\mathbf {w}\ge \mathbf {0} \,\wedge \mathbf {1}^\top \mathbf {\tilde{X}}\mathbf {w}= 1}\sum _{i,j} \delta _{ij}^2(\mathbf {w}). \end{align}

(36)

Note that this is a convex quadratic program. Given \(\mathbf {w}^0\), we generate the initial PL scores via the affine parametrization \(\mathbf {\pi }^0=\mathbf {\tilde{X}}\mathbf {w}^0\).

3.4 Spectral Algorithm for Logistic Regression

We describe here how to apply Algorithm 1 on logistic regression of PL scores via

\begin{align} \log \tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w})=\mathbf {X}\mathbf {w}, \end{align}

(37)

where \(\log \mathbf {\pi }=[\log {\pi }_i]_{i\in \left[n\right]}\) is the \(\mathbb {R}^n\) vector generated by applying \(\log\) to \(\mathbf {\pi }\) element-wise. Recall from Section 3.1 that the resulting MLE problem is given by

(38)

which is convex, and can thus be solved by Newton’s method. Nevertheless, we would like to accelerate its computation via a spectral method akin to ILSR.

Mirroring the affine case, we employ Algorithm 1 with quadratic proximal penalty:

\begin{align} D_p(\mathbf {\pi }||\tilde{\mathbf {\pi }}) = \frac{1}{2} \parallel \! \mathbf {\pi }- \tilde{\mathbf {\pi }}\!\parallel _2^2\!. \end{align}

(39)

As a result, the regression step (21b) corresponds to

\begin{align} \mathbf {w}^{k+1} &= \underset{\mathbf {w}\in \mathbb {R}^{d+1}}{\text{arg min}} \, {\mathbf {y}^k}^T (\log \mathbf {\pi }^{k+1} - \mathbf {X}\mathbf {w}) + \frac{\rho }{2}\parallel \! \log \mathbf {\pi }^{k+1} - \mathbf {X}\mathbf {w}\!\parallel _2^2 \nonumber \nonumber\\ &= {(\mathbf {X}^T\mathbf {X})}^{-1}\mathbf {X}^T\left(\log \mathbf {\pi }^{k+1} + \frac{1}{\rho }\mathbf {y}^{k}\right), \end{align}

(40)

which is again a quadratic minimization and admits a closed form solution.

More importantly, the score update step (21b) is amenable to the spectral solution established in Theorem 3.2. Mutatis mutandis, following the same manipulations in Lemma 3.1, a stationary point of the objective in each step (21b) can be cast as the stationary distribution of the continuous-time MC with transition rates \(\mu _{ji}(\mathbf {\pi })\), \(i,j\in \left[n\right]\) (Equation (27)), where vector \(\mathbf {\sigma }= [\sigma _i]_{i\in \left[n\right]}\) is now given by

\begin{align} \sigma _i = \frac{\rho (\log {\pi }_i - \mathbf {x}_i^T \mathbf {w}^k) + {y}_i^k}{{\pi }_i },\quad i \in \left[n\right]\!. \end{align}

(41)

Having adjusted the transition matrix \(\mathbf {M}(\mathbf {\pi })\) thusly, \(\mathbf {\pi }\) can again be obtained by repeated iterations of ILSRX (29), with \(\tilde{{\pi }}(\mathbf {X}; \mathbf {w})\) and \(D_p\) given by Equations (37) and (39), respectively. We refer to this instance of Algorithm 1 as Spectral Ranking-l2-log (SR-l2-log).

Initialization. Similar to the initialization of SR-l2 (c.f. Equation (36)), we initialize \(\mathbf {w}\) so that the initial scores obey the PL model, mirroring the approach by Saha and Rajkumar [2018]. Defining \(P_{ij}\), \(i,j\in \left[n\right]\) the same way, and using the logistic parametrization in Equation (17), we have that

\begin{align} \frac{{P}_{ij}}{{P}_{ji}}= \frac{{\pi }_i}{{\pi }_j} = e^{\mathbf {w}^T (\mathbf {x}_i - \mathbf {x}_j)}. \end{align}

(42)

Accordingly, we initialize \(\mathbf {w}\) as

\begin{align} \mathbf {w}^0 = \text{arg min}_{\mathbf {w}\in \mathbb {R}^d} \, \sum _{i,j} {\left(\mathbf {w}^T (\mathbf {x}_i - \mathbf {x}_j) - \log \left(\frac{\hat{P}_{ij}}{\hat{P}_{ji}}\right)\right)}^2, \end{align}

(43)

where \(\hat{P}_{ij}\), \(i,j\in \left[n\right]\), are again empirical estimates obtained from dataset \(\mathcal {D}\). Given \(\mathbf {w}^0\), we generate the initial PL scores via the logistic parametrization \(\mathbf {\pi }^0=[e^{\mathbf {x}_i^T \mathbf {w}^0}]_{i\in \left[n\right]}\).

3.5 Spectral Algorithm for DNN Regression

In this section, we employ Algorithm 1 for DNN regression of PL scores via \(\tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w})\) (c.f. Problem (18)).

To regress scores via a generic DNN \(\tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w})\), we generalize the traditional quadratic proximal penalty \(D_p(\mathbf {\pi }||\tilde{\mathbf {\pi }}) = \frac{1}{2} \parallel \! \mathbf {\pi }- \tilde{\mathbf {\pi }}\!\parallel _2^2\) to KL-divergence, which is naturally suited to Problem (19): This is because the scores computed by ILSRX (29) form, by construction, a distribution over \(\left[n\right]\). This generalization to KL penalty is still amenable to a spectral solution via ILSRX (29), as Theorem 3.2 holds for any proximal penalty \(D_p\); in practice, this also leads to a significantly improved performance over an \(\ell _2\)-proximal penalty (by up to \(56\%\) Top-1 accuracy and \(25\%\) KT correlation, as discussed in Section 4.9). As shown in Equation (48) below, the KL-divergence penalty leads to training \(\tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w})\) with a max-entropy loss and an additional linear term at each ADMM iteration. Compared to the \(\ell _2\)-proximal penalty, max-entropy has been observed to converge faster and lead to better fitting [Caruana and Niculescu-Mizil 2004; Golik et al. 2013; Bosman et al. 2020]. Finally, ADMM with Bregman divergence proximal penalties, including KL divergence, has been shown to offer local convergence guarantees for non-convex problems (c.f. Section 2.2.3); this further motivates us to employ KL over other distance measures between probability distributions.

We implement DSR with the two proximal penalty functions \(D_p(\mathbf {\pi }||\tilde{\mathbf {\pi }})\) mentioned above: KL divergence [Kullback and Leibler 1951] and \(\ell _2\) norm. We describe implementation specifics for each of these cases below.

\(\ell _2\) proximal penalty. For ADMM with quadratic proximal penalty:

\begin{align} D_p(\mathbf {\pi }||\tilde{\mathbf {\pi }}) = \frac{1}{2} \parallel \! \mathbf {\pi }- \tilde{\mathbf {\pi }}\!\parallel _2^2, \end{align}

(44)

the regression step (21b) of DSR becomes a least-squares minimization of the form:

\begin{align} \mathbf {w}^{k+1} = \underset{\mathbf {w}\in \mathbb {R}^{d^{\prime }}}{\text{arg min}} \,\, \parallel \! \tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w})- \left(\mathbf {\pi }^{k+1} + \frac{1}{\rho }\mathbf {y}^k\right) \!\parallel _2^2\!. \end{align}

(45)

As a result, we train the DNN regressor \(\tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w})\) via SGD optimization over the loss function (45).

Moreover, the score update step (21b) is amenable to the spectral approach established in Theorem 3.2 and solved via repeated iterations of ILSRX (29), where

\begin{align} \sigma _i(\mathbf {\pi }) = \rho \frac{\partial D_p(\mathbf {\pi }|| \tilde{\mathbf {\pi }}^k)}{\partial {{\pi }_i}} + {y}_i^k = \rho ({\pi }_i - \tilde{{\pi }}(\mathbf {x}_i, \mathbf {w}^k)) + {y}_i^k, \end{align}

(46)

for all \(i \in \left[n\right]\) in the transition rates (27). We refer to this instance of Algorithm 1 as Deep Spectral Ranking-l2 (DSR-l2).

KL proximal penalty. For ADMM with KL proximal penalty:

\begin{align} D_p(\mathbf {\pi }||\tilde{\mathbf {\pi }}) = \sum _{i=1}^{n} {\pi }_i \log \frac{{\pi }_i}{\tilde{{\pi }}_i}, \end{align}

(47)

the optimization step (21c) to train \(\tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w})\) corresponds to minimizing a max-entropy loss with an additional linear term:

\begin{align} \mathbf {w}^{k+1} = \underset{\mathbf {w}\in \mathbb {R}^{d^{\prime }}}{\text{arg min}} \,\, \sum _{i=1}^{n} \left(- \frac{{y}_i^k}{\rho } \tilde{{\pi }}(\mathbf {x}_i, \mathbf {w}) - {\pi }_i^{k+1} \log \tilde{{\pi }}(\mathbf {x}_i, \mathbf {w})\right). \end{align}

(48)

Moreover, the score update step (21b) is again solved via repeated iterations of ILSRX (29), where

\begin{align} \sigma _i(\mathbf {\pi }) = \rho \frac{\partial D_p(\mathbf {\pi }|| \tilde{\mathbf {\pi }}^k)}{\partial {{\pi }_i}} + {y}_i^k = \rho \left(1 + \log \frac{{\pi }_i}{\tilde{{\pi }}(\mathbf {x}_i, \mathbf {w}^k)}\right) + {y}_i^k , \end{align}

(49)

for all \(i \in \left[n\right]\) in the transition rates (27). We refer to this instance of Algorithm 1 as Deep Spectral Ranking-KL (DSR-KL).

Initialization. For both cases of penalties, we initialize all samples with equal scores, i.e., \(\mathbf {\pi }= \tilde{\mathbf {\pi }}= \frac{1}{n}\mathbf {1}\). At each ADMM iteration \(k\), we initialize \(\mathbf {\pi }\) with \(\mathbf {\pi }^{k-1}\), and update \(\mathbf {\pi }\) via ILSRX. Then, we initialize the DNN parameters \(\mathbf {w}\) with \(\mathbf {w}^{k-1}\), and fine-tune the DNN \(\tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w})\) via SGD over Equations (45) or (48).

3.6 Computational Complexity

Each iteration of our spectral algorithm (c.f. Algorithm 1) involves the three steps in Equation (21). One iteration of ILSRX at step (21b) is \(O(Km+ n^2)\) for constructing the transition rates via Equation (27) and for finding the stationary distribution \(\mathbf {\pi }\) via, e.g., a power method [Lei et al. 2016], respectively. It is important to note that the number of ranking observations \(m\) appears only in the construction of the transition rates through the terms \(\mu _{j, i}\) in Equation (21b), which, in turn, are computed via sums w.r.t. these \(m\) quantities (see Equation (27)). This is important, as the (potentially exponential in \(K\)) ranking observations affect the complexity only through scalar summations, which themselves can be easily parallelized (via, e.g., reduce-by-key operations). This is in sharp contrast to the classic gradient-based optimization (c.f. Section 3.1 and Figure 1), in which \(m\) directly determines the size of each epoch.

We discuss the complexity of the regression step (21c) separately for shallow and DNN model regression cases. For shallow model regression via affine or logistic parametrizations, i.e., SR-l2 and SR-l2-log, the regression step (21c) takes the specific forms (33) and (40), respectively. In both cases, \(\mathbf {w}\) update is \(O\big (n(d+1)\big)\) as a matrix-vector multiplication, since the matrix \({(\mathbf {X}^T\mathbf {X})}^{-1}\mathbf {X}^T\) can be precomputed. For DNN regression via DSR, the regression step (21c) is equivalent to training \(\tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w})\) via Equation (22). Most importantly, the length of each training epoch, is linear in number of samples \(n\): Each optimization step goes over \(O(n)\) samples to update the \({d^{\prime }}\) parameters. Again, this is in contrast to the \(m=O(n^K)\)-length epochs if one was minimizing the likelihood function directly via standard gradient descent methods. Finally, given \(\tilde{\mathbf {\pi }}\) and \(\mathbf {\pi }\), the update of \(\mathbf {y}\) at step (21a) is \(O(n)\).

Overall, updating \(\mathbf {\pi }\) via ILSRX at step (21b) and \(\mathbf {w}\) via gradient-based training at step (21c) are both iterative processes, for which the computational complexity until convergence have not yet been theoretically shown. That said, the literature on the complexity and convergence of iterative optimization over such highly non-convex objectives illustrates the challenges behind such a theoretical analysis [Jain and Kar 2017; Taheri et al. 2021]. Nevertheless, our Theorem 3.4 below, establishing convergence guarantees in the affine regression case, is a step toward this direction. This is in sharp contrast to, e.g., siamese networks, in which \(m\) characterizes the length of an entire training epoch.

3.7 Theoretical Guarantees

As the ranking regression problem (19) is, in general, non-convex, we aim at obtaining convergence guarantees for our spectral algorithm (c.f. Algorithm 1). To do so, we focus on the affine regression case explained in Section 3.3. Given this case, we wish to establish conditions such that (i) the scores computed at each ADMM iteration form a global optimum w.r.t. step (21a), and (ii) the scores and model parameters generated by Algorithm 1 converge to a stationary point of Problem (16).

Recall from Section 3.3 that the affine regression problem is given by

(50)

In this setting, we employ ADMM with standard \(\ell _2\) proximal penalty \(D_p(\mathbf {\pi }||\tilde{\mathbf {\pi }}) = \frac{1}{2} \parallel \! \mathbf {\pi }- \tilde{\mathbf {\pi }}\!\parallel _2^2\). As Problem (19) is non-convex, condition (23) is, in general, not sufficient for optimality w.r.t. step (21a). To show this, we require the following technical assumption:

Assumption 3.1.

For \(\lbrace \mathbf {\pi }^k\rbrace _{k\in \mathbb {N}}\), given by (21a), there exists an \(\epsilon \gt 0\) such that \({\pi }_i^k \gt \epsilon\) for all \(i \in \left[n\right]\) and \(k \in \mathbb {N}\).

In other words, we assume that the scores computed at each ADMM iteration are strictly positive. Under this assumption, we show that stationarity implies optimality w.r.t. (21a) for large enough \(\rho\):

Theorem 3.3.

Under Assumption 3.1, for \(\rho \ge \frac{2}{\epsilon ^2} \max _i \sum _{\ell |i\in A_\ell }\frac{1}{|A_\ell |^2}\) and \(D_p(\mathbf {\pi }||\tilde{\mathbf {\pi }}) = \frac{1}{2} \parallel \! \mathbf {\pi }- \tilde{\mathbf {\pi }}\!\parallel _2^2\), a \(\mathbf {\pi }\gt \mathbf {0}\) satisfying condition (23) is a minimizer of (21a).

The proof is in Appendix E. Theorem 3.3 implies that given large enough \(\rho\), the stationary scores satisfying condition (23) also form a global optimum of the ADMM step (21a). Using this result, we further establish the following convergence guarantee for Algorithm 1 in the affine regression case:

Theorem 3.4.

For \(\tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w})=\mathbf {\tilde{X}}\mathbf {w}\) and \(D_p(\mathbf {\pi }||\tilde{\mathbf {\pi }}) = \frac{1}{2} \parallel \! \mathbf {\pi }- \tilde{\mathbf {\pi }}\!\parallel _2^2\), suppose that there exists \(\kappa \gt 0\) such that \(\mathbf {\tilde{X}}^T \mathbf {\tilde{X}}\succeq \kappa \, \mathbf {I}\) and the sequence \(\lbrace (\mathbf {\pi }^k, \mathbf {y}^k, \mathbf {w}^k)\rbrace _{k \in \mathbb {N}}\) generated by (21) is bounded. Then, under Assumption 3.1, for \(\rho \gt \frac{2 \max _i |W_i|}{\epsilon ^2}\) and \(W_i = \lbrace \ell \,| i \in A_\ell , c_\ell = i\rbrace\) denoting the observations where sample \(i \in \left[n\right]\) is chosen, the sequence \(\lbrace (\mathbf {\pi }^k, \mathbf {y}^k, \mathbf {w}^k)\rbrace _{k \in \mathbb {N}}\) generated by (21) converges to a point that satisfies the Karush-Kuhn-Tucker (KKT) conditions of (16).

The proof is in Appendix F. By Theorem 3.4, we conclude that the scores \(\mathbf {\pi }\) and the affine regression parameters \(\mathbf {w}\) generated by Algorithm 1 converge to a stationary point of Problem (16) for large enough \(\rho\).

4 Experiments

4.1 Datasets

We evaluate our algorithms on synthetic and real-life datasets, summarized in Table 2.

Table 2.

Spec.	Dataset
Spec.	ROP-num / ROP-img	FAC	Pairwise Sushi	Triplet Sushi	ICLR-3/4	Movehub-Cost-4/5	Movehub-Quality-4/5	IMDB-4
\(K\)	2			3	\(3/4\)	\(4/5\)	\(4/5\)	4
\(n\)	100	1,000	100			50
\(d\)	\(143 / 224 \times 224\)	50	18		768	6	5	36
\(m\)	29,705	728	450	1,200	120, \(324/2,248,524\)	230, \(298/2,118,756\)	230, \(298/2,118,756\)	85,583
\(\mathbf {X}\)	numerical / image	numerical

Table 2. No. of Samples ( \(n\)), No. of Features ( \(d\)), No. of Observations ( \(m\)), and Query Size ( \(K\)) for Datasets with Image or Numerical Features ( \(\mathbf {X}\))

Synthetic Datasets. We generate the feature vectors \(\mathbf {x}_i \in \mathbb {R}^{d}, \;\; i \in \left[n\right]\) from \(\mathcal {N} (\mathbf {0},\sigma ^2_{x}\mathbf {I}_{d\times d})\) and a common parameter vector \({\mathbf {w}} \in \mathbb {R}^{d}\) from \(\mathcal {N} (\mathbf {0},\sigma ^2_{{\beta }}\mathbf {I}_{d\times d})\). Then, we generate the PL scores via the logistic parametrization \(\mathbf {\pi }=[e^{\mathbf {x}_i^T \mathbf {w}}]_{i\in \left[n\right]}\). We normalize the resulting scores, so that \(\mathbf {1}^\top \mathbf {\pi }=1\). We set \(\sigma ^2_{x}=\sigma ^2_{{\beta }}=0.8\) in all experiments. Given \(\mathbf {\pi }\), we generate each observation in \(\mathcal {D}\) as follows: We first select \(|A_\ell |=2\) samples out of \(n\) samples uniformly at random. Then, we generate the choice \(c_l\), \(l \in \left[m\right]\) from the PL model given by Equation (1).

Retinopathy of Prematurity (ROP). The ROP dataset contains \(n=100\) vessel-segmented retina images. Experts are provided with two images and are asked to choose the image with higher severity of the ROP disease. Five experts independently label 5,941 image pairs; the resulting dataset contains \(m=29,705\) pairwise comparisons. Note that some pairs are labeled more than once by different experts. We employ two ROP datasets, ROP-num and ROP-img for shallow and DNN regression, respectively: Each sample in ROP-num has \(d=143\) extracted numerical features [Ataer-Cansızoğlu 2015], while each sample in ROP-img is an image with dimensions \(d=224 \times 224\).

Filter Aesthetic Comparison (FAC). The FAC dataset [Sun et al. 2017] contains 1,280 unfiltered images pertaining to eight different categories. A total of 22 different image filters are applied to each image. Labelers are provided with two filtered images and are asked to identify which image has better quality. We select \(n=1,000\) images within one category, as only the filtered image pairs that are within the same category are compared. The resulting dataset contains \(m=728\) pairwise comparisons. Moreover, for each image, we extract features via a state-of-the-art convolutional neural network architecture, namely GoogLeNet [Szegedy et al. 2015], with weights pre-trained on the ImageNet dataset [Deng et al. 2009]. We select \(d=50\) of these features by Principal Component Analysis [Jolliffe 1986].

SUSHI. The SUSHI Preference dataset [Kamishima et al. 2009] contains \(n=100\) sushi ingredients with \(d=18\) features. Each of the 5,000 customers independently ranks 10 ingredients according to her preferences. We select the rankings provided by 10 customers, where an ingredient is ranked higher if it precedes the other ingredients in a customer’s ranked list. We generate two datasets: triplet Sushi containing \(m=1,200\) rankings of \(|A_\ell |=3\) ingredients, and pairwise Sushi containing \(m=450\) pairwise comparisons.

International Conference on Learning Representations (ICLR). The ICLR dataset contains abstracts and reviewer ratings of 2,561 papers that are submitted to ICLR 2020 conference and are available on OpenReview website [Sun 2020]. We choose the top \(n=100\) papers, and extract \(d=768\) numerical features from each abstract using the Deep Bidirectional Transformers (BERT) [Devlin et al. 2019] architecture, pre-trained on the Books Corpus dataset [Zhu et al. 2015] and English Wikipedia. We normalize \(\mathbf {X}\) to have 0 mean and unit variance over samples \(\left[n\right]\). We generate all possible \(m=120,324 (2,248,524)\) \(K=3(4)\)-way rankings w.r.t. the relative order of the average reviewer ratings. We add noise to the resulting rankings following the same process as Movehub-Cost.

Movehub-Cost. The Movehub-Cost dataset contains the total ranking of 216 cities w.r.t. cost of living [Blitzer 2017]. Each city is associated with \(d=6\) numerical features, which are average costs for cappuccino, cinema, wine, gasoline, rent, and disposable income. We normalize \(\mathbf {X}\) to have 0 mean and unit variance over samples \(\left[n\right]\). We select \(n=50\) cities and generate all \(m=230,298 (2,118,756)\) \(K=4(5)\)-way rankings w.r.t. the relative order of the queried cities in the total ranking. To mimic the real-life noise introduced by human labeling, we apply the following post-processing to the resulting rankings: For each ranking, we sample a value uniformly at random in \([0,1]\). If the value is less than 0.1, we add noise to the ranking by a cyclic permutation of the ranked samples.

Movehub-Quality. The Movehub-Quality dataset contains total ranking of the same 216 cities as Movehub-Cost, this time w.r.t. quality of life. Each city is associated with \(d=5\) numerical features, including overall scores for purchase power, healthcare, pollution, quality of life, and crime. We normalize \(\mathbf {X}\) to have 0 mean and unit variance over samples \(\left[n\right]\). We select \(n=50\) cities and generate all \(m=230,298 (2,118,756)\) \(K=4(5)\)-way rankings w.r.t. the relative order of the queried cities in the total ranking. We add noise to the rankings following the same process as Movehub-Cost.

IMDB. The IMDB Movies Dataset contains IMDB ratings of 14,762 movies, each of which is associated with \(d=36\) numerical features [Leka 2016]. We normalize \(\mathbf {X}\) to have 0 mean and unit variance over samples \(\left[n\right]\). We select \(n=50\) movies and generate all possible \(m=85,583\) \(K=4\)-way rankings w.r.t. the relative order of the ratings of queried movies. We add noise to the resulting rankings following the same process as Movehub-Cost.

4.2 Experiment Setup

We partition each dataset into training and test sets in two ways. In rank partitioning, we partition the dataset w.r.t. \(\left[m\right]\), using \(90\%\) of the \(m\) observations for training, and the remaining \(10\%\) for testing. In sample partitioning, we partition \(\left[n\right]\), using \(90\%\) of the \(n\) samples for training, and the remaining \(10\%\) for testing. In this setting, observations containing samples from both training and validation/test sets are discarded. We perform cross validation on the resulting training sets. We use 10 folds for synthetic datasets, ROP-num, FAC, and Triplet Sushi, and 3 folds for the other datasets. For synthetic datasets, we also repeat experiments over five random generations.

To evaluate the convergence speed of each algorithm, we measure the elapsed time, including time spent in initialization, in seconds (Time). Moreover, we measure the prediction performance using Top-1 accuracy (Top-1 Acc.) and KT correlation [Kendall 1938] on validation and test sets (c.f. 4.7). For synthetic datasets, we also measure the quality of convergence by the norm of the difference between estimated and true PL scores (\(\triangle \mathbf {\pi }\)); lower values indicate better estimation. We report averages and standard deviations over folds for all algorithms except for the siamese network method: As training takes several hours, we execute only one fold.

4.3 Shallow Regression Competing Methods

We implement¹ seven competing algorithms. Four are feature methods, i.e., algorithms that regress PL scores from features: SR-l2 described in Section 3.3, SR-l2-log described in Section 3.4, sequential least-squares quadratic programming (SLSQP), that solves (16), and Newton on \({{\mathbf {w}}}\), that solves the convex problem (17) via Newton’s method. The remaining three are featureless methods, i.e., algorithms that learn the PL scores from the choice observations alone: ILSR described by Equation (7), the MM algorithm [Hunter 2004], and Newton on \(\mathbf {\theta }\) that solves Equation (3) via the reparametrization \({\pi }_i = e^{\theta _i}\), \(i \in \left[n\right]\) using Newton’s method on \(\mathbf {\theta }=[\theta _i]_{i\in \left[n\right]}\). We expand upon the implementation of all algorithms below.

SR-l2. We compute the stationary distribution at each iteration of ILSRX (c.f. Equation (29)) using the power method [Lei et al. 2016]. As the stopping criterion, we use \(\parallel \!\! {\mathbf {\pi }}^k- {\mathbf {\pi }}^{k-1} \!\!\parallel _2 \,\lt \, {r_\mathtt {tol}} \parallel \!\! {\mathbf {\pi }}^k \!\!\parallel _2\) and \(\parallel \!\! \mathbf {\tilde{X}}\mathbf {w}^k - \mathbf {\tilde{X}}\mathbf {w}^{k-1} \!\!\parallel _2 \,\lt \, {{r_\mathtt {tol}}} \parallel \!\! \mathbf {\tilde{X}}\mathbf {w}^k \!\!\parallel _2\). We set the relative tolerance \(r_\mathtt {tol}=10^{-4}\) for all experiments. We use the same relative tolerance for the stopping criterion of the power method. We set \(\rho =1\) in our experiments, which is a standard choice in the ADMM literature [Boyd et al. 2011]. In our experiments, we consistently observe that Equation (26) is satisfied. That is why, we use \(c_j = \frac{- {\pi }_j \sigma _j}{\sum _{i\in \left[n\right]_{-}} {\pi }_i \sigma _i}\), \(j \in \left[n\right]_{+}\) to calculate the transition rates (27).

SR-l2-log. As the stopping criterion, we use \(\parallel \!\! {\mathbf {\pi }}^k- {\mathbf {\pi }}^{k-1} \!\!\parallel _2 \,\lt \, {r_\mathtt {tol}} \parallel \!\! {\mathbf {\pi }}^k \!\!\parallel _2\) and \(\parallel \!\! e^{\mathbf {X}{{\mathbf {w}}^k}} - e^{\mathbf {X}{{\mathbf {w}}^{k-1}}} \!\!\parallel _2 \,\lt \, {r_\mathtt {tol}} \parallel \!\! e^{\mathbf {X}{{\mathbf {w}}^k}} \!\!\parallel _2\), where exponentiation is applied elementwise.

SLSQP. We initialize SLSQP the same as SR-l2 (c.f. Section 3.3). As stopping criterion, we use \(\parallel \!\! {\mathbf {\pi }}^k- {\mathbf {\pi }}^{k-1} \!\!\parallel _2 \,\lt \, {r_\mathtt {tol}} \parallel \!\! {\mathbf {\pi }}^k \!\!\parallel _2\), where \(\mathbf {\pi }^k = \mathbf {X}\mathbf {a}^k + \mathbf {1} {b}^k\), \(k \in \mathbb {N}\). Each iteration of SLSQP is \(O(\sum _{\ell \in \mathcal {D}}(|A_\ell | (d+1)) + (d+1)^2)\) for constructing the gradient of Equation (15) w.r.t. \(\mathbf {w}\) and updating \(\mathbf {w}\), respectively.

Newton on \({{\mathbf {w}}}\). We initialize Newton on \({{\mathbf {w}}}\) the same as SR-l2-log (c.f. Section 3.4). As stopping criterion, we use \(\parallel \!\! {\mathbf {\pi }}^k- {\mathbf {\pi }}^{k-1} \!\!\parallel _2 \,\lt \, {r_\mathtt {tol}} \parallel \!\! {\mathbf {\pi }}^k \!\!\parallel _2\), where \(\mathbf {\pi }^k = [e^{ \mathbf {x}_i^T \mathbf {w}^k}]_{i\in \left[n\right]}\), \(k \in \mathbb {N}\). Each iteration of Newton on \({{\mathbf {w}}}\) is \(O(\sum _{\ell \in \mathcal {D}}(|A_\ell |\, d^2) + d^2)\) for constructing the Hessian of Equation (15) w.r.t. \(\mathbf {w}\) and updating \(\mathbf {w}\), respectively.

ILSR. We initialize ILSR with \({\mathbf {\pi }}^0 = \frac{1}{n}\mathbf {1}\). We compute the stationary distribution at each iteration of ILSR using the power method. As the stopping criterion, we use \(\parallel \!\! {\mathbf {\pi }}^k- {\mathbf {\pi }}^{k-1} \!\!\parallel _2 \,\lt \, {r_\mathtt {tol}} \parallel \!\! {\mathbf {\pi }}^k \!\!\parallel _2\). Each iteration of ILSR is \(O(\sum _{\ell \in \mathcal {D}}(|A_\ell |) + n^2)\) for constructing the transition matrix \(\mathbf {\Lambda }(\mathbf {\pi })\) (c.f. Equation (6)) and finding the stationary distribution \(\mathbf {\pi }\), respectively.

MM. We initialize MM with \({\mathbf {\pi }}^0 = \frac{1}{n}\mathbf {1}\). As the stopping criterion, we use \(\parallel \!\! {\mathbf {\pi }}^k- {\mathbf {\pi }}^{k-1} \!\!\parallel _2 \,\lt \, {r_\mathtt {tol}} \parallel \!\! {\mathbf {\pi }}^k \!\!\parallel _2\). Each iteration of MM is \(O(\sum _{\ell \in \mathcal {D}}(|A_\ell |))\).

Newton on \(\mathbf {\theta }\). We initialize Newton on \(\mathbf {\theta }\) with \({\mathbf {\theta }}^0 = [\theta _i^0]_{i\in \left[n\right]} = \mathbf {0}\). As stopping criterion, we use \(\parallel \!\! {\mathbf {\pi }}^k- {\mathbf {\pi }}^{k-1} \!\!\parallel _2 \,\lt \, {r_\mathtt {tol}} \parallel \!\! {\mathbf {\pi }}^k \!\!\parallel _2\), where \({\mathbf {\pi }}^k_i=e^{\theta _i^k}\), \(i \in \left[n\right]\), \(k \in \mathbb {N}\). Each iteration of Newton on \(\mathbf {\theta }\) is \(O(\sum _{\ell \in \mathcal {D}}(|A_\ell |^2) + n^2)\) for constructing the Hessian of Equation (15) w.r.t. \(\mathbf {\theta }\) and updating \(\mathbf {\theta }\), respectively.

4.4 DNN Regression Competing Methods

We implement² five competing algorithms. DSR with KL penalty (DSR-KL), DSR with \(\ell _2\) norm penalty (DSR-l2), and siamese network competitor regress scores via a DNN \(\tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w})\) with parameters \(\mathbf {w}\). SR-l2 and SR-KL use an affine regressor \(\tilde{\mathbf {\pi }}=\mathbf {X}\mathbf {a}+ \mathbf {1} {b}\) with parameters \(\mathbf {a}\) and \({b}\), and solve (19) via ADMM with \(\ell _2\) and KL penalties, respectively.

DSR-KL and DSR-l2 are explained in Section 3.5. We compute the stationary distribution at each iteration of ILSRX (c.f. Equation (29)) using the power method [Lei et al. 2016]. At each ADMM iteration, we fine-tune the DNN regressor \(\tilde{{\pi }}(\mathbf {X}; \mathbf {w})\) via Adam optimization [Kingma and Ba 2015] over Equation (22). We check the convergence of \(\mathcal {L}\) on the training set and KT correlation evaluations on the validation set (c.f. 4.7) as the stopping criterion for: (i) fine-tuning \(\tilde{{\pi }}(\mathbf {X}; \mathbf {w})\) at each ADMM iteration, and (ii) overall DSR-KL algorithm. We declare convergence when KT on validation set does not change for 5 iterations, for maximum total of 50 iterations, with relative tolerance \(r_\mathtt {tol}=10^{-4}\). We use the same relative tolerance for the stopping criterion of the power method.

To aid convergence in practice [Boyd et al. 2011], we update the dual variable at each ADMM iteration with a multiplicative smoothing parameter \(\gamma ^k=1/k\). We also adapt the penalty parameter as

\begin{align*} \rho ^{k+1} &= {\left\lbrace \begin{array}{ll} \tau \rho ^k,&\!\!\text{if} \parallel \! \tilde{\mathbf {\pi }}^k - \mathbf {\pi }^k \!\parallel _2^2 \gt \beta \parallel \! \tilde{\mathbf {\pi }}^k \!-\! \tilde{\mathbf {\pi }}^{k-1} \!\parallel _2^2\\ \frac{\rho ^k}{\tau }, &\!\!\text{if} \parallel \! \tilde{\mathbf {\pi }}^k \!-\! \mathbf {\pi }^k \!\parallel _2^2 \lt \beta \parallel \! \tilde{\mathbf {\pi }}^k \!-\! \tilde{\mathbf {\pi }}^{k-1} \!\parallel _2^2\\ \rho ^k, & \!\!\text{otherwise,} \end{array}\right.} \end{align*}

where \(\tau =2\) and \(\beta =10\).

The siamese network competitor minimizes Equation (15) w.r.t. \(\mathbf {w}\) via SGD. We train a siamese network architecture with base network \(\tilde{{\pi }}\) on observations \(\mathcal {D}\) via Adam optimization [Kingma and Ba 2015] over Equation (15). We employ the same convergence criterion and experiment setup as DSR-KL and DSR-l2 to train and optimize the siamese network.

Finally, we implement two spectral algorithms that regress Plackett-scores via an affine model, i.e., \(\tilde{\mathbf {\pi }}=\mathbf {X}\mathbf {a}+ \mathbf {1}{b}\): SR-l2, and its variant SR-KL that uses KL proximal penalty instead of \(\ell _2\) norm penalty for ADMM. As the stopping criterion for both algorithms, we use \(\parallel \!\! {\mathbf {\pi }}^k- {\mathbf {\pi }}^{k-1} \!\!\parallel _2 \,\lt \, {r_\mathtt {tol}} \parallel \!\! {\mathbf {\pi }}^k \!\!\parallel _2\) and \(\parallel \!\! (\mathbf {X}\mathbf {a}^k + \mathbf {1}{b}^k) - (\mathbf {X}\mathbf {a}^{k-1} +\mathbf {1} {b}^{k-1} \!\!\parallel _2 \,\lt \, {{r_\mathtt {tol}}} \parallel \!\! \mathbf {X}\mathbf {a}^k + \mathbf {1}{b}^k \!\!\parallel _2\). We set the ADMM penalty parameter as \(\rho =1\).

4.5 DNN Architecture and Training Details

To evaluate each method on ROP-img, we choose \(\tilde{{\pi }}(\mathbf {X}; \mathbf {w})\) as a state-of-the-art convolutional neural network architecture, namely GoogLeNet [Szegedy et al. 2015], followed by a fully connected output layer comprising a single neuron with sigmoid activation. We initialize the convolutional layers with weights pre-trained on the ImageNet dataset [Deng et al. 2009]. For other datasets, we design \(\tilde{{\pi }}(\mathbf {X}; \mathbf {w})\) as a fully connected architecture with relu activation for hidden layers and an output layer comprising a single neuron with sigmoid activation. We add \(\ell _2\) regularizers to all layers. For each configuration of (i) \(\ell _2\) regularization parameter varying in \([2\times 10^{-5},2\times 10^{-2}]\), (ii) learning rate varying in \([10^{-4},10^{-2}]\), and (iii) number of layers varying in \([1,10]\) for fully connected \(\tilde{{\pi }}(\mathbf {X}; \mathbf {w})\), we run each method until convergence (c.f. 4.4 for criteria). We determine the best set of hyperparameters w.r.t. the prediction performance via cross validation (c.f. 4.2).

4.6 Execution Environment

In Tables 3–5, we measure timing on an Intel Xeon CPU E5-2680v2 2.8 GHz with 128 GB RAM. Particularly for experiments on larger synthetic datasets (c.f. Figures 3–4), we use an Intel Xeon CPU E5-2680v4 2.4 GHz with 500 GB RAM. For all DNN regression experiments, we employ NVIDIA GPUs with Intel Gold [email protected] GHz CPUs and 128 GB RAM.

Fig. 3.

Fig. 4.

Table 3.

Partitioning	Method	Training Metrics			Performance Metrics on the Test Set
		Time (s) \(\downarrow\)	Iter. \(\downarrow\)	\(\triangle \mathbf {\pi }\) \(\downarrow\)	Top-1 Acc. \(\uparrow\)	KT \(\uparrow\)
Sample Partitioning	SR-l2	\(0.237 \pm 0.006\)	\(4 \pm 0\)	\(0.717 \pm 0.207\)	\(0.831 \pm 0.119\)	\(0.609 \pm 0.247\)
	SR-l2-log	\(1.428 \pm 2.595\)	\(49 \pm 79\)	\(0.845 \pm 0.204\)	\(0.668 \pm 0.159\)	\(0.335 \pm 0.318\)
	ILSR (no \(\mathbf {X}\))	\(0.045 \pm 0.002\)	\(2 \pm 0\)	\(0.718 \pm 0.207\)	\(0.5 \pm 0.0\)	\(-1.0 \pm 0.0\)
	MM (no \(\mathbf {X}\))	\(9.728 \pm 0.487\)	\(500 \pm 0\)	\(1.2 \pm 0.1\)	\(0.5 \pm 0.0\)	\(0.0 \pm 0.0\)
	Newton on \(\mathbf {\theta }\) (no \(\mathbf {X}\))	\(4.537 \pm 0.729\)	\(14 \pm 3\)	\(1.236 \pm 0.132\)	\(0.5 \pm 0.0\)	\(-0.08 \pm 0.272\)
	Newton on \({{\mathbf {w}}}\)	\(6.406 \pm 2.104\)	\(14 \pm 5\)	\(0.808 \pm 0.462\)	\(0.844 \pm 0.148\)	\(0.688 \pm 0.296\)
	SLSQP	\(43.908 \pm 24.469\)	\(229 \pm 132\)	\(0.718 \pm 0.206\)	\(0.796 \pm 0.106\)	\(0.592 \pm 0.211\)
Rank Partitioning	SR-l2	\(0.48 \pm 0.24\)	\(4 \pm 0\)	\(0.717 \pm 0.207\)	\(0.837 \pm 0.037\)	\(0.569 \pm 0.072\)
	SR-l2-log	\(1.58 \pm 2.027\)	\(29 \pm 14\)	\(0.883 \pm 0.208\)	\(0.699 \pm 0.066\)	\(0.398 \pm 0.132\)
	ILSR (no \(\mathbf {X}\))	\(0.098 \pm 0.056\)	\(2 \pm 0\)	\(0.718 \pm 0.208\)	\(0.708 \pm 0.045\)	\(0.389 \pm 0.088\)
	MM (no \(\mathbf {X}\))	\(11.302 \pm 0.515\)	\(500 \pm 0\)	\(0.864 \pm 0.19\)	\(0.685 \pm 0.037\)	\(0.354 \pm 0.074\)
	Newton on \(\mathbf {\theta }\) (no \(\mathbf {X}\))	\(8.218 \pm 1.782\)	\(14 \pm 3\)	\(1.244 \pm 0.121\)	\(0.506 \pm 0.029\)	\(0.01 \pm 0.05\)
	Newton on \({{\mathbf {w}}}\)	\(7.696 \pm 2.35\)	\(14 \pm 4\)	\(0.804 \pm 0.463\)	\(0.871 \pm 0.087\)	\(0.742 \pm 0.173\)
	SLSQP	\(47.824 \pm 28.585\)	\(219 \pm 138\)	\(0.718 \pm 0.206\)	\(0.819 \pm 0.035\)	\(0.637 \pm 0.07\)

Table 3. Evaluations on a Synthetic Dataset with \(n=1,000\), \(d=100\), and \(m=1,000\), Partitioned w.r.t. Sample Partitioning and Rank Partitioning

We report the convergence time (Time), number of iterations until convergence (Iter), norm error in estimating true PL scores ( \(\triangle \mathbf {\pi }\)), top-1 accuracy on the test set (Top-1 Acc.), and Kendall-Tau correlation on the test set (KT). ILSR, MM, and Newton on \(\mathbf {\theta }\) do not use the features \(\mathbf {X}\). Newton on \({{\mathbf {w}}}\) and SLSQP regress \(\mathbf {\pi }\) from \(\mathbf {X}\).

4.7 Performance Metrics

We measure the prediction performance by Top-1 accuracy (Top-1 Acc.) and KT correlation on the test set. Let the test set be \(\mathcal {D}_{\mathtt {rank}} = \lbrace (\alpha ^\ell , A_\ell) \,|\, \ell \in \lbrace 1,\ldots ,m_{\mathtt {test}}\rbrace \rbrace\), where \(\alpha ^\ell = \alpha _{1}^\ell \succ \alpha _{2}^\ell \succ \cdots \succ \alpha ^\ell _{K}\) is an ordered sequence of the samples in \(A_\ell\). Given \(A_\ell\), we predict the \(\ell\)th choice as \(\hat{c}_\ell = \text{arg,max}_{i \in A_\ell } {\pi }_i\). We calculate the Top-1 accuracy (Top-1 Acc.) as

\begin{align} \text{Top-1 Acc.} = \frac{\sum _{\ell =1}^{m_{\mathtt {test}}} \mathbb {1}(\hat{c}_\ell = \alpha _{1}^\ell)}{m_{\mathtt {test}}} \in [0,1]. \end{align}

(51)

We also predict the ranking as \(\hat{\alpha }^\ell = \text{arg,sort}[{\pi }_i]_{i \in A_\ell }\), i.e., sequence of the samples in \(A_\ell\) ordered w.r.t. their estimated scores. We calculate KT correlation [Kendall 1938] as a measure of the correlation between each true ranking \(\alpha ^\ell\) and predicted ranking \(\hat{\alpha }^\ell\), \(\ell \in \lbrace 1,\ldots ,m_{\mathtt {test}}\rbrace\). For observation \(\ell\), let \(T_\ell = \sum _{t=1}^{K} \sum _{s=1}^{K} \mathbb {1}(\hat{\alpha }_t^\ell \succ \hat{\alpha }_s^\ell \wedge \alpha _t^\ell \succ \alpha _s^\ell)\) be the number correctly predicted ranking positions, and \(F_\ell = \sum _{t=1}^{K} \sum _{s=1}^{K} \mathbb {1}(\hat{\alpha }_t^\ell \succ \hat{\alpha }_s^\ell \wedge \alpha _s^\ell \succ \alpha _t^\ell)\) be the number incorrectly predicted ranking positions. Then, KT is computed by

\begin{align} \text{KT} = \frac{\sum _{\ell =1}^{m_{\mathtt {test}}} (T_\ell - F_\ell)/ \binom{K}{2}}{m_{\mathtt {test}}} \in [-1, 1], \end{align}

(52)

where \(\left({{K}\atop {2}}\right)\) is the number of sample pairs.

4.8 Shallow Regression Results

Sample partitioning. We begin with the experiments on sample partitioning, in which we partition samples \(\left[n\right]\) into training and test sets. Table 3 shows the evaluations of all algorithms trained on a synthetic dataset with \(n=1,000\) samples, \(d=100\) features, \(m=1,000\) observations, and query size \(|A_\ell |=2\). SR-l2 and SR-l2-log converge 4–27 times faster than other feature methods, i.e., Newton on \({{\mathbf {w}}}\) and SLSQP. Recall that in sample partitioning, training observations contain only training samples: Test samples do not participate in any of the training observations. Thus, featureless methods ILSR, MM, and Newton on \(\mathbf {\theta }\) are no better than random predictors, with 0.5 Top-1 Acc. and 0.0 KT. By regressing the PL scores from features, SR-l2 and SR-l2-log significantly outperform the predictions of ILSR, MM, and Newton on \(\mathbf {\theta }\), by 16%–33% Top-1 Acc. and 16%– 30% KT.

Real datasets. We observe an equally significant speed gain on real datasets; Table 4 shows the evaluations on four real datasets partitioned w.r.t. sample partitioning. SR-l2 and SR-l2-log are 3–18 times faster than Newton on \({{\mathbf {w}}}\) and SLSQP. This speed gain is fundamentally due to the smaller per iteration complexity of SR-l2 and SR-l2-log. For instance, compared to Newton on \({{\mathbf {w}}}\), SR-l2-log requires about 2 times more iterations, but still converges 3 times faster than Newton on \({{\mathbf {w}}}\) on FAC. Moreover, while significantly decreasing the convergence time, SR-l2 or SR-l2-log consistently attain similar prediction performance to Newton on \({{\mathbf {w}}}\) and SLSQP, except for Sushi, for which they perform slightly worse (by \(13\%\)–\(20\%\) Top-1 Acc.), though the convergence time dividends are striking in comparison (\(78\%\)–\(95\%\)).

Table 4.

Dataset	Method	Training Metrics		Performance Metrics on the Test Set
Dataset	Method	Time (s) \(\downarrow\)	Iter. \(\downarrow\)	Top-1 Acc. \(\uparrow\)	KT \(\uparrow\)
FAC	SR-l2	\(0.301 \pm 0.048\)	\(4 \pm 0\)	\(0.654 \pm 0.237\)	\(0.307 \pm 0.473\)
	SR-l2-log	\(0.298 \pm 0.466\)	\(10 \pm 15\)	\(0.685 \pm 0.237\)	\(0.369 \pm 0.474\)
	ILSR (no \(\mathbf {X}\))	\(0.059 \pm 0.016\)	\(2 \pm 0\)	\(0.5 \pm 0.0\)	\(-1.0 \pm 0.0\)
	MM (no \(\mathbf {X}\))	\(5.905 \pm 0.282\)	\(500 \pm 0\)	\(0.5 \pm 0.0\)	\(0.0 \pm 0.0\)
	Newton on \(\mathbf {\theta }\) (no \(\mathbf {X}\))	\(7.604 \pm 0.805\)	\(18 \pm 2\)	\(0.5 \pm 0.0\)	\(-0.4 \pm 0.49\)
	Newton on \({\mathbf {w}}\)	\(0.859 \pm 0.077\)	\(6 \pm 1\)	\(0.67 \pm 0.17\)	\(0.34 \pm 0.339\)
	SLSQP	\(14.332 \pm 5.684\)	\(178 \pm 67\)	\(0.675 \pm 0.147\)	\(0.349 \pm 0.293\)
ROP-num	SR-l2	\(1.708 \pm 0.166\)	\(4 \pm 0\)	\(0.783 \pm 0.03\)	\(0.565 \pm 0.06\)
	SR-l2-log	\(0.325 \pm 0.028\)	\(1 \pm 0\)	\(0.724 \pm 0.105\)	\(0.448 \pm 0.209\)
	ILSR (no \(\mathbf {X}\))	\(0.649 \pm 0.053\)	\(2 \pm 0\)	\(0.5 \pm 0.0\)	\(-1.0 \pm 0.0\)
	MM (no \(\mathbf {X}\))	\(0.001 \pm 0.001\)	\(1 \pm 0\)	\(0.5 \pm 0.0\)	\(-1.0 \pm 0.0\)
	Newton on \(\mathbf {\theta }\) (no \(\mathbf {X}\))	\(68.924 \pm 5.521\)	\(8 \pm 0\)	\(0.497 \pm 0.012\)	\(-0.988 \pm 0.036\)
	Newton on \({\mathbf {w}}\)	\(47.563 \pm 8.342\)	\(2 \pm 1\)	\(0.552 \pm 0.048\)	\(0.103 \pm 0.096\)
	SLSQP	\(4.823 \pm 4.914\)	\(2 \pm 1\)	\(0.769 \pm 0.052\)	\(0.538 \pm 0.104\)
Pairwise Sushi	SR-l2	\(0.046 \pm 0.01\)	\(4 \pm 0\)	\(0.451 \pm 0.082\)	\(-0.09 \pm 0.177\)
	SR-l2-log	\(0.141 \pm 0.025\)	\(27 \pm 13\)	\(0.532 \pm 0.076\)	\(0.064 \pm 0.152\)
	ILSR (no \(\mathbf {X}\))	\(0.014 \pm 0.006\)	\(2 \pm 0\)	\(0.5 \pm 0.0\)	\(-1.0 \pm 0.0\)
	MM (no \(\mathbf {X}\))	\(1.513 \pm 0.587\)	\(352 \pm 183\)	\(0.5 \pm 0.0\)	\(-1.0 \pm 0.0\)
	Newton on \(\mathbf {\theta }\) (no \(\mathbf {X}\))	\(1.282 \pm 0.924\)	\(18 \pm 9\)	\(0.5 \pm 0.0\)	\(-0.666 \pm 0.472\)
	Newton on \({\mathbf {w}}\)	\(0.21 \pm 0.115\)	\(4 \pm 2\)	\(0.665 \pm 0.035\)	\(0.33 \pm 0.069\)
	SLSQP	\(4.619 \pm 6.321\)	\(168 \pm 235\)	\(0.624 \pm 0.065\)	\(0.248 \pm 0.13\)
Triplet Sushi	SR-l2	\(0.091 \pm 0.02\)	\(4 \pm 0\)	\(0.358 \pm 0.805\)	\(-0.333 \pm 0.924\)
	SR-l2-log	\(0.556 \pm 0.276\)	\(40 \pm 25\)	\(0.393 \pm 0.826\)	\(0.096 \pm 1.069\)
	ILSR (no \(\mathbf {X}\))	\(0.033 \pm 0.012\)	\(2 \pm 0\)	\(0.334 \pm 0.0\)	\(-0.047 \pm 1.089\)
	MM (no \(\mathbf {X}\))	\(1.824 \pm 0.701\)	\(267 \pm 151\)	\(0.334 \pm 0.0\)	\(0.0 \pm 0.0\)
	Newton on \(\mathbf {\theta }\) (no \(\mathbf {X}\))	\(2.728 \pm 1.475\)	\(13 \pm 3\)	\(0.334 \pm 0.0\)	\(-0.047 \pm 1.089\)
	Newton on \({\mathbf {w}}\)	\(1.966 \pm 3.158\)	\(10 \pm 19\)	\(0.322 \pm 0.802\)	\(-0.261 \pm 0.956\)
	SLSQP	\(1.656 \pm 1.793\)	\(20 \pm 30\)	\(0.608 \pm 0.826\)	\(0.358 \pm 0.928\)

Table 4. Evaluations on Real Datasets Partitioned w.r.t. Sample Partitioning

We report the convergence time (Time), number of iterations until convergence (Iter), top-1 accuracy on the test set (Top-1 Acc.), and Kendall-Tau correlation on the test set (KT). ILSR, MM, and Newton on \(\mathbf {\theta }\) do not use the features \(\mathbf {X}\). Newton on \({{\mathbf {w}}}\) and SLSQP regress \(\mathbf {\pi }\) from \(\mathbf {X}\).

Aligned with the prediction performance on synthetic datasets, featureless methods ILSR, MM, and Newton on \(\mathbf {\theta }\) can only attain 0.5 Top-1 Acc. and 0.0 KT. By regressing the PL scores from features, SR-l2 and SR-l2-log significantly outperform the predictions of ILSR, MM, and Newton on \(\mathbf {\theta }\), by \(3\%\)–\(31\%\) Top-1 Acc. and \(5\%\)–\(78\%\) KT.

Rank partitioning. A sample can appear in both training and test observations in rank partitioning. Hence, featureless methods ILSR, MM, and Newton on \(\mathbf {\theta }\) should fare better than in sample partitioning. Nonetheless, in Table 3, as \(n=1,000\) is larger than \(d=100\), there are more scores to learn than parameters. As a result, feature methods are advantageous for good predictions compared to featureless methods. Particularly, SR-l2 and SR-l2-log outperform the predictions of ILSR, MM, and Newton on \(\mathbf {\theta }\) in rank partitioning on Table 3, by \(13\%\) Top-1 Acc. and \(9\%\) KT. The relative performance of feature vs. featureless methods is governed by the relationship among \(n\), \(d\), and \(m\). We do not include Newton on \(\mathbf {\theta }\) and MM in this analysis, as they are too slow.

Impact of \(d\). To assess the impact of number of parameters, we fix \(n=1,000\), \(m=250\), \(|A_\ell |=2\) and generate synthetic datasets with \(d\in \lbrace 10,100,1000,10000\rbrace\). Figure 3(a) shows the Time and Top-1 Acc. of SR-l2, SR-l2-log, ILSR, and Newton on \({{\mathbf {w}}}\). As \(m=250\) observations are not enough to learn \(n=1,000\) scores, SR-l2 leads to significantly better Top-1 Acc. compared to ILSR. When \(d=10\), SR-l2 and SR-l2-log outperform ILSR by \(18\%\)–\(28\%\) Top-1 Acc. Moreover, SR-l2 and SR-l2-log are consistently faster than Newton on \({{\mathbf {w}}}\), for all \(d\gt 100\). Particularly, for \(d=10,000\), SR-l2 and SR-l2-log converge 42–579 times faster than Newton on \({{\mathbf {w}}}\). Interestingly, the convergence time of SR-l2-log can even decrease with increasing \(d\). This is because the number of iterations until convergence decreases. While significantly decreasing the convergence time, SR-l2 consistently attains better Top-1 Acc. than Newton on \({{\mathbf {w}}}\), up to \(8\%\) for \(d=100\). Impact of \(n\). To assess the impact of number of samples, we fix \(d=100\), \(m=250\), \(|A_\ell |=2\) and generate synthetic datasets with \(n\in \lbrace 50,100,1000,10000,100000\rbrace\); Figure 3(b) shows evaluations on the resulting datasets. For \(n\gt d=100\), i.e., when there are more scores to learn than parameters, SR-l2 leads to significantly better Top-1 Acc. compared to ILSR. Particularly, for \(n=100,000\), SR-l2 outperforms ILSR by \(25\%\) Top-1 Acc. This confirms that, especially when the number of observations \(m\) is not sufficient to learn \(n\) scores, exploiting the features associated with the samples is crucial in attaining good prediction performance. As expected, convergence time of Newton on \({{\mathbf {w}}}\) is not significantly affected by \(n\). Despite this, SR-l2 and SR-l2-log are faster than Newton on \({{\mathbf {w}}}\) for all \(n\lt 1,\!000\). Particularly, for \(n=50\), SR-l2 and SR-l2-log converge 2–75 times faster than Newton on \({{\mathbf {w}}}\). While decreasing the convergence time, SR-l2 consistently attains better Top-1 Acc. than Newton on \({{\mathbf {w}}}\), up to \(19\%\) for \(n=100,\!000\).

Impact of \(m\). Figure 4 shows the convergence time (Time) and top-1 test accuracy (Top-1 Acc.) of SR-l2, SR-l2-log, ILSR, and Newton on \({{\mathbf {w}}}\) when trained on synthetic datasets with number of observations \(m\in \lbrace 10,100,1000,10000,100000\rbrace\). Observations are partitioned w.r.t. rank partitioning, where number of samples is \(n=1,\!000\), number of parameters is \(d=100\), and size of each query is \(|A_\ell |=2\). As \(n\gt d\), SR-l2 benefits from being able to regress \(n\) scores from a smaller number of \(d\) parameters and leads to significantly better Top-1 Acc compared to ILSR in Figure 4. Especially when \(m\) is not enough to learn \(n=1,000\) scores, but to learn \(d=100\) parameters, SR-l2 gains the most performance advantage over ILSR, up to \(13\%\) Top-1 Acc. Moreover, SR-l2 and SR-l2-log are consistently faster than Newton on \({{\mathbf {w}}}\), for all number of observations \(m\gt 100\). Particularly, for \(m=100,000\), SR-l2 and SR-l2-log converge 4–60 times faster than Newton on \({{\mathbf {w}}}\).

Real datasets. Performance agrees with observations above regarding the dependence on \(n\) and \(d\). For datasets where \(n\gt m\gt d\), e.g., FAC, SR-l2, and SR-l2-log significantly outperform the predictions of ILSR, by \(10\%\) Top-1 Acc. and \(25\%\) KT. For datasets where \(m\) is much larger than \(n\) (c.f. Table 2), feature methods lead to similar prediction performance to each other and slightly lower performance than ILSR, MM, and Newton on \(\mathbf {\theta }\). Overall, SR-l2 and SR-l2-log consistently converge faster than Newton on \({{\mathbf {w}}}\) and SLSQP, by 3–27 times across all real datasets. Table 5 also confirms the observations from Figure 3(a) and (b) regarding the effect of \(n\) and \(d\). For FAC, \(n=1000 \gt m=728 \gt d=50\), i.e., there are enough observations to learn parameters, but not scores. As a result, SR-l2 and SR-l2-log significantly outperform the predictions of ILSR, by \(10\%\) Top-1 Acc. and \(25\%\) KT. For the other real datasets, \(m\) is much larger than \(n\) (c.f. Table 2). Thus, feature methods lead to similar prediction performance to each other and lower performance than ILSR, MM, and Newton on \(\mathbf {\theta }\).

Table 5.

Dataset	Method	Training Metrics		Performance Metrics on the Test Set
Dataset	Method	Time (s) \(\downarrow\)	Iter. \(\downarrow\)	Top-1 Acc. \(\uparrow\)	KT \(\uparrow\)
FAC	SR-l2	\(0.352 \pm 0.044\)	\(4 \pm 0\)	\(0.68 \pm 0.048\)	\(0.35 \pm 0.089\)
	SR-l2-log	\(0.17 \pm 0.033\)	\(4 \pm 0\)	\(0.691 \pm 0.054\)	\(0.378 \pm 0.11\)
	ILSR (no \(\mathbf {X}\))	\(0.066 \pm 0.012\)	\(2 \pm 0\)	\(0.591 \pm 0.067\)	\(-0.13 \pm 0.164\)
	MM (no \(\mathbf {X}\))	\(10.7 \pm 0.501\)	\(500 \pm 0\)	\(0.544 \pm 0.046\)	\(0.046 \pm 0.087\)
	Newton on \(\mathbf {\theta }\) (no \(\mathbf {X}\))	\(9.152 \pm 1.284\)	\(17 \pm 3\)	\(0.5 \pm 0.0\)	\(0.0 \pm 0.0\)
	Newton on \({\mathbf {w}}\)	\(1.531 \pm 0.169\)	\(6 \pm 1\)	\(0.701 \pm 0.04\)	\(0.398 \pm 0.08\)
	SLSQP	\(22.73 \pm 19.151\)	\(160 \pm 135\)	\(0.689 \pm 0.063\)	\(0.375 \pm 0.125\)
ROP-num	SR-l2	\(1.953 \pm 0.217\)	\(4 \pm 0\)	\(0.896 \pm 0.005\)	\(0.791 \pm 0.009\)
	SR-l2-log	\(0.359 \pm 0.027\)	\(1 \pm 0\)	\(0.904 \pm 0.005\)	\(0.807 \pm 0.01\)
	ILSR (no \(\mathbf {X}\))	\(0.716 \pm 0.058\)	\(2 \pm 0\)	\(0.891 \pm 0.005\)	\(0.781 \pm 0.009\)
	MM (no \(\mathbf {X}\))	\(356.497 \pm 29.11\)	\(500 \pm 0\)	\(0.905 \pm 0.004\)	\(0.81 \pm 0.008\)
	Newton on \(\mathbf {\theta }\) (no \(\mathbf {X}\))	\(85.42 \pm 6.849\)	\(9 \pm 0\)	\(0.906 \pm 0.004\)	\(0.811 \pm 0.008\)
	Newton on \({\mathbf {w}}\)	\(55.718 \pm 6.293\)	\(2 \pm 0\)	\(0.904 \pm 0.005\)	\(0.808 \pm 0.009\)
	SLSQP	\(9.595 \pm 7.136\)	\(2 \pm 1\)	\(0.683 \pm 0.049\)	\(0.366 \pm 0.098\)
Pairwise Sushi	SR-l2	\(0.061 \pm 0.002\)	\(4 \pm 0\)	\(0.669 \pm 0.034\)	\(0.338 \pm 0.068\)
	SR-l2-log	\(0.764 \pm 1.192\)	\(58 \pm 30\)	\(0.634 \pm 0.075\)	\(0.267 \pm 0.15\)
	ILSR (no \(\mathbf {X}\))	\(0.027 \pm 0.003\)	\(2 \pm 0\)	\(0.763 \pm 0.039\)	\(0.521 \pm 0.084\)
	MM (no \(\mathbf {X}\))	\(5.191 \pm 0.345\)	\(490 \pm 31\)	\(0.773 \pm 0.048\)	\(0.543 \pm 0.094\)
	Newton on \(\mathbf {\theta }\) (no \(\mathbf {X}\))	\(2.342 \pm 0.689\)	\(18 \pm 5\)	\(0.735 \pm 0.095\)	\(0.465 \pm 0.185\)
	Newton on \({\mathbf {w}}\)	\(0.176 \pm 0.17\)	\(2 \pm 2\)	\(0.685 \pm 0.044\)	\(0.369 \pm 0.087\)
	SLSQP	\(16.198 \pm 8.728\)	\(245 \pm 134\)	\(0.64 \pm 0.06\)	\(0.28 \pm 0.119\)
Triplet Sushi	SR-l2	\(0.127 \pm 0.007\)	\(4 \pm 0\)	\(0.569 \pm 0.035\)	\(0.218 \pm 0.045\)
	SR-l2-log	\(0.804 \pm 0.349\)	\(36 \pm 18\)	\(0.487 \pm 0.034\)	\(0.19 \pm 0.072\)
	ILSR (no \(\mathbf {X}\))	\(0.054 \pm 0.003\)	\(2 \pm 0\)	\(0.678 \pm 0.036\)	\(0.454 \pm 0.06\)
	MM (no \(\mathbf {X}\))	\(15.349 \pm 0.617\)	\(500 \pm 0\)	\(0.715 \pm 0.035\)	\(0.522 \pm 0.059\)
	Newton on \(\mathbf {\theta }\) (no \(\mathbf {X}\))	\(5.122 \pm 0.34\)	\(14 \pm 1\)	\(0.73 \pm 0.036\)	\(0.496 \pm 0.089\)
	Newton on \({\mathbf {w}}\)	\(1.12 \pm 0.659\)	\(3 \pm 2\)	\(0.605 \pm 0.058\)	\(0.285 \pm 0.062\)
	SLSQP	\(21.738 \pm 39.761\)	\(107 \pm 197\)	\(0.521 \pm 0.043\)	\(0.191 \pm 0.059\)

Table 5. Evaluations on Real Datasets Partitioned w.r.t. Rank Partitioning

We report the convergence time in seconds (Time), number of iterations until convergence (Iter), top-1 accuracy on the test set (Top-1 Acc.), and Kendall-Tau correlation on the test set (KT). ILSR, MM, and Newton on \(\mathbf {\theta }\) learn the PL scores \(\mathbf {\pi }\) from the choice observations alone and do not use the features \(\mathbf {X}\). Newton on \({{\mathbf {w}}}\) and SLSQP regress \(\mathbf {\pi }\) from \(\mathbf {X}\) (c.f. Section 4.3).

4.9 DNN Regression Results

Training Time vs. Prediction Performance. Figure 5(a) and (b) show the training time of DSR-KL and siamese network vs. Top-1 Acc. and KT on test sets of Movehub-Cost-4, Movehub-Quality-4, and IMDB-4 datasets, partitioned with rank partitioning. For all datasets, DSR-KL results lie on the top left region of both Top-1 Acc. and KT plots: DSR-KL consistently leads to much faster training and better predictions than the siamese counterpart.

Fig. 5.

Columns 3–5 in Table 6 show the training time and test set prediction performance of DSR-KL, DSR-l2, siamese network, SR-l2, and SR-KL trained on datasets partitioned with rank partition- ing. DSR-KL and DSR-l2 are 1.5–142 times faster than siamese network over all datasets. Moreover, DSR-KL consistently attains equivalent or better prediction performance than both siamese network and DSR-l2 w.r.t. both Top-1 Acc. and KT. DSR-KL leads to particularly better performance in ranking predictions, by up to \(6\%\) higher KT than siamese and \(25\%\) higher KT than DSR-l2 on IMDB-4.

Table 6.

Dataset	Method	Rank Partitioning			Sample Partitioning
		Time (s) \(\downarrow\)	Performance on the Test Set		Time (s) \(\downarrow\)	Performance on the Test Set
		Time (s) \(\downarrow\)	Top-1 Acc. \(\uparrow\)	KT \(\uparrow\)	Time (s) \(\downarrow\)	Top-1 Acc. \(\uparrow\)	KT \(\uparrow\)
ICLR-3	DSR-KL	152.86 \(\pm \; 29.98\)	0.9 \(\pm \; 0.0\)	0.86 \(\pm \;0.0\)	145.76 \(\pm \;9.78\)	0.48 \(\pm \;0.06\)	0.28 \(\pm \;0.09\)
	DSR-l2	165.59 \(\pm \;22.63\)	0.79 \(\pm \;0.0\)	0.5 \(\pm \;0.0\)	122.92 \(\pm \;75.43\)	0.51 \(\pm \;0.02\)	0.07 \(\pm \; 0.08\)
	Siamese	1445.77	0.88	0.8	827.05	0.48	0.05
	SR-KL	529.02 \(\pm \;117.26\)	0.37 \(\pm \;0.0\)	0.02 \(\pm \;0.0\)	96.49 \(\pm \;90.02\)	0.48 \(\pm \;0.0\)	0.0 \(\pm \;0.0\)
	SR-l2	20.59 \(\pm \;3.02\)	0.87 \(\pm \;0.0\)	0.82 \(\pm \;0.0\)	4.91 \(\pm \;4.6\)	0.47 \(\pm \;0.0\)	0.06 \(\pm \;0.0\)
Movehub-Cost-4	DSR-KL	49.54 \(\pm \;34.2\)	0.88 \(\pm \;0.07\)	0.85 \(\pm \;0.09\)	17.65 \(\pm \;7.68\)	0.61 \(\pm \;0.07\)	0.45 \(\pm \;0.12\)
	DSR-l2	73.38 \(\pm \;32.82\)	0.72 \(\pm \;0.02\)	0.58 \(\pm \;0.03\)	19.85 \(\pm \;3.7\)	0.05 \(\pm \;0.08\)	\(-\)0.3 \(\pm \;0.22\)
	Siamese	523.64	0.87	0.82	216.22	0.6	0.66
	SR-KL	29.64 \(\pm \;3.59\)	0.47 \(\pm \;0.0\)	0.27 \(\pm \;0.0\)	7.45 \(\pm \;0.3\)	0.31 \(\pm \;0.0\)	0.44 \(\pm \;0.0\)
	SR-l2	19.7 \(\pm \;0.72\)	0.8 \(\pm \;0.0\)	0.54 \(\pm \;0.0\)	4.13 \(\pm \;0.2\)	0.5 \(\pm \;0.0\)	0.71 \(\pm \;0.0\)
Movehub-Quality-4	DSR-KL	63.05 \(\pm \;4.14\)	0.88 \(\pm \;0.0\)	0.84 \(\pm \;0.0\)	31.41 \(\pm \;5.02\)	0.88 \(\pm \;0.0\)	0.74 \(\pm \;0.05\)
	DSR-l2	531.86 \(\pm \;212.91\)	0.82 \(\pm \;0.07\)	0.75 \(\pm \;0.08\)	34.72 \(\pm \;8.68\)	0.67 \(\pm \;0.11\)	0.81 \(\pm \;0.31\)
	Siamese	710.35	0.88	0.82	258.87	0.88	0.88
	SR-KL	98.02 \(\pm \;3.73\)	0.17 \(\pm \;0.0\)	\(-\)0.1 \(\pm \;0.0\)	7.68 \(\pm \;0.2\)	0.48 \(\pm \;0.0\)	0.05 \(\pm \;0.0\)
	SR-l2	18.98 \(\pm \;4.11\)	0.84 \(\pm \;0.0\)	0.62 \(\pm \;0.0\)	4.69 \(\pm \;0.1\)	0.51 \(\pm \;0.0\)	0.55 \(\pm \;0.0\)
IMDB-4	DSR-KL	23.93 \(\pm \;16.15\)	0.9 \(\pm \;0.0\)	0.9 \(\pm \;0.05\)	526.01 \(\pm \;11.57\)	0.73 \(\pm \;0.29\)	\(-\)0.14 \(\pm \;0.25\)
	DSR-l2	54.16 \(\pm \;13.22\)	0.78 \(\pm \;0.035\)	0.38 \(\pm \;0.07\)	27.73 \(\pm \;26.63\)	0.21 \(\pm \;0.21\)	\(-\)0.02 \(\pm \;0.08\)
	Siamese	3409.03	0.87	0.78	1240.98	0.47	0.02
	SR-KL	57.93 \(\pm \;0.87\)	0.16 \(\pm \;0.0\)	\(-\)0.04 \(\pm \;0.0\)	31.33 \(\pm \;13.26\)	0.04 \(\pm \;0.0\)	0.05 \(\pm \;0.0\)
	SR-l2	9.43 \(\pm \;1.02\)	0.52 \(\pm \;0.0\)	0.08 \(\pm \;0.0\)	6.97 \(\pm \;2.71\)	0.6 \(\pm \;0.06\)	0.04 \(\pm \;0.06\)
ICLR-4	DSR-KL	84.93 \(\pm \;1.76\)	0.88 \(\pm \;0.01\)	0.62 \(\pm \;0.06\)	288.45 \(\pm \;3.84\)	0.48 \(\pm \;0.13\)	0.06 \(\pm \;0.18\)
	DSR-l2	84.78 \(\pm \;1.21\)	0.63 \(\pm \;0.01\)	0.19 \(\pm \;0.01\)	280.72 \(\pm \;270.75\)	0.47 \(\pm \;0.02\)	0.032 \(\pm \;0.1\)
	Siamese	623.18	0.84	0.76	10142.55	0.32	\(-\)0.07
	SR-KL	347.29 \(\pm \;39.6\)	0.29 \(\pm \;\)	0.0 \(\pm \;0.0\)	131.8 \(\pm \;120.01\)	0.45 \(\pm \;0.0\)	\(-\)0.07 \(\pm \;0.0\)
	SR-l2	503.11 \(\pm \;40.35\)	0.86 \(\pm \;\)	0.82 \(\pm \;0.0\)	198.49 \(\pm \;6.87\)	0.37 \(\pm \;0.03\)	\(-\)0.01 \(\pm \;0.0\)
ROP-img	DSR-KL	776.71 \(\pm \;136.74\)	0.89 \(\pm \;0.0\)	0.79 \(\pm \;0.0\)	25.3 \(\pm \;15.4\)	0.8 \(\pm \;0.01\)	0.6 \(\pm \;0.02\)
	DSR-l2	694.99 \(\pm \;431.12\)	0.84 \(\pm \;0.03\)	0.68 \(\pm \;0.064\)	210.45 \(\pm \;47.16\)	0.79 \(\pm \;0.01\)	0.59 \(\pm \;0.02\)
	Siamese	1152.06	0.86	0.73	4438.61	0.82	0.65
	SR-KL	610.91 \(\pm \;20.69\)	0.5 \(\pm \;0.0\)	0.0 \(\pm \;0.0\)	527.76 \(\pm \;163.76\)	0.61 \(\pm \;0.0\)	0.23 \(\pm \;0.0\)
	SR-l2	3.08 \(\pm \;0.59\)	0.89 \(\pm \;0.0\)	0.79 \(\pm \;0.0\)	1.96 \(\pm \;1.1\)	0.45 \(\pm \;0.0\)	\(-\)0.08 \(\pm \;0.0\)
Movehub-Cost-5	DSR-KL	183.6 \(\pm \;48.2\)	0.85 \(\pm \;0.047\)	0.84 \(\pm \;0.08\)	111.13 \(\pm \;31.61\)	0.76 \(\pm \;0.29\)	0.67 \(\pm \;0.15\)
	DSR-l2	216.5 \(\pm \;64.34\)	0.81 \(\pm \;0.04\)	0.69 \(\pm \;0.02\)	137.14 \(\pm \;65.55\)	0.19 \(\pm \;0.05\)	0.52 \(\pm \;0.34\)
	Siamese	7986.22	0.89	0.83	2842.12	0.62	0.74
	SR-KL	189.85 \(\pm \;0.72\)	0.45 \(\pm \;0.0\)	0.28 \(\pm \;0.0\)	30.51 \(\pm \;0.3\)	0.16 \(\pm \;0.0\)	0.39 \(\pm \;0.0\)
	SR-l2	209.5 \(\pm \;1.5\)	0.78 \(\pm \;0.0\)	0.55 \(\pm \;0.0\)	30.14 \(\pm \;0.2\)	0.37 \(\pm \;0.0\)	0.73 \(\pm \;0.0\)
Movehub-Quality-5	DSR-KL	79.35 \(\pm \;2.91\)	0.85 \(\pm \;0.05\)	0.79 \(\pm \;0.08\)	114.03 \(\pm \;30.18\)	0.92 \(\pm \;0.06\)	0.35 \(\pm \;0.25\)
	DSR-l2	91.87 \(\pm \;6.31\)	0.62 \(\pm \;0.04\)	0.5 \(\pm \;0.05\)	46.99 \(\pm \;13.23\)	0.57 \(\pm \;0.34\)	0.73 \(\pm \;0.46\)
	Siamese	2241.49	0.89	0.83	752.06	0.76	0.08
	SR-KL	924.07 \(\pm \;0.2\)	0.13 \(\pm \;0.0\)	-0.1 \(\pm \;0.0\)	29.68 \(\pm \;1.2\)	0.62 \(\pm \;0.0\)	0.08 \(\pm \;0.0\)
	SR-l2	208.1 \(\pm \;0.1\)	0.84 \(\pm \;0.0\)	0.65 \(\pm \;0.0\)	30.03 \(\pm \;1.1\)	0.39 \(\pm \;0.0\)	0.59 \(\pm \;0.0\)

Table 6. Training Time vs. Top-1 Acc. and KT on Test Sets, Partitioned w.r.t. Rank and Sample Partitioning, Respectively

We report averages and standard deviations over folds for all algorithms except for siamese. Our algorithms, as well as the algorithm that attains the best performance for each dataset are indicated in bold.

Deeper regression methods consistently outperform the predictions of shallow regression methods. Particularly, our deep spectral algorithms DSR-KL and DSR-l2 lead to significantly better predictions than the shallow versions SR-l2 and SR-KL, up to \(38\%\) Top-1 Acc. and \(41\%\) KT on IMDB-4. The training times of DSR-KL and DSR-l2 are also not noticeably larger than the ones of shallow versions; SR-KL converges even slower than DSR-KL, by up to 12 times on Movehub-Quality-5. Unlike SR-l2 that solves a least-squares problem with closed form solution, parameter update step of SR-KL (c.f. 48) requires an iterative optimization at each ADMM iteration.

Columns 6–8 in Table 6 show the training time and test set prediction performance of DSR-KL, DSR-l2, siamese network, SR-l2, and SR-KL trained on all datasets, partitioned with sample partitioning. Note that this is the setting when regressing scores from features is essential, as training samples cannot participate in any observations in validation or test sets. Agreeing with the speed gain in rank partitioning, DSR-KL and DSR-l2 are 8–175 times faster than the siamese network over all datasets. Moreover, DSR-KL leads to particularly better performance in maximal-choice predictions, by up to \(26\%\) higher Top-1 Acc. than siamese on IMDB-4 and \(56\%\) higher Top-1 Acc. than DSR-l2 on Movehub-Cost-4. Finally, deeper regression methods DSR-KL, DSR-l2, and siamese network, outperform the predictions of shallow counterparts SR-l2 and SR-KL, by up to \(39\%\) Top-1 Acc. and \(11\%\) KT on Movehub-cost-5 and ICLR-3, respectively.

Impact of \(K\). Figures 1 and 5(c) show training time and prediction performances of DSR-KL, DSR-l2, and siamese network vs. query size (\(K\)) on Movehub-Cost and Movehub-Quality datasets partitioned w.r.t. rank partitioning. The speed gain of DSR-KL and DSR-l2 over siamese reach up to 43 times as \(K\) increases, as each epoch of siamese grows exponentially with \(K\). Moreover, agreeing with Table 6, DSR-KL consistently outperforms the predictions of DSR-l2.

Details on Convergence. Figure 6 shows the log-likelihood \(-\mathcal {L}\) on training and Top-1 Acc. on validation sets of Movehub-Cost-4, Movehub-Quality-4, and IMDB-4 datasets, partitioned w.r.t. rank partitioning. Each point for DSR-KL and DSR-l2 correspond to an overall iteration of ADMM (c.f. (21)), while each point for siamese corresponds to a training epoch. DSR-KL consistently attains higher log-likelihood and better validation performance than both siamese network and DSR-l2, while converging faster.

Fig. 6.

Comparison to Naïve Approach. A naïve spectral algorithm can be constructed by a single iteration of the primal steps in (21): (i) solving (21a) to learn \(\mathbf {\pi }\) via repeated iterations of (29), and (ii) given \(\mathbf {\pi }\), solving (21b) by training the DNN regressor \(\tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w})\) via SGD over Equation (22). Intuitively, this ignores/does not exploit the fact that samples with similar features ought to have similar scores. We denote this naive approach as One-iter.-DSR-KL. Unlike One-iter.-DSR-KL, our algorithm DSR-KL has the capability of repeatedly adapting both \(\mathbf {\pi }\) and \(\mathbf {w}\) by solving (19) via ADMM. This advantage is illustrated in Figure 6(a) and (b), where not only \(\tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w})\), but also \(\mathbf {\pi }\) are adjusted until convergence. Moreover, Figure 6(d) shows the corresponding test set predictions of DSR-KL vs. One-iter.-DSR-KL on Movehub-Cost-4 and Movehub-Quality-4; DSR-KL outperforms One-iter.-DSR-KL by 13–21% Top-1 Acc. and 6–15% KT.

5 Conclusion

We solve the MLE problem for the PL scores via ADMM. We show that the scores are equivalent to the stationary distribution of an MC and propose spectral algorithms for both shallow and DNN models. Our resulting spectral algorithms significantly outperform the traditional Newton’s method and siamese networks for ranking regression.

Given that the number of rankings grows exponentially in query size, designing active learning algorithms to identify which rankings to solicit from labelers is an interesting open problem. Generalizing existing active learning algorithms for shallow models, e.g., Guo et al. [2018], to deeper models via our efficient algorithms is a promising direction.

Footnotes

Our code for shallow model regression is publicly available at https://github.com/neu-spiral/FastAndAccurate-RankingRegression.

Our code for DNN regression is publicly available at https://github.com/neu-spiral/DeepSpectralRanking.

Appendices

In this section, we provide proofs for all theoretical results established in Sections 2 and 3, as well as a brief technical preliminary on finite-state homogeneous MCs.

A Finite-state Homogeneous Markov Chains

We briefly review finite-state homogeneous MCs due to the recent emergence of spectral algorithms for PL inference. All the results mentioned in this section are classic; our main reference is the book by Gallager [2013]. A finite-state MC is a sequence of random variables \(\lbrace Z_1, Z_2, \ldots \rbrace\), each taking values from a finite-state sample space \(\left[n\right]\equiv \lbrace 1,\ldots ,n\rbrace\), that satisfies the Markov property:

\begin{align} \mathbf {P}(Z_t | Z_{1:t-1}) = \mathbf {P}(Z_t | Z_{t-1}) , \quad \text{for all} \,\, t. \end{align}

(53)

A homogeneous MC additionally satisfies: \(\lambda _{ij} = \mathbf {P}(Z_t=j | Z_{t-1}=i) = \mathbf {P}(Z_{t-1}=j | Z_{t-2}=i) = \cdots = \mathbf {P}(Z_2=j | Z_1=i)\). In this case, the MC can be represented by a directed graph, in which there is a node for each \(i \in \left[n\right]\), and \(\lambda _{ij}\) denotes the edge weight on the transition from node \(i \in \left[n\right]\) to \(j \in \left[n\right]\). Particularly, \(\lambda _{ij}\) for all \((i, j)\) are called the transition rates, and the corresponding matrix \(\mathbf {\Lambda }=[\lambda _{ij}]_{i,j \in \left[n\right]}\) is called the transition matrix of the MC.

If the transition matrix \(\mathbf {\Lambda }\), and accordingly the corresponding directed graph, is irreducible, the MC is said to be ergodic and admits a unique stationary distribution \(\mathbf {\pi }\in \mathbb {R}^n_+\), that satisfies the so-called global balance equations:

\begin{align} \sum _{j\ne i} {\pi }_j \lambda _{ji} = \sum _{j\ne i} {\pi }_i \lambda _{ij}, \quad \text{for all}~i \in \left[n\right]\!. \end{align}

(54)

\(\mathbf {\pi }\) is the unique solution to the linear system defined by these balance equations and \(\mathbf {1}^\top \mathbf {\pi }=1\), as it is a distribution. This follows from the Perron–Frobenius Theorem [Horn and Johnson 2012]. In practice, the stationary distribution \(\mathbf {\pi }\) can be computed by uniformizing \(\mathbf {\Lambda }\), i.e., increasing self-transition rates until all states have the same outgoing rate, and finding the leading left eigenvector via, e.g., the power method [Lei et al. 2016]:

\begin{align} \mathbf {\pi }^{l+1} = \frac{\mathbf {\Lambda }\mathbf {\pi }^l}{\parallel \mathbf {\Lambda }\mathbf {\pi }^l \parallel _2}, \quad \text{for}~ l \in \mathbb {N}. \end{align}

(55)

B Proof of Theorem 2.1

We start by showing that \(\frac{\partial \mathcal {L}(\mathcal {D}\,|\,\mathbf {\pi })}{\partial {\pi }_i} = 0\), \(i \in \left[n\right]\) is the optimality condition to minimize Equation (15). Consider the reparametrization \({\pi }_i = e^{\theta _i}\), \(i \in \left[n\right]\). Equation (15) under this reparametrization is given by

\begin{align} \mathcal {L}(\mathcal {D}\,|\,\mathbf {\theta }) = \sum _{\ell =1}^n\left(\log \sum _{j \in A_\ell } e^{\theta _j} - \theta _\ell \right), \end{align}

(56)

which is convex w.r.t. \(\mathbf {\theta }= [\theta _i]_{i\in \left[n\right]}\), i.e., even though Equation (15) is not convex w.r.t. \(\mathbf {\pi }\), it is convex under the reparametrization \({\pi }_i = e^{\theta _i}\), \(i \in \left[n\right]\). This implies that \(\frac{\partial \mathcal {L}(\mathcal {D}\,|\,\mathbf {\theta })}{\partial \theta _i} = 0\), \(i \in \left[n\right]\) is the optimality condition to minimize Equation (56) w.r.t. \(\mathbf {\theta }\). By the chain rule, this condition can be written in terms of \({\pi }_i = e^{\theta _i}\), \(i \in \left[n\right]\) as

\begin{align} \frac{\partial \mathcal {L}(\mathcal {D}\,|\,\mathbf {\theta })}{\partial \theta _i} = \frac{\partial \mathcal {L}(\mathcal {D}\,|\,\mathbf {\pi })}{\partial {\pi }_i} e^{\theta _i} = 0 \,\,\,\, \forall i \in \left[n\right]\!. \end{align}

(57)

Note that \(e^{\theta _i} \gt 0\), \(i \in \left[n\right]\). Then, \(\frac{\partial \mathcal {L}(\mathcal {D}\,|\,\mathbf {\theta })}{\partial \theta _i} = 0\) is equivalent to \(\frac{\partial \mathcal {L}(\mathcal {D}\,|\,\mathbf {\pi })}{\partial {\pi }_i} = 0\), \(i \in \left[n\right]\), i.e., \({\pi }_i = e^{\theta _i}\), \(i \in \left[n\right]\) satisfies Equation (57) if and only if \(\theta _i\), \(i \in \left[n\right]\) is the minimizer of Equation (56). Hence, the stationarity condition \(\frac{\partial \mathcal {L}(\mathcal {D}\,|\,\mathbf {\pi })}{\partial {\pi }_i} = 0\), \(i \in \left[n\right]\) is also the optimality condition for problem (3).

The optimality condition is given explicitly by

\begin{align} \begin{split} \frac{\partial \mathcal {L}}{\partial {\pi }_i} &= \sum _{\ell \in W_i} \left(\frac{1}{\sum _{t\in A_\ell }{\pi }_t} - \frac{1}{{\pi }_i}\right) + \sum _{\ell \in L_i}\frac{1}{\sum _{t\in A_\ell }{\pi }_t} = 0 \,\,\,\, \forall i \in \left[n\right]\!, \end{split} \end{align}

(58)

where \(W_i = \lbrace \ell \,| i \in A_\ell , c_\ell = i\rbrace\) is the set of observations where sample \(i \in \left[n\right]\) is chosen and \(L_i = \lbrace \ell \,| i \in A_\ell , c_\ell \ne i\rbrace\) is the set of observations where sample \(i \in \left[n\right]\) is not chosen. Multiplying both sides of Equation (58) with \({\pi }_i\), \(i \in \left[n\right]\), we have

\begin{align} \sum _{\ell \in L_i} \left(\frac{{\pi }_i}{\sum _{t\in A_\ell }{\pi }_t}\right) - \sum _{\ell \in W_i} \left(\frac{\sum _{j\ne i\in A_\ell }{\pi }_j}{\sum _{t\in A_\ell }{\pi }_t} \right) = 0 , \end{align}

(59)

for all \(i \in \left[n\right]\). Note that \(\sum _{\ell \in W_i} \sum _{j\ne i\in A_\ell } \cdot = \sum _{j\ne i} \sum _{\ell \in W_i \cap L_j} \cdot\) and \(\sum _{\ell \in L_i} \cdot = \sum _{j\ne i} \sum _{\ell \in W_j \cap L_i} \cdot\). Accordingly, we rewrite Equation (59) as

\begin{align} \begin{split} &\sum _{j\ne i} \sum _{\ell \in W_j \cap L_i} \left(\frac{{\pi }_i}{\sum _{t\in A_\ell }{\pi }_t}\right) = \sum _{j\ne i} \sum _{\ell \in W_i \cap L_j} \left(\frac{{\pi }_j}{\sum _{t\in A_\ell }{\pi }_t}\right) \,\,\,\, \forall i \in \left[n\right]\!. \end{split} \end{align}

(60)

Then, an optimal solution \(\mathbf {\pi }\in \mathbb {R}^n_+\) to Equation (3) satisfies:

\begin{align} \sum _{j\ne i} {\pi }_j \lambda _{ji}(\mathbf {\pi }) = \sum _{j\ne i} {\pi }_i \lambda _{ij}(\mathbf {\pi }) \,\,\,\, \forall i \in \left[n\right]\!, \end{align}

(61)

where \(\lambda _{ji}(\mathbf {\pi })\), \(i,j \in \left[n\right], i\ne j\) are given by Equation (5).

C Proof of Lemma 3.1

Let a stationary point \(\mathbf {\pi }\in \mathbb {R}^n_+\) of the Augmented Lagrangian (20) be such that

\begin{align} & \frac{\partial L_\rho (\mathbf {\pi }, \mathbf {w}^k, \mathbf {y}^k)}{\partial {{\pi }_i}} = 0 \Leftrightarrow \frac{\partial \mathcal {L}(\mathcal {D}|\mathbf {\pi })}{\partial {{\pi }_i}} + {y}_i^k + \rho \frac{\partial D_p(\mathbf {\pi }|| \tilde{\mathbf {\pi }}^k)}{\partial {{\pi }_i}} = 0. \,\,\, \forall i \in \left[n\right]\!. \end{align}

(62a)

Let \(\sigma _i(\mathbf {\pi }) = \rho \frac{\partial D_p(\mathbf {\pi }|| \tilde{\mathbf {\pi }}^k)}{\partial {{\pi }_i}} + {y}_i^k\), for all \(i \in \left[n\right]\). Then, Equation (62a) is equivalent to

\begin{align} & \frac{\partial \mathcal {L}(\mathcal {D}|\mathbf {\pi })}{\partial {{\pi }_i}} + \sigma _i(\mathbf {\pi }) = 0 \,\,\, \forall i \in \left[n\right]\!. \end{align}

(63)

Partial derivatives of the negative log-likelihood \(\mathcal {L}(\mathcal {D}|\mathbf {\pi })\) are given by

\begin{align} \frac{\partial \mathcal {L}(\mathcal {D}|\mathbf {\pi })}{\partial {{\pi }_i}} &= \sum _{\ell \in W_i} \left(\frac{1}{\sum _{t\in A_\ell }{\pi }_t} - \frac{1}{{\pi }_i}\right) + \sum _{\ell \in L_i}\frac{1}{\sum _{t\in A_\ell }{\pi }_t} \,\,\, \forall i \in \left[n\right]\!, \end{align}

(64)

where \(W_i = \lbrace \ell \,| i \in A_\ell , c_\ell = i\rbrace\) is the set of observations where sample \(i \in \left[n\right]\) is chosen and \(L_i = \lbrace \ell \,| i \in A_\ell , c_\ell \ne i\rbrace\) is the set of observations where sample \(i \in \left[n\right]\) is not chosen. Setting \(\frac{\partial \mathcal {L}(\mathcal {D}|\mathbf {\pi })}{\partial {{\pi }_i}}\) from Equation (64) to Equation (63), we have

\begin{align} \frac{\partial L_\rho (\mathbf {\pi }, \mathbf {w}^k, \mathbf {y}^k)}{\partial {{\pi }_i}} = \sum _{\ell \in W_i} \left(\frac{1}{\sum _{t\in A_\ell }{\pi }_t} - \frac{1}{{\pi }_i}\right) + \sum _{\ell \in L_i} \frac{1}{\sum _{t\in A_\ell }{\pi }_t} + \sigma _i(\mathbf {\pi }) = 0, \,\,\, \forall i \in \left[n\right]\!. \end{align}

(65)

Multiplying both sides of Equation (65) with \(-{\pi }_i\), \(i \in \left[n\right]\), we have

\begin{align} \sum _{\ell \in W_i} \left(\frac{\sum _{j\ne i\in A_\ell }{\pi }_j}{\sum _{t\in A_\ell }{\pi }_t} \right) - \sum _{\ell \in L_i} \left(\frac{{\pi }_i}{\sum _{t\in A_\ell }{\pi }_t}\right) - {\pi }_i \sigma _i(\mathbf {\pi }) = 0 \,\,\, \forall i \in \left[n\right]\!. \end{align}

(66)

Note that \(\sum _{\ell \in W_i} \sum _{j\ne i\in A_\ell } \cdot = \sum _{j\ne i} \sum _{\ell \in W_i \cap L_j} \cdot\) and \(\sum _{\ell \in L_i} \cdot = \sum _{j\ne i} \sum _{\ell \in W_j \cap L_i} \cdot\). Accordingly, we rewrite Equation (66) as

\begin{align} \sum _{j\ne i} \sum _{\ell \in W_i \cap L_j} \left(\frac{{\pi }_j}{\sum _{t\in A_\ell }{\pi }_t}\right) - \sum _{j\ne i} \sum _{\ell \in W_j \cap L_i} \left(\frac{{\pi }_i}{\sum _{t\in A_\ell }{\pi }_t}\right) - {\pi }_i \sigma _i(\mathbf {\pi }) = 0 \,\,\, \forall i \in \left[n\right]\!. \end{align}

(67)

Then, the stationarity condition given by Equation (62a) is equivalent to

\begin{align} \sum _{j\ne i} {\pi }_j \lambda _{ji}(\mathbf {\pi }) - \sum _{j\ne i} {\pi }_i \lambda _{ij}(\mathbf {\pi }) = {\pi }_i \sigma _i(\mathbf {\pi }) \,\,\,\, \forall i \in \left[n\right]\!, \end{align}

(68)

where \(\lambda _{ji}(\mathbf {\pi })\), \(i,j \in \left[n\right], i\ne j\) are given by Equation (5).

D Proof of Theorem 3.2

Summing Equation (26) for \(i \in \left[n\right]\), we get

\begin{align} \sum _{i} \sum _{j} \left({\pi }_j \lambda _{ji}(\mathbf {\pi }) - {\pi }_i \lambda _{ij}(\mathbf {\pi }) \right)\mathbb {1}_{j \ne i} = \sum _{i} {\pi }_i \sigma _i = 0. \end{align}

(69)

Since the PL scores are non-negative, i.e., \({\pi }_i \ge 0 \,\, , \, i \in \left[n\right]\), Equation (69) implies that \(\mathbf {\sigma }\equiv [\sigma _i]_{i \in \left[n\right]}\) contains both positive and negative elements. Let \((\left[n\right]_{+}, \left[n\right]_{-})\) be a partition of \(\left[n\right]\) such that \(\sigma _i \ge 0\) for all \(i \in \left[n\right]_{+}\) and \(\sigma _i \lt 0\) for all \(i \in \left[n\right]_{-}\). Then, for \(i \in \left[n\right]_{+}\) in Equation (26), we have

\begin{align} \sum _{j\ne i} {\pi }_j \lambda _{ji}(\mathbf {\pi }) = {\pi }_i \left(\sum _{j\ne i} \lambda _{ij}(\mathbf {\pi }) + \sigma _i\right), \,\, \forall i \in \left[n\right]_{+}\!, \end{align}

(70)

where \(\lambda _{ij}(\mathbf {\pi }) + \sigma _i \ge 0\), \(i \in \left[n\right]_{+}\), and \(j \in \left[n\right]\). Equation (70) shows that from each state \(i \in \left[n\right]_{+}\) into the states in \(\left[n\right]_{-}\), there exists a total of \(\sigma _i\) “additional outgoing rate”, compared to Equation (4). At the same time, for \(i \in \left[n\right]_{-}\) in Equation (26), we have

\begin{align} \begin{split} &\sum _{j\in \left[n\right]_{+}} {\pi }_j \lambda _{ji}(\mathbf {\pi }) + \sum _{j\in \left[n\right]_{-}|j\ne i} {\pi }_j \lambda _{ji}(\mathbf {\pi }) = {\pi }_i \sum _{j\ne i} \lambda _{ij}(\mathbf {\pi }) + {\pi }_i \sigma _i, \,\,\,\, \forall i \in \left[n\right]_{-}\!. \end{split} \end{align}

(71)

Since \({\pi }_i \sigma _i \lt 0\), for \(i \in \left[n\right]_{-}\), we distribute these terms into the first sum on the left-hand side. Then, Equation (71) is equivalent to

\begin{align} \begin{split} &\sum _{j\in \left[n\right]_{+}} {\pi }_j \left(\lambda _{ji}(\mathbf {\pi }) - \frac{{\pi }_i \sigma _i c_j}{{\pi }_j} \right) + \sum _{j\in \left[n\right]_{-}|j\ne i} {\pi }_j \lambda _{ji}(\mathbf {\pi }) = {\pi }_i \sum _{j\ne i} \lambda _{ij}(\mathbf {\pi }), \,\,\,\, \forall i \in \left[n\right]_{-}\!, \end{split} \end{align}

(72)

where \(\sum _{j\in \left[n\right]_{+}} c_j = 1\).

To determine the \(\lbrace c_j\rbrace _{j \in \left[n\right]_{+}}\), recall from Equation (70) that from each state \(j \in \left[n\right]_{+}\) into the states \(i \in \left[n\right]_{-}\), there exists a total of \(\sigma _j\) additional outgoing rate. Then, Equation (72) implies that \(\sum _{i\in \left[n\right]_{-}} - \frac{{\pi }_i \sigma _i c_j}{{\pi }_j} = \sigma _j\), i.e., \(c_j = \frac{- {\pi }_j \sigma _j}{\sum _{i\in \left[n\right]_{-}} {\pi }_i \sigma _i}\), \(j \in \left[n\right]_{+}\). Using \(-\sum _{i\in \left[n\right]_{-}} {\pi }_i \sigma _i = \sum _{i\in \left[n\right]_{+}} {\pi }_i \sigma _i\) from Equation (69), we confirm that \(\sum _{j\in \left[n\right]_{+}} c_j = \frac{- \sum _{j\in \left[n\right]_{+}} {\pi }_j \sigma _j}{\sum _{i\in \left[n\right]_{-}} {\pi }_i \sigma _i} = 1\), and rewrite \(\lbrace c_j\rbrace _{j \in \left[n\right]_{+}}\) as

\begin{align} c_j = \frac{-2 {\pi }_j \sigma _j}{\sum _{i\in \left[n\right]_{-}} {\pi }_i \sigma _i - \sum _{i\in \left[n\right]_{+}}{\pi }_i \sigma _i}, \,\,\,\, \forall j \in \left[n\right]_{+}\!. \end{align}

(73)

Finally, setting \(\lbrace c_j\rbrace _{j \in \left[n\right]_{+}}\) into Equation (72), we have

\begin{align} \begin{split} &\sum _{j\in \left[n\right]_{+}} {\pi }_j \left(\lambda _{ji}(\mathbf {\pi }) + \frac{2 {\pi }_i \sigma _i \sigma _j}{\sum _{t\in \left[n\right]_{-}} {\pi }_t \sigma _t - \sum _{t\in \left[n\right]_{+}} {\pi }_t \sigma _t} \right) + \sum _{j\in \left[n\right]_{-}|j\ne i} {\pi }_j \lambda _{ji}(\mathbf {\pi }) = {\pi }_i \sum _{j\ne i} \lambda _{ij}(\mathbf {\pi }), \end{split} \end{align}

(74)

where \(\lambda _{ji}(\mathbf {\pi }) + \frac{2 {\pi }_i \sigma _i \sigma _j}{\sum _{t\in \left[n\right]_{-}} {\pi }_t \sigma _t - \sum _{t\in \left[n\right]_{+}} {\pi }_t \sigma _t} \ge 0\), \(j \in \left[n\right]_{+}\), and \(i \in \left[n\right]_{-}\).

Equation (26), partitioned as Equations (70) and (74), is the balance equation of a continuous-time MC with transition rates given by

\begin{align} \mu _{ji}(\mathbf {\pi }) &= {\left\lbrace \begin{array}{ll} \lambda _{ji}(\mathbf {\pi }) + \frac{2 {\pi }_i \sigma _i \sigma _j}{\sum _{t\in \left[n\right]_{-}} {\pi }_t \sigma _t - \sum _{t\in \left[n\right]_{+}} {\pi }_t \sigma _t} \\ \quad \quad \quad \quad \text{if} \,\,\, j \in \left[n\right]_{+} \,\,\, \text{and} \,\,\, i \in \left[n\right]_{-}\\ \lambda _{ji}(\mathbf {\pi }) \quad \ \ \text{otherwise}. \end{array}\right.} \end{align}

(75)

Hence, \(\mathbf {\pi }\) is the stationary distribution of this MC (c.f. Section 2).

E Proof of Theorem 3.3

We use the following definition.

Definition E.1 (Diagonal Dominance).

A matrix \(\mathbf {H}\) is diagonally dominant if \(|\mathbf {H}_{ii}| \ge \sum _{j \ne i} |\mathbf {H}_{ij}|\), \(i \in \left[n\right]\), i.e., for every row, magnitude of the diagonal element is larger than the sum of magnitudes of the corresponding off-diagonal elements [Horn and Johnson 2012].

For \(\mathbf {\sigma }=[\sigma _i]_{i \in \left[n\right]}\) given by (46), Equation (65) is equivalent to

\begin{align} \begin{split} \frac{\partial L_\rho (\mathbf {\pi },\mathbf {w}^k, \mathbf {y}^k)}{\partial {\pi }_i} &= \sum _{\ell \in W_i} - \frac{1}{{\pi }_i} + \sum _{\ell |i\in A_\ell } \frac{1}{\sum _{t\in A_\ell }{\pi }_t} + \rho {(\mathbf {\pi }- \tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w}^k))}_i + {y}^k_i, \end{split} \end{align}

(76)

for all \(i \in \left[n\right]\). At the \(k\)th iteration of (21), let \(\nabla ^2 L_\rho (\mathbf {\pi },\mathbf {w}^k, \mathbf {y}^k)\) be the Hessian of the augmented Lagrangian w.r.t. \(\mathbf {\pi }\). Differentiating Equation (76) w.r.t. \({\pi }_j\), \(\nabla ^2 L_\rho (\mathbf {\pi },\mathbf {w}^k, \mathbf {y}^k)\) has the following form:

\begin{align} &\nabla ^2_{ij} L_\rho (\mathbf {\pi },\mathbf {w}^k, \mathbf {y}^k) = {\left\lbrace \begin{array}{ll} \sum _{\ell \in W_i} \frac{1}{{\pi }_i^2} - \sum _{\ell |i\in A_\ell } \frac{1}{{(\sum _{t\in A_\ell }{\pi }_t)}^2} + \rho , & i = j \\ - \sum _{\ell |i,j\in A_\ell } \frac{1}{{(\sum _{t\in A_\ell }{\pi }_t)}^2}, & i \ne j. \end{array}\right.} \end{align}

(77)

Consider \(\rho \ge \frac{2}{\epsilon ^2} \max _i \sum _{\ell |i\in A_\ell }\frac{1}{|A_\ell |^2}\). By Assumption 3.1, we have

\begin{align} & \rho \ge \frac{2}{\epsilon ^2} \sum _{\ell |i\in A_\ell }\frac{1}{|A_\ell |^2} \,\, \forall i \in \left[n\right]\!,\nonumber \nonumber\\ & \Leftrightarrow \,\, \rho \ge \sum _{\ell |i\in A_\ell }\frac{2}{{(\sum _{t\in A_\ell }{\pi }_t)}^2} \,\, \forall i \in \left[n\right]\!,\nonumber \nonumber\\ & \Leftrightarrow \,\, \rho + \sum _{\ell \in W_i} \frac{1}{{\pi }_i^2} \gt \sum _{\ell |i\in A_\ell }\frac{2}{{(\sum _{t\in A_\ell }{\pi }_t)}^2} \,\, \forall i \in \left[n\right]\!, \end{align}

(78a)

\begin{align} & \Leftrightarrow \,\, \rho + \sum _{\ell \in W_i} \frac{1}{{\pi }_i^2} \gt \sum _{\ell |i\in A_\ell }\frac{1}{{(\sum _{t\in A_\ell }{\pi }_t)}^2} + \sum _{j \ne i}\sum _{\ell |i,j\in A_\ell }\frac{1}{{(\sum _{t\in A_\ell }{\pi }_t)}^2} \,\, \forall i \in \left[n\right]\!. \end{align}

(78b)

Equation (78a) implies that all diagonal elements of \(\nabla ^2 L_\rho (\mathbf {\pi },\mathbf {w}^k, \mathbf {y}^k)\) are positive. Moreover, by Equation (78b), \(\nabla ^2 L_\rho (\mathbf {\pi },\mathbf {w}^k, \mathbf {y}^k)\) is diagonally dominant (c.f. Definition E.1). Thus, \(\nabla ^2 L_\rho (\mathbf {\pi },\mathbf {w}^k, \mathbf {y}^k)\) is positive definite [Horn and Johnson 2012], i.e., \(L_\rho (\mathbf {\pi },\mathbf {w}^k, \mathbf {y}^k)\) is convex w.r.t. \(\mathbf {\pi }\). As a result, under Assumption 3.1, for \(\rho \ge \frac{2}{\epsilon ^2} \max _i \sum _{\ell |i\in A_\ell }\frac{1}{|A_\ell |^2}\) and \(D_p(\mathbf {\pi }||\tilde{\mathbf {\pi }}) = \frac{1}{2} \parallel \! \mathbf {\pi }- \tilde{\mathbf {\pi }}\!\parallel _2^2\), a stationary \(\mathbf {\pi }\gt \mathbf {0}\) satisfying condition (23) is also a minimizer of step (21a).

F Proof of Theorem 3.4

We make use of the following lemmas.

Lemma F.1 (Zeng et al. [2018]).

Logarithm and polynomials are Kurdyka–Łojasiewicz functions. Moreover, sums, products, compositions, and quotients (with denominator bounded away from 0) of Kurdyka–Łojasiewicz functions are also Kurdyka–Łojasiewicz.

Lemma F.2 (Guo et al. [2017]).

Consider the optimization problem:

\begin{equation} \begin{aligned}& \underset{\mathbf {w},\, \mathbf {\pi }}{\text{minimize}} & g(\mathbf {\pi }) \\ & \text{subject to} & \mathbf {\tilde{X}}\mathbf {w}= \mathbf {\pi }, \end{aligned} \end{equation}

(79)

and solve Equation (79) via Alternating Direction Method of Multipliers (ADMM) [Boyd et al. 2011]. Let \(\lbrace (\mathbf {\pi }^k, \mathbf {y}^k, \mathbf {w}^k)\rbrace _{k \in \mathbb {N}}\) be the sequence generated by the ADMM algorithm, and \(\rho\) be the penalty parameter of ADMM. Suppose that there exists \(\kappa \gt 0\) such that \(\mathbf {\tilde{X}}^T \mathbf {\tilde{X}}\succeq \kappa \, \mathbf {I}\), and the sequence \(\lbrace (\mathbf {\pi }^k, \mathbf {y}^k, \mathbf {w}^k)\rbrace _{k \in \mathbb {N}}\) is bounded.

If there exist solutions for the minimization steps of ADMM w.r.t. both \(\mathbf {\pi }\) and \(\mathbf {w}\), \(g(\mathbf {\pi })\) is a continuous differentiable function with an \(L\)-Lipschitz continuous gradient at \(\mathbf {\pi }^k\), \(k\in \mathbb {N}\) where \(L \gt 0\), and the augmented Lagrangian of Equation (79) is a Kurdyka–Łojasiewicz function, then, for \(\rho \gt 2L\), \(\lbrace (\mathbf {\pi }^k, \mathbf {y}^k, \mathbf {w}^k)\rbrace _{k \in \mathbb {N}}\) converges to a point that satisfies the Karush-Kuhn-Tucker (KKT) conditions of Equation (79).

For affine regression described in Section 3.3, there exist solutions for the minimization steps in (21): \(\mathbf {w}\) update has the closed form solution given by Equation (33) and \(\mathbf {\pi }\) update admits a minimizer for large enough \(\rho\) by Lemma 3.3.

By Assumption 3.1, \(\nabla _{\mathbf {\pi }} \mathcal {L}\) given by Equation (58) exists, i.e., \(\mathcal {L}\) is continuous differentiable at \(\mathbf {\pi }^k\), \(k\in \mathbb {N}\) generated by (21a). Let \(\nabla ^2(\mathcal {L})\) be the Hessian of \(\mathcal {L}\). Differentiating Equation (58) w.r.t. \({\pi }_j\), \(\nabla ^2(\mathcal {L})\) has the following form:

\begin{align} &\nabla ^2_{ij}(\mathcal {L}) = {\left\lbrace \begin{array}{ll} \sum _{\ell \in W_i} \frac{1}{{\pi }_i^2} - \sum _{\ell |i\in A_\ell } \frac{1}{{(\sum _{t\in A_\ell }{\pi }_t)}^2}, & i = j \\ - \sum _{\ell |i,j\in A_\ell } \frac{1}{{(\sum _{t\in A_\ell }{\pi }_t)}^2}, & i \ne j. \end{array}\right.} \end{align}

(80)

Consider \(L=\frac{\max _i |W_i|}{\epsilon ^2}\), where \(W_i\) is the set of observations where sample \(i \in \left[n\right]\) is chosen. By Assumption 3.1, we have

\begin{align} & L = \frac{\max _i |W_i|}{\epsilon ^2} \ge \sum _{\ell \in W_i} \frac{1}{{\pi }_i^2} \,\, \forall i \in \left[n\right]\!,\nonumber \nonumber\\ & \Leftrightarrow \,\, L - \sum _{\ell \in W_i} \frac{1}{{\pi }_i^2} + \sum _{\ell |i\in A_\ell }\frac{1}{{(\sum _{t\in A_\ell }{\pi }_t)}^2} \ge \sum _{\ell |i\in A_\ell }\frac{1}{{(\sum _{t\in A_\ell }{\pi }_t)}^2} \,\, \forall i \in \left[n\right]\!,\nonumber \nonumber\\ & \Leftrightarrow \,\, L - \sum _{\ell \in W_i} \frac{1}{{\pi }_i^2} + \sum _{\ell |i\in A_\ell }\frac{1}{{(\sum _{t\in A_\ell }{\pi }_t)}^2} \ge \sum _{j \ne i}\sum _{\ell |i,j\in A_\ell }\frac{1}{{(\sum _{t\in A_\ell }{\pi }_t)}^2}\,\, \forall i \in \left[n\right]\!. \end{align}

(81)

Now, consider the matrix \(L \mathbf {I}_{n\times n} - \nabla ^2(\mathcal {L})\). By Equation (81), \(L \mathbf {I}_{n\times n} - \nabla ^2(\mathcal {L})\) is diagonally dominant (c.f. Definition E.1) and all of its diagonal elements are positive, i.e., \(\nabla ^2(\mathcal {L})\) is upper bounded by \(L \mathbf {I}_{n\times n}\). Thus, the objective function of Equation (16), i.e., \(\mathcal {L}\), has an \(L\)-Lipschitz continuous gradient at \(\mathbf {\pi }^k\), \(k\in \mathbb {N}\), where \(L = \frac{\max _i |W_i|}{\epsilon ^2} \gt 0\).

Moreover, the augmented Lagrangian given by Equation (20) is a sum of three functions: logarithm of the ratio of two polynomials where the denominator is bounded away from 0 for all \(\mathbf {\pi }^k\), \(k\in \mathbb {N}\) by Assumption 3.1, and two other polynomial functions. By Lemma F.1, these three functions and their sum is Kurdyka–Łojasiewicz on the set \(\lbrace \mathbf {\pi }^k \,|\, {\pi }_i^k \gt \epsilon \,\,, i \in \left[n\right], k\in \mathbb {N}\rbrace\). As a result, the augmented Lagrangian of Equation (16) is a Kurdyka–Łojasiewicz function.

Putting it all together, by Lemma F.2, for affine regression (c.f. Section 3.3) with \(\rho \gt \frac{2 \max _i |W_i|}{\epsilon ^2}\), the sequence \(\lbrace (\mathbf {\pi }^k, \mathbf {y}^k, \mathbf {w}^k)\rbrace _{k \in \mathbb {N}}\) generated by (21) converges to a point that satisfies the KKT conditions [Nocedal and Wright 2006] of Problem (16).

References

[1]

Arpit Agarwal, Prathamesh Patil, and Shivani Agarwal. 2018. Accelerated spectral ranking. In Proceedings of the 35th International Conference on Machine Learning. 70–79.

Abstract

1 Introduction

2 Related Work and Technical Preliminary

2.1 Related Work

2.2 Technical Preliminary

2.2.1 The PL Model.

2.2.2 Iterative Luce Spectral Ranking.

2.2.3 Alternating Directions Method of Multipliers.

3 Spectral Ranking Regression

3.1 Problem Formulation: Ranking Regression

3.2 Key Technical Result: Decoupling Optimization and a Spectral Method

3.3 Spectral Algorithm for Affine Regression

3.4 Spectral Algorithm for Logistic Regression

3.5 Spectral Algorithm for DNN Regression

3.6 Computational Complexity

3.7 Theoretical Guarantees

4 Experiments

4.1 Datasets

4.2 Experiment Setup

4.3 Shallow Regression Competing Methods

4.4 DNN Regression Competing Methods

4.5 DNN Architecture and Training Details

4.6 Execution Environment

4.7 Performance Metrics

4.8 Shallow Regression Results

4.9 DNN Regression Results

5 Conclusion

Footnotes

Appendices

A Finite-state Homogeneous Markov Chains

B Proof of Theorem 2.1

C Proof of Lemma 3.1

D Proof of Theorem 3.2

E Proof of Theorem 3.3

F Proof of Theorem 3.4

References

Index Terms

Recommendations

Re-ranking search results using query logs

Spectral ranking using seriation

Focused ranking in a vertical search engine

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

HTML Format

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations