Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Public Access

Spectral Ranking Regression

Published: 30 July 2022 Publication History
  • Get Citation Alerts
  • Abstract

    We study the problem of ranking regression, in which a dataset of rankings is used to learn Plackett–Luce scores as functions of sample features. We propose a novel spectral algorithm to accelerate learning in ranking regression. Our main technical contribution is to show that the Plackett–Luce negative log-likelihood augmented with a proximal penalty has stationary points that satisfy the balance equations of a Markov Chain. This allows us to tackle the ranking regression problem via an efficient spectral algorithm by using the Alternating Directions Method of Multipliers (ADMM). ADMM separates the learning of scores and model parameters, and in turn, enables us to devise fast spectral algorithms for ranking regression via both shallow and deep neural network (DNN) models. For shallow models, our algorithms are up to 579 times faster than the Newton’s method. For DNN models, we extend the standard ADMM via a Kullback–Leibler proximal penalty and show that this is still amenable to fast inference via a spectral approach. Compared to a state-of-the-art siamese network, our resulting algorithms are up to 175 times faster and attain better predictions by up to 26% Top-1 Accuracy and 6% Kendall-Tau correlation over five real-life ranking datasets.

    1 Introduction

    Learning from ranking observations arises in many domains, including, e.g., econometrics [McFadden 1973; Ryzin and Mahajan 1999], psychometrics [Thurstone 1927; Bradley and Terry 1952], sports [Elo 1978], and medicine [Tian et al. 2019], to name a few. Ranking observations are (potentially noisy and incomplete) orderings of subsets of samples. Given a dataset of such ranking observations, probabilistic inference typically assumes the existence of an underlying total ordering and aims at recovering it. The Plackett–Luce (PL) model [Plackett 1975] is a prominent parametric model in this setting; it postulates that (a) each sample is parametrized by a score and (b) the probability that a sample is ranked higher than a set of alternatives is proportional to this score.
    PL scores are traditionally learned from ranking observations via Maximum Likelihood Estimation (MLE) [Dykstra 1960; Hunter 2004; Hajek et al. 2014; Negahban et al. 2018]; under a reparametrization, the negative log-likelihood is convex and PL scores can be estimated via, e.g., Newton’s method [Nocedal and Wright 2006]. Nevertheless, for large datasets, Newton’s method can be prohibitively slow [Hunter 2004]. Recently, Maystre and Grossglauser [2015] proposed a highly efficient iterative spectral method, termed Iterative Luce Spectral Ranking (ILSR), that estimates PL scores significantly faster than state-of-the-art methods. ILSR relies on the fact that ML estimates of PL scores constitute the stationary distribution of a Markov Chain (MC) with transition rates defined by ranking observations. In turn, this stationary distribution can be found very efficiently via spectral techniques, e.g., the power method [Lei et al. 2016].
    The above approaches learn PL scores in the absence of sample features, which precludes rank predictions on samples outside the training set. A natural variant of the above setting is ranking regression. In ranking regression, PL scores are assumed to be a parametric function of sample features; the parameters of this function are then regressed from ranking observations. Ranking regression has received considerable attention in the literature, via both shallow [Joachims 2002; Pahikkala et al. 2009; Tian et al. 2019] and deep models [Burges et al. 2005; Chang et al. 2016; Dubey et al. 2016; Han 2018; Yıldız et al. 2019]. Particularly in deep neural network (DNN) regression of PL scores, siamese neural networks [Bromley et al. 1994] perform extremely well: For example, Yıldız et al. [2019] train a 5.9M parameter siamese neural network from pairwise comparisons on just 80 images, attaining 0.92 AUC.
    Virtually all existing work on ranking regression relies on classic optimization methods for parameter inference. To the best of our knowledge, the opportunity to accelerate learning in ranking regression via spectral methods such as ILSR [Maystre and Grossglauser 2015] has not yet been explored. One reason is that this is technically challenging: In contrast to the feature-less setting, it is not a priori evident how to devise a spectral method akin to ILSR for ranking regression via PL. Particularly, PL scores regressed from sample features cannot be directly expressed as stationary distributions of an MC.
    Our main technical contribution is to show that the PL negative log-likelihood augmented with a proximal penalty has stationary points that satisfy the balance equations of an MC (c.f. Theorem 3.2). This allows us to tackle the ranking regression problem via a spectral method akin to ILSR by using Alternating Directions Method of Multipliers (ADMM) [Boyd et al. 2011]. Intuitively, ADMM decouples the optimization of PL scores from regression model parameters, encapsulating them in a proximal penalty. In turn, this can be solved highly efficiently via an ILSR-like spectral algorithm.
    Overall, we make the following contributions.
    We propose the first approach to accelerate learning in ranking regression via spectral methods. As stated above, our main technical contribution is to show that the PL negative log-likelihood augmented with a proximal penalty has stationary points that satisfy the balance equations of an MC (c.f. Theorem 3.2). We further show that using ADMM to solve the ranking regression problem reduces it to the minimization of such an objective, which is thus amenable to a highly efficient spectral solution.
    We employ our spectral algorithm to regress PL scores via shallow regression functions of sample features, focusing on standard ADMM with \( \ell _2 \) -norm proximal penalty. Although the ranking regression problem is non-convex, we establish conditions that yield convergence guarantees for our spectral algorithm in the case of affine regression (c.f. Theorem 3.4).
    We also extend the application of our spectral algorithm to ranking regression via DNNs. We show experimentally that replacing the standard \( \ell _2 \) -norm proximal penalty in ADMM with Kullback–Leibler (KL) divergence [Kullback and Leibler 1951] has a better performance in this setting.
    Our proposed algorithm yields significant performance improvements across the board. In the case of shallow ranking regression, our spectral algorithm is up to 579 times faster than traditional shallow ranking regression methods, and outperforms feature-less methods (including ILSR) by \( 13\% \) accuracy. In the case of deep ranking regression, our algorithm significantly outperforms standard deep siamese networks, while the KL-divergence penalty attains better accuracy than the one attained by the \( \ell _2 \) -penalty ADMM. As shown in Figure 1, siamese network training grows exponentially with ranking size \( K \) ; both KL-divergence (DSR-KL) and \( \ell _2 \) -penalty (DSR-l2) ADMM scale more gracefully w.r.t. \( K \) . However, DSR-KL has a considerably higher accuracy, comparable to the one attained by the less efficient siamese network.
    Fig. 1.
    Fig. 1. Training time and Top-1 prediction accuracy (indicated next to each marker) vs. query size ( \( K \) ) on the Movehub-Cost dataset partitioned w.r.t. rank partitioning (c.f. Section 4.2). We compare Deep Spectral Ranking with a KL penalty (DSR-KL), Deep Spectral Ranking with an \( \ell _2 \) penalty (DSR-l2), and a traditional siamese network. Siamese network training grows exponentially with \( K \) ; both our spectral algorithms (DSR-KL and DSR-l2) scale more gracefully w.r.t. \( K \) . Moreover, DSR-KL has a considerably higher accuracy, comparable to (or better than) than the one attained by the less efficient siamese network.
    The remainder of this article is organized as follows. We review related literature in Section 2. We formulate our main contributions and proposed algorithms in Section 3. We present our experiments in Section 4 and conclude with future work in Section 5.

    2 Related Work and Technical Preliminary

    2.1 Related Work

    Rank Aggregation. The problem of rank aggregation [Dwork et al. 2001], in which a total ordering of samples is regressed from ranking observations, is classic; literature on the subject is vast—see, e.g., the surveys by Fligner and Verducci [1993], Cattelan [2012], and Marden [2014]. Probabilistic inference in this setting typically assumes (a) that a “true” total ordering of samples exists, and (b) that ranking observations exhibit a stochastic transitivity property [Agarwal 2016]: A sample is more likely to be ranked higher than another when this event is consistent with the underlying total ordering. The noisy permutation model is a non-parametric model for this setting: Pairwise comparisons consistent with the underlying total ordering are observed under i.i.d. Bernoulli noise. MLE is NP-hard in this setting. A polytime algorithm by Braverman and Mossel [2008] recovers the underlying ordering in \( \Theta (n\log n) \) comparisons, w.h.p.; this is tightened by several recent works [Wauthier et al. 2013; Mao et al. 2018a;, 2018b]. The Mallows model [Mallows 1957] assumes that the probability of a ranking observation is a decreasing function of its distance from the underlying total ordering, under appropriate notions of distance (e.g., Kendall-Tau (KT)); MLE can be approached, e.g., via EM [Lu and Boutilier 2011]. Shah et al. [2016b] learn the full matrix of pairwise comparison probabilities via a minimax optimal estimator. Rajkumar and Agarwal [2016] learn the matrix via matrix completion, requiring \( \Theta (nr \log n) \) comparisons, where \( r \ll n \) is the rank. Ammar and Shah [2011] assume that comparisons are sampled from an unknown distribution over total orderings and propose an entropy maximization algorithm requiring \( \Theta (n^2) \) comparisons.
    Parametric Rank Aggregation. We focus on parametric models, as they are more natural in the context of regressing rankings from sample features. In both PL [Plackett 1975] and Thurstone [Thurstone 1927], each sample is parametrized by a score. In the Thurstone model, observations result from comparing scores after the addition of Gaussian noise. Vojnovic and Yun [2016] and Shah et al. [2016a] estimate Thurstone scores via MLE and provide sample complexity bounds that are inversely proportional to the smallest non-zero eigenvalue of the Laplacian of a graph modeling comparisons. In PL, the probability that a sample is chosen over a set of alternatives is proportional to its score. Hunter [2004] proposes a Minorization-Maximization (MM) approach to estimate PL scores via MLE, earlier used by Dykstra [1960] on pairwise comparisons (i.e., on the Bradley–Terry (BT) setting). Hajek et al. [2014] provide an upper bound on the error in estimating the PL scores via MLE and show that the latter is minimax-optimal. Negahban et al. [2018] propose a latent factor model, estimating parameters via a convex relaxation of the corresponding rank penalty and providing sample complexity guarantees. Vojnovic et al. [2020] show that gradient-descent and MM algorithms that solve the MLE problem converge with linear rate.
    PL Inference via Spectral Methods. Our focus on PL is due to the recent emergence of spectral algorithms for inference in this setting. Negahban et al. [2012] propose the Rank Centrality (RC) algorithm for the BT setting and derive a minimax error bound. Chen and Suh [2015] propose a spectral MLE algorithm extending RC with an additional stage that cyclically performs MLE for each score. Soufiani et al. [2013] and Jang et al. [2017] extend RC to rankings of two or more samples by breaking rankings into independent comparisons. Improved bounds, applying also to broader noise settings, are provided by Rajkumar and Agarwal [2014]. Khetan and Oh [2016] generalize the work by Soufiani et al. [2013] by breaking rankings into independent shorter rankings, and building a hierarchy of tractable and consistent estimators. Blanchet et al. [2016] model sequential choices by state transitions in an MC, where transitions are functions of choice probabilities.
    Bridging the above approaches with MLE, Maystre and Grossglauser [2015] show that the MLE of PL scores can be expressed as the stationary distribution of an MC. Their proposed ILSR algorithm estimates the PL scores faster than traditional optimization methods, such as, e.g., Hunter’s [Hunter 2004] and Newton’s method [Nocedal and Wright 2006], and more accurately than prior spectral rank aggregation methods. More recently, Ragain and Ugander [2016] showed that a spectral approach applies even after relaxing the assumption that the relative order of any two samples is independent of the alternatives. Agarwal et al. [2018] proposed another spectral method called accelerated spectral ranking by departing from the exact equivalence between MLE and MC approximation and demonstrate faster convergence than ILSR.
    Ranking Regression via PL. We depart from all aforementioned methods by regressing ranked choices from sample features. Ranking regression via the PL [Plackett 1975] model has a long history, mostly focusing on pairwise comparisons, i.e., restricted to \( K=2 \) ; this case is also known as the BT model [Bradley and Terry 1952]. The resulting algorithms have been employed in a broad array of domains including medicine [Tian et al. 2019], information retrieval on search engines [Joachims 2002; Burges et al. 2005], image-based aesthetic recommendations [Chang et al. 2016; Sun et al. 2017], click prediction for online advertisements [Sculley 2010], and image retrieval [Chen et al. 2015], to name a few.
    Among shallow regressors, RankSVM [Joachims 2002] learns a target ranking from features via a linear Support Vector Machine (SVM), with constraints imposed by all possible comparisons. Pahikkala et al. [2009] propose a regularized least-squares based algorithm for learning to rank from comparisons. Several works [Joachims 2002; Pahikkala et al. 2009; Guo et al. 2018; Tian et al. 2019] regress comparisons via MLE over BT [Bradley and Terry 1952] models.
    Deeper models for regressing comparisons have also been extensively explored in the BT setting [Burges et al. 2005; Chang et al. 2016; Dubey et al. 2016; Doughty et al. 2018; Yıldız et al. 2019]. Burges et al. [2005] estimate comparisons via a fully connected neural network called RankNet, using the BT model to construct a cross-entropy loss. Several works [Chang et al. 2016; Dubey et al. 2016; Doughty et al. 2018; Yıldız et al. 2019] learn from comparisons via siamese networks [Bromley et al. 1994]. Chang et al. [2016] and Yıldız et al. [2019] learn from comparisons using MLE on the BT model. Dubey et al. [2016] predict image comparisons, via a loss function combining cross-entropy and hinge loss. Doughty et al. [2018] learn similarities and comparisons of videos via hinge loss. All of the above methods generalize to rankings under the PL model, but suffer from the complexity issues outlined in the introduction.
    Cao et al. [2007], Xia et al. [2008], and Ma et al. [2020] focus on learning from star/relevance ratings rather than rankings; they train a neural network called ListNet that treats ratings as PL scores. Particularly, given a dataset of \( n \) ratings, ListNet constructs all \( \frac{n!}{(n-K)!} \) possible \( K \) -sized rankings; it is then trained by minimizing the cross entropy between the PL probabilities that a given \( K \) -ranking comprises the top samples among all \( n \) , with scores being (a) the predicted scores and (b) the ground-truth ratings, respectively. Hence, although ListNet solves a very different problem than ranking regression, it is related to the siamese methods mentioned above through its (exponential in \( K \) ) objective.
    ADMM Applications in DNN Training. When training DNNs, ADMM is traditionally used to enforce sparsity. One application is compressed sensing, in which a signal is recovered from undersampled measurements via a DNN transformation [Sun et al. 2016; Li et al. 2018; Liu et al. 2018; Ma et al. 2019; Yang et al. 2020]. The training objective combines a standard regression loss with a norm penalty on measurements, including, e.g., \( \ell _1 \) , \( \ell _2 \) , Frobenius, or nuclear norm. Another application is model pruning [Ye et al. 2019;, 2019; Zhao and Liao 2019]: ADMM is used to enforce weight sparsity by replacing norm penalties with (possibly non-convex) low-cardinality set constraints. Zhao et al. [2018] train a DNN classifier to recover samples distorted by additive adversarial noise, using an objective that combines a classification loss with an \( \ell _0 \) , \( \ell _1 \) , \( \ell _2 \) , or \( \ell _\infty \) norm penalty.
    Relationship to Previous Works. The present article is a combination and extension of two of our earlier works [Yıldız et al. 2020;, 2021], which dealt with shallow and deep model settings separately. We unify the key theoretical results of these works, and include all proofs omitted from these conference versions. We extend our experimental evaluations, and provide a more detailed technical preliminary.

    2.2 Technical Preliminary

    2.2.1 The PL Model.

    We introduce here the PL discriminative model [Plackett 1975] that we use in our analysis; before describing it for ranking observations, we first consider the simpler “maximal choice” setting. Consider a dataset of \( n \) samples, indexed by \( i \in \left[n\right]\equiv \lbrace 1,\ldots ,n\rbrace \) . Every sample \( i\in \left[n\right] \) has a corresponding \( d \) -dimensional feature vector \( \mathbf {x}_i\in \mathbb {R}^d \) . There exists an underlying total ordering of these \( n \) samples. A labeler of this dataset acts as a (possibly noisy) oracle revealing this total ordering: When presented with a query \( A\subseteq \left[n\right] \) , i.e., a set of alternatives, the noisy labeler chooses the maximal sample in \( A \) w.r.t. the underlying total ordering. Formally, our “labeled” dataset \( \mathcal {D}= \lbrace (c_\ell , A_\ell) \rbrace _{\ell =1}^m \) consists of \( m \) observations \( (c_\ell , A_\ell) \) , \( \ell \in \left[m\right]\equiv \lbrace 1,\ldots ,m\rbrace \) , where \( A_\ell \subseteq \left[n\right] \) is the \( \ell \) th query submitted to the labeler and \( c_\ell \in A_\ell \) is her respective \( \ell \) th maximal choice (i.e., the label). For every sample \( i\in \left[n\right] \) , we denote by \( W_i = \lbrace \ell \in \left[m\right]\mid i \in A_\ell , c_\ell = i\rbrace \) the set of observations where sample \( i \in \left[n\right] \) is chosen, and by \( L_i = \lbrace \ell \in \left[m\right]\mid i \in A_\ell , c_\ell \ne i\rbrace \) the set of observations where sample \( i \in \left[n\right] \) is not chosen.
    The PL model [Plackett 1975] asserts that every sample \( i\in \left[n\right] \) is associated with a non-negative deterministic score \( {\pi }_i \in \mathbb {R}_+ \) . Given scores \( \mathbf {\pi }= [{\pi }_i]_{i \in \left[n\right]} \in \mathbb {R}^n_+ \) , then (i) observations \( (c_\ell , A_\ell), \ell \in \left[m\right] \) are independent, and (ii) for a query \( A_\ell \) , the probability that sample \( c_\ell \in A_\ell \) is chosen is
    \( \begin{align} \mathbf {P}(c_\ell \,|\,A_\ell , \mathbf {\pi }) = {{\pi }_{c_\ell }}/{\sum \limits _{j \in A_\ell } {\pi }_j} = {{\pi }_\ell }/{\sum \limits _{j \in A_\ell } {\pi }_j}, \end{align} \)
    (1)
    where, in the last equation, we abuse notation to write the score of the chosen sample as \( {\pi }_\ell \equiv {\pi }_{c_\ell } \) . Note that \( \mathbf {P}(c_\ell \,|\,A_\ell , \mathbf {\pi }) = \mathbf {P}(c_\ell \,|\,A_\ell , s\mathbf {\pi }) \) , for all \( s\gt 0 \) ; thus, w.l.o.g., we assume (or enforce via rescaling) that PL scores satisfy \( \mathbf {1}^\top \mathbf {\pi }= 1 \) , i.e., \( \mathbf {\pi } \) is a distribution over \( \left[n\right] \) .
    Ranking Observations. PL readily extends to datasets of (partial) ranking observations. In this setting, when presented with a query \( A_\ell \subseteq \left[n\right] \) of \( K=|A_\ell | \) samples, the labeler ranks the samples in \( A_\ell \) into an ordered sequence \( \alpha _{1}^\ell \succ \alpha _{2}^\ell \succ \cdots \succ \alpha ^\ell _{K} \) . Under the PL model, this ranking is expressed as \( K-1 \) maximal choice queries: \( \alpha _{1}^\ell \) over \( A_\ell \) , \( \alpha _{2}^\ell \) over \( A_\ell \setminus \lbrace \alpha _{1}^\ell \rbrace \) , and so on, so that
    \( \begin{align} \mathbf {P}(\alpha _{1}^\ell \succ \alpha _{2}^\ell \succ \cdots \succ \alpha ^\ell _{K}\,|\,A_\ell , \mathbf {\pi }) = \prod \limits _{t=1}^{K-1} \left({{\pi }_{\alpha _t^\ell }}/{\sum \limits _{s=t}^{K} {\pi }_{\alpha _s^\ell }}\right). \end{align} \)
    (2)
    The product form of (2) implies that a ranking observation in response to a query \( A_\ell \) can be converted to \( K-1 \) independent maximal-choice observations (again, \( \alpha _{1}^\ell \) over \( A_\ell \) , \( \alpha _{2}^\ell \) over \( A_\ell \setminus \lbrace \alpha _{1}^\ell \rbrace \) , and so forth), each governed by (1), that yield the same joint probability: \( (\alpha _{1}^\ell \succ \cdots \succ \alpha ^\ell _{K}) \) can be seen as the outcome of \( \alpha _{1}^\ell \) being chosen as the top within the query set \( {A_\ell } \) , \( \alpha _{2}^\ell \) being the top among \( A_\ell \setminus \lbrace \alpha _{1}^\ell \rbrace \) , and so on. MLE over a dataset of ranking observations thus reduces to MLE over a dataset of maximal choice observations. For notational simplicity, we present our analysis over maximal-choice datasets, keeping the above reduction in mind.
    Parameter Inference. Given the dataset of observations \( \mathcal {D} \) , MLE of the PL scores \( \mathbf {\pi }\in \mathbb {R}^n_+ \) amounts to minimizing the negative log-likelihood:
    \( \begin{align} \mathop {\min }\limits _{\mathbf {\pi }\in \mathbb {R}^n_+} \mathcal {L}(\mathcal {D}\mid \mathbf {\pi }) \equiv \sum \limits _{\ell =1}^m\left(\log \sum \limits _{j \in A_\ell } {\pi }_j - \log \,{\pi }_\ell \right). \end{align} \)
    (3)
    Reparametrizing the scores as \( {\pi }_i = e^{\theta _i} \) , \( i \in \left[n\right] \) makes the negative log-likelihood \( \mathcal {L} \) convex in \( \mathbf {\theta }= [\theta _i]_{i\in \left[n\right]} \) . Under this transform, a global optimum of Equation (3) can be computed via, e.g., Newton’s method.

    2.2.2 Iterative Luce Spectral Ranking.

    As described in more detail in Section 2.2.1, MLE of PL scores is convex under the reparametrization \( {\pi }_i = e^{\theta _i} \) , \( i \in \left[n\right] \) , which, in turn, enables computing the PL scores via Newton’s method. Nevertheless, Newton’s method can be prohibitively slow for large \( n \) and \( m \) [Hunter 2004]. Maystre and Grossglauser [2015] proposed a novel spectral algorithm that is significantly faster than traditional optimization methods, such as, e.g., Hunter’s [Hunter 2004] and Newton’s method, and more accurate than prior spectral rank aggregation methods. They show that the MLE of PL scores can be expressed as the stationary distribution of an MC. Their algorithm relies on the following theorem which, for completeness, we re-prove in Appendix B:
    Theorem 2.1 (Maystre and Grossglauser [2015]).
    An optimal solution \( \mathbf {\pi }\in \mathbb {R}^n_+ \) to (3) satisfies:
    \( \begin{align} \sum _{j\ne i} {\pi }_j \lambda _{ji}(\mathbf {\pi }) = \sum _{j\ne i} {\pi }_i \lambda _{ij}(\mathbf {\pi }), \quad \text{for all}~i \in \left[n\right], \end{align} \)
    (4)
    where, for all \( i,j \in \left[n\right] \) , with \( i\ne j \) ,
    \( \begin{align} \lambda _{ji}(\mathbf {\pi }) = \sum _{\ell \in W_i \cap L_j} \left({\sum _{t\in A_\ell }{\pi }_t}\right)^{-1} \ge 0, \end{align} \)
    (5)
    for \( W_i = \lbrace \ell \,| i \in A_\ell , c_\ell = i\rbrace \) the observations where sample \( i \in \left[n\right] \) is chosen and \( L_i = \lbrace \ell \,| i \in A_\ell , c_\ell \ne i\rbrace \) the observations where sample \( i \in \left[n\right] \) is not chosen.
    Equation (4) are the balance equations of a continuous-time MC with transition rates:
    \( \begin{align} \mathbf {\Lambda }(\mathbf {\pi })= [\lambda _{ji}(\mathbf {\pi })]_{i,j \in \left[n\right]}, \end{align} \)
    (6)
    where \( \lambda _{ji}(\mathbf {\pi }) \,\, \) are given by Equation (5). Hence, \( \mathbf {\pi } \) is the stationary distribution of the MC defined by transition rates \( \mathbf {\Lambda }(\mathbf {\pi }) \) (c.f. Appendix A).
    Let \( \mathtt {ssd}(\mathbf {\Lambda }) \) be the stationary distribution of an MC with transition rates \( \mathbf {\Lambda } \) . When matrix \( \mathbf {\Lambda } \) is fixed (i.e., the transition rates are known), the vector \( \mathtt {ssd}(\mathbf {\Lambda }) \) is a solution to the linear system defined by the balance equations (4) and \( \mathbf {1}^\top \mathbf {\pi }=1 \) , as it is a distribution. However, the transition matrix \( \mathbf {\Lambda }= \mathbf {\Lambda }(\mathbf {\pi }) \) in Theorem 2.1 is itself a function of \( \mathbf {\pi } \) , and is therefore a priori unknown. Maystre and Grossglauser [2015] find \( \mathbf {\pi } \) through an iterative algorithm. Starting from the uniform distribution \( {\mathbf {\pi }}^0 = \frac{1}{n}\mathbf {1} \) , they compute:
    \( \begin{align} \mathbf {\pi }^{l+1} = \mathtt {ssd}\left(\mathbf {\Lambda }(\mathbf {\pi }^l)\right), \quad \text{for}~ l \in \mathbb {N}, \end{align} \)
    (7)
    where \( \mathbf {\Lambda }(\cdot) \) is given by (5), (6).
    Maystre and Grossglauser [2015] refer to Equation (7) as the ILSR algorithm. They also establish that (7) converges to an optimal solution for MLE of PL scores under mild assumptions. Most importantly, as mentioned above, ILSR significantly outperforms state-of-the-art MLE algorithms in computational efficiency. We note that the dataset \( \mathcal {D} \) appears in the computation of the rate matrix \( \mathbf {\Lambda } \) , via sets \( W_i \) and \( L_j \) ; the computation of these weights can be easily parallelized.

    2.2.3 Alternating Directions Method of Multipliers.

    Virtually all shallow and deep models for ranking regression rely on classic optimization methods for direct parameter inference, resulting in prohibitively slow learning for large rankings. We solve the ranking regression problem via the ADMM [Boyd et al. 2011] and attain spectral algorithms that significantly accelerate learning.
    ADMM is a primal-dual algorithm designed for affine-constrained optimization problems with decoupled objectives, i.e., objectives that can be written as a sum of functions where each function depends on only one of the optimized variables. ADMM combines the traditional dual decomposition [Nocedal and Wright 2006] and augmented Lagrangian [Nocedal and Wright 2006] optimization algorithms; it has also been shown to be a special case of the Douglas–Rachford splitting [Nocedal and Wright 2006] algorithm. ADMM became particularly popular for being well-suited for distributed convex optimization, a problem that commonly arises in applied statistics and machine learning [Boyd et al. 2011]. We follow the notation by Boyd et al. [2011] in our exposition below.
    Standard ADMM. Formally, the ADMM algorithm solves problems in the form:
    \( \begin{align} \begin{split} \text{minimize} &\quad f(\mathbf {x}) + g(\mathbf {z})\\ \text{subject to} &\quad \mathbf {A} \mathbf {x} + \mathbf {B} \mathbf {z} = \mathbf {c}, \end{split} \end{align} \)
    (8)
    where \( \mathbf {x} \in \mathbb {R}^n \) and \( \mathbf {z} \in \mathbb {R}^m \) are the optimized variables, with \( \mathbf {A} \in \mathbb {R}^{p \times n} \) , \( \mathbf {B} \in \mathbb {R}^{p \times m} \) , and \( \mathbf {c} \in \mathbb {R}^p \) . \( f \) and \( g \) are typically assumed to be convex functions that depend on decoupled variables \( \mathbf {x} \) and \( \mathbf {z} \) , respectively.
    ADMM solves a constrained optimization problem by minimizing the augmented Lagrangian instantiated below, rather than the standard Lagrangian:
    \( \begin{align} L_\rho (\mathbf {x}, \mathbf {z}, \mathbf {y}) = f(\mathbf {x}) + g(\mathbf {z}) + \mathbf {y}^\top (\mathbf {A} \mathbf {x} + \mathbf {B} \mathbf {z} - \mathbf {c}) + \frac{\rho }{2} \parallel \mathbf {A} \mathbf {x} + \mathbf {B} \mathbf {z} - \mathbf {c} \parallel _2^2, \end{align} \)
    (9)
    where \( \mathbf {y}\in \mathbb {R}^p \) is the Lagrangian dual variable corresponding to the equality constraint and \( \rho \ge 0 \) is a penalty parameter. Evidently, the difference of augmented Lagrangian from the standard Lagrangian is the additional quadratic proximal penalty on the equality constraint. This additional penalty is shown to greatly improve convergence properties of the algorithm [Boyd et al. 2011], compared to, e.g., dual ascent.
    ADMM iteratively optimizes the augmented Lagrangian w.r.t. the primal variables \( \mathbf {x} \) and \( \mathbf {z} \) , and dual variables \( \mathbf {y} \) as follows:
    \( \begin{align} \begin{split} \mathbf {x}^{k+1} &= \mathop{\text{arg min}}\limits_{x} L_\rho (\mathbf {x}, \mathbf {z}^k, \mathbf {y}^k), \\ \mathbf {z}^{k+1} &= \mathop{\text{arg min}}\limits_{z} L_\rho (\mathbf {x}^{k+1}, \mathbf {z}, \mathbf {y}^k), \\ \mathbf {y}^{k+1} &= \mathbf {y}^k + \rho (\mathbf {A} \mathbf {x}^{k+1} + \mathbf {B} \mathbf {z}^{k+1} - \mathbf {c}). \end{split} \end{align} \)
    (10)
    For convex problems, there are well-established convergence properties for ADMM. If \( f \) and \( g \) are closed, proper, and convex, and the augmented Lagrangian without the quadratic penalty has a saddle point, the ADMM iterations are guaranteed to converge to a point where (a) the equality constraint is satisfied, and (b) objectives and dual variable attain optimal values. Even though the convergence has linear rate, in many applications, ADMM has been shown to converge to a modest accuracy in a few tens of iterations [Boyd et al. 2011]. ADMM has also been extensively used to solve non-convex problems similar to the one we study here [Chartrand and Wohlberg 2013; Guo et al. 2017; Hong 2018; Wang et al. 2019]. Focusing on specific instances of non-convex problems, Magnússon et al. [2015]; Guo et al. [2017] and Hong [2018] establish that ADMM converges to a stationary point of the augmented Lagrangian. In general, ADMM is not guaranteed to converge for non-convex problems, and even if it does, it may not converge a global optimum.
    Generalization to Bregman Divergence Proximal Penalty. ADMM has extensions where the quadratic proximal penalty is generalized to, e.g., Bregman divergences. Wang and Banerjee [2014] employ ADMM with a Bregman divergence proximal penalty, where KL is a special case, to solve a problem of the form (8). They minimize the augmented Lagrangian:
    \( \begin{align} L_\rho (\mathbf {x}, \mathbf {z}, \mathbf {y}) = f(\mathbf {x}) + g(\mathbf {z}) + \mathbf {y}^\top (\mathbf {A} \mathbf {x} + \mathbf {B} \mathbf {z} - \mathbf {c}) + \rho D_p(\mathbf {B} \mathbf {z} , \mathbf {c} - \mathbf {A} \mathbf {x}), \end{align} \)
    (11)
    where \( D_p \) is a Bregman divergence [Bregman 1967]. A Bregman divergence is associated with a strictly convex and contionously differentiable function \( h: \Omega \rightarrow \mathbb {R} \) , where \( \Omega \) is a closed convex set. Given such a function \( h \) and any two points \( x, y \in \Omega \) , the Bregman divergence is the difference between the value of \( h \) at point \( x \) and the value of the first-order Taylor expansion of \( h \) around point \( y \) evaluated at point \( x \) [Bregman 1967]:
    \( \begin{align} D_p(x, y) = h(x) - h(y) - \nabla h(y) ^\top (x - y). \end{align} \)
    (12)
    Having adjusted the augmented Lagrangian as (11), Bregman-ADMM again iteratively optimizes the augmented Lagrangian w.r.t. the primal variables \( \mathbf {x} \) and \( \mathbf {z} \) , and dual variables \( \mathbf {y} \) via (10). Wang and Banerjee [2014] show that ADMM with a Bregman divergence proximal penalty converges with linear rate for convex objectives. Wang et al. [2018] extend this convergence guarantee to local optima of non-convex problems. Several works [Shi et al. 2017; Yu et al. 2018; Yu and Açıkmeşe 2019] employ a Bregman ADMM algorithm for distributed optimization of separable problems.

    3 Spectral Ranking Regression

    3.1 Problem Formulation: Ranking Regression

    As introduced in Section 2.2.1, MLE of the PL scores \( \mathbf {\pi }\in \mathbb {R}^n_+ \) amounts to minimizing the negative log-likelihood:
    \( \begin{align} \min _{\mathbf {\pi }\in \mathbb {R}^n_+} \mathcal {L}(\mathcal {D}\mid \mathbf {\pi }) \equiv \sum _{\ell =1}^m\left(\log \sum _{j \in A_\ell } {\pi }_j - \log \,{\pi }_\ell \right). \end{align} \)
    (13)
    To regress scores \( \mathbf {\pi } \) from sample features \( \mathbf {X}= {[\mathbf {x}_1,..,\mathbf {x}_n]}^T \in \mathbb {R}^{n\times d} \) , we assume that there exists a function \( \tilde{{\pi }}:\mathbb {R}^d\rightarrow [0,1] \) , parametrized by \( \mathbf {w}\in \mathbb {R}^{d^{\prime }} \) , such that
    \( \begin{align} {\pi }_i=\tilde{{\pi }}(\mathbf {x}_{i}; \mathbf {w}), \quad \text{for all}~i\in \left[n\right]\!. \end{align} \)
    (14)
    Then, MLE amounts to minimizing the following objective w.r.t. \( \mathbf {w}\in \mathbb {R}^{d^{\prime }} \) :
    \( \begin{align} \mathcal {L}(\mathcal {D}\mid \mathbf {w}) \equiv \sum _{\ell =1}^m\left(\log \sum _{j \in A_\ell } \tilde{{\pi }}(\mathbf {x}_{j}; \mathbf {w}) - \log \tilde{{\pi }}(\mathbf {x}_{\ell }; \mathbf {w}) \right). \end{align} \)
    (15)
    We tackle the regression problem via the MLE objective (15) w.r.t. three regression functions. Our notation is summarized in Table 1.
    Table 1.
    \( n \) number of samples
    \( m \) number of choice/ranking observations
    \( d \) number of features of each sample
    \( K \) number of samples in each ranking query
    \( \mathbf {w}\in \mathbb {R}^{d^{\prime }} \) regression model parameters
    \( \mathbf {x}_i \in \mathbb {R}^d \) feature vector of sample \( i \)
    \( \mathbf {X}\in \mathbb {R}^{n\times d} \) matrix of feature vectors of all samples
    \( \mathbf {\tilde{X}}= [\mathbf {X}| \mathbf {1}] \in \mathbb {R}^{n\times (d+1)} \) extended matrix of feature vectors of all samples
    \( c_\ell \) maximal choice for the \( \ell \) th observation
    \( \alpha _\ell \) ranking of samples in the \( \ell \) th observation
    \( A_\ell \) set of alternative samples, i.e., query set of the \( \ell \) th observation
    \( \left[n\right] \) set of all samples
    \( \left[m\right] \) set of all choice/ranking observations
    \( \mathcal {D} \) dataset of all query–observation pairs
    \( W_i \) set of observations in which sample \( i \) is chosen
    \( L_i \) set of observations in which sample \( i \) is not chosen
    \( \mathbf {\pi }\in \mathbb {R}^n_+ \) scores of all samples
    \( \tilde{\mathbf {\pi }}(\cdot , \mathbf {w}) \in \mathbb {R}^n_+ \) regression function of all scores
    \( D_p \) proximal penalty function of ADMM
    \( \mathcal {L} \) negative log-likelihood of the PL model
    \( L \) augmented Lagrangian of ADMM
    \( \rho \gt 0 \) penalty parameter of ADMM
    \( \mathbf {y}\in \mathbb {R}^{n} \) dual variables of ADMM
    \( \lambda _{ji}, \,\, i,j\in \left[n\right] \) transition rates of the ILSR MC (c.f. (5))
    \( \mathbf {\Lambda } \) transition matrix of the ILSR MC (c.f. (6))
    \( \mu _{ji}, \,\, i,j\in \left[n\right] \) transition rates of the MC (c.f. (27))
    \( \mathbf {M} \) transition matrix of the MC (c.f. (28))
    \( \mathbf {\sigma }\in \mathbb {R}^n \) additional terms in the MC rates (c.f. Theorem 3.2)
    \( (\left[n\right]_{-}, \left[n\right]_{+}) \) partition of \( \left[n\right] \) s.t. \( \sigma _i \ge 0 \) for \( i \in \left[n\right]_{+} \) and \( \sigma _i \lt 0 \) for \( i \in \left[n\right]_{-} \)
    \( \mathtt {ssd} \) stationary distribution of an MC
    Table 1. Notation Summary
    Affine Regression. We assume that there exist \( \mathbf {a}\in \mathbb {R}^d \) and \( {b}\in \mathbb {R} \) , such that \( \mathbf {\pi }= \tilde{{\pi }}(\mathbf {X}; \mathbf {w})\equiv \mathbf {X}\mathbf {a}+ \mathbf {1} {b}\, \) for \( \mathbf {w}=(\mathbf {a},{b})\in \mathbb {R}^{d+1} \) . Then, MLE of parameters \( \mathbf {w} \) amounts to solving:
    \( \begin{align} \min _{\mathbf {w}=(\mathbf {a},{b}) \in \mathbb {R}^{d+1} :\tilde{{\pi }}(\mathbf {X}; \mathbf {w})\ge \mathbf {0}}\mathcal {L}\left(\mathcal {D}\,|\,\tilde{{\pi }}(\mathbf {X}; \mathbf {w})\equiv \mathbf {X}\mathbf {a}+ \mathbf {1} {b}\right)\!, \end{align} \)
    (16)
    where \( \mathcal {L} \) is given by (15). We note that Problem (16) is not convex, as the objective is not convex in \( (\mathbf {a},{b})\in \mathbb {R}^{d+1} \) .
    Logistic Regression. In the logistic case, we assume that there exists \( \mathbf {w}\in \mathbb {R}^d \) s.t. \( \mathbf {\pi }=\tilde{{\pi }}(\mathbf {X}; \mathbf {w})\equiv [e^{\mathbf {w}^\top \mathbf {x}_i}]_{i\in \left[n\right]}. \) As \( \tilde{{\pi }}(\mathbf {X}; \mathbf {w})\ge 0 \) by definition, MLE corresponds to
    \( \begin{align} \min _{\mathbf {w}\in \mathbb {R}^d} \mathcal {L}\left(\mathcal {D}\,|\,\tilde{{\pi }}(\mathbf {X}; \mathbf {w})\equiv [e^{\mathbf {w}^\top \mathbf {x}_i}]_{i\in \left[n\right]}\right)\!. \end{align} \)
    (17)
    The objective of (17) is convex in \( \mathbf {w} \) , so an optimal solution can be found via, e.g., Newton’s method [Nocedal and Wright 2006]. The convexity of Problem (17) w.r.t. \( \mathbf {w} \) follows by the facts that (a) the reparametrization \( {\pi }_i = e^{\theta _i} \) , \( i \in \left[n\right] \) makes \( \mathcal {L} \) convex, and (b) the composition of convex and affine is convex [Boyd and Vandenberghe 2004].
    DNN Regression. Finally, we regress scores \( \mathbf {\pi } \) via a generic DNN function \( \tilde{{\pi }}(\mathbf {X}; \mathbf {w}) \) by solving the following MLE problem:
    \( \begin{align} \min _{\mathbf {w}\in \mathbb {R}^{d^{\prime }}} \mathcal {L}(\mathcal {D}\,|\,\tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w})), \end{align} \)
    (18)
    where \( \tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w})\in \mathbb {R}^n_+ \) is the vector map whose coefficients \( \tilde{{\pi }}_i\in \mathbb {R}_+ \) are given by \( \tilde{{\pi }}_i=\tilde{{\pi }}(\mathbf {x}_{i}; \mathbf {w}) \) , i.e., \( \tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w})\equiv [\tilde{{\pi }}(\mathbf {x}_{i}; \mathbf {w})]_{i \in \left[n\right]} \) . For instance, \( \tilde{{\pi }}(\cdot , \mathbf {w}) \) can be a fully connected feed-forward neural network if \( \mathbf {X} \) contains numerical features, or a convolutional neural network if \( \mathbf {X} \) is an image.
    As also discussed in Section 1, virtually all state-of-the-art ranking regression methods minimize Equation (15) via classic gradient-based optimization [Joachims 2002; Burges et al. 2005; Pahikkala et al. 2009; Chang et al. 2016; Dubey et al. 2016; Doughty et al. 2018; Guo et al. 2018; Tian et al. 2019; Yıldız et al. 2019]. Note that the optimization of objective (15) implies the following process: For each observation \( (c_\ell ,A_\ell) \) in \( \mathcal {D} \) , the regression function \( \tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w}) \) needs to be evaluated and updated over all samples in \( A_\ell \) . Overall, the ranking regression objective (15) implies that \( K \) samples need to be loaded in RAM to process a single ranking or maximal choice query. To make matters worse, the number \( m \) of possible rankings/maximal queries in \( \mathcal {D} \) grows as \( \left({{n}\atop {K}}\right) = O(n^K) \) . This means that the length of a training epoch of stochastic gradient descent (and hence, the number of expensive gradient computations) is \( m=O(n^K) \) , i.e., grows exponentially in \( K \) . Combined, these two effects can make the cost of a single optimization step over Equation (15) prohibitive (c.f. Figures 1 and 5(c)). This motivates us to explore efficient spectral algorithms to solve regression problems (16)–(18), as we discuss next.

    3.2 Key Technical Result: Decoupling Optimization and a Spectral Method

    Given ILSR’s significant computational benefits in Section 2, we wish to develop analogues in the regression setting. In contrast to the feature-less setting, it is not a priori evident how to solve Problems (16)–(18) via a spectral approach. Taking the affine case as an example, and momentarily ignoring issues of non-convexity, the stationary points of the optimization problem (16) cannot be expressed via the balance equations of an MC. Our main contribution is to circumvent this problem by using the ADMM [Boyd et al. 2011]. Intuitively, ADMM allows us to decouple the optimization of scores \( \mathbf {\pi } \) from model parameters \( \mathbf {w} \) , encapsulating them in a proximal penalty: The latter becomes amenable to a spectral approach after a series of manipulations that we outline below (see Theorem 3.2).
    We wish to devise an efficient algorithm to minimize the regression objective (15). To do so, we extend the standard ADMM [Boyd et al. 2011] introduced in Section 2. To this end, we rewrite the minimization of (15) as
    \( \begin{align} \underset{\mathbf {\pi }, \mathbf {w}}{\text{Minimize}} &\quad \mathcal {L}(\mathcal {D}\,|\,\mathbf {\pi }) \equiv \sum _{\ell =1}^m\left(\log \sum _{j \in A_\ell } {\pi }_j - \log \,{\pi }_\ell \right) \end{align} \)
    (19a)
    \( \begin{align} \text{subject to:} & \quad \mathbf {\pi }= \tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w}), \quad \mathbf {\pi } \ge \mathbf {0}. \end{align} \)
    (19b)
    To solve (19) via ADMM, consider the following augmented Lagrangian (see Section 2):
    \( \begin{align} L_\rho (\mathbf {\pi }, \mathbf {w}, \mathbf {y}) = \mathcal {L}(\mathcal {D}\,|\,\mathbf {\pi }) &+ \mathbf {y}^\top (\mathbf {\pi }-\tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w})) + \rho \cdot D_p(\mathbf {\pi }||\tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w})), \end{align} \)
    (20)
    where \( \mathbf {y}\in \mathbb {R}^n \) is the Lagrangian dual variable corresponding to the equality constraint in (19b), \( \rho \ge 0 \) is a penalty parameter, and \( D_p(\cdot ||\cdot) \) is a proximal penalty term. This term satisfies: (i) \( D_p(\mathbf {\pi }||\tilde{\mathbf {\pi }}) \ge 0 \) for all \( \mathbf {\pi },\tilde{\mathbf {\pi }}\in \mathbb {R}^n_+ \) , and (ii) \( D_p(\mathbf {\pi }||\tilde{\mathbf {\pi }}) = 0 \) if and only if \( \tilde{\mathbf {\pi }}=\mathbf {\pi } \) . Classic ADMM involves using the usual \( \ell _2 \) proximal penalty, i.e., \( D_p(\mathbf {\pi }||\tilde{\mathbf {\pi }}) = \frac{1}{2} \parallel \! \mathbf {\pi }- \tilde{\mathbf {\pi }}\!\parallel _2^2 \) . We depart from this by considering more generic \( D_p \) ; as we discuss in Section 3.5, using KL divergence instead, i.e., \( D_p(\mathbf {\pi }||\tilde{\mathbf {\pi }}) = \sum _{i=1}^{n} {\pi }_i \log \frac{{\pi }_i}{\tilde{{\pi }}_i} \) has significant advantages in our setting.
    In its general form, ADMM alternates between optimizing \( \mathbf {\pi } \) and \( \mathbf {w} \) until convergence via the following primal-dual algorithm on the augmented Lagrangian (c.f. Section 2):
    \( \begin{align} \mathbf {\pi }^{k+1} &= \mathop{\text{arg min}}\limits_{\mathbf {\pi }\in \mathbb {R}^n_+} \,\, L_\rho (\mathbf {\pi }, \mathbf {w}^k, \mathbf {y}^k), \end{align} \)
    (21a)
    \( \begin{align} \mathbf {w}^{k+1} &= \mathop{\text{arg min}}\limits_{\mathbf {w}\in \mathbb {R}^{d^{\prime }}} \,\,L_\rho (\mathbf {\pi }^{k+1}, \mathbf {w}, \mathbf {y}^k), \end{align} \)
    (21b)
    \( \begin{align} \mathbf {y}^{k+1} &= \mathbf {y}^k + \rho (\mathbf {\pi }^{k+1}- \tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w}^{k+1})) \end{align} \)
    (21c)
    This has the following immediate computational advantages. First, step (21b) is equivalent to
    \( \begin{align} \begin{split}\mathbf {w}^{k+1} = \underset{\mathbf {w}\in \mathbb {R}^{d^{\prime }}}{\text{arg min}} \,\, & \rho D_p\big (\mathbf {\pi }^{k+1}||\tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w})\big) - {\mathbf {y}^k}^\top \tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w}). \end{split} \end{align} \)
    (22)
    Note that this operation does not depend on dataset \( \mathcal {D} \) : The model \( \tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w}) \) is regressed directly from the present score estimates \( \mathbf {\pi }^{k+1} \) via penalty \( D_p \) , with the additional linear dual term. In particular, as discussed below, \( D_p \) leads to a least-squares regression in the case of the \( \ell _2 \) -proximal penalty, and a max-entropy penalty in the case of KL-divergence; both can be solved efficiently. Crucially, an optimization method that solves (22) iterates over the \( O(n) \) samples, instead of the \( O(n^K) \) ranking observations, which would be the case when Equation (15) is minimized directly without ADMM decoupling.
    Most importantly, step (21b)—which does depend on \( \mathcal {D} \) —admits a highly efficient spectral implementation for a proximal penalty \( D_p \) ; ADMM (21) therefore delegates solving the “expensive” portion of the problem to a highly efficient algorithm. To establish this, we begin with the following lemma, which we prove in Appendix C.
    Lemma 3.1.
    Given \( \tilde{\mathbf {\pi }}^k\equiv \tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w}^k) \in \mathbb {R}^n_+ \) and \( \mathbf {y}^k \in \mathbb {R}^n \) , let \( \mathbf {\pi }\in \mathbb {R}_+^n \) be such that
    \( \begin{align} \nabla _{\mathbf {\pi }}& L_\rho (\mathbf {\pi }, \mathbf {w}^k, \mathbf {y}^k) = \mathbf {0}. \end{align} \)
    (23)
    Let \( \mathbf {\sigma }(\mathbf {\pi })=[\sigma _i(\mathbf {\pi })]_{i\in \left[n\right]}\in \mathbb {R}^n \) , where
    \( \begin{align} \sigma _i(\mathbf {\pi }) &\equiv \rho \frac{\partial D_p(\mathbf {\pi }|| \tilde{\mathbf {\pi }}^k)}{\partial {{\pi }_i}} + {y}_i^k, \quad \text{for }i \in \left[n\right]\!, \end{align} \)
    (24)
    and
    \( \begin{align} \lambda _{ji}(\mathbf {\pi }) \equiv \sum _{\ell \in W_i \cap L_j} \left({\sum _{t\in A_\ell }{\pi }_t}\right)^{-1} \ge 0,\quad \text{for }i,j \in \left[n\right]\text{ with }i\ne j, \end{align} \)
    (25)
    with \( W_i = \lbrace \ell \,| i \in A_\ell , c_\ell = i\rbrace \) denoting the observations where sample \( i \in \left[n\right] \) is chosen and \( L_i = \lbrace \ell \,| i \in A_\ell , c_\ell \ne i\rbrace \) denoting the observations where sample \( i \in \left[n\right] \) is not chosen. Then, (23) is equivalent to
    \( \begin{align} \sum _{j\ne i} {\pi }_j \lambda _{ji}(\mathbf {\pi }) - \sum _{j\ne i} {\pi }_i \lambda _{ij}(\mathbf {\pi }) = {\pi }_i \sigma _i(\mathbf {\pi }), \quad \text{for all }i \in \left[n\right]\!. \end{align} \)
    (26)
    Contrasting Equation (26) to Equation (4), it is not immediately evident that the former corresponds to the balance equations of an MC as, in general, \( \mathbf {\sigma }(\mathbf {\pi }) = [\sigma _i(\mathbf {\pi })]_{i \in \left[n\right]} \ne \mathbf {0} \) . Nevertheless, for an appropriate definition of transition rates, we prove that this is indeed the case:
    Theorem 3.2.
    Equations (26) are the balance equations of a continuous-time Markov Chain (MC) with transition rates:
    \( \begin{align} \mu _{ji}(\mathbf {\pi }) &= {\left\lbrace \begin{array}{ll} \lambda _{ji}(\mathbf {\pi }) + \displaystyle \frac{2 {\pi }_i \sigma _i(\mathbf {\pi }) \sigma _j(\mathbf {\pi })}{ \sum \nolimits _{t\in \left[n\right]_{-}} {\pi }_t \sigma _t(\mathbf {\pi }) - \sum \nolimits _{t\in \left[n\right]_{+}} {\pi }_t \sigma _t(\mathbf {\pi })}, &\text{if} \,\,\, j \in \left[n\right]_{+}\,\,\, \text{and} \,\,\, i \in \left[n\right]_{-}, \\ \lambda _{ji}(\mathbf {\pi }) & \text{otherwise}, \end{array}\right.} \end{align} \)
    (27)
    where \( (\left[n\right]_{+}\!, \left[n\right]_{-}) \) denotes a partition of \( \left[n\right] \) such that \( \sigma _i(\mathbf {\pi }) \ge 0 \) for all \( i \in \left[n\right]_{+} \) and \( \sigma _i(\mathbf {\pi }) \lt 0 \) for all \( i \in \left[n\right]_{-} \) .
    The proof is in Appendix D. By Lemma 3.1 and Theorem 3.2, we conclude that a stationary \( \mathbf {\pi }\in \mathbb {R}^n_+ \) satisfying (23) is also the stationary distribution of the continuous-time MC with transition rates:
    \( \begin{align} \mathbf {M}(\mathbf {\pi })= [\mu _{ji}(\mathbf {\pi })]_{i,j \in \left[n\right]}, \end{align} \)
    (28)
    where \( \mu _{ji}(\mathbf {\pi }) \) are given by Equation (27). Motivated by these observations, and mirroring ILSR (Equation (7)), we compute a solution to (21b) via
    \( \begin{align} \mathbf {\pi }^{l+1} = \mathtt {ssd}\left(\mathbf {M}(\mathbf {\pi }^l)\right)\!, \quad \text{for}~ l \in \mathbb {N}, \end{align} \)
    (29)
    where \( \mathbf {M}(\cdot) \) is given by Equation (28) and \( \mathtt {ssd}(\cdot) \) is an algorithm computing the steady state distribution. We refer to this procedure as ILSRX (“ILSR with features”).
    Our spectral algorithm solving Equation (19) is summarized in Algorithm 1 and Figure 2. We initialize all samples with equal scores, i.e., \( \mathbf {\pi }= \tilde{\mathbf {\pi }}= \frac{1}{n}\mathbf {1} \) , and set the initial dual variable as \( \mathbf {y}^0 = \mathbf {0} \) . We iteratively update \( \mathbf {\pi } \) , \( \mathbf {w} \) , and \( \mathbf {y} \) via Equation (21) until convergence. At each ADMM iteration \( k+1 \) , we initialize \( \mathbf {\pi } \) with \( \mathbf {\pi }^{k} \) , and update \( \mathbf {\pi } \) via ILSRX given by Equation (29). We elaborate on the resulting implementation specifics for each of the regression problems (16)–(18) below.
    Fig. 2.
    Fig. 2. Spectral Ranking Algorithm Diagram. We iteratively update \( \mathbf {\pi } \) , \( \mathbf {w} \) , and \( \mathbf {y} \) via Equation (21) until convergence. Each block corresponds to one of the three ADMM steps in Equation (21). External inputs are provided on the left, in yellow, and optimized variables from the previous iteration or step are provided on the top. We indicate the variables provided by the previous ( \( k \) th) iteration in blue and the updated variables (i.e., the output at the \( (k+1) \) -th iteration) in green.

    3.3 Spectral Algorithm for Affine Regression

    In this section, we apply Algorithm 1 on affine regression of PL scores via
    \( \begin{align} \tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w})=\mathbf {X}\mathbf {a}+ \mathbf {1} {b}=\mathbf {\tilde{X}}\mathbf {w}. \end{align} \)
    (30)
    We introduce \( \mathbf {w}=(\mathbf {a}, {b}) \in \mathbb {R}^{d+1} \) and \( \mathbf {\tilde{X}}= [\mathbf {X}| \mathbf {1}] \in \mathbb {R}^{n\times (d+1)} \) for notational simplicity. Recall from Section 3.1 that the resulting MLE problem is given by
    \( \begin{align} \min _{\mathbf {w}\in \mathbb {R}^{d+1} :\tilde{{\pi }}(\mathbf {X}; \mathbf {w})\ge \mathbf {0}}\mathcal {L}\left(\mathcal {D}\,|\,\tilde{{\pi }}(\mathbf {X}; \mathbf {w})\equiv \mathbf {\tilde{X}}\mathbf {w}\right)\!. \end{align} \)
    (31)
    In this setting, we employ standard ADMM with quadratic proximal penalty:
    \( \begin{align} D_p(\mathbf {\pi }||\tilde{\mathbf {\pi }}) = \frac{1}{2} \parallel \! \mathbf {\pi }- \tilde{\mathbf {\pi }}\!\parallel _2^2\!. \end{align} \)
    (32)
    This leads to additional theoretical guarantees explained in Section 3.7. Employing Algorithm 1 with quadratic \( D_p(\mathbf {\pi }||\tilde{\mathbf {\pi }}) \) and affine \( \tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w}) \) , the regression step (21b) of ADMM becomes:
    \( \begin{align} \mathbf {w}^{k+1} &= \underset{\mathbf {w}\in \mathbb {R}^{d+1}}{\text{arg min}} \, {\mathbf {y}^k}^T (\mathbf {\pi }^{k+1} - \mathbf {\tilde{X}}\mathbf {w}) + \frac{\rho }{2}\parallel \! \mathbf {\pi }^{k+1}- \mathbf {\tilde{X}}\mathbf {w}\!\parallel _2^2\nonumber \nonumber\\ &= {(\mathbf {\tilde{X}}^T\mathbf {\tilde{X}})}^{-1}\mathbf {\tilde{X}}^T\left(\mathbf {\pi }^{k+1} + \frac{1}{\rho }\mathbf {y}^{k}\right). \end{align} \)
    (33)
    Equation (33) follows from completing the square by combining the linear and quadratic terms into a single quadratic objective. This has the following immediate computational advantage: The regression step (33) is a quadratic minimization and admits a closed form solution.
    Crucially, the score update step (21b) is amenable to the spectral solution established in Theorem 3.2, where
    \( \begin{align} \mathbf {\sigma }= \rho (\mathbf {\pi }- \mathbf {\tilde{X}}\mathbf {w}^k) + \mathbf {y}^k. \end{align} \)
    (34)
    Having adjusted the transition matrix \( \mathbf {M}(\mathbf {\pi }) \) thusly, \( \mathbf {\pi } \) can be obtained by repeated iterations of ILSRX (29), with \( \tilde{{\pi }}(\mathbf {X}; \mathbf {w}) \) and \( D_p \) given by Equations (30) and (32), respectively. We refer to this instance of Algorithm 1 as Spectral Ranking-l2 (SR-l2).
    We note that, as Problem (19) is non-convex, selecting a good initialization point is important in practice. We discuss initialization below, leaving the additional theoretical guarantees to Section 3.7.
    Initialization. We initialize \( \mathbf {w} \) so that \( \mathbf {\tilde{X}}\mathbf {w} \) is a good approximation of PL scores. We use a technique akin to Saha and Rajkumar [2018], applied to our affine setting. Given a distribution over queries \( A\subseteq \left[n\right] \) , let \( P_{ij}=\mathbb {E}_A[c=i|\lbrace i,j\rbrace \subseteq A] \) be the probability that \( i \) is chosen given a query \( A \) that contains both \( i \) and \( j \) . By (1), for \( i, j \in \left[n\right] \) , \( \frac{{P}_{ij}}{{P}_{ji}} = \frac{{\pi }_i}{{\pi }_j} = \frac{\mathbf {x}_i^\top \mathbf {a}+ {b}}{\mathbf {x}_j^\top \mathbf {a}+ {b}} \) , or
    \( \begin{align} \begin{split} \delta _{ij}(\mathbf {w}) \equiv {({P}_{ij} \mathbf {x}_j - {P}_{ji} \mathbf {x}_i)}^\top \mathbf {a}+ {({P}_{ij} - {P}_{ji})}{b}= 0. \end{split} \end{align} \)
    (35)
    Motivated by (35), we estimate \( P_{ij} \) empirically from \( \mathcal {D} \) , and obtain our initialization \( \mathbf {w}^0=(\mathbf {a}^0,{b}^0)\in \mathbb {R}^{d+1} \) by solving (35) in the least-square sense; that is,
    \( \begin{align} \boldsymbol{w}^0 \mathop{\arg\min}\limits_{\mathbf {w}\in \mathbb {R}^{d+1}: \mathbf {\tilde{X}}\mathbf {w}\ge \mathbf {0} \,\wedge \mathbf {1}^\top \mathbf {\tilde{X}}\mathbf {w}= 1}\sum _{i,j} \delta _{ij}^2(\mathbf {w}). \end{align} \)
    (36)
    Note that this is a convex quadratic program. Given \( \mathbf {w}^0 \) , we generate the initial PL scores via the affine parametrization \( \mathbf {\pi }^0=\mathbf {\tilde{X}}\mathbf {w}^0 \) .

    3.4 Spectral Algorithm for Logistic Regression

    We describe here how to apply Algorithm 1 on logistic regression of PL scores via
    \( \begin{align} \log \tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w})=\mathbf {X}\mathbf {w}, \end{align} \)
    (37)
    where \( \log \mathbf {\pi }=[\log {\pi }_i]_{i\in \left[n\right]} \) is the \( \mathbb {R}^n \) vector generated by applying \( \log \) to \( \mathbf {\pi } \) element-wise. Recall from Section 3.1 that the resulting MLE problem is given by
    \( \begin{align} \min _{\mathbf {w}\in \mathbb {R}^d} \mathcal {L}\left(\mathcal {D}\,|\,\tilde{{\pi }}(\mathbf {X}; \mathbf {w})\equiv [e^{\mathbf {w}^\top \mathbf {x}_i}]_{i\in \left[n\right]}\right)\!, \end{align} \)
    (38)
    which is convex, and can thus be solved by Newton’s method. Nevertheless, we would like to accelerate its computation via a spectral method akin to ILSR.
    Mirroring the affine case, we employ Algorithm 1 with quadratic proximal penalty:
    \( \begin{align} D_p(\mathbf {\pi }||\tilde{\mathbf {\pi }}) = \frac{1}{2} \parallel \! \mathbf {\pi }- \tilde{\mathbf {\pi }}\!\parallel _2^2\!. \end{align} \)
    (39)
    As a result, the regression step (21b) corresponds to
    \( \begin{align} \mathbf {w}^{k+1} &= \underset{\mathbf {w}\in \mathbb {R}^{d+1}}{\text{arg min}} \, {\mathbf {y}^k}^T (\log \mathbf {\pi }^{k+1} - \mathbf {X}\mathbf {w}) + \frac{\rho }{2}\parallel \! \log \mathbf {\pi }^{k+1} - \mathbf {X}\mathbf {w}\!\parallel _2^2 \nonumber \nonumber\\ &= {(\mathbf {X}^T\mathbf {X})}^{-1}\mathbf {X}^T\left(\log \mathbf {\pi }^{k+1} + \frac{1}{\rho }\mathbf {y}^{k}\right), \end{align} \)
    (40)
    which is again a quadratic minimization and admits a closed form solution.
    More importantly, the score update step (21b) is amenable to the spectral solution established in Theorem 3.2. Mutatis mutandis, following the same manipulations in Lemma 3.1, a stationary point of the objective in each step (21b) can be cast as the stationary distribution of the continuous-time MC with transition rates \( \mu _{ji}(\mathbf {\pi }) \) , \( i,j\in \left[n\right] \) (Equation (27)), where vector \( \mathbf {\sigma }= [\sigma _i]_{i\in \left[n\right]} \) is now given by
    \( \begin{align} \sigma _i = \frac{\rho (\log {\pi }_i - \mathbf {x}_i^T \mathbf {w}^k) + {y}_i^k}{{\pi }_i },\quad i \in \left[n\right]\!. \end{align} \)
    (41)
    Having adjusted the transition matrix \( \mathbf {M}(\mathbf {\pi }) \) thusly, \( \mathbf {\pi } \) can again be obtained by repeated iterations of ILSRX (29), with \( \tilde{{\pi }}(\mathbf {X}; \mathbf {w}) \) and \( D_p \) given by Equations (37) and (39), respectively. We refer to this instance of Algorithm 1 as Spectral Ranking-l2-log (SR-l2-log).
    Initialization. Similar to the initialization of SR-l2 (c.f. Equation (36)), we initialize \( \mathbf {w} \) so that the initial scores obey the PL model, mirroring the approach by Saha and Rajkumar [2018]. Defining \( P_{ij} \) , \( i,j\in \left[n\right] \) the same way, and using the logistic parametrization in Equation (17), we have that
    \( \begin{align} \frac{{P}_{ij}}{{P}_{ji}}= \frac{{\pi }_i}{{\pi }_j} = e^{\mathbf {w}^T (\mathbf {x}_i - \mathbf {x}_j)}. \end{align} \)
    (42)
    Accordingly, we initialize \( \mathbf {w} \) as
    \( \begin{align} \mathbf {w}^0 = \text{arg min}_{\mathbf {w}\in \mathbb {R}^d} \, \sum _{i,j} {\left(\mathbf {w}^T (\mathbf {x}_i - \mathbf {x}_j) - \log \left(\frac{\hat{P}_{ij}}{\hat{P}_{ji}}\right)\right)}^2, \end{align} \)
    (43)
    where \( \hat{P}_{ij} \) , \( i,j\in \left[n\right] \) , are again empirical estimates obtained from dataset \( \mathcal {D} \) . Given \( \mathbf {w}^0 \) , we generate the initial PL scores via the logistic parametrization \( \mathbf {\pi }^0=[e^{\mathbf {x}_i^T \mathbf {w}^0}]_{i\in \left[n\right]} \) .

    3.5 Spectral Algorithm for DNN Regression

    In this section, we employ Algorithm 1 for DNN regression of PL scores via \( \tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w}) \) (c.f. Problem (18)).
    To regress scores via a generic DNN \( \tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w}) \) , we generalize the traditional quadratic proximal penalty \( D_p(\mathbf {\pi }||\tilde{\mathbf {\pi }}) = \frac{1}{2} \parallel \! \mathbf {\pi }- \tilde{\mathbf {\pi }}\!\parallel _2^2 \) to KL-divergence, which is naturally suited to Problem (19): This is because the scores computed by ILSRX (29) form, by construction, a distribution over \( \left[n\right] \) . This generalization to KL penalty is still amenable to a spectral solution via ILSRX (29), as Theorem 3.2 holds for any proximal penalty \( D_p \) ; in practice, this also leads to a significantly improved performance over an \( \ell _2 \) -proximal penalty (by up to \( 56\% \) Top-1 accuracy and \( 25\% \) KT correlation, as discussed in Section 4.9). As shown in Equation (48) below, the KL-divergence penalty leads to training \( \tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w}) \) with a max-entropy loss and an additional linear term at each ADMM iteration. Compared to the \( \ell _2 \) -proximal penalty, max-entropy has been observed to converge faster and lead to better fitting [Caruana and Niculescu-Mizil 2004; Golik et al. 2013; Bosman et al. 2020]. Finally, ADMM with Bregman divergence proximal penalties, including KL divergence, has been shown to offer local convergence guarantees for non-convex problems (c.f. Section 2.2.3); this further motivates us to employ KL over other distance measures between probability distributions.
    We implement DSR with the two proximal penalty functions \( D_p(\mathbf {\pi }||\tilde{\mathbf {\pi }}) \) mentioned above: KL divergence [Kullback and Leibler 1951] and \( \ell _2 \) norm. We describe implementation specifics for each of these cases below.
    \( \ell _2 \) proximal penalty. For ADMM with quadratic proximal penalty:
    \( \begin{align} D_p(\mathbf {\pi }||\tilde{\mathbf {\pi }}) = \frac{1}{2} \parallel \! \mathbf {\pi }- \tilde{\mathbf {\pi }}\!\parallel _2^2, \end{align} \)
    (44)
    the regression step (21b) of DSR becomes a least-squares minimization of the form:
    \( \begin{align} \mathbf {w}^{k+1} = \underset{\mathbf {w}\in \mathbb {R}^{d^{\prime }}}{\text{arg min}} \,\, \parallel \! \tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w})- \left(\mathbf {\pi }^{k+1} + \frac{1}{\rho }\mathbf {y}^k\right) \!\parallel _2^2\!. \end{align} \)
    (45)
    As a result, we train the DNN regressor \( \tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w}) \) via SGD optimization over the loss function (45).
    Moreover, the score update step (21b) is amenable to the spectral approach established in Theorem 3.2 and solved via repeated iterations of ILSRX (29), where
    \( \begin{align} \sigma _i(\mathbf {\pi }) = \rho \frac{\partial D_p(\mathbf {\pi }|| \tilde{\mathbf {\pi }}^k)}{\partial {{\pi }_i}} + {y}_i^k = \rho ({\pi }_i - \tilde{{\pi }}(\mathbf {x}_i, \mathbf {w}^k)) + {y}_i^k, \end{align} \)
    (46)
    for all \( i \in \left[n\right] \) in the transition rates (27). We refer to this instance of Algorithm 1 as Deep Spectral Ranking-l2 (DSR-l2).
    KL proximal penalty. For ADMM with KL proximal penalty:
    \( \begin{align} D_p(\mathbf {\pi }||\tilde{\mathbf {\pi }}) = \sum _{i=1}^{n} {\pi }_i \log \frac{{\pi }_i}{\tilde{{\pi }}_i}, \end{align} \)
    (47)
    the optimization step (21c) to train \( \tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w}) \) corresponds to minimizing a max-entropy loss with an additional linear term:
    \( \begin{align} \mathbf {w}^{k+1} = \underset{\mathbf {w}\in \mathbb {R}^{d^{\prime }}}{\text{arg min}} \,\, \sum _{i=1}^{n} \left(- \frac{{y}_i^k}{\rho } \tilde{{\pi }}(\mathbf {x}_i, \mathbf {w}) - {\pi }_i^{k+1} \log \tilde{{\pi }}(\mathbf {x}_i, \mathbf {w})\right). \end{align} \)
    (48)
    Moreover, the score update step (21b) is again solved via repeated iterations of ILSRX (29), where
    \( \begin{align} \sigma _i(\mathbf {\pi }) = \rho \frac{\partial D_p(\mathbf {\pi }|| \tilde{\mathbf {\pi }}^k)}{\partial {{\pi }_i}} + {y}_i^k = \rho \left(1 + \log \frac{{\pi }_i}{\tilde{{\pi }}(\mathbf {x}_i, \mathbf {w}^k)}\right) + {y}_i^k , \end{align} \)
    (49)
    for all \( i \in \left[n\right] \) in the transition rates (27). We refer to this instance of Algorithm 1 as Deep Spectral Ranking-KL (DSR-KL).
    Initialization. For both cases of penalties, we initialize all samples with equal scores, i.e., \( \mathbf {\pi }= \tilde{\mathbf {\pi }}= \frac{1}{n}\mathbf {1} \) . At each ADMM iteration \( k \) , we initialize \( \mathbf {\pi } \) with \( \mathbf {\pi }^{k-1} \) , and update \( \mathbf {\pi } \) via ILSRX. Then, we initialize the DNN parameters \( \mathbf {w} \) with \( \mathbf {w}^{k-1} \) , and fine-tune the DNN \( \tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w}) \) via SGD over Equations (45) or (48).

    3.6 Computational Complexity

    Each iteration of our spectral algorithm (c.f. Algorithm 1) involves the three steps in Equation (21). One iteration of ILSRX at step (21b) is \( O(Km+ n^2) \) for constructing the transition rates via Equation (27) and for finding the stationary distribution \( \mathbf {\pi } \) via, e.g., a power method [Lei et al. 2016], respectively. It is important to note that the number of ranking observations \( m \) appears only in the construction of the transition rates through the terms \( \mu _{j, i} \) in Equation (21b), which, in turn, are computed via sums w.r.t. these \( m \) quantities (see Equation (27)). This is important, as the (potentially exponential in \( K \) ) ranking observations affect the complexity only through scalar summations, which themselves can be easily parallelized (via, e.g., reduce-by-key operations). This is in sharp contrast to the classic gradient-based optimization (c.f. Section 3.1 and Figure 1), in which \( m \) directly determines the size of each epoch.
    We discuss the complexity of the regression step (21c) separately for shallow and DNN model regression cases. For shallow model regression via affine or logistic parametrizations, i.e., SR-l2 and SR-l2-log, the regression step (21c) takes the specific forms (33) and (40), respectively. In both cases, \( \mathbf {w} \) update is \( O\big (n(d+1)\big) \) as a matrix-vector multiplication, since the matrix \( {(\mathbf {X}^T\mathbf {X})}^{-1}\mathbf {X}^T \) can be precomputed. For DNN regression via DSR, the regression step (21c) is equivalent to training \( \tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w}) \) via Equation (22). Most importantly, the length of each training epoch, is linear in number of samples \( n \) : Each optimization step goes over \( O(n) \) samples to update the \( {d^{\prime }} \) parameters. Again, this is in contrast to the \( m=O(n^K) \) -length epochs if one was minimizing the likelihood function directly via standard gradient descent methods. Finally, given \( \tilde{\mathbf {\pi }} \) and \( \mathbf {\pi } \) , the update of \( \mathbf {y} \) at step (21a) is \( O(n) \) .
    Overall, updating \( \mathbf {\pi } \) via ILSRX at step (21b) and \( \mathbf {w} \) via gradient-based training at step (21c) are both iterative processes, for which the computational complexity until convergence have not yet been theoretically shown. That said, the literature on the complexity and convergence of iterative optimization over such highly non-convex objectives illustrates the challenges behind such a theoretical analysis [Jain and Kar 2017; Taheri et al. 2021]. Nevertheless, our Theorem 3.4 below, establishing convergence guarantees in the affine regression case, is a step toward this direction. This is in sharp contrast to, e.g., siamese networks, in which \( m \) characterizes the length of an entire training epoch.

    3.7 Theoretical Guarantees

    As the ranking regression problem (19) is, in general, non-convex, we aim at obtaining convergence guarantees for our spectral algorithm (c.f. Algorithm 1). To do so, we focus on the affine regression case explained in Section 3.3. Given this case, we wish to establish conditions such that (i) the scores computed at each ADMM iteration form a global optimum w.r.t. step (21a), and (ii) the scores and model parameters generated by Algorithm 1 converge to a stationary point of Problem (16).
    Recall from Section 3.3 that the affine regression problem is given by
    \( \begin{align} \min _{\mathbf {w}\in \mathbb {R}^{d+1} :\tilde{{\pi }}(\mathbf {X}; \mathbf {w})\ge \mathbf {0}}\mathcal {L}\left(\mathcal {D}\,|\,\tilde{{\pi }}(\mathbf {X}; \mathbf {w})\equiv \mathbf {\tilde{X}}\mathbf {w}\right)\!. \end{align} \)
    (50)
    In this setting, we employ ADMM with standard \( \ell _2 \) proximal penalty \( D_p(\mathbf {\pi }||\tilde{\mathbf {\pi }}) = \frac{1}{2} \parallel \! \mathbf {\pi }- \tilde{\mathbf {\pi }}\!\parallel _2^2 \) . As Problem (19) is non-convex, condition (23) is, in general, not sufficient for optimality w.r.t. step (21a). To show this, we require the following technical assumption:
    Assumption 3.1.
    For \( \lbrace \mathbf {\pi }^k\rbrace _{k\in \mathbb {N}} \) , given by (21a), there exists an \( \epsilon \gt 0 \) such that \( {\pi }_i^k \gt \epsilon \) for all \( i \in \left[n\right] \) and \( k \in \mathbb {N} \) .
    In other words, we assume that the scores computed at each ADMM iteration are strictly positive. Under this assumption, we show that stationarity implies optimality w.r.t. (21a) for large enough \( \rho \) :
    Theorem 3.3.
    Under Assumption 3.1, for \( \rho \ge \frac{2}{\epsilon ^2} \max _i \sum _{\ell |i\in A_\ell }\frac{1}{|A_\ell |^2} \) and \( D_p(\mathbf {\pi }||\tilde{\mathbf {\pi }}) = \frac{1}{2} \parallel \! \mathbf {\pi }- \tilde{\mathbf {\pi }}\!\parallel _2^2 \) , a \( \mathbf {\pi }\gt \mathbf {0} \) satisfying condition (23) is a minimizer of (21a).
    The proof is in Appendix E. Theorem 3.3 implies that given large enough \( \rho \) , the stationary scores satisfying condition (23) also form a global optimum of the ADMM step (21a). Using this result, we further establish the following convergence guarantee for Algorithm 1 in the affine regression case:
    Theorem 3.4.
    For \( \tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w})=\mathbf {\tilde{X}}\mathbf {w} \) and \( D_p(\mathbf {\pi }||\tilde{\mathbf {\pi }}) = \frac{1}{2} \parallel \! \mathbf {\pi }- \tilde{\mathbf {\pi }}\!\parallel _2^2 \) , suppose that there exists \( \kappa \gt 0 \) such that \( \mathbf {\tilde{X}}^T \mathbf {\tilde{X}}\succeq \kappa \, \mathbf {I} \) and the sequence \( \lbrace (\mathbf {\pi }^k, \mathbf {y}^k, \mathbf {w}^k)\rbrace _{k \in \mathbb {N}} \) generated by (21) is bounded. Then, under Assumption 3.1, for \( \rho \gt \frac{2 \max _i |W_i|}{\epsilon ^2} \) and \( W_i = \lbrace \ell \,| i \in A_\ell , c_\ell = i\rbrace \) denoting the observations where sample \( i \in \left[n\right] \) is chosen, the sequence \( \lbrace (\mathbf {\pi }^k, \mathbf {y}^k, \mathbf {w}^k)\rbrace _{k \in \mathbb {N}} \) generated by (21) converges to a point that satisfies the Karush-Kuhn-Tucker (KKT) conditions of (16).
    The proof is in Appendix F. By Theorem 3.4, we conclude that the scores \( \mathbf {\pi } \) and the affine regression parameters \( \mathbf {w} \) generated by Algorithm 1 converge to a stationary point of Problem (16) for large enough \( \rho \) .

    4 Experiments

    4.1 Datasets

    We evaluate our algorithms on synthetic and real-life datasets, summarized in Table 2.
    Table 2.
    Spec.Dataset
    ROP-num / ROP-imgFACPairwise SushiTriplet SushiICLR-3/4Movehub-Cost-4/5Movehub-Quality-4/5IMDB-4
    \( K \) 23 \( 3/4 \) \( 4/5 \) \( 4/5 \) 4
    \( n \) 1001,00010050
    \( d \) \( 143 / 224 \times 224 \) 50187686536
    \( m \) 29,7057284501,200120, \( 324/2,248,524 \) 230, \( 298/2,118,756 \) 230, \( 298/2,118,756 \) 85,583
    \( \mathbf {X} \) numerical / imagenumerical
    Table 2. No. of Samples ( \( n \) ), No. of Features ( \( d \) ), No. of Observations ( \( m \) ), and Query Size ( \( K \) ) for Datasets with Image or Numerical Features ( \( \mathbf {X} \) )
    Synthetic Datasets. We generate the feature vectors \( \mathbf {x}_i \in \mathbb {R}^{d}, \;\; i \in \left[n\right] \) from \( \mathcal {N} (\mathbf {0},\sigma ^2_{x}\mathbf {I}_{d\times d}) \) and a common parameter vector \( {\mathbf {w}} \in \mathbb {R}^{d} \) from \( \mathcal {N} (\mathbf {0},\sigma ^2_{{\beta }}\mathbf {I}_{d\times d}) \) . Then, we generate the PL scores via the logistic parametrization \( \mathbf {\pi }=[e^{\mathbf {x}_i^T \mathbf {w}}]_{i\in \left[n\right]} \) . We normalize the resulting scores, so that \( \mathbf {1}^\top \mathbf {\pi }=1 \) . We set \( \sigma ^2_{x}=\sigma ^2_{{\beta }}=0.8 \) in all experiments. Given \( \mathbf {\pi } \) , we generate each observation in \( \mathcal {D} \) as follows: We first select \( |A_\ell |=2 \) samples out of \( n \) samples uniformly at random. Then, we generate the choice \( c_l \) , \( l \in \left[m\right] \) from the PL model given by Equation (1).
    Retinopathy of Prematurity (ROP). The ROP dataset contains \( n=100 \) vessel-segmented retina images. Experts are provided with two images and are asked to choose the image with higher severity of the ROP disease. Five experts independently label 5,941 image pairs; the resulting dataset contains \( m=29,705 \) pairwise comparisons. Note that some pairs are labeled more than once by different experts. We employ two ROP datasets, ROP-num and ROP-img for shallow and DNN regression, respectively: Each sample in ROP-num has \( d=143 \) extracted numerical features [Ataer-Cansızoğlu 2015], while each sample in ROP-img is an image with dimensions \( d=224 \times 224 \) .
    Filter Aesthetic Comparison (FAC). The FAC dataset [Sun et al. 2017] contains 1,280 unfiltered images pertaining to eight different categories. A total of 22 different image filters are applied to each image. Labelers are provided with two filtered images and are asked to identify which image has better quality. We select \( n=1,000 \) images within one category, as only the filtered image pairs that are within the same category are compared. The resulting dataset contains \( m=728 \) pairwise comparisons. Moreover, for each image, we extract features via a state-of-the-art convolutional neural network architecture, namely GoogLeNet [Szegedy et al. 2015], with weights pre-trained on the ImageNet dataset [Deng et al. 2009]. We select \( d=50 \) of these features by Principal Component Analysis [Jolliffe 1986].
    SUSHI. The SUSHI Preference dataset [Kamishima et al. 2009] contains \( n=100 \) sushi ingredients with \( d=18 \) features. Each of the 5,000 customers independently ranks 10 ingredients according to her preferences. We select the rankings provided by 10 customers, where an ingredient is ranked higher if it precedes the other ingredients in a customer’s ranked list. We generate two datasets: triplet Sushi containing \( m=1,200 \) rankings of \( |A_\ell |=3 \) ingredients, and pairwise Sushi containing \( m=450 \) pairwise comparisons.
    International Conference on Learning Representations (ICLR). The ICLR dataset contains abstracts and reviewer ratings of 2,561 papers that are submitted to ICLR 2020 conference and are available on OpenReview website [Sun 2020]. We choose the top \( n=100 \) papers, and extract \( d=768 \) numerical features from each abstract using the Deep Bidirectional Transformers (BERT) [Devlin et al. 2019] architecture, pre-trained on the Books Corpus dataset [Zhu et al. 2015] and English Wikipedia. We normalize \( \mathbf {X} \) to have 0 mean and unit variance over samples \( \left[n\right] \) . We generate all possible \( m=120,324 (2,248,524) \) \( K=3(4) \) -way rankings w.r.t. the relative order of the average reviewer ratings. We add noise to the resulting rankings following the same process as Movehub-Cost.
    Movehub-Cost. The Movehub-Cost dataset contains the total ranking of 216 cities w.r.t. cost of living [Blitzer 2017]. Each city is associated with \( d=6 \) numerical features, which are average costs for cappuccino, cinema, wine, gasoline, rent, and disposable income. We normalize \( \mathbf {X} \) to have 0 mean and unit variance over samples \( \left[n\right] \) . We select \( n=50 \) cities and generate all \( m=230,298 (2,118,756) \) \( K=4(5) \) -way rankings w.r.t. the relative order of the queried cities in the total ranking. To mimic the real-life noise introduced by human labeling, we apply the following post-processing to the resulting rankings: For each ranking, we sample a value uniformly at random in \( [0,1] \) . If the value is less than 0.1, we add noise to the ranking by a cyclic permutation of the ranked samples.
    Movehub-Quality. The Movehub-Quality dataset contains total ranking of the same 216 cities as Movehub-Cost, this time w.r.t. quality of life. Each city is associated with \( d=5 \) numerical features, including overall scores for purchase power, healthcare, pollution, quality of life, and crime. We normalize \( \mathbf {X} \) to have 0 mean and unit variance over samples \( \left[n\right] \) . We select \( n=50 \) cities and generate all \( m=230,298 (2,118,756) \) \( K=4(5) \) -way rankings w.r.t. the relative order of the queried cities in the total ranking. We add noise to the rankings following the same process as Movehub-Cost.
    IMDB. The IMDB Movies Dataset contains IMDB ratings of 14,762 movies, each of which is associated with \( d=36 \) numerical features [Leka 2016]. We normalize \( \mathbf {X} \) to have 0 mean and unit variance over samples \( \left[n\right] \) . We select \( n=50 \) movies and generate all possible \( m=85,583 \) \( K=4 \) -way rankings w.r.t. the relative order of the ratings of queried movies. We add noise to the resulting rankings following the same process as Movehub-Cost.

    4.2 Experiment Setup

    We partition each dataset into training and test sets in two ways. In rank partitioning, we partition the dataset w.r.t. \( \left[m\right] \) , using \( 90\% \) of the \( m \) observations for training, and the remaining \( 10\% \) for testing. In sample partitioning, we partition \( \left[n\right] \) , using \( 90\% \) of the \( n \) samples for training, and the remaining \( 10\% \) for testing. In this setting, observations containing samples from both training and validation/test sets are discarded. We perform cross validation on the resulting training sets. We use 10 folds for synthetic datasets, ROP-num, FAC, and Triplet Sushi, and 3 folds for the other datasets. For synthetic datasets, we also repeat experiments over five random generations.
    To evaluate the convergence speed of each algorithm, we measure the elapsed time, including time spent in initialization, in seconds (Time). Moreover, we measure the prediction performance using Top-1 accuracy (Top-1 Acc.) and KT correlation [Kendall 1938] on validation and test sets (c.f. 4.7). For synthetic datasets, we also measure the quality of convergence by the norm of the difference between estimated and true PL scores ( \( \triangle \mathbf {\pi } \) ); lower values indicate better estimation. We report averages and standard deviations over folds for all algorithms except for the siamese network method: As training takes several hours, we execute only one fold.

    4.3 Shallow Regression Competing Methods

    We implement1 seven competing algorithms. Four are feature methods, i.e., algorithms that regress PL scores from features: SR-l2 described in Section 3.3, SR-l2-log described in Section 3.4, sequential least-squares quadratic programming (SLSQP), that solves (16), and Newton on \( {{\mathbf {w}}} \) , that solves the convex problem (17) via Newton’s method. The remaining three are featureless methods, i.e., algorithms that learn the PL scores from the choice observations alone: ILSR described by Equation (7), the MM algorithm [Hunter 2004], and Newton on \( \mathbf {\theta } \) that solves Equation (3) via the reparametrization \( {\pi }_i = e^{\theta _i} \) , \( i \in \left[n\right] \) using Newton’s method on \( \mathbf {\theta }=[\theta _i]_{i\in \left[n\right]} \) . We expand upon the implementation of all algorithms below.
    SR-l2. We compute the stationary distribution at each iteration of ILSRX (c.f. Equation (29)) using the power method [Lei et al. 2016]. As the stopping criterion, we use \( \parallel \!\! {\mathbf {\pi }}^k- {\mathbf {\pi }}^{k-1} \!\!\parallel _2 \,\lt \, {r_\mathtt {tol}} \parallel \!\! {\mathbf {\pi }}^k \!\!\parallel _2 \) and \( \parallel \!\! \mathbf {\tilde{X}}\mathbf {w}^k - \mathbf {\tilde{X}}\mathbf {w}^{k-1} \!\!\parallel _2 \,\lt \, {{r_\mathtt {tol}}} \parallel \!\! \mathbf {\tilde{X}}\mathbf {w}^k \!\!\parallel _2 \) . We set the relative tolerance \( r_\mathtt {tol}=10^{-4} \) for all experiments. We use the same relative tolerance for the stopping criterion of the power method. We set \( \rho =1 \) in our experiments, which is a standard choice in the ADMM literature [Boyd et al. 2011]. In our experiments, we consistently observe that Equation (26) is satisfied. That is why, we use \( c_j = \frac{- {\pi }_j \sigma _j}{\sum _{i\in \left[n\right]_{-}} {\pi }_i \sigma _i} \) , \( j \in \left[n\right]_{+} \) to calculate the transition rates (27).
    SR-l2-log. As the stopping criterion, we use \( \parallel \!\! {\mathbf {\pi }}^k- {\mathbf {\pi }}^{k-1} \!\!\parallel _2 \,\lt \, {r_\mathtt {tol}} \parallel \!\! {\mathbf {\pi }}^k \!\!\parallel _2 \) and \( \parallel \!\! e^{\mathbf {X}{{\mathbf {w}}^k}} - e^{\mathbf {X}{{\mathbf {w}}^{k-1}}} \!\!\parallel _2 \,\lt \, {r_\mathtt {tol}} \parallel \!\! e^{\mathbf {X}{{\mathbf {w}}^k}} \!\!\parallel _2 \) , where exponentiation is applied elementwise.
    SLSQP. We initialize SLSQP the same as SR-l2 (c.f. Section 3.3). As stopping criterion, we use \( \parallel \!\! {\mathbf {\pi }}^k- {\mathbf {\pi }}^{k-1} \!\!\parallel _2 \,\lt \, {r_\mathtt {tol}} \parallel \!\! {\mathbf {\pi }}^k \!\!\parallel _2 \) , where \( \mathbf {\pi }^k = \mathbf {X}\mathbf {a}^k + \mathbf {1} {b}^k \) , \( k \in \mathbb {N} \) . Each iteration of SLSQP is \( O(\sum _{\ell \in \mathcal {D}}(|A_\ell | (d+1)) + (d+1)^2) \) for constructing the gradient of Equation (15) w.r.t. \( \mathbf {w} \) and updating \( \mathbf {w} \) , respectively.
    Newton on \( {{\mathbf {w}}} \) . We initialize Newton on \( {{\mathbf {w}}} \) the same as SR-l2-log (c.f. Section 3.4). As stopping criterion, we use \( \parallel \!\! {\mathbf {\pi }}^k- {\mathbf {\pi }}^{k-1} \!\!\parallel _2 \,\lt \, {r_\mathtt {tol}} \parallel \!\! {\mathbf {\pi }}^k \!\!\parallel _2 \) , where \( \mathbf {\pi }^k = [e^{ \mathbf {x}_i^T \mathbf {w}^k}]_{i\in \left[n\right]} \) , \( k \in \mathbb {N} \) . Each iteration of Newton on \( {{\mathbf {w}}} \) is \( O(\sum _{\ell \in \mathcal {D}}(|A_\ell |\, d^2) + d^2) \) for constructing the Hessian of Equation (15) w.r.t. \( \mathbf {w} \) and updating \( \mathbf {w} \) , respectively.
    ILSR. We initialize ILSR with \( {\mathbf {\pi }}^0 = \frac{1}{n}\mathbf {1} \) . We compute the stationary distribution at each iteration of ILSR using the power method. As the stopping criterion, we use \( \parallel \!\! {\mathbf {\pi }}^k- {\mathbf {\pi }}^{k-1} \!\!\parallel _2 \,\lt \, {r_\mathtt {tol}} \parallel \!\! {\mathbf {\pi }}^k \!\!\parallel _2 \) . Each iteration of ILSR is \( O(\sum _{\ell \in \mathcal {D}}(|A_\ell |) + n^2) \) for constructing the transition matrix \( \mathbf {\Lambda }(\mathbf {\pi }) \) (c.f. Equation (6)) and finding the stationary distribution \( \mathbf {\pi } \) , respectively.
    MM. We initialize MM with \( {\mathbf {\pi }}^0 = \frac{1}{n}\mathbf {1} \) . As the stopping criterion, we use \( \parallel \!\! {\mathbf {\pi }}^k- {\mathbf {\pi }}^{k-1} \!\!\parallel _2 \,\lt \, {r_\mathtt {tol}} \parallel \!\! {\mathbf {\pi }}^k \!\!\parallel _2 \) . Each iteration of MM is \( O(\sum _{\ell \in \mathcal {D}}(|A_\ell |)) \) .
    Newton on \( \mathbf {\theta } \) . We initialize Newton on \( \mathbf {\theta } \) with \( {\mathbf {\theta }}^0 = [\theta _i^0]_{i\in \left[n\right]} = \mathbf {0} \) . As stopping criterion, we use \( \parallel \!\! {\mathbf {\pi }}^k- {\mathbf {\pi }}^{k-1} \!\!\parallel _2 \,\lt \, {r_\mathtt {tol}} \parallel \!\! {\mathbf {\pi }}^k \!\!\parallel _2 \) , where \( {\mathbf {\pi }}^k_i=e^{\theta _i^k} \) , \( i \in \left[n\right] \) , \( k \in \mathbb {N} \) . Each iteration of Newton on \( \mathbf {\theta } \) is \( O(\sum _{\ell \in \mathcal {D}}(|A_\ell |^2) + n^2) \) for constructing the Hessian of Equation (15) w.r.t. \( \mathbf {\theta } \) and updating \( \mathbf {\theta } \) , respectively.

    4.4 DNN Regression Competing Methods

    We implement2 five competing algorithms. DSR with KL penalty (DSR-KL), DSR with \( \ell _2 \) norm penalty (DSR-l2), and siamese network competitor regress scores via a DNN \( \tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w}) \) with parameters \( \mathbf {w} \) . SR-l2 and SR-KL use an affine regressor \( \tilde{\mathbf {\pi }}=\mathbf {X}\mathbf {a}+ \mathbf {1} {b} \) with parameters \( \mathbf {a} \) and \( {b} \) , and solve (19) via ADMM with \( \ell _2 \) and KL penalties, respectively.
    DSR-KL and DSR-l2 are explained in Section 3.5. We compute the stationary distribution at each iteration of ILSRX (c.f. Equation (29)) using the power method [Lei et al. 2016]. At each ADMM iteration, we fine-tune the DNN regressor \( \tilde{{\pi }}(\mathbf {X}; \mathbf {w}) \) via Adam optimization [Kingma and Ba 2015] over Equation (22). We check the convergence of \( \mathcal {L} \) on the training set and KT correlation evaluations on the validation set (c.f. 4.7) as the stopping criterion for: (i) fine-tuning \( \tilde{{\pi }}(\mathbf {X}; \mathbf {w}) \) at each ADMM iteration, and (ii) overall DSR-KL algorithm. We declare convergence when KT on validation set does not change for 5 iterations, for maximum total of 50 iterations, with relative tolerance \( r_\mathtt {tol}=10^{-4} \) . We use the same relative tolerance for the stopping criterion of the power method.
    To aid convergence in practice [Boyd et al. 2011], we update the dual variable at each ADMM iteration with a multiplicative smoothing parameter \( \gamma ^k=1/k \) . We also adapt the penalty parameter as
    \( \begin{align*} \rho ^{k+1} &= {\left\lbrace \begin{array}{ll} \tau \rho ^k,&\!\!\text{if} \parallel \! \tilde{\mathbf {\pi }}^k - \mathbf {\pi }^k \!\parallel _2^2 \gt \beta \parallel \! \tilde{\mathbf {\pi }}^k \!-\! \tilde{\mathbf {\pi }}^{k-1} \!\parallel _2^2\\ \frac{\rho ^k}{\tau }, &\!\!\text{if} \parallel \! \tilde{\mathbf {\pi }}^k \!-\! \mathbf {\pi }^k \!\parallel _2^2 \lt \beta \parallel \! \tilde{\mathbf {\pi }}^k \!-\! \tilde{\mathbf {\pi }}^{k-1} \!\parallel _2^2\\ \rho ^k, & \!\!\text{otherwise,} \end{array}\right.} \end{align*} \)
    where \( \tau =2 \) and \( \beta =10 \) .
    The siamese network competitor minimizes Equation (15) w.r.t. \( \mathbf {w} \) via SGD. We train a siamese network architecture with base network \( \tilde{{\pi }} \) on observations \( \mathcal {D} \) via Adam optimization [Kingma and Ba 2015] over Equation (15). We employ the same convergence criterion and experiment setup as DSR-KL and DSR-l2 to train and optimize the siamese network.
    Finally, we implement two spectral algorithms that regress Plackett-scores via an affine model, i.e., \( \tilde{\mathbf {\pi }}=\mathbf {X}\mathbf {a}+ \mathbf {1}{b} \) : SR-l2, and its variant SR-KL that uses KL proximal penalty instead of \( \ell _2 \) norm penalty for ADMM. As the stopping criterion for both algorithms, we use \( \parallel \!\! {\mathbf {\pi }}^k- {\mathbf {\pi }}^{k-1} \!\!\parallel _2 \,\lt \, {r_\mathtt {tol}} \parallel \!\! {\mathbf {\pi }}^k \!\!\parallel _2 \) and \( \parallel \!\! (\mathbf {X}\mathbf {a}^k + \mathbf {1}{b}^k) - (\mathbf {X}\mathbf {a}^{k-1} +\mathbf {1} {b}^{k-1} \!\!\parallel _2 \,\lt \, {{r_\mathtt {tol}}} \parallel \!\! \mathbf {X}\mathbf {a}^k + \mathbf {1}{b}^k \!\!\parallel _2 \) . We set the ADMM penalty parameter as \( \rho =1 \) .

    4.5 DNN Architecture and Training Details

    To evaluate each method on ROP-img, we choose \( \tilde{{\pi }}(\mathbf {X}; \mathbf {w}) \) as a state-of-the-art convolutional neural network architecture, namely GoogLeNet [Szegedy et al. 2015], followed by a fully connected output layer comprising a single neuron with sigmoid activation. We initialize the convolutional layers with weights pre-trained on the ImageNet dataset [Deng et al. 2009]. For other datasets, we design \( \tilde{{\pi }}(\mathbf {X}; \mathbf {w}) \) as a fully connected architecture with relu activation for hidden layers and an output layer comprising a single neuron with sigmoid activation. We add \( \ell _2 \) regularizers to all layers. For each configuration of (i) \( \ell _2 \) regularization parameter varying in \( [2\times 10^{-5},2\times 10^{-2}] \) , (ii) learning rate varying in \( [10^{-4},10^{-2}] \) , and (iii) number of layers varying in \( [1,10] \) for fully connected \( \tilde{{\pi }}(\mathbf {X}; \mathbf {w}) \) , we run each method until convergence (c.f. 4.4 for criteria). We determine the best set of hyperparameters w.r.t. the prediction performance via cross validation (c.f. 4.2).

    4.6 Execution Environment

    In Tables 35, we measure timing on an Intel Xeon CPU E5-2680v2 2.8 GHz with 128 GB RAM. Particularly for experiments on larger synthetic datasets (c.f. Figures 34), we use an Intel Xeon CPU E5-2680v4 2.4 GHz with 500 GB RAM. For all DNN regression experiments, we employ NVIDIA GPUs with Intel Gold [email protected] GHz CPUs and 128 GB RAM.
    Fig. 3.
    Fig. 3. Convergence time (Time) and top-1 test accuracy (Top-1 Acc.) vs. \( n \) and \( d \) for SR-l2, SR-l2-log, ILSR, and Newton on \( {{\mathbf {w}}} \) . Evaluations are on synthetic datasets containing \( m=250 \) observations partitioned w.r.t. rank partitioning. Number of samples changes in \( n\in \lbrace 50,100,1000,10000,100000\rbrace \) when number of features is \( d=100 \) , and number of features changes in \( d\in \lbrace 10,100,1000,10000\rbrace \) when number of samples is \( n=1,000 \) .
    Fig. 4.
    Fig. 4. Convergence time (Conv. Time) and top-1 test accuracy (Top-1 Acc.) of SR-l2, SR-l2-log, ILSR, and Newton on \( {{\mathbf {w}}} \) evaluated on synthetic datasets vs. the number of observations \( m\in \lbrace 10,100,1000,10000,100000\rbrace \) . Observations are partitioned w.r.t. rank partitioning, where number of samples is \( n=1,\!000 \) , number of features is \( d=100 \) , and query size is \( |A_\ell |=2 \) .
    Table 3.
    PartitioningMethodTraining MetricsPerformance Metrics on the Test Set
      Time (s) \( \downarrow \) Iter. \( \downarrow \) \( \triangle \mathbf {\pi } \) \( \downarrow \) Top-1 Acc. \( \uparrow \) KT \( \uparrow \)
    Sample PartitioningSR-l2 \( 0.237 \pm 0.006 \) \( 4 \pm 0 \) \( 0.717 \pm 0.207 \) \( 0.831 \pm 0.119 \) \( 0.609 \pm 0.247 \)
    SR-l2-log \( 1.428 \pm 2.595 \) \( 49 \pm 79 \) \( 0.845 \pm 0.204 \) \( 0.668 \pm 0.159 \) \( 0.335 \pm 0.318 \)
    ILSR (no \( \mathbf {X} \) ) \( 0.045 \pm 0.002 \) \( 2 \pm 0 \) \( 0.718 \pm 0.207 \) \( 0.5 \pm 0.0 \) \( -1.0 \pm 0.0 \)
    MM (no \( \mathbf {X} \) ) \( 9.728 \pm 0.487 \) \( 500 \pm 0 \) \( 1.2 \pm 0.1 \) \( 0.5 \pm 0.0 \) \( 0.0 \pm 0.0 \)
    Newton on \( \mathbf {\theta } \) (no \( \mathbf {X} \) ) \( 4.537 \pm 0.729 \) \( 14 \pm 3 \) \( 1.236 \pm 0.132 \) \( 0.5 \pm 0.0 \) \( -0.08 \pm 0.272 \)
    Newton on \( {{\mathbf {w}}} \) \( 6.406 \pm 2.104 \) \( 14 \pm 5 \) \( 0.808 \pm 0.462 \) \( 0.844 \pm 0.148 \) \( 0.688 \pm 0.296 \)
    SLSQP \( 43.908 \pm 24.469 \) \( 229 \pm 132 \) \( 0.718 \pm 0.206 \) \( 0.796 \pm 0.106 \) \( 0.592 \pm 0.211 \)
    Rank PartitioningSR-l2 \( 0.48 \pm 0.24 \) \( 4 \pm 0 \) \( 0.717 \pm 0.207 \) \( 0.837 \pm 0.037 \) \( 0.569 \pm 0.072 \)
    SR-l2-log \( 1.58 \pm 2.027 \) \( 29 \pm 14 \) \( 0.883 \pm 0.208 \) \( 0.699 \pm 0.066 \) \( 0.398 \pm 0.132 \)
    ILSR (no \( \mathbf {X} \) ) \( 0.098 \pm 0.056 \) \( 2 \pm 0 \) \( 0.718 \pm 0.208 \) \( 0.708 \pm 0.045 \) \( 0.389 \pm 0.088 \)
    MM (no \( \mathbf {X} \) ) \( 11.302 \pm 0.515 \) \( 500 \pm 0 \) \( 0.864 \pm 0.19 \) \( 0.685 \pm 0.037 \) \( 0.354 \pm 0.074 \)
    Newton on \( \mathbf {\theta } \) (no \( \mathbf {X} \) ) \( 8.218 \pm 1.782 \) \( 14 \pm 3 \) \( 1.244 \pm 0.121 \) \( 0.506 \pm 0.029 \) \( 0.01 \pm 0.05 \)
    Newton on \( {{\mathbf {w}}} \) \( 7.696 \pm 2.35 \) \( 14 \pm 4 \) \( 0.804 \pm 0.463 \) \( 0.871 \pm 0.087 \) \( 0.742 \pm 0.173 \)
    SLSQP \( 47.824 \pm 28.585 \) \( 219 \pm 138 \) \( 0.718 \pm 0.206 \) \( 0.819 \pm 0.035 \) \( 0.637 \pm 0.07 \)
    Table 3. Evaluations on a Synthetic Dataset with \( n=1,000 \) , \( d=100 \) , and \( m=1,000 \) , Partitioned w.r.t. Sample Partitioning and Rank Partitioning
    We report the convergence time (Time), number of iterations until convergence (Iter), norm error in estimating true PL scores ( \( \triangle \mathbf {\pi } \) ), top-1 accuracy on the test set (Top-1 Acc.), and Kendall-Tau correlation on the test set (KT). ILSR, MM, and Newton on \( \mathbf {\theta } \) do not use the features \( \mathbf {X} \) . Newton on \( {{\mathbf {w}}} \) and SLSQP regress \( \mathbf {\pi } \) from \( \mathbf {X} \) .

    4.7 Performance Metrics

    We measure the prediction performance by Top-1 accuracy (Top-1 Acc.) and KT correlation on the test set. Let the test set be \( \mathcal {D}_{\mathtt {rank}} = \lbrace (\alpha ^\ell , A_\ell) \,|\, \ell \in \lbrace 1,\ldots ,m_{\mathtt {test}}\rbrace \rbrace \) , where \( \alpha ^\ell = \alpha _{1}^\ell \succ \alpha _{2}^\ell \succ \cdots \succ \alpha ^\ell _{K} \) is an ordered sequence of the samples in \( A_\ell \) . Given \( A_\ell \) , we predict the \( \ell \) th choice as \( \hat{c}_\ell = \text{arg,max}_{i \in A_\ell } {\pi }_i \) . We calculate the Top-1 accuracy (Top-1 Acc.) as
    \( \begin{align} \text{Top-1 Acc.} = \frac{\sum _{\ell =1}^{m_{\mathtt {test}}} \mathbb {1}(\hat{c}_\ell = \alpha _{1}^\ell)}{m_{\mathtt {test}}} \in [0,1]. \end{align} \)
    (51)
    We also predict the ranking as \( \hat{\alpha }^\ell = \text{arg,sort}[{\pi }_i]_{i \in A_\ell } \) , i.e., sequence of the samples in \( A_\ell \) ordered w.r.t. their estimated scores. We calculate KT correlation [Kendall 1938] as a measure of the correlation between each true ranking \( \alpha ^\ell \) and predicted ranking \( \hat{\alpha }^\ell \) , \( \ell \in \lbrace 1,\ldots ,m_{\mathtt {test}}\rbrace \) . For observation \( \ell \) , let \( T_\ell = \sum _{t=1}^{K} \sum _{s=1}^{K} \mathbb {1}(\hat{\alpha }_t^\ell \succ \hat{\alpha }_s^\ell \wedge \alpha _t^\ell \succ \alpha _s^\ell) \) be the number correctly predicted ranking positions, and \( F_\ell = \sum _{t=1}^{K} \sum _{s=1}^{K} \mathbb {1}(\hat{\alpha }_t^\ell \succ \hat{\alpha }_s^\ell \wedge \alpha _s^\ell \succ \alpha _t^\ell) \) be the number incorrectly predicted ranking positions. Then, KT is computed by
    \( \begin{align} \text{KT} = \frac{\sum _{\ell =1}^{m_{\mathtt {test}}} (T_\ell - F_\ell)/ \binom{K}{2}}{m_{\mathtt {test}}} \in [-1, 1], \end{align} \)
    (52)
    where \( \left({{K}\atop {2}}\right) \) is the number of sample pairs.

    4.8 Shallow Regression Results

    Sample partitioning. We begin with the experiments on sample partitioning, in which we partition samples \( \left[n\right] \) into training and test sets. Table 3 shows the evaluations of all algorithms trained on a synthetic dataset with \( n=1,000 \) samples, \( d=100 \) features, \( m=1,000 \) observations, and query size \( |A_\ell |=2 \) . SR-l2 and SR-l2-log converge 4–27 times faster than other feature methods, i.e., Newton on \( {{\mathbf {w}}} \) and SLSQP. Recall that in sample partitioning, training observations contain only training samples: Test samples do not participate in any of the training observations. Thus, featureless methods ILSR, MM, and Newton on \( \mathbf {\theta } \) are no better than random predictors, with 0.5 Top-1 Acc. and 0.0 KT. By regressing the PL scores from features, SR-l2 and SR-l2-log significantly outperform the predictions of ILSR, MM, and Newton on \( \mathbf {\theta } \) , by 16%–33% Top-1 Acc. and 16%– 30% KT.
    Real datasets. We observe an equally significant speed gain on real datasets; Table 4 shows the evaluations on four real datasets partitioned w.r.t. sample partitioning. SR-l2 and SR-l2-log are 3–18 times faster than Newton on \( {{\mathbf {w}}} \) and SLSQP. This speed gain is fundamentally due to the smaller per iteration complexity of SR-l2 and SR-l2-log. For instance, compared to Newton on \( {{\mathbf {w}}} \) , SR-l2-log requires about 2 times more iterations, but still converges 3 times faster than Newton on \( {{\mathbf {w}}} \) on FAC. Moreover, while significantly decreasing the convergence time, SR-l2 or SR-l2-log consistently attain similar prediction performance to Newton on \( {{\mathbf {w}}} \) and SLSQP, except for Sushi, for which they perform slightly worse (by \( 13\% \) \( 20\% \) Top-1 Acc.), though the convergence time dividends are striking in comparison ( \( 78\% \) \( 95\% \) ).
    Table 4.
    DatasetMethodTraining MetricsPerformance Metrics on the Test Set
    Time (s) \( \downarrow \) Iter. \( \downarrow \) Top-1 Acc. \( \uparrow \) KT \( \uparrow \)
    FACSR-l2 \( 0.301 \pm 0.048 \) \( 4 \pm 0 \) \( 0.654 \pm 0.237 \) \( 0.307 \pm 0.473 \)
    SR-l2-log \( 0.298 \pm 0.466 \) \( 10 \pm 15 \) \( 0.685 \pm 0.237 \) \( 0.369 \pm 0.474 \)
    ILSR (no \( \mathbf {X} \) ) \( 0.059 \pm 0.016 \) \( 2 \pm 0 \) \( 0.5 \pm 0.0 \) \( -1.0 \pm 0.0 \)
    MM (no \( \mathbf {X} \) ) \( 5.905 \pm 0.282 \) \( 500 \pm 0 \) \( 0.5 \pm 0.0 \) \( 0.0 \pm 0.0 \)
    Newton on \( \mathbf {\theta } \) (no \( \mathbf {X} \) ) \( 7.604 \pm 0.805 \) \( 18 \pm 2 \) \( 0.5 \pm 0.0 \) \( -0.4 \pm 0.49 \)
    Newton on \( {\mathbf {w}} \) \( 0.859 \pm 0.077 \) \( 6 \pm 1 \) \( 0.67 \pm 0.17 \) \( 0.34 \pm 0.339 \)
    SLSQP \( 14.332 \pm 5.684 \) \( 178 \pm 67 \) \( 0.675 \pm 0.147 \) \( 0.349 \pm 0.293 \)
    ROP-numSR-l2 \( 1.708 \pm 0.166 \) \( 4 \pm 0 \) \( 0.783 \pm 0.03 \) \( 0.565 \pm 0.06 \)
    SR-l2-log \( 0.325 \pm 0.028 \) \( 1 \pm 0 \) \( 0.724 \pm 0.105 \) \( 0.448 \pm 0.209 \)
    ILSR (no \( \mathbf {X} \) ) \( 0.649 \pm 0.053 \) \( 2 \pm 0 \) \( 0.5 \pm 0.0 \) \( -1.0 \pm 0.0 \)
    MM (no \( \mathbf {X} \) ) \( 0.001 \pm 0.001 \) \( 1 \pm 0 \) \( 0.5 \pm 0.0 \) \( -1.0 \pm 0.0 \)
    Newton on \( \mathbf {\theta } \) (no \( \mathbf {X} \) ) \( 68.924 \pm 5.521 \) \( 8 \pm 0 \) \( 0.497 \pm 0.012 \) \( -0.988 \pm 0.036 \)
    Newton on \( {\mathbf {w}} \) \( 47.563 \pm 8.342 \) \( 2 \pm 1 \) \( 0.552 \pm 0.048 \) \( 0.103 \pm 0.096 \)
    SLSQP \( 4.823 \pm 4.914 \) \( 2 \pm 1 \) \( 0.769 \pm 0.052 \) \( 0.538 \pm 0.104 \)
    Pairwise SushiSR-l2 \( 0.046 \pm 0.01 \) \( 4 \pm 0 \) \( 0.451 \pm 0.082 \) \( -0.09 \pm 0.177 \)
    SR-l2-log \( 0.141 \pm 0.025 \) \( 27 \pm 13 \) \( 0.532 \pm 0.076 \) \( 0.064 \pm 0.152 \)
    ILSR (no \( \mathbf {X} \) ) \( 0.014 \pm 0.006 \) \( 2 \pm 0 \) \( 0.5 \pm 0.0 \) \( -1.0 \pm 0.0 \)
    MM (no \( \mathbf {X} \) ) \( 1.513 \pm 0.587 \) \( 352 \pm 183 \) \( 0.5 \pm 0.0 \) \( -1.0 \pm 0.0 \)
    Newton on \( \mathbf {\theta } \) (no \( \mathbf {X} \) ) \( 1.282 \pm 0.924 \) \( 18 \pm 9 \) \( 0.5 \pm 0.0 \) \( -0.666 \pm 0.472 \)
    Newton on \( {\mathbf {w}} \) \( 0.21 \pm 0.115 \) \( 4 \pm 2 \) \( 0.665 \pm 0.035 \) \( 0.33 \pm 0.069 \)
    SLSQP \( 4.619 \pm 6.321 \) \( 168 \pm 235 \) \( 0.624 \pm 0.065 \) \( 0.248 \pm 0.13 \)
    Triplet SushiSR-l2 \( 0.091 \pm 0.02 \) \( 4 \pm 0 \) \( 0.358 \pm 0.805 \) \( -0.333 \pm 0.924 \)
    SR-l2-log \( 0.556 \pm 0.276 \) \( 40 \pm 25 \) \( 0.393 \pm 0.826 \) \( 0.096 \pm 1.069 \)
    ILSR (no \( \mathbf {X} \) ) \( 0.033 \pm 0.012 \) \( 2 \pm 0 \) \( 0.334 \pm 0.0 \) \( -0.047 \pm 1.089 \)
    MM (no \( \mathbf {X} \) ) \( 1.824 \pm 0.701 \) \( 267 \pm 151 \) \( 0.334 \pm 0.0 \) \( 0.0 \pm 0.0 \)
    Newton on \( \mathbf {\theta } \) (no \( \mathbf {X} \) ) \( 2.728 \pm 1.475 \) \( 13 \pm 3 \) \( 0.334 \pm 0.0 \) \( -0.047 \pm 1.089 \)
    Newton on \( {\mathbf {w}} \) \( 1.966 \pm 3.158 \) \( 10 \pm 19 \) \( 0.322 \pm 0.802 \) \( -0.261 \pm 0.956 \)
    SLSQP \( 1.656 \pm 1.793 \) \( 20 \pm 30 \) \( 0.608 \pm 0.826 \) \( 0.358 \pm 0.928 \)
    Table 4. Evaluations on Real Datasets Partitioned w.r.t. Sample Partitioning
    We report the convergence time (Time), number of iterations until convergence (Iter), top-1 accuracy on the test set (Top-1 Acc.), and Kendall-Tau correlation on the test set (KT). ILSR, MM, and Newton on \( \mathbf {\theta } \) do not use the features \( \mathbf {X} \) . Newton on \( {{\mathbf {w}}} \) and SLSQP regress \( \mathbf {\pi } \) from \( \mathbf {X} \) .
    Aligned with the prediction performance on synthetic datasets, featureless methods ILSR, MM, and Newton on \( \mathbf {\theta } \) can only attain 0.5 Top-1 Acc. and 0.0 KT. By regressing the PL scores from features, SR-l2 and SR-l2-log significantly outperform the predictions of ILSR, MM, and Newton on \( \mathbf {\theta } \) , by \( 3\% \) \( 31\% \) Top-1 Acc. and \( 5\% \) \( 78\% \) KT.
    Rank partitioning. A sample can appear in both training and test observations in rank partitioning. Hence, featureless methods ILSR, MM, and Newton on \( \mathbf {\theta } \) should fare better than in sample partitioning. Nonetheless, in Table 3, as \( n=1,000 \) is larger than \( d=100 \) , there are more scores to learn than parameters. As a result, feature methods are advantageous for good predictions compared to featureless methods. Particularly, SR-l2 and SR-l2-log outperform the predictions of ILSR, MM, and Newton on \( \mathbf {\theta } \) in rank partitioning on Table 3, by \( 13\% \) Top-1 Acc. and \( 9\% \) KT. The relative performance of feature vs. featureless methods is governed by the relationship among \( n \) , \( d \) , and \( m \) . We do not include Newton on \( \mathbf {\theta } \) and MM in this analysis, as they are too slow.
    Impact of \( d \) . To assess the impact of number of parameters, we fix \( n=1,000 \) , \( m=250 \) , \( |A_\ell |=2 \) and generate synthetic datasets with \( d\in \lbrace 10,100,1000,10000\rbrace \) . Figure 3(a) shows the Time and Top-1 Acc. of SR-l2, SR-l2-log, ILSR, and Newton on \( {{\mathbf {w}}} \) . As \( m=250 \) observations are not enough to learn \( n=1,000 \) scores, SR-l2 leads to significantly better Top-1 Acc. compared to ILSR. When \( d=10 \) , SR-l2 and SR-l2-log outperform ILSR by \( 18\% \) \( 28\% \) Top-1 Acc. Moreover, SR-l2 and SR-l2-log are consistently faster than Newton on \( {{\mathbf {w}}} \) , for all \( d\gt 100 \) . Particularly, for \( d=10,000 \) , SR-l2 and SR-l2-log converge 42–579 times faster than Newton on \( {{\mathbf {w}}} \) . Interestingly, the convergence time of SR-l2-log can even decrease with increasing \( d \) . This is because the number of iterations until convergence decreases. While significantly decreasing the convergence time, SR-l2 consistently attains better Top-1 Acc. than Newton on \( {{\mathbf {w}}} \) , up to \( 8\% \) for \( d=100 \) . Impact of \( n \) . To assess the impact of number of samples, we fix \( d=100 \) , \( m=250 \) , \( |A_\ell |=2 \) and generate synthetic datasets with \( n\in \lbrace 50,100,1000,10000,100000\rbrace \) ; Figure 3(b) shows evaluations on the resulting datasets. For \( n\gt d=100 \) , i.e., when there are more scores to learn than parameters, SR-l2 leads to significantly better Top-1 Acc. compared to ILSR. Particularly, for \( n=100,000 \) , SR-l2 outperforms ILSR by \( 25\% \) Top-1 Acc. This confirms that, especially when the number of observations \( m \) is not sufficient to learn \( n \) scores, exploiting the features associated with the samples is crucial in attaining good prediction performance. As expected, convergence time of Newton on \( {{\mathbf {w}}} \) is not significantly affected by \( n \) . Despite this, SR-l2 and SR-l2-log are faster than Newton on \( {{\mathbf {w}}} \) for all \( n\lt 1,\!000 \) . Particularly, for \( n=50 \) , SR-l2 and SR-l2-log converge 2–75 times faster than Newton on \( {{\mathbf {w}}} \) . While decreasing the convergence time, SR-l2 consistently attains better Top-1 Acc. than Newton on \( {{\mathbf {w}}} \) , up to \( 19\% \) for \( n=100,\!000 \) .
    Impact of \( m \) . Figure 4 shows the convergence time (Time) and top-1 test accuracy (Top-1 Acc.) of SR-l2, SR-l2-log, ILSR, and Newton on \( {{\mathbf {w}}} \) when trained on synthetic datasets with number of observations \( m\in \lbrace 10,100,1000,10000,100000\rbrace \) . Observations are partitioned w.r.t. rank partitioning, where number of samples is \( n=1,\!000 \) , number of parameters is \( d=100 \) , and size of each query is \( |A_\ell |=2 \) . As \( n\gt d \) , SR-l2 benefits from being able to regress \( n \) scores from a smaller number of \( d \) parameters and leads to significantly better Top-1 Acc compared to ILSR in Figure 4. Especially when \( m \) is not enough to learn \( n=1,000 \) scores, but to learn \( d=100 \) parameters, SR-l2 gains the most performance advantage over ILSR, up to \( 13\% \) Top-1 Acc. Moreover, SR-l2 and SR-l2-log are consistently faster than Newton on \( {{\mathbf {w}}} \) , for all number of observations \( m\gt 100 \) . Particularly, for \( m=100,000 \) , SR-l2 and SR-l2-log converge 4–60 times faster than Newton on \( {{\mathbf {w}}} \) .
    Real datasets. Performance agrees with observations above regarding the dependence on \( n \) and \( d \) . For datasets where \( n\gt m\gt d \) , e.g., FAC, SR-l2, and SR-l2-log significantly outperform the predictions of ILSR, by \( 10\% \) Top-1 Acc. and \( 25\% \) KT. For datasets where \( m \) is much larger than \( n \) (c.f. Table 2), feature methods lead to similar prediction performance to each other and slightly lower performance than ILSR, MM, and Newton on \( \mathbf {\theta } \) . Overall, SR-l2 and SR-l2-log consistently converge faster than Newton on \( {{\mathbf {w}}} \) and SLSQP, by 3–27 times across all real datasets. Table 5 also confirms the observations from Figure 3(a) and (b) regarding the effect of \( n \) and \( d \) . For FAC, \( n=1000 \gt m=728 \gt d=50 \) , i.e., there are enough observations to learn parameters, but not scores. As a result, SR-l2 and SR-l2-log significantly outperform the predictions of ILSR, by \( 10\% \) Top-1 Acc. and \( 25\% \) KT. For the other real datasets, \( m \) is much larger than \( n \) (c.f. Table 2). Thus, feature methods lead to similar prediction performance to each other and lower performance than ILSR, MM, and Newton on \( \mathbf {\theta } \) .
    Table 5.
    DatasetMethodTraining MetricsPerformance Metrics on the Test Set
    Time (s) \( \downarrow \) Iter. \( \downarrow \) Top-1 Acc. \( \uparrow \) KT \( \uparrow \)
    FACSR-l2 \( 0.352 \pm 0.044 \) \( 4 \pm 0 \) \( 0.68 \pm 0.048 \) \( 0.35 \pm 0.089 \)
    SR-l2-log \( 0.17 \pm 0.033 \) \( 4 \pm 0 \) \( 0.691 \pm 0.054 \) \( 0.378 \pm 0.11 \)
    ILSR (no \( \mathbf {X} \) ) \( 0.066 \pm 0.012 \) \( 2 \pm 0 \) \( 0.591 \pm 0.067 \) \( -0.13 \pm 0.164 \)
    MM (no \( \mathbf {X} \) ) \( 10.7 \pm 0.501 \) \( 500 \pm 0 \) \( 0.544 \pm 0.046 \) \( 0.046 \pm 0.087 \)
    Newton on \( \mathbf {\theta } \) (no \( \mathbf {X} \) ) \( 9.152 \pm 1.284 \) \( 17 \pm 3 \) \( 0.5 \pm 0.0 \) \( 0.0 \pm 0.0 \)
    Newton on \( {\mathbf {w}} \) \( 1.531 \pm 0.169 \) \( 6 \pm 1 \) \( 0.701 \pm 0.04 \) \( 0.398 \pm 0.08 \)
    SLSQP \( 22.73 \pm 19.151 \) \( 160 \pm 135 \) \( 0.689 \pm 0.063 \) \( 0.375 \pm 0.125 \)
    ROP-numSR-l2 \( 1.953 \pm 0.217 \) \( 4 \pm 0 \) \( 0.896 \pm 0.005 \) \( 0.791 \pm 0.009 \)
    SR-l2-log \( 0.359 \pm 0.027 \) \( 1 \pm 0 \) \( 0.904 \pm 0.005 \) \( 0.807 \pm 0.01 \)
    ILSR (no \( \mathbf {X} \) ) \( 0.716 \pm 0.058 \) \( 2 \pm 0 \) \( 0.891 \pm 0.005 \) \( 0.781 \pm 0.009 \)
    MM (no \( \mathbf {X} \) ) \( 356.497 \pm 29.11 \) \( 500 \pm 0 \) \( 0.905 \pm 0.004 \) \( 0.81 \pm 0.008 \)
    Newton on \( \mathbf {\theta } \) (no \( \mathbf {X} \) ) \( 85.42 \pm 6.849 \) \( 9 \pm 0 \) \( 0.906 \pm 0.004 \) \( 0.811 \pm 0.008 \)
    Newton on \( {\mathbf {w}} \) \( 55.718 \pm 6.293 \) \( 2 \pm 0 \) \( 0.904 \pm 0.005 \) \( 0.808 \pm 0.009 \)
    SLSQP \( 9.595 \pm 7.136 \) \( 2 \pm 1 \) \( 0.683 \pm 0.049 \) \( 0.366 \pm 0.098 \)
    Pairwise SushiSR-l2 \( 0.061 \pm 0.002 \) \( 4 \pm 0 \) \( 0.669 \pm 0.034 \) \( 0.338 \pm 0.068 \)
    SR-l2-log \( 0.764 \pm 1.192 \) \( 58 \pm 30 \) \( 0.634 \pm 0.075 \) \( 0.267 \pm 0.15 \)
    ILSR (no \( \mathbf {X} \) ) \( 0.027 \pm 0.003 \) \( 2 \pm 0 \) \( 0.763 \pm 0.039 \) \( 0.521 \pm 0.084 \)
    MM (no \( \mathbf {X} \) ) \( 5.191 \pm 0.345 \) \( 490 \pm 31 \) \( 0.773 \pm 0.048 \) \( 0.543 \pm 0.094 \)
    Newton on \( \mathbf {\theta } \) (no \( \mathbf {X} \) ) \( 2.342 \pm 0.689 \) \( 18 \pm 5 \) \( 0.735 \pm 0.095 \) \( 0.465 \pm 0.185 \)
    Newton on \( {\mathbf {w}} \) \( 0.176 \pm 0.17 \) \( 2 \pm 2 \) \( 0.685 \pm 0.044 \) \( 0.369 \pm 0.087 \)
    SLSQP \( 16.198 \pm 8.728 \) \( 245 \pm 134 \) \( 0.64 \pm 0.06 \) \( 0.28 \pm 0.119 \)
    Triplet SushiSR-l2 \( 0.127 \pm 0.007 \) \( 4 \pm 0 \) \( 0.569 \pm 0.035 \) \( 0.218 \pm 0.045 \)
    SR-l2-log \( 0.804 \pm 0.349 \) \( 36 \pm 18 \) \( 0.487 \pm 0.034 \) \( 0.19 \pm 0.072 \)
    ILSR (no \( \mathbf {X} \) ) \( 0.054 \pm 0.003 \) \( 2 \pm 0 \) \( 0.678 \pm 0.036 \) \( 0.454 \pm 0.06 \)
    MM (no \( \mathbf {X} \) ) \( 15.349 \pm 0.617 \) \( 500 \pm 0 \) \( 0.715 \pm 0.035 \) \( 0.522 \pm 0.059 \)
    Newton on \( \mathbf {\theta } \) (no \( \mathbf {X} \) ) \( 5.122 \pm 0.34 \) \( 14 \pm 1 \) \( 0.73 \pm 0.036 \) \( 0.496 \pm 0.089 \)
    Newton on \( {\mathbf {w}} \) \( 1.12 \pm 0.659 \) \( 3 \pm 2 \) \( 0.605 \pm 0.058 \) \( 0.285 \pm 0.062 \)
    SLSQP \( 21.738 \pm 39.761 \) \( 107 \pm 197 \) \( 0.521 \pm 0.043 \) \( 0.191 \pm 0.059 \)
    Table 5. Evaluations on Real Datasets Partitioned w.r.t. Rank Partitioning
    We report the convergence time in seconds (Time), number of iterations until convergence (Iter), top-1 accuracy on the test set (Top-1 Acc.), and Kendall-Tau correlation on the test set (KT). ILSR, MM, and Newton on \( \mathbf {\theta } \) learn the PL scores \( \mathbf {\pi } \) from the choice observations alone and do not use the features \( \mathbf {X} \) . Newton on \( {{\mathbf {w}}} \) and SLSQP regress \( \mathbf {\pi } \) from \( \mathbf {X} \) (c.f. Section 4.3).

    4.9 DNN Regression Results

    Training Time vs. Prediction Performance. Figure 5(a) and (b) show the training time of DSR-KL and siamese network vs. Top-1 Acc. and KT on test sets of Movehub-Cost-4, Movehub-Quality-4, and IMDB-4 datasets, partitioned with rank partitioning. For all datasets, DSR-KL results lie on the top left region of both Top-1 Acc. and KT plots: DSR-KL consistently leads to much faster training and better predictions than the siamese counterpart.
    Fig. 5.
    Fig. 5. (a)–(b) Training time of DSR-KL and siamese network vs. Top-1 Acc. and KT on test sets of Movehub-Cost-4, Movehub-Quality-4, and IMDB-4 datasets, partitioned w.r.t. rank partitioning. (c) Training time and prediction performances of DSR-KL, DSR-l2, and siamese network vs. query size ( \( K \) ) on Movehub-Quality dataset partitioned w.r.t. rank partitioning. Top-1 accuracy for each \( K \) is next to the corresponding marker.
    Columns 3–5 in Table 6 show the training time and test set prediction performance of DSR-KL, DSR-l2, siamese network, SR-l2, and SR-KL trained on datasets partitioned with rank partition- ing. DSR-KL and DSR-l2 are 1.5–142 times faster than siamese network over all datasets. Moreover, DSR-KL consistently attains equivalent or better prediction performance than both siamese network and DSR-l2 w.r.t. both Top-1 Acc. and KT. DSR-KL leads to particularly better performance in ranking predictions, by up to \( 6\% \) higher KT than siamese and \( 25\% \) higher KT than DSR-l2 on IMDB-4.
    Table 6.
    DatasetMethodRank PartitioningSample Partitioning
    Time (s) \( \downarrow \) Performance on the Test SetTime (s) \( \downarrow \) Performance on the Test Set
    Top-1 Acc. \( \uparrow \) KT \( \uparrow \) Top-1 Acc. \( \uparrow \) KT \( \uparrow \)
    ICLR-3DSR-KL152.86 \( \pm \; 29.98 \) 0.9 \( \pm \; 0.0 \) 0.86 \( \pm \;0.0 \) 145.76 \( \pm \;9.78 \) 0.48 \( \pm \;0.06 \) 0.28 \( \pm \;0.09 \)
    DSR-l2165.59 \( \pm \;22.63 \) 0.79 \( \pm \;0.0 \) 0.5 \( \pm \;0.0 \) 122.92 \( \pm \;75.43 \) 0.51 \( \pm \;0.02 \) 0.07 \( \pm \; 0.08 \)
    Siamese1445.770.880.8827.050.480.05
    SR-KL529.02 \( \pm \;117.26 \) 0.37 \( \pm \;0.0 \) 0.02 \( \pm \;0.0 \) 96.49 \( \pm \;90.02 \) 0.48 \( \pm \;0.0 \) 0.0 \( \pm \;0.0 \)
    SR-l220.59 \( \pm \;3.02 \) 0.87 \( \pm \;0.0 \) 0.82 \( \pm \;0.0 \) 4.91 \( \pm \;4.6 \) 0.47 \( \pm \;0.0 \) 0.06 \( \pm \;0.0 \)
    Movehub-Cost-4DSR-KL49.54 \( \pm \;34.2 \) 0.88 \( \pm \;0.07 \) 0.85 \( \pm \;0.09 \) 17.65 \( \pm \;7.68 \) 0.61 \( \pm \;0.07 \) 0.45 \( \pm \;0.12 \)
    DSR-l273.38 \( \pm \;32.82 \) 0.72 \( \pm \;0.02 \) 0.58 \( \pm \;0.03 \) 19.85 \( \pm \;3.7 \) 0.05 \( \pm \;0.08 \) \( - \) 0.3 \( \pm \;0.22 \)
    Siamese523.640.870.82216.220.60.66
    SR-KL29.64 \( \pm \;3.59 \) 0.47 \( \pm \;0.0 \) 0.27 \( \pm \;0.0 \) 7.45 \( \pm \;0.3 \) 0.31 \( \pm \;0.0 \) 0.44 \( \pm \;0.0 \)
    SR-l219.7 \( \pm \;0.72 \) 0.8 \( \pm \;0.0 \) 0.54 \( \pm \;0.0 \) 4.13 \( \pm \;0.2 \) 0.5 \( \pm \;0.0 \) 0.71 \( \pm \;0.0 \)
    Movehub-Quality-4DSR-KL63.05 \( \pm \;4.14 \) 0.88 \( \pm \;0.0 \) 0.84 \( \pm \;0.0 \) 31.41 \( \pm \;5.02 \) 0.88 \( \pm \;0.0 \) 0.74 \( \pm \;0.05 \)
    DSR-l2531.86 \( \pm \;212.91 \) 0.82 \( \pm \;0.07 \) 0.75 \( \pm \;0.08 \) 34.72 \( \pm \;8.68 \) 0.67 \( \pm \;0.11 \) 0.81 \( \pm \;0.31 \)
    Siamese710.350.880.82258.870.880.88
    SR-KL98.02 \( \pm \;3.73 \) 0.17 \( \pm \;0.0 \) \( - \) 0.1 \( \pm \;0.0 \) 7.68 \( \pm \;0.2 \) 0.48 \( \pm \;0.0 \) 0.05 \( \pm \;0.0 \)
    SR-l218.98 \( \pm \;4.11 \) 0.84 \( \pm \;0.0 \) 0.62 \( \pm \;0.0 \) 4.69 \( \pm \;0.1 \) 0.51 \( \pm \;0.0 \) 0.55 \( \pm \;0.0 \)
    IMDB-4DSR-KL23.93 \( \pm \;16.15 \) 0.9 \( \pm \;0.0 \) 0.9 \( \pm \;0.05 \) 526.01 \( \pm \;11.57 \) 0.73 \( \pm \;0.29 \) \( - \) 0.14 \( \pm \;0.25 \)
    DSR-l254.16 \( \pm \;13.22 \) 0.78 \( \pm \;0.035 \) 0.38 \( \pm \;0.07 \) 27.73 \( \pm \;26.63 \) 0.21 \( \pm \;0.21 \) \( - \) 0.02 \( \pm \;0.08 \)
    Siamese3409.030.870.781240.980.470.02
    SR-KL57.93 \( \pm \;0.87 \) 0.16 \( \pm \;0.0 \) \( - \) 0.04 \( \pm \;0.0 \) 31.33 \( \pm \;13.26 \) 0.04 \( \pm \;0.0 \) 0.05 \( \pm \;0.0 \)
    SR-l29.43 \( \pm \;1.02 \) 0.52 \( \pm \;0.0 \) 0.08 \( \pm \;0.0 \) 6.97 \( \pm \;2.71 \) 0.6 \( \pm \;0.06 \) 0.04 \( \pm \;0.06 \)
    ICLR-4DSR-KL84.93 \( \pm \;1.76 \) 0.88 \( \pm \;0.01 \) 0.62 \( \pm \;0.06 \) 288.45 \( \pm \;3.84 \) 0.48 \( \pm \;0.13 \) 0.06 \( \pm \;0.18 \)
    DSR-l284.78 \( \pm \;1.21 \) 0.63 \( \pm \;0.01 \) 0.19 \( \pm \;0.01 \) 280.72 \( \pm \;270.75 \) 0.47 \( \pm \;0.02 \) 0.032 \( \pm \;0.1 \)
    Siamese623.180.840.7610142.550.32 \( - \) 0.07
    SR-KL347.29 \( \pm \;39.6 \) 0.29 \( \pm \; \) 0.0 \( \pm \;0.0 \) 131.8 \( \pm \;120.01 \) 0.45 \( \pm \;0.0 \) \( - \) 0.07 \( \pm \;0.0 \)
    SR-l2503.11 \( \pm \;40.35 \) 0.86 \( \pm \; \) 0.82 \( \pm \;0.0 \) 198.49 \( \pm \;6.87 \) 0.37 \( \pm \;0.03 \) \( - \) 0.01 \( \pm \;0.0 \)
    ROP-imgDSR-KL776.71 \( \pm \;136.74 \) 0.89 \( \pm \;0.0 \) 0.79 \( \pm \;0.0 \) 25.3 \( \pm \;15.4 \) 0.8 \( \pm \;0.01 \) 0.6 \( \pm \;0.02 \)
    DSR-l2694.99 \( \pm \;431.12 \) 0.84 \( \pm \;0.03 \) 0.68 \( \pm \;0.064 \) 210.45 \( \pm \;47.16 \) 0.79 \( \pm \;0.01 \) 0.59 \( \pm \;0.02 \)
    Siamese1152.060.860.734438.610.820.65
    SR-KL610.91 \( \pm \;20.69 \) 0.5 \( \pm \;0.0 \) 0.0 \( \pm \;0.0 \) 527.76 \( \pm \;163.76 \) 0.61 \( \pm \;0.0 \) 0.23 \( \pm \;0.0 \)
    SR-l23.08 \( \pm \;0.59 \) 0.89 \( \pm \;0.0 \) 0.79 \( \pm \;0.0 \) 1.96 \( \pm \;1.1 \) 0.45 \( \pm \;0.0 \) \( - \) 0.08 \( \pm \;0.0 \)
    Movehub-Cost-5DSR-KL183.6 \( \pm \;48.2 \) 0.85 \( \pm \;0.047 \) 0.84 \( \pm \;0.08 \) 111.13 \( \pm \;31.61 \) 0.76 \( \pm \;0.29 \) 0.67 \( \pm \;0.15 \)
    DSR-l2216.5 \( \pm \;64.34 \) 0.81 \( \pm \;0.04 \) 0.69 \( \pm \;0.02 \) 137.14 \( \pm \;65.55 \) 0.19 \( \pm \;0.05 \) 0.52 \( \pm \;0.34 \)
    Siamese7986.220.890.832842.120.620.74
    SR-KL189.85 \( \pm \;0.72 \) 0.45 \( \pm \;0.0 \) 0.28 \( \pm \;0.0 \) 30.51 \( \pm \;0.3 \) 0.16 \( \pm \;0.0 \) 0.39 \( \pm \;0.0 \)
    SR-l2209.5 \( \pm \;1.5 \) 0.78 \( \pm \;0.0 \) 0.55 \( \pm \;0.0 \) 30.14 \( \pm \;0.2 \) 0.37 \( \pm \;0.0 \) 0.73 \( \pm \;0.0 \)
    Movehub-Quality-5DSR-KL79.35 \( \pm \;2.91 \) 0.85 \( \pm \;0.05 \) 0.79 \( \pm \;0.08 \) 114.03 \( \pm \;30.18 \) 0.92 \( \pm \;0.06 \) 0.35 \( \pm \;0.25 \)
    DSR-l291.87 \( \pm \;6.31 \) 0.62 \( \pm \;0.04 \) 0.5 \( \pm \;0.05 \) 46.99 \( \pm \;13.23 \) 0.57 \( \pm \;0.34 \) 0.73 \( \pm \;0.46 \)
    Siamese2241.490.890.83752.060.760.08
    SR-KL924.07 \( \pm \;0.2 \) 0.13 \( \pm \;0.0 \) -0.1 \( \pm \;0.0 \) 29.68 \( \pm \;1.2 \) 0.62 \( \pm \;0.0 \) 0.08 \( \pm \;0.0 \)
    SR-l2208.1 \( \pm \;0.1 \) 0.84 \( \pm \;0.0 \) 0.65 \( \pm \;0.0 \) 30.03 \( \pm \;1.1 \) 0.39 \( \pm \;0.0 \) 0.59 \( \pm \;0.0 \)
    Table 6. Training Time vs. Top-1 Acc. and KT on Test Sets, Partitioned w.r.t. Rank and Sample Partitioning, Respectively
    We report averages and standard deviations over folds for all algorithms except for siamese. Our algorithms, as well as the algorithm that attains the best performance for each dataset are indicated in bold.
    Deeper regression methods consistently outperform the predictions of shallow regression methods. Particularly, our deep spectral algorithms DSR-KL and DSR-l2 lead to significantly better predictions than the shallow versions SR-l2 and SR-KL, up to \( 38\% \) Top-1 Acc. and \( 41\% \) KT on IMDB-4. The training times of DSR-KL and DSR-l2 are also not noticeably larger than the ones of shallow versions; SR-KL converges even slower than DSR-KL, by up to 12 times on Movehub-Quality-5. Unlike SR-l2 that solves a least-squares problem with closed form solution, parameter update step of SR-KL (c.f. 48) requires an iterative optimization at each ADMM iteration.
    Columns 6–8 in Table 6 show the training time and test set prediction performance of DSR-KL, DSR-l2, siamese network, SR-l2, and SR-KL trained on all datasets, partitioned with sample partitioning. Note that this is the setting when regressing scores from features is essential, as training samples cannot participate in any observations in validation or test sets. Agreeing with the speed gain in rank partitioning, DSR-KL and DSR-l2 are 8–175 times faster than the siamese network over all datasets. Moreover, DSR-KL leads to particularly better performance in maximal-choice predictions, by up to \( 26\% \) higher Top-1 Acc. than siamese on IMDB-4 and \( 56\% \) higher Top-1 Acc. than DSR-l2 on Movehub-Cost-4. Finally, deeper regression methods DSR-KL, DSR-l2, and siamese network, outperform the predictions of shallow counterparts SR-l2 and SR-KL, by up to \( 39\% \) Top-1 Acc. and \( 11\% \) KT on Movehub-cost-5 and ICLR-3, respectively.
    Impact of \( K \) . Figures 1 and 5(c) show training time and prediction performances of DSR-KL, DSR-l2, and siamese network vs. query size ( \( K \) ) on Movehub-Cost and Movehub-Quality datasets partitioned w.r.t. rank partitioning. The speed gain of DSR-KL and DSR-l2 over siamese reach up to 43 times as \( K \) increases, as each epoch of siamese grows exponentially with \( K \) . Moreover, agreeing with Table 6, DSR-KL consistently outperforms the predictions of DSR-l2.
    Details on Convergence. Figure 6 shows the log-likelihood \( -\mathcal {L} \) on training and Top-1 Acc. on validation sets of Movehub-Cost-4, Movehub-Quality-4, and IMDB-4 datasets, partitioned w.r.t. rank partitioning. Each point for DSR-KL and DSR-l2 correspond to an overall iteration of ADMM (c.f. (21)), while each point for siamese corresponds to a training epoch. DSR-KL consistently attains higher log-likelihood and better validation performance than both siamese network and DSR-l2, while converging faster.
    Fig. 6.
    Fig. 6. (a)–(c) Log-likelihood \( -\mathcal {L} \) on training and Top-1 Acc. on validation sets of Movehub-Cost-4, Movehub-Quality-4, and IMDB-4, partitioned w.r.t. rank partitioning. Each point for DSR-KL and DSR-l2 correspond to an iteration of ADMM, while each point for siamese corresponds to a training epoch. (d) Test set predictions of DSR-KL vs. One-iter.-DSR-KL on Movehub-Cost-4 and Movehub-Quality-4.
    Comparison to Naïve Approach. A naïve spectral algorithm can be constructed by a single iteration of the primal steps in (21): (i) solving (21a) to learn \( \mathbf {\pi } \) via repeated iterations of (29), and (ii) given \( \mathbf {\pi } \) , solving (21b) by training the DNN regressor \( \tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w}) \) via SGD over Equation (22). Intuitively, this ignores/does not exploit the fact that samples with similar features ought to have similar scores. We denote this naive approach as One-iter.-DSR-KL. Unlike One-iter.-DSR-KL, our algorithm DSR-KL has the capability of repeatedly adapting both \( \mathbf {\pi } \) and \( \mathbf {w} \) by solving (19) via ADMM. This advantage is illustrated in Figure 6(a) and (b), where not only \( \tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w}) \) , but also \( \mathbf {\pi } \) are adjusted until convergence. Moreover, Figure 6(d) shows the corresponding test set predictions of DSR-KL vs. One-iter.-DSR-KL on Movehub-Cost-4 and Movehub-Quality-4; DSR-KL outperforms One-iter.-DSR-KL by 13–21% Top-1 Acc. and 6–15% KT.

    5 Conclusion

    We solve the MLE problem for the PL scores via ADMM. We show that the scores are equivalent to the stationary distribution of an MC and propose spectral algorithms for both shallow and DNN models. Our resulting spectral algorithms significantly outperform the traditional Newton’s method and siamese networks for ranking regression.
    Given that the number of rankings grows exponentially in query size, designing active learning algorithms to identify which rankings to solicit from labelers is an interesting open problem. Generalizing existing active learning algorithms for shallow models, e.g., Guo et al. [2018], to deeper models via our efficient algorithms is a promising direction.

    Footnotes

    1
    Our code for shallow model regression is publicly available at https://github.com/neu-spiral/FastAndAccurate-RankingRegression.
    2
    Our code for DNN regression is publicly available at https://github.com/neu-spiral/DeepSpectralRanking.

    Appendices

    In this section, we provide proofs for all theoretical results established in Sections 2 and 3, as well as a brief technical preliminary on finite-state homogeneous MCs.

    A Finite-state Homogeneous Markov Chains

    We briefly review finite-state homogeneous MCs due to the recent emergence of spectral algorithms for PL inference. All the results mentioned in this section are classic; our main reference is the book by Gallager [2013]. A finite-state MC is a sequence of random variables \( \lbrace Z_1, Z_2, \ldots \rbrace \) , each taking values from a finite-state sample space \( \left[n\right]\equiv \lbrace 1,\ldots ,n\rbrace \) , that satisfies the Markov property:
    \( \begin{align} \mathbf {P}(Z_t | Z_{1:t-1}) = \mathbf {P}(Z_t | Z_{t-1}) , \quad \text{for all} \,\, t. \end{align} \)
    (53)
    A homogeneous MC additionally satisfies: \( \lambda _{ij} = \mathbf {P}(Z_t=j | Z_{t-1}=i) = \mathbf {P}(Z_{t-1}=j | Z_{t-2}=i) = \cdots = \mathbf {P}(Z_2=j | Z_1=i) \) . In this case, the MC can be represented by a directed graph, in which there is a node for each \( i \in \left[n\right] \) , and \( \lambda _{ij} \) denotes the edge weight on the transition from node \( i \in \left[n\right] \) to \( j \in \left[n\right] \) . Particularly, \( \lambda _{ij} \) for all \( (i, j) \) are called the transition rates, and the corresponding matrix \( \mathbf {\Lambda }=[\lambda _{ij}]_{i,j \in \left[n\right]} \) is called the transition matrix of the MC.
    If the transition matrix \( \mathbf {\Lambda } \) , and accordingly the corresponding directed graph, is irreducible, the MC is said to be ergodic and admits a unique stationary distribution \( \mathbf {\pi }\in \mathbb {R}^n_+ \) , that satisfies the so-called global balance equations:
    \( \begin{align} \sum _{j\ne i} {\pi }_j \lambda _{ji} = \sum _{j\ne i} {\pi }_i \lambda _{ij}, \quad \text{for all}~i \in \left[n\right]\!. \end{align} \)
    (54)
    \( \mathbf {\pi } \) is the unique solution to the linear system defined by these balance equations and \( \mathbf {1}^\top \mathbf {\pi }=1 \) , as it is a distribution. This follows from the Perron–Frobenius Theorem [Horn and Johnson 2012]. In practice, the stationary distribution \( \mathbf {\pi } \) can be computed by uniformizing \( \mathbf {\Lambda } \) , i.e., increasing self-transition rates until all states have the same outgoing rate, and finding the leading left eigenvector via, e.g., the power method [Lei et al. 2016]:
    \( \begin{align} \mathbf {\pi }^{l+1} = \frac{\mathbf {\Lambda }\mathbf {\pi }^l}{\parallel \mathbf {\Lambda }\mathbf {\pi }^l \parallel _2}, \quad \text{for}~ l \in \mathbb {N}. \end{align} \)
    (55)

    B Proof of Theorem 2.1

    We start by showing that \( \frac{\partial \mathcal {L}(\mathcal {D}\,|\,\mathbf {\pi })}{\partial {\pi }_i} = 0 \) , \( i \in \left[n\right] \) is the optimality condition to minimize Equation (15). Consider the reparametrization \( {\pi }_i = e^{\theta _i} \) , \( i \in \left[n\right] \) . Equation (15) under this reparametrization is given by
    \( \begin{align} \mathcal {L}(\mathcal {D}\,|\,\mathbf {\theta }) = \sum _{\ell =1}^n\left(\log \sum _{j \in A_\ell } e^{\theta _j} - \theta _\ell \right), \end{align} \)
    (56)
    which is convex w.r.t. \( \mathbf {\theta }= [\theta _i]_{i\in \left[n\right]} \) , i.e., even though Equation (15) is not convex w.r.t. \( \mathbf {\pi } \) , it is convex under the reparametrization \( {\pi }_i = e^{\theta _i} \) , \( i \in \left[n\right] \) . This implies that \( \frac{\partial \mathcal {L}(\mathcal {D}\,|\,\mathbf {\theta })}{\partial \theta _i} = 0 \) , \( i \in \left[n\right] \) is the optimality condition to minimize Equation (56) w.r.t. \( \mathbf {\theta } \) . By the chain rule, this condition can be written in terms of \( {\pi }_i = e^{\theta _i} \) , \( i \in \left[n\right] \) as
    \( \begin{align} \frac{\partial \mathcal {L}(\mathcal {D}\,|\,\mathbf {\theta })}{\partial \theta _i} = \frac{\partial \mathcal {L}(\mathcal {D}\,|\,\mathbf {\pi })}{\partial {\pi }_i} e^{\theta _i} = 0 \,\,\,\, \forall i \in \left[n\right]\!. \end{align} \)
    (57)
    Note that \( e^{\theta _i} \gt 0 \) , \( i \in \left[n\right] \) . Then, \( \frac{\partial \mathcal {L}(\mathcal {D}\,|\,\mathbf {\theta })}{\partial \theta _i} = 0 \) is equivalent to \( \frac{\partial \mathcal {L}(\mathcal {D}\,|\,\mathbf {\pi })}{\partial {\pi }_i} = 0 \) , \( i \in \left[n\right] \) , i.e., \( {\pi }_i = e^{\theta _i} \) , \( i \in \left[n\right] \) satisfies Equation (57) if and only if \( \theta _i \) , \( i \in \left[n\right] \) is the minimizer of Equation (56). Hence, the stationarity condition \( \frac{\partial \mathcal {L}(\mathcal {D}\,|\,\mathbf {\pi })}{\partial {\pi }_i} = 0 \) , \( i \in \left[n\right] \) is also the optimality condition for problem (3).
    The optimality condition is given explicitly by
    \( \begin{align} \begin{split} \frac{\partial \mathcal {L}}{\partial {\pi }_i} &= \sum _{\ell \in W_i} \left(\frac{1}{\sum _{t\in A_\ell }{\pi }_t} - \frac{1}{{\pi }_i}\right) + \sum _{\ell \in L_i}\frac{1}{\sum _{t\in A_\ell }{\pi }_t} = 0 \,\,\,\, \forall i \in \left[n\right]\!, \end{split} \end{align} \)
    (58)
    where \( W_i = \lbrace \ell \,| i \in A_\ell , c_\ell = i\rbrace \) is the set of observations where sample \( i \in \left[n\right] \) is chosen and \( L_i = \lbrace \ell \,| i \in A_\ell , c_\ell \ne i\rbrace \) is the set of observations where sample \( i \in \left[n\right] \) is not chosen. Multiplying both sides of Equation (58) with \( {\pi }_i \) , \( i \in \left[n\right] \) , we have
    \( \begin{align} \sum _{\ell \in L_i} \left(\frac{{\pi }_i}{\sum _{t\in A_\ell }{\pi }_t}\right) - \sum _{\ell \in W_i} \left(\frac{\sum _{j\ne i\in A_\ell }{\pi }_j}{\sum _{t\in A_\ell }{\pi }_t} \right) = 0 , \end{align} \)
    (59)
    for all \( i \in \left[n\right] \) . Note that \( \sum _{\ell \in W_i} \sum _{j\ne i\in A_\ell } \cdot = \sum _{j\ne i} \sum _{\ell \in W_i \cap L_j} \cdot \) and \( \sum _{\ell \in L_i} \cdot = \sum _{j\ne i} \sum _{\ell \in W_j \cap L_i} \cdot \) . Accordingly, we rewrite Equation (59) as
    \( \begin{align} \begin{split} &\sum _{j\ne i} \sum _{\ell \in W_j \cap L_i} \left(\frac{{\pi }_i}{\sum _{t\in A_\ell }{\pi }_t}\right) = \sum _{j\ne i} \sum _{\ell \in W_i \cap L_j} \left(\frac{{\pi }_j}{\sum _{t\in A_\ell }{\pi }_t}\right) \,\,\,\, \forall i \in \left[n\right]\!. \end{split} \end{align} \)
    (60)
    Then, an optimal solution \( \mathbf {\pi }\in \mathbb {R}^n_+ \) to Equation (3) satisfies:
    \( \begin{align} \sum _{j\ne i} {\pi }_j \lambda _{ji}(\mathbf {\pi }) = \sum _{j\ne i} {\pi }_i \lambda _{ij}(\mathbf {\pi }) \,\,\,\, \forall i \in \left[n\right]\!, \end{align} \)
    (61)
    where \( \lambda _{ji}(\mathbf {\pi }) \) , \( i,j \in \left[n\right], i\ne j \) are given by Equation (5).

    C Proof of Lemma 3.1

    Let a stationary point \( \mathbf {\pi }\in \mathbb {R}^n_+ \) of the Augmented Lagrangian (20) be such that
    \( \begin{align} & \frac{\partial L_\rho (\mathbf {\pi }, \mathbf {w}^k, \mathbf {y}^k)}{\partial {{\pi }_i}} = 0 \Leftrightarrow \frac{\partial \mathcal {L}(\mathcal {D}|\mathbf {\pi })}{\partial {{\pi }_i}} + {y}_i^k + \rho \frac{\partial D_p(\mathbf {\pi }|| \tilde{\mathbf {\pi }}^k)}{\partial {{\pi }_i}} = 0. \,\,\, \forall i \in \left[n\right]\!. \end{align} \)
    (62a)
    Let \( \sigma _i(\mathbf {\pi }) = \rho \frac{\partial D_p(\mathbf {\pi }|| \tilde{\mathbf {\pi }}^k)}{\partial {{\pi }_i}} + {y}_i^k \) , for all \( i \in \left[n\right] \) . Then, Equation (62a) is equivalent to
    \( \begin{align} & \frac{\partial \mathcal {L}(\mathcal {D}|\mathbf {\pi })}{\partial {{\pi }_i}} + \sigma _i(\mathbf {\pi }) = 0 \,\,\, \forall i \in \left[n\right]\!. \end{align} \)
    (63)
    Partial derivatives of the negative log-likelihood \( \mathcal {L}(\mathcal {D}|\mathbf {\pi }) \) are given by
    \( \begin{align} \frac{\partial \mathcal {L}(\mathcal {D}|\mathbf {\pi })}{\partial {{\pi }_i}} &= \sum _{\ell \in W_i} \left(\frac{1}{\sum _{t\in A_\ell }{\pi }_t} - \frac{1}{{\pi }_i}\right) + \sum _{\ell \in L_i}\frac{1}{\sum _{t\in A_\ell }{\pi }_t} \,\,\, \forall i \in \left[n\right]\!, \end{align} \)
    (64)
    where \( W_i = \lbrace \ell \,| i \in A_\ell , c_\ell = i\rbrace \) is the set of observations where sample \( i \in \left[n\right] \) is chosen and \( L_i = \lbrace \ell \,| i \in A_\ell , c_\ell \ne i\rbrace \) is the set of observations where sample \( i \in \left[n\right] \) is not chosen. Setting \( \frac{\partial \mathcal {L}(\mathcal {D}|\mathbf {\pi })}{\partial {{\pi }_i}} \) from Equation (64) to Equation (63), we have
    \( \begin{align} \frac{\partial L_\rho (\mathbf {\pi }, \mathbf {w}^k, \mathbf {y}^k)}{\partial {{\pi }_i}} = \sum _{\ell \in W_i} \left(\frac{1}{\sum _{t\in A_\ell }{\pi }_t} - \frac{1}{{\pi }_i}\right) + \sum _{\ell \in L_i} \frac{1}{\sum _{t\in A_\ell }{\pi }_t} + \sigma _i(\mathbf {\pi }) = 0, \,\,\, \forall i \in \left[n\right]\!. \end{align} \)
    (65)
    Multiplying both sides of Equation (65) with \( -{\pi }_i \) , \( i \in \left[n\right] \) , we have
    \( \begin{align} \sum _{\ell \in W_i} \left(\frac{\sum _{j\ne i\in A_\ell }{\pi }_j}{\sum _{t\in A_\ell }{\pi }_t} \right) - \sum _{\ell \in L_i} \left(\frac{{\pi }_i}{\sum _{t\in A_\ell }{\pi }_t}\right) - {\pi }_i \sigma _i(\mathbf {\pi }) = 0 \,\,\, \forall i \in \left[n\right]\!. \end{align} \)
    (66)
    Note that \( \sum _{\ell \in W_i} \sum _{j\ne i\in A_\ell } \cdot = \sum _{j\ne i} \sum _{\ell \in W_i \cap L_j} \cdot \) and \( \sum _{\ell \in L_i} \cdot = \sum _{j\ne i} \sum _{\ell \in W_j \cap L_i} \cdot \) . Accordingly, we rewrite Equation (66) as
    \( \begin{align} \sum _{j\ne i} \sum _{\ell \in W_i \cap L_j} \left(\frac{{\pi }_j}{\sum _{t\in A_\ell }{\pi }_t}\right) - \sum _{j\ne i} \sum _{\ell \in W_j \cap L_i} \left(\frac{{\pi }_i}{\sum _{t\in A_\ell }{\pi }_t}\right) - {\pi }_i \sigma _i(\mathbf {\pi }) = 0 \,\,\, \forall i \in \left[n\right]\!. \end{align} \)
    (67)
    Then, the stationarity condition given by Equation (62a) is equivalent to
    \( \begin{align} \sum _{j\ne i} {\pi }_j \lambda _{ji}(\mathbf {\pi }) - \sum _{j\ne i} {\pi }_i \lambda _{ij}(\mathbf {\pi }) = {\pi }_i \sigma _i(\mathbf {\pi }) \,\,\,\, \forall i \in \left[n\right]\!, \end{align} \)
    (68)
    where \( \lambda _{ji}(\mathbf {\pi }) \) , \( i,j \in \left[n\right], i\ne j \) are given by Equation (5).

    D Proof of Theorem 3.2

    Summing Equation (26) for \( i \in \left[n\right] \) , we get
    \( \begin{align} \sum _{i} \sum _{j} \left({\pi }_j \lambda _{ji}(\mathbf {\pi }) - {\pi }_i \lambda _{ij}(\mathbf {\pi }) \right)\mathbb {1}_{j \ne i} = \sum _{i} {\pi }_i \sigma _i = 0. \end{align} \)
    (69)
    Since the PL scores are non-negative, i.e., \( {\pi }_i \ge 0 \,\, , \, i \in \left[n\right] \) , Equation (69) implies that \( \mathbf {\sigma }\equiv [\sigma _i]_{i \in \left[n\right]} \) contains both positive and negative elements. Let \( (\left[n\right]_{+}, \left[n\right]_{-}) \) be a partition of \( \left[n\right] \) such that \( \sigma _i \ge 0 \) for all \( i \in \left[n\right]_{+} \) and \( \sigma _i \lt 0 \) for all \( i \in \left[n\right]_{-} \) . Then, for \( i \in \left[n\right]_{+} \) in Equation (26), we have
    \( \begin{align} \sum _{j\ne i} {\pi }_j \lambda _{ji}(\mathbf {\pi }) = {\pi }_i \left(\sum _{j\ne i} \lambda _{ij}(\mathbf {\pi }) + \sigma _i\right), \,\, \forall i \in \left[n\right]_{+}\!, \end{align} \)
    (70)
    where \( \lambda _{ij}(\mathbf {\pi }) + \sigma _i \ge 0 \) , \( i \in \left[n\right]_{+} \) , and \( j \in \left[n\right] \) . Equation (70) shows that from each state \( i \in \left[n\right]_{+} \) into the states in \( \left[n\right]_{-} \) , there exists a total of \( \sigma _i \) “additional outgoing rate”, compared to Equation (4). At the same time, for \( i \in \left[n\right]_{-} \) in Equation (26), we have
    \( \begin{align} \begin{split} &\sum _{j\in \left[n\right]_{+}} {\pi }_j \lambda _{ji}(\mathbf {\pi }) + \sum _{j\in \left[n\right]_{-}|j\ne i} {\pi }_j \lambda _{ji}(\mathbf {\pi }) = {\pi }_i \sum _{j\ne i} \lambda _{ij}(\mathbf {\pi }) + {\pi }_i \sigma _i, \,\,\,\, \forall i \in \left[n\right]_{-}\!. \end{split} \end{align} \)
    (71)
    Since \( {\pi }_i \sigma _i \lt 0 \) , for \( i \in \left[n\right]_{-} \) , we distribute these terms into the first sum on the left-hand side. Then, Equation (71) is equivalent to
    \( \begin{align} \begin{split} &\sum _{j\in \left[n\right]_{+}} {\pi }_j \left(\lambda _{ji}(\mathbf {\pi }) - \frac{{\pi }_i \sigma _i c_j}{{\pi }_j} \right) + \sum _{j\in \left[n\right]_{-}|j\ne i} {\pi }_j \lambda _{ji}(\mathbf {\pi }) = {\pi }_i \sum _{j\ne i} \lambda _{ij}(\mathbf {\pi }), \,\,\,\, \forall i \in \left[n\right]_{-}\!, \end{split} \end{align} \)
    (72)
    where \( \sum _{j\in \left[n\right]_{+}} c_j = 1 \) .
    To determine the \( \lbrace c_j\rbrace _{j \in \left[n\right]_{+}} \) , recall from Equation (70) that from each state \( j \in \left[n\right]_{+} \) into the states \( i \in \left[n\right]_{-} \) , there exists a total of \( \sigma _j \) additional outgoing rate. Then, Equation (72) implies that \( \sum _{i\in \left[n\right]_{-}} - \frac{{\pi }_i \sigma _i c_j}{{\pi }_j} = \sigma _j \) , i.e., \( c_j = \frac{- {\pi }_j \sigma _j}{\sum _{i\in \left[n\right]_{-}} {\pi }_i \sigma _i} \) , \( j \in \left[n\right]_{+} \) . Using \( -\sum _{i\in \left[n\right]_{-}} {\pi }_i \sigma _i = \sum _{i\in \left[n\right]_{+}} {\pi }_i \sigma _i \) from Equation (69), we confirm that \( \sum _{j\in \left[n\right]_{+}} c_j = \frac{- \sum _{j\in \left[n\right]_{+}} {\pi }_j \sigma _j}{\sum _{i\in \left[n\right]_{-}} {\pi }_i \sigma _i} = 1 \) , and rewrite \( \lbrace c_j\rbrace _{j \in \left[n\right]_{+}} \) as
    \( \begin{align} c_j = \frac{-2 {\pi }_j \sigma _j}{\sum _{i\in \left[n\right]_{-}} {\pi }_i \sigma _i - \sum _{i\in \left[n\right]_{+}}{\pi }_i \sigma _i}, \,\,\,\, \forall j \in \left[n\right]_{+}\!. \end{align} \)
    (73)
    Finally, setting \( \lbrace c_j\rbrace _{j \in \left[n\right]_{+}} \) into Equation (72), we have
    \( \begin{align} \begin{split} &\sum _{j\in \left[n\right]_{+}} {\pi }_j \left(\lambda _{ji}(\mathbf {\pi }) + \frac{2 {\pi }_i \sigma _i \sigma _j}{\sum _{t\in \left[n\right]_{-}} {\pi }_t \sigma _t - \sum _{t\in \left[n\right]_{+}} {\pi }_t \sigma _t} \right) + \sum _{j\in \left[n\right]_{-}|j\ne i} {\pi }_j \lambda _{ji}(\mathbf {\pi }) = {\pi }_i \sum _{j\ne i} \lambda _{ij}(\mathbf {\pi }), \end{split} \end{align} \)
    (74)
    where \( \lambda _{ji}(\mathbf {\pi }) + \frac{2 {\pi }_i \sigma _i \sigma _j}{\sum _{t\in \left[n\right]_{-}} {\pi }_t \sigma _t - \sum _{t\in \left[n\right]_{+}} {\pi }_t \sigma _t} \ge 0 \) , \( j \in \left[n\right]_{+} \) , and \( i \in \left[n\right]_{-} \) .
    Equation (26), partitioned as Equations (70) and (74), is the balance equation of a continuous-time MC with transition rates given by
    \( \begin{align} \mu _{ji}(\mathbf {\pi }) &= {\left\lbrace \begin{array}{ll} \lambda _{ji}(\mathbf {\pi }) + \frac{2 {\pi }_i \sigma _i \sigma _j}{\sum _{t\in \left[n\right]_{-}} {\pi }_t \sigma _t - \sum _{t\in \left[n\right]_{+}} {\pi }_t \sigma _t} \\ \quad \quad \quad \quad \text{if} \,\,\, j \in \left[n\right]_{+} \,\,\, \text{and} \,\,\, i \in \left[n\right]_{-}\\ \lambda _{ji}(\mathbf {\pi }) \quad \ \ \text{otherwise}. \end{array}\right.} \end{align} \)
    (75)
    Hence, \( \mathbf {\pi } \) is the stationary distribution of this MC (c.f. Section 2).

    E Proof of Theorem 3.3

    We use the following definition.
    Definition E.1 (Diagonal Dominance).
    A matrix \( \mathbf {H} \) is diagonally dominant if \( |\mathbf {H}_{ii}| \ge \sum _{j \ne i} |\mathbf {H}_{ij}| \) , \( i \in \left[n\right] \) , i.e., for every row, magnitude of the diagonal element is larger than the sum of magnitudes of the corresponding off-diagonal elements [Horn and Johnson 2012].
    For \( \mathbf {\sigma }=[\sigma _i]_{i \in \left[n\right]} \) given by (46), Equation (65) is equivalent to
    \( \begin{align} \begin{split} \frac{\partial L_\rho (\mathbf {\pi },\mathbf {w}^k, \mathbf {y}^k)}{\partial {\pi }_i} &= \sum _{\ell \in W_i} - \frac{1}{{\pi }_i} + \sum _{\ell |i\in A_\ell } \frac{1}{\sum _{t\in A_\ell }{\pi }_t} + \rho {(\mathbf {\pi }- \tilde{\mathbf {\pi }}(\mathbf {X}; \mathbf {w}^k))}_i + {y}^k_i, \end{split} \end{align} \)
    (76)
    for all \( i \in \left[n\right] \) . At the \( k \) th iteration of (21), let \( \nabla ^2 L_\rho (\mathbf {\pi },\mathbf {w}^k, \mathbf {y}^k) \) be the Hessian of the augmented Lagrangian w.r.t. \( \mathbf {\pi } \) . Differentiating Equation (76) w.r.t. \( {\pi }_j \) , \( \nabla ^2 L_\rho (\mathbf {\pi },\mathbf {w}^k, \mathbf {y}^k) \) has the following form:
    \( \begin{align} &\nabla ^2_{ij} L_\rho (\mathbf {\pi },\mathbf {w}^k, \mathbf {y}^k) = {\left\lbrace \begin{array}{ll} \sum _{\ell \in W_i} \frac{1}{{\pi }_i^2} - \sum _{\ell |i\in A_\ell } \frac{1}{{(\sum _{t\in A_\ell }{\pi }_t)}^2} + \rho , & i = j \\ - \sum _{\ell |i,j\in A_\ell } \frac{1}{{(\sum _{t\in A_\ell }{\pi }_t)}^2}, & i \ne j. \end{array}\right.} \end{align} \)
    (77)
    Consider \( \rho \ge \frac{2}{\epsilon ^2} \max _i \sum _{\ell |i\in A_\ell }\frac{1}{|A_\ell |^2} \) . By Assumption 3.1, we have
    \( \begin{align} & \rho \ge \frac{2}{\epsilon ^2} \sum _{\ell |i\in A_\ell }\frac{1}{|A_\ell |^2} \,\, \forall i \in \left[n\right]\!,\nonumber \nonumber\\ & \Leftrightarrow \,\, \rho \ge \sum _{\ell |i\in A_\ell }\frac{2}{{(\sum _{t\in A_\ell }{\pi }_t)}^2} \,\, \forall i \in \left[n\right]\!,\nonumber \nonumber\\ & \Leftrightarrow \,\, \rho + \sum _{\ell \in W_i} \frac{1}{{\pi }_i^2} \gt \sum _{\ell |i\in A_\ell }\frac{2}{{(\sum _{t\in A_\ell }{\pi }_t)}^2} \,\, \forall i \in \left[n\right]\!, \end{align} \)
    (78a)
    \( \begin{align} & \Leftrightarrow \,\, \rho + \sum _{\ell \in W_i} \frac{1}{{\pi }_i^2} \gt \sum _{\ell |i\in A_\ell }\frac{1}{{(\sum _{t\in A_\ell }{\pi }_t)}^2} + \sum _{j \ne i}\sum _{\ell |i,j\in A_\ell }\frac{1}{{(\sum _{t\in A_\ell }{\pi }_t)}^2} \,\, \forall i \in \left[n\right]\!. \end{align} \)
    (78b)
    Equation (78a) implies that all diagonal elements of \( \nabla ^2 L_\rho (\mathbf {\pi },\mathbf {w}^k, \mathbf {y}^k) \) are positive. Moreover, by Equation (78b), \( \nabla ^2 L_\rho (\mathbf {\pi },\mathbf {w}^k, \mathbf {y}^k) \) is diagonally dominant (c.f. Definition E.1). Thus, \( \nabla ^2 L_\rho (\mathbf {\pi },\mathbf {w}^k, \mathbf {y}^k) \) is positive definite [Horn and Johnson 2012], i.e., \( L_\rho (\mathbf {\pi },\mathbf {w}^k, \mathbf {y}^k) \) is convex w.r.t. \( \mathbf {\pi } \) . As a result, under Assumption 3.1, for \( \rho \ge \frac{2}{\epsilon ^2} \max _i \sum _{\ell |i\in A_\ell }\frac{1}{|A_\ell |^2} \) and \( D_p(\mathbf {\pi }||\tilde{\mathbf {\pi }}) = \frac{1}{2} \parallel \! \mathbf {\pi }- \tilde{\mathbf {\pi }}\!\parallel _2^2 \) , a stationary \( \mathbf {\pi }\gt \mathbf {0} \) satisfying condition (23) is also a minimizer of step (21a).

    F Proof of Theorem 3.4

    We make use of the following lemmas.
    Lemma F.1 (Zeng et al. [2018]).
    Logarithm and polynomials are Kurdyka–Łojasiewicz functions. Moreover, sums, products, compositions, and quotients (with denominator bounded away from 0) of Kurdyka–Łojasiewicz functions are also Kurdyka–Łojasiewicz.
    Lemma F.2 (Guo et al. [2017]).
    Consider the optimization problem:
    \( \begin{equation} \begin{aligned}& \underset{\mathbf {w},\, \mathbf {\pi }}{\text{minimize}} & g(\mathbf {\pi }) \\ & \text{subject to} & \mathbf {\tilde{X}}\mathbf {w}= \mathbf {\pi }, \end{aligned} \end{equation} \)
    (79)
    and solve Equation (79) via Alternating Direction Method of Multipliers (ADMM) [Boyd et al. 2011]. Let \( \lbrace (\mathbf {\pi }^k, \mathbf {y}^k, \mathbf {w}^k)\rbrace _{k \in \mathbb {N}} \) be the sequence generated by the ADMM algorithm, and \( \rho \) be the penalty parameter of ADMM. Suppose that there exists \( \kappa \gt 0 \) such that \( \mathbf {\tilde{X}}^T \mathbf {\tilde{X}}\succeq \kappa \, \mathbf {I} \) , and the sequence \( \lbrace (\mathbf {\pi }^k, \mathbf {y}^k, \mathbf {w}^k)\rbrace _{k \in \mathbb {N}} \) is bounded.
    If there exist solutions for the minimization steps of ADMM w.r.t. both \( \mathbf {\pi } \) and \( \mathbf {w} \) , \( g(\mathbf {\pi }) \) is a continuous differentiable function with an \( L \) -Lipschitz continuous gradient at \( \mathbf {\pi }^k \) , \( k\in \mathbb {N} \) where \( L \gt 0 \) , and the augmented Lagrangian of Equation (79) is a Kurdyka–Łojasiewicz function, then, for \( \rho \gt 2L \) , \( \lbrace (\mathbf {\pi }^k, \mathbf {y}^k, \mathbf {w}^k)\rbrace _{k \in \mathbb {N}} \) converges to a point that satisfies the Karush-Kuhn-Tucker (KKT) conditions of Equation (79).
    For affine regression described in Section 3.3, there exist solutions for the minimization steps in (21): \( \mathbf {w} \) update has the closed form solution given by Equation (33) and \( \mathbf {\pi } \) update admits a minimizer for large enough \( \rho \) by Lemma 3.3.
    By Assumption 3.1, \( \nabla _{\mathbf {\pi }} \mathcal {L} \) given by Equation (58) exists, i.e., \( \mathcal {L} \) is continuous differentiable at \( \mathbf {\pi }^k \) , \( k\in \mathbb {N} \) generated by (21a). Let \( \nabla ^2(\mathcal {L}) \) be the Hessian of \( \mathcal {L} \) . Differentiating Equation (58) w.r.t. \( {\pi }_j \) , \( \nabla ^2(\mathcal {L}) \) has the following form:
    \( \begin{align} &\nabla ^2_{ij}(\mathcal {L}) = {\left\lbrace \begin{array}{ll} \sum _{\ell \in W_i} \frac{1}{{\pi }_i^2} - \sum _{\ell |i\in A_\ell } \frac{1}{{(\sum _{t\in A_\ell }{\pi }_t)}^2}, & i = j \\ - \sum _{\ell |i,j\in A_\ell } \frac{1}{{(\sum _{t\in A_\ell }{\pi }_t)}^2}, & i \ne j. \end{array}\right.} \end{align} \)
    (80)
    Consider \( L=\frac{\max _i |W_i|}{\epsilon ^2} \) , where \( W_i \) is the set of observations where sample \( i \in \left[n\right] \) is chosen. By Assumption 3.1, we have
    \( \begin{align} & L = \frac{\max _i |W_i|}{\epsilon ^2} \ge \sum _{\ell \in W_i} \frac{1}{{\pi }_i^2} \,\, \forall i \in \left[n\right]\!,\nonumber \nonumber\\ & \Leftrightarrow \,\, L - \sum _{\ell \in W_i} \frac{1}{{\pi }_i^2} + \sum _{\ell |i\in A_\ell }\frac{1}{{(\sum _{t\in A_\ell }{\pi }_t)}^2} \ge \sum _{\ell |i\in A_\ell }\frac{1}{{(\sum _{t\in A_\ell }{\pi }_t)}^2} \,\, \forall i \in \left[n\right]\!,\nonumber \nonumber\\ & \Leftrightarrow \,\, L - \sum _{\ell \in W_i} \frac{1}{{\pi }_i^2} + \sum _{\ell |i\in A_\ell }\frac{1}{{(\sum _{t\in A_\ell }{\pi }_t)}^2} \ge \sum _{j \ne i}\sum _{\ell |i,j\in A_\ell }\frac{1}{{(\sum _{t\in A_\ell }{\pi }_t)}^2}\,\, \forall i \in \left[n\right]\!. \end{align} \)
    (81)
    Now, consider the matrix \( L \mathbf {I}_{n\times n} - \nabla ^2(\mathcal {L}) \) . By Equation (81), \( L \mathbf {I}_{n\times n} - \nabla ^2(\mathcal {L}) \) is diagonally dominant (c.f. Definition E.1) and all of its diagonal elements are positive, i.e., \( \nabla ^2(\mathcal {L}) \) is upper bounded by \( L \mathbf {I}_{n\times n} \) . Thus, the objective function of Equation (16), i.e., \( \mathcal {L} \) , has an \( L \) -Lipschitz continuous gradient at \( \mathbf {\pi }^k \) , \( k\in \mathbb {N} \) , where \( L = \frac{\max _i |W_i|}{\epsilon ^2} \gt 0 \) .
    Moreover, the augmented Lagrangian given by Equation (20) is a sum of three functions: logarithm of the ratio of two polynomials where the denominator is bounded away from 0 for all \( \mathbf {\pi }^k \) , \( k\in \mathbb {N} \) by Assumption 3.1, and two other polynomial functions. By Lemma F.1, these three functions and their sum is Kurdyka–Łojasiewicz on the set \( \lbrace \mathbf {\pi }^k \,|\, {\pi }_i^k \gt \epsilon \,\,, i \in \left[n\right], k\in \mathbb {N}\rbrace \) . As a result, the augmented Lagrangian of Equation (16) is a Kurdyka–Łojasiewicz function.
    Putting it all together, by Lemma F.2, for affine regression (c.f. Section 3.3) with \( \rho \gt \frac{2 \max _i |W_i|}{\epsilon ^2} \) , the sequence \( \lbrace (\mathbf {\pi }^k, \mathbf {y}^k, \mathbf {w}^k)\rbrace _{k \in \mathbb {N}} \) generated by (21) converges to a point that satisfies the KKT conditions [Nocedal and Wright 2006] of Problem (16).

    References

    [1]
    Arpit Agarwal, Prathamesh Patil, and Shivani Agarwal. 2018. Accelerated spectral ranking. In Proceedings of the 35th International Conference on Machine Learning. 70–79.
    [2]
    Shivani Agarwal. 2016. On ranking and choice models. In Proceedings of the 25th International Joint Conference on Artificial Intelligence. 4050–4053.
    [3]
    Ammar Ammar and Devavrat Shah. 2011. Ranking: Compare, don’t score. In Proceedings of the 49th Annual Allerton Conference on Communication, Control, and Computing. IEEE, 776–783.
    [4]
    Esra Ataer-Cansızoğlu. 2015. Retinal Image Analytics: A Complete Framework from Segmentation to Diagnosis. Northeastern University.
    [5]
    Jose Blanchet, Guillermo Gallego, and Vineet Goyal. 2016. A Markov chain approximation to choice modeling. Operations Research 64, 4 (2016), 886–905.
    [6]
    Blitzer. 2017. Movehub City Rankings. Retrieved October 2020 from https://www.kaggle.com/blitzr/movehub-city-rankings?select=movehubqualityoflife.csv.
    [7]
    Anna Sergeevna Bosman, Andries Engelbrecht, and Mardé Helbig. 2020. Visualising basins of attraction for the cross-entropy and the squared error neural network loss functions. Neurocomputing 400 (2020), 113–136.
    [8]
    Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning 3, 1 (2011), 1–122.
    [9]
    Stephen Boyd and Lieven Vandenberghe. 2004. Convex Optimization. Cambridge University Press.
    [10]
    Ralph Allan Bradley and Milton E. Terry. 1952. Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika 39, 3/4 (1952), 324–345.
    [11]
    Mark Braverman and Elchanan Mossel. 2008. Noisy sorting without resampling. In Proceedings of the 19th Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, 268–276.
    [12]
    Lev M. Bregman. 1967. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics 7, 3 (1967), 200–217.
    [13]
    Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. 1994. Signature verification using a “siamese” time delay neural network. In Proceedings of the International Conference on Neural Information Processing Systems. 737–744.
    [14]
    Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to rank using gradient descent. In Proceedings of the International Conference on Machine Learning. ACM, 89–96.
    [15]
    Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. 2007. Learning to rank: From pairwise approach to listwise approach. In Proceedings of the 24th International Conference on Machine Learning. 129–136.
    [16]
    Rich Caruana and Alexandru Niculescu-Mizil. 2004. Data mining in metric space: An empirical analysis of supervised learning performance criteria. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 69–78.
    [17]
    Manuela Cattelan. 2012. Models for paired comparison data: A review with emphasis on dependent data. Statistical Science 27, 3 (2012), 412–433.
    [18]
    Huiwen Chang, Fisher Yu, Jue Wang, Douglas Ashley, and Adam Finkelstein. 2016. Automatic triage for a photo series. ACM Transactions on Graphics 35, 4 (2016), 148.
    [19]
    Rick Chartrand and Brendt Wohlberg. 2013. A nonconvex ADMM algorithm for group sparsity with sparse groups. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 6009–6013.
    [20]
    Lin Chen, Peng Zhang, and Baoxin Li. 2015. Fusing pointwise and pairwise labels for supporting user-adaptive image retrieval. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. ACM, 67–74.
    [21]
    Yuxin Chen and Changho Suh. 2015. Spectral MLE: Top-k rank aggregation from pairwise comparisons. In Proceedings of the 32nd International Conference on International Conference on Machine Learning. 371–380.
    [22]
    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248–255.
    [23]
    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186.
    [24]
    Hazel Doughty, Dima Damen, and Walterio Mayol-Cuevas. 2018. Who’s better? Who’s best? Pairwise deep ranking for skill determination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6057–6066.
    [25]
    Abhimanyu Dubey, Nikhil Naik, Devi Parikh, Ramesh Raskar, and César A. Hidalgo. 2016. Deep learning the city: Quantifying urban perception at a global scale. In Proceedings of the European Conference on Computer Vision. Springer, 196–212.
    [26]
    Cynthia Dwork, Ravi Kumar, Moni Naor, and Dandapani Sivakumar. 2001. Rank aggregation methods for the web. In Proceedings of the 10th International Conference on World Wide Web. ACM, 613–622.
    [27]
    Otto Dykstra. 1960. Rank analysis of incomplete block designs: A method of paired comparisons employing unequal repetitions on pairs. Biometrics 16, 2 (1960), 176–188.
    [28]
    Arpad E. Elo. 1978. The Rating of Chessplayers, Past and Present. Arco Pub.
    [29]
    Michael A. Fligner and Joseph S. Verducci. 1993. Probability Models and Statistical Analyses for Ranking Data. Vol. 80. Springer.
    [30]
    Robert G. Gallager. 2013. Stochastic Processes: Theory for Applications. Cambridge University Press.
    [31]
    Pavel Golik, Patrick Doetsch, and Hermann Ney. 2013. Cross-entropy vs. squared error training: A theoretical and experimental comparison. In Proceedings of the 14th Annual Conference of the International Speech Communication Association, Vol. 13. 1756–1760.
    [32]
    Ke Guo, D. R. Han, and Ting-Ting Wu. 2017. Convergence of alternating direction method for minimizing sum of two nonconvex functions with linear constraints. International Journal of Computer Mathematics 94, 8 (2017), 1653–1669.
    [33]
    Yuan Guo, Peng Tian, Jayashree Kalpathy-Cramer, Susan Ostmo, J. Peter Campbell, Michael F. Chiang, Deniz Erdoğmuş, Jennifer G. Dy, and Stratis Ioannidis. 2018. Experimental design under the Bradley-Terry model. In Proceedings of the 27th International Joint Conference on Artificial Intelligence. 2198–2204.
    [34]
    Bruce Hajek, Sewoong Oh, and Jiaming Xu. 2014. Minimax-optimal inference from partial rankings. In Proceedings of the 27th International Conference on Neural Information Processing Systems. 1475–1483.
    [35]
    Bo Han. 2018. DATELINE: Deep Plackett-Luce model with uncertainty measurements. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics. PMLR 130:928–936.
    [36]
    Mingyi Hong. 2018. A distributed, asynchronous, and incremental algorithm for nonconvex optimization: An ADMM approach. IEEE Transactions on Control of Network Systems 5, 3 (2018), 935–945.
    [37]
    Roger A. Horn and Charles R. Johnson. 2012. Matrix Analysis. Cambridge University Press.
    [38]
    David R. Hunter. 2004. MM algorithms for generalized Bradley-Terry models. The Annals of Statistics 32, 1 (2004), 384–406.
    [39]
    Prateek Jain and Purushottam Kar. 2017. Non-convex optimization for machine learning. Foundations and Trends® in Machine Learning 10, 3–4 (2017), 142–336.
    [40]
    Minje Jang, Sunghyun Kim, Changho Suh, and Sewoong Oh. 2017. Optimal sample complexity of m-wise data for top-k ranking. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 1686–1696.
    [41]
    Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 133–142.
    [42]
    Ian T. Jolliffe. 2003. Principal component analysis. Technometrics 45, 3 (2003), 276.
    [43]
    T. Kamishima, M. Hamasaki, and S. Akaho. 2009. A simple transfer learning method and its application to personalization in collaborative tagging. In Proceedings of the 2009 9th IEEE International Conference on Data Mining.
    [44]
    Maurice G. Kendall. 1938. A new measure of rank correlation. Biometrika 30, 1/2 (1938), 81–93.
    [45]
    Ashish Khetan and Sewoong Oh. 2016. Computational and statistical tradeoffs in learning to rank. In Proceedings of the 30th International Conference on Neural Information Processing Systems. 739–747.
    [46]
    Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations.
    [47]
    Solomon Kullback and Richard A. Leibler. 1951. On information and sufficiency. The Annals of Mathematical Statistics 22, 1 (1951), 79–86.
    [48]
    Qi Lei, Kai Zhong, and Inderjit S. Dhillon. 2016. Coordinate-wise power method. In Proceedings of the 30th International Conference on Neural Information Processing Systems. 2064–2072.
    [49]
    Orges Leka. 2016. IMDB Movies Dataset. Retrieved October 2020 from https://www.kaggle.com/orgesleka/imdbmovies.
    [50]
    Y. Li, X. Cheng, and G. Gui. 2018. Co-Robust-ADMM-Net: Joint ADMM framework and DNN for robust sparse composite regularization. IEEE Access 6 (2018), 47943–47952.
    [51]
    R. Liu, Z. Jiang, X. Fan, H. Li, and Z. Luo. 2018. Single image layer separation via deep ADMM unrolling. In Proceedings of the 2018 IEEE International Conference on Multimedia and Expo. 1–6.
    [52]
    Tyler Lu and Craig Boutilier. 2011. Learning mallows models with pairwise preferences. In Proceedings of the 28th International Conference on Machine Learning. 145–152.
    [53]
    Jiawei Ma, Xiao-Yang Liu, Zheng Shou, and Xin Yuan. 2019. Deep tensor ADMM-net for snapshot compressive imaging. In Proceedings of the IEEE International Conference on Computer Vision. 10223–10232.
    [54]
    Jiaqi Ma, Xinyang Yi, Weijing Tang, Zhe Zhao, Lichan Hong, Ed Chi, and Qiaozhu Mei. 2021. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics. PMLR, 130:928–936.
    [55]
    Sindri Magnússon, Pradeep Chathuranga Weeraddana, Michael G. Rabbat, and Carlo Fischione. 2015. On the convergence of alternating direction lagrangian methods for nonconvex structured optimization problems. IEEE Transactions on Control of Network Systems 3, 3 (2015), 296–309.
    [56]
    Colin L. Mallows. 1957. Non-null ranking models. I. Biometrika 44, 1/2 (1957), 114–130.
    [57]
    Cheng Mao, Ashwin Pananjady, and Martin J. Wainwright. 2018a. Breaking the \( 1/\sqrt {n} \) barrier: Faster rates for permutation-based models in polynomial time. In Proceedings of the 31st Conference on Learning Theory. 2037–2042.
    [58]
    Cheng Mao, Jonathan Weed, and Philippe Rigollet. 2018b. Minimax rates and efficient algorithms for noisy sorting. In Proceedings of the Algorithmic Learning Theory. PMLR, 821–847.
    [59]
    John I. Marden. 2014. Analyzing and Modeling Rank Data. Chapman and Hall/CRC.
    [60]
    Lucas Maystre and Matthias Grossglauser. 2015. Fast and accurate inference of Plackett-Luce models. In Proceedings of the 28th International Conference on Neural Information Processing Systems. 172–180.
    [61]
    Daniel McFadden. 1973. Conditional logit analysis of qualitative choice behavior. In Frontiers in Econometrics, Paul Zarembka, Harcourt Brace & Company (1993-1999) (Eds.). 105–142.
    [62]
    Sahand Negahban, Sewoong Oh, and Devavrat Shah. 2012. Iterative ranking from pair-wise comparisons. In Proceedings of the 25th International Conference on Neural Information Processing Systems. 2474–2482.
    [63]
    Sahand Negahban, Sewoong Oh, Kiran K. Thekumparampil, and Jiaming Xu. 2018. Learning from comparisons and choices. The Journal of Machine Learning Research 19, 1 (2018), 1478–1572.
    [64]
    Jorge Nocedal and Stephen Wright. 2006. Numerical Optimization. Springer Science & Business Media.
    [65]
    Tapio Pahikkala, Evgeni Tsivtsivadze, Antti Airola, Jouni Järvinen, and Jorma Boberg. 2009. An efficient algorithm for learning to rank from preference graphs. Machine Learning 75, 1 (2009), 129–165.
    [66]
    Robin L. Plackett. 1975. The analysis of permutations. Applied Statistics 24, 2 (1975), 193–202.
    [67]
    Stephen Ragain and Johan Ugander. 2016. Pairwise choice Markov chains. In Proceedings of the 30th International Conference on Neural Information Processing Systems. 3198–3206.
    [68]
    Arun Rajkumar and Shivani Agarwal. 2014. A statistical convergence perspective of algorithms for rank aggregation from pairwise data. In Proceedings of the 31st International Conference on International Conference on Machine Learning. 118–126.
    [69]
    Arun Rajkumar and Shivani Agarwal. 2016. When can we rank well from comparisons of O (n \( \backslash \) log (n)) non-actively chosen pairs?. In Proceedings of the 29th Annual Conference on Learning Theory. 1376–1401.
    [70]
    Garrett van Ryzin and Siddharth Mahajan. 1999. On the relationship between inventory costs and variety benefits in retail assortments. Management Science 45, 11 (1999), 1496–1509.
    [71]
    Aadirupa Saha and Arun Rajkumar. 2018. Ranking with features: Algorithm and a graph theoretic analysis. arXiv:1808.03857. Retrieved from https://arxiv.org/abs/1808.03857.
    [72]
    David Sculley. 2010. Combined regression and ranking. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 979–988.
    [73]
    Nihar Shah, Sivaraman Balakrishnan, Aditya Guntuboyina, and Martin Wainwright. 2016b. Stochastically transitive models for pairwise comparisons: Statistical and computational issues. In Proceedings of the 33rd International Conference on International Conference on Machine Learning. 11–20.
    [74]
    Nihar B. Shah, Sivaraman Balakrishnan, Joseph Bradley, Abhay Parekh, Kannan Ramchandran, and Martin J. Wainwright. 2016a. Estimation from pairwise comparisons: Sharp minimax bounds with topology dependence. The Journal of Machine Learning Research 17, 1 (2016), 2049–2095.
    [75]
    Zhan Shi, Xinhua Zhang, and Yaoliang Yu. 2017. Bregman divergence for stochastic variance reduction: Saddle-point and adversarial prediction. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 6031–6041.
    [76]
    Hossein Azari Soufiani, William Chen, David C. Parkes, and Lirong Xia. 2013. Generalized method-of-moments for rank aggregation. In Proceedings of the 26th International Conference on Neural Information Processing Systems. 2706–2714.
    [77]
    Jian Sun, Huibin Li, Zongben Xu, and Yan Yang. 2016. Deep ADMM-Net for compressive sensing MRI. In Proceedings of the 30th International Conference on Neural Information Processing Systems. 10–18.
    [78]
    Shao-Hua Sun. 2020. Crawl and Visualize ICLR 2020 OpenReview Data. Retrieved October 2020 from https://github.com/shaohua0116/ICLR2020-OpenReviewData.
    [79]
    Wei-Tse Sun, Ting-Hsuan Chao, Yin-Hsi Kuo, and Winston H. Hsu. 2017. Photo filter recommendation by category-aware aesthetic learning. IEEE Transactions on Multimedia 19, 8 (2017), 1870–1880.
    [80]
    Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9.
    [81]
    Mahsa Taheri, Fang Xie, and Johannes Lederer. 2021. Statistical guarantees for regularized neural networks. Neural Networks 142 (2021), 148–161.
    [82]
    Louis L. Thurstone. 1927. The method of paired comparisons for social values. The Journal of Abnormal and Social Psychology 21, 4 (1927), 384.
    [83]
    Peng Tian, Yuan Guo, Jayashree Kalpathy-Cramer, Susan Ostmo, John Peter Campbell, Michael F. Chiang, Jennifer Dy, Deniz Erdoğmuş, and Stratis Ioannidis. 2019. A severity score for retinopathy of prematurity. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 1809–1819.
    [84]
    Milan Vojnovic and Seyoung Yun. 2016. Parameter estimation for generalized Thurstone choice models. In Proceedings of the International Conference on Machine Learning. 498–506.
    [85]
    Milan Vojnovic, Se-Young Yun, and Kaifang Zhou. 2020. Convergence rates of gradient descent and MM algorithms for Bradley-Terry models. In Proceedings of the International Conference on Artificial Intelligence and Statistics. PMLR, 1254–1264.
    [86]
    Fenghui Wang, Wenfei Cao, and Zongben Xu. 2018. Convergence of multi-block Bregman ADMM for nonconvex composite problems. Science China Information Sciences 61, 12 (2018), 122101.
    [87]
    Huahua Wang and Arindam Banerjee. 2014. Bregman alternating direction method of multipliers. In Proceedings of the 27th International Conference on Neural Information Processing Systems. 2816–2824.
    [88]
    Yu Wang, Wotao Yin, and Jinshan Zeng. 2019. Global convergence of ADMM in nonconvex nonsmooth optimization. Journal of Scientific Computing 78, 1 (2019), 29–63.
    [89]
    Fabian Wauthier, Michael Jordan, and Nebojsa Jojic. 2013. Efficient ranking from pairwise comparisons. In Proceedings of the International Conference on Machine Learning. 109–117.
    [90]
    Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. 2008. Listwise approach to learning to rank: Theory and algorithm. In Proceedings of the 25th International Conference on Machine Learning. 1192–1199.
    [91]
    Y. Yang, J. Sun, H. Li, and Z. Xu. 2020. ADMM-CSNet: A deep learning approach for image compressive sensing. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 3 (2020), 521–538.
    [92]
    Shaokai Ye, Xiaoyu Feng, Tianyun Zhang, Xiaolong Ma, Sheng Lin, Zhengang Li, Kaidi Xu, Wujie Wen, Sijia Liu, Jian Tang, Makan Fardad, Xue Lin, Yongpan Liu, and Yanzhi Wang. 2019. Progressive DNN compression: A key to achieve ultra high weight pruning and quantization rates using ADMM. arXiv preprint arXiv:1903.09769.
    [93]
    Shaokai Ye, Tianyun Zhang, Kaiqi Zhang, Jiayu Li, Jiaming Xie, Yun Liang, Sijia Liu, Xue Lin, and Yanzhi Wang. 2019. A unified framework of DNN weight pruning and weight clustering/quantization using. arXiv preprint arXiv:1811.01907.
    [94]
    İlkay Yıldız, Jennifer Dy, Deniz Erdoğmuş, Jayashree Kalpathy-Cramer, Susan Ostmo, J. Peter Campbell, Michael F. Chiang, and Stratis Ioannidis. 2020. Fast and accurate ranking regression. In Proceedings of the International Conference on Artificial Intelligence and Statistics.
    [95]
    Ilkay Yıldız, Jennifer Dy, Deniz Erdoğmuş, Susan Ostmo, J. Peter Campbell, Michael F. Chiang, and Stratis Ioannidis. 2021. Deep spectral ranking. In Proceedings of the International Conference on Artificial Intelligence and Statistics. PMLR, 361–369.
    [96]
    İlkay Yıldız, Peng Tian, Jennifer Dy, Deniz Erdoğmuş, James Brown, Jayashree Kalpathy-Cramer, Susan Ostmo, J. Peter Campbell, Michael F. Chiang, and Stratis Ioannidis. 2019. Classification and comparison via neural networks. Neural Networks 118 (2019), 65–80.
    [97]
    Yue Yu and Behçet Açıkmeşe. 2019. Stochastic Bregman parallel direction method of multipliers for distributed optimization. In Proceedings of the 58th Conference on Decision and Control. IEEE, 5550–5555.
    [98]
    Yue Yu, Behçet Açıkmeşe, and Mehran Mesbahi. 2018. Bregman parallel direction method of multipliers for distributed optimization via mirror averaging. IEEE Control Systems Letters 2, 2 (2018), 302–306.
    [99]
    Jinshan Zeng, Shikang Ouyang, Tim Tsz-Kit Lau, Shaobo Lin, and Yuan Yao. 2018. Global convergence in deep learning with variable splitting via the Kurdyka-Lojasiewicz property. arXiv:1803.00225. Retrieved from https://arxiv.org/abs/1803.00225.
    [100]
    Haimeng Zhao and Peiyuan Liao. 2019. CAE-ADMM: Implicit Bitrate Optimization via ADMM-based Pruning in Compressive Autoencoders. arXiv preprint arXiv:1901.07196.
    [101]
    Pu Zhao, Sijia Liu, Yanzhi Wang, and Xue Lin. 2018. An ADMM-based universal framework for adversarial attacks on deep neural networks. In Proceedings of the 26th ACM International Conference on Multimedia. ACM, New York, NY, 1065–1073. DOI:
    [102]
    Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE International Conference on Computer Vision. 19–27.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Knowledge Discovery from Data
    ACM Transactions on Knowledge Discovery from Data  Volume 16, Issue 6
    December 2022
    631 pages
    ISSN:1556-4681
    EISSN:1556-472X
    DOI:10.1145/3543989
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 July 2022
    Online AM: 12 April 2022
    Accepted: 01 April 2022
    Revised: 01 October 2021
    Received: 01 May 2021
    Published in TKDD Volume 16, Issue 6

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Plackett-Luce
    2. ranking
    3. spectral methods
    4. Markov Chain
    5. ADMM

    Qualifiers

    • Research-article
    • Refereed

    Funding Sources

    • NIH
    • NSF
    • Facebook Statistics Research Award

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 592
      Total Downloads
    • Downloads (Last 12 months)370
    • Downloads (Last 6 weeks)39
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media