Abstract
We study the problem of clustering a set of items from binary user feedback. Such a problem arises in crowdsourcing platforms solving large-scale labeling tasks with minimal effort put on the users. For example, in some of the recent reCAPTCHA systems, users clicks (binary answers) can be used to efficiently label images. In our inference problem, items are grouped into initially unknown non-overlapping clusters. To recover these clusters, the learner sequentially presents to users a finite list of items together with a question with a binary answer selected from a fixed finite set. For each of these items, the user provides a noisy answer whose expectation is determined by the item cluster and the question and by an item-specific parameter characterizing the hardness of classifying the item. The objective is to devise an algorithm with a minimal cluster recovery error rate. We derive problem-specific information-theoretical lower bounds on the error rate satisfied by any algorithm, for both uniform and adaptive (list, question) selection strategies. For uniform selection, we present a simple algorithm built upon the K-means algorithm and whose performance almost matches the fundamental limits. For adaptive selection, we develop an adaptive algorithm that is inspired by the derivation of the information-theoretical error lower bounds, and in turn allocates the budget in an efficient way. The algorithm learns to select items hard to cluster and relevant questions more often. We compare the performance of our algorithms with or without the adaptive selection strategy numerically and illustrate the gain achieved by being adaptive.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Modern Machine Learning (ML) models require a massive amount of labeled data to be efficiently trained. Humans have been so far the main source of labeled data. This data collection is often tedious and very costly. Fortunately, most of the data can be simply labeled by non-experts. This observation is at the core of many crowdsourcing platforms such as reCAPTCHA, where users receive low or no payment. In these platforms, complex labeling problems are decomposed into simpler tasks, typically questions with binary answers. In reCAPTCHAs, for example, the user is asked to click on images (presented in batches) that contain a particular object (a car, a road sign), and the system leverages users’ answers to label images. As another motivating example, consider the task of classifying bird images. Users may be asked to answer binary questions like: “Is the bird grey?”, “Does it have a circular tail fin?”, “Does it have a pattern on its cheeks?”, etc. Correct answers to those questions, if well-processed, may lead to an accurate bird classification and image labels. In both aforementioned examples, some images may be harder to label than others, e.g., due to the photographic environment, the birds’ posture, etc. Some questions may be harder to answer than others, leading to a higher error rate. To build a reliable system, tasks/questions have to be carefully designed and selected, and user responses need to be smartly processed. Efficient systems must also learn the difficulty of the different tasks, and guess how informative they are when solving the complex labeling problem.
This paper investigates the design of such systems, tackling clustering problems that have to be solved using answers to binary questions. We incorporate a model that takes into consideration the varying difficulty levels or heterogeneity of clustering each item. We propose a full analysis of the problem, including information-theoretical limits that hold for any algorithm and novel algorithms with provable performance guarantees. Before giving a precise statement of our results, we provide a precise description of the problem setting and the statistical model dictating the way users answer. This model is inspired by models, such as the Dawid–Skene model (Dawid & Skene, 1979) successfully used in the crowdsourcing literature, see e.g., Khetan and Oh (2016) and references therein. However, to the best of our knowledge, this paper is the first to model and analyze clustering problems with binary feedback and accounting for item heterogeneity.
1.1 Problem setting and feedback model
Consider a large set \(\mathcal {I}\) of n items (e.g. images) partitioned into K disjoint unknown clusters \(\mathcal {I}_1,\ldots , \mathcal {I}_K\). Denote by \(\sigma (i)\) the cluster of item i. To recover these hidden clusters, the learner gathers binary user feedback sequentially. Upon arrival, a user is presented a list of \(w\ge 1\) items together with a question with a binary answer. The question is selected from a predefined finite set of cardinality L. The process of selecting the (list, question) pair for a given user can be carried out in either a nonadaptive or adaptive manner (in the latter case, the pair would depend on user feedback previously collected). Importantly, our model captures item heterogeneity: the difficulty of clustering items varies across items. We wish to devise algorithms recovering clusters as accurately as possible using the noisy binary answers collected from T users.
We use the following statistical model parametrized by a matrix \(\varvec{p}:=(p_{k\ell })_{k \in [K], \ell \in [L]}\)Footnote 1 with entries in [0, 1] and by a vector \(\varvec{h}:=(h_i)_{i \in \mathcal {I}}\in [1/2,1]^n\). These parameters are (initially) unknown. When the t-th user is asked a question \(\ell _t =\ell \in [L]\) for a set \(\mathcal {W}_t\) of \(w \ge 1\) items, she provides noisy answers: for the item \(i\in \mathcal {I}_k\) in the list \(\mathcal {W}_t\), her answer \(X_{i \ell t}\) is \(+1\) with probability \(q_{i\ell }:= h_i p_{k \ell }+{\bar{h}}_i{\bar{p}}_{k \ell }\), and \(-1\) with probability \({\bar{q}}_{i \ell }\), where for brevity, \({\bar{x}}\) denotes \(1-x\) for any \(x \in [0, 1]\). Answers are independent across items and users. Our model is simple, but general enough to include as specific cases, crowdsourcing models recently investigated in the literature. For example, the model in Khetan and Oh (2016) corresponds to our model with only one question (\(L=1\)), two clusters (\(K=2\)), and a question asked for a single item at a time (\(w=1\)). Note that in our model, answers are collected from a very large set of users, and a given user is very unlikely to interact with the system several times. This justifies the fact that answers provided by the various users are statistically identical.
Item hardness. An important aspect of our model stems from the item-specific parameter \(h_i\). It can be interpreted as the hardness of clustering item i, whereas \(p_{k \ell }\) corresponds to a latent parameter related to question \(\ell\) when asked for an item in cluster k. Note that when \(h_i=1/2\), \(q_{i \ell } = {1/2}\) irrespective of the cluster of item i. Hence any question \(\ell\) on item i receives completely random responses, and this item cannot be clustered. Further note that intuitively, the larger the hardness parameter \(h_i\) of item i is, the easier it can be clustered. Indeed, when asking question \(\ell\), we can easily distinguish whether item i belongs to cluster k or \(k'\) if the difference between the corresponding parameters of user statistical answers \(h_i p_{k \ell }+{\bar{h}}_i{\bar{p}}_{k \ell }\) and \(h_i p_{k' \ell }+{\bar{h}}_i{\bar{p}}_{k' \ell }\) is large. This difference is \(| p_{k \ell }-p_{k' \ell }|(2h_i-1)\), an increasing function of \(h_i\). We believe that introducing item heterogeneity is critical to obtain a realistic model (without \(\varvec{h}\), all items from the same cluster would be exchangeable), but complicates the analysis. Most theoretical results on clustering or community detection do not account for this heterogeneity—refer to Sect. 2 for detail.
Illustrative Example. We introduce an example to illustrate the structure and characteristics of our model.
Example 1
Consider the task of classifying images into two types of birds: Mallards and Canadian Geese. Mallards (see Fig. 1a for an image), a type of duck, and Canadian Geese (see Fig. 1b for an image), which are not classified as ducks, present a unique classification challenge. In this case, \(L=1\), and the question posed to the users is: “Is the bird in the image a duck?". We assign cluster 1 for the Mallard images and cluster 2 for the Canadian Goose images. Assume that \(p_{11}=0.8\) and \(p_{21}=0.3\): they are latent probabilities of answering yes to the question given an image of a Mallard and a Canadian Goose, respectively.
\(p_{11}\) and \(p_{21}\) represent the latent probabilities of answering yes to a question 1. These parameters also consider the scenario where a user, randomly selected from a large set, may not answer a question correctly due to a lack of knowledge or other reasons. For each image i, \(h_i\) indicates the difficulty of classification. For instance, when the image is of Mallards (a type of duck), and the image is clear, the classification is relatively easy, and \(h_i\) is set to \(h_i=1\). Consequently, the probability of correctly identifying the Mallards is \(q_{i1} = h_ip_{11} + {\bar{h}}_i{\bar{p}}_{11} = 1\cdot 0.8 + 0\cdot 0.2=0.8\). However, when another image j of Mallards is blurred due to poor lighting or other factors, and the classification difficulty is \(h_j=0.8\), the probability of correct classification decreases to \(q_{j1} = h_jp_{11} + {\bar{h}}_j{\bar{p}}_{11} = 0.8\cdot 0.8 + 0.2\cdot 0.2=0.68\). As a result, the feedback obtained for image j is more ambiguous compared to that for image i, due to the increased difficulty in classification.
Assumption
We make the following mild assumptions on our statistical model \(\mathcal {M}:= (\varvec{p}, \varvec{h})\). Define for each \(k \in [K]\), \(\varvec{r}_k:= (r_{k \ell })_{\ell \in [L]}\) with \({r}_{k\ell }:= {2p_{k\ell } - 1}\). Throughout the paper, \(\Vert \cdot \Vert\) denotes the \(\ell _\infty\)-norm, i.e., \(\Vert \varvec{x}\Vert = \max _i{|x_i|}\).
Assumption (A1) excludes the cases where clustering is impossible even if all parameters were accurately estimated. Indeed, when \(h_*= 0\), there exists at least one item i which receives completely random responses for any question, i.e., \(q_{i\ell } = 1/2\) for any \(\ell \in [L]\). Observe that when \(\rho _*= 0\), there exists \(k \ne k'\) and \(c \ge 0\) such that \(2{p}_{k \ell } -1 = c (2 {p}_{k' \ell } -1)\) for all \(\ell \in [L]\). Then, for item \(i \in \mathcal {I}_k\), we find \(h' \in [1/2,1]\) such that \(2q_{i\ell } -1 = (2h' -1)(2p_{k'\ell } - 1)\). Items in the different clusters k and \(k'\) can have the same value of \(q_{i \ell }\). As a consequence, from the answers, we cannot determine whether i is in cluster k or \(k'\). In Example 1, \(r_{11} = 0.6\), \(r_{21}= -0.4\), and the value of \(\rho _*\) is \(\rho _* = |0 \cdot 0.6 + 0.4 |= 0.4\). Assumption (A2) states some homogeneity among the parameters of the clusters. It implies that \(q_{i \ell } \in [\eta , 1- \eta ]\) for all \(i \in \mathcal {I}\) and \(\ell \in [L]\). Let \(\Omega\) be the set of all models satisfying (A1) and (A2).
For convenience, we provide a table summarizing all the notations in Appendix A.
1.2 Main contributions
We study both nonadaptive and adaptive sequential (list, question) selection strategies. In the case of nonadaptive strategy, we assume that the selection of (list, question) pairs is uniform in the sense that the number of times a given question is asked for a given item is (roughly) \(\lfloor Tw/(nL)\rfloor\). The objective is to devise a clustering algorithm taking as input the data collected over T users and returning estimated clusters as accurate as possible. When using adaptive strategies, the objective is to devise an algorithm that sequentially selects the (list, question) pairs presented to users, and that, after having collected answers from T users, returns accurate estimated clusters.
Our contributions are as follows. We first derive information-theoretical performance limits satisfied by any algorithm under uniform or adaptive sequential (list, question) selection strategy. We then propose a clustering algorithm that matches our limits order-wise in the case of uniform (list, question) selection. We further present a joint adaptive (list, question) selection strategy and clustering algorithm, and illustrate, using numerical experiments on both synthetic and real data, the advantage of being adaptive.
Fundamental limits. We provide a precise statement of our lower bounds on the cluster recovery error rate. These bounds are problem specific, i.e., they depend explicitly on the model \(\mathcal {M} = (\varvec{p}, \varvec{h})\), and they will guide us in the design of algorithms.
(Uniform selection) In this case, we derive a clustering error lower bound for each individual item. Let \(\pi\) denote a clustering algorithm, and define the clustering error rate of item \(i \in \mathcal {I}\) as \(\varepsilon ^\pi _i (n, T):=\mathbb {P}[i \in \mathcal {E}^\pi ]\), where \(\mathcal {E}^\pi\) denotes the set of mis-classified items under \(\pi\). The latter set is defined as \(\mathcal {E}^\pi := \cup _{k\in [K]} \mathcal {I}_k{\setminus } \mathcal {S}_{\gamma (k)}^\pi\), where \((\mathcal {S}_{1}^\pi ,\ldots , \mathcal {S}_{K}^\pi )\) denotes the output of \(\pi\) and \(\gamma\) is a permutation of [K] minimizing the cardinality of \(\cup _{k\in [K]} \mathcal {I}_k{\setminus } \mathcal {S}_{\zeta (k)}^\pi\) over all possible permutations \(\zeta\) of [K]. When deriving problem-specific error lower bounds, we restrict our attention to so-called uniformly good algorithms. An algorithm \(\pi\) is uniformly good if for all \(\mathcal {M} \in \Omega\) and \(i \in \mathcal {I}\), \(\varepsilon ^\pi _i (n, T)= o(1)\) as \(T \rightarrow \infty\) under \(T= \omega (n)\). We establish that for any \(\mathcal {M}\in \Omega\) satisfying (A1) and (A2), under any uniformly good clustering algorithm \(\pi\), as T grows large under \(T=\omega (n)\), for any item i, we have:
In the above definition of the divergence \(\mathcal {D}_{\mathcal {M}}^U(i)\), \({{\,\mathrm{\text {KL}}\,}}(a, b)\) is the Kullback–Leibler divergence between two Bernoulli distributions of means a and b (\({{\,\mathrm{\text {KL}}\,}}(a, b):= a \log \frac{a}{b}+{\bar{a}} \log \frac{{\bar{a}}}{{\bar{b}}}\)). Note that uniformly good algorithms actually exist (see Algorithm 1 presented in Sect. 4).
(Adaptive selection) We also provide clustering error lower bounds in the case the algorithm is also sequentially selecting (list, question) pairs in an adaptive manner. Note that here a lower bound cannot be derived for each item individually, say item i, since an adaptive algorithm could well select this given item often so as to get no error when returning its cluster. Instead we provide a lower bound for the cluster recovery error rate averaged over all items, i.e., \(\varepsilon ^\pi (n,T):={ \frac{1}{n} }\sum _{i\in \mathcal {I}}\varepsilon ^\pi _i(n, T)\). Under any uniformly good joint (list, question) selection and clustering algorithm \(\pi\), as T grows large under \(T = \omega (n)\), we have:
In the above lower bound, the vector \(\varvec{y}\) encodes the expected numbers of times the various questions are asked for each item. Specifically, as shown later, \(y_{i \ell }\frac{Tw}{n}\) can be interpreted expected number of times the question \(\ell\) is asked for the item i. Maximizing over \(\varvec{y}\) in (4) hence corresponds to an optimal (list, question) selection strategy, and to the minimal error rate. Further interpretations and discussions of the divergences \(\mathcal {D}_{\mathcal {M}}^U(i)\) and \({\mathcal {D}}_{\mathcal {M}}^A(i, \varvec{y})\) are provided later in the paper.
Algorithms. We develop algorithms with both uniform and adaptive (list, question) selection strategies.
(Uniform selection) In this case, for each item i and based on the collected answers, we build a normalized vector (of dimension L) that concentrates (when T is large) around a vector depending on the cluster id \(\sigma (i)\) only. Our algorithm applies a K-means algorithm to these vectors (with an appropriate initialization) to reconstruct the clusters. We are able to establish that the algorithm almost matches our fundamental limits. More precisely, when \({T} = \omega \left( n\right)\) and \(T = o (n^2)\), under our algorithm, we have, for some absolute constant \(C>0\),
The above error rate has an optimal scaling in T, w, L, n. By deriving upper and lower bounds on \(\mathcal {D}_{\mathcal {M}}^U(i)\)), we further show that the scaling is also optimal in \((2h_i-1)^2\) and almost in \(\rho _*\) (see Assumption (A1)).
(Adaptive selection) The design of our adaptive algorithm is inspired by the information-theoretical lower bounds. The algorithm periodically updates estimates of the model parameters, and of the clusters. Based on these estimates, we further estimate lower bounds on the probabilities to misclassify every item. The items we select are those with the highest lower bounds (the items that are most likely to be misclassified); we further select the question that would be the most informative about these items. We believe that our algorithm should approach the minimal possible error rate (since it follows the optimal (list, question) selection strategy). Our numerical experiments suggest that the adaptive algorithm significantly outperforms algorithms with uniform (list, question) selection strategy, especially when items have very heterogenous hardnesses.
2 Related work
To our knowledge, the model proposed in this paper has been neither introduced nor analyzed in previous work. The problem has similarities with crowdsourced classification problems with a very rich literature (Dawid & Skene, 1979; Raykar et al., 2010; Karger et al., 2011; Zhou et al., 2012; Ho et al., 2013; Long et al., 2013; Zhang et al., 2014; Gao et al., 2016; Ok et al., 2016) (Dawid–Skene model and its various extensions without clustered structure), Vinayak and Hassibi (2016) and Gomes et al. (2011) (Clustering without item heterogeneity). However, our model has clear differences. For instance, if we draw a parallel between our model and that considered in Khetan and Oh (2016), there tasks correspond to our items, and there are only two clusters of tasks. More importantly, the statistics of the answers for a particular task do not depend on the true cluster of the task since the ground truth is defined by the majority of answers given by the various users. Our results also differ from those in the crowdsourcing literature from a methodological perspective. In this literature, fundamental limits are rarely investigated, and if they are, they are in the minimax sense by postulating the worst parameter setting (e.g., Zhang et al. 2014; Khetan and Oh 2016; Gao et al. 2016) or it is problem-specific but without quantifying of the error rate (e.g., Ok et al. 2016). Here we derive more precise problem-specific lower bounds on the error rate, i.e., we provide minimum clustering error rates given the model parameters \((\varvec{p}, \varvec{h})\). Further note that most of the classification tasks studied in the literature are simple (can be solved using a single binary question).
Our problem also resembles cluster inference problems in the celebrated Stochastic Block Model (SBM), see (Abbe, 2018) for a recent survey. Plain SBM models, however, assume that the statistics of observations for items in the same cluster are identical (there are no items harder to cluster than others, this corresponds to \(h_i = 1, \forall i \in \mathcal {I}\) in our model), and observations are typically not operated in an adaptive manner. The closest work in the context of SBM to ours is the analysis of the so-called Degree-Corrected SBM, where each node is associated with an average degree quantifying the number of observations obtained for this node. The average degree then replaces our hardness parameter \(h_i\) for item i. In Gao et al. (2018), the authors study the Degree-Corrected SBM, but deal with minimax performance guarantees only, and non-adaptive sampling strategies.
3 Information-theoretical limits
3.1 Uniform selection strategy
Recall that an algorithm \(\pi\) is uniformly good if for all \(\mathcal {M} \in \Omega\) and \(i \in \mathcal {I}\), \(\varepsilon ^\pi _i (n, T) = o(1)\) as \(T \rightarrow \infty\) under \(T=\omega (n)\). Assumptions (A1) and (A2) ensure the existence of uniformly good algorithms. The algorithm we present in Sect. 4 is uniformly good under these assumptions. The following theorem provides a lower bound on the error rate of uniformly good algorithms.
Theorem 1
If an algorithm \(\pi\) with uniform selection strategy is uniformly good, then for any \(\mathcal {M}\in \Omega\) satisfying (A1) and (A2), under \(T=\omega (n)\), the following holds:
The proof of Theorem 1 will be presented later in this section. Theorem 1 implies that the global error rate of any uniformly good algorithm satisfies:
Divergence \(\mathcal {D}_{\mathcal {M}}^U(i)\) and its properties. The divergence \(\mathcal {D}_{\mathcal {M}}^U(i)\), defined in Section 1, quantifies the hardness of classifying item i. This divergence naturally appears in the change-of-measure argument used to establish Theorem 1. To get a better understanding of \(\mathcal {D}_{\mathcal {M}}^U(i)\), and in particular to assess its dependence on the various system parameters, we provide the following useful upper and lower bounds, proved in Appendix B:
Proposition 1
Fix \(i\in \mathcal {I}\). Let \(k'\) be such that:
Then, we have:
Note that \(\mathcal {D}^U_{\mathcal {M}}(i)\) vanishes as \(h_i\) goes to 1/2, which makes sense since for \(h_i\approx 1/2\), item i is very hard to cluster. We also have \(\mathcal {D}_{\mathcal {M}}^U(i) = 0\) when \(\min _{{\frac{h_*}{2 h_i - 1} \le c \le \frac{1}{2 h_i -1}}}\Vert c\varvec{r}_{k'} - \varvec{r}_{\sigma (i)}\Vert _2^2 = 0\). In this case, \(\rho _* = 0\) and there exists \(h' \in [(1+h_*)/2, 1]\) such that for some \(k' \ne \sigma (i)\), \(2q_{i\ell } -1 = (2\,h'-1) (2{p}_{k'\ell } -1)\) for all \(\ell \in [L]\), so that clustering item i is impossible.
Application to the simpler model of Khetan and Oh (2016). Consider a model with a single question and two clusters of items. From Theorem 1, we can recover an asymptotic version of Theorem 2.4. in Khetan and Oh (2016).
Corollary 1
Let \(L=1, K=2\), \(\varvec{p} = (p_{11}, p_{21})\), and \(w=1\). If an algorithm \(\pi\) with uniform selection strategy is uniformly good, whenever \(\mathcal {M}\) satisfies (A1) and (A2), under \(T = \omega (n)\), we have:
where \(C>0\) is an absolute constant.
The proof of Corollary 1 is presented in Appendix D. Corollary 1 implies,
as \(T \rightarrow \infty\) under \(T = \omega (n)\). Smaller \(h_i\) and \({|p_{11} - p_{21}|}\) imply item i is harder to classify. Note that Theorem 2.4. in Khetan and Oh (2016) (corresponds to \({p_{21} = 1- p_{11}}\) in our Corollary 1) provides a minimax lower bound whereas our result is problem-specific and hence more precise. Note that Corollary 1 also applies directly to Example 1 mentioned in Introduction. The lower bound on the error probability for each item i scales as \(\exp ( -c \frac{T}{n}(2 h_i -1)^2 )\) with some constant \(c>0\).
Proof of Theorem 1
The proof leverages change-of-measure arguments, as those used in the classical multi-armed bandit problem (Lai & Herbert, 1985) or the Stochastic Block Model (Yun & Proutiere, 2016). Here the proof is however complicated by the fact that we wish a lower bound on the error rate for clustering each item.
Let \(\pi\) denote a uniformly good algorithm with uniform selection strategy, and let \(\mathcal {M}\in \Omega\) be a model satisfying Assumptions (A1) and (A2). In our change-of-measure, we denote by \(\mathcal {M}\) the original model and by \(\mathcal {N}\) a perturbed model. Fix \(i \in \mathcal {I}\), where \(\sigma (i) = k\). Let \(k' \in [K], h' \in [(h_*+1) /2, 1]\) denote the minimizers for the optimization problem leading to \(\mathcal {D}_{\mathcal {M}}^U(i)\), i.e.,
For these choices of \(i, k',\) and \(h'\), we construct the perturbed model \(\mathcal {N}\) as follows. Under \(\mathcal {N}\), all responses for items different than i are generated as under \(\mathcal {M}\). The responses for i under \(\mathcal {N}\) are generated as if i was in cluster \(k'\) and had difficulty \(h'\). We can write the log-likelihood ratio of the observation under \(\mathcal {N}\) to that under \(\mathcal {M}\) as follows:
where we let \(\varvec{q}':= (q'_\ell )_{\ell \in [L]}\) with \(q'_\ell = h'p_{k'\ell } + {\bar{h}}'{\bar{p}}_{k'\ell }\).
Let \(\mathbb {P}_{\mathcal {N}}\) and \({{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {N}}\) (resp. \(\mathbb {P}_{\mathcal {M}} = \mathbb {P}\) and \({{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {M}} = {{\,\mathrm{\mathbb {E}}\,}}\)) denote, respectively, the probability measure and the expectation under \(\mathcal {N}\) (resp. \(\mathcal {M}\)). Using the construction of \(\mathcal {N}\), a change-of-measure argument provides us with a connection between the error rate on item i under \(\mathcal {M}\) and the mean and the variance of \(\mathcal {L}\) under \(\mathcal {N}\):
\(\square\)
Proof of (10)
The distribution of the log-likelihood \(\mathcal {L}\) under \(\mathcal {N}\) satisfies: for any \(g \ge 0\),
Using the definition (9) of the log-likelihood ratio, we bound the first term in (11) as follows:
To bound the second term in (11), note that \((2h' -1)\) is a strictly positive constant.Footnote 2 Hence, the perturbed model \({\mathcal {N}}\) satisfies (A1). By the definition of the uniformly good algorithm, we have \({\mathbb {P}_{\mathcal {N}} \left\{ i \notin \mathcal {S}_{k'}^\pi \right\} } = o(1)\). Hence:
Combining (11), (12) and (13) with \(g = - \log (4 \varepsilon _i^\pi (n, T))\), we have
Using Chebyshev’s inequality, we obtain:
which implies \(\mathbb {P}_{\mathcal {N}} \left\{ \mathcal {L} \le \mathbb {E}_{\mathcal {N}}[\mathcal {L}]+ \sqrt{2 \mathbb {E}_{{\mathcal {N}}} [(\mathcal {L} - \mathbb {E}_{\mathcal {N}}[\mathcal {L}] )^2 ]} \right\} \ge \frac{1}{2}\). Combining this result to (14) implies the claim (10). \(\square\)
Next, Lemma 1 provides the upper bound on mean and variance of \(\mathcal {L}\) under the model \({\mathcal {N}}\).
Lemma 1
Assume that (A2) holds. For \(i, i'\) such that \(\sigma (i) = k \ne k' = \sigma (i')\), under the uniform selection strategy, we have
The proof of this lemma is presented to Appendix C. Note that in view the above lemma, the r.h.s. of (10) is asymptotically dominated by \({{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {N}}[\mathcal {L}]\), since \({{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {N}}[\mathcal {L}] = \Omega (T/n)\) and \(\sqrt{ {{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {N}} \left[ (\mathcal {L} - {{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {N}} [\mathcal {L}]) ^2 \right] } = O(\sqrt{T/n})\). Thus, Theorem 1 follows from the claim in (10) and Lemma 1. \(\square\)
3.2 Adaptive selection strategy
The derivation of a lower bound for the error rate under adaptive (list, pair) selection strategies is similar:
Theorem 2
For any \(\mathcal {M}\in \Omega\) satisfying (A1) and (A2), and for any uniformly good algorithm \(\pi\) with possibly adaptive (list, question) selection strategy, under \(T=\omega (n)\), we have:
Theorem 2 implies \(\varepsilon ^\pi (n, T) \ge \exp ( - \frac{Tw}{n} \widetilde{\mathcal {D}}_{\mathcal {M}}^A (1+ o(1)) ).\)
Proof of Theorem 2
Again we use a change-of-measure argument, where we swap two items from different clusters. First, we prove the lower bound for the error rate of a fixed item i. Fix \(i\in \mathcal {I},\) let j be an item satisfying \(\sigma (j) \ne \sigma (i)\) and let
\(\mathcal {D}^A_{\mathcal {M}}(i, \varvec{y})\) is the value of the optimization problem:
Consider a perturbed model \({\mathcal {N}}'\), in which items except i and j have the same response statistics as under \({\mathcal {M}}\), and in which item i behaves as item j, and item j behaves as item i. Let \(\mathbb {P}_{\mathcal {N}'}\) and \({{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {N}' }\) denote, respectively, the probability measure and the expectation under \({\mathcal {N}}'\). The log-likelihood ratio of the responses under \({\mathcal {N}}'\) and under \({\mathcal {M}}\) is:
The mean and variance of \(\mathcal {L}\) under \({\mathcal {N}}'\) are:
using a slight modification of Lemma 1. By a similar argument as that used in the proof of Theorem 1, we get:
as is in (10). Note that the r.h.s. of (16) is asymptotically dominated by \({{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {N}' }[\mathcal {L}]\) as \({{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {N}' }[\mathcal {L}] = \Omega (T/n)\) and \(\sqrt{ {{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {N}'} \left[ (\mathcal {L} - {{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {N}' } [\mathcal {L}]) ^2 \right] } = O(\sqrt{T/n})\). That is,
We deduce that \(\varepsilon _{i}^\pi (n, T) \ge \exp \left( - \frac{Tw}{n}{\mathcal {D}}_{\mathcal {M}}^A(i, \varvec{y}) (1+ o(1)) \right)\). Thus, from the definition of \({\widetilde{D}}_{\mathcal {M}}^A\), we have:
Taking the logarithm of the previous inequality, we conclude the proof. \(\square\)
4 Algorithms
In this section, we describe our algorithms for both uniform and adaptive (list, question) selection strategies.
4.1 Uniform selection strategy
In this case, we assume that with a budget of T users, each item receives the same amount of answers for each question. After gathering these answers, we have to exploit the data to estimate the clusters. To this aim, we propose an extension of the K-means clustering algorithm, that efficiently leverages the problem structure. The pseudo-code of the algorithm is presented in Algorithm 1.
The algorithm first estimates the parameters \({q_{\ell i}}\): the estimator \({\hat{q}}_{\ell i}\) just counts the number of times item i has received a positive answer for question \(\ell\). We denote by \(\hat{\varvec{q}}_i=({\hat{q}}_{\ell i})_\ell\) the resulting vector. By normalizing the vector \(2 \hat{\varvec{q}}_i -1\), we can decouple the nonlinear relationship between \(\varvec{q}\), \(\varvec{h}\) and \(\varvec{p}\). Let \(\hat{\varvec{r}}^i = \frac{2 \hat{\varvec{q}}_{i} - 1}{\Vert 2 \hat{\varvec{q}}_{i} - 1\Vert }\) be the normalized vector. Then, \(\hat{\varvec{r}}^i\) concentrates around \(\widetilde{\varvec{r}}_{\sigma (i)}:=\varvec{r}_{\sigma (i)}/\Vert \varvec{r}_{\sigma (i)}\Vert\). Importantly, the normalized vector \(\widetilde{\varvec{r}}_{\sigma (i)}\) does not depend on \(h_i\) but on the cluster index \(\sigma (i)\) only. The algorithm exploits this observation, and applies the K-means algorithm to cluster the vectors \(\hat{\varvec{r}}^i\). By analyzing how \(\hat{\varvec{r}}^i\) concentrates around \(\widetilde{\varvec{r}}_{\sigma (i)}\) and by applying the results to our properly tuned algorithm (decision thresholds), we establish the following theorem.
Theorem 3
Assume \(T= \omega \left( n\right)\) and \(T = o (n^{2})\). Under Algorithm 1, we have,
We will present the proof of Theorem 3 later in this section. In view of Proposition 1 and the lower bounds derived in the previous section, we observe that the exponent for the mis-classification error of item i has the correct dependence in Tw/Ln and the tightest possible scaling in the hardness of the item, namely \((2h_i-1)^2\). Also note that using Proposition 1, the equivalence between the \(\ell _\infty\)-norm and the Euclidean norm, and (A1), we have: \(\mathcal {D}_{\mathcal {M}}^U(i) \ge C \frac{(2 h_i - 1)^2}{L} \rho _*^2\), for some absolute constant \(C>0\). Hence, Algorithm 1 has a performance scaling optimally w.r.t. all the model parameters.
The computational complexity of Algorithm 1 is \(\mathcal {O}(n^2)\). By choosing a small (\(\log n\)) subset of items (and not all the items in \(\mathcal {I}\)) to compute centroids (\(T_i\)), it is possible to reduce the computational complexity to \(\mathcal {O}(n \log n)\). This would not affect the performance of the algorithm in practice, but would result in worse performance guarantees.
Proof of Theorem 3
In this proof, we let \(\tau = \lfloor \frac{T w}{n L} \rfloor\) be the number of times question \(\ell\) is asked for item i. We also denote by \(\varvec{\alpha }:= (\alpha _1, \ldots , \alpha _K)\) the fractions of items that are in the various clusters, i.e., \(|\mathcal {I}_k| = \alpha _k n\). Without loss of generality, and to simplify the notation, we assume that the set of misclassified items is \(\mathcal {E} = \cup _{k=1}^K (\mathcal {I}_k \setminus \mathcal {S}_k)\), where recall that \(\{ \mathcal {S}_k\}_{k\in [K]}\) is the output of the algorithm (i.e., the permutation \(\gamma\) in the definition of this set is just the identity).
The proof proceeds in three steps: (i) we first decompose the probability of clustering error for item i, using the design of the algorithm and Assumptions (A1) and (A2). We show that this probability can be upper bounded by the probabilities of events related to \(\Vert \hat{\varvec{r}}^i - \widetilde{\varvec{r}}_{\sigma (i)}\Vert\) and \(\Vert \xi _{k} - \widetilde{\varvec{r}}_{k}\Vert\) for all k, where recall that \(\widetilde{\varvec{r}}_{k}:=\varvec{r}_{k}/\Vert \varvec{r}_{k}\Vert\). The remaining steps of the proof aim at bounding the probabilities of these events. Towards this objective, (ii) in the second step, we establish a concentration result on \(\Vert \hat{\varvec{r}}^i - \widetilde{\varvec{r}}_{\sigma (i)}\Vert\), and (iii) the last step upper bound \(\Vert \xi _{k} - \widetilde{\varvec{r}}_{k}\Vert\).
Step 1. Error probability decomposition. The algorithm classifies item i to the cluster k minimizing the distance between \({\hat{r}}^i\) and \(\xi _k\). As a consequence, we have:
where the two above inequalities are obtained by simply applying the triangle inequality. Now observe that in view of Assumptions (A1) and (A2), we have: for \(k' \ne \sigma (i)\),
We deduce that:
Step 2. Concentration of \({\hat{r}}^i\) and upper bound on (a). We prove the following lemma, a consequence of a concentration result of \(\hat{\varvec{q}}_i\): \(\square\)
Lemma 2
Let \(0 < \varepsilon \le \frac{\Vert 2 \varvec{q}_i - 1\Vert }{16}\), \(\widetilde{\varvec{r}}^i = \frac{{\varvec{r}}_{\sigma (i)}}{\Vert {\varvec{r}}_{\sigma (i)}\Vert }\), and \(\widetilde{\varvec{r}}_k = \frac{{\varvec{r}}_{k}}{\Vert {\varvec{r}}_{k}\Vert }\). For each \(i \in \mathcal {I}\),
with probability at least \(1--2\,L \exp \left( -{2 \tau } \varepsilon ^2 \right)\).
The proof of Lemma 2 is presented at Appendix E. Note that by definition of \(\rho _*\), we have:
Applying Lemma 2 with \(\varepsilon = \frac{(2 h_i -1) \rho _{*}}{20} < \frac{\Vert 2 \varvec{q}_i -1\Vert }{16}\), we obtain an upper bound on the term (a):
Step 3. Upper bound of the term (b). Next, we establish the following claim:
To this aim, we first show that a large fraction of the items satisfy \(\Vert \hat{\varvec{r}}^v -\widetilde{\varvec{r}}^v\Vert \le \frac{1}{4} \left( \frac{n}{T}\right) ^{\frac{1}{4}}\). Applying Lemma 2 with \(\varepsilon = \frac{\Vert 2 \varvec{q}_i - 1 \Vert }{20} \left( \frac{n}{T}\right) ^{\frac{1}{4}}\), we get:
Define \(p_{\max }:= \max _{v \in \mathcal {I}_k} \mathbb {P} \left\{ \Vert \hat{\varvec{r}}^v- \widetilde{\varvec{r}}^v\Vert \ge \frac{1}{4} \left( \frac{n}{T}\right) ^{\frac{1}{4}} \right\}\). Then from (19), \(p_{\max } \le \exp \left( -\Theta \left( \left( \frac{T}{n}\right) ^{\frac{1}{2}} \right) \right)\). Further define S as the number of the items in \(\mathcal {I}\) that satisfy \(\Vert \hat{\varvec{r}}^v -\widetilde{\varvec{r}}^v\Vert \le \frac{1}{4} \left( \frac{n}{T}\right) ^{\frac{1}{4}}\), i.e., \(S = \sum _{v \in \mathcal {I}} \mathbbm {1}_{\{\Vert \hat{\varvec{r}}^v- \widetilde{\varvec{r}}^v\Vert \ge \frac{1}{4} (\frac{n}{T})^{\frac{1}{4}} \}}\). Since \(\hat{\varvec{r}}^1, \ldots , \hat{\varvec{r}}^n\) are independent random variables, \(\mathbbm {1}_{\{\Vert \hat{\varvec{r}}^1- \widetilde{\varvec{r}}^1\Vert \ge \frac{1}{4} (\frac{n}{T})^{\frac{1}{4}} \}}, \ldots , \mathbbm {1}_{\{\Vert \hat{\varvec{r}}^n- \widetilde{\varvec{r}}^n\Vert \ge \frac{1}{4} (\frac{n}{T})^{\frac{1}{4}} \}}\) are independent Bernoulli random variables. From Chernoff bound, we get:
where for (i), we set \(\lambda = \log \frac{1}{p_{\max }}\). Therefore, to prove (18), it suffices to show that:
Assume that \(S \le \frac{n}{\log \left( \frac{T}{n}\right) }\). Then, every v having \(\min _{1\le k\le K}\Vert \hat{\varvec{r}}^v - \widetilde{\varvec{r}}_k \Vert \ge 2\left( \frac{n}{T}\right) ^{\frac{1}{4}}\) cannot be a center node (i.e., one the \(i_k^*\) for \(k=1,\ldots ,K\)). This is due to the following facts:
-
(i)
\(|T_v| \le \frac{n}{\log \left( \frac{T}{n}\right) }\) when \(\min _{1\le k\le K}\Vert \hat{\varvec{r}}^v - \widetilde{\varvec{r}}_k \Vert \ge 2 \left( \frac{n}{T}\right) ^{\frac{1}{4}}\), since for all w such that \(\Vert \hat{\varvec{r}}^w- \widetilde{\varvec{r}}^w\Vert \le \frac{1}{4} (\frac{n}{T})^{\frac{1}{4}}\), \(\Vert \hat{\varvec{r}}^v - \hat{\varvec{r}}^w \Vert \ge \Vert \hat{\varvec{r}}^v - \widetilde{\varvec{r}}^w \Vert - \Vert \widetilde{\varvec{r}}^w - \hat{\varvec{r}}^w \Vert \ge \frac{3}{2} \left( \frac{n}{T}\right) ^{\frac{1}{4}}.\)
-
(ii)
\(|T_v| \ge \alpha _k n - \frac{n}{\log \left( \frac{T}{n}\right) }\) when \(\Vert \hat{\varvec{r}}^v - \widetilde{\varvec{r}}_k \Vert \le \frac{1}{2} \left( \frac{n}{T}\right) ^{\frac{1}{4}}\), since for all \(w\in \mathcal {I}_k\) such that \(\Vert \hat{\varvec{r}}^w- \widetilde{\varvec{r}}^w\Vert \le \frac{1}{4} (\frac{n}{T})^{\frac{1}{4}}\), \(\Vert \hat{\varvec{r}}^v - \hat{\varvec{r}}^w \Vert \le \Vert \hat{\varvec{r}}^v - \widetilde{\varvec{r}}_k \Vert + \Vert \widetilde{\varvec{r}}_k - \hat{\varvec{r}}^w \Vert \le \frac{3}{4} \left( \frac{n}{T}\right) ^{\frac{1}{4}}.\)
Therefore, when \(\frac{T}{n} = \omega (1)\),
Let \(\mathcal {R}_k\) denote the set of items \(\mathcal {S}_k\) before computing \(\xi _k\) (\(\mathcal {S}_k\) used for the calculation of \(\xi _k\)) – see the algorithm. Then, from (20) and the definition of \(\mathcal {S}_k\) before computing \(\xi _k\),
From the above inequality and Jensen’s inequality,
Therefore, when \(T = \omega (n)\),
which concludes the proof of (18).
The proof of theorem is completed by remarking that when \(T=o(n^2)\), then
This implies that the upper bound we derived for the term (a) is dominating the upper bound of the term (b). Finally,
\(\square\)
4.2 Adaptive selection strategy
Our adaptive (item, question) selection and clustering algorithm is described in Algorithm 2. The design of the adaptive (item, question) selection strategy is inspired by the derivation of the information-theoretical error lower bounds. The algorithm maintains estimates of the model parameters \(\varvec{p}\) and \(\varvec{h}\) and of the clusters \(\{\mathcal {I}_k\}_{k\in [K]}\). These estimates, denoted by \(\hat{\varvec{p}}\), \(\hat{\varvec{h}}\), and \(\{\mathcal {S}_k\}_{k\in [K]}\), respectively, are updated every \(\tau =T/(4\log (T/n))\) users. More precisely, we use Algorithm 1 to compute \(\{\mathcal {S}_k\}_{k\in [K]}\), and from there, we update the estimates as:
where \(Y_{i\ell }\) is the number of times where question \(\ell\) has been asked for item i so far, and where \({\hat{\sigma }}(i)\) corresponds to the estimated cluster of i (i.e., \(i\in \mathcal {S}_{{\hat{\sigma }}(i)}\)). Let \(\varvec{Y}:= (Y_{i \ell })_{i\in \mathcal {I}, \ell \in [L]}\).
Now using the same arguments as those used to derive error lower bounds, we may estimate that after seeing the t-th user, a lower bound on the mis-classification error for item i is \(\exp \left( - {\hat{d}}_{i}(\varvec{Y}) \right)\), where
The above lower bounds are heuristic in nature, as they are based solely on estimated parameters and clusters. These are derived from the divergence \({\mathcal {D}}_{\mathcal {M}}^A (i, \varvec{y})\) using (5), with a particular emphasis on the adjustable parameters for item i. This approach takes a pessimistic view of the hardness parameters, with the exception of those for item i. Revisiting the scenario of Example 1, there is only one question (\(L=1\)) and the adaptability of the algorithm is principally determined by how the budget T is allocated among the items. Observe that, when \(h_i\) is estimated to be small, the value of \({{\,\mathrm{\text {KL}}\,}}( {h}' {\hat{p}}_{k' \ell }+\bar{{h}}' \bar{{\hat{p}}}_{k'\ell }, {\hat{h}}_i{\hat{p}}_{{\hat{\sigma }}(i) \ell }+\bar{{\hat{h}}}_i \bar{{\hat{p}}}_{{\hat{\sigma }}(i)\ell })\) tends to be small. Conversely, when \(h_i\) is estimated to be large, the value of \({{\,\mathrm{\text {KL}}\,}}( {h}' {\hat{p}}_{k' \ell }+\bar{{h}}' \bar{{\hat{p}}}_{k'\ell }, {\hat{h}}_i{\hat{p}}_{{\hat{\sigma }}(i) \ell }+\bar{{\hat{h}}}_i \bar{{\hat{p}}}_{{\hat{\sigma }}(i)\ell })\) tends to be large. Therefore, the more difficult the item i is, the greater the need for a larger \(Y_{i 1}\), and the higher the frequency of it being selected. Analyzing the accuracy of these lower bounds is particularly challenging (it is hard to analyze the estimated item hardness \({\hat{h}}_i\)). Using these estimated lower bounds, we select the items and the question to be asked next. We put in the list \(\mathcal {W}_t\) the w items with the smallest \({\hat{d}}_{i}(\varvec{Y})\). The question \(\ell\) is chosen to maximize the term: \(\min _{k' \ne {\hat{\sigma }}(i^*)} {{\,\mathrm{\text {KL}}\,}}(h'_{i^*} {\hat{p}}_{k'\ell } + h'_{i^*} \bar{{\hat{p}}}_{k'\ell }, {\hat{h}}_{i^*} {\hat{p}}_{{\hat{\sigma }}(i^*) \ell } +\bar{{\hat{h}}}_{i^*} \bar{{\hat{p}}}_{{\hat{\sigma }}(i^*) \ell } ),\) where \(i^*= \mathop {\mathrm {arg\,min}}\limits _{i \in \mathcal {I}} {\hat{d}}_i(\varvec{Y})\) (see Algorithm 2 for the details). Note that the question is selected by considering the item \(i^*\) that seems to be the most difficult to classify.
Note that in order to reduce the computational complexity of the algorithm, we may replace the KL function in the definition of \({d}_i\) by a simple quadratic function (as suggested in the proof of Proposition 1). This simplifies the minimization problem over \(h'\) to find \(h_i'\). We actually have an explicit expression for \(h_i'\) with this modification.
The computational complexity of the adaptive algorithm (Algorithm 2 in Appendix) is: \(\mathcal {O}(n^2T/\tau ) = \mathcal {O}(n^2 \log (T/n))\). As in the uniform case, by choosing a small (\(\log n\)) subset of items (and not all the items in \(\mathcal {I}\)) to compute centroids (\(T_i\)), one can reduce the computational complexity to: \(\mathcal {O}(n \log (n) \log (T/n)).\) We provide experimental evidence on the superiority of our adaptive algorithm in the following sections.
5 Numerical experiments: synthetic data
In this section, we evaluate the performance of our algorithms on synthetic data. We consider different models. the problem investigated here is different from those one may find in the crowdsourcing or Stochastic Block Model literature. Hence, we cannot compare our algorithms to existing algorithms developed in this literature. Instead we focus on comparing the performance of our nonadaptive and adaptive algorithms.
Model 1: heterogeneous items with dummy questions: Consider \(n=1000\) items and two clusters (\(K=2\)) of equal sizes. The hardness of the items are i.i.d., picked uniformly at random in the interval [0.55, 1]. We ask each user to answer one of four questions. The answers’ statistics are as follows: for cluster \(k =1\), \(\varvec{p}_{1} = (0.01, 0.99, 0.5, 0.5)\) and for cluster \(k =2\), \(\varvec{p}_{2} = (0.99, 0.01, 0.5, 0.5)\). Note that only half of the questions (\(\ell = 1, 2\)) are useful; the other questions (\(\ell = 3,4\)) generate completely random answers for both clusters. Figure 2 (top) plots the error rate averaged over all items and over 100 instances of our algorithms. Under both algorithms, the error rate decays exponentially with the budget T, as expected from the analysis. Selecting items and questions in an adaptive manner brings significant performance improvements. For example, after collecting the answers from \(t=200k\), the adaptive algorithm recovers the clusters exactly for most of the instances, whereas the algorithm using uniform selection does not achieve exact recovery even with \(t=1000k\) users. In particular, the adaptive algorithm is able to reduce the error rates on the 20% most difficult items, i.e., items that have the \(20\%\) smallest \(h_i\). In Fig. 2 (bottom), we present the error rate of these items. The error rates for these most difficult items are significantly reduced by being adaptive. In Fig. 3, we present the evolution over time of the budget allocation observed under our adaptive algorithm. We group items and questions into four categories. For example, one category corresponds to the question \(\ell =1,2\) and to the 20% most difficult items. As expected, the adaptive algorithm learns to select relevant questions (\(\ell =1,2\)) with hard items more and more often as time evolves.
Model 2: heterogeneous items without dummy questions. This model is similar to Model 1, except that we remove the dummy questions \(\ell =3, 4\), i.e., we set \(\varvec{p}_{1} = (0.01, 0.99)\) and \(\varvec{p}_{2} = (0.99, 0.01)\). The performance of our algorithms are shown in Fig. 4. Overall, compared to Model 1, the error rates are better. For example, exact cluster recovery is achieved using only 100k users for almost all instances.
Model 3: homogeneous items with dummy questions. Here we study the homogeneous scenario where all items have the same hardness: \(h_i = 1, \forall i \in \mathcal {I}\). We still have 1000 items grouped into two clusters of equal sizes. We set \(\varvec{p}_{1} = (0.3, 0.2,0.2,0.2)\), \(\varvec{p}_{2} = (0.7, 0.2, 0.2,0.2)\) (question \(\ell = 2, 3, 4\) are useless). The performance of the algorithms is shown in Fig. 5. The adaptive algorithm exhibits better error rates than the algorithm with uniform selection, although the improvement is not as spectacular as in heterogeneous models where adaptive algorithms can gather more information about difficult items. In homogeneous models, the adaptive algorithm remains better because it selects questions wisely.
6 Numerical experiments: real-world data
Finally, we use real-world data to assess the performance of our algorithms. Finding data that would fit our setting exactly (e.g. several possible questions) is not easy. We restrict our attention here to scenarios with a single question, but with items with different hardnesses. We use the waterbird dataset by Welinder et al. (2010). This dataset contains 50 images of Mallards (a kind of duck) and 50 images of Canadian Goose (not a duck). The dataset reports the feedback of 40 users per image, collected using Amazon Mturk: each user is asked whether the image is that of a duck. This scenario mirrors the one outlined in Example 1 in Introduction. Each image is unique in the sense that the orientation of the animal varies, the brightness and contrasts are different, etc. We hence have a good heterogeneity in terms of item hardness. Actually, the classification task is rather difficult, and the users’ answers seem very noisy – overall answers are correct 76% of the time.
From this small dataset, we generated a larger dataset containing 1000 images (by just replicating images). To emulate the sequential nature of our clustering problem, in each round, we pick a user uniformly at random (with replacement), and observe her answers to the selected images.
The error rates of both algorithms are shown in Fig. 6. The global error rate is averaged over 100 instances. Both algorithms have rather low performance, which can be explained by the inherent hardness of the learning task. The adaptive algorithm becomes significantly better after \(t=20k\) users. this can be explained as follows. The adaptive algorithm needs to estimate the hardness of items before being efficient. Until the algorithm gathers enough answers on item i, its estimate of \({\hat{h}}_i\) remains close to 0.5. As a consequence, the algorithm continues to pick items uniformly at random. As soon as the algorithm gets better estimates of the items’ hardnesses, it starts selecting items with strong preferences.
7 Conclusion
In this paper, we analyzed the problem of clustering complex items using very simple binary feedback provided by users. A key aspect of our problem is that it takes into account the fact that some items are inherently more difficult to cluster than some others. Accounting for this heterogeneity is critical to get realistic models, and is unfortunately not investigated often in the literature on clustering and community detection (e.g. that on Stochastic Block Model). The item heterogeneity also significantly complicates any theoretical development.
For the case where data is collected uniformly (each item receives the same amount of user feedback), we derived a lower bound of the clustering error rate for any individual item, and we developed a clustering algorithm approaching the optimal error rate. We also investigated adaptive algorithms, under which the user feedback is received sequentially, and can be adapted to past observations. Being adaptive allows to gather more feedback for more difficult items. We derived a lower bound of the error rate that holds for any adaptive algorithm. Based on our lower bounds, we devised an adaptive algorithm that smartly select items and the nature of the feedback to be collected. We evaluated our algorithms on both synthetic and real-world data. These numerical experiments support our theoretical results, and demonstrate that being adaptive leads to drastic performance gains.
Availability of data and materials
We have attached the data used and generated raw data at the following URL: https://bit.ly/3Am5DhX.
Code Availability
We have attached the source code at the following URL: https://bit.ly/3Am5DhX.
Notes
Define for any integer \(A\ge 1\), the set \([A]:=\{1,\ldots ,A\}\).
This is true as \(h'\) is optimized over \([(h_*+1)/2, 1]\).
References
Abbe, E. (2018). Community detection and stochastic block models. Foundations and Trends in Communications and Information Theory, 14(1–2), 1–162.
Boys in Bristol Photography. (2024). A Canada goose grooming while swimming in a lake. https://www.pexels.com/photo/a-canada-goose-grooming-while-swimming-in-a-lake-7589597/. Online accessed 2 Feb. 2024.
Chris F. (2024). A mallard duck on water. https://www.pexels.com/photo/a-mallard-duck-on-water-11798057/. Online; Accessed 2 Feb. 2024.
Dawid, A. P., & Skene, A. M. (1979). Maximum likelihood estimation of observer error-rates using the em algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 28(1), 20–28.
Gao, C., Lu, Y., & Zhou, D. (2016). Exact exponent in optimal rates for crowdsourcing. In Proceedings of the 33nd international conference on machine learning (pp. 603–611).
Gao, C., Ma, Z., Zhang, A. Y., Zhou, H. H., et al. (2018). Community detection in degree-corrected block models. The Annals of Statistics, 46(5), 2153–2185.
Gomes, R. G., Welinder, P., Krause, A., & Perona, P. (2011). Crowdclustering. In Advances in neural information processing systems, (pp. 558–566).
Ho, C.-J., Jabbari, S., & Vaughan, J. W. (2013). Adaptive task assignment for crowdsourced classification. In International conference on machine learning, (pp. 534–542).
Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301), 13–30.
Karger, D. R, Oh, S., & Shah, D. (2011). Iterative learning for reliable crowdsourcing systems. In Advances in neural information processing systems (pp. 1953–1961)
Khetan, A., & Oh, S. (2016). Achieving budget-optimality with adaptive schemes in crowdsourcing. In Advances in Neural Information Processing Systems (Vol. 29).
Lai, T. L., & Herbert, R. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1), 4–22.
Long T.-T., Matteo V., Alex R., & Nicholas R. J. (2013). Efficient budget allocation with accuracy guarantees for crowdsourcing classification tasks. In Proceedings of the 2013 international conference on autonomous agents and multi-agent systems, (pp. 901–908)
Ok, J., Oh, S., Shin, J., & Yi, Y. (2016). Optimality of belief propagation for crowdsourced classification. In Proceedings of the 33nd International Conference on Machine Learning (pp. 535–544).
Raykar, V. C., Shipeng, Yu., Zhao, L. H., Valadez, G. H., Florin, C., Bogoni, L., & Moy, L. (2010). Learning from crowds. Journal of Machine Learning Research, 11, 1297–1322.
Vinayak, R. K., & Hassibi, B. (2016). Crowdsourced clustering: Querying edges vs triangles. In Advances in Neural Information Processing Systems, (pp. 1316–1324).
Welinder, P., Branson, S., Perona, P., & Belongie, S. J. (2010). The multidimensional wisdom of crowds. In Advances in neural information processing systems, (pp. 2424–2432).
Yun, S.-Y., & Proutiere, A. (2016). Optimal cluster recovery in the labeled stochastic block model. In Advances in Neural Information Processing Systems, (pp. 965–973).
Zhang, Y., Chen, X., Zhou, D., & Jordan, M. I. (2014). Spectral methods meet EM: a provably optimal algorithm for crowdsourcing. In Advances in neural information processing systems, (pp. 1260–1268).
Zhou, D., Basu, S., Mao, Y., & Platt, J. C. (2012). Learning from the wisdom of crowds by minimax entropy. In Advances in neural information processing systems, (pp. 2195–2203).
Funding
Open access funding provided by Royal Institute of Technology. This research is partly funded by the Nakajima Foundation Scholarship (Kaito Ariu), Vetenskaprådet (Alexandre Proutiere), Institute of Information and communications Technology Planning and Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00075, Artificial Intelligence Graduate School Program (KAIST) (Seyoung Yun) and No. 2019-0-01906, Artificial Intelligence Graduate School Program (POSTECH) (Jungseul Ok)), and JSPS KAKENHI Grant No. 23K19986 (Kaito Ariu).
Author information
Authors and Affiliations
Contributions
K. Ariu, A. Proutiere, and S. Yun collaboratively developed the conceptualization and formulation of the model and problem. The establishment of proofs was carried out by K. Ariu, J. Ok, and S. Yun. Both K. Ariu and J. Ok conducted numerical experiments using synthetic data, while K. Ariu also conducted experiments with non-synthetic data. All authors actively participated in writing the entire manuscript.
Corresponding author
Ethics declarations
Conflict of interest
Not applicable.
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Additional information
Editor: Hendrik Blockeel.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Table of notations
Problem-specific notations | |
---|---|
\(\mathcal {I}\) | Set of items |
\(\mathcal {I}_k\) | Set of items in the item cluster k |
n | Number of items |
K | Number of item clusters |
\(\sigma (i)\) | Cluster index of item i |
\(\varvec{\alpha } {:}{=}(\alpha _1, \ldots , \alpha _K)\) | Fractions of items in each cluster |
w | The number of items presented at the same time |
\(\mathcal {W}_t\) | Set of items presented to the t-th user |
L | Number of possible questions |
\(\ell _t\) | Question asked to the t-th user |
T | Total number of user arrivals within the time horizon |
\(\varvec{p} {:}{=}(p_{k \ell })_{k\in [K], \ell \in [L]}\) | Statistical parameterization of items in cluster k for the question \(\ell\) |
\(\varvec{h} {:}{=}(h_i)_{i \in \mathcal {I}}\) | Hardness parameter for item i |
\(\mathcal {M}\) | Statistical models parameterized by \((\varvec{p}, \varvec{h})\) |
\(X_{i \ell t}\) | Binary feedback from t-th user for item i and for question \(\ell\) |
\(q_{i \ell } {:}{=}h_i p_{k \ell } - {\bar{h}}_i {\bar{p}}_{k \ell }\) | Probability of positive answer to item i for question \(\ell\) |
\(h_*\) | Minimum hardness across items, see Assumption (A1) |
\(\rho _*\) | Minimum separation between different clusters, see Assumption (A1) |
\(\eta\) | Homogeneity parameter among clusters, see Assumption (A2) |
\(\Omega\) | Set of all models satisfying Assumptions (A1) and (A2) |
\(r_{k\ell }\) | Value of \(2 p_{k \ell } -1\) |
\(\mathcal {E}^\pi\) | Set of misclassified items by the algorithm \(\pi\) |
\(\varepsilon ^\pi _i (n, T)\) | Probability that the item i is misclassifed after the T-th user arrived under the algorithm \(\pi\) |
\(\varepsilon ^\pi (n, T)\) | Expected proportion of misclassified items after the T-th user arrived under the algorithm \(\pi\) |
\(Y_{i \ell }\) | Number of times the item i is presented together with the question \(\ell\) |
\(y_{i \ell }\) | Normalized expected number of times the question \(\ell\) is asked for the item i under some fixed algorithm |
\(\mathcal {D}^U_{\mathcal {M}}(i)\) | Divergence for the misclassification of item i with the model \(\mathcal {M}\) under uniform item selection strategy |
\(\widetilde{\mathcal {D}}^U_{\mathcal {M}}\) | Global divergence with the model \(\mathcal {M}\) under uniform item selection strategy |
\(\mathcal {D}^A_{\mathcal {M}}(i, \varvec{y})\) | Divergence for the misclassification of item i with the model \(\mathcal {M}\) under some adaptive item selection strategy satisfying \({{\,\mathrm{\mathbb {E}}\,}}[Y_{i \ell }] = \frac{Tw}{n} y_{i \ell }\) |
\(\widetilde{\mathcal {D}}^A_{\mathcal {M}}\) | Global divergence with the model \(\mathcal {M}\) under the optimal adaptive item selection strategy |
Generic notations | |
---|---|
\({\hat{a}}\) | Estimated value of a |
[a] | Set of positive integers upto a, i,e., \(\{1, \ldots , a\}\) |
\(\mathbbm {1}\{A\}\) | Indicator function: 0 when A is false, 1 when A is true |
\(\Vert \varvec{x}\Vert\) | \(\ell _\infty\) norm of \(\varvec{x}\), i.e., \(\Vert \varvec{x}\Vert = \max _i x_i\) |
\(\Vert \varvec{x}\Vert _2\) | \(\ell _2\) norm of \(\varvec{x}\) |
\({\bar{a}}\) | Value of \(1 - a\) |
\(\mathbb {P}(A)\) | Probability that event A occurs |
\({{\,\mathrm{\mathbb {E}}\,}}[a]\) | Expected value of a |
\(\text {KL}(a, b)\) | Kullback–Leibler divergence between Bernoulli distributions with means a and b |
Appendix B: Proof of Proposition 1
For given \(i \in \mathcal {I}\), let \(k = \sigma (i)\) and \(k' \in [K]\) be such that:
Upper bound. Recalling the definition of \(q_{i\ell }:= h_i p_{k\ell } + {\bar{h}}_i {\bar{p}}_{k\ell }\), it follows that for any \(h' \in [(h_*+1) /2, 1],\)
where the second inequality is from the comparison between the KL divergence from \(\chi ^2\)-divergence and the third inequality is from (A2), i.e., \(q_{i\ell } \in [\eta , 1-\eta ]\). Now observing that \(h' \in [(h_*+1) /2, 1]\) implies \(\frac{h_*}{2h_i - 1} \le \frac{2\,h' - 1}{2h_i - 1} \le \frac{1}{2h_i - 1}.\) Taking the minimum over \(h' \in [(h_*+1) /2, 1]\), we obtain the upper bound.
Lower bound. Using Pinsker’s inequality, we obtain:
where for the last inequality, we again use the fact that \(h' \in [(h_*+1) /2, 1]\) implies \(\frac{h_*}{2h_i - 1} \le \frac{2\,h' - 1}{2h_i - 1} \le \frac{1}{2h_i - 1}\). This completes the proof of Proposition 1. Note that we can further write:
using the relationship between the \(\ell _\infty\)-norm and the Euclidean norm and (A1). \(\square\)
Appendix C: Proof of Lemma 1
\(\mathbb {E}_{\mathcal {N}} [L]\) can be obtained as follows:
To bound the variance of \(\mathcal {L}\), we first decompose \(\mathcal {L}^2\) as follows:
where \(\mathcal {L}_{t}:= \mathbbm {1}[i \in {\mathcal {W}}_t] \sum _{\ell =1}^L \mathbbm {1}[\ell _t = \ell ] \left( \mathbbm {1}[X_{i \ell t} = +1] \log \frac{q'_{\ell }}{q_{i\ell }} +\mathbbm {1}[X_{i \ell t} = -1] \log \frac{{\bar{q}}'_{\ell }}{{\bar{q}}_{i\ell }} \right)\). We compute \(\mathcal {L}_{t}^2\) as follows:
where the last inequality follows from the fact that \(q_{i \ell } \in [\eta , 1-\eta ]\) under (A2), i.e., \(\log \frac{q'_\ell }{q_{i\ell }} \le \log \frac{1}{\eta }\) and \(\log \frac{{\bar{q}}'_\ell }{{\bar{q}}_{i\ell }} \le \log \frac{1}{\eta }\). We deduce that:
where for the last inequality, we used the Pinsker’s inequality. Moreover we can compute the expectation of \(\sum _{t\ne t'} \mathcal {L}_{t}\mathcal {L}_{t'}\) as follows:
where for the last equality, we use the expression (21) of \({{\,\mathrm{\mathbb {E}}\,}}_{\mathcal {N}}[\mathcal {L}]\). Combining (23) with the above, it follows that:
\(\square\)
Appendix D: Proof of Corollary 1
We have:
where (a) stems from the relationship between the KL divergence and the \(\chi ^2\)-divergence and (b) is from (A2). Combining this inequality and Theorem 1 yield Corollary 1. \(\square\)
Proof of Lemma 2
We use Hoeffding’s inequality to establish the lemma.
Theorem 4
(Hoeffding’s inequality for bounded independent random variables (Theorem 1 of Hoeffding (1963))) Let \(X_1,\dots ,X_n\) be independent random variables with values in [0, 1]. Denote \(\mu = {{\,\mathrm{\mathbb {E}}\,}}\left[ \frac{1}{n}\sum _{i=1}^n X_i\right] .\) Then, for any \(t \ge 0\),
Lemma 3
Recall that by definition, \(\tau := \lfloor \frac{T w}{L n} \rfloor\). For any \(\varepsilon > 0\), \(\Vert \hat{\varvec{q}}_{i} - \varvec{q}_{i} \Vert \le \varepsilon\) with probability at least \(1--2\,L \exp \left( - {2 \tau }\varepsilon ^2\right) \;.\)
Proof of Lemma 3
Note that the number of times question \(\ell\) is asked for item i is \(\tau\). Using Hoeffding’s inequality (Theorem 4), it is straightforward to check: for any \(\varepsilon > 0\) and \(\ell \in [L]\),
We conclude the proof using the union bound as follows:
\(\square\)
Proof of Lemma 2
By Lemma 3, we have \(\Vert \hat{\varvec{q}}_i - {\varvec{q}}_i \Vert \le \varepsilon\) with probability at least \(1 - 2 L \exp \left( - {2 \varepsilon ^2} \tau \right)\). Suppose \(\Vert \hat{\varvec{q}}_i - {\varvec{q}}_i \Vert \le \varepsilon\) and \(0 < \varepsilon \le \frac{\Vert 2 \varvec{q}_i - 1\Vert }{16}\). Using the triangle inequality and the reverse triangle inequality, we have
Therefore,
which implies that:
From \(0 < \varepsilon \le \frac{\Vert 2 \varvec{q}_i - 1\Vert }{16},\) we have \(0< \frac{2 \varepsilon }{\Vert 2 \varvec{q}_i - 1\Vert } \le \frac{1}{8}\). Now observe that we have:
for all x such that \(0<x< \frac{1}{8}\). Then we obtain:
Then, there exists \(x \in \left[ - \frac{2 \varepsilon }{\Vert 2 {\varvec{q}}_i - 1\Vert }, \frac{16 \varepsilon }{7 \Vert 2 {\varvec{q}}_i - 1\Vert } \right]\) such that \(\frac{1}{\Vert 2 \hat{\varvec{q}}_i - 1\Vert } = \frac{1}{\Vert 2 {\varvec{q}}_i - 1\Vert } (1 + x)\). Using this x, we get:
with probability at least \(1--2\,L \exp \left( - {2 \varepsilon ^2} \tau \right)\) for all \(\varepsilon\) such that \(0< \varepsilon \le \frac{\Vert 2 \varvec{q}_i - 1\Vert }{16}\), where in (a), we use \(5x \ge \frac{30}{7}x + \frac{32}{7} x^2\) for all \(0<x<\frac{1}{16}\) and \(0<\varepsilon \le \frac{\Vert 2 \varvec{q}_i - 1 \Vert }{16} \le \frac{1}{16}\). This concludes the proof. \(\square\)
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ariu, K., Ok, J., Proutiere, A. et al. Optimal clustering from noisy binary feedback. Mach Learn 113, 2733–2764 (2024). https://doi.org/10.1007/s10994-024-06532-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-024-06532-z