Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
\acsetup

format/first-long= \acsetupbarriers/use=true \acsetupbarriers/reset=true \acsetupsingle \DeclareAcronymmlshort = ML, long = machine learning \DeclareAcronymqlshort = QL, long = quantification learning \DeclareAcronymgqlshort = GQL, long = graph quantification learning \DeclareAcronymnlpshort = NLP, long = natural language processing \DeclareAcronymnnshort = NN, long = neural network \DeclareAcronymmlpshort = MLP, long = Multilayer Perceptron \DeclareAcronymrnnshort = RNN, long = recurrent neural network \DeclareAcronymsvmshort = SVM, long = support vector machine \DeclareAcronymcnnshort = CNN, long = convolutional neural network \DeclareAcronymgnnshort = GNN, long = graph neural network \DeclareAcronymgcnshort = GCN, long = Graph Convolutional Network \DeclareAcronymginshort = GIN, long = Graph Isomorphism Network \DeclareAcronymgatshort = GAT, long = Graph Attention Network \DeclareAcronymappnpshort = APPNP, long = approximate personalized propagation of neural predictions \DeclareAcronymceshort = CE, long = cross-entropy \DeclareAcronymuceshort = UCE, long = uncertain cross-entropy \DeclareAcronympprshort = PPR, long = personalized page-rank \DeclareAcronymarcshort = ARC, long = accuracy-rejection curve \DeclareAcronymoodshort = OOD, long = out-of-distribution \DeclareAcronymidshort = ID, long = in-distribution \DeclareAcronymaurocshort = AUC-ROC, long = area under the receiver operating characteristic curve \DeclareAcronymqgnnshort = QGNN, long = Quantification Graph Neural Network \DeclareAcronymccshort = CC, long = Classify & Count \DeclareAcronympccshort = PCC, long = Probabilistic Classify & Count \DeclareAcronymaccshort = ACC, long = Adjusted Classify & Count \DeclareAcronympaccshort = PACC, long = Probabilistic Adjusted Classify & Count \DeclareAcronymppsshort = PPS, long = prior probability shift \DeclareAcronymmlpeshort = MLPE, long = Maximum Likelihood Prevalence Estimation \DeclareAcronymslsqpshort = SLSQP, long = Sequential Least Squares Quadratic Programming \DeclareAcronymbfsshort = BFS, long = breadth-first search \DeclareAcronymspshort = SP, long = shortest path \DeclareAcronymrwshort = RW, long = random walk \DeclareAcronymsisshort = SIS, long = structural importance sampling \DeclareAcronymnaccshort = NACC, long = Neighborhood-aware ACC \DeclareAcronymaeshort = AE, long = absolute error \DeclareAcronymraeshort = RAE, long = relative absolute error \DeclareAcronymkldshort = KLD, long = Kullback-Leibler divergence

Adjusted Count Quantification Learning on Graphs

[Uncaptioned image] Clemens Damke Institute of Informatics, LMU Munich, Germany [Uncaptioned image] Eyke Hüllermeier Institute of Informatics, LMU Munich, Germany Munich Center for Machine Learning (MCML) German Centre for Artificial Intelligence (DFKI, DSA)
Abstract
\Acl

ql is the task of predicting the label distribution of a set of instances. We study this problem in the context of graph-structured data, where the instances are vertices. Previously, this problem has only been addressed via node clustering methods. In this paper, we extend the popular \acacc method to graphs. We show that the \aclpps assumption upon which \acacc relies is often not fulfilled and propose two novel graph quantification techniques: \Acsis makes \acacc applicable in graph domains with covariate shift. \Aclnacc improves quantification in the presence of non-homophilic edges. We show the effectiveness of our techniques on multiple graph quantification tasks.

\acbarrier

1 Introduction

We consider the task of \acql on graph-structured data. This term has first been coined by Forman [2005, 2006, 2008] and is used to describe the task of estimating label prevalences via supervised learning. A \acql method receives a set of training instances with known labels which is used to train a quantifier. The quantifier is then used to predict the label distribution of a set of test instances. Unlike standard instance-wise classification, \acql does not concern itself with predicting an accurate label for each test instance but rather with predicting the overall prevalence of each label across all instances. \Acql can thus be seen as a dataset-level prediction task, where a single prediction is made for a population of instances.

Quantification problems naturally arise in polling and surveying, where the goal is to estimate the proportion of a population that has a certain property or holds a certain opinion. Examples include estimating the proportion of voters that support a certain political party or the proportion of customers that are satisfied with a product. Similarly, \acql can be applied to epidemiology or ecological modelling to estimate the prevalence of diseases or species in a given population. We refer to Esuli et al. [2023] for a comprehensive overview of the applications of quantification.

Typically, \acql is studied in the context of tabular data, where each instance x𝒳=d𝑥𝒳superscript𝑑x\in\mathcal{X}=\mathbb{R}^{d}italic_x ∈ caligraphic_X = blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is represented by a feature vector. In this setting, instances are assumed to be independent, i.e., the label distribution p(yx)𝑝conditional𝑦𝑥p(y\mid x)italic_p ( italic_y ∣ italic_x ) is fully determined by the instance x𝑥xitalic_x. However, in many real-world applications, this independence assumption does not hold. Consider the example of estimating the proportion of voters supporting a certain party. Assume we have access to a social network where each node represents a voter and each edge represents a social connection. In this case, the label distribution of a voter, i.e., their political preferences, may depend not only on their own features but also on the features of their social connections. Incorporating this relational information into the quantification process can lead to more accurate estimates.

Generally speaking, \acql methods can be divided into two categories: aggregative and non-aggregative. Aggregative quantifiers rely upon an instance-wise label estimator, i.e., a regular classifier; the instance-level label estimates are then aggregated to obtain dataset-level label prevalence estimates. Non-aggregative quantifiers, on the other hand, directly estimate dataset-level label prevalences without first predicting labels for each instance. In this paper, we focus on aggregative quantification methods, which are more common and have been studied more extensively. An intuitively plausible aggregative method is to simply estimate the prevalence of a label as the fraction of test instances that are predicted to belong to that label by the classifier. This method is known as \accc and, given a perfect classifier, it will yield perfect quantification results. However, in practice, classifiers are not perfect and even good, but not perfect, classifiers can lead to poor quantification results. Conversely, even a bad classifier can yield good quantification results. The reason for this disconnect is that the optimization goals of classification and quantification are misaligned. More specifically, while a good binary classifier should minimize the total number of misclassifications, i.e., (FP+FN)FPFN(\mathrm{FP}+\mathrm{FN})( roman_FP + roman_FN ), a good binary quantifier should minimize |FPFN|FPFN\left|\mathrm{FP}-\mathrm{FN}\right|| roman_FP - roman_FN |. If FP=FNFPFN\mathrm{FP}=\mathrm{FN}roman_FP = roman_FN, even a classifier with a high misclassification rate will yield perfect quantification results.

This misalignment is commonly addressed by the family of \acacc methods, which use an estimate of the classifier’s confusion matrix to adjust the predicted label prevalences [Vucetic and Obradovic, 2001, Saerens et al., 2002, Forman, 2005]. \Acacc has been shown to estimate the true test label prevalences in expectation if the so-called \acpps assumption holds, i.e., if the class conditional training distributions p(xy)𝑝conditional𝑥𝑦p(x\mid y)italic_p ( italic_x ∣ italic_y ) equals the class conditional test distribution q(xy)𝑞conditional𝑥𝑦q(x\mid y)italic_q ( italic_x ∣ italic_y ) [Tasche, 2017].

In the following, we investigate the validity of the \acpps assumption and thereby of \acacc in the context of graph-structured data. In Section 2, we begin with a brief formal description of the quantification problem, the \acpps assumption and how it is used in \acacc. Section 3 describes two novel structure-aware aggregative quantification methods for graphs: \Acl*sis and \acl*nacc. To our knowledge, this is the first work on graph quantification learning using a node classifier. In Section 4, the proposed methods are evaluated on a series of node classification datasets under different shift assumptions. Last, we conclude with a brief outlook in Section 5.

2 Quantification Learning

Let 𝒳𝒳\mathcal{X}caligraphic_X denote the instance space and 𝒴={1,,K}𝒴1𝐾\mathcal{Y}=\{1,\dots,K\}caligraphic_Y = { 1 , … , italic_K } the (finite) label space. In \acql we assume to be given a training set of labeled instances 𝒟L𝒳×𝒴subscript𝒟𝐿𝒳𝒴\mathcal{D}_{L}\subseteq\mathcal{X}\times\mathcal{Y}caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ⊆ caligraphic_X × caligraphic_Y drawn from a distribution P(x,y)𝑃𝑥𝑦P(x,y)italic_P ( italic_x , italic_y ) with corresponding density p𝑝pitalic_p. Additionally, there is a set of labeled instances 𝒟U𝒳×𝒴subscript𝒟𝑈𝒳𝒴\mathcal{D}_{U}\subseteq\mathcal{X}\times\mathcal{Y}caligraphic_D start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ⊆ caligraphic_X × caligraphic_Y drawn from a test distribution Q(x,y)𝑄𝑥𝑦Q(x,y)italic_Q ( italic_x , italic_y ) with corresponding density q𝑞qitalic_q. The goal of \acql is to estimate Q(Y=i)𝑄𝑌𝑖Q(Y=i)italic_Q ( italic_Y = italic_i ) for all i𝒴𝑖𝒴i\in\mathcal{Y}italic_i ∈ caligraphic_Y given 𝒟Lsubscript𝒟𝐿\mathcal{D}_{L}caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT and 𝒳U{x(x,y)𝒟U}subscript𝒳𝑈conditional-set𝑥𝑥𝑦subscript𝒟𝑈\mathcal{X}_{U}\coloneqq\{x\mid(x,y)\in\mathcal{D}_{U}\}caligraphic_X start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ≔ { italic_x ∣ ( italic_x , italic_y ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT }. If P=Q𝑃𝑄P=Qitalic_P = italic_Q, i.e., if the training and test data are drawn from the same distribution, the quantification problem is trivially solved via a maximum likelihood estimate of the label distribution on 𝒟Lsubscript𝒟𝐿\mathcal{D}_{L}caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT:

Q^MLPE(Y=i)1|𝒟L|(x,y)𝒟L𝟙[y=i]superscript^𝑄MLPE𝑌𝑖1subscript𝒟𝐿subscript𝑥𝑦subscript𝒟𝐿1delimited-[]𝑦𝑖\displaystyle\hat{Q}^{\mathrm{MLPE}}({Y=i})\coloneqq\frac{1}{|\mathcal{D}_{L}|% }\smashoperator[r]{\sum_{(x,y)\in\mathcal{D}_{L}}^{}}\mathds{1}[{y=i}]over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT roman_MLPE end_POSTSUPERSCRIPT ( italic_Y = italic_i ) ≔ divide start_ARG 1 end_ARG start_ARG | caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT | end_ARG start_SUMOP SUBSCRIPTOP ∑ start_ARG ( italic_x , italic_y ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_ARG end_SUMOP blackboard_1 [ italic_y = italic_i ] (1)

where 𝟙[]1delimited-[]\mathds{1}[\cdot]blackboard_1 [ ⋅ ] denotes the indicator function. This \acmlpe approach [Barranquero et al., 2013, Esuli et al., 2023] is akin to the majority classifier in classification in the sense that it predicts the most likely distribution in the absence of test data 𝒳Usubscript𝒳𝑈\mathcal{X}_{U}caligraphic_X start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT. However, if the training and test data are not identically distributed, the quantification problem becomes more challenging. A quantification approach has to account for the distribution shift between P𝑃Pitalic_P and Q𝑄Qitalic_Q to provide accurate estimates of Q(Y)𝑄𝑌Q(Y)italic_Q ( italic_Y ). Depending on the nature of this distribution shift, different quantification methods may be more or less suitable.

2.1 Types of Distribution Shift

If the train and test distributions differ, one should ask whether learning from the training data is still feasible. Certainly, if P𝑃Pitalic_P and Q𝑄Qitalic_Q are completely unrelated, any information learned from 𝒟Lsubscript𝒟𝐿\mathcal{D}_{L}caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT is useless for predicting Q(Y)𝑄𝑌Q(Y)italic_Q ( italic_Y ). Quantification approaches therefore typically assume that P𝑃Pitalic_P and Q𝑄Qitalic_Q are related in some way. The applicability of a quantification method then depends on whether those assumptions hold true for the given problem. First, note that q𝑞qitalic_q can be expressed as

q(x,y)=q(yx)q(x)=q(xy)q(y).𝑞𝑥𝑦𝑞conditional𝑦𝑥𝑞𝑥𝑞conditional𝑥𝑦𝑞𝑦\displaystyle q(x,y)=q(y\mid x)q(x)=q(x\mid y)q(y)\ .italic_q ( italic_x , italic_y ) = italic_q ( italic_y ∣ italic_x ) italic_q ( italic_x ) = italic_q ( italic_x ∣ italic_y ) italic_q ( italic_y ) .

By fixing one of the factors in the two right-hand terms, we obtain three types of distribution shifts [Esuli et al., 2023]:

  1. 1.

    Concept Shift: The conditional label distribution changes, but the distribution of the instances remains the same, i.e., q(yx)p(yx)𝑞conditional𝑦𝑥𝑝conditional𝑦𝑥q(y\mid x)\neq p(y\mid x)italic_q ( italic_y ∣ italic_x ) ≠ italic_p ( italic_y ∣ italic_x ), while q(x)=p(x)𝑞𝑥𝑝𝑥q(x)=p(x)italic_q ( italic_x ) = italic_p ( italic_x ). This type of shift, also referred to as concept drift, can occur in domains with classes that are defined relative to some frame of reference. For example, consider the task of predicting the prevalence of local vs. world news articles in a newspaper. While the distribution of news articles may remain the same between training and test, the definition of what constitutes local or world news depends on the location of the newspaper.

  2. 2.

    Covariate Shift: The distribution of the instances changes, but the conditional label distribution remains the same, i.e., q(x)p(x)𝑞𝑥𝑝𝑥q(x)\neq p(x)italic_q ( italic_x ) ≠ italic_p ( italic_x ), while q(yx)=p(yx)𝑞conditional𝑦𝑥𝑝conditional𝑦𝑥q(y\mid x)=p(y\mid x)italic_q ( italic_y ∣ italic_x ) = italic_p ( italic_y ∣ italic_x ). This is common in domain adaptation, where the training and test data are drawn from different but related domains. For example, assume the task is to predict the prevalence of a certain sentiment or opinion in social media posts. The training data may be drawn from one social media platform, while the test data is drawn from another. Given a post x𝑥xitalic_x, the probability of it expressing a certain sentiment y𝑦yitalic_y is likely the same on both platforms, but the distribution of posts may differ.

  3. 3.

    Prior Probability Shift: The label distribution changes, but not the class-conditional instance distribution, i.e., q(y)p(y)𝑞𝑦𝑝𝑦q(y)\neq p(y)italic_q ( italic_y ) ≠ italic_p ( italic_y ), while q(xy)=p(xy)𝑞conditional𝑥𝑦𝑝conditional𝑥𝑦q(x\mid y)=p(x\mid y)italic_q ( italic_x ∣ italic_y ) = italic_p ( italic_x ∣ italic_y ). Similar to covariate shift, \acfpps occurs between domains that share the same label concepts. For example, consider the task of predicting the percentage of a population that has a certain disease. The training data may come from a case-control study consisting of an equal amount of healthy and infected individuals, while the test data is drawn from the general population. Given y{𝑖𝑛𝑓𝑒𝑐𝑡𝑒𝑑,ℎ𝑒𝑎𝑙𝑡ℎ𝑦}𝑦𝑖𝑛𝑓𝑒𝑐𝑡𝑒𝑑ℎ𝑒𝑎𝑙𝑡ℎ𝑦y\in\{\mathit{infected},\mathit{healthy}\}italic_y ∈ { italic_infected , italic_healthy }, the feature distribution of an individual x𝑥xitalic_x should be the same between training and test, whereas the prevalence of the disease will likely not be.

We do not consider the case where q(y)=p(y)𝑞𝑦𝑝𝑦q(y)=p(y)italic_q ( italic_y ) = italic_p ( italic_y ), as this would imply that the label distribution remains unchanged, in which case the quantification problem is trivially solved by \acmlpe. Note that the difference between covariate shift and \acpps is subtle. Whether it is p(x)𝑝𝑥p(x)italic_p ( italic_x ) or p(y)𝑝𝑦p(y)italic_p ( italic_y ) that changes between training and test is mostly a matter of the assumed causal relation between instances and labels, i.e., whether it is in the direction 𝒳𝒴𝒳𝒴\mathcal{X}\to\mathcal{Y}caligraphic_X → caligraphic_Y or 𝒴𝒳𝒴𝒳\mathcal{Y}\to\mathcal{X}caligraphic_Y → caligraphic_X [Fawcett and Flach, 2005, Schölkopf et al., 2012, Kull and Flach, 2014]. In \acql, \acpps is commonly assumed, as there are many 𝒴𝒳𝒴𝒳\mathcal{Y}\to\mathcal{X}caligraphic_Y → caligraphic_X domains in which this is reasonable [González et al., 2024]. Generally speaking, quantification under concept or covariate shift is more challenging and often requires additional assumptions or domain knowledge. We will get back to the question of which shift assumptions are appropriate for a given domain in Section 3.

2.2 Adjusted Count

We will now describe the \acfacc method, a popular approach to quantification under \acpps [Forman, 2005]. As mentioned in the introduction, the naïve \aclcc method estimates the prevalence of a label as the fraction of test instances that are predicted to belong to that label by a classifier h:𝒳𝒴:𝒳𝒴h:\mathcal{X}\to\mathcal{Y}italic_h : caligraphic_X → caligraphic_Y:

Q^CC(Y=i)1|𝒳U|x𝒳U𝟙[h(x)=i].superscript^𝑄CC𝑌𝑖1subscript𝒳𝑈subscript𝑥subscript𝒳𝑈1delimited-[]𝑥𝑖\displaystyle\hat{Q}^{\mathrm{CC}}(Y=i)\coloneqq\frac{1}{|\mathcal{X}_{U}|}% \smashoperator[r]{\sum_{x\in\mathcal{X}_{U}}^{}}\mathds{1}[h(x)=i]\ .over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT roman_CC end_POSTSUPERSCRIPT ( italic_Y = italic_i ) ≔ divide start_ARG 1 end_ARG start_ARG | caligraphic_X start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT | end_ARG start_SUMOP SUBSCRIPTOP ∑ start_ARG italic_x ∈ caligraphic_X start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT end_ARG end_SUMOP blackboard_1 [ italic_h ( italic_x ) = italic_i ] . (2)

Since hhitalic_h is trained on data drawn from P𝑃Pitalic_P, the estimated propensity scores Q^CC(Y)superscript^𝑄CC𝑌\hat{Q}^{\mathrm{CC}}(Y)over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT roman_CC end_POSTSUPERSCRIPT ( italic_Y ) will be biased towards P(Y)𝑃𝑌P(Y)italic_P ( italic_Y ). \Acacc removes this bias by adjusting the predicted label prevalences based on an estimate of the classifier’s confusion matrix. To understand \acacc, note that the \acpps assumptios implies that

Q(Y^=j)𝑄^𝑌𝑗\displaystyle Q(\hat{Y}=j)italic_Q ( over^ start_ARG italic_Y end_ARG = italic_j ) =i=1KQ(Y^=jY=i)Q(Y=i)absentsuperscriptsubscript𝑖1𝐾𝑄^𝑌conditional𝑗𝑌𝑖𝑄𝑌𝑖\displaystyle=\sum_{i=1}^{K}Q(\hat{Y}=j\mid Y=i)\cdot Q(Y=i)= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_Q ( over^ start_ARG italic_Y end_ARG = italic_j ∣ italic_Y = italic_i ) ⋅ italic_Q ( italic_Y = italic_i )
=i=1KP(Y^=jY=i)Q(Y=i)absentsuperscriptsubscript𝑖1𝐾𝑃^𝑌conditional𝑗𝑌𝑖𝑄𝑌𝑖\displaystyle=\sum_{i=1}^{K}P(\hat{Y}=j\mid Y=i)\cdot Q(Y=i)= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_P ( over^ start_ARG italic_Y end_ARG = italic_j ∣ italic_Y = italic_i ) ⋅ italic_Q ( italic_Y = italic_i ) (3)

for all j𝒴𝑗𝒴j\in\mathcal{Y}italic_j ∈ caligraphic_Y, with Y^=h(X)^𝑌𝑋\hat{Y}=h(X)over^ start_ARG italic_Y end_ARG = italic_h ( italic_X ). Given 𝒟Lsubscript𝒟𝐿\mathcal{D}_{L}caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT and 𝒳Usubscript𝒳𝑈\mathcal{X}_{U}caligraphic_X start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT, we can obtain unbiased estimates of, both, Q(Y^)𝑄^𝑌Q(\hat{Y})italic_Q ( over^ start_ARG italic_Y end_ARG ) and P(Y^Y)𝑃conditional^𝑌𝑌P(\hat{Y}\mid Y)italic_P ( over^ start_ARG italic_Y end_ARG ∣ italic_Y ):

Q^(Y^=j)^𝑄^𝑌𝑗\displaystyle\hat{Q}(\hat{Y}=j)over^ start_ARG italic_Q end_ARG ( over^ start_ARG italic_Y end_ARG = italic_j ) =Q^CC(Y=j),absentsuperscript^𝑄CC𝑌𝑗\displaystyle=\hat{Q}^{\mathrm{CC}}(Y=j)\ ,= over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT roman_CC end_POSTSUPERSCRIPT ( italic_Y = italic_j ) ,
P^(Y^=jY=i)^𝑃^𝑌conditional𝑗𝑌𝑖\displaystyle\hat{P}(\hat{Y}=j\mid Y=i)over^ start_ARG italic_P end_ARG ( over^ start_ARG italic_Y end_ARG = italic_j ∣ italic_Y = italic_i ) =(x,y)𝒟L𝟙[h(x)=jy=i]|{(x,y)𝒟Ly=i}|.absentsubscript𝑥𝑦subscript𝒟𝐿1delimited-[]𝑥𝑗𝑦𝑖conditional-set𝑥𝑦subscript𝒟𝐿𝑦𝑖\displaystyle=\frac{\smashoperator[r]{\sum_{(x,y)\in\mathcal{D}_{L}}^{}}% \mathds{1}[h(x)=j\land y=i]}{|\{(x,y)\in\mathcal{D}_{L}\mid y=i\}|}\ .= divide start_ARG start_SUMOP SUBSCRIPTOP ∑ start_ARG ( italic_x , italic_y ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_ARG end_SUMOP blackboard_1 [ italic_h ( italic_x ) = italic_j ∧ italic_y = italic_i ] end_ARG start_ARG | { ( italic_x , italic_y ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ∣ italic_y = italic_i } | end_ARG . (4)

Plugging these estimates into Eq. 3 yields a system of equations which can be solved to obtain estimates of Q(Y)𝑄𝑌Q(Y)italic_Q ( italic_Y ) [Saerens et al., 2002]. Let 𝐂^[0,1]K×K^𝐂superscript01𝐾𝐾\hat{\mathbf{C}}\in{[0,1]}^{K\times K}over^ start_ARG bold_C end_ARG ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_K × italic_K end_POSTSUPERSCRIPT be the estimated confusion matrix of hhitalic_h on P𝑃Pitalic_P, i.e., C^j,i=P^(Y^=jY=i)subscript^𝐶𝑗𝑖^𝑃^𝑌conditional𝑗𝑌𝑖\hat{C}_{j,i}=\hat{P}(\hat{Y}=j\mid Y=i)over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT = over^ start_ARG italic_P end_ARG ( over^ start_ARG italic_Y end_ARG = italic_j ∣ italic_Y = italic_i ). Then, the \acacc estimates of Q(Y)𝑄𝑌Q(Y)italic_Q ( italic_Y ) are given by

Q^ACC(Y)=𝐂^1Q^(Y^).superscript^𝑄ACC𝑌superscript^𝐂1^𝑄^𝑌\displaystyle\hat{Q}^{\mathrm{ACC}}(Y)=\hat{\mathbf{C}}^{-1}\cdot\hat{Q}(\hat{% Y})\ .over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT roman_ACC end_POSTSUPERSCRIPT ( italic_Y ) = over^ start_ARG bold_C end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ over^ start_ARG italic_Q end_ARG ( over^ start_ARG italic_Y end_ARG ) . (5)

While the binary version of \acacc goes back at least to Gart and Buck [1966], it was first described as a quantification method by Vucetic and Obradovic [2001]. Tasche [2017] showed that \acacc is an unbiased estimator of the true test label prevalences if the \acpps assumption holds.

Note that there are two practical problems with Eq. 5: First, if 𝐂𝐂\mathbf{C}bold_C is not invertible, there might be no or multiple solutions for Q^ACC(Y)superscript^𝑄ACC𝑌\hat{Q}^{\mathrm{ACC}}(Y)over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT roman_ACC end_POSTSUPERSCRIPT ( italic_Y ). Second, the adjusted label prevalences may not be a valid distribution over 𝒴𝒴\mathcal{Y}caligraphic_Y, i.e., they could lie outside [0,1]01[0,1][ 0 , 1 ] or not sum to one. Possible reasons for this are that the \acpps assumption might not be (fully) satisfied or simply that the estimates 𝐂^^𝐂\hat{\mathbf{C}}over^ start_ARG bold_C end_ARG and Q^(Y^)^𝑄^𝑌\hat{Q}(\hat{Y})over^ start_ARG italic_Q end_ARG ( over^ start_ARG italic_Y end_ARG ) are noisy, e.g., due to small sample sizes. A number of solutions to these problems have been proposed in the literature, including clipping and rescaling the estimates [Forman, 2008], adjusting the confusion matrix [Lipton et al., 2018], using the pseudo-inverse of 𝐂𝐂\mathbf{C}bold_C or replacing the system of equations with a constrained optimization problem [Bunse, 2022]. In this work, we will use the latter approach, i.e., constrained optimization, to solve Eq. 5:

Q^ACC(Y)=argmin𝐪ΔK𝐂^𝐪Q^(Y^)22,superscript^𝑄ACC𝑌subscriptargmin𝐪subscriptΔ𝐾superscriptsubscriptnorm^𝐂𝐪^𝑄^𝑌22\displaystyle\hat{Q}^{\mathrm{ACC}}(Y)=\operatorname*{arg\,min}_{\mathbf{q}\in% \Delta_{K}}\left\|\hat{\mathbf{C}}\cdot\mathbf{q}-\hat{Q}(\hat{Y})\right\|_{2}% ^{2}\ ,over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT roman_ACC end_POSTSUPERSCRIPT ( italic_Y ) = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_q ∈ roman_Δ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ over^ start_ARG bold_C end_ARG ⋅ bold_q - over^ start_ARG italic_Q end_ARG ( over^ start_ARG italic_Y end_ARG ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (6)

where ΔKsubscriptΔ𝐾\Delta_{K}roman_Δ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT denotes the unit (K1)𝐾1(K-1)( italic_K - 1 )-simplex. This problem can be solved numerically, e.g., using a (quasi-)Newtonian method such as \acslsqp. Bunse [2022] has shown that this approach is a sensible default choice, as it generally performs well in practice.

In addition to the \accc and \acacc methods described above, which use a hard classifier h:𝒳𝒴:𝒳𝒴h:\mathcal{X}\to\mathcal{Y}italic_h : caligraphic_X → caligraphic_Y, one can also use a probabilistic classifier hs:𝒳ΔK:subscript𝑠𝒳subscriptΔ𝐾h_{s}:\mathcal{X}\to\Delta_{K}italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT : caligraphic_X → roman_Δ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT [Bella et al., 2010]. Analogous to \accc, \acfpcc is defined as

Q^PCC(Y=i)1|𝒳U|x𝒳Uhs(x)i.superscript^𝑄PCC𝑌𝑖1subscript𝒳𝑈subscript𝑥subscript𝒳𝑈subscript𝑠subscript𝑥𝑖\displaystyle\hat{Q}^{\mathrm{PCC}}(Y=i)\coloneqq\frac{1}{|\mathcal{X}_{U}|}% \smashoperator[r]{\sum_{x\in\mathcal{X}_{U}}^{}}h_{s}(x)_{i}\ .over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT roman_PCC end_POSTSUPERSCRIPT ( italic_Y = italic_i ) ≔ divide start_ARG 1 end_ARG start_ARG | caligraphic_X start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT | end_ARG start_SUMOP SUBSCRIPTOP ∑ start_ARG italic_x ∈ caligraphic_X start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT end_ARG end_SUMOP italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . (7)

Likewise, \acfpacc estimates Q(Y^)𝑄^𝑌Q(\hat{Y})italic_Q ( over^ start_ARG italic_Y end_ARG ) and P(Y^Y)𝑃conditional^𝑌𝑌P(\hat{Y}\mid Y)italic_P ( over^ start_ARG italic_Y end_ARG ∣ italic_Y ) using predicted label probabilities:

Q^(Y^=j)^𝑄^𝑌𝑗\displaystyle\hat{Q}(\hat{Y}=j)over^ start_ARG italic_Q end_ARG ( over^ start_ARG italic_Y end_ARG = italic_j ) =Q^PCC(Y=j),absentsuperscript^𝑄PCC𝑌𝑗\displaystyle=\hat{Q}^{\mathrm{PCC}}(Y=j)\ ,= over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT roman_PCC end_POSTSUPERSCRIPT ( italic_Y = italic_j ) , P^(Y^=jY=i)^𝑃^𝑌conditional𝑗𝑌𝑖\displaystyle\hat{P}(\hat{Y}=j\mid Y=i)over^ start_ARG italic_P end_ARG ( over^ start_ARG italic_Y end_ARG = italic_j ∣ italic_Y = italic_i ) =(x,y)𝒟Lhs(x)j𝟙[y=i]|{(x,y)𝒟Ly=i}|.absentsubscript𝑥𝑦subscript𝒟𝐿subscript𝑠subscript𝑥𝑗1delimited-[]𝑦𝑖conditional-set𝑥𝑦subscript𝒟𝐿𝑦𝑖\displaystyle=\frac{\smashoperator[r]{\sum_{(x,y)\in\mathcal{D}_{L}}^{}}h_{s}(% x)_{j}\cdot\mathds{1}[y=i]}{|\{(x,y)\in\mathcal{D}_{L}\mid y=i\}|}\ .= divide start_ARG start_SUMOP SUBSCRIPTOP ∑ start_ARG ( italic_x , italic_y ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_ARG end_SUMOP italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⋅ blackboard_1 [ italic_y = italic_i ] end_ARG start_ARG | { ( italic_x , italic_y ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ∣ italic_y = italic_i } | end_ARG . (8)

The motivation for using a soft classifier instead of a hard one is that predicted label probabilities can be more informative than hard labels. Whether this is truly the case is problem dependent and depends on the quality of the predicted probabilities.

3 Graph Quantification Learning

We now turn to the problem of \aclql on graph-structured data. In Section 2, we assumed that the instances in 𝒟Lsubscript𝒟𝐿\mathcal{D}_{L}caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT and 𝒟Usubscript𝒟𝑈\mathcal{D}_{U}caligraphic_D start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT are i.i.d. wrt. P𝑃Pitalic_P and Q𝑄Qitalic_Q respectively. This assumption does not hold for graph-structured data, where the instances are the vertices of a graph and the labels are associated with the vertices. More specifically, let G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ) be a graph with vertex set V𝑉Vitalic_V and edge set EV×V𝐸𝑉𝑉E\subseteq V\times Vitalic_E ⊆ italic_V × italic_V. Each vertex viVsubscript𝑣𝑖𝑉v_{i}\in Vitalic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_V is associated with a feature vector xi𝒳subscript𝑥𝑖𝒳x_{i}\in\mathcal{X}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_X and a label yi𝒴subscript𝑦𝑖𝒴y_{i}\in\mathcal{Y}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_Y. We use 𝒩(vi)={vj(vi,vj)E}𝒩subscript𝑣𝑖conditional-setsubscript𝑣𝑗subscript𝑣𝑖subscript𝑣𝑗𝐸\mathcal{N}(v_{i})=\{v_{j}\mid(v_{i},v_{j})\in E\}caligraphic_N ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = { italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ italic_E } to denote the set of neighbors of visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The edges in G𝐺Gitalic_G are used to encode homophily between vertices, i.e., similar vertices are more likely to be connected. Formally, an edge (vi,vj)Esubscript𝑣𝑖subscript𝑣𝑗𝐸(v_{i},v_{j})\in E( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ italic_E should indicate that P(yi=yj)ε𝑃subscript𝑦𝑖subscript𝑦𝑗𝜀P(y_{i}=y_{j})\geq\varepsilonitalic_P ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ≥ italic_ε, with ε𝜀\varepsilonitalic_ε being either a graph-specific constant or a function of an edge weight wi,jsubscript𝑤𝑖𝑗w_{i,j}\in\mathbb{R}italic_w start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ blackboard_R. Since homophily is symmetric by definition, G𝐺Gitalic_G is undirected, i.e., (vi,vj)E(vj,vi)Esubscript𝑣𝑖subscript𝑣𝑗𝐸subscript𝑣𝑗subscript𝑣𝑖𝐸(v_{i},v_{j})\in E\Leftrightarrow(v_{j},v_{i})\in E( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ italic_E ⇔ ( italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ italic_E. Such homophilic graphs are commonly used to represent social networks, citation networks, co-purchase graphs or the World Wide Web.

Refer to caption
Figure 1: The Amazon Photos co-purchase graph. Colors indicate vertex labels (K=8𝐾8K=8italic_K = 8). The highlighted vertices are misclassifications by an \acs*appnp classifier.

Figure 1 shows one such graph, namely the Amazon Photos co-purchase graph [Shchur et al., 2019], where vertices represent products, edges indicate that two products are frequently bought together and labels represent product categories. Due to homophily, the product categories form separate densely connected clusters, while cross-category edges are sparse.

Analogous to the tabular case, in \acgql we are given a training set of labeled vertices 𝒟Lsubscript𝒟𝐿\mathcal{D}_{L}caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT drawn from a distribution P𝑃Pitalic_P and our goal is to estimate the label distribution of the vertices in a test set 𝒟Usubscript𝒟𝑈\mathcal{D}_{U}caligraphic_D start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT drawn from a distribution Q𝑄Qitalic_Q. Given some vertex classifier hG:V𝒴:subscript𝐺𝑉𝒴h_{G}:V\to\mathcal{Y}italic_h start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT : italic_V → caligraphic_Y, the \acgql problem is, in principle, amenable to standard aggregative quantification methods, such as \acacc or \acpacc. As discussed in Section 2.2, those adjusted count methods assume \acpps, which in turn assumes a 𝒴𝒳𝒴𝒳\mathcal{Y}\to\mathcal{X}caligraphic_Y → caligraphic_X domain. This means that both the training and the test data is assumed to be generated by sampling from some fixed distribution p(xy)𝑝conditional𝑥𝑦p(x\mid y)italic_p ( italic_x ∣ italic_y ) for all y𝒴𝑦𝒴y\in\mathcal{Y}italic_y ∈ caligraphic_Y. We argue that this is oftentimes not realistic for graph-structured data.

Consider the example of estimating the proportion of users holding a certain opinion. Here, the training data 𝒟Lsubscript𝒟𝐿\mathcal{D}_{L}caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT may come from a social network where a (non-representative) local subset of users was sampled. The test data 𝒟Usubscript𝒟𝑈\mathcal{D}_{U}caligraphic_D start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT, on the other hand, may come from the entire social network or possibly some local subcluster of interest. In this setting, it is the instance distribution p(x)𝑝𝑥p(x)italic_p ( italic_x ) that changes, while p(yx)𝑝conditional𝑦𝑥p(y\mid x)italic_p ( italic_y ∣ italic_x ) remains fixed, i.e., covariate shift. More generally, a sampling process that is structure-dependent, e.g., by sampling local training or test neighborhoods, has covariate shift, not \acpps. We will now discuss how such structural biases can be accounted for in the quantification process.

3.1 Structural Importance Sampling

\Ac

acc depends on being able to estimate the test confusion matrix 𝐂𝐂\mathbf{C}bold_C from training data, with Cj,iQ(Y^=jY=i)subscript𝐶𝑗𝑖𝑄^𝑌conditional𝑗𝑌𝑖C_{j,i}\coloneqq Q(\hat{Y}=j\mid Y=i)italic_C start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT ≔ italic_Q ( over^ start_ARG italic_Y end_ARG = italic_j ∣ italic_Y = italic_i ). As described, estimating 𝐂𝐂\mathbf{C}bold_C is trivial under \acpps. We will now introduce \acsis, a novel approach to \aclgql under covariate shift. First, note that 𝐂j,isubscript𝐂𝑗𝑖\mathbf{C}_{j,i}bold_C start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT can be expressed as

Cj,isubscript𝐶𝑗𝑖\displaystyle C_{j,i}italic_C start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT =vV𝟙[hG(v)=j]Q(vY=i)absentsubscript𝑣𝑉1delimited-[]subscript𝐺𝑣𝑗𝑄conditional𝑣𝑌𝑖\displaystyle=\smashoperator[]{\sum_{v\in V}^{}}\mathds{1}[h_{G}(v)=j]\cdot Q(% v\mid Y=i)= start_SUMOP SUBSCRIPTOP ∑ start_ARG italic_v ∈ italic_V end_ARG end_SUMOP blackboard_1 [ italic_h start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_v ) = italic_j ] ⋅ italic_Q ( italic_v ∣ italic_Y = italic_i ) (9)
=vV𝟙[hG(v)=j]Q(vY=i)P(vY=i)=ρ(vi)P(vY=i).absentsubscript𝑣𝑉1delimited-[]subscript𝐺𝑣𝑗subscript𝑄conditional𝑣𝑌𝑖𝑃conditional𝑣𝑌𝑖absent𝜌conditional𝑣𝑖𝑃conditional𝑣𝑌𝑖\displaystyle=\smashoperator[]{\sum_{v\in V}^{}}\mathds{1}[h_{G}(v)=j]% \underbrace{\frac{Q(v\mid Y=i)}{P(v\mid Y=i)}}_{=\rho(v\mid i)}\cdot P(v\mid Y% =i)\ .= start_SUMOP SUBSCRIPTOP ∑ start_ARG italic_v ∈ italic_V end_ARG end_SUMOP blackboard_1 [ italic_h start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_v ) = italic_j ] under⏟ start_ARG divide start_ARG italic_Q ( italic_v ∣ italic_Y = italic_i ) end_ARG start_ARG italic_P ( italic_v ∣ italic_Y = italic_i ) end_ARG end_ARG start_POSTSUBSCRIPT = italic_ρ ( italic_v ∣ italic_i ) end_POSTSUBSCRIPT ⋅ italic_P ( italic_v ∣ italic_Y = italic_i ) .

Using the covariate shift assumption, we can rewrite ρ𝜌\rhoitalic_ρ as

ρ(vi)𝜌conditional𝑣𝑖\displaystyle\rho(v\mid i)italic_ρ ( italic_v ∣ italic_i ) =Q(Y=iv)Q(v)P(Y=i)P(Y=iv)P(v)Q(y=i)absent𝑄𝑌conditional𝑖𝑣𝑄𝑣𝑃𝑌𝑖𝑃𝑌conditional𝑖𝑣𝑃𝑣𝑄𝑦𝑖\displaystyle=\frac{Q(Y=i\mid v)\cdot Q(v)\cdot P(Y=i)}{P(Y=i\mid v)\cdot P(v)% \cdot Q(y=i)}= divide start_ARG italic_Q ( italic_Y = italic_i ∣ italic_v ) ⋅ italic_Q ( italic_v ) ⋅ italic_P ( italic_Y = italic_i ) end_ARG start_ARG italic_P ( italic_Y = italic_i ∣ italic_v ) ⋅ italic_P ( italic_v ) ⋅ italic_Q ( italic_y = italic_i ) end_ARG
=Q(v)P(v)P(Y=i)Q(Y=i)=ρ(v)ρ(i)1.absent𝑄𝑣𝑃𝑣𝑃𝑌𝑖𝑄𝑌𝑖𝜌𝑣𝜌superscript𝑖1\displaystyle=\frac{Q(v)}{P(v)}\cdot\frac{P(Y=i)}{Q(Y=i)}=\rho(v)\cdot\rho(i)^% {-1}\ .= divide start_ARG italic_Q ( italic_v ) end_ARG start_ARG italic_P ( italic_v ) end_ARG ⋅ divide start_ARG italic_P ( italic_Y = italic_i ) end_ARG start_ARG italic_Q ( italic_Y = italic_i ) end_ARG = italic_ρ ( italic_v ) ⋅ italic_ρ ( italic_i ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT . (10)

Thus, 𝐂𝐂\mathbf{C}bold_C can be obtained by reweighting the vertices:

Cj,isubscript𝐶𝑗𝑖\displaystyle C_{j,i}italic_C start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT =ρ(i)1vV𝟙[hG(v)=j]ρ(v)P(vY=i)absent𝜌superscript𝑖1subscript𝑣𝑉1delimited-[]subscript𝐺𝑣𝑗𝜌𝑣𝑃conditional𝑣𝑌𝑖\displaystyle=\rho(i)^{-1}\smashoperator[]{\sum_{v\in V}^{}}\mathds{1}[h_{G}(v% )=j]\cdot\rho(v)\cdot P(v\mid Y=i)= italic_ρ ( italic_i ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_SUMOP SUBSCRIPTOP ∑ start_ARG italic_v ∈ italic_V end_ARG end_SUMOP blackboard_1 [ italic_h start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_v ) = italic_j ] ⋅ italic_ρ ( italic_v ) ⋅ italic_P ( italic_v ∣ italic_Y = italic_i )
=ρ(i)1vV𝟙[hG(v)=j]ρ(v)P(vY=i)ρ(i)1vVρ(v)P(vY=i)absent𝜌superscript𝑖1subscript𝑣𝑉1delimited-[]subscript𝐺𝑣𝑗𝜌𝑣𝑃conditional𝑣𝑌𝑖𝜌superscript𝑖1subscript𝑣𝑉𝜌𝑣𝑃conditional𝑣𝑌𝑖\displaystyle=\frac{\rho(i)^{-1}\smashoperator[]{\sum_{v\in V}^{}}\mathds{1}[h% _{G}(v)=j]\cdot\rho(v)\cdot P(v\mid Y=i)}{\rho(i)^{-1}\smashoperator[]{\sum_{v% \in V}^{}}\rho(v)\cdot P(v\mid Y=i)}= divide start_ARG italic_ρ ( italic_i ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_SUMOP SUBSCRIPTOP ∑ start_ARG italic_v ∈ italic_V end_ARG end_SUMOP blackboard_1 [ italic_h start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_v ) = italic_j ] ⋅ italic_ρ ( italic_v ) ⋅ italic_P ( italic_v ∣ italic_Y = italic_i ) end_ARG start_ARG italic_ρ ( italic_i ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_SUMOP SUBSCRIPTOP ∑ start_ARG italic_v ∈ italic_V end_ARG end_SUMOP italic_ρ ( italic_v ) ⋅ italic_P ( italic_v ∣ italic_Y = italic_i ) end_ARG
=𝔼vP(i)[𝟙[hG(v)=j]ρ(v)]𝔼vP(i)[ρ(v)].\displaystyle=\frac{\mathbb{E}_{v\sim P(\cdot\mid i)}\left[\mathds{1}[h_{G}(v)% =j]\cdot\rho(v)\right]}{\mathbb{E}_{v\sim P(\cdot\mid i)}\left[\rho(v)\right]}\ .= divide start_ARG blackboard_E start_POSTSUBSCRIPT italic_v ∼ italic_P ( ⋅ ∣ italic_i ) end_POSTSUBSCRIPT [ blackboard_1 [ italic_h start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_v ) = italic_j ] ⋅ italic_ρ ( italic_v ) ] end_ARG start_ARG blackboard_E start_POSTSUBSCRIPT italic_v ∼ italic_P ( ⋅ ∣ italic_i ) end_POSTSUBSCRIPT [ italic_ρ ( italic_v ) ] end_ARG . (11)

Given 𝒟Lsubscript𝒟𝐿\mathcal{D}_{L}caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT, we can obtain an unbiased estimate of Cj,isubscript𝐶𝑗𝑖C_{j,i}italic_C start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT:

C^j,i=(v,y)𝒟L𝟙[hG(v)=jy=i]ρ(v)(v,y)𝒟Lρ(v)𝟙[y=i].subscript^𝐶𝑗𝑖subscript𝑣𝑦subscript𝒟𝐿1delimited-[]subscript𝐺𝑣𝑗𝑦𝑖𝜌𝑣subscript𝑣𝑦subscript𝒟𝐿𝜌𝑣1delimited-[]𝑦𝑖\displaystyle\hat{C}_{j,i}=\frac{\smashoperator[r]{\sum_{(v,y)\in\mathcal{D}_{% L}}^{}}\mathds{1}[h_{G}(v)=j\land y=i]\cdot\rho(v)}{\smashoperator[]{\sum_{(v,% y)\in\mathcal{D}_{L}}^{}}\rho(v)\cdot\mathds{1}[y=i]}\ .over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT = divide start_ARG start_SUMOP SUBSCRIPTOP ∑ start_ARG ( italic_v , italic_y ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_ARG end_SUMOP blackboard_1 [ italic_h start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_v ) = italic_j ∧ italic_y = italic_i ] ⋅ italic_ρ ( italic_v ) end_ARG start_ARG start_SUMOP SUBSCRIPTOP ∑ start_ARG ( italic_v , italic_y ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_ARG end_SUMOP italic_ρ ( italic_v ) ⋅ blackboard_1 [ italic_y = italic_i ] end_ARG . (12)

Note that this is essentially a weighted version of Eq. 4.

The problem with this formulation is that it requires ρ(v)=Q(v)P(v)𝜌𝑣𝑄𝑣𝑃𝑣\rho(v)=\frac{Q(v)}{P(v)}italic_ρ ( italic_v ) = divide start_ARG italic_Q ( italic_v ) end_ARG start_ARG italic_P ( italic_v ) end_ARG, which cannot be computed since Q(v)𝑄𝑣Q(v)italic_Q ( italic_v ) is unknown. We do however have access to 𝒳Usubscript𝒳𝑈\mathcal{X}_{U}caligraphic_X start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT, which is sampled from Q𝑄Qitalic_Q. Using a suitable vertex kernel k:V×V:𝑘𝑉𝑉k:V\times V\to\mathbb{R}italic_k : italic_V × italic_V → blackboard_R, we can thus estimate ρ(v)𝜌𝑣\rho(v)italic_ρ ( italic_v ) via kernel density estimation:

ρ(v)ρ^(v)1|𝒳U|v𝒳Uk(v,v).proportional-tosimilar-to𝜌𝑣^𝜌𝑣1subscript𝒳𝑈subscriptsuperscript𝑣subscript𝒳𝑈𝑘𝑣superscript𝑣\displaystyle\rho(v)\mathchoice{\mathrel{\vbox{ \offinterlineskip\halign{\hfil$#$\cr\displaystyle\propto\cr\kern 2.0pt\cr% \displaystyle\sim\cr\kern-2.0pt\cr}}}}{\mathrel{\vbox{ \offinterlineskip\halign{\hfil$#$\cr\textstyle\propto\cr\kern 2.0pt\cr% \textstyle\sim\cr\kern-2.0pt\cr}}}}{\mathrel{\vbox{ \offinterlineskip\halign{\hfil$#$\cr\scriptstyle\propto\cr\kern 2.0pt\cr% \scriptstyle\sim\cr\kern-2.0pt\cr}}}}{\mathrel{\vbox{ \offinterlineskip\halign{\hfil$#$\cr\scriptscriptstyle\propto\cr\kern 2.0pt\cr% \scriptscriptstyle\sim\cr\kern-2.0pt\cr}}}}\hat{\rho}(v)\coloneqq\frac{1}{|% \mathcal{X}_{U}|}\smashoperator[r]{\sum_{v^{\prime}\in\mathcal{X}_{U}}^{}}k(v,% v^{\prime})\ .italic_ρ ( italic_v ) start_RELOP start_ROW start_CELL ∝ end_CELL end_ROW start_ROW start_CELL ∼ end_CELL end_ROW end_RELOP over^ start_ARG italic_ρ end_ARG ( italic_v ) ≔ divide start_ARG 1 end_ARG start_ARG | caligraphic_X start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT | end_ARG start_SUMOP SUBSCRIPTOP ∑ start_ARG italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT end_ARG end_SUMOP italic_k ( italic_v , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) . (21)

The suitability of the kernel k𝑘kitalic_k depends on the nature of the test distribution Q𝑄Qitalic_Q. For example, if Q(vy)𝑄conditional𝑣𝑦Q(v\mid y)italic_Q ( italic_v ∣ italic_y ) is uniform for all y𝒴𝑦𝒴y\in\mathcal{Y}italic_y ∈ caligraphic_Y, i.e., the sampling is structure agnostic, the constant kernel k1(v,v)=1subscript𝑘1𝑣superscript𝑣1k_{1}(v,v^{\prime})=1italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_v , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = 1 would be appropriate. Note that Eq. 12 simplifies to standard \acacc in this case, as structure agnostic test sampling implies \acpps.

If the test nodes are sampled via a randomized \acbfs, a \acsp kernel, e.g.,

kSP(v,v)=exp(λdSP(v,v)),subscript𝑘SP𝑣superscript𝑣𝜆subscript𝑑SP𝑣superscript𝑣\displaystyle k_{\mathrm{SP}}(v,v^{\prime})=\exp(-\lambda\cdot d_{\mathrm{SP}}% (v,v^{\prime}))\ ,italic_k start_POSTSUBSCRIPT roman_SP end_POSTSUBSCRIPT ( italic_v , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = roman_exp ( - italic_λ ⋅ italic_d start_POSTSUBSCRIPT roman_SP end_POSTSUBSCRIPT ( italic_v , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) , (22)

might be more appropriate, where dSP(v,v)subscript𝑑SP𝑣superscript𝑣d_{\mathrm{SP}}(v,v^{\prime})italic_d start_POSTSUBSCRIPT roman_SP end_POSTSUBSCRIPT ( italic_v , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is the length of the shortest path length between v𝑣vitalic_v and vsuperscript𝑣v^{\prime}italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and λ>0𝜆0\lambda>0italic_λ > 0 a tunable hyperparameter. Similarly, for \acrw sampling, one can use a kernel based on the \acppr algorithm [Page et al., 1999]:

kPPR(v,v)=Πv,vL,subscript𝑘PPR𝑣superscript𝑣subscriptsuperscriptΠ𝐿superscript𝑣𝑣\displaystyle k_{\mathrm{PPR}}(v,v^{\prime})=\Pi^{L}_{v^{\prime},v}\ ,italic_k start_POSTSUBSCRIPT roman_PPR end_POSTSUBSCRIPT ( italic_v , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = roman_Π start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_v end_POSTSUBSCRIPT , (23)

where Πv,vLsubscriptsuperscriptΠ𝐿superscript𝑣𝑣\Pi^{L}_{v^{\prime},v}roman_Π start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_v end_POSTSUBSCRIPT denotes the probability that a random walk of length L𝐿Litalic_L starting at vsuperscript𝑣v^{\prime}italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ends at v𝑣vitalic_v. In general, the kernel should be treated as a hyperparameter which has to be tuned to the specific problem at hand.

To summarize, \acsis enables graph quantification under covariate shift by estimating the test confusion matrix 𝐂𝐂\mathbf{C}bold_C using a kernel density estimate of the test instance distribution Q𝑄Qitalic_Q. Using this estimate, the adjusted label prevalences can be computed using Eq. 6.

3.2 Neighborhood-aware Adjusted Count

In the previous section, we described how \acgql can be performed even under covariate shift. In addition to the nature of the distribution shift, there is another aspect to consider when applying \acacc: Class identifiability. Consider a classifier that is unable to distinguish between two classes i𝑖iitalic_i and j𝑗jitalic_j, i.e., it predicts the same label for both. In this case, 𝐂:,i=𝐂:,jsubscript𝐂:𝑖subscript𝐂:𝑗\mathbf{C}_{:,i}=\mathbf{C}_{:,j}bold_C start_POSTSUBSCRIPT : , italic_i end_POSTSUBSCRIPT = bold_C start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT and thus there is no unique solution for Eq. 5. This can lead to poor quantification results if the prediction vector Q(Y^)𝑄^𝑌Q(\hat{Y})italic_Q ( over^ start_ARG italic_Y end_ARG ) has a large overlap with, both, 𝐂:,isubscript𝐂:𝑖\mathbf{C}_{:,i}bold_C start_POSTSUBSCRIPT : , italic_i end_POSTSUBSCRIPT and 𝐂:,jsubscript𝐂:𝑗\mathbf{C}_{:,j}bold_C start_POSTSUBSCRIPT : , italic_j end_POSTSUBSCRIPT, since any distribution of probability mass between both classes may then be returned. To address this issue, we propose \acfnacc, which uses the neighborhood structure of the graph to improve class identifiability.

First, note that Eq. 6 can be understood as finding a mixture of the columns of 𝐂𝐂\mathbf{C}bold_C that best approximates Q(Y^)𝑄^𝑌Q(\hat{Y})italic_Q ( over^ start_ARG italic_Y end_ARG ). In the case of collinear columns, this mixture is not unique. A simple way to break such symmetries is to use the neighborhood structure of the graph:

Q(Y^𝒩=(j,k))𝑄subscript^𝑌𝒩𝑗𝑘\displaystyle Q(\hat{Y}_{\mathcal{N}}=(j,k))italic_Q ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT caligraphic_N end_POSTSUBSCRIPT = ( italic_j , italic_k ) ) =i=1KQ(Y^𝒩=(j,k)Y=i)Q(Y=i),absentsuperscriptsubscript𝑖1𝐾𝑄subscript^𝑌𝒩conditional𝑗𝑘𝑌𝑖𝑄𝑌𝑖\displaystyle=\sum_{i=1}^{K}Q(\hat{Y}_{\mathcal{N}}=(j,k)\mid Y=i)\cdot Q(Y=i),= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_Q ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT caligraphic_N end_POSTSUBSCRIPT = ( italic_j , italic_k ) ∣ italic_Y = italic_i ) ⋅ italic_Q ( italic_Y = italic_i ) ,

where Y^𝒩subscript^𝑌𝒩\hat{Y}_{\mathcal{N}}over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT caligraphic_N end_POSTSUBSCRIPT is a RV representing a tuple of the predicted label of a vertex and the majority predicted label of its neighbors. Using this decomposition of Q(Y^𝒩)𝑄subscript^𝑌𝒩Q(\hat{Y}_{\mathcal{N}})italic_Q ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT caligraphic_N end_POSTSUBSCRIPT ), Q(Y)𝑄𝑌Q(Y)italic_Q ( italic_Y ) can be estimated using \acacc, and possibly \acsis, with the only difference being that the confusion matrix estimate is now of shape K2×Ksuperscript𝐾2𝐾K^{2}\times Kitalic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × italic_K. Intuitively, this approach uses information on the presence of absence homophily to improve class identifiability.

Consider Fig. 1, where a vertex is highlighted if it is misclassified by an \acappnp classifier [Gasteiger et al., 2018]. Note that the vertices with label 7 (dark green) are often confused with vertices of label 1 (blue) or 6 (orange) because there are many non-homophilic edges between those classes. Using \acacc, this would imply that the row vectors of labels 7, 1 and 6 are collinear, i.e., C:,7αC:,1+(1α)C:,6subscript𝐶:7𝛼subscript𝐶:11𝛼subscript𝐶:6C_{:,7}\approx\alpha\cdot C_{:,1}+(1-\alpha)\cdot C_{:,6}italic_C start_POSTSUBSCRIPT : , 7 end_POSTSUBSCRIPT ≈ italic_α ⋅ italic_C start_POSTSUBSCRIPT : , 1 end_POSTSUBSCRIPT + ( 1 - italic_α ) ⋅ italic_C start_POSTSUBSCRIPT : , 6 end_POSTSUBSCRIPT for some α[0,1]𝛼01\alpha\in[0,1]italic_α ∈ [ 0 , 1 ]. Using the neighborhood structure, \acnacc can break this symmetry. For labels 1 and 6, the majority of predicted neighbors will nearly always be of the same label due to homophily, whereas for label 7, both, Y^𝒩=(1,6)subscript^𝑌𝒩16\hat{Y}_{\mathcal{N}}=(1,6)over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT caligraphic_N end_POSTSUBSCRIPT = ( 1 , 6 ) and Y^𝒩=(6,1)subscript^𝑌𝒩61\hat{Y}_{\mathcal{N}}=(6,1)over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT caligraphic_N end_POSTSUBSCRIPT = ( 6 , 1 ) are common. With this information, \acnacc is able to distinguish the confusion profile of label 7 from those of labels 1 and 6.

In principle, one could extend \acnacc to use even more neighborhood information, e.g., by considering the majority label of the neighbors of neighbors or by considering the second-most predicted neighboring label. However, given a finite training set 𝒟Lsubscript𝒟𝐿\mathcal{D}_{L}caligraphic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT, by making the confusion profiles for fine-grained, the confusion estimate 𝐂^^𝐂\hat{\mathbf{C}}over^ start_ARG bold_C end_ARG will become noisier, counteracting the potential gains of additional information. We found that using the 1-hop majority label is a good trade-off between class identifiability and confusion estimate noise.

4 Evaluation

We assess the performance of \acsis and \acnacc on a series of graph quantification tasks using, both, \acpps and covariate shift. The quantification methods are applied to the predictions of multiple node classifiers. As a baseline we compare our proposed \acgql methods with \acmlpe, (P)CC and (P)ACC.

4.1 Experimental Setup

Quantification Metrics

There is a large number of metrics to evaluate quantification methods [Esuli et al., 2023]. We use the following three: The \acae is one of the most commonly used metrics in quantification. It is defined as

AE(q,q^)=1Ki=1K|qiq^i|.AE𝑞^𝑞1𝐾superscriptsubscript𝑖1𝐾subscript𝑞𝑖subscript^𝑞𝑖\displaystyle\mathrm{AE}(q,\hat{q})=\frac{1}{K}\sum_{i=1}^{K}|q_{i}-\hat{q}_{i% }|\ .roman_AE ( italic_q , over^ start_ARG italic_q end_ARG ) = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT | italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | . (24)

The \acrae [González-Castro et al., 2013] is a reweighted version of the \acae that penalizes deviations from labels with low prevalence more heavily:

RAE(q,q^)=1Ki=1K|qiq^i|qi.RAE𝑞^𝑞1𝐾superscriptsubscript𝑖1𝐾subscript𝑞𝑖subscript^𝑞𝑖subscript𝑞𝑖\displaystyle\mathrm{RAE}(q,\hat{q})=\frac{1}{K}\sum_{i=1}^{K}\frac{|q_{i}-% \hat{q}_{i}|}{q_{i}}\ .roman_RAE ( italic_q , over^ start_ARG italic_q end_ARG ) = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG | italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG . (25)

Last, we also use the \ackld to measure the divergence between the true and estimated label distributions.

Datasets

We use five node classification benchmark datasets and introduce different types of distribution shifts to evaluate the mentioned quantification methods. The datasets come from two different domains: Three citation network datasets, namely, CoraML, CiteSeer and PubMed [McCallum et al., 2000, Giles et al., 1998, Getoor, 2005, Sen et al., 2008, Namata et al., 2012], and two co-purchase datasets, namely Amazon Photos and Amazon Computers [McAuley et al., 2015, Shchur et al., 2019]. All reported results were obtained by averaging over 10 random splits of the node set into classifier-train/quantifier-train/test, with sizes of 5%percent55\%5 %/15%percent1515\%15 %/80%percent8080\%80 % respectively. Since random splitting does not generate distribution shift, we synthetically introduce shifts in the test splits in two ways:

  1. 1.
    \acs

    pps: To simulate \acpps, we randomly sample a set of Zipf distributions over the labels [Qi et al., 2020]. Given one such Zipf distribution, we sample vertices uniformly at random for each label so that the target label frequencies are reached.

  2. 2.

    Covariate Shift: This shift is introduced by uniformly sampling start vertices for each label and then performing a randomized \acbfs to sample a neighborhood of a fixed size around each start vertex.

In our experiments, both, the \acpps and the covariate shifted test splits each consist of 100 vertices. For each label, we sample 10 corresponding shifted splits and report the average results.

Classifiers

We use four different node classifiers to predict the labels of the vertices: A structure-unaware \acmlp, a \acgcn [Kipf and Welling, 2017], \acgat [Veličković et al., 2018], and \acappnp [Gasteiger et al., 2018]. All models are trained using the same training splits and hyperparameters, and two hidden layers/convolutions with widths of 64 and ReLU activations. Each model is trained ten times on each of the ten splits per dataset, totalling 100 models per dataset with which each quantifier is evaluated.

Quantifiers

\Ac

sis and \acnacc are evaluated, both, separately and in combination. \Acsis is evaluated using the \acsp kernel with λ=1/2𝜆12\lambda=\nicefrac{{1}}{{2}}italic_λ = / start_ARG 1 end_ARG start_ARG 2 end_ARG and the \acppr kernel from Eqs. 22 and 23.

4.2 Discussion of Results

CoraML

Refer to caption Refer to caption

CiteSeer

Refer to caption Refer to caption

Amazon Photos

Refer to caption Refer to caption

Amazon Computers

Refer to caption Refer to caption

PubMed

Refer to caption Refer to caption
(a) Hard Classifier (b) Probabilistic Classifier
Figure 2: Quantification results for different dataset, classifier, quantifier combinations, compared to classifier accuracy.

Figure 2 compares the \acae of the different quantification methods wrt. classifier accuracy on the shifted test data for three different distribution shift scenarios: no shift, \acpps and covariate shift.

Classifier Accuracy

The quality of the classifier has a significant impact on the quantification results, i.e., the error generally goes down with increasing classifier accuracy. Unsurprisingly, the structure-unaware \acmlp performs worst, while \acappnp generally yields the best results. Despite these differences, compared to the \accc quantifier, the \acs*acc-based methods are generally able to compensate the misclassifications of the weak classifiers, flattening the error curve.

Effectiveness of \acs*sis

We observe that the difficulty of quantification depends on the type of distribution shift. In contrast, quantification without shift is trivial by definition, \acpps is generally easier than covariate shift, as it does not require \acsis and the choice of an appropriate vertex kernel. Under \acpps, \acsis with the \acppr kernel generally performs worse than non-\acs*sis methods since this kernel does not reflect the data-generating process. In contrast, under covariate shift, \acsis generally outperforms the non-\acs*sis methods, demonstrating that kernel density estimation improves confusion estimates. However, on the CiteSeer dataset, \acsis performs significantly worse under covariate shift; this is likely because the \acppr kernel is not well-suited for this dataset, highlighting the importance of choosing an appropriate kernel.

Effectiveness of \acs*nacc

Since the goal of \acnacc is to improve class identifiability, via structural information, its effectiveness depends on the presence of non-homophilic regions which add collinearities to the confusion matrix. We find that \acnacc improves quantification results.

Table 1: Quantification using probabilistic classifiers (absolute error, relative absolute error and KL divergence).
\csvreader

[ column count=69, tabular=rl | rrr | rrr | rrr | rrr | rrr | rrr, separator=comma, table head=Model& CoraML CiteSeer Amazon Photos Amazon Computers PubMed Avg. Rank
& Shift Quantifier AE RAE KLD AE RAE KLD AE RAE KLD AE RAE KLD AE RAE KLD AE RAE KLD
, before reading=, table foot=, head to column names, late after line=
, ]tables/pcc.csv \approach\coraMlAe \coraMlRae \coraMlKld \citeSeerAe \citeSeerRae \citeSeerKld \photosAe \photosRae \photosKld \computersAe \computersRae \computersRae \pubMedAe \pubMedRae \pubMedKld \aeRank \raeRank \kldRank Section 4.2 shows probabilistic quantification results for the \acrae and \ackld metrics and \acsis with the \acsp kernel. The bold results indicate that there is no significant difference between the reported mean and the best mean within a given block, determined by the 95th percentile of a one-sided t-test. The experiments show that both \acsis and \acnacc are able to improve quantification results under \acpps and covariate shift.

5 Conclusion

We have introduced two novel graph quantification methods, \acsis and \acnacc; to our knowledge, this is the first work to investigate classifier-based graph quantification. \Acsis enables quantification under covariate shift by estimating the test confusion matrix using a kernel density estimate of the test instance distribution. \Acnacc uses the neighborhood structure of the graph to improve class identifiability. The effectiveness of our approach was demonstrated on multiple graph benchmark datasets.

We envision two lines of future research. First, in this work, we focused on extensions of \acacc to the graph setting. However, on tabular data, distribution matching quantifiers, such as DMy [González-Castro et al., 2013] or KDEy [Moreo et al., 2025], often outperform \acs*acc-based approaches. An extension of distribution matching to \acgql could further improve the quantification performance on graphs. Second, as our experiments showed, choosing an appropriate kernel in \acsis is important. While the simple \acsp and \acppr kernels generally perform well in our experiments, a deeper understanding of the practical applicability of vertex kernels for quantification is desirable. To this end, one could design an AutoML system, to automatically determine the type of distribution shift in a graph quantification problem in order to find an appropriate kernel based on that shift.

References

  • Barranquero et al. [2013] Jose Barranquero, Pablo González, Jorge Díez, and Juan José del Coz. On the study of nearest neighbor algorithms for prevalence estimation in binary problems. Pattern Recognition, 46(2):472–482, February 2013. ISSN 0031-3203. doi: 10.1016/j.patcog.2012.07.022.
  • Bella et al. [2010] Antonio Bella, Cesar Ferri, José Hernández-Orallo, and María José Ramírez-Quintana. Quantification via Probability Estimators. In 2010 IEEE International Conference on Data Mining, pages 737–742, December 2010. doi: 10.1109/ICDM.2010.75.
  • Bunse [2022] Mirko Bunse. On multi-class extensions of adjusted classify and count. In Proceedings of the 2nd International Workshop on Learning to Quantify (LQ 2022), pages 43–50, 2022.
  • Esuli et al. [2023] Andrea Esuli, Alessandro Fabris, Alejandro Moreo, and Fabrizio Sebastiani. Learning to Quantify, volume 1 of The Information Retrieval Series. Springer, Cham, March 2023. ISBN 978-3-031-20467-8.
  • Fawcett and Flach [2005] Tom Fawcett and Peter A. Flach. A Response to Webb and Ting’s On the Application of ROC Analysis to Predict Classification Performance Under Varying Class Distributions. Machine Learning, 58(1):33–38, January 2005. ISSN 1573-0565. doi: 10.1007/s10994-005-5256-4.
  • Forman [2005] George Forman. Counting positives accurately despite inaccurate classification. In Proceedings of the 16th European Conference on Machine Learning, ECML’05, pages 564–575, Berlin, Heidelberg, October 2005. Springer-Verlag. ISBN 978-3-540-29243-2. doi: 10.1007/11564096_55.
  • Forman [2006] George Forman. Quantifying trends accurately despite classifier error and class imbalance. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, pages 157–166, New York, NY, USA, August 2006. Association for Computing Machinery. ISBN 978-1-59593-339-3. doi: 10.1145/1150402.1150423.
  • Forman [2008] George Forman. Quantifying counts and costs via classification. Data Mining and Knowledge Discovery, 17(2):164–206, June 2008. ISSN 1573-756X. doi: 10.1007/s10618-008-0097-y.
  • Gart and Buck [1966] John J. Gart and Alfred A. Buck. Comparison of a screening test and a reference test in epidemiologic studies. II. A probabilistic model for the comparison of diagnostic tests. American Journal of Epidemiology, 83(3):593–602, May 1966. ISSN 1476-6256, 0002-9262. doi: 10.1093/oxfordjournals.aje.a120610.
  • Gasteiger et al. [2018] Johannes Gasteiger, Aleksandar Bojchevski, and Stephan Günnemann. Predict then Propagate: Graph Neural Networks meet Personalized PageRank. In International Conference on Learning Representations, September 2018.
  • Getoor [2005] Lise Getoor. Link-based Classification. In Sanghamitra Bandyopadhyay, Ujjwal Maulik, Lawrence B. Holder, and Diane J. Cook, editors, Advanced Methods for Knowledge Discovery from Complex Data, Advanced Information and Knowledge Processing, pages 189–207. Springer, London, 2005. ISBN 978-1-84628-284-3. doi: 10.1007/1-84628-284-5_7.
  • Giles et al. [1998] C. Lee Giles, Kurt D. Bollacker, and Steve Lawrence. CiteSeer: An automatic citation indexing system. In Proceedings of the Third ACM Conference on Digital Libraries, DL ’98, pages 89–98, New York, NY, USA, May 1998. Association for Computing Machinery. ISBN 978-0-89791-965-4. doi: 10.1145/276675.276685.
  • González et al. [2024] Pablo González, Alejandro Moreo, and Fabrizio Sebastiani. Binary quantification and dataset shift: An experimental investigation. Data Min Knowl Disc, 38(4):1670–1712, July 2024. ISSN 1573-756X. doi: 10.1007/s10618-024-01014-1.
  • González-Castro et al. [2013] Víctor González-Castro, Rocío Alaiz-Rodríguez, and Enrique Alegre. Class distribution estimation based on the Hellinger distance. Information Sciences, 218:146–164, January 2013. ISSN 0020-0255. doi: 10.1016/j.ins.2012.05.028.
  • Kipf and Welling [2017] Thomas N. Kipf and Max Welling. Semi-Supervised Classification with Graph Convolutional Networks. In International Conference on Learning Representations, 2017.
  • Kull and Flach [2014] Meelis Kull and Peter A. Flach. Patterns of dataset shift. 2014.
  • Lipton et al. [2018] Zachary Lipton, Yu-Xiang Wang, and Alexander Smola. Detecting and Correcting for Label Shift with Black Box Predictors. In Proceedings of the 35th International Conference on Machine Learning, pages 3122–3130. PMLR, July 2018.
  • McAuley et al. [2015] Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. Image-Based Recommendations on Styles and Substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’15, pages 43–52, New York, NY, USA, August 2015. Association for Computing Machinery. ISBN 978-1-4503-3621-5. doi: 10.1145/2766462.2767755.
  • McCallum et al. [2000] Andrew Kachites McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. Automating the Construction of Internet Portals with Machine Learning. Information Retrieval, 3(2):127–163, July 2000. ISSN 1573-7659. doi: 10.1023/A:1009953814988.
  • Moreo et al. [2025] Alejandro Moreo, Pablo González, and Juan José del Coz. Kernel density estimation for multiclass quantification. Machine Learning, 114(4):1–38, February 2025. ISSN 1573-0565. doi: 10.1007/s10994-024-06726-5.
  • Namata et al. [2012] Galileo Namata, Ben London, L. Getoor, and Bert Huang. Query-driven Active Surveying for Collective Classification. In Proceedings of the Workshop on Mining and Learning with Graphs (MLG-2012), Edinburgh, Scotland, UK, 2012.
  • Page et al. [1999] Lawrence Page, Sergey Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking : Bringing Order to the Web. In The Web Conference, November 1999.
  • Qi et al. [2020] Lei Qi, Mohammed Khaleel, Wallapak Tavanapong, Adisak Sukul, and David Peterson. A Framework for Deep Quantification Learning. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2020, Ghent, Belgium, September 14–18, 2020, Proceedings, Part I, pages 232–248, Berlin, Heidelberg, September 2020. Springer-Verlag. ISBN 978-3-030-67657-5. doi: 10.1007/978-3-030-67658-2_14.
  • Saerens et al. [2002] Marco Saerens, Patrice Latinne, and Christine Decaestecker. Adjusting the Outputs of a Classifier to New a Priori Probabilities: A Simple Procedure. Neural Computation, 14(1):21–41, January 2002. ISSN 0899-7667. doi: 10.1162/089976602753284446.
  • Schölkopf et al. [2012] Bernhard Schölkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, pages 459–466, Madison, WI, USA, June 2012. Omnipress. ISBN 978-1-4503-1285-1.
  • Sen et al. [2008] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective Classification in Network Data. AI Magazine, 29(3):93–93, September 2008. ISSN 2371-9621. doi: 10.1609/aimag.v29i3.2157.
  • Shchur et al. [2019] Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. Pitfalls of Graph Neural Network Evaluation, June 2019.
  • Tasche [2017] Dirk Tasche. Fisher consistency for prior probability shift. J. Mach. Learn. Res., 18(1):3338–3369, January 2017. ISSN 1532-4435.
  • Veličković et al. [2018] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph Attention Networks. In International Conference on Learning Representations, February 2018.
  • Vucetic and Obradovic [2001] Slobodan Vucetic and Zoran Obradovic. Classification on Data with Biased Class Distribution. In Luc De Raedt and Peter Flach, editors, Machine Learning: ECML 2001, pages 527–538, Berlin, Heidelberg, 2001. Springer. ISBN 978-3-540-44795-5. doi: 10.1007/3-540-44795-4_45.