\acsetup

format/first-long= \acsetupbarriers/use=true \acsetupbarriers/reset=true \acsetupsingle \DeclareAcronymmlshort = ML, long = machine learning \DeclareAcronymqlshort = QL, long = quantification learning \DeclareAcronymgqlshort = GQL, long = graph quantification learning \DeclareAcronymnlpshort = NLP, long = natural language processing \DeclareAcronymnnshort = NN, long = neural network \DeclareAcronymmlpshort = MLP, long = Multilayer Perceptron \DeclareAcronymrnnshort = RNN, long = recurrent neural network \DeclareAcronymsvmshort = SVM, long = support vector machine \DeclareAcronymcnnshort = CNN, long = convolutional neural network \DeclareAcronymgnnshort = GNN, long = graph neural network \DeclareAcronymgcnshort = GCN, long = Graph Convolutional Network \DeclareAcronymginshort = GIN, long = Graph Isomorphism Network \DeclareAcronymgatshort = GAT, long = Graph Attention Network \DeclareAcronymappnpshort = APPNP, long = approximate personalized propagation of neural predictions \DeclareAcronymceshort = CE, long = cross-entropy \DeclareAcronymuceshort = UCE, long = uncertain cross-entropy \DeclareAcronympprshort = PPR, long = personalized page-rank \DeclareAcronymarcshort = ARC, long = accuracy-rejection curve \DeclareAcronymoodshort = OOD, long = out-of-distribution \DeclareAcronymidshort = ID, long = in-distribution \DeclareAcronymaurocshort = AUC-ROC, long = area under the receiver operating characteristic curve \DeclareAcronymqgnnshort = QGNN, long = Quantification Graph Neural Network \DeclareAcronymccshort = CC, long = Classify & Count \DeclareAcronympccshort = PCC, long = Probabilistic Classify & Count \DeclareAcronymaccshort = ACC, long = Adjusted Classify & Count \DeclareAcronympaccshort = PACC, long = Probabilistic Adjusted Classify & Count \DeclareAcronymppsshort = PPS, long = prior probability shift \DeclareAcronymmlpeshort = MLPE, long = Maximum Likelihood Prevalence Estimation \DeclareAcronymslsqpshort = SLSQP, long = Sequential Least Squares Quadratic Programming \DeclareAcronymbfsshort = BFS, long = breadth-first search \DeclareAcronymspshort = SP, long = shortest path \DeclareAcronymrwshort = RW, long = random walk \DeclareAcronymsisshort = SIS, long = structural importance sampling \DeclareAcronymnaccshort = NACC, long = Neighborhood-aware ACC \DeclareAcronymaeshort = AE, long = absolute error \DeclareAcronymraeshort = RAE, long = relative absolute error \DeclareAcronymkldshort = KLD, long = Kullback-Leibler divergence

Adjusted Count Quantification Learning on Graphs

Clemens Damke Institute of Informatics, LMU Munich, Germany

Eyke Hüllermeier Institute of Informatics, LMU Munich, Germany Munich Center for Machine Learning (MCML) German Centre for Artificial Intelligence (DFKI, DSA)

Abstract

\Acl

ql is the task of predicting the label distribution of a set of instances. We study this problem in the context of graph-structured data, where the instances are vertices. Previously, this problem has only been addressed via node clustering methods. In this paper, we extend the popular \acacc method to graphs. We show that the \aclpps assumption upon which \acacc relies is often not fulfilled and propose two novel graph quantification techniques: \Acsis makes \acacc applicable in graph domains with covariate shift. \Aclnacc improves quantification in the presence of non-homophilic edges. We show the effectiveness of our techniques on multiple graph quantification tasks.

\acbarrier

1 Introduction

We consider the task of \acql on graph-structured data. This term has first been coined by Forman [2005, 2006, 2008] and is used to describe the task of estimating label prevalences via supervised learning. A \acql method receives a set of training instances with known labels which is used to train a quantifier. The quantifier is then used to predict the label distribution of a set of test instances. Unlike standard instance-wise classification, \acql does not concern itself with predicting an accurate label for each test instance but rather with predicting the overall prevalence of each label across all instances. \Acql can thus be seen as a dataset-level prediction task, where a single prediction is made for a population of instances.

Quantification problems naturally arise in polling and surveying, where the goal is to estimate the proportion of a population that has a certain property or holds a certain opinion. Examples include estimating the proportion of voters that support a certain political party or the proportion of customers that are satisfied with a product. Similarly, \acql can be applied to epidemiology or ecological modelling to estimate the prevalence of diseases or species in a given population. We refer to Esuli et al. [2023] for a comprehensive overview of the applications of quantification.

Typically, \acql is studied in the context of tabular data, where each instance $x\in\mathcal{X}=\mathbb{R}^{d}$ is represented by a feature vector. In this setting, instances are assumed to be independent, i.e., the label distribution $p(y\mid x)$ is fully determined by the instance $x$ . However, in many real-world applications, this independence assumption does not hold. Consider the example of estimating the proportion of voters supporting a certain party. Assume we have access to a social network where each node represents a voter and each edge represents a social connection. In this case, the label distribution of a voter, i.e., their political preferences, may depend not only on their own features but also on the features of their social connections. Incorporating this relational information into the quantification process can lead to more accurate estimates.

Generally speaking, \acql methods can be divided into two categories: aggregative and non-aggregative. Aggregative quantifiers rely upon an instance-wise label estimator, i.e., a regular classifier; the instance-level label estimates are then aggregated to obtain dataset-level label prevalence estimates. Non-aggregative quantifiers, on the other hand, directly estimate dataset-level label prevalences without first predicting labels for each instance. In this paper, we focus on aggregative quantification methods, which are more common and have been studied more extensively. An intuitively plausible aggregative method is to simply estimate the prevalence of a label as the fraction of test instances that are predicted to belong to that label by the classifier. This method is known as \accc and, given a perfect classifier, it will yield perfect quantification results. However, in practice, classifiers are not perfect and even good, but not perfect, classifiers can lead to poor quantification results. Conversely, even a bad classifier can yield good quantification results. The reason for this disconnect is that the optimization goals of classification and quantification are misaligned. More specifically, while a good binary classifier should minimize the total number of misclassifications, i.e., $(\mathrm{FP}+\mathrm{FN})$ , a good binary quantifier should minimize $\left|\mathrm{FP}-\mathrm{FN}\right|$ . If $\mathrm{FP}=\mathrm{FN}$ , even a classifier with a high misclassification rate will yield perfect quantification results.

This misalignment is commonly addressed by the family of \acacc methods, which use an estimate of the classifier’s confusion matrix to adjust the predicted label prevalences [Vucetic and Obradovic, 2001, Saerens et al., 2002, Forman, 2005]. \Acacc has been shown to estimate the true test label prevalences in expectation if the so-called \acpps assumption holds, i.e., if the class conditional training distributions $p(x\mid y)$ equals the class conditional test distribution $q(x\mid y)$ [Tasche, 2017].

In the following, we investigate the validity of the \acpps assumption and thereby of \acacc in the context of graph-structured data. In Section 2, we begin with a brief formal description of the quantification problem, the \acpps assumption and how it is used in \acacc. Section 3 describes two novel structure-aware aggregative quantification methods for graphs: \Acl*sis and \acl*nacc. To our knowledge, this is the first work on graph quantification learning using a node classifier. In Section 4, the proposed methods are evaluated on a series of node classification datasets under different shift assumptions. Last, we conclude with a brief outlook in Section 5.

2 Quantification Learning

Let $\mathcal{X}$ denote the instance space and $\mathcal{Y}=\{1,\dots,K\}$ the (finite) label space. In \acql we assume to be given a training set of labeled instances $\mathcal{D}_{L}\subseteq\mathcal{X}\times\mathcal{Y}$ drawn from a distribution $P(x,y)$ with corresponding density $p$ . Additionally, there is a set of labeled instances $\mathcal{D}_{U}\subseteq\mathcal{X}\times\mathcal{Y}$ drawn from a test distribution $Q(x,y)$ with corresponding density $q$ . The goal of \acql is to estimate $Q(Y=i)$ for all $i\in\mathcal{Y}$ given $\mathcal{D}_{L}$ and $\mathcal{X}_{U}\coloneqq\{x\mid(x,y)\in\mathcal{D}_{U}\}$ . If $P=Q$ , i.e., if the training and test data are drawn from the same distribution, the quantification problem is trivially solved via a maximum likelihood estimate of the label distribution on $\mathcal{D}_{L}$ :

\displaystyle\hat{Q}^{\mathrm{MLPE}}({Y=i})\coloneqq\frac{1}{|\mathcal{D}_{L}|% }\smashoperator[r]{\sum_{(x,y)\in\mathcal{D}_{L}}^{}}\mathds{1}[{y=i}]

(1)

where $\mathds{1}[\cdot]$ denotes the indicator function. This \acmlpe approach [Barranquero et al., 2013, Esuli et al., 2023] is akin to the majority classifier in classification in the sense that it predicts the most likely distribution in the absence of test data $\mathcal{X}_{U}$ . However, if the training and test data are not identically distributed, the quantification problem becomes more challenging. A quantification approach has to account for the distribution shift between $P$ and $Q$ to provide accurate estimates of $Q(Y)$ . Depending on the nature of this distribution shift, different quantification methods may be more or less suitable.

2.1 Types of Distribution Shift

If the train and test distributions differ, one should ask whether learning from the training data is still feasible. Certainly, if $P$ and $Q$ are completely unrelated, any information learned from $\mathcal{D}_{L}$ is useless for predicting $Q(Y)$ . Quantification approaches therefore typically assume that $P$ and $Q$ are related in some way. The applicability of a quantification method then depends on whether those assumptions hold true for the given problem. First, note that $q$ can be expressed as

\displaystyle q(x,y)=q(y\mid x)q(x)=q(x\mid y)q(y)\ .

By fixing one of the factors in the two right-hand terms, we obtain three types of distribution shifts [Esuli et al., 2023]:

1.

Concept Shift: The conditional label distribution changes, but the distribution of the instances remains the same, i.e., $q(y\mid x)\neq p(y\mid x)$ , while $q(x)=p(x)$ . This type of shift, also referred to as concept drift, can occur in domains with classes that are defined relative to some frame of reference. For example, consider the task of predicting the prevalence of local vs. world news articles in a newspaper. While the distribution of news articles may remain the same between training and test, the definition of what constitutes local or world news depends on the location of the newspaper.
2.

Covariate Shift: The distribution of the instances changes, but the conditional label distribution remains the same, i.e., $q(x)\neq p(x)$ , while $q(y\mid x)=p(y\mid x)$ . This is common in domain adaptation, where the training and test data are drawn from different but related domains. For example, assume the task is to predict the prevalence of a certain sentiment or opinion in social media posts. The training data may be drawn from one social media platform, while the test data is drawn from another. Given a post $x$ , the probability of it expressing a certain sentiment $y$ is likely the same on both platforms, but the distribution of posts may differ.
3.

Prior Probability Shift: The label distribution changes, but not the class-conditional instance distribution, i.e., $q(y)\neq p(y)$ , while $q(x\mid y)=p(x\mid y)$ . Similar to covariate shift, \acfpps occurs between domains that share the same label concepts. For example, consider the task of predicting the percentage of a population that has a certain disease. The training data may come from a case-control study consisting of an equal amount of healthy and infected individuals, while the test data is drawn from the general population. Given $y\in\{\mathit{infected},\mathit{healthy}\}$ , the feature distribution of an individual $x$ should be the same between training and test, whereas the prevalence of the disease will likely not be.

We do not consider the case where $q(y)=p(y)$ , as this would imply that the label distribution remains unchanged, in which case the quantification problem is trivially solved by \acmlpe. Note that the difference between covariate shift and \acpps is subtle. Whether it is $p(x)$ or $p(y)$ that changes between training and test is mostly a matter of the assumed causal relation between instances and labels, i.e., whether it is in the direction $\mathcal{X}\to\mathcal{Y}$ or $\mathcal{Y}\to\mathcal{X}$ [Fawcett and Flach, 2005, Schölkopf et al., 2012, Kull and Flach, 2014]. In \acql, \acpps is commonly assumed, as there are many $\mathcal{Y}\to\mathcal{X}$ domains in which this is reasonable [González et al., 2024]. Generally speaking, quantification under concept or covariate shift is more challenging and often requires additional assumptions or domain knowledge. We will get back to the question of which shift assumptions are appropriate for a given domain in Section 3.

2.2 Adjusted Count

We will now describe the \acfacc method, a popular approach to quantification under \acpps [Forman, 2005]. As mentioned in the introduction, the naïve \aclcc method estimates the prevalence of a label as the fraction of test instances that are predicted to belong to that label by a classifier $h:\mathcal{X}\to\mathcal{Y}$ :

\displaystyle\hat{Q}^{\mathrm{CC}}(Y=i)\coloneqq\frac{1}{|\mathcal{X}_{U}|}% \smashoperator[r]{\sum_{x\in\mathcal{X}_{U}}^{}}\mathds{1}[h(x)=i]\ .

(2)

Since $h$ is trained on data drawn from $P$ , the estimated propensity scores $\hat{Q}^{\mathrm{CC}}(Y)$ will be biased towards $P(Y)$ . \Acacc removes this bias by adjusting the predicted label prevalences based on an estimate of the classifier’s confusion matrix. To understand \acacc, note that the \acpps assumptios implies that

	$\displaystyle Q(\hat{Y}=j)$	$\displaystyle=\sum_{i=1}^{K}Q(\hat{Y}=j\mid Y=i)\cdot Q(Y=i)$
		$\displaystyle=\sum_{i=1}^{K}P(\hat{Y}=j\mid Y=i)\cdot Q(Y=i)$		(3)

for all $j\in\mathcal{Y}$ , with $\hat{Y}=h(X)$ . Given $\mathcal{D}_{L}$ and $\mathcal{X}_{U}$ , we can obtain unbiased estimates of, both, $Q(\hat{Y})$ and $P(\hat{Y}\mid Y)$ :

	$\displaystyle\hat{Q}(\hat{Y}=j)$	$\displaystyle=\hat{Q}^{\mathrm{CC}}(Y=j)\ ,$
	$\displaystyle\hat{P}(\hat{Y}=j\mid Y=i)$	$\displaystyle=\frac{\smashoperator[r]{\sum_{(x,y)\in\mathcal{D}_{L}}^{}}% \mathds{1}[h(x)=j\land y=i]}{\|\{(x,y)\in\mathcal{D}_{L}\mid y=i\}\|}\ .$		(4)

Plugging these estimates into Eq. 3 yields a system of equations which can be solved to obtain estimates of $Q(Y)$ [Saerens et al., 2002]. Let $\hat{\mathbf{C}}\in{[0,1]}^{K\times K}$ be the estimated confusion matrix of $h$ on $P$ , i.e., $\hat{C}_{j,i}=\hat{P}(\hat{Y}=j\mid Y=i)$ . Then, the \acacc estimates of $Q(Y)$ are given by

\displaystyle\hat{Q}^{\mathrm{ACC}}(Y)=\hat{\mathbf{C}}^{-1}\cdot\hat{Q}(\hat{% Y})\ .

(5)

While the binary version of \acacc goes back at least to Gart and Buck [1966], it was first described as a quantification method by Vucetic and Obradovic [2001]. Tasche [2017] showed that \acacc is an unbiased estimator of the true test label prevalences if the \acpps assumption holds.

Note that there are two practical problems with Eq. 5: First, if $\mathbf{C}$ is not invertible, there might be no or multiple solutions for $\hat{Q}^{\mathrm{ACC}}(Y)$ . Second, the adjusted label prevalences may not be a valid distribution over $\mathcal{Y}$ , i.e., they could lie outside $[0,1]$ or not sum to one. Possible reasons for this are that the \acpps assumption might not be (fully) satisfied or simply that the estimates $\hat{\mathbf{C}}$ and $\hat{Q}(\hat{Y})$ are noisy, e.g., due to small sample sizes. A number of solutions to these problems have been proposed in the literature, including clipping and rescaling the estimates [Forman, 2008], adjusting the confusion matrix [Lipton et al., 2018], using the pseudo-inverse of $\mathbf{C}$ or replacing the system of equations with a constrained optimization problem [Bunse, 2022]. In this work, we will use the latter approach, i.e., constrained optimization, to solve Eq. 5:

\displaystyle\hat{Q}^{\mathrm{ACC}}(Y)=\operatorname*{arg\,min}_{\mathbf{q}\in% \Delta_{K}}\left\|\hat{\mathbf{C}}\cdot\mathbf{q}-\hat{Q}(\hat{Y})\right\|_{2}% ^{2}\ ,

(6)

where $\Delta_{K}$ denotes the unit $(K-1)$ -simplex. This problem can be solved numerically, e.g., using a (quasi-)Newtonian method such as \acslsqp. Bunse [2022] has shown that this approach is a sensible default choice, as it generally performs well in practice.

In addition to the \accc and \acacc methods described above, which use a hard classifier $h:\mathcal{X}\to\mathcal{Y}$ , one can also use a probabilistic classifier $h_{s}:\mathcal{X}\to\Delta_{K}$ [Bella et al., 2010]. Analogous to \accc, \acfpcc is defined as

\displaystyle\hat{Q}^{\mathrm{PCC}}(Y=i)\coloneqq\frac{1}{|\mathcal{X}_{U}|}% \smashoperator[r]{\sum_{x\in\mathcal{X}_{U}}^{}}h_{s}(x)_{i}\ .

(7)

Likewise, \acfpacc estimates $Q(\hat{Y})$ and $P(\hat{Y}\mid Y)$ using predicted label probabilities:

\displaystyle\hat{Q}(\hat{Y}=j)

\displaystyle=\hat{Q}^{\mathrm{PCC}}(Y=j)\ ,

\displaystyle\hat{P}(\hat{Y}=j\mid Y=i)

\displaystyle=\frac{\smashoperator[r]{\sum_{(x,y)\in\mathcal{D}_{L}}^{}}h_{s}(% x)_{j}\cdot\mathds{1}[y=i]}{|\{(x,y)\in\mathcal{D}_{L}\mid y=i\}|}\ .

(8)

The motivation for using a soft classifier instead of a hard one is that predicted label probabilities can be more informative than hard labels. Whether this is truly the case is problem dependent and depends on the quality of the predicted probabilities.

3 Graph Quantification Learning

We now turn to the problem of \aclql on graph-structured data. In Section 2, we assumed that the instances in $\mathcal{D}_{L}$ and $\mathcal{D}_{U}$ are i.i.d. wrt. $P$ and $Q$ respectively. This assumption does not hold for graph-structured data, where the instances are the vertices of a graph and the labels are associated with the vertices. More specifically, let $G=(V,E)$ be a graph with vertex set $V$ and edge set $E\subseteq V\times V$ . Each vertex $v_{i}\in V$ is associated with a feature vector $x_{i}\in\mathcal{X}$ and a label $y_{i}\in\mathcal{Y}$ . We use $\mathcal{N}(v_{i})=\{v_{j}\mid(v_{i},v_{j})\in E\}$ to denote the set of neighbors of $v_{i}$ . The edges in $G$ are used to encode homophily between vertices, i.e., similar vertices are more likely to be connected. Formally, an edge $(v_{i},v_{j})\in E$ should indicate that $P(y_{i}=y_{j})\geq\varepsilon$ , with $\varepsilon$ being either a graph-specific constant or a function of an edge weight $w_{i,j}\in\mathbb{R}$ . Since homophily is symmetric by definition, $G$ is undirected, i.e., $(v_{i},v_{j})\in E\Leftrightarrow(v_{j},v_{i})\in E$ . Such homophilic graphs are commonly used to represent social networks, citation networks, co-purchase graphs or the World Wide Web.

Refer to caption — Figure 1: The Amazon Photos co-purchase graph. Colors indicate vertex labels ( $K=8$ ). The highlighted vertices are misclassifications by an \acs*appnp classifier.

Figure 1 shows one such graph, namely the Amazon Photos co-purchase graph [Shchur et al., 2019], where vertices represent products, edges indicate that two products are frequently bought together and labels represent product categories. Due to homophily, the product categories form separate densely connected clusters, while cross-category edges are sparse.

Analogous to the tabular case, in \acgql we are given a training set of labeled vertices $\mathcal{D}_{L}$ drawn from a distribution $P$ and our goal is to estimate the label distribution of the vertices in a test set $\mathcal{D}_{U}$ drawn from a distribution $Q$ . Given some vertex classifier $h_{G}:V\to\mathcal{Y}$ , the \acgql problem is, in principle, amenable to standard aggregative quantification methods, such as \acacc or \acpacc. As discussed in Section 2.2, those adjusted count methods assume \acpps, which in turn assumes a $\mathcal{Y}\to\mathcal{X}$ domain. This means that both the training and the test data is assumed to be generated by sampling from some fixed distribution $p(x\mid y)$ for all $y\in\mathcal{Y}$ . We argue that this is oftentimes not realistic for graph-structured data.

Consider the example of estimating the proportion of users holding a certain opinion. Here, the training data $\mathcal{D}_{L}$ may come from a social network where a (non-representative) local subset of users was sampled. The test data $\mathcal{D}_{U}$ , on the other hand, may come from the entire social network or possibly some local subcluster of interest. In this setting, it is the instance distribution $p(x)$ that changes, while $p(y\mid x)$ remains fixed, i.e., covariate shift. More generally, a sampling process that is structure-dependent, e.g., by sampling local training or test neighborhoods, has covariate shift, not \acpps. We will now discuss how such structural biases can be accounted for in the quantification process.

3.1 Structural Importance Sampling

\Ac

acc depends on being able to estimate the test confusion matrix $\mathbf{C}$ from training data, with $C_{j,i}\coloneqq Q(\hat{Y}=j\mid Y=i)$ . As described, estimating $\mathbf{C}$ is trivial under \acpps. We will now introduce \acsis, a novel approach to \aclgql under covariate shift. First, note that $\mathbf{C}_{j,i}$ can be expressed as

	$\displaystyle C_{j,i}$	$\displaystyle=\smashoperator[]{\sum_{v\in V}^{}}\mathds{1}[h_{G}(v)=j]\cdot Q(% v\mid Y=i)$		(9)
		$\displaystyle=\smashoperator[]{\sum_{v\in V}^{}}\mathds{1}[h_{G}(v)=j]% \underbrace{\frac{Q(v\mid Y=i)}{P(v\mid Y=i)}}_{=\rho(v\mid i)}\cdot P(v\mid Y% =i)\ .$

Using the covariate shift assumption, we can rewrite $\rho$ as

	$\displaystyle\rho(v\mid i)$	$\displaystyle=\frac{Q(Y=i\mid v)\cdot Q(v)\cdot P(Y=i)}{P(Y=i\mid v)\cdot P(v)% \cdot Q(y=i)}$
		$\displaystyle=\frac{Q(v)}{P(v)}\cdot\frac{P(Y=i)}{Q(Y=i)}=\rho(v)\cdot\rho(i)^% {-1}\ .$		(10)

Thus, $\mathbf{C}$ can be obtained by reweighting the vertices:

$\displaystyle C_{j,i}$	$\displaystyle=\rho(i)^{-1}\smashoperator[]{\sum_{v\in V}^{}}\mathds{1}[h_{G}(v% )=j]\cdot\rho(v)\cdot P(v\mid Y=i)$
	$\displaystyle=\frac{\rho(i)^{-1}\smashoperator[]{\sum_{v\in V}^{}}\mathds{1}[h% _{G}(v)=j]\cdot\rho(v)\cdot P(v\mid Y=i)}{\rho(i)^{-1}\smashoperator[]{\sum_{v% \in V}^{}}\rho(v)\cdot P(v\mid Y=i)}$
	$\displaystyle=\frac{\mathbb{E}_{v\sim P(\cdot\mid i)}\left[\mathds{1}[h_{G}(v)% =j]\cdot\rho(v)\right]}{\mathbb{E}_{v\sim P(\cdot\mid i)}\left[\rho(v)\right]}\ .$	(11)

Given $\mathcal{D}_{L}$ , we can obtain an unbiased estimate of $C_{j,i}$ :

\displaystyle\hat{C}_{j,i}=\frac{\smashoperator[r]{\sum_{(v,y)\in\mathcal{D}_{% L}}^{}}\mathds{1}[h_{G}(v)=j\land y=i]\cdot\rho(v)}{\smashoperator[]{\sum_{(v,% y)\in\mathcal{D}_{L}}^{}}\rho(v)\cdot\mathds{1}[y=i]}\ .

(12)

Note that this is essentially a weighted version of Eq. 4.

The problem with this formulation is that it requires $\rho(v)=\frac{Q(v)}{P(v)}$ , which cannot be computed since $Q(v)$ is unknown. We do however have access to $\mathcal{X}_{U}$ , which is sampled from $Q$ . Using a suitable vertex kernel $k:V\times V\to\mathbb{R}$ , we can thus estimate $\rho(v)$ via kernel density estimation:

\displaystyle\rho(v)\mathchoice{\mathrel{\vbox{ \offinterlineskip\halign{\hfil$#$\cr\displaystyle\propto\cr\kern 2.0pt\cr% \displaystyle\sim\cr\kern-2.0pt\cr}}}}{\mathrel{\vbox{ \offinterlineskip\halign{\hfil$#$\cr\textstyle\propto\cr\kern 2.0pt\cr% \textstyle\sim\cr\kern-2.0pt\cr}}}}{\mathrel{\vbox{ \offinterlineskip\halign{\hfil$#$\cr\scriptstyle\propto\cr\kern 2.0pt\cr% \scriptstyle\sim\cr\kern-2.0pt\cr}}}}{\mathrel{\vbox{ \offinterlineskip\halign{\hfil$#$\cr\scriptscriptstyle\propto\cr\kern 2.0pt\cr% \scriptscriptstyle\sim\cr\kern-2.0pt\cr}}}}\hat{\rho}(v)\coloneqq\frac{1}{|% \mathcal{X}_{U}|}\smashoperator[r]{\sum_{v^{\prime}\in\mathcal{X}_{U}}^{}}k(v,% v^{\prime})\ .

(21)

The suitability of the kernel $k$ depends on the nature of the test distribution $Q$ . For example, if $Q(v\mid y)$ is uniform for all $y\in\mathcal{Y}$ , i.e., the sampling is structure agnostic, the constant kernel $k_{1}(v,v^{\prime})=1$ would be appropriate. Note that Eq. 12 simplifies to standard \acacc in this case, as structure agnostic test sampling implies \acpps.

If the test nodes are sampled via a randomized \acbfs, a \acsp kernel, e.g.,

\displaystyle k_{\mathrm{SP}}(v,v^{\prime})=\exp(-\lambda\cdot d_{\mathrm{SP}}% (v,v^{\prime}))\ ,

(22)

might be more appropriate, where $d_{\mathrm{SP}}(v,v^{\prime})$ is the length of the shortest path length between $v$ and $v^{\prime}$ and $\lambda>0$ a tunable hyperparameter. Similarly, for \acrw sampling, one can use a kernel based on the \acppr algorithm [Page et al., 1999]:

\displaystyle k_{\mathrm{PPR}}(v,v^{\prime})=\Pi^{L}_{v^{\prime},v}\ ,

(23)

where $\Pi^{L}_{v^{\prime},v}$ denotes the probability that a random walk of length $L$ starting at $v^{\prime}$ ends at $v$ . In general, the kernel should be treated as a hyperparameter which has to be tuned to the specific problem at hand.

To summarize, \acsis enables graph quantification under covariate shift by estimating the test confusion matrix $\mathbf{C}$ using a kernel density estimate of the test instance distribution $Q$ . Using this estimate, the adjusted label prevalences can be computed using Eq. 6.

3.2 Neighborhood-aware Adjusted Count

In the previous section, we described how \acgql can be performed even under covariate shift. In addition to the nature of the distribution shift, there is another aspect to consider when applying \acacc: Class identifiability. Consider a classifier that is unable to distinguish between two classes $i$ and $j$ , i.e., it predicts the same label for both. In this case, $\mathbf{C}_{:,i}=\mathbf{C}_{:,j}$ and thus there is no unique solution for Eq. 5. This can lead to poor quantification results if the prediction vector $Q(\hat{Y})$ has a large overlap with, both, $\mathbf{C}_{:,i}$ and $\mathbf{C}_{:,j}$ , since any distribution of probability mass between both classes may then be returned. To address this issue, we propose \acfnacc, which uses the neighborhood structure of the graph to improve class identifiability.

First, note that Eq. 6 can be understood as finding a mixture of the columns of $\mathbf{C}$ that best approximates $Q(\hat{Y})$ . In the case of collinear columns, this mixture is not unique. A simple way to break such symmetries is to use the neighborhood structure of the graph:

\displaystyle Q(\hat{Y}_{\mathcal{N}}=(j,k))

\displaystyle=\sum_{i=1}^{K}Q(\hat{Y}_{\mathcal{N}}=(j,k)\mid Y=i)\cdot Q(Y=i),

where $\hat{Y}_{\mathcal{N}}$ is a RV representing a tuple of the predicted label of a vertex and the majority predicted label of its neighbors. Using this decomposition of $Q(\hat{Y}_{\mathcal{N}})$ , $Q(Y)$ can be estimated using \acacc, and possibly \acsis, with the only difference being that the confusion matrix estimate is now of shape $K^{2}\times K$ . Intuitively, this approach uses information on the presence of absence homophily to improve class identifiability.

Consider Fig. 1, where a vertex is highlighted if it is misclassified by an \acappnp classifier [Gasteiger et al., 2018]. Note that the vertices with label 7 (dark green) are often confused with vertices of label 1 (blue) or 6 (orange) because there are many non-homophilic edges between those classes. Using \acacc, this would imply that the row vectors of labels 7, 1 and 6 are collinear, i.e., $C_{:,7}\approx\alpha\cdot C_{:,1}+(1-\alpha)\cdot C_{:,6}$ for some $\alpha\in[0,1]$ . Using the neighborhood structure, \acnacc can break this symmetry. For labels 1 and 6, the majority of predicted neighbors will nearly always be of the same label due to homophily, whereas for label 7, both, $\hat{Y}_{\mathcal{N}}=(1,6)$ and $\hat{Y}_{\mathcal{N}}=(6,1)$ are common. With this information, \acnacc is able to distinguish the confusion profile of label 7 from those of labels 1 and 6.

In principle, one could extend \acnacc to use even more neighborhood information, e.g., by considering the majority label of the neighbors of neighbors or by considering the second-most predicted neighboring label. However, given a finite training set $\mathcal{D}_{L}$ , by making the confusion profiles for fine-grained, the confusion estimate $\hat{\mathbf{C}}$ will become noisier, counteracting the potential gains of additional information. We found that using the 1-hop majority label is a good trade-off between class identifiability and confusion estimate noise.

4 Evaluation

We assess the performance of \acsis and \acnacc on a series of graph quantification tasks using, both, \acpps and covariate shift. The quantification methods are applied to the predictions of multiple node classifiers. As a baseline we compare our proposed \acgql methods with \acmlpe, (P)CC and (P)ACC.

4.1 Experimental Setup

Quantification Metrics

There is a large number of metrics to evaluate quantification methods [Esuli et al., 2023]. We use the following three: The \acae is one of the most commonly used metrics in quantification. It is defined as

\displaystyle\mathrm{AE}(q,\hat{q})=\frac{1}{K}\sum_{i=1}^{K}|q_{i}-\hat{q}_{i% }|\ .

(24)

The \acrae [González-Castro et al., 2013] is a reweighted version of the \acae that penalizes deviations from labels with low prevalence more heavily:

\displaystyle\mathrm{RAE}(q,\hat{q})=\frac{1}{K}\sum_{i=1}^{K}\frac{|q_{i}-% \hat{q}_{i}|}{q_{i}}\ .

(25)

Last, we also use the \ackld to measure the divergence between the true and estimated label distributions.

Datasets

We use five node classification benchmark datasets and introduce different types of distribution shifts to evaluate the mentioned quantification methods. The datasets come from two different domains: Three citation network datasets, namely, CoraML, CiteSeer and PubMed [McCallum et al., 2000, Giles et al., 1998, Getoor, 2005, Sen et al., 2008, Namata et al., 2012], and two co-purchase datasets, namely Amazon Photos and Amazon Computers [McAuley et al., 2015, Shchur et al., 2019]. All reported results were obtained by averaging over 10 random splits of the node set into classifier-train/quantifier-train/test, with sizes of $5\%$ / $15\%$ / $80\%$ respectively. Since random splitting does not generate distribution shift, we synthetically introduce shifts in the test splits in two ways:

1.
\acs
pps: To simulate \acpps, we randomly sample a set of Zipf distributions over the labels [Qi et al., 2020]. Given one such Zipf distribution, we sample vertices uniformly at random for each label so that the target label frequencies are reached.
2.

Covariate Shift: This shift is introduced by uniformly sampling start vertices for each label and then performing a randomized \acbfs to sample a neighborhood of a fixed size around each start vertex.

In our experiments, both, the \acpps and the covariate shifted test splits each consist of 100 vertices. For each label, we sample 10 corresponding shifted splits and report the average results.

Classifiers

We use four different node classifiers to predict the labels of the vertices: A structure-unaware \acmlp, a \acgcn [Kipf and Welling, 2017], \acgat [Veličković et al., 2018], and \acappnp [Gasteiger et al., 2018]. All models are trained using the same training splits and hyperparameters, and two hidden layers/convolutions with widths of 64 and ReLU activations. Each model is trained ten times on each of the ten splits per dataset, totalling 100 models per dataset with which each quantifier is evaluated.

Quantifiers

\Ac

sis and \acnacc are evaluated, both, separately and in combination. \Acsis is evaluated using the \acsp kernel with $\lambda=\nicefrac{{1}}{{2}}$ and the \acppr kernel from Eqs. 22 and 23.

4.2 Discussion of Results

Figure 2 compares the \acae of the different quantification methods wrt. classifier accuracy on the shifted test data for three different distribution shift scenarios: no shift, \acpps and covariate shift.

Classifier Accuracy

The quality of the classifier has a significant impact on the quantification results, i.e., the error generally goes down with increasing classifier accuracy. Unsurprisingly, the structure-unaware \acmlp performs worst, while \acappnp generally yields the best results. Despite these differences, compared to the \accc quantifier, the \acs*acc-based methods are generally able to compensate the misclassifications of the weak classifiers, flattening the error curve.

Effectiveness of \acs*sis

We observe that the difficulty of quantification depends on the type of distribution shift. In contrast, quantification without shift is trivial by definition, \acpps is generally easier than covariate shift, as it does not require \acsis and the choice of an appropriate vertex kernel. Under \acpps, \acsis with the \acppr kernel generally performs worse than non-\acs*sis methods since this kernel does not reflect the data-generating process. In contrast, under covariate shift, \acsis generally outperforms the non-\acs*sis methods, demonstrating that kernel density estimation improves confusion estimates. However, on the CiteSeer dataset, \acsis performs significantly worse under covariate shift; this is likely because the \acppr kernel is not well-suited for this dataset, highlighting the importance of choosing an appropriate kernel.

Effectiveness of \acs*nacc

Since the goal of \acnacc is to improve class identifiability, via structural information, its effectiveness depends on the presence of non-homophilic regions which add collinearities to the confusion matrix. We find that \acnacc improves quantification results.

Table 1: Quantification using probabilistic classifiers (absolute error, relative absolute error and KL divergence).

\csvreader

[ column count=69, tabular=rl | rrr | rrr | rrr | rrr | rrr | rrr, separator=comma, table head=Model& CoraML CiteSeer Amazon Photos Amazon Computers PubMed Avg. Rank
& Shift Quantifier AE RAE KLD AE RAE KLD AE RAE KLD AE RAE KLD AE RAE KLD AE RAE KLD
, before reading=, table foot=, head to column names, late after line=
, ]tables/pcc.csv \approach\coraMlAe \coraMlRae \coraMlKld \citeSeerAe \citeSeerRae \citeSeerKld \photosAe \photosRae \photosKld \computersAe \computersRae \computersRae \pubMedAe \pubMedRae \pubMedKld \aeRank \raeRank \kldRank Section 4.2 shows probabilistic quantification results for the \acrae and \ackld metrics and \acsis with the \acsp kernel. The bold results indicate that there is no significant difference between the reported mean and the best mean within a given block, determined by the 95th percentile of a one-sided t-test. The experiments show that both \acsis and \acnacc are able to improve quantification results under \acpps and covariate shift.

5 Conclusion

We have introduced two novel graph quantification methods, \acsis and \acnacc; to our knowledge, this is the first work to investigate classifier-based graph quantification. \Acsis enables quantification under covariate shift by estimating the test confusion matrix using a kernel density estimate of the test instance distribution. \Acnacc uses the neighborhood structure of the graph to improve class identifiability. The effectiveness of our approach was demonstrated on multiple graph benchmark datasets.

We envision two lines of future research. First, in this work, we focused on extensions of \acacc to the graph setting. However, on tabular data, distribution matching quantifiers, such as DMy [González-Castro et al., 2013] or KDEy [Moreo et al., 2025], often outperform \acs*acc-based approaches. An extension of distribution matching to \acgql could further improve the quantification performance on graphs. Second, as our experiments showed, choosing an appropriate kernel in \acsis is important. While the simple \acsp and \acppr kernels generally perform well in our experiments, a deeper understanding of the practical applicability of vertex kernels for quantification is desirable. To this end, one could design an AutoML system, to automatically determine the type of distribution shift in a graph quantification problem in order to find an appropriate kernel based on that shift.

References

Barranquero et al. [2013] Jose Barranquero, Pablo González, Jorge Díez, and Juan José del Coz. On the study of nearest neighbor algorithms for prevalence estimation in binary problems. Pattern Recognition, 46(2):472–482, February 2013. ISSN 0031-3203. doi: 10.1016/j.patcog.2012.07.022.
Bella et al. [2010] Antonio Bella, Cesar Ferri, José Hernández-Orallo, and María José Ramírez-Quintana. Quantification via Probability Estimators. In 2010 IEEE International Conference on Data Mining, pages 737–742, December 2010. doi: 10.1109/ICDM.2010.75.
Bunse [2022] Mirko Bunse. On multi-class extensions of adjusted classify and count. In Proceedings of the 2nd International Workshop on Learning to Quantify (LQ 2022), pages 43–50, 2022.
Esuli et al. [2023] Andrea Esuli, Alessandro Fabris, Alejandro Moreo, and Fabrizio Sebastiani. Learning to Quantify, volume 1 of The Information Retrieval Series. Springer, Cham, March 2023. ISBN 978-3-031-20467-8.
Fawcett and Flach [2005] Tom Fawcett and Peter A. Flach. A Response to Webb and Ting’s On the Application of ROC Analysis to Predict Classification Performance Under Varying Class Distributions. Machine Learning, 58(1):33–38, January 2005. ISSN 1573-0565. doi: 10.1007/s10994-005-5256-4.
Forman [2005] George Forman. Counting positives accurately despite inaccurate classification. In Proceedings of the 16th European Conference on Machine Learning, ECML’05, pages 564–575, Berlin, Heidelberg, October 2005. Springer-Verlag. ISBN 978-3-540-29243-2. doi: 10.1007/11564096_55.
Forman [2006] George Forman. Quantifying trends accurately despite classifier error and class imbalance. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, pages 157–166, New York, NY, USA, August 2006. Association for Computing Machinery. ISBN 978-1-59593-339-3. doi: 10.1145/1150402.1150423.
Forman [2008] George Forman. Quantifying counts and costs via classification. Data Mining and Knowledge Discovery, 17(2):164–206, June 2008. ISSN 1573-756X. doi: 10.1007/s10618-008-0097-y.
Gart and Buck [1966] John J. Gart and Alfred A. Buck. Comparison of a screening test and a reference test in epidemiologic studies. II. A probabilistic model for the comparison of diagnostic tests. American Journal of Epidemiology, 83(3):593–602, May 1966. ISSN 1476-6256, 0002-9262. doi: 10.1093/oxfordjournals.aje.a120610.
Gasteiger et al. [2018] Johannes Gasteiger, Aleksandar Bojchevski, and Stephan Günnemann. Predict then Propagate: Graph Neural Networks meet Personalized PageRank. In International Conference on Learning Representations, September 2018.
Getoor [2005] Lise Getoor. Link-based Classification. In Sanghamitra Bandyopadhyay, Ujjwal Maulik, Lawrence B. Holder, and Diane J. Cook, editors, Advanced Methods for Knowledge Discovery from Complex Data, Advanced Information and Knowledge Processing, pages 189–207. Springer, London, 2005. ISBN 978-1-84628-284-3. doi: 10.1007/1-84628-284-5_7.
Giles et al. [1998] C. Lee Giles, Kurt D. Bollacker, and Steve Lawrence. CiteSeer: An automatic citation indexing system. In Proceedings of the Third ACM Conference on Digital Libraries, DL ’98, pages 89–98, New York, NY, USA, May 1998. Association for Computing Machinery. ISBN 978-0-89791-965-4. doi: 10.1145/276675.276685.
González et al. [2024] Pablo González, Alejandro Moreo, and Fabrizio Sebastiani. Binary quantification and dataset shift: An experimental investigation. Data Min Knowl Disc, 38(4):1670–1712, July 2024. ISSN 1573-756X. doi: 10.1007/s10618-024-01014-1.
González-Castro et al. [2013] Víctor González-Castro, Rocío Alaiz-Rodríguez, and Enrique Alegre. Class distribution estimation based on the Hellinger distance. Information Sciences, 218:146–164, January 2013. ISSN 0020-0255. doi: 10.1016/j.ins.2012.05.028.
Kipf and Welling [2017] Thomas N. Kipf and Max Welling. Semi-Supervised Classification with Graph Convolutional Networks. In International Conference on Learning Representations, 2017.
Kull and Flach [2014] Meelis Kull and Peter A. Flach. Patterns of dataset shift. 2014.
Lipton et al. [2018] Zachary Lipton, Yu-Xiang Wang, and Alexander Smola. Detecting and Correcting for Label Shift with Black Box Predictors. In Proceedings of the 35th International Conference on Machine Learning, pages 3122–3130. PMLR, July 2018.
McAuley et al. [2015] Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. Image-Based Recommendations on Styles and Substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’15, pages 43–52, New York, NY, USA, August 2015. Association for Computing Machinery. ISBN 978-1-4503-3621-5. doi: 10.1145/2766462.2767755.
McCallum et al. [2000] Andrew Kachites McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. Automating the Construction of Internet Portals with Machine Learning. Information Retrieval, 3(2):127–163, July 2000. ISSN 1573-7659. doi: 10.1023/A:1009953814988.
Moreo et al. [2025] Alejandro Moreo, Pablo González, and Juan José del Coz. Kernel density estimation for multiclass quantification. Machine Learning, 114(4):1–38, February 2025. ISSN 1573-0565. doi: 10.1007/s10994-024-06726-5.
Namata et al. [2012] Galileo Namata, Ben London, L. Getoor, and Bert Huang. Query-driven Active Surveying for Collective Classification. In Proceedings of the Workshop on Mining and Learning with Graphs (MLG-2012), Edinburgh, Scotland, UK, 2012.
Page et al. [1999] Lawrence Page, Sergey Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking : Bringing Order to the Web. In The Web Conference, November 1999.
Qi et al. [2020] Lei Qi, Mohammed Khaleel, Wallapak Tavanapong, Adisak Sukul, and David Peterson. A Framework for Deep Quantification Learning. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2020, Ghent, Belgium, September 14–18, 2020, Proceedings, Part I, pages 232–248, Berlin, Heidelberg, September 2020. Springer-Verlag. ISBN 978-3-030-67657-5. doi: 10.1007/978-3-030-67658-2_14.
Saerens et al. [2002] Marco Saerens, Patrice Latinne, and Christine Decaestecker. Adjusting the Outputs of a Classifier to New a Priori Probabilities: A Simple Procedure. Neural Computation, 14(1):21–41, January 2002. ISSN 0899-7667. doi: 10.1162/089976602753284446.
Schölkopf et al. [2012] Bernhard Schölkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, pages 459–466, Madison, WI, USA, June 2012. Omnipress. ISBN 978-1-4503-1285-1.
Sen et al. [2008] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective Classification in Network Data. AI Magazine, 29(3):93–93, September 2008. ISSN 2371-9621. doi: 10.1609/aimag.v29i3.2157.
Shchur et al. [2019] Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. Pitfalls of Graph Neural Network Evaluation, June 2019.
Tasche [2017] Dirk Tasche. Fisher consistency for prior probability shift. J. Mach. Learn. Res., 18(1):3338–3369, January 2017. ISSN 1532-4435.
Veličković et al. [2018] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph Attention Networks. In International Conference on Learning Representations, February 2018.
Vucetic and Obradovic [2001] Slobodan Vucetic and Zoran Obradovic. Classification on Data with Biased Class Distribution. In Luc De Raedt and Peter Flach, editors, Machine Learning: ECML 2001, pages 527–538, Berlin, Heidelberg, 2001. Springer. ISBN 978-3-540-44795-5. doi: 10.1007/3-540-44795-4_45.

CoraML
CiteSeer
Amazon Photos
Amazon Computers
PubMed
	(a) Hard Classifier	(b) Probabilistic Classifier