Efficient $k$ -means with Individual Fairness via Exponential Tilting

Shengkun Zhu, Jinshan Zeng, Yuan Sun, Sheng Wang, Xiaodong Li, Zhiyong Peng Shengkun Zhu, Sheng Wang, and Zhiyong Peng are with the School of Computer Science, Wuhan University. E-mail: {whuzsk66, swangcs, peng}@whu.edu.cn Jinshan Zeng is with the School of Computer and Information Engineering, Jiangxi Normal University. E-mail: jinshanzeng@jxnu.edu.cn Yuan Sun is with La Trobe Business School, La Trobe University. Email: yuan.sun@latrobe.edu.au. Xiaodong Li is with the School of Computing Technologies, RMIT University. Email: xiaodong.li@rmit.edu.au.

Abstract

In location-based resource allocation scenarios, the distances between each individual and the facility are desired to be approximately equal, thereby ensuring fairness. Individually fair clustering is often employed to achieve the principle of “treating all points equally”, which can be applied in these scenarios. This paper proposes a novel algorithm, tilted $k$ -means (TKM), aiming to achieve individual fairness in clustering to ensure fair allocation of resources. We integrate the exponential tilting into the sum of squared errors (SSE) to formulate a novel objective function called tilted SSE. We demonstrate that the tilted SSE can generalize to SSE and employ the coordinate descent and first-order gradient method for optimization. We propose a novel fairness metric, the variance of the squared distance of each point to its nearest centroid within a cluster, which can alleviate the Matthew Effect typically caused by existing fairness metrics. Our theoretical analysis demonstrates that the well-known $k$ -means++ incurs a multiplicative error of $O(k\log k)$ with our objective function, and we establish the convergence of TKM under mild conditions. In terms of fairness, we prove that the variance in each cluster generated by TKM decreases with $t$ , where $t$ is a hyperparameter that adjusts the trade-off between utility and fairness. In terms of efficiency, we demonstrate the time complexity of TKM is linear with the dataset size. Moreover, we demonstrate the monotonicity of the tilted SSE with respect to $t$ in a simple case. Our experimental results demonstrate that TKM outperforms state-of-the-art methods in effectiveness, fairness, and efficiency. Specifically, TKM exhibits a better trade-off between clustering utility and fairness than six baselines and achieves hundreds or even thousands of times acceleration in running time. Moreover, TKM can overcome the RAM overflow issue that other methods encounter with a large dataset size.

Index Terms:

Location-based resource allocation,

k

-means, individual fairness, exponential tilting, coordinate descent, variance.

1 Introduction

In the era of big data, the scale of data is increasing exponentially [47, 65], emerging from diverse fields [29, 30] with rich information and potential value [64]. Clustering algorithms have become powerful tools for exploring the internal structure of datasets by partitioning data points into different clusters [7], where data points within the same cluster are similar to each other, while those in different clusters are dissimilar [34, 66]. $k$ -means is one of the most classic clustering algorithms, which measures the similarity between data points using Euclidean distance and is suitable for various types of data [46, 17, 23, 61]. This characteristic makes $k$ -means widely applicable in various location-based resource allocation scenarios, such as opening new facilities to serve residents [38, 31, 62, 54].

Refer to caption — Figure 1: A comparison between $k$ -means and individually fair $k$ -means. $k$ -means results in those minority residents being too far from the centroid, while in the clustering results of individually fair $k$ -means, the distance of each resident to the centroid is approximately equal.

However, applying $k$ -means to resource allocation scenarios may lead to the issue of unfairness [53, 35, 5]. Consider the scenario in Figure 1(a): when setting up public facilities such as hospitals for residents, $k$ -means tends to place these facilities closer to densely populated areas, resulting in sparsely populated areas having difficulty accessing public resources and unfair treatment for minority residents. Individual fairness is a promising concept that can ensure that within the same cluster, each data point is treated approximately equally [42, 18]. Figure 1(b) shows the clustering results obtained by $k$ -means with individual fairness, where the distances from each resident to the centroid are approximately equal. In this case, we consider the clustering result to be fair.

One of the most widely studied concepts in individual fairness for $k$ -clustering is the “service in your neighborhood” proposed by Jung et al. [35]. This concept ensures that each data point has a centroid within a small constant factor of their neighborhood radius. The neighborhood radius is defined as the minimum radius of a ball centered at each data point that includes at least $n/k$ data points, where $n$ is the total number of data points. Several studies [48, 40] have made significant improvements in clustering utility and yielded tighter theoretical bounds based on this individual fairness concept. Mahabadi and Vakilian [40] introduced a local search method for $k$ -clustering that notably surpasses [35] in effectiveness. Negahbani and Chakrabarty [48] proposed leveraging linear programming to develop improved algorithms for individually fair $k$ -clustering, both theoretically and practically.

However, the fairness definition of these methods faces a similar issue. To illustrate this, let us consider the scenario in Fig. 2: point B resides in a sparsely populated area, with its neighborhood radius larger than point A situated in a densely populated area. This tends to result in opening more facilities in densely populated areas while only opening a few facilities in sparsely populated areas [48]. Within the same radius, individual A in densely populated areas has more opportunities to choose facilities, while individual B in sparsely populated areas may have only a single facility available. Moreover, densely populated areas attract more individuals due to abundant resources. To meet the needs of these individuals, additional facilities must be opened. This results in the development of sparse areas continually lagging behind. This phenomenon, also known as the Matthew Effect [44, 10], is a sociological concept describing how the distribution of resources, wealth, and opportunities tends to favor individuals who already possess them.

Moreover, existing individually fair clustering methods suffer from the efficiency issue: their running time heavily depends on the dataset size. The most promising theoretical finding suggests a time complexity of $O(kn^{4})$ [48]. Based on data from the U.S. Census, the population of New York City was 8.468 million in 2021 [4]. Due to the high time complexity of existing algorithms, no individual fair clustering algorithm can effectively perform clustering analysis on such a large-scale dataset. Moreover, as the dataset size increases, existing methods suffer from the issue of RAM overflow since they require computation of the distance between each pair data point, necessitating the storage of an $n\times n$ array in memory. Additionally, in the clustering results obtained by these algorithms, each centroid must be selected from the data points, which is often unreasonable in real-world applications.

Exponential tilting is a widely used technique to induce parametric shifts in distributions in various disciplines, including statistics [16, 56, 59], probability [25], information theory [43, 11], and optimization [51, 55]. Li et al. [37] first proposed using exponential tilting in machine learning to ensure the fairness of empirical risk minimization (TERM). The flexibility of TERM lies in its ability to adjust the impact of individual losses using a scale parameter and thus enables us to effectively tune the influence of minority data points as required [63]. TERM provided examples of exponential tilting in supervised learning, such as linear regression and logistic regression. However, the practical applications of exponential tilting in unsupervised learning, especially in clustering algorithms, remain unresolved. Furthermore, some theoretical analysis of TERM relies on the assumption that the objective function follows a generalized linear model, which does not hold for clustering algorithms.

We aim to utilize the ability of exponential tilting to induce parametric shifts in distribution to ensure individual fairness for clustering analysis. Building on this concept, we propose a novel loss function, tilted SSE, for the individually fair $k$ -means problem based on exponential tilting, and suggest solving this problem effectively through coordinate descent (CD) and stochastic gradient descent (SGD), ensuring that the centroid in each cluster is closer to minority data points, thus guaranteeing individual fairness. Moreover, we demonstrate that tilted SSE can generalize to SSE when the scaled parameter in TKM is set to 0. Due to the fact that existing fairness metrics may exacerbate the Matthew Effect in location-based resource allocation scenarios, we propose a novel criterion for evaluating fairness within clusters, utilizing the variance of distances between each data point and its centroid. Our fairness metric aims to treat each individual more equitably compared to existing metrics, thereby mitigating the Matthew Effect.

Our theoretical analysis comprises five parts: approximation guarantee, convergence analysis, fairness analysis, efficiency analysis, and monotonicity analysis. Our approximation guarantee indicates that the centroids obtained through the well-known $k$ -means++ incur a multiplicative error of $O(k\log k)$ . We establish the convergence analysis for TKM under mild conditions. Specifically, we demonstrate that the expected tilted SSE is non-increasing with respect to iterations. For fairness analysis, we demonstrate that the variance of distances in each cluster decreases as the increase of hyperparameter $t$ in TKM. A smaller variance indicates greater fairness, implying that as $t$ grows, clustering becomes fairer. For efficiency analysis, we demonstrate that the time complexity of TKM is $O(kn)$ , similar to that of $k$ -means. Note that the time complexity of the state-of-the-art method is $O(kn^{4})$ [48], while that of TKM is linear with respect to $n$ . Therefore, TKM exhibits a significant advantage in efficiency compared to other methods. For monotonicity analysis, we demonstrate that the tilted SSE monotonically increases with $t$ in a simple case. This property may guide the choices of $t$ for TKM in practical applications.

Our experimental evaluations demonstrate the effectiveness, fairness, efficiency, and convergence of TKM over ten real-world datasets with five measurements. Our experimental findings indicate that TKM outperforms the state-of-the-art methods regarding the trade-off between clustering utility and fairness. Specifically, we use SSE to measure clustering utility. The SSE of TKM is lower than that of the state-of-the-art method and is very close to clustering algorithms that do not consider individual fairness on some datasets. To evaluate fairness, we use not only variance as a metric but also the maximum distance from each sample point to the centroid within each cluster. Our results show that TKM outperforms the state-of-the-art fair clustering methods on both metrics. Moreover, TKM’s performance in efficiency is remarkably impressive. Due to the linear time complexity with dataset size, TKM achieves acceleration of hundreds or even thousands of times compared to other fairness-aware clustering methods. Furthermore, TKM can overcome the RAM overflow issues in other methods when dealing with large-scale data. Additionally, we validate the impact of different hyperparameters in TKM on its convergence.

Our contributions are summarized as follows:

•

We incorporate exponential tilting into SSE to propose a novel method for individually fair $k$ -means: TKM.
•

We theoretically analyze TKM’s approximation guarantee, convergence, fairness, efficiency, and monotonicity.
•

We experimentally validated the effectiveness, fairness, efficiency, and convergence of TKM.

The remaining sections are structured as follows: Section 2 presents the notations used in this paper, Section 3 presents the related work, Section 4 introduces the preliminaries used in our study, Section 5 outlines our proposed method, TKM, Section 6 validates our algorithm through experiments, Section 7 concludes our paper, and Section 8 presents the proofs of our theories.

TABLE I: Summary of notations

Notation	Description
$\mathcal{X}:=\{\boldsymbol{x}_{i}\}_{i=1}^{n}$	The dataset of $n$ points
$\mathcal{S}:=\{\mathcal{S}_{j}\}_{j=1}^{k}$	The set of $k$ clusters
$\mathcal{C}:=\{\boldsymbol{c}_{j}\}_{j=1}^{k}$	The set of $k$ centroids
$\overline{\psi},\overline{\phi}$	The SSE, tilted SSE of all clusters
$\psi,\phi$	The SSE, tilted SSE of each cluster
$\operatorname{\textup{m}},\operatorname{\textup{Tm}}$	The arithmetic, tilted mean operator
$\eta$	The learning rate
$E$	The epoch size

2 Notations

We use different text formatting styles to represent different mathematical concepts: plain letters for scalars, bold letters for vectors, and calligraphic letters for sets. For instance, $k$ represents a scalar, $\boldsymbol{x}$ represents a vector, and $\mathcal{C}$ denotes a set. Without loss of generality, all data points in this paper are represented using vectors. We use $[k]$ to represent the set $\{1,2,...,k\}$ . The symbol $\mathbb{E}$ denotes the expectation of a random variable, and we use “ $:=$ ” to indicate a definition. We use $\mathbb{I}$ to denote the identity matrix. We use $\|\cdot\|$ to denote the Euclidean norm of a vector. We use the symbol “ $\log$ ” to denote the natural logarithm with base $e$ . Table I lists the notations appearing in this paper and their interpretations.

3 Related Work

We provide an overview of previous studies on fair clustering and the application of exponential tilting in various fields and highlight the limitations of these studies.

Fair Clustering. Fairness in clustering algorithms is typically divided into two categories: group fairness and individual fairness [15, 18, 42, 27]. The goal of group fairness is to achieve clustering of a given set of points with minimal cost while ensuring that all clusters are balanced with respect to certain protected attributes, such as gender or race. Group fairness is not the focus of this paper, so interested readers can refer to [20, 12, 22, 67, 26].

The concept of individual fairness was initially introduced by Dwork et al. [28] in the context of classification, which posits that “similar individuals should be treated equally”. Several studies have explored this definition of individual fairness in clustering, such as [14, 19]. Another widely used and researched concept of individual fairness is referred to as “service in your neighborhood”, which was initially suggested by Jung et al. [35]. This concept aims to ensure that each data point has a centroid within at most a small constant factor of their neighborhood radius, where the neighborhood radius is the minimum radius of a ball centered at the data point $\boldsymbol{x}_{i}$ that includes at least $n/k$ data points. Subsequently, various methods addressing the individually fair $k$ -clustering were based on this paradigm [48, 40, 21], along with numerous improved theoretical upper bounds [32, 60]. Mahabadi and Vakilian [40] introduced a local search algorithm for $k$ -clustering, which significantly outperforms the method proposed by Jung et al. [35] in terms of clustering utility. Negahbani and Chakrabarty [48] proposed leveraging linear programming techniques to develop improved algorithms for individually fair $k$ -clustering, both theoretically and practically. The fairness metric used in these methods can alleviate some of the unfairness in location-based resource allocation scenarios by ensuring that facilities are within a neighborhood radius of each data point. However, this metric might exacerbate the Matthew Effect, as it tends to result in more facilities being opened in densely populated areas while fewer facilities are opened in sparsely populated areas. Moreover, existing individually fair clustering methods encounter the same issue: they suffer from prohibitively high computational time. Specifically, the time complexity of [40] is $O(k^{5}n^{4})$ , and [48] is $O(kn^{4})$ . To address the running time issue, Chhaya et al. [21] proposed a method to reduce the dataset size by constructing a coreset. However, this approach results in diminished clustering utility and fails to mitigate the inherent dependency of the computational complexity of existing individual fairness clustering on dataset size.

Exponential Tilting. We elucidate the concept of exponential tilting and explore its applications across various disciplines. Let $\mathcal{P}:=\{p_{\theta}\}$ be a set of probability distributions with parameter $\theta$ , $X$ denote a random variable drawn from the probability distribution $p_{\theta}$ , then for any $x\in\mathcal{X}$ , the information of $x$ under $\theta$ [24] is defined as

\displaystyle f(x,\theta):=\log p_{\theta}(X=x).

(1)

When $\theta$ is not specified, we assume that $X$ is a random variable drawn from the distribution $p(\cdot)$ . Then the cumulant generating function of $f(X,\theta)$ [25] is defined as

\displaystyle\Delta_{X}(t,\theta):=\log\Bigl{(}\mathbb{E}\Bigl{[}e^{tf(X,% \theta)}\Bigr{]}\Bigr{)}=\log\sum_{x}p(x)p_{\theta}(x)^{-t},

(2)

where $\mathbb{E}[e^{tf(X,\theta)}]$ is commonly referred to as an exponential tilting of the information density, and can induce the probability distribution with parameter $\theta$ shifting. Exponential tilting has been applied in numerous fields, such as statistics [16, 56, 59], applied probability [25], information theory [43, 11], and optimization [51, 55]. Interested readers can refer to [37] for a more detailed introduction. Currently, there are relatively few applications of exponential tilting in machine learning [37, 63, 58]. Li et al. [37] proposed tilted empirical risk minimization (TERM), which allows flexible tuning of individual losses, marking a pioneering move in machine learning. TERM offers several examples of supervised learning, including linear regression and logistic regression, as illustrated in Fig. 3. Recent research has also concentrated on supervised learning, such as the additive model [63] and semantic segmentation [58].

Remarks. 1) Current fairness metrics might exacerbate the Matthew Effect, as they tend to lead to more facilities being opened in densely populated areas while fewer facilities are opened in sparsely populated areas. 2) The efficiency of existing individually fair $k$ -clustering algorithms heavily depends on the number of samples $n$ of the dataset. 3) Existing individually fair clustering algorithms cannot flexibly tune the trade-off between utility and fairness. Moreover, these algorithms require cluster centroids to be one of the data points. 4) The current application of exponential tilting is still limited to supervised learning, and it has not been applied in unsupervised learning, especially in clustering.

4 Preliminaries

We begin by introducing the definition of $k$ -means. Then, we present the well-known $k$ -means++ initialization method.

4.1 $k$ -means

$k$ -means is a widely used clustering algorithm designed to partition a dataset into $k$ distinct clusters based on similarities among data points. Let $\mathcal{X}:=\{\boldsymbol{x}_{i}\}_{i=1}^{n}$ be a dataset of $n$ points, $k$ -means aims to find a set $\mathcal{S}:=\{\mathcal{S}_{j}\}_{j=1}^{k}$ of $k$ clusters such that the sum of squared error (SSE) is minimized,

\displaystyle\min_{\mathcal{S},\mathcal{C}}\Bigl{\{}\overline{\psi}(\mathcal{S% },\mathcal{C}):=\sum_{j=1}^{k}\frac{1}{n}\sum_{\boldsymbol{x}_{i}\in\mathcal{S% }_{j}}f(\boldsymbol{x}_{i},\boldsymbol{c}_{j})\Bigr{\}},

(3)

where $\mathcal{C}:=\{\boldsymbol{c}_{j}\}_{j=1}^{k}$ is a set of centroids, $\boldsymbol{c}_{j}$ is the centroid of cluster $\mathcal{S}_{j}$ , $f(\boldsymbol{x}_{i},\boldsymbol{c}_{j}):=\|\boldsymbol{x}_{i}-\boldsymbol{c}_% {j}\|^{2}$ denotes the square of the Euclidean distance from a data point $\boldsymbol{x}_{i}$ to the centroid $\boldsymbol{c}_{j}$ . The commonly used method for solving $k$ -means is the well-known Lloyd’s heuristic [39], which iteratively computes the assignment of each data point and the centroids through coordinate descent. Next, we provide a detailed description of the optimization process of Lloyd’s heuristic. We begin by presenting the equivalent form of Problem (3) as

\displaystyle\min_{\mathcal{S},\mathcal{C}}\Bigl{\{}\overline{\psi}(\mathcal{S% },\mathcal{C})\!:=\!\sum_{j=1}^{k}\psi(\boldsymbol{\delta}_{j},\boldsymbol{c}_% {j})\!:=\!\sum_{j=1}^{k}\frac{1}{n}\sum_{i=1}^{n}f(\boldsymbol{x}_{i},% \boldsymbol{c}_{j})\delta_{ij}\Bigr{\}},\!\!

(4)

where $\delta_{ij},i\in[n],j\in[k]$ denotes the assignment of each data point, for example, if $\boldsymbol{x}_{i}\in\mathcal{S}_{j}$ , then $\delta_{ij}=1$ , else $\delta_{ij}=0$ , $\boldsymbol{\delta}_{j}:=(\delta_{1j},\delta_{2j},\cdots,\delta_{nj})\in% \mathbb{R}^{n}$ represents the assignment of data points in the $j$ -th cluster, and $\psi(\boldsymbol{\delta}_{j},\boldsymbol{c}_{j}):=\frac{1}{n}\sum_{i=1}^{n}f(% \boldsymbol{x}_{i},\boldsymbol{c}_{j})\delta_{ij}$ denotes the SSE in the cluster $\mathcal{S}_{j}$ . To solve Problem (4), one may iteratively assign each point to its nearest centroid and refine $\boldsymbol{c}_{j}$ using Lloyd’s heuristic. Following initialization, with $\boldsymbol{c}_{j}$ holds constant, the solution for $\delta_{ij}$ can be obtained as

\displaystyle\delta_{ij}=\begin{cases}1,&j=\arg\min_{l}f(\boldsymbol{x}_{i},% \boldsymbol{c}_{l}),\\ 0,&\text{otherwise.}\end{cases}

(5)

When $\delta_{ij}$ holds constant, solve for $\boldsymbol{c}_{j}$ as follows:

\displaystyle\boldsymbol{c}_{j}=\operatorname{\textup{m}}(\mathcal{S}_{j})=% \frac{\sum_{i=1}^{n}\delta_{ij}\cdot\boldsymbol{x}_{i}}{\sum_{i=1}^{n}\delta_{% ij}},

(6)

where $\operatorname{\textup{m}}(\cdot)$ is an operator to calculate the weighted mean.

Input:

\mathcal{X}:=\{\boldsymbol{x}_{i}\}_{i=1}^{n}

k

\mathcal{C}\leftarrow

Sample a point uniformly from

\mathcal{X}

;

\mathcal{C}\leftarrow

Sample the next centroid

\boldsymbol{c}_{j}\in\mathcal{X}

with probability

\frac{D(\boldsymbol{c}_{j})^{2}}{\sum_{\boldsymbol{c}_{j}\in\mathcal{X}}\!D(% \boldsymbol{c}_{j})^{2}}

;

5Repeat Step 1 until

k

centroids are chosen;

// Coordinate descent.

6 while not converge do

7 Update

\delta_{ij},i\in[n],j\in[k]

by Equation (5);

9 Update

\boldsymbol{c}_{j},j\in[k]

by Equation (6);

11return

\mathcal{C}

Algorithm 1

k

-means++

4.2 $k$ -means++

$k$ -means++ is an improved version of $k$ -means by providing a more effective strategy for selecting initial centroids, thus enhancing the speed and accuracy [8]. We provide the details of $k$ -means++ in Algorithm 1. Its process involves selecting the first centroid randomly from the dataset (Step 1 in Algorithm 1). Let $D(\boldsymbol{x}_{i})$ be the shortest distance from a data point $\boldsymbol{x}_{i}$ to its closest centroids that we have already chosen. The subsequent centroid is chosen from the data points based on their squared distances to the nearest existing centroids, with a probability $\frac{D(\boldsymbol{x}^{\prime})^{2}}{\sum_{\boldsymbol{x}_{i}\in\mathcal{X}}D% (\boldsymbol{x}_{i})^{2}}$ (Step 1 in Algorithm 1). This iterative process is repeated until $k$ centroids are chosen (Step 1 in Algorithm 1). After selecting $k$ centroids, the subsequent update of $\boldsymbol{\delta}_{j}$ and $\boldsymbol{c}_{j}$ is performed through coordinate descent, which is identical to Lloyd’s heuristic (Steps 1-1 in Algorithm 1).

5 Proposed TKM

In this section, we begin by proposing the objective function of tilted $k$ -means (TKM) and presenting the corresponding optimization method. Then, we theoretically analyze the convergence, approximation guarantee, fairness, efficiency, and monotonicity of TKM.

5.1 Objective Function of TKM

Due to the characteristic of exponential tilting inducing parametric shifts in distributions, we consider incorporating exponential tilting into SSE to obtain tilted SSE. The objective of tilted $k$ -means is to minimize the tilted SSE within each cluster as follows:

	$\displaystyle\min_{\mathcal{S},\mathcal{C}}\Bigl{\{}\overline{\phi}(t,\mathcal% {S},\mathcal{C}):=$	$\displaystyle\sum_{j=1}^{k}\phi(t,\boldsymbol{\delta}_{j},\boldsymbol{c}_{j})$
	$\displaystyle:=$	$\displaystyle\sum_{j=1}^{k}\frac{1}{t}\log\frac{1}{n}\!\sum_{i=1}^{n}e^{tf(% \boldsymbol{x}_{i},\boldsymbol{c}_{j})\delta_{ij}}\Bigr{\}},$		(7)

where $t>0$ is a hyperparameter. Note that when $\mathcal{P}$ in (1) is an exponential set of distributions parameterized by $\boldsymbol{c}_{j}$ and $\boldsymbol{\delta}_{j}$ , the cumulant generating function can be written as:

\displaystyle\Delta_{X}(t,\boldsymbol{\delta}_{j},\boldsymbol{c}_{j}):=\log% \frac{1}{n}\sum_{i=1}^{n}e^{tf(\boldsymbol{x}_{i},\boldsymbol{c}_{j})\delta_{% ij}}.

(8)

Therefore, it is clear that the objective function of TKM can be considered as a properly scaled summation version of the cumulant generating function in Equation (8).

Next, we consider the case of $t=0$ in TKM. When $t\to 0$ , according to L’Hôpital’s rule, it holds that:

\displaystyle\lim_{t\to 0}\overline{\phi}(t,\mathcal{S},\mathcal{C})=\frac{1}{% n}\sum_{j=1}^{k}\sum_{i=1}^{n}f(\boldsymbol{x}_{i},\boldsymbol{c}_{j})\delta_{% ij}.

(9)

Therefore, when $t\to 0$ , tilted SSE generates to SSE. Without loss of generality, we define

\displaystyle\phi(0,\boldsymbol{\delta}_{j},c_{j}):=\psi(\boldsymbol{\delta}_{% j},\boldsymbol{c}_{j}).

(10)

Input:

\mathcal{X}:=\{\boldsymbol{x}_{i}\}_{i=1}^{n}

k

: # of clusters,

E

: # of epoch.

1 Initialize

\mathcal{C}:=\{\boldsymbol{c}_{j}\}_{j=1}^{k}

k

-means++;

2 while not converge do

/* Assignment. */

3 Update

\delta_{ij},i\in[n],j\in[k]

by (5);

/* Refinement. */

4 for $e=1,\cdots,E$ do

6 Sample a mini-batch data

\mathcal{B}

from

\mathcal{X}

;

8 Update

\boldsymbol{c}_{j},j\in[k]

by (13);

10return

\mathcal{C}

Algorithm 2 Solving tilted

k

-means via SGD

5.2 Solving Tilted $k$ -means

Since Problem (5.1) involves a highly non-convex objective function with multi-block variables, we consider using coordinate descent (CD) to solve it. We begin by fixing $\boldsymbol{c}_{j}$ to solve $\delta_{ij}$ . Due to the monotonically increasing nature of the objective function with respect to $tf(\boldsymbol{x}_{i},\boldsymbol{c}_{j})$ , the solution for $\delta_{ij}$ is identical to that of Equation (5). Next, we consider fixing $\delta_{ij}$ to solve $\mathcal{C}$ . Since the tilted SSE is convex with respect to $\boldsymbol{c}_{j}$ (this property will be proven in Section 5.3), we can derive the optimality condition for the tilted SSE with respect to $\boldsymbol{c}_{j}$ . We then present the first-order gradient of $\overline{\phi}(t,\mathcal{S},\mathcal{C})$ with respect to $\boldsymbol{c}_{j}$ as follows,

\begin{split}\nabla_{\boldsymbol{c}_{j}}\overline{\phi}(t,\mathcal{S},\mathcal% {C})=&\nabla_{\boldsymbol{c}_{j}}\phi(t,\boldsymbol{\delta}_{j},\boldsymbol{c}% _{j})\\ =&\frac{\sum_{\boldsymbol{x}_{i}\in\mathcal{S}_{j}}\!e^{tf(\boldsymbol{x}_{i},% \boldsymbol{c}_{j})}\cdot\nabla_{\boldsymbol{c}_{j}}f(\boldsymbol{x}_{i},% \boldsymbol{c}_{j})}{\sum_{i=1}^{n}e^{tf(\boldsymbol{x}_{i},\boldsymbol{c}_{j}% )\delta_{ij}}}.\end{split}

(11)

where $\nabla_{\boldsymbol{c}_{j}}f(\boldsymbol{x}_{i},\boldsymbol{c}_{j}):=2(% \boldsymbol{x}_{i}-\boldsymbol{c}_{j})$ is the first-order gradient of $f(\boldsymbol{x}_{i},\boldsymbol{c}_{j})$ with respect to $\boldsymbol{c}_{j}$ . Then setting Equation (11) equal to zero yields the optimal condition of $\boldsymbol{c}_{j}$ :

\displaystyle\sum_{\boldsymbol{x}_{i}\in\mathcal{S}_{j}}e^{t\|\boldsymbol{x}_{% i}-\boldsymbol{c}_{j}\|^{2}}(\boldsymbol{x}_{i}-\boldsymbol{c}_{j})=0.

(12)

We define a tilted mean operator $\operatorname{\textup{Tm}}(\cdot)$ , where $\boldsymbol{c}_{j}=\operatorname{\textup{Tm}}(t,\mathcal{S}_{j})$ represents the values of $\boldsymbol{c}_{j}$ that satisfy Equation (12). Note that obtaining the closed solution for $c_{j}$ from Equation (12) is nontrivial, therefore, we employ the first-order gradient method to solve $\boldsymbol{c}_{j}$ . Let $\mathcal{B}$ be a batch data from $\mathcal{X}$ , then $\boldsymbol{c}_{j}$ is updated as follows:

	$\displaystyle\boldsymbol{c}_{j}$	$\displaystyle\leftarrow\boldsymbol{c}_{j}-\eta\cdot\nabla_{\boldsymbol{c}_{j}}% \overline{\phi}(t,\mathcal{B},\mathcal{C}),$		(13)
	$\displaystyle\nabla_{\boldsymbol{c}_{j}}\overline{\phi}(t,\mathcal{B},\mathcal% {C})$	$\displaystyle=\frac{\sum_{\boldsymbol{x}_{i}\in\mathcal{B}_{j}}e^{tf(% \boldsymbol{x}_{i},\boldsymbol{c}_{j})}\cdot\nabla_{\boldsymbol{c}_{j}}f(% \boldsymbol{x}_{i},\boldsymbol{c}_{j})}{\sum_{i=1}^{n}e^{tf(\boldsymbol{x}_{i}% ,\boldsymbol{c}_{j})\delta_{ij}}},$		(14)

where $\eta$ is a learning rate, and $\mathcal{B}_{j}:=\mathcal{S}_{j}\cap\mathcal{B}$ . Note that the first-order gradient method is a commonly used optimization method for solving such problems. Interested readers may consider trying second-order gradient methods such as Newton method [49] for solving $\boldsymbol{c}_{j}$ .

Algorithm Description. The algorithmic process of TKM can be summarized into three parts: initialization, assignment, and refinement. We provide algorithm details for TKM in Algorithm 2 and an example in Fig. 4. Firstly, the centroids set $\mathcal{C}$ is initialized using $k$ -means++ (Line 2 in Algorithm 2). Subsequently, we employ CD to iteratively solve $\delta_{ij}$ (assignment) and $\boldsymbol{c}_{j}$ (refinement) (Lines 2-2 in Algorithm 2). We set $E$ epochs for solving $\boldsymbol{c}_{j}$ , where in each epoch, a batch $\mathcal{B}$ data is sampled from $\mathcal{X}$ , and the data points within $\mathcal{B}\cap\mathcal{S}_{j}$ are used to solve $\boldsymbol{c}_{j}$ using Equation (13).

5.3 Theoretical Analysis

Our theoretical analysis consists of five parts. The first part provides an approximation guarantee for the initial centroids obtained by $k$ -means++ with respect to the tilted SSE. Then we present a convergence analysis of TKM. Next, we delve into a fairness analysis of TKM. In the fourth part, we explore the time complexity of TKM. Finally, we analyze the monotonicity of the tilted SSE using a simple case.

5.3.1 Definitions and Assumptions

We begin by providing some definitions and assumptions used throughout our theories.

Definition 1 (Tilted weight).

Given a cluster $\mathcal{S}_{j}$ and a centroid $\boldsymbol{c}_{j}$ , the tilted weight $w_{i}(t,\mathcal{S}_{j},\boldsymbol{c}_{j})$ of a data point $\boldsymbol{x}_{i}$ is defined as

\begin{split}w_{i}(t,\mathcal{S}_{j},\boldsymbol{c}_{j}):=&\,\,\frac{e^{tf(% \boldsymbol{x}_{i},\boldsymbol{c}_{j})}}{\sum_{\boldsymbol{x}_{i}\in\mathcal{S% }_{j}}e^{tf(\boldsymbol{x}_{i},\boldsymbol{c}_{j})}}.\end{split}

(15)

Definition 2 (Tilted empirical mean and variance).

Let $\mathrm{f}(\mathcal{S}_{j},\boldsymbol{c}_{j}):=\bigl{\{}f(\boldsymbol{x}_{i},% \boldsymbol{c}_{j})|\boldsymbol{x}_{i}\in\mathcal{S}_{j}\bigl{\}}$ be a set of squared Euclidean distances of points in $\mathcal{S}_{j}$ to the centroid $\boldsymbol{c}_{j}$ , then the tilted empirical mean and variance in the cluster $\mathcal{S}_{j}$ are defined as

	$\displaystyle\mathbb{E}_{t}\bigl{(}\mathrm{f}(\mathcal{S}_{j},\boldsymbol{c}_{% j})\bigr{)}$	$\displaystyle:=\sum_{\boldsymbol{x}_{i}\in\mathcal{S}_{j}}w_{i}(t,\mathcal{S}_% {j},\boldsymbol{c}_{j})\cdot f(\boldsymbol{x}_{i},\boldsymbol{c}_{j}),$		(16)
	$\displaystyle\mathrm{Var}_{t}\bigl{(}\mathrm{f}(\mathcal{S}_{j},\boldsymbol{c}% _{j})\bigr{)}$	$\displaystyle:=\mathbb{E}_{t}\Bigl{(}f(\boldsymbol{x}_{i},\boldsymbol{c}_{j})-% \mathbb{E}_{t}\bigl{(}\mathrm{f}(\mathcal{S}_{j},\boldsymbol{c}_{j})\bigr{)}% \Bigr{)}^{2}.$		(17)

Note that when $t=0$ , tilted empirical mean and variance generalize to the standard mean and variance in statistics.

Definition 3 (Gradient Lipschitz Continuity).

The objective function $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ is continuously differentiable and the gradient function of $f$ , namely, $\nabla f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ , is gradient Lipschitz continuous with Lipschitz constant $L>0$ , if for any $\boldsymbol{c},\boldsymbol{c}^{\prime}\in\mathbb{R}^{d}$ , it holds that

\displaystyle\|\nabla f(\boldsymbol{c})-\nabla f(\boldsymbol{c}^{\prime})\|% \leq L\|\boldsymbol{c}-\boldsymbol{c}^{\prime}\|.

(18)

Definition 4 (Tilted Hessian).

For any $t\geq 0$ , we define the Tilted Hessian $\nabla^{2}_{\boldsymbol{c}_{j}\boldsymbol{c}_{j}^{\top}}\phi(t,\boldsymbol{% \delta}_{j},\boldsymbol{c}_{j})$ as the Hessian of $\phi(t,\boldsymbol{\delta}_{j},\boldsymbol{c}_{j})$ with respect to $\boldsymbol{c}_{j}$ . That is

	$\displaystyle\!\nabla^{2}_{\boldsymbol{c}_{j}\boldsymbol{c}_{j}^{\top}}\phi(t,% \boldsymbol{\delta}_{j},\boldsymbol{c}_{j})\!=\frac{t}{n}\sum_{i=1}^{n}\bigl{(% }\nabla_{\boldsymbol{c}_{j}}f(\boldsymbol{x}_{i},\boldsymbol{c}_{j})\delta_{ij% }-\nabla_{\boldsymbol{c}_{j}}\phi(t,\boldsymbol{\delta}_{j},\boldsymbol{c}_{j}% )\bigl{)}$
	$\displaystyle\!\!\!\bigl{(}\nabla_{\boldsymbol{c}_{j}}f(\boldsymbol{x}_{i},% \boldsymbol{c}_{j})\delta_{ij}\!-\!\nabla_{\boldsymbol{c}_{j}}\phi(t,% \boldsymbol{\delta}_{j},\boldsymbol{c}_{j})\bigl{)}^{\top}\!e^{t\bigl{(}f(% \boldsymbol{x}_{i},\boldsymbol{c}_{j})\delta_{ij}-\phi(t,\boldsymbol{\delta}_{% j},\boldsymbol{c}_{j})\bigr{)}}$
	$\displaystyle+\frac{1}{n}\sum_{i=1}^{n}e^{t\bigl{(}f(\boldsymbol{x}_{i},% \boldsymbol{c}_{j})\delta_{ij}-\phi(t,\boldsymbol{\delta}_{j},\boldsymbol{c}_{% j})\bigr{)}}\cdot 2\mathbb{I}\delta_{ij},$

and $\mathbb{I}$ is an identity matrix of appropriate size.

Lemma 1 (Strong Convexity of Tilted SSE [37]).

For any $t\geq 0$ , the tilted SSE is strongly convex with respect to $\boldsymbol{c}_{j}$ . That is

\displaystyle\nabla^{2}_{\boldsymbol{c}_{j}\boldsymbol{c}_{j}^{\top}}\phi(t,% \boldsymbol{\delta}_{j},\boldsymbol{c}_{j})\succeq\frac{2|\mathcal{S}_{j}|}{n}% \mathbb{I}.

Proof.

Note that the first term in tilted Hessian is positive semi-definite, and the second term is positive definite and lower bounded by $\frac{2|\mathcal{S}_{j}|}{n}\mathbb{I}$ , which completes the proof. ∎

Lemma 2 (Gradient Lipschitz Continuity of Tilted SSE [37]).

For any $t\geq 0$ , $\nabla_{\boldsymbol{c}_{j}}\phi(t,\boldsymbol{\delta}_{j},\boldsymbol{c}_{j})$ is $L(t)$ -Lipschitz with respect to $\boldsymbol{c}_{j}$ , where $L(t):=\sigma_{\max}\Bigl{(}\nabla^{2}_{\boldsymbol{c}_{j}\boldsymbol{c}_{j}^{% \top}}\phi(t,\boldsymbol{\delta}_{j},\boldsymbol{c}_{j})\Bigr{)}$ , and $\sigma_{\max}$ denotes the largest eigenvalue.

Assumption 1.

Let $g(\mathcal{B},\boldsymbol{c}_{j}):=\overline{\phi}(t,\mathcal{B},\mathcal{C})$ denote the mini-batch gradient of $\overline{\phi}(t,\mathcal{S},\mathcal{C})$ , then the following conditions hold:

•

There exist scalars $\mu_{G}\geq\mu>0$ such that for any $\boldsymbol{c}_{j}\in\mathbb{R}^{d}$ ,

	$\displaystyle\nabla_{\boldsymbol{c}_{j}}\phi(t,\boldsymbol{\delta}_{j},% \boldsymbol{c}_{j})^{\top}\mathbb{E}[g($	$\displaystyle\mathcal{B},\boldsymbol{c}_{j})]\geq\mu\cdot\\|\nabla_{\boldsymbol% {c}_{j}}\phi(t,\boldsymbol{\delta}_{j},\boldsymbol{c}_{j})\\|^{2},$
	$\displaystyle\\|\mathbb{E}[g(\mathcal{B},\boldsymbol{c}_{j})]\\|$	$\displaystyle\leq\mu_{G}\cdot\\|\nabla_{\boldsymbol{c}_{j}}\phi(t,\boldsymbol{% \delta}_{j},\boldsymbol{c}_{j})\\|.$

•

There exist scalars $\nu\geq 0$ and $\nu_{H}\geq 0$ such that for any $\boldsymbol{c}_{j}\in\mathbb{R}^{d}$ , it holds that

\displaystyle\mathbb{E}[\|g(\mathcal{B},\boldsymbol{c}_{j})-\mathbb{E}[g(% \mathcal{B},\boldsymbol{c}_{j})]\|^{2}]\leq\nu+\nu_{H}\cdot\|\nabla_{% \boldsymbol{c}_{j}}\phi(t,\boldsymbol{\delta}_{j},\boldsymbol{c}_{j})\|^{2}.

The first requirement in Assumption 1 states that in expectation, the vector $-g(\mathcal{B},\boldsymbol{c}_{j})$ is a direction of sufficient descent for $\phi$ from $\boldsymbol{c}_{j}$ with a norm comparable to the norm of the gradient. The second requirement in Assumption 1, states that the variance of $g(\mathcal{B},\boldsymbol{c}_{j})$ is restricted, but in a relatively minor manner.

5.3.2 Approximation Guarantee

Let $\overline{\phi}^{*}$ represent the optimal value of $\overline{\phi}$ , we aim to prove that $k$ -means++ can ensure the resulting initial centroids set $\mathcal{C}$ satisfy $\mathbb{E}[\overline{\phi}(t,\mathcal{S},\mathcal{C})]\leq\alpha\cdot\overline% {\phi}^{*}$ , where $\alpha$ is a multiplicative error. Next, we mathematically obtain the value of $\alpha$ .

Theorem 1.

Let $\overline{\phi}^{*}$ be the optimal value of tilted SSE, Let $\overline{\psi}^{\star}$ be the optimal value of SSE, then for any dataset $\mathcal{X}$ , centroids set $\mathcal{C}$ initialized by $k$ -means++, and induced clusters $\mathcal{S}$ , it holds that

\displaystyle\mathbb{E}[\overline{\phi}(t,\mathcal{S},\mathcal{C})]\leq O(k% \log k)\cdot\overline{\psi}^{\star}\leq O(k\log k)\cdot\overline{\phi}^{*}.

(19)

The proof of Theorem 1 can be found in Section 8.1. $k$ -means++ has been proven to generate initial centroids with a multiplicative error of $O(\log k)$ in $k$ -means when fairness constraints are not considered [8]. Theorem 1 demonstrates that with individual fairness constraints, $k$ -means++ achieves the multiplicative error of $O(k\log k)$ .

5.3.3 Convergence Analysis

Next, we provide the convergence analysis of TKM by proving that the assignment and refinement steps ensure that the expected value of the tilted SSE decreases.

Theorem 2.

Let $\mathcal{S}^{it}$ , $\mathcal{C}^{it}$ and $\mathcal{S}^{it+1}$ , $\mathcal{C}^{it+1}$ be the solutions in the $it$ -th and $(it+1$ )-th iterations of TKM. Under Assumption 1 and by choosing the learning rate $\eta<\frac{2}{\mu\cdot L(t)}$ , it holds that

\displaystyle\mathbb{E}[\overline{\phi}(t,\mathcal{S}^{it+1},\mathcal{C}^{it+1% })]\leq\overline{\phi}(t,\mathcal{S}^{it},\mathcal{C}^{it}).

(20)

The proof of Theorem 2 is provided in Section 8.2. Theorem 2 demonstrates that with the selection of an appropriate learning rate, the expected value of the tilted SSE can decrease until reaching convergence.

5.3.4 Fairness Analysis

We propose using the variance of each data point’s squared distance to the centroid within each cluster to measure the fairness of clustering algorithms. Note that when $t=0$ in the tilted weight, the tilted empirical variance generalizes to standard variance. We employ variance as a measure of fairness because it quantifies the extent to which sample points in a dataset are distributed around the mean, with smaller variance indicating reduced fluctuation in distances from the mean and thus greater fairness. Next, we consider the monotonicity of the tilted empirical variance with $t$ .

Theorem 3.

For any cluster $\mathcal{S}_{j}$ , any corresponding centroid $\boldsymbol{c}_{j}(t)=\operatorname{\textup{Tm}}(t,\mathcal{S}_{j})$ , and any $t\geq 0$ , suppose all data points are normalized to a unit norm, then it holds that

\displaystyle\frac{\partial}{\partial t}\Bigl{\{}\mathrm{Var}_{\tau}\Bigl{(}% \mathrm{f}(\mathcal{S}_{j},\boldsymbol{c}_{j}(t))\Bigr{)}\Bigr{\}}<0.

(21)

The proof of Theorem 3 is provided in Section 8.3. Note that $\tau$ is a constant in the calculation of tilted empirical variance, where it contributes to the tilted weight adjustment. Theorem 3 states that the $\tau$ -tilted empirical variance among the distances between each data point in $\mathcal{S}_{j}$ and their corresponding centroid will decrease with an increase in $t$ . Therefore, there exists a potential trade-off between SSE and variance, enabling solutions to flexibly achieve desirable clustering utility and fairness. While Theorem 3 suppose all data points are normalized to a unit norm which is not satisfied in some datasets, we observe favorable numerical results motivating the extension of these results beyond the cases that are theoretically studied in this paper.

5.3.5 Time Complexity

We provide the time complexity of TKM and analyze why TKM is suitable for individually fair clustering analysis in big data scenarios.

Theorem 4.

The time complexity of TKM is $O(nkdET)$ , where $d$ is the number of attributes of each data point, $E$ is the epoch size, and $T$ is the total number of iterations.

The proof of Theorem 4 is provided in Section 8.4. Note that the time complexity of TKM is linear with the dataset size, which is the same as that of vanilla $k$ -means algorithms without fair constraints such as Lloyd’s heuristic [39] and SGD-based $k$ -means [52]. In contrast, existing individual fair clustering methods exhibit a time complexity of $O(kn^{4})$ [40, 48]. In the context of big data, employing these methods for clustering becomes impractical, as the required running time becomes difficult to estimate when dealing with dataset sizes reaching the order of millions. Moreover, these methods encounter RAM overflow issues due to the necessity of computing distances between each data point, requiring storage of an $n\times n$ array in RAM. Conversely, TKM only necessitates distance calculations between each data point and corresponding centroids during the assignment, thus requiring the computation of only an $n\times k$ array, effectively mitigating the risk of RAM overflow.

5.3.6 Monotonicity Analysis

In this section, we provided a monotonicity analysis for tilted SSE in a simple case.

Theorem 5.

When $k=1$ , suppose all data points are normalized to a unit norm, then for any $t\geq 0$ , it holds that,

\displaystyle\frac{\partial\overline{\phi}(t,\mathcal{S},\mathcal{C})}{% \partial t}\geq 0.

(22)

Proof of Theorem 5 is provided in Section 8.5. When $k=1$ , $k$ -means simplifies to a point estimation problem. In this case, Theorem 5 shows that the tilted SSE increases as $t$ increases. While the monotonicity of the tilted SSE is restricted to the scenario when $k=1$ , our experiments suggest that the tilted SSE also exhibits a monotonically increasing trend for other values of $k$ .

6 Experiments

TABLE II: An overview of the datasets

Ref.

Datasets

of points

of attributes

of clusters

[33]

Athlete

271,117

3-10

[45]

Bank

4,521

3-10

[36]

Census

32,561

3-10

[57]

Diabetes

101,766

3-10

[6]

Recruitment

4,001

3-10

[9]

Spanish

4,747

3-10

[1]

Student

32,594

3-10

[2]

3D-spatial

434,874

3-10

[41]

Census1990

2,458,285

[3]

HMDA

5,986,660

[50]

Synthetic

200

2, 3

Goals. In this section, we verify the effectiveness and efficiency of TKM by comparing it with various methods. We also examine the impact of various hyperparameters on the convergence of TKM. Moreover, we provide visualizations of the centroids’ variations with varying $t$ .

6.1 Settings

Datasets. We employ ten real-world datasets and two synthetic datasets to validate the performance of TKM. To compare the effectiveness and fairness of TKM with various methods and parameters, we utilize Athlete, Bank, Census, Diabetes, Recruitment, Spanish, Student, and 3D-spatial. To compare the efficiency of TKM with other methods, we employ Census1990 and HMDA. For visualizing TKM, we use two synthetic datasets. We sampled numerical features from ten real-world datasets and then standardized these features (the names of these features are provided in our repository). A comprehensive overview of the datasets can be obtained from Table II.

Baselines. We experimentally evaluate the performance of TKM against six methods, namely, $k$ -means++ [8], JKL [35], MV [40], FR [48], SFR [48], and NF [52]. As explained in our related works, JKL first introduced the concept of individual fairness for $k$ -means. MV, FR and SFR are three state-of-the-art methods for individually fair $k$ -means. Note that SFR is a sparsed version of FR. $k$ -means++ and NF are two clustering methods that do not take individual fairness into account. It is worth noting that NF is a method different from the classical Lloyd’s heuristic. It is solved through SGD and can be considered as the case of $t=0$ in TKM. For TKM and NF, we employed $k$ -means++ for initialization.

TABLE III: Comparison among TKM, MV, and FR in terms of running time (seconds). We abbreviate TLE as the time limit exceeded for 1 hour, SLE as the sampling size limit exceeded for the dataset dimension, and ROF as the RAM overflow.

Dataset	Method	1K	2K	5K	10K	15K	20K	25K	30K	40K	50K	60K	70K	80K	90K	2M	5M
Census1990	TKM	0.7	1.3	2.4	4.0	9.0	12.0	15.9	18.9	23.4	30.7	41.5	49.4	56.2	66.0	542.3	SLE
	SFR	0.5	2.6	11.7	31.4	45.4	65.2	77.3	98.9	156.7	195.2	291.8	483.1	601.3	ROF	ROF	SLE
	MV	5.6	30.7	85.9	250.9	1068.3	1783.9	4960.8	TLE	TLE	TLE	TLE	TLE	TLE	ROF	ROF	SLE
	FR	13.4	129.8	1053.4	10692.7	TLE	TLE	TLE	TLE	TLE	TLE	TLE	TLE	TLE	ROF	ROF	SLE
HMDA	TKM	0.3	1.0	2.2	3.8	9.3	12.3	15.5	19.4	24.3	31.1	45.2	53.0	59.1	71.3	743.9	1901.6
	SFR	2.3	5.7	17.9	41.1	65.8	72.9	88.6	111.8	174.5	211.9	348.8	528.1	712.5	ROF	ROF	ROF
	MV	5.0	27.8	61.2	304.5	406.8	1923.9	5187.6	TLE	TLE	TLE	TLE	TLE	TLE	ROF	ROF	ROF
	FR	49.8	263.3	2784.1	TLE	TLE	TLE	TLE	TLE	TLE	TLE	TLE	TLE	TLE	ROF	ROF	ROF

Measurements. We employ several metrics to evaluate the performance of clustering algorithms. We use SSE to measure the utility of different clustering algorithms, where a smaller SSE is considered a better clustering utility. To measure fairness among different clustering algorithms, we consider using two metrics. The first is the variance of each point’s distance to its nearest centroid within a cluster. A smaller variance indicates a fairer algorithm. The second metric is the maximum distance from each point in a cluster to the centroid, where a smaller maximum distance signifies greater fairness. As for efficiency evaluation, we measure it using the running time of each algorithm. To verify the impact of different hyperparameters on the convergence of TKM, we use tilted SSE as the metric.

Implementations. Our algorithms were executed on a platform comprising an Intel i9-14900KF CPU with 24 cores, 64 GB of RAM, and operating on the CentOS 7 environment. The software implementations, including our methods and the comparison methods, were realized in Python 3.7 and open-sourced (https://github.com/zsk66/TKM-master).

6.2 Comparison among Various Methods

6.2.1 Effectiveness Analysis

Fig. 7 compares the SSE of six methods as $k$ varies on eight datasets: Athlete, Bank, Census, Diabetes, Recruitment, Spanish, Student, and 3D-spatial. Due to the long running time required by our comparison methods, we need to sample the datasets to accommodate them. We sampled 1000 data points from each dataset, repeated this process 10 times, conducted experiments on the resulting 10 sampled datasets, and averaged the obtained SSE values. We set the parameter $t$ of TKM to be 0.01, 0.05, 0.1, and 0.2, respectively. The learning rate for NF and TKM was set to 0.05, the number of epochs was set to 5, the batch size was set to 100, and the number of iterations was set to 500. JKL, MV, FR, and SFR adopted the default hyperparameter settings in their papers.

Observations. We can see that as $t$ increases, the SSE of TKM also increases. This is because an increase in $t$ inevitably brings the centroids closer to the minority data points, resulting in an increase in SSE. Comparing the SSE of different methods, we can observe that the SSE of JKL is consistently the highest across all datasets except for Bank and Spanish. In these two datasets, TKM has a large SSE at $t=0.2$ , which is due to excessively large $t$ causing the centroids obtained by TKM to be too close to those minority data points. The SSE of SFR is always larger than FR because SFR is a version of FR that applies the sparsification technique. The SSE for 3D-spatial and Recruitment in FR is lower than in MV, but on the other six datasets, MV has a lower SSE compared to FR. Meanwhile, TKM’s SSE at $t=0.01,0.05,0.1$ is consistently lower than JKL, MV, FR, and SFR, and even performs nearly as well as $k$ -means++ and NF on the Census and Recruitment, which reflects the outstanding effectiveness of TKM.

6.2.2 Fairness Analysis

Fig. 7 and Fig. 7 illustrate the variance and maximum distance within each cluster for various methods when $k=4$ . The variance and maximum distance values within Clusters 1-4 are arranged in descending order. The data processing and hyperparameter configurations for all methods remain consistent with those outlined in Section 6.2.1.

Observations. From Fig. 7, it can be seen that for TKM, as $t$ increases, the variance of each cluster decreases, which is consistent with our theoretical results. Next, without loss of generality, we examine the variance of each method on Cluster 1. It can be observed that JKL has the largest variance across all datasets except for Bank, Recruitment, and Spanish, while $k$ -means++ and NF have the largest variance on Bank and Spanish, and SFR has the largest variance on Recruitment. It is worth noting that in some datasets, such as Diabetes, Recruitment, Student, and 3D-spatial, even when $t=0.01$ , the variance of TKM is smaller than other comparison methods. Moreover, in other datasets, by adjusting $t$ , it is always possible to make the variance of TKM smaller than the comparison methods. From Fig. 7, we observe that the maximum distance within each cluster decreases as $t$ increases. This occurs because the greater maximum distance is caused by the centroids being farther from the minority points. With a higher $t$ , the centroids shift towards the minority points, thereby reducing the maximum distance. Comparing TKM with other methods reveals that TKM achieves the smallest maximum distance, demonstrating its fairness. Moreover, we observe that in 3D-spatial, the variance and maximum distance of JKL, MV, FR, and SFR are all larger than those of $k$ -means++, indicating that existing individually fair clustering methods might even exacerbate unfairness in our scenario.

6.2.3 Effeciency Analysis

Table III presents a comparison of the running time of TKM with three state-of-the-art methods, MV, FR, and SFR ( Due to the poor performance of JKL and NF in effectiveness and fairness, we do not consider these two methods in the comparison of efficiency). We sampled the Census1990 and HMDA with sizes $n_{s}$ of 1K, 2K, 5K, 10K,15K, 20K, 25K, 30K, 40K, 50K, 60K, 70K, 80K, 90K, 2M, and 5M, respectively. We set the number of iterations for TKM to 500, the batch size to $\frac{1}{50}n_{s}$ , the number of epochs to 5, and the learning rate to 0.05. The hyperparameters for MV, FR, and SFR were set to their default values in their papers.

Observations. Experimental results demonstrate that regardless of the number of data points sampled, the running time of TKM is always significantly shorter than that of MV, FR, and SFR. It can be observed that TKM can cluster 5 million data points in about 30 minutes, while MV can only cluster 20,000 samples within 30 minutes, and FR can only cluster 5,000 data points. Moreover, it is worth noting that as the sample size increases, TKM’s running time increases by hundreds or even thousands of times compared to MV and FR. For example, when the number of sampled points is 1,000, TKM achieves 8.0 $\times$ and 19.1 $\times$ acceleration compared to MV and FR in Census1990, respectively. When the number of sampled points is 10,000, TKM achieves 62.7 $\times$ and 2673.2 $\times$ acceleration compared to MV and FR in Census1990, respectively. Furthermore, although the running time of SFR is significantly shorter than MV and FR, TKM still achieves approximately a 10.7 $\times$ and 12.1 $\times$ acceleration with 80,000 data points in Census1990 and HMDA, respectively. Furthermore, when the sample size reaches 90,000, the algorithmic characteristic of SFR, which requires computing distances between each sample point, can lead to a memory overflow issue, causing the algorithm to terminate. This issue also arises in MV and FR.

6.2.4 Summary of Lessons Learned

We have provided the changes of SSE with respect to $k$ for different methods, the variance results in four clusters for different methods, and a comparison of the efficiency of different methods. Our experimental results have led us to draw the following conclusions:

•

TKM outperforms state-of-the-art methods in terms of effectiveness. Specifically, TKM achieves smaller SSE compared to state-of-the-art methods across different values of $k$ and $t$ . In some datasets, the SSE of TKM is almost the same as methods that do not consider individual fairness.
•

TKM outperforms state-of-the-art methods in terms of fairness. Specifically, TKM can achieve smaller variance and maximum distance than state-of-the-art methods when an appropriate value of $t$ is chosen.
•

TKM surpasses state-of-the-art methods in terms of efficiency. Specifically, TKM can cluster more data points in a shorter time, and as the sample size increases, this acceleration effect becomes even more pronounced. Moreover, TKM can overcome the RAM overflow issue that existing methods encounter when dealing with large-scale data.

6.3 Comparison among Various Parameters

6.3.1 Tilted SSE vs. $t$

Fig. 10 illustrates the convergence of tilted SSE with iterations at $t$ values of 0.01, 0.05, 0.1, 0.2, 0.5, and 1. We randomly select 1000 data points from each dataset, repeating this process 10 times. We then conduct experiments on these 10 subsampled datasets, calculating the average of the resulting tilted SSE values. For other hyperparameters, we set the learning rate to 0.05, the number of iterations to 500, the batch size to 100, and the epoch size to 5.

Observations. We observe that despite using SGD to update the centroids, the tilted SSE of TKM still decreases steadily with iterations, which confirms the convergence of TKM. As $t$ increases, the tilted SSE also increases. This confirms that our theoretical analysis of the monotonicity of the tilted SSE with respect to $t$ holds not only for $k=1$ . When $t=0.01$ , the tilted SSE remains nearly unchanged with iterations. This indicates that the tilted SSE is insensitive to variations in $t$ when $t$ is small.

6.3.2 Tilted SSE vs. Epoch

Fig. 10 illustrates the convergence of tilted SSE with different numbers of epochs during iterations. The data preprocessing for TKM here follows the same procedures outlined in Section 6.3.1. To visualize the curve of tilted SSE of TKM over iterations more intuitively, we set $t=0.5$ , learning rate to 0.03, number of iterations to 500, batch size to 50, and epoch size to 1, 3, 5, 7, 9.

Observations. From Fig. 10, it can be observed that as the number of iterations increases, the tilted SSE of TKM decreases and tends to stabilize after reaching a certain value on all datasets. With an increase in the epoch size, the convergence speed of TKM accelerates, and its convergence performance improves. This is because increasing the epoch size allows for higher precision in the solution obtained through SGD during each iteration, as more data can be utilized. When the epoch size is 7 and 9, the convergence and convergence speed of TKM are not significantly different. Therefore, selecting 7 as the epoch size is an appropriate choice. However, in some datasets, we found that increasing the epoch size does not necessarily improve convergence. For example, in Recruitment, a smaller epoch size of 7 yields better convergence compared to an epoch size of 9. This is attributed to the risk of overfitting when the epoch size is too large. Therefore, choosing an epoch size of 7 is deemed appropriate for these datasets.

6.3.3 Tilted SSE vs. Learning Rate

Fig. 10 illustrates the convergence of tilted SSE with various learning rates during iterations. The data preprocessing for TKM here is the same as in Section 6.3.1. For the parameter settings of TKM, we set $t=0.5$ , epoch size as 5, batch size as 50, number of iterations to 500, and learning rate as 0.01, 0.02, 0.03, 0.04, and 0.05.

Observations. From Fig. 10, we can see that, across the eight datasets, the convergence speed generally increases with the increase in learning rate. However, when the learning rate increases to a certain extent, the increase in convergence speed becomes slower. For example, when $\eta=0.03,0.04,0.05$ , the convergence speed and the converged tilted SSE value on the Bank are almost indistinguishable. Additionally, if the learning rate is excessively high, it can result in poorer convergence, as demonstrated in Diabetes where $\eta=0.04$ produces a smaller tilted SSE. This occurs because an overly large learning rate may cause the SGD step size to become excessive, hindering the achievement of locally optimal solutions.

6.3.4 Visualization

Fig. 11 demonstrates how centroids change over $t$ in two synthetic datasets when the number of clusters is set to 2 and 3, respectively. We set the number of epochs for TKM to 5, the number of iterations to 1000, and the learning rate to 0.01, the batch size to 20. For the values of $t$ , we take a total of 60 geometrically spaced values between $10^{-2}$ and $10^{2}$ . We employ a blue-to-red gradient to depict the rising values of $t$ , and we use the same color to represent data points within the same cluster.

Observations. It can be observed that as $t$ increases, the positions of the centroids tend to shift towards the minority data points in each cluster. This ensures data points in each cluster can guarantee “treat all points equally”, aligning with the concept of individual fairness. Furthermore, we observe that as $t$ increases, the centroids do not shift excessively towards minority data points, ensuring that the distance from majority data points to the centroids remains reasonable. This demonstrates that TKM ensures equal treatment of each data point.

6.3.5 Summary of Lessons Learned

We have provided the convergence behavior of TKM under different epoch sizes and learning rates, as well as visualizations of TKM on 2-dimension synthetic data. These experiments lead us to the following conclusions:

•

TKM is a convergent algorithm, and the tilted SSE increases monotonically with $t$ . Specifically, for different values of $t$ , the tilted SSE in TKM steadily decreases to a stable value. Moreover, as $t$ increases, the tilted SSE increases.
•

The convergence of TKM is influenced by the epoch size and learning rate. Specifically, selecting an appropriate epoch size and learning rate can lead to faster convergence speed and better convergence of TKM. However, choosing larger epoch sizes and learning rates does not necessarily improve the performance.
•

TKM indeed can ensure individual fairness for $k$ -means. Specifically, as $t$ increases, it can guarantee that those minority data points can be closer to the centroids, achieving the goal of treating each individual equally.

7 Conclusions and Future Work

This paper investigated the individually fair $k$ -means in the context of location-based resource allocation. To address the issue where existing individually fair clustering methods and fairness metrics may exacerbate unfairness, we proposed TKM, an algorithm designed to effectively solve the individually fair $k$ -means problem via exponential tilting. We constructed the tilted SSE as the objective function and proposed solving the optimization problem using CD and SGD. Moreover, we proposed to employ variance to measure fairness. Our theory and experiments have validated that the effectiveness, efficiency, and fairness of our proposed algorithm outperform existing state-of-the-art methods. It is noteworthy that existing individually fair clustering methods encounter challenges in their application to large-scale data clustering analysis scenarios, primarily due to their computational complexity, which depends on the dataset size. In contrast, TKM, due to its excellent efficiency performance, can be applied in many big data clustering analysis scenarios, such as resource allocation.

Due to privacy concerns, data is often stored on different devices and cannot be shared among them. Therefore, a hot topic of research is how to perform clustering analysis without sharing data. In the future, we will investigate individually fair $k$ -means in the framework of federated learning to address this issue.

8 Proofs

8.1 Proof of Theorem 1

Before proving Theorem 1, we present some useful lemmas.

Lemma 3.

Given a cluster $\mathcal{S}_{j}$ , let $\boldsymbol{\delta}_{j}$ and $\boldsymbol{c}_{j}$ be the corresponding assignment and centroid, then for any $t\geq 0$ , it holds that,

\displaystyle\psi(\boldsymbol{\delta}_{j},\boldsymbol{c}_{j})\leq\phi(t,% \boldsymbol{\delta}_{j},\boldsymbol{c}_{j}).

(23)

Proof.

Following from (5.1), we have

$\displaystyle\phi(t,\boldsymbol{\delta}_{j},\boldsymbol{c}_{j})$	$\displaystyle=\frac{1}{t}\log\frac{1}{n}\sum_{i=1}^{n}e^{tf(\boldsymbol{x}_{i}% ,\boldsymbol{c}_{j})\delta_{ij}}$
	$\displaystyle\geq\frac{1}{t}\cdot\frac{1}{n}\sum_{i=1}^{n}\log e^{tf(% \boldsymbol{x}_{i},\boldsymbol{c}_{j})\delta_{ij}}$	(24)
	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}f(\boldsymbol{x}_{i},\boldsymbol{c}_{j}% )\delta_{ij}=\psi(\boldsymbol{\delta}_{j},\boldsymbol{c}_{j}),$	(25)

where (24) follows from the Jensen’s inequality. ∎

Lemma 4.

Given a set of clusters $\mathcal{S}=\{\mathcal{S}_{j}\}_{j=1}^{k}$ and a set of centroids $\mathcal{C}=\{\boldsymbol{c}_{j}\}_{j=1}^{k}$ , let $dist(\boldsymbol{x}_{i},\mathcal{C}):=\min_{\boldsymbol{c}_{j}\in\mathcal{C}}% \|\boldsymbol{x}_{i}-\boldsymbol{c}_{j}\|^{2}$ , then for any $t\geq 0$ , there exists a scalar $\epsilon\geq k\cdot\frac{\max_{\boldsymbol{x}_{i}\in X}dist(\boldsymbol{x}_{i}% ,\mathcal{C})}{\min_{\boldsymbol{x}_{i}\in X}dist(\boldsymbol{x}_{i},\mathcal{% C})}$ , such that the following inequality holds:

\displaystyle\overline{\phi}(t,\mathcal{S},\mathcal{C})\leq\epsilon\cdot% \overline{\psi}(t,\mathcal{S},\mathcal{C}).

(26)

Proof.

Consider the case when $t\to\infty$ , according to L’Hôpital’s rule, it holds that,

\displaystyle\lim_{t\to\infty}\phi(t,\boldsymbol{\delta}_{j},\boldsymbol{c}_{j% })=\max_{\boldsymbol{x}_{i}\in\mathcal{S}_{j}}f(\boldsymbol{x}_{i},\boldsymbol% {c}_{j}),

(27)

which implies that for any $j\in[k]$ , $\phi(t,\boldsymbol{\delta}_{j},\boldsymbol{c}_{j})$ is bounded. Then there must exist a scalar $\epsilon$ such that

$\displaystyle\epsilon$	$\displaystyle\geq k\cdot\frac{\max_{\boldsymbol{x}_{i}\in X}dist(\boldsymbol{x% }_{i},\mathcal{C})}{\min_{\boldsymbol{x}_{i}\in X}dist(\boldsymbol{x}_{i},% \mathcal{C})}$
	$\displaystyle=\frac{\sum_{j=1}^{k}\frac{1}{t}\log\frac{1}{n}\sum_{i=1}^{n}e^{t% \max_{\boldsymbol{x}_{i}\in X}dist(\boldsymbol{x}_{i},\mathcal{C})}}{\min_{% \boldsymbol{x}_{i}\in X}dist(\boldsymbol{x}_{i},\mathcal{C})}$	(28)
	$\displaystyle\geq\frac{\sum_{j=1}^{k}\frac{1}{t}\log\frac{1}{n}\sum_{i=1}^{n}e% ^{tf(\boldsymbol{x}_{i},\boldsymbol{c}_{j})\delta_{ij}}}{\frac{1}{n}\sum_{j=1}% ^{k}\sum_{\boldsymbol{x}_{i}\in\mathcal{S}_{j}}f(\boldsymbol{x}_{i},% \boldsymbol{c}_{j})}$	(29)
	$\displaystyle=\frac{\overline{\phi}(t,\mathcal{S},\mathcal{C})}{\overline{\psi% }(\mathcal{S},\mathcal{C})},$	(30)

which completes the proof. ∎

Proposition 1.

Let $\delta_{j}^{\star},\boldsymbol{c}_{j}^{\star},j\in[k]$ be the optimal solution of SSE, let $\boldsymbol{\delta}_{j}^{*},\boldsymbol{c}_{j}^{*},j\in[k]$ be the optimal solutions of tilted SSE, and let $\overline{\psi}^{\star}$ , $\overline{\phi}^{*}$ be the corresponding optimal objective function values, then for any $t\geq 0$ , we have $\overline{\psi}^{\star}\leq\overline{\phi}^{*}$ .

Proof.

Based on Lemma 3 and optimal conditions, we obtain

\displaystyle\overline{\psi}^{\star}=\psi(\boldsymbol{\delta}_{j}^{\star},% \boldsymbol{c}_{j}^{\star})\leq\psi(\boldsymbol{\delta}_{j}^{*},\boldsymbol{c}% _{j}^{*})\leq\phi(t,\boldsymbol{\delta}_{j}^{*},\boldsymbol{c}_{j}^{*})=% \overline{\phi}^{*}.

(31)

Summing over (31) from 1 to $k$ implies Proposition 1. ∎

Lemma 5 (Theorem 1.1 in [8]).

Let $\overline{\psi}^{\star}$ be the optimal SSE of $k$ -means, let $\mathcal{C}$ be the centroids set constructed by $k$ -means++, and let $\mathcal{S}$ be the corresponding induced assignment, then for any set of data points, it holds that $\mathbb{E}[\overline{\psi}(\mathcal{S},\mathcal{C})]\leq 8(\log k+2)\overline{% \psi}^{\star}$ .

Next, we are ready to prove Theorem 1 based on the above lemmas.

Proof of Theorem 1.

Let $\mathcal{C}$ be the centroids set constructed by $k$ -means++, and let $\mathcal{S}$ be the corresponding induced set of clusters, then following from Lemma 4, we have

\displaystyle\overline{\phi}(t,\mathcal{S},\mathcal{C})\leq\epsilon\cdot% \overline{\psi}(\mathcal{S},\mathcal{C}),

(32)

Then we can bound $\mathbb{E}[\overline{\phi}(t,\mathcal{S},\mathcal{C})]$ as

$\displaystyle\mathbb{E}[\overline{\phi}(t,\mathcal{S},\mathcal{C})]$	$\displaystyle\leq\epsilon\cdot\mathbb{E}[\overline{\psi}(\mathcal{S},\mathcal{% C})]$
	$\displaystyle\leq 8\epsilon(\log k+2)\cdot\overline{\psi}^{\star}$	(33)
	$\displaystyle\leq 8\epsilon(\log k+2)\cdot\overline{\phi}^{*},$	(34)

where (33) follows from Lemma 5, and (34) follows from Proposition 1. According to Lemma 4, we have $\epsilon\geq k\cdot\frac{\max_{\boldsymbol{x}_{i}\in X}dist(\boldsymbol{x}_{i}% ,\mathcal{C})}{\min_{\boldsymbol{x}_{i}\in X}dist(\boldsymbol{x}_{i},\mathcal{% C})}$ , then we can derive that $\mathbb{E}[\overline{\phi}(t,\mathcal{S},\mathcal{C})]\leq O(k\log k)\overline% {\phi}^{*}$ , which completes the proof. ∎

8.2 Proof of Theorem 2

By the Mean Value Theorem, the gradient Lipschitz continuity indicates the following proposition.

Proposition 2.

For any $t\geq 0$ , and $\tilde{\boldsymbol{c}}_{j},\tilde{\boldsymbol{c}}_{j}^{\prime}\in\mathbb{R}^{d}$ , it holds that

	$\displaystyle\phi(t,\boldsymbol{\delta}_{j},\tilde{\boldsymbol{c}}_{j})$	$\displaystyle-\phi(t,\boldsymbol{\delta}_{j},\tilde{\boldsymbol{c}}_{j}^{% \prime})\leq$
	$\displaystyle\nabla_{\boldsymbol{c}_{j}}\phi(t,\boldsymbol{\delta}_{j},\tilde{% \boldsymbol{c}}_{j}^{\prime})^{\top}$	$\displaystyle(\tilde{\boldsymbol{c}}_{j}-\tilde{\boldsymbol{c}}_{j}^{\prime})+% \frac{L(t)}{2}\\|\tilde{\boldsymbol{c}}_{j}-\tilde{\boldsymbol{c}}_{j}^{\prime}% \\|^{2}.$

Proof.

Following Lemma 2, it holds that

	$\displaystyle\phi(t,\boldsymbol{\delta}_{j},\tilde{\boldsymbol{c}}_{j})=\phi(t% ,\boldsymbol{\delta}_{j},\tilde{\boldsymbol{c}}_{j}^{\prime})+\int_{0}^{1}% \frac{\partial\phi(t,\boldsymbol{\delta}_{j},\tilde{\boldsymbol{c}}_{j}^{% \prime}+y(\tilde{\boldsymbol{c}}_{j}-\tilde{\boldsymbol{c}}_{j}^{\prime}))}{% \partial y}dy$
	$\displaystyle=\phi(t,\boldsymbol{\delta}_{j},\tilde{\boldsymbol{c}}_{j}^{% \prime})\!+\!\!\int_{0}^{1}\nabla_{\boldsymbol{c}_{j}}\phi(t,\boldsymbol{% \delta}_{j},\tilde{\boldsymbol{c}}_{j}^{\prime}+y(\tilde{\boldsymbol{c}}_{j}-% \tilde{\boldsymbol{c}}_{j}^{\prime}))^{\top}(\tilde{\boldsymbol{c}}_{j}-\tilde% {\boldsymbol{c}}_{j}^{\prime})dy$
	$\displaystyle=\phi(t,\boldsymbol{\delta}_{j},\tilde{\boldsymbol{c}}_{j}^{% \prime})+\nabla_{\boldsymbol{c}_{j}}\phi(t,\boldsymbol{\delta}_{j},\tilde{% \boldsymbol{c}}_{j}^{\prime})^{\top}(\tilde{\boldsymbol{c}}_{j}-\tilde{% \boldsymbol{c}}_{j}^{\prime})+$
	$\displaystyle\!\!\int_{0}^{1}\![\nabla_{\boldsymbol{c}_{j}}\phi(t,\boldsymbol{% \delta}_{j},\tilde{\boldsymbol{c}}_{j}^{\prime}\!+\!y(\tilde{\boldsymbol{c}}_{% j}-\tilde{\boldsymbol{c}}_{j}^{\prime}))\!-\!\nabla_{\boldsymbol{c}_{j}}\phi(t% ,\boldsymbol{\delta}_{j},\tilde{\boldsymbol{c}}_{j}^{\prime})]^{\top}(\tilde{% \boldsymbol{c}}_{j}-\tilde{\boldsymbol{c}}_{j}^{\prime})dy$
	$\displaystyle\leq\phi(t,\boldsymbol{\delta}_{j},\tilde{\boldsymbol{c}}_{j}^{% \prime})+\nabla_{\boldsymbol{c}_{j}}\phi(t,\boldsymbol{\delta}_{j},\tilde{% \boldsymbol{c}}_{j}^{\prime})^{\top}(\tilde{\boldsymbol{c}}_{j}-\tilde{% \boldsymbol{c}}_{j}^{\prime})+\frac{L(t)}{2}\\|\tilde{\boldsymbol{c}}_{j}-% \tilde{\boldsymbol{c}}_{j}^{\prime}\\|^{2},$

which completes the proof. ∎

Next, we show the proof of Theorem 2.

Proof of Theorem 2.

We consider proving the decreasing property of TKM from two parts: refinement and assignment. Our proof with respect to the refinement follows from [13] which establishes the convergence for gradient Lipschitz continuous objective functions. Under the gradient Lipschitz continuous property of $\phi(t,\boldsymbol{\delta}_{j},\boldsymbol{c}_{j})$ with respect to $\boldsymbol{c}_{j}$ , the iterations of SGD satisfy the following inequality by applying Proposition 2:

		$\displaystyle\mathbb{E}[\phi(t,\boldsymbol{\delta}_{j}^{it},\boldsymbol{c}_{j}% ^{it+1})]-\phi(t,\boldsymbol{\delta}_{j}^{it},\boldsymbol{c}_{j}^{it})\!\leq\!$
		$\displaystyle\!\!\!\!\!\!\!\!-\eta\nabla_{\boldsymbol{c}_{j}}\phi(t,% \boldsymbol{\delta}_{j}^{it},\boldsymbol{c}_{j}^{it})^{\top}\mathbb{E}[g(% \mathcal{B},\boldsymbol{c}_{j}^{it})]\!+\!\frac{1}{2}\eta^{2}L(t)\mathbb{E}[\\|% g(\mathcal{B},\boldsymbol{c}_{j})\\|^{2}].\!\!\!$		(35)

According to Cauchy-Schwarz inequality and Assumption 1, it holds that

\displaystyle\|\mathbb{E}[g(\mathcal{B},\boldsymbol{c}_{j}^{it})]\|^{2}\geq\mu% ^{2}\|\nabla_{\boldsymbol{c}_{j}}\phi(t,\boldsymbol{\delta}_{j}^{it},% \boldsymbol{c}_{j}^{it})\|^{2}.

(36)

Next, we consider bounding $\mathbb{E}[\|g(\mathcal{B},\boldsymbol{c}_{j}^{it})\|^{2}]$ under Assumption 1 as follows,

	$\displaystyle\!\!\mathbb{E}[\\|g(\mathcal{B},\boldsymbol{c}_{j}^{it})\\|^{2}]\!$	$\displaystyle=\!\mathbb{E}[\\|g(\mathcal{B},\boldsymbol{c}_{j}^{it})\!-\!% \mathbb{E}[g(\mathcal{B},\boldsymbol{c}_{j}^{it})]\\|^{2}]\!+\!\\|\mathbb{E}[g(% \mathcal{B},\boldsymbol{c}_{j}^{it})]\\|^{2}$
		$\displaystyle\leq\nu+\nu_{G}\cdot\\|\nabla_{\boldsymbol{c}_{j}}\phi(t,% \boldsymbol{\delta}_{j},\boldsymbol{c}_{j}^{it})\\|^{2},$		(37)

where $\nu_{G}:=\nu_{H}+\mu_{G}^{2}\geq\mu^{2}$ . Then by applying Assumption 1 and (8.2) into (8.2), we obtain

	$\displaystyle\mathbb{E}$	$\displaystyle[\phi(t,\boldsymbol{\delta}_{j}^{it},\boldsymbol{c}_{j}^{it+1})]-% \phi(t,\boldsymbol{\delta}_{j}^{it},\boldsymbol{c}_{j}^{it})\leq$
		$\displaystyle-\bigl{(}\mu-\frac{1}{2}\eta\nu_{G}L(t)\bigr{)}\eta\\|\nabla_{% \boldsymbol{c}_{j}}\phi(t,\boldsymbol{\delta}_{j},\boldsymbol{c}_{j})\\|^{2}+% \frac{1}{2}\eta^{2}\nu L(t).$		(38)

To ensure that the objective function value decreases within refinement, we need $\mu-\frac{1}{2}\eta\nu_{G}L(t)>0$ , which implies $\eta<\frac{2\mu}{\nu_{G}L(t)}\leq\frac{2}{\mu L(t)}$ . Next, we consider proving the decreasing property in the assignment. Following the optimal condition with $\boldsymbol{\delta}_{j}$ , the following inequality holds

\displaystyle\mathbb{E}[\phi(t,

\displaystyle\boldsymbol{\delta}_{j}^{it+1},\boldsymbol{c}_{j}^{it+1})]\leq% \mathbb{E}[\phi(t,\boldsymbol{\delta}_{j}^{it},\boldsymbol{c}_{j}^{it+1})].

(39)

Combining (8.2) and (39) yields

\displaystyle\mathbb{E}[\phi(t,

\displaystyle\boldsymbol{\delta}_{j}^{it+1},\boldsymbol{c}_{j}^{it+1})]\leq% \phi(t,\boldsymbol{\delta}_{j}^{it},\boldsymbol{c}_{j}^{it}).

(40)

Summing over (40) from $1$ to $k$ proves Theorem 2. ∎

8.3 Proof of Theorem 3

We begin by defining the tilted weight, tilted empirical mean, and tilted empirical variance when all data points are normalized to a unit norm.

Definition 5 (Tilted gradient and weight).

Suppose the dataset is normalized, then the tilted weight is defined as

\displaystyle w_{i}(t,\mathcal{S}_{j},\boldsymbol{c}_{j})\!:=\!\frac{e^{t\|% \boldsymbol{x}_{i}-c_{j}\|^{2}}}{\sum_{\boldsymbol{x}_{i}\in\mathcal{S}_{j}}e^% {t\|\boldsymbol{x}_{i}-\boldsymbol{c}_{j}\|^{2}}}\!=\!\frac{1}{|\mathcal{S}_{j% }|}e^{-2t\boldsymbol{c}_{j}^{\top}\boldsymbol{x}_{i}-\Gamma(t,\mathcal{S}_{j},% \boldsymbol{c}_{j})},\!\!

where $\Gamma(t,\mathcal{S}_{j},\boldsymbol{c}_{j}):=\log\frac{1}{|\mathcal{S}_{j}|}% \sum_{\boldsymbol{x}_{i}\in\mathcal{S}_{j}}e^{-2t\boldsymbol{c}_{j}^{\top}% \boldsymbol{x}_{i}}$ .

Definition 6 (Tilted empirical mean and variance).

Suppose the dataset is normalized, the tilted empirical mean and variance in each cluster are defined as

	$\displaystyle\mathbb{E}_{t}\Bigl{(}\!\mathrm{f}(\mathcal{S}_{j},\boldsymbol{c}% _{j})\!\Bigr{)}\!:=\!\\|\boldsymbol{c}_{j}\\|^{2}+\!\!\!\!\!\sum_{\boldsymbol{x}% _{i}\in\mathcal{S}_{j}}\!\!\!\!w_{i}(t,\mathcal{S}_{j},\boldsymbol{c}_{j})\\|% \boldsymbol{x}_{i}\\|^{2}-\boldsymbol{c}_{j}^{\top}M(t,\mathcal{S}_{j},% \boldsymbol{c}_{j}),\!$
	$\displaystyle\!\!\!\!\mathrm{Var}_{t}\Bigl{(}\!\mathrm{f}(\mathcal{S}_{j},% \boldsymbol{c}_{j})\!\Bigr{)}\!:=\!\mathbb{E}_{t}\Bigl{(}\boldsymbol{c}_{j}^{% \top}\bigl{(}-2\boldsymbol{x}_{i}\!-\!M(t,\mathcal{S}_{j},\boldsymbol{c}_{j})% \bigr{)}\!\Bigr{)}^{2}\!\!\!=\!\boldsymbol{c}_{j}^{\top}V(t,\mathcal{S}_{j},% \boldsymbol{c}_{j})\boldsymbol{c}_{j},$

where $M(t,\mathcal{S}_{j},\boldsymbol{c}_{j}):=\sum_{\boldsymbol{x}_{i}\in\mathcal{S% }_{j}}2w_{i}(t,\mathcal{S}_{j},\boldsymbol{c}_{j})\boldsymbol{x}_{i}$ , and

	$\displaystyle V(t,\mathcal{S}_{j},\boldsymbol{c}_{j})\!:=\mathbb{E}_{t}\bigl{(% }-2\boldsymbol{x}_{i}\!-\!M(t,\mathcal{S}_{j},\boldsymbol{c}_{j})\bigr{)}^{% \top}\bigl{(}-2\boldsymbol{x}_{i}-M(t,\mathcal{S}_{j},\boldsymbol{c}_{j})\bigr% {)}$
	$\displaystyle=\!\!\!\sum_{\boldsymbol{x}_{i}\in\mathcal{S}_{j}}\!\!w_{i}(t,% \mathcal{S}_{j},\boldsymbol{c}_{j})\bigl{(}-2\boldsymbol{x}_{i}-M(t,\mathcal{S% }_{j},\boldsymbol{c}_{j})\bigr{)}^{\top}\bigl{(}-2\boldsymbol{x}_{i}-M(t,% \mathcal{S}_{j},\boldsymbol{c}_{j})\bigr{)}.$

Lemma 6 (Partial derivatives of $M(t,\mathcal{S}_{j},\boldsymbol{c}_{j})$ and $\Gamma(t,\mathcal{S}_{j},\boldsymbol{c}_{j})$ ).

For any $t\geq 0$ , and any $\boldsymbol{c}_{j}\in\mathbb{R}^{d}$ , it holds that

	$\displaystyle\frac{\partial}{\partial t}M(t,\mathcal{S}_{j},\boldsymbol{c}_{j}% )=-V(t,\mathcal{S}_{j},\boldsymbol{c}_{j})\boldsymbol{c}_{j},$		(41)
	$\displaystyle\nabla_{\boldsymbol{c}_{j}}M(t,\mathcal{S}_{j},\boldsymbol{c}_{j}% )=-tV(t,\mathcal{S}_{j},\boldsymbol{c}_{j}),$		(42)
	$\displaystyle\frac{\partial}{\partial t}\Gamma(t,\mathcal{S}_{j},\boldsymbol{c% }_{j})=-\boldsymbol{c}_{j}M(t,\mathcal{S}_{j},\boldsymbol{c}_{j}),$		(43)
	$\displaystyle\nabla_{\boldsymbol{c}_{j}}\Gamma(t,\mathcal{S}_{j},\boldsymbol{c% }_{j})=-tM(t,\mathcal{S}_{j},\boldsymbol{c}_{j}).$		(44)

Proof of Theorem 3.

Let $\boldsymbol{c}_{j}(t):=\operatorname{\textup{Tm}}(t,\mathcal{S}_{j})$ be the solution of (5.1), then substituting $t$ , $\mathcal{S}$ and $\boldsymbol{c}_{j}(t)$ into the tilted weight denoted as $\hat{w}_{i}:=w_{i}(t,\mathcal{S}_{j},\boldsymbol{c}_{j}(t))$ , we can obtain the tilted empirical mean and variance for each cluster as

	$\displaystyle\mathbb{E}_{t}\Bigl{(}\mathrm{f}(\mathcal{S}_{j},\boldsymbol{c}_{% j})\Bigr{)}\!$	$\displaystyle=\!\!\!\sum_{\boldsymbol{x}_{i}\in\mathcal{S}_{j}}\!\!\hat{w}_{i}% \cdot f(\boldsymbol{x}_{i},\boldsymbol{c}_{j})$
		$\displaystyle=\\|\boldsymbol{c}_{j}\\|^{2}+\!\!\!\sum_{\boldsymbol{x}_{i}\in% \mathcal{S}_{j}}\hat{w}_{i}\\|\boldsymbol{x}_{i}\\|^{2}-\boldsymbol{c}_{j}^{\top% }M_{t}$
	$\displaystyle\mathrm{Var}_{t}\Bigl{(}\mathrm{f}(\mathcal{S}_{j},\boldsymbol{c}% _{j})\Bigr{)}\!$	$\displaystyle=\!\mathbb{E}_{t}\Bigl{(}\boldsymbol{c}_{j}^{\top}\bigl{(}-2% \boldsymbol{x}_{i}-M_{t}\bigr{)}\Bigr{)}^{2}=\boldsymbol{c}_{j}^{\top}V_{t}% \boldsymbol{c}_{j},$

where $M_{t}\!:=\!2\!\sum_{\boldsymbol{x}_{i}\in\mathcal{S}_{j}}\hat{w}_{i}\cdot% \boldsymbol{x}_{i}$ and $V_{t}\!:=\!\sum_{\boldsymbol{x}_{i}\in\mathcal{S}_{j}}\!\hat{w}_{i}\bigl{(}-2% \boldsymbol{x}_{i}-M_{t}\bigr{)}^{\top}\bigl{(}-2\boldsymbol{x}_{i}-M_{t}\bigr% {)}$ are constants. Then, by taking derivative of $\mathrm{Var}_{\tau}\Bigl{(}\mathrm{f}\bigl{(}\mathcal{S}_{j},\boldsymbol{c}_{j% }(t)\bigr{)}\Bigr{)}$ with respect to $t$ , we have

	$\displaystyle\frac{\partial}{\partial t}\Bigl{\{}\mathrm{Var}_{\tau}\Bigl{(}% \mathrm{f}\bigl{(}\mathcal{S}_{j},\boldsymbol{c}_{j}(t)\bigl{)}\Bigr{)}\Bigl{\}}$
$\displaystyle=$	$\displaystyle\Bigl{(}\frac{\partial}{\partial t}\boldsymbol{c}_{j}(t)\Bigr{)}^% {\top}\cdot\nabla_{\boldsymbol{c}_{j}}\Bigl{\{}\mathrm{Var}_{\tau}\Bigl{(}% \mathrm{f}\bigl{(}\mathcal{S}_{j},\boldsymbol{c}_{j}(t)\bigl{)}\Bigr{)}\Bigr{\}}$
$\displaystyle=$	$\displaystyle 2\Bigl{(}\frac{\partial}{\partial t}\boldsymbol{c}_{j}(t)\Bigr{)% }^{\top}V_{\tau}\boldsymbol{c}_{j}(t).$	(45)

Based on the optimal condition with $\boldsymbol{c}_{j}$ , we have

\displaystyle 0

\displaystyle=\sum_{\boldsymbol{x}_{i}\in\mathcal{S}_{j}}e^{t\|\boldsymbol{x}_% {i}-\boldsymbol{c}_{j}(t)\|^{2}}(\boldsymbol{x}_{i}-\boldsymbol{c}_{j}(t)).

(46)

Divide both sides of (46) by $-\frac{1}{2}\sum_{\boldsymbol{x}_{i}\in\mathcal{S}_{j}}e^{t\|\boldsymbol{x}_{i% }-\boldsymbol{c}_{j}(t)\|^{2}}$ , and differentiate with respect to $t$ yields

	$\displaystyle 0=\frac{\partial}{\partial t}\Bigl{\{}\sum_{\boldsymbol{x}_{i}% \in\mathcal{S}_{j}}w_{i}(t,\mathcal{S}_{j},\boldsymbol{c}_{j}(t))\cdot 2(% \boldsymbol{c}_{j}(t)-\boldsymbol{x}_{i})\Bigr{\}}$
	$\displaystyle\!\!\!\!=\frac{\partial}{\partial t}\Bigl{\{}2\boldsymbol{c}_{j}(% t)-M(t,\mathcal{S}_{j},\boldsymbol{c}_{j}(t))\Bigr{\}}$
	$\displaystyle\!\!\!\!=\frac{\partial\boldsymbol{c}_{j}(t)}{\partial t}\Bigl{(}% 2\!-\!\nabla_{\boldsymbol{c}_{j}}M(t,\mathcal{S}_{j},\boldsymbol{c}_{j}(t))\!% \Bigr{)}\!-\!\frac{\partial}{\partial\tau}M(\tau,\mathcal{S}_{j},\boldsymbol{c% }_{j}(t))\Big{\|}_{\tau=t}\!\!$		(47)
	$\displaystyle\!\!\!\!=\frac{\partial\boldsymbol{c}_{j}(t)}{\partial t}\Bigl{(}% 2\!+\!tV(t,\mathcal{S}_{j},\boldsymbol{c}_{j}(t))\Bigr{)}\!+\!V(t,\mathcal{S}_% {j},\boldsymbol{c}_{j}(t))\boldsymbol{c}_{j}(t),\!\!$		(48)

where (47) follows from the chain rule, and (48) follows from Lemma 6. Then we can infer from (48) that

\displaystyle\frac{\partial\boldsymbol{c}_{j}(t)}{\partial t}=-V(t,\mathcal{S}% _{j},\boldsymbol{c}_{j}(t))\boldsymbol{c}_{j}(t)\cdot\frac{1}{2+tV(t,\mathcal{% S}_{j},\boldsymbol{c}_{j}(t))}.

(49)

Substituting (49) into (8.3), we obtain

	$\displaystyle\frac{\partial}{\partial t}\Bigl{\{}\mathrm{Var}_{\tau}\Bigl{(}% \mathrm{f}\bigl{(}\mathcal{S}_{j},\boldsymbol{c}_{j}(t)\bigl{)}\Bigr{)}\Bigl{\}}$	$\displaystyle=2\Bigl{(}\frac{\partial}{\partial t}\boldsymbol{c}_{j}(t)\Bigr{)% }^{\top}V_{\tau}\boldsymbol{c}_{j}(t)$
		$\displaystyle=\underbrace{-\frac{\boldsymbol{c}_{j}(t)^{\top}V(t,\mathcal{S}_{% j},\boldsymbol{c}_{j}(t))V_{\tau}\boldsymbol{c}_{j}(t)}{2+tV(t,\mathcal{S}_{j}% ,\boldsymbol{c}_{j}(t))}}_{<0},$

which completes the proof. ∎

8.4 Proof of Theorem 4

Proof.

When initializing the centroids with $k$ -means++, the required number of multiplications is $O(nkd)$ . The number of multiplication needed for assignment and refinement are $O(nkd)$ and $O(nkdE)$ , respectively. When we set the number of iterations to $T$ , we can obtain the multiplication required for TKM is $O(nkdET)$ . ∎

8.5 Proof of Theorem 5

Proof.

When $k=1$ , we obtain

\displaystyle\overline{\phi}(t,\mathcal{S},\boldsymbol{c})=\frac{1}{t}\log% \frac{1}{n}\sum_{i=1}^{n}e^{tf(\boldsymbol{x}_{i},\boldsymbol{c})},

(50)

where $\mathcal{S}=X$ and $\boldsymbol{c}=\operatorname{\textup{Tm}}(t,\mathcal{S})$ are the unique cluster and centroid. We directly take the partial derivative of $\overline{\phi}(t,\mathcal{S},\boldsymbol{c})$ with respect to $t$ , yielding:

	$\displaystyle\frac{\partial\Pi(t,\mathcal{S}_{j},\boldsymbol{c}_{j})}{\partial t}$
$\displaystyle=$	$\displaystyle\frac{1}{t}\frac{\sum_{i=1}^{n}e^{t\\|\boldsymbol{x}_{i}-% \boldsymbol{c}\\|^{2}}\\|\boldsymbol{x}_{i}-\boldsymbol{c}\\|^{2}}{\sum_{i=1}^{n}% e^{t\\|\boldsymbol{x}_{i}-\boldsymbol{c}\\|^{2}}}-\frac{1}{t^{2}}\log\frac{1}{n}% \sum_{i=1}^{n}e^{t\\|\boldsymbol{x}_{i}-\boldsymbol{c}\\|^{2}}$
$\displaystyle=$	$\displaystyle-\frac{1}{t}\boldsymbol{c}_{j}^{\top}\sum_{i=1}^{n}2w_{i}(t,% \mathcal{S},\boldsymbol{c})\boldsymbol{x}_{i}-\frac{1}{t^{2}}\log\frac{1}{n}% \sum_{i=1}^{n}e^{-2t\boldsymbol{c}^{\top}\boldsymbol{x}_{i}}$	(51)
$\displaystyle=$	$\displaystyle-\frac{1}{t}\boldsymbol{c}_{j}^{\top}M(t,\mathcal{S},\boldsymbol{% c})\!-\!\frac{1}{t^{2}}\Gamma(t,\mathcal{S},\boldsymbol{c})=:g(t,\mathcal{S},% \boldsymbol{c}),$	(52)

where (51) follows from the fact that all data points are normalized, and (52) defines $g(t,\mathcal{S}_{j},\boldsymbol{c}_{j})$ . Next, we consider

	$\displaystyle\frac{\partial}{\partial t}\{t^{2}g(t,\mathcal{S},\boldsymbol{c})\}$	$\displaystyle=\frac{\partial}{\partial t}\Bigl{\{}-t\boldsymbol{c}^{\top}M(t,% \mathcal{S},\boldsymbol{c})\!-\!\Gamma(t,\mathcal{S},\boldsymbol{c})\Bigr{\}}$
		$\displaystyle=t\boldsymbol{c}^{\top}V(t,\mathcal{S},\boldsymbol{c})\boldsymbol% {c},$		(53)

where (8.5) follows from (41), (43) and the chain rule. Given that $t\boldsymbol{c}^{\top}V(t,\mathcal{S},\boldsymbol{c})\boldsymbol{c}\geq 0$ for any $t\geq 0$ , therefore $t^{2}g(t,\mathcal{S},\boldsymbol{c})$ is a monotonically increasing function with $t$ , and its minimum value is attained at $t=0$ . When $t=0$ , we have

	$\displaystyle g(0,\mathcal{S},\boldsymbol{c})$	$\displaystyle:=\lim_{t\rightarrow 0}-\frac{\Gamma(t,\mathcal{S},\boldsymbol{c}% )+t\boldsymbol{c}_{j}^{\top}M(t,\mathcal{S},\boldsymbol{c})}{t^{2}},$
		$\displaystyle=\frac{1}{2}\boldsymbol{c}^{\top}V(0,\mathcal{S},\boldsymbol{c})% \boldsymbol{c},$		(54)

where (8.5) follows from (41), (43) and L’Hôpital’s rule. Then we obtain $t^{2}g(t,\mathcal{S},\boldsymbol{c})\geq 0$ , and consequently infer $g(t,\mathcal{S},\boldsymbol{c})\geq 0$ for any $t\geq 0$ . In conjunction with Equation (52), Theorem 5 is implied. ∎

References

[1] Open university learning analytics dataset. https://analyse.kmi.open.ac.uk/open_dataset, 2015.
[2] 3d road network (north jutland, denmark). https://archive.ics.uci.edu/dataset/246/3d+road+network+north+jutland+denmark, 2017.
[3] The home mortgage disclosure act. https://ffiec.cfpb.gov/data-browser/, 2017.
[4] The U.S. census data. https://www.census.gov/glossary/#term_Populationestimates, 2021.
[5] A tutorial and resources for fair clustering. https://www.fairclustering.com/, 2022.
[6] Utrecht fairness recruitment dataset. https://www.kaggle.com/datasets/ictinstitute/utrecht-fairness-recruitment-dataset, 2022.
[7] M. Ankerst, M. M. Breunig, H. Kriegel, and J. Sander. OPTICS: ordering points to identify the clustering structure. In SIGMOD, pages 49–60, 1999.
[8] D. Arthur and S. Vassilvitskii. K-means++ the advantages of careful seeding. In SODA, pages 1027–1035, 2007.
[9] T. Athanasios and L. Max. UCI machine learning repository: Gender gap in spanish wp data set. https://archive.ics.uci.edu/ml/datasets/Parkinsons+Telemonitoring, 2009.
[10] P. Azoulay, T. E. Stuart, and Y. Wang. Matthew: Effect or fable? Manag. Sci., 60(1):92–109, 2014.
[11] A. Beirami, A. R. Calderbank, M. M. Christiansen, K. R. Duffy, and M. Médard. A characterization of guesswork on swiftly tilting curves. IEEE Trans. Inf. Theory, 65(5):2850–2871, 2019.
[12] S. K. Bera, D. Chakrabarty, N. Flores, and M. Negahbani. Fair algorithms for clustering. In NeurIPS, pages 4955–4966, 2019.
[13] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM Rev., 60(2):223–311, 2018.
[14] B. Brubach, D. Chakrabarti, J. P. Dickerson, S. Khuller, A. Srinivasan, and L. Tsepenekas. A pairwise fair and community-preserving approach to k-center clustering. In ICML, pages 1178–1189, 2020.
[15] F. Buet-Golfouse and I. Utyagulov. Towards fair unsupervised learning. In FAccT, pages 1399–1409, 2022.
[16] R. W. Butler. Saddlepoint approximations with applications. Cambridge University Press, 2007.
[17] X. Cao, G. Cong, and C. S. Jensen. Mining significant semantic locations from gps data. Proc. VLDB Endow., 3(1):1009–1020, 2010.
[18] S. Caton and C. Haas. Fairness in machine learning: A survey. ACM Comput. Surv., 2023.
[19] D. Chakrabarti, J. P. Dickerson, S. A. Esmaeili, A. Srinivasan, and L. Tsepenekas. A new notion of individually fair clustering: $\alpha$ -equitable k-center. In AISTATS, volume 151, pages 6387–6408, 2022.
[20] X. Chen, B. Fain, L. Lyu, and K. Munagala. Proportionally fair clustering. In ICML, volume 97, pages 1032–1041, 2019.
[21] R. Chhaya, A. Dasgupta, J. Choudhari, and S. Shit. On coresets for fair regression and individually fair clustering. In AISTATS, volume 151, pages 9603–9625, 2022.
[22] F. Chierichetti, R. Kumar, S. Lattanzi, and S. Vassilvitskii. Fair clustering through fairlets. In NeurIPS, pages 5029–5037, 2017.
[23] G. Cong, C. S. Jensen, and D. Wu. Efficient retrieval of the top-k most relevant spatial web objects. Proc. VLDB Endow., 2(1):337–348, 2009.
[24] T. M. Cover. Elements of information theory. John Wiley & Sons, 1999.
[25] A. Dembo and O. Zeitouni. Large deviations techniques and applications. Springer Science & Business Media, 2009.
[26] J. P. Dickerson, S. A. Esmaeili, J. H. Morgenstern, and C. J. Zhang. Doubly constrained fair clustering. In NeurIPS, 2024.
[27] Y. Dong, J. Ma, S. Wang, C. Chen, and J. Li. Fairness in graph mining: A survey. IEEE Trans. Knowl. Data Eng., 35(10):10583–10602, 2023.
[28] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. S. Zemel. Fairness through awareness. In ITCS, pages 214–226, 2012.
[29] P. Edara and M. Pasumansky. Big metadata : When metadata is big data. Proc. VLDB Endow., 14(12):3083–3095, 2021.
[30] W. Fan. Big graphs: Challenges and opportunities. Proc. VLDB Endow., 15(12):3782–3797, 2022.
[31] S. Gupta, R. Kumar, K. Lu, B. Moseley, and S. Vassilvitskii. Local search methods for k-means with outliers. Proc. VLDB Endow., 10(7):757–768, 2017.
[32] M. Hossein, Bateni, V. Cohen-Addad, A. Epasto, and S. Lattanzi. A scalable algorithm for individually fair k-means clustering. arXiv:2402.06730, 2024.
[33] L. Huang, S. H. Jiang, and N. K. Vishnoi. Coresets for clustering with fairness constraints. In NeurIPS, pages 7587–7598, 2019.
[34] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM Comput. Surv., 31(3):264–323, 1999.
[35] C. Jung, S. Kannan, and N. Lutz. Service in your neighborhood: Fairness in center location. In FORC, volume 156, pages 5:1–5:15, 2020.
[36] R. Kohavi. Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid. In KDD, pages 202–207, 1996.
[37] T. Li, A. Beirami, M. Sanjabi, and V. Smith. On tilted losses in machine learning: Theory and applications. J. Mach. Learn. Res., 24:142:1–142:79, 2023.
[38] E. Liberty, Z. Karnin, B. Xiang, L. Rouesnel, B. Coskun, R. Nallapati, J. Delgado, A. Sadoughi, Y. Astashonok, P. Das, C. Balioglu, S. Chakravarty, M. Jha, P. Gautier, D. Arpin, T. Januschowski, V. Flunkert, Y. Wang, J. Gasthaus, L. Stella, S. Rangapuram, D. Salinas, S. Schelter, and A. Smola. Elastic machine learning algorithms in amazon sagemaker. In SIGMOD, page 731–737, 2020.
[39] S. P. Lloyd. Least squares quantization in PCM. IEEE Trans. Inf. Theory, 28(2):129–136, 1982.
[40] S. Mahabadi and A. Vakilian. Individual fairness for k-clustering. In ICML, volume 119, pages 6586–6596, 2020.
[41] C. Meek, B. Thiesson, and D. Heckerman. The learning-curve sampling method applied to model-based clustering. J. Mach. Learn. Res., 2:397–418, 2002.
[42] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan. A survey on bias and fairness in machine learning. ACM Comput. Surv., 54(6):115:1–115:35, 2022.
[43] N. Merhav. List decoding - random coding exponents and expurgated exponents. IEEE Trans. Inf. Theory, 60(11):6749–6759, 2014.
[44] R. K. Merton. The matthew effect in science. Science, 159(3810):56–63, 1968.
[45] S. Moro, P. Cortez, and P. Rita. A data-driven approach to predict the success of bank telemarketing. Decis. Support Syst., 62:22–31, 2014.
[46] K. O. Mortensen, F. Zardbani, M. A. Haque, S. Y. Agustsson, D. Mottin, P. Hofmann, and P. Karras. Marigold: Efficient k-means clustering in high dimensions. Proc. VLDB Endow., 16(7):1740–1748, 2023.
[47] F. Nargesian, E. Zhu, R. J. Miller, K. Q. Pu, and P. C. Arocena. Data lake management: Challenges and opportunities. Proc. VLDB Endow., 12(12):1986–1989, 2019.
[48] M. Negahbani and D. Chakrabarty. Better algorithms for individually fair k-clustering. In NeurIPS, pages 13340–13351, 2021.
[49] Y. E. Nesterov. Introductory Lectures on Convex Optimization - A Basic Course, volume 87. Springer, 2004.
[50] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. VanderPlas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in python. J. Mach. Learn. Res., 12:2825–2830, 2011.
[51] E. Y. Pee and J. O. Royset. On solving large-scale finite minimax problems using exponential smoothing. J. Optim. Theory Appl., 148(2):390–421, 2011.
[52] D. Sculley. Web-scale k-means clustering. In WWW, pages 1177–1178, 2010.
[53] S. Shaham, G. Ghinita, and C. Shahabi. Models and mechanisms for spatial data fairness. Proc. VLDB Endow., 16(2):167–179, 2022.
[54] S. Shang, L. Chen, Z. Wei, C. S. Jensen, J. Wen, and P. Kalnis. Collective travel planning in spatial networks. IEEE Trans. Knowl. Data Eng., 28(5):1132–1146, 2016.
[55] C. Shen and H. Li. On the dual formulation of boosting algorithms. IEEE Trans. Pattern Anal. Mach. Intell., 32(12):2216–2231, 2010.
[56] D. Siegmund. Importance sampling in the monte carlo study of sequential tests. Ann. Stat., pages 673–684, 1976.
[57] B. Strack, J. P. DeShazo, C. Gennings, J. L. Olmo, S. Ventura, K. J. Cios, J. N. Clore, et al. Impact of hba1c measurement on hospital readmission rates: Analysis of 70,000 clinical database patient records. Biomed Res. Int., 2014.
[58] A. Szabó, H. J. Rad, and S. Mannava. Tilted cross-entropy (TCE): promoting fairness in semantic segmentation. In CVPR, pages 2305–2310, 2021.
[59] R. Tang and Y. Yang. Bayesian inference for risk minimization via exponentially tilted empirical likelihood. J. R. Stat. Soc. B., 84(4):1257–1286, 2022.
[60] A. Vakilian and M. Yalçiner. Improved approximation algorithms for individually fair clustering. In AISTATS, volume 151, pages 8758–8779, 2022.
[61] D. Van Aken, A. Pavlo, G. J. Gordon, and B. Zhang. Automatic database management system tuning through large-scale machine learning. In SIGMOD, page 1009–1024, 2017.
[62] S. Wang, Y. Sun, and Z. Bao. On the efficiency of k-means clustering: Evaluation, optimization, and algorithm selection. Proc. VLDB Endow., 14(2):163–175, 2020.
[63] Y. Wang, H. Chen, W. Liu, F. He, T. Gong, Y. Fu, and D. Tao. Tilted sparse additive models. In ICML, volume 202, pages 35579–35604, 2023.
[64] X. Wu, X. Zhu, G. Wu, and W. Ding. Data mining with big data. IEEE Trans. Knowl. Data Eng., 26(1):97–107, 2014.
[65] H. Zhang, G. Chen, B. C. Ooi, K. Tan, and M. Zhang. In-memory big data management and processing: A survey. IEEE Trans. Knowl. Data Eng., 27(7):1920–1948, 2015.
[66] Y. Zhou, H. Cheng, and J. X. Yu. Graph clustering based on structural/attribute similarities. Proc. VLDB Endow., 2(1):718–729, 2009.
[67] S. Zhu, Q. Xu, J. Zeng, S. Wang, Y. Sun, Z. Yang, C. Yang, and Z. Peng. F3KM: federated, fair, and fast k-means. Proc. ACM Manag. Data, 1(4):241:1–241:25, 2023.

	$\displaystyle\nabla_{\boldsymbol{c}_{j}}\phi(t,\boldsymbol{\delta}_{j},% \boldsymbol{c}_{j})^{\top}\mathbb{E}[g($	$\displaystyle\mathcal{B},\boldsymbol{c}_{j})]\geq\mu\cdot\\|\nabla_{\boldsymbol% {c}_{j}}\phi(t,\boldsymbol{\delta}_{j},\boldsymbol{c}_{j})\\|^{2},$
	$\displaystyle\\|\mathbb{E}[g(\mathcal{B},\boldsymbol{c}_{j})]\\|$	$\displaystyle\leq\mu_{G}\cdot\\|\nabla_{\boldsymbol{c}_{j}}\phi(t,\boldsymbol{% \delta}_{j},\boldsymbol{c}_{j})\\|.$

	$\displaystyle\!\!\mathbb{E}[\\|g(\mathcal{B},\boldsymbol{c}_{j}^{it})\\|^{2}]\!$	$\displaystyle=\!\mathbb{E}[\\|g(\mathcal{B},\boldsymbol{c}_{j}^{it})\!-\!% \mathbb{E}[g(\mathcal{B},\boldsymbol{c}_{j}^{it})]\\|^{2}]\!+\!\\|\mathbb{E}[g(% \mathcal{B},\boldsymbol{c}_{j}^{it})]\\|^{2}$
		$\displaystyle\leq\nu+\nu_{G}\cdot\\|\nabla_{\boldsymbol{c}_{j}}\phi(t,% \boldsymbol{\delta}_{j},\boldsymbol{c}_{j}^{it})\\|^{2},$		(37)

Efficient k𝑘kitalic_k-means with Individual Fairness via Exponential Tilting

Abstract

Index Terms:

1 Introduction

2 Notations

3 Related Work

4 Preliminaries

4.1 k𝑘kitalic_k-means

4.2 k𝑘kitalic_k-means++

5 Proposed TKM

5.1 Objective Function of TKM

5.2 Solving Tilted k𝑘kitalic_k-means

5.3 Theoretical Analysis

5.3.1 Definitions and Assumptions

Definition 1 (Tilted weight).

Definition 2 (Tilted empirical mean and variance).

Definition 3 (Gradient Lipschitz Continuity).

Definition 4 (Tilted Hessian).

Lemma 1 (Strong Convexity of Tilted SSE [37]).

Proof.

Lemma 2 (Gradient Lipschitz Continuity of Tilted SSE [37]).

Assumption 1.

5.3.2 Approximation Guarantee

Theorem 1.

5.3.3 Convergence Analysis

Theorem 2.

5.3.4 Fairness Analysis

Theorem 3.

5.3.5 Time Complexity

Theorem 4.

5.3.6 Monotonicity Analysis

Theorem 5.

6 Experiments

6.1 Settings

6.2 Comparison among Various Methods

6.2.1 Effectiveness Analysis

6.2.2 Fairness Analysis

6.2.3 Effeciency Analysis

6.2.4 Summary of Lessons Learned

6.3 Comparison among Various Parameters

6.3.1 Tilted SSE vs. t𝑡titalic_t

6.3.2 Tilted SSE vs. Epoch

6.3.3 Tilted SSE vs. Learning Rate

6.3.4 Visualization

6.3.5 Summary of Lessons Learned

7 Conclusions and Future Work

8 Proofs

8.1 Proof of Theorem 1

Lemma 3.

Proof.

Lemma 4.

Proof.

Proposition 1.

Proof.

Lemma 5 (Theorem 1.1 in [8]).

Proof of Theorem 1.

8.2 Proof of Theorem 2

Proposition 2.

Proof.

Proof of Theorem 2.

8.3 Proof of Theorem 3

Definition 5 (Tilted gradient and weight).

Definition 6 (Tilted empirical mean and variance).

Proof of Theorem 3.

8.4 Proof of Theorem 4

Proof.

8.5 Proof of Theorem 5

Proof.

References

Efficient $k$ -means with Individual Fairness via Exponential Tilting

4.1 $k$ -means

4.2 $k$ -means++

5.2 Solving Tilted $k$ -means

6.3.1 Tilted SSE vs. $t$