Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Efficient k𝑘kitalic_k-means with Individual Fairness via Exponential Tilting

Shengkun Zhu, Jinshan Zeng, Yuan Sun, Sheng Wang, Xiaodong Li,  Zhiyong Peng Shengkun Zhu, Sheng Wang, and Zhiyong Peng are with the School of Computer Science, Wuhan University. E-mail: {whuzsk66, swangcs, peng}@whu.edu.cn Jinshan Zeng is with the School of Computer and Information Engineering, Jiangxi Normal University. E-mail: jinshanzeng@jxnu.edu.cn Yuan Sun is with La Trobe Business School, La Trobe University. Email: yuan.sun@latrobe.edu.au. Xiaodong Li is with the School of Computing Technologies, RMIT University. Email: xiaodong.li@rmit.edu.au.
Abstract

In location-based resource allocation scenarios, the distances between each individual and the facility are desired to be approximately equal, thereby ensuring fairness. Individually fair clustering is often employed to achieve the principle of “treating all points equally”, which can be applied in these scenarios. This paper proposes a novel algorithm, tilted k𝑘kitalic_k-means (TKM), aiming to achieve individual fairness in clustering to ensure fair allocation of resources. We integrate the exponential tilting into the sum of squared errors (SSE) to formulate a novel objective function called tilted SSE. We demonstrate that the tilted SSE can generalize to SSE and employ the coordinate descent and first-order gradient method for optimization. We propose a novel fairness metric, the variance of the squared distance of each point to its nearest centroid within a cluster, which can alleviate the Matthew Effect typically caused by existing fairness metrics. Our theoretical analysis demonstrates that the well-known k𝑘kitalic_k-means++ incurs a multiplicative error of O(klogk)𝑂𝑘𝑘O(k\log k)italic_O ( italic_k roman_log italic_k ) with our objective function, and we establish the convergence of TKM under mild conditions. In terms of fairness, we prove that the variance in each cluster generated by TKM decreases with t𝑡titalic_t, where t𝑡titalic_t is a hyperparameter that adjusts the trade-off between utility and fairness. In terms of efficiency, we demonstrate the time complexity of TKM is linear with the dataset size. Moreover, we demonstrate the monotonicity of the tilted SSE with respect to t𝑡titalic_t in a simple case. Our experimental results demonstrate that TKM outperforms state-of-the-art methods in effectiveness, fairness, and efficiency. Specifically, TKM exhibits a better trade-off between clustering utility and fairness than six baselines and achieves hundreds or even thousands of times acceleration in running time. Moreover, TKM can overcome the RAM overflow issue that other methods encounter with a large dataset size.

Index Terms:
Location-based resource allocation, k𝑘kitalic_k-means, individual fairness, exponential tilting, coordinate descent, variance.

1 Introduction

In the era of big data, the scale of data is increasing exponentially [47, 65], emerging from diverse fields [29, 30] with rich information and potential value [64]. Clustering algorithms have become powerful tools for exploring the internal structure of datasets by partitioning data points into different clusters [7], where data points within the same cluster are similar to each other, while those in different clusters are dissimilar [34, 66]. k𝑘kitalic_k-means is one of the most classic clustering algorithms, which measures the similarity between data points using Euclidean distance and is suitable for various types of data [46, 17, 23, 61]. This characteristic makes k𝑘kitalic_k-means widely applicable in various location-based resource allocation scenarios, such as opening new facilities to serve residents [38, 31, 62, 54].

Refer to caption
Figure 1: A comparison between k𝑘kitalic_k-means and individually fair k𝑘kitalic_k-means. k𝑘kitalic_k-means results in those minority residents being too far from the centroid, while in the clustering results of individually fair k𝑘kitalic_k-means, the distance of each resident to the centroid is approximately equal.

However, applying k𝑘kitalic_k-means to resource allocation scenarios may lead to the issue of unfairness [53, 35, 5]. Consider the scenario in Figure 1(a): when setting up public facilities such as hospitals for residents, k𝑘kitalic_k-means tends to place these facilities closer to densely populated areas, resulting in sparsely populated areas having difficulty accessing public resources and unfair treatment for minority residents. Individual fairness is a promising concept that can ensure that within the same cluster, each data point is treated approximately equally [42, 18]. Figure 1(b) shows the clustering results obtained by k𝑘kitalic_k-means with individual fairness, where the distances from each resident to the centroid are approximately equal. In this case, we consider the clustering result to be fair.

One of the most widely studied concepts in individual fairness for k𝑘kitalic_k-clustering is the “service in your neighborhood” proposed by Jung et al. [35]. This concept ensures that each data point has a centroid within a small constant factor of their neighborhood radius. The neighborhood radius is defined as the minimum radius of a ball centered at each data point that includes at least n/k𝑛𝑘n/kitalic_n / italic_k data points, where n𝑛nitalic_n is the total number of data points. Several studies [48, 40] have made significant improvements in clustering utility and yielded tighter theoretical bounds based on this individual fairness concept. Mahabadi and Vakilian [40] introduced a local search method for k𝑘kitalic_k-clustering that notably surpasses [35] in effectiveness. Negahbani and Chakrabarty [48] proposed leveraging linear programming to develop improved algorithms for individually fair k𝑘kitalic_k-clustering, both theoretically and practically.

However, the fairness definition of these methods faces a similar issue. To illustrate this, let us consider the scenario in Fig. 2: point B resides in a sparsely populated area, with its neighborhood radius larger than point A situated in a densely populated area. This tends to result in opening more facilities in densely populated areas while only opening a few facilities in sparsely populated areas [48]. Within the same radius, individual A in densely populated areas has more opportunities to choose facilities, while individual B in sparsely populated areas may have only a single facility available. Moreover, densely populated areas attract more individuals due to abundant resources. To meet the needs of these individuals, additional facilities must be opened. This results in the development of sparse areas continually lagging behind. This phenomenon, also known as the Matthew Effect [44, 10], is a sociological concept describing how the distribution of resources, wealth, and opportunities tends to favor individuals who already possess them.

Moreover, existing individually fair clustering methods suffer from the efficiency issue: their running time heavily depends on the dataset size. The most promising theoretical finding suggests a time complexity of O(kn4)𝑂𝑘superscript𝑛4O(kn^{4})italic_O ( italic_k italic_n start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) [48]. Based on data from the U.S. Census, the population of New York City was 8.468 million in 2021 [4]. Due to the high time complexity of existing algorithms, no individual fair clustering algorithm can effectively perform clustering analysis on such a large-scale dataset. Moreover, as the dataset size increases, existing methods suffer from the issue of RAM overflow since they require computation of the distance between each pair data point, necessitating the storage of an n×n𝑛𝑛n\times nitalic_n × italic_n array in memory. Additionally, in the clustering results obtained by these algorithms, each centroid must be selected from the data points, which is often unreasonable in real-world applications.

Refer to caption
Figure 2: In sparse areas, the neighborhood radius for point B is larger than the neighborhood radius for point A in dense areas. Within the same radius, individual A can access more facilities.

Exponential tilting is a widely used technique to induce parametric shifts in distributions in various disciplines, including statistics [16, 56, 59], probability [25], information theory [43, 11], and optimization [51, 55]. Li et al. [37] first proposed using exponential tilting in machine learning to ensure the fairness of empirical risk minimization (TERM). The flexibility of TERM lies in its ability to adjust the impact of individual losses using a scale parameter and thus enables us to effectively tune the influence of minority data points as required [63]. TERM provided examples of exponential tilting in supervised learning, such as linear regression and logistic regression. However, the practical applications of exponential tilting in unsupervised learning, especially in clustering algorithms, remain unresolved. Furthermore, some theoretical analysis of TERM relies on the assumption that the objective function follows a generalized linear model, which does not hold for clustering algorithms.

We aim to utilize the ability of exponential tilting to induce parametric shifts in distribution to ensure individual fairness for clustering analysis. Building on this concept, we propose a novel loss function, tilted SSE, for the individually fair k𝑘kitalic_k-means problem based on exponential tilting, and suggest solving this problem effectively through coordinate descent (CD) and stochastic gradient descent (SGD), ensuring that the centroid in each cluster is closer to minority data points, thus guaranteeing individual fairness. Moreover, we demonstrate that tilted SSE can generalize to SSE when the scaled parameter in TKM is set to 0. Due to the fact that existing fairness metrics may exacerbate the Matthew Effect in location-based resource allocation scenarios, we propose a novel criterion for evaluating fairness within clusters, utilizing the variance of distances between each data point and its centroid. Our fairness metric aims to treat each individual more equitably compared to existing metrics, thereby mitigating the Matthew Effect.

Our theoretical analysis comprises five parts: approximation guarantee, convergence analysis, fairness analysis, efficiency analysis, and monotonicity analysis. Our approximation guarantee indicates that the centroids obtained through the well-known k𝑘kitalic_k-means++ incur a multiplicative error of O(klogk)𝑂𝑘𝑘O(k\log k)italic_O ( italic_k roman_log italic_k ). We establish the convergence analysis for TKM under mild conditions. Specifically, we demonstrate that the expected tilted SSE is non-increasing with respect to iterations. For fairness analysis, we demonstrate that the variance of distances in each cluster decreases as the increase of hyperparameter t𝑡titalic_t in TKM. A smaller variance indicates greater fairness, implying that as t𝑡titalic_t grows, clustering becomes fairer. For efficiency analysis, we demonstrate that the time complexity of TKM is O(kn)𝑂𝑘𝑛O(kn)italic_O ( italic_k italic_n ), similar to that of k𝑘kitalic_k-means. Note that the time complexity of the state-of-the-art method is O(kn4)𝑂𝑘superscript𝑛4O(kn^{4})italic_O ( italic_k italic_n start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) [48], while that of TKM is linear with respect to n𝑛nitalic_n. Therefore, TKM exhibits a significant advantage in efficiency compared to other methods. For monotonicity analysis, we demonstrate that the tilted SSE monotonically increases with t𝑡titalic_t in a simple case. This property may guide the choices of t𝑡titalic_t for TKM in practical applications.

Our experimental evaluations demonstrate the effectiveness, fairness, efficiency, and convergence of TKM over ten real-world datasets with five measurements. Our experimental findings indicate that TKM outperforms the state-of-the-art methods regarding the trade-off between clustering utility and fairness. Specifically, we use SSE to measure clustering utility. The SSE of TKM is lower than that of the state-of-the-art method and is very close to clustering algorithms that do not consider individual fairness on some datasets. To evaluate fairness, we use not only variance as a metric but also the maximum distance from each sample point to the centroid within each cluster. Our results show that TKM outperforms the state-of-the-art fair clustering methods on both metrics. Moreover, TKM’s performance in efficiency is remarkably impressive. Due to the linear time complexity with dataset size, TKM achieves acceleration of hundreds or even thousands of times compared to other fairness-aware clustering methods. Furthermore, TKM can overcome the RAM overflow issues in other methods when dealing with large-scale data. Additionally, we validate the impact of different hyperparameters in TKM on its convergence.

Our contributions are summarized as follows:

  • We incorporate exponential tilting into SSE to propose a novel method for individually fair k𝑘kitalic_k-means: TKM.

  • We theoretically analyze TKM’s approximation guarantee, convergence, fairness, efficiency, and monotonicity.

  • We experimentally validated the effectiveness, fairness, efficiency, and convergence of TKM.

The remaining sections are structured as follows: Section 2 presents the notations used in this paper, Section 3 presents the related work, Section 4 introduces the preliminaries used in our study, Section 5 outlines our proposed method, TKM, Section 6 validates our algorithm through experiments, Section 7 concludes our paper, and Section 8 presents the proofs of our theories.

TABLE I: Summary of notations
Notation Description
𝒳:={𝒙i}i=1nassign𝒳superscriptsubscriptsubscript𝒙𝑖𝑖1𝑛\mathcal{X}:=\{\boldsymbol{x}_{i}\}_{i=1}^{n}caligraphic_X := { bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT The dataset of n𝑛nitalic_n points
𝒮:={𝒮j}j=1kassign𝒮superscriptsubscriptsubscript𝒮𝑗𝑗1𝑘\mathcal{S}:=\{\mathcal{S}_{j}\}_{j=1}^{k}caligraphic_S := { caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT The set of k𝑘kitalic_k clusters
𝒞:={𝒄j}j=1kassign𝒞superscriptsubscriptsubscript𝒄𝑗𝑗1𝑘\mathcal{C}:=\{\boldsymbol{c}_{j}\}_{j=1}^{k}caligraphic_C := { bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT The set of k𝑘kitalic_k centroids
ψ¯,ϕ¯¯𝜓¯italic-ϕ\overline{\psi},\overline{\phi}over¯ start_ARG italic_ψ end_ARG , over¯ start_ARG italic_ϕ end_ARG The SSE, tilted SSE of all clusters
ψ,ϕ𝜓italic-ϕ\psi,\phiitalic_ψ , italic_ϕ The SSE, tilted SSE of each cluster
m,TmmTm\operatorname{\textup{m}},\operatorname{\textup{Tm}}m , tm The arithmetic, tilted mean operator
η𝜂\etaitalic_η The learning rate
E𝐸Eitalic_E The epoch size

2 Notations

We use different text formatting styles to represent different mathematical concepts: plain letters for scalars, bold letters for vectors, and calligraphic letters for sets. For instance, k𝑘kitalic_k represents a scalar, 𝒙𝒙\boldsymbol{x}bold_italic_x represents a vector, and 𝒞𝒞\mathcal{C}caligraphic_C denotes a set. Without loss of generality, all data points in this paper are represented using vectors. We use [k]delimited-[]𝑘[k][ italic_k ] to represent the set {1,2,,k}12𝑘\{1,2,...,k\}{ 1 , 2 , … , italic_k }. The symbol 𝔼𝔼\mathbb{E}blackboard_E denotes the expectation of a random variable, and we use “:=assign:=:=” to indicate a definition. We use 𝕀𝕀\mathbb{I}blackboard_I to denote the identity matrix. We use \|\cdot\|∥ ⋅ ∥ to denote the Euclidean norm of a vector. We use the symbol “log\logroman_log” to denote the natural logarithm with base e𝑒eitalic_e. Table I lists the notations appearing in this paper and their interpretations.

3 Related Work

We provide an overview of previous studies on fair clustering and the application of exponential tilting in various fields and highlight the limitations of these studies.

Fair Clustering.  Fairness in clustering algorithms is typically divided into two categories: group fairness and individual fairness [15, 18, 42, 27]. The goal of group fairness is to achieve clustering of a given set of points with minimal cost while ensuring that all clusters are balanced with respect to certain protected attributes, such as gender or race. Group fairness is not the focus of this paper, so interested readers can refer to [20, 12, 22, 67, 26].

The concept of individual fairness was initially introduced by Dwork et al. [28] in the context of classification, which posits that “similar individuals should be treated equally”. Several studies have explored this definition of individual fairness in clustering, such as [14, 19]. Another widely used and researched concept of individual fairness is referred to as “service in your neighborhood”, which was initially suggested by Jung et al. [35]. This concept aims to ensure that each data point has a centroid within at most a small constant factor of their neighborhood radius, where the neighborhood radius is the minimum radius of a ball centered at the data point 𝒙isubscript𝒙𝑖\boldsymbol{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that includes at least n/k𝑛𝑘n/kitalic_n / italic_k data points. Subsequently, various methods addressing the individually fair k𝑘kitalic_k-clustering were based on this paradigm [48, 40, 21], along with numerous improved theoretical upper bounds [32, 60]. Mahabadi and Vakilian [40] introduced a local search algorithm for k𝑘kitalic_k-clustering, which significantly outperforms the method proposed by Jung et al. [35] in terms of clustering utility. Negahbani and Chakrabarty [48] proposed leveraging linear programming techniques to develop improved algorithms for individually fair k𝑘kitalic_k-clustering, both theoretically and practically. The fairness metric used in these methods can alleviate some of the unfairness in location-based resource allocation scenarios by ensuring that facilities are within a neighborhood radius of each data point. However, this metric might exacerbate the Matthew Effect, as it tends to result in more facilities being opened in densely populated areas while fewer facilities are opened in sparsely populated areas. Moreover, existing individually fair clustering methods encounter the same issue: they suffer from prohibitively high computational time. Specifically, the time complexity of [40] is O(k5n4)𝑂superscript𝑘5superscript𝑛4O(k^{5}n^{4})italic_O ( italic_k start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ), and [48] is O(kn4)𝑂𝑘superscript𝑛4O(kn^{4})italic_O ( italic_k italic_n start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ). To address the running time issue, Chhaya et al. [21] proposed a method to reduce the dataset size by constructing a coreset. However, this approach results in diminished clustering utility and fails to mitigate the inherent dependency of the computational complexity of existing individual fairness clustering on dataset size.

Exponential Tilting.  We elucidate the concept of exponential tilting and explore its applications across various disciplines. Let 𝒫:={pθ}assign𝒫subscript𝑝𝜃\mathcal{P}:=\{p_{\theta}\}caligraphic_P := { italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT } be a set of probability distributions with parameter θ𝜃\thetaitalic_θ, X𝑋Xitalic_X denote a random variable drawn from the probability distribution pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, then for any x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, the information of x𝑥xitalic_x under θ𝜃\thetaitalic_θ [24] is defined as

f(x,θ):=logpθ(X=x).assign𝑓𝑥𝜃subscript𝑝𝜃𝑋𝑥\displaystyle f(x,\theta):=\log p_{\theta}(X=x).italic_f ( italic_x , italic_θ ) := roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X = italic_x ) . (1)

When θ𝜃\thetaitalic_θ is not specified, we assume that X𝑋Xitalic_X is a random variable drawn from the distribution p()𝑝p(\cdot)italic_p ( ⋅ ). Then the cumulant generating function of f(X,θ)𝑓𝑋𝜃f(X,\theta)italic_f ( italic_X , italic_θ ) [25] is defined as

ΔX(t,θ):=log(𝔼[etf(X,θ)])=logxp(x)pθ(x)t,assignsubscriptΔ𝑋𝑡𝜃𝔼delimited-[]superscript𝑒𝑡𝑓𝑋𝜃subscript𝑥𝑝𝑥subscript𝑝𝜃superscript𝑥𝑡\displaystyle\Delta_{X}(t,\theta):=\log\Bigl{(}\mathbb{E}\Bigl{[}e^{tf(X,% \theta)}\Bigr{]}\Bigr{)}=\log\sum_{x}p(x)p_{\theta}(x)^{-t},roman_Δ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_t , italic_θ ) := roman_log ( blackboard_E [ italic_e start_POSTSUPERSCRIPT italic_t italic_f ( italic_X , italic_θ ) end_POSTSUPERSCRIPT ] ) = roman_log ∑ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_p ( italic_x ) italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) start_POSTSUPERSCRIPT - italic_t end_POSTSUPERSCRIPT , (2)

where 𝔼[etf(X,θ)]𝔼delimited-[]superscript𝑒𝑡𝑓𝑋𝜃\mathbb{E}[e^{tf(X,\theta)}]blackboard_E [ italic_e start_POSTSUPERSCRIPT italic_t italic_f ( italic_X , italic_θ ) end_POSTSUPERSCRIPT ] is commonly referred to as an exponential tilting of the information density, and can induce the probability distribution with parameter θ𝜃\thetaitalic_θ shifting. Exponential tilting has been applied in numerous fields, such as statistics [16, 56, 59], applied probability [25], information theory [43, 11], and optimization [51, 55]. Interested readers can refer to [37] for a more detailed introduction. Currently, there are relatively few applications of exponential tilting in machine learning [37, 63, 58]. Li et al. [37] proposed tilted empirical risk minimization (TERM), which allows flexible tuning of individual losses, marking a pioneering move in machine learning. TERM offers several examples of supervised learning, including linear regression and logistic regression, as illustrated in Fig. 3. Recent research has also concentrated on supervised learning, such as the additive model [63] and semantic segmentation [58].

Remarks. 1) Current fairness metrics might exacerbate the Matthew Effect, as they tend to lead to more facilities being opened in densely populated areas while fewer facilities are opened in sparsely populated areas. 2) The efficiency of existing individually fair k𝑘kitalic_k-clustering algorithms heavily depends on the number of samples n𝑛nitalic_n of the dataset. 3) Existing individually fair clustering algorithms cannot flexibly tune the trade-off between utility and fairness. Moreover, these algorithms require cluster centroids to be one of the data points. 4) The current application of exponential tilting is still limited to supervised learning, and it has not been applied in unsupervised learning, especially in clustering.

Refer to caption
Figure 3: Two examples of TERM [37]. Increasing parameter t𝑡titalic_t can magnify the impact of minority points on the models.

4 Preliminaries

We begin by introducing the definition of k𝑘kitalic_k-means. Then, we present the well-known k𝑘kitalic_k-means++ initialization method.

4.1 k𝑘kitalic_k-means

k𝑘kitalic_k-means is a widely used clustering algorithm designed to partition a dataset into k𝑘kitalic_k distinct clusters based on similarities among data points. Let 𝒳:={𝒙i}i=1nassign𝒳superscriptsubscriptsubscript𝒙𝑖𝑖1𝑛\mathcal{X}:=\{\boldsymbol{x}_{i}\}_{i=1}^{n}caligraphic_X := { bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT be a dataset of n𝑛nitalic_n points, k𝑘kitalic_k-means aims to find a set 𝒮:={𝒮j}j=1kassign𝒮superscriptsubscriptsubscript𝒮𝑗𝑗1𝑘\mathcal{S}:=\{\mathcal{S}_{j}\}_{j=1}^{k}caligraphic_S := { caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT of k𝑘kitalic_k clusters such that the sum of squared error (SSE) is minimized,

min𝒮,𝒞{ψ¯(𝒮,𝒞):=j=1k1n𝒙i𝒮jf(𝒙i,𝒄j)},subscript𝒮𝒞assign¯𝜓𝒮𝒞superscriptsubscript𝑗1𝑘1𝑛subscriptsubscript𝒙𝑖subscript𝒮𝑗𝑓subscript𝒙𝑖subscript𝒄𝑗\displaystyle\min_{\mathcal{S},\mathcal{C}}\Bigl{\{}\overline{\psi}(\mathcal{S% },\mathcal{C}):=\sum_{j=1}^{k}\frac{1}{n}\sum_{\boldsymbol{x}_{i}\in\mathcal{S% }_{j}}f(\boldsymbol{x}_{i},\boldsymbol{c}_{j})\Bigr{\}},roman_min start_POSTSUBSCRIPT caligraphic_S , caligraphic_C end_POSTSUBSCRIPT { over¯ start_ARG italic_ψ end_ARG ( caligraphic_S , caligraphic_C ) := ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } , (3)

where 𝒞:={𝒄j}j=1kassign𝒞superscriptsubscriptsubscript𝒄𝑗𝑗1𝑘\mathcal{C}:=\{\boldsymbol{c}_{j}\}_{j=1}^{k}caligraphic_C := { bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is a set of centroids, 𝒄jsubscript𝒄𝑗\boldsymbol{c}_{j}bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the centroid of cluster 𝒮jsubscript𝒮𝑗\mathcal{S}_{j}caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, f(𝒙i,𝒄j):=𝒙i𝒄j2assign𝑓subscript𝒙𝑖subscript𝒄𝑗superscriptnormsubscript𝒙𝑖subscript𝒄𝑗2f(\boldsymbol{x}_{i},\boldsymbol{c}_{j}):=\|\boldsymbol{x}_{i}-\boldsymbol{c}_% {j}\|^{2}italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) := ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT denotes the square of the Euclidean distance from a data point 𝒙isubscript𝒙𝑖\boldsymbol{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to the centroid 𝒄jsubscript𝒄𝑗\boldsymbol{c}_{j}bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. The commonly used method for solving k𝑘kitalic_k-means is the well-known Lloyd’s heuristic [39], which iteratively computes the assignment of each data point and the centroids through coordinate descent. Next, we provide a detailed description of the optimization process of Lloyd’s heuristic. We begin by presenting the equivalent form of Problem (3) as

min𝒮,𝒞{ψ¯(𝒮,𝒞):=j=1kψ(𝜹j,𝒄j):=j=1k1ni=1nf(𝒙i,𝒄j)δij},subscript𝒮𝒞assign¯𝜓𝒮𝒞superscriptsubscript𝑗1𝑘𝜓subscript𝜹𝑗subscript𝒄𝑗assignsuperscriptsubscript𝑗1𝑘1𝑛superscriptsubscript𝑖1𝑛𝑓subscript𝒙𝑖subscript𝒄𝑗subscript𝛿𝑖𝑗\displaystyle\min_{\mathcal{S},\mathcal{C}}\Bigl{\{}\overline{\psi}(\mathcal{S% },\mathcal{C})\!:=\!\sum_{j=1}^{k}\psi(\boldsymbol{\delta}_{j},\boldsymbol{c}_% {j})\!:=\!\sum_{j=1}^{k}\frac{1}{n}\sum_{i=1}^{n}f(\boldsymbol{x}_{i},% \boldsymbol{c}_{j})\delta_{ij}\Bigr{\}},\!\!roman_min start_POSTSUBSCRIPT caligraphic_S , caligraphic_C end_POSTSUBSCRIPT { over¯ start_ARG italic_ψ end_ARG ( caligraphic_S , caligraphic_C ) := ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_ψ ( bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) := ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT } , (4)

where δij,i[n],j[k]formulae-sequencesubscript𝛿𝑖𝑗𝑖delimited-[]𝑛𝑗delimited-[]𝑘\delta_{ij},i\in[n],j\in[k]italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , italic_i ∈ [ italic_n ] , italic_j ∈ [ italic_k ] denotes the assignment of each data point, for example, if 𝒙i𝒮jsubscript𝒙𝑖subscript𝒮𝑗\boldsymbol{x}_{i}\in\mathcal{S}_{j}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, then δij=1subscript𝛿𝑖𝑗1\delta_{ij}=1italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1, else δij=0subscript𝛿𝑖𝑗0\delta_{ij}=0italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 0, 𝜹j:=(δ1j,δ2j,,δnj)nassignsubscript𝜹𝑗subscript𝛿1𝑗subscript𝛿2𝑗subscript𝛿𝑛𝑗superscript𝑛\boldsymbol{\delta}_{j}:=(\delta_{1j},\delta_{2j},\cdots,\delta_{nj})\in% \mathbb{R}^{n}bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT := ( italic_δ start_POSTSUBSCRIPT 1 italic_j end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT , ⋯ , italic_δ start_POSTSUBSCRIPT italic_n italic_j end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT represents the assignment of data points in the j𝑗jitalic_j-th cluster, and ψ(𝜹j,𝒄j):=1ni=1nf(𝒙i,𝒄j)δijassign𝜓subscript𝜹𝑗subscript𝒄𝑗1𝑛superscriptsubscript𝑖1𝑛𝑓subscript𝒙𝑖subscript𝒄𝑗subscript𝛿𝑖𝑗\psi(\boldsymbol{\delta}_{j},\boldsymbol{c}_{j}):=\frac{1}{n}\sum_{i=1}^{n}f(% \boldsymbol{x}_{i},\boldsymbol{c}_{j})\delta_{ij}italic_ψ ( bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) := divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT denotes the SSE in the cluster 𝒮jsubscript𝒮𝑗\mathcal{S}_{j}caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. To solve Problem (4), one may iteratively assign each point to its nearest centroid and refine 𝒄jsubscript𝒄𝑗\boldsymbol{c}_{j}bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT using Lloyd’s heuristic. Following initialization, with 𝒄jsubscript𝒄𝑗\boldsymbol{c}_{j}bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT holds constant, the solution for δijsubscript𝛿𝑖𝑗\delta_{ij}italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT can be obtained as

δij={1,j=argminlf(𝒙i,𝒄l),0,otherwise.subscript𝛿𝑖𝑗cases1𝑗subscript𝑙𝑓subscript𝒙𝑖subscript𝒄𝑙0otherwise.\displaystyle\delta_{ij}=\begin{cases}1,&j=\arg\min_{l}f(\boldsymbol{x}_{i},% \boldsymbol{c}_{l}),\\ 0,&\text{otherwise.}\end{cases}italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = { start_ROW start_CELL 1 , end_CELL start_CELL italic_j = roman_arg roman_min start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise. end_CELL end_ROW (5)

When δijsubscript𝛿𝑖𝑗\delta_{ij}italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT holds constant, solve for 𝒄jsubscript𝒄𝑗\boldsymbol{c}_{j}bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT as follows:

𝒄j=m(𝒮j)=i=1nδij𝒙ii=1nδij,subscript𝒄𝑗msubscript𝒮𝑗superscriptsubscript𝑖1𝑛subscript𝛿𝑖𝑗subscript𝒙𝑖superscriptsubscript𝑖1𝑛subscript𝛿𝑖𝑗\displaystyle\boldsymbol{c}_{j}=\operatorname{\textup{m}}(\mathcal{S}_{j})=% \frac{\sum_{i=1}^{n}\delta_{ij}\cdot\boldsymbol{x}_{i}}{\sum_{i=1}^{n}\delta_{% ij}},bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = m ( caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ⋅ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG , (6)

where m()m\operatorname{\textup{m}}(\cdot)m ( ⋅ ) is an operator to calculate the weighted mean.

Input: 𝒳:={𝒙i}i=1nassign𝒳superscriptsubscriptsubscript𝒙𝑖𝑖1𝑛\mathcal{X}:=\{\boldsymbol{x}_{i}\}_{i=1}^{n}caligraphic_X := { bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, k𝑘kitalic_k.
1 𝒞𝒞absent\mathcal{C}\leftarrowcaligraphic_C ← Sample a point uniformly from 𝒳𝒳\mathcal{X}caligraphic_X;
2
3𝒞𝒞absent\mathcal{C}\leftarrowcaligraphic_C ← Sample the next centroid 𝒄j𝒳subscript𝒄𝑗𝒳\boldsymbol{c}_{j}\in\mathcal{X}bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_X with probability D(𝒄j)2𝒄j𝒳D(𝒄j)2𝐷superscriptsubscript𝒄𝑗2subscriptsubscript𝒄𝑗𝒳𝐷superscriptsubscript𝒄𝑗2\frac{D(\boldsymbol{c}_{j})^{2}}{\sum_{\boldsymbol{c}_{j}\in\mathcal{X}}\!D(% \boldsymbol{c}_{j})^{2}}divide start_ARG italic_D ( bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_X end_POSTSUBSCRIPT italic_D ( bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG;
4
5Repeat Step 1 until k𝑘kitalic_k centroids are chosen;
// Coordinate descent.
6 while not converge do
7       Update δij,i[n],j[k]formulae-sequencesubscript𝛿𝑖𝑗𝑖delimited-[]𝑛𝑗delimited-[]𝑘\delta_{ij},i\in[n],j\in[k]italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , italic_i ∈ [ italic_n ] , italic_j ∈ [ italic_k ] by Equation (5);
8      
9      Update 𝒄j,j[k]subscript𝒄𝑗𝑗delimited-[]𝑘\boldsymbol{c}_{j},j\in[k]bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_j ∈ [ italic_k ] by Equation (6);
10      
11return 𝒞𝒞\mathcal{C}caligraphic_C.
Algorithm 1 k𝑘kitalic_k-means++

4.2 k𝑘kitalic_k-means++

k𝑘kitalic_k-means++ is an improved version of k𝑘kitalic_k-means by providing a more effective strategy for selecting initial centroids, thus enhancing the speed and accuracy [8]. We provide the details of k𝑘kitalic_k-means++ in Algorithm 1. Its process involves selecting the first centroid randomly from the dataset (Step 1 in Algorithm 1). Let D(𝒙i)𝐷subscript𝒙𝑖D(\boldsymbol{x}_{i})italic_D ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) be the shortest distance from a data point 𝒙isubscript𝒙𝑖\boldsymbol{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to its closest centroids that we have already chosen. The subsequent centroid is chosen from the data points based on their squared distances to the nearest existing centroids, with a probability D(𝒙)2𝒙i𝒳D(𝒙i)2𝐷superscriptsuperscript𝒙2subscriptsubscript𝒙𝑖𝒳𝐷superscriptsubscript𝒙𝑖2\frac{D(\boldsymbol{x}^{\prime})^{2}}{\sum_{\boldsymbol{x}_{i}\in\mathcal{X}}D% (\boldsymbol{x}_{i})^{2}}divide start_ARG italic_D ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_X end_POSTSUBSCRIPT italic_D ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (Step 1 in Algorithm 1). This iterative process is repeated until k𝑘kitalic_k centroids are chosen (Step 1 in Algorithm 1). After selecting k𝑘kitalic_k centroids, the subsequent update of 𝜹jsubscript𝜹𝑗\boldsymbol{\delta}_{j}bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and 𝒄jsubscript𝒄𝑗\boldsymbol{c}_{j}bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is performed through coordinate descent, which is identical to Lloyd’s heuristic (Steps 1-1 in Algorithm 1).

5 Proposed TKM

In this section, we begin by proposing the objective function of tilted k𝑘kitalic_k-means (TKM) and presenting the corresponding optimization method. Then, we theoretically analyze the convergence, approximation guarantee, fairness, efficiency, and monotonicity of TKM.

5.1 Objective Function of TKM

Due to the characteristic of exponential tilting inducing parametric shifts in distributions, we consider incorporating exponential tilting into SSE to obtain tilted SSE. The objective of tilted k𝑘kitalic_k-means is to minimize the tilted SSE within each cluster as follows:

min𝒮,𝒞{ϕ¯(t,𝒮,𝒞):=\displaystyle\min_{\mathcal{S},\mathcal{C}}\Bigl{\{}\overline{\phi}(t,\mathcal% {S},\mathcal{C}):=roman_min start_POSTSUBSCRIPT caligraphic_S , caligraphic_C end_POSTSUBSCRIPT { over¯ start_ARG italic_ϕ end_ARG ( italic_t , caligraphic_S , caligraphic_C ) := j=1kϕ(t,𝜹j,𝒄j)superscriptsubscript𝑗1𝑘italic-ϕ𝑡subscript𝜹𝑗subscript𝒄𝑗\displaystyle\sum_{j=1}^{k}\phi(t,\boldsymbol{\delta}_{j},\boldsymbol{c}_{j})∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )
:=assign\displaystyle:=:= j=1k1tlog1ni=1netf(𝒙i,𝒄j)δij},\displaystyle\sum_{j=1}^{k}\frac{1}{t}\log\frac{1}{n}\!\sum_{i=1}^{n}e^{tf(% \boldsymbol{x}_{i},\boldsymbol{c}_{j})\delta_{ij}}\Bigr{\}},∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_t end_ARG roman_log divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_t italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } , (7)

where t>0𝑡0t>0italic_t > 0 is a hyperparameter. Note that when 𝒫𝒫\mathcal{P}caligraphic_P in (1) is an exponential set of distributions parameterized by 𝒄jsubscript𝒄𝑗\boldsymbol{c}_{j}bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and 𝜹jsubscript𝜹𝑗\boldsymbol{\delta}_{j}bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, the cumulant generating function can be written as:

ΔX(t,𝜹j,𝒄j):=log1ni=1netf(𝒙i,𝒄j)δij.assignsubscriptΔ𝑋𝑡subscript𝜹𝑗subscript𝒄𝑗1𝑛superscriptsubscript𝑖1𝑛superscript𝑒𝑡𝑓subscript𝒙𝑖subscript𝒄𝑗subscript𝛿𝑖𝑗\displaystyle\Delta_{X}(t,\boldsymbol{\delta}_{j},\boldsymbol{c}_{j}):=\log% \frac{1}{n}\sum_{i=1}^{n}e^{tf(\boldsymbol{x}_{i},\boldsymbol{c}_{j})\delta_{% ij}}.roman_Δ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) := roman_log divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_t italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT . (8)

Therefore, it is clear that the objective function of TKM can be considered as a properly scaled summation version of the cumulant generating function in Equation (8).

Next, we consider the case of t=0𝑡0t=0italic_t = 0 in TKM. When t0𝑡0t\to 0italic_t → 0, according to L’Hôpital’s rule, it holds that:

limt0ϕ¯(t,𝒮,𝒞)=1nj=1ki=1nf(𝒙i,𝒄j)δij.subscript𝑡0¯italic-ϕ𝑡𝒮𝒞1𝑛superscriptsubscript𝑗1𝑘superscriptsubscript𝑖1𝑛𝑓subscript𝒙𝑖subscript𝒄𝑗subscript𝛿𝑖𝑗\displaystyle\lim_{t\to 0}\overline{\phi}(t,\mathcal{S},\mathcal{C})=\frac{1}{% n}\sum_{j=1}^{k}\sum_{i=1}^{n}f(\boldsymbol{x}_{i},\boldsymbol{c}_{j})\delta_{% ij}.roman_lim start_POSTSUBSCRIPT italic_t → 0 end_POSTSUBSCRIPT over¯ start_ARG italic_ϕ end_ARG ( italic_t , caligraphic_S , caligraphic_C ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT . (9)

Therefore, when t0𝑡0t\to 0italic_t → 0, tilted SSE generates to SSE. Without loss of generality, we define

ϕ(0,𝜹j,cj):=ψ(𝜹j,𝒄j).assignitalic-ϕ0subscript𝜹𝑗subscript𝑐𝑗𝜓subscript𝜹𝑗subscript𝒄𝑗\displaystyle\phi(0,\boldsymbol{\delta}_{j},c_{j}):=\psi(\boldsymbol{\delta}_{% j},\boldsymbol{c}_{j}).italic_ϕ ( 0 , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) := italic_ψ ( bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) . (10)
Input: 𝒳:={𝒙i}i=1nassign𝒳superscriptsubscriptsubscript𝒙𝑖𝑖1𝑛\mathcal{X}:=\{\boldsymbol{x}_{i}\}_{i=1}^{n}caligraphic_X := { bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, k𝑘kitalic_k: # of clusters, E𝐸Eitalic_E: # of epoch.
1 Initialize 𝒞:={𝒄j}j=1kassign𝒞superscriptsubscriptsubscript𝒄𝑗𝑗1𝑘\mathcal{C}:=\{\boldsymbol{c}_{j}\}_{j=1}^{k}caligraphic_C := { bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT by k𝑘kitalic_k-means++;
2 while not converge do
       /* Assignment. */
3       Update δij,i[n],j[k]formulae-sequencesubscript𝛿𝑖𝑗𝑖delimited-[]𝑛𝑗delimited-[]𝑘\delta_{ij},i\in[n],j\in[k]italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , italic_i ∈ [ italic_n ] , italic_j ∈ [ italic_k ] by (5);
       /* Refinement. */
4       for e=1,,E𝑒1𝐸e=1,\cdots,Eitalic_e = 1 , ⋯ , italic_E do
5            
6            Sample a mini-batch data \mathcal{B}caligraphic_B from 𝒳𝒳\mathcal{X}caligraphic_X;
7            
8            Update 𝒄j,j[k]subscript𝒄𝑗𝑗delimited-[]𝑘\boldsymbol{c}_{j},j\in[k]bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_j ∈ [ italic_k ] by (13);
9            
10return 𝒞𝒞\mathcal{C}caligraphic_C.
Algorithm 2 Solving tilted k𝑘kitalic_k-means via SGD

5.2 Solving Tilted k𝑘kitalic_k-means

Since Problem (5.1) involves a highly non-convex objective function with multi-block variables, we consider using coordinate descent (CD) to solve it. We begin by fixing 𝒄jsubscript𝒄𝑗\boldsymbol{c}_{j}bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to solve δijsubscript𝛿𝑖𝑗\delta_{ij}italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. Due to the monotonically increasing nature of the objective function with respect to tf(𝒙i,𝒄j)𝑡𝑓subscript𝒙𝑖subscript𝒄𝑗tf(\boldsymbol{x}_{i},\boldsymbol{c}_{j})italic_t italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), the solution for δijsubscript𝛿𝑖𝑗\delta_{ij}italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is identical to that of Equation (5). Next, we consider fixing δijsubscript𝛿𝑖𝑗\delta_{ij}italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT to solve 𝒞𝒞\mathcal{C}caligraphic_C. Since the tilted SSE is convex with respect to 𝒄jsubscript𝒄𝑗\boldsymbol{c}_{j}bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (this property will be proven in Section 5.3), we can derive the optimality condition for the tilted SSE with respect to 𝒄jsubscript𝒄𝑗\boldsymbol{c}_{j}bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. We then present the first-order gradient of ϕ¯(t,𝒮,𝒞)¯italic-ϕ𝑡𝒮𝒞\overline{\phi}(t,\mathcal{S},\mathcal{C})over¯ start_ARG italic_ϕ end_ARG ( italic_t , caligraphic_S , caligraphic_C ) with respect to 𝒄jsubscript𝒄𝑗\boldsymbol{c}_{j}bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT as follows,

𝒄jϕ¯(t,𝒮,𝒞)=𝒄jϕ(t,𝜹j,𝒄j)=𝒙i𝒮jetf(𝒙i,𝒄j)𝒄jf(𝒙i,𝒄j)i=1netf(𝒙i,𝒄j)δij.subscriptsubscript𝒄𝑗¯italic-ϕ𝑡𝒮𝒞subscriptsubscript𝒄𝑗italic-ϕ𝑡subscript𝜹𝑗subscript𝒄𝑗subscriptsubscript𝒙𝑖subscript𝒮𝑗superscript𝑒𝑡𝑓subscript𝒙𝑖subscript𝒄𝑗subscriptsubscript𝒄𝑗𝑓subscript𝒙𝑖subscript𝒄𝑗superscriptsubscript𝑖1𝑛superscript𝑒𝑡𝑓subscript𝒙𝑖subscript𝒄𝑗subscript𝛿𝑖𝑗\begin{split}\nabla_{\boldsymbol{c}_{j}}\overline{\phi}(t,\mathcal{S},\mathcal% {C})=&\nabla_{\boldsymbol{c}_{j}}\phi(t,\boldsymbol{\delta}_{j},\boldsymbol{c}% _{j})\\ =&\frac{\sum_{\boldsymbol{x}_{i}\in\mathcal{S}_{j}}\!e^{tf(\boldsymbol{x}_{i},% \boldsymbol{c}_{j})}\cdot\nabla_{\boldsymbol{c}_{j}}f(\boldsymbol{x}_{i},% \boldsymbol{c}_{j})}{\sum_{i=1}^{n}e^{tf(\boldsymbol{x}_{i},\boldsymbol{c}_{j}% )\delta_{ij}}}.\end{split}start_ROW start_CELL ∇ start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT over¯ start_ARG italic_ϕ end_ARG ( italic_t , caligraphic_S , caligraphic_C ) = end_CELL start_CELL ∇ start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL divide start_ARG ∑ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_t italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ⋅ ∇ start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_t italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG . end_CELL end_ROW (11)

where 𝒄jf(𝒙i,𝒄j):=2(𝒙i𝒄j)assignsubscriptsubscript𝒄𝑗𝑓subscript𝒙𝑖subscript𝒄𝑗2subscript𝒙𝑖subscript𝒄𝑗\nabla_{\boldsymbol{c}_{j}}f(\boldsymbol{x}_{i},\boldsymbol{c}_{j}):=2(% \boldsymbol{x}_{i}-\boldsymbol{c}_{j})∇ start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) := 2 ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is the first-order gradient of f(𝒙i,𝒄j)𝑓subscript𝒙𝑖subscript𝒄𝑗f(\boldsymbol{x}_{i},\boldsymbol{c}_{j})italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) with respect to 𝒄jsubscript𝒄𝑗\boldsymbol{c}_{j}bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Then setting Equation (11) equal to zero yields the optimal condition of 𝒄jsubscript𝒄𝑗\boldsymbol{c}_{j}bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT:

𝒙i𝒮jet𝒙i𝒄j2(𝒙i𝒄j)=0.subscriptsubscript𝒙𝑖subscript𝒮𝑗superscript𝑒𝑡superscriptnormsubscript𝒙𝑖subscript𝒄𝑗2subscript𝒙𝑖subscript𝒄𝑗0\displaystyle\sum_{\boldsymbol{x}_{i}\in\mathcal{S}_{j}}e^{t\|\boldsymbol{x}_{% i}-\boldsymbol{c}_{j}\|^{2}}(\boldsymbol{x}_{i}-\boldsymbol{c}_{j})=0.∑ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_t ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = 0 . (12)

We define a tilted mean operator Tm()Tm\operatorname{\textup{Tm}}(\cdot)tm ( ⋅ ), where 𝒄j=Tm(t,𝒮j)subscript𝒄𝑗Tm𝑡subscript𝒮𝑗\boldsymbol{c}_{j}=\operatorname{\textup{Tm}}(t,\mathcal{S}_{j})bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = tm ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) represents the values of 𝒄jsubscript𝒄𝑗\boldsymbol{c}_{j}bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT that satisfy Equation (12). Note that obtaining the closed solution for cjsubscript𝑐𝑗c_{j}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT from Equation (12) is nontrivial, therefore, we employ the first-order gradient method to solve 𝒄jsubscript𝒄𝑗\boldsymbol{c}_{j}bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Let \mathcal{B}caligraphic_B be a batch data from 𝒳𝒳\mathcal{X}caligraphic_X, then 𝒄jsubscript𝒄𝑗\boldsymbol{c}_{j}bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is updated as follows:

𝒄jsubscript𝒄𝑗\displaystyle\boldsymbol{c}_{j}bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT 𝒄jη𝒄jϕ¯(t,,𝒞),absentsubscript𝒄𝑗𝜂subscriptsubscript𝒄𝑗¯italic-ϕ𝑡𝒞\displaystyle\leftarrow\boldsymbol{c}_{j}-\eta\cdot\nabla_{\boldsymbol{c}_{j}}% \overline{\phi}(t,\mathcal{B},\mathcal{C}),← bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_η ⋅ ∇ start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT over¯ start_ARG italic_ϕ end_ARG ( italic_t , caligraphic_B , caligraphic_C ) , (13)
𝒄jϕ¯(t,,𝒞)subscriptsubscript𝒄𝑗¯italic-ϕ𝑡𝒞\displaystyle\nabla_{\boldsymbol{c}_{j}}\overline{\phi}(t,\mathcal{B},\mathcal% {C})∇ start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT over¯ start_ARG italic_ϕ end_ARG ( italic_t , caligraphic_B , caligraphic_C ) =𝒙ijetf(𝒙i,𝒄j)𝒄jf(𝒙i,𝒄j)i=1netf(𝒙i,𝒄j)δij,absentsubscriptsubscript𝒙𝑖subscript𝑗superscript𝑒𝑡𝑓subscript𝒙𝑖subscript𝒄𝑗subscriptsubscript𝒄𝑗𝑓subscript𝒙𝑖subscript𝒄𝑗superscriptsubscript𝑖1𝑛superscript𝑒𝑡𝑓subscript𝒙𝑖subscript𝒄𝑗subscript𝛿𝑖𝑗\displaystyle=\frac{\sum_{\boldsymbol{x}_{i}\in\mathcal{B}_{j}}e^{tf(% \boldsymbol{x}_{i},\boldsymbol{c}_{j})}\cdot\nabla_{\boldsymbol{c}_{j}}f(% \boldsymbol{x}_{i},\boldsymbol{c}_{j})}{\sum_{i=1}^{n}e^{tf(\boldsymbol{x}_{i}% ,\boldsymbol{c}_{j})\delta_{ij}}},= divide start_ARG ∑ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_t italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ⋅ ∇ start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_t italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG , (14)

where η𝜂\etaitalic_η is a learning rate, and j:=𝒮jassignsubscript𝑗subscript𝒮𝑗\mathcal{B}_{j}:=\mathcal{S}_{j}\cap\mathcal{B}caligraphic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT := caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∩ caligraphic_B. Note that the first-order gradient method is a commonly used optimization method for solving such problems. Interested readers may consider trying second-order gradient methods such as Newton method [49] for solving 𝒄jsubscript𝒄𝑗\boldsymbol{c}_{j}bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

Algorithm Description.  The algorithmic process of TKM can be summarized into three parts: initialization, assignment, and refinement. We provide algorithm details for TKM in Algorithm 2 and an example in Fig. 4. Firstly, the centroids set 𝒞𝒞\mathcal{C}caligraphic_C is initialized using k𝑘kitalic_k-means++ (Line 2 in Algorithm 2). Subsequently, we employ CD to iteratively solve δijsubscript𝛿𝑖𝑗\delta_{ij}italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT (assignment) and 𝒄jsubscript𝒄𝑗\boldsymbol{c}_{j}bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (refinement) (Lines 2-2 in Algorithm 2). We set E𝐸Eitalic_E epochs for solving 𝒄jsubscript𝒄𝑗\boldsymbol{c}_{j}bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, where in each epoch, a batch \mathcal{B}caligraphic_B data is sampled from 𝒳𝒳\mathcal{X}caligraphic_X, and the data points within 𝒮jsubscript𝒮𝑗\mathcal{B}\cap\mathcal{S}_{j}caligraphic_B ∩ caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are used to solve 𝒄jsubscript𝒄𝑗\boldsymbol{c}_{j}bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT using Equation (13).

Refer to caption
Figure 4: An example of TKM includes the stages of initialization, assignment, and refinement.

5.3 Theoretical Analysis

Our theoretical analysis consists of five parts. The first part provides an approximation guarantee for the initial centroids obtained by k𝑘kitalic_k-means++ with respect to the tilted SSE. Then we present a convergence analysis of TKM. Next, we delve into a fairness analysis of TKM. In the fourth part, we explore the time complexity of TKM. Finally, we analyze the monotonicity of the tilted SSE using a simple case.

5.3.1 Definitions and Assumptions

We begin by providing some definitions and assumptions used throughout our theories.

Definition 1 (Tilted weight).

Given a cluster 𝒮jsubscript𝒮𝑗\mathcal{S}_{j}caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and a centroid 𝐜jsubscript𝐜𝑗\boldsymbol{c}_{j}bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, the tilted weight wi(t,𝒮j,𝐜j)subscript𝑤𝑖𝑡subscript𝒮𝑗subscript𝐜𝑗w_{i}(t,\mathcal{S}_{j},\boldsymbol{c}_{j})italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) of a data point 𝐱isubscript𝐱𝑖\boldsymbol{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is defined as

wi(t,𝒮j,𝒄j):=etf(𝒙i,𝒄j)𝒙i𝒮jetf(𝒙i,𝒄j).assignsubscript𝑤𝑖𝑡subscript𝒮𝑗subscript𝒄𝑗superscript𝑒𝑡𝑓subscript𝒙𝑖subscript𝒄𝑗subscriptsubscript𝒙𝑖subscript𝒮𝑗superscript𝑒𝑡𝑓subscript𝒙𝑖subscript𝒄𝑗\begin{split}w_{i}(t,\mathcal{S}_{j},\boldsymbol{c}_{j}):=&\,\,\frac{e^{tf(% \boldsymbol{x}_{i},\boldsymbol{c}_{j})}}{\sum_{\boldsymbol{x}_{i}\in\mathcal{S% }_{j}}e^{tf(\boldsymbol{x}_{i},\boldsymbol{c}_{j})}}.\end{split}start_ROW start_CELL italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) := end_CELL start_CELL divide start_ARG italic_e start_POSTSUPERSCRIPT italic_t italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_t italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT end_ARG . end_CELL end_ROW (15)
Definition 2 (Tilted empirical mean and variance).

Let f(𝒮j,𝐜j):={f(𝐱i,𝐜j)|𝐱i𝒮j}\mathrm{f}(\mathcal{S}_{j},\boldsymbol{c}_{j}):=\bigl{\{}f(\boldsymbol{x}_{i},% \boldsymbol{c}_{j})|\boldsymbol{x}_{i}\in\mathcal{S}_{j}\bigl{\}}roman_f ( caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) := { italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) | bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } be a set of squared Euclidean distances of points in 𝒮jsubscript𝒮𝑗\mathcal{S}_{j}caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to the centroid 𝐜jsubscript𝐜𝑗\boldsymbol{c}_{j}bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, then the tilted empirical mean and variance in the cluster 𝒮jsubscript𝒮𝑗\mathcal{S}_{j}caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are defined as

𝔼t(f(𝒮j,𝒄j))subscript𝔼𝑡fsubscript𝒮𝑗subscript𝒄𝑗\displaystyle\mathbb{E}_{t}\bigl{(}\mathrm{f}(\mathcal{S}_{j},\boldsymbol{c}_{% j})\bigr{)}blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_f ( caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) :=𝒙i𝒮jwi(t,𝒮j,𝒄j)f(𝒙i,𝒄j),assignabsentsubscriptsubscript𝒙𝑖subscript𝒮𝑗subscript𝑤𝑖𝑡subscript𝒮𝑗subscript𝒄𝑗𝑓subscript𝒙𝑖subscript𝒄𝑗\displaystyle:=\sum_{\boldsymbol{x}_{i}\in\mathcal{S}_{j}}w_{i}(t,\mathcal{S}_% {j},\boldsymbol{c}_{j})\cdot f(\boldsymbol{x}_{i},\boldsymbol{c}_{j}),:= ∑ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⋅ italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , (16)
Vart(f(𝒮j,𝒄j))subscriptVar𝑡fsubscript𝒮𝑗subscript𝒄𝑗\displaystyle\mathrm{Var}_{t}\bigl{(}\mathrm{f}(\mathcal{S}_{j},\boldsymbol{c}% _{j})\bigr{)}roman_Var start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_f ( caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) :=𝔼t(f(𝒙i,𝒄j)𝔼t(f(𝒮j,𝒄j)))2.assignabsentsubscript𝔼𝑡superscript𝑓subscript𝒙𝑖subscript𝒄𝑗subscript𝔼𝑡fsubscript𝒮𝑗subscript𝒄𝑗2\displaystyle:=\mathbb{E}_{t}\Bigl{(}f(\boldsymbol{x}_{i},\boldsymbol{c}_{j})-% \mathbb{E}_{t}\bigl{(}\mathrm{f}(\mathcal{S}_{j},\boldsymbol{c}_{j})\bigr{)}% \Bigr{)}^{2}.:= blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_f ( caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (17)

Note that when t=0𝑡0t=0italic_t = 0, tilted empirical mean and variance generalize to the standard mean and variance in statistics.

Definition 3 (Gradient Lipschitz Continuity).

The objective function f:d:𝑓superscript𝑑f:\mathbb{R}^{d}\rightarrow\mathbb{R}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R is continuously differentiable and the gradient function of f𝑓fitalic_f, namely, f:d:𝑓superscript𝑑\nabla f:\mathbb{R}^{d}\rightarrow\mathbb{R}∇ italic_f : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R, is gradient Lipschitz continuous with Lipschitz constant L>0𝐿0L>0italic_L > 0, if for any 𝐜,𝐜d𝐜superscript𝐜superscript𝑑\boldsymbol{c},\boldsymbol{c}^{\prime}\in\mathbb{R}^{d}bold_italic_c , bold_italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, it holds that

f(𝒄)f(𝒄)L𝒄𝒄.norm𝑓𝒄𝑓superscript𝒄𝐿norm𝒄superscript𝒄\displaystyle\|\nabla f(\boldsymbol{c})-\nabla f(\boldsymbol{c}^{\prime})\|% \leq L\|\boldsymbol{c}-\boldsymbol{c}^{\prime}\|.∥ ∇ italic_f ( bold_italic_c ) - ∇ italic_f ( bold_italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ ≤ italic_L ∥ bold_italic_c - bold_italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ . (18)
Definition 4 (Tilted Hessian).

For any t0𝑡0t\geq 0italic_t ≥ 0, we define the Tilted Hessian 𝐜j𝐜j2ϕ(t,𝛅j,𝐜j)subscriptsuperscript2subscript𝐜𝑗superscriptsubscript𝐜𝑗topitalic-ϕ𝑡subscript𝛅𝑗subscript𝐜𝑗\nabla^{2}_{\boldsymbol{c}_{j}\boldsymbol{c}_{j}^{\top}}\phi(t,\boldsymbol{% \delta}_{j},\boldsymbol{c}_{j})∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) as the Hessian of ϕ(t,𝛅j,𝐜j)italic-ϕ𝑡subscript𝛅𝑗subscript𝐜𝑗\phi(t,\boldsymbol{\delta}_{j},\boldsymbol{c}_{j})italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) with respect to 𝐜jsubscript𝐜𝑗\boldsymbol{c}_{j}bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. That is

𝒄j𝒄j2ϕ(t,𝜹j,𝒄j)=tni=1n(𝒄jf(𝒙i,𝒄j)δij𝒄jϕ(t,𝜹j,𝒄j))\displaystyle\!\nabla^{2}_{\boldsymbol{c}_{j}\boldsymbol{c}_{j}^{\top}}\phi(t,% \boldsymbol{\delta}_{j},\boldsymbol{c}_{j})\!=\frac{t}{n}\sum_{i=1}^{n}\bigl{(% }\nabla_{\boldsymbol{c}_{j}}f(\boldsymbol{x}_{i},\boldsymbol{c}_{j})\delta_{ij% }-\nabla_{\boldsymbol{c}_{j}}\phi(t,\boldsymbol{\delta}_{j},\boldsymbol{c}_{j}% )\bigl{)}∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = divide start_ARG italic_t end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( ∇ start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - ∇ start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) )
(𝒄jf(𝒙i,𝒄j)δij𝒄jϕ(t,𝜹j,𝒄j))et(f(𝒙i,𝒄j)δijϕ(t,𝜹j,𝒄j))\displaystyle\!\!\!\bigl{(}\nabla_{\boldsymbol{c}_{j}}f(\boldsymbol{x}_{i},% \boldsymbol{c}_{j})\delta_{ij}\!-\!\nabla_{\boldsymbol{c}_{j}}\phi(t,% \boldsymbol{\delta}_{j},\boldsymbol{c}_{j})\bigl{)}^{\top}\!e^{t\bigl{(}f(% \boldsymbol{x}_{i},\boldsymbol{c}_{j})\delta_{ij}-\phi(t,\boldsymbol{\delta}_{% j},\boldsymbol{c}_{j})\bigr{)}}( ∇ start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - ∇ start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_t ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) end_POSTSUPERSCRIPT
+1ni=1net(f(𝒙i,𝒄j)δijϕ(t,𝜹j,𝒄j))2𝕀δij,1𝑛superscriptsubscript𝑖1𝑛superscript𝑒𝑡𝑓subscript𝒙𝑖subscript𝒄𝑗subscript𝛿𝑖𝑗italic-ϕ𝑡subscript𝜹𝑗subscript𝒄𝑗2𝕀subscript𝛿𝑖𝑗\displaystyle+\frac{1}{n}\sum_{i=1}^{n}e^{t\bigl{(}f(\boldsymbol{x}_{i},% \boldsymbol{c}_{j})\delta_{ij}-\phi(t,\boldsymbol{\delta}_{j},\boldsymbol{c}_{% j})\bigr{)}}\cdot 2\mathbb{I}\delta_{ij},+ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_t ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) end_POSTSUPERSCRIPT ⋅ 2 blackboard_I italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ,

and 𝕀𝕀\mathbb{I}blackboard_I is an identity matrix of appropriate size.

Lemma 1 (Strong Convexity of Tilted SSE [37]).

For any t0𝑡0t\geq 0italic_t ≥ 0, the tilted SSE is strongly convex with respect to 𝐜jsubscript𝐜𝑗\boldsymbol{c}_{j}bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. That is

𝒄j𝒄j2ϕ(t,𝜹j,𝒄j)2|𝒮j|n𝕀.succeeds-or-equalssubscriptsuperscript2subscript𝒄𝑗superscriptsubscript𝒄𝑗topitalic-ϕ𝑡subscript𝜹𝑗subscript𝒄𝑗2subscript𝒮𝑗𝑛𝕀\displaystyle\nabla^{2}_{\boldsymbol{c}_{j}\boldsymbol{c}_{j}^{\top}}\phi(t,% \boldsymbol{\delta}_{j},\boldsymbol{c}_{j})\succeq\frac{2|\mathcal{S}_{j}|}{n}% \mathbb{I}.∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⪰ divide start_ARG 2 | caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | end_ARG start_ARG italic_n end_ARG blackboard_I .
Proof.

Note that the first term in tilted Hessian is positive semi-definite, and the second term is positive definite and lower bounded by 2|𝒮j|n𝕀2subscript𝒮𝑗𝑛𝕀\frac{2|\mathcal{S}_{j}|}{n}\mathbb{I}divide start_ARG 2 | caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | end_ARG start_ARG italic_n end_ARG blackboard_I, which completes the proof. ∎

Lemma 2 (Gradient Lipschitz Continuity of Tilted SSE [37]).

For any t0𝑡0t\geq 0italic_t ≥ 0, 𝐜jϕ(t,𝛅j,𝐜j)subscriptsubscript𝐜𝑗italic-ϕ𝑡subscript𝛅𝑗subscript𝐜𝑗\nabla_{\boldsymbol{c}_{j}}\phi(t,\boldsymbol{\delta}_{j},\boldsymbol{c}_{j})∇ start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is L(t)𝐿𝑡L(t)italic_L ( italic_t )-Lipschitz with respect to 𝐜jsubscript𝐜𝑗\boldsymbol{c}_{j}bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, where L(t):=σmax(𝐜j𝐜j2ϕ(t,𝛅j,𝐜j))assign𝐿𝑡subscript𝜎subscriptsuperscript2subscript𝐜𝑗superscriptsubscript𝐜𝑗topitalic-ϕ𝑡subscript𝛅𝑗subscript𝐜𝑗L(t):=\sigma_{\max}\Bigl{(}\nabla^{2}_{\boldsymbol{c}_{j}\boldsymbol{c}_{j}^{% \top}}\phi(t,\boldsymbol{\delta}_{j},\boldsymbol{c}_{j})\Bigr{)}italic_L ( italic_t ) := italic_σ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ), and σmaxsubscript𝜎\sigma_{\max}italic_σ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT denotes the largest eigenvalue.

Assumption 1.

Let g(,𝐜j):=ϕ¯(t,,𝒞)assign𝑔subscript𝐜𝑗¯italic-ϕ𝑡𝒞g(\mathcal{B},\boldsymbol{c}_{j}):=\overline{\phi}(t,\mathcal{B},\mathcal{C})italic_g ( caligraphic_B , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) := over¯ start_ARG italic_ϕ end_ARG ( italic_t , caligraphic_B , caligraphic_C ) denote the mini-batch gradient of ϕ¯(t,𝒮,𝒞)¯italic-ϕ𝑡𝒮𝒞\overline{\phi}(t,\mathcal{S},\mathcal{C})over¯ start_ARG italic_ϕ end_ARG ( italic_t , caligraphic_S , caligraphic_C ), then the following conditions hold:

  • There exist scalars μGμ>0subscript𝜇𝐺𝜇0\mu_{G}\geq\mu>0italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ≥ italic_μ > 0 such that for any 𝒄jdsubscript𝒄𝑗superscript𝑑\boldsymbol{c}_{j}\in\mathbb{R}^{d}bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT,

    𝒄jϕ(t,𝜹j,𝒄j)𝔼[g(\displaystyle\nabla_{\boldsymbol{c}_{j}}\phi(t,\boldsymbol{\delta}_{j},% \boldsymbol{c}_{j})^{\top}\mathbb{E}[g(∇ start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT blackboard_E [ italic_g ( ,𝒄j)]μ𝒄jϕ(t,𝜹j,𝒄j)2,\displaystyle\mathcal{B},\boldsymbol{c}_{j})]\geq\mu\cdot\|\nabla_{\boldsymbol% {c}_{j}}\phi(t,\boldsymbol{\delta}_{j},\boldsymbol{c}_{j})\|^{2},caligraphic_B , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ] ≥ italic_μ ⋅ ∥ ∇ start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,
    𝔼[g(,𝒄j)]norm𝔼delimited-[]𝑔subscript𝒄𝑗\displaystyle\|\mathbb{E}[g(\mathcal{B},\boldsymbol{c}_{j})]\|∥ blackboard_E [ italic_g ( caligraphic_B , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ] ∥ μG𝒄jϕ(t,𝜹j,𝒄j).absentsubscript𝜇𝐺normsubscriptsubscript𝒄𝑗italic-ϕ𝑡subscript𝜹𝑗subscript𝒄𝑗\displaystyle\leq\mu_{G}\cdot\|\nabla_{\boldsymbol{c}_{j}}\phi(t,\boldsymbol{% \delta}_{j},\boldsymbol{c}_{j})\|.≤ italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ⋅ ∥ ∇ start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∥ .
  • There exist scalars ν0𝜈0\nu\geq 0italic_ν ≥ 0 and νH0subscript𝜈𝐻0\nu_{H}\geq 0italic_ν start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ≥ 0 such that for any 𝒄jdsubscript𝒄𝑗superscript𝑑\boldsymbol{c}_{j}\in\mathbb{R}^{d}bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, it holds that

    𝔼[g(,𝒄j)𝔼[g(,𝒄j)]2]ν+νH𝒄jϕ(t,𝜹j,𝒄j)2.𝔼delimited-[]superscriptnorm𝑔subscript𝒄𝑗𝔼delimited-[]𝑔subscript𝒄𝑗2𝜈subscript𝜈𝐻superscriptnormsubscriptsubscript𝒄𝑗italic-ϕ𝑡subscript𝜹𝑗subscript𝒄𝑗2\displaystyle\mathbb{E}[\|g(\mathcal{B},\boldsymbol{c}_{j})-\mathbb{E}[g(% \mathcal{B},\boldsymbol{c}_{j})]\|^{2}]\leq\nu+\nu_{H}\cdot\|\nabla_{% \boldsymbol{c}_{j}}\phi(t,\boldsymbol{\delta}_{j},\boldsymbol{c}_{j})\|^{2}.blackboard_E [ ∥ italic_g ( caligraphic_B , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - blackboard_E [ italic_g ( caligraphic_B , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ] ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ italic_ν + italic_ν start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ⋅ ∥ ∇ start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

The first requirement in Assumption 1 states that in expectation, the vector g(,𝒄j)𝑔subscript𝒄𝑗-g(\mathcal{B},\boldsymbol{c}_{j})- italic_g ( caligraphic_B , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is a direction of sufficient descent for ϕitalic-ϕ\phiitalic_ϕ from 𝒄jsubscript𝒄𝑗\boldsymbol{c}_{j}bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with a norm comparable to the norm of the gradient. The second requirement in Assumption 1, states that the variance of g(,𝒄j)𝑔subscript𝒄𝑗g(\mathcal{B},\boldsymbol{c}_{j})italic_g ( caligraphic_B , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is restricted, but in a relatively minor manner.

5.3.2 Approximation Guarantee

Let ϕ¯superscript¯italic-ϕ\overline{\phi}^{*}over¯ start_ARG italic_ϕ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT represent the optimal value of ϕ¯¯italic-ϕ\overline{\phi}over¯ start_ARG italic_ϕ end_ARG, we aim to prove that k𝑘kitalic_k-means++ can ensure the resulting initial centroids set 𝒞𝒞\mathcal{C}caligraphic_C satisfy 𝔼[ϕ¯(t,𝒮,𝒞)]αϕ¯𝔼delimited-[]¯italic-ϕ𝑡𝒮𝒞𝛼superscript¯italic-ϕ\mathbb{E}[\overline{\phi}(t,\mathcal{S},\mathcal{C})]\leq\alpha\cdot\overline% {\phi}^{*}blackboard_E [ over¯ start_ARG italic_ϕ end_ARG ( italic_t , caligraphic_S , caligraphic_C ) ] ≤ italic_α ⋅ over¯ start_ARG italic_ϕ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, where α𝛼\alphaitalic_α is a multiplicative error. Next, we mathematically obtain the value of α𝛼\alphaitalic_α.

Theorem 1.

Let ϕ¯superscript¯italic-ϕ\overline{\phi}^{*}over¯ start_ARG italic_ϕ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT be the optimal value of tilted SSE, Let ψ¯superscript¯𝜓\overline{\psi}^{\star}over¯ start_ARG italic_ψ end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT be the optimal value of SSE, then for any dataset 𝒳𝒳\mathcal{X}caligraphic_X, centroids set 𝒞𝒞\mathcal{C}caligraphic_C initialized by k𝑘kitalic_k-means++, and induced clusters 𝒮𝒮\mathcal{S}caligraphic_S, it holds that

𝔼[ϕ¯(t,𝒮,𝒞)]O(klogk)ψ¯O(klogk)ϕ¯.𝔼delimited-[]¯italic-ϕ𝑡𝒮𝒞𝑂𝑘𝑘superscript¯𝜓𝑂𝑘𝑘superscript¯italic-ϕ\displaystyle\mathbb{E}[\overline{\phi}(t,\mathcal{S},\mathcal{C})]\leq O(k% \log k)\cdot\overline{\psi}^{\star}\leq O(k\log k)\cdot\overline{\phi}^{*}.blackboard_E [ over¯ start_ARG italic_ϕ end_ARG ( italic_t , caligraphic_S , caligraphic_C ) ] ≤ italic_O ( italic_k roman_log italic_k ) ⋅ over¯ start_ARG italic_ψ end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ≤ italic_O ( italic_k roman_log italic_k ) ⋅ over¯ start_ARG italic_ϕ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT . (19)

The proof of Theorem 1 can be found in Section 8.1. k𝑘kitalic_k-means++ has been proven to generate initial centroids with a multiplicative error of O(logk)𝑂𝑘O(\log k)italic_O ( roman_log italic_k ) in k𝑘kitalic_k-means when fairness constraints are not considered [8]. Theorem 1 demonstrates that with individual fairness constraints, k𝑘kitalic_k-means++ achieves the multiplicative error of O(klogk)𝑂𝑘𝑘O(k\log k)italic_O ( italic_k roman_log italic_k ).

5.3.3 Convergence Analysis

Next, we provide the convergence analysis of TKM by proving that the assignment and refinement steps ensure that the expected value of the tilted SSE decreases.

Theorem 2.

Let 𝒮itsuperscript𝒮𝑖𝑡\mathcal{S}^{it}caligraphic_S start_POSTSUPERSCRIPT italic_i italic_t end_POSTSUPERSCRIPT, 𝒞itsuperscript𝒞𝑖𝑡\mathcal{C}^{it}caligraphic_C start_POSTSUPERSCRIPT italic_i italic_t end_POSTSUPERSCRIPT and 𝒮it+1superscript𝒮𝑖𝑡1\mathcal{S}^{it+1}caligraphic_S start_POSTSUPERSCRIPT italic_i italic_t + 1 end_POSTSUPERSCRIPT, 𝒞it+1superscript𝒞𝑖𝑡1\mathcal{C}^{it+1}caligraphic_C start_POSTSUPERSCRIPT italic_i italic_t + 1 end_POSTSUPERSCRIPT be the solutions in the it𝑖𝑡ititalic_i italic_t-th and (it+1(it+1( italic_i italic_t + 1)-th iterations of TKM. Under Assumption 1 and by choosing the learning rate η<2μL(t)𝜂2𝜇𝐿𝑡\eta<\frac{2}{\mu\cdot L(t)}italic_η < divide start_ARG 2 end_ARG start_ARG italic_μ ⋅ italic_L ( italic_t ) end_ARG, it holds that

𝔼[ϕ¯(t,𝒮it+1,𝒞it+1)]ϕ¯(t,𝒮it,𝒞it).𝔼delimited-[]¯italic-ϕ𝑡superscript𝒮𝑖𝑡1superscript𝒞𝑖𝑡1¯italic-ϕ𝑡superscript𝒮𝑖𝑡superscript𝒞𝑖𝑡\displaystyle\mathbb{E}[\overline{\phi}(t,\mathcal{S}^{it+1},\mathcal{C}^{it+1% })]\leq\overline{\phi}(t,\mathcal{S}^{it},\mathcal{C}^{it}).blackboard_E [ over¯ start_ARG italic_ϕ end_ARG ( italic_t , caligraphic_S start_POSTSUPERSCRIPT italic_i italic_t + 1 end_POSTSUPERSCRIPT , caligraphic_C start_POSTSUPERSCRIPT italic_i italic_t + 1 end_POSTSUPERSCRIPT ) ] ≤ over¯ start_ARG italic_ϕ end_ARG ( italic_t , caligraphic_S start_POSTSUPERSCRIPT italic_i italic_t end_POSTSUPERSCRIPT , caligraphic_C start_POSTSUPERSCRIPT italic_i italic_t end_POSTSUPERSCRIPT ) . (20)

The proof of Theorem 2 is provided in Section 8.2. Theorem 2 demonstrates that with the selection of an appropriate learning rate, the expected value of the tilted SSE can decrease until reaching convergence.

5.3.4 Fairness Analysis

We propose using the variance of each data point’s squared distance to the centroid within each cluster to measure the fairness of clustering algorithms. Note that when t=0𝑡0t=0italic_t = 0 in the tilted weight, the tilted empirical variance generalizes to standard variance. We employ variance as a measure of fairness because it quantifies the extent to which sample points in a dataset are distributed around the mean, with smaller variance indicating reduced fluctuation in distances from the mean and thus greater fairness. Next, we consider the monotonicity of the tilted empirical variance with t𝑡titalic_t.

Theorem 3.

For any cluster 𝒮jsubscript𝒮𝑗\mathcal{S}_{j}caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, any corresponding centroid 𝐜j(t)=Tm(t,𝒮j)subscript𝐜𝑗𝑡Tm𝑡subscript𝒮𝑗\boldsymbol{c}_{j}(t)=\operatorname{\textup{Tm}}(t,\mathcal{S}_{j})bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) = tm ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), and any t0𝑡0t\geq 0italic_t ≥ 0, suppose all data points are normalized to a unit norm, then it holds that

t{Varτ(f(𝒮j,𝒄j(t)))}<0.𝑡subscriptVar𝜏fsubscript𝒮𝑗subscript𝒄𝑗𝑡0\displaystyle\frac{\partial}{\partial t}\Bigl{\{}\mathrm{Var}_{\tau}\Bigl{(}% \mathrm{f}(\mathcal{S}_{j},\boldsymbol{c}_{j}(t))\Bigr{)}\Bigr{\}}<0.divide start_ARG ∂ end_ARG start_ARG ∂ italic_t end_ARG { roman_Var start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( roman_f ( caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) ) ) } < 0 . (21)

The proof of Theorem 3 is provided in Section 8.3. Note that τ𝜏\tauitalic_τ is a constant in the calculation of tilted empirical variance, where it contributes to the tilted weight adjustment. Theorem 3 states that the τ𝜏\tauitalic_τ-tilted empirical variance among the distances between each data point in 𝒮jsubscript𝒮𝑗\mathcal{S}_{j}caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and their corresponding centroid will decrease with an increase in t𝑡titalic_t. Therefore, there exists a potential trade-off between SSE and variance, enabling solutions to flexibly achieve desirable clustering utility and fairness. While Theorem 3 suppose all data points are normalized to a unit norm which is not satisfied in some datasets, we observe favorable numerical results motivating the extension of these results beyond the cases that are theoretically studied in this paper.

5.3.5 Time Complexity

We provide the time complexity of TKM and analyze why TKM is suitable for individually fair clustering analysis in big data scenarios.

Theorem 4.

The time complexity of TKM is O(nkdET)𝑂𝑛𝑘𝑑𝐸𝑇O(nkdET)italic_O ( italic_n italic_k italic_d italic_E italic_T ), where d𝑑ditalic_d is the number of attributes of each data point, E𝐸Eitalic_E is the epoch size, and T𝑇Titalic_T is the total number of iterations.

The proof of Theorem 4 is provided in Section 8.4. Note that the time complexity of TKM is linear with the dataset size, which is the same as that of vanilla k𝑘kitalic_k-means algorithms without fair constraints such as Lloyd’s heuristic [39] and SGD-based k𝑘kitalic_k-means [52]. In contrast, existing individual fair clustering methods exhibit a time complexity of O(kn4)𝑂𝑘superscript𝑛4O(kn^{4})italic_O ( italic_k italic_n start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) [40, 48]. In the context of big data, employing these methods for clustering becomes impractical, as the required running time becomes difficult to estimate when dealing with dataset sizes reaching the order of millions. Moreover, these methods encounter RAM overflow issues due to the necessity of computing distances between each data point, requiring storage of an n×n𝑛𝑛n\times nitalic_n × italic_n array in RAM. Conversely, TKM only necessitates distance calculations between each data point and corresponding centroids during the assignment, thus requiring the computation of only an n×k𝑛𝑘n\times kitalic_n × italic_k array, effectively mitigating the risk of RAM overflow.

5.3.6 Monotonicity Analysis

In this section, we provided a monotonicity analysis for tilted SSE in a simple case.

Theorem 5.

When k=1𝑘1k=1italic_k = 1, suppose all data points are normalized to a unit norm, then for any t0𝑡0t\geq 0italic_t ≥ 0, it holds that,

ϕ¯(t,𝒮,𝒞)t0.¯italic-ϕ𝑡𝒮𝒞𝑡0\displaystyle\frac{\partial\overline{\phi}(t,\mathcal{S},\mathcal{C})}{% \partial t}\geq 0.divide start_ARG ∂ over¯ start_ARG italic_ϕ end_ARG ( italic_t , caligraphic_S , caligraphic_C ) end_ARG start_ARG ∂ italic_t end_ARG ≥ 0 . (22)

Proof of Theorem 5 is provided in Section 8.5. When k=1𝑘1k=1italic_k = 1, k𝑘kitalic_k-means simplifies to a point estimation problem. In this case, Theorem 5 shows that the tilted SSE increases as t𝑡titalic_t increases. While the monotonicity of the tilted SSE is restricted to the scenario when k=1𝑘1k=1italic_k = 1, our experiments suggest that the tilted SSE also exhibits a monotonically increasing trend for other values of k𝑘kitalic_k.

6 Experiments

TABLE II: An overview of the datasets
Ref. Datasets
#
of points
#
of attributes
#
of clusters
[33] Athlete 271,117 15 3-10
[45] Bank 4,521 16 3-10
[36] Census 32,561 15 3-10
[57] Diabetes 101,766 24 3-10
[6] Recruitment 4,001 50 3-10
[9] Spanish 4,747 15 3-10
[1] Student 32,594 21 3-10
[2] 3D-spatial 434,874 12 3-10
[41] Census1990 2,458,285 69 4
[3] HMDA 5,986,660 53 4
[50] Synthetic 200 2 2, 3

Goals.  In this section, we verify the effectiveness and efficiency of TKM by comparing it with various methods. We also examine the impact of various hyperparameters on the convergence of TKM. Moreover, we provide visualizations of the centroids’ variations with varying t𝑡titalic_t.

6.1 Settings

Datasets.  We employ ten real-world datasets and two synthetic datasets to validate the performance of TKM. To compare the effectiveness and fairness of TKM with various methods and parameters, we utilize Athlete, Bank, Census, Diabetes, Recruitment, Spanish, Student, and 3D-spatial. To compare the efficiency of TKM with other methods, we employ Census1990 and HMDA. For visualizing TKM, we use two synthetic datasets. We sampled numerical features from ten real-world datasets and then standardized these features (the names of these features are provided in our repository). A comprehensive overview of the datasets can be obtained from Table II.

Baselines.  We experimentally evaluate the performance of TKM against six methods, namely, k𝑘kitalic_k-means++ [8], JKL [35], MV [40], FR [48], SFR [48], and NF [52]. As explained in our related works, JKL first introduced the concept of individual fairness for k𝑘kitalic_k-means. MV, FR and SFR are three state-of-the-art methods for individually fair k𝑘kitalic_k-means. Note that SFR is a sparsed version of FR. k𝑘kitalic_k-means++ and NF are two clustering methods that do not take individual fairness into account. It is worth noting that NF is a method different from the classical Lloyd’s heuristic. It is solved through SGD and can be considered as the case of t=0𝑡0t=0italic_t = 0 in TKM. For TKM and NF, we employed k𝑘kitalic_k-means++ for initialization.

Refer to caption
Figure 5: Comparison among various methods in terms of SSE at varying values of k𝑘kitalic_k.
Refer to caption
Figure 6: Comparison among various methods in terms of the variance in each cluster.
Refer to caption
Figure 7: Comparison among various methods in terms of the maximum distance in each cluster.
TABLE III: Comparison among TKM, MV, and FR in terms of running time (seconds). We abbreviate TLE as the time limit exceeded for 1 hour, SLE as the sampling size limit exceeded for the dataset dimension, and ROF as the RAM overflow.
Dataset Method 1K 2K 5K 10K 15K 20K 25K 30K 40K 50K 60K 70K 80K 90K 2M 5M
Census1990 TKM 0.7 1.3 2.4 4.0 9.0 12.0 15.9 18.9 23.4 30.7 41.5 49.4 56.2 66.0 542.3 SLE
SFR 0.5 2.6 11.7 31.4 45.4 65.2 77.3 98.9 156.7 195.2 291.8 483.1 601.3 ROF ROF SLE
MV 5.6 30.7 85.9 250.9 1068.3 1783.9 4960.8 TLE TLE TLE TLE TLE TLE ROF ROF SLE
FR 13.4 129.8 1053.4 10692.7 TLE TLE TLE TLE TLE TLE TLE TLE TLE ROF ROF SLE
HMDA TKM 0.3 1.0 2.2 3.8 9.3 12.3 15.5 19.4 24.3 31.1 45.2 53.0 59.1 71.3 743.9 1901.6
SFR 2.3 5.7 17.9 41.1 65.8 72.9 88.6 111.8 174.5 211.9 348.8 528.1 712.5 ROF ROF ROF
MV 5.0 27.8 61.2 304.5 406.8 1923.9 5187.6 TLE TLE TLE TLE TLE TLE ROF ROF ROF
FR 49.8 263.3 2784.1 TLE TLE TLE TLE TLE TLE TLE TLE TLE TLE ROF ROF ROF

Measurements.  We employ several metrics to evaluate the performance of clustering algorithms. We use SSE to measure the utility of different clustering algorithms, where a smaller SSE is considered a better clustering utility. To measure fairness among different clustering algorithms, we consider using two metrics. The first is the variance of each point’s distance to its nearest centroid within a cluster. A smaller variance indicates a fairer algorithm. The second metric is the maximum distance from each point in a cluster to the centroid, where a smaller maximum distance signifies greater fairness. As for efficiency evaluation, we measure it using the running time of each algorithm. To verify the impact of different hyperparameters on the convergence of TKM, we use tilted SSE as the metric.

Implementations.  Our algorithms were executed on a platform comprising an Intel i9-14900KF CPU with 24 cores, 64 GB of RAM, and operating on the CentOS 7 environment. The software implementations, including our methods and the comparison methods, were realized in Python 3.7 and open-sourced (https://github.com/zsk66/TKM-master).

6.2 Comparison among Various Methods

6.2.1 Effectiveness Analysis

Fig. 7 compares the SSE of six methods as k𝑘kitalic_k varies on eight datasets: Athlete, Bank, Census, Diabetes, Recruitment, Spanish, Student, and 3D-spatial. Due to the long running time required by our comparison methods, we need to sample the datasets to accommodate them. We sampled 1000 data points from each dataset, repeated this process 10 times, conducted experiments on the resulting 10 sampled datasets, and averaged the obtained SSE values. We set the parameter t𝑡titalic_t of TKM to be 0.01, 0.05, 0.1, and 0.2, respectively. The learning rate for NF and TKM was set to 0.05, the number of epochs was set to 5, the batch size was set to 100, and the number of iterations was set to 500. JKL, MV, FR, and SFR adopted the default hyperparameter settings in their papers.

Observations. We can see that as t𝑡titalic_t increases, the SSE of TKM also increases. This is because an increase in t𝑡titalic_t inevitably brings the centroids closer to the minority data points, resulting in an increase in SSE. Comparing the SSE of different methods, we can observe that the SSE of JKL is consistently the highest across all datasets except for Bank and Spanish. In these two datasets, TKM has a large SSE at t=0.2𝑡0.2t=0.2italic_t = 0.2, which is due to excessively large t𝑡titalic_t causing the centroids obtained by TKM to be too close to those minority data points. The SSE of SFR is always larger than FR because SFR is a version of FR that applies the sparsification technique. The SSE for 3D-spatial and Recruitment in FR is lower than in MV, but on the other six datasets, MV has a lower SSE compared to FR. Meanwhile, TKM’s SSE at t=0.01,0.05,0.1𝑡0.010.050.1t=0.01,0.05,0.1italic_t = 0.01 , 0.05 , 0.1 is consistently lower than JKL, MV, FR, and SFR, and even performs nearly as well as k𝑘kitalic_k-means++ and NF on the Census and Recruitment, which reflects the outstanding effectiveness of TKM.

6.2.2 Fairness Analysis

Fig. 7 and Fig. 7 illustrate the variance and maximum distance within each cluster for various methods when k=4𝑘4k=4italic_k = 4. The variance and maximum distance values within Clusters 1-4 are arranged in descending order. The data processing and hyperparameter configurations for all methods remain consistent with those outlined in Section 6.2.1.

Observations. From Fig. 7, it can be seen that for TKM, as t𝑡titalic_t increases, the variance of each cluster decreases, which is consistent with our theoretical results. Next, without loss of generality, we examine the variance of each method on Cluster 1. It can be observed that JKL has the largest variance across all datasets except for Bank, Recruitment, and Spanish, while k𝑘kitalic_k-means++ and NF have the largest variance on Bank and Spanish, and SFR has the largest variance on Recruitment. It is worth noting that in some datasets, such as Diabetes, Recruitment, Student, and 3D-spatial, even when t=0.01𝑡0.01t=0.01italic_t = 0.01, the variance of TKM is smaller than other comparison methods. Moreover, in other datasets, by adjusting t𝑡titalic_t, it is always possible to make the variance of TKM smaller than the comparison methods. From Fig. 7, we observe that the maximum distance within each cluster decreases as t𝑡titalic_t increases. This occurs because the greater maximum distance is caused by the centroids being farther from the minority points. With a higher t𝑡titalic_t, the centroids shift towards the minority points, thereby reducing the maximum distance. Comparing TKM with other methods reveals that TKM achieves the smallest maximum distance, demonstrating its fairness. Moreover, we observe that in 3D-spatial, the variance and maximum distance of JKL, MV, FR, and SFR are all larger than those of k𝑘kitalic_k-means++, indicating that existing individually fair clustering methods might even exacerbate unfairness in our scenario.

6.2.3 Effeciency Analysis

Table III presents a comparison of the running time of TKM with three state-of-the-art methods, MV, FR, and SFR ( Due to the poor performance of JKL and NF in effectiveness and fairness, we do not consider these two methods in the comparison of efficiency). We sampled the Census1990 and HMDA with sizes nssubscript𝑛𝑠n_{s}italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT of 1K, 2K, 5K, 10K,15K, 20K, 25K, 30K, 40K, 50K, 60K, 70K, 80K, 90K, 2M, and 5M, respectively. We set the number of iterations for TKM to 500, the batch size to 150ns150subscript𝑛𝑠\frac{1}{50}n_{s}divide start_ARG 1 end_ARG start_ARG 50 end_ARG italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, the number of epochs to 5, and the learning rate to 0.05. The hyperparameters for MV, FR, and SFR were set to their default values in their papers.

Refer to caption
Figure 8: Effect of t𝑡titalic_t on the convergence of TKM.
Refer to caption
Figure 9: Effect of the epoch on the convergence of TKM.
Refer to caption
Figure 10: Effect of the learning rate on the convergence of TKM.
Refer to caption
Figure 11: Visualization of two synthetic 2-dimensional data for k=2𝑘2k=2italic_k = 2 and k=3𝑘3k=3italic_k = 3 by TKM.

Observations. Experimental results demonstrate that regardless of the number of data points sampled, the running time of TKM is always significantly shorter than that of MV, FR, and SFR. It can be observed that TKM can cluster 5 million data points in about 30 minutes, while MV can only cluster 20,000 samples within 30 minutes, and FR can only cluster 5,000 data points. Moreover, it is worth noting that as the sample size increases, TKM’s running time increases by hundreds or even thousands of times compared to MV and FR. For example, when the number of sampled points is 1,000, TKM achieves 8.0×\times× and 19.1×\times× acceleration compared to MV and FR in Census1990, respectively. When the number of sampled points is 10,000, TKM achieves 62.7×\times× and 2673.2×\times× acceleration compared to MV and FR in Census1990, respectively. Furthermore, although the running time of SFR is significantly shorter than MV and FR, TKM still achieves approximately a 10.7×\times× and 12.1×\times× acceleration with 80,000 data points in Census1990 and HMDA, respectively. Furthermore, when the sample size reaches 90,000, the algorithmic characteristic of SFR, which requires computing distances between each sample point, can lead to a memory overflow issue, causing the algorithm to terminate. This issue also arises in MV and FR.

6.2.4 Summary of Lessons Learned

We have provided the changes of SSE with respect to k𝑘kitalic_k for different methods, the variance results in four clusters for different methods, and a comparison of the efficiency of different methods. Our experimental results have led us to draw the following conclusions:

  • TKM outperforms state-of-the-art methods in terms of effectiveness. Specifically, TKM achieves smaller SSE compared to state-of-the-art methods across different values of k𝑘kitalic_k and t𝑡titalic_t. In some datasets, the SSE of TKM is almost the same as methods that do not consider individual fairness.

  • TKM outperforms state-of-the-art methods in terms of fairness. Specifically, TKM can achieve smaller variance and maximum distance than state-of-the-art methods when an appropriate value of t𝑡titalic_t is chosen.

  • TKM surpasses state-of-the-art methods in terms of efficiency. Specifically, TKM can cluster more data points in a shorter time, and as the sample size increases, this acceleration effect becomes even more pronounced. Moreover, TKM can overcome the RAM overflow issue that existing methods encounter when dealing with large-scale data.

6.3 Comparison among Various Parameters

6.3.1 Tilted SSE vs. t𝑡titalic_t

Fig. 10 illustrates the convergence of tilted SSE with iterations at t𝑡titalic_t values of 0.01, 0.05, 0.1, 0.2, 0.5, and 1. We randomly select 1000 data points from each dataset, repeating this process 10 times. We then conduct experiments on these 10 subsampled datasets, calculating the average of the resulting tilted SSE values. For other hyperparameters, we set the learning rate to 0.05, the number of iterations to 500, the batch size to 100, and the epoch size to 5.

Observations. We observe that despite using SGD to update the centroids, the tilted SSE of TKM still decreases steadily with iterations, which confirms the convergence of TKM. As t𝑡titalic_t increases, the tilted SSE also increases. This confirms that our theoretical analysis of the monotonicity of the tilted SSE with respect to t𝑡titalic_t holds not only for k=1𝑘1k=1italic_k = 1. When t=0.01𝑡0.01t=0.01italic_t = 0.01, the tilted SSE remains nearly unchanged with iterations. This indicates that the tilted SSE is insensitive to variations in t𝑡titalic_t when t𝑡titalic_t is small.

6.3.2 Tilted SSE vs. Epoch

Fig. 10 illustrates the convergence of tilted SSE with different numbers of epochs during iterations. The data preprocessing for TKM here follows the same procedures outlined in Section 6.3.1. To visualize the curve of tilted SSE of TKM over iterations more intuitively, we set t=0.5𝑡0.5t=0.5italic_t = 0.5, learning rate to 0.03, number of iterations to 500, batch size to 50, and epoch size to 1, 3, 5, 7, 9.

Observations. From Fig. 10, it can be observed that as the number of iterations increases, the tilted SSE of TKM decreases and tends to stabilize after reaching a certain value on all datasets. With an increase in the epoch size, the convergence speed of TKM accelerates, and its convergence performance improves. This is because increasing the epoch size allows for higher precision in the solution obtained through SGD during each iteration, as more data can be utilized. When the epoch size is 7 and 9, the convergence and convergence speed of TKM are not significantly different. Therefore, selecting 7 as the epoch size is an appropriate choice. However, in some datasets, we found that increasing the epoch size does not necessarily improve convergence. For example, in Recruitment, a smaller epoch size of 7 yields better convergence compared to an epoch size of 9. This is attributed to the risk of overfitting when the epoch size is too large. Therefore, choosing an epoch size of 7 is deemed appropriate for these datasets.

6.3.3 Tilted SSE vs. Learning Rate

Fig. 10 illustrates the convergence of tilted SSE with various learning rates during iterations. The data preprocessing for TKM here is the same as in Section 6.3.1. For the parameter settings of TKM, we set t=0.5𝑡0.5t=0.5italic_t = 0.5, epoch size as 5, batch size as 50, number of iterations to 500, and learning rate as 0.01, 0.02, 0.03, 0.04, and 0.05.

Observations. From Fig. 10, we can see that, across the eight datasets, the convergence speed generally increases with the increase in learning rate. However, when the learning rate increases to a certain extent, the increase in convergence speed becomes slower. For example, when η=0.03,0.04,0.05𝜂0.030.040.05\eta=0.03,0.04,0.05italic_η = 0.03 , 0.04 , 0.05, the convergence speed and the converged tilted SSE value on the Bank are almost indistinguishable. Additionally, if the learning rate is excessively high, it can result in poorer convergence, as demonstrated in Diabetes where η=0.04𝜂0.04\eta=0.04italic_η = 0.04 produces a smaller tilted SSE. This occurs because an overly large learning rate may cause the SGD step size to become excessive, hindering the achievement of locally optimal solutions.

6.3.4 Visualization

Fig. 11 demonstrates how centroids change over t𝑡titalic_t in two synthetic datasets when the number of clusters is set to 2 and 3, respectively. We set the number of epochs for TKM to 5, the number of iterations to 1000, and the learning rate to 0.01, the batch size to 20. For the values of t𝑡titalic_t, we take a total of 60 geometrically spaced values between 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT and 102superscript10210^{2}10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. We employ a blue-to-red gradient to depict the rising values of t𝑡titalic_t, and we use the same color to represent data points within the same cluster.

Observations. It can be observed that as t𝑡titalic_t increases, the positions of the centroids tend to shift towards the minority data points in each cluster. This ensures data points in each cluster can guarantee “treat all points equally”, aligning with the concept of individual fairness. Furthermore, we observe that as t𝑡titalic_t increases, the centroids do not shift excessively towards minority data points, ensuring that the distance from majority data points to the centroids remains reasonable. This demonstrates that TKM ensures equal treatment of each data point.

6.3.5 Summary of Lessons Learned

We have provided the convergence behavior of TKM under different epoch sizes and learning rates, as well as visualizations of TKM on 2-dimension synthetic data. These experiments lead us to the following conclusions:

  • TKM is a convergent algorithm, and the tilted SSE increases monotonically with t𝑡titalic_t. Specifically, for different values of t𝑡titalic_t, the tilted SSE in TKM steadily decreases to a stable value. Moreover, as t𝑡titalic_t increases, the tilted SSE increases.

  • The convergence of TKM is influenced by the epoch size and learning rate. Specifically, selecting an appropriate epoch size and learning rate can lead to faster convergence speed and better convergence of TKM. However, choosing larger epoch sizes and learning rates does not necessarily improve the performance.

  • TKM indeed can ensure individual fairness for k𝑘kitalic_k-means. Specifically, as t𝑡titalic_t increases, it can guarantee that those minority data points can be closer to the centroids, achieving the goal of treating each individual equally.

7 Conclusions and Future Work

This paper investigated the individually fair k𝑘kitalic_k-means in the context of location-based resource allocation. To address the issue where existing individually fair clustering methods and fairness metrics may exacerbate unfairness, we proposed TKM, an algorithm designed to effectively solve the individually fair k𝑘kitalic_k-means problem via exponential tilting. We constructed the tilted SSE as the objective function and proposed solving the optimization problem using CD and SGD. Moreover, we proposed to employ variance to measure fairness. Our theory and experiments have validated that the effectiveness, efficiency, and fairness of our proposed algorithm outperform existing state-of-the-art methods. It is noteworthy that existing individually fair clustering methods encounter challenges in their application to large-scale data clustering analysis scenarios, primarily due to their computational complexity, which depends on the dataset size. In contrast, TKM, due to its excellent efficiency performance, can be applied in many big data clustering analysis scenarios, such as resource allocation.

Due to privacy concerns, data is often stored on different devices and cannot be shared among them. Therefore, a hot topic of research is how to perform clustering analysis without sharing data. In the future, we will investigate individually fair k𝑘kitalic_k-means in the framework of federated learning to address this issue.

8 Proofs

8.1 Proof of Theorem 1

Before proving Theorem 1, we present some useful lemmas.

Lemma 3.

Given a cluster 𝒮jsubscript𝒮𝑗\mathcal{S}_{j}caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, let 𝛅jsubscript𝛅𝑗\boldsymbol{\delta}_{j}bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and 𝐜jsubscript𝐜𝑗\boldsymbol{c}_{j}bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT be the corresponding assignment and centroid, then for any t0𝑡0t\geq 0italic_t ≥ 0, it holds that,

ψ(𝜹j,𝒄j)ϕ(t,𝜹j,𝒄j).𝜓subscript𝜹𝑗subscript𝒄𝑗italic-ϕ𝑡subscript𝜹𝑗subscript𝒄𝑗\displaystyle\psi(\boldsymbol{\delta}_{j},\boldsymbol{c}_{j})\leq\phi(t,% \boldsymbol{\delta}_{j},\boldsymbol{c}_{j}).italic_ψ ( bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ≤ italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) . (23)
Proof.

Following from (5.1), we have

ϕ(t,𝜹j,𝒄j)italic-ϕ𝑡subscript𝜹𝑗subscript𝒄𝑗\displaystyle\phi(t,\boldsymbol{\delta}_{j},\boldsymbol{c}_{j})italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) =1tlog1ni=1netf(𝒙i,𝒄j)δijabsent1𝑡1𝑛superscriptsubscript𝑖1𝑛superscript𝑒𝑡𝑓subscript𝒙𝑖subscript𝒄𝑗subscript𝛿𝑖𝑗\displaystyle=\frac{1}{t}\log\frac{1}{n}\sum_{i=1}^{n}e^{tf(\boldsymbol{x}_{i}% ,\boldsymbol{c}_{j})\delta_{ij}}= divide start_ARG 1 end_ARG start_ARG italic_t end_ARG roman_log divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_t italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
1t1ni=1nlogetf(𝒙i,𝒄j)δijabsent1𝑡1𝑛superscriptsubscript𝑖1𝑛superscript𝑒𝑡𝑓subscript𝒙𝑖subscript𝒄𝑗subscript𝛿𝑖𝑗\displaystyle\geq\frac{1}{t}\cdot\frac{1}{n}\sum_{i=1}^{n}\log e^{tf(% \boldsymbol{x}_{i},\boldsymbol{c}_{j})\delta_{ij}}≥ divide start_ARG 1 end_ARG start_ARG italic_t end_ARG ⋅ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_log italic_e start_POSTSUPERSCRIPT italic_t italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (24)
=1ni=1nf(𝒙i,𝒄j)δij=ψ(𝜹j,𝒄j),absent1𝑛superscriptsubscript𝑖1𝑛𝑓subscript𝒙𝑖subscript𝒄𝑗subscript𝛿𝑖𝑗𝜓subscript𝜹𝑗subscript𝒄𝑗\displaystyle=\frac{1}{n}\sum_{i=1}^{n}f(\boldsymbol{x}_{i},\boldsymbol{c}_{j}% )\delta_{ij}=\psi(\boldsymbol{\delta}_{j},\boldsymbol{c}_{j}),= divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_ψ ( bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , (25)

where (24) follows from the Jensen’s inequality. ∎

Lemma 4.

Given a set of clusters 𝒮={𝒮j}j=1k𝒮superscriptsubscriptsubscript𝒮𝑗𝑗1𝑘\mathcal{S}=\{\mathcal{S}_{j}\}_{j=1}^{k}caligraphic_S = { caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and a set of centroids 𝒞={𝐜j}j=1k𝒞superscriptsubscriptsubscript𝐜𝑗𝑗1𝑘\mathcal{C}=\{\boldsymbol{c}_{j}\}_{j=1}^{k}caligraphic_C = { bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, let dist(𝐱i,𝒞):=min𝐜j𝒞𝐱i𝐜j2assign𝑑𝑖𝑠𝑡subscript𝐱𝑖𝒞subscriptsubscript𝐜𝑗𝒞superscriptnormsubscript𝐱𝑖subscript𝐜𝑗2dist(\boldsymbol{x}_{i},\mathcal{C}):=\min_{\boldsymbol{c}_{j}\in\mathcal{C}}% \|\boldsymbol{x}_{i}-\boldsymbol{c}_{j}\|^{2}italic_d italic_i italic_s italic_t ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_C ) := roman_min start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_C end_POSTSUBSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, then for any t0𝑡0t\geq 0italic_t ≥ 0, there exists a scalar ϵkmax𝐱iXdist(𝐱i,𝒞)min𝐱iXdist(𝐱i,𝒞)italic-ϵ𝑘subscriptsubscript𝐱𝑖𝑋𝑑𝑖𝑠𝑡subscript𝐱𝑖𝒞subscriptsubscript𝐱𝑖𝑋𝑑𝑖𝑠𝑡subscript𝐱𝑖𝒞\epsilon\geq k\cdot\frac{\max_{\boldsymbol{x}_{i}\in X}dist(\boldsymbol{x}_{i}% ,\mathcal{C})}{\min_{\boldsymbol{x}_{i}\in X}dist(\boldsymbol{x}_{i},\mathcal{% C})}italic_ϵ ≥ italic_k ⋅ divide start_ARG roman_max start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_X end_POSTSUBSCRIPT italic_d italic_i italic_s italic_t ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_C ) end_ARG start_ARG roman_min start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_X end_POSTSUBSCRIPT italic_d italic_i italic_s italic_t ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_C ) end_ARG, such that the following inequality holds:

ϕ¯(t,𝒮,𝒞)ϵψ¯(t,𝒮,𝒞).¯italic-ϕ𝑡𝒮𝒞italic-ϵ¯𝜓𝑡𝒮𝒞\displaystyle\overline{\phi}(t,\mathcal{S},\mathcal{C})\leq\epsilon\cdot% \overline{\psi}(t,\mathcal{S},\mathcal{C}).over¯ start_ARG italic_ϕ end_ARG ( italic_t , caligraphic_S , caligraphic_C ) ≤ italic_ϵ ⋅ over¯ start_ARG italic_ψ end_ARG ( italic_t , caligraphic_S , caligraphic_C ) . (26)
Proof.

Consider the case when t𝑡t\to\inftyitalic_t → ∞, according to L’Hôpital’s rule, it holds that,

limtϕ(t,𝜹j,𝒄j)=max𝒙i𝒮jf(𝒙i,𝒄j),subscript𝑡italic-ϕ𝑡subscript𝜹𝑗subscript𝒄𝑗subscriptsubscript𝒙𝑖subscript𝒮𝑗𝑓subscript𝒙𝑖subscript𝒄𝑗\displaystyle\lim_{t\to\infty}\phi(t,\boldsymbol{\delta}_{j},\boldsymbol{c}_{j% })=\max_{\boldsymbol{x}_{i}\in\mathcal{S}_{j}}f(\boldsymbol{x}_{i},\boldsymbol% {c}_{j}),roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = roman_max start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , (27)

which implies that for any j[k]𝑗delimited-[]𝑘j\in[k]italic_j ∈ [ italic_k ], ϕ(t,𝜹j,𝒄j)italic-ϕ𝑡subscript𝜹𝑗subscript𝒄𝑗\phi(t,\boldsymbol{\delta}_{j},\boldsymbol{c}_{j})italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is bounded. Then there must exist a scalar ϵitalic-ϵ\epsilonitalic_ϵ such that

ϵitalic-ϵ\displaystyle\epsilonitalic_ϵ kmax𝒙iXdist(𝒙i,𝒞)min𝒙iXdist(𝒙i,𝒞)absent𝑘subscriptsubscript𝒙𝑖𝑋𝑑𝑖𝑠𝑡subscript𝒙𝑖𝒞subscriptsubscript𝒙𝑖𝑋𝑑𝑖𝑠𝑡subscript𝒙𝑖𝒞\displaystyle\geq k\cdot\frac{\max_{\boldsymbol{x}_{i}\in X}dist(\boldsymbol{x% }_{i},\mathcal{C})}{\min_{\boldsymbol{x}_{i}\in X}dist(\boldsymbol{x}_{i},% \mathcal{C})}≥ italic_k ⋅ divide start_ARG roman_max start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_X end_POSTSUBSCRIPT italic_d italic_i italic_s italic_t ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_C ) end_ARG start_ARG roman_min start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_X end_POSTSUBSCRIPT italic_d italic_i italic_s italic_t ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_C ) end_ARG
=j=1k1tlog1ni=1netmax𝒙iXdist(𝒙i,𝒞)min𝒙iXdist(𝒙i,𝒞)absentsuperscriptsubscript𝑗1𝑘1𝑡1𝑛superscriptsubscript𝑖1𝑛superscript𝑒𝑡subscriptsubscript𝒙𝑖𝑋𝑑𝑖𝑠𝑡subscript𝒙𝑖𝒞subscriptsubscript𝒙𝑖𝑋𝑑𝑖𝑠𝑡subscript𝒙𝑖𝒞\displaystyle=\frac{\sum_{j=1}^{k}\frac{1}{t}\log\frac{1}{n}\sum_{i=1}^{n}e^{t% \max_{\boldsymbol{x}_{i}\in X}dist(\boldsymbol{x}_{i},\mathcal{C})}}{\min_{% \boldsymbol{x}_{i}\in X}dist(\boldsymbol{x}_{i},\mathcal{C})}= divide start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_t end_ARG roman_log divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_t roman_max start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_X end_POSTSUBSCRIPT italic_d italic_i italic_s italic_t ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_C ) end_POSTSUPERSCRIPT end_ARG start_ARG roman_min start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_X end_POSTSUBSCRIPT italic_d italic_i italic_s italic_t ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_C ) end_ARG (28)
j=1k1tlog1ni=1netf(𝒙i,𝒄j)δij1nj=1k𝒙i𝒮jf(𝒙i,𝒄j)absentsuperscriptsubscript𝑗1𝑘1𝑡1𝑛superscriptsubscript𝑖1𝑛superscript𝑒𝑡𝑓subscript𝒙𝑖subscript𝒄𝑗subscript𝛿𝑖𝑗1𝑛superscriptsubscript𝑗1𝑘subscriptsubscript𝒙𝑖subscript𝒮𝑗𝑓subscript𝒙𝑖subscript𝒄𝑗\displaystyle\geq\frac{\sum_{j=1}^{k}\frac{1}{t}\log\frac{1}{n}\sum_{i=1}^{n}e% ^{tf(\boldsymbol{x}_{i},\boldsymbol{c}_{j})\delta_{ij}}}{\frac{1}{n}\sum_{j=1}% ^{k}\sum_{\boldsymbol{x}_{i}\in\mathcal{S}_{j}}f(\boldsymbol{x}_{i},% \boldsymbol{c}_{j})}≥ divide start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_t end_ARG roman_log divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_t italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG (29)
=ϕ¯(t,𝒮,𝒞)ψ¯(𝒮,𝒞),absent¯italic-ϕ𝑡𝒮𝒞¯𝜓𝒮𝒞\displaystyle=\frac{\overline{\phi}(t,\mathcal{S},\mathcal{C})}{\overline{\psi% }(\mathcal{S},\mathcal{C})},= divide start_ARG over¯ start_ARG italic_ϕ end_ARG ( italic_t , caligraphic_S , caligraphic_C ) end_ARG start_ARG over¯ start_ARG italic_ψ end_ARG ( caligraphic_S , caligraphic_C ) end_ARG , (30)

which completes the proof. ∎

Proposition 1.

Let δj,𝐜j,j[k]superscriptsubscript𝛿𝑗superscriptsubscript𝐜𝑗𝑗delimited-[]𝑘\delta_{j}^{\star},\boldsymbol{c}_{j}^{\star},j\in[k]italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , italic_j ∈ [ italic_k ] be the optimal solution of SSE, let 𝛅j,𝐜j,j[k]superscriptsubscript𝛅𝑗superscriptsubscript𝐜𝑗𝑗delimited-[]𝑘\boldsymbol{\delta}_{j}^{*},\boldsymbol{c}_{j}^{*},j\in[k]bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_j ∈ [ italic_k ] be the optimal solutions of tilted SSE, and let ψ¯superscript¯𝜓\overline{\psi}^{\star}over¯ start_ARG italic_ψ end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, ϕ¯superscript¯italic-ϕ\overline{\phi}^{*}over¯ start_ARG italic_ϕ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT be the corresponding optimal objective function values, then for any t0𝑡0t\geq 0italic_t ≥ 0, we have ψ¯ϕ¯superscript¯𝜓superscript¯italic-ϕ\overline{\psi}^{\star}\leq\overline{\phi}^{*}over¯ start_ARG italic_ψ end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ≤ over¯ start_ARG italic_ϕ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

Proof.

Based on Lemma 3 and optimal conditions, we obtain

ψ¯=ψ(𝜹j,𝒄j)ψ(𝜹j,𝒄j)ϕ(t,𝜹j,𝒄j)=ϕ¯.superscript¯𝜓𝜓superscriptsubscript𝜹𝑗superscriptsubscript𝒄𝑗𝜓superscriptsubscript𝜹𝑗superscriptsubscript𝒄𝑗italic-ϕ𝑡superscriptsubscript𝜹𝑗superscriptsubscript𝒄𝑗superscript¯italic-ϕ\displaystyle\overline{\psi}^{\star}=\psi(\boldsymbol{\delta}_{j}^{\star},% \boldsymbol{c}_{j}^{\star})\leq\psi(\boldsymbol{\delta}_{j}^{*},\boldsymbol{c}% _{j}^{*})\leq\phi(t,\boldsymbol{\delta}_{j}^{*},\boldsymbol{c}_{j}^{*})=% \overline{\phi}^{*}.over¯ start_ARG italic_ψ end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = italic_ψ ( bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ≤ italic_ψ ( bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = over¯ start_ARG italic_ϕ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT . (31)

Summing over (31) from 1 to k𝑘kitalic_k implies Proposition 1. ∎

Lemma 5 (Theorem 1.1 in [8]).

Let ψ¯superscript¯𝜓\overline{\psi}^{\star}over¯ start_ARG italic_ψ end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT be the optimal SSE of k𝑘kitalic_k-means, let 𝒞𝒞\mathcal{C}caligraphic_C be the centroids set constructed by k𝑘kitalic_k-means++, and let 𝒮𝒮\mathcal{S}caligraphic_S be the corresponding induced assignment, then for any set of data points, it holds that 𝔼[ψ¯(𝒮,𝒞)]8(logk+2)ψ¯𝔼delimited-[]¯𝜓𝒮𝒞8𝑘2superscript¯𝜓\mathbb{E}[\overline{\psi}(\mathcal{S},\mathcal{C})]\leq 8(\log k+2)\overline{% \psi}^{\star}blackboard_E [ over¯ start_ARG italic_ψ end_ARG ( caligraphic_S , caligraphic_C ) ] ≤ 8 ( roman_log italic_k + 2 ) over¯ start_ARG italic_ψ end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT.

Next, we are ready to prove Theorem 1 based on the above lemmas.

Proof of Theorem 1.

Let 𝒞𝒞\mathcal{C}caligraphic_C be the centroids set constructed by k𝑘kitalic_k-means++, and let 𝒮𝒮\mathcal{S}caligraphic_S be the corresponding induced set of clusters, then following from Lemma 4, we have

ϕ¯(t,𝒮,𝒞)ϵψ¯(𝒮,𝒞),¯italic-ϕ𝑡𝒮𝒞italic-ϵ¯𝜓𝒮𝒞\displaystyle\overline{\phi}(t,\mathcal{S},\mathcal{C})\leq\epsilon\cdot% \overline{\psi}(\mathcal{S},\mathcal{C}),over¯ start_ARG italic_ϕ end_ARG ( italic_t , caligraphic_S , caligraphic_C ) ≤ italic_ϵ ⋅ over¯ start_ARG italic_ψ end_ARG ( caligraphic_S , caligraphic_C ) , (32)

Then we can bound 𝔼[ϕ¯(t,𝒮,𝒞)]𝔼delimited-[]¯italic-ϕ𝑡𝒮𝒞\mathbb{E}[\overline{\phi}(t,\mathcal{S},\mathcal{C})]blackboard_E [ over¯ start_ARG italic_ϕ end_ARG ( italic_t , caligraphic_S , caligraphic_C ) ] as

𝔼[ϕ¯(t,𝒮,𝒞)]𝔼delimited-[]¯italic-ϕ𝑡𝒮𝒞\displaystyle\mathbb{E}[\overline{\phi}(t,\mathcal{S},\mathcal{C})]blackboard_E [ over¯ start_ARG italic_ϕ end_ARG ( italic_t , caligraphic_S , caligraphic_C ) ] ϵ𝔼[ψ¯(𝒮,𝒞)]absentitalic-ϵ𝔼delimited-[]¯𝜓𝒮𝒞\displaystyle\leq\epsilon\cdot\mathbb{E}[\overline{\psi}(\mathcal{S},\mathcal{% C})]≤ italic_ϵ ⋅ blackboard_E [ over¯ start_ARG italic_ψ end_ARG ( caligraphic_S , caligraphic_C ) ]
8ϵ(logk+2)ψ¯absent8italic-ϵ𝑘2superscript¯𝜓\displaystyle\leq 8\epsilon(\log k+2)\cdot\overline{\psi}^{\star}≤ 8 italic_ϵ ( roman_log italic_k + 2 ) ⋅ over¯ start_ARG italic_ψ end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT (33)
8ϵ(logk+2)ϕ¯,absent8italic-ϵ𝑘2superscript¯italic-ϕ\displaystyle\leq 8\epsilon(\log k+2)\cdot\overline{\phi}^{*},≤ 8 italic_ϵ ( roman_log italic_k + 2 ) ⋅ over¯ start_ARG italic_ϕ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , (34)

where (33) follows from Lemma 5, and (34) follows from Proposition 1. According to Lemma 4, we have ϵkmax𝒙iXdist(𝒙i,𝒞)min𝒙iXdist(𝒙i,𝒞)italic-ϵ𝑘subscriptsubscript𝒙𝑖𝑋𝑑𝑖𝑠𝑡subscript𝒙𝑖𝒞subscriptsubscript𝒙𝑖𝑋𝑑𝑖𝑠𝑡subscript𝒙𝑖𝒞\epsilon\geq k\cdot\frac{\max_{\boldsymbol{x}_{i}\in X}dist(\boldsymbol{x}_{i}% ,\mathcal{C})}{\min_{\boldsymbol{x}_{i}\in X}dist(\boldsymbol{x}_{i},\mathcal{% C})}italic_ϵ ≥ italic_k ⋅ divide start_ARG roman_max start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_X end_POSTSUBSCRIPT italic_d italic_i italic_s italic_t ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_C ) end_ARG start_ARG roman_min start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_X end_POSTSUBSCRIPT italic_d italic_i italic_s italic_t ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_C ) end_ARG, then we can derive that 𝔼[ϕ¯(t,𝒮,𝒞)]O(klogk)ϕ¯𝔼delimited-[]¯italic-ϕ𝑡𝒮𝒞𝑂𝑘𝑘superscript¯italic-ϕ\mathbb{E}[\overline{\phi}(t,\mathcal{S},\mathcal{C})]\leq O(k\log k)\overline% {\phi}^{*}blackboard_E [ over¯ start_ARG italic_ϕ end_ARG ( italic_t , caligraphic_S , caligraphic_C ) ] ≤ italic_O ( italic_k roman_log italic_k ) over¯ start_ARG italic_ϕ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, which completes the proof. ∎

8.2 Proof of Theorem 2

By the Mean Value Theorem, the gradient Lipschitz continuity indicates the following proposition.

Proposition 2.

For any t0𝑡0t\geq 0italic_t ≥ 0, and 𝐜~j,𝐜~jdsubscript~𝐜𝑗superscriptsubscript~𝐜𝑗superscript𝑑\tilde{\boldsymbol{c}}_{j},\tilde{\boldsymbol{c}}_{j}^{\prime}\in\mathbb{R}^{d}over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, it holds that

ϕ(t,𝜹j,𝒄~j)italic-ϕ𝑡subscript𝜹𝑗subscript~𝒄𝑗\displaystyle\phi(t,\boldsymbol{\delta}_{j},\tilde{\boldsymbol{c}}_{j})italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ϕ(t,𝜹j,𝒄~j)italic-ϕ𝑡subscript𝜹𝑗superscriptsubscript~𝒄𝑗absent\displaystyle-\phi(t,\boldsymbol{\delta}_{j},\tilde{\boldsymbol{c}}_{j}^{% \prime})\leq- italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤
𝒄jϕ(t,𝜹j,𝒄~j)subscriptsubscript𝒄𝑗italic-ϕsuperscript𝑡subscript𝜹𝑗superscriptsubscript~𝒄𝑗top\displaystyle\nabla_{\boldsymbol{c}_{j}}\phi(t,\boldsymbol{\delta}_{j},\tilde{% \boldsymbol{c}}_{j}^{\prime})^{\top}∇ start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT (𝒄~j𝒄~j)+L(t)2𝒄~j𝒄~j2.subscript~𝒄𝑗superscriptsubscript~𝒄𝑗𝐿𝑡2superscriptnormsubscript~𝒄𝑗superscriptsubscript~𝒄𝑗2\displaystyle(\tilde{\boldsymbol{c}}_{j}-\tilde{\boldsymbol{c}}_{j}^{\prime})+% \frac{L(t)}{2}\|\tilde{\boldsymbol{c}}_{j}-\tilde{\boldsymbol{c}}_{j}^{\prime}% \|^{2}.( over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + divide start_ARG italic_L ( italic_t ) end_ARG start_ARG 2 end_ARG ∥ over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .
Proof.

Following Lemma 2, it holds that

ϕ(t,𝜹j,𝒄~j)=ϕ(t,𝜹j,𝒄~j)+01ϕ(t,𝜹j,𝒄~j+y(𝒄~j𝒄~j))y𝑑yitalic-ϕ𝑡subscript𝜹𝑗subscript~𝒄𝑗italic-ϕ𝑡subscript𝜹𝑗superscriptsubscript~𝒄𝑗superscriptsubscript01italic-ϕ𝑡subscript𝜹𝑗superscriptsubscript~𝒄𝑗𝑦subscript~𝒄𝑗superscriptsubscript~𝒄𝑗𝑦differential-d𝑦\displaystyle\phi(t,\boldsymbol{\delta}_{j},\tilde{\boldsymbol{c}}_{j})=\phi(t% ,\boldsymbol{\delta}_{j},\tilde{\boldsymbol{c}}_{j}^{\prime})+\int_{0}^{1}% \frac{\partial\phi(t,\boldsymbol{\delta}_{j},\tilde{\boldsymbol{c}}_{j}^{% \prime}+y(\tilde{\boldsymbol{c}}_{j}-\tilde{\boldsymbol{c}}_{j}^{\prime}))}{% \partial y}dyitalic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT divide start_ARG ∂ italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_y ( over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) end_ARG start_ARG ∂ italic_y end_ARG italic_d italic_y
=ϕ(t,𝜹j,𝒄~j)+01𝒄jϕ(t,𝜹j,𝒄~j+y(𝒄~j𝒄~j))(𝒄~j𝒄~j)𝑑yabsentitalic-ϕ𝑡subscript𝜹𝑗superscriptsubscript~𝒄𝑗superscriptsubscript01subscriptsubscript𝒄𝑗italic-ϕsuperscript𝑡subscript𝜹𝑗superscriptsubscript~𝒄𝑗𝑦subscript~𝒄𝑗superscriptsubscript~𝒄𝑗topsubscript~𝒄𝑗superscriptsubscript~𝒄𝑗differential-d𝑦\displaystyle=\phi(t,\boldsymbol{\delta}_{j},\tilde{\boldsymbol{c}}_{j}^{% \prime})\!+\!\!\int_{0}^{1}\nabla_{\boldsymbol{c}_{j}}\phi(t,\boldsymbol{% \delta}_{j},\tilde{\boldsymbol{c}}_{j}^{\prime}+y(\tilde{\boldsymbol{c}}_{j}-% \tilde{\boldsymbol{c}}_{j}^{\prime}))^{\top}(\tilde{\boldsymbol{c}}_{j}-\tilde% {\boldsymbol{c}}_{j}^{\prime})dy= italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_y ( over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_d italic_y
=ϕ(t,𝜹j,𝒄~j)+𝒄jϕ(t,𝜹j,𝒄~j)(𝒄~j𝒄~j)+absentitalic-ϕ𝑡subscript𝜹𝑗superscriptsubscript~𝒄𝑗limit-fromsubscriptsubscript𝒄𝑗italic-ϕsuperscript𝑡subscript𝜹𝑗superscriptsubscript~𝒄𝑗topsubscript~𝒄𝑗superscriptsubscript~𝒄𝑗\displaystyle=\phi(t,\boldsymbol{\delta}_{j},\tilde{\boldsymbol{c}}_{j}^{% \prime})+\nabla_{\boldsymbol{c}_{j}}\phi(t,\boldsymbol{\delta}_{j},\tilde{% \boldsymbol{c}}_{j}^{\prime})^{\top}(\tilde{\boldsymbol{c}}_{j}-\tilde{% \boldsymbol{c}}_{j}^{\prime})+= italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + ∇ start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) +
01[𝒄jϕ(t,𝜹j,𝒄~j+y(𝒄~j𝒄~j))𝒄jϕ(t,𝜹j,𝒄~j)](𝒄~j𝒄~j)𝑑ysuperscriptsubscript01superscriptdelimited-[]subscriptsubscript𝒄𝑗italic-ϕ𝑡subscript𝜹𝑗superscriptsubscript~𝒄𝑗𝑦subscript~𝒄𝑗superscriptsubscript~𝒄𝑗subscriptsubscript𝒄𝑗italic-ϕ𝑡subscript𝜹𝑗superscriptsubscript~𝒄𝑗topsubscript~𝒄𝑗superscriptsubscript~𝒄𝑗differential-d𝑦\displaystyle\!\!\int_{0}^{1}\![\nabla_{\boldsymbol{c}_{j}}\phi(t,\boldsymbol{% \delta}_{j},\tilde{\boldsymbol{c}}_{j}^{\prime}\!+\!y(\tilde{\boldsymbol{c}}_{% j}-\tilde{\boldsymbol{c}}_{j}^{\prime}))\!-\!\nabla_{\boldsymbol{c}_{j}}\phi(t% ,\boldsymbol{\delta}_{j},\tilde{\boldsymbol{c}}_{j}^{\prime})]^{\top}(\tilde{% \boldsymbol{c}}_{j}-\tilde{\boldsymbol{c}}_{j}^{\prime})dy∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT [ ∇ start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_y ( over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) - ∇ start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_d italic_y
ϕ(t,𝜹j,𝒄~j)+𝒄jϕ(t,𝜹j,𝒄~j)(𝒄~j𝒄~j)+L(t)2𝒄~j𝒄~j2,absentitalic-ϕ𝑡subscript𝜹𝑗superscriptsubscript~𝒄𝑗subscriptsubscript𝒄𝑗italic-ϕsuperscript𝑡subscript𝜹𝑗superscriptsubscript~𝒄𝑗topsubscript~𝒄𝑗superscriptsubscript~𝒄𝑗𝐿𝑡2superscriptnormsubscript~𝒄𝑗superscriptsubscript~𝒄𝑗2\displaystyle\leq\phi(t,\boldsymbol{\delta}_{j},\tilde{\boldsymbol{c}}_{j}^{% \prime})+\nabla_{\boldsymbol{c}_{j}}\phi(t,\boldsymbol{\delta}_{j},\tilde{% \boldsymbol{c}}_{j}^{\prime})^{\top}(\tilde{\boldsymbol{c}}_{j}-\tilde{% \boldsymbol{c}}_{j}^{\prime})+\frac{L(t)}{2}\|\tilde{\boldsymbol{c}}_{j}-% \tilde{\boldsymbol{c}}_{j}^{\prime}\|^{2},≤ italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + ∇ start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + divide start_ARG italic_L ( italic_t ) end_ARG start_ARG 2 end_ARG ∥ over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - over~ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

which completes the proof. ∎

Next, we show the proof of Theorem 2.

Proof of Theorem 2.

We consider proving the decreasing property of TKM from two parts: refinement and assignment. Our proof with respect to the refinement follows from [13] which establishes the convergence for gradient Lipschitz continuous objective functions. Under the gradient Lipschitz continuous property of ϕ(t,𝜹j,𝒄j)italic-ϕ𝑡subscript𝜹𝑗subscript𝒄𝑗\phi(t,\boldsymbol{\delta}_{j},\boldsymbol{c}_{j})italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) with respect to 𝒄jsubscript𝒄𝑗\boldsymbol{c}_{j}bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, the iterations of SGD satisfy the following inequality by applying Proposition 2:

𝔼[ϕ(t,𝜹jit,𝒄jit+1)]ϕ(t,𝜹jit,𝒄jit)𝔼delimited-[]italic-ϕ𝑡superscriptsubscript𝜹𝑗𝑖𝑡superscriptsubscript𝒄𝑗𝑖𝑡1italic-ϕ𝑡superscriptsubscript𝜹𝑗𝑖𝑡superscriptsubscript𝒄𝑗𝑖𝑡absent\displaystyle\mathbb{E}[\phi(t,\boldsymbol{\delta}_{j}^{it},\boldsymbol{c}_{j}% ^{it+1})]-\phi(t,\boldsymbol{\delta}_{j}^{it},\boldsymbol{c}_{j}^{it})\!\leq\!blackboard_E [ italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_t end_POSTSUPERSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_t + 1 end_POSTSUPERSCRIPT ) ] - italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_t end_POSTSUPERSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_t end_POSTSUPERSCRIPT ) ≤
η𝒄jϕ(t,𝜹jit,𝒄jit)𝔼[g(,𝒄jit)]+12η2L(t)𝔼[g(,𝒄j)2].𝜂subscriptsubscript𝒄𝑗italic-ϕsuperscript𝑡superscriptsubscript𝜹𝑗𝑖𝑡superscriptsubscript𝒄𝑗𝑖𝑡top𝔼delimited-[]𝑔superscriptsubscript𝒄𝑗𝑖𝑡12superscript𝜂2𝐿𝑡𝔼delimited-[]superscriptnorm𝑔subscript𝒄𝑗2\displaystyle\!\!\!\!\!\!\!\!-\eta\nabla_{\boldsymbol{c}_{j}}\phi(t,% \boldsymbol{\delta}_{j}^{it},\boldsymbol{c}_{j}^{it})^{\top}\mathbb{E}[g(% \mathcal{B},\boldsymbol{c}_{j}^{it})]\!+\!\frac{1}{2}\eta^{2}L(t)\mathbb{E}[\|% g(\mathcal{B},\boldsymbol{c}_{j})\|^{2}].\!\!\!- italic_η ∇ start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_t end_POSTSUPERSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT blackboard_E [ italic_g ( caligraphic_B , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_t end_POSTSUPERSCRIPT ) ] + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L ( italic_t ) blackboard_E [ ∥ italic_g ( caligraphic_B , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (35)

According to Cauchy-Schwarz inequality and Assumption 1, it holds that

𝔼[g(,𝒄jit)]2μ2𝒄jϕ(t,𝜹jit,𝒄jit)2.superscriptnorm𝔼delimited-[]𝑔superscriptsubscript𝒄𝑗𝑖𝑡2superscript𝜇2superscriptnormsubscriptsubscript𝒄𝑗italic-ϕ𝑡superscriptsubscript𝜹𝑗𝑖𝑡superscriptsubscript𝒄𝑗𝑖𝑡2\displaystyle\|\mathbb{E}[g(\mathcal{B},\boldsymbol{c}_{j}^{it})]\|^{2}\geq\mu% ^{2}\|\nabla_{\boldsymbol{c}_{j}}\phi(t,\boldsymbol{\delta}_{j}^{it},% \boldsymbol{c}_{j}^{it})\|^{2}.∥ blackboard_E [ italic_g ( caligraphic_B , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_t end_POSTSUPERSCRIPT ) ] ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ ∇ start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_t end_POSTSUPERSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (36)

Next, we consider bounding 𝔼[g(,𝒄jit)2]𝔼delimited-[]superscriptnorm𝑔superscriptsubscript𝒄𝑗𝑖𝑡2\mathbb{E}[\|g(\mathcal{B},\boldsymbol{c}_{j}^{it})\|^{2}]blackboard_E [ ∥ italic_g ( caligraphic_B , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] under Assumption 1 as follows,

𝔼[g(,𝒄jit)2]𝔼delimited-[]superscriptnorm𝑔superscriptsubscript𝒄𝑗𝑖𝑡2\displaystyle\!\!\mathbb{E}[\|g(\mathcal{B},\boldsymbol{c}_{j}^{it})\|^{2}]\!blackboard_E [ ∥ italic_g ( caligraphic_B , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] =𝔼[g(,𝒄jit)𝔼[g(,𝒄jit)]2]+𝔼[g(,𝒄jit)]2absent𝔼delimited-[]superscriptnorm𝑔superscriptsubscript𝒄𝑗𝑖𝑡𝔼delimited-[]𝑔superscriptsubscript𝒄𝑗𝑖𝑡2superscriptnorm𝔼delimited-[]𝑔superscriptsubscript𝒄𝑗𝑖𝑡2\displaystyle=\!\mathbb{E}[\|g(\mathcal{B},\boldsymbol{c}_{j}^{it})\!-\!% \mathbb{E}[g(\mathcal{B},\boldsymbol{c}_{j}^{it})]\|^{2}]\!+\!\|\mathbb{E}[g(% \mathcal{B},\boldsymbol{c}_{j}^{it})]\|^{2}= blackboard_E [ ∥ italic_g ( caligraphic_B , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_t end_POSTSUPERSCRIPT ) - blackboard_E [ italic_g ( caligraphic_B , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_t end_POSTSUPERSCRIPT ) ] ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + ∥ blackboard_E [ italic_g ( caligraphic_B , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_t end_POSTSUPERSCRIPT ) ] ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
ν+νG𝒄jϕ(t,𝜹j,𝒄jit)2,absent𝜈subscript𝜈𝐺superscriptnormsubscriptsubscript𝒄𝑗italic-ϕ𝑡subscript𝜹𝑗superscriptsubscript𝒄𝑗𝑖𝑡2\displaystyle\leq\nu+\nu_{G}\cdot\|\nabla_{\boldsymbol{c}_{j}}\phi(t,% \boldsymbol{\delta}_{j},\boldsymbol{c}_{j}^{it})\|^{2},≤ italic_ν + italic_ν start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ⋅ ∥ ∇ start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (37)

where νG:=νH+μG2μ2assignsubscript𝜈𝐺subscript𝜈𝐻superscriptsubscript𝜇𝐺2superscript𝜇2\nu_{G}:=\nu_{H}+\mu_{G}^{2}\geq\mu^{2}italic_ν start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT := italic_ν start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT + italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Then by applying Assumption 1 and (8.2) into (8.2), we obtain

𝔼𝔼\displaystyle\mathbb{E}blackboard_E [ϕ(t,𝜹jit,𝒄jit+1)]ϕ(t,𝜹jit,𝒄jit)delimited-[]italic-ϕ𝑡superscriptsubscript𝜹𝑗𝑖𝑡superscriptsubscript𝒄𝑗𝑖𝑡1italic-ϕ𝑡superscriptsubscript𝜹𝑗𝑖𝑡superscriptsubscript𝒄𝑗𝑖𝑡absent\displaystyle[\phi(t,\boldsymbol{\delta}_{j}^{it},\boldsymbol{c}_{j}^{it+1})]-% \phi(t,\boldsymbol{\delta}_{j}^{it},\boldsymbol{c}_{j}^{it})\leq[ italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_t end_POSTSUPERSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_t + 1 end_POSTSUPERSCRIPT ) ] - italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_t end_POSTSUPERSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_t end_POSTSUPERSCRIPT ) ≤
(μ12ηνGL(t))η𝒄jϕ(t,𝜹j,𝒄j)2+12η2νL(t).𝜇12𝜂subscript𝜈𝐺𝐿𝑡𝜂superscriptnormsubscriptsubscript𝒄𝑗italic-ϕ𝑡subscript𝜹𝑗subscript𝒄𝑗212superscript𝜂2𝜈𝐿𝑡\displaystyle-\bigl{(}\mu-\frac{1}{2}\eta\nu_{G}L(t)\bigr{)}\eta\|\nabla_{% \boldsymbol{c}_{j}}\phi(t,\boldsymbol{\delta}_{j},\boldsymbol{c}_{j})\|^{2}+% \frac{1}{2}\eta^{2}\nu L(t).- ( italic_μ - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_η italic_ν start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT italic_L ( italic_t ) ) italic_η ∥ ∇ start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ν italic_L ( italic_t ) . (38)

To ensure that the objective function value decreases within refinement, we need μ12ηνGL(t)>0𝜇12𝜂subscript𝜈𝐺𝐿𝑡0\mu-\frac{1}{2}\eta\nu_{G}L(t)>0italic_μ - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_η italic_ν start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT italic_L ( italic_t ) > 0, which implies η<2μνGL(t)2μL(t)𝜂2𝜇subscript𝜈𝐺𝐿𝑡2𝜇𝐿𝑡\eta<\frac{2\mu}{\nu_{G}L(t)}\leq\frac{2}{\mu L(t)}italic_η < divide start_ARG 2 italic_μ end_ARG start_ARG italic_ν start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT italic_L ( italic_t ) end_ARG ≤ divide start_ARG 2 end_ARG start_ARG italic_μ italic_L ( italic_t ) end_ARG. Next, we consider proving the decreasing property in the assignment. Following the optimal condition with 𝜹jsubscript𝜹𝑗\boldsymbol{\delta}_{j}bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, the following inequality holds

𝔼[ϕ(t,\displaystyle\mathbb{E}[\phi(t,blackboard_E [ italic_ϕ ( italic_t , 𝜹jit+1,𝒄jit+1)]𝔼[ϕ(t,𝜹jit,𝒄jit+1)].\displaystyle\boldsymbol{\delta}_{j}^{it+1},\boldsymbol{c}_{j}^{it+1})]\leq% \mathbb{E}[\phi(t,\boldsymbol{\delta}_{j}^{it},\boldsymbol{c}_{j}^{it+1})].bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_t + 1 end_POSTSUPERSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_t + 1 end_POSTSUPERSCRIPT ) ] ≤ blackboard_E [ italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_t end_POSTSUPERSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_t + 1 end_POSTSUPERSCRIPT ) ] . (39)

Combining (8.2) and (39) yields

𝔼[ϕ(t,\displaystyle\mathbb{E}[\phi(t,blackboard_E [ italic_ϕ ( italic_t , 𝜹jit+1,𝒄jit+1)]ϕ(t,𝜹jit,𝒄jit).\displaystyle\boldsymbol{\delta}_{j}^{it+1},\boldsymbol{c}_{j}^{it+1})]\leq% \phi(t,\boldsymbol{\delta}_{j}^{it},\boldsymbol{c}_{j}^{it}).bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_t + 1 end_POSTSUPERSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_t + 1 end_POSTSUPERSCRIPT ) ] ≤ italic_ϕ ( italic_t , bold_italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_t end_POSTSUPERSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_t end_POSTSUPERSCRIPT ) . (40)

Summing over (40) from 1111 to k𝑘kitalic_k proves Theorem 2. ∎

8.3 Proof of Theorem 3

We begin by defining the tilted weight, tilted empirical mean, and tilted empirical variance when all data points are normalized to a unit norm.

Definition 5 (Tilted gradient and weight).

Suppose the dataset is normalized, then the tilted weight is defined as

wi(t,𝒮j,𝒄j):=et𝒙icj2𝒙i𝒮jet𝒙i𝒄j2=1|𝒮j|e2t𝒄j𝒙iΓ(t,𝒮j,𝒄j),assignsubscript𝑤𝑖𝑡subscript𝒮𝑗subscript𝒄𝑗superscript𝑒𝑡superscriptnormsubscript𝒙𝑖subscript𝑐𝑗2subscriptsubscript𝒙𝑖subscript𝒮𝑗superscript𝑒𝑡superscriptnormsubscript𝒙𝑖subscript𝒄𝑗21subscript𝒮𝑗superscript𝑒2𝑡superscriptsubscript𝒄𝑗topsubscript𝒙𝑖Γ𝑡subscript𝒮𝑗subscript𝒄𝑗\displaystyle w_{i}(t,\mathcal{S}_{j},\boldsymbol{c}_{j})\!:=\!\frac{e^{t\|% \boldsymbol{x}_{i}-c_{j}\|^{2}}}{\sum_{\boldsymbol{x}_{i}\in\mathcal{S}_{j}}e^% {t\|\boldsymbol{x}_{i}-\boldsymbol{c}_{j}\|^{2}}}\!=\!\frac{1}{|\mathcal{S}_{j% }|}e^{-2t\boldsymbol{c}_{j}^{\top}\boldsymbol{x}_{i}-\Gamma(t,\mathcal{S}_{j},% \boldsymbol{c}_{j})},\!\!italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) := divide start_ARG italic_e start_POSTSUPERSCRIPT italic_t ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_t ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG = divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | end_ARG italic_e start_POSTSUPERSCRIPT - 2 italic_t bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - roman_Γ ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ,

where Γ(t,𝒮j,𝐜j):=log1|𝒮j|𝐱i𝒮je2t𝐜j𝐱iassignΓ𝑡subscript𝒮𝑗subscript𝐜𝑗1subscript𝒮𝑗subscriptsubscript𝐱𝑖subscript𝒮𝑗superscript𝑒2𝑡superscriptsubscript𝐜𝑗topsubscript𝐱𝑖\Gamma(t,\mathcal{S}_{j},\boldsymbol{c}_{j}):=\log\frac{1}{|\mathcal{S}_{j}|}% \sum_{\boldsymbol{x}_{i}\in\mathcal{S}_{j}}e^{-2t\boldsymbol{c}_{j}^{\top}% \boldsymbol{x}_{i}}roman_Γ ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) := roman_log divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - 2 italic_t bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT.

Definition 6 (Tilted empirical mean and variance).

Suppose the dataset is normalized, the tilted empirical mean and variance in each cluster are defined as

𝔼t(f(𝒮j,𝒄j)):=𝒄j2+𝒙i𝒮jwi(t,𝒮j,𝒄j)𝒙i2𝒄jM(t,𝒮j,𝒄j),assignsubscript𝔼𝑡fsubscript𝒮𝑗subscript𝒄𝑗superscriptnormsubscript𝒄𝑗2subscriptsubscript𝒙𝑖subscript𝒮𝑗subscript𝑤𝑖𝑡subscript𝒮𝑗subscript𝒄𝑗superscriptnormsubscript𝒙𝑖2superscriptsubscript𝒄𝑗top𝑀𝑡subscript𝒮𝑗subscript𝒄𝑗\displaystyle\mathbb{E}_{t}\Bigl{(}\!\mathrm{f}(\mathcal{S}_{j},\boldsymbol{c}% _{j})\!\Bigr{)}\!:=\!\|\boldsymbol{c}_{j}\|^{2}+\!\!\!\!\!\sum_{\boldsymbol{x}% _{i}\in\mathcal{S}_{j}}\!\!\!\!w_{i}(t,\mathcal{S}_{j},\boldsymbol{c}_{j})\|% \boldsymbol{x}_{i}\|^{2}-\boldsymbol{c}_{j}^{\top}M(t,\mathcal{S}_{j},% \boldsymbol{c}_{j}),\!blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_f ( caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) := ∥ bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_M ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ,
Vart(f(𝒮j,𝒄j)):=𝔼t(𝒄j(2𝒙iM(t,𝒮j,𝒄j)))2=𝒄jV(t,𝒮j,𝒄j)𝒄j,assignsubscriptVar𝑡fsubscript𝒮𝑗subscript𝒄𝑗subscript𝔼𝑡superscriptsuperscriptsubscript𝒄𝑗top2subscript𝒙𝑖𝑀𝑡subscript𝒮𝑗subscript𝒄𝑗2superscriptsubscript𝒄𝑗top𝑉𝑡subscript𝒮𝑗subscript𝒄𝑗subscript𝒄𝑗\displaystyle\!\!\!\!\mathrm{Var}_{t}\Bigl{(}\!\mathrm{f}(\mathcal{S}_{j},% \boldsymbol{c}_{j})\!\Bigr{)}\!:=\!\mathbb{E}_{t}\Bigl{(}\boldsymbol{c}_{j}^{% \top}\bigl{(}-2\boldsymbol{x}_{i}\!-\!M(t,\mathcal{S}_{j},\boldsymbol{c}_{j})% \bigr{)}\!\Bigr{)}^{2}\!\!\!=\!\boldsymbol{c}_{j}^{\top}V(t,\mathcal{S}_{j},% \boldsymbol{c}_{j})\boldsymbol{c}_{j},roman_Var start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_f ( caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) := blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( - 2 bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_M ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_V ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ,

where M(t,𝒮j,𝐜j):=𝐱i𝒮j2wi(t,𝒮j,𝐜j)𝐱iassign𝑀𝑡subscript𝒮𝑗subscript𝐜𝑗subscriptsubscript𝐱𝑖subscript𝒮𝑗2subscript𝑤𝑖𝑡subscript𝒮𝑗subscript𝐜𝑗subscript𝐱𝑖M(t,\mathcal{S}_{j},\boldsymbol{c}_{j}):=\sum_{\boldsymbol{x}_{i}\in\mathcal{S% }_{j}}2w_{i}(t,\mathcal{S}_{j},\boldsymbol{c}_{j})\boldsymbol{x}_{i}italic_M ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) := ∑ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT 2 italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and

V(t,𝒮j,𝒄j):=𝔼t(2𝒙iM(t,𝒮j,𝒄j))(2𝒙iM(t,𝒮j,𝒄j))assign𝑉𝑡subscript𝒮𝑗subscript𝒄𝑗subscript𝔼𝑡superscript2subscript𝒙𝑖𝑀𝑡subscript𝒮𝑗subscript𝒄𝑗top2subscript𝒙𝑖𝑀𝑡subscript𝒮𝑗subscript𝒄𝑗\displaystyle V(t,\mathcal{S}_{j},\boldsymbol{c}_{j})\!:=\mathbb{E}_{t}\bigl{(% }-2\boldsymbol{x}_{i}\!-\!M(t,\mathcal{S}_{j},\boldsymbol{c}_{j})\bigr{)}^{% \top}\bigl{(}-2\boldsymbol{x}_{i}-M(t,\mathcal{S}_{j},\boldsymbol{c}_{j})\bigr% {)}italic_V ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) := blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( - 2 bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_M ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( - 2 bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_M ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) )
=𝒙i𝒮jwi(t,𝒮j,𝒄j)(2𝒙iM(t,𝒮j,𝒄j))(2𝒙iM(t,𝒮j,𝒄j)).absentsubscriptsubscript𝒙𝑖subscript𝒮𝑗subscript𝑤𝑖𝑡subscript𝒮𝑗subscript𝒄𝑗superscript2subscript𝒙𝑖𝑀𝑡subscript𝒮𝑗subscript𝒄𝑗top2subscript𝒙𝑖𝑀𝑡subscript𝒮𝑗subscript𝒄𝑗\displaystyle=\!\!\!\sum_{\boldsymbol{x}_{i}\in\mathcal{S}_{j}}\!\!w_{i}(t,% \mathcal{S}_{j},\boldsymbol{c}_{j})\bigl{(}-2\boldsymbol{x}_{i}-M(t,\mathcal{S% }_{j},\boldsymbol{c}_{j})\bigr{)}^{\top}\bigl{(}-2\boldsymbol{x}_{i}-M(t,% \mathcal{S}_{j},\boldsymbol{c}_{j})\bigr{)}.= ∑ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ( - 2 bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_M ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( - 2 bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_M ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) .
Lemma 6 (Partial derivatives of M(t,𝒮j,𝒄j)𝑀𝑡subscript𝒮𝑗subscript𝒄𝑗M(t,\mathcal{S}_{j},\boldsymbol{c}_{j})italic_M ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) and Γ(t,𝒮j,𝒄j)Γ𝑡subscript𝒮𝑗subscript𝒄𝑗\Gamma(t,\mathcal{S}_{j},\boldsymbol{c}_{j})roman_Γ ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )).

For any t0𝑡0t\geq 0italic_t ≥ 0, and any 𝐜jdsubscript𝐜𝑗superscript𝑑\boldsymbol{c}_{j}\in\mathbb{R}^{d}bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, it holds that

tM(t,𝒮j,𝒄j)=V(t,𝒮j,𝒄j)𝒄j,𝑡𝑀𝑡subscript𝒮𝑗subscript𝒄𝑗𝑉𝑡subscript𝒮𝑗subscript𝒄𝑗subscript𝒄𝑗\displaystyle\frac{\partial}{\partial t}M(t,\mathcal{S}_{j},\boldsymbol{c}_{j}% )=-V(t,\mathcal{S}_{j},\boldsymbol{c}_{j})\boldsymbol{c}_{j},divide start_ARG ∂ end_ARG start_ARG ∂ italic_t end_ARG italic_M ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = - italic_V ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , (41)
𝒄jM(t,𝒮j,𝒄j)=tV(t,𝒮j,𝒄j),subscriptsubscript𝒄𝑗𝑀𝑡subscript𝒮𝑗subscript𝒄𝑗𝑡𝑉𝑡subscript𝒮𝑗subscript𝒄𝑗\displaystyle\nabla_{\boldsymbol{c}_{j}}M(t,\mathcal{S}_{j},\boldsymbol{c}_{j}% )=-tV(t,\mathcal{S}_{j},\boldsymbol{c}_{j}),∇ start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_M ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = - italic_t italic_V ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , (42)
tΓ(t,𝒮j,𝒄j)=𝒄jM(t,𝒮j,𝒄j),𝑡Γ𝑡subscript𝒮𝑗subscript𝒄𝑗subscript𝒄𝑗𝑀𝑡subscript𝒮𝑗subscript𝒄𝑗\displaystyle\frac{\partial}{\partial t}\Gamma(t,\mathcal{S}_{j},\boldsymbol{c% }_{j})=-\boldsymbol{c}_{j}M(t,\mathcal{S}_{j},\boldsymbol{c}_{j}),divide start_ARG ∂ end_ARG start_ARG ∂ italic_t end_ARG roman_Γ ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = - bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_M ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , (43)
𝒄jΓ(t,𝒮j,𝒄j)=tM(t,𝒮j,𝒄j).subscriptsubscript𝒄𝑗Γ𝑡subscript𝒮𝑗subscript𝒄𝑗𝑡𝑀𝑡subscript𝒮𝑗subscript𝒄𝑗\displaystyle\nabla_{\boldsymbol{c}_{j}}\Gamma(t,\mathcal{S}_{j},\boldsymbol{c% }_{j})=-tM(t,\mathcal{S}_{j},\boldsymbol{c}_{j}).∇ start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_Γ ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = - italic_t italic_M ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) . (44)
Proof of Theorem 3.

Let 𝒄j(t):=Tm(t,𝒮j)assignsubscript𝒄𝑗𝑡Tm𝑡subscript𝒮𝑗\boldsymbol{c}_{j}(t):=\operatorname{\textup{Tm}}(t,\mathcal{S}_{j})bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) := tm ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) be the solution of (5.1), then substituting t𝑡titalic_t, 𝒮𝒮\mathcal{S}caligraphic_S and 𝒄j(t)subscript𝒄𝑗𝑡\boldsymbol{c}_{j}(t)bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) into the tilted weight denoted as w^i:=wi(t,𝒮j,𝒄j(t))assignsubscript^𝑤𝑖subscript𝑤𝑖𝑡subscript𝒮𝑗subscript𝒄𝑗𝑡\hat{w}_{i}:=w_{i}(t,\mathcal{S}_{j},\boldsymbol{c}_{j}(t))over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT := italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) ), we can obtain the tilted empirical mean and variance for each cluster as

𝔼t(f(𝒮j,𝒄j))subscript𝔼𝑡fsubscript𝒮𝑗subscript𝒄𝑗\displaystyle\mathbb{E}_{t}\Bigl{(}\mathrm{f}(\mathcal{S}_{j},\boldsymbol{c}_{% j})\Bigr{)}\!blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_f ( caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) =𝒙i𝒮jw^if(𝒙i,𝒄j)absentsubscriptsubscript𝒙𝑖subscript𝒮𝑗subscript^𝑤𝑖𝑓subscript𝒙𝑖subscript𝒄𝑗\displaystyle=\!\!\!\sum_{\boldsymbol{x}_{i}\in\mathcal{S}_{j}}\!\!\hat{w}_{i}% \cdot f(\boldsymbol{x}_{i},\boldsymbol{c}_{j})= ∑ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )
=𝒄j2+𝒙i𝒮jw^i𝒙i2𝒄jMtabsentsuperscriptnormsubscript𝒄𝑗2subscriptsubscript𝒙𝑖subscript𝒮𝑗subscript^𝑤𝑖superscriptnormsubscript𝒙𝑖2superscriptsubscript𝒄𝑗topsubscript𝑀𝑡\displaystyle=\|\boldsymbol{c}_{j}\|^{2}+\!\!\!\sum_{\boldsymbol{x}_{i}\in% \mathcal{S}_{j}}\hat{w}_{i}\|\boldsymbol{x}_{i}\|^{2}-\boldsymbol{c}_{j}^{\top% }M_{t}= ∥ bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
Vart(f(𝒮j,𝒄j))subscriptVar𝑡fsubscript𝒮𝑗subscript𝒄𝑗\displaystyle\mathrm{Var}_{t}\Bigl{(}\mathrm{f}(\mathcal{S}_{j},\boldsymbol{c}% _{j})\Bigr{)}\!roman_Var start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_f ( caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) =𝔼t(𝒄j(2𝒙iMt))2=𝒄jVt𝒄j,absentsubscript𝔼𝑡superscriptsuperscriptsubscript𝒄𝑗top2subscript𝒙𝑖subscript𝑀𝑡2superscriptsubscript𝒄𝑗topsubscript𝑉𝑡subscript𝒄𝑗\displaystyle=\!\mathbb{E}_{t}\Bigl{(}\boldsymbol{c}_{j}^{\top}\bigl{(}-2% \boldsymbol{x}_{i}-M_{t}\bigr{)}\Bigr{)}^{2}=\boldsymbol{c}_{j}^{\top}V_{t}% \boldsymbol{c}_{j},= blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( - 2 bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ,

where Mt:=2𝒙i𝒮jw^i𝒙iassignsubscript𝑀𝑡2subscriptsubscript𝒙𝑖subscript𝒮𝑗subscript^𝑤𝑖subscript𝒙𝑖M_{t}\!:=\!2\!\sum_{\boldsymbol{x}_{i}\in\mathcal{S}_{j}}\hat{w}_{i}\cdot% \boldsymbol{x}_{i}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := 2 ∑ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Vt:=𝒙i𝒮jw^i(2𝒙iMt)(2𝒙iMt)assignsubscript𝑉𝑡subscriptsubscript𝒙𝑖subscript𝒮𝑗subscript^𝑤𝑖superscript2subscript𝒙𝑖subscript𝑀𝑡top2subscript𝒙𝑖subscript𝑀𝑡V_{t}\!:=\!\sum_{\boldsymbol{x}_{i}\in\mathcal{S}_{j}}\!\hat{w}_{i}\bigl{(}-2% \boldsymbol{x}_{i}-M_{t}\bigr{)}^{\top}\bigl{(}-2\boldsymbol{x}_{i}-M_{t}\bigr% {)}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := ∑ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( - 2 bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( - 2 bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) are constants. Then, by taking derivative of Varτ(f(𝒮j,𝒄j(t)))subscriptVar𝜏fsubscript𝒮𝑗subscript𝒄𝑗𝑡\mathrm{Var}_{\tau}\Bigl{(}\mathrm{f}\bigl{(}\mathcal{S}_{j},\boldsymbol{c}_{j% }(t)\bigr{)}\Bigr{)}roman_Var start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( roman_f ( caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) ) ) with respect to t𝑡titalic_t, we have

t{Varτ(f(𝒮j,𝒄j(t)))}\displaystyle\frac{\partial}{\partial t}\Bigl{\{}\mathrm{Var}_{\tau}\Bigl{(}% \mathrm{f}\bigl{(}\mathcal{S}_{j},\boldsymbol{c}_{j}(t)\bigl{)}\Bigr{)}\Bigl{\}}divide start_ARG ∂ end_ARG start_ARG ∂ italic_t end_ARG { roman_Var start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( roman_f ( caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) ) ) }
=\displaystyle== (t𝒄j(t))𝒄j{Varτ(f(𝒮j,𝒄j(t)))}\displaystyle\Bigl{(}\frac{\partial}{\partial t}\boldsymbol{c}_{j}(t)\Bigr{)}^% {\top}\cdot\nabla_{\boldsymbol{c}_{j}}\Bigl{\{}\mathrm{Var}_{\tau}\Bigl{(}% \mathrm{f}\bigl{(}\mathcal{S}_{j},\boldsymbol{c}_{j}(t)\bigl{)}\Bigr{)}\Bigr{\}}( divide start_ARG ∂ end_ARG start_ARG ∂ italic_t end_ARG bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ⋅ ∇ start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT { roman_Var start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( roman_f ( caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) ) ) }
=\displaystyle== 2(t𝒄j(t))Vτ𝒄j(t).2superscript𝑡subscript𝒄𝑗𝑡topsubscript𝑉𝜏subscript𝒄𝑗𝑡\displaystyle 2\Bigl{(}\frac{\partial}{\partial t}\boldsymbol{c}_{j}(t)\Bigr{)% }^{\top}V_{\tau}\boldsymbol{c}_{j}(t).2 ( divide start_ARG ∂ end_ARG start_ARG ∂ italic_t end_ARG bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) . (45)

Based on the optimal condition with 𝒄jsubscript𝒄𝑗\boldsymbol{c}_{j}bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, we have

00\displaystyle 0 =𝒙i𝒮jet𝒙i𝒄j(t)2(𝒙i𝒄j(t)).absentsubscriptsubscript𝒙𝑖subscript𝒮𝑗superscript𝑒𝑡superscriptnormsubscript𝒙𝑖subscript𝒄𝑗𝑡2subscript𝒙𝑖subscript𝒄𝑗𝑡\displaystyle=\sum_{\boldsymbol{x}_{i}\in\mathcal{S}_{j}}e^{t\|\boldsymbol{x}_% {i}-\boldsymbol{c}_{j}(t)\|^{2}}(\boldsymbol{x}_{i}-\boldsymbol{c}_{j}(t)).= ∑ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_t ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) ) . (46)

Divide both sides of (46) by 12𝒙i𝒮jet𝒙i𝒄j(t)212subscriptsubscript𝒙𝑖subscript𝒮𝑗superscript𝑒𝑡superscriptnormsubscript𝒙𝑖subscript𝒄𝑗𝑡2-\frac{1}{2}\sum_{\boldsymbol{x}_{i}\in\mathcal{S}_{j}}e^{t\|\boldsymbol{x}_{i% }-\boldsymbol{c}_{j}(t)\|^{2}}- divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_t ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, and differentiate with respect to t𝑡titalic_t yields

0=t{𝒙i𝒮jwi(t,𝒮j,𝒄j(t))2(𝒄j(t)𝒙i)}0𝑡subscriptsubscript𝒙𝑖subscript𝒮𝑗subscript𝑤𝑖𝑡subscript𝒮𝑗subscript𝒄𝑗𝑡2subscript𝒄𝑗𝑡subscript𝒙𝑖\displaystyle 0=\frac{\partial}{\partial t}\Bigl{\{}\sum_{\boldsymbol{x}_{i}% \in\mathcal{S}_{j}}w_{i}(t,\mathcal{S}_{j},\boldsymbol{c}_{j}(t))\cdot 2(% \boldsymbol{c}_{j}(t)-\boldsymbol{x}_{i})\Bigr{\}}0 = divide start_ARG ∂ end_ARG start_ARG ∂ italic_t end_ARG { ∑ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) ) ⋅ 2 ( bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) - bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) }
=t{2𝒄j(t)M(t,𝒮j,𝒄j(t))}absent𝑡2subscript𝒄𝑗𝑡𝑀𝑡subscript𝒮𝑗subscript𝒄𝑗𝑡\displaystyle\!\!\!\!=\frac{\partial}{\partial t}\Bigl{\{}2\boldsymbol{c}_{j}(% t)-M(t,\mathcal{S}_{j},\boldsymbol{c}_{j}(t))\Bigr{\}}= divide start_ARG ∂ end_ARG start_ARG ∂ italic_t end_ARG { 2 bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) - italic_M ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) ) }
=𝒄j(t)t(2𝒄jM(t,𝒮j,𝒄j(t)))τM(τ,𝒮j,𝒄j(t))|τ=tabsentsubscript𝒄𝑗𝑡𝑡2subscriptsubscript𝒄𝑗𝑀𝑡subscript𝒮𝑗subscript𝒄𝑗𝑡evaluated-at𝜏𝑀𝜏subscript𝒮𝑗subscript𝒄𝑗𝑡𝜏𝑡\displaystyle\!\!\!\!=\frac{\partial\boldsymbol{c}_{j}(t)}{\partial t}\Bigl{(}% 2\!-\!\nabla_{\boldsymbol{c}_{j}}M(t,\mathcal{S}_{j},\boldsymbol{c}_{j}(t))\!% \Bigr{)}\!-\!\frac{\partial}{\partial\tau}M(\tau,\mathcal{S}_{j},\boldsymbol{c% }_{j}(t))\Big{|}_{\tau=t}\!\!= divide start_ARG ∂ bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) end_ARG start_ARG ∂ italic_t end_ARG ( 2 - ∇ start_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_M ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) ) ) - divide start_ARG ∂ end_ARG start_ARG ∂ italic_τ end_ARG italic_M ( italic_τ , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) ) | start_POSTSUBSCRIPT italic_τ = italic_t end_POSTSUBSCRIPT (47)
=𝒄j(t)t(2+tV(t,𝒮j,𝒄j(t)))+V(t,𝒮j,𝒄j(t))𝒄j(t),absentsubscript𝒄𝑗𝑡𝑡2𝑡𝑉𝑡subscript𝒮𝑗subscript𝒄𝑗𝑡𝑉𝑡subscript𝒮𝑗subscript𝒄𝑗𝑡subscript𝒄𝑗𝑡\displaystyle\!\!\!\!=\frac{\partial\boldsymbol{c}_{j}(t)}{\partial t}\Bigl{(}% 2\!+\!tV(t,\mathcal{S}_{j},\boldsymbol{c}_{j}(t))\Bigr{)}\!+\!V(t,\mathcal{S}_% {j},\boldsymbol{c}_{j}(t))\boldsymbol{c}_{j}(t),\!\!= divide start_ARG ∂ bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) end_ARG start_ARG ∂ italic_t end_ARG ( 2 + italic_t italic_V ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) ) ) + italic_V ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) ) bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) , (48)

where (47) follows from the chain rule, and (48) follows from Lemma 6. Then we can infer from (48) that

𝒄j(t)t=V(t,𝒮j,𝒄j(t))𝒄j(t)12+tV(t,𝒮j,𝒄j(t)).subscript𝒄𝑗𝑡𝑡𝑉𝑡subscript𝒮𝑗subscript𝒄𝑗𝑡subscript𝒄𝑗𝑡12𝑡𝑉𝑡subscript𝒮𝑗subscript𝒄𝑗𝑡\displaystyle\frac{\partial\boldsymbol{c}_{j}(t)}{\partial t}=-V(t,\mathcal{S}% _{j},\boldsymbol{c}_{j}(t))\boldsymbol{c}_{j}(t)\cdot\frac{1}{2+tV(t,\mathcal{% S}_{j},\boldsymbol{c}_{j}(t))}.divide start_ARG ∂ bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) end_ARG start_ARG ∂ italic_t end_ARG = - italic_V ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) ) bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) ⋅ divide start_ARG 1 end_ARG start_ARG 2 + italic_t italic_V ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) ) end_ARG . (49)

Substituting (49) into (8.3), we obtain

t{Varτ(f(𝒮j,𝒄j(t)))}\displaystyle\frac{\partial}{\partial t}\Bigl{\{}\mathrm{Var}_{\tau}\Bigl{(}% \mathrm{f}\bigl{(}\mathcal{S}_{j},\boldsymbol{c}_{j}(t)\bigl{)}\Bigr{)}\Bigl{\}}divide start_ARG ∂ end_ARG start_ARG ∂ italic_t end_ARG { roman_Var start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( roman_f ( caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) ) ) } =2(t𝒄j(t))Vτ𝒄j(t)absent2superscript𝑡subscript𝒄𝑗𝑡topsubscript𝑉𝜏subscript𝒄𝑗𝑡\displaystyle=2\Bigl{(}\frac{\partial}{\partial t}\boldsymbol{c}_{j}(t)\Bigr{)% }^{\top}V_{\tau}\boldsymbol{c}_{j}(t)= 2 ( divide start_ARG ∂ end_ARG start_ARG ∂ italic_t end_ARG bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t )
=𝒄j(t)V(t,𝒮j,𝒄j(t))Vτ𝒄j(t)2+tV(t,𝒮j,𝒄j(t))<0,absentsubscriptsubscript𝒄𝑗superscript𝑡top𝑉𝑡subscript𝒮𝑗subscript𝒄𝑗𝑡subscript𝑉𝜏subscript𝒄𝑗𝑡2𝑡𝑉𝑡subscript𝒮𝑗subscript𝒄𝑗𝑡absent0\displaystyle=\underbrace{-\frac{\boldsymbol{c}_{j}(t)^{\top}V(t,\mathcal{S}_{% j},\boldsymbol{c}_{j}(t))V_{\tau}\boldsymbol{c}_{j}(t)}{2+tV(t,\mathcal{S}_{j}% ,\boldsymbol{c}_{j}(t))}}_{<0},= under⏟ start_ARG - divide start_ARG bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_V ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) ) italic_V start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) end_ARG start_ARG 2 + italic_t italic_V ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) ) end_ARG end_ARG start_POSTSUBSCRIPT < 0 end_POSTSUBSCRIPT ,

which completes the proof. ∎

8.4 Proof of Theorem 4

Proof.

When initializing the centroids with k𝑘kitalic_k-means++, the required number of multiplications is O(nkd)𝑂𝑛𝑘𝑑O(nkd)italic_O ( italic_n italic_k italic_d ). The number of multiplication needed for assignment and refinement are O(nkd)𝑂𝑛𝑘𝑑O(nkd)italic_O ( italic_n italic_k italic_d ) and O(nkdE)𝑂𝑛𝑘𝑑𝐸O(nkdE)italic_O ( italic_n italic_k italic_d italic_E ), respectively. When we set the number of iterations to T𝑇Titalic_T, we can obtain the multiplication required for TKM is O(nkdET)𝑂𝑛𝑘𝑑𝐸𝑇O(nkdET)italic_O ( italic_n italic_k italic_d italic_E italic_T ). ∎

8.5 Proof of Theorem 5

Proof.

When k=1𝑘1k=1italic_k = 1, we obtain

ϕ¯(t,𝒮,𝒄)=1tlog1ni=1netf(𝒙i,𝒄),¯italic-ϕ𝑡𝒮𝒄1𝑡1𝑛superscriptsubscript𝑖1𝑛superscript𝑒𝑡𝑓subscript𝒙𝑖𝒄\displaystyle\overline{\phi}(t,\mathcal{S},\boldsymbol{c})=\frac{1}{t}\log% \frac{1}{n}\sum_{i=1}^{n}e^{tf(\boldsymbol{x}_{i},\boldsymbol{c})},over¯ start_ARG italic_ϕ end_ARG ( italic_t , caligraphic_S , bold_italic_c ) = divide start_ARG 1 end_ARG start_ARG italic_t end_ARG roman_log divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_t italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c ) end_POSTSUPERSCRIPT , (50)

where 𝒮=X𝒮𝑋\mathcal{S}=Xcaligraphic_S = italic_X and 𝒄=Tm(t,𝒮)𝒄Tm𝑡𝒮\boldsymbol{c}=\operatorname{\textup{Tm}}(t,\mathcal{S})bold_italic_c = tm ( italic_t , caligraphic_S ) are the unique cluster and centroid. We directly take the partial derivative of ϕ¯(t,𝒮,𝒄)¯italic-ϕ𝑡𝒮𝒄\overline{\phi}(t,\mathcal{S},\boldsymbol{c})over¯ start_ARG italic_ϕ end_ARG ( italic_t , caligraphic_S , bold_italic_c ) with respect to t𝑡titalic_t, yielding:

Π(t,𝒮j,𝒄j)tΠ𝑡subscript𝒮𝑗subscript𝒄𝑗𝑡\displaystyle\frac{\partial\Pi(t,\mathcal{S}_{j},\boldsymbol{c}_{j})}{\partial t}divide start_ARG ∂ roman_Π ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_t end_ARG
=\displaystyle== 1ti=1net𝒙i𝒄2𝒙i𝒄2i=1net𝒙i𝒄21t2log1ni=1net𝒙i𝒄21𝑡superscriptsubscript𝑖1𝑛superscript𝑒𝑡superscriptnormsubscript𝒙𝑖𝒄2superscriptnormsubscript𝒙𝑖𝒄2superscriptsubscript𝑖1𝑛superscript𝑒𝑡superscriptnormsubscript𝒙𝑖𝒄21superscript𝑡21𝑛superscriptsubscript𝑖1𝑛superscript𝑒𝑡superscriptnormsubscript𝒙𝑖𝒄2\displaystyle\frac{1}{t}\frac{\sum_{i=1}^{n}e^{t\|\boldsymbol{x}_{i}-% \boldsymbol{c}\|^{2}}\|\boldsymbol{x}_{i}-\boldsymbol{c}\|^{2}}{\sum_{i=1}^{n}% e^{t\|\boldsymbol{x}_{i}-\boldsymbol{c}\|^{2}}}-\frac{1}{t^{2}}\log\frac{1}{n}% \sum_{i=1}^{n}e^{t\|\boldsymbol{x}_{i}-\boldsymbol{c}\|^{2}}divide start_ARG 1 end_ARG start_ARG italic_t end_ARG divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_t ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_c ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_c ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_t ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_c ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_log divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_t ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_c ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT
=\displaystyle== 1t𝒄ji=1n2wi(t,𝒮,𝒄)𝒙i1t2log1ni=1ne2t𝒄𝒙i1𝑡superscriptsubscript𝒄𝑗topsuperscriptsubscript𝑖1𝑛2subscript𝑤𝑖𝑡𝒮𝒄subscript𝒙𝑖1superscript𝑡21𝑛superscriptsubscript𝑖1𝑛superscript𝑒2𝑡superscript𝒄topsubscript𝒙𝑖\displaystyle-\frac{1}{t}\boldsymbol{c}_{j}^{\top}\sum_{i=1}^{n}2w_{i}(t,% \mathcal{S},\boldsymbol{c})\boldsymbol{x}_{i}-\frac{1}{t^{2}}\log\frac{1}{n}% \sum_{i=1}^{n}e^{-2t\boldsymbol{c}^{\top}\boldsymbol{x}_{i}}- divide start_ARG 1 end_ARG start_ARG italic_t end_ARG bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT 2 italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t , caligraphic_S , bold_italic_c ) bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_log divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - 2 italic_t bold_italic_c start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (51)
=\displaystyle== 1t𝒄jM(t,𝒮,𝒄)1t2Γ(t,𝒮,𝒄)=:g(t,𝒮,𝒄),\displaystyle-\frac{1}{t}\boldsymbol{c}_{j}^{\top}M(t,\mathcal{S},\boldsymbol{% c})\!-\!\frac{1}{t^{2}}\Gamma(t,\mathcal{S},\boldsymbol{c})=:g(t,\mathcal{S},% \boldsymbol{c}),- divide start_ARG 1 end_ARG start_ARG italic_t end_ARG bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_M ( italic_t , caligraphic_S , bold_italic_c ) - divide start_ARG 1 end_ARG start_ARG italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_Γ ( italic_t , caligraphic_S , bold_italic_c ) = : italic_g ( italic_t , caligraphic_S , bold_italic_c ) , (52)

where (51) follows from the fact that all data points are normalized, and (52) defines g(t,𝒮j,𝒄j)𝑔𝑡subscript𝒮𝑗subscript𝒄𝑗g(t,\mathcal{S}_{j},\boldsymbol{c}_{j})italic_g ( italic_t , caligraphic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). Next, we consider

t{t2g(t,𝒮,𝒄)}𝑡superscript𝑡2𝑔𝑡𝒮𝒄\displaystyle\frac{\partial}{\partial t}\{t^{2}g(t,\mathcal{S},\boldsymbol{c})\}divide start_ARG ∂ end_ARG start_ARG ∂ italic_t end_ARG { italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g ( italic_t , caligraphic_S , bold_italic_c ) } =t{t𝒄M(t,𝒮,𝒄)Γ(t,𝒮,𝒄)}absent𝑡𝑡superscript𝒄top𝑀𝑡𝒮𝒄Γ𝑡𝒮𝒄\displaystyle=\frac{\partial}{\partial t}\Bigl{\{}-t\boldsymbol{c}^{\top}M(t,% \mathcal{S},\boldsymbol{c})\!-\!\Gamma(t,\mathcal{S},\boldsymbol{c})\Bigr{\}}= divide start_ARG ∂ end_ARG start_ARG ∂ italic_t end_ARG { - italic_t bold_italic_c start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_M ( italic_t , caligraphic_S , bold_italic_c ) - roman_Γ ( italic_t , caligraphic_S , bold_italic_c ) }
=t𝒄V(t,𝒮,𝒄)𝒄,absent𝑡superscript𝒄top𝑉𝑡𝒮𝒄𝒄\displaystyle=t\boldsymbol{c}^{\top}V(t,\mathcal{S},\boldsymbol{c})\boldsymbol% {c},= italic_t bold_italic_c start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_V ( italic_t , caligraphic_S , bold_italic_c ) bold_italic_c , (53)

where (8.5) follows from (41), (43) and the chain rule. Given that t𝒄V(t,𝒮,𝒄)𝒄0𝑡superscript𝒄top𝑉𝑡𝒮𝒄𝒄0t\boldsymbol{c}^{\top}V(t,\mathcal{S},\boldsymbol{c})\boldsymbol{c}\geq 0italic_t bold_italic_c start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_V ( italic_t , caligraphic_S , bold_italic_c ) bold_italic_c ≥ 0 for any t0𝑡0t\geq 0italic_t ≥ 0, therefore t2g(t,𝒮,𝒄)superscript𝑡2𝑔𝑡𝒮𝒄t^{2}g(t,\mathcal{S},\boldsymbol{c})italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g ( italic_t , caligraphic_S , bold_italic_c ) is a monotonically increasing function with t𝑡titalic_t, and its minimum value is attained at t=0𝑡0t=0italic_t = 0. When t=0𝑡0t=0italic_t = 0, we have

g(0,𝒮,𝒄)𝑔0𝒮𝒄\displaystyle g(0,\mathcal{S},\boldsymbol{c})italic_g ( 0 , caligraphic_S , bold_italic_c ) :=limt0Γ(t,𝒮,𝒄)+t𝒄jM(t,𝒮,𝒄)t2,assignabsentsubscript𝑡0Γ𝑡𝒮𝒄𝑡superscriptsubscript𝒄𝑗top𝑀𝑡𝒮𝒄superscript𝑡2\displaystyle:=\lim_{t\rightarrow 0}-\frac{\Gamma(t,\mathcal{S},\boldsymbol{c}% )+t\boldsymbol{c}_{j}^{\top}M(t,\mathcal{S},\boldsymbol{c})}{t^{2}},:= roman_lim start_POSTSUBSCRIPT italic_t → 0 end_POSTSUBSCRIPT - divide start_ARG roman_Γ ( italic_t , caligraphic_S , bold_italic_c ) + italic_t bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_M ( italic_t , caligraphic_S , bold_italic_c ) end_ARG start_ARG italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ,
=12𝒄V(0,𝒮,𝒄)𝒄,absent12superscript𝒄top𝑉0𝒮𝒄𝒄\displaystyle=\frac{1}{2}\boldsymbol{c}^{\top}V(0,\mathcal{S},\boldsymbol{c})% \boldsymbol{c},= divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_italic_c start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_V ( 0 , caligraphic_S , bold_italic_c ) bold_italic_c , (54)

where (8.5) follows from (41), (43) and L’Hôpital’s rule. Then we obtain t2g(t,𝒮,𝒄)0superscript𝑡2𝑔𝑡𝒮𝒄0t^{2}g(t,\mathcal{S},\boldsymbol{c})\geq 0italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g ( italic_t , caligraphic_S , bold_italic_c ) ≥ 0, and consequently infer g(t,𝒮,𝒄)0𝑔𝑡𝒮𝒄0g(t,\mathcal{S},\boldsymbol{c})\geq 0italic_g ( italic_t , caligraphic_S , bold_italic_c ) ≥ 0 for any t0𝑡0t\geq 0italic_t ≥ 0. In conjunction with Equation (52), Theorem 5 is implied. ∎

References

  • [1] Open university learning analytics dataset. https://analyse.kmi.open.ac.uk/open_dataset, 2015.
  • [2] 3d road network (north jutland, denmark). https://archive.ics.uci.edu/dataset/246/3d+road+network+north+jutland+denmark, 2017.
  • [3] The home mortgage disclosure act. https://ffiec.cfpb.gov/data-browser/, 2017.
  • [4] The U.S. census data. https://www.census.gov/glossary/#term_Populationestimates, 2021.
  • [5] A tutorial and resources for fair clustering. https://www.fairclustering.com/, 2022.
  • [6] Utrecht fairness recruitment dataset. https://www.kaggle.com/datasets/ictinstitute/utrecht-fairness-recruitment-dataset, 2022.
  • [7] M. Ankerst, M. M. Breunig, H. Kriegel, and J. Sander. OPTICS: ordering points to identify the clustering structure. In SIGMOD, pages 49–60, 1999.
  • [8] D. Arthur and S. Vassilvitskii. K-means++ the advantages of careful seeding. In SODA, pages 1027–1035, 2007.
  • [9] T. Athanasios and L. Max. UCI machine learning repository: Gender gap in spanish wp data set. https://archive.ics.uci.edu/ml/datasets/Parkinsons+Telemonitoring, 2009.
  • [10] P. Azoulay, T. E. Stuart, and Y. Wang. Matthew: Effect or fable? Manag. Sci., 60(1):92–109, 2014.
  • [11] A. Beirami, A. R. Calderbank, M. M. Christiansen, K. R. Duffy, and M. Médard. A characterization of guesswork on swiftly tilting curves. IEEE Trans. Inf. Theory, 65(5):2850–2871, 2019.
  • [12] S. K. Bera, D. Chakrabarty, N. Flores, and M. Negahbani. Fair algorithms for clustering. In NeurIPS, pages 4955–4966, 2019.
  • [13] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM Rev., 60(2):223–311, 2018.
  • [14] B. Brubach, D. Chakrabarti, J. P. Dickerson, S. Khuller, A. Srinivasan, and L. Tsepenekas. A pairwise fair and community-preserving approach to k-center clustering. In ICML, pages 1178–1189, 2020.
  • [15] F. Buet-Golfouse and I. Utyagulov. Towards fair unsupervised learning. In FAccT, pages 1399–1409, 2022.
  • [16] R. W. Butler. Saddlepoint approximations with applications. Cambridge University Press, 2007.
  • [17] X. Cao, G. Cong, and C. S. Jensen. Mining significant semantic locations from gps data. Proc. VLDB Endow., 3(1):1009–1020, 2010.
  • [18] S. Caton and C. Haas. Fairness in machine learning: A survey. ACM Comput. Surv., 2023.
  • [19] D. Chakrabarti, J. P. Dickerson, S. A. Esmaeili, A. Srinivasan, and L. Tsepenekas. A new notion of individually fair clustering: α𝛼\alphaitalic_α-equitable k-center. In AISTATS, volume 151, pages 6387–6408, 2022.
  • [20] X. Chen, B. Fain, L. Lyu, and K. Munagala. Proportionally fair clustering. In ICML, volume 97, pages 1032–1041, 2019.
  • [21] R. Chhaya, A. Dasgupta, J. Choudhari, and S. Shit. On coresets for fair regression and individually fair clustering. In AISTATS, volume 151, pages 9603–9625, 2022.
  • [22] F. Chierichetti, R. Kumar, S. Lattanzi, and S. Vassilvitskii. Fair clustering through fairlets. In NeurIPS, pages 5029–5037, 2017.
  • [23] G. Cong, C. S. Jensen, and D. Wu. Efficient retrieval of the top-k most relevant spatial web objects. Proc. VLDB Endow., 2(1):337–348, 2009.
  • [24] T. M. Cover. Elements of information theory. John Wiley & Sons, 1999.
  • [25] A. Dembo and O. Zeitouni. Large deviations techniques and applications. Springer Science & Business Media, 2009.
  • [26] J. P. Dickerson, S. A. Esmaeili, J. H. Morgenstern, and C. J. Zhang. Doubly constrained fair clustering. In NeurIPS, 2024.
  • [27] Y. Dong, J. Ma, S. Wang, C. Chen, and J. Li. Fairness in graph mining: A survey. IEEE Trans. Knowl. Data Eng., 35(10):10583–10602, 2023.
  • [28] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. S. Zemel. Fairness through awareness. In ITCS, pages 214–226, 2012.
  • [29] P. Edara and M. Pasumansky. Big metadata : When metadata is big data. Proc. VLDB Endow., 14(12):3083–3095, 2021.
  • [30] W. Fan. Big graphs: Challenges and opportunities. Proc. VLDB Endow., 15(12):3782–3797, 2022.
  • [31] S. Gupta, R. Kumar, K. Lu, B. Moseley, and S. Vassilvitskii. Local search methods for k-means with outliers. Proc. VLDB Endow., 10(7):757–768, 2017.
  • [32] M. Hossein, Bateni, V. Cohen-Addad, A. Epasto, and S. Lattanzi. A scalable algorithm for individually fair k-means clustering. arXiv:2402.06730, 2024.
  • [33] L. Huang, S. H. Jiang, and N. K. Vishnoi. Coresets for clustering with fairness constraints. In NeurIPS, pages 7587–7598, 2019.
  • [34] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM Comput. Surv., 31(3):264–323, 1999.
  • [35] C. Jung, S. Kannan, and N. Lutz. Service in your neighborhood: Fairness in center location. In FORC, volume 156, pages 5:1–5:15, 2020.
  • [36] R. Kohavi. Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid. In KDD, pages 202–207, 1996.
  • [37] T. Li, A. Beirami, M. Sanjabi, and V. Smith. On tilted losses in machine learning: Theory and applications. J. Mach. Learn. Res., 24:142:1–142:79, 2023.
  • [38] E. Liberty, Z. Karnin, B. Xiang, L. Rouesnel, B. Coskun, R. Nallapati, J. Delgado, A. Sadoughi, Y. Astashonok, P. Das, C. Balioglu, S. Chakravarty, M. Jha, P. Gautier, D. Arpin, T. Januschowski, V. Flunkert, Y. Wang, J. Gasthaus, L. Stella, S. Rangapuram, D. Salinas, S. Schelter, and A. Smola. Elastic machine learning algorithms in amazon sagemaker. In SIGMOD, page 731–737, 2020.
  • [39] S. P. Lloyd. Least squares quantization in PCM. IEEE Trans. Inf. Theory, 28(2):129–136, 1982.
  • [40] S. Mahabadi and A. Vakilian. Individual fairness for k-clustering. In ICML, volume 119, pages 6586–6596, 2020.
  • [41] C. Meek, B. Thiesson, and D. Heckerman. The learning-curve sampling method applied to model-based clustering. J. Mach. Learn. Res., 2:397–418, 2002.
  • [42] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan. A survey on bias and fairness in machine learning. ACM Comput. Surv., 54(6):115:1–115:35, 2022.
  • [43] N. Merhav. List decoding - random coding exponents and expurgated exponents. IEEE Trans. Inf. Theory, 60(11):6749–6759, 2014.
  • [44] R. K. Merton. The matthew effect in science. Science, 159(3810):56–63, 1968.
  • [45] S. Moro, P. Cortez, and P. Rita. A data-driven approach to predict the success of bank telemarketing. Decis. Support Syst., 62:22–31, 2014.
  • [46] K. O. Mortensen, F. Zardbani, M. A. Haque, S. Y. Agustsson, D. Mottin, P. Hofmann, and P. Karras. Marigold: Efficient k-means clustering in high dimensions. Proc. VLDB Endow., 16(7):1740–1748, 2023.
  • [47] F. Nargesian, E. Zhu, R. J. Miller, K. Q. Pu, and P. C. Arocena. Data lake management: Challenges and opportunities. Proc. VLDB Endow., 12(12):1986–1989, 2019.
  • [48] M. Negahbani and D. Chakrabarty. Better algorithms for individually fair k-clustering. In NeurIPS, pages 13340–13351, 2021.
  • [49] Y. E. Nesterov. Introductory Lectures on Convex Optimization - A Basic Course, volume 87. Springer, 2004.
  • [50] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. VanderPlas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in python. J. Mach. Learn. Res., 12:2825–2830, 2011.
  • [51] E. Y. Pee and J. O. Royset. On solving large-scale finite minimax problems using exponential smoothing. J. Optim. Theory Appl., 148(2):390–421, 2011.
  • [52] D. Sculley. Web-scale k-means clustering. In WWW, pages 1177–1178, 2010.
  • [53] S. Shaham, G. Ghinita, and C. Shahabi. Models and mechanisms for spatial data fairness. Proc. VLDB Endow., 16(2):167–179, 2022.
  • [54] S. Shang, L. Chen, Z. Wei, C. S. Jensen, J. Wen, and P. Kalnis. Collective travel planning in spatial networks. IEEE Trans. Knowl. Data Eng., 28(5):1132–1146, 2016.
  • [55] C. Shen and H. Li. On the dual formulation of boosting algorithms. IEEE Trans. Pattern Anal. Mach. Intell., 32(12):2216–2231, 2010.
  • [56] D. Siegmund. Importance sampling in the monte carlo study of sequential tests. Ann. Stat., pages 673–684, 1976.
  • [57] B. Strack, J. P. DeShazo, C. Gennings, J. L. Olmo, S. Ventura, K. J. Cios, J. N. Clore, et al. Impact of hba1c measurement on hospital readmission rates: Analysis of 70,000 clinical database patient records. Biomed Res. Int., 2014.
  • [58] A. Szabó, H. J. Rad, and S. Mannava. Tilted cross-entropy (TCE): promoting fairness in semantic segmentation. In CVPR, pages 2305–2310, 2021.
  • [59] R. Tang and Y. Yang. Bayesian inference for risk minimization via exponentially tilted empirical likelihood. J. R. Stat. Soc. B., 84(4):1257–1286, 2022.
  • [60] A. Vakilian and M. Yalçiner. Improved approximation algorithms for individually fair clustering. In AISTATS, volume 151, pages 8758–8779, 2022.
  • [61] D. Van Aken, A. Pavlo, G. J. Gordon, and B. Zhang. Automatic database management system tuning through large-scale machine learning. In SIGMOD, page 1009–1024, 2017.
  • [62] S. Wang, Y. Sun, and Z. Bao. On the efficiency of k-means clustering: Evaluation, optimization, and algorithm selection. Proc. VLDB Endow., 14(2):163–175, 2020.
  • [63] Y. Wang, H. Chen, W. Liu, F. He, T. Gong, Y. Fu, and D. Tao. Tilted sparse additive models. In ICML, volume 202, pages 35579–35604, 2023.
  • [64] X. Wu, X. Zhu, G. Wu, and W. Ding. Data mining with big data. IEEE Trans. Knowl. Data Eng., 26(1):97–107, 2014.
  • [65] H. Zhang, G. Chen, B. C. Ooi, K. Tan, and M. Zhang. In-memory big data management and processing: A survey. IEEE Trans. Knowl. Data Eng., 27(7):1920–1948, 2015.
  • [66] Y. Zhou, H. Cheng, and J. X. Yu. Graph clustering based on structural/attribute similarities. Proc. VLDB Endow., 2(1):718–729, 2009.
  • [67] S. Zhu, Q. Xu, J. Zeng, S. Wang, Y. Sun, Z. Yang, C. Yang, and Z. Peng. F3KM: federated, fair, and fast k-means. Proc. ACM Manag. Data, 1(4):241:1–241:25, 2023.
[Uncaptioned image] Shengkun Zhu received the BE degree in electronic information engineering from Dalian University of Technology, China in 2018. He is currently working toward a Ph.D. degree in computer science and technology, School of Computer Science, Wuhan University, China. His research interests mainly include fair clustering, federated learning and nonconvex optimization.
[Uncaptioned image] Jinshan Zeng received the Ph.D. degree in mathematics from Xi’an Jiaotong University, Xi’an, China, in 2015. He is currently a Distinguished Professor with the School of Computer and Information Engineering, Jiangxi Normal University, Nanchang, China, and serves as the Director of the Department of Data Science and Big Data. He has authored more than 40 papers in high-impact journals and conferences such as IEEE TPAMI, JMLR, IEEE TSP, ICML, and AAAI. He has coauthored two papers with collaborators that received the International Consortium of Chinese Mathematicians (ICCM) Best Paper Award in 2018 and 2020). His current research interests include nonconvex optimization, machine learning (in particular deep learning), and remote sensing.
[Uncaptioned image] Yuan Sun is a Lecturer in Business Analytics and Artificial Intelligence at La Trobe University, Australia. He received his BSc in Applied Mathematics from Peking University, China, and his PhD in Computer Science from The University of Melbourne, Australia. His research interest is on artificial intelligence, machine learning, operations research, and evolutionary computation. He has contributed significantly to the emerging research area of leveraging machine learning for combinatorial optimisation. His research has been published in top-tier journals and conferences such as IEEE TPAMI, IEEE TEVC, EJOR, NeurIPS, ICLR, VLDB, ICDE, and AAAI.
[Uncaptioned image] Sheng Wang received the BE degree in information security, ME degree in computer technology from Nanjing University of Aeronautics and Astronautics, China in 2013 and 2016, and Ph.D. from RMIT University in 2019. He is a professor at the School of Computer Science, Wuhan University. His research interests mainly include mobile databases, multi-modal data management, and fairness-aware data analysis. He has published full research papers on top database and information systems venues as the first author, such as TKDE, SIGMOD, PVLDB, and ICDE.
[Uncaptioned image] Xiaodong Li (Fellow, IEEE) received the B.Sc. degree from Xidian University, Xi’an, China, in 1988, and the Ph.D. degree in information science from the University of Otago, Dunedin, New Zealand, in 1998. He is a Professor with the School of Science (Computer Science and Software Engineering), RMIT University, Melbourne, VIC, Australia. His research interests include machine learning, evolutionary computation, neural networks, data analytics, multiobjective optimization, multimodal optimization, and swarm intelligence. Prof. Li is the recipient of the 2013 ACM SIGEVO Impact Award and the 2017 IEEE CIS IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION Outstanding Paper Award. He serves as an Associate Editor for the IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, Swarm Intelligence (Springer), and International Journal of Swarm Intelligence Research. He is a Founding Member of IEEE CIS Task Force on Swarm Intelligence, the Vice-Chair of IEEE Task Force on Multimodal Optimization, and the Former Chair of IEEE CIS Task Force on Large Scale Global Optimization.
[Uncaptioned image] Zhiyong Peng received the BSc degree from Wuhan University, in 1985, the MEng degree from the Changsha Institute of Technology of China, in 1988, and the PhD degree from the Kyoto University of Japan, in 1995. He is a professor of computer school, the Wuhan University of China. He worked as a researcher in Advanced Software Technology & Mechatronics Research Institute of Kyoto from 1995 to 1997 and a member of technical staff in Hewlett-Packard Laboratories Japan from 1997 to 2000. His research interests include complex data management, web data management, and trusted data management. He is a member of IEEE Computer Society, ACM SIGMOD and vice director of Database Society of Chinese Computer Federation. He was general co-chair of WAIM 2011, DASFAA 2013 and PC Co-chair of DASFAA 2012, WISE 2006, and CIT 2004.