research-article

F3KM: Federated, Fair, and Fast k-means

Authors:

Zhiyong PengAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 1, Issue 4

Article No.: 241, Pages 1 - 25

https://doi.org/10.1145/3626728

Published: 12 December 2023 Publication History

Abstract

This paper proposes a federated, fair, and fast k-means algorithm (F3KM) to solve the fair clustering problem efficiently in scenarios where data cannot be shared among different parties. The proposed algorithm decomposes the fair k-means problem into multiple subproblems and assigns each subproblem to a client for local computation. Our algorithm allows each client to possess multiple sensitive attributes (or have no sensitive attributes). We propose an in-processing method that employs the alternating direction method of multipliers (ADMM) to solve each subproblem. During the procedure of solving subproblems, only the computation results are exchanged between the server and the clients, without exchanging the raw data. Our theoretical analysis shows that F3KM is efficient in terms of both communication and computation complexities. Specifically, it achieves a better trade-off between utility and communication complexity, and reduces the computation complexity to linear with respect to the dataset size. Our experiments show that F3KM achieves a better trade-off between utility and fairness than other methods. Moreover, F3KM is able to cluster five million points in one hour, highlighting its impressive efficiency.

Supplemental Material

MP4 File

Presentation video

Download
112.14 MB

References

[1]

2015. Open University Learning Analytics dataset. https://analyse.kmi.open.ac.uk/open_dataset.

[2]

2017. The Home Mortgage Disclosure Act. https://ffiec.cfpb.gov/data-browser/.

[3]

2022. Utrecht Fairness Recruitment dataset. https://www.kaggle.com/datasets/ictinstitute/utrecht-fairness-recruitment-dataset.

[4]

Sara Ahmadian, Alessandro Epasto, Ravi Kumar, and Mohammad Mahdian. 2019. Clustering without over-representation. In KDD. 267--275.

[5]

Mihael Ankerst, Markus M Breunig, Hans-Peter Kriegel, and Jörg Sander. 1999. OPTICS: Ordering points to identify the clustering structure. In SIGMOD, Vol. 28. 49--60.

Digital Library

[6]

Abolfazl Asudeh, HV Jagadish, Julia Stoyanovich, and Gautam Das. 2019. Designing fair ranking schemes. In SIGMOD. 1259--1276.

[7]

Tsanas Athanasios and Little Max. 2009. UCI Machine Learning Repository: Gender Gap in Spanish WP Data Set. https://archive.ics.uci.edu/ml/datasets/ParkinsonsTelemonitoring/

[8]

Olivier Bachem, Mario Lucic, and Andreas Krause. 2018. Scalable k-means clustering via lightweight coresets. In KDD. 1119--1127.

[9]

Arturs Backurs, Piotr Indyk, Krzysztof Onak, Baruch Schieber, Ali Vakilian, and Tal Wagner. 2019. Scalable fair clustering. In ICML. 405--413.

[10]

Maria-Florina F Balcan, Steven Ehrlich, and Yingyu Liang. 2013. Distributed k-means and k-median clustering on general topologies. In NeurIPS. 1995--2003.

[11]

Suman Bera, Deeparnab Chakrabarty, Nicolas Flores, and Maryam Negahbani. 2019. Fair algorithms for clustering. In NeurIPS. 4955--4966.

[12]

Suman Bera, Syamantak Das, Sainyam Galhotra, and Sagar Sudhir Kale. 2022. Fair k-Center Clustering in MapReduce and Streaming Settings. In WWW. 1414--1422.

[13]

Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al . 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn 3, 1 (2011), 1--122.

Digital Library

[14]

Badrish Chandramouli, Junyi Xie, and Jun Yang. 2006. On the database/network interface in large-scale publish/subscribe systems. In SIGMOD. 587--598.

[15]

Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, and Sergei Vassilvitskii. 2017. Fair clustering through fairlets. In NeurIPS. 5029--5037.

[16]

Michael B Cohen, Yin Tat Lee, and Zhao Song. 2019. Solving linear programs in the current matrix multiplication time. In STOC. 938--942.

[17]

Graham Cormode and Shan Muthukrishnan. 2005. An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55, 1 (2005), 58--75.

Digital Library

[18]

Don Kurian Dennis, Tian Li, and Virginia Smith. 2021. Heterogeneity for the win: One-shot federated clustering. In ICML. 2611--2620.

[19]

Hu Ding, Yu Liu, Lingxiao Huang, and Jian Li. 2016. k-means clustering with distributed dimensions. In ICML. 1339--1348.

[20]

Arash Fard, Anh Le, George Larionov, Waqas Dhillon, and Chuck Bear. 2020. Vertica-ml: Distributed machine learning in vertica database. In SIGMOD. 755--768.

[21]

Tom Farrand, Fatemehsadat Mireshghallah, Sahib Singh, and Andrew Trask. 2020. Neither private nor fair: Impact of data imbalance on utility and fairness in differential privacy. In CCS. 15--19.

[22]

Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. 2015. Certifying and removing disparate impact. In KDD. 259--268.

[23]

Fangcheng Fu, Huanran Xue, Yong Cheng, Yangyu Tao, and Bin Cui. 2022. Blindfl: Vertical federated machine learning without peeking into your data. In SIGMOD. 1316--1330.

[24]

Mehrdad Ghadiri, Samira Samadi, and Santosh Vempala. 2021. Socially fair k-means clustering. In FAccT. 438--448.

[25]

Avishek Ghosh, Jichan Chung, Dong Yin, and Kannan Ramchandran. 2020. An Efficient Framework for Clustered Federated Learning. In NeurIPS. 19586--19597.

[26]

Mehmet Gönen and Adam A Margolin. 2014. Localized data fusion for kernel k-means clustering with application to cancer biology. In NeurIPS. 1305--1313.

[27]

Elfarouk Harb and Ho Shan Lam. 2020. KFC: A Scalable Approximation Algorithm for k-center Fair Clustering. In NeurIPS. 14509--14519.

[28]

Xi He, Ashwin Machanavajjhala, and Bolin Ding. 2014. Blowfish privacy: Tuning privacy-utility trade-offs using policies. In SIGMOD. 1447--1458.

[29]

Zhicheng He, Wei Xia, Kai Dong, Huifeng Guo, Ruiming Tang, Dingyin Xia, and Rui Zhang. 2022. Unsupervised Learning Style Classification for Learning Path Generation in Online Education Platforms. In KDD. 2997--3006.

[30]

Lingxiao Huang, Shaofeng Jiang, and Nisheeth Vishnoi. 2019. Coresets for clustering with fairness constraints. In NeurIPS. 7587--7598.

[31]

Lingxiao Huang, Zhize Li, Jialin Sun, and Haoyu Zhao. 2022. Coresets for Vertical Federated Learning: Regularized Linear Regression and k-Means Clustering. In NeurIPS, Vol. 35. 29566--29581.

[32]

Yefan Huang, Xiaoli Wang, Feiyan Liu, and Guofeng Huang. 2022. OVQA: A Clinically Generated Visual Question Answering Dataset. In SIGIR. 2924--2938.

Digital Library

[33]

Maliha Tashfia Islam, Anna Fariha, Alexandra Meliou, and Babak Salimi. 2022. Through the data management lens: Experimental analysis and evaluation of fair classification. In SIGMOD. 232--246.

[34]

Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, et al . 2021. Advances and open problems in federated learning. Found. Trends Mach. Learn. 14, 1--2 (2021), 1--210.

Digital Library

[35]

Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. 2020. Federated learning: Challenges, methods, and future directions. IEEE Signal Process. Mag. 37, 3 (2020), 50--60.

[36]

Zitao Li, Tianhao Wang, and Ninghui Li. 2022. Differentially Private Vertical Federated Clustering. Proc. VLDB Endow. 16, 6 (2022), 1277 -- 1290.

Digital Library

[37]

Edo Liberty, Zohar Karnin, Bing Xiang, Laurence Rouesnel, Baris Coskun, Ramesh Nallapati, Julio Delgado, Amir Sadoughi, Yury Astashonok, Piali Das, et al. 2020. Elastic machine learning algorithms in amazon sagemaker. In SIGMOD. 731--737.

[38]

Stuart Lloyd. 1982. Least squares quantization in PCM. IEEE Trans. Inf. Theory 28, 2 (1982), 129--137.

Digital Library

[39]

Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. 2017. Communication-efficient learning of deep networks from decentralized data. In AISTATS. PMLR, 1273--1282.

[40]

Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2021. A survey on bias and fairness in machine learning. ACM Comput. Surv. 54, 6 (2021), 1--35.

Digital Library

[41]

Feiping Nie, Jingjing Xue, Danyang Wu, Rong Wang, Hui Li, and Xuelong Li. 2021. Coordinate Descent Method for k-means. IEEE Trans. Pattern Anal. Mach. Intell. 44, 5 (2021), 2371--2385.

[42]

John Paparrizos and Luis Gravano. 2015. k-shape: Efficient and accurate clustering of time series. In SIGMOD. 1855--1870.

[43]

Erich Schubert, Alexander Koos, Tobias Emrich, Andreas Züfle, Klaus Arthur Schmid, and Arthur Zimek. 2015. A framework for clustering uncertain data. Proc. VLDB Endow. 8, 12 (2015), 1976--1979.

Digital Library

[44]

Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet S Talwalkar. 2017. Federated multi-task learning. In NeurIPS, Vol. 30. 4424--4434.

[45]

Yanchao Tan, Carl Yang, Xiangyu Wei, Chaochao Chen, Weiming Liu, Longfei Li, Jun Zhou, and Xiaolin Zheng. 2022. Metacare: Meta-learning with hierarchical subtyping for cold-start diagnosis prediction in healthcare data. In SIGIR. 449--459.

[46]

Suhas Thejaswi, Ameet Gadekar, Bruno Ordozgoiti, and Michal Osadnik. 2022. Clustering with Fair-Center Representation: Parameterized Approximation Algorithms and Heuristics. In KDD. 1749--1759.

[47]

Sheng Wang, Yuan Sun, and Zhifeng Bao. 2020. On the efficiency of k-means clustering: evaluation, optimization, and algorithm selection. Proc. VLDB Endow. 14, 2, 163--175.

Digital Library

[48]

Yu Wang, Wotao Yin, and Jinshan Zeng. 2019. Global convergence of ADMM in nonconvex nonsmooth optimization. J. Sci. Comput. 78 (2019), 29--63.

Digital Library

[49]

Xindong Wu, Vipin Kumar, J Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J McLachlan, Angus Ng, Bing Liu, Philip S Yu, et al. 2008. Top 10 algorithms in data mining. Knowl. Inf. Syst. 14, 1 (2008), 1--37.

Digital Library

[50]

Yi Xu, Mingrui Liu, Qihang Lin, and Tianbao Yang. 2017. ADMM without a fixed penalty parameter: Faster convergence with new adaptive penalization. In NeurIPS. 1267--1277.

[51]

Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. 2019. Federated machine learning: Concept and applications. ACM Trans. Intell. Syst. Technol. 10, 2 (2019), 1--19.

Digital Library

[52]

Zhenkun Yang, Chuanhui Yang, Fusheng Han, Mingqiang Zhuang, Bing Yang, Zhifeng Yang, Xiaojun Cheng, Yuzhong Zhao, Wenhui Shi, Huafeng Xi, et al. 2022. OceanBase: a 707 million tpmC distributed relational database system. Proc. VLDB Endow. 15, 12 (2022), 3385--3397.

Digital Library

[53]

Jinshan Zeng, Tim Tsz-Kit Lau, Shaobo Lin, and Yuan Yao. 2019. Global convergence of block coordinate descent in deep learning. In ICML. 7313--7323.

[54]

Juntao Zhang, Sheng Wang, Yuan Sun, and Zhiyong Peng. 2023. Prerequisite-driven Fair Clustering on Heterogeneous Information Networks. Proc. ACM Manag. Data 1, 2 (2023), 1--27.

Digital Library

[55]

Yang Zhou, Hong Cheng, and Jeffrey Xu Yu. 2009. Graph clustering based on structural/attribute similarities. Proc. VLDB Endow. 2, 1 (2009), 718--729.

Digital Library

Cited By

Huang WYe MShi ZWan GLi HDu BYang Q(2024)Federated Learning for Generalization, Robustness, Fairness: A Survey and BenchmarkIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.341886246:12(9387-9406)Online publication date: 1-Dec-2024
https://dl.acm.org/doi/10.1109/TPAMI.2024.3418862

Index Terms

F3KM: Federated, Fair, and Fast k-means
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis

Recommendations

Verifiable Threshold Secret Sharing and Full Fair Secure Two-Party Computation
AST '09: Proceedings of the 2009 International e-Conference on Advanced Science and Technology

Based on the verifiable encryption and zero-knowledge proof protocols of Jarecki and Shmatikov and Pedersen’s verifiable threshold secret sharing scheme, this paper proposes a new full fair secure two-party computation protocols. For getting full fair, ...
Proficient Normalised Fuzzy K-Means With Initial Centroids Methodology

This article describes how data is relevant and if it can be organized, linked with other data and grouped into a cluster. Clustering is the process of organizing a given set of objects into a set of disjoint groups called clusters. There are a number ...
Initializing K-means Clustering Using Affinity Propagation
HIS '09: Proceedings of the 2009 Ninth International Conference on Hybrid Intelligent Systems - Volume 01

K-means clustering is widely used due to its fast convergence, but it is sensitive to the initial condition.Therefore, many methods of initializing K-means clustering have been proposed in the literatures. Compared with Kmeans clustering, a novel ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 1, Issue 4

PACMMOD

December 2023

1317 pages

EISSN:2836-6573

DOI:10.1145/3637468

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 December 2023

Published in PACMMOD Volume 1, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Author Tags

Qualifiers

Research-article

Funding Sources

Key R&D in Hubei Province
Thousand Talents Plan of Jiangxi Province
National Natural Science Foundation of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
287
Total Downloads

Downloads (Last 12 months)225
Downloads (Last 6 weeks)21

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Huang WYe MShi ZWan GLi HDu BYang Q(2024)Federated Learning for Generalization, Robustness, Fairness: A Survey and BenchmarkIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.341886246:12(9387-9406)Online publication date: 1-Dec-2024
https://dl.acm.org/doi/10.1109/TPAMI.2024.3418862

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents