Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

F3KM: Federated, Fair, and Fast k-means

Published: 12 December 2023 Publication History

Abstract

This paper proposes a federated, fair, and fast k-means algorithm (F3KM) to solve the fair clustering problem efficiently in scenarios where data cannot be shared among different parties. The proposed algorithm decomposes the fair k-means problem into multiple subproblems and assigns each subproblem to a client for local computation. Our algorithm allows each client to possess multiple sensitive attributes (or have no sensitive attributes). We propose an in-processing method that employs the alternating direction method of multipliers (ADMM) to solve each subproblem. During the procedure of solving subproblems, only the computation results are exchanged between the server and the clients, without exchanging the raw data. Our theoretical analysis shows that F3KM is efficient in terms of both communication and computation complexities. Specifically, it achieves a better trade-off between utility and communication complexity, and reduces the computation complexity to linear with respect to the dataset size. Our experiments show that F3KM achieves a better trade-off between utility and fairness than other methods. Moreover, F3KM is able to cluster five million points in one hour, highlighting its impressive efficiency.

Supplemental Material

MP4 File
Presentation video

References

[1]
2015. Open University Learning Analytics dataset. https://analyse.kmi.open.ac.uk/open_dataset.
[2]
2017. The Home Mortgage Disclosure Act. https://ffiec.cfpb.gov/data-browser/.
[3]
2022. Utrecht Fairness Recruitment dataset. https://www.kaggle.com/datasets/ictinstitute/utrecht-fairness-recruitment-dataset.
[4]
Sara Ahmadian, Alessandro Epasto, Ravi Kumar, and Mohammad Mahdian. 2019. Clustering without over-representation. In KDD. 267--275.
[5]
Mihael Ankerst, Markus M Breunig, Hans-Peter Kriegel, and Jörg Sander. 1999. OPTICS: Ordering points to identify the clustering structure. In SIGMOD, Vol. 28. 49--60.
[6]
Abolfazl Asudeh, HV Jagadish, Julia Stoyanovich, and Gautam Das. 2019. Designing fair ranking schemes. In SIGMOD. 1259--1276.
[7]
Tsanas Athanasios and Little Max. 2009. UCI Machine Learning Repository: Gender Gap in Spanish WP Data Set. https://archive.ics.uci.edu/ml/datasets/ParkinsonsTelemonitoring/
[8]
Olivier Bachem, Mario Lucic, and Andreas Krause. 2018. Scalable k-means clustering via lightweight coresets. In KDD. 1119--1127.
[9]
Arturs Backurs, Piotr Indyk, Krzysztof Onak, Baruch Schieber, Ali Vakilian, and Tal Wagner. 2019. Scalable fair clustering. In ICML. 405--413.
[10]
Maria-Florina F Balcan, Steven Ehrlich, and Yingyu Liang. 2013. Distributed k-means and k-median clustering on general topologies. In NeurIPS. 1995--2003.
[11]
Suman Bera, Deeparnab Chakrabarty, Nicolas Flores, and Maryam Negahbani. 2019. Fair algorithms for clustering. In NeurIPS. 4955--4966.
[12]
Suman Bera, Syamantak Das, Sainyam Galhotra, and Sagar Sudhir Kale. 2022. Fair k-Center Clustering in MapReduce and Streaming Settings. In WWW. 1414--1422.
[13]
Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al . 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn 3, 1 (2011), 1--122.
[14]
Badrish Chandramouli, Junyi Xie, and Jun Yang. 2006. On the database/network interface in large-scale publish/subscribe systems. In SIGMOD. 587--598.
[15]
Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, and Sergei Vassilvitskii. 2017. Fair clustering through fairlets. In NeurIPS. 5029--5037.
[16]
Michael B Cohen, Yin Tat Lee, and Zhao Song. 2019. Solving linear programs in the current matrix multiplication time. In STOC. 938--942.
[17]
Graham Cormode and Shan Muthukrishnan. 2005. An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55, 1 (2005), 58--75.
[18]
Don Kurian Dennis, Tian Li, and Virginia Smith. 2021. Heterogeneity for the win: One-shot federated clustering. In ICML. 2611--2620.
[19]
Hu Ding, Yu Liu, Lingxiao Huang, and Jian Li. 2016. k-means clustering with distributed dimensions. In ICML. 1339--1348.
[20]
Arash Fard, Anh Le, George Larionov, Waqas Dhillon, and Chuck Bear. 2020. Vertica-ml: Distributed machine learning in vertica database. In SIGMOD. 755--768.
[21]
Tom Farrand, Fatemehsadat Mireshghallah, Sahib Singh, and Andrew Trask. 2020. Neither private nor fair: Impact of data imbalance on utility and fairness in differential privacy. In CCS. 15--19.
[22]
Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. 2015. Certifying and removing disparate impact. In KDD. 259--268.
[23]
Fangcheng Fu, Huanran Xue, Yong Cheng, Yangyu Tao, and Bin Cui. 2022. Blindfl: Vertical federated machine learning without peeking into your data. In SIGMOD. 1316--1330.
[24]
Mehrdad Ghadiri, Samira Samadi, and Santosh Vempala. 2021. Socially fair k-means clustering. In FAccT. 438--448.
[25]
Avishek Ghosh, Jichan Chung, Dong Yin, and Kannan Ramchandran. 2020. An Efficient Framework for Clustered Federated Learning. In NeurIPS. 19586--19597.
[26]
Mehmet Gönen and Adam A Margolin. 2014. Localized data fusion for kernel k-means clustering with application to cancer biology. In NeurIPS. 1305--1313.
[27]
Elfarouk Harb and Ho Shan Lam. 2020. KFC: A Scalable Approximation Algorithm for k-center Fair Clustering. In NeurIPS. 14509--14519.
[28]
Xi He, Ashwin Machanavajjhala, and Bolin Ding. 2014. Blowfish privacy: Tuning privacy-utility trade-offs using policies. In SIGMOD. 1447--1458.
[29]
Zhicheng He, Wei Xia, Kai Dong, Huifeng Guo, Ruiming Tang, Dingyin Xia, and Rui Zhang. 2022. Unsupervised Learning Style Classification for Learning Path Generation in Online Education Platforms. In KDD. 2997--3006.
[30]
Lingxiao Huang, Shaofeng Jiang, and Nisheeth Vishnoi. 2019. Coresets for clustering with fairness constraints. In NeurIPS. 7587--7598.
[31]
Lingxiao Huang, Zhize Li, Jialin Sun, and Haoyu Zhao. 2022. Coresets for Vertical Federated Learning: Regularized Linear Regression and k-Means Clustering. In NeurIPS, Vol. 35. 29566--29581.
[32]
Yefan Huang, Xiaoli Wang, Feiyan Liu, and Guofeng Huang. 2022. OVQA: A Clinically Generated Visual Question Answering Dataset. In SIGIR. 2924--2938.
[33]
Maliha Tashfia Islam, Anna Fariha, Alexandra Meliou, and Babak Salimi. 2022. Through the data management lens: Experimental analysis and evaluation of fair classification. In SIGMOD. 232--246.
[34]
Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, et al . 2021. Advances and open problems in federated learning. Found. Trends Mach. Learn. 14, 1--2 (2021), 1--210.
[35]
Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. 2020. Federated learning: Challenges, methods, and future directions. IEEE Signal Process. Mag. 37, 3 (2020), 50--60.
[36]
Zitao Li, Tianhao Wang, and Ninghui Li. 2022. Differentially Private Vertical Federated Clustering. Proc. VLDB Endow. 16, 6 (2022), 1277 -- 1290.
[37]
Edo Liberty, Zohar Karnin, Bing Xiang, Laurence Rouesnel, Baris Coskun, Ramesh Nallapati, Julio Delgado, Amir Sadoughi, Yury Astashonok, Piali Das, et al. 2020. Elastic machine learning algorithms in amazon sagemaker. In SIGMOD. 731--737.
[38]
Stuart Lloyd. 1982. Least squares quantization in PCM. IEEE Trans. Inf. Theory 28, 2 (1982), 129--137.
[39]
Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. 2017. Communication-efficient learning of deep networks from decentralized data. In AISTATS. PMLR, 1273--1282.
[40]
Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2021. A survey on bias and fairness in machine learning. ACM Comput. Surv. 54, 6 (2021), 1--35.
[41]
Feiping Nie, Jingjing Xue, Danyang Wu, Rong Wang, Hui Li, and Xuelong Li. 2021. Coordinate Descent Method for k-means. IEEE Trans. Pattern Anal. Mach. Intell. 44, 5 (2021), 2371--2385.
[42]
John Paparrizos and Luis Gravano. 2015. k-shape: Efficient and accurate clustering of time series. In SIGMOD. 1855--1870.
[43]
Erich Schubert, Alexander Koos, Tobias Emrich, Andreas Züfle, Klaus Arthur Schmid, and Arthur Zimek. 2015. A framework for clustering uncertain data. Proc. VLDB Endow. 8, 12 (2015), 1976--1979.
[44]
Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet S Talwalkar. 2017. Federated multi-task learning. In NeurIPS, Vol. 30. 4424--4434.
[45]
Yanchao Tan, Carl Yang, Xiangyu Wei, Chaochao Chen, Weiming Liu, Longfei Li, Jun Zhou, and Xiaolin Zheng. 2022. Metacare: Meta-learning with hierarchical subtyping for cold-start diagnosis prediction in healthcare data. In SIGIR. 449--459.
[46]
Suhas Thejaswi, Ameet Gadekar, Bruno Ordozgoiti, and Michal Osadnik. 2022. Clustering with Fair-Center Representation: Parameterized Approximation Algorithms and Heuristics. In KDD. 1749--1759.
[47]
Sheng Wang, Yuan Sun, and Zhifeng Bao. 2020. On the efficiency of k-means clustering: evaluation, optimization, and algorithm selection. Proc. VLDB Endow. 14, 2, 163--175.
[48]
Yu Wang, Wotao Yin, and Jinshan Zeng. 2019. Global convergence of ADMM in nonconvex nonsmooth optimization. J. Sci. Comput. 78 (2019), 29--63.
[49]
Xindong Wu, Vipin Kumar, J Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J McLachlan, Angus Ng, Bing Liu, Philip S Yu, et al. 2008. Top 10 algorithms in data mining. Knowl. Inf. Syst. 14, 1 (2008), 1--37.
[50]
Yi Xu, Mingrui Liu, Qihang Lin, and Tianbao Yang. 2017. ADMM without a fixed penalty parameter: Faster convergence with new adaptive penalization. In NeurIPS. 1267--1277.
[51]
Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. 2019. Federated machine learning: Concept and applications. ACM Trans. Intell. Syst. Technol. 10, 2 (2019), 1--19.
[52]
Zhenkun Yang, Chuanhui Yang, Fusheng Han, Mingqiang Zhuang, Bing Yang, Zhifeng Yang, Xiaojun Cheng, Yuzhong Zhao, Wenhui Shi, Huafeng Xi, et al. 2022. OceanBase: a 707 million tpmC distributed relational database system. Proc. VLDB Endow. 15, 12 (2022), 3385--3397.
[53]
Jinshan Zeng, Tim Tsz-Kit Lau, Shaobo Lin, and Yuan Yao. 2019. Global convergence of block coordinate descent in deep learning. In ICML. 7313--7323.
[54]
Juntao Zhang, Sheng Wang, Yuan Sun, and Zhiyong Peng. 2023. Prerequisite-driven Fair Clustering on Heterogeneous Information Networks. Proc. ACM Manag. Data 1, 2 (2023), 1--27.
[55]
Yang Zhou, Hong Cheng, and Jeffrey Xu Yu. 2009. Graph clustering based on structural/attribute similarities. Proc. VLDB Endow. 2, 1 (2009), 718--729.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data
Proceedings of the ACM on Management of Data  Volume 1, Issue 4
PACMMOD
December 2023
1317 pages
EISSN:2836-6573
DOI:10.1145/3637468
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 December 2023
Published in PACMMOD Volume 1, Issue 4

Permissions

Request permissions for this article.

Author Tags

  1. ADMM
  2. fair
  3. fast
  4. federated
  5. k-means

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 235
    Total Downloads
  • Downloads (Last 12 months)235
  • Downloads (Last 6 weeks)34
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media