Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Joint cluster analysis of attribute data and relationship data: The connected k-center problem, algorithms and applications

Published: 24 July 2008 Publication History

Abstract

Attribute data and relationship data are two principal types of data, representing the intrinsic and extrinsic properties of entities. While attribute data have been the main source of data for cluster analysis, relationship data such as social networks or metabolic networks are becoming increasingly available. It is also common to observe both data types carry complementary information such as in market segmentation and community identification, which calls for a joint cluster analysis of both data types so as to achieve better results. In this article, we introduce the novel Connected k-Center (CkC) problem, a clustering model taking into account attribute data as well as relationship data. We analyze the complexity of the problem and prove its NP-hardness. Therefore, we analyze the approximability of the problem and also present a constant factor approximation algorithm. For the special case of the CkC problem where the relationship data form a tree structure, we propose a dynamic programming method giving an optimal solution in polynomial time. We further present NetScan, a heuristic algorithm that is efficient and effective for large real databases. Our extensive experimental evaluation on real datasets demonstrates the meaningfulness and accuracy of the NetScan results.

References

[1]
Agarwal, P. and Procopiuc, C. M. 2002. Exact and approximation algorithms for clustering. Algorithmica 33, 2, 201--226.]]
[2]
Ankerst, M., Breunig, M. M., Kriegel, H., and Sander, J. 1999. Optics: Ordering points to identify the clustering structure. In Proceedings of the ACM SIGMOD International Conference on Management of Data (Philadelphia, PA). ACM, New York, 49--60.]]
[3]
Barabasi, A. L., Jeong, H., Neda, Z., Ravasz, E., Schubert, A., and Vicseks, T. 2002. Evolution of the social network of scientific collaborations. Physica A 311, 3--4, 590--614.]]
[4]
Bartal, Y., Charikar, M., and Raz, D. 2001. Approximating min-sum k-clustering in metric spaces. In Proceedings on 33rd Annual ACM Symposium on Theory of Computing (Hersonissos, Greece). ACM, New York, 11--20.]]
[5]
Basu, S., Bilenko, M., and Mooney, R. J. 2004. A probabilistic framework for semi-supervised clustering. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Seattle, WA). ACM, New York, 59--68.]]
[6]
Berriz, G. F., King, O. D., Bryant, B., Sander, C., and Roth, F. P. 2003. Characterizing gene sets with funcassociate. Bioinformatics 19, 18, 2502--2504.]]
[7]
Brandes, U., Gaertler, M., and Wagner, D. 2003. Experiments on graph clustering algorithms. In Proceedings of the 11th Annual European Symposium on Algorithms (Budapest, Hungary). Springer-Verlag, Berlin/Heidelberg, Germany, 568--579.]]
[8]
Brucker, P. 1977. On the complexity of clustering problems. In Optimization and Operations Research, R. Hehn, B. Korte, and W. Oettli, Eds. Springer-Verlag, Berlin, Germany, 45--54.]]
[9]
Chan, P. K., Schlag, M. D. F., and Zien, J. Y. 1994. Spectral k-way ratio-cut partitioning and clustering. IEEE Trans. Computer-Aided Desi. Integ. Circ. Syst. 13, 9, 1088--1096.]]
[10]
Charikar, M., Guha, S., Tardos, É., and Shmoys, D. 1999. A constant factor approximation algorithm for the k-median problem. J. Comput. Syst. Sci. 65, 1, 129--149.]]
[11]
Charikar, M. and Panigrahy, R. 2004. Clustering to minimize the sum of cluster diameters. J. Comput. Syst. Sci. 68, 2, 417--441.]]
[12]
CiteSeer. 2006. Scientific literature digital library. http://citeseer.ist.psu.edu/.]]
[13]
Davidson, I. and Ravi, S. S. 2005. Clustering with constraints: Feasibility issues and the k-means algorithm. In Proceedings of the 5th SIAM International Conference on Data Mining (Newport Beach, CA). Society for Industrial and Applied Mathematics, Philadelphia, PA, 138--149.]]
[14]
DBLP. Computer science bibliography. http://www.informatik.uni-trier.de/~ley/db/index.html.]]
[15]
Dhillon, I. S., Guan, Y., and Kulis, B. 2007. Weighted graph cuts without eigenvectors: A multilevel approach. IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 11, 1944--1957.]]
[16]
Doddi, S., Marathe, M. V., Ravi, S. S., Taylor, D., and Widmayer, P. 2000. Approximation algorithms for clustering to minimize the sum of diameters. In Proceedings of the 7th Scandinavian Workshop on Algorithm Theory (Bergen, Norway). Springer-Verlag, Berlin/Heidelberg, Germany, 237--250.]]
[17]
Dyer, M. and Frieze, A. M. 1985. A simple heuristic for the p-center problem. Oper. Res. Lett. 3, 285--288.]]
[18]
Ester, M., Ge, R., Gao, B. J., Hu, Z., and Ben-Moshe, B. 2006. Joint cluster analysis of attribute data and relationship data: the connected k-center problem. In Proceedings of the 6th SIAM Conference on Data Mining (Bethesda, MD). Society for Industrial and Applied Mathematics, Philadelphia, PA, 246--257.]]
[19]
Ester, M., Kriegel, H., Sander, J., and Xu, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (Portland, OR). AAAI Press, 226--231.]]
[20]
Feder, T. and Greene, D. H. 1988. Optimal algorithms for approximate clustering. In Proceedings of the 20th Annual ACM Symposium on Theory of Computing (Chicago, IL). ACM, New York, 434--444.]]
[21]
Frederickson, G. N. and Johnson, D. B. 1979. Optimal algorithms for generating quantile information in x + y and matrices with sorted columns. In Proceedings of the 13th Annual Conference on Information Science and Systems. The Johns Hopkins Univ., Baltimore, MD, 47--52.]]
[22]
Gonzalez, T. 1985. Clustering to minimize the maximum inter-cluster distance. Theoret. Comput. Sci. 38, 2--3, 293--306.]]
[23]
Guha, S., Rastogi, R., and Shim, K. 1999. Rock: A robust clustering algorithm for categorical attributes. In Proceedings of the 15th International Conference on Data Engineering (Sydney, Austrialia). IEEE Computer Society, Los Alamitos, CA, 512--521.]]
[24]
Guttman-Beck, N. and Hassin, R. 1998. Approximation algorithms for min-sum p-clustering. Disc. Appl. Math. 89, 1--3, 125--142.]]
[25]
Hanisch, D., Zien, A., Zimmer, R., and Lengauer, T. 2002. Co-clustering of biological networks and gene expression data. Bioinformatics 18, S145--S154.]]
[26]
Hanneman, R. A. and Riddle, M. 2005. Introduction to social network methods. http://faculty.ucr.edu/~hanneman/.]]
[27]
Hartuv, E. and Shamir, R. 2000. A clustering algorithm based on graph connectivity. Inf. Proc. Lett. 76, 4--6, 175--181.]]
[28]
Hochbaum, D. and Shmoys, D. 1985. A best possible heuristic for the k-center problem. Math. Oper. Res. 10, 180--184.]]
[29]
Iacobucci, D. 1996. Networks in Marketing. Sage Publications, Thousand Oaks, CA.]]
[30]
Jain, A. and Dubes, R. 1988. Algorithms for clustering data. Prentice-Hall, Englewood Cliffs, NJ.]]
[31]
Jain, K. and Vazirani, V. 2001. Approximation algorithms for metric facility location and k-median problems using the primal-dual scheme and lagrangian relaxation. J. ACM 48, 2, 274--296.]]
[32]
Kariv, O. and Hakimi, S. L. 1979. An algorithmic approach to network location problems, Part II: p-medians. SIAM J. Appl. Math. 37, 539--560.]]
[33]
Karypis, G., Han, E., and Kumar, V. 1999. Chameleon: Hierarchical clustering using dynamic modeling. IEEE Comput. 32, 8, 68--75.]]
[34]
Kaufman, L. and Rousseeuw, P. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York.]]
[35]
Kulis, B., Basu, S., Dhillon, I. S., and Mooney, R. 2005. Semi-supervised graph clustering: A kernel approach. In Proceedings of the 22nd International Conference on Machine Learning (Bonn, Germany). ACM, New York, 457--464.]]
[36]
Lin, J. and Vitter, J. 1992. Approximation algorithms for geometric median problems. Inf. Proc. Lett. 44, 5, 245--249.]]
[37]
Lloyd, S. 1982. Least squares quantization in pcm. IEEE Trans. Inf. Theory 28, 2, 129--136.]]
[38]
MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematics, Statistics and Probability. 281--297.]]
[39]
Megiddo, N. and Supowit, K. J. 1984. On the complexity of some common geometric location problems. SIAM Journal on Computing 13, 1, 182--196.]]
[40]
Megiddo, N., Tamir, A., Zemel, E., and Chandrasekaran, R. 1981. An o(n log2 n) algorithm for the k-th longest path in a tree with applications to location problems. SIAM J. Comput. 10, 2, 328--337.]]
[41]
Moser, F., Ge, R., and Ester, M. 2007. Joint cluster analysis of attribute and relationship data without a-priori specification of the number of clusters. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Jose, CA). ACM, New York.]]
[42]
Ng, R. T. and Han, J. 1994. Efficient and effective clustering methods for spatial data mining. In Proceedings of the 20th International Conference on Very Large Data Bases (Santiago de Chile, Chile). Morgan Kaufmann, San Francisco, CA, 144--155.]]
[43]
Scott, J. 2000. Social Network Analysis: A handbook. Sage Publications, Thousand Oaks, CA.]]
[44]
Segal, E., Wang, H., and Koller, D. 2003. Discovering molecular pathways from protein interaction and gene expression data. Bioinformatics (Suppl. 1) 19, 264--272.]]
[45]
Shi, J. and Malik, J. 2000. Normalized cuts and image segmentation. IEEE Trans. Patt. Analysis and Machine Intelligence 22, 8, 888--905.]]
[46]
Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. O., Botstein, D., and Futcher, B. 1998. Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Molec. Biol. Cell 9, 12, 3273--3297.]]
[47]
Stark, C., Breitkreutz, B., Reguly, T., Boucher, L., Breitkreutz, A., and Tyers, M. 2006. Biogrid: A general repository for interaction datasets. Nucleic Acids Res. 34, D535--D539.]]
[48]
Steinbach, M., Karypis, G., and Kumar, V. 2000. A comparison of document clustering techniques. In KDD Workshop on Text Mining.]]
[49]
Steinhaus, H. 1956. Sur la division des corp materiels en parties. Bulletin L'Acadmie Polonaise des Science C1. III, IV, 801--804.]]
[50]
Swamy, C., and Kumar, A. 2004. Primal-dual algorithms for connected facility location problems. Algorithmica 40, 4, 245--269.]]
[51]
Tamir, A. 1996. An o(pn2) algorithm for the p-median and related problems on tree graphs. Operations Research Letters 19, 59--64.]]
[52]
Taskar, B., Segal, E., and Koller, D. 2001. Probabilistic classification and clustering in relational data. In Proceedings of 17th International Joint Conference on Artificial Intelligence (Seattle, WA). Morgan Kaufmann, San Francisco, CA, 870--878.]]
[53]
Toregas, C., Swan, R., Revelle, C., and Bergman, L. 1971. The location of emergency service facilities. Oper. Res. 19, 1363--1373.]]
[54]
Tung, A. K. H., Ng, R. T., Lakshmanan, L. V. S., and Han, J. 2001. Constraint-based clustering in large databases. In Proceedings of the 8th International Conference on Database Theory (London, UK). Springer-Verlag, New York, 405--419.]]
[55]
Ulitsky, I. and Shamir, R. 2007. Identification of functional modules using network topology and high-throughput data. BMC System Biology 1, 8.]]
[56]
Wasserman, S. and Faust, K. 1994. Social Network Analysis. Cambridge University Press, Cambridge, UK.]]
[57]
Webster, C. and Morrison, P. 2004. Network analysis in marketing. Australasian Market. J. 12, 2, 8--18.]]

Cited By

View all
  • (2024)Connected k-Center and k-Diameter ClusteringAlgorithmica10.1007/s00453-024-01266-986:11(3425-3464)Online publication date: 2-Sep-2024
  • (2024)BackgroundFinding Communities in Social Networks Using Graph Embeddings10.1007/978-3-031-60916-9_2(17-36)Online publication date: 29-Apr-2024
  • (2021)The Nearest Neighbor Algorithm for Balanced and Connected k-Center Problem under Modular Distance2021 International Conference on Networking and Network Applications (NaNA)10.1109/NaNA53684.2021.00073(385-389)Online publication date: Oct-2021
  • Show More Cited By

Index Terms

  1. Joint cluster analysis of attribute data and relationship data: The connected k-center problem, algorithms and applications

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Knowledge Discovery from Data
    ACM Transactions on Knowledge Discovery from Data  Volume 2, Issue 2
    July 2008
    152 pages
    ISSN:1556-4681
    EISSN:1556-472X
    DOI:10.1145/1376815
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 July 2008
    Accepted: 01 December 2007
    Revised: 01 July 2007
    Received: 01 December 2006
    Published in TKDD Volume 2, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Attribute data
    2. NP-hardness
    3. approximation algorithms
    4. community identification
    5. document clustering
    6. joint cluster analysis
    7. market segmentation
    8. relationship data

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)34
    • Downloads (Last 6 weeks)9
    Reflects downloads up to 14 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Connected k-Center and k-Diameter ClusteringAlgorithmica10.1007/s00453-024-01266-986:11(3425-3464)Online publication date: 2-Sep-2024
    • (2024)BackgroundFinding Communities in Social Networks Using Graph Embeddings10.1007/978-3-031-60916-9_2(17-36)Online publication date: 29-Apr-2024
    • (2021)The Nearest Neighbor Algorithm for Balanced and Connected k-Center Problem under Modular Distance2021 International Conference on Networking and Network Applications (NaNA)10.1109/NaNA53684.2021.00073(385-389)Online publication date: Oct-2021
    • (2020)Community detection in node-attributed social networks: A surveyComputer Science Review10.1016/j.cosrev.2020.10028637(100286)Online publication date: Aug-2020
    • (2020)Attributed Networks Partitioning Based on Modularity OptimizationAdvances in Data Science10.1002/9781119695110.ch8(169-185)Online publication date: 15-Jan-2020
    • (2019)Co-Association Matrix-Based Multi-Layer Fusion for Community Detection in Attributed NetworksEntropy10.3390/e2101009521:1(95)Online publication date: 20-Jan-2019
    • (2019)MinerLSD: efficient mining of local patterns on attributed networksApplied Network Science10.1007/s41109-019-0155-y4:1Online publication date: 27-Jun-2019
    • (2019)Community detection in large-scale social networks: state-of-the-art and future directionsSocial Network Analysis and Mining10.1007/s13278-019-0566-x9:1Online publication date: 18-May-2019
    • (2018)Consensus-based methodology for detection communities in multilayered networksPhysica A: Statistical Mechanics and its Applications10.1016/j.physa.2017.11.130494(547-558)Online publication date: Mar-2018
    • (2017)Finding overlapping communities based on information fusion in social network2017 International Conference on Service Systems and Service Management10.1109/ICSSSM.2017.7996310(1-6)Online publication date: Jun-2017
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media