Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1109/SFCS.2005.36guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Fitting tree metrics: Hierarchical clustering and Phylogeny

Published: 23 October 2005 Publication History
  • Get Citation Alerts
  • Abstract

    Given dissimilarity data on pairs of objects in a set, we study the problem of fitting a tree metric to this data so as to minimize additive error (i.e. some measure of the difference between the tree metric and the given data). This problem arises in constructing an M-level hierarchical clustering of objects (or an ultrametric on objects) so as to match the given dissimilarity data - a basic problem in statistics. Viewed in this way, the problem is a generalization of the correlation clustering problem (which corresponds to M = 1). We give a very simple randomized combinatorial algorithm for the Mlevel hierarchical clustering problem that achieves an approximation ratio of M+2. This is a generalization of a previous factor 3 algorithm for correlation clustering on complete graphs. The problem of fitting tree metrics also arises in phylogeny where the objective is to learn the evolution tree by fitting a tree to dissimilarity data on taxa. The quality of the fit is measured by taking the \ellp norm of the difference between the tree metric constructed and the given data. Previous results obtained a factor 3 approximation for finding the closest tree tree metric under the \ell\infty norm. No non-trivial approximation for general \ellp norms was known before. We present a novel LP formulation for this problem and obtain an O(({\rm{log n log log n}})^{1/p} ) approximation using this. En route, we obtain an O(({\rm{log n log log n}})^{1/p} ) approximation for the closest ultrametric under the \ellp norm. Our techniques are based on representing and viewing an ultrametric as a hierarchy of clusterings, and may be useful in other contexts.

    References

    [1]
    R. Agarwala, V. Bafna, M. Farach, M. Paterson, and M. Thorup. On the approximability of numerical taxonomy (fitting distances by tree metrics). SIAM Journal on Computing , 28(3):1073-1085, 1999.
    [2]
    N. Ailon, M. Charikar, and A. Newman. Aggregating inconsistent information: Ranking and clustering. In Proceedings of the 37th Annual ACM Symposium on Theory of Computing (STOC) , 2005.
    [3]
    F. Ardila. Subdominant matroid ultrametrics. Annals of Combinatorics , 8(4):379-389, 2005.
    [4]
    N. Bansal, A. Blum, and S. Chawla. Correlation clustering. Machine Learning Journal (Special Issue on Theoretical Advances in Data Clustering) , 56(1-3):89-113, 2004. Extended abstract appeared in FOCS 2002. pages 238-247.
    [5]
    M. Charikar, V. Guruswami, and A. Wirth. Clustering with qualitative information. In Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science (FOCS) , pages 524-533, Boston, 2003.
    [6]
    K. Dhamdhere. Approximating additive distortion of embeddings into line metrics. In Proceedings of 7th APPROX and 8th RANDOM , volume 3122. Springer-Verlag, 2004.
    [7]
    C. Dwork, R. Kumar, M. Naor, and D. Sivakumru. Rank aggregation methods for the web. In Proceedings of the Tenth International Conference on the World Wide Web (WWW10) , pages 613-622, Hong Kong, 2001.
    [8]
    J. Fakcharoenphol, S. Rao, and K. Talwar. A tight bound on approximating arbitrary metrics by tree metncs. In Proceedings of the 35th Annual ACM Symposium on Theory of Computing (STOC) , pages 448-455, 2005.
    [9]
    M. Farach, S. Kannan, and T. Warnow. A robust model for finding optimal evolutionary trees. Algorithmica, Special Issue on Computational Biology , pages 155- 179, 1995.
    [10]
    V. Filkov and S. Skiena. Integrating microarray data by consensus clustering. In Proceedings of International Conference on Tools with Artificial Intelligence (ICTAl) , pages 418-425, Sacramento, 2003.
    [11]
    A. Gionis, H. Mannila, and P. Tsaparas. Clustering aggregation. lin Proceedings of the 21st International Conference on Data Engineering (ICDE) , Tokyo, 2005. To appear.
    [12]
    B. Harb, S. Kannan, and A. McGregor. Approximating the best-fit tree under l p norms. In Proceedings of 8th APPROX and 8th RANDOM . Springer-Verlag, 2005.
    [13]
    B. Holland, K. Huber, J. Koolen, V. Moulton, and J. Weyer-Menkhoff. Delta-additive and delta-ultra-additive maps, gromov's trees, and the farris transform. Discrete Applied Mathematics , pages 51-73, 2005.
    [14]
    J. Kim and T. Warnow. Tutorial on phylogenetic tree estimation, 2004. Originally presented at ISMB 1999.
    [15]
    B. Ma, L. Wang, and L. Zhang. Fitting distances by tree metrics wilth increment error. Journal of Combinatorial Optimization , 3(2-3):213-225, 1999.
    [16]
    E. Mossel and S. Roch. Learning nonsingular phylogenies and hidden markov models. In Proceedings of the 37th Annual ACM Symposium on Theory of Computing (STOC) , 2005.
    [17]
    H. T. Wareham. On the complexity of inferring evolutionary trees. Technical Report Technical Report 9301, Memorial University of New Foundland, 1993.

    Cited By

    View all

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    FOCS '05: Proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science
    October 2005
    645 pages
    ISBN:0769524680

    Publisher

    IEEE Computer Society

    United States

    Publication History

    Published: 23 October 2005

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)HyperAidProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539378(201-211)Online publication date: 14-Aug-2022
    • (2017)Hierarchical clustering via spreading metricsThe Journal of Machine Learning Research10.5555/3122009.317683218:1(3077-3111)Online publication date: 1-Jan-2017
    • (2017)Metric embeddings with outliersProceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms10.5555/3039686.3039729(670-689)Online publication date: 16-Jan-2017
    • (2016)Hierarchical clustering via spreading metricsProceedings of the 30th International Conference on Neural Information Processing Systems10.5555/3157096.3157356(2324-2332)Online publication date: 5-Dec-2016
    • (2014)A Framework for Hierarchical Ensemble ClusteringACM Transactions on Knowledge Discovery from Data10.1145/26113809:2(1-23)Online publication date: 23-Sep-2014
    • (2011)Seriation in the Presence of ErrorsAlgorithmica10.5555/3118745.311891359:4(521-568)Online publication date: 1-Apr-2011
    • (2010)Aggregation of Partial Rankings, p-Ratings and Top-m ListsAlgorithmica10.5555/3118232.311851357:2(284-300)Online publication date: 1-Jun-2010
    • (2009)Deterministic Pivoting Algorithms for Constrained Ranking and Clustering ProblemsMathematics of Operations Research10.1287/moor.1090.038534:3(594-620)Online publication date: 1-Aug-2009
    • (2009)Linear time approximation schemes for the Gale-Berlekamp game and related minimization problemsProceedings of the forty-first annual ACM symposium on Theory of computing10.1145/1536414.1536458(313-322)Online publication date: 31-May-2009
    • (2008)Aggregating inconsistent informationJournal of the ACM10.1145/1411509.141151355:5(1-27)Online publication date: 5-Nov-2008
    • Show More Cited By

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media