Article

Fitting tree metrics: Hierarchical clustering and Phylogeny

Authors:

Moses CharikarAuthors Info & Claims

FOCS '05: Proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science

October 2005

Pages 73 - 82

https://doi.org/10.1109/SFCS.2005.36

Published: 23 October 2005 Publication History

Abstract

Given dissimilarity data on pairs of objects in a set, we study the problem of fitting a tree metric to this data so as to minimize additive error (i.e. some measure of the difference between the tree metric and the given data). This problem arises in constructing an M-level hierarchical clustering of objects (or an ultrametric on objects) so as to match the given dissimilarity data - a basic problem in statistics. Viewed in this way, the problem is a generalization of the correlation clustering problem (which corresponds to M = 1). We give a very simple randomized combinatorial algorithm for the Mlevel hierarchical clustering problem that achieves an approximation ratio of M+2. This is a generalization of a previous factor 3 algorithm for correlation clustering on complete graphs. The problem of fitting tree metrics also arises in phylogeny where the objective is to learn the evolution tree by fitting a tree to dissimilarity data on taxa. The quality of the fit is measured by taking the \ellp norm of the difference between the tree metric constructed and the given data. Previous results obtained a factor 3 approximation for finding the closest tree tree metric under the \ell\infty norm. No non-trivial approximation for general \ellp norms was known before. We present a novel LP formulation for this problem and obtain an O(({\rm{log n log log n}})^{1/p} ) approximation using this. En route, we obtain an O(({\rm{log n log log n}})^{1/p} ) approximation for the closest ultrametric under the \ellp norm. Our techniques are based on representing and viewing an ultrametric as a hierarchy of clusterings, and may be useful in other contexts.

References

[1]

R. Agarwala, V. Bafna, M. Farach, M. Paterson, and M. Thorup. On the approximability of numerical taxonomy (fitting distances by tree metrics). SIAM Journal on Computing , 28(3):1073-1085, 1999.

[2]

N. Ailon, M. Charikar, and A. Newman. Aggregating inconsistent information: Ranking and clustering. In Proceedings of the 37th Annual ACM Symposium on Theory of Computing (STOC) , 2005.

[3]

F. Ardila. Subdominant matroid ultrametrics. Annals of Combinatorics , 8(4):379-389, 2005.

[4]

N. Bansal, A. Blum, and S. Chawla. Correlation clustering. Machine Learning Journal (Special Issue on Theoretical Advances in Data Clustering) , 56(1-3):89-113, 2004. Extended abstract appeared in FOCS 2002. pages 238-247.

[5]

M. Charikar, V. Guruswami, and A. Wirth. Clustering with qualitative information. In Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science (FOCS) , pages 524-533, Boston, 2003.

[6]

K. Dhamdhere. Approximating additive distortion of embeddings into line metrics. In Proceedings of 7th APPROX and 8th RANDOM , volume 3122. Springer-Verlag, 2004.

[7]

C. Dwork, R. Kumar, M. Naor, and D. Sivakumru. Rank aggregation methods for the web. In Proceedings of the Tenth International Conference on the World Wide Web (WWW10) , pages 613-622, Hong Kong, 2001.

[8]

J. Fakcharoenphol, S. Rao, and K. Talwar. A tight bound on approximating arbitrary metrics by tree metncs. In Proceedings of the 35th Annual ACM Symposium on Theory of Computing (STOC) , pages 448-455, 2005.

[9]

M. Farach, S. Kannan, and T. Warnow. A robust model for finding optimal evolutionary trees. Algorithmica, Special Issue on Computational Biology , pages 155- 179, 1995.

[10]

V. Filkov and S. Skiena. Integrating microarray data by consensus clustering. In Proceedings of International Conference on Tools with Artificial Intelligence (ICTAl) , pages 418-425, Sacramento, 2003.

[11]

A. Gionis, H. Mannila, and P. Tsaparas. Clustering aggregation. lin Proceedings of the 21st International Conference on Data Engineering (ICDE) , Tokyo, 2005. To appear.

[12]

B. Harb, S. Kannan, and A. McGregor. Approximating the best-fit tree under l _p norms. In Proceedings of 8th APPROX and 8th RANDOM . Springer-Verlag, 2005.

[13]

B. Holland, K. Huber, J. Koolen, V. Moulton, and J. Weyer-Menkhoff. Delta-additive and delta-ultra-additive maps, gromov's trees, and the farris transform. Discrete Applied Mathematics , pages 51-73, 2005.

[14]

J. Kim and T. Warnow. Tutorial on phylogenetic tree estimation, 2004. Originally presented at ISMB 1999.

[15]

B. Ma, L. Wang, and L. Zhang. Fitting distances by tree metrics wilth increment error. Journal of Combinatorial Optimization , 3(2-3):213-225, 1999.

[16]

E. Mossel and S. Roch. Learning nonsingular phylogenies and hidden markov models. In Proceedings of the 37th Annual ACM Symposium on Theory of Computing (STOC) , 2005.

[17]

H. T. Wareham. On the complexity of inferring evolutionary trees. Technical Report Technical Report 9301, Memorial University of New Foundland, 1993.

Cited By

Chien ETabaghi PMilenkovic OZhang ARangwala H(2022)HyperAidProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539378(201-211)Online publication date: 14-Aug-2022
https://dl.acm.org/doi/10.1145/3534678.3539378
Roy APokutta S(2017)Hierarchical clustering via spreading metricsThe Journal of Machine Learning Research10.5555/3122009.317683218:1(3077-3111)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.5555/3122009.3176832
Sidiropoulos AWang DWang YKlein P(2017)Metric embeddings with outliersProceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms10.5555/3039686.3039729(670-689)Online publication date: 16-Jan-2017
https://dl.acm.org/doi/10.5555/3039686.3039729
Show More Cited By

Index Terms

Fitting tree metrics: Hierarchical clustering and Phylogeny
1. Mathematics of computing
  1. Discrete mathematics
    1. Graph theory
      1. Graph algorithms
      2. Trees
  2. Probability and statistics
    1. Statistical paradigms
      1. Statistical graphics

Recommendations

Fitting Tree Metrics: Hierarchical Clustering and Phylogeny

Given dissimilarity data on pairs of objects in a set, we study the problem of fitting a tree metric to this data so as to minimize additive error (i.e., some measure of the difference between the tree metric and the given data). This problem arises in ...
Read More
On the Approximability of Numerical Taxonomy (Fitting Distances by Tree Metrics)

We consider the problem of fitting an n × n distance matrix D by a tree metric T . Let $\varepsilon$ be the distance to the closest tree metric under the $L_{\infty}$ norm; that is, $\varepsilon=\min_T\{\parallel T-D\parallel{\infty}\}$. First we ...
Read More
On the Approximability of Numerical Taxonomy (Fitting Distances by Tree Metrics)
Read More

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

FOCS '05: Proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science

October 2005

645 pages

ISBN:0769524680

Publisher

IEEE Computer Society

United States

Publication History

Published: 23 October 2005

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Other Metrics

View Author Metrics

Citations

Cited By

Chien ETabaghi PMilenkovic OZhang ARangwala H(2022)HyperAidProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539378(201-211)Online publication date: 14-Aug-2022
https://dl.acm.org/doi/10.1145/3534678.3539378
Roy APokutta S(2017)Hierarchical clustering via spreading metricsThe Journal of Machine Learning Research10.5555/3122009.317683218:1(3077-3111)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.5555/3122009.3176832
Sidiropoulos AWang DWang YKlein P(2017)Metric embeddings with outliersProceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms10.5555/3039686.3039729(670-689)Online publication date: 16-Jan-2017
https://dl.acm.org/doi/10.5555/3039686.3039729
Roy APokutta S(2016)Hierarchical clustering via spreading metricsProceedings of the 30th International Conference on Neural Information Processing Systems10.5555/3157096.3157356(2324-2332)Online publication date: 5-Dec-2016
https://dl.acm.org/doi/10.5555/3157096.3157356
Zheng LLi TDing C(2014)A Framework for Hierarchical Ensemble ClusteringACM Transactions on Knowledge Discovery from Data10.1145/26113809:2(1-23)Online publication date: 23-Sep-2014
https://dl.acm.org/doi/10.1145/2611380
Chepoi VSeston M(2011)Seriation in the Presence of ErrorsAlgorithmica10.5555/3118745.311891359:4(521-568)Online publication date: 1-Apr-2011
https://dl.acm.org/doi/10.5555/3118745.3118913
Ailon N(2010)Aggregation of Partial Rankings, p-Ratings and Top-m ListsAlgorithmica10.5555/3118232.311851357:2(284-300)Online publication date: 1-Jun-2010
https://dl.acm.org/doi/10.5555/3118232.3118513
van Zuylen AWilliamson D(2009)Deterministic Pivoting Algorithms for Constrained Ranking and Clustering ProblemsMathematics of Operations Research10.1287/moor.1090.038534:3(594-620)Online publication date: 1-Aug-2009
https://dl.acm.org/doi/10.1287/moor.1090.0385
Karpinski MSchudy WMitzenmacher M(2009)Linear time approximation schemes for the Gale-Berlekamp game and related minimization problemsProceedings of the forty-first annual ACM symposium on Theory of computing10.1145/1536414.1536458(313-322)Online publication date: 31-May-2009
https://dl.acm.org/doi/10.1145/1536414.1536458
Ailon NCharikar MNewman A(2008)Aggregating inconsistent informationJournal of the ACM10.1145/1411509.141151355:5(1-27)Online publication date: 5-Nov-2008
https://dl.acm.org/doi/10.1145/1411509.1411513
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Table of Contents