Abstract
We propose a new biclustering method for binary data matrices using the maximum penalized Bernoulli likelihood estimation. Our method applies a multi-layer model defined on the logits of the success probabilities, where each layer represents a simple bicluster structure and the combination of multiple layers is able to reveal complicated, multiple biclusters. The method allows for non-pure biclusters, and can simultaneously identify the 1-prevalent blocks and 0-prevalent blocks. A computationally efficient algorithm is developed and guidelines are provided for specifying the tuning parameters, including initial values of model parameters, the number of layers, and the penalty parameters. Missing-data imputation can be handled in the EM framework. The method is tested using synthetic and real datasets and shows good performance.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Ben-Dor, A., Chor, B., Karp, R., Yakhini, Z.: Discovering local structure in gene expression data: the order-preserving sub-matrix problem. In: Proceedings of the 6th Annual International Conference on Computational Biology, pp. 49–57 (2002)
Brookes, A.J.: Review: the essence of SNPs. Gene 234, 177–186 (1999)
Cheng, Y., Church, G.L.: Biclustering of expression data. In: Proceedings of International Conference on Intelligence Systems for Molecular Biology, pp. 93–103 (2000)
Collins, M., Dasgupta, S., Schapire, R.E.: A generalization of principal component analysis to the exponential family. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (eds.) Advanced in Neural Information Processing System, vol. 14, pp. 617–pages 642. MIT Press, Cambridge (2002)
de Leeuw, J.: Principal component analysis of binary data by iterated singular value decomposition. Comput. Stat. Data Anal. 50, 21–39 (2006)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. B 39, 1–38 (1977)
Dhollander, T., Sheng, Q., Lemmens, K., De Moore, B., Marchal, K., Moreau, Y.: Query-driven module discovery in microarray data. Bioinformatics 23, 2573–2580 (2007)
Ewens, W.J., Spielman, R.S.: The transmission/disequilibrium test: history, subdivision, and admixture. Am. J. Hum. Genet. 57, 455–464 (1995)
Frank, A., Asuncion, A.: UCI machine learning repository. http://archive.ics.uci.edu/ml, Irvine, CA: University of California, School of Information and Computer Science (2010)
Friedman, J., Hastie, T., Höfling, H., Tibshirani, R.: Pathwise coordinate optimization. Ann. Appl. Stat. 1, 302–332 (2007)
Gormley, I.C., Murphy, T.B.: A mixture of experts model for rank data with application in election studies. Ann. Appl. Stat. 2, 1452–1477 (2008)
Govaert, G., Nadif, M.: Block clustering with Bernoulli mixture models: comparison of different approaches. Comput. Stat. Data Anal. 52, 3233–3245 (2008)
Huber, P., Ronchetti, E., victoria-Feser, M.: Estimation of generalized linear latent variable models. J. R. Stat. Soc. B 66, 893–908 (2004)
Hunter, D.R., Li, R.: Variable selection using MM algorithm. Ann. Stat. 33, 1617–1642 (2005)
Ihmels, J., Friedlander, G., Bergman, S., Sarig, O., Ziv, Y., Barkai, N.: Revealing modular organization in the yeast transcriptional network. Nat. Genet. 31, 370–377 (2002)
Jaakkola, T.S., Jordan, M.I.: Bayesian parameter estimation via variational methods. Stat. Comput. 10, 25–37 (2000)
Kwok, P.Y., Deng, Q., Zakeri, H., Taylor, S.L., Nicerson, D.A.: Increasing the information content of STS-based genome maps: identifying polymorphisms in mapped STSs. Genomics 31, 123–126 (1996)
Lange, K., Hunter, D.R., Yang, I.: Optimization transfer using surrogate objective function (with discussion). J. Comput. Graph. Stat. 9, 1–20 (2000)
Lazzeroni, L., Owen, A.B.: Plaid models for gene expression data. Stat. Sin. 12, 61–86 (2002)
Lee, M., Shen, H., Huang, J.Z., Marron, J.S.: Biclustering via sparse singular value decomposition. Biometrics 66, 1087–1095 (2010a)
Lee, S., Huang, J.Z., Hu, J.: Sparse logistic principal components analysis for binary data. Ann. Appl. Stat. 4, 1579–1601 (2010b)
Murali, T.M., Kasif, S.: Extracting conserved gene expression motifs from gene expression data. IEEE/ACM Trans. Comput. Biol. Bioinform. 8, 77–88 (2003)
Prelić, A., Bleuler, S., Zimmermann, P., Wille, A., Bühlmann, P., Gruissem, W., Hennig, L., Thiele, L., Zitzler, E.: A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 22, 1122–1129 (2006)
Rodriguez-Baena, D.S., Perez-Pulido, A., Aguilar-Ruiz, J.S.: A biclustering algorithm for extracting bit-patterns from binary datasets. Bioinformatics 27, 2746–2753 (2011)
Schwarz, G.E.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978)
Serre, D., Montpetit, A., Paré, G., Engert, J.C., Yusuf, S., Keavney, B., Judson, T.J., Anand, S.: Correction of population stratification in large multi-ethnic association studies. PLoS ONE 2, e1382 (2008). doi:10.1371/journal.pone.0001382
Shamir, R., Maron-Katz, A., Tanay, A., Linhart, C., Steinfeld, I., Sharan, R., Shiloh, Y., Elkon, R.: EXPANDER—an integrative program suite for microarray data analysis. BMC Bioinform. 6, 232 (2005)
Schein, A., Saul, L.K., Ungar, L.H.: A generalized linear model for principal component analysis of binary data. In: Bishop, C.M., Frey, B.J. (eds.) Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, pp. 14–21. Key West, FL (2003)
Shen, H., Huang, J.Z.: Sparse principal component analysis via regularized low rank matrix approximation. J. Multivar. Anal. 99, 1015–1034 (2008)
Sheng, Q., Moreau, Y., De Moor, B.: Biclustering microarray data by Gibbs sampling. Bioinformatics 19, 196–205 (2003)
Tanay, A., Sharan, R., Shamir, R.: Discovering statistically significant biclusters in gene expression data. Bioinformatics 18(suppl. 1), S136–S144 (2002)
The International HapMap Consortium: A haplotype map of the human genome. Nature 437, 1299–1320 (2005)
Tibshirani, R.J.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B 58, 267–288 (1996)
Van Uitert, M., Meuleman, W., Wessels, L.: Biclustering sparse binary genomic data. J. Comput. Biol. 15, 1329–1345 (2008)
Wyse, J., Friel, N.: Block clustering with collapsed latent block models. Stat. Comput. 22, 415–428 (2012)
Acknowledgements
The authors would like to thank the editor, the associate editor, and two reviewers for helpful comments. Dr. Lan Zhou carefully read the paper and gave many useful suggestions for improving the writing. Lee’s work was supported by Basic Science Research Program through the National Research Foundation (NRF) of Korea (2011-0011608). Huang’s work was partially supported by NCI (CA57030), NSF (DMS-0907170, DMS-1007618, DMS-1208952), and King Abdullah University of Science and Technology (KUS-CI-016-04).
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
1.1 Derivations of (4) and (5)
Lemma 1
The function π(x){1−π(x)} is decreasing in x≥0 where π(x)={1+exp(−x)}−1.
Proof
The first derivative of the underlying function is π′(x){1−π(x)}−π(x)π′(x)=π′(x){1−2π(x)}=π(x){1−π(x)}{1−2π(x)}. Since 1/2≤π(x)≤1 for x≥0, the derivative is negative and therefore the function is decreasing. □
Lemma 2
The function \(r(x)=\log\pi(\sqrt{x})-\sqrt{x}/2\) is convex.
Proof
The second derivative of r(x) is
Note that
with \(\xi\in(-\sqrt{x},\sqrt{x})\) from the mean value theorem. From \(\xi<\sqrt{x}\) and Lemma 1, the second derivative of r(x) is positive. This completes the proof of Lemma 2. □
From the convexity of r(x), we get r(x)≥r(y)+r′(y)(x−y) at any y, so that
or
By replacing \(\sqrt{x}\) and \(\sqrt{y}\) with x and y respectively, we obtain
The coefficient of the quadratic term is bounded above by 1/8 because
for ξ∈(−y,y) by the mean value theorem. At y=0, the coefficient of the quadratic term is not defined properly. In this case, we use the limit when y approaches zero. By L’hopital’s theorem we get
Now, by completing squares around x, we get the upper bound as
Replacing x and y by q ij θ ij and \(q_{ij}\theta_{ij}^{o}\) respectively, we obtain the upper bound in (4). In fact, the first two terms in (4) are obtained straightforwardly after x and y are replaced by q ij θ ij and \(q_{ij}\theta _{ij}^{o}\) respectively. After the replacement, the third term of the above displayed formula becomes
In the above, we used \(q_{ij}^{2}=1\) since q ij takes values −1 or 1 only. Now, if we define \(x_{ij}=\theta_{ij}^{o}+4q_{ij} \{1-\pi (q_{ij}\theta_{ij}^{o}) \}\), the desired result is obtained.
Applying (4) to (1), we obtain the upper bound of the penalized Bernoulli log likelihood
where \(g(\mbox{\boldmath$\varXi$}|\mbox{\boldmath$\varXi$}^{o})\) is in the equation (5) of the manuscript, and C is the constant term \(-\sum_{i=1}^{n}\sum _{j=1}^{p}\log (q_{ij}\theta_{ij}^{o}) + 2\sum_{i=1}^{n}\sum_{j=1}^{p}\{1-\pi (q_{ij}\theta _{ij}^{o})\}^{2}\), which does not involve θ ij . Thus, g can be used as the surrogate function for the upper-bound optimization.
Rights and permissions
About this article
Cite this article
Lee, S., Huang, J.Z. A biclustering algorithm for binary matrices based on penalized Bernoulli likelihood. Stat Comput 24, 429–441 (2014). https://doi.org/10.1007/s11222-013-9379-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-013-9379-3