Abstract
The triple jump extrapolation method is an effective approximation of Aitken’s acceleration that can accelerate the convergence of many algorithms for data mining, including EM and generalized iterative scaling (GIS). It has two options—global and componentwise extrapolation. Empirical studies showed that neither can dominate the other and it is not known which one is better under what condition. In this paper, we investigate this problem and conclude that, when the Jacobian is (block) diagonal, componentwise extrapolation will be more effective. We derive two hints to determine the block diagonality. The first hint is that when we have a highly sparse data set, the Jacobian of the EM mapping for training a Bayesian network will be block diagonal. The second is that the block diagonality of the Jacobian of the GIS mapping for training CRF is negatively correlated with the strength of feature dependencies. We empirically verify these hints with controlled and real-world data sets and show that our hints can accurately predict which method will be superior. We also show that both global and componentwise extrapolation can provide substantial acceleration. In particular, when applied to train large-scale CRF models, the GIS variant accelerated by componentwise extrapolation not only outperforms its global extrapolation counterpart, as our hint predicts, but can also compete with limited-memory BFGS (L-BFGS), the de facto standard for CRF training, in terms of both computational efficiency and F-scores. Though none of the above methods are as fast as stochastic gradient descent (SGD), careful tuning is required for SGD and the results given in this paper provide a useful foundation for automatic tuning.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Bauer E, Koller D, Singer Y (1997) Update rules for parameter estimation in Bayesian networks. In: Proceedings of the 13th conference on uncertainty in artificial intelligence (UAI’97), pp 3–13
Berlinet A, Roland C (2007) Acceleration schemes with application to the EM algorithm. Comput Stat Data Anal 51: 3689–3702
Binder J, Koller D, Russell S, Kanazawa K (1997) Adaptive probabilistic networks with hidden variables. Mach Learn 29(2–3): 213–244
Bottou L (2007) Stochastic gradient descent examples on toy problems. http://leon.bottou.org/projects/sgd
Burden RL, Faires D (1988) Numerical analysis. PWS-KENT Pub Co
Cheeseman P, Stutz J (1996) Bayesian classification (autoclass): theory and results. In: Advances in knowledge discovery and data mining. MIT Press, Cambridge, MA, USA, pp 153–180
Cooper GF, Herskovits E (1992) A Bayesian method for the induction of probabilistic networks from data. Mach Learn 9: 309–347
Darroch JN, Ratcliff D (1972) Generalized iterative scaling for log-linear models. Ann Math Stat 43(5): 1470–1480
Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39(1): 1–38
Fraley C (1999) On computing the largest fraction of missing information for the em algorithm and the worst linear function for data augmentation. Comput Stat Data Anal 31(1): 13–26
Golub GH, Loan CFV (1996) Matrix computations, 3rd edn. PWS-KENT Pub Co
Hammerlin G, Hoffmann K-H (1991) Numerical mathematics. Springer, New York
Hesterberg T (2005) Staggered Aitken acceleration for EM. In: Proceedings of the statistical computing section of the American Statistical Association, Minneapolis, Minnesota, USA
Hsu C-N, Chang Y-M, Kuo C-J, Lin Y-S, Huang H-S, Chuang I-F (2008) Integrating high dimensional bi-directional parsing models for gene mention tagging. Bioinformatics 24(13):i286–i294. Proceedings of ISMB-2008
Huang H-S, Yang B-H, Hsu, C-N (2005) Triple-jump acceleration for the EM algorithm. In: Proceedings of the fifth IEEE international conference on data mining (ICDM’05), pp 649–652
Huang H-S, Yang B-H, Hsu C-N (2007a) TJ2aEM: targeted aggressive extrapolation method for accelerating the EM algorithm. Technical report, Institute of Information Science, Academia Sinica, Taiwan. IIS Technical Report: TR-IIS-07-012
Huang H-S, Chang Y-M, Hsu C-N (2007b) Training conditional random fields by periodic step size adaptation for large-scale text mining. In: Proceedings of the seventh IEEE international conference on data mining (ICDM’07), pp 511–516
Jamshidian M, Jennrich RI (1997) Acceleration of the EM algorithm by using quasi-newton methods. J R Stat Soc Ser B 59(3): 569–587
Kapetanios G (2004) On testing for diagonality of large dimensional covariance matrices
Kim J-D, Ohta T, Tsuruoka Y, Tateisi Y, Collier N (2004) Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the joint workshop on natural language processing in biomedicine and its applications (JNLPBA-2004), pp 70–75
Kudo T (2006) CRF++: yet another CRF toolkit (http://crfpp.sourceforge.net/). Available under LGPL from the following http://crfpp.sourceforge.net/
Kuroda M, Sakakihara M (2006) Accelerating the convergence of the EM algorithm using the vector ε algorithm. Comput Stat Data Anal 51: 1549–1561
Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of 18th international conference on machine learning (ICML’03), pp 282–289
Louis TA (1982) Finding the observed information matrix when using the EM algorithm. J R Stat Soc Ser B 44: 226–233
Malouf R (2002) A comparison of algorithms for maximum entropy parameter estimation. In: Proceedings of the sixth conference on natural language learning (CoNLL-2002), pp 49–55
McLachlan GJ, Krishnan T (1997) The EM algorithm and extensions. Wiley series in probability and statistics. Wiley-Interscience
Meng X-L, Rubin DB (1994) On the global and componentwise rates of convergence of the EM algorithm. Linear Algebra Appl 199: 413–425
Nocedal J, Wright SJ (1999) Numerical optimization. Springer
Pearl J (1988) Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann
Russell S, Binder J, Koller D, Kanazawa K (1995) Local learning in probabilistic networks with hidden variables. In: Proceedings of the fourteenth international joint conference on artificial intelligence (IJCAI’95), pp 1146–1152
Salakhutdinov R, Roweis S (2003) Adaptive overrelaxed bound optimization methods. In: Proceedings of the twentieth international conference on machine learning (ICML’03), pp 664–671
Salakhutdinov R, Roweis S, Ghahramani Z (2003) On the convergence of bound optimization algorithms. In: Conference on uncertainty in artificial intelligence (UAI’03), pp 509–516
Schafer JL (1997) Analysis of incomplete multivariate data. Chapman and Hall, New York
Settles B (2004) Biomedical named entity recognition using conditional random fields and novel feature sets. In: Proceedings of the joint workshop on natural language processing in biomedicine and its applications (JNLPBA-2004), pp 104–107
Sha F, Pereira F (2003) Shallow parsing with conditional random fields. In: Proceedings of human language technology, the North American Chapter of the Association for Computational Linguistics (NAACL’03), pp 213–220
Thiesson B, Meek C, Heckerman D (2001) Accelerating EM for large databases. Mach Learn 45(3): 279–299
Tjong EF, Sang K, Buchholz S (2000) Introduction to the CoNLL-2000 shared task: Chunking. In: Proceedings of conference on computational natural language learning (CoNLL-2000), pp 127–132
Varadhan R, Roland C (2004) Squared extrapolation methods (SQUAREM): a newclass of simple and efficient numerical schemes for accelerating the convergence of the EM algorithm. Department of Biostatistics Working Paper, Johns Hopkins University, Paper 63
Wilbur J, Smith L, Tanabe L (2007). Biocreative 2. Gene mention task. In: Proceedings of the second biocreative challenge evaluation workshop, pp 7–16
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Charles Elkan.
Rights and permissions
About this article
Cite this article
Huang, HS., Yang, BH., Chang, YM. et al. Global and componentwise extrapolations for accelerating training of Bayesian networks and conditional random fields. Data Min Knowl Disc 19, 58–94 (2009). https://doi.org/10.1007/s10618-009-0128-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-009-0128-3