Abstract
Along with a massive amount of information being placed online, it is a challenge to exploit the internal and external information of documents when assessing similarity between them. A variety of approaches have been proposed to model the document similarity based on different foundations, but usually they are not applicable for combining internal and external information. In this paper, we introduce a link-based method into content analysis, which is based on random walk on graphs. By defining similarity as the meeting probability of two random surfers, we propose a computational model for content analysis, which can also be integrated with external information of documents. Empirical study shows that our method achieves good accuracy, acceptable performance and fast convergent rate in multi-relational document similarity measuring.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Renda, M.E., Straccia, U.: A Personalized Collaborative Digital Library Environment: a model and an application. Information Processing and Management 41(1), 5–21 (2005)
Salton, G., Wong, A., Yang, C.S.: A Vector Space Model for Automatic Indexing. Communications of the ACM 18, 613–620 (1975)
Damashek, M.: Gauging similarity with n-grams: Language-independent categorization of text. Science 267, 843–848
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)
Jeh, G., Widom, J.: SimRank: A measure of structural-context similarity. In: SIGKDD 2002, pp. 538–543 (2002)
Xi, W., Fox, E.A., Fan, W., Zhang, B., Chen, Z., Yan, J., Zhuang, D.: SimFusion: measuring similarity using unified relationship matrix. In: SIGIR 2005, pp. 130–137 (2005)
Yin, X., Han, J., Yu, P.S.: Linkclus: Efficient clustering via heterogeneous semantic links. In: VLDB 2006, pp. 427–438 (2006)
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, Chichester (1990)
Lovasz, L.: Random walks on graphs: a survey. In: Combinatorics, Paul Erdos is Eighty, vol. 2, pp. 1–46, Keszthely, Hungary (1993)
Kallenberg, O.: Foundations of Modern Probability. Springer, New York (1997)
Fogaras, D., Racz, B.: Scaling Link-Based Similarity Search. In: WWW 2005, pp. 641–650 (2005)
Getoor, L., Diehl, C.P.: Link mining: A survey. In: SIGKDD 2005 Explorations, vol. 7(2), pp. 3–12.
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing & Management 24(5), 513–523 (1988)
Hammouda, K.M., Kamel, M.S.: Phrase-based Document Similarity Based on an Index Graph Model. In: ICDM 2002, pp. 203–210 (2002)
Aslam, J.A., Frost, M.: An Information-theoretic Measure for Document Similarity. In: SIGIR 2003, pp. 449–450 (2003)
Calado, P., Cristo, M., Moura, E.S., Ziviani, N., Ribeiro-Neto, B.A., Goncalves, M.A.: Combining link-based and content-based methods for web document classification. In: CIKM 2003, pp. 394–401 (2003)
Jin, R., Dumais, S.: Probabilistic Combination of Content and Links. In: SIGIR 2001, pp. 402–403 (2001)
Zhu, S., Yu, K., Chi, Y., Gong, Y.: Combining content and link for classification using matrix factorization. In: SIGIR 2007, pp. 487–494 (2007)
Porter, M.: An algorithm for suffix stripping. Program, vol. 14(3), pp. 130–137 (1980), http://www.tartarus.org/~martin/PorterStemmer
Salton, G.: The SMART Retrieval System—Experiments in Automatic Document Processing. Prentice-Hall, Inc., Upper Saddle River (1971)
The stop-words list, http://members.unine.ch/jacques.savoy/clef/englishST.txt
ACM Computing Classification System, http://portal.acm.org/ccs.cfm
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Li, P., Li, Z., Liu, H., He, J., Du, X. (2009). Using Link-Based Content Analysis to Measure Document Similarity Effectively. In: Li, Q., Feng, L., Pei, J., Wang, S.X., Zhou, X., Zhu, QM. (eds) Advances in Data and Web Management. APWeb WAIM 2009 2009. Lecture Notes in Computer Science, vol 5446. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00672-2_40
Download citation
DOI: https://doi.org/10.1007/978-3-642-00672-2_40
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00671-5
Online ISBN: 978-3-642-00672-2
eBook Packages: Computer ScienceComputer Science (R0)