A New Retrieval Model Based on TextTiling for Document Similarity Search

Wan, Xiao-Jun; Peng, Yu-Xin

doi:10.1007/s11390-005-0552-9

A New Retrieval Model Based on TextTiling for Document Similarity Search

Regular Paper
Published: July 2005

Volume 20, pages 552–558, (2005)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Xiao-Jun Wan¹ &
Yu-Xin Peng¹

96 Accesses
12 Citations
3 Altmetric
Explore all metrics

Abstract

Document similarity search is to find documents similar to a given query document and return a ranked list of similar documents to users, which is widely used in many text and web systems, such as digital library, search engine, etc. Traditional retrieval models, including the Okapi's BM25 model and the Smart's vector space model with length normalization, could handle this problem to some extent by taking the query document as a long query. In practice, the Cosine measure is considered as the best model for document similarity search because of its good ability to measure similarity between two documents. In this paper, the quantitative performances of the above models are compared using experiments. Because the Cosine measure is not able to reflect the structural similarity between documents, a new retrieval model based on TextTiling is proposed in the paper. The proposed model takes into account the subtopic structures of documents. It first splits the documents into text segments with TextTiling and calculates the similarities for different pairs of text segments in the documents. Lastly the overall similarity between the documents is returned by combining the similarities of different pairs of text segments with optimal matching method. Experiments are performed and results show: 1) the popular retrieval models (the Okapi's BM25 model and the Smart's vector space model with length normalization) do not perform well for document similarity search; 2) the proposed model based on TextTiling is effective and outperforms other models, including the Cosine measure; 3) the methods for the three components in the proposed model are validated to be appropriately employed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improved sqrt-cosine similarity measurement

Article Open access 25 July 2017

Pairwise document similarity measure based on present term set

Article Open access 26 December 2018

k-Factor-Based Cosine Similarity Measurement

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Robertson S, Walker S. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proc. the 17th International ACM/SIGIR Conference on Research and Development in Information Retrieval (SIGIR'2003), Dublin, Ireland, 1994, pp.232–241.
Salton G. The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, 1971.
Baeza-Yates R, Ribeiro-Neto B. Modern Information Retrieval. ACM Press and Addison Wesley, New York, 1999.
Google Scholar
Jones W P, Furnas G W. Pictures of relevance: A geometric analysis of similarity measure. Journal of the American Society for Information Science, 1987, 38(6): 420–442.
Article Google Scholar
Zobel J, Moffat A. Exploring the similarity space. ACM SIGIR FORUM, 1998, 32(1): 18–34.
Google Scholar
Aslam J A, Frost M. An information-theoretic measure for document similarity. In Proc. the 26th International ACM/SIGIR Conference on Research and Development in Information Retrieval (SIGIR'2003), Toronto, Canada, 2003, pp.449–450.
Hammouda K M, Kamel M S. Document similarity using a phrase indexing graph model. Journal of Knowledge and Information Systems, 2004, 6(6): 710–717.
Google Scholar
Hearst M A. Multi-paragraph segmentation of expository text. In Proc. the 32nd Meeting of the Association for Computational Linguistics (ACL'1994), Los Cruces, NM, 1994, pp.9–16.
van Rijsbergen C J. Information Retrieval. Butterworths, London, 1979.
Google Scholar
Deerwester S C, Dumais S T, Landauer T K et al. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 1990, 41(6): 211–240.
Article Google Scholar
Buckley C, Sinhal A, Mitra M, Saltton G. New retrieval approaches using SMART: TREC 4. In Proc. the TREC 4 Conference, 1996, pp.25–48.
Singhal A, Buckley C, Mitra M. Pivoted document length normalization. In Proc. the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, 1996, pp.21–29.
Hearst M A, Plaunt C. Subtopic structuring for full-length document access. In Proc. the 16th Annual International ACM/SIGIR Conference, Pittsburgh, PA, 1993, pp.59–68.
Karlgren J. Stylistic variation in an information retrieval experiment. In Proc. the NeMLaP-2 Conference, Ankara, Turkey, 1996.
Mittal V, Kantrowitz M, Goldstein J, Carbonell J. Selecting text spans for document summaries: Heuristics and metrics. In Proc. the 16th Annual Conference on Artificial Intelligence (AAAI99), Orlando, FL, 1999, pp.467–473.
Lovasz L, and Plummer M D. Matching Theory, Amsterdam: North Holland, 1986.
Peng Y X, Ngo C W, Dong Q J et al. Video clip retrieval by maximal matching and optimal matching in graph theory. In Proc. 2003 IEEE Int. Conf. Multimedia & Expo (ICME'2003), Baltimore, Maryland, 2003, pp.317–320.
Choi F. JTextTile: A free platform independent text segmentation algorithm. http://www.cs.man.ac.uk/~choif
Allan J, Carbonell J, Doddington G, Yamron J P, Yang Y. Topic detection and tracking pilot study: Final report. In Proc. DARPA Broadcast News Transcription and Understanding Workshop, 1998, pp.194–218.
Porter M F. An algorithm for suffix stripping. Program, July 1980, 14(3): 130–137.
Google Scholar
Aggarwal C C, Yu P S. On effective conceptual indexing and similarity search in text data. In Proc. the 2001 IEEE Int. Conf. Data Mining (ICDM2001), San Jose, California, 2001, pp.3–10.
Schrijver A. Combinatorial Optimization: Polyhedra and Efficiency. Volume A, Berlin: Springer, 2003.
Google Scholar

Download references

Author information

Authors and Affiliations

National Key Laboratory of Text Processing Technology, Institute of Computer Science and Technology, Peking University, Beijing, 100871, P.R. China
Xiao-Jun Wan & Yu-Xin Peng

Authors

Xiao-Jun Wan
View author publications
You can also search for this author in PubMed Google Scholar
Yu-Xin Peng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yu-Xin Peng.

Additional information

In this paper, we use the abbreviation “Okapi” and “Smart” to represent the Okapi's BM25 model and the Smart's vector space model with length normalization respectively.

http://trec.nist.gov

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wan, XJ., Peng, YX. A New Retrieval Model Based on TextTiling for Document Similarity Search. J Comput Sci Technol 20, 552–558 (2005). https://doi.org/10.1007/s11390-005-0552-9

Download citation

Received: 16 October 2004
Revised: 08 March 2005
Issue Date: July 2005
DOI: https://doi.org/10.1007/s11390-005-0552-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A New Retrieval Model Based on TextTiling for Document Similarity Search

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Improved sqrt-cosine similarity measurement

Pairwise document similarity measure based on present term set

k-Factor-Based Cosine Similarity Measurement

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A New Retrieval Model Based on TextTiling for Document Similarity Search

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Improved sqrt-cosine similarity measurement

Pairwise document similarity measure based on present term set

k-Factor-Based Cosine Similarity Measurement

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation