Abstract
Many information systems keep multiple versions of documents. Examples include content management systems, version control systems (e.g. ClearCase, CVS), Wikis, and backup and archiving solutions. Often, it is desired to enable free-text search over such repositories, i.e. to enable submitting queries that may match any version of any document. We propose an indexing method that takes advantage of the inherent redundancy present in versioned documents by solving a variant of the multiple sequence alignment problem. The scheme produces an index that is much more compact than a standard index that treats each version independently. In experiments over publicly available versioned data, our method achieved compaction ratios of 81% as compared with standard indexing, while supporting the same retrieval capabilities.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Arasu, A., et al.: Searching the web. ACM Transactions on Internet Technology 1(1), 2–43 (2001)
Witten, I., Moffat, A., Bell, T.: Managing Gigabytes, 2nd edn. Morgan Kaufmann Publishers, Inc., San Francisco (1999)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press, New York (1999)
Heinz, S., Zobel, J.: Efficient single-pass index construction for text databases. JASIST 54(8), 713–729 (2003)
Melnik, S., et al.: Building a distributed full-text index for the web. In: Proc. 10th International World Wide Web Conference (WWW 2001), pp. 396–406. ACM Press, New York (2001)
Anick, P.G., Flynn, R.A.: Versioning a full-text information retrieval system. In: Proc. 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 98–111. ACM Press, New York (1992)
Broder, A.Z., et al.: Indexing of shared content in information retrieval systems. In: Proc. 10th International EDBT Conference, pp. 313–330 (2006)
Broder, A.Z., Glassman, S.C., Manasse, M.S.: Syntactic clustering of the web. In: Proc. 6th International WWW Conference (1997)
Ferragina, P., et al.: Compressing and searching xml data via two zips. In: Proc. 15th International World Wide Web Conference (WWW’2006), pp. 751–760 (2006)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)
Apostolico, A.: String editing and longest common subsequences. In: Rozenberg, G., Salomaa, A. (eds.) Handbook of Formal Languages, vol. 2 Linear Modeling: Background and Application, pp. 361–398. Springer, Heidelberg (1997)
Myers, E.W.: An o(ND) difference algorithm and its variations. Algorithmica 1(2), 251–266 (1986)
Miller, W., Myers, E.W.: A file comparison program. Software – Practice and Experience 15(11), 1025–1040 (1985)
Garey, M.R., Johnson, D.S.: Computers and Intractability, A Guide to the Theory of NP-Completeness. W.H. Freeman and Company, New York (1979)
Turtle, H., Flood, J.: Query evaluation: strategies and optimizations. Inf. Process. Manage. 31(6) (1995)
Garcia-Molina, H., Ullman, J., Widom, J.: Database System Implementation. Prentice-Hall, Englewood Cliffs (2000)
Gathman, S.D.: Diff java class (2003), http://www.bmsi.com/java/Diff.java
Needleman, S., Wunsch, C.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Molecular Biology 48(3), 443–453 (1970)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Herscovici, M., Lempel, R., Yogev, S. (2007). Efficient Indexing of Versioned Document Sequences. In: Amati, G., Carpineto, C., Romano, G. (eds) Advances in Information Retrieval. ECIR 2007. Lecture Notes in Computer Science, vol 4425. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71496-5_10
Download citation
DOI: https://doi.org/10.1007/978-3-540-71496-5_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71494-1
Online ISBN: 978-3-540-71496-5
eBook Packages: Computer ScienceComputer Science (R0)