Abstract
We propose a new similarity measure between texts which, contrary to the current state-of-the-art approaches, takes a global view of the texts to be compared. We have implemented a tool to compute our textual distance and conducted experiments on several corpuses of texts. The experiments show that our methods can reliably identify different global types of texts.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Asarin, E., Degorre, A.: Volume and entropy of regular timed languages. hal (2009). http://hal.archives-ouvertes.fr/hal-00369812
Basset, N., Asarin, E.: Thin and thick timed regular languages. In: Fahrenberg and Tripakis [9], pp. 113–128
Cortelazzo, M.A., Nadalutti, P., Tuzzi, A.: Improving Labbé’s intertextual distance: testing a revised version on a large corpus of italian literature. J. Quant. Linguist. 20(2), 125–152 (2013)
Damerau, F.: A technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171–176 (1964)
Fahrenberg, U., Biondi, F., Corre, K., Jegourel, C., Kongshøj, S., Legay, A.: Measuring global similarity between texts. Technical report, arxiv (2014). http://arxiv.org/abs/1403.4024
Fahrenberg, U., Legay, A.: Generalized quantitative analysis of metric transition systems. In: Shan, C. (ed.) APLAS 2013. LNCS, vol. 8301, pp. 192–208. Springer, Heidelberg (2013)
Fahrenberg, U., Legay, A.: The quantitative linear-time-branching-time spectrum. Theor. Comput. Sci. (2013). http://dx.doi.org/10.1016/j.tcs.2013.07.030
Fahrenberg, U., Legay, A., Thrane, C.R.: The quantitative linear-time-branching-time spectrum. In: Chakraborty, S., Kumar, A. (eds.) FSTTCS. vol. 13 of LIPIcs, pp. 103–114 (2011)
Fahrenberg, U., Tripakis, S. (eds.): FORMATS 2011. LNCS, vol. 6919. Springer, Heidelberg (2011)
Haverkort, B.R.: Formal modeling and analysis of timed systems: Technology push or market pull? In: Fahrenberg and Tripakis [9], pp. 18–24
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (1990)
Kharmeh, S.A., Eder, K., May, D.: A design-for-verification framework for a configurable performance-critical communication interface. In: Fahrenberg and Tripakis [9], pp. 335–351
Kuhn, H.W.: The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 2(1–2), 83–97 (1955)
Labbé, C.: Ike Antkare, one of the great stars in the scientific firmament. ISSI Newsl. 6(2), 48–52 (2010). http://hal.archives-ouvertes.fr/hal-00713564
Labbé, C., Labbé, D.: Inter-textual distance and authorship attribution Corneille and Molière. J. Quant. Linguist. 8(3), 213–231 (2001)
Labbé, C., Labbé, D.: A tool for literary studies: intertextual distance and tree classification. Literary Linguist. Comp. 21(3), 311–326 (2006)
Labbé, C., Labbé, D.: Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science? Scientometrics 94(1), 379–396 (2013)
Labbé, D.: Experiments on authorship attribution by intertextual distance in English. J. Quant. Linguist. 14(1), 33–80 (2007)
Lin, C.Y., Hovy, E.H.: Automatic evaluation of summaries using n-gram co-occurrence statistics. In: HLT-NAACL (2003)
Lin, C.Y., Och, F.J.: Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: Scott, D., Daelemans, W., Walker, M.A. (eds.) ACL. pp. 605–612. ACL (2004)
Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)
Noorden, R.V.: Publishers withdraw more than 120 gibberish papers. Nature News & Comment, February 2014. http://dx.doi.org/10.1038/nature.2014.14763
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL. pp. 311–318. ACL (2002)
Sankaranarayanan, S., Homaei, H., Lewis, C.: Model-based dependability analysis of programmable drug infusion pumps. In: Fahrenberg and Tripakis [9], pp. 317–334
Savoy, J.: Authorship attribution: a comparative study of three text corpora and three languages. J. Quant. Linguist. 19(2), 132–161 (2012)
Savoy, J.: Authorship attribution based on specific vocabulary. ACM Trans. Inf. Syst. 30(2), 12 (2012)
Smith, S.T., Kao, E.K., Senne, K.D., Bernstein, G., Philips, S.: Bayesian discovery of threat networks. CoRR abs/1311.5552v1 (2013)
Smith, S.T., Senne, K.D., Philips, S., Kao, E.K., Bernstein, G.: Network detection theory and performance. CoRR abs/1303.5613v1 (2013)
Smith, T., Waterman, M.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981)
Springer second update on SCIgen-generated papers in conference proceedings. Springer Statement, April 2014. http://www.springer.com/about+springer/media/statements?SGWID=0-1760813-6-1460747-0
Tomasi, F., Bartolini, I., Condello, F., Degli Esposti, M., Garulli, V., Viale, M.: Towards a taxonomy of suspected forgery in authorship attribution field. A case: Montale’s Diario Postumo. In: DH-CASE. pp. 10:1–10:8. ACM (2013)
Ulusoy, A., Smith, S.L., Ding, X.C., Belta, C.: Robust multi-robot optimal path planning with temporal logic constraints. CoRR abs/1202.1307v2 (2012)
Ulusoy, A., Smith, S.L., Ding, X.C., Belta, C., Rus, D.: Optimal multi-robot path planning with temporal logic constraints. CoRR abs/1107.0062v1 (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Fahrenberg, U., Biondi, F., Corre, K., Jegourel, C., Kongshøj, S., Legay, A. (2014). Measuring Global Similarity Between Texts. In: Besacier, L., Dediu, AH., MartÃn-Vide, C. (eds) Statistical Language and Speech Processing. SLSP 2014. Lecture Notes in Computer Science(), vol 8791. Springer, Cham. https://doi.org/10.1007/978-3-319-11397-5_17
Download citation
DOI: https://doi.org/10.1007/978-3-319-11397-5_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11396-8
Online ISBN: 978-3-319-11397-5
eBook Packages: Computer ScienceComputer Science (R0)