Analyzing Word Frequencies in Large Text Corpora Using Inter-arrival Times and Bootstrapping

Lijffijt, Jefrey; Papapetrou, Panagiotis; Puolamäki, Kai; Mannila, Heikki

doi:10.1007/978-3-642-23783-6_22

Jefrey Lijffijt^23,24,
Panagiotis Papapetrou^23,24,
Kai Puolamäki^23,24 &
…
Heikki Mannila^23,24

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6912))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

5703 Accesses
10 Citations

Abstract

Comparing frequency counts over texts or corpora is an important task in many applications and scientific disciplines. Given a text corpus, we want to test a hypothesis, such as “word X is frequent”, “word X has become more frequent over time”, or “word X is more frequent in male than in female speech”. For this purpose we need a null model of word frequencies. The commonly used bag-of-words model, which corresponds to a Bernoulli process with fixed parameter, does not account for any structure present in natural languages. Using this model for word frequencies results in large numbers of words being reported as unexpectedly frequent. We address how to take into account the inherent occurrence patterns of words in significance testing of word frequencies. Based on studies of words in two large corpora, we propose two methods for modeling word frequencies that both take into account the occurrence patterns of words and go beyond the bag-of-words assumption. The first method models word frequencies based on the spatial distribution of individual words in the language. The second method is based on bootstrapping and takes into account only word frequency at the text level. The proposed methods are compared to the current gold standard in a series of experiments on both corpora. We find that words obey different spatial patterns in the language, ranging from bursty to non-bursty/uniform, independent of their frequency, showing that the traditional approach leads to many false positives.

Download to read the full chapter text

Chapter PDF

Frequency domain bootstrap for ratio statistics under long-range dependence

Article 04 April 2019

On robust estimation of negative binomial INARCH models

Article Open access 24 April 2021

Modelling and diagnostic tests for Poisson and negative-binomial count time series

Article Open access 13 December 2023

Keywords

References

Altmann, E.G., Pierrehumbert, J.B., Motter, A.E.: Beyond word frequency: Bursts, lulls, and scaling in the temporal distributions of words. PLoS ONE 4(11), e7678 (2009)
Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman, Amsterdam (1999)
Google Scholar
Barabási, A.-L.: The origin of bursts and heavy tails in human dynamics. Nature 435, 207–211 (2005)
Article Google Scholar
Biber, D.: Dimensions of register variation: A cross-linguistic comparison. Cambridge University Press, Cambridge (1995)
Book Google Scholar
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19, 61–74 (1993)
Google Scholar
Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. Chapman and Hall/CRC (1994)
Google Scholar
Faloutsos, M., Faloutsos, P., Faloutsos, C.: On power-law relationships of the internet topology. In: ACM SIGCOMM, pp. 251–262 (1999)
Google Scholar
Fung, G.P.C., Pui, G., Fung, C., Yu, J.X., Yu, P.S., Yu, S., Lu, H.: Parameter free bursty events detection in text streams. In: VLDB, pp. 181–192 (2005)
Google Scholar
Gries, S.T.: Null-hypothesis significance testing of word frequencies: a follow-up on Kilgarriff. Corpus Linguistics and Linguistic Theory 12, 277–294 (2005)
Google Scholar
Gries, S.T.: Syntactic priming: A corpus-based approach. Journal of Psycholinguistic Research 34(4), 365–399 (2005)
Article Google Scholar
He, Q., Chang, K., Lim, E.-P.: Analyzing feature trajectories for event detection. In: ACM SIGIR, pp. 207–214 (2007)
Google Scholar
He, Q., Chang, K., Lim, E.-P.: Using burstiness to improve clustering of topics in news streams. In: IEEE ICDM, pp. 493–498 (2007)
Google Scholar
He, Q., Chang, K., Lim, E.-P., Zhang, J.: Bursty Feature Representation for Clustering Text Streams. In: SIAM SDM, pp. 491–496 (2007)
Google Scholar
Kilgarriff, A.: Language is never, ever, ever, random. Corpus Linguistics and Linguistic Theory 1-2, 263–275 (2005)
Google Scholar
Kleinberg, J.: Bursty and hierarchical structure in streams. DMKD 7, 373–397 (2003)
MathSciNet Google Scholar
Kumar, R., Novak, J., Raghavan, P., Tomkins, A.: On the bursty evolution of blogspace. World Wide Web 8(2), 159–178 (2005)
Article Google Scholar
Lappas, T., Arai, B., Platakis, M., Kotsakos, D., Gunopulos, D.: On burstiness-aware search for document sequences. In: ACM SIGKDD, pp. 477–486 (2009)
Google Scholar
Leskovec, J., Kleinberg, J., Faloutsos, C.: Graph evolution: Densification and shrinking diameters. ACM TKDD 1(1) (2007)
Google Scholar
North, B.V., Curtis, D., Sham, P.C.: A note on the calculation of empirical p-values from Monte Carlo procedures. The American Journal of Human Genetics 71(2), 439–441 (2002)
Article Google Scholar
Rayson, P., Garside, R.: Comparing corpora using frequency profiling. In: 38th ACL Workshop on Comparing Corpora, pp. 1–6 (2000)
Google Scholar
Rayson, P., Leech, G., Hodges, M.: Social differentiation in the use of English vocabulary: some analyses of the conversational component of the British National Corpus. International Journal of Corpus Linguistics 2(1), 133–152 (1997)
Article Google Scholar
Robertson, S.E., Walker, S.: Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In: ACM SIGIR, pp. 232–241 (1994)
Google Scholar
Szmrecsanyi, B.: Language users as creatures of habit: A corpus-based analysis of persistence in spoken English. Corpus Linguistics and Linguistic Theory 1(1), 113–149 (2005)
Article Google Scholar
The British National Corpus, version 3, BNC XML edn. (2007)
Google Scholar
Vlachos, M.: Identifying similarities, periodicities and bursts for online search queries. In: ACM SIGMOD, pp. 131–142 (2004)
Google Scholar
Zipf, G.K.: Human behavior and the principle of least effort. Addison-Wesley, Reading (1949)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information and Computer Science, Aalto University, Finland
Jefrey Lijffijt, Panagiotis Papapetrou, Kai Puolamäki & Heikki Mannila
Helsinki Institute for Information Technology (HIIT), Finland
Jefrey Lijffijt, Panagiotis Papapetrou, Kai Puolamäki & Heikki Mannila

Authors

Jefrey Lijffijt
View author publications
You can also search for this author in PubMed Google Scholar
Panagiotis Papapetrou
View author publications
You can also search for this author in PubMed Google Scholar
Kai Puolamäki
View author publications
You can also search for this author in PubMed Google Scholar
Heikki Mannila
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Informatics and Telecommunications, University of Athens, Panepistimioupolis, Ilisia, 15784, Athens, Greece
Dimitrios Gunopulos
Google Switzerland GmbH, Brandschenkestrasse 110, 8002, Zurich, Switzerland
Thomas Hofmann
Department of Computer Science, University of Bari “Aldo Moro”, via Orabona 4, 70125, Bari, Italy
Donato Malerba
Deptartment of Informatics, Athens University of Economics and Business, Patision 76, 10434, Athens, Greece
Michalis Vazirgiannis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lijffijt, J., Papapetrou, P., Puolamäki, K., Mannila, H. (2011). Analyzing Word Frequencies in Large Text Corpora Using Inter-arrival Times and Bootstrapping. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2011. Lecture Notes in Computer Science(), vol 6912. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23783-6_22

Download citation

DOI: https://doi.org/10.1007/978-3-642-23783-6_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23782-9
Online ISBN: 978-3-642-23783-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Analyzing Word Frequencies in Large Text Corpora Using Inter-arrival Times and Bootstrapping

Abstract

Chapter PDF

Similar content being viewed by others

Frequency domain bootstrap for ratio statistics under long-range dependence

On robust estimation of negative binomial INARCH models

Modelling and diagnostic tests for Poisson and negative-binomial count time series

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Analyzing Word Frequencies in Large Text Corpora Using Inter-arrival Times and Bootstrapping

Abstract

Chapter PDF

Similar content being viewed by others

Frequency domain bootstrap for ratio statistics under long-range dependence

On robust estimation of negative binomial INARCH models

Modelling and diagnostic tests for Poisson and negative-binomial count time series

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation